hw/arm/virt: Add multiple nested SMMUs

[PATCH RFCv1 00/10] hw/arm/virt: Add multiple nested SMMUs
Posted by Nicolin Chen 1 year ago
Hi all,

This is a draft solution adding multiple nested SMMU instances to VM.
The main goal of the series is to collect opinions, to figure out a
reasonable solution that would fit our needs.

I understood that there are concerns regarding this support, from our
previous discussion:
https://lore.kernel.org/all/ZEcT%2F7erkhHDaNvD@Asurada-Nvidia/

Yet, some followup discussion in the kernel maillist has shifted the
direction of regular SMMU nesting to potentially having multiple vSMMU
instances as well:
https://lore.kernel.org/all/20240611121756.GT19897@nvidia.com/

I will summerize all the dots in the following paragraphs:

[ Why do we need multiple nested SMMUs? ]
1, This is a must feature for NVIDIA's Grace SoC to support its CMDQV
   (an extension HW for SMMU). It allows to assgin a Command Queue HW
   dedicatedly to a VM. Then VM controls it via an mmap'd MMIO page:
   https://lore.kernel.org/all/f00da8df12a154204e53b343b2439bf31517241f.1712978213.git.nicolinc@nvidia.com/
   Each Grace SoC has 5 SMMUs (i.e. 5 CMDQVs), meaning there can be 5
   MMIO pages. If QEMU only supports one vSMMU and all passthrough
   devices attach to one shared vSMMU, it technically cannot mmap 5
   MMIO pages, nor assign devices to use corresponding pages.
2, This is optional for nested SMMU, and essentially a design choice
   between a single vSMMU design and a multiple-vSMMU design. Here're
   the pros and cons:
   + Pros for single vSMMU design
     a) It is easy and clean, by all means.
   - Cons for single vSMMU design
     b) It can have complications if underlying pSMMUs are different.
     c) Emulated devices might have to be added to the nested SMMU,
        since "iommu=nested-smmuv3" enables for the whole VM. This
        means the vSMMU instance has to act at the same time as both
        a nested SMMU and a para-virt SMMU.
     d) IOTLB inefficiency. Since devices behind different pSMMUs are
        attached to a single vSMMU, the vSMMU code traps invalidation
        commands in a shared guest CMDQ, meaning it needs to dispatch
        those commands correctly to pSMMUs, by either a broadcast or
        walking through a lookup table. Note that if a command is not
        tied to any hwpt or device, it still has to be broadcasted.
   + Pros for multiple vSMMU design
     e) Emulated devices can be isolated from any nested SMMU.
     f) Cache invalidation commands will always be forwarded to the
        corresponding pSMMU, reducing the overhead from vSMMU walking
        through a lookup table or broadcasting.
     g) It will adapt CMDQV very easily.
   - Cons for multiple vSMMU diesgn
     h) Complications in VIRT and IORT design
     i) Difficulty to support device hotplugs
     j) Potential of running out of PCI bus number, as QEMU doesn't
        support multiple PCI domains.

[ How is it implemented with this series? ]
 * As an experimental series, this is all done in VIRT and ACPI code.
 * Scan iommu sysfs nodes and build an SMMU node list (PATCH-03).
 * Create one PCIe Expander Bridge (+ one vSMMU) from the top of bus
   number (0xFF) with intervals for root-ports (PATCH-05). E.g. host
   system with three pSMMUs:
   [ pcie.0 bus ]
   -----------------------------------------------------------------------------
           |                  |                   |                  |
   -----------------  ------------------  ------------------  ------------------
   | emulated devs |  | smmu_bridge.e5 |  | smmu_bridge.ee |  | smmu_bridge.f7 |
   -----------------  ------------------  ------------------  ------------------
 * Loop vfio-pci devices against the SMMU node list and assign them
   automatically in the VIRT code to the corresponding smmu_bridges,
   and then attach them by creating a root port (PATCH-06):
   [ pcie.0 bus ]
   -----------------------------------------------------------------------------
           |                  |                   |                  |
   -----------------  ------------------  ------------------  ------------------
   | emulated devs |  | smmu_bridge.e5 |  | smmu_bridge.ee |  | smmu_bridge.f7 |
   -----------------  ------------------  ------------------  ------------------
                                                  |
                                          ----------------   -----------
                                          | root_port.ef |---| PCI dev |
                                          ----------------   -----------
 * Set the "pcie.0" root bus to iommu bypass, so its entire ID space
   will be directed to ITS in IORT. If a vfio-pci device chooses to
   bypass 2-stage translation, it can be added to "pcie.0" (PATCH-07):
     --------------build_iort: its_idmaps
     build_iort_id_mapping: input_base=0x0, id_count=0xe4ff, out_ref=0x30
 * Map IDs of smmu_bridges to corresponding vSMMUs (PATCH-09):
     --------------build_iort: smmu_idmaps
     build_iort_id_mapping: input_base=0xe500, id_count=0x8ff, out_ref=0x48
     build_iort_id_mapping: input_base=0xee00, id_count=0x8ff, out_ref=0xa0
     build_iort_id_mapping: input_base=0xf700, id_count=0x8ff, out_ref=0xf8
 * Finally, "lspci -tv" in the guest looks like this:
     -+-[0000:ee]---00.0-[ef]----00.0  [vfio-pci passthrough]
      \-[0000:00]-+-00.0  Red Hat, Inc. QEMU PCIe Host bridge
                  +-01.0  Red Hat, Inc. QEMU PCIe Expander bridge
                  +-02.0  Red Hat, Inc. QEMU PCIe Expander bridge
                  +-03.0  Red Hat, Inc. QEMU PCIe Expander bridge
                  +-04.0  Red Hat, Inc. QEMU NVM Express Controller [emulated]
                  \-05.0  Intel Corporation 82540EM Gigabit Ethernet [emulated]

[ Topics for discussion ]
 * Some of the bits can be moved to backends/iommufd.c, e.g.
     -object iommufd,id=iommufd0,[nesting=smmu3,[max-hotplugs=1]]
   And I was hoping that the vfio-pci device could take the iommufd
   BE pointer so it can redirect the PCI bus. Yet, seems to be more
   complicated than I thought...
 * Possiblity of adding nesting support for vfio-pci-nohotplug only?
   The kernel uAPI (even for nesting cap detection) requires a dev
   handler. If a VM boots without a vfio-pci and then gets a hotplug
   after boot-to-console, a vSMMU that has already finished a reset
   cycle will need to sync the idr/idrr bits and will have to reset
   again?

This series is on Github:
https://github.com/nicolinc/qemu/commits/iommufd_multi_vsmmu-rfcv1

Thanks!
Nicolin

Eric Auger (1):
  hw/arm/virt-acpi-build: Add IORT RMR regions to handle MSI nested
    binding

Nicolin Chen (9):
  hw/arm/virt: Add iommufd link to virt-machine
  hw/arm/virt: Get the number of host-level SMMUv3 instances
  hw/arm/virt: Add an SMMU_IO_LEN macro
  hw/arm/virt: Add VIRT_NESTED_SMMU
  hw/arm/virt: Assign vfio-pci devices to nested SMMUs
  hw/arm/virt: Bypass iommu for default PCI bus
  hw/arm/virt-acpi-build: Handle reserved bus number of pxb buses
  hw/arm/virt-acpi-build: Build IORT with multiple SMMU nodes
  hw/arm/virt-acpi-build: Enable ATS for nested SMMUv3

 hw/arm/virt-acpi-build.c | 144 ++++++++++++++++----
 hw/arm/virt.c            | 277 +++++++++++++++++++++++++++++++++++++--
 include/hw/arm/virt.h    |  63 +++++++++
 3 files changed, 449 insertions(+), 35 deletions(-)

-- 
2.43.0