[v5] i386/pc: Fix creation of >= 1010G guests on AMD systems with IOMMU

[PATCH v5 0/5] i386/pc: Fix creation of >= 1010G guests on AMD systems with IOMMU

Posted by Joao Martins 3 years, 8 months ago

v4[5] -> v5:
* Fixed the 32-bit build(s) (patch 1, Michael Tsirkin)
* Fix wrong reference (patch 4) to TCG_PHYS_BITS in code comment and
commit message;

---

This series lets Qemu spawn i386 guests with >= 1010G with VFIO,
particularly when running on AMD systems with an IOMMU.

Since Linux v5.4, VFIO validates whether the IOVA in DMA_MAP ioctl is valid and it
will return -EINVAL on those cases. On x86, Intel hosts aren't particularly
affected by this extra validation. But AMD systems with IOMMU have a hole in
the 1TB boundary which is *reserved* for HyperTransport I/O addresses located
here: FD_0000_0000h - FF_FFFF_FFFFh. See IOMMU manual [1], specifically
section '2.1.2 IOMMU Logical Topology', Table 3 on what those addresses mean.

VFIO DMA_MAP calls in this IOVA address range fall through this check and hence return
 -EINVAL, consequently failing the creation the guests bigger than 1010G. Example
of the failure:

qemu-system-x86_64: -device vfio-pci,host=0000:41:10.1,bootindex=-1: VFIO_MAP_DMA: -22
qemu-system-x86_64: -device vfio-pci,host=0000:41:10.1,bootindex=-1: vfio 0000:41:10.1: 
	failed to setup container for group 258: memory listener initialization failed:
		Region pc.ram: vfio_dma_map(0x55ba53e7a9d0, 0x100000000, 0xff30000000, 0x7ed243e00000) = -22 (Invalid argument)

Prior to v5.4, we could map to these IOVAs *but* that's still not the right thing
to do and could trigger certain IOMMU events (e.g. INVALID_DEVICE_REQUEST), or
spurious guest VF failures from the resultant IOMMU target abort (see Errata 1155[2])
as documented on the links down below.

This small series tries to address that by dealing with this AMD-specific 1Tb hole,
but rather than dealing like the 4G hole, it instead relocates RAM above 4G
to be above the 1T if the maximum RAM range crosses the HT reserved range.
It is organized as following:

patch 1: Introduce a @above_4g_mem_start which defaults to 4 GiB as starting
         address of the 4G boundary

patches 2-3: Move pci-host qdev creation to be before pc_memory_init(),
	     to get accessing to pci_hole64_size. The actual pci-host
	     initialization is kept as is, only the qdev_new.

patch 4: Change @above_4g_mem_start to 1TiB /if we are on AMD and the max
possible address acrosses the HT region. Errors out if the phys-bits is too
low, which is only the case for >=1010G configurations or something that
crosses the HT region.

patch 5: Ensure valid IOVAs only on new machine types, but not older
ones (<= v7.0.0)

The 'consequence' of this approach is that we may need more than the default
phys-bits e.g. a guest with >1010G, will have most of its RAM after the 1TB
address, consequently needing 41 phys-bits as opposed to the default of 40
(TCG_PHYS_ADDR_BITS). Today there's already a precedent to depend on the user to
pick the right value of phys-bits (regardless of this series), so we warn in
case phys-bits aren't enough. Finally, CMOS loosing its meaning of the above 4G
ram blocks, but it was mentioned over RFC that CMOS is only useful for very
old seabios. 

Additionally, the reserved region is added to E820 if the relocation is done.

Alternative options considered (in RFC[0]):

a) Dealing with the 1T hole like the 4G hole -- which also represents what
hardware closely does.

Thanks,
	Joao

Older Changelog,

v3[4] -> v4[5]:
(changes in patch 4 and 5 only)
* Rebased to 7.1.0, hence move compat machine attribute to <= 7.0.0 versions
* Check guest vCPU vendor rather than host CPU vendor (Michael Tsirkin)
* Squash previous patch 5 into patch 4 to tie in the phys-bits check
  into the relocate-4g-start logic: We now error out if the phys-bits
  aren't enough on configurations that require above-4g ram relocation. (Michael Tsirkin)
* Make the error message more explicit when phys-bits isn't enough to also
  mention: "cannot avoid AMD HT range"
* Add comments inside x86_update_above_4g_mem_start() explaining the
  logic behind it. (Michael Tsirkin)
* Tested on old guests old guests with Linux 2.6.32/3.10/4.14.35/4.1 based kernels
  alongside Win2008/2K12/2K16/2K19 on configs spanning 1T and 2T (Michael Tsirkin)
  Validated -numa topologies too as well as making sure qtests observe no regressions;

 Notes from v4:

* the machine attribute that enables this new logic (see last patch)
is called ::enforce_valid_iova since the RFC. Let me know if folks think it
is poorly named, and whether something a bit more obvious is preferred
(e.g. ::amd_relocate_1t).

* @mst one of the comments you said was to add "host checks" in vdpa/vfio devices.
In discussion with Alex and you over the last version of the patches it seems
that we weren't keen on making this device-specific or behind any machine
property flags (besides machine-compat). Just to reiterate there, making sure we do
the above-4g relocation requiring properly sized phys-bits and AMD as vCPU
vendor (as this series) already ensures thtat this is going to be right for
offending configuration with VDPA/VFIO device that might be
configured/hotplugged. Unless you were thinking that somehow vfio/vdpa devices
start poking into machine-specific details when we fail to relocate due to the
lack of phys-bits? Otherwise Qemu, just doesn't have enough information to tell
what's a valid IOVA or not, in which case kernel vhost-iotlb/vhost-vdpa is the one
that needs fixing (as VFIO did in v5.4).

RFCv2[3] -> v3[4]:

* Add missing brackets in single line statement, in patch 5 (David)
* Change ranges printf to use PRIx64, in patch 5 (David)
* Move the check to after changing above_4g_mem_start, in patch 5 (David)
* Make the check generic and move it to pc_memory_init rather being specific
to AMD, as the check is useful to capture invalid phys-bits
configs (patch 5, Igor).
* Fix comment as 'Start address of the initial RAM above 4G' in patch 1 (Igor)
* Consider pci_hole64_size in patch 4 (Igor)
* To consider pci_hole64_size in max used addr we need to get it from pci-host,
so introduce two new patches (2 and 3) which move only the qdev_new("i440fx") or
qdev_new("q35") to be before pc_memory_init().
* Consider sgx_epc.size in max used address, in patch 4 (Igor)
* Rename relocate_4g() to x86_update_above_4g_mem_start() (Igor)
* Keep warn_report() in patch 5, as erroring out will break a few x86_64 qtests
due to pci_hole64 accounting surprass phys-bits possible maxphysaddr.

RFC[0] -> RFCv2[3]:

* At Igor's suggestion in one of the patches I reworked the series enterily,
and more or less as he was thinking it is far simpler to relocate the
ram-above-4g to be at 1TiB where applicable. The changeset is 3x simpler,
and less intrusive. (patch 1 & 2)
* Check phys-bits is big enough prior to relocating (new patch 3)
* Remove the machine property, and it's only internal and set by new machine
version (Igor, patch 4).
* Clarify whether it's GPA or HPA as a more clear meaning (Igor, patch 2)
* Add IOMMU SDM in the commit message (Igor, patch 2)

[0] https://lore.kernel.org/qemu-devel/20210622154905.30858-1-joao.m.martins@oracle.com/
[1] https://www.amd.com/system/files/TechDocs/48882_IOMMU.pdf
[2] https://developer.amd.com/wp-content/resources/56323-PUB_0.78.pdf
[3] https://lore.kernel.org/qemu-devel/20220207202422.31582-1-joao.m.martins@oracle.com/T/#u
[4] https://lore.kernel.org/all/20220223184455.9057-1-joao.m.martins@oracle.com/
[5] https://lore.kernel.org/qemu-devel/20220420201138.23854-1-joao.m.martins@oracle.com/

Joao Martins (5):
  hw/i386: add 4g boundary start to X86MachineState
  i386/pc: create pci-host qdev prior to pc_memory_init()
  i386/pc: pass pci_hole64_size to pc_memory_init()
  i386/pc: relocate 4g start to 1T where applicable
  i386/pc: restrict AMD only enforcing of valid IOVAs to new machine
    type

 hw/i386/acpi-build.c         |   2 +-
 hw/i386/pc.c                 | 126 +++++++++++++++++++++++++++++++++--
 hw/i386/pc_piix.c            |  12 +++-
 hw/i386/pc_q35.c             |  14 +++-
 hw/i386/sgx.c                |   2 +-
 hw/i386/x86.c                |   1 +
 hw/pci-host/i440fx.c         |  10 ++-
 include/hw/i386/pc.h         |   4 +-
 include/hw/i386/x86.h        |   3 +
 include/hw/pci-host/i440fx.h |   3 +-
 10 files changed, 161 insertions(+), 16 deletions(-)

-- 
2.17.2

Re: [PATCH v5 0/5] i386/pc: Fix creation of >= 1010G guests on AMD systems with IOMMU

Posted by Alex Williamson 3 years, 7 months ago

On Fri, 20 May 2022 11:45:27 +0100
Joao Martins <joao.m.martins@oracle.com> wrote:

> v4[5] -> v5:
> * Fixed the 32-bit build(s) (patch 1, Michael Tsirkin)
> * Fix wrong reference (patch 4) to TCG_PHYS_BITS in code comment and
> commit message;
> 
> ---
> 
> This series lets Qemu spawn i386 guests with >= 1010G with VFIO,
> particularly when running on AMD systems with an IOMMU.
> 
> Since Linux v5.4, VFIO validates whether the IOVA in DMA_MAP ioctl is valid and it
> will return -EINVAL on those cases. On x86, Intel hosts aren't particularly
> affected by this extra validation. But AMD systems with IOMMU have a hole in
> the 1TB boundary which is *reserved* for HyperTransport I/O addresses located
> here: FD_0000_0000h - FF_FFFF_FFFFh. See IOMMU manual [1], specifically
> section '2.1.2 IOMMU Logical Topology', Table 3 on what those addresses mean.
> 
> VFIO DMA_MAP calls in this IOVA address range fall through this check and hence return
>  -EINVAL, consequently failing the creation the guests bigger than 1010G. Example
> of the failure:
> 
> qemu-system-x86_64: -device vfio-pci,host=0000:41:10.1,bootindex=-1: VFIO_MAP_DMA: -22
> qemu-system-x86_64: -device vfio-pci,host=0000:41:10.1,bootindex=-1: vfio 0000:41:10.1: 
> 	failed to setup container for group 258: memory listener initialization failed:
> 		Region pc.ram: vfio_dma_map(0x55ba53e7a9d0, 0x100000000, 0xff30000000, 0x7ed243e00000) = -22 (Invalid argument)
> 
> Prior to v5.4, we could map to these IOVAs *but* that's still not the right thing
> to do and could trigger certain IOMMU events (e.g. INVALID_DEVICE_REQUEST), or
> spurious guest VF failures from the resultant IOMMU target abort (see Errata 1155[2])
> as documented on the links down below.
> 
> This small series tries to address that by dealing with this AMD-specific 1Tb hole,
> but rather than dealing like the 4G hole, it instead relocates RAM above 4G
> to be above the 1T if the maximum RAM range crosses the HT reserved range.
> It is organized as following:
> 
> patch 1: Introduce a @above_4g_mem_start which defaults to 4 GiB as starting
>          address of the 4G boundary
> 
> patches 2-3: Move pci-host qdev creation to be before pc_memory_init(),
> 	     to get accessing to pci_hole64_size. The actual pci-host
> 	     initialization is kept as is, only the qdev_new.
> 
> patch 4: Change @above_4g_mem_start to 1TiB /if we are on AMD and the max
> possible address acrosses the HT region. Errors out if the phys-bits is too
> low, which is only the case for >=1010G configurations or something that
> crosses the HT region.
> 
> patch 5: Ensure valid IOVAs only on new machine types, but not older
> ones (<= v7.0.0)
> 
> The 'consequence' of this approach is that we may need more than the default
> phys-bits e.g. a guest with >1010G, will have most of its RAM after the 1TB
> address, consequently needing 41 phys-bits as opposed to the default of 40
> (TCG_PHYS_ADDR_BITS). Today there's already a precedent to depend on the user to
> pick the right value of phys-bits (regardless of this series), so we warn in
> case phys-bits aren't enough. Finally, CMOS loosing its meaning of the above 4G
> ram blocks, but it was mentioned over RFC that CMOS is only useful for very
> old seabios. 
> 
> Additionally, the reserved region is added to E820 if the relocation is done.

I was helping a user on irc yesterday who was assigning a bunch of GPUs
on an AMD system and was not specifying an increased PCI hole and
therefore was not triggering the relocation.  The result was that the
VM doesn't know about this special range and given their guest RAM
size, firmware was mapping GPU BARs overlapping this reserved range
anyway.  I didn't see any evidence that this user was doing anything
like booting with pci=nocrs to blatantly ignore the firmware provided
bus resources.

To avoid this sort of thing, shouldn't this hypertransport range always
be marked reserved regardless of whether the relocation is done?

vfio-pci won't generate a fatal error when MMIO mappings fail, so this
scenario can be rather subtle.  NB, it also did not resolve this user's
problem to specify the PCI hole size and activate the relocation, so
this was not necessarily the issue they were fighting, but I noted it
as an apparent gap in this series.  Thanks,

Alex

Re: [PATCH v5 0/5] i386/pc: Fix creation of >= 1010G guests on AMD systems with IOMMU

Posted by Joao Martins 3 years, 7 months ago

On 6/22/22 23:37, Alex Williamson wrote:
> On Fri, 20 May 2022 11:45:27 +0100
> Joao Martins <joao.m.martins@oracle.com> wrote:
>> v4[5] -> v5:
>> * Fixed the 32-bit build(s) (patch 1, Michael Tsirkin)
>> * Fix wrong reference (patch 4) to TCG_PHYS_BITS in code comment and
>> commit message;
>>
>> ---
>>
>> This series lets Qemu spawn i386 guests with >= 1010G with VFIO,
>> particularly when running on AMD systems with an IOMMU.
>>
>> Since Linux v5.4, VFIO validates whether the IOVA in DMA_MAP ioctl is valid and it
>> will return -EINVAL on those cases. On x86, Intel hosts aren't particularly
>> affected by this extra validation. But AMD systems with IOMMU have a hole in
>> the 1TB boundary which is *reserved* for HyperTransport I/O addresses located
>> here: FD_0000_0000h - FF_FFFF_FFFFh. See IOMMU manual [1], specifically
>> section '2.1.2 IOMMU Logical Topology', Table 3 on what those addresses mean.
>>
>> VFIO DMA_MAP calls in this IOVA address range fall through this check and hence return
>>  -EINVAL, consequently failing the creation the guests bigger than 1010G. Example
>> of the failure:
>>
>> qemu-system-x86_64: -device vfio-pci,host=0000:41:10.1,bootindex=-1: VFIO_MAP_DMA: -22
>> qemu-system-x86_64: -device vfio-pci,host=0000:41:10.1,bootindex=-1: vfio 0000:41:10.1: 
>> 	failed to setup container for group 258: memory listener initialization failed:
>> 		Region pc.ram: vfio_dma_map(0x55ba53e7a9d0, 0x100000000, 0xff30000000, 0x7ed243e00000) = -22 (Invalid argument)
>>
>> Prior to v5.4, we could map to these IOVAs *but* that's still not the right thing
>> to do and could trigger certain IOMMU events (e.g. INVALID_DEVICE_REQUEST), or
>> spurious guest VF failures from the resultant IOMMU target abort (see Errata 1155[2])
>> as documented on the links down below.
>>
>> This small series tries to address that by dealing with this AMD-specific 1Tb hole,
>> but rather than dealing like the 4G hole, it instead relocates RAM above 4G
>> to be above the 1T if the maximum RAM range crosses the HT reserved range.
>> It is organized as following:
>>
>> patch 1: Introduce a @above_4g_mem_start which defaults to 4 GiB as starting
>>          address of the 4G boundary
>>
>> patches 2-3: Move pci-host qdev creation to be before pc_memory_init(),
>> 	     to get accessing to pci_hole64_size. The actual pci-host
>> 	     initialization is kept as is, only the qdev_new.
>>
>> patch 4: Change @above_4g_mem_start to 1TiB /if we are on AMD and the max
>> possible address acrosses the HT region. Errors out if the phys-bits is too
>> low, which is only the case for >=1010G configurations or something that
>> crosses the HT region.
>>
>> patch 5: Ensure valid IOVAs only on new machine types, but not older
>> ones (<= v7.0.0)
>>
>> The 'consequence' of this approach is that we may need more than the default
>> phys-bits e.g. a guest with >1010G, will have most of its RAM after the 1TB
>> address, consequently needing 41 phys-bits as opposed to the default of 40
>> (TCG_PHYS_ADDR_BITS). Today there's already a precedent to depend on the user to
>> pick the right value of phys-bits (regardless of this series), so we warn in
>> case phys-bits aren't enough. Finally, CMOS loosing its meaning of the above 4G
>> ram blocks, but it was mentioned over RFC that CMOS is only useful for very
>> old seabios. 
>>
>> Additionally, the reserved region is added to E820 if the relocation is done.
> 
> I was helping a user on irc yesterday who was assigning a bunch of GPUs
> on an AMD system and was not specifying an increased PCI hole and
> therefore was not triggering the relocation.  The result was that the
> VM doesn't know about this special range and given their guest RAM
> size, firmware was mapping GPU BARs overlapping this reserved range
> anyway.  I didn't see any evidence that this user was doing anything
> like booting with pci=nocrs to blatantly ignore the firmware provided
> bus resources.
> 
> To avoid this sort of thing, shouldn't this hypertransport range always
> be marked reserved regardless of whether the relocation is done?
> 
Yeap, I think that's the right thing to do. We were alluding to that in patch 4.

I can switch said patch to IS_AMD() together with a phys-bits check to add the
range to e820.

But in practice, right now, this is going to be merely informative and doesn't
change the outcome, as OVMF ignores reserved ranges if I understood that code
correctly.

relocation is most effective at avoiding this reserved-range overlapping issue
on guests with less than a 1010GiB.

> vfio-pci won't generate a fatal error when MMIO mappings fail, so this
> scenario can be rather subtle.  NB, it also did not resolve this user's
> problem to specify the PCI hole size and activate the relocation, so
> this was not necessarily the issue they were fighting, but I noted it
> as an apparent gap in this series.  Thanks,

So I take it that even after the user expanded the PCI hole64 size and thus
the GPU BARS were placed in a non-reserved range... still saw the MMIO
mappings fail?

	Joao

Re: [PATCH v5 0/5] i386/pc: Fix creation of >= 1010G guests on AMD systems with IOMMU

Posted by Alex Williamson 3 years, 7 months ago

On Thu, 23 Jun 2022 00:18:06 +0100
Joao Martins <joao.m.martins@oracle.com> wrote:

> On 6/22/22 23:37, Alex Williamson wrote:
> > On Fri, 20 May 2022 11:45:27 +0100
> > Joao Martins <joao.m.martins@oracle.com> wrote:  
> >> v4[5] -> v5:
> >> * Fixed the 32-bit build(s) (patch 1, Michael Tsirkin)
> >> * Fix wrong reference (patch 4) to TCG_PHYS_BITS in code comment and
> >> commit message;
> >>
> >> ---
> >>
> >> This series lets Qemu spawn i386 guests with >= 1010G with VFIO,
> >> particularly when running on AMD systems with an IOMMU.
> >>
> >> Since Linux v5.4, VFIO validates whether the IOVA in DMA_MAP ioctl is valid and it
> >> will return -EINVAL on those cases. On x86, Intel hosts aren't particularly
> >> affected by this extra validation. But AMD systems with IOMMU have a hole in
> >> the 1TB boundary which is *reserved* for HyperTransport I/O addresses located
> >> here: FD_0000_0000h - FF_FFFF_FFFFh. See IOMMU manual [1], specifically
> >> section '2.1.2 IOMMU Logical Topology', Table 3 on what those addresses mean.
> >>
> >> VFIO DMA_MAP calls in this IOVA address range fall through this check and hence return
> >>  -EINVAL, consequently failing the creation the guests bigger than 1010G. Example
> >> of the failure:
> >>
> >> qemu-system-x86_64: -device vfio-pci,host=0000:41:10.1,bootindex=-1: VFIO_MAP_DMA: -22
> >> qemu-system-x86_64: -device vfio-pci,host=0000:41:10.1,bootindex=-1: vfio 0000:41:10.1: 
> >> 	failed to setup container for group 258: memory listener initialization failed:
> >> 		Region pc.ram: vfio_dma_map(0x55ba53e7a9d0, 0x100000000, 0xff30000000, 0x7ed243e00000) = -22 (Invalid argument)
> >>
> >> Prior to v5.4, we could map to these IOVAs *but* that's still not the right thing
> >> to do and could trigger certain IOMMU events (e.g. INVALID_DEVICE_REQUEST), or
> >> spurious guest VF failures from the resultant IOMMU target abort (see Errata 1155[2])
> >> as documented on the links down below.
> >>
> >> This small series tries to address that by dealing with this AMD-specific 1Tb hole,
> >> but rather than dealing like the 4G hole, it instead relocates RAM above 4G
> >> to be above the 1T if the maximum RAM range crosses the HT reserved range.
> >> It is organized as following:
> >>
> >> patch 1: Introduce a @above_4g_mem_start which defaults to 4 GiB as starting
> >>          address of the 4G boundary
> >>
> >> patches 2-3: Move pci-host qdev creation to be before pc_memory_init(),
> >> 	     to get accessing to pci_hole64_size. The actual pci-host
> >> 	     initialization is kept as is, only the qdev_new.
> >>
> >> patch 4: Change @above_4g_mem_start to 1TiB /if we are on AMD and the max
> >> possible address acrosses the HT region. Errors out if the phys-bits is too
> >> low, which is only the case for >=1010G configurations or something that
> >> crosses the HT region.
> >>
> >> patch 5: Ensure valid IOVAs only on new machine types, but not older
> >> ones (<= v7.0.0)
> >>
> >> The 'consequence' of this approach is that we may need more than the default
> >> phys-bits e.g. a guest with >1010G, will have most of its RAM after the 1TB
> >> address, consequently needing 41 phys-bits as opposed to the default of 40
> >> (TCG_PHYS_ADDR_BITS). Today there's already a precedent to depend on the user to
> >> pick the right value of phys-bits (regardless of this series), so we warn in
> >> case phys-bits aren't enough. Finally, CMOS loosing its meaning of the above 4G
> >> ram blocks, but it was mentioned over RFC that CMOS is only useful for very
> >> old seabios. 
> >>
> >> Additionally, the reserved region is added to E820 if the relocation is done.  
> > 
> > I was helping a user on irc yesterday who was assigning a bunch of GPUs
> > on an AMD system and was not specifying an increased PCI hole and
> > therefore was not triggering the relocation.  The result was that the
> > VM doesn't know about this special range and given their guest RAM
> > size, firmware was mapping GPU BARs overlapping this reserved range
> > anyway.  I didn't see any evidence that this user was doing anything
> > like booting with pci=nocrs to blatantly ignore the firmware provided
> > bus resources.
> > 
> > To avoid this sort of thing, shouldn't this hypertransport range always
> > be marked reserved regardless of whether the relocation is done?
> >   
> Yeap, I think that's the right thing to do. We were alluding to that in patch 4.
> 
> I can switch said patch to IS_AMD() together with a phys-bits check to add the
> range to e820.
> 
> But in practice, right now, this is going to be merely informative and doesn't
> change the outcome, as OVMF ignores reserved ranges if I understood that code
> correctly.

:-\

> relocation is most effective at avoiding this reserved-range overlapping issue
> on guests with less than a 1010GiB.

Do we need to do the relocation by default?

> > vfio-pci won't generate a fatal error when MMIO mappings fail, so this
> > scenario can be rather subtle.  NB, it also did not resolve this user's
> > problem to specify the PCI hole size and activate the relocation, so
> > this was not necessarily the issue they were fighting, but I noted it
> > as an apparent gap in this series.  Thanks,  
> 
> So I take it that even after the user expanded the PCI hole64 size and thus
> the GPU BARS were placed in a non-reserved range... still saw the MMIO
> mappings fail?

No, the mapping failures are resolved if the hole64 size is set, it's
just that there seem to be remaining issues that a device occasionally
gets into a bad state that isn't resolved by restarting the VM.
AFAICT, p2p mappings are not being used, so the faults were more of a
nuisance than actually contributing to the issues this user is working
through.  Thanks,

Alex

Re: [PATCH v5 0/5] i386/pc: Fix creation of >= 1010G guests on AMD systems with IOMMU

Posted by Joao Martins 3 years, 7 months ago

On 6/23/22 17:03, Alex Williamson wrote:
> On Thu, 23 Jun 2022 00:18:06 +0100
> Joao Martins <joao.m.martins@oracle.com> wrote:
>> On 6/22/22 23:37, Alex Williamson wrote:
>>> On Fri, 20 May 2022 11:45:27 +0100
>>> Joao Martins <joao.m.martins@oracle.com> wrote:  
>>>> v4[5] -> v5:
>>>> * Fixed the 32-bit build(s) (patch 1, Michael Tsirkin)
>>>> * Fix wrong reference (patch 4) to TCG_PHYS_BITS in code comment and
>>>> commit message;
>>>>
>>>> ---
>>>>
>>>> This series lets Qemu spawn i386 guests with >= 1010G with VFIO,
>>>> particularly when running on AMD systems with an IOMMU.
>>>>
>>>> Since Linux v5.4, VFIO validates whether the IOVA in DMA_MAP ioctl is valid and it
>>>> will return -EINVAL on those cases. On x86, Intel hosts aren't particularly
>>>> affected by this extra validation. But AMD systems with IOMMU have a hole in
>>>> the 1TB boundary which is *reserved* for HyperTransport I/O addresses located
>>>> here: FD_0000_0000h - FF_FFFF_FFFFh. See IOMMU manual [1], specifically
>>>> section '2.1.2 IOMMU Logical Topology', Table 3 on what those addresses mean.
>>>>
>>>> VFIO DMA_MAP calls in this IOVA address range fall through this check and hence return
>>>>  -EINVAL, consequently failing the creation the guests bigger than 1010G. Example
>>>> of the failure:
>>>>
>>>> qemu-system-x86_64: -device vfio-pci,host=0000:41:10.1,bootindex=-1: VFIO_MAP_DMA: -22
>>>> qemu-system-x86_64: -device vfio-pci,host=0000:41:10.1,bootindex=-1: vfio 0000:41:10.1: 
>>>> 	failed to setup container for group 258: memory listener initialization failed:
>>>> 		Region pc.ram: vfio_dma_map(0x55ba53e7a9d0, 0x100000000, 0xff30000000, 0x7ed243e00000) = -22 (Invalid argument)
>>>>
>>>> Prior to v5.4, we could map to these IOVAs *but* that's still not the right thing
>>>> to do and could trigger certain IOMMU events (e.g. INVALID_DEVICE_REQUEST), or
>>>> spurious guest VF failures from the resultant IOMMU target abort (see Errata 1155[2])
>>>> as documented on the links down below.
>>>>
>>>> This small series tries to address that by dealing with this AMD-specific 1Tb hole,
>>>> but rather than dealing like the 4G hole, it instead relocates RAM above 4G
>>>> to be above the 1T if the maximum RAM range crosses the HT reserved range.
>>>> It is organized as following:
>>>>
>>>> patch 1: Introduce a @above_4g_mem_start which defaults to 4 GiB as starting
>>>>          address of the 4G boundary
>>>>
>>>> patches 2-3: Move pci-host qdev creation to be before pc_memory_init(),
>>>> 	     to get accessing to pci_hole64_size. The actual pci-host
>>>> 	     initialization is kept as is, only the qdev_new.
>>>>
>>>> patch 4: Change @above_4g_mem_start to 1TiB /if we are on AMD and the max
>>>> possible address acrosses the HT region. Errors out if the phys-bits is too
>>>> low, which is only the case for >=1010G configurations or something that
>>>> crosses the HT region.
>>>>
>>>> patch 5: Ensure valid IOVAs only on new machine types, but not older
>>>> ones (<= v7.0.0)
>>>>
>>>> The 'consequence' of this approach is that we may need more than the default
>>>> phys-bits e.g. a guest with >1010G, will have most of its RAM after the 1TB
>>>> address, consequently needing 41 phys-bits as opposed to the default of 40
>>>> (TCG_PHYS_ADDR_BITS). Today there's already a precedent to depend on the user to
>>>> pick the right value of phys-bits (regardless of this series), so we warn in
>>>> case phys-bits aren't enough. Finally, CMOS loosing its meaning of the above 4G
>>>> ram blocks, but it was mentioned over RFC that CMOS is only useful for very
>>>> old seabios. 
>>>>
>>>> Additionally, the reserved region is added to E820 if the relocation is done.  
>>>
>>> I was helping a user on irc yesterday who was assigning a bunch of GPUs
>>> on an AMD system and was not specifying an increased PCI hole and
>>> therefore was not triggering the relocation.  The result was that the
>>> VM doesn't know about this special range and given their guest RAM
>>> size, firmware was mapping GPU BARs overlapping this reserved range
>>> anyway.  I didn't see any evidence that this user was doing anything
>>> like booting with pci=nocrs to blatantly ignore the firmware provided
>>> bus resources.
>>>
>>> To avoid this sort of thing, shouldn't this hypertransport range always
>>> be marked reserved regardless of whether the relocation is done?
>>>   
>> Yeap, I think that's the right thing to do. We were alluding to that in patch 4.
>>
>> I can switch said patch to IS_AMD() together with a phys-bits check to add the
>> range to e820.
>>
>> But in practice, right now, this is going to be merely informative and doesn't
>> change the outcome, as OVMF ignores reserved ranges if I understood that code
>> correctly.
> 
> :-\
> 
>> relocation is most effective at avoiding this reserved-range overlapping issue
>> on guests with less than a 1010GiB.
> 
> Do we need to do the relocation by default?
> 
Given the dependency on phys-bits > 40 (TCG_TARGET_PHYS_BITS), maybe not.

Plus this might not be the common case, considering that it is restricted to VMs that have
something closer to 1Tb of memory, (say +768GB) *and* have VFs attached that have very big
BARs enough that cross the 1010G..1T reserved region.

... Unless we could get an idea on how much the PCI hole64 size will be at surplus (based
on device BAR sizes) to understand if it's enough, and relocate based on that. Albeit in
qemu vfio is at a late stage versus the memmap construction.

>>> vfio-pci won't generate a fatal error when MMIO mappings fail, so this
>>> scenario can be rather subtle.  NB, it also did not resolve this user's
>>> problem to specify the PCI hole size and activate the relocation, so
>>> this was not necessarily the issue they were fighting, but I noted it
>>> as an apparent gap in this series.  Thanks,  
>>
>> So I take it that even after the user expanded the PCI hole64 size and thus
>> the GPU BARS were placed in a non-reserved range... still saw the MMIO
>> mappings fail?
> 
> No, the mapping failures are resolved if the hole64 size is set, it's
> just that there seem to be remaining issues that a device occasionally
> gets into a bad state that isn't resolved by restarting the VM.

/me nods

> AFAICT, p2p mappings are not being used, so the faults were more of a
> nuisance than actually contributing to the issues this user is working
> through.  Thanks

Ah OK -- thanks for enlightening

Re: [PATCH v5 0/5] i386/pc: Fix creation of >= 1010G guests on AMD systems with IOMMU

Posted by Joao Martins 3 years, 8 months ago

On 5/20/22 11:45, Joao Martins wrote:
> v4[5] -> v5:
> * Fixed the 32-bit build(s) (patch 1, Michael Tsirkin)
> * Fix wrong reference (patch 4) to TCG_PHYS_BITS in code comment and
> commit message;
> 
> ---
> 
> This series lets Qemu spawn i386 guests with >= 1010G with VFIO,
> particularly when running on AMD systems with an IOMMU.
> 
> Since Linux v5.4, VFIO validates whether the IOVA in DMA_MAP ioctl is valid and it
> will return -EINVAL on those cases. On x86, Intel hosts aren't particularly
> affected by this extra validation. But AMD systems with IOMMU have a hole in
> the 1TB boundary which is *reserved* for HyperTransport I/O addresses located
> here: FD_0000_0000h - FF_FFFF_FFFFh. See IOMMU manual [1], specifically
> section '2.1.2 IOMMU Logical Topology', Table 3 on what those addresses mean.
> 
> VFIO DMA_MAP calls in this IOVA address range fall through this check and hence return
>  -EINVAL, consequently failing the creation the guests bigger than 1010G. Example
> of the failure:
> 
> qemu-system-x86_64: -device vfio-pci,host=0000:41:10.1,bootindex=-1: VFIO_MAP_DMA: -22
> qemu-system-x86_64: -device vfio-pci,host=0000:41:10.1,bootindex=-1: vfio 0000:41:10.1: 
> 	failed to setup container for group 258: memory listener initialization failed:
> 		Region pc.ram: vfio_dma_map(0x55ba53e7a9d0, 0x100000000, 0xff30000000, 0x7ed243e00000) = -22 (Invalid argument)
> 
> Prior to v5.4, we could map to these IOVAs *but* that's still not the right thing
> to do and could trigger certain IOMMU events (e.g. INVALID_DEVICE_REQUEST), or
> spurious guest VF failures from the resultant IOMMU target abort (see Errata 1155[2])
> as documented on the links down below.
> 
> This small series tries to address that by dealing with this AMD-specific 1Tb hole,
> but rather than dealing like the 4G hole, it instead relocates RAM above 4G
> to be above the 1T if the maximum RAM range crosses the HT reserved range.
> It is organized as following:
> 
> patch 1: Introduce a @above_4g_mem_start which defaults to 4 GiB as starting
>          address of the 4G boundary
> 
> patches 2-3: Move pci-host qdev creation to be before pc_memory_init(),
> 	     to get accessing to pci_hole64_size. The actual pci-host
> 	     initialization is kept as is, only the qdev_new.
> 
> patch 4: Change @above_4g_mem_start to 1TiB /if we are on AMD and the max
> possible address acrosses the HT region. Errors out if the phys-bits is too
> low, which is only the case for >=1010G configurations or something that
> crosses the HT region.
> 
> patch 5: Ensure valid IOVAs only on new machine types, but not older
> ones (<= v7.0.0)
> 
> The 'consequence' of this approach is that we may need more than the default
> phys-bits e.g. a guest with >1010G, will have most of its RAM after the 1TB
> address, consequently needing 41 phys-bits as opposed to the default of 40
> (TCG_PHYS_ADDR_BITS). Today there's already a precedent to depend on the user to
> pick the right value of phys-bits (regardless of this series), so we warn in
> case phys-bits aren't enough. Finally, CMOS loosing its meaning of the above 4G
> ram blocks, but it was mentioned over RFC that CMOS is only useful for very
> old seabios. 
> 
> Additionally, the reserved region is added to E820 if the relocation is done.
> 
> Alternative options considered (in RFC[0]):
> 
> a) Dealing with the 1T hole like the 4G hole -- which also represents what
> hardware closely does.
> 
> Thanks,
> 	Joao
> 

Ping?

> Older Changelog,
> 
> v3[4] -> v4[5]:
> (changes in patch 4 and 5 only)
> * Rebased to 7.1.0, hence move compat machine attribute to <= 7.0.0 versions
> * Check guest vCPU vendor rather than host CPU vendor (Michael Tsirkin)
> * Squash previous patch 5 into patch 4 to tie in the phys-bits check
>   into the relocate-4g-start logic: We now error out if the phys-bits
>   aren't enough on configurations that require above-4g ram relocation. (Michael Tsirkin)
> * Make the error message more explicit when phys-bits isn't enough to also
>   mention: "cannot avoid AMD HT range"
> * Add comments inside x86_update_above_4g_mem_start() explaining the
>   logic behind it. (Michael Tsirkin)
> * Tested on old guests old guests with Linux 2.6.32/3.10/4.14.35/4.1 based kernels
>   alongside Win2008/2K12/2K16/2K19 on configs spanning 1T and 2T (Michael Tsirkin)
>   Validated -numa topologies too as well as making sure qtests observe no regressions;
> 
>  Notes from v4:
> 
> * the machine attribute that enables this new logic (see last patch)
> is called ::enforce_valid_iova since the RFC. Let me know if folks think it
> is poorly named, and whether something a bit more obvious is preferred
> (e.g. ::amd_relocate_1t).
> 
> * @mst one of the comments you said was to add "host checks" in vdpa/vfio devices.
> In discussion with Alex and you over the last version of the patches it seems
> that we weren't keen on making this device-specific or behind any machine
> property flags (besides machine-compat). Just to reiterate there, making sure we do
> the above-4g relocation requiring properly sized phys-bits and AMD as vCPU
> vendor (as this series) already ensures thtat this is going to be right for
> offending configuration with VDPA/VFIO device that might be
> configured/hotplugged. Unless you were thinking that somehow vfio/vdpa devices
> start poking into machine-specific details when we fail to relocate due to the
> lack of phys-bits? Otherwise Qemu, just doesn't have enough information to tell
> what's a valid IOVA or not, in which case kernel vhost-iotlb/vhost-vdpa is the one
> that needs fixing (as VFIO did in v5.4).
> 
> RFCv2[3] -> v3[4]:
> 
> * Add missing brackets in single line statement, in patch 5 (David)
> * Change ranges printf to use PRIx64, in patch 5 (David)
> * Move the check to after changing above_4g_mem_start, in patch 5 (David)
> * Make the check generic and move it to pc_memory_init rather being specific
> to AMD, as the check is useful to capture invalid phys-bits
> configs (patch 5, Igor).
> * Fix comment as 'Start address of the initial RAM above 4G' in patch 1 (Igor)
> * Consider pci_hole64_size in patch 4 (Igor)
> * To consider pci_hole64_size in max used addr we need to get it from pci-host,
> so introduce two new patches (2 and 3) which move only the qdev_new("i440fx") or
> qdev_new("q35") to be before pc_memory_init().
> * Consider sgx_epc.size in max used address, in patch 4 (Igor)
> * Rename relocate_4g() to x86_update_above_4g_mem_start() (Igor)
> * Keep warn_report() in patch 5, as erroring out will break a few x86_64 qtests
> due to pci_hole64 accounting surprass phys-bits possible maxphysaddr.
> 
> RFC[0] -> RFCv2[3]:
> 
> * At Igor's suggestion in one of the patches I reworked the series enterily,
> and more or less as he was thinking it is far simpler to relocate the
> ram-above-4g to be at 1TiB where applicable. The changeset is 3x simpler,
> and less intrusive. (patch 1 & 2)
> * Check phys-bits is big enough prior to relocating (new patch 3)
> * Remove the machine property, and it's only internal and set by new machine
> version (Igor, patch 4).
> * Clarify whether it's GPA or HPA as a more clear meaning (Igor, patch 2)
> * Add IOMMU SDM in the commit message (Igor, patch 2)
> 
> [0] https://lore.kernel.org/qemu-devel/20210622154905.30858-1-joao.m.martins@oracle.com/
> [1] https://www.amd.com/system/files/TechDocs/48882_IOMMU.pdf
> [2] https://developer.amd.com/wp-content/resources/56323-PUB_0.78.pdf
> [3] https://lore.kernel.org/qemu-devel/20220207202422.31582-1-joao.m.martins@oracle.com/T/#u
> [4] https://lore.kernel.org/all/20220223184455.9057-1-joao.m.martins@oracle.com/
> [5] https://lore.kernel.org/qemu-devel/20220420201138.23854-1-joao.m.martins@oracle.com/
> 
> Joao Martins (5):
>   hw/i386: add 4g boundary start to X86MachineState
>   i386/pc: create pci-host qdev prior to pc_memory_init()
>   i386/pc: pass pci_hole64_size to pc_memory_init()
>   i386/pc: relocate 4g start to 1T where applicable
>   i386/pc: restrict AMD only enforcing of valid IOVAs to new machine
>     type
> 
>  hw/i386/acpi-build.c         |   2 +-
>  hw/i386/pc.c                 | 126 +++++++++++++++++++++++++++++++++--
>  hw/i386/pc_piix.c            |  12 +++-
>  hw/i386/pc_q35.c             |  14 +++-
>  hw/i386/sgx.c                |   2 +-
>  hw/i386/x86.c                |   1 +
>  hw/pci-host/i440fx.c         |  10 ++-
>  include/hw/i386/pc.h         |   4 +-
>  include/hw/i386/x86.h        |   3 +
>  include/hw/pci-host/i440fx.h |   3 +-
>  10 files changed, 161 insertions(+), 16 deletions(-)
>