MdeModulePkg/Bus/Ata/AtaAtapiPassThru/AhciMode.h | 18 ++++++ MdeModulePkg/Bus/Ata/AtaAtapiPassThru/AtaAtapiPassThru.h | 5 ++ MdeModulePkg/Bus/Ata/AtaAtapiPassThru/AtaAtapiPassThru.c | 67 +++++++++++++++++++- MdeModulePkg/Bus/Pci/EhciDxe/Ehci.c | 25 +++++++- MdeModulePkg/Bus/Pci/UhciDxe/Uhci.c | 25 +++++++- MdeModulePkg/Bus/Pci/XhciDxe/Xhci.c | 31 +++++++++ 6 files changed, 168 insertions(+), 3 deletions(-)
Repo: https://github.com/lersek/edk2.git Branch: pci_host_controllers_unmap This series updates four PCI Host Controller drivers so that they don't leave CommonBuffer mappings hanging when control is transfered to the OS. Cc: Brijesh Singh <brijesh.singh@amd.com> Cc: Eric Dong <eric.dong@intel.com> Cc: Star Zeng <star.zeng@intel.com> Thanks Laszlo Laszlo Ersek (4): MdeModulePkg/UhciDxe: unmap BM common buffers when exiting boot services MdeModulePkg/EhciDxe: unmap BM common buffers when exiting boot services MdeModulePkg/XhciDxe: unmap BM common buffers when exiting boot services MdeModulePkg/AtaAtapiPassThru: unmap common buffers at ExitBootServices() MdeModulePkg/Bus/Ata/AtaAtapiPassThru/AhciMode.h | 18 ++++++ MdeModulePkg/Bus/Ata/AtaAtapiPassThru/AtaAtapiPassThru.h | 5 ++ MdeModulePkg/Bus/Ata/AtaAtapiPassThru/AtaAtapiPassThru.c | 67 +++++++++++++++++++- MdeModulePkg/Bus/Pci/EhciDxe/Ehci.c | 25 +++++++- MdeModulePkg/Bus/Pci/UhciDxe/Uhci.c | 25 +++++++- MdeModulePkg/Bus/Pci/XhciDxe/Xhci.c | 31 +++++++++ 6 files changed, 168 insertions(+), 3 deletions(-) -- 2.14.1.3.gb7cf6e02401b _______________________________________________ edk2-devel mailing list edk2-devel@lists.01.org https://lists.01.org/mailman/listinfo/edk2-devel
Curious about one question when I am reviewing this patch series. Does/Should the IOMMU component disable the address translation at ExitBootServices? Thanks, Star -----Original Message----- From: edk2-devel [mailto:edk2-devel-bounces@lists.01.org] On Behalf Of Laszlo Ersek Sent: Monday, September 4, 2017 3:55 AM To: edk2-devel-01 <edk2-devel@lists.01.org> Cc: Dong, Eric <eric.dong@intel.com>; Zeng, Star <star.zeng@intel.com> Subject: [edk2] [PATCH 0/4] MdeModulePkg: some PCI HC drivers: unmap common buffers at ExitBootServices() Repo: https://github.com/lersek/edk2.git Branch: pci_host_controllers_unmap This series updates four PCI Host Controller drivers so that they don't leave CommonBuffer mappings hanging when control is transfered to the OS. Cc: Brijesh Singh <brijesh.singh@amd.com> Cc: Eric Dong <eric.dong@intel.com> Cc: Star Zeng <star.zeng@intel.com> Thanks Laszlo Laszlo Ersek (4): MdeModulePkg/UhciDxe: unmap BM common buffers when exiting boot services MdeModulePkg/EhciDxe: unmap BM common buffers when exiting boot services MdeModulePkg/XhciDxe: unmap BM common buffers when exiting boot services MdeModulePkg/AtaAtapiPassThru: unmap common buffers at ExitBootServices() MdeModulePkg/Bus/Ata/AtaAtapiPassThru/AhciMode.h | 18 ++++++ MdeModulePkg/Bus/Ata/AtaAtapiPassThru/AtaAtapiPassThru.h | 5 ++ MdeModulePkg/Bus/Ata/AtaAtapiPassThru/AtaAtapiPassThru.c | 67 +++++++++++++++++++- MdeModulePkg/Bus/Pci/EhciDxe/Ehci.c | 25 +++++++- MdeModulePkg/Bus/Pci/UhciDxe/Uhci.c | 25 +++++++- MdeModulePkg/Bus/Pci/XhciDxe/Xhci.c | 31 +++++++++ 6 files changed, 168 insertions(+), 3 deletions(-) -- 2.14.1.3.gb7cf6e02401b _______________________________________________ edk2-devel mailing list edk2-devel@lists.01.org https://lists.01.org/mailman/listinfo/edk2-devel _______________________________________________ edk2-devel mailing list edk2-devel@lists.01.org https://lists.01.org/mailman/listinfo/edk2-devel
On 09/04/17 12:36, Zeng, Star wrote: > Curious about one question when I am reviewing this patch series. > > Does/Should the IOMMU component disable the address translation at ExitBootServices? It is a very good and interesting question. My answer is, I don't know. The IOMMU protocol abstraction is specific to edk2, and its behavior is not documented in the PI or UEFI specs (as far as I know). For one example, the VT-d IOMMU implementation in IntelSiliconPkg/IntelVTdDxe zaps all translations at ExitBootServices(): VOID EFIAPI OnExitBootServices ( IN EFI_EVENT Event, IN VOID *Context ) { DEBUG ((DEBUG_INFO, "Vtd OnExitBootServices\n")); DumpVtdRegsAll (); DisableDmar (); DumpVtdRegsAll (); } Now, in case someone suggested that the same approach should be followed by all IOMMU implementations, I'd have three counter-arguments: (1) My series states the problem (and seeks to solve it) on the PciIo protocol level, which is standardized. The IOMMU protocol is edk2-only. I feel that this kind of "unmap on ExitBootServices" should be done by all UEFI drivers that map DMA common buffers, regardless of platform-specific PciIo implementations. (2) ExitBootServices() notify functions are called in unspecified order (there's no way to express dependencies between them). This means that, following the above pattern, an IO address translation could be torn down before a reliant PCI device is de-configured in its own UEFI_DRIVER ExitBootServices() handler. The result can be a series of IOMMU faults. I'm not sure if that's bad in practice, but it does look like a layering violation. (We shouldn't tear down resources from the bottom up.) (3) The DisableDmar() function is relatively simple. It uses VT-d registers that exist independently of any PCI devices, and of any "live" mappings. Some IOMMU protocol implementations might be different -- for them the live configuration might only be known by constantly tracking the full set of Map() and Unmap() operations. That's a hurdle for the implementation. Furthermore, what is supposed to happen if both the IOMMU driver and the individual PCI driver destroy the mapping at ExitBootServices()? If the PCI driver's notification function runs first, then the IOMMU driver can mark the mapping as "unmapped", and ignore it in its own notification function. If the IOMMU driver's notification function runs first, then it can again only mark the mapping as "unmapped", so that when the PCI driver tries to unmap it as well, things don't blow up. This seems to imply that an unmapped mapping data structure can never be recycled, because we never know what action might follow. This sounds very confusing to me. Now, there are two arguments in favor of the IOMMU ExitBootServices() callback as well: - Security. It doesn't matter if some driver forgets to de-configure a translation, the IOMMU driver won't. Layering violations be damned. - In ExitBootServices() notification functions, the UEFI memory map must not be altered (memory cannot be released). This means that Bus Master Write and Bus Master Read operations, for pending transfers, cannot be unmapped (only CommonBuffer operations can be), because unmapping Write/Read might free memory (bounce buffers) *implicitly*. For CommonBuffer operations, separate PciIo.FreeBuffer() calls are necessary for releasing memory, thus the notification functions can simply *not* call FreeBuffer(), to stay safe. But this is not possible for pending Read/Write operations. My current thinking is that (a) PCI drivers should unmap pending Common Buffer operations at ExitBootServices; (b) PCI drivers should never return from their protocol member functions with pending Write/Read operations, only CommonBuffer operations (put differently, if the operation is asynchronous on the higher protocol level, then make the underlying BMDMA operation CommonBuffer); (c) IOMMU protocol implementations should not be required to clean up translations, only allowed to, if they can do it without interfering with (a). Basically, the question is, who *owns* the translations. Based on the PciIo interfaces, and on the requirement that PCI drivers Map() and Unmap() buffers, my opinion is that the PCI drivers own the translations. If you think that we should take this question to the USWG, I'm OK with that. Thanks, Laszlo > -----Original Message----- > From: edk2-devel [mailto:edk2-devel-bounces@lists.01.org] On Behalf Of Laszlo Ersek > Sent: Monday, September 4, 2017 3:55 AM > To: edk2-devel-01 <edk2-devel@lists.01.org> > Cc: Dong, Eric <eric.dong@intel.com>; Zeng, Star <star.zeng@intel.com> > Subject: [edk2] [PATCH 0/4] MdeModulePkg: some PCI HC drivers: unmap common buffers at ExitBootServices() > > Repo: https://github.com/lersek/edk2.git > Branch: pci_host_controllers_unmap > > This series updates four PCI Host Controller drivers so that they don't leave CommonBuffer mappings hanging when control is transfered to the OS. > > Cc: Brijesh Singh <brijesh.singh@amd.com> > Cc: Eric Dong <eric.dong@intel.com> > Cc: Star Zeng <star.zeng@intel.com> > > Thanks > Laszlo > > Laszlo Ersek (4): > MdeModulePkg/UhciDxe: unmap BM common buffers when exiting boot > services > MdeModulePkg/EhciDxe: unmap BM common buffers when exiting boot > services > MdeModulePkg/XhciDxe: unmap BM common buffers when exiting boot > services > MdeModulePkg/AtaAtapiPassThru: unmap common buffers at > ExitBootServices() > > MdeModulePkg/Bus/Ata/AtaAtapiPassThru/AhciMode.h | 18 ++++++ > MdeModulePkg/Bus/Ata/AtaAtapiPassThru/AtaAtapiPassThru.h | 5 ++ MdeModulePkg/Bus/Ata/AtaAtapiPassThru/AtaAtapiPassThru.c | 67 +++++++++++++++++++- > MdeModulePkg/Bus/Pci/EhciDxe/Ehci.c | 25 +++++++- > MdeModulePkg/Bus/Pci/UhciDxe/Uhci.c | 25 +++++++- > MdeModulePkg/Bus/Pci/XhciDxe/Xhci.c | 31 +++++++++ > 6 files changed, 168 insertions(+), 3 deletions(-) > > -- > 2.14.1.3.gb7cf6e02401b > > _______________________________________________ > edk2-devel mailing list > edk2-devel@lists.01.org > https://lists.01.org/mailman/listinfo/edk2-devel > _______________________________________________ edk2-devel mailing list edk2-devel@lists.01.org https://lists.01.org/mailman/listinfo/edk2-devel
HI Laszlo Thank you very much to raise this DMA topic. I have a chat with Star. We have couples of question, from high level to low level. Hope you can clarify: 1) About disable DMA As you mentioned, in UEFI_Driver_Writer_Guide_V1.0.1_120308.pdf, 7.7 Adding the Exit Boot Services feature ========================= Examples from the EDK II that use this feature are the PCI device drivers for USB Host Controllers. Some USB Host Controllers are PCI Bus Masters that continuously access a memory buffer to poll for operation requests. Access to this memory buffer by a USB Host Controller may be required to boot an operation system, but this activity must be terminated when the OS calls ExitBootServices(). The typical action in the Exit Boot Services Event for these types of drivers is to disable the PCI bus master and place the USB Host Controller into a halted state ========================= I believe the fundamental requirement is to "disable DMA". As such, the simplest action is to clear PCI Bus Master Enable bit in the command register. Do you think clearing BME is enough? Is there any special reason that we must unmap the DMA buffer? 2) About common buffer If there is a special requirement that we must unmap the DMA buffer, why we only unmap the common buffer? I understand your statement that unmapping a read buffer or write buffer may cause free memory action. But that is implementation choice. If we have strong requirement to unmap the DMA buffer, we should achieve that. 3) About interaction with IOMMU I am not worried about IOMMU interaction. I believe an proper IOMMU implementation should not block any action taken by a PCI device driver. Current IntelVTdDxe driver disables VTd engine in ExitBootService, which means the DMA access is no longer managed by VTd engine. It is NOT related to disable DMA buffer. 4) About the driver you modified I believe we have some other PCI device driver (NVMe/SDMMC/UFS), besides the ones you modified in this patch (ATA/UCHI/EHCI/XHCI). If we need take this approach, I assume they also need update, right? Thank you Yao Jiewen From: edk2-devel [mailto:edk2-devel-bounces@lists.01.org] On Behalf Of Laszlo Ersek Sent: Tuesday, September 5, 2017 5:20 AM To: Zeng, Star <star.zeng@intel.com>; edk2-devel-01 <edk2-devel@lists.01.org> Cc: Dong, Eric <eric.dong@intel.com> Subject: Re: [edk2] [PATCH 0/4] MdeModulePkg: some PCI HC drivers: unmap common buffers at ExitBootServices() On 09/04/17 12:36, Zeng, Star wrote: > Curious about one question when I am reviewing this patch series. > > Does/Should the IOMMU component disable the address translation at ExitBootServices? It is a very good and interesting question. My answer is, I don't know. The IOMMU protocol abstraction is specific to edk2, and its behavior is not documented in the PI or UEFI specs (as far as I know). For one example, the VT-d IOMMU implementation in IntelSiliconPkg/IntelVTdDxe zaps all translations at ExitBootServices(): VOID EFIAPI OnExitBootServices ( IN EFI_EVENT Event, IN VOID *Context ) { DEBUG ((DEBUG_INFO, "Vtd OnExitBootServices\n")); DumpVtdRegsAll (); DisableDmar (); DumpVtdRegsAll (); } Now, in case someone suggested that the same approach should be followed by all IOMMU implementations, I'd have three counter-arguments: (1) My series states the problem (and seeks to solve it) on the PciIo protocol level, which is standardized. The IOMMU protocol is edk2-only. I feel that this kind of "unmap on ExitBootServices" should be done by all UEFI drivers that map DMA common buffers, regardless of platform-specific PciIo implementations. (2) ExitBootServices() notify functions are called in unspecified order (there's no way to express dependencies between them). This means that, following the above pattern, an IO address translation could be torn down before a reliant PCI device is de-configured in its own UEFI_DRIVER ExitBootServices() handler. The result can be a series of IOMMU faults. I'm not sure if that's bad in practice, but it does look like a layering violation. (We shouldn't tear down resources from the bottom up.) (3) The DisableDmar() function is relatively simple. It uses VT-d registers that exist independently of any PCI devices, and of any "live" mappings. Some IOMMU protocol implementations might be different -- for them the live configuration might only be known by constantly tracking the full set of Map() and Unmap() operations. That's a hurdle for the implementation. Furthermore, what is supposed to happen if both the IOMMU driver and the individual PCI driver destroy the mapping at ExitBootServices()? If the PCI driver's notification function runs first, then the IOMMU driver can mark the mapping as "unmapped", and ignore it in its own notification function. If the IOMMU driver's notification function runs first, then it can again only mark the mapping as "unmapped", so that when the PCI driver tries to unmap it as well, things don't blow up. This seems to imply that an unmapped mapping data structure can never be recycled, because we never know what action might follow. This sounds very confusing to me. Now, there are two arguments in favor of the IOMMU ExitBootServices() callback as well: - Security. It doesn't matter if some driver forgets to de-configure a translation, the IOMMU driver won't. Layering violations be damned. - In ExitBootServices() notification functions, the UEFI memory map must not be altered (memory cannot be released). This means that Bus Master Write and Bus Master Read operations, for pending transfers, cannot be unmapped (only CommonBuffer operations can be), because unmapping Write/Read might free memory (bounce buffers) *implicitly*. For CommonBuffer operations, separate PciIo.FreeBuffer() calls are necessary for releasing memory, thus the notification functions can simply *not* call FreeBuffer(), to stay safe. But this is not possible for pending Read/Write operations. My current thinking is that (a) PCI drivers should unmap pending Common Buffer operations at ExitBootServices; (b) PCI drivers should never return from their protocol member functions with pending Write/Read operations, only CommonBuffer operations (put differently, if the operation is asynchronous on the higher protocol level, then make the underlying BMDMA operation CommonBuffer); (c) IOMMU protocol implementations should not be required to clean up translations, only allowed to, if they can do it without interfering with (a). Basically, the question is, who *owns* the translations. Based on the PciIo interfaces, and on the requirement that PCI drivers Map() and Unmap() buffers, my opinion is that the PCI drivers own the translations. If you think that we should take this question to the USWG, I'm OK with that. Thanks, Laszlo > -----Original Message----- > From: edk2-devel [mailto:edk2-devel-bounces@lists.01.org] On Behalf Of Laszlo Ersek > Sent: Monday, September 4, 2017 3:55 AM > To: edk2-devel-01 <edk2-devel@lists.01.org<mailto:edk2-devel@lists.01.org>> > Cc: Dong, Eric <eric.dong@intel.com<mailto:eric.dong@intel.com>>; Zeng, Star <star.zeng@intel.com<mailto:star.zeng@intel.com>> > Subject: [edk2] [PATCH 0/4] MdeModulePkg: some PCI HC drivers: unmap common buffers at ExitBootServices() > > Repo: https://github.com/lersek/edk2.git > Branch: pci_host_controllers_unmap > > This series updates four PCI Host Controller drivers so that they don't leave CommonBuffer mappings hanging when control is transfered to the OS. > > Cc: Brijesh Singh <brijesh.singh@amd.com<mailto:brijesh.singh@amd.com>> > Cc: Eric Dong <eric.dong@intel.com<mailto:eric.dong@intel.com>> > Cc: Star Zeng <star.zeng@intel.com<mailto:star.zeng@intel.com>> > > Thanks > Laszlo > > Laszlo Ersek (4): > MdeModulePkg/UhciDxe: unmap BM common buffers when exiting boot > services > MdeModulePkg/EhciDxe: unmap BM common buffers when exiting boot > services > MdeModulePkg/XhciDxe: unmap BM common buffers when exiting boot > services > MdeModulePkg/AtaAtapiPassThru: unmap common buffers at > ExitBootServices() > > MdeModulePkg/Bus/Ata/AtaAtapiPassThru/AhciMode.h | 18 ++++++ > MdeModulePkg/Bus/Ata/AtaAtapiPassThru/AtaAtapiPassThru.h | 5 ++ MdeModulePkg/Bus/Ata/AtaAtapiPassThru/AtaAtapiPassThru.c | 67 +++++++++++++++++++- > MdeModulePkg/Bus/Pci/EhciDxe/Ehci.c | 25 +++++++- > MdeModulePkg/Bus/Pci/UhciDxe/Uhci.c | 25 +++++++- > MdeModulePkg/Bus/Pci/XhciDxe/Xhci.c | 31 +++++++++ > 6 files changed, 168 insertions(+), 3 deletions(-) > > -- > 2.14.1.3.gb7cf6e02401b > > _______________________________________________ > edk2-devel mailing list > edk2-devel@lists.01.org<mailto:edk2-devel@lists.01.org> > https://lists.01.org/mailman/listinfo/edk2-devel > _______________________________________________ edk2-devel mailing list edk2-devel@lists.01.org<mailto:edk2-devel@lists.01.org> https://lists.01.org/mailman/listinfo/edk2-devel _______________________________________________ edk2-devel mailing list edk2-devel@lists.01.org https://lists.01.org/mailman/listinfo/edk2-devel
On 09/05/17 04:18, Yao, Jiewen wrote: > HI Laszlo > Thank you very much to raise this DMA topic. > > I have a chat with Star. We have couples of question, from high level to low level. > Hope you can clarify: > > > 1) About disable DMA > > As you mentioned, in UEFI_Driver_Writer_Guide_V1.0.1_120308.pdf, 7.7 Adding the Exit Boot Services feature > ========================= > Examples from the EDK II that use this feature are the PCI device drivers for USB Host > Controllers. Some USB Host Controllers are PCI Bus Masters that continuously access a > memory buffer to poll for operation requests. Access to this memory buffer by a USB > Host Controller may be required to boot an operation system, but this activity must be > terminated when the OS calls ExitBootServices(). The typical action in the Exit Boot > Services Event for these types of drivers is to disable the PCI bus master and place the > USB Host Controller into a halted state > ========================= > > I believe the fundamental requirement is to "disable DMA". > As such, the simplest action is to clear PCI Bus Master Enable bit in the command register. > > Do you think clearing BME is enough? > Is there any special reason that we must unmap the DMA buffer? Modifying the device state is enough to ask the device *nicely* to stop doing DMA. But, I think, if we are on an IOMMU-protected system, where part of PciIo.Map() is to add a translation (= system memory access permission) to the IOMMU, then we should also revoke this permission explicitly, *after* we asked the device nicely to stop doing DMA. Basically, ask the device nicely first, and then enforce the protection. ... BTW, I mentioned "IOMMU faults" earlier, and I have some further thoughts on them. This is about tearing down IOMMU translations *first*, in the IOMMU driver's ExitBootServices() handler, *before* the PCI drivers get a chance -- in *their* ExitBootServices() handlers -- to ask the PCI devices nicely to stop doing DMA. I think a quite problematic failure mode is possible here. Assume that a disk controller has some write commands queued (which are mapped with Bus Master Read or Bus Master CommonBuffer operations). If the ExitBootServices() handler for the IOMMU driver runs first, then the device could act upon these queued commands with missing IOMMU translations, before the device's own PCI driver asked the device to shut down DMA. In some cases, I guess this could result in those queued write commands simply failing, without ill effects. However, dependent on the IOMMU implementation, the write commands might not fail explicitly, but read garbage from system memory, and write garbage to disk blocks. That would be bad. I think this is exactly what could happen with the SEV IOMMU we have. The memory would be first re-encrypted (this is equivalent to revoking a translation), and then the write commands (queued in the emulated AHCI controller, in the hypervisor) would read garbage from guest memory, and write garbage to the virtual disk. > > > 2) About common buffer > If there is a special requirement that we must unmap the DMA buffer, why we only unmap the common buffer? > > I understand your statement that unmapping a read buffer or write buffer may cause free memory action. But that is implementation choice. I believe I disagree with this (i.e., I think it is not an implementation choice). In the general case, we can assume that PciIo.Map() allocates a bounce buffer internally, when Read or Write action is requested. In turn, PciIo.Unmap() has to release the same buffer, internally. PciIo.Unmap() does not know if it is called from an ExitBootServices() notification function, or from a normal context. It cannot decide if it is allowed to release memory, or not. There's no additional parameter to provide this hint either. If PciIo.Unmap() never releases buffers, then ExitBootServices() will be happy, but we're going to leak memory. Now, PciIo.Unmap() could move the bounce buffer to some internal "free list" instead, rather than freeing it. And, the next PciIo.Map() call could try to reuse a buffer from the "free list". But this is difficult to do, because the buffers may have any size, and recycling them would require an internal, general "page pool", in parallel to the Boot Services page pool (AllocatePages, FreePages). If PciIo.Unmap() could be informed that it is running in ExitBootServices() notification context -- called from PCI drivers --, then unmapping Read and Write operations would be simple. > If we have strong requirement to unmap the DMA buffer, we should achieve that. I totally agree that we *should* achieve that, but I don't see how the current PciIo.Map() and PciIo.Unmap() APIs make it possible, for Read and Write operations. For CommonBuffer operations, it is possible, because AllocateBuffer() and FreeBuffer() are separate actions, and an ExitBootServices() notification function can *avoid* calling FreeBuffer. (But only for CommonBuffer operations.) PciIo.Unmap() cannot be told *not* to release a bounce buffer for Read/Write. > > > 3) About interaction with IOMMU > I am not worried about IOMMU interaction. > I believe an proper IOMMU implementation should not block any action taken by a PCI device driver. > Current IntelVTdDxe driver disables VTd engine in ExitBootService, which means the DMA access is no longer managed by VTd engine. It is NOT related to disable DMA buffer. Ah, I understand! So, on one hand, that is totally safe, for PCI device drivers. It means all pending DMA can get through, until the PCI drivers ask the devices (nicely) to stop doing DMA. On the other hand, it is the opposite of the security goal of SEV. With SEV, the default state is "all DMA forbidden" (= all guest memory encrypted). This is also the status that we'd like to achieve when ExitBootServices() completes. When control is transferred to the OS, all guest memory should be encrypted again, even those areas that were once used as CommonBuffers. So apparently the default behaviors of the VT-d IOMMU and the SEV IOMMU are opposites -- VT-d permits all DMA unless configured otherwise, while SEV forbids all DMA unless configured otherwise. > 4) About the driver you modified > I believe we have some other PCI device driver (NVMe/SDMMC/UFS), besides the ones you modified in this patch (ATA/UCHI/EHCI/XHCI). > If we need take this approach, I assume they also need update, right? Yes, they should be updated similarly. SDMMC and UFS are not included in OVMF, so I couldn't test their behavior. NVMe is included in OVMF, but it is used very rarely in practice (we can call it a corner case), so I thought I would first submit patches for the four main drivers (UHCI, EHCI, XHCI, AHCI) that are frequently used with OVMF and QEMU. Tracking down the CommonBuffer lifecycles was difficult :) Thank you! Laszlo > > Thank you > Yao Jiewen > > > From: edk2-devel [mailto:edk2-devel-bounces@lists.01.org] On Behalf Of Laszlo Ersek > Sent: Tuesday, September 5, 2017 5:20 AM > To: Zeng, Star <star.zeng@intel.com>; edk2-devel-01 <edk2-devel@lists.01.org> > Cc: Dong, Eric <eric.dong@intel.com> > Subject: Re: [edk2] [PATCH 0/4] MdeModulePkg: some PCI HC drivers: unmap common buffers at ExitBootServices() > > On 09/04/17 12:36, Zeng, Star wrote: >> Curious about one question when I am reviewing this patch series. >> >> Does/Should the IOMMU component disable the address translation at ExitBootServices? > > It is a very good and interesting question. > > My answer is, I don't know. The IOMMU protocol abstraction is specific > to edk2, and its behavior is not documented in the PI or UEFI specs (as > far as I know). > > For one example, the VT-d IOMMU implementation in > > IntelSiliconPkg/IntelVTdDxe > > zaps all translations at ExitBootServices(): > > VOID > EFIAPI > OnExitBootServices ( > IN EFI_EVENT Event, > IN VOID *Context > ) > { > DEBUG ((DEBUG_INFO, "Vtd OnExitBootServices\n")); > DumpVtdRegsAll (); > DisableDmar (); > DumpVtdRegsAll (); > } > > Now, in case someone suggested that the same approach should be followed > by all IOMMU implementations, I'd have three counter-arguments: > > (1) My series states the problem (and seeks to solve it) on the PciIo > protocol level, which is standardized. The IOMMU protocol is edk2-only. > I feel that this kind of "unmap on ExitBootServices" should be done by > all UEFI drivers that map DMA common buffers, regardless of > platform-specific PciIo implementations. > > (2) ExitBootServices() notify functions are called in unspecified order > (there's no way to express dependencies between them). This means that, > following the above pattern, an IO address translation could be torn > down before a reliant PCI device is de-configured in its own UEFI_DRIVER > ExitBootServices() handler. The result can be a series of IOMMU faults. > I'm not sure if that's bad in practice, but it does look like a layering > violation. (We shouldn't tear down resources from the bottom up.) > > (3) The DisableDmar() function is relatively simple. It uses VT-d > registers that exist independently of any PCI devices, and of any "live" > mappings. > > Some IOMMU protocol implementations might be different -- for them the > live configuration might only be known by constantly tracking the full > set of Map() and Unmap() operations. That's a hurdle for the implementation. > > Furthermore, what is supposed to happen if both the IOMMU driver and the > individual PCI driver destroy the mapping at ExitBootServices()? If the > PCI driver's notification function runs first, then the IOMMU driver can > mark the mapping as "unmapped", and ignore it in its own notification > function. If the IOMMU driver's notification function runs first, then > it can again only mark the mapping as "unmapped", so that when the PCI > driver tries to unmap it as well, things don't blow up. This seems to > imply that an unmapped mapping data structure can never be recycled, > because we never know what action might follow. This sounds very > confusing to me. > > > Now, there are two arguments in favor of the IOMMU ExitBootServices() > callback as well: > > - Security. It doesn't matter if some driver forgets to de-configure a > translation, the IOMMU driver won't. Layering violations be damned. > > - In ExitBootServices() notification functions, the UEFI memory map must > not be altered (memory cannot be released). This means that Bus Master > Write and Bus Master Read operations, for pending transfers, cannot be > unmapped (only CommonBuffer operations can be), because unmapping > Write/Read might free memory (bounce buffers) *implicitly*. For > CommonBuffer operations, separate PciIo.FreeBuffer() calls are necessary > for releasing memory, thus the notification functions can simply *not* > call FreeBuffer(), to stay safe. But this is not possible for pending > Read/Write operations. > > > My current thinking is that (a) PCI drivers should unmap pending Common > Buffer operations at ExitBootServices; (b) PCI drivers should never > return from their protocol member functions with pending Write/Read > operations, only CommonBuffer operations (put differently, if the > operation is asynchronous on the higher protocol level, then make the > underlying BMDMA operation CommonBuffer); (c) IOMMU protocol > implementations should not be required to clean up translations, only > allowed to, if they can do it without interfering with (a). > > Basically, the question is, who *owns* the translations. Based on the > PciIo interfaces, and on the requirement that PCI drivers Map() and > Unmap() buffers, my opinion is that the PCI drivers own the translations. > > If you think that we should take this question to the USWG, I'm OK with > that. > > Thanks, > Laszlo > > > >> -----Original Message----- >> From: edk2-devel [mailto:edk2-devel-bounces@lists.01.org] On Behalf Of Laszlo Ersek >> Sent: Monday, September 4, 2017 3:55 AM >> To: edk2-devel-01 <edk2-devel@lists.01.org<mailto:edk2-devel@lists.01.org>> >> Cc: Dong, Eric <eric.dong@intel.com<mailto:eric.dong@intel.com>>; Zeng, Star <star.zeng@intel.com<mailto:star.zeng@intel.com>> >> Subject: [edk2] [PATCH 0/4] MdeModulePkg: some PCI HC drivers: unmap common buffers at ExitBootServices() >> >> Repo: https://github.com/lersek/edk2.git >> Branch: pci_host_controllers_unmap >> >> This series updates four PCI Host Controller drivers so that they don't leave CommonBuffer mappings hanging when control is transfered to the OS. >> >> Cc: Brijesh Singh <brijesh.singh@amd.com<mailto:brijesh.singh@amd.com>> >> Cc: Eric Dong <eric.dong@intel.com<mailto:eric.dong@intel.com>> >> Cc: Star Zeng <star.zeng@intel.com<mailto:star.zeng@intel.com>> >> >> Thanks >> Laszlo >> >> Laszlo Ersek (4): >> MdeModulePkg/UhciDxe: unmap BM common buffers when exiting boot >> services >> MdeModulePkg/EhciDxe: unmap BM common buffers when exiting boot >> services >> MdeModulePkg/XhciDxe: unmap BM common buffers when exiting boot >> services >> MdeModulePkg/AtaAtapiPassThru: unmap common buffers at >> ExitBootServices() >> >> MdeModulePkg/Bus/Ata/AtaAtapiPassThru/AhciMode.h | 18 ++++++ >> MdeModulePkg/Bus/Ata/AtaAtapiPassThru/AtaAtapiPassThru.h | 5 ++ MdeModulePkg/Bus/Ata/AtaAtapiPassThru/AtaAtapiPassThru.c | 67 +++++++++++++++++++- >> MdeModulePkg/Bus/Pci/EhciDxe/Ehci.c | 25 +++++++- >> MdeModulePkg/Bus/Pci/UhciDxe/Uhci.c | 25 +++++++- >> MdeModulePkg/Bus/Pci/XhciDxe/Xhci.c | 31 +++++++++ >> 6 files changed, 168 insertions(+), 3 deletions(-) >> >> -- >> 2.14.1.3.gb7cf6e02401b >> >> _______________________________________________ >> edk2-devel mailing list >> edk2-devel@lists.01.org<mailto:edk2-devel@lists.01.org> >> https://lists.01.org/mailman/listinfo/edk2-devel >> > > _______________________________________________ > edk2-devel mailing list > edk2-devel@lists.01.org<mailto:edk2-devel@lists.01.org> > https://lists.01.org/mailman/listinfo/edk2-devel > _______________________________________________ edk2-devel mailing list edk2-devel@lists.01.org https://lists.01.org/mailman/listinfo/edk2-devel
Good discussion. Comment in line. From: Laszlo Ersek [mailto:lersek@redhat.com] Sent: Tuesday, September 5, 2017 5:16 PM To: Yao, Jiewen <jiewen.yao@intel.com>; Zeng, Star <star.zeng@intel.com>; edk2-devel-01 <edk2-devel@lists.01.org> Cc: Dong, Eric <eric.dong@intel.com> Subject: Re: [edk2] [PATCH 0/4] MdeModulePkg: some PCI HC drivers: unmap common buffers at ExitBootServices() On 09/05/17 04:18, Yao, Jiewen wrote: > HI Laszlo > Thank you very much to raise this DMA topic. > > I have a chat with Star. We have couples of question, from high level to low level. > Hope you can clarify: > > > 1) About disable DMA > > As you mentioned, in UEFI_Driver_Writer_Guide_V1.0.1_120308.pdf, 7.7 Adding the Exit Boot Services feature > ========================= > Examples from the EDK II that use this feature are the PCI device drivers for USB Host > Controllers. Some USB Host Controllers are PCI Bus Masters that continuously access a > memory buffer to poll for operation requests. Access to this memory buffer by a USB > Host Controller may be required to boot an operation system, but this activity must be > terminated when the OS calls ExitBootServices(). The typical action in the Exit Boot > Services Event for these types of drivers is to disable the PCI bus master and place the > USB Host Controller into a halted state > ========================= > > I believe the fundamental requirement is to "disable DMA". > As such, the simplest action is to clear PCI Bus Master Enable bit in the command register. > > Do you think clearing BME is enough? > Is there any special reason that we must unmap the DMA buffer? Modifying the device state is enough to ask the device *nicely* to stop doing DMA. But, I think, if we are on an IOMMU-protected system, where part of PciIo.Map() is to add a translation (= system memory access permission) to the IOMMU, then we should also revoke this permission explicitly, *after* we asked the device nicely to stop doing DMA. Basically, ask the device nicely first, and then enforce the protection. ... BTW, I mentioned "IOMMU faults" earlier, and I have some further thoughts on them. This is about tearing down IOMMU translations *first*, in the IOMMU driver's ExitBootServices() handler, *before* the PCI drivers get a chance -- in *their* ExitBootServices() handlers -- to ask the PCI devices nicely to stop doing DMA. I think a quite problematic failure mode is possible here. Assume that a disk controller has some write commands queued (which are mapped with Bus Master Read or Bus Master CommonBuffer operations). If the ExitBootServices() handler for the IOMMU driver runs first, then the device could act upon these queued commands with missing IOMMU translations, before the device's own PCI driver asked the device to shut down DMA. In some cases, I guess this could result in those queued write commands simply failing, without ill effects. However, dependent on the IOMMU implementation, the write commands might not fail explicitly, but read garbage from system memory, and write garbage to disk blocks. That would be bad. I think this is exactly what could happen with the SEV IOMMU we have. The memory would be first re-encrypted (this is equivalent to revoking a translation), and then the write commands (queued in the emulated AHCI controller, in the hypervisor) would read garbage from guest memory, and write garbage to the virtual disk. [Jiewen] I am not clear on how it happens in SEV here. Below is my thought, please correct me if I am wrong. First, the fundamental requirement is to disable BME at exit boot service, or make sure the controller is in halt state. If a device driver fails to meet such requirement, it is a bug we need to fix at first. Now, let's assume all device driver is in halt state at ExitBootService. - that is my understanding for all rest discussion. I checked OVMF and I do not see any code in IOMMU driver to register exit boot services event, which means the DMA buffer is still decrypted. If that is the case, the DMA transfer before BME disable is safety (DMA buffer is decrypted) and there will be no DMA transfer any more at ExitBootService (BME disabled). Then how does the write command queued in the emulated AHCI controller start read garbage? The SEV IOMMU driver only modifies the page table. Then after ExitBootService, the OS will take control of CR3 and set correct encryption bit. NOTE: The device should still be halt state, because the device is also controlled by OS. > > > 2) About common buffer > If there is a special requirement that we must unmap the DMA buffer, why we only unmap the common buffer? > > I understand your statement that unmapping a read buffer or write buffer may cause free memory action. But that is implementation choice. I believe I disagree with this (i.e., I think it is not an implementation choice). In the general case, we can assume that PciIo.Map() allocates a bounce buffer internally, when Read or Write action is requested. [Jiewen] When I say implementation choice, I mean Map() *may* allocates a bounce buffer or it just uses the same address, if the platform is capable of using DRAM as DMA buffer directly. In turn, PciIo.Unmap() has to release the same buffer, internally. PciIo.Unmap() does not know if it is called from an ExitBootServices() notification function, or from a normal context. It cannot decide if it is allowed to release memory, or not. There's no additional parameter to provide this hint either. If PciIo.Unmap() never releases buffers, then ExitBootServices() will be happy, but we're going to leak memory. Now, PciIo.Unmap() could move the bounce buffer to some internal "free list" instead, rather than freeing it. And, the next PciIo.Map() call could try to reuse a buffer from the "free list". But this is difficult to do, because the buffers may have any size, and recycling them would require an internal, general "page pool", in parallel to the Boot Services page pool (AllocatePages, FreePages). If PciIo.Unmap() could be informed that it is running in ExitBootServices() notification context -- called from PCI drivers --, then unmapping Read and Write operations would be simple. > If we have strong requirement to unmap the DMA buffer, we should achieve that. I totally agree that we *should* achieve that, but I don't see how the current PciIo.Map() and PciIo.Unmap() APIs make it possible, for Read and Write operations. For CommonBuffer operations, it is possible, because AllocateBuffer() and FreeBuffer() are separate actions, and an ExitBootServices() notification function can *avoid* calling FreeBuffer. (But only for CommonBuffer operations.) PciIo.Unmap() cannot be told *not* to release a bounce buffer for Read/Write. [Jiewen] Again, it is an implementation choice. It is possible that a host bridge allocate common buffer without calling any Allocate(). But another implementation may also allocate internal data structure to record every map() action, including common buffer. (For event log, or tracing purpose, et etc.) I do not think Free() should bring difference between common buffer or read/write buffer. I am a little worried that we have too many assumption here: e.g. Map() common buffer never involves Allocate() even for driver internal data structure , and Map() read/write buffer always involves Allocate() even there is no need to allocate anything. With that assumption, we purposely avoid Unmap() read/write buffer, but force Unmap() common buffer. I am not sure if that assumption is correct. > > > 3) About interaction with IOMMU > I am not worried about IOMMU interaction. > I believe an proper IOMMU implementation should not block any action taken by a PCI device driver. > Current IntelVTdDxe driver disables VTd engine in ExitBootService, which means the DMA access is no longer managed by VTd engine. It is NOT related to disable DMA buffer. Ah, I understand! So, on one hand, that is totally safe, for PCI device drivers. It means all pending DMA can get through, until the PCI drivers ask the devices (nicely) to stop doing DMA. [Jiewen] Yes. I think any IOMMU implementation should take that into consideration. On the other hand, it is the opposite of the security goal of SEV. With SEV, the default state is "all DMA forbidden" (= all guest memory encrypted). This is also the status that we'd like to achieve when ExitBootServices() completes. When control is transferred to the OS, all guest memory should be encrypted again, even those areas that were once used as CommonBuffers. [Jiewen] I am not sure who will control "When control is transferred to the OS, all guest memory should be encrypted again, even those areas that were once used as CommonBuffers." For SEV case, I think it should be controlled by OS, because it is just CR3. If so, we should not meet any problem, because we already disable BME in device driver's ExitBootService event. So apparently the default behaviors of the VT-d IOMMU and the SEV IOMMU are opposites -- VT-d permits all DMA unless configured otherwise, while SEV forbids all DMA unless configured otherwise. [Jiewen] I do not think so. The default behaviors of VT-d IOMMU is to disable all DMA access. I setup translation table, but mark all entry to be NOT-PRESENT. I grant DMA access only if there is a specific request by a device. I open DMA access in ExitBootServices, just want to make sure it is compatible to the existing OS, which might not have knowledge on IOMMU. I will provide a patch to make it a policy very soon. As such VTd IOMMU may turn off IOMMU, or keep it enabled at ExitBootService. An IOMMU aware OS may take over IOMMU directly and reprogram it. > 4) About the driver you modified > I believe we have some other PCI device driver (NVMe/SDMMC/UFS), besides the ones you modified in this patch (ATA/UCHI/EHCI/XHCI). > If we need take this approach, I assume they also need update, right? Yes, they should be updated similarly. SDMMC and UFS are not included in OVMF, so I couldn't test their behavior. NVMe is included in OVMF, but it is used very rarely in practice (we can call it a corner case), so I thought I would first submit patches for the four main drivers (UHCI, EHCI, XHCI, AHCI) that are frequently used with OVMF and QEMU. Tracking down the CommonBuffer lifecycles was difficult :) [Jiewen] Understood. That is OK. We can handle that in separate patch. Thank you! Laszlo > > Thank you > Yao Jiewen > > > From: edk2-devel [mailto:edk2-devel-bounces@lists.01.org] On Behalf Of Laszlo Ersek > Sent: Tuesday, September 5, 2017 5:20 AM > To: Zeng, Star <star.zeng@intel.com<mailto:star.zeng@intel.com>>; edk2-devel-01 <edk2-devel@lists.01.org<mailto:edk2-devel@lists.01.org>> > Cc: Dong, Eric <eric.dong@intel.com<mailto:eric.dong@intel.com>> > Subject: Re: [edk2] [PATCH 0/4] MdeModulePkg: some PCI HC drivers: unmap common buffers at ExitBootServices() > > On 09/04/17 12:36, Zeng, Star wrote: >> Curious about one question when I am reviewing this patch series. >> >> Does/Should the IOMMU component disable the address translation at ExitBootServices? > > It is a very good and interesting question. > > My answer is, I don't know. The IOMMU protocol abstraction is specific > to edk2, and its behavior is not documented in the PI or UEFI specs (as > far as I know). > > For one example, the VT-d IOMMU implementation in > > IntelSiliconPkg/IntelVTdDxe > > zaps all translations at ExitBootServices(): > > VOID > EFIAPI > OnExitBootServices ( > IN EFI_EVENT Event, > IN VOID *Context > ) > { > DEBUG ((DEBUG_INFO, "Vtd OnExitBootServices\n")); > DumpVtdRegsAll (); > DisableDmar (); > DumpVtdRegsAll (); > } > > Now, in case someone suggested that the same approach should be followed > by all IOMMU implementations, I'd have three counter-arguments: > > (1) My series states the problem (and seeks to solve it) on the PciIo > protocol level, which is standardized. The IOMMU protocol is edk2-only. > I feel that this kind of "unmap on ExitBootServices" should be done by > all UEFI drivers that map DMA common buffers, regardless of > platform-specific PciIo implementations. > > (2) ExitBootServices() notify functions are called in unspecified order > (there's no way to express dependencies between them). This means that, > following the above pattern, an IO address translation could be torn > down before a reliant PCI device is de-configured in its own UEFI_DRIVER > ExitBootServices() handler. The result can be a series of IOMMU faults. > I'm not sure if that's bad in practice, but it does look like a layering > violation. (We shouldn't tear down resources from the bottom up.) > > (3) The DisableDmar() function is relatively simple. It uses VT-d > registers that exist independently of any PCI devices, and of any "live" > mappings. > > Some IOMMU protocol implementations might be different -- for them the > live configuration might only be known by constantly tracking the full > set of Map() and Unmap() operations. That's a hurdle for the implementation. > > Furthermore, what is supposed to happen if both the IOMMU driver and the > individual PCI driver destroy the mapping at ExitBootServices()? If the > PCI driver's notification function runs first, then the IOMMU driver can > mark the mapping as "unmapped", and ignore it in its own notification > function. If the IOMMU driver's notification function runs first, then > it can again only mark the mapping as "unmapped", so that when the PCI > driver tries to unmap it as well, things don't blow up. This seems to > imply that an unmapped mapping data structure can never be recycled, > because we never know what action might follow. This sounds very > confusing to me. > > > Now, there are two arguments in favor of the IOMMU ExitBootServices() > callback as well: > > - Security. It doesn't matter if some driver forgets to de-configure a > translation, the IOMMU driver won't. Layering violations be damned. > > - In ExitBootServices() notification functions, the UEFI memory map must > not be altered (memory cannot be released). This means that Bus Master > Write and Bus Master Read operations, for pending transfers, cannot be > unmapped (only CommonBuffer operations can be), because unmapping > Write/Read might free memory (bounce buffers) *implicitly*. For > CommonBuffer operations, separate PciIo.FreeBuffer() calls are necessary > for releasing memory, thus the notification functions can simply *not* > call FreeBuffer(), to stay safe. But this is not possible for pending > Read/Write operations. > > > My current thinking is that (a) PCI drivers should unmap pending Common > Buffer operations at ExitBootServices; (b) PCI drivers should never > return from their protocol member functions with pending Write/Read > operations, only CommonBuffer operations (put differently, if the > operation is asynchronous on the higher protocol level, then make the > underlying BMDMA operation CommonBuffer); (c) IOMMU protocol > implementations should not be required to clean up translations, only > allowed to, if they can do it without interfering with (a). > > Basically, the question is, who *owns* the translations. Based on the > PciIo interfaces, and on the requirement that PCI drivers Map() and > Unmap() buffers, my opinion is that the PCI drivers own the translations. > > If you think that we should take this question to the USWG, I'm OK with > that. > > Thanks, > Laszlo > > > >> -----Original Message----- >> From: edk2-devel [mailto:edk2-devel-bounces@lists.01.org] On Behalf Of Laszlo Ersek >> Sent: Monday, September 4, 2017 3:55 AM >> To: edk2-devel-01 <edk2-devel@lists.01.org<mailto:edk2-devel@lists.01.org<mailto:edk2-devel@lists.01.org%3cmailto:edk2-devel@lists.01.org>>> >> Cc: Dong, Eric <eric.dong@intel.com<mailto:eric.dong@intel.com<mailto:eric.dong@intel.com%3cmailto:eric.dong@intel.com>>>; Zeng, Star <star.zeng@intel.com<mailto:star.zeng@intel.com<mailto:star.zeng@intel.com%3cmailto:star.zeng@intel.com>>> >> Subject: [edk2] [PATCH 0/4] MdeModulePkg: some PCI HC drivers: unmap common buffers at ExitBootServices() >> >> Repo: https://github.com/lersek/edk2.git >> Branch: pci_host_controllers_unmap >> >> This series updates four PCI Host Controller drivers so that they don't leave CommonBuffer mappings hanging when control is transfered to the OS. >> >> Cc: Brijesh Singh <brijesh.singh@amd.com<mailto:brijesh.singh@amd.com<mailto:brijesh.singh@amd.com%3cmailto:brijesh.singh@amd.com>>> >> Cc: Eric Dong <eric.dong@intel.com<mailto:eric.dong@intel.com<mailto:eric.dong@intel.com%3cmailto:eric.dong@intel.com>>> >> Cc: Star Zeng <star.zeng@intel.com<mailto:star.zeng@intel.com<mailto:star.zeng@intel.com%3cmailto:star.zeng@intel.com>>> >> >> Thanks >> Laszlo >> >> Laszlo Ersek (4): >> MdeModulePkg/UhciDxe: unmap BM common buffers when exiting boot >> services >> MdeModulePkg/EhciDxe: unmap BM common buffers when exiting boot >> services >> MdeModulePkg/XhciDxe: unmap BM common buffers when exiting boot >> services >> MdeModulePkg/AtaAtapiPassThru: unmap common buffers at >> ExitBootServices() >> >> MdeModulePkg/Bus/Ata/AtaAtapiPassThru/AhciMode.h | 18 ++++++ >> MdeModulePkg/Bus/Ata/AtaAtapiPassThru/AtaAtapiPassThru.h | 5 ++ MdeModulePkg/Bus/Ata/AtaAtapiPassThru/AtaAtapiPassThru.c | 67 +++++++++++++++++++- >> MdeModulePkg/Bus/Pci/EhciDxe/Ehci.c | 25 +++++++- >> MdeModulePkg/Bus/Pci/UhciDxe/Uhci.c | 25 +++++++- >> MdeModulePkg/Bus/Pci/XhciDxe/Xhci.c | 31 +++++++++ >> 6 files changed, 168 insertions(+), 3 deletions(-) >> >> -- >> 2.14.1.3.gb7cf6e02401b >> >> _______________________________________________ >> edk2-devel mailing list >> edk2-devel@lists.01.org<mailto:edk2-devel@lists.01.org<mailto:edk2-devel@lists.01.org%3cmailto:edk2-devel@lists.01.org>> >> https://lists.01.org/mailman/listinfo/edk2-devel >> > > _______________________________________________ > edk2-devel mailing list > edk2-devel@lists.01.org<mailto:edk2-devel@lists.01.org<mailto:edk2-devel@lists.01.org%3cmailto:edk2-devel@lists.01.org>> > https://lists.01.org/mailman/listinfo/edk2-devel > _______________________________________________ edk2-devel mailing list edk2-devel@lists.01.org https://lists.01.org/mailman/listinfo/edk2-devel
On 09/05/17 15:44, Yao, Jiewen wrote: > Good discussion. Comment in line. > > From: Laszlo Ersek [mailto:lersek@redhat.com] > Sent: Tuesday, September 5, 2017 5:16 PM > To: Yao, Jiewen <jiewen.yao@intel.com>; Zeng, Star <star.zeng@intel.com>; edk2-devel-01 <edk2-devel@lists.01.org> > Cc: Dong, Eric <eric.dong@intel.com> > Subject: Re: [edk2] [PATCH 0/4] MdeModulePkg: some PCI HC drivers: unmap common buffers at ExitBootServices() > > On 09/05/17 04:18, Yao, Jiewen wrote: >> HI Laszlo >> Thank you very much to raise this DMA topic. >> >> I have a chat with Star. We have couples of question, from high level to low level. >> Hope you can clarify: >> >> >> 1) About disable DMA >> >> As you mentioned, in UEFI_Driver_Writer_Guide_V1.0.1_120308.pdf, 7.7 Adding the Exit Boot Services feature >> ========================= >> Examples from the EDK II that use this feature are the PCI device drivers for USB Host >> Controllers. Some USB Host Controllers are PCI Bus Masters that continuously access a >> memory buffer to poll for operation requests. Access to this memory buffer by a USB >> Host Controller may be required to boot an operation system, but this activity must be >> terminated when the OS calls ExitBootServices(). The typical action in the Exit Boot >> Services Event for these types of drivers is to disable the PCI bus master and place the >> USB Host Controller into a halted state >> ========================= >> >> I believe the fundamental requirement is to "disable DMA". >> As such, the simplest action is to clear PCI Bus Master Enable bit in the command register. >> >> Do you think clearing BME is enough? >> Is there any special reason that we must unmap the DMA buffer? > > Modifying the device state is enough to ask the device *nicely* to stop > doing DMA. > > But, I think, if we are on an IOMMU-protected system, where part of > PciIo.Map() is to add a translation (= system memory access permission) > to the IOMMU, then we should also revoke this permission explicitly, > *after* we asked the device nicely to stop doing DMA. > > Basically, ask the device nicely first, and then enforce the protection. > > ... BTW, I mentioned "IOMMU faults" earlier, and I have some further > thoughts on them. This is about tearing down IOMMU translations *first*, > in the IOMMU driver's ExitBootServices() handler, *before* the PCI > drivers get a chance -- in *their* ExitBootServices() handlers -- to ask > the PCI devices nicely to stop doing DMA. > > I think a quite problematic failure mode is possible here. Assume that a > disk controller has some write commands queued (which are mapped with > Bus Master Read or Bus Master CommonBuffer operations). If the > ExitBootServices() handler for the IOMMU driver runs first, then the > device could act upon these queued commands with missing IOMMU > translations, before the device's own PCI driver asked the device to > shut down DMA. In some cases, I guess this could result in those queued > write commands simply failing, without ill effects. However, dependent > on the IOMMU implementation, the write commands might not fail > explicitly, but read garbage from system memory, and write garbage to > disk blocks. That would be bad. > > I think this is exactly what could happen with the SEV IOMMU we have. > The memory would be first re-encrypted (this is equivalent to revoking a > translation), and then the write commands (queued in the emulated AHCI > controller, in the hypervisor) would read garbage from guest memory, and > write garbage to the virtual disk. > > [Jiewen] I am not clear on how it happens in SEV here. > Below is my thought, please correct me if I am wrong. > > First, the fundamental requirement is to disable BME at exit boot service, > or make sure the controller is in halt state. > If a device driver fails to meet such requirement, it is a bug we need to fix at first. > > Now, let's assume all device driver is in halt state at ExitBootService. - that is my > understanding for all rest discussion. OK. > I checked OVMF and I do not see any code in IOMMU driver to register exit boot services event, > which means the DMA buffer is still decrypted. More precisely, the IOMMU driver does not take it upon itself to re-encrypt DMA buffers at ExitBootServices(). Instead it relies on the PCI drivers to Unmap() CommonBuffers at ExitBootServices(). > If that is the case, the DMA transfer before BME disable is safety (DMA buffer is decrypted) and > there will be no DMA transfer any more at ExitBootService (BME disabled). > Then how does the write command queued in the emulated AHCI controller start read garbage? It is not happening right now. When I wrote that, I was just speculating about possible consequences; i.e., what *could* happen if the IOMMU driver itself re-encrypted memory at ExitBootServices(). *Then* the garbage case could happen. In other words, the case I described does not happen right now; I only described it as a counter-argument against *adding* an ExitBootServices() handler to the IOMMU driver. Under OvmfPkg, the IOMMU behavior for SEV, and the PCI drivers for the virtio devices, are fully in synch. The PCI drivers *alone* are responsible for unmapping CommonBuffers at ExitBootServices(), and so the IOMMU driver doesn't do it. The question is if the same distribution of responsibilities applies to PCI drivers that live outside of OvmfPkg, such as MdeModulePkg. The current patch set suggests "yes". > The SEV IOMMU driver only modifies the page table. Not exactly. It modifies page table entries, and then re-writes the memory covered by those page table entries. Simply flipping the C-bit in page table entries does not reencrypt memory contents, it only sets up the next read and/or write that happens through that page mapping, for "decrypted view" or "encrypted view", from the virtualization host side. > Then after ExitBootService, the OS will take control of CR3 and set correct > encryption bit. This is true, the guest OS will set up new page tables, and in those PTEs, the C-bit ("encrypt") will likely be set by default. However, the guest OS will not *rewrite* all of the memory, with the C-bit set. This means that any memory that the firmware didn't actually re-encrypt (in the IOMMU driver) will still expose stale data to the hypervisor. > NOTE: The device should still be halt state, because the device is > also controlled by OS. This is the key point -- the device is *not* controlled by the guest OS, in the worst case. The virtual device is ultimately implemented by the hypervisor. We don't expect the virtual device (= the hypervisor) to mis-behave on purpose. However, if the hypervisor is compromised by an external attacker (for example over the network, or via privilege escalation from another guest), then all guest RAM that has not been encrypted up to that point becomes visible to the attacker. In other words, the hypervisor ("the device") becomes retro-actively malicious. The point of SEV is to keep as much guest data encrypted at all times as possible, so that *whenever* the hypervisor is compromised by an attacker, the guest memory -- which the attacker can inevitably dump from the host side -- remains un-intellegible to the attacker. >> 2) About common buffer >> If there is a special requirement that we must unmap the DMA buffer, why we only unmap the common buffer? >> >> I understand your statement that unmapping a read buffer or write buffer may cause free memory action. But that is implementation choice. > > I believe I disagree with this (i.e., I think it is not an > implementation choice). > > In the general case, we can assume that PciIo.Map() allocates a bounce > buffer internally, when Read or Write action is requested. > > [Jiewen] When I say implementation choice, I mean Map() *may* allocates > a bounce buffer or it just uses the same address, if the platform is capable of > using DRAM as DMA buffer directly. I have two counter-arguments: (a) As long as we don't want to reimplement PciIo.Map() from the scratch, the current layering of protocols in edk2 implies that even if the PCI device driver sets DUAL_ADDRESS_CYCLE, this flag can be cleared on the way to the IOMMU protocol, in PciHostBridgeDxe, if PciHostBridgeLib says "no DMA above 4GB". OvmfPkg's PciHostBridgeLib currently says "no DMA above 4GB", simply because that's always been the case, and I'm not sure if we can lift the limitation without regressions. (b) The more important counter-argument is that Map() and Unmap() need auxiliary memory (a temporary buffer) for actually encrypting and decrypting memory, so that the virtual device (implemented by the hypervisor) can read it or write it (temporarily). As I wrote above, encrypting or decrypting the memory contents works by writing the original data through a new page mapping that has the desired C-bit setting. In a guest operating system, in-place encryption is possible by having two virtual mappings to the same physical address range, one with the C-bit set, and another with the C-bit clear. For encrypting, the guest OS reads a cache-line sized chunk of data through the "C-bit clear" mapping, and writes it back through the "C-bit set" mapping. For decrypting, it does the inverse. In UEFI, we don't have two virtual page mappings for the same physical address, we only have a 1:1 mapping, on which we can flip the C-bit. This means that, for decryption for example, we have to copy the data from its original location (currently encrypted) to a "stash" (which is always encrypted), modify the PTEs for the original location (clear the C-bit, flush the TLB), then finally copy the data back from the "stash" (still encrypted) to the original location (which will effectively decrypt it, for the hypervisor to read). This "stash" requires room. Currently the "stash" has an equal number of pages to the actual region being mapped. For CommonBuffer operations, the stash is allocated in AllocateBuffer(), together with the normal buffer's allocation. It is freed similarly in FreeBuffer(). For Read/Write BMDMA operations, for which the PCI driver only calls Map()/Unmap(), we cannot avoid allocating the stash in Map(), and freeing it in Unmap(). We are practically forced to allocate a bounce buffer not due to address width limitations (32 vs 64), but due to C-bit status. Now, earlier I had an idea to add a 1-page sized global array variable to the IOMMU driver, and do the encryption/decryption page by page. In other words, "unroll" the PTE management and the copying into page-sized chunks. This would eliminate the need to allocate a "stash" dynamically. However, the drawback of this approach is that the PTE's must be toggled one by one even for a larger region. It could cause undesirably high page table splitting and bad performance. So, the SEV IOMMU could only eliminate dynamic memory alloc/dealloc for Read/Write DMA operations allocation if both conditions were changed: - if we knew for sure that 64-bit DMA was safe to advertize in PciHostBridgeLib, - if we replaced the dynamically allocated stash, and the "batch" PTE management + copying, with a single-page static buffer, and "unrolled" PTE management + copying. More generally speaking, 64-bit clean DMA might simply not be possible on a given hardware platform, and on that platform, PciIo.Map() could not avoid allocating bounce buffers for Read/Write DMA operations, generally speaking. Thus, we shouldn't require a PCI driver to call Unmap() at ExitBootServices(). > In turn, PciIo.Unmap() has to release the same buffer, internally. > PciIo.Unmap() does not know if it is called from an ExitBootServices() > notification function, or from a normal context. It cannot decide if it > is allowed to release memory, or not. There's no additional parameter to > provide this hint either. If PciIo.Unmap() never releases buffers, then > ExitBootServices() will be happy, but we're going to leak memory. > > Now, PciIo.Unmap() could move the bounce buffer to some internal "free > list" instead, rather than freeing it. And, the next PciIo.Map() call > could try to reuse a buffer from the "free list". But this is difficult > to do, because the buffers may have any size, and recycling them would > require an internal, general "page pool", in parallel to the Boot > Services page pool (AllocatePages, FreePages). > > If PciIo.Unmap() could be informed that it is running in > ExitBootServices() notification context -- called from PCI drivers --, > then unmapping Read and Write operations would be simple. > >> If we have strong requirement to unmap the DMA buffer, we should achieve that. > > I totally agree that we *should* achieve that, but I don't see how the > current PciIo.Map() and PciIo.Unmap() APIs make it possible, for Read > and Write operations. > > For CommonBuffer operations, it is possible, because AllocateBuffer() > and FreeBuffer() are separate actions, and an ExitBootServices() > notification function can *avoid* calling FreeBuffer. (But only for > CommonBuffer operations.) PciIo.Unmap() cannot be told *not* to release > a bounce buffer for Read/Write. > > [Jiewen] Again, it is an implementation choice. It is possible that a host > bridge allocate common buffer without calling any Allocate(). > But another implementation may also allocate internal data structure to > record every map() action, including common buffer. (For event log, or > tracing purpose, et etc.) True. My point is that - we could reasonably require that Unmap() work for CommonBuffer operations without releasing memory, - but we couldn't reasonably require the same for Read and Write operations. This is because for CommonBuffer, a platform that right now allocates extra stuff in Map(), could move the same allocations into AllocateBuffer(). Similarly, move the current de-allocations from Unmap() into FreeBuffer(). The same is not possible for Read/Write, so we shouldn't require that. > I do not think Free() should bring difference between common buffer or > read/write buffer. > I am a little worried that we have too many assumption here: > e.g. Map() common buffer never involves Allocate() even for driver internal > data structure , and Map() read/write buffer always involves Allocate() even > there is no need to allocate anything. > With that assumption, we purposely avoid Unmap() read/write buffer, > but force Unmap() common buffer. Exactly. > I am not sure if that assumption is correct. I'm just trying to operate with the APIs currently defined by the UEFI spec, and these assumptions were the best I could come up with. For example, if PciIo.Unmap() took one more parameter: IN BOOLEAN CalledAtExitBootServices then all this would be solved at once. Namely, PciIo implementations would be required to call Unmap() for *all* DMA operations (read, write, commonbuffer) in their ExitBootServices() notification functions, with the parameter set to TRUE. All other contexts should pass FALSE. In turn, CalledAtExitBootServices==TRUE would require PciIo.Unmap() to work exactly as otherwise, but all memory allocated earlier with gBS->AllocatePool() and gBS->AllocatePages() would have to be leaked on purpose. This is all. Given that this parameter does not exist, I'm trying to find other ways to: - perform the unmapping (re-encryption) at ExitBootServices, - without releasing memory, - and not in the IOMMU driver (because that could remove the translation before the PCI driver's own callback could disable the device, and that would be wrong) >> 3) About interaction with IOMMU >> I am not worried about IOMMU interaction. >> I believe an proper IOMMU implementation should not block any action taken by a PCI device driver. >> Current IntelVTdDxe driver disables VTd engine in ExitBootService, which means the DMA access is no longer managed by VTd engine. It is NOT related to disable DMA buffer. > > Ah, I understand! > > So, on one hand, that is totally safe, for PCI device drivers. It means > all pending DMA can get through, until the PCI drivers ask the devices > (nicely) to stop doing DMA. > [Jiewen] Yes. I think any IOMMU implementation should take that > into consideration. > > > > On the other hand, it is the opposite of the security goal of SEV. With > SEV, the default state is "all DMA forbidden" (= all guest memory > encrypted). This is also the status that we'd like to achieve when > ExitBootServices() completes. When control is transferred to the OS, all > guest memory should be encrypted again, even those areas that were once > used as CommonBuffers. > [Jiewen] I am not sure who will control "When control is transferred to the OS, all > guest memory should be encrypted again, even those areas that were once > used as CommonBuffers." > For SEV case, I think it should be controlled by OS, because it is just CR3. If it was indeed just CR3, then I would fully agree with you. But -- to my understanding --, ensuring that the memory is actually encrypted requires that the guest *write* to the memory through a virtual address whose PTE has the C-bit set. And I don't think the guest OS can be expected to rewrite all of its memory at launch. :( > If so, we should not meet any problem, because we already disable BME > in device driver's ExitBootService event. I agree, but installing new page tables is not enough. > So apparently the default behaviors of the VT-d IOMMU and the SEV IOMMU > are opposites -- VT-d permits all DMA unless configured otherwise, while > SEV forbids all DMA unless configured otherwise. > [Jiewen] I do not think so. The default behaviors of VT-d IOMMU is to disable all DMA access. > I setup translation table, but mark all entry to be NOT-PRESENT. > I grant DMA access only if there is a specific request by a device. > > I open DMA access in ExitBootServices, just want to make sure it is compatible to > the existing OS, which might not have knowledge on IOMMU. > I will provide a patch to make it a policy very soon. As such VTd IOMMU may > turn off IOMMU, or keep it enabled at ExitBootService. > An IOMMU aware OS may take over IOMMU directly and reprogram it. Thanks for the clarification. But, again, will this not lead to the possibility that the VT-d IOMMU's ExitBootServices() handler disables all DMA *before* the PCI drivers get a chance to run their own ExitBootServices() handlers, disabling the individual devices? If that happens, then a series of IOMMU faults could be generated, which I described above. (I.e., such IOMMU faults could result, at least in the case of SEV, in garbage being written to disk, via queued commands.) >> 4) About the driver you modified >> I believe we have some other PCI device driver (NVMe/SDMMC/UFS), besides the ones you modified in this patch (ATA/UCHI/EHCI/XHCI). >> If we need take this approach, I assume they also need update, right? > > Yes, they should be updated similarly. > > SDMMC and UFS are not included in OVMF, so I couldn't test their behavior. > > NVMe is included in OVMF, but it is used very rarely in practice (we can > call it a corner case), so I thought I would first submit patches for > the four main drivers (UHCI, EHCI, XHCI, AHCI) that are frequently used > with OVMF and QEMU. Tracking down the CommonBuffer lifecycles was > difficult :) > [Jiewen] Understood. That is OK. We can handle that in separate patch. Thanks! Now, to finish up, here's an idea I just had. Are we allowed to call gBS->SignalEvent() in an ExitBootServices() notification function? If we are, then we could do the following: * PciIo.Unmap() and friends would work as always (no restrictions on dynamic memory allocation or freeing, for any kind of DMA operation). * PCI drivers would not be expected to call PciIo.Unmap() in their ExitBootServices() handlers. * The IOMMU driver would have an ExitBootServices() notification function, to be enqueued at the TPL_CALLBACK or TPL_NOTIFY level (doesn't matter which). * In this notification function, the IOMMU driver would signal *another* event (a private one). The notification function for this event would be enqueued strictly at the TPL_CALLBACK level. * The notification function for the second event (private to the IOMMU) would iterate over all existent mappings, and unmap them, without allocating or freeing memory. The key point is that by signaling the second event, the IOMMU driver could order its own cleanup code after all other ExitBootServices() callbacks. (Assuming, at least, that no other ExitBootServices() callback employs the same trick to defer itself!) Therefore by the time the second callback runs, all PCI devices have been halted, and it is safe to tear down the translations. The ordering would be ensured by the following: - The UEFI spec says under CreateEventEx(), "All events are guaranteed to be signaled before the first notification action is taken." This means that, by the time the IOMMU's ExitBootServices() handler is entered, all other ExitBootServices() handlers have been *queued* at least, at TPL_CALLBACK or TPL_NOTIFY. - Therefore, when we signal our second callback from there, for TPL_CALLBACK, the second callback will be queued at the end of the TPL_CALLBACK queue. - CreateEventEx() also says that EFI_EVENT_GROUP_EXIT_BOOT_SERVICES "is functionally equivalent to the EVT_SIGNAL_EXIT_BOOT_SERVICES flag for the Type argument of CreateEvent." So it wouldn't matter if a driver used CreateEvent() or CreateEventEx() for setting up its own handler, the handler would be queued just the same. I think this is ugly and brittle, but perhaps the only way to clean up *all* translations safely, with regard to PciIo.Unmap() + ExitBootServices(), without updating the UEFI spec. What do you think? Thanks, Laszlo _______________________________________________ edk2-devel mailing list edk2-devel@lists.01.org https://lists.01.org/mailman/listinfo/edk2-devel
Thanks for the clarification. Comment in line. From: Laszlo Ersek [mailto:lersek@redhat.com] Sent: Wednesday, September 6, 2017 1:57 AM To: Yao, Jiewen <jiewen.yao@intel.com>; Zeng, Star <star.zeng@intel.com>; edk2-devel-01 <edk2-devel@lists.01.org> Cc: Dong, Eric <eric.dong@intel.com>; Brijesh Singh <brijesh.singh@amd.com> Subject: Re: [edk2] [PATCH 0/4] MdeModulePkg: some PCI HC drivers: unmap common buffers at ExitBootServices() On 09/05/17 15:44, Yao, Jiewen wrote: > Good discussion. Comment in line. > > From: Laszlo Ersek [mailto:lersek@redhat.com] > Sent: Tuesday, September 5, 2017 5:16 PM > To: Yao, Jiewen <jiewen.yao@intel.com<mailto:jiewen.yao@intel.com>>; Zeng, Star <star.zeng@intel.com<mailto:star.zeng@intel.com>>; edk2-devel-01 <edk2-devel@lists.01.org<mailto:edk2-devel@lists.01.org>> > Cc: Dong, Eric <eric.dong@intel.com<mailto:eric.dong@intel.com>> > Subject: Re: [edk2] [PATCH 0/4] MdeModulePkg: some PCI HC drivers: unmap common buffers at ExitBootServices() > > On 09/05/17 04:18, Yao, Jiewen wrote: >> HI Laszlo >> Thank you very much to raise this DMA topic. >> >> I have a chat with Star. We have couples of question, from high level to low level. >> Hope you can clarify: >> >> >> 1) About disable DMA >> >> As you mentioned, in UEFI_Driver_Writer_Guide_V1.0.1_120308.pdf, 7.7 Adding the Exit Boot Services feature >> ========================= >> Examples from the EDK II that use this feature are the PCI device drivers for USB Host >> Controllers. Some USB Host Controllers are PCI Bus Masters that continuously access a >> memory buffer to poll for operation requests. Access to this memory buffer by a USB >> Host Controller may be required to boot an operation system, but this activity must be >> terminated when the OS calls ExitBootServices(). The typical action in the Exit Boot >> Services Event for these types of drivers is to disable the PCI bus master and place the >> USB Host Controller into a halted state >> ========================= >> >> I believe the fundamental requirement is to "disable DMA". >> As such, the simplest action is to clear PCI Bus Master Enable bit in the command register. >> >> Do you think clearing BME is enough? >> Is there any special reason that we must unmap the DMA buffer? > > Modifying the device state is enough to ask the device *nicely* to stop > doing DMA. > > But, I think, if we are on an IOMMU-protected system, where part of > PciIo.Map() is to add a translation (= system memory access permission) > to the IOMMU, then we should also revoke this permission explicitly, > *after* we asked the device nicely to stop doing DMA. > > Basically, ask the device nicely first, and then enforce the protection. > > ... BTW, I mentioned "IOMMU faults" earlier, and I have some further > thoughts on them. This is about tearing down IOMMU translations *first*, > in the IOMMU driver's ExitBootServices() handler, *before* the PCI > drivers get a chance -- in *their* ExitBootServices() handlers -- to ask > the PCI devices nicely to stop doing DMA. > > I think a quite problematic failure mode is possible here. Assume that a > disk controller has some write commands queued (which are mapped with > Bus Master Read or Bus Master CommonBuffer operations). If the > ExitBootServices() handler for the IOMMU driver runs first, then the > device could act upon these queued commands with missing IOMMU > translations, before the device's own PCI driver asked the device to > shut down DMA. In some cases, I guess this could result in those queued > write commands simply failing, without ill effects. However, dependent > on the IOMMU implementation, the write commands might not fail > explicitly, but read garbage from system memory, and write garbage to > disk blocks. That would be bad. > > I think this is exactly what could happen with the SEV IOMMU we have. > The memory would be first re-encrypted (this is equivalent to revoking a > translation), and then the write commands (queued in the emulated AHCI > controller, in the hypervisor) would read garbage from guest memory, and > write garbage to the virtual disk. > > [Jiewen] I am not clear on how it happens in SEV here. > Below is my thought, please correct me if I am wrong. > > First, the fundamental requirement is to disable BME at exit boot service, > or make sure the controller is in halt state. > If a device driver fails to meet such requirement, it is a bug we need to fix at first. > > Now, let's assume all device driver is in halt state at ExitBootService. - that is my > understanding for all rest discussion. OK. > I checked OVMF and I do not see any code in IOMMU driver to register exit boot services event, > which means the DMA buffer is still decrypted. More precisely, the IOMMU driver does not take it upon itself to re-encrypt DMA buffers at ExitBootServices(). Instead it relies on the PCI drivers to Unmap() CommonBuffers at ExitBootServices(). > If that is the case, the DMA transfer before BME disable is safety (DMA buffer is decrypted) and > there will be no DMA transfer any more at ExitBootService (BME disabled). > Then how does the write command queued in the emulated AHCI controller start read garbage? It is not happening right now. When I wrote that, I was just speculating about possible consequences; i.e., what *could* happen if the IOMMU driver itself re-encrypted memory at ExitBootServices(). [Jiewen] OK. Got it. *Then* the garbage case could happen. In other words, the case I described does not happen right now; I only described it as a counter-argument against *adding* an ExitBootServices() handler to the IOMMU driver. [Jiewen] I agree with you. A IOMMU driver should NOT use ExitBootServices to stop DMA transfer. Under OvmfPkg, the IOMMU behavior for SEV, and the PCI drivers for the virtio devices, are fully in synch. The PCI drivers *alone* are responsible for unmapping CommonBuffers at ExitBootServices(), and so the IOMMU driver doesn't do it. [Jiewen] I agree that IOMMU does not. But I hold my opinion on if the PCI driver is responsible for unmapping CommonBuffer at ExitBootServices(). The question is if the same distribution of responsibilities applies to PCI drivers that live outside of OvmfPkg, such as MdeModulePkg. [Jiewen] Yes. If we want to define a rule, the rule should be for *any* PCI device driver, no matter where it is from. It could be in OVMF package, MdeModulePkg, a private repo from Independent BIOS Vendor (IBV) or Original Equipment Manufacturer (OEM), and even from Independent Hardware Vendor (IHV). A special notice: The IHV driver should only follow the interface defined in UEFI specification, such as PCI_IO. The IHV driver should NOT consume any interface defined in MdeModulePkg, such as EDKII_IOMMU. The current patch set suggests "yes". > The SEV IOMMU driver only modifies the page table. Not exactly. It modifies page table entries, and then re-writes the memory covered by those page table entries. [Jiewen] Yes, you are right. There is StashBuffer associated with every common buffer. Simply flipping the C-bit in page table entries does not reencrypt memory contents, it only sets up the next read and/or write that happens through that page mapping, for "decrypted view" or "encrypted view", from the virtualization host side. [Jiewen] Thanks for the info. > Then after ExitBootService, the OS will take control of CR3 and set correct > encryption bit. This is true, the guest OS will set up new page tables, and in those PTEs, the C-bit ("encrypt") will likely be set by default. However, the guest OS will not *rewrite* all of the memory, with the C-bit set. This means that any memory that the firmware didn't actually re-encrypt (in the IOMMU driver) will still expose stale data to the hypervisor. [Jiewen] That is an interesting question. Is there any security concern associated? If the C-bit is cleared for a read/write buffer, is the content in the read/write buffer also exposed to hypervisor? I means if there is security concern, the concern should be applied to both common buffer and read/write buffer. Then we have to figure a way to resolve both buffer. To be honest, that is exactly my biggest question on this patch: why do we only handle the common buffer specially? > NOTE: The device should still be halt state, because the device is > also controlled by OS. This is the key point -- the device is *not* controlled by the guest OS, in the worst case. The virtual device is ultimately implemented by the hypervisor. We don't expect the virtual device (= the hypervisor) to mis-behave on purpose. However, if the hypervisor is compromised by an external attacker (for example over the network, or via privilege escalation from another guest), then all guest RAM that has not been encrypted up to that point becomes visible to the attacker. In other words, the hypervisor ("the device") becomes retro-actively malicious. [Jiewen] If that is the threat model, I have a question on the exposure: 1) If the concern is for existing data, it means all DMA data already written. However, the DMA data is already consumed or produced by virtual device or a hypervisor. If the hypervisor is attacked, it already gets all the data content. 2) if the concern is for future data, it means all user data to be written. However, the C-bit already prevent that. Maybe I do not fully understand the threat model defined. If you can explain more, that would be great. The point of SEV is to keep as much guest data encrypted at all times as possible, so that *whenever* the hypervisor is compromised by an attacker, the guest memory -- which the attacker can inevitably dump from the host side -- remains un-intellegible to the attacker. [Jiewen] OK. If this is a security question, I suggest we define a clear threat model at first. Or what we are doing might be unnecessary or insufficient. >> 2) About common buffer >> If there is a special requirement that we must unmap the DMA buffer, why we only unmap the common buffer? >> >> I understand your statement that unmapping a read buffer or write buffer may cause free memory action. But that is implementation choice. > > I believe I disagree with this (i.e., I think it is not an > implementation choice). > > In the general case, we can assume that PciIo.Map() allocates a bounce > buffer internally, when Read or Write action is requested. > > [Jiewen] When I say implementation choice, I mean Map() *may* allocates > a bounce buffer or it just uses the same address, if the platform is capable of > using DRAM as DMA buffer directly. I have two counter-arguments: (a) As long as we don't want to reimplement PciIo.Map() from the scratch, the current layering of protocols in edk2 implies that even if the PCI device driver sets DUAL_ADDRESS_CYCLE, this flag can be cleared on the way to the IOMMU protocol, in PciHostBridgeDxe, if PciHostBridgeLib says "no DMA above 4GB". OvmfPkg's PciHostBridgeLib currently says "no DMA above 4GB", simply because that's always been the case, and I'm not sure if we can lift the limitation without regressions. (b) The more important counter-argument is that Map() and Unmap() need auxiliary memory (a temporary buffer) for actually encrypting and decrypting memory, so that the virtual device (implemented by the hypervisor) can read it or write it (temporarily). As I wrote above, encrypting or decrypting the memory contents works by writing the original data through a new page mapping that has the desired C-bit setting. In a guest operating system, in-place encryption is possible by having two virtual mappings to the same physical address range, one with the C-bit set, and another with the C-bit clear. For encrypting, the guest OS reads a cache-line sized chunk of data through the "C-bit clear" mapping, and writes it back through the "C-bit set" mapping. For decrypting, it does the inverse. In UEFI, we don't have two virtual page mappings for the same physical address, we only have a 1:1 mapping, on which we can flip the C-bit. This means that, for decryption for example, we have to copy the data from its original location (currently encrypted) to a "stash" (which is always encrypted), modify the PTEs for the original location (clear the C-bit, flush the TLB), then finally copy the data back from the "stash" (still encrypted) to the original location (which will effectively decrypt it, for the hypervisor to read). This "stash" requires room. Currently the "stash" has an equal number of pages to the actual region being mapped. For CommonBuffer operations, the stash is allocated in AllocateBuffer(), together with the normal buffer's allocation. It is freed similarly in FreeBuffer(). For Read/Write BMDMA operations, for which the PCI driver only calls Map()/Unmap(), we cannot avoid allocating the stash in Map(), and freeing it in Unmap(). We are practically forced to allocate a bounce buffer not due to address width limitations (32 vs 64), but due to C-bit status. Now, earlier I had an idea to add a 1-page sized global array variable to the IOMMU driver, and do the encryption/decryption page by page. In other words, "unroll" the PTE management and the copying into page-sized chunks. This would eliminate the need to allocate a "stash" dynamically. However, the drawback of this approach is that the PTE's must be toggled one by one even for a larger region. It could cause undesirably high page table splitting and bad performance. So, the SEV IOMMU could only eliminate dynamic memory alloc/dealloc for Read/Write DMA operations allocation if both conditions were changed: - if we knew for sure that 64-bit DMA was safe to advertize in PciHostBridgeLib, - if we replaced the dynamically allocated stash, and the "batch" PTE management + copying, with a single-page static buffer, and "unrolled" PTE management + copying. More generally speaking, 64-bit clean DMA might simply not be possible on a given hardware platform, and on that platform, PciIo.Map() could not avoid allocating bounce buffers for Read/Write DMA operations, generally speaking. Thus, we shouldn't require a PCI driver to call Unmap() at ExitBootServices(). [Jiewen] I agree what you said above. And I did not treat that to be counter-arguments. :-) > In turn, PciIo.Unmap() has to release the same buffer, internally. > PciIo.Unmap() does not know if it is called from an ExitBootServices() > notification function, or from a normal context. It cannot decide if it > is allowed to release memory, or not. There's no additional parameter to > provide this hint either. If PciIo.Unmap() never releases buffers, then > ExitBootServices() will be happy, but we're going to leak memory. > > Now, PciIo.Unmap() could move the bounce buffer to some internal "free > list" instead, rather than freeing it. And, the next PciIo.Map() call > could try to reuse a buffer from the "free list". But this is difficult > to do, because the buffers may have any size, and recycling them would > require an internal, general "page pool", in parallel to the Boot > Services page pool (AllocatePages, FreePages). > > If PciIo.Unmap() could be informed that it is running in > ExitBootServices() notification context -- called from PCI drivers --, > then unmapping Read and Write operations would be simple. > >> If we have strong requirement to unmap the DMA buffer, we should achieve that. > > I totally agree that we *should* achieve that, but I don't see how the > current PciIo.Map() and PciIo.Unmap() APIs make it possible, for Read > and Write operations. > > For CommonBuffer operations, it is possible, because AllocateBuffer() > and FreeBuffer() are separate actions, and an ExitBootServices() > notification function can *avoid* calling FreeBuffer. (But only for > CommonBuffer operations.) PciIo.Unmap() cannot be told *not* to release > a bounce buffer for Read/Write. > > [Jiewen] Again, it is an implementation choice. It is possible that a host > bridge allocate common buffer without calling any Allocate(). > But another implementation may also allocate internal data structure to > record every map() action, including common buffer. (For event log, or > tracing purpose, et etc.) True. My point is that - we could reasonably require that Unmap() work for CommonBuffer operations without releasing memory, - but we couldn't reasonably require the same for Read and Write operations. This is because for CommonBuffer, a platform that right now allocates extra stuff in Map(), could move the same allocations into AllocateBuffer(). Similarly, move the current de-allocations from Unmap() into FreeBuffer(). The same is not possible for Read/Write, so we shouldn't require that. [Jiewen] For "require that Unmap() work for CommonBuffer operations without releasing memory", I would hold my opinion until it is documented clearly in UEFI spec. My current concern is: First, this sentence is NOT defined in UEFI specification. Second, it might break an existing implementation or existing feature, such as tracing. Till now, I did not see any restriction in UEFI spec say: In this function, you are not allowed to call memory services. The only restriction is 1) TPL restriction, where memory service must be <= TPL_NOTIFY. 2) AP restriction, where no UEFI service/protocol is allowed for AP. > I do not think Free() should bring difference between common buffer or > read/write buffer. > I am a little worried that we have too many assumption here: > e.g. Map() common buffer never involves Allocate() even for driver internal > data structure , and Map() read/write buffer always involves Allocate() even > there is no need to allocate anything. > With that assumption, we purposely avoid Unmap() read/write buffer, > but force Unmap() common buffer. Exactly. > I am not sure if that assumption is correct. I'm just trying to operate with the APIs currently defined by the UEFI spec, and these assumptions were the best I could come up with. [Jiewen] The PCI device driver must follow and *only* follow UEFI spec. Especially the IHV Option ROM should not consume any private API. For example, if PciIo.Unmap() took one more parameter: IN BOOLEAN CalledAtExitBootServices then all this would be solved at once. Namely, PciIo implementations would be required to call Unmap() for *all* DMA operations (read, write, commonbuffer) in their ExitBootServices() notification functions, with the parameter set to TRUE. All other contexts should pass FALSE. In turn, CalledAtExitBootServices==TRUE would require PciIo.Unmap() to work exactly as otherwise, but all memory allocated earlier with gBS->AllocatePool() and gBS->AllocatePages() would have to be leaked on purpose. This is all. Given that this parameter does not exist, I'm trying to find other ways to: - perform the unmapping (re-encryption) at ExitBootServices, - without releasing memory, - and not in the IOMMU driver (because that could remove the translation before the PCI driver's own callback could disable the device, and that would be wrong) >> 3) About interaction with IOMMU >> I am not worried about IOMMU interaction. >> I believe an proper IOMMU implementation should not block any action taken by a PCI device driver. >> Current IntelVTdDxe driver disables VTd engine in ExitBootService, which means the DMA access is no longer managed by VTd engine. It is NOT related to disable DMA buffer. > > Ah, I understand! > > So, on one hand, that is totally safe, for PCI device drivers. It means > all pending DMA can get through, until the PCI drivers ask the devices > (nicely) to stop doing DMA. > [Jiewen] Yes. I think any IOMMU implementation should take that > into consideration. > > > > On the other hand, it is the opposite of the security goal of SEV. With > SEV, the default state is "all DMA forbidden" (= all guest memory > encrypted). This is also the status that we'd like to achieve when > ExitBootServices() completes. When control is transferred to the OS, all > guest memory should be encrypted again, even those areas that were once > used as CommonBuffers. > [Jiewen] I am not sure who will control "When control is transferred to the OS, all > guest memory should be encrypted again, even those areas that were once > used as CommonBuffers." > For SEV case, I think it should be controlled by OS, because it is just CR3. If it was indeed just CR3, then I would fully agree with you. But -- to my understanding --, ensuring that the memory is actually encrypted requires that the guest *write* to the memory through a virtual address whose PTE has the C-bit set. And I don't think the guest OS can be expected to rewrite all of its memory at launch. :( [Jiewen] That is fine. I suggest we get clear on the threat model as the first step. The threat model for SEV usage in OVMF. I am sorry if that is already discussed before. I might ignore the conversation. If you or any SEV owner can share the insight, that will be great. See my question above "If that is the threat model, I have a question on the exposure:..." > If so, we should not meet any problem, because we already disable BME > in device driver's ExitBootService event. I agree, but installing new page tables is not enough. > So apparently the default behaviors of the VT-d IOMMU and the SEV IOMMU > are opposites -- VT-d permits all DMA unless configured otherwise, while > SEV forbids all DMA unless configured otherwise. > [Jiewen] I do not think so. The default behaviors of VT-d IOMMU is to disable all DMA access. > I setup translation table, but mark all entry to be NOT-PRESENT. > I grant DMA access only if there is a specific request by a device. > > I open DMA access in ExitBootServices, just want to make sure it is compatible to > the existing OS, which might not have knowledge on IOMMU. > I will provide a patch to make it a policy very soon. As such VTd IOMMU may > turn off IOMMU, or keep it enabled at ExitBootService. > An IOMMU aware OS may take over IOMMU directly and reprogram it. Thanks for the clarification. But, again, will this not lead to the possibility that the VT-d IOMMU's ExitBootServices() handler disables all DMA *before* the PCI drivers get a chance to run their own ExitBootServices() handlers, disabling the individual devices? [Jiewen] Sorry for the confusing. Let me explain: No, VTd never disables *all* DMA buffer at ExitBootService. "disable VTd" means to "disable IOMMU protection" and "allow all DMA". "Keep VTd enabled" means to "keep IOMMU protection enabled" and "only allow the DMA from Map() request". If that happens, then a series of IOMMU faults could be generated, which I described above. (I.e., such IOMMU faults could result, at least in the case of SEV, in garbage being written to disk, via queued commands.) [Jiewen] You are right. That would not happen. IOMMU driver should not bring any compatibility problem for the PCI driver, iff the PCI driver follows the UEFI specification. >> 4) About the driver you modified >> I believe we have some other PCI device driver (NVMe/SDMMC/UFS), besides the ones you modified in this patch (ATA/UCHI/EHCI/XHCI). >> If we need take this approach, I assume they also need update, right? > > Yes, they should be updated similarly. > > SDMMC and UFS are not included in OVMF, so I couldn't test their behavior. > > NVMe is included in OVMF, but it is used very rarely in practice (we can > call it a corner case), so I thought I would first submit patches for > the four main drivers (UHCI, EHCI, XHCI, AHCI) that are frequently used > with OVMF and QEMU. Tracking down the CommonBuffer lifecycles was > difficult :) > [Jiewen] Understood. That is OK. We can handle that in separate patch. Thanks! Now, to finish up, here's an idea I just had. Are we allowed to call gBS->SignalEvent() in an ExitBootServices() notification function? If we are, then we could do the following: * PciIo.Unmap() and friends would work as always (no restrictions on dynamic memory allocation or freeing, for any kind of DMA operation). * PCI drivers would not be expected to call PciIo.Unmap() in their ExitBootServices() handlers. * The IOMMU driver would have an ExitBootServices() notification function, to be enqueued at the TPL_CALLBACK or TPL_NOTIFY level (doesn't matter which). * In this notification function, the IOMMU driver would signal *another* event (a private one). The notification function for this event would be enqueued strictly at the TPL_CALLBACK level. * The notification function for the second event (private to the IOMMU) would iterate over all existent mappings, and unmap them, without allocating or freeing memory. The key point is that by signaling the second event, the IOMMU driver could order its own cleanup code after all other ExitBootServices() callbacks. (Assuming, at least, that no other ExitBootServices() callback employs the same trick to defer itself!) Therefore by the time the second callback runs, all PCI devices have been halted, and it is safe to tear down the translations. The ordering would be ensured by the following: - The UEFI spec says under CreateEventEx(), "All events are guaranteed to be signaled before the first notification action is taken." This means that, by the time the IOMMU's ExitBootServices() handler is entered, all other ExitBootServices() handlers have been *queued* at least, at TPL_CALLBACK or TPL_NOTIFY. - Therefore, when we signal our second callback from there, for TPL_CALLBACK, the second callback will be queued at the end of the TPL_CALLBACK queue. - CreateEventEx() also says that EFI_EVENT_GROUP_EXIT_BOOT_SERVICES "is functionally equivalent to the EVT_SIGNAL_EXIT_BOOT_SERVICES flag for the Type argument of CreateEvent." So it wouldn't matter if a driver used CreateEvent() or CreateEventEx() for setting up its own handler, the handler would be queued just the same. I think this is ugly and brittle, but perhaps the only way to clean up *all* translations safely, with regard to PciIo.Unmap() + ExitBootServices(), without updating the UEFI spec. What do you think? [Jiewen] If the order is correct, and all PCI device driver is halt at ExitBootServices, that works. We do not need update PCI driver and we do not need update UEFI spec. We only need update IOMMU driver which is concerned, based upon the threat model. That probably is best solution. :-) I assume you want to handle both common buffer and read/write buffer, right? And if possible, I still have interest to get clear on the threat model for SEV in OVMF. If you or any SEV owner can share ... Thanks, Laszlo _______________________________________________ edk2-devel mailing list edk2-devel@lists.01.org https://lists.01.org/mailman/listinfo/edk2-devel
On 09/06/17 06:37, Yao, Jiewen wrote: > Thanks for the clarification. Comment in line. > > From: Laszlo Ersek [mailto:lersek@redhat.com] > Sent: Wednesday, September 6, 2017 1:57 AM > To: Yao, Jiewen <jiewen.yao@intel.com>; Zeng, Star <star.zeng@intel.com>; edk2-devel-01 <edk2-devel@lists.01.org> > Cc: Dong, Eric <eric.dong@intel.com>; Brijesh Singh <brijesh.singh@amd.com> > Subject: Re: [edk2] [PATCH 0/4] MdeModulePkg: some PCI HC drivers: unmap common buffers at ExitBootServices() >> Then after ExitBootService, the OS will take control of CR3 and set correct >> encryption bit. > > This is true, the guest OS will set up new page tables, and in those > PTEs, the C-bit ("encrypt") will likely be set by default. > > However, the guest OS will not *rewrite* all of the memory, with the > C-bit set. This means that any memory that the firmware didn't actually > re-encrypt (in the IOMMU driver) will still expose stale data to the > hypervisor. > [Jiewen] That is an interesting question. > Is there any security concern associated? I can't tell for sure. Answering this question is up to the proponents of SEV, who have designed the threat model for SEV. My operating assumption is that we should keep memory as tightly encrypted as possible at the firmware --> OS control transfer. It could be an exaggerated expectation from my side; I'd just better be safe than sorry :) > If the C-bit is cleared for a read/write buffer, is the content in the > read/write buffer also exposed to hypervisor? Not immediately. Only when the guest also rewrites the memory through the appropriate virtual address. Basically, in the virtual machine, the C-bit in a PTE that covers a particular guest virtual address (GVA) controls whether a guest write through that GVA will result in memory encryption, and if a gues read through that GVA will result in memory decryption. The current state of the C-bit doesn't matter for the hypervisor, what matters is the last guest write through a GVA whose PTE had the C-bit set, or clear. If the C-bit was clear when the guest last wrote the area, then the hypervisor can read the data. If the C-bit was set, then the hypervisor can only read garbage. > I means if there is security concern, the concern should be applied to > both common buffer and read/write buffer. > Then we have to figure a way to resolve both buffer. Yes, this is correct. The PciIo.Unmap operation, as currently implemented in OvmfPkg/IoMmuDxe/, handles the encryption/decryption correctly for all operations, but it can only guarantee *not* freeing memory for CommonBuffer. So Unmap()ing CommonBuffer at ExitBootServices is safe, while Unmap()ing Read/Write is not. The encryption would be re-established just fine for Read/Write as well, but we would change the UEFI memmap. In OVMF, we currently manage this problem by making all asynchronous DMA mappings CommonBuffer, even if they could othewise be Read or Write. We use Read and Write only if the DMA operation completes before the higher-level protocol function returns (so we can immediately Unmap the operation, and the ExitBootServices() handler doesn't have to care). > To be honest, that is exactly my biggest question on this patch: > why do we only handle the common buffer specially? For the following reason: - Given that CommonBuffer mappings take a separate AllocateBuffer() / FreeBuffer() call-pair, around Map/Unmap, we can *reasonably ask* PciIo implementors -- beyond what the UEFI spec requries -- to keep *all* dynamic memory management out of Map/Unmap. If they need dynamic memory management, we ask them to do it in AllocateBuffer/FreeBuffer instead. This way Unmap is safe in ExitBootServices handlers. - We cannot *reasonably ask* PciIo implementors to provide the same guarantee for Read and Write mappings, simply because there are no other APIs that could manage the dynamic memory for Map and Unmap -- AllocateBuffer and FreeBuffer are not required for Read and Write mappings. PciIo implementors couldn't agree to our request even if they wanted to. Therefore Unmapping Read/Write operations is inherently unsafe in EBS handlers. >> NOTE: The device should still be halt state, because the device is >> also controlled by OS. > > This is the key point -- the device is *not* controlled by the guest OS, > in the worst case. > > The virtual device is ultimately implemented by the hypervisor. We don't > expect the virtual device (= the hypervisor) to mis-behave on purpose. > However, if the hypervisor is compromised by an external attacker (for > example over the network, or via privilege escalation from another > guest), then all guest RAM that has not been encrypted up to that point > becomes visible to the attacker. In other words, the hypervisor ("the > device") becomes retro-actively malicious. > [Jiewen] If that is the threat model, I have a question on the exposure: > 1) If the concern is for existing data, it means all DMA data already written. > However, the DMA data is already consumed or produced by virtual device > or a hypervisor. If the hypervisor is attacked, it already gets all the data content. Maybe, maybe not. The data may have been sent to a different host over the network, and wiped from memory. Or, the data may have been written to a disk image that is separately encrypted by the host OS, and has been detached (unplugged) from the guest, and also unmounted from the host, with the disk key securely wiped from host memory. We can also imagine the reverse direction. Assume that the user of the virtual machine goes into the UEFI shell in OVMF, starts the EDIT program, and types some secret information into a text file on the ESP. Further assume that the disk image is encrypted on the host OS. It is conceivable that fragments of the secret could remain stuck in the AHCI (disk) and USB (keyboard) DMA buffers. Maybe you think that these are "made up" examples, and I agree, I just made them up. The point is, it is not my place to discount possible attack vectors (I haven't designed SEV). Attackers will be limited by their *own* imaginations only, not by mine :) > 2) if the concern is for future data, it means all user data to be written. > However, the C-bit already prevent that. As far as I understand SEV, future data is out of scope. The goal is to prevent *retroactive* guest data leaks (= whatever is currently in host OS memory) if an attacker compromises an otherwise non-malicious hypervisor. If an attacker compromises a virtualization host, then they can just sit silent and watch data flow into and out of guests from that point onward, because they can then monitor all DMA (which always happens in clear-text) real-time. > Maybe I do not fully understand the threat model defined. > If you can explain more, that would be great. Well I tried to explain *my* understanding of SEV :) I hope Brijesh will correct me if I'm wrong. > The point of SEV is to keep as much guest data encrypted at all times as > possible, so that *whenever* the hypervisor is compromised by an > attacker, the guest memory -- which the attacker can inevitably dump > from the host side -- remains un-intellegible to the attacker. > [Jiewen] OK. If this is a security question, I suggest we define a clear > threat model at first. > Or what we are doing might be unnecessary or insufficient. I agree. As I said above, my operating principle has been to re-encrypt all DMA buffers as soon as possible. For long-standing DMA buffers, re-encrypt them at ExitBootServices at the latest. > [Jiewen] For "require that Unmap() work for CommonBuffer > operations without releasing memory", I would hold my opinion until > it is documented clearly in UEFI spec. You are right, of course. It's just that we cannot avoid exploring ways, for this feature, that currently fall outside of the spec. Whatever we do here will be outside of the spec, one way or another. I suggested what I thought could be a reasonable extension to the spec, one that could be implemented by PciIo implementors even before the spec is modified. edk2's PciIo.Unmap already behaves like this, without breaking the spec. If there's a better way -- for example modifying CoreExitBootServices() in edk2, to signal IOMMU drivers separately, *after* notifying other ExitBootServices() handlers --, that might work as well. > My current concern is: > First, this sentence is NOT defined in UEFI specification. Correct. > Second, it might break an existing implementation or existing feature, such as tracing. Correct. > Till now, I did not see any restriction in UEFI spec say: In this function, you are not allowed > to call memory services. > The only restriction is > 1) TPL restriction, where memory service must be <= TPL_NOTIFY. > 2) AP restriction, where no UEFI service/protocol is allowed for AP. I agree. > I'm just trying to operate with the APIs currently defined by the UEFI > spec, and these assumptions were the best I could come up with. > [Jiewen] The PCI device driver must follow and *only* follow UEFI spec. > Especially the IHV Option ROM should not consume any private API. I disagree about "only follow". If there are additional requirements that can be satisfied without breaking the spec, driver implementors can choose to conform to both sets of requirements. For example (if I understand correctly), Microsoft / Windows present a bunch of requirements *beyond* the UEFI spec, for both platform and add-on card firmware. Most vendors conform :) >> [Jiewen] I am not sure who will control "When control is transferred to the OS, all >> guest memory should be encrypted again, even those areas that were once >> used as CommonBuffers." >> For SEV case, I think it should be controlled by OS, because it is just CR3. > > If it was indeed just CR3, then I would fully agree with you. > > But -- to my understanding --, ensuring that the memory is actually > encrypted requires that the guest *write* to the memory through a > virtual address whose PTE has the C-bit set. > > And I don't think the guest OS can be expected to rewrite all of its > memory at launch. :( > > [Jiewen] That is fine. > I suggest we get clear on the threat model as the first step. > The threat model for SEV usage in OVMF. > > I am sorry if that is already discussed before. I might ignore the conversation. No problem; it's always good to re-investigate our assumptions and operating principles. > If you or any SEV owner can share the insight, that will be great. > See my question above "If that is the threat model, I have a question on the exposure:..." I shared *my* impressions of the threat model (which are somewhat unclear, admittedly, and thus could make me overly cautious). I hope Brijesh can extend and/or correct my description. >> So apparently the default behaviors of the VT-d IOMMU and the SEV IOMMU >> are opposites -- VT-d permits all DMA unless configured otherwise, while >> SEV forbids all DMA unless configured otherwise. >> [Jiewen] I do not think so. The default behaviors of VT-d IOMMU is to disable all DMA access. >> I setup translation table, but mark all entry to be NOT-PRESENT. >> I grant DMA access only if there is a specific request by a device. >> >> I open DMA access in ExitBootServices, just want to make sure it is compatible to >> the existing OS, which might not have knowledge on IOMMU. >> I will provide a patch to make it a policy very soon. As such VTd IOMMU may >> turn off IOMMU, or keep it enabled at ExitBootService. >> An IOMMU aware OS may take over IOMMU directly and reprogram it. > > Thanks for the clarification. > > But, again, will this not lead to the possibility that the VT-d IOMMU's > ExitBootServices() handler disables all DMA *before* the PCI drivers get > a chance to run their own ExitBootServices() handlers, disabling the > individual devices? > [Jiewen] Sorry for the confusing. Let me explain: > No, VTd never disables *all* DMA buffer at ExitBootService. > > "disable VTd" means to "disable IOMMU protection" and "allow all DMA". > "Keep VTd enabled" means to "keep IOMMU protection enabled" and > "only allow the DMA from Map() request". Okay, but this interpretation was exactly what I thought of first (see above): "VT-d permits all DMA unless configured otherwise". You answered that it wasn't the case. So basically, if I understand it correctly now, at ExitBootServices the VT-d IOMMU driver opens up all RAM for DMA access. Is that correct? That is equivalent to decrypting all memory under SEV, and is the exact opposite of what we want. (As I understand it.) > If that happens, then a series of IOMMU faults could be generated, which > I described above. (I.e., such IOMMU faults could result, at least in > the case of SEV, in garbage being written to disk, via queued commands.) > [Jiewen] You are right. That would not happen. > IOMMU driver should not bring any compatibility problem for the PCI driver, > iff the PCI driver follows the UEFI specification. Agreed. > Now, to finish up, here's an idea I just had. > > Are we allowed to call gBS->SignalEvent() in an ExitBootServices() > notification function? > > If we are, then we could do the following: > > * PciIo.Unmap() and friends would work as always (no restrictions on > dynamic memory allocation or freeing, for any kind of DMA operation). > > * PCI drivers would not be expected to call PciIo.Unmap() in their > ExitBootServices() handlers. > > * The IOMMU driver would have an ExitBootServices() notification > function, to be enqueued at the TPL_CALLBACK or TPL_NOTIFY level > (doesn't matter which). > > * In this notification function, the IOMMU driver would signal *another* > event (a private one). The notification function for this event would > be enqueued strictly at the TPL_CALLBACK level. > > * The notification function for the second event (private to the IOMMU) > would iterate over all existent mappings, and unmap them, without > allocating or freeing memory. > > The key point is that by signaling the second event, the IOMMU driver > could order its own cleanup code after all other ExitBootServices() > callbacks. (Assuming, at least, that no other ExitBootServices() > callback employs the same trick to defer itself!) Therefore by the time > the second callback runs, all PCI devices have been halted, and it is > safe to tear down the translations. > > The ordering would be ensured by the following: > > - The UEFI spec says under CreateEventEx(), "All events are guaranteed > to be signaled before the first notification action is taken." This > means that, by the time the IOMMU's ExitBootServices() handler is > entered, all other ExitBootServices() handlers have been *queued* at > least, at TPL_CALLBACK or TPL_NOTIFY. > > - Therefore, when we signal our second callback from there, for > TPL_CALLBACK, the second callback will be queued at the end of the > TPL_CALLBACK queue. > > - CreateEventEx() also says that EFI_EVENT_GROUP_EXIT_BOOT_SERVICES "is > functionally equivalent to the EVT_SIGNAL_EXIT_BOOT_SERVICES flag > for the Type argument of CreateEvent." So it wouldn't matter if a > driver used CreateEvent() or CreateEventEx() for setting up its own > handler, the handler would be queued just the same. > > I think this is ugly and brittle, but perhaps the only way to clean up > *all* translations safely, with regard to PciIo.Unmap() + > ExitBootServices(), without updating the UEFI spec. > > What do you think? > > [Jiewen] If the order is correct, and all PCI device driver is halt at ExitBootServices, that works. > We do not need update PCI driver and we do not need update UEFI spec. > We only need update IOMMU driver which is concerned, based upon the threat model. > That probably is best solution. :-) I'm very glad to hear that you like the idea. However, it depends on whether we are permitted, by the UEFI spec, to signal another event in an ExitBootServices() notification function. Are we permitted to do that? Does the UEFI spec guarantee that the notification function for the *second* event will be queued just like it would be under "normal" circumstances? (I know we must not allocate or free memory in the notification function of the *second* event either; I just want to know if the second event's handler is *queued* like it would normally be.) > I assume you want to handle both common buffer and read/write buffer, right? Yes. Under this idea, all kinds of operations would be cleaned up. > And if possible, I still have interest to get clear on the threat model for SEV in OVMF. > If you or any SEV owner can share ... Absolutely. Please see above. Thank you! Laszlo _______________________________________________ edk2-devel mailing list edk2-devel@lists.01.org https://lists.01.org/mailman/listinfo/edk2-devel
On 09/06/2017 07:14 AM, Laszlo Ersek wrote: > On 09/06/17 06:37, Yao, Jiewen wrote: >> Thanks for the clarification. Comment in line. >> >> From: Laszlo Ersek [mailto:lersek@redhat.com] >> Sent: Wednesday, September 6, 2017 1:57 AM >> To: Yao, Jiewen <jiewen.yao@intel.com>; Zeng, Star <star.zeng@intel.com>; edk2-devel-01 <edk2-devel@lists.01.org> >> Cc: Dong, Eric <eric.dong@intel.com>; Brijesh Singh <brijesh.singh@amd.com> >> Subject: Re: [edk2] [PATCH 0/4] MdeModulePkg: some PCI HC drivers: unmap common buffers at ExitBootServices() > >>> Then after ExitBootService, the OS will take control of CR3 and set correct >>> encryption bit. >> >> This is true, the guest OS will set up new page tables, and in those >> PTEs, the C-bit ("encrypt") will likely be set by default. >> >> However, the guest OS will not *rewrite* all of the memory, with the >> C-bit set. This means that any memory that the firmware didn't actually >> re-encrypt (in the IOMMU driver) will still expose stale data to the >> hypervisor. >> [Jiewen] That is an interesting question. >> Is there any security concern associated? > > I can't tell for sure. Answering this question is up to the proponents > of SEV, who have designed the threat model for SEV. > > My operating assumption is that we should keep memory as tightly > encrypted as possible at the firmware --> OS control transfer. It could > be an exaggerated expectation from my side; I'd just better be safe than > sorry :) > > Let me give some brief intro on SEV (Secure Encrypted Virtualization) and then we can discuss a security model (if needed). SEV is an extension to the AMD-V architecture which supports running encrypted virtual machines (VMs) under the control of a hypervisor. Encrypted VMs have their pages (code and data) secured such that only the guest itself has access to the unencrypted version. Each encrypted VMs is associated with a unique encryption key; if its data is accessed by a different entity using a different key the encrypted guest data will be incorrectly decrypted, leading to unintelligible data. You can also find some detail for SEV in white paper [1]. SEV encrypted Vs have the concept of private and shared memory. The private memory is encrypted with the guest-specific key, while shared memory may be encrypted with hypervisor key. SEV guest VMs can choose which pages they would like to be private. But the instruction pages and guest page tables are always treated as private by the hardware. The DMA operation inside the guest must be performed on shared pages -- this is mainly because in virtualization world the device DMA needs some assistance from the hypervisor. The security model provided by the SEV ensure that hypervisor will no longer able to inspect or alter any guest code or data. The guest OS controls what it want to share with outside world (in this case with Hypervisor). In software implementation we took approach to encrypt everything possible starting early in boot. In this approach whenever OVMF wants to perform certain DMA operations it allocate a shared page, issues the command, free the shared page after the command is completed (in other word we use sw bounce buffer to complete the DMA operation). We have implemented IOMMU driver to provide the following functions: AllocateBuffer(): -------------------- it allocate a private pages, as per UEFI spec the driver will map the buffer allocated from this routine using BusMasterCommonBuffer hence we allocate extra stash pages in addition to requested pages. Map ---- If requested operation is BusMasterRead or BusMasterWrite then we allocate a shared buffer and DeviceAddress point to shared buffer. If requested operation is BusMasterCommonBuffer then we perform in-place decryption of the contents and update the page-table to clear the C-bit (basically make it shared page) Unmap ------ If operation was BusMasterRead or BusMasterWrite then we complete the unmapping and free the shared buffer allocated in Map(). If operation was BusMasterCommonBuffer then we perform in-place encryption and set the C-bit (basically make the page private) FreeBuffer: ----------- Free the pages Lets run with the below scenario: 1) guest marks a page as "shared" and gives its physical address to HV (e.g DMA read) 2) hypervisor completes the request operation 3) hypervisor caches the shared physical address and monitor it for information leak 4) sometime later if guest write data in its "shared" physical address then hypervisor can easily read it without guest knowledge. SEV hardware can protect us against the attack where someone tries to inject or alter the guest code. As I noted above any instruction fetch is forced as private hence if attacker write some code into a shared buffer and point the RIP to his/her code then instruction fetch will try to decrypt it and get unintelligible opcode. If a page was marked as "shared" then its guest responsibility to make it "private" after its done communicating with hypervisor. [1] http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/12/AMD_Memory_Encryption_Whitepaper_v7-Public.pdf >> If the C-bit is cleared for a read/write buffer, is the content in the >> read/write buffer also exposed to hypervisor? > > Not immediately. Only when the guest also rewrites the memory through > the appropriate virtual address. > > Basically, in the virtual machine, the C-bit in a PTE that covers a > particular guest virtual address (GVA) controls whether a guest write > through that GVA will result in memory encryption, and if a gues read > through that GVA will result in memory decryption. > > The current state of the C-bit doesn't matter for the hypervisor, what > matters is the last guest write through a GVA whose PTE had the C-bit > set, or clear. If the C-bit was clear when the guest last wrote the > area, then the hypervisor can read the data. If the C-bit was set, then > the hypervisor can only read garbage. > > >> I means if there is security concern, the concern should be applied to >> both common buffer and read/write buffer. >> Then we have to figure a way to resolve both buffer. > > Yes, this is correct. The PciIo.Unmap operation, as currently > implemented in OvmfPkg/IoMmuDxe/, handles the encryption/decryption > correctly for all operations, but it can only guarantee *not* freeing > memory for CommonBuffer. So Unmap()ing CommonBuffer at ExitBootServices > is safe, while Unmap()ing Read/Write is not. The encryption would be > re-established just fine for Read/Write as well, but we would change the > UEFI memmap. > > In OVMF, we currently manage this problem by making all asynchronous DMA > mappings CommonBuffer, even if they could othewise be Read or Write. We > use Read and Write only if the DMA operation completes before the > higher-level protocol function returns (so we can immediately Unmap the > operation, and the ExitBootServices() handler doesn't have to care). > > >> To be honest, that is exactly my biggest question on this patch: >> why do we only handle the common buffer specially? > > For the following reason: > > - Given that CommonBuffer mappings take a separate AllocateBuffer() / > FreeBuffer() call-pair, around Map/Unmap, we can *reasonably ask* PciIo > implementors -- beyond what the UEFI spec requries -- to keep *all* > dynamic memory management out of Map/Unmap. If they need dynamic memory > management, we ask them to do it in AllocateBuffer/FreeBuffer instead. > This way Unmap is safe in ExitBootServices handlers. > > - We cannot *reasonably ask* PciIo implementors to provide the same > guarantee for Read and Write mappings, simply because there are no other > APIs that could manage the dynamic memory for Map and Unmap -- > AllocateBuffer and FreeBuffer are not required for Read and Write > mappings. PciIo implementors couldn't agree to our request even if they > wanted to. Therefore Unmapping Read/Write operations is inherently > unsafe in EBS handlers. > > >>> NOTE: The device should still be halt state, because the device is >>> also controlled by OS. >> >> This is the key point -- the device is *not* controlled by the guest OS, >> in the worst case. >> >> The virtual device is ultimately implemented by the hypervisor. We don't >> expect the virtual device (= the hypervisor) to mis-behave on purpose. >> However, if the hypervisor is compromised by an external attacker (for >> example over the network, or via privilege escalation from another >> guest), then all guest RAM that has not been encrypted up to that point >> becomes visible to the attacker. In other words, the hypervisor ("the >> device") becomes retro-actively malicious. >> [Jiewen] If that is the threat model, I have a question on the exposure: >> 1) If the concern is for existing data, it means all DMA data already written. >> However, the DMA data is already consumed or produced by virtual device >> or a hypervisor. If the hypervisor is attacked, it already gets all the data content. > > Maybe, maybe not. The data may have been sent to a different host over > the network, and wiped from memory. > > Or, the data may have been written to a disk image that is separately > encrypted by the host OS, and has been detached (unplugged) from the > guest, and also unmounted from the host, with the disk key securely > wiped from host memory. > > We can also imagine the reverse direction. Assume that the user of the > virtual machine goes into the UEFI shell in OVMF, starts the EDIT > program, and types some secret information into a text file on the ESP. > Further assume that the disk image is encrypted on the host OS. It is > conceivable that fragments of the secret could remain stuck in the AHCI > (disk) and USB (keyboard) DMA buffers. > > Maybe you think that these are "made up" examples, and I agree, I just > made them up. The point is, it is not my place to discount possible > attack vectors (I haven't designed SEV). Attackers will be limited by > their *own* imaginations only, not by mine :) > > >> 2) if the concern is for future data, it means all user data to be written. >> However, the C-bit already prevent that. > > As far as I understand SEV, future data is out of scope. The goal is to > prevent *retroactive* guest data leaks (= whatever is currently in host > OS memory) if an attacker compromises an otherwise non-malicious hypervisor. > > If an attacker compromises a virtualization host, then they can just sit > silent and watch data flow into and out of guests from that point > onward, because they can then monitor all DMA (which always happens in > clear-text) real-time. > > >> Maybe I do not fully understand the threat model defined. >> If you can explain more, that would be great. > > Well I tried to explain *my* understanding of SEV :) I hope Brijesh will > correct me if I'm wrong. > > >> The point of SEV is to keep as much guest data encrypted at all times as >> possible, so that *whenever* the hypervisor is compromised by an >> attacker, the guest memory -- which the attacker can inevitably dump >> from the host side -- remains un-intellegible to the attacker. >> [Jiewen] OK. If this is a security question, I suggest we define a clear >> threat model at first. >> Or what we are doing might be unnecessary or insufficient. > > I agree. > > As I said above, my operating principle has been to re-encrypt all DMA > buffers as soon as possible. For long-standing DMA buffers, re-encrypt > them at ExitBootServices at the latest. > > >> [Jiewen] For "require that Unmap() work for CommonBuffer >> operations without releasing memory", I would hold my opinion until >> it is documented clearly in UEFI spec. > > You are right, of course. > > It's just that we cannot avoid exploring ways, for this feature, that > currently fall outside of the spec. Whatever we do here will be outside > of the spec, one way or another. I suggested what I thought could be a > reasonable extension to the spec, one that could be implemented by PciIo > implementors even before the spec is modified. > > edk2's PciIo.Unmap already behaves like this, without breaking the spec. > > If there's a better way -- for example modifying CoreExitBootServices() > in edk2, to signal IOMMU drivers separately, *after* notifying other > ExitBootServices() handlers --, that might work as well. > > >> My current concern is: >> First, this sentence is NOT defined in UEFI specification. > > Correct. > > >> Second, it might break an existing implementation or existing feature, such as tracing. > > Correct. > >> Till now, I did not see any restriction in UEFI spec say: In this function, you are not allowed >> to call memory services. >> The only restriction is >> 1) TPL restriction, where memory service must be <= TPL_NOTIFY. >> 2) AP restriction, where no UEFI service/protocol is allowed for AP. > > I agree. > > >> I'm just trying to operate with the APIs currently defined by the UEFI >> spec, and these assumptions were the best I could come up with. >> [Jiewen] The PCI device driver must follow and *only* follow UEFI spec. >> Especially the IHV Option ROM should not consume any private API. > > I disagree about "only follow". If there are additional requirements > that can be satisfied without breaking the spec, driver implementors can > choose to conform to both sets of requirements. > > For example (if I understand correctly), Microsoft / Windows present a > bunch of requirements *beyond* the UEFI spec, for both platform and > add-on card firmware. Most vendors conform :) > > >>> [Jiewen] I am not sure who will control "When control is transferred to the OS, all >>> guest memory should be encrypted again, even those areas that were once >>> used as CommonBuffers." >>> For SEV case, I think it should be controlled by OS, because it is just CR3. >> >> If it was indeed just CR3, then I would fully agree with you. >> >> But -- to my understanding --, ensuring that the memory is actually >> encrypted requires that the guest *write* to the memory through a >> virtual address whose PTE has the C-bit set. >> >> And I don't think the guest OS can be expected to rewrite all of its >> memory at launch. :( >> >> [Jiewen] That is fine. >> I suggest we get clear on the threat model as the first step. >> The threat model for SEV usage in OVMF. >> >> I am sorry if that is already discussed before. I might ignore the conversation. > > No problem; it's always good to re-investigate our assumptions and > operating principles. > > >> If you or any SEV owner can share the insight, that will be great. >> See my question above "If that is the threat model, I have a question on the exposure:..." > > I shared *my* impressions of the threat model (which are somewhat > unclear, admittedly, and thus could make me overly cautious). > > I hope Brijesh can extend and/or correct my description. > > >>> So apparently the default behaviors of the VT-d IOMMU and the SEV IOMMU >>> are opposites -- VT-d permits all DMA unless configured otherwise, while >>> SEV forbids all DMA unless configured otherwise. >>> [Jiewen] I do not think so. The default behaviors of VT-d IOMMU is to disable all DMA access. >>> I setup translation table, but mark all entry to be NOT-PRESENT. >>> I grant DMA access only if there is a specific request by a device. >>> >>> I open DMA access in ExitBootServices, just want to make sure it is compatible to >>> the existing OS, which might not have knowledge on IOMMU. >>> I will provide a patch to make it a policy very soon. As such VTd IOMMU may >>> turn off IOMMU, or keep it enabled at ExitBootService. >>> An IOMMU aware OS may take over IOMMU directly and reprogram it. >> >> Thanks for the clarification. >> >> But, again, will this not lead to the possibility that the VT-d IOMMU's >> ExitBootServices() handler disables all DMA *before* the PCI drivers get >> a chance to run their own ExitBootServices() handlers, disabling the >> individual devices? >> [Jiewen] Sorry for the confusing. Let me explain: >> No, VTd never disables *all* DMA buffer at ExitBootService. >> >> "disable VTd" means to "disable IOMMU protection" and "allow all DMA". >> "Keep VTd enabled" means to "keep IOMMU protection enabled" and >> "only allow the DMA from Map() request". > > Okay, but this interpretation was exactly what I thought of first (see > above): "VT-d permits all DMA unless configured otherwise". You answered > that it wasn't the case. > > So basically, if I understand it correctly now, at ExitBootServices the > VT-d IOMMU driver opens up all RAM for DMA access. Is that correct? > > That is equivalent to decrypting all memory under SEV, and is the exact > opposite of what we want. (As I understand it.) > > >> If that happens, then a series of IOMMU faults could be generated, which >> I described above. (I.e., such IOMMU faults could result, at least in >> the case of SEV, in garbage being written to disk, via queued commands.) >> [Jiewen] You are right. That would not happen. >> IOMMU driver should not bring any compatibility problem for the PCI driver, >> iff the PCI driver follows the UEFI specification. > > Agreed. > > >> Now, to finish up, here's an idea I just had. >> >> Are we allowed to call gBS->SignalEvent() in an ExitBootServices() >> notification function? >> >> If we are, then we could do the following: >> >> * PciIo.Unmap() and friends would work as always (no restrictions on >> dynamic memory allocation or freeing, for any kind of DMA operation). >> >> * PCI drivers would not be expected to call PciIo.Unmap() in their >> ExitBootServices() handlers. >> >> * The IOMMU driver would have an ExitBootServices() notification >> function, to be enqueued at the TPL_CALLBACK or TPL_NOTIFY level >> (doesn't matter which). >> >> * In this notification function, the IOMMU driver would signal *another* >> event (a private one). The notification function for this event would >> be enqueued strictly at the TPL_CALLBACK level. >> >> * The notification function for the second event (private to the IOMMU) >> would iterate over all existent mappings, and unmap them, without >> allocating or freeing memory. >> >> The key point is that by signaling the second event, the IOMMU driver >> could order its own cleanup code after all other ExitBootServices() >> callbacks. (Assuming, at least, that no other ExitBootServices() >> callback employs the same trick to defer itself!) Therefore by the time >> the second callback runs, all PCI devices have been halted, and it is >> safe to tear down the translations. >> >> The ordering would be ensured by the following: >> >> - The UEFI spec says under CreateEventEx(), "All events are guaranteed >> to be signaled before the first notification action is taken." This >> means that, by the time the IOMMU's ExitBootServices() handler is >> entered, all other ExitBootServices() handlers have been *queued* at >> least, at TPL_CALLBACK or TPL_NOTIFY. >> >> - Therefore, when we signal our second callback from there, for >> TPL_CALLBACK, the second callback will be queued at the end of the >> TPL_CALLBACK queue. >> >> - CreateEventEx() also says that EFI_EVENT_GROUP_EXIT_BOOT_SERVICES "is >> functionally equivalent to the EVT_SIGNAL_EXIT_BOOT_SERVICES flag >> for the Type argument of CreateEvent." So it wouldn't matter if a >> driver used CreateEvent() or CreateEventEx() for setting up its own >> handler, the handler would be queued just the same. >> >> I think this is ugly and brittle, but perhaps the only way to clean up >> *all* translations safely, with regard to PciIo.Unmap() + >> ExitBootServices(), without updating the UEFI spec. >> >> What do you think? >> >> [Jiewen] If the order is correct, and all PCI device driver is halt at ExitBootServices, that works. >> We do not need update PCI driver and we do not need update UEFI spec. >> We only need update IOMMU driver which is concerned, based upon the threat model. >> That probably is best solution. :-) > > I'm very glad to hear that you like the idea. > > However, it depends on whether we are permitted, by the UEFI spec, to > signal another event in an ExitBootServices() notification function. > > Are we permitted to do that? > > Does the UEFI spec guarantee that the notification function for the > *second* event will be queued just like it would be under "normal" > circumstances? > > (I know we must not allocate or free memory in the notification function > of the *second* event either; I just want to know if the second event's > handler is *queued* like it would normally be.) > > >> I assume you want to handle both common buffer and read/write buffer, right? > > Yes. Under this idea, all kinds of operations would be cleaned up. > > >> And if possible, I still have interest to get clear on the threat model for SEV in OVMF. >> If you or any SEV owner can share ... > > Absolutely. Please see above. > > Thank you! > Laszlo > _______________________________________________ edk2-devel mailing list edk2-devel@lists.01.org https://lists.01.org/mailman/listinfo/edk2-devel
Thanks for the sharing, Brijesh. "If a page was marked as "shared" then its guest responsibility to make it "private" after its done communicating with hypervisor." I believe I have same understanding - a *guest* should guarantee that. My question is: During the *transition* from firmware to OS, *which guest* should make the shared buffer to be private? Is it "guest firmware" or "guest OS"? Maybe I can ask the specific question to get it more clear. 1) If the common DMA buffer is not unmapped at ExitBootService, is that treated as an issue? 2) If the read/write DMA buffer is not unmapped at ExitBootService, is that treated as an issue? Thank you Yao Jiewen From: Brijesh Singh [mailto:brijesh.singh@amd.com] Sent: Wednesday, September 6, 2017 11:40 PM To: Laszlo Ersek <lersek@redhat.com>; Yao, Jiewen <jiewen.yao@intel.com>; Zeng, Star <star.zeng@intel.com>; edk2-devel-01 <edk2-devel@lists.01.org> Cc: brijesh.singh@amd.com; Dong, Eric <eric.dong@intel.com> Subject: Re: [edk2] [PATCH 0/4] MdeModulePkg: some PCI HC drivers: unmap common buffers at ExitBootServices() On 09/06/2017 07:14 AM, Laszlo Ersek wrote: > On 09/06/17 06:37, Yao, Jiewen wrote: >> Thanks for the clarification. Comment in line. >> >> From: Laszlo Ersek [mailto:lersek@redhat.com] >> Sent: Wednesday, September 6, 2017 1:57 AM >> To: Yao, Jiewen <jiewen.yao@intel.com<mailto:jiewen.yao@intel.com>>; Zeng, Star <star.zeng@intel.com<mailto:star.zeng@intel.com>>; edk2-devel-01 <edk2-devel@lists.01.org<mailto:edk2-devel@lists.01.org>> >> Cc: Dong, Eric <eric.dong@intel.com<mailto:eric.dong@intel.com>>; Brijesh Singh <brijesh.singh@amd.com<mailto:brijesh.singh@amd.com>> >> Subject: Re: [edk2] [PATCH 0/4] MdeModulePkg: some PCI HC drivers: unmap common buffers at ExitBootServices() > >>> Then after ExitBootService, the OS will take control of CR3 and set correct >>> encryption bit. >> >> This is true, the guest OS will set up new page tables, and in those >> PTEs, the C-bit ("encrypt") will likely be set by default. >> >> However, the guest OS will not *rewrite* all of the memory, with the >> C-bit set. This means that any memory that the firmware didn't actually >> re-encrypt (in the IOMMU driver) will still expose stale data to the >> hypervisor. >> [Jiewen] That is an interesting question. >> Is there any security concern associated? > > I can't tell for sure. Answering this question is up to the proponents > of SEV, who have designed the threat model for SEV. > > My operating assumption is that we should keep memory as tightly > encrypted as possible at the firmware --> OS control transfer. It could > be an exaggerated expectation from my side; I'd just better be safe than > sorry :) > > Let me give some brief intro on SEV (Secure Encrypted Virtualization) and then we can discuss a security model (if needed). SEV is an extension to the AMD-V architecture which supports running encrypted virtual machines (VMs) under the control of a hypervisor. Encrypted VMs have their pages (code and data) secured such that only the guest itself has access to the unencrypted version. Each encrypted VMs is associated with a unique encryption key; if its data is accessed by a different entity using a different key the encrypted guest data will be incorrectly decrypted, leading to unintelligible data. You can also find some detail for SEV in white paper [1]. SEV encrypted Vs have the concept of private and shared memory. The private memory is encrypted with the guest-specific key, while shared memory may be encrypted with hypervisor key. SEV guest VMs can choose which pages they would like to be private. But the instruction pages and guest page tables are always treated as private by the hardware. The DMA operation inside the guest must be performed on shared pages -- this is mainly because in virtualization world the device DMA needs some assistance from the hypervisor. The security model provided by the SEV ensure that hypervisor will no longer able to inspect or alter any guest code or data. The guest OS controls what it want to share with outside world (in this case with Hypervisor). In software implementation we took approach to encrypt everything possible starting early in boot. In this approach whenever OVMF wants to perform certain DMA operations it allocate a shared page, issues the command, free the shared page after the command is completed (in other word we use sw bounce buffer to complete the DMA operation). We have implemented IOMMU driver to provide the following functions: AllocateBuffer(): -------------------- it allocate a private pages, as per UEFI spec the driver will map the buffer allocated from this routine using BusMasterCommonBuffer hence we allocate extra stash pages in addition to requested pages. Map ---- If requested operation is BusMasterRead or BusMasterWrite then we allocate a shared buffer and DeviceAddress point to shared buffer. If requested operation is BusMasterCommonBuffer then we perform in-place decryption of the contents and update the page-table to clear the C-bit (basically make it shared page) Unmap ------ If operation was BusMasterRead or BusMasterWrite then we complete the unmapping and free the shared buffer allocated in Map(). If operation was BusMasterCommonBuffer then we perform in-place encryption and set the C-bit (basically make the page private) FreeBuffer: ----------- Free the pages Lets run with the below scenario: 1) guest marks a page as "shared" and gives its physical address to HV (e.g DMA read) 2) hypervisor completes the request operation 3) hypervisor caches the shared physical address and monitor it for information leak 4) sometime later if guest write data in its "shared" physical address then hypervisor can easily read it without guest knowledge. SEV hardware can protect us against the attack where someone tries to inject or alter the guest code. As I noted above any instruction fetch is forced as private hence if attacker write some code into a shared buffer and point the RIP to his/her code then instruction fetch will try to decrypt it and get unintelligible opcode. If a page was marked as "shared" then its guest responsibility to make it "private" after its done communicating with hypervisor. [1] http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/12/AMD_Memory_Encryption_Whitepaper_v7-Public.pdf >> If the C-bit is cleared for a read/write buffer, is the content in the >> read/write buffer also exposed to hypervisor? > > Not immediately. Only when the guest also rewrites the memory through > the appropriate virtual address. > > Basically, in the virtual machine, the C-bit in a PTE that covers a > particular guest virtual address (GVA) controls whether a guest write > through that GVA will result in memory encryption, and if a gues read > through that GVA will result in memory decryption. > > The current state of the C-bit doesn't matter for the hypervisor, what > matters is the last guest write through a GVA whose PTE had the C-bit > set, or clear. If the C-bit was clear when the guest last wrote the > area, then the hypervisor can read the data. If the C-bit was set, then > the hypervisor can only read garbage. > > >> I means if there is security concern, the concern should be applied to >> both common buffer and read/write buffer. >> Then we have to figure a way to resolve both buffer. > > Yes, this is correct. The PciIo.Unmap operation, as currently > implemented in OvmfPkg/IoMmuDxe/, handles the encryption/decryption > correctly for all operations, but it can only guarantee *not* freeing > memory for CommonBuffer. So Unmap()ing CommonBuffer at ExitBootServices > is safe, while Unmap()ing Read/Write is not. The encryption would be > re-established just fine for Read/Write as well, but we would change the > UEFI memmap. > > In OVMF, we currently manage this problem by making all asynchronous DMA > mappings CommonBuffer, even if they could othewise be Read or Write. We > use Read and Write only if the DMA operation completes before the > higher-level protocol function returns (so we can immediately Unmap the > operation, and the ExitBootServices() handler doesn't have to care). > > >> To be honest, that is exactly my biggest question on this patch: >> why do we only handle the common buffer specially? > > For the following reason: > > - Given that CommonBuffer mappings take a separate AllocateBuffer() / > FreeBuffer() call-pair, around Map/Unmap, we can *reasonably ask* PciIo > implementors -- beyond what the UEFI spec requries -- to keep *all* > dynamic memory management out of Map/Unmap. If they need dynamic memory > management, we ask them to do it in AllocateBuffer/FreeBuffer instead. > This way Unmap is safe in ExitBootServices handlers. > > - We cannot *reasonably ask* PciIo implementors to provide the same > guarantee for Read and Write mappings, simply because there are no other > APIs that could manage the dynamic memory for Map and Unmap -- > AllocateBuffer and FreeBuffer are not required for Read and Write > mappings. PciIo implementors couldn't agree to our request even if they > wanted to. Therefore Unmapping Read/Write operations is inherently > unsafe in EBS handlers. > > >>> NOTE: The device should still be halt state, because the device is >>> also controlled by OS. >> >> This is the key point -- the device is *not* controlled by the guest OS, >> in the worst case. >> >> The virtual device is ultimately implemented by the hypervisor. We don't >> expect the virtual device (= the hypervisor) to mis-behave on purpose. >> However, if the hypervisor is compromised by an external attacker (for >> example over the network, or via privilege escalation from another >> guest), then all guest RAM that has not been encrypted up to that point >> becomes visible to the attacker. In other words, the hypervisor ("the >> device") becomes retro-actively malicious. >> [Jiewen] If that is the threat model, I have a question on the exposure: >> 1) If the concern is for existing data, it means all DMA data already written. >> However, the DMA data is already consumed or produced by virtual device >> or a hypervisor. If the hypervisor is attacked, it already gets all the data content. > > Maybe, maybe not. The data may have been sent to a different host over > the network, and wiped from memory. > > Or, the data may have been written to a disk image that is separately > encrypted by the host OS, and has been detached (unplugged) from the > guest, and also unmounted from the host, with the disk key securely > wiped from host memory. > > We can also imagine the reverse direction. Assume that the user of the > virtual machine goes into the UEFI shell in OVMF, starts the EDIT > program, and types some secret information into a text file on the ESP. > Further assume that the disk image is encrypted on the host OS. It is > conceivable that fragments of the secret could remain stuck in the AHCI > (disk) and USB (keyboard) DMA buffers. > > Maybe you think that these are "made up" examples, and I agree, I just > made them up. The point is, it is not my place to discount possible > attack vectors (I haven't designed SEV). Attackers will be limited by > their *own* imaginations only, not by mine :) > > >> 2) if the concern is for future data, it means all user data to be written. >> However, the C-bit already prevent that. > > As far as I understand SEV, future data is out of scope. The goal is to > prevent *retroactive* guest data leaks (= whatever is currently in host > OS memory) if an attacker compromises an otherwise non-malicious hypervisor. > > If an attacker compromises a virtualization host, then they can just sit > silent and watch data flow into and out of guests from that point > onward, because they can then monitor all DMA (which always happens in > clear-text) real-time. > > >> Maybe I do not fully understand the threat model defined. >> If you can explain more, that would be great. > > Well I tried to explain *my* understanding of SEV :) I hope Brijesh will > correct me if I'm wrong. > > >> The point of SEV is to keep as much guest data encrypted at all times as >> possible, so that *whenever* the hypervisor is compromised by an >> attacker, the guest memory -- which the attacker can inevitably dump >> from the host side -- remains un-intellegible to the attacker. >> [Jiewen] OK. If this is a security question, I suggest we define a clear >> threat model at first. >> Or what we are doing might be unnecessary or insufficient. > > I agree. > > As I said above, my operating principle has been to re-encrypt all DMA > buffers as soon as possible. For long-standing DMA buffers, re-encrypt > them at ExitBootServices at the latest. > > >> [Jiewen] For "require that Unmap() work for CommonBuffer >> operations without releasing memory", I would hold my opinion until >> it is documented clearly in UEFI spec. > > You are right, of course. > > It's just that we cannot avoid exploring ways, for this feature, that > currently fall outside of the spec. Whatever we do here will be outside > of the spec, one way or another. I suggested what I thought could be a > reasonable extension to the spec, one that could be implemented by PciIo > implementors even before the spec is modified. > > edk2's PciIo.Unmap already behaves like this, without breaking the spec. > > If there's a better way -- for example modifying CoreExitBootServices() > in edk2, to signal IOMMU drivers separately, *after* notifying other > ExitBootServices() handlers --, that might work as well. > > >> My current concern is: >> First, this sentence is NOT defined in UEFI specification. > > Correct. > > >> Second, it might break an existing implementation or existing feature, such as tracing. > > Correct. > >> Till now, I did not see any restriction in UEFI spec say: In this function, you are not allowed >> to call memory services. >> The only restriction is >> 1) TPL restriction, where memory service must be <= TPL_NOTIFY. >> 2) AP restriction, where no UEFI service/protocol is allowed for AP. > > I agree. > > >> I'm just trying to operate with the APIs currently defined by the UEFI >> spec, and these assumptions were the best I could come up with. >> [Jiewen] The PCI device driver must follow and *only* follow UEFI spec. >> Especially the IHV Option ROM should not consume any private API. > > I disagree about "only follow". If there are additional requirements > that can be satisfied without breaking the spec, driver implementors can > choose to conform to both sets of requirements. > > For example (if I understand correctly), Microsoft / Windows present a > bunch of requirements *beyond* the UEFI spec, for both platform and > add-on card firmware. Most vendors conform :) > > >>> [Jiewen] I am not sure who will control "When control is transferred to the OS, all >>> guest memory should be encrypted again, even those areas that were once >>> used as CommonBuffers." >>> For SEV case, I think it should be controlled by OS, because it is just CR3. >> >> If it was indeed just CR3, then I would fully agree with you. >> >> But -- to my understanding --, ensuring that the memory is actually >> encrypted requires that the guest *write* to the memory through a >> virtual address whose PTE has the C-bit set. >> >> And I don't think the guest OS can be expected to rewrite all of its >> memory at launch. :( >> >> [Jiewen] That is fine. >> I suggest we get clear on the threat model as the first step. >> The threat model for SEV usage in OVMF. >> >> I am sorry if that is already discussed before. I might ignore the conversation. > > No problem; it's always good to re-investigate our assumptions and > operating principles. > > >> If you or any SEV owner can share the insight, that will be great. >> See my question above "If that is the threat model, I have a question on the exposure:..." > > I shared *my* impressions of the threat model (which are somewhat > unclear, admittedly, and thus could make me overly cautious). > > I hope Brijesh can extend and/or correct my description. > > >>> So apparently the default behaviors of the VT-d IOMMU and the SEV IOMMU >>> are opposites -- VT-d permits all DMA unless configured otherwise, while >>> SEV forbids all DMA unless configured otherwise. >>> [Jiewen] I do not think so. The default behaviors of VT-d IOMMU is to disable all DMA access. >>> I setup translation table, but mark all entry to be NOT-PRESENT. >>> I grant DMA access only if there is a specific request by a device. >>> >>> I open DMA access in ExitBootServices, just want to make sure it is compatible to >>> the existing OS, which might not have knowledge on IOMMU. >>> I will provide a patch to make it a policy very soon. As such VTd IOMMU may >>> turn off IOMMU, or keep it enabled at ExitBootService. >>> An IOMMU aware OS may take over IOMMU directly and reprogram it. >> >> Thanks for the clarification. >> >> But, again, will this not lead to the possibility that the VT-d IOMMU's >> ExitBootServices() handler disables all DMA *before* the PCI drivers get >> a chance to run their own ExitBootServices() handlers, disabling the >> individual devices? >> [Jiewen] Sorry for the confusing. Let me explain: >> No, VTd never disables *all* DMA buffer at ExitBootService. >> >> "disable VTd" means to "disable IOMMU protection" and "allow all DMA". >> "Keep VTd enabled" means to "keep IOMMU protection enabled" and >> "only allow the DMA from Map() request". > > Okay, but this interpretation was exactly what I thought of first (see > above): "VT-d permits all DMA unless configured otherwise". You answered > that it wasn't the case. > > So basically, if I understand it correctly now, at ExitBootServices the > VT-d IOMMU driver opens up all RAM for DMA access. Is that correct? > > That is equivalent to decrypting all memory under SEV, and is the exact > opposite of what we want. (As I understand it.) > > >> If that happens, then a series of IOMMU faults could be generated, which >> I described above. (I.e., such IOMMU faults could result, at least in >> the case of SEV, in garbage being written to disk, via queued commands.) >> [Jiewen] You are right. That would not happen. >> IOMMU driver should not bring any compatibility problem for the PCI driver, >> iff the PCI driver follows the UEFI specification. > > Agreed. > > >> Now, to finish up, here's an idea I just had. >> >> Are we allowed to call gBS->SignalEvent() in an ExitBootServices() >> notification function? >> >> If we are, then we could do the following: >> >> * PciIo.Unmap() and friends would work as always (no restrictions on >> dynamic memory allocation or freeing, for any kind of DMA operation). >> >> * PCI drivers would not be expected to call PciIo.Unmap() in their >> ExitBootServices() handlers. >> >> * The IOMMU driver would have an ExitBootServices() notification >> function, to be enqueued at the TPL_CALLBACK or TPL_NOTIFY level >> (doesn't matter which). >> >> * In this notification function, the IOMMU driver would signal *another* >> event (a private one). The notification function for this event would >> be enqueued strictly at the TPL_CALLBACK level. >> >> * The notification function for the second event (private to the IOMMU) >> would iterate over all existent mappings, and unmap them, without >> allocating or freeing memory. >> >> The key point is that by signaling the second event, the IOMMU driver >> could order its own cleanup code after all other ExitBootServices() >> callbacks. (Assuming, at least, that no other ExitBootServices() >> callback employs the same trick to defer itself!) Therefore by the time >> the second callback runs, all PCI devices have been halted, and it is >> safe to tear down the translations. >> >> The ordering would be ensured by the following: >> >> - The UEFI spec says under CreateEventEx(), "All events are guaranteed >> to be signaled before the first notification action is taken." This >> means that, by the time the IOMMU's ExitBootServices() handler is >> entered, all other ExitBootServices() handlers have been *queued* at >> least, at TPL_CALLBACK or TPL_NOTIFY. >> >> - Therefore, when we signal our second callback from there, for >> TPL_CALLBACK, the second callback will be queued at the end of the >> TPL_CALLBACK queue. >> >> - CreateEventEx() also says that EFI_EVENT_GROUP_EXIT_BOOT_SERVICES "is >> functionally equivalent to the EVT_SIGNAL_EXIT_BOOT_SERVICES flag >> for the Type argument of CreateEvent." So it wouldn't matter if a >> driver used CreateEvent() or CreateEventEx() for setting up its own >> handler, the handler would be queued just the same. >> >> I think this is ugly and brittle, but perhaps the only way to clean up >> *all* translations safely, with regard to PciIo.Unmap() + >> ExitBootServices(), without updating the UEFI spec. >> >> What do you think? >> >> [Jiewen] If the order is correct, and all PCI device driver is halt at ExitBootServices, that works. >> We do not need update PCI driver and we do not need update UEFI spec. >> We only need update IOMMU driver which is concerned, based upon the threat model. >> That probably is best solution. :-) > > I'm very glad to hear that you like the idea. > > However, it depends on whether we are permitted, by the UEFI spec, to > signal another event in an ExitBootServices() notification function. > > Are we permitted to do that? > > Does the UEFI spec guarantee that the notification function for the > *second* event will be queued just like it would be under "normal" > circumstances? > > (I know we must not allocate or free memory in the notification function > of the *second* event either; I just want to know if the second event's > handler is *queued* like it would normally be.) > > >> I assume you want to handle both common buffer and read/write buffer, right? > > Yes. Under this idea, all kinds of operations would be cleaned up. > > >> And if possible, I still have interest to get clear on the threat model for SEV in OVMF. >> If you or any SEV owner can share ... > > Absolutely. Please see above. > > Thank you! > Laszlo > _______________________________________________ edk2-devel mailing list edk2-devel@lists.01.org https://lists.01.org/mailman/listinfo/edk2-devel
On 09/07/17 06:46, Yao, Jiewen wrote: > Thanks for the sharing, Brijesh. > > "If a page was marked as "shared" > then its guest responsibility to make it "private" after its done communicating with > hypervisor." > > I believe I have same understanding - a *guest* should guarantee that. > > My question is: During the *transition* from firmware to OS, *which guest* should make the shared buffer to be private? Is it "guest firmware" or "guest OS"? > > Maybe I can ask the specific question to get it more clear. > > 1) If the common DMA buffer is not unmapped at ExitBootService, is that treated as an issue? > > 2) If the read/write DMA buffer is not unmapped at ExitBootService, is that treated as an issue? Very good questions, totally to the point. On the authoritative answer, I defer to Brijesh. ( My personal opinion is that both situations (#1 and #2) are problems, because they break the *practical* security invariant for SEV guests: - most memory should be encrypted at all times, *and* - any memory that is decrypted must have an owner that is responsible for re-encrypting the memory eventually. Therefore, *either* the firmware has to re-encrypt all remaining DMA buffers at ExitBootServices(), *or* a new information channel must be designed, from firmware to OS, to carry over the decryption status. I strongly prefer the first option, for the following reason: the same questions apply to all EDK2 IOMMU protocol interfaces, not just the one exported by the SEV driver. ) Thanks, Laszlo > From: Brijesh Singh [mailto:brijesh.singh@amd.com] > Sent: Wednesday, September 6, 2017 11:40 PM > To: Laszlo Ersek <lersek@redhat.com>; Yao, Jiewen <jiewen.yao@intel.com>; Zeng, Star <star.zeng@intel.com>; edk2-devel-01 <edk2-devel@lists.01.org> > Cc: brijesh.singh@amd.com; Dong, Eric <eric.dong@intel.com> > Subject: Re: [edk2] [PATCH 0/4] MdeModulePkg: some PCI HC drivers: unmap common buffers at ExitBootServices() > > > > On 09/06/2017 07:14 AM, Laszlo Ersek wrote: >> On 09/06/17 06:37, Yao, Jiewen wrote: >>> Thanks for the clarification. Comment in line. >>> >>> From: Laszlo Ersek [mailto:lersek@redhat.com] >>> Sent: Wednesday, September 6, 2017 1:57 AM >>> To: Yao, Jiewen <jiewen.yao@intel.com<mailto:jiewen.yao@intel.com>>; Zeng, Star <star.zeng@intel.com<mailto:star.zeng@intel.com>>; edk2-devel-01 <edk2-devel@lists.01.org<mailto:edk2-devel@lists.01.org>> >>> Cc: Dong, Eric <eric.dong@intel.com<mailto:eric.dong@intel.com>>; Brijesh Singh <brijesh.singh@amd.com<mailto:brijesh.singh@amd.com>> >>> Subject: Re: [edk2] [PATCH 0/4] MdeModulePkg: some PCI HC drivers: unmap common buffers at ExitBootServices() >> >>>> Then after ExitBootService, the OS will take control of CR3 and set correct >>>> encryption bit. >>> >>> This is true, the guest OS will set up new page tables, and in those >>> PTEs, the C-bit ("encrypt") will likely be set by default. >>> >>> However, the guest OS will not *rewrite* all of the memory, with the >>> C-bit set. This means that any memory that the firmware didn't actually >>> re-encrypt (in the IOMMU driver) will still expose stale data to the >>> hypervisor. >>> [Jiewen] That is an interesting question. >>> Is there any security concern associated? >> >> I can't tell for sure. Answering this question is up to the proponents >> of SEV, who have designed the threat model for SEV. >> >> My operating assumption is that we should keep memory as tightly >> encrypted as possible at the firmware --> OS control transfer. It could >> be an exaggerated expectation from my side; I'd just better be safe than >> sorry :) >> >> > > Let me give some brief intro on SEV (Secure Encrypted Virtualization) and then > we can discuss a security model (if needed). > > SEV is an extension to the AMD-V architecture which supports running encrypted > virtual machines (VMs) under the control of a hypervisor. Encrypted VMs have their > pages (code and data) secured such that only the guest itself has access to the > unencrypted version. Each encrypted VMs is associated with a unique encryption > key; if its data is accessed by a different entity using a different key the > encrypted guest data will be incorrectly decrypted, leading to unintelligible data. > You can also find some detail for SEV in white paper [1]. > > SEV encrypted Vs have the concept of private and shared memory. The private memory > is encrypted with the guest-specific key, while shared memory may be encrypted > with hypervisor key. SEV guest VMs can choose which pages they would like to > be private. But the instruction pages and guest page tables are always treated > as private by the hardware. The DMA operation inside the guest must be performed > on shared pages -- this is mainly because in virtualization world the device > DMA needs some assistance from the hypervisor. > > The security model provided by the SEV ensure that hypervisor will no longer able > to inspect or alter any guest code or data. The guest OS controls what it want to > share with outside world (in this case with Hypervisor). > > In software implementation we took approach to encrypt everything possible starting > early in boot. In this approach whenever OVMF wants to perform certain DMA operations > it allocate a shared page, issues the command, free the shared page after the command > is completed (in other word we use sw bounce buffer to complete the DMA operation). > > We have implemented IOMMU driver to provide the following functions: > > AllocateBuffer(): > -------------------- > it allocate a private pages, as per UEFI spec the driver will map the buffer allocated > from this routine using BusMasterCommonBuffer hence we allocate extra stash pages > in addition to requested pages. > > > Map > ---- > If requested operation is BusMasterRead or BusMasterWrite then we allocate a shared buffer > and DeviceAddress point to shared buffer. > > If requested operation is BusMasterCommonBuffer then we perform in-place decryption of the > contents and update the page-table to clear the C-bit (basically make it shared page) > > Unmap > ------ > If operation was BusMasterRead or BusMasterWrite then we complete the unmapping and free > the shared buffer allocated in Map(). > > If operation was BusMasterCommonBuffer then we perform in-place encryption and set the C-bit > (basically make the page private) > > FreeBuffer: > ----------- > Free the pages > > > Lets run with the below scenario: > > 1) guest marks a page as "shared" and gives its physical address to HV (e.g DMA read) > 2) hypervisor completes the request operation > 3) hypervisor caches the shared physical address and monitor it for information leak > 4) sometime later if guest write data in its "shared" physical address then hypervisor can > easily read it without guest knowledge. > > SEV hardware can protect us against the attack where someone tries to inject or alter the > guest code. As I noted above any instruction fetch is forced as private hence if attacker > write some code into a shared buffer and point the RIP to his/her code then instruction > fetch will try to decrypt it and get unintelligible opcode. If a page was marked as "shared" > then its guest responsibility to make it "private" after its done communicating with > hypervisor. > > [1] http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/12/AMD_Memory_Encryption_Whitepaper_v7-Public.pdf > > >>> If the C-bit is cleared for a read/write buffer, is the content in the >>> read/write buffer also exposed to hypervisor? >> >> Not immediately. Only when the guest also rewrites the memory through >> the appropriate virtual address. >> >> Basically, in the virtual machine, the C-bit in a PTE that covers a >> particular guest virtual address (GVA) controls whether a guest write >> through that GVA will result in memory encryption, and if a gues read >> through that GVA will result in memory decryption. >> >> The current state of the C-bit doesn't matter for the hypervisor, what >> matters is the last guest write through a GVA whose PTE had the C-bit >> set, or clear. If the C-bit was clear when the guest last wrote the >> area, then the hypervisor can read the data. If the C-bit was set, then >> the hypervisor can only read garbage. >> >> >>> I means if there is security concern, the concern should be applied to >>> both common buffer and read/write buffer. >>> Then we have to figure a way to resolve both buffer. >> >> Yes, this is correct. The PciIo.Unmap operation, as currently >> implemented in OvmfPkg/IoMmuDxe/, handles the encryption/decryption >> correctly for all operations, but it can only guarantee *not* freeing >> memory for CommonBuffer. So Unmap()ing CommonBuffer at ExitBootServices >> is safe, while Unmap()ing Read/Write is not. The encryption would be >> re-established just fine for Read/Write as well, but we would change the >> UEFI memmap. >> >> In OVMF, we currently manage this problem by making all asynchronous DMA >> mappings CommonBuffer, even if they could othewise be Read or Write. We >> use Read and Write only if the DMA operation completes before the >> higher-level protocol function returns (so we can immediately Unmap the >> operation, and the ExitBootServices() handler doesn't have to care). >> >> >>> To be honest, that is exactly my biggest question on this patch: >>> why do we only handle the common buffer specially? >> >> For the following reason: >> >> - Given that CommonBuffer mappings take a separate AllocateBuffer() / >> FreeBuffer() call-pair, around Map/Unmap, we can *reasonably ask* PciIo >> implementors -- beyond what the UEFI spec requries -- to keep *all* >> dynamic memory management out of Map/Unmap. If they need dynamic memory >> management, we ask them to do it in AllocateBuffer/FreeBuffer instead. >> This way Unmap is safe in ExitBootServices handlers. >> >> - We cannot *reasonably ask* PciIo implementors to provide the same >> guarantee for Read and Write mappings, simply because there are no other >> APIs that could manage the dynamic memory for Map and Unmap -- >> AllocateBuffer and FreeBuffer are not required for Read and Write >> mappings. PciIo implementors couldn't agree to our request even if they >> wanted to. Therefore Unmapping Read/Write operations is inherently >> unsafe in EBS handlers. >> >> >>>> NOTE: The device should still be halt state, because the device is >>>> also controlled by OS. >>> >>> This is the key point -- the device is *not* controlled by the guest OS, >>> in the worst case. >>> >>> The virtual device is ultimately implemented by the hypervisor. We don't >>> expect the virtual device (= the hypervisor) to mis-behave on purpose. >>> However, if the hypervisor is compromised by an external attacker (for >>> example over the network, or via privilege escalation from another >>> guest), then all guest RAM that has not been encrypted up to that point >>> becomes visible to the attacker. In other words, the hypervisor ("the >>> device") becomes retro-actively malicious. >>> [Jiewen] If that is the threat model, I have a question on the exposure: >>> 1) If the concern is for existing data, it means all DMA data already written. >>> However, the DMA data is already consumed or produced by virtual device >>> or a hypervisor. If the hypervisor is attacked, it already gets all the data content. >> >> Maybe, maybe not. The data may have been sent to a different host over >> the network, and wiped from memory. >> >> Or, the data may have been written to a disk image that is separately >> encrypted by the host OS, and has been detached (unplugged) from the >> guest, and also unmounted from the host, with the disk key securely >> wiped from host memory. >> >> We can also imagine the reverse direction. Assume that the user of the >> virtual machine goes into the UEFI shell in OVMF, starts the EDIT >> program, and types some secret information into a text file on the ESP. >> Further assume that the disk image is encrypted on the host OS. It is >> conceivable that fragments of the secret could remain stuck in the AHCI >> (disk) and USB (keyboard) DMA buffers. >> >> Maybe you think that these are "made up" examples, and I agree, I just >> made them up. The point is, it is not my place to discount possible >> attack vectors (I haven't designed SEV). Attackers will be limited by >> their *own* imaginations only, not by mine :) >> >> >>> 2) if the concern is for future data, it means all user data to be written. >>> However, the C-bit already prevent that. >> >> As far as I understand SEV, future data is out of scope. The goal is to >> prevent *retroactive* guest data leaks (= whatever is currently in host >> OS memory) if an attacker compromises an otherwise non-malicious hypervisor. >> >> If an attacker compromises a virtualization host, then they can just sit >> silent and watch data flow into and out of guests from that point >> onward, because they can then monitor all DMA (which always happens in >> clear-text) real-time. >> >> >>> Maybe I do not fully understand the threat model defined. >>> If you can explain more, that would be great. >> >> Well I tried to explain *my* understanding of SEV :) I hope Brijesh will >> correct me if I'm wrong. >> >> >>> The point of SEV is to keep as much guest data encrypted at all times as >>> possible, so that *whenever* the hypervisor is compromised by an >>> attacker, the guest memory -- which the attacker can inevitably dump >>> from the host side -- remains un-intellegible to the attacker. >>> [Jiewen] OK. If this is a security question, I suggest we define a clear >>> threat model at first. >>> Or what we are doing might be unnecessary or insufficient. >> >> I agree. >> >> As I said above, my operating principle has been to re-encrypt all DMA >> buffers as soon as possible. For long-standing DMA buffers, re-encrypt >> them at ExitBootServices at the latest. >> >> >>> [Jiewen] For "require that Unmap() work for CommonBuffer >>> operations without releasing memory", I would hold my opinion until >>> it is documented clearly in UEFI spec. >> >> You are right, of course. >> >> It's just that we cannot avoid exploring ways, for this feature, that >> currently fall outside of the spec. Whatever we do here will be outside >> of the spec, one way or another. I suggested what I thought could be a >> reasonable extension to the spec, one that could be implemented by PciIo >> implementors even before the spec is modified. >> >> edk2's PciIo.Unmap already behaves like this, without breaking the spec. >> >> If there's a better way -- for example modifying CoreExitBootServices() >> in edk2, to signal IOMMU drivers separately, *after* notifying other >> ExitBootServices() handlers --, that might work as well. >> >> >>> My current concern is: >>> First, this sentence is NOT defined in UEFI specification. >> >> Correct. >> >> >>> Second, it might break an existing implementation or existing feature, such as tracing. >> >> Correct. >> >>> Till now, I did not see any restriction in UEFI spec say: In this function, you are not allowed >>> to call memory services. >>> The only restriction is >>> 1) TPL restriction, where memory service must be <= TPL_NOTIFY. >>> 2) AP restriction, where no UEFI service/protocol is allowed for AP. >> >> I agree. >> >> >>> I'm just trying to operate with the APIs currently defined by the UEFI >>> spec, and these assumptions were the best I could come up with. >>> [Jiewen] The PCI device driver must follow and *only* follow UEFI spec. >>> Especially the IHV Option ROM should not consume any private API. >> >> I disagree about "only follow". If there are additional requirements >> that can be satisfied without breaking the spec, driver implementors can >> choose to conform to both sets of requirements. >> >> For example (if I understand correctly), Microsoft / Windows present a >> bunch of requirements *beyond* the UEFI spec, for both platform and >> add-on card firmware. Most vendors conform :) >> >> >>>> [Jiewen] I am not sure who will control "When control is transferred to the OS, all >>>> guest memory should be encrypted again, even those areas that were once >>>> used as CommonBuffers." >>>> For SEV case, I think it should be controlled by OS, because it is just CR3. >>> >>> If it was indeed just CR3, then I would fully agree with you. >>> >>> But -- to my understanding --, ensuring that the memory is actually >>> encrypted requires that the guest *write* to the memory through a >>> virtual address whose PTE has the C-bit set. >>> >>> And I don't think the guest OS can be expected to rewrite all of its >>> memory at launch. :( >>> >>> [Jiewen] That is fine. >>> I suggest we get clear on the threat model as the first step. >>> The threat model for SEV usage in OVMF. >>> >>> I am sorry if that is already discussed before. I might ignore the conversation. >> >> No problem; it's always good to re-investigate our assumptions and >> operating principles. >> >> >>> If you or any SEV owner can share the insight, that will be great. >>> See my question above "If that is the threat model, I have a question on the exposure:..." >> >> I shared *my* impressions of the threat model (which are somewhat >> unclear, admittedly, and thus could make me overly cautious). >> >> I hope Brijesh can extend and/or correct my description. >> >> >>>> So apparently the default behaviors of the VT-d IOMMU and the SEV IOMMU >>>> are opposites -- VT-d permits all DMA unless configured otherwise, while >>>> SEV forbids all DMA unless configured otherwise. >>>> [Jiewen] I do not think so. The default behaviors of VT-d IOMMU is to disable all DMA access. >>>> I setup translation table, but mark all entry to be NOT-PRESENT. >>>> I grant DMA access only if there is a specific request by a device. >>>> >>>> I open DMA access in ExitBootServices, just want to make sure it is compatible to >>>> the existing OS, which might not have knowledge on IOMMU. >>>> I will provide a patch to make it a policy very soon. As such VTd IOMMU may >>>> turn off IOMMU, or keep it enabled at ExitBootService. >>>> An IOMMU aware OS may take over IOMMU directly and reprogram it. >>> >>> Thanks for the clarification. >>> >>> But, again, will this not lead to the possibility that the VT-d IOMMU's >>> ExitBootServices() handler disables all DMA *before* the PCI drivers get >>> a chance to run their own ExitBootServices() handlers, disabling the >>> individual devices? >>> [Jiewen] Sorry for the confusing. Let me explain: >>> No, VTd never disables *all* DMA buffer at ExitBootService. >>> >>> "disable VTd" means to "disable IOMMU protection" and "allow all DMA". >>> "Keep VTd enabled" means to "keep IOMMU protection enabled" and >>> "only allow the DMA from Map() request". >> >> Okay, but this interpretation was exactly what I thought of first (see >> above): "VT-d permits all DMA unless configured otherwise". You answered >> that it wasn't the case. >> >> So basically, if I understand it correctly now, at ExitBootServices the >> VT-d IOMMU driver opens up all RAM for DMA access. Is that correct? >> >> That is equivalent to decrypting all memory under SEV, and is the exact >> opposite of what we want. (As I understand it.) >> >> >>> If that happens, then a series of IOMMU faults could be generated, which >>> I described above. (I.e., such IOMMU faults could result, at least in >>> the case of SEV, in garbage being written to disk, via queued commands.) >>> [Jiewen] You are right. That would not happen. >>> IOMMU driver should not bring any compatibility problem for the PCI driver, >>> iff the PCI driver follows the UEFI specification. >> >> Agreed. >> >> >>> Now, to finish up, here's an idea I just had. >>> >>> Are we allowed to call gBS->SignalEvent() in an ExitBootServices() >>> notification function? >>> >>> If we are, then we could do the following: >>> >>> * PciIo.Unmap() and friends would work as always (no restrictions on >>> dynamic memory allocation or freeing, for any kind of DMA operation). >>> >>> * PCI drivers would not be expected to call PciIo.Unmap() in their >>> ExitBootServices() handlers. >>> >>> * The IOMMU driver would have an ExitBootServices() notification >>> function, to be enqueued at the TPL_CALLBACK or TPL_NOTIFY level >>> (doesn't matter which). >>> >>> * In this notification function, the IOMMU driver would signal *another* >>> event (a private one). The notification function for this event would >>> be enqueued strictly at the TPL_CALLBACK level. >>> >>> * The notification function for the second event (private to the IOMMU) >>> would iterate over all existent mappings, and unmap them, without >>> allocating or freeing memory. >>> >>> The key point is that by signaling the second event, the IOMMU driver >>> could order its own cleanup code after all other ExitBootServices() >>> callbacks. (Assuming, at least, that no other ExitBootServices() >>> callback employs the same trick to defer itself!) Therefore by the time >>> the second callback runs, all PCI devices have been halted, and it is >>> safe to tear down the translations. >>> >>> The ordering would be ensured by the following: >>> >>> - The UEFI spec says under CreateEventEx(), "All events are guaranteed >>> to be signaled before the first notification action is taken." This >>> means that, by the time the IOMMU's ExitBootServices() handler is >>> entered, all other ExitBootServices() handlers have been *queued* at >>> least, at TPL_CALLBACK or TPL_NOTIFY. >>> >>> - Therefore, when we signal our second callback from there, for >>> TPL_CALLBACK, the second callback will be queued at the end of the >>> TPL_CALLBACK queue. >>> >>> - CreateEventEx() also says that EFI_EVENT_GROUP_EXIT_BOOT_SERVICES "is >>> functionally equivalent to the EVT_SIGNAL_EXIT_BOOT_SERVICES flag >>> for the Type argument of CreateEvent." So it wouldn't matter if a >>> driver used CreateEvent() or CreateEventEx() for setting up its own >>> handler, the handler would be queued just the same. >>> >>> I think this is ugly and brittle, but perhaps the only way to clean up >>> *all* translations safely, with regard to PciIo.Unmap() + >>> ExitBootServices(), without updating the UEFI spec. >>> >>> What do you think? >>> >>> [Jiewen] If the order is correct, and all PCI device driver is halt at ExitBootServices, that works. >>> We do not need update PCI driver and we do not need update UEFI spec. >>> We only need update IOMMU driver which is concerned, based upon the threat model. >>> That probably is best solution. :-) >> >> I'm very glad to hear that you like the idea. >> >> However, it depends on whether we are permitted, by the UEFI spec, to >> signal another event in an ExitBootServices() notification function. >> >> Are we permitted to do that? >> >> Does the UEFI spec guarantee that the notification function for the >> *second* event will be queued just like it would be under "normal" >> circumstances? >> >> (I know we must not allocate or free memory in the notification function >> of the *second* event either; I just want to know if the second event's >> handler is *queued* like it would normally be.) >> >> >>> I assume you want to handle both common buffer and read/write buffer, right? >> >> Yes. Under this idea, all kinds of operations would be cleaned up. >> >> >>> And if possible, I still have interest to get clear on the threat model for SEV in OVMF. >>> If you or any SEV owner can share ... >> >> Absolutely. Please see above. >> >> Thank you! >> Laszlo >> > _______________________________________________ > edk2-devel mailing list > edk2-devel@lists.01.org > https://lists.01.org/mailman/listinfo/edk2-devel > _______________________________________________ edk2-devel mailing list edk2-devel@lists.01.org https://lists.01.org/mailman/listinfo/edk2-devel
Hi Jiewen, On 09/07/2017 06:46 AM, Laszlo Ersek wrote: > On 09/07/17 06:46, Yao, Jiewen wrote: >> Thanks for the sharing, Brijesh. >> >> "If a page was marked as "shared" >> then its guest responsibility to make it "private" after its done communicating with >> hypervisor." >> >> I believe I have same understanding - a *guest* should guarantee that. >> >> My question is: During the *transition* from firmware to OS, *which guest* should make the shared buffer to be private? Is it "guest firmware" or "guest OS"? >> >> Maybe I can ask the specific question to get it more clear. >> >> 1) If the common DMA buffer is not unmapped at ExitBootService, is that treated as an issue? >> >> 2) If the read/write DMA buffer is not unmapped at ExitBootService, is that treated as an issue? > > Very good questions, totally to the point. > > On the authoritative answer, I defer to Brijesh. > Both the above cases (#1 and #2) are problems. Since buffers was owned by firmware and firmware made it "shared" hence firmware is responsible to mark them as private after its done with the buffer. In other words, we must call Unmap() from ExitBootServices to ensure that buffers mapped using BusMasterCommonBuffer/BusMasterRead/BusMasterWrite is marked as "private" before we make it available to the guest OS. (we do similar thing in Linux OS). Having any kind of side channel to communicate the encryption status of a page will not work -- we should be able to support a usecase where we boot OVMF in 64-bit but launch 32-bit Linux guest. When Linux boots in 32-bit mode it does not have access to encryption bit (C-bit is bit-47 in page table) and can't mark the page as private (even if we provide some kind of side-channel). thank you very much for all your help. > ( > > My personal opinion is that both situations (#1 and #2) are problems, > because they break the *practical* security invariant for SEV guests: > > - most memory should be encrypted at all times, *and* > > - any memory that is decrypted must have an owner that is responsible > for re-encrypting the memory eventually. > > Therefore, *either* the firmware has to re-encrypt all remaining DMA > buffers at ExitBootServices(), *or* a new information channel must be > designed, from firmware to OS, to carry over the decryption status. > > I strongly prefer the first option, for the following reason: the same > questions apply to all EDK2 IOMMU protocol interfaces, not just the one > exported by the SEV driver. > > ) > > Thanks, > Laszlo > >> From: Brijesh Singh [mailto:brijesh.singh@amd.com] >> Sent: Wednesday, September 6, 2017 11:40 PM >> To: Laszlo Ersek <lersek@redhat.com>; Yao, Jiewen <jiewen.yao@intel.com>; Zeng, Star <star.zeng@intel.com>; edk2-devel-01 <edk2-devel@lists.01.org> >> Cc: brijesh.singh@amd.com; Dong, Eric <eric.dong@intel.com> >> Subject: Re: [edk2] [PATCH 0/4] MdeModulePkg: some PCI HC drivers: unmap common buffers at ExitBootServices() >> >> >> >> On 09/06/2017 07:14 AM, Laszlo Ersek wrote: >>> On 09/06/17 06:37, Yao, Jiewen wrote: >>>> Thanks for the clarification. Comment in line. >>>> >>>> From: Laszlo Ersek [mailto:lersek@redhat.com] >>>> Sent: Wednesday, September 6, 2017 1:57 AM >>>> To: Yao, Jiewen <jiewen.yao@intel.com<mailto:jiewen.yao@intel.com>>; Zeng, Star <star.zeng@intel.com<mailto:star.zeng@intel.com>>; edk2-devel-01 <edk2-devel@lists.01.org<mailto:edk2-devel@lists.01.org>> >>>> Cc: Dong, Eric <eric.dong@intel.com<mailto:eric.dong@intel.com>>; Brijesh Singh <brijesh.singh@amd.com<mailto:brijesh.singh@amd.com>> >>>> Subject: Re: [edk2] [PATCH 0/4] MdeModulePkg: some PCI HC drivers: unmap common buffers at ExitBootServices() >>> >>>>> Then after ExitBootService, the OS will take control of CR3 and set correct >>>>> encryption bit. >>>> >>>> This is true, the guest OS will set up new page tables, and in those >>>> PTEs, the C-bit ("encrypt") will likely be set by default. >>>> >>>> However, the guest OS will not *rewrite* all of the memory, with the >>>> C-bit set. This means that any memory that the firmware didn't actually >>>> re-encrypt (in the IOMMU driver) will still expose stale data to the >>>> hypervisor. >>>> [Jiewen] That is an interesting question. >>>> Is there any security concern associated? >>> >>> I can't tell for sure. Answering this question is up to the proponents >>> of SEV, who have designed the threat model for SEV. >>> >>> My operating assumption is that we should keep memory as tightly >>> encrypted as possible at the firmware --> OS control transfer. It could >>> be an exaggerated expectation from my side; I'd just better be safe than >>> sorry :) >>> >>> >> >> Let me give some brief intro on SEV (Secure Encrypted Virtualization) and then >> we can discuss a security model (if needed). >> >> SEV is an extension to the AMD-V architecture which supports running encrypted >> virtual machines (VMs) under the control of a hypervisor. Encrypted VMs have their >> pages (code and data) secured such that only the guest itself has access to the >> unencrypted version. Each encrypted VMs is associated with a unique encryption >> key; if its data is accessed by a different entity using a different key the >> encrypted guest data will be incorrectly decrypted, leading to unintelligible data. >> You can also find some detail for SEV in white paper [1]. >> >> SEV encrypted Vs have the concept of private and shared memory. The private memory >> is encrypted with the guest-specific key, while shared memory may be encrypted >> with hypervisor key. SEV guest VMs can choose which pages they would like to >> be private. But the instruction pages and guest page tables are always treated >> as private by the hardware. The DMA operation inside the guest must be performed >> on shared pages -- this is mainly because in virtualization world the device >> DMA needs some assistance from the hypervisor. >> >> The security model provided by the SEV ensure that hypervisor will no longer able >> to inspect or alter any guest code or data. The guest OS controls what it want to >> share with outside world (in this case with Hypervisor). >> >> In software implementation we took approach to encrypt everything possible starting >> early in boot. In this approach whenever OVMF wants to perform certain DMA operations >> it allocate a shared page, issues the command, free the shared page after the command >> is completed (in other word we use sw bounce buffer to complete the DMA operation). >> >> We have implemented IOMMU driver to provide the following functions: >> >> AllocateBuffer(): >> -------------------- >> it allocate a private pages, as per UEFI spec the driver will map the buffer allocated >> from this routine using BusMasterCommonBuffer hence we allocate extra stash pages >> in addition to requested pages. >> >> >> Map >> ---- >> If requested operation is BusMasterRead or BusMasterWrite then we allocate a shared buffer >> and DeviceAddress point to shared buffer. >> >> If requested operation is BusMasterCommonBuffer then we perform in-place decryption of the >> contents and update the page-table to clear the C-bit (basically make it shared page) >> >> Unmap >> ------ >> If operation was BusMasterRead or BusMasterWrite then we complete the unmapping and free >> the shared buffer allocated in Map(). >> >> If operation was BusMasterCommonBuffer then we perform in-place encryption and set the C-bit >> (basically make the page private) >> >> FreeBuffer: >> ----------- >> Free the pages >> >> >> Lets run with the below scenario: >> >> 1) guest marks a page as "shared" and gives its physical address to HV (e.g DMA read) >> 2) hypervisor completes the request operation >> 3) hypervisor caches the shared physical address and monitor it for information leak >> 4) sometime later if guest write data in its "shared" physical address then hypervisor can >> easily read it without guest knowledge. >> >> SEV hardware can protect us against the attack where someone tries to inject or alter the >> guest code. As I noted above any instruction fetch is forced as private hence if attacker >> write some code into a shared buffer and point the RIP to his/her code then instruction >> fetch will try to decrypt it and get unintelligible opcode. If a page was marked as "shared" >> then its guest responsibility to make it "private" after its done communicating with >> hypervisor. >> >> [1] http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/12/AMD_Memory_Encryption_Whitepaper_v7-Public.pdf >> >> >>>> If the C-bit is cleared for a read/write buffer, is the content in the >>>> read/write buffer also exposed to hypervisor? >>> >>> Not immediately. Only when the guest also rewrites the memory through >>> the appropriate virtual address. >>> >>> Basically, in the virtual machine, the C-bit in a PTE that covers a >>> particular guest virtual address (GVA) controls whether a guest write >>> through that GVA will result in memory encryption, and if a gues read >>> through that GVA will result in memory decryption. >>> >>> The current state of the C-bit doesn't matter for the hypervisor, what >>> matters is the last guest write through a GVA whose PTE had the C-bit >>> set, or clear. If the C-bit was clear when the guest last wrote the >>> area, then the hypervisor can read the data. If the C-bit was set, then >>> the hypervisor can only read garbage. >>> >>> >>>> I means if there is security concern, the concern should be applied to >>>> both common buffer and read/write buffer. >>>> Then we have to figure a way to resolve both buffer. >>> >>> Yes, this is correct. The PciIo.Unmap operation, as currently >>> implemented in OvmfPkg/IoMmuDxe/, handles the encryption/decryption >>> correctly for all operations, but it can only guarantee *not* freeing >>> memory for CommonBuffer. So Unmap()ing CommonBuffer at ExitBootServices >>> is safe, while Unmap()ing Read/Write is not. The encryption would be >>> re-established just fine for Read/Write as well, but we would change the >>> UEFI memmap. >>> >>> In OVMF, we currently manage this problem by making all asynchronous DMA >>> mappings CommonBuffer, even if they could othewise be Read or Write. We >>> use Read and Write only if the DMA operation completes before the >>> higher-level protocol function returns (so we can immediately Unmap the >>> operation, and the ExitBootServices() handler doesn't have to care). >>> >>> >>>> To be honest, that is exactly my biggest question on this patch: >>>> why do we only handle the common buffer specially? >>> >>> For the following reason: >>> >>> - Given that CommonBuffer mappings take a separate AllocateBuffer() / >>> FreeBuffer() call-pair, around Map/Unmap, we can *reasonably ask* PciIo >>> implementors -- beyond what the UEFI spec requries -- to keep *all* >>> dynamic memory management out of Map/Unmap. If they need dynamic memory >>> management, we ask them to do it in AllocateBuffer/FreeBuffer instead. >>> This way Unmap is safe in ExitBootServices handlers. >>> >>> - We cannot *reasonably ask* PciIo implementors to provide the same >>> guarantee for Read and Write mappings, simply because there are no other >>> APIs that could manage the dynamic memory for Map and Unmap -- >>> AllocateBuffer and FreeBuffer are not required for Read and Write >>> mappings. PciIo implementors couldn't agree to our request even if they >>> wanted to. Therefore Unmapping Read/Write operations is inherently >>> unsafe in EBS handlers. >>> >>> >>>>> NOTE: The device should still be halt state, because the device is >>>>> also controlled by OS. >>>> >>>> This is the key point -- the device is *not* controlled by the guest OS, >>>> in the worst case. >>>> >>>> The virtual device is ultimately implemented by the hypervisor. We don't >>>> expect the virtual device (= the hypervisor) to mis-behave on purpose. >>>> However, if the hypervisor is compromised by an external attacker (for >>>> example over the network, or via privilege escalation from another >>>> guest), then all guest RAM that has not been encrypted up to that point >>>> becomes visible to the attacker. In other words, the hypervisor ("the >>>> device") becomes retro-actively malicious. >>>> [Jiewen] If that is the threat model, I have a question on the exposure: >>>> 1) If the concern is for existing data, it means all DMA data already written. >>>> However, the DMA data is already consumed or produced by virtual device >>>> or a hypervisor. If the hypervisor is attacked, it already gets all the data content. >>> >>> Maybe, maybe not. The data may have been sent to a different host over >>> the network, and wiped from memory. >>> >>> Or, the data may have been written to a disk image that is separately >>> encrypted by the host OS, and has been detached (unplugged) from the >>> guest, and also unmounted from the host, with the disk key securely >>> wiped from host memory. >>> >>> We can also imagine the reverse direction. Assume that the user of the >>> virtual machine goes into the UEFI shell in OVMF, starts the EDIT >>> program, and types some secret information into a text file on the ESP. >>> Further assume that the disk image is encrypted on the host OS. It is >>> conceivable that fragments of the secret could remain stuck in the AHCI >>> (disk) and USB (keyboard) DMA buffers. >>> >>> Maybe you think that these are "made up" examples, and I agree, I just >>> made them up. The point is, it is not my place to discount possible >>> attack vectors (I haven't designed SEV). Attackers will be limited by >>> their *own* imaginations only, not by mine :) >>> >>> >>>> 2) if the concern is for future data, it means all user data to be written. >>>> However, the C-bit already prevent that. >>> >>> As far as I understand SEV, future data is out of scope. The goal is to >>> prevent *retroactive* guest data leaks (= whatever is currently in host >>> OS memory) if an attacker compromises an otherwise non-malicious hypervisor. >>> >>> If an attacker compromises a virtualization host, then they can just sit >>> silent and watch data flow into and out of guests from that point >>> onward, because they can then monitor all DMA (which always happens in >>> clear-text) real-time. >>> >>> >>>> Maybe I do not fully understand the threat model defined. >>>> If you can explain more, that would be great. >>> >>> Well I tried to explain *my* understanding of SEV :) I hope Brijesh will >>> correct me if I'm wrong. >>> >>> >>>> The point of SEV is to keep as much guest data encrypted at all times as >>>> possible, so that *whenever* the hypervisor is compromised by an >>>> attacker, the guest memory -- which the attacker can inevitably dump >>>> from the host side -- remains un-intellegible to the attacker. >>>> [Jiewen] OK. If this is a security question, I suggest we define a clear >>>> threat model at first. >>>> Or what we are doing might be unnecessary or insufficient. >>> >>> I agree. >>> >>> As I said above, my operating principle has been to re-encrypt all DMA >>> buffers as soon as possible. For long-standing DMA buffers, re-encrypt >>> them at ExitBootServices at the latest. >>> >>> >>>> [Jiewen] For "require that Unmap() work for CommonBuffer >>>> operations without releasing memory", I would hold my opinion until >>>> it is documented clearly in UEFI spec. >>> >>> You are right, of course. >>> >>> It's just that we cannot avoid exploring ways, for this feature, that >>> currently fall outside of the spec. Whatever we do here will be outside >>> of the spec, one way or another. I suggested what I thought could be a >>> reasonable extension to the spec, one that could be implemented by PciIo >>> implementors even before the spec is modified. >>> >>> edk2's PciIo.Unmap already behaves like this, without breaking the spec. >>> >>> If there's a better way -- for example modifying CoreExitBootServices() >>> in edk2, to signal IOMMU drivers separately, *after* notifying other >>> ExitBootServices() handlers --, that might work as well. >>> >>> >>>> My current concern is: >>>> First, this sentence is NOT defined in UEFI specification. >>> >>> Correct. >>> >>> >>>> Second, it might break an existing implementation or existing feature, such as tracing. >>> >>> Correct. >>> >>>> Till now, I did not see any restriction in UEFI spec say: In this function, you are not allowed >>>> to call memory services. >>>> The only restriction is >>>> 1) TPL restriction, where memory service must be <= TPL_NOTIFY. >>>> 2) AP restriction, where no UEFI service/protocol is allowed for AP. >>> >>> I agree. >>> >>> >>>> I'm just trying to operate with the APIs currently defined by the UEFI >>>> spec, and these assumptions were the best I could come up with. >>>> [Jiewen] The PCI device driver must follow and *only* follow UEFI spec. >>>> Especially the IHV Option ROM should not consume any private API. >>> >>> I disagree about "only follow". If there are additional requirements >>> that can be satisfied without breaking the spec, driver implementors can >>> choose to conform to both sets of requirements. >>> >>> For example (if I understand correctly), Microsoft / Windows present a >>> bunch of requirements *beyond* the UEFI spec, for both platform and >>> add-on card firmware. Most vendors conform :) >>> >>> >>>>> [Jiewen] I am not sure who will control "When control is transferred to the OS, all >>>>> guest memory should be encrypted again, even those areas that were once >>>>> used as CommonBuffers." >>>>> For SEV case, I think it should be controlled by OS, because it is just CR3. >>>> >>>> If it was indeed just CR3, then I would fully agree with you. >>>> >>>> But -- to my understanding --, ensuring that the memory is actually >>>> encrypted requires that the guest *write* to the memory through a >>>> virtual address whose PTE has the C-bit set. >>>> >>>> And I don't think the guest OS can be expected to rewrite all of its >>>> memory at launch. :( >>>> >>>> [Jiewen] That is fine. >>>> I suggest we get clear on the threat model as the first step. >>>> The threat model for SEV usage in OVMF. >>>> >>>> I am sorry if that is already discussed before. I might ignore the conversation. >>> >>> No problem; it's always good to re-investigate our assumptions and >>> operating principles. >>> >>> >>>> If you or any SEV owner can share the insight, that will be great. >>>> See my question above "If that is the threat model, I have a question on the exposure:..." >>> >>> I shared *my* impressions of the threat model (which are somewhat >>> unclear, admittedly, and thus could make me overly cautious). >>> >>> I hope Brijesh can extend and/or correct my description. >>> >>> >>>>> So apparently the default behaviors of the VT-d IOMMU and the SEV IOMMU >>>>> are opposites -- VT-d permits all DMA unless configured otherwise, while >>>>> SEV forbids all DMA unless configured otherwise. >>>>> [Jiewen] I do not think so. The default behaviors of VT-d IOMMU is to disable all DMA access. >>>>> I setup translation table, but mark all entry to be NOT-PRESENT. >>>>> I grant DMA access only if there is a specific request by a device. >>>>> >>>>> I open DMA access in ExitBootServices, just want to make sure it is compatible to >>>>> the existing OS, which might not have knowledge on IOMMU. >>>>> I will provide a patch to make it a policy very soon. As such VTd IOMMU may >>>>> turn off IOMMU, or keep it enabled at ExitBootService. >>>>> An IOMMU aware OS may take over IOMMU directly and reprogram it. >>>> >>>> Thanks for the clarification. >>>> >>>> But, again, will this not lead to the possibility that the VT-d IOMMU's >>>> ExitBootServices() handler disables all DMA *before* the PCI drivers get >>>> a chance to run their own ExitBootServices() handlers, disabling the >>>> individual devices? >>>> [Jiewen] Sorry for the confusing. Let me explain: >>>> No, VTd never disables *all* DMA buffer at ExitBootService. >>>> >>>> "disable VTd" means to "disable IOMMU protection" and "allow all DMA". >>>> "Keep VTd enabled" means to "keep IOMMU protection enabled" and >>>> "only allow the DMA from Map() request". >>> >>> Okay, but this interpretation was exactly what I thought of first (see >>> above): "VT-d permits all DMA unless configured otherwise". You answered >>> that it wasn't the case. >>> >>> So basically, if I understand it correctly now, at ExitBootServices the >>> VT-d IOMMU driver opens up all RAM for DMA access. Is that correct? >>> >>> That is equivalent to decrypting all memory under SEV, and is the exact >>> opposite of what we want. (As I understand it.) >>> >>> >>>> If that happens, then a series of IOMMU faults could be generated, which >>>> I described above. (I.e., such IOMMU faults could result, at least in >>>> the case of SEV, in garbage being written to disk, via queued commands.) >>>> [Jiewen] You are right. That would not happen. >>>> IOMMU driver should not bring any compatibility problem for the PCI driver, >>>> iff the PCI driver follows the UEFI specification. >>> >>> Agreed. >>> >>> >>>> Now, to finish up, here's an idea I just had. >>>> >>>> Are we allowed to call gBS->SignalEvent() in an ExitBootServices() >>>> notification function? >>>> >>>> If we are, then we could do the following: >>>> >>>> * PciIo.Unmap() and friends would work as always (no restrictions on >>>> dynamic memory allocation or freeing, for any kind of DMA operation). >>>> >>>> * PCI drivers would not be expected to call PciIo.Unmap() in their >>>> ExitBootServices() handlers. >>>> >>>> * The IOMMU driver would have an ExitBootServices() notification >>>> function, to be enqueued at the TPL_CALLBACK or TPL_NOTIFY level >>>> (doesn't matter which). >>>> >>>> * In this notification function, the IOMMU driver would signal *another* >>>> event (a private one). The notification function for this event would >>>> be enqueued strictly at the TPL_CALLBACK level. >>>> >>>> * The notification function for the second event (private to the IOMMU) >>>> would iterate over all existent mappings, and unmap them, without >>>> allocating or freeing memory. >>>> >>>> The key point is that by signaling the second event, the IOMMU driver >>>> could order its own cleanup code after all other ExitBootServices() >>>> callbacks. (Assuming, at least, that no other ExitBootServices() >>>> callback employs the same trick to defer itself!) Therefore by the time >>>> the second callback runs, all PCI devices have been halted, and it is >>>> safe to tear down the translations. >>>> >>>> The ordering would be ensured by the following: >>>> >>>> - The UEFI spec says under CreateEventEx(), "All events are guaranteed >>>> to be signaled before the first notification action is taken." This >>>> means that, by the time the IOMMU's ExitBootServices() handler is >>>> entered, all other ExitBootServices() handlers have been *queued* at >>>> least, at TPL_CALLBACK or TPL_NOTIFY. >>>> >>>> - Therefore, when we signal our second callback from there, for >>>> TPL_CALLBACK, the second callback will be queued at the end of the >>>> TPL_CALLBACK queue. >>>> >>>> - CreateEventEx() also says that EFI_EVENT_GROUP_EXIT_BOOT_SERVICES "is >>>> functionally equivalent to the EVT_SIGNAL_EXIT_BOOT_SERVICES flag >>>> for the Type argument of CreateEvent." So it wouldn't matter if a >>>> driver used CreateEvent() or CreateEventEx() for setting up its own >>>> handler, the handler would be queued just the same. >>>> >>>> I think this is ugly and brittle, but perhaps the only way to clean up >>>> *all* translations safely, with regard to PciIo.Unmap() + >>>> ExitBootServices(), without updating the UEFI spec. >>>> >>>> What do you think? >>>> >>>> [Jiewen] If the order is correct, and all PCI device driver is halt at ExitBootServices, that works. >>>> We do not need update PCI driver and we do not need update UEFI spec. >>>> We only need update IOMMU driver which is concerned, based upon the threat model. >>>> That probably is best solution. :-) >>> >>> I'm very glad to hear that you like the idea. >>> >>> However, it depends on whether we are permitted, by the UEFI spec, to >>> signal another event in an ExitBootServices() notification function. >>> >>> Are we permitted to do that? >>> >>> Does the UEFI spec guarantee that the notification function for the >>> *second* event will be queued just like it would be under "normal" >>> circumstances? >>> >>> (I know we must not allocate or free memory in the notification function >>> of the *second* event either; I just want to know if the second event's >>> handler is *queued* like it would normally be.) >>> >>> >>>> I assume you want to handle both common buffer and read/write buffer, right? >>> >>> Yes. Under this idea, all kinds of operations would be cleaned up. >>> >>> >>>> And if possible, I still have interest to get clear on the threat model for SEV in OVMF. >>>> If you or any SEV owner can share ... >>> >>> Absolutely. Please see above. >>> >>> Thank you! >>> Laszlo >>> >> _______________________________________________ >> edk2-devel mailing list >> edk2-devel@lists.01.org >> https://lists.01.org/mailman/listinfo/edk2-devel >> > _______________________________________________ edk2-devel mailing list edk2-devel@lists.01.org https://lists.01.org/mailman/listinfo/edk2-devel
Great. Thanks to confirm that. Now it is clear to me. Then let's wait Laszlo's new patch to make all DMA buffer to private. :) Thank you Yao, Jiewen From: edk2-devel [mailto:edk2-devel-bounces@lists.01.org] On Behalf Of Brijesh Singh Sent: Thursday, September 7, 2017 10:40 PM To: Laszlo Ersek <lersek@redhat.com>; Yao, Jiewen <jiewen.yao@intel.com>; Zeng, Star <star.zeng@intel.com>; edk2-devel-01 <edk2-devel@lists.01.org> Cc: brijesh.singh@amd.com; Dong, Eric <eric.dong@intel.com> Subject: Re: [edk2] [PATCH 0/4] MdeModulePkg: some PCI HC drivers: unmap common buffers at ExitBootServices() Hi Jiewen, On 09/07/2017 06:46 AM, Laszlo Ersek wrote: > On 09/07/17 06:46, Yao, Jiewen wrote: >> Thanks for the sharing, Brijesh. >> >> "If a page was marked as "shared" >> then its guest responsibility to make it "private" after its done communicating with >> hypervisor." >> >> I believe I have same understanding - a *guest* should guarantee that. >> >> My question is: During the *transition* from firmware to OS, *which guest* should make the shared buffer to be private? Is it "guest firmware" or "guest OS"? >> >> Maybe I can ask the specific question to get it more clear. >> >> 1) If the common DMA buffer is not unmapped at ExitBootService, is that treated as an issue? >> >> 2) If the read/write DMA buffer is not unmapped at ExitBootService, is that treated as an issue? > > Very good questions, totally to the point. > > On the authoritative answer, I defer to Brijesh. > Both the above cases (#1 and #2) are problems. Since buffers was owned by firmware and firmware made it "shared" hence firmware is responsible to mark them as private after its done with the buffer. In other words, we must call Unmap() from ExitBootServices to ensure that buffers mapped using BusMasterCommonBuffer/BusMasterRead/BusMasterWrite is marked as "private" before we make it available to the guest OS. (we do similar thing in Linux OS). Having any kind of side channel to communicate the encryption status of a page will not work -- we should be able to support a usecase where we boot OVMF in 64-bit but launch 32-bit Linux guest. When Linux boots in 32-bit mode it does not have access to encryption bit (C-bit is bit-47 in page table) and can't mark the page as private (even if we provide some kind of side-channel). thank you very much for all your help. > ( > > My personal opinion is that both situations (#1 and #2) are problems, > because they break the *practical* security invariant for SEV guests: > > - most memory should be encrypted at all times, *and* > > - any memory that is decrypted must have an owner that is responsible > for re-encrypting the memory eventually. > > Therefore, *either* the firmware has to re-encrypt all remaining DMA > buffers at ExitBootServices(), *or* a new information channel must be > designed, from firmware to OS, to carry over the decryption status. > > I strongly prefer the first option, for the following reason: the same > questions apply to all EDK2 IOMMU protocol interfaces, not just the one > exported by the SEV driver. > > ) > > Thanks, > Laszlo > >> From: Brijesh Singh [mailto:brijesh.singh@amd.com] >> Sent: Wednesday, September 6, 2017 11:40 PM >> To: Laszlo Ersek <lersek@redhat.com<mailto:lersek@redhat.com>>; Yao, Jiewen <jiewen.yao@intel.com<mailto:jiewen.yao@intel.com>>; Zeng, Star <star.zeng@intel.com<mailto:star.zeng@intel.com>>; edk2-devel-01 <edk2-devel@lists.01.org<mailto:edk2-devel@lists.01.org>> >> Cc: brijesh.singh@amd.com<mailto:brijesh.singh@amd.com>; Dong, Eric <eric.dong@intel.com<mailto:eric.dong@intel.com>> >> Subject: Re: [edk2] [PATCH 0/4] MdeModulePkg: some PCI HC drivers: unmap common buffers at ExitBootServices() >> >> >> >> On 09/06/2017 07:14 AM, Laszlo Ersek wrote: >>> On 09/06/17 06:37, Yao, Jiewen wrote: >>>> Thanks for the clarification. Comment in line. >>>> >>>> From: Laszlo Ersek [mailto:lersek@redhat.com] >>>> Sent: Wednesday, September 6, 2017 1:57 AM >>>> To: Yao, Jiewen <jiewen.yao@intel.com<mailto:jiewen.yao@intel.com<mailto:jiewen.yao@intel.com%3cmailto:jiewen.yao@intel.com>>>; Zeng, Star <star.zeng@intel.com<mailto:star.zeng@intel.com<mailto:star.zeng@intel.com%3cmailto:star.zeng@intel.com>>>; edk2-devel-01 <edk2-devel@lists.01.org<mailto:edk2-devel@lists.01.org<mailto:edk2-devel@lists.01.org%3cmailto:edk2-devel@lists.01.org>>> >>>> Cc: Dong, Eric <eric.dong@intel.com<mailto:eric.dong@intel.com<mailto:eric.dong@intel.com%3cmailto:eric.dong@intel.com>>>; Brijesh Singh <brijesh.singh@amd.com<mailto:brijesh.singh@amd.com<mailto:brijesh.singh@amd.com%3cmailto:brijesh.singh@amd.com>>> >>>> Subject: Re: [edk2] [PATCH 0/4] MdeModulePkg: some PCI HC drivers: unmap common buffers at ExitBootServices() >>> >>>>> Then after ExitBootService, the OS will take control of CR3 and set correct >>>>> encryption bit. >>>> >>>> This is true, the guest OS will set up new page tables, and in those >>>> PTEs, the C-bit ("encrypt") will likely be set by default. >>>> >>>> However, the guest OS will not *rewrite* all of the memory, with the >>>> C-bit set. This means that any memory that the firmware didn't actually >>>> re-encrypt (in the IOMMU driver) will still expose stale data to the >>>> hypervisor. >>>> [Jiewen] That is an interesting question. >>>> Is there any security concern associated? >>> >>> I can't tell for sure. Answering this question is up to the proponents >>> of SEV, who have designed the threat model for SEV. >>> >>> My operating assumption is that we should keep memory as tightly >>> encrypted as possible at the firmware --> OS control transfer. It could >>> be an exaggerated expectation from my side; I'd just better be safe than >>> sorry :) >>> >>> >> >> Let me give some brief intro on SEV (Secure Encrypted Virtualization) and then >> we can discuss a security model (if needed). >> >> SEV is an extension to the AMD-V architecture which supports running encrypted >> virtual machines (VMs) under the control of a hypervisor. Encrypted VMs have their >> pages (code and data) secured such that only the guest itself has access to the >> unencrypted version. Each encrypted VMs is associated with a unique encryption >> key; if its data is accessed by a different entity using a different key the >> encrypted guest data will be incorrectly decrypted, leading to unintelligible data. >> You can also find some detail for SEV in white paper [1]. >> >> SEV encrypted Vs have the concept of private and shared memory. The private memory >> is encrypted with the guest-specific key, while shared memory may be encrypted >> with hypervisor key. SEV guest VMs can choose which pages they would like to >> be private. But the instruction pages and guest page tables are always treated >> as private by the hardware. The DMA operation inside the guest must be performed >> on shared pages -- this is mainly because in virtualization world the device >> DMA needs some assistance from the hypervisor. >> >> The security model provided by the SEV ensure that hypervisor will no longer able >> to inspect or alter any guest code or data. The guest OS controls what it want to >> share with outside world (in this case with Hypervisor). >> >> In software implementation we took approach to encrypt everything possible starting >> early in boot. In this approach whenever OVMF wants to perform certain DMA operations >> it allocate a shared page, issues the command, free the shared page after the command >> is completed (in other word we use sw bounce buffer to complete the DMA operation). >> >> We have implemented IOMMU driver to provide the following functions: >> >> AllocateBuffer(): >> -------------------- >> it allocate a private pages, as per UEFI spec the driver will map the buffer allocated >> from this routine using BusMasterCommonBuffer hence we allocate extra stash pages >> in addition to requested pages. >> >> >> Map >> ---- >> If requested operation is BusMasterRead or BusMasterWrite then we allocate a shared buffer >> and DeviceAddress point to shared buffer. >> >> If requested operation is BusMasterCommonBuffer then we perform in-place decryption of the >> contents and update the page-table to clear the C-bit (basically make it shared page) >> >> Unmap >> ------ >> If operation was BusMasterRead or BusMasterWrite then we complete the unmapping and free >> the shared buffer allocated in Map(). >> >> If operation was BusMasterCommonBuffer then we perform in-place encryption and set the C-bit >> (basically make the page private) >> >> FreeBuffer: >> ----------- >> Free the pages >> >> >> Lets run with the below scenario: >> >> 1) guest marks a page as "shared" and gives its physical address to HV (e.g DMA read) >> 2) hypervisor completes the request operation >> 3) hypervisor caches the shared physical address and monitor it for information leak >> 4) sometime later if guest write data in its "shared" physical address then hypervisor can >> easily read it without guest knowledge. >> >> SEV hardware can protect us against the attack where someone tries to inject or alter the >> guest code. As I noted above any instruction fetch is forced as private hence if attacker >> write some code into a shared buffer and point the RIP to his/her code then instruction >> fetch will try to decrypt it and get unintelligible opcode. If a page was marked as "shared" >> then its guest responsibility to make it "private" after its done communicating with >> hypervisor. >> >> [1] http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/12/AMD_Memory_Encryption_Whitepaper_v7-Public.pdf >> >> >>>> If the C-bit is cleared for a read/write buffer, is the content in the >>>> read/write buffer also exposed to hypervisor? >>> >>> Not immediately. Only when the guest also rewrites the memory through >>> the appropriate virtual address. >>> >>> Basically, in the virtual machine, the C-bit in a PTE that covers a >>> particular guest virtual address (GVA) controls whether a guest write >>> through that GVA will result in memory encryption, and if a gues read >>> through that GVA will result in memory decryption. >>> >>> The current state of the C-bit doesn't matter for the hypervisor, what >>> matters is the last guest write through a GVA whose PTE had the C-bit >>> set, or clear. If the C-bit was clear when the guest last wrote the >>> area, then the hypervisor can read the data. If the C-bit was set, then >>> the hypervisor can only read garbage. >>> >>> >>>> I means if there is security concern, the concern should be applied to >>>> both common buffer and read/write buffer. >>>> Then we have to figure a way to resolve both buffer. >>> >>> Yes, this is correct. The PciIo.Unmap operation, as currently >>> implemented in OvmfPkg/IoMmuDxe/, handles the encryption/decryption >>> correctly for all operations, but it can only guarantee *not* freeing >>> memory for CommonBuffer. So Unmap()ing CommonBuffer at ExitBootServices >>> is safe, while Unmap()ing Read/Write is not. The encryption would be >>> re-established just fine for Read/Write as well, but we would change the >>> UEFI memmap. >>> >>> In OVMF, we currently manage this problem by making all asynchronous DMA >>> mappings CommonBuffer, even if they could othewise be Read or Write. We >>> use Read and Write only if the DMA operation completes before the >>> higher-level protocol function returns (so we can immediately Unmap the >>> operation, and the ExitBootServices() handler doesn't have to care). >>> >>> >>>> To be honest, that is exactly my biggest question on this patch: >>>> why do we only handle the common buffer specially? >>> >>> For the following reason: >>> >>> - Given that CommonBuffer mappings take a separate AllocateBuffer() / >>> FreeBuffer() call-pair, around Map/Unmap, we can *reasonably ask* PciIo >>> implementors -- beyond what the UEFI spec requries -- to keep *all* >>> dynamic memory management out of Map/Unmap. If they need dynamic memory >>> management, we ask them to do it in AllocateBuffer/FreeBuffer instead. >>> This way Unmap is safe in ExitBootServices handlers. >>> >>> - We cannot *reasonably ask* PciIo implementors to provide the same >>> guarantee for Read and Write mappings, simply because there are no other >>> APIs that could manage the dynamic memory for Map and Unmap -- >>> AllocateBuffer and FreeBuffer are not required for Read and Write >>> mappings. PciIo implementors couldn't agree to our request even if they >>> wanted to. Therefore Unmapping Read/Write operations is inherently >>> unsafe in EBS handlers. >>> >>> >>>>> NOTE: The device should still be halt state, because the device is >>>>> also controlled by OS. >>>> >>>> This is the key point -- the device is *not* controlled by the guest OS, >>>> in the worst case. >>>> >>>> The virtual device is ultimately implemented by the hypervisor. We don't >>>> expect the virtual device (= the hypervisor) to mis-behave on purpose. >>>> However, if the hypervisor is compromised by an external attacker (for >>>> example over the network, or via privilege escalation from another >>>> guest), then all guest RAM that has not been encrypted up to that point >>>> becomes visible to the attacker. In other words, the hypervisor ("the >>>> device") becomes retro-actively malicious. >>>> [Jiewen] If that is the threat model, I have a question on the exposure: >>>> 1) If the concern is for existing data, it means all DMA data already written. >>>> However, the DMA data is already consumed or produced by virtual device >>>> or a hypervisor. If the hypervisor is attacked, it already gets all the data content. >>> >>> Maybe, maybe not. The data may have been sent to a different host over >>> the network, and wiped from memory. >>> >>> Or, the data may have been written to a disk image that is separately >>> encrypted by the host OS, and has been detached (unplugged) from the >>> guest, and also unmounted from the host, with the disk key securely >>> wiped from host memory. >>> >>> We can also imagine the reverse direction. Assume that the user of the >>> virtual machine goes into the UEFI shell in OVMF, starts the EDIT >>> program, and types some secret information into a text file on the ESP. >>> Further assume that the disk image is encrypted on the host OS. It is >>> conceivable that fragments of the secret could remain stuck in the AHCI >>> (disk) and USB (keyboard) DMA buffers. >>> >>> Maybe you think that these are "made up" examples, and I agree, I just >>> made them up. The point is, it is not my place to discount possible >>> attack vectors (I haven't designed SEV). Attackers will be limited by >>> their *own* imaginations only, not by mine :) >>> >>> >>>> 2) if the concern is for future data, it means all user data to be written. >>>> However, the C-bit already prevent that. >>> >>> As far as I understand SEV, future data is out of scope. The goal is to >>> prevent *retroactive* guest data leaks (= whatever is currently in host >>> OS memory) if an attacker compromises an otherwise non-malicious hypervisor. >>> >>> If an attacker compromises a virtualization host, then they can just sit >>> silent and watch data flow into and out of guests from that point >>> onward, because they can then monitor all DMA (which always happens in >>> clear-text) real-time. >>> >>> >>>> Maybe I do not fully understand the threat model defined. >>>> If you can explain more, that would be great. >>> >>> Well I tried to explain *my* understanding of SEV :) I hope Brijesh will >>> correct me if I'm wrong. >>> >>> >>>> The point of SEV is to keep as much guest data encrypted at all times as >>>> possible, so that *whenever* the hypervisor is compromised by an >>>> attacker, the guest memory -- which the attacker can inevitably dump >>>> from the host side -- remains un-intellegible to the attacker. >>>> [Jiewen] OK. If this is a security question, I suggest we define a clear >>>> threat model at first. >>>> Or what we are doing might be unnecessary or insufficient. >>> >>> I agree. >>> >>> As I said above, my operating principle has been to re-encrypt all DMA >>> buffers as soon as possible. For long-standing DMA buffers, re-encrypt >>> them at ExitBootServices at the latest. >>> >>> >>>> [Jiewen] For "require that Unmap() work for CommonBuffer >>>> operations without releasing memory", I would hold my opinion until >>>> it is documented clearly in UEFI spec. >>> >>> You are right, of course. >>> >>> It's just that we cannot avoid exploring ways, for this feature, that >>> currently fall outside of the spec. Whatever we do here will be outside >>> of the spec, one way or another. I suggested what I thought could be a >>> reasonable extension to the spec, one that could be implemented by PciIo >>> implementors even before the spec is modified. >>> >>> edk2's PciIo.Unmap already behaves like this, without breaking the spec. >>> >>> If there's a better way -- for example modifying CoreExitBootServices() >>> in edk2, to signal IOMMU drivers separately, *after* notifying other >>> ExitBootServices() handlers --, that might work as well. >>> >>> >>>> My current concern is: >>>> First, this sentence is NOT defined in UEFI specification. >>> >>> Correct. >>> >>> >>>> Second, it might break an existing implementation or existing feature, such as tracing. >>> >>> Correct. >>> >>>> Till now, I did not see any restriction in UEFI spec say: In this function, you are not allowed >>>> to call memory services. >>>> The only restriction is >>>> 1) TPL restriction, where memory service must be <= TPL_NOTIFY. >>>> 2) AP restriction, where no UEFI service/protocol is allowed for AP. >>> >>> I agree. >>> >>> >>>> I'm just trying to operate with the APIs currently defined by the UEFI >>>> spec, and these assumptions were the best I could come up with. >>>> [Jiewen] The PCI device driver must follow and *only* follow UEFI spec. >>>> Especially the IHV Option ROM should not consume any private API. >>> >>> I disagree about "only follow". If there are additional requirements >>> that can be satisfied without breaking the spec, driver implementors can >>> choose to conform to both sets of requirements. >>> >>> For example (if I understand correctly), Microsoft / Windows present a >>> bunch of requirements *beyond* the UEFI spec, for both platform and >>> add-on card firmware. Most vendors conform :) >>> >>> >>>>> [Jiewen] I am not sure who will control "When control is transferred to the OS, all >>>>> guest memory should be encrypted again, even those areas that were once >>>>> used as CommonBuffers." >>>>> For SEV case, I think it should be controlled by OS, because it is just CR3. >>>> >>>> If it was indeed just CR3, then I would fully agree with you. >>>> >>>> But -- to my understanding --, ensuring that the memory is actually >>>> encrypted requires that the guest *write* to the memory through a >>>> virtual address whose PTE has the C-bit set. >>>> >>>> And I don't think the guest OS can be expected to rewrite all of its >>>> memory at launch. :( >>>> >>>> [Jiewen] That is fine. >>>> I suggest we get clear on the threat model as the first step. >>>> The threat model for SEV usage in OVMF. >>>> >>>> I am sorry if that is already discussed before. I might ignore the conversation. >>> >>> No problem; it's always good to re-investigate our assumptions and >>> operating principles. >>> >>> >>>> If you or any SEV owner can share the insight, that will be great. >>>> See my question above "If that is the threat model, I have a question on the exposure:..." >>> >>> I shared *my* impressions of the threat model (which are somewhat >>> unclear, admittedly, and thus could make me overly cautious). >>> >>> I hope Brijesh can extend and/or correct my description. >>> >>> >>>>> So apparently the default behaviors of the VT-d IOMMU and the SEV IOMMU >>>>> are opposites -- VT-d permits all DMA unless configured otherwise, while >>>>> SEV forbids all DMA unless configured otherwise. >>>>> [Jiewen] I do not think so. The default behaviors of VT-d IOMMU is to disable all DMA access. >>>>> I setup translation table, but mark all entry to be NOT-PRESENT. >>>>> I grant DMA access only if there is a specific request by a device. >>>>> >>>>> I open DMA access in ExitBootServices, just want to make sure it is compatible to >>>>> the existing OS, which might not have knowledge on IOMMU. >>>>> I will provide a patch to make it a policy very soon. As such VTd IOMMU may >>>>> turn off IOMMU, or keep it enabled at ExitBootService. >>>>> An IOMMU aware OS may take over IOMMU directly and reprogram it. >>>> >>>> Thanks for the clarification. >>>> >>>> But, again, will this not lead to the possibility that the VT-d IOMMU's >>>> ExitBootServices() handler disables all DMA *before* the PCI drivers get >>>> a chance to run their own ExitBootServices() handlers, disabling the >>>> individual devices? >>>> [Jiewen] Sorry for the confusing. Let me explain: >>>> No, VTd never disables *all* DMA buffer at ExitBootService. >>>> >>>> "disable VTd" means to "disable IOMMU protection" and "allow all DMA". >>>> "Keep VTd enabled" means to "keep IOMMU protection enabled" and >>>> "only allow the DMA from Map() request". >>> >>> Okay, but this interpretation was exactly what I thought of first (see >>> above): "VT-d permits all DMA unless configured otherwise". You answered >>> that it wasn't the case. >>> >>> So basically, if I understand it correctly now, at ExitBootServices the >>> VT-d IOMMU driver opens up all RAM for DMA access. Is that correct? >>> >>> That is equivalent to decrypting all memory under SEV, and is the exact >>> opposite of what we want. (As I understand it.) >>> >>> >>>> If that happens, then a series of IOMMU faults could be generated, which >>>> I described above. (I.e., such IOMMU faults could result, at least in >>>> the case of SEV, in garbage being written to disk, via queued commands.) >>>> [Jiewen] You are right. That would not happen. >>>> IOMMU driver should not bring any compatibility problem for the PCI driver, >>>> iff the PCI driver follows the UEFI specification. >>> >>> Agreed. >>> >>> >>>> Now, to finish up, here's an idea I just had. >>>> >>>> Are we allowed to call gBS->SignalEvent() in an ExitBootServices() >>>> notification function? >>>> >>>> If we are, then we could do the following: >>>> >>>> * PciIo.Unmap() and friends would work as always (no restrictions on >>>> dynamic memory allocation or freeing, for any kind of DMA operation). >>>> >>>> * PCI drivers would not be expected to call PciIo.Unmap() in their >>>> ExitBootServices() handlers. >>>> >>>> * The IOMMU driver would have an ExitBootServices() notification >>>> function, to be enqueued at the TPL_CALLBACK or TPL_NOTIFY level >>>> (doesn't matter which). >>>> >>>> * In this notification function, the IOMMU driver would signal *another* >>>> event (a private one). The notification function for this event would >>>> be enqueued strictly at the TPL_CALLBACK level. >>>> >>>> * The notification function for the second event (private to the IOMMU) >>>> would iterate over all existent mappings, and unmap them, without >>>> allocating or freeing memory. >>>> >>>> The key point is that by signaling the second event, the IOMMU driver >>>> could order its own cleanup code after all other ExitBootServices() >>>> callbacks. (Assuming, at least, that no other ExitBootServices() >>>> callback employs the same trick to defer itself!) Therefore by the time >>>> the second callback runs, all PCI devices have been halted, and it is >>>> safe to tear down the translations. >>>> >>>> The ordering would be ensured by the following: >>>> >>>> - The UEFI spec says under CreateEventEx(), "All events are guaranteed >>>> to be signaled before the first notification action is taken." This >>>> means that, by the time the IOMMU's ExitBootServices() handler is >>>> entered, all other ExitBootServices() handlers have been *queued* at >>>> least, at TPL_CALLBACK or TPL_NOTIFY. >>>> >>>> - Therefore, when we signal our second callback from there, for >>>> TPL_CALLBACK, the second callback will be queued at the end of the >>>> TPL_CALLBACK queue. >>>> >>>> - CreateEventEx() also says that EFI_EVENT_GROUP_EXIT_BOOT_SERVICES "is >>>> functionally equivalent to the EVT_SIGNAL_EXIT_BOOT_SERVICES flag >>>> for the Type argument of CreateEvent." So it wouldn't matter if a >>>> driver used CreateEvent() or CreateEventEx() for setting up its own >>>> handler, the handler would be queued just the same. >>>> >>>> I think this is ugly and brittle, but perhaps the only way to clean up >>>> *all* translations safely, with regard to PciIo.Unmap() + >>>> ExitBootServices(), without updating the UEFI spec. >>>> >>>> What do you think? >>>> >>>> [Jiewen] If the order is correct, and all PCI device driver is halt at ExitBootServices, that works. >>>> We do not need update PCI driver and we do not need update UEFI spec. >>>> We only need update IOMMU driver which is concerned, based upon the threat model. >>>> That probably is best solution. :-) >>> >>> I'm very glad to hear that you like the idea. >>> >>> However, it depends on whether we are permitted, by the UEFI spec, to >>> signal another event in an ExitBootServices() notification function. >>> >>> Are we permitted to do that? >>> >>> Does the UEFI spec guarantee that the notification function for the >>> *second* event will be queued just like it would be under "normal" >>> circumstances? >>> >>> (I know we must not allocate or free memory in the notification function >>> of the *second* event either; I just want to know if the second event's >>> handler is *queued* like it would normally be.) >>> >>> >>>> I assume you want to handle both common buffer and read/write buffer, right? >>> >>> Yes. Under this idea, all kinds of operations would be cleaned up. >>> >>> >>>> And if possible, I still have interest to get clear on the threat model for SEV in OVMF. >>>> If you or any SEV owner can share ... >>> >>> Absolutely. Please see above. >>> >>> Thank you! >>> Laszlo >>> >> _______________________________________________ >> edk2-devel mailing list >> edk2-devel@lists.01.org<mailto:edk2-devel@lists.01.org> >> https://lists.01.org/mailman/listinfo/edk2-devel >> > _______________________________________________ edk2-devel mailing list edk2-devel@lists.01.org<mailto:edk2-devel@lists.01.org> https://lists.01.org/mailman/listinfo/edk2-devel _______________________________________________ edk2-devel mailing list edk2-devel@lists.01.org https://lists.01.org/mailman/listinfo/edk2-devel
On 09/07/17 16:48, Yao, Jiewen wrote: > Great. Thanks to confirm that. Now it is clear to me. > > Then let's wait Laszlo's new patch to make all DMA buffer to private. :) I've got the patches ready and smoke-tested. Let me look a bit more at the logs, and re-review the patches. I hope I can post the series still today (well, in my timezone anyway :) ) Thanks! Laszlo > From: edk2-devel [mailto:edk2-devel-bounces@lists.01.org] On Behalf Of Brijesh Singh > Sent: Thursday, September 7, 2017 10:40 PM > To: Laszlo Ersek <lersek@redhat.com>; Yao, Jiewen <jiewen.yao@intel.com>; Zeng, Star <star.zeng@intel.com>; edk2-devel-01 <edk2-devel@lists.01.org> > Cc: brijesh.singh@amd.com; Dong, Eric <eric.dong@intel.com> > Subject: Re: [edk2] [PATCH 0/4] MdeModulePkg: some PCI HC drivers: unmap common buffers at ExitBootServices() > > Hi Jiewen, > > On 09/07/2017 06:46 AM, Laszlo Ersek wrote: >> On 09/07/17 06:46, Yao, Jiewen wrote: >>> Thanks for the sharing, Brijesh. >>> >>> "If a page was marked as "shared" >>> then its guest responsibility to make it "private" after its done communicating with >>> hypervisor." >>> >>> I believe I have same understanding - a *guest* should guarantee that. >>> >>> My question is: During the *transition* from firmware to OS, *which guest* should make the shared buffer to be private? Is it "guest firmware" or "guest OS"? >>> >>> Maybe I can ask the specific question to get it more clear. >>> >>> 1) If the common DMA buffer is not unmapped at ExitBootService, is that treated as an issue? >>> >>> 2) If the read/write DMA buffer is not unmapped at ExitBootService, is that treated as an issue? >> >> Very good questions, totally to the point. >> >> On the authoritative answer, I defer to Brijesh. >> > > > Both the above cases (#1 and #2) are problems. Since buffers was owned by firmware > and firmware made it "shared" hence firmware is responsible to mark them as private > after its done with the buffer. In other words, we must call Unmap() from ExitBootServices > to ensure that buffers mapped using BusMasterCommonBuffer/BusMasterRead/BusMasterWrite > is marked as "private" before we make it available to the guest OS. (we do similar thing > in Linux OS). > > Having any kind of side channel to communicate the encryption status of a page > will not work -- we should be able to support a usecase where we boot OVMF in > 64-bit but launch 32-bit Linux guest. When Linux boots in 32-bit mode it does not have > access to encryption bit (C-bit is bit-47 in page table) and can't mark the page as > private (even if we provide some kind of side-channel). > > thank you very much for all your help. > >> ( >> >> My personal opinion is that both situations (#1 and #2) are problems, >> because they break the *practical* security invariant for SEV guests: >> >> - most memory should be encrypted at all times, *and* >> >> - any memory that is decrypted must have an owner that is responsible >> for re-encrypting the memory eventually. >> >> Therefore, *either* the firmware has to re-encrypt all remaining DMA >> buffers at ExitBootServices(), *or* a new information channel must be >> designed, from firmware to OS, to carry over the decryption status. >> >> I strongly prefer the first option, for the following reason: the same >> questions apply to all EDK2 IOMMU protocol interfaces, not just the one >> exported by the SEV driver. >> >> ) >> >> Thanks, >> Laszlo >> >>> From: Brijesh Singh [mailto:brijesh.singh@amd.com] >>> Sent: Wednesday, September 6, 2017 11:40 PM >>> To: Laszlo Ersek <lersek@redhat.com<mailto:lersek@redhat.com>>; Yao, Jiewen <jiewen.yao@intel.com<mailto:jiewen.yao@intel.com>>; Zeng, Star <star.zeng@intel.com<mailto:star.zeng@intel.com>>; edk2-devel-01 <edk2-devel@lists.01.org<mailto:edk2-devel@lists.01.org>> >>> Cc: brijesh.singh@amd.com<mailto:brijesh.singh@amd.com>; Dong, Eric <eric.dong@intel.com<mailto:eric.dong@intel.com>> >>> Subject: Re: [edk2] [PATCH 0/4] MdeModulePkg: some PCI HC drivers: unmap common buffers at ExitBootServices() >>> >>> >>> >>> On 09/06/2017 07:14 AM, Laszlo Ersek wrote: >>>> On 09/06/17 06:37, Yao, Jiewen wrote: >>>>> Thanks for the clarification. Comment in line. >>>>> >>>>> From: Laszlo Ersek [mailto:lersek@redhat.com] >>>>> Sent: Wednesday, September 6, 2017 1:57 AM >>>>> To: Yao, Jiewen <jiewen.yao@intel.com<mailto:jiewen.yao@intel.com<mailto:jiewen.yao@intel.com%3cmailto:jiewen.yao@intel.com>>>; Zeng, Star <star.zeng@intel.com<mailto:star.zeng@intel.com<mailto:star.zeng@intel.com%3cmailto:star.zeng@intel.com>>>; edk2-devel-01 <edk2-devel@lists.01.org<mailto:edk2-devel@lists.01.org<mailto:edk2-devel@lists.01.org%3cmailto:edk2-devel@lists.01.org>>> >>>>> Cc: Dong, Eric <eric.dong@intel.com<mailto:eric.dong@intel.com<mailto:eric.dong@intel.com%3cmailto:eric.dong@intel.com>>>; Brijesh Singh <brijesh.singh@amd.com<mailto:brijesh.singh@amd.com<mailto:brijesh.singh@amd.com%3cmailto:brijesh.singh@amd.com>>> >>>>> Subject: Re: [edk2] [PATCH 0/4] MdeModulePkg: some PCI HC drivers: unmap common buffers at ExitBootServices() >>>> >>>>>> Then after ExitBootService, the OS will take control of CR3 and set correct >>>>>> encryption bit. >>>>> >>>>> This is true, the guest OS will set up new page tables, and in those >>>>> PTEs, the C-bit ("encrypt") will likely be set by default. >>>>> >>>>> However, the guest OS will not *rewrite* all of the memory, with the >>>>> C-bit set. This means that any memory that the firmware didn't actually >>>>> re-encrypt (in the IOMMU driver) will still expose stale data to the >>>>> hypervisor. >>>>> [Jiewen] That is an interesting question. >>>>> Is there any security concern associated? >>>> >>>> I can't tell for sure. Answering this question is up to the proponents >>>> of SEV, who have designed the threat model for SEV. >>>> >>>> My operating assumption is that we should keep memory as tightly >>>> encrypted as possible at the firmware --> OS control transfer. It could >>>> be an exaggerated expectation from my side; I'd just better be safe than >>>> sorry :) >>>> >>>> >>> >>> Let me give some brief intro on SEV (Secure Encrypted Virtualization) and then >>> we can discuss a security model (if needed). >>> >>> SEV is an extension to the AMD-V architecture which supports running encrypted >>> virtual machines (VMs) under the control of a hypervisor. Encrypted VMs have their >>> pages (code and data) secured such that only the guest itself has access to the >>> unencrypted version. Each encrypted VMs is associated with a unique encryption >>> key; if its data is accessed by a different entity using a different key the >>> encrypted guest data will be incorrectly decrypted, leading to unintelligible data. >>> You can also find some detail for SEV in white paper [1]. >>> >>> SEV encrypted Vs have the concept of private and shared memory. The private memory >>> is encrypted with the guest-specific key, while shared memory may be encrypted >>> with hypervisor key. SEV guest VMs can choose which pages they would like to >>> be private. But the instruction pages and guest page tables are always treated >>> as private by the hardware. The DMA operation inside the guest must be performed >>> on shared pages -- this is mainly because in virtualization world the device >>> DMA needs some assistance from the hypervisor. >>> >>> The security model provided by the SEV ensure that hypervisor will no longer able >>> to inspect or alter any guest code or data. The guest OS controls what it want to >>> share with outside world (in this case with Hypervisor). >>> >>> In software implementation we took approach to encrypt everything possible starting >>> early in boot. In this approach whenever OVMF wants to perform certain DMA operations >>> it allocate a shared page, issues the command, free the shared page after the command >>> is completed (in other word we use sw bounce buffer to complete the DMA operation). >>> >>> We have implemented IOMMU driver to provide the following functions: >>> >>> AllocateBuffer(): >>> -------------------- >>> it allocate a private pages, as per UEFI spec the driver will map the buffer allocated >>> from this routine using BusMasterCommonBuffer hence we allocate extra stash pages >>> in addition to requested pages. >>> >>> >>> Map >>> ---- >>> If requested operation is BusMasterRead or BusMasterWrite then we allocate a shared buffer >>> and DeviceAddress point to shared buffer. >>> >>> If requested operation is BusMasterCommonBuffer then we perform in-place decryption of the >>> contents and update the page-table to clear the C-bit (basically make it shared page) >>> >>> Unmap >>> ------ >>> If operation was BusMasterRead or BusMasterWrite then we complete the unmapping and free >>> the shared buffer allocated in Map(). >>> >>> If operation was BusMasterCommonBuffer then we perform in-place encryption and set the C-bit >>> (basically make the page private) >>> >>> FreeBuffer: >>> ----------- >>> Free the pages >>> >>> >>> Lets run with the below scenario: >>> >>> 1) guest marks a page as "shared" and gives its physical address to HV (e.g DMA read) >>> 2) hypervisor completes the request operation >>> 3) hypervisor caches the shared physical address and monitor it for information leak >>> 4) sometime later if guest write data in its "shared" physical address then hypervisor can >>> easily read it without guest knowledge. >>> >>> SEV hardware can protect us against the attack where someone tries to inject or alter the >>> guest code. As I noted above any instruction fetch is forced as private hence if attacker >>> write some code into a shared buffer and point the RIP to his/her code then instruction >>> fetch will try to decrypt it and get unintelligible opcode. If a page was marked as "shared" >>> then its guest responsibility to make it "private" after its done communicating with >>> hypervisor. >>> >>> [1] http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/12/AMD_Memory_Encryption_Whitepaper_v7-Public.pdf >>> >>> >>>>> If the C-bit is cleared for a read/write buffer, is the content in the >>>>> read/write buffer also exposed to hypervisor? >>>> >>>> Not immediately. Only when the guest also rewrites the memory through >>>> the appropriate virtual address. >>>> >>>> Basically, in the virtual machine, the C-bit in a PTE that covers a >>>> particular guest virtual address (GVA) controls whether a guest write >>>> through that GVA will result in memory encryption, and if a gues read >>>> through that GVA will result in memory decryption. >>>> >>>> The current state of the C-bit doesn't matter for the hypervisor, what >>>> matters is the last guest write through a GVA whose PTE had the C-bit >>>> set, or clear. If the C-bit was clear when the guest last wrote the >>>> area, then the hypervisor can read the data. If the C-bit was set, then >>>> the hypervisor can only read garbage. >>>> >>>> >>>>> I means if there is security concern, the concern should be applied to >>>>> both common buffer and read/write buffer. >>>>> Then we have to figure a way to resolve both buffer. >>>> >>>> Yes, this is correct. The PciIo.Unmap operation, as currently >>>> implemented in OvmfPkg/IoMmuDxe/, handles the encryption/decryption >>>> correctly for all operations, but it can only guarantee *not* freeing >>>> memory for CommonBuffer. So Unmap()ing CommonBuffer at ExitBootServices >>>> is safe, while Unmap()ing Read/Write is not. The encryption would be >>>> re-established just fine for Read/Write as well, but we would change the >>>> UEFI memmap. >>>> >>>> In OVMF, we currently manage this problem by making all asynchronous DMA >>>> mappings CommonBuffer, even if they could othewise be Read or Write. We >>>> use Read and Write only if the DMA operation completes before the >>>> higher-level protocol function returns (so we can immediately Unmap the >>>> operation, and the ExitBootServices() handler doesn't have to care). >>>> >>>> >>>>> To be honest, that is exactly my biggest question on this patch: >>>>> why do we only handle the common buffer specially? >>>> >>>> For the following reason: >>>> >>>> - Given that CommonBuffer mappings take a separate AllocateBuffer() / >>>> FreeBuffer() call-pair, around Map/Unmap, we can *reasonably ask* PciIo >>>> implementors -- beyond what the UEFI spec requries -- to keep *all* >>>> dynamic memory management out of Map/Unmap. If they need dynamic memory >>>> management, we ask them to do it in AllocateBuffer/FreeBuffer instead. >>>> This way Unmap is safe in ExitBootServices handlers. >>>> >>>> - We cannot *reasonably ask* PciIo implementors to provide the same >>>> guarantee for Read and Write mappings, simply because there are no other >>>> APIs that could manage the dynamic memory for Map and Unmap -- >>>> AllocateBuffer and FreeBuffer are not required for Read and Write >>>> mappings. PciIo implementors couldn't agree to our request even if they >>>> wanted to. Therefore Unmapping Read/Write operations is inherently >>>> unsafe in EBS handlers. >>>> >>>> >>>>>> NOTE: The device should still be halt state, because the device is >>>>>> also controlled by OS. >>>>> >>>>> This is the key point -- the device is *not* controlled by the guest OS, >>>>> in the worst case. >>>>> >>>>> The virtual device is ultimately implemented by the hypervisor. We don't >>>>> expect the virtual device (= the hypervisor) to mis-behave on purpose. >>>>> However, if the hypervisor is compromised by an external attacker (for >>>>> example over the network, or via privilege escalation from another >>>>> guest), then all guest RAM that has not been encrypted up to that point >>>>> becomes visible to the attacker. In other words, the hypervisor ("the >>>>> device") becomes retro-actively malicious. >>>>> [Jiewen] If that is the threat model, I have a question on the exposure: >>>>> 1) If the concern is for existing data, it means all DMA data already written. >>>>> However, the DMA data is already consumed or produced by virtual device >>>>> or a hypervisor. If the hypervisor is attacked, it already gets all the data content. >>>> >>>> Maybe, maybe not. The data may have been sent to a different host over >>>> the network, and wiped from memory. >>>> >>>> Or, the data may have been written to a disk image that is separately >>>> encrypted by the host OS, and has been detached (unplugged) from the >>>> guest, and also unmounted from the host, with the disk key securely >>>> wiped from host memory. >>>> >>>> We can also imagine the reverse direction. Assume that the user of the >>>> virtual machine goes into the UEFI shell in OVMF, starts the EDIT >>>> program, and types some secret information into a text file on the ESP. >>>> Further assume that the disk image is encrypted on the host OS. It is >>>> conceivable that fragments of the secret could remain stuck in the AHCI >>>> (disk) and USB (keyboard) DMA buffers. >>>> >>>> Maybe you think that these are "made up" examples, and I agree, I just >>>> made them up. The point is, it is not my place to discount possible >>>> attack vectors (I haven't designed SEV). Attackers will be limited by >>>> their *own* imaginations only, not by mine :) >>>> >>>> >>>>> 2) if the concern is for future data, it means all user data to be written. >>>>> However, the C-bit already prevent that. >>>> >>>> As far as I understand SEV, future data is out of scope. The goal is to >>>> prevent *retroactive* guest data leaks (= whatever is currently in host >>>> OS memory) if an attacker compromises an otherwise non-malicious hypervisor. >>>> >>>> If an attacker compromises a virtualization host, then they can just sit >>>> silent and watch data flow into and out of guests from that point >>>> onward, because they can then monitor all DMA (which always happens in >>>> clear-text) real-time. >>>> >>>> >>>>> Maybe I do not fully understand the threat model defined. >>>>> If you can explain more, that would be great. >>>> >>>> Well I tried to explain *my* understanding of SEV :) I hope Brijesh will >>>> correct me if I'm wrong. >>>> >>>> >>>>> The point of SEV is to keep as much guest data encrypted at all times as >>>>> possible, so that *whenever* the hypervisor is compromised by an >>>>> attacker, the guest memory -- which the attacker can inevitably dump >>>>> from the host side -- remains un-intellegible to the attacker. >>>>> [Jiewen] OK. If this is a security question, I suggest we define a clear >>>>> threat model at first. >>>>> Or what we are doing might be unnecessary or insufficient. >>>> >>>> I agree. >>>> >>>> As I said above, my operating principle has been to re-encrypt all DMA >>>> buffers as soon as possible. For long-standing DMA buffers, re-encrypt >>>> them at ExitBootServices at the latest. >>>> >>>> >>>>> [Jiewen] For "require that Unmap() work for CommonBuffer >>>>> operations without releasing memory", I would hold my opinion until >>>>> it is documented clearly in UEFI spec. >>>> >>>> You are right, of course. >>>> >>>> It's just that we cannot avoid exploring ways, for this feature, that >>>> currently fall outside of the spec. Whatever we do here will be outside >>>> of the spec, one way or another. I suggested what I thought could be a >>>> reasonable extension to the spec, one that could be implemented by PciIo >>>> implementors even before the spec is modified. >>>> >>>> edk2's PciIo.Unmap already behaves like this, without breaking the spec. >>>> >>>> If there's a better way -- for example modifying CoreExitBootServices() >>>> in edk2, to signal IOMMU drivers separately, *after* notifying other >>>> ExitBootServices() handlers --, that might work as well. >>>> >>>> >>>>> My current concern is: >>>>> First, this sentence is NOT defined in UEFI specification. >>>> >>>> Correct. >>>> >>>> >>>>> Second, it might break an existing implementation or existing feature, such as tracing. >>>> >>>> Correct. >>>> >>>>> Till now, I did not see any restriction in UEFI spec say: In this function, you are not allowed >>>>> to call memory services. >>>>> The only restriction is >>>>> 1) TPL restriction, where memory service must be <= TPL_NOTIFY. >>>>> 2) AP restriction, where no UEFI service/protocol is allowed for AP. >>>> >>>> I agree. >>>> >>>> >>>>> I'm just trying to operate with the APIs currently defined by the UEFI >>>>> spec, and these assumptions were the best I could come up with. >>>>> [Jiewen] The PCI device driver must follow and *only* follow UEFI spec. >>>>> Especially the IHV Option ROM should not consume any private API. >>>> >>>> I disagree about "only follow". If there are additional requirements >>>> that can be satisfied without breaking the spec, driver implementors can >>>> choose to conform to both sets of requirements. >>>> >>>> For example (if I understand correctly), Microsoft / Windows present a >>>> bunch of requirements *beyond* the UEFI spec, for both platform and >>>> add-on card firmware. Most vendors conform :) >>>> >>>> >>>>>> [Jiewen] I am not sure who will control "When control is transferred to the OS, all >>>>>> guest memory should be encrypted again, even those areas that were once >>>>>> used as CommonBuffers." >>>>>> For SEV case, I think it should be controlled by OS, because it is just CR3. >>>>> >>>>> If it was indeed just CR3, then I would fully agree with you. >>>>> >>>>> But -- to my understanding --, ensuring that the memory is actually >>>>> encrypted requires that the guest *write* to the memory through a >>>>> virtual address whose PTE has the C-bit set. >>>>> >>>>> And I don't think the guest OS can be expected to rewrite all of its >>>>> memory at launch. :( >>>>> >>>>> [Jiewen] That is fine. >>>>> I suggest we get clear on the threat model as the first step. >>>>> The threat model for SEV usage in OVMF. >>>>> >>>>> I am sorry if that is already discussed before. I might ignore the conversation. >>>> >>>> No problem; it's always good to re-investigate our assumptions and >>>> operating principles. >>>> >>>> >>>>> If you or any SEV owner can share the insight, that will be great. >>>>> See my question above "If that is the threat model, I have a question on the exposure:..." >>>> >>>> I shared *my* impressions of the threat model (which are somewhat >>>> unclear, admittedly, and thus could make me overly cautious). >>>> >>>> I hope Brijesh can extend and/or correct my description. >>>> >>>> >>>>>> So apparently the default behaviors of the VT-d IOMMU and the SEV IOMMU >>>>>> are opposites -- VT-d permits all DMA unless configured otherwise, while >>>>>> SEV forbids all DMA unless configured otherwise. >>>>>> [Jiewen] I do not think so. The default behaviors of VT-d IOMMU is to disable all DMA access. >>>>>> I setup translation table, but mark all entry to be NOT-PRESENT. >>>>>> I grant DMA access only if there is a specific request by a device. >>>>>> >>>>>> I open DMA access in ExitBootServices, just want to make sure it is compatible to >>>>>> the existing OS, which might not have knowledge on IOMMU. >>>>>> I will provide a patch to make it a policy very soon. As such VTd IOMMU may >>>>>> turn off IOMMU, or keep it enabled at ExitBootService. >>>>>> An IOMMU aware OS may take over IOMMU directly and reprogram it. >>>>> >>>>> Thanks for the clarification. >>>>> >>>>> But, again, will this not lead to the possibility that the VT-d IOMMU's >>>>> ExitBootServices() handler disables all DMA *before* the PCI drivers get >>>>> a chance to run their own ExitBootServices() handlers, disabling the >>>>> individual devices? >>>>> [Jiewen] Sorry for the confusing. Let me explain: >>>>> No, VTd never disables *all* DMA buffer at ExitBootService. >>>>> >>>>> "disable VTd" means to "disable IOMMU protection" and "allow all DMA". >>>>> "Keep VTd enabled" means to "keep IOMMU protection enabled" and >>>>> "only allow the DMA from Map() request". >>>> >>>> Okay, but this interpretation was exactly what I thought of first (see >>>> above): "VT-d permits all DMA unless configured otherwise". You answered >>>> that it wasn't the case. >>>> >>>> So basically, if I understand it correctly now, at ExitBootServices the >>>> VT-d IOMMU driver opens up all RAM for DMA access. Is that correct? >>>> >>>> That is equivalent to decrypting all memory under SEV, and is the exact >>>> opposite of what we want. (As I understand it.) >>>> >>>> >>>>> If that happens, then a series of IOMMU faults could be generated, which >>>>> I described above. (I.e., such IOMMU faults could result, at least in >>>>> the case of SEV, in garbage being written to disk, via queued commands.) >>>>> [Jiewen] You are right. That would not happen. >>>>> IOMMU driver should not bring any compatibility problem for the PCI driver, >>>>> iff the PCI driver follows the UEFI specification. >>>> >>>> Agreed. >>>> >>>> >>>>> Now, to finish up, here's an idea I just had. >>>>> >>>>> Are we allowed to call gBS->SignalEvent() in an ExitBootServices() >>>>> notification function? >>>>> >>>>> If we are, then we could do the following: >>>>> >>>>> * PciIo.Unmap() and friends would work as always (no restrictions on >>>>> dynamic memory allocation or freeing, for any kind of DMA operation). >>>>> >>>>> * PCI drivers would not be expected to call PciIo.Unmap() in their >>>>> ExitBootServices() handlers. >>>>> >>>>> * The IOMMU driver would have an ExitBootServices() notification >>>>> function, to be enqueued at the TPL_CALLBACK or TPL_NOTIFY level >>>>> (doesn't matter which). >>>>> >>>>> * In this notification function, the IOMMU driver would signal *another* >>>>> event (a private one). The notification function for this event would >>>>> be enqueued strictly at the TPL_CALLBACK level. >>>>> >>>>> * The notification function for the second event (private to the IOMMU) >>>>> would iterate over all existent mappings, and unmap them, without >>>>> allocating or freeing memory. >>>>> >>>>> The key point is that by signaling the second event, the IOMMU driver >>>>> could order its own cleanup code after all other ExitBootServices() >>>>> callbacks. (Assuming, at least, that no other ExitBootServices() >>>>> callback employs the same trick to defer itself!) Therefore by the time >>>>> the second callback runs, all PCI devices have been halted, and it is >>>>> safe to tear down the translations. >>>>> >>>>> The ordering would be ensured by the following: >>>>> >>>>> - The UEFI spec says under CreateEventEx(), "All events are guaranteed >>>>> to be signaled before the first notification action is taken." This >>>>> means that, by the time the IOMMU's ExitBootServices() handler is >>>>> entered, all other ExitBootServices() handlers have been *queued* at >>>>> least, at TPL_CALLBACK or TPL_NOTIFY. >>>>> >>>>> - Therefore, when we signal our second callback from there, for >>>>> TPL_CALLBACK, the second callback will be queued at the end of the >>>>> TPL_CALLBACK queue. >>>>> >>>>> - CreateEventEx() also says that EFI_EVENT_GROUP_EXIT_BOOT_SERVICES "is >>>>> functionally equivalent to the EVT_SIGNAL_EXIT_BOOT_SERVICES flag >>>>> for the Type argument of CreateEvent." So it wouldn't matter if a >>>>> driver used CreateEvent() or CreateEventEx() for setting up its own >>>>> handler, the handler would be queued just the same. >>>>> >>>>> I think this is ugly and brittle, but perhaps the only way to clean up >>>>> *all* translations safely, with regard to PciIo.Unmap() + >>>>> ExitBootServices(), without updating the UEFI spec. >>>>> >>>>> What do you think? >>>>> >>>>> [Jiewen] If the order is correct, and all PCI device driver is halt at ExitBootServices, that works. >>>>> We do not need update PCI driver and we do not need update UEFI spec. >>>>> We only need update IOMMU driver which is concerned, based upon the threat model. >>>>> That probably is best solution. :-) >>>> >>>> I'm very glad to hear that you like the idea. >>>> >>>> However, it depends on whether we are permitted, by the UEFI spec, to >>>> signal another event in an ExitBootServices() notification function. >>>> >>>> Are we permitted to do that? >>>> >>>> Does the UEFI spec guarantee that the notification function for the >>>> *second* event will be queued just like it would be under "normal" >>>> circumstances? >>>> >>>> (I know we must not allocate or free memory in the notification function >>>> of the *second* event either; I just want to know if the second event's >>>> handler is *queued* like it would normally be.) >>>> >>>> >>>>> I assume you want to handle both common buffer and read/write buffer, right? >>>> >>>> Yes. Under this idea, all kinds of operations would be cleaned up. >>>> >>>> >>>>> And if possible, I still have interest to get clear on the threat model for SEV in OVMF. >>>>> If you or any SEV owner can share ... >>>> >>>> Absolutely. Please see above. >>>> >>>> Thank you! >>>> Laszlo >>>> >>> _______________________________________________ >>> edk2-devel mailing list >>> edk2-devel@lists.01.org<mailto:edk2-devel@lists.01.org> >>> https://lists.01.org/mailman/listinfo/edk2-devel >>> >> > _______________________________________________ > edk2-devel mailing list > edk2-devel@lists.01.org<mailto:edk2-devel@lists.01.org> > https://lists.01.org/mailman/listinfo/edk2-devel > _______________________________________________ edk2-devel mailing list edk2-devel@lists.01.org https://lists.01.org/mailman/listinfo/edk2-devel
Thanks. It seems the discussion becomes too long. I try to make it short. 1) As long as we use same mechanism to handle both common buffer and read/write buffer. I have no concern at all. :) 2) I am sorry that I do not have an immediate answer for your question on event handling in ExitBootServices. Although it seems tricky, I believe it works. Can you have a try? 3) Looking at the code (DxeMain.c), I have another idea for your consideration only. // // Notify other drivers that we are exiting boot services. // CoreNotifySignalList (&gEfiEventExitBootServicesGuid); // // Report that ExitBootServices() has been called // REPORT_STATUS_CODE ( EFI_PROGRESS_CODE, (EFI_SOFTWARE_EFI_BOOT_SERVICE | EFI_SW_BS_PC_EXIT_BOOT_SERVICES) ); The cleanup driver may register a report status code handle to do the cleanup. :) We already have such example in MdeModulePkg\Universal\Acpi\FirmwarePerformanceDataTableDxe\ FirmwarePerformanceDxe.c. FpdtStatusCodeListenerDxe() } else if (Value == (EFI_SOFTWARE_EFI_BOOT_SERVICE | EFI_SW_BS_PC_EXIT_BOOT_SERVICES)) { ...... The requirement will become: If a platform wants to add cleanup, it must use a real report status code library. Thank you Yao Jiewen From: edk2-devel [mailto:edk2-devel-bounces@lists.01.org] On Behalf Of Laszlo Ersek Sent: Wednesday, September 6, 2017 8:15 PM To: Yao, Jiewen <jiewen.yao@intel.com>; Zeng, Star <star.zeng@intel.com>; edk2-devel-01 <edk2-devel@lists.01.org> Cc: Brijesh Singh <brijesh.singh@amd.com>; Dong, Eric <eric.dong@intel.com> Subject: Re: [edk2] [PATCH 0/4] MdeModulePkg: some PCI HC drivers: unmap common buffers at ExitBootServices() On 09/06/17 06:37, Yao, Jiewen wrote: > Thanks for the clarification. Comment in line. > > From: Laszlo Ersek [mailto:lersek@redhat.com] > Sent: Wednesday, September 6, 2017 1:57 AM > To: Yao, Jiewen <jiewen.yao@intel.com<mailto:jiewen.yao@intel.com>>; Zeng, Star <star.zeng@intel.com<mailto:star.zeng@intel.com>>; edk2-devel-01 <edk2-devel@lists.01.org<mailto:edk2-devel@lists.01.org>> > Cc: Dong, Eric <eric.dong@intel.com<mailto:eric.dong@intel.com>>; Brijesh Singh <brijesh.singh@amd.com<mailto:brijesh.singh@amd.com>> > Subject: Re: [edk2] [PATCH 0/4] MdeModulePkg: some PCI HC drivers: unmap common buffers at ExitBootServices() >> Then after ExitBootService, the OS will take control of CR3 and set correct >> encryption bit. > > This is true, the guest OS will set up new page tables, and in those > PTEs, the C-bit ("encrypt") will likely be set by default. > > However, the guest OS will not *rewrite* all of the memory, with the > C-bit set. This means that any memory that the firmware didn't actually > re-encrypt (in the IOMMU driver) will still expose stale data to the > hypervisor. > [Jiewen] That is an interesting question. > Is there any security concern associated? I can't tell for sure. Answering this question is up to the proponents of SEV, who have designed the threat model for SEV. My operating assumption is that we should keep memory as tightly encrypted as possible at the firmware --> OS control transfer. It could be an exaggerated expectation from my side; I'd just better be safe than sorry :) > If the C-bit is cleared for a read/write buffer, is the content in the > read/write buffer also exposed to hypervisor? Not immediately. Only when the guest also rewrites the memory through the appropriate virtual address. Basically, in the virtual machine, the C-bit in a PTE that covers a particular guest virtual address (GVA) controls whether a guest write through that GVA will result in memory encryption, and if a gues read through that GVA will result in memory decryption. The current state of the C-bit doesn't matter for the hypervisor, what matters is the last guest write through a GVA whose PTE had the C-bit set, or clear. If the C-bit was clear when the guest last wrote the area, then the hypervisor can read the data. If the C-bit was set, then the hypervisor can only read garbage. > I means if there is security concern, the concern should be applied to > both common buffer and read/write buffer. > Then we have to figure a way to resolve both buffer. Yes, this is correct. The PciIo.Unmap operation, as currently implemented in OvmfPkg/IoMmuDxe/, handles the encryption/decryption correctly for all operations, but it can only guarantee *not* freeing memory for CommonBuffer. So Unmap()ing CommonBuffer at ExitBootServices is safe, while Unmap()ing Read/Write is not. The encryption would be re-established just fine for Read/Write as well, but we would change the UEFI memmap. In OVMF, we currently manage this problem by making all asynchronous DMA mappings CommonBuffer, even if they could othewise be Read or Write. We use Read and Write only if the DMA operation completes before the higher-level protocol function returns (so we can immediately Unmap the operation, and the ExitBootServices() handler doesn't have to care). > To be honest, that is exactly my biggest question on this patch: > why do we only handle the common buffer specially? For the following reason: - Given that CommonBuffer mappings take a separate AllocateBuffer() / FreeBuffer() call-pair, around Map/Unmap, we can *reasonably ask* PciIo implementors -- beyond what the UEFI spec requries -- to keep *all* dynamic memory management out of Map/Unmap. If they need dynamic memory management, we ask them to do it in AllocateBuffer/FreeBuffer instead. This way Unmap is safe in ExitBootServices handlers. - We cannot *reasonably ask* PciIo implementors to provide the same guarantee for Read and Write mappings, simply because there are no other APIs that could manage the dynamic memory for Map and Unmap -- AllocateBuffer and FreeBuffer are not required for Read and Write mappings. PciIo implementors couldn't agree to our request even if they wanted to. Therefore Unmapping Read/Write operations is inherently unsafe in EBS handlers. >> NOTE: The device should still be halt state, because the device is >> also controlled by OS. > > This is the key point -- the device is *not* controlled by the guest OS, > in the worst case. > > The virtual device is ultimately implemented by the hypervisor. We don't > expect the virtual device (= the hypervisor) to mis-behave on purpose. > However, if the hypervisor is compromised by an external attacker (for > example over the network, or via privilege escalation from another > guest), then all guest RAM that has not been encrypted up to that point > becomes visible to the attacker. In other words, the hypervisor ("the > device") becomes retro-actively malicious. > [Jiewen] If that is the threat model, I have a question on the exposure: > 1) If the concern is for existing data, it means all DMA data already written. > However, the DMA data is already consumed or produced by virtual device > or a hypervisor. If the hypervisor is attacked, it already gets all the data content. Maybe, maybe not. The data may have been sent to a different host over the network, and wiped from memory. Or, the data may have been written to a disk image that is separately encrypted by the host OS, and has been detached (unplugged) from the guest, and also unmounted from the host, with the disk key securely wiped from host memory. We can also imagine the reverse direction. Assume that the user of the virtual machine goes into the UEFI shell in OVMF, starts the EDIT program, and types some secret information into a text file on the ESP. Further assume that the disk image is encrypted on the host OS. It is conceivable that fragments of the secret could remain stuck in the AHCI (disk) and USB (keyboard) DMA buffers. Maybe you think that these are "made up" examples, and I agree, I just made them up. The point is, it is not my place to discount possible attack vectors (I haven't designed SEV). Attackers will be limited by their *own* imaginations only, not by mine :) > 2) if the concern is for future data, it means all user data to be written. > However, the C-bit already prevent that. As far as I understand SEV, future data is out of scope. The goal is to prevent *retroactive* guest data leaks (= whatever is currently in host OS memory) if an attacker compromises an otherwise non-malicious hypervisor. If an attacker compromises a virtualization host, then they can just sit silent and watch data flow into and out of guests from that point onward, because they can then monitor all DMA (which always happens in clear-text) real-time. > Maybe I do not fully understand the threat model defined. > If you can explain more, that would be great. Well I tried to explain *my* understanding of SEV :) I hope Brijesh will correct me if I'm wrong. > The point of SEV is to keep as much guest data encrypted at all times as > possible, so that *whenever* the hypervisor is compromised by an > attacker, the guest memory -- which the attacker can inevitably dump > from the host side -- remains un-intellegible to the attacker. > [Jiewen] OK. If this is a security question, I suggest we define a clear > threat model at first. > Or what we are doing might be unnecessary or insufficient. I agree. As I said above, my operating principle has been to re-encrypt all DMA buffers as soon as possible. For long-standing DMA buffers, re-encrypt them at ExitBootServices at the latest. > [Jiewen] For "require that Unmap() work for CommonBuffer > operations without releasing memory", I would hold my opinion until > it is documented clearly in UEFI spec. You are right, of course. It's just that we cannot avoid exploring ways, for this feature, that currently fall outside of the spec. Whatever we do here will be outside of the spec, one way or another. I suggested what I thought could be a reasonable extension to the spec, one that could be implemented by PciIo implementors even before the spec is modified. edk2's PciIo.Unmap already behaves like this, without breaking the spec. If there's a better way -- for example modifying CoreExitBootServices() in edk2, to signal IOMMU drivers separately, *after* notifying other ExitBootServices() handlers --, that might work as well. > My current concern is: > First, this sentence is NOT defined in UEFI specification. Correct. > Second, it might break an existing implementation or existing feature, such as tracing. Correct. > Till now, I did not see any restriction in UEFI spec say: In this function, you are not allowed > to call memory services. > The only restriction is > 1) TPL restriction, where memory service must be <= TPL_NOTIFY. > 2) AP restriction, where no UEFI service/protocol is allowed for AP. I agree. > I'm just trying to operate with the APIs currently defined by the UEFI > spec, and these assumptions were the best I could come up with. > [Jiewen] The PCI device driver must follow and *only* follow UEFI spec. > Especially the IHV Option ROM should not consume any private API. I disagree about "only follow". If there are additional requirements that can be satisfied without breaking the spec, driver implementors can choose to conform to both sets of requirements. For example (if I understand correctly), Microsoft / Windows present a bunch of requirements *beyond* the UEFI spec, for both platform and add-on card firmware. Most vendors conform :) >> [Jiewen] I am not sure who will control "When control is transferred to the OS, all >> guest memory should be encrypted again, even those areas that were once >> used as CommonBuffers." >> For SEV case, I think it should be controlled by OS, because it is just CR3. > > If it was indeed just CR3, then I would fully agree with you. > > But -- to my understanding --, ensuring that the memory is actually > encrypted requires that the guest *write* to the memory through a > virtual address whose PTE has the C-bit set. > > And I don't think the guest OS can be expected to rewrite all of its > memory at launch. :( > > [Jiewen] That is fine. > I suggest we get clear on the threat model as the first step. > The threat model for SEV usage in OVMF. > > I am sorry if that is already discussed before. I might ignore the conversation. No problem; it's always good to re-investigate our assumptions and operating principles. > If you or any SEV owner can share the insight, that will be great. > See my question above "If that is the threat model, I have a question on the exposure:..." I shared *my* impressions of the threat model (which are somewhat unclear, admittedly, and thus could make me overly cautious). I hope Brijesh can extend and/or correct my description. >> So apparently the default behaviors of the VT-d IOMMU and the SEV IOMMU >> are opposites -- VT-d permits all DMA unless configured otherwise, while >> SEV forbids all DMA unless configured otherwise. >> [Jiewen] I do not think so. The default behaviors of VT-d IOMMU is to disable all DMA access. >> I setup translation table, but mark all entry to be NOT-PRESENT. >> I grant DMA access only if there is a specific request by a device. >> >> I open DMA access in ExitBootServices, just want to make sure it is compatible to >> the existing OS, which might not have knowledge on IOMMU. >> I will provide a patch to make it a policy very soon. As such VTd IOMMU may >> turn off IOMMU, or keep it enabled at ExitBootService. >> An IOMMU aware OS may take over IOMMU directly and reprogram it. > > Thanks for the clarification. > > But, again, will this not lead to the possibility that the VT-d IOMMU's > ExitBootServices() handler disables all DMA *before* the PCI drivers get > a chance to run their own ExitBootServices() handlers, disabling the > individual devices? > [Jiewen] Sorry for the confusing. Let me explain: > No, VTd never disables *all* DMA buffer at ExitBootService. > > "disable VTd" means to "disable IOMMU protection" and "allow all DMA". > "Keep VTd enabled" means to "keep IOMMU protection enabled" and > "only allow the DMA from Map() request". Okay, but this interpretation was exactly what I thought of first (see above): "VT-d permits all DMA unless configured otherwise". You answered that it wasn't the case. So basically, if I understand it correctly now, at ExitBootServices the VT-d IOMMU driver opens up all RAM for DMA access. Is that correct? That is equivalent to decrypting all memory under SEV, and is the exact opposite of what we want. (As I understand it.) > If that happens, then a series of IOMMU faults could be generated, which > I described above. (I.e., such IOMMU faults could result, at least in > the case of SEV, in garbage being written to disk, via queued commands.) > [Jiewen] You are right. That would not happen. > IOMMU driver should not bring any compatibility problem for the PCI driver, > iff the PCI driver follows the UEFI specification. Agreed. > Now, to finish up, here's an idea I just had. > > Are we allowed to call gBS->SignalEvent() in an ExitBootServices() > notification function? > > If we are, then we could do the following: > > * PciIo.Unmap() and friends would work as always (no restrictions on > dynamic memory allocation or freeing, for any kind of DMA operation). > > * PCI drivers would not be expected to call PciIo.Unmap() in their > ExitBootServices() handlers. > > * The IOMMU driver would have an ExitBootServices() notification > function, to be enqueued at the TPL_CALLBACK or TPL_NOTIFY level > (doesn't matter which). > > * In this notification function, the IOMMU driver would signal *another* > event (a private one). The notification function for this event would > be enqueued strictly at the TPL_CALLBACK level. > > * The notification function for the second event (private to the IOMMU) > would iterate over all existent mappings, and unmap them, without > allocating or freeing memory. > > The key point is that by signaling the second event, the IOMMU driver > could order its own cleanup code after all other ExitBootServices() > callbacks. (Assuming, at least, that no other ExitBootServices() > callback employs the same trick to defer itself!) Therefore by the time > the second callback runs, all PCI devices have been halted, and it is > safe to tear down the translations. > > The ordering would be ensured by the following: > > - The UEFI spec says under CreateEventEx(), "All events are guaranteed > to be signaled before the first notification action is taken." This > means that, by the time the IOMMU's ExitBootServices() handler is > entered, all other ExitBootServices() handlers have been *queued* at > least, at TPL_CALLBACK or TPL_NOTIFY. > > - Therefore, when we signal our second callback from there, for > TPL_CALLBACK, the second callback will be queued at the end of the > TPL_CALLBACK queue. > > - CreateEventEx() also says that EFI_EVENT_GROUP_EXIT_BOOT_SERVICES "is > functionally equivalent to the EVT_SIGNAL_EXIT_BOOT_SERVICES flag > for the Type argument of CreateEvent." So it wouldn't matter if a > driver used CreateEvent() or CreateEventEx() for setting up its own > handler, the handler would be queued just the same. > > I think this is ugly and brittle, but perhaps the only way to clean up > *all* translations safely, with regard to PciIo.Unmap() + > ExitBootServices(), without updating the UEFI spec. > > What do you think? > > [Jiewen] If the order is correct, and all PCI device driver is halt at ExitBootServices, that works. > We do not need update PCI driver and we do not need update UEFI spec. > We only need update IOMMU driver which is concerned, based upon the threat model. > That probably is best solution. :-) I'm very glad to hear that you like the idea. However, it depends on whether we are permitted, by the UEFI spec, to signal another event in an ExitBootServices() notification function. Are we permitted to do that? Does the UEFI spec guarantee that the notification function for the *second* event will be queued just like it would be under "normal" circumstances? (I know we must not allocate or free memory in the notification function of the *second* event either; I just want to know if the second event's handler is *queued* like it would normally be.) > I assume you want to handle both common buffer and read/write buffer, right? Yes. Under this idea, all kinds of operations would be cleaned up. > And if possible, I still have interest to get clear on the threat model for SEV in OVMF. > If you or any SEV owner can share ... Absolutely. Please see above. Thank you! Laszlo _______________________________________________ edk2-devel mailing list edk2-devel@lists.01.org<mailto:edk2-devel@lists.01.org> https://lists.01.org/mailman/listinfo/edk2-devel _______________________________________________ edk2-devel mailing list edk2-devel@lists.01.org https://lists.01.org/mailman/listinfo/edk2-devel
On 09/07/17 06:34, Yao, Jiewen wrote: > Thanks. > It seems the discussion becomes too long. I try to make it short. > > > 1) As long as we use same mechanism to handle both common buffer and read/write buffer. I have no concern at all. :) > > > 2) I am sorry that I do not have an immediate answer for your question on event handling in ExitBootServices. > Although it seems tricky, I believe it works. Can you have a try? > > > 3) Looking at the code (DxeMain.c), I have another idea for your consideration only. > > // > // Notify other drivers that we are exiting boot services. > // > CoreNotifySignalList (&gEfiEventExitBootServicesGuid); > > // > // Report that ExitBootServices() has been called > // > REPORT_STATUS_CODE ( > EFI_PROGRESS_CODE, > (EFI_SOFTWARE_EFI_BOOT_SERVICE | EFI_SW_BS_PC_EXIT_BOOT_SERVICES) > ); > > The cleanup driver may register a report status code handle to do the cleanup. :) > > We already have such example in MdeModulePkg\Universal\Acpi\FirmwarePerformanceDataTableDxe\ FirmwarePerformanceDxe.c. > FpdtStatusCodeListenerDxe() > > } else if (Value == (EFI_SOFTWARE_EFI_BOOT_SERVICE | EFI_SW_BS_PC_EXIT_BOOT_SERVICES)) { > ...... > > The requirement will become: If a platform wants to add cleanup, it must use a real report status code library. I think I prefer option #2; it presents fewer requirements for the platform. In addition: I've just realized that we don't need the UEFI spec to promise us that option #2 works. We only have to care whether it works in edk2. The reason is that EDKII_IOMMU_PROTOCOL is *already* edk2-specific; it is a platform (DXE) driver, not a UEFI driver. In other words, we would use option #2 only inside edk2. So, I'll try option #2. Thanks! Laszlo _______________________________________________ edk2-devel mailing list edk2-devel@lists.01.org https://lists.01.org/mailman/listinfo/edk2-devel
© 2016 - 2024 Red Hat, Inc.