[v2] hw/vfio/pci-quirks: Support alternate offset for GPUDirect Cliques

[PATCH v2] hw/vfio/pci-quirks: Support alternate offset for GPUDirect Cliques

Posted by Alex Williamson 2 years, 8 months ago

NVIDIA Turing and newer GPUs implement the MSI-X capability at the offset
previously reserved for use by hypervisors to implement the GPUDirect
Cliques capability.  A revised specification provides an alternate
location.  Add a config space walk to the quirk to check for conflicts,
allowing us to fall back to the new location or generate an error at the
quirk setup rather than when the real conflicting capability is added
should there be no available location.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
---
 hw/vfio/pci-quirks.c | 41 ++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 40 insertions(+), 1 deletion(-)

diff --git a/hw/vfio/pci-quirks.c b/hw/vfio/pci-quirks.c
index f0147a050aaa..0ed2fcd53152 100644
--- a/hw/vfio/pci-quirks.c
+++ b/hw/vfio/pci-quirks.c
@@ -1490,6 +1490,9 @@ void vfio_setup_resetfn_quirk(VFIOPCIDevice *vdev)
  * +---------------------------------+---------------------------------+
  *
  * https://lists.gnu.org/archive/html/qemu-devel/2017-08/pdfUda5iEpgOS.pdf
+ *
+ * Specification for Turning and later GPU architectures:
+ * https://lists.gnu.org/archive/html/qemu-devel/2023-06/pdf142OR4O4c2.pdf
  */
 static void get_nv_gpudirect_clique_id(Object *obj, Visitor *v,
                                        const char *name, void *opaque,
@@ -1530,7 +1533,9 @@ const PropertyInfo qdev_prop_nv_gpudirect_clique = {
 static int vfio_add_nv_gpudirect_cap(VFIOPCIDevice *vdev, Error **errp)
 {
     PCIDevice *pdev = &vdev->pdev;
-    int ret, pos = 0xC8;
+    int ret, pos;
+    bool c8_conflict = false, d4_conflict = false;
+    uint8_t tmp;
 
     if (vdev->nv_gpudirect_clique == 0xFF) {
         return 0;
@@ -1547,6 +1552,40 @@ static int vfio_add_nv_gpudirect_cap(VFIOPCIDevice *vdev, Error **errp)
         return -EINVAL;
     }
 
+    /*
+     * Per the updated specification above, it's recommended to use offset
+     * D4h for Turing and later GPU architectures due to a conflict of the
+     * MSI-X capability at C8h.  We don't know how to determine the GPU
+     * architecture, instead we walk the capability chain to mark conflicts
+     * and choose one or error based on the result.
+     *
+     * NB. Cap list head in pdev->config is already cleared, read from device.
+     */
+    ret = pread(vdev->vbasedev.fd, &tmp, 1,
+                vdev->config_offset + PCI_CAPABILITY_LIST);
+    if (ret != 1 || !tmp) {
+        error_setg(errp, "NVIDIA GPUDirect Clique ID: error getting cap list");
+        return -EINVAL;
+    }
+
+    do {
+        if (tmp == 0xC8) {
+            c8_conflict = true;
+        } else if (tmp == 0xD4) {
+            d4_conflict = true;
+        }
+        tmp = pdev->config[tmp + PCI_CAP_LIST_NEXT];
+    } while (tmp);
+
+    if (!c8_conflict) {
+        pos = 0xC8;
+    } else if (!d4_conflict) {
+        pos = 0xD4;
+    } else {
+        error_setg(errp, "NVIDIA GPUDirect Clique ID: invalid config space");
+        return -EINVAL;
+    }
+
     ret = pci_add_capability(pdev, PCI_CAP_ID_VNDR, pos, 8, errp);
     if (ret < 0) {
         error_prepend(errp, "Failed to add NVIDIA GPUDirect cap: ");
-- 
2.39.2

Re: [PATCH v2] hw/vfio/pci-quirks: Support alternate offset for GPUDirect Cliques

Posted by Cédric Le Goater 2 years, 8 months ago

On 6/8/23 20:05, Alex Williamson wrote:
> NVIDIA Turing and newer GPUs implement the MSI-X capability at the offset
> previously reserved for use by hypervisors to implement the GPUDirect
> Cliques capability.  A revised specification provides an alternate
> location.  Add a config space walk to the quirk to check for conflicts,
> allowing us to fall back to the new location or generate an error at the
> quirk setup rather than when the real conflicting capability is added
> should there be no available location.
> 
> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> ---
>   hw/vfio/pci-quirks.c | 41 ++++++++++++++++++++++++++++++++++++++++-
>   1 file changed, 40 insertions(+), 1 deletion(-)
> 
> diff --git a/hw/vfio/pci-quirks.c b/hw/vfio/pci-quirks.c
> index f0147a050aaa..0ed2fcd53152 100644
> --- a/hw/vfio/pci-quirks.c
> +++ b/hw/vfio/pci-quirks.c
> @@ -1490,6 +1490,9 @@ void vfio_setup_resetfn_quirk(VFIOPCIDevice *vdev)
>    * +---------------------------------+---------------------------------+
>    *
>    * https://lists.gnu.org/archive/html/qemu-devel/2017-08/pdfUda5iEpgOS.pdf
> + *
> + * Specification for Turning and later GPU architectures:

s/Turning/Turing/

I will fix that.

> + * https://lists.gnu.org/archive/html/qemu-devel/2023-06/pdf142OR4O4c2.pdf
>    */
>   static void get_nv_gpudirect_clique_id(Object *obj, Visitor *v,
>                                          const char *name, void *opaque,
> @@ -1530,7 +1533,9 @@ const PropertyInfo qdev_prop_nv_gpudirect_clique = {
>   static int vfio_add_nv_gpudirect_cap(VFIOPCIDevice *vdev, Error **errp)
>   {
>       PCIDevice *pdev = &vdev->pdev;
> -    int ret, pos = 0xC8;
> +    int ret, pos;
> +    bool c8_conflict = false, d4_conflict = false;
> +    uint8_t tmp;
>   
>       if (vdev->nv_gpudirect_clique == 0xFF) {
>           return 0;
> @@ -1547,6 +1552,40 @@ static int vfio_add_nv_gpudirect_cap(VFIOPCIDevice *vdev, Error **errp)
>           return -EINVAL;
>       }
>   
> +    /*
> +     * Per the updated specification above, it's recommended to use offset
> +     * D4h for Turing and later GPU architectures due to a conflict of the
> +     * MSI-X capability at C8h.  We don't know how to determine the GPU

There is a way :

   # nvidia-smi -q | grep Architecture
       Product Architecture                  : Turing

but it must be vendor specific and the proposed solution is as good.

Reviewed-by: Cédric Le Goater <clg@redhat.com>

Thanks,

C.

> +     * architecture, instead we walk the capability chain to mark conflicts
> +     * and choose one or error based on the result.
> +     *
> +     * NB. Cap list head in pdev->config is already cleared, read from device.
> +     */
> +    ret = pread(vdev->vbasedev.fd, &tmp, 1,
> +                vdev->config_offset + PCI_CAPABILITY_LIST);
> +    if (ret != 1 || !tmp) {
> +        error_setg(errp, "NVIDIA GPUDirect Clique ID: error getting cap list");
> +        return -EINVAL;
> +    }
> +
> +    do {
> +        if (tmp == 0xC8) {
> +            c8_conflict = true;
> +        } else if (tmp == 0xD4) {
> +            d4_conflict = true;
> +        }
> +        tmp = pdev->config[tmp + PCI_CAP_LIST_NEXT];
> +    } while (tmp);
> +
> +    if (!c8_conflict) {
> +        pos = 0xC8;
> +    } else if (!d4_conflict) {
> +        pos = 0xD4;
> +    } else {
> +        error_setg(errp, "NVIDIA GPUDirect Clique ID: invalid config space");
> +        return -EINVAL;
> +    }
> +
>       ret = pci_add_capability(pdev, PCI_CAP_ID_VNDR, pos, 8, errp);
>       if (ret < 0) {
>           error_prepend(errp, "Failed to add NVIDIA GPUDirect cap: ");

Re: [PATCH v2] hw/vfio/pci-quirks: Support alternate offset for GPUDirect Cliques

Posted by Alex Williamson 2 years, 8 months ago

On Mon, 12 Jun 2023 16:07:33 +0200
Cédric Le Goater <clg@redhat.com> wrote:

> On 6/8/23 20:05, Alex Williamson wrote:
> > NVIDIA Turing and newer GPUs implement the MSI-X capability at the offset
> > previously reserved for use by hypervisors to implement the GPUDirect
> > Cliques capability.  A revised specification provides an alternate
> > location.  Add a config space walk to the quirk to check for conflicts,
> > allowing us to fall back to the new location or generate an error at the
> > quirk setup rather than when the real conflicting capability is added
> > should there be no available location.
> > 
> > Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> > ---
> >   hw/vfio/pci-quirks.c | 41 ++++++++++++++++++++++++++++++++++++++++-
> >   1 file changed, 40 insertions(+), 1 deletion(-)
> > 
> > diff --git a/hw/vfio/pci-quirks.c b/hw/vfio/pci-quirks.c
> > index f0147a050aaa..0ed2fcd53152 100644
> > --- a/hw/vfio/pci-quirks.c
> > +++ b/hw/vfio/pci-quirks.c
> > @@ -1490,6 +1490,9 @@ void vfio_setup_resetfn_quirk(VFIOPCIDevice *vdev)
> >    * +---------------------------------+---------------------------------+
> >    *
> >    * https://lists.gnu.org/archive/html/qemu-devel/2017-08/pdfUda5iEpgOS.pdf
> > + *
> > + * Specification for Turning and later GPU architectures:  
> 
> s/Turning/Turing/
> 
> I will fix that.

Yes, thanks!
 
> > + * https://lists.gnu.org/archive/html/qemu-devel/2023-06/pdf142OR4O4c2.pdf
> >    */
> >   static void get_nv_gpudirect_clique_id(Object *obj, Visitor *v,
> >                                          const char *name, void *opaque,
> > @@ -1530,7 +1533,9 @@ const PropertyInfo qdev_prop_nv_gpudirect_clique = {
> >   static int vfio_add_nv_gpudirect_cap(VFIOPCIDevice *vdev, Error **errp)
> >   {
> >       PCIDevice *pdev = &vdev->pdev;
> > -    int ret, pos = 0xC8;
> > +    int ret, pos;
> > +    bool c8_conflict = false, d4_conflict = false;
> > +    uint8_t tmp;
> >   
> >       if (vdev->nv_gpudirect_clique == 0xFF) {
> >           return 0;
> > @@ -1547,6 +1552,40 @@ static int vfio_add_nv_gpudirect_cap(VFIOPCIDevice *vdev, Error **errp)
> >           return -EINVAL;
> >       }
> >   
> > +    /*
> > +     * Per the updated specification above, it's recommended to use offset
> > +     * D4h for Turing and later GPU architectures due to a conflict of the
> > +     * MSI-X capability at C8h.  We don't know how to determine the GPU  
> 
> There is a way :
> 
>    # nvidia-smi -q | grep Architecture
>        Product Architecture                  : Turing

There are a few problems with that:

 1) nvidia-smi is a proprietary tool.

 2) Using nvidia-smi, or even the PCI IDs database, would require
    ongoing maintenance to update the string or IDs for future
    architectures.

 3) nvidia-smi requires the device to be managed by the nvidia driver,
    which becomes and chicken and egg problem when we require the
    device to be managed by a vfio compatible driver by this point.

> but it must be vendor specific and the proposed solution is as good.
> 
> Reviewed-by: Cédric Le Goater <clg@redhat.com>

Thanks!

Alex

> > +     * architecture, instead we walk the capability chain to mark conflicts
> > +     * and choose one or error based on the result.
> > +     *
> > +     * NB. Cap list head in pdev->config is already cleared, read from device.
> > +     */
> > +    ret = pread(vdev->vbasedev.fd, &tmp, 1,
> > +                vdev->config_offset + PCI_CAPABILITY_LIST);
> > +    if (ret != 1 || !tmp) {
> > +        error_setg(errp, "NVIDIA GPUDirect Clique ID: error getting cap list");
> > +        return -EINVAL;
> > +    }
> > +
> > +    do {
> > +        if (tmp == 0xC8) {
> > +            c8_conflict = true;
> > +        } else if (tmp == 0xD4) {
> > +            d4_conflict = true;
> > +        }
> > +        tmp = pdev->config[tmp + PCI_CAP_LIST_NEXT];
> > +    } while (tmp);
> > +
> > +    if (!c8_conflict) {
> > +        pos = 0xC8;
> > +    } else if (!d4_conflict) {
> > +        pos = 0xD4;
> > +    } else {
> > +        error_setg(errp, "NVIDIA GPUDirect Clique ID: invalid config space");
> > +        return -EINVAL;
> > +    }
> > +
> >       ret = pci_add_capability(pdev, PCI_CAP_ID_VNDR, pos, 8, errp);
> >       if (ret < 0) {
> >           error_prepend(errp, "Failed to add NVIDIA GPUDirect cap: ");  
>

Re: [PATCH v2] hw/vfio/pci-quirks: Support alternate offset for GPUDirect Cliques

Posted by Cédric Le Goater 2 years, 8 months ago

On 6/12/23 17:05, Alex Williamson wrote:
> On Mon, 12 Jun 2023 16:07:33 +0200
> Cédric Le Goater <clg@redhat.com> wrote:
> 
>> On 6/8/23 20:05, Alex Williamson wrote:
>>> NVIDIA Turing and newer GPUs implement the MSI-X capability at the offset
>>> previously reserved for use by hypervisors to implement the GPUDirect
>>> Cliques capability.  A revised specification provides an alternate
>>> location.  Add a config space walk to the quirk to check for conflicts,
>>> allowing us to fall back to the new location or generate an error at the
>>> quirk setup rather than when the real conflicting capability is added
>>> should there be no available location.
>>>
>>> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
>>> ---
>>>    hw/vfio/pci-quirks.c | 41 ++++++++++++++++++++++++++++++++++++++++-
>>>    1 file changed, 40 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/hw/vfio/pci-quirks.c b/hw/vfio/pci-quirks.c
>>> index f0147a050aaa..0ed2fcd53152 100644
>>> --- a/hw/vfio/pci-quirks.c
>>> +++ b/hw/vfio/pci-quirks.c
>>> @@ -1490,6 +1490,9 @@ void vfio_setup_resetfn_quirk(VFIOPCIDevice *vdev)
>>>     * +---------------------------------+---------------------------------+
>>>     *
>>>     * https://lists.gnu.org/archive/html/qemu-devel/2017-08/pdfUda5iEpgOS.pdf
>>> + *
>>> + * Specification for Turning and later GPU architectures:
>>
>> s/Turning/Turing/
>>
>> I will fix that.
> 
> Yes, thanks!
>   
>>> + * https://lists.gnu.org/archive/html/qemu-devel/2023-06/pdf142OR4O4c2.pdf
>>>     */
>>>    static void get_nv_gpudirect_clique_id(Object *obj, Visitor *v,
>>>                                           const char *name, void *opaque,
>>> @@ -1530,7 +1533,9 @@ const PropertyInfo qdev_prop_nv_gpudirect_clique = {
>>>    static int vfio_add_nv_gpudirect_cap(VFIOPCIDevice *vdev, Error **errp)
>>>    {
>>>        PCIDevice *pdev = &vdev->pdev;
>>> -    int ret, pos = 0xC8;
>>> +    int ret, pos;
>>> +    bool c8_conflict = false, d4_conflict = false;
>>> +    uint8_t tmp;
>>>    
>>>        if (vdev->nv_gpudirect_clique == 0xFF) {
>>>            return 0;
>>> @@ -1547,6 +1552,40 @@ static int vfio_add_nv_gpudirect_cap(VFIOPCIDevice *vdev, Error **errp)
>>>            return -EINVAL;
>>>        }
>>>    
>>> +    /*
>>> +     * Per the updated specification above, it's recommended to use offset
>>> +     * D4h for Turing and later GPU architectures due to a conflict of the
>>> +     * MSI-X capability at C8h.  We don't know how to determine the GPU
>>
>> There is a way :
>>
>>     # nvidia-smi -q | grep Architecture
>>         Product Architecture                  : Turing
> 
> There are a few problems with that:
> 
>   1) nvidia-smi is a proprietary tool.
> 
>   2) Using nvidia-smi, or even the PCI IDs database, would require
>      ongoing maintenance to update the string or IDs for future
>      architectures.
> 
>   3) nvidia-smi requires the device to be managed by the nvidia driver,
>      which becomes and chicken and egg problem when we require the
>      device to be managed by a vfio compatible driver by this point.

For my education, could such information be exposed in a PCI vendor
specific capability ? May be it is ?

Thanks,

C.


> 
>> but it must be vendor specific and the proposed solution is as good.
>>
>> Reviewed-by: Cédric Le Goater <clg@redhat.com>
> 
> Thanks!
> 
> Alex
> 
>>> +     * architecture, instead we walk the capability chain to mark conflicts
>>> +     * and choose one or error based on the result.
>>> +     *
>>> +     * NB. Cap list head in pdev->config is already cleared, read from device.
>>> +     */
>>> +    ret = pread(vdev->vbasedev.fd, &tmp, 1,
>>> +                vdev->config_offset + PCI_CAPABILITY_LIST);
>>> +    if (ret != 1 || !tmp) {
>>> +        error_setg(errp, "NVIDIA GPUDirect Clique ID: error getting cap list");
>>> +        return -EINVAL;
>>> +    }
>>> +
>>> +    do {
>>> +        if (tmp == 0xC8) {
>>> +            c8_conflict = true;
>>> +        } else if (tmp == 0xD4) {
>>> +            d4_conflict = true;
>>> +        }
>>> +        tmp = pdev->config[tmp + PCI_CAP_LIST_NEXT];
>>> +    } while (tmp);
>>> +
>>> +    if (!c8_conflict) {
>>> +        pos = 0xC8;
>>> +    } else if (!d4_conflict) {
>>> +        pos = 0xD4;
>>> +    } else {
>>> +        error_setg(errp, "NVIDIA GPUDirect Clique ID: invalid config space");
>>> +        return -EINVAL;
>>> +    }
>>> +
>>>        ret = pci_add_capability(pdev, PCI_CAP_ID_VNDR, pos, 8, errp);
>>>        if (ret < 0) {
>>>            error_prepend(errp, "Failed to add NVIDIA GPUDirect cap: ");
>>
>

Re: [PATCH v2] hw/vfio/pci-quirks: Support alternate offset for GPUDirect Cliques

Posted by Alex Williamson 2 years, 8 months ago

On Wed, 14 Jun 2023 14:37:08 +0200
Cédric Le Goater <clg@redhat.com> wrote:

> On 6/12/23 17:05, Alex Williamson wrote:
> > On Mon, 12 Jun 2023 16:07:33 +0200
> > Cédric Le Goater <clg@redhat.com> wrote:
> >   
> >> On 6/8/23 20:05, Alex Williamson wrote:  
> >>> NVIDIA Turing and newer GPUs implement the MSI-X capability at the offset
> >>> previously reserved for use by hypervisors to implement the GPUDirect
> >>> Cliques capability.  A revised specification provides an alternate
> >>> location.  Add a config space walk to the quirk to check for conflicts,
> >>> allowing us to fall back to the new location or generate an error at the
> >>> quirk setup rather than when the real conflicting capability is added
> >>> should there be no available location.
> >>>
> >>> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> >>> ---
> >>>    hw/vfio/pci-quirks.c | 41 ++++++++++++++++++++++++++++++++++++++++-
> >>>    1 file changed, 40 insertions(+), 1 deletion(-)
> >>>
> >>> diff --git a/hw/vfio/pci-quirks.c b/hw/vfio/pci-quirks.c
> >>> index f0147a050aaa..0ed2fcd53152 100644
> >>> --- a/hw/vfio/pci-quirks.c
> >>> +++ b/hw/vfio/pci-quirks.c
> >>> @@ -1490,6 +1490,9 @@ void vfio_setup_resetfn_quirk(VFIOPCIDevice *vdev)
> >>>     * +---------------------------------+---------------------------------+
> >>>     *
> >>>     * https://lists.gnu.org/archive/html/qemu-devel/2017-08/pdfUda5iEpgOS.pdf
> >>> + *
> >>> + * Specification for Turning and later GPU architectures:  
> >>
> >> s/Turning/Turing/
> >>
> >> I will fix that.  
> > 
> > Yes, thanks!
> >     
> >>> + * https://lists.gnu.org/archive/html/qemu-devel/2023-06/pdf142OR4O4c2.pdf
> >>>     */
> >>>    static void get_nv_gpudirect_clique_id(Object *obj, Visitor *v,
> >>>                                           const char *name, void *opaque,
> >>> @@ -1530,7 +1533,9 @@ const PropertyInfo qdev_prop_nv_gpudirect_clique = {
> >>>    static int vfio_add_nv_gpudirect_cap(VFIOPCIDevice *vdev, Error **errp)
> >>>    {
> >>>        PCIDevice *pdev = &vdev->pdev;
> >>> -    int ret, pos = 0xC8;
> >>> +    int ret, pos;
> >>> +    bool c8_conflict = false, d4_conflict = false;
> >>> +    uint8_t tmp;
> >>>    
> >>>        if (vdev->nv_gpudirect_clique == 0xFF) {
> >>>            return 0;
> >>> @@ -1547,6 +1552,40 @@ static int vfio_add_nv_gpudirect_cap(VFIOPCIDevice *vdev, Error **errp)
> >>>            return -EINVAL;
> >>>        }
> >>>    
> >>> +    /*
> >>> +     * Per the updated specification above, it's recommended to use offset
> >>> +     * D4h for Turing and later GPU architectures due to a conflict of the
> >>> +     * MSI-X capability at C8h.  We don't know how to determine the GPU  
> >>
> >> There is a way :
> >>
> >>     # nvidia-smi -q | grep Architecture
> >>         Product Architecture                  : Turing  
> > 
> > There are a few problems with that:
> > 
> >   1) nvidia-smi is a proprietary tool.
> > 
> >   2) Using nvidia-smi, or even the PCI IDs database, would require
> >      ongoing maintenance to update the string or IDs for future
> >      architectures.
> > 
> >   3) nvidia-smi requires the device to be managed by the nvidia driver,
> >      which becomes and chicken and egg problem when we require the
> >      device to be managed by a vfio compatible driver by this point.  
> 
> For my education, could such information be exposed in a PCI vendor
> specific capability ? May be it is ?

Sure, nothing technically prevents it, but the vendor has to have a
need to do so whereas NVIDIA probably has their own means to interrogate
the device to determine the architectural level or doesn't mind
maintaining PCI IDs.  Probably a bit of both.  Thanks,

Alex