[RFC PATCH 00/11] hw/nvme: reimplement all multi-aio commands with custom aiocbs

Klaus Jensen posted 11 patches 2 years, 9 months ago
Patches applied successfully (tree, apply log)
git fetch https://github.com/patchew-project/next-importer-push tags/patchew/20210604065237.873228-1-its@irrelevant.dk
Maintainers: Keith Busch <kbusch@kernel.org>, Klaus Jensen <its@irrelevant.dk>, Max Reitz <mreitz@redhat.com>, Fam Zheng <fam@euphon.net>, Stefan Hajnoczi <stefanha@redhat.com>, Kevin Wolf <kwolf@redhat.com>
There is a newer version of this series
hw/nvme/nvme.h       |   10 +-
include/block/nvme.h |    8 +
hw/nvme/ctrl.c       | 1861 ++++++++++++++++++++++++------------------
hw/nvme/dif.c        |   64 +-
hw/nvme/trace-events |   21 +-
5 files changed, 1102 insertions(+), 862 deletions(-)
[RFC PATCH 00/11] hw/nvme: reimplement all multi-aio commands with custom aiocbs
Posted by Klaus Jensen 2 years, 9 months ago
From: Klaus Jensen <k.jensen@samsung.com>

This series reimplements flush, dsm, copy, zone reset and format nvm to
allow cancellation. I posted an RFC back in March ("hw/block/nvme:
convert ad-hoc aio tracking to aiocb") and I've applied some feedback
from Stefan and reimplemented the remaining commands.

The basic idea is to define custom AIOCBs for these commands. The custom
AIOCB takes care of issuing all the "nested" AIOs one by one instead of
blindly sending them off simultaneously without tracking the returned
aiocbs.

I've kept the RFC since I'm still new to using the block layer like
this. I was hoping that Stefan could find some time to look over this -
this is a huge series, so I don't expect non-nvme folks to spend a large
amount of time on it, but I would really like feedback on my approach in
the reimplementation of flush and format. Those commands are special in
that may issue AIOs to multiple namespaces and thus, to multiple block
backends. Since this device does not support iothreads, I've opted for
simply always returning the main loop aio context, but I wonder if this
is acceptable or not. It might be the case that this should contain an
assert of some kind, in case someone starts adding iothread support.

Klaus Jensen (11):
  hw/nvme: reimplement flush to allow cancellation
  hw/nvme: add nvme_block_status_all helper
  hw/nvme: reimplement dsm to allow cancellation
  hw/nvme: save reftag when generating pi
  hw/nvme: remove assert from nvme_get_zone_by_slba
  hw/nvme: use prinfo directly in nvme_check_prinfo and nvme_dif_check
  hw/nvme: add dw0/1 to the req completion trace event
  hw/nvme: reimplement the copy command to allow aio cancellation
  hw/nvme: reimplement zone reset to allow cancellation
  hw/nvme: reimplement format nvm to allow cancellation
  Partially revert "hw/block/nvme: drain namespaces on sq deletion"

 hw/nvme/nvme.h       |   10 +-
 include/block/nvme.h |    8 +
 hw/nvme/ctrl.c       | 1861 ++++++++++++++++++++++++------------------
 hw/nvme/dif.c        |   64 +-
 hw/nvme/trace-events |   21 +-
 5 files changed, 1102 insertions(+), 862 deletions(-)

-- 
2.31.1


Re: [RFC PATCH 00/11] hw/nvme: reimplement all multi-aio commands with custom aiocbs
Posted by Vladimir Sementsov-Ogievskiy 2 years, 9 months ago
04.06.2021 09:52, Klaus Jensen wrote:
> From: Klaus Jensen <k.jensen@samsung.com>
> 
> This series reimplements flush, dsm, copy, zone reset and format nvm to
> allow cancellation. I posted an RFC back in March ("hw/block/nvme:
> convert ad-hoc aio tracking to aiocb") and I've applied some feedback
> from Stefan and reimplemented the remaining commands.
> 
> The basic idea is to define custom AIOCBs for these commands. The custom
> AIOCB takes care of issuing all the "nested" AIOs one by one instead of
> blindly sending them off simultaneously without tracking the returned
> aiocbs.

I'm not familiar with nvme. But intuitively, isn't it less efficient to send mutltiple commands one-by-one? Overall, wouldn't it be slower? In block layer we mostly do opposite transition: instead of doing IO operations one-by-one, run them simultaneously to make a non-empty queue on a device.. Even on one device. This way overall performance is increased.

If you need to store nested AIOCBs, you may store them in a list for example, and cancel in a loop, keeping simultaneous start for all flushes.. If you send two flushes on two different disks, what's the reason to wait for first flush finish before issuing the second?

> 
> I've kept the RFC since I'm still new to using the block layer like
> this. I was hoping that Stefan could find some time to look over this -
> this is a huge series, so I don't expect non-nvme folks to spend a large
> amount of time on it, but I would really like feedback on my approach in
> the reimplementation of flush and format.

If I understand your code correctly, you do stat next io operation from call-back of a previous. It works, and it is similar to haw mirror block-job was operating some time ago (still it maintained several in-flight requests simultaneously, but I'm about using _aio_ functions). Still, now mirror doesn't use _aio_ functions like this.

Better approach to call several io functions of block layer one-by-one is creating a coroutine. You may just add a coroutine function, that does all your linear logic as you want, without any callbacks like:

nvme_co_flush()
{
    for (...) {
       blk_co_flush();
    }
}

(and you'll need qemu_coroutine_create() and qemu_coroutine_enter() to start a coroutine).

Still, I'm not sure that moving from simultaneous issuing several IO commands to sequential is good idea..
And this way you of course can't use blk_aio_canel.. This leads to my last doubt:

One more thing I don't understand after fast look at the series: how cancelation works? It seems to me, that you just call cancel on nested AIOCBs, produced by blk_<io_functions>, but no one of them implement cancel.. I see only four implementations of .cancel_async callback in the whole Qemu code: in iscsi, in ide/core.c, in dma-helpers.c and in thread-pool.c.. Seems no one is related to blk_aio_flush() and other blk_* functions you call in the series. Or, what I miss?


> Those commands are special in
> that may issue AIOs to multiple namespaces and thus, to multiple block
> backends. Since this device does not support iothreads, I've opted for
> simply always returning the main loop aio context, but I wonder if this
> is acceptable or not. It might be the case that this should contain an
> assert of some kind, in case someone starts adding iothread support.
> 
> Klaus Jensen (11):
>    hw/nvme: reimplement flush to allow cancellation
>    hw/nvme: add nvme_block_status_all helper
>    hw/nvme: reimplement dsm to allow cancellation
>    hw/nvme: save reftag when generating pi
>    hw/nvme: remove assert from nvme_get_zone_by_slba
>    hw/nvme: use prinfo directly in nvme_check_prinfo and nvme_dif_check
>    hw/nvme: add dw0/1 to the req completion trace event
>    hw/nvme: reimplement the copy command to allow aio cancellation
>    hw/nvme: reimplement zone reset to allow cancellation
>    hw/nvme: reimplement format nvm to allow cancellation
>    Partially revert "hw/block/nvme: drain namespaces on sq deletion"
> 
>   hw/nvme/nvme.h       |   10 +-
>   include/block/nvme.h |    8 +
>   hw/nvme/ctrl.c       | 1861 ++++++++++++++++++++++++------------------
>   hw/nvme/dif.c        |   64 +-
>   hw/nvme/trace-events |   21 +-
>   5 files changed, 1102 insertions(+), 862 deletions(-)
> 


-- 
Best regards,
Vladimir

Re: [RFC PATCH 00/11] hw/nvme: reimplement all multi-aio commands with custom aiocbs
Posted by Klaus Jensen 2 years, 9 months ago
Hi Vladimir,

Thanks for taking the time to look through this!

I'll try to comment on all your observations below.

On Jun  7 08:14, Vladimir Sementsov-Ogievskiy wrote:
>04.06.2021 09:52, Klaus Jensen wrote:
>>From: Klaus Jensen <k.jensen@samsung.com>
>>
>>This series reimplements flush, dsm, copy, zone reset and format nvm to
>>allow cancellation. I posted an RFC back in March ("hw/block/nvme:
>>convert ad-hoc aio tracking to aiocb") and I've applied some feedback
>>from Stefan and reimplemented the remaining commands.
>>
>>The basic idea is to define custom AIOCBs for these commands. The custom
>>AIOCB takes care of issuing all the "nested" AIOs one by one instead of
>>blindly sending them off simultaneously without tracking the returned
>>aiocbs.
>
>I'm not familiar with nvme. But intuitively, isn't it less efficient to 
>send mutltiple commands one-by-one? Overall, wouldn't it be slower?

No, you are right, it is of course slower overall.

>In block layer we mostly do opposite transition: instead of doing IO 
>operations one-by-one, run them simultaneously to make a non-empty 
>queue on a device.. Even on one device. This way overall performance is 
>increased.
>

Of these commands, Copy is the only one that I would consider optimizing 
like this. But the most obvious use of the Copy command is host-driven 
garbage collection in the context of zoned namespaces. And I would not 
consider that operation to be performance critical in terms of latency. 
All regular I/O commands are "one aiocb" and doesnt need any of this. 
And we already "parallelize" this heavily.

>If you need to store nested AIOCBs, you may store them in a list for 
>example, and cancel in a loop, keeping simultaneous start for all 
>flushes.. If you send two flushes on two different disks, what's the 
>reason to wait for first flush finish before issuing the second?
>

Keeping a list of returned aiocbs was my initial approach actually. But 
when I looked at hw/ide I got the impression that the AIOCB approach was 
the right one. My first approach involved adding an aiocblist to the 
core NvmeRequest structure, but I ended up killing that approach because 
I didnt want to deal with it on the normal I/O path.

But you are absolutely correct that waiting for the first flush to 
finish is suboptimal.

>>
>>I've kept the RFC since I'm still new to using the block layer like
>>this. I was hoping that Stefan could find some time to look over this -
>>this is a huge series, so I don't expect non-nvme folks to spend a large
>>amount of time on it, but I would really like feedback on my approach in
>>the reimplementation of flush and format.
>
>If I understand your code correctly, you do stat next io operation from 
>call-back of a previous. It works, and it is similar to haw mirror 
>block-job was operating some time ago (still it maintained several 
>in-flight requests simultaneously, but I'm about using _aio_ 
>functions). Still, now mirror doesn't use _aio_ functions like this.
>
>Better approach to call several io functions of block layer one-by-one 
>is creating a coroutine. You may just add a coroutine function, that 
>does all your linear logic as you want, without any callbacks like:
>
>nvme_co_flush()
>{
>   for (...) {
>      blk_co_flush();
>   }
>}
>
>(and you'll need qemu_coroutine_create() and qemu_coroutine_enter() to 
>start a coroutine).
>

So, this is definitely a tempting way to implement this. I must admit 
that I did not consider it like this because I thought this was at the 
wrong level of abstraction (looked to me like this was something that 
belonged in block/, not hw/). Again, I referred to the Trim 
implementation in hw/ide as the source of inspiration on the sequential 
AIOCB approach.

>Still, I'm not sure that moving from simultaneous issuing several IO 
>commands to sequential is good idea..
>And this way you of course can't use blk_aio_canel.. This leads to my last doubt:
>
>One more thing I don't understand after fast look at the series: how 
>cancelation works? It seems to me, that you just call cancel on nested 
>AIOCBs, produced by blk_<io_functions>, but no one of them implement 
>cancel.. I see only four implementations of .cancel_async callback in 
>the whole Qemu code: in iscsi, in ide/core.c, in dma-helpers.c and in 
>thread-pool.c.. Seems no one is related to blk_aio_flush() and other 
>blk_* functions you call in the series. Or, what I miss?
>

Right now, cancellation is only initiated by the device when a 
submission queue is deleted. This causes blk_aio_cancel() to be called 
on each BlockAIOCB (NvmeRequest.aiocb) for outstanding requests. In most 
cases this BlockAIOCB is a DMAAIOCB from softmmu/dma-helpers.c, which 
implements .cancel_async. Prior to this patchset, Flush, DSM, Copy and 
so on, did not have any BlockAIOCB to cancel since we did not keep 
references to the ongoing AIOs.

The blk_aio_cancel() call is synchronous, but it does call 
bdrv_aio_cancel_async() which calls the .cancel_async callback if 
implemented. This means that we can now cancel ongoing DSM or Copy 
commands while they are processing their individual LBA ranges. So while 
blk_aio_cancel() subsequently waits for the AIO to complete this may 
cause them to complete earlier than if they had run to full completion 
(i.e. if they did not implement .cancel_async).

There are two things I'd like to do subsequent to this patch series:

   1. Fix the Abort command to actually do something. Currently the 
   command is a no-op (which is allowed by the spec), but I'd like it to 
   actually cancel the command that the host specified.

   2. Make submission queue deletion asynchronous.

The infrastructure provided by this refactor should allow this if I am 
not mistaken.

Overall, I think this "sequentialization" makes it easier to reason 
about cancellation, but that might just be me ;)

>
>>Those commands are special in
>>that may issue AIOs to multiple namespaces and thus, to multiple block
>>backends. Since this device does not support iothreads, I've opted for
>>simply always returning the main loop aio context, but I wonder if this
>>is acceptable or not. It might be the case that this should contain an
>>assert of some kind, in case someone starts adding iothread support.
>>
>>Klaus Jensen (11):
>>   hw/nvme: reimplement flush to allow cancellation
>>   hw/nvme: add nvme_block_status_all helper
>>   hw/nvme: reimplement dsm to allow cancellation
>>   hw/nvme: save reftag when generating pi
>>   hw/nvme: remove assert from nvme_get_zone_by_slba
>>   hw/nvme: use prinfo directly in nvme_check_prinfo and nvme_dif_check
>>   hw/nvme: add dw0/1 to the req completion trace event
>>   hw/nvme: reimplement the copy command to allow aio cancellation
>>   hw/nvme: reimplement zone reset to allow cancellation
>>   hw/nvme: reimplement format nvm to allow cancellation
>>   Partially revert "hw/block/nvme: drain namespaces on sq deletion"
>>
>>  hw/nvme/nvme.h       |   10 +-
>>  include/block/nvme.h |    8 +
>>  hw/nvme/ctrl.c       | 1861 ++++++++++++++++++++++++------------------
>>  hw/nvme/dif.c        |   64 +-
>>  hw/nvme/trace-events |   21 +-
>>  5 files changed, 1102 insertions(+), 862 deletions(-)
>>
>
>
Re: [RFC PATCH 00/11] hw/nvme: reimplement all multi-aio commands with custom aiocbs
Posted by Vladimir Sementsov-Ogievskiy 2 years, 9 months ago
07.06.2021 09:17, Klaus Jensen wrote:
> Hi Vladimir,
> 
> Thanks for taking the time to look through this!
> 
> I'll try to comment on all your observations below.
> 
> On Jun  7 08:14, Vladimir Sementsov-Ogievskiy wrote:
>> 04.06.2021 09:52, Klaus Jensen wrote:
>>> From: Klaus Jensen <k.jensen@samsung.com>
>>>
>>> This series reimplements flush, dsm, copy, zone reset and format nvm to
>>> allow cancellation. I posted an RFC back in March ("hw/block/nvme:
>>> convert ad-hoc aio tracking to aiocb") and I've applied some feedback
>>> from Stefan and reimplemented the remaining commands.
>>>
>>> The basic idea is to define custom AIOCBs for these commands. The custom
>>> AIOCB takes care of issuing all the "nested" AIOs one by one instead of
>>> blindly sending them off simultaneously without tracking the returned
>>> aiocbs.
>>
>> I'm not familiar with nvme. But intuitively, isn't it less efficient to send mutltiple commands one-by-one? Overall, wouldn't it be slower?
> 
> No, you are right, it is of course slower overall.
> 
>> In block layer we mostly do opposite transition: instead of doing IO operations one-by-one, run them simultaneously to make a non-empty queue on a device.. Even on one device. This way overall performance is increased.
>>
> 
> Of these commands, Copy is the only one that I would consider optimizing like this. But the most obvious use of the Copy command is host-driven garbage collection in the context of zoned namespaces. And I would not consider that operation to be performance critical in terms of latency. All regular I/O commands are "one aiocb" and doesnt need any of this. And we already "parallelize" this heavily.
> 
>> If you need to store nested AIOCBs, you may store them in a list for example, and cancel in a loop, keeping simultaneous start for all flushes.. If you send two flushes on two different disks, what's the reason to wait for first flush finish before issuing the second?
>>
> 
> Keeping a list of returned aiocbs was my initial approach actually. But when I looked at hw/ide I got the impression that the AIOCB approach was the right one. My first approach involved adding an aiocblist to the core NvmeRequest structure, but I ended up killing that approach because I didnt want to deal with it on the normal I/O path.
> 
> But you are absolutely correct that waiting for the first flush to finish is suboptimal.
> 

Aha. OK, if that's not a performance critical path.

>>>
>>> I've kept the RFC since I'm still new to using the block layer like
>>> this. I was hoping that Stefan could find some time to look over this -
>>> this is a huge series, so I don't expect non-nvme folks to spend a large
>>> amount of time on it, but I would really like feedback on my approach in
>>> the reimplementation of flush and format.
>>
>> If I understand your code correctly, you do stat next io operation from call-back of a previous. It works, and it is similar to haw mirror block-job was operating some time ago (still it maintained several in-flight requests simultaneously, but I'm about using _aio_ functions). Still, now mirror doesn't use _aio_ functions like this.
>>
>> Better approach to call several io functions of block layer one-by-one is creating a coroutine. You may just add a coroutine function, that does all your linear logic as you want, without any callbacks like:
>>
>> nvme_co_flush()
>> {
>>   for (...) {
>>      blk_co_flush();
>>   }
>> }
>>
>> (and you'll need qemu_coroutine_create() and qemu_coroutine_enter() to start a coroutine).
>>
> 
> So, this is definitely a tempting way to implement this. I must admit that I did not consider it like this because I thought this was at the wrong level of abstraction (looked to me like this was something that belonged in block/, not hw/). Again, I referred to the Trim implementation in hw/ide as the source of inspiration on the sequential AIOCB approach.

No, I think it's OK from abstraction point of view. Everybody is welcome to use coroutines if it is appropriate and especially for doing sequential IOs :)
Actually, it's just more efficient: the way I propose, you create one coroutine, which does all your logic as you want, when blk_aio_ functions actually create a coroutine under the hood each time (I don't think that it noticeably affects performance, but logic becomes more straightforward)

The only problem is that for this way we don't have cancellation API, so you can't use it for cancellation anyway :(

> 
>> Still, I'm not sure that moving from simultaneous issuing several IO commands to sequential is good idea..
>> And this way you of course can't use blk_aio_canel.. This leads to my last doubt:
>>
>> One more thing I don't understand after fast look at the series: how cancelation works? It seems to me, that you just call cancel on nested AIOCBs, produced by blk_<io_functions>, but no one of them implement cancel.. I see only four implementations of .cancel_async callback in the whole Qemu code: in iscsi, in ide/core.c, in dma-helpers.c and in thread-pool.c.. Seems no one is related to blk_aio_flush() and other blk_* functions you call in the series. Or, what I miss?
>>
> 
> Right now, cancellation is only initiated by the device when a submission queue is deleted. This causes blk_aio_cancel() to be called on each BlockAIOCB (NvmeRequest.aiocb) for outstanding requests. In most cases this BlockAIOCB is a DMAAIOCB from softmmu/dma-helpers.c, which implements .cancel_async. Prior to this patchset, Flush, DSM, Copy and so on, did not have any BlockAIOCB to cancel since we did not keep references to the ongoing AIOs.

Hmm. Looking at flush for example, I don't see how DMAAIOCB comes.

You do

   iocb->aiocb = blk_aio_flush(ns->blkconf.blk, nvme_flush_ns_cb, iocb);

it calls blk_aio_prwv(), it calls blk_aio_get() with blk_aio_em_aiocb_info, that doesn't implement .cancel_async..

> 
> The blk_aio_cancel() call is synchronous, but it does call bdrv_aio_cancel_async() which calls the .cancel_async callback if implemented. This means that we can now cancel ongoing DSM or Copy commands while they are processing their individual LBA ranges. So while blk_aio_cancel() subsequently waits for the AIO to complete this may cause them to complete earlier than if they had run to full completion (i.e. if they did not implement .cancel_async).
> 
> There are two things I'd like to do subsequent to this patch series:
> 
>    1. Fix the Abort command to actually do something. Currently the   command is a no-op (which is allowed by the spec), but I'd like it to   actually cancel the command that the host specified.
> 
>    2. Make submission queue deletion asynchronous.
> 
> The infrastructure provided by this refactor should allow this if I am not mistaken.
> 
> Overall, I think this "sequentialization" makes it easier to reason about cancellation, but that might just be me ;)
> 

I just don't like sequential logic simulated on top of aio-callback async API, which in turn is simulated on top of coroutine-driven sequential API (which is made on top of real async block API (thread-based or linux-aio based, etc)) :) Still I can't suggest now an alternative that supports cancellation. But I still think that cancellation doesn't work for blk_aio_flush and friends either..

>>
>>> Those commands are special in
>>> that may issue AIOs to multiple namespaces and thus, to multiple block
>>> backends. Since this device does not support iothreads, I've opted for
>>> simply always returning the main loop aio context, but I wonder if this
>>> is acceptable or not. It might be the case that this should contain an
>>> assert of some kind, in case someone starts adding iothread support.
>>>
>>> Klaus Jensen (11):
>>>   hw/nvme: reimplement flush to allow cancellation
>>>   hw/nvme: add nvme_block_status_all helper
>>>   hw/nvme: reimplement dsm to allow cancellation
>>>   hw/nvme: save reftag when generating pi
>>>   hw/nvme: remove assert from nvme_get_zone_by_slba
>>>   hw/nvme: use prinfo directly in nvme_check_prinfo and nvme_dif_check
>>>   hw/nvme: add dw0/1 to the req completion trace event
>>>   hw/nvme: reimplement the copy command to allow aio cancellation
>>>   hw/nvme: reimplement zone reset to allow cancellation
>>>   hw/nvme: reimplement format nvm to allow cancellation
>>>   Partially revert "hw/block/nvme: drain namespaces on sq deletion"
>>>
>>>  hw/nvme/nvme.h       |   10 +-
>>>  include/block/nvme.h |    8 +
>>>  hw/nvme/ctrl.c       | 1861 ++++++++++++++++++++++++------------------
>>>  hw/nvme/dif.c        |   64 +-
>>>  hw/nvme/trace-events |   21 +-
>>>  5 files changed, 1102 insertions(+), 862 deletions(-)
>>>
>>
>>


-- 
Best regards,
Vladimir

Re: [RFC PATCH 00/11] hw/nvme: reimplement all multi-aio commands with custom aiocbs
Posted by Klaus Jensen 2 years, 9 months ago
On Jun  7 10:11, Vladimir Sementsov-Ogievskiy wrote:
>07.06.2021 09:17, Klaus Jensen wrote:
>>On Jun  7 08:14, Vladimir Sementsov-Ogievskiy wrote:
>>>04.06.2021 09:52, Klaus Jensen wrote:
>>>>
>>>>I've kept the RFC since I'm still new to using the block layer like
>>>>this. I was hoping that Stefan could find some time to look over this -
>>>>this is a huge series, so I don't expect non-nvme folks to spend a large
>>>>amount of time on it, but I would really like feedback on my approach in
>>>>the reimplementation of flush and format.
>>>
>>>If I understand your code correctly, you do stat next io operation 
>>>from call-back of a previous. It works, and it is similar to haw 
>>>mirror block-job was operating some time ago (still it maintained 
>>>several in-flight requests simultaneously, but I'm about using _aio_ 
>>>functions). Still, now mirror doesn't use _aio_ functions like this.
>>>
>>>Better approach to call several io functions of block layer 
>>>one-by-one is creating a coroutine. You may just add a coroutine 
>>>function, that does all your linear logic as you want, without any 
>>>callbacks like:
>>>
>>>nvme_co_flush()
>>>{
>>>  for (...) {
>>>     blk_co_flush();
>>>  }
>>>}
>>>
>>>(and you'll need qemu_coroutine_create() and qemu_coroutine_enter() 
>>>to start a coroutine).
>>>
>>
>>So, this is definitely a tempting way to implement this. I must admit 
>>that I did not consider it like this because I thought this was at the 
>>wrong level of abstraction (looked to me like this was something that 
>>belonged in block/, not hw/). Again, I referred to the Trim 
>>implementation in hw/ide as the source of inspiration on the 
>>sequential AIOCB approach.
>
>No, I think it's OK from abstraction point of view. Everybody is 
>welcome to use coroutines if it is appropriate and especially for doing 
>sequential IOs :)
>Actually, it's just more efficient: the way I propose, you create one 
>coroutine, which does all your logic as you want, when blk_aio_ 
>functions actually create a coroutine under the hood each time (I don't 
>think that it noticeably affects performance, but logic becomes more 
>straightforward)
>
>The only problem is that for this way we don't have cancellation API, 
>so you can't use it for cancellation anyway :(
>

Yeah, I'm not really feeling up for adding that :P

>>
>>>Still, I'm not sure that moving from simultaneous issuing several IO 
>>>commands to sequential is good idea..
>>>And this way you of course can't use blk_aio_canel.. This leads to my 
>>>last doubt:
>>>
>>>One more thing I don't understand after fast look at the series: how 
>>>cancelation works? It seems to me, that you just call cancel on 
>>>nested AIOCBs, produced by blk_<io_functions>, but no one of them 
>>>implement cancel.. I see only four implementations of .cancel_async 
>>>callback in the whole Qemu code: in iscsi, in ide/core.c, in 
>>>dma-helpers.c and in thread-pool.c.. Seems no one is related to 
>>>blk_aio_flush() and other blk_* functions you call in the series. Or, 
>>>what I miss?
>>>
>>
>>Right now, cancellation is only initiated by the device when a 
>>submission queue is deleted. This causes blk_aio_cancel() to be called 
>>on each BlockAIOCB (NvmeRequest.aiocb) for outstanding requests. In 
>>most cases this BlockAIOCB is a DMAAIOCB from softmmu/dma-helpers.c, 
>>which implements .cancel_async. Prior to this patchset, Flush, DSM, 
>>Copy and so on, did not have any BlockAIOCB to cancel since we did not 
>>keep references to the ongoing AIOs.
>
>Hmm. Looking at flush for example, I don't see how DMAAIOCB comes.
>
>You do
>
>  iocb->aiocb = blk_aio_flush(ns->blkconf.blk, nvme_flush_ns_cb, iocb);
>
>it calls blk_aio_prwv(), it calls blk_aio_get() with 
>blk_aio_em_aiocb_info, that doesn't implement .cancel_async..
>

I meant that most I/O in the regular path (read/write) are using the dma 
helpers (since they do DMA). We might use the blk_aio_p{read,write} 
directly when we read from/write to memory on the device (the controller 
memory buffer), but it is not the common case.

You are correct that BlkAioEmAIOCB does not implement cancel, but the 
"wrapper" (NvmeFlushAIOCB) *does*. This means that, from the NVMe 
controller perspective, we can cancel the flush in between 
(un-cancellable blk_aio_flush-initiated) flushes to multiple namespaces.

>>
>>The blk_aio_cancel() call is synchronous, but it does call 
>>bdrv_aio_cancel_async() which calls the .cancel_async callback if 
>>implemented. This means that we can now cancel ongoing DSM or Copy 
>>commands while they are processing their individual LBA ranges. So 
>>while blk_aio_cancel() subsequently waits for the AIO to complete this 
>>may cause them to complete earlier than if they had run to full 
>>completion (i.e. if they did not implement .cancel_async).
>>
>>There are two things I'd like to do subsequent to this patch series:
>>
>>   1. Fix the Abort command to actually do something. Currently the 
>> command is a no-op (which is allowed by the spec), but I'd like it to 
>> actually cancel the command that the host specified.
>>
>>   2. Make submission queue deletion asynchronous.
>>
>>The infrastructure provided by this refactor should allow this if I am 
>>not mistaken.
>>
>>Overall, I think this "sequentialization" makes it easier to reason 
>>about cancellation, but that might just be me ;)
>>
>
>I just don't like sequential logic simulated on top of aio-callback 
>async API, which in turn is simulated on top of coroutine-driven 
>sequential API (which is made on top of real async block API 
>(thread-based or linux-aio based, etc)) :)

Ha! Yes, we are not exactly improving on that layering here ;)

> Still I can't suggest now an alternative that supports cancellation. 
>But I still think that cancellation doesn't work for blk_aio_flush and 
>friends either..
>

The aiocb from blk_aio_flush is considered "un-cancellable" I guess (by 
design from the block layer), but the NVMe command "Flush" is 
cancellable from the perspective of the NVMe controller. Or at least, 
that's what I am intending to do here.
Re: [RFC PATCH 00/11] hw/nvme: reimplement all multi-aio commands with custom aiocbs
Posted by Vladimir Sementsov-Ogievskiy 2 years, 9 months ago
07.06.2021 13:00, Klaus Jensen wrote:
> On Jun  7 10:11, Vladimir Sementsov-Ogievskiy wrote:
>> 07.06.2021 09:17, Klaus Jensen wrote:
>>> On Jun  7 08:14, Vladimir Sementsov-Ogievskiy wrote:
>>>> 04.06.2021 09:52, Klaus Jensen wrote:
>>>>>
>>>>> I've kept the RFC since I'm still new to using the block layer like
>>>>> this. I was hoping that Stefan could find some time to look over this -
>>>>> this is a huge series, so I don't expect non-nvme folks to spend a large
>>>>> amount of time on it, but I would really like feedback on my approach in
>>>>> the reimplementation of flush and format.
>>>>
>>>> If I understand your code correctly, you do stat next io operation from call-back of a previous. It works, and it is similar to haw mirror block-job was operating some time ago (still it maintained several in-flight requests simultaneously, but I'm about using _aio_ functions). Still, now mirror doesn't use _aio_ functions like this.
>>>>
>>>> Better approach to call several io functions of block layer one-by-one is creating a coroutine. You may just add a coroutine function, that does all your linear logic as you want, without any callbacks like:
>>>>
>>>> nvme_co_flush()
>>>> {
>>>>   for (...) {
>>>>      blk_co_flush();
>>>>   }
>>>> }
>>>>
>>>> (and you'll need qemu_coroutine_create() and qemu_coroutine_enter() to start a coroutine).
>>>>
>>>
>>> So, this is definitely a tempting way to implement this. I must admit that I did not consider it like this because I thought this was at the wrong level of abstraction (looked to me like this was something that belonged in block/, not hw/). Again, I referred to the Trim implementation in hw/ide as the source of inspiration on the sequential AIOCB approach.
>>
>> No, I think it's OK from abstraction point of view. Everybody is welcome to use coroutines if it is appropriate and especially for doing sequential IOs :)
>> Actually, it's just more efficient: the way I propose, you create one coroutine, which does all your logic as you want, when blk_aio_ functions actually create a coroutine under the hood each time (I don't think that it noticeably affects performance, but logic becomes more straightforward)
>>
>> The only problem is that for this way we don't have cancellation API, so you can't use it for cancellation anyway :(
>>
> 
> Yeah, I'm not really feeling up for adding that :P
> 
>>>
>>>> Still, I'm not sure that moving from simultaneous issuing several IO commands to sequential is good idea..
>>>> And this way you of course can't use blk_aio_canel.. This leads to my last doubt:
>>>>
>>>> One more thing I don't understand after fast look at the series: how cancelation works? It seems to me, that you just call cancel on nested AIOCBs, produced by blk_<io_functions>, but no one of them implement cancel.. I see only four implementations of .cancel_async callback in the whole Qemu code: in iscsi, in ide/core.c, in dma-helpers.c and in thread-pool.c.. Seems no one is related to blk_aio_flush() and other blk_* functions you call in the series. Or, what I miss?
>>>>
>>>
>>> Right now, cancellation is only initiated by the device when a submission queue is deleted. This causes blk_aio_cancel() to be called on each BlockAIOCB (NvmeRequest.aiocb) for outstanding requests. In most cases this BlockAIOCB is a DMAAIOCB from softmmu/dma-helpers.c, which implements .cancel_async. Prior to this patchset, Flush, DSM, Copy and so on, did not have any BlockAIOCB to cancel since we did not keep references to the ongoing AIOs.
>>
>> Hmm. Looking at flush for example, I don't see how DMAAIOCB comes.
>>
>> You do
>>
>>  iocb->aiocb = blk_aio_flush(ns->blkconf.blk, nvme_flush_ns_cb, iocb);
>>
>> it calls blk_aio_prwv(), it calls blk_aio_get() with blk_aio_em_aiocb_info, that doesn't implement .cancel_async..
>>
> 
> I meant that most I/O in the regular path (read/write) are using the dma helpers (since they do DMA). We might use the blk_aio_p{read,write} directly when we read from/write to memory on the device (the controller memory buffer), but it is not the common case.
> 
> You are correct that BlkAioEmAIOCB does not implement cancel, but the "wrapper" (NvmeFlushAIOCB) *does*. This means that, from the NVMe controller perspective, we can cancel the flush in between (un-cancellable blk_aio_flush-initiated) flushes to multiple namespaces.

Hm. But it will work the same way if you just not implement .cancel_async in nvme_flush_aiocb_info.

> 
>>>
>>> The blk_aio_cancel() call is synchronous, but it does call bdrv_aio_cancel_async() which calls the .cancel_async callback if implemented. This means that we can now cancel ongoing DSM or Copy commands while they are processing their individual LBA ranges. So while blk_aio_cancel() subsequently waits for the AIO to complete this may cause them to complete earlier than if they had run to full completion (i.e. if they did not implement .cancel_async).
>>>
>>> There are two things I'd like to do subsequent to this patch series:
>>>
>>>   1. Fix the Abort command to actually do something. Currently the command is a no-op (which is allowed by the spec), but I'd like it to actually cancel the command that the host specified.
>>>
>>>   2. Make submission queue deletion asynchronous.
>>>
>>> The infrastructure provided by this refactor should allow this if I am not mistaken.
>>>
>>> Overall, I think this "sequentialization" makes it easier to reason about cancellation, but that might just be me ;)
>>>
>>
>> I just don't like sequential logic simulated on top of aio-callback async API, which in turn is simulated on top of coroutine-driven sequential API (which is made on top of real async block API (thread-based or linux-aio based, etc)) :)
> 
> Ha! Yes, we are not exactly improving on that layering here ;)
> 
>> Still I can't suggest now an alternative that supports cancellation. But I still think that cancellation doesn't work for blk_aio_flush and friends either..
>>
> 
> The aiocb from blk_aio_flush is considered "un-cancellable" I guess (by design from the block layer), but the NVMe command "Flush" is cancellable from the perspective of the NVMe controller. Or at least, that's what I am intending to do here.


-- 
Best regards,
Vladimir

Re: [RFC PATCH 00/11] hw/nvme: reimplement all multi-aio commands with custom aiocbs
Posted by Klaus Jensen 2 years, 9 months ago
On Jun  7 13:24, Vladimir Sementsov-Ogievskiy wrote:
>07.06.2021 13:00, Klaus Jensen wrote:
>>On Jun  7 10:11, Vladimir Sementsov-Ogievskiy wrote:
>>>07.06.2021 09:17, Klaus Jensen wrote:
>>>>On Jun  7 08:14, Vladimir Sementsov-Ogievskiy wrote:
>>>>>04.06.2021 09:52, Klaus Jensen wrote:
>>>>>>
>>>>>>I've kept the RFC since I'm still new to using the block layer like
>>>>>>this. I was hoping that Stefan could find some time to look over this -
>>>>>>this is a huge series, so I don't expect non-nvme folks to spend a large
>>>>>>amount of time on it, but I would really like feedback on my approach in
>>>>>>the reimplementation of flush and format.
>>>>>
>>>>>If I understand your code correctly, you do stat next io operation from call-back of a previous. It works, and it is similar to haw mirror block-job was operating some time ago (still it maintained several in-flight requests simultaneously, but I'm about using _aio_ functions). Still, now mirror doesn't use _aio_ functions like this.
>>>>>
>>>>>Better approach to call several io functions of block layer one-by-one is creating a coroutine. You may just add a coroutine function, that does all your linear logic as you want, without any callbacks like:
>>>>>
>>>>>nvme_co_flush()
>>>>>{
>>>>>  for (...) {
>>>>>     blk_co_flush();
>>>>>  }
>>>>>}
>>>>>
>>>>>(and you'll need qemu_coroutine_create() and qemu_coroutine_enter() to start a coroutine).
>>>>>
>>>>
>>>>So, this is definitely a tempting way to implement this. I must admit that I did not consider it like this because I thought this was at the wrong level of abstraction (looked to me like this was something that belonged in block/, not hw/). Again, I referred to the Trim implementation in hw/ide as the source of inspiration on the sequential AIOCB approach.
>>>
>>>No, I think it's OK from abstraction point of view. Everybody is welcome to use coroutines if it is appropriate and especially for doing sequential IOs :)
>>>Actually, it's just more efficient: the way I propose, you create one coroutine, which does all your logic as you want, when blk_aio_ functions actually create a coroutine under the hood each time (I don't think that it noticeably affects performance, but logic becomes more straightforward)
>>>
>>>The only problem is that for this way we don't have cancellation API, so you can't use it for cancellation anyway :(
>>>
>>
>>Yeah, I'm not really feeling up for adding that :P
>>
>>>>
>>>>>Still, I'm not sure that moving from simultaneous issuing several IO commands to sequential is good idea..
>>>>>And this way you of course can't use blk_aio_canel.. This leads to my last doubt:
>>>>>
>>>>>One more thing I don't understand after fast look at the series: how cancelation works? It seems to me, that you just call cancel on nested AIOCBs, produced by blk_<io_functions>, but no one of them implement cancel.. I see only four implementations of .cancel_async callback in the whole Qemu code: in iscsi, in ide/core.c, in dma-helpers.c and in thread-pool.c.. Seems no one is related to blk_aio_flush() and other blk_* functions you call in the series. Or, what I miss?
>>>>>
>>>>
>>>>Right now, cancellation is only initiated by the device when a submission queue is deleted. This causes blk_aio_cancel() to be called on each BlockAIOCB (NvmeRequest.aiocb) for outstanding requests. In most cases this BlockAIOCB is a DMAAIOCB from softmmu/dma-helpers.c, which implements .cancel_async. Prior to this patchset, Flush, DSM, Copy and so on, did not have any BlockAIOCB to cancel since we did not keep references to the ongoing AIOs.
>>>
>>>Hmm. Looking at flush for example, I don't see how DMAAIOCB comes.
>>>
>>>You do
>>>
>>> iocb->aiocb = blk_aio_flush(ns->blkconf.blk, nvme_flush_ns_cb, iocb);
>>>
>>>it calls blk_aio_prwv(), it calls blk_aio_get() with blk_aio_em_aiocb_info, that doesn't implement .cancel_async..
>>>
>>
>>I meant that most I/O in the regular path (read/write) are using the dma helpers (since they do DMA). We might use the blk_aio_p{read,write} directly when we read from/write to memory on the device (the controller memory buffer), but it is not the common case.
>>
>>You are correct that BlkAioEmAIOCB does not implement cancel, but the "wrapper" (NvmeFlushAIOCB) *does*. This means that, from the NVMe controller perspective, we can cancel the flush in between (un-cancellable blk_aio_flush-initiated) flushes to multiple namespaces.
>
>Hm. But it will work the same way if you just not implement .cancel_async in nvme_flush_aiocb_info.
>

Oh. Crap, I see. I screwed this up in flush.

blk_aio_cancel_async(iocb->aiocb) will be a no-op and ret will never be 
-ECANCELED. I think I do this correctly in the other reimplementations 
(explicitly set iocb->ret), but not here in flush.
Re: [RFC PATCH 00/11] hw/nvme: reimplement all multi-aio commands with custom aiocbs
Posted by Stefan Hajnoczi 2 years, 9 months ago
On Fri, Jun 04, 2021 at 08:52:26AM +0200, Klaus Jensen wrote:
> From: Klaus Jensen <k.jensen@samsung.com>
> 
> This series reimplements flush, dsm, copy, zone reset and format nvm to
> allow cancellation. I posted an RFC back in March ("hw/block/nvme:
> convert ad-hoc aio tracking to aiocb") and I've applied some feedback
> from Stefan and reimplemented the remaining commands.
> 
> The basic idea is to define custom AIOCBs for these commands. The custom
> AIOCB takes care of issuing all the "nested" AIOs one by one instead of
> blindly sending them off simultaneously without tracking the returned
> aiocbs.
> 
> I've kept the RFC since I'm still new to using the block layer like
> this. I was hoping that Stefan could find some time to look over this -
> this is a huge series, so I don't expect non-nvme folks to spend a large
> amount of time on it, but I would really like feedback on my approach in
> the reimplementation of flush and format. Those commands are special in
> that may issue AIOs to multiple namespaces and thus, to multiple block
> backends. Since this device does not support iothreads, I've opted for
> simply always returning the main loop aio context, but I wonder if this
> is acceptable or not. It might be the case that this should contain an
> assert of some kind, in case someone starts adding iothread support.

This approach looks fine to me. Vladimir mentioned coroutines, which
have simpler code for sequential I/O, but don't support cancellation.
Since cancellation is the point of this series I think sticking to the
aio approach makes sense.

Regarding coroutine cancellation, it's a hard to add since there is
already a lot of coroutine code that's not written with cancellation in
mind.

I think I would approach it by adding a .cancel_cb() field to Coroutine
that does nothing by default (because existing code doesn't support
cancellation and we must wait for the operation to complete). Cases the
do support cancellation would install a .cancel_cb() across yield that
causes the operation that coroutine is waiting on to complete early.

An alternative approach is to re-enter the coroutine, but this requires
all yield points in QEMU to check for cancellation. I don't think this
is practical because converting all the code would be hard.

Anyway, the aio approach looks fine.

Stefan