[v2] Live update: cpr-exec

[PATCH V2 00/11] Live update: cpr-exec

Posted by Steve Sistare 1 year, 4 months ago

What?

This patch series adds the live migration cpr-exec mode, which allows
the user to update QEMU with minimal guest pause time, by preserving
guest RAM in place, albeit with new virtual addresses in new QEMU, and
by preserving device file descriptors.

The new user-visible interfaces are:
  * cpr-exec (MigMode migration parameter)
  * cpr-exec-command (migration parameter)
  * anon-alloc (command-line option for -machine)

The user sets the mode parameter before invoking the migrate command.
In this mode, the user issues the migrate command to old QEMU, which
stops the VM and saves state to the migration channels.  Old QEMU then
exec's new QEMU, replacing the original process while retaining its PID.
The user specifies the command to exec new QEMU in the migration parameter
cpr-exec-command.  The command must pass all old QEMU arguments to new
QEMU, plus the -incoming option.  Execution resumes in new QEMU.

Memory-backend objects must have the share=on attribute, but
memory-backend-epc is not supported.  The VM must be started
with the '-machine anon-alloc=memfd' option, which allows anonymous
memory to be transferred in place to the new process.

Why?

This mode has less impact on the guest than any other method of updating
in place.  The pause time is much lower, because devices need not be torn
down and recreated, DMA does not need to be drained and quiesced, and minimal
state is copied to new QEMU.  Further, there are no constraints on the guest.
By contrast, cpr-reboot mode requires the guest to support S3 suspend-to-ram,
and suspending plus resuming vfio devices adds multiple seconds to the
guest pause time.  Lastly, there is no loss of connectivity to the guest,
because chardev descriptors remain open and connected.

These benefits all derive from the core design principle of this mode,
which is preserving open descriptors.  This approach is very general and
can be used to support a wide variety of devices that do not have hardware
support for live migration, including but not limited to: vfio, chardev,
vhost, vdpa, and iommufd.  Some devices need new kernel software interfaces
to allow a descriptor to be used in a process that did not originally open it.

In a containerized QEMU environment, cpr-exec reuses an existing QEMU
container and its assigned resources.  By contrast, consider a design in
which a new container is created on the same host as the target of the
CPR operation.  Resources must be reserved for the new container, while
the old container still reserves resources until the operation completes.
Avoiding over commitment requires extra work in the management layer.
This is one reason why a cloud provider may prefer cpr-exec.  A second reason
is that the container may include agents with their own connections to the
outside world, and such connections remain intact if the container is reused.

How?

All memory that is mapped by the guest is preserved in place.  Indeed,
it must be, because it may be the target of DMA requests, which are not
quiesced during cpr-exec.  All such memory must be mmap'able in new QEMU.
This is easy for named memory-backend objects, as long as they are mapped
shared, because they are visible in the file system in both old and new QEMU.
Anonymous memory must be allocated using memfd_create rather than MAP_ANON,
so the memfd's can be sent to new QEMU.  Pages that were locked in memory
for DMA in old QEMU remain locked in new QEMU, because the descriptor of
the device that locked them remains open.

cpr-exec preserves descriptors across exec by clearing the CLOEXEC flag,
and by sending the unique name and value of each descriptor to new QEMU
via CPR state.

For device descriptors, new QEMU reuses the descriptor when creating the
device, rather than opening it again.  The same holds for chardevs.  For
memfd descriptors, new QEMU mmap's the preserved memfd when a ramblock
is created.

CPR state cannot be sent over the normal migration channel, because devices
and backends are created prior to reading the channel, so this mode sends
CPR state over a second migration channel that is not visible to the user.
New QEMU reads the second channel prior to creating devices or backends.

The exec itself is trivial.  After writing to the migration channels, the
migration code calls a new main-loop hook to perform the exec.

Example:

In this example, we simply restart the same version of QEMU, but in
a real scenario one would use a new QEMU binary path in cpr-exec-command.

  # qemu-kvm -monitor stdio -object
  memory-backend-file,id=ram0,size=4G,mem-path=/dev/shm/ram0,share=on
  -m 4G -machine anon-alloc=memfd ...

  QEMU 9.1.50 monitor - type 'help' for more information
  (qemu) info status
  VM status: running
  (qemu) migrate_set_parameter mode cpr-exec
  (qemu) migrate_set_parameter cpr-exec-command qemu-kvm ... -incoming file:vm.state
  (qemu) migrate -d file:vm.state
  (qemu) QEMU 9.1.50 monitor - type 'help' for more information
  (qemu) info status
  VM status: running

This patch series implements a minimal version of cpr-exec.  Additional
series are ready to be posted to deliver the complete vision described
above, including
  * vfio
  * chardev
  * vhost
  * blockers
  * hostmem-memfd
  * migration-test cases

Works in progress include:
  * vdpa
  * iommufd
  * cpr-transfer mode

Changes since V1:
  * Dropped precreate and factory patches.  Added CPR state instead.
  * Dropped patches that refactor ramblock allocation
  * Dropped vmstate_info_void patch (peter)
  * Dropped patch "seccomp: cpr-exec blocker" (Daniel)
  * Redefined memfd-alloc option as anon-alloc
  * No longer preserve ramblock fields, except for fd (peter)
  * Added fd preservation functions in CPR state
  * Hoisted cpr code out of migrate_fd_cleanup (fabiano)
  * Revised migration.json docs (markus)
  * Fixed qtest failures (fabiano)
  * Renamed SAVEVM_FOREACH macros (fabiano)
  * Renamed cpr-exec-args as cpr-exec-command (markus)

The first 6 patches below are foundational and are needed for both cpr-exec
mode and cpr-transfer mode.  The last 5 patches are specific to cpr-exec
and implement the mechanisms for sharing state across exec.

Steve Sistare (11):
  machine: alloc-anon option
  migration: cpr-state
  migration: save cpr mode
  migration: stop vm earlier for cpr
  physmem: preserve ram blocks for cpr
  migration: fix mismatched GPAs during cpr
  oslib: qemu_clear_cloexec
  vl: helper to request exec
  migration: cpr-exec-command parameter
  migration: cpr-exec save and load
  migration: cpr-exec mode

 hmp-commands.hx                |   2 +-
 hw/core/machine.c              |  24 +++++
 include/exec/memory.h          |  12 +++
 include/hw/boards.h            |   1 +
 include/migration/cpr.h        |  35 ++++++
 include/qemu/osdep.h           |   9 ++
 include/sysemu/runstate.h      |   3 +
 migration/cpr-exec.c           | 180 +++++++++++++++++++++++++++++++
 migration/cpr.c                | 238 +++++++++++++++++++++++++++++++++++++++++
 migration/meson.build          |   2 +
 migration/migration-hmp-cmds.c |  25 +++++
 migration/migration.c          |  43 ++++++--
 migration/options.c            |  23 +++-
 migration/ram.c                |  17 +--
 migration/trace-events         |   5 +
 qapi/machine.json              |  14 +++
 qapi/migration.json            |  45 +++++++-
 qemu-options.hx                |  13 +++
 system/memory.c                |  22 +++-
 system/physmem.c               |  61 ++++++++++-
 system/runstate.c              |  29 +++++
 system/trace-events            |   3 +
 system/vl.c                    |   3 +
 util/oslib-posix.c             |   9 ++
 util/oslib-win32.c             |   4 +
 25 files changed, 792 insertions(+), 30 deletions(-)
 create mode 100644 include/migration/cpr.h
 create mode 100644 migration/cpr-exec.c
 create mode 100644 migration/cpr.c

-- 
1.8.3.1

Re: [PATCH V2 00/11] Live update: cpr-exec

Posted by Peter Xu 1 year, 3 months ago

Steve,

On Sun, Jun 30, 2024 at 12:40:23PM -0700, Steve Sistare wrote:
> What?

Thanks for trying out with the cpr-transfer series.  I saw that that series
missed most of the cc list here, so I'm attaching the link here:

https://lore.kernel.org/r/1719776648-435073-1-git-send-email-steven.sistare@oracle.com

I think most of my previous questions for exec() solution still are there,
I'll try to summarize them all in this reply as much as I can.

> 
> This patch series adds the live migration cpr-exec mode, which allows
> the user to update QEMU with minimal guest pause time, by preserving
> guest RAM in place, albeit with new virtual addresses in new QEMU, and
> by preserving device file descriptors.
> 
> The new user-visible interfaces are:
>   * cpr-exec (MigMode migration parameter)
>   * cpr-exec-command (migration parameter)

I really, really hope we can avoid this..

It's super cumbersome to pass in a qemu cmdline in a qemu migration
parameter.. if we can do that with generic live migration ways, I hope we
stick with the clean approach.

>   * anon-alloc (command-line option for -machine)

Igor questioned this, and I second his opinion..  We can leave the
discussion there for this one.

> 
> The user sets the mode parameter before invoking the migrate command.
> In this mode, the user issues the migrate command to old QEMU, which
> stops the VM and saves state to the migration channels.  Old QEMU then
> exec's new QEMU, replacing the original process while retaining its PID.
> The user specifies the command to exec new QEMU in the migration parameter
> cpr-exec-command.  The command must pass all old QEMU arguments to new
> QEMU, plus the -incoming option.  Execution resumes in new QEMU.
> 
> Memory-backend objects must have the share=on attribute, but
> memory-backend-epc is not supported.  The VM must be started
> with the '-machine anon-alloc=memfd' option, which allows anonymous
> memory to be transferred in place to the new process.
> 
> Why?
> 
> This mode has less impact on the guest than any other method of updating
> in place.

So I wonder whether there's comparison between exec() and transfer mode
that you recently proposed.

I'm asking because exec() (besides all the rest of things that I dislike on
it in this approach..) should be simply slower, logically, due to the
serialized operation to (1) tearing down the old mm, (2) reload the new
ELF, then (3) runs through the QEMU init process.

If with a generic migration solution, the dest QEMU can start running (2+3)
concurrently without even need to run (1).

In this whole process, I doubt (2) could be relatively fast, (3) I donno,
maybe it could be slow but I never measured; Paolo may have good idea as I
know he used to work on qboot.

For (1), I also doubt in your test cases it's fast, but it may not always
be fast.  Consider the guest has a huge TBs of shared mem, even if the
memory will be completely shared between src/dst QEMUs, the pgtable won't!
It means if the TBs are mapped in PAGE_SIZE tearing down the src QEMU
pgtable alone can even take time, and that will be accounted in step (1)
and further in exec() request.

All these fuss will be avoided if you use a generic live migration model
like cpr-transfer you proposed.  That's also cleaner.

> The pause time is much lower, because devices need not be torn
> down and recreated, DMA does not need to be drained and quiesced, and minimal
> state is copied to new QEMU.  Further, there are no constraints on the guest.
> By contrast, cpr-reboot mode requires the guest to support S3 suspend-to-ram,
> and suspending plus resuming vfio devices adds multiple seconds to the
> guest pause time.  Lastly, there is no loss of connectivity to the guest,
> because chardev descriptors remain open and connected.

Again, I raised the question on why this would matter, as after all mgmt
app will need to coop with reconnections due to the fact they'll need to
support a generic live migration, in which case reconnection is a must.

So far it doesn't sound like a performance critical path, for example, to
do the mgmt reconnects on the ports.  So this might be an optimization that
most mgmt apps may not care much?

> 
> These benefits all derive from the core design principle of this mode,
> which is preserving open descriptors.  This approach is very general and
> can be used to support a wide variety of devices that do not have hardware
> support for live migration, including but not limited to: vfio, chardev,
> vhost, vdpa, and iommufd.  Some devices need new kernel software interfaces
> to allow a descriptor to be used in a process that did not originally open it.

Yes, I still think this is a great idea.  It just can also be built on top
of something else than exec().

> 
> In a containerized QEMU environment, cpr-exec reuses an existing QEMU
> container and its assigned resources.  By contrast, consider a design in
> which a new container is created on the same host as the target of the
> CPR operation.  Resources must be reserved for the new container, while
> the old container still reserves resources until the operation completes.

Note that if we need to share RAM anyway, the resources consumption should
be minimal, as mem should IMHO be the major concern (except CPU, but CPU
isn't a concern in this scenario) in container world and here the shared
guest mem shouldn't be accounted to the dest container.  So IMHO it's about
the metadata QEMU/KVM needs to do the hypervisor work, it seems to me, and
that should be relatively small.

In that case I don't yet see it a huge improvement, if the dest container
is cheap to initiate.

> Avoiding over commitment requires extra work in the management layer.

So it would be nice to know what needs to be overcommitted here.  I confess
I don't know much on containerized VMs, so maybe the page cache can be a
problem even if shared.  But I hope we can spell that out.  Logically IIUC
memcg shouldn't account those page cache if preallocated, because memcg
accounting should be done at folio allocations, at least, where the page
cache should miss first (so not this case..).

> This is one reason why a cloud provider may prefer cpr-exec.  A second reason
> is that the container may include agents with their own connections to the
> outside world, and such connections remain intact if the container is reused.
> 
> How?
> 
> All memory that is mapped by the guest is preserved in place.  Indeed,
> it must be, because it may be the target of DMA requests, which are not
> quiesced during cpr-exec.  All such memory must be mmap'able in new QEMU.
> This is easy for named memory-backend objects, as long as they are mapped
> shared, because they are visible in the file system in both old and new QEMU.
> Anonymous memory must be allocated using memfd_create rather than MAP_ANON,
> so the memfd's can be sent to new QEMU.  Pages that were locked in memory
> for DMA in old QEMU remain locked in new QEMU, because the descriptor of
> the device that locked them remains open.
> 
> cpr-exec preserves descriptors across exec by clearing the CLOEXEC flag,
> and by sending the unique name and value of each descriptor to new QEMU
> via CPR state.
> 
> For device descriptors, new QEMU reuses the descriptor when creating the
> device, rather than opening it again.  The same holds for chardevs.  For
> memfd descriptors, new QEMU mmap's the preserved memfd when a ramblock
> is created.
> 
> CPR state cannot be sent over the normal migration channel, because devices
> and backends are created prior to reading the channel, so this mode sends
> CPR state over a second migration channel that is not visible to the user.
> New QEMU reads the second channel prior to creating devices or backends.

Oh, maybe this is the reason that cpr-transfer will need a separate uri..

Thanks,

-- 
Peter Xu