[v5] Qemu crashes on VM migration after an handled memory error

[PATCH v5 0/2] Qemu crashes on VM migration after an handled memory error

Posted by “William Roche 2 years, 3 months ago

From: William Roche <william.roche@oracle.com>


Note about ARM specificities:
This code has a small part impacting more specificaly ARM machines,
that's the reason why I added qemu-arm@nongnu.org -- see description.


A Qemu VM can survive a memory error, as qemu can relay the error to the
VM kernel which could also deal with it -- poisoning/off-lining the impacted
page.
This situation creates a hole in the VM memory address space that the VM kernel
knows about (an unreadable page or set of pages).

But the migration of this VM (live migration through the network or
pseudo-migration with the creation of a state file) will crash Qemu when
it sequentially reads the memory address space and stumbles on the
existing hole.

In order to thoroughly correct this problem, the poison information should
follow the migration which represents several difficulties:
- poisoning a page on the destination machine to replicate the source
  poison requires CAP_SYS_ADMIN priviledges, and qemu process may not
  always run as a root process
- the destination kernel needs to be configured with CONFIG_MEMORY_FAILURE
- the poison information would require a memory transfer protocol
  enhancement to provide this information
(The current patches don't provide any of that)

But if we rely on the fact that the a running VM kernel is correctly
dealing with memory poison it is informed about: marking the poison page
as inaccessible, we could count on the VM kernel to make sure that
poisoned pages are not used, even after a migration.
In this case, I suggest to treat the poisoned pages as if they were
zero-pages for the migration copy.
This fix also works with underlying large pages, taking into account the
RAMBlock segment "page-size".

Now, it leaves a case that we have to deal with: if a memory error is
reported to qemu but not injected into the running kernel...
As the migration will go from a poisoned page to an all-zero page, if
the VM kernel doesn't prevent the access to this page, a memory read
that would generate a BUS_MCEERR_AR error on the source platform, could
be reading zeros on the destination. This is a memory corruption.

So we have to ensure that all poisoned pages we set to zero are known by
the running kernel. But we have a problem with platforms where BUS_MCEERR_AO
errors are ignored, which means that qemu knows about the poison but the VM
doesn't. For the moment it's only the case for ARM, but could later be
also needed for AMD VMs.
See https://lore.kernel.org/all/20230912211824.90952-3-john.allen@amd.com/

In order to avoid this possible silent data corruption situation, we should
prevent the migration when we know that a poisoned page is ignored from the VM.

Which is, according to me, the smallest fix we need  to avoid qemu crashes
on migration after an handled memory error, without introducing a possible
corruption situation.

This fix is scripts/checkpatch.pl clean.
Unit test: Migration blocking succesfully tested on ARM -- injected AO error
blocks it. On x86 the same type of error being relayed doesn't block.

v2:
  - adding compressed transfer handling of poisoned pages

v3:
  - Included the Reviewed-by and Tested-by information on first patch
  - added a TODO comment above control_save_page()
    mentioning Zhijian's feedback about RDMA migration failure.

v4:
  - adding a patch to deal with unknown poison tracking (impacting ARM)
    (not using migrate_add_blocker as this is not devices related and
    we want to avoid the interaction with --only-migratable mechanism)

v5:
  - Updating the code to the latest version
  - adding qemu-arm@nongnu.org for a complementary review


William Roche (2):
  migration: skip poisoned memory pages on "ram saving" phase
  migration: prevent migration when a poisoned page is unknown from the
    VM

 accel/kvm/kvm-all.c      | 41 +++++++++++++++++++++++++++++++++++++++-
 accel/stubs/kvm-stub.c   | 10 ++++++++++
 include/sysemu/kvm.h     | 16 ++++++++++++++++
 include/sysemu/kvm_int.h |  3 ++-
 migration/migration.c    |  6 ++++++
 migration/ram-compress.c |  3 ++-
 migration/ram.c          | 24 +++++++++++++++++++++--
 migration/ram.h          |  2 ++
 target/arm/kvm64.c       |  6 +++++-
 target/i386/kvm/kvm.c    |  2 +-
 10 files changed, 106 insertions(+), 7 deletions(-)

-- 
2.39.3

Re: [PATCH v5 0/2] Qemu crashes on VM migration after an handled memory error

Posted by Peter Xu 2 years, 3 months ago

On Mon, Nov 06, 2023 at 10:03:17PM +0000, “William Roche wrote:
> From: William Roche <william.roche@oracle.com>
> 
> 
> Note about ARM specificities:
> This code has a small part impacting more specificaly ARM machines,
> that's the reason why I added qemu-arm@nongnu.org -- see description.
> 
> 
> A Qemu VM can survive a memory error, as qemu can relay the error to the
> VM kernel which could also deal with it -- poisoning/off-lining the impacted
> page.
> This situation creates a hole in the VM memory address space that the VM kernel
> knows about (an unreadable page or set of pages).
> 
> But the migration of this VM (live migration through the network or
> pseudo-migration with the creation of a state file) will crash Qemu when
> it sequentially reads the memory address space and stumbles on the
> existing hole.
> 
> In order to thoroughly correct this problem, the poison information should
> follow the migration which represents several difficulties:
> - poisoning a page on the destination machine to replicate the source
>   poison requires CAP_SYS_ADMIN priviledges, and qemu process may not
>   always run as a root process
> - the destination kernel needs to be configured with CONFIG_MEMORY_FAILURE
> - the poison information would require a memory transfer protocol
>   enhancement to provide this information
> (The current patches don't provide any of that)
> 
> But if we rely on the fact that the a running VM kernel is correctly
> dealing with memory poison it is informed about: marking the poison page
> as inaccessible, we could count on the VM kernel to make sure that
> poisoned pages are not used, even after a migration.
> In this case, I suggest to treat the poisoned pages as if they were
> zero-pages for the migration copy.
> This fix also works with underlying large pages, taking into account the
> RAMBlock segment "page-size".
> 
> Now, it leaves a case that we have to deal with: if a memory error is
> reported to qemu but not injected into the running kernel...
> As the migration will go from a poisoned page to an all-zero page, if
> the VM kernel doesn't prevent the access to this page, a memory read
> that would generate a BUS_MCEERR_AR error on the source platform, could
> be reading zeros on the destination. This is a memory corruption.
> 
> So we have to ensure that all poisoned pages we set to zero are known by
> the running kernel. But we have a problem with platforms where BUS_MCEERR_AO
> errors are ignored, which means that qemu knows about the poison but the VM
> doesn't. For the moment it's only the case for ARM, but could later be
> also needed for AMD VMs.
> See https://lore.kernel.org/all/20230912211824.90952-3-john.allen@amd.com/
> 
> In order to avoid this possible silent data corruption situation, we should
> prevent the migration when we know that a poisoned page is ignored from the VM.
> 
> Which is, according to me, the smallest fix we need  to avoid qemu crashes
> on migration after an handled memory error, without introducing a possible
> corruption situation.
> 
> This fix is scripts/checkpatch.pl clean.
> Unit test: Migration blocking succesfully tested on ARM -- injected AO error
> blocks it. On x86 the same type of error being relayed doesn't block.
> 
> v2:
>   - adding compressed transfer handling of poisoned pages
> 
> v3:
>   - Included the Reviewed-by and Tested-by information on first patch
>   - added a TODO comment above control_save_page()
>     mentioning Zhijian's feedback about RDMA migration failure.
> 
> v4:
>   - adding a patch to deal with unknown poison tracking (impacting ARM)
>     (not using migrate_add_blocker as this is not devices related and
>     we want to avoid the interaction with --only-migratable mechanism)
> 
> v5:
>   - Updating the code to the latest version
>   - adding qemu-arm@nongnu.org for a complementary review
> 
> 
> William Roche (2):
>   migration: skip poisoned memory pages on "ram saving" phase
>   migration: prevent migration when a poisoned page is unknown from the
>     VM

I hope someone from arch-specific can have a quick look at patch 2..

One thing to mention is unfortunately waiting on patch 2 means we'll miss
this release. Actually it is already missed.. softfreeze yesterday [1].  So
it may likely need to wait for 9.0.

[1] https://wiki.qemu.org/Planning/8.2

-- 
Peter Xu

[PATCH v1 0/1] Qemu crashes on VM migration after an handled memory error

Posted by “William Roche 2 years ago

From: William Roche <william.roche@oracle.com>

Problem:
--------
A Qemu VM can survive a memory error, as qemu can relay the error to the
VM kernel which could also deal with it -- poisoning/off-lining the impacted
page. This situation creates a hole in the VM memory address space (an
unreadable page or set of pages).

A migration request of this VM (live migration through the network or
pseudo-migration with the creation of a state file) will crash Qemu when
it sequentially reads the memory address space and stumbles on the
existing hole.

New fix proposal:
-----------------
Let's prevent the migration when we know that there is a poison page in
the VM address space.


History:
--------
My first fix proposal for this crash condition (latest version:
https://lore.kernel.org/all/20231106220319.456765-1-william.roche@oracle.com/ )
relied on a well behaving kernel to guaranty that a known poison page is
not accessed. It introduced an ARM platform specificity.
I haven't received any feedback about the ARM specificity to avoid
a possible memory corruption after a migration transforming a poisoned
page into an all zero page.

I also accept that when a memory error leads to memory poisoning, this
platform functionality has to be honored as long as a physical platform
would provide it.

Peter asked for a complete correction of this problem (transfering
the memory holes information with the migration and recreating these
holes on the destination platform).

In the meantime, this is a very small fix to avoid the current crash
situation reading the poisoned memory pages.  I'm simply preventing
the migration when we know that it would crash, when there is a
poisoned page in the VM address space.

This is a generic protection code, avoiding a crash condition and
reporting the following error message:
"Error: Can't migrate this vm with hardware poisoned memory, please reboot the vm and try again"
instead of crashing the VM.

This fix is scripts/checkpatch.pl clean.
Unit tested on ARM and x86.


William Roche (1):
  migration: prevent migration when VM has poisoned memory

 accel/kvm/kvm-all.c    | 10 ++++++++++
 accel/stubs/kvm-stub.c |  5 +++++
 include/sysemu/kvm.h   |  6 ++++++
 migration/migration.c  |  7 +++++++
 4 files changed, 28 insertions(+)

-- 
2.39.3

[PATCH v1 1/1] migration: prevent migration when VM has poisoned memory

Posted by “William Roche 2 years ago

From: William Roche <william.roche@oracle.com>

A memory page poisoned from the hypervisor level is no longer readable.
The migration of a VM will crash Qemu when it tries to read the
memory address space and stumbles on the poisoned page with a similar
stack trace:

Program terminated with signal SIGBUS, Bus error.
#0  _mm256_loadu_si256
#1  buffer_zero_avx2
#2  select_accel_fn
#3  buffer_is_zero
#4  save_zero_page
#5  ram_save_target_page_legacy
#6  ram_save_host_page
#7  ram_find_and_save_block
#8  ram_save_iterate
#9  qemu_savevm_state_iterate
#10 migration_iteration_run
#11 migration_thread
#12 qemu_thread_start

To avoid this VM crash during the migration, prevent the migration
when a known hardware poison exists on the VM.

Signed-off-by: William Roche <william.roche@oracle.com>
---
 accel/kvm/kvm-all.c    | 10 ++++++++++
 accel/stubs/kvm-stub.c |  5 +++++
 include/sysemu/kvm.h   |  6 ++++++
 migration/migration.c  |  7 +++++++
 4 files changed, 28 insertions(+)

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index 49e755ec4a..a8cecd040e 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -1119,6 +1119,11 @@ int kvm_vm_check_extension(KVMState *s, unsigned int extension)
     return ret;
 }
 
+/*
+ * We track the poisoned pages to be able to:
+ * - replace them on VM reset
+ * - block a migration for a VM with a poisoned page
+ */
 typedef struct HWPoisonPage {
     ram_addr_t ram_addr;
     QLIST_ENTRY(HWPoisonPage) list;
@@ -1152,6 +1157,11 @@ void kvm_hwpoison_page_add(ram_addr_t ram_addr)
     QLIST_INSERT_HEAD(&hwpoison_page_list, page, list);
 }
 
+bool kvm_hwpoisoned_mem(void)
+{
+    return !QLIST_EMPTY(&hwpoison_page_list);
+}
+
 static uint32_t adjust_ioeventfd_endianness(uint32_t val, uint32_t size)
 {
 #if HOST_BIG_ENDIAN != TARGET_BIG_ENDIAN
diff --git a/accel/stubs/kvm-stub.c b/accel/stubs/kvm-stub.c
index 1b37d9a302..ca38172884 100644
--- a/accel/stubs/kvm-stub.c
+++ b/accel/stubs/kvm-stub.c
@@ -124,3 +124,8 @@ uint32_t kvm_dirty_ring_size(void)
 {
     return 0;
 }
+
+bool kvm_hwpoisoned_mem(void)
+{
+    return false;
+}
diff --git a/include/sysemu/kvm.h b/include/sysemu/kvm.h
index d614878164..fad9a7e8ff 100644
--- a/include/sysemu/kvm.h
+++ b/include/sysemu/kvm.h
@@ -538,4 +538,10 @@ bool kvm_arch_cpu_check_are_resettable(void);
 bool kvm_dirty_ring_enabled(void);
 
 uint32_t kvm_dirty_ring_size(void);
+
+/**
+ * kvm_hwpoisoned_mem - indicate if there is any hwpoisoned page
+ * reported for the VM.
+ */
+bool kvm_hwpoisoned_mem(void);
 #endif
diff --git a/migration/migration.c b/migration/migration.c
index d5f705ceef..b574e66f7b 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -67,6 +67,7 @@
 #include "options.h"
 #include "sysemu/dirtylimit.h"
 #include "qemu/sockets.h"
+#include "sysemu/kvm.h"
 
 static NotifierList migration_state_notifiers =
     NOTIFIER_LIST_INITIALIZER(migration_state_notifiers);
@@ -1906,6 +1907,12 @@ static bool migrate_prepare(MigrationState *s, bool blk, bool blk_inc,
         return false;
     }
 
+    if (kvm_hwpoisoned_mem()) {
+        error_setg(errp, "Can't migrate this vm with hardware poisoned memory, "
+                   "please reboot the vm and try again");
+        return false;
+    }
+
     if (migration_is_blocked(errp)) {
         return false;
     }
-- 
2.39.3

Re: [PATCH v1 1/1] migration: prevent migration when VM has poisoned memory

Posted by Peter Xu 2 years ago

On Tue, Jan 30, 2024 at 07:06:40PM +0000, “William Roche wrote:
> From: William Roche <william.roche@oracle.com>
> 
> A memory page poisoned from the hypervisor level is no longer readable.
> The migration of a VM will crash Qemu when it tries to read the
> memory address space and stumbles on the poisoned page with a similar
> stack trace:
> 
> Program terminated with signal SIGBUS, Bus error.
> #0  _mm256_loadu_si256
> #1  buffer_zero_avx2
> #2  select_accel_fn
> #3  buffer_is_zero
> #4  save_zero_page
> #5  ram_save_target_page_legacy
> #6  ram_save_host_page
> #7  ram_find_and_save_block
> #8  ram_save_iterate
> #9  qemu_savevm_state_iterate
> #10 migration_iteration_run
> #11 migration_thread
> #12 qemu_thread_start
> 
> To avoid this VM crash during the migration, prevent the migration
> when a known hardware poison exists on the VM.
> 
> Signed-off-by: William Roche <william.roche@oracle.com>

I queued it for now, while it'll always good to get feedback from either
Paolo or anyone else, as the pull won't happen in one week.  If no
objection it'll be included the next migration pull.

Thanks,

-- 
Peter Xu