From nobody Fri Oct 18 06:20:35 2024 Delivered-To: importer2@patchew.org Authentication-Results: mx.zohomail.com; dkim=pass; spf=pass (zohomail.com: domain of gnu.org designates 209.51.188.17 as permitted sender) smtp.mailfrom=qemu-devel-bounces+importer2=patchew.org@nongnu.org; dmarc=pass(p=none dis=none) header.from=redhat.com ARC-Seal: i=1; a=rsa-sha256; t=1673954727; cv=none; d=zohomail.com; s=zohoarc; b=V5yhyRc3agMtAvi6JrHeQDLWgLamswxN+V2RxEXdRcl55QGvRdjMAFjw+CgYkMWq21YQfyHK1RtDXWMQFPCFvxaYFqB6yTHYdSM6t43nYF3Rj30Mr64412Skqkq/Wr4qeKEaYb8UNFGVB1VYmRHbxM1fSpcLADffvMsAwZNP0HQ= ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=zohomail.com; s=zohoarc; t=1673954727; h=Content-Transfer-Encoding:Cc:Date:From:In-Reply-To:List-Subscribe:List-Post:List-Id:List-Archive:List-Help:List-Unsubscribe:MIME-Version:Message-ID:References:Sender:Subject:To; bh=NVRuLRSMT7U5pooPPlLSJjVG2yVrd+zaoAra33ynVdY=; b=DGi+s3jisFs40FyiUEOrc/Np9Zr5tWbgFPhnbSb/CMCS8w6fYV77y+afsBSxegOOSQ6NOW62mUp70C9C7mW01Vb+CoWR2RSHkzobAe46eIj2CYhGXf+vTrkwLv9U+iHXCl+kAk2MnG0wgNEp9u7Flkvg0Dj/lQiyvjaGe0DHQ9o= ARC-Authentication-Results: i=1; mx.zohomail.com; dkim=pass; spf=pass (zohomail.com: domain of gnu.org designates 209.51.188.17 as permitted sender) smtp.mailfrom=qemu-devel-bounces+importer2=patchew.org@nongnu.org; dmarc=pass header.from= (p=none dis=none) Return-Path: Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) by mx.zohomail.com with SMTPS id 1673954727615586.5374326638592; Tue, 17 Jan 2023 03:25:27 -0800 (PST) Received: from localhost ([::1] helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1pHk3d-0001dC-Iu; Tue, 17 Jan 2023 06:23:13 -0500 Received: from eggs.gnu.org ([2001:470:142:3::10]) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1pHk3c-0001cU-Bz for qemu-devel@nongnu.org; Tue, 17 Jan 2023 06:23:12 -0500 Received: from us-smtp-delivery-124.mimecast.com ([170.10.129.124]) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1pHk3a-0002QN-7t for qemu-devel@nongnu.org; Tue, 17 Jan 2023 06:23:12 -0500 Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-35-hhveZF6TMpixZd6x4dDoUw-1; Tue, 17 Jan 2023 06:23:08 -0500 Received: from smtp.corp.redhat.com (int-mx05.intmail.prod.int.rdu2.redhat.com [10.11.54.5]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id E6C4A85C6E8 for ; Tue, 17 Jan 2023 11:23:07 +0000 (UTC) Received: from t480s.redhat.com (unknown [10.39.193.160]) by smtp.corp.redhat.com (Postfix) with ESMTP id 3F38E7AE5; Tue, 17 Jan 2023 11:23:06 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1673954589; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=NVRuLRSMT7U5pooPPlLSJjVG2yVrd+zaoAra33ynVdY=; b=G0WAqRBJG46efJ+xynUAToW7lMj5vgxrGA3T1WdOjwXrCQirl26du/dNiuNmtvp/2T02Tr Cde6L7Sz5Fu7Lx5y1ztfKbV23SaX7tPpA84WFOtMCyUXFjpiKnAEbM4BTiKkA1osGTGUT9 7ZyUS+9Yr1WSTSEDhj+qao+4AAqafE8= X-MC-Unique: hhveZF6TMpixZd6x4dDoUw-1 From: David Hildenbrand To: qemu-devel@nongnu.org Cc: David Hildenbrand , "Dr . David Alan Gilbert" , Juan Quintela , Peter Xu , "Michael S . Tsirkin" , Michal Privoznik , Jing Qi Subject: [PATCH v5 8/8] virtio-mem: Proper support for preallocation with migration Date: Tue, 17 Jan 2023 12:22:49 +0100 Message-Id: <20230117112249.244096-9-david@redhat.com> In-Reply-To: <20230117112249.244096-1-david@redhat.com> References: <20230117112249.244096-1-david@redhat.com> MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable X-Scanned-By: MIMEDefang 3.1 on 10.11.54.5 Received-SPF: pass (zohomail.com: domain of gnu.org designates 209.51.188.17 as permitted sender) client-ip=209.51.188.17; envelope-from=qemu-devel-bounces+importer2=patchew.org@nongnu.org; helo=lists.gnu.org; Received-SPF: pass client-ip=170.10.129.124; envelope-from=david@redhat.com; helo=us-smtp-delivery-124.mimecast.com X-Spam_score_int: -20 X-Spam_score: -2.1 X-Spam_bar: -- X-Spam_report: (-2.1 / 5.0 requ) BAYES_00=-1.9, DKIMWL_WL_HIGH=-0.001, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H2=-0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: qemu-devel-bounces+importer2=patchew.org@nongnu.org Sender: qemu-devel-bounces+importer2=patchew.org@nongnu.org X-ZohoMail-DKIM: pass (identity @redhat.com) X-ZM-MESSAGEID: 1673954728596100006 Content-Type: text/plain; charset="utf-8" Ordinary memory preallocation runs when QEMU starts up and creates the memory backends, before processing the incoming migration stream. With virtio-mem, we don't know which memory blocks to preallocate before migration started. Now that we migrate the virtio-mem bitmap early, before migrating any RAM content, we can safely preallocate memory for all plugged memory blocks before migrating any RAM content. This is especially relevant for the following cases: (1) User errors With hugetlb/files, if we don't have sufficient backend memory available on the migration destination, we'll crash QEMU (SIGBUS) during RAM migration when running out of backend memory. Preallocating memory before actual RAM migration allows for failing gracefully and informing the user about the setup problem. (2) Excluded memory ranges during migration For example, virtio-balloon free page hinting will exclude some pages from getting migrated. In that case, we won't crash during RAM migration, but later, when running the VM on the destination, which is bad. To fix this for new QEMU machines that migrate the bitmap early, preallocate the memory early, before any RAM migration. Warn with old QEMU machines. Getting postcopy right is a bit tricky, but we essentially now implement the same (problematic) preallocation logic as ordinary preallocation: preallocate memory early and discard it again before precopy starts. During ordinary preallocation, discarding of RAM happens when postcopy is advised. As the state (bitmap) is loaded after postcopy was advised but before postcopy starts listening, we have to discard memory we preallocated immediately again ourselves. Note that nothing (not even hugetlb reservations) guarantees for postcopy that backend memory (especially, hugetlb pages) are still free after they were freed ones while discarding RAM. Still, allocating that memory at least once helps catching some basic setup problems. Before this change, trying to restore a VM when insufficient hugetlb pages are around results in the process crashing to to a "Bus error" (SIGBUS). With this change, QEMU fails gracefully: qemu-system-x86_64: qemu_prealloc_mem: preallocating memory failed: Bad a= ddress qemu-system-x86_64: error while loading state for instance 0x0 of device = '0000:00:03.0/virtio-mem-device-early' qemu-system-x86_64: load of migration failed: Cannot allocate memory And we can even introspect the early migration data, including the bitmap: $ ./scripts/analyze-migration.py -f STATEFILE { "ram (2)": { "section sizes": { "0000:00:03.0/mem0": "0x0000000780000000", "0000:00:04.0/mem1": "0x0000000780000000", "pc.ram": "0x0000000100000000", "/rom@etc/acpi/tables": "0x0000000000020000", "pc.bios": "0x0000000000040000", "0000:00:02.0/e1000.rom": "0x0000000000040000", "pc.rom": "0x0000000000020000", "/rom@etc/table-loader": "0x0000000000001000", "/rom@etc/acpi/rsdp": "0x0000000000001000" } }, "0000:00:03.0/virtio-mem-device-early (51)": { "tmp": "00 00 00 01 40 00 00 00 00 00 00 07 80 00 00 00 00 00 00 00 0= 0 20 00 00 00 00 00 00", "size": "0x0000000040000000", "bitmap": "ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff [...] }, "0000:00:04.0/virtio-mem-device-early (53)": { "tmp": "00 00 00 08 c0 00 00 00 00 00 00 07 80 00 00 00 00 00 00 00 0= 0 20 00 00 00 00 00 00", "size": "0x00000001fa400000", "bitmap": "ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff [...] }, [...] Reported-by: Jing Qi Reviewed-by: Dr. David Alan Gilbert Signed-off-by: David Hildenbrand Reviewed-by: Juan Quintela --- hw/virtio/virtio-mem.c | 87 ++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 87 insertions(+) diff --git a/hw/virtio/virtio-mem.c b/hw/virtio/virtio-mem.c index ca37949df8..957fe77dc0 100644 --- a/hw/virtio/virtio-mem.c +++ b/hw/virtio/virtio-mem.c @@ -204,6 +204,30 @@ static int virtio_mem_for_each_unplugged_range(const V= irtIOMEM *vmem, void *arg, return ret; } =20 +static int virtio_mem_for_each_plugged_range(const VirtIOMEM *vmem, void *= arg, + virtio_mem_range_cb cb) +{ + unsigned long first_bit, last_bit; + uint64_t offset, size; + int ret =3D 0; + + first_bit =3D find_first_bit(vmem->bitmap, vmem->bitmap_size); + while (first_bit < vmem->bitmap_size) { + offset =3D first_bit * vmem->block_size; + last_bit =3D find_next_zero_bit(vmem->bitmap, vmem->bitmap_size, + first_bit + 1) - 1; + size =3D (last_bit - first_bit + 1) * vmem->block_size; + + ret =3D cb(vmem, arg, offset, size); + if (ret) { + break; + } + first_bit =3D find_next_bit(vmem->bitmap, vmem->bitmap_size, + last_bit + 2); + } + return ret; +} + /* * Adjust the memory section to cover the intersection with the given rang= e. * @@ -938,6 +962,10 @@ static int virtio_mem_post_load(void *opaque, int vers= ion_id) RamDiscardListener *rdl; int ret; =20 + if (vmem->prealloc && !vmem->early_migration) { + warn_report("Proper preallocation with migration requires a newer = QEMU machine"); + } + /* * We started out with all memory discarded and our memory region is m= apped * into an address space. Replay, now that we updated the bitmap. @@ -957,6 +985,64 @@ static int virtio_mem_post_load(void *opaque, int vers= ion_id) return virtio_mem_restore_unplugged(vmem); } =20 +static int virtio_mem_prealloc_range_cb(const VirtIOMEM *vmem, void *arg, + uint64_t offset, uint64_t size) +{ + void *area =3D memory_region_get_ram_ptr(&vmem->memdev->mr) + offset; + int fd =3D memory_region_get_fd(&vmem->memdev->mr); + Error *local_err =3D NULL; + + qemu_prealloc_mem(fd, area, size, 1, NULL, &local_err); + if (local_err) { + error_report_err(local_err); + return -ENOMEM; + } + return 0; +} + +static int virtio_mem_post_load_early(void *opaque, int version_id) +{ + VirtIOMEM *vmem =3D VIRTIO_MEM(opaque); + RAMBlock *rb =3D vmem->memdev->mr.ram_block; + int ret; + + if (!vmem->prealloc) { + return 0; + } + + /* + * We restored the bitmap and verified that the basic properties + * match on source and destination, so we can go ahead and preallocate + * memory for all plugged memory blocks, before actual RAM migration s= tarts + * touching this memory. + */ + ret =3D virtio_mem_for_each_plugged_range(vmem, NULL, + virtio_mem_prealloc_range_cb); + if (ret) { + return ret; + } + + /* + * This is tricky: postcopy wants to start with a clean slate. On + * POSTCOPY_INCOMING_ADVISE, postcopy code discards all (ordinarily + * preallocated) RAM such that postcopy will work as expected later. + * + * However, we run after POSTCOPY_INCOMING_ADVISE -- but before actual + * RAM migration. So let's discard all memory again. This looks like an + * expensive NOP, but actually serves a purpose: we made sure that we + * were able to allocate all required backend memory once. We cannot + * guarantee that the backend memory we will free will remain free + * until we need it during postcopy, but at least we can catch the + * obvious setup issues this way. + */ + if (migration_incoming_postcopy_advised()) { + if (ram_block_discard_range(rb, 0, qemu_ram_get_used_length(rb))) { + return -EBUSY; + } + } + return 0; +} + typedef struct VirtIOMEMMigSanityChecks { VirtIOMEM *parent; uint64_t addr; @@ -1068,6 +1154,7 @@ static const VMStateDescription vmstate_virtio_mem_de= vice_early =3D { .minimum_version_id =3D 1, .version_id =3D 1, .early_setup =3D true, + .post_load =3D virtio_mem_post_load_early, .fields =3D (VMStateField[]) { VMSTATE_WITH_TMP(VirtIOMEM, VirtIOMEMMigSanityChecks, vmstate_virtio_mem_sanity_checks), --=20 2.39.0