From nobody Mon May 13 09:36:31 2024 Delivered-To: importer2@patchew.org Received-SPF: pass (zohomail.com: domain of vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; envelope-from=linux-kernel-owner@vger.kernel.org; helo=vger.kernel.org; Authentication-Results: mx.zohomail.com; dkim=pass; spf=pass (zohomail.com: domain of vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass(p=reject dis=none) header.from=google.com ARC-Seal: i=1; a=rsa-sha256; t=1622163042; cv=none; d=zohomail.com; s=zohoarc; b=Ghle+xhIRNTdri3w22nm8m+LSz/rXB9lcdbiLeVFAQP+xCqJI2UOC+VdR+K6XKs6mIPMVshVZ55/SUg3+Y3tsS9u4tXRBUapVdJlI4X+/4xuFVa0DMEa0RQY7gD6zdCEC7V9nDlbzLdMz48+Gq4RTXE1TX9NBFagIbhU5glAXxk= ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=zohomail.com; s=zohoarc; t=1622163042; h=Content-Type:Cc:Date:From:List-Id:MIME-Version:Message-ID:Subject:To; bh=IxmGmY7n5KWBKt5FosIwPuRpVYTAXeIP+zmHpHVCHdk=; b=nIZdzgA/JvPYF+9UORFV4+c6k8+MszIfQgE8xIYxuZ8Zm9uZ5fwSpI4fUQ0tXdE9x7zyCwPcCG3gGhp+gg2AnKLtZjE8Fm/zjRjXx1zAhfUSrkCD5c7Tgpg/52xr5hX43Y56LQBDtITeh2F++q32wlNcgT3TlvQ5i5t0n7jC29Y= ARC-Authentication-Results: i=1; mx.zohomail.com; dkim=pass; spf=pass (zohomail.com: domain of vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass header.from= (p=reject dis=none) header.from= Return-Path: Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mx.zohomail.com with SMTP id 1622163042896590.7911810966067; Thu, 27 May 2021 17:50:42 -0700 (PDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S235648AbhE1AwP (ORCPT ); Thu, 27 May 2021 20:52:15 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:47580 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S235119AbhE1AwG (ORCPT ); Thu, 27 May 2021 20:52:06 -0400 Received: from mail-yb1-xb49.google.com (mail-yb1-xb49.google.com [IPv6:2607:f8b0:4864:20::b49]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 5C0E2C061574 for ; Thu, 27 May 2021 17:50:32 -0700 (PDT) Received: by mail-yb1-xb49.google.com with SMTP id k128-20020a25c6860000b029052fd5ee8a17so2434467ybf.15 for ; Thu, 27 May 2021 17:50:32 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=date:message-id:mime-version:subject:from:cc; bh=IxmGmY7n5KWBKt5FosIwPuRpVYTAXeIP+zmHpHVCHdk=; b=DohjESoP/dFt/2ZPtV2i7cUhEPE6LzZ381mXb86OWnZImSWL9ctrRVgf3qcYJvZjGf h8P+MmBanZuPLs7WQNs8wFZJKe8qBhkFgPFiYinMYp3SA63A4357FHeEyAqlhh3pxVSD tDVmbkOLTaPdQRsHHXYb4fcvxbXysIC2JSgEd5KkfUalQ+Pq4sgBjI4gJqJXQiZPUsmF 6Oh9NyHvjiOMIop3+kJUKjc6iX6qdE+iDnWCcZVuQfIk5ndSOwc9Jf1rtEACh6oKNfxD BoDGr2tiDPQmhy0ctLm/QF0Wmow4P5WhotB6PWaRv8h6DNZzwWpc2oP8C8N9N27SXBDx gHDA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:message-id:mime-version:subject:from:cc; bh=IxmGmY7n5KWBKt5FosIwPuRpVYTAXeIP+zmHpHVCHdk=; b=HdvZ6lq41ionaSgcKYLMYLnDtwmA7ET0tL48h0TfuIngA9BSfUkGp3Q7UCgE28HF/C ZKXpKg2tgU7QdC2d/5hS69kfJ+S9ezVhfENCu+3HoYsQ2US+f8weU3XqUWS5gQ7nRvvm hb2eFB/ZrO9nx0vMd2Mdacoi0gjcCM8K1xXBPIxqIyWi8XcaR7+gCxGMgHtUjOo0Eexf Sfuhw+FY4lCgaZXZjJrNPj4ywMzWXJo46yBZGMr0ZjVhFM7zxnEnZmWUiaOhX21jus7O UHhJtTqmpAWTbllJ9zNPPb3AXZ+4ANO82Zr03aStlF8qjymbqEnC14sXnMSW32KHbS/O /XQg== X-Gm-Message-State: AOAM531ZLbszg9k8kwqgrID1f3CjJp1jLDP3J5jwNn4fP2S/6Gm4Jbiz 8I2fCZPKqpMw+O4NsZy6pdqwcUCA6M25Lazf2w== X-Google-Smtp-Source: ABdhPJxSNpWMjGmtXQc7/7c+BN5g4mcCDdtgX5Qdjko26lrpUplmm0xvqZ2DgvAtTbdJNyCZKm1luVIrrJv7nuraSg== X-Received: from almasrymina.svl.corp.google.com ([2620:15c:2cd:202:b35:38bd:7e0f:3b1d]) (user=almasrymina job=sendgmr) by 2002:a25:80d4:: with SMTP id c20mr8667284ybm.345.1622163031545; Thu, 27 May 2021 17:50:31 -0700 (PDT) Date: Thu, 27 May 2021 17:50:29 -0700 Message-Id: <20210528005029.88088-1-almasrymina@google.com> Mime-Version: 1.0 X-Mailer: git-send-email 2.32.0.rc0.204.g9fa02ecfa5-goog Subject: [PATCH v4] mm, hugetlb: fix racy resv_huge_pages underflow on UFFDIO_COPY From: Mina Almasry Cc: Mina Almasry , Axel Rasmussen , Peter Xu , linux-mm@kvack.org, Mike Kravetz , Andrew Morton , linux-kernel@vger.kernel.org To: unlisted-recipients:; (no To-header on input) Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-ZohoMail-DKIM: pass (identity @google.com) Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-8" On UFFDIO_COPY, if we fail to copy the page contents while holding the hugetlb_fault_mutex, we will drop the mutex and return to the caller after allocating a page that consumed a reservation. In this case there may be a fault that double consumes the reservation. To handle this, we free the allocated page, fix the reservations, and allocate a temporary hugetlb page and return that to the caller. When the caller does the copy outside of the lock, we again check the cache, and allocate a page consuming the reservation, and copy over the contents. Test: Hacked the code locally such that resv_huge_pages underflows produce a warning and the copy_huge_page_from_user() always fails, then: ./tools/testing/selftests/vm/userfaultfd hugetlb_shared 10 2 /tmp/kokonut_test/huge/userfaultfd_test && echo test success ./tools/testing/selftests/vm/userfaultfd hugetlb 10 2 /tmp/kokonut_test/huge/userfaultfd_test && echo test success Both tests succeed and produce no warnings. After the test runs number of free/resv hugepages is correct. Signed-off-by: Mina Almasry Cc: Axel Rasmussen Cc: Peter Xu Cc: linux-mm@kvack.org Cc: Mike Kravetz Cc: Andrew Morton Cc: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org --- include/linux/migrate.h | 4 ++++ mm/hugetlb.c | 48 +++++++++++++++++++++++++++++++++-------- mm/migrate.c | 4 ++-- mm/userfaultfd.c | 48 +---------------------------------------- 4 files changed, 46 insertions(+), 58 deletions(-) diff --git a/include/linux/migrate.h b/include/linux/migrate.h index 4bb4e519e3f5..4164c9ddd86e 100644 --- a/include/linux/migrate.h +++ b/include/linux/migrate.h @@ -51,6 +51,7 @@ extern int migrate_huge_page_move_mapping(struct address_= space *mapping, struct page *newpage, struct page *page); extern int migrate_page_move_mapping(struct address_space *mapping, struct page *newpage, struct page *page, int extra_count); +extern void migrate_copy_huge_page(struct page *dst, struct page *src); #else static inline void putback_movable_pages(struct list_head *l) {} @@ -77,6 +78,9 @@ static inline int migrate_huge_page_move_mapping(struct a= ddress_space *mapping, return -ENOSYS; } +static inline void migrate_copy_huge_page(struct page *dst, struct page *s= rc) +{ +} #endif /* CONFIG_MIGRATION */ #ifdef CONFIG_COMPACTION diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 76e2a6efc165..6072c9f82794 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -30,6 +30,7 @@ #include #include #include +#include #include #include @@ -4905,20 +4906,17 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_= mm, struct page **pagep) { bool is_continue =3D (mode =3D=3D MCOPY_ATOMIC_CONTINUE); - struct address_space *mapping; - pgoff_t idx; + struct hstate *h =3D hstate_vma(dst_vma); + struct address_space *mapping =3D dst_vma->vm_file->f_mapping; + pgoff_t idx =3D vma_hugecache_offset(h, dst_vma, dst_addr); unsigned long size; int vm_shared =3D dst_vma->vm_flags & VM_SHARED; - struct hstate *h =3D hstate_vma(dst_vma); pte_t _dst_pte; spinlock_t *ptl; - int ret; + int ret =3D -ENOMEM; struct page *page; int writable; - mapping =3D dst_vma->vm_file->f_mapping; - idx =3D vma_hugecache_offset(h, dst_vma, dst_addr); - if (is_continue) { ret =3D -EFAULT; page =3D find_lock_page(mapping, idx); @@ -4947,12 +4945,44 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_= mm, /* fallback to copy_from_user outside mmap_lock */ if (unlikely(ret)) { ret =3D -ENOENT; + /* Free the allocated page which may have + * consumed a reservation. + */ + restore_reserve_on_error(h, dst_vma, dst_addr, page); + put_page(page); + + /* Allocate a temporary page to hold the copied + * contents. + */ + page =3D alloc_huge_page_vma(h, dst_vma, dst_addr); + if (IS_ERR(page)) { + ret =3D -ENOMEM; + goto out; + } *pagep =3D page; - /* don't free the page */ + /* Set the outparam pagep and return to the caller to + * copy the contents outside the lock. Don't free the + * page. + */ goto out; } } else { - page =3D *pagep; + if (vm_shared && + hugetlbfs_pagecache_present(h, dst_vma, dst_addr)) { + put_page(*pagep); + ret =3D -EEXIST; + *pagep =3D NULL; + goto out; + } + + page =3D alloc_huge_page(dst_vma, dst_addr, 0); + if (IS_ERR(page)) { + ret =3D -ENOMEM; + *pagep =3D NULL; + goto out; + } + migrate_copy_huge_page(page, *pagep); + put_page(*pagep); *pagep =3D NULL; } diff --git a/mm/migrate.c b/mm/migrate.c index b234c3f3acb7..3bfe1f7d127d 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -550,7 +550,7 @@ static void __copy_gigantic_page(struct page *dst, stru= ct page *src, } } -static void copy_huge_page(struct page *dst, struct page *src) +void migrate_copy_huge_page(struct page *dst, struct page *src) { int i; int nr_pages; @@ -652,7 +652,7 @@ EXPORT_SYMBOL(migrate_page_states); void migrate_page_copy(struct page *newpage, struct page *page) { if (PageHuge(page) || PageTransHuge(page)) - copy_huge_page(newpage, page); + migrate_copy_huge_page(newpage, page); else copy_highpage(newpage, page); diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c index 2d6a3a36f6ce..e13a0492b7ba 100644 --- a/mm/userfaultfd.c +++ b/mm/userfaultfd.c @@ -346,54 +346,8 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(= struct mm_struct *dst_mm, out_unlock: mmap_read_unlock(dst_mm); out: - if (page) { - /* - * We encountered an error and are about to free a newly - * allocated huge page. - * - * Reservation handling is very subtle, and is different for - * private and shared mappings. See the routine - * restore_reserve_on_error for details. Unfortunately, we - * can not call restore_reserve_on_error now as it would - * require holding mmap_lock. - * - * If a reservation for the page existed in the reservation - * map of a private mapping, the map was modified to indicate - * the reservation was consumed when the page was allocated. - * We clear the HPageRestoreRsvCnt flag now so that the global - * reserve count will not be incremented in free_huge_page. - * The reservation map will still indicate the reservation - * was consumed and possibly prevent later page allocation. - * This is better than leaking a global reservation. If no - * reservation existed, it is still safe to clear - * HPageRestoreRsvCnt as no adjustments to reservation counts - * were made during allocation. - * - * The reservation map for shared mappings indicates which - * pages have reservations. When a huge page is allocated - * for an address with a reservation, no change is made to - * the reserve map. In this case HPageRestoreRsvCnt will be - * set to indicate that the global reservation count should be - * incremented when the page is freed. This is the desired - * behavior. However, when a huge page is allocated for an - * address without a reservation a reservation entry is added - * to the reservation map, and HPageRestoreRsvCnt will not be - * set. When the page is freed, the global reserve count will - * NOT be incremented and it will appear as though we have - * leaked reserved page. In this case, set HPageRestoreRsvCnt - * so that the global reserve count will be incremented to - * match the reservation map entry which was created. - * - * Note that vm_alloc_shared is based on the flags of the vma - * for which the page was originally allocated. dst_vma could - * be different or NULL on error. - */ - if (vm_alloc_shared) - SetHPageRestoreRsvCnt(page); - else - ClearHPageRestoreRsvCnt(page); + if (page) put_page(page); - } BUG_ON(copied < 0); BUG_ON(err > 0); BUG_ON(!copied && !err); -- 2.32.0.rc0.204.g9fa02ecfa5-goog