[PATCH] coredump: Limit what can interrupt coredumps

Eric W. Biederman posted 1 patch 1 week, 3 days ago
Failed in applying to current master (apply log)
fs/coredump.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

[PATCH] coredump: Limit what can interrupt coredumps

Posted by Eric W. Biederman 1 week, 3 days ago

Olivier Langlois has been struggling with coredumps being incompletely written in
processes using io_uring.

Olivier Langlois <olivier@trillion01.com> writes:
> io_uring is a big user of task_work and any event that io_uring made a
> task waiting for that occurs during the core dump generation will
> generate a TIF_NOTIFY_SIGNAL.
>
> Here are the detailed steps of the problem:
> 1. io_uring calls vfs_poll() to install a task to a file wait queue
>    with io_async_wake() as the wakeup function cb from io_arm_poll_handler()
> 2. wakeup function ends up calling task_work_add() with TWA_SIGNAL
> 3. task_work_add() sets the TIF_NOTIFY_SIGNAL bit by calling
>    set_notify_signal()

The coredump code deliberately supports being interrupted by SIGKILL,
and depends upon prepare_signal to filter out all other signals.   Now
that signal_pending includes wake ups for TIF_NOTIFY_SIGNAL this hack
in dump_emitted by the coredump code no longer works.

Make the coredump code more robust by explicitly testing for all of
the wakeup conditions the coredump code supports.  This prevents
new wakeup conditions from breaking the coredump code, as well
as fixing the current issue.

The filesystem code that the coredump code uses already limits
itself to only aborting on fatal_signal_pending.  So it should
not develop surprising wake-up reasons either.

v2: Don't remove the now unnecessary code in prepare_signal.

Cc: stable@vger.kernel.org
Fixes: 12db8b690010 ("entry: Add support for TIF_NOTIFY_SIGNAL")
Reported-by: Olivier Langlois <olivier@trillion01.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
---
 fs/coredump.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/coredump.c b/fs/coredump.c
index 2868e3e171ae..c3d8fc14b993 100644
--- a/fs/coredump.c
+++ b/fs/coredump.c
@@ -519,7 +519,7 @@ static bool dump_interrupted(void)
 	 * but then we need to teach dump_write() to restart and clear
 	 * TIF_SIGPENDING.
 	 */
-	return signal_pending(current);
+	return fatal_signal_pending(current) || freezing(current);
 }
 
 static void wait_for_dump_helpers(struct file *file)
-- 
2.20.1

Re: [PATCH] coredump: Limit what can interrupt coredumps

Posted by Oleg Nesterov 1 week ago
Eric, et al, sorry for delay, I didn't read emails several days.

On 06/10, Eric W. Biederman wrote:
>
> v2: Don't remove the now unnecessary code in prepare_signal.

No, that code is still needed. Otherwise any fatal signal will be
turned into SIGKILL.

> --- a/fs/coredump.c
> +++ b/fs/coredump.c
> @@ -519,7 +519,7 @@ static bool dump_interrupted(void)
>  	 * but then we need to teach dump_write() to restart and clear
>  	 * TIF_SIGPENDING.
>  	 */
> -	return signal_pending(current);
> +	return fatal_signal_pending(current) || freezing(current);
>  }


Well yes, this is what the comment says.

But note that there is another reason why dump_interrupted() returns true
if signal_pending(), it assumes thagt __dump_emit()->__kernel_write() may
fail anyway if signal_pending() is true. Say, pipe_write(), or iirc nfs,
perhaps something else...

That is why zap_threads() clears TIF_SIGPENDING. Perhaps it should clear
TIF_NOTIFY_SIGNAL as well and we should change io-uring to not abuse the
dumping threads?

Or perhaps we should change __dump_emit() to clear signal_pending() and
restart __kernel_write() if it fails or returns a short write.

Otherwise the change above doesn't look like a full fix to me.

Oleg.

Re: [PATCH] coredump: Limit what can interrupt coredumps

Posted by Eric W. Biederman 5 days, 21 hours ago
Oleg Nesterov <oleg@redhat.com> writes:

>> --- a/fs/coredump.c
>> +++ b/fs/coredump.c
>> @@ -519,7 +519,7 @@ static bool dump_interrupted(void)
>>  	 * but then we need to teach dump_write() to restart and clear
>>  	 * TIF_SIGPENDING.
>>  	 */
>> -	return signal_pending(current);
>> +	return fatal_signal_pending(current) || freezing(current);
>>  }
>
>
> Well yes, this is what the comment says.
>
> But note that there is another reason why dump_interrupted() returns true
> if signal_pending(), it assumes thagt __dump_emit()->__kernel_write() may
> fail anyway if signal_pending() is true. Say, pipe_write(), or iirc nfs,
> perhaps something else...
>
> That is why zap_threads() clears TIF_SIGPENDING. Perhaps it should clear
> TIF_NOTIFY_SIGNAL as well and we should change io-uring to not abuse the
> dumping threads?
>
> Or perhaps we should change __dump_emit() to clear signal_pending() and
> restart __kernel_write() if it fails or returns a short write.
>
> Otherwise the change above doesn't look like a full fix to me.

Agreed.  The coredump to a pipe will still be short.  That needs
something additional.

The problem Olivier Langlois <olivier@trillion01.com> reported was
core dumps coming up short because TIF_NOTIFY_SIGNAL was being
set during a core dump.

We can see this with pipe_write returning -ERESTARTSYS
on a full pipe if signal_pending which includes TIF_NOTIFY_SIGNAL
is true.

Looking further if the thread that is core dumping initiated
any io_uring work then io_ring_exit_work will use task_work_add
to request that thread clean up it's io_uring state.

Perhaps we can put a big comment in dump_emit and if we
get back -ERESTARTSYS run tracework_notify_signal.  I am not
seeing any locks held at that point in the coredump, so it
should be safe.  The coredump is run inside of file_start_write
which is the only potential complication.



The code flow is complicated but it looks like the entire
point of the exercise is to call io_uring_del_task_file
on the originating thread.  I suppose that keeps the
locking of the xarray in io_uring_task simple.


Hmm.   All of this comes from io_uring_release.
How do we get to io_uring_release?  The coredump should
be catching everything in exit_mm before exit_files?

Confused and hopeful someone can explain to me what is going on,
and perhaps simplify it.

Eric

Re: [PATCH] coredump: Limit what can interrupt coredumps

Posted by Olivier Langlois 5 days ago
On Tue, 2021-06-15 at 17:08 -0500, Eric W. Biederman wrote:
> Oleg Nesterov <oleg@redhat.com> writes:
> 
> > > --- a/fs/coredump.c
> > > +++ b/fs/coredump.c
> > > @@ -519,7 +519,7 @@ static bool dump_interrupted(void)
> > >          * but then we need to teach dump_write() to restart and
> > > clear
> > >          * TIF_SIGPENDING.
> > >          */
> > > -       return signal_pending(current);
> > > +       return fatal_signal_pending(current) ||
> > > freezing(current);
> > >  }
> > 
> > 
> > Well yes, this is what the comment says.
> > 
> > But note that there is another reason why dump_interrupted()
> > returns true
> > if signal_pending(), it assumes thagt __dump_emit()-
> > >__kernel_write() may
> > fail anyway if signal_pending() is true. Say, pipe_write(), or iirc
> > nfs,
> > perhaps something else...
> > 
> > That is why zap_threads() clears TIF_SIGPENDING. Perhaps it should
> > clear
> > TIF_NOTIFY_SIGNAL as well and we should change io-uring to not
> > abuse the
> > dumping threads?
> > 
> > Or perhaps we should change __dump_emit() to clear signal_pending()
> > and
> > restart __kernel_write() if it fails or returns a short write.
> > 
> > Otherwise the change above doesn't look like a full fix to me.
> 
> Agreed.  The coredump to a pipe will still be short.  That needs
> something additional.
> 
> The problem Olivier Langlois <olivier@trillion01.com> reported was
> core dumps coming up short because TIF_NOTIFY_SIGNAL was being
> set during a core dump.
> 
> We can see this with pipe_write returning -ERESTARTSYS
> on a full pipe if signal_pending which includes TIF_NOTIFY_SIGNAL
> is true.
> 
Eric,

I redid my test but this time instead of dumping directly into a file,
I did let the coredump be piped to the systemd coredump module and the
coredump generation isn't working as expected when piping.

So your code review conclusions are correct.


Re: [PATCH] coredump: Limit what can interrupt coredumps

Posted by Eric W. Biederman 4 days, 23 hours ago
Olivier Langlois <olivier@trillion01.com> writes:

> I redid my test but this time instead of dumping directly into a file,
> I did let the coredump be piped to the systemd coredump module and the
> coredump generation isn't working as expected when piping.
>
> So your code review conclusions are correct.

Thank you for confirming that.

Do you know how your test program is using io_uring?

I have been trying to put the pieces together on what io_uring is doing
that stops the coredump.  The fact that it takes a little while before
it kills the coredump is a little puzzling.  The code looks like all of
the io_uring operations should have been canceled before the coredump
starts.

Thanks,
Eric

Re: [PATCH] coredump: Limit what can interrupt coredumps

Posted by Olivier Langlois 2 days, 23 hours ago
On Wed, 2021-06-16 at 15:00 -0500, Eric W. Biederman wrote:
> Olivier Langlois <olivier@trillion01.com> writes:
> 
> > I redid my test but this time instead of dumping directly into a
> > file,
> > I did let the coredump be piped to the systemd coredump module and
> > the
> > coredump generation isn't working as expected when piping.
> > 
> > So your code review conclusions are correct.
> 
> Thank you for confirming that.
> 
> Do you know how your test program is using io_uring?
> 
> I have been trying to put the pieces together on what io_uring is
> doing
> that stops the coredump.  The fact that it takes a little while
> before
> it kills the coredump is a little puzzling.  The code looks like all
> of
> the io_uring operations should have been canceled before the coredump
> starts.
> 
> 
With a very simple setup, I guess that this could easily be
reproducible. Make a TCP connection with a server that is streaming
non-stop data and enter a loop where you keep initiating async
OP_IOURING_READ operations on your TCP fd.

Once you have that, manually sending a SIG_SEGV is a sure fire way to
stumble into the problem. This is how I am testing the patches.

IRL, it is possible to call io_uring_enter() to submit operations and
return from the syscall without waiting on all events to have
completed. Once the process is back in userspace, if it stumble into a
bug that triggers a coredump, any remaining pending I/O operations can
set TIF_SIGNAL_NOTIFY while the coredump is generated.

I have read the part of your previous email where you share the result
of your ongoing investigation. I didn't comment as the definitive
references in io_uring matters are Jens and Pavel but I am going to
share my opinion on the matter.

I think that you did put the finger on the code cleaning up the
io_uring instance in regards to pending operations. It seems to be
io_uring_release() which is probably called from exit_files() which
happens to be after the call to exit_mm().

At first, I did entertain the idea of considering if it could be
possible to duplicate some of the operations performed by
io_uring_release() related to the infamous TIF_SIGNAL_NOTIFY setting
into io_uring_files_cancel() which is called before exit_mm().

but the idea is useless as it is not the other threads of the group
that are causing the TIF_SIGNAL_NOTIFY problem. It is the thread
calling do_coredump() which is done by the signal handing code even
before that thread enters do_exit() and start to be cleaned up. That
thread when it enters do_coredump() is still fully loaded and
operational in terms of io_uring functionality.

I guess that this io_uring cancel all pending operations hook would
have to be called from do_coredump or from get_signal() but if it is
the way to go, I feel that this is a change major enough that wouldn't
dare going there without the blessing of the maintainers in cause....


Re: [PATCH] coredump: Limit what can interrupt coredumps

Posted by Eric W. Biederman 1 week ago
Oleg Nesterov <oleg@redhat.com> writes:

> Eric, et al, sorry for delay, I didn't read emails several days.
>
> On 06/10, Eric W. Biederman wrote:
>>
>> v2: Don't remove the now unnecessary code in prepare_signal.
>
> No, that code is still needed. Otherwise any fatal signal will be
> turned into SIGKILL.
>
>> --- a/fs/coredump.c
>> +++ b/fs/coredump.c
>> @@ -519,7 +519,7 @@ static bool dump_interrupted(void)
>>  	 * but then we need to teach dump_write() to restart and clear
>>  	 * TIF_SIGPENDING.
>>  	 */
>> -	return signal_pending(current);
>> +	return fatal_signal_pending(current) || freezing(current);
>>  }
>
>
> Well yes, this is what the comment says.
>
> But note that there is another reason why dump_interrupted() returns true
> if signal_pending(), it assumes thagt __dump_emit()->__kernel_write() may
> fail anyway if signal_pending() is true. Say, pipe_write(), or iirc nfs,
> perhaps something else...

The pipe_write case is a good case to consider.  In general filesystems
are only allowed to stop if fatal_signal_pending.  It is an ancient
unix/posix thing.

> That is why zap_threads() clears TIF_SIGPENDING. Perhaps it should clear
> TIF_NOTIFY_SIGNAL as well and we should change io-uring to not abuse the
> dumping threads?

I would very much like some clarity on TIF_NOTIFY_SIGNAL.  At the very
least it would be nice if it could get renamed TIF_NOTIFY_TASK_WORK.
I don't know what it has to do with signals.

> Or perhaps we should change __dump_emit() to clear signal_pending() and
> restart __kernel_write() if it fails or returns a short write.

Given that the code needs to handle pipes that seems a reasonable thing
to do.  Note.  All of the blocking of new signals in prepare_signal
is still in place.  So only a SIGKILL can come in set TIF_SIGPENDING.

So I would say that the current fix is correct (and safe to backport).
But still requires some magic in prepare_signal until we do more.

I don't understand the logic with well enough of adding work to
non-io_uring threads with task_work_add to understand why that happens
in the first place.

There are a lot of silly corners here.  Yes please let's keep on
digging.

Eric

Re: [PATCH] coredump: Limit what can interrupt coredumps

Posted by Oleg Nesterov 1 week ago
On 06/14, Eric W. Biederman wrote:
>
> I would very much like some clarity on TIF_NOTIFY_SIGNAL.  At the very
> least it would be nice if it could get renamed TIF_NOTIFY_TASK_WORK.

No, no, no ;)

I think that, for example, freezer should be changed to use
set_notify_signal() rather than fake_signal_wake_up(). Livepatch.
And probably it will have more users.

> I don't understand the logic with well enough of adding work to
> non-io_uring threads with task_work_add to understand why that happens
> in the first place.

Same here.

Oleg.

Re: [PATCH] coredump: Limit what can interrupt coredumps

Posted by Olivier Langlois 1 week, 2 days ago
On Thu, 2021-06-10 at 15:11 -0500, Eric W. Biederman wrote:
> 
> Olivier Langlois has been struggling with coredumps being incompletely
> written in
> processes using io_uring.
> 
> Olivier Langlois <olivier@trillion01.com> writes:
> > io_uring is a big user of task_work and any event that io_uring made
> > a
> > task waiting for that occurs during the core dump generation will
> > generate a TIF_NOTIFY_SIGNAL.
> > 
> > Here are the detailed steps of the problem:
> > 1. io_uring calls vfs_poll() to install a task to a file wait queue
> >    with io_async_wake() as the wakeup function cb from
> > io_arm_poll_handler()
> > 2. wakeup function ends up calling task_work_add() with TWA_SIGNAL
> > 3. task_work_add() sets the TIF_NOTIFY_SIGNAL bit by calling
> >    set_notify_signal()
> 
> The coredump code deliberately supports being interrupted by SIGKILL,
> and depends upon prepare_signal to filter out all other signals.   Now
> that signal_pending includes wake ups for TIF_NOTIFY_SIGNAL this hack
> in dump_emitted by the coredump code no longer works.
> 
> Make the coredump code more robust by explicitly testing for all of
> the wakeup conditions the coredump code supports.  This prevents
> new wakeup conditions from breaking the coredump code, as well
> as fixing the current issue.
> 
> The filesystem code that the coredump code uses already limits
> itself to only aborting on fatal_signal_pending.  So it should
> not develop surprising wake-up reasons either.
> 
> v2: Don't remove the now unnecessary code in prepare_signal.
> 
> Cc: stable@vger.kernel.org
> Fixes: 12db8b690010 ("entry: Add support for TIF_NOTIFY_SIGNAL")
> Reported-by: Olivier Langlois <olivier@trillion01.com>
> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
> ---
>  fs/coredump.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/coredump.c b/fs/coredump.c
> index 2868e3e171ae..c3d8fc14b993 100644
> --- a/fs/coredump.c
> +++ b/fs/coredump.c
> @@ -519,7 +519,7 @@ static bool dump_interrupted(void)
>          * but then we need to teach dump_write() to restart and clear
>          * TIF_SIGPENDING.
>          */
> -       return signal_pending(current);
> +       return fatal_signal_pending(current) || freezing(current);
>  }
>  
>  static void wait_for_dump_helpers(struct file *file)

Tested-by: Olivier Langlois <olivier@trillion01.com>


Re: [PATCH] coredump: Limit what can interrupt coredumps

Posted by Jens Axboe 1 week, 2 days ago
On 6/12/21 8:36 AM, Olivier Langlois wrote:
> On Thu, 2021-06-10 at 15:11 -0500, Eric W. Biederman wrote:
>>
>> Olivier Langlois has been struggling with coredumps being incompletely
>> written in
>> processes using io_uring.
>>
>> Olivier Langlois <olivier@trillion01.com> writes:
>>> io_uring is a big user of task_work and any event that io_uring made
>>> a
>>> task waiting for that occurs during the core dump generation will
>>> generate a TIF_NOTIFY_SIGNAL.
>>>
>>> Here are the detailed steps of the problem:
>>> 1. io_uring calls vfs_poll() to install a task to a file wait queue
>>>    with io_async_wake() as the wakeup function cb from
>>> io_arm_poll_handler()
>>> 2. wakeup function ends up calling task_work_add() with TWA_SIGNAL
>>> 3. task_work_add() sets the TIF_NOTIFY_SIGNAL bit by calling
>>>    set_notify_signal()
>>
>> The coredump code deliberately supports being interrupted by SIGKILL,
>> and depends upon prepare_signal to filter out all other signals.   Now
>> that signal_pending includes wake ups for TIF_NOTIFY_SIGNAL this hack
>> in dump_emitted by the coredump code no longer works.
>>
>> Make the coredump code more robust by explicitly testing for all of
>> the wakeup conditions the coredump code supports.  This prevents
>> new wakeup conditions from breaking the coredump code, as well
>> as fixing the current issue.
>>
>> The filesystem code that the coredump code uses already limits
>> itself to only aborting on fatal_signal_pending.  So it should
>> not develop surprising wake-up reasons either.
>>
>> v2: Don't remove the now unnecessary code in prepare_signal.
>>
>> Cc: stable@vger.kernel.org
>> Fixes: 12db8b690010 ("entry: Add support for TIF_NOTIFY_SIGNAL")
>> Reported-by: Olivier Langlois <olivier@trillion01.com>
>> Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
>> ---
>>  fs/coredump.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/fs/coredump.c b/fs/coredump.c
>> index 2868e3e171ae..c3d8fc14b993 100644
>> --- a/fs/coredump.c
>> +++ b/fs/coredump.c
>> @@ -519,7 +519,7 @@ static bool dump_interrupted(void)
>>          * but then we need to teach dump_write() to restart and clear
>>          * TIF_SIGPENDING.
>>          */
>> -       return signal_pending(current);
>> +       return fatal_signal_pending(current) || freezing(current);
>>  }
>>  
>>  static void wait_for_dump_helpers(struct file *file)
> 
> Tested-by: Olivier Langlois <olivier@trillion01.com>

Thanks Olivier and Eric for taking care of this. I've been mostly
offline for more than a week, back at it next week.

-- 
Jens Axboe

Re: [PATCH] coredump: Limit what can interrupt coredumps

Posted by Linus Torvalds 1 week, 3 days ago
On Thu, Jun 10, 2021 at 1:11 PM Eric W. Biederman <ebiederm@xmission.com> wrote:
>
> Make the coredump code more robust by explicitly testing for all of
> the wakeup conditions the coredump code supports.  This prevents
> new wakeup conditions from breaking the coredump code, as well
> as fixing the current issue.

Thanks, applied.

And lots of thanks to Olivier who kept debugging the odd coredump
behavior he saw.

            Linus