[libvirt] [PATCH v5 3/3] libvirtd: fix crash on termination

John Ferlan posted 3 patches 7 years, 6 months ago
[libvirt] [PATCH v5 3/3] libvirtd: fix crash on termination
Posted by John Ferlan 7 years, 6 months ago
From: Nikolay Shirokovskiy <nshirokovskiy@virtuozzo.com>

The problem is incorrect order of qemu driver shutdown and shutdown
of netserver threads that serve client requests (thru qemu driver
particularly).

Net server threads are shutdown upon dispose which is triggered
by last daemon object unref at the end of main function. At the same
time qemu driver is shutdown earlier in virStateCleanup. As a result
netserver threads see invalid driver object in the middle of request
processing.

Let's move shutting down netserver threads earlier to virNetDaemonClose.

Note: order of last daemon unref and virStateCleanup
is introduced in 85c3a182 for a valid reason.

Signed-off-by: John Ferlan <jferlan@redhat.com>
---
 src/rpc/virnetdaemon.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/src/rpc/virnetdaemon.c b/src/rpc/virnetdaemon.c
index 8c21414897..33bd8e3b06 100644
--- a/src/rpc/virnetdaemon.c
+++ b/src/rpc/virnetdaemon.c
@@ -881,6 +881,7 @@ virNetDaemonClose(virNetDaemonPtr dmn)
     virObjectLock(dmn);
 
     virHashForEach(dmn->servers, daemonServerClose, NULL);
+    virHashRemoveAll(dmn->servers);
 
     virObjectUnlock(dmn);
 }
-- 
2.13.6

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list
Re: [libvirt] [PATCH v5 3/3] libvirtd: fix crash on termination
Posted by Martin Kletzander 7 years, 6 months ago
On Mon, Oct 30, 2017 at 07:14:39AM -0400, John Ferlan wrote:
>From: Nikolay Shirokovskiy <nshirokovskiy@virtuozzo.com>
>
>The problem is incorrect order of qemu driver shutdown and shutdown
>of netserver threads that serve client requests (thru qemu driver
>particularly).
>
>Net server threads are shutdown upon dispose which is triggered
>by last daemon object unref at the end of main function. At the same
>time qemu driver is shutdown earlier in virStateCleanup. As a result
>netserver threads see invalid driver object in the middle of request
>processing.
>
>Let's move shutting down netserver threads earlier to virNetDaemonClose.
>
>Note: order of last daemon unref and virStateCleanup
>is introduced in 85c3a182 for a valid reason.
>

I must say I don't believe that's true.  Reading it back, that patch is
wrong IMHO.  I haven't gone through all the details of it and I don't
remember them from when I rewrote part of it, but the way this should
be handled is different.

The way you can clearly identify such issues is when you see that one
thread determines the validity of data (pointer in this case).  This
must never happen.  That means that the pointer is used from more places
than how many references that object has.  However each of the pointers
for such object should have their own reference.

So the problem is probably somewhere else.

>Signed-off-by: John Ferlan <jferlan@redhat.com>
>---
> src/rpc/virnetdaemon.c | 1 +
> 1 file changed, 1 insertion(+)
>
>diff --git a/src/rpc/virnetdaemon.c b/src/rpc/virnetdaemon.c
>index 8c21414897..33bd8e3b06 100644
>--- a/src/rpc/virnetdaemon.c
>+++ b/src/rpc/virnetdaemon.c
>@@ -881,6 +881,7 @@ virNetDaemonClose(virNetDaemonPtr dmn)
>     virObjectLock(dmn);
>
>     virHashForEach(dmn->servers, daemonServerClose, NULL);
>+    virHashRemoveAll(dmn->servers);
>

If I get this correctly, you are removing the servers so that their workers get
cleaned up, but that should be done in a separate step.  Most probably what
daemonServerClose should be doing.  This is a workaround, but not a proper fix.
If that's not true than the previous commit mentioned here should also be fixed
differently.

So this is a clear NACK from my POV.

>     virObjectUnlock(dmn);
> }
>-- 
>2.13.6
>
>--
>libvir-list mailing list
>libvir-list@redhat.com
>https://www.redhat.com/mailman/listinfo/libvir-list
--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list
Re: [libvirt] [PATCH v5 3/3] libvirtd: fix crash on termination
Posted by Nikolay Shirokovskiy 7 years, 6 months ago

On 30.10.2017 19:21, Martin Kletzander wrote:
> On Mon, Oct 30, 2017 at 07:14:39AM -0400, John Ferlan wrote:
>> From: Nikolay Shirokovskiy <nshirokovskiy@virtuozzo.com>
>>
>> The problem is incorrect order of qemu driver shutdown and shutdown
>> of netserver threads that serve client requests (thru qemu driver
>> particularly).
>>
>> Net server threads are shutdown upon dispose which is triggered
>> by last daemon object unref at the end of main function. At the same
>> time qemu driver is shutdown earlier in virStateCleanup. As a result
>> netserver threads see invalid driver object in the middle of request
>> processing.
>>
>> Let's move shutting down netserver threads earlier to virNetDaemonClose.
>>
>> Note: order of last daemon unref and virStateCleanup
>> is introduced in 85c3a182 for a valid reason.
>>
> 
> I must say I don't believe that's true.  Reading it back, that patch is
> wrong IMHO.  I haven't gone through all the details of it and I don't
> remember them from when I rewrote part of it, but the way this should
> be handled is different.
> 
> The way you can clearly identify such issues is when you see that one
> thread determines the validity of data (pointer in this case).  This
> must never happen.  That means that the pointer is used from more places
> than how many references that object has.  However each of the pointers
> for such object should have their own reference.
> 
> So the problem is probably somewhere else.

If I understand you correctly we can fix issue in 85c3a182 in 
a differenct way. Like adding reference to daemon for every
driver that uses it for shutdown inhibition. It will require to
pass unref function to driver init function or a bit of wild casting
to virObjectPtr for unref. Not sure it is worth it in this case.

Anyway the initical conditions for current patch stay the same - 
daemon object will be unrefed after state drivers cleanup and 
RPC requests could deal with invalid state driver objects.

> 
>> Signed-off-by: John Ferlan <jferlan@redhat.com>
>> ---
>> src/rpc/virnetdaemon.c | 1 +
>> 1 file changed, 1 insertion(+)
>>
>> diff --git a/src/rpc/virnetdaemon.c b/src/rpc/virnetdaemon.c
>> index 8c21414897..33bd8e3b06 100644
>> --- a/src/rpc/virnetdaemon.c
>> +++ b/src/rpc/virnetdaemon.c
>> @@ -881,6 +881,7 @@ virNetDaemonClose(virNetDaemonPtr dmn)
>>     virObjectLock(dmn);
>>
>>     virHashForEach(dmn->servers, daemonServerClose, NULL);
>> +    virHashRemoveAll(dmn->servers);
>>
> 
> If I get this correctly, you are removing the servers so that their workers get
> cleaned up, but that should be done in a separate step.  Most probably what
> daemonServerClose should be doing.  This is a workaround, but not a proper fix.

Well, original patch has a different approach for stopping RPC calls (see [1]) close
to what you suggest. I think in this case it is matter of taste because there are
no usecases which help us to prefer one way over another so I'm ok with both of them.

[1] https://www.redhat.com/archives/libvir-list/2017-September/msg00999.html
[2] https://www.redhat.com/archives/libvir-list/2017-October/msg00643.html   (thread dicussing [1])

> If that's not true than the previous commit mentioned here should also be fixed
> differently.

Sorry I don't understand it. Can you elaborate on this point?


Nikolay
> 
> So this is a clear NACK from my POV.
> 
>>     virObjectUnlock(dmn);
>> }
>> -- 
>> 2.13.6
>>
>> -- 
>> libvir-list mailing list
>> libvir-list@redhat.com
>> https://www.redhat.com/mailman/listinfo/libvir-list
> 
> 
> --
> libvir-list mailing list
> libvir-list@redhat.com
> https://www.redhat.com/mailman/listinfo/libvir-list
> 

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list
Re: [libvirt] [PATCH v5 3/3] libvirtd: fix crash on termination
Posted by John Ferlan 7 years, 6 months ago

On 10/31/2017 02:54 AM, Nikolay Shirokovskiy wrote:
> 
> 
> On 30.10.2017 19:21, Martin Kletzander wrote:
>> On Mon, Oct 30, 2017 at 07:14:39AM -0400, John Ferlan wrote:
>>> From: Nikolay Shirokovskiy <nshirokovskiy@virtuozzo.com>
>>>
>>> The problem is incorrect order of qemu driver shutdown and shutdown
>>> of netserver threads that serve client requests (thru qemu driver
>>> particularly).
>>>
>>> Net server threads are shutdown upon dispose which is triggered
>>> by last daemon object unref at the end of main function. At the same
>>> time qemu driver is shutdown earlier in virStateCleanup. As a result
>>> netserver threads see invalid driver object in the middle of request
>>> processing.
>>>
>>> Let's move shutting down netserver threads earlier to virNetDaemonClose.
>>>
>>> Note: order of last daemon unref and virStateCleanup
>>> is introduced in 85c3a182 for a valid reason.
>>>
>>
>> I must say I don't believe that's true.  Reading it back, that patch is
>> wrong IMHO.  I haven't gone through all the details of it and I don't
>> remember them from when I rewrote part of it, but the way this should
>> be handled is different.
>>
>> The way you can clearly identify such issues is when you see that one
>> thread determines the validity of data (pointer in this case).  This
>> must never happen.  That means that the pointer is used from more places
>> than how many references that object has.  However each of the pointers
>> for such object should have their own reference.
>>
>> So the problem is probably somewhere else.
> 
> If I understand you correctly we can fix issue in 85c3a182 in 
> a differenct way. Like adding reference to daemon for every
> driver that uses it for shutdown inhibition. It will require to
> pass unref function to driver init function or a bit of wild casting
> to virObjectPtr for unref. Not sure it is worth it in this case.

caveat: I'm still in a post KVM Forum travel brain fog...

Perhaps a bit more simple than that... Since the 'inhibitCallback' and
the 'inhibitOpaque' are essentially the mechanism that would pass/use
@dmn, then it'd be up to the inhibitCallback to add/remove the reference
with the *given* that the inhibitOpaque is a lockable object anyway...

Thus, virNetDaemonAddShutdownInhibition could use virObjectRef and
virNetDaemonRemoveShutdownInhibition could use virObjectUnref... The
"catch" is that for Shutdown the Unref would need to go after Unlock.

I believe that then would "replicate" the virObjectRef done in
daemonStateInit and the virObjectUnref done at the end of
daemonRunStateInit with respect to "some outside thread" being able to
use @dmn and not have it be Unref'd for the last time at any point in
time until the "consumer" was done with it.

Moving the Unref to after all the StateCleanup calls were done works
because it accomplishesd the same/similar task, but just held onto @dmn
longer because the original implementation didn't properly reference dmn
when it was being used for some other consumer/thread. Of course that
led to other problems which this series is addressing and I'm not clear
yet as to the impact vis-a-vis your StateShutdown series.

> 
> Anyway the initical conditions for current patch stay the same - 
> daemon object will be unrefed after state drivers cleanup and 
> RPC requests could deal with invalid state driver objects.
> 
>>
>>> Signed-off-by: John Ferlan <jferlan@redhat.com>
>>> ---
>>> src/rpc/virnetdaemon.c | 1 +
>>> 1 file changed, 1 insertion(+)
>>>
>>> diff --git a/src/rpc/virnetdaemon.c b/src/rpc/virnetdaemon.c
>>> index 8c21414897..33bd8e3b06 100644
>>> --- a/src/rpc/virnetdaemon.c
>>> +++ b/src/rpc/virnetdaemon.c
>>> @@ -881,6 +881,7 @@ virNetDaemonClose(virNetDaemonPtr dmn)
>>>     virObjectLock(dmn);
>>>
>>>     virHashForEach(dmn->servers, daemonServerClose, NULL);
>>> +    virHashRemoveAll(dmn->servers);
>>>
>>
>> If I get this correctly, you are removing the servers so that their workers get
>> cleaned up, but that should be done in a separate step.  Most probably what
>> daemonServerClose should be doing.  This is a workaround, but not a proper fix.

Are you sure?  The daemonServerClose is the callback for virHashForEach
which means the table->iterating would be set thus preventing
daemonServerClose from performing a virHashRemoveEntry of the server
element from the list.

So this is to me the right fix.

John

> 
> Well, original patch has a different approach for stopping RPC calls (see [1]) close
> to what you suggest. I think in this case it is matter of taste because there are
> no usecases which help us to prefer one way over another so I'm ok with both of them.
> 
> [1] https://www.redhat.com/archives/libvir-list/2017-September/msg00999.html
> [2] https://www.redhat.com/archives/libvir-list/2017-October/msg00643.html   (thread dicussing [1])
> 
>> If that's not true than the previous commit mentioned here should also be fixed
>> differently.
> 
> Sorry I don't understand it. Can you elaborate on this point?
> 
> 
> Nikolay
>>
>> So this is a clear NACK from my POV.
>>
>>>     virObjectUnlock(dmn);
>>> }
>>> -- 
>>> 2.13.6
>>>
>>> -- 
>>> libvir-list mailing list
>>> libvir-list@redhat.com
>>> https://www.redhat.com/mailman/listinfo/libvir-list
>>
>>
>> --
>> libvir-list mailing list
>> libvir-list@redhat.com
>> https://www.redhat.com/mailman/listinfo/libvir-list
>>

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list
Re: [libvirt] [PATCH v5 3/3] libvirtd: fix crash on termination
Posted by Nikolay Shirokovskiy 7 years, 6 months ago

On 01.11.2017 21:51, John Ferlan wrote:
> 
> 
> On 10/31/2017 02:54 AM, Nikolay Shirokovskiy wrote:
>>
>>
>> On 30.10.2017 19:21, Martin Kletzander wrote:
>>> On Mon, Oct 30, 2017 at 07:14:39AM -0400, John Ferlan wrote:
>>>> From: Nikolay Shirokovskiy <nshirokovskiy@virtuozzo.com>
>>>>
>>>> The problem is incorrect order of qemu driver shutdown and shutdown
>>>> of netserver threads that serve client requests (thru qemu driver
>>>> particularly).
>>>>
>>>> Net server threads are shutdown upon dispose which is triggered
>>>> by last daemon object unref at the end of main function. At the same
>>>> time qemu driver is shutdown earlier in virStateCleanup. As a result
>>>> netserver threads see invalid driver object in the middle of request
>>>> processing.
>>>>
>>>> Let's move shutting down netserver threads earlier to virNetDaemonClose.
>>>>
>>>> Note: order of last daemon unref and virStateCleanup
>>>> is introduced in 85c3a182 for a valid reason.
>>>>
>>>
>>> I must say I don't believe that's true.  Reading it back, that patch is
>>> wrong IMHO.  I haven't gone through all the details of it and I don't
>>> remember them from when I rewrote part of it, but the way this should
>>> be handled is different.
>>>
>>> The way you can clearly identify such issues is when you see that one
>>> thread determines the validity of data (pointer in this case).  This
>>> must never happen.  That means that the pointer is used from more places
>>> than how many references that object has.  However each of the pointers
>>> for such object should have their own reference.
>>>
>>> So the problem is probably somewhere else.
>>
>> If I understand you correctly we can fix issue in 85c3a182 in 
>> a differenct way. Like adding reference to daemon for every
>> driver that uses it for shutdown inhibition. It will require to
>> pass unref function to driver init function or a bit of wild casting
>> to virObjectPtr for unref. Not sure it is worth it in this case.
> 
> caveat: I'm still in a post KVM Forum travel brain fog...
> 
> Perhaps a bit more simple than that... Since the 'inhibitCallback' and
> the 'inhibitOpaque' are essentially the mechanism that would pass/use
> @dmn, then it'd be up to the inhibitCallback to add/remove the reference
> with the *given* that the inhibitOpaque is a lockable object anyway...
> 
> Thus, virNetDaemonAddShutdownInhibition could use virObjectRef and
> virNetDaemonRemoveShutdownInhibition could use virObjectUnref... The
> "catch" is that for Shutdown the Unref would need to go after Unlock.
> 
> I believe that then would "replicate" the virObjectRef done in
> daemonStateInit and the virObjectUnref done at the end of
> daemonRunStateInit with respect to "some outside thread" being able to
> use @dmn and not have it be Unref'd for the last time at any point in
> time until the "consumer" was done with it.
> 
> Moving the Unref to after all the StateCleanup calls were done works
> because it accomplishesd the same/similar task, but just held onto @dmn
> longer because the original implementation didn't properly reference dmn
> when it was being used for some other consumer/thread. Of course that
> led to other problems which this series is addressing and I'm not clear
> yet as to the impact vis-a-vis your StateShutdown series.

Ok, 85c3a182 can be rewritten this way. It is more staightforward to
ref/unref by consumer thread instead of relying on consumer structure
(in this case 85c3a182 relies upon the fact that after virStateCleanup
no hypervisor driver would use @dmn object).

But we still have the issues address by this patch series and state
shutdown series because the order in which @dmn and other objects
will be disposed would be same for 85c3a182 and proposed 85c3a182
replacement.

Nikolay

> 
>>
>> Anyway the initical conditions for current patch stay the same - 
>> daemon object will be unrefed after state drivers cleanup and 
>> RPC requests could deal with invalid state driver objects.
>>
>>>
>>>> Signed-off-by: John Ferlan <jferlan@redhat.com>
>>>> ---
>>>> src/rpc/virnetdaemon.c | 1 +
>>>> 1 file changed, 1 insertion(+)
>>>>
>>>> diff --git a/src/rpc/virnetdaemon.c b/src/rpc/virnetdaemon.c
>>>> index 8c21414897..33bd8e3b06 100644
>>>> --- a/src/rpc/virnetdaemon.c
>>>> +++ b/src/rpc/virnetdaemon.c
>>>> @@ -881,6 +881,7 @@ virNetDaemonClose(virNetDaemonPtr dmn)
>>>>     virObjectLock(dmn);
>>>>
>>>>     virHashForEach(dmn->servers, daemonServerClose, NULL);
>>>> +    virHashRemoveAll(dmn->servers);
>>>>
>>>
>>> If I get this correctly, you are removing the servers so that their workers get
>>> cleaned up, but that should be done in a separate step.  Most probably what
>>> daemonServerClose should be doing.  This is a workaround, but not a proper fix.
> 
> Are you sure?  The daemonServerClose is the callback for virHashForEach
> which means the table->iterating would be set thus preventing
> daemonServerClose from performing a virHashRemoveEntry of the server
> element from the list.
> 
> So this is to me the right fix.
> 
> John
> 
>>
>> Well, original patch has a different approach for stopping RPC calls (see [1]) close
>> to what you suggest. I think in this case it is matter of taste because there are
>> no usecases which help us to prefer one way over another so I'm ok with both of them.
>>
>> [1] https://www.redhat.com/archives/libvir-list/2017-September/msg00999.html
>> [2] https://www.redhat.com/archives/libvir-list/2017-October/msg00643.html   (thread dicussing [1])
>>
>>> If that's not true than the previous commit mentioned here should also be fixed
>>> differently.
>>
>> Sorry I don't understand it. Can you elaborate on this point?
>>
>>
>> Nikolay
>>>
>>> So this is a clear NACK from my POV.
>>>
>>>>     virObjectUnlock(dmn);
>>>> }
>>>> -- 
>>>> 2.13.6
>>>>
>>>> -- 
>>>> libvir-list mailing list
>>>> libvir-list@redhat.com
>>>> https://www.redhat.com/mailman/listinfo/libvir-list
>>>
>>>
>>> --
>>> libvir-list mailing list
>>> libvir-list@redhat.com
>>> https://www.redhat.com/mailman/listinfo/libvir-list
>>>

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list
Re: [libvirt] [PATCH v5 3/3] libvirtd: fix crash on termination
Posted by Martin Kletzander 7 years, 6 months ago
On Thu, Nov 02, 2017 at 10:49:35AM +0300, Nikolay Shirokovskiy wrote:
>
>
>On 01.11.2017 21:51, John Ferlan wrote:
>>
>>
>> On 10/31/2017 02:54 AM, Nikolay Shirokovskiy wrote:
>>>
>>>
>>> On 30.10.2017 19:21, Martin Kletzander wrote:
>>>> On Mon, Oct 30, 2017 at 07:14:39AM -0400, John Ferlan wrote:
>>>>> From: Nikolay Shirokovskiy <nshirokovskiy@virtuozzo.com>
>>>>>
>>>>> The problem is incorrect order of qemu driver shutdown and shutdown
>>>>> of netserver threads that serve client requests (thru qemu driver
>>>>> particularly).
>>>>>
>>>>> Net server threads are shutdown upon dispose which is triggered
>>>>> by last daemon object unref at the end of main function. At the same
>>>>> time qemu driver is shutdown earlier in virStateCleanup. As a result
>>>>> netserver threads see invalid driver object in the middle of request
>>>>> processing.
>>>>>
>>>>> Let's move shutting down netserver threads earlier to virNetDaemonClose.
>>>>>
>>>>> Note: order of last daemon unref and virStateCleanup
>>>>> is introduced in 85c3a182 for a valid reason.
>>>>>
>>>>
>>>> I must say I don't believe that's true.  Reading it back, that patch is
>>>> wrong IMHO.  I haven't gone through all the details of it and I don't
>>>> remember them from when I rewrote part of it, but the way this should
>>>> be handled is different.
>>>>
>>>> The way you can clearly identify such issues is when you see that one
>>>> thread determines the validity of data (pointer in this case).  This
>>>> must never happen.  That means that the pointer is used from more places
>>>> than how many references that object has.  However each of the pointers
>>>> for such object should have their own reference.
>>>>
>>>> So the problem is probably somewhere else.
>>>
>>> If I understand you correctly we can fix issue in 85c3a182 in
>>> a differenct way. Like adding reference to daemon for every
>>> driver that uses it for shutdown inhibition. It will require to
>>> pass unref function to driver init function or a bit of wild casting
>>> to virObjectPtr for unref. Not sure it is worth it in this case.
>>
>> caveat: I'm still in a post KVM Forum travel brain fog...
>>
>> Perhaps a bit more simple than that... Since the 'inhibitCallback' and
>> the 'inhibitOpaque' are essentially the mechanism that would pass/use
>> @dmn, then it'd be up to the inhibitCallback to add/remove the reference
>> with the *given* that the inhibitOpaque is a lockable object anyway...
>>
>> Thus, virNetDaemonAddShutdownInhibition could use virObjectRef and
>> virNetDaemonRemoveShutdownInhibition could use virObjectUnref... The
>> "catch" is that for Shutdown the Unref would need to go after Unlock.
>>
>> I believe that then would "replicate" the virObjectRef done in
>> daemonStateInit and the virObjectUnref done at the end of
>> daemonRunStateInit with respect to "some outside thread" being able to
>> use @dmn and not have it be Unref'd for the last time at any point in
>> time until the "consumer" was done with it.
>>
>> Moving the Unref to after all the StateCleanup calls were done works
>> because it accomplishesd the same/similar task, but just held onto @dmn
>> longer because the original implementation didn't properly reference dmn
>> when it was being used for some other consumer/thread. Of course that
>> led to other problems which this series is addressing and I'm not clear
>> yet as to the impact vis-a-vis your StateShutdown series.
>
>Ok, 85c3a182 can be rewritten this way. It is more staightforward to
>ref/unref by consumer thread instead of relying on consumer structure
>(in this case 85c3a182 relies upon the fact that after virStateCleanup
>no hypervisor driver would use @dmn object).
>

That's what reference counting is for, each consumer should have its reference.

>But we still have the issues address by this patch series and state
>shutdown series because the order in which @dmn and other objects
>will be disposed would be same for 85c3a182 and proposed 85c3a182
>replacement.
>

Let me summarize below, I hope I'll touch on all the points.  It's hard
to follow since I'm currently reading 3 or more threads at the same time :)

>>>>> Signed-off-by: John Ferlan <jferlan@redhat.com>
>>>>> ---
>>>>> src/rpc/virnetdaemon.c | 1 +
>>>>> 1 file changed, 1 insertion(+)
>>>>>
>>>>> diff --git a/src/rpc/virnetdaemon.c b/src/rpc/virnetdaemon.c
>>>>> index 8c21414897..33bd8e3b06 100644
>>>>> --- a/src/rpc/virnetdaemon.c
>>>>> +++ b/src/rpc/virnetdaemon.c
>>>>> @@ -881,6 +881,7 @@ virNetDaemonClose(virNetDaemonPtr dmn)
>>>>>     virObjectLock(dmn);
>>>>>
>>>>>     virHashForEach(dmn->servers, daemonServerClose, NULL);
>>>>> +    virHashRemoveAll(dmn->servers);
>>>>>
>>>>
>>>> If I get this correctly, you are removing the servers so that their workers get
>>>> cleaned up, but that should be done in a separate step.  Most probably what
>>>> daemonServerClose should be doing.  This is a workaround, but not a proper fix.
>>
>> Are you sure?  The daemonServerClose is the callback for virHashForEach
>> which means the table->iterating would be set thus preventing
>> daemonServerClose from performing a virHashRemoveEntry of the server
>> element from the list.
>>

This just makes the window of opportunity (between daemonServerClose()
and the actual removal of the virNetServerPtr from the hash) smaller.
That's why I don't see it as a fix, rather as a workaround.

What I think should be done is the order of what's done with services,
servers, drivers, daemon, workers, etc.

I read the other threds and I think we're on the same note here, it's
just that there is lot of confusion and we're all approaching it
differently.

So first let's agree on what should be done when virNetDaemonClose() is
called and the loop in virNetDaemonRun() exits.  My opinion it is:

 1) Stop accepting new calls and services:
    virNetServerServiceToggle(false);

 2) Wait for all the workers to finish:
    First part of virThreadPoolFree()

 3) Kill off clients:
    virNetServerClientClose()

 4) Clean up drivers:
    virStateCleanup()

 5) Clean up services, servers, daemon, etc.
    including the second part of virThreadPoolFree()

The last thing is what should be done in virNetDaemonDispose(), but
unfortunately we (mis)use it for more than that.  Numbers 1-4 should be,
IMO in virNet{Daemon,Server}Close().

Would this solve your problem?  I mean I'm not against more reference
counting, it should be done properly, but I think the above would do the
trick and also make sense from the naming point of view.

Since both the daemon and server will probably not ever be consistent
after calling the Close function, it might make sense to just include
virNetDaemonClose() into virNetDaemonDispose() and I'm not against that.
But the order of the steps should still be as described above IMO.

Let me know what you think.  And don't forget to have a nice day!

Martin

P.S.: The order of clean ups in libvirtd's main() is absolutely terrible and
      unreadable.  Also, even though I guess it goes without saying, if proper
      reference counting is in place, the order of virObjectUnref()s and
      VIR_FREE()s should not matter.  I think it chould be changed like this:

diff --git i/daemon/libvirtd.c w/daemon/libvirtd.c
index 589b32192e3d..3cc14ed20987 100644
--- i/daemon/libvirtd.c
+++ w/daemon/libvirtd.c
@@ -1500,14 +1500,15 @@ int main(int argc, char **argv) {

  cleanup:
     virNetlinkEventServiceStopAll();
-    virObjectUnref(remoteProgram);
-    virObjectUnref(lxcProgram);
-    virObjectUnref(qemuProgram);
-    virObjectUnref(adminProgram);
     virNetDaemonClose(dmn);
-    virObjectUnref(srv);
-    virObjectUnref(srvAdm);
+
+    if (driversInitialized) {
+        driversInitialized = false;
+        virStateCleanup();
+    }
+
     virNetlinkShutdown();
+
     if (statuswrite != -1) {
         if (ret != 0) {
             /* Tell parent of daemon what failed */
@@ -1521,6 +1522,14 @@ int main(int argc, char **argv) {
     if (pid_file_fd != -1)
         virPidFileReleasePath(pid_file, pid_file_fd);

+    virObjectUnref(remoteProgram);
+    virObjectUnref(lxcProgram);
+    virObjectUnref(qemuProgram);
+    virObjectUnref(adminProgram);
+    virObjectUnref(srv);
+    virObjectUnref(srvAdm);
+    virObjectUnref(dmn);
+
     VIR_FREE(sock_file);
     VIR_FREE(sock_file_ro);
     VIR_FREE(sock_file_adm);
@@ -1530,13 +1539,5 @@ int main(int argc, char **argv) {

     daemonConfigFree(config);

-    if (driversInitialized) {
-        driversInitialized = false;
-        virStateCleanup();
-    }
-    /* Now that the hypervisor shutdown inhibition functions that use
-     * 'dmn' as a parameter are done, we can finally unref 'dmn' */
-    virObjectUnref(dmn);
-
     return ret;
 }
--
--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list
Re: [libvirt] [PATCH v5 3/3] libvirtd: fix crash on termination
Posted by Nikolay Shirokovskiy 7 years, 6 months ago

On 02.11.2017 19:32, Martin Kletzander wrote:
> On Thu, Nov 02, 2017 at 10:49:35AM +0300, Nikolay Shirokovskiy wrote:
>>
>>
>> On 01.11.2017 21:51, John Ferlan wrote:
>>>
>>>
>>> On 10/31/2017 02:54 AM, Nikolay Shirokovskiy wrote:
>>>>
>>>>
>>>> On 30.10.2017 19:21, Martin Kletzander wrote:
>>>>> On Mon, Oct 30, 2017 at 07:14:39AM -0400, John Ferlan wrote:
>>>>>> From: Nikolay Shirokovskiy <nshirokovskiy@virtuozzo.com>
>>>>>>
>>>>>> The problem is incorrect order of qemu driver shutdown and shutdown
>>>>>> of netserver threads that serve client requests (thru qemu driver
>>>>>> particularly).
>>>>>>
>>>>>> Net server threads are shutdown upon dispose which is triggered
>>>>>> by last daemon object unref at the end of main function. At the same
>>>>>> time qemu driver is shutdown earlier in virStateCleanup. As a result
>>>>>> netserver threads see invalid driver object in the middle of request
>>>>>> processing.
>>>>>>
>>>>>> Let's move shutting down netserver threads earlier to virNetDaemonClose.
>>>>>>
>>>>>> Note: order of last daemon unref and virStateCleanup
>>>>>> is introduced in 85c3a182 for a valid reason.
>>>>>>
>>>>>
>>>>> I must say I don't believe that's true.  Reading it back, that patch is
>>>>> wrong IMHO.  I haven't gone through all the details of it and I don't
>>>>> remember them from when I rewrote part of it, but the way this should
>>>>> be handled is different.
>>>>>
>>>>> The way you can clearly identify such issues is when you see that one
>>>>> thread determines the validity of data (pointer in this case).  This
>>>>> must never happen.  That means that the pointer is used from more places
>>>>> than how many references that object has.  However each of the pointers
>>>>> for such object should have their own reference.
>>>>>
>>>>> So the problem is probably somewhere else.
>>>>
>>>> If I understand you correctly we can fix issue in 85c3a182 in
>>>> a differenct way. Like adding reference to daemon for every
>>>> driver that uses it for shutdown inhibition. It will require to
>>>> pass unref function to driver init function or a bit of wild casting
>>>> to virObjectPtr for unref. Not sure it is worth it in this case.
>>>
>>> caveat: I'm still in a post KVM Forum travel brain fog...
>>>
>>> Perhaps a bit more simple than that... Since the 'inhibitCallback' and
>>> the 'inhibitOpaque' are essentially the mechanism that would pass/use
>>> @dmn, then it'd be up to the inhibitCallback to add/remove the reference
>>> with the *given* that the inhibitOpaque is a lockable object anyway...
>>>
>>> Thus, virNetDaemonAddShutdownInhibition could use virObjectRef and
>>> virNetDaemonRemoveShutdownInhibition could use virObjectUnref... The
>>> "catch" is that for Shutdown the Unref would need to go after Unlock.
>>>
>>> I believe that then would "replicate" the virObjectRef done in
>>> daemonStateInit and the virObjectUnref done at the end of
>>> daemonRunStateInit with respect to "some outside thread" being able to
>>> use @dmn and not have it be Unref'd for the last time at any point in
>>> time until the "consumer" was done with it.
>>>
>>> Moving the Unref to after all the StateCleanup calls were done works
>>> because it accomplishesd the same/similar task, but just held onto @dmn
>>> longer because the original implementation didn't properly reference dmn
>>> when it was being used for some other consumer/thread. Of course that
>>> led to other problems which this series is addressing and I'm not clear
>>> yet as to the impact vis-a-vis your StateShutdown series.
>>
>> Ok, 85c3a182 can be rewritten this way. It is more staightforward to
>> ref/unref by consumer thread instead of relying on consumer structure
>> (in this case 85c3a182 relies upon the fact that after virStateCleanup
>> no hypervisor driver would use @dmn object).
>>
> 
> That's what reference counting is for, each consumer should have its reference.
> 
>> But we still have the issues address by this patch series and state
>> shutdown series because the order in which @dmn and other objects
>> will be disposed would be same for 85c3a182 and proposed 85c3a182
>> replacement.
>>
> 
> Let me summarize below, I hope I'll touch on all the points.  It's hard
> to follow since I'm currently reading 3 or more threads at the same time :)
> 
>>>>>> Signed-off-by: John Ferlan <jferlan@redhat.com>
>>>>>> ---
>>>>>> src/rpc/virnetdaemon.c | 1 +
>>>>>> 1 file changed, 1 insertion(+)
>>>>>>
>>>>>> diff --git a/src/rpc/virnetdaemon.c b/src/rpc/virnetdaemon.c
>>>>>> index 8c21414897..33bd8e3b06 100644
>>>>>> --- a/src/rpc/virnetdaemon.c
>>>>>> +++ b/src/rpc/virnetdaemon.c
>>>>>> @@ -881,6 +881,7 @@ virNetDaemonClose(virNetDaemonPtr dmn)
>>>>>>     virObjectLock(dmn);
>>>>>>
>>>>>>     virHashForEach(dmn->servers, daemonServerClose, NULL);
>>>>>> +    virHashRemoveAll(dmn->servers);
>>>>>>
>>>>>
>>>>> If I get this correctly, you are removing the servers so that their workers get
>>>>> cleaned up, but that should be done in a separate step.  Most probably what
>>>>> daemonServerClose should be doing.  This is a workaround, but not a proper fix.
>>>
>>> Are you sure?  The daemonServerClose is the callback for virHashForEach
>>> which means the table->iterating would be set thus preventing
>>> daemonServerClose from performing a virHashRemoveEntry of the server
>>> element from the list.
>>>
> 
> This just makes the window of opportunity (between daemonServerClose()
> and the actual removal of the virNetServerPtr from the hash) smaller.
> That's why I don't see it as a fix, rather as a workaround.
> 
> What I think should be done is the order of what's done with services,
> servers, drivers, daemon, workers, etc.
> 
> I read the other threds and I think we're on the same note here, it's
> just that there is lot of confusion and we're all approaching it
> differently.
> 
> So first let's agree on what should be done when virNetDaemonClose() is
> called and the loop in virNetDaemonRun() exits.  My opinion it is:
> 
> 1) Stop accepting new calls and services:
>    virNetServerServiceToggle(false);

This not really necessary as this has effect only if loop is running.

> 
> 2) Wait for all the workers to finish:
>    First part of virThreadPoolFree()

I was thinking of splitting this function to finish/free too.
After finish it is not really usable because there are no
more threads in pool anymore so we have to introduce state
finished and check it in every thread pool api function to
be safe. So this will introduce a bit of complexity only
for the sake of some scheme IMO.

> 
> 3) Kill off clients:
>    virNetServerClientClose()
> 
> 4) Clean up drivers:
>    virStateCleanup()

I think it is better for net daemon/servers not to know about
hypervisor drivers directly.

> 
> 5) Clean up services, servers, daemon, etc.
>    including the second part of virThreadPoolFree()
> 
> The last thing is what should be done in virNetDaemonDispose(), but
> unfortunately we (mis)use it for more than that.  Numbers 1-4 should be,
> IMO in virNet{Daemon,Server}Close().
> 
> Would this solve your problem?  I mean I'm not against more reference
> counting, it should be done properly, but I think the above would do the
> trick and also make sense from the naming point of view.
> 
> Since both the daemon and server will probably not ever be consistent
> after calling the Close function, it might make sense to just include
> virNetDaemonClose() into virNetDaemonDispose() and I'm not against that.
> But the order of the steps should still be as described above IMO.

This won't work IMO. If we are already in virNetDaemonDispose then 
@dmn object is invalid but we don't yet call virStateCleanup yet
(if it is moved to virNetDaemonClose) and this is the problem 85c3a182 dealt with.

> 
> Let me know what you think.  And don't forget to have a nice day!
> 
> Martin
> 
> P.S.: The order of clean ups in libvirtd's main() is absolutely terrible and
>      unreadable.  Also, even though I guess it goes without saying, if proper
>      reference counting is in place, the order of virObjectUnref()s and
>      VIR_FREE()s should not matter.  I think it chould be changed like this:

It makes sense that in cleanup section we can first make all cleanup and
then all dispose. It is only not clear to me can we call virStateCleanup 
before virNetlinkShutdown. Also I would keep comment for @dmn unref
until proper refcount for @dmn is implemented.

Nikolay

> 
> diff --git i/daemon/libvirtd.c w/daemon/libvirtd.c
> index 589b32192e3d..3cc14ed20987 100644
> --- i/daemon/libvirtd.c
> +++ w/daemon/libvirtd.c
> @@ -1500,14 +1500,15 @@ int main(int argc, char **argv) {
> 
>  cleanup:
>     virNetlinkEventServiceStopAll();
> -    virObjectUnref(remoteProgram);
> -    virObjectUnref(lxcProgram);
> -    virObjectUnref(qemuProgram);
> -    virObjectUnref(adminProgram);
>     virNetDaemonClose(dmn);
> -    virObjectUnref(srv);
> -    virObjectUnref(srvAdm);
> +
> +    if (driversInitialized) {
> +        driversInitialized = false;
> +        virStateCleanup();
> +    }
> +
>     virNetlinkShutdown();
> +
>     if (statuswrite != -1) {
>         if (ret != 0) {
>             /* Tell parent of daemon what failed */
> @@ -1521,6 +1522,14 @@ int main(int argc, char **argv) {
>     if (pid_file_fd != -1)
>         virPidFileReleasePath(pid_file, pid_file_fd);
> 
> +    virObjectUnref(remoteProgram);
> +    virObjectUnref(lxcProgram);
> +    virObjectUnref(qemuProgram);
> +    virObjectUnref(adminProgram);
> +    virObjectUnref(srv);
> +    virObjectUnref(srvAdm);
> +    virObjectUnref(dmn);
> +
>     VIR_FREE(sock_file);
>     VIR_FREE(sock_file_ro);
>     VIR_FREE(sock_file_adm);
> @@ -1530,13 +1539,5 @@ int main(int argc, char **argv) {
> 
>     daemonConfigFree(config);
> 
> -    if (driversInitialized) {
> -        driversInitialized = false;
> -        virStateCleanup();
> -    }
> -    /* Now that the hypervisor shutdown inhibition functions that use
> -     * 'dmn' as a parameter are done, we can finally unref 'dmn' */
> -    virObjectUnref(dmn);
> -
>     return ret;
> }
> -- 

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list
Re: [libvirt] [PATCH v5 3/3] libvirtd: fix crash on termination
Posted by Martin Kletzander 7 years, 6 months ago
On Fri, Nov 03, 2017 at 11:07:36AM +0300, Nikolay Shirokovskiy wrote:
>On 02.11.2017 19:32, Martin Kletzander wrote:
>> This just makes the window of opportunity (between daemonServerClose()
>> and the actual removal of the virNetServerPtr from the hash) smaller.
>> That's why I don't see it as a fix, rather as a workaround.
>>
>> What I think should be done is the order of what's done with services,
>> servers, drivers, daemon, workers, etc.
>>
>> I read the other threds and I think we're on the same note here, it's
>> just that there is lot of confusion and we're all approaching it
>> differently.
>>
>> So first let's agree on what should be done when virNetDaemonClose() is
>> called and the loop in virNetDaemonRun() exits.  My opinion it is:
>>
>> 1) Stop accepting new calls and services:
>>    virNetServerServiceToggle(false);
>
>This not really necessary as this has effect only if loop is running.
>

Sure, good point.

>>
>> 2) Wait for all the workers to finish:
>>    First part of virThreadPoolFree()
>
>I was thinking of splitting this function to finish/free too.
>After finish it is not really usable because there are no
>more threads in pool anymore so we have to introduce state
>finished and check it in every thread pool api function to
>be safe. So this will introduce a bit of complexity only
>for the sake of some scheme IMO.
>

Yeah, it's similar to the daemon when you are between close and dispose, but for
the daemon we need it.

What function do you think we would need to check this for?  The finish would
end all the threads and since the event loop is not running no new jobs could be
posted.

>>
>> 3) Kill off clients:
>>    virNetServerClientClose()
>>
>> 4) Clean up drivers:
>>    virStateCleanup()
>
>I think it is better for net daemon/servers not to know about
>hypervisor drivers directly.
>

Sure, I was only talking about the libvirtd for the sake of simplicity.

>>
>> 5) Clean up services, servers, daemon, etc.
>>    including the second part of virThreadPoolFree()
>>
>> The last thing is what should be done in virNetDaemonDispose(), but
>> unfortunately we (mis)use it for more than that.  Numbers 1-4 should be,
>> IMO in virNet{Daemon,Server}Close().
>>
>> Would this solve your problem?  I mean I'm not against more reference
>> counting, it should be done properly, but I think the above would do the
>> trick and also make sense from the naming point of view.
>>
>> Since both the daemon and server will probably not ever be consistent
>> after calling the Close function, it might make sense to just include
>> virNetDaemonClose() into virNetDaemonDispose() and I'm not against that.
>> But the order of the steps should still be as described above IMO.
>
>This won't work IMO. If we are already in virNetDaemonDispose then
>@dmn object is invalid but we don't yet call virStateCleanup yet
>(if it is moved to virNetDaemonClose) and this is the problem 85c3a182 dealt with.
>

It would solve it if order above is followed.  Sure, virNetDaemon should not
know about any drivers, but we can simply register a cleanup callback for the
daemon which the Dispose function will call after the thread pools are cleaned
up (just thought of that now).

>>
>> Let me know what you think.  And don't forget to have a nice day!
>>
>> Martin
>>
>> P.S.: The order of clean ups in libvirtd's main() is absolutely terrible and
>>      unreadable.  Also, even though I guess it goes without saying, if proper
>>      reference counting is in place, the order of virObjectUnref()s and
>>      VIR_FREE()s should not matter.  I think it chould be changed like this:
>
>It makes sense that in cleanup section we can first make all cleanup and
>then all dispose. It is only not clear to me can we call virStateCleanup
>before virNetlinkShutdown. Also I would keep comment for @dmn unref
>until proper refcount for @dmn is implemented.
>

I don't think virStateCleanup() and virNetlinkShutdown() are dependent.  The
comment can stay, I was just trying to show that it's also a mess.  What I would
like, actually, is to have the daemon running with all the references and
pointers (that the main() function doesn't need anymore) freed, but both this
and what I suggested below is unrelated to what we're trying to fix, I don't
know why I started talking about it.--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list
Re: [libvirt] [PATCH v5 3/3] libvirtd: fix crash on termination
Posted by Nikolay Shirokovskiy 7 years, 6 months ago

On 03.11.2017 11:42, Martin Kletzander wrote:
> On Fri, Nov 03, 2017 at 11:07:36AM +0300, Nikolay Shirokovskiy wrote:
>> On 02.11.2017 19:32, Martin Kletzander wrote:
>>> This just makes the window of opportunity (between daemonServerClose()
>>> and the actual removal of the virNetServerPtr from the hash) smaller.
>>> That's why I don't see it as a fix, rather as a workaround.
>>>
>>> What I think should be done is the order of what's done with services,
>>> servers, drivers, daemon, workers, etc.
>>>
>>> I read the other threds and I think we're on the same note here, it's
>>> just that there is lot of confusion and we're all approaching it
>>> differently.
>>>
>>> So first let's agree on what should be done when virNetDaemonClose() is
>>> called and the loop in virNetDaemonRun() exits.  My opinion it is:
>>>
>>> 1) Stop accepting new calls and services:
>>>    virNetServerServiceToggle(false);
>>
>> This not really necessary as this has effect only if loop is running.
>>
> 
> Sure, good point.
> 
>>>
>>> 2) Wait for all the workers to finish:
>>>    First part of virThreadPoolFree()
>>
>> I was thinking of splitting this function to finish/free too.
>> After finish it is not really usable because there are no
>> more threads in pool anymore so we have to introduce state
>> finished and check it in every thread pool api function to
>> be safe. So this will introduce a bit of complexity only
>> for the sake of some scheme IMO.
>>
> 
> Yeah, it's similar to the daemon when you are between close and dispose, but for
> the daemon we need it.
> 
> What function do you think we would need to check this for?  The finish would
> end all the threads and since the event loop is not running no new jobs could be
> posted.

virThreadPoolSendJob is already good in this respect as it checks @quit at the beginning.
Looks like we need only to add same check to virThreadPoolSetParameters. So
not so much complexity. 

I can imagine thread pool function being called from another
thread pools thread. AFAIK there are not such situation though. So just to be on a safe side.

> 
>>>
>>> 3) Kill off clients:
>>>    virNetServerClientClose()
>>>
>>> 4) Clean up drivers:
>>>    virStateCleanup()
>>
>> I think it is better for net daemon/servers not to know about
>> hypervisor drivers directly.
>>
> 
> Sure, I was only talking about the libvirtd for the sake of simplicity.
> 
>>>
>>> 5) Clean up services, servers, daemon, etc.
>>>    including the second part of virThreadPoolFree()
>>>
>>> The last thing is what should be done in virNetDaemonDispose(), but
>>> unfortunately we (mis)use it for more than that.  Numbers 1-4 should be,
>>> IMO in virNet{Daemon,Server}Close().
>>>
>>> Would this solve your problem?  I mean I'm not against more reference
>>> counting, it should be done properly, but I think the above would do the
>>> trick and also make sense from the naming point of view.
>>>
>>> Since both the daemon and server will probably not ever be consistent
>>> after calling the Close function, it might make sense to just include
>>> virNetDaemonClose() into virNetDaemonDispose() and I'm not against that.
>>> But the order of the steps should still be as described above IMO.
>>
>> This won't work IMO. If we are already in virNetDaemonDispose then
>> @dmn object is invalid but we don't yet call virStateCleanup yet
>> (if it is moved to virNetDaemonClose) and this is the problem 85c3a182 dealt with.
>>
> 
> It would solve it if order above is followed.  Sure, virNetDaemon should not

It is ok if virStateCleanup is called from virNetDaemonClose somehow. I was
only against calling virNetDaemonClose from virNetDaemonDispose in this case.

> know about any drivers, but we can simply register a cleanup callback for the
> daemon which the Dispose function will call after the thread pools are cleaned
> up (just thought of that now).

Well to me it is too generic for this case. We can code this order just
as order of statements.

1. close netservers (which imply finishing rpc thread pool)
2. close hypervisor drivers
3. unref @dmn

Or we can use refcounts and take point 3 out of consideration.
Also there is a patch series [1] that introduces 0 step to help 1 step to finish actually
in the situation when loop is down.

0. shutdown hypervirsor drivers.

[1] https://www.redhat.com/archives/libvir-list/2017-October/msg01134.html
> 
>>>
>>> Let me know what you think.  And don't forget to have a nice day!
>>>
>>> Martin
>>>
>>> P.S.: The order of clean ups in libvirtd's main() is absolutely terrible and
>>>      unreadable.  Also, even though I guess it goes without saying, if proper
>>>      reference counting is in place, the order of virObjectUnref()s and
>>>      VIR_FREE()s should not matter.  I think it chould be changed like this:
>>
>> It makes sense that in cleanup section we can first make all cleanup and
>> then all dispose. It is only not clear to me can we call virStateCleanup
>> before virNetlinkShutdown. Also I would keep comment for @dmn unref
>> until proper refcount for @dmn is implemented.
>>
> 
> I don't think virStateCleanup() and virNetlinkShutdown() are dependent.  The
> comment can stay, I was just trying to show that it's also a mess.  What I would
> like, actually, is to have the daemon running with all the references and
> pointers (that the main() function doesn't need anymore) freed, but both this
> and what I suggested below is unrelated to what we're trying to fix, I don't
> know why I started talking about it.

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list
Re: [libvirt] [PATCH v5 3/3] libvirtd: fix crash on termination
Posted by John Ferlan 7 years, 6 months ago

On 11/03/2017 05:20 AM, Nikolay Shirokovskiy wrote:
> 
> 
> On 03.11.2017 11:42, Martin Kletzander wrote:
>> On Fri, Nov 03, 2017 at 11:07:36AM +0300, Nikolay Shirokovskiy wrote:
>>> On 02.11.2017 19:32, Martin Kletzander wrote:
>>>> This just makes the window of opportunity (between daemonServerClose()
>>>> and the actual removal of the virNetServerPtr from the hash) smaller.
>>>> That's why I don't see it as a fix, rather as a workaround.
>>>>
>>>> What I think should be done is the order of what's done with services,
>>>> servers, drivers, daemon, workers, etc.
>>>>
>>>> I read the other threds and I think we're on the same note here, it's
>>>> just that there is lot of confusion and we're all approaching it
>>>> differently.

Trying to catch up, but this thread has gone beyond where I was
previously. I still have to push the first two patches of this series...

Yes, libvirtd cleanup path is awful and there's quite a few "issues"
that have cropped up because previous patches didn't take care of
refcnt'ing properly and just "added" cleanup "somewhere" in the list of
things to be cleaned up... Fortunately to a degree I think this is still
all edge condition type stuff, but getting it right would be good...

For me theory has always been to cleanup in the inverse order that
something was created - for sure libvirtd doesn't follow that at all.
For libvirtd the shutdown/cleanup logic just seemed too haphazard and
fixing never made it to the top of the list of things to think about.

Another "caveat" in libvirtd cleanup logic is that driversInitialized is
set as part of a thread that could be running or have run if the code
enters the cleanup: path as a result of virNetlinkEventServiceStart
failure before we enter the event loop (e.g. virNetDaemonRun).

Still like Erik - I do like Martin's suggestion and would like to see it
as a formal patch - although perhaps with a few tweaks.

IIRC, this particular patch "worked" because by removing the
dmn->servers refcnt earlier made sure that we didn't have to wait for
virObjectUnref(dmn) to occur after virStateCleanup.

>>>>
>>>> So first let's agree on what should be done when virNetDaemonClose() is
>>>> called and the loop in virNetDaemonRun() exits.  My opinion it is:
>>>>
>>>> 1) Stop accepting new calls and services:
>>>>    virNetServerServiceToggle(false);
>>>
>>> This not really necessary as this has effect only if loop is running.
>>>
>>
>> Sure, good point.
>>
>>>>
>>>> 2) Wait for all the workers to finish:
>>>>    First part of virThreadPoolFree()
>>>
>>> I was thinking of splitting this function to finish/free too.
>>> After finish it is not really usable because there are no
>>> more threads in pool anymore so we have to introduce state
>>> finished and check it in every thread pool api function to
>>> be safe. So this will introduce a bit of complexity only
>>> for the sake of some scheme IMO.
>>>
>>
>> Yeah, it's similar to the daemon when you are between close and dispose, but for
>> the daemon we need it.
>>
>> What function do you think we would need to check this for?  The finish would
>> end all the threads and since the event loop is not running no new jobs could be
>> posted.
> 
> virThreadPoolSendJob is already good in this respect as it checks @quit at the beginning.
> Looks like we need only to add same check to virThreadPoolSetParameters. So
> not so much complexity. 
> 
> I can imagine thread pool function being called from another
> thread pools thread. AFAIK there are not such situation though. So just to be on a safe side.
> 
>>
>>>>
>>>> 3) Kill off clients:
>>>>    virNetServerClientClose()
>>>>
>>>> 4) Clean up drivers:
>>>>    virStateCleanup()
>>>
>>> I think it is better for net daemon/servers not to know about
>>> hypervisor drivers directly.
>>>
>>
>> Sure, I was only talking about the libvirtd for the sake of simplicity.
>>
>>>>
>>>> 5) Clean up services, servers, daemon, etc.
>>>>    including the second part of virThreadPoolFree()
>>>>
>>>> The last thing is what should be done in virNetDaemonDispose(), but
>>>> unfortunately we (mis)use it for more than that.  Numbers 1-4 should be,
>>>> IMO in virNet{Daemon,Server}Close().
>>>>
>>>> Would this solve your problem?  I mean I'm not against more reference
>>>> counting, it should be done properly, but I think the above would do the
>>>> trick and also make sense from the naming point of view.
>>>>
>>>> Since both the daemon and server will probably not ever be consistent
>>>> after calling the Close function, it might make sense to just include
>>>> virNetDaemonClose() into virNetDaemonDispose() and I'm not against that.
>>>> But the order of the steps should still be as described above IMO.
>>>
>>> This won't work IMO. If we are already in virNetDaemonDispose then
>>> @dmn object is invalid but we don't yet call virStateCleanup yet
>>> (if it is moved to virNetDaemonClose) and this is the problem 85c3a182 dealt with.
>>>
>>
>> It would solve it if order above is followed.  Sure, virNetDaemon should not
> 
> It is ok if virStateCleanup is called from virNetDaemonClose somehow. I was
> only against calling virNetDaemonClose from virNetDaemonDispose in this case.
> 
>> know about any drivers, but we can simply register a cleanup callback for the
>> daemon which the Dispose function will call after the thread pools are cleaned
>> up (just thought of that now).
> 
> Well to me it is too generic for this case. We can code this order just
> as order of statements.
> 
> 1. close netservers (which imply finishing rpc thread pool)
> 2. close hypervisor drivers
> 3. unref @dmn
> 
> Or we can use refcounts and take point 3 out of consideration.
> Also there is a patch series [1] that introduces 0 step to help 1 step to finish actually
> in the situation when loop is down.
> 
> 0. shutdown hypervirsor drivers.
> 
> [1] https://www.redhat.com/archives/libvir-list/2017-October/msg01134.html

Right this still exists out there, but it's hard enough to focus on the
issues being discussed here without adding more into the mix.  If we can
get the shutdown order right as Martin has proposed, then address this
afterwards, it'll perhaps make things easier.

>>
>>>>
>>>> Let me know what you think.  And don't forget to have a nice day!
>>>>
>>>> Martin
>>>>
>>>> P.S.: The order of clean ups in libvirtd's main() is absolutely terrible and
>>>>      unreadable.  Also, even though I guess it goes without saying, if proper
>>>>      reference counting is in place, the order of virObjectUnref()s and
>>>>      VIR_FREE()s should not matter.  I think it chould be changed like this:
>>>
>>> It makes sense that in cleanup section we can first make all cleanup and
>>> then all dispose. It is only not clear to me can we call virStateCleanup
>>> before virNetlinkShutdown. Also I would keep comment for @dmn unref
>>> until proper refcount for @dmn is implemented.
>>>
>>
>> I don't think virStateCleanup() and virNetlinkShutdown() are dependent.  The
>> comment can stay, I was just trying to show that it's also a mess.  What I would
>> like, actually, is to have the daemon running with all the references and
>> pointers (that the main() function doesn't need anymore) freed, but both this
>> and what I suggested below is unrelated to what we're trying to fix, I don't
>> know why I started talking about it.


So you cut off Martin's patch here, but if one essentially did inverse
order of startup they'd get the following: (the // comments are my notes
about startup order oddity).

cleanup:
    virNetDaemonClose(dmn);

    virNetlinkEventServiceStopAll();

    if (driversInitialized) {
        /* Set via daemonRunStateInit in daemonStateInit */
        driversInitialized = false;
        virStateCleanup();
    }

    virObjectUnref(adminProgram);
    virObjectUnref(srvAdm);
    virObjectUnref(qemuProgram);
    virObjectUnref(lxcProgram);
    virObjectUnref(remoteProgram);

    // FWIW: Not exact here, but the order of startup should be
    // changed since srv is added to dmn, so realistically srv
    // depends on dmn existence.
    virObjectUnref(srv);
    virObjectUnref(dmn);

    virNetlinkShutdown();

    // This one is also strange - seems startup should move the claiming
    // to before run_dir setup after forking into background.
    if (pid_file_fd != -1)
        virPidFileReleasePath(pid_file, pid_file_fd);

    VIR_FREE(run_dir);

    if (statuswrite != -1) {
        if (ret != 0) {
            /* Tell parent of daemon what failed */
            char status = ret;
            while (write(statuswrite, &status, 1) == -1 &&
                   errno == EINTR)
                ;
        }
        VIR_FORCE_CLOSE(statuswrite);
    }

    VIR_FREE(sock_file);
    VIR_FREE(sock_file_ro);
    VIR_FREE(sock_file_adm);

    VIR_FREE(pid_file);

    VIR_FREE(remote_config_file);
    daemonConfigFree(config);


Yes, this does differ from Martin's order...

John

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list
Re: [libvirt] [PATCH v5 3/3] libvirtd: fix crash on termination
Posted by Martin Kletzander 7 years, 6 months ago
On Mon, Nov 06, 2017 at 03:20:13PM -0500, John Ferlan wrote:
>
>
>On 11/03/2017 05:20 AM, Nikolay Shirokovskiy wrote:
>>
>>
>> On 03.11.2017 11:42, Martin Kletzander wrote:
>>> On Fri, Nov 03, 2017 at 11:07:36AM +0300, Nikolay Shirokovskiy wrote:
>>>> On 02.11.2017 19:32, Martin Kletzander wrote:
>>>>> This just makes the window of opportunity (between daemonServerClose()
>>>>> and the actual removal of the virNetServerPtr from the hash) smaller.
>>>>> That's why I don't see it as a fix, rather as a workaround.
>>>>>
>>>>> What I think should be done is the order of what's done with services,
>>>>> servers, drivers, daemon, workers, etc.
>>>>>
>>>>> I read the other threds and I think we're on the same note here, it's
>>>>> just that there is lot of confusion and we're all approaching it
>>>>> differently.
>
>Trying to catch up, but this thread has gone beyond where I was
>previously. I still have to push the first two patches of this series...
>
>Yes, libvirtd cleanup path is awful and there's quite a few "issues"
>that have cropped up because previous patches didn't take care of
>refcnt'ing properly and just "added" cleanup "somewhere" in the list of
>things to be cleaned up... Fortunately to a degree I think this is still
>all edge condition type stuff, but getting it right would be good...
>
>For me theory has always been to cleanup in the inverse order that
>something was created - for sure libvirtd doesn't follow that at all.
>For libvirtd the shutdown/cleanup logic just seemed too haphazard and
>fixing never made it to the top of the list of things to think about.
>
>Another "caveat" in libvirtd cleanup logic is that driversInitialized is
>set as part of a thread that could be running or have run if the code
>enters the cleanup: path as a result of virNetlinkEventServiceStart
>failure before we enter the event loop (e.g. virNetDaemonRun).
>
>Still like Erik - I do like Martin's suggestion and would like to see it
>as a formal patch - although perhaps with a few tweaks.
>
>IIRC, this particular patch "worked" because by removing the
>dmn->servers refcnt earlier made sure that we didn't have to wait for
>virObjectUnref(dmn) to occur after virStateCleanup.
>

Would it be to daring to ask you guys to send one series that includes all the
related patchsets?  We can then discuss there, it would be easier to follow as
well.  I don't want anyone to do any extra work, but we might save some review
bandwidth by resending those patches in one series.  Or maybe it's just me and
I have a hard time keeping up with everything at the same time =)

>>>>>
>>>>> So first let's agree on what should be done when virNetDaemonClose() is
>>>>> called and the loop in virNetDaemonRun() exits.  My opinion it is:
>>>>>
>>>>> 1) Stop accepting new calls and services:
>>>>>    virNetServerServiceToggle(false);
>>>>
>>>> This not really necessary as this has effect only if loop is running.
>>>>
>>>
>>> Sure, good point.
>>>
>>>>>
>>>>> 2) Wait for all the workers to finish:
>>>>>    First part of virThreadPoolFree()
>>>>
>>>> I was thinking of splitting this function to finish/free too.
>>>> After finish it is not really usable because there are no
>>>> more threads in pool anymore so we have to introduce state
>>>> finished and check it in every thread pool api function to
>>>> be safe. So this will introduce a bit of complexity only
>>>> for the sake of some scheme IMO.
>>>>
>>>
>>> Yeah, it's similar to the daemon when you are between close and dispose, but for
>>> the daemon we need it.
>>>
>>> What function do you think we would need to check this for?  The finish would
>>> end all the threads and since the event loop is not running no new jobs could be
>>> posted.
>>
>> virThreadPoolSendJob is already good in this respect as it checks @quit at the beginning.
>> Looks like we need only to add same check to virThreadPoolSetParameters. So
>> not so much complexity.
>>
>> I can imagine thread pool function being called from another
>> thread pools thread. AFAIK there are not such situation though. So just to be on a safe side.
>>
>>>
>>>>>
>>>>> 3) Kill off clients:
>>>>>    virNetServerClientClose()
>>>>>
>>>>> 4) Clean up drivers:
>>>>>    virStateCleanup()
>>>>
>>>> I think it is better for net daemon/servers not to know about
>>>> hypervisor drivers directly.
>>>>
>>>
>>> Sure, I was only talking about the libvirtd for the sake of simplicity.
>>>
>>>>>
>>>>> 5) Clean up services, servers, daemon, etc.
>>>>>    including the second part of virThreadPoolFree()
>>>>>
>>>>> The last thing is what should be done in virNetDaemonDispose(), but
>>>>> unfortunately we (mis)use it for more than that.  Numbers 1-4 should be,
>>>>> IMO in virNet{Daemon,Server}Close().
>>>>>
>>>>> Would this solve your problem?  I mean I'm not against more reference
>>>>> counting, it should be done properly, but I think the above would do the
>>>>> trick and also make sense from the naming point of view.
>>>>>
>>>>> Since both the daemon and server will probably not ever be consistent
>>>>> after calling the Close function, it might make sense to just include
>>>>> virNetDaemonClose() into virNetDaemonDispose() and I'm not against that.
>>>>> But the order of the steps should still be as described above IMO.
>>>>
>>>> This won't work IMO. If we are already in virNetDaemonDispose then
>>>> @dmn object is invalid but we don't yet call virStateCleanup yet
>>>> (if it is moved to virNetDaemonClose) and this is the problem 85c3a182 dealt with.
>>>>
>>>
>>> It would solve it if order above is followed.  Sure, virNetDaemon should not
>>
>> It is ok if virStateCleanup is called from virNetDaemonClose somehow. I was
>> only against calling virNetDaemonClose from virNetDaemonDispose in this case.
>>
>>> know about any drivers, but we can simply register a cleanup callback for the
>>> daemon which the Dispose function will call after the thread pools are cleaned
>>> up (just thought of that now).
>>
>> Well to me it is too generic for this case. We can code this order just
>> as order of statements.
>>
>> 1. close netservers (which imply finishing rpc thread pool)
>> 2. close hypervisor drivers
>> 3. unref @dmn
>>
>> Or we can use refcounts and take point 3 out of consideration.
>> Also there is a patch series [1] that introduces 0 step to help 1 step to finish actually
>> in the situation when loop is down.
>>
>> 0. shutdown hypervirsor drivers.
>>
>> [1] https://www.redhat.com/archives/libvir-list/2017-October/msg01134.html
>
>Right this still exists out there, but it's hard enough to focus on the
>issues being discussed here without adding more into the mix.  If we can
>get the shutdown order right as Martin has proposed, then address this
>afterwards, it'll perhaps make things easier.
>

Or just put it all in a series as hinted above ;)

>>>
>>>>>
>>>>> Let me know what you think.  And don't forget to have a nice day!
>>>>>
>>>>> Martin
>>>>>
>>>>> P.S.: The order of clean ups in libvirtd's main() is absolutely terrible and
>>>>>      unreadable.  Also, even though I guess it goes without saying, if proper
>>>>>      reference counting is in place, the order of virObjectUnref()s and
>>>>>      VIR_FREE()s should not matter.  I think it chould be changed like this:
>>>>
>>>> It makes sense that in cleanup section we can first make all cleanup and
>>>> then all dispose. It is only not clear to me can we call virStateCleanup
>>>> before virNetlinkShutdown. Also I would keep comment for @dmn unref
>>>> until proper refcount for @dmn is implemented.
>>>>
>>>
>>> I don't think virStateCleanup() and virNetlinkShutdown() are dependent.  The
>>> comment can stay, I was just trying to show that it's also a mess.  What I would
>>> like, actually, is to have the daemon running with all the references and
>>> pointers (that the main() function doesn't need anymore) freed, but both this
>>> and what I suggested below is unrelated to what we're trying to fix, I don't
>>> know why I started talking about it.
>
>
>So you cut off Martin's patch here, but if one essentially did inverse
>order of startup they'd get the following: (the // comments are my notes
>about startup order oddity).
>
>cleanup:
>    virNetDaemonClose(dmn);
>
>    virNetlinkEventServiceStopAll();
>
>    if (driversInitialized) {
>        /* Set via daemonRunStateInit in daemonStateInit */
>        driversInitialized = false;
>        virStateCleanup();
>    }
>
>    virObjectUnref(adminProgram);
>    virObjectUnref(srvAdm);
>    virObjectUnref(qemuProgram);
>    virObjectUnref(lxcProgram);
>    virObjectUnref(remoteProgram);
>
>    // FWIW: Not exact here, but the order of startup should be
>    // changed since srv is added to dmn, so realistically srv
>    // depends on dmn existence.
>    virObjectUnref(srv);

We can unref all of the above whenever we please.  @dmn keeps references for
everything it needs, we just have @srv in order to set some things up, we can
unref it right after we add it to @dmn.  The dependencies here are not the
concern thanks to (hopefully) proper refcnt'ing.

>    virObjectUnref(dmn);
>
>    virNetlinkShutdown();
>
>    // This one is also strange - seems startup should move the claiming
>    // to before run_dir setup after forking into background.
>    if (pid_file_fd != -1)
>        virPidFileReleasePath(pid_file, pid_file_fd);

During start we should quit immediately, so it should be done before forking
into background, otherwise init scripts will not know whether starting the
service failed or not.

>
>    VIR_FREE(run_dir);

Frees can be together, again, most of them could be freed before virNetDaemonRun
since all the string are just used to initialize stuff and I believe it's
strdup'd everywhere.

>
>    if (statuswrite != -1) {
>        if (ret != 0) {
>            /* Tell parent of daemon what failed */
>            char status = ret;
>            while (write(statuswrite, &status, 1) == -1 &&
>                   errno == EINTR)
>                ;
>        }
>        VIR_FORCE_CLOSE(statuswrite);
>    }
>
>    VIR_FREE(sock_file);
>    VIR_FREE(sock_file_ro);
>    VIR_FREE(sock_file_adm);
>
>    VIR_FREE(pid_file);
>
>    VIR_FREE(remote_config_file);
>    daemonConfigFree(config);
>
>
>Yes, this does differ from Martin's order...
>

Sure, no problem, I just quickly sorted the lines by hand, I just wanted to show
that the readability was bad as well.

>John
--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list
Re: [libvirt] [PATCH v5 3/3] libvirtd: fix crash on termination
Posted by John Ferlan 7 years, 6 months ago

On 11/07/2017 04:46 AM, Martin Kletzander wrote:
> On Mon, Nov 06, 2017 at 03:20:13PM -0500, John Ferlan wrote:
>>
>>
>> On 11/03/2017 05:20 AM, Nikolay Shirokovskiy wrote:
>>>
>>>
>>> On 03.11.2017 11:42, Martin Kletzander wrote:
>>>> On Fri, Nov 03, 2017 at 11:07:36AM +0300, Nikolay Shirokovskiy wrote:
>>>>> On 02.11.2017 19:32, Martin Kletzander wrote:
>>>>>> This just makes the window of opportunity (between
>>>>>> daemonServerClose()
>>>>>> and the actual removal of the virNetServerPtr from the hash) smaller.
>>>>>> That's why I don't see it as a fix, rather as a workaround.
>>>>>>
>>>>>> What I think should be done is the order of what's done with
>>>>>> services,
>>>>>> servers, drivers, daemon, workers, etc.
>>>>>>
>>>>>> I read the other threds and I think we're on the same note here, it's
>>>>>> just that there is lot of confusion and we're all approaching it
>>>>>> differently.
>>
>> Trying to catch up, but this thread has gone beyond where I was
>> previously. I still have to push the first two patches of this series...
>>
>> Yes, libvirtd cleanup path is awful and there's quite a few "issues"
>> that have cropped up because previous patches didn't take care of
>> refcnt'ing properly and just "added" cleanup "somewhere" in the list of
>> things to be cleaned up... Fortunately to a degree I think this is still
>> all edge condition type stuff, but getting it right would be good...
>>
>> For me theory has always been to cleanup in the inverse order that
>> something was created - for sure libvirtd doesn't follow that at all.
>> For libvirtd the shutdown/cleanup logic just seemed too haphazard and
>> fixing never made it to the top of the list of things to think about.
>>
>> Another "caveat" in libvirtd cleanup logic is that driversInitialized is
>> set as part of a thread that could be running or have run if the code
>> enters the cleanup: path as a result of virNetlinkEventServiceStart
>> failure before we enter the event loop (e.g. virNetDaemonRun).
>>
>> Still like Erik - I do like Martin's suggestion and would like to see it
>> as a formal patch - although perhaps with a few tweaks.
>>
>> IIRC, this particular patch "worked" because by removing the
>> dmn->servers refcnt earlier made sure that we didn't have to wait for
>> virObjectUnref(dmn) to occur after virStateCleanup.
>>
> 
> Would it be to daring to ask you guys to send one series that includes
> all the
> related patchsets?  We can then discuss there, it would be easier to
> follow as
> well.  I don't want anyone to do any extra work, but we might save some
> review
> bandwidth by resending those patches in one series.  Or maybe it's just
> me and
> I have a hard time keeping up with everything at the same time =)
> 

Originally "this" crash fix was mixed in with the other series to add a
shutdown state option, but I asked they be separate since they were IMO
addressing different problems.

To me this series was attempting to resolve a refcnt problem and a
"latent" free of the dmn->servers that was IIRC attempting to be used
before the state cleanup and the Unref(dmn) occurred to run the
virHashFree in virNetDaemonDispose.

The other series seems to be adding a state shutdown step to be run
before state cleanup; however, if we alter the cleanup code in libvirtd
does that make the state shutdown step unnecessary? I don't know and
really didn't want to think about it until we got through thinking about
the cleanup processing order. Still if one considers that state
initialization is really two steps (init and auto start), then splitting
cleanup into shutdown and cleanup seems reasonable, but I don't think
affects whether we have a crash or latent usage of dmn->servers. I could
be wrong, but it's so hard to tell.

[...]

>>
>>
>> So you cut off Martin's patch here, but if one essentially did inverse
>> order of startup they'd get the following: (the // comments are my notes
>> about startup order oddity).
>>
>> cleanup:
>>    virNetDaemonClose(dmn);
>>
>>    virNetlinkEventServiceStopAll();
>>
>>    if (driversInitialized) {
>>        /* Set via daemonRunStateInit in daemonStateInit */
>>        driversInitialized = false;
>>        virStateCleanup();
>>    }
>>
>>    virObjectUnref(adminProgram);
>>    virObjectUnref(srvAdm);
>>    virObjectUnref(qemuProgram);
>>    virObjectUnref(lxcProgram);
>>    virObjectUnref(remoteProgram);
>>
>>    // FWIW: Not exact here, but the order of startup should be
>>    // changed since srv is added to dmn, so realistically srv
>>    // depends on dmn existence.
>>    virObjectUnref(srv);
> 
> We can unref all of the above whenever we please.  @dmn keeps references
> for
> everything it needs, we just have @srv in order to set some things up,
> we can
> unref it right after we add it to @dmn.  The dependencies here are not the
> concern thanks to (hopefully) proper refcnt'ing.

OK - that's a bit more adjustment, but totally possible...

> 
>>    virObjectUnref(dmn);
>>
>>    virNetlinkShutdown();
>>
>>    // This one is also strange - seems startup should move the claiming
>>    // to before run_dir setup after forking into background.
>>    if (pid_file_fd != -1)
>>        virPidFileReleasePath(pid_file, pid_file_fd);
> 
> During start we should quit immediately, so it should be done before
> forking
> into background, otherwise init scripts will not know whether starting the
> service failed or not.
> 

But don't we want to write the child pid into the pid_file and not the
parent? Or more succinctly the grandchild pid...

e.g. the call after the fork is:

   if ((pid_file_fd = virPidFileAcquirePath(pid_file, false, getpid()))
< 0) {

which essentially writes the getpid value into the pid_file... If we did
this before the fork, then we'd be writing the parent pid into the file
and that parent exits as does it's child, leaving the grandchild as the
daemon (or am I reading all that wrong - it's quite possible ;-))

In any case, I don't mind posting something that alters the cleanup:
logic - I wasn't sure if you were going to do that based on a couple of
replies back in this thread...

John

>>
>>    VIR_FREE(run_dir);
> 
> Frees can be together, again, most of them could be freed before
> virNetDaemonRun
> since all the string are just used to initialize stuff and I believe it's
> strdup'd everywhere.
> 
>>
>>    if (statuswrite != -1) {
>>        if (ret != 0) {
>>            /* Tell parent of daemon what failed */
>>            char status = ret;
>>            while (write(statuswrite, &status, 1) == -1 &&
>>                   errno == EINTR)
>>                ;
>>        }
>>        VIR_FORCE_CLOSE(statuswrite);
>>    }
>>
>>    VIR_FREE(sock_file);
>>    VIR_FREE(sock_file_ro);
>>    VIR_FREE(sock_file_adm);
>>
>>    VIR_FREE(pid_file);
>>
>>    VIR_FREE(remote_config_file);
>>    daemonConfigFree(config);
>>
>>
>> Yes, this does differ from Martin's order...
>>
> 
> Sure, no problem, I just quickly sorted the lines by hand, I just wanted
> to show
> that the readability was bad as well.
> 
>> John

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list
Re: [libvirt] [PATCH v5 3/3] libvirtd: fix crash on termination
Posted by Erik Skultety 7 years, 6 months ago
On Tue, Nov 07, 2017 at 10:28:10AM -0500, John Ferlan wrote:
>
>
> On 11/07/2017 04:46 AM, Martin Kletzander wrote:
> > On Mon, Nov 06, 2017 at 03:20:13PM -0500, John Ferlan wrote:
> >>
> >>
> >> On 11/03/2017 05:20 AM, Nikolay Shirokovskiy wrote:
> >>>
> >>>
> >>> On 03.11.2017 11:42, Martin Kletzander wrote:
> >>>> On Fri, Nov 03, 2017 at 11:07:36AM +0300, Nikolay Shirokovskiy wrote:
> >>>>> On 02.11.2017 19:32, Martin Kletzander wrote:
> >>>>>> This just makes the window of opportunity (between
> >>>>>> daemonServerClose()
> >>>>>> and the actual removal of the virNetServerPtr from the hash) smaller.
> >>>>>> That's why I don't see it as a fix, rather as a workaround.
> >>>>>>
> >>>>>> What I think should be done is the order of what's done with
> >>>>>> services,
> >>>>>> servers, drivers, daemon, workers, etc.
> >>>>>>
> >>>>>> I read the other threds and I think we're on the same note here, it's
> >>>>>> just that there is lot of confusion and we're all approaching it
> >>>>>> differently.
> >>
> >> Trying to catch up, but this thread has gone beyond where I was
> >> previously. I still have to push the first two patches of this series...
> >>
> >> Yes, libvirtd cleanup path is awful and there's quite a few "issues"
> >> that have cropped up because previous patches didn't take care of
> >> refcnt'ing properly and just "added" cleanup "somewhere" in the list of
> >> things to be cleaned up... Fortunately to a degree I think this is still
> >> all edge condition type stuff, but getting it right would be good...
> >>
> >> For me theory has always been to cleanup in the inverse order that
> >> something was created - for sure libvirtd doesn't follow that at all.
> >> For libvirtd the shutdown/cleanup logic just seemed too haphazard and
> >> fixing never made it to the top of the list of things to think about.
> >>
> >> Another "caveat" in libvirtd cleanup logic is that driversInitialized is
> >> set as part of a thread that could be running or have run if the code
> >> enters the cleanup: path as a result of virNetlinkEventServiceStart
> >> failure before we enter the event loop (e.g. virNetDaemonRun).
> >>
> >> Still like Erik - I do like Martin's suggestion and would like to see it
> >> as a formal patch - although perhaps with a few tweaks.
> >>
> >> IIRC, this particular patch "worked" because by removing the
> >> dmn->servers refcnt earlier made sure that we didn't have to wait for
> >> virObjectUnref(dmn) to occur after virStateCleanup.
> >>
> >
> > Would it be to daring to ask you guys to send one series that includes
> > all the
> > related patchsets?  We can then discuss there, it would be easier to
> > follow as
> > well.  I don't want anyone to do any extra work, but we might save some
> > review
> > bandwidth by resending those patches in one series.  Or maybe it's just
> > me and
> > I have a hard time keeping up with everything at the same time =)
> >
>
> Originally "this" crash fix was mixed in with the other series to add a
> shutdown state option, but I asked they be separate since they were IMO
> addressing different problems.
>
> To me this series was attempting to resolve a refcnt problem and a
> "latent" free of the dmn->servers that was IIRC attempting to be used
> before the state cleanup and the Unref(dmn) occurred to run the
> virHashFree in virNetDaemonDispose.

Exactly, this was only about refcounting of @dmn, see below for @driver issue.

>
> The other series seems to be adding a state shutdown step to be run
> before state cleanup; however, if we alter the cleanup code in libvirtd
> does that make the state shutdown step unnecessary? I don't know and

Actually, no, the stateShutdown was addressing the issue with running
<driver>StateCleanup while there still was an active worker thread that could
potentially access the @driver object, in which case - SIGSEGV, but as I wrote
in my response, I don't like the idea of having another state driver callback,
as I feel it introduces a bit of ambiguity, since by just looking at the
callback names, you can't have an idea, what the difference between
stateCleanup and stateShutdown is. I think this should be handled as part of
the final stage of leaving the event loop and simply register an appropriate
callback to the handle, that way, you don't need to define a new state callback
which would most likely used by a couple of drivers.

Erik

> really didn't want to think about it until we got through thinking about
> the cleanup processing order. Still if one considers that state
> initialization is really two steps (init and auto start), then splitting
> cleanup into shutdown and cleanup seems reasonable, but I don't think
> affects whether we have a crash or latent usage of dmn->servers. I could
> be wrong, but it's so hard to tell.
>
> [...]

Erik

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list
Re: [libvirt] [PATCH v5 3/3] libvirtd: fix crash on termination
Posted by Martin Kletzander 7 years, 6 months ago
On Tue, Nov 07, 2017 at 05:06:33PM +0100, Erik Skultety wrote:
>On Tue, Nov 07, 2017 at 10:28:10AM -0500, John Ferlan wrote:
>>
>>
>> On 11/07/2017 04:46 AM, Martin Kletzander wrote:
>> > On Mon, Nov 06, 2017 at 03:20:13PM -0500, John Ferlan wrote:
>> >>
>> >>
>> >> On 11/03/2017 05:20 AM, Nikolay Shirokovskiy wrote:
>> >>>
>> >>>
>> >>> On 03.11.2017 11:42, Martin Kletzander wrote:
>> >>>> On Fri, Nov 03, 2017 at 11:07:36AM +0300, Nikolay Shirokovskiy wrote:
>> >>>>> On 02.11.2017 19:32, Martin Kletzander wrote:
>> >>>>>> This just makes the window of opportunity (between
>> >>>>>> daemonServerClose()
>> >>>>>> and the actual removal of the virNetServerPtr from the hash) smaller.
>> >>>>>> That's why I don't see it as a fix, rather as a workaround.
>> >>>>>>
>> >>>>>> What I think should be done is the order of what's done with
>> >>>>>> services,
>> >>>>>> servers, drivers, daemon, workers, etc.
>> >>>>>>
>> >>>>>> I read the other threds and I think we're on the same note here, it's
>> >>>>>> just that there is lot of confusion and we're all approaching it
>> >>>>>> differently.
>> >>
>> >> Trying to catch up, but this thread has gone beyond where I was
>> >> previously. I still have to push the first two patches of this series...
>> >>
>> >> Yes, libvirtd cleanup path is awful and there's quite a few "issues"
>> >> that have cropped up because previous patches didn't take care of
>> >> refcnt'ing properly and just "added" cleanup "somewhere" in the list of
>> >> things to be cleaned up... Fortunately to a degree I think this is still
>> >> all edge condition type stuff, but getting it right would be good...
>> >>
>> >> For me theory has always been to cleanup in the inverse order that
>> >> something was created - for sure libvirtd doesn't follow that at all.
>> >> For libvirtd the shutdown/cleanup logic just seemed too haphazard and
>> >> fixing never made it to the top of the list of things to think about.
>> >>
>> >> Another "caveat" in libvirtd cleanup logic is that driversInitialized is
>> >> set as part of a thread that could be running or have run if the code
>> >> enters the cleanup: path as a result of virNetlinkEventServiceStart
>> >> failure before we enter the event loop (e.g. virNetDaemonRun).
>> >>
>> >> Still like Erik - I do like Martin's suggestion and would like to see it
>> >> as a formal patch - although perhaps with a few tweaks.
>> >>
>> >> IIRC, this particular patch "worked" because by removing the
>> >> dmn->servers refcnt earlier made sure that we didn't have to wait for
>> >> virObjectUnref(dmn) to occur after virStateCleanup.
>> >>
>> >
>> > Would it be to daring to ask you guys to send one series that includes
>> > all the
>> > related patchsets?  We can then discuss there, it would be easier to
>> > follow as
>> > well.  I don't want anyone to do any extra work, but we might save some
>> > review
>> > bandwidth by resending those patches in one series.  Or maybe it's just
>> > me and
>> > I have a hard time keeping up with everything at the same time =)
>> >
>>
>> Originally "this" crash fix was mixed in with the other series to add a
>> shutdown state option, but I asked they be separate since they were IMO
>> addressing different problems.
>>
>> To me this series was attempting to resolve a refcnt problem and a
>> "latent" free of the dmn->servers that was IIRC attempting to be used
>> before the state cleanup and the Unref(dmn) occurred to run the
>> virHashFree in virNetDaemonDispose.
>
>Exactly, this was only about refcounting of @dmn, see below for @driver issue.
>

OK, sorry for that then, I followed up only more recently and honestly I'm
getting pretty confused with how many thought threads this has so the context
switches get more and more expensive.

>>
>> The other series seems to be adding a state shutdown step to be run
>> before state cleanup; however, if we alter the cleanup code in libvirtd
>> does that make the state shutdown step unnecessary? I don't know and
>
>Actually, no, the stateShutdown was addressing the issue with running
><driver>StateCleanup while there still was an active worker thread that could
>potentially access the @driver object, in which case - SIGSEGV, but as I wrote
>in my response, I don't like the idea of having another state driver callback,
>as I feel it introduces a bit of ambiguity, since by just looking at the
>callback names, you can't have an idea, what the difference between
>stateCleanup and stateShutdown is. I think this should be handled as part of
>the final stage of leaving the event loop and simply register an appropriate
>callback to the handle, that way, you don't need to define a new state callback
>which would most likely used by a couple of drivers.
>

If stateShutdown is what makes everything go away, then why not?  The names are
pretty descriptive, shutdown is called when the daemon is shutting down and
cleanup when you need to clean up after yourself.  Used by only few drivers?
Well, we can implement it everywhere and not make it make optional, but rather
mandatory.  If that is checked on the registration then we'll know right away
when we missed something (aat runtime, but right after someone tries running
it).

>Erik
>
>> really didn't want to think about it until we got through thinking about
>> the cleanup processing order. Still if one considers that state
>> initialization is really two steps (init and auto start), then splitting
>> cleanup into shutdown and cleanup seems reasonable, but I don't think
>> affects whether we have a crash or latent usage of dmn->servers. I could
>> be wrong, but it's so hard to tell.
>>
>> [...]
>
>Erik
--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list
Re: [libvirt] [PATCH v5 3/3] libvirtd: fix crash on termination
Posted by Erik Skultety 7 years, 6 months ago
On Thu, Nov 02, 2017 at 05:32:35PM +0100, Martin Kletzander wrote:
> On Thu, Nov 02, 2017 at 10:49:35AM +0300, Nikolay Shirokovskiy wrote:
> >
> >
> > On 01.11.2017 21:51, John Ferlan wrote:
> > >
> > >
> > > On 10/31/2017 02:54 AM, Nikolay Shirokovskiy wrote:
> > > >
> > > >
> > > > On 30.10.2017 19:21, Martin Kletzander wrote:
> > > > > On Mon, Oct 30, 2017 at 07:14:39AM -0400, John Ferlan wrote:
> > > > > > From: Nikolay Shirokovskiy <nshirokovskiy@virtuozzo.com>
> > > > > >
> > > > > > The problem is incorrect order of qemu driver shutdown and shutdown
> > > > > > of netserver threads that serve client requests (thru qemu driver
> > > > > > particularly).
> > > > > >
> > > > > > Net server threads are shutdown upon dispose which is triggered
> > > > > > by last daemon object unref at the end of main function. At the same
> > > > > > time qemu driver is shutdown earlier in virStateCleanup. As a result
> > > > > > netserver threads see invalid driver object in the middle of request
> > > > > > processing.
> > > > > >
> > > > > > Let's move shutting down netserver threads earlier to virNetDaemonClose.
> > > > > >
> > > > > > Note: order of last daemon unref and virStateCleanup
> > > > > > is introduced in 85c3a182 for a valid reason.
> > > > > >
> > > > >
> > > > > I must say I don't believe that's true.  Reading it back, that patch is
> > > > > wrong IMHO.  I haven't gone through all the details of it and I don't
> > > > > remember them from when I rewrote part of it, but the way this should
> > > > > be handled is different.
> > > > >
> > > > > The way you can clearly identify such issues is when you see that one
> > > > > thread determines the validity of data (pointer in this case).  This
> > > > > must never happen.  That means that the pointer is used from more places
> > > > > than how many references that object has.  However each of the pointers
> > > > > for such object should have their own reference.
> > > > >
> > > > > So the problem is probably somewhere else.
> > > >
> > > > If I understand you correctly we can fix issue in 85c3a182 in
> > > > a differenct way. Like adding reference to daemon for every
> > > > driver that uses it for shutdown inhibition. It will require to
> > > > pass unref function to driver init function or a bit of wild casting
> > > > to virObjectPtr for unref. Not sure it is worth it in this case.
> > >
> > > caveat: I'm still in a post KVM Forum travel brain fog...
> > >
> > > Perhaps a bit more simple than that... Since the 'inhibitCallback' and
> > > the 'inhibitOpaque' are essentially the mechanism that would pass/use
> > > @dmn, then it'd be up to the inhibitCallback to add/remove the reference
> > > with the *given* that the inhibitOpaque is a lockable object anyway...
> > >
> > > Thus, virNetDaemonAddShutdownInhibition could use virObjectRef and
> > > virNetDaemonRemoveShutdownInhibition could use virObjectUnref... The
> > > "catch" is that for Shutdown the Unref would need to go after Unlock.
> > >
> > > I believe that then would "replicate" the virObjectRef done in
> > > daemonStateInit and the virObjectUnref done at the end of
> > > daemonRunStateInit with respect to "some outside thread" being able to
> > > use @dmn and not have it be Unref'd for the last time at any point in
> > > time until the "consumer" was done with it.
> > >
> > > Moving the Unref to after all the StateCleanup calls were done works
> > > because it accomplishesd the same/similar task, but just held onto @dmn
> > > longer because the original implementation didn't properly reference dmn
> > > when it was being used for some other consumer/thread. Of course that
> > > led to other problems which this series is addressing and I'm not clear
> > > yet as to the impact vis-a-vis your StateShutdown series.
> >
> > Ok, 85c3a182 can be rewritten this way. It is more staightforward to
> > ref/unref by consumer thread instead of relying on consumer structure
> > (in this case 85c3a182 relies upon the fact that after virStateCleanup
> > no hypervisor driver would use @dmn object).
> >
>
> That's what reference counting is for, each consumer should have its reference.
>
> > But we still have the issues address by this patch series and state
> > shutdown series because the order in which @dmn and other objects
> > will be disposed would be same for 85c3a182 and proposed 85c3a182
> > replacement.
> >
>
> Let me summarize below, I hope I'll touch on all the points.  It's hard
> to follow since I'm currently reading 3 or more threads at the same time :)
>
> > > > > > Signed-off-by: John Ferlan <jferlan@redhat.com>
> > > > > > ---
> > > > > > src/rpc/virnetdaemon.c | 1 +
> > > > > > 1 file changed, 1 insertion(+)
> > > > > >
> > > > > > diff --git a/src/rpc/virnetdaemon.c b/src/rpc/virnetdaemon.c
> > > > > > index 8c21414897..33bd8e3b06 100644
> > > > > > --- a/src/rpc/virnetdaemon.c
> > > > > > +++ b/src/rpc/virnetdaemon.c
> > > > > > @@ -881,6 +881,7 @@ virNetDaemonClose(virNetDaemonPtr dmn)
> > > > > >     virObjectLock(dmn);
> > > > > >
> > > > > >     virHashForEach(dmn->servers, daemonServerClose, NULL);
> > > > > > +    virHashRemoveAll(dmn->servers);
> > > > > >
> > > > >
> > > > > If I get this correctly, you are removing the servers so that their workers get
> > > > > cleaned up, but that should be done in a separate step.  Most probably what
> > > > > daemonServerClose should be doing.  This is a workaround, but not a proper fix.
> > >
> > > Are you sure?  The daemonServerClose is the callback for virHashForEach
> > > which means the table->iterating would be set thus preventing
> > > daemonServerClose from performing a virHashRemoveEntry of the server
> > > element from the list.
> > >
>
> This just makes the window of opportunity (between daemonServerClose()
> and the actual removal of the virNetServerPtr from the hash) smaller.
> That's why I don't see it as a fix, rather as a workaround.
>
> What I think should be done is the order of what's done with services,
> servers, drivers, daemon, workers, etc.
>
> I read the other threds and I think we're on the same note here, it's
> just that there is lot of confusion and we're all approaching it
> differently.
>
> So first let's agree on what should be done when virNetDaemonClose() is
> called and the loop in virNetDaemonRun() exits.  My opinion it is:
>
> 1) Stop accepting new calls and services:
>    virNetServerServiceToggle(false);
>
> 2) Wait for all the workers to finish:
>    First part of virThreadPoolFree()

This is a very tricky part which might get overly complex in order to be done
properly. We currently don't allow the threads finish gracefully, IOW the
connections have most likely been closed by the time the worker thread got to
respond back to the client or the thread simply hung for some other reason in
which case the daemon wouldn't even finish. So to give the threads a chance to
finish properly, they need to be decommissioned while the event loop is still
running otherwise there's no way for the worker to e.g. get an event back from
QEMU's monitor, since there's noone to execute the appropriate callback for
that anymore.
However, even if we did this and let's say we waited for a worker thread to
finish its time-consuming job, you still have the problem I mentioned above
with distinguishing really long time-consuming jobs from threads that are hung.
So I'm not really sure whether any attempt to be terminating them gracefully is
really worth the effort, given the circumstances, but at the same time I
understand that the outcomes: the daemon might crash (it was terminating anyway)
or it terminates properly but leaves a QEMU guest in an inconsistent way.

To your idea of splitting virThreadPoolFree, since we most likely have to stop
the threads within the event loop, thus broadcast 'quit' as well as wait for
them, I don't see a point in decoupling the function to killing and joining
threads and then doing a bunch of VIR_FREEs and vir(Mutex|Cond)Destroys, that's
just scraping of some leftovers, I think it makes sense the way it's now, even
though the naming is not ideal, I agree with that and we should come up with
something else. But this is just a matter of taste and very unimportant, of
course I don't have a strong argument about why the approach would be bad,
because it's not, I just don't see it worth the effort.

>
> 3) Kill off clients:
>    virNetServerClientClose()
>
> 4) Clean up drivers:
>    virStateCleanup()
>
> 5) Clean up services, servers, daemon, etc.
>    including the second part of virThreadPoolFree()
>
> The last thing is what should be done in virNetDaemonDispose(), but
> unfortunately we (mis)use it for more than that.  Numbers 1-4 should be,
> IMO in virNet{Daemon,Server}Close().
>
> Would this solve your problem?  I mean I'm not against more reference
> counting, it should be done properly, but I think the above would do the
> trick and also make sense from the naming point of view.
>
> Since both the daemon and server will probably not ever be consistent
> after calling the Close function, it might make sense to just include
> virNetDaemonClose() into virNetDaemonDispose() and I'm not against that.
> But the order of the steps should still be as described above IMO.
>
> Let me know what you think.  And don't forget to have a nice day!
>
> Martin
>
> P.S.: The order of clean ups in libvirtd's main() is absolutely terrible and
>      unreadable.  Also, even though I guess it goes without saying, if proper
>      reference counting is in place, the order of virObjectUnref()s and
>      VIR_FREE()s should not matter.  I think it chould be changed like this:
>
> diff --git i/daemon/libvirtd.c w/daemon/libvirtd.c
> index 589b32192e3d..3cc14ed20987 100644
> --- i/daemon/libvirtd.c
> +++ w/daemon/libvirtd.c
> @@ -1500,14 +1500,15 @@ int main(int argc, char **argv) {
>
>  cleanup:
>     virNetlinkEventServiceStopAll();
> -    virObjectUnref(remoteProgram);
> -    virObjectUnref(lxcProgram);
> -    virObjectUnref(qemuProgram);
> -    virObjectUnref(adminProgram);
>     virNetDaemonClose(dmn);
> -    virObjectUnref(srv);
> -    virObjectUnref(srvAdm);
> +
> +    if (driversInitialized) {
> +        driversInitialized = false;
> +        virStateCleanup();
> +    }
> +
>     virNetlinkShutdown();
> +
>     if (statuswrite != -1) {
>         if (ret != 0) {
>             /* Tell parent of daemon what failed */
> @@ -1521,6 +1522,14 @@ int main(int argc, char **argv) {
>     if (pid_file_fd != -1)
>         virPidFileReleasePath(pid_file, pid_file_fd);
>
> +    virObjectUnref(remoteProgram);
> +    virObjectUnref(lxcProgram);
> +    virObjectUnref(qemuProgram);
> +    virObjectUnref(adminProgram);
> +    virObjectUnref(srv);
> +    virObjectUnref(srvAdm);
> +    virObjectUnref(dmn);
> +
>     VIR_FREE(sock_file);
>     VIR_FREE(sock_file_ro);
>     VIR_FREE(sock_file_adm);
> @@ -1530,13 +1539,5 @@ int main(int argc, char **argv) {
>
>     daemonConfigFree(config);
>
> -    if (driversInitialized) {
> -        driversInitialized = false;
> -        virStateCleanup();
> -    }
> -    /* Now that the hypervisor shutdown inhibition functions that use
> -     * 'dmn' as a parameter are done, we can finally unref 'dmn' */
> -    virObjectUnref(dmn);
> -
>     return ret;
> }

ACK to ^this.

Erik


--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list
Re: [libvirt] [PATCH v5 3/3] libvirtd: fix crash on termination
Posted by Erik Skultety 7 years, 6 months ago
[...]

> > So first let's agree on what should be done when virNetDaemonClose() is
> > called and the loop in virNetDaemonRun() exits.  My opinion it is:
> >
> > 1) Stop accepting new calls and services:
> >    virNetServerServiceToggle(false);
> >
> > 2) Wait for all the workers to finish:
> >    First part of virThreadPoolFree()
>
> This is a very tricky part which might get overly complex in order to be done
> properly. We currently don't allow the threads finish gracefully, IOW the
> connections have most likely been closed by the time the worker thread got to
> respond back to the client or the thread simply hung for some other reason in
> which case the daemon wouldn't even finish. So to give the threads a chance to
> finish properly, they need to be decommissioned while the event loop is still
> running otherwise there's no way for the worker to e.g. get an event back from
> QEMU's monitor, since there's noone to execute the appropriate callback for
> that anymore.

Actually, I was wrong here, qemuMonitorClose would wake the worker up with an
error, so it wouldn't be stuck indefinitely. However, I'd still prefer the
event loop to decommission the threads and do what Nikolay's qemuStateShutdown
proposal is doing, since eventloop knows about all of its handles, so when it's
finishing, it could simply run an appropriate cleanup callback on all of the
handles, thus closing the qemuMonitor as well.

> However, even if we did this and let's say we waited for a worker thread to
> finish its time-consuming job, you still have the problem I mentioned above
> with distinguishing really long time-consuming jobs from threads that are hung.
> So I'm not really sure whether any attempt to be terminating them gracefully is
> really worth the effort, given the circumstances, but at the same time I
> understand that the outcomes: the daemon might crash (it was terminating anyway)
> or it terminates properly but leaves a QEMU guest in an inconsistent way.
>
> To your idea of splitting virThreadPoolFree, since we most likely have to stop
> the threads within the event loop, thus broadcast 'quit' as well as wait for
> them, I don't see a point in decoupling the function to killing and joining
> threads and then doing a bunch of VIR_FREEs and vir(Mutex|Cond)Destroys, that's
> just scraping of some leftovers, I think it makes sense the way it's now, even
> though the naming is not ideal, I agree with that and we should come up with
> something else. But this is just a matter of taste and very unimportant, of
> course I don't have a strong argument about why the approach would be bad,
> because it's not, I just don't see it worth the effort.
>

Erik

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list
Re: [libvirt] [PATCH v5 3/3] libvirtd: fix crash on termination
Posted by Nikolay Shirokovskiy 7 years, 5 months ago
Hi, all.

I looked over thread for this particular patch again and found resolution is we:

1. make a more sane cleanup order in libvirtd's main function (already done by [1]).
2. rewrite linked series [2] by introducing event loop closing callback (ok)

But there is no resolution on this patch itself if I am not mistaken and
it is not pushed too.

LINKS

[1] [PATCH 0/5] libvirtd: Adjustments to startup and cleanup processing
https://www.redhat.com/archives/libvir-list/2017-November/msg00293.html

[2] [PATCH 0/4] libvirtd: fix hang on termination in qemu driver
https://www.redhat.com/archives/libvir-list/2017-October/msg01134.html


Nikolay

On 30.10.2017 14:14, John Ferlan wrote:
> From: Nikolay Shirokovskiy <nshirokovskiy@virtuozzo.com>
> 
> The problem is incorrect order of qemu driver shutdown and shutdown
> of netserver threads that serve client requests (thru qemu driver
> particularly).
> 
> Net server threads are shutdown upon dispose which is triggered
> by last daemon object unref at the end of main function. At the same
> time qemu driver is shutdown earlier in virStateCleanup. As a result
> netserver threads see invalid driver object in the middle of request
> processing.
> 
> Let's move shutting down netserver threads earlier to virNetDaemonClose.
> 
> Note: order of last daemon unref and virStateCleanup
> is introduced in 85c3a182 for a valid reason.
> 
> Signed-off-by: John Ferlan <jferlan@redhat.com>
> ---
>  src/rpc/virnetdaemon.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/src/rpc/virnetdaemon.c b/src/rpc/virnetdaemon.c
> index 8c21414897..33bd8e3b06 100644
> --- a/src/rpc/virnetdaemon.c
> +++ b/src/rpc/virnetdaemon.c
> @@ -881,6 +881,7 @@ virNetDaemonClose(virNetDaemonPtr dmn)
>      virObjectLock(dmn);
>  
>      virHashForEach(dmn->servers, daemonServerClose, NULL);
> +    virHashRemoveAll(dmn->servers);
>  
>      virObjectUnlock(dmn);
>  }
> 

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list
Re: [libvirt] [PATCH v5 3/3] libvirtd: fix crash on termination
Posted by Erik Skultety 7 years, 5 months ago
On Thu, Dec 14, 2017 at 02:58:30PM +0300, Nikolay Shirokovskiy wrote:
> Hi, all.
>
> I looked over thread for this particular patch again and found resolution is we:
>
> 1. make a more sane cleanup order in libvirtd's main function (already done by [1]).
> 2. rewrite linked series [2] by introducing event loop closing callback (ok)
>
> But there is no resolution on this patch itself if I am not mistaken and
> it is not pushed too.

It's been a while and the thread became fairly large to track, correct me if
I'm wrong but wouldn't the issue this patch is trying to solve disappear once
we go with a reworked version of [2]?

Erik

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list
Re: [libvirt] [PATCH v5 3/3] libvirtd: fix crash on termination
Posted by Nikolay Shirokovskiy 7 years, 5 months ago

On 14.12.2017 15:09, Erik Skultety wrote:
> On Thu, Dec 14, 2017 at 02:58:30PM +0300, Nikolay Shirokovskiy wrote:
>> Hi, all.
>>
>> I looked over thread for this particular patch again and found resolution is we:
>>
>> 1. make a more sane cleanup order in libvirtd's main function (already done by [1]).
>> 2. rewrite linked series [2] by introducing event loop closing callback (ok)
>>
>> But there is no resolution on this patch itself if I am not mistaken and
>> it is not pushed too.
> 
> It's been a while and the thread became fairly large to track, correct me if
> I'm wrong but wouldn't the issue this patch is trying to solve disappear once
> we go with a reworked version of [2]?
> 

This way we can ask RPC threads serving requests of some driver to finish
but we can not be sure they are actually finish if we do not join them
before calling virStateCleanup.

Nikolay

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list
Re: [libvirt] [PATCH v5 3/3] libvirtd: fix crash on termination
Posted by John Ferlan 7 years, 5 months ago

On 12/14/2017 06:58 AM, Nikolay Shirokovskiy wrote:
> Hi, all.
> 
> I looked over thread for this particular patch again and found resolution is we:
> 
> 1. make a more sane cleanup order in libvirtd's main function (already done by [1]).
> 2. rewrite linked series [2] by introducing event loop closing callback (ok)
> 
> But there is no resolution on this patch itself if I am not mistaken and
> it is not pushed too.
> 
> LINKS
> 
> [1] [PATCH 0/5] libvirtd: Adjustments to startup and cleanup processing
> https://www.redhat.com/archives/libvir-list/2017-November/msg00293.html

The cover letter of this essentially states the cleanup is an
alternative based upon the review of "this" series' 3/3 patch (IOW: this
patch).

So I agree w/ Erik - the (last) memory strand that I have left on this
indicates that the virHashRemoveAll isn't necessary in virNetDaemonClose
because we've properly order the cleanup with the new patch 4 from [1].

> 
> [2] [PATCH 0/4] libvirtd: fix hang on termination in qemu driver
> https://www.redhat.com/archives/libvir-list/2017-October/msg01134.html
> 

I responded to this series' cover letter with:

https://www.redhat.com/archives/libvir-list/2017-November/msg00455.html

in order to attempt to move the conversations found in "this" series to
that series rather than clouding "this" series with information
belonging to that series.

So the question *now* is - did you find a condition/case with [1]
patches applied, but [2] patches not applied (IOW: current top) where
something isn't being Unref'd properly?

John

> 
> Nikolay
> 
> On 30.10.2017 14:14, John Ferlan wrote:
>> From: Nikolay Shirokovskiy <nshirokovskiy@virtuozzo.com>
>>
>> The problem is incorrect order of qemu driver shutdown and shutdown
>> of netserver threads that serve client requests (thru qemu driver
>> particularly).
>>
>> Net server threads are shutdown upon dispose which is triggered
>> by last daemon object unref at the end of main function. At the same
>> time qemu driver is shutdown earlier in virStateCleanup. As a result
>> netserver threads see invalid driver object in the middle of request
>> processing.
>>
>> Let's move shutting down netserver threads earlier to virNetDaemonClose.
>>
>> Note: order of last daemon unref and virStateCleanup
>> is introduced in 85c3a182 for a valid reason.
>>
>> Signed-off-by: John Ferlan <jferlan@redhat.com>
>> ---
>>  src/rpc/virnetdaemon.c | 1 +
>>  1 file changed, 1 insertion(+)
>>
>> diff --git a/src/rpc/virnetdaemon.c b/src/rpc/virnetdaemon.c
>> index 8c21414897..33bd8e3b06 100644
>> --- a/src/rpc/virnetdaemon.c
>> +++ b/src/rpc/virnetdaemon.c
>> @@ -881,6 +881,7 @@ virNetDaemonClose(virNetDaemonPtr dmn)
>>      virObjectLock(dmn);
>>  
>>      virHashForEach(dmn->servers, daemonServerClose, NULL);
>> +    virHashRemoveAll(dmn->servers);
>>  
>>      virObjectUnlock(dmn);
>>  }
>>

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list
Re: [libvirt] [PATCH v5 3/3] libvirtd: fix crash on termination
Posted by Nikolay Shirokovskiy 7 years, 5 months ago
On 14.12.2017 15:31, John Ferlan wrote:
> 
> 
> On 12/14/2017 06:58 AM, Nikolay Shirokovskiy wrote:
>> Hi, all.
>>
>> I looked over thread for this particular patch again and found resolution is we:
>>
>> 1. make a more sane cleanup order in libvirtd's main function (already done by [1]).
>> 2. rewrite linked series [2] by introducing event loop closing callback (ok)
>>
>> But there is no resolution on this patch itself if I am not mistaken and
>> it is not pushed too.
>>
>> LINKS
>>
>> [1] [PATCH 0/5] libvirtd: Adjustments to startup and cleanup processing
>> https://www.redhat.com/archives/libvir-list/2017-November/msg00293.html
> 
> The cover letter of this essentially states the cleanup is an
> alternative based upon the review of "this" series' 3/3 patch (IOW: this
> patch).

Sorry, somehow I lost this impornant notice on my read. I thought [1]
was more about valueable but unrelated to this patch cleanup.

> 
> So I agree w/ Erik - the (last) memory strand that I have left on this
> indicates that the virHashRemoveAll isn't necessary in virNetDaemonClose
> because we've properly order the cleanup with the new patch 4 from [1].

Unfortunately it is not and looks like patch 4 is not pushed too. Anyway
it can't help finishing RPC threads in rigth order to hypervisor drivers
cleanup.

> 
>>
>> [2] [PATCH 0/4] libvirtd: fix hang on termination in qemu driver
>> https://www.redhat.com/archives/libvir-list/2017-October/msg01134.html
>>
> 
> I responded to this series' cover letter with:
> 
> https://www.redhat.com/archives/libvir-list/2017-November/msg00455.html
> 
> in order to attempt to move the conversations found in "this" series to
> that series rather than clouding "this" series with information
> belonging to that series.
> 
> So the question *now* is - did you find a condition/case with [1]
> patches applied, but [2] patches not applied (IOW: current top) where
> something isn't being Unref'd properly?

In a sense yes - virStateCleanup is still called before RPC threads are joined
which causes crashes. It is not fixed by [1].

Nikolay

> 
> John
> 
>>
>> Nikolay
>>
>> On 30.10.2017 14:14, John Ferlan wrote:
>>> From: Nikolay Shirokovskiy <nshirokovskiy@virtuozzo.com>
>>>
>>> The problem is incorrect order of qemu driver shutdown and shutdown
>>> of netserver threads that serve client requests (thru qemu driver
>>> particularly).
>>>
>>> Net server threads are shutdown upon dispose which is triggered
>>> by last daemon object unref at the end of main function. At the same
>>> time qemu driver is shutdown earlier in virStateCleanup. As a result
>>> netserver threads see invalid driver object in the middle of request
>>> processing.
>>>
>>> Let's move shutting down netserver threads earlier to virNetDaemonClose.
>>>
>>> Note: order of last daemon unref and virStateCleanup
>>> is introduced in 85c3a182 for a valid reason.
>>>
>>> Signed-off-by: John Ferlan <jferlan@redhat.com>
>>> ---
>>>  src/rpc/virnetdaemon.c | 1 +
>>>  1 file changed, 1 insertion(+)
>>>
>>> diff --git a/src/rpc/virnetdaemon.c b/src/rpc/virnetdaemon.c
>>> index 8c21414897..33bd8e3b06 100644
>>> --- a/src/rpc/virnetdaemon.c
>>> +++ b/src/rpc/virnetdaemon.c
>>> @@ -881,6 +881,7 @@ virNetDaemonClose(virNetDaemonPtr dmn)
>>>      virObjectLock(dmn);
>>>  
>>>      virHashForEach(dmn->servers, daemonServerClose, NULL);
>>> +    virHashRemoveAll(dmn->servers);
>>>  
>>>      virObjectUnlock(dmn);
>>>  }
>>>

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list
Re: [libvirt] [PATCH v5 3/3] libvirtd: fix crash on termination
Posted by John Ferlan 7 years, 5 months ago

On 12/14/2017 07:43 AM, Nikolay Shirokovskiy wrote:
> On 14.12.2017 15:31, John Ferlan wrote:
>>
>>
>> On 12/14/2017 06:58 AM, Nikolay Shirokovskiy wrote:
>>> Hi, all.
>>>
>>> I looked over thread for this particular patch again and found resolution is we:
>>>
>>> 1. make a more sane cleanup order in libvirtd's main function (already done by [1]).
>>> 2. rewrite linked series [2] by introducing event loop closing callback (ok)
>>>
>>> But there is no resolution on this patch itself if I am not mistaken and
>>> it is not pushed too.
>>>
>>> LINKS
>>>
>>> [1] [PATCH 0/5] libvirtd: Adjustments to startup and cleanup processing
>>> https://www.redhat.com/archives/libvir-list/2017-November/msg00293.html
>>
>> The cover letter of this essentially states the cleanup is an
>> alternative based upon the review of "this" series' 3/3 patch (IOW: this
>> patch).
> 
> Sorry, somehow I lost this impornant notice on my read. I thought [1]
> was more about valueable but unrelated to this patch cleanup.
> 
>>
>> So I agree w/ Erik - the (last) memory strand that I have left on this
>> indicates that the virHashRemoveAll isn't necessary in virNetDaemonClose
>> because we've properly order the cleanup with the new patch 4 from [1].
> 
> Unfortunately it is not and looks like patch 4 is not pushed too. Anyway
> it can't help finishing RPC threads in rigth order to hypervisor drivers
> cleanup.
> 

Follow the series... Patches 3 & 4 were not accepted, thus patch 5
needed to be adjusted in order to perform the Unref's in the cleanup:
code rather than "inline" as patch 3 & 4 had done.

>>
>>>
>>> [2] [PATCH 0/4] libvirtd: fix hang on termination in qemu driver
>>> https://www.redhat.com/archives/libvir-list/2017-October/msg01134.html
>>>
>>
>> I responded to this series' cover letter with:
>>
>> https://www.redhat.com/archives/libvir-list/2017-November/msg00455.html
>>
>> in order to attempt to move the conversations found in "this" series to
>> that series rather than clouding "this" series with information
>> belonging to that series.
>>
>> So the question *now* is - did you find a condition/case with [1]
>> patches applied, but [2] patches not applied (IOW: current top) where
>> something isn't being Unref'd properly?
> 
> In a sense yes - virStateCleanup is still called before RPC threads are joined
> which causes crashes. It is not fixed by [1].
> 
> Nikolay
> 

Now you've lost me. What are the back traces? and now does one
reasonably reproduce?  Are you trying to advocate here for [2] to be
reviewed/accepted?

John

>>
>> John
>>
>>>
>>> Nikolay
>>>
>>> On 30.10.2017 14:14, John Ferlan wrote:
>>>> From: Nikolay Shirokovskiy <nshirokovskiy@virtuozzo.com>
>>>>
>>>> The problem is incorrect order of qemu driver shutdown and shutdown
>>>> of netserver threads that serve client requests (thru qemu driver
>>>> particularly).
>>>>
>>>> Net server threads are shutdown upon dispose which is triggered
>>>> by last daemon object unref at the end of main function. At the same
>>>> time qemu driver is shutdown earlier in virStateCleanup. As a result
>>>> netserver threads see invalid driver object in the middle of request
>>>> processing.
>>>>
>>>> Let's move shutting down netserver threads earlier to virNetDaemonClose.
>>>>
>>>> Note: order of last daemon unref and virStateCleanup
>>>> is introduced in 85c3a182 for a valid reason.
>>>>
>>>> Signed-off-by: John Ferlan <jferlan@redhat.com>
>>>> ---
>>>>  src/rpc/virnetdaemon.c | 1 +
>>>>  1 file changed, 1 insertion(+)
>>>>
>>>> diff --git a/src/rpc/virnetdaemon.c b/src/rpc/virnetdaemon.c
>>>> index 8c21414897..33bd8e3b06 100644
>>>> --- a/src/rpc/virnetdaemon.c
>>>> +++ b/src/rpc/virnetdaemon.c
>>>> @@ -881,6 +881,7 @@ virNetDaemonClose(virNetDaemonPtr dmn)
>>>>      virObjectLock(dmn);
>>>>  
>>>>      virHashForEach(dmn->servers, daemonServerClose, NULL);
>>>> +    virHashRemoveAll(dmn->servers);
>>>>  
>>>>      virObjectUnlock(dmn);
>>>>  }
>>>>

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list
Re: [libvirt] [PATCH v5 3/3] libvirtd: fix crash on termination
Posted by Nikolay Shirokovskiy 7 years, 5 months ago

On 14.12.2017 16:09, John Ferlan wrote:
> 
> 
> On 12/14/2017 07:43 AM, Nikolay Shirokovskiy wrote:
>> On 14.12.2017 15:31, John Ferlan wrote:
>>>
>>>
>>> On 12/14/2017 06:58 AM, Nikolay Shirokovskiy wrote:
>>>> Hi, all.
>>>>
>>>> I looked over thread for this particular patch again and found resolution is we:
>>>>
>>>> 1. make a more sane cleanup order in libvirtd's main function (already done by [1]).
>>>> 2. rewrite linked series [2] by introducing event loop closing callback (ok)
>>>>
>>>> But there is no resolution on this patch itself if I am not mistaken and
>>>> it is not pushed too.
>>>>
>>>> LINKS
>>>>
>>>> [1] [PATCH 0/5] libvirtd: Adjustments to startup and cleanup processing
>>>> https://www.redhat.com/archives/libvir-list/2017-November/msg00293.html
>>>
>>> The cover letter of this essentially states the cleanup is an
>>> alternative based upon the review of "this" series' 3/3 patch (IOW: this
>>> patch).
>>
>> Sorry, somehow I lost this impornant notice on my read. I thought [1]
>> was more about valueable but unrelated to this patch cleanup.
>>
>>>
>>> So I agree w/ Erik - the (last) memory strand that I have left on this
>>> indicates that the virHashRemoveAll isn't necessary in virNetDaemonClose
>>> because we've properly order the cleanup with the new patch 4 from [1].
>>
>> Unfortunately it is not and looks like patch 4 is not pushed too. Anyway
>> it can't help finishing RPC threads in rigth order to hypervisor drivers
>> cleanup.
>>
> 
> Follow the series... Patches 3 & 4 were not accepted, thus patch 5
> needed to be adjusted in order to perform the Unref's in the cleanup:
> code rather than "inline" as patch 3 & 4 had done.
> 
>>>
>>>>
>>>> [2] [PATCH 0/4] libvirtd: fix hang on termination in qemu driver
>>>> https://www.redhat.com/archives/libvir-list/2017-October/msg01134.html
>>>>
>>>
>>> I responded to this series' cover letter with:
>>>
>>> https://www.redhat.com/archives/libvir-list/2017-November/msg00455.html
>>>
>>> in order to attempt to move the conversations found in "this" series to
>>> that series rather than clouding "this" series with information
>>> belonging to that series.
>>>
>>> So the question *now* is - did you find a condition/case with [1]
>>> patches applied, but [2] patches not applied (IOW: current top) where
>>> something isn't being Unref'd properly?
>>
>> In a sense yes - virStateCleanup is still called before RPC threads are joined
>> which causes crashes. It is not fixed by [1].
>>
>> Nikolay
>>
> 
> Now you've lost me. What are the back traces? and now does one
> reasonably reproduce?  Are you trying to advocate here for [2] to be
> reviewed/accepted?

Sorry for that. Unfortunately back trace was only emailed in the first version
of series [3]. The reproducer is there too.

I'm not here for [2]. It addresses a different issue. However I want to 
make some progress on that series too.

[3] https://www.redhat.com/archives/libvir-list/2017-September/msg01006.html

> 
> John
> 
>>>
>>> John
>>>
>>>>
>>>> Nikolay
>>>>
>>>> On 30.10.2017 14:14, John Ferlan wrote:
>>>>> From: Nikolay Shirokovskiy <nshirokovskiy@virtuozzo.com>
>>>>>
>>>>> The problem is incorrect order of qemu driver shutdown and shutdown
>>>>> of netserver threads that serve client requests (thru qemu driver
>>>>> particularly).
>>>>>
>>>>> Net server threads are shutdown upon dispose which is triggered
>>>>> by last daemon object unref at the end of main function. At the same
>>>>> time qemu driver is shutdown earlier in virStateCleanup. As a result
>>>>> netserver threads see invalid driver object in the middle of request
>>>>> processing.
>>>>>
>>>>> Let's move shutting down netserver threads earlier to virNetDaemonClose.
>>>>>
>>>>> Note: order of last daemon unref and virStateCleanup
>>>>> is introduced in 85c3a182 for a valid reason.
>>>>>
>>>>> Signed-off-by: John Ferlan <jferlan@redhat.com>
>>>>> ---
>>>>>  src/rpc/virnetdaemon.c | 1 +
>>>>>  1 file changed, 1 insertion(+)
>>>>>
>>>>> diff --git a/src/rpc/virnetdaemon.c b/src/rpc/virnetdaemon.c
>>>>> index 8c21414897..33bd8e3b06 100644
>>>>> --- a/src/rpc/virnetdaemon.c
>>>>> +++ b/src/rpc/virnetdaemon.c
>>>>> @@ -881,6 +881,7 @@ virNetDaemonClose(virNetDaemonPtr dmn)
>>>>>      virObjectLock(dmn);
>>>>>  
>>>>>      virHashForEach(dmn->servers, daemonServerClose, NULL);
>>>>> +    virHashRemoveAll(dmn->servers);
>>>>>  
>>>>>      virObjectUnlock(dmn);
>>>>>  }
>>>>>

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list
Re: [libvirt] [PATCH v5 3/3] libvirtd: fix crash on termination
Posted by John Ferlan 7 years, 4 months ago
[...]

>>
>> Now you've lost me. What are the back traces? and now does one
>> reasonably reproduce?  Are you trying to advocate here for [2] to be
>> reviewed/accepted?
> 
> Sorry for that. Unfortunately back trace was only emailed in the first version
> of series [3]. The reproducer is there too.
> 
> I'm not here for [2]. It addresses a different issue. However I want to 
> make some progress on that series too.
> 
> [3] https://www.redhat.com/archives/libvir-list/2017-September/msg01006.html
> 


Ahh, sigh...  I should have been more "proactive" in ensuring that I had
gone back through the above test when two of the patches I had posted
related to the changes in this code were not ACK'd. Those two patches are:

https://www.redhat.com/archives/libvir-list/2017-November/msg00296.html

and

https://www.redhat.com/archives/libvir-list/2017-November/msg00297.html

What those did was perform the Unref from libvirtd as soon as the
AddServer or AddProgram added to the respective hash table (and of
course incr the refcnt). The result is the Unref for each was done
*prior* to the virNetDaemonClose and prior to the virStateCleanup which
solves the late run after Close when Dispose runs.

What jiggled my memory on this was today there was an OFTC IRC posting
regarding a virt-manager crash (see pastebin from cbosdonnat at
http://paste.opensuse.org/44522138) as a result of incorrect ordering of
daemon unref and running virStateCleanup.  A copy/paste of the
subsequent conversation is:

<cbosdonnat> I'm seeing a crash here when libvirtd is stopping and a
client (namely virt-manager) calls
networkConnectNetworkEventDeregisterAny().... is it OK to just goto
cleanup if the driver is NULL in that function?
<danpb> cbosdonnat: you mean the driver is being torn down in
virStateCleanup ?  and an API is still allowed to be run ?
<cbosdonnat> danpb, looks like that indeed
<cbosdonnat> where would the API call be rejected otherwise?
<danpb> that seems a much more general problem - we should not call
virStateCleanup until we've killed off all the API worker threads
<danpb> i kind of thought we already did that
<danpb> is it perhaps that we're killing off active clients and as part
of that we're cleaning up events they have registered ?
<cbosdonnat> danpb, here is the backtrace of the crash:
http://paste.opensuse.org/44522138
<cbosdonnat> so seems like that indeed
<danpb> oh yeah, we're killing off clients
<danpb> we need to explicitly be able to kill off clients before
shutting down drivers
<danpb> surprised we've not seen this before to be honest
<danpb> i wonder if some refactoring  has introduced this problem recently
<cbosdonnat> are those done in separate threads?
<danpb> i can't rememeber to be honest
<cbosdonnat> I'll chase the order of these then
<cbosdonnat> danpb, the state cleanup happens before the daemon unref...
looks like we'll need to purge the daemon before that
<cbosdonnat> danpb, would that be completely silly to move the client
closing code to virNetServerClose()?
<danpb> that seems pretty sensible to me actually
<danpb> logically they feel related actions
<cbosdonnat> we can keep the unref of them where it is though if it
makes more sense
<cbosdonnat> yep

So yes, the "refactoring" that caused this is commit id '2f3054c22'.

As a test, I altered the code to perform the all the Unref's before the
virStateCleanup and that "resolved" the virt-manager crash; however,
that would seem to conflict with commit id '85c3a1820' which noted that
performing the last Unref(dmn) prior to virStateCleanup would cause
issues for daemonInhibitCallback.

So I moved the Unref(dmn) after virStateCleanup... That of course led to
the virt-manager issue.  So I figured I'd give adding this patch in and
that then fixed the virt-manager crash (obviously as we'd be Unref'ing
all those servers we added).

So after all that if I add that sleep into domstats as shown in the [3]
link from above - I don't get a crash, but it also seems to cause the
SIGTERM to be ignored at least the first time through. Running a second
SIGTERM does the proper kill.

So short story made really long, I think the best course of action will
be to add this patch and reorder the Unref()'s (adminProgram thru srv,
but not dmn). It seems to resolve these corner cases, but I'm also open
to other suggestions. Still need to think about it some more too before
posting any patches.


John

[...]

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list
Re: [libvirt] [PATCH v5 3/3] libvirtd: fix crash on termination
Posted by Cedric Bosdonnat 7 years, 4 months ago
Hi John,

On Wed, 2017-12-20 at 16:52 -0500, John Ferlan wrote:
> So I moved the Unref(dmn) after virStateCleanup... That of course led to
> the virt-manager issue.  So I figured I'd give adding this patch in and
> that then fixed the virt-manager crash (obviously as we'd be Unref'ing
> all those servers we added).

Seems like I forgot to git-send my patch yesterday evening: here it is:

https://www.redhat.com/archives/libvir-list/2017-December/msg00739.html

That fixes the virt-manager crash at least.
--
Cedric

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list
Re: [libvirt] [PATCH v5 3/3] libvirtd: fix crash on termination
Posted by Erik Skultety 7 years, 4 months ago
On Thu, Dec 21, 2017 at 09:39:16AM +0100, Cedric Bosdonnat wrote:
> Hi John,
>
> On Wed, 2017-12-20 at 16:52 -0500, John Ferlan wrote:
> > So I moved the Unref(dmn) after virStateCleanup... That of course led to
> > the virt-manager issue.  So I figured I'd give adding this patch in and
> > that then fixed the virt-manager crash (obviously as we'd be Unref'ing
> > all those servers we added).
>
> Seems like I forgot to git-send my patch yesterday evening: here it is:
>
> https://www.redhat.com/archives/libvir-list/2017-December/msg00739.html
>
> That fixes the virt-manager crash at least.

Hi, so the patch looks good, at first glance it makes complete sense to close
the client connections when closing the server and unreffing the objects when
disposing of the server object. However, since this and the other related
issues have been bugging us for quite some time I'm not going to say ACK just
yet, I'd like to think about it for a bit - but if someone else already has and
they feel brave enough :P, then I'm not going to block this to go in.

Erik

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list
Re: [libvirt] [PATCH v5 3/3] libvirtd: fix crash on termination
Posted by Nikolay Shirokovskiy 7 years, 4 months ago

On 21.12.2017 11:49, Erik Skultety wrote:
> On Thu, Dec 21, 2017 at 09:39:16AM +0100, Cedric Bosdonnat wrote:
>> Hi John,
>>
>> On Wed, 2017-12-20 at 16:52 -0500, John Ferlan wrote:
>>> So I moved the Unref(dmn) after virStateCleanup... That of course led to
>>> the virt-manager issue.  So I figured I'd give adding this patch in and
>>> that then fixed the virt-manager crash (obviously as we'd be Unref'ing
>>> all those servers we added).
>>
>> Seems like I forgot to git-send my patch yesterday evening: here it is:
>>
>> https://www.redhat.com/archives/libvir-list/2017-December/msg00739.html
>>
>> That fixes the virt-manager crash at least.
> 
> Hi, so the patch looks good, at first glance it makes complete sense to close
> the client connections when closing the server and unreffing the objects when
> disposing of the server object. However, since this and the other related
> issues have been bugging us for quite some time I'm not going to say ACK just
> yet, I'd like to think about it for a bit - but if someone else already has and
> they feel brave enough :P, then I'm not going to block this to go in.
> 
> Erik
> 

Patch looks good to me too. But still original "libvirtd: fix crash on termination"
fixes another issue and if applied fixes "virt-manager issue" as well as John
figured out.

Nikolay

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list
Re: [libvirt] [PATCH v5 3/3] libvirtd: fix crash on termination
Posted by Erik Skultety 7 years, 4 months ago
On Thu, Dec 21, 2017 at 12:48:44PM +0300, Nikolay Shirokovskiy wrote:
>
>
> On 21.12.2017 11:49, Erik Skultety wrote:
> > On Thu, Dec 21, 2017 at 09:39:16AM +0100, Cedric Bosdonnat wrote:
> >> Hi John,
> >>
> >> On Wed, 2017-12-20 at 16:52 -0500, John Ferlan wrote:
> >>> So I moved the Unref(dmn) after virStateCleanup... That of course led to
> >>> the virt-manager issue.  So I figured I'd give adding this patch in and
> >>> that then fixed the virt-manager crash (obviously as we'd be Unref'ing
> >>> all those servers we added).
> >>
> >> Seems like I forgot to git-send my patch yesterday evening: here it is:
> >>
> >> https://www.redhat.com/archives/libvir-list/2017-December/msg00739.html
> >>
> >> That fixes the virt-manager crash at least.
> >
> > Hi, so the patch looks good, at first glance it makes complete sense to close
> > the client connections when closing the server and unreffing the objects when
> > disposing of the server object. However, since this and the other related
> > issues have been bugging us for quite some time I'm not going to say ACK just
> > yet, I'd like to think about it for a bit - but if someone else already has and
> > they feel brave enough :P, then I'm not going to block this to go in.
> >
> > Erik
> >
>
> Patch looks good to me too. But still original "libvirtd: fix crash on termination"
> fixes another issue and if applied fixes "virt-manager issue" as well as John
> figured out.

Finally I'm back on track with this (sorry for it taking so long), although
you're right about this, it's not the correct fix, it's just a byproduct of
your patch, however, the whole thing about closing connections and releasing
memory is a bit of a mess. For example, right now, we only dispose of service
objs (but we don't close them, we do that in virNetServerClose), but we both
close and dispose of the client objs. Another thing, we toggle the service to
stop accepting any new connections, but that's irrelevant at that point because
we've closed them already by that time as a product of calling
virNetDaemonClose->virNetServerClose->virNetServerServiceClose - so that
should be removed...I drifted a bit here, anyway, we need to make a clear
distinction when we're releasing memory and when we're shutting down some kind
of service - sometimes both can be done at the same time (see below), however
it's very clear we can't do it here. Your issue is that Close currently only
closes the sockets (this reflects my point of "shutting down a service") but
does nothing with the threadpool connected to the server, thus leaving the
worker threads to continue running and executing APIs the results of which they
shouldn't even be able to return back to the client, since we're shutting down.
Now the thing with threadpool is that both memory release and teardown are
merged into one "object disposal" operation and therefore done as part of
virNetServerDispose. Since I understand a removal from the hash table as a
memory release operation, we should not be calling virHashRemoveAll from
virNetDaemonClose. Now, I see 2 options:

1) split virThreadPoolFree into 2 functions, one which only broadcasts the
"die" message and joins the threads (or waits for them in this case...) and the
other releasing the resources - can't say I'm a fan of this one

2) call virThreadPoolFree from virNetServerClose instead...

None of those approaches is ideal, but I can't seem to think off anything
better at the moment.

I'm open to discuss any other suggestions.
Erik

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list
Re: [libvirt] [PATCH v5 3/3] libvirtd: fix crash on termination
Posted by Nikolay Shirokovskiy 7 years, 4 months ago

On 21.12.2017 15:14, Erik Skultety wrote:
> On Thu, Dec 21, 2017 at 12:48:44PM +0300, Nikolay Shirokovskiy wrote:
>>
>>
>> On 21.12.2017 11:49, Erik Skultety wrote:
>>> On Thu, Dec 21, 2017 at 09:39:16AM +0100, Cedric Bosdonnat wrote:
>>>> Hi John,
>>>>
>>>> On Wed, 2017-12-20 at 16:52 -0500, John Ferlan wrote:
>>>>> So I moved the Unref(dmn) after virStateCleanup... That of course led to
>>>>> the virt-manager issue.  So I figured I'd give adding this patch in and
>>>>> that then fixed the virt-manager crash (obviously as we'd be Unref'ing
>>>>> all those servers we added).
>>>>
>>>> Seems like I forgot to git-send my patch yesterday evening: here it is:
>>>>
>>>> https://www.redhat.com/archives/libvir-list/2017-December/msg00739.html
>>>>
>>>> That fixes the virt-manager crash at least.
>>>
>>> Hi, so the patch looks good, at first glance it makes complete sense to close
>>> the client connections when closing the server and unreffing the objects when
>>> disposing of the server object. However, since this and the other related
>>> issues have been bugging us for quite some time I'm not going to say ACK just
>>> yet, I'd like to think about it for a bit - but if someone else already has and
>>> they feel brave enough :P, then I'm not going to block this to go in.
>>>
>>> Erik
>>>
>>
>> Patch looks good to me too. But still original "libvirtd: fix crash on termination"
>> fixes another issue and if applied fixes "virt-manager issue" as well as John
>> figured out.
> 
> Finally I'm back on track with this (sorry for it taking so long), although
> you're right about this, it's not the correct fix, it's just a byproduct of
> your patch, however, the whole thing about closing connections and releasing
> memory is a bit of a mess. For example, right now, we only dispose of service
> objs (but we don't close them, we do that in virNetServerClose), but we both
> close and dispose of the client objs. Another thing, we toggle the service to
> stop accepting any new connections, but that's irrelevant at that point because
> we've closed them already by that time as a product of calling
> virNetDaemonClose->virNetServerClose->virNetServerServiceClose - so that
> should be removed...I drifted a bit here, anyway, we need to make a clear
> distinction when we're releasing memory and when we're shutting down some kind
> of service - sometimes both can be done at the same time (see below), however
> it's very clear we can't do it here. Your issue is that Close currently only
> closes the sockets (this reflects my point of "shutting down a service") but
> does nothing with the threadpool connected to the server, thus leaving the
> worker threads to continue running and executing APIs the results of which they
> shouldn't even be able to return back to the client, since we're shutting down.

To me it is not the first order problem. On daemon shutdown clients will anyway
most probably receive some kind of error. 

But we should finish worker threads before we cleanup drivers.

> Now the thing with threadpool is that both memory release and teardown are
> merged into one "object disposal" operation and therefore done as part of
> virNetServerDispose. Since I understand a removal from the hash table as a
> memory release operation, we should not be calling virHashRemoveAll from
> virNetDaemonClose. Now, I see 2 options:
> 
> 1) split virThreadPoolFree into 2 functions, one which only broadcasts the
> "die" message and joins the threads (or waits for them in this case...) and the
> other releasing the resources - can't say I'm a fan of this one
> 
> 2) call virThreadPoolFree from virNetServerClose instead...
> 
> None of those approaches is ideal, but I can't seem to think off anything
> better at the moment.
> 
> I'm open to discuss any other suggestions.
> Erik
> 

2nd options looks good to me. Netserver is finished upon close so there are no workers
anymore. And we are not going to restart/reuse netserver after that and reininialize
workers (and if we do we change thread pool as well to wait for threads in a pool
on stop but not finishing them etc etc). So it quite sane to free thread pool
on close to me. Why we not freeing other internal objects, why we not destroing
locks? Well other threads could be using this object so to keep things simple we
keep some basic internal objects likes locks but we can dispose clients and
services on close as well.

By the way we will need to check whether workers != NULL in virNetServerGetThreadPoolParameters
for example after that. (Quite possible situation - somebody queries thread pool parameters of main
server thru admin server).

Nikolay

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list
Re: [libvirt] [PATCH v5 3/3] libvirtd: fix crash on termination
Posted by Daniel P. Berrange 7 years, 4 months ago
On Fri, Dec 22, 2017 at 10:52:45AM +0300, Nikolay Shirokovskiy wrote:
> 
> 
> On 21.12.2017 15:14, Erik Skultety wrote:
> > On Thu, Dec 21, 2017 at 12:48:44PM +0300, Nikolay Shirokovskiy wrote:
> >>
> >>
> >> On 21.12.2017 11:49, Erik Skultety wrote:
> >>> On Thu, Dec 21, 2017 at 09:39:16AM +0100, Cedric Bosdonnat wrote:
> >>>> Hi John,
> >>>>
> >>>> On Wed, 2017-12-20 at 16:52 -0500, John Ferlan wrote:
> >>>>> So I moved the Unref(dmn) after virStateCleanup... That of course led to
> >>>>> the virt-manager issue.  So I figured I'd give adding this patch in and
> >>>>> that then fixed the virt-manager crash (obviously as we'd be Unref'ing
> >>>>> all those servers we added).
> >>>>
> >>>> Seems like I forgot to git-send my patch yesterday evening: here it is:
> >>>>
> >>>> https://www.redhat.com/archives/libvir-list/2017-December/msg00739.html
> >>>>
> >>>> That fixes the virt-manager crash at least.
> >>>
> >>> Hi, so the patch looks good, at first glance it makes complete sense to close
> >>> the client connections when closing the server and unreffing the objects when
> >>> disposing of the server object. However, since this and the other related
> >>> issues have been bugging us for quite some time I'm not going to say ACK just
> >>> yet, I'd like to think about it for a bit - but if someone else already has and
> >>> they feel brave enough :P, then I'm not going to block this to go in.
> >>>
> >>> Erik
> >>>
> >>
> >> Patch looks good to me too. But still original "libvirtd: fix crash on termination"
> >> fixes another issue and if applied fixes "virt-manager issue" as well as John
> >> figured out.
> > 
> > Finally I'm back on track with this (sorry for it taking so long), although
> > you're right about this, it's not the correct fix, it's just a byproduct of
> > your patch, however, the whole thing about closing connections and releasing
> > memory is a bit of a mess. For example, right now, we only dispose of service
> > objs (but we don't close them, we do that in virNetServerClose), but we both
> > close and dispose of the client objs. Another thing, we toggle the service to
> > stop accepting any new connections, but that's irrelevant at that point because
> > we've closed them already by that time as a product of calling
> > virNetDaemonClose->virNetServerClose->virNetServerServiceClose - so that
> > should be removed...I drifted a bit here, anyway, we need to make a clear
> > distinction when we're releasing memory and when we're shutting down some kind
> > of service - sometimes both can be done at the same time (see below), however
> > it's very clear we can't do it here. Your issue is that Close currently only
> > closes the sockets (this reflects my point of "shutting down a service") but
> > does nothing with the threadpool connected to the server, thus leaving the
> > worker threads to continue running and executing APIs the results of which they
> > shouldn't even be able to return back to the client, since we're shutting down.
> 
> To me it is not the first order problem. On daemon shutdown clients will anyway
> most probably receive some kind of error. 
> 
> But we should finish worker threads before we cleanup drivers.

Yes, that is absolutely critical. With current situation where we leave clients
connected & worker threads running, while we cleanup drivers, we're pretty much
guaranteed a tonne of bizarre crashes.

For killing clients we should simply make virNetServerClose() tear down all
client connections, at the same time as it tears down listener sockets. There
is no point separating those actions, as they're conceptually related.

> > 1) split virThreadPoolFree into 2 functions, one which only broadcasts the
> > "die" message and joins the threads (or waits for them in this case...) and the
> > other releasing the resources - can't say I'm a fan of this one
> > 2) call virThreadPoolFree from virNetServerClose instead...
> > 
> > None of those approaches is ideal, but I can't seem to think off anything
> > better at the moment.

Whether the workers threads are running or not is actually not the key
issue here.

What causes the crashes is whether the worker threads are processing jobs
or not.  IOW, all we need todo is first purge all queued jobs, and then
wait for the threads to finish any jobs they already had. The threads
themselves can stay running. This is a conceptually useful operation beyond
just the shutdown scenario.

IOW, we should create  a  virThreadPoolDrain() method to purge all jobs
and wait for idle.


Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list
Re: [libvirt] [PATCH v5 3/3] libvirtd: fix crash on termination
Posted by John Ferlan 7 years, 4 months ago
[...]

>>
>> Patch looks good to me too. But still original "libvirtd: fix crash on termination"
>> fixes another issue and if applied fixes "virt-manager issue" as well as John
>> figured out.

Not sure there's enough coffee in the house this morning to make me
"awake enough" for all this stuff, especially just before a holiday
break. But perhaps better now than after the new year... sooo....

> 
> Finally I'm back on track with this (sorry for it taking so long), although
> you're right about this, it's not the correct fix, it's just a byproduct of
> your patch, however, the whole thing about closing connections and releasing
> memory is a bit of a mess. For example, right now, we only dispose of service
> objs (but we don't close them, we do that in virNetServerClose), but we both
> close and dispose of the client objs. Another thing, we toggle the service to
> stop accepting any new connections, but that's irrelevant at that point because
> we've closed them already by that time as a product of calling
> virNetDaemonClose->virNetServerClose->virNetServerServiceClose - so that

virNetServerServiceClose could call virNetServerServiceToggle(,false),
even though it probably doesn't matter at this point.

Makes me wonder why virNetServerUpdateServicesLocked wasn't called in
virNetServerDispose instead of open coding the nservices loop (sigh).

> should be removed...I drifted a bit here, anyway, we need to make a clear
> distinction when we're releasing memory and when we're shutting down some kind
> of service - sometimes both can be done at the same time (see below), however
> it's very clear we can't do it here. Your issue is that Close currently only
> closes the sockets (this reflects my point of "shutting down a service") but
> does nothing with the threadpool connected to the server, thus leaving the
> worker threads to continue running and executing APIs the results of which they
> shouldn't even be able to return back to the client, since we're shutting down.
> Now the thing with threadpool is that both memory release and teardown are
> merged into one "object disposal" operation and therefore done as part of
> virNetServerDispose. Since I understand a removal from the hash table as a
> memory release operation, we should not be calling virHashRemoveAll from
> virNetDaemonClose. Now, I see 2 options:
> 
> 1) split virThreadPoolFree into 2 functions, one which only broadcasts the
> "die" message and joins the threads (or waits for them in this case...) and the
> other releasing the resources - can't say I'm a fan of this one
> 

Kind of a "virThreadPoolStop" type operation...


> 2) call virThreadPoolFree from virNetServerClose instead...
> 
> None of those approaches is ideal, but I can't seem to think off anything
> better at the moment.

I like 2 better, but it doesn't fix the problem of a long running thread
(such as GetAllDomainStats), it just moves the cheese.  Although I have
a feeling the virStateShutdown series Nikolay is promoting may solve
that issue.

It's really a conundrum w/r/t how much time to spend on this especially
if the short/long term goal is a shims for libvirtd (e.g. libvirt_qemu)
which will move the cheese even more.

I'm going to move away from this for now, maybe a fresh look will help.
Right now I'm not sure I can focus on this with the other threads I'm
involved in.

John


> 
> I'm open to discuss any other suggestions.
> Erik
> 
> --
> libvir-list mailing list
> libvir-list@redhat.com
> https://www.redhat.com/mailman/listinfo/libvir-list
> 

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list
Re: [libvirt] [PATCH v5 3/3] libvirtd: fix crash on termination
Posted by Nikolay Shirokovskiy 7 years, 4 months ago

On 21.12.2017 17:59, John Ferlan wrote:
> [...]
> 
>>>
>>> Patch looks good to me too. But still original "libvirtd: fix crash on termination"
>>> fixes another issue and if applied fixes "virt-manager issue" as well as John
>>> figured out.
> 
> Not sure there's enough coffee in the house this morning to make me
> "awake enough" for all this stuff, especially just before a holiday
> break. But perhaps better now than after the new year... sooo....
> 
>>
>> Finally I'm back on track with this (sorry for it taking so long), although
>> you're right about this, it's not the correct fix, it's just a byproduct of
>> your patch, however, the whole thing about closing connections and releasing
>> memory is a bit of a mess. For example, right now, we only dispose of service
>> objs (but we don't close them, we do that in virNetServerClose), but we both
>> close and dispose of the client objs. Another thing, we toggle the service to
>> stop accepting any new connections, but that's irrelevant at that point because
>> we've closed them already by that time as a product of calling
>> virNetDaemonClose->virNetServerClose->virNetServerServiceClose - so that
> 
> virNetServerServiceClose could call virNetServerServiceToggle(,false),
> even though it probably doesn't matter at this point.
> 
> Makes me wonder why virNetServerUpdateServicesLocked wasn't called in
> virNetServerDispose instead of open coding the nservices loop (sigh).
> 
>> should be removed...I drifted a bit here, anyway, we need to make a clear
>> distinction when we're releasing memory and when we're shutting down some kind
>> of service - sometimes both can be done at the same time (see below), however
>> it's very clear we can't do it here. Your issue is that Close currently only
>> closes the sockets (this reflects my point of "shutting down a service") but
>> does nothing with the threadpool connected to the server, thus leaving the
>> worker threads to continue running and executing APIs the results of which they
>> shouldn't even be able to return back to the client, since we're shutting down.
>> Now the thing with threadpool is that both memory release and teardown are
>> merged into one "object disposal" operation and therefore done as part of
>> virNetServerDispose. Since I understand a removal from the hash table as a
>> memory release operation, we should not be calling virHashRemoveAll from
>> virNetDaemonClose. Now, I see 2 options:
>>
>> 1) split virThreadPoolFree into 2 functions, one which only broadcasts the
>> "die" message and joins the threads (or waits for them in this case...) and the
>> other releasing the resources - can't say I'm a fan of this one
>>
> 
> Kind of a "virThreadPoolStop" type operation...
> 
> 
>> 2) call virThreadPoolFree from virNetServerClose instead...
>>
>> None of those approaches is ideal, but I can't seem to think off anything
>> better at the moment.
> 
> I like 2 better, but it doesn't fix the problem of a long running thread
> (such as GetAllDomainStats), it just moves the cheese.  Although I have
> a feeling the virStateShutdown series Nikolay is promoting may solve
> that issue.

No, no. The problem will be fixed as I mentined in a different reply.
It just can not be solved by solely virHashRemoveAll with current
unref ordering in libvirtd.c. 

> 
> It's really a conundrum w/r/t how much time to spend on this especially
> if the short/long term goal is a shims for libvirtd (e.g. libvirt_qemu)
> which will move the cheese even more.
> 
> I'm going to move away from this for now, maybe a fresh look will help.
> Right now I'm not sure I can focus on this with the other threads I'm
> involved in.
> 
> John
> 
> 
>>
>> I'm open to discuss any other suggestions.
>> Erik
>>
>> --
>> libvir-list mailing list
>> libvir-list@redhat.com
>> https://www.redhat.com/mailman/listinfo/libvir-list
>>

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list
Re: [libvirt] [PATCH v5 3/3] libvirtd: fix crash on termination
Posted by Nikolay Shirokovskiy 7 years, 4 months ago

On 21.12.2017 00:52, John Ferlan wrote:
> 
> [...]
> 
>>>
>>> Now you've lost me. What are the back traces? and now does one
>>> reasonably reproduce?  Are you trying to advocate here for [2] to be
>>> reviewed/accepted?
>>
>> Sorry for that. Unfortunately back trace was only emailed in the first version
>> of series [3]. The reproducer is there too.
>>
>> I'm not here for [2]. It addresses a different issue. However I want to 
>> make some progress on that series too.
>>
>> [3] https://www.redhat.com/archives/libvir-list/2017-September/msg01006.html
>>
> 
> 
> Ahh, sigh...  I should have been more "proactive" in ensuring that I had
> gone back through the above test when two of the patches I had posted
> related to the changes in this code were not ACK'd. Those two patches are:
> 
> https://www.redhat.com/archives/libvir-list/2017-November/msg00296.html
> 
> and
> 
> https://www.redhat.com/archives/libvir-list/2017-November/msg00297.html
> 
> What those did was perform the Unref from libvirtd as soon as the
> AddServer or AddProgram added to the respective hash table (and of
> course incr the refcnt). The result is the Unref for each was done
> *prior* to the virNetDaemonClose and prior to the virStateCleanup which
> solves the late run after Close when Dispose runs.
> 
> What jiggled my memory on this was today there was an OFTC IRC posting
> regarding a virt-manager crash (see pastebin from cbosdonnat at
> http://paste.opensuse.org/44522138) as a result of incorrect ordering of
> daemon unref and running virStateCleanup.  A copy/paste of the
> subsequent conversation is:
> 
> <cbosdonnat> I'm seeing a crash here when libvirtd is stopping and a
> client (namely virt-manager) calls
> networkConnectNetworkEventDeregisterAny().... is it OK to just goto
> cleanup if the driver is NULL in that function?
> <danpb> cbosdonnat: you mean the driver is being torn down in
> virStateCleanup ?  and an API is still allowed to be run ?
> <cbosdonnat> danpb, looks like that indeed
> <cbosdonnat> where would the API call be rejected otherwise?
> <danpb> that seems a much more general problem - we should not call
> virStateCleanup until we've killed off all the API worker threads
> <danpb> i kind of thought we already did that
> <danpb> is it perhaps that we're killing off active clients and as part
> of that we're cleaning up events they have registered ?
> <cbosdonnat> danpb, here is the backtrace of the crash:
> http://paste.opensuse.org/44522138
> <cbosdonnat> so seems like that indeed
> <danpb> oh yeah, we're killing off clients
> <danpb> we need to explicitly be able to kill off clients before
> shutting down drivers
> <danpb> surprised we've not seen this before to be honest
> <danpb> i wonder if some refactoring  has introduced this problem recently
> <cbosdonnat> are those done in separate threads?
> <danpb> i can't rememeber to be honest
> <cbosdonnat> I'll chase the order of these then
> <cbosdonnat> danpb, the state cleanup happens before the daemon unref...
> looks like we'll need to purge the daemon before that
> <cbosdonnat> danpb, would that be completely silly to move the client
> closing code to virNetServerClose()?
> <danpb> that seems pretty sensible to me actually
> <danpb> logically they feel related actions
> <cbosdonnat> we can keep the unref of them where it is though if it
> makes more sense
> <cbosdonnat> yep
> 
> So yes, the "refactoring" that caused this is commit id '2f3054c22'.
> 
> As a test, I altered the code to perform the all the Unref's before the
> virStateCleanup and that "resolved" the virt-manager crash; however,
> that would seem to conflict with commit id '85c3a1820' which noted that
> performing the last Unref(dmn) prior to virStateCleanup would cause
> issues for daemonInhibitCallback.
> 
> So I moved the Unref(dmn) after virStateCleanup... That of course led to
> the virt-manager issue.  So I figured I'd give adding this patch in and
> that then fixed the virt-manager crash (obviously as we'd be Unref'ing
> all those servers we added).
> 
> So after all that if I add that sleep into domstats as shown in the [3]
> link from above - I don't get a crash, but it also seems to cause the
> SIGTERM to be ignored at least the first time through. Running a second
> SIGTERM does the proper kill.
> 
> So short story made really long, I think the best course of action will
> be to add this patch and reorder the Unref()'s (adminProgram thru srv,
> but not dmn). It seems to resolve these corner cases, but I'm also open
> to other suggestions. Still need to think about it some more too before
> posting any patches.
> 
> 
Hi.

I'm not grasp the whole picture yet but I've managed to find out what
triggered the crash. It is not 2f3054c22 where you reordered unrefs but
1fd1b766105 which moves events unregistering from netserver client closing to
netservec client disposing. Before 1fd1b766105 we don't have crash
because clients simply do not get disposed.
 
As to fixing the crash with this patch I thinks its is coincidence. I want
do dispose netservers early to join rpc threads and it turns out that
disposing also closing clients too and this fixes the problem.

Nikolay

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list
Re: [libvirt] [PATCH v5 3/3] libvirtd: fix crash on termination
Posted by John Ferlan 7 years, 4 months ago
[...]

>>
>> So short story made really long, I think the best course of action will
>> be to add this patch and reorder the Unref()'s (adminProgram thru srv,
>> but not dmn). It seems to resolve these corner cases, but I'm also open
>> to other suggestions. Still need to think about it some more too before
>> posting any patches.
>>
>>
> Hi.
> 
> I'm not grasp the whole picture yet but I've managed to find out what
> triggered the crash. It is not 2f3054c22 where you reordered unrefs but
> 1fd1b766105 which moves events unregistering from netserver client closing to
> netservec client disposing. Before 1fd1b766105 we don't have crash
> because clients simply do not get disposed.

Oh yeah, that one....  But considering Erik's most recent response in
this overall thread vis-a-vis the separation of "close" vs. "dispose"
and the timing of each w/r/t Unref and Free, I think having the call to
remoteClientFreePrivateCallbacks in remoteClientCloseFunc is perhaps
better than in remoteClientFreeFunc.

>  
> As to fixing the crash with this patch I thinks its is coincidence. I want
> do dispose netservers early to join rpc threads and it turns out that
> disposing also closing clients too and this fixes the problem.
> 
> Nikolay
> 

With Cedric's patch in place, the virt-manager client issue is fixed. So
that's goodness.

If I then add the sleep (or usleep) into qemuConnectGetAllDomainStats as
noted in what started this all, then I can either get libvirtd to crash
dereferencing a NULL driver pointer or (my favorite) hang with two
threads stuck waiting:

(gdb) t a a bt

Thread 5 (Thread 0x7fffe535b700 (LWP 15568)):
#0  0x00007ffff3dc909d in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007ffff3dc1e23 in pthread_mutex_lock () from /lib64/libpthread.so.0
#2  0x00007ffff7299a15 in virMutexLock (m=<optimized out>)
    at util/virthread.c:89
#3  0x00007fffc760621e in qemuDriverLock (driver=0x7fffbc190510)
    at qemu/qemu_conf.c:100
#4  virQEMUDriverGetConfig (driver=driver@entry=0x7fffbc190510)
    at qemu/qemu_conf.c:1002
#5  0x00007fffc75dfa89 in qemuDomainObjBeginJobInternal (
    driver=driver@entry=0x7fffbc190510, obj=obj@entry=0x7fffbc3bcd60,
    job=job@entry=QEMU_JOB_QUERY,
asyncJob=asyncJob@entry=QEMU_ASYNC_JOB_NONE)
    at qemu/qemu_domain.c:4690
#6  0x00007fffc75e3b2b in qemuDomainObjBeginJob (
    driver=driver@entry=0x7fffbc190510, obj=obj@entry=0x7fffbc3bcd60,
    job=job@entry=QEMU_JOB_QUERY) at qemu/qemu_domain.c:4842
#7  0x00007fffc764f744 in qemuConnectGetAllDomainStats
(conn=0x7fffb80009a0,
    doms=<optimized out>, ndoms=<optimized out>, stats=<optimized out>,
    retStats=0x7fffe535aaf0, flags=<optimized out>) at
qemu/qemu_driver.c:20219
#8  0x00007ffff736430a in virDomainListGetStats (doms=0x7fffa8000950,
stats=0,
    retStats=retStats@entry=0x7fffe535aaf0, flags=0) at
libvirt-domain.c:11595
#9  0x000055555557948d in remoteDispatchConnectGetAllDomainStats (
    server=<optimized out>, msg=<optimized out>, ret=0x7fffa80008e0,
    args=0x7fffa80008c0, rerr=0x7fffe535abf0, client=<optimized out>)
    at remote.c:6538
#10 remoteDispatchConnectGetAllDomainStatsHelper (server=<optimized out>,
    client=<optimized out>, msg=<optimized out>, rerr=0x7fffe535abf0,
    args=0x7fffa80008c0, ret=0x7fffa80008e0) at remote_dispatch.h:615
#11 0x00007ffff73bf59c in virNetServerProgramDispatchCall
(msg=0x55555586cdd0,
    client=0x55555586bea0, server=0x55555582ed90, prog=0x555555869190)
    at rpc/virnetserverprogram.c:437
#12 virNetServerProgramDispatch (prog=0x555555869190,
    server=server@entry=0x55555582ed90, client=0x55555586bea0,
    msg=0x55555586cdd0) at rpc/virnetserverprogram.c:307
#13 0x00005555555a9318 in virNetServerProcessMsg (msg=<optimized out>,
    prog=<optimized out>, client=<optimized out>, srv=0x55555582ed90)
    at rpc/virnetserver.c:148
#14 virNetServerHandleJob (jobOpaque=<optimized out>, opaque=0x55555582ed90)
    at rpc/virnetserver.c:169
#15 0x00007ffff729a521 in virThreadPoolWorker (
    opaque=opaque@entry=0x55555583aa40) at util/virthreadpool.c:167
#16 0x00007ffff7299898 in virThreadHelper (data=<optimized out>)
    at util/virthread.c:206
#17 0x00007ffff3dbf36d in start_thread () from /lib64/libpthread.so.0
#18 0x00007ffff3af3e1f in clone () from /lib64/libc.so.6

Thread 1 (Thread 0x7ffff7ef9d80 (LWP 15561)):
#0  0x00007ffff3dc590b in pthread_cond_wait@@GLIBC_2.3.2 ()
---Type <return> to continue, or q <return> to quit---
   from /lib64/libpthread.so.0
#1  0x00007ffff7299af6 in virCondWait (c=<optimized out>, m=<optimized out>)
    at util/virthread.c:154
#2  0x00007ffff729a760 in virThreadPoolFree (pool=<optimized out>)
    at util/virthreadpool.c:290
#3  0x00005555555a8ec2 in virNetServerDispose (obj=0x55555582ed90)
    at rpc/virnetserver.c:767
#4  0x00007ffff727923b in virObjectUnref (anyobj=<optimized out>)
    at util/virobject.c:356
#5  0x00007ffff724f069 in virHashFree (table=<optimized out>)
    at util/virhash.c:318
#6  0x00007ffff73b8295 in virNetDaemonDispose (obj=0x55555582eb10)
    at rpc/virnetdaemon.c:105
#7  0x00007ffff727923b in virObjectUnref (anyobj=<optimized out>)
    at util/virobject.c:356
#8  0x000055555556f2eb in main (argc=<optimized out>, argv=<optimized out>)
    at libvirtd.c:1524
(gdb)


Of course this could be a red herring because sleep/usleep and the
condition handling nature of these jobs could be interfering with one
another.

Still adding the "virHashRemoveAll(dmn->servers);" into
virNetDaemonClose doesn't help the situation as I can still either crash
randomly or hang, so I'm less convinced this would really fix anything.
It does change the "nature" of the hung thread stack trace though, as
the second thread is now:

 Thread 1 (Thread 0x7ffff7ef9d80 (LWP 20159)):
#0  0x00007ffff3dc590b in pthread_cond_wait@@GLIBC_2.3.2 ()
---Type <return> to continue, or q <return> to quit---
   from /lib64/libpthread.so.0
#1  0x00007ffff7299b06 in virCondWait (c=<optimized out>, m=<optimized out>)
    at util/virthread.c:154
#2  0x00007ffff729a770 in virThreadPoolFree (pool=<optimized out>)
    at util/virthreadpool.c:290
#3  0x00005555555a8ec2 in virNetServerDispose (obj=0x55555582ed90)
    at rpc/virnetserver.c:767
#4  0x00007ffff727924b in virObjectUnref (anyobj=<optimized out>)
    at util/virobject.c:356
#5  0x000055555556f2e3 in main (argc=<optimized out>, argv=<optimized out>)
    at libvirtd.c:1523
(gdb)


So we still haven't found the "root cause", but I think Erik is on to
something in the other part of this thread. I'll go there.


John

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list
Re: [libvirt] [PATCH v5 3/3] libvirtd: fix crash on termination
Posted by Nikolay Shirokovskiy 7 years, 4 months ago

On 21.12.2017 15:57, John Ferlan wrote:
> [...]
> 
>>>
>>> So short story made really long, I think the best course of action will
>>> be to add this patch and reorder the Unref()'s (adminProgram thru srv,
>>> but not dmn). It seems to resolve these corner cases, but I'm also open
>>> to other suggestions. Still need to think about it some more too before
>>> posting any patches.
>>>
>>>
>> Hi.
>>
>> I'm not grasp the whole picture yet but I've managed to find out what
>> triggered the crash. It is not 2f3054c22 where you reordered unrefs but
>> 1fd1b766105 which moves events unregistering from netserver client closing to
>> netservec client disposing. Before 1fd1b766105 we don't have crash
>> because clients simply do not get disposed.
> 
> Oh yeah, that one....  But considering Erik's most recent response in
> this overall thread vis-a-vis the separation of "close" vs. "dispose"
> and the timing of each w/r/t Unref and Free, I think having the call to
> remoteClientFreePrivateCallbacks in remoteClientCloseFunc is perhaps
> better than in remoteClientFreeFunc.
> 
>>  
>> As to fixing the crash with this patch I thinks its is coincidence. I want
>> do dispose netservers early to join rpc threads and it turns out that
>> disposing also closing clients too and this fixes the problem.
>>
>> Nikolay
>>
> 
> With Cedric's patch in place, the virt-manager client issue is fixed. So
> that's goodness.
> 
> If I then add the sleep (or usleep) into qemuConnectGetAllDomainStats as
> noted in what started this all, then I can either get libvirtd to crash
> dereferencing a NULL driver pointer or (my favorite) hang with two
> threads stuck waiting:
> 
> (gdb) t a a bt
> 
> Thread 5 (Thread 0x7fffe535b700 (LWP 15568)):
> #0  0x00007ffff3dc909d in __lll_lock_wait () from /lib64/libpthread.so.0
> #1  0x00007ffff3dc1e23 in pthread_mutex_lock () from /lib64/libpthread.so.0
> #2  0x00007ffff7299a15 in virMutexLock (m=<optimized out>)
>     at util/virthread.c:89
> #3  0x00007fffc760621e in qemuDriverLock (driver=0x7fffbc190510)
>     at qemu/qemu_conf.c:100
> #4  virQEMUDriverGetConfig (driver=driver@entry=0x7fffbc190510)
>     at qemu/qemu_conf.c:1002
> #5  0x00007fffc75dfa89 in qemuDomainObjBeginJobInternal (
>     driver=driver@entry=0x7fffbc190510, obj=obj@entry=0x7fffbc3bcd60,
>     job=job@entry=QEMU_JOB_QUERY,
> asyncJob=asyncJob@entry=QEMU_ASYNC_JOB_NONE)
>     at qemu/qemu_domain.c:4690
> #6  0x00007fffc75e3b2b in qemuDomainObjBeginJob (
>     driver=driver@entry=0x7fffbc190510, obj=obj@entry=0x7fffbc3bcd60,
>     job=job@entry=QEMU_JOB_QUERY) at qemu/qemu_domain.c:4842
> #7  0x00007fffc764f744 in qemuConnectGetAllDomainStats
> (conn=0x7fffb80009a0,
>     doms=<optimized out>, ndoms=<optimized out>, stats=<optimized out>,
>     retStats=0x7fffe535aaf0, flags=<optimized out>) at
> qemu/qemu_driver.c:20219
> #8  0x00007ffff736430a in virDomainListGetStats (doms=0x7fffa8000950,
> stats=0,
>     retStats=retStats@entry=0x7fffe535aaf0, flags=0) at
> libvirt-domain.c:11595
> #9  0x000055555557948d in remoteDispatchConnectGetAllDomainStats (
>     server=<optimized out>, msg=<optimized out>, ret=0x7fffa80008e0,
>     args=0x7fffa80008c0, rerr=0x7fffe535abf0, client=<optimized out>)
>     at remote.c:6538
> #10 remoteDispatchConnectGetAllDomainStatsHelper (server=<optimized out>,
>     client=<optimized out>, msg=<optimized out>, rerr=0x7fffe535abf0,
>     args=0x7fffa80008c0, ret=0x7fffa80008e0) at remote_dispatch.h:615
> #11 0x00007ffff73bf59c in virNetServerProgramDispatchCall
> (msg=0x55555586cdd0,
>     client=0x55555586bea0, server=0x55555582ed90, prog=0x555555869190)
>     at rpc/virnetserverprogram.c:437
> #12 virNetServerProgramDispatch (prog=0x555555869190,
>     server=server@entry=0x55555582ed90, client=0x55555586bea0,
>     msg=0x55555586cdd0) at rpc/virnetserverprogram.c:307
> #13 0x00005555555a9318 in virNetServerProcessMsg (msg=<optimized out>,
>     prog=<optimized out>, client=<optimized out>, srv=0x55555582ed90)
>     at rpc/virnetserver.c:148
> #14 virNetServerHandleJob (jobOpaque=<optimized out>, opaque=0x55555582ed90)
>     at rpc/virnetserver.c:169
> #15 0x00007ffff729a521 in virThreadPoolWorker (
>     opaque=opaque@entry=0x55555583aa40) at util/virthreadpool.c:167
> #16 0x00007ffff7299898 in virThreadHelper (data=<optimized out>)
>     at util/virthread.c:206
> #17 0x00007ffff3dbf36d in start_thread () from /lib64/libpthread.so.0
> #18 0x00007ffff3af3e1f in clone () from /lib64/libc.so.6
> 
> Thread 1 (Thread 0x7ffff7ef9d80 (LWP 15561)):
> #0  0x00007ffff3dc590b in pthread_cond_wait@@GLIBC_2.3.2 ()
> ---Type <return> to continue, or q <return> to quit---
>    from /lib64/libpthread.so.0
> #1  0x00007ffff7299af6 in virCondWait (c=<optimized out>, m=<optimized out>)
>     at util/virthread.c:154
> #2  0x00007ffff729a760 in virThreadPoolFree (pool=<optimized out>)
>     at util/virthreadpool.c:290
> #3  0x00005555555a8ec2 in virNetServerDispose (obj=0x55555582ed90)
>     at rpc/virnetserver.c:767
> #4  0x00007ffff727923b in virObjectUnref (anyobj=<optimized out>)
>     at util/virobject.c:356
> #5  0x00007ffff724f069 in virHashFree (table=<optimized out>)
>     at util/virhash.c:318
> #6  0x00007ffff73b8295 in virNetDaemonDispose (obj=0x55555582eb10)
>     at rpc/virnetdaemon.c:105
> #7  0x00007ffff727923b in virObjectUnref (anyobj=<optimized out>)
>     at util/virobject.c:356
> #8  0x000055555556f2eb in main (argc=<optimized out>, argv=<optimized out>)
>     at libvirtd.c:1524
> (gdb)
> 
> 
> Of course this could be a red herring because sleep/usleep and the
> condition handling nature of these jobs could be interfering with one
> another.
> 
> Still adding the "virHashRemoveAll(dmn->servers);" into
> virNetDaemonClose doesn't help the situation as I can still either crash
> randomly or hang, so I'm less convinced this would really fix anything.
> It does change the "nature" of the hung thread stack trace though, as
> the second thread is now:

virHashRemoveAll is not enough now. Due to unref reordeing last ref to @srv is
unrefed after virStateCleanup. So we need to virObjectUnref(srv|srvAdm) before
virStateCleanup. Or we can call virThreadPoolFree from virNetServerClose (
as in the first version of the patch and as Erik suggests) instead
of virHashRemoveAll.

> 
>  Thread 1 (Thread 0x7ffff7ef9d80 (LWP 20159)):
> #0  0x00007ffff3dc590b in pthread_cond_wait@@GLIBC_2.3.2 ()
> ---Type <return> to continue, or q <return> to quit---
>    from /lib64/libpthread.so.0
> #1  0x00007ffff7299b06 in virCondWait (c=<optimized out>, m=<optimized out>)
>     at util/virthread.c:154
> #2  0x00007ffff729a770 in virThreadPoolFree (pool=<optimized out>)
>     at util/virthreadpool.c:290
> #3  0x00005555555a8ec2 in virNetServerDispose (obj=0x55555582ed90)
>     at rpc/virnetserver.c:767
> #4  0x00007ffff727924b in virObjectUnref (anyobj=<optimized out>)
>     at util/virobject.c:356
> #5  0x000055555556f2e3 in main (argc=<optimized out>, argv=<optimized out>)
>     at libvirtd.c:1523
> (gdb)
> 
> 
> So we still haven't found the "root cause", but I think Erik is on to
> something in the other part of this thread. I'll go there.
> 
> 
> John
> 

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list
Re: [libvirt] [PATCH v5 3/3] libvirtd: fix crash on termination
Posted by John Ferlan 7 years, 4 months ago
[...]

>>
>> Still adding the "virHashRemoveAll(dmn->servers);" into
>> virNetDaemonClose doesn't help the situation as I can still either crash
>> randomly or hang, so I'm less convinced this would really fix anything.
>> It does change the "nature" of the hung thread stack trace though, as
>> the second thread is now:
> 
> virHashRemoveAll is not enough now. Due to unref reordeing last ref to @srv is
> unrefed after virStateCleanup. So we need to virObjectUnref(srv|srvAdm) before
> virStateCleanup. Or we can call virThreadPoolFree from virNetServerClose (
> as in the first version of the patch and as Erik suggests) instead
> of virHashRemoveAll.
> 

Patches w/

 1. Long pause before GetAllStats (without using [u]sleep)
 2. Adjustment to call virNetServerServiceToggle in
virNetServerServiceClose (instead of virNetServerDispose)
 3. Call virHashRemoveAll in virNetDaemonClose
 4. Call virThreadPoolFree in virNetServerClose
 5. Perform Unref (adminProgram, srvAdm, qemuProgram, lxcProgram,
remoteProgream, and srv) before virNetDaemonClose

Still has the virCondWait's - so as Daniel points out there's quite a
bit more work to be done. Like most Red Hat engineers - I will not be
very active over the next week or so (until the New Year) as it's a
holiday break/vacation for us.

So unless you have the burning desire to put together some patches and
do the work yourself, more thoughts/work will need to wait.

John

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list
Re: [libvirt] [PATCH v5 3/3] libvirtd: fix crash on termination
Posted by Nikolay Shirokovskiy 7 years, 4 months ago

On 22.12.2017 17:13, John Ferlan wrote:
> [...]
> 
>>>
>>> Still adding the "virHashRemoveAll(dmn->servers);" into
>>> virNetDaemonClose doesn't help the situation as I can still either crash
>>> randomly or hang, so I'm less convinced this would really fix anything.
>>> It does change the "nature" of the hung thread stack trace though, as
>>> the second thread is now:
>>
>> virHashRemoveAll is not enough now. Due to unref reordeing last ref to @srv is
>> unrefed after virStateCleanup. So we need to virObjectUnref(srv|srvAdm) before
>> virStateCleanup. Or we can call virThreadPoolFree from virNetServerClose (
>> as in the first version of the patch and as Erik suggests) instead
>> of virHashRemoveAll.
>>
> 
> Patches w/
> 
>  1. Long pause before GetAllStats (without using [u]sleep)
>  2. Adjustment to call virNetServerServiceToggle in
> virNetServerServiceClose (instead of virNetServerDispose)
>  3. Call virHashRemoveAll in virNetDaemonClose
>  4. Call virThreadPoolFree in virNetServerClose
>  5. Perform Unref (adminProgram, srvAdm, qemuProgram, lxcProgram,
> remoteProgream, and srv) before virNetDaemonClose
> 
> Still has the virCondWait's - so as Daniel points out there's quite a
> bit more work to be done. Like most Red Hat engineers - I will not be
> very active over the next week or so (until the New Year) as it's a
> holiday break/vacation for us.
> 
> So unless you have the burning desire to put together some patches and
> do the work yourself, more thoughts/work will need to wait.
> 
> John
> 

Ok. Happy holydays!

I'm going to check what's going nevertheless ))

Nikolay

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list
Re: [libvirt] [PATCH v5 3/3] libvirtd: fix crash on termination
Posted by Nikolay Shirokovskiy 7 years, 4 months ago

On 22.12.2017 17:13, John Ferlan wrote:
> [...]
> 
>>>
>>> Still adding the "virHashRemoveAll(dmn->servers);" into
>>> virNetDaemonClose doesn't help the situation as I can still either crash
>>> randomly or hang, so I'm less convinced this would really fix anything.
>>> It does change the "nature" of the hung thread stack trace though, as
>>> the second thread is now:
>>
>> virHashRemoveAll is not enough now. Due to unref reordeing last ref to @srv is
>> unrefed after virStateCleanup. So we need to virObjectUnref(srv|srvAdm) before
>> virStateCleanup. Or we can call virThreadPoolFree from virNetServerClose (
>> as in the first version of the patch and as Erik suggests) instead
>> of virHashRemoveAll.
>>
> 
> Patches w/
> 
>  1. Long pause before GetAllStats (without using [u]sleep)
>  2. Adjustment to call virNetServerServiceToggle in
> virNetServerServiceClose (instead of virNetServerDispose)
>  3. Call virHashRemoveAll in virNetDaemonClose
>  4. Call virThreadPoolFree in virNetServerClose
>  5. Perform Unref (adminProgram, srvAdm, qemuProgram, lxcProgram,
> remoteProgream, and srv) before virNetDaemonClose
> 
> Still has the virCondWait's - so as Daniel points out there's quite a
> bit more work to be done. Like most Red Hat engineers - I will not be
> very active over the next week or so (until the New Year) as it's a
> holiday break/vacation for us.
> 
> So unless you have the burning desire to put together some patches and
> do the work yourself, more thoughts/work will need to wait.
> 
> John
> 

I've checked what's going on after applying patch you described above
(however it would be enough to apply only 3 (or 4) and part of 5 besides
pause hunk). I get hangs too and this kind of hangs are fixed by 
second series - '[PATCH 0/4] libvirtd: fix hang on termination in qemu driver'.
That is there is a next hang backtrace besides hang in thread
freeing thread pool you already mentioned:

#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1  0x00007ffff7335c58 in virCondWait (c=0x7fffc4000e18, m=0x7fffc4000df0) at util/virthread.c:154
#2  0x00007fffd9605983 in qemuMonitorSend (mon=0x7fffc4000de0, msg=0x7fffe70bd1f0) at qemu/qemu_monitor.c:1067
#3  0x00007fffd961b68f in qemuMonitorJSONCommandWithFd (mon=0x7fffc4000de0, cmd=0x7fffb0005310, scm_fd=-1, 
    reply=0x7fffe70bd2d0) at qemu/qemu_monitor_json.c:300
#4  0x00007fffd961b7c1 in qemuMonitorJSONCommand (mon=0x7fffc4000de0, cmd=0x7fffb0005310, reply=0x7fffe70bd2d0)
    at qemu/qemu_monitor_json.c:330
#5  0x00007fffd9629f0b in qemuMonitorJSONGetObjectListPaths (mon=0x7fffc4000de0, 
    path=0x7fffd96a7c96 "/machine/peripheral", paths=0x7fffe70bd380) at qemu/qemu_monitor_json.c:5715
#6  0x00007fffd962dcc4 in qemuMonitorJSONFindObjectPathByAlias (mon=0x7fffc4000de0, 
    name=0x7fffd969f3cd "virtio-balloon-pci", alias=0x7fffcc1e8d30 "balloon0", path=0x7fffe70bd450)
    at qemu/qemu_monitor_json.c:7235
#7  0x00007fffd962e231 in qemuMonitorJSONFindLinkPath (mon=0x7fffc4000de0, name=0x7fffd969f3cd "virtio-balloon-pci", 
    alias=0x7fffcc1e8d30 "balloon0", path=0x7fffe70bd450) at qemu/qemu_monitor_json.c:7349
#8  0x00007fffd9605bf7 in qemuMonitorInitBalloonObjectPath (mon=0x7fffc4000de0, balloon=0x7fffcc1e8e60)
    at qemu/qemu_monitor.c:1157
#9  0x00007fffd96098d3 in qemuMonitorGetMemoryStats (mon=0x7fffc4000de0, balloon=0x7fffcc1e8e60, 
    stats=0x7fffe70bd5b0, nr_stats=10) at qemu/qemu_monitor.c:2133
#10 0x00007fffd964e70c in qemuDomainMemoryStatsInternal (driver=0x7fffcc1872a0, vm=0x7fffcc2737e0, 
    stats=0x7fffe70bd5b0, nr_stats=10) at qemu/qemu_driver.c:11453
#11 0x00007fffd9667013 in qemuDomainGetStatsBalloon (driver=0x7fffcc1872a0, dom=0x7fffcc2737e0, 
    record=0x7fffb00008c0, maxparams=0x7fffe70bd6b0, privflags=1) at qemu/qemu_driver.c:19478
#12 0x00007fffd9669597 in qemuDomainGetStats (conn=0x7fffb80030e0, dom=0x7fffcc2737e0, stats=127, 
    record=0x7fffe70bd790, flags=1) at qemu/qemu_driver.c:20133
#13 0x00007fffd966997f in qemuConnectGetAllDomainStats (conn=0x7fffb80030e0, doms=0x7fffb0005220, ndoms=1, 
    stats=127, retStats=0x7fffe70bd8e0, flags=0) at qemu/qemu_driver.c:20226
#14 0x00007ffff7424fd7 in virDomainListGetStats (doms=0x7fffb0005220, stats=0, retStats=0x7fffe70bd8e0, flags=0)
    at libvirt-domain.c:11595
#15 0x00005555555ac030 in remoteDispatchConnectGetAllDomainStats (server=0x55555612a3a0, client=0x555556151d10, 
    msg=0x555556152540, rerr=0x7fffe70bda20, args=0x7fffb00036e0, ret=0x7fffb0002d20) at remote.c:6538

I'm writing this not to involve you back into the work and do not expect a reply. It is holydays)
Only to document my research.

Nikolay

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list