src/qemu/qemu_domain.c | 16 ++++++++++++++++ src/qemu/qemu_domain.h | 2 ++ src/qemu/qemu_migration.c | 4 ++-- 3 files changed, 20 insertions(+), 2 deletions(-)
In case of non-p2p migration, in case libvirt client gets disconnected from source libvirt
after PERFORM phase is over, the daemon just resets the current migration job.
However, the VM could be left paused on both source and destination in such case. In case
the client reconnects and queries migration status, the job has been blanked out from source libvirt,
and this reconnected client has no clear way of figuring out if an unclean migration had previously
been aborted.
This patch calls out a "potentially" incomplete migration as a failed job, so that a client may
be able to watch previously failed jobs for this VM and take corrective actions as needed.
Signed-off-by: Prerna Saxena <saxenap.ltc@gmail.com>
---
src/qemu/qemu_domain.c | 16 ++++++++++++++++
src/qemu/qemu_domain.h | 2 ++
src/qemu/qemu_migration.c | 4 ++--
3 files changed, 20 insertions(+), 2 deletions(-)
diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c
index e8e0313..7c60d17 100644
--- a/src/qemu/qemu_domain.c
+++ b/src/qemu/qemu_domain.c
@@ -4564,6 +4564,22 @@ qemuDomainObjDiscardAsyncJob(virQEMUDriverPtr driver, virDomainObjPtr obj)
qemuDomainObjSaveJob(driver, obj);
}
+
+void
+qemuDomainObjFailAsyncJob(virQEMUDriverPtr driver, virDomainObjPtr obj)
+{
+ qemuDomainObjPrivatePtr priv = obj->privateData;
+ VIR_FREE(priv->job.completed);
+ if (VIR_ALLOC(priv->job.completed) == 0) {
+ priv->job.current->type = VIR_DOMAIN_JOB_FAILED;
+ priv->job.completed = priv->job.current;
+ } else {
+ VIR_WARN("Unable to allocate job.completed for VM %s", obj->def->name);
+ }
+ qemuDomainObjResetAsyncJob(priv);
+ qemuDomainObjEndJob(driver, obj);
+}
+
void
qemuDomainObjReleaseAsyncJob(virDomainObjPtr obj)
{
diff --git a/src/qemu/qemu_domain.h b/src/qemu/qemu_domain.h
index c33af36..6465603 100644
--- a/src/qemu/qemu_domain.h
+++ b/src/qemu/qemu_domain.h
@@ -497,6 +497,8 @@ void qemuDomainObjRestoreJob(virDomainObjPtr obj,
struct qemuDomainJobObj *job);
void qemuDomainObjDiscardAsyncJob(virQEMUDriverPtr driver,
virDomainObjPtr obj);
+void qemuDomainObjFailAsyncJob(virQEMUDriverPtr driver,
+ virDomainObjPtr obj);
void qemuDomainObjReleaseAsyncJob(virDomainObjPtr obj);
qemuMonitorPtr qemuDomainGetMonitor(virDomainObjPtr vm)
diff --git a/src/qemu/qemu_migration.c b/src/qemu/qemu_migration.c
index 69eb231..fd8950e 100644
--- a/src/qemu/qemu_migration.c
+++ b/src/qemu/qemu_migration.c
@@ -1911,8 +1911,8 @@ qemuMigrationCleanup(virDomainObjPtr vm,
VIR_WARN("Migration of domain %s finished but we don't know if the"
" domain was successfully started on destination or not",
vm->def->name);
- /* clear the job and let higher levels decide what to do */
- qemuDomainObjDiscardAsyncJob(driver, vm);
+ /* clearly "fail" the job and let higher levels decide what to do */
+ qemuDomainObjFailAsyncJob(driver, vm);
break;
case QEMU_MIGRATION_PHASE_PERFORM3:
--
1.7.1
--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list
On Thu, Jan 25, 2018 at 19:51:23 +0530, Prerna Saxena wrote: > In case of non-p2p migration, in case libvirt client gets disconnected from source libvirt > after PERFORM phase is over, the daemon just resets the current migration job. > However, the VM could be left paused on both source and destination in such case. In case > the client reconnects and queries migration status, the job has been blanked out from source libvirt, > and this reconnected client has no clear way of figuring out if an unclean migration had previously > been aborted. The virDomainGetState API should return VIR_DOMAIN_PAUSED with VIR_DOMAIN_PAUSED_MIGRATION reason. Is this not enough? > This patch calls out a "potentially" incomplete migration as a failed > job, so that a client may As you say it's potentially incomplete, so marking it as failed is not completely correct. It's a split brain when the source cannot distinguish whether the migration was successful or not. > diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c > index e8e0313..7c60d17 100644 > --- a/src/qemu/qemu_domain.c > +++ b/src/qemu/qemu_domain.c > @@ -4564,6 +4564,22 @@ qemuDomainObjDiscardAsyncJob(virQEMUDriverPtr driver, virDomainObjPtr obj) > qemuDomainObjSaveJob(driver, obj); > } > > + > +void > +qemuDomainObjFailAsyncJob(virQEMUDriverPtr driver, virDomainObjPtr obj) > +{ > + qemuDomainObjPrivatePtr priv = obj->privateData; > + VIR_FREE(priv->job.completed); > + if (VIR_ALLOC(priv->job.completed) == 0) { > + priv->job.current->type = VIR_DOMAIN_JOB_FAILED; > + priv->job.completed = priv->job.current; This will just leak the memory allocated for priv->job.completed by overwriting the pointer to the one from priv->job.current, ... > + } else { > + VIR_WARN("Unable to allocate job.completed for VM %s", obj->def->name); > + } > + qemuDomainObjResetAsyncJob(priv); which will point to a freed memory after this call. > + qemuDomainObjEndJob(driver, obj); And while this is almost certainly (I didn't really check though) not something you should call from a close callback, you don't save the changes to the status XML so once libvirtd restarts, it will think the domain is still being migrated. Jirka -- libvir-list mailing list libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list
Hi Jirka, On Thu, Jan 25, 2018 at 8:43 PM, Jiri Denemark <jdenemar@redhat.com> wrote: > On Thu, Jan 25, 2018 at 19:51:23 +0530, Prerna Saxena wrote: > > In case of non-p2p migration, in case libvirt client gets disconnected > from source libvirt > > after PERFORM phase is over, the daemon just resets the current > migration job. > > However, the VM could be left paused on both source and destination in > such case. In case > > the client reconnects and queries migration status, the job has been > blanked out from source libvirt, > > and this reconnected client has no clear way of figuring out if an > unclean migration had previously > > been aborted. > > The virDomainGetState API should return VIR_DOMAIN_PAUSED with > VIR_DOMAIN_PAUSED_MIGRATION reason. Is this not enough? > > I understand that a client application should poll source libvirtd for status of migration job completion using virDomainGetJobStats(). However, as you explained above, cleanup callbacks clear the job info so a client should additionally be polling for virDomainGetState() too. Would it not be cleaner to have a single API reflect the source of truth? > > This patch calls out a "potentially" incomplete migration as a failed > > job, so that a client may > > As you say it's potentially incomplete, so marking it as failed is not > completely correct. It's a split brain when the source cannot > distinguish whether the migration was successful or not. > Agree, it might have run to completion too, as we observed in some cases. Do you think marking the job status as "UNKNOWN" is better articulation of the current state? > > > diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c > > index e8e0313..7c60d17 100644 > > --- a/src/qemu/qemu_domain.c > > +++ b/src/qemu/qemu_domain.c > > @@ -4564,6 +4564,22 @@ qemuDomainObjDiscardAsyncJob(virQEMUDriverPtr > driver, virDomainObjPtr obj) > > qemuDomainObjSaveJob(driver, obj); > > } > > > > + > > +void > > +qemuDomainObjFailAsyncJob(virQEMUDriverPtr driver, virDomainObjPtr obj) > > +{ > > + qemuDomainObjPrivatePtr priv = obj->privateData; > > + VIR_FREE(priv->job.completed); > > + if (VIR_ALLOC(priv->job.completed) == 0) { > > + priv->job.current->type = VIR_DOMAIN_JOB_FAILED; > > + priv->job.completed = priv->job.current; > > This will just leak the memory allocated for priv->job.completed by > overwriting the pointer to the one from priv->job.current, ... > > > + } else { > > + VIR_WARN("Unable to allocate job.completed for VM %s", > obj->def->name); > > + } > > + qemuDomainObjResetAsyncJob(priv); > > which will point to a freed memory after this call. > Agree, I will fix this. > > > + qemuDomainObjEndJob(driver, obj); > > And while this is almost certainly (I didn't really check though) not > something you should call from a close callback, you don't save the > changes to the status XML so once libvirtd restarts, it will think the > domain is still being migrated. > I will add the same to status XML. I am suggesting that strengthening the job data would be additionally useful. If the daemon has not restarted, job information can still get us fairly accurate status of migration. Pls let me know if you think this is not useful, I will be happy to learn the rationale. Regards, Prerna -- libvir-list mailing list libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list
On Mon, Jan 29, 2018 at 15:56:29 +0530, Prerna wrote: > Hi Jirka, > > On Thu, Jan 25, 2018 at 8:43 PM, Jiri Denemark <jdenemar@redhat.com> wrote: > > > On Thu, Jan 25, 2018 at 19:51:23 +0530, Prerna Saxena wrote: > > > In case of non-p2p migration, in case libvirt client gets disconnected > > from source libvirt > > > after PERFORM phase is over, the daemon just resets the current > > migration job. > > > However, the VM could be left paused on both source and destination in > > such case. In case > > > the client reconnects and queries migration status, the job has been > > blanked out from source libvirt, > > > and this reconnected client has no clear way of figuring out if an > > unclean migration had previously > > > been aborted. > > > > The virDomainGetState API should return VIR_DOMAIN_PAUSED with > > VIR_DOMAIN_PAUSED_MIGRATION reason. Is this not enough? > > > I understand that a client application should poll source libvirtd for > status of migration job completion using virDomainGetJobStats(). Not really, it may poll if it wants to monitor migration progress, but normally the client would just wait for the migration API to return either success or failure. > However, as you explained above, cleanup callbacks clear the job info so a > client should additionally be polling for virDomainGetState() too. Well, even if virDomainGetJobStats with VIR_DOMAIN_JOB_STATS_COMPLETED flag was modified to report the job as VIR_DOMAIN_JOB_FAILED, the client would still need to call virDomainGetState (on both sides in some cases) to check whether the domain is running or it was left in a paused state. So the reporting of a failed job by virDomainGetJobStats does not seem to be really necessary. And it would be a bit confusing too since the flag is called *_COMPLETED, while the migration in fact did not complete. This confusion could be fixed by introducing a new flag, but... > Would it not be cleaner to have a single API reflect the source of truth? Perhaps, but since there already is a way of getting the info, any client which wants to work with more than just a bleeding edge libvirt would still need to implement the existing way. And why would the client bother using the new API when it can be sure the old way will still be available? Doing so would make the client even more complicated for no benefit. But as I said, just seeing that a previous migration job failed is not enough to recover from a disconnected client which was controlling a non-p2p migration. BTW, p2p migration is far less fragile in this respect. If the connection to a client breaks, migration normally continues without any disruption. And if the connection between libvirt daemons fails, both sides will detect it and abort the migration. Of course, a split brain can still happen even with p2p migration, but it's not so easy to trigger it since the time frame in which the connection has to break to cause a split brain is much shorter. Jirka -- libvir-list mailing list libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list
© 2016 - 2025 Red Hat, Inc.