Details
Description
There's a race condition in QDBusAbstractInterfacePrivate, between the constructor and initOwnerTracking(), when creating an interface to a service being registered at the same time.
The current code fetches the current owner in the constructor, then sets up a connection from a service watcher to _q_serviceOwnerChanged() in initOwnerTracking(). But depending on timing, the owner change notification could arrive in between the two (ie. after we fetch the current owner but before the connection is made). In this case the owner change notification will be lost, and the interface will keep reporting as invalid permanently.
This problem is particularly visible when the process registering the service and the process creating the interface are started at around the same time. See the attached test program that reproduces the issue:
This test program creates two children processes that are mutually talking to each other via a dbus interface, and restarted continuously. When run as-is, one of the processes gets stuck waiting on the other processes' interface to become valid. It randomly affects one or the other, and can sometimes happen at the first iteration or after a couple successful iterations. By adding a msleep() in one of the processes, everything starts running normally.
While debugging the issue, I've added extra logging to QDBusAbstractInterfacePrivate to prove the race. In a normal scenario, this is what I got:
[AppOne] currentOwner IN CONSTRUCTOR "" [AppOne] currentOwner AFTER CONNECT "" [AppOne] Waiting for AppTwo proxy... [AppOne] QDBusAbstractInterfacePrivate serviceOwnerChanged "com.bluescape.AppTwo" "" ":1.1829" [AppOne] AppTwo proxy valid, sending ping!
Then in the failing case this is what I got:
[AppTwo] currentOwner IN CONSTRUCTOR "" [AppTwo] currentOwner AFTER CONNECT ":1.1833" [AppTwo] Waiting for AppOne proxy... [AppTwo] Waiting for AppOne proxy... [AppTwo] Waiting for AppOne proxy... ...
You can see that by the time the connect() was made in initOwnerTracking(), the owner has been changed, and that change wasn't captured by the service watcher as we never got a signal for it.
The good news is that I already have a fix for it which I'm going push to gerrit, I'll link it to this ticket when it's up.