We have many failures where an integration gets canceled because a machine can not be acquired for some work item.
These failures waste both CI capacity and developer time due to constant re-staging, which delay important integrations, especially before release time.
Here's a recent failure
My suggestion would be that instead of failing with that error message, we should either increase the timeout threshold, or put the job back into the queue so it is retried later.
But failing a whole integration just because a machine was not acquired within 2-4 hours is unreasonable.