Final Update - 5/4/2026:
The root cause was that a burst of new runs coincided with a jump in VCS and other i/o bound tasks, which increased overall task duration across the queue. CPU utilization remained moderate throughout, so the CPU-based worker autoscaling did not trigger, meaning the system did not add capacity even as the queue fell behind. This created a cascading effect: as task delays continued to grow, a race condition was exposed, causing run transition tasks to fail. The queue overload alone would have resulted in only a temporary slowdown, but the race condition caused some waiting runs to become permanently stuck with no automatic recovery.
What has been done so far:
Monitoring since these changes shows noticeably healthier queue behavior - fewer tasks overall and faster execution times.
What will be done next:
Update #2 - 5/1/2026
Further investigation points to what triggered the queue slowdowns. Issues with runs getting stuck on Apr 28/29 lined up with high bursts of new runs being created at the same time as several heavier background jobs. Those jobs share the same pool of workers, and when they spiked together, the pool fell behind, which delayed the steps that move runs from one stage to the next. The resilience improvement we shipped on April 30 already addresses that delay.
We are now planning changes to separate the heaviest background work from run-transition processing, so a busy moment in one area cannot slow the other. We will share another update as that work progresses.
Update #1 - 4/30/2026
On April 28 at approximately 15:00 UTC and again on April 29 at approximately 15:22 UTC, a load spike affected our internal task queue, which slowed down the mechanism Scalr uses to transition runs between pipeline stages. In some cases, the transition task failed before it could complete, leaving runs stuck in a waiting state with no automatic recovery path. Both times, the queues cleared on their own while our engineering team investigated. We have already shipped an improvement (released April 30) that makes this transition task significantly more resilient to lock contention, greatly reducing the likelihood of runs getting stuck. We are continuing to investigate the underlying cause of the queue spikes and will share further updates as the investigation progresses.