Workspace runs stuck in pending approval

Incident Report for scalr.io

Postmortem

Final Update - 5/4/2026:

The root cause was that a burst of new runs coincided with a jump in VCS and other i/o bound tasks, which increased overall task duration across the queue. CPU utilization remained moderate throughout, so the CPU-based worker autoscaling did not trigger, meaning the system did not add capacity even as the queue fell behind. This created a cascading effect: as task delays continued to grow, a race condition was exposed, causing run transition tasks to fail. The queue overload alone would have resulted in only a temporary slowdown, but the race condition caused some waiting runs to become permanently stuck with no automatic recovery.

What has been done so far:

  • The improvement released on April 30 made run transitions significantly more stable and resilient to race conditions.
  • Audit log processing, which accounted for roughly a third of all task load at the time of the incident, has been moved to dedicated workers
  • The I/O-bound workers have more resources.

Monitoring since these changes shows noticeably healthier queue behavior - fewer tasks overall and faster execution times.

What will be done next:

  • Moving run transitions to a separate queue and fully removing the underlying race condition, so that even under heavy load runs cannot get stuck.
  • Optimizing run notification tasks that spiked during the incident.

Update #2 - 5/1/2026

Further investigation points to what triggered the queue slowdowns. Issues with runs getting stuck on Apr 28/29 lined up with high bursts of new runs being created at the same time as several heavier background jobs. Those jobs share the same pool of workers, and when they spiked together, the pool fell behind, which delayed the steps that move runs from one stage to the next. The resilience improvement we shipped on April 30 already addresses that delay.

We are now planning changes to separate the heaviest background work from run-transition processing, so a busy moment in one area cannot slow the other. We will share another update as that work progresses.

Update #1 - 4/30/2026

On April 28 at approximately 15:00 UTC and again on April 29 at approximately 15:22 UTC, a load spike affected our internal task queue, which slowed down the mechanism Scalr uses to transition runs between pipeline stages. In some cases, the transition task failed before it could complete, leaving runs stuck in a waiting state with no automatic recovery path. Both times, the queues cleared on their own while our engineering team investigated. We have already shipped an improvement (released April 30) that makes this transition task significantly more resilient to lock contention, greatly reducing the likelihood of runs getting stuck. We are continuing to investigate the underlying cause of the queue spikes and will share further updates as the investigation progresses.

Posted Apr 30, 2026 - 18:09 UTC

Resolved

This incident has been resolved. An RCA will be posted soon.
Posted Apr 29, 2026 - 18:57 UTC

Monitoring

We are continuing to monitor for any further issues.
Posted Apr 29, 2026 - 17:11 UTC

Identified

The issue has been identified and a fix is being implemented.
Posted Apr 29, 2026 - 17:08 UTC

Monitoring

A fix has been implemented and we are monitoring the results.
Posted Apr 29, 2026 - 17:04 UTC

Identified

The issue has been identified and a fix is being implemented.
Posted Apr 29, 2026 - 15:22 UTC

Investigating

We are currently investigating an issue where some runs are stuck in a pending approval state without the ability to approve them.
Posted Apr 29, 2026 - 15:22 UTC
This incident affected: Scalr Platform.