Runs Taking Longer Than Usual to Execute

Incident Report for scalr.io

Postmortem

Summary

On July 1, Terraform and OpenTofu runs experienced significant delays, up to 10–25 minutes, during the initialization and plan phases for a number of customers, due to capacity constraints in the network-backed provider plugin cache under load. We have mitigated the issue by temporarily removing the shared network storage from the provider download path, and the platform is stable. We are re-architecting how provider plugins are distributed to prevent this class of slowdown from recurring.

What Happened

Before a run can begin planning, it downloads the Terraform/OpenTofu provider plugins it needs (for example, the AWS, GCP, or Azure providers) from a shared cache backed by network storage. A period of high run volume, combined with growth in the total size of the provider cache, drove the aggregate load past the practical limits of that storage. As downloads queued, the init/plan phase of affected runs slowed dramatically, in some cases by 10–25 minutes, and a small number of runs errored or hit their timeout and had to be retried.

The network storage reached its limits due to the large number of connected run nodes and began throttling, slowing down all workloads that relied on it.

Mitigation

After several unsuccessful attempts to increase the storage throughput, we temporarily removed the shared network storage from the provider cache path, providers are currently downloaded over the public network for each run. As a result, init times may be slightly longer than with a healthy provider cache, and runs temporarily depend on the availability of public provider registries.

What We're Doing Next

We are improving the plugin cache architecture along two lines: a node-local caching layer that eliminates the shared network storage bottleneck (to be in place before we re-enable the provider cache), and a caching network mirror that removes the direct dependency on public registry availability. Together, these layers ensure the failure of any one of them is absorbed by the others without service degradation.

We apologize for the disruption. If you have questions or are still experiencing issues, please contact our support team.

Posted Jul 02, 2026 - 14:27 UTC

Resolved

This incident has been resolved.
Posted Jul 01, 2026 - 21:46 UTC

Update

We are continuing to monitor for any further issues.
Posted Jul 01, 2026 - 21:17 UTC

Monitoring

A fix has been implemented and we are monitoring the results.
Posted Jul 01, 2026 - 21:17 UTC

Update

Customers should see the lagging runs start to improve. Existing runs will resume normally on their own.
Posted Jul 01, 2026 - 20:42 UTC

Update

We are continuing to work on a fix for this issue.
Posted Jul 01, 2026 - 20:39 UTC

Identified

The issue has been identified and a fix is being implemented.
Posted Jul 01, 2026 - 18:47 UTC

Investigating

We are currently investigating this issue.
Posted Jul 01, 2026 - 18:30 UTC
This incident affected: Scalr Platform and Scalr Worker.