Summary
A backend optimization deployed on April 24 inadvertently triggered a database performance issue that caused a platform-wide outage. The change has been fully reverted and the platform is stable. We are addressing the underlying query design before re-attempting the optimization.
What Happened
The optimization changed how policy code is delivered during runs, moving from an inline payload to a download from blob storage. This introduced an additional authorization check on each policy download that was not present before.
That authorization check relied on an existing database query with significant hidden complexity: under normal conditions it goes unnoticed, but at production policy-check volume it joined approximately 2 million rows per call, taking around 25 seconds to complete. The increased call frequency exposed this latency, drove the database to 100% CPU utilization, and exhausted the connection pool, making the platform unavailable.
Resolution
The optimization was fully reverted, restoring normal platform behavior. No data was lost or corrupted.
What We're Doing Next
The authorization query is being redesigned to use a direct indexed lookup, which eliminates the row-scan behavior that caused the spike. The optimization will not be re-released until this redesign is complete and validated.
We apologize for the disruption. If you have questions or are still experiencing issues, please contact our support team.