Current Status
All Systems Operational
Components
Recent Incidents
Delayed notifications
majorMay 28, 2026 · resolved May 28
This incident has been resolved.
Increased latency and error rates
noneMay 26, 2026 · resolved May 26
## Service Impact A subset of customers experienced elevated latency in notification delivery. ## Incident Summary While migrating a subset of our background processing services to Amazon EKS, we encountered an issue with delivery of internal metrics. The discovered issue did not impact performance or availability, but would have impaired our ability to detect such problems if they occurred. Out of an abundance of caution we decided to revert the migration, and moved those services back to the original infrastructure on AWS Fargate. When migrating to EKS, we scale down and disable automatic scaling on Fargate. This allows us to quickly migrate back by scaling up Fargate. When we moved the workloads back to Fargate to restore internal metrics, we missed the step to re-enable autoscaling. As a result, the affected services did not have sufficient capacity and could not keep up with incoming work. We re-enabled autoscaling promptly once the problem was discovered, and provisioned extra capacity for customers where a backlog of work had accumulated. Between 09:17 and 10:17 UTC, a small subset of our customers were impacted. Individual customers experienced a limited outage of notification services, which lasted between 35 and 58 minutes within this window, if there was any impact at all. The migration is performed in small batches, so not all customers experienced this incident. ## Changes we're making * We are simplifying the runbook used to rollback migrations in the event of incidents. * We are adding more verification steps to the migration process.
Delayed notifications
majorMay 20, 2026 · resolved May 20
## Service Impact A subset of our customers experienced elevated latency in our notification delivery, build dispatch and metrics services. ## Incident Summary We are in the process of migrating our underlying compute platform from AWS Fargate to AWS EKS for our production workloads. We are migrating our services in small batches so we can verify stability as we go. Between 15:42 and 17:33 our EKS Prometheus server began to need more memory than was available on the host where it was running. This was caused by autoscaling operations that increased the number of pods tracked by Prometheus, which in turn increased the Prometheus server's memory requirement. The host killed the Prometheus server process, which was restarted shortly after by the Kubernetes control plane. In the interim, the metrics used for application autoscaling were unavailable. The unavailable metrics meant that the affected services were not being triggered to scale up, resulting in the observed delays. Prometheus exceeded the host's available memory again soon after restarting, which caused the cycle to repeat. The on call team followed a prepared documentation to shift load on the affected services back to Fargate. The majority of customers saw complete recovery from 16:49. A handful of customers had developed such a large backlog during the period of higher latency, that they had to be manually scaled up further. All customers saw full recovery by 17:33. ## Changes we're making We have already made the following changes to our rollout of EKS for production workloads: * Upsized the underlying system nodes. * Set higher requests and limits for the Prometheus server so it can handle more product load. * Reviewed and set any missing requests and limits for all new EKS resources, ensuring that EKS has all the required information to prevent accidental resource contention. * Added more observability and monitors for EKS pod and node health to help us identify root causes quickly during future incidents. We have since migrated all these services back to EKS and observed successful scaling well beyond the limits we encountered during this incident.
Delayed Test Engine ingestion processing
minorMay 15, 2026 · resolved May 15
Processing of the backlog is complete.
Error rates increasing
minorMay 13, 2026 · resolved May 13
Additional capacity was added to our redis caches. This triggered a failover between UTC 15:10 - 15:14 and there was a spike of errors on the REST and GraphQL APIs. Customers would have seen some errors in the Buildkite UI during this period as well. We have been monitoring the situation since then and things have returned to baseline.
Get alerted when Buildkite goes down
Alert24 monitors Buildkite and 3,700+ other cloud and SaaS providers. When an outage is detected, it updates your status page automatically and pages your on-call team. No manual updates at 2 AM.


