Current Status
All Systems Operational
Components
Recent Incidents
`from_url` uploads degradation
minorDec 19, 2025 · resolved Dec 19
Although Cloudflare's issues persist and the files they distribute take longer to download via `from_url` than others, our systems remain stable. We are closing the incident but will continue to monitor the situation.
Upload API Degradation.
majorDec 15, 2025 · resolved Dec 16
# **Analysis of the December 15, 2025, Upload API Disruption** ## 1. Summary On December 15, 2025, the Uploadcare platform experienced a significant service disruption affecting the Upload API and REST API subsystems. The incident, spanning approximately ten hours, resulted in elevated error rates for file uploads, increased latency across API endpoints, and interruptions of Dashboard operations. ## 2. Root cause The root cause was a resource contention event on Amazon Elastic File System \(EFS\), aggravated by an observability side-effect. A monitoring agent, executing a metadata-heavy recursive file scan, monopolized the I/O capacity of the shared NFS mount points. This I/O saturation created a cascading failure due to a legacy architectural pattern where storage operations occur within active database transactions. Background Processing Workers, taking long time to read or write to the stalled filesystem, held open database transactions for a long time. This saturated the global connection pool \(PgBouncer\), causing a denial of service for Upload API HTTP Workers and other services attempting to run database queries, despite the underlying PostgreSQL engine remaining healthy. ## 3. Architectural context and system design The Uploadcare platform is designed for scale, but this incident exposed specific coupling risks between our storage, database, and monitoring layers. ### 3.1 Ingestion pipeline & worker roles The core of the affected system is the Upload API. It utilizes two distinct worker types: 1. Upload API HTTP workers. These synchronous workers handle incoming HTTP requests \(Direct uploads, Multipart, and `from_url` requests\). They ingest data and schedule tasks. 2. Background processing workers. These asynchronous workers pick up tasks from a queue to perform complex operations \(virus scanning, image validation, property extraction, etc\). Both worker types utilize a shared temporary staging area backed by Amazon EFS to maintain a POSIX-compliant shared state across the distributed fleet. ### 3.2 Observability stack To monitor the staging area's health, a Telegraf sidecar agent was configured to track file backlog depth. This was implemented via a standard Linux `find` command sequence. On a network filesystem \(NFS\), a recursive `find` forces a traversal of the directory tree over the network. As file counts grow, the cost of this operation increases linearly, generating a storm of `READDIRPLUS` and `GETATTR` RPC calls. ## 4. Timeline _All times are listed in Coordinated Universal Time \(UTC\)._ * **17:17:** System begins receiving a massive surge of `from_url` upload requests. This workload is write-heavy. * **17:17 – 17:23:** Upload API HTTP workers successfully ingest files and write them to EFS. However, Background processing workers \(responsible for processing and deleting files\) begin to slow down due to the increasing I/O load. The rate of file creation exceeds the rate of processing/deletion. The file count on the volume jumps from ~40 \(healthy\) to ~7,000. * **17:25:** The "Noisy Neighbor" effect begins. The Telegraf `find` command, attempting to scan 7,000\+ files, fails to complete within its 30-second timeout. We lose visibility of metrics related to the file staging area. * **17:30:** Multiple instances of the `find` command stack up. The resulting metadata storm heavily affects the NFS client's ability to process requests. Background processing workers become very ineffective. * **19:06:** PostgreSQL begins terminating client connections due to `idle-in-transaction` timeouts. This confirms workers are stuck holding connections. * **20:28 – 20:51:** Engineering attempts to horizontally scale the PgBouncers to absorb the held connections. This inadvertently pushes the upper layer PgBouncers toward their OS file descriptor limits. * **21:07:** Upper layer PgBouncer instances hit the Linux file descriptor limit \(`ulimit`\). New connections to this layer begin failing. * **21:28:** File descriptor limits are increased. Background processing workers begin to recover slowly as the pool stabilizes. * **23:30:** Background processing workers successfully process the backlog. The background task queue becomes empty. * **23:41:** Despite the background queue being empty, Upload API HTTP workers continue to exhibit "overload" behavior. * **01:35 \(Dec 16\):** Engineering reconfigures Upload API HTTP workers to connect directly to the upper PgBouncer layer. * **02:48:** EFS metrics reappear. * **02:58:** Engineers manually removed approximately 4,000 "orphaned" files that remained on the disk \(files where the background job failed during the crash and could not self-heal\). File manipulations on EFS become normal. * **02:59:** Database connection pressure vanished. Latency returned to nominal levels. Incident resolved. ## 5. Response Evaluation ### What went well * Proactive alerting. Our monitoring systems successfully triggered alerts for elevated wait times before customer reports of degradation began to surface, giving the team an early head start on triage. * Core database resilience. Despite the severe saturation of the connection poolers \(PgBouncer\), the underlying PostgreSQL database engine demonstrated remarkable stability. Resource utilization \(CPU/Memory\) on the database nodes remained nominal, preventing a total database meltdown. * URL API isolation. The URL API \(file processing and delivery\) remained fully operational throughout the incident. Its architectural isolation from the ingestion pipeline ensured that end-user access to previously uploaded files was never affected. ### What went wrong * Observability became the bottleneck. The monitoring agent \(`telegraf`\), designed to provide visibility, became the primary cause of the outage. The use of an active, heavy command like `find` on a network filesystem was a design flaw that turned a monitoring check into a Denial of Service attack on the NFS client. * Misleading initial signals. The initial flood of `idle-in-transaction` connections led the response team to focus on the database layer. It delayed the investigation into the storage layer. ## 5. Action Items * Eliminate "Noisy Neighbor" observability. Replace active, resource-intensive monitoring checks \(specifically the recursive `find` command\) with passive metrics. Transition to using AWS CloudWatch metrics or other side-channel indicators that do not consume client-side I/O credits or burden the EFS mount. * Enforce connection pool isolation. Reconfigure PgBouncer to implement strict resource bulkheading. Dedicate separate connection pools for the Upload API HTTP workers, REST API HTTP workers, and background workers. This ensures that a resource stall in the background worker fleet cannot starve the customer-facing APIs of database connections. * Decouple storage I/O from database state. Refactor the application logic to execute blocking file storage operations \(reading/writing to EFS\) _outside_ of atomic database transactions. This ensures that storage latency results only in slower worker throughput, rather than holding database connections open \(`idle-in-transaction`\) and exhausting the global pool. * Upgrade incident response tooling. Equip production containers and nodes with essential diagnostic capabilities and establish secure access protocols. This ensures engineers can diagnose kernel-level and network-level bottlenecks during active incidents. ## 6. Conclusion The December 15 disruption was a complex failure mode involving storage physics, monitoring side-effects, and database concurrency. By identifying the contention on the EFS mount \(caused by the monitoring agent\) as the root cause, rather than a hard AWS limit, we have a clear path to resolution. The removal of the invasive monitoring script and the decoupling of transactions will robustly insulate the Uploadcare platform from similar incidents in the future. We apologize for the disruption and appreciate the patience of our customers as we hardened our systems.
URL API Service Disruption
majorOct 20, 2025 · resolved Oct 20
On October 20, 2025, between 07:50 UTC and 09:27 UTC, a portion of our URL API service experienced a major disruption. The service responsible for real-time processing and transformation of uncached files was unavailable, resulting in failed requests for those specific operations. The delivery of already-cached files and all other Uploadcare services, including file uploading and management, were unaffected. The direct root cause of this incident was a major service outage in the Amazon Web Services \(AWS\) `us-east-1` \(N. Virginia\) region, specifically impacting AWS DynamoDB. During the incident, we also identified a critical flaw in our incident communication process: we were unable to access our Atlassian-hosted status page to provide timely updates to our customers. It was affected by AWS services disruption, too. Our key follow-up actions will focus on improving the architectural resilience of our services to reduce single points of failure and establishing a more robust, independent system for incident communications. ## Timeline of events All times are in UTC on October 20, 2025. * **06:49:** Our internal monitoring systems begin to alert on a significant increase in error rates and latency for the URL API service, specifically for requests involving uncached files. Our engineering team begins an immediate investigation. * **07:03**: Our engineering team localizes issue with DynamoDB. * **07:37:** The team attempts to update our public status page \(`status.uploadcare.com`\) to inform customers but is unable to log in to the third-party provider \(Atlassian Statuspage\). * **07:46:** We get to conclusion that the issue is related to DNS of DynamoDB. * **07:51:** AWS posts its first notification acknowledging increased error rates and latencies for multiple services in the `us-east-1` region. Our team correlates our internal alerts with this broader AWS issue. * **08:26:** AWS confirms that the issue is centered on significant error rates for DynamoDB in `us-east-1`, confirming our initial diagnosis of the root cause affecting our service. * **09:01:** AWS reports they have identified a potential root cause related to DNS resolution for the DynamoDB API endpoint and are working on mitigations. * **09:27:** Our internal monitoring shows that error rates for the URL API have returned to normal levels. The service is fully recovered and operational. We declare the incident resolved internally. ## What went well * Our internal monitoring systems detected the service failure immediately, allowing for a rapid start to our investigation. * Our engineering team was able to quickly correlate the internal issue with the external AWS service outage, preventing wasted time on internal diagnostics. ## What went wrong * **Architectural single point of failure:** Our URL API for uncached files had a hard dependency on a single service \(DynamoDB\) within a single AWS region \(`us-east-1`\). The failure of this service led directly to the failure of ours, with no mechanism for graceful degradation or failover. * **Failure of incident communication:** Our designated channel for customer communication, the status page hosted at `status.uploadcare.com`, was inaccessible to our team during the outage. This failure prevented us from providing timely and transparent updates to our users, which is a critical part of our incident response protocol. ## Action items * **Establish a redundant status communication channel:** We will set up a secondary, fully independent channel for incident communication. This will ensure that if our primary status page provider is ever unavailable, we can still communicate effectively with our customers. * **Architect for resilience and decentralization:** Our engineering team will conduct a full architectural review of the URL API service. The primary goal is to re-architect the service to remove DynamoDB in `us-east-1` as a single point of failure. This may involve implementing multi-region failover capabilities or introducing a more resilient data caching layer. * **Audit critical services for single points of failure:** We will expand our review to other critical services across the Uploadcare platform to identify and mitigate other potential single points of failure related to external, regional dependencies. We sincerely apologize for the disruption this incident caused our customers and for our failure to communicate the issue in a timely manner. We are committed to learning from this event and implementing these changes to build a more resilient and reliable platform.
Elevated URL API Errors
majorNov 26, 2024 · resolved Nov 26
## Incident Summary On 26 November 2024, an issue with our URL API was identified, causing a partial outage of the service from 14:31 to 15:00 UTC. ## Timeline * **14:20 UTC** – A CDN configuration change was made to improve the Auto Format feature in the URL API. * **14:31 UTC** – Service degradation began as increased traffic overwhelmed our origin servers due to cache key changes. * **14:37 UTC** – Alert notifications were received by our team. * **14:58 UTC** – The CDN configuration was rolled back. * **15:00 UTC** – The incident was fully resolved, and service was restored. ## Root Cause The issue was caused by a CDN configuration change that altered cache key behavior, significantly increasing traffic to our origin servers. This overwhelmed our autoscaling groups, which hit their scaling limits, resulting in a service outage. ## Impact URL API degradation lasted for approximately 29 minutes. ## Challenges During Resolution This type of issue is difficult to catch in a staging environment because the traffic profile does not reflect production-level demand. Although the alert system worked as intended, delays in responding to the alert contributed to the resolution time. ## Resolution The configuration change was identified as the root cause and rolled back. The service stabilized immediately after the rollback, and full functionality was restored by 15:00 UTC. ## Action Items ### Short-term * Review and improve escalation processes for alert notifications to reduce resolution time. * Adjust staging environment testing to better simulate production traffic patterns for similar changes. ## Long-term * Assess autoscaling limits to ensure sufficient capacity for handling unexpected traffic spikes. We sincerely apologize for the disruption and are committed to learning from this incident to improve the reliability of our services. Thank you for your understanding and continued support.
Unavailability of uploading from 3rd Party Services
majorNov 25, 2024 · resolved Nov 25
## Incident Summary On November 25, 2024, from 16:01 to 19:44 UTC, the “Uploading from 3rd Party Services” feature of our platform was unavailable. This feature allows users to upload files from social networks and storage services like Dropbox, Facebook, Google Drive, and Instagram. ## Timeline * **15:50 UTC** – A server configuration update was initiated to streamline deployments in our Kubernetes environment. * **16:01 UTC** – The outage began as requests to the “Uploading from 3rd Party Services” feature failed to be processed. * **19:00 UTC** – Investigation revealed that the application networking was incorrectly configured. * **19:15 UTC** – A fix was implemented to re-configure the application networking. * **19:44 UTC** – The incident was fully resolved, and service was restored. ## Root Cause The outage was caused by a synchronization issue between repositories during a server configuration update. This misalignment between repositories, compounded by a lack of explicit configuration, led to the service outage. ## Impact For approximately 3 hours and 43 minutes, users were unable to upload files from third-party services. ## Challenges During Resolution This type of issue is difficult to catch in a staging environment because the traffic profile does not reflect production-level demand. Although the alert system worked as intended, delays in responding to the alert contributed to the resolution time. ## Resolution The application configuration was updated and service functionality was fully restored at 19:44 UTC. ## Action Items ### Short-term * Improve feature status monitoring and alerting to detect outages faster. ## Long-term * Improve synchronization processes between repositories to avoid dependency misalignments. We sincerely apologize for the disruption this caused to our users. We take this incident seriously and are committed to implementing the above action items to prevent similar occurrences in the future. Thank you for your patience and continued trust in our platform. If you have any questions or need further details, please don’t hesitate to reach out.
Get alerted when Uploadcare goes down
Alert24 monitors Uploadcare and 3,700+ other cloud and SaaS providers. When an outage is detected, it updates your status page automatically and pages your on-call team. No manual updates at 2 AM.



