Current Status
All Systems Operational
Components
Recent Incidents
WhatsApp messages failing Prods 1 + 2
minorMay 26, 2026 · resolved May 27
This incident has been resolved. Our teams have confirmed recovery and platform stability following mitigation efforts. We will be reaching out independently to affected orgs to provide additional context and follow-up as needed. Thank you for your patience while we worked through this issue.
Chat - Messages Failing & Assistant Issues - Prod-1
minorMay 14, 2026 · resolved May 14
Kustomer has resolved an event affecting Chat on Prod 1 that caused messages to fail and assistants to not follow the configured flow. After careful monitoring, our team has determined that all affected areas are now fully restored. Please reach out to Kustomer support if you have additional questions or concerns.
[Gmail]Email Connectivity & Send/Receive Issues - Prod 1 & Prod 2
minorMay 1, 2026 · resolved May 1
## Summary On May 1, 2026, customers using the Gmail integration experienced a service event that caused Gmail connections to disappear and the email channel to stop functioning. The issue affected customers across different regions, more specifically: * EU-based clients, from 2:40 AM ET. * US-based clients, from 5 AM ET. The issue was fully resolved by 9:20 AM ET. No customer data was lost during this event. Messages that were delayed during the incident were fully recovered once service was restored. Current system health is stable. ## Root cause A configuration issue in the infrastructure used by the Gmail integration prevented replacement service tasks from starting correctly during routine task rotation. As running capacity declined over several hours, the Gmail integration became unavailable. ## Timeline | Time \(ET\) | Event | | --- | --- | | May 1, ~12:30 AM | Service task replacement failures began in production. | | May 1, ~2:40 AM | Engineering was alerted after service capacity dropped to zero in one production environment. | | May 1, 2:40–9:20 AM | Teams investigated the failure, identified the configuration problem, prepared a fix, and deployed it. | | May 1, 9:20 AM | Fix deployed and service restored. | ## Lessons and improvements * We are auditing related infrastructure configurations to identify and correct similar patterns in other services. * We are standardizing how service roles are managed so this class of configuration issue is less likely to recur. * We are improving alerting so teams are notified earlier when running service capacity drops below expected levels, before a full service interruption occurs. * We are adding additional checks to catch configuration drift and invalid service role references earlier in the deployment lifecycle.
[DRAFTS] Internal API Errors (Prod1)
minorApr 24, 2026 · resolved Apr 24
## Summary On April 24, 2026, customers in our prod1 environment experienced a service event that caused elevated errors and latency in messaging-related workflows. The broad cross-customer impact was limited to approximately 41 minutes, from 3:26 PM ET to 4:07 PM ET. During that window, some customers saw failed or delayed messaging operations. The immediate platform impact was resolved the same day, and overall system health returned to normal. We then completed follow-up mitigation to stop the underlying event source and reduce the risk of recurrence. ## Impact * Customers in prod1 experienced elevated API errors and latency in messaging-related workflows. * The broad cross-customer impact lasted about 41 minutes. * A subset of messaging workflows failed or were delayed during that period. ## Timeline * **~3:25 PM ET** — A newly enabled automation began processing a large backlog of historical conversations for one tenant. * **3:26 PM ET** — Elevated errors and latency began affecting shared messaging workflows in prod1. * **4:01 PM ET** — We published a status update for the production issue. * **4:07 PM ET** — Broad cross-customer impact ended as the affected services stabilized. * **~5:17 PM ET** — We disabled the triggering automation configuration for the affected tenant. * **~5:23 PM ET** — The remaining retry activity stopped. ## Root cause The event was triggered when a newly enabled automation for one tenant processed a much larger set of eligible conversations than intended. That sudden volume overloaded a shared downstream service and caused elevated errors and timeouts in dependent workflows. The incident was amplified by missing safeguards in how this automation handled backlog volume and retries. In particular, the system did not sufficiently limit the number of conversations processed at once or prevent the same failed work from being retried too aggressively. ## Resolution We restored platform stability during the incident by allowing the affected services to recover under increased capacity, then disabled the triggering automation configuration and cleared the remaining retry backlog. System health is currently normal. ## Preventative actions We are treating the following preventative actions as a priority bug effort. These actions are expected to be resolved by the end of May in accordance with our SLOs: * Prevent newly enabled automation settings from processing large historical backlogs unintentionally. * Add stronger batch limits and tenant-level throttling for this workflow. * Reduce retry amplification by improving how failed work is tracked and re-queued. * Improve error handling so rate-limit conditions are classified correctly and handled with the right retry behavior.
[Knowledge Base] [Custom Domains returning 500 error] [POD(s) AFFECTED]
minorApr 14, 2026 · resolved Apr 14
## **Summary** On April 14, 2026, customers experienced errors accessing the Kustomer Knowledge Base, with all KB pages returning 500 errors. The issue was caused by a dependency upgrade in a recent deployment that introduced a TLS certificate mismatch in internal service communication. Customer impact began at 9:51 AM ET when the deployment reached the first production environment, and was fully resolved by 10:38 AM ET — a ~47 minute impact window. Engineers identified the root cause and completed a full rollback across all production environments within 12 minutes of the initial alert. ## **Root Cause** A recent deployment to the KB service included an upgrade to an internal library that contained a known issue with TLS hostname verification. This caused internal service-to-service requests to fail, resulting in 500 errors for all KB requests. The affected library version had previously been identified as problematic in a staging environment, but the fix had not been fully applied across all services before this deployment reached production. ## **Timeline** **Apr 14, 2026** 9:51 AM ET — Deployment reached the first production environment; customers began experiencing 500 errors when accessing the Knowledge Base 10:26 AM ET — Automated alerting fired; incident response began 10:31 AM ET — Engineers identified a recent deployment as the likely cause and began investigating rollback options 10:35 AM ET — Root cause confirmed as a problematic internal library version; rollback initiated across all production environments 10:37 AM ET — Rollback completed on prod1; KB restored for affected customers 10:38–10:39 AM ET — Rollback completed across remaining production environments 10:45 AM ET — Full KB functionality confirmed restored for all customers 12:09 PM ET — Corrected fix deployed to the Knowledge Base service; remediation completed across all other affected services ## **Lessons/Improvements** * Implementing stricter controls to prevent pre-release or beta library versions from being deployed to production * Improving the process for tracking and completing cross-service remediation work when an issue is identified in one service, to ensure all affected services are addressed * Enhancing our pre-production validation process to improve detection of this class of issue before it reaches production environments
Get alerted when Kustomer goes down
Alert24 monitors Kustomer and 3,700+ other cloud and SaaS providers. When an outage is detected, it updates your status page automatically and pages your on-call team. No manual updates at 2 AM.



