Sardine AI logo

Sardine AI Status Page

AI & Machine Learning · monitored by Alert24

All Systems Operational

Current Status

All Systems Operational

View Sardine AI status page ↗

Components

Device APIs
Operational
Issuing API
Operational
Customer APIs
Operational
Dashboard
Operational
Crypto APIs
Operational
External Provider
Operational
Crypto Web
Operational

Recent Incidents

Degraded API Latency

none

May 22, 2026 · resolved May 22

#### Summary Our cache service experienced intermittent latency issues on May 22, 2026. Service has been fully restored. #### What Happened Our cache infrastructure experienced three distinct periods of elevated latency: * 22:23-22:31 UTC * 22:46-22:52 UTC * 23:00-23:05 UTC The system partially recovered between incidents but experienced cascading failures before full restoration at 23:05 UTC #### Why It Happened The incident began with an unusual spike in cache sets and cache deletes that stressed the caching infrastructure. #### What We're Doing About It * Implementing infrastructure improvements to prevent similar incidents resulting from noisy neighbor pattern * Enhancing cache monitoring, alerting and run books * Working with SME’s from our cloud provider to identify and address any contributing factors We apologize for the disruption and appreciate your patience.

Documentation is temporarily not available

minor

May 20, 2026 · resolved May 20

This has been resolved, documentation is available.

Dashboard instability while loading certain entities might ocurr

minor

May 12, 2026 · resolved May 13

**Impact:** During incident window * **Customer Intelligence Search** latency was degraded for queries spanning **>30 days** of data. * **Session Details** and **Customer Details** pages load were slow * **Connections Graph** and **Timeline** features were also impacted ## Executive Summary As part of infrastructure optimization, our development team performed multiple operations to our search databases to optimize index structure and data storage. This resulted in inefficient provision of our warm data cluster, and resulted in degraded performance. The team ultimately resolved the incident by updating data cluster configuration. Due to the volume of data, simple rollback was not possible, resulting in the long incident. ## Incident Details ### What Happened Our development team performed multiple operations to our search databases to optimize index structure and data storage. Due to bug in migration script, we migrated more data than initially anticipated. The destination cluster didn’t have sufficient storage and computing resources assigned. Latency started rising slowly as more data was migrated. This was initially dismissed as expected as we’re moving older data to separate clusters that are indeed slower but should remain within acceptable bounds. Two days later, on May 12, as the warm indices filled up as the migration completed, users began reporting that dashboard search was very slow. We then attempted upsizing the cluster but it was not able to upsize due to high traffic and large amount of data. Incident was resolved by our team manually reverted some of the operation. ## Timeline | Time \(PT, May 12\) | Event | | --- | --- | | **May 10, 23:38** | Automated operation around data migration was initiated, team was monitoring and didn’t report any issue | | **May 11, 00:00** | Latency starts climbing. Alerts were triggered but assumed as expected. | | **May 12, 6:02 AM** | Support reports dashboard slowness; on-call begins investigation | | **9:04 AM** | Incident formally created | | **10:56 AM** | First code fix deployed for customer details \+ session details | | **11:18 AM** | Deploy complete, pages still slow | | **12:25 PM** | Removed search dependency on Customer Profile \+ Session Details. Page Loads improved, Network Graph \+ Customer search still slow. | | **1:09 PM** | Root cause identified: indices incorrectly in warm tier; direct hot-tier migration initiated \(~10h estimated\) | | **3:05 PM** | Warm tier upsized aggressively migration still not converging | | **7:00–7:08 PM** | search cluster repeatedly auto-cancels in-flight shard recovery; direct migration abandoned | | **7:19 PM** | Switched to another approach of spinnig up new cluster | | **7:41 PM** | April indicies restored from snapshot; last-30d queries drop to ~15ms | | **9:03 PM** | February \+ March indicies restores complete | | **10:14 PM** | Replicas added to hot copies; search queue drops to 0. Incident resolved. | ## Action Items Immediate: * Manually rollback problematic resource allocation * Ensure all node pools have enough resources Medium Term Process Improvements: * Runbook and Migration process for search database upgrade operation * Better review process for Infra changes * Runbook for monitoring upgrade and immediate rollback * Observability in order to know if latency is expected

Dashboard latency when accessing certain items

minor

May 12, 2026 · resolved May 12

This incident has been resolved.

Increased latency in PROD US for /v1/customers and /v1/issuing/risks endpoints

major

May 6, 2026 · resolved May 6

**Summary** On May 6, 2026 from approximately 16:48 to 17:57 UTC, customers using Sardine's `/v1/customers` and `/v1/issuing/risks` APIs experienced elevated latency and degraded responses. We sincerely apologize for the disruption this caused. This document summarizes what happened, why it happened, and the steps we are taking to prevent recurrence. **What Happened** During this window, requests to the affected endpoints experienced one of two behaviors: * Elevated latency with a `SITO` reason code, indicating that Sardine was unable to compute certain risk signals within the expected timeframe. Customers still received rule evaluation results, but with limited signals. * For approximately 14% of `/v1/customers` traffic, requests returned HTTP 500 errors. The incident was resolved at approximately 17:57 UTC after our team rerouted database traffic to a healthy replicas. **Why It Happened** The root cause was an infrastructure failure in our cloud provider's \(Google Cloud\) database service in the US-central1 region. Internal resource shortage for certain instance types caused a routine automatic update operation on our primary read replica to fail. While Google Cloud UI and CLI reported instance to be healthy, database instances were not properly handling incoming queries. Our team performed a manual failover to redirect traffic to a healthy replicas, which restored service. **What We're Doing About It** We are taking the following actions to reduce the likelihood and impact of similar incidents: * Improved monitoring: We are adding alerts for failed database update operations and for "zombie" database states where an instance appears healthy but is not accepting queries. * Failover runbook: We are formalizing a documented procedure for read replica failover, including how to identify a failover target, resize a replica, update service configuration, and restart affected services. * Graceful degradation: We are investigating how to maintain partial service when a read replica is unavailable, rather than surfacing timeouts to customers. * Application timeout enforcement: We are reviewing and correcting how our services enforce database query timeouts to ensure failures surface quickly rather than hanging. * Database architecture review: We have scheduled a review with our cloud provider to evaluate high-availability configuration improvements and reduce our exposure to single-replica failure modes. ~~We are also requiring a full root cause analysis from Google Cloud within 3 business days.~~ \[EDIT: We have received RCA from Google. Here is expert from Google’s RCA with slight edit - On May 6, 2026, an database instance in the us-central1 region experienced total read unavailability following a series of scale-out operations. The incident was driven by a combination of regional resource exhaustion \(stockout\) and a logic error in the Managed Instance Group \(MIG\) downsizing algorithm. The MIG incorrectly prioritized the removal of healthy, running virtual machines \(VMs\) over non-functional "phantom" instances during automated reconciliation. ## Resolution and mitigation ### Immediate actions * **Downtime mitigation:** The affected instance was moved to a different machine family \(`N2`\) with sufficient regional capacity to restore service immediately. * **Reservations:** Additional capacity reservations were placed for the customer to ensure that stockouts don’t impact existing nodes as they update their instances as part of their N2 to C4A migration and production scale changes. ### Permanent fix **Algorithm update:** A fix for the MIG’s downsizing algorithm has been developed, verified and is currently rolling out. This update ensures that non-running instances are always prioritized for deletion over healthy ones when removing nodes. The global rollout of this fix is scheduled for completion by the end of May 2026, following standard safety and validation procedures. EDIT END\] ‌ We take the reliability of our platform seriously and apologize again for the impact this incident had on your operations. Please reach out to your account team or [[email protected]](mailto:[email protected]) if you have questions.

Get alerted when Sardine AI goes down

Alert24 monitors Sardine AI and 3,700+ other cloud and SaaS providers. When an outage is detected, it updates your status page automatically and pages your on-call team. No manual updates at 2 AM.

Start free — no credit card

More AI & Machine Learning status pages