Commerce Layer logo

Commerce Layer Status Page

E-commerce & Marketplaces · monitored by Alert24

All Systems Operational

Current Status

All Systems Operational

View Commerce Layer status page ↗

Components

Commerce API
Operational
Checkout
Operational
Dashboard
Operational
Documentation
Operational
Metrics API
Operational
Stream Hub
Operational

Recent Incidents

Dashboard suffered from third party CDN issue

minor

Nov 18, 2025 · resolved Nov 19

The issue has been resolved.

Incident Report

major

May 23, 2025 · resolved May 23

# **Summary** On May 23rd, 2025, from \`5:50 AM UTC\` till \`7:03 AM UTC\`, after a scheduled deployment,  some customers experienced mismatched data responses on their interactions with our services. Realizing that it was a major issue, we initiated the roll-back procedure, reverting our services to the previous code version. # **Leadup** At  \`5:50 AM UTC\` we concluded a deployment that included an upgrade of a third party provider library. The deployment introduced errors on several sessions. # **Fault** Some sessions with tokens requested during the affected timeframe might have experienced issues with add to cart functionality and checkout. In particular, adding items to cart and placing orders was not always feasible. In addition to this, some placed orders were missing approved\_at and reference data. # **Detection** A few minutes after the release, one of our operators detected an anomaly on the internal metrics dashboard that was confirmed by our clients requests to the support team. # **Root Cause** The upgrade of a third party provider library that manages low level web request wrappers introduced problems in our APIs that have not been detected by our tests in preprod and qa environments. The upgrade was recommended to keep our services up to speed with the latest security patches and compatibility with other components. # **Mitigation and resolution** A few minutes after the release, once we realized the severity of the issue, we decided to roll-back to the previous version of the code in order to immediately restore normal operations and allow a deeper understanding of the root cause offline. In addition to the roll-back, a partial CDN cache purge was performed to reduce the TTL of the wrong data. # **Corrective and Preventative Measures** We are introducing additional automatic tests related to the involved component and other similar patterns.

Fix has been applied - monitoring

none

Oct 1, 2024 · resolved Oct 1

The Dashboard is now completely operational, the test/live toggle that disappeared, is now visible and clickable.

API/Dashboard/Applications back to normal

critical

Apr 9, 2024 · resolved Apr 9

# **Summary** On April 9th, from 6:45 AM to 8:55 AM UTC, during the scheduled maintenance window, we encountered a database issue. The database version upgrade process failed and the automatic recovery service provided by our cloud infrastructure provider did not activate. We immediately contacted the infrastructure provider’s support team. They addressed this problem and there was no resulting data loss. # **Leadup** On April 9th, 2024, we began the DB upgrade maintenance at 6:39 AM UTC. At approximately 6:45 AM UTC, the background task provided by the DB cloud service failed, leaving the entire upgrade procedure stuck in an intermediate state. # **Fault** During the incident, all Core API calls resulted in errors. These errors were spread across all endpoints and customers, depending on their traffic distribution and type. The read requests cached on our CDN continued to work without errors. # **Detection** A few minutes after beginning maintenance, our engineers realized that the automatic upgrade process was not responding. The cloud service provides safeguards during upgrades that automatically restores normal operations should a timeout exceed 5 minutes. This expected action did not occur.     # **Root Cause** The service provider's support team confirmed that the upgrade process, once initiated, encountered an unexpected issue within their internal procedure, causing the upgrade to stall. The automatic recovery safeguard began as expected five minutes after the incident. However, it couldn't complete because one of the read replicas was in an inconsistent state. The vendor is still investigating the cause of this issue. Since the database service is fully managed by the provider, our engineers had to wait for the provider's support team to restore the normal level of availability. # **Mitigation and resolution** A few minutes before 7:00 AM UTC, our engineers tried to manually interrupt the process, but the option to cancel it was disabled by the vendor. Our team immediately engaged the service provider’s support team, requesting to interrupt the process and restore the availability of the DB, with the highest urgency. The provider’s support team started investigating through their own tools and after a few minutes suggested a couple of workarounds that didn’t succeed. In parallel, we began our own restoration process from a backup.   At about 8:50 AM UTC, the service provider’s support team was able to restore the DB availability and our services gradually recovered to normal operation levels. At approximately 8:55 AM UTC, the entire resolution was completed. # **Corrective and Preventative Measures** Given the nature of the cloud infrastructure on which our services are built upon, not all operational steps are in our full control. However, we identified improvements that we intend to implement on processes and procedures, together with our service provider:  * We will directly involve the service provider’s representatives in future maintenance operations, from planning through implementation. * We will update our status page as soon as the issue occurs. We will extend this by sending a proactive alert to organization owners.   * We will schedule the maintenance window during a low traffic period on our platform.

Partial Outage

minor

Mar 26, 2024 · resolved Mar 26

# **Summary** On March 26th, from 5:45 AM to 7:00 AM UTC, we noted several errors in response to API and Dashboard calls due to a database issue. Our monitoring tool identified the problem and immediately alerted our internal communication system. Upon investigating, we applied the needed fixes and restarted all backend application services, which restored their operations. Later, between 9:10 AM and 9:40 AM UTC, we experienced degraded database performance as a residual effect of the main issue. This had a lighter impact on services than the initial incident. # **Leadup** On March 26th, 2024, at approximately 5:45 PM UTC, an unusual database condition occurred, despite no changes being applied to our infrastructure in the hours preceding the incident. # **Fault** During the incident, approximately 30% of all API and Dashboard calls resulted in errors. These errors were spread across all endpoints and customers, depending on their traffic distribution and type. # **Detection** Our real-time monitoring system first detected an error around 5:45 AM UTC and quickly sent out alerts via our internal communication channels. System Administrators started investigating the issue shortly thereafter. By 6:30 AM UTC, the Incident Response Team was actively addressing the problem. # **Root Cause** We have finished investigating our database infrastructure and codebase. We identified specific timeout parameters that, under certain concurrent conditions, could lead to the issue we experienced. When this issue arises, application transactions might fail to establish a database connection from the connection pool, resulting in an error. At the same time, the current auto-scaling parameters were not able to detect this specific temporary glitch, causing other transactions to receive a timeout before the database could complete the requested operation. # **Mitigation and resolution** From 6:00 AM to 6:45 AM UTC, the Team examined the logs and monitoring systems to identify specific errors and possible immediate emergency actions. Recognizing the issue's impact, they opted for an aggressive approach, which led to a restart of the runtime environment. By around 7:00 AM UTC, all API and Dashboard calls were error-free. The Team continued to work on identifying the root cause to resolve the issue definitively, while keeping the situation monitored, aware that side effects could still potentially occur. At 9:10 AM, they indeed observed a degradation in DB performance. In this case they applied a few mitigation measures, such as horizontal scaling and code hotfix to deploy a specific optimization. Finally, a more suitable set of timeout parameters has been applied to the database infrastructure, allowing the team to consider the issue resolved. # **Corrective and Preventative Measures** To significantly lessen, or even eliminate the likelihood of this issue recurring, we've already implemented extra controls in our codebase and set up alerts for the pro-active detection and automated resolution of the specific condition. We are also developing additional auto-scaling criteria to add capacity should similar conditions occur in the future that are triggered for different reasons.

Get alerted when Commerce Layer goes down

Alert24 monitors Commerce Layer and 3,700+ other cloud and SaaS providers. When an outage is detected, it updates your status page automatically and pages your on-call team. No manual updates at 2 AM.

Start free — no credit card

More E-commerce & Marketplaces status pages