Websolr logo

Websolr Status Page

Developer Platforms & Tools · monitored by Alert24

All Systems Operational

Current Status

All Systems Operational

View Websolr status page ↗

Components

Region Health: AWS US East (Virginia)
Operational
Region Health: AWS EU West (Ireland)
Operational
Region Health: AWS US West (California)
Operational
Websolr Index Provisioning & Management
Operational
Websolr Dashboard
Operational

Recent Incidents

Websolr Unaffected By XZ Compromise

none

Apr 1, 2024 · resolved Apr 1

Researchers recently discovered a sophisticated attempt to compromise XZ, a compression library that is widely used in Linux-based services across the world. It is suspected that if the compromise had been successful, state actors or other attackers would be able to remotely access many Linux-based machines on the Internet. Fortunately, the issue was discovered before the compromised version of the library made its way into mainline channels, so the impact is limited to versions 5.6.0 and 5.6.1. [CISA recommends](https://www.cisa.gov/news-events/alerts/2024/03/29/reported-supply-chain-compromise-affecting-xz-utils-data-compression-library-cve-2024-3094) users downgrade to XZ Utils 5.4.6 or earlier. Websolr's system maintenance policy is to use up to date stable and LTS versions of software, and this policy means that none of our systems are impacted by the compromise. Out of an abundance of caution, we audited every online server in our fleet and verified that none of them is running the compromised versions of XZ Utils. Websolr remains committed to the security and integrity of our systems and customers' data. Please direct any additional questions or concerns to [email protected].

Heroku Websolr add-on users experiencing provisioning failures

none

Feb 24, 2023 · resolved Feb 24

This incident has been resolved. Heroku users will need to resend their provisioning requests. Heroku will also remove failed provisioning requests after 24 hours, so users will have to resend their provisioning requests.

Websolr.com is unresponsive

minor

Feb 24, 2022 · resolved Feb 24

This incident has been resolved.

Elevated HTTP 502 Errors in the EU-West Region

major

May 12, 2021 · resolved May 12

At approximately 06:00 UTC May 12th, users in the EU-West region began to receive HTTP 502 responses to all requests. This lasted until roughly 14:20 UTC.  In preparing for this postmortem, we identified two distinct problems: the disruption itself and the duration of time it took for a resolution. This postmortem will consider both. ‌ **Background** Websolr maintains a global fleet of servers, each of which is monitored by several different systems. These systems can identify problems and alert our operations team in the event of a problem. Part of the fleet is dedicated solely to acting as a proxy layer, capable of routing requests to specific Solr cores, enforce throttling, manage authentication, and more. The proxy layer is a distributed service, partitioned by geographic region. The decentralized nature of the proxy layer makes it resistant to node loss, as state is shared among all nodes in the layer. Thus, if a node fails for some reason, the load balancers will simply route traffic to the healthy nodes until the unhealthy node is fixed or replaced. This happens seamlessly so that users never even notice. ‌ **Proximate Cause of the Disruption** Nodes are periodically replaced automatically for a variety of reasons. Last week, one of the proxy layer’s nodes in EU-West-1 was replaced. The replacement came online and began to perform a bootstrapping process to install and configure all of the software needed to run an instance of the proxy service. However, the process failed due to a recent change in pip. This failure prevented the bootstrap process from installing _anything_, including the monitoring services. As a result, no alert was sent to the team. And it went unnoticed by users, as there was still a healthy node online serving requests. At 06:00 UTC May 12th, the healthy node was automatically replaced and failed to bootstrap as well. This meant that not only did the EU-West-1 region no longer have a functional routing layer, but that the monitoring services that otherwise would have indicated a problem were not present. ‌ **Proximate Cause of Delay in Resolution** Lacking alerts about the problem meant that it issue was not detected until our support team came online several hours later and noticed a massive number of support tickets. At that time, alarms were sounded and the problem was resolved within about 30 minutes. Up to that point there were a number of signals indicating that something was very wrong and should have warranted review. “Signals” in this context refer to such large departures from the norm that a reasonable person would know immediately that there is a problem. The most obvious signal was we began to receive an influx of support tickets. We received more tickets in the first few hours of the incident than we normally receive in 2 weeks, across all our products. We also received messages via social media regarding the disruption. Had anyone been monitoring these channels, it would have been obvious that there was a problem warranting investigation. Since our support team is US-based, the messages were arriving around 1:00AM our time, when everyone was asleep. Another signal was that our system for monitoring request metrics suddenly stopped receiving anything from the EU-West region. A sudden loss of traffic across an entire region is a clear signal that something is amiss. A third signal was that the load balancers experienced a massive increase in HTTP 5XX errors and almost 100% of requests were not successful. That itself should have raised alarms. ‌ **Root Cause of the Disruption & Delay** The aforementioned series of events are not the root cause of this particular incident. The root cause was the lack of sufficient alerting \(both direct and indirect\). Not only was a vital piece of infrastructure able to fail bootstrapping in a way that raised no alarm, but it remained undetected until the start of business hours because we lacked sufficient indirect monitoring. If the monitoring service wasn’t dependent on bootstrapping, or the bootstrapping process could page our team in the event of a problem, this incident would not have occurred. We would have been made aware of the problem last week and would have been able to fix it then, preventing it from recurring on May 12th. And if we’d had sufficient indirect monitoring \(identifying signals and thresholds that _imply_ a problem\), then the failed bootstrapping process wouldn’t have mattered as much, because we still would have been paged to examine the region anyway. ‌ **Resolution** When the technical problem was identified, we were able to fix the bootstrapping problem with a single line of code. The nodes were re-bootstrapped and the routing layer was repopulated with data. This fix was deployed in several other regions, with US-East-1 being the final region on \(scheduled for May 18th\). We have also upgraded the proxy service to be managed with our latest deploy tools, which are in heavy use with our hosted Elasticsearch product, Bonsai. Finally, we’re working on implementing some signal detection in our existing systems that would alert us about the high probability of a problem somewhere in our fleet. If you have any questions or concerns, please let us know at [[email protected]](mailto:[email protected]).

Cluster Metrics Unavailable

none

May 26, 2020 · resolved May 26

This incident has been resolved.

Get alerted when Websolr goes down

Alert24 monitors Websolr and 3,700+ other cloud and SaaS providers. When an outage is detected, it updates your status page automatically and pages your on-call team. No manual updates at 2 AM.

Start free — no credit card

More Developer Platforms & Tools status pages