Hosted Mender logo

Hosted Mender Status Page

IoT & Hardware · monitored by Alert24

All Systems Operational

Current Status

All Systems Operational

View Hosted Mender status page ↗

Components

Hosted Mender US
Operational
Hosted Mender EU
Operational
registry.mender.io
Operational
docs.mender.io
Operational

Recent Incidents

Issues with the Mender Server UI

critical

Apr 8, 2026 · resolved Apr 8

Mender UI was unavailable between 2026-04-08T13:16:15Z and 2026-04-08T13:42:09Z (26m) due to a breaking change to Google Analytics dependency leading to misconfiguration. The issue has been mitigated by temporarily disabling Google Analytics until a fix is deployed.

Degraded performance on hosted Mender

major

Feb 20, 2026 · resolved Feb 20

# **Database Overload from Device Limit Migration Bug** **Date:** 2026-02-20 **Duration:** ~3 hours 25 minutes \(08:15 - 11:40 UTC\) **Severity:** High ## **Executive Summary** On February 20, 2026, the Hosted Mender platform experienced a critical service outage affecting device authentication and inventory operations. A change deployed as part of Mender v4.2.0-saas.2 failed to uniformly handle data inconsistencies in older tenant configurations. A specific call order of two independent backend endpoints in combination with scheduled cache invalidation uncovered a bug which caused a heavy increase in load on the database. The unexpected increase in load was beyond what the system is designed to handle which resulted in cascading errors and platform-wide degradation. ## **Impact** * **Duration:** Approximately 3 hours 25 minutes * **Scope:** Multi-tenant platform-wide degradation * **Affected Services:** Device authentication, device inventory * **User Experience:** Unable to accept new devices * **Business Impact:** Complete halt to device provisioning across the platform \(hosted Mender US only\) during incident window ## **Root Cause** In Mender v4.2.0-saas.2 we changed the definition of an “unlimited” device limit from 0 to -1 so the system would be able to represent limits that allow zero devices. This was done by introducing a database migration that migrated existing limits with the value 0 to have the value -1 instead and cleared the limits cache to ensure data would be collected fresh from the database post migration. Lastly, we were aware of a known edge case where certain tenants would not have a limit defined in the database and took steps to ensure consistent handling of this scenario post migration. This new version of Mender also included an internal endpoint that incorrectly set the cached device limit of a tenant to 0 in the case where a\) there was no limit in the cache from before and b\) there also was no limit in the database. This endpoint was overlooked in the steps mentioned above. When the internal endpoint was called after the cache was invalidated, but before any external endpoints that used device limits, the limit 0 was incorrectly cached for some tenants with a large number of devices and no device limit in the database. When the device authorization reprocessing logic was executed for these devices, the incorrectly cached limit caused a large amount of database queries to be executed in order to check if the limit had been exceeded \(something which is not necessary to check if the device limit is “unlimited”\). No matter the result of the check, a limit of 0 will always result in the device not being allowed to authorize with the system and devices will continuously retry in such a case, amplifying the issue manyfold until eventual and complete MongoDB resources exhaustion. ## **Timeline \(All times UTC\)** **2026-02-19** * **13:16** - Deployed Mender v4.2.0-saas.2 - _Root cause introduced_ **2026-02-20** * **~08:00** - Devices of the affected tenants started the authorization reprocessing process * **08:15** - A synthetic test failure alerted the On-call team * **08:20** - On-call investigated tenant configuration; Admin Panel queries failing with 499/504 due to DB exhaustion * **08:20** - Identified ongoing device authorization reprocessing consuming all database resources * **09:25** - Attempted to stop problematic queries * **10:30** - Discovered blocked queries still holding locks; initiated emergency database scaling * **10:40** - Database scaled; locks cleared; device acceptance partially restored * **11:55** - Cache for device-auth disabled  * **11:00** - Added missing limits with value -1 \(unlimited\) in the database affected tenants * **11:40** - Service fully restored ## **What went wrong** 1. **Inadequate test coverage**The test coverage of the internal endpoint was inadequate as it didn’t verify that the correct value was used and cached in this scenario. 2. **Inadequate manual testing**Manual testing was performed, but not with a cache that was explicitly invalidated for this purpose. 3. **Uncontrolled Cascade**The device authorization reprocessing logic had a snowball effect on the platform. ## **Action Items** * Resolve the issue where limits who are intended to be “unlimited” can be incorrectly cached as 0 by this internal endpoint. * Update the device authorization reprocessing logic to not execute unnecessary database queries if the limit is 0. * Review and improve test coverage of the affected endpoints. ## **Conclusions** We want to sincerely apologize for the service disruption you experienced on February 20, 2026. For over three hours, our platform was unable to process device authentication and inventory operations, preventing you from onboarding new devices and managing your fleet. We are committed to prevent this kind of disruption in the future.

Rate limits issue for some customers

major

Nov 19, 2025 · resolved Nov 19

This incident has been resolved. However, a rate limit hot fix has been implemented, so we will schedule a new maintenance window soon, to apply the definitive fix.

Issues with webhooks and AWS IoT Integration

minor

Oct 31, 2025 · resolved Nov 3

**Abstract** On Monday, 27th of October, we released the Mender Server v4.1.0-saas.16 to hosted Mender US and EU. Among the many changes, there was also a change indirectly changing the cipher method for client side encryption of secrets in the IoT Manager database. The change replaced the deprecated Cipher Feedback \(CFB\) cipher mode with Counter \(CTR\) mode as suggested by the [Golang documentation](https://pkg.go.dev/crypto/[email protected]#NewCFBDecrypter). On October 31th, at about 8PM, we were alerted by multiple tickets opened by customers regarding webhooks not working for AWS IoT integration. The on-call team then opened an [incident](https://mender.statuspage.io/incidents/g019zy922897). The engineering team then on Monday, 3rd of November, soon acknowledged the issue and found out the root cause. We briefly discussed how to solve the issue and we decided to rollback the IoT Manager service and re-encrypt the secret with the old algorithm for the affected customers. Two customers, however, already updated their config, because they recreated the webhook configuration, after being suggested by the Northern Tech team as a valid workaround, so the rollback affected their operation for a second time. We are really sorry for the inconvenience, and we are working to fix this process around the IoT Manager integration. **Incident Timeline \(UTC\)** * 2025-10-27 12AM - Mender Server v4.1.0-saas.16 released on hosted Mender EU and US * 2025-10-31 8PM - The Customer Engineer team alerted because of multiple ticket regarding failing IoT Integration * 2025-10-31 8:58PM - This incident has been opened * 2025-11-03 12AM - We reverted the IoT Manager version, decrypted the secret with the new cipher and re-encrypted it again with the old cipher, restoring the operation * 2025-11-03 11AM - Mender server v4.1.0-saas.17 was released to hosted Mender US and EU, including the revert commit for the new cipher, restoring the old one. **What went wrong** Multiple failure at multiple level: * we lack of IoT Manager upgrade tests; for this specific issue, unit and integration tests didn’t catch the issue because they are performing tests on new fresh data, encrypted with the new cipher; * we lack of Synthetic Tests on IoT Manager; * we suggested a workaround for restoring the situation as a first step, but then the rollback to the previous version caused another disruption to some customers. **Actions we decided to take to prevent this issue in the future** * Improve the logging and monitoring around the IoT Manager service * introduce new error log when webhooks are misbehaving and build metrics and alert based on the new log to catch issues faster * Introduce Synthetics tests to periodically assess the IoT Manager webhook functionality * Improve error handling by registering unsuccessful attempts to send webhooks * Register timestamp on secret update and creation, to easily understand the history of a secret * We still need to replace the outdated cipher, we will plan a non disruptive update * Introduce upgrade tests, to check that the IoT Manager service could work with both the old and the new version.

API issues with Mender Server

critical

Oct 22, 2025 · resolved Oct 22

**Date:** October 22, 2025 **Duration:** 78 minutes \(08:10 - 09:28 UTC\) **Severity:** Major service disruption ‌ **Executive Summary** A database migration in release v4.1.0-saas.16 caused a complete failure of the Device Authentication service across US and EU hosted Mender clusters. The migration incorrectly deleted a critical uniqueness constraint during online operations, leading to database corruption that prevented service recovery. We restored service by performing a point-in-time database rollback, resulting in 78 minutes of data loss. **Customer Impact**: Device authentication was unavailable for 78 minutes. New device enrollments were blocked, and existing device operations may have been disrupted during this period. **Root cause** The new version contained a database migration to 2.0.1 for the Device Auth database, which was designed to replace a uniqueness constraint on device authentication records but executed the deletion and recreation as separate operations. During online migration, the window between index deletion and recreation allowed duplicate device entries to be created, corrupting the database state and preventing both forward migration completion and rollback. For this reason, the only viable solution was to rollback both the Mender Server version and the Database. ‌ **Resolution and recovery** With duplicate records preventing normal rollback procedures, we performed a point-in-time database restore to 08:10 UTC, with a safe timestamp before migration execution. This restored database integrity but resulted in permanent loss of all data created between 08:10 and 09:28. ‌ **Incident timeline \(UTC\)** * 08:35 AM - the new v4.1.0-saas.16 version was published and both hosted Mender US and EU started the automated upgrade * 08:40 AM - the upgrade failed and rolled back automatically to v4.1.0-saas.15, because the deviceauth service wasn’t able to complete the migration job. * 08:42 AM - the On-call team acknowledged a possible issue with the upgrade, in the meantime the deviceauth service and MongoDB were at 100% load, because of the missing index. * 09:16 AM - we decided to restore the MongoDB database to the Point-in-time with timestamp 08:10:00 AM and the restoration process started. * 09:28 AM - the MongoDB restoration process finished. ‌ **What went wrong** * **Migration Strategy**: The migration required an offline window or an atomic operation strategy, but this requirement was not identified during development or code review. * **Testing Gaps**: Pre-release testing did not simulate high-concurrency writing during the migration, failing to trigger the race condition found in production. * **Data loss**: We failed to export a snapshot of the corrupted state before the point-in-time retention window expired. ‌ **Action Items** * **Enhance Load Testing**: pre-release tests are not sufficient to really simulate the production environment, so to catch this issue in an early stage. We are planning to run load testing and chaos testing more often and extensibly to mitigate this risk. * **Update the rollback playbook**: mandate that a snapshot of the "corrupted" database state be taken immediately following a destructive Point-in-Time recovery to preserve data and to allow recovery of data if necessary. ‌ We sincerely apologize for the disruption to your operations and, specifically, for the data loss that occurred during the recovery window.

Get alerted when Hosted Mender goes down

Alert24 monitors Hosted Mender and 3,700+ other cloud and SaaS providers. When an outage is detected, it updates your status page automatically and pages your on-call team. No manual updates at 2 AM.

Start free — no credit card

More IoT & Hardware status pages