Skip to main content
Insights

In Case of Emergency, Break Glass!

Enabling better disaster recovery for when things go sideways.

Somesh Koli portrait

Somesh Koli

Engineer

4 min read
In Case of Emergency, Break Glass

This is the final post in our Launch Week series on Change Controls. Check out our other feature announcements for Approvals and Drift Detection.

When working with software, things go wrong all the time. Addressing a critical issue usually requires extra privileges and capabilities to allow the user to manually run diagnostics. This week we’re launching Break Glass Actions to help software vendors mitigate such situations and ensure smooth resolution.

What Is Break Glass ?

Often used in emergency situations, Break Glass provides users with extra tools and capabilities that they wouldn’t normally need in order to mitigate an incident. (Warning! Break glass with caution! Unauthorized glass breaking is a punishable offense. Consequences may include the erosion of customer trust, not to mention the broken glass all over the floor.)

Consider the following situation: An e-commerce company uses an external order processing system to manage orders, process payments, and manage inventory. At peak traffic during their holiday sale, the SaaS provider’s payment integration latency spiked from 2–3 seconds to 20–30 seconds. This caused the internal order queue to balloon from 2,000 to over 250,000 messages. Consequently, the shared, central inventory database became overwhelmed by concurrent updates, leading to deadlocks and time-outs. The SaaS provider's auto-scaling failed, as new workers only increased the shared database load, creating a positive feedback loop and worsening performance for all clients.

As a result, the entire SaaS pipeline froze. The e-commerce company's frontend accepted new orders, but the SaaS system couldn't confirm them, leading to an explosion of "payment pending" support tickets and a cascading failure across multiple clients.

The system cannot self-heal because:

  • Auto-scaling exacerbates the central database bottleneck.
  • The massive order backlog risks data corruption (double inventory deduction or double customer charging) and violates shipping SLAs.
  • Automatically canceling all affected orders is too costly, risking millions in remediation and a major PR disaster.

The system is now stalled, unable to recover without risking data integrity or significant financial loss.

Under such unexpected circumstances, engineers manually intervene to perform actions such as freezing ingestion, manually scaling up databases, and flushing overloading connection pools. These operations require elevated permissions and a supporting environment which allows them to perform such operations.

Breaking Glass in BYOC Environments

When onboarding a SaaS tool, a customer expects a seamless experience that requires as little management as possible on their end, and the responsibility of running the tool is outsourced to the SaaS company. Customers generally don’t concern themselves with what’s happening “under the hood” of the SaaS they’re using.

As the vendor of a SaaS product, you always have maximum control over your software and the underlying infrastructure, which allows you to confidently deliver on your SLA.

But when it comes to BYOC, the line of responsibility between customer and vendor becomes a bit blurry. When the software is deployed into a customer’s cloud, the vendor has limited access.

Breaking Glass With Nuon

Nuon provides software vendors the platform and tools they need to deploy their software into customer environments. A crucial part of this is the ability to seamlessly break glass in case of emergency.

How to Break Glass With Nuon

Break Glass Process

Nuon simplifies break-glass access in a BYOC environment by integrating necessary roles and permissions directly into the deployment and update process, while ensuring the customer maintains strict control.

1. Define Break Glass Roles in App Configuration

The first step is for the software vendor to define the specific, time-limited, and least-privileged AWS IAM roles required for emergency operations directly within the Nuon Application Configuration.

This typically involves:

  • Identifying necessary actions: Determine the exact capabilities needed (e.g., scale database, flush queues, read sensitive logs).
  • Defining policies: Create granular policies corresponding to these actions.
  • Mapping roles: Associate these policies with specific Break Glass” IAM roles (e.g., bucket-operations-break-glass).

These definitions are bundled with the application when Nuon creates the install stack.

text

2. Customer Approval via Install Stack

When a vendor deploys or updates software using Nuon, the platform generates an install stack (depending on the customer's cloud). This stack includes the Break Glass roles as defined by the vendor.

The customer's responsibility is to explicitly enable these roles during the stack execution. By doing so, the customer:

  • Controls role creation: The customer's cloud environment creates the roles, ensuring they are valid and adhere to internal governance.
  • Audits capabilities: The customer can review the exact permissions associated with the roles before deployment.
  • Establishes trust: This mechanism ensures the vendor cannot activate elevated permissions without the customer's upfront, explicit approval.

3. Utilizing Roles for Emergency Actions and Workflows

Once the CloudFormation stack is successfully deployed and the Break Glass roles exist in the customer's cloud, the vendor can use Nuon to invoke emergency operations, such as:

  • Actions: These are defined, single-step operations that utilize the elevated permissions (e.g., a function to manually scale an overloaded database instance).
  • Action Workflows: For complex mitigation scenarios (like the e-commerce example), vendors can create multi-step workflows that coordinate several actions.

Nuon acts as a secure intermediary that enables controlled, manual intervention in a BYOC environment by allowing you to temporarily assume the customer-approved Break Glass role to execute the required actions.

text

4. Post-Incident Review and Audit

Every time a break-glass action is initiated, Nuon provides detailed logging and audit trails, clearly documenting:

  • Who initiated the actions.
  • When the action was performed.
  • What the action did.
  • Which specific role was assumed.

This ensures accountability and facilitates post-incident reviews with the customer.

Excited to Try it Out?

Check out our documentation and sample-apps or reach out to our team for guidance on migrating your existing deployments.

Have questions or feedback about Break Glass Actions? We'd love to hear from you in our community Slack or through our support channels.

Ready to get started?

Newsletter

Subscribe to our newsletter

Too much email? Subscribe via RSS feed