Cost Optimisation in ECS: Integrating Spot Instances at Scale

At Deliveroo, we’re always refining how we scale - especially when it comes to managing compute costs in the cloud. After optimising our Amazon ECS workloads with Reserved Instances and Savings Plans, we saw an opportunity to push further using EC2 Spot Instances, which offer up to 90% savings compared to On-Demand prices. But Spot comes with challenges: Their availability can fluctuate, and they can be terminated with just a two-minute warning.

To unlock these savings without compromising service stability, we had to engineer a robust solution across infrastructure, workload qualification, and automation.

The Challenge: Balancing Cost and Reliability

Our ECS infrastructure initially relied entirely on On-Demand EC2 instances, provisioned through Auto Scaling Groups (ASGs) connected to ECS Capacity Providers. While reliable, this approach didn’t take advantage of AWS’s surplus compute capacity.

We aimed to layer Spot Instances into our clusters, but selectively. Our goal was clear: route only eligible workloads to Spot capacity while ensuring no service degradation during unexpected terminations.

Spot Instances: Power and Pitfalls

Spot Instances provide dramatic cost reductions but introduce several operational caveats:

Ephemeral by nature: AWS can terminate them at any time with a two-minute warning.
Capacity variability: Availability depends on AWS’s excess capacity in each AZ and can shift unpredictably.
Scaling limitations: Auto Scaling may fail if the desired instance types are not currently available.

To avoid introducing fragility into our stack, we established technical eligibility criteria that workloads must meet before being scheduled on Spot.

Defining Spot Eligibility

We formalised the following criteria to assess whether a workload could safely tolerate Spot interruptions:

Fast Shutdown Support
- Constraint: stopTimeout must be < 120 seconds in the container definition.
- Reason: Ensures ECS has time to gracefully shut down the task before AWS’s 2-minute termination.
Resilience via Redundancy
- Constraint: ECS service must have a desiredCount ≥ 2.
- Reason: Enables task-level redundancy, preventing service outage from single-instance terminations.
Statelessness
- Constraint: No use of local storage or persistent volumes.
- Reason: Stateless services can be safely terminated without risk of data loss.
Load Balancer Deregistration < 120s
- Constraint: ELB target deregistration delay < 120 seconds.
- Reason: Ensures that tasks are removed from load balancers before forced shutdown, preventing traffic loss.

These criteria allowed us to systematically identify workloads that can tolerate interruptions without user impact.

Introducing the Spot Placement Score (SPS) Metric

To make smarter placement decisions, we introduced a Spot Placement Score (SPS) metric. This metric reflects how likely we are to acquire Spot Instances for a specific configuration and is based on:

The instance families and sizes defined in our Spot ASGs
The maximum capacity targeted by those ASGs
Regional and zone-level availability of excess Spot capacity

We derive this score using AWS’s Spot Placement Score API, which provides insight into AWS’s internal view of instance availability. We calculate a custom SPS for each cluster configuration and emit it as a metric to track Spot capacity health over time.

A high SPS means Spot capacity is readily available and interruptions are unlikely.
A low SPS indicates increased risk of interruption, guiding us to scale back Spot usage.

Automating Spot Adjustments with Fine-Grained Controls

To manage Spot usage dynamically, we built an automation around this score to respond to SPS in near real-time. Our system enables:

Proactive de-risking: Reduce Spot allocation when SPS drops.
Critical period protections: Disable Spot entirely during high-sensitivity windows like product launches.
Quick recovery: Re-enable Spot when capacity signals stabilise.

This is achieved via a global “Spot percentage” setting in ECS Capacity Providers, which controls the ratio of tasks placed on Spot versus On-Demand (e.g. 70/30). Adjusting this percentage allows us to dial in or out of Spot with minimal effort.

Scalable Architecture: Modular Automation

To scale this across environments, we built a resilient, self-healing system to continuously evaluate and apply Spot configurations. Key components include:

Orchestrator (Step Function) - Coordinates the entire workflow:
- Identifies eligible services
- Updates capacity provider strategies
- Validates post-deployment health
- Retries updates for unstable services
Determiner (Lambda) - Applies the eligibility criteria and detects drift between current and desired configurations.
Updater (Lambda) - Applies the updated strategy only if needed, ensuring idempotent and safe deployments.
Stability Checker (Lambda) - Validates that each service has stabilized (all desired tasks are running, none pending).
Scheduler (EventBridge) - Triggers the orchestrator on a configurable cadence to maintain up-to-date configurations.

We leverage ECS capacity provider strategies to control spot usage per service. These strategies define how tasks are distributed across different capacity providers, such as On-Demand and Spot by assigning weights. For example 70% on-demand, 30% spot and ECS uses this to decide where tasks should be placed.

Per-Service Granularity

Rather than enforce a blanket rule across all services, we opted for per-service strategy definitions. This enables us to:

Tailor strategies to each service’s reliability and behavior
Perform safe, gradual rollouts
Fine-tune Spot usage at the workload level as conditions evolve

This granularity gives teams confidence in adoption without compromising uptime.

Real-World Impact

Even with a cautious 5-10% Spot adoption per cluster, we’ve achieved meaningful compute cost reductions without any observable impact on service availability or latency.

As our internal confidence, observability, and tooling mature, we’ll progressively increase this percentage and expand spot adoption without sacrificing safety.

Key Takeaways

Spot requires discipline: Savings are substantial, but only interruption-tolerant workloads should be placed on Spot.
Hard rules scale better than heuristics: Hard constraints like shutdown time and statelessness help scale safely and avoid human judgment errors.
Automation is non-negotiable: Managing Spot at scale without orchestration is error-prone and fragile.
Validation gates protect uptime: Post-deployment checks ensure services remain healthy after every update.
Predictive scoring unlocks confidence: SPS helps us stay ahead of capacity fluctuations and plan accordingly.

If you’re managing a large ECS footprint, Spot Instances can be a major lever for compute cost optimisation—but only with the right guardrails in place. With objective eligibility criteria, an automation framework, and real-time placement scoring, Spot can become a safe, reliable, and cost-effective part of your ECS strategy.

We’re hiring in Tech, if you are interested in a position at Deliveroo, please visit deliveroo.com/careers.

Source