AWS ECS Deep Dive Documentation
A comprehensive guide to Amazon Elastic Container Service — architecture, components, networking, scheduling, scaling, and operational best practices. Source material: Joud W. Awad — AWS ECS Deep Dive (Medium, Jan 2025), supplemented with current AWS documentation.
Table of Contents
-
What is Amazon ECS?
-
The Three Layers of ECS
-
ECS Application Lifecycle
-
Capacity Options
-
Networking
-
Receiving Inbound Traffic from the Internet
-
Connecting ECS to AWS Services from Inside Your VPC
-
Service-to-Service Communication
-
Monitoring
-
ECS Clusters
-
Container Instance States: Draining vs Deregistering
-
The ECS Container Agent
-
Task Definitions
-
IAM Roles for ECS
-
Task Networking Modes (EC2 launch type)
-
Storage Options
-
Task Scheduling and Placement
-
The ECS Task Lifecycle
-
Standalone Tasks
-
ECS Services
-
Load Balancing
-
Auto Scaling
-
Task Scale-In Protection
-
Quick Reference Cheat Sheet
-
Recent Updates (2025)
What is Amazon ECS?
Amazon Elastic Container Service (Amazon ECS) is a fully managed container orchestration service that helps you deploy, manage, and scale containerized applications. As a fully managed service, it ships with AWS configuration and operational best practices built in.
Key integrations:
-
AWS-native tools: Amazon ECR, IAM, CloudWatch, ALB/NLB, Auto Scaling, EventBridge
-
Third-party tooling: Docker, GitHub Actions, Terraform You can run and scale container workloads across AWS Regions in the cloud and on-premises — without managing a control plane.
The Three Layers of ECS
ECS architecture is divided into three logical layers:
| Layer | Description |
| Capacity | The infrastructure where your containers run (EC2, Fargate, on-prem). |
| Controller | The software that deploys and manages your applications running on the containers. |
| Provisioning | The tools that you use to interface with the scheduler to deploy and manage applications. |
ECS Application Lifecycle
The high-level lifecycle of moving an application onto ECS:
-
Architect for containers — A container is a standardized unit of software that holds everything your app needs to run: code, runtime, system tools, and libraries.
-
Build an image — Containers are created from a read-only template called an image. Images are typically built from a
Dockerfile. -
Store the image in a registry — e.g., Amazon ECR.
-
Create a task definition — A JSON blueprint describing your application: which image, ports, volumes, env vars, CPU/memory, etc.
-
Deploy as a Task or a Service onto a Cluster.
-
Monitor — using CloudWatch, Container Insights, or Runtime Monitoring.
Core vocabulary
-
Cluster — a logical grouping of tasks/services running on registered capacity infrastructure.
-
Task definition — the JSON blueprint of one or more containers that form your app.
-
Task — an instantiation of a task definition within a cluster. Can run standalone or as part of a service.
-
Service — runs and maintains a desired number of tasks simultaneously. If a task fails, the service scheduler launches a replacement.
-
Container agent — runs on each container instance; reports running tasks and resource utilization back to ECS, and starts/stops tasks on request.
Capacity Options
Capacity is the infrastructure where your containers run. You set it at the cluster default level when creating the cluster, and can override it at the task definition / launch type level.
🆕 2025 Update — Capacity Providers are now the recommended interface. AWS now recommends using capacity providers to launch tasks rather than directly specifying a launch type. Use the launch type only in
requiresCompatibilitieson the task definition for compatibility validation. Capacity providers offer better resource control, smoother transitions between compute types, and are required for the newer ECS Managed Instances option below.
Fargate (Serverless)
A serverless, pay-as-you-go compute engine. You don’t manage servers, plan capacity, or isolate workloads — you just pick exact CPU and memory.
Best for:
-
Large workloads that need low operational overhead
-
Small workloads with occasional bursts
-
Tiny workloads
-
Batch workloads
Capacity providers:
-
FARGATE— on-demand -
FARGATE_SPOT— discounted spare capacity, interruption-tolerant only (2-minute warning when AWS reclaims capacity)
EC2 (self-managed)
You manage the EC2 instances backing the cluster via Auto Scaling Groups. Best for large workloads that must be price-optimized, or when you need full control over the host (custom AMIs, kernel-level tools, etc.). When designing services on EC2, group containers by purpose — e.g., a frontend service and its log-streaming sidecar belong in the same task definition; a backend API and a data store belong in separate task definitions.
ECS Managed Instances (launched Sept 2025)
A fully managed EC2 compute option that combines Fargate’s hands-off operations with EC2’s flexibility. You define task requirements (vCPU, memory, CPU architecture) and ECS automatically provisions, configures, and operates the optimal EC2 instances in your account using AWS-controlled access.
Key features:
-
Attribute-based instance selection — specify ranges (e.g., 8–16 vCPU), CPU manufacturers, accelerator types, GPU support, network-optimized or burstable families
-
Continuous task placement optimization — ECS bin-packs tasks across instances and drains underutilized ones automatically
-
Automatic AZ spreading — tasks distributed across AZs first, then bin-packed
-
Automatic security patching every 14 days (configurable to weekly maintenance windows via EC2 event windows)
-
Spot support (added Dec 2025) — set
capacityOptionType: spotfor up to 90% discount on fault-tolerant workloads -
Tag propagation — tags flow from capacity provider to instances, ENIs, volumes, etc.
Best for: GPU/ML inference, high-network workloads, capacity reservation needs, eBPF-based observability tools that need privileged host access — anything that needs more than Fargate offers but where you’d rather not run an Auto Scaling Group yourself. You’re billed for the management overhead plus underlying EC2 costs.
External (ECS Anywhere)
Register an on-premises server or VM to your ECS cluster using the EXTERNAL launch type. Best for outbound or data-processing workloads — there’s no Elastic Load Balancing support for external instances, which makes inbound-heavy workloads less efficient. The on-prem server runs both the ECS agent and the SSM agent.
Capacity option comparison
| Option | You manage | AWS manages | Best for |
| Fargate | Task definition only | Everything else | Default; low ops overhead |
| Fargate Spot | Task definition + interruption handling | Everything else | Fault-tolerant workloads at discount |
| ECS Managed Instances | Task requirements | Instance lifecycle, patching, optimization | EC2 flexibility without ASG management |
| EC2 (self-managed) | ASG, AMIs, scaling, patching | ECS scheduling | Maximum control, custom AMIs |
| ECS Anywhere | On-prem hardware + agents | ECS scheduling | Hybrid / on-prem workloads |
Networking
AWS resources live in subnets. ECS tasks run inside the subnet you specify (at cluster, service, or task level depending on launch type).
Subnet types
| Connectivity option | When to use |
| Public subnet + Internet Gateway | Public apps that need high bandwidth or low latency — video streaming, gaming. |
| Private subnet + NAT Gateway | Apps that must be protected from direct external access — payment processing, user data stores. |
| AWS PrivateLink | Private connectivity between VPCs, AWS services, and on-prem networks without exposing traffic to the public internet. |
Tip: NAT Gateway cost note: NAT gateways charge per hour AND per GB of data processed. For HA, you should run one NAT gateway per Availability Zone — which can get expensive for small workloads.
Receiving Inbound Traffic from the Internet
For public services, place a scalable input layer between the internet and your application. The three main options:
Application Load Balancer (ALB) — OSI Layer 7
Best for HTTP/HTTPS services and REST APIs.
Strengths:
-
SSL/TLS termination — manages certs and offloads SSL from the app
-
Advanced routing — host- and path-based routing for microservices
-
gRPC, WebSockets, HTTP/2 support
-
Security — HTTP de-sync mitigations, AWS WAF integration (SQLi, XSS protection)
Network Load Balancer (NLB) — OSI Layer 4
Best for non-HTTP protocols or where end-to-end encryption is required.
Strengths:
-
End-to-end encryption — operates at L4 without reading packet contents
-
TLS termination — optionally offload TLS
-
UDP support — and other non-TCP protocols
Amazon API Gateway (HTTP API)
Best for HTTP apps with sudden bursts or low overall traffic volumes.
Pricing model differs: ALB/NLB charge an hourly fee to keep the LB available; API Gateway charges per request.
-
Low traffic / spiky traffic → API Gateway is cheaper
-
High sustained traffic → ALB/NLB is cheaper per request API Gateway connects into a private VPC subnet via a VPC link, and discovers private IPs via AWS Cloud Map records managed by ECS Service Discovery. It also adds capabilities like client authorization, usage tiers, request/response transformation, edge/regional/private endpoints, and response caching.
Connecting ECS to AWS Services from Inside Your VPC
The ECS container agent must talk to the ECS control plane. If using ECR, hosts must reach the ECR endpoint and S3 (where image layers live).
Option 1: NAT Gateway
The easiest path — but with downsides:
-
No granular destination control — you can’t restrict NAT-gateway-bound traffic to specific AWS services without disrupting all VPC outbound traffic.
-
Per-GB charges for NAT data processing — pulling large S3 objects, heavy DynamoDB reads, or ECR image pulls all cost money.
-
Bandwidth caps at 5 Gbps (auto-scaling up to 45 Gbps); divide workloads across subnets with their own NAT gateways for very high-bandwidth apps.
Option 2: AWS PrivateLink (VPC Endpoints)
Provides private connectivity between your VPC and supported AWS services without traversing the public internet. PrivateLink provisions ENIs inside your subnet, and VPC routing sends traffic for the service hostname through the ENI directly to the AWS service.
Benefits: no IGW, no NAT, no public IPs needed. Traffic never leaves the AWS network.
Service-to-Service Communication
Once you’re running multiple ECS services in a VPC, they need to find and talk to each other. There are three main approaches:
1. ECS Service Connect (recommended)
Service Connect provides ECS-managed configuration for service discovery, connectivity, and traffic monitoring. Apps use short names and standard ports to connect to services in the same cluster, across clusters, and even across VPCs in the same Region. ECS handles all parts of service discovery — name registration, dynamic per-task entries, and a sidecar agent in each task that resolves names. Your app uses standard DNS lookups, so no code changes are needed if you already do that.
Why it’s recommended over plain Service Discovery:
-
Faster failover — doesn’t rely on DNS TTL caching
-
Built-in resilience — automatic load balancing, automatic retries (e.g., on 503), connection draining, network-level health checks
-
Standardized metrics and logs — observability baked in
-
Changes only happen during deployments — config is part of the service/task definition; updates are tied to the deployment lifecycle, avoiding DNS propagation delays
2. ECS Service Discovery (AWS Cloud Map)
Direct service-to-service communication using DNS. ECS syncs the list of running tasks to Cloud Map, which maintains a DNS hostname resolving to internal task IPs.
Pros:
-
Lowest latency — traffic goes container → container directly
-
Simple architecture, no extra components
Cons:
-
Your app must implement retry logic and gracefully handle stale DNS records (TTL caching can return IPs of containers that no longer exist)
-
Not as resilient or observable as Service Connect
📝 When Service Discovery still wins: If you’re using Blue/Green deployments with CodeDeploy, Service Connect was historically incompatible (CodeDeploy
DeploymentControllertype wasn’t supported). Native ECS Blue/Green is now supported with Service Connect, but verify your deployment controller compatibility before choosing.
3. Internal Load Balancer
An ALB or NLB deployed entirely inside your VPC. ServiceA opens connections to the LB; the LB opens connections to ServiceB tasks.
Pros:
-
Centralized management of connections
-
Automatic health checks remove bad targets
-
Apps don’t need to track downstream container counts
Cons:
-
Cost — load balancers need redundant resources per AZ
-
Mitigation: share one ALB across multiple services using path-based routing (e.g.,
/api/user/*→ user service,/api/order/*→ order service)
4. Amazon VPC Lattice (modern option)
A managed application networking service. By associating ECS services with a VPC Lattice target group, ECS auto-registers tasks as IP targets. Useful for connecting, observing, and securing apps across compute services, VPCs, and accounts without code changes.
Monitoring
Before standing up a cluster, build a monitoring plan that answers:
-
What are your monitoring goals?
-
What resources will you monitor?
-
How often?
-
Which tools?
-
Who runs the monitoring?
-
Who gets paged when things break?
Minimum baseline metrics
-
CPU and memory reservation + utilization at the cluster level
-
CPU and memory utilization at the service level Available metrics depend on launch type:
-
Fargate: CPU and memory utilization metrics provided automatically per service
-
EC2: You also need to monitor the EC2 instances themselves; cluster/service/task-level reservation and utilization metrics are also available
Tooling options
-
CloudWatch Alarms — threshold-based alerts; can also drive Auto Scaling for Fargate services
-
CloudWatch Logs — capture container stdout/stderr by setting the
awslogslog driver in the task definition -
CloudWatch Events / EventBridge — match events and route them to targets for automated response
-
Container Insights — collects, aggregates, and summarizes performance metrics and logs for containerized workloads using structured JSON performance log events; CloudWatch creates aggregated metrics at cluster/service/task level (with Enhanced Observability mode adding container-level detail)
-
CloudTrail — log API actions; ship to CloudWatch Logs for real-time monitoring
-
Runtime Monitoring — uses a GuardDuty security agent for runtime visibility (file access, process execution, network connections)
Warning: Container Insights metrics only reflect resources with running tasks in the time range. A service with
desiredCount > 0but noRUNNINGtasks emits no metrics.
Automating responses with EventBridge
ECS emits these event types to EventBridge in near real time:
-
Container instance state change
-
Task state change
-
Service action
-
Service deployment state change You write rules matching events of interest and trigger automated responses (Lambda, SNS, Step Functions, etc.).
Container Health Checks
Defined in the task definition, run inside the container, evaluate exit codes:
| Parameter | Meaning |
command | The command run inside the container (e.g., curl localhost:80) |
interval | Seconds between checks |
timeout | Seconds to wait before marking a check failed |
retries | Failed checks before container is marked unhealthy |
startPeriod | Optional grace period during bootstrapping |
health_check = {
command = ["CMD-SHELL", "curl -f -m 1.00 http://localhost:80 || exit 1"]
timeout = 2
retries = 3
interval = 10
startPeriod = 10
}
Possible statuses: HEALTHY, UNHEALTHY, UNKNOWN.
Task health rollup rules (evaluated in order):
-
If any essential container is
UNHEALTHY→ task isUNHEALTHY -
If any essential container is
UNKNOWN→ task isUNKNOWN -
If all essential containers are
HEALTHY→ task isHEALTHY
Tip: ECS does not monitor Docker
HEALTHCHECKdirectives embedded in images unless they’re declared in the container definition. Container definition health checks override image-embedded ones.
ECS Exec
Connect into running containers without SSH or open ports:
-
Direct container access — run commands or open shells in EC2- or Fargate-based tasks
-
Enhanced security — no SSH keys, no extra inbound ports
-
Auditing — ECS Exec sessions can log to CloudWatch Logs or S3, and CloudTrail records who connected and when ECS Exec uses AWS Systems Manager Session Manager for the connection and IAM policies for authorization. The SSM agent binaries are bind-mounted into the container, and the ECS/Fargate agent starts the SSM core agent alongside your application.
ECS Clusters
A cluster is a logical grouping of tasks and services. It contains:
-
The infrastructure capacity provider(s)
-
The network (VPC and subnets)
-
An optional namespace — used for Service Connect
-
A monitoring option (e.g., Container Insights)
Capacity providers — Fargate
When using Fargate, you don’t create or manage capacity. You associate one or both of these pre-defined providers with the cluster:
-
FARGATE -
FARGATE_SPOTWhen Spot tasks are reclaimed, ECS sends atask state changeevent to EventBridge with the stopped reason describing the interruption.
Capacity providers — EC2
Use Auto Scaling Groups (ASGs) to manage the EC2 instances registered to the cluster. ECS can manage scale-in/scale-out via managed scaling, or you can manage it yourself.
🔹 Best practice: Create a new, empty ASG for the capacity provider. Reusing an existing ASG with already-registered instances can cause registration mismatches with the capacity provider. 🔹 Enable managed instance draining (on by default) for graceful EC2 termination during scale-in.
Capacity provider strategy
Distributes tasks across providers using two parameters:
-
base— minimum number of tasks that must run on a specific provider. Only one provider in a strategy may define a base. -
weight— relative percentage split. WithcapacityProviderA=1andcapacityProviderB=4, every 1 task on A is matched by 4 on B. Example: 75% Fargate / 25% Fargate Spot → weight3onFARGATE, weight1onFARGATE_SPOT.
Container Instance States: Draining vs Deregistering
These two operations are commonly confused.
Draining
Transitioning an instance to DRAINING prevents new tasks from being scheduled and safely removes running tasks. Used during system updates, scale-in, or maintenance.
For Services:
-
Pending tasks are stopped immediately. The scheduler launches replacements if cluster capacity allows.
-
Running tasks are transitioned to
STOPPED. The scheduler replaces them based on the deployment configuration:minimumHealthyPercent— lower bound on healthy task count. WithdesiredCount=4andminimumHealthyPercent=50%, at least 2 tasks must remain healthy. The scheduler can stop up to 2 tasks before launching replacements.maximumPercent— upper bound. WithdesiredCount=4andmaximumPercent=200%, up to 8 tasks can run concurrently during replacement, enabling faster blue/green-style replacement.
For Standalone tasks: Pending and running tasks are unaffected. You must wait for them to finish or stop them manually. The instance stays in DRAINING until completion or reactivation.
A draining instance returns to ACTIVE when you flip its state back. Until then it remains in DRAINING.
Deregistering (EC2-only)
Deregistering removes the EC2 instance from the cluster. It becomes unavailable for new tasks.
Key gotchas:
-
Running tasks become orphaned — they keep running but ECS no longer manages them. Service tasks are replaced on other instances by the scheduler.
-
The EC2 instance is NOT terminated — you must terminate it manually to stop being billed.
-
ASGs / CloudFormation: Update the ASG or stack to remove the instance — otherwise the ASG will replace it with a fresh one.
Comparison
| Aspect | Draining | Deregistering |
| Purpose | Temporarily stop scheduling new tasks; gracefully remove running ones | Permanently remove instance from cluster |
| Reversible | Yes (back to ACTIVE) | No |
| Effect on running tasks | Stopped/replaced (services); untouched (standalone) | Orphaned (continue running, unmanaged) |
| EC2 instance fate | Continues to exist | Continues to exist (must be terminated separately) |
| Applies to | Both EC2 and Fargate | EC2 only |
The ECS Container Agent
A process that runs on every container instance registered to the cluster. It facilitates communication between the instance and ECS.
State transitions:
-
Successful registration → instance status
ACTIVE, agent connectionTRUE→ can accept run-task requests. -
Stop (not terminate) the instance → status stays
ACTIVEbut agent connection drops toFALSEwithin minutes; running tasks stop. -
Restart the instance → agent reconnects, instance can run tasks again.
-
Set state to
DRAINING→ no new tasks placed; service tasks evicted if possible. -
Deregister or terminate → status becomes
INACTIVEimmediately; instance still describable for 1 hour after termination, then gone.
Tip: Always run the latest ECS agent version when possible — each version adds features and bug fixes.
Task Definitions
A task definition is the JSON blueprint for your app — it defines containers, ports, env vars, volumes, IAM roles, networking mode, logging, and resource sizes.
Task definition states
| State | Meaning |
ACTIVE | Registered and usable for running tasks or creating services. |
INACTIVE | Deregistered. Existing tasks/services unaffected, but no new tasks/services can be created from it. Still retrievable via DescribeTaskDefinition. |
DELETE_IN_PROGRESS | Submitted for deletion. ECS verifies no active tasks/deployments reference it before permanently deleting. |
Best practices for container images
-
Make images self-contained — bundle all dependencies as static files inside the image.
-
One process per container — avoid the “fat container” anti-pattern.
-
Handle
SIGTERMgracefully — when ECS stops a task, it sendsSIGTERM, thenSIGKILLafter the stop timeout. Apps that ignoreSIGTERMforce the wait. YourSIGTERMhandler should:- Stop accepting new work
- Finish in-flight work, OR
- Persist unfinished work to external storage if it can’t complete in time
-
**Log to
stdout/ **stderr— decouples log handling from app code; lets infra adjust log routing without redeploying. -
Use tags to version images — don’t build per-commit, but build per-release. Treat image tags as immutable release markers.
Task sizes (CPU / memory)
CPU is measured in 1024 units = 1 full vCPU. Memory is in MB.
-
Reservation — guaranteed minimum. Scheduler won’t place a task on an instance that can’t fulfill the reservation.
-
Limit — hard ceiling. Exceeding CPU → throttled. Exceeding memory → container killed.
-
Bursting — using more than the reservation (up to the limit) when capacity allows.
Stateless apps (behind an LB):
-
Determine memory consumption empirically via
ps,top, or Container Insights. -
For CPU: smaller reservations (e.g., 256 units / ¼ vCPU) → fine-grained, cheaper, but slower to scale on spikes. Larger reservations → faster spike response, more expensive.
Singleton / non-horizontal apps (workers, DB servers):
- Pick CPU/memory based on load testing for your SLO. ECS guarantees placement on a host with adequate capacity.
📚 For full task definition parameter reference, consult the official AWS docs — parameters change frequently.
IAM Roles for ECS
ECS uses several distinct IAM roles depending on launch type and features:
| Role | Purpose |
| Task Execution Role | Used by the ECS agent and Fargate agent to pull images from ECR, ship logs to CloudWatch, fetch secrets from Secrets Manager / SSM Parameter Store. |
| Task Role | Used by the application code inside the container to call AWS APIs (e.g., S3, DynamoDB). |
| Service-Linked Role for ECS | Allows ECS itself to call other AWS services on your behalf (auto-created). |
| Container Instance IAM Role (EC2 only) | Lets the EC2 host register with the cluster, send telemetry, pull images. Attached to the EC2 instance profile. |
| EventBridge / Auto Scaling roles | Required for scheduled tasks and Application Auto Scaling. |
Task Networking Modes (EC2 launch type)
Defined in the task definition. Each mode has trade-offs.
awsvpc (recommended)
Gives each task the same networking properties as an EC2 instance — its own ENI with a private IP (and IPv6 if dual-stack).
-
Granular security — security groups per task, VPC Flow Logs, etc.
-
Simpler networking — no port collisions; containers in the same task share
localhost. -
Each task can only have one ENI.
host
Container networking is tied directly to the EC2 host. The container listens on the host’s IP and port.
-
Significant drawback: only one instance of a task per host (port collision); no port remapping.
-
Not recommended.
bridge
A virtual network bridge between host and container, with port mappings (static or dynamic).
-
Static mapping — explicitly map host port → container port. Same one-instance-per-host limitation as
host. -
Dynamic mapping — Docker assigns a random ephemeral port on the host. Multiple instances per host become possible.
-
Drawback: dynamic ports make it hard to lock down service-to-service security groups (you’d need to open broad port ranges).
| Mode | Multiple tasks per host? | Per-task SG? | Best for |
awsvpc | ✅ | ✅ | Default choice |
host | ❌ | ❌ | Rare cases needing host-level networking |
bridge (static) | ❌ | ❌ | Legacy or specific port needs |
bridge (dynamic) | ✅ | ❌ | Multiple instances when awsvpc isn’t usable |
Storage Options
ECS supports several volume types for tasks. The right choice depends on persistence, sharing, and performance needs:
-
Bind mounts — host filesystem mount; ephemeral
-
Docker volumes — managed by Docker on the host
-
Amazon EFS — shared, persistent, multi-AZ; great for shared state and Fargate
-
Amazon FSx for Windows File Server — Windows containers with SMB
-
Amazon EBS — block storage, EC2 launch type
-
Ephemeral storage — temporary, scoped to the task lifetime For full parameter reference, see the AWS task definition docs.
Task Scheduling and Placement
ECS provides flexible scheduling: a service scheduler for long-running apps, and standalone or scheduled tasks for batch / single-run jobs.
Placement components
-
Task placement strategy — algorithm for picking instances (and picking tasks to terminate). E.g., random, spread, binpack.
-
Task group — a logical group of related tasks (e.g., all DB tasks).
-
Task placement constraint — rules an instance must meet to host a task. Unmet constraints leave the task in
PENDING.
EC2 launch type — placement algorithm
When placing a task, ECS:
-
Identifies instances satisfying CPU/GPU/memory/port requirements
-
Filters by placement constraints
-
Filters by placement strategies
-
Selects the best instance
Defaults:
-
For tasks running as part of a service:
spreadacrossattribute:ecs.availability-zone -
For standalone tasks: no default constraint
Fargate launch type
Warning: Placement strategies and constraints are not supported on Fargate. Fargate makes a best-effort spread across AZs. With both Fargate and Fargate Spot in the strategy, spread is independent per provider.
Strategy types
| Strategy | What it does | Common fields |
random | Places tasks on instances at random | — |
spread | Distributes tasks evenly across a dimension | instanceId, attribute:ecs.availability-zone |
binpack | Packs tasks onto fewest instances based on least available CPU or memory | cpu, memory |
Composing strategies
You can chain multiple strategies — first spread across AZs, then across instances within each AZ:
"placementStrategy": [
{ "field": "attribute:ecs.availability-zone", "type": "spread" },
{ "field": "instanceId", "type": "spread" }
]
Task groups
All tasks with the same task group name are considered a set when applying spread. Task groups can also serve as placement constraints via memberOf.
Defaults:
-
Standalone tasks → task definition family name (e.g.,
family:my-task-definition) -
Service tasks → service name (cannot be changed)
Constraint types
| Type | Description |
distinctInstance | Tasks must run on different instances |
memberOf | Tasks placed only on instances matching a cluster query language expression |
The ECS Task Lifecycle
A task moves through these states from launch to termination:
| State | Description |
PROVISIONING | ECS sets up prerequisites — e.g., creating the ENI for awsvpc mode. |
PENDING | Waiting on the container agent — typically waiting for capacity. |
ACTIVATING | Pulling images, creating containers, configuring networking, registering target groups, configuring service discovery. |
RUNNING | Task is up and serving. |
DEACTIVATING | Performing teardown prep — e.g., deregistering from LB target groups. |
STOPPING | Agent sends SIGTERM, waits StopTimeout, then SIGKILL. |
DEPROVISIONING | Detaching/deleting ENIs, etc. |
STOPPED | Task fully stopped. |
DELETED | Internal transition; visible only via describe-tasks, not the console. |
ECS tracks both lastStatus (current) and desiredStatus (target) for every task.
Standalone Tasks
Use when the app does some work and stops — e.g., a batch job. Triggered via console, AWS CLI, API/SDK, or EventBridge Scheduler.
When launched, a task starts in PROVISIONING, ECS finds capacity (using launch type or capacity provider strategy), then moves through the lifecycle. With managed scaling capacity providers, capacity-shortage tasks remain in PROVISIONING instead of failing immediately.
Optimizing task launch time
-
Cache images + binpack — set the ECS agent’s image pull behavior to
prefer-cached(EC2 only) and use thebinpackstrategy to consolidate tasks onto fewer instances. Helps Windows workloads with large images. Enable ENI trunking for more concurrentawsvpctasks per instance. -
Pick the right network mode —
awsvpcadds ENI provisioning latency;bridgeis faster if per-task security groups aren’t needed. -
Monitor launch lifecycle — use the Task metadata endpoint to capture
ContainerStartTime→ readiness, then trim image size and bootstrap overhead. -
Right-size instances — match CPU/memory to actual reservations (e.g., an
m5.largewith 2 vCPU/8 GB hosts 4 tasks at 0.5 vCPU/2 GB each cleanly).
EventBridge Scheduler
Serverless scheduler — works independently of EventBridge buses/rules with broader API targets. Supports:
-
Rate-based schedules (e.g., every 5 minutes)
-
Cron-based schedules
-
One-time schedules
ECS Services
A service runs and maintains a desired number of task instances simultaneously. If a task fails, the service scheduler launches a replacement.
Scheduling strategies
| Strategy | Behavior | Fargate? |
REPLICA | Maintains the desired number of tasks across the cluster (default: spread across AZs). | ✅ |
DAEMON | Runs exactly one task per active container instance that meets placement constraints. | ❌ Not supported |
REPLICA strategy
-
You define
desiredCount -
Tasks can sit behind a load balancer
-
Customize placement with strategies/constraints
-
Health monitoring via container health checks or LB target group health checks
-
Failure throttling — if tasks repeatedly fail to enter
RUNNING, the scheduler slows down launch attempts and emits service events to prevent resource waste
Tip: Use AZ Rebalancing with REPLICA to keep tasks evenly distributed across AZs.
Unhealthy task replacement flow:
-
Service marks a task as unhealthy
-
Scheduler starts a replacement
-
If the replacement is
HEALTHY→ original unhealthy task stopped -
If the replacement is
UNHEALTHY→ scheduler stops one of the unhealthy tasks at random to keep total count neardesiredCount -
If
maximumPercentblocks starting first, scheduler stops one unhealthy task to free capacity, then launches replacement -
Repeats until all unhealthy tasks are replaced; if there’s an excess, healthy tasks are stopped at random down to
desiredCount
DAEMON strategy
-
Exactly one task per active container instance
-
ECS reserves CPU, memory, and network interfaces for daemon tasks
-
Daemon tasks have priority — they launch first and stop last in clusters that mix daemon + replica services
-
Ideal for logging/monitoring agents that should run on every host
Availability Zone Rebalancing
After a service reaches steady state, ECS continuously monitors task counts per AZ. If imbalanced:
-
ECS launches new tasks in under-utilized AZs
-
Once new tasks confirm
HEALTHY, ECS stops tasks in over-utilized AZs
Supported:
-
Both Fargate and EC2 launch types (Fargate auto-redistributes; EC2 rebalances across existing instances based on placement strategies, but won’t provision new instances)
-
REPLICA scheduling strategy with AZ-spread or no placement strategy
Not supported with:
-
DAEMON strategy
-
EXTERNALlaunch type (ECS Anywhere) -
maximumPercent = 100% -
Classic Load Balancer
-
attribute:ecs.availability-zoneas a placement constraint
Load Balancing
ECS services on Fargate support ALB, NLB, and Gateway Load Balancer. Use ALB unless you specifically need NLB or GWLB features.
Optimizing health check parameters
Two key parameters control deployment speed:
-
HealthCheckIntervalSeconds— time between checks (default 30s) -
HealthyThresholdCount— consecutive passes to mark healthy (default 5) With defaults, a container takes up to 2m30s to be marked healthy.
🔧 If your service starts in under 10s, set interval to 5s and threshold to 2 — total ~10s. Major deployment speedup.
Optimizing connection draining
Clients keep connections alive for reuse; the LB checks if clients have closed the connection before stopping a target.
-
deregistration_delay.timeout_seconds— how long the LB waits before forcing the target intoUNUSED. Default: 300s. -
For services with sub-1s response time, set this to 5s.
SIGTERM responsiveness
-
ECS_CONTAINER_STOP_TIMEOUT— time betweenSIGTERMandSIGKILL. Default: 30s. -
For fast-shutdown apps, lower to 2s, and trap
SIGTERMin your code:
process.on('SIGTERM', function() {
server.close();
});
This stops accepting new requests, finishes in-flight ones, and exits cleanly — often well under the stop timeout, avoiding SIGKILL.
🆕 Dec 2025 — Custom stop signals on Fargate. Fargate now honors the
STOPSIGNALinstruction from OCI-compliant container images. If your app usesSIGQUIT(e.g., Nginx graceful shutdown) orSIGINTinstead of the defaultSIGTERM, the ECS agent reads the image’sSTOPSIGNALand sends the appropriate signal during task termination. Available at no additional cost in all regions.
Auto Scaling
ECS leverages Application Auto Scaling to dynamically adjust the desired task count.
Four scaling types
| Type | How it works | When to use |
| Target Tracking (recommended default) | Maintains a target value for a metric (e.g., CPU at 50%) — like a thermostat. | Most workloads; metrics that scale linearly with capacity. |
| Step Scaling | Step adjustments based on CloudWatch alarm breach magnitude. | When you need different scaling magnitudes for different alarm severities. |
| Scheduled Scaling | Scale up/down at specific times. | Predictable daily/weekly traffic patterns. |
| Predictive Scaling | ML analyzes historical load to detect patterns and pre-scale. | Workloads with strong daily/weekly seasonality. |
Tip: Default to target tracking on metrics like average CPU utilization or request count per target — they decrease when capacity grows, which lets ECS follow demand cleanly.
Task Scale-In Protection
Protects mission-critical tasks from being terminated during scale-in events from auto-scaling or deployments.
Use cases
-
Async job processing — video transcoding, data jobs that run for hours
-
Game servers — ECS tasks hosting active sessions; restart latency is expensive
-
Mid-deployment — protect tasks doing expensive work mid-rollout
Configuration
-
Set
protectionEnabledtotrue -
Default protection: 2 hours
-
Customize via
expiresInMinutes: minimum 1 minute, maximum 2880 minutes (48 hours) -
After work finishes, set
protectionEnabled = falseto allow normal termination
Setting protection — two mechanisms
1. ECS container agent endpoint (self-determining tasks) For queue-based or job-processing workloads. From inside the container, hit:
$ECS_AGENT_URI/task-protection/v1/state
Set ProtectionEnabled when consuming an SQS message, clear it when work finishes. Recommended for workloads where the task itself knows when it’s busy.
2. ECS API (externally-tracked tasks)
Use UpdateTaskProtection to mark tasks protected and GetTaskProtection to query status. Good when an external service tracks task lifecycle — e.g., a game-server controller marking tasks when users log in, clearing when they log out.
You can combine both — agent endpoint to set protection from inside, API to clear from an external controller.
Quick Reference Cheat Sheet
Launch type decision
-
Need lowest ops overhead? → Fargate
-
Cost-sensitive at scale? → EC2
-
Burst / batch / tiny? → Fargate (or Fargate Spot for interruption-tolerant)
-
On-prem requirement? → ECS Anywhere
Networking decision (EC2 mode)
-
Default choice →
awsvpc -
Multi-task per host without per-task SGs →
bridgewith dynamic ports -
Avoid →
host
Inbound entry point
-
HTTP / REST API → ALB
-
TCP/UDP / end-to-end TLS → NLB
-
Spiky / low-volume HTTP → API Gateway
Service-to-service
-
Default → Service Connect
-
Lowest latency, simple setup → Service Discovery (with retry logic in app)
-
Need centralized routing for many services → Internal ALB with path-based routing
Auto-scaling
-
Default → Target tracking on CPU/memory or RPS
-
Predictable schedule → Scheduled scaling
-
Seasonal patterns → Predictive scaling
Health check tuning (fast-starting services)
-
HealthCheckIntervalSeconds: 5 -
HealthyThresholdCount: 2 -
deregistration_delay.timeout_seconds: 5 -
ECS_CONTAINER_STOP_TIMEOUT: 2 -
App must trap
SIGTERM
Recent Updates (2025)
ECS shipped several material updates in 2025 that go beyond the original article. Highlights:
ECS Express Mode (Nov 2025, re:Invent)
A new feature for rapidly launching containerized web apps and APIs. Provide three inputs — container image, task execution role, infrastructure role — and ECS auto-provisions:
-
A Fargate-based ECS service
-
An ALB with HTTPS on port 443 and SSL/TLS termination
-
Auto-scaling policies
-
CloudWatch monitoring and alarms
-
Security groups with least-privilege rules
-
A unique URL on
.ecs.\<region>.on.aws
Key details:
-
Fargate-only; no Blue/Green deployment support
-
Up to 25 Express Mode services can share a single ALB (intelligent rule-based routing)
-
Updates use canary deployment by default
-
Resources stay in your account — fully accessible and modifiable
-
No additional charges; pay only for the underlying AWS resources
-
Available in all AWS Regions where ECS and Fargate are supported
-
Manageable via Console, CLI, SDK, CloudFormation, CDK, and Terraform
ECS Managed Instances (Sept 2025 launch, GA Oct 2025, Spot support Dec 2025)
Covered in the Capacity Options section above. The TL;DR: a fully managed EC2 capacity provider that handles instance provisioning, patching every 14 days, and continuous task placement optimization — combining Fargate’s hands-off feel with EC2’s flexibility (GPUs, custom instance types, etc.).
Custom container stop signals on Fargate (Dec 2025)
Fargate now reads STOPSIGNAL from OCI-compliant images and sends the appropriate signal (SIGQUIT, SIGINT, etc.) during task termination, instead of always sending SIGTERM. Useful for apps like Nginx where graceful shutdown uses a non-default signal.
IPv6-only workloads (re:Invent 2025)
ECS now supports running containerized applications in IPv6-only environments without IPv4 dependencies, while maintaining compatibility with existing apps and AWS services. Helps address IPv4 exhaustion, simplifies network architecture, and meets IPv6 compliance requirements.
SOCI Parallel Pull Mode (re:Invent 2025)
A new image pull strategy for faster container starts, with configurable parallelization for both download and unpacking phases. AWS measured ~60% faster pulls on a 10 GB Deep Learning Container image — particularly valuable for AI/ML workloads with large images.
VPC Lattice integration
ECS services can now be associated with VPC Lattice target groups as IP targets. ECS auto-registers tasks to the target group on launch, enabling cross-VPC, cross-account application networking with built-in observability and security — without code changes.
Native Blue/Green with Service Connect
Earlier in 2025, ECS-native Blue/Green deployments added support for Service Connect alongside load balancers, removing one of the historical reasons to choose Service Discovery over Service Connect (the CodeDeploy DeploymentController incompatibility).
Capacity providers preferred over launch types
AWS now explicitly recommends using capacity providers as the primary mechanism for launching tasks. Use launch type only on the task definition’s requiresCompatibilities field for compatibility validation. Capacity providers offer better resource control, allow seamless transitions between compute types, and are required for ECS Managed Instances.
Copilot CLI sunset
The AWS Copilot CLI reaches end of support on June 12, 2026. It remains available as an open-source GitHub project but will no longer receive new features or security updates from AWS. ECS Express Mode is positioned as part of the simplification story going forward.
Conclusion
ECS provides a deeply integrated way to run containers on AWS — clusters, task definitions, tasks, and services compose to deliver scalable, resilient, observable container workloads. The big leverage points:
-
Use capacity providers, not launch types directly, for new workloads
-
Pick the right capacity option — Fargate for low ops overhead, ECS Managed Instances for EC2 flexibility without ASG management, self-managed EC2 for full control
-
Try ECS Express Mode for simple web apps and APIs — three inputs, production-ready stack
-
Use
awsvpcnetworking unless you have a specific reason not to -
Prefer Service Connect for service-to-service communication
-
Tune health checks, deregistration delay, and
SIGTERMhandling for fast deployments -
Use target tracking auto-scaling on metrics that respond linearly to capacity
-
Spread across AZs with placement strategies and AZ Rebalancing for resilience
-
Protect long-running tasks from scale-in with task scale-in protection
References
-
Joud W. Awad — AWS ECS Deep Dive (Medium, Jan 2025): https://joudwawad.medium.com/aws-ecs-deep-dive-c8f773af0bf6
-
AWS — ECS Developer Guide
-
AWS — Service Connect