AWS ECS Deep Dive Documentation

A comprehensive guide to Amazon Elastic Container Service — architecture, components, networking, scheduling, scaling, and operational best practices. Source material: Joud W. Awad — AWS ECS Deep Dive (Medium, Jan 2025), supplemented with current AWS documentation.

What is Amazon ECS?
The Three Layers of ECS
ECS Application Lifecycle
Capacity Options
Networking
Receiving Inbound Traffic from the Internet
Connecting ECS to AWS Services from Inside Your VPC
Service-to-Service Communication
Monitoring
ECS Clusters
Container Instance States: Draining vs Deregistering
The ECS Container Agent
Task Definitions
IAM Roles for ECS
Task Networking Modes (EC2 launch type)
Storage Options
Task Scheduling and Placement
The ECS Task Lifecycle
Standalone Tasks
ECS Services
Load Balancing
Auto Scaling
Task Scale-In Protection
Quick Reference Cheat Sheet
Recent Updates (2025)

What is Amazon ECS?

Amazon Elastic Container Service (Amazon ECS) is a fully managed container orchestration service that helps you deploy, manage, and scale containerized applications. As a fully managed service, it ships with AWS configuration and operational best practices built in.

Key integrations:

AWS-native tools: Amazon ECR, IAM, CloudWatch, ALB/NLB, Auto Scaling, EventBridge
Third-party tooling: Docker, GitHub Actions, Terraform You can run and scale container workloads across AWS Regions in the cloud and on-premises — without managing a control plane.

The Three Layers of ECS

ECS architecture is divided into three logical layers:

Layer	Description
Capacity	The infrastructure where your containers run (EC2, Fargate, on-prem).
Controller	The software that deploys and manages your applications running on the containers.
Provisioning	The tools that you use to interface with the scheduler to deploy and manage applications.

ECS Application Lifecycle

The high-level lifecycle of moving an application onto ECS:

Architect for containers — A container is a standardized unit of software that holds everything your app needs to run: code, runtime, system tools, and libraries.
Build an image — Containers are created from a read-only template called an image. Images are typically built from a Dockerfile.
Store the image in a registry — e.g., Amazon ECR.
Create a task definition — A JSON blueprint describing your application: which image, ports, volumes, env vars, CPU/memory, etc.
Deploy as a Task or a Service onto a Cluster.
Monitor — using CloudWatch, Container Insights, or Runtime Monitoring.

Core vocabulary

Cluster — a logical grouping of tasks/services running on registered capacity infrastructure.
Task definition — the JSON blueprint of one or more containers that form your app.
Task — an instantiation of a task definition within a cluster. Can run standalone or as part of a service.
Service — runs and maintains a desired number of tasks simultaneously. If a task fails, the service scheduler launches a replacement.
Container agent — runs on each container instance; reports running tasks and resource utilization back to ECS, and starts/stops tasks on request.

Capacity Options

Capacity is the infrastructure where your containers run. You set it at the cluster default level when creating the cluster, and can override it at the task definition / launch type level.

🆕 2025 Update — Capacity Providers are now the recommended interface. AWS now recommends using capacity providers to launch tasks rather than directly specifying a launch type. Use the launch type only in requiresCompatibilities on the task definition for compatibility validation. Capacity providers offer better resource control, smoother transitions between compute types, and are required for the newer ECS Managed Instances option below.

Fargate (Serverless)

A serverless, pay-as-you-go compute engine. You don’t manage servers, plan capacity, or isolate workloads — you just pick exact CPU and memory.

Best for:

Large workloads that need low operational overhead
Small workloads with occasional bursts
Tiny workloads
Batch workloads

Capacity providers:

FARGATE — on-demand
FARGATE_SPOT — discounted spare capacity, interruption-tolerant only (2-minute warning when AWS reclaims capacity)

EC2 (self-managed)

You manage the EC2 instances backing the cluster via Auto Scaling Groups. Best for large workloads that must be price-optimized, or when you need full control over the host (custom AMIs, kernel-level tools, etc.). When designing services on EC2, group containers by purpose — e.g., a frontend service and its log-streaming sidecar belong in the same task definition; a backend API and a data store belong in separate task definitions.

ECS Managed Instances (launched Sept 2025)

A fully managed EC2 compute option that combines Fargate’s hands-off operations with EC2’s flexibility. You define task requirements (vCPU, memory, CPU architecture) and ECS automatically provisions, configures, and operates the optimal EC2 instances in your account using AWS-controlled access.

Key features:

Attribute-based instance selection — specify ranges (e.g., 8–16 vCPU), CPU manufacturers, accelerator types, GPU support, network-optimized or burstable families
Continuous task placement optimization — ECS bin-packs tasks across instances and drains underutilized ones automatically
Automatic AZ spreading — tasks distributed across AZs first, then bin-packed
Automatic security patching every 14 days (configurable to weekly maintenance windows via EC2 event windows)
Spot support (added Dec 2025) — set capacityOptionType: spot for up to 90% discount on fault-tolerant workloads
Tag propagation — tags flow from capacity provider to instances, ENIs, volumes, etc.

Best for: GPU/ML inference, high-network workloads, capacity reservation needs, eBPF-based observability tools that need privileged host access — anything that needs more than Fargate offers but where you’d rather not run an Auto Scaling Group yourself. You’re billed for the management overhead plus underlying EC2 costs.

External (ECS Anywhere)

Register an on-premises server or VM to your ECS cluster using the EXTERNAL launch type. Best for outbound or data-processing workloads — there’s no Elastic Load Balancing support for external instances, which makes inbound-heavy workloads less efficient. The on-prem server runs both the ECS agent and the SSM agent.

Capacity option comparison

Option	You manage	AWS manages	Best for
Fargate	Task definition only	Everything else	Default; low ops overhead
Fargate Spot	Task definition + interruption handling	Everything else	Fault-tolerant workloads at discount
ECS Managed Instances	Task requirements	Instance lifecycle, patching, optimization	EC2 flexibility without ASG management
EC2 (self-managed)	ASG, AMIs, scaling, patching	ECS scheduling	Maximum control, custom AMIs
ECS Anywhere	On-prem hardware + agents	ECS scheduling	Hybrid / on-prem workloads

Networking

AWS resources live in subnets. ECS tasks run inside the subnet you specify (at cluster, service, or task level depending on launch type).

Subnet types

Connectivity option	When to use
Public subnet + Internet Gateway	Public apps that need high bandwidth or low latency — video streaming, gaming.
Private subnet + NAT Gateway	Apps that must be protected from direct external access — payment processing, user data stores.
AWS PrivateLink	Private connectivity between VPCs, AWS services, and on-prem networks without exposing traffic to the public internet.

Tip: NAT Gateway cost note: NAT gateways charge per hour AND per GB of data processed. For HA, you should run one NAT gateway per Availability Zone — which can get expensive for small workloads.

Receiving Inbound Traffic from the Internet

For public services, place a scalable input layer between the internet and your application. The three main options:

Application Load Balancer (ALB) — OSI Layer 7

Best for HTTP/HTTPS services and REST APIs.

Strengths:

SSL/TLS termination — manages certs and offloads SSL from the app
Advanced routing — host- and path-based routing for microservices
gRPC, WebSockets, HTTP/2 support
Security — HTTP de-sync mitigations, AWS WAF integration (SQLi, XSS protection)

Network Load Balancer (NLB) — OSI Layer 4

Best for non-HTTP protocols or where end-to-end encryption is required.

Strengths:

End-to-end encryption — operates at L4 without reading packet contents
TLS termination — optionally offload TLS
UDP support — and other non-TCP protocols

Amazon API Gateway (HTTP API)

Best for HTTP apps with sudden bursts or low overall traffic volumes.

Pricing model differs: ALB/NLB charge an hourly fee to keep the LB available; API Gateway charges per request.

Low traffic / spiky traffic → API Gateway is cheaper
High sustained traffic → ALB/NLB is cheaper per request API Gateway connects into a private VPC subnet via a VPC link, and discovers private IPs via AWS Cloud Map records managed by ECS Service Discovery. It also adds capabilities like client authorization, usage tiers, request/response transformation, edge/regional/private endpoints, and response caching.

Connecting ECS to AWS Services from Inside Your VPC

The ECS container agent must talk to the ECS control plane. If using ECR, hosts must reach the ECR endpoint and S3 (where image layers live).

Option 1: NAT Gateway

The easiest path — but with downsides:

No granular destination control — you can’t restrict NAT-gateway-bound traffic to specific AWS services without disrupting all VPC outbound traffic.
Per-GB charges for NAT data processing — pulling large S3 objects, heavy DynamoDB reads, or ECR image pulls all cost money.
Bandwidth caps at 5 Gbps (auto-scaling up to 45 Gbps); divide workloads across subnets with their own NAT gateways for very high-bandwidth apps.

Option 2: AWS PrivateLink (VPC Endpoints)

Provides private connectivity between your VPC and supported AWS services without traversing the public internet. PrivateLink provisions ENIs inside your subnet, and VPC routing sends traffic for the service hostname through the ENI directly to the AWS service.

Benefits: no IGW, no NAT, no public IPs needed. Traffic never leaves the AWS network.

Service-to-Service Communication

Once you’re running multiple ECS services in a VPC, they need to find and talk to each other. There are three main approaches:

1. ECS Service Connect (recommended)

Service Connect provides ECS-managed configuration for service discovery, connectivity, and traffic monitoring. Apps use short names and standard ports to connect to services in the same cluster, across clusters, and even across VPCs in the same Region. ECS handles all parts of service discovery — name registration, dynamic per-task entries, and a sidecar agent in each task that resolves names. Your app uses standard DNS lookups, so no code changes are needed if you already do that.

Why it’s recommended over plain Service Discovery:

Faster failover — doesn’t rely on DNS TTL caching
Built-in resilience — automatic load balancing, automatic retries (e.g., on 503), connection draining, network-level health checks
Standardized metrics and logs — observability baked in
Changes only happen during deployments — config is part of the service/task definition; updates are tied to the deployment lifecycle, avoiding DNS propagation delays

2. ECS Service Discovery (AWS Cloud Map)

Direct service-to-service communication using DNS. ECS syncs the list of running tasks to Cloud Map, which maintains a DNS hostname resolving to internal task IPs.

Pros:

Lowest latency — traffic goes container → container directly
Simple architecture, no extra components

Cons:

Your app must implement retry logic and gracefully handle stale DNS records (TTL caching can return IPs of containers that no longer exist)
Not as resilient or observable as Service Connect

📝 When Service Discovery still wins: If you’re using Blue/Green deployments with CodeDeploy, Service Connect was historically incompatible (CodeDeploy DeploymentController type wasn’t supported). Native ECS Blue/Green is now supported with Service Connect, but verify your deployment controller compatibility before choosing.

3. Internal Load Balancer

An ALB or NLB deployed entirely inside your VPC. ServiceA opens connections to the LB; the LB opens connections to ServiceB tasks.

Pros:

Centralized management of connections
Automatic health checks remove bad targets
Apps don’t need to track downstream container counts

Cons:

Cost — load balancers need redundant resources per AZ
Mitigation: share one ALB across multiple services using path-based routing (e.g., /api/user/* → user service, /api/order/* → order service)

4. Amazon VPC Lattice (modern option)

A managed application networking service. By associating ECS services with a VPC Lattice target group, ECS auto-registers tasks as IP targets. Useful for connecting, observing, and securing apps across compute services, VPCs, and accounts without code changes.

Monitoring

Before standing up a cluster, build a monitoring plan that answers:

What are your monitoring goals?
What resources will you monitor?
How often?
Which tools?
Who runs the monitoring?
Who gets paged when things break?

Minimum baseline metrics

CPU and memory reservation + utilization at the cluster level
CPU and memory utilization at the service level Available metrics depend on launch type:
Fargate: CPU and memory utilization metrics provided automatically per service
EC2: You also need to monitor the EC2 instances themselves; cluster/service/task-level reservation and utilization metrics are also available

Tooling options

CloudWatch Alarms — threshold-based alerts; can also drive Auto Scaling for Fargate services
CloudWatch Logs — capture container stdout/stderr by setting the awslogs log driver in the task definition
CloudWatch Events / EventBridge — match events and route them to targets for automated response
Container Insights — collects, aggregates, and summarizes performance metrics and logs for containerized workloads using structured JSON performance log events; CloudWatch creates aggregated metrics at cluster/service/task level (with Enhanced Observability mode adding container-level detail)
CloudTrail — log API actions; ship to CloudWatch Logs for real-time monitoring
Runtime Monitoring — uses a GuardDuty security agent for runtime visibility (file access, process execution, network connections)

Warning: Container Insights metrics only reflect resources with running tasks in the time range. A service with desiredCount > 0 but no RUNNING tasks emits no metrics.

Automating responses with EventBridge

ECS emits these event types to EventBridge in near real time:

Container instance state change
Task state change
Service action
Service deployment state change You write rules matching events of interest and trigger automated responses (Lambda, SNS, Step Functions, etc.).

Container Health Checks

Defined in the task definition, run inside the container, evaluate exit codes:

Parameter	Meaning
`command`	The command run inside the container (e.g., `curl localhost:80`)
`interval`	Seconds between checks
`timeout`	Seconds to wait before marking a check failed
`retries`	Failed checks before container is marked unhealthy
`startPeriod`	Optional grace period during bootstrapping

health_check = {
  command     = ["CMD-SHELL", "curl -f -m 1.00 http://localhost:80 || exit 1"]
  timeout     = 2
  retries     = 3
  interval    = 10
  startPeriod = 10
}

Possible statuses: HEALTHY, UNHEALTHY, UNKNOWN.

Task health rollup rules (evaluated in order):

If any essential container is UNHEALTHY → task is UNHEALTHY
If any essential container is UNKNOWN → task is UNKNOWN
If all essential containers are HEALTHY → task is HEALTHY

Tip: ECS does not monitor Docker HEALTHCHECK directives embedded in images unless they’re declared in the container definition. Container definition health checks override image-embedded ones.

ECS Exec

Connect into running containers without SSH or open ports:

Direct container access — run commands or open shells in EC2- or Fargate-based tasks
Enhanced security — no SSH keys, no extra inbound ports
Auditing — ECS Exec sessions can log to CloudWatch Logs or S3, and CloudTrail records who connected and when ECS Exec uses AWS Systems Manager Session Manager for the connection and IAM policies for authorization. The SSM agent binaries are bind-mounted into the container, and the ECS/Fargate agent starts the SSM core agent alongside your application.

ECS Clusters

A cluster is a logical grouping of tasks and services. It contains:

The infrastructure capacity provider(s)
The network (VPC and subnets)
An optional namespace — used for Service Connect
A monitoring option (e.g., Container Insights)

Capacity providers — Fargate

When using Fargate, you don’t create or manage capacity. You associate one or both of these pre-defined providers with the cluster:

FARGATE
FARGATE_SPOT When Spot tasks are reclaimed, ECS sends a task state change event to EventBridge with the stopped reason describing the interruption.

Capacity providers — EC2

Use Auto Scaling Groups (ASGs) to manage the EC2 instances registered to the cluster. ECS can manage scale-in/scale-out via managed scaling, or you can manage it yourself.

🔹 Best practice: Create a new, empty ASG for the capacity provider. Reusing an existing ASG with already-registered instances can cause registration mismatches with the capacity provider. 🔹 Enable managed instance draining (on by default) for graceful EC2 termination during scale-in.

Capacity provider strategy

Distributes tasks across providers using two parameters:

base — minimum number of tasks that must run on a specific provider. Only one provider in a strategy may define a base.
weight — relative percentage split. With capacityProviderA=1 and capacityProviderB=4, every 1 task on A is matched by 4 on B. Example: 75% Fargate / 25% Fargate Spot → weight 3 on FARGATE, weight 1 on FARGATE_SPOT.

Container Instance States: Draining vs Deregistering

These two operations are commonly confused.

Draining

Transitioning an instance to DRAINING prevents new tasks from being scheduled and safely removes running tasks. Used during system updates, scale-in, or maintenance.

For Services:

Pending tasks are stopped immediately. The scheduler launches replacements if cluster capacity allows.
Running tasks are transitioned to STOPPED. The scheduler replaces them based on the deployment configuration:
- minimumHealthyPercent — lower bound on healthy task count. With desiredCount=4 and minimumHealthyPercent=50%, at least 2 tasks must remain healthy. The scheduler can stop up to 2 tasks before launching replacements.
- maximumPercent — upper bound. With desiredCount=4 and maximumPercent=200%, up to 8 tasks can run concurrently during replacement, enabling faster blue/green-style replacement.

For Standalone tasks: Pending and running tasks are unaffected. You must wait for them to finish or stop them manually. The instance stays in DRAINING until completion or reactivation. A draining instance returns to ACTIVE when you flip its state back. Until then it remains in DRAINING.

Deregistering (EC2-only)

Deregistering removes the EC2 instance from the cluster. It becomes unavailable for new tasks.

Key gotchas:

Running tasks become orphaned — they keep running but ECS no longer manages them. Service tasks are replaced on other instances by the scheduler.
The EC2 instance is NOT terminated — you must terminate it manually to stop being billed.
ASGs / CloudFormation: Update the ASG or stack to remove the instance — otherwise the ASG will replace it with a fresh one.

Comparison

Aspect	Draining	Deregistering
Purpose	Temporarily stop scheduling new tasks; gracefully remove running ones	Permanently remove instance from cluster
Reversible	Yes (back to `ACTIVE`)	No
Effect on running tasks	Stopped/replaced (services); untouched (standalone)	Orphaned (continue running, unmanaged)
EC2 instance fate	Continues to exist	Continues to exist (must be terminated separately)
Applies to	Both EC2 and Fargate	EC2 only

The ECS Container Agent

A process that runs on every container instance registered to the cluster. It facilitates communication between the instance and ECS.

State transitions:

Successful registration → instance status ACTIVE, agent connection TRUE → can accept run-task requests.
Stop (not terminate) the instance → status stays ACTIVE but agent connection drops to FALSE within minutes; running tasks stop.
Restart the instance → agent reconnects, instance can run tasks again.
Set state to DRAINING → no new tasks placed; service tasks evicted if possible.
Deregister or terminate → status becomes INACTIVE immediately; instance still describable for 1 hour after termination, then gone.

Tip: Always run the latest ECS agent version when possible — each version adds features and bug fixes.

Task Definitions

A task definition is the JSON blueprint for your app — it defines containers, ports, env vars, volumes, IAM roles, networking mode, logging, and resource sizes.

Task definition states

State	Meaning
`ACTIVE`	Registered and usable for running tasks or creating services.
`INACTIVE`	Deregistered. Existing tasks/services unaffected, but no new tasks/services can be created from it. Still retrievable via `DescribeTaskDefinition`.
`DELETE_IN_PROGRESS`	Submitted for deletion. ECS verifies no active tasks/deployments reference it before permanently deleting.

Best practices for container images

Make images self-contained — bundle all dependencies as static files inside the image.
One process per container — avoid the “fat container” anti-pattern.
Handle SIGTERM gracefully — when ECS stops a task, it sends SIGTERM, then SIGKILL after the stop timeout. Apps that ignore SIGTERM force the wait. Your SIGTERM handler should:
- Stop accepting new work
- Finish in-flight work, OR
- Persist unfinished work to external storage if it can’t complete in time
**Log to stdout / **stderr — decouples log handling from app code; lets infra adjust log routing without redeploying.
Use tags to version images — don’t build per-commit, but build per-release. Treat image tags as immutable release markers.

Task sizes (CPU / memory)

CPU is measured in 1024 units = 1 full vCPU. Memory is in MB.

Reservation — guaranteed minimum. Scheduler won’t place a task on an instance that can’t fulfill the reservation.
Limit — hard ceiling. Exceeding CPU → throttled. Exceeding memory → container killed.
Bursting — using more than the reservation (up to the limit) when capacity allows.

Stateless apps (behind an LB):

Determine memory consumption empirically via ps, top, or Container Insights.
For CPU: smaller reservations (e.g., 256 units / ¼ vCPU) → fine-grained, cheaper, but slower to scale on spikes. Larger reservations → faster spike response, more expensive.

Singleton / non-horizontal apps (workers, DB servers):

Pick CPU/memory based on load testing for your SLO. ECS guarantees placement on a host with adequate capacity.

📚 For full task definition parameter reference, consult the official AWS docs — parameters change frequently.

IAM Roles for ECS

ECS uses several distinct IAM roles depending on launch type and features:

Role	Purpose
Task Execution Role	Used by the ECS agent and Fargate agent to pull images from ECR, ship logs to CloudWatch, fetch secrets from Secrets Manager / SSM Parameter Store.
Task Role	Used by the application code inside the container to call AWS APIs (e.g., S3, DynamoDB).
Service-Linked Role for ECS	Allows ECS itself to call other AWS services on your behalf (auto-created).
Container Instance IAM Role (EC2 only)	Lets the EC2 host register with the cluster, send telemetry, pull images. Attached to the EC2 instance profile.
EventBridge / Auto Scaling roles	Required for scheduled tasks and Application Auto Scaling.

Task Networking Modes (EC2 launch type)

Defined in the task definition. Each mode has trade-offs.

`awsvpc` (recommended)

Gives each task the same networking properties as an EC2 instance — its own ENI with a private IP (and IPv6 if dual-stack).

Granular security — security groups per task, VPC Flow Logs, etc.
Simpler networking — no port collisions; containers in the same task share localhost.
Each task can only have one ENI.

`host`

Container networking is tied directly to the EC2 host. The container listens on the host’s IP and port.

Significant drawback: only one instance of a task per host (port collision); no port remapping.
Not recommended.

`bridge`

A virtual network bridge between host and container, with port mappings (static or dynamic).

Static mapping — explicitly map host port → container port. Same one-instance-per-host limitation as host.
Dynamic mapping — Docker assigns a random ephemeral port on the host. Multiple instances per host become possible.
Drawback: dynamic ports make it hard to lock down service-to-service security groups (you’d need to open broad port ranges).

Mode	Multiple tasks per host?	Per-task SG?	Best for
`awsvpc`	✅	✅	Default choice
`host`	❌	❌	Rare cases needing host-level networking
`bridge` (static)	❌	❌	Legacy or specific port needs
`bridge` (dynamic)	✅	❌	Multiple instances when `awsvpc` isn’t usable

Storage Options

ECS supports several volume types for tasks. The right choice depends on persistence, sharing, and performance needs:

Bind mounts — host filesystem mount; ephemeral
Docker volumes — managed by Docker on the host
Amazon EFS — shared, persistent, multi-AZ; great for shared state and Fargate
Amazon FSx for Windows File Server — Windows containers with SMB
Amazon EBS — block storage, EC2 launch type
Ephemeral storage — temporary, scoped to the task lifetime For full parameter reference, see the AWS task definition docs.

Task Scheduling and Placement

ECS provides flexible scheduling: a service scheduler for long-running apps, and standalone or scheduled tasks for batch / single-run jobs.

Placement components

Task placement strategy — algorithm for picking instances (and picking tasks to terminate). E.g., random, spread, binpack.
Task group — a logical group of related tasks (e.g., all DB tasks).
Task placement constraint — rules an instance must meet to host a task. Unmet constraints leave the task in PENDING.

EC2 launch type — placement algorithm

When placing a task, ECS:

Identifies instances satisfying CPU/GPU/memory/port requirements
Filters by placement constraints
Filters by placement strategies
Selects the best instance

Defaults:

For tasks running as part of a service: spread across attribute:ecs.availability-zone
For standalone tasks: no default constraint

Fargate launch type

Warning: Placement strategies and constraints are not supported on Fargate. Fargate makes a best-effort spread across AZs. With both Fargate and Fargate Spot in the strategy, spread is independent per provider.

Strategy types

Strategy	What it does	Common fields
`random`	Places tasks on instances at random	—
`spread`	Distributes tasks evenly across a dimension	`instanceId`, `attribute:ecs.availability-zone`
`binpack`	Packs tasks onto fewest instances based on least available CPU or memory	`cpu`, `memory`

Composing strategies

You can chain multiple strategies — first spread across AZs, then across instances within each AZ:

"placementStrategy": [
  { "field": "attribute:ecs.availability-zone", "type": "spread" },
  { "field": "instanceId", "type": "spread" }
]

Task groups

All tasks with the same task group name are considered a set when applying spread. Task groups can also serve as placement constraints via memberOf.

Defaults:

Standalone tasks → task definition family name (e.g., family:my-task-definition)
Service tasks → service name (cannot be changed)

Constraint types

Type	Description
`distinctInstance`	Tasks must run on different instances
`memberOf`	Tasks placed only on instances matching a cluster query language expression

The ECS Task Lifecycle

A task moves through these states from launch to termination:

State	Description
`PROVISIONING`	ECS sets up prerequisites — e.g., creating the ENI for `awsvpc` mode.
`PENDING`	Waiting on the container agent — typically waiting for capacity.
`ACTIVATING`	Pulling images, creating containers, configuring networking, registering target groups, configuring service discovery.
`RUNNING`	Task is up and serving.
`DEACTIVATING`	Performing teardown prep — e.g., deregistering from LB target groups.
`STOPPING`	Agent sends `SIGTERM`, waits `StopTimeout`, then `SIGKILL`.
`DEPROVISIONING`	Detaching/deleting ENIs, etc.
`STOPPED`	Task fully stopped.
`DELETED`	Internal transition; visible only via `describe-tasks`, not the console.

ECS tracks both lastStatus (current) and desiredStatus (target) for every task.

Standalone Tasks

Use when the app does some work and stops — e.g., a batch job. Triggered via console, AWS CLI, API/SDK, or EventBridge Scheduler. When launched, a task starts in PROVISIONING, ECS finds capacity (using launch type or capacity provider strategy), then moves through the lifecycle. With managed scaling capacity providers, capacity-shortage tasks remain in PROVISIONING instead of failing immediately.

Optimizing task launch time

Cache images + binpack — set the ECS agent’s image pull behavior to prefer-cached (EC2 only) and use the binpack strategy to consolidate tasks onto fewer instances. Helps Windows workloads with large images. Enable ENI trunking for more concurrent awsvpc tasks per instance.
Pick the right network mode — awsvpc adds ENI provisioning latency; bridge is faster if per-task security groups aren’t needed.
Monitor launch lifecycle — use the Task metadata endpoint to capture ContainerStartTime → readiness, then trim image size and bootstrap overhead.
Right-size instances — match CPU/memory to actual reservations (e.g., an m5.large with 2 vCPU/8 GB hosts 4 tasks at 0.5 vCPU/2 GB each cleanly).

EventBridge Scheduler

Serverless scheduler — works independently of EventBridge buses/rules with broader API targets. Supports:

Rate-based schedules (e.g., every 5 minutes)
Cron-based schedules
One-time schedules

ECS Services

A service runs and maintains a desired number of task instances simultaneously. If a task fails, the service scheduler launches a replacement.

Scheduling strategies

Strategy	Behavior	Fargate?
`REPLICA`	Maintains the desired number of tasks across the cluster (default: spread across AZs).	✅
`DAEMON`	Runs exactly one task per active container instance that meets placement constraints.	❌ Not supported

REPLICA strategy

You define desiredCount
Tasks can sit behind a load balancer
Customize placement with strategies/constraints
Health monitoring via container health checks or LB target group health checks
Failure throttling — if tasks repeatedly fail to enter RUNNING, the scheduler slows down launch attempts and emits service events to prevent resource waste

Tip: Use AZ Rebalancing with REPLICA to keep tasks evenly distributed across AZs.

Unhealthy task replacement flow:

Service marks a task as unhealthy
Scheduler starts a replacement
If the replacement is HEALTHY → original unhealthy task stopped
If the replacement is UNHEALTHY → scheduler stops one of the unhealthy tasks at random to keep total count near desiredCount
If maximumPercent blocks starting first, scheduler stops one unhealthy task to free capacity, then launches replacement
Repeats until all unhealthy tasks are replaced; if there’s an excess, healthy tasks are stopped at random down to desiredCount

DAEMON strategy

Exactly one task per active container instance
ECS reserves CPU, memory, and network interfaces for daemon tasks
Daemon tasks have priority — they launch first and stop last in clusters that mix daemon + replica services
Ideal for logging/monitoring agents that should run on every host

Availability Zone Rebalancing

After a service reaches steady state, ECS continuously monitors task counts per AZ. If imbalanced:

ECS launches new tasks in under-utilized AZs
Once new tasks confirm HEALTHY, ECS stops tasks in over-utilized AZs

Supported:

Both Fargate and EC2 launch types (Fargate auto-redistributes; EC2 rebalances across existing instances based on placement strategies, but won’t provision new instances)
REPLICA scheduling strategy with AZ-spread or no placement strategy

Not supported with:

DAEMON strategy
EXTERNAL launch type (ECS Anywhere)
maximumPercent = 100%
Classic Load Balancer
attribute:ecs.availability-zone as a placement constraint

Load Balancing

ECS services on Fargate support ALB, NLB, and Gateway Load Balancer. Use ALB unless you specifically need NLB or GWLB features.

Optimizing health check parameters

Two key parameters control deployment speed:

HealthCheckIntervalSeconds — time between checks (default 30s)
HealthyThresholdCount — consecutive passes to mark healthy (default 5) With defaults, a container takes up to 2m30s to be marked healthy.

🔧 If your service starts in under 10s, set interval to 5s and threshold to 2 — total ~10s. Major deployment speedup.

Optimizing connection draining

Clients keep connections alive for reuse; the LB checks if clients have closed the connection before stopping a target.

deregistration_delay.timeout_seconds — how long the LB waits before forcing the target into UNUSED. Default: 300s.
For services with sub-1s response time, set this to 5s.

SIGTERM responsiveness

ECS_CONTAINER_STOP_TIMEOUT — time between SIGTERM and SIGKILL. Default: 30s.
For fast-shutdown apps, lower to 2s, and trap SIGTERM in your code:

process.on('SIGTERM', function() {
  server.close();
});

This stops accepting new requests, finishes in-flight ones, and exits cleanly — often well under the stop timeout, avoiding SIGKILL.

🆕 Dec 2025 — Custom stop signals on Fargate. Fargate now honors the STOPSIGNAL instruction from OCI-compliant container images. If your app uses SIGQUIT (e.g., Nginx graceful shutdown) or SIGINT instead of the default SIGTERM, the ECS agent reads the image’s STOPSIGNAL and sends the appropriate signal during task termination. Available at no additional cost in all regions.

Auto Scaling

ECS leverages Application Auto Scaling to dynamically adjust the desired task count.

Four scaling types

Type	How it works	When to use
Target Tracking (recommended default)	Maintains a target value for a metric (e.g., CPU at 50%) — like a thermostat.	Most workloads; metrics that scale linearly with capacity.
Step Scaling	Step adjustments based on CloudWatch alarm breach magnitude.	When you need different scaling magnitudes for different alarm severities.
Scheduled Scaling	Scale up/down at specific times.	Predictable daily/weekly traffic patterns.
Predictive Scaling	ML analyzes historical load to detect patterns and pre-scale.	Workloads with strong daily/weekly seasonality.

Tip: Default to target tracking on metrics like average CPU utilization or request count per target — they decrease when capacity grows, which lets ECS follow demand cleanly.

Task Scale-In Protection

Protects mission-critical tasks from being terminated during scale-in events from auto-scaling or deployments.

Use cases

Async job processing — video transcoding, data jobs that run for hours
Game servers — ECS tasks hosting active sessions; restart latency is expensive
Mid-deployment — protect tasks doing expensive work mid-rollout

Configuration

Set protectionEnabled to true
Default protection: 2 hours
Customize via expiresInMinutes: minimum 1 minute, maximum 2880 minutes (48 hours)
After work finishes, set protectionEnabled = false to allow normal termination

Setting protection — two mechanisms

1. ECS container agent endpoint (self-determining tasks) For queue-based or job-processing workloads. From inside the container, hit:

$ECS_AGENT_URI/task-protection/v1/state

Set ProtectionEnabled when consuming an SQS message, clear it when work finishes. Recommended for workloads where the task itself knows when it’s busy.

2. ECS API (externally-tracked tasks) Use UpdateTaskProtection to mark tasks protected and GetTaskProtection to query status. Good when an external service tracks task lifecycle — e.g., a game-server controller marking tasks when users log in, clearing when they log out. You can combine both — agent endpoint to set protection from inside, API to clear from an external controller.

Quick Reference Cheat Sheet

Launch type decision

Need lowest ops overhead? → Fargate
Cost-sensitive at scale? → EC2
Burst / batch / tiny? → Fargate (or Fargate Spot for interruption-tolerant)
On-prem requirement? → ECS Anywhere

Networking decision (EC2 mode)

Default choice → awsvpc
Multi-task per host without per-task SGs → bridge with dynamic ports
Avoid → host

Inbound entry point

HTTP / REST API → ALB
TCP/UDP / end-to-end TLS → NLB
Spiky / low-volume HTTP → API Gateway

Service-to-service

Default → Service Connect
Lowest latency, simple setup → Service Discovery (with retry logic in app)
Need centralized routing for many services → Internal ALB with path-based routing

Auto-scaling

Default → Target tracking on CPU/memory or RPS
Predictable schedule → Scheduled scaling
Seasonal patterns → Predictive scaling

Health check tuning (fast-starting services)

HealthCheckIntervalSeconds: 5
HealthyThresholdCount: 2
deregistration_delay.timeout_seconds: 5
ECS_CONTAINER_STOP_TIMEOUT: 2
App must trap SIGTERM

Recent Updates (2025)

ECS shipped several material updates in 2025 that go beyond the original article. Highlights:

ECS Express Mode (Nov 2025, re:Invent)

A new feature for rapidly launching containerized web apps and APIs. Provide three inputs — container image, task execution role, infrastructure role — and ECS auto-provisions:

A Fargate-based ECS service
An ALB with HTTPS on port 443 and SSL/TLS termination
Auto-scaling policies
CloudWatch monitoring and alarms
Security groups with least-privilege rules
A unique URL on .ecs.\<region>.on.aws

Key details:

Fargate-only; no Blue/Green deployment support
Up to 25 Express Mode services can share a single ALB (intelligent rule-based routing)
Updates use canary deployment by default
Resources stay in your account — fully accessible and modifiable
No additional charges; pay only for the underlying AWS resources
Available in all AWS Regions where ECS and Fargate are supported
Manageable via Console, CLI, SDK, CloudFormation, CDK, and Terraform

ECS Managed Instances (Sept 2025 launch, GA Oct 2025, Spot support Dec 2025)

Covered in the Capacity Options section above. The TL;DR: a fully managed EC2 capacity provider that handles instance provisioning, patching every 14 days, and continuous task placement optimization — combining Fargate’s hands-off feel with EC2’s flexibility (GPUs, custom instance types, etc.).

Custom container stop signals on Fargate (Dec 2025)

Fargate now reads STOPSIGNAL from OCI-compliant images and sends the appropriate signal (SIGQUIT, SIGINT, etc.) during task termination, instead of always sending SIGTERM. Useful for apps like Nginx where graceful shutdown uses a non-default signal.

IPv6-only workloads (re:Invent 2025)

ECS now supports running containerized applications in IPv6-only environments without IPv4 dependencies, while maintaining compatibility with existing apps and AWS services. Helps address IPv4 exhaustion, simplifies network architecture, and meets IPv6 compliance requirements.

SOCI Parallel Pull Mode (re:Invent 2025)

A new image pull strategy for faster container starts, with configurable parallelization for both download and unpacking phases. AWS measured ~60% faster pulls on a 10 GB Deep Learning Container image — particularly valuable for AI/ML workloads with large images.

VPC Lattice integration

ECS services can now be associated with VPC Lattice target groups as IP targets. ECS auto-registers tasks to the target group on launch, enabling cross-VPC, cross-account application networking with built-in observability and security — without code changes.

Native Blue/Green with Service Connect

Earlier in 2025, ECS-native Blue/Green deployments added support for Service Connect alongside load balancers, removing one of the historical reasons to choose Service Discovery over Service Connect (the CodeDeploy DeploymentController incompatibility).

Capacity providers preferred over launch types

AWS now explicitly recommends using capacity providers as the primary mechanism for launching tasks. Use launch type only on the task definition’s requiresCompatibilities field for compatibility validation. Capacity providers offer better resource control, allow seamless transitions between compute types, and are required for ECS Managed Instances.

Copilot CLI sunset

The AWS Copilot CLI reaches end of support on June 12, 2026. It remains available as an open-source GitHub project but will no longer receive new features or security updates from AWS. ECS Express Mode is positioned as part of the simplification story going forward.

Conclusion

ECS provides a deeply integrated way to run containers on AWS — clusters, task definitions, tasks, and services compose to deliver scalable, resilient, observable container workloads. The big leverage points:

Use capacity providers, not launch types directly, for new workloads
Pick the right capacity option — Fargate for low ops overhead, ECS Managed Instances for EC2 flexibility without ASG management, self-managed EC2 for full control
Try ECS Express Mode for simple web apps and APIs — three inputs, production-ready stack
Use awsvpc networking unless you have a specific reason not to
Prefer Service Connect for service-to-service communication
Tune health checks, deregistration delay, and SIGTERM handling for fast deployments
Use target tracking auto-scaling on metrics that respond linearly to capacity
Spread across AZs with placement strategies and AZ Rebalancing for resilience
Protect long-running tasks from scale-in with task scale-in protection

References

Joud W. Awad — AWS ECS Deep Dive (Medium, Jan 2025): https://joudwawad.medium.com/aws-ecs-deep-dive-c8f773af0bf6
AWS — ECS Developer Guide
AWS — Task definition parameters
AWS — Service Connect
AWS — Interconnect ECS services
AWS — Placement strategy examples
AWS — Placement constraint examples

Table of Contents​

What is Amazon ECS?​

The Three Layers of ECS​

ECS Application Lifecycle​

Core vocabulary​

Capacity Options​

Fargate (Serverless)​

EC2 (self-managed)​

ECS Managed Instances (launched Sept 2025)​

External (ECS Anywhere)​

Capacity option comparison​

Networking​

Subnet types​

Receiving Inbound Traffic from the Internet​

Application Load Balancer (ALB) — OSI Layer 7​

Network Load Balancer (NLB) — OSI Layer 4​

Amazon API Gateway (HTTP API)​

Connecting ECS to AWS Services from Inside Your VPC​

Option 1: NAT Gateway​

Option 2: AWS PrivateLink (VPC Endpoints)​

Service-to-Service Communication​

1. ECS Service Connect (recommended)​

2. ECS Service Discovery (AWS Cloud Map)​

3. Internal Load Balancer​

4. Amazon VPC Lattice (modern option)​

Monitoring​

Minimum baseline metrics​

Tooling options​

Automating responses with EventBridge​

Container Health Checks​

ECS Exec​

ECS Clusters​

Capacity providers — Fargate​

Capacity providers — EC2​

Capacity provider strategy​

Container Instance States: Draining vs Deregistering​

Draining​

Deregistering (EC2-only)​

Comparison​

The ECS Container Agent​

Task Definitions​

Task definition states​

Best practices for container images​

Task sizes (CPU / memory)​

IAM Roles for ECS​

Task Networking Modes (EC2 launch type)​

awsvpc (recommended)​

host​

bridge​

Storage Options​

Task Scheduling and Placement​

Placement components​

EC2 launch type — placement algorithm​

Fargate launch type​

Strategy types​

Composing strategies​

Task groups​

Constraint types​

The ECS Task Lifecycle​

Standalone Tasks​

Optimizing task launch time​

EventBridge Scheduler​

ECS Services​

Scheduling strategies​

REPLICA strategy​

DAEMON strategy​

Availability Zone Rebalancing​

Load Balancing​

Optimizing health check parameters​

Optimizing connection draining​

SIGTERM responsiveness​

Auto Scaling​

Four scaling types​

Task Scale-In Protection​

Use cases​

Configuration​

Setting protection — two mechanisms​

Quick Reference Cheat Sheet​

Launch type decision​

Networking decision (EC2 mode)​

Table of Contents