Skip to main content
Current12d ago

AWS ECS Deep Dive Documentation

A comprehensive guide to Amazon Elastic Container Service — architecture, components, networking, scheduling, scaling, and operational best practices. Source material: Joud W. Awad — AWS ECS Deep Dive (Medium, Jan 2025), supplemented with current AWS documentation.


Table of Contents

  1. What is Amazon ECS?

  2. The Three Layers of ECS

  3. ECS Application Lifecycle

  4. Capacity Options

  5. Networking

  6. Receiving Inbound Traffic from the Internet

  7. Connecting ECS to AWS Services from Inside Your VPC

  8. Service-to-Service Communication

  9. Monitoring

  10. ECS Clusters

  11. Container Instance States: Draining vs Deregistering

  12. The ECS Container Agent

  13. Task Definitions

  14. IAM Roles for ECS

  15. Task Networking Modes (EC2 launch type)

  16. Storage Options

  17. Task Scheduling and Placement

  18. The ECS Task Lifecycle

  19. Standalone Tasks

  20. ECS Services

  21. Load Balancing

  22. Auto Scaling

  23. Task Scale-In Protection

  24. Quick Reference Cheat Sheet

  25. Recent Updates (2025)


What is Amazon ECS?

Amazon Elastic Container Service (Amazon ECS) is a fully managed container orchestration service that helps you deploy, manage, and scale containerized applications. As a fully managed service, it ships with AWS configuration and operational best practices built in.

Key integrations:

  • AWS-native tools: Amazon ECR, IAM, CloudWatch, ALB/NLB, Auto Scaling, EventBridge

  • Third-party tooling: Docker, GitHub Actions, Terraform You can run and scale container workloads across AWS Regions in the cloud and on-premises — without managing a control plane.


The Three Layers of ECS

ECS architecture is divided into three logical layers:

LayerDescription
CapacityThe infrastructure where your containers run (EC2, Fargate, on-prem).
ControllerThe software that deploys and manages your applications running on the containers.
ProvisioningThe tools that you use to interface with the scheduler to deploy and manage applications.

ECS Application Lifecycle

The high-level lifecycle of moving an application onto ECS:

  1. Architect for containers — A container is a standardized unit of software that holds everything your app needs to run: code, runtime, system tools, and libraries.

  2. Build an image — Containers are created from a read-only template called an image. Images are typically built from a Dockerfile.

  3. Store the image in a registry — e.g., Amazon ECR.

  4. Create a task definition — A JSON blueprint describing your application: which image, ports, volumes, env vars, CPU/memory, etc.

  5. Deploy as a Task or a Service onto a Cluster.

  6. Monitor — using CloudWatch, Container Insights, or Runtime Monitoring.

Core vocabulary

  • Cluster — a logical grouping of tasks/services running on registered capacity infrastructure.

  • Task definition — the JSON blueprint of one or more containers that form your app.

  • Task — an instantiation of a task definition within a cluster. Can run standalone or as part of a service.

  • Service — runs and maintains a desired number of tasks simultaneously. If a task fails, the service scheduler launches a replacement.

  • Container agent — runs on each container instance; reports running tasks and resource utilization back to ECS, and starts/stops tasks on request.


Capacity Options

Capacity is the infrastructure where your containers run. You set it at the cluster default level when creating the cluster, and can override it at the task definition / launch type level.

🆕 2025 Update — Capacity Providers are now the recommended interface. AWS now recommends using capacity providers to launch tasks rather than directly specifying a launch type. Use the launch type only in requiresCompatibilities on the task definition for compatibility validation. Capacity providers offer better resource control, smoother transitions between compute types, and are required for the newer ECS Managed Instances option below.

Fargate (Serverless)

A serverless, pay-as-you-go compute engine. You don’t manage servers, plan capacity, or isolate workloads — you just pick exact CPU and memory.

Best for:

  • Large workloads that need low operational overhead

  • Small workloads with occasional bursts

  • Tiny workloads

  • Batch workloads

Capacity providers:

  • FARGATE — on-demand

  • FARGATE_SPOT — discounted spare capacity, interruption-tolerant only (2-minute warning when AWS reclaims capacity)

EC2 (self-managed)

You manage the EC2 instances backing the cluster via Auto Scaling Groups. Best for large workloads that must be price-optimized, or when you need full control over the host (custom AMIs, kernel-level tools, etc.). When designing services on EC2, group containers by purpose — e.g., a frontend service and its log-streaming sidecar belong in the same task definition; a backend API and a data store belong in separate task definitions.

ECS Managed Instances (launched Sept 2025)

A fully managed EC2 compute option that combines Fargate’s hands-off operations with EC2’s flexibility. You define task requirements (vCPU, memory, CPU architecture) and ECS automatically provisions, configures, and operates the optimal EC2 instances in your account using AWS-controlled access.

Key features:

  • Attribute-based instance selection — specify ranges (e.g., 8–16 vCPU), CPU manufacturers, accelerator types, GPU support, network-optimized or burstable families

  • Continuous task placement optimization — ECS bin-packs tasks across instances and drains underutilized ones automatically

  • Automatic AZ spreading — tasks distributed across AZs first, then bin-packed

  • Automatic security patching every 14 days (configurable to weekly maintenance windows via EC2 event windows)

  • Spot support (added Dec 2025) — set capacityOptionType: spot for up to 90% discount on fault-tolerant workloads

  • Tag propagation — tags flow from capacity provider to instances, ENIs, volumes, etc.

Best for: GPU/ML inference, high-network workloads, capacity reservation needs, eBPF-based observability tools that need privileged host access — anything that needs more than Fargate offers but where you’d rather not run an Auto Scaling Group yourself. You’re billed for the management overhead plus underlying EC2 costs.

External (ECS Anywhere)

Register an on-premises server or VM to your ECS cluster using the EXTERNAL launch type. Best for outbound or data-processing workloads — there’s no Elastic Load Balancing support for external instances, which makes inbound-heavy workloads less efficient. The on-prem server runs both the ECS agent and the SSM agent.

Capacity option comparison

OptionYou manageAWS managesBest for
FargateTask definition onlyEverything elseDefault; low ops overhead
Fargate SpotTask definition + interruption handlingEverything elseFault-tolerant workloads at discount
ECS Managed InstancesTask requirementsInstance lifecycle, patching, optimizationEC2 flexibility without ASG management
EC2 (self-managed)ASG, AMIs, scaling, patchingECS schedulingMaximum control, custom AMIs
ECS AnywhereOn-prem hardware + agentsECS schedulingHybrid / on-prem workloads

Networking

AWS resources live in subnets. ECS tasks run inside the subnet you specify (at cluster, service, or task level depending on launch type).

Subnet types

Connectivity optionWhen to use
Public subnet + Internet GatewayPublic apps that need high bandwidth or low latency — video streaming, gaming.
Private subnet + NAT GatewayApps that must be protected from direct external access — payment processing, user data stores.
AWS PrivateLinkPrivate connectivity between VPCs, AWS services, and on-prem networks without exposing traffic to the public internet.

Tip: NAT Gateway cost note: NAT gateways charge per hour AND per GB of data processed. For HA, you should run one NAT gateway per Availability Zone — which can get expensive for small workloads.


Receiving Inbound Traffic from the Internet

For public services, place a scalable input layer between the internet and your application. The three main options:

Application Load Balancer (ALB) — OSI Layer 7

Best for HTTP/HTTPS services and REST APIs.

Strengths:

  • SSL/TLS termination — manages certs and offloads SSL from the app

  • Advanced routing — host- and path-based routing for microservices

  • gRPC, WebSockets, HTTP/2 support

  • Security — HTTP de-sync mitigations, AWS WAF integration (SQLi, XSS protection)

Network Load Balancer (NLB) — OSI Layer 4

Best for non-HTTP protocols or where end-to-end encryption is required.

Strengths:

  • End-to-end encryption — operates at L4 without reading packet contents

  • TLS termination — optionally offload TLS

  • UDP support — and other non-TCP protocols

Amazon API Gateway (HTTP API)

Best for HTTP apps with sudden bursts or low overall traffic volumes.

Pricing model differs: ALB/NLB charge an hourly fee to keep the LB available; API Gateway charges per request.

  • Low traffic / spiky traffic → API Gateway is cheaper

  • High sustained traffic → ALB/NLB is cheaper per request API Gateway connects into a private VPC subnet via a VPC link, and discovers private IPs via AWS Cloud Map records managed by ECS Service Discovery. It also adds capabilities like client authorization, usage tiers, request/response transformation, edge/regional/private endpoints, and response caching.


Connecting ECS to AWS Services from Inside Your VPC

The ECS container agent must talk to the ECS control plane. If using ECR, hosts must reach the ECR endpoint and S3 (where image layers live).

Option 1: NAT Gateway

The easiest path — but with downsides:

  • No granular destination control — you can’t restrict NAT-gateway-bound traffic to specific AWS services without disrupting all VPC outbound traffic.

  • Per-GB charges for NAT data processing — pulling large S3 objects, heavy DynamoDB reads, or ECR image pulls all cost money.

  • Bandwidth caps at 5 Gbps (auto-scaling up to 45 Gbps); divide workloads across subnets with their own NAT gateways for very high-bandwidth apps.

Provides private connectivity between your VPC and supported AWS services without traversing the public internet. PrivateLink provisions ENIs inside your subnet, and VPC routing sends traffic for the service hostname through the ENI directly to the AWS service.

Benefits: no IGW, no NAT, no public IPs needed. Traffic never leaves the AWS network.


Service-to-Service Communication

Once you’re running multiple ECS services in a VPC, they need to find and talk to each other. There are three main approaches:

Service Connect provides ECS-managed configuration for service discovery, connectivity, and traffic monitoring. Apps use short names and standard ports to connect to services in the same cluster, across clusters, and even across VPCs in the same Region. ECS handles all parts of service discovery — name registration, dynamic per-task entries, and a sidecar agent in each task that resolves names. Your app uses standard DNS lookups, so no code changes are needed if you already do that.

Why it’s recommended over plain Service Discovery:

  • Faster failover — doesn’t rely on DNS TTL caching

  • Built-in resilience — automatic load balancing, automatic retries (e.g., on 503), connection draining, network-level health checks

  • Standardized metrics and logs — observability baked in

  • Changes only happen during deployments — config is part of the service/task definition; updates are tied to the deployment lifecycle, avoiding DNS propagation delays

2. ECS Service Discovery (AWS Cloud Map)

Direct service-to-service communication using DNS. ECS syncs the list of running tasks to Cloud Map, which maintains a DNS hostname resolving to internal task IPs.

Pros:

  • Lowest latency — traffic goes container → container directly

  • Simple architecture, no extra components

Cons:

  • Your app must implement retry logic and gracefully handle stale DNS records (TTL caching can return IPs of containers that no longer exist)

  • Not as resilient or observable as Service Connect

📝 When Service Discovery still wins: If you’re using Blue/Green deployments with CodeDeploy, Service Connect was historically incompatible (CodeDeploy DeploymentController type wasn’t supported). Native ECS Blue/Green is now supported with Service Connect, but verify your deployment controller compatibility before choosing.

3. Internal Load Balancer

An ALB or NLB deployed entirely inside your VPC. ServiceA opens connections to the LB; the LB opens connections to ServiceB tasks.

Pros:

  • Centralized management of connections

  • Automatic health checks remove bad targets

  • Apps don’t need to track downstream container counts

Cons:

  • Cost — load balancers need redundant resources per AZ

  • Mitigation: share one ALB across multiple services using path-based routing (e.g., /api/user/* → user service, /api/order/* → order service)

4. Amazon VPC Lattice (modern option)

A managed application networking service. By associating ECS services with a VPC Lattice target group, ECS auto-registers tasks as IP targets. Useful for connecting, observing, and securing apps across compute services, VPCs, and accounts without code changes.


Monitoring

Before standing up a cluster, build a monitoring plan that answers:

  • What are your monitoring goals?

  • What resources will you monitor?

  • How often?

  • Which tools?

  • Who runs the monitoring?

  • Who gets paged when things break?

Minimum baseline metrics

  • CPU and memory reservation + utilization at the cluster level

  • CPU and memory utilization at the service level Available metrics depend on launch type:

  • Fargate: CPU and memory utilization metrics provided automatically per service

  • EC2: You also need to monitor the EC2 instances themselves; cluster/service/task-level reservation and utilization metrics are also available

Tooling options

  • CloudWatch Alarms — threshold-based alerts; can also drive Auto Scaling for Fargate services

  • CloudWatch Logs — capture container stdout/stderr by setting the awslogs log driver in the task definition

  • CloudWatch Events / EventBridge — match events and route them to targets for automated response

  • Container Insights — collects, aggregates, and summarizes performance metrics and logs for containerized workloads using structured JSON performance log events; CloudWatch creates aggregated metrics at cluster/service/task level (with Enhanced Observability mode adding container-level detail)

  • CloudTrail — log API actions; ship to CloudWatch Logs for real-time monitoring

  • Runtime Monitoring — uses a GuardDuty security agent for runtime visibility (file access, process execution, network connections)

Warning: Container Insights metrics only reflect resources with running tasks in the time range. A service with desiredCount > 0 but no RUNNING tasks emits no metrics.

Automating responses with EventBridge

ECS emits these event types to EventBridge in near real time:

  • Container instance state change

  • Task state change

  • Service action

  • Service deployment state change You write rules matching events of interest and trigger automated responses (Lambda, SNS, Step Functions, etc.).

Container Health Checks

Defined in the task definition, run inside the container, evaluate exit codes:

ParameterMeaning
commandThe command run inside the container (e.g., curl localhost:80)
intervalSeconds between checks
timeoutSeconds to wait before marking a check failed
retriesFailed checks before container is marked unhealthy
startPeriodOptional grace period during bootstrapping
health_check = {
command = ["CMD-SHELL", "curl -f -m 1.00 http://localhost:80 || exit 1"]
timeout = 2
retries = 3
interval = 10
startPeriod = 10
}

Possible statuses: HEALTHY, UNHEALTHY, UNKNOWN.

Task health rollup rules (evaluated in order):

  1. If any essential container is UNHEALTHY → task is UNHEALTHY

  2. If any essential container is UNKNOWN → task is UNKNOWN

  3. If all essential containers are HEALTHY → task is HEALTHY

Tip: ECS does not monitor Docker HEALTHCHECK directives embedded in images unless they’re declared in the container definition. Container definition health checks override image-embedded ones.

ECS Exec

Connect into running containers without SSH or open ports:

  • Direct container access — run commands or open shells in EC2- or Fargate-based tasks

  • Enhanced security — no SSH keys, no extra inbound ports

  • Auditing — ECS Exec sessions can log to CloudWatch Logs or S3, and CloudTrail records who connected and when ECS Exec uses AWS Systems Manager Session Manager for the connection and IAM policies for authorization. The SSM agent binaries are bind-mounted into the container, and the ECS/Fargate agent starts the SSM core agent alongside your application.


ECS Clusters

A cluster is a logical grouping of tasks and services. It contains:

  • The infrastructure capacity provider(s)

  • The network (VPC and subnets)

  • An optional namespace — used for Service Connect

  • A monitoring option (e.g., Container Insights)

Capacity providers — Fargate

When using Fargate, you don’t create or manage capacity. You associate one or both of these pre-defined providers with the cluster:

  • FARGATE

  • FARGATE_SPOT When Spot tasks are reclaimed, ECS sends a task state change event to EventBridge with the stopped reason describing the interruption.

Capacity providers — EC2

Use Auto Scaling Groups (ASGs) to manage the EC2 instances registered to the cluster. ECS can manage scale-in/scale-out via managed scaling, or you can manage it yourself.

🔹 Best practice: Create a new, empty ASG for the capacity provider. Reusing an existing ASG with already-registered instances can cause registration mismatches with the capacity provider. 🔹 Enable managed instance draining (on by default) for graceful EC2 termination during scale-in.

Capacity provider strategy

Distributes tasks across providers using two parameters:

  • base — minimum number of tasks that must run on a specific provider. Only one provider in a strategy may define a base.

  • weight — relative percentage split. With capacityProviderA=1 and capacityProviderB=4, every 1 task on A is matched by 4 on B. Example: 75% Fargate / 25% Fargate Spot → weight 3 on FARGATE, weight 1 on FARGATE_SPOT.


Container Instance States: Draining vs Deregistering

These two operations are commonly confused.

Draining

Transitioning an instance to DRAINING prevents new tasks from being scheduled and safely removes running tasks. Used during system updates, scale-in, or maintenance.

For Services:

  • Pending tasks are stopped immediately. The scheduler launches replacements if cluster capacity allows.

  • Running tasks are transitioned to STOPPED. The scheduler replaces them based on the deployment configuration:

    • minimumHealthyPercent — lower bound on healthy task count. With desiredCount=4 and minimumHealthyPercent=50%, at least 2 tasks must remain healthy. The scheduler can stop up to 2 tasks before launching replacements.
    • maximumPercent — upper bound. With desiredCount=4 and maximumPercent=200%, up to 8 tasks can run concurrently during replacement, enabling faster blue/green-style replacement.

For Standalone tasks: Pending and running tasks are unaffected. You must wait for them to finish or stop them manually. The instance stays in DRAINING until completion or reactivation. A draining instance returns to ACTIVE when you flip its state back. Until then it remains in DRAINING.

Deregistering (EC2-only)

Deregistering removes the EC2 instance from the cluster. It becomes unavailable for new tasks.

Key gotchas:

  • Running tasks become orphaned — they keep running but ECS no longer manages them. Service tasks are replaced on other instances by the scheduler.

  • The EC2 instance is NOT terminated — you must terminate it manually to stop being billed.

  • ASGs / CloudFormation: Update the ASG or stack to remove the instance — otherwise the ASG will replace it with a fresh one.

Comparison

AspectDrainingDeregistering
PurposeTemporarily stop scheduling new tasks; gracefully remove running onesPermanently remove instance from cluster
ReversibleYes (back to ACTIVE)No
Effect on running tasksStopped/replaced (services); untouched (standalone)Orphaned (continue running, unmanaged)
EC2 instance fateContinues to existContinues to exist (must be terminated separately)
Applies toBoth EC2 and FargateEC2 only

The ECS Container Agent

A process that runs on every container instance registered to the cluster. It facilitates communication between the instance and ECS.

State transitions:

  • Successful registration → instance status ACTIVE, agent connection TRUE → can accept run-task requests.

  • Stop (not terminate) the instance → status stays ACTIVE but agent connection drops to FALSE within minutes; running tasks stop.

  • Restart the instance → agent reconnects, instance can run tasks again.

  • Set state to DRAINING → no new tasks placed; service tasks evicted if possible.

  • Deregister or terminate → status becomes INACTIVE immediately; instance still describable for 1 hour after termination, then gone.

Tip: Always run the latest ECS agent version when possible — each version adds features and bug fixes.


Task Definitions

A task definition is the JSON blueprint for your app — it defines containers, ports, env vars, volumes, IAM roles, networking mode, logging, and resource sizes.

Task definition states

StateMeaning
ACTIVERegistered and usable for running tasks or creating services.
INACTIVEDeregistered. Existing tasks/services unaffected, but no new tasks/services can be created from it. Still retrievable via DescribeTaskDefinition.
DELETE_IN_PROGRESSSubmitted for deletion. ECS verifies no active tasks/deployments reference it before permanently deleting.

Best practices for container images

  • Make images self-contained — bundle all dependencies as static files inside the image.

  • One process per container — avoid the “fat container” anti-pattern.

  • Handle SIGTERM gracefully — when ECS stops a task, it sends SIGTERM, then SIGKILL after the stop timeout. Apps that ignore SIGTERM force the wait. Your SIGTERM handler should:

    • Stop accepting new work
    • Finish in-flight work, OR
    • Persist unfinished work to external storage if it can’t complete in time
  • **Log to stdout / **stderr — decouples log handling from app code; lets infra adjust log routing without redeploying.

  • Use tags to version images — don’t build per-commit, but build per-release. Treat image tags as immutable release markers.

Task sizes (CPU / memory)

CPU is measured in 1024 units = 1 full vCPU. Memory is in MB.

  • Reservation — guaranteed minimum. Scheduler won’t place a task on an instance that can’t fulfill the reservation.

  • Limit — hard ceiling. Exceeding CPU → throttled. Exceeding memory → container killed.

  • Bursting — using more than the reservation (up to the limit) when capacity allows.

Stateless apps (behind an LB):

  • Determine memory consumption empirically via ps, top, or Container Insights.

  • For CPU: smaller reservations (e.g., 256 units / ¼ vCPU) → fine-grained, cheaper, but slower to scale on spikes. Larger reservations → faster spike response, more expensive.

Singleton / non-horizontal apps (workers, DB servers):

  • Pick CPU/memory based on load testing for your SLO. ECS guarantees placement on a host with adequate capacity.

📚 For full task definition parameter reference, consult the official AWS docs — parameters change frequently.


IAM Roles for ECS

ECS uses several distinct IAM roles depending on launch type and features:

RolePurpose
Task Execution RoleUsed by the ECS agent and Fargate agent to pull images from ECR, ship logs to CloudWatch, fetch secrets from Secrets Manager / SSM Parameter Store.
Task RoleUsed by the application code inside the container to call AWS APIs (e.g., S3, DynamoDB).
Service-Linked Role for ECSAllows ECS itself to call other AWS services on your behalf (auto-created).
Container Instance IAM Role (EC2 only)Lets the EC2 host register with the cluster, send telemetry, pull images. Attached to the EC2 instance profile.
EventBridge / Auto Scaling rolesRequired for scheduled tasks and Application Auto Scaling.

Task Networking Modes (EC2 launch type)

Defined in the task definition. Each mode has trade-offs.

Gives each task the same networking properties as an EC2 instance — its own ENI with a private IP (and IPv6 if dual-stack).

  • Granular security — security groups per task, VPC Flow Logs, etc.

  • Simpler networking — no port collisions; containers in the same task share localhost.

  • Each task can only have one ENI.

host

Container networking is tied directly to the EC2 host. The container listens on the host’s IP and port.

  • Significant drawback: only one instance of a task per host (port collision); no port remapping.

  • Not recommended.

bridge

A virtual network bridge between host and container, with port mappings (static or dynamic).

  • Static mapping — explicitly map host port → container port. Same one-instance-per-host limitation as host.

  • Dynamic mapping — Docker assigns a random ephemeral port on the host. Multiple instances per host become possible.

  • Drawback: dynamic ports make it hard to lock down service-to-service security groups (you’d need to open broad port ranges).

ModeMultiple tasks per host?Per-task SG?Best for
awsvpcDefault choice
hostRare cases needing host-level networking
bridge (static)Legacy or specific port needs
bridge (dynamic)Multiple instances when awsvpc isn’t usable

Storage Options

ECS supports several volume types for tasks. The right choice depends on persistence, sharing, and performance needs:

  • Bind mounts — host filesystem mount; ephemeral

  • Docker volumes — managed by Docker on the host

  • Amazon EFS — shared, persistent, multi-AZ; great for shared state and Fargate

  • Amazon FSx for Windows File Server — Windows containers with SMB

  • Amazon EBS — block storage, EC2 launch type

  • Ephemeral storage — temporary, scoped to the task lifetime For full parameter reference, see the AWS task definition docs.


Task Scheduling and Placement

ECS provides flexible scheduling: a service scheduler for long-running apps, and standalone or scheduled tasks for batch / single-run jobs.

Placement components

  • Task placement strategy — algorithm for picking instances (and picking tasks to terminate). E.g., random, spread, binpack.

  • Task group — a logical group of related tasks (e.g., all DB tasks).

  • Task placement constraint — rules an instance must meet to host a task. Unmet constraints leave the task in PENDING.

EC2 launch type — placement algorithm

When placing a task, ECS:

  1. Identifies instances satisfying CPU/GPU/memory/port requirements

  2. Filters by placement constraints

  3. Filters by placement strategies

  4. Selects the best instance

Defaults:

  • For tasks running as part of a service: spread across attribute:ecs.availability-zone

  • For standalone tasks: no default constraint

Fargate launch type

Warning: Placement strategies and constraints are not supported on Fargate. Fargate makes a best-effort spread across AZs. With both Fargate and Fargate Spot in the strategy, spread is independent per provider.

Strategy types

StrategyWhat it doesCommon fields
randomPlaces tasks on instances at random
spreadDistributes tasks evenly across a dimensioninstanceId, attribute:ecs.availability-zone
binpackPacks tasks onto fewest instances based on least available CPU or memorycpu, memory

Composing strategies

You can chain multiple strategies — first spread across AZs, then across instances within each AZ:

"placementStrategy": [
{ "field": "attribute:ecs.availability-zone", "type": "spread" },
{ "field": "instanceId", "type": "spread" }
]

Task groups

All tasks with the same task group name are considered a set when applying spread. Task groups can also serve as placement constraints via memberOf.

Defaults:

  • Standalone tasks → task definition family name (e.g., family:my-task-definition)

  • Service tasks → service name (cannot be changed)

Constraint types

TypeDescription
distinctInstanceTasks must run on different instances
memberOfTasks placed only on instances matching a cluster query language expression

The ECS Task Lifecycle

A task moves through these states from launch to termination:

StateDescription
PROVISIONINGECS sets up prerequisites — e.g., creating the ENI for awsvpc mode.
PENDINGWaiting on the container agent — typically waiting for capacity.
ACTIVATINGPulling images, creating containers, configuring networking, registering target groups, configuring service discovery.
RUNNINGTask is up and serving.
DEACTIVATINGPerforming teardown prep — e.g., deregistering from LB target groups.
STOPPINGAgent sends SIGTERM, waits StopTimeout, then SIGKILL.
DEPROVISIONINGDetaching/deleting ENIs, etc.
STOPPEDTask fully stopped.
DELETEDInternal transition; visible only via describe-tasks, not the console.

ECS tracks both lastStatus (current) and desiredStatus (target) for every task.


Standalone Tasks

Use when the app does some work and stops — e.g., a batch job. Triggered via console, AWS CLI, API/SDK, or EventBridge Scheduler. When launched, a task starts in PROVISIONING, ECS finds capacity (using launch type or capacity provider strategy), then moves through the lifecycle. With managed scaling capacity providers, capacity-shortage tasks remain in PROVISIONING instead of failing immediately.

Optimizing task launch time

  • Cache images + binpack — set the ECS agent’s image pull behavior to prefer-cached (EC2 only) and use the binpack strategy to consolidate tasks onto fewer instances. Helps Windows workloads with large images. Enable ENI trunking for more concurrent awsvpc tasks per instance.

  • Pick the right network modeawsvpc adds ENI provisioning latency; bridge is faster if per-task security groups aren’t needed.

  • Monitor launch lifecycle — use the Task metadata endpoint to capture ContainerStartTime → readiness, then trim image size and bootstrap overhead.

  • Right-size instances — match CPU/memory to actual reservations (e.g., an m5.large with 2 vCPU/8 GB hosts 4 tasks at 0.5 vCPU/2 GB each cleanly).

EventBridge Scheduler

Serverless scheduler — works independently of EventBridge buses/rules with broader API targets. Supports:

  • Rate-based schedules (e.g., every 5 minutes)

  • Cron-based schedules

  • One-time schedules


ECS Services

A service runs and maintains a desired number of task instances simultaneously. If a task fails, the service scheduler launches a replacement.

Scheduling strategies

StrategyBehaviorFargate?
REPLICAMaintains the desired number of tasks across the cluster (default: spread across AZs).
DAEMONRuns exactly one task per active container instance that meets placement constraints.❌ Not supported

REPLICA strategy

  • You define desiredCount

  • Tasks can sit behind a load balancer

  • Customize placement with strategies/constraints

  • Health monitoring via container health checks or LB target group health checks

  • Failure throttling — if tasks repeatedly fail to enter RUNNING, the scheduler slows down launch attempts and emits service events to prevent resource waste

Tip: Use AZ Rebalancing with REPLICA to keep tasks evenly distributed across AZs.

Unhealthy task replacement flow:

  1. Service marks a task as unhealthy

  2. Scheduler starts a replacement

  3. If the replacement is HEALTHY → original unhealthy task stopped

  4. If the replacement is UNHEALTHY → scheduler stops one of the unhealthy tasks at random to keep total count near desiredCount

  5. If maximumPercent blocks starting first, scheduler stops one unhealthy task to free capacity, then launches replacement

  6. Repeats until all unhealthy tasks are replaced; if there’s an excess, healthy tasks are stopped at random down to desiredCount

DAEMON strategy

  • Exactly one task per active container instance

  • ECS reserves CPU, memory, and network interfaces for daemon tasks

  • Daemon tasks have priority — they launch first and stop last in clusters that mix daemon + replica services

  • Ideal for logging/monitoring agents that should run on every host

Availability Zone Rebalancing

After a service reaches steady state, ECS continuously monitors task counts per AZ. If imbalanced:

  1. ECS launches new tasks in under-utilized AZs

  2. Once new tasks confirm HEALTHY, ECS stops tasks in over-utilized AZs

Supported:

  • Both Fargate and EC2 launch types (Fargate auto-redistributes; EC2 rebalances across existing instances based on placement strategies, but won’t provision new instances)

  • REPLICA scheduling strategy with AZ-spread or no placement strategy

Not supported with:

  • DAEMON strategy

  • EXTERNAL launch type (ECS Anywhere)

  • maximumPercent = 100%

  • Classic Load Balancer

  • attribute:ecs.availability-zone as a placement constraint


Load Balancing

ECS services on Fargate support ALB, NLB, and Gateway Load Balancer. Use ALB unless you specifically need NLB or GWLB features.

Optimizing health check parameters

Two key parameters control deployment speed:

  • HealthCheckIntervalSeconds — time between checks (default 30s)

  • HealthyThresholdCount — consecutive passes to mark healthy (default 5) With defaults, a container takes up to 2m30s to be marked healthy.

🔧 If your service starts in under 10s, set interval to 5s and threshold to 2 — total ~10s. Major deployment speedup.

Optimizing connection draining

Clients keep connections alive for reuse; the LB checks if clients have closed the connection before stopping a target.

  • deregistration_delay.timeout_seconds — how long the LB waits before forcing the target into UNUSED. Default: 300s.

  • For services with sub-1s response time, set this to 5s.

SIGTERM responsiveness

  • ECS_CONTAINER_STOP_TIMEOUT — time between SIGTERM and SIGKILL. Default: 30s.

  • For fast-shutdown apps, lower to 2s, and trap SIGTERM in your code:

process.on('SIGTERM', function() {
server.close();
});

This stops accepting new requests, finishes in-flight ones, and exits cleanly — often well under the stop timeout, avoiding SIGKILL.

🆕 Dec 2025 — Custom stop signals on Fargate. Fargate now honors the STOPSIGNAL instruction from OCI-compliant container images. If your app uses SIGQUIT (e.g., Nginx graceful shutdown) or SIGINT instead of the default SIGTERM, the ECS agent reads the image’s STOPSIGNAL and sends the appropriate signal during task termination. Available at no additional cost in all regions.


Auto Scaling

ECS leverages Application Auto Scaling to dynamically adjust the desired task count.

Four scaling types

TypeHow it worksWhen to use
Target Tracking (recommended default)Maintains a target value for a metric (e.g., CPU at 50%) — like a thermostat.Most workloads; metrics that scale linearly with capacity.
Step ScalingStep adjustments based on CloudWatch alarm breach magnitude.When you need different scaling magnitudes for different alarm severities.
Scheduled ScalingScale up/down at specific times.Predictable daily/weekly traffic patterns.
Predictive ScalingML analyzes historical load to detect patterns and pre-scale.Workloads with strong daily/weekly seasonality.

Tip: Default to target tracking on metrics like average CPU utilization or request count per target — they decrease when capacity grows, which lets ECS follow demand cleanly.


Task Scale-In Protection

Protects mission-critical tasks from being terminated during scale-in events from auto-scaling or deployments.

Use cases

  • Async job processing — video transcoding, data jobs that run for hours

  • Game servers — ECS tasks hosting active sessions; restart latency is expensive

  • Mid-deployment — protect tasks doing expensive work mid-rollout

Configuration

  • Set protectionEnabled to true

  • Default protection: 2 hours

  • Customize via expiresInMinutes: minimum 1 minute, maximum 2880 minutes (48 hours)

  • After work finishes, set protectionEnabled = false to allow normal termination

Setting protection — two mechanisms

1. ECS container agent endpoint (self-determining tasks) For queue-based or job-processing workloads. From inside the container, hit:

$ECS_AGENT_URI/task-protection/v1/state

Set ProtectionEnabled when consuming an SQS message, clear it when work finishes. Recommended for workloads where the task itself knows when it’s busy.

2. ECS API (externally-tracked tasks) Use UpdateTaskProtection to mark tasks protected and GetTaskProtection to query status. Good when an external service tracks task lifecycle — e.g., a game-server controller marking tasks when users log in, clearing when they log out. You can combine both — agent endpoint to set protection from inside, API to clear from an external controller.


Quick Reference Cheat Sheet

Launch type decision

  • Need lowest ops overhead? → Fargate

  • Cost-sensitive at scale? → EC2

  • Burst / batch / tiny? → Fargate (or Fargate Spot for interruption-tolerant)

  • On-prem requirement? → ECS Anywhere

Networking decision (EC2 mode)

  • Default choiceawsvpc

  • Multi-task per host without per-task SGsbridge with dynamic ports

  • Avoidhost

Inbound entry point

  • HTTP / REST API → ALB

  • TCP/UDP / end-to-end TLS → NLB

  • Spiky / low-volume HTTP → API Gateway

Service-to-service

  • Default → Service Connect

  • Lowest latency, simple setup → Service Discovery (with retry logic in app)

  • Need centralized routing for many services → Internal ALB with path-based routing

Auto-scaling

  • Default → Target tracking on CPU/memory or RPS

  • Predictable schedule → Scheduled scaling

  • Seasonal patterns → Predictive scaling

Health check tuning (fast-starting services)

  • HealthCheckIntervalSeconds: 5

  • HealthyThresholdCount: 2

  • deregistration_delay.timeout_seconds: 5

  • ECS_CONTAINER_STOP_TIMEOUT: 2

  • App must trap SIGTERM

Recent Updates (2025)

ECS shipped several material updates in 2025 that go beyond the original article. Highlights:

ECS Express Mode (Nov 2025, re:Invent)

A new feature for rapidly launching containerized web apps and APIs. Provide three inputs — container image, task execution role, infrastructure role — and ECS auto-provisions:

  • A Fargate-based ECS service

  • An ALB with HTTPS on port 443 and SSL/TLS termination

  • Auto-scaling policies

  • CloudWatch monitoring and alarms

  • Security groups with least-privilege rules

  • A unique URL on .ecs.\<region>.on.aws

Key details:

  • Fargate-only; no Blue/Green deployment support

  • Up to 25 Express Mode services can share a single ALB (intelligent rule-based routing)

  • Updates use canary deployment by default

  • Resources stay in your account — fully accessible and modifiable

  • No additional charges; pay only for the underlying AWS resources

  • Available in all AWS Regions where ECS and Fargate are supported

  • Manageable via Console, CLI, SDK, CloudFormation, CDK, and Terraform

ECS Managed Instances (Sept 2025 launch, GA Oct 2025, Spot support Dec 2025)

Covered in the Capacity Options section above. The TL;DR: a fully managed EC2 capacity provider that handles instance provisioning, patching every 14 days, and continuous task placement optimization — combining Fargate’s hands-off feel with EC2’s flexibility (GPUs, custom instance types, etc.).

Custom container stop signals on Fargate (Dec 2025)

Fargate now reads STOPSIGNAL from OCI-compliant images and sends the appropriate signal (SIGQUIT, SIGINT, etc.) during task termination, instead of always sending SIGTERM. Useful for apps like Nginx where graceful shutdown uses a non-default signal.

IPv6-only workloads (re:Invent 2025)

ECS now supports running containerized applications in IPv6-only environments without IPv4 dependencies, while maintaining compatibility with existing apps and AWS services. Helps address IPv4 exhaustion, simplifies network architecture, and meets IPv6 compliance requirements.

SOCI Parallel Pull Mode (re:Invent 2025)

A new image pull strategy for faster container starts, with configurable parallelization for both download and unpacking phases. AWS measured ~60% faster pulls on a 10 GB Deep Learning Container image — particularly valuable for AI/ML workloads with large images.

VPC Lattice integration

ECS services can now be associated with VPC Lattice target groups as IP targets. ECS auto-registers tasks to the target group on launch, enabling cross-VPC, cross-account application networking with built-in observability and security — without code changes.

Native Blue/Green with Service Connect

Earlier in 2025, ECS-native Blue/Green deployments added support for Service Connect alongside load balancers, removing one of the historical reasons to choose Service Discovery over Service Connect (the CodeDeploy DeploymentController incompatibility).

Capacity providers preferred over launch types

AWS now explicitly recommends using capacity providers as the primary mechanism for launching tasks. Use launch type only on the task definition’s requiresCompatibilities field for compatibility validation. Capacity providers offer better resource control, allow seamless transitions between compute types, and are required for ECS Managed Instances.

Copilot CLI sunset

The AWS Copilot CLI reaches end of support on June 12, 2026. It remains available as an open-source GitHub project but will no longer receive new features or security updates from AWS. ECS Express Mode is positioned as part of the simplification story going forward.


Conclusion

ECS provides a deeply integrated way to run containers on AWS — clusters, task definitions, tasks, and services compose to deliver scalable, resilient, observable container workloads. The big leverage points:

  • Use capacity providers, not launch types directly, for new workloads

  • Pick the right capacity option — Fargate for low ops overhead, ECS Managed Instances for EC2 flexibility without ASG management, self-managed EC2 for full control

  • Try ECS Express Mode for simple web apps and APIs — three inputs, production-ready stack

  • Use awsvpc networking unless you have a specific reason not to

  • Prefer Service Connect for service-to-service communication

  • Tune health checks, deregistration delay, and SIGTERM handling for fast deployments

  • Use target tracking auto-scaling on metrics that respond linearly to capacity

  • Spread across AZs with placement strategies and AZ Rebalancing for resilience

  • Protect long-running tasks from scale-in with task scale-in protection


References

Related Articles