Orchestrate ECS Tasks with StepFunction
In some cases it is not possible to perform some work operations within a Lambda function, this is due to timeouts or system dependencies that are difficult to install and manage.
In some cases running a bash script in a given Docker image does the job more than well. This especially applies to DevOps tasks, such as dumping a database or converting a series of JPEG/PNG images.
Let’s start first with the infrastructure configurations, some resources are needed to run a Docker container in the AWS world.
Docker repository
You can use a public repository like Docker Hub or public ECR repository.
ECRRepository:
Type: AWS::ECR::PublicRepository
Properties:
RepositoryName: !Sub "${ProjectName}/${ImageName}" # example/my-image
ImageScanningConfiguration:
ScanOnPush: 'true'For private repositories you will need to add the necessary permissions to repository to be able to pull and run Docker images from your account:
ECRRepository:
Type: AWS::ECR::Repository
Properties:
RepositoryName: !Sub "${ProjectName}/${ImageName}"
ImageScanningConfiguration:
ScanOnPush: 'true'
RepositoryPolicyText:
Version: "2012-10-17"
Statement:
Effect: "Allow"
Principal:
AWS: !Sub "arn:aws:iam::${AWS::AccountId}:root"
Action:
- "ecr:BatchCheckLayerAvailability"
- "ecr:BatchGetImage"
- "ecr:DescribeImages"
- "ecr:DescribeRepositories"
- "ecr:GetDownloadUrlForLayer"ECS Cluster
The simplest way to run an ECS task is to use Fargate, without having to do EC2 instance configurations and worry about the underlying infrastructure.
EcsCluster:
Type: 'AWS::ECS::Cluster'
Properties:
ClusterName: !Ref AWS::StackName
CapacityProviders:
- FARGATE_SPOT
DefaultCapacityProviderStrategy:
- CapacityProvider: FARGATE_SPOT
Weight: 1Using SPOT mode also reduces the price of execution by using Spot instances.
A fundamental part in the execution of the docker container is the definition of the task, this describes how it will be executed: with which permissions, which environment variables, where to log the output and how many resources in terms of CPU and memory to allocate to the execution.
Log
To save the output of the docker container you will need to declare a CloudWatch Log group:
EcsTaskLogGroup:
Type: AWS::Logs::LogGroup
Properties:
LogGroupName: !Sub "/aws/ecs/${AWS::StackName}/my-task"
RetentionInDays: 30Remember to import a retention to avoid accumulating logs forever, increasing CloudWatch costs.
Permissions
The permits that we will declare will be two:
- Task execution role: ECS container and Fargate agents permission to make AWS API calls on your behalf, for example to log output to CloudWatch or retrieving SSM secrets to put into environment variables
- Task role: Role assumed by the containers running in the task, the permissions required depend on which AWS APIs will be used within the container execution.
TaskExecutionRole:
Type: AWS::IAM::Role
Properties:
RoleName: !Sub "${AWS::StackName}-task-execution-role"
AssumeRolePolicyDocument:
Version: 2012-10-17
Statement:
- Effect: Allow
Principal:
Service:
- ecs-tasks.amazonaws.com
Action:
- 'sts:AssumeRole'
Policies:
- PolicyName: logging
PolicyDocument:
Version: "2012-10-17"
Statement:
- Effect: "Allow"
Action:
- logs:CreateLogStream
- logs:PutLogEvents
Resource: !GetAtt EcsTaskLogGroup.Arn
- PolicyName: secrets
PolicyDocument:
Version: "2012-10-17"
Statement:
- Effect: "Allow"
Action:
- "ssm:GetParameters"
Resource:
- !Sub "arn:aws:ssm:${AWS::Region}:${AWS::AccountId}:parameter/${AWS::StackName}/*"
TaskRole:
Type: AWS::IAM::Role
Properties:
RoleName: !Sub "${AWS::StackName}-task-role"
AssumeRolePolicyDocument:
Version: 2012-10-17
Statement:
- Effect: Allow
Principal:
Service:
- ecs-tasks.amazonaws.com
Action:
- 'sts:AssumeRole'
Policies:
- PolicyName: !Ref AWS::StackName
PolicyDocument:
Version: "2012-10-17"
Statement:
- Effect: "Allow"
Action:
- states:SendTaskSuccess
- states:SendTaskFailure
Resource:
# this avoid circular dependency between step function and task definition
- !Sub "arn:aws:states:${AWS::Region}:${AWS::AccountId}:stateMachine:${AWS::StackName}-*"In our case the Docker container will have to communicate back to StepFunction using the SendTaskSuccess API (in case of successful execution) or SendTaskFailure API (in case of errors).
Task definition
Let’s create the task definition now, we will specify the execution through Fargate, configure its resources and environment variables
EcsTaskDefinition:
Type: AWS::ECS::TaskDefinition
Properties:
Family: !Ref AWS::StackName
Cpu: 256 # The number of cpu units used by the task, 256 = 0.25 vCPU
Memory: 512 # The amount (in MiB) of memory used by the task
NetworkMode: awsvpc
ContainerDefinitions:
- Name: "my-container"
Image: !Ref DockerImage
Essential: true
Secrets:
- Name: DB_USER
ValueFrom: !Sub "arn:aws:ssm:${AWS::Region}:${AWS::AccountId}:parameter${DbUsernameParameter}"
- Name: DB_PASS
ValueFrom: !Sub "arn:aws:ssm:${AWS::Region}:${AWS::AccountId}:parameter${DbPasswordParameter}"
LogConfiguration:
LogDriver: "awslogs"
Options:
awslogs-group: !Ref EcsTaskLogGroup
awslogs-region: !Ref AWS::Region
awslogs-stream-prefix: "ecs"
ExecutionRoleArn: !GetAtt TaskExecutionRole.Arn
TaskRoleArn: !GetAtt TaskRole.Arn
RequiresCompatibilities:
- FARGATEPay attention to the CPU definition, this is a bit of a complicated step, you need to check the documentation carefully.
A very convenient and interesting feature is secrets, you can load values from SSM automatically without specifying them inside the template definition.
StepFunction task
The StateMachine resource is defined in the “template.yaml” referring to its own declaration file with the “DefinitionUri” property. The content of the definition file will be parsed and all “DefinitionSubstitutions” keys will be replaced with the corresponding value.
The StateMachine needs policy to access other AWS services, like a Lambda function. For ECS the RunTask action policy is required and also passing the Task’s roles needs to be allowed via “iam:PassRole”.
Resources:
StateMachine:
Type: AWS::Serverless::StateMachine
Properties:
Name: !Ref AWS::StackName
DefinitionUri: execute.asl.yaml
DefinitionSubstitutions:
ClusterArn: !GetAtt EcsCluster.Arn
TaskDefinitionArn: !Ref EcsTaskDefinition
VpcSubnet1: !Select [ 0, !Ref VpcSubnets ]
VpcSubnet2: !Select [ 1, !Ref VpcSubnets ]
VpcSubnet3: !Select [ 2, !Ref VpcSubnets ]
Policies:
- Statement:
- Effect: "Allow"
Action:
- ecs:RunTask
Resource:
- !Ref EcsTaskDefinition
- Effect: Allow
Action:
- iam:PassRole
Resource:
- !GetAtt TaskRole.Arn
- !GetAtt TaskExecutionRole.ArnStepFunction has a direct integration with ECS, for executing a new task some network configurations are required like the ECS cluster ARN, the Task Definition ARN and at least one VPC subnet ID.
The StepFunction and ECS integration also supports the async mode using TaskToken. Accessing the token from “$$.Task.Token” the StateMachine’s step will pass it through the environment variable TASK_TOKEN to the task’s execution.
StartAt: ExecuteTask
States:
ExecuteTask:
Type: Task
Resource: 'arn:aws:states:::ecs:runTask.waitForTaskToken'
TimeoutSeconds: 300
Parameters:
Cluster: '${ClusterArn}'
TaskDefinition: '${TaskDefinitionArn}'
NetworkConfiguration:
AwsvpcConfiguration:
Subnets:
- '${VpcSubnet1}'
- '${VpcSubnet2}'
- '${VpcSubnet3}'
AssignPublicIp: ENABLED
Overrides:
ContainerOverrides:
- Name: container
Command: "example"
Environment:
- Name: TASK_TOKEN
Value.$: $$.Task.Token
Catch:
- ErrorEquals:
- ScriptFailed
Next: Fail
ResultPath: $.Output
Next: SuccessFail:
Type: Fail
Cause: An error occured during script execution.
Error: ScriptError Success:
Type: Succeed
InputPath: $.Output
Specific errors (like “ScriptFailed”) can be triggered from script execution and handled on the StepFunction flow.
Entrypoint and task execution
The entrypoint script should exit for every error, this simplifies the StepFunction task execution.
#!/bin/bash
set -eUsing a “trap” it’s possible to catch every script’s exit and execute a specific bash function that checks the exit code. If the script exits with a code different than 0 the StateMachine’s task will fail:
trap 'catch $? $2' EXIT
catch() {
if [ "$1" != "0" ]; then
aws stepfunctions send-task-failure --error "ScriptFailed" --cause "$1" --task-token "$TASK_TOKEN"
fi
}If the script reaches the end of the file without error the StateMachine’s task will continue execution:
OUTPUT="{ \"Example\": \"Test\" }"
aws stepfunctions send-task-success --task-output "$OUTPUT" --task-token "$TASK_TOKEN"A JSON payload can be passed as a task execution result.
Docker image build and deploy
As a base image you can use for example the AWS CLI one so as to already have a large set of features available for communicating with the AWS APIs.
Let’s define a very simple Dockerfile that takes our script and add it as an entrypoint of the image:
FROM public.ecr.aws/aws-cli/aws-cli
USER root
RUN yum install -y bash jq curl
RUN rm -rf /var/cache/apk/*
COPY entrypoint.sh /entrypoint.sh
RUN chmod +x /entrypoint.sh
ENTRYPOINT [ "/entrypoint.sh" ]
WORKDIR /tmpIf we are using a Docker repository we will first have to log in to ECR:
aws --profile <your aws profile> ecr get-login-password | docker login --username AWS --password-stdin "<account id>.dkr.ecr.<region>.amazonaws.com"Now build and deploy the Docker image:
docker build -t <account id>.dkr.ecr.<region>.amazonaws.com/my-repository:tag .
docker push <account id>.dkr.ecr.<region>.amazonaws.com/my-repository:tagNow everything is ready to execute the stepfunction and check its execution.
For a concrete example I leave at the bottom of the article a repository with the execution of MySQL dump.
Credits: Cloudcraft.
Repository: bitbull-serverless/rds-mysql-dump
