Long running jobs in AWS Lambda

Using AWS Lambda to trigger long running jobs.

2021-08-24

Metadata
Long running jobs in AWS Lambda
Using AWS Lambda to trigger long running jobs.
2021-08-24
./long-job.jpg
javascriptlambdaawsdockerecs

AWS Lambda has proven to be a powerful piece of infrastructure, capable of taking parts of our apps that were tied to a long running app server, and turn them into scalable, event driven, service oriented apps. Unfortunately, it has an execution time limit of only 15 minutes. Which makes it not well suited for executing long jobs that may require time-intensive workloads. There is a straightforward, serverless, approach however to running these kinds of jobs alongside Lambda using AWS Elastic Container Service (ECS).

ECS

In many configurations, you would use ECS to deploy and scale a long-running dockerized app server or a job server that pulls from SQS. For our use case, we are going to create a container that exits after completion.

ECS has several parts that we will need to set up in order to be able to trigger a job.

Docker Container

We will need to create a docker container around the job we need to run as this will be specific to each use case we will use a very generic docker file for the sake of this article.

Our job will be encapsulated in a shell script hello-world.sh.

echo $HELLO_WORLD

And a Dockerfile

FROM alpine

COPY hello-world.sh /usr/src/hello-world.sh

ENTRYPOINT /usr/src/hello-world.sh

To pass variables into our job we will use environment variables. Hence why in our hello-world.sh file we are echoing the $HELLO_WORLD environment variable, as an example of how we can pass data into our jobs.

Once we have our docker image ready, we will want to build it an push it to AWS Elastic Container Repository (ECR). https://docs.aws.amazon.com/AmazonECR/latest/userguide/docker-push-ecr-image.html

Task Definition

In ECS, a task definition provides the required configuration for ECS to know what docker image to run, and how to run it. Inside our task definition, we will provide information such as, how much memory and vCPUs should be used, what container to run, and what environment variables should be set for every run.

At this stage, we will also need to define two IAM roles. One role will be used by ECS to pull our docker image and execute our job. And the other will be used for our job to have permissions to any other AWS resources it requires.

Continuing with our hello world example. The following Cloudformation config will set up our IAM roles and the task definition for our job.

HelloWorldExecutionRole:
    Type: 'AWS::IAM::Role'
    Properties:
    RoleName: HelloWorldExecutionRole
    AssumeRolePolicyDocument:
        Version: '2012-10-17'
        Statement:
          - Effect: Allow
            Principal:
            Service: 'ecs-tasks.amazonaws.com'
            Action: 'sts:AssumeRole'
        ManagedPolicyArns:
          - arn:aws:iam::aws:policy/service-role/AmazonECSTaskExecutionRolePolicy

HelloWorldTaskRole:
    Type: 'AWS::IAM::Role'
    Properties:
        RoleName: HelloWorldTaskRole
        AssumeRolePolicyDocument:
            Version: '2012-10-17'
            Statement:
              - Effect: Allow
                Principal:
                Service: 'ecs-tasks.amazonaws.com'
                Action: 'sts:AssumeRole'
        Policies:
          - PolicyName: IAMPolicyHelloWorld
            PolicyDocument:
                Version: '2012-10-17'
                Statement:
                  - Effect: Allow
                    Action:
                      - "s3:PutObject"
                    Resource:
                      - arn:aws:s3:::hello-world-bucket/*

HelloWorldTaskDefinition:
    Type: 'AWS::ECS::TaskDefinition'
    DependsOn: HelloWorldExecutionRole
    Properties:
        ContainerDefinitions:
          - Image: 123456786.dkr.ecr.us-east-1.amazonaws.com/hello-world:1.0.0
            Name: hello-world
        Cpu: 256
        Memory: 512
        RequiresCompatibilities: 
          - FARGATE
        NetworkMode: awsvpc
        ExecutionRoleArn: {"Ref": HelloWorldExecutionRole}
        TaskRoleArn: {“Ref”: HelloWorldTaskRole}

There are a few important pieces to note here.

Both our execution role and our task role are being granted sts:AssumeRole. This allows ECS to assume the role with the permissions necessary to pull containers and execute ECS tasks. And it allows the task itself to assume the role of the policy we provided to give it permission to access the resources we specified. In this case, we gave it permission to put objects to an s3 bucket.

Our task itself is pointing to our image URL in ECR and is using the FARGATE compatibility. This is required for this pattern to work. Because we are using FARGATE, we are also required to use NetworkMode: awsvpc.

Other configuration options that are useful to apply here, but aren’t for brevity are default environment variables, and logging. Both of which can be found in the Cloudformation documentation.

https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-resource-ecs-taskdefinition.html

Service

In ECS, a service is used to control starting, stopping, and monitoring tasks. We are not going to be using the service we create to do any of this, but it is a required step for setting up ECS.

 HelloWorldService:
    Type: 'AWS::ECS::Service'
    DependsOn: HelloWorldTaskDefinition
    Properties:
        Cluster: {"Ref": "HelloWorldCluster"}
        LaunchType: FARGATE
        ServiceName: hello-world
        DesiredCount: 0
        NetworkConfiguration:
            AwsvpcConfiguration:
            Subnets: # your vpc subnet ids
        TaskDefinition: {"Ref": "HelloWorldTaskDefinition"}

You will notice that our DesiredCount is zero. Because we don’t want this service to try to keep any number of tasks running indefinitely, we set this to zero. You will also need to fill in the Subnets row with your VPCs subnet ids.

Cluster

An ECS Cluster is where our jobs will be run. It is possible to have a single cluster that has many services, and many task definitions but in this example we will just have the one cluster, service, and task definition. You will notice that in the service cloudformation template we referenced the cluster in the below template.

HelloWorldCluster:
    Type: 'AWS::ECS::Cluster'
    Properties:
        ClusterName: hello-world

With this we have all the infrastructure we need in order to run our job.

Lambda

Now we will want to write a Lambda, that when invoked, will run our job. We can do so using the aws-sdk.

const AWS = require(‘aws-sdk’);
const ecs = new AWS.ECS();

const CLUSTER_NAME =;
const TASK_DEFINITION_NAME =;
const VPC_SUBNETS =;

const handler = async (event, context, callback) => {
    try {   
        const { tasks } = await ecs.runTask({
            cluster: CLUSTER_NAME,
            taskDefinition: TASK_DEFINITION_NAME,
            networkConfiguration: {
                awsvpcConfiguration: {
                    subnets: VPC_SUBNETS,
                },
            },
            launchType: 'FARGATE',
            overrides: {
                containerOverrides: [{
                    name: ‘hello-world’,
                    environment: {
                        HELLO_WORLD: event.helloWorld
                    }
                }],
            },
        }).promise();
	    callback(null, tasks);
    } catch (err) {
	    callback(err);
    }
}

module.exports = { handler };

This function uses the ECS class on the aws-sdk to run the task. In order to successfully run the task, we need to know the cluster’s name or ARN, the task definition’s name or ARN, and the subnets for your account’s VPC. We are also able to pass environment variables into the task via the overrides parameter. These overrides are specified per container and so we must provide them with the name of the container they apply to.

You will need to make sure this function has the necessary permissions to run tasks in ECS with the ecs:RunTask permission.

Profit

The approach has proven to be highly scalable and effective at running long jobs in a serverless application. I’ve used this setup to perform workloads ranging from report building to electric grid power flow simulations. Because it is possible to have many services and task definitions on a single cluster, you can deploy many kinds of jobs into one cluster and run many different tasks in that cluster.