Articles/AWS/Orchestrating Lambdas with Step Functions

Orchestrating Lambdas with Step Functions

Step Functions let you wire Lambdas into workflows with retries, branching, and parallelism, but you do not always need them. Here is an honest guide to when a state machine earns its keep, then a real parallel pipeline built with the modern JSONata syntax, deployed with SAM and tested locally.

June 3, 2026·10 min read

Across part one and part two we built a containerized thumbnailer and gave it an HTTP front door. One function behind one endpoint goes a long way. This finale is about the moment it stops being enough, when the work becomes several steps with retries, branching, and things running in parallel, and whether AWS Step Functions are the right answer. Sometimes they are not, and I will be honest about that before we build one.

When do you actually need a state machine?

You can get remarkably far without orchestration. A single Lambda, a Lambda calling another with the SDK, or API Gateway routing to a few functions covers most applications. Step Functions earn their place when you have a genuine multi-step workflow and you would rather not hand-write the glue: explicit steps, automatic retries with backoff on each one, parallel branches, conditional paths, waits, and a visual record of every execution.

Here is the tell. If you catch yourself writing a Lambda whose entire job is to call three other Lambdas in order, catch each one's errors, retry the flaky ones, and stitch the results together, that orchestration code is exactly what a state machine replaces, and the state machine version is easier to see and to change. But if you have one function, or two calls in a row that rarely fail, you do not need this. Reaching for Step Functions there just adds a moving part.

Our thumbnailer is a good candidate for one specific reason: we want several sizes generated in parallel, each with its own retry, and a tidy summary at the end. That is orchestration, so let us wire it up.

Standard or Express?

Step Functions come in two workflow types. Standard runs for up to a year, guarantees exactly-once execution, keeps a full visual history, and bills per state transition. Express runs for up to five minutes, is at-least-once, and bills per request plus duration, which makes it cheap at very high volume (think streaming or event ingestion).

For an image pipeline that runs occasionally and where you want to inspect each run, Standard is the right call, and the execution history alone is worth it the first time something misbehaves.

The pipeline

The shape we want: take an input image, confirm it is valid, fan out to generate small, medium, and large versions at the same time, then summarize. Each size is a call to the thumbnailer Lambda from part one, and they run concurrently.

One thing to design around up front: Step Functions cap the data passed between states at 256 KB. Shipping three full base64 PNGs through the workflow would blow past that fast, so a real pipeline writes the images to S3 inside the function and passes back keys or URLs, not the bytes. To keep the example readable we will just pass back each thumbnail's dimensions.

The state machine, in JSONata

Workflows are written in the Amazon States Language (ASL), which is JSON. Since late 2024 you can opt into JSONata as the query language, and it is a big simplification over the old JSONPath fields. You set it once at the top, then transform data with {% ... %} expressions and a reserved $states variable. Save this as statemachine/pipeline.asl.json:

JSON
{
  "Comment": "Generate thumbnails in parallel",
  "QueryLanguage": "JSONata",
  "StartAt": "CheckInput",
  "States": {
    "CheckInput": {
      "Type": "Choice",
      "Choices": [
        { "Condition": "{% $exists($states.input.image_url) %}", "Next": "GenerateSizes" }
      ],
      "Default": "MissingImage"
    },
    "GenerateSizes": {
      "Type": "Map",
      "Items": "{% [120, 480, 1024].{'width': $, 'image_url': $states.input.image_url} %}",
      "ItemProcessor": {
        "ProcessorConfig": { "Mode": "INLINE" },
        "StartAt": "ResizeOne",
        "States": {
          "ResizeOne": {
            "Type": "Task",
            "Resource": "arn:aws:states:::lambda:invoke",
            "Arguments": {
              "FunctionName": "${ThumbnailerArn}",
              "Payload": "{% $states.input %}"
            },
            "Retry": [
              {
                "ErrorEquals": ["Lambda.ServiceException", "Lambda.TooManyRequestsException"],
                "IntervalSeconds": 1,
                "MaxAttempts": 3,
                "BackoffRate": 2,
                "JitterStrategy": "FULL"
              }
            ],
            "Output": "{% {'width': $states.result.Payload.width, 'height': $states.result.Payload.height} %}",
            "End": true
          }
        }
      },
      "Next": "Summarize"
    },
    "Summarize": {
      "Type": "Pass",
      "Output": "{% {'generated': $count($states.input), 'sizes': $states.input} %}",
      "End": true
    },
    "MissingImage": {
      "Type": "Fail",
      "Error": "MissingImage",
      "Cause": "Provide image_url in the input"
    }
  }
}

There is a lot in there, so let me unpack the pieces.

Map, Choice, and the JSONata bits

The Choice state branches on a condition. In JSONata mode, that is a single Condition expression that returns a boolean: {% $exists($states.input.image_url) %} checks the input actually has an image before we do any work, and falls through to MissingImage otherwise.

GenerateSizes is a Map state, which runs its inner steps once per item, concurrently. The clever part is Items: the JSONata expression [120, 480, 1024].{'width': $, 'image_url': $states.input.image_url} turns the list of widths into a list of payloads, one per size, each carrying the original image URL. Each parallel iteration then runs ResizeOne.

ResizeOne is the Task that calls Lambda. A few things to notice:

  • "Resource": "arn:aws:states:::lambda:invoke" is the optimized Lambda integration. The function's own return value comes back nested under $states.result.Payload, which is why the Output reaches in for Payload.width.
  • Arguments is what you send (the JSONPath world called this Parameters). Output is what the state passes on. Those two fields replace the five fiddly JSONPath fields, and there is no more .$ suffix to remember.
  • ${ThumbnailerArn} is not JSONata. It is a SAM substitution that gets replaced with the real function ARN at deploy time, which we set up next. JSONata runs at execution time inside {% %}; the ${...} token is resolved earlier, when the template is deployed. They sit side by side happily.

Finally, Summarize is a Pass state that just shapes the output: $count($states.input) counts the array the Map produced.

Retries and catches

Notice the Retry block on ResizeOne. Each parallel branch retries Lambda throttling and transient service errors on its own, backing off exponentially with jitter, before giving up. That per-step resilience is one of the strongest reasons to use a state machine instead of hand-rolled orchestration.

For failures you want to handle rather than retry, add a Catch that routes to another state:

JSON
"Catch": [
  { "ErrorEquals": ["States.ALL"], "Next": "ReportFailure" }
]

Retry and Catch work together: the state retries while it can, and if it still fails, the catch sends execution down a recovery path of your choosing instead of failing the whole workflow.

Deploy it with SAM

We can extend the same template.yaml from part two. The state machine references the external ASL file, injects the function ARN through DefinitionSubstitutions, and is granted permission to invoke the function with a SAM policy template:

YAML
  ThumbnailPipeline:
    Type: AWS::Serverless::StateMachine
    Properties:
      Type: STANDARD
      DefinitionUri: statemachine/pipeline.asl.json
      DefinitionSubstitutions:
        ThumbnailerArn: !GetAtt Thumbnailer.Arn
      Policies:
        - LambdaInvokePolicy:
            FunctionName: !Ref Thumbnailer

DefinitionSubstitutions is where ${ThumbnailerArn} from the ASL gets its value, here the real ARN of the Thumbnailer function defined earlier in the template. LambdaInvokePolicy is a SAM shorthand that attaches the right IAM permission to the state machine's role so it can actually call the function. sam build && sam deploy ships it alongside the Lambda.

Test it locally

You can exercise a state machine without deploying the whole thing. AWS publishes Step Functions Local as a Docker image:

BASH
docker run -p 8083:8083 amazon/aws-stepfunctions-local

You then point the CLI at it with --endpoint http://localhost:8083 to create and start executions, and you can mock the Lambda calls so nothing touches AWS. For checking a single state's logic, the newer TestState API is even handier, since it runs one state with an input you provide:

BASH
aws stepfunctions test-state \
  --definition file://one-state.json \
  --role-arn arn:aws:iam::111122223333:role/StepFunctionsRole \
  --input '{"image_url": "https://example.com/cat.jpg"}'

And the individual functions are still just Lambdas, so sam local invoke from part two tests each one in isolation. Between unit-testing states, mocking the orchestration locally, and invoking the real functions, you can build a fair amount of confidence before deploying.

That closes the series. We started with a single Python function in a Docker image, gave it a public HTTP endpoint and called it from Laravel, and finished by orchestrating it into a parallel pipeline with retries and branching. The throughline is that AWS gives you a ladder of options, direct invocation, function URLs, API Gateway, and Step Functions, and the skill is matching the rung to the job rather than always reaching for the top one. Start with the simplest thing that works, and climb only when the problem makes you.