How to Build a Serverless ETL Pipeline by Integrating transform.do with AWS Lambda

Modern data pipelines are often a tangled mess of scripts, cron jobs, and monolithic applications that are brittle and difficult to scale. The dream is an automated, event-driven system that just works. The serverless paradigm, championed by services like AWS Lambda, gets us part of the way there. But what about the most complex part: the data transformation logic itself?

This is where you can hit a wall. Embedding complex rules for data mapping, cleansing, and format conversion directly into your Lambda functions creates tight coupling, making updates a chore and reuse nearly impossible.

In this guide, you'll learn how to build a powerful, event-driven, and infinitely scalable serverless data processing pipeline. We'll connect the dots between AWS S3, AWS Lambda, and transform.do to create a system where your infrastructure is lean and your transformation logic is clean, version-controlled, and managed as a simple service.

The Architecture: A Modern, Decoupled ETL Flow

Before we dive into the code, let's look at the elegant architecture we're building.

Ingestion: A raw data file (e.g., new_users.json) is uploaded to a designated AWS S3 bucket.
Trigger: The S3 upload event automatically triggers an AWS Lambda function.
Delegation: The Lambda function's only job is to act as a lightweight orchestrator. It reads the raw file from S3 and makes a simple API call to a transform.do service endpoint.
Intelligent Transformation: transform.do executes the complex transformation you've defined as code—renaming fields, converting data types, enriching records, or changing formats from JSON to CSV.
Output: The Lambda receives the perfectly structured data back from the API and writes the new file to a different "processed" S3 bucket for use in analytics, applications, or other downstream systems.

This approach separates your concerns perfectly. AWS handles the event-driven infrastructure, while transform.do handles the specialized task of data transformation.

Prerequisites

An active AWS Account.
A transform.do account and your API key.
Node.js and npm installed on your machine.
Basic familiarity with the AWS Management Console for S3, IAM, and Lambda.

Step-by-Step Guide to Your Serverless Pipeline

Step 1: Define Your Transformation as Code on transform.do

The heart of our pipeline is the transformation logic. By defining this on transform.do first, we create a stable API endpoint that our Lambda can call. This is the essence of "ETL as Code."

Log in to your transform.do dashboard.
Create a new "Transformation Agent."
In the agent's configuration, define your rules. Let's use an example that takes raw user data and prepares it for a marketing database.

// Example transformation.do agent configuration

const transformations = {
  targetFormat: "csv", // We want to convert from JSON to CSV
  rules: [
    // Rule 1: Rename fields to be more database-friendly
    { rename: { "user_id": "ID", "first_name": "FirstName", "last_name": "LastName", "join_date": "MemberSince" } },

    // Rule 2: Convert the timestamp to a simple date format
    { convert: { "MemberSince": "date('YYYY-MM-DD')" } },

    // Rule 3: Add a new, derived field for a full name
    { addField: { "FullName": "{{FirstName}} {{LastName}}" } },

    // Rule 4: Ensure only specific columns are in the final output
    { select: ["ID", "FullName", "MemberSince"] }
  ]
};

Once you save this agent, transform.do provides you with a unique and secure API endpoint. Copy this endpoint URL and your API key—you'll need them for your Lambda function.

Step 2: Configure Your S3 Buckets

We need two S3 buckets to act as the start and end points of our pipeline.

Navigate to the S3 service in your AWS Console.
Create two buckets with unique names (e.g., my-app-raw-data-source and my-app-processed-data-dest).
Keep the default settings for now.

Step 3: Create an IAM Role for Your Lambda

Your Lambda function needs permission to interact with other AWS services. We'll create a specific role for it.

Go to the IAM service in the AWS Console.
Go to Roles and click Create role.
For the trusted entity, select AWS service, and for the use case, choose Lambda.
On the permissions page, attach the AWSLambdaBasicExecutionRole policy (this allows it to write logs to CloudWatch).

Click Create policy to add custom permissions:

Select the JSON editor and paste the following policy. Remember to replace the placeholder bucket names with your actual bucket names.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": "s3:GetObject",
            "Resource": "arn:aws:s3:::my-app-raw-data-source/*"
        },
        {
            "Effect": "Allow",
            "Action": "s3:PutObject",
            "Resource": "arn:aws:s3:::my-app-processed-data-dest/*"
        }
    ]
}

Name the policy (e.g., LambdaS3TransformPermissions), create it, and then attach it to the role you are creating.
Finally, give your role a name (e.g., transform-do-lambda-role) and create it.

Step 4: Write and Deploy the AWS Lambda Function

This is our orchestrator. The code is surprisingly simple because all the heavy lifting is offloaded to transform.do.

In the AWS Lambda console, click Create function.
Select Author from scratch.
Name your function (e.g., S3DataTransformer).
Choose Node.js 18.x or a later version as the runtime.
Under Permissions, choose Use an existing role and select the transform-do-lambda-role you just created.
Click Create function.

Now, replace the default index.mjs code with the following:

import { S3Client, GetObjectCommand, PutObjectCommand } from "@aws-sdk/client-s3";
import axios from "axios";

// Initialize S3 client
const s3Client = new S3Client({});

// Retrieve secrets from environment variables
const TRANSFORM_DO_ENDPOINT = process.env.TRANSFORM_DO_ENDPOINT;
const TRANSFORM_DO_API_KEY = process.env.TRANSFORM_DO_API_KEY;
const DESTINATION_BUCKET = process.env.DESTINATION_BUCKET;

// Helper to stream data from S3
const streamToString = (stream) =>
  new Promise((resolve, reject) => {
    const chunks = [];
    stream.on("data", (chunk) => chunks.push(chunk));
    stream.on("error", reject);
    stream.on("end", () => resolve(Buffer.concat(chunks).toString("utf8")));
  });

export const handler = async (event) => {
  try {
    // 1. Get the bucket and key from the trigger event
    const sourceBucket = event.Records[0].s3.bucket.name;
    const sourceKey = decodeURIComponent(event.Records[0].s3.object.key.replace(/\+/g, ' '));
    console.log(`New file detected: ${sourceKey} in bucket ${sourceBucket}`);

    // 2. Read the raw file from the source S3 bucket
    const getObjectParams = { Bucket: sourceBucket, Key: sourceKey };
    const { Body: sourceStream } = await s3Client.send(new GetObjectCommand(getObjectParams));
    const sourceDataString = await streamToString(sourceStream);
    const sourceData = JSON.parse(sourceDataString);

    // 3. Delegate the transformation to transform.do
    console.log("Sending data to transform.do agent...");
    const response = await axios.post(
      TRANSFORM_DO_ENDPOINT,
      { source: sourceData }, // The payload must contain the source data
      { headers: { 
          'Authorization': `Bearer ${TRANSFORM_DO_API_KEY}`,
          'Content-Type': 'application/json' 
        } 
      }
    );
    
    // transform.do returns the transformed data in the 'data' field
    const transformedData = response.data.data;
    console.log("Transformation successful. Received transformed data.");

    // 4. Write the new, transformed data to the destination bucket
    // The output will be a CSV string as defined in our agent
    const destinationKey = sourceKey.replace('.json', '.csv');
    const putObjectParams = {
      Bucket: DESTINATION_BUCKET,
      Key: destinationKey,
      Body: transformedData,
      ContentType: 'text/csv'
    };
    await s3Client.send(new PutObjectCommand(putObjectParams));

    console.log(`Successfully processed and saved to ${destinationKey} in ${DESTINATION_BUCKET}`);
    return { statusCode: 200, body: 'Transformation complete.' };

  } catch (error) {
    console.error("Error during transformation pipeline:", error);
    // For production, add more robust error handling/notifications
    throw error;
  }
};

Before you can use this, you need to add axios. Create a package.json file on your local machine with { "dependencies": { "axios": "^1.6.0" } }, run npm install, then zip the resulting node_modules folder and your index.mjs file together. Upload this .zip file under the Code source section of your Lambda.

Step 5: Configure and Trigger the Lambda

The final steps are to connect everything.

Set Environment Variables: In your Lambda's Configuration > Environment variables tab, add the following key-value pairs:
- TRANSFORM_DO_ENDPOINT: The API endpoint URL you copied from transform.do.
- TRANSFORM_DO_API_KEY: Your secret transform.do API key.
- DESTINATION_BUCKET: The name of your processed data bucket (e.g., my-app-processed-data-dest).
Add the S3 Trigger:
- In the Function overview section, click + Add trigger.
- Select S3 as the source.
- Choose your source bucket (e.g., my-app-raw-data-source).
- For the event type, select All object create events.
- It's good practice to add a suffix filter, like .json, to ensure the Lambda only runs on the files you intend to process.
- Acknowledge the recursive invocation warning and click Add.

Test Your Pipeline!

You're all set! To test the entire flow:

Create a local file named users.json with the following content:

[
  { "user_id": 101, "first_name": "Jane", "last_name": "Doe", "join_date": "2023-01-15T10:00:00Z" },
  { "user_id": 102, "first_name": "John", "last_name": "Smith", "join_date": "2023-02-20T12:30:00Z" }
]

Upload users.json to your source S3 bucket.
Within seconds, you should see a new file, users.csv, appear in your destination S3 bucket.

Download and open users.csv. Its content should be:

ID,FullName,MemberSince
101,"Jane Doe",2023-01-15
102,"John Smith",2023-02-20

It worked! You've successfully built a fully automated, serverless ETL pipeline.

Conclusion: Scale without Complexity

By integrating transform.do with AWS Lambda, you've built more than just a pipeline; you've created a maintainable and scalable data processing foundation.

Agility: Need to change the date format or add a new field? Simply update your agent on transform.do. No code deployment or infrastructure changes needed.
Scalability: This architecture scales infinitely. Whether you upload one file or one million, AWS and transform.do handle the load. For terabyte-scale jobs, our asynchronous workflows and webhooks integrate just as easily.
Clarity: Your Lambda function is a simple, readable orchestrator. Your transformation logic is declarative, version-controlled code. This separation is key to long-term maintainability.

Ready to stop wrestling with ETL scripts and start building intelligent data workflows? Sign up for transform.do and turn your most complex data challenges into simple, powerful API calls.

Do Work. With AI.