How I Queried Millions of Rows in Under a Second with AWS Lambda and DuckDB

Many data engineers face the challenge of processing massive datasets quickly. In my recent project, I used AWS Lambda and DuckDB to query over 5.7 million records in under a second. Here’s how I did it.

Anthony Rodrigues
5 min read1 day ago

Have you ever faced the challenge of scanning millions of rows of data and needing the results in a few seconds? Recently, I worked on a project that required precisely that. The exciting project pushed me to find an efficient, scalable, and cost-effective solution using tools like AWS Lambda and DuckDB.

The project involved scanning data generated weekly, following the WORM (Write Once Read Many) approach. This data consisted of everything received for a given entity type over the past few months, grouped by month and week. Storing historical data efficiently was crucial due to its growing size.

The key requirements were:

  • Speed: Return results in just a few seconds.
  • Scalability: Handle increasing data volumes without performance loss.
  • Cost-Effectiveness: Keep operational costs low.
  • Accessibility: Make data available via a secure API.
  • Ease of Maintenance: Guarantee a solution that's easy to set up, validate, maintain, and monitor.

At first glance, it might seem straightforward. However, things get more complex when you factor in the requirements above.

Choosing the correct type of architecture:

The first decision was whether to manage the services ourselves or go serverless.

Photo by Field Engineer: https://www.pexels.com/photo/crop-focused-ethnic-engineer-using-electric-screwdriver-442151/

Given our needs for scalability and ease of setup, serverless architecture was the clear winner. With serverless, there’s no need to provision or manage underlying computing resources — you create the service, start using it, and pay as you go.

With the serverless architecture chosen, the next step was selecting the right AWS services to build our solution.

API Management with API Gateway

To make the data accessible via an API, we chose AWS API Gateway. It’s a fantastic tool for creating, publishing, maintaining, monitoring, and securing APIs at any scale.

API Gateway supports RESTful, WebSocket, and HTTP APIs and integrates seamlessly with other AWS services. It checked all our boxes:

  • Scalable
  • Cost-Effective
  • Easy to Set Up

Processing data with AWS Lambda

We selected AWS Lambda to process the data. It’s a serverless computing service that allows you to upload and run your code without provisioning or managing servers.

Lambda scales automatically with demand and integrates well with AWS services like API Gateway and CloudWatch.

Plus, you only pay for what you use, making it perfect for a cost-effective solution.

Overcoming Lambda’s Limitations with Docker and ECR

While Lambda is powerful, it has a limitation: the code package size (including dependencies) is limited to 250MB unzipped. This can be a pain point when your application has many dependencies. We used Docker and Amazon Elastic Container Registry (ECR) to overcome this.

  • Docker allows us to package the Lambda function as a Docker image, increasing the size limit to 10GB.
  • ECR is a fully managed container registry that stores our Docker images, making it easy to pull them from Lambda and keep track of versions.

Flexible Storage with S3

The next challenge was storage. We had three main serverless options:

  1. Amazon Aurora
  2. DynamoDB
  3. Amazon S3

We chose Amazon S3 with Parquet files because:

  • Cost-Effective Storage: Ideal for storing large amounts of historical data.
  • Scalability: No limit on the total amount of data stored.
  • Flexibility: Supports data partitioning and various file formats.
  • Low Latency: Provides quick access to stored data.

But how would we query data stored in S3 efficiently? Enter DuckDB.

Introducing DuckDB

DuckDB is a high-performance analytical database system that allows you to run SQL queries directly on data files like Parquet, CSV, and JSON without loading them entirely into memory. This makes it ideal for running queries in a limited memory environment like Lambda.

Keeping Lambda Warm with EventBridge Scheduler

Downloading large files from S3 can introduce latency. To mitigate this, we used AWS EventBridge Scheduler to keep our Lambda function “warm.” By invoking the function regularly, we kept the execution environment active, reducing cold start times and keeping the large files cached in memory.

Deploying with Terraform

To deploy this architecture, we used Terraform, which allows you to define and provision your infrastructure as code. This ensures that environments are consistent, version-controlled, and easy to replicate.

After showing all the services used, this is the final architecture result:

A simple, easy-to-set, cost-effective service that:

  • Reads a file with over 5.7 million records.
  • Responds in less than one second.
  • It costs less than $5 a month.
  • It is secure, scalable, cost-effective, and meets all our requirements.

That's it. I hope you enjoyed it.

I’ll explore each component in the upcoming posts and show how we structure this architecture to perform well.

--

--

Anthony Rodrigues

Data Engineer | Data Architect | Python | AWS. Becoming a writer in those subjects and much more. Find me at: https://www.linkedin.com/in/anthonymrodrigues/