Reprocessing 25M Records Without Losing My Mind

2 minute read

Sometimes the “simple” tasks are the ones that stretch your creativity the most.
I recently had to reprocess historical data from 2023 — about 25 million records — and push it all back into an S3 bucket. Sounds straightforward, right? Well, here’s how it actually played out.

Step 1: From Excel to CSV (aka Herding 25M IDs)

The data came to me in multiple Excel sheets — not exactly the most processing-friendly format. I wrote a quick JavaScript script to merge all those IDs into a single CSV file.

While at it, I added a new column called processed, defaulting to false. This would act like a tracker, so I’d know which records were already handled and which ones were still pending.

Step 2: Importing Into DynamoDB

With the CSV ready, I used DynamoDB’s “Import from S3” feature to bulk-load the data. This is a lifesaver when you’re dealing with millions of rows — no need to write custom loaders or deal with batching yourself.

The table’s primary key was already set as id, so that part was straightforward.
But there was a challenge: how do I efficiently pull out only the unprocessed rows without scanning all 25M records every time?

Step 3: The GSI Trick

Scanning through millions of items on the processed flag would have been painfully slow and costly. The fix?
I created a Global Secondary Index (GSI) with the processed column as the partition key.

Now I could query only the rows where processed = false — fast, efficient, and cost-friendly.

Step 4: EventBridge for Automation

Next came the processing loop. I set up an EventBridge rule on a 30-minute schedule:

Query the GSI to fetch items where processed = false.
Process them (max 5,000 items per run so Lambda doesn’t hit its 15-minute timeout).
Push results into the S3 bucket.
Update those rows in DynamoDB to mark processed = true.

If a run fails, the rows simply remain false. That’s fine — reprocessing isn’t a problem since S3 will just overwrite the file, not create duplicates.

Final Thoughts

It wasn’t glamorous, but it worked:

Historical data reprocessed ✅
Ingested into S3 ✅
DynamoDB cleaned and tracked ✅
Fully automated ✅
Safe batching with 5k per Lambda execution ✅

Over to You

This setup worked for me, but I’m curious:
Do you know a better way to handle 25M+ record reprocessing without the DynamoDB + GSI + Eventbridge + Lambda setup?

If yes, I’d love to hear it. Share your ideas with me!

Share on

X Facebook LinkedIn Bluesky

Aravindh Raju

Reprocessing 25M Records Without Losing My Mind

Step 1: From Excel to CSV (aka Herding 25M IDs)

Step 2: Importing Into DynamoDB

Step 3: The GSI Trick

Step 4: EventBridge for Automation

Final Thoughts

Over to You

Share on

You May Also Enjoy

How We Solved AWS Lambda Code Storage Limit Issues with Layers

Vibe Coding Gone Wrong: A Real-World Wake-Up Call

AWS Lambda Timeout Got You? Beat It with Recursion (and a Little JS Magic)

Getting Started with Open Source: My Journey and How You Can Contribute