Lessons Learned from 1TB DynamoDB Import

At camelcamelcamel we have been tracking price points of Amazon products for the past 15 years. All this data was saved into one big MySQL table that is now over 1TB in size. We decided to move it to DynamoDB to save on costs, get better performance, and reduce maintenance complexity. Composite keys where the sort key is the timestamp fit perfectly for our use case. All our queries consistently finish in less than 20ms. One of the slowest queries we had on MySQL was querying a specific product by date. In DynamoDB it is consistently fast with the new composite key of product id for primary key and timestamp for sort key. Those queries no longer affect the entire system either.

The move from MySQL to DynamoDB took a couple of months and we have learned a few important lessons along the way. See below for a summary of our lessons.

Writing batches doesn’t save money

DynamoDB charges for writes by Write Capacity Unit. The documentation states:

Each API call to write data to your table is a write request. For items up to 1 KB in size, one WCU can perform one standard write request per second. Items larger than 1 KB require additional WCUs.

https://aws.amazon.com/dynamodb/pricing/on-demand/

We understood from that that using BatchWriteItem, which is a single API call, will only use 1 WCU for each 1KB we write. As our items are very small and average 34 bytes each, every batch of the maximum 25 items would only be around 850 bytes. That should cost us only 1 WCU. But digging deeper shows that’s not the case.

BatchWriteItem—Writes up to 25 items to one or more tables. DynamoDB processes each item in the batch as an individual PutItem or DeleteItem request (updates are not supported). So DynamoDB first rounds up the size of each item to the next 1 KB boundary, and then calculates the total size. The result is not necessarily the same as the total size of all the items. For example, if BatchWriteItem writes a 500-byte item and a 3.5 KB item, DynamoDB calculates the size as 5 KB (1 KB + 4 KB), not 4 KB (500 bytes + 3.5 KB).

https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/ProvisionedThroughput.html#ItemSizeCalculations.Writes

Each item in the batch is written separately as if it was its own API call. So even if we technically write less than 1KB, we would still get charged for 25 WCUs.

Storage has hidden fees

The table size reported by DynamoDB is the not the size used for billing. Both the API and the console page display the total size of all items in the table. But billing includes indexes and a hidden overhead of at least 100 bytes per item depending on your settings. For us that meant paying 300% more than we expected for storage. Our items are 34 bytes on average but DynamoDB bills us for (34 + 100) * 2 due to overhead and one GSI. We expected to pay for GSI storage, but the over was a surprise.

Even this calculation is not perfect. It seems like GSI might use a little less than 100 bytes per item. But overall (avg_item_size + 100) * (1 + gsi_count) is the closest we got to a number matching our bills.

For storage billing purposes, each item includes a per-item storage overhead that depends on the features you have enabled.

  • All items in DynamoDB require 100 bytes of storage overhead for indexing.
  • Some DynamoDB features (global tables, transactions, change data capture for Kinesis Data Streams with DynamoDB) require additional storage overhead to account for system-created attributes resulting from enabling those features. For example, global tables requires an additional 48 bytes of storage overhead.
https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/CapacityUnitCalculations.html

Worse yet, the AWS Pricing Calculator asks you for the item size but doesn’t actually use it to calculate this overhead. If you just went by the calculator, you might be in for a surprise on your DynamoDB storage bill.

Our account manager helped us open an internal feature request to make the storage billable size visible and to make the price calculator consider the overhead.

Financial commitment really pays

Provisioned capacity is about 6-7 times cheaper than on-demand capacity. Reserved capacity is about 5 times cheaper than provisioned capacity. That means reserved capacity is up to 35 times cheaper than on-demand. AWS apparently really wants you to commit and offer massive discounts for it.

For example, assuming consistent usage and 100 WCU:

  • On-demand capacity costs around $0.45 per hour (100*60*60/1000000*1.25)
  • Provisioned capacity costs $0.065 per hour
  • Reserved capacity costs $0.0128 per hour

This assumes that you can predict usage well enough to choose the right provisioned or reserved capacity which can be hard. Auto-scaling does help, but it’s not perfect.

Import from S3 is very cheap

The new import option from S3 is so much cheaper than even reserved capacity, if your items are not very big. It costs $0.15 per GB. We can fit around 32 million items in 1GB. With on-demand that would be about $40. Based on the math in the previous item, for reserved capacity it would be about $2. It does even out a bit more with bigger items, but our items are tiny. On-demand of 1,000,000 items of 1KB totaling 1GB would be $1.25. Provisioned would cost about $0.17 per GB based on our math above. And reserved capacity would then cost about $0.03 per GB.

After our initial import we realized we needed a GSI. AWS support explained that adding a GSI to an existing table uses an internal process called backfilling. It uses internal read capacity that doesn’t consume RCU, but does consume index WCU. For us that meant we had to double our initial import cost. With the new import feature, it made much more sense to export to S3, and then import it back again with GSI. It was literally 40 times cheaper.

Importing was also faster than manually writing all those items to a new table. However, it did have some weird overhead issue where every import took about 12 hours no matter what the data size was. 5GB or 500GB tables both took about 12 hours to import. AWS support mentioned it’s probably due to our small item size.

Multi-table design has its benefits

When reading about DynamoDB, you will see single-table design mentioned a lot. But there benefits to splitting up a table into multiple tables too. In our case, we wanted to make maintenance easier, so we split our table to locales. This way we can easily give US more capacity as it’s the most heavily used locale. Backups, restores and removals are easier to reason with this way for our use-case.

But the big one for me was being able to import our existing data up to 8 times faster. Each table is limited to 80,000 WCU. If we have a table per locale, our limit is now 640,000 WCU. Both of these numbers require a service quota increase request that are quickly approved. Due to our current data patterns (and some poor JOIN choices I’ve made), we were only able to import at about 250,000 items per second. Overall, using multiple tables meant the import was done in 3 days instead of around 10 days.

Auto-scaling can take a while

It felt implied that scaling is instantaneous but we have seen cases where scaling can take up to even an hour. This mostly happened when we had big jumps like 10,000 to 40,000 WCU and wasn’t very consistent. AWS support confirmed it wasn’t a bug or a temporary issue and just something that can happen when you make big jumps.

As a bonus, scaling down is limited to once an hour.

There is a default quota on the number of provisioned capacity decreases you can perform on your DynamoDB table per day. A day is defined according to Universal Time Coordinated (UTC). On a given day, you can start by performing up to four decreases within one hour as long as you have not performed any other decreases yet during that day. Subsequently, you can perform one additional decrease per hour as long as there was no decreases in the preceding hour. This effectively brings the maximum number of decreases in a day to 27 times (4 decreases in the first hour, and 1 decrease for each of the subsequent 1-hour windows in a day).

https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/ServiceQuotas.html#limits-decreasing-provisioned-throughput

LSI can only be created at the time the table is created

We considered secondary indexes only after our initial import and didn’t realize Local Secondary Index cannot be added to existing tables. LSI have to be created along with the table. We had to settle for Global Secondary Index.

To create one or more local secondary indexes on a table, use the LocalSecondaryIndexes parameter of the CreateTable operation. Local secondary indexes on a table are created when the table is created. When you delete a table, any local secondary indexes on that table are also deleted.

https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/LSI.html#LSI.Creating

The new import from S3 feature doesn’t support LSI yet.

Lambda is not required

A lot of articles about moving data quickly into DynamoDB talk about using other AWS services like Lambda, Kinesis, S3 and SQS. But keeping it simple works too. I ended up writing an importer in Go and was able to saturate 8 tables with one EC2 instance. It had a few goroutines to read from MySQL, a queue, and a few goroutines to write to DynamoDB. Having everything in one place made the process easier to debug and reason with. It was also easy to add a rate limiter and avoid throttling by DynamoDB. It was definitely easier than reasoning with Lambda’s reserved concurrency and the number of items it may or may not pick from the queue to write to DynamoDB.

For the instance I chose c6gn.12xlarge for its 75gbit/s networking, but we didn’t actually need more than 25gbit/s at the end. Our data points were very small so mostly we were waiting for for DynamoDB. The large amount of RAM helped keep big queues to saturate all tables despite source data not being evenly distributed.

If I had to do it all over again, I would have only used this code to export data from MySQL to S3 and then import it using DynamoDB S3 Import. Sadly this new feature came out a few days after we finished our initial expensive import.

VPC Gateway Endpoint vs NAT

VPC Gateway Endpoint for DynamoDB is free and can save a lot of money on NAT Gateway traffic. Our NAT Gateway spending doubled after deploying DynamoDB. Our RDS was sitting in a private subnet and so all of our instances could connect to it directly. DynamoDB, however, meant our private subnet instances had to go through NAT Gateway. Adding an endpoint for DynamoDB was just a single line of code in CDK. It reduced our bill back to normal as it allowed our instances to once again connect to their data without routing through NAT.

Both DynamoDB and S3 endpoints are free and very easy to setup. I am definitely going to always enable them from now on, regardless of what services might be used.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.