Using Athena Partitioning and Gzip to save money while increasing performance

Amazon Athena is a new serverless query service which is great for accessing the data you have stored in S3. Athena is running Presto under the hood. By parallelizing the query, Athena provides a huge amount of performance over extremely large datasets stored in S3.

Athena charges by TB scanned. At $5 a TB, it simply is one of the best deals out there when it comes to databases. You also have to pay for S3 access charges, but given how far S3 prices have dropped since it was first launched, odds are you won’t even have to worry about it.

The best part I love about Athena is the simplicity of getting data in. Want more data to analyze? Put it in your S3 bucket and boom, you can query it. No more database imports that take forever and fail halfway through.

So what is partitioning and how do you save money with it? Partitioning works by allowing Athena to ignore entire subsets of your data when querying for a result. Simply declare the partition in your table definition, or use a Glue Crawler to automatically define it for you.

Partitions *must* be in the format of /columnName=Value (yes, include the ‘=’ !)

Sample bucket/paths:

Data_bucket/year=2017/<data files here>

Data_bucket/year=2018/<data files here>

If you run a query that has “… where year = 2018“, Athena will know not to even bother looking in the /year=2017 path. This can result in huge performance gains and cost reductions.

Enter Gzip. Remember that Athena charges by TB scanned. If your files are zipped, Athena uses the compressed file size when determining how much data it has scanned. This means if you zip a file and it reduces the file size by 75%, you have just reduced your Athena (and S3) charges by 75% as well.

For example:

Datafrom2017.csv is 100mb

Datafrom2018.csv is 100mb

If I run the query “… where year=2018”, Athena will only charge for scanning 100mb of data.

I can take this a step further and gzip “Datafrom2018.csv” down to 20mb. Now, if I run the query “ …. Where year = 2018”, Athena will only charge me for scanning 20mb of data.

By using some basic partitioning and gzip, we’ve reduced our query costs and the amount of data Athena has to scan by 90%!