Athena Provisioned Capacity Review

Was it worth the wait?

Jul 20, 2023

bits of data waiting in a queue in a factory, digital art

Athena released provisioned capacity in April 2023. This has been one of the most asked-for features ever since Athena launched in 2016. In this post, we'll go over what provisioned capacity is, why it matters, and how it holds up in practice.

As a refresher, Athena is a serverless analytics offering by AWS that "provides a simplified, flexible way to analyze petabytes of data where it lives". Underneath the hood, Athena runs on presto/trino.

Athena is pure serverless - it has zero setup and you only get billed when you query. Billing for Athena is based on the data scanned and is charged at $5/TB or $0.005/GB (for the keen-eyed, this is the same price as cloudwatch log insights).

Note that there are additional fees due to GET/PUT calls to S3 and AWS Glue - these costs are generally negligible and don't meaningfully impact the query cost (though they'll frustrate your accountant).

Athena runs on shared infrastructure and has the concept of a queue time - depending on the load on the Athena fleet in a particular region, you might have to wait before your query gets to run. Most of the time, queue time is under 200ms but for the p99 case, this can go past a minute.

Query execution itself is quite fast, especially on the latest Athena engine . On a well-tuned dataset, Athena can scan +7GB/s and finish in single-digit seconds for simple queries, making it suitable for real-time analysis.

While there is both extensive documentation and controls for tuning both your dataset and your queries for optimal performance, there has not been, until this year, a way of directly controlling your time in the queue.

Athena Provisioned Capacity

Athena provisioned capacity gives customers a way of reserving capacity on the Athena fleet for exclusive use. By reserving sufficient capacity, you can eliminate queuing and also have predictable billing.

When you provision capacity, you request capacity in Data Processing Units (DPUs). A single DPU is equivalent to 4 vCPUs and 16GB of memory. It is charged at $0.30/DPU.

You have to specify at least 8 DPUs when making a reservation and you need to book it for at least 8h. This comes out to $7.20/hour or $57.60 minimum per use. Once you make a reservation, you are no longer charged for the amount of data scanned but purely for the DPUs that you've reserved.

After 8h of use, DPUs are charged on a per-minute basis. You can cancel a reservation at any time but if you cancel early, you will still be charged for a minimum of 8h of use.

Making a Reservation

Making a reservation is simple and can be done either via the console or the API. Note that reservations can not go through depending on capacity as well as take up to 30 minutes to be ready. That said, in testing, reservations were successful in under a minute.

Once a reservation is made, you need to associate it with one or more workgroups. A workgroup is an Athena concept that encapsulates the Athena engine version, capabilities, and limits on querying for users of the given workgroup.

After completing the association, any queries you make via a workgroup that has a capacity reservation will use the dedicated capacity instead of the on-demand capacity of the Athena fleet.

Using Provisioned Capacity

The good news is that provisioned capacity does eliminate unexpected queue spikes - because of dedicated capacity, you are guaranteed execution (as long as there is capacity).

This comes however with the caveat that average queue time gets worse.

The above image shows observed queue times when running in an on-demand workgroup vs a provisioned workgroup. The average time spent in the queue is ~158ms for on-demand and ~331ms for provisioned (more than 2x). The difference of an extra ~150ms per query vs the variance introduced by an additional 6000ms during a queue spike is generally worth the tradeoff - nevertheless, if you're running lots of short queries, this is something to be aware of.

What is not great about provisioned capacity is that your time in the queue can grow unbounded (for as long as you have an active reservation). This happens if you do not have enough DPUs provisioned. Any workgroup that uses provisioned capacity is limited to only using provisioned capacity as long as the reservation is in place. This means if your in-flight queries are consuming all the DPUs, any additional queries will be queued until capacity becomes available.

AWS Lambda is another serverless offering with a provisioned capacity model. Unlike the Athena model, Lambda's provisioned concurrency allocates pre-initialized environments based on the reservation and falls back to on-demand when it is exhausted. It's an approach that I wish Athena would have followed - falling back to on-demand is a much better user experience than putting a hard stop on queries.

Effectively, provisioned capacity turns Athena from a serverless analytics engine to managed infrastructure that you have to plan capacity around. What makes this worse is the lack of visibility when it comes to mapping DPUs to workloads.

While AWS has a table that provides guidance on how many DPUs to reserve based on the expected number of concurrent queries, this is not enough information to do capacity planning.

Athena Provisioned Capacity Recommendations

This is because DPU requirement varies based on data size, storage format, query construction, and other factors, not just concurrency. As an example, the common-crawl dataset has partitions that exceed 200GB. Running the following select statement requires a minimum of 124 DPUs (this translates to 1984GB of RAM - why is this much RAM needed to scan 200GB of data? I can't tell you why and neither can Athena 🤷‍♂️).

select url from ccindex
WHERE crawl = 'CC-MAIN-2023-06'
AND subset = 'warc' AND content_digest = '3WCXFWELSS4SO3MLCODIDHS2KLT4SHPZ'

Right-sizing how many DPUs you need is a manual guess and check - the current AWS recommendation is to reduce DPUs and see if that causes query failures or increases queue times.

Final Thoughts

I like Athena. It is one of my favorite AWS services and the first thing I reach for when needing to analyze data in S3. The team has done a tremendous job in boosting engine performance and building out a wide feature set. It even works across regions, accounts, and KMS-encrypted objects. The issue of unpredictable queue spikes is one of the few remaining reservations I have about the service.

Provisioned capacity is the first direct solution to this problem. Unfortunately, by using it, you also give up most of the benefits of Athena being serverless. With provisioned capacity, you now have to plan for capacity upfront, you pay even when you're not using the service, and you hit a hard stop if you exceed the set capacity.

I do not doubt that there exists a set of customers that will be well served with this feature - mainly those with consistent workloads that take place over eight hours and place a premium on predictable pricing. For everyone else, you're probably better served continuing with on-demand.

Bit by Bit

Discussion about this post