DoiT Cloud Intelligence™

BigQuery HLL: How we cut COUNT(DISTINCT) query costs by 93% using HyperLogLog

By Elad ShaabiNov 14, 20247 min read

TL;DR: A data provider struggling with massive COUNT(DISTINCT) queries over 6.5TB of data found relief through HyperLogLog (HLL) implementation. Working together, we achieved:

Dramatic cost reduction: Slashed BigQuery slot usage from 2,000 on-demand slots to just 135 slots, eliminating the need for purchasing additional reserved slots
Lightning-fast performance: Query times plummeted from hours to just 7 seconds
Efficient scaling: Reduced data scanning from 6.5TB to 16.25GB per query while maintaining accuracy
Sustainable growth: New HLL-based approach continues to scale efficiently as data volumes increase

Understanding unique customer behavior is crucial for companies in e-commerce, digital marketing, and data services to succeed. One of our customers I was working with, a large data provider, tracks unique user interactions over 30-day windows to power their marketing campaigns and product recommendations.

However, as their data grew to billions of records, the COUNT(DISTINCT) queries they were running to understand unique user behavior began taking hours to complete, becoming exponentially more costly — and occasionally failing altogether.

What started as a straightforward analytics task transformed into a significant technical challenge, threatening to impact their decision-making capabilities.

In this post, I’ll break down:

Why does the COUNT(DISTINCT) function become computationally expensive at scale
How this impacts query performance and costs (our client was scanning 6.5TB of data for each query!)
How we solved this using HyperLogLog (HLL), reducing query times from hours to seconds and cutting resource usage by 93%

Hyper City

The problem with COUNT(DISTINCT) at scale

While COUNT DISTINCT helps you count the number of unique values within a specified column, when the data scale gets into the picture, performance issues arise because the query has to process the full dataset with each run.

Let’s take a simple example:

- Day 1 data
User A visited
User B visited
User C visited
COUNT DISTINCT = 3 users

 - Day 2 data
User B visited
User C visited
User D visited
COUNT DISTINCT = 3 users

If you added these daily counts (3+3=6) to get the total unique users over two days, you’d be incorrect because User B and User C would be counted twice.

While many aggregate functions — like SUM — can be further aggregated, COUNT DISTINCT cannot. To correctly determine the total number of distinct values, we would need to combine both days’ data and then apply the COUNT DISTINCT function to the entire combined set

Case study: When 6.5TB COUNT(DISTINCT) queries became unsustainable

In our customers’ case, they had no choice but to look at all the raw data every day over a rolling 30-day period. This involved querying a massive table with 18,570,335,647 records (after partition filtering) — roughly a 6.5 TB scan each time!

This created a bottleneck for their operations:

High slot consumption: They quickly used up 2,000 slots, hitting BigQuery’s slot quota for on-demand jobs and causing the job to fail.
Long query execution times: If the job didn’t fail, it took hours to run. This meant their data wasn’t available in a timely manner for their decision-makers.
Expensive queries: Recalculating COUNT(DISTINCT) every day and scanning billions of records again and again naturally came with an expensive price tag.

As a result, it became obvious to us that it was not feasible to continue with this COUNT DISTINCT approach.

How Bigquery HLL approximates unique counts

This is where HyperLogLog (HLL) enters the picture. HLL is a probabilistic algorithm that allows you to estimate the number of distinct elements in a dataset. Instead of recalculating the exact number of distinct elements, HLL offers a highly accurate approximation that uses significantly fewer resources.

BigQuery provides this functionality through the APPROX_COUNT_DISTINCT function, which is based on HLL. This approximation was accurate enough for the client’s needs, and the trade-off in precision was outweighed by the substantial improvements in speed and cost.

Quick wins: Implementing HLL through APPROX\_COUNT\_DISTINCT

Standard COUNT(DISTINCT) Query

Let’s look at how a typical COUNT(DISTINCT) query over 30 days might look:

SELECT
COUNT(DISTINCT user_id) AS unique_users
FROM
project.dataset.user_interactions
WHERE
event_date BETWEEN DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY) AND CURRENT_DATE();

This query calculates distinct users over the last 30 days, but on the next run, we will not leverage the results we already got from the previews run, which can be slow and expensive.

Optimized HLL-Based Query Using APPROX_COUNT_DISTINCT

By using HLL through the APPROX_COUNT_DISTINCT function in BigQuery, we can get our results much faster and cheaper:

SELECT
APPROX_COUNT_DISTINCT(user_id) AS approx_unique_users
FROM
project.dataset.user_interactions
WHERE
event_date BETWEEN DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY) AND CURRENT_DATE();

However, even with APPROX_COUNT_DISTINCT, we’re still scanning 30 days of data for every query. This is where HLL sketches come in — they allow us to pre-calculate and store daily summaries that we can efficiently combine later.

BigQuery HLL Sketches: Advanced daily aggregation method

HLL sketches are compact data structures that use hashing to summarize information. Instead of keeping a list of every single user visit (which can be millions of records), you store a compressed representation a few kilobyes in size — the sketch — that can later be used to estimate distinct counts over multiple time periods.

In our customer’s case, we helped them create daily HLL sketches:

Each day, the system generated an HLL sketch that summarized the unique user groups by values for the customer's needs.
These sketches were then aggregated to calculate distinct users over any time range, such as the past 30 days, without recalculating everything from scratch.

This aggregation is the key benefit: HLL sketches can be combined efficiently over time periods without introducing errors from double-counting or requiring a full data scan. It enabled our customer to query unique user counts over 30 days, using a fraction of the resources and time.

Example Query: Creating daily HLL sketches

Here’s how you might store daily HLL sketches in BigQuery:

CREATE OR REPLACE TABLE project.dataset.daily_sketches AS
SELECT
event_date,
HLL_COUNT.INIT(user_id) AS hll_sketch
FROM
project.dataset.user_interactions
GROUP BY
event_date;

In this query, we create a daily table of HLL sketches, one for each day, summarizing the distinct users in the user_interactions table.

Aggregating HLL Sketches over 30-day windows

SELECT
HLL_COUNT.MERGE(hll_sketch) AS approx_unique_users
FROM
project.dataset.daily_sketches
WHERE
event_date BETWEEN DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY) AND CURRENT_DATE();

Here, we use HLL_COUNT.MERGE to combine the daily HLL sketches, allowing us to estimate the distinct user count across the entire 30-day period without recalculating from scratch.

Real-world Bigquery HLL results

The slot utilization before and after using HLL sketches.

The slot utilization before and after using HLL sketches. ( 2K slots representing the max allocation; the actual consumption before HLL was much higher)

Performance improvements

The shift to HLL sketches offered a major advantage for our customer:

Reduced query costs: Initially, the query was consuming all of the allocated 2,000 slots under the on-demand pricing model. The client’s alternative was to purchase additional reserved slots ( BigQuery Editions), which would have significantly increased their costs. However, after implementing HLL sketches and creating an aggregate table, the slot consumption dropped to just 135 slots, drastically reducing both query time and costs.
Faster queries: Query times dropped from hours to seconds (7 seconds), enabling the client to generate reports faster and meet real-time decision-making needs.
Scalability: As the dataset grew, the HLL-based approach continued to scale without needing full-table scans or recalculations, keeping costs low.
Accurate ”enough”: While HLL offers an approximation of unique visitor data, the client found that the accuracy that can be configured was more than sufficient for their reporting needs without any noticeable impact on business outcomes.

Cost reduction

The results were substantial:

Performance (running time) improved by over 99%.
Slots: Query slots consumption was reduced from over 2,000 slots to just 135.
Time: Query times for distinct counts over 30 days dropped from several hours to just seconds.
Data scanned: from 6.5TB for each query to 16.25GB.
Cost savings: The customer was able to reduce the number of slots consumed and data scanned by each query, and after the implementation, there was no need to purchase slots.
Improved data-driven decision-making: With faster access to key insights, the client could react to user behavior and adjust their marketing strategies much more effectively.

For companies facing performance bottlenecks counting unique values with COUNT(DISTINCT) queries in BigQuery — especially over large datasets and extended time periods — HLL sketches provide a highly efficient solution. By switching to approximated distinct counts and storing daily sketches, you can significantly reduce query costs and boost performance by over 99%, as seen in this real-world example.

If your business relies on timely and cost-effective distinct count reporting, consider adopting HLL and see similar improvements in your data workflows.

If you’re looking to optimize your cloud usage, cost and — most importantly —time, get in touch with us and let’s explore how we can transform your data challenges into opportunities for growth.