DoiT Cloud Intelligence™

Utilizing ClickHouse to Reduce Costs from Your BigQuery and Looker Usage Part 1

By Sayle MatthewsJun 30, 20249 min read
Utilizing ClickHouse to Reduce Costs from Your BigQuery and Looker Usage Part 1

Introduction

This is a multi-part series on this topic which is broken down into logical parts of setting this process up. Since there are a lot of “if-then-else” conditions related to this topic due to the complexity of the ecosystem and technology I have decided to break this article up so that all of these are contained in separate parts to not convolute the whole of the article. The first part will contain most of the theory of this, then the second part will focus on a basic implementation of this.

Out of the gate, if you are one of the uninitiated, ClickHouse is a competing data warehouse to BigQuery. Its claim-to-fame is that it’s a very highly performant datastore that can perform operations significantly faster than other data warehouses out there. This makes it ideal for storing and serving data to workloads that are constantly querying it such as BI tools like Looker.

This article was written with the technical assistance of our partner Aiven who is the leading provider of DBaaS offerings and proudly hosts a managed ClickHouse service. For completeness, I am including details from ClickHouse as well who hosts their own managed service.

Problem Statement

BigQuery is a great platform for doing large-scale analysis or ML workloads and is often tied to BI tools such as Looker for doing visualization or reporting duties. Unfortunately, when doing queries for this type of work it can often hit performance issues due to resource or cost-forced-resource constraints. Then there is the very large (and sometimes angry) gorilla in the room: the cost-per-query that BigQuery charges.

In addition to this, after the recent price increases on a per-query basis, BigQuery has gotten more expensive, especially when tied to a tool that performs constant querying of data. Due to this many customers are looking at ways of reducing their workload costs and finding out something DoiT International Customer Reliability Engineers have preached for years: their BI tools are one of, if not, the largest contributor to BigQuery costs and is one of the largest targets for cost optimization.

At Google Next 2023 in my Next presentation (the recording was removed unfortunately but the slides are still available), I proposed a topic about cost optimization that numerous customers of ours have used to great effect: migrating certain expensive workloads off of BigQuery, particularly compute-heavy ones, onto other tools that are cheaper and better suited for the purpose.

In direct relation to that idea, I am proposing in this article an alternative method of data presentation by utilizing the ClickHouse data warehouse as a “caching and serving” layer between BigQuery and BI tools, such as Looker. This means reporting and visualization workloads can be moved off of BigQuery onto a cheaper and more performant platform.

Important note: I am just using ClickHouse here and this could easily be done by other tools and databases as needed. I am just using ClickHouse as it works amazing for serving data very quickly to visualization tools and I see it often enough to justify this article.

What is ClickHouse?

Lately, you might have been seeing quite a bit of advertising on LinkedIn or other sites with ads stating that ClickHouse is cheaper or faster than BigQuery. The great minds behind that advertising campaign are capitalizing on the cornerstone foundations of this concept I am proposing in this article. Our data architects here at DoiT International have seen at scale how the pricing changes are affecting a very large subsection of GCP customers while coming up with creative ways to help them save money.

ClickHouse is a data warehouse in a “similar” vein to BigQuery, but with some major architectural changes that differentiate the two. ClickHouse is built around a more monolithic architecture that is centered around user-defined “modules” to be utilized inside of it to massively increase performance. Due to this modular infrastructure, it can be significantly more performant for many tasks than BigQuery or other data warehousing solutions on the market.

This quote by PostHog sums it up very nicely:

“The performance difference between BigQuery and ClickHouse can be immense. BigQuery can take dozens of seconds to execute a query. ClickHouse, if tuned correctly, can execute the same query on terabytes of data with sub-second performance.”

Why? Cost Savings!

“Elementary, my dear Watson for cost savings” –Sherlock Holmes (with a modern twist for cloud)

The answer is Cost Savings! Well to be more realistic: potential cost savings.

This happens because you are querying a data warehouse that doesn’t bill for query usage and just has a flat rate thus you can query as much as the resources allow without being billed by use.

In this article I will be utilizing Aiven’s Clickhouse managed service offering for pricing. This is the service I tested this article out with and they have made it very easy to hook up ClickHouse, and their other offerings, to your GCP environment.

Aiven’s business plans start at ~$500/month for the startup level and ~$2000/month for the business level to give the price points we are looking at to show whether or not this is feasible for your environment.

There are a few break-even points for this solution to be feasible from a cost perspective.

Note if you are spending less than $500/month on BigQuery or for your visualization tool then there is a good chance this is not going to be giving you any cost reductions, but on the other side of the coin there is a very good chance it will give you some massive performance improvements.

Keeping in mind the $500 and $2,000 price points from Aiven this relates to 1TiB and 321 TiB (320TiB + 1 TiB of free tier) of data processed per month for BigQuery’s on-demand pricing model. You can also run this query [1] against the project that does your jobs for your BI tool.

For comparative reference, ClickHouse offers plans with an autoscaling component and a little bit larger instances for very comparable prices. Depending upon the sizing and development/production needs they might be a cheaper alternative.

For BigQuery Editions there isn’t a straightforward value that says where the break-even point is because Editions uses an autoscaler which is essentially a sliding scale on pricing. Now the easiest method for this is to run the above-linked query [1] in the project which Looker queries will calculate the approximate Looker costs.

Note on this query:

You may need to adjust the regex or replace it completely with your Looker service account name to get accurate costs specifically if you are using a non-standard Looker service account or multiple ones.

Once you have calculated that value you will be able to determine if that price is above or below the $500 and $2000 thresholds mentioned above. In addition to this, if performance is a concern then this might be worth the extra cost to be able to not wait 20 seconds for a Looker dashboard to display some realtime data.

The Evil Genius Plan

The plan here is to “replicate” data between both BigQuery and ClickHouse then connect any heavy querying BI tools, or any other heavy querying tools for that matter, to ClickHouse. This will, in theory, remove the per-query costs associated with BigQuery and reduce your costs significantly as you are now using a fixed-price asset for your querying.

This may also be used if you need realtime or much faster-querying capability as well. This is because ClickHouse can be a MUCH more performant querying engine when tuned properly

Preemptive Homework

There are always first steps when doing something that will potentially save you a lot of money and this process is no exception.

The first thing to determine is how much data and what tables are used by your BI tool from BigQuery. This is a very generic statement here and is not a trivial problem to solve in BigQuery, but as has been shown in the past I am providing you with a query to help this out here[2]. Note this query could potentially cost a lot of money, so make sure you check the estimation in the BQ UI before running this.

The second item on the list is the more detailed of the two and is to perform a discovery on your ingestion process. You will need to understand what “makes it tick” and look at how data is ingested currently into BigQuery. The reasoning for this is that we will be modifying this to “split” the data between BigQuery and the newly created Clickhouse instance so data is propagated to both.

Choosing a ClickHouse Instance Size

For the final part of this first part I want to give guidance on sizing and creating the ClickHouse instance that will be used throughout the rest of this series.

Having done multiple BigQuery -> * (some other database system) migrations I am often asked how do I choose the correct sizing of the target database system. The unfortunate answer I have discovered is that there really isn’t a bulletproof method of doing this.

I have gone through exercises looking at the data pulled from BigQuery in queries, looking at the volume of cache hits, and the slot usage but the truth is that BigQuery is just such a different sort of system than most other databases out there we can’t really build an apples-to-apples comparison. In doing these exercises I have discovered this could be done, but BigQuery does not emit some needed metrics for this such as unique data queried or CPU/memory used by slots.

Now with that said having done these exercises has shown me one thing: always start with at least 8GB of RAM for a database you are doing basic querying from on a regular basis or if you are doing a true production visualization database with > 10 users that are doing some heavy computational querying with it throughout the day then 16GB is a minimum. As the complexity of queries goes up then adding more CPUs is beneficial, but a dual CPU machine is a good starting point no matter the workload since memory is far more important than CPU on most querying capabilities.

Using that guidance I would start a proof-of-concept with a very small instance, such as the hobbyist tier from Aiven or the development instance from ClickHouse, to get things up and running then move up from there to rightsize as needed.

Starting a ClickHouse Instance with Aiven

Instead of doing a full walkthrough here that will inevitably be out-of-date when the UI changes I am going to link directly to the Aiven documentation on doing this.

The first step is to create an account with Aiven as per here.

The second step is to create an Aiven VPC, per here, and then peer it with your GCP VPC as per here.

The next step is to create the ClickHouse service per here. This may take a few minutes, so grab a cup of coffee or tea while waiting.

Verify that it works using either the Docker container or the CLI tool inside your peered VPC and then you should be ready to start using it.

Starting a ClickHouse Instance with ClickHouse.com

As above I am going to just link to the vendor documentation as anything I write here will inevitably be out-of-date when something changes in the future.

Here is the vendor’s quickstart guide on how to create an instance.

I would recommend setting up GCP Private Service Connect to your ClickHouse instance per here. This ensures the highest level of security and the least amount of configuration on utilizing your instance.

Moving On

This part was just a basic introduction to what we are doing with this plan and in the next section I will be covering getting your data into ClickHouse and setting up the replication between it and BigQuery.