DoiT Cloud Intelligence™

Utilizing ClickHouse to Reduce Costs from Your BigQuery and Looker Usage Part 2

By Sayle MatthewsJun 30, 202416 min read

Continuing On

In the previous section, we covered what this actual plan will do and how to create a basic ClickHouse service utilizing Aiven’s or ClickHouse’s DBaaS offering. In this section, we will start the process of getting the data into ClickHouse and getting replication set up between it and BigQuery.

GitHub Repository

Throughout this article, I will reference a Cloud Function and a BigQuery job. The source code for these two artifacts are located here in this GitHub repository.

Note this code is very simplistic by design and is built as a learning piece. So there is a very high chance it will need to be customized to provide real-world usage for your scenario.

Getting Data Out of BigQuery to ClickHouse

There currently are no automated methods of doing CDC or streaming data out of BigQuery into an arbitrary destination. This means that data must be either captured before inserting it into BigQuery or extracted after it has been inserted to capture it. The lack of any official method makes this a very in-depth subject. Due to this, I will be covering only batch loads in this article.

That’s discussed in the next section, but first, we need to get the initial dataset out of BigQuery and into ClickHouse to create our baseline.

BigQuery has an export function that is the easiest way to get data out of it into somewhere else. There are two key disadvantages of using it though: it can only export a single table at a time and it can only target GCS for storing the data.

Due to the limitations of a single table at a time now is a good time to determine if all tables need to be replicated into ClickHouse or which subset of tables are good candidates to replicate over for usage by Looker.

A quick way to list out all tables is to run the following query against your dataset which will spit out the whole list for you. Additionally, if you are unsure which tables are used very often this query will list out the count of queries that have hit each table in the dataset. Note this query can cost a significant amount of money, so make sure you check the cost estimate in the UI first then adjust the number of days it scans before running it.

When going to do the actual export the best way to do this is to use the bq CLI command and grab a whole table. The reason why I recommend this is because utilizing the “EXPORT DATA…” SQL command will incur processing/scan charges for the amount of data exported, or slot charges for the export if using Editions, whereas the CLI command, or the API call, will just dump the whole table without additional charges.

Now if you want only part of a table, which normally is just a set of partitions of a table, then there is a trick you can do using the bq cp command to copy a partition to a new table that can be loaded directly into ClickHouse. Unfortunately, this command won’t work with wildcards or more than one partition at a time so you must run it per partition, this can be relatively easily scripted but I want to provide the command to do this here:

bq cp –append_table=true `<source_project_id>:<source_dataset>.<source_table>$<source_partition_name>` `<target_project_id>.<target_dataset>.<target_table>`

If it’s not partitioned (and not a massive table) I just recommend loading the whole table into ClickHouse and pruning it from there using SQL so you don’t incur processing charges on BigQuery.

When you are ready to do the initial export then let’s carry on.

I am going to be dumping the data out into the parquet format since it is the closest file type to the filesystem BigQuery uses and is easily loaded into ClickHouse with the column types intact.

The command to export a table from BigQuery is:

bq extract — destination_format=PARQUET \
<project_id>:<dataset>.<table_name> \
gs://<bucket_name>/<path_and_filename>

(Note on this if you are putting the files into a sub-folder make sure it exists in your bucket or else you will get a very obtuse error message about positional arguments)

Once a table has been exported to GCS then it’s time to load it into ClickHouse and thus we continue our journey.

Loading the Initial Data into ClickHouse

When loading into ClickHouse the official SQL query to run is this (at the time of the writing):

CREATE TABLE <table_name>
ENGINE = MergeTree
ORDER BY tuple() AS
SELECT *
FROM S3Cluster(default,
  'https://storage.googleapis.com/<bucket_name>/<path>/*.parquet');

Note: there is a bug in some versions of ClickHouse where the above query will fail, but will still create the table. To get around this just do an INSERT INTO

followed by the SELECT statement part from above and it will do the insert.

The recommended and more secure way is to use an HMAC key. Just make sure you add them to the bucket first for your utilized service account as this is a common mistake. Note that these are used for some methods of replicating the data later in this article, so it’s best to get this going now and have your keys generated and loaded into the proper buckets already.

When using these keys use this SQL instead in ClickHouse:

CREATE TABLE <table_name>
ENGINE = MergeTree
ORDER BY tuple() AS
SELECT *
FROM s3Cluster(‘default’,
  ‘https://storage.googleapis.com/<bucket_name>/<path>/*.parquet’,
  <hmac_access_key>,
  <hmac_secret>)

In these queries, the asterisk (*) is used by the filename which grabs all files in that directory with a parquet extension. Also, make sure you don’t mix up a service account key with an HMAC key as they are completely different things. HMAC keys live only inside of GCS and not in the IAM service of GCP. Also if you don’t put the extension on your files, you will need to pass ‘Parquet’ in as an argument to the s3Cluster function call as it may throw an error (this is a common issue when working with services that query from GCS and using HMAC keys).

Now this part is where it gets a little crazy and where ClickHouse shines. That line with ENGINE in it is one of the keys to performance. These are called Table Engines and are the subject of at least one whole training series for ClickHouse so I am not going to cover them here. Instead, I am going to link the official ClickHouse document on them here.

When loading the data you must choose the correct engine based upon your usage. The MergeTree family engines are great general-purpose storage engines with some specific ones in the family, but I HIGHLY recommend looking at the other ones out there and matching up your usage to the correct one. This is one of, if not the key to success here with performance on ClickHouse! This cannot be stressed enough.

Now with that said there is one last important step in loading the initial data into ClickHouse: deleting the temporary data. Too many people forget this step and end up paying for the GCS storage till an audit brings them up at a later date, so don’t fall into this trap and handle it now to save some money. Or at the very least set a reminder to do this so you don’t forget.

It is generally a good idea to delete the files inside of your GCS bucket to save costs. If you are unsure if you will need them for a reload later, it may be a good idea to set up a lifecycle rule to delete them after 30 days (or more) so you don’t forget.

Patterns of Replication

Next up is getting the replication of data going which has different methods depending on how you currently ingest data into BigQuery. In this article, I will be covering how to do this by using batch ingestions. Streaming coverage will be covered in a future planned article as it’s a much more complex scenario.

The other big issue is that there is not a one-size-fits-all approach to this and it’s all very dependent upon how your data pipeline works currently. What I will be doing is proposing the most common ones we see here at DoiT and letting you as the actual implementor decide what the best strategy is for your organization.

Replication Prerequisites

There are some prerequisites for all of these methods of replicating data into both sources. Thankfully most datasets will have them already built in so this is generally not something to worry about, but in full disclosure mode, I have to mention them.

Prerequisite 1: Ingestion time or “incremental markers”

To be able to determine where the replication started and/or stopped is key here to prevent inserting duplicated data thus causing some bad data scenarios. This can be an ingestion time or a unique “primary key” that’s of some sort of incremental nature. In BigQuery many times this aligns with a partition boundary.

Prerequisite 2: Determine the source of your data

This can be summed up as finding out where your data comes from before being loaded into BigQuery. Many times this is in GCS or S3 for more batch-type data or it could be streamed directly into BigQuery using a subscription in Pub/Sub or maybe it’s coming from Dataflow/Beam transforming the data. Knowing where the data comes from is key to choosing the best method of replicating that data.

Prerequisite 3: Determining how quickly data is needed

This is more of a question of do you need “realtime” or “eventual” data that can be queried in ClickHouse by your BI tool. If you need it within a few minutes then let’s call that “realtime” and if you could go with it being updated every 30 minutes or more let’s call that “eventual” to choose a strategy.

Setting up Batch Replication

Batch replication of data to ClickHouse happens on an interval to get data in from the source or BigQuery into ClickHouse. These strategies are the easiest to implement and are utilized when you have the data in files or another batch mechanism before loading it into BigQuery.

The other alternative is to pull the data from BigQuery into GCS then load it into ClickHouse from there.

This very first method is to load up data stored in GCS straight to ClickHouse using a Cloud Function. This method only works if you have data already stored in GCS and in a compatible format that ClickHouse supports loading from.

This method is very easy and is a straightforward load job much like is done to BigQuery by most customers.

The second method is if the data is already stored in BigQuery and needs to be replicated out of it into ClickHouse for usage. This method isn’t as easy and has a few more moving parts and costs so it’s not the preferred method for this reason.

Note that if you have data in another datastore such as a MySQL or PostgreSQL database that either acts as an intermediate step on the way to BigQuery or is replicated up to BigQuery then the second method is going to be the way to go. I will call out the process for using non-BigQuery datastore scenarios below for it.

Setting up Batch Replication with Data in GCS Before Loading into BigQuery

This method is by far the easiest and is just a few steps with minimal setup.

In a nutshell, what this is doing is just taking your data files from GCS and doing a LOAD on the files into ClickHouse. Now it always isn’t that easy (as always), but it’s pretty close and I am providing most of the code for it already.

The easiest method I have found so far is to set a trigger on GCS for files being finalized in your bucket and call a Cloud Function which loads the file into ClickHouse. The process for doing this is located here and just use the source code here in your deployment, but just make sure you read the documentation first so the prerequisites are met (namely creating the secret to hold all of your connection information).

Now there is a caveat here where you probably want to only load files of a certain type and maybe load the file into a different table based on the path or filename. This logic will need to be added by you for your specific use case and I have documented in the file where these modifications need to be done.

Setting up Batch Replication with Data Already in BigQuery

Next up on replication strategies is batch replication from BigQuery to ClickHouse on an interval. These methods are great if you are loading data straight from somewhere else into BigQuery and aren’t able to capture it before it is loaded. Many times this is if you are doing streaming inserts, BigQuery Storage API loads, or have data being loaded from external sources such as Stitch or FiveTran into BigQuery.

Costs Note: This method will probably have to query the data on an interval, so will incur some charges due to this. Just keep this in mind and make sure you partition and/or cluster your data in the underlying table appropriately to minimize these charges. You may also choose to use on-demand billing for this as well versus Editions depending upon how much data is ingested. Many times creating a separate project that does these exports of the data stored in the main project is the preferred way of doing this to keep a separation of workloads. This also allows using on-demand billing or a separate reservation if using Editions so you can utilize Standard Edition for cheaper costs.

The overall outline for this method is that it utilizes running a scheduled query on an interval inside of BigQuery which exports the new data to a GCS bucket. Once in the GCS bucket, a Cloud Function will trigger that loads the data into ClickHouse when the file is done being written.

This method can be pricey on both querying, storage, and the Cloud Function call volume price, but will get the data loaded at a very predictable interval. But if you are unable to load the data before BigQuery then this might be the best way to do this and will probably still save money on querying costs.

To reduce the cost of this it’s generally best to align your data exports with a partition so the partition can be exported directly to GCS (see above command for doing this). For instance, if you partition hourly scheduling the process happens when it can grab the entire hour’s worth of data to load at a time.

So let’s dive in and show how this thing works!

The Working Example

For the sake of this example, I have created a very basic table with the following schema inside of BigQuery:

I have it partitioned on transaction_time for the sake of faster querying in this case.

Data is loaded into this table every hour containing all data for the last hour and will contain all data for that hour, meaning it won’t miss any that will be included in a later load. This is for simplicity's sake and would very rarely happen in the real world.

For simplicity’s sake, and a very good practice to have, when the load is completed a message is pushed into a Pub/Sub topic. A subscription is then attached to that topic which triggers a Cloud Function that does the work for you instead of needing to set up a small pipeline. As an alternative when the file is done writing a Cloud Function can be triggered to automatically load it, which would make this closer to realtime than loading on an interval.

Here is the link to the Cloud Function that does this load along with a readme file explaining its usage.

Note that in it I have put the credentials for ClickHouse in a Secret Manger secret and am exposing those values as environment variables to the Cloud Function. This is the easiest and one of the most secure way to access your credentials in a Cloud Function. I have documented all of this in the README.md file for the python code associated with it.

The first step on this is to create a Cloud Function. Note, that the finalized event on a GCS bucket triggers this. Your setup of the Cloud Function should look something like this (note the warning on the service account as well, make sure you grant those permissions):

Now before hitting save on this drop down the “Runtime, build, connections and security settings” dropdown and scroll over to Security and Image Repo. At the top is a “Secrets” section where you will need to add a secret for each of the needed values such as below. Make sure you select the “Exposed as environment variable” in the dropdown for each for this Cloud Function to work.

Note you will need to do this for each of the secrets needed by the Cloud Function (host, port, secure, username, and password). Port and secure have default values of 9019 and True respectively so if you wish to use the secrets aren’t required. Also note the warning on the service account that I left in, make sure that the service account has access to read from Secret Manager before doing this.

The next step in this process is to utilize this query as a Scheduled Query ( here is the official documentation on how to do this for those that have never done this before) in BigQuery scheduled to run 5 minutes after the hour every hour.

Once this is done wait till your designated time and then check your GCS bucket for a new directory to have been created with the files. If this exists then your Cloud Function worked.

Using this Example in the Real World

Now this example is VERY simple and idealized for simplicity’s sake. In the real world, we know that data will never be perfectly loaded within hour boundaries and on time without data arriving late. I have purposefully left out these conditions as ClickHouse has a good answer to them: ReplacingMergeTree and CollapsingMergeTree

The ClickHouse team wrote a great article on this here which explains how to utilize these for updates and deletes. Depending upon the data usage patterns one can be chosen over the other.

Additionally there is a good chance you will want to add additional functionality from this such as notifications, triggering ETL/ELT jobs, etc. so I would recommend adding additional code at the end of the Cloud Function execution to put a message into Pub/Sub to trigger any downstream functionality.

Hooking up Looker to ClickHouse

The last thing to do is to convert your Looker connections over to utilize ClickHouse instead.

It can sometimes be hard to find this, but if you go into the Admin mode of Looker. It is located at Database->Connections on the left-panel of the Admin mode.

Admin->Database->Connections

Next, just hit add and fill out the information for your host on the connections page for your ClickHouse instance.

Upon clicking connections and selecting a Dialect of ClickHouse the following will appear:

Looker new ClickHouse connection view

Once this has been loaded copy and paste out the information from your instance such as the below from the Aiven console:

Instance connection information from Aiven’s ClickHouse console

The information you will be looking for is generally for the JDBC connector or HTTPS access on ClickHouse.

Once you have completed that use the test button in Looker to ensure that it works.

Lastly, this is very important for security is an IP whitelisting feature in your ClickHouse instance. This list defines the only IPs that are allowed to connect to your instance. If you follow the instructions here from Google it will list out a set of IP addresses you can whitelist for access. This will ensure that connections only from the Looker IPs will be able to connect to your ClickHouse instance.

Allowed IP address view from Aiven console

This pattern should get you started on having a ClickHouse instance setup you can reference from Looker instead of BigQuery thus potentially saving a lot of money on your querying capabilities.

Since this is an elementary example it probably won’t be an out-of-the-box solution for all of your data loading needs, but it should be viewed as the first stepping stone across the river on your journey to saving money by “caching” the data in ClickHouse versus paying per query on BigQuery.