DoiT Cloud Intelligence™

Dataplex Data Catalog API Default Activation on March 4th 2024

By MatthewMar 5, 202413 min read
Dataplex Data Catalog API Default Activation on March 4th 2024

Introduction

Google Cloud’s recent announcement regarding enabling the Dataplex API by default for all BigQuery customers from March 4th 2024 brings exciting possibilities for enhanced analytics capabilities.

BUT, this change also comes with potentially dangerous cost implications! Especially given the nature of the Dataplex API itself and supplementary APIs such as the Data Lineage and Data Catalog APIs.

In this blog post, we will delve into the details of this new change, its impact on costs, and strategies to mitigate these effects, as originally described by myself and my fellow Google Data Practice Lead at DoiT, Sayle Matthews at our webinar on February 29th and via this LinkedIn post.

Understanding the Change itself:

In an email communication from the Google Cloud Team distributed to all Google BigQuery users in mid-February, titled “[Action Advised] Review the updates to BigQuery to confirm if you need to take action on your projects”, it was revealed that starting March 4, 2024, the Dataplex API will be automatically enabled for all active BigQuery projects and activated by default for new Google Cloud projects.

This move is part of the broader BigQuery Studio launch ( link), aimed at providing users with additional analytics platform capabilities seamlessly.

The important facts of this email are below, it also highlights other APIs that are to be automatically enabled as a result of the upcoming changes on March 4th:

The impact of the enablement of the Dataplex service in general, as well as APIs associated with the Dataplex service such as the Data Catalog and Data Lineage APIs are the largest cause for concern of those listed above.

For the purposes of this blog post, it is worth noting that as per Sayle Matthews’ LinkedIn post, there was a major new release of Dataplex rolled out by Google week commencing February 26th that fixed a number of these concerns on auto-enabling features with a cost associated with them, so luckily this change impacts on fewer potential users now.

Implications of Dataplex on GCP Costs:

While the introduction of Dataplex offers the potential of advanced data governance and management features for GCP customers, it’s crucial to to be aware of potential cost implications associated with its automatic enablement from March 4th.

Dataplex introduces functionalities and hidden processes related to various Google Cloud Data services and APIs, such as Google BigQuery, Cloud Dataflow & Cloud Pub/Sub for example, which could lead to increased costs if you do not manage this change within your GCP Projects effectively!

As per Dataplex’s main pricing page, Dataplex itself charges through 4 main SKUs that are billed differently, namely:

  • Dataplex processing (standard and premium)
  • Dataplex shuffle storage
  • Data Catalog API calls
  • Data Catalog metadata storage

The most significant of these SKUs on your Dataplex costs is the Processing SKU, which at time of writing is currently charged in terms of DCU-hours ($0.060 per DCU-hour for Standard Dataplex processing and $0.089 for Premium Dataplex processing, though these charges may differ by region).

The below table from the Dataplex pricing page gives a breakdown on how Dataplex processing is charged in practice:

As shown above, Dataplex’s processing SKU can perform a fair few functions on your GCP Data sources if allowed to run freely (and this is before we go onto the Data Lineage and Data Catalog associated APIs!), but how does the Dataplex Processing SKU actually work?

The answer to this is by the creation of Dataplex Lakes and Zones. In Dataplex a Lake is effectively a data mesh domain to contain a business unit’s data artifacts as shown below, Lakes can in turn contain Zones, Assets and Entities to manage your data according to your own business and data structures within Google BigQuery, GCS etc.

When you attempt to create a Dataplex Lake ( link), after specifying your Lake’s name, region and any labeling policies you wish to apply to it, you also need to link the Lake to a Dataproc Metastore.

This is a big contributor to the overall costs for Dataplex processing, as the Metastore is used by Dataplex to manage the data exploration workbench processing and also affects your Dataproc costs too via the cost of running the Metastore, this is also available in different tiers (Developer & Enterprise tiers for Metastore v1 and Enterprise and Enterprise Plus tiers for v2), which are resourced and priced differently according to the specific needs of your Dataproc Metastore as detailed here.

Once your Dataplex Lake is created, the Dataplex API will then start to process & scan your data as per the data exploration processes highlighted previously. At this stage you can expect to see the processing SKUs appear within your Google Cloud Billing figures and this is where costs may rise!

A key point here though is that while Dataplex in itself with a Dataproc metastore will perform some costly processes, other services offered by the Dataplex umbrella such as Data Lineage, Data Profiling, Data Catalog and Data Quality do require turning on to incur costs, this blog will cover the Lineage and Catalog APIs in particular, as those are where we experience the biggest cost implications for our customers typically.

Regarding the Data Lineage API, this API as per its name, is used to scan all your data processes defined via Dataplex and will track any operations that led to the creation of your assets, examples of which being any BigQuery jobs, any Dataflow pipelines or any Pub/Sub or Dataproc operations running that led to any modification to your assets. While this may sound benign when you read this, the examples mentioned can lead to a vast amount of API calls for each operation, which directly ties into the pricing of the Data Lineage API as per the explanation below.

The Data Catalog API like the Data Lineage API is another separate API that must be enabled. It works directly with Data Lineage to store metadata related to your operations and also works with most Dataplex umbrella services to generate API calls from your operations associated with Dataplex and the Data Lineage API in particular.

The below screenshot taken from Google’s Dataplex pricing pages details the breakdown of your Data Lineage and Data Catalog costs in particular.

As mentioned, both APIs are billed based on API calls and depending on the scalability of your various Data services used in Google Cloud, these can easily ramp up into the millions of calls, making this a significant charge when the two APIs are enabled.

In addition, Data Catalog also uses the Dataproc metastore to store metadata, which as per Dataplex’s use of the same will cost more depending on the version of the Dataproc metastore you are using, as well as the volume of metadata stored of course.

Estimating the overall Cost Impact of Dataplex

One of the main questions our customers ask us around these Dataplex changes is — “How can we estimate how much all these Dataplex elements will cost us?” — unfortunately the quick answer to this is that there is no easy way to estimate this.

From speaking with Google directly on this, our Engineers have found that there is no way to link a given process with a specific Dataplex API call, meaning that estimating this element of your costs is virtually impossible.

In the case of the Processing cost element too, this is hard to estimate, though it is documented that these costs are linearly proportional to the amount of data associated with your Dataplex Lakes.

While this isn’t the nice and easy answer you’d hope for here, we can certainly mitigate ALL of these cost elements with good management of your Dataplex APIs and Services!

Mitigating the Impact of these additional costs — your options:

Disable Dataplex and its associated APIs Altogether:

Users who are not currently leveraging the advanced features of Dataplex or do not have immediate plans for implementing its particular data governance features may consider disabling Dataplex and its APIs altogether.

An easy way this can be achieved is through the Google Cloud Console, as via its simple interface you can enable/disable any of your APIs via the aptly named “APIs and Services” Console page.

When you go to that page in the Google Cloud Console, you should see a page similar to the above which will show all of your enabled APIs and Services. From there you can simply search for the specific API you wish to disable, i.e. the Cloud Dataplex, Data Catalog or Data Lineage APIs, then from there you can simply click “Disable API” as per the below screenshot, thus disabling the API and all costs or operations it would enable.

Alternatively a number of our customers at the associated webinar asked us if this process can be done programmatically via Command Line type scripts? The answer to this is yes, the below script extract can be run on a specific API within your Google Cloud Projects to disable a specific API programmatically:

gcloud services disable API_SERVICE_NAME — project PROJECT_ID

The above script will not work automatically if the API has been used in the last 30 days, so in addition to this, you would need to add in the –force option after the API Service Name to disable the Dataplex APIs successfully.

A similar script can also be run across your GCP Organization should you have to run this process across multiple projects within your GCP Org, an example for this is here:

API_TO_DISABLE=API_SERVICE_NAME

for PROJECT_ID in $(gcloud projects list –filter=’parent.id=ORG_ID’ — format=’value(projectId)’)

do

echo “Processing $PROJECT_ID”

# Disable the API for the project

echo “Disabling $API_TO_DISABLE in $PROJECT_ID”

gcloud services disable $API_TO_DISABLE — project $PROJECT_ID

done

Handily in addition to this, in their original email sent on February 16th, Google provided a Form link which automatically prevents the APIs due to be activated from March 4th 2024 from being enabled in your projects, so should you or anyone with the relevant access in your organization have followed this link already, then all relevant APIs should already be disabled in this case.

For the purposes of this Blog Post however, we would like to stress that you should only Disable APIs and Services in Google Cloud if you are certain these are not being used by any of your business processes, which is why we would like to ensure that the above scripts are only run on specific Dataplex-related APIs and if you are certain that you are not using the Dataplex Data Lineage or Data Catalog style processes yourselves.

Ensure Selective Dataplex Feature Activation:

For users who wish to maintain the balance between enhanced data governance and cost-effectiveness by using Dataplex, selectively activating specific Dataplex features might be a prudent strategy.

By carefully evaluating the specific Dataplex functionalities and APIs needed for their projects, users can enable only those features that align with their requirements. We urge Google Cloud Users to think carefully about what they need to use Dataplex for and if they do decide to use it, to tread carefully with the features and APIs enabled.

We also advise Google Cloud users that do wish to use the Dataplex service to create a single “admin” GCP project where all your Dataplex functionality lives and runs to minimize costs and complexity for your business following this change. This will serve to limit the impact of the various API calls, Processing and Metadata Storage charges that may occur as a result of enabling Dataplex and its features across multiple projects in your Google Cloud Organization.

Ensure Constant Monitoring and Optimization:

Constant monitoring of your Dataplex-related costs is essential. Google Cloud provides detailed billing and cost management tools that can help users keep track of their expenses. Examples of these services include the Google Cloud Billing, Cloud Monitoring and Cloud Logging services, that allow you to track your direct spend by SKU, your usage via easy to read dashboards and individual logging entries related to the Dataplex service and aforementioned APIs.

Regularly reviewing these reports and service outputs allows Google Cloud users to identify any unexpected spikes in costs and take timely actions to optimize their usage.

Users can also set Alerts via Cloud Monitoring Policies as screenshotted to proactively track breaches on their specific usage metrics that they may wish to track, such as Dataplex service and its connected resources.

Conclusion:

The forthcoming default activation of the Dataplex API in Google BigQuery projects presents both opportunities and challenges for Google Cloud customers.

By understanding the implications and implementing proactive cost management strategies, users can continue to harness the power of BigQuery and Dataplex’s evolving capabilities while ensuring cost-effective operations.

The recommendations provided in this blog post, along with insights from DoiT’s industry experts, can position Google BigQuery users to navigate this change successfully and optimize their Google Cloud usage and spend following this impending change.

To close this Blog Post, I would like to add some final points about ourselves at DoiT International. Assisting our customers with performing analysis and long-term planning after widely impactful changes such as this one, is one of the major things we do here as a business.

So if you aren’t a DoiT customer yet then we urge you as a reader of this Post to give us a look on our pages here as we have an extremely highly skilled and experienced team at DoiT, we always help our customers to make the most of their Google Cloud (as well as AWS & Azure) resources and we continually put together blogs and webinars such as this one to best guide our customers through these changes as they occur.

Thanks for reading this Blog Post and hopefully this, our wider content and our experienced team will help you navigate this change to BigQuery and Dataplex features in the near future!

Additional FAQs

As well as the LinkedIn article from DoiT’s own Sayle Matthews here that sheds light on the broader implications of this change to BigQuery from March 4th onwards, Sayle and I also hosted a Webinar on this same topic on Thursday February 29th 2024 (linked here) where we were asked a number of additional questions on this topic. Our responses to a selection of those are below for our readers’ reference (and you can hear our responses to all of these on the Webinar link itself):

Q1: What would be the best (least painful) way to ‘scan’ for these enabled API’s over 100+ GCP projects?

Answer: See the supplied script in the main blog content above. We would like to re-stress here though that disabling APIs all in one go can be dangerous though, so we urge you to selectively disable specific APIs that you are sure are not needed depending on your organization’s needs.

Q2: Do you know the time on March 4th when we must disable these APIs?

Answer: While Google do not specify a time of day when these changes will come into effect, we urge Google Cloud users to perform the disabling of any APIs they do not need as soon as they can on or following March 4th to reduce any unforeseen impact on their Google Cloud bills. While many customers will be unaffected by this change, we do not want our customers to receive any nasty billing surprises following the change.

Q3: If I am not using any of the services Google is automatically enabling here, would disabling the APIs after the 4th keep costs as they are? (knowing that we mainly use BigQuery)

Answer: That is correct, after the change the other day everything is in an “opt-in” model thankfully. We would suggest though that you just make sure you have turned off Data Lineage, Data Catalog, and Data Discovery APIs before to ensure that none of these processes start automatically charging.

Q4: Can we use the DoiT Support console to view/track these Dataplex costs?

Answer: Yes, if you do a Filter on the “Dataplex” Service it will list them all. When our Engineers are troubleshooting we usually filter on Service for Dataplex and then group by SKU and project. That will show the costs per project and per SKU in the DoiT Console.