DoiT Cloud Intelligence™DoiT Cloud Intelligence™

DoiT Cloud Intelligence™

Carefully calculating Google Cloud Storage Buckets size with Cloud Functions & Pub/Sub

By Dror LevyJan 24, 20185 min read

An inexpensive way to easily analyze all of your Google Cloud Storage buckets sizes with just a few clicks using Pub/Sub, Cloud Functions and BigQuery.

If you have buckets with a massive number of objects and you wanted to know their size, you might have noticed that Stackdriver Monitoring Cloud Storage dashboard shows estimations only and doesn’t provide an accurate bucket size.

In fact, these estimations sometimes are very far away from the actual buckets sizes.

gcs-calc-image-1

The size of the marked bucket according to Access Logs & Storage Logs is:

"bucket","storage_byte_hours"
"******playground","1833814969766862"

1833814969766862 / 24 = 76408957073619 bytes (~69 TB)

What are your options on getting a more accurate picture of your buckets size?

You could use gsutil du command to get the total space used by all of your objects in the bucket. However, if you tried running gsutil du on a large bucket with hundreds of thousands or millions of objects, you would learn that it takes a lot of time because gsutil du command calculates space usage by making bucket listing requests, which can take a lot of time and can also be pretty expensive. The storage.objects.list operation is a Class A Operation e.g. the most expensive operation Cloud Storage has to offer.

Another way to get a daily report of your bucket’s statistics is the Access Logs & Storage Logs for Cloud Storage. Cloud Storage offers access logs and storage logs in the form of CSV files that you can download and view. Access logs provide information for all of the requests made on a specified bucket and are created hourly, while the daily storage logs provide information about the storage consumption of that bucket for the last day.

Usually, you would like to analyze your daily buckets storage size. Running gsutil du is out of the question and currently Stackdriver Monitoring can be somewhat inaccurate for this task.

We will use the Storage Logs and a Cloud Function (CF) triggered by a Pub/Sub topic to automatically load the storage consumption log into BigQuery, where you can query the bucket size and visualize it. Storage logs that were loaded will then be moved to another bucket. Logs that were not loaded successfully or other objects will be moved to an “errors” bucket.

gcs-calc-image-2

  1. Create a bucket in which you’re going to store Storage & Access logs for other buckets. Switch to your designated project and create a new bucket to store the logs
gcloud config set project PROJECT-ID
gsutil mb gs://LOGS-BUCKET
  1. We would also like to separate storage logs from usage logs and a bucket to store logs that failed to be loaded into BigQuery. The logs will be moved from gs://LOGS-BUCKET to the appropriate bucket after they were handled by the CF.
gsutil mb gs://PROCCESSED-LOGS-BUCKET
gsutil mb gs://ERRORS-LOGS-BUCKET
gsutil mb gs://USAGE-LOGS-BUCKET

Note: You can also use different folders and use only one bucket, but this will require you to modify the CF code a bit.

  1. Allow Google’s Cloud Storage Analytics service account to write to our new bucket:
gsutil acl ch -g [email protected]:W gs://LOGS-BUCKET
  1. Enable bucket notifications to Pub/Sub. The following command will create a notification configuration for gs://LOGS-BUCKET meaning that every time an object is created, changed, deleted or archived, a message will be pushed to a Pub/Sub topic with the relevant information. Since we only care about when are objects created (e.g. when the logs objects are created), we will watch only the OBJECT_FINALIZE event.
gsutil notification create -e OBJECT_FINALIZE -f none gs://LOGS-BUCKET
  • The -f flag specifies the payload information for the message. The available options are either ‘json’ or ‘none’. We are not using any bucket metadata information in our CF so we chose ‘none’.
  • If you check the Pub/Sub page in the developer console, you will see that the command above created a topic named projects/PROJECT-ID/topics/LOGS-BUCKET. Messages for new objects will be published to this topic.
  1. Create a new BigQuery dataset and table
bq mk MY_DATASET
bq mk —-schema project_id:string,bucket:string,storage_byte_hours:integer,bytes:integer,date:date,update_time:timestamp,filename:string -t MY_DATASET.MY_TABLE
  1. Create a staging bucket for the function we are about to deploy
gsutil mb gs://CF-STAGING-BUCKET
  1. Clone the Cloud Function code
git clone https://github.com/doitintl/gcs-stats.git
cd gcs-stats

Important: Open config.js with any text editor and edit the values according to the previous steps.

  1. After editing the config file, deploy the function
gcloud beta functions deploy gcs-stats \
— project PROJECT-ID \
— entry-point gcsStatsHandler \
— stage-bucket gs://CF-STAGING-BUCKET \
— source . \
— trigger-topic LOGS-BUCKET \
— memory 128 \
— timeout 10s
  1. Finally we can enable Access & Storage logs on any bucket we want to analyze. The target bucket for the logs is of course the bucket we created in the first step: gs://LOGS-BUCKET.

We would like to have a column for the project ID of each bucket. Since the logs don’t include this information, we will use a custom prefix for the logs file name. The prefix we chose is PROJECT_[ANALYZED-PROJECT-ID]BUCKET[ANALYZED-BUCKET]

Important: if you already have logging enabled on the bucket, executing the following command will stop the existing logging configuration.

gsutil logging set on -b gs://LOGS-BUCKET -o PROJECT_[ANALYZED-PROJECT-ID]_BUCKET_[ANALYZED-BUCKET] gs://ANALYZED-BUCKET

For example, if you have a bucket called gs://my-enormous-bucket and it resides in the project my-storage-project, the prefix should be PROJECT_my-storage-project_BUCKET_my-enormous-bucket. If decide not to use the custom prefix, the log objects will have the default name prefix and the project_id column in the BigQuery table will contain null values.

  1. Enable logs for any other bucket you want to monitor as explained in the previous step.

When storage logs are written every day, a message will be pushed into the Pub/Sub topic. These messages will trigger the Cloud Function. If the new object is a storage log, it will be read and inserted in the BigQuery table and then it will be moved to gs://PROCCESSED-LOGS-BUCKET bucket. If it is an access log, it will be moved to gs://USAGE-LOGS-BUCKET. Any other object will remain in the bucket. If an error has occurred while loading the log it will be moved to gs://ERRORS-LOGS-BUCKET.

You might want to connect the BigQuery table to Google Data Studio and create yourself a dashboard to easily visualize the data.

gcs-calc-image-3