Automate BigQuery Dataset & Table Size Reporting using BigQuery or Google Sheets
UPDATE JAN 18, 2021: Updated the process and repo code so that you only need to specify your GCP Project IDs, instead of needing to manually enter all your projects AND dataset IDs. So…it’s even easier now!
I wanted to see if it was possible to easily automate the process for tracking how my different BigQuery datasets and tables changed over time. At some point, I’ll also add in the ability to get email or Google Chat notifications when a threshold has been breached (table increased/decreased too much in between checks), but that’s for another day. Today, we’re just collecting information about our tables for easy reporting and visualization.
There’s obviously a bunch of different ways to do this, including scheduled queries in BigQuery itself, but that doesn’t handle the process of pushing the data into a Google Sheet automatically either. Plus… I wanted to see if this was possible in AppScript, because now I can repurpose the query function to perform other things in future too.
Here are all the things we’ll use:
- BigQuery Datasets/Tables — to check their size, across multiple projects
- AppScript — to handle the code and schedule the checks to run automatically
- BigQuery table — to store what we collected, if we don’t want to use Sheets
- Google Sheet — to store what we collected, if we don’t want to use BQ
Let’s get started.
Create an appscript project
- Go to script.google.com and create a new project
- Because there’s not a git integration with appscript, you’ll need to just copy the files (contents of the files) manually
- Copy the contents of the files from the github repository shared here: https://github.com/usaussie/appscript-bigquery-reporter
- You should end up with 3 files — Jobs.gs, Helpers.gs,Config.gs (renamed from the sample provided)
Update the project to use your BigQuery / Sheet info
- Update the Config.gs file to point to your own BQ Project(s) and datasets that you want to track, then also provide a BQ table to store the results =, and finally a google sheet url.
// BigQuery Project ID - all lowercaseconst
BQ_PROJECT_ID = 'your-gcp-project-id';
// Create a dataset in the BigQuery UI (https://bigquery.cloud.google.com)const
BQ_DATASET_ID = 'your-gcp-dataset-id';
// Table ID to store the collected stats
const BQ_TABLE_ID = 'your-gcp-table-id';
// for storing the metrics in google sheets too
const SHEET_URL = "https://docs.google.com/spreadsheets/d/your-google-sheet/edit";
const MAIN_SHEET_TAB_NAME = "Sheet1";
// these are the tables & datasets that you
function project_dataset_list() {
return [
['gcp-project-id-a', 'dataset_one'],
['gcp-project-id-a', 'dataset_two'],
['gcp-project-id-b', 'dataset_one'],
['gcp-project-id-c', 'dataset_one']
];
}
- That’s it…that’s all the stuff you have to code by yourself….and it’s really not even coding….it’s just pointing at IDs you already manage.
- If you don’t already have a BQ table to store the results, don’t worry, you can do that from this project too.
Enough setup, let’s do this…
- First, let’s create our BQ table — open the Jobs.gs file, and run the create_tables_one_time() function and the set_sheet_headers() function once.
- After the function finished executing, you can check either your BigQuery table, or your google sheet to see the results.
- You should now see rows that contain your project ID, dataset ID, table ID, and then the table size (bytes) and number of rows.
Schedule this to run automatically
- Click on the triggers section in your appscript project, and follow the prompts to run the job_get_bq_stats() function on a daily basis (or hourly depending on your needs).
That’s it!
The next things I’d like to add, would be around alerting when datasets grow/change “too” much between checks.
Note, you can obviously connect this to a Google Datastudio dashboard to chart your table data over time, and to identify outliers etc.
Finally….here’s another link to the git repo for the code…feel free to use, enhance, augment for your own needs.