In this article we will cover key factors in reducing costs over BigQuery from writing queries properly and building data architecture to support one goal – reducing costs.
Queries in BQ Cost Money
Pricing in BQ is in a pay as you go model : $5 per 1TB scanned when executing your query
Rule of thumb – in most of the cases, if your query costs more than $1 you are apparently doing something wrong.
Be in control of your Cost – This tool can help you dramatically with that :
- BigQuery Mate – (free chrome extension)
Do’s and Don'ts
- Try to avoid using “Select *”,
This way you avoid paying for data scans of columns that you don’t really need in your query result.
- Using the “Limit” clause on a query will not affect the amount of data scanned.
- Use the “Table Preview” option in order to see few lines of the table (instead of using limit) – This is Free
- There is also a more extreme protection level under the Query settings – Advanced options section : “Maximum bytes billed”
Using this will limit the bytes billed for this query. If this query will have bytes billed beyond this limit, the query will be failed (without incurring a charge).
- use the default table expiration time to remove the data when it’s no longer needed. – This will save Storage Costs.
- Use streaming inserts only if your data must be immediately available to query from. In other words – Don’t use streaming inserts unless you have to ! It costs money.
- Quotas – You can define a hard limit of data scan of a daily level or even a project level – but not on a user level. That’s a good practice to set a daily budget that will keep you in the safe zone in terms of cost.
Performance (& Cost) Key Factors
- Partition your tables – the significantly big ones !
This will improve both query performance and cost.
- There are several ways to partition tables – the most common and most of the time also most cost and performance effective is the time-partitioned tables method.
- De-normalise when possible. – Bigquery by its nature is designed for one huge table. Implementing a normalised table relations will result with many joins and overhead in terms of performance and cost.
- Use “Order By” only in the outermost query or within window clauses (analytic functions). Push complex operations to the end of the query.
- Avoid self-joins. Use a window function instead.
- Same thing applies on Cross-Joins
Cost reducing Data Architecture Guidelines
- As mentioned at the top of the article – The key principle that guides us when designing a cost effective architecture is to reduce the amount of data scanned by the end user.
- In most cases the end user might be the analyst or a business user using a BI platform.
- We want to reduce the amount of data scanned both in terms of cost and lets not forget performance.
You don’t want your CEO to wait for a dashboard more than a few seconds until it’s presented with fresh data.
- The way of reducing the data is by a simple layering method that starts with the raw data – ingestion layer and ends with a very small portion of it designed and grouped specifically for the BI tool of your choice – The Presentation layer
- In between those layers you should do all the cleansing / transforming /joining /aggregating in order to support the top layer to be optimised for fast and cheap queries for your BI visualisations.
- We strongly recommend decoupling your compute and storage.
A common solution to make sure of that is by using Airflow to orchestrate all of your data pipelines.
- The data operations on each level can be executed by a tool of your choice. In order to simplify the process, a good practice is to start with implementing it solely with internal views.
- This is how it looks like eventually :