Google BigQuery Best Practices
Following the right standards — Make it habitual
BigQuery, the super Powerful Cloud Data warehouse engine, there is not a strict practice that Google advises to follow, however, I have curated all of information on working through deeply that is suitable for query standards, performance and also for the usage as a team.
Dataform
Ultimately all of our BigQuery requirements needs to be performed in a IDE which is managed by Dataform.
Dataform was a start-up company later acquired by Google. So for google did not have a proper IDE like Visual Studio for their development & deployment activities, everything where managed and developed directly on the BigQuery Portal. Dataform solved the problem, which had options more than just an IDE.
The Dataform is mainly used to define, develop, test, schedule and version control which makes the code push to pre-prod to production easily.
If you are using BigQuery in a webconsole directly, I would strictly suggest to move your workloads to dataform.
Broadcast joins
When joining a large table to a small table, BigQuery creates a broadcast join where the small table is sent to each slot processing the large table.
Even though the SQL query optimizer can determine which table should be on which side of the join, it is recommended to order joined tables appropriately. The best practice is to place the largest table first, followed by the smallest, and then by decreasing size.
-- Good select source.* FROM source LEFT JOIN other_source ON source.account_id = other_source.account_id WHERE ...
-- Bad select * FROM source a LEFT JOIN other_source b ON a.account_id = b.account_id WHERE ..
Units
When naming columns with units the following defaults should be used:
- Currency — GBP
- Energy quantity — kWh
The column name should end with the unit used as _unit
, so a financial column name may be: amount_paid_gbp
. This is important to ensure full traceability and understanding in the future.
1-- Good SELECT SUM(amount) AS amount_paid_gbp
-- Bad SELECT SUM(amount) AS amount_paid
Do not Create external tables (directly from Drive)
Avoid creating tables directly from external sources eg: from google drive.
It is not advised to create an external table which makes the query process plan heavier and also access level issue would occur.
bigquery error: Access Denied: BigQuery: Permission denied while getting Drive credentials.
The table is created directly from Google Drive, it is an external table, means you are querying directly to the drive sheets.
If there is a change in circumstances of the file like schema, access change or even deleted, then the query, if used inside any joins will fail completely resulting in Dataform schedule to error & fail, which would affect other development.
Solution: If you have anything to check quickly from sheets, either upload directly from local machine or upload to google cloud storage and create the table as native from there. Make sure to drop the table, if unused.
Partitions
Partitions is one of the must usage element especially when you are dealing with larger data set. As you know, BigQuery pricing is either Pay per query or flat rate (reserved slots).
However, when you are especially on Pay-per-query then Let’s consider a Scenario.
Create table with Partitions, if you do have any specific field to define as partition key then partition it by ingest datetime, which would be partitioned tables be able to retrieve the data with less table scans.
Policy Tags & Row level Security
Applying policy tags makes you enable to label the each and every field with the tags like ‘Confidential’, ‘Internal’ ,‘Public’ and many more.
Field Naming and Naming Conventions
- New fields and table name should be in lower case.
- Ambiguous fields such as
id
,name
, ortype
should always be prefixed by what it is referring to. - Use ‘AS’ to alias fields. When 10 or more fields exist in a SELECT block, make sure each AS is aligned.
- Use commas before each field name rather than at the end. This allows for the quicker commenting out/removal of individual fields.
-- Good SELECT id AS account_id , name AS account_name , type AS account_type
-- Bad SELECT id, name, type, ...
- Field names should be snake-cased, not hyphened or camel-cased or full upper-cased
-- Good SELECT dvcecreatedtstamp AS device_created_timestamp FROM table
-- Bad SELECT dvcecreatedtstamp AS DeviceCreatedTimestamp FROM table
- Boolean field names (fields which have values true or false) should start with
has_
,is_
, ordoes_
-- Good SELECT deleted AS is_deleted , sla AS has_sla FROM table
-- Bad SELECT deleted, sla FROM table
Dates
- Timestamps should end with
_at
, e.g.deal_closed_at
, and should always be in UTC - Dates should end with
_date
, e.g.deal_closed_date
, and should always be in UTC - Months should be indicated as such and should always be truncated to a date format, e.g.
deal_closed_month
- Always avoid key words like
date
ormonth
as a column name - Prefer the explicit date function over
date_part
, but preferdate_part
overextract
, e.g.DAYOFWEEK(created_at)
>DATE_PART(dayofweek, 'created_at')
>EXTRACT(dow FROM created_at)
- Note that selecting a date’s part is different from truncating the date.
date_trunc('month', created_at)
will produce the calendar month ('2019-01-01' for '2019-01-25') whileSELECT date_part('month', '2019-01-25'::date)
will produce the number 1 - Be careful using DATEDIFF, as the results are often non-intuitive.
- For example,
SELECT DATEDIFF('days', '2001-12-01 23:59:59.999', '2001-12-02 00:00:00.000')
returns1
even though the timestamps are different by one millisecond. - Similarly,
SELECT DATEDIFF('days', '2001-12-01 00:00:00.001', '2001-12-01 23:59:59.999')
return0
even though the timestamps are nearly an entire day apart. - Using the appropriate interval with the
DATEDIFF
function will ensure you are getting the right results. For example,DATEDIFF('days', '2001-12-01 23:59:59.999', '2001-12-02 00:00:00.000')
will provide a1 day interval
andDATEDIFF('ms', '2001-12-01 23:59:59.999', '2001-12-02 00:00:00.000')
will provide a1 millisecond interval
Use CTEs (Common Table Expressions) instead of subqueries
- A CTE, defined by a WITH statement, is a temporary, named data set to be used as a part of a query. It is stored in memory and only exists during the execution of that query.
- A subquery is a nested query, which passes its results to be used by the outer query.
- CTEs are preferred due to their performance, readability and re-usability.
- CTEs should be placed at the top of the query.
- CTE names should be as concise as possible while still being clear as to what the CTE data contains.
- CTEs with complex/notable logic should be commented.
- Leave empty rows above and below each CTE block.
Example formatting:
WITH events AS ( -- think of these select statements as your import statements...), filtered_events AS ( -- CTE comments go here 8 ... 9 10) 11 12SELECT * -- you should always aim to "select * from final" for your last model 13FROM filtered_events
General
- Indents for the main part of a query should be 4 spaces (i.e. when in a CTE)
- Further indents should be two spaces (except for predicates, which should line up with the
WHERE
keyword) - No tabs should be used — only spaces.
- Beware, BigQuery is case sensitive, so always prefer snake-cased (i.e. snake_case) for all of your dataset, tables, columns, views, functions etc.,
- When
SELECT
ing, always give each column its own row, with the exception ofSELECT *
which can be on a single row if columns are unknown. - Be cautious when using
SELECT
, always useLimit 100
at the end of the statement to avoid running into heavy query cost. - Prefer to use “preview” in the Web console instead of running
SELECT
query, this would are cost anything and query plan remain unchanged.
-- Good SELECT id AS account_id ,name AS account_name ... FROM table LIMIT 100 -- Okay (if column names are unknown)
SELECT * FROM table LIMIT 100 -- Bad SELECT * FROM table ..
DISTINCT
should be included on the same row as SELECT
The AS
keyword should be used when projecting a field. This isn’t needed for aliasing a table.
Ordering and grouping by field name is preferred over number (eg. avoid GROUP BY 1, 2). Grouping by numbers is brittle
Prefer WHERE
to HAVING
when either would suffice
Prefer UNION ALL
to UNION
. This is because a UNION
could indicate upstream data integrity issue that are better solved elsewhere.
Prefer !=
to <>
. This is because !=
is more common in other programming languages and reads like "not equal" which is how we're more likely to speak.
Consider performance. Understand the difference between LIKE
vs ILIKE
, IS
vs =
, and NOT
vs !
vs <>
and use as appropriate.
Prefer lower(column) LIKE '%match%'
to column ILIKE '%Match%'
. This lowers the chance of stray capital letters leading to an unexpected result.
Familiarize yourself with DRY principle (Don’t Repeat Yourself). If you type the same line twice, it needs to be maintained in two places.
When de-duping data by topic version, make sure you dedupe by the number rather than the topic string (e.g. energy_contracts_contract_v4
→ This is in case the version number goes above 9, as deduping on the string field would prioritise topic_v9
over topic_v10
.
Functions
Function names in small case and keywords should all be capitalized
Prefer IF
to a single line CASE WHEN
statement
Prefer IF
to selecting a Boolean statement (amount < 10) AS is_less_than_ten
JOINS
- Be explicit when joining, e.g. use
LEFT JOIN
instead ofJOIN
. (Default joins areINNER
) - Prefix the table name to a column when joining, otherwise omit
- When table names are long use sensible aliases which are:
- Short not too short
- Obvious as to what data source they refer to.
- Or in cases when table names are short then use directly instead of creating an alias
- Specify the order of a join with the FROM table first and JOIN table second
Comments
Comments should be provided on all logic for WHERE
and HAVING
clauses, joins and calculated fields.
When making single line comments in a model use the #
syntax.
When making multi-line comments in a model use the /* */
syntax.
Respect the character line limit when making comments. Move to a new line or to the model documentation if the comment is too long.
When creating a view, script, job etc. add a description at the top of the code to include the following elements:
/* Author: Vis CreatedOn: 27/10/2021
Type: user-defined function
Purpose: converts the date string into proper date format */
More information on the coding principles can be referred on below blog,
How to Handle Data in Cloud. Following the right principles &…
If you like the above, you may also like my other blogs
1.) How to build Modern Cloud Data warehouse
2.) What it takes to Run Terabytes/Petabytes of Data in Cloud? What should you Ask..?
3.) How to Handle Data in Cloud. Following the right principles &…
4.) Why should you consider Data Lake in your Cloud Architecture..?
Conclusion
Try considering all the shared above best practices and add it to your existing BigQuery practices later be habitual one.
I will cover more on these related concepts and bring in best practices in my future blogs.
#GoogleBigquery #AzureSynapseInfluencer #BigQuery
If you find this useful, please give a clap and share. if you have any queries just comment. Thanks for reading.
Explore ~Learn ~Do