Google BigQuery Best Practices

Following the right standards — Make it habitual

Vis - The Data Enthusiast
8 min readAug 8, 2023

BigQuery, the super Powerful Cloud Data warehouse engine, there is not a strict practice that Google advises to follow, however, I have curated all of information on working through deeply that is suitable for query standards, performance and also for the usage as a team.

Dataform

Ultimately all of our BigQuery requirements needs to be performed in a IDE which is managed by Dataform.

Dataform was a start-up company later acquired by Google. So for google did not have a proper IDE like Visual Studio for their development & deployment activities, everything where managed and developed directly on the BigQuery Portal. Dataform solved the problem, which had options more than just an IDE.

Dataform in BigQuery

The Dataform is mainly used to define, develop, test, schedule and version control which makes the code push to pre-prod to production easily.

If you are using BigQuery in a webconsole directly, I would strictly suggest to move your workloads to dataform.

Broadcast joins

When joining a large table to a small table, BigQuery creates a broadcast join where the small table is sent to each slot processing the large table.

Even though the SQL query optimizer can determine which table should be on which side of the join, it is recommended to order joined tables appropriately. The best practice is to place the largest table first, followed by the smallest, and then by decreasing size.

-- Good select source.* FROM source LEFT JOIN other_source ON source.account_id = other_source.account_id WHERE ...

-- Bad select * FROM source a LEFT JOIN other_source b ON a.account_id = b.account_id WHERE ..

Units

When naming columns with units the following defaults should be used:

  • Currency — GBP
  • Energy quantity — kWh

The column name should end with the unit used as _unit, so a financial column name may be: amount_paid_gbp. This is important to ensure full traceability and understanding in the future.

1-- Good SELECT SUM(amount) AS amount_paid_gbp

-- Bad SELECT SUM(amount) AS amount_paid

Do not Create external tables (directly from Drive)

Avoid creating tables directly from external sources eg: from google drive.

It is not advised to create an external table which makes the query process plan heavier and also access level issue would occur.

bigquery error: Access Denied: BigQuery: Permission denied while getting Drive credentials.

The table is created directly from Google Drive, it is an external table, means you are querying directly to the drive sheets.

If there is a change in circumstances of the file like schema, access change or even deleted, then the query, if used inside any joins will fail completely resulting in Dataform schedule to error & fail, which would affect other development.

Solution: If you have anything to check quickly from sheets, either upload directly from local machine or upload to google cloud storage and create the table as native from there. Make sure to drop the table, if unused.

Partitions

Partitions is one of the must usage element especially when you are dealing with larger data set. As you know, BigQuery pricing is either Pay per query or flat rate (reserved slots).

However, when you are especially on Pay-per-query then Let’s consider a Scenario.

Create table with Partitions, if you do have any specific field to define as partition key then partition it by ingest datetime, which would be partitioned tables be able to retrieve the data with less table scans.

Policy Tags & Row level Security

Applying policy tags makes you enable to label the each and every field with the tags like ‘Confidential’, ‘Internal’ ,‘Public’ and many more.

Policy Tag

Field Naming and Naming Conventions

  • New fields and table name should be in lower case.
  • Ambiguous fields such as id, name, or type should always be prefixed by what it is referring to.
  • Use ‘AS’ to alias fields. When 10 or more fields exist in a SELECT block, make sure each AS is aligned.
  • Use commas before each field name rather than at the end. This allows for the quicker commenting out/removal of individual fields.

-- Good SELECT id AS account_id , name AS account_name , type AS account_type

-- Bad SELECT id, name, type, ...

  • Field names should be snake-cased, not hyphened or camel-cased or full upper-cased

-- Good SELECT dvcecreatedtstamp AS device_created_timestamp FROM table

-- Bad SELECT dvcecreatedtstamp AS DeviceCreatedTimestamp FROM table

  • Boolean field names (fields which have values true or false) should start with has_, is_, or does_

-- Good SELECT deleted AS is_deleted , sla AS has_sla FROM table

-- Bad SELECT deleted, sla FROM table

Dates

  • Timestamps should end with _at, e.g. deal_closed_at, and should always be in UTC
  • Dates should end with _date, e.g. deal_closed_date, and should always be in UTC
  • Months should be indicated as such and should always be truncated to a date format, e.g. deal_closed_month
  • Always avoid key words like date or month as a column name
  • Prefer the explicit date function over date_part, but prefer date_part over extract, e.g. DAYOFWEEK(created_at) > DATE_PART(dayofweek, 'created_at') > EXTRACT(dow FROM created_at)
  • Note that selecting a date’s part is different from truncating the date. date_trunc('month', created_at) will produce the calendar month ('2019-01-01' for '2019-01-25') while SELECT date_part('month', '2019-01-25'::date) will produce the number 1
  • Be careful using DATEDIFF, as the results are often non-intuitive.
  • For example, SELECT DATEDIFF('days', '2001-12-01 23:59:59.999', '2001-12-02 00:00:00.000') returns 1 even though the timestamps are different by one millisecond.
  • Similarly, SELECT DATEDIFF('days', '2001-12-01 00:00:00.001', '2001-12-01 23:59:59.999') return 0 even though the timestamps are nearly an entire day apart.
  • Using the appropriate interval with the DATEDIFF function will ensure you are getting the right results. For example, DATEDIFF('days', '2001-12-01 23:59:59.999', '2001-12-02 00:00:00.000') will provide a 1 day interval and DATEDIFF('ms', '2001-12-01 23:59:59.999', '2001-12-02 00:00:00.000') will provide a 1 millisecond interval

Use CTEs (Common Table Expressions) instead of subqueries

  • A CTE, defined by a WITH statement, is a temporary, named data set to be used as a part of a query. It is stored in memory and only exists during the execution of that query.
  • A subquery is a nested query, which passes its results to be used by the outer query.
  • CTEs are preferred due to their performance, readability and re-usability.
  • CTEs should be placed at the top of the query.
  • CTE names should be as concise as possible while still being clear as to what the CTE data contains.
  • CTEs with complex/notable logic should be commented.
  • Leave empty rows above and below each CTE block.

Example formatting:

WITH events AS ( -- think of these select statements as your import statements...), filtered_events AS ( -- CTE comments go here 8 ... 9 10) 11 12SELECT * -- you should always aim to "select * from final" for your last model 13FROM filtered_events

General

  • Indents for the main part of a query should be 4 spaces (i.e. when in a CTE)
  • Further indents should be two spaces (except for predicates, which should line up with the WHERE keyword)
  • No tabs should be used — only spaces.
  • Beware, BigQuery is case sensitive, so always prefer snake-cased (i.e. snake_case) for all of your dataset, tables, columns, views, functions etc.,
  • When SELECTing, always give each column its own row, with the exception of SELECT * which can be on a single row if columns are unknown.
  • Be cautious when using SELECT, always use Limit 100 at the end of the statement to avoid running into heavy query cost.
  • Prefer to use “preview” in the Web console instead of running SELECT query, this would are cost anything and query plan remain unchanged.

-- Good SELECT id AS account_id ,name AS account_name ... FROM table LIMIT 100 -- Okay (if column names are unknown)

SELECT * FROM table LIMIT 100 -- Bad SELECT * FROM table ..

DISTINCT should be included on the same row as SELECT

The AS keyword should be used when projecting a field. This isn’t needed for aliasing a table.

Ordering and grouping by field name is preferred over number (eg. avoid GROUP BY 1, 2). Grouping by numbers is brittle

Prefer WHERE to HAVING when either would suffice

Prefer UNION ALL to UNION. This is because a UNION could indicate upstream data integrity issue that are better solved elsewhere.

Prefer != to <>. This is because != is more common in other programming languages and reads like "not equal" which is how we're more likely to speak.

Consider performance. Understand the difference between LIKE vs ILIKE, IS vs =, and NOT vs ! vs <> and use as appropriate.

Prefer lower(column) LIKE '%match%' to column ILIKE '%Match%'. This lowers the chance of stray capital letters leading to an unexpected result.

Familiarize yourself with DRY principle (Don’t Repeat Yourself). If you type the same line twice, it needs to be maintained in two places.

When de-duping data by topic version, make sure you dedupe by the number rather than the topic string (e.g. energy_contracts_contract_v4 → This is in case the version number goes above 9, as deduping on the string field would prioritise topic_v9 over topic_v10.

Functions

Function names in small case and keywords should all be capitalized

Prefer IF to a single line CASE WHEN statement

Prefer IF to selecting a Boolean statement (amount < 10) AS is_less_than_ten

JOINS

  • Be explicit when joining, e.g. use LEFT JOIN instead of JOIN. (Default joins are INNER)
  • Prefix the table name to a column when joining, otherwise omit
  • When table names are long use sensible aliases which are:
  • Short not too short
  • Obvious as to what data source they refer to.
  • Or in cases when table names are short then use directly instead of creating an alias
  • Specify the order of a join with the FROM table first and JOIN table second

Comments

Comments should be provided on all logic for WHERE and HAVING clauses, joins and calculated fields.

When making single line comments in a model use the # syntax.

When making multi-line comments in a model use the /* */ syntax.

Respect the character line limit when making comments. Move to a new line or to the model documentation if the comment is too long.

When creating a view, script, job etc. add a description at the top of the code to include the following elements:

/* Author: Vis CreatedOn: 27/10/2021

Type: user-defined function

Purpose: converts the date string into proper date format */

More information on the coding principles can be referred on below blog,

How to Handle Data in Cloud. Following the right principles &…

If you like the above, you may also like my other blogs

1.) How to build Modern Cloud Data warehouse

2.) What it takes to Run Terabytes/Petabytes of Data in Cloud? What should you Ask..?

3.) How to Handle Data in Cloud. Following the right principles &…

4.) Why should you consider Data Lake in your Cloud Architecture..?

Conclusion

Try considering all the shared above best practices and add it to your existing BigQuery practices later be habitual one.

I will cover more on these related concepts and bring in best practices in my future blogs.

#GoogleBigquery #AzureSynapseInfluencer #BigQuery

If you find this useful, please give a clap and share. if you have any queries just comment. Thanks for reading.

Explore ~Learn ~Do

--

--