BigQuery is Google’s database that enables interactive analysis of large amounts of data in a short time and at a low cost. With BigQuery, you can query terabytes of data without the need to manage any infrastructure or database settings. BigQuery uses SQL and can take advantage of the pay-as-you-go model. Therefore, this tool allows you to focus on analyzing the data in order to find the meaningful information.
In this series of articles, we will look at some of BigQuery’s advanced features. In particular, we will explore the WITH clause with respect to permanent tables in terms of performance and readability, and then present the GIS functions that are natively supported. After that, we’ll present the use of analytic window functions, which allow you to split data into subsets on which to perform SQL. We will also discuss functions for creating models for classifying data that can be generated and saved directly in BigQuery. Finally, we will conclude with the best strategies for optimizing query performance.
The WITH clause contains one or more common table expressions (CTEs). Each CTE binds the results of a subquery to a table name, which can be used elsewhere in the same query expression.
The syntax is as follows:
WITH cte[, ...]
BigQuery does not materialize the results of non-recursive CTEs within the WITH clause. If a non-recursive CTE is referenced in multiple places in a query, then the CTE is executed once for each reference. Using a WITH clause is a great way to help break up a complex query.
Let’s look at some examples to understand how it works and the benefits it introduces.
Suppose the analytics team exported Google Analytics logs for an e-commerce site into BigQuery and created a new table with all the raw session data from e-commerce visitors. An example dataset is available in Google Cloud here.
We want to identify the five products with the most views (product_views) by all visitors, including people who viewed the same product more than once. In Google Analytics, a visitor can view a product during the following interaction types: ‘page’, ‘screenview’, ‘event’, ‘transaction’, ‘item’, ‘social’, ‘exception’, ‘timing’. In order to simplify the analysis, we consider only the interactions of type ‘page’.
The query, reported below, counts the number of times a product (v2ProductName) has been viewed (product_views), orders the result based on decreasing values of views and limits the display to the first 5 results.
SELECT COUNT(*) AS product_views, (v2ProductName) AS ProductName FROM `data-to-insights.ecommerce.all_sessions` WHERE type = 'PAGE' GROUP BY v2ProductName ORDER BY product_views DESC LIMIT 5;
This query does not use the WITH construct since the required calculation can be solved by a simple aggregation function (COUNT) applied to each group defined by the GROUP BY clause. However, the result obtained in this way is not entirely true. In fact, it considers the views made by the same user as distinct. To eliminate this error of evaluation it is necessary to rewrite the previous query as follows.
WITH unique_product_views_by_person AS ( -- find each unique product viewed by each visitor SELECT fullVisitorId, (v2ProductName) AS ProductName FROM `data-to-insights.ecommerce.all_sessions` WHERE type = 'PAGE' GROUP BY fullVisitorId, v2ProductName ) -- aggregate the top viewed products and sort them SELECT COUNT(*) AS unique_view_count, ProductName FROM unique_product_views_by_person GROUP BY ProductName ORDER BY unique_view_count DESC LIMIT 5;
In this case, we are going to define a unique_product_views_by_person CTE that creates a temporary table containing only the fullVisitorId and v2ProductName (renamed ProductName using the AS clause) data pair. The result of the CTE is used to calculate the number of unique visits to each product.
The WITH clause, therefore, is to split the problem into simpler sub-problems while also avoiding calculation errors.
analyses thanks also to the huge amount of data made publicly available. Using the bigquery-public-data.covid19_open_data dataset, we simply want to answer the following question: “How many states in the US had more than 100 deaths on April 10, 2020?”
We can use the following query to answer this question.
WITH per_state AS ( SELECT subregion1_name as state, SUM(cumulative_deceased) as total_cases FROM `bigquery-public-data.covid19_open_data.covid19_open_data` WHERE date='2020-04-10' and country_name="United States of America" and subregion1_name IS NOT NULL GROUP BY subregion1_name ) SELECT count(*) as count_of_states FROM per_state WHERE total_cases > 100;
The per_state CTE calculates the total number of deaths (total_cases) for each subregion belonging to the United States of America and whose value is not NULL. Its result is used to enforce the constraint on the minimum threshold of 100 deaths and count the number of states that have exceeded this threshold.
As seen above, the WITH clause is used to compute an intermediate result used in one of the subsequent queries to simplify problem solving. Otherwise, a table could be used to save the query result. We briefly discuss the pros and cons of each approach.
Potentially, queries included in the WITH clause could be pasted inside the FROM clause of another query, as is done in Oracle for defining table functions. However, this approach produces a query that is very difficult to read, even though the result is the same. With the WITH clause it is also possible to have more than one subquery that can be concatenated together. This produces as many intermediate results as are needed to answer the query.
But why keep the preprocessing query in the WITH clause and not store its result in a new permanent table? Good question!
Of course, it is absolutely possible to store the results of the preprocessing query in a table. They’ll be much faster to query later, because the preprocessing and any joins involved have already been done. Also, you can potentially share that table with other users.
The disadvantage of this approach is that users usually only need a subset of the data by continually adding WHERE clauses that would affect performance. Therefore, it might be much faster to use the WITH clause. But how can the WITH clause approach be more efficient than a result saved in an indexed table? The answer is that BigQuery is smart enough to do what is called push-down. Even if the query with the WITH clause does not have a WHERE filter, BigQuery applies the WHERE clause filter of the next query as part of the WITH statement. This optimizes execution.
Finally, with the permanent table, you lose the various calculations done in the various steps. This means that you have to destroy it and recreate it every time you need to make a change to the starting query.