Having seen various constructs of the SQL language, we have almost all the tools to write both simple and also very complex queries. We have, in fact, started from the basic structure of a query to filter data according to some conditions (see SELECT: simple query structure), to move on to join information saved in different tables and define groups of data to calculate some statistics (see SELECT: query with JOIN and GROUP BY). Let’s also not forget inclusion operators ( SELECT: IN operator ), exclusion ( SELECT: NOT IN operator and tuple constructor ) and set operators ( SELECT: set operators ). Finally, we introduced the concept of derived tables ( SQL: derived tables ) to save a temporary result and use it as a new table in our database. This very last construct allowed us to increase the expressive and computational power of our queries. Nevertheless, queries using this construct are mostly difficult to read in addition to other limitations already discussed. To overcome these drawbacks, Common Table Expressions or more briefly CTEs were introduced.
In this article we will address, therefore, CTEs by first describing the syntax and its features, and then understanding their operation and benefits through real examples. As in the previous articles, we will use the scheme related to movie screenings in cinemas. We reproduce it below for those who have not read the other articles.
FILM (CodF, Title, Release_Date, Genre, DurationMinutes)
CINEMA (CodC, Name, Address, City, Website*)
ROOM (CodC, RoomNumber, Capacity)
PROJECTION (CodC, RoomNumber, Date, StartTime, EndTime, CodF)
As a reminder, the notation used to describe the schema uses underlining to indicate the fields that make up the primary key and an asterisk to indicate the possibility of assigning NULL values to the field. Finally, for simplicity, foreign keys have the same name as the primary keys to which they refer. We then proceed to look at the various operators.
Similar to derived tables, CTEs allow the result of a query to be saved temporarily and then reused within another query as if it were a table. To define them, the WITH construct is used as below:
As can be seen from the figure, after the WITH command you need to assign a name to each query that you are going to write as a CTE. You can optionally specify the field names. If they are not specified, those defined in the SELECT clause of the query associated with the CTE will be used. Of course, an arbitrary number of CTEs can be defined. The CTEs thus defined can then be used in the main query as simple tables along with those already in the database.
Advantages and disadvantages of CTEs
As with derived tables and nested queries, CTEs allow the problem to be divided into subproblems. This advantage has already allowed us in previous articles to solve very complex queries incrementally. In fact, we can write even very complex SQL queries to have them return tables with an a-hoc schema for our needs.
The main advantage of CTEs, however, is that these “new” tables can be anywhere in the query thus overcoming the limitations of the other constructs. In fact, a CTE can be used not only in the main query, but also by the other CTEs a valla its definition. This aspect allows us to compute, for example, different levels of aggregation on the data without having to resort to dedicated data warehouse constructs. Not only that! We will see in another article how to use cross-referencing between CTEs to perform recursion.
CTEs also have other advantages over their “sisters,” derived tables. The first advantage is readability. Because the queries associated with each CTE are defined at the beginning and not within FROM clauses the code is cleaner and neater. We will notice this later through the examples. Another advantage is the reduction of errors. By having only one point at which a CTE is defined, regardless of how many times it is subsequently used, it is easier to debug the individual CTE and update its definition without running into copy errors at multiple points in the query.
CTEs also have a disadvantage, albeit a negligible one in our opinion: their contents are not saved to disk. For this reason, it will be necessary to redefine common CTEs for each frequent query we write.
Examples of the use of CTEs
To better understand the use of CTEs and the enormous advantages they introduce let us analyze the same examples we saw in article XXX. We will also report for each example the solution with derived tables to show the differences at the syntactic level.
Double aggregation function
Let us start with the example that requires us to find the maximum average capacity of movie theaters. To do this we need to divide the problem into two tasks:
- find the average capacity of each cinema
- calculate the maximum value of the previously calculated averages
The solution with derived tables required defining a query in the FROM clause that calculated the average capacity for each cinema. Then it was sufficient to calculate the maximum on the aggregate value. The solution was as follows:
SELECT MAX(AvgCapacityCinema) FROM ( SELECT CodC, AVG(Capacity) AS AvgCapacityCinema FROM ROOM GROUP BY CodC) AS AVERAGE;
As already discussed the solution is not easy to understand by reading only the SQL. Let’s try to rewrite it using CTEs!
First we need to define our CTE that will correspond to the derived table query.
WITH AVERAGE AS ( SELECT CodC, AVG(Capacity) AS AvgCapacityCinema FROM ROOM GROUP BY CodC)
As you can see already at first glance the SQL is much more readable and clean. We can, in fact immediately understand what we are calculating without having to worry about how this result will be used.
At this point we only need to use the CTE in the main query as if it were a simple table in our database. The final query will then be
WITH AVERAGE AS ( SELECT CodC, AVG(Capacity) AS AvgCapacityCinema FROM ROOM GROUP BY CodC) SELECT MAX(AvgCapacityCinema) FROM AVERAGE;
The result will be the same as that obtained with the derived table, but when writing and also validating the syntax it is very simple. The division into sub-problems is equivalent to writing as many CTEs as necessary.
Computation of two-level aggregates
Let us now take the example that calculates for each city, the average maximum capacity of cinemas. The steps to follow will be:
- find the average capacity of each cinema
- group the cinemas by city and calculate the maximum value of the averages
We can simply reuse the CTE from the previous example and then perform a join with the CINEMA table in the main query.
WITH AVERAGE AS ( SELECT CodC, AVG(Capacity) AS AvgCapacityCinema FROM ROOM GROUP BY CodC) SELECT City, MAX(AvgCapacityCinema) FROM CINEMA C, AVERAGE A WHERE C.CodC=A.CodC GROUP BY City;
However, we also have another solution! Within the CTE we not only select the code of the cinema and the average capacity of its theaters, but we also report the city of the cinema. This way the main query will only have to perform a grouping on a single table, delegating the join to the CTE. Be careful, however, to include the City field in the GROUP BY of the CTE as well. The final query will look as follows:
WITH AVERAGE AS ( SELECT R.CodC, AVG(Capacity) AS AvgCapacityCinema FROM ROOM R, CINEMA C WHERE C.CodC=R.CodC GROUP BY R.CodC) SELECT City, MAX(AvgCapacityCinema) FROM AVERAGE GROUP BY City;
Using multiple CTEs
Let us now face the more complex scenario. We want to calculate for each genre of movies, the number of movies that were shown in the month of June 2023 equal to or more than twice as many times as in the previous month. Let us decompose the problem as usual. The steps to be taken are:
- calculate the number of screenings for each film in the month of June 2023
- calculate the number of screenings for each film in the month of May 2023
- compare the values and select only those films that meet the required constraint
- group by genre and calculate the number of items belonging to each group
If you remember the solution with the derived tables was very difficult both to write and to read (provided you could finish writing it correctly!). We bring it to you below for simplicity.
SELECT Genre, COUNT(*) FROM FILM WHERE CodF IN ( SELECT June.CodF FROM (SELECT CodF, COUNT(*) NFilmJune FROM PROJECTION WHERE Date >= ‘2023-06-01’ AND Date <= ‘2023-06-30’ GROUP BY CodF) AS June (SELECT CodF, COUNT(*) NFilmMay FROM PROJECTION WHERE Date >= ‘2023-05-01’ AND Date <= ‘2023-05-31’ GROUP BY CodF) AS May WHERE June.CodF=May.CodF AND NFilmJune >= 2* NFilmMay) GROUP BY Genre;
Let us now look at addressing the same problem with CTEs. As mentioned earlier, it will be sufficient to enter a CTE for each subproblem we need to solve. We start, then, with the calculation of the number of screenings for each film in the months of interest. The corresponding CTEs will be:
WITH June AS ( SELECT CodF, COUNT(*) NFilmJune FROM PROJECTION WHERE Date >= ‘2023-06-01’ AND Date <= ‘2023-06-30’ GROUP BY CodF), May AS (SELECT CodF, COUNT(*) NFilmMay FROM PROJECTION WHERE Date >= ‘2023-05-01’ AND Date <= ‘2023-05-31’ GROUP BY CodF)
Now we can identify the movies that meet the required constraint. To do this, we use a CTE, called FILM_BEST, which exploits those just defined. We report only its definition below
WITH …, FILM_BEST AS (SELECT J.CodF FROM June J, May M WHERE J.CodF=M.CodF AND NFilmJune >= 2* NFilmMay)
Now all we have to do is retrieve the information of the identified movies, group them by genre, and calculate how many movies belong to each group. As with the solution with the derived tables, we can retrieve the movie information in two ways:
- by performing a join with the FILM table
- using the IN operator to filter the FILM table
Let us first see the solution with the join in the main query. The final solution will be
WITH June AS ( SELECT CodF, COUNT(*) NFilmJune FROM PROJECTION WHERE Date >= ‘2023-06-01’ AND Date <= ‘2023-06-30’ GROUP BY CodF), May AS (SELECT CodF, COUNT(*) NFilmMay FROM PROJECTION WHERE Date >= ‘2023-05-01’ AND Date <= ‘2023-05-31’ GROUP BY CodF) FILM_BEST AS (SELECT J.CodF FROM June J, May M WHERE J.CodF=M.CodF AND NFilmJune >= 2* NFilmMay) SELECT Genre, COUNT(*) FROM FILM, FILM_BEST FB WHERE F.CodF=FB.CodF GROUP BY Genre;
Obviously they could use the solution with the IN that will result:
WITH June AS ( SELECT CodF, COUNT(*) NFilmJune FROM PROJECTION WHERE Date >= ‘2023-06-01’ AND Date <= ‘2023-06-30’ GROUP BY CodF), May AS (SELECT CodF, COUNT(*) NFilmMay FROM PROJECTION WHERE Date >= ‘2023-05-01’ AND Date <= ‘2023-05-31’ GROUP BY CodF) FILM_BEST AS (SELECT J.CodF FROM June J, May M WHERE J.CodF=M.CodF AND NFilmJune >= 2* NFilmMay) SELECT Genre, COUNT(*) FROM FILM WHERE CodF IN (SELECT CodF FROM FILM_BEST) GROUP BY Genre;
Both solutions turn out to be very readable, as well as easy to write!