SQL: Common Table Expression

Expressing complex queries in SQL language can be really difficult. Dividing the problem into mini-queries is the best strategy to get the desired results quickly without running into errors. Common Table Expressions, also called CTEs, allow the queries of our interest to be defined once to be reused multiple times when needed. Their clean syntax increases the readability of SQL scripts, allows us to debug written code and to create very complex logic to filter and extract data. Through some examples we will discover their potential.

Share

Reading time: 6 minutes

Having seen various constructs of the SQL language, we have almost all the tools to write both simple and also very complex queries. We have, in fact, started from the basic structure of a query to filter data according to some conditions (see SELECT: simple query structure), to move on to join information saved in different tables and define groups of data to calculate some statistics (see SELECT: query with JOIN and GROUP BY). Let’s also not forget inclusion operators ( SELECT: IN operator ), exclusion ( SELECT: NOT IN operator and tuple constructor ) and set operators ( SELECT: set operators ). Finally, we introduced the concept of derived tables ( SQL: derived tables ) to save a temporary result and use it as a new table in our database. This very last construct allowed us to increase the expressive and computational power of our queries. Nevertheless, queries using this construct are mostly difficult to read in addition to other limitations already discussed. To overcome these drawbacks, Common Table Expressions or more briefly CTEs were introduced.

In this article we will address, therefore, CTEs by first describing the syntax and its features, and then understanding their operation and benefits through real examples. As in the previous articles, we will use the scheme related to movie screenings in cinemas. We reproduce it below for those who have not read the other articles.

FILM (CodF, Title, Release_Date, Genre, DurationMinutes)
CINEMA (CodC, Name, Address, City, Website*)
ROOM (CodC, RoomNumber, Capacity)
PROJECTION (CodC, RoomNumber, Date, StartTime, EndTime, CodF)

As a reminder, the notation used to describe the schema uses underlining to indicate the fields that make up the primary key and an asterisk to indicate the possibility of assigning NULL values to the field. Finally, for simplicity, foreign keys have the same name as the primary keys to which they refer. We then proceed to look at the various operators.

CTE Definition

Similar to derived tables, CTEs allow the result of a query to be saved temporarily and then reused within another query as if it were a table. To define them, the WITH construct is used as below:

As can be seen from the figure, after the WITH command you need to assign a name to each query that you are going to write as a CTE. You can optionally specify the field names. If they are not specified, those defined in the SELECT clause of the query associated with the CTE will be used. Of course, an arbitrary number of CTEs can be defined. The CTEs thus defined can then be used in the main query as simple tables along with those already in the database.

Advantages and disadvantages of CTEs

As with derived tables and nested queries, CTEs allow the problem to be divided into subproblems. This advantage has already allowed us in previous articles to solve very complex queries incrementally. In fact, we can write even very complex SQL queries to have them return tables with an a-hoc schema for our needs.

The main advantage of CTEs, however, is that these “new” tables can be anywhere in the query thus overcoming the limitations of the other constructs. In fact, a CTE can be used not only in the main query, but also by the other CTEs a valla its definition. This aspect allows us to compute, for example, different levels of aggregation on the data without having to resort to dedicated data warehouse constructs. Not only that! We will see in another article how to use cross-referencing between CTEs to perform recursion.

CTEs also have other advantages over their “sisters,” derived tables. The first advantage is readability. Because the queries associated with each CTE are defined at the beginning and not within FROM clauses the code is cleaner and neater. We will notice this later through the examples. Another advantage is the reduction of errors. By having only one point at which a CTE is defined, regardless of how many times it is subsequently used, it is easier to debug the individual CTE and update its definition without running into copy errors at multiple points in the query.

CTEs also have a disadvantage, albeit a negligible one in our opinion: their contents are not saved to disk. For this reason, it will be necessary to redefine common CTEs for each frequent query we write.

Examples of the use of CTEs

To better understand the use of CTEs and the enormous advantages they introduce let us analyze the same examples we saw in article XXX. We will also report for each example the solution with derived tables to show the differences at the syntactic level.

Double aggregation function

Let us start with the example that requires us to find the maximum average capacity of movie theaters. To do this we need to divide the problem into two tasks:

  1. find the average capacity of each cinema
  2. calculate the maximum value of the previously calculated averages

The solution with derived tables required defining a query in the FROM clause that calculated the average capacity for each cinema. Then it was sufficient to calculate the maximum on the aggregate value. The solution was as follows:

				
					SELECT MAX(AvgCapacityCinema)
FROM ( SELECT CodC, AVG(Capacity) AS AvgCapacityCinema
FROM ROOM
GROUP BY CodC) AS AVERAGE;
				
			

As already discussed the solution is not easy to understand by reading only the SQL. Let’s try to rewrite it using CTEs!

First we need to define our CTE that will correspond to the derived table query.

				
					WITH 
AVERAGE AS (
    SELECT CodC, AVG(Capacity) AS AvgCapacityCinema
    FROM ROOM
    GROUP BY CodC)
				
			

As you can see already at first glance the SQL is much more readable and clean. We can, in fact immediately understand what we are calculating without having to worry about how this result will be used.

At this point we only need to use the CTE in the main query as if it were a simple table in our database. The final query will then be

				
					WITH 
AVERAGE AS (
    SELECT CodC, AVG(Capacity) AS AvgCapacityCinema
    FROM ROOM
    GROUP BY CodC)

SELECT MAX(AvgCapacityCinema)
FROM AVERAGE;

				
			

The result will be the same as that obtained with the derived table, but when writing and also validating the syntax it is very simple. The division into sub-problems is equivalent to writing as many CTEs as necessary.

Computation of two-level aggregates

Let us now take the example that calculates for each city, the average maximum capacity of cinemas. The steps to follow will be:

  1. find the average capacity of each cinema
  2. group the cinemas by city and calculate the maximum value of the averages

We can simply reuse the CTE from the previous example and then perform a join with the CINEMA table in the main query.

				
					WITH 
AVERAGE AS (
    SELECT CodC, AVG(Capacity) AS AvgCapacityCinema
    FROM ROOM
    GROUP BY CodC)

SELECT City, MAX(AvgCapacityCinema)
FROM CINEMA C, AVERAGE A
WHERE C.CodC=A.CodC
GROUP BY City;

				
			

However, we also have another solution! Within the CTE we not only select the code of the cinema and the average capacity of its theaters, but we also report the city of the cinema. This way the main query will only have to perform a grouping on a single table, delegating the join to the CTE. Be careful, however, to include the City field in the GROUP BY of the CTE as well. The final query will look as follows:

				
					WITH 
AVERAGE AS (
    SELECT R.CodC, AVG(Capacity) AS AvgCapacityCinema
    FROM ROOM R, CINEMA C
    WHERE C.CodC=R.CodC
    GROUP BY R.CodC)


SELECT City, MAX(AvgCapacityCinema)
FROM AVERAGE
GROUP BY City;


				
			

Using multiple CTEs

Let us now face the more complex scenario. We want to calculate for each genre of movies, the number of movies that were shown in the month of June 2023 equal to or more than twice as many times as in the previous month. Let us decompose the problem as usual. The steps to be taken are:

  1. calculate the number of screenings for each film in the month of June 2023
  2. calculate the number of screenings for each film in the month of May 2023
  3. compare the values and select only those films that meet the required constraint
  4. group by genre and calculate the number of items belonging to each group

If you remember the solution with the derived tables was very difficult both to write and to read (provided you could finish writing it correctly!). We bring it to you below for simplicity.

				
					SELECT Genre, COUNT(*)
FROM FILM
WHERE CodF IN (
    SELECT June.CodF
    FROM (SELECT CodF, COUNT(*) NFilmJune
        FROM PROJECTION
        WHERE Date >= ‘2023-06-01’ AND Date <= ‘2023-06-30’
        GROUP BY CodF) AS June
        (SELECT CodF, COUNT(*) NFilmMay
        FROM PROJECTION
        WHERE Date >= ‘2023-05-01’ AND Date <= ‘2023-05-31’
        GROUP BY CodF) AS May
    WHERE June.CodF=May.CodF
    AND NFilmJune >= 2* NFilmMay)
GROUP BY Genre;
				
			

Let us now look at addressing the same problem with CTEs. As mentioned earlier, it will be sufficient to enter a CTE for each subproblem we need to solve. We start, then, with the calculation of the number of screenings for each film in the months of interest. The corresponding CTEs will be:

				
					WITH 
June AS (
    SELECT CodF, COUNT(*) NFilmJune
    FROM PROJECTION
    WHERE Date >= ‘2023-06-01’ AND Date <= ‘2023-06-30’
    GROUP BY CodF),
May AS 
    (SELECT CodF, COUNT(*) NFilmMay
    FROM PROJECTION
    WHERE Date >= ‘2023-05-01’ AND Date <= ‘2023-05-31’
    GROUP BY CodF)

				
			

Now we can identify the movies that meet the required constraint. To do this, we use a CTE, called FILM_BEST, which exploits those just defined. We report only its definition below

				
					WITH 
…,
FILM_BEST AS 
    (SELECT J.CodF
    FROM June J, May M
    WHERE J.CodF=M.CodF
    AND NFilmJune >= 2* NFilmMay)

				
			

Now all we have to do is retrieve the information of the identified movies, group them by genre, and calculate how many movies belong to each group. As with the solution with the derived tables, we can retrieve the movie information in two ways:

  • by performing a join with the FILM table
  • using the IN operator to filter the FILM table

Let us first see the solution with the join in the main query. The final solution will be

				
					WITH 
June AS (
    SELECT CodF, COUNT(*) NFilmJune
    FROM PROJECTION
    WHERE Date >= ‘2023-06-01’ AND Date <= ‘2023-06-30’
    GROUP BY CodF),
May AS 
    (SELECT CodF, COUNT(*) NFilmMay
    FROM PROJECTION
    WHERE Date >= ‘2023-05-01’ AND Date <= ‘2023-05-31’
    GROUP BY CodF)
FILM_BEST AS 
    (SELECT J.CodF
    FROM June J, May M
    WHERE J.CodF=M.CodF
    AND NFilmJune >= 2* NFilmMay)


SELECT Genre, COUNT(*)
FROM FILM, FILM_BEST FB
WHERE F.CodF=FB.CodF
GROUP BY Genre;


				
			

Obviously they could use the solution with the IN that will result:

				
					WITH 
June AS (
    SELECT CodF, COUNT(*) NFilmJune
    FROM PROJECTION
    WHERE Date >= ‘2023-06-01’ AND Date <= ‘2023-06-30’
    GROUP BY CodF),
May AS 
    (SELECT CodF, COUNT(*) NFilmMay
    FROM PROJECTION
    WHERE Date >= ‘2023-05-01’ AND Date <= ‘2023-05-31’
    GROUP BY CodF)
FILM_BEST AS 
    (SELECT J.CodF
    FROM June J, May M
    WHERE J.CodF=M.CodF
    AND NFilmJune >= 2* NFilmMay)


SELECT Genre, COUNT(*)
FROM FILM
WHERE CodF IN (SELECT CodF 
    FROM FILM_BEST)
GROUP BY Genre;
				
			

Both solutions turn out to be very readable, as well as easy to write!

More To Explore

Artificial intelligence

Gradio: web applications in python for AI [part1]

Writing web applications for our machine learning and/or artificial intelligence models can take a lot of time and skills that we do not possess. To streamline and speed up this task we are helped by Gradio, a Python library designed to create web applications with just a few lines of code. Let’s discover its basic functionality with some examples.

Artificial intelligence

AI: the best prompt techniques for leveraging LLMs

Prompt techniques are the basis for the use of LLMs. There are several studies and guidelines for obtaining the best results from these models. Let us analyze some of them to extract the basic principles that will allow us to obtain the desired answers according to our task.

Leave a Reply

Your email address will not be published. Required fields are marked *

Design with MongoDB

Design with MongoDB!!!

Buy the new book that will help you to use MongoDB correctly for your applications. Available now on Amazon!