All the operators seen in the previous articles allow us to operate in a more or less complex way on the data. In fact, we have seen how to filter data (see SELECT: simple query structure), join data from multiple tables by means of join conditions (see SELECT: query with JOIN and GROUP BY), define groups and filter them (see SELECT: query with JOIN and GROUP BY), include and/or exclude results (see SELECT: IN operator, SELECT: NOT IN operator and tuple constructor), and finally operate on homogeneous sets (see SELECT: set operators). In many cases we have suggested dividing the problem into sub-problems in order to simplify the formulation and structure of the query. This approach always turns out to be successful! At this point, however, some minor drawbacks arise. If the sub-queries we write are very similar to each other we have to repeat the statement several times risking errors. Also, sometimes we need to use an even complex result, perhaps with aggregations, to filter other tables and/or temporary results. How do we overcome these limitations? Derived tables and Common Table Expression (CTE) come to our rescue for this problem.
In this article we will discover derived tables that help to better structure our SQL statements, but more importantly, to increase the ability to analyze our data. In another article we will address the topic of CTEs in detail. As in previous articles, we will use the schema related to movie showings in theaters. We reproduce it below for those who have not read the other articles.
FILM (CodF, Title, Release_Date, Genre, DurationMinutes)
CINEMA (CodC, Name, Address, City, Website*)
ROOM (CodC, RoomNumber, Capacity)
PROJECTION (CodC, RoomNumber, Date, StartTime, EndTime, CodF)
As a reminder, the notation used to describe the schema uses underlining to indicate the fields that make up the primary key and an asterisk to indicate the possibility of assigning NULL values to the field. Finally, for simplicity, foreign keys have the same name as the primary keys to which they refer. We then proceed to look at the various operators.
Definition of derived tables
As mentioned earlier, in some scenarios we would like to save the result of even a complex query to filter data from other tables. In other cases we might need the results of two queries to compare. Of course, we could write a program in our preferred language in which we go in and submit the queries we are interested in and then manipulate the results to our liking. This solution, while viable, is not efficient. In fact, the total time of all operations would be affected by both the network latency for each query to the database as well as, of course, our ability to write efficient code to properly manipulate the data.
Can we ask the database to perform all the queries itself and combine them appropriately? Of course we can. We can use a first solution called a derived table. But what are derived tables? They are nothing more than temporary tables defined by SELECT statements within the FROM clause. Once defined these can be used like all other tables within the database.
Advantages and disadvantages of derived tables
The main advantage is the ability to divide the problem into sub-problems as addressed already with nested queries. Unlike nested queries whose results can be used only with IN, NOT IN and possibly equality/inequality operators (if they return only one result), derived tables can be used as real tables on which to perform join, filtering and grouping operations.
Another advantage is the ability to write complex SQL statements returning even a very large data structure. Thus, we are not constrained to select some attribute or an aggregation function, but we can define a schema that includes several pieces of information. This allows us, for example, to compute statistics by aggregating data at different levels of granularity.
Of course, there are downsides. Derived tables are temporary tables that are “created” during the execution of the main query. This implies that their contents are not saved to the database and we have to redefine them every time we need them. Another downside is that they cannot also use the same definition in nested queries, but I am forced to redefine the derived table in the FROM clause. This can be both tedious when writing, but more importantly, risky as it can lead to errors. Finally, the final statement is sometimes very difficult to read.
Examples of using derived tables
So far we have only talked on a theoretical level about derived tables. Let’s see in concrete terms how they are defined and how they can help us.
Double aggregation function
Suppose we want to find the maximum average capacity of movie theaters. To do this we need to divide the problem into two tasks:
- find the average capacity of each cinema
- compute the maximum value of the previously calculated averages
The first problem is solved by simply grouping by theater code the ROOM table and averaging the Capacity attribute. The resulting SQL query will be:
SELECT CodC, AVG(Capacity) AS AvgCapacityCinema FROM ROOM GROUP BY CodC;
At this point we would like to compute the maximum AvgCapacityCinema measure. Over the years I have seen many people try to write something like this hoping the database would be smart enough to understand what we would like to do.
SELECT MAX(AVG(Capacity)) AS MaxAvgCapacity FROM SALA GROUP BY CodC;
To overcome this limitation, we must use the query from step 1 as a derived table, that is, we must insert it into the FROM clause of another query. At this point we can apply the MAX aggregation function on the AvgCapacityCinema attribute. The correct query is as follows.
SELECT MAX(AvgCapacityCinema) FROM ( SELECT CodC, AVG(Capacity) AS AvgCapacityCinema FROM ROOM GROUP BY CodC) AS AVERAGE;
This is a simple example to understand how to define a derived table and use it to calculate a double aggregate.
Computation of two-level aggregates
Let us now see how to use the same derived table from the previous example along with other tables. Suppose we want to calculate for each city, the average maximum capacity of cinemas. The request is very similar to the previous one, but, in this case, the maximum must be calculated based on the cinema city. Therefore, the steps to follow will be:
- find the average capacity of each cinema
- group cinemas by city and calculate the maximum value of averages
We have already solved the first step previously, so let us see how to solve the second. The derived table returns us only the cinema code. To derive the city we need to use the CINEMA table and put it in join with the derived table AVERAGE. Then we could define groups for each city and calculate the maximum of the attribute AvgCapacityCinema. The solution will be:
SELECT City, MAX(AvgCapacityCinema) FROM CINEMA C, ( SELECT CodC, AVG(Capienza) AS AvgCapacityCinema FROM SALA GROUP BY CodC) AS AVERAGE WHERE C.CodC=AVERAGE.CodC GROUP BY City;
Using multiple derived tables
Let us now face a much more complex scenario. We want to calculate for each genre of films, the number of films that were shown in the month of June 2023 equal to or more than twice as many times as in the previous month. Let us decompose the problem as usual. The steps to be taken are:
- calculate the number of projections for each film in the month of June 2023
- calculate the number of projections for each film in the month of May 2023
- compare the values and select only those films that meet the required constraint
- group by genre and calculate the number of items belonging to each group
The first two steps are very similar to each other except for filtering on date. In fact, it is necessary to filter the PROJECTION table for the desired date range, group by movie code, and finally calculate the number of tuples in each group. The query for June 2023 will be:
SELECT CodF, COUNT(*) NFilmJune FROM PROJECTION WHERE Date >= ‘2023-06-01’ AND Date <= ‘2023-06-30’ GROUP BY CodF;
Instead, the SQL statement for May 2023 will be:
SELECT CodF, COUNT(*) NFilmMay FROM PROJECTION WHERE Date >= ‘2023-05-01’ AND Date <= ‘2023-05-31’ GROUP BY CodF;
At this point we compare the results obtained to identify the movie codes that meet the required constraint. We use the previous queries as derived tables and merge them via the movie code. Finally, we filter the results by requiring that the NFilmJune attribute be greater than or equal to twice the value of the NFilmMay attribute. The SQL query will be:
SELECT June.CodF FROM (SELECT CodF, COUNT(*) NFilmJune FROM PROJECTION WHERE Date >= ‘2023-06-01’ AND Date <= ‘2023-06-30’ GROUP BY CodF) AS June (SELECT CodF, COUNT(*) NFilmMay FROM PROJECTION WHERE Date >= ‘2023-05-01’ AND Date <= ‘2023-05-31’ GROUP BY CodF) AS May WHERE June.CodF=May.CodF AND NFilmJune >= 2* NFilmMay;
Now all we have to do is retrieve the information of the identified movies, group them by genre, and calculate how many movies belong to each group. To retrieve the information of the movies we can perform this operation in two ways:
- performing a join with the FILM table
- using the IN operator to filter the FILM table
We opt to use the IN operator, but both solutions are equivalent. Finally, in the outer query we are going to group by Genre and in the SELECT clause we will also insert the COUNT aggregation function. The final query will be:
SELECT Genre, COUNT(*) FROM FILM WHERE CodF IN ( SELECT June.CodF FROM (SELECT CodF, COUNT(*) NFilmJune FROM PROJECTION WHERE Date >= ‘2023-06-01’ AND Date <= ‘2023-06-30’ GROUP BY CodF) AS June (SELECT CodF, COUNT(*) NFilmMay FROM PROJECTION WHERE Date >= ‘2023-05-01’ AND Date <= ‘2023-05-31’ GROUP BY CodF) AS May WHERE June.CodF=May.CodF AND NFilmJune >= 2* NFilmMay) GROUP BY Genre;
As can be seen, the query is very structured and difficult to read. Nevertheless, it performs both a comparison of the results of two different queries and a double level of aggregation.