The SQL language makes it possible to extract a lot of information from relational databases. Writing queries, however, is not always simple. As seen in the previous articles, SELECT: simple query structure and SELECT: query with JOIN and GROUP BY, there are several clauses to define so that both the syntax and the extracted data are correct. But how do you do this when the query is very complex? The key is to divide the problem into smaller parts so that you can solve them more easily. After that, you put the various solutions together to compose the final answer. To make an analogy it is like with LEGO bricks, you build one brick at a time to get to create even very complex sets, as well as beautiful to look at.
If the bricks, in the SQL language, are simple queries using the syntax seen so far, what are the statements to build the final query. There are four types of strategies for joining different queries:
- Nested queries
- Set Operators
- Derived tables
- Common Table Expression (CTE)
Also in this article we will use the outline from the other articles regarding movie screenings in theaters. We report it below for those who have not read the other articles.
FILM (CodF, Title, Release_date, Genre, DurationMinutes)
CINEMA (CodC, Name, Address, City, Website*)
ROOM (CodC, NumberRoom, Capacity)
PROJECTION (CodC, NumberRoom, Date, StartTime, EndTime, CodF)
As a reminder, the notation used to describe the schema uses underscores to indicate the fields that make up the primary key and asterisks to indicate the possibility of assigning NULL values to the field. Finally, for simplicity, foreign keys have the same name as the primary keys to which they refer. We then proceed to look at the various operators.
In this article we will discuss nested queries and in particular the IN operator. But before we understand how this operator works, it is necessary to define what nested queries are. They are nothing more than a SELECT statement contained within another query. This allows a complex problem to be broken down into simpler subproblems and use the results of one query as input to another. Nested queries can be inserted as a predicate in the WHERE and HAVING clauses, but also instead of a table in the FROM clause.
The simplest example is to use a nested query in the WHERE clause. Suppose we want to find all the rooms with the maximum capacity within our database. The problem must necessarily be divided into 2 parts:
- Find the maximum capacity within the ROOM table.
- Filter the rooms that have a capacity equal to the maximum calculated in the previous step
The first point is solved by the use of the MAX aggregation function on the entire ROOM table, as shown below
SELECT MAX(Capacity) FROM ROOM;
The second problem at this point is solved by imposing in the WHERE clause that the capacity must be equal to the query in the previous point. The statement will then be:
SELECT * FROM ROOM WHERE Capacity=(SELECT MAX(Capacity) FROM ROOM);
So we can solve some problems, but we still have limitations in some scenarios. Let’s see how we can achieve even more data manipulation capabilities with the SQL language.
Suppose we want to calculate for each film the number of screenings that were held in cinemas in Turin. The SQL query, using the syntax seen in the other articles involves:
- use the FILM, PROJECTION and CINEMA tables (FROM clause),
- impose appropriate join conditions in addition to the filter on the city (WHERE clause),
- group the results by movie code and title (GROUP BY clause)
- finally, display the title and projection number using the COUNT aggregation function (SELECT clause)
The resulting query will be:
SELECT Title, COUNT(*) AS NumProjections FROM FILM F, PROJECTION P, CINEMA C WHERE F.CodF=P.CodF AND P.CodC=C.CodC AND City = ‘Turin’ GROUP BY CodF, Title;
In this case the query is not complex, but it may not be immediately so intuitive to write the solution just seen. How can we proceed to simplify the problem? One possibility is to divide the problem into two parts:
- find cinemas in Turin
- compute the number of screenings of each film shown in the theaters in the previous point
The first query is very simple! It only requires us to filter, using a WHERE clause the records in the CINEMA table. We will only need to locate the codes since we do not need any other information. This results in the following SQL code
SELECT CodC FROM CINEMA WHERE City=’Turin’;
The second point is to calculate the screenings of these cinemas. Let’s simplify further! We calculate only the projections of each film. From the projections table we simply group by CodF and then count the number of records with the same CodF value. Since the codes, alas, are not so explanatory we also need to use the FILM table to retrieve the title. The query will result as follows.
SELECT Title, COUNT(*) AS NumProjections FROM FILM F, PROJECTION P WHERE F.CodF=P.CodF GROUP BY CodF, Title;
Let’s merge the queries
Now we just have to put the two solutions together. How? In theory, it would be enough for us to say that all records in the PROJECTION table that take a CodC value equal to one of the values returned by the first query should be considered in the final calculation. But in practice?
Many might think that you run the first query, I store the result in a variable that I then use in the second query. It could work, but it would require submitting two separate queries to the database resulting in increased response time even solely due to network latency. Also, this approach would only work within a programming language and in any case would always involve the problem of using a list of values to filter the records of interest.
So? No problem! There is an operator that indicates the inclusion of a value within a set: the IN operator. This operator can be used in the WHERE clause and its syntax is as follows:
AttributeName IN (NestedQuery)
where the attribute name must be compatible with the attribute returned by the nested query. The nested query can also be replaced with a list of values, if necessary. This last option, however, is strongly discouraged. In fact, it is used only in cases where these values are derived from external interactions with DB data, such as when a user chooses multiple options in a form. Having a query nested using the IN operator allows the DBMS to optimize query execution and make it faster and more efficient.
Returning to our example, it will be sufficient to impose that the CodCs in the PROJECTION table must belong to those returned by the first query. The query will, therefore, be as follows:
SELECT Title, COUNT(*) AS NumProjections FROM FILM F, PROJECTION P WHERE F.CodF=P.CodF AND CodC IN ( SELECT CodC FROM CINEMA WHERE City = ‘Turin’ ) GROUP BY CodF, Title;
Of course, much more complex examples can be given, but what there is to remember is that you only need to divide the problem to find a much easier solution to write!
JOIN instead of the IN operator
As seen in the previous example, we solved the problem with two different approaches. The first was to join all the tables we needed and then do the appropriate filtering and grouping. In the second approach we divided the problem into two pieces, solved both and then combined them together with the IN operator. At this point the question arises: which of the two solutions is better? In theory, neither!!! They are equivalent.
In fact, the IN operator can also be expressed by a join in many cases. It is sufficient to
- insert the tables referenced in the FROM clause of the main query into the FROM clause of the nested query,
- in the WHERE the appropriate join conditions (equality on the attribute used by the IN operator),
- finally, add any filtering predicates added in the WHERE clause of the nested query
So, is the IN operator used only when you are a beginner? No! For one thing, the IN operator reduces the demand for resources by the DBMS. In fact, all the resources used by the nested query will not be able to be accessed directly by the main query, which results in a release of memory and sometimes faster execution speed. This implies that, if the information contained in a table is not to be displayed, but only used for filtering, the IN operator can be used without problems. Otherwise, join must be used so as not to unnecessarily instantiate the same table multiple times.
In some cases you cannot translate IN with a join. For example, think of having both a group by and having clause in the nested query. In this case, you cannot possibly carry over the same constraints into the outer query without altering the end result.
An example of this scenario could be a variant of the previous query: for each movie, calculate the number of screenings that took place in theaters with an average capacity greater than 100. Let us split the problem again:
- find cinemas with an average capacity of more than 100
- compute the number of screenings of each film shown in the theaters in the previous point
The second query we have already solved previously. The first, on the other hand, is to calculate for each theater the average capacity of the theaters and filter out the theaters whose value is less than 100. The query will result as follows:
SELELCT CodC FROM CINEMA C, ROOM R WHERE C.CodC=R.CodC GROUP BY CodC HAVING AVG(Capacity)>100;
Therefore, the final solution will be:
SELECT Title, COUNT(*) AS NumProjections FROM FILM F, PROJECTION P WHERE F.CodF=P.CodF AND CodC IN (SELELCT CodC FROM CINEMA C, ROOM R WHERE C.CodC=R.CodC GROUP BY CodC HAVING AVG(Capacity)>100) GROUP BY CodF, Title;
As you can well understand, it is not possible to use a join instead of the IN operator since we could not handle the GROUP BY and HAVING clauses correctly and simultaneously.