SELECT: IN operator

In some cases writing SQL queries can be really complex. The divide and conquer strategy, always turns out to be successful because it allows us to merge the results of two or more queries. We have, however, the need to use special constructs. By means of nested queries and in particular of the IN operator this operation is simple and intuitive.

Reading time: 6 minutes

The SQL language makes it possible to extract a lot of information from relational databases. Writing queries, however, is not always simple. As seen in the previous articles, SELECT: simple query structure and SELECT: query with JOIN and GROUP BY, there are several clauses to define so that both the syntax and the extracted data are correct. But how do you do this when the query is very complex? The key is to divide the problem into smaller parts so that you can solve them more easily. After that, you put the various solutions together to compose the final answer. To make an analogy it is like with LEGO bricks, you build one brick at a time to get to create even very complex sets, as well as beautiful to look at.

If the bricks, in the SQL language, are simple queries using the syntax seen so far, what are the statements to build the final query. There are four types of strategies for joining different queries:

Nested queries
Set Operators
Derived tables
Common Table Expression (CTE)

Also in this article we will use the outline from the other articles regarding movie screenings in theaters. We report it below for those who have not read the other articles.

FILM (CodF, Title, Release_date, Genre, DurationMinutes)
CINEMA (CodC, Name, Address, City, Website*)
ROOM (CodC, NumberRoom, Capacity)
PROJECTION (CodC, NumberRoom, Date, StartTime, EndTime, CodF)

As a reminder, the notation used to describe the schema uses underscores to indicate the fields that make up the primary key and asterisks to indicate the possibility of assigning NULL values to the field. Finally, for simplicity, foreign keys have the same name as the primary keys to which they refer. We then proceed to look at the various operators.

Nested queries

In this article we will discuss nested queries and in particular the IN operator. But before we understand how this operator works, it is necessary to define what nested queries are. They are nothing more than a SELECT statement contained within another query. This allows a complex problem to be broken down into simpler subproblems and use the results of one query as input to another. Nested queries can be inserted as a predicate in the WHERE and HAVING clauses, but also instead of a table in the FROM clause.

The simplest example is to use a nested query in the WHERE clause. Suppose we want to find all the rooms with the maximum capacity within our database. The problem must necessarily be divided into 2 parts:

Find the maximum capacity within the ROOM table.
Filter the rooms that have a capacity equal to the maximum calculated in the previous step

The first point is solved by the use of the MAX aggregation function on the entire ROOM table, as shown below

				
					SELECT MAX(Capacity)
FROM ROOM;

The second problem at this point is solved by imposing in the WHERE clause that the capacity must be equal to the query in the previous point. The statement will then be:

				
					SELECT *
FROM ROOM
WHERE Capacity=(SELECT MAX(Capacity)
                FROM ROOM);

Attention

Operators such as equality, inequality, and comparison can be used only if it is known a priori that the result of the nested SELECT is always a single value.

So we can solve some problems, but we still have limitations in some scenarios. Let’s see how we can achieve even more data manipulation capabilities with the SQL language.

IN operator

Suppose we want to calculate for each film the number of screenings that were held in cinemas in Turin. The SQL query, using the syntax seen in the other articles involves:

use the FILM, PROJECTION and CINEMA tables (FROM clause),
impose appropriate join conditions in addition to the filter on the city (WHERE clause),
group the results by movie code and title (GROUP BY clause)
finally, display the title and projection number using the COUNT aggregation function (SELECT clause)

The resulting query will be:

				
					SELECT Title, COUNT(*) AS NumProjections
FROM FILM F, PROJECTION P, CINEMA C
WHERE F.CodF=P.CodF
AND P.CodC=C.CodC
AND City = ‘Turin’
GROUP BY CodF, Title;

In this case the query is not complex, but it may not be immediately so intuitive to write the solution just seen. How can we proceed to simplify the problem? One possibility is to divide the problem into two parts:

find cinemas in Turin
compute the number of screenings of each film shown in the theaters in the previous point

First query

The first query is very simple! It only requires us to filter, using a WHERE clause the records in the CINEMA table. We will only need to locate the codes since we do not need any other information. This results in the following SQL code

				
					SELECT CodC
FROM CINEMA
WHERE City=’Turin’;

Second query

The second point is to calculate the screenings of these cinemas. Let’s simplify further! We calculate only the projections of each film. From the projections table we simply group by CodF and then count the number of records with the same CodF value. Since the codes, alas, are not so explanatory we also need to use the FILM table to retrieve the title. The query will result as follows.

				
					SELECT Title, COUNT(*) AS NumProjections
FROM FILM F, PROJECTION P
WHERE F.CodF=P.CodF
GROUP BY CodF, Title;

Let’s merge the queries

Now we just have to put the two solutions together. How? In theory, it would be enough for us to say that all records in the PROJECTION table that take a CodC value equal to one of the values returned by the first query should be considered in the final calculation. But in practice?

Many might think that you run the first query, I store the result in a variable that I then use in the second query. It could work, but it would require submitting two separate queries to the database resulting in increased response time even solely due to network latency. Also, this approach would only work within a programming language and in any case would always involve the problem of using a list of values to filter the records of interest.

So? No problem! There is an operator that indicates the inclusion of a value within a set: the IN operator. This operator can be used in the WHERE clause and its syntax is as follows:

				
					AttributeName IN (NestedQuery)

where the attribute name must be compatible with the attribute returned by the nested query. The nested query can also be replaced with a list of values, if necessary. This last option, however, is strongly discouraged. In fact, it is used only in cases where these values are derived from external interactions with DB data, such as when a user chooses multiple options in a form. Having a query nested using the IN operator allows the DBMS to optimize query execution and make it faster and more efficient.

Returning to our example, it will be sufficient to impose that the CodCs in the PROJECTION table must belong to those returned by the first query. The query will, therefore, be as follows:

				
					SELECT Title, COUNT(*) AS NumProjections
FROM FILM F, PROJECTION P
WHERE F.CodF=P.CodF
AND CodC IN (
	SELECT CodC
	FROM CINEMA
	WHERE City = ‘Turin’
)
GROUP BY CodF, Title;

Of course, much more complex examples can be given, but what there is to remember is that you only need to divide the problem to find a much easier solution to write!

JOIN instead of the IN operator

As seen in the previous example, we solved the problem with two different approaches. The first was to join all the tables we needed and then do the appropriate filtering and grouping. In the second approach we divided the problem into two pieces, solved both and then combined them together with the IN operator. At this point the question arises: which of the two solutions is better? In theory, neither!!! They are equivalent.

In fact, the IN operator can also be expressed by a join in many cases. It is sufficient to

insert the tables referenced in the FROM clause of the main query into the FROM clause of the nested query,
in the WHERE the appropriate join conditions (equality on the attribute used by the IN operator),
finally, add any filtering predicates added in the WHERE clause of the nested query

So, is the IN operator used only when you are a beginner? No! For one thing, the IN operator reduces the demand for resources by the DBMS. In fact, all the resources used by the nested query will not be able to be accessed directly by the main query, which results in a release of memory and sometimes faster execution speed. This implies that, if the information contained in a table is not to be displayed, but only used for filtering, the IN operator can be used without problems. Otherwise, join must be used so as not to unnecessarily instantiate the same table multiple times.

In some cases you cannot translate IN with a join. For example, think of having both a group by and having clause in the nested query. In this case, you cannot possibly carry over the same constraints into the outer query without altering the end result.

An example of this scenario could be a variant of the previous query: for each movie, calculate the number of screenings that took place in theaters with an average capacity greater than 100. Let us split the problem again:

find cinemas with an average capacity of more than 100
compute the number of screenings of each film shown in the theaters in the previous point

The second query we have already solved previously. The first, on the other hand, is to calculate for each theater the average capacity of the theaters and filter out the theaters whose value is less than 100. The query will result as follows:

				
					SELELCT CodC
FROM CINEMA C, ROOM R
WHERE C.CodC=R.CodC
GROUP BY CodC
HAVING AVG(Capacity)>100;

Therefore, the final solution will be:

				
					SELECT Title, COUNT(*) AS NumProjections
FROM FILM F, PROJECTION P
WHERE F.CodF=P.CodF
AND CodC IN (SELELCT CodC
    FROM CINEMA C, ROOM R
    WHERE C.CodC=R.CodC
    GROUP BY CodC
    HAVING AVG(Capacity)>100)
GROUP BY CodF, Title;

As you can well understand, it is not possible to use a join instead of the IN operator since we could not handle the GROUP BY and HAVING clauses correctly and simultaneously.

More To Explore

Artificial intelligence

Gradio: web applications in python for AI [part2]

Gradio is a python library that allows us to create web applications quickly and intuitively for our machine learning and AI models. Our applications always require user interaction and layout customization. Let us find out, through examples, how to improve our applications.

Alessandro Fiori 22 April 2024

Artificial intelligence

Gradio: web applications in python for AI [part1]

Writing web applications for our machine learning and/or artificial intelligence models can take a lot of time and skills that we do not possess. To streamline and speed up this task we are helped by Gradio, a Python library designed to create web applications with just a few lines of code. Let’s discover its basic functionality with some examples.

Alessandro Fiori 8 April 2024

2 Responses

Pingback: SQL: derived tables - Flowygo
Pingback: SQL: correlation - Flowygo