SQL: correlation

In the SQL language, it is possible to condition the execution of a query on the values of the tuples being parsed. This behavior is defined by correlation, that is, conditions in the WHERE clause of a nested query that relate to attributes of the tables in the outer query. Through some examples we will discover which operators use this construct, how to appropriately define correlations, and its alternatives.

Reading time: 5 minutes

In previous articles we have seen many constructs of the SQL language. In particular, with the use of derived tables and their evolution, namely Common Table Expression (CTE) we have been able to write very complex queries. These constructs have been present for years now in relational databases, both commercial and open source. However, before their advent, queries were written differently when a query could be answered by a single query. In fact, In some cases it may be necessary to bind the execution of a nested query to the value of one or more attributes in a more external query. This binding, called correlation, is expressed by one or more correlation conditions that are specified in the WHERE clause of the nested query that requires it. The predicate will, therefore. bind some attributes of the tables entered in the FROM clause of the nested query with some attributes of the tables defined in the FROM clause of the outermost queries. Put like that it sounds very complex, but it is actually easier done than said.

In this article we will look at some examples of correlation both required by some SQL language operators and for other scenarios. As in previous articles, we will use the pattern related to movie showings in theaters. We reproduce it below for those who have not read the other articles.

FILM (CodF, Title, Release_Date, Genre, DurationMinutes)
CINEMA (CodC, Name, Address, City, Website*)
ROOM (CodC, RoomNumber, Capacity)
PROJECTION (CodC, RoomNumber, Date, StartTime, EndTime, CodF)

As a reminder, the notation used to describe the schema uses underlining to indicate the fields that make up the primary key and an asterisk to indicate the possibility of assigning NULL values to the field. Finally, for simplicity, foreign keys have the same name as the primary keys to which they refer. We then proceed to look at the various operators.

EXISTS and NOT EXISTS

The operators that use correlation for their operation are:

EXISTS
NOT EXISTS

These two operators are “half-brothers” of the IN and NOT IN operators seen in SELECT: IN operator and SELECT: NOT IN operator and tuple constructor articles, respectively. However, both their syntax and behavior differ.

In fact, the EXISTS and NOT EXISTS operators admit as a parameter a nested query whose execution will be bound to the outermost query by one or more correlation conditions. The EXISTS operator returns the true value only if the nested query yields a nonempty set (i.e., returns at least one record). Otherwise, it returns the value false if the nested query returns the empty set (i.e., returns no records). The NOT EXISTS operator behaves in a dual way. Finally, the SELECT clause of the nested query can include any attribute but is irrelevant to the result. Let us look at an example to understand the link between the various operators.

Suppose we want to find the titles of movies that have never been shown in cinemas in Turin. As usual, we divide the problem into two steps

Find the codes of movies that have been shown at least once in a cinema in Turin
Display the titles of movies that do not belong to the set found previously

The first part can be solved with the following query

				
					SELECT CodF
FROM PROJECTION P, CINEMA C
WHERE P.CodC=C.CodC
AND City=’Turin’;

At this point we should exclude these codes from the final result. The most natural thing to do is to use the NOT IN operator on the CodF attribute. We would then obtain the following query:

				
					SELECT Title
FROM FILM 
WHERE CodF NOT IN (
    SELECT CodF
    FROM PROJECTION P, CINEMA C
    WHERE P.CodC=C.CodC
    AND City=’Turin’);

One could reason differently, however. For each film, we look to see if there is a screening in cinemas in the city of Turin. If it exists, that movie should be excluded from the result, otherwise it should be displayed. This is exactly the logic of NOT EXISTS. We then rewrite the query as follows.

				
					SELECT Title
FROM FILM F
WHERE NOT EXISTS (
    SELECT *
    FROM PROJECTION P, CINEMA C
    WHERE P.CodC=C.CodC
    AND City=’Turin’
    AND P.CodF=F.CodF);

The last condition in the nested query determines the correlation between its execution and the external query. This means that as the FILM table is scanned, a query is generated whose conditions in the WHERE clause are updated according to the value read in the CodF field. Obviously, the database does not perform this procedure because it would be too resource- and run-time-intensive, but that is the underlying logic.

As can be seen from this example it is possible to express the same query with NOT In than with NOT EXISTS. But then why might the results differ? First we must specify that in most cases writing with an operator or its “half-brother” is irrelevant to the result. However, if the attribute on which correlation is performed (i.e., the comparison attribute for inclusion/exclusion) may have NULL values then the set of filtered records will not be equivalent. The EXISTS and NOT EXISTS operators also handle NULL values, while their “half-brothers” do not.

Correlation among queries

Correlation can also be used without the use of the EXISTS and NOT EXISTS operators. In fact, we may need to bind the result of nested query to what we are analyzing in the outer query. Before we look at a concrete example, some clarifications need to be made:

Correlation conditions cannot be expressed in queries at the same level of nesting.
Correlation can only be expressed between a query of a given nesting level and the attributes of the tables of the query of the immediately higher nesting level. Therefore, correlation cannot be used in an external query or “skipped” nesting levels.

Suppose we want to find theaters that have a capacity greater than the average capacity of all movie theaters in the same city. To summarize, if movie theaters in Turin have an average capacity of 50, we want to find the theaters that have a value for the capacity attribute greater than 50. Obviously, as we change cities, the average value will change accordingly.

Let’s start writing the query that calculates the average capacity of the cinema.

				
					SELECT AVG(Capacity)
FROM ROOM R, CINEMA C
WHERE R.CodC=C.CodC

This query, however, returns us a value calculated on all halls. How do we limit its range to the city to which a hall belongs? We need to use correlation! In the external query we are going to scan the HALL table by requiring that the capacity of the individual hall is greater than the average calculated by the previous query to which, however, we will add the condition that the city is equal to the one we are externally analyzing. The end result will be:

				
					SELECT CodC, RoomNumber
FROM ROOM R1, CINEMA C1
WHERE R1.CodC=C1.CodC
AND Cacity > (SELECT AVG(Capacity)
	FROM ROOM R2, CINEMA C2
    WHERE R2.CodC=C2.CodC
    AND C1.City=C2.City);

The use of aliases helps us to distinguish various instances of the same table and to express correlation correctly.

This example highlights to us the difficulty of writing queries of this kind in a straightforward way. Are there simpler alternatives? Of course there are: derived tables and CTEs. We can, in fact, write the same query using what we saw in articles XXX and YYY. We just need to divide the problem differently:

For each city calculate the average capacity
Add the average capacity value to each city
Filter the rooms by capacity

We use CTEs for simplicity in reading the solution as well. The first query will then be:

				
					WITH 
AVERAGE_CITY AS
    (SELECT City, AVG(Capacity) AS AvgCapacity
    FROM ROOM R, CINEMA C
    WHERE R.CodC=C.CodC
    GROUP BY City)

At this point, to answer the other two steps of the problem, simply join the CTE just defined with the CINEMA table and then, through another join, retrieve the theaters in each theater and compare the capacity with the average capacity. The final query will be:

				
					WITH 
AVERAGE_CITY AS
    (SELECT City, AVG(Capacity) AS AvgCapacity
    FROM ROOM R, CINEMA C
    WHERE R.CodC=C.CodC
    GROUP BY City)

SELECT CodC, RoomNumber
FROM ROOM R, CINEMA C, AVERAGE_CITY A
WHERE R.CodC=C.CodC
AND A.City=C.City
AND Capacity > AvgCapacity;

As can be seen, the use of CTE and derived tables is much more intuitive than correlation and places no constraints on the levels of nesting we can use.

More To Explore

Artificial intelligence

Gradio: web applications in python for AI [part2]

Gradio is a python library that allows us to create web applications quickly and intuitively for our machine learning and AI models. Our applications always require user interaction and layout customization. Let us find out, through examples, how to improve our applications.

Alessandro Fiori 22 April 2024

Artificial intelligence

Gradio: web applications in python for AI [part1]

Writing web applications for our machine learning and/or artificial intelligence models can take a lot of time and skills that we do not possess. To streamline and speed up this task we are helped by Gradio, a Python library designed to create web applications with just a few lines of code. Let’s discover its basic functionality with some examples.

Alessandro Fiori 8 April 2024

SQL: correlation

Share

EXISTS and NOT EXISTS

Correlation among queries

More To Explore

Gradio: web applications in python for AI [part2]

Gradio: web applications in python for AI [part1]

Leave a Reply Cancel reply

Design with MongoDB!!!