Neo4j: guide to use a graph database

Neo4j is the leader in graph databases. Through Cypher queries, it is possible to optimally manage a graph representation of our data and discover interesting correlations between them. Let's find out how to use it with some simple query examples and some advanced use case suggestions.

Share

Share on facebook
Share on linkedin
Share on twitter
Share on email
Reading time: 9 minutes

The world of NoSQL databases is a vast one. In the book Designing with MongoDB, we covered how to model a document database to get the most out of it. However, there are no document databases such as Mongo. Among other types, graph databases are certainly very interesting. They lend themselves very well to model very complex social networks or track very interconnected data. Using graph theory it is then possible to perform analysis to detect anomalous behavior or even just to recommend an object of interest. In this article we are going to discover Neo4j, one of the most used graph databases.

Before you start using Neo4j, you must remember that these types of databases model the data as a graph, whose structure is based on:

  • Nodes: represent records/data. You can add zero or more labels to a node.
  • Relations: represent the connection between nodes. Each connection always has a direction.
  • Properties: represent the values of named data and can be associated with both nodes and relationships.

Let’s start with the installation.

Installation

In this tutorial we will base it on the desktop version installed on Ubuntu 20.04. First you need to connect to the official website of Neo4j. Depending on your operating system, the instructions may vary from those that we will illustrate below.

If you are under linux, once you have entered your data in the appropriate form, the download of the application will begin. To launch the installation, however, you must make the file executable with the following command:

chmod +x FILE_NAME 

Executing the file will open the following window requesting acceptance of the user license.

Once accepted, you will need to enter your data. If you have downloaded from the site, and then already made the registration, you will just have to report already previously entered. In addition, you will have to copy in the appropriate field the key that was generated previously.

If all fields have been correctly entered, you can start the installation by clicking on the “Activate” button. At this point the software will carry out a series of steps to verify that the system requirements are met and thus proceed with the installation of the various components. In case of errors or available updates, special windows will be displayed.

When the system is ready you’ll see the Neo4j dashboard. You’ll notice that an example project has already been loaded. We’re going to use this example to start understanding the basic concepts of Neo4j.

CYPHER

In the example project, two files are listed. We are going to use the file about-movie.neo4j-browser-guide. Moving the mouse over it will display an open button that will launch a new window related to our database. This is a tutorial to understand both the use of the graphical interface and to start using CYPHER queries.

The database, called Movie Graph, contains the actors and directors of some movies. In particular, there are 2 types of nodes: Person and Movie. These nodes are connected by a series of relationships that represent for example if an actor has acted in a movie or if the person has been the director. There are also social relationships such as whether a person is a follower of another on social networks. To see all the types of nodes and relationships, as well as their possible properties, just click on the database icon in the upper left corner.

The Cypher language is used to query data in Neo4j. It’s a highly optimized language to find the nodes of interest and navigate the relationships between them. When writing a query in Cypher, one must remember that there are no tables on which join operations must be performed, only nodes and relationships. The idea must therefore be to identify the nodes of interest and from these navigate the available relations. Let’s discover the most used queries.

MATCH

The MATCH clause is used to select nodes or relationships that meet certain criteria. It is possible to filter objects based on their labels or the properties associated with them. In the example below we want to display the node of the actor Tom Hanks. To make the query more efficient we will select only nodes of type Person (a:Person) and impose that the object must have the property name equal to Tom Hanks ({name:’Tom Hanks’}). 

To display the data, you must always enter the RETURN command. This must include the objects we want to display. The result will be shown as a graph. There is a possibility to display it also as a table, text or code.

In this case only the selected node is displayed. We will see later that the visualization can be much more complex depending on the query.

MATCH also allows us to identify nodes that have special relationships with other nodes. For example, we can display all the movies in which Tom Hanks has acted. In this case we start by selecting the node of type Person with the name property equal to Tom Hanks, as in the previous example. We must then impose that the database must “run” only the relationships exiting from the selected node and of type ACTED_IN. For the sake of query comprehension but also efficiency, we specify that the linked node will be of type Movie. The resulting query will be:

MATCH (a:Person {name:'Tom Hanks'})-[:ACTED_IN]->(m:Movie) RETURN a,m 

Entering both the source and destination nodes in the RETURN will show the resulting graph. By clicking on a node of your choice it is also possible to expand the graph, collapse its relations or remove the node from the view.

In the RETURN function you can also specify only the attributes of interest of the returned nodes. Suppose we want to display the names of the actors who acted together with Tom Hanks. Starting from the structure of the previous query we will have to add an additional relationship between the Movie node and a new Person node. To display the name it will be sufficient to indicate the name of the attribute using dot notation. The resulting query is as follows:

MATCH (a:Person {name:'Tom Hanks'})-[:ACTED_IN]->(m)<-[:ACTED_IN]-(c) RETURN c.name 

In this case you won’t see the resulting graph, but only a table whose columns are the required properties while the rows are the returned values.

Similarly, you can visualize the types of relationships that connect two nodes and the structure of the node itself. Suppose we want to visualize the roles of the people who acted, directed and reviewed the movie Cloud Atlas. Starting from a generic node of type Person we will set up a non-directional relationship with the film of interest. In the RETURN we will select the name of the node Person, the type of relationship and finally the information contained in the relationship. The corresponding query is as follows:

MATCH (people:Person)-[relatedTo]-(:Movie {title:'Cloud Atlas'}) 
RETURN people.name, type(relatedTo), relatedTo 

The result will be shown as a table by default. In this case, however, the data associated with the relationship will be shown in json format.

WHERE

In the previous example we looked for a perfect match. In many cases it is necessary to impose looser constraints. This can be achieved through the WHERE clause. The predicates can be very complex and include boolean conditions (AND, OR and NOT), ranges of values and/or partial string matches. For example, the query to select all actors whose name begins with Tom will have the following syntax:

MATCH (a:Person) WHERE a.name STARTS WITH 'Tom' RETURN a 

f, on the other hand, we wanted all the movies released in the 1990s, the query would be:

MATCH (a:Movie) WHERE a.released > 1990 AND a.released < 2000 RETURN a 

CREATE

The CREATE statement is used to insert data into our database. As with the MATCH clause, patterns are used to associate labels and properties with nodes. For example to create a person node we can write the following query:

CREATE (a:Person {name:'Brie Larson', born:1989}) RETURN a 

This instruction will create a mode of type Person with the properties name equal to Brie Larson and a date of birth (born attribute) equal to 1989. The RETURN command is used only to display the newly created node. In a similar way we can create a node of type Movie.

CREATE (a:Movie {title:'Captain Marvel', released:2019, 
tagline:'Everything begins with a (her)o.'}) RETURN a 

Attention

The previously written queries do not verify the existence of a node with the same data. Therefore, care must be taken to duplicate data either by application code or by creating unique indexes within the database.

DELETE

The DELETE command allows you to remove nodes that meet certain requirements. The command only allows you to remove nodes that have no relationships. If you want to remove the relationships in which they are involved, you must add the DETACH clause. For example, if we want to remove the movie Captain Marvel, and all relationships connected to it, we must select it using a MATCH and then delete the relationships and the node itself. The query to perform this operation will be:

MATCH (a:Movie {title:'Captain Marvel'}) DETACH DELETE a 

MERGE

To avoid duplicates, the MERGE statement can be used. This command first searches for the node that satisfies the entered requirements. By means of the ON CREATE clause you define the properties and their corresponding values to be set. On the other hand, with the ON MATCH clause you can set the values of other properties. 

Let’s suppose, for example, that we want to create the node of the actor Brie Larson. If she is not yet present in the graph, set her birthdate to 1989. Otherwise, we will increase the value of the stars property. Since the property stars may not yet have been set, we can use, as shown in the following query, the function COALESCE that returns the first non-zero value of a set of values. 

MERGE (a:Person {name:'Brie Larson'}) 
ON CREATE SET a.born = 1989 
ON MATCH SET a.stars = COALESCE(a.stars, 0) + 1
RETURN a 

The MERGE statement can also be used to create relationships between two existing nodes and prevent duplicates from being created. Suppose we want to insert the role of Brie Larson in the movie Captain Marvel. Through the MATCH statement we will select the corresponding Person and Movie nodes, while with the MERGE clause we will create the ACTED_IN relation if and only this has not already been defined. The final query will be:

MATCH (a:Person {name:'Brie Larson'}), (b:Movie {title:'Captain Marvel'})
MERGE (a)-[r:ACTED_IN]->(b) SET r.roles = ['Carol Danvers']
RETURN a,r,b 

Advanced use cases

Through the use of Cypher it is possible to create very complex queries. The query language, for example, also allows the use of list comprehension. This can be useful to extend the values of a list of the current values of a property. 

For example, if we wanted to include a new value in an actor’s role list, we would need to ensure that it is not already present. The use of comprehension lists solves this problem. An example of a query is shown below.

MATCH (a:Person {name:'Brie Larson'}), (b:Movie {title:'Captain Marvel'})
MERGE (a)-[r:ACTED_IN]->(b) 
SET r.roles = [x in r.roles WHERE x <> 'Captain Marvel'] + ['Captain Marvel']
RETURN a,r,b 

A graph representation of the data also allows you to do complex analysis that in relational databases would be very difficult or impossible to do. An example is to find people who even indirectly have a link with a specific person. In sociology this kind of analysis is very useful and is the basis of the evaluation of degrees of separation.  In our example we want to know all the people who have a link with Kevin Bacon up to the fourth degree. By setting a condition on the number of relationships that can be traversed, the result will be shown in a few seconds.

MATCH (bacon:Person {name:"Kevin Bacon"})-[*1..4]-(hollywood)
RETURN DISTINCT hollywood 

Neo4j also offers some algorithms related to graph theory. For example, it is possible to find the minimum path between two nodes. Suppose we want to visualize how Kevin Bacon could know directly or indirectly Al Pacino. Through the function shortestPath we delegate to Neo4j the calculation, improving the performances of our application. The query of our example will be the following.

MATCH p=shortestPath(
              (bacon:Person {name:"Kevin Bacon"})-[*]-(a:Person {name:'Al Pacino'})
            )
RETURN p 

The result we will get is as follows.

Another advanced use of Cypher is the recommendation of potential nodes of interest. This is very useful in social relationships in general, but also in work relationships. A basic recommendation approach is to find connections from immediate neighbors that are themselves well connected. 

For example, we might identify potential actors who might work well with Tom Hanks because they have worked with some of his co-workers before. We can also give a weight based on the number of paths that have been returned for that actor. The query would result in the following.

MATCH (a:Person {name:'Tom Hanks'})-[:ACTED_IN]->(m)<-[:ACTED_IN]-(coActors),
      (coActors)-[:ACTED_IN]->(m2)<-[:ACTED_IN]-(cocoActors)
WHERE NOT (a)-[:ACTED_IN]->()<-[:ACTED_IN]-(cocoActors) AND a <> cocoActors
RETURN cocoActors.name AS Recommended, count(*) AS Strength ORDER BY Strength DESC 

Similarly, we can identify the best person who could introduce Tom Cruise to Tom Hanks since he has worked with both of them. A possible solution to the problem could be as follows.

MATCH (a:Person {name:'Tom Hanks'})-[:ACTED_IN]->(m)<-[:ACTED_IN]-(coActors),
(coActors)-[:ACTED_IN]->(m2)<-[:ACTED_IN]-(other:Person {name:'Tom Cruise'})
RETURN a, m, coActors, m2, other 

These are just some of the possible applications of Neo4j. You can find more case studies and other documentation on the official website.

Recommended Readings

More To Explore

Google Cloud platform

BigQuery: performance optimization

Although BigQuery is a very good tool for querying terabytes, best practices should be adopted to improve performance. Let’s discover tricks for writing queries that execute quickly and save on execution costs. We also look at how you can optimize table storage through partitioning and clustering.

Google Cloud platform

BigQuery: WINDOWS analytics

In many application scenarios, the statistics you need to extract refer to different groupings on the source data. By defining aggregation windows, you can calculate statistics within the same query. Moreover, if necessary, you can also provide different levels of data granularity through the ARRAY data type. Let’s discover these advanced features through two real-world examples.

Leave a Reply

Your email address will not be published. Required fields are marked *

Design with MongoDB

Design with MongoDB!!!

Buy the new book that will help you to use MongoDB correctly for your applications. Available now on Amazon!