A Journey with a Graph Database
What is a graph database you may say
If I say table and you immediately assume that I’m talking about databases, there is a good chance that you work in IT. This is not surprising because, like most developers, you probably took your baby steps with a relational database.
Then one day, you find yourself looking at other kinds of databases, maybe because RDBMS is not a good fit anymore, maybe you’re just curious. Whatever that reason may be, I hope you turned to graph databases.
Before going straight into the concept of graph databases, a short introduction to graphs doesn’t hurt.
Graphs are an abstract representation of data with objects (called nodes or vertices) and the relationship between these objects (lines or edges). A graph is what naturally comes into mind when we think about drawing a model.
A graph database is basically just a way to store data as nodes and relationships.
Nodes represent your objects (often nouns), while relationships represent how they interact and relate to each other (often verbs).
The small graph below is a simple way to represent people and the departments they belong to.
Graph databases are not really new. In fact, they are commonly used by some big companies that deal with heavily interconnected data especially when there is a social component.
Because a picture is worth a thousand words, let me give you a quick example comparing a graph model with a relational database.
Screenshots taken from https://neo4j.com/developer/relational-to-graph-modeling
As you may notice, a graph data model is more “friendly” because most people naturally tend to represent an example of a data structure by drawing nodes and arrows to connect them.
This goes without saying, but when you design your data structure at the beginning of your project, this will be a huge time saver because you won’t need to reformat your model to fit an RDBMS data structure.
It’s not you, it’s me say
I have worked with relational databases for a decade now. This option is the common way to store data and interact with it.
I have nothing against RDBMS, in fact, it’s still what I use most of the time when it’s the best choice for my project.
Yes, because sometimes, it’s not really the best option but we still use an RDBMS, because in the end it will always work, and you know what we say, if it ain’t broke, don’t fix it!
While working on a project, you don’t just make it work, you probably spend some time thinking about the best tools you can use to build a strong base for your project. This includes choosing tools that will scale with your project and won’t work against you when things go south.
When we started drawing our data structure, we quickly noticed how interconnected it was, in other words, that translated to a lot of table joins in SQL.
One of the key concepts of our app is a recommendation system powerful enough to be able to generate recommendations based on multiple factors.
So, in a nutshell, our app has highly interconnected data, needs a recommendation system, and on top of that, we needed it to scale easily, all with only a minor impact on performance.
I’m not saying that RDBMS can’t do the job, just that it’s probably not the best option for this task, especially when it comes to the recommendation system.
Why a graph database was the answer
Our app Dayvelop was designed as a solution for developers to keep track of news related to their favorite technologies and languages by subscribing to information feeds and participating in simple quizzes.
Basically, that means that a typical user can subscribe to multiple information sources and after a few months, we can have users with thousands or millions of articles opened, and it’s the same for the quizzes.
If you extend this to thousands of users, your performance won’t be that great with an RDBMS.
So basically, at this point, we have two choices, make it work with RDBMS whatever the cost or try to solve this with a graph database.
A graph database was built to deal with highly interconnected data as a core feature. Creating a recommendation system that can grow with the data volume generated by users is a piece of cake.
Here is an oversimplified version of our data model:
In red, we have user nodes and in blue, articles read by these users.
A recommendation engine can be used here to create recommendations based on user reading lists.
What user 56651 can be interested in is “articles read by users that also read one or more articles from my reading list”.
If we use RDBMS, we would probably have a table for users, another for articles and a join table to keep track of each user reading list. Using a structure like this one to answer the question above can quickly give you a headache, especially if you have millions of users and even more articles.
Let’s explore how to do this with a graph database.
For our Dayvelop app, we used the Neo4j database. They use a query language called Cypher to query the database.
You can take a look at the official documentation to get more information https://neo4j.com/developer/cypher
Let’s go back to our recommendation engine and explore the query needed to find an article to recommend to a user based on their current reading list.
Of course, this version is pretty simple. In a real recommendation engine, you may consider optimizing the results to get the most pertinent article first. There are different ways of doing this, and although we won’t explore these options here, you can achieve this by returning the article that has the most reads first, or ordering by users who share the most reads with the current user, or any other way that may be interesting for your users.
How to go from relational databases to graph databases
Modeling your database with a graph database is pretty straightforward, but you may still need to take some time to think about how your data will be structured.
In Dayvelop’s dashboard, users have different counters in the Dashboard, like the number of new articles by source, the total number of reads, etc.
In this example above, to get the total reads for the user 44552, a simple query like this one will do:
This will work just fine, but as your database will grow, counting relationships (especially counters that you may display all the time) is not very efficient and you will experience some latency. A popular and simple way of doing this is maintaining your own counters in separate nodes.
Another thing to keep in mind using Cypher is avoiding cartesian products as much as possible. Simple and innocent queries can load millions or billions of nodes depending on the size of your database.
For example, you may be tempted to find two users with this query
This query will give you this result if you have only 2 users in your database:
This is the result for 3 users in your database:
As you can see, we already have 27 records with just 3 users in our database.
The good news is that Neo4j has a desktop browser to test your queries and you will get a warning if you include some heavy loads like this one:
You also have the profiler option to analyze your queries and optimize them.
For example, profiling the query above will reveal the cartesian product and how heavy it is.
More generally, using the profiler throughout your project is a good idea to understand your queries and identify their costs.
With these tools and the pretty good documentation of Neo4j, it’s quite easy to design your data model as a graph or even go from a relational database to a graph one.
Another thing to keep in mind is that a graph database can work hand in hand with your pre-existing solution for data storage.
For another project that we worked on recently, we already had a NoSQL database to store our data in, but one of the main features was a recommendation engine that we needed to build. So, we just exported and synchronized all the data that we needed for the recommendation engine in a Neo4j database.
The good, the bad, and the ugly
So why not just use a graph database all the time? I mean, after all, everything you can do with RDBMS or a NoSQL database, you can do it (even better) with a graph database.
The answer is “actually you can but…”
The truth is that the more you use a graph database, the more it will become the obvious choice for every project — at least that’s what happened to me.
Now I see everything as a graph…
But let’s get out of the matrix for a minute and explore some issues that we had with Neo4j.
Yes, because there were some issues with Neo4j that we faced at the beginning of our project.
Please keep in mind that these issues are restricted to Neo4j at least at the time of writing this article.
Chances are that if you migrate from another type of database, you’ll need to import data into Neo4j.
In some cases, you even need to import data for new projects like referential data or any other kind of data that your project might need.
Neo4j lets you import CSV files into your database using a combination of LOAD CSV, MERGE, and CREATE. This works just fine and can easily handle small CSV files.
For bigger files of more than 20 MO, this option is very slow, especially if you insert your data in a non-empty database.
Luckily, there is a way to load your data much faster but requires some transformations. In our case, we used the Neo4j admin tool designed for managing your Neo4j instance.
This command-line tool has an option to import structured files in an empty database.
This feature is pretty fast but requires an empty database, which is not always convenient.
So basically, we had to adapt our CSV files and create others to store the relationships.
For a simple model like the one below, we need at least two files
So finally, after some changes, we were able to upload our data in a reasonable timeframe.
Is Neo4j the only option here
Of course not, we decided to go with Neo4j as our primary database for Dayvelop, but there are many other graph databases out there.
We found that this one was particularly easy to start with and the documentation is pretty good.
But if you’re interested, you can also check out Dgraph. It’s an horizontally scalable and distributed GraphQL database with a graph backend. It’s open-source and written entirely in Go.
https://dgraph.io/dgraph
By the way, what exactly is Dayvelop
During this pandemic, I tried, like many developers out there, to improve my skills by reading and exploring new technologies and tools.
After some research, I still hadn’t found a suitable solution specially designed for developers to stay updated about their favorite skills and tools.
With a friend, I decided to create Dayvelop, a web application to follow your favorite sources of information. We also added some quizzes to test your knowledge from time to time.
If something like that might interest you, consider joining us at https://dayvelop.app/
Until then, have a good one! Cheers :)