Neo4j Graphical Database
For this assignment, you should take information from a relational database and migrate it to a NoSQL database of your own choosing.
For the relational database, you might use the flights database, the tb database, the “data skills” database your team created for Project 3, or another database of your own choosing or creation. For the NoSQL database, you may use MongoDB (which we introduced in week 7), Neo4j, or another NoSQL database of your choosing. Your migration process needs to be reproducible. R code is encouraged, but not required. You should also briefly describe the advantages and disadvantages of storing the data in a relational database vs. your NoSQL database.
For my database of choosing, I had selected nycflights13 as this has countless examples of possibilities for nodes and its relationships. Let’s obtain the data from the nycflights13 package.
if (!require(nycflights13)) install.packages('nycflights13')
library(nycflights13)
flights.airlines <- airlines
flights.airports <- airports
flights.flights <- flights
flights.planes <- planes
flights.weather <- weather
Great! We had successfully obtained all 5 relational dataframes from the package. Now, we will create .csv files and download them for use in neo4j. As of note, it is important that we download these files into the correct folder. The way neo4j works is that it uses a local database folder. So for instance, for my particular laptop (which happens to be a Macbook), the folder that I would load these files into would be: /Users/Joel/Documents/Neo4j/default.graphdb/import. For reproducibility, it is important to set the working directory to your default folder.
# Import data into correct folder. Below would be an example code to set the working directory to the appropriate folder.
# setwd("/Users/Joel/Documents/Neo4j/default.graphdb/import")
# Now time to write the dataframes into csv files and load them into the local directory
if (!file.exists("airlines.csv"))
write.csv(flights.airlines, file = "airlines.csv")
if (!file.exists("airports.csv"))
write.csv(flights.airports, file = "airports.csv")
if (!file.exists("flights.csv"))
# From prior experimentation (and at least, reasons unknown at this time, but will need
# further investigation), a significant amount of computer resources were used when I had
# utilized all ~33k flights information and I was unsucessful in creating the MERGE
# relationships. Therefore, I had taken a sample of 5000 flights and used this instead.
# I'll need to look into more efficient coding queries in Cypher in the future.
flights.flights <- flights.flights[sample(nrow(flights.flights), 5000),]
write.csv(flights.flights, file = "flights.csv")
if (!file.exists("planes.csv"))
write.csv(flights.planes, file = "planes.csv")
if (!file.exists("weather.csv"))
write.csv(flights.weather, file = "weather.csv")
Now everything is downloaded. We will take these files and use them in Neo4j.
Given that this is R, and that the neo4j maintains its own language (Cypher), I will be typing the commands below. The initial step is to create each individual nodes for each relational table. Each node will have labels associated with it.
neo4j Commands for Loading CSVs
LOAD CSV WITH HEADERS FROM “file:///airlines.csv” AS airlines CREATE (a: Airlines {Carrier: airlines.carrier, Name: airlines.name})
LOAD CSV WITH HEADERS FROM “file:///airports.csv” AS airports CREATE (a: Airports {Label: airports.faa, Name: airports.name})
LOAD CSV WITH HEADERS FROM “file:///flights.csv” AS flights CREATE (a: Flights {Number: flights.flight, Origin: flights.origin, Dest: flights.dest, Carrier: flights.carrier, Tailnum: flights.tailnum, year: flights.year, month: flights.month, day: flights.day, dep_time: flights.dep_time, arr_time: flights.arr_time})
LOAD CSV WITH HEADERS FROM “file:///planes.csv” AS planes CREATE (a: Planes {Tailnum: planes.tailnum, Year: planes.year, Manufacturer: planes.manufacturer, Engines: planes.engines, Seats: planes.seats})
LOAD CSV WITH HEADERS FROM “file:///weather.csv” AS weather CREATE (a: Weather {Loc: weather.origin, year: weather.year, month: weather.month, day: weather.day, temp: weather.temp})
Creating Relationships
Now that we have uploaded all of the relational datasets as nodes with its appropriate labels into neo4J, it’s time to take advantage of what neo4j does best (which is creating relationships). With graphical databases, a lot of insight can be derived from its relationships. What we will do here is create the most obvious relationship in these datasets: the actual flight that connects between the airports. We’ll need to to define the relationships as either “Depart” or “Arrive” depending on where the flight is going.
LOAD CSV WITH HEADERS FROM “file:///flights.csv” AS flights MATCH (a:Flights {Number: flights.flight, year:flights.year, month: flights.month, day: flights.day, dep_time: flights.dep_time, Carrier: flights.carrier}) MATCH (b:Airports {Label: flights.origin}) MERGE (a)-[r:Departs {dep_time: flights.dep_time}]->(b)
LOAD CSV WITH HEADERS FROM “file:///flights.csv” AS flights MATCH (a:Flights {Number: flights.flight, year:flights.year, month: flights.month, day: flights.day, dep_time: flights.dep_time, Carrier: flights.carrier}) MATCH (b:Airports {Label: flights.dest}) MERGE (a)-[r:Arrives {arr_time: flights.arr_time}]->(b)
As of note, this function noted below :config initialNodeDisplay
sets the limit on how many nodes are displayed at once in the browser. Here, I had set it arbitrarily to 1000.
:config initialNodeDisplay: 1000
This next step is the MATCH function, where I query for the requested nodes, which is essentially, the airports and its connecting flights.
MATCH (a: Airports) -[r]- (b: Flights) -[z]- (c: Airports) RETURN a,b,c
As you can see below, this is what it looks like (for 1000 nodes).
I will say that this is quite beautiful! However, this was really taxing on my computer (and this is despite what was listed in the pros section for improved performance over relational databases!). Given the taxing nature, I had at this time deferred on attempting to create more relationships with the other databases such as planes
, weather
, and airlines
. Perhaps, again, once I learn how to query with improve efficiency, I may approach this topic again in the future.
Pros vs. Cons of Graphical Databases
So what are the pros and cons of having graphical databases. A source that I had referred to for this is listed below:
Source: Graph Databases by Ian Robinson, Jim Webber, and Emil Eifrem
These authors of the book argue for the strengths of graphical databases. Given that humans are visual, they believe that this is more natural for people to store information in this type of format. In addition, they provide 3 reasons why they believe graphical databases are better.
Performance: “In contrast to relational databases, where join-intensive query performance deteriorates as the dataset gets bigger, with a graph database performance tends to remain relatively constant, even as the dataset grows. This is because queries are localized to a portion of the graph. As a result, the execution time for each query is proportional only to the size of the part of the graph traversed to satisfy that query, rather than the size of the overall graph.” In other words, they had noted improved speed and efficiency (despite the contrary of what I had encountered.)
Flexibility: “Graphs are naturally additive, meaning we can add new kinds of relationships, new nodes, new labels, and new subgraphs to an existing structure without disturbing existing queries and application functionality. These things have generally positive implications for developer productivity and project risk. Because of the graph model’s flexibility, we don’t have to model our domain in exhaustive detail ahead of time—a practice that is all but foolhardy in the face of changing business requirements.” They noted that neo4j is a much more flexible database system.
Agility: “We want to be able to evolve our data model in step with the rest of our application, using a technology aligned with today’s incremental and iterative software delivery practices. Modern graph databases equip us to perform frictionless development and graceful systems maintenance.” They also state that graphical databases are easier to maintain.
While neo4j and other graphical databases can certainly flaunt some pros, there needs to be consideration for some cons. Below are some examples:
https://www.quora.com/What-are-the-pros-and-cons-of-using-a-graph-database https://groups.google.com/forum/#!topic/neo4j/mts6H9Py-2I https://www.slideshare.net/maxdemarzi/introduction-to-graph-databases-12735789
This is newer technology, so there is less time for the technology to mature.
Given that this is new, a new language i.e. Cypher needs to be learned and requires to rewire your brain.
If the databases do not rely on relationships, then it may not be necessary to use a graphical database such as neo4j.
Ultimately, graphical databases is another tool for the data scientist to use, and it’s certainly something that I would like to explore more.
RNeo4j
I wanted to spend a little bit of time on a package called RNeo4j
. The sources that I had used in regards to the package are:
This was written by Nicole White, who works for neo4j. This was a driver that allowed a connection between the R world with the Neo4j. This was an interesting package because it allowed the R user to interface with neo4j in the comforts of his or her Rstudio. The videos provide a tremendous amount of information on its use. In this particular example, I will be attempting to do a query search for flights between Newark and Atlanta.
library(RNeo4j)
# For reproducibility, you will need to replace the username and password with your own username and password.
graph = startGraph("http://localhost:7474/db/data/", username = "neo4j", password = "neo4jCUNY")
# We will look for flights that flew from Newark to Atlanta.
query = "MATCH (a: Airports) -[r]- (b: Flights) -[z]- (c: Airports) WHERE a.Label = 'EWR' AND c.Label = 'ATL' RETURN b.Tailnum AS TailNumber_Flights"
# The cypher function retrieves Cypher query results as a data frame.
cypher(graph, query)
## TailNumber_Flights
## 1 N14905
## 2 N607LR
## 3 N326NB
## 4 N320NB
## 5 N751EV
## 6 N134EV
## 7 N13123
## 8 N301DQ
## 9 N326NB
## 10 N328NW
## 11 N977AT
## 12 N304DQ
## 13 N332NB
## 14 N16981
## 15 N317NB
## 16 N538CA
## 17 N977AT
## 18 N935AT
## 19 N11109
## 20 N309DE
## 21 N3744F
## 22 N787NC
## 23 N310DE
## 24 N321NB
## 25 N748EV
## 26 N305DQ
## 27 N935AT
## 28 N308DE
## 29 N305DQ
## 30 N760EV
## 31 N308DE
## 32 N302DQ
## 33 N368NB
## 34 N365NB
## 35 N17196
## 36 N351NW
## 37 N327NB
## 38 N11548
## 39 N371NB
## 40 N325NB
## 41 N12175
## 42 N989AT
## 43 N767NC
## 44 N16149
## 45 N943DL
## 46 N305DQ
## 47 N19554
## 48 N48901
## 49 N933AT
## 50 N371NB
## 51 N615QX
## 52 N360NB
## 53 N538CA
## 54 N16987
## 55 N13913
## 56 N341NB
## 57 N349NB
## 58 N354NB
## 59 N12922
## 60 N14558
## 61 N357NB
## 62 N767NC
## 63 N347NB
## 64 N22909
## 65 N38268
## 66 N328NB
## 67 N316US
## 68 N990AT
## 69 N302DQ
## 70 N15910
## 71 N368NB
## 72 N14916
## 73 N310DE
## 74 N397DA
## 75 N309DE
## 76 N309DE
## 77 N302DQ
## 78 N309DE
## 79 N979AT
## 80 N770NC
## 81 N779NC
Certainly much more could be written about this, but this is something that I will need to spend more time with.