For this assignment, you should take information from a relational database and migrate it to a NoSQL database of your own choosing.
For the relational database, you might use the flights database, the tb database, the “data skills” database your team created for Project 3, or another database of your own choosing or creation. For the NoSQL database, you may use MongoDB (which we introduced in week 7), Neo4j, or another NoSQL database of your choosing. Your migration process needs to be reproducible. R code is encouraged, but not required. You should also briefly describe the advantages and disadvantages of storing the data in a relational database vs. your NoSQL database.
This rmd,sql code and all related input data can be found on my github.
Run these two applications first in order to install ‘RNeo4j’ package in R
#library(devtools)
#install_git("https://github.com/nicolewhite/RNeo4j")
library(RNeo4j)
library(knitr)
airports <- read.csv("https://raw.githubusercontent.com/omerozeren/DATA607/master/HMW_12/neo4j-airport-csv-raw.csv", stringsAsFactors=F)
flights <- read.csv("https://raw.githubusercontent.com/omerozeren/DATA607/master/HMW_12/neo4j-flight-csv-raw.csv", stringsAsFactors=F)
kable(head(flights,10))
flight | airline | depart | arrive | capacity |
---|---|---|---|---|
23 | Delta | ATL | DTW | 160 |
24 | Delta | DTW | ATL | 160 |
27 | Delta | DTW | BOS | 160 |
28 | Delta | BOS | DTW | 160 |
35 | Delta | DTW | PIT | 128 |
36 | Delta | PIT | DTW | 128 |
37 | Delta | ATL | BOS | 160 |
38 | Delta | BOS | ATL | 160 |
43 | Delta | ATL | PIT | 160 |
44 | Delta | PIT | ATL | 160 |
flight_nodes <- startGraph("http://localhost:7474/db/data/", username = "neo4j", password = "omerozeren")
query <- "MERGE (:Airport {label:{label_name}, city:{city_name}, state:{state_name}})"
# Create a transaction object
t <- newTransaction(flight_nodes)
# Append the cypher queries to the transaction object
for (i in 1:nrow(airports)){
appendCypher(t,
query,
label_name = airports[i,]$label,
city_name = airports[i,]$city,
state_name = airports[i,]$state)
}
# Execute the transaction
commit(t)
query <- "
MERGE (f:Flight {flight:{flnm}, airline:{alnm}, capacity:{cpct}})
WITH f
MATCH (a:Airport {label:{depart}}), (c:Airport {label:{arrive}})
CREATE (a) -[:DEPART {takeoff:{tkof}}]-> (f) -[:ARRIVE {landing:{ldng}}]-> (c)
"
t <- newTransaction(flight_nodes)
for (i in 1:nrow(flights)){
appendCypher(t,
query,
flnm = flights[i,]$flight,
alnm = flights[i,]$airline,
cpct = flights[i,]$capacity,
tkof = flights[i,]$takeoff,
ldng = flights[i,]$landing,
depart = flights[i,]$depart ,
arrive = flights[i,]$arrive)
}
commit(t)
knitr::include_graphics("https://raw.githubusercontent.com/omerozeren/DATA607/master/HMW_12/flight_nodes.png")
So what are the pros and cons of having graphical databases. A source that I had referred to for this is listed below:
References : https://scraperwiki.com/2015/01/book-review-graph-databases-by-ian-robinson-jim-webber-and-emil-eifrem/
Three reasons why they(authors) believe graphical databases are better.
Performance: “In contrast to relational databases, where join-intensive query performance deteriorates as the dataset gets bigger, with a graph database performance tends to remain relatively constant, even as the dataset grows. This is because queries are localized to a portion of the graph. As a result, the execution time for each query is proportional only to the size of the part of the graph traversed to satisfy that query, rather than the size of the overall graph.” In other words, they had noted improved speed and efficiency (despite the contrary of what I had encountered.)
Flexibility: “Graphs are naturally additive, meaning we can add new kinds of relationships, new nodes, new labels, and new subgraphs to an existing structure without disturbing existing queries and application functionality. These things have generally positive implications for developer productivity and project risk. Because of the graph model’s flexibility, we don’t have to model our domain in exhaustive detail ahead of time-a practice that is all but foolhardy in the face of changing business requirements.” They noted that neo4j is a much more flexible database system.
Agility: “We want to be able to evolve our data model in step with the rest of our application, using a technology aligned with today’s incremental and iterative software delivery practices. Modern graph databases equip us to perform frictionless development and graceful systems maintenance.” They also state that graphical databases are easier to maintain.
This is newer technology, so there is less time for the technology to mature.
Given that this is new, a new language i.e. Cypher needs to be learned and requires to rewire your brain.
If the databases do not rely on relationships, then it may not be necessary to use a graphical database such as neo4j.