For this assignment, you should take information from a relational database and migrate it to a NoSQL database of your own choosing.
For the relational database, you might use the flights database, the tb database, the “data skills” database your team created for Project 3, or another database of your own choosing or creation.
For the NoSQL database, you may use MongoDB (which we introduced in week 7), Neo4j, or another NoSQL database of your choosing.
Your migration process needs to be reproducible. R code is encouraged, but not required. You should also briefly describe the advantages and disadvantages of storing the data in a relational database vs. your NoSQL database.
First we load RMySQL and strings packages using the following code.
# loading package
# install.packages("RMySQL")
# install.packages("stringr")
library(RMySQL)
## Loading required package: DBI
library(stringr)
Then we connect to the database from R.
con <- dbConnect(MySQL(), user='root', password='root', dbname='Flights', host='localhost')
We run a query to create the dataframes which now contain flights database tables.
airlines <- dbGetQuery(con, "SELECT * FROM airlines;")
airports <- dbGetQuery(con, "SELECT * FROM airports;")
flights <- dbGetQuery(con, "SELECT * FROM flights;")
planes <- dbGetQuery(con, "SELECT * FROM planes;")
weather <- dbGetQuery(con, "SELECT * FROM weather;")
We can now disconnect the database and visually inspect our data.
# Disconecting from Database
dbDisconnect(con)
## [1] TRUE
head(airlines)
## carrier name
## 1 9E Endeavor Air Inc.\r
## 2 AA American Airlines Inc.\r
## 3 AS Alaska Airlines Inc.\r
## 4 B6 JetBlue Airways\r
## 5 DL Delta Air Lines Inc.\r
## 6 EV ExpressJet Airlines Inc.\r
#Looks lie there are "/r" present in the airline names - let's get rid of those:
airlines$name<-str_replace(airlines$name,"\\r","")
head(airlines)
## carrier name
## 1 9E Endeavor Air Inc.
## 2 AA American Airlines Inc.
## 3 AS Alaska Airlines Inc.
## 4 B6 JetBlue Airways
## 5 DL Delta Air Lines Inc.
## 6 EV ExpressJet Airlines Inc.
head(airports)
## faa name lat lon alt tz dst
## 1 04G Lansdowne Airport 41.13047 -80.61958 1044 -5 A
## 2 06A Moton Field Municipal Airport 32.46057 -85.68003 264 -5 A
## 3 06C Schaumburg Regional 41.98934 -88.10124 801 -6 A
## 4 06N Randall Airport 41.43191 -74.39156 523 -5 A
## 5 09J Jekyll Island Airport 31.07447 -81.42778 11 -4 A
## 6 0A9 Elizabethton Municipal Airport 36.37122 -82.17342 1593 -4 A
head(flights)
## year month day dep_time dep_delay arr_time arr_delay carrier tailnum
## 1 2013 1 1 517 2 830 11 UA N14228
## 2 2013 1 1 533 4 850 20 UA N24211
## 3 2013 1 1 542 2 923 33 AA N619AA
## 4 2013 1 1 544 -1 1004 -18 B6 N804JB
## 5 2013 1 1 554 -6 812 -25 DL N668DN
## 6 2013 1 1 554 -4 740 12 UA N39463
## flight origin dest air_time distance hour minute
## 1 1545 EWR IAH 227 1400 5 17
## 2 1714 LGA IAH 227 1416 5 33
## 3 1141 JFK MIA 160 1089 5 42
## 4 725 JFK BQN 183 1576 5 44
## 5 461 LGA ATL 116 762 6 54
## 6 1696 EWR ORD 150 719 6 54
head(planes)
## tailnum year type manufacturer model engines
## 1 N10156 2004 Fixed wing multi engine EMBRAER EMB-145XR 2
## 2 N102UW 1998 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 2
## 3 N103US 1999 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 2
## 4 N104UW 1999 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 2
## 5 N10575 2002 Fixed wing multi engine EMBRAER EMB-145LR 2
## 6 N105UW 1999 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 2
## seats speed engine
## 1 55 NA Turbo-fan
## 2 182 NA Turbo-fan
## 3 182 NA Turbo-fan
## 4 182 NA Turbo-fan
## 5 55 NA Turbo-fan
## 6 182 NA Turbo-fan
head(weather)
## origin year month day hour temp dewp humid wind_dir wind_speed
## 1 EWR 2013 1 1 0 37.04 21.92 53.97 230 10.35702
## 2 EWR 2013 1 1 1 37.04 21.92 53.97 230 13.80936
## 3 EWR 2013 1 1 2 37.94 21.92 52.09 230 12.65858
## 4 EWR 2013 1 1 3 37.94 23.00 54.51 230 13.80936
## 5 EWR 2013 1 1 4 37.94 24.08 57.04 240 14.96014
## 6 EWR 2013 1 1 6 39.02 26.06 59.37 270 10.35702
## wind_gust precip pressure visib
## 1 11.91865 0 1013.9 10
## 2 15.89154 0 1013.0 10
## 3 14.56724 0 1012.6 10
## 4 15.89154 0 1012.7 10
## 5 17.21583 0 1012.8 10
## 6 11.91865 0 1012.0 10
Now let’s load required package and export the data to Neo4j database.
# using base code from the following package help page https://nicolewhite.github.io/2014/05/30/demo-of-rneo4j-part1.html"
#install.packages("devtools")
#devtools::install_github("nicolewhite/RNeo4j")
library(RNeo4j)
graph = startGraph("http://localhost:7474/db/data/", username = "neo4j", password = "graph")
clear(graph, input = FALSE)
Let’s load the tables - I am omititng some of the data just to keep it simple. Same code can be expanded to import all the columns and all tables.
for ( i in 1: nrow(airports) ) {
createNode (graph, "airports",
faa = airports[[i,"faa" ]],
airportname = airports[[i,"name"]],
dst = airports[[i,"dst"]]
)
}
for ( j in 1: nrow(airlines) ) {
createNode (graph, "airlines",
carrier = airlines[[j,"carrier" ]],
airlinename = airlines[[j,"name"]]
)
}
I am going to limit the number of rows I import from flights since my computer can’t handle so much data and freezes when I try to import all rows.
#replacing for ( k in 1: nrow(flights) ) { WITH
for ( k in 1: 2000 ) {
createNode (graph, "flights",
carrier = flights[[k,"carrier" ]],
tailnum = flights[[k,"tailnum"]],
flight = flights[[k,"flight" ]],
origin = flights[[k,"origin" ]],
dest = flights[[k,"dest" ]]
)
}
To make sure that it worked - let’s create a couple of queries to illustrate how we are using cypher on Neo4j data:
queryresult <- cypher(graph, "MATCH (a:flights), (b:airports) where a.dest = b.faa RETURN a.dest, a.carrier, a.tailnum, a.flight, a.origin, b.airportname, b. dst ")
head(queryresult)
## a.dest a.carrier a.tailnum a.flight a.origin
## 1 ATL DL N668DN 461 LGA
## 2 ATL DL N3739P 1743 JFK
## 3 ATL MQ N542MQ 4650 LGA
## 4 ATL DL N6703D 1547 LGA
## 5 ATL DL N326NB 575 EWR
## 6 ATL DL N935DL 2047 LGA
## b.airportname b. dst
## 1 Hartsfield Jackson Atlanta Intl A
## 2 Hartsfield Jackson Atlanta Intl A
## 3 Hartsfield Jackson Atlanta Intl A
## 4 Hartsfield Jackson Atlanta Intl A
## 5 Hartsfield Jackson Atlanta Intl A
## 6 Hartsfield Jackson Atlanta Intl A
queryresult2 <- cypher(graph, "MATCH (a:flights), (b:airlines) where a.carrier = b.carrier RETURN a.dest, a.carrier, a.tailnum, a.flight, a.origin, b.airlinename")
head(queryresult2)
## a.dest a.carrier a.tailnum a.flight a.origin b.airlinename
## 1 MIA AA N619AA 1141 JFK American Airlines Inc.
## 2 FLL B6 N516JB 507 EWR JetBlue Airways
## 3 PBI B6 N793JB 49 JFK JetBlue Airways
## 4 IAH UA N14228 1545 EWR United Air Lines Inc.
## 5 ATL DL N668DN 461 LGA Delta Air Lines Inc.
## 6 MCO B6 N593JB 79 JFK JetBlue Airways
Each type of database has it’s own advantages so it will depend on the type of data you are working with to determine what’s a better option for your needs.
There are several advantages of storing the data in a relational database vs. your NoSQL database:
Relational Databases like SQL are compatible with all major platforms
Easy to replicate
Works better for highly structured type of data
Here are the disadvantages of realtional databases:
Dynamic schema, meaning that you have the flexibility to change your data schema without modifying any of your existing data.
Scalability, graph databases are horizontally scalable
Manageability, noSQL databases are easier to manage, doesn’t require and administrator
Speed, you can get lightining speed results for simple queries.
Flexibility, you can add new columns or fields on graph DB without affecting existing rows
Source: https://medium.com/xplenty-blog/the-sql-vs-nosql-difference-mysql-vs-mongodb-32c9980e67b2