Week 13 assignment - NoSQL migration

For this assignment, you should take information from a relational database and migrate it to a NoSQL database of your own choosing.

For the relational database, you might use the flights database, the tb database, the “data skills” database your team created for Project 3, or another database of your own choosing or creation.

For the NoSQL database, you may use MongoDB (which we introduced in week 7), Neo4j, or another NoSQL database of your choosing.

Your migration process needs to be reproducible. R code is encouraged, but not required. You should also briefly describe the advantages and disadvantages of storing the data in a relational database vs. your NoSQL database.

Loading data from the nycflights13 package to a MySQL db

library(nycflights13)
library(RNeo4j)
library(RMySQL)

## Loading required package: DBI

library(RSQLite)

## 
## Attaching package: 'RSQLite'

## The following object is masked from 'package:RMySQL':
## 
##     isIdCurrent

library(tidyverse)

## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr

## Conflicts with tidy packages ----------------------------------------------

## filter(): dplyr, stats
## lag():    dplyr, stats

#Exploring the data

data(package="nycflights13")

#The package comes with 5 tables, namely airlines, airports, flights, planes and weather

write.csv(airlines, "C:/Users/Patrizia/Desktop/AmbraMSDA/W13/airlines.csv", row.names=F)
write.csv(airports, "C:/Users/Patrizia/Desktop/AmbraMSDA/W13/airports.csv", row.names=F)
write.csv(flights, "C:/Users/Patrizia/Desktop/AmbraMSDA/W13/flights.csv", row.names=F)
write.csv(planes, "C:/Users/Patrizia/Desktop/AmbraMSDA/W13/planes.csv", row.names=F)
write.csv(weather, "C:/Users/Patrizia/Desktop/AmbraMSDA/W13/weather.csv", row.names=F)

#Create an empty database
db <- dbConnect(SQLite(), dbname="airportsdb")

#Populate the db with tables
if(!dbExistsTable(db,"airlines")) {
dbWriteTable(conn=db, name="airlines", value="C:/Users/Patrizia/Desktop/AmbraMSDA/W13/airlines.csv", row.names=FALSE, header=TRUE)}

if(!dbExistsTable(db,"airports")) {
dbWriteTable(conn=db, name="airports", value="C:/Users/Patrizia/Desktop/AmbraMSDA/W13/airports.csv", row.names=FALSE, header=TRUE)}

if(!dbExistsTable(db,"flights")) {
dbWriteTable(conn=db, name="flights", value="C:/Users/Patrizia/Desktop/AmbraMSDA/W13/flights.csv", row.names=FALSE, header=TRUE)}

if(!dbExistsTable(db,"planes")) {
dbWriteTable(conn=db, name="planes", value="C:/Users/Patrizia/Desktop/AmbraMSDA/W13/planes.csv", row.names=FALSE, header=TRUE)}

if(!dbExistsTable(db,"weather")) {
dbWriteTable(conn=db, name="weather", value="C:/Users/Patrizia/Desktop/AmbraMSDA/W13/weather.csv", row.names=FALSE, header=TRUE)}


dbListTables(db)

## [1] "airlines" "airports" "flights"  "planes"   "weather"

# Test by fetching all query results from airlines into a data frame or reading the airports table:
airlinesdf<- dbGetQuery(db, "SELECT * FROM airlines")

dbReadTable(db,name="airports" ) %>% head()

faa	name	lat	lon	alt	tz	dst	tzone
“04G”	“Lansdowne Airport”	41.13047	-80.61958	1044	-5	“A”	“America/New_York”
“06A”	“Moton Field Municipal Airport”	32.46057	-85.68003	264	-6	“A”	“America/Chicago”
“06C”	“Schaumburg Regional”	41.98934	-88.10124	801	-6	“A”	“America/Chicago”
“06N”	“Randall Airport”	41.43191	-74.39156	523	-5	“A”	“America/New_York”
“09J”	“Jekyll Island Airport”	31.07447	-81.42778	11	-5	“A”	“America/New_York”
“0A9”	“Elizabethton Municipal Airport”	36.37122	-82.17342	1593	-5	“A”	“America/New_York”

dbDisconnect(db)

Migrating to a GraphDB (Neo4J)

Importing flights and airports as nodes, arrivals and departures as relationships, after manually copying the files into the import folder of Neo4J

graph = startGraph("http://localhost:7474/db/data/", username="neo4j", password="password")

#Delete all existing nodes and edges 

cypher(graph,"match (a) optional match (a)-[r]-() delete a,r")

Batch import of airports, flights and airlines csv files through the Neo4J browser

I amended the configuration file to point to the db/data directory to load the csv file locally (see issue with security restriction explained at http://neo4j.com/docs/developer-manual/current/cypher/clauses/load-csv/ ) and it worked in Neo4J, but not in RNeo4J. The same is true for accessing the csv files that I previously uploaded to my W13 Github Repsitory- ex. https://raw.githubusercontent.com/ambra1982/W13/master/airports.csv

The following code worked like a charm in Neo4J browser: Loading the airports csv file However, the same code returned an error in RNeo4J:

airportsf<- “https://raw.githubusercontent.com/ambra1982/W13/master/airports.csv” query= “load csv with headers from {filepath} as airports create (a1:Airport {label: airports.faa, name: airports.name, lat: airports.lat, lon:airports.lon, zone: airports.tzone})” cypher(graph, query, filepath=paste0(“‘“,airportsf,”’” ))

Neo4J kept crashing when trying to load all the flights table. Therefore I am going to subset the flights dataset and write the new csv file to the default directory. I am going to select all flights on May 1 (964 observations vs. original 336776). The relevant airports will amount to 84 (vs. 1458 in the original dataset).

str(flights)

## Classes 'tbl_df', 'tbl' and 'data.frame':    336776 obs. of  19 variables:
##  $ year          : int  2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ...
##  $ month         : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ day           : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ dep_time      : int  517 533 542 544 554 554 555 557 557 558 ...
##  $ sched_dep_time: int  515 529 540 545 600 558 600 600 600 600 ...
##  $ dep_delay     : num  2 4 2 -1 -6 -4 -5 -3 -3 -2 ...
##  $ arr_time      : int  830 850 923 1004 812 740 913 709 838 753 ...
##  $ sched_arr_time: int  819 830 850 1022 837 728 854 723 846 745 ...
##  $ arr_delay     : num  11 20 33 -18 -25 12 19 -14 -8 8 ...
##  $ carrier       : chr  "UA" "UA" "AA" "B6" ...
##  $ flight        : int  1545 1714 1141 725 461 1696 507 5708 79 301 ...
##  $ tailnum       : chr  "N14228" "N24211" "N619AA" "N804JB" ...
##  $ origin        : chr  "EWR" "LGA" "JFK" "JFK" ...
##  $ dest          : chr  "IAH" "IAH" "MIA" "BQN" ...
##  $ air_time      : num  227 227 160 183 116 150 158 53 140 138 ...
##  $ distance      : num  1400 1416 1089 1576 762 ...
##  $ hour          : num  5 5 5 5 6 5 6 6 6 6 ...
##  $ minute        : num  15 29 40 45 0 58 0 0 0 0 ...
##  $ time_hour     : POSIXct, format: "2013-01-01 05:00:00" "2013-01-01 05:00:00" ...

flightsMay<- filter(flights, month == 5)
flightsMay1<- filter(flightsMay, day == 1)

# I will also subset the airport table to dest and origin found in the flights dataset as amended

airportsMay1 <- airports[airports$faa %in% flightsMay1$dest | airports$faa %in% flightsMay1$origin, ]

str(airportsMay1)

## Classes 'tbl_df', 'tbl' and 'data.frame':    84 obs. of  8 variables:
##  $ faa  : chr  "ABQ" "ALB" "ATL" "AUS" ...
##  $ name : chr  "Albuquerque International Sunport" "Albany Intl" "Hartsfield Jackson Atlanta Intl" "Austin Bergstrom Intl" ...
##  $ lat  : num  35 42.7 33.6 30.2 35.4 ...
##  $ lon  : num  -106.6 -73.8 -84.4 -97.7 -82.5 ...
##  $ alt  : int  5355 285 1026 542 2165 173 192 644 599 19 ...
##  $ tz   : num  -7 -5 -5 -6 -5 -5 -5 -6 -6 -5 ...
##  $ dst  : chr  "A" "A" "A" "A" ...
##  $ tzone: chr  "America/Denver" "America/New_York" "America/New_York" "America/Chicago" ...

write.csv(flightsMay1, "C:/Users/Patrizia/Documents/Neo4j/db4/data/flights.csv", row.names=F)
write.csv(airportsMay1, "C:/Users/Patrizia/Documents/Neo4j/db4/data/airports.csv", row.names=F)

Still using the browser, I loaded the flights and airports as amended as nodes and added number and carrier as properties:

Loading the flights csv file

Loading the airports csv file

Then I created the relationships in the browser, using load csv and attaching the labels of “Arrive” and “Depart” :

Creating the arrivals edges

Data Visualization

Return all flights from JFK to LAX:

JFK to LAX flights

Conclusions

Neo4J boasts the clear advantage of making it easy to explore and visualize relationships in datasets. However, compared to MySQL, it took significantly longer to load files from local directories and it did not seem to be able to handle a 32MB file. Additionally, the browser does not seem to resize when returning hundreds of nodes. Finally, there are multiple R packages to work with MySQL databases, while I could only find RNeo4J for graphdb, and I suspect that its manual has not been updated to reflect changes introduced with Neo4J 3.1.

```

NOSQL DataBase

Ambra

25 aprile 2017