For this assignment, you should take information from a relational database and migrate it to a NoSQL database of your own choosing.
For the relational database, you might use the flights database, the tb database, the “data skills” database your team created for Project 3, or another database of your own choosing or creation.
For the NoSQL database, you may use MongoDB (which we introduced in week 7), Neo4j, or another NoSQL database of your choosing.
Your migration process needs to be reproducible. R code is encouraged, but not required. You should also briefly describe the advantages and disadvantages of storing the data in a relational database vs. your NoSQL database.
library(nycflights13)
library(RNeo4j)
library(RMySQL)
## Loading required package: DBI
library(RSQLite)
##
## Attaching package: 'RSQLite'
## The following object is masked from 'package:RMySQL':
##
## isIdCurrent
library(tidyverse)
## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr
## Conflicts with tidy packages ----------------------------------------------
## filter(): dplyr, stats
## lag(): dplyr, stats
#Exploring the data
data(package="nycflights13")
#The package comes with 5 tables, namely airlines, airports, flights, planes and weather
write.csv(airlines, "C:/Users/Patrizia/Desktop/AmbraMSDA/W13/airlines.csv", row.names=F)
write.csv(airports, "C:/Users/Patrizia/Desktop/AmbraMSDA/W13/airports.csv", row.names=F)
write.csv(flights, "C:/Users/Patrizia/Desktop/AmbraMSDA/W13/flights.csv", row.names=F)
write.csv(planes, "C:/Users/Patrizia/Desktop/AmbraMSDA/W13/planes.csv", row.names=F)
write.csv(weather, "C:/Users/Patrizia/Desktop/AmbraMSDA/W13/weather.csv", row.names=F)
#Create an empty database
db <- dbConnect(SQLite(), dbname="airportsdb")
#Populate the db with tables
if(!dbExistsTable(db,"airlines")) {
dbWriteTable(conn=db, name="airlines", value="C:/Users/Patrizia/Desktop/AmbraMSDA/W13/airlines.csv", row.names=FALSE, header=TRUE)}
if(!dbExistsTable(db,"airports")) {
dbWriteTable(conn=db, name="airports", value="C:/Users/Patrizia/Desktop/AmbraMSDA/W13/airports.csv", row.names=FALSE, header=TRUE)}
if(!dbExistsTable(db,"flights")) {
dbWriteTable(conn=db, name="flights", value="C:/Users/Patrizia/Desktop/AmbraMSDA/W13/flights.csv", row.names=FALSE, header=TRUE)}
if(!dbExistsTable(db,"planes")) {
dbWriteTable(conn=db, name="planes", value="C:/Users/Patrizia/Desktop/AmbraMSDA/W13/planes.csv", row.names=FALSE, header=TRUE)}
if(!dbExistsTable(db,"weather")) {
dbWriteTable(conn=db, name="weather", value="C:/Users/Patrizia/Desktop/AmbraMSDA/W13/weather.csv", row.names=FALSE, header=TRUE)}
dbListTables(db)
## [1] "airlines" "airports" "flights" "planes" "weather"
# Test by fetching all query results from airlines into a data frame or reading the airports table:
airlinesdf<- dbGetQuery(db, "SELECT * FROM airlines")
dbReadTable(db,name="airports" ) %>% head()
faa | name | lat | lon | alt | tz | dst | tzone |
---|---|---|---|---|---|---|---|
“04G” | “Lansdowne Airport” | 41.13047 | -80.61958 | 1044 | -5 | “A” | “America/New_York” |
“06A” | “Moton Field Municipal Airport” | 32.46057 | -85.68003 | 264 | -6 | “A” | “America/Chicago” |
“06C” | “Schaumburg Regional” | 41.98934 | -88.10124 | 801 | -6 | “A” | “America/Chicago” |
“06N” | “Randall Airport” | 41.43191 | -74.39156 | 523 | -5 | “A” | “America/New_York” |
“09J” | “Jekyll Island Airport” | 31.07447 | -81.42778 | 11 | -5 | “A” | “America/New_York” |
“0A9” | “Elizabethton Municipal Airport” | 36.37122 | -82.17342 | 1593 | -5 | “A” | “America/New_York” |
dbDisconnect(db)
Importing flights and airports as nodes, arrivals and departures as relationships, after manually copying the files into the import folder of Neo4J
graph = startGraph("http://localhost:7474/db/data/", username="neo4j", password="password")
#Delete all existing nodes and edges
cypher(graph,"match (a) optional match (a)-[r]-() delete a,r")
I amended the configuration file to point to the db/data directory to load the csv file locally (see issue with security restriction explained at http://neo4j.com/docs/developer-manual/current/cypher/clauses/load-csv/ ) and it worked in Neo4J, but not in RNeo4J. The same is true for accessing the csv files that I previously uploaded to my W13 Github Repsitory- ex. https://raw.githubusercontent.com/ambra1982/W13/master/airports.csv
The following code worked like a charm in Neo4J browser: However, the same code returned an error in RNeo4J:
airportsf<- “https://raw.githubusercontent.com/ambra1982/W13/master/airports.csv” query= “load csv with headers from {filepath} as airports create (a1:Airport {label: airports.faa, name: airports.name, lat: airports.lat, lon:airports.lon, zone: airports.tzone})” cypher(graph, query, filepath=paste0(“‘“,airportsf,”’” ))
Neo4J kept crashing when trying to load all the flights table. Therefore I am going to subset the flights dataset and write the new csv file to the default directory. I am going to select all flights on May 1 (964 observations vs. original 336776). The relevant airports will amount to 84 (vs. 1458 in the original dataset).
str(flights)
## Classes 'tbl_df', 'tbl' and 'data.frame': 336776 obs. of 19 variables:
## $ year : int 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ...
## $ month : int 1 1 1 1 1 1 1 1 1 1 ...
## $ day : int 1 1 1 1 1 1 1 1 1 1 ...
## $ dep_time : int 517 533 542 544 554 554 555 557 557 558 ...
## $ sched_dep_time: int 515 529 540 545 600 558 600 600 600 600 ...
## $ dep_delay : num 2 4 2 -1 -6 -4 -5 -3 -3 -2 ...
## $ arr_time : int 830 850 923 1004 812 740 913 709 838 753 ...
## $ sched_arr_time: int 819 830 850 1022 837 728 854 723 846 745 ...
## $ arr_delay : num 11 20 33 -18 -25 12 19 -14 -8 8 ...
## $ carrier : chr "UA" "UA" "AA" "B6" ...
## $ flight : int 1545 1714 1141 725 461 1696 507 5708 79 301 ...
## $ tailnum : chr "N14228" "N24211" "N619AA" "N804JB" ...
## $ origin : chr "EWR" "LGA" "JFK" "JFK" ...
## $ dest : chr "IAH" "IAH" "MIA" "BQN" ...
## $ air_time : num 227 227 160 183 116 150 158 53 140 138 ...
## $ distance : num 1400 1416 1089 1576 762 ...
## $ hour : num 5 5 5 5 6 5 6 6 6 6 ...
## $ minute : num 15 29 40 45 0 58 0 0 0 0 ...
## $ time_hour : POSIXct, format: "2013-01-01 05:00:00" "2013-01-01 05:00:00" ...
flightsMay<- filter(flights, month == 5)
flightsMay1<- filter(flightsMay, day == 1)
# I will also subset the airport table to dest and origin found in the flights dataset as amended
airportsMay1 <- airports[airports$faa %in% flightsMay1$dest | airports$faa %in% flightsMay1$origin, ]
str(airportsMay1)
## Classes 'tbl_df', 'tbl' and 'data.frame': 84 obs. of 8 variables:
## $ faa : chr "ABQ" "ALB" "ATL" "AUS" ...
## $ name : chr "Albuquerque International Sunport" "Albany Intl" "Hartsfield Jackson Atlanta Intl" "Austin Bergstrom Intl" ...
## $ lat : num 35 42.7 33.6 30.2 35.4 ...
## $ lon : num -106.6 -73.8 -84.4 -97.7 -82.5 ...
## $ alt : int 5355 285 1026 542 2165 173 192 644 599 19 ...
## $ tz : num -7 -5 -5 -6 -5 -5 -5 -6 -6 -5 ...
## $ dst : chr "A" "A" "A" "A" ...
## $ tzone: chr "America/Denver" "America/New_York" "America/New_York" "America/Chicago" ...
write.csv(flightsMay1, "C:/Users/Patrizia/Documents/Neo4j/db4/data/flights.csv", row.names=F)
write.csv(airportsMay1, "C:/Users/Patrizia/Documents/Neo4j/db4/data/airports.csv", row.names=F)
Still using the browser, I loaded the flights and airports as amended as nodes and added number and carrier as properties:
Loading the flights csv file
Loading the airports csv file
Then I created the relationships in the browser, using load csv and attaching the labels of “Arrive” and “Depart” :
Creating the arrivals edges
Return all flights from JFK to LAX:
JFK to LAX flights
Neo4J boasts the clear advantage of making it easy to explore and visualize relationships in datasets. However, compared to MySQL, it took significantly longer to load files from local directories and it did not seem to be able to handle a 32MB file. Additionally, the browser does not seem to resize when returning hundreds of nodes. Finally, there are multiple R packages to work with MySQL databases, while I could only find RNeo4J for graphdb, and I suspect that its manual has not been updated to reflect changes introduced with Neo4J 3.1.
```