The objective of this project is simple: take information from a relational database and migrate it to a NoSQL database of your own choosing. I’ve chosen the flights database from R and Neo4j as our NoSQL database. The decision to create an RMarkdown file even though it is not necessary is to make the process reproducible and transparent.
Since the flight database is found via the nycflights13 package, it should first be loaded into R along with the other packages we’ll use in this project:
#install.packages('nycflights13')
pacman::p_load(knitr)
#this package could be loaded with pacman but I seperated it out for simplicity:
library(nycflights13)
The package comes with a couple of datasets, of which we’ll use: flights, weather, airlines, and airports. The only dataset we’re leaving out is the planes dataset.
kable(head(flights))
year | month | day | dep_time | dep_delay | arr_time | arr_delay | carrier | tailnum | flight | origin | dest | air_time | distance | hour | minute |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2013 | 1 | 1 | 517 | 2 | 830 | 11 | UA | N14228 | 1545 | EWR | IAH | 227 | 1400 | 5 | 17 |
2013 | 1 | 1 | 533 | 4 | 850 | 20 | UA | N24211 | 1714 | LGA | IAH | 227 | 1416 | 5 | 33 |
2013 | 1 | 1 | 542 | 2 | 923 | 33 | AA | N619AA | 1141 | JFK | MIA | 160 | 1089 | 5 | 42 |
2013 | 1 | 1 | 544 | -1 | 1004 | -18 | B6 | N804JB | 725 | JFK | BQN | 183 | 1576 | 5 | 44 |
2013 | 1 | 1 | 554 | -6 | 812 | -25 | DL | N668DN | 461 | LGA | ATL | 116 | 762 | 5 | 54 |
2013 | 1 | 1 | 554 | -4 | 740 | 12 | UA | N39463 | 1696 | EWR | ORD | 150 | 719 | 5 | 54 |
kable(head(weather))
origin | year | month | day | hour | temp | dewp | humid | wind_dir | wind_speed | wind_gust | precip | pressure | visib |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
EWR | 2013 | 1 | 1 | 0 | 37.04 | 21.92 | 53.97 | 230 | 10.35702 | 11.91865 | 0 | 1013.9 | 10 |
EWR | 2013 | 1 | 1 | 1 | 37.04 | 21.92 | 53.97 | 230 | 13.80936 | 15.89154 | 0 | 1013.0 | 10 |
EWR | 2013 | 1 | 1 | 2 | 37.94 | 21.92 | 52.09 | 230 | 12.65858 | 14.56724 | 0 | 1012.6 | 10 |
EWR | 2013 | 1 | 1 | 3 | 37.94 | 23.00 | 54.51 | 230 | 13.80936 | 15.89154 | 0 | 1012.7 | 10 |
EWR | 2013 | 1 | 1 | 4 | 37.94 | 24.08 | 57.04 | 240 | 14.96014 | 17.21583 | 0 | 1012.8 | 10 |
EWR | 2013 | 1 | 1 | 6 | 39.02 | 26.06 | 59.37 | 270 | 10.35702 | 11.91865 | 0 | 1012.0 | 10 |
kable(head(airlines))
carrier | name |
---|---|
9E | Endeavor Air Inc. |
AA | American Airlines Inc. |
AS | Alaska Airlines Inc. |
B6 | JetBlue Airways |
DL | Delta Air Lines Inc. |
EV | ExpressJet Airlines Inc. |
kable(head(airports))
faa | name | lat | lon | alt | tz | dst |
---|---|---|---|---|---|---|
04G | Lansdowne Airport | 41.13047 | -80.61958 | 1044 | -5 | A |
06A | Moton Field Municipal Airport | 32.46057 | -85.68003 | 264 | -5 | A |
06C | Schaumburg Regional | 41.98934 | -88.10124 | 801 | -6 | A |
06N | Randall Airport | 41.43191 | -74.39156 | 523 | -5 | A |
09J | Jekyll Island Airport | 31.07447 | -81.42778 | 11 | -4 | A |
0A9 | Elizabethton Municipal Airport | 36.37122 | -82.17342 | 1593 | -4 | A |
Since we want to load the datasets into a relational database first, we’ll export them all in R to a new folder in the C:/ called data.
write.csv(flights, "c:/data/flights.csv", row.names=FALSE)
write.csv(weather, "c:/data/weather.csv", row.names=FALSE)
write.csv(airlines, "c:/data/airlines.csv", row.names=FALSE)
write.csv(airports, "c:/data/airports.csv", row.names=FALSE)
Now that we’ve exported the data into a local file, we’ll switch to MySQL to load the datasets into a regional database.
Simply enough, the .CSV files were imported using the MySQL Import CSV wizard. I first created a new scheme (titled project4), right clicked the scheme tables, and used the table data export line. Instructions can be found on the MySQL website.
Note: This process took an extremely long time (one whole day: 24 hours). It could be due to the size of the tables, the slow processor speed and slow HDD on my computer, the speed of MySQL’s importing wizard, or any other reason, but going forward this will need to be taken into consideration and alternatives should be evaluated.
At this stage, it is possible (rather simple too) to manipulate the data as you’d see fit
Very simply, the cSV files are then exported back into a CSV file.
After slicing and dicing the databases in MySQL and exporting it, we now move to importing the CSV files into
Using the code below, the CSV files were exported from MySQL and imported into Neo4j. This processes was extremely quick in comparison to the MySQL import.
#airports data
LOAD CSV WITH HEADERS FROM "file:///data//airports.csv" AS row
CREATE (n:airports)
SET n = row,
n.faa = toFloat(row.faa),
n.name = toFloat(row.name), n.lat = toInt(row.lat),
n.lon = toInt(row.lon),
n.alt = toInt(row.alt),
n.tz = toInt(row.tz),
n.dst = toFloat(row.dst)
#airlines data
LOAD CSV WITH HEADERS FROM "file:///data//airlines.csv" AS row
CREATE (n:airlines)
SET n = row,
n.carrier = toFloat(row.carrier),
n.name = toFloat(row.name)
#weather data
LOAD CSV WITH HEADERS FROM "file:///data//weather.csv" AS row
CREATE (n:weather)
SET n = row,
n.origin = toFloat(row.origin),
n.year = toInt(row.year),
n.month = toInt(row.month),
n.day = toInt(row.day),
n.hour = toInt(row.hour),
n.temp = toInt(row.temp),
n.dewp = toInt(row.dewp),
n.humid = toInt(row.humid),
n.wind_speed = toInt(row.wind_speed),
n.wind_gust = toInt(row.wind_gust),
n.precip = toInt(row.precip),
n.pressure = toInt(row.pressure),
n.visib = toInt(row.visib)
#flights data
#since there is a lot of data, periodic commit must be used
USING PERIODIC COMMIT 500
LOAD CSV WITH HEADERS FROM "file:///data//flights.csv" AS row
CREATE (n:flights)
SET n = row,
n.year = toInt(row.year),
n.month = toInt(row.month),
n.day = toInt(row.day),
n.dep_time = toInt(row.dep_time),
n.dep_delay = toInt(row.dep_delay),
n.arr_time = toInt(row.arr_time),
n.arr_delay = toInt(row.arr_delay),
n.carrier = toFloat(row.carrier),
n.tailnum = toFloat(row.tailnum),
n.flight = toInt(row.flight),
n.origin = toFloat(row.origin),
n.dest = toFloat(row.dest),
n.air_time = toInt(row.air_time),
n.distance = toInt(row.distance),
n.hour = toInt(row.hour),
n.minute = toInt(row.minute)
While this was a slow way of migrating data from MySQL to Neo4j. There is certainly a quicker way to do this directly from R or even connecting Neo4j to MySQL, but the long way will suffice for now.
The obvious pro of using a NoSQL Graph Database is the ability to connect disperate relationships. While SQL can join tables and make simple connections between them, finding a connection between tables far and away is much more difficult and requires a lot of syntax. A graph database makes this process incredibly simple and streamlined.
One of the advantages of SQL over NoSQL is that SQL handles incredibly large amounts of data very well. This is extremely valuable with “stagnant” data, data that does not change often and does not require real-time analysis. In business context, it stores a vast amount of history perfect for finding trends and doing deep analysis on historical data. Dark data is also incredibly valuable information that can be extracted from SQL databases, since the information is more or less hidden.