NoSQL Migration

Elina Azrilyan

20 November 2018

For this assignment, you should take information from a relational database and migrate it to a NoSQL database of your own choosing.

For the relational database, you might use the flights database, the tb database, the “data skills” database your team created for Project 3, or another database of your own choosing or creation.

For the NoSQL database, you may use MongoDB (which we introduced in week 7), Neo4j, or another NoSQL database of your choosing.

Your migration process needs to be reproducible. R code is encouraged, but not required. You should also briefly describe the advantages and disadvantages of storing the data in a relational database vs. your NoSQL database.

Step 1. Pulling data from SQL database into R.

Loading Packages.

First we load RMySQL and strings packages using the following code.

# loading package
# install.packages("RMySQL")
# install.packages("stringr")
library(RMySQL)
## Loading required package: DBI
library(stringr)
Establish connection.

Then we connect to the database from R.

con <- dbConnect(MySQL(), user='root', password='root', dbname='Flights', host='localhost')
Import Data.

We run a query to create the dataframes which now contain flights database tables.

airlines <- dbGetQuery(con, "SELECT * FROM airlines;")
airports <- dbGetQuery(con, "SELECT * FROM airports;")
flights <- dbGetQuery(con, "SELECT * FROM flights;")
planes <- dbGetQuery(con, "SELECT * FROM planes;")
weather <- dbGetQuery(con, "SELECT * FROM weather;")

We can now disconnect the database and visually inspect our data.

# Disconecting from Database
dbDisconnect(con)
## [1] TRUE
head(airlines)
##   carrier                       name
## 1      9E        Endeavor Air Inc.\r
## 2      AA   American Airlines Inc.\r
## 3      AS     Alaska Airlines Inc.\r
## 4      B6          JetBlue Airways\r
## 5      DL     Delta Air Lines Inc.\r
## 6      EV ExpressJet Airlines Inc.\r
#Looks lie there are "/r" present in the airline names - let's get rid of those:
airlines$name<-str_replace(airlines$name,"\\r","")

head(airlines)
##   carrier                     name
## 1      9E        Endeavor Air Inc.
## 2      AA   American Airlines Inc.
## 3      AS     Alaska Airlines Inc.
## 4      B6          JetBlue Airways
## 5      DL     Delta Air Lines Inc.
## 6      EV ExpressJet Airlines Inc.
head(airports)
##   faa                           name      lat       lon  alt tz dst
## 1 04G              Lansdowne Airport 41.13047 -80.61958 1044 -5   A
## 2 06A  Moton Field Municipal Airport 32.46057 -85.68003  264 -5   A
## 3 06C            Schaumburg Regional 41.98934 -88.10124  801 -6   A
## 4 06N                Randall Airport 41.43191 -74.39156  523 -5   A
## 5 09J          Jekyll Island Airport 31.07447 -81.42778   11 -4   A
## 6 0A9 Elizabethton Municipal Airport 36.37122 -82.17342 1593 -4   A
head(flights)
##   year month day dep_time dep_delay arr_time arr_delay carrier tailnum
## 1 2013     1   1      517         2      830        11      UA  N14228
## 2 2013     1   1      533         4      850        20      UA  N24211
## 3 2013     1   1      542         2      923        33      AA  N619AA
## 4 2013     1   1      544        -1     1004       -18      B6  N804JB
## 5 2013     1   1      554        -6      812       -25      DL  N668DN
## 6 2013     1   1      554        -4      740        12      UA  N39463
##   flight origin dest air_time distance hour minute
## 1   1545    EWR  IAH      227     1400    5     17
## 2   1714    LGA  IAH      227     1416    5     33
## 3   1141    JFK  MIA      160     1089    5     42
## 4    725    JFK  BQN      183     1576    5     44
## 5    461    LGA  ATL      116      762    6     54
## 6   1696    EWR  ORD      150      719    6     54
head(planes)
##   tailnum year                    type     manufacturer     model engines
## 1  N10156 2004 Fixed wing multi engine          EMBRAER EMB-145XR       2
## 2  N102UW 1998 Fixed wing multi engine AIRBUS INDUSTRIE  A320-214       2
## 3  N103US 1999 Fixed wing multi engine AIRBUS INDUSTRIE  A320-214       2
## 4  N104UW 1999 Fixed wing multi engine AIRBUS INDUSTRIE  A320-214       2
## 5  N10575 2002 Fixed wing multi engine          EMBRAER EMB-145LR       2
## 6  N105UW 1999 Fixed wing multi engine AIRBUS INDUSTRIE  A320-214       2
##   seats speed    engine
## 1    55    NA Turbo-fan
## 2   182    NA Turbo-fan
## 3   182    NA Turbo-fan
## 4   182    NA Turbo-fan
## 5    55    NA Turbo-fan
## 6   182    NA Turbo-fan
head(weather)
##   origin year month day hour  temp  dewp humid wind_dir wind_speed
## 1    EWR 2013     1   1    0 37.04 21.92 53.97      230   10.35702
## 2    EWR 2013     1   1    1 37.04 21.92 53.97      230   13.80936
## 3    EWR 2013     1   1    2 37.94 21.92 52.09      230   12.65858
## 4    EWR 2013     1   1    3 37.94 23.00 54.51      230   13.80936
## 5    EWR 2013     1   1    4 37.94 24.08 57.04      240   14.96014
## 6    EWR 2013     1   1    6 39.02 26.06 59.37      270   10.35702
##   wind_gust precip pressure visib
## 1  11.91865      0   1013.9    10
## 2  15.89154      0   1013.0    10
## 3  14.56724      0   1012.6    10
## 4  15.89154      0   1012.7    10
## 5  17.21583      0   1012.8    10
## 6  11.91865      0   1012.0    10

Step 2. Exporting Data

Now let’s load required package and export the data to Neo4j database.

# using base code from the following package help page  https://nicolewhite.github.io/2014/05/30/demo-of-rneo4j-part1.html"

#install.packages("devtools")
#devtools::install_github("nicolewhite/RNeo4j")
library(RNeo4j)

graph = startGraph("http://localhost:7474/db/data/", username = "neo4j", password = "graph")
clear(graph, input = FALSE)

Let’s load the tables - I am omititng some of the data just to keep it simple. Same code can be expanded to import all the columns and all tables.

for ( i in 1: nrow(airports) ) {
createNode (graph, "airports", 
            faa = airports[[i,"faa" ]],
            airportname = airports[[i,"name"]], 
            dst = airports[[i,"dst"]]  
            )
}
for ( j in 1: nrow(airlines) ) {
createNode (graph, "airlines", 
            carrier = airlines[[j,"carrier" ]],
            airlinename = airlines[[j,"name"]]
            )
}

I am going to limit the number of rows I import from flights since my computer can’t handle so much data and freezes when I try to import all rows.

#replacing for ( k in 1: nrow(flights) ) {   WITH

for ( k in 1: 2000 ) {
createNode (graph, "flights", 
            carrier = flights[[k,"carrier" ]],
            tailnum = flights[[k,"tailnum"]],
            flight = flights[[k,"flight" ]],
            origin = flights[[k,"origin" ]],
            dest = flights[[k,"dest" ]]
            )
}

Step 3. Validation.

To make sure that it worked - let’s create a couple of queries to illustrate how we are using cypher on Neo4j data:

queryresult <- cypher(graph, "MATCH (a:flights), (b:airports) where a.dest = b.faa RETURN a.dest, a.carrier, a.tailnum, a.flight, a.origin, b.airportname, b. dst ")
head(queryresult)
##   a.dest a.carrier a.tailnum a.flight a.origin
## 1    ATL        DL    N668DN      461      LGA
## 2    ATL        DL    N3739P     1743      JFK
## 3    ATL        MQ    N542MQ     4650      LGA
## 4    ATL        DL    N6703D     1547      LGA
## 5    ATL        DL    N326NB      575      EWR
## 6    ATL        DL    N935DL     2047      LGA
##                     b.airportname b. dst
## 1 Hartsfield Jackson Atlanta Intl      A
## 2 Hartsfield Jackson Atlanta Intl      A
## 3 Hartsfield Jackson Atlanta Intl      A
## 4 Hartsfield Jackson Atlanta Intl      A
## 5 Hartsfield Jackson Atlanta Intl      A
## 6 Hartsfield Jackson Atlanta Intl      A
queryresult2 <- cypher(graph, "MATCH (a:flights), (b:airlines) where a.carrier = b.carrier RETURN a.dest, a.carrier, a.tailnum, a.flight, a.origin, b.airlinename")
head(queryresult2)
##   a.dest a.carrier a.tailnum a.flight a.origin          b.airlinename
## 1    MIA        AA    N619AA     1141      JFK American Airlines Inc.
## 2    FLL        B6    N516JB      507      EWR        JetBlue Airways
## 3    PBI        B6    N793JB       49      JFK        JetBlue Airways
## 4    IAH        UA    N14228     1545      EWR  United Air Lines Inc.
## 5    ATL        DL    N668DN      461      LGA   Delta Air Lines Inc.
## 6    MCO        B6    N593JB       79      JFK        JetBlue Airways

Step 4. Advantages and Disadvantages.

Each type of database has it’s own advantages so it will depend on the type of data you are working with to determine what’s a better option for your needs.

There are several advantages of storing the data in a relational database vs. your NoSQL database:

  • Relational Databases like SQL are compatible with all major platforms

  • Easy to replicate

  • Works better for highly structured type of data

Here are the disadvantages of realtional databases:

  • Dynamic schema, meaning that you have the flexibility to change your data schema without modifying any of your existing data.

  • Scalability, graph databases are horizontally scalable

  • Manageability, noSQL databases are easier to manage, doesn’t require and administrator

  • Speed, you can get lightining speed results for simple queries.

  • Flexibility, you can add new columns or fields on graph DB without affecting existing rows

Source: https://medium.com/xplenty-blog/the-sql-vs-nosql-difference-mysql-vs-mongodb-32c9980e67b2