1. For this assignment, you should take information from a relational database and migrate it to a NoSQL database of your own choosing. For the relational database, you might use the flights database, the tb database, the “data skills” database your team created for Project 3, or another database of your own choosing or creation. For the NoSQL database, you may use MongoDB (which we introduced in week 7), Neo4j, or another NoSQL database of your choosing. Your migration process needs to be reproducible. R code is encouraged, but not required. You should also briefly describe the advantages and disadvantages of storing the data in a relational database vs. your NoSQL database.

Envirnment Set up

library(DBI)
library(RMySQL)
library(tidyverse)
library(knitr)

Data Acquisition (flight database:nycflights13)

library(nycflights13)
dim(nycflights13::airlines)
## [1] 16  2
dim(nycflights13::airports)
## [1] 1458    8
head(nycflights13::airports)
## # A tibble: 6 x 8
##   faa   name                             lat   lon   alt    tz dst   tzone
##   <chr> <chr>                          <dbl> <dbl> <int> <dbl> <chr> <chr>
## 1 04G   Lansdowne Airport               41.1 -80.6  1044   -5. A     Amer~
## 2 06A   Moton Field Municipal Airport   32.5 -85.7   264   -6. A     Amer~
## 3 06C   Schaumburg Regional             42.0 -88.1   801   -6. A     Amer~
## 4 06N   Randall Airport                 41.4 -74.4   523   -5. A     Amer~
## 5 09J   Jekyll Island Airport           31.1 -81.4    11   -5. A     Amer~
## 6 0A9   Elizabethton Municipal Airport  36.4 -82.2  1593   -5. A     Amer~

Database Interaction (RDMBS: MySQL)

Once the connection is established, view the tables.

mySQLdb =dbConnect(MySQL(),user='DATA607User',password=mypassword,dbname='CUNY_DATA607',host='localhost')
dbListTables(mySQLdb)
## [1] "airlines" "airports" "movie"    "rating"   "reviewer"

Insert (Drop if exist) tables from MYSQL database for airliens and airports

dbRemoveTable(mySQLdb, name='airlines')
## [1] TRUE
dbWriteTable(mySQLdb, name='airlines', value=nycflights13::airlines)
## [1] TRUE
dbRemoveTable(mySQLdb, name='airports')
## [1] TRUE
dbWriteTable(mySQLdb, name='airports', value=nycflights13::airports)
## [1] TRUE
dbListTables(mySQLdb)
## [1] "airlines" "airports" "movie"    "rating"   "reviewer"

Fetching the records frm airlines table

airlines_rs = dbSendQuery(mySQLdb, 'select * from airlines')
airlines.df <- fetch(airlines_rs)
kable(head(airlines.df))
row_names carrier name
1 9E Endeavor Air Inc.
2 AA American Airlines Inc.
3 AS Alaska Airlines Inc.
4 B6 JetBlue Airways
5 DL Delta Air Lines Inc.
6 EV ExpressJet Airlines Inc.
airports_rs = dbSendQuery(mySQLdb, 'select * from airports')
airports.df <- fetch(airports_rs,n = -1)
kable(head(airports.df))
row_names faa name lat lon alt tz dst tzone
1 04G Lansdowne Airport 41.13047 -80.61958 1044 -5 A America/New_York
2 06A Moton Field Municipal Airport 32.46057 -85.68003 264 -6 A America/Chicago
3 06C Schaumburg Regional 41.98934 -88.10124 801 -6 A America/Chicago
4 06N Randall Airport 41.43191 -74.39156 523 -5 A America/New_York
5 09J Jekyll Island Airport 31.07447 -81.42778 11 -5 A America/New_York
6 0A9 Elizabethton Municipal Airport 36.37122 -82.17342 1593 -5 A America/New_York
nrow(airports.df)
## [1] 1458

Database Migration: Mongodb NoSQL database set up and data migration from MySQL to Mongodb

Created a developer account on https://mlab.com and also created a database mdbdata607

library(mongolite)
## Warning: package 'mongolite' was built under R version 3.4.4

Mongodb is a document oriented NoSQL database which stores values in key-value pair (json objects) and stored in their respective collection (collection is similar to Table of RDBMS)

#mongo db user name and password hidden in RMarkdown
#mdbUrl <- 'mongodb://<username>:<password>@ds041603.mlab.com:41603/mdbdata607'
airlineCollection <- mongo(collection = 'airlines',url = mdbUrl )
airlineCollection$drop()
airlineCollection$insert(airlines.df)
## List of 5
##  $ nInserted  : num 16
##  $ nMatched   : num 0
##  $ nRemoved   : num 0
##  $ nUpserted  : num 0
##  $ writeErrors: list()
airportCollection <- mongo(collection = 'airports',url = mdbUrl )
airportCollection$drop()
airportCollection$insert(airports.df)
## List of 5
##  $ nInserted  : num 1458
##  $ nMatched   : num 0
##  $ nRemoved   : num 0
##  $ nUpserted  : num 0
##  $ writeErrors: list()


kable(head(airportCollection$find()))
row_names faa name lat lon alt tz dst tzone
1 04G Lansdowne Airport 41.13047 -80.61958 1044 -5 A America/New_York
2 06A Moton Field Municipal Airport 32.46057 -85.68003 264 -6 A America/Chicago
3 06C Schaumburg Regional 41.98934 -88.10124 801 -6 A America/Chicago
4 06N Randall Airport 41.43191 -74.39156 523 -5 A America/New_York
5 09J Jekyll Island Airport 31.07447 -81.42778 11 -5 A America/New_York
6 0A9 Elizabethton Municipal Airport 36.37122 -82.17342 1593 -5 A America/New_York

Comparing MySQL and Mongo DB dataframes (TRUE indicates successful Migration)

mdb.airlines.df <- airlineCollection$find()

mdb.airlines.df== airlines.df
##    row_names carrier name
## 1       TRUE    TRUE TRUE
## 2       TRUE    TRUE TRUE
## 3       TRUE    TRUE TRUE
## 4       TRUE    TRUE TRUE
## 5       TRUE    TRUE TRUE
## 6       TRUE    TRUE TRUE
## 7       TRUE    TRUE TRUE
## 8       TRUE    TRUE TRUE
## 9       TRUE    TRUE TRUE
## 10      TRUE    TRUE TRUE
## 11      TRUE    TRUE TRUE
## 12      TRUE    TRUE TRUE
## 13      TRUE    TRUE TRUE
## 14      TRUE    TRUE TRUE
## 15      TRUE    TRUE TRUE
## 16      TRUE    TRUE TRUE

RDBMS Vs NoSQL

MongoDb is a Document database which stores information in pair (key-value). Each key with a complex data structure known as a document. Documents can contain many different key-value pairs, or key-array pairs, or even nested documents. It is very useful if the use case document centric.

SQL databases on the other hand are most commonly used. It is very efficient for storing structured data and where organizing data in realtional manner is recommended.

SQL databases are by design ACID (Atomicity Consistency Isolation Durability) compliant whereas NoSQL databases are not. So, selection of database is really driven by the type of use case and whether the system needs ACID capability or not.