For this assignment, you should take information from a relational database and migrate it to a NoSQL database of your own choosing. For the relational database, you might use the flights database, the tb database, the “data skills” database your team created for Project 3, or another database of your own choosing or creation. For the NoSQL database, you may use MongoDB (which we introduced in week 7), Neo4j, or another NoSQL database of your choosing. Your migration process needs to be reproducible. R code is encouraged, but not required. You should also briefly describe the advantages and disadvantages of storing the data in a relational database vs. your NoSQL database.
For this assignment, I have decided to use the MySQL as my RDBMS, and MongoDB as my NoSQL database. Also, the flights database used earlier in Data 607 assignments has been used here.
The first step is to load the MySQL database with the Flights database. The sql script to load the Flights database is present here: https://raw.githubusercontent.com/deepakmongia/Fall2018/master/Data-607/flights.sql The above script needs to be run before the below.
Step-1 : Loading the required R packages:
library(mongolite)
library(RMySQL)
## Loading required package: DBI
Step-2 : Connecting the MySQL database:
mydb = dbConnect(MySQL(), user='root', password='Deepak1234#', dbname='flights', host='localhost')
summary(mydb)
## <MySQLConnection:0,0>
## User: root
## Host: localhost
## Dbname: flights
## Connection type: localhost via TCP/IP
##
## Results:
Now assuming that the Flights database has been loaded on to MySQL, we will use R to pull the data into R from MySQL.
Step-3 : Pulling data from MySQL
airlines.df <- dbGetQuery(mydb, "
select * from airlines")
head(airlines.df)
## carrier name
## 1 9E Endeavor Air Inc.\r
## 2 AA American Airlines Inc.\r
## 3 AS Alaska Airlines Inc.\r
## 4 B6 JetBlue Airways\r
## 5 DL Delta Air Lines Inc.\r
## 6 EV ExpressJet Airlines Inc.\r
airports.df <- dbGetQuery(mydb, "
select * from airports")
head(airports.df)
## faa name lat lon alt tz dst
## 1 04G Lansdowne Airport 41.13047 -80.61958 1044 -5 A
## 2 06A Moton Field Municipal Airport 32.46057 -85.68003 264 -5 A
## 3 06C Schaumburg Regional 41.98934 -88.10124 801 -6 A
## 4 06N Randall Airport 41.43191 -74.39156 523 -5 A
## 5 09J Jekyll Island Airport 31.07447 -81.42778 11 -4 A
## 6 0A9 Elizabethton Municipal Airport 36.37122 -82.17342 1593 -4 A
flights.df <- dbGetQuery(mydb, "
select * from flights")
head(flights.df)
## year month day dep_time dep_delay arr_time arr_delay carrier tailnum
## 1 2013 1 1 517 2 830 11 UA N14228
## 2 2013 1 1 533 4 850 20 UA N24211
## 3 2013 1 1 542 2 923 33 AA N619AA
## 4 2013 1 1 544 -1 1004 -18 B6 N804JB
## 5 2013 1 1 554 -6 812 -25 DL N668DN
## 6 2013 1 1 554 -4 740 12 UA N39463
## flight origin dest air_time distance hour minute
## 1 1545 EWR IAH 227 1400 5 17
## 2 1714 LGA IAH 227 1416 5 33
## 3 1141 JFK MIA 160 1089 5 42
## 4 725 JFK BQN 183 1576 5 44
## 5 461 LGA ATL 116 762 6 54
## 6 1696 EWR ORD 150 719 6 54
planes.df <- dbGetQuery(mydb, "
select * from planes")
head(planes.df)
## tailnum year type manufacturer model engines
## 1 N10156 2004 Fixed wing multi engine EMBRAER EMB-145XR 2
## 2 N102UW 1998 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 2
## 3 N103US 1999 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 2
## 4 N104UW 1999 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 2
## 5 N10575 2002 Fixed wing multi engine EMBRAER EMB-145LR 2
## 6 N105UW 1999 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 2
## seats speed engine
## 1 55 NA Turbo-fan
## 2 182 NA Turbo-fan
## 3 182 NA Turbo-fan
## 4 182 NA Turbo-fan
## 5 55 NA Turbo-fan
## 6 182 NA Turbo-fan
weather.df <- dbGetQuery(mydb, "
select * from weather")
head(weather.df)
## origin year month day hour temp dewp humid wind_dir wind_speed
## 1 EWR 2013 1 1 0 37.04 21.92 53.97 230 10.35702
## 2 EWR 2013 1 1 1 37.04 21.92 53.97 230 13.80936
## 3 EWR 2013 1 1 2 37.94 21.92 52.09 230 12.65858
## 4 EWR 2013 1 1 3 37.94 23.00 54.51 230 13.80936
## 5 EWR 2013 1 1 4 37.94 24.08 57.04 240 14.96014
## 6 EWR 2013 1 1 6 39.02 26.06 59.37 270 10.35702
## wind_gust precip pressure visib
## 1 11.91865 0 1013.9 10
## 2 15.89154 0 1013.0 10
## 3 14.56724 0 1012.6 10
## 4 15.89154 0 1012.7 10
## 5 17.21583 0 1012.8 10
## 6 11.91865 0 1012.0 10
Step-4 : Inserting in the same structure format into MongoDB:
mongo_airlines <- mongo(collection = "airlines", db = "flights")
mongo_airlines$insert(airlines.df)
## List of 5
## $ nInserted : num 16
## $ nMatched : num 0
## $ nRemoved : num 0
## $ nUpserted : num 0
## $ writeErrors: list()
mongo_airports <- mongo(collection = "airports", db = "flights")
mongo_airports$insert(airports.df)
## List of 5
## $ nInserted : num 1397
## $ nMatched : num 0
## $ nRemoved : num 0
## $ nUpserted : num 0
## $ writeErrors: list()
mongo_flights <- mongo(collection = "flights", db = "flights")
mongo_flights$insert(flights.df)
## List of 5
## $ nInserted : num 336776
## $ nMatched : num 0
## $ nRemoved : num 0
## $ nUpserted : num 0
## $ writeErrors: list()
mongo_planes <- mongo(collection = "planes", db = "flights")
mongo_planes$insert(planes.df)
## List of 5
## $ nInserted : num 3322
## $ nMatched : num 0
## $ nRemoved : num 0
## $ nUpserted : num 0
## $ writeErrors: list()
mongo_weather <- mongo(collection = "weather", db = "flights")
mongo_weather$insert(weather.df)
## List of 5
## $ nInserted : num 8719
## $ nMatched : num 0
## $ nRemoved : num 0
## $ nUpserted : num 0
## $ writeErrors: list()
Even though we inserted here in the exact same structure as the MySQL into MongoDB, but MongoDB or any other document database is more effective with saving documents in the form of embedded document structures like JSON.
Comparison of RDMBMS and NoSQL databases:
RDBMS:
Advantages:
Usage of uniform language across all the RDBMSes - SQL
ACID properties - Atomicity, Consistency, Isolation, Durability
Easy to use and maintain
Disadvantages:
Does not work really well and becomes difficult to manage as the DB size grows
Unable to handle unstructured data
NoSQL databases:
Advantages:
Can handle unstructured data very well
Easily scalable as it is schema free
The data is stored in same structure as it is presented
Disadvantages:
ACID properties are given away
data redundancy may happen