NoSQL Data Stores

Take information from a relational database and migrate it to a NoSQL database of your own choosing. For the relational database, you might use the flights database, the tb database, the “data skills” database your team created for Project 3, or another database of your own choosing or creation.

For the NoSQL database, you may use MongoDB (which we introduced in week 7), Neo4j, or another NoSQL database of your choosing.

Migration process needs to be reproducible. R code is encouraged, but not required. Briefly describe the advantages and disadvantages of storing the data in a relational database vs. your NoSQL database.

#install.packages("RMySQL")
#install.packages("rmongodb")
#install.packages("mongolite")
library(RMySQL)
## Loading required package: DBI
library(DBI)
library(mongolite)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
library(knitr)

Source data store: Flights database in MySQL RDBMS

Connect to MySQL, extract and load 3 tables into R data frames

flightsdb <- dbConnect(RMySQL::MySQL(), user='root', password='Pass@word1', dbname='flights', host='localhost')
dbListTables(flightsdb)
## [1] "airlines" "airports" "flights"  "planes"   "weather"
q_airl <- dbSendQuery(flightsdb, "select * from airlines")
airlines_df <- fetch(q_airl)
dbClearResult(dbListResults(flightsdb)[[1]])
## [1] TRUE
q_airp <- dbSendQuery(flightsdb, "select * from airports limit 500")
airports_df <- fetch(q_airp)
dbClearResult(dbListResults(flightsdb)[[1]])
## [1] TRUE
q_flight <- dbSendQuery(flightsdb, "select * from flights limit 500")
flights_df <- fetch(q_flight)

# Manage connections and resultsets
# dbListConnections( dbDriver( drv = "MySQL"))
# dbClearResult(dbListResults(flightsdb))

Target data store : MongoDB Community Edition (Local instance installation)

Insert data into MongoDB local instance. Create connection, database and collections

Airlines

mgo_airlines = mongo(collection = "airlines", db = "flights")
mgo_airlines$insert(airlines_df)
## List of 5
##  $ nInserted  : num 16
##  $ nMatched   : num 0
##  $ nRemoved   : num 0
##  $ nUpserted  : num 0
##  $ writeErrors: list()

Airports

mgo_airports = mongo(collection = "airports", db = "flights")
mgo_airports$insert(airports_df)
## List of 5
##  $ nInserted  : num 500
##  $ nMatched   : num 0
##  $ nRemoved   : num 0
##  $ nUpserted  : num 0
##  $ writeErrors: list()

Flights

mgo_flights = mongo(collection = "flights", db = "flights")
mgo_flights$insert(flights_df)
## List of 5
##  $ nInserted  : num 500
##  $ nMatched   : num 0
##  $ nRemoved   : num 0
##  $ nUpserted  : num 0
##  $ writeErrors: list()

Querying data from the flights collection

mgo_flights$iterate()$one() # first record in the collection
## $year
## [1] 2013
## 
## $month
## [1] 1
## 
## $day
## [1] 1
## 
## $dep_time
## [1] 517
## 
## $dep_delay
## [1] 2
## 
## $arr_time
## [1] 830
## 
## $arr_delay
## [1] 11
## 
## $carrier
## [1] "UA"
## 
## $tailnum
## [1] "N14228"
## 
## $flight
## [1] 1545
## 
## $origin
## [1] "EWR"
## 
## $dest
## [1] "IAH"
## 
## $air_time
## [1] 227
## 
## $distance
## [1] 1400
## 
## $hour
## [1] 5
## 
## $minute
## [1] 17
mgo_flights$find(paste('{"arr_delay" :{"$gt" : 80}}'))  %>% kable() # flights with an arrival delay > 80mins
year month day dep_time dep_delay arr_time arr_delay carrier tailnum flight origin dest air_time distance hour minute
2013 1 1 811 101 1047 137 MQ N531MQ 4576 LGA CLT 118 544 8 11
2013 1 1 848 853 1001 851 MQ N942MQ 3944 JFK BWI 41 184 8 48
2013 1 1 957 144 1056 123 UA N534UA 856 EWR BOS 37 200 10 57
2013 1 1 1114 134 1447 145 UA N76502 1086 LGA IAH 248 1416 11 14
2013 1 1 1255 55 1451 81 MQ N518MQ 4601 LGA BNA 139 764 13 55
2013 1 1 1301 71 1518 93 MQ N542MQ 4646 LGA MSP 170 1020 13 1
2013 1 1 1400 70 1645 103 EV N748EV 4869 LGA MEM 178 963 14 0
2013 1 1 1402 39 1650 84 EV N13123 4516 EWR MEM 183 946 14 2
2013 1 1 1428 59 1803 83 B6 N635JB 355 JFK BUR 371 2465 14 28
2013 1 1 1505 115 1638 127 EV N17984 4497 EWR RIC 63 277 15 5
2013 1 1 1525 105 1831 125 B6 N231JB 525 EWR MCO 152 937 15 25
2013 1 1 1540 122 2020 115 B6 N570JB 705 JFK SJU 193 1598 15 40
2013 1 1 1548 88 1751 91 MQ N6EAMQ 4588 LGA MSP 167 1020 15 48
2013 1 1 1549 64 1912 136 EV N21197 4181 EWR MCI 234 1092 15 49

Join flights and airlines collections

mgo_flights$aggregate('[
                          {"$match" : {"arr_delay" :{"$gt" : 120}}},
                          {
                           "$project":
                           {
                            "_id" :0
                           }
                          },
                          {
                            "$lookup":
                            {
                            "from": "airlines",
                            "localField": "carrier",
                            "foreignField": "carrier",
                            "as": "airline_name"
                            }
                          },
                          {"$unwind" : "$airline_name"},
                          {
                           "$project":
                           {
                            "airline_name._id" :0, "airline_name.carrier":0
                           }
                          }
                      ]') %>% kable()
year month day dep_time dep_delay arr_time arr_delay carrier tailnum flight origin dest air_time distance hour minute airline_name
2013 1 1 811 101 1047 137 MQ N531MQ 4576 LGA CLT 118 544 8 11 Envoy Air
2013 1 1 848 853 1001 851 MQ N942MQ 3944 JFK BWI 41 184 8 48 Envoy Air
2013 1 1 957 144 1056 123 UA N534UA 856 EWR BOS 37 200 10 57 United Air Lines Inc.
2013 1 1 1114 134 1447 145 UA N76502 1086 LGA IAH 248 1416 11 14 United Air Lines Inc.
2013 1 1 1505 115 1638 127 EV N17984 4497 EWR RIC 63 277 15 5 ExpressJet Airlines Inc.
2013 1 1 1525 105 1831 125 B6 N231JB 525 EWR MCO 152 937 15 25 JetBlue Airways
2013 1 1 1549 64 1912 136 EV N21197 4181 EWR MCI 234 1092 15 49 ExpressJet Airlines Inc.

Bar plot of number of fligts by carrier

mgo_flights$aggregate('[{"$group":{"_id":"$carrier", "Count": {"$sum":1}}}]') %>% arrange(desc(Count))%>% 
  ggplot(aes(x=reorder(`_id`,Count),y=Count))+geom_bar(stat="identity",color='skyblue')+coord_flip()+xlab("Carriers")+ theme_bw()

NoSQL DB vs Relational DB

NoSQL Data stores provide the following main advantages:

1) Data models - Flexible and multi-model types: Document, Key-value, Wide-columns and graph. No predefined schemas required which allow for large volumes of rapidly changing structured, semi-structured and unstructured data

2) Performance and Scaling - Distributed scale-out architecture with auto-sharding and natively replicated. Horizontally scaled by adding capacity over commodity servers (instead of expensive, monolithic architectures)

3) Development - Object-oriented programming model, quick schema iterations and frequest code pushes

NoSQL Data stores provide the following main disadvantages:

1) Data redundancy - Most of the NoSQL do not require or adopt data normalization

2) Data Consistency - Poor support for ACID transactions, most NoSQL DBs rely on principles of eventual consistenty (BASE)

3) Maturity & Security - Lack of advanced security and data manipulation features compared to relational data stores