NoSQL Data Stores
Take information from a relational database and migrate it to a NoSQL database of your own choosing. For the relational database, you might use the flights database, the tb database, the “data skills” database your team created for Project 3, or another database of your own choosing or creation.
For the NoSQL database, you may use MongoDB (which we introduced in week 7), Neo4j, or another NoSQL database of your choosing.
Migration process needs to be reproducible. R code is encouraged, but not required. Briefly describe the advantages and disadvantages of storing the data in a relational database vs. your NoSQL database.
#install.packages("RMySQL")
#install.packages("rmongodb")
#install.packages("mongolite")
library(RMySQL)
## Loading required package: DBI
library(DBI)
library(mongolite)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(knitr)
Source data store: Flights database in MySQL RDBMS
Connect to MySQL, extract and load 3 tables into R data frames
flightsdb <- dbConnect(RMySQL::MySQL(), user='root', password='Pass@word1', dbname='flights', host='localhost')
dbListTables(flightsdb)
## [1] "airlines" "airports" "flights" "planes" "weather"
q_airl <- dbSendQuery(flightsdb, "select * from airlines")
airlines_df <- fetch(q_airl)
dbClearResult(dbListResults(flightsdb)[[1]])
## [1] TRUE
q_airp <- dbSendQuery(flightsdb, "select * from airports limit 500")
airports_df <- fetch(q_airp)
dbClearResult(dbListResults(flightsdb)[[1]])
## [1] TRUE
q_flight <- dbSendQuery(flightsdb, "select * from flights limit 500")
flights_df <- fetch(q_flight)
# Manage connections and resultsets
# dbListConnections( dbDriver( drv = "MySQL"))
# dbClearResult(dbListResults(flightsdb))
Insert data into MongoDB local instance. Create connection, database and collections
Airlines
mgo_airlines = mongo(collection = "airlines", db = "flights")
mgo_airlines$insert(airlines_df)
## List of 5
## $ nInserted : num 16
## $ nMatched : num 0
## $ nRemoved : num 0
## $ nUpserted : num 0
## $ writeErrors: list()
Airports
mgo_airports = mongo(collection = "airports", db = "flights")
mgo_airports$insert(airports_df)
## List of 5
## $ nInserted : num 500
## $ nMatched : num 0
## $ nRemoved : num 0
## $ nUpserted : num 0
## $ writeErrors: list()
Flights
mgo_flights = mongo(collection = "flights", db = "flights")
mgo_flights$insert(flights_df)
## List of 5
## $ nInserted : num 500
## $ nMatched : num 0
## $ nRemoved : num 0
## $ nUpserted : num 0
## $ writeErrors: list()
Querying data from the flights collection
mgo_flights$iterate()$one() # first record in the collection
## $year
## [1] 2013
##
## $month
## [1] 1
##
## $day
## [1] 1
##
## $dep_time
## [1] 517
##
## $dep_delay
## [1] 2
##
## $arr_time
## [1] 830
##
## $arr_delay
## [1] 11
##
## $carrier
## [1] "UA"
##
## $tailnum
## [1] "N14228"
##
## $flight
## [1] 1545
##
## $origin
## [1] "EWR"
##
## $dest
## [1] "IAH"
##
## $air_time
## [1] 227
##
## $distance
## [1] 1400
##
## $hour
## [1] 5
##
## $minute
## [1] 17
mgo_flights$find(paste('{"arr_delay" :{"$gt" : 80}}')) %>% kable() # flights with an arrival delay > 80mins
2013 |
1 |
1 |
811 |
101 |
1047 |
137 |
MQ |
N531MQ |
4576 |
LGA |
CLT |
118 |
544 |
8 |
11 |
2013 |
1 |
1 |
848 |
853 |
1001 |
851 |
MQ |
N942MQ |
3944 |
JFK |
BWI |
41 |
184 |
8 |
48 |
2013 |
1 |
1 |
957 |
144 |
1056 |
123 |
UA |
N534UA |
856 |
EWR |
BOS |
37 |
200 |
10 |
57 |
2013 |
1 |
1 |
1114 |
134 |
1447 |
145 |
UA |
N76502 |
1086 |
LGA |
IAH |
248 |
1416 |
11 |
14 |
2013 |
1 |
1 |
1255 |
55 |
1451 |
81 |
MQ |
N518MQ |
4601 |
LGA |
BNA |
139 |
764 |
13 |
55 |
2013 |
1 |
1 |
1301 |
71 |
1518 |
93 |
MQ |
N542MQ |
4646 |
LGA |
MSP |
170 |
1020 |
13 |
1 |
2013 |
1 |
1 |
1400 |
70 |
1645 |
103 |
EV |
N748EV |
4869 |
LGA |
MEM |
178 |
963 |
14 |
0 |
2013 |
1 |
1 |
1402 |
39 |
1650 |
84 |
EV |
N13123 |
4516 |
EWR |
MEM |
183 |
946 |
14 |
2 |
2013 |
1 |
1 |
1428 |
59 |
1803 |
83 |
B6 |
N635JB |
355 |
JFK |
BUR |
371 |
2465 |
14 |
28 |
2013 |
1 |
1 |
1505 |
115 |
1638 |
127 |
EV |
N17984 |
4497 |
EWR |
RIC |
63 |
277 |
15 |
5 |
2013 |
1 |
1 |
1525 |
105 |
1831 |
125 |
B6 |
N231JB |
525 |
EWR |
MCO |
152 |
937 |
15 |
25 |
2013 |
1 |
1 |
1540 |
122 |
2020 |
115 |
B6 |
N570JB |
705 |
JFK |
SJU |
193 |
1598 |
15 |
40 |
2013 |
1 |
1 |
1548 |
88 |
1751 |
91 |
MQ |
N6EAMQ |
4588 |
LGA |
MSP |
167 |
1020 |
15 |
48 |
2013 |
1 |
1 |
1549 |
64 |
1912 |
136 |
EV |
N21197 |
4181 |
EWR |
MCI |
234 |
1092 |
15 |
49 |
Join flights and airlines collections
mgo_flights$aggregate('[
{"$match" : {"arr_delay" :{"$gt" : 120}}},
{
"$project":
{
"_id" :0
}
},
{
"$lookup":
{
"from": "airlines",
"localField": "carrier",
"foreignField": "carrier",
"as": "airline_name"
}
},
{"$unwind" : "$airline_name"},
{
"$project":
{
"airline_name._id" :0, "airline_name.carrier":0
}
}
]') %>% kable()
2013 |
1 |
1 |
811 |
101 |
1047 |
137 |
MQ |
N531MQ |
4576 |
LGA |
CLT |
118 |
544 |
8 |
11 |
Envoy Air |
2013 |
1 |
1 |
848 |
853 |
1001 |
851 |
MQ |
N942MQ |
3944 |
JFK |
BWI |
41 |
184 |
8 |
48 |
Envoy Air |
2013 |
1 |
1 |
957 |
144 |
1056 |
123 |
UA |
N534UA |
856 |
EWR |
BOS |
37 |
200 |
10 |
57 |
United Air Lines Inc. |
2013 |
1 |
1 |
1114 |
134 |
1447 |
145 |
UA |
N76502 |
1086 |
LGA |
IAH |
248 |
1416 |
11 |
14 |
United Air Lines Inc. |
2013 |
1 |
1 |
1505 |
115 |
1638 |
127 |
EV |
N17984 |
4497 |
EWR |
RIC |
63 |
277 |
15 |
5 |
ExpressJet Airlines Inc. |
2013 |
1 |
1 |
1525 |
105 |
1831 |
125 |
B6 |
N231JB |
525 |
EWR |
MCO |
152 |
937 |
15 |
25 |
JetBlue Airways |
2013 |
1 |
1 |
1549 |
64 |
1912 |
136 |
EV |
N21197 |
4181 |
EWR |
MCI |
234 |
1092 |
15 |
49 |
ExpressJet Airlines Inc. |
Bar plot of number of fligts by carrier
mgo_flights$aggregate('[{"$group":{"_id":"$carrier", "Count": {"$sum":1}}}]') %>% arrange(desc(Count))%>%
ggplot(aes(x=reorder(`_id`,Count),y=Count))+geom_bar(stat="identity",color='skyblue')+coord_flip()+xlab("Carriers")+ theme_bw()

NoSQL DB vs Relational DB
NoSQL Data stores provide the following main advantages:
1) Data models - Flexible and multi-model types: Document, Key-value, Wide-columns and graph. No predefined schemas required which allow for large volumes of rapidly changing structured, semi-structured and unstructured data
2) Performance and Scaling - Distributed scale-out architecture with auto-sharding and natively replicated. Horizontally scaled by adding capacity over commodity servers (instead of expensive, monolithic architectures)
3) Development - Object-oriented programming model, quick schema iterations and frequest code pushes
NoSQL Data stores provide the following main disadvantages:
1) Data redundancy - Most of the NoSQL do not require or adopt data normalization
2) Data Consistency - Poor support for ACID transactions, most NoSQL DBs rely on principles of eventual consistenty (BASE)
3) Maturity & Security - Lack of advanced security and data manipulation features compared to relational data stores