Task

For Project 4, you should take information from a relational database and migrate it to a NoSQL database of your own choosing.

For the relational database, you might use the flights database, the tb database, the “data skills” database your team created for Project 3, or another database of your own choosing or creation.

For the NoSQL database, you may use MongoDB, Neo4j (which we introduce in Week 12), or another NoSQL database of your choosing.

Your migration process needs to be reproducible. R code is encouraged, but not required. You should also briefly describe the advantages and disadvantages of storing the data in a relational database vs. your NoSQL database.

You may work on a small team on this project. Due before our Meetup on Thursday November 17th.

Solution

Set up MongoDB

I chose to go with MongoDB. To install and set up mongoDB, I followed the instructions located here: https://docs.mongodb.com/manual/tutorial/install-mongodb-on-windows/

This was also a very useful website: http://www.tutorialspoint.com/mongodb/mongodb_quick_guide.htm

For understanding RMongo, I relied heavily on the documentation: https://cran.r-project.org/web/packages/RMongo/RMongo.pdf

Relational Database: MySQL Flights Data

As I already have the flights database stored in MySQL, I will use it. There are 5 tables in the flights database: airlines, airports, flights, planes, and weather. I will migrate each of these into MongoDB using R code.

First, load the RMySQL library.

library('RMySQL')

Then connect to the MySQL database (Note: password has been changed to “password”) and change to the “flights” database.

mydb <- dbConnect(MySQL(), user='root', password='password', host='localhost')
dbSendQuery(mydb, "USE flights")

Then pull in the flights tables dataframes.

airlines<-dbGetQuery(mydb,"SELECT * FROM airlines;")
airlines$name<-str_replace(airlines$name,"\\r","") #get rid of returns in data

airports<-dbGetQuery(mydb,"SELECT * FROM airports;")
flights<-dbGetQuery(mydb,"SELECT * FROM flights;")
planes<-dbGetQuery(mydb,"SELECT * FROM planes;")
weather<-dbGetQuery(mydb,"SELECT * FROM weather;")

As we no longer need to connect to MySQL, disconnect the connection and clean up.

dbDisconnect(mydb)
mydb<-NA

#detach RMySQL to prevent masking of functions below
detach("package:RMySQL", unload=TRUE)

MongoDB

Now that we have the data in R, we need to put it into MongoDB. First, we need to make sure the server is running. In the default installation, the application is kept in C:\Program Files\MongoDB\Server\3.2\bin. To start the server, we run the mongod application. Then we need to create a flights database, so we create that by starting mongo.exe and typing in use flights. Now we can begin inserting the flights tables as collections into the flights db in MongoDB.

library(RMongo)
library(jsonlite)
library(stringr)

#start server running before trying code
#connect to db
mongo <- mongoDbConnect("flights", "localhost", 27017)


#insert tables as collections
#insert each row from df as a mongoDB document

#create function for inputing data
insert_json<-function(connection,db,df){
  
  #convert df row to json mongodb document
  df_json<-toJSON(df)
  
  #clean up
  df_json<-sub("[","",df_json, fixed = TRUE)
  df_json<-sub("]","",df_json, fixed = TRUE)
  df_json<-as.character(df_json)
  
  #insert document
  dbInsertDocument(connection, db, df_json)
  
}

#call the function for each dataframe/collection to insert rows/documents
#airlines
for(i in 1:nrow(airlines)) {
  insert_json(mongo,"airlines",airlines[i,])
}

#airports
for(i in 1:nrow(airports)) {
  insert_json(mongo,"airports",airports[i,])
}

#flights
for(i in 1:nrow(flights)) {
  insert_json(mongo,"flights",flights[i,])
  print(i) #monitor progress. Took 20 minutes on my machine
}

#planes
for(i in 1:nrow(planes)) {
  insert_json(mongo,"planes",planes[i,])
}

#weather
for(i in 1:nrow(weather)) {
  insert_json(mongo,"weather",weather[i,])
}

#close the connection
dbDisconnect(mongo)

Exploratory

We should confirm that the data was correctly input and that we can query the collections (i.e., tables) in mongoDB. Here are some example queries:

library(RMongo)
library(knitr)

#connect to db
mongo <- mongoDbConnect("flights", "localhost", 27017)

#show results from query
kable(dbGetQuery(mongo, "airlines",'{"carrier": "9E"}')) #find carrier with code "9E"
carrier name X_id
9 Endeavor Air Inc. 582644a9a3a1d43cc2888245
kable(dbGetQuery(mongo, "airports",'{"lat": {$gt:70}}')) #find all airports with latitude above 70
dst tz faa name alt lon X_id lat
A -9 AIN Wainwright Airport 41 -159.9947 582645aaa3a1d43cc28882b8 70.6381
A -9 ATK Atqasuk Edward Burnell Sr Memorial Airport 96 -157.4360 582645aaa3a1d43cc28882e2 70.4673
A -9 BRW Wiley Post Will Rogers Mem 44 -156.7660 582645aaa3a1d43cc288832f 71.2854
A -9 BTI Barter Island Lrrs 2 -143.5819 582645aaa3a1d43cc2888331 70.1340
A -5 EEN Dillant Hopkins Airport 149 42.8983 582645aaa3a1d43cc28883df 72.2708
A -9 K03 Wainwright As 35 -159.8604 582645aba3a1d43cc28884f8 70.6134
A -9 NUI Nuiqsut Airport 38 -151.0056 582645aba3a1d43cc28885fe 70.2100
A -9 SCC Deadhorse 64 -148.4652 582645aca3a1d43cc28886ce 70.1947
A -9 UUK Ugnu-Kuparuk Airport 67 -149.5975 582645aca3a1d43cc2888770 70.3308
kable(dbGetQuery(mongo, "flights",'{"distance": {$lt:30}}')) #find all flights with distance less than 30 miles
flight carrier month distance year origin X_id dest day
1632 US 7 17 2013 EWR 58264d49a3a1d43cc28e58ff LGA 27

I first thought the above result from flights was a transformation issue, since many fields from the query were not being returned. Actually, many of those fields are NULL in the flights dataframe, so they were not put into MongoDB, and hence, not returned. MongoDB has no need to store NULL values. So the query does return correctly from MongoDB.

Here is another example from flights to show all of the fields. Two other queries from planes and weather are also shown.

kable(dbGetQuery(mongo, "flights",'{$and:[{"distance":80},{"dep_delay":{$lte:-5}}]}')) #find all 80 mile flights that departed 5 minutes early or earlier
flight arr_delay distance year tailnum dep_time origin dep_delay dest minute carrier month hour arr_time X_id air_time day
4193 -25 80 2013 N14902 1155 EWR -5 PHL 55 EV 1 12 1241 5826482fa3a1d43cc28a3253 29 5
4619 -12 80 2013 N33182 2124 EWR -5 PHL 24 EV 1 21 2212 58264838a3a1d43cc28a3ac9 25 7
4619 -18 80 2013 N16976 2122 EWR -8 PHL 22 EV 1 21 2207 58264854a3a1d43cc28a5622 26 15
4619 -8 80 2013 N19554 2123 EWR -6 PHL 23 EV 1 21 2216 58264869a3a1d43cc28a6a21 32 21
4616 -20 80 2013 N18120 1610 EWR -7 PHL 10 EV 2 16 1702 58264a15a3a1d43cc28bdb4c 33 2
4619 -14 80 2013 N12921 2123 EWR -7 PHL 23 EV 2 21 2211 58264a46a3a1d43cc28bfe25 21 12
4457 -18 80 2013 N12569 1942 EWR -8 PHL 42 EV 3 19 2026 58264b39a3a1d43cc28ca423 24 30
kable(dbGetQuery(mongo, "planes",'{$and:[{"year": {$lt:1970}},{"engines":4}]}')) #find all planes that are older than 1970 and have 4 engines
tailnum year engine engines model X_id type seats speed manufacturer
N381AA 1956 Reciprocating 4 DC-7BF 58264f32a3a1d43cc28f4aab Fixed wing multi engine 102 232 DOUGLAS
kable(dbGetQuery(mongo, "weather",'{$or:[{"wind_gusts": {$gt:1000}},{"wind_speed":{$gt:1000}}]}')) #find all flights in which wind gusts or wind speed were over 1000
temp year origin visib humid wind_dir pressure wind_gust precip month hour dewp wind_speed X_id day
39.02 2013 EWR 10 61.63 260 1008.3 1206.432 0 2 8 26.96 1048.361 58264f50a3a1d43cc28f578f 12
#close the connection
dbDisconnect(mongo)

The data was successfully moved and can be queried.

Advantages/Disadvantages of SQL vs. NOSQL

The main advantage of using NoSQL is that the schema doesn’t have to change and the data types do not have to be consistent. In so doing, it is easy to scale. Also, one can avoid large sparse tables since we are always using key-value pairs where the value is not NULL. It also appears that one can avoid the need for joins if properly set up, since information can be nested in the JSON style of documents. Finally, the data can be “sharded”, that is, split onto many machines to increase read and write speeds and to solve data storage limitations. Thus, it can serve as a “big data” option.

A disadvantage that I see is that it is harder to view the data in its entirety. We are so used to seeing data in a tabular form. We cannot directly see what the table looks like in its column names, typical values, etc. Also, traditional methods of analysis rely on a tabular structure for input, so the data would have to be transformed before use in these methods. Finally, there is a lot of upfront work to convert data from the traditional tabular format into the key-value pair format. For example, to truly move the data from the flights database, I would need to do a massive join on the data to relate it inside the mongoDB database. Or I’d have to do a lot of work to translate it into JSON nested formats before inserting it into mongoDB. I suppose I would need to do more research on how people do this in practice.

For me, this assignment was particular challenging because the data transformation from tabular data frames to JSON collections took quite a while to work bug free due to syntax and formating challenges. Furthermore, it took a while to figure out the mapping of relational database terms to mongoDB terms (e.g., tables -> collection, row -> document). It would be nice if similar concepts could use similar names across all data storage systems.

Overall, I think the challenges I faced are more a reflection on my unfamiliarty with this style of data storage. With time and practice, I am sure these NoSQL concepts and methods of data storage and retrieval would become as intuitive and easy as the traditional SQL/tabular methods I am used to.