Project 4: NoSQL migration

Task

For Project 4, you should take information from a relational database and migrate it to a NoSQL database of your own choosing.

For the relational database, you might use the flights database, the tb database, the “data skills” database your team created for Project 3, or another database of your own choosing or creation.

For the NoSQL database, you may use MongoDB, Neo4j (which we introduce in Week 12), or another NoSQL database of your choosing.

Your migration process needs to be reproducible. R code is encouraged, but not required. You should also briefly describe the advantages and disadvantages of storing the data in a relational database vs. your NoSQL database.

You may work on a small team on this project. Due before our Meetup on Thursday November 17th.

Solution

Set up MongoDB

I chose to go with MongoDB. To install and set up mongoDB, I followed the instructions located here: https://docs.mongodb.com/manual/tutorial/install-mongodb-on-windows/

This was also a very useful website: http://www.tutorialspoint.com/mongodb/mongodb_quick_guide.htm

For understanding RMongo, I relied heavily on the documentation: https://cran.r-project.org/web/packages/RMongo/RMongo.pdf

Relational Database: MySQL Flights Data

As I already have the flights database stored in MySQL, I will use it. There are 5 tables in the flights database: airlines, airports, flights, planes, and weather. I will migrate each of these into MongoDB using R code.

First, load the RMySQL library.

library('RMySQL')

Then connect to the MySQL database (Note: password has been changed to “password”) and change to the “flights” database.

mydb <- dbConnect(MySQL(), user='root', password='password', host='localhost')
dbSendQuery(mydb, "USE flights")

Then pull in the flights tables dataframes.

airlines<-dbGetQuery(mydb,"SELECT * FROM airlines;")
airlines$name<-str_replace(airlines$name,"\\r","") #get rid of returns in data

airports<-dbGetQuery(mydb,"SELECT * FROM airports;")
flights<-dbGetQuery(mydb,"SELECT * FROM flights;")
planes<-dbGetQuery(mydb,"SELECT * FROM planes;")
weather<-dbGetQuery(mydb,"SELECT * FROM weather;")

As we no longer need to connect to MySQL, disconnect the connection and clean up.

dbDisconnect(mydb)
mydb<-NA

#detach RMySQL to prevent masking of functions below
detach("package:RMySQL", unload=TRUE)

MongoDB

Now that we have the data in R, we need to put it into MongoDB. First, we need to make sure the server is running. In the default installation, the application is kept in C:\Program Files\MongoDB\Server\3.2\bin. To start the server, we run the mongod application. Then we need to create a flights database, so we create that by starting mongo.exe and typing in use flights. Now we can begin inserting the flights tables as collections into the flights db in MongoDB.

library(RMongo)
library(jsonlite)
library(stringr)

#start server running before trying code
#connect to db
mongo <- mongoDbConnect("flights", "localhost", 27017)


#insert tables as collections
#insert each row from df as a mongoDB document

#create function for inputing data
insert_json<-function(connection,db,df){
  
  #convert df row to json mongodb document
  df_json<-toJSON(df)
  
  #clean up
  df_json<-sub("[","",df_json, fixed = TRUE)
  df_json<-sub("]","",df_json, fixed = TRUE)
  df_json<-as.character(df_json)
  
  #insert document
  dbInsertDocument(connection, db, df_json)
  
}

#call the function for each dataframe/collection to insert rows/documents
#airlines
for(i in 1:nrow(airlines)) {
  insert_json(mongo,"airlines",airlines[i,])
}

#airports
for(i in 1:nrow(airports)) {
  insert_json(mongo,"airports",airports[i,])
}

#flights
for(i in 1:nrow(flights)) {
  insert_json(mongo,"flights",flights[i,])
  print(i) #monitor progress. Took 20 minutes on my machine
}

#planes
for(i in 1:nrow(planes)) {
  insert_json(mongo,"planes",planes[i,])
}

#weather
for(i in 1:nrow(weather)) {
  insert_json(mongo,"weather",weather[i,])
}

#close the connection
dbDisconnect(mongo)

Exploratory

We should confirm that the data was correctly input and that we can query the collections (i.e., tables) in mongoDB. Here are some example queries:

library(RMongo)
library(knitr)

#connect to db
mongo <- mongoDbConnect("flights", "localhost", 27017)

#show results from query
kable(dbGetQuery(mongo, "airlines",'{"carrier": "9E"}')) #find carrier with code "9E"

carrier	name	X_id
9	Endeavor Air Inc.	582644a9a3a1d43cc2888245

kable(dbGetQuery(mongo, "airports",'{"lat": {$gt:70}}')) #find all airports with latitude above 70

dst	tz	faa	name	alt	lon	X_id	lat
A	-9	AIN	Wainwright Airport	41	-159.9947	582645aaa3a1d43cc28882b8	70.6381
A	-9	ATK	Atqasuk Edward Burnell Sr Memorial Airport	96	-157.4360	582645aaa3a1d43cc28882e2	70.4673
A	-9	BRW	Wiley Post Will Rogers Mem	44	-156.7660	582645aaa3a1d43cc288832f	71.2854
A	-9	BTI	Barter Island Lrrs	2	-143.5819	582645aaa3a1d43cc2888331	70.1340
A	-5	EEN	Dillant Hopkins Airport	149	42.8983	582645aaa3a1d43cc28883df	72.2708
A	-9	K03	Wainwright As	35	-159.8604	582645aba3a1d43cc28884f8	70.6134
A	-9	NUI	Nuiqsut Airport	38	-151.0056	582645aba3a1d43cc28885fe	70.2100
A	-9	SCC	Deadhorse	64	-148.4652	582645aca3a1d43cc28886ce	70.1947
A	-9	UUK	Ugnu-Kuparuk Airport	67	-149.5975	582645aca3a1d43cc2888770	70.3308

kable(dbGetQuery(mongo, "flights",'{"distance": {$lt:30}}')) #find all flights with distance less than 30 miles

flight	carrier	month	distance	year	origin	X_id	dest	day
1632	US	7	17	2013	EWR	58264d49a3a1d43cc28e58ff	LGA	27

I first thought the above result from flights was a transformation issue, since many fields from the query were not being returned. Actually, many of those fields are NULL in the flights dataframe, so they were not put into MongoDB, and hence, not returned. MongoDB has no need to store NULL values. So the query does return correctly from MongoDB.

Here is another example from flights to show all of the fields. Two other queries from planes and weather are also shown.

kable(dbGetQuery(mongo, "flights",'{$and:[{"distance":80},{"dep_delay":{$lte:-5}}]}')) #find all 80 mile flights that departed 5 minutes early or earlier

flight	arr_delay	distance	year	tailnum	dep_time	origin	dep_delay	dest	minute	carrier	month	hour	arr_time	X_id	air_time	day
4193	-25	80	2013	N14902	1155	EWR	-5	PHL	55	EV	1	12	1241	5826482fa3a1d43cc28a3253	29	5
4619	-12	80	2013	N33182	2124	EWR	-5	PHL	24	EV	1	21	2212	58264838a3a1d43cc28a3ac9	25	7
4619	-18	80	2013	N16976	2122	EWR	-8	PHL	22	EV	1	21	2207	58264854a3a1d43cc28a5622	26	15
4619	-8	80	2013	N19554	2123	EWR	-6	PHL	23	EV	1	21	2216	58264869a3a1d43cc28a6a21	32	21
4616	-20	80	2013	N18120	1610	EWR	-7	PHL	10	EV	2	16	1702	58264a15a3a1d43cc28bdb4c	33	2
4619	-14	80	2013	N12921	2123	EWR	-7	PHL	23	EV	2	21	2211	58264a46a3a1d43cc28bfe25	21	12
4457	-18	80	2013	N12569	1942	EWR	-8	PHL	42	EV	3	19	2026	58264b39a3a1d43cc28ca423	24	30

kable(dbGetQuery(mongo, "planes",'{$and:[{"year": {$lt:1970}},{"engines":4}]}')) #find all planes that are older than 1970 and have 4 engines

tailnum	year	engine	engines	model	X_id	type	seats	speed	manufacturer
N381AA	1956	Reciprocating	4	DC-7BF	58264f32a3a1d43cc28f4aab	Fixed wing multi engine	102	232	DOUGLAS

kable(dbGetQuery(mongo, "weather",'{$or:[{"wind_gusts": {$gt:1000}},{"wind_speed":{$gt:1000}}]}')) #find all flights in which wind gusts or wind speed were over 1000

temp	year	origin	visib	humid	wind_dir	pressure	wind_gust	precip	month	hour	dewp	wind_speed	X_id	day
39.02	2013	EWR	10	61.63	260	1008.3	1206.432	0	2	8	26.96	1048.361	58264f50a3a1d43cc28f578f	12

#close the connection
dbDisconnect(mongo)

The data was successfully moved and can be queried.

Advantages/Disadvantages of SQL vs. NOSQL

The main advantage of using NoSQL is that the schema doesn’t have to change and the data types do not have to be consistent. In so doing, it is easy to scale. Also, one can avoid large sparse tables since we are always using key-value pairs where the value is not NULL. It also appears that one can avoid the need for joins if properly set up, since information can be nested in the JSON style of documents. Finally, the data can be “sharded”, that is, split onto many machines to increase read and write speeds and to solve data storage limitations. Thus, it can serve as a “big data” option.

A disadvantage that I see is that it is harder to view the data in its entirety. We are so used to seeing data in a tabular form. We cannot directly see what the table looks like in its column names, typical values, etc. Also, traditional methods of analysis rely on a tabular structure for input, so the data would have to be transformed before use in these methods. Finally, there is a lot of upfront work to convert data from the traditional tabular format into the key-value pair format. For example, to truly move the data from the flights database, I would need to do a massive join on the data to relate it inside the mongoDB database. Or I’d have to do a lot of work to translate it into JSON nested formats before inserting it into mongoDB. I suppose I would need to do more research on how people do this in practice.

For me, this assignment was particular challenging because the data transformation from tabular data frames to JSON collections took quite a while to work bug free due to syntax and formating challenges. Furthermore, it took a while to figure out the mapping of relational database terms to mongoDB terms (e.g., tables -> collection, row -> document). It would be nice if similar concepts could use similar names across all data storage systems.

Overall, I think the challenges I faced are more a reflection on my unfamiliarty with this style of data storage. With time and practice, I am sure these NoSQL concepts and methods of data storage and retrieval would become as intuitive and easy as the traditional SQL/tabular methods I am used to.