Credits

Thanks to guidance from Valerie Briot, Christophe Hunt, Antonio Bayquen, Dan Brooks, R-bloggers (http://www.r-bloggers.com/mysql-and-r/), and Andy Catlin’s MongoDB instructional video.

Intro

The magic of communicating software! In this fascinating assignment we pull data from an existing SQL database, import into R, and recreate in a NoSQL database, Mongo DB.

What is Mongo DB?

From the MongoDB website:

“MongoDB is an open-source document database that provides high performance, high availability, and automatic scaling. MongoDB obviates the need for an Object Relational Mapping (ORM) to facilitate development.”

“MongoDB stores data in the form of documents, which are JSON-like field and value pairs. Documents are analogous to structures in programming languages that associate keys with values (e.g. dictionaries, hashes, maps, and associative arrays). Formally, MongoDB documents are BSON documents. BSON is a binary representation of JSON with additional type information. In the documents, the value of a field can be any of the BSON data types, including other documents, arrays, and arrays of documents.”

Let’s Begin!

Here are the packages which give us warning messages for any number of reasons.

suppressMessages(suppressWarnings(library(RMySQL)))
suppressMessages(suppressWarnings(library(rmongodb)))
suppressMessages(suppressWarnings(library(RJSONIO)))
suppressMessages(suppressWarnings(library(rjson)))
suppressMessages(suppressWarnings(library(RCurl)))
suppressMessages(suppressWarnings(library(plyr)))
suppressMessages(suppressWarnings(library(knitr)))
suppressMessages(suppressWarnings(library(mongolite)))

Now we connect to our MySQL server, which must already be running on the machine before attempting this step.

driver <- dbDriver("MySQL")
sqlconnect <- dbConnect(driver, user="root", password = "mysql11", dbname = "flights", host = "localhost")

Let’s check out what tables exist in our SQL database. Does this look familiar from last year? It should! Good memories.

dbListTables(sqlconnect)
##  [1] "airlines"       "airports"       "flights"        "planes"        
##  [5] "reviews"        "tblgroups"      "tblrooms"       "tblroomsgroups"
##  [9] "tblusers"       "tblusersgroups" "videos"         "weather"

Now we can run some “queries” where we select everything to see if it looks good, and check the number of rows.

query <- "select * from flights;"
flights <- dbGetQuery(sqlconnect, query)
nrow(flights)
## [1] 336776
kable(head(flights))
year month day dep_time dep_delay arr_time arr_delay carrier tailnum flight origin dest air_time distance hour minute
2013 1 1 517 2 830 11 UA N14228 1545 EWR IAH 227 1400 5 17
2013 1 1 533 4 850 20 UA N24211 1714 LGA IAH 227 1416 5 33
2013 1 1 542 2 923 33 AA N619AA 1141 JFK MIA 160 1089 5 42
2013 1 1 544 -1 1004 -18 B6 N804JB 725 JFK BQN 183 1576 5 44
2013 1 1 554 -6 812 -25 DL N668DN 461 LGA ATL 116 762 6 54
2013 1 1 554 -4 740 12 UA N39463 1696 EWR ORD 150 719 6 54
query <- "select * from airlines;"
airlines <- dbGetQuery(sqlconnect, query)
nrow(airlines)
## [1] 16
kable(head(airlines))
carrier name
9E Endeavor Air Inc.
AA American Airlines Inc.
AS Alaska Airlines Inc.
B6 JetBlue Airways
DL Delta Air Lines Inc.
EV ExpressJet Airlines Inc.
query <- "select * from weather;"
weather <- dbGetQuery(sqlconnect, query)
nrow(weather)
## [1] 8719
kable(head(weather))
origin year month day hour temp dewp humid wind_dir wind_speed wind_gust precip pressure visib
EWR 2013 1 1 0 37.04 21.92 53.97 230 10.35702 11.91865 0 1013.9 10
EWR 2013 1 1 1 37.04 21.92 53.97 230 13.80936 15.89154 0 1013.0 10
EWR 2013 1 1 2 37.94 21.92 52.09 230 12.65858 14.56724 0 1012.6 10
EWR 2013 1 1 3 37.94 23.00 54.51 230 13.80936 15.89154 0 1012.7 10
EWR 2013 1 1 4 37.94 24.08 57.04 240 14.96014 17.21583 0 1012.8 10
EWR 2013 1 1 6 39.02 26.06 59.37 270 10.35702 11.91865 0 1012.0 10
query <- "select * from planes;"
planes <- dbGetQuery(sqlconnect, query)
nrow(planes)
## [1] 3322
kable(head(planes))
tailnum year type manufacturer model engines seats speed engine
N10156 2004 Fixed wing multi engine EMBRAER EMB-145XR 2 55 NA Turbo-fan
N102UW 1998 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 2 182 NA Turbo-fan
N103US 1999 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 2 182 NA Turbo-fan
N104UW 1999 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 2 182 NA Turbo-fan
N10575 2002 Fixed wing multi engine EMBRAER EMB-145LR 2 55 NA Turbo-fan
N105UW 1999 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 2 182 NA Turbo-fan
query <- "select * from airports;"
airports <- dbGetQuery(sqlconnect, query)
nrow(airports)
## [1] 1397
kable(head(airports))
faa name lat lon alt tz dst
04G Lansdowne Airport 41.13047 -80.61958 1044 -5 A
06A Moton Field Municipal Airport 32.46057 -85.68003 264 -5 A
06C Schaumburg Regional 41.98934 -88.10124 801 -6 A
06N Randall Airport 41.43191 -74.39156 523 -5 A
09J Jekyll Island Airport 31.07447 -81.42778 11 -4 A
0A9 Elizabethton Municipal Airport 36.37122 -82.17342 1593 -4 A

Looks good. Now we disconnect from the server.

dbDisconnect(sqlconnect) 
## [1] TRUE
dbUnloadDriver(driver)
## [1] TRUE

Mongo time!

First we make sure the Mongo service is running before starting this step.

Then we create a collection named flights and insert our previously created flights dataset. The mongolite package works well here. Hold tight because this can take a couple minutes.

m <- mongo(collection = "flights")
if (m$count()>0)(m$drop())
m$insert(flights)
nrow(flights)

Finally, we check the row count. It looks like this matches the row count from our SQL import.

Reflections

What are the advantages and disadvantages of storing the data in a relational database vs. your NoSQL database?

While I did not try Neo4j for this project, MongoDB was difficult to learn with the command prompt interface. It was not very user friendly. I made some syntax errors because I was not experienced in command prompt. It also seemed that I had to run the R code during the connection attempt for the connection to succeed. The command prompt also does not seem to allow copying and pasting. Therefore, the interface and learning curve here are disadvantages.

From a structural standpoint, the queries run through MongoDB seem to take longer than SQL and are also less intuitive. The advantage would be that SQL would not be needed to run database queries if everything can be done through R.