Thanks to guidance from Valerie Briot, Christophe Hunt, Antonio Bayquen, Dan Brooks, R-bloggers (http://www.r-bloggers.com/mysql-and-r/), and Andy Catlin’s MongoDB instructional video.
The magic of communicating software! In this fascinating assignment we pull data from an existing SQL database, import into R, and recreate in a NoSQL database, Mongo DB.
From the MongoDB website:
“MongoDB is an open-source document database that provides high performance, high availability, and automatic scaling. MongoDB obviates the need for an Object Relational Mapping (ORM) to facilitate development.”
“MongoDB stores data in the form of documents, which are JSON-like field and value pairs. Documents are analogous to structures in programming languages that associate keys with values (e.g. dictionaries, hashes, maps, and associative arrays). Formally, MongoDB documents are BSON documents. BSON is a binary representation of JSON with additional type information. In the documents, the value of a field can be any of the BSON data types, including other documents, arrays, and arrays of documents.”
suppressMessages(suppressWarnings(library(RMySQL)))
suppressMessages(suppressWarnings(library(rmongodb)))
suppressMessages(suppressWarnings(library(RJSONIO)))
suppressMessages(suppressWarnings(library(rjson)))
suppressMessages(suppressWarnings(library(RCurl)))
suppressMessages(suppressWarnings(library(plyr)))
suppressMessages(suppressWarnings(library(knitr)))
suppressMessages(suppressWarnings(library(mongolite)))
driver <- dbDriver("MySQL")
sqlconnect <- dbConnect(driver, user="root", password = "mysql11", dbname = "flights", host = "localhost")
dbListTables(sqlconnect)
## [1] "airlines" "airports" "flights" "planes"
## [5] "reviews" "tblgroups" "tblrooms" "tblroomsgroups"
## [9] "tblusers" "tblusersgroups" "videos" "weather"
query <- "select * from flights;"
flights <- dbGetQuery(sqlconnect, query)
nrow(flights)
## [1] 336776
kable(head(flights))
| year | month | day | dep_time | dep_delay | arr_time | arr_delay | carrier | tailnum | flight | origin | dest | air_time | distance | hour | minute |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2013 | 1 | 1 | 517 | 2 | 830 | 11 | UA | N14228 | 1545 | EWR | IAH | 227 | 1400 | 5 | 17 |
| 2013 | 1 | 1 | 533 | 4 | 850 | 20 | UA | N24211 | 1714 | LGA | IAH | 227 | 1416 | 5 | 33 |
| 2013 | 1 | 1 | 542 | 2 | 923 | 33 | AA | N619AA | 1141 | JFK | MIA | 160 | 1089 | 5 | 42 |
| 2013 | 1 | 1 | 544 | -1 | 1004 | -18 | B6 | N804JB | 725 | JFK | BQN | 183 | 1576 | 5 | 44 |
| 2013 | 1 | 1 | 554 | -6 | 812 | -25 | DL | N668DN | 461 | LGA | ATL | 116 | 762 | 6 | 54 |
| 2013 | 1 | 1 | 554 | -4 | 740 | 12 | UA | N39463 | 1696 | EWR | ORD | 150 | 719 | 6 | 54 |
query <- "select * from airlines;"
airlines <- dbGetQuery(sqlconnect, query)
nrow(airlines)
## [1] 16
kable(head(airlines))
| carrier | name |
|---|---|
| 9E | Endeavor Air Inc. |
| AA | American Airlines Inc. |
| AS | Alaska Airlines Inc. |
| B6 | JetBlue Airways |
| DL | Delta Air Lines Inc. |
| EV | ExpressJet Airlines Inc. |
query <- "select * from weather;"
weather <- dbGetQuery(sqlconnect, query)
nrow(weather)
## [1] 8719
kable(head(weather))
| origin | year | month | day | hour | temp | dewp | humid | wind_dir | wind_speed | wind_gust | precip | pressure | visib |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| EWR | 2013 | 1 | 1 | 0 | 37.04 | 21.92 | 53.97 | 230 | 10.35702 | 11.91865 | 0 | 1013.9 | 10 |
| EWR | 2013 | 1 | 1 | 1 | 37.04 | 21.92 | 53.97 | 230 | 13.80936 | 15.89154 | 0 | 1013.0 | 10 |
| EWR | 2013 | 1 | 1 | 2 | 37.94 | 21.92 | 52.09 | 230 | 12.65858 | 14.56724 | 0 | 1012.6 | 10 |
| EWR | 2013 | 1 | 1 | 3 | 37.94 | 23.00 | 54.51 | 230 | 13.80936 | 15.89154 | 0 | 1012.7 | 10 |
| EWR | 2013 | 1 | 1 | 4 | 37.94 | 24.08 | 57.04 | 240 | 14.96014 | 17.21583 | 0 | 1012.8 | 10 |
| EWR | 2013 | 1 | 1 | 6 | 39.02 | 26.06 | 59.37 | 270 | 10.35702 | 11.91865 | 0 | 1012.0 | 10 |
query <- "select * from planes;"
planes <- dbGetQuery(sqlconnect, query)
nrow(planes)
## [1] 3322
kable(head(planes))
| tailnum | year | type | manufacturer | model | engines | seats | speed | engine |
|---|---|---|---|---|---|---|---|---|
| N10156 | 2004 | Fixed wing multi engine | EMBRAER | EMB-145XR | 2 | 55 | NA | Turbo-fan |
| N102UW | 1998 | Fixed wing multi engine | AIRBUS INDUSTRIE | A320-214 | 2 | 182 | NA | Turbo-fan |
| N103US | 1999 | Fixed wing multi engine | AIRBUS INDUSTRIE | A320-214 | 2 | 182 | NA | Turbo-fan |
| N104UW | 1999 | Fixed wing multi engine | AIRBUS INDUSTRIE | A320-214 | 2 | 182 | NA | Turbo-fan |
| N10575 | 2002 | Fixed wing multi engine | EMBRAER | EMB-145LR | 2 | 55 | NA | Turbo-fan |
| N105UW | 1999 | Fixed wing multi engine | AIRBUS INDUSTRIE | A320-214 | 2 | 182 | NA | Turbo-fan |
query <- "select * from airports;"
airports <- dbGetQuery(sqlconnect, query)
nrow(airports)
## [1] 1397
kable(head(airports))
| faa | name | lat | lon | alt | tz | dst |
|---|---|---|---|---|---|---|
| 04G | Lansdowne Airport | 41.13047 | -80.61958 | 1044 | -5 | A |
| 06A | Moton Field Municipal Airport | 32.46057 | -85.68003 | 264 | -5 | A |
| 06C | Schaumburg Regional | 41.98934 | -88.10124 | 801 | -6 | A |
| 06N | Randall Airport | 41.43191 | -74.39156 | 523 | -5 | A |
| 09J | Jekyll Island Airport | 31.07447 | -81.42778 | 11 | -4 | A |
| 0A9 | Elizabethton Municipal Airport | 36.37122 | -82.17342 | 1593 | -4 | A |
dbDisconnect(sqlconnect)
## [1] TRUE
dbUnloadDriver(driver)
## [1] TRUE
m <- mongo(collection = "flights")
if (m$count()>0)(m$drop())
m$insert(flights)
nrow(flights)
While I did not try Neo4j for this project, MongoDB was difficult to learn with the command prompt interface. It was not very user friendly. I made some syntax errors because I was not experienced in command prompt. It also seemed that I had to run the R code during the connection attempt for the connection to succeed. The command prompt also does not seem to allow copying and pasting. Therefore, the interface and learning curve here are disadvantages.
From a structural standpoint, the queries run through MongoDB seem to take longer than SQL and are also less intuitive. The advantage would be that SQL would not be needed to run database queries if everything can be done through R.