1 Setting up R Environment

Of the few choices for MySQL and MongoDB configurations, I arbitrarily used RMySQL and RMongo. There are alternatives available that may be more suitable to your style and environment.

1.1 Libraries

#for data connection
require(RMySQL)
require(RMongo)

#for analysis
require(dplyr)
require(reshape2)

1.2 Creating a MySQL connection

Ensure that you check your MySQL connection for your unique username, password, and host. The following code opens the flights database and outputs a list of each of the tables contained in the flights database.

#Set up working directory using the RMySQL library
con <- dbConnect(MySQL(), 
    user = 'root',
    password = mySQLPass, #fill in per user
    host = 'localhost',
    dbname='flights')

#Get the tables array
tablesList<-dbListTables(con)
dbListTables(con)
## [1] "airlines" "airports" "flights"  "planes"   "weather"

2 Setting up MongoDB Environment

Please download the latest MongoDB Community Edition to your machine from mongodb.org/downloads. I am running Windows 10 x64 bit.

Once downloaded, please follow the instructions on mongodb.org/manual/tutorial/install-mongodb-on-windows/. You must ensure that your PATH variable, database file structure, and config file are customized to your working environment before proceeding!

3 Data preprocessing

3.1 R to CSV

It is relatively straightforward, creating csv files from a mySQL database. The following loop will take all of the tables in the database with which you connect. This allows for easier MongoDB consumption.

I have set my working directory to C:/Data/flights and the csv will be output to this location.

i<-1
for (i in 1:length(tablesList)){
   j <- dbReadTable(con, tablesList[i])
   write.csv(j, file = paste(tablesList[i],".csv"), row.names = FALSE)
}

3.2 CSV to Mongo

It is relatively straightforward to create csv files from a MySQL database and transfer this data into MongoDB. The following loop will take all of the tables in the database with which you connect. This allows for easier MongoDB consumption.

Open command line and cd to the path where your mongo.exe file is contained.

Run the following commands to import your .csv files as 5 collections into the MongoDB flights database –db. The collections will be contained in the directory that you should have set up in your configuration file in the websites’ instructions.

mongoimport --db flightsData --collection airlines --type csv --file "c:\data\flights\airlines .csv" --headerline
mongoimport --db flightsData --collection airports --type csv --file "c:\data\flights\airports .csv" --headerline
mongoimport --db flightsData --collection flights --type csv --file "c:\data\flights\flights .csv" --headerline
mongoimport --db flightsData --collection planes --type csv --file "c:\data\flights\planes .csv" --headerline
mongoimport --db flightsData --collection weather --type csv --file "c:\data\flights\weather .csv" --headerline

If you wish to test the load, you can enter the Mongo Shell. Keeping in the same directory as listed directly above, type:

C:\Program Files\MongoDB\Server\3.2\bin>mongo
...
...
...
>use flights
>show collection

4 Querying Mongo

4.1 MongoDB Connection

conn_mongo  <- mongoDbConnect("flightsData")
airlines_mongo <- dbGetQuery(conn_mongo, "airlines", "",0,16)
airports_mongo <- dbGetQuery(conn_mongo, "airports", "",0,1397)
flights_mongo <- dbGetQuery(conn_mongo, "flights", "",0,336776)
planes_mongo <- dbGetQuery(conn_mongo, "planes", "",0,3322)
weather_mongo <- dbGetQuery(conn_mongo, "weather", "",0,8719)

4.2 Mongo Query Examples

4.2.1 XID Mongo Unique Identifier

xid<-dbGetQuery(conn_mongo, 'airports', '{"name": "Jekyll Island Airport"}')
xid$X_id
## [1] "5717f71eea7e80e167b74cee"

4.2.2 Frequency Barplot Example

Idea borrowed from https://github.com/renkun-ken/pipeR-tutorial/blob/master/Examples/dplyr.Rmd. The data can be treated as if it were any other R list.

flights_mongo %>% 
  mutate(speed = distance / air_time) %>%
  group_by(carrier) %>%
  summarize(mean_speed = mean((distance/air_time)*60,na.rm = TRUE)) %>%
  arrange(desc(mean_speed)) %>%
  with(barplot(mean_speed,names.arg = carrier,
    main = "Average flight speed", xlab="Airline", ylab="Miles per hour"))

5 Discussion : What are the advantages and disadvantages of a NoSQL database?

NoSQL is an alternative to the more commonly learned SQL data database. Without prior knowledge either, a NoSQL database offers a relatively quicker and more painless configuration. In the event where one is considering a “data first” or “coding first” scenario, it might also be worth pairing these options with their respective counterparts SQL, and NOSQL.

When a project’s requirements are nebulous or the nature of the coding more experimental, it might be worth considering a NoSQL approach. Alternatively, SQL grants a veritable structure to data that NoSQL is not equipped with handling. This includes, but is not limited to, foreign keys and robust transactions.

Flat, relational data is the standard for SQL. If the database designer can visualize and implement a near, if not perfect schema for an application in the relational data structure, SQL is the only option. NoSQL, however, is more geared collections which can be expanded into hierarchies. Hierarchical data connects well with certain libraries and molds, while relational data is better suited to other situations.

There are performance and technical issues with every potential application. Before any decision to choose one data storage method or the other, it is highly recommended that both are well understood and learned by the developer. That said, without a solid grasp of either option, you should certainly consult with online anecdotes discussing similar situations. Good luck.