This document pertains to load data from MySql Database into MongoDB. When possible this will be done using R code. The following packages will be used:
RMySQL rmongodb rjson
The mongod executable should be running in the background.
## Warning: package 'RMySQL' was built under R version 3.2.4
## Loading required package: DBI
## Warning: package 'rmongodb' was built under R version 3.2.5
For the purpose of this exercise, we will use the flights data base already loaded into MySQL on localHost. We will assume that similar environment is available.
First, we will connect to MySQL database and retrieve the data using the RMySQL package. The connection parameters should be changed if the environment available is different.
#Connect to MySQL database
flights_db_mysql <- dbConnect(MySQL(), user="root", password="admin",
dbname="flights", host="localhost")
Then we will run queries to retrieve the data and store it. For the flights table, the data could not be loaded using the method from rmongodb and the mongoimport function will be used. To this effect a .csv file will be created. The path where this file is stored should be udpated to reproduce the results.
# .csv path for flights table
path_csv <- "C:/Users/vbrio/Documents/Cuny/DATA_607/project4/flights.csv"
airlines <- dbGetQuery(flights_db_mysql, "SELECT * FROM flights.airlines;")
airports <- dbGetQuery(flights_db_mysql, "SELECT * FROM flights.airports;")
weather <- dbGetQuery(flights_db_mysql, "SELECT * FROM flights.weather;")
planes <- dbGetQuery(flights_db_mysql, "SELECT * FROM flights.planes;")
flights <- dbGetQuery(flights_db_mysql, "SELECT * FROM flights.flights;")
write.csv(flights, file = path_csv)
Checking some elements of the data. Also, the .csv file is created with a extra column containing the record number. This column should be removed from the .csv file.
class(airports)
## [1] "data.frame"
class(airlines)
## [1] "data.frame"
class(weather)
## [1] "data.frame"
class(planes)
## [1] "data.frame"
class(flights)
## [1] "data.frame"
airlines[1:5, ]
## carrier name
## 1 9E Endeavor Air Inc.\r
## 2 AA American Airlines Inc.\r
## 3 AS Alaska Airlines Inc.\r
## 4 B6 JetBlue Airways\r
## 5 DL Delta Air Lines Inc.\r
airports[1:5, ]
## faa name lat lon alt tz dst
## 1 04G Lansdowne Airport 41.13047 -80.61958 1044 -5 A
## 2 06A Moton Field Municipal Airport 32.46057 -85.68003 264 -5 A
## 3 06C Schaumburg Regional 41.98934 -88.10124 801 -6 A
## 4 06N Randall Airport 41.43191 -74.39156 523 -5 A
## 5 09J Jekyll Island Airport 31.07447 -81.42778 11 -4 A
weather[1:5, ]
## origin year month day hour temp dewp humid wind_dir wind_speed
## 1 EWR 2013 1 1 0 37.04 21.92 53.97 230 10.35702
## 2 EWR 2013 1 1 1 37.04 21.92 53.97 230 13.80936
## 3 EWR 2013 1 1 2 37.94 21.92 52.09 230 12.65858
## 4 EWR 2013 1 1 3 37.94 23.00 54.51 230 13.80936
## 5 EWR 2013 1 1 4 37.94 24.08 57.04 240 14.96014
## wind_gust precip pressure visib
## 1 11.91865 0 1013.9 10
## 2 15.89154 0 1013.0 10
## 3 14.56724 0 1012.6 10
## 4 15.89154 0 1012.7 10
## 5 17.21583 0 1012.8 10
planes[1:5, ]
## tailnum year type manufacturer model engines
## 1 N10156 2004 Fixed wing multi engine EMBRAER EMB-145XR 2
## 2 N102UW 1998 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 2
## 3 N103US 1999 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 2
## 4 N104UW 1999 Fixed wing multi engine AIRBUS INDUSTRIE A320-214 2
## 5 N10575 2002 Fixed wing multi engine EMBRAER EMB-145LR 2
## seats speed engine
## 1 55 NA Turbo-fan
## 2 182 NA Turbo-fan
## 3 182 NA Turbo-fan
## 4 182 NA Turbo-fan
## 5 55 NA Turbo-fan
flights[1:5, ]
## year month day dep_time dep_delay arr_time arr_delay carrier tailnum
## 1 2013 1 1 517 2 830 11 UA N14228
## 2 2013 1 1 533 4 850 20 UA N24211
## 3 2013 1 1 542 2 923 33 AA N619AA
## 4 2013 1 1 544 -1 1004 -18 B6 N804JB
## 5 2013 1 1 554 -6 812 -25 DL N668DN
## flight origin dest air_time distance hour minute
## 1 1545 EWR IAH 227 1400 5 17
## 2 1714 LGA IAH 227 1416 5 33
## 3 1141 JFK MIA 160 1089 5 42
## 4 725 JFK BQN 183 1576 5 44
## 5 461 LGA ATL 116 762 6 54
nrow(airlines)
## [1] 16
nrow(airports)
## [1] 1397
nrow(weather)
## [1] 8719
nrow(planes)
## [1] 3322
nrow(flights)
## [1] 336776
We will now connect to Mongodb database and we will load the data from the airlines, airports, weather, and planes tables. For the flight table the data will be imported from mongo shell by the mongoimport command. We were encoutering memory problem with the method outline below.
######################################################
#Connect to Mongodb database
# connect to mongodb running in background on localhost
# Please note, flights_m, was created using mongo shell
#######################################################
mongo <- mongo.create()
mongo.is.connected(mongo)
## [1] TRUE
db <- "flights_m"
mairlines <- "flights_m.airlines"
mairports <- "flights_m.airports"
mweather <- "flights_m.weather"
mplanes <- "flights_m.planes"
mflights <-"flights_m.flights"
mongo.get.database.collections(mongo, db)
## character(0)
The data must be converted to bjson format prior to being inserted into Mongodb database.
# convert airports l to bson format
airlines_lbson <- lapply(split(airlines, 1:nrow(airlines)), function(x) mongo.bson.from.JSON(toJSON(x)))
airports_lbson <- lapply(split(airports, 1:nrow(airports)), function(x) mongo.bson.from.JSON(toJSON(x)))
weather_lbson <- lapply(split(weather, 1:nrow(weather)), function(x) mongo.bson.from.JSON(toJSON(x)))
planes_lbson <- lapply(split(planes, 1:nrow(planes)), function(x) mongo.bson.from.JSON(toJSON(x)))
#flights_lbson <- lapply(split(flights, 1:nrow(flights)), function(x) mongo.bson.from.JSON(toJSON(x)))
# flights table is not being loaded. Will write to .csv and load with dbmongo import.
We now inser the data into mongodb, the the collection already exists, we will remove it as to not duplicate the data.
if(mongo.count(mongo,mairlines) != 0){
mongo.remove(mongo, mairlines, criteria = mongo.bson.empty())
}
## [1] TRUE
mongo.insert.batch(mongo, mairlines, airlines_lbson)
## [1] TRUE
if(mongo.count(mongo,mairports) != 0){
mongo.remove(mongo, mairports, criteria = mongo.bson.empty())
}
## [1] TRUE
mongo.insert.batch(mongo, mairports, airports_lbson)
## [1] TRUE
if(mongo.count(mongo,mweather) != 0){
mongo.remove(mongo, mweather, criteria = mongo.bson.empty())
}
## [1] TRUE
mongo.insert.batch(mongo, mweather, weather_lbson)
## [1] TRUE
if(mongo.count(mongo,mplanes) != 0){
mongo.remove(mongo, mplanes, criteria = mongo.bson.empty())
}
## [1] TRUE
mongo.insert.batch(mongo, mplanes, planes_lbson)
## [1] TRUE
For the flights table, the data will be inserted directly using the mongoimport command.
https://raw.githubusercontent.com/vbriot28/datascienceCUNY_607/master/import_command.PNG
The results we objtained:
https://raw.githubusercontent.com/vbriot28/datascienceCUNY_607/master/import_commnad_result.PNG
We will now compare the count between the set of records.
mongo.count(mongo, mairlines)
## [1] 16
mongo.count(mongo, mairports)
## [1] 1397
mongo.count(mongo, mweather)
## [1] 8719
mongo.count(mongo, mplanes)
## [1] 3322
mongo.count(mongo, mflights)
## [1] 336776
mysql_count <- c(nrow(airlines), nrow(airports), nrow(weather), nrow(planes), nrow(flights))
mongodb_count <- c(mongo.count(mongo, mairlines), mongo.count(mongo, mairports), mongo.count(mongo, mweather), mongo.count(mongo, mplanes),
mongo.count(mongo, mflights))
mysql_count == mongodb_count
## [1] TRUE TRUE TRUE TRUE TRUE
We will now run a query on one of the table using both SQL and Mongodb. We will present the data into data frame. The query will select the record from the “weather” table for which origin = “JFK”.
mongo_l <- mongo.find.all(mongo, mweather, '{"origin" : "JFK"}')
class(mongo_l)
## [1] "list"
mongo_df<-as.data.frame(do.call(rbind.data.frame, mongo_l))
class(mongo_df)
## [1] "data.frame"
sql_df <- dbGetQuery(flights_db_mysql, "SELECT * FROM flights.weather where origin = 'JFK';")
sql_df
## origin year month day hour temp dewp humid wind_dir wind_speed
## 1 JFK 2013 2 18 4 17.96 -0.94 42.69 290 29.92028
## 2 JFK 2013 2 20 19 32.00 8.06 36.03 280 26.46794
## 3 JFK 2013 7 2 11 71.60 69.80 94.06 180 11.50780
## 4 JFK 2013 7 2 13 71.60 69.80 94.06 190 10.35702
## 5 JFK 2013 7 31 6 71.06 55.04 56.93 320 8.05546
## 6 JFK 2013 9 2 20 75.20 73.40 94.14 200 4.60312
## 7 JFK 2013 10 23 10 48.92 39.02 68.51 60 4.60312
## 8 JFK 2013 10 23 11 48.92 39.02 68.51 40 4.60312
## 9 JFK 2013 12 17 5 26.96 10.94 50.34 40 4.60312
## wind_gust precip pressure visib
## 1 34.431660 0 1016.2 10
## 2 30.458776 0 1011.2 10
## 3 13.242946 0 NA 0
## 4 11.918651 0 NA 1
## 5 9.270062 0 1020.4 10
## 6 5.297178 0 NA 4
## 7 5.297178 0 1008.1 10
## 8 5.297178 0 1008.5 10
## 9 5.297178 0 1023.9 10
dim(mongo_df)
## [1] 9 15
dim(sql_df)
## [1] 9 14
mongo_df[,2:15] == sql_df
## origin year month day hour temp dewp humid wind_dir wind_speed
## 2 TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## 21 TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## 3 TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## 4 TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## 5 TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## 6 TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## 7 TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## 8 TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## 9 TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## wind_gust precip pressure visib
## 2 TRUE TRUE TRUE TRUE
## 21 TRUE TRUE TRUE TRUE
## 3 TRUE TRUE NA TRUE
## 4 TRUE TRUE NA TRUE
## 5 TRUE TRUE TRUE TRUE
## 6 TRUE TRUE NA TRUE
## 7 TRUE TRUE TRUE TRUE
## 8 TRUE TRUE TRUE TRUE
## 9 TRUE TRUE TRUE TRUE
mongo_df$pressure
## [1] 1016.2 1011.2 NA NA 1020.4 NA 1008.1 1008.5 1023.9
## Levels: 1008.1 1008.5 1011.2 1016.2 1020.4 1023.9 NA
sql_df$pressure
## [1] 1016.2 1011.2 NA NA 1020.4 NA 1008.1 1008.5 1023.9
The ‘NA’ results we obtain in the dataframe to dataframe comparison are due to NA value in the pressure column. These values are ‘NA’ in both data sets, the one extracting from sql and the one extracting from mongodb.
In conclusion, with the rmongodb package, db opertion can be executed from R directly. However, the advantage of using Mongodb as a data base is to have a dynamic schema data base that can store large amount of data. We would anticipate that any work in R would be done in subset of original data set. The most efficient way of loading the data would be using the command provided by the database package. We would expect that a subset of the data would be loaded in R for transformation and analysis.
An interest video on rmongodb can be found here…
https://www.youtube.com/watch?v=GWZdFFYrR4I