In this assignment, we migrate the nycflights13 database from MySQL to MongoDB. The nycflights13 database includes 5 tables:
airlinesairportsflightsplanesweather.We will use the RMySQL package to load the data from the MySQL database and then the mongolite package to save the data into a MongoDB database.
library(tidyverse)
library(RMySQL)
library(mongolite)
First we load the data from the nycflights13 database in MySQL. We use the RMySQL package to connect to the database and to pass the SQL query to load the data. Note that the MySQL instance here is running on ‘localhost’ with my user credentials.
# connect to MySQL and get list of tables in the flights database
flights_db <- dbConnect(MySQL(), user=usr, password=pwd, dbname='flights', host='localhost')
tables <- dbListTables(flights_db)
tables
## [1] "airlines" "airports" "flights" "planes" "weather"
# query to load the data from each table
query <- sprintf("SELECT * FROM %s", tables)
query
## [1] "SELECT * FROM airlines" "SELECT * FROM airports"
## [3] "SELECT * FROM flights" "SELECT * FROM planes"
## [5] "SELECT * FROM weather"
# initialize temp tables
temp1 <- rep(list(NA), length(tables))
# load tables into temp dataframes
for (j in 1:length(tables)) {
temp1[[j]] <- as_tibble(dbGetQuery(flights_db, query[j]))
}
temp1
## [[1]]
## # A tibble: 16 x 2
## carrier name
## <chr> <chr>
## 1 9E "Endeavor Air Inc.\r"
## 2 AA "American Airlines Inc.\r"
## 3 AS "Alaska Airlines Inc.\r"
## 4 B6 "JetBlue Airways\r"
## 5 DL "Delta Air Lines Inc.\r"
## 6 EV "ExpressJet Airlines Inc.\r"
## 7 F9 "Frontier Airlines Inc.\r"
## 8 FL "AirTran Airways Corporation\r"
## 9 HA "Hawaiian Airlines Inc.\r"
## 10 MQ "Envoy Air\r"
## 11 OO "SkyWest Airlines Inc.\r"
## 12 UA "United Air Lines Inc.\r"
## 13 US "US Airways Inc.\r"
## 14 VX "Virgin America\r"
## 15 WN "Southwest Airlines Co.\r"
## 16 YV "Mesa Airlines Inc.\r"
##
## [[2]]
## # A tibble: 1,397 x 7
## faa name lat lon alt tz dst
## <chr> <chr> <dbl> <dbl> <int> <int> <chr>
## 1 04G Lansdowne Airport 41.1 -80.6 1044 -5 A
## 2 06A Moton Field Municipal Airport 32.5 -85.7 264 -5 A
## 3 06C Schaumburg Regional 42.0 -88.1 801 -6 A
## 4 06N Randall Airport 41.4 -74.4 523 -5 A
## 5 09J Jekyll Island Airport 31.1 -81.4 11 -4 A
## 6 0A9 Elizabethton Municipal Airport 36.4 -82.2 1593 -4 A
## 7 0G6 Williams County Airport 41.5 -84.5 730 -5 A
## 8 0G7 Finger Lakes Regional Airport 42.9 -76.8 492 -5 A
## 9 0P2 Shoestring Aviation Airfield 39.8 -76.6 1000 -5 U
## 10 0S9 Jefferson County Intl 48.1 -123. 108 -8 A
## # ... with 1,387 more rows
##
## [[3]]
## # A tibble: 336,776 x 16
## year month day dep_time dep_delay arr_time arr_delay carrier tailnum
## <int> <int> <int> <int> <int> <int> <int> <chr> <chr>
## 1 2013 1 1 517 2 830 11 UA N14228
## 2 2013 1 1 533 4 850 20 UA N24211
## 3 2013 1 1 542 2 923 33 AA N619AA
## 4 2013 1 1 544 -1 1004 -18 B6 N804JB
## 5 2013 1 1 554 -6 812 -25 DL N668DN
## 6 2013 1 1 554 -4 740 12 UA N39463
## 7 2013 1 1 555 -5 913 19 B6 N516JB
## 8 2013 1 1 557 -3 709 -14 EV N829AS
## 9 2013 1 1 557 -3 838 -8 B6 N593JB
## 10 2013 1 1 558 -2 753 8 AA N3ALAA
## # ... with 336,766 more rows, and 7 more variables: flight <int>,
## # origin <chr>, dest <chr>, air_time <int>, distance <int>, hour <int>,
## # minute <int>
##
## [[4]]
## # A tibble: 3,322 x 9
## tailnum year type manufacturer model engines seats speed engine
## <chr> <int> <chr> <chr> <chr> <int> <int> <int> <chr>
## 1 N10156 2004 Fixed wi~ EMBRAER EMB-1~ 2 55 NA Turbo~
## 2 N102UW 1998 Fixed wi~ AIRBUS INDUS~ A320-~ 2 182 NA Turbo~
## 3 N103US 1999 Fixed wi~ AIRBUS INDUS~ A320-~ 2 182 NA Turbo~
## 4 N104UW 1999 Fixed wi~ AIRBUS INDUS~ A320-~ 2 182 NA Turbo~
## 5 N10575 2002 Fixed wi~ EMBRAER EMB-1~ 2 55 NA Turbo~
## 6 N105UW 1999 Fixed wi~ AIRBUS INDUS~ A320-~ 2 182 NA Turbo~
## 7 N107US 1999 Fixed wi~ AIRBUS INDUS~ A320-~ 2 182 NA Turbo~
## 8 N108UW 1999 Fixed wi~ AIRBUS INDUS~ A320-~ 2 182 NA Turbo~
## 9 N109UW 1999 Fixed wi~ AIRBUS INDUS~ A320-~ 2 182 NA Turbo~
## 10 N110UW 1999 Fixed wi~ AIRBUS INDUS~ A320-~ 2 182 NA Turbo~
## # ... with 3,312 more rows
##
## [[5]]
## # A tibble: 8,719 x 14
## origin year month day hour temp dewp humid wind_dir wind_speed
## <chr> <int> <int> <int> <int> <dbl> <dbl> <dbl> <int> <dbl>
## 1 EWR 2013 1 1 0 37.0 21.9 54.0 230 10.4
## 2 EWR 2013 1 1 1 37.0 21.9 54.0 230 13.8
## 3 EWR 2013 1 1 2 37.9 21.9 52.1 230 12.7
## 4 EWR 2013 1 1 3 37.9 23 54.5 230 13.8
## 5 EWR 2013 1 1 4 37.9 24.1 57.0 240 15.0
## 6 EWR 2013 1 1 6 39.0 26.1 59.4 270 10.4
## 7 EWR 2013 1 1 7 39.0 27.0 61.6 250 8.06
## 8 EWR 2013 1 1 8 39.0 28.0 64.4 240 11.5
## 9 EWR 2013 1 1 9 39.9 28.0 62.2 250 12.7
## 10 EWR 2013 1 1 10 39.0 28.0 64.4 260 12.7
## # ... with 8,709 more rows, and 4 more variables: wind_gust <dbl>,
## # precip <dbl>, pressure <dbl>, visib <dbl>
# disconnect
dbDisconnect(flights_db)
## [1] TRUE
Now we have the 5 tables of nycflights13 loaded as a list of dataframes in temp1.
Next we migrate the data in temp1 into a MongoDB database. We use the mongolite package to connect to MongoDB and to save each dataframe as a separate collection in MongoDB.
# initialize temp tables
temp2 <- rep(list(NA), length(tables))
# loop through each table and save into each collection
for (j in 1:length(tables)) {
# connect to collection in mongodb
temp2[[j]] <- mongo(collection = tables[j], db = "nycflights13")
# drop collection if already exists
if (temp2[[j]]$count() > 0) {
temp2[[j]]$drop()
}
# insert data
temp2[[j]]$insert(temp1[[j]])
# stop if row counts are not identical
stopifnot(temp2[[j]]$count() == nrow(temp1[[j]]))
}
Finally we confirm that all the data loaded correctly into MongoDB, by first inspecting the collections and then checking that they are identical to the original MySQL tables. Let’s inspect the data in each collection.
# get the mongo collections
airlines <- mongo(collection = "airlines", db = "nycflights13")
airports <- mongo(collection = "airports", db = "nycflights13")
flights <- mongo(collection = "flights", db = "nycflights13")
planes <- mongo(collection = "planes", db = "nycflights13")
weather <- mongo(collection = "weather", db = "nycflights13")
# inspect the data
( airlines_data <- as_tibble(airlines$find()) )
## # A tibble: 16 x 2
## carrier name
## * <chr> <chr>
## 1 9E "Endeavor Air Inc.\r"
## 2 AA "American Airlines Inc.\r"
## 3 AS "Alaska Airlines Inc.\r"
## 4 B6 "JetBlue Airways\r"
## 5 DL "Delta Air Lines Inc.\r"
## 6 EV "ExpressJet Airlines Inc.\r"
## 7 F9 "Frontier Airlines Inc.\r"
## 8 FL "AirTran Airways Corporation\r"
## 9 HA "Hawaiian Airlines Inc.\r"
## 10 MQ "Envoy Air\r"
## 11 OO "SkyWest Airlines Inc.\r"
## 12 UA "United Air Lines Inc.\r"
## 13 US "US Airways Inc.\r"
## 14 VX "Virgin America\r"
## 15 WN "Southwest Airlines Co.\r"
## 16 YV "Mesa Airlines Inc.\r"
( airports_data <- as_tibble(airports$find()) )
## # A tibble: 1,397 x 7
## faa name lat lon alt tz dst
## * <chr> <chr> <dbl> <dbl> <int> <int> <chr>
## 1 04G Lansdowne Airport 41.1 -80.6 1044 -5 A
## 2 06A Moton Field Municipal Airport 32.5 -85.7 264 -5 A
## 3 06C Schaumburg Regional 42.0 -88.1 801 -6 A
## 4 06N Randall Airport 41.4 -74.4 523 -5 A
## 5 09J Jekyll Island Airport 31.1 -81.4 11 -4 A
## 6 0A9 Elizabethton Municipal Airport 36.4 -82.2 1593 -4 A
## 7 0G6 Williams County Airport 41.5 -84.5 730 -5 A
## 8 0G7 Finger Lakes Regional Airport 42.9 -76.8 492 -5 A
## 9 0P2 Shoestring Aviation Airfield 39.8 -76.6 1000 -5 U
## 10 0S9 Jefferson County Intl 48.1 -123. 108 -8 A
## # ... with 1,387 more rows
( flights_data <- as_tibble(flights$find()) )
## # A tibble: 336,776 x 16
## year month day dep_time dep_delay arr_time arr_delay carrier tailnum
## * <int> <int> <int> <int> <int> <int> <int> <chr> <chr>
## 1 2013 1 1 517 2 830 11 UA N14228
## 2 2013 1 1 533 4 850 20 UA N24211
## 3 2013 1 1 542 2 923 33 AA N619AA
## 4 2013 1 1 544 -1 1004 -18 B6 N804JB
## 5 2013 1 1 554 -6 812 -25 DL N668DN
## 6 2013 1 1 554 -4 740 12 UA N39463
## 7 2013 1 1 555 -5 913 19 B6 N516JB
## 8 2013 1 1 557 -3 709 -14 EV N829AS
## 9 2013 1 1 557 -3 838 -8 B6 N593JB
## 10 2013 1 1 558 -2 753 8 AA N3ALAA
## # ... with 336,766 more rows, and 7 more variables: flight <int>,
## # origin <chr>, dest <chr>, air_time <int>, distance <int>, hour <int>,
## # minute <int>
( planes_data <- as_tibble(planes$find()) )
## # A tibble: 3,322 x 9
## tailnum year type manufacturer model engines seats engine speed
## * <chr> <int> <chr> <chr> <chr> <int> <int> <chr> <int>
## 1 N10156 2004 Fixed wi~ EMBRAER EMB-1~ 2 55 Turbo~ NA
## 2 N102UW 1998 Fixed wi~ AIRBUS INDUS~ A320-~ 2 182 Turbo~ NA
## 3 N103US 1999 Fixed wi~ AIRBUS INDUS~ A320-~ 2 182 Turbo~ NA
## 4 N104UW 1999 Fixed wi~ AIRBUS INDUS~ A320-~ 2 182 Turbo~ NA
## 5 N10575 2002 Fixed wi~ EMBRAER EMB-1~ 2 55 Turbo~ NA
## 6 N105UW 1999 Fixed wi~ AIRBUS INDUS~ A320-~ 2 182 Turbo~ NA
## 7 N107US 1999 Fixed wi~ AIRBUS INDUS~ A320-~ 2 182 Turbo~ NA
## 8 N108UW 1999 Fixed wi~ AIRBUS INDUS~ A320-~ 2 182 Turbo~ NA
## 9 N109UW 1999 Fixed wi~ AIRBUS INDUS~ A320-~ 2 182 Turbo~ NA
## 10 N110UW 1999 Fixed wi~ AIRBUS INDUS~ A320-~ 2 182 Turbo~ NA
## # ... with 3,312 more rows
( weather_data <- as_tibble(weather$find()) )
## # A tibble: 8,719 x 14
## origin year month day hour temp dewp humid wind_dir wind_speed
## * <chr> <int> <int> <int> <int> <dbl> <dbl> <dbl> <int> <dbl>
## 1 EWR 2013 1 1 0 37.0 21.9 54.0 230 10.4
## 2 EWR 2013 1 1 1 37.0 21.9 54.0 230 13.8
## 3 EWR 2013 1 1 2 37.9 21.9 52.1 230 12.7
## 4 EWR 2013 1 1 3 37.9 23 54.5 230 13.8
## 5 EWR 2013 1 1 4 37.9 24.1 57.0 240 15.0
## 6 EWR 2013 1 1 6 39.0 26.1 59.4 270 10.4
## 7 EWR 2013 1 1 7 39.0 27.0 61.6 250 8.06
## 8 EWR 2013 1 1 8 39.0 28.0 64.4 240 11.5
## 9 EWR 2013 1 1 9 39.9 28.0 62.2 250 12.7
## 10 EWR 2013 1 1 10 39.0 28.0 64.4 260 12.7
## # ... with 8,709 more rows, and 4 more variables: wind_gust <dbl>,
## # precip <dbl>, pressure <dbl>, visib <dbl>
Next let’s confirm that the data in each MySQL table is identical to the data in each MongoDB collection.
# confirm that collections are identical to the mysql tables
all.equal(temp1[[1]], airlines_data)
## [1] TRUE
all.equal(temp1[[2]], airports_data)
## [1] "Rows in x but not y: 597, 997, 998, 452, 72, 17. Rows in y but not x: 452, 998, 72, 597, 17, 997. "
all.equal(temp1[[3]], flights_data)
## [1] TRUE
all.equal(temp1[[4]], planes_data)
## [1] TRUE
all.equal(temp1[[5]], weather_data)
## [1] "Rows in x but not y: 3438, 5041, 3476, 6781, 4931, 3039, 3896, 5374, 8220, 6961, 145[...]. Rows in y but not x: 7678, 6592, 5456, 6678, 1930, 6625, 2638, 2203, 412, 243, 1043[...]. "
It appears that the data are identical between MySQL and MongoDB for airlines, flights, and planes, but not for airports and weather. For instance, in the airports data, the error message indicates that rows 17, 72, and 452 are not the same. Furthermore, in the weather data, the rows that are not the same varies each time we run the comparison. The last time I ran this, rows 411, 589, and 808 were not the same. Let’s take a closer look at these rows.
# look at some problem rows
length(temp1[[2]]) == length(airports_data)
## [1] TRUE
check1 <- c(17, 72, 452)
temp1[[2]][check1, ]
## # A tibble: 3 x 7
## faa name lat lon alt tz dst
## <chr> <chr> <dbl> <dbl> <int> <int> <chr>
## 1 1C9 Frazier Lake Airpark 54.0 -125. 152 -8 A
## 2 A39 Phoenix Regional Airport 33.0 -112. 1300 -7 A
## 3 FFZ Mesa Falcon Field 33.5 -112. 1394 -7 A
airports_data[check1, ]
## # A tibble: 3 x 7
## faa name lat lon alt tz dst
## <chr> <chr> <dbl> <dbl> <int> <int> <chr>
## 1 1C9 Frazier Lake Airpark 54.0 -125. 152 -8 A
## 2 A39 Phoenix Regional Airport 33.0 -112. 1300 -7 A
## 3 FFZ Mesa Falcon Field 33.5 -112. 1394 -7 A
temp1[[2]][check1, ] == airports_data[check1, ]
## faa name lat lon alt tz dst
## [1,] TRUE TRUE FALSE FALSE TRUE TRUE TRUE
## [2,] TRUE TRUE FALSE FALSE TRUE TRUE TRUE
## [3,] TRUE TRUE FALSE FALSE TRUE TRUE TRUE
# count up number of problem rows
sum1 <- 0
for (k in 1:nrow(airports_data)) {
sum1 <- sum1 + (temp1[[2]][k, ] != airports_data[k, ])
}
sum1
## faa name lat lon alt tz dst
## [1,] 0 0 6 6 0 0 0
# look at some problem rows
length(temp1[[5]]) == length(weather_data)
## [1] TRUE
check2 <- c(411, 589, 808)
temp1[[5]][check2, ]
## # A tibble: 3 x 14
## origin year month day hour temp dewp humid wind_dir wind_speed
## <chr> <int> <int> <int> <int> <dbl> <dbl> <dbl> <int> <dbl>
## 1 EWR 2013 1 18 3 37.0 23 56.5 360 9.21
## 2 EWR 2013 1 25 13 16.0 3.02 56.0 300 6.90
## 3 EWR 2013 2 3 16 28.0 21.0 74.6 350 8.06
## # ... with 4 more variables: wind_gust <dbl>, precip <dbl>,
## # pressure <dbl>, visib <dbl>
weather_data[check2, ]
## # A tibble: 3 x 14
## origin year month day hour temp dewp humid wind_dir wind_speed
## <chr> <int> <int> <int> <int> <dbl> <dbl> <dbl> <int> <dbl>
## 1 EWR 2013 1 18 3 37.0 23 56.5 360 9.21
## 2 EWR 2013 1 25 13 16.0 3.02 56.0 300 6.90
## 3 EWR 2013 2 3 16 28.0 21.0 74.6 350 8.06
## # ... with 4 more variables: wind_gust <dbl>, precip <dbl>,
## # pressure <dbl>, visib <dbl>
temp1[[5]][check2, ] == weather_data[check2, ]
## origin year month day hour temp dewp humid wind_dir wind_speed
## [1,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [2,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [3,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## wind_gust precip pressure visib
## [1,] FALSE TRUE TRUE TRUE
## [2,] FALSE TRUE TRUE TRUE
## [3,] FALSE TRUE TRUE TRUE
# count up number of problem rows
sum2 <- 0
for (k in 1:nrow(weather_data)) {
sum2 <- sum2 + (temp1[[5]][k, ] != weather_data[k, ])
}
sum2
## origin year month day hour temp dewp humid wind_dir wind_speed
## [1,] 0 0 NA NA NA NA NA NA 0 NA
## wind_gust precip pressure visib
## [1,] NA 0 NA 0
Hmmm … this is curious. It appears there are 6 rows (out of 1,397) in the airports data that have discrepancies; the explanation may relate to rounding of the latitude and longitude data (of type double). On the other hand, discrepancies in the weather data may relate to null or NA data, as well as the rounding issue. This will take more time to investigate.
Comparing MongoDB and MySQL, it is apparent that they each have their own advantages and disadvantages.
Some advantages of MongoDB / NoSQL databases include:
Some advantages of MySQL / relational databases include:
It would be good to investigate why the MySQL and MongoDB versions of the airports and weather data are not identical. Possible culprits include:
lat and lon data in airportsweather