1 Introduction

In this assignment, we migrate the nycflights13 database from MySQL to MongoDB. The nycflights13 database includes 5 tables:

  • airlines
  • airports
  • flights
  • planes
  • weather.

We will use the RMySQL package to load the data from the MySQL database and then the mongolite package to save the data into a MongoDB database.

library(tidyverse)
library(RMySQL)
library(mongolite)

2 Loading the data from MySQL

First we load the data from the nycflights13 database in MySQL. We use the RMySQL package to connect to the database and to pass the SQL query to load the data. Note that the MySQL instance here is running on ‘localhost’ with my user credentials.

# connect to MySQL and get list of tables in the flights database
flights_db <- dbConnect(MySQL(), user=usr, password=pwd, dbname='flights', host='localhost')
tables <- dbListTables(flights_db)
tables
## [1] "airlines" "airports" "flights"  "planes"   "weather"
# query to load the data from each table
query <- sprintf("SELECT * FROM %s", tables)
query
## [1] "SELECT * FROM airlines" "SELECT * FROM airports"
## [3] "SELECT * FROM flights"  "SELECT * FROM planes"  
## [5] "SELECT * FROM weather"
# initialize temp tables
temp1 <- rep(list(NA), length(tables))            

# load tables into temp dataframes
for (j in 1:length(tables)) {
    temp1[[j]] <- as_tibble(dbGetQuery(flights_db, query[j]))
}

temp1
## [[1]]
## # A tibble: 16 x 2
##    carrier name                           
##    <chr>   <chr>                          
##  1 9E      "Endeavor Air Inc.\r"          
##  2 AA      "American Airlines Inc.\r"     
##  3 AS      "Alaska Airlines Inc.\r"       
##  4 B6      "JetBlue Airways\r"            
##  5 DL      "Delta Air Lines Inc.\r"       
##  6 EV      "ExpressJet Airlines Inc.\r"   
##  7 F9      "Frontier Airlines Inc.\r"     
##  8 FL      "AirTran Airways Corporation\r"
##  9 HA      "Hawaiian Airlines Inc.\r"     
## 10 MQ      "Envoy Air\r"                  
## 11 OO      "SkyWest Airlines Inc.\r"      
## 12 UA      "United Air Lines Inc.\r"      
## 13 US      "US Airways Inc.\r"            
## 14 VX      "Virgin America\r"             
## 15 WN      "Southwest Airlines Co.\r"     
## 16 YV      "Mesa Airlines Inc.\r"         
## 
## [[2]]
## # A tibble: 1,397 x 7
##    faa   name                             lat    lon   alt    tz dst  
##    <chr> <chr>                          <dbl>  <dbl> <int> <int> <chr>
##  1 04G   Lansdowne Airport               41.1  -80.6  1044    -5 A    
##  2 06A   Moton Field Municipal Airport   32.5  -85.7   264    -5 A    
##  3 06C   Schaumburg Regional             42.0  -88.1   801    -6 A    
##  4 06N   Randall Airport                 41.4  -74.4   523    -5 A    
##  5 09J   Jekyll Island Airport           31.1  -81.4    11    -4 A    
##  6 0A9   Elizabethton Municipal Airport  36.4  -82.2  1593    -4 A    
##  7 0G6   Williams County Airport         41.5  -84.5   730    -5 A    
##  8 0G7   Finger Lakes Regional Airport   42.9  -76.8   492    -5 A    
##  9 0P2   Shoestring Aviation Airfield    39.8  -76.6  1000    -5 U    
## 10 0S9   Jefferson County Intl           48.1 -123.    108    -8 A    
## # ... with 1,387 more rows
## 
## [[3]]
## # A tibble: 336,776 x 16
##     year month   day dep_time dep_delay arr_time arr_delay carrier tailnum
##    <int> <int> <int>    <int>     <int>    <int>     <int> <chr>   <chr>  
##  1  2013     1     1      517         2      830        11 UA      N14228 
##  2  2013     1     1      533         4      850        20 UA      N24211 
##  3  2013     1     1      542         2      923        33 AA      N619AA 
##  4  2013     1     1      544        -1     1004       -18 B6      N804JB 
##  5  2013     1     1      554        -6      812       -25 DL      N668DN 
##  6  2013     1     1      554        -4      740        12 UA      N39463 
##  7  2013     1     1      555        -5      913        19 B6      N516JB 
##  8  2013     1     1      557        -3      709       -14 EV      N829AS 
##  9  2013     1     1      557        -3      838        -8 B6      N593JB 
## 10  2013     1     1      558        -2      753         8 AA      N3ALAA 
## # ... with 336,766 more rows, and 7 more variables: flight <int>,
## #   origin <chr>, dest <chr>, air_time <int>, distance <int>, hour <int>,
## #   minute <int>
## 
## [[4]]
## # A tibble: 3,322 x 9
##    tailnum  year type      manufacturer  model  engines seats speed engine
##    <chr>   <int> <chr>     <chr>         <chr>    <int> <int> <int> <chr> 
##  1 N10156   2004 Fixed wi~ EMBRAER       EMB-1~       2    55    NA Turbo~
##  2 N102UW   1998 Fixed wi~ AIRBUS INDUS~ A320-~       2   182    NA Turbo~
##  3 N103US   1999 Fixed wi~ AIRBUS INDUS~ A320-~       2   182    NA Turbo~
##  4 N104UW   1999 Fixed wi~ AIRBUS INDUS~ A320-~       2   182    NA Turbo~
##  5 N10575   2002 Fixed wi~ EMBRAER       EMB-1~       2    55    NA Turbo~
##  6 N105UW   1999 Fixed wi~ AIRBUS INDUS~ A320-~       2   182    NA Turbo~
##  7 N107US   1999 Fixed wi~ AIRBUS INDUS~ A320-~       2   182    NA Turbo~
##  8 N108UW   1999 Fixed wi~ AIRBUS INDUS~ A320-~       2   182    NA Turbo~
##  9 N109UW   1999 Fixed wi~ AIRBUS INDUS~ A320-~       2   182    NA Turbo~
## 10 N110UW   1999 Fixed wi~ AIRBUS INDUS~ A320-~       2   182    NA Turbo~
## # ... with 3,312 more rows
## 
## [[5]]
## # A tibble: 8,719 x 14
##    origin  year month   day  hour  temp  dewp humid wind_dir wind_speed
##    <chr>  <int> <int> <int> <int> <dbl> <dbl> <dbl>    <int>      <dbl>
##  1 EWR     2013     1     1     0  37.0  21.9  54.0      230      10.4 
##  2 EWR     2013     1     1     1  37.0  21.9  54.0      230      13.8 
##  3 EWR     2013     1     1     2  37.9  21.9  52.1      230      12.7 
##  4 EWR     2013     1     1     3  37.9  23    54.5      230      13.8 
##  5 EWR     2013     1     1     4  37.9  24.1  57.0      240      15.0 
##  6 EWR     2013     1     1     6  39.0  26.1  59.4      270      10.4 
##  7 EWR     2013     1     1     7  39.0  27.0  61.6      250       8.06
##  8 EWR     2013     1     1     8  39.0  28.0  64.4      240      11.5 
##  9 EWR     2013     1     1     9  39.9  28.0  62.2      250      12.7 
## 10 EWR     2013     1     1    10  39.0  28.0  64.4      260      12.7 
## # ... with 8,709 more rows, and 4 more variables: wind_gust <dbl>,
## #   precip <dbl>, pressure <dbl>, visib <dbl>
# disconnect 
dbDisconnect(flights_db)
## [1] TRUE

Now we have the 5 tables of nycflights13 loaded as a list of dataframes in temp1.

3 Migrating the data to MongoDB

3.1 Save tables as collections

Next we migrate the data in temp1 into a MongoDB database. We use the mongolite package to connect to MongoDB and to save each dataframe as a separate collection in MongoDB.

# initialize temp tables
temp2 <- rep(list(NA), length(tables))            

# loop through each table and save into each collection
for (j in 1:length(tables)) {
    # connect to collection in mongodb 
    temp2[[j]] <- mongo(collection = tables[j], db = "nycflights13")

    # drop collection if already exists
    if (temp2[[j]]$count() > 0) {
        temp2[[j]]$drop() 
    }

    # insert data
    temp2[[j]]$insert(temp1[[j]])

    # stop if row counts are not identical
    stopifnot(temp2[[j]]$count() == nrow(temp1[[j]]))
}

3.2 Validate the data

Finally we confirm that all the data loaded correctly into MongoDB, by first inspecting the collections and then checking that they are identical to the original MySQL tables. Let’s inspect the data in each collection.

# get the mongo collections
airlines <- mongo(collection = "airlines", db = "nycflights13")
airports <- mongo(collection = "airports", db = "nycflights13")
flights <- mongo(collection = "flights", db = "nycflights13")
planes <- mongo(collection = "planes", db = "nycflights13")
weather <- mongo(collection = "weather", db = "nycflights13")

# inspect the data
( airlines_data <- as_tibble(airlines$find()) )
## # A tibble: 16 x 2
##    carrier name                           
##  * <chr>   <chr>                          
##  1 9E      "Endeavor Air Inc.\r"          
##  2 AA      "American Airlines Inc.\r"     
##  3 AS      "Alaska Airlines Inc.\r"       
##  4 B6      "JetBlue Airways\r"            
##  5 DL      "Delta Air Lines Inc.\r"       
##  6 EV      "ExpressJet Airlines Inc.\r"   
##  7 F9      "Frontier Airlines Inc.\r"     
##  8 FL      "AirTran Airways Corporation\r"
##  9 HA      "Hawaiian Airlines Inc.\r"     
## 10 MQ      "Envoy Air\r"                  
## 11 OO      "SkyWest Airlines Inc.\r"      
## 12 UA      "United Air Lines Inc.\r"      
## 13 US      "US Airways Inc.\r"            
## 14 VX      "Virgin America\r"             
## 15 WN      "Southwest Airlines Co.\r"     
## 16 YV      "Mesa Airlines Inc.\r"
( airports_data <- as_tibble(airports$find()) )
## # A tibble: 1,397 x 7
##    faa   name                             lat    lon   alt    tz dst  
##  * <chr> <chr>                          <dbl>  <dbl> <int> <int> <chr>
##  1 04G   Lansdowne Airport               41.1  -80.6  1044    -5 A    
##  2 06A   Moton Field Municipal Airport   32.5  -85.7   264    -5 A    
##  3 06C   Schaumburg Regional             42.0  -88.1   801    -6 A    
##  4 06N   Randall Airport                 41.4  -74.4   523    -5 A    
##  5 09J   Jekyll Island Airport           31.1  -81.4    11    -4 A    
##  6 0A9   Elizabethton Municipal Airport  36.4  -82.2  1593    -4 A    
##  7 0G6   Williams County Airport         41.5  -84.5   730    -5 A    
##  8 0G7   Finger Lakes Regional Airport   42.9  -76.8   492    -5 A    
##  9 0P2   Shoestring Aviation Airfield    39.8  -76.6  1000    -5 U    
## 10 0S9   Jefferson County Intl           48.1 -123.    108    -8 A    
## # ... with 1,387 more rows
( flights_data <- as_tibble(flights$find()) )
## # A tibble: 336,776 x 16
##     year month   day dep_time dep_delay arr_time arr_delay carrier tailnum
##  * <int> <int> <int>    <int>     <int>    <int>     <int> <chr>   <chr>  
##  1  2013     1     1      517         2      830        11 UA      N14228 
##  2  2013     1     1      533         4      850        20 UA      N24211 
##  3  2013     1     1      542         2      923        33 AA      N619AA 
##  4  2013     1     1      544        -1     1004       -18 B6      N804JB 
##  5  2013     1     1      554        -6      812       -25 DL      N668DN 
##  6  2013     1     1      554        -4      740        12 UA      N39463 
##  7  2013     1     1      555        -5      913        19 B6      N516JB 
##  8  2013     1     1      557        -3      709       -14 EV      N829AS 
##  9  2013     1     1      557        -3      838        -8 B6      N593JB 
## 10  2013     1     1      558        -2      753         8 AA      N3ALAA 
## # ... with 336,766 more rows, and 7 more variables: flight <int>,
## #   origin <chr>, dest <chr>, air_time <int>, distance <int>, hour <int>,
## #   minute <int>
( planes_data <- as_tibble(planes$find()) )
## # A tibble: 3,322 x 9
##    tailnum  year type      manufacturer  model  engines seats engine speed
##  * <chr>   <int> <chr>     <chr>         <chr>    <int> <int> <chr>  <int>
##  1 N10156   2004 Fixed wi~ EMBRAER       EMB-1~       2    55 Turbo~    NA
##  2 N102UW   1998 Fixed wi~ AIRBUS INDUS~ A320-~       2   182 Turbo~    NA
##  3 N103US   1999 Fixed wi~ AIRBUS INDUS~ A320-~       2   182 Turbo~    NA
##  4 N104UW   1999 Fixed wi~ AIRBUS INDUS~ A320-~       2   182 Turbo~    NA
##  5 N10575   2002 Fixed wi~ EMBRAER       EMB-1~       2    55 Turbo~    NA
##  6 N105UW   1999 Fixed wi~ AIRBUS INDUS~ A320-~       2   182 Turbo~    NA
##  7 N107US   1999 Fixed wi~ AIRBUS INDUS~ A320-~       2   182 Turbo~    NA
##  8 N108UW   1999 Fixed wi~ AIRBUS INDUS~ A320-~       2   182 Turbo~    NA
##  9 N109UW   1999 Fixed wi~ AIRBUS INDUS~ A320-~       2   182 Turbo~    NA
## 10 N110UW   1999 Fixed wi~ AIRBUS INDUS~ A320-~       2   182 Turbo~    NA
## # ... with 3,312 more rows
( weather_data <- as_tibble(weather$find()) )
## # A tibble: 8,719 x 14
##    origin  year month   day  hour  temp  dewp humid wind_dir wind_speed
##  * <chr>  <int> <int> <int> <int> <dbl> <dbl> <dbl>    <int>      <dbl>
##  1 EWR     2013     1     1     0  37.0  21.9  54.0      230      10.4 
##  2 EWR     2013     1     1     1  37.0  21.9  54.0      230      13.8 
##  3 EWR     2013     1     1     2  37.9  21.9  52.1      230      12.7 
##  4 EWR     2013     1     1     3  37.9  23    54.5      230      13.8 
##  5 EWR     2013     1     1     4  37.9  24.1  57.0      240      15.0 
##  6 EWR     2013     1     1     6  39.0  26.1  59.4      270      10.4 
##  7 EWR     2013     1     1     7  39.0  27.0  61.6      250       8.06
##  8 EWR     2013     1     1     8  39.0  28.0  64.4      240      11.5 
##  9 EWR     2013     1     1     9  39.9  28.0  62.2      250      12.7 
## 10 EWR     2013     1     1    10  39.0  28.0  64.4      260      12.7 
## # ... with 8,709 more rows, and 4 more variables: wind_gust <dbl>,
## #   precip <dbl>, pressure <dbl>, visib <dbl>

Next let’s confirm that the data in each MySQL table is identical to the data in each MongoDB collection.

# confirm that collections are identical to the mysql tables
all.equal(temp1[[1]], airlines_data)
## [1] TRUE
all.equal(temp1[[2]], airports_data)
## [1] "Rows in x but not y: 597, 997, 998, 452, 72, 17. Rows in y but not x: 452, 998, 72, 597, 17, 997. "
all.equal(temp1[[3]], flights_data)
## [1] TRUE
all.equal(temp1[[4]], planes_data)
## [1] TRUE
all.equal(temp1[[5]], weather_data)
## [1] "Rows in x but not y: 3438, 5041, 3476, 6781, 4931, 3039, 3896, 5374, 8220, 6961, 145[...]. Rows in y but not x: 7678, 6592, 5456, 6678, 1930, 6625, 2638, 2203, 412, 243, 1043[...]. "

It appears that the data are identical between MySQL and MongoDB for airlines, flights, and planes, but not for airports and weather. For instance, in the airports data, the error message indicates that rows 17, 72, and 452 are not the same. Furthermore, in the weather data, the rows that are not the same varies each time we run the comparison. The last time I ran this, rows 411, 589, and 808 were not the same. Let’s take a closer look at these rows.

# look at some problem rows
length(temp1[[2]]) == length(airports_data)
## [1] TRUE
check1 <- c(17, 72, 452)
temp1[[2]][check1, ]
## # A tibble: 3 x 7
##   faa   name                       lat   lon   alt    tz dst  
##   <chr> <chr>                    <dbl> <dbl> <int> <int> <chr>
## 1 1C9   Frazier Lake Airpark      54.0 -125.   152    -8 A    
## 2 A39   Phoenix Regional Airport  33.0 -112.  1300    -7 A    
## 3 FFZ   Mesa Falcon Field         33.5 -112.  1394    -7 A
airports_data[check1, ]
## # A tibble: 3 x 7
##   faa   name                       lat   lon   alt    tz dst  
##   <chr> <chr>                    <dbl> <dbl> <int> <int> <chr>
## 1 1C9   Frazier Lake Airpark      54.0 -125.   152    -8 A    
## 2 A39   Phoenix Regional Airport  33.0 -112.  1300    -7 A    
## 3 FFZ   Mesa Falcon Field         33.5 -112.  1394    -7 A
temp1[[2]][check1, ] == airports_data[check1, ]
##       faa name   lat   lon  alt   tz  dst
## [1,] TRUE TRUE FALSE FALSE TRUE TRUE TRUE
## [2,] TRUE TRUE FALSE FALSE TRUE TRUE TRUE
## [3,] TRUE TRUE FALSE FALSE TRUE TRUE TRUE
# count up number of problem rows
sum1 <- 0
for (k in 1:nrow(airports_data)) {
    sum1 <- sum1 + (temp1[[2]][k, ] != airports_data[k, ])
}
sum1
##      faa name lat lon alt tz dst
## [1,]   0    0   6   6   0  0   0
# look at some problem rows
length(temp1[[5]]) == length(weather_data)
## [1] TRUE
check2 <- c(411, 589, 808) 
temp1[[5]][check2, ]
## # A tibble: 3 x 14
##   origin  year month   day  hour  temp  dewp humid wind_dir wind_speed
##   <chr>  <int> <int> <int> <int> <dbl> <dbl> <dbl>    <int>      <dbl>
## 1 EWR     2013     1    18     3  37.0 23     56.5      360       9.21
## 2 EWR     2013     1    25    13  16.0  3.02  56.0      300       6.90
## 3 EWR     2013     2     3    16  28.0 21.0   74.6      350       8.06
## # ... with 4 more variables: wind_gust <dbl>, precip <dbl>,
## #   pressure <dbl>, visib <dbl>
weather_data[check2, ]
## # A tibble: 3 x 14
##   origin  year month   day  hour  temp  dewp humid wind_dir wind_speed
##   <chr>  <int> <int> <int> <int> <dbl> <dbl> <dbl>    <int>      <dbl>
## 1 EWR     2013     1    18     3  37.0 23     56.5      360       9.21
## 2 EWR     2013     1    25    13  16.0  3.02  56.0      300       6.90
## 3 EWR     2013     2     3    16  28.0 21.0   74.6      350       8.06
## # ... with 4 more variables: wind_gust <dbl>, precip <dbl>,
## #   pressure <dbl>, visib <dbl>
temp1[[5]][check2, ] == weather_data[check2, ]
##      origin year month  day hour temp dewp humid wind_dir wind_speed
## [1,]   TRUE TRUE  TRUE TRUE TRUE TRUE TRUE  TRUE     TRUE       TRUE
## [2,]   TRUE TRUE  TRUE TRUE TRUE TRUE TRUE  TRUE     TRUE       TRUE
## [3,]   TRUE TRUE  TRUE TRUE TRUE TRUE TRUE  TRUE     TRUE       TRUE
##      wind_gust precip pressure visib
## [1,]     FALSE   TRUE     TRUE  TRUE
## [2,]     FALSE   TRUE     TRUE  TRUE
## [3,]     FALSE   TRUE     TRUE  TRUE
# count up number of problem rows
sum2 <- 0
for (k in 1:nrow(weather_data)) {
    sum2 <- sum2 + (temp1[[5]][k, ] != weather_data[k, ])
}
sum2
##      origin year month day hour temp dewp humid wind_dir wind_speed
## [1,]      0    0    NA  NA   NA   NA   NA    NA        0         NA
##      wind_gust precip pressure visib
## [1,]        NA      0       NA     0

Hmmm … this is curious. It appears there are 6 rows (out of 1,397) in the airports data that have discrepancies; the explanation may relate to rounding of the latitude and longitude data (of type double). On the other hand, discrepancies in the weather data may relate to null or NA data, as well as the rounding issue. This will take more time to investigate.

4 Conclusion

4.1 Advantages & disadvantages of MongoDB

Comparing MongoDB and MySQL, it is apparent that they each have their own advantages and disadvantages.

  • Some advantages of MongoDB / NoSQL databases include:

    • Flexibility in data structure: NoSQL databases like MongoDB offer more flexibility in terms of non-structured or semi-structured data, such as text and document data; data of different types can be stored together.
    • Flexibility in data relationships: NoSQL databases don’t require a rigid schema that defines the relationships among the data; relationships embedded in the data can change over time.
    • Scalability, speed, and availability: Although it is of no consequence for this assignment, the literature seems to indicate that MongoDB and other NoSQL databases offer greater scalability, speed, and availability through distributed computing environments, particularly for extremely large databases.
  • Some advantages of MySQL / relational databases include:

    • Familiarity: Given their longer history, SQL databases and languages are established in computer science and business, so can be understood easily by a larger audience.
    • Data validation: Because of the uniformity and consistency enforced by the database schema, SQL databases provide various layers of data validation for structured data (e.g., each column should have the same data type, each row should have the same variable set).
    • Complex joins: SQL databases support complex joins in concise JOIN statements, whereas similar joins in no-SQL environments may require more customized coding by the user, according to the literature.
    • Efficient data structure: SQL databases may have more efficient data storage that avoids redundancy; for instance, in MongoDB, the key name for each key-value pair is saved for each document. For extremely large databases, this redundancy may lead to some inefficiency in performance, according to the literature.

4.2 For further work

It would be good to investigate why the MySQL and MongoDB versions of the airports and weather data are not identical. Possible culprits include:

  • Rounding of numeric data: this is likely the case for the lat and lon data in airports
  • Treatment of null / NA values: this, along with the rounding issue, may explain the discrepancies in weather
  • Non-ordered document insertion: according to the literature, document inserts may be non-sequential when importing large amounts of data into MongoDB collections. I’m not sure if that’s the case here, but it may be a contributing factor.