CUNY SPS DATA-607 Week 12 Assignment

Description

The assignment is to migrate data from a MySQL database to a NoSQL database. I will be using the flights database used in the Hands on Lab 2-1.

Hypothetical Scenario

For this assignment I will pretend I am a data scientist at a ThereOnTime. ThereOnTime is start-up that evaluates airline performance. Users enter in their flight route. ThereOnTime recommends which airline will get the user there on time. ThereOnTime is growing in popularity and needs to scale up their system. As their data scientist, I recommend they migrate from MySQL to MongoDB. For this scenario I will pretend all flights are in the datasets, not only the NYC ones.

Hypothetical Scenario Design

For the End User

ThereOnTime users want a recommendation on which airline that is going to get them from point A to point B on time. They will have varying exclusion criteria (i.e. they refuse to take on Airline X, they won’t fly on puddle jumpers). The way to help them is through the best system design at ThereOnTime.

For the ThereOnTime System

The goal is to provide quick and relevant recommendations. They fit the customer’s unique preferences and criteria. ThereOnTime meets these requirement by allowing filtering of evaluations. The ThereOnTime system looks at weather as a factor to inform their evaluation. They also have a recommender system for alternative routes. The airlines are the principle object of evaluation. The airports, and planes are filtering criteria. This should inform the NoSQL schema design.

Extraction from MySQL Database

I will start by connecting to the MySQL database. I will make use of a MySQL Settings file which hold authentication credentials. It is not included in the GitHub repo for obvious reasons.

library(DBI)
source('hidden/MySQL Settings.R')
conn <- dbConnect(RMySQL::MySQL(), user=user, password=password, dbname='flights', host=host)

I will extract all the tables into their own data frames.

airlines <- dbReadTable(conn, "airlines")
airports <- dbReadTable(conn, "airports")
flights <- dbReadTable(conn, "flights")
planes <- dbReadTable(conn, "planes") 

dbDisconnect(conn)

Data Transformation

I will create the flights data. Since I don’t want to loose any records I will use dplyr’s left_join and right_join in the data pipeline.

library(dplyr)

# Merge together the data frames
data <- flights %>%
  mutate(on_time = ifelse(arr_delay > 0, 0 ,1)) %>%
  merge(airlines) %>%
  rename(carrier_name = name) %>%
  rowwise() %>%
  mutate(carrier_name = gsub("[\r\n]", "", carrier_name)) %>%
  ungroup() %>%
  left_join(planes)

# Hat tip to https://stackoverflow.com/questions/33741357/how-do-i-add-a-prefix-to-several-variable-names-using-dplyr for rename all columns

# Add in the origin airport data
data <- airports %>%
  rename_(.dots = setNames(names(.), paste0('origin_', names(.)))) %>%
  rename(origin = origin_faa) %>%
  right_join(data)

# Add in the destination airport data
data <- airports %>%
  rename_(.dots = setNames(names(.), paste0('dest_', names(.)))) %>%
  rename(dest = dest_faa) %>%
  right_join(data)

Load Data into MongoDB

Now to load the data into a MongoDB. First we need a connection:

library(mongolite)
mdb = mongo(collection = "data", db = "flights")

Now we can load the data into the MongoDB.

mdb$insert(data)
## List of 5
##  $ nInserted  : num 336776
##  $ nMatched   : num 0
##  $ nRemoved   : num 0
##  $ nUpserted  : num 0
##  $ writeErrors: list()

The migration from MySQL to MongoDB is complete.

Database Test Run

To test out the new database I will run a test. Let’s say the user wants to fly to Rochester NY from the NYC area. Let’s suppose they don’t care about which airport they leave from. Which airline and airport would be most likely to get them there on time?

results <- mdb$find('{"dest": "ROC"}') %>%
  group_by(carrier_name, origin_name) %>%
  summarise(on_time_flights = sum(on_time, na.rm=TRUE), total_flights = n()) %>%
  mutate(share_on_time = on_time_flights / total_flights) %>%
  arrange(-share_on_time)
Table 1. Flights from NYC Area to Rochester NY
carrier_name origin_name on_time_flights total_flights share_on_time
Endeavor Air Inc. La Guardia 6 8 0.7500000
Endeavor Air Inc. John F Kennedy Intl 176 273 0.6446886
ExpressJet Airlines Inc. La Guardia 139 216 0.6435185
JetBlue Airways John F Kennedy Intl 836 1406 0.5945946
ExpressJet Airlines Inc. Newark Liberty Intl 256 513 0.4990253

It looks like Endeavor Air Inc out of LaGuardia airport is the best bet.

NoSQL Advantages and Disadvantages

One major difference between a NoSQL and relational database is the schemaless architecture. This is an advantage because it gives a lot of flexibility. It is nice to be able to add or drop data elements as needs change.

It is also a disadvantage because it is very foreign way of thinking about data bases. Determining if you should embed or reference a document is something I don’t have a handle on yet.

Mike Silva

12 November 2018