Problem Statement

For this assignment, you should take information from a relational database and migrate it to a NoSQL

database of your own choosing.


Assignment 12 - MYSQL, MongoDB & R


Loading libraries

library("DT")
library(dplyr)
library(plotrix)
library(data.table)
library(RMySQL)
library(DBI)
library(rstudioapi)
require (ggplot2)
library(DBI)

Extraction from MySQL Database

# connecting to flights database in MYSQL

# This line do not ask for user and password
con <- dbConnect(MySQL(), user ="root", password = "Chicke0n", port = 3306, dbname = "flights", host = "localhost")

# This line ask for user name and password
#con <- dbConnect(MySQL(), user = showPrompt('Username',"Username",default="root"), password = #askForPassword(), port = 3306, dbname = "flights",host = "localhost")

#extract all the tables into their own data frames.from flights database

Converting all database extract to tables

airlines <- dbReadTable(con, "airlines")
airports <- dbReadTable(con, "airports")
flights <- dbReadTable(con, "flights")
planes <- dbReadTable(con, "planes")
weather <- dbReadTable(con, "weather")

Merging tables together

# show the tables retreived from MYSQL

View(airlines)

View(airports)

View(flights)

View(planes)

View(weather)

Filtering only on early depature and early arrival flights

data <- flights %>% 
    filter( arr_delay <0 | dep_delay <0) %>% # taking only early arrivals & departures
    merge(airlines) %>% 
    rowwise() %>% 
    mutate(carrier_name = gsub("[\r\n]", "", name))
# drop the columns with return carriage
drop <- c('name')
data1 <- data[,!(names(data) %in% drop)]

head(data1, n=1)    
## Source: local data frame [1 x 17]
## Groups: <by row>
## 
## # A tibble: 1 x 17
##   carrier  year month   day dep_time dep_delay arr_time arr_delay tailnum
##   <chr>   <int> <int> <int>    <int>     <int>    <int>     <int> <chr>  
## 1 9E       2013    11    18      815         0      943        -9 N929XJ 
## # ... with 8 more variables: flight <int>, origin <chr>, dest <chr>,
## #   air_time <int>, distance <int>, hour <int>, minute <int>,
## #   carrier_name <chr>

Left Join flights to airports

drop <- c('faa')

airports1 <- airports %>% mutate(dest = faa)

airports1 <- airports1[,!(names(airports1) %in% drop)]

data2 <- data1 %>% 
    left_join(airports1)
## Joining, by = "dest"

Left Join flights to planes & Weather

data3 <- data2 %>% 
    inner_join(planes)
## Joining, by = c("year", "tailnum")
data_final <- data3 %>% # Getting only the intersections of all the various tables
    inner_join(weather)
## Joining, by = c("year", "month", "day", "origin", "hour")

Load Data into MongoDB/Get connection first

worked:

  1. The key was to get MongoDB installed (whether via cloud based on an on-prem version)

  2. Followed these steps in installing MongoDB on Windows;

https://www.guru99.com/installation-configuration-mongodb.html

Starting MongoDB server

  1. Watch the following youtube video;

https://www.youtube.com/watch?v=JBEKJflNV2g. This video taught me how to complete this project step by

step. The key was to have MongoDB server installed and running (if you have an on-prem version). Once

that is done, you can then complete the last step of migrating your curated tables from Mysql to Mongodb



library(mongolite)
#start server running before trying code
#connect to db
mdb = mongo(collection = "data_final", db = "flights")

mdb$insert(data_final)
## List of 5
##  $ nInserted  : num 1327
##  $ nMatched   : num 0
##  $ nRemoved   : num 0
##  $ nUpserted  : num 0
##  $ writeErrors: list()
#mdb$remove('{}')
mdb$count()
## [1] 7962
results <- mdb$find('{}') %>% 
    group_by(carrier_name, name) %>% 
    
    top_n(-10,arr_delay) 
     

head(results,n=1)
## # A tibble: 1 x 38
## # Groups:   carrier_name, name [1]
##   carrier  year month   day dep_time dep_delay arr_time arr_delay tailnum
##   <chr>   <int> <int> <int>    <int>     <int>    <int>     <int> <chr>  
## 1 9E       2013    11    15     1131        -4     1320       -18 N295PQ 
## # ... with 29 more variables: flight <int>, origin <chr>, dest <chr>,
## #   air_time <int>, distance <int>, hour <int>, minute <int>,
## #   carrier_name <chr>, name <chr>, lat <dbl>, lon <dbl>, alt <int>,
## #   tz <int>, dst <chr>, type <chr>, manufacturer <chr>, model <chr>,
## #   engines <int>, seats <int>, engine <chr>, temp <dbl>, dewp <dbl>,
## #   humid <dbl>, wind_dir <int>, wind_speed <dbl>, wind_gust <dbl>,
## #   precip <dbl>, pressure <dbl>, visib <dbl>

Database Test Run

best arrival time in the past year of 2013, i.e., the best of the best on-time performers?

earlier the flight arrived

#bar plot of Total raw ratings
ggplot(data=results, aes(x=carrier_name, y=arr_delay)) + geom_bar(stat="identity" )+
    geom_bar(colour="black", fill="#DD8888", width=.8, stat="identity" ) + 
    guides(fill=FALSE) +
    xlab("Airlines") + ylab("Early Arrival Scores") + coord_flip()+
    ggtitle("Airlines with Most Early Cummulative Arrival Occurances")


Summary Results:

Since we filtered only on early arrivals and early depature flights, you are seeing only the best

performance airlines in 2013.

Of the early arrival airlines, United has the most number of cummalative early arrivals followed by

Alaska Air and Jet Blue Airways


Observation on the differences between traditional SQL and NOSQL:

an advantage because it gives a lot of flexibility. It is nice to be able to add or drop data elements

as needs change.

engineers who grew up handling data structures the structured way. There’s new syntax to be learn and

that could be a hinderence for the old school data people.