Data 607 Week 12 Assignment

links to RPubs:

http://rpubs.com/ssufian/549294

link to Github:

https://github.com/ssufian/Data_607

Problem Statement

For this assignment, you should take information from a relational database and migrate it to a NoSQL

database of your own choosing.

Assignment 12 - MYSQL, MongoDB & R

For the relational database, I chose MySQL flights database
For the NoSQL database, I use MongoDB

Loading libraries

library("DT")
library(dplyr)
library(plotrix)
library(data.table)
library(RMySQL)
library(DBI)
library(rstudioapi)
require (ggplot2)
library(DBI)

Extraction from MySQL Database

# connecting to flights database in MYSQL

# This line do not ask for user and password
con <- dbConnect(MySQL(), user ="root", password = "Chicke0n", port = 3306, dbname = "flights", host = "localhost")

# This line ask for user name and password
#con <- dbConnect(MySQL(), user = showPrompt('Username',"Username",default="root"), password = #askForPassword(), port = 3306, dbname = "flights",host = "localhost")

#extract all the tables into their own data frames.from flights database

Converting all database extract to tables

airlines <- dbReadTable(con, "airlines")
airports <- dbReadTable(con, "airports")
flights <- dbReadTable(con, "flights")
planes <- dbReadTable(con, "planes")
weather <- dbReadTable(con, "weather")

Merging tables together

Data Transformation
View the all the related flights tables

# show the tables retreived from MYSQL

View(airlines)

View(airports)

View(flights)

View(planes)

View(weather)

Filtering only on early depature and early arrival flights

data <- flights %>% 
    filter( arr_delay <0 | dep_delay <0) %>% # taking only early arrivals & departures
    merge(airlines) %>% 
    rowwise() %>% 
    mutate(carrier_name = gsub("[\r\n]", "", name))
# drop the columns with return carriage
drop <- c('name')
data1 <- data[,!(names(data) %in% drop)]

head(data1, n=1)

## Source: local data frame [1 x 17]
## Groups: <by row>
## 
## # A tibble: 1 x 17
##   carrier  year month   day dep_time dep_delay arr_time arr_delay tailnum
##   <chr>   <int> <int> <int>    <int>     <int>    <int>     <int> <chr>  
## 1 9E       2013    11    18      815         0      943        -9 N929XJ 
## # ... with 8 more variables: flight <int>, origin <chr>, dest <chr>,
## #   air_time <int>, distance <int>, hour <int>, minute <int>,
## #   carrier_name <chr>

Left Join flights to airports

based only the intersections

drop <- c('faa')

airports1 <- airports %>% mutate(dest = faa)

airports1 <- airports1[,!(names(airports1) %in% drop)]

data2 <- data1 %>% 
    left_join(airports1)

## Joining, by = "dest"

Left Join flights to planes & Weather

based only the intersections

data3 <- data2 %>% 
    inner_join(planes)

## Joining, by = c("year", "tailnum")

data_final <- data3 %>% # Getting only the intersections of all the various tables
    inner_join(weather)

## Joining, by = c("year", "month", "day", "origin", "hour")

Load Data into MongoDB/Get connection first

Loading data into a MongoDB was difficult for me. I spent 3 days trouble shooting it but finally it

worked:

The key was to get MongoDB installed (whether via cloud based on an on-prem version)
Followed these steps in installing MongoDB on Windows;

https://www.guru99.com/installation-configuration-mongodb.html

Starting MongoDB server

Watch the following youtube video;

https://www.youtube.com/watch?v=JBEKJflNV2g. This video taught me how to complete this project step by

step. The key was to have MongoDB server installed and running (if you have an on-prem version). Once

that is done, you can then complete the last step of migrating your curated tables from Mysql to Mongodb

Inserting flights database that I had “filtered” to my liking
Making sure all tables were inserted properly

library(mongolite)
#start server running before trying code
#connect to db
mdb = mongo(collection = "data_final", db = "flights")

mdb$insert(data_final)

## List of 5
##  $ nInserted  : num 1327
##  $ nMatched   : num 0
##  $ nRemoved   : num 0
##  $ nUpserted  : num 0
##  $ writeErrors: list()

#mdb$remove('{}')
mdb$count()

## [1] 7962

results <- mdb$find('{}') %>% 
    group_by(carrier_name, name) %>% 
    
    top_n(-10,arr_delay) 
     

head(results,n=1)

## # A tibble: 1 x 38
## # Groups:   carrier_name, name [1]
##   carrier  year month   day dep_time dep_delay arr_time arr_delay tailnum
##   <chr>   <int> <int> <int>    <int>     <int>    <int>     <int> <chr>  
## 1 9E       2013    11    15     1131        -4     1320       -18 N295PQ 
## # ... with 29 more variables: flight <int>, origin <chr>, dest <chr>,
## #   air_time <int>, distance <int>, hour <int>, minute <int>,
## #   carrier_name <chr>, name <chr>, lat <dbl>, lon <dbl>, alt <int>,
## #   tz <int>, dst <chr>, type <chr>, manufacturer <chr>, model <chr>,
## #   engines <int>, seats <int>, engine <chr>, temp <dbl>, dewp <dbl>,
## #   humid <dbl>, wind_dir <int>, wind_speed <dbl>, wind_gust <dbl>,
## #   precip <dbl>, pressure <dbl>, visib <dbl>

Database Test Run

To test out the new database, Let’s say of the early “birds”, we wanted to know Which airlines has the

best arrival time in the past year of 2013, i.e., the best of the best on-time performers?

Note: Here negative values means early arrival and departures; the more negative the values, the

earlier the flight arrived

#bar plot of Total raw ratings
ggplot(data=results, aes(x=carrier_name, y=arr_delay)) + geom_bar(stat="identity" )+
    geom_bar(colour="black", fill="#DD8888", width=.8, stat="identity" ) + 
    guides(fill=FALSE) +
    xlab("Airlines") + ylab("Early Arrival Scores") + coord_flip()+
    ggtitle("Airlines with Most Early Cummulative Arrival Occurances")

Summary Results:

Since we filtered only on early arrivals and early depature flights, you are seeing only the best

performance airlines in 2013.

Of the early arrival airlines, United has the most number of cummalative early arrivals followed by

Alaska Air and Jet Blue Airways

Observation on the differences between traditional SQL and NOSQL:

The major difference between a NoSQL and relational database is the schemaless architecture. This is

an advantage because it gives a lot of flexibility. It is nice to be able to add or drop data elements

as needs change.

The main disadvantage is because of its “Anti SQL” nature, it is still very foreign to alot of data

engineers who grew up handling data structures the structured way. There’s new syntax to be learn and

that could be a hinderence for the old school data people.

Data 607 Week 12 Assignment - NoSQL Migration

Sufian

11/08/2019