For this assignment, you should take information from a relational database and migrate it to a NoSQL
database of your own choosing.
For the relational database, I chose MySQL flights database
For the NoSQL database, I use MongoDB
library("DT")
library(dplyr)
library(plotrix)
library(data.table)
library(RMySQL)
library(DBI)
library(rstudioapi)
require (ggplot2)
library(DBI)
# connecting to flights database in MYSQL
# This line do not ask for user and password
con <- dbConnect(MySQL(), user ="root", password = "Chicke0n", port = 3306, dbname = "flights", host = "localhost")
# This line ask for user name and password
#con <- dbConnect(MySQL(), user = showPrompt('Username',"Username",default="root"), password = #askForPassword(), port = 3306, dbname = "flights",host = "localhost")
#extract all the tables into their own data frames.from flights database
airlines <- dbReadTable(con, "airlines")
airports <- dbReadTable(con, "airports")
flights <- dbReadTable(con, "flights")
planes <- dbReadTable(con, "planes")
weather <- dbReadTable(con, "weather")
Data Transformation
View the all the related flights tables
# show the tables retreived from MYSQL
View(airlines)
View(airports)
View(flights)
View(planes)
View(weather)
data <- flights %>%
filter( arr_delay <0 | dep_delay <0) %>% # taking only early arrivals & departures
merge(airlines) %>%
rowwise() %>%
mutate(carrier_name = gsub("[\r\n]", "", name))
# drop the columns with return carriage
drop <- c('name')
data1 <- data[,!(names(data) %in% drop)]
head(data1, n=1)
## Source: local data frame [1 x 17]
## Groups: <by row>
##
## # A tibble: 1 x 17
## carrier year month day dep_time dep_delay arr_time arr_delay tailnum
## <chr> <int> <int> <int> <int> <int> <int> <int> <chr>
## 1 9E 2013 11 18 815 0 943 -9 N929XJ
## # ... with 8 more variables: flight <int>, origin <chr>, dest <chr>,
## # air_time <int>, distance <int>, hour <int>, minute <int>,
## # carrier_name <chr>
drop <- c('faa')
airports1 <- airports %>% mutate(dest = faa)
airports1 <- airports1[,!(names(airports1) %in% drop)]
data2 <- data1 %>%
left_join(airports1)
## Joining, by = "dest"
data3 <- data2 %>%
inner_join(planes)
## Joining, by = c("year", "tailnum")
data_final <- data3 %>% # Getting only the intersections of all the various tables
inner_join(weather)
## Joining, by = c("year", "month", "day", "origin", "hour")
worked:
The key was to get MongoDB installed (whether via cloud based on an on-prem version)
Followed these steps in installing MongoDB on Windows;
https://www.guru99.com/installation-configuration-mongodb.html
https://www.youtube.com/watch?v=JBEKJflNV2g. This video taught me how to complete this project step by
step. The key was to have MongoDB server installed and running (if you have an on-prem version). Once
that is done, you can then complete the last step of migrating your curated tables from Mysql to Mongodb
Inserting flights database that I had “filtered” to my liking
Making sure all tables were inserted properly
library(mongolite)
#start server running before trying code
#connect to db
mdb = mongo(collection = "data_final", db = "flights")
mdb$insert(data_final)
## List of 5
## $ nInserted : num 1327
## $ nMatched : num 0
## $ nRemoved : num 0
## $ nUpserted : num 0
## $ writeErrors: list()
#mdb$remove('{}')
mdb$count()
## [1] 7962
results <- mdb$find('{}') %>%
group_by(carrier_name, name) %>%
top_n(-10,arr_delay)
head(results,n=1)
## # A tibble: 1 x 38
## # Groups: carrier_name, name [1]
## carrier year month day dep_time dep_delay arr_time arr_delay tailnum
## <chr> <int> <int> <int> <int> <int> <int> <int> <chr>
## 1 9E 2013 11 15 1131 -4 1320 -18 N295PQ
## # ... with 29 more variables: flight <int>, origin <chr>, dest <chr>,
## # air_time <int>, distance <int>, hour <int>, minute <int>,
## # carrier_name <chr>, name <chr>, lat <dbl>, lon <dbl>, alt <int>,
## # tz <int>, dst <chr>, type <chr>, manufacturer <chr>, model <chr>,
## # engines <int>, seats <int>, engine <chr>, temp <dbl>, dewp <dbl>,
## # humid <dbl>, wind_dir <int>, wind_speed <dbl>, wind_gust <dbl>,
## # precip <dbl>, pressure <dbl>, visib <dbl>
best arrival time in the past year of 2013, i.e., the best of the best on-time performers?
earlier the flight arrived
#bar plot of Total raw ratings
ggplot(data=results, aes(x=carrier_name, y=arr_delay)) + geom_bar(stat="identity" )+
geom_bar(colour="black", fill="#DD8888", width=.8, stat="identity" ) +
guides(fill=FALSE) +
xlab("Airlines") + ylab("Early Arrival Scores") + coord_flip()+
ggtitle("Airlines with Most Early Cummulative Arrival Occurances")
Summary Results:
Since we filtered only on early arrivals and early depature flights, you are seeing only the best
performance airlines in 2013.
Of the early arrival airlines, United has the most number of cummalative early arrivals followed by
Alaska Air and Jet Blue Airways
Observation on the differences between traditional SQL and NOSQL:
an advantage because it gives a lot of flexibility. It is nice to be able to add or drop data elements
as needs change.
engineers who grew up handling data structures the structured way. There’s new syntax to be learn and
that could be a hinderence for the old school data people.