MYSQL Database integration with R

The MYSQL database table created for this assignment contains arrival delays for two airlines across five destinations

Additional packages installed

Please note below additional packages installed. The formattable package is used to transform vectors and data frames into more readable and impactful tabular formats. Package ‘conflicted’ solved an issue with the rename func as it pertains to knit/html.

install.packages(“formattable”) install.packages(“conflicted”)

library(tidyverse)
## -- Attaching packages ------------------------------------------------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.2     v purrr   0.3.4
## v tibble  3.0.3     v dplyr   1.0.2
## v tidyr   1.1.2     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.5.0
## -- Conflicts ---------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(RMySQL)
## Loading required package: DBI
library(formattable)
##DBconnect was utilized in this assignment to authenticate/pull in the flight delay information into R
my.database = dbConnect(MySQL(), user='root', password = 'Password1', dbname='flightsandstatus', host='localhost')
dbListTables(my.database)
## [1] "airlines_info1"

Using fetch function we are able to select all the row from our flightsandstatus database [table = airlines_info1]

airlines_status <- fetch(dbSendQuery(my.database, "SELECT * FROM airlines_info1 ORDER BY ID"))
dim(airlines_status)
## [1] 4 8
colnames(airlines_status)
## [1] "ID"                  "Airline"             "Status"             
## [4] "cities_losangeles"   "cities_phoenix"      "cities_sandiego"    
## [7] "cities_sanfrancisco" "cities_seattle"
summary(airlines_status)
##        ID         Airline             Status          cities_losangeles
##  Min.   :1.00   Length:4           Length:4           Min.   : 62.0    
##  1st Qu.:1.75   Class :character   Class :character   1st Qu.:103.2    
##  Median :2.50   Mode  :character   Mode  :character   Median :307.0    
##  Mean   :2.50                                         Mean   :342.5    
##  3rd Qu.:3.25                                         3rd Qu.:546.2    
##  Max.   :4.00                                         Max.   :694.0    
##  cities_phoenix   cities_sandiego  cities_sanfrancisco cities_seattle
##  Min.   :  12.0   Min.   : 20.00   Min.   :102.0       Min.   :  61  
##  1st Qu.: 168.8   1st Qu.: 53.75   1st Qu.:122.2       1st Qu.: 166  
##  Median : 318.0   Median :138.50   Median :224.5       Median : 253  
##  Mean   :1372.0   Mean   :170.00   Mean   :263.5       Mean   : 602  
##  3rd Qu.:1521.2   3rd Qu.:254.75   3rd Qu.:365.8       3rd Qu.: 689  
##  Max.   :4840.0   Max.   :383.00   Max.   :503.0       Max.   :1841

Experimented with the tibble func below for the first time. Tibbles have a refined print method that shows only the first 10 rows, and all the columns that fit on screen. This makes it much easier to work with large data.

airline_statusinfo <- as_tibble(airlines_status)
airline_statusinfo
## # A tibble: 4 x 8
##      ID Airline Status cities_losangel~ cities_phoenix cities_sandiego
##   <int> <chr>   <chr>             <int>          <int>           <int>
## 1     1 ALASKA  on ti~              497            221             212
## 2     2 ALASKA  delay~               62             12              20
## 3     3 AM WEST on ti~              694           4840             383
## 4     4 AM WEST delay~              117            415              65
## # ... with 2 more variables: cities_sanfrancisco <int>, cities_seattle <int>

Note the use of the conflicted library below was helpful in tackling a difficult issue of knitting to html with the rename func used in our R code. Similarly from assignment 2, it is ideal for us to rename often time certain data which is not easy to understand/or to better the appearance of certain data such as columns. Also note the dplyr::rename being utilized in this scenario.

library(conflicted)
library(dplyr)

airlinestatusinfo9 <- dplyr::rename(airline_statusinfo,"Los Angeles"="cities_losangeles",
          "Phoenix" = "cities_phoenix",
          "San Diego" = "cities_sandiego",
          "San Francisco" = "cities_sanfrancisco",
          "Seattle" = "cities_seattle"
)

print(airlinestatusinfo9)
## # A tibble: 4 x 8
##      ID Airline Status `Los Angeles` Phoenix `San Diego` `San Francisco` Seattle
##   <int> <chr>   <chr>          <int>   <int>       <int>           <int>   <int>
## 1     1 ALASKA  on ti~           497     221         212             503    1841
## 2     2 ALASKA  delay~            62      12          20             102     305
## 3     3 AM WEST on ti~           694    4840         383             320     201
## 4     4 AM WEST delay~           117     415          65             129      61

Data Tidying

Note below we are using the gather, select and spread func to tidy/better organize the flights info to later on be able to manipulate and do calculations. We were encouraged to use a ‘wide’ structure so that we are later able to transform it.

airline_statusinfo8 <- airlinestatusinfo9 %>% 
  gather(City, count, -ID, -Airline, -Status) %>% 
  select(Airline, Status, City, count) %>% 
  spread(Status, count)

  
print(airline_statusinfo8)
## # A tibble: 10 x 4
##    Airline City          delayed `on time`
##    <chr>   <chr>           <int>     <int>
##  1 ALASKA  Los Angeles        62       497
##  2 ALASKA  Phoenix            12       221
##  3 ALASKA  San Diego          20       212
##  4 ALASKA  San Francisco     102       503
##  5 ALASKA  Seattle           305      1841
##  6 AM WEST Los Angeles       117       694
##  7 AM WEST Phoenix           415      4840
##  8 AM WEST San Diego          65       383
##  9 AM WEST San Francisco     129       320
## 10 AM WEST Seattle            61       201

The formattable package is used to transform vectors and data frames into more readable and impactful tabular formats.

formattable(airline_statusinfo8)
Airline City delayed on time
ALASKA Los Angeles 62 497
ALASKA Phoenix 12 221
ALASKA San Diego 20 212
ALASKA San Francisco 102 503
ALASKA Seattle 305 1841
AM WEST Los Angeles 117 694
AM WEST Phoenix 415 4840
AM WEST San Diego 65 383
AM WEST San Francisco 129 320
AM WEST Seattle 61 201

Data Analysis

Perform analysis to compare the arrival delays for the two airlines.

airline_statusinfo7 <- airline_statusinfo8 %>%
  dplyr::mutate(
         ontime_rate = `on time` / (delayed + `on time`),
         delay_rate = delayed / (delayed + `on time`),
         flights = delayed + `on time`,
         percent_on_time = round(`on time`/flights*100, 2),
         
         
  )
formattable(airline_statusinfo7)
Airline City delayed on time ontime_rate delay_rate flights percent_on_time
ALASKA Los Angeles 62 497 0.8890877 0.11091234 559 88.91
ALASKA Phoenix 12 221 0.9484979 0.05150215 233 94.85
ALASKA San Diego 20 212 0.9137931 0.08620690 232 91.38
ALASKA San Francisco 102 503 0.8314050 0.16859504 605 83.14
ALASKA Seattle 305 1841 0.8578751 0.14212488 2146 85.79
AM WEST Los Angeles 117 694 0.8557337 0.14426634 811 85.57
AM WEST Phoenix 415 4840 0.9210276 0.07897241 5255 92.10
AM WEST San Diego 65 383 0.8549107 0.14508929 448 85.49
AM WEST San Francisco 129 320 0.7126949 0.28730512 449 71.27
AM WEST Seattle 61 201 0.7671756 0.23282443 262 76.72

Using ggplot2 to gather the delay information based on airline

library(ggplot2)
Alaska_delays<-dplyr::filter(airline_statusinfo7, Airline=="ALASKA")
Alaska_delays
## # A tibble: 5 x 8
##   Airline City  delayed `on time` ontime_rate delay_rate flights percent_on_time
##   <chr>   <chr>   <int>     <int>       <dbl>      <dbl>   <int>           <dbl>
## 1 ALASKA  Los ~      62       497       0.889     0.111      559            88.9
## 2 ALASKA  Phoe~      12       221       0.948     0.0515     233            94.8
## 3 ALASKA  San ~      20       212       0.914     0.0862     232            91.4
## 4 ALASKA  San ~     102       503       0.831     0.169      605            83.1
## 5 ALASKA  Seat~     305      1841       0.858     0.142     2146            85.8
AMWEST_delays<-dplyr::filter(airline_statusinfo7, Airline=="AM WEST")
AMWEST_delays
## # A tibble: 5 x 8
##   Airline City  delayed `on time` ontime_rate delay_rate flights percent_on_time
##   <chr>   <chr>   <int>     <int>       <dbl>      <dbl>   <int>           <dbl>
## 1 AM WEST Los ~     117       694       0.856     0.144      811            85.6
## 2 AM WEST Phoe~     415      4840       0.921     0.0790    5255            92.1
## 3 AM WEST San ~      65       383       0.855     0.145      448            85.5
## 4 AM WEST San ~     129       320       0.713     0.287      449            71.3
## 5 AM WEST Seat~      61       201       0.767     0.233      262            76.7
compare_data <- airline_statusinfo7 %>% 
  group_by(Airline) %>%
  dplyr::summarize(OnTime = sum(`on time`), Delayed = sum(delayed), 
            PercentOnTime = round(OnTime/sum(OnTime,Delayed),2))%>%
  select(Airline, PercentOnTime)
## `summarise()` ungrouping output (override with `.groups` argument)
compare_data %>% dplyr::arrange(desc(PercentOnTime))
## # A tibble: 2 x 2
##   Airline PercentOnTime
##   <chr>           <dbl>
## 1 AM WEST          0.89
## 2 ALASKA           0.87
 ggplot(airline_statusinfo7, aes(x = Airline, y=delay_rate, fill = City)) +  
          geom_bar(stat="identity",position="dodge") + 
          xlab("Airlines") + ylab("Delays") 

 ggplot(airline_statusinfo7, aes(x = Airline, y=ontime_rate, fill = City)) +  
          geom_bar(stat="identity",position="dodge") + 
          xlab("Airlines") + ylab("On time rate") 

In conclusion

in conclusion we can witness on the chart how AM WEST has slightly higher delay percentage than Alaska. Also we see how ontime rate is very similar/close for both airlines.