Week 5 Assignment - Tidying and Transforming Data

MYSQL Database integration with R

The MYSQL database table created for this assignment contains arrival delays for two airlines across five destinations

Additional packages installed

Please note below additional packages installed. The formattable package is used to transform vectors and data frames into more readable and impactful tabular formats. Package ‘conflicted’ solved an issue with the rename func as it pertains to knit/html.

install.packages(“formattable”) install.packages(“conflicted”)

library(tidyverse)

## -- Attaching packages ------------------------------------------------------------------------------- tidyverse 1.3.0 --

## v ggplot2 3.3.2     v purrr   0.3.4
## v tibble  3.0.3     v dplyr   1.0.2
## v tidyr   1.1.2     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.5.0

## -- Conflicts ---------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(RMySQL)

## Loading required package: DBI

library(formattable)

##DBconnect was utilized in this assignment to authenticate/pull in the flight delay information into R
my.database = dbConnect(MySQL(), user='root', password = 'Password1', dbname='flightsandstatus', host='localhost')
dbListTables(my.database)

## [1] "airlines_info1"

Using fetch function we are able to select all the row from our flightsandstatus database [table = airlines_info1]

airlines_status <- fetch(dbSendQuery(my.database, "SELECT * FROM airlines_info1 ORDER BY ID"))
dim(airlines_status)

## [1] 4 8

colnames(airlines_status)

## [1] "ID"                  "Airline"             "Status"             
## [4] "cities_losangeles"   "cities_phoenix"      "cities_sandiego"    
## [7] "cities_sanfrancisco" "cities_seattle"

summary(airlines_status)

##        ID         Airline             Status          cities_losangeles
##  Min.   :1.00   Length:4           Length:4           Min.   : 62.0    
##  1st Qu.:1.75   Class :character   Class :character   1st Qu.:103.2    
##  Median :2.50   Mode  :character   Mode  :character   Median :307.0    
##  Mean   :2.50                                         Mean   :342.5    
##  3rd Qu.:3.25                                         3rd Qu.:546.2    
##  Max.   :4.00                                         Max.   :694.0    
##  cities_phoenix   cities_sandiego  cities_sanfrancisco cities_seattle
##  Min.   :  12.0   Min.   : 20.00   Min.   :102.0       Min.   :  61  
##  1st Qu.: 168.8   1st Qu.: 53.75   1st Qu.:122.2       1st Qu.: 166  
##  Median : 318.0   Median :138.50   Median :224.5       Median : 253  
##  Mean   :1372.0   Mean   :170.00   Mean   :263.5       Mean   : 602  
##  3rd Qu.:1521.2   3rd Qu.:254.75   3rd Qu.:365.8       3rd Qu.: 689  
##  Max.   :4840.0   Max.   :383.00   Max.   :503.0       Max.   :1841

Experimented with the tibble func below for the first time. Tibbles have a refined print method that shows only the first 10 rows, and all the columns that fit on screen. This makes it much easier to work with large data.

airline_statusinfo <- as_tibble(airlines_status)
airline_statusinfo

## # A tibble: 4 x 8
##      ID Airline Status cities_losangel~ cities_phoenix cities_sandiego
##   <int> <chr>   <chr>             <int>          <int>           <int>
## 1     1 ALASKA  on ti~              497            221             212
## 2     2 ALASKA  delay~               62             12              20
## 3     3 AM WEST on ti~              694           4840             383
## 4     4 AM WEST delay~              117            415              65
## # ... with 2 more variables: cities_sanfrancisco <int>, cities_seattle <int>

Note the use of the conflicted library below was helpful in tackling a difficult issue of knitting to html with the rename func used in our R code. Similarly from assignment 2, it is ideal for us to rename often time certain data which is not easy to understand/or to better the appearance of certain data such as columns. Also note the dplyr::rename being utilized in this scenario.

library(conflicted)
library(dplyr)

airlinestatusinfo9 <- dplyr::rename(airline_statusinfo,"Los Angeles"="cities_losangeles",
          "Phoenix" = "cities_phoenix",
          "San Diego" = "cities_sandiego",
          "San Francisco" = "cities_sanfrancisco",
          "Seattle" = "cities_seattle"
)

print(airlinestatusinfo9)

## # A tibble: 4 x 8
##      ID Airline Status `Los Angeles` Phoenix `San Diego` `San Francisco` Seattle
##   <int> <chr>   <chr>          <int>   <int>       <int>           <int>   <int>
## 1     1 ALASKA  on ti~           497     221         212             503    1841
## 2     2 ALASKA  delay~            62      12          20             102     305
## 3     3 AM WEST on ti~           694    4840         383             320     201
## 4     4 AM WEST delay~           117     415          65             129      61

Data Tidying

Note below we are using the gather, select and spread func to tidy/better organize the flights info to later on be able to manipulate and do calculations. We were encouraged to use a ‘wide’ structure so that we are later able to transform it.

airline_statusinfo8 <- airlinestatusinfo9 %>% 
  gather(City, count, -ID, -Airline, -Status) %>% 
  select(Airline, Status, City, count) %>% 
  spread(Status, count)

  
print(airline_statusinfo8)

## # A tibble: 10 x 4
##    Airline City          delayed `on time`
##    <chr>   <chr>           <int>     <int>
##  1 ALASKA  Los Angeles        62       497
##  2 ALASKA  Phoenix            12       221
##  3 ALASKA  San Diego          20       212
##  4 ALASKA  San Francisco     102       503
##  5 ALASKA  Seattle           305      1841
##  6 AM WEST Los Angeles       117       694
##  7 AM WEST Phoenix           415      4840
##  8 AM WEST San Diego          65       383
##  9 AM WEST San Francisco     129       320
## 10 AM WEST Seattle            61       201

The formattable package is used to transform vectors and data frames into more readable and impactful tabular formats.

formattable(airline_statusinfo8)

Airline	City	delayed	on time
ALASKA	Los Angeles	62	497
ALASKA	Phoenix	12	221
ALASKA	San Diego	20	212
ALASKA	San Francisco	102	503
ALASKA	Seattle	305	1841
AM WEST	Los Angeles	117	694
AM WEST	Phoenix	415	4840
AM WEST	San Diego	65	383
AM WEST	San Francisco	129	320
AM WEST	Seattle	61	201

Data Analysis

Perform analysis to compare the arrival delays for the two airlines.

airline_statusinfo7 <- airline_statusinfo8 %>%
  dplyr::mutate(
         ontime_rate = `on time` / (delayed + `on time`),
         delay_rate = delayed / (delayed + `on time`),
         flights = delayed + `on time`,
         percent_on_time = round(`on time`/flights*100, 2),
         
         
  )
formattable(airline_statusinfo7)

Airline	City	delayed	on time	ontime_rate	delay_rate	flights	percent_on_time
ALASKA	Los Angeles	62	497	0.8890877	0.11091234	559	88.91
ALASKA	Phoenix	12	221	0.9484979	0.05150215	233	94.85
ALASKA	San Diego	20	212	0.9137931	0.08620690	232	91.38
ALASKA	San Francisco	102	503	0.8314050	0.16859504	605	83.14
ALASKA	Seattle	305	1841	0.8578751	0.14212488	2146	85.79
AM WEST	Los Angeles	117	694	0.8557337	0.14426634	811	85.57
AM WEST	Phoenix	415	4840	0.9210276	0.07897241	5255	92.10
AM WEST	San Diego	65	383	0.8549107	0.14508929	448	85.49
AM WEST	San Francisco	129	320	0.7126949	0.28730512	449	71.27
AM WEST	Seattle	61	201	0.7671756	0.23282443	262	76.72

Using ggplot2 to gather the delay information based on airline

library(ggplot2)
Alaska_delays<-dplyr::filter(airline_statusinfo7, Airline=="ALASKA")
Alaska_delays

## # A tibble: 5 x 8
##   Airline City  delayed `on time` ontime_rate delay_rate flights percent_on_time
##   <chr>   <chr>   <int>     <int>       <dbl>      <dbl>   <int>           <dbl>
## 1 ALASKA  Los ~      62       497       0.889     0.111      559            88.9
## 2 ALASKA  Phoe~      12       221       0.948     0.0515     233            94.8
## 3 ALASKA  San ~      20       212       0.914     0.0862     232            91.4
## 4 ALASKA  San ~     102       503       0.831     0.169      605            83.1
## 5 ALASKA  Seat~     305      1841       0.858     0.142     2146            85.8

AMWEST_delays<-dplyr::filter(airline_statusinfo7, Airline=="AM WEST")
AMWEST_delays

## # A tibble: 5 x 8
##   Airline City  delayed `on time` ontime_rate delay_rate flights percent_on_time
##   <chr>   <chr>   <int>     <int>       <dbl>      <dbl>   <int>           <dbl>
## 1 AM WEST Los ~     117       694       0.856     0.144      811            85.6
## 2 AM WEST Phoe~     415      4840       0.921     0.0790    5255            92.1
## 3 AM WEST San ~      65       383       0.855     0.145      448            85.5
## 4 AM WEST San ~     129       320       0.713     0.287      449            71.3
## 5 AM WEST Seat~      61       201       0.767     0.233      262            76.7

compare_data <- airline_statusinfo7 %>% 
  group_by(Airline) %>%
  dplyr::summarize(OnTime = sum(`on time`), Delayed = sum(delayed), 
            PercentOnTime = round(OnTime/sum(OnTime,Delayed),2))%>%
  select(Airline, PercentOnTime)

## `summarise()` ungrouping output (override with `.groups` argument)

compare_data %>% dplyr::arrange(desc(PercentOnTime))

## # A tibble: 2 x 2
##   Airline PercentOnTime
##   <chr>           <dbl>
## 1 AM WEST          0.89
## 2 ALASKA           0.87

 ggplot(airline_statusinfo7, aes(x = Airline, y=delay_rate, fill = City)) +  
          geom_bar(stat="identity",position="dodge") + 
          xlab("Airlines") + ylab("Delays")

 ggplot(airline_statusinfo7, aes(x = Airline, y=ontime_rate, fill = City)) +  
          geom_bar(stat="identity",position="dodge") + 
          xlab("Airlines") + ylab("On time rate")

In conclusion

in conclusion we can witness on the chart how AM WEST has slightly higher delay percentage than Alaska. Also we see how ontime rate is very similar/close for both airlines.