OVERVIEW

For this assignment, I created a wide table, with the arrival data for AM West and Alaska, in Postgresql for further transformation and analysis.

DATA TRANSFORMATION

1. Load Libraries

library(tidyverse)
## -- Attaching packages ---------------- tidyverse 1.3.0 --
## v ggplot2 3.3.2     v purrr   0.3.4
## v tibble  3.0.3     v dplyr   1.0.2
## v tidyr   1.1.2     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.5.0
## -- Conflicts ------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(DBI)

2. Connecting to Postgresql

Results are hidden as it contains db access info.

3. Listing the table in the db

dbListTables(con)
## [1] "arrivals"

4. Extracting the data from the db

Data <- dbGetQuery(con,"Select * from Arrivals")

5. Putting the data into a dataframe

Df <- data.frame(Data)

6. Converting data from wide to long

df <- Df %>%
  gather(Destination,flights,
         Los.Angeles,
        Phoenix,
        San.Diego,
        San.Francisco,
        Seattle)
df
##    Airline  Delays   Destination flights
## 1   Alaska On-Time   Los.Angeles     497
## 2   Alaska Delayed   Los.Angeles      62
## 3  AM West On time   Los.Angeles     694
## 4  AM West Delayed   Los.Angeles     117
## 5   Alaska On-Time       Phoenix     221
## 6   Alaska Delayed       Phoenix      12
## 7  AM West On time       Phoenix    4840
## 8  AM West Delayed       Phoenix     415
## 9   Alaska On-Time     San.Diego     212
## 10  Alaska Delayed     San.Diego      20
## 11 AM West On time     San.Diego     383
## 12 AM West Delayed     San.Diego      65
## 13  Alaska On-Time San.Francisco     503
## 14  Alaska Delayed San.Francisco     102
## 15 AM West On time San.Francisco     320
## 16 AM West Delayed San.Francisco     129
## 17  Alaska On-Time       Seattle    1841
## 18  Alaska Delayed       Seattle     305
## 19 AM West On time       Seattle     201
## 20 AM West Delayed       Seattle      61

ANALYSIS

On Average, which airline has had more on-time flights?

OT <- df %>%
  group_by(Airline) %>%
  filter(Delays == 'On-Time'|Delays == "On time" ) %>%
  summarize(average_ot=mean(flights))
## `summarise()` ungrouping output (override with `.groups` argument)
ggplot(OT)+
  geom_col(mapping=aes(x=Airline,y=average_ot,fill=Airline))+
ggtitle("On Time Flights by Airline")

AM West seems to have more on-time arrivals than Alaska.

Which airline has more on-time arrivals per city?

OTCity <- df %>%
  group_by(Airline,Destination) %>%
  filter(Delays == 'On-Time'|Delays == "On time" ) %>%
  summarize(flights)
## `summarise()` regrouping output by 'Airline' (override with `.groups` argument)
ggplot(OTCity)+
  geom_col(mapping=aes(x=Destination,y=flights,fill=Airline))+
ggtitle("On Time Flights by Airline Per City")

Looks like AM West has had more on-time arrivals than Alaska in each city as well,except for in SF and Seattle.

PARADOX

Does AM West have more delays as well?

 df %>%
  group_by(Airline) %>%
  filter(Delays == 'Delayed' ) %>%
  summarize(average_ot=mean(flights))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 2 x 2
##   Airline average_ot
##   <chr>        <dbl>
## 1 Alaska        100.
## 2 AM West       157.

It looks like AM West has more delays as well overall. What about per city:

DelayCity <- df %>%
  group_by(Airline,Destination) %>%
  filter(Delays == 'Delayed' ) %>%
  summarize(flights)
## `summarise()` regrouping output by 'Airline' (override with `.groups` argument)
arrange(DelayCity,flights)
## # A tibble: 10 x 3
## # Groups:   Airline [2]
##    Airline Destination   flights
##    <chr>   <chr>           <int>
##  1 Alaska  Phoenix            12
##  2 Alaska  San.Diego          20
##  3 AM West Seattle            61
##  4 Alaska  Los.Angeles        62
##  5 AM West San.Diego          65
##  6 Alaska  San.Francisco     102
##  7 AM West Los.Angeles       117
##  8 AM West San.Francisco     129
##  9 Alaska  Seattle           305
## 10 AM West Phoenix           415

Seems like AM West has more delays per city as well.

Explanation

AM West offers more flights than Alaska, which may explain why it has more of both on-time arrivals and delayed arrivals.

df %>%
  group_by(Airline) %>%
  summarize(ot_flights=sum(flights))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 2 x 2
##   Airline ot_flights
##   <chr>        <int>
## 1 Alaska        3775
## 2 AM West       7225