Get Data and Tidy Up

Below code brings in a “wide” dataset from a CSV file in Github, drops row if there’s an ‘na’ in the second column, and fills empty cells in column 1 with prior values.

flights <- read_csv(curl("https://raw.githubusercontent.com/sjv1030/Data607-hw/master/flights.csv"))
## Warning: Missing column names filled in: 'X1' [1], 'X2' [2]
## Parsed with column specification:
## cols(
##   X1 = col_character(),
##   X2 = col_character(),
##   `Los Angeles` = col_integer(),
##   Phoenix = col_integer(),
##   `San Diego` = col_integer(),
##   `San Francisco` = col_integer(),
##   Seattle = col_integer()
## )
head(flights)
## # A tibble: 5 x 7
##        X1      X2 `Los Angeles` Phoenix `San Diego` `San Francisco`
##     <chr>   <chr>         <int>   <int>       <int>           <int>
## 1  Alaska on time           497     221         212             503
## 2    <NA> delayed            62      12          20             102
## 3    <NA>    <NA>            NA      NA          NA              NA
## 4 AM West on time           694    4840         383             320
## 5    <NA> delayed           117     415          65             129
## # ... with 1 more variables: Seattle <int>
flights <- flights %>% drop_na(X2) %>% fill(X1)

Below code “gathers” the data by city and then spreads it by status (“delayed”" or “on_time”“). Mutate adds a total column (”tot“”) that’s used to calculate delayed flights as a percent of total flights (column “pct”). All of this converts the data set from a “wide” format to a “long” format.

df <- flights %>% gather('city','n',3:7) %>% spread(X2,n)
names(df) <- c('airline','city','delayed','on_time')

df <- df %>% mutate(tot = delayed + on_time)
df <- df %>% mutate(pct = round(delayed/tot,4) * 100)

head(df,10)
## # A tibble: 10 x 6
##    airline          city delayed on_time   tot   pct
##      <chr>         <chr>   <int>   <int> <int> <dbl>
##  1  Alaska   Los Angeles      62     497   559 11.09
##  2  Alaska       Phoenix      12     221   233  5.15
##  3  Alaska     San Diego      20     212   232  8.62
##  4  Alaska San Francisco     102     503   605 16.86
##  5  Alaska       Seattle     305    1841  2146 14.21
##  6 AM West   Los Angeles     117     694   811 14.43
##  7 AM West       Phoenix     415    4840  5255  7.90
##  8 AM West     San Diego      65     383   448 14.51
##  9 AM West San Francisco     129     320   449 28.73
## 10 AM West       Seattle      61     201   262 23.28

Interesting Factoids

Which airline had the largest percent of delays and to which city?

df[which.max(df$pct),]
## # A tibble: 1 x 6
##   airline          city delayed on_time   tot   pct
##     <chr>         <chr>   <int>   <int> <int> <dbl>
## 1 AM West San Francisco     129     320   449 28.73

Which airline had the lowest percent of delays and to which city?

df[which.min(df$pct),]
## # A tibble: 1 x 6
##   airline    city delayed on_time   tot   pct
##     <chr>   <chr>   <int>   <int> <int> <dbl>
## 1  Alaska Phoenix      12     221   233  5.15

Show average percent of delays by airline:

f3 <- df %>% group_by(airline) %>% summarise(avg=mean(pct)) %>% arrange(desc(avg))
ggplot(f3,aes(airline,avg))+ggtitle('Average Percent of Delays by Airline')+
        geom_col(fill="#0072B2",colour="black")+theme_minimal()+xlab('')+ylab('')

Show max percent of delays by airline:

f4 <- df %>% group_by(airline) %>% summarise(high=max(pct)) %>% arrange(desc(high))
ggplot(f4,aes(airline,high))+ggtitle('Max Percent of Delays by Airline')+
        geom_col(fill="#0072B2",colour="black")+theme_minimal()+xlab('')+ylab('')

Show max percent of delays by city:

f5 <- df %>% group_by(city) %>% summarise(high=max(pct)) %>% arrange(desc(high))
ggplot(f5,aes(city,high))+ggtitle('Max Percent of Delays by City')+
        geom_col(fill="#0072B2",colour="black")+theme_minimal()+xlab('')+ylab('')

Given the data, we can conclude that Phoenix is the best airport to have a layover in and San Fran is the worst. Additionally, it looks like Alaska airlines is a lot more reliable than AM West.