Below code brings in a “wide” dataset from a CSV file in Github, drops row if there’s an ‘na’ in the second column, and fills empty cells in column 1 with prior values.
flights <- read_csv(curl("https://raw.githubusercontent.com/sjv1030/Data607-hw/master/flights.csv"))
## Warning: Missing column names filled in: 'X1' [1], 'X2' [2]
## Parsed with column specification:
## cols(
## X1 = col_character(),
## X2 = col_character(),
## `Los Angeles` = col_integer(),
## Phoenix = col_integer(),
## `San Diego` = col_integer(),
## `San Francisco` = col_integer(),
## Seattle = col_integer()
## )
head(flights)
## # A tibble: 5 x 7
## X1 X2 `Los Angeles` Phoenix `San Diego` `San Francisco`
## <chr> <chr> <int> <int> <int> <int>
## 1 Alaska on time 497 221 212 503
## 2 <NA> delayed 62 12 20 102
## 3 <NA> <NA> NA NA NA NA
## 4 AM West on time 694 4840 383 320
## 5 <NA> delayed 117 415 65 129
## # ... with 1 more variables: Seattle <int>
flights <- flights %>% drop_na(X2) %>% fill(X1)
Below code “gathers” the data by city and then spreads it by status (“delayed”" or “on_time”“). Mutate adds a total column (”tot“”) that’s used to calculate delayed flights as a percent of total flights (column “pct”). All of this converts the data set from a “wide” format to a “long” format.
df <- flights %>% gather('city','n',3:7) %>% spread(X2,n)
names(df) <- c('airline','city','delayed','on_time')
df <- df %>% mutate(tot = delayed + on_time)
df <- df %>% mutate(pct = round(delayed/tot,4) * 100)
head(df,10)
## # A tibble: 10 x 6
## airline city delayed on_time tot pct
## <chr> <chr> <int> <int> <int> <dbl>
## 1 Alaska Los Angeles 62 497 559 11.09
## 2 Alaska Phoenix 12 221 233 5.15
## 3 Alaska San Diego 20 212 232 8.62
## 4 Alaska San Francisco 102 503 605 16.86
## 5 Alaska Seattle 305 1841 2146 14.21
## 6 AM West Los Angeles 117 694 811 14.43
## 7 AM West Phoenix 415 4840 5255 7.90
## 8 AM West San Diego 65 383 448 14.51
## 9 AM West San Francisco 129 320 449 28.73
## 10 AM West Seattle 61 201 262 23.28
Which airline had the largest percent of delays and to which city?
df[which.max(df$pct),]
## # A tibble: 1 x 6
## airline city delayed on_time tot pct
## <chr> <chr> <int> <int> <int> <dbl>
## 1 AM West San Francisco 129 320 449 28.73
Which airline had the lowest percent of delays and to which city?
df[which.min(df$pct),]
## # A tibble: 1 x 6
## airline city delayed on_time tot pct
## <chr> <chr> <int> <int> <int> <dbl>
## 1 Alaska Phoenix 12 221 233 5.15
Show average percent of delays by airline:
f3 <- df %>% group_by(airline) %>% summarise(avg=mean(pct)) %>% arrange(desc(avg))
ggplot(f3,aes(airline,avg))+ggtitle('Average Percent of Delays by Airline')+
geom_col(fill="#0072B2",colour="black")+theme_minimal()+xlab('')+ylab('')
Show max percent of delays by airline:
f4 <- df %>% group_by(airline) %>% summarise(high=max(pct)) %>% arrange(desc(high))
ggplot(f4,aes(airline,high))+ggtitle('Max Percent of Delays by Airline')+
geom_col(fill="#0072B2",colour="black")+theme_minimal()+xlab('')+ylab('')
Show max percent of delays by city:
f5 <- df %>% group_by(city) %>% summarise(high=max(pct)) %>% arrange(desc(high))
ggplot(f5,aes(city,high))+ggtitle('Max Percent of Delays by City')+
geom_col(fill="#0072B2",colour="black")+theme_minimal()+xlab('')+ylab('')
Given the data, we can conclude that Phoenix is the best airport to have a layover in and San Fran is the worst. Additionally, it looks like Alaska airlines is a lot more reliable than AM West.