Data Shaping using Tidyr

In this section, I chose to use both gather and spread to take my desired table output. To make the data easier to manipulate, I duplicated the airline name in the 2nd and 5th row as the tidyr gather/spread libraries made it easier to reshape the data. Based on this data, I also wanted to answer which airline experiences less delays at a general level as well as a per city level.

csv_url = read.csv(file = "https://raw.githubusercontent.com/dapolloxp/data607/master/assignment5/flight-comp.csv", header = TRUE)

flights.df <- data.frame(csv_url)
# Flight data has a row with NAs. Let's remove this
head(flights.df)

##         X     X.1 Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1  ALASKA On time         497     221       212           503    1841
## 2         delayed          62      12        20           102     305
## 3                          NA      NA        NA            NA      NA
## 4 AM WEST On time         694    4840       383           320     201
## 5         delayed         117     415        65           129      61

# Let's create a function that checks each row for na's. This would be helpful for thousands of rows

# Let's filter out the row with na's

filtered.flights <- drop_na(flights.df)
filtered.flights[2,1] = filtered.flights[1,1]
filtered.flights[4,1] = filtered.flights[3,1]

#
gathered.flights <- gather(filtered.flights , "City", "n", 3:7)

names(gathered.flights)[1] <- "Airline"
names(gathered.flights)[2] <- "Status"

tidy.table <- spread(gathered.flights, "Status", "n")

Data Grouping using dplyr

In this section, I wanted to take the average delay by airlines. On the surface, it appears that AM West has a higher average, 157 vs 100. However, looking at these numbers alone does not tell the full story. We need to look at the ratios of delayed vs ontime.

# Let's compare the average delayed flights between the two airlines
# based on the average delay by airline, it looks like AM West is delayed many more times than Alaska
delayed.by.airline <- tidy.table %>% group_by(Airline) %>% summarise(Avgdelayed = format(mean(delayed), 4)) %>% arrange(desc(Airline))
# Looing at the avgerage ontime by airlines, it appears AM West has many instances where it is ontime more than Alaska. This can be due to AM West having more flights as a whole. Let's compare the ratios.
ontime.by.airline <- tidy.table %>% group_by(Airline) %>% summarise(AvgOntime = format(mean(`On time`), 4)) %>% arrange(desc(Airline))
# change char to double
ontime.by.airline$AvgOntime <- as.numeric(ontime.by.airline$AvgOntime)
delayed.by.airline$Avgdelayed <- as.numeric(delayed.by.airline$Avgdelayed)
bind_cols(ontime.by.airline, delayed.by.airline[,2])

## # A tibble: 2 x 3
##   Airline AvgOntime Avgdelayed
##   <fct>       <dbl>      <dbl>
## 1 AM WEST     1288.       157.
## 2 ALASKA       655.       100.

Flight Analysis

when taking the ratio of delayed vs the total for each airline, we can see that AM West actually experiences a much lower rate of delays vs Alaska: 11% vs 13%, even though AM West has much more volume.

# Let's check the percentage of delayed flights
df<- tidy.table %>% group_by(Airline) %>% summarise(Delayed=sum(delayed),OnTime=sum(`On time`)) %>% mutate(PercentDelayed=round(Delayed/(Delayed+OnTime),2))
df <- df %>% mutate(Total=Delayed+OnTime)
df

## # A tibble: 2 x 5
##   Airline Delayed OnTime PercentDelayed Total
##   <fct>     <int>  <int>          <dbl> <int>
## 1 ALASKA      501   3274           0.13  3775
## 2 AM WEST     787   6438           0.11  7225

paste0("AM West's percentage delayed flights were: ", round(df %>% filter(Airline == 'AM WEST') %>% select(PercentDelayed),2), "%")

## [1] "AM West's percentage delayed flights were: 0.11%"

paste0("Alaska's percentage delayed flights are: ", round(df %>% filter(Airline == 'ALASKA') %>% select(PercentDelayed),2), "%")

## [1] "Alaska's percentage delayed flights are: 0.13%"

Comparing Airlines by City

In this section, I wanted to see how many flights each airline has per city. I chose to graph this as it can visually show which cities have heavier volume and a higher count of delays.

# Let's see how each airline flys by city
ggplot(data=tidy.table, aes(x=Airline, y=`On time` + delayed)) + geom_bar(stat = "identity", aes(fill=City), position ="dodge") + labs(x= "Airline by City", y="Total Flights by City", title = "Airline Total Flights by City")

ggplot(data=tidy.table, aes(x=Airline, y=delayed)) + geom_bar(stat = "identity", aes(fill=City), position ="dodge") + labs(x= "Airline by City", y=" Delayed Flights by City", title = "Airline Total Flight Delays by City")

Within this chart, we can see how each airline compares to each other by city. Overall, the numbers are relatively close for Seattle and San Franciso, but there are gaps in Los Angeles and San Diego. However, even though AM West has a lower percentage delay overall, it still has a higher delay in several cities when compared to Alaska airlines.

tidy.table %>% mutate(Total=delayed +`On time`) %>% mutate(PercentDelayed=round(delayed/Total,2))

##    Airline          City delayed On time Total PercentDelayed
## 1   ALASKA   Los.Angeles      62     497   559           0.11
## 2   ALASKA       Phoenix      12     221   233           0.05
## 3   ALASKA     San.Diego      20     212   232           0.09
## 4   ALASKA San.Francisco     102     503   605           0.17
## 5   ALASKA       Seattle     305    1841  2146           0.14
## 6  AM WEST   Los.Angeles     117     694   811           0.14
## 7  AM WEST       Phoenix     415    4840  5255           0.08
## 8  AM WEST     San.Diego      65     383   448           0.15
## 9  AM WEST San.Francisco     129     320   449           0.29
## 10 AM WEST       Seattle      61     201   262           0.23

#tidy.table %>% group_by(City) %>% summarise(Avgdelayed = format(mean(delayed), 4)) %>% arrange(desc(Avgdelayed))

Conclusion

Analyzing data at an aggregate level does not always tell the complete story. It is important to break data into much more granular levels to uncover hidden information that may not be obvious within a summary level. At a wholistic level, AM West has a slightly lower delay rate vs Alaska, however, AM West also has much more volume, 7225 vs 3775, almost double that of Alaska. However, Alaska experiences a much lower delay rate per city vs AM West. The question remains however, if Alaska had the same volume as AM West, would the numbers be closer?

Assignment 5 - Using Tidyr and Dplyr

David Apolinar

3/2/2019

Data Shaping using Tidyr

Data Grouping using dplyr

Flight Analysis

Comparing Airlines by City

Conclusion