DATA 607: HW Week 5

1: Create Data File

I used a .csv (created in Excel, then saved as a .csv).

2a: Read the Information Into R

The data is shown as below. I only added two column names where they were missing (“Airline” and “Status”). I also removed the blank row in between the two airline groupings, and repeat the missing values in “Airline”.

  data<-read.csv("C:/Users/Andy/Desktop/Personal/Learning/CUNY/DATA607/Assignment_Week5_data.csv")
  data

##   Airline  Status Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1  Alaska On Time         497     221       212           503    1841
## 2  Alaska Delayed          62      12        20           102     305
## 3 AM West On Time         694    4840       383           320     201
## 4 AM West Delayed         117     415        65           129      61

2b: Use `tidyr` and `dplyr` to tidy and transform the data

Now that I have the data, I pipe the data to gather the city columns into one column (“City”). Then I reorder the columns before I spread the data to make “On Time” and “Delayed” into separate columns populated by the “Counts” data. Then I change the column name “On Time” to “OnTime”.

Finally, I use mutate to calculate the Total number of flights per route as well as the Delayed Percentage per route for each airline. Then I assign the transformed data to the “final” variable.

library(tidyr)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

#tidy the data
tidy_data<-data %>%
  gather(City,Counts,3:7) %>%
  select(Airline,City,Status,Counts) %>%
  spread(Status,Counts)

#change name
names(tidy_data)[4]<-"OnTime"

#transform the data for analysis
transform_data<-tidy_data %>%
  mutate(Total = Delayed + OnTime) %>%
  mutate(Delay_Perc = Delayed/Total)

#set as final
final<-transform_data
final

##    Airline          City Delayed OnTime Total Delay_Perc
## 1   Alaska   Los.Angeles      62    497   559 0.11091234
## 2   Alaska       Phoenix      12    221   233 0.05150215
## 3   Alaska     San.Diego      20    212   232 0.08620690
## 4   Alaska San.Francisco     102    503   605 0.16859504
## 5   Alaska       Seattle     305   1841  2146 0.14212488
## 6  AM West   Los.Angeles     117    694   811 0.14426634
## 7  AM West       Phoenix     415   4840  5255 0.07897241
## 8  AM West     San.Diego      65    383   448 0.14508929
## 9  AM West San.Francisco     129    320   449 0.28730512
## 10 AM West       Seattle      61    201   262 0.23282443

3: Analysis to Compare the Arrival Delays of the Two Airlines

I suspect a Simpson’s paradox here…

Let’s compare total delay percentages between the two airlines. I use the group_by function to group by Airline before summarizing the data by the Delayed Total Percentage, calculated by summing up the Delayed column and dividing that by the sum of the Total column.

final %>%
  group_by(Airline) %>%
  summarise(DelayedTotal_Perc=sum(Delayed) / sum(Total))

## # A tibble: 2 × 2
##   Airline DelayedTotal_Perc
##    <fctr>             <dbl>
## 1  Alaska         0.1327152
## 2 AM West         0.1089273

We see that Alaska is delayed 13.3% of the time while AM West is delayed almost 10.8% of the time for all flights combined. Now let’s compare each route individually.

final %>%
  select(Airline,City,Delay_Perc) %>%
  spread(Airline,Delay_Perc)

##            City     Alaska    AM West
## 1   Los.Angeles 0.11091234 0.14426634
## 2       Phoenix 0.05150215 0.07897241
## 3     San.Diego 0.08620690 0.14508929
## 4 San.Francisco 0.16859504 0.28730512
## 5       Seattle 0.14212488 0.23282443

Since each route’s delayed percentage has already been calculated, I reorder and limit the data using the select function to the columns that I need (Airline, City, and Delay Percentage). Then I use the spread function to move the Airline values to the columns, which are populated by the Delay Percentage values according to the City.

When looking at each route individually, we see that Alaska always has fewer delays as a percentage than AM West. So as I suspected, this is Simpson’s paradox. When looking at the aggregated delays, Alaska has a greater ratio of delays. However, when looking at each route individually, AM West has a greater ratio of delays for each route.

In short, we have this paradox. So if asked, “which airline has more delays?”, we’d have to clarify the question. Are we concerned with each route individually? If so, AM West has more delays as a percentage per route. Are we concerned with all flights in total, regardless of route? If so, Alaska has more delays as a percentage of total.

Let’s look at the delay percentages and standard deviations.

summary(filter(final, Airline=="Alaska")$Delay_Perc)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.05150 0.08621 0.11090 0.11190 0.14210 0.16860

sd(filter(final, Airline=="Alaska")$Delay_Perc)

## [1] 0.04592624

summary(filter(final, Airline=="AM West")$Delay_Perc)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.07897 0.14430 0.14510 0.17770 0.23280 0.28730

sd(filter(final, Airline=="AM West")$Delay_Perc)

## [1] 0.08212854

We can see that by this analysis, looking at the route percentages, AM West looks a lot worse. The minimum route delay percentage is 7% compared to 5%, the mean is 18% compared to 11%, and the maximum is 29% compared to 17%. Alaska has a delay percentage standard deviation of 5% while AM West has a standard deviation of 8%. So I suppose if one is looking for a more consistent experience, Alaska is preferrable to AM West.

Who does the most flights? In total, AM West does much more. 7225 compared to 3775.

final %>%
  group_by(Airline) %>%
  summarise(Total=sum(Total))

## # A tibble: 2 × 2
##   Airline Total
##    <fctr> <int>
## 1  Alaska  3775
## 2 AM West  7225

By route, however, Alaska does have more flights to Seattle and San Francisco. The other routes have more by AM West. Phoenix has a lot more.

final %>%
  select(Airline,City,Total) %>%
  spread(Airline,Total)

##            City Alaska AM West
## 1   Los.Angeles    559     811
## 2       Phoenix    233    5255
## 3     San.Diego    232     448
## 4 San.Francisco    605     449
## 5       Seattle   2146     262

Conclusion

It’s hard to say which airline is “better” in terms of delays. Instead, we should be more precise: “better” for what? Overall, arguably AM West. Each route? Alaska. Consistency among routes? Alaska. So I suppose I’d lean more towards Alaska, but neither choice is overwhelming.

DATA 607: HW Week 5

Andrew Carson

September 27, 2016

1: Create Data File

2a: Read the Information Into R

2b: Use `tidyr` and `dplyr` to tidy and transform the data

3: Analysis to Compare the Arrival Delays of the Two Airlines

Conclusion

DATA 607: HW Week 5

Andrew Carson

September 27, 2016

1: Create Data File

2a: Read the Information Into R

2b: Use tidyr and dplyr to tidy and transform the data

3: Analysis to Compare the Arrival Delays of the Two Airlines

Conclusion

2b: Use `tidyr` and `dplyr` to tidy and transform the data