week 4 Assignment

By Brian Weinfeld

March 02, 2018

Tidying Up

There are a number of transformations that needed to be performed on the original csv file in order to make it tidy. I performed all of these transformations in a row with the pipe (%>%) operator. Below I have seperated the transformations for clarity. The complete, working, original code is below.

wide.data <- read.csv('https://raw.githubusercontent.com/brian-cuny/607assignment4/master/week4Assignment.csv', 
                      head=TRUE, stringsAsFactors=FALSE, na.strings='')
kable(wide.data)

X	X.1	Los.Angeles	Phoenix	San.Diego	San.Francisco	Seattle
ALASKA	on time	497	221	212	503	1841
NA	delayed	62	12	20	102	305
NA	NA	NA	NA	NA	NA	NA
AM WEST	on time	694	4840	383	320	201
NA	delayed	117	415	65	129	61

The original data once read into R has a number of alterations that need to be made. First, I renamed the first two columns to better identify the data and then I filtered out the empty row.

wide.data %<>% rename(company=X, status=X.1) %>%
  filter(status!='')
kable(wide.data)

company	status	Los.Angeles	Phoenix	San.Diego	San.Francisco	Seattle
ALASKA	on time	497	221	212	503	1841
NA	delayed	62	12	20	102	305
AM WEST	on time	694	4840	383	320	201
NA	delayed	117	415	65	129	61

There is a helpful method in the ‘zoo’ library named na.locf() that fills in N/A elements in a row by copying forward the previous non-N/A element. This can be used to fill in the missing company names.

wide.data %<>% na.locf()
kable(wide.data)

company	status	Los.Angeles	Phoenix	San.Diego	San.Francisco	Seattle
ALASKA	on time	497	221	212	503	1841
ALASKA	delayed	62	12	20	102	305
AM WEST	on time	694	4840	383	320	201
AM WEST	delayed	117	415	65	129	61

Next I used gather to move the 5 locations from column headers to rows.

wide.data %<>%  gather('dest', 'count', 3:7)
kable(wide.data)

company	status	dest	count
ALASKA	on time	Los.Angeles	497
ALASKA	delayed	Los.Angeles	62
AM WEST	on time	Los.Angeles	694
AM WEST	delayed	Los.Angeles	117
ALASKA	on time	Phoenix	221
ALASKA	delayed	Phoenix	12
AM WEST	on time	Phoenix	4840
AM WEST	delayed	Phoenix	415
ALASKA	on time	San.Diego	212
ALASKA	delayed	San.Diego	20
AM WEST	on time	San.Diego	383
AM WEST	delayed	San.Diego	65
ALASKA	on time	San.Francisco	503
ALASKA	delayed	San.Francisco	102
AM WEST	on time	San.Francisco	320
AM WEST	delayed	San.Francisco	129
ALASKA	on time	Seattle	1841
ALASKA	delayed	Seattle	305
AM WEST	on time	Seattle	201
AM WEST	delayed	Seattle	61

Then I used spread to move the on time and delayed rows into columns.

wide.data %<>% spread('status', 'count')
kable(wide.data)

company	dest	delayed	on time
ALASKA	Los.Angeles	62	497
ALASKA	Phoenix	12	221
ALASKA	San.Diego	20	212
ALASKA	San.Francisco	102	503
ALASKA	Seattle	305	1841
AM WEST	Los.Angeles	117	694
AM WEST	Phoenix	415	4840
AM WEST	San.Diego	65	383
AM WEST	San.Francisco	129	320
AM WEST	Seattle	61	201

At this point the data is now tidy, however there are still a number of modifications I would like to make on the data. I replaced all the periods (.) with spaces in order to better format the location names and then I converted all the elements to their appropriate types. Finally, I converted the whole structure into a tibble() to aid in my analysis.

wide.data %<>% map(~str_replace_all(., '\\.', ' ') %>% 
                    type.convert(.)
                  ) %>%
  as.tibble() 
kable(wide.data)

company	dest	delayed	on time
ALASKA	Los Angeles	62	497
ALASKA	Phoenix	12	221
ALASKA	San Diego	20	212
ALASKA	San Francisco	102	503
ALASKA	Seattle	305	1841
AM WEST	Los Angeles	117	694
AM WEST	Phoenix	415	4840
AM WEST	San Diego	65	383
AM WEST	San Francisco	129	320
AM WEST	Seattle	61	201

The original piped code is written below.

wide.data <- read.csv('https://raw.githubusercontent.com/brian-cuny/607assignment4/master/week4Assignment.csv', 
                      head=TRUE, stringsAsFactors=FALSE, na.strings='') %>%
  rename(company=X, status=X.1) %>%
  filter(status != '') %>%
  na.locf() %>%
  gather('dest', 'count', 3:7) %>%
  spread('status', 'count') %>%
  map(~str_replace_all(., '\\.', ' ') %>% 
        type.convert(.)
      ) %>%
  as.tibble()

Analysis

I began by comparing the rate of on time arrivals between the two companies for each of their 5 locations.

wide.data %>%
  mutate(prop=`on time`/(`on time` + delayed)) %>%
  ggplot() + 
  geom_histogram(aes(x=dest, y=prop, fill=company), stat='identity', position='dodge') + 
  labs(y='Proportion of On Time Flights', x='Destination', fill='Company', 
       title='Comparison of Rate of on Time Arrivals') +
  coord_flip() +
  guides(fill=guide_legend(reverse=TRUE))

The graph provides clear evidence that ALASKA has a higher proportion of on time arrivals across the board when compared to AM WEST. This would seem to indicate that ALASKA is the company with the better record. However, something odd happens when the total proportion of on time arrivals is compared.

wide.data %>%
  group_by(company) %>%
  summarise(propotion=sum(`on time`) / (sum(delayed) + sum(`on time`)))

## # A tibble: 2 x 2
##   company propotion
##   <fct>       <dbl>
## 1 ALASKA      0.867
## 2 AM WEST     0.891

This summary shows that AM WEST has a higher overall proportion of on time arrivals across all of their flights. This seems contradictory. After all, how could ALASKA have better rates in every head-to-head comparison but still have a worse overall on time rate? The answer lies in the count.

wide.data %>%
  count(company, wt=delayed + `on time`)

## # A tibble: 2 x 2
##   company     n
##   <fct>   <int>
## 1 ALASKA   3775
## 2 AM WEST  7225

It turns out that there is a wide discrepency in the number of flights offered by ALASKA and AM WEST. This is a well examined statistical phenomena where grouping the elements and examining them seperately can result in different answers. This is a result of the vastly different sizes of each one of the counts.

The data is summarized below.

wide.data %>%
  group_by(company) %>%
  summarise(delayed=sum(delayed),
            `on time`=sum(`on time`),
            total=sum(delayed) + sum(`on time`),
            propotion=sum(`on time`) / (sum(delayed) + sum(`on time`))
  )

## # A tibble: 2 x 5
##   company delayed `on time` total propotion
##   <fct>     <int>     <int> <int>     <dbl>
## 1 ALASKA      501      3274  3775     0.867
## 2 AM WEST     787      6438  7225     0.891

In conclusion, ALASKA offered a higher proportion of on time flights for each of the observed destinations when compared to AM WEST. However, the number of flights to each location varies wildly and may affect which company should be selected.