week 4 Assignment

By Brian Weinfeld

March 02, 2018

There are a number of transformations that needed to be performed on the original csv file in order to make it tidy. I performed all of these transformations in a row with the pipe (%>%) operator. Below I have seperated the transformations for clarity. The complete, working, original code is below.

wide.data <- read.csv('https://raw.githubusercontent.com/brian-cuny/607assignment4/master/week4Assignment.csv', 
                      head=TRUE, stringsAsFactors=FALSE, na.strings='')
kable(wide.data)
X X.1 Los.Angeles Phoenix San.Diego San.Francisco Seattle
ALASKA on time 497 221 212 503 1841
NA delayed 62 12 20 102 305
NA NA NA NA NA NA NA
AM WEST on time 694 4840 383 320 201
NA delayed 117 415 65 129 61

The original data once read into R has a number of alterations that need to be made. First, I renamed the first two columns to better identify the data and then I filtered out the empty row.

wide.data %<>% rename(company=X, status=X.1) %>%
  filter(status!='')
kable(wide.data)
company status Los.Angeles Phoenix San.Diego San.Francisco Seattle
ALASKA on time 497 221 212 503 1841
NA delayed 62 12 20 102 305
AM WEST on time 694 4840 383 320 201
NA delayed 117 415 65 129 61

There is a helpful method in the ‘zoo’ library named na.locf() that fills in N/A elements in a row by copying forward the previous non-N/A element. This can be used to fill in the missing company names.

wide.data %<>% na.locf()
kable(wide.data)
company status Los.Angeles Phoenix San.Diego San.Francisco Seattle
ALASKA on time 497 221 212 503 1841
ALASKA delayed 62 12 20 102 305
AM WEST on time 694 4840 383 320 201
AM WEST delayed 117 415 65 129 61

Next I used gather to move the 5 locations from column headers to rows.

wide.data %<>%  gather('dest', 'count', 3:7)
kable(wide.data)
company status dest count
ALASKA on time Los.Angeles 497
ALASKA delayed Los.Angeles 62
AM WEST on time Los.Angeles 694
AM WEST delayed Los.Angeles 117
ALASKA on time Phoenix 221
ALASKA delayed Phoenix 12
AM WEST on time Phoenix 4840
AM WEST delayed Phoenix 415
ALASKA on time San.Diego 212
ALASKA delayed San.Diego 20
AM WEST on time San.Diego 383
AM WEST delayed San.Diego 65
ALASKA on time San.Francisco 503
ALASKA delayed San.Francisco 102
AM WEST on time San.Francisco 320
AM WEST delayed San.Francisco 129
ALASKA on time Seattle 1841
ALASKA delayed Seattle 305
AM WEST on time Seattle 201
AM WEST delayed Seattle 61

Then I used spread to move the on time and delayed rows into columns.

wide.data %<>% spread('status', 'count')
kable(wide.data)
company dest delayed on time
ALASKA Los.Angeles 62 497
ALASKA Phoenix 12 221
ALASKA San.Diego 20 212
ALASKA San.Francisco 102 503
ALASKA Seattle 305 1841
AM WEST Los.Angeles 117 694
AM WEST Phoenix 415 4840
AM WEST San.Diego 65 383
AM WEST San.Francisco 129 320
AM WEST Seattle 61 201

At this point the data is now tidy, however there are still a number of modifications I would like to make on the data. I replaced all the periods (.) with spaces in order to better format the location names and then I converted all the elements to their appropriate types. Finally, I converted the whole structure into a tibble() to aid in my analysis.

wide.data %<>% map(~str_replace_all(., '\\.', ' ') %>% 
                    type.convert(.)
                  ) %>%
  as.tibble() 
kable(wide.data)
company dest delayed on time
ALASKA Los Angeles 62 497
ALASKA Phoenix 12 221
ALASKA San Diego 20 212
ALASKA San Francisco 102 503
ALASKA Seattle 305 1841
AM WEST Los Angeles 117 694
AM WEST Phoenix 415 4840
AM WEST San Diego 65 383
AM WEST San Francisco 129 320
AM WEST Seattle 61 201

The original piped code is written below.

wide.data <- read.csv('https://raw.githubusercontent.com/brian-cuny/607assignment4/master/week4Assignment.csv', 
                      head=TRUE, stringsAsFactors=FALSE, na.strings='') %>%
  rename(company=X, status=X.1) %>%
  filter(status != '') %>%
  na.locf() %>%
  gather('dest', 'count', 3:7) %>%
  spread('status', 'count') %>%
  map(~str_replace_all(., '\\.', ' ') %>% 
        type.convert(.)
      ) %>%
  as.tibble() 

I began by comparing the rate of on time arrivals between the two companies for each of their 5 locations.

wide.data %>%
  mutate(prop=`on time`/(`on time` + delayed)) %>%
  ggplot() + 
  geom_histogram(aes(x=dest, y=prop, fill=company), stat='identity', position='dodge') + 
  labs(y='Proportion of On Time Flights', x='Destination', fill='Company', 
       title='Comparison of Rate of on Time Arrivals') +
  coord_flip() +
  guides(fill=guide_legend(reverse=TRUE))

The graph provides clear evidence that ALASKA has a higher proportion of on time arrivals across the board when compared to AM WEST. This would seem to indicate that ALASKA is the company with the better record. However, something odd happens when the total proportion of on time arrivals is compared.

wide.data %>%
  group_by(company) %>%
  summarise(propotion=sum(`on time`) / (sum(delayed) + sum(`on time`)))
## # A tibble: 2 x 2
##   company propotion
##   <fct>       <dbl>
## 1 ALASKA      0.867
## 2 AM WEST     0.891

This summary shows that AM WEST has a higher overall proportion of on time arrivals across all of their flights. This seems contradictory. After all, how could ALASKA have better rates in every head-to-head comparison but still have a worse overall on time rate? The answer lies in the count.

wide.data %>%
  count(company, wt=delayed + `on time`)
## # A tibble: 2 x 2
##   company     n
##   <fct>   <int>
## 1 ALASKA   3775
## 2 AM WEST  7225

It turns out that there is a wide discrepency in the number of flights offered by ALASKA and AM WEST. This is a well examined statistical phenomena where grouping the elements and examining them seperately can result in different answers. This is a result of the vastly different sizes of each one of the counts.

The data is summarized below.

wide.data %>%
  group_by(company) %>%
  summarise(delayed=sum(delayed),
            `on time`=sum(`on time`),
            total=sum(delayed) + sum(`on time`),
            propotion=sum(`on time`) / (sum(delayed) + sum(`on time`))
  )
## # A tibble: 2 x 5
##   company delayed `on time` total propotion
##   <fct>     <int>     <int> <int>     <dbl>
## 1 ALASKA      501      3274  3775     0.867
## 2 AM WEST     787      6438  7225     0.891

In conclusion, ALASKA offered a higher proportion of on time flights for each of the observed destinations when compared to AM WEST. However, the number of flights to each location varies wildly and may affect which company should be selected.