There are a number of transformations that needed to be performed on the original csv file in order to make it tidy. I performed all of these transformations in a row with the pipe (%>%) operator. Below I have seperated the transformations for clarity. The complete, working, original code is below.
wide.data <- read.csv('https://raw.githubusercontent.com/brian-cuny/607assignment4/master/week4Assignment.csv',
head=TRUE, stringsAsFactors=FALSE, na.strings='')
kable(wide.data)
| X | X.1 | Los.Angeles | Phoenix | San.Diego | San.Francisco | Seattle |
|---|---|---|---|---|---|---|
| ALASKA | on time | 497 | 221 | 212 | 503 | 1841 |
| NA | delayed | 62 | 12 | 20 | 102 | 305 |
| NA | NA | NA | NA | NA | NA | NA |
| AM WEST | on time | 694 | 4840 | 383 | 320 | 201 |
| NA | delayed | 117 | 415 | 65 | 129 | 61 |
The original data once read into R has a number of alterations that need to be made. First, I renamed the first two columns to better identify the data and then I filtered out the empty row.
wide.data %<>% rename(company=X, status=X.1) %>%
filter(status!='')
kable(wide.data)
| company | status | Los.Angeles | Phoenix | San.Diego | San.Francisco | Seattle |
|---|---|---|---|---|---|---|
| ALASKA | on time | 497 | 221 | 212 | 503 | 1841 |
| NA | delayed | 62 | 12 | 20 | 102 | 305 |
| AM WEST | on time | 694 | 4840 | 383 | 320 | 201 |
| NA | delayed | 117 | 415 | 65 | 129 | 61 |
There is a helpful method in the ‘zoo’ library named na.locf() that fills in N/A elements in a row by copying forward the previous non-N/A element. This can be used to fill in the missing company names.
wide.data %<>% na.locf()
kable(wide.data)
| company | status | Los.Angeles | Phoenix | San.Diego | San.Francisco | Seattle |
|---|---|---|---|---|---|---|
| ALASKA | on time | 497 | 221 | 212 | 503 | 1841 |
| ALASKA | delayed | 62 | 12 | 20 | 102 | 305 |
| AM WEST | on time | 694 | 4840 | 383 | 320 | 201 |
| AM WEST | delayed | 117 | 415 | 65 | 129 | 61 |
Next I used gather to move the 5 locations from column headers to rows.
wide.data %<>% gather('dest', 'count', 3:7)
kable(wide.data)
| company | status | dest | count |
|---|---|---|---|
| ALASKA | on time | Los.Angeles | 497 |
| ALASKA | delayed | Los.Angeles | 62 |
| AM WEST | on time | Los.Angeles | 694 |
| AM WEST | delayed | Los.Angeles | 117 |
| ALASKA | on time | Phoenix | 221 |
| ALASKA | delayed | Phoenix | 12 |
| AM WEST | on time | Phoenix | 4840 |
| AM WEST | delayed | Phoenix | 415 |
| ALASKA | on time | San.Diego | 212 |
| ALASKA | delayed | San.Diego | 20 |
| AM WEST | on time | San.Diego | 383 |
| AM WEST | delayed | San.Diego | 65 |
| ALASKA | on time | San.Francisco | 503 |
| ALASKA | delayed | San.Francisco | 102 |
| AM WEST | on time | San.Francisco | 320 |
| AM WEST | delayed | San.Francisco | 129 |
| ALASKA | on time | Seattle | 1841 |
| ALASKA | delayed | Seattle | 305 |
| AM WEST | on time | Seattle | 201 |
| AM WEST | delayed | Seattle | 61 |
Then I used spread to move the on time and delayed rows into columns.
wide.data %<>% spread('status', 'count')
kable(wide.data)
| company | dest | delayed | on time |
|---|---|---|---|
| ALASKA | Los.Angeles | 62 | 497 |
| ALASKA | Phoenix | 12 | 221 |
| ALASKA | San.Diego | 20 | 212 |
| ALASKA | San.Francisco | 102 | 503 |
| ALASKA | Seattle | 305 | 1841 |
| AM WEST | Los.Angeles | 117 | 694 |
| AM WEST | Phoenix | 415 | 4840 |
| AM WEST | San.Diego | 65 | 383 |
| AM WEST | San.Francisco | 129 | 320 |
| AM WEST | Seattle | 61 | 201 |
At this point the data is now tidy, however there are still a number of modifications I would like to make on the data. I replaced all the periods (.) with spaces in order to better format the location names and then I converted all the elements to their appropriate types. Finally, I converted the whole structure into a tibble() to aid in my analysis.
wide.data %<>% map(~str_replace_all(., '\\.', ' ') %>%
type.convert(.)
) %>%
as.tibble()
kable(wide.data)
| company | dest | delayed | on time |
|---|---|---|---|
| ALASKA | Los Angeles | 62 | 497 |
| ALASKA | Phoenix | 12 | 221 |
| ALASKA | San Diego | 20 | 212 |
| ALASKA | San Francisco | 102 | 503 |
| ALASKA | Seattle | 305 | 1841 |
| AM WEST | Los Angeles | 117 | 694 |
| AM WEST | Phoenix | 415 | 4840 |
| AM WEST | San Diego | 65 | 383 |
| AM WEST | San Francisco | 129 | 320 |
| AM WEST | Seattle | 61 | 201 |
The original piped code is written below.
wide.data <- read.csv('https://raw.githubusercontent.com/brian-cuny/607assignment4/master/week4Assignment.csv',
head=TRUE, stringsAsFactors=FALSE, na.strings='') %>%
rename(company=X, status=X.1) %>%
filter(status != '') %>%
na.locf() %>%
gather('dest', 'count', 3:7) %>%
spread('status', 'count') %>%
map(~str_replace_all(., '\\.', ' ') %>%
type.convert(.)
) %>%
as.tibble()
I began by comparing the rate of on time arrivals between the two companies for each of their 5 locations.
wide.data %>%
mutate(prop=`on time`/(`on time` + delayed)) %>%
ggplot() +
geom_histogram(aes(x=dest, y=prop, fill=company), stat='identity', position='dodge') +
labs(y='Proportion of On Time Flights', x='Destination', fill='Company',
title='Comparison of Rate of on Time Arrivals') +
coord_flip() +
guides(fill=guide_legend(reverse=TRUE))
The graph provides clear evidence that ALASKA has a higher proportion of on time arrivals across the board when compared to AM WEST. This would seem to indicate that ALASKA is the company with the better record. However, something odd happens when the total proportion of on time arrivals is compared.
wide.data %>%
group_by(company) %>%
summarise(propotion=sum(`on time`) / (sum(delayed) + sum(`on time`)))
## # A tibble: 2 x 2
## company propotion
## <fct> <dbl>
## 1 ALASKA 0.867
## 2 AM WEST 0.891
This summary shows that AM WEST has a higher overall proportion of on time arrivals across all of their flights. This seems contradictory. After all, how could ALASKA have better rates in every head-to-head comparison but still have a worse overall on time rate? The answer lies in the count.
wide.data %>%
count(company, wt=delayed + `on time`)
## # A tibble: 2 x 2
## company n
## <fct> <int>
## 1 ALASKA 3775
## 2 AM WEST 7225
It turns out that there is a wide discrepency in the number of flights offered by ALASKA and AM WEST. This is a well examined statistical phenomena where grouping the elements and examining them seperately can result in different answers. This is a result of the vastly different sizes of each one of the counts.
The data is summarized below.
wide.data %>%
group_by(company) %>%
summarise(delayed=sum(delayed),
`on time`=sum(`on time`),
total=sum(delayed) + sum(`on time`),
propotion=sum(`on time`) / (sum(delayed) + sum(`on time`))
)
## # A tibble: 2 x 5
## company delayed `on time` total propotion
## <fct> <int> <int> <int> <dbl>
## 1 ALASKA 501 3274 3775 0.867
## 2 AM WEST 787 6438 7225 0.891
In conclusion, ALASKA offered a higher proportion of on time flights for each of the observed destinations when compared to AM WEST. However, the number of flights to each location varies wildly and may affect which company should be selected.