In this assignment, the following two packages will be used to tidy up, manipulate and transform the data.
library(tidyr)
library(dplyr)
##
## Attaching package: 'dplyr'
##
## The following objects are masked from 'package:stats':
##
## filter, lag
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
Data source
The untidy, wide format data set was generated using Excel and saved to csv called week6_data.csv. The CSV file used in this project is posted to Github in this link
Importing the csv datasource
Importing the week6_data.csv file from my working directory. Of note is that the “Cities” variable values were originally imported as factors. To prevent this, I set the parameter stringAsFactors = FALSE.
mydata = read.csv ("week6_data.csv", stringsAsFactors = FALSE)
Taking a peak at the imported data
After creating the data set in Excel, I want to see if how it displays in R. Additionally, I want to compare it to the original chart data to determine if they are equivalent. It looks like they are equivalent.
Original Chart
Recreated Chart
mydata
## X X.1 Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1 ALASKA on time 497 221 212 503 1841
## 2 delayed 62 12 20 102 305
## 3 NA NA NA NA NA
## 4 AM WEST on time 694 4840 383 320 201
## 5 delayed 117 415 65 129 61
Reviewing the structure of the data
At this point, I wanted to see the structure of the data frame mydata to determine the data types of the variables. I also want to make sure that the character variables were not imported as factor and that my analysis variables are numerical. So far so good… it looks like I can move on to the next stage.
str(mydata)
## 'data.frame': 5 obs. of 7 variables:
## $ X : chr "ALASKA" "" "" "AM WEST" ...
## $ X.1 : chr "on time" "delayed" "" "on time" ...
## $ Los.Angeles : int 497 62 NA 694 117
## $ Phoenix : int 221 12 NA 4840 415
## $ San.Diego : int 212 20 NA 383 65
## $ San.Francisco: int 503 102 NA 320 129
## $ Seattle : int 1841 305 NA 201 61
Removing columns and rows
In the original and imported data set, I noticed that row #3 was full of NAs and therefore not usable. Additionally, I also noticed that the airline names were not carried through all the rows. At this point, I decided that I will remove the rows with NAs and the Airline column or column X.
I could have done several procedures to accomplish this, but I decided to use negative subsetting in a single step.
mydata <- mydata [-3,-1]
mydata
## X.1 Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1 on time 497 221 212 503 1841
## 2 delayed 62 12 20 102 305
## 4 on time 694 4840 383 320 201
## 5 delayed 117 415 65 129 61
Creating a new column for Airlines
In the prior step I removed the NAs and column X. To replace the values in column X, I created a character vector with the names of the airlines in the order they appear. The new vector will be called airlines with a lowercase a.
airlines <- c("Alaska", "Alaska", "AM WEST", "AM WEST")
airlines
## [1] "Alaska" "Alaska" "AM WEST" "AM WEST"
Adding Airlines column and changing the first column name
In this step, I append the Airlines vector I created earlier to the data frame.
mydata[,"Airlines"] <- airlines
mydata
## X.1 Los.Angeles Phoenix San.Diego San.Francisco Seattle Airlines
## 1 on time 497 221 212 503 1841 Alaska
## 2 delayed 62 12 20 102 305 Alaska
## 4 on time 694 4840 383 320 201 AM WEST
## 5 delayed 117 415 65 129 61 AM WEST
str(mydata)
## 'data.frame': 4 obs. of 7 variables:
## $ X.1 : chr "on time" "delayed" "on time" "delayed"
## $ Los.Angeles : int 497 62 694 117
## $ Phoenix : int 221 12 4840 415
## $ San.Diego : int 212 20 383 65
## $ San.Francisco: int 503 102 320 129
## $ Seattle : int 1841 305 201 61
## $ Airlines : chr "Alaska" "Alaska" "AM WEST" "AM WEST"
In this step, I modify the column name for the first variable…I did not like the default name of X.1.
colnames(mydata)[1] <- "Arrival"
mydata
## Arrival Los.Angeles Phoenix San.Diego San.Francisco Seattle Airlines
## 1 on time 497 221 212 503 1841 Alaska
## 2 delayed 62 12 20 102 305 Alaska
## 4 on time 694 4840 383 320 201 AM WEST
## 5 delayed 117 415 65 129 61 AM WEST
str(mydata)
## 'data.frame': 4 obs. of 7 variables:
## $ Arrival : chr "on time" "delayed" "on time" "delayed"
## $ Los.Angeles : int 497 62 694 117
## $ Phoenix : int 221 12 4840 415
## $ San.Diego : int 212 20 383 65
## $ San.Francisco: int 503 102 320 129
## $ Seattle : int 1841 305 201 61
## $ Airlines : chr "Alaska" "Alaska" "AM WEST" "AM WEST"
At this point, I am done with my basic data cleanup steps and can then move on to make this data tidy.
The goal of this section is to transform the mydata data set into a tidy-compliant data set that plays well with R.
Characteristic of Tidy Data
Coverting the data from a wide to a long format
In the original data set structure, the cities were across several columns. This made our data set wide and not ideal for analysis in R. In the tidy framework, this data set fails characteristic #1. The purpose of this manipulation, using the gather function is to “gather”" the cities from column values to row values.
mydata2 <- gather(mydata,"Cities", "n", 2:6, convert = TRUE)
head(mydata2)
## Arrival Airlines Cities n
## 1 on time Alaska Los.Angeles 497
## 2 delayed Alaska Los.Angeles 62
## 3 on time AM WEST Los.Angeles 694
## 4 delayed AM WEST Los.Angeles 117
## 5 on time Alaska Phoenix 221
## 6 delayed Alaska Phoenix 12
str(mydata2)
## 'data.frame': 20 obs. of 4 variables:
## $ Arrival : chr "on time" "delayed" "on time" "delayed" ...
## $ Airlines: chr "Alaska" "Alaska" "AM WEST" "AM WEST" ...
## $ Cities : chr "Los.Angeles" "Los.Angeles" "Los.Angeles" "Los.Angeles" ...
## $ n : int 497 62 694 117 221 12 4840 415 212 20 ...
Setting up the dataset for further arrival analysis
In the current state, the data does not meet all the requirements for a tidy data set - it fails characteristic #2. We still have multiple rows for the same observation. In this case, this is caused by the Arrival variable. To make this data set tidy and to prep it for further analysis, the unique values in this variable need to become variables.
This is accomplished by the spread function below. This function takes the unique values of the Arrival variable and coverts them into columns. It also re-arranges the analysis variable.
mydata3 <- spread(mydata2, Arrival, n )
head(mydata3)
## Airlines Cities delayed on time
## 1 Alaska Los.Angeles 62 497
## 2 Alaska Phoenix 12 221
## 3 Alaska San.Diego 20 212
## 4 Alaska San.Francisco 102 503
## 5 Alaska Seattle 305 1841
## 6 AM WEST Los.Angeles 117 694
str(mydata3)
## 'data.frame': 10 obs. of 4 variables:
## $ Airlines: chr "Alaska" "Alaska" "Alaska" "Alaska" ...
## $ Cities : chr "Los.Angeles" "Phoenix" "San.Diego" "San.Francisco" ...
## $ delayed : int 62 12 20 102 305 117 415 65 129 61
## $ on time : int 497 221 212 503 1841 694 4840 383 320 201
In this section, I will perform several analysis of the data set to compare the arrival of both airlines. The goal is to determine which airline performed better on the On Time metric.
As a warm-up, I will calculate the total mean for the delayed variable across both airlines. In this example, I am using the summarize function to summarize the total mean.
tot_mean <- mydata3 %>% summarise(mean_delay=mean(delayed))
tot_mean
## mean_delay
## 1 128.8
Renaming the On Time column
I renamed the On Time column to On_Time to prevent an error that were caused by the time keyword.
colnames(mydata3)[4] <- "On_time"
mydata3
## Airlines Cities delayed On_time
## 1 Alaska Los.Angeles 62 497
## 2 Alaska Phoenix 12 221
## 3 Alaska San.Diego 20 212
## 4 Alaska San.Francisco 102 503
## 5 Alaska Seattle 305 1841
## 6 AM WEST Los.Angeles 117 694
## 7 AM WEST Phoenix 415 4840
## 8 AM WEST San.Diego 65 383
## 9 AM WEST San.Francisco 129 320
## 10 AM WEST Seattle 61 201
Okay, enough warm-ups. I now need to perform analysis on the arrival delays for the two airlines and reach a conclusion. For this assignment, I am assuming we need to determine which airline performed better in terms of delays.
Simple Comparison
In comparing delays between these two airlines, I want to know which airline has the highest absolute number of delays. For this analysis, I will group the airlines and then summarize an variable called sum_delayed. This variable is the sum of the delays.
In next step, I will use the arrange function to display the data in descending order using the previously created sum_delayed variable.
mydata5 <- mydata3 %>%
group_by(Airlines) %>%
summarize( sum_delayed = sum(delayed))
arrange(mydata5, desc(sum_delayed))
## Source: local data frame [2 x 2]
##
## Airlines sum_delayed
## (chr) (int)
## 1 AM WEST 787
## 2 Alaska 501
The result of this analysis is that the AM WEST airline has 787 delays vs. 501 for the Alaska airline.
Does this imply that the Alaska airline performed better than AM WEST? Let’s take a look at a few summary statistics for those two airlines.
Summary Statistics
Although the sample is very small, I decided to run a few summary statistics on the grouped airline variables. The grouping of the airline’s variable was done using the group_by function and the summarizing was done by the using the summarize function. The following variables are defined:
mydata6 <- mydata3 %>%
group_by(Airlines) %>%
summarize( max_delayed = max(delayed), mean_delayed = mean(delayed), sd_delayed = sd(delayed))
arrange(mydata6, desc(max_delayed))
## Source: local data frame [2 x 4]
##
## Airlines max_delayed mean_delayed sd_delayed
## (chr) (int) (dbl) (dbl)
## 1 AM WEST 415 157.4 147.1625
## 2 Alaska 305 100.2 120.0175
Base on these summary statistics, I can tell the following about the two airlines:
So, does this mean that the Alaska airline has better On Time performance than the AM WEST airline? Not so fast! we are not normalizing the comparisons so it would not be fair to assume that Alaska has better performance than AM WEST. Although I will not do a full normalization of the variables, I will compare their ratios:
Total delayed flights / Total Flights
This ratio will allow me to better compare the performance of each airline. I will accomplish this in the following steps:
mydata4 <- mydata3 %>%
group_by(Airlines) %>%
summarize(sum_delayed = sum(delayed), sum_On_time = sum(On_time), Total_flights = (sum(delayed) + sum(On_time)))
mydata4 <- mutate(mydata4, ratio = round((sum_delayed / (sum_On_time + sum_delayed)) * 100,2))
arrange(mydata4, desc(ratio))
## Source: local data frame [2 x 5]
##
## Airlines sum_delayed sum_On_time Total_flights ratio
## (chr) (int) (int) (int) (dbl)
## 1 Alaska 501 3274 3775 13.27
## 2 AM WEST 787 6438 7225 10.89
For this assignment, the performance metric was somewhat ambiguous. We could have used any of the measures in the above sections to determine which airline performed better on the delay criteria.
The AM WEST airline showed higher absolute delays as well as higher values in the statistical measures. On the other hand, if I was evaluating the performance of these two airlines, in terms of delays, I would compare the ratio of delays between the two. This would ensure I am comparing “apples-to-apples”.
Based on this measure, I can conclude that the Alaska airline had a greater number of delays 13.27% vs. 10.80% for the AM WEST airline.