1 Packages

In this assignment, the following two packages will be used to tidy up, manipulate and transform the data.

library(tidyr)
library(dplyr)

## 
## Attaching package: 'dplyr'
## 
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

2 CSV Data

Data source

The untidy, wide format data set was generated using Excel and saved to csv called week6_data.csv. The CSV file used in this project is posted to Github in this link

Importing the csv datasource

Importing the week6_data.csv file from my working directory. Of note is that the “Cities” variable values were originally imported as factors. To prevent this, I set the parameter stringAsFactors = FALSE.

mydata = read.csv ("week6_data.csv", stringsAsFactors = FALSE)

3 Inspecting

Taking a peak at the imported data

After creating the data set in Excel, I want to see if how it displays in R. Additionally, I want to compare it to the original chart data to determine if they are equivalent. It looks like they are equivalent.

Original Chart

Recreated Chart

mydata

##         X     X.1 Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1  ALASKA on time         497     221       212           503    1841
## 2         delayed          62      12        20           102     305
## 3                          NA      NA        NA            NA      NA
## 4 AM WEST on time         694    4840       383           320     201
## 5         delayed         117     415        65           129      61

Reviewing the structure of the data

At this point, I wanted to see the structure of the data frame mydata to determine the data types of the variables. I also want to make sure that the character variables were not imported as factor and that my analysis variables are numerical. So far so good… it looks like I can move on to the next stage.

str(mydata)

## 'data.frame':    5 obs. of  7 variables:
##  $ X            : chr  "ALASKA" "" "" "AM WEST" ...
##  $ X.1          : chr  "on time" "delayed" "" "on time" ...
##  $ Los.Angeles  : int  497 62 NA 694 117
##  $ Phoenix      : int  221 12 NA 4840 415
##  $ San.Diego    : int  212 20 NA 383 65
##  $ San.Francisco: int  503 102 NA 320 129
##  $ Seattle      : int  1841 305 NA 201 61

4 Subsetting

Removing columns and rows

In the original and imported data set, I noticed that row #3 was full of NAs and therefore not usable. Additionally, I also noticed that the airline names were not carried through all the rows. At this point, I decided that I will remove the rows with NAs and the Airline column or column X.

I could have done several procedures to accomplish this, but I decided to use negative subsetting in a single step.

mydata <- mydata [-3,-1]

mydata

##       X.1 Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1 on time         497     221       212           503    1841
## 2 delayed          62      12        20           102     305
## 4 on time         694    4840       383           320     201
## 5 delayed         117     415        65           129      61

Creating a new column for Airlines

In the prior step I removed the NAs and column X. To replace the values in column X, I created a character vector with the names of the airlines in the order they appear. The new vector will be called airlines with a lowercase a.

airlines <- c("Alaska", "Alaska", "AM WEST", "AM WEST")

airlines

## [1] "Alaska"  "Alaska"  "AM WEST" "AM WEST"

Adding Airlines column and changing the first column name

In this step, I append the Airlines vector I created earlier to the data frame.

mydata[,"Airlines"] <- airlines

mydata

##       X.1 Los.Angeles Phoenix San.Diego San.Francisco Seattle Airlines
## 1 on time         497     221       212           503    1841   Alaska
## 2 delayed          62      12        20           102     305   Alaska
## 4 on time         694    4840       383           320     201  AM WEST
## 5 delayed         117     415        65           129      61  AM WEST

str(mydata)

## 'data.frame':    4 obs. of  7 variables:
##  $ X.1          : chr  "on time" "delayed" "on time" "delayed"
##  $ Los.Angeles  : int  497 62 694 117
##  $ Phoenix      : int  221 12 4840 415
##  $ San.Diego    : int  212 20 383 65
##  $ San.Francisco: int  503 102 320 129
##  $ Seattle      : int  1841 305 201 61
##  $ Airlines     : chr  "Alaska" "Alaska" "AM WEST" "AM WEST"

In this step, I modify the column name for the first variable…I did not like the default name of X.1.

colnames(mydata)[1] <- "Arrival"

mydata

##   Arrival Los.Angeles Phoenix San.Diego San.Francisco Seattle Airlines
## 1 on time         497     221       212           503    1841   Alaska
## 2 delayed          62      12        20           102     305   Alaska
## 4 on time         694    4840       383           320     201  AM WEST
## 5 delayed         117     415        65           129      61  AM WEST

str(mydata)

## 'data.frame':    4 obs. of  7 variables:
##  $ Arrival      : chr  "on time" "delayed" "on time" "delayed"
##  $ Los.Angeles  : int  497 62 694 117
##  $ Phoenix      : int  221 12 4840 415
##  $ San.Diego    : int  212 20 383 65
##  $ San.Francisco: int  503 102 320 129
##  $ Seattle      : int  1841 305 201 61
##  $ Airlines     : chr  "Alaska" "Alaska" "AM WEST" "AM WEST"

At this point, I am done with my basic data cleanup steps and can then move on to make this data tidy.

5 Tidying Up

The goal of this section is to transform the mydata data set into a tidy-compliant data set that plays well with R.

Characteristic of Tidy Data

Each variable is in its own column.
Each observation is saved on its own row.
Each type of observation is stored in a single table.

Coverting the data from a wide to a long format

In the original data set structure, the cities were across several columns. This made our data set wide and not ideal for analysis in R. In the tidy framework, this data set fails characteristic #1. The purpose of this manipulation, using the gather function is to “gather”" the cities from column values to row values.

mydata2 <- gather(mydata,"Cities", "n", 2:6, convert = TRUE)

head(mydata2)

##   Arrival Airlines      Cities   n
## 1 on time   Alaska Los.Angeles 497
## 2 delayed   Alaska Los.Angeles  62
## 3 on time  AM WEST Los.Angeles 694
## 4 delayed  AM WEST Los.Angeles 117
## 5 on time   Alaska     Phoenix 221
## 6 delayed   Alaska     Phoenix  12

str(mydata2)

## 'data.frame':    20 obs. of  4 variables:
##  $ Arrival : chr  "on time" "delayed" "on time" "delayed" ...
##  $ Airlines: chr  "Alaska" "Alaska" "AM WEST" "AM WEST" ...
##  $ Cities  : chr  "Los.Angeles" "Los.Angeles" "Los.Angeles" "Los.Angeles" ...
##  $ n       : int  497 62 694 117 221 12 4840 415 212 20 ...

Setting up the dataset for further arrival analysis

In the current state, the data does not meet all the requirements for a tidy data set - it fails characteristic #2. We still have multiple rows for the same observation. In this case, this is caused by the Arrival variable. To make this data set tidy and to prep it for further analysis, the unique values in this variable need to become variables.

This is accomplished by the spread function below. This function takes the unique values of the Arrival variable and coverts them into columns. It also re-arranges the analysis variable.

mydata3 <- spread(mydata2, Arrival, n )

head(mydata3)

##   Airlines        Cities delayed on time
## 1   Alaska   Los.Angeles      62     497
## 2   Alaska       Phoenix      12     221
## 3   Alaska     San.Diego      20     212
## 4   Alaska San.Francisco     102     503
## 5   Alaska       Seattle     305    1841
## 6  AM WEST   Los.Angeles     117     694

str(mydata3)

## 'data.frame':    10 obs. of  4 variables:
##  $ Airlines: chr  "Alaska" "Alaska" "Alaska" "Alaska" ...
##  $ Cities  : chr  "Los.Angeles" "Phoenix" "San.Diego" "San.Francisco" ...
##  $ delayed : int  62 12 20 102 305 117 415 65 129 61
##  $ on time : int  497 221 212 503 1841 694 4840 383 320 201

6 Analysis

In this section, I will perform several analysis of the data set to compare the arrival of both airlines. The goal is to determine which airline performed better on the On Time metric.

As a warm-up, I will calculate the total mean for the delayed variable across both airlines. In this example, I am using the summarize function to summarize the total mean.

tot_mean <- mydata3  %>% summarise(mean_delay=mean(delayed))

tot_mean

##   mean_delay
## 1      128.8

Renaming the On Time column

I renamed the On Time column to On_Time to prevent an error that were caused by the time keyword.

colnames(mydata3)[4] <- "On_time"

mydata3

##    Airlines        Cities delayed On_time
## 1    Alaska   Los.Angeles      62     497
## 2    Alaska       Phoenix      12     221
## 3    Alaska     San.Diego      20     212
## 4    Alaska San.Francisco     102     503
## 5    Alaska       Seattle     305    1841
## 6   AM WEST   Los.Angeles     117     694
## 7   AM WEST       Phoenix     415    4840
## 8   AM WEST     San.Diego      65     383
## 9   AM WEST San.Francisco     129     320
## 10  AM WEST       Seattle      61     201

Okay, enough warm-ups. I now need to perform analysis on the arrival delays for the two airlines and reach a conclusion. For this assignment, I am assuming we need to determine which airline performed better in terms of delays.

Simple Comparison

In comparing delays between these two airlines, I want to know which airline has the highest absolute number of delays. For this analysis, I will group the airlines and then summarize an variable called sum_delayed. This variable is the sum of the delays.

In next step, I will use the arrange function to display the data in descending order using the previously created sum_delayed variable.

mydata5 <- mydata3  %>%  
  group_by(Airlines)  %>% 
  summarize( sum_delayed = sum(delayed))

arrange(mydata5, desc(sum_delayed))

## Source: local data frame [2 x 2]
## 
##   Airlines sum_delayed
##      (chr)       (int)
## 1  AM WEST         787
## 2   Alaska         501

The result of this analysis is that the AM WEST airline has 787 delays vs. 501 for the Alaska airline.

Does this imply that the Alaska airline performed better than AM WEST? Let’s take a look at a few summary statistics for those two airlines.

Summary Statistics

Although the sample is very small, I decided to run a few summary statistics on the grouped airline variables. The grouping of the airline’s variable was done using the group_by function and the summarizing was done by the using the summarize function. The following variables are defined:

max_delayed = maximum delayed value.
mean_delayed = mean delayed value.
sd_delayed = standard deviation of delayed values.

mydata6 <- mydata3  %>%  
  group_by(Airlines)  %>% 
  summarize( max_delayed = max(delayed), mean_delayed = mean(delayed), sd_delayed = sd(delayed))


arrange(mydata6, desc(max_delayed))

## Source: local data frame [2 x 4]
## 
##   Airlines max_delayed mean_delayed sd_delayed
##      (chr)       (int)        (dbl)      (dbl)
## 1  AM WEST         415        157.4   147.1625
## 2   Alaska         305        100.2   120.0175

Base on these summary statistics, I can tell the following about the two airlines:

AM WEST has higher max_delayed at 415 vs. 305 for Alaska.
AM WEST has higher mean_delayed at 157 vs. 100 for Alaska.
AM WEST has higher sd_delayed at 147 vs. 120 for Alaska.

So, does this mean that the Alaska airline has better On Time performance than the AM WEST airline? Not so fast! we are not normalizing the comparisons so it would not be fair to assume that Alaska has better performance than AM WEST. Although I will not do a full normalization of the variables, I will compare their ratios:

Total delayed flights / Total Flights

This ratio will allow me to better compare the performance of each airline. I will accomplish this in the following steps:

Will use the Group_by and summarize functions to perform grouped calculations.
Will use the mudate function to further manipulate the new variable called ratio.
Will use the arrange function to display the data ratio variable in descending order.

mydata4 <- mydata3  %>%  
  group_by(Airlines)  %>% 
  summarize(sum_delayed = sum(delayed), sum_On_time = sum(On_time), Total_flights = (sum(delayed) + sum(On_time)))

mydata4 <- mutate(mydata4, ratio = round((sum_delayed / (sum_On_time + sum_delayed)) * 100,2)) 

arrange(mydata4, desc(ratio))

## Source: local data frame [2 x 5]
## 
##   Airlines sum_delayed sum_On_time Total_flights ratio
##      (chr)       (int)       (int)         (int) (dbl)
## 1   Alaska         501        3274          3775 13.27
## 2  AM WEST         787        6438          7225 10.89

7 Conclusion

For this assignment, the performance metric was somewhat ambiguous. We could have used any of the measures in the above sections to determine which airline performed better on the delay criteria.

The AM WEST airline showed higher absolute delays as well as higher values in the statistical measures. On the other hand, if I was evaluating the performance of these two airlines, in terms of delays, I would compare the ratio of delays between the two. This would ensure I am comparing “apples-to-apples”.

Based on this measure, I can conclude that the Alaska airline had a greater number of delays 13.27% vs. 10.80% for the AM WEST airline.

week6_Assignment

Diego Diaz

October 3, 2015