MSDS Spring 2018

DATA 607 Data Aquisition and Management

Jiadi Li

Week 5 Assignment:Tidying and Transforming Data



1.Create a .CSV file that includes all of the informations above. Use a “wide” structure similar to how the information appears above, so that tidying and transformings as described below can be practiced.

2.Read the information from .CSV file into R.

airlines.raw <- read.csv("https://raw.githubusercontent.com/xiaoxiaogao-DD/DATA607_Assignment5/master/airlines.csv",sep = ",")
airlines.raw
##       ï..       X Los.Angeles Phoenix San.Diego San..Francisco Seattle
## 1  ALASKA on time         497    221        212            503  1,841 
## 2         delayed          62     12         20            102    305 
## 3                          NA                NA             NA        
## 4 AM WEST on time         694  4,840        383            320    201 
## 5         delayed         117    415         65            129     61

Use \(tidyr\) and \(dplyr\) as needed to tidy and transform the data.

colnames(airlines.raw)[1:2] <- c("airline","status") #add column names for columns with no title
airlines.raw[[1]][2] <- airlines.raw[[1]][1] #fill empty cells with appropriate values based on the overall structure of the table
airlines.raw[[1]][5] <- airlines.raw[[1]][4] #same as above
airlines.raw
##   airline  status Los.Angeles Phoenix San.Diego San..Francisco Seattle
## 1  ALASKA on time         497    221        212            503  1,841 
## 2  ALASKA delayed          62     12         20            102    305 
## 3                          NA                NA             NA        
## 4 AM WEST on time         694  4,840        383            320    201 
## 5 AM WEST delayed         117    415         65            129     61
library(tidyr)

airlines <- gather(airlines.raw,city,number.airlines,Los.Angeles:Seattle) #most important step: restructure the dataset so that it can be considered as tidy data
## Warning: attributes are not identical across measure variables;
## they will be dropped
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
#airlines <- subset(airlines,status!="")
airlines <- filter(airlines,status!="") #delete rows with no value (previously left blank for aesthetic reason), this step can be done with/without dplyr

i <- 1
while(i <= length(airlines$city)) {
  airlines$city[i] <- gsub("\\."," ",airlines$city[i]) #remove ","(comma) between number.airlines so that as.numeric function can be applied
  airlines$number.airlines[i] <- gsub(",","",airlines$number.airlines[i]) #remove the "."(dots) between city names for every row
  i <- i + 1
} 

airlines$number.airlines <- as.numeric(airlines$number.airlines) #convert all number.airlines data into numeric data so that further calculation can be performed

airlines
##    airline  status           city number.airlines
## 1   ALASKA on time    Los Angeles             497
## 2   ALASKA delayed    Los Angeles              62
## 3  AM WEST on time    Los Angeles             694
## 4  AM WEST delayed    Los Angeles             117
## 5   ALASKA on time        Phoenix             221
## 6   ALASKA delayed        Phoenix              12
## 7  AM WEST on time        Phoenix            4840
## 8  AM WEST delayed        Phoenix             415
## 9   ALASKA on time      San Diego             212
## 10  ALASKA delayed      San Diego              20
## 11 AM WEST on time      San Diego             383
## 12 AM WEST delayed      San Diego              65
## 13  ALASKA on time San  Francisco             503
## 14  ALASKA delayed San  Francisco             102
## 15 AM WEST on time San  Francisco             320
## 16 AM WEST delayed San  Francisco             129
## 17  ALASKA on time        Seattle            1841
## 18  ALASKA delayed        Seattle             305
## 19 AM WEST on time        Seattle             201
## 20 AM WEST delayed        Seattle              61

3.Perform analysis to compare the arrival delays for the two airlines.

airlines.analysis <- airlines %>% group_by(airline,city) %>% summarise(total.airlines = sum(number.airlines)) #compute the sum of number.airlines for the same airport and city

i <- 1
rate.status <- vector()

while(i <= length(airlines$number.airlines)){ #compute the percentage of on time/ delay for each airport and city and store the data along with the dataframe
  ap <- airlines$airline[i]
  ct <- airlines$city[i]
  rate.status[i] <- round(airlines$number.airlines[i]/airlines.analysis$total.airlines[airlines.analysis$airline==ap & airlines.analysis$city == ct],3)
  i <- i + 1 
}

airlines <- cbind(airlines,rate.status)
airlines
##    airline  status           city number.airlines rate.status
## 1   ALASKA on time    Los Angeles             497       0.889
## 2   ALASKA delayed    Los Angeles              62       0.111
## 3  AM WEST on time    Los Angeles             694       0.856
## 4  AM WEST delayed    Los Angeles             117       0.144
## 5   ALASKA on time        Phoenix             221       0.948
## 6   ALASKA delayed        Phoenix              12       0.052
## 7  AM WEST on time        Phoenix            4840       0.921
## 8  AM WEST delayed        Phoenix             415       0.079
## 9   ALASKA on time      San Diego             212       0.914
## 10  ALASKA delayed      San Diego              20       0.086
## 11 AM WEST on time      San Diego             383       0.855
## 12 AM WEST delayed      San Diego              65       0.145
## 13  ALASKA on time San  Francisco             503       0.831
## 14  ALASKA delayed San  Francisco             102       0.169
## 15 AM WEST on time San  Francisco             320       0.713
## 16 AM WEST delayed San  Francisco             129       0.287
## 17  ALASKA on time        Seattle            1841       0.858
## 18  ALASKA delayed        Seattle             305       0.142
## 19 AM WEST on time        Seattle             201       0.767
## 20 AM WEST delayed        Seattle              61       0.233
airlines.delay <- filter(airlines,status=="delayed")
airlines.delay
##    airline  status           city number.airlines rate.status
## 1   ALASKA delayed    Los Angeles              62       0.111
## 2  AM WEST delayed    Los Angeles             117       0.144
## 3   ALASKA delayed        Phoenix              12       0.052
## 4  AM WEST delayed        Phoenix             415       0.079
## 5   ALASKA delayed      San Diego              20       0.086
## 6  AM WEST delayed      San Diego              65       0.145
## 7   ALASKA delayed San  Francisco             102       0.169
## 8  AM WEST delayed San  Francisco             129       0.287
## 9   ALASKA delayed        Seattle             305       0.142
## 10 AM WEST delayed        Seattle              61       0.233
library(ggplot2)
ggplot(data=airlines.delay, aes(x=city,y=rate.status,fill=airline)) +
    geom_bar(stat="identity", position=position_dodge()) +
    ggtitle("Delay Rate by Airline by City") +
    xlab("city") + 
    ylab("rate") #create a ggplot compare delay rate for each airline and each city

4.Narrative descriptions of data cleanup work, analysis, and conclusions.

Data cleanup processes are commented above.

Analysis:
Based on the barplot, AM West has higher delay rate across all cities. For the same airline, San Francisco and Seattle has obviously higher delay rate.

Conclusion:
Regardless of the destination (city), in order to arrive on time (or earlier), one should choose ALASKA.
For further exploration, it would be interesting to take price and other strategy related factors into consideration. Given a higher delay rate across board, AM WEST might using a low cost strategy.
On the other hand, standard of delay rate within the whole industry is even more important. While AM WEST has higher delay rate, both AM WEST and ALASKA might perform better or worse than other competitors in the industry.
For cities like San Francisco and Seattle, we can also look closer into data of their weather and traffic.