IS 607 - Week 6 Assignment

Your task is to: (1) Create a .CSV file (or optionally, a MySQL database!) that includes all of the information above. You’re encouraged to use a “wide” structure similar to how the information appears above, so that you can practice tidying and transformations as described below.

u<-c("","","Los Angeles","Phoenix","San Diego","San Francisco","Seattle")
t<-c("Alaska","on time","497","221","212","503","1841")
i<-c("","delayed","62","12","20","102","305")
d<-c("AM WEST","on time","694","4840","383","320","201")
y<-c("","delayed","117","415","65","129","61")

library(dplyr)
## 
## Attaching package: 'dplyr'
## 
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyr)
untidy<-data.frame(rbind(u, t,i, d,y),row.names=NULL ,stringsAsFactors=FALSE)
print(untidy)
##        X1      X2          X3      X4        X5            X6      X7
## 1                 Los Angeles Phoenix San Diego San Francisco Seattle
## 2  Alaska on time         497     221       212           503    1841
## 3         delayed          62      12        20           102     305
## 4 AM WEST on time         694    4840       383           320     201
## 5         delayed         117     415        65           129      61
# Write CSV in R
write.table(untidy, file = "asignment6_untidy.csv", row.names=FALSE, col.names=FALSE, sep=",") 
  1. Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data.
# Read CSV into R
MyData <- read.csv('asignment6_untidy.csv',stringsAsFactors=FALSE)

print(MyData) #It is showing that Column headers are values, not variable names. Also, ariables are stored in both rows and columns.
##         X     X.1 Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1  Alaska on time         497     221       212           503    1841
## 2         delayed          62      12        20           102     305
## 3 AM WEST on time         694    4840       383           320     201
## 4         delayed         117     415        65           129      61
#We will assign the column's names for our preprocessing
colnames(MyData)[1] <- "Airline"
colnames(MyData)[2] <- "Status"

#We are assigning the values Alask and AM West to the two empty columns 
MyData[2,1]<-'Alaska'
MyData[4,1]<-'AM WEST'
MyData
##   Airline  Status Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1  Alaska on time         497     221       212           503    1841
## 2  Alaska delayed          62      12        20           102     305
## 3 AM WEST on time         694    4840       383           320     201
## 4 AM WEST delayed         117     415        65           129      61

In tidy data:

  • observations are in rows
  • variables are in columns

In order to have a tidy dataset, we are going to transform city into one variable(column), then we are going to transform delayed and on time into two variables.

library(tidyr)
library(dplyr)
MyData<-MyData %>% gather(City, Occurence, 3:7) 
head(MyData)
##   Airline  Status        City Occurence
## 1  Alaska on time Los.Angeles       497
## 2  Alaska delayed Los.Angeles        62
## 3 AM WEST on time Los.Angeles       694
## 4 AM WEST delayed Los.Angeles       117
## 5  Alaska on time     Phoenix       221
## 6  Alaska delayed     Phoenix        12
MyData <- spread(MyData, Status, Occurence)
head(MyData)
##   Airline          City delayed on time
## 1  Alaska   Los.Angeles      62     497
## 2  Alaska       Phoenix      12     221
## 3  Alaska     San.Diego      20     212
## 4  Alaska San.Francisco     102     503
## 5  Alaska       Seattle     305    1841
## 6 AM WEST   Los.Angeles     117     694
#We are now going to remove the extra character in the city names
MyData$City<-gsub("\\."," ",MyData$City)
  1. Perform analysis to compare the arrival delays for the two airlines.
MyData %>% 
  group_by(City, Airline) %>% 
  summarise(delay_percent = 100 * delayed/(delayed+`on time`)) %>% 
  arrange(City, Airline)
## Source: local data frame [10 x 3]
## Groups: City [5]
## 
##             City Airline delay_percent
##            (chr)   (chr)         (dbl)
## 1    Los Angeles  Alaska     11.091234
## 2    Los Angeles AM WEST     14.426634
## 3        Phoenix  Alaska      5.150215
## 4        Phoenix AM WEST      7.897241
## 5      San Diego  Alaska      8.620690
## 6      San Diego AM WEST     14.508929
## 7  San Francisco  Alaska     16.859504
## 8  San Francisco AM WEST     28.730512
## 9        Seattle  Alaska     14.212488
## 10       Seattle AM WEST     23.282443