GOAL 1: Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data.
Reading CSV file and printing data
library(tidyr)
## Warning: package 'tidyr' was built under R version 3.2.3
flightData<-read.csv("https://raw.githubusercontent.com/chirag-vithlani/607/master/week6/week6.csv", header= TRUE,sep=",",na.strings = "?",stringsAsFactors=FALSE)
flightData
## X X.1 Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1 ALASKA On Time 497 221 212 503 1841
## 2 Delayed 62 12 20 102 305
## 3 AM West On Time 694 4840 383 320 201
## 4 Delayed 117 415 65 129 61
Missing headers are names as X and X.1, so rename headers
names(flightData)[names(flightData) == "X"] <- "Airline"
names(flightData)[names(flightData) == "X.1"] <- "Arrival"
Repeating airline name where it is missing ( this is hard coded, but this step should be done programmatically )
flightData[2, 1] <- "ALASKA"
flightData[4, 1] <- "AM West"
flightData
## Airline Arrival Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1 ALASKA On Time 497 221 212 503 1841
## 2 ALASKA Delayed 62 12 20 102 305
## 3 AM West On Time 694 4840 383 320 201
## 4 AM West Delayed 117 415 65 129 61
Using gather() function. gather() takes multiple columns, and gathers them into key-value pairs: it makes “wide” data longer.
tidy <- gather(flightData, "City", "Count", 3:7)
head(tidy)
## Airline Arrival City Count
## 1 ALASKA On Time Los.Angeles 497
## 2 ALASKA Delayed Los.Angeles 62
## 3 AM West On Time Los.Angeles 694
## 4 AM West Delayed Los.Angeles 117
## 5 ALASKA On Time Phoenix 221
## 6 ALASKA Delayed Phoenix 12
Using spread() function. spread() takes two columns (a key-value pair) and spreads them in to multiple columns, making “long” data wider
#tidy$Count <- as.numeric(gsub(",", "", tidy$Count))
tidy <- spread(tidy, "Arrival", Count)
tidy
## Airline City Delayed On Time
## 1 ALASKA Los.Angeles 62 497
## 2 ALASKA Phoenix 12 221
## 3 ALASKA San.Diego 20 212
## 4 ALASKA San.Francisco 102 503
## 5 ALASKA Seattle 305 1841
## 6 AM West Los.Angeles 117 694
## 7 AM West Phoenix 415 4840
## 8 AM West San.Diego 65 383
## 9 AM West San.Francisco 129 320
## 10 AM West Seattle 61 201
select(): focus on a subset of variables filter(): focus on a subset of rows mutate(): add new columns summarise(): reduce each group to a smaller number of summary statistics arrange(): re-order the rows
Choosing columns: select
suppressMessages(library(dplyr))
head(select(tidy, Airline))
## Airline
## 1 ALASKA
## 2 ALASKA
## 3 ALASKA
## 4 ALASKA
## 5 ALASKA
## 6 AM West
or use minus to hide that column
head(select(tidy, -Airline))
## City Delayed On Time
## 1 Los.Angeles 62 497
## 2 Phoenix 12 221
## 3 San.Diego 20 212
## 4 San.Francisco 102 503
## 5 Seattle 305 1841
## 6 Los.Angeles 117 694
Using filter
filter(tidy,Delayed==62)
## Airline City Delayed On Time
## 1 ALASKA Los.Angeles 62 497
Using mutate : Add new variables
tidy=mutate(tidy,Total = Delayed + `On Time`)
head(tidy)
## Airline City Delayed On Time Total
## 1 ALASKA Los.Angeles 62 497 559
## 2 ALASKA Phoenix 12 221 233
## 3 ALASKA San.Diego 20 212 232
## 4 ALASKA San.Francisco 102 503 605
## 5 ALASKA Seattle 305 1841 2146
## 6 AM West Los.Angeles 117 694 811
As we can see for almost for all cities AM West is getting delayed more number of time.
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.2.3
tidy <- mutate(tidy, Total = Delayed + `On Time`, PercentDelayed = Delayed / Total * 100)
tidy <- arrange(tidy, City, PercentDelayed)
ggplot(tidy,aes(x=City,y=PercentDelayed,fill=factor(Airline)))+
geom_bar(stat="identity",position="dodge")
tidy
## Airline City Delayed On Time Total PercentDelayed
## 1 ALASKA Los.Angeles 62 497 559 11.091234
## 2 AM West Los.Angeles 117 694 811 14.426634
## 3 ALASKA Phoenix 12 221 233 5.150215
## 4 AM West Phoenix 415 4840 5255 7.897241
## 5 ALASKA San.Diego 20 212 232 8.620690
## 6 AM West San.Diego 65 383 448 14.508929
## 7 ALASKA San.Francisco 102 503 605 16.859504
## 8 AM West San.Francisco 129 320 449 28.730512
## 9 ALASKA Seattle 305 1841 2146 14.212488
## 10 AM West Seattle 61 201 262 23.282443
Using summarise : Reduce variables to values
Overall delay : Alaska airline is delayed by 11% and AM West delayed by 18%.
delays <- tidy %>% group_by(Airline) %>% summarise(MeanPercent = round(mean(PercentDelayed), 0))
delays
## Source: local data frame [2 x 2]
##
## Airline MeanPercent
## (chr) (dbl)
## 1 ALASKA 11
## 2 AM West 18