Your task is to: (1) Create a .CSV file (or optionally, a MySQL database!) that includes all of the information above. You’re encouraged to use a “wide” structure similar to how the information appears above, so that you can practice tidying and transformations as described below.
u<-c("","","Los Angeles","Phoenix","San Diego","San Francisco","Seattle")
t<-c("Alaska","on time","497","221","212","503","1841")
i<-c("","delayed","62","12","20","102","305")
d<-c("AM WEST","on time","694","4840","383","320","201")
y<-c("","delayed","117","415","65","129","61")
library(dplyr)
##
## Attaching package: 'dplyr'
##
## The following objects are masked from 'package:stats':
##
## filter, lag
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr)
untidy<-data.frame(rbind(u, t,i, d,y),row.names=NULL ,stringsAsFactors=FALSE)
print(untidy)
## X1 X2 X3 X4 X5 X6 X7
## 1 Los Angeles Phoenix San Diego San Francisco Seattle
## 2 Alaska on time 497 221 212 503 1841
## 3 delayed 62 12 20 102 305
## 4 AM WEST on time 694 4840 383 320 201
## 5 delayed 117 415 65 129 61
# Write CSV in R
write.table(untidy, file = "asignment6_untidy.csv", row.names=FALSE, col.names=FALSE, sep=",")
# Read CSV into R
MyData <- read.csv('asignment6_untidy.csv',stringsAsFactors=FALSE)
print(MyData) #It is showing that Column headers are values, not variable names. Also, ariables are stored in both rows and columns.
## X X.1 Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1 Alaska on time 497 221 212 503 1841
## 2 delayed 62 12 20 102 305
## 3 AM WEST on time 694 4840 383 320 201
## 4 delayed 117 415 65 129 61
#We will assign the column's names for our preprocessing
colnames(MyData)[1] <- "Airline"
colnames(MyData)[2] <- "Status"
#We are assigning the values Alask and AM West to the two empty columns
MyData[2,1]<-'Alaska'
MyData[4,1]<-'AM WEST'
MyData
## Airline Status Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1 Alaska on time 497 221 212 503 1841
## 2 Alaska delayed 62 12 20 102 305
## 3 AM WEST on time 694 4840 383 320 201
## 4 AM WEST delayed 117 415 65 129 61
In order to have a tidy dataset, we are going to transform city into one variable(column), then we are going to transform delayed and on time into two variables.
library(tidyr)
library(dplyr)
MyData<-MyData %>% gather(City, Occurence, 3:7)
head(MyData)
## Airline Status City Occurence
## 1 Alaska on time Los.Angeles 497
## 2 Alaska delayed Los.Angeles 62
## 3 AM WEST on time Los.Angeles 694
## 4 AM WEST delayed Los.Angeles 117
## 5 Alaska on time Phoenix 221
## 6 Alaska delayed Phoenix 12
MyData <- spread(MyData, Status, Occurence)
head(MyData)
## Airline City delayed on time
## 1 Alaska Los.Angeles 62 497
## 2 Alaska Phoenix 12 221
## 3 Alaska San.Diego 20 212
## 4 Alaska San.Francisco 102 503
## 5 Alaska Seattle 305 1841
## 6 AM WEST Los.Angeles 117 694
#We are now going to remove the extra character in the city names
MyData$City<-gsub("\\."," ",MyData$City)
MyData %>%
group_by(City, Airline) %>%
summarise(delay_percent = 100 * delayed/(delayed+`on time`)) %>%
arrange(City, Airline)
## Source: local data frame [10 x 3]
## Groups: City [5]
##
## City Airline delay_percent
## (chr) (chr) (dbl)
## 1 Los Angeles Alaska 11.091234
## 2 Los Angeles AM WEST 14.426634
## 3 Phoenix Alaska 5.150215
## 4 Phoenix AM WEST 7.897241
## 5 San Diego Alaska 8.620690
## 6 San Diego AM WEST 14.508929
## 7 San Francisco Alaska 16.859504
## 8 San Francisco AM WEST 28.730512
## 9 Seattle Alaska 14.212488
## 10 Seattle AM WEST 23.282443