Details and Goal of Assignment:
- Create a .CSV file (or optionally, a MySQL database!) that includes all of the information above. You’re encouraged to use a “wide” structure similar to how the information appears above, so that you can practice tidying and transformations as described below.
- Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data.
- Perform analysis to compare the arrival delays for the two airlines.
#Let's begin by first checking if the package is installed, and installing packages if not
if(!require('tidyr')) {
install.packages('tidyr')
library(tidyr)
}
## Loading required package: tidyr
if(!require('dplyr')) {
install.packages('dplyr')
library(tidyr)
}
## Loading required package: dplyr
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(stringr)
delay_csv <- rbind(c("Airline", "Status", "Los Angeles", "Phoenix", "San Diego", "San Francisco", "Seattle"),
c("ALASKA", "On Time", 497, 221, 212, 503, 1841),
c("ALASKA", "Delayed", 62, 12, 20, 102, 305),
c("AM WEST", "On Time", 694, 4840, 383, 320, 201),
c("AM WEST", "Delayed", 117, 415, 65, 129, 61))
delay_csv
## [,1] [,2] [,3] [,4] [,5]
## [1,] "Airline" "Status" "Los Angeles" "Phoenix" "San Diego"
## [2,] "ALASKA" "On Time" "497" "221" "212"
## [3,] "ALASKA" "Delayed" "62" "12" "20"
## [4,] "AM WEST" "On Time" "694" "4840" "383"
## [5,] "AM WEST" "Delayed" "117" "415" "65"
## [,6] [,7]
## [1,] "San Francisco" "Seattle"
## [2,] "503" "1841"
## [3,] "102" "305"
## [4,] "320" "201"
## [5,] "129" "61"
write.table(delay_csv, file = "Airline.csv", sep = ",", col.names=F, row.names=F)
#Next, let's read data from the file
delay_csv <- read.csv(paste0("Airline.csv"), stringsAsFactors = F);
delay_csv[,2] <- sapply(delay_csv[,2], str_replace, " ", "_")
delay_csv
## Airline Status Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1 ALASKA On_Time 497 221 212 503 1841
## 2 ALASKA Delayed 62 12 20 102 305
## 3 AM WEST On_Time 694 4840 383 320 201
## 4 AM WEST Delayed 117 415 65 129 61
#Gather function can be used to move data from the wide format to the tall format
delay_csv2 <- gather(delay_csv, "City", "Count", 3:7)
delay_csv2
## Airline Status City Count
## 1 ALASKA On_Time Los.Angeles 497
## 2 ALASKA Delayed Los.Angeles 62
## 3 AM WEST On_Time Los.Angeles 694
## 4 AM WEST Delayed Los.Angeles 117
## 5 ALASKA On_Time Phoenix 221
## 6 ALASKA Delayed Phoenix 12
## 7 AM WEST On_Time Phoenix 4840
## 8 AM WEST Delayed Phoenix 415
## 9 ALASKA On_Time San.Diego 212
## 10 ALASKA Delayed San.Diego 20
## 11 AM WEST On_Time San.Diego 383
## 12 AM WEST Delayed San.Diego 65
## 13 ALASKA On_Time San.Francisco 503
## 14 ALASKA Delayed San.Francisco 102
## 15 AM WEST On_Time San.Francisco 320
## 16 AM WEST Delayed San.Francisco 129
## 17 ALASKA On_Time Seattle 1841
## 18 ALASKA Delayed Seattle 305
## 19 AM WEST On_Time Seattle 201
## 20 AM WEST Delayed Seattle 61
#In order to move the status into its own variable, use the spread function
delay_csv2 <- spread(delay_csv2, Status, Count)
delay_csv2
## Airline City Delayed On_Time
## 1 ALASKA Los.Angeles 62 497
## 2 ALASKA Phoenix 12 221
## 3 ALASKA San.Diego 20 212
## 4 ALASKA San.Francisco 102 503
## 5 ALASKA Seattle 305 1841
## 6 AM WEST Los.Angeles 117 694
## 7 AM WEST Phoenix 415 4840
## 8 AM WEST San.Diego 65 383
## 9 AM WEST San.Francisco 129 320
## 10 AM WEST Seattle 61 201
#Analyzing delays for both airlines, by airlines:
delay_csv2 %>% group_by(Airline) %>%
summarise(
TotDelay=sum(Delayed),TotOn_Time=sum(On_Time),
PctDelayed=(TotDelay/(TotDelay+TotOn_Time)*100),
PctOnTime=(TotOn_Time/(TotDelay+TotOn_Time)*100)
)
## # A tibble: 2 x 5
## Airline TotDelay TotOn_Time PctDelayed PctOnTime
## <chr> <int> <int> <dbl> <dbl>
## 1 ALASKA 501 3274 13.3 86.7
## 2 AM WEST 787 6438 10.9 89.1
#Analyzing delays for both airlines, by city:
delay_csv2 %>% group_by(City) %>%
summarise(
TotDelay=sum(Delayed),TotOn_Time=sum(On_Time),
PctDelayed=(TotDelay/(TotDelay+TotOn_Time)*100),
PctOnTime=(TotOn_Time/(TotDelay+TotOn_Time)*100)
)
## # A tibble: 5 x 5
## City TotDelay TotOn_Time PctDelayed PctOnTime
## <chr> <int> <int> <dbl> <dbl>
## 1 Los.Angeles 179 1191 13.1 86.9
## 2 Phoenix 427 5061 7.78 92.2
## 3 San.Diego 85 595 12.5 87.5
## 4 San.Francisco 231 823 21.9 78.1
## 5 Seattle 366 2042 15.2 84.8
Overall, AM West airline has better on time performance compared to ALASKA airline
Of the cities compared, Phoenix seems to be the best on time while San Francisco shows a higher amount of delays