Assignment:-
The chart above describes arrival delays for two airlines across five destinations. Your task is to:
(1) Create a .CSV file (or optionally, a MySQL database!) that includes all of the information above. You’re encouraged to use a “wide” structure similar to how the information appears above, so that you can practice tidying and transformations as described below.
(2) Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data.
(3) Perform analysis to compare the arrival delays for the two airlines
- Create a .CSV file (or optionally, a MySQL database!) that includes all of the information above. You’re encouraged to use a “wide” structure similar to how the information appears above, so that you can practice tidying and transformations as described below
data <- rbind(c(NA, NA, "Los Angeles", "Phoenix", "San Diego", "San Francisco", "Seattle"),
c("ALASKA", "On Time", 497, 221, 212, 503, 1841),
c(NA, "Delayed", 62, 12, 20, 102, 305),
c("AM WEST", "On Time", 694, 4840, 383, 320, 201),
c(NA, "Delayed", 117, 415, 65, 129, 61))
write.table(data, file = "/Users/ashishsm1986/git/Cuny-Assignments/week5-assignment/Assignment5.csv", sep = ",", col.names=F, row.names=F) #Populate the data in a .csv file
- Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data.
wide_data <- read.csv(paste0("/Users/ashishsm1986/git/Cuny-Assignments/week5-assignment/Assignment5.csv"), stringsAsFactors = F); wide_data
## NA. NA..1 Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1 ALASKA On Time 497 221 212 503 1841
## 2 <NA> Delayed 62 12 20 102 305
## 3 AM WEST On Time 694 4840 383 320 201
## 4 <NA> Delayed 117 415 65 129 61
library(tidyr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
long_data <- wide_data %>% #Data is converted into a “long” structure using tidyr and dplyr.
mutate(AL_Airline = NA., AM_Airline=lag(NA.)) %>% #Mutate function is used to duplicate the first column twice
mutate(Airline = coalesce(AL_Airline,AM_Airline), Status = NA..1) %>% #Combine the new columns into one column without null values.
gather("Destination", "Flight", 3:7) %>% #Transform the data from its crosstab appearance to a more normalized format.
select(Airline:Flight) %>% #Query the columns we want in the final structure
arrange(Airline, desc(Status), Destination); #Split the data into the two groups that will be compared.
long_data
## Airline Status Destination Flight
## 1 ALASKA On Time Los.Angeles 497
## 2 ALASKA On Time Phoenix 221
## 3 ALASKA On Time San.Diego 212
## 4 ALASKA On Time San.Francisco 503
## 5 ALASKA On Time Seattle 1841
## 6 ALASKA Delayed Los.Angeles 62
## 7 ALASKA Delayed Phoenix 12
## 8 ALASKA Delayed San.Diego 20
## 9 ALASKA Delayed San.Francisco 102
## 10 ALASKA Delayed Seattle 305
## 11 AM WEST On Time Los.Angeles 694
## 12 AM WEST On Time Phoenix 4840
## 13 AM WEST On Time San.Diego 383
## 14 AM WEST On Time San.Francisco 320
## 15 AM WEST On Time Seattle 201
## 16 AM WEST Delayed Los.Angeles 117
## 17 AM WEST Delayed Phoenix 415
## 18 AM WEST Delayed San.Diego 65
## 19 AM WEST Delayed San.Francisco 129
## 20 AM WEST Delayed Seattle 61
- Perform analysis to compare the arrival delays for the two airlines.
Performance <- long_data %>%
group_by(Airline) %>%
mutate(Arrival_Tot = sum(Flight)) %>%
group_by(Airline, Status) %>%
mutate(Sum_Tot = sum(Flight), Perf_percntage = Sum_Tot / Arrival_Tot) %>% group_by(Airline, Destination) %>%
mutate(D_Tot = sum(Flight), D_Perf = Flight / D_Tot)
# Get Airlines Overall Performance
data.frame(Performance[c(1,10,11,20), c(1,2,7)])
## Airline Status Perf_percntage
## 1 ALASKA On Time 0.8672848
## 2 ALASKA Delayed 0.1327152
## 3 AM WEST On Time 0.8910727
## 4 AM WEST Delayed 0.1089273
Analysys Results:-
Alaska Airlines were on-time 86.72% of the time and delayed 13.28% of the times.
AM West Airlines were on-time 89.11% of the time and delayed 10.89% of the times.