Assignment 5
Create CSV
csv <- rbind(c(NA, NA, "Los Angeles", "Phoenix", "San Diego", "San Francisco", "Seattle"),
c("ALASKA", "On Time", 497, 221, 212, 503, 1841),
c(NA, "Delayed", 62, 12, 20, 102, 305),
c("AM WEST", "On Time", 694, 4840, 383, 320, 201),
c(NA, "Delayed", 117, 415, 65, 129, 61))
write.table(csv, file = "arrivaltimes.csv", sep = ",", col.names=F, row.names=F)Read CSV into R
originalarrival<-read.csv("https://raw.githubusercontent.com/IsARam/DATA607/master/arrivaltimes.csv?_sm_au_=iVVSSF6Sk3kVrtvj")
originalarrival## NA. NA..1 Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1 ALASKA On Time 497 221 212 503 1841
## 2 <NA> Delayed 62 12 20 102 305
## 3 AM WEST On Time 694 4840 383 320 201
## 4 <NA> Delayed 117 415 65 129 61
Transformation Of Data
Load Libraries
library(tidyr)
library(dplyr)##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)Tidy Data
I created another table and modified it by using the following formulas: The mutate formula (dplyr) adds new variables and preserves existing. In the case of the above this formula was used to duplicate the first column. The coalesce (dplyr) formula finds the first non-missing value at each position.In the case of the above this was used to combine the first and second columns into one. The gather formuala (tidyr) takes multiple columns and collapses into key-value pairs, duplicating all other columns as needed. The select formula (dplyr) keeps only the variables you mention. The arrange formula (dplyr) sort a variable in descending order.
arrival<- originalarrival %>%
mutate(Airlines1 = NA., Airlines2=lag(NA.)) %>%
mutate(Airline = coalesce(Airlines1,Airlines2), Status = NA..1) %>%
gather("Destination", "Flights", 3:7) %>%
select(Airline:Flights) %>%
arrange(Airline, desc(Status), Destination); arrival## Airline Status Destination Flights
## 1 ALASKA On Time Los.Angeles 497
## 2 ALASKA On Time Phoenix 221
## 3 ALASKA On Time San.Diego 212
## 4 ALASKA On Time San.Francisco 503
## 5 ALASKA On Time Seattle 1841
## 6 ALASKA Delayed Los.Angeles 62
## 7 ALASKA Delayed Phoenix 12
## 8 ALASKA Delayed San.Diego 20
## 9 ALASKA Delayed San.Francisco 102
## 10 ALASKA Delayed Seattle 305
## 11 AM WEST On Time Los.Angeles 694
## 12 AM WEST On Time Phoenix 4840
## 13 AM WEST On Time San.Diego 383
## 14 AM WEST On Time San.Francisco 320
## 15 AM WEST On Time Seattle 201
## 16 AM WEST Delayed Los.Angeles 117
## 17 AM WEST Delayed Phoenix 415
## 18 AM WEST Delayed San.Diego 65
## 19 AM WEST Delayed San.Francisco 129
## 20 AM WEST Delayed Seattle 61
Analysis
Filter Status
When analyzing the comparison of delays for the two airlines I created a new table that filtered Delayed status and summarized Delayed flights.
The group_by formula (dplyr) takes an existing tbl and converts it into a grouped tbl where operations are performed “by group”. The summarise formula (dplyr) is typically used on grouped data created by group_by().
FlightsDelayed <- arrival %>% group_by(Airline,Destination) %>% filter(Status == 'Delayed') %>%summarise(Delayed = sum(Flights))
FlightsDelayed## # A tibble: 10 x 3
## # Groups: Airline [?]
## Airline Destination Delayed
## <fct> <chr> <int>
## 1 ALASKA Los.Angeles 62
## 2 ALASKA Phoenix 12
## 3 ALASKA San.Diego 20
## 4 ALASKA San.Francisco 102
## 5 ALASKA Seattle 305
## 6 AM WEST Los.Angeles 117
## 7 AM WEST Phoenix 415
## 8 AM WEST San.Diego 65
## 9 AM WEST San.Francisco 129
## 10 AM WEST Seattle 61
The group_by (dplyr) formula takes an existing tbl and converts it into a grouped tbl where operations are performed “by group”.
Calculate Total Flights
When analyzing the comparison of delays for the two airlines I created a new table that summarized the total flights.
The group_by formula (dplyr) takes an existing tbl and converts it into a grouped tbl where operations are performed “by group”. The summarise formula (dplyr) is typically used on grouped data created by group_by().
TotalFlights <- arrival %>% group_by(Airline,Destination) %>% summarise(Total = sum(Flights))
TotalFlights## # A tibble: 10 x 3
## # Groups: Airline [?]
## Airline Destination Total
## <fct> <chr> <int>
## 1 ALASKA Los.Angeles 559
## 2 ALASKA Phoenix 233
## 3 ALASKA San.Diego 232
## 4 ALASKA San.Francisco 605
## 5 ALASKA Seattle 2146
## 6 AM WEST Los.Angeles 811
## 7 AM WEST Phoenix 5255
## 8 AM WEST San.Diego 448
## 9 AM WEST San.Francisco 449
## 10 AM WEST Seattle 262
Combine and Mutate
I created a new table which combined the tables above.
The cbind formula (base) take a sequence of vector, matrix or data-frame arguments and combine by columns or rows, respectively. The mutate formula (dplyr) adds new variables and preserves existing. In the case of the above this formula was used to duplicate the first column.
DelaySummary<- cbind(FlightsDelayed, Total=TotalFlights$Total)
DelaySummary<- DelaySummary %>% mutate(DelayPercentage = Delayed/Total)
DelaySummary## # A tibble: 10 x 5
## # Groups: Airline [2]
## Airline Destination Delayed Total DelayPercentage
## <fct> <chr> <int> <int> <dbl>
## 1 ALASKA Los.Angeles 62 559 0.111
## 2 ALASKA Phoenix 12 233 0.0515
## 3 ALASKA San.Diego 20 232 0.0862
## 4 ALASKA San.Francisco 102 605 0.169
## 5 ALASKA Seattle 305 2146 0.142
## 6 AM WEST Los.Angeles 117 811 0.144
## 7 AM WEST Phoenix 415 5255 0.0790
## 8 AM WEST San.Diego 65 448 0.145
## 9 AM WEST San.Francisco 129 449 0.287
## 10 AM WEST Seattle 61 262 0.233
High Level Summary
I created a high level summary without Destination.
FlightsDelayedHL <- arrival %>% group_by(Airline) %>% filter(Status == 'Delayed') %>%summarise(Delayed = sum(Flights))
TotalFlightsHL <- arrival %>% group_by(Airline) %>% summarise(Total = sum(Flights))
DelaySummaryHL<-cbind(FlightsDelayedHL,Total=TotalFlightsHL$Total)
DelaySummaryHL<- DelaySummaryHL %>% mutate(HLDelayPercentage = DelaySummaryHL$Delayed/DelaySummaryHL$Total)
DelaySummaryHL## Airline Delayed Total HLDelayPercentage
## 1 ALASKA 501 3775 0.1327152
## 2 AM WEST 787 7225 0.1089273
Alaska Airlines has a higher delay percentage over AM West however Alaska Airlines had fewer Total Flights.
Visuals
I used geom_point as the visual.
The point geom is used to create scatterplots. The scatterplot is most useful for displaying the relationship between two continuous variables. It can be used to compare one continuous and one categorical variable, or two categorical variables which I thought was fitting in this case.
ggplot(DelaySummary, aes(Destination, Delayed)) + geom_point(aes(color=Airline),size=8,alpha=1/2)