Following are the goal’s for this assignment:
-Transform data between wide and long formats using tidyr package
-Change shapes of data frames using dplyr package
-Perform data transformations to support downstream data analysis
In this instance, used the assignment’s PDF document as the data source to create a csv dataset using Excel. This dataset stored on GitHub.
theURL = "https://raw.githubusercontent.com/CUNYSPS-RickRN/Team607-1/master/Wk05_airline_arrival_status.csv"
arrivals_df <- read_csv(theURL, col_names = TRUE)
## Parsed with column specification:
## cols(
## Airline = col_character(),
## Status = col_character(),
## LAX = col_double(),
## PHX = col_double(),
## SD = col_double(),
## SFO = col_double(),
## SEA = col_double()
## )
## # A tibble: 5 x 7
## Airline Status LAX PHX SD SFO SEA
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Alaska on time 497 221 212 503 1841
## 2 <NA> delayed 62 12 20 102 305
## 3 <NA> <NA> NA NA NA NA NA
## 4 AM WEST on time 694 4840 383 320 201
## 5 <NA> delayed 117 415 65 129 61
Following three rules makes a dataset tidy: variables are in columns, observations are in rows, and values are in cells.
In this step, 3 steps performed to tidy data: - remove “blank line” by selecting status = on time or delayed - fill missing Airline data from previous row - pivot_longer to create observations of each origination/destination pair
Before and after views of dataset of shown
# Data transformation into tidy data form
arrivals_pivot_df <- arrivals_df %>%
subset(Status == ("on time") | Status == ("delayed")) %>%
fill(Airline) %>%
pivot_longer(c("LAX", "PHX", "SD", "SFO", "SEA"),
names_to = "dest", values_to = "numflights")
arrivals_pivot_df
## # A tibble: 20 x 4
## Airline Status dest numflights
## <chr> <chr> <chr> <dbl>
## 1 Alaska on time LAX 497
## 2 Alaska on time PHX 221
## 3 Alaska on time SD 212
## 4 Alaska on time SFO 503
## 5 Alaska on time SEA 1841
## 6 Alaska delayed LAX 62
## 7 Alaska delayed PHX 12
## 8 Alaska delayed SD 20
## 9 Alaska delayed SFO 102
## 10 Alaska delayed SEA 305
## 11 AM WEST on time LAX 694
## 12 AM WEST on time PHX 4840
## 13 AM WEST on time SD 383
## 14 AM WEST on time SFO 320
## 15 AM WEST on time SEA 201
## 16 AM WEST delayed LAX 117
## 17 AM WEST delayed PHX 415
## 18 AM WEST delayed SD 65
## 19 AM WEST delayed SFO 129
## 20 AM WEST delayed SEA 61
This analysis compares the per-city on-time performance for both airlines.
The first pair of graphs by airline shows the number of on-time and delayed flights by airport. Different shapes are used to distinguish between on-time and delayed status. AM West Airlines has a high concentration of flights at the Phoenix airport while Alaska Airlines has a high concentration of flights at the Seattle airport.
The third graph shows the proportion of each airline’s on-time and delayed flights by airport. One noticeable observation of the Phoenix airport is AM West Airlines has a substantially higher proportion of on-time flights when compared with Alaska Airlines and when compared with all other airports in the sample. In the Seattle airport Alaska Airlines has a substantially higher proportion of on-time flights when compared with AM West Airlines and when compared with all other airports in the sample.
#
# plot by Airline dest/status num flights info
ggplot(data = arrivals_pivot_df) +
geom_point(mapping = aes(x = dest, y = numflights, color= Airline, shape = Status)) +
facet_wrap(~ Airline)
# create new var prop_totalflights and calc prop of on-time by destination
flight_props2 <- arrivals_pivot_df %>%
group_by(dest) %>%
# filter(Status == "on time") %>%
mutate(prop_totflights = numflights / sum(numflights))
# Compare per-city on-time performance for both airlines
ggplot(data = flight_props2) +
geom_point(mapping = aes(x = dest, y = prop_totflights, color = Airline, shape = Status))
AM West Airlines had a better performance of having approximately 89% of flights on-time while Alaska Airlines had approximately 87% of its flights on-time.
# calculate numerator
flight_props_n <- flight_props2 %>%
group_by(Airline) %>%
filter(Status == "on time", Airline == "Alaska") %>%
summarize(numflights)
## `summarise()` regrouping output by 'Airline' (override with `.groups` argument)
# calculate denominator
flight_props_d <- flight_props2 %>%
group_by(Airline) %>%
filter(Airline == "Alaska") %>%
summarize(numflights)
## `summarise()` regrouping output by 'Airline' (override with `.groups` argument)
OT_Alaska <- sum(flight_props_n$numflights) /sum(flight_props_d$numflights)
cat("Alaska Airlines had ", OT_Alaska, " of flights on-time.\n ",
"Number of on-time flights=",sum(flight_props_n$numflights),
" out of ",sum(flight_props_d$numflights))
## Alaska Airlines had 0.8672848 of flights on-time.
## Number of on-time flights= 3274 out of 3775
# Calculate numerator
flight_props_n <- flight_props2 %>%
group_by(Airline) %>%
filter(Status == "on time", Airline == "AM WEST") %>%
summarize(numflights)
## `summarise()` regrouping output by 'Airline' (override with `.groups` argument)
# Calculate denominator
flight_props_d <- flight_props2 %>%
group_by(Airline) %>%
filter(Airline == "AM WEST") %>%
summarize(numflights)
## `summarise()` regrouping output by 'Airline' (override with `.groups` argument)
OT_AMWEST <- sum(flight_props_n$numflights) /sum(flight_props_d$numflights)
cat("AM West Airlines had ", OT_AMWEST, " of flights on-time.\n ",
"Number of on-time flights=",sum(flight_props_n$numflights),
" out of ",sum(flight_props_d$numflights))
## AM West Airlines had 0.8910727 of flights on-time.
## Number of on-time flights= 6438 out of 7225
Airlines with hub airports defined as those with high concentration of flights appear to have a higher number of on-time flights. Phoenix airport appears to be a hub for AM West Airlines while Seattle airport appears to be a hub for Alaska Airlines.
AM West Airlines had a better performance of having approximately 89% of flights on-time while Alaska Airlines had approximately 87% of its flights on-time. AM West airlines had more than twice the number of flights than Alaska airlines.