Create CSV file or MySQL DB that includes all of required info.
# As we created CSV file that mirrors the briefing doc, we'll read data from that CSV file into R.
flightstatus.sourcefile <- ("G:/OneDrive/Learning/_CUNY_SPS_MSDS/2018_Spring/DATA 607/Week 5/Assignment/Numbersense.csv")
fs <- read.csv(flightstatus.sourcefile, header = T, sep= ",", stringsAsFactors = F)Prep the table for tidying.
# We add meaningful column names where missing in the source table, using rename before we get to tidying.
fs = fs %>% rename("carrier" = "X", "status" = "X.1")
# Attempted to use filter to remove the spacing row without success, so changed tacks calling na.omit on the status column. Status column is best bet because:
# 1) It constitutes a variable
# 2) Carrier values for several rows are blank as a function of the source table, so they don't constitute a good test
# 3) There's a possibility that city columns might not be populated, and that those NAs actually constitued 0s
fs <- na.omit(fs, cols = status)
# We remove space in ontime status.
fs$status <- str_replace_all(fs$status, fixed(" "), "")
# We fill in blank carrier values - next time will take generalizable approach using conditional logic, but for now we'll do a local fix.
fs$carrier[2] <- c("ALASKA")
fs$carrier[4] <- c("AM WEST")Use a wide structure to practice tidying and transformation.
# First, we effect a wide structure using gather.
fs.tidy <- gather(fs, "city", "n", 3:7)
tbl_df(fs.tidy)## # A tibble: 20 x 4
## carrier status city n
## <chr> <chr> <chr> <int>
## 1 ALASKA ontime Los.Angeles 497
## 2 ALASKA delayed Los.Angeles 62
## 3 AM WEST ontime Los.Angeles 694
## 4 AM WEST delayed Los.Angeles 117
## 5 ALASKA ontime Phoenix 221
## 6 ALASKA delayed Phoenix 12
## 7 AM WEST ontime Phoenix 4840
## 8 AM WEST delayed Phoenix 415
## 9 ALASKA ontime San.Diego 212
## 10 ALASKA delayed San.Diego 20
## 11 AM WEST ontime San.Diego 383
## 12 AM WEST delayed San.Diego 65
## 13 ALASKA ontime San.Francisco 503
## 14 ALASKA delayed San.Francisco 102
## 15 AM WEST ontime San.Francisco 320
## 16 AM WEST delayed San.Francisco 129
## 17 ALASKA ontime Seattle 1841
## 18 ALASKA delayed Seattle 305
## 19 AM WEST ontime Seattle 201
## 20 AM WEST delayed Seattle 61
Experimenting a little with tidy-in / tidy-out.
# Spread can reverse the tidying operation, returning the previous data frame.
spread(fs.tidy, city, n)## carrier status Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1 ALASKA delayed 62 12 20 102 305
## 2 ALASKA ontime 497 221 212 503 1841
## 3 AM WEST delayed 117 415 65 129 61
## 4 AM WEST ontime 694 4840 383 320 201
# And it can also cut the data for us in different ways
spread(fs.tidy, status, n)## carrier city delayed ontime
## 1 ALASKA Los.Angeles 62 497
## 2 ALASKA Phoenix 12 221
## 3 ALASKA San.Diego 20 212
## 4 ALASKA San.Francisco 102 503
## 5 ALASKA Seattle 305 1841
## 6 AM WEST Los.Angeles 117 694
## 7 AM WEST Phoenix 415 4840
## 8 AM WEST San.Diego 65 383
## 9 AM WEST San.Francisco 129 320
## 10 AM WEST Seattle 61 201
spread(fs.tidy, carrier, n)## status city ALASKA AM WEST
## 1 delayed Los.Angeles 62 117
## 2 delayed Phoenix 12 415
## 3 delayed San.Diego 20 65
## 4 delayed San.Francisco 102 129
## 5 delayed Seattle 305 61
## 6 ontime Los.Angeles 497 694
## 7 ontime Phoenix 221 4840
## 8 ontime San.Diego 212 383
## 9 ontime San.Francisco 503 320
## 10 ontime Seattle 1841 201
Perform analysis to compare arrival delays for two airlines.
# First, we'll calculcate total flights by city, irrespective of carrier. This will give us a lay of the land.
fs.tidy %>% group_by(city) %>%
summarise(total = sum(n)) %>%
arrange(desc(total)) -> flightsbycity
# A bar chart will serve nicely to visualize this.
ggplot(flightsbycity, aes(x = reorder(city, -total), y = total)) +
geom_bar(stat = "identity") +
labs(x = NULL, y = "flights") +
scale_y_continuous(breaks = seq(0,6000,1000))# Phoenix sees the highest volume overall between these two carriers. The remaining four cities together have the same volume as Phoenix.# Next, we'll look at carriers, diving into overall on-time performance by carrier. We use mutate on the tidily vectorized columns, calculating total flights (total) and on-time performance (OTperf).
fs.tidy %>% spread(status, n) %>%
group_by(carrier) %>%
summarise(delayed = sum(delayed), ontime = sum(ontime)) %>%
mutate(total = delayed + ontime, OTperf = round((ontime / total), 3)) %>%
select(carrier, OTperf) %>%
arrange(desc(OTperf))## # A tibble: 2 x 2
## carrier OTperf
## <chr> <dbl>
## 1 AM WEST 0.891
## 2 ALASKA 0.867
# With the five cities pooled, the carriers are pretty close in terms of on-time performance. AM West is slightly ahead of Alaska. This is consistent with industry benchmarks - few carriers see better performance than 90% nationwide, and many are in the 80% decile.# Next, we'll look at on-time performance by city, irrespective of carrier. This will reveal how cities compare with each other.
fs.tidy %>% spread(status, n) %>%
group_by(city) %>%
summarise(delayed = sum(delayed), ontime = sum(ontime)) %>%
mutate(total = delayed + ontime, OTperf = round((ontime / total), 3)) %>%
select(city, OTperf) %>%
arrange(desc(OTperf)) -> OTPbycity
# Again, a bar chart will serve nicely to visualize this.
ggplot(OTPbycity, aes(x = reorder(city, -OTperf), y = OTperf)) +
geom_bar(stat = "identity") +
labs(x = NULL, y = "% departed on-time") # scale_y_continuous(breaks = seq(0,6000,1000))
# We might think that higher volume would correlate with poorer on-time performance (i.e. due to logistical complexity or delay cascades); however, the highest trafficked city for these two carriers also has the highest ontime performance. That said, this data does not seem representative of broader industry patterns - SFO served 55 million passengers in 2017, while Phoenix served 43 million. Between the two carriers, SFO did poorly in terms of on-time departures. # Given SFO's on-time perforamnce is poor, we'd like to understand what's driving this. Alaska is less on-time across the five cities - does SFO's performance have something to do with Alaska? We'll total traffic and on-time performance by carrier to find out.
fs.tidy %>% spread(status, n) %>%
filter(city %in% c("San.Francisco")) %>%
group_by(carrier) %>%
summarise(delayed = sum(delayed), ontime = sum(ontime)) %>%
mutate(total = delayed + ontime, OTperf = round((ontime / total), 3)) %>%
select(carrier, total, OTperf) %>%
arrange(desc(OTperf))## # A tibble: 2 x 3
## carrier total OTperf
## <chr> <int> <dbl>
## 1 ALASKA 605 0.831
## 2 AM WEST 449 0.713
# While Alaska has higher traffic through SFO, it's on-time performance is actually the higher of the two carriers. AM West has lower throughput but much lower on-time performance.# Finally, let's look at the on-time perfomance of each carrier by city. First, Alaska:
fs.tidy %>% spread(status, n) %>%
group_by(city) %>%
filter(carrier == "ALASKA") %>%
summarise(delayed = sum(delayed), ontime = sum(ontime)) %>%
mutate(total = delayed + ontime, OTperf = round((ontime / total), 3)) %>%
select(city, total, OTperf) %>%
arrange(desc(OTperf)) -> ALASKA.OTPbycity
ALASKA.OTPbycity## # A tibble: 5 x 3
## city total OTperf
## <chr> <int> <dbl>
## 1 Phoenix 233 0.948
## 2 San.Diego 232 0.914
## 3 Los.Angeles 559 0.889
## 4 Seattle 2146 0.858
## 5 San.Francisco 605 0.831
# For Alaska, Seattle (a high volume city) and San Francisco (in the middle of the pack) are the poorest performers. As Seattle is a high volume hub for Alaska, Seattle's lower performance drags down Alaska's average.# Finally, let's look at the on-time perfomance of each carrier by city. First, Alaska:
fs.tidy %>% spread(status, n) %>%
group_by(city) %>%
filter(carrier == "AM WEST") %>%
summarise(delayed = sum(delayed), ontime = sum(ontime)) %>%
mutate(total = delayed + ontime, OTperf = round((ontime / total), 3)) %>%
select(city, total, OTperf) %>%
arrange(desc(OTperf)) -> AMWESTA.OTPbycity
AMWESTA.OTPbycity## # A tibble: 5 x 3
## city total OTperf
## <chr> <int> <dbl>
## 1 Phoenix 5255 0.921
## 2 Los.Angeles 811 0.856
## 3 San.Diego 448 0.855
## 4 Seattle 262 0.767
## 5 San.Francisco 449 0.713
# For AM West, Seattle (the lowest volume city) and San Francisco (middle of the pack) are the poorest performers. As Phoenix is responsible for the highest flight volume for AM West, Phoenix's higher performance elevates AM West's average in spite of lower perforamnce at other cities.Code should be in R Markdow file posted to rpubs.com Include narrative descriptions of data cleanup work, analysis, and conclusions