Data607_Assignment05

Read in data
Update column names
Remove blank rows
Cascade company names
Convert to long format
Clean columns
Return to wide format with ontime/delayed as columns
Derive percent delayed, percent on time columns
Visualize

Github for homework: https://github.com/rweberc/Data607_Assignment5

Read in data

schedDf <- read.csv("https://raw.githubusercontent.com/rweberc/Data607_Assignment5/master/PlaneTimes.csv",
                    stringsAsFactors = FALSE)
schedDf

##         X     X.1 Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1  ALASKA on time         497     221       212           503    1841
## 2         delayed          62      12        20           102     305
## 3                          NA      NA        NA            NA      NA
## 4 AM WEST on time         694    4840       383           320     201
## 5         delayed         117     415        65           129      61

Update column names

colnames(schedDf)[1:2] <- c("company", "timeType")
schedDf

##   company timeType Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1  ALASKA  on time         497     221       212           503    1841
## 2          delayed          62      12        20           102     305
## 3                           NA      NA        NA            NA      NA
## 4 AM WEST  on time         694    4840       383           320     201
## 5          delayed         117     415        65           129      61

Remove blank rows

schedDf <- schedDf %>% filter(timeType != "")
schedDf

##   company timeType Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1  ALASKA  on time         497     221       212           503    1841
## 2          delayed          62      12        20           102     305
## 3 AM WEST  on time         694    4840       383           320     201
## 4          delayed         117     415        65           129      61

Cascade company names

schedDf$company[schedDf$company == ""] <- NA
schedDf <- schedDf %>% fill(company)
schedDf

##   company timeType Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1  ALASKA  on time         497     221       212           503    1841
## 2  ALASKA  delayed          62      12        20           102     305
## 3 AM WEST  on time         694    4840       383           320     201
## 4 AM WEST  delayed         117     415        65           129      61

Convert to long format

schedDf <- gather(schedDf, "destination", "occurrences", 3:length(schedDf))

Clean columns

schedDf$occurrences <- as.numeric(schedDf$occurrences)

schedDf$timeType[schedDf$timeType == "on time"] <- "on.time"

schedDf$destination <- str_replace(schedDf$destination, "\\.", " ")

Return to wide format with ontime/delayed as columns

Create two dataframes: one with overall values (compDf), another that preserves the destination information (compDestDf)

compDf <- schedDf %>% 
  group_by(company, timeType) %>% 
  summarize(occurrences = sum(occurrences))

compDestDf <- schedDf %>% spread(timeType, occurrences)
compDf <- compDf %>% spread(timeType, occurrences)

compDestDf

##    company   destination delayed on.time
## 1   ALASKA   Los Angeles      62     497
## 2   ALASKA       Phoenix      12     221
## 3   ALASKA     San Diego      20     212
## 4   ALASKA San Francisco     102     503
## 5   ALASKA       Seattle     305    1841
## 6  AM WEST   Los Angeles     117     694
## 7  AM WEST       Phoenix     415    4840
## 8  AM WEST     San Diego      65     383
## 9  AM WEST San Francisco     129     320
## 10 AM WEST       Seattle      61     201

compDf

## # A tibble: 2 x 3
## # Groups:   company [2]
##   company delayed on.time
##   <chr>     <dbl>   <dbl>
## 1 ALASKA      501    3274
## 2 AM WEST     787    6438

Derive percent delayed, percent on time columns

compDf$percentDelayed <- round(compDf$delayed/(compDf$delayed + compDf$on.time), 2) 
compDestDf$percentDelayed <- round(compDestDf$delayed/(compDestDf$delayed + compDestDf$on.time), 2) 

compDestDf

##    company   destination delayed on.time percentDelayed
## 1   ALASKA   Los Angeles      62     497           0.11
## 2   ALASKA       Phoenix      12     221           0.05
## 3   ALASKA     San Diego      20     212           0.09
## 4   ALASKA San Francisco     102     503           0.17
## 5   ALASKA       Seattle     305    1841           0.14
## 6  AM WEST   Los Angeles     117     694           0.14
## 7  AM WEST       Phoenix     415    4840           0.08
## 8  AM WEST     San Diego      65     383           0.15
## 9  AM WEST San Francisco     129     320           0.29
## 10 AM WEST       Seattle      61     201           0.23

compDf

## # A tibble: 2 x 4
## # Groups:   company [2]
##   company delayed on.time percentDelayed
##   <chr>     <dbl>   <dbl>          <dbl>
## 1 ALASKA      501    3274          0.130
## 2 AM WEST     787    6438          0.110

Visualize

ggplot(compDf, aes(x=company, y=percentDelayed)) + 
  geom_bar(stat = "identity")  +
  labs(title = "Overall Percent of Delayed Flights", x = "Company", y = "Percent Delayed Flights") +
  theme(plot.title = element_text(hjust = 0.5))

From the above, we can see that even though AM WEST has a slightly lower percentage of delayed flights than ALASKA; however…

ggplot(compDestDf, aes(x=company, y=percentDelayed)) + 
  geom_bar(aes(fill = destination), position = "dodge", stat = "identity") +
  labs(title = "Delayed Flights By Destination", x = "Company", y = "Percent Delayed Flights") +
  theme(plot.title = element_text(hjust = 0.5))

Looking at individual destinations, we can see that trips to Phoenix have the smallest percentage of delayed flights for both companies, and AM WEST happens to have the majority of their arrivals to this location. AM WEST actually has a much greater percentage of delayed flights to all locations.