Rpubs link: http://rpubs.com/jefflittlejohn/Data_607_Week__5 Github: https://github.com/littlejohnjeff/DATA607_Fall2018/blob/master/Data%20607%20-%20Week%205%20Assignment%20-%20Tidying%20and%20Transforming%20Data%20-%20Jeff%20Littlejohn.Rmd
The chart above describes arrival delays for two airlines across five destinations. Your task is to:
Please include in your homework submission:
The URL to the .Rmd file in your GitHub repository.??? The URL for your rpubs.com web page.
Let’s start.
#load libraries
library(tidyr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
Load our data.
#Reading csv file into dataframe called flight_info.
flight_info <- tbl_df(read.csv("https://raw.githubusercontent.com/littlejohnjeff/DATA607_Fall2018/master/Numbersense.csv", header = FALSE, sep = ",", stringsAsFactors = FALSE))
flight_info
## # A tibble: 6 x 7
## V1 V2 V3 V4 V5 V6 V7
## <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 "" "" Los Angeles Phoenix San Diego San Francisco Seattle
## 2 ALASKA on time 497 221 212 503 1841
## 3 "" delayed 62 12 20 102 305
## 4 "" "" "" "" "" "" ""
## 5 AM WEST on time 694 4840 383 320 201
## 6 "" delayed 117 415 65 129 61
Clean up the dataframe to prepare for fancier operations.
#add column names
flight_col_names <- c("Airline","Arrival_Status",flight_info[1,3:7])
names(flight_info) <- flight_col_names
#add airline name to blank fields in first column
flight_info[3,1] <- flight_info[2,1]
flight_info[6,1] <- flight_info[5,1]
#remove empty 3rd row
flight_info <- flight_info[!apply(flight_info == "", 1, all),]
#remove first row that has city names
flight_info <- flight_info[2:5,]
flight_info
## # A tibble: 4 x 7
## Airline Arrival_Status `Los Angeles` Phoenix `San Diego` `San Francisco`
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 ALASKA on time 497 221 212 503
## 2 ALASKA delayed 62 12 20 102
## 3 AM WEST on time 694 4840 383 320
## 4 AM WEST delayed 117 415 65 129
## # ... with 1 more variable: Seattle <chr>
Now let’s use gather to unpivot the city names into one (tidy!) column.
#first two arguments in gather give names to the columns that are gather/unpivoted
#the minus sign tells gather to leave Airline and Arrival_Status as is
flight_info <- flight_info %>%
gather(City,Flight_Count,-Airline,-Arrival_Status)
We still need to do one more thing before starting analysis - make sure the Flight_Count field is numeric.
#Convert Flight_Count to numeric field to enable analysis
flight_info <- transform(flight_info, Flight_Count = as.numeric(Flight_Count))
Which airline is more reliable? Let’s create some subsets to facilitate analyis.
#delayed flights
delay_counts <- subset(flight_info, flight_info$Arrival_Status == "delayed")
#Alaska flights - total, on time, delayed
Alaska_flight_info <- subset(flight_info, flight_info$Airline == "ALASKA")
Alaska_ontime_counts <- subset(flight_info, flight_info$Airline == "ALASKA" & flight_info$Arrival_Status == "on time")
Alaska_delay_counts <- subset(flight_info, flight_info$Airline == "ALASKA" & flight_info$Arrival_Status == "delayed")
#Am West flights
Amwest_flight_info <- subset(flight_info, flight_info$Airline == "AM WEST")
Amwest_ontime_counts <- subset(flight_info, flight_info$Airline == "AM WEST" & flight_info$Arrival_Status == "on time")
Amwest_delay_counts <- subset(flight_info, flight_info$Airline == "AM WEST" & flight_info$Arrival_Status == "delayed")
#One check to verify it worked
head(Amwest_delay_counts)
## Airline Arrival_Status City Flight_Count
## 4 AM WEST delayed Los Angeles 117
## 8 AM WEST delayed Phoenix 415
## 12 AM WEST delayed San Diego 65
## 16 AM WEST delayed San Francisco 129
## 20 AM WEST delayed Seattle 61
Which airline has a higher percentage of delayed flights?
#Alaska delays
Alaska_delay_sum <- sum(Alaska_delay_counts$Flight_Count)
Alaska_flight_sum <- sum(Alaska_flight_info$Flight_Count)
Alaska_delay_rate <- Alaska_delay_sum/Alaska_flight_sum
print("Alaska delay count:")
## [1] "Alaska delay count:"
Alaska_delay_sum
## [1] 501
print("Alaska delay rate:")
## [1] "Alaska delay rate:"
Alaska_delay_rate
## [1] 0.1327152
#Amwest delays
Amwest_delay_sum <- sum(Amwest_delay_counts$Flight_Count)
Amwest_flight_sum <- sum(Amwest_flight_info$Flight_Count)
Amwest_delay_rate <- Amwest_delay_sum/Amwest_flight_sum
print("Amwest delay count:")
## [1] "Amwest delay count:"
Amwest_delay_sum
## [1] 787
print("Amwest delay rate:")
## [1] "Amwest delay rate:"
Amwest_delay_rate
## [1] 0.1089273
We see that although Amwest has more delays, they actually have a lower rate of delays per flight. Let’s plot the delay counts to see how this looks.
ggplot(flight_info,aes(flight_info$Airline,flight_info$Flight_Count, colour = flight_info$Arrival_Status)) + geom_point()
Amwest has one big on-time outlier, as does Alaska. Let’s check to see if they are both the same city. If not, instead of choosing one airline over the other based on overall delay rate, we might be selective and choose an airline based on which city from which we’re flying.
ggplot(delay_counts,aes(delay_counts$City,delay_counts$Flight_Count, colour = delay_counts$Airline)) + geom_point()
We see that Phoenix is home to the largest number of Amwest delays, and Seattle is home to the biggest portion of Alaska delays, but we can’t determine if that’s due to a large volume of total flights. Let’s get messy with ggplot.
ggplot(flight_info,aes(flight_info$City,flight_info$Flight_Count, colour = flight_info$Airline,flight_info$Arrival_Status)) + geom_text(aes(label=paste0(flight_info$Arrival_Status,flight_info$Airline)))
Based on this more detailed analysis, independent of price and schedule convenience, it looks like we should fly Alaska out of Seattle and San Francisco and Amwest everywhere else.