The chart above describes arrival delays for two airlines across five destinations. Your task is to:
Lets import the data from a .CSV file hosted in GitHub.
flight.data <- as_tibble(read.csv("https://raw.githubusercontent.com/JMawyin/MSDS2019-607/master/HW4FlightData.csv", na = "NULL",stringsAsFactors=FALSE))
str(flight.data)
## Classes 'tbl_df', 'tbl' and 'data.frame': 5 obs. of 7 variables:
## $ X : chr "ALASKA" "" "" "AM WEST" ...
## $ X.1 : chr "On time" "Delayed" "" "On time" ...
## $ Los.Angeles : int 497 62 NA 694 117
## $ Phoenix : int 221 12 NA 4840 415
## $ San.Diego : int 212 20 NA 383 65
## $ San.Francisco: int 503 102 NA 320 129
## $ Seattle : int 1841 305 NA 201 61
Then lets rename our columns and remove the empty rows with no data.
##Rename column names
colnames(flight.data) <- c("Airline", "Status", "Los.Angeles", "Phoenix", "San.Diego", "San.Francisco", "Seattle")
##To remove rows with both NAs and empty
flight.data <- flight.data[!apply(is.na(flight.data) | flight.data == "", 1, all),]
flight.data
We can use the loop below to complete the missing airline name entries in the Airline column.
for (row in 2:length(flight.data$Airline)){ # 2 so you don't affect column names
if(flight.data$Airline[row] == "") { # if its empty...
flight.data$Airline[row] = flight.data$Airline[row-1] # ...replace with previous row's value
}
}
flight.data
Now we can use the Gather function to arrange our data into long form and facilitate later analysis.
l.flight.data <- flight.data %>% gather(City, Count, -Airline, -Status)
head(l.flight.data, 5)
Filtering columns by Flight Status (“On time” or “Delayed”) and renaming by airport entries to signify count of arrivals by Flight Status (“OT”, “DL”)
OT <- filter(l.flight.data, Status == "On time")
DL <- filter(l.flight.data, Status == "Delayed")
First, lets calculate the total number of flights from the on-time and delayed flights count.
Total.Flights <- OT[,4]+DL[,4]
colnames(Total.Flights) <- c("Total Flights")
With the total number of flights we can calculate the percentage of on-time and delayed flights.
percent.OT <- (100*OT[,4]/Total.Flights) %>% round(digits = 0)
percent.DL <- (100*DL[,4]/Total.Flights) %>% round(digits = 0)
Lets bind together all the data of interest under the columns “Airline”, “City”, “Flights_On_Time” and “Flights_Delayed”.
flight.analysis <- cbind(OT[,1], OT[,3], percent.OT, percent.DL )
colnames(flight.analysis) <- c("Airline", "City", "Flights_On_Time", "Flights_Delayed")
We can do a simple comparisson showing that the flights from Alaska airlines arriving in Phoneniz have the hightest percentage of flights arriving on-time.
arrange(flight.analysis, desc(Flights_On_Time))
We can also show that the flights from Alaska airlines have an overall higher percentage (89%) of flights arriving on-time compared to AM West airlines (82%).
AL.DATA <- filter(flight.analysis, Airline == "ALASKA")
mean(AL.DATA$Flights_On_Time) %>% round(digits = 0)
## [1] 89
AM.DATA <- filter(flight.analysis, Airline == "AM WEST")
mean(AM.DATA$Flights_On_Time) %>% round(digits = 0)
## [1] 82