This is an R Markdown document for providing documentation for Flight Status Compare And Analysis. An attempt has been made to reshape data using differnet packages and tidying data for further analysis.
knitr::opts_chunk$set(message = FALSE, echo = TRUE)
# Library for string manipulation/regex operations
library(stringr)
# Library for data display in tabular format
library(DT)
# Library to read text file
library(RCurl)
# Library to melt (to long format) and cast (to wide format) data
library(reshape)
# Library to gather (to long format) and spread (to wide format) data, to tidy data
library(tidyr)
# Library to filter, transform data
library(dplyr)
# Library to plot
library(ggplot2)
library(knitr)
Loading the flight status data, Reading text file from the GitHub location with Header as True
flightstatus.giturl <- "https://raw.githubusercontent.com/DataDriven-MSDA/DATA607/master/Week5A/flightstatus.csv"
flightstatus.gitdata <- getURL(flightstatus.giturl)
flightdata <- read.csv2(text = flightstatus.gitdata, header = T, sep = ",", stringsAsFactors = FALSE)
datatable(flightdata)
Melting data from wide format to long format using melt() from reshape package
melted.flightdatasdf <- melt(flightdata, id = c("Airline", "Status"))
colnames(melted.flightdatasdf) <- c("Airline", "Status", "City", "#Flights")
melted.flightdatasdf <- melted.flightdatasdf %>% mutate(City = gsub("\\.", " ", City))
## Warning in mutate_impl(.data, dots): '.Random.seed' is not an integer
## vector but of type 'NULL', so ignored
datatable(melted.flightdatasdf)
Casting data from long format to wide format using melt() from reshape package
newflightdata <- cast(melted.flightdatasdf, Airline + City ~ Status)
colnames(newflightdata) <- c("Airline", "City", "Delayed", "OnTime")
datatable(newflightdata)
Using gather() to convert to long format
tidylongflightdata <- flightdata %>% gather(City, NumFlights, Los.Angeles:Seattle,
na.rm = FALSE)
tidylongflightdata <- tidylongflightdata %>% mutate(City = gsub("\\.", " ", City))
datatable(tidylongflightdata)
Using spread() to convert to wide format
tidywideflightdata <- tidylongflightdata %>% spread(Status, NumFlights)
colnames(tidywideflightdata) <- c("Airline", "City", "Delayed", "OnTime")
datatable(tidywideflightdata)
Using deplyr package for adding OnTime Probability column and arranging by Destination City
ontimeprob_tidwideflightdata <- tidywideflightdata %>% mutate(OnTimePct = round(OnTime/(Delayed +
OnTime), 3), DelayedPct = round(Delayed/(Delayed + OnTime), 3)) %>% arrange(City)
datatable(ontimeprob_tidwideflightdata)
Plotting the On Time Probability Percentage for different destination cities for both airlines. We find that Alaska airlines although , at glance may appear to be lesser efficient, we derive from the plot that it is more reliable as its on time probability is higher than that of AM West airlines.
ggplot(ontimeprob_tidwideflightdata, aes(x = City, y = OnTimePct, fill = Airline)) +
xlab("Destination City") + ylab("On-Time Probability") + ggtitle("Airline On-Time Probability Compare") +
geom_bar(stat = "identity", position = position_dodge()) + scale_fill_manual(values = c("yellow",
"red"))
We see from the plot , how although Alaska Airline’s on-time flight figures make it appear to be less reliable than AM West Airline, the on-time ratio to total flights proves that the on-time probability of Alaska Airline is better than the AM West Airline’s on-time probability. Such reversal/disappearance of an obvious situation is called Simpson’s Paradox