Introduction - Week 5 Assignemnt

The task for this week’s assignment is to:

Create a .CSV file (or optionally, a MySQL database!) that includes all of the information above. You’re encouraged to use a “wide” structure similar to how the information appears above, so that you can practice tidying and transformations as described below,
Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data.
Perform analysis to compare the arrival delays for the two airlines.
Your code should be in an R Markdown file, posted to rpubs.com, and should include narrative descriptions of your data cleanup work, analysis, and conclusions. Please include in your homework submission:
- The URL to the .Rmd file in your GitHub repository.
- The URL for your rpubs.com web page.

Load Packages

Load the packages necessary for this assignment.

library(tidyverse)
library(RCurl)
library(knitr)
library(ggrepel)
library(cowplot)

Part 1 - Create a .csv file

The first step is to recreate the table from the assignment rubric.

Part 2 - Load file and tranform data

This next step is asking us to import the .csv file created above and preform different functions to transform the data and get it ready for analysis.

Loading the data

This first step is loading the data

data <-
  getURL("https://raw.githubusercontent.com/sbiguzzi/data607assignment5/master/flightdata.csv")
flightdata <- read.csv(text = data)

Transforming data

We notice that the data isn’t set up great for analysis. The first thing to do is rename all the columns for readability. Second, there is a full empty row that needs to be removed. Third, the company name is only available for one of the rows that belongs to the airline. This needs to be fixed. And finally, we want to change the shape of the data from wide to long as it will be easier to analyze.

Renaming columns

#Remove periods in column names
names(flightdata) <- gsub("\\.","_",names(flightdata))
#Rename first two columns
flightdata <- rename(flightdata,c("Airline" = "X_U_FEFF_", "Flight_Status" = "X"))

Remove third row

#Remove third now
flightdata <- flightdata[-c(3),]
#Resetting index
rownames(flightdata) <- NULL

Fill airline company name

#Convert blanks to na in Airline Company
flightdata$Airline <- na_if(flightdata$Airline,"")
#Fill company names to appropriate rows
flightdata <- fill(flightdata,Airline,.direction = "down")

Convert data from wide to long

#Make data long format
flightdata_long <- flightdata %>%
  gather(Airport,Flight_Counts,Los_Angeles:Seattle, factor_key = TRUE)
#Replace underscore with space
flightdata_long$Airport <- gsub("\\_"," ",flightdata_long$Airport)
#Transpose on-time and delayed
flightdata_long <- flightdata_long %>%
  spread(Flight_Status,Flight_Counts)

Part 3 - Analysis

Now that the data is in the long format we are able to run some analyses. First we want to compare the overall on-time performance for both airlines. Then compare the per-city on-time performance for both airlines.

Compare the overall on-time performance for both airlines

To understand the overall on time performance of each airline we create a column total_on_time that groups by airline and divides the sum of the on time flights by the sum of the on time and delayed flights.

Using the new column we create Fig 1. From Fig 1, we see that AM WEST has a better overall on time percent for it’s flights than ALASKA.

Compare the per city on-time performance by airline

First we have to create a new column called on_time_per, which is the number of on time flights divided by the total flights to each airport. Keeping in mind to group by airline company as well.

We can use the new column to create Fig 2. After examining Fig 2 we see that AM WEST in reality has less on time flights at every airport in the data, which contradicts Fig 1.

Part 4 - Conclusion: The Paradox

The paradoxical conclusion here is that AM WEST is both the airline with the most on time flights and the airline with the least on time flights.

To understand this paradox we can look at the scatter plot in Fig 3 and the box plot in Fig 4. We see that both airlines have an outlier in their data, for ALASKA it’s the flights to Seattle and for AM WEST it’s the flights to Phoenix. When we split the data by airport this outlier is not an issue as we are grouping the on time performance by both airline and airport as seen in Fig 2. However, when we only group by airlines and we exclude airports, the additional 4,840 on time Phoenix flights skews the percent of total on time flights for AM WEST more than the 1,841 on time Seattle flights skews the percent for on time flights for ALASKA.

The reason we see AM WEST have 89% on time total flights compared to ALASKA’s 87% on time is because of these outliers. In fact if we look at the median of the box plot in Fig 4, we see that AM WEST’s on time flights (383) is lower than ALASKA’s (497) flights. The recommendation here is to look deeper into the data than just base summaries as you may find that your initial conclusion might be skewed by outliers. Sometimes it is worth looking deeper into the data than just base percent calculations or else you may end up with paradoxical conclusions like the one above.

Week 5 Assignment - Tidying and Transforming Data

Stefano Biguzzi

9/22/2020