The task for this week’s assignment is to:
Load the packages necessary for this assignment.
The first step is to recreate the table from the assignment rubric.
This next step is asking us to import the .csv file created above and preform different functions to transform the data and get it ready for analysis.
This first step is loading the data
We notice that the data isn’t set up great for analysis. The first thing to do is rename all the columns for readability. Second, there is a full empty row that needs to be removed. Third, the company name is only available for one of the rows that belongs to the airline. This needs to be fixed. And finally, we want to change the shape of the data from wide to long as it will be easier to analyze.
#Remove periods in column names
names(flightdata) <- gsub("\\.","_",names(flightdata))
#Rename first two columns
flightdata <- rename(flightdata,c("Airline" = "X_U_FEFF_", "Flight_Status" = "X"))#Convert blanks to na in Airline Company
flightdata$Airline <- na_if(flightdata$Airline,"")
#Fill company names to appropriate rows
flightdata <- fill(flightdata,Airline,.direction = "down")#Make data long format
flightdata_long <- flightdata %>%
gather(Airport,Flight_Counts,Los_Angeles:Seattle, factor_key = TRUE)
#Replace underscore with space
flightdata_long$Airport <- gsub("\\_"," ",flightdata_long$Airport)
#Transpose on-time and delayed
flightdata_long <- flightdata_long %>%
spread(Flight_Status,Flight_Counts)Now that the data is in the long format we are able to run some analyses. First we want to compare the overall on-time performance for both airlines. Then compare the per-city on-time performance for both airlines.
To understand the overall on time performance of each airline we create a column total_on_time that groups by airline and divides the sum of the on time flights by the sum of the on time and delayed flights.
Using the new column we create Fig 1. From Fig 1, we see that AM WEST has a better overall on time percent for it’s flights than ALASKA.
First we have to create a new column called on_time_per, which is the number of on time flights divided by the total flights to each airport. Keeping in mind to group by airline company as well.
We can use the new column to create Fig 2. After examining Fig 2 we see that AM WEST in reality has less on time flights at every airport in the data, which contradicts Fig 1.
The paradoxical conclusion here is that AM WEST is both the airline with the most on time flights and the airline with the least on time flights.
To understand this paradox we can look at the scatter plot in Fig 3 and the box plot in Fig 4. We see that both airlines have an outlier in their data, for ALASKA it’s the flights to Seattle and for AM WEST it’s the flights to Phoenix. When we split the data by airport this outlier is not an issue as we are grouping the on time performance by both airline and airport as seen in Fig 2. However, when we only group by airlines and we exclude airports, the additional 4,840 on time Phoenix flights skews the percent of total on time flights for AM WEST more than the 1,841 on time Seattle flights skews the percent for on time flights for ALASKA.
The reason we see AM WEST have 89% on time total flights compared to ALASKA’s 87% on time is because of these outliers. In fact if we look at the median of the box plot in Fig 4, we see that AM WEST’s on time flights (383) is lower than ALASKA’s (497) flights. The recommendation here is to look deeper into the data than just base summaries as you may find that your initial conclusion might be skewed by outliers. Sometimes it is worth looking deeper into the data than just base percent calculations or else you may end up with paradoxical conclusions like the one above.