DATA 607, Week 5 – Tidying and Transforming Data

In this assignment, I practiced tidying and transforming data for downstream analysis.

The assignment included the following image of a flight arrivals table for two airlines:

I created a .CSV file from this table, read the file into R, transformed and tidied the data, and performed an analysis.

I used the following packages:

tidyr – to gather and separate the data
dplyr – to group and summarize the data
knitr and kableExtra – to create responsive HTML tables

1. Create and read in the CSV.

To begin, I entered the data in a CSV file exactly as it appeared in the image. Then, I saved the file to a folder on my desktop and read it into R.

flights <- read.table("C:/Users/Kavya/Desktop/Education/MS Data Science/DATA 607 - Data Acquisition and Management/Assignments/Assignment 05/flightdelays.csv",sep = ",", fill = TRUE, header = TRUE)

flights %>% 
  kable("html") %>% 
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

ï..	X	Los.Angeles	Phoenix	San.Diego	San.Francisco	Seattle
ALASKA	on time	497	221	212	503	1841
	delayed	62	12	20	102	305
		NA	NA	NA	NA	NA
AM WEST	on time	694	4840	383	320	201
	delayed	117	415	65	129	61

2. Prepare the data.

I did a few preliminary steps to prepare the data for reshaping. First, I removed the third row of the dataset, which was empty.

flights <- flights[ -3, ]

I renamed the columns to be more descriptive.

names(flights) <- c("Airline", "Status", "Los Angeles", "Phoenix", "San Diego", "San Francisco", "Seattle")

I filled in the empty rows of the Airline variable.

flights$Airline <- c("ALASKA", "ALASKA", 
                     "AM WEST", "AM WEST")

flights %>% 
  kable("html") %>% 
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

	Airline	Status	Los Angeles	Phoenix	San Diego	San Francisco	Seattle
1	ALASKA	on time	497	221	212	503	1841
2	ALASKA	delayed	62	12	20	102	305
4	AM WEST	on time	694	4840	383	320	201
5	AM WEST	delayed	117	415	65	129	61

3. Gather the data.

Next, I reshaped the data from a wide format to a long format. I assumed that the numbers in the original table referred to number of flights.

I used the gather function to collapse Destination and number of flights (N) into 20 rows.

f_gathered <- gather(flights, "Destination", "N", 3:7)

f_gathered %>% 
  kable("html") %>% 
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

Airline	Status	Destination	N
ALASKA	on time	Los Angeles	497
ALASKA	delayed	Los Angeles	62
AM WEST	on time	Los Angeles	694
AM WEST	delayed	Los Angeles	117
ALASKA	on time	Phoenix	221
ALASKA	delayed	Phoenix	12
AM WEST	on time	Phoenix	4840
AM WEST	delayed	Phoenix	415
ALASKA	on time	San Diego	212
ALASKA	delayed	San Diego	20
AM WEST	on time	San Diego	383
AM WEST	delayed	San Diego	65
ALASKA	on time	San Francisco	503
ALASKA	delayed	San Francisco	102
AM WEST	on time	San Francisco	320
AM WEST	delayed	San Francisco	129
ALASKA	on time	Seattle	1841
ALASKA	delayed	Seattle	305
AM WEST	on time	Seattle	201
AM WEST	delayed	Seattle	61

4. Spread the data.

I noticed that the Status and N columns could be further tidied. I spread the columns into two:

delayed – the number of flights that arrived with a delay at that destination and airline.
on time – the number of flights that arrived on time at that destination and airline.

f_spread <- spread(f_gathered, "Status", "N")

f_spread %>% 
  kable("html") %>% 
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

Airline	Destination	delayed	on time
ALASKA	Los Angeles	62	497
ALASKA	Phoenix	12	221
ALASKA	San Diego	20	212
ALASKA	San Francisco	102	503
ALASKA	Seattle	305	1841
AM WEST	Los Angeles	117	694
AM WEST	Phoenix	415	4840
AM WEST	San Diego	65	383
AM WEST	San Francisco	129	320
AM WEST	Seattle	61	201

5. Add calculated columns.

With the data now tidy, I could begin calculating new variables to explore the results.

Using the mutate function in dplyr, I added three columns to the dataframe:

total – the total number of flights observed.
% delayed – the percentage of flights that arrived with a delay at that destination and airline.
% on time – the percentage of flights that arrived on time at that destination and airline.

f_mutated <- mutate(f_spread, "total" = f_spread$delayed + f_spread$`on time`)

## Warning: package 'bindrcpp' was built under R version 3.4.3

f_mutated <- mutate(f_mutated, "% delayed" = round(( f_mutated$delayed / f_mutated$total ) * 100, 2), 
                               "% on time" = round(( f_mutated$`on time` / f_mutated$total ) * 100, 2) )

f_mutated %>% 
  kable("html") %>% 
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

Airline	Destination	delayed	on time	total	% delayed	% on time
ALASKA	Los Angeles	62	497	559	11.09	88.91
ALASKA	Phoenix	12	221	233	5.15	94.85
ALASKA	San Diego	20	212	232	8.62	91.38
ALASKA	San Francisco	102	503	605	16.86	83.14
ALASKA	Seattle	305	1841	2146	14.21	85.79
AM WEST	Los Angeles	117	694	811	14.43	85.57
AM WEST	Phoenix	415	4840	5255	7.90	92.10
AM WEST	San Diego	65	383	448	14.51	85.49
AM WEST	San Francisco	129	320	449	28.73	71.27
AM WEST	Seattle	61	201	262	23.28	76.72

6. Group and summarize the data.

Finally, I decided to calculate the percentage of flights from each airline that arrived on time and with a delay.

I grouped the data by airline, and used the summarise function to calculate the sum of on-time and delayed flights. Then, I added more columns using mutate to determine the percentages.

f_grouped <- group_by(f_mutated, Airline)

f_sum <- f_grouped %>% 
  summarise(delayed_sum = sum(`delayed`), 
            ontime_sum = sum(`on time`))

f_sum <- mutate(f_sum, 
                "total" = f_sum$delayed_sum + f_sum$ontime_sum)

f_sum <- mutate(f_sum, 
                "% delayed" = round(( f_sum$`delayed_sum` / f_sum$total ) * 100, 2),
                "% on time" = round(( f_sum$`ontime_sum` / f_sum$total ) * 100, 2) )

names(f_sum) <- c("Airline", "# Flights Delayed", "# Flights On Time", "Total # Flights", "% Flights Delayed", "% Flights on Time")

f_sum %>% 
  kable("html") %>% 
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

Airline	# Flights Delayed	# Flights On Time	Total # Flights	% Flights Delayed	% Flights on Time
ALASKA	501	3274	3775	13.27	86.73
AM WEST	787	6438	7225	10.89	89.11

7. Analysis.

Overall, I see that ALASKA has a higher percentage of flights that arrive with a delay compared to AM WEST. However, AM WEST has about 4,000 more flights recorded in the database than ALASKA.

From the summary table by destination in Step 5, I see that both airlines are delayed most often when the destination is San Francisco. This could lead to the conclusion that something about the San Francisco airport causes more arrival delays than other airports.

However, it’s important to note that the airlines don’t go to each destination equally – for example, more ALASKA planes go to Seattle than AM WEST, and more AM WEST planes go to Phoenix than ALASKA. This unequal distribution could affect the results. We also don’t have data about the individual airports, like how busy each they usually get, or the efficiency of their air traffic controllers.