In this assignment, I practiced tidying and transforming data for downstream analysis.

The assignment included the following image of a flight arrivals table for two airlines:

I created a .CSV file from this table, read the file into R, transformed and tidied the data, and performed an analysis.

I used the following packages:


1. Create and read in the CSV.

To begin, I entered the data in a CSV file exactly as it appeared in the image. Then, I saved the file to a folder on my desktop and read it into R.

flights <- read.table("C:/Users/Kavya/Desktop/Education/MS Data Science/DATA 607 - Data Acquisition and Management/Assignments/Assignment 05/flightdelays.csv",sep = ",", fill = TRUE, header = TRUE)

flights %>% 
  kable("html") %>% 
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
ï.. X Los.Angeles Phoenix San.Diego San.Francisco Seattle
ALASKA on time 497 221 212 503 1841
delayed 62 12 20 102 305
NA NA NA NA NA
AM WEST on time 694 4840 383 320 201
delayed 117 415 65 129 61

2. Prepare the data.

I did a few preliminary steps to prepare the data for reshaping. First, I removed the third row of the dataset, which was empty.

flights <- flights[ -3, ]

I renamed the columns to be more descriptive.

names(flights) <- c("Airline", "Status", "Los Angeles", "Phoenix", "San Diego", "San Francisco", "Seattle")

I filled in the empty rows of the Airline variable.

flights$Airline <- c("ALASKA", "ALASKA", 
                     "AM WEST", "AM WEST")

flights %>% 
  kable("html") %>% 
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
Airline Status Los Angeles Phoenix San Diego San Francisco Seattle
1 ALASKA on time 497 221 212 503 1841
2 ALASKA delayed 62 12 20 102 305
4 AM WEST on time 694 4840 383 320 201
5 AM WEST delayed 117 415 65 129 61

3. Gather the data.

Next, I reshaped the data from a wide format to a long format. I assumed that the numbers in the original table referred to number of flights.

I used the gather function to collapse Destination and number of flights (N) into 20 rows.

f_gathered <- gather(flights, "Destination", "N", 3:7)

f_gathered %>% 
  kable("html") %>% 
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
Airline Status Destination N
ALASKA on time Los Angeles 497
ALASKA delayed Los Angeles 62
AM WEST on time Los Angeles 694
AM WEST delayed Los Angeles 117
ALASKA on time Phoenix 221
ALASKA delayed Phoenix 12
AM WEST on time Phoenix 4840
AM WEST delayed Phoenix 415
ALASKA on time San Diego 212
ALASKA delayed San Diego 20
AM WEST on time San Diego 383
AM WEST delayed San Diego 65
ALASKA on time San Francisco 503
ALASKA delayed San Francisco 102
AM WEST on time San Francisco 320
AM WEST delayed San Francisco 129
ALASKA on time Seattle 1841
ALASKA delayed Seattle 305
AM WEST on time Seattle 201
AM WEST delayed Seattle 61

4. Spread the data.

I noticed that the Status and N columns could be further tidied. I spread the columns into two:

f_spread <- spread(f_gathered, "Status", "N")

f_spread %>% 
  kable("html") %>% 
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
Airline Destination delayed on time
ALASKA Los Angeles 62 497
ALASKA Phoenix 12 221
ALASKA San Diego 20 212
ALASKA San Francisco 102 503
ALASKA Seattle 305 1841
AM WEST Los Angeles 117 694
AM WEST Phoenix 415 4840
AM WEST San Diego 65 383
AM WEST San Francisco 129 320
AM WEST Seattle 61 201

5. Add calculated columns.

With the data now tidy, I could begin calculating new variables to explore the results.

Using the mutate function in dplyr, I added three columns to the dataframe:

f_mutated <- mutate(f_spread, "total" = f_spread$delayed + f_spread$`on time`)
## Warning: package 'bindrcpp' was built under R version 3.4.3
f_mutated <- mutate(f_mutated, "% delayed" = round(( f_mutated$delayed / f_mutated$total ) * 100, 2), 
                               "% on time" = round(( f_mutated$`on time` / f_mutated$total ) * 100, 2) )

f_mutated %>% 
  kable("html") %>% 
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
Airline Destination delayed on time total % delayed % on time
ALASKA Los Angeles 62 497 559 11.09 88.91
ALASKA Phoenix 12 221 233 5.15 94.85
ALASKA San Diego 20 212 232 8.62 91.38
ALASKA San Francisco 102 503 605 16.86 83.14
ALASKA Seattle 305 1841 2146 14.21 85.79
AM WEST Los Angeles 117 694 811 14.43 85.57
AM WEST Phoenix 415 4840 5255 7.90 92.10
AM WEST San Diego 65 383 448 14.51 85.49
AM WEST San Francisco 129 320 449 28.73 71.27
AM WEST Seattle 61 201 262 23.28 76.72

6. Group and summarize the data.

Finally, I decided to calculate the percentage of flights from each airline that arrived on time and with a delay.

I grouped the data by airline, and used the summarise function to calculate the sum of on-time and delayed flights. Then, I added more columns using mutate to determine the percentages.

f_grouped <- group_by(f_mutated, Airline)

f_sum <- f_grouped %>% 
  summarise(delayed_sum = sum(`delayed`), 
            ontime_sum = sum(`on time`))

f_sum <- mutate(f_sum, 
                "total" = f_sum$delayed_sum + f_sum$ontime_sum)

f_sum <- mutate(f_sum, 
                "% delayed" = round(( f_sum$`delayed_sum` / f_sum$total ) * 100, 2),
                "% on time" = round(( f_sum$`ontime_sum` / f_sum$total ) * 100, 2) )

names(f_sum) <- c("Airline", "# Flights Delayed", "# Flights On Time", "Total # Flights", "% Flights Delayed", "% Flights on Time")

f_sum %>% 
  kable("html") %>% 
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
Airline # Flights Delayed # Flights On Time Total # Flights % Flights Delayed % Flights on Time
ALASKA 501 3274 3775 13.27 86.73
AM WEST 787 6438 7225 10.89 89.11

7. Analysis.

Overall, I see that ALASKA has a higher percentage of flights that arrive with a delay compared to AM WEST. However, AM WEST has about 4,000 more flights recorded in the database than ALASKA.

From the summary table by destination in Step 5, I see that both airlines are delayed most often when the destination is San Francisco. This could lead to the conclusion that something about the San Francisco airport causes more arrival delays than other airports.

However, it’s important to note that the airlines don’t go to each destination equally – for example, more ALASKA planes go to Seattle than AM WEST, and more AM WEST planes go to Phoenix than ALASKA. This unequal distribution could affect the results. We also don’t have data about the individual airports, like how busy each they usually get, or the efficiency of their air traffic controllers.