In this assignment, I practiced tidying and transforming data for downstream analysis.
The assignment included the following image of a flight arrivals table for two airlines:
I created a .CSV file from this table, read the file into R, transformed and tidied the data, and performed an analysis.
I used the following packages:
To begin, I entered the data in a CSV file exactly as it appeared in the image. Then, I saved the file to a folder on my desktop and read it into R.
flights <- read.table("C:/Users/Kavya/Desktop/Education/MS Data Science/DATA 607 - Data Acquisition and Management/Assignments/Assignment 05/flightdelays.csv",sep = ",", fill = TRUE, header = TRUE)
flights %>%
kable("html") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
ï.. | X | Los.Angeles | Phoenix | San.Diego | San.Francisco | Seattle |
---|---|---|---|---|---|---|
ALASKA | on time | 497 | 221 | 212 | 503 | 1841 |
delayed | 62 | 12 | 20 | 102 | 305 | |
NA | NA | NA | NA | NA | ||
AM WEST | on time | 694 | 4840 | 383 | 320 | 201 |
delayed | 117 | 415 | 65 | 129 | 61 |
I did a few preliminary steps to prepare the data for reshaping. First, I removed the third row of the dataset, which was empty.
flights <- flights[ -3, ]
I renamed the columns to be more descriptive.
names(flights) <- c("Airline", "Status", "Los Angeles", "Phoenix", "San Diego", "San Francisco", "Seattle")
I filled in the empty rows of the Airline
variable.
flights$Airline <- c("ALASKA", "ALASKA",
"AM WEST", "AM WEST")
flights %>%
kable("html") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
Airline | Status | Los Angeles | Phoenix | San Diego | San Francisco | Seattle | |
---|---|---|---|---|---|---|---|
1 | ALASKA | on time | 497 | 221 | 212 | 503 | 1841 |
2 | ALASKA | delayed | 62 | 12 | 20 | 102 | 305 |
4 | AM WEST | on time | 694 | 4840 | 383 | 320 | 201 |
5 | AM WEST | delayed | 117 | 415 | 65 | 129 | 61 |
Next, I reshaped the data from a wide format to a long format. I assumed that the numbers in the original table referred to number of flights.
I used the gather
function to collapse Destination
and number of flights (N
) into 20 rows.
f_gathered <- gather(flights, "Destination", "N", 3:7)
f_gathered %>%
kable("html") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
Airline | Status | Destination | N |
---|---|---|---|
ALASKA | on time | Los Angeles | 497 |
ALASKA | delayed | Los Angeles | 62 |
AM WEST | on time | Los Angeles | 694 |
AM WEST | delayed | Los Angeles | 117 |
ALASKA | on time | Phoenix | 221 |
ALASKA | delayed | Phoenix | 12 |
AM WEST | on time | Phoenix | 4840 |
AM WEST | delayed | Phoenix | 415 |
ALASKA | on time | San Diego | 212 |
ALASKA | delayed | San Diego | 20 |
AM WEST | on time | San Diego | 383 |
AM WEST | delayed | San Diego | 65 |
ALASKA | on time | San Francisco | 503 |
ALASKA | delayed | San Francisco | 102 |
AM WEST | on time | San Francisco | 320 |
AM WEST | delayed | San Francisco | 129 |
ALASKA | on time | Seattle | 1841 |
ALASKA | delayed | Seattle | 305 |
AM WEST | on time | Seattle | 201 |
AM WEST | delayed | Seattle | 61 |
I noticed that the Status
and N
columns could be further tidied. I spread the columns into two:
delayed
– the number of flights that arrived with a delay at that destination and airline.on time
– the number of flights that arrived on time at that destination and airline.f_spread <- spread(f_gathered, "Status", "N")
f_spread %>%
kable("html") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
Airline | Destination | delayed | on time |
---|---|---|---|
ALASKA | Los Angeles | 62 | 497 |
ALASKA | Phoenix | 12 | 221 |
ALASKA | San Diego | 20 | 212 |
ALASKA | San Francisco | 102 | 503 |
ALASKA | Seattle | 305 | 1841 |
AM WEST | Los Angeles | 117 | 694 |
AM WEST | Phoenix | 415 | 4840 |
AM WEST | San Diego | 65 | 383 |
AM WEST | San Francisco | 129 | 320 |
AM WEST | Seattle | 61 | 201 |
With the data now tidy, I could begin calculating new variables to explore the results.
Using the mutate
function in dplyr
, I added three columns to the dataframe:
total
– the total number of flights observed.% delayed
– the percentage of flights that arrived with a delay at that destination and airline.% on time
– the percentage of flights that arrived on time at that destination and airline.f_mutated <- mutate(f_spread, "total" = f_spread$delayed + f_spread$`on time`)
## Warning: package 'bindrcpp' was built under R version 3.4.3
f_mutated <- mutate(f_mutated, "% delayed" = round(( f_mutated$delayed / f_mutated$total ) * 100, 2),
"% on time" = round(( f_mutated$`on time` / f_mutated$total ) * 100, 2) )
f_mutated %>%
kable("html") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
Airline | Destination | delayed | on time | total | % delayed | % on time |
---|---|---|---|---|---|---|
ALASKA | Los Angeles | 62 | 497 | 559 | 11.09 | 88.91 |
ALASKA | Phoenix | 12 | 221 | 233 | 5.15 | 94.85 |
ALASKA | San Diego | 20 | 212 | 232 | 8.62 | 91.38 |
ALASKA | San Francisco | 102 | 503 | 605 | 16.86 | 83.14 |
ALASKA | Seattle | 305 | 1841 | 2146 | 14.21 | 85.79 |
AM WEST | Los Angeles | 117 | 694 | 811 | 14.43 | 85.57 |
AM WEST | Phoenix | 415 | 4840 | 5255 | 7.90 | 92.10 |
AM WEST | San Diego | 65 | 383 | 448 | 14.51 | 85.49 |
AM WEST | San Francisco | 129 | 320 | 449 | 28.73 | 71.27 |
AM WEST | Seattle | 61 | 201 | 262 | 23.28 | 76.72 |
Finally, I decided to calculate the percentage of flights from each airline that arrived on time and with a delay.
I grouped the data by airline, and used the summarise
function to calculate the sum of on-time and delayed flights. Then, I added more columns using mutate
to determine the percentages.
f_grouped <- group_by(f_mutated, Airline)
f_sum <- f_grouped %>%
summarise(delayed_sum = sum(`delayed`),
ontime_sum = sum(`on time`))
f_sum <- mutate(f_sum,
"total" = f_sum$delayed_sum + f_sum$ontime_sum)
f_sum <- mutate(f_sum,
"% delayed" = round(( f_sum$`delayed_sum` / f_sum$total ) * 100, 2),
"% on time" = round(( f_sum$`ontime_sum` / f_sum$total ) * 100, 2) )
names(f_sum) <- c("Airline", "# Flights Delayed", "# Flights On Time", "Total # Flights", "% Flights Delayed", "% Flights on Time")
f_sum %>%
kable("html") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
Airline | # Flights Delayed | # Flights On Time | Total # Flights | % Flights Delayed | % Flights on Time |
---|---|---|---|---|---|
ALASKA | 501 | 3274 | 3775 | 13.27 | 86.73 |
AM WEST | 787 | 6438 | 7225 | 10.89 | 89.11 |
Overall, I see that ALASKA has a higher percentage of flights that arrive with a delay compared to AM WEST. However, AM WEST has about 4,000 more flights recorded in the database than ALASKA.
From the summary table by destination in Step 5, I see that both airlines are delayed most often when the destination is San Francisco. This could lead to the conclusion that something about the San Francisco airport causes more arrival delays than other airports.
However, it’s important to note that the airlines don’t go to each destination equally – for example, more ALASKA planes go to Seattle than AM WEST, and more AM WEST planes go to Phoenix than ALASKA. This unequal distribution could affect the results. We also don’t have data about the individual airports, like how busy each they usually get, or the efficiency of their air traffic controllers.