DATA 607 Data Acquisition & Management

Week 4 Assignment

Silma Khan SPRING 2025

For this assignment, we are given a dataset from Numbersense, Kaiser Fung McGraw Hill, 2013 and asked to: 1. Create a CSV in wide format 2. Read the CSV into R and tidy and transform the data using tidyr and dplyr 3. Perform analysis to compare the arrival delays for the two airlines

For this assignment, I deiced to create the CSV file right into R, by creating a dataframe for it and then writing it out to a CSV file:

flight_status <- data.frame(
  Airline        = c("ALASKA", "ALASKA", "AM WEST", "AM WEST"),
  Flight.Status  = c("on time", "delayed", "on time", "delayed"),
  Los.Angeles    = c(497, 62, 694, 117),
  Phoenix        = c(221, 12, 4840, 415),
  San.Diego      = c(212, 20, 383, 65),
  San.Francisco  = c(503, 102, 320, 129),
  Seattle        = c(1841, 305, 201, 61)
)

flight_status
##   Airline Flight.Status Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1  ALASKA       on time         497     221       212           503    1841
## 2  ALASKA       delayed          62      12        20           102     305
## 3 AM WEST       on time         694    4840       383           320     201
## 4 AM WEST       delayed         117     415        65           129      61

Now, we need to write the data frame to a CSV file:

write.csv(flight_status, "flight_status.csv", row.names = FALSE)

Now to make sure the file exists, I can use the file.exists() function:

if (file.exists("flight_status.csv")) {
  print("flight_status.csv exists")
} else {
  print("flight_status.csv does NOT exist")
}
## [1] "flight_status.csv exists"

Now with completing the creation of the CSV file, we can now move onto step 2, which is to tidy and transform the data

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyr)
library(ggplot2)
flights_wide <- read.csv("flight_status.csv", stringsAsFactors = FALSE)
head(flights_wide)
##   Airline Flight.Status Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1  ALASKA       on time         497     221       212           503    1841
## 2  ALASKA       delayed          62      12        20           102     305
## 3 AM WEST       on time         694    4840       383           320     201
## 4 AM WEST       delayed         117     415        65           129      61

With this data being in wide structure, for the tidying portion, it is important to convert the data from a wide to a long format because by allowing each variable to form a column and each observation to form a row, it allows us to work with the data easier

flights_long <- flights_wide %>%
  pivot_longer(
    cols      = c("Los.Angeles", "Phoenix", "San.Diego", "San.Francisco", "Seattle"),
    names_to  = "Destination",
    values_to = "Count"
  )

head(flights_long)
## # A tibble: 6 × 4
##   Airline Flight.Status Destination   Count
##   <chr>   <chr>         <chr>         <int>
## 1 ALASKA  on time       Los.Angeles     497
## 2 ALASKA  on time       Phoenix         221
## 3 ALASKA  on time       San.Diego       212
## 4 ALASKA  on time       San.Francisco   503
## 5 ALASKA  on time       Seattle        1841
## 6 ALASKA  delayed       Los.Angeles      62

After transforming the data to long form, each row how represents a unique combination of Airline, Flight.Status. and Destination

With doing this, we can now move onto step 3, which is performing simple analysis

For the analysis portion, I will be comparing:

  1. the total flights each airline operates that is broken down by their arrival status
  2. and compare the percentage of delayed flights by destination for each airline
flight_sum <- flights_long %>%
  group_by(Airline, Flight.Status) %>%
  summarize(TotalFlights = sum(Count), .groups = "drop")

flight_sum
## # A tibble: 4 × 3
##   Airline Flight.Status TotalFlights
##   <chr>   <chr>                <int>
## 1 ALASKA  delayed                501
## 2 ALASKA  on time               3274
## 3 AM WEST delayed                787
## 4 AM WEST on time               6438
ggplot(flight_sum, aes(x = Airline, y = TotalFlights, fill = Flight.Status)) +
  geom_bar(stat = "identity", position = "stack") +
  labs(title = "Total Flights by Airline and Flight Status",
       x = "Airline",
       y = "Number of Flights") +
  theme_minimal()

By doing simple analysis to compare the two Airline’s flight status’ we are able to see that both airlines have much more on time flight arrivals compared to their delayed flight arrivals

Now moving onto the next analysis which is to compare the percentage of delayed flights by destination for each airline

percentage_delayed <- flights_long %>%
  group_by(Airline, Destination, Flight.Status) %>%
  summarize(Total = sum(Count), .groups = "drop") %>%
  pivot_wider(names_from = Flight.Status, values_from = Total) %>%
  mutate(
    TotalFlights   = `on time` + delayed,
    PercentDelayed = round(delayed / TotalFlights * 100, 2)
  )

percentage_delayed
## # A tibble: 10 × 6
##    Airline Destination   delayed `on time` TotalFlights PercentDelayed
##    <chr>   <chr>           <int>     <int>        <int>          <dbl>
##  1 ALASKA  Los.Angeles        62       497          559          11.1 
##  2 ALASKA  Phoenix            12       221          233           5.15
##  3 ALASKA  San.Diego          20       212          232           8.62
##  4 ALASKA  San.Francisco     102       503          605          16.9 
##  5 ALASKA  Seattle           305      1841         2146          14.2 
##  6 AM WEST Los.Angeles       117       694          811          14.4 
##  7 AM WEST Phoenix           415      4840         5255           7.9 
##  8 AM WEST San.Diego          65       383          448          14.5 
##  9 AM WEST San.Francisco     129       320          449          28.7 
## 10 AM WEST Seattle            61       201          262          23.3
ggplot(percentage_delayed, aes(x = Destination, y = PercentDelayed, fill = Airline)) +
  geom_col(position = "dodge") +
  labs(title = "Percentage of Delayed Flights by Destination",
       x = "Destination",
       y = "Percent Delayed (%)") +
  theme_minimal()

Using this analysis we can quickly identify which airline has a larger delay rate on specific routes in comparison to the others. Here we can see that the AM WEST has a larger delay rate than ALASKA, so safe to say we should avoid that airline when deciding to fly out

Conclusion

By creating a CSV file in wide version and tidying and transforming the data we were able to analyze the data and get a better understanding of the flights and Airline. By analyzing the total number of flights for both on time and delayed, we can see that both airline have many more on time arrivals compared to delayed, however, when taking a closer look at the overall delayed percentage of flights, we can see that AM WEST has a higher percentage rate of delayed flights compared to ALASKA airlines, so we can avoid that airline if we are worried about our flights being delayed