library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

There are three interrelated rules that make a dataset tidy:

Each variable is a column; each column is a variable. Each observation is a row; each row is an observation. Each value is a cell; each cell is a single value.

Working With Un-Tidy Data

Import The Data

  1. Create a .CSV file (or optionally, a MySQL database!) that includes all of the information above. You’re encouraged to use a “wide” structure similar to how the information appears above, so that you can practice tidying and transformations as described below.
#Loading the CSV
flights_csv <- read.csv("https://raw.githubusercontent.com/william-forero/Data-607/refs/heads/main/Untidy_Flight_Table.csv")
## Warning in read.table(file = file, header = header, sep = sep, quote = quote, :
## incomplete final line found by readTableHeader on
## 'https://raw.githubusercontent.com/william-forero/Data-607/refs/heads/main/Untidy_Flight_Table.csv'
head(flights_csv)
##         X     X.1 Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1  Alaska On Time         497     221       212           503    1841
## 2         Delayed          62      12        20           102     305
## 3 AM West On Time         694    4840       383           320     201
## 4         Delayed         117     415        65           129      61
  1. Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data.

Prepare the dataframe for pivoting

There are some blank values in the original csv headers and rows. I first want to label those

# Renaming the columns
flights_csv <- flights_csv |>
  rename(
    Airline = X,
    Flight_Status = X.1
  )

head(flights_csv)
##   Airline Flight_Status Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1  Alaska       On Time         497     221       212           503    1841
## 2               Delayed          62      12        20           102     305
## 3 AM West       On Time         694    4840       383           320     201
## 4               Delayed         117     415        65           129      61

There were also blank values in the Airline column. These values are not truly missing, we can infer which airline the corresponding Delayed values belong to.

flights_csv[2,"Airline"] <- "Alaska"
flights_csv[4,"Airline"] <- "AM West"

head(flights_csv)
##   Airline Flight_Status Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1  Alaska       On Time         497     221       212           503    1841
## 2  Alaska       Delayed          62      12        20           102     305
## 3 AM West       On Time         694    4840       383           320     201
## 4 AM West       Delayed         117     415        65           129      61

Pivoting the data

The different destinations can be considered a variable “Destination” and placed into one column. The resulting dataframe is tidy in that each variable is a column, each row is an observation and each cell is a single value.

flights_tidy <- flights_csv |>
  pivot_longer(
    cols = !(Airline:Flight_Status), 
    names_to = "Destination", 
    values_to = "Count"
  )

head(flights_tidy)
## # A tibble: 6 × 4
##   Airline Flight_Status Destination   Count
##   <chr>   <chr>         <chr>         <int>
## 1 Alaska  On Time       Los.Angeles     497
## 2 Alaska  On Time       Phoenix         221
## 3 Alaska  On Time       San.Diego       212
## 4 Alaska  On Time       San.Francisco   503
## 5 Alaska  On Time       Seattle        1841
## 6 Alaska  Delayed       Los.Angeles      62

If I want to treat the flight status of on time and delayed as separate variables I can make it wider, this is something I may want to use later on to view some data easier. I’ll save it as a new dataframe

flights_wide <- flights_tidy |> 
  pivot_wider(
    names_from = Flight_Status,
    values_from = Count
  ) |>
  rename(
    On_Time = `On Time`
    )

head(flights_wide)
## # A tibble: 6 × 4
##   Airline Destination   On_Time Delayed
##   <chr>   <chr>           <int>   <int>
## 1 Alaska  Los.Angeles       497      62
## 2 Alaska  Phoenix           221      12
## 3 Alaska  San.Diego         212      20
## 4 Alaska  San.Francisco     503     102
## 5 Alaska  Seattle          1841     305
## 6 AM West Los.Angeles       694     117

Analysis of Tidy Data

  1. Perform analysis to compare the arrival delays for the two airlines.

Visualizing the data

Below we can see some general trends of delayed and on time flights with relative frequency across the two airlines and different destinations. We can see that generally flights to San Francisco have relatively more delayed flights and flights to Phoenix have less delayed flights.

ggplot(flights_tidy, 
       aes(x = Destination, 
           y = Count, 
           fill = Flight_Status)) +
  geom_bar(stat = 'identity', 
           position = 'fill') +
  facet_wrap(~Airline, 
             ncol=1)

We can get a better idea by computing the percentage of delayed flights. It may be visually easier to use the wider dataframe from earlier.

flights_wide <- flights_wide |>
  mutate(Perc_Delayed = round((Delayed / (On_Time + Delayed)) * 100, 2))

print(flights_wide)
## # A tibble: 10 × 5
##    Airline Destination   On_Time Delayed Perc_Delayed
##    <chr>   <chr>           <int>   <int>        <dbl>
##  1 Alaska  Los.Angeles       497      62        11.1 
##  2 Alaska  Phoenix           221      12         5.15
##  3 Alaska  San.Diego         212      20         8.62
##  4 Alaska  San.Francisco     503     102        16.9 
##  5 Alaska  Seattle          1841     305        14.2 
##  6 AM West Los.Angeles       694     117        14.4 
##  7 AM West Phoenix          4840     415         7.9 
##  8 AM West San.Diego         383      65        14.5 
##  9 AM West San.Francisco     320     129        28.7 
## 10 AM West Seattle           201      61        23.3

From this we can see that the destination with the highest percentage of delays is San Francisco for both airlines. However, between the two AM West has a higher percentage of delayed flights to San Francisco.

The destination with the lowest percentage of delays is also the same for both airlines, which is Phoneix but AM West again has the higher percentage of delayed flights to Phoenix between the two.

We can try to view the delays by airline as a whole

airline_summary <- flights_tidy |>
  group_by(Airline) |>
  summarise(
    Nbr_OnTime = sum(Count[Flight_Status == "On Time"]),
    Nbr_Delayed = sum(Count[Flight_Status == "Delayed"]),
    Total_Flights = sum(Count)
  ) |>
  mutate(Perc_Delayed = round((Nbr_Delayed / Total_Flights) * 100,2))
print(airline_summary)
## # A tibble: 2 × 5
##   Airline Nbr_OnTime Nbr_Delayed Total_Flights Perc_Delayed
##   <chr>        <int>       <int>         <int>        <dbl>
## 1 AM West       6438         787          7225         10.9
## 2 Alaska        3274         501          3775         13.3

Alaska airlines has a slightly higher overall percentage of delayed flights, but a lower overall number of flights in total.

  1. Your code should be in an R Markdown file, posted to rpubs.com, and should include narrative descriptions of your data cleanup work, analysis, and conclusions.