AssignmentWeek05

Introduction

This R Markdown takes data from a Blackboard assignment that includes an image of a table. This data is manually built as a data frame, put into a .csv file, and read back into the project. After this, the data needs to be tidied, as the table has each column as different observations, rather than aspects of a single observation.
The table has 5 states as columns, with rows that show airline (2 options) and status (2 options). The frequency of each state, status, and destination state are populated in the cross-sections. In total, there are 7 columns. Tidy data will have 4 columns: airline, status, destination state, and frequency total.

Load required packages

#load tidyverse package
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

1) Create a .csv file with the provided data

To create this .csv file, the data must be manually brought into R. Again, the reference for this data is in the DATA 607 Blackboard assignment file. Row names are set to F here to avoid the creation of the row names, as they will be added as a default when reading the file back into R. The extra column would be redundant.

# create the original data set as a data frame
  # columns:
airline = c("ALASKA", NA, NA, "AMWEST",NA)
status = c("on time", "delayed", NA, "on time", "delayed")
LosAngeles = c(497, 62, NA,  694, 117)
Phoenix = c(221, 12, NA, 4840, 415)
SanDiego = c(212, 20, NA, 383, 65)
SanFransisco = c(503, 102, NA, 320, 65)
Seattle= c(1841, 305, NA, 201, 61)
  # data frame assembly:
df <- data.frame(airline, status, 
                  LosAngeles, Phoenix, SanDiego, SanFransisco, Seattle)

# show the table format
df

##   airline  status LosAngeles Phoenix SanDiego SanFransisco Seattle
## 1  ALASKA on time        497     221      212          503    1841
## 2    <NA> delayed         62      12       20          102     305
## 3    <NA>    <NA>         NA      NA       NA           NA      NA
## 4  AMWEST on time        694    4840      383          320     201
## 5    <NA> delayed        117     415       65           65      61

# create csv file to working directory
write.csv(df, file = "airline_comparison_data.csv", row.names = F)

2) Read the data into R, tidy, and transform

The table includes no value for airline on each second line. To put the name there, fill(airline) is used, which takes the value from the row above and copies it to the row with null data in the airline column. The null row is then dropped. The tidying happens with a pivot_longer function. The columns to pivot longer are the 5 state destinations. See below at the tidied data head that checks the data was properly transformed.

# read into R
airline_df <- read.csv("airline_comparison_data.csv")

# add airline names in null rows
airline_df <- airline_df |> 
                fill(airline)

# drop the row that does not have data
airline_df <- na.omit(airline_df)

# tidy and transform
airline_df <- airline_df |>
                pivot_longer(
                  cols = !c(airline, status),
                  names_to = "destination",
                  values_to = "count"
                )

# check if the data is tidy - should have the 4 columns mentioned in the intro
head(airline_df)

## # A tibble: 6 × 4
##   airline status  destination  count
##   <chr>   <chr>   <chr>        <int>
## 1 ALASKA  on time LosAngeles     497
## 2 ALASKA  on time Phoenix        221
## 3 ALASKA  on time SanDiego       212
## 4 ALASKA  on time SanFransisco   503
## 5 ALASKA  on time Seattle       1841
## 6 ALASKA  delayed LosAngeles      62

3) Comparative analysis

The below graph takes a look at different aspects of the airline delay frequencies to help the analysis.

# isolate arrival delays from on time
delay_only <- filter(airline_df, status == "delayed")
ct_helper <- filter(airline_df, status == "on time")

# get the ratio of delayed to total flights for each airline and destination
delay_ratio <- airline_df |> 
            filter(status == "on time") |>
            mutate(
              ratio = round((delay_only$count / (ct_helper$count + delay_only$count)), 2)
            )


# graph the delays and totals on one bar chart
ggplot() +
  geom_col(data = airline_df, 
           aes(x = destination, y = count, fill = airline, alpha = 0.2),
           position = "dodge") +
  geom_col(data = delay_only, 
           aes(x = destination, y = count, fill = airline),
           position = "dodge") +
  geom_text(data = delay_ratio, 
            aes(x = destination, y = count, label = ratio, group = airline),
            vjust = -0.5, 
            position = position_dodge(width = .9))

Graph explanation: The figure above shows how the ratio of delayed flights per airline and destination to total flights for that airline and destination. For instance, the first label is 0.11, which tells us that of all ALASKA airline’s flights to Los Angeles, 11% of them were delayed. The less opaque bars show the total flights and the more opaque bars show the delayed flights.

Analysis: In the delay ratios, it can be seen that there is not an instance of ALASKA airlines having a higher ratio of delayed flights than AMWEST airlines. The range of AMWEST delays are 8% to 23%. For ALASKA, that range is 5% to 17%. With these metrics, we can confidently say that any randomly chosen flight on AMWEST is more likely or as likely to be delayed than the same destination on an ALASKA flight.