Assignment: Tidying and Transforming Data

Introduction

For this assignment, we’re given the below data and tasked with reading the file into R as a .CSV, preparing the data and performing a basic analysis comparing delay rates for two airlines:

Loading the Data

First we load packages and read the data into R:

library(tidyverse)
library(ggplot2)
library(gridExtra)

url <- "https://raw.githubusercontent.com/josh1den/DATA-607/Projects/HW6/flights.csv"
raw_data <- read.csv(url, stringsAsFactors=FALSE)

raw_data
##         X     X.1 Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1  Alaska on time         497     221       212           503    1841
## 2         delayed          62      12        20           102     305
## 3                          NA      NA        NA            NA      NA
## 4 AM WEST on time         694    4840       383           320     201
## 5         delayed         117     415        65           129      61

Cleaning the Data

# remove na row
# rename first two columns
# convert blank airline rows to NA in order to fill
# pivot data on cities
# pivot data on arrival
# create total_flights, delay_pct column
# replace periods in cities with spaces

df <- na.omit(raw_data) |>
  rename(Airline =  X, Arrival = X.1) |>
  mutate(Airline = na_if(Airline, "")) |>
  fill(Airline, .direction = "down") |>
  pivot_longer(3:7, names_to = "City", values_to = "Total") |>
  pivot_wider(names_from = Arrival, values_from = Total) |>
  rename(on_time = 'on time') |>
  mutate(total_flights = on_time + delayed,
         delay_pct = round(delayed/(on_time + delayed) * 100,2),
         City = gsub("\\.", " ", City))

df
## # A tibble: 10 × 6
##    Airline City          on_time delayed total_flights delay_pct
##    <chr>   <chr>           <int>   <int>         <int>     <dbl>
##  1 Alaska  Los Angeles       497      62           559     11.1 
##  2 Alaska  Phoenix           221      12           233      5.15
##  3 Alaska  San Diego         212      20           232      8.62
##  4 Alaska  San Francisco     503     102           605     16.9 
##  5 Alaska  Seattle          1841     305          2146     14.2 
##  6 AM WEST Los Angeles       694     117           811     14.4 
##  7 AM WEST Phoenix          4840     415          5255      7.9 
##  8 AM WEST San Diego         383      65           448     14.5 
##  9 AM WEST San Francisco     320     129           449     28.7 
## 10 AM WEST Seattle           201      61           262     23.3

Preparing the Data for Analysis

total_flights <- df |>
  select(1, 3, 4) |>
  group_by(Airline) |>
  summarise(on_time = sum(on_time), delayed = sum(delayed), 
            total_flights = sum(on_time) + sum(delayed),
            delay_pct = round(sum(delayed)/
                                (sum(on_time) + sum(delayed)) * 100, 2))

total_flights
## # A tibble: 2 × 5
##   Airline on_time delayed total_flights delay_pct
##   <chr>     <int>   <int>         <int>     <dbl>
## 1 Alaska     3274     501          3775      13.3
## 2 AM WEST    6438     787          7225      10.9

Visualizing the Data

# avg delay
b1 <- ggplot(total_flights, aes(x=Airline, y=delay_pct, fill=Airline)) +
        geom_bar(stat="identity") +
        labs(title = "Average Delay") +
        scale_x_discrete(guide = guide_axis(angle = 45))

# total flights
b2 <- ggplot(total_flights, aes(x=Airline, y=total_flights, fill=Airline)) +
        geom_bar(stat="identity") +
        labs(title = "Total Flights") +
        scale_x_discrete(guide = guide_axis(angle = 45))

grid.arrange(b1, b2, ncol=2)

As we can see here, Alaska Air has nearly half as many flights as America West and a 2.5% higher delay percentage, but Alaska Air has lower percentages of delay in every city. Let’s take a closer look

f1 <- ggplot(df, aes(x=Airline, y=delay_pct, fill=Airline)) +
        geom_bar(stat="identity") +
        labs(title = "Delay Pct") +
        facet_wrap(~ City)

f2 <- ggplot(df, aes(x=Airline, y=total_flights, fill=Airline)) +
        geom_bar(stat="identity") +
        labs(title = "Total Flights") +
        facet_wrap(~ City)

grid.arrange(f1, f2)

As we can see here, although Alaska Air has lower delay percentages in each city than America West, the Phoenix market is weighing the overall total, as Phoenix holds each airlines lowest delay rate and America West dramatically outbalances Alaska in terms of total number of flights from Phoenix.