Week 5 Assignment - Fox

Author

Amanda Fox

Published

February 21, 2024

Introduction

For this assignment, we were asked to create a small dataset as a .csv file, load it to R, and then clean, tidy, and analyze it.

I used Excel to create the dataset using the example provided, which described arrival statistics for two airlines across five U.S. cities. The data had several issues to be cleaned and tidied: a blank row, missing values, values as column names (cities), and variables in rows (on time, delayed). I saved this dataset as a .csv file to Github.

Importing the Data

In R, I began by loading the libraries and reading in the text file from Github:

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.4.4     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
library(RColorBrewer)
  
df <- read_csv('https://raw.githubusercontent.com/AmandaSFox/DATA607/main/week%205/Week_5_Data.csv', show_col_types = FALSE)
df
# A tibble: 5 × 7
  airline `on time/delayed` `Los Angeles` Phoenix `San Diego` `San Francisco`
  <chr>   <chr>                     <dbl>   <dbl>       <dbl>           <dbl>
1 Alaska  on time                     497     221         212             503
2 <NA>    delayed                      62      12          20             102
3 <NA>    <NA>                         NA      NA          NA              NA
4 Am West on time                     694    4840         383             320
5 <NA>    delayed                     117     415          65             129
# ℹ 1 more variable: Seattle <dbl>

Filter: Delete Blank Row

To clean the file, I first used the filter function to remove the blank row by filtering out rows with NA in the Los Angeles column. This was a very small dataset, so I could see that this filter would identify all blank rows and not inadvertently delete a row with values in other columns.

In the case of a larger dataset or if I were creating a repeatable process where I could not be sure that a blank in the Los Angeles field would always indicate a blank row, I would have written this to validate that all fields were NA before deleting the row, but that wasn’t necessary here.

df <- df %>% 
  filter(!is.na(`Los Angeles`))
df
# A tibble: 4 × 7
  airline `on time/delayed` `Los Angeles` Phoenix `San Diego` `San Francisco`
  <chr>   <chr>                     <dbl>   <dbl>       <dbl>           <dbl>
1 Alaska  on time                     497     221         212             503
2 <NA>    delayed                      62      12          20             102
3 Am West on time                     694    4840         383             320
4 <NA>    delayed                     117     415          65             129
# ℹ 1 more variable: Seattle <dbl>

Fill: Fill In Missing Airline Names

Then I filled in the missing airline name values using the fill function. This is a quick and easy way to handle variables where repeating values are suppressed.

df <- df %>% 
  fill(airline)
df
# A tibble: 4 × 7
  airline `on time/delayed` `Los Angeles` Phoenix `San Diego` `San Francisco`
  <chr>   <chr>                     <dbl>   <dbl>       <dbl>           <dbl>
1 Alaska  on time                     497     221         212             503
2 Alaska  delayed                      62      12          20             102
3 Am West on time                     694    4840         383             320
4 Am West delayed                     117     415          65             129
# ℹ 1 more variable: Seattle <dbl>

Melt: Create New City Variable

One untidy feature of this dataset was the use of values (cities) as column names. This needed to be unpivoted or melted to create the variable “city,” and I used the pivot_longer function in this case to do it.

df <- df %>% 
  pivot_longer(cols = c("Los Angeles","Phoenix","San Diego","San Francisco","Seattle"),
               names_to = "city",
               values_to = "count")  

df
# A tibble: 20 × 4
   airline `on time/delayed` city          count
   <chr>   <chr>             <chr>         <dbl>
 1 Alaska  on time           Los Angeles     497
 2 Alaska  on time           Phoenix         221
 3 Alaska  on time           San Diego       212
 4 Alaska  on time           San Francisco   503
 5 Alaska  on time           Seattle        1841
 6 Alaska  delayed           Los Angeles      62
 7 Alaska  delayed           Phoenix          12
 8 Alaska  delayed           San Diego        20
 9 Alaska  delayed           San Francisco   102
10 Alaska  delayed           Seattle         305
11 Am West on time           Los Angeles     694
12 Am West on time           Phoenix        4840
13 Am West on time           San Diego       383
14 Am West on time           San Francisco   320
15 Am West on time           Seattle         201
16 Am West delayed           Los Angeles     117
17 Am West delayed           Phoenix         415
18 Am West delayed           San Diego        65
19 Am West delayed           San Francisco   129
20 Am West delayed           Seattle          61

Cast: ‘On Time/Delayed’ Variables

To complete tidying, I pivoted out or cast the on time/delayed values to variables or columns.

This resulted in a tidy dataframe, with one column per variable (fixed: city, airline; measurable: on time, delayed), and one row per observation (city/airline pairs).

df_tidy <- df %>% 
  pivot_wider(names_from = 'on time/delayed',values_from = 'count') %>% 
  dplyr::rename(on_time =`on time`) %>% 
  mutate(total_count = on_time + delayed) %>% 
  mutate(pct_on_time = on_time / total_count)

df_tidy
# A tibble: 10 × 6
   airline city          on_time delayed total_count pct_on_time
   <chr>   <chr>           <dbl>   <dbl>       <dbl>       <dbl>
 1 Alaska  Los Angeles       497      62         559       0.889
 2 Alaska  Phoenix           221      12         233       0.948
 3 Alaska  San Diego         212      20         232       0.914
 4 Alaska  San Francisco     503     102         605       0.831
 5 Alaska  Seattle          1841     305        2146       0.858
 6 Am West Los Angeles       694     117         811       0.856
 7 Am West Phoenix          4840     415        5255       0.921
 8 Am West San Diego         383      65         448       0.855
 9 Am West San Francisco     320     129         449       0.713
10 Am West Seattle           201      61         262       0.767

Analysis: Arrival Delays

Finally, I used this new tidy dataframe to compare arrival delays for these two airlines, both in aggregate and by city.

In aggregate across these five cities, Am West had significantly more flights than Alaska (7,225 vs. 3,775) and a higher overall on-time percentage (89.1% vs. 86.7%).

#----------Find Overall % On Time by Airline

df_tidy_airline <- df_tidy %>% 
  group_by(airline) %>% 
  summarise(tot_on_time = sum(on_time),
            tot_count = sum(total_count)) %>% 
  mutate(tot_pct_on_time = tot_on_time/tot_count)

df_tidy_airline
# A tibble: 2 × 4
  airline tot_on_time tot_count tot_pct_on_time
  <chr>         <dbl>     <dbl>           <dbl>
1 Alaska         3274      3775           0.867
2 Am West        6438      7225           0.891
#----------Plot Overall % On Time by Airline

ggplot(df, aes(x = airline, y = count, fill =`on time/delayed`)) + 
    geom_bar(position="fill", stat="identity") +
    scale_fill_brewer(palette = "Pastel1") +
    ylab("% Flights") +
    xlab("Airline") +
    ggtitle("Overall On Time % by Airline") +
    labs(fill = "Flight Status") +
    scale_y_continuous(labels = scales::percent)

However, in every individual city, Alaska’s on-time percentage was higher, despite having the lower overall on-time percentage:

# % On Time by Status and Airline and City

ggplot(df, aes(x = airline, y = count, fill =`on time/delayed`)) + 
    geom_bar(position="fill", stat="identity") +
    scale_fill_brewer(palette = "Pastel1") +
    facet_wrap(~city) +
    ylab("% Flights") +
    xlab("Airline") +
    ggtitle("On Time % by Airline and City") +
    labs(fill = "Flight Status") +
    scale_y_continuous(labels = scales::percent) 

This is the “Simpson Paradox,” a phenomenon whereby a relationship between two variables that is seen in groups of data reverses when those groups are combined.

In this case, Am West had a higher aggregate on-time percentage despite having lower rates in individual cities because its flights were heavily concentrated in Phoenix, where on-time percentages were very high. Conversely, Alaska flights were concentrated in Seattle, where on-time rates are low. Looking at percentages without understanding relative volume and how it is distributed could lead a consumer to underestimate the chance of being late on any given Am West flight vs. an Alaska one from the same airport.

# Total weighted avg on-time rate by city

df_tidy_city <- df_tidy %>% 
  group_by(city) %>% 
  summarise(city_on_time = sum(on_time),
            city_total = sum(total_count)) %>% 
  mutate(city_pct_on_time = city_on_time/city_total)

# On-time % by city: by airline and in aggregate 

df_tidy_city_airline <- df_tidy %>% 
  select(city,airline,total_count,pct_on_time) %>% 
  arrange(city) %>% 
  left_join(df_tidy_city, join_by(x$city == y$city)) %>% 
  rename(airline_pct_on_time = pct_on_time,
         airline_total_flights = total_count,
         total_on_time_flights = city_on_time,
         total_flights = city_total,
         total_pct_on_time = city_pct_on_time
         ) 

df_tidy_city_airline <- df_tidy_city_airline[,-3] 
df_tidy_city_airline <- df_tidy_city_airline[,-4:-5] 

df_tidy_city_airline
# A tibble: 10 × 4
   city          airline airline_pct_on_time total_pct_on_time
   <chr>         <chr>                 <dbl>             <dbl>
 1 Los Angeles   Alaska                0.889             0.869
 2 Los Angeles   Am West               0.856             0.869
 3 Phoenix       Alaska                0.948             0.922
 4 Phoenix       Am West               0.921             0.922
 5 San Diego     Alaska                0.914             0.875
 6 San Diego     Am West               0.855             0.875
 7 San Francisco Alaska                0.831             0.781
 8 San Francisco Am West               0.713             0.781
 9 Seattle       Alaska                0.858             0.848
10 Seattle       Am West               0.767             0.848
ggplot(df_tidy,aes(x = airline, y = total_count, fill= city)) +
  geom_bar(position="stack", stat="identity") +
  scale_fill_brewer(palette = "Pastel1") +
  ylab("Total Flights") +
  xlab("Airline") +
  ggtitle("Total Flights by City: Alaska vs. Am West") +
  labs(fill = "City")