For this assignment, we were asked to create a small dataset as a .csv file, load it to R, and then clean, tidy, and analyze it.
I used Excel to create the dataset using the example provided, which described arrival statistics for two airlines across five U.S. cities. The data had several issues to be cleaned and tidied: a blank row, missing values, values as column names (cities), and variables in rows (on time, delayed). I saved this dataset as a .csv file to Github.
Importing the Data
In R, I began by loading the libraries and reading in the text file from Github:
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.4.4 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.0
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# A tibble: 5 × 7
airline `on time/delayed` `Los Angeles` Phoenix `San Diego` `San Francisco`
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Alaska on time 497 221 212 503
2 <NA> delayed 62 12 20 102
3 <NA> <NA> NA NA NA NA
4 Am West on time 694 4840 383 320
5 <NA> delayed 117 415 65 129
# ℹ 1 more variable: Seattle <dbl>
Filter: Delete Blank Row
To clean the file, I first used the filter function to remove the blank row by filtering out rows with NA in the Los Angeles column. This was a very small dataset, so I could see that this filter would identify all blank rows and not inadvertently delete a row with values in other columns.
In the case of a larger dataset or if I were creating a repeatable process where I could not be sure that a blank in the Los Angeles field would always indicate a blank row, I would have written this to validate that all fields were NA before deleting the row, but that wasn’t necessary here.
df <- df %>%filter(!is.na(`Los Angeles`))df
# A tibble: 4 × 7
airline `on time/delayed` `Los Angeles` Phoenix `San Diego` `San Francisco`
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Alaska on time 497 221 212 503
2 <NA> delayed 62 12 20 102
3 Am West on time 694 4840 383 320
4 <NA> delayed 117 415 65 129
# ℹ 1 more variable: Seattle <dbl>
Fill: Fill In Missing Airline Names
Then I filled in the missing airline name values using the fill function. This is a quick and easy way to handle variables where repeating values are suppressed.
df <- df %>%fill(airline)df
# A tibble: 4 × 7
airline `on time/delayed` `Los Angeles` Phoenix `San Diego` `San Francisco`
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Alaska on time 497 221 212 503
2 Alaska delayed 62 12 20 102
3 Am West on time 694 4840 383 320
4 Am West delayed 117 415 65 129
# ℹ 1 more variable: Seattle <dbl>
Melt: Create New City Variable
One untidy feature of this dataset was the use of values (cities) as column names. This needed to be unpivoted or melted to create the variable “city,” and I used the pivot_longer function in this case to do it.
# A tibble: 20 × 4
airline `on time/delayed` city count
<chr> <chr> <chr> <dbl>
1 Alaska on time Los Angeles 497
2 Alaska on time Phoenix 221
3 Alaska on time San Diego 212
4 Alaska on time San Francisco 503
5 Alaska on time Seattle 1841
6 Alaska delayed Los Angeles 62
7 Alaska delayed Phoenix 12
8 Alaska delayed San Diego 20
9 Alaska delayed San Francisco 102
10 Alaska delayed Seattle 305
11 Am West on time Los Angeles 694
12 Am West on time Phoenix 4840
13 Am West on time San Diego 383
14 Am West on time San Francisco 320
15 Am West on time Seattle 201
16 Am West delayed Los Angeles 117
17 Am West delayed Phoenix 415
18 Am West delayed San Diego 65
19 Am West delayed San Francisco 129
20 Am West delayed Seattle 61
Cast: ‘On Time/Delayed’ Variables
To complete tidying, I pivoted out or cast the on time/delayed values to variables or columns.
This resulted in a tidy dataframe, with one column per variable (fixed: city, airline; measurable: on time, delayed), and one row per observation (city/airline pairs).
# A tibble: 10 × 6
airline city on_time delayed total_count pct_on_time
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Alaska Los Angeles 497 62 559 0.889
2 Alaska Phoenix 221 12 233 0.948
3 Alaska San Diego 212 20 232 0.914
4 Alaska San Francisco 503 102 605 0.831
5 Alaska Seattle 1841 305 2146 0.858
6 Am West Los Angeles 694 117 811 0.856
7 Am West Phoenix 4840 415 5255 0.921
8 Am West San Diego 383 65 448 0.855
9 Am West San Francisco 320 129 449 0.713
10 Am West Seattle 201 61 262 0.767
Analysis: Arrival Delays
Finally, I used this new tidy dataframe to compare arrival delays for these two airlines, both in aggregate and by city.
In aggregate across these five cities, Am West had significantly more flights than Alaska (7,225 vs. 3,775) and a higher overall on-time percentage (89.1% vs. 86.7%).
#----------Find Overall % On Time by Airlinedf_tidy_airline <- df_tidy %>%group_by(airline) %>%summarise(tot_on_time =sum(on_time),tot_count =sum(total_count)) %>%mutate(tot_pct_on_time = tot_on_time/tot_count)df_tidy_airline
# A tibble: 2 × 4
airline tot_on_time tot_count tot_pct_on_time
<chr> <dbl> <dbl> <dbl>
1 Alaska 3274 3775 0.867
2 Am West 6438 7225 0.891
#----------Plot Overall % On Time by Airlineggplot(df, aes(x = airline, y = count, fill =`on time/delayed`)) +geom_bar(position="fill", stat="identity") +scale_fill_brewer(palette ="Pastel1") +ylab("% Flights") +xlab("Airline") +ggtitle("Overall On Time % by Airline") +labs(fill ="Flight Status") +scale_y_continuous(labels = scales::percent)
However, in every individual city, Alaska’s on-time percentage was higher, despite having the lower overall on-time percentage:
# % On Time by Status and Airline and Cityggplot(df, aes(x = airline, y = count, fill =`on time/delayed`)) +geom_bar(position="fill", stat="identity") +scale_fill_brewer(palette ="Pastel1") +facet_wrap(~city) +ylab("% Flights") +xlab("Airline") +ggtitle("On Time % by Airline and City") +labs(fill ="Flight Status") +scale_y_continuous(labels = scales::percent)
This is the “Simpson Paradox,” a phenomenon whereby a relationship between two variables that is seen in groups of data reverses when those groups are combined.
In this case, Am West had a higher aggregate on-time percentage despite having lower rates in individual cities because its flights were heavily concentrated in Phoenix, where on-time percentages were very high. Conversely, Alaska flights were concentrated in Seattle, where on-time rates are low. Looking at percentages without understanding relative volume and how it is distributed could lead a consumer to underestimate the chance of being late on any given Am West flight vs. an Alaska one from the same airport.
# Total weighted avg on-time rate by citydf_tidy_city <- df_tidy %>%group_by(city) %>%summarise(city_on_time =sum(on_time),city_total =sum(total_count)) %>%mutate(city_pct_on_time = city_on_time/city_total)# On-time % by city: by airline and in aggregate df_tidy_city_airline <- df_tidy %>%select(city,airline,total_count,pct_on_time) %>%arrange(city) %>%left_join(df_tidy_city, join_by(x$city == y$city)) %>%rename(airline_pct_on_time = pct_on_time,airline_total_flights = total_count,total_on_time_flights = city_on_time,total_flights = city_total,total_pct_on_time = city_pct_on_time ) df_tidy_city_airline <- df_tidy_city_airline[,-3] df_tidy_city_airline <- df_tidy_city_airline[,-4:-5] df_tidy_city_airline
# A tibble: 10 × 4
city airline airline_pct_on_time total_pct_on_time
<chr> <chr> <dbl> <dbl>
1 Los Angeles Alaska 0.889 0.869
2 Los Angeles Am West 0.856 0.869
3 Phoenix Alaska 0.948 0.922
4 Phoenix Am West 0.921 0.922
5 San Diego Alaska 0.914 0.875
6 San Diego Am West 0.855 0.875
7 San Francisco Alaska 0.831 0.781
8 San Francisco Am West 0.713 0.781
9 Seattle Alaska 0.858 0.848
10 Seattle Am West 0.767 0.848
ggplot(df_tidy,aes(x = airline, y = total_count, fill= city)) +geom_bar(position="stack", stat="identity") +scale_fill_brewer(palette ="Pastel1") +ylab("Total Flights") +xlab("Airline") +ggtitle("Total Flights by City: Alaska vs. Am West") +labs(fill ="City")