Overview

In this assignment, I go over untidy data concerning two airlines on-time data for multiple cities, tidy it, and then analyze it using tidyr and dplyr.

Data Retrieval

I added na.strings parameter so that I can use is.na to find missing data and empty rows.

airlines <- read.csv("https://raw.githubusercontent.com/Kadaeux/DATA607Ass4/master/ass4.csv", na.strings=c("","NA"))
head(airlines)
##         X     X.1 Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1  ALASKA on time         497     221       212           503    1841
## 2    <NA> delayed          62      12        20           102     305
## 3    <NA>    <NA>          NA      NA        NA            NA      NA
## 4 AM WEST on time         694    4840       383           320     201
## 5    <NA> delayed         117     415        65           129      61

Data Tidying

There are three interrelated rules that make a dataset tidy:

  1. Each variable is a column; each column is a variable.

  2. Each observation is a row; each row is an observation.

  3. Each value is a cell; each cell is a single value.

Our dataset has a lot of tidying to meet these standards. Our variables are Airline, City, Delayed Count, and On-Time Count. Each row should be an observation of all of these variables. Example of ideal observation for this data:

Airline City Delayed_Count On_Time_Count
ALASKA Log.Angeles 62 497

So, we’ll need to parse and tidy up this old dataframe and transform it into our expected tidy dataframe.

names(airlines)[1] <- "Airline"
names(airlines)[2] <- "On_Time_Status"
head(airlines)
##   Airline On_Time_Status Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1  ALASKA        on time         497     221       212           503    1841
## 2    <NA>        delayed          62      12        20           102     305
## 3    <NA>           <NA>          NA      NA        NA            NA      NA
## 4 AM WEST        on time         694    4840       383           320     201
## 5    <NA>        delayed         117     415        65           129      61

Then we’ll remove the rows with ALL NA.

airlines <- airlines[rowSums(is.na(airlines)) != ncol(airlines),]
head(airlines)
##   Airline On_Time_Status Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1  ALASKA        on time         497     221       212           503    1841
## 2    <NA>        delayed          62      12        20           102     305
## 4 AM WEST        on time         694    4840       383           320     201
## 5    <NA>        delayed         117     415        65           129      61

We are missing the airline name for delayed departures, as the original dataset had it implicit based on previous row. Let’s add that in using fill, specifying to use previous row’s value.

airlines <- airlines %>% fill(Airline, .direction = "down")
head(airlines)
##   Airline On_Time_Status Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1  ALASKA        on time         497     221       212           503    1841
## 2  ALASKA        delayed          62      12        20           102     305
## 4 AM WEST        on time         694    4840       383           320     201
## 5 AM WEST        delayed         117     415        65           129      61

Looking better already! Now let’s gather these non-variable columns representing cities into a new City column.

airlines <- airlines %>% pivot_longer(cols=colnames(airlines[3:length(airlines)]),names_to = "City", values_to = "Flight_Count")
head(airlines)
## # A tibble: 6 × 4
##   Airline On_Time_Status City          Flight_Count
##   <chr>   <chr>          <chr>                <int>
## 1 ALASKA  on time        Los.Angeles            497
## 2 ALASKA  on time        Phoenix                221
## 3 ALASKA  on time        San.Diego              212
## 4 ALASKA  on time        San.Francisco          503
## 5 ALASKA  on time        Seattle               1841
## 6 ALASKA  delayed        Los.Angeles             62

Almost there… we need to spread the on_time_status column values (On Time and Delayed) to their own columns, containing the values from Flight_Count.

airlines <- airlines %>% spread(key = On_Time_Status, value = Flight_Count)
head(airlines)
## # A tibble: 6 × 4
##   Airline City          delayed `on time`
##   <chr>   <chr>           <int>     <int>
## 1 ALASKA  Los.Angeles        62       497
## 2 ALASKA  Phoenix            12       221
## 3 ALASKA  San.Diego          20       212
## 4 ALASKA  San.Francisco     102       503
## 5 ALASKA  Seattle           305      1841
## 6 AM WEST Los.Angeles       117       694

Data Analysis

Now that we’ve got a nice little tidy dataset, we can go about using it for analyzing these airlines’ performance.

airlinesTotals <- setNames(aggregate(cbind(airlines$delayed,airlines$`on time`),by=list(Airline=airlines$Airline),FUN=sum), c("Airline","delayed","on time"))

head(airlinesTotals)
##   Airline delayed on time
## 1  ALASKA     501    3274
## 2 AM WEST     787    6438

We can also do this using dplyr.

airlinesTotals <- airlines %>% group_by(Airline) %>% summarise(across(delayed:`on time`, sum))
head(airlinesTotals)
## # A tibble: 2 × 3
##   Airline delayed `on time`
##   <chr>     <int>     <int>
## 1 ALASKA      501      3274
## 2 AM WEST     787      6438

However, we should probably care about getting numbers like what percentage of flights were on time. Luckily, dplyr makes this easy using the same function we used to sum our delayed and on time flight counts.

airlinesTotals <- airlines %>% group_by(Airline) %>% summarise(across(delayed:`on time`, sum),
                                                                 On_Time_Percentage = (`on time`/(`on time`+delayed)))
head(airlinesTotals)
## # A tibble: 2 × 4
##   Airline delayed `on time` On_Time_Percentage
##   <chr>     <int>     <int>              <dbl>
## 1 ALASKA      501      3274              0.867
## 2 AM WEST     787      6438              0.891

We can see here that AM West has a better on time percentage. If we wanted to analyze it by city, we can do the same function and simply group by Airline + city.

airlinesTotals <- airlines %>% group_by(Airline,City) %>% summarise(across(delayed:`on time`, sum),
                                                                 On_Time_Percentage = (`on time`/(`on time`+delayed)))
## `summarise()` has grouped output by 'Airline'. You can override using the
## `.groups` argument.
head(airlinesTotals)
## # A tibble: 6 × 5
## # Groups:   Airline [2]
##   Airline City          delayed `on time` On_Time_Percentage
##   <chr>   <chr>           <int>     <int>              <dbl>
## 1 ALASKA  Los.Angeles        62       497              0.889
## 2 ALASKA  Phoenix            12       221              0.948
## 3 ALASKA  San.Diego          20       212              0.914
## 4 ALASKA  San.Francisco     102       503              0.831
## 5 ALASKA  Seattle           305      1841              0.858
## 6 AM WEST Los.Angeles       117       694              0.856

Here we can see that while AM West has a better on time percentage as a whole, it tends to vary based on destination(?) city.