In this assignment, I go over untidy data concerning two airlines on-time data for multiple cities, tidy it, and then analyze it using tidyr and dplyr.
I added na.strings parameter so that I can use is.na to find missing data and empty rows.
airlines <- read.csv("https://raw.githubusercontent.com/Kadaeux/DATA607Ass4/master/ass4.csv", na.strings=c("","NA"))
head(airlines)## X X.1 Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1 ALASKA on time 497 221 212 503 1841
## 2 <NA> delayed 62 12 20 102 305
## 3 <NA> <NA> NA NA NA NA NA
## 4 AM WEST on time 694 4840 383 320 201
## 5 <NA> delayed 117 415 65 129 61
There are three interrelated rules that make a dataset tidy:
Each variable is a column; each column is a variable.
Each observation is a row; each row is an observation.
Each value is a cell; each cell is a single value.
Our dataset has a lot of tidying to meet these standards. Our variables are Airline, City, Delayed Count, and On-Time Count. Each row should be an observation of all of these variables. Example of ideal observation for this data:
| Airline | City | Delayed_Count | On_Time_Count |
|---|---|---|---|
| ALASKA | Log.Angeles | 62 | 497 |
So, we’ll need to parse and tidy up this old dataframe and transform it into our expected tidy dataframe.
## Airline On_Time_Status Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1 ALASKA on time 497 221 212 503 1841
## 2 <NA> delayed 62 12 20 102 305
## 3 <NA> <NA> NA NA NA NA NA
## 4 AM WEST on time 694 4840 383 320 201
## 5 <NA> delayed 117 415 65 129 61
Then we’ll remove the rows with ALL NA.
## Airline On_Time_Status Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1 ALASKA on time 497 221 212 503 1841
## 2 <NA> delayed 62 12 20 102 305
## 4 AM WEST on time 694 4840 383 320 201
## 5 <NA> delayed 117 415 65 129 61
We are missing the airline name for delayed departures, as the original dataset had it implicit based on previous row. Let’s add that in using fill, specifying to use previous row’s value.
## Airline On_Time_Status Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1 ALASKA on time 497 221 212 503 1841
## 2 ALASKA delayed 62 12 20 102 305
## 4 AM WEST on time 694 4840 383 320 201
## 5 AM WEST delayed 117 415 65 129 61
Looking better already! Now let’s gather these non-variable columns representing cities into a new City column.
airlines <- airlines %>% pivot_longer(cols=colnames(airlines[3:length(airlines)]),names_to = "City", values_to = "Flight_Count")
head(airlines)## # A tibble: 6 × 4
## Airline On_Time_Status City Flight_Count
## <chr> <chr> <chr> <int>
## 1 ALASKA on time Los.Angeles 497
## 2 ALASKA on time Phoenix 221
## 3 ALASKA on time San.Diego 212
## 4 ALASKA on time San.Francisco 503
## 5 ALASKA on time Seattle 1841
## 6 ALASKA delayed Los.Angeles 62
Almost there… we need to spread the on_time_status column values (On Time and Delayed) to their own columns, containing the values from Flight_Count.
## # A tibble: 6 × 4
## Airline City delayed `on time`
## <chr> <chr> <int> <int>
## 1 ALASKA Los.Angeles 62 497
## 2 ALASKA Phoenix 12 221
## 3 ALASKA San.Diego 20 212
## 4 ALASKA San.Francisco 102 503
## 5 ALASKA Seattle 305 1841
## 6 AM WEST Los.Angeles 117 694
Now that we’ve got a nice little tidy dataset, we can go about using it for analyzing these airlines’ performance.
airlinesTotals <- setNames(aggregate(cbind(airlines$delayed,airlines$`on time`),by=list(Airline=airlines$Airline),FUN=sum), c("Airline","delayed","on time"))
head(airlinesTotals)## Airline delayed on time
## 1 ALASKA 501 3274
## 2 AM WEST 787 6438
We can also do this using dplyr.
airlinesTotals <- airlines %>% group_by(Airline) %>% summarise(across(delayed:`on time`, sum))
head(airlinesTotals)## # A tibble: 2 × 3
## Airline delayed `on time`
## <chr> <int> <int>
## 1 ALASKA 501 3274
## 2 AM WEST 787 6438
However, we should probably care about getting numbers like what percentage of flights were on time. Luckily, dplyr makes this easy using the same function we used to sum our delayed and on time flight counts.
airlinesTotals <- airlines %>% group_by(Airline) %>% summarise(across(delayed:`on time`, sum),
On_Time_Percentage = (`on time`/(`on time`+delayed)))
head(airlinesTotals)## # A tibble: 2 × 4
## Airline delayed `on time` On_Time_Percentage
## <chr> <int> <int> <dbl>
## 1 ALASKA 501 3274 0.867
## 2 AM WEST 787 6438 0.891
We can see here that AM West has a better on time percentage. If we wanted to analyze it by city, we can do the same function and simply group by Airline + city.
airlinesTotals <- airlines %>% group_by(Airline,City) %>% summarise(across(delayed:`on time`, sum),
On_Time_Percentage = (`on time`/(`on time`+delayed)))## `summarise()` has grouped output by 'Airline'. You can override using the
## `.groups` argument.
## # A tibble: 6 × 5
## # Groups: Airline [2]
## Airline City delayed `on time` On_Time_Percentage
## <chr> <chr> <int> <int> <dbl>
## 1 ALASKA Los.Angeles 62 497 0.889
## 2 ALASKA Phoenix 12 221 0.948
## 3 ALASKA San.Diego 20 212 0.914
## 4 ALASKA San.Francisco 102 503 0.831
## 5 ALASKA Seattle 305 1841 0.858
## 6 AM WEST Los.Angeles 117 694 0.856
Here we can see that while AM West has a better on time percentage as a whole, it tends to vary based on destination(?) city.