Data 607 Assignment 5

Load packages

library(tidyverse)
library(sqldf)

Overview / Introduction

For this assignment, we are given a table in an untidy structure. This table contains information on Alaska and AmWest airline flights to Los Angeles, Phoenix, San Diego, San Francisco, and Seattle. Specifically, it contains information on the number of delayed and on-time flights to each of these destinations. This untidy data was created as a CSV file and then transformed into a more easily manipulated structure. Then, analysis was performed to compare the arrival delays. This analysis was done in the form of two bar graphs, used for comparing the arrival delays between the two airlines.

Create a CSV file that includes all of the information

To mimic the wide structure of the data, I included all of the destinations as columns for the table. The timing of the flight (e.g. delayed or on time) and the airline were also included as columns

drop table if exists airlines

create table airlines (
airline nvarchar(50),
timing nvarchar(50), 
Los_Angeles int,
Phoenix int,
San_Diego int,
San_Francisco int,
Seattle int)

insert into airlines 
values ('Alaska','on_time', 497, 221, 212, 503, 1841),
('','delayed', 62, 12, 20, 102, 305),
(NULL,NULL,NULL,NULL,NULL,NULL,NULL),
('AmWest','on_time', 694, 4840, 383, 320, 201),
('','delayed', 117, 415, 65, 129, 61)

Read information into R and tidy/transform the data

To tidy the data, I used the pivot_longer function to turn all the destinations into rows. This created two columns, Destination and CountFlights.

airlines <- read.csv("C:\\Users\\Kim\\Documents\\Data607\\airlines.csv", header = TRUE, sep = ",")

airlines <- airlines[-3,]

airlines[2,"airline"] <- "Alaska"
airlines[4,"airline"] <- "AmWest"


airlines_pivot <- airlines %>% 
  pivot_longer(c(Los_Angeles,Phoenix,San_Diego,San_Francisco,Seattle),
               names_to = "Destination", values_to = "CountFlights")
airlines_pivot$CountFlights <- as.integer(airlines_pivot$CountFlights)
airlines_pivot

## # A tibble: 20 × 4
##    airline timing  Destination   CountFlights
##    <chr>   <chr>   <chr>                <int>
##  1 Alaska  on_time Los_Angeles            497
##  2 Alaska  on_time Phoenix                221
##  3 Alaska  on_time San_Diego              212
##  4 Alaska  on_time San_Francisco          503
##  5 Alaska  on_time Seattle               1841
##  6 Alaska  delayed Los_Angeles             62
##  7 Alaska  delayed Phoenix                 12
##  8 Alaska  delayed San_Diego               20
##  9 Alaska  delayed San_Francisco          102
## 10 Alaska  delayed Seattle                305
## 11 AmWest  on_time Los_Angeles            694
## 12 AmWest  on_time Phoenix               4840
## 13 AmWest  on_time San_Diego              383
## 14 AmWest  on_time San_Francisco          320
## 15 AmWest  on_time Seattle                201
## 16 AmWest  delayed Los_Angeles            117
## 17 AmWest  delayed Phoenix                415
## 18 AmWest  delayed San_Diego               65
## 19 AmWest  delayed San_Francisco          129
## 20 AmWest  delayed Seattle                 61

Perform Analysis to compare arrival delays for the two airlines

From the following bar graph, AmWest appears to have significantly more delays than Alaska in Phoenix. Amwest also has more delays in Los Angeles, San Diego, and San Francisco, although the difference is less staggering. Alaska has more delays than Amwest for flights to Seattle.

airlines_pivot %>% filter(timing == "delayed") %>% 
  ggplot(aes(fill = airline, y = CountFlights, x = Destination)) + 
  geom_bar(position = "dodge", stat = "identity") +
  labs(y = "Flights", title = "Number of Delayed Flights")

However, this may be misleading as Amwest and Alaska do not have the same number of flights to each destination. To standardize flight delays, we can use percentage delay as a better metric. First, I split the dataset based on the timing of the flight (e.g. on time or delayed). Then, I joined the dataset based on destination and airline so that each row contains airline, destination, count of delayed flights, and count of on time flights. The timing columns were dropped as they were no longer needed. From there, percentage can be calculated and graphed. We can see now that Amwest has a greater percentage of delayed flights at all of the destinations.

airlines_delayed <- airlines_pivot %>% filter(timing == "delayed") 
airlines_on_time <- airlines_pivot %>% filter(timing == "on_time")
airlines_delayed_on_time <- inner_join(airlines_delayed, airlines_on_time, by = c("airline","Destination")) %>% 
  rename(delayedFlights = CountFlights.x, on_timeFlights = CountFlights.y) %>% select(-timing.x, -timing.y) %>%
  mutate(total_flights = delayedFlights + on_timeFlights) %>%
  mutate(percent_delayed = 100 * delayedFlights / total_flights, percent_on_time = 100 * on_timeFlights / total_flights)
airlines_delayed_on_time

## # A tibble: 10 × 7
##    airline Destination   delayedFlights on_timeFlights total_flights
##    <chr>   <chr>                  <int>          <int>         <int>
##  1 Alaska  Los_Angeles               62            497           559
##  2 Alaska  Phoenix                   12            221           233
##  3 Alaska  San_Diego                 20            212           232
##  4 Alaska  San_Francisco            102            503           605
##  5 Alaska  Seattle                  305           1841          2146
##  6 AmWest  Los_Angeles              117            694           811
##  7 AmWest  Phoenix                  415           4840          5255
##  8 AmWest  San_Diego                 65            383           448
##  9 AmWest  San_Francisco            129            320           449
## 10 AmWest  Seattle                   61            201           262
## # ℹ 2 more variables: percent_delayed <dbl>, percent_on_time <dbl>

airlines_delayed_on_time  %>% 
  ggplot(aes(fill = airline, y = percent_delayed, x = Destination)) + 
  geom_bar(position = "dodge", stat = "identity") +
  labs(y = "Delayed Percentage", title = "Percentage of Delayed Flights")

Conclusion

This assignment introduced me to the pivot_longer function, which was incredibly useful in transforming badly structured data into a more usable format. In an earlier assignment with the movie ratings, I had re-done the structure of my table in SQL after my first attempt. I had essentially gone back and restructured the table manually in SQL when inserting the values. However, with pivot_longer, I would have been able to do this in R with less effort. After using the pivot_longer function, it was easy to create a better table structure using table joins, and this resulted in the final table structure with Airline, Destination, Delayed Flights, On Time Flights, Total Flights, Delayed Flight Percentage, and On Time Flight Percentage as the table columns.