Assignment – Tidying and Transforming Data

We are given a .csv file containing disjointed data for two airlines - “Alaska” and “AM WEST”, the five airports that they operate out of, and their respective arrival and departure delays. Use ‘tidyr’ and ‘dplyr’ to tidy and transform the data. Perform an analysis to compare the arrival delays for the two airlines.

Read data from my github repository

airlines <- read.csv(file = "https://raw.githubusercontent.com/carolc57/Data607-Fall23/main/Airlines.csv", header = TRUE, sep = ",", na.strings = "")

#display data
kable(airlines, format = "markdown")
X X.1 Los..Angeles Phoenix San..Diego San..Francisco Seattle
ALASKA on time 497 221 212 503 1841
NA delayed 62 12 20 102 305
NA NA NA NA NA NA NA
AM WEST on time 694 4840 383 320 201
NA delayed 117 415 65 129 61
#rename airlines to preserve original file
t_airlines <- airlines

Let the tidying begin…

#rename columns 1 and 2 
names(t_airlines)[1] <- "Airlines"
names(t_airlines)[2] <- "Arrival_Status"

#Add airline names to blank column adjacent to delayed status 
t_airlines[2,1] <- "ALASKA"
t_airlines[5,1] <- "AM WEST"

#remove NA rows
t_airlines <- t_airlines[!apply(is.na(t_airlines[1:7]),1,all), ]

kable(t_airlines, format = "markdown")
Airlines Arrival_Status Los..Angeles Phoenix San..Diego San..Francisco Seattle
1 ALASKA on time 497 221 212 503 1841
2 ALASKA delayed 62 12 20 102 305
4 AM WEST on time 694 4840 383 320 201
5 AM WEST delayed 117 415 65 129 61

Transform data from wide to long; filter for delayed flights only

#transform from wide to long

t_airlines2 <- t_airlines  |>
  pivot_longer(cols = 3:7, 
               names_to = "City",
               values_to = "Count") |>
  filter(Arrival_Status =="delayed")            #because we're only interested in delayed flights for our analysis

kable(t_airlines2,format = "markdown")
Airlines Arrival_Status City Count
ALASKA delayed Los..Angeles 62
ALASKA delayed Phoenix 12
ALASKA delayed San..Diego 20
ALASKA delayed San..Francisco 102
ALASKA delayed Seattle 305
AM WEST delayed Los..Angeles 117
AM WEST delayed Phoenix 415
AM WEST delayed San..Diego 65
AM WEST delayed San..Francisco 129
AM WEST delayed Seattle 61

Lets see the performance of these airlines by city

flightsbycity <- t_airlines2 |> select(Airlines, City, Count) %>% 
      group_by(Airlines, City) %>% 
      summarise(total = sum(Count)) %>% 
      mutate(percent           = round(total/sum(total) * 100, 2),
             percent_formatted = paste0(round(total/sum(total) * 100, 1), "%"),
             total_flights     = sum(total)) 
## `summarise()` has grouped output by 'Airlines'. You can override using the
## `.groups` argument.
flights_city <- as.data.frame(flightsbycity)

kable(flights_city, format = "markdown")
Airlines City total percent percent_formatted total_flights
ALASKA Los..Angeles 62 12.38 12.4% 501
ALASKA Phoenix 12 2.40 2.4% 501
ALASKA San..Diego 20 3.99 4% 501
ALASKA San..Francisco 102 20.36 20.4% 501
ALASKA Seattle 305 60.88 60.9% 501
AM WEST Los..Angeles 117 14.87 14.9% 787
AM WEST Phoenix 415 52.73 52.7% 787
AM WEST San..Diego 65 8.26 8.3% 787
AM WEST San..Francisco 129 16.39 16.4% 787
AM WEST Seattle 61 7.75 7.8% 787

Graphical presentation of airline performance by city

flights_city %>% 
ggplot( aes(x=City, y=percent, fill=Airlines)) +
    geom_bar(stat="identity", position=position_dodge(), colour="navy") +
    geom_text(aes(label=percent_formatted), vjust=.5, hjust=1,position= position_dodge(width=0.9),  color="navy") +
        ggtitle("Percentage of Flights Delayed by Airline by City") +
    xlab("Destination") + ylab("Percentage of Flights Delayed") +
        coord_flip() 

It’s interesting to see that Alaska Airlines experienced the highest delays at Seattle while AM West had a higher number of delays at Phoenix. Both airlines experienced relatively close arrival delays at Los Angeles and San Francisco.

Additional tidying to compare arrival delays between the two airlines

stats_t_airlines2 <- t_airlines2 %>% dplyr::group_by(Airlines) %>% 
  dplyr::summarise(total=sum(Count),
                   max=max(Count), 
                   min=min(Count),
                   avg=mean(Count)
  )

stats_t_airlines2
## # A tibble: 2 x 5
##   Airlines total   max   min   avg
##   <chr>    <int> <int> <int> <dbl>
## 1 ALASKA     501   305    12  100.
## 2 AM WEST    787   415    61  157.
airlines_comp <- as.data.frame(stats_t_airlines2)

kable(airlines_comp, format = "markdown")
Airlines total max min avg
ALASKA 501 305 12 100.2
AM WEST 787 415 61 157.4

From our analysis, we can conclude that Alaska Airlines has a better on-time performance than AM West Airlines over all cities. Alaska Airlines average number of delayed flights was 100 compared to AM West Airlines’ average number of delayed flights of 157.

Graphical representation of Airlines Delay

airlines_comp |> 
    ggplot( aes(x=Airlines, y=total, fill=Airlines)) +
    geom_bar(stat="identity", position=position_dodge(), colour="navy", width = 0.5) +
    ggtitle("Flights Delayed by Airline") +
    xlab("Airlines") + ylab("Number of Flights Delayed") 

Again we can see clearly that Alaska Airlines had fewer delayed flights than AM West.