Assignment – Tidying and Transforming Data
We are given a .csv file containing disjointed data for two airlines - “Alaska” and “AM WEST”, the five airports that they operate out of, and their respective arrival and departure delays. Use ‘tidyr’ and ‘dplyr’ to tidy and transform the data. Perform an analysis to compare the arrival delays for the two airlines.
Read data from my github repository
airlines <- read.csv(file = "https://raw.githubusercontent.com/carolc57/Data607-Fall23/main/Airlines.csv", header = TRUE, sep = ",", na.strings = "")
#display data
kable(airlines, format = "markdown")
| X | X.1 | Los..Angeles | Phoenix | San..Diego | San..Francisco | Seattle |
|---|---|---|---|---|---|---|
| ALASKA | on time | 497 | 221 | 212 | 503 | 1841 |
| NA | delayed | 62 | 12 | 20 | 102 | 305 |
| NA | NA | NA | NA | NA | NA | NA |
| AM WEST | on time | 694 | 4840 | 383 | 320 | 201 |
| NA | delayed | 117 | 415 | 65 | 129 | 61 |
#rename airlines to preserve original file
t_airlines <- airlines
Let the tidying begin…
#rename columns 1 and 2
names(t_airlines)[1] <- "Airlines"
names(t_airlines)[2] <- "Arrival_Status"
#Add airline names to blank column adjacent to delayed status
t_airlines[2,1] <- "ALASKA"
t_airlines[5,1] <- "AM WEST"
#remove NA rows
t_airlines <- t_airlines[!apply(is.na(t_airlines[1:7]),1,all), ]
kable(t_airlines, format = "markdown")
| Airlines | Arrival_Status | Los..Angeles | Phoenix | San..Diego | San..Francisco | Seattle | |
|---|---|---|---|---|---|---|---|
| 1 | ALASKA | on time | 497 | 221 | 212 | 503 | 1841 |
| 2 | ALASKA | delayed | 62 | 12 | 20 | 102 | 305 |
| 4 | AM WEST | on time | 694 | 4840 | 383 | 320 | 201 |
| 5 | AM WEST | delayed | 117 | 415 | 65 | 129 | 61 |
Transform data from wide to long; filter for delayed flights only
#transform from wide to long
t_airlines2 <- t_airlines |>
pivot_longer(cols = 3:7,
names_to = "City",
values_to = "Count") |>
filter(Arrival_Status =="delayed") #because we're only interested in delayed flights for our analysis
kable(t_airlines2,format = "markdown")
| Airlines | Arrival_Status | City | Count |
|---|---|---|---|
| ALASKA | delayed | Los..Angeles | 62 |
| ALASKA | delayed | Phoenix | 12 |
| ALASKA | delayed | San..Diego | 20 |
| ALASKA | delayed | San..Francisco | 102 |
| ALASKA | delayed | Seattle | 305 |
| AM WEST | delayed | Los..Angeles | 117 |
| AM WEST | delayed | Phoenix | 415 |
| AM WEST | delayed | San..Diego | 65 |
| AM WEST | delayed | San..Francisco | 129 |
| AM WEST | delayed | Seattle | 61 |
Lets see the performance of these airlines by city
flightsbycity <- t_airlines2 |> select(Airlines, City, Count) %>%
group_by(Airlines, City) %>%
summarise(total = sum(Count)) %>%
mutate(percent = round(total/sum(total) * 100, 2),
percent_formatted = paste0(round(total/sum(total) * 100, 1), "%"),
total_flights = sum(total))
## `summarise()` has grouped output by 'Airlines'. You can override using the
## `.groups` argument.
flights_city <- as.data.frame(flightsbycity)
kable(flights_city, format = "markdown")
| Airlines | City | total | percent | percent_formatted | total_flights |
|---|---|---|---|---|---|
| ALASKA | Los..Angeles | 62 | 12.38 | 12.4% | 501 |
| ALASKA | Phoenix | 12 | 2.40 | 2.4% | 501 |
| ALASKA | San..Diego | 20 | 3.99 | 4% | 501 |
| ALASKA | San..Francisco | 102 | 20.36 | 20.4% | 501 |
| ALASKA | Seattle | 305 | 60.88 | 60.9% | 501 |
| AM WEST | Los..Angeles | 117 | 14.87 | 14.9% | 787 |
| AM WEST | Phoenix | 415 | 52.73 | 52.7% | 787 |
| AM WEST | San..Diego | 65 | 8.26 | 8.3% | 787 |
| AM WEST | San..Francisco | 129 | 16.39 | 16.4% | 787 |
| AM WEST | Seattle | 61 | 7.75 | 7.8% | 787 |
Graphical presentation of airline performance by city
flights_city %>%
ggplot( aes(x=City, y=percent, fill=Airlines)) +
geom_bar(stat="identity", position=position_dodge(), colour="navy") +
geom_text(aes(label=percent_formatted), vjust=.5, hjust=1,position= position_dodge(width=0.9), color="navy") +
ggtitle("Percentage of Flights Delayed by Airline by City") +
xlab("Destination") + ylab("Percentage of Flights Delayed") +
coord_flip()
It’s interesting to see that Alaska Airlines experienced the highest
delays at Seattle while AM West had a higher number of delays at
Phoenix. Both airlines experienced relatively close arrival delays at
Los Angeles and San Francisco.
Additional tidying to compare arrival delays between the two airlines
stats_t_airlines2 <- t_airlines2 %>% dplyr::group_by(Airlines) %>%
dplyr::summarise(total=sum(Count),
max=max(Count),
min=min(Count),
avg=mean(Count)
)
stats_t_airlines2
## # A tibble: 2 x 5
## Airlines total max min avg
## <chr> <int> <int> <int> <dbl>
## 1 ALASKA 501 305 12 100.
## 2 AM WEST 787 415 61 157.
airlines_comp <- as.data.frame(stats_t_airlines2)
kable(airlines_comp, format = "markdown")
| Airlines | total | max | min | avg |
|---|---|---|---|---|
| ALASKA | 501 | 305 | 12 | 100.2 |
| AM WEST | 787 | 415 | 61 | 157.4 |
From our analysis, we can conclude that Alaska Airlines has a better on-time performance than AM West Airlines over all cities. Alaska Airlines average number of delayed flights was 100 compared to AM West Airlines’ average number of delayed flights of 157.
Graphical representation of Airlines Delay
airlines_comp |>
ggplot( aes(x=Airlines, y=total, fill=Airlines)) +
geom_bar(stat="identity", position=position_dodge(), colour="navy", width = 0.5) +
ggtitle("Flights Delayed by Airline") +
xlab("Airlines") + ylab("Number of Flights Delayed")
Again we can see clearly that Alaska Airlines had fewer delayed flights
than AM West.