The chart above describes arrival delays for two airlines across five destinations. Your task is to: (1) Create a .CSV file (or optionally, a MySQL database!) that includes all of the information above. You’re encouraged to use a “wide” structure similar to how the information appears above, so that you can practice tidying and transformations as described below. (2) Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data. (3) Perform analysis to compare the arrival delays for the two airlines. (4) Your code should be in an R Markdown file, posted to rpubs.com, and should include narrative descriptions of your data cleanup work, analysis, and conclusions. Please include in your homework submission: The URL to the .Rmd file in your GitHub repository. and The URL for your rpubs.com web page.
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0 ✔ purrr 1.0.1
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.1 ✔ stringr 1.5.0
## ✔ readr 2.1.3 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
The next chunk load the data from github with na.strings arguments that interprets the character vector as NA values. Then, we assign new value for row 2 column 1 in the dataframe df to “ALASKA” and the apply the same to row 5 column 1 to “AM West”.
df <- read.csv("https://raw.githubusercontent.com/Nick-Climaco/Rdataset/main/Numbersenese_kaiserfung_data.csv",
na.strings = c("", "NA"))
df[2, 1] <- "ALASKA"
df[5, 1] <- "AM West"
df
## X X.1 Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1 ALASKA on time 497 221 212 503 1841
## 2 ALASKA delayed 61 12 20 102 305
## 3 <NA> <NA> NA NA NA NA NA
## 4 AM West on time 694 4840 383 320 201
## 5 AM West delayed 117 415 65 129 61
Filtering out the rows from df where all values are missing. rowSums(is.na(.)) creates a vector of the missing values in each row. If at least one column on a given row has a missing value then that row is filtered out.
df_filter <- df %>%
filter(rowSums(is.na(.)) < ncol(.)) %>%
rename(Airline = "X", Status = "X.1")
df_filter
## Airline Status Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1 ALASKA on time 497 221 212 503 1841
## 2 ALASKA delayed 61 12 20 102 305
## 3 AM West on time 694 4840 383 320 201
## 4 AM West delayed 117 415 65 129 61
Using the pivot_longer() function, we can convert the data from wide format to long format. Thus tidying the data. The chunk below takes in the df_filter dataframe then pipes into the pivot_longer() function: where we selected the city columns that will be pivoted. Then, names_to = “City” will be a new column that will hold the old column names. values_to = “Count”, where “Count” will be a new column that will hold the values of the old columns.
-cols specifies the columns that will be pivoted
-names_to specifies the name of the column that will hold the name/s of
the old column name/s
-values_to specifies the name of the column that will hold the values
of the old column/s
df_long <- df_filter %>%
pivot_longer(cols = c("Los.Angeles", "Phoenix", "San.Diego", "San.Francisco", "Seattle"),
names_to = "City", values_to = "Count")
head(df_long)
## # A tibble: 6 × 4
## Airline Status City Count
## <chr> <chr> <chr> <int>
## 1 ALASKA "on time " Los.Angeles 497
## 2 ALASKA "on time " Phoenix 221
## 3 ALASKA "on time " San.Diego 212
## 4 ALASKA "on time " San.Francisco 503
## 5 ALASKA "on time " Seattle 1841
## 6 ALASKA "delayed" Los.Angeles 61
Here, creating a new datafram called df_summary where it takes the data from df_long, then using the group_by() to group by Airline and City then pipes into the summarise() function where it calculate the total number of delays, total number of flights, and the percentage of delays.
df_summary <- df_long %>%
group_by(Airline, City) %>%
summarise(total_delays = sum(Count[Status == "delayed"]), total_flights = sum(Count), pct_delays = round(total_delays/total_flights,
2))
## `summarise()` has grouped output by 'Airline'. You can override using the
## `.groups` argument.
df_summary
## # A tibble: 10 × 5
## # Groups: Airline [2]
## Airline City total_delays total_flights pct_delays
## <chr> <chr> <int> <int> <dbl>
## 1 ALASKA Los.Angeles 61 558 0.11
## 2 ALASKA Phoenix 12 233 0.05
## 3 ALASKA San.Diego 20 232 0.09
## 4 ALASKA San.Francisco 102 605 0.17
## 5 ALASKA Seattle 305 2146 0.14
## 6 AM West Los.Angeles 117 811 0.14
## 7 AM West Phoenix 415 5255 0.08
## 8 AM West San.Diego 65 448 0.15
## 9 AM West San.Francisco 129 449 0.29
## 10 AM West Seattle 61 262 0.23
ggplot(df_summary, aes(x = City, y = pct_delays, fill = Airline)) + geom_bar(stat = "identity",
position = "dodge") + geom_text(aes(label = pct_delays), position = position_dodge(width = 1),
vjust = -0.5) + labs(x = "Airline", y = "Delay rate", fill = "Airline") + theme_bw()
From the given data and shown on the graph, it can be observed that AM West Airlines had a higher rate of arrival delays in these five cities relative to Alaska Airlines. The difference in delays rate is at least 3% which is indicative of a performance disparity between the two airlines in this closed experiment. Thus, the data suggests that traveling to these five cities, Alaska Airlines might be the better choice since it has a lower chance of a delayed arrival.