Assignment

Assignment Description

The chart above describes arrival delays for two airlines across five destinations. Your task is to: (1) Create a .CSV file (or optionally, a MySQL database!) that includes all of the information above. You’re encouraged to use a “wide” structure similar to how the information appears above, so that you can practice tidying and transformations as described below. (2) Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data. (3) Perform analysis to compare the arrival delays for the two airlines. (4) Your code should be in an R Markdown file, posted to rpubs.com, and should include narrative descriptions of your data cleanup work, analysis, and conclusions. Please include in your homework submission:  The URL to the .Rmd file in your GitHub repository. and  The URL for your rpubs.com web page.

Import packages

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0      ✔ purrr   1.0.1 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.5.0 
## ✔ readr   2.1.3      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

Load the Data

The next chunk load the data from github with na.strings arguments that interprets the character vector as NA values. Then, we assign new value for row 2 column 1 in the dataframe df to “ALASKA” and the apply the same to row 5 column 1 to “AM West”.

df <- read.csv("https://raw.githubusercontent.com/Nick-Climaco/Rdataset/main/Numbersenese_kaiserfung_data.csv",
    na.strings = c("", "NA"))
df[2, 1] <- "ALASKA"
df[5, 1] <- "AM West"
df

##         X      X.1 Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1  ALASKA on time          497     221       212           503    1841
## 2  ALASKA  delayed          61      12        20           102     305
## 3    <NA>     <NA>          NA      NA        NA            NA      NA
## 4 AM West  on time         694    4840       383           320     201
## 5 AM West  delayed         117     415        65           129      61

Tidy and Clean

Filtering out the rows from df where all values are missing. rowSums(is.na(.)) creates a vector of the missing values in each row. If at least one column on a given row has a missing value then that row is filtered out.

df_filter <- df %>%
    filter(rowSums(is.na(.)) < ncol(.)) %>%
    rename(Airline = "X", Status = "X.1")
df_filter

##   Airline   Status Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1  ALASKA on time          497     221       212           503    1841
## 2  ALASKA  delayed          61      12        20           102     305
## 3 AM West  on time         694    4840       383           320     201
## 4 AM West  delayed         117     415        65           129      61

Convert to long format using pivot_longer()

Using the pivot_longer() function, we can convert the data from wide format to long format. Thus tidying the data. The chunk below takes in the df_filter dataframe then pipes into the pivot_longer() function: where we selected the city columns that will be pivoted. Then, names_to = “City” will be a new column that will hold the old column names. values_to = “Count”, where “Count” will be a new column that will hold the values of the old columns.

-cols specifies the columns that will be pivoted 
-names_to specifies the name of the column that will hold the name/s of
the old column name/s
-values_to specifies the name of the column that will hold the values
of the old column/s

df_long <- df_filter %>%
    pivot_longer(cols = c("Los.Angeles", "Phoenix", "San.Diego", "San.Francisco", "Seattle"),
        names_to = "City", values_to = "Count")

head(df_long)

## # A tibble: 6 × 4
##   Airline Status     City          Count
##   <chr>   <chr>      <chr>         <int>
## 1 ALASKA  "on time " Los.Angeles     497
## 2 ALASKA  "on time " Phoenix         221
## 3 ALASKA  "on time " San.Diego       212
## 4 ALASKA  "on time " San.Francisco   503
## 5 ALASKA  "on time " Seattle        1841
## 6 ALASKA  "delayed"  Los.Angeles      61

Analysis : for arrival delays for Alaska and AM West airlines

Here, creating a new datafram called df_summary where it takes the data from df_long, then using the group_by() to group by Airline and City then pipes into the summarise() function where it calculate the total number of delays, total number of flights, and the percentage of delays.

df_summary <- df_long %>%
    group_by(Airline, City) %>%
    summarise(total_delays = sum(Count[Status == "delayed"]), total_flights = sum(Count), pct_delays = round(total_delays/total_flights,
        2))

## `summarise()` has grouped output by 'Airline'. You can override using the
## `.groups` argument.

df_summary

## # A tibble: 10 × 5
## # Groups:   Airline [2]
##    Airline City          total_delays total_flights pct_delays
##    <chr>   <chr>                <int>         <int>      <dbl>
##  1 ALASKA  Los.Angeles             61           558       0.11
##  2 ALASKA  Phoenix                 12           233       0.05
##  3 ALASKA  San.Diego               20           232       0.09
##  4 ALASKA  San.Francisco          102           605       0.17
##  5 ALASKA  Seattle                305          2146       0.14
##  6 AM West Los.Angeles            117           811       0.14
##  7 AM West Phoenix                415          5255       0.08
##  8 AM West San.Diego               65           448       0.15
##  9 AM West San.Francisco          129           449       0.29
## 10 AM West Seattle                 61           262       0.23

Visualization

ggplot(df_summary, aes(x = City, y = pct_delays, fill = Airline)) + geom_bar(stat = "identity",
    position = "dodge") + geom_text(aes(label = pct_delays), position = position_dodge(width = 1),
    vjust = -0.5) + labs(x = "Airline", y = "Delay rate", fill = "Airline") + theme_bw()

Conclusion

From the given data and shown on the graph, it can be observed that AM West Airlines had a higher rate of arrival delays in these five cities relative to Alaska Airlines. The difference in delays rate is at least 3% which is indicative of a performance disparity between the two airlines in this closed experiment. Thus, the data suggests that traveling to these five cities, Alaska Airlines might be the better choice since it has a lower chance of a delayed arrival.

Assignment_5

Nick Climaco

2023-02-20