This assignment will import uncleaned data from a csv. file. My task is to tidy and transform data as described below. (1) Read the information from a .CSV file into R, and use tidyr and dplyr as needed to tidy and transform the data. (2) Perform analysis to compare the arrival delays for the two airlines
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.1 ✔ purrr 0.3.4
## ✔ tibble 3.1.7 ✔ dplyr 1.0.9
## ✔ tidyr 1.2.0 ✔ stringr 1.4.0
## ✔ readr 2.1.2 ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
url = r"(https://raw.githubusercontent.com/yinaS1234/data-607/main/data%20607%20lab%205/lab5data.csv)"
suppressMessages(
df <- read_csv(url, skip_empty_rows = TRUE, show_col_types = FALSE)[-3,]
)
names(df)[1:2] <- c("Airline", "Status")
df
## # A tibble: 4 × 7
## Airline Status `Los Angeles` Phoenix `San Diego` `San Francisco` Seattle
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 ALASKA on time 497 221 212 503 1841
## 2 <NA> delayed 62 12 20 102 305
## 3 AM WEST on time 694 4840 383 320 201
## 4 <NA> delayed 117 415 65 129 61
fixing NA transform data, rename
for(x in seq(from=2, to=nrow(df), by=2))
{
df[x, 1] = df[x-1, 1]
}
library(tidyr)
df_unpivot = df %>%
gather(key="City", value="Count", c("Los Angeles","Phoenix", "San Diego", "San Francisco", "Seattle"))
df_unpivot
## # A tibble: 20 × 4
## Airline Status City Count
## <chr> <chr> <chr> <dbl>
## 1 ALASKA on time Los Angeles 497
## 2 ALASKA delayed Los Angeles 62
## 3 AM WEST on time Los Angeles 694
## 4 AM WEST delayed Los Angeles 117
## 5 ALASKA on time Phoenix 221
## 6 ALASKA delayed Phoenix 12
## 7 AM WEST on time Phoenix 4840
## 8 AM WEST delayed Phoenix 415
## 9 ALASKA on time San Diego 212
## 10 ALASKA delayed San Diego 20
## 11 AM WEST on time San Diego 383
## 12 AM WEST delayed San Diego 65
## 13 ALASKA on time San Francisco 503
## 14 ALASKA delayed San Francisco 102
## 15 AM WEST on time San Francisco 320
## 16 AM WEST delayed San Francisco 129
## 17 ALASKA on time Seattle 1841
## 18 ALASKA delayed Seattle 305
## 19 AM WEST on time Seattle 201
## 20 AM WEST delayed Seattle 61
library(dplyr)
df_unpivot_2 = df_unpivot %>%
spread(key="Status", value="Count")
df_unpivot_2 = df_unpivot_2 %>%
rename(on_time = `on time`)
df_unpivot_2 = df_unpivot_2 %>%
mutate(otp = on_time/(on_time + delayed)) %>%
arrange(desc(otp))
df_unpivot_2
## # A tibble: 10 × 5
## Airline City delayed on_time otp
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 ALASKA Phoenix 12 221 0.948
## 2 AM WEST Phoenix 415 4840 0.921
## 3 ALASKA San Diego 20 212 0.914
## 4 ALASKA Los Angeles 62 497 0.889
## 5 ALASKA Seattle 305 1841 0.858
## 6 AM WEST Los Angeles 117 694 0.856
## 7 AM WEST San Diego 65 383 0.855
## 8 ALASKA San Francisco 102 503 0.831
## 9 AM WEST Seattle 61 201 0.767
## 10 AM WEST San Francisco 129 320 0.713
df_unpivot_2 %>%
select(Airline, delayed, on_time) %>%
group_by(Airline) %>%
summarise(delayed = sum(delayed), on_time = sum(on_time), otp = sum(on_time) / (sum(on_time) + sum(delayed))) %>%
arrange(desc(otp))
## # A tibble: 2 × 4
## Airline delayed on_time otp
## <chr> <dbl> <dbl> <dbl>
## 1 AM WEST 787 6438 0.891
## 2 ALASKA 501 3274 0.867
df_unpivot_2 %>%
filter(City != "Phoenix") %>%
select(Airline, delayed, on_time) %>%
group_by(Airline) %>%
summarise(delay = sum(delayed), on_time = sum(on_time), otp = sum(on_time) / (sum(on_time) + sum(delayed))) %>%
arrange(desc(otp))
## # A tibble: 2 × 4
## Airline delay on_time otp
## <chr> <dbl> <dbl> <dbl>
## 1 ALASKA 489 3053 0.862
## 2 AM WEST 372 1598 0.811
Based on the data provided,ALASKA consistently outperforms AM WEST. For flying to Phoenix specifically, I would recommend AM WEST as the on time performance is on par with ALASKA and have more flights available.