library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.3 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
1. Creating the CSV, exporting it to Github
initial_csv <- rbind(c(NA, NA, "Los Angeles", "Phoenix", "San Diego", "San Fransisco", "Seattle"),
c("ALASKA", "on time", 497, 221, 212, 503, 1841),
c(NA, "delayed", 62, 12, 20, 102, 305),
c("AM WEST", "on time", 694, 4840, 383, 320, 201),
c(NA, "delayed", 117, 415, 65, 129, 61))
write.table(initial_csv, file = "Initial_csv_Data", sep = ",", col.names = F, row.names = F)
2. Here I will transform the data using tidyr and dplyr. I will start this by going from wide to long data. Note, I do not need to load any extra packages because tidyr and dplyr are located within tidyverse. Importing CSV from GitHub into a usable table
csv_week5 <- read.csv("https://raw.githubusercontent.com/jonburns2454/DATA607/main/Initial_csv_Data")
Start by renaming the first two columns
csv_week5 <- csv_week5 %>%
rename("Airline" = 1, "Status" = 2)
Dealing with NAs | This replaces the NA figs under ‘Airline’ with their respective airline classifications to avoid issues throuhout the rest of the problem
csv_week5 <- csv_week5 %>%
fill(Airline, .direction = c("down"))
shift to long data format This takes the data and utilized the ‘pivot_longer’ funtion in tidyR to change the wide data into ‘Arrival.City’ and ‘Number.Of.Flights’. After this I selected the entire df using Airline:Number.Of.Flights and then arranged these figures by status in decending order by the airline. The next funtion this uses is spread, which separates number of flights into ontime and delayed. Lastly, the rename function changes the finiky ‘on time’ to on.time, to make analysis easier.
long_airline_data <- csv_week5 %>%
pivot_longer(cols = 3:7, names_to = "Arrival.City",
values_to = "Number.Of.Flights") %>%
select(Airline:Number.Of.Flights) %>%
arrange(Airline, desc(Status), 'Arrival City') %>%
spread(Status, Number.Of.Flights) %>%
rename(on.time = 'on time')
data.table::data.table(long_airline_data)
## Airline Arrival.City delayed on.time
## 1: ALASKA Los.Angeles 62 497
## 2: ALASKA Phoenix 12 221
## 3: ALASKA San.Diego 20 212
## 4: ALASKA San.Fransisco 102 503
## 5: ALASKA Seattle 305 1841
## 6: AM WEST Los.Angeles 117 694
## 7: AM WEST Phoenix 415 4840
## 8: AM WEST San.Diego 65 383
## 9: AM WEST San.Fransisco 129 320
## 10: AM WEST Seattle 61 201
3. Comparing general mean of “ontime flights” and “delayed flights” bewteen the two
On_time_data <- long_airline_data %>%
group_by(Airline) %>%
summarize(mean_ontime = mean(on.time))
delayed_data <- long_airline_data %>%
group_by(Airline) %>%
summarize(mean_delayed = mean(delayed))
Plotting and comparing on.time flights to delayed flights at the two airports
On time
ggplot(On_time_data, aes(x = Airline, y = mean_ontime, color = Airline)) +
geom_bar(stat = "identity")
Delayed
ggplot(delayed_data, aes(x = Airline, y = mean_delayed, color = Airline)) +
geom_bar(stat = "identity")
Conclusion The data shows that AM West has both more On time flights and delayed arrivals than compared with ALASKA. One thing to note however, is that AM west appears to run many more flights than Alaska but the difference between the mean delayed flight of the airlines is fairly small in comparison to how many more flights AM west runs.