The assignment instructs of creatig a dataset in a wide format (reflecting a given chart) to then conduct a comparison analysis.
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
## ✔ tibble 3.1.7 ✔ dplyr 1.0.9
## ✔ tidyr 1.2.0 ✔ stringr 1.4.0
## ✔ readr 2.1.2 ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(janitor)
##
## Attaching package: 'janitor'
##
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
A dataframe was created containing information about flight arrival times for two airlines, Alaska and AM West. The data frame has 7 columns: Airline, Status, with 5 of the columns describing the different cities (Los Angeles, Phoenix, San Diego, San Francisco, and Seattle). The Airline column specifies which airline the flight belongs to, the Status column indicates whether the flight was on time or delayed, and the other columns indicate the arrival time for each city. After creating the data frame, the code writes it to a CSV file called “airline-destination.csv”.
data <- data.frame(
"Airline" = c("ALASKA", "ALASKA", "AM WEST", "AM WEST"),
"Status" = c("on time", "delayed", "on time", "delayed"),
"Los Angeles" = c(497, 62, 694, 117),
"Phoenix" = c(221, 12, 4840, 415),
"San Diego" = c(212, 20, 383, 65),
"San Francisco" = c(503, 102, 320, 129),
"Seattle" = c(1841, 305, 201, 61)
)
data
## Airline Status Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1 ALASKA on time 497 221 212 503 1841
## 2 ALASKA delayed 62 12 20 102 305
## 3 AM WEST on time 694 4840 383 320 201
## 4 AM WEST delayed 117 415 65 129 61
write_csv(data, "airline-destination.csv")
The dataframe is reshaped from a wide format to a long format, where each row represents a single observation of an airline’s arrival performance for a particular city.
data_long <- pivot_longer(data, cols = -c(Airline, Status), names_to = "City", values_to = "Value")
data_long
## # A tibble: 20 × 4
## Airline Status City Value
## <chr> <chr> <chr> <dbl>
## 1 ALASKA on time Los.Angeles 497
## 2 ALASKA on time Phoenix 221
## 3 ALASKA on time San.Diego 212
## 4 ALASKA on time San.Francisco 503
## 5 ALASKA on time Seattle 1841
## 6 ALASKA delayed Los.Angeles 62
## 7 ALASKA delayed Phoenix 12
## 8 ALASKA delayed San.Diego 20
## 9 ALASKA delayed San.Francisco 102
## 10 ALASKA delayed Seattle 305
## 11 AM WEST on time Los.Angeles 694
## 12 AM WEST on time Phoenix 4840
## 13 AM WEST on time San.Diego 383
## 14 AM WEST on time San.Francisco 320
## 15 AM WEST on time Seattle 201
## 16 AM WEST delayed Los.Angeles 117
## 17 AM WEST delayed Phoenix 415
## 18 AM WEST delayed San.Diego 65
## 19 AM WEST delayed San.Francisco 129
## 20 AM WEST delayed Seattle 61
##Perform analysis to compare the arrival delays for the two airlines. The long formatted data frame was then used to filter the data to include only delayed flights, and then used tp create a bar chart to compare the arrival delays for each airline
data_long %>%
filter(Status == "delayed") %>%
ggplot(aes(x = Airline, y = Value, fill = Status)) +
geom_bar(position = "dodge", stat = "identity")
## Conclusion/ Results