Introduction

The assignment instructs of creatig a dataset in a wide format (reflecting a given chart) to then conduct a comparison analysis.

Import

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6     ✔ purrr   0.3.4
## ✔ tibble  3.1.7     ✔ dplyr   1.0.9
## ✔ tidyr   1.2.0     ✔ stringr 1.4.0
## ✔ readr   2.1.2     ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(janitor)
## 
## Attaching package: 'janitor'
## 
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test

Creating the Data Frame

A dataframe was created containing information about flight arrival times for two airlines, Alaska and AM West. The data frame has 7 columns: Airline, Status, with 5 of the columns describing the different cities (Los Angeles, Phoenix, San Diego, San Francisco, and Seattle). The Airline column specifies which airline the flight belongs to, the Status column indicates whether the flight was on time or delayed, and the other columns indicate the arrival time for each city. After creating the data frame, the code writes it to a CSV file called “airline-destination.csv”.

data <- data.frame(
  "Airline" = c("ALASKA", "ALASKA", "AM WEST", "AM WEST"),
  "Status" = c("on time", "delayed", "on time", "delayed"),
  "Los Angeles" = c(497, 62, 694, 117),
  "Phoenix" = c(221, 12, 4840, 415),
  "San Diego" = c(212, 20, 383, 65),
  "San Francisco" = c(503, 102, 320, 129),
  "Seattle" = c(1841, 305, 201, 61)
)
data 
##   Airline  Status Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1  ALASKA on time         497     221       212           503    1841
## 2  ALASKA delayed          62      12        20           102     305
## 3 AM WEST on time         694    4840       383           320     201
## 4 AM WEST delayed         117     415        65           129      61
write_csv(data, "airline-destination.csv")

Reshaping wide -> long format

The dataframe is reshaped from a wide format to a long format, where each row represents a single observation of an airline’s arrival performance for a particular city.

data_long <- pivot_longer(data, cols = -c(Airline, Status), names_to = "City", values_to = "Value")
data_long
## # A tibble: 20 × 4
##    Airline Status  City          Value
##    <chr>   <chr>   <chr>         <dbl>
##  1 ALASKA  on time Los.Angeles     497
##  2 ALASKA  on time Phoenix         221
##  3 ALASKA  on time San.Diego       212
##  4 ALASKA  on time San.Francisco   503
##  5 ALASKA  on time Seattle        1841
##  6 ALASKA  delayed Los.Angeles      62
##  7 ALASKA  delayed Phoenix          12
##  8 ALASKA  delayed San.Diego        20
##  9 ALASKA  delayed San.Francisco   102
## 10 ALASKA  delayed Seattle         305
## 11 AM WEST on time Los.Angeles     694
## 12 AM WEST on time Phoenix        4840
## 13 AM WEST on time San.Diego       383
## 14 AM WEST on time San.Francisco   320
## 15 AM WEST on time Seattle         201
## 16 AM WEST delayed Los.Angeles     117
## 17 AM WEST delayed Phoenix         415
## 18 AM WEST delayed San.Diego        65
## 19 AM WEST delayed San.Francisco   129
## 20 AM WEST delayed Seattle          61

##Perform analysis to compare the arrival delays for the two airlines. The long formatted data frame was then used to filter the data to include only delayed flights, and then used tp create a bar chart to compare the arrival delays for each airline

data_long %>%
  filter(Status == "delayed") %>%
  ggplot(aes(x = Airline, y = Value, fill = Status)) +
  geom_bar(position = "dodge", stat = "identity")

## Conclusion/ Results