Overview:

Assignment 3 asks me to create a .csv file that includes specific information regarding two airlines (Alaska and AM West) across five destinations (Los Angeles, Phoenix, San Diego, San Francisco, and Seattle). I will read the file into R and perform analysis to compare arrival delays for the two airlines.

I created a .csv file from a google sheet document, saved it in my github repository, copied the url of the raw file, and pasted it into R.

#Load the .csv file into R and save it
dataF <- read.csv('https://raw.githubusercontent.com/evelynbartley/Data-607/main/Untitled%20spreadsheet%20-%20Sheet1.csv', header = TRUE)
#fill in empty cells
dataF$X = c("Alaska", "Alaska", "AM West", "AM West")
#replace periods with a space for aesthetic purposes
dataF <- dataF |>
  rename(
    "Los Angeles" = "Los.Angeles",
    "San Diego" = "San.Diego",
    "San Francisco" = "San.Francisco"
  )
dataF
##         X     X.1 Los Angeles Phoenix San Diego San Francisco Seattle
## 1  Alaska on time         497     221       212           503    1841
## 2  Alaska delayed          62      12        20           102     305
## 3 AM West on time         694    4840       383           320     201
## 4 AM West delayed         117     415        65           129      61

1. TIDY!

I want to make every column a variable, and every variable a column every observation a row and every row an observation, and every value a cell and every cell a value.

This dataset has 4 variables: Airline, Arrival Status, Destination and Count

Airline has two possible values: Alaska or AM West

Arrival Status has two possible values: on time or delayed

Destination has five possible values: Los Angeles, Phoenix, San Diego, San Francisco, or Seattle

Counts is a quantitative value

#Use pivot_longer() to combine the five destination columns into one column that specifies destination, and put those values into a frequency column that specifies the frequency of the observation.
dataF <- dataF |>
  pivot_longer('Los Angeles':'Seattle', names_to = "Destination", values_to = "Counts")
head(dataF)
## # A tibble: 6 × 4
##   X      X.1     Destination   Counts
##   <chr>  <chr>   <chr>          <int>
## 1 Alaska on time Los Angeles      497
## 2 Alaska on time Phoenix          221
## 3 Alaska on time San Diego        212
## 4 Alaska on time San Francisco    503
## 5 Alaska on time Seattle         1841
## 6 Alaska delayed Los Angeles       62

I want to give each column a descriptive name.

dataF <- dataF |>
  rename(
    "Airline" = "X",
    "Status" = "X.1"
  )
head(dataF)
## # A tibble: 6 × 4
##   Airline Status  Destination   Counts
##   <chr>   <chr>   <chr>          <int>
## 1 Alaska  on time Los Angeles      497
## 2 Alaska  on time Phoenix          221
## 3 Alaska  on time San Diego        212
## 4 Alaska  on time San Francisco    503
## 5 Alaska  on time Seattle         1841
## 6 Alaska  delayed Los Angeles       62

Now I have a tidy data frame since each variable forms a column, each observation forms a row, and each value has its own cell.

2. Analysis

I want to compare the arrival delays for each airline. Let’s start by seeing some summary statistics.

#Create a subset of the dataframe that only includes information from Alaska airlines
AlaskaData <- dataF[which(dataF$Airline=='Alaska'), ]
AlaskaData
## # A tibble: 10 × 4
##    Airline Status  Destination   Counts
##    <chr>   <chr>   <chr>          <int>
##  1 Alaska  on time Los Angeles      497
##  2 Alaska  on time Phoenix          221
##  3 Alaska  on time San Diego        212
##  4 Alaska  on time San Francisco    503
##  5 Alaska  on time Seattle         1841
##  6 Alaska  delayed Los Angeles       62
##  7 Alaska  delayed Phoenix           12
##  8 Alaska  delayed San Diego         20
##  9 Alaska  delayed San Francisco    102
## 10 Alaska  delayed Seattle          305
#Create a subset of the dataframe that only includes information from Alaska airlines
AMWestData <- dataF[which(dataF$Airline=='AM West'), ]
AMWestData
## # A tibble: 10 × 4
##    Airline Status  Destination   Counts
##    <chr>   <chr>   <chr>          <int>
##  1 AM West on time Los Angeles      694
##  2 AM West on time Phoenix         4840
##  3 AM West on time San Diego        383
##  4 AM West on time San Francisco    320
##  5 AM West on time Seattle          201
##  6 AM West delayed Los Angeles      117
##  7 AM West delayed Phoenix          415
##  8 AM West delayed San Diego         65
##  9 AM West delayed San Francisco    129
## 10 AM West delayed Seattle           61
#get the total number of planes from each airline
TotalAlaska <- summarise(AlaskaData, sum(Counts))
TotalAlaska
## # A tibble: 1 × 1
##   `sum(Counts)`
##           <int>
## 1          3775
TotalAMWest <- summarise(AMWestData, sum(Counts))
TotalAMWest
## # A tibble: 1 × 1
##   `sum(Counts)`
##           <int>
## 1          7225

In total, Alaska airlines had 3,775 flights and AM West had 7,225 flights.

Now let’s group the data frame by arrival status.

delay <- group_by(dataF, Status)
summarise(delay, 
          sum(Counts))
## # A tibble: 2 × 2
##   Status  `sum(Counts)`
##   <chr>           <int>
## 1 delayed          1288
## 2 on time          9712

In total, 1,288 flights were delayed and 9,712 were on time.

I want to see the proportion of delayed/on time flights based on what airline they are from.

propStatusAirline <- ggplot(dataF, mapping = aes(x = Airline, y = Counts, fill = Status)) + geom_col(position = "fill") 
propStatusAirline

Based on this stacked bar chart that shows proportions, there is not a huge difference in the proportion of delayed flights from Alaska vs from AM West.

Now I want to see if there’s a big difference in proportion of delayed flights based on their destination for Alaska flights.

propAlaskaDestination <- ggplot(AlaskaData, mapping = aes(x = Destination, y = Counts, fill = Status)) + geom_col(position = "fill")
propAlaskaDestination

I dont like how the names of the destinations are hard to read so I want to make this proportional bar plot horizontal!

propAlaskaDestinationH <- ggplot(AlaskaData, mapping = aes(x = Destination, y = Counts, fill = Status)) + geom_col(position = "fill") + coord_flip()
propAlaskaDestinationH

Much better!

Now lets do the same for the AM West flights. I want to see if there’s a big difference in proportion of delayed flights based on their destination for AM West flights.

propAMWestDestination <- ggplot(AMWestData, mapping = aes(x = Destination, y = Counts, fill = Status)) + geom_col(position = "fill") + coord_flip()
propAMWestDestination

Conclusions

Based on this analysis, there is not a significant difference between arrival delays for Alaska and AM West. From this data specifically, I found that the greatest proportion of delayed flights were from AM West airlines going to San Francisco. I also found that the greatest proportion of on time flights were from Alaska airlines going to Phoenix.