For assignment 4, a .csv file was transformed from a wide structure data frame to a long structure data frame. The data was then analysis to compare the arrival delays for the two airlines listed.

Load libraries.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyverse)
## ── Attaching packages
## ───────────────────────────────────────
## tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6     ✔ purrr   0.3.4
## ✔ tibble  3.1.7     ✔ stringr 1.4.1
## ✔ tidyr   1.2.0     ✔ forcats 0.5.2
## ✔ readr   2.1.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

The .csv file was created in an excel where it was uploaded into a GitHub repository and information was read. Also, any blank celled information in the data frame was replaced by a NA value.

flights_wide = read.csv('https://raw.githubusercontent.com/melbow2424/Data-607-Assignment-4/main/flights_wide.csv',
                        na.strings=c("","NA"))

print(flights_wide)
##         X     X.1 Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1  ALASKA on time         497     221       212           503    1841
## 2    <NA> delayed          62      12        20           102     305
## 3    <NA>    <NA>          NA      NA        NA            NA      NA
## 4 AM WEST on time         694    4840       383           320     201
## 5    <NA> delayed         117     415        65           129      61

First to reshape the data, the values of the two airlines that where being traveled from (Alaska and AM West) needed to be filled into NA values. There, the fill function was used. Then all other NA values where dropped with drop_na(). Most of the reshaping of the original data set came from pivot_longer and pivot_wider. Pivot_longer here changed several columns into a single row known as traveling_to. Pivot_wider took the values “on time” and “delayed” and created their own columns with the values from “time”. Also, one of the columns names was renamed just to make the information more understandable.

flights_long <- flights_wide %>% 
  fill("X") %>%
  drop_na() %>%
  pivot_longer(cols=c('Los.Angeles', 'Phoenix', 'San.Diego', 'San.Francisco', 'Seattle'),
                    names_to='traveling_to' ,values_to = "time") %>%
  pivot_wider(names_from = 'X.1', values_from = 'time')

names(flights_long)[names(flights_long)=="X"] <- "traveling_from"

print(flights_long)
## # A tibble: 10 × 4
##    traveling_from traveling_to  `on time` delayed
##    <chr>          <chr>             <int>   <int>
##  1 ALASKA         Los.Angeles         497      62
##  2 ALASKA         Phoenix             221      12
##  3 ALASKA         San.Diego           212      20
##  4 ALASKA         San.Francisco       503     102
##  5 ALASKA         Seattle            1841     305
##  6 AM WEST        Los.Angeles         694     117
##  7 AM WEST        Phoenix            4840     415
##  8 AM WEST        San.Diego           383      65
##  9 AM WEST        San.Francisco       320     129
## 10 AM WEST        Seattle             201      61

The analysis I performed was to determine which airline had the least amount of delay times. This was determined by filtering the data by the two different airlines then summaries over data. It was determined that the Alaska Airlines had the least amount of delay time due to its mean and median value being the lower then AM West values.

alaska_flights <- flights_long %>%
  filter(traveling_from == "ALASKA")

summary(alaska_flights$delayed)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    12.0    20.0    62.0   100.2   102.0   305.0
am_west_flights <- flights_long %>%
  filter(traveling_from == "AM WEST") 

summary(am_west_flights$delayed)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    61.0    65.0   117.0   157.4   129.0   415.0