For assignment 4, a .csv file was transformed from a wide structure data frame to a long structure data frame. The data was then analysis to compare the arrival delays for the two airlines listed.
Load libraries.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyverse)
## ── Attaching packages
## ───────────────────────────────────────
## tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
## ✔ tibble 3.1.7 ✔ stringr 1.4.1
## ✔ tidyr 1.2.0 ✔ forcats 0.5.2
## ✔ readr 2.1.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
The .csv file was created in an excel where it was uploaded into a GitHub repository and information was read. Also, any blank celled information in the data frame was replaced by a NA value.
flights_wide = read.csv('https://raw.githubusercontent.com/melbow2424/Data-607-Assignment-4/main/flights_wide.csv',
na.strings=c("","NA"))
print(flights_wide)
## X X.1 Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1 ALASKA on time 497 221 212 503 1841
## 2 <NA> delayed 62 12 20 102 305
## 3 <NA> <NA> NA NA NA NA NA
## 4 AM WEST on time 694 4840 383 320 201
## 5 <NA> delayed 117 415 65 129 61
First to reshape the data, the values of the two airlines that where being traveled from (Alaska and AM West) needed to be filled into NA values. There, the fill function was used. Then all other NA values where dropped with drop_na(). Most of the reshaping of the original data set came from pivot_longer and pivot_wider. Pivot_longer here changed several columns into a single row known as traveling_to. Pivot_wider took the values “on time” and “delayed” and created their own columns with the values from “time”. Also, one of the columns names was renamed just to make the information more understandable.
flights_long <- flights_wide %>%
fill("X") %>%
drop_na() %>%
pivot_longer(cols=c('Los.Angeles', 'Phoenix', 'San.Diego', 'San.Francisco', 'Seattle'),
names_to='traveling_to' ,values_to = "time") %>%
pivot_wider(names_from = 'X.1', values_from = 'time')
names(flights_long)[names(flights_long)=="X"] <- "traveling_from"
print(flights_long)
## # A tibble: 10 × 4
## traveling_from traveling_to `on time` delayed
## <chr> <chr> <int> <int>
## 1 ALASKA Los.Angeles 497 62
## 2 ALASKA Phoenix 221 12
## 3 ALASKA San.Diego 212 20
## 4 ALASKA San.Francisco 503 102
## 5 ALASKA Seattle 1841 305
## 6 AM WEST Los.Angeles 694 117
## 7 AM WEST Phoenix 4840 415
## 8 AM WEST San.Diego 383 65
## 9 AM WEST San.Francisco 320 129
## 10 AM WEST Seattle 201 61
The analysis I performed was to determine which airline had the least amount of delay times. This was determined by filtering the data by the two different airlines then summaries over data. It was determined that the Alaska Airlines had the least amount of delay time due to its mean and median value being the lower then AM West values.
alaska_flights <- flights_long %>%
filter(traveling_from == "ALASKA")
summary(alaska_flights$delayed)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 12.0 20.0 62.0 100.2 102.0 305.0
am_west_flights <- flights_long %>%
filter(traveling_from == "AM WEST")
summary(am_west_flights$delayed)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 61.0 65.0 117.0 157.4 129.0 415.0