Assignment 4 Task:
flight <- read.csv("https://raw.githubusercontent.com/Wilchau/607Lab4/main/Data%20607%20Assignment%204.csv")
#Step 2 load up the libraries, and overview the dataset
library(tidyr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
head(flight)
## X X.1 Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1 Alaska on time 497 221 212 503 1841
## 2 delayed 62 12 20 102 305
## 3 NA NA NA NA NA
## 4 AM West on time 694 4840 383 320 201
## 5 delayed 117 415 65 129 61
glimpse(flight)
## Rows: 5
## Columns: 7
## $ X <chr> "Alaska", "", "", "AM West", ""
## $ X.1 <chr> "on time", "delayed", "", "on time", "delayed"
## $ Los.Angeles <int> 497, 62, NA, 694, 117
## $ Phoenix <int> 221, 12, NA, 4840, 415
## $ San.Diego <int> 212, 20, NA, 383, 65
## $ San.Francisco <int> 503, 102, NA, 320, 129
## $ Seattle <int> 1841, 305, NA, 201, 61
Looking through glimpse, and head We see that there’s 7 columns and 5 rows. There is a few empty spaces, NA. I will begin by renaming or filling the columns, and then filling up the NA spots or empty spaces.
flight <- flight %>%
rename("Airline" = 1, "Status" = 2, "Los Angeles" = 3, "San Diego" = 5, "San Francisco" = 6, "Seattle" = 7)
flight
## Airline Status Los Angeles Phoenix San Diego San Francisco Seattle
## 1 Alaska on time 497 221 212 503 1841
## 2 delayed 62 12 20 102 305
## 3 NA NA NA NA NA
## 4 AM West on time 694 4840 383 320 201
## 5 delayed 117 415 65 129 61
After renaming the columns, I will remove empty rows using %>% drop_na
flight <- flight %>% drop_na
flight
## Airline Status Los Angeles Phoenix San Diego San Francisco Seattle
## 1 Alaska on time 497 221 212 503 1841
## 2 delayed 62 12 20 102 305
## 3 AM West on time 694 4840 383 320 201
## 4 delayed 117 415 65 129 61
I still see 2 empty spaces on Airline row 2,4. I will fill it with NA. Once the NA is replaced, I will use Tidy data to convert NA with the data above their respective column
flight[flight == ""] = NA
head(flight)
## Airline Status Los Angeles Phoenix San Diego San Francisco Seattle
## 1 Alaska on time 497 221 212 503 1841
## 2 <NA> delayed 62 12 20 102 305
## 3 AM West on time 694 4840 383 320 201
## 4 <NA> delayed 117 415 65 129 61
flight <- flight %>%
fill(Airline, .direction = c("down"))
head(flight)
## Airline Status Los Angeles Phoenix San Diego San Francisco Seattle
## 1 Alaska on time 497 221 212 503 1841
## 2 Alaska delayed 62 12 20 102 305
## 3 AM West on time 694 4840 383 320 201
## 4 AM West delayed 117 415 65 129 61
I have cleaned up the data set to be more organized and have respected airline with the status. I will then convert this dataset to a wide range. This should be the wide format. I also created a datafram flight_organized separate from flight to provide context of the Airlines -> Status -> City -> Flight #.
flight_organized <- flight %>% gather("city","Flight#", 3:7)
flight_organized
## Airline Status city Flight#
## 1 Alaska on time Los Angeles 497
## 2 Alaska delayed Los Angeles 62
## 3 AM West on time Los Angeles 694
## 4 AM West delayed Los Angeles 117
## 5 Alaska on time Phoenix 221
## 6 Alaska delayed Phoenix 12
## 7 AM West on time Phoenix 4840
## 8 AM West delayed Phoenix 415
## 9 Alaska on time San Diego 212
## 10 Alaska delayed San Diego 20
## 11 AM West on time San Diego 383
## 12 AM West delayed San Diego 65
## 13 Alaska on time San Francisco 503
## 14 Alaska delayed San Francisco 102
## 15 AM West on time San Francisco 320
## 16 AM West delayed San Francisco 129
## 17 Alaska on time Seattle 1841
## 18 Alaska delayed Seattle 305
## 19 AM West on time Seattle 201
## 20 AM West delayed Seattle 61
#Analysis on Flight delays between Alaska and AM West In order to find the analysis of delay flights, we will seprate the delay flights vs on time.
delay_flights <- flight_organized %>%
filter(flight_organized$Status == "delayed")
delay_flights
## Airline Status city Flight#
## 1 Alaska delayed Los Angeles 62
## 2 AM West delayed Los Angeles 117
## 3 Alaska delayed Phoenix 12
## 4 AM West delayed Phoenix 415
## 5 Alaska delayed San Diego 20
## 6 AM West delayed San Diego 65
## 7 Alaska delayed San Francisco 102
## 8 AM West delayed San Francisco 129
## 9 Alaska delayed Seattle 305
## 10 AM West delayed Seattle 61
I have isolated the delayed flights, and will now begin analysis
dplyr::glimpse(delay_flights)
## Rows: 10
## Columns: 4
## $ Airline <chr> "Alaska", "AM West", "Alaska", "AM West", "Alaska", "AM West…
## $ Status <chr> "delayed", "delayed", "delayed", "delayed", "delayed", "dela…
## $ city <chr> "Los Angeles", "Los Angeles", "Phoenix", "Phoenix", "San Die…
## $ `Flight#` <int> 62, 117, 12, 415, 20, 65, 102, 129, 305, 61
dplyr::select(delay_flights, "Airline","city", "Flight#")
## Airline city Flight#
## 1 Alaska Los Angeles 62
## 2 AM West Los Angeles 117
## 3 Alaska Phoenix 12
## 4 AM West Phoenix 415
## 5 Alaska San Diego 20
## 6 AM West San Diego 65
## 7 Alaska San Francisco 102
## 8 AM West San Francisco 129
## 9 Alaska Seattle 305
## 10 AM West Seattle 61
Using dplyr, I was able to select the Airlines and the FLight# respective to the Airlines. This gives us a concentrated insights on delayed flights. We can see that this gives us AM West with the heaviest delay at 415 at Phoenix, while Alaska only has 305 heaviest delay at Seattle. Another thing to focus on is that Alaska lowest delay is at 12 while AM West lowest delay is at Seattle. We can see that Alaska heaviest’s delay is AM west lightest delay, and while AM West heaviest’s delay is Alaska’s lightest delay.