Assignment 4 Task:

  1. Create a .CSV file (or optionally, a MySQL database!) that includes all of the information above. You’re encouraged to use a “wide” structure similar to how the information appears above, so that you can practice tidying and transformations as described below.
  2. Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data.
  3. Perform analysis to compare the arrival delays for the two airlines.
  4. Your code should be in an R Markdown file, posted to rpubs.com, and should include narrative descriptions of your data cleanup work, analysis, and conclusions. Please include in your homework submission: he URL to the .Rmd file in your GitHub repository. #Step 1 Create CSV from raw file from github. I took the dataset from assignment 4 and converted it into a CSV, uploaded into Github and read the raw file from it on R.
flight <- read.csv("https://raw.githubusercontent.com/Wilchau/607Lab4/main/Data%20607%20Assignment%204.csv")

#Step 2 load up the libraries, and overview the dataset

library(tidyr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
head(flight)
##         X     X.1 Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1  Alaska on time         497     221       212           503    1841
## 2         delayed          62      12        20           102     305
## 3                          NA      NA        NA            NA      NA
## 4 AM West on time         694    4840       383           320     201
## 5         delayed         117     415        65           129      61
glimpse(flight)
## Rows: 5
## Columns: 7
## $ X             <chr> "Alaska", "", "", "AM West", ""
## $ X.1           <chr> "on time", "delayed", "", "on time", "delayed"
## $ Los.Angeles   <int> 497, 62, NA, 694, 117
## $ Phoenix       <int> 221, 12, NA, 4840, 415
## $ San.Diego     <int> 212, 20, NA, 383, 65
## $ San.Francisco <int> 503, 102, NA, 320, 129
## $ Seattle       <int> 1841, 305, NA, 201, 61

Looking through glimpse, and head We see that there’s 7 columns and 5 rows. There is a few empty spaces, NA. I will begin by renaming or filling the columns, and then filling up the NA spots or empty spaces.

flight <- flight %>%
rename("Airline" = 1, "Status" = 2, "Los Angeles" = 3, "San Diego" = 5, "San Francisco" = 6, "Seattle" = 7)
flight
##   Airline  Status Los Angeles Phoenix San Diego San Francisco Seattle
## 1  Alaska on time         497     221       212           503    1841
## 2         delayed          62      12        20           102     305
## 3                          NA      NA        NA            NA      NA
## 4 AM West on time         694    4840       383           320     201
## 5         delayed         117     415        65           129      61

After renaming the columns, I will remove empty rows using %>% drop_na

flight <- flight %>% drop_na
flight
##   Airline  Status Los Angeles Phoenix San Diego San Francisco Seattle
## 1  Alaska on time         497     221       212           503    1841
## 2         delayed          62      12        20           102     305
## 3 AM West on time         694    4840       383           320     201
## 4         delayed         117     415        65           129      61

I still see 2 empty spaces on Airline row 2,4. I will fill it with NA. Once the NA is replaced, I will use Tidy data to convert NA with the data above their respective column

flight[flight == ""] = NA
head(flight)
##   Airline  Status Los Angeles Phoenix San Diego San Francisco Seattle
## 1  Alaska on time         497     221       212           503    1841
## 2    <NA> delayed          62      12        20           102     305
## 3 AM West on time         694    4840       383           320     201
## 4    <NA> delayed         117     415        65           129      61
flight <- flight %>% 
fill(Airline, .direction = c("down"))
head(flight)
##   Airline  Status Los Angeles Phoenix San Diego San Francisco Seattle
## 1  Alaska on time         497     221       212           503    1841
## 2  Alaska delayed          62      12        20           102     305
## 3 AM West on time         694    4840       383           320     201
## 4 AM West delayed         117     415        65           129      61

I have cleaned up the data set to be more organized and have respected airline with the status. I will then convert this dataset to a wide range. This should be the wide format. I also created a datafram flight_organized separate from flight to provide context of the Airlines -> Status -> City -> Flight #.

flight_organized <- flight %>% gather("city","Flight#", 3:7)
flight_organized
##    Airline  Status          city Flight#
## 1   Alaska on time   Los Angeles     497
## 2   Alaska delayed   Los Angeles      62
## 3  AM West on time   Los Angeles     694
## 4  AM West delayed   Los Angeles     117
## 5   Alaska on time       Phoenix     221
## 6   Alaska delayed       Phoenix      12
## 7  AM West on time       Phoenix    4840
## 8  AM West delayed       Phoenix     415
## 9   Alaska on time     San Diego     212
## 10  Alaska delayed     San Diego      20
## 11 AM West on time     San Diego     383
## 12 AM West delayed     San Diego      65
## 13  Alaska on time San Francisco     503
## 14  Alaska delayed San Francisco     102
## 15 AM West on time San Francisco     320
## 16 AM West delayed San Francisco     129
## 17  Alaska on time       Seattle    1841
## 18  Alaska delayed       Seattle     305
## 19 AM West on time       Seattle     201
## 20 AM West delayed       Seattle      61

#Analysis on Flight delays between Alaska and AM West In order to find the analysis of delay flights, we will seprate the delay flights vs on time.

delay_flights <- flight_organized %>%
                    filter(flight_organized$Status == "delayed")
delay_flights
##    Airline  Status          city Flight#
## 1   Alaska delayed   Los Angeles      62
## 2  AM West delayed   Los Angeles     117
## 3   Alaska delayed       Phoenix      12
## 4  AM West delayed       Phoenix     415
## 5   Alaska delayed     San Diego      20
## 6  AM West delayed     San Diego      65
## 7   Alaska delayed San Francisco     102
## 8  AM West delayed San Francisco     129
## 9   Alaska delayed       Seattle     305
## 10 AM West delayed       Seattle      61

I have isolated the delayed flights, and will now begin analysis

dplyr::glimpse(delay_flights)
## Rows: 10
## Columns: 4
## $ Airline   <chr> "Alaska", "AM West", "Alaska", "AM West", "Alaska", "AM West…
## $ Status    <chr> "delayed", "delayed", "delayed", "delayed", "delayed", "dela…
## $ city      <chr> "Los Angeles", "Los Angeles", "Phoenix", "Phoenix", "San Die…
## $ `Flight#` <int> 62, 117, 12, 415, 20, 65, 102, 129, 305, 61
dplyr::select(delay_flights, "Airline","city", "Flight#")
##    Airline          city Flight#
## 1   Alaska   Los Angeles      62
## 2  AM West   Los Angeles     117
## 3   Alaska       Phoenix      12
## 4  AM West       Phoenix     415
## 5   Alaska     San Diego      20
## 6  AM West     San Diego      65
## 7   Alaska San Francisco     102
## 8  AM West San Francisco     129
## 9   Alaska       Seattle     305
## 10 AM West       Seattle      61

Conclusion

Using dplyr, I was able to select the Airlines and the FLight# respective to the Airlines. This gives us a concentrated insights on delayed flights. We can see that this gives us AM West with the heaviest delay at 415 at Phoenix, while Alaska only has 305 heaviest delay at Seattle. Another thing to focus on is that Alaska lowest delay is at 12 while AM West lowest delay is at Seattle. We can see that Alaska heaviest’s delay is AM west lightest delay, and while AM West heaviest’s delay is Alaska’s lightest delay.