Data 607 Week 5

Add the libraries required to do this assignment

For this assignment we need tidyr and dplyr

require(tidyr)

## Loading required package: tidyr

require(dplyr)

## Loading required package: dplyr

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

require(tidyverse)

## Loading required package: tidyverse

## -- Attaching packages --------------------------------------- tidyverse 1.2.1 --

## v ggplot2 3.2.1     v purrr   0.3.2
## v tibble  2.1.3     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.4.0

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

(1) Create a .CSV file (or optionally, a MySQL database!) that includes all of the information above. You’re encouraged to use a “wide” structure similar to how the information appears above, so that you can practice tidying and transformations as described below.

This DataSet was created in Excel and uploaded into my GitHub

airportdata <-   read.csv("https://raw.githubusercontent.com/Luz917/data607Week5/master/airport%20data.csv",stringsAsFactors = FALSE)
str(airportdata)

## 'data.frame':    5 obs. of  7 variables:
##  $ X            : chr  "ALASKA" "" "" "AM WEST" ...
##  $ X.1          : chr  "On Time" "Delayed" "" "On Time" ...
##  $ Los.Angelos  : int  497 62 NA 694 117
##  $ Phoenix      : int  221 12 NA 4840 415
##  $ San.Diego    : int  212 20 NA 383 65
##  $ San.Francisco: int  503 102 NA 320 129
##  $ Seattle      : int  1841 305 NA 201 61

summary(airportdata)

##       X                 X.1             Los.Angelos       Phoenix      
##  Length:5           Length:5           Min.   : 62.0   Min.   :  12.0  
##  Class :character   Class :character   1st Qu.:103.2   1st Qu.: 168.8  
##  Mode  :character   Mode  :character   Median :307.0   Median : 318.0  
##                                        Mean   :342.5   Mean   :1372.0  
##                                        3rd Qu.:546.2   3rd Qu.:1521.2  
##                                        Max.   :694.0   Max.   :4840.0  
##                                        NA's   :1       NA's   :1       
##    San.Diego      San.Francisco      Seattle    
##  Min.   : 20.00   Min.   :102.0   Min.   :  61  
##  1st Qu.: 53.75   1st Qu.:122.2   1st Qu.: 166  
##  Median :138.50   Median :224.5   Median : 253  
##  Mean   :170.00   Mean   :263.5   Mean   : 602  
##  3rd Qu.:254.75   3rd Qu.:365.8   3rd Qu.: 689  
##  Max.   :383.00   Max.   :503.0   Max.   :1841  
##  NA's   :1        NA's   :1       NA's   :1

airportdata

##         X     X.1 Los.Angelos Phoenix San.Diego San.Francisco Seattle
## 1  ALASKA On Time         497     221       212           503    1841
## 2         Delayed          62      12        20           102     305
## 3                          NA      NA        NA            NA      NA
## 4 AM WEST On Time         694    4840       383           320     201
## 5         Delayed         117     415        65           129      61

Prepare the Table

Before we begin to use tidyr and dplyr we need to do a feew thingsa to set up the table. -Name all of the Columns -Fill in the empty spaces in the first column with the name of each airport -Then remove the third row

colnames(airportdata)<-c("Airport", "FlightStatus","Los.Angelos","Phoenix","San.Diego","San.Francisco","Seattle")
airportdata [2,1]<-"ALASKA"
airportdata [5,1]<-"AM WEST"
airportdata = airportdata[-c(3) , ]
airportdata

##   Airport FlightStatus Los.Angelos Phoenix San.Diego San.Francisco Seattle
## 1  ALASKA      On Time         497     221       212           503    1841
## 2  ALASKA      Delayed          62      12        20           102     305
## 4 AM WEST      On Time         694    4840       383           320     201
## 5 AM WEST      Delayed         117     415        65           129      61

Tidyr

Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data.

We begin the transformation, we begin by putting states into a column by using pivot_longer. (source is on the bottom)

airportdata <- pivot_longer(airportdata, cols = -c("Airport","FlightStatus"), names_to = "States")
airportdata

## # A tibble: 20 x 4
##    Airport FlightStatus States        value
##    <chr>   <chr>        <chr>         <int>
##  1 ALASKA  On Time      Los.Angelos     497
##  2 ALASKA  On Time      Phoenix         221
##  3 ALASKA  On Time      San.Diego       212
##  4 ALASKA  On Time      San.Francisco   503
##  5 ALASKA  On Time      Seattle        1841
##  6 ALASKA  Delayed      Los.Angelos      62
##  7 ALASKA  Delayed      Phoenix          12
##  8 ALASKA  Delayed      San.Diego        20
##  9 ALASKA  Delayed      San.Francisco   102
## 10 ALASKA  Delayed      Seattle         305
## 11 AM WEST On Time      Los.Angelos     694
## 12 AM WEST On Time      Phoenix        4840
## 13 AM WEST On Time      San.Diego       383
## 14 AM WEST On Time      San.Francisco   320
## 15 AM WEST On Time      Seattle         201
## 16 AM WEST Delayed      Los.Angelos     117
## 17 AM WEST Delayed      Phoenix         415
## 18 AM WEST Delayed      San.Diego        65
## 19 AM WEST Delayed      San.Francisco   129
## 20 AM WEST Delayed      Seattle          61

As you can see we end up with 20 rows

Next we use pivit_wider so we can see On Time and Delayed Status by Airport and City

airportdata %>%
  pivot_wider(names_from = FlightStatus, values_from = value) %>%
  arrange(States)

## # A tibble: 10 x 4
##    Airport States        `On Time` Delayed
##    <chr>   <chr>             <int>   <int>
##  1 ALASKA  Los.Angelos         497      62
##  2 AM WEST Los.Angelos         694     117
##  3 ALASKA  Phoenix             221      12
##  4 AM WEST Phoenix            4840     415
##  5 ALASKA  San.Diego           212      20
##  6 AM WEST San.Diego           383      65
##  7 ALASKA  San.Francisco       503     102
##  8 AM WEST San.Francisco       320     129
##  9 ALASKA  Seattle            1841     305
## 10 AM WEST Seattle             201      61

Now there are 10 Rows

Data Analysis with dplyr

Perform analysis to compare the arrival delays for the two airlines.

Sum of Flights by Airport

airportdata %>%
  group_by(Airport)%>%
  summarise(AirportSum= sum(value))

## # A tibble: 2 x 2
##   Airport AirportSum
##   <chr>        <int>
## 1 ALASKA        3775
## 2 AM WEST       7225

AM West has more flights then Alaska

Sum of Flights By State

airportdata %>%
  group_by(States) %>%
  summarise(SumFlights= sum(value))

## # A tibble: 5 x 2
##   States        SumFlights
##   <chr>              <int>
## 1 Los.Angelos         1370
## 2 Phoenix             5488
## 3 San.Diego            680
## 4 San.Francisco       1054
## 5 Seattle             2408

Phoenix has the most flights by state

Sum of Flight Status by Airport

airportdata %>%
  group_by(Airport,FlightStatus) %>%
  summarise(SumofFlights= sum(value))

## # A tibble: 4 x 3
## # Groups:   Airport [2]
##   Airport FlightStatus SumofFlights
##   <chr>   <chr>               <int>
## 1 ALASKA  Delayed               501
## 2 ALASKA  On Time              3274
## 3 AM WEST Delayed               787
## 4 AM WEST On Time              6438

AM West has the highest delay rate, but they also have a high number of On Time flights

Average Number of Flights

airportdata %>%
  group_by(Airport,FlightStatus) %>%
  summarise(AvgFlights= mean(value))

## # A tibble: 4 x 3
## # Groups:   Airport [2]
##   Airport FlightStatus AvgFlights
##   <chr>   <chr>             <dbl>
## 1 ALASKA  Delayed            100.
## 2 ALASKA  On Time            655.
## 3 AM WEST Delayed            157.
## 4 AM WEST On Time           1288.

AM West has the highest average number of Flights

Max Number of Flights by State

airportdata %>%
  group_by(Airport,States) %>%
  summarise(MaxFlights= max(value))

## # A tibble: 10 x 3
## # Groups:   Airport [2]
##    Airport States        MaxFlights
##    <chr>   <chr>              <int>
##  1 ALASKA  Los.Angelos          497
##  2 ALASKA  Phoenix              221
##  3 ALASKA  San.Diego            212
##  4 ALASKA  San.Francisco        503
##  5 ALASKA  Seattle             1841
##  6 AM WEST Los.Angelos          694
##  7 AM WEST Phoenix             4840
##  8 AM WEST San.Diego            383
##  9 AM WEST San.Francisco        320
## 10 AM WEST Seattle              201

Phoenix has the Max Number of Flights

Max Number of Flights by Airport

airportdata %>%
  group_by(Airport) %>%
  summarise(MaxFlights= max(value))

## # A tibble: 2 x 2
##   Airport MaxFlights
##   <chr>        <int>
## 1 ALASKA        1841
## 2 AM WEST       4840

AM West has the Max Number of Flights

Percent Rank

airportdata %>%
  mutate(Flights= percent_rank(value)) %>%
  arrange(States)

## # A tibble: 20 x 5
##    Airport FlightStatus States        value Flights
##    <chr>   <chr>        <chr>         <int>   <dbl>
##  1 ALASKA  On Time      Los.Angelos     497  0.789 
##  2 ALASKA  Delayed      Los.Angelos      62  0.158 
##  3 AM WEST On Time      Los.Angelos     694  0.895 
##  4 AM WEST Delayed      Los.Angelos     117  0.316 
##  5 ALASKA  On Time      Phoenix         221  0.526 
##  6 ALASKA  Delayed      Phoenix          12  0     
##  7 AM WEST On Time      Phoenix        4840  1.000 
##  8 AM WEST Delayed      Phoenix         415  0.737 
##  9 ALASKA  On Time      San.Diego       212  0.474 
## 10 ALASKA  Delayed      San.Diego        20  0.0526
## 11 AM WEST On Time      San.Diego       383  0.684 
## 12 AM WEST Delayed      San.Diego        65  0.211 
## 13 ALASKA  On Time      San.Francisco   503  0.842 
## 14 ALASKA  Delayed      San.Francisco   102  0.263 
## 15 AM WEST On Time      San.Francisco   320  0.632 
## 16 AM WEST Delayed      San.Francisco   129  0.368 
## 17 ALASKA  On Time      Seattle        1841  0.947 
## 18 ALASKA  Delayed      Seattle         305  0.579 
## 19 AM WEST On Time      Seattle         201  0.421 
## 20 AM WEST Delayed      Seattle          61  0.105

Seattle had the highest Percent Rank

Conclusion

We can tell by the Analysis that AM West gets more flights in comparison to Alaska. Phoenix has the highest number of flights out of all of the States.

Sources

How to reshape a dataframe from wide to long or long to wide format using tidyr - pivot_longer() pivot_wider() https://www.programmingwithr.com/how-to-reshape-a-dataframe-from-wide-to-long-or-long-to-wide-format/

2.Hands-on dplyr tutorial for faster data manipulation in R https://www.dataschool.io/dplyr-tutorial-for-faster-data-manipulation-in-r/

3.How do I delete rows in a data frame? https://stackoverflow.com/questions/12328056/how-do-i-delete-rows-in-a-data-frame