For this assignment we need tidyr and dplyr
require(tidyr)
## Loading required package: tidyr
require(dplyr)
## Loading required package: dplyr
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
require(tidyverse)
## Loading required package: tidyverse
## -- Attaching packages --------------------------------------- tidyverse 1.2.1 --
## v ggplot2 3.2.1 v purrr 0.3.2
## v tibble 2.1.3 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.4.0
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
This DataSet was created in Excel and uploaded into my GitHub
airportdata <- read.csv("https://raw.githubusercontent.com/Luz917/data607Week5/master/airport%20data.csv",stringsAsFactors = FALSE)
str(airportdata)
## 'data.frame': 5 obs. of 7 variables:
## $ X : chr "ALASKA" "" "" "AM WEST" ...
## $ X.1 : chr "On Time" "Delayed" "" "On Time" ...
## $ Los.Angelos : int 497 62 NA 694 117
## $ Phoenix : int 221 12 NA 4840 415
## $ San.Diego : int 212 20 NA 383 65
## $ San.Francisco: int 503 102 NA 320 129
## $ Seattle : int 1841 305 NA 201 61
summary(airportdata)
## X X.1 Los.Angelos Phoenix
## Length:5 Length:5 Min. : 62.0 Min. : 12.0
## Class :character Class :character 1st Qu.:103.2 1st Qu.: 168.8
## Mode :character Mode :character Median :307.0 Median : 318.0
## Mean :342.5 Mean :1372.0
## 3rd Qu.:546.2 3rd Qu.:1521.2
## Max. :694.0 Max. :4840.0
## NA's :1 NA's :1
## San.Diego San.Francisco Seattle
## Min. : 20.00 Min. :102.0 Min. : 61
## 1st Qu.: 53.75 1st Qu.:122.2 1st Qu.: 166
## Median :138.50 Median :224.5 Median : 253
## Mean :170.00 Mean :263.5 Mean : 602
## 3rd Qu.:254.75 3rd Qu.:365.8 3rd Qu.: 689
## Max. :383.00 Max. :503.0 Max. :1841
## NA's :1 NA's :1 NA's :1
airportdata
## X X.1 Los.Angelos Phoenix San.Diego San.Francisco Seattle
## 1 ALASKA On Time 497 221 212 503 1841
## 2 Delayed 62 12 20 102 305
## 3 NA NA NA NA NA
## 4 AM WEST On Time 694 4840 383 320 201
## 5 Delayed 117 415 65 129 61
Before we begin to use tidyr and dplyr we need to do a feew thingsa to set up the table. -Name all of the Columns -Fill in the empty spaces in the first column with the name of each airport -Then remove the third row
colnames(airportdata)<-c("Airport", "FlightStatus","Los.Angelos","Phoenix","San.Diego","San.Francisco","Seattle")
airportdata [2,1]<-"ALASKA"
airportdata [5,1]<-"AM WEST"
airportdata = airportdata[-c(3) , ]
airportdata
## Airport FlightStatus Los.Angelos Phoenix San.Diego San.Francisco Seattle
## 1 ALASKA On Time 497 221 212 503 1841
## 2 ALASKA Delayed 62 12 20 102 305
## 4 AM WEST On Time 694 4840 383 320 201
## 5 AM WEST Delayed 117 415 65 129 61
We begin the transformation, we begin by putting states into a column by using pivot_longer. (source is on the bottom)
airportdata <- pivot_longer(airportdata, cols = -c("Airport","FlightStatus"), names_to = "States")
airportdata
## # A tibble: 20 x 4
## Airport FlightStatus States value
## <chr> <chr> <chr> <int>
## 1 ALASKA On Time Los.Angelos 497
## 2 ALASKA On Time Phoenix 221
## 3 ALASKA On Time San.Diego 212
## 4 ALASKA On Time San.Francisco 503
## 5 ALASKA On Time Seattle 1841
## 6 ALASKA Delayed Los.Angelos 62
## 7 ALASKA Delayed Phoenix 12
## 8 ALASKA Delayed San.Diego 20
## 9 ALASKA Delayed San.Francisco 102
## 10 ALASKA Delayed Seattle 305
## 11 AM WEST On Time Los.Angelos 694
## 12 AM WEST On Time Phoenix 4840
## 13 AM WEST On Time San.Diego 383
## 14 AM WEST On Time San.Francisco 320
## 15 AM WEST On Time Seattle 201
## 16 AM WEST Delayed Los.Angelos 117
## 17 AM WEST Delayed Phoenix 415
## 18 AM WEST Delayed San.Diego 65
## 19 AM WEST Delayed San.Francisco 129
## 20 AM WEST Delayed Seattle 61
As you can see we end up with 20 rows
Next we use pivit_wider so we can see On Time and Delayed Status by Airport and City
airportdata %>%
pivot_wider(names_from = FlightStatus, values_from = value) %>%
arrange(States)
## # A tibble: 10 x 4
## Airport States `On Time` Delayed
## <chr> <chr> <int> <int>
## 1 ALASKA Los.Angelos 497 62
## 2 AM WEST Los.Angelos 694 117
## 3 ALASKA Phoenix 221 12
## 4 AM WEST Phoenix 4840 415
## 5 ALASKA San.Diego 212 20
## 6 AM WEST San.Diego 383 65
## 7 ALASKA San.Francisco 503 102
## 8 AM WEST San.Francisco 320 129
## 9 ALASKA Seattle 1841 305
## 10 AM WEST Seattle 201 61
Now there are 10 Rows
airportdata %>%
group_by(Airport)%>%
summarise(AirportSum= sum(value))
## # A tibble: 2 x 2
## Airport AirportSum
## <chr> <int>
## 1 ALASKA 3775
## 2 AM WEST 7225
AM West has more flights then Alaska
airportdata %>%
group_by(States) %>%
summarise(SumFlights= sum(value))
## # A tibble: 5 x 2
## States SumFlights
## <chr> <int>
## 1 Los.Angelos 1370
## 2 Phoenix 5488
## 3 San.Diego 680
## 4 San.Francisco 1054
## 5 Seattle 2408
Phoenix has the most flights by state
airportdata %>%
group_by(Airport,FlightStatus) %>%
summarise(SumofFlights= sum(value))
## # A tibble: 4 x 3
## # Groups: Airport [2]
## Airport FlightStatus SumofFlights
## <chr> <chr> <int>
## 1 ALASKA Delayed 501
## 2 ALASKA On Time 3274
## 3 AM WEST Delayed 787
## 4 AM WEST On Time 6438
AM West has the highest delay rate, but they also have a high number of On Time flights
airportdata %>%
group_by(Airport,FlightStatus) %>%
summarise(AvgFlights= mean(value))
## # A tibble: 4 x 3
## # Groups: Airport [2]
## Airport FlightStatus AvgFlights
## <chr> <chr> <dbl>
## 1 ALASKA Delayed 100.
## 2 ALASKA On Time 655.
## 3 AM WEST Delayed 157.
## 4 AM WEST On Time 1288.
AM West has the highest average number of Flights
airportdata %>%
group_by(Airport,States) %>%
summarise(MaxFlights= max(value))
## # A tibble: 10 x 3
## # Groups: Airport [2]
## Airport States MaxFlights
## <chr> <chr> <int>
## 1 ALASKA Los.Angelos 497
## 2 ALASKA Phoenix 221
## 3 ALASKA San.Diego 212
## 4 ALASKA San.Francisco 503
## 5 ALASKA Seattle 1841
## 6 AM WEST Los.Angelos 694
## 7 AM WEST Phoenix 4840
## 8 AM WEST San.Diego 383
## 9 AM WEST San.Francisco 320
## 10 AM WEST Seattle 201
Phoenix has the Max Number of Flights
airportdata %>%
group_by(Airport) %>%
summarise(MaxFlights= max(value))
## # A tibble: 2 x 2
## Airport MaxFlights
## <chr> <int>
## 1 ALASKA 1841
## 2 AM WEST 4840
AM West has the Max Number of Flights
airportdata %>%
mutate(Flights= percent_rank(value)) %>%
arrange(States)
## # A tibble: 20 x 5
## Airport FlightStatus States value Flights
## <chr> <chr> <chr> <int> <dbl>
## 1 ALASKA On Time Los.Angelos 497 0.789
## 2 ALASKA Delayed Los.Angelos 62 0.158
## 3 AM WEST On Time Los.Angelos 694 0.895
## 4 AM WEST Delayed Los.Angelos 117 0.316
## 5 ALASKA On Time Phoenix 221 0.526
## 6 ALASKA Delayed Phoenix 12 0
## 7 AM WEST On Time Phoenix 4840 1.000
## 8 AM WEST Delayed Phoenix 415 0.737
## 9 ALASKA On Time San.Diego 212 0.474
## 10 ALASKA Delayed San.Diego 20 0.0526
## 11 AM WEST On Time San.Diego 383 0.684
## 12 AM WEST Delayed San.Diego 65 0.211
## 13 ALASKA On Time San.Francisco 503 0.842
## 14 ALASKA Delayed San.Francisco 102 0.263
## 15 AM WEST On Time San.Francisco 320 0.632
## 16 AM WEST Delayed San.Francisco 129 0.368
## 17 ALASKA On Time Seattle 1841 0.947
## 18 ALASKA Delayed Seattle 305 0.579
## 19 AM WEST On Time Seattle 201 0.421
## 20 AM WEST Delayed Seattle 61 0.105
Seattle had the highest Percent Rank
We can tell by the Analysis that AM West gets more flights in comparison to Alaska. Phoenix has the highest number of flights out of all of the States.
2.Hands-on dplyr tutorial for faster data manipulation in R https://www.dataschool.io/dplyr-tutorial-for-faster-data-manipulation-in-r/
3.How do I delete rows in a data frame? https://stackoverflow.com/questions/12328056/how-do-i-delete-rows-in-a-data-frame