The goal of this assignment is to use R to tidy (reshape) data, using “base R” functionality and Hadley Wickham’s tidyr and dplyr packages. The folowing data that was used in this assignment describes arrival delays for two airlines across five destinations.
| Los.Angeles | Phoenix | San.Diego | San.Franciso | Seattle | ||
|---|---|---|---|---|---|---|
| ALASKA | On time | 497 | 221 | 212 | 503 | 1841 |
| Delayed | 62 | 12 | 20 | 102 | 305 | |
| AM WEST | On Time | 694 | 4840 | 383 | 320 | 201 |
| Delayed | 117 | 415 | 65 | 129 | 61 |
Data Source : Numbersense, Kaiser Fung, McGraw Hill, 2013
library("tidyr")
library("dplyr")
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(stringr)
We will load the data from a csv file hosted in github:
airlines <- read.csv("https://raw.githubusercontent.com/aaitelmouden/DATA607S2020/master/Week5/airlines.csv", stringsAsFactors=FALSE, sep = ',')
head(airlines)
Let’s change the names of X, X.1 to something meaningful
airlinesDF <- airlines %>% rename(airline = X, status = X.1, los_angeles = Los.Angeles, phoenix = Phoenix, san_diego = San.Diego, san_franciso = San.Franciso, seattle = Seattle)
#Add missing values
airlinesDF[2, "airline"] <- "ALASKA"
airlinesDF[5, "airline"] <- "AM WEST"
rownames(airlinesDF) <- NULL #reset our index
dplyr::tbl_df(airlinesDF)
dplyr::glimpse(airlinesDF)
## Observations: 5
## Variables: 7
## $ airline <chr> " ALASKA ", "ALASKA", " ", " AM WEST ", "AM WES…
## $ status <chr> "On time", "Delayed", " ", "On Time", "Delayed"
## $ los_angeles <int> 497, 62, NA, 694, 117
## $ phoenix <int> 221, 12, NA, 4840, 415
## $ san_diego <int> 212, 20, NA, 383, 65
## $ san_franciso <int> 503, 102, NA, 320, 129
## $ seattle <int> 1841, 305, NA, 201, 61
airlineGather <- tidyr::gather(airlinesDF, "Destinations","Flights", 3:7)
airlineGather = airlineGather %>% na.omit()
head(airlineGather)
We can use filter() function to choose rows/cases where conditions are true. for example we can filter by Destinations
filter(airlineGather, Destinations == "phoenix")
Let’s compare the arrival delays, by calculating the total number of flights by each carrier (airlines) to different destinations.
flightsSum <- airlineGather %>% group_by(airline) %>%
summarize(flightsSum = sum(Flights))
flightsSum
Let’s count the number of On time flights for each airline
OntimeFlights <- airlineGather %>% group_by(airline) %>%
filter(status == " Delayed ") %>%
summarize(OntimeFlights = sum(Flights))
OntimeFlights
For AM airlines which city had most number of Delays:
AMWESTontime <- filter(airlineGather, status==" Delayed ")
Ontime.rank <- AMWESTontime %>%
arrange(desc(AMWESTontime$Flights))
Ontime.rank
library(ggplot2)
ggplot(data = AMWESTontime, aes(x = Destinations, y = Flights, fill = airline)) + geom_bar(stat="identity", position="dodge") + ggtitle("Comparing Delay by Destinations & Airline") + ylab("Delay flight Count")
From the above graph, we can say that AMWEST has most delay rate across 5 cities as compared to Alaska.