Objective of the Assignment:

The goal of this assignment is to use R to tidy (reshape) data, using “base R” functionality and Hadley Wickham’s tidyr and dplyr packages. The folowing data that was used in this assignment describes arrival delays for two airlines across five destinations.

Los.Angeles Phoenix San.Diego San.Franciso Seattle
ALASKA On time 497 221 212 503 1841
Delayed 62 12 20 102 305
AM WEST On Time 694 4840 383 320 201
Delayed 117 415 65 129 61

Data Source : Numbersense, Kaiser Fung, McGraw Hill, 2013

Load the required Libraries

library("tidyr")
library("dplyr")
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(stringr)

We will load the data from a csv file hosted in github:

airlines <- read.csv("https://raw.githubusercontent.com/aaitelmouden/DATA607S2020/master/Week5/airlines.csv", stringsAsFactors=FALSE, sep = ',')
head(airlines)

Let’s change the names of X, X.1 to something meaningful

airlinesDF <- airlines %>% rename(airline = X, status = X.1, los_angeles = Los.Angeles, phoenix = Phoenix, san_diego = San.Diego, san_franciso = San.Franciso, seattle = Seattle)

#Add missing values
airlinesDF[2, "airline"] <- "ALASKA"
airlinesDF[5, "airline"] <- "AM WEST"

rownames(airlinesDF) <- NULL #reset our index

Data Wrangling with dplyr and tidyr

Converts data to tbl class. tbl’s are easier to examine than data frames.

dplyr::tbl_df(airlinesDF)

Information dense summary of tbl data.

dplyr::glimpse(airlinesDF)
## Observations: 5
## Variables: 7
## $ airline      <chr> " ALASKA  ", "ALASKA", "         ", " AM WEST ", "AM WES…
## $ status       <chr> "On time", "Delayed", "         ", "On Time", "Delayed"
## $ los_angeles  <int> 497, 62, NA, 694, 117
## $ phoenix      <int> 221, 12, NA, 4840, 415
## $ san_diego    <int> 212, 20, NA, 383, 65
## $ san_franciso <int> 503, 102, NA, 320, 129
## $ seattle      <int> 1841, 305, NA, 201, 61

Gather columns into rows and remove empty rows

airlineGather <- tidyr::gather(airlinesDF, "Destinations","Flights", 3:7)
airlineGather = airlineGather %>% na.omit()
head(airlineGather)

We can use filter() function to choose rows/cases where conditions are true. for example we can filter by Destinations

filter(airlineGather, Destinations == "phoenix")

Let’s compare the arrival delays, by calculating the total number of flights by each carrier (airlines) to different destinations.

flightsSum <- airlineGather %>% group_by(airline) %>% 
      summarize(flightsSum = sum(Flights))
flightsSum

Let’s count the number of On time flights for each airline

OntimeFlights <- airlineGather %>% group_by(airline) %>% 
  filter(status == " Delayed ") %>% 
  summarize(OntimeFlights = sum(Flights))
OntimeFlights

For AM airlines which city had most number of Delays:

AMWESTontime <- filter(airlineGather, status==" Delayed ")
Ontime.rank <- AMWESTontime %>%
                      arrange(desc(AMWESTontime$Flights))
Ontime.rank
library(ggplot2)
ggplot(data = AMWESTontime, aes(x = Destinations, y = Flights, fill = airline)) + geom_bar(stat="identity", position="dodge") + ggtitle("Comparing Delay by Destinations & Airline") + ylab("Delay flight Count")

From the above graph, we can say that AMWEST has most delay rate across 5 cities as compared to Alaska.