DATA607 | Week 5

Objective of the Assignment:

The goal of this assignment is to use R to tidy (reshape) data, using “base R” functionality and Hadley Wickham’s tidyr and dplyr packages. The folowing data that was used in this assignment describes arrival delays for two airlines across five destinations.

		Los.Angeles	Phoenix	San.Diego	San.Franciso	Seattle
ALASKA	On time	497	221	212	503	1841
	Delayed	62	12	20	102	305

AM WEST	On Time	694	4840	383	320	201
	Delayed	117	415	65	129	61

Data Source : Numbersense, Kaiser Fung, McGraw Hill, 2013

Load the required Libraries

library("tidyr")
library("dplyr")

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(stringr)

We will load the data from a csv file hosted in github:

airlines <- read.csv("https://raw.githubusercontent.com/aaitelmouden/DATA607S2020/master/Week5/airlines.csv", stringsAsFactors=FALSE, sep = ',')
head(airlines)

Let’s change the names of X, X.1 to something meaningful

airlinesDF <- airlines %>% rename(airline = X, status = X.1, los_angeles = Los.Angeles, phoenix = Phoenix, san_diego = San.Diego, san_franciso = San.Franciso, seattle = Seattle)

#Add missing values
airlinesDF[2, "airline"] <- "ALASKA"
airlinesDF[5, "airline"] <- "AM WEST"

rownames(airlinesDF) <- NULL #reset our index

Data Wrangling with dplyr and tidyr

Converts data to tbl class. tbl’s are easier to examine than data frames.

dplyr::tbl_df(airlinesDF)

Information dense summary of tbl data.

dplyr::glimpse(airlinesDF)

## Observations: 5
## Variables: 7
## $ airline      <chr> " ALASKA  ", "ALASKA", "         ", " AM WEST ", "AM WES…
## $ status       <chr> "On time", "Delayed", "         ", "On Time", "Delayed"
## $ los_angeles  <int> 497, 62, NA, 694, 117
## $ phoenix      <int> 221, 12, NA, 4840, 415
## $ san_diego    <int> 212, 20, NA, 383, 65
## $ san_franciso <int> 503, 102, NA, 320, 129
## $ seattle      <int> 1841, 305, NA, 201, 61

Gather columns into rows and remove empty rows

airlineGather <- tidyr::gather(airlinesDF, "Destinations","Flights", 3:7)
airlineGather = airlineGather %>% na.omit()
head(airlineGather)

We can use filter() function to choose rows/cases where conditions are true. for example we can filter by Destinations

filter(airlineGather, Destinations == "phoenix")

Let’s compare the arrival delays, by calculating the total number of flights by each carrier (airlines) to different destinations.

flightsSum <- airlineGather %>% group_by(airline) %>% 
      summarize(flightsSum = sum(Flights))
flightsSum

Let’s count the number of On time flights for each airline

OntimeFlights <- airlineGather %>% group_by(airline) %>% 
  filter(status == " Delayed ") %>% 
  summarize(OntimeFlights = sum(Flights))
OntimeFlights

For AM airlines which city had most number of Delays:

AMWESTontime <- filter(airlineGather, status==" Delayed ")
Ontime.rank <- AMWESTontime %>%
                      arrange(desc(AMWESTontime$Flights))
Ontime.rank

library(ggplot2)
ggplot(data = AMWESTontime, aes(x = Destinations, y = Flights, fill = airline)) + geom_bar(stat="identity", position="dodge") + ggtitle("Comparing Delay by Destinations & Airline") + ylab("Delay flight Count")

From the above graph, we can say that AMWEST has most delay rate across 5 cities as compared to Alaska.