Assignment 5

Create CSV
Read CSV into R
Transformation Of Data
- Load Libraries
- Tidy Data
Analysis
Visuals

Create CSV

csv <- rbind(c(NA, NA, "Los Angeles", "Phoenix", "San Diego", "San Francisco", "Seattle"),
             c("ALASKA", "On Time", 497, 221, 212, 503, 1841),
             c(NA, "Delayed", 62, 12, 20, 102, 305),
             c("AM WEST", "On Time", 694, 4840, 383, 320, 201),
             c(NA, "Delayed", 117, 415, 65, 129, 61))

write.table(csv, file = "arrivaltimes.csv", sep = ",", col.names=F, row.names=F)

Read CSV into R

originalarrival<-read.csv("https://raw.githubusercontent.com/IsARam/DATA607/master/arrivaltimes.csv?_sm_au_=iVVSSF6Sk3kVrtvj")
originalarrival

##       NA.   NA..1 Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1  ALASKA On Time         497     221       212           503    1841
## 2    <NA> Delayed          62      12        20           102     305
## 3 AM WEST On Time         694    4840       383           320     201
## 4    <NA> Delayed         117     415        65           129      61

Transformation Of Data

Load Libraries

library(tidyr)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)

Tidy Data

I created another table and modified it by using the following formulas: The mutate formula (dplyr) adds new variables and preserves existing. In the case of the above this formula was used to duplicate the first column. The coalesce (dplyr) formula finds the first non-missing value at each position.In the case of the above this was used to combine the first and second columns into one. The gather formuala (tidyr) takes multiple columns and collapses into key-value pairs, duplicating all other columns as needed. The select formula (dplyr) keeps only the variables you mention. The arrange formula (dplyr) sort a variable in descending order.

arrival<- originalarrival %>%
mutate(Airlines1 = NA., Airlines2=lag(NA.)) %>% 
mutate(Airline = coalesce(Airlines1,Airlines2), Status = NA..1)  %>%
gather("Destination", "Flights", 3:7) %>% 
select(Airline:Flights) %>% 
arrange(Airline, desc(Status), Destination); arrival

##    Airline  Status   Destination Flights
## 1   ALASKA On Time   Los.Angeles     497
## 2   ALASKA On Time       Phoenix     221
## 3   ALASKA On Time     San.Diego     212
## 4   ALASKA On Time San.Francisco     503
## 5   ALASKA On Time       Seattle    1841
## 6   ALASKA Delayed   Los.Angeles      62
## 7   ALASKA Delayed       Phoenix      12
## 8   ALASKA Delayed     San.Diego      20
## 9   ALASKA Delayed San.Francisco     102
## 10  ALASKA Delayed       Seattle     305
## 11 AM WEST On Time   Los.Angeles     694
## 12 AM WEST On Time       Phoenix    4840
## 13 AM WEST On Time     San.Diego     383
## 14 AM WEST On Time San.Francisco     320
## 15 AM WEST On Time       Seattle     201
## 16 AM WEST Delayed   Los.Angeles     117
## 17 AM WEST Delayed       Phoenix     415
## 18 AM WEST Delayed     San.Diego      65
## 19 AM WEST Delayed San.Francisco     129
## 20 AM WEST Delayed       Seattle      61

Analysis

Filter Status

When analyzing the comparison of delays for the two airlines I created a new table that filtered Delayed status and summarized Delayed flights.

The group_by formula (dplyr) takes an existing tbl and converts it into a grouped tbl where operations are performed “by group”. The summarise formula (dplyr) is typically used on grouped data created by group_by().

FlightsDelayed <- arrival %>% group_by(Airline,Destination) %>% filter(Status == 'Delayed') %>%summarise(Delayed = sum(Flights))
FlightsDelayed

## # A tibble: 10 x 3
## # Groups:   Airline [?]
##    Airline Destination   Delayed
##    <fct>   <chr>           <int>
##  1 ALASKA  Los.Angeles        62
##  2 ALASKA  Phoenix            12
##  3 ALASKA  San.Diego          20
##  4 ALASKA  San.Francisco     102
##  5 ALASKA  Seattle           305
##  6 AM WEST Los.Angeles       117
##  7 AM WEST Phoenix           415
##  8 AM WEST San.Diego          65
##  9 AM WEST San.Francisco     129
## 10 AM WEST Seattle            61

The group_by (dplyr) formula takes an existing tbl and converts it into a grouped tbl where operations are performed “by group”.

Calculate Total Flights

When analyzing the comparison of delays for the two airlines I created a new table that summarized the total flights.

TotalFlights <- arrival %>% group_by(Airline,Destination) %>% summarise(Total = sum(Flights))
TotalFlights

## # A tibble: 10 x 3
## # Groups:   Airline [?]
##    Airline Destination   Total
##    <fct>   <chr>         <int>
##  1 ALASKA  Los.Angeles     559
##  2 ALASKA  Phoenix         233
##  3 ALASKA  San.Diego       232
##  4 ALASKA  San.Francisco   605
##  5 ALASKA  Seattle        2146
##  6 AM WEST Los.Angeles     811
##  7 AM WEST Phoenix        5255
##  8 AM WEST San.Diego       448
##  9 AM WEST San.Francisco   449
## 10 AM WEST Seattle         262

Combine and Mutate

I created a new table which combined the tables above.

The cbind formula (base) take a sequence of vector, matrix or data-frame arguments and combine by columns or rows, respectively. The mutate formula (dplyr) adds new variables and preserves existing. In the case of the above this formula was used to duplicate the first column.

DelaySummary<- cbind(FlightsDelayed, Total=TotalFlights$Total)
DelaySummary<- DelaySummary %>% mutate(DelayPercentage = Delayed/Total)
DelaySummary

## # A tibble: 10 x 5
## # Groups:   Airline [2]
##    Airline Destination   Delayed Total DelayPercentage
##    <fct>   <chr>           <int> <int>           <dbl>
##  1 ALASKA  Los.Angeles        62   559          0.111 
##  2 ALASKA  Phoenix            12   233          0.0515
##  3 ALASKA  San.Diego          20   232          0.0862
##  4 ALASKA  San.Francisco     102   605          0.169 
##  5 ALASKA  Seattle           305  2146          0.142 
##  6 AM WEST Los.Angeles       117   811          0.144 
##  7 AM WEST Phoenix           415  5255          0.0790
##  8 AM WEST San.Diego          65   448          0.145 
##  9 AM WEST San.Francisco     129   449          0.287 
## 10 AM WEST Seattle            61   262          0.233

High Level Summary

I created a high level summary without Destination.

FlightsDelayedHL <- arrival %>% group_by(Airline) %>% filter(Status == 'Delayed') %>%summarise(Delayed = sum(Flights))
TotalFlightsHL <- arrival %>% group_by(Airline) %>% summarise(Total = sum(Flights))
DelaySummaryHL<-cbind(FlightsDelayedHL,Total=TotalFlightsHL$Total)
DelaySummaryHL<- DelaySummaryHL %>% mutate(HLDelayPercentage = DelaySummaryHL$Delayed/DelaySummaryHL$Total)
DelaySummaryHL

##   Airline Delayed Total HLDelayPercentage
## 1  ALASKA     501  3775         0.1327152
## 2 AM WEST     787  7225         0.1089273

Alaska Airlines has a higher delay percentage over AM West however Alaska Airlines had fewer Total Flights.

Visuals

I used geom_point as the visual.

The point geom is used to create scatterplots. The scatterplot is most useful for displaying the relationship between two continuous variables. It can be used to compare one continuous and one categorical variable, or two categorical variables which I thought was fitting in this case.

ggplot(DelaySummary, aes(Destination, Delayed)) + geom_point(aes(color=Airline),size=8,alpha=1/2)