First, to make our lives easy, we will use three packages from the tidyverse: readr, tidyr, and dplyr. Then, we will read the data in with read_csv and fill in the null spots with the correct data using fill().
require(readr)
## Loading required package: readr
require(tidyr)
## Loading required package: tidyr
require(dplyr)
## Loading required package: dplyr
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
data <- read_csv('flights.csv')
## Warning: Missing column names filled in: 'X1' [1], 'X2' [2]
## Parsed with column specification:
## cols(
## X1 = col_character(),
## X2 = col_character(),
## `Los Angeles` = col_integer(),
## Phoenix = col_integer(),
## `San Diego` = col_integer(),
## `San Francisco` = col_integer(),
## Seattle = col_integer()
## )
data <- fill(data, X1)
data
## # A tibble: 4 x 7
## X1 X2 `Los Angeles` Phoenix `San Diego` `San Francisco` Seattle
## <chr> <chr> <int> <int> <int> <int> <int>
## 1 ALASKA ontime 497 221 212 503 1841
## 2 ALASKA delayed 62 12 20 201 305
## 3 AMWEST ontime 694 4840 383 320 201
## 4 AMWEST delayed 117 415 65 129 61
Next we crea data frames for each airline and give the columns names.
alaska <- data.frame(t(filter(data, X1 == "ALASKA")[3:7]))
amwest <- data.frame(t(filter(data, X1 == "AMWEST")[3:7]))
colnames(alaska) <- c( 'Ontime', 'Delayed')
colnames(amwest) <- c( 'Ontime', 'Delayed')
Then, we add a column that shows the delay rate for each city and airline.
alaska <- mutate(alaska, Rates = Delayed/(Ontime+Delayed), Totals = Ontime+Delayed)
amwest <- mutate(amwest, Rates = Delayed/(Ontime+Delayed), Totals = Ontime+Delayed)
alaska
## Ontime Delayed Rates Totals
## 1 497 62 0.11091234 559
## 2 221 12 0.05150215 233
## 3 212 20 0.08620690 232
## 4 503 201 0.28551136 704
## 5 1841 305 0.14212488 2146
Next, we bind the rates columns from each airline and give the new data frame column names.
rates <- data.frame(cbind(alaska$Rates, amwest$Rates))
colnames(rates) <- c("Alaska", "Amwest")
mean(rates$Alaska)
## [1] 0.1352515
mean(rates$Amwest)
## [1] 0.1776915
sd(rates$Alaska)
## [1] 0.09032856
sd(rates$Amwest)
## [1] 0.08212854
median(rates$Alaska)
## [1] 0.1109123
median(rates$Amwest)
## [1] 0.1450893
This shows us that at the average city, Amwest is more likely to be delayed than Alaska but Alaska has more variance between cities. However, we can see that Alaska had an average of .15 delays and Amwest had .10 when we look at the aggregate data.
sum(alaska$Delayed)/sum(alaska$Totals)
## [1] 0.1548787
sum(amwest$Delayed)/sum(amwest$Totals)
## [1] 0.1089273
This makes sense when we see that both sets of data are skewed.
summary(rates)
## Alaska Amwest
## Min. :0.05150 Min. :0.07897
## 1st Qu.:0.08621 1st Qu.:0.14427
## Median :0.11091 Median :0.14509
## Mean :0.13525 Mean :0.17769
## 3rd Qu.:0.14212 3rd Qu.:0.23282
## Max. :0.28551 Max. :0.28731