First, load the packages.
library(tidyr)
## Warning: package 'tidyr' was built under R version 3.4.2
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.4.2
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
The .csv data is loaded.
theURL <- "https://raw.githubusercontent.com/Tyllis/Data607/master/airlinedata.csv"
aldata <- read.csv(theURL)
Let’s see what it looks like.
aldata
## ï.. X Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1 ALASKA on time 497 221 212 503 1841
## 2 delayed 62 12 20 102 305
## 3 NA NA NA NA NA
## 4 AM WEST on time 694 4840 383 320 201
## 5 delayed 117 415 65 129 61
The .csv is loaded into a data.frame object. There are a couple of initial issues:
These two issues need to be dealt with first.
The NA lines can be removed simply using the filter function.
aldata <- filter(aldata, X != "")
First we give the airline column a name.
names(aldata)[1] <- c("Airline")
Now we can copy the airline names onto the “delayed” row.
numrow <- dim(aldata)[1]
aldata[seq(2, numrow, 2), "Airline"] <- aldata[seq(1, numrow, 2), "Airline"]
aldata
## Airline X Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1 ALASKA on time 497 221 212 503 1841
## 2 ALASKA delayed 62 12 20 102 305
## 3 AM WEST on time 694 4840 383 320 201
## 4 AM WEST delayed 117 415 65 129 61
The data.frame now looks more tidy than we started.
Upon inspection, this data has 4 variables:
The city names can be stacked using the gather function.
The number of on-time or delayed flights can be spread into two columns using spread function.
aldata <- aldata %>%
gather(City, timliness, Los.Angeles:Seattle) %>%
spread(X, timliness)
aldata
## Airline City delayed on time
## 1 ALASKA Los.Angeles 62 497
## 2 ALASKA Phoenix 12 221
## 3 ALASKA San.Diego 20 212
## 4 ALASKA San.Francisco 102 503
## 5 ALASKA Seattle 305 1841
## 6 AM WEST Los.Angeles 117 694
## 7 AM WEST Phoenix 415 4840
## 8 AM WEST San.Diego 65 383
## 9 AM WEST San.Francisco 129 320
## 10 AM WEST Seattle 61 201
It is now in its tidy form. Each column is a variable, and each row is a observation.
I am interested to see which airline has more delays in these cities.
aldata %>%
select(Airline, delayed) %>%
group_by(Airline) %>%
summarise(total_delays = sum(delayed))
## # A tibble: 2 x 2
## Airline total_delays
## <fctr> <int>
## 1 ALASKA 501
## 2 AM WEST 787
So Am West has more delays.
Let’s look at the percentage or proportion of delays.
But first we need to change the column name “on time” to “on_time”, so it doesn’t throw an error when we use dplyr functions.
names(aldata)[4] <- "on_time"
aldata %>%
select(Airline, delayed, on_time) %>%
group_by(Airline) %>%
summarise(delay_proportion = sum(delayed)/(sum(delayed+on_time)))
## # A tibble: 2 x 2
## Airline delay_proportion
## <fctr> <dbl>
## 1 ALASKA 0.1327152
## 2 AM WEST 0.1089273
Percentage wise, Alaska Airline has more delay.
Next, I am interested to see which city has the most delay by percentage.
aldata %>%
group_by(City) %>%
summarise(delay_proportion = sum(delayed)/(sum(delayed+on_time)))
## # A tibble: 5 x 2
## City delay_proportion
## <chr> <dbl>
## 1 Los.Angeles 0.13065693
## 2 Phoenix 0.07780612
## 3 San.Diego 0.12500000
## 4 San.Francisco 0.21916509
## 5 Seattle 0.15199336
Therefore, San Francisco has 21.92% of its Alaska and Am West flights delayed, the highest among the cities.
Lastly, I would like to see, for each airline, which city has the most delay.
aldata %>%
mutate(dp = delayed/(delayed+on_time)) %>%
group_by(Airline) %>%
filter(dp == max(dp))
## # A tibble: 2 x 5
## # Groups: Airline [2]
## Airline City delayed on_time dp
## <fctr> <chr> <int> <int> <dbl>
## 1 ALASKA San.Francisco 102 503 0.1685950
## 2 AM WEST San.Francisco 129 320 0.2873051
Again, San Francisco tops the flight delay chart for both airlines.