Load Packages and Data

First, load the packages.

library(tidyr)
## Warning: package 'tidyr' was built under R version 3.4.2
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.4.2
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

The .csv data is loaded.

theURL <- "https://raw.githubusercontent.com/Tyllis/Data607/master/airlinedata.csv"
aldata <- read.csv(theURL) 

Let’s see what it looks like.

aldata
##       ï..       X Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1  ALASKA on time         497     221       212           503    1841
## 2         delayed          62      12        20           102     305
## 3                          NA      NA        NA            NA      NA
## 4 AM WEST on time         694    4840       383           320     201
## 5         delayed         117     415        65           129      61

Transforming Data Set

The .csv is loaded into a data.frame object. There are a couple of initial issues:

  • The csv data has a blank line in between the airlines. These blank line is imported as a line filled with “NA”.
  • The airline names are only showing in the “on time” row. They are omitted in the “delayed” row.

These two issues need to be dealt with first.

Removing “NA” line

The NA lines can be removed simply using the filter function.

aldata <- filter(aldata, X != "")

Copy Airline Names

First we give the airline column a name.

names(aldata)[1] <- c("Airline")

Now we can copy the airline names onto the “delayed” row.

numrow <- dim(aldata)[1]
aldata[seq(2, numrow, 2), "Airline"] <- aldata[seq(1, numrow, 2), "Airline"]
aldata
##   Airline       X Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1  ALASKA on time         497     221       212           503    1841
## 2  ALASKA delayed          62      12        20           102     305
## 3 AM WEST on time         694    4840       383           320     201
## 4 AM WEST delayed         117     415        65           129      61

The data.frame now looks more tidy than we started.

Cleaning Data

Upon inspection, this data has 4 variables:

  • airline name
  • city name
  • number of on time flights
  • number of delayed flights

The city names can be stacked using the gather function.

The number of on-time or delayed flights can be spread into two columns using spread function.

aldata <- aldata %>% 
            gather(City, timliness, Los.Angeles:Seattle) %>% 
            spread(X, timliness)
aldata
##    Airline          City delayed on time
## 1   ALASKA   Los.Angeles      62     497
## 2   ALASKA       Phoenix      12     221
## 3   ALASKA     San.Diego      20     212
## 4   ALASKA San.Francisco     102     503
## 5   ALASKA       Seattle     305    1841
## 6  AM WEST   Los.Angeles     117     694
## 7  AM WEST       Phoenix     415    4840
## 8  AM WEST     San.Diego      65     383
## 9  AM WEST San.Francisco     129     320
## 10 AM WEST       Seattle      61     201

It is now in its tidy form. Each column is a variable, and each row is a observation.

Analyzing Data Set

I am interested to see which airline has more delays in these cities.

aldata %>% 
  select(Airline, delayed) %>% 
  group_by(Airline) %>% 
  summarise(total_delays = sum(delayed))
## # A tibble: 2 x 2
##   Airline total_delays
##    <fctr>        <int>
## 1  ALASKA          501
## 2 AM WEST          787

So Am West has more delays.

Let’s look at the percentage or proportion of delays.

But first we need to change the column name “on time” to “on_time”, so it doesn’t throw an error when we use dplyr functions.

names(aldata)[4] <- "on_time"
aldata %>% 
  select(Airline, delayed, on_time) %>% 
  group_by(Airline) %>% 
  summarise(delay_proportion = sum(delayed)/(sum(delayed+on_time)))
## # A tibble: 2 x 2
##   Airline delay_proportion
##    <fctr>            <dbl>
## 1  ALASKA        0.1327152
## 2 AM WEST        0.1089273

Percentage wise, Alaska Airline has more delay.

Next, I am interested to see which city has the most delay by percentage.

aldata %>% 
  group_by(City) %>% 
  summarise(delay_proportion = sum(delayed)/(sum(delayed+on_time)))
## # A tibble: 5 x 2
##            City delay_proportion
##           <chr>            <dbl>
## 1   Los.Angeles       0.13065693
## 2       Phoenix       0.07780612
## 3     San.Diego       0.12500000
## 4 San.Francisco       0.21916509
## 5       Seattle       0.15199336

Therefore, San Francisco has 21.92% of its Alaska and Am West flights delayed, the highest among the cities.

Lastly, I would like to see, for each airline, which city has the most delay.

aldata %>% 
  mutate(dp = delayed/(delayed+on_time)) %>% 
  group_by(Airline) %>% 
  filter(dp == max(dp))
## # A tibble: 2 x 5
## # Groups:   Airline [2]
##   Airline          City delayed on_time        dp
##    <fctr>         <chr>   <int>   <int>     <dbl>
## 1  ALASKA San.Francisco     102     503 0.1685950
## 2 AM WEST San.Francisco     129     320 0.2873051

Again, San Francisco tops the flight delay chart for both airlines.