Overview

The chart above describes arrival delays for two airlines across five destinations. Your task is to:

  1. Create a .CSV file (or optionally, a MySQL database!) that includes all of the information above.
    • You’re encouraged to use a “wide” structure similar to how the information appears above, so that you can practice tidying and transformations as described below.
  2. Read the information from your .CSV file into\(\ R\), and use tidyr and dplyr as needed to tidy and transform your data.
  3. Perform analysis to compare the arrival delays for the two airlines.
  4. Your code should be in an\(\ R \ Markdown\) file, posted to RPUBS, and should include narrative descriptions of your data cleanup work, analysis, and conclusions. Please include in your homework submission:
    • The URL to the .Rmd file in your GitHub repository. and
    • The URL for your rpubs.com web page.

Import Data

Data is imported from the .csv file located in my github Week 5 folder. Function read.csv from the utils package is used.

arrivals_delays <- read.csv(url, as.is = TRUE)
arrivals_delays
X X.1 Los.Angeles Phoenix San.Diego San.Francisco Seattle
ALASKA on time 497 221 212 503 1,841
delayed 62 12 20 102 305
NA NA NA
AM WEST on time 694 4,840 383 320 201
delayed 117 415 65 129 61
NOTE:
The present table reflects the NULL cells and unlabeled columns in the .csv file

Cleanup

The above data set has a few problems that need correcting including:

  • Removal of row comprised of empty cells () and NA
  • Relabeling of column names for functional use
  • Inserting logical data, for blank cells in column 1.

NA Removal

The only row requiring removal is row 3. This is done using brackets [ ] in conjunction with - symbol to specify removal of the selected row.

(arrivals_delays <- arrivals_delays[-3,])
##         X     X.1 Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1  ALASKA on time         497     221       212           503   1,841
## 2         delayed          62      12        20           102     305
## 4 AM WEST on time         694   4,840       383           320     201
## 5         delayed         117     415        65           129      61

Relabeling of colnames

Since all but two columns are labeled correctly, a list of names is best to insert into the data frame.

df_names <-
  c("Airline", "Arrival", "Los Angeles", "Phoenix",
    "San Diego", "San Francisco", "Seattle")

colnames(arrivals_delays) <- df_names
##   Airline Arrival Los Angeles Phoenix San Diego San Francisco Seattle
## 1  ALASKA on time         497     221       212           503   1,841
## 2         delayed          62      12        20           102     305
## 4 AM WEST on time         694   4,840       383           320     201
## 5         delayed         117     415        65           129      61

Inserting Data

The most logical data to place inside the empty cells in column one, is the Airline being referenced. Rather than manually type in the name I chose to reference the Airline name in the row before the blank one. Had this been a larger data set a while loop may have been a helpful tool.

arrivals_delays[2,1]<-arrivals_delays[1,1]
arrivals_delays[4,1]<-arrivals_delays[3,1]
arrivals_delays
Airline Arrival Los Angeles Phoenix San Diego San Francisco Seattle
ALASKA on time 497 221 212 503 1,841
ALASKA delayed 62 12 20 102 305
AM WEST on time 694 4,840 383 320 201
AM WEST delayed 117 415 65 129 61

Transforming columns and rows

Gather()

Using gather() function along with pipe %>% I reorganize the data so the counts and cities are accounted for on two columns.

  • key is the name of the new key or column holding the previous column names
  • value is the name of the column holding the values associated with said key values.
(arrivals_delays_steps<-arrivals_delays %>%
  gather(key = "City",value = "Count", 3:7))
##    Airline Arrival          City Count
## 1   ALASKA on time   Los Angeles   497
## 2   ALASKA delayed   Los Angeles    62
## 3  AM WEST on time   Los Angeles   694
## 4  AM WEST delayed   Los Angeles   117
## 5   ALASKA on time       Phoenix   221
## 6   ALASKA delayed       Phoenix    12
## 7  AM WEST on time       Phoenix 4,840
## 8  AM WEST delayed       Phoenix   415
## 9   ALASKA on time     San Diego   212
## 10  ALASKA delayed     San Diego    20
## 11 AM WEST on time     San Diego   383
## 12 AM WEST delayed     San Diego    65
## 13  ALASKA on time San Francisco   503
## 14  ALASKA delayed San Francisco   102
## 15 AM WEST on time San Francisco   320
## 16 AM WEST delayed San Francisco   129
## 17  ALASKA on time       Seattle 1,841
## 18  ALASKA delayed       Seattle   305
## 19 AM WEST on time       Seattle   201
## 20 AM WEST delayed       Seattle    61

Spread()

Using spread() function

  • key is the column name or positions we are referencing.
  • values are still the values affiliated with the key, but graphically the reference is lateral. Technically the shared relationship for these options is a single row.
  • Columns \(\ 2 \ \&\ 4\) are referenced. e.g.1
(arrivals_delays_steps<-arrivals_delays_steps %>%
  spread(key = 2,value = 4))
##    Airline          City delayed on time
## 1   ALASKA   Los Angeles      62     497
## 2   ALASKA       Phoenix      12     221
## 3   ALASKA     San Diego      20     212
## 4   ALASKA San Francisco     102     503
## 5   ALASKA       Seattle     305   1,841
## 6  AM WEST   Los Angeles     117     694
## 7  AM WEST       Phoenix     415   4,840
## 8  AM WEST     San Diego      65     383
## 9  AM WEST San Francisco     129     320
## 10 AM WEST       Seattle      61     201

Arrange()

Using the arrange() function I sort the data by city

arrivals_delays_steps %>%
  arrange(City)
##    Airline          City delayed on time
## 1   ALASKA   Los Angeles      62     497
## 2  AM WEST   Los Angeles     117     694
## 3   ALASKA       Phoenix      12     221
## 4  AM WEST       Phoenix     415   4,840
## 5   ALASKA     San Diego      20     212
## 6  AM WEST     San Diego      65     383
## 7   ALASKA San Francisco     102     503
## 8  AM WEST San Francisco     129     320
## 9   ALASKA       Seattle     305   1,841
## 10 AM WEST       Seattle      61     201

Single command

Alternatively all step can occur at once

(arrivals_delays<-arrivals_delays %>%
  gather(key = "City",value = "Count", 3:7)%>%
    spread(key = 2,value = 4) %>%
        arrange(City))
##    Airline          City delayed on time
## 1   ALASKA   Los Angeles      62     497
## 2  AM WEST   Los Angeles     117     694
## 3   ALASKA       Phoenix      12     221
## 4  AM WEST       Phoenix     415   4,840
## 5   ALASKA     San Diego      20     212
## 6  AM WEST     San Diego      65     383
## 7   ALASKA San Francisco     102     503
## 8  AM WEST San Francisco     129     320
## 9   ALASKA       Seattle     305   1,841
## 10 AM WEST       Seattle      61     201

Calculations

The Calculations are as follows:

  • Total Flights = delayed + on time values by row.
  • The for loop traverses each row, calculating Total Flights and storing it in the corresponding row.
  • Delay Rate and On Time Rate are calculated by dividing entire columns:
    • delayed/Total Flights
    • on time/Total Flights
# for loop to create totals
for(row in 1:nrow(arrivals_delays))
  arrivals_delays$`Total Flights`[row]<-
  as.numeric(arrivals_delays[row,3])+
  as.numeric(gsub(",","",arrivals_delays[row,4]))

#outside of for loop calculations for rates
  arrivals_delays$`Delay Rate`<-
    round( as.numeric( arrivals_delays$delayed ) / arrivals_delays$`Total Flights`,2 ) 

#arrival rates calculations
  arrivals_delays$`On Time Rate`<-
    round( as.numeric( gsub(",","",arrivals_delays$`on time` )) / arrivals_delays$`Total Flights`,2 ) 
arrivals_delays
Airline City delayed on time Total Flights Delay Rate On Time Rate
ALASKA Los Angeles 62 497 559 0.11 0.89
AM WEST Los Angeles 117 694 811 0.14 0.86
ALASKA Phoenix 12 221 233 0.05 0.95
AM WEST Phoenix 415 4,840 5255 0.08 0.92
ALASKA San Diego 20 212 232 0.09 0.91
AM WEST San Diego 65 383 448 0.15 0.85
ALASKA San Francisco 102 503 605 0.17 0.83
AM WEST San Francisco 129 320 449 0.29 0.71
ALASKA Seattle 305 1,841 2146 0.14 0.86
AM WEST Seattle 61 201 262 0.23 0.77

Conclusion

The summary below and the corresponding line graphs show, AM West varies greatly on its delay and on time rate by city, specifically regarding San Francisco. ALASKA airlines, however, fluctuates less and is consistently on time more with less delays on all major cities compared to AM WEST.

## # A tibble: 2 x 5
##   Airline `Avg Delay` `Avg Delay Rate` `On Time Avg` `On Time Avg Rate`
## * <chr>         <dbl>            <dbl>         <dbl>              <dbl>
## 1 ALASKA         100.            0.112          655.              0.888
## 2 AM WEST        157.            0.178         1288.              0.822


  1. arrivals_delays[2] and arrivals_delays[4] for columns 2 & 4 respectively:↩︎