Elina Azrilyan

Week 5 Assignment

24 Sep 2018

The chart above describes arrival delays for two airlines across five destinations. Your task is to:

  1. Create a .CSV file (or optionally, a MySQL database!) that includes all of the information above. You’re encouraged to use a “wide” structure similar to how the information appears above, so that you can practice tidying and transformations as described below.

I have started by creating a .csv file with the arrival delay information provided in the assignment. I uploaded my .csv file to github to read it from an online location. The data includes blanks and it is challenging to analyze it in this format.

#install.packages("tidyverse")
#install.packages("dplyr")
require(dplyr)
## Warning: package 'dplyr' was built under R version 3.5.1
require(tidyr)

AirData <- read.csv(file="https://raw.githubusercontent.com/che10vek/Data-607-Assignments/master/Assignment%205%20Data.csv", header=TRUE, sep=",")
head(AirData)
##         X     X.1 Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1  ALASKA on time         497     221       212           503    1841
## 2         delayed          62      12        20           102     305
## 3                          NA      NA        NA            NA      NA
## 4 AM WEST on time         694    4840       383           320     201
## 5         delayed         117     415        65           129      61
## 6                          NA      NA        NA            NA      NA
  1. Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data.

Now that we have read in our data - we need to clean up and transform it to make it possible to do analysis.

This code will get read of NAs:

newairdata <- subset(AirData, AirData$X.1 != "") 
head(newairdata)
##         X     X.1 Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1  ALASKA on time         497     221       212           503    1841
## 2         delayed          62      12        20           102     305
## 4 AM WEST on time         694    4840       383           320     201
## 5         delayed         117     415        65           129      61

I struggled with filling the blank cells with the value from the cell above quiet a bit. I ended up finding a function on the following webpage and adopting it a bit for my data: https://stackoverflow.com/questions/10554741/fill-in-data-frame-with-values-from-rows-above

f4 <- function(x, blank = "") {
# Find the values
if (is.function(blank)) {
  isnotblank <- !blank(x)
  } else {
    isnotblank <- x != blank
  }
# Fill down
x[which(isnotblank)][cumsum(isnotblank)]
}

newairdata$X <- f4(newairdata$X)
head (newairdata)
##         X     X.1 Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1  ALASKA on time         497     221       212           503    1841
## 2  ALASKA delayed          62      12        20           102     305
## 4 AM WEST on time         694    4840       383           320     201
## 5 AM WEST delayed         117     415        65           129      61

The following step converts this “wide”" data set into “long” data.

air_data_long <- newairdata %>% gather(City, Count, Los.Angeles:Seattle)
air_data_long
##          X     X.1          City Count
## 1   ALASKA on time   Los.Angeles   497
## 2   ALASKA delayed   Los.Angeles    62
## 3  AM WEST on time   Los.Angeles   694
## 4  AM WEST delayed   Los.Angeles   117
## 5   ALASKA on time       Phoenix   221
## 6   ALASKA delayed       Phoenix    12
## 7  AM WEST on time       Phoenix  4840
## 8  AM WEST delayed       Phoenix   415
## 9   ALASKA on time     San.Diego   212
## 10  ALASKA delayed     San.Diego    20
## 11 AM WEST on time     San.Diego   383
## 12 AM WEST delayed     San.Diego    65
## 13  ALASKA on time San.Francisco   503
## 14  ALASKA delayed San.Francisco   102
## 15 AM WEST on time San.Francisco   320
## 16 AM WEST delayed San.Francisco   129
## 17  ALASKA on time       Seattle  1841
## 18  ALASKA delayed       Seattle   305
## 19 AM WEST on time       Seattle   201
## 20 AM WEST delayed       Seattle    61
  1. Perform analysis to compare the arrival delays for the two airlines.

I have manipulated the data further to get the “on time” and “delayed” data into columns.

air_data_long2 <- air_data_long %>% spread(X.1, Count)
air_data_long2
##          X          City delayed on time
## 1   ALASKA   Los.Angeles      62     497
## 2   ALASKA       Phoenix      12     221
## 3   ALASKA     San.Diego      20     212
## 4   ALASKA San.Francisco     102     503
## 5   ALASKA       Seattle     305    1841
## 6  AM WEST   Los.Angeles     117     694
## 7  AM WEST       Phoenix     415    4840
## 8  AM WEST     San.Diego      65     383
## 9  AM WEST San.Francisco     129     320
## 10 AM WEST       Seattle      61     201

Let’s add a column with includes percentages to give us a beter understanding of delays for each city.

air_data_long2 <- mutate(air_data_long2,
       PercentageDelayed = round(air_data_long2$delayed/(air_data_long2$delayed + air_data_long2$'on time')*100,2)
 )
air_data_long2 <- mutate(air_data_long2,
       PercentageOnTime = round(air_data_long2$'on time'/(air_data_long2$delayed+air_data_long2$'on time')*100,2)
)
air_data_long2
##          X          City delayed on time PercentageDelayed
## 1   ALASKA   Los.Angeles      62     497             11.09
## 2   ALASKA       Phoenix      12     221              5.15
## 3   ALASKA     San.Diego      20     212              8.62
## 4   ALASKA San.Francisco     102     503             16.86
## 5   ALASKA       Seattle     305    1841             14.21
## 6  AM WEST   Los.Angeles     117     694             14.43
## 7  AM WEST       Phoenix     415    4840              7.90
## 8  AM WEST     San.Diego      65     383             14.51
## 9  AM WEST San.Francisco     129     320             28.73
## 10 AM WEST       Seattle      61     201             23.28
##    PercentageOnTime
## 1             88.91
## 2             94.85
## 3             91.38
## 4             83.14
## 5             85.79
## 6             85.57
## 7             92.10
## 8             85.49
## 9             71.27
## 10            76.72

Now let’s look at the data graphically. We do a boxplot and looks at Percentage of delayed flights for both airlines. We can see that AM West has much higher percentage delayed.

plot(air_data_long2$X, air_data_long2$PercentageDelayed)

We can also look at Percent on time and that shows us the same result. Alaska has a much higher percent of on time arrivals.

plot(air_data_long2$X, air_data_long2$PercentageOnTime)

  1. Your code should be in an R Markdown file, posted to rpubs.com, and should include narrative descriptions of your data cleanup work, analysis, and conclusions. Please include in your homework submission:  The URL to the .Rmd file in your GitHub repository. and  The URL for your rpubs.com web page.