Shana Green

DATA 607 - Homework 5

Due Date: 9/26/2020

The chart above describes arrival delays for two airlines across five destinations. Your task is to:

  1. Create a .CSV file (or optionally, a MySQL database!) that includes all of the information above. You’re encouraged to use a “wide” structure similar to how the information appears above, so that you can practice tidying and transformations as described below.

  2. Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data.

  3. Perform analysis to compare the arrival delays for the two airlines.

  4. Your code should be in an R Markdown file, posted to rpubs.com, and should include narrative descriptions of your data cleanup work, analysis, and conclusions. Please include in your homework submission:

The URL to the .Rmd file in your GitHub repository and the URL for your rpubs.com web page.

library(knitr)
library(stringr)
library(tidyr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
# Upload CSV file

airlines <- read.csv("https://raw.githubusercontent.com/sagreen131/DATA-607-HW-5/master/numbersense.csv", header=FALSE, sep=",", stringsAsFactors=FALSE)
                    
airlines
##        V1      V2          V3      V4        V5            V6      V7
## 1                 Los Angeles Phoenix San Diego San Francisco Seattle
## 2  ALASKA on time         497     221       212           503    1841
## 3         delayed          62      12        20           102     305
## 4                                                                    
## 5 AM WEST on time         694    4840       383           320     201
## 6         delayed         117     415        65           129      61

As we can see here, we need to do some data cleaning before doing our analysis. The CSV file is missing two column names.

# Rename header columns

airlines$V1[1] <- "Airlines"
airlines$V2[1] <- "Flight Status"
airlines
##         V1            V2          V3      V4        V5            V6      V7
## 1 Airlines Flight Status Los Angeles Phoenix San Diego San Francisco Seattle
## 2   ALASKA       on time         497     221       212           503    1841
## 3                delayed          62      12        20           102     305
## 4                                                                           
## 5  AM WEST       on time         694    4840       383           320     201
## 6                delayed         117     415        65           129      61
# Renaming row header

names(airlines) <- airlines[1,]
airlines
##   Airlines Flight Status Los Angeles Phoenix San Diego San Francisco Seattle
## 1 Airlines Flight Status Los Angeles Phoenix San Diego San Francisco Seattle
## 2   ALASKA       on time         497     221       212           503    1841
## 3                delayed          62      12        20           102     305
## 4                                                                           
## 5  AM WEST       on time         694    4840       383           320     201
## 6                delayed         117     415        65           129      61

Let’s delete Row 1 and 4

airlines <- airlines[-c(1,4),]
airlines
##   Airlines Flight Status Los Angeles Phoenix San Diego San Francisco Seattle
## 2   ALASKA       on time         497     221       212           503    1841
## 3                delayed          62      12        20           102     305
## 5  AM WEST       on time         694    4840       383           320     201
## 6                delayed         117     415        65           129      61

After doing this, I noticed empty spaces for ALASKA and AM WEST. I will proceed to add in data in given data set.

airlines$Airlines[2] = 'ALASKA'
airlines$Airlines[4] = 'AM WEST'
airlines
##   Airlines Flight Status Los Angeles Phoenix San Diego San Francisco Seattle
## 2   ALASKA       on time         497     221       212           503    1841
## 3   ALASKA       delayed          62      12        20           102     305
## 5  AM WEST       on time         694    4840       383           320     201
## 6  AM WEST       delayed         117     415        65           129      61

I will use the gather function to gather columns into rows. Doing this creates long data instead of wide data here.

airlines <-gather(airlines, city, number, 3:7, factor_key = TRUE)
airlines
##    Airlines Flight Status          city number
## 1    ALASKA       on time   Los Angeles    497
## 2    ALASKA       delayed   Los Angeles     62
## 3   AM WEST       on time   Los Angeles    694
## 4   AM WEST       delayed   Los Angeles    117
## 5    ALASKA       on time       Phoenix    221
## 6    ALASKA       delayed       Phoenix     12
## 7   AM WEST       on time       Phoenix   4840
## 8   AM WEST       delayed       Phoenix    415
## 9    ALASKA       on time     San Diego    212
## 10   ALASKA       delayed     San Diego     20
## 11  AM WEST       on time     San Diego    383
## 12  AM WEST       delayed     San Diego     65
## 13   ALASKA       on time San Francisco    503
## 14   ALASKA       delayed San Francisco    102
## 15  AM WEST       on time San Francisco    320
## 16  AM WEST       delayed San Francisco    129
## 17   ALASKA       on time       Seattle   1841
## 18   ALASKA       delayed       Seattle    305
## 19  AM WEST       on time       Seattle    201
## 20  AM WEST       delayed       Seattle     61
# I transposed columns 3 to 7 (Los Angeles, Phoenix, San Diego, San Francisco, and Seattle) here and displayed their listed numbers

I will do the same using the spread function, but instead display the Flight Status.

airlines <-  airlines %>%
                 spread("Flight Status", `number`)
airlines
##    Airlines          city delayed on time
## 1    ALASKA   Los Angeles      62     497
## 2    ALASKA       Phoenix      12     221
## 3    ALASKA     San Diego      20     212
## 4    ALASKA San Francisco     102     503
## 5    ALASKA       Seattle     305    1841
## 6   AM WEST   Los Angeles     117     694
## 7   AM WEST       Phoenix     415    4840
## 8   AM WEST     San Diego      65     383
## 9   AM WEST San Francisco     129     320
## 10  AM WEST       Seattle      61     201

Although the delayed and on time column displays numeric value, I noticed that both columns were listed in chr.

# Convert from chr to numeric

airlines$delayed <- as.numeric(as.character(airlines$delayed))
airlines$`on time`<- as.numeric(as.character(airlines$'on time'))

Using the dplyr package, I will add a total variable to add the total number of delayed and on time flights, in addition to the delayed percentage.

airlines <- airlines %>% 
  mutate(total = delayed + `on time`, delayed.percentage = (delayed/total)*100) 

airlines
##    Airlines          city delayed on time total delayed.percentage
## 1    ALASKA   Los Angeles      62     497   559          11.091234
## 2    ALASKA       Phoenix      12     221   233           5.150215
## 3    ALASKA     San Diego      20     212   232           8.620690
## 4    ALASKA San Francisco     102     503   605          16.859504
## 5    ALASKA       Seattle     305    1841  2146          14.212488
## 6   AM WEST   Los Angeles     117     694   811          14.426634
## 7   AM WEST       Phoenix     415    4840  5255           7.897241
## 8   AM WEST     San Diego      65     383   448          14.508929
## 9   AM WEST San Francisco     129     320   449          28.730512
## 10  AM WEST       Seattle      61     201   262          23.282443
ggplot(airlines, aes(x = Airlines, y = delayed.percentage, fill = city)) +  
          geom_bar(stat="identity", position="dodge") + 
          xlab("Airlines") + ylab("Delays") 

After analyzing the data, we can see here that flights from San Francisco from both airlines, experience the most delays. Phoenix, on the other hand, has the least number of delayed flights from both airlines.

Github link here

Rpubs link here