607 Week 6 Assignment

Chirag Vithalani

March 4, 2016


  • GOAL 1: Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data.

  • tidyr provides three main functions for tidying your messy data: gather(), separate() and spread().
  • Reading CSV file and printing data

    library(tidyr)
    ## Warning: package 'tidyr' was built under R version 3.2.3
    flightData<-read.csv("https://raw.githubusercontent.com/chirag-vithlani/607/master/week6/week6.csv", header= TRUE,sep=",",na.strings = "?",stringsAsFactors=FALSE)
    flightData
    ##         X     X.1 Los.Angeles Phoenix San.Diego San.Francisco Seattle
    ## 1  ALASKA On Time         497     221       212           503    1841
    ## 2         Delayed          62      12        20           102     305
    ## 3 AM West On Time         694    4840       383           320     201
    ## 4         Delayed         117     415        65           129      61
  • Missing headers are names as X and X.1, so rename headers

    names(flightData)[names(flightData) == "X"] <- "Airline"
    names(flightData)[names(flightData) == "X.1"] <- "Arrival"
  • Repeating airline name where it is missing ( this is hard coded, but this step should be done programmatically )

    flightData[2, 1] <- "ALASKA"
    flightData[4, 1] <- "AM West"
    flightData
    ##   Airline Arrival Los.Angeles Phoenix San.Diego San.Francisco Seattle
    ## 1  ALASKA On Time         497     221       212           503    1841
    ## 2  ALASKA Delayed          62      12        20           102     305
    ## 3 AM West On Time         694    4840       383           320     201
    ## 4 AM West Delayed         117     415        65           129      61
  • Using gather() function. gather() takes multiple columns, and gathers them into key-value pairs: it makes “wide” data longer.

    tidy <- gather(flightData, "City", "Count", 3:7) 
    head(tidy)
    ##   Airline Arrival        City Count
    ## 1  ALASKA On Time Los.Angeles   497
    ## 2  ALASKA Delayed Los.Angeles    62
    ## 3 AM West On Time Los.Angeles   694
    ## 4 AM West Delayed Los.Angeles   117
    ## 5  ALASKA On Time     Phoenix   221
    ## 6  ALASKA Delayed     Phoenix    12
  • Using spread() function. spread() takes two columns (a key-value pair) and spreads them in to multiple columns, making “long” data wider

    #tidy$Count <- as.numeric(gsub(",", "", tidy$Count))
    tidy <- spread(tidy, "Arrival", Count)
    tidy
    ##    Airline          City Delayed On Time
    ## 1   ALASKA   Los.Angeles      62     497
    ## 2   ALASKA       Phoenix      12     221
    ## 3   ALASKA     San.Diego      20     212
    ## 4   ALASKA San.Francisco     102     503
    ## 5   ALASKA       Seattle     305    1841
    ## 6  AM West   Los.Angeles     117     694
    ## 7  AM West       Phoenix     415    4840
    ## 8  AM West     San.Diego      65     383
    ## 9  AM West San.Francisco     129     320
    ## 10 AM West       Seattle      61     201
  • dplyr implements the following verbs useful for data manipulation:
    select(): focus on a subset of variables
    filter(): focus on a subset of rows
    mutate(): add new columns
    summarise(): reduce each group to a smaller number of summary statistics
    arrange(): re-order the rows
    

    Choosing columns: select

    suppressMessages(library(dplyr))
    head(select(tidy, Airline))
    ##   Airline
    ## 1  ALASKA
    ## 2  ALASKA
    ## 3  ALASKA
    ## 4  ALASKA
    ## 5  ALASKA
    ## 6 AM West

    or use minus to hide that column

    head(select(tidy, -Airline))
    ##            City Delayed On Time
    ## 1   Los.Angeles      62     497
    ## 2       Phoenix      12     221
    ## 3     San.Diego      20     212
    ## 4 San.Francisco     102     503
    ## 5       Seattle     305    1841
    ## 6   Los.Angeles     117     694
  • Using filter

    filter(tidy,Delayed==62)
    ##   Airline        City Delayed On Time
    ## 1  ALASKA Los.Angeles      62     497
  • Using mutate : Add new variables

    tidy=mutate(tidy,Total = Delayed + `On Time`)
    head(tidy)
    ##   Airline          City Delayed On Time Total
    ## 1  ALASKA   Los.Angeles      62     497   559
    ## 2  ALASKA       Phoenix      12     221   233
    ## 3  ALASKA     San.Diego      20     212   232
    ## 4  ALASKA San.Francisco     102     503   605
    ## 5  ALASKA       Seattle     305    1841  2146
    ## 6 AM West   Los.Angeles     117     694   811
  • GOAL 2: Perform analysis to compare the arrival delays for the two airlines.
  • As we can see for almost for all cities AM West is getting delayed more number of time.

    library(ggplot2)
    ## Warning: package 'ggplot2' was built under R version 3.2.3
    tidy <- mutate(tidy, Total = Delayed + `On Time`, PercentDelayed = Delayed / Total * 100)
    tidy <- arrange(tidy, City, PercentDelayed)
    
    ggplot(tidy,aes(x=City,y=PercentDelayed,fill=factor(Airline)))+
        geom_bar(stat="identity",position="dodge")

    tidy
    ##    Airline          City Delayed On Time Total PercentDelayed
    ## 1   ALASKA   Los.Angeles      62     497   559      11.091234
    ## 2  AM West   Los.Angeles     117     694   811      14.426634
    ## 3   ALASKA       Phoenix      12     221   233       5.150215
    ## 4  AM West       Phoenix     415    4840  5255       7.897241
    ## 5   ALASKA     San.Diego      20     212   232       8.620690
    ## 6  AM West     San.Diego      65     383   448      14.508929
    ## 7   ALASKA San.Francisco     102     503   605      16.859504
    ## 8  AM West San.Francisco     129     320   449      28.730512
    ## 9   ALASKA       Seattle     305    1841  2146      14.212488
    ## 10 AM West       Seattle      61     201   262      23.282443
  • Using summarise : Reduce variables to values

    Overall delay : Alaska airline is delayed by 11% and AM West delayed by 18%.

    delays <- tidy %>% group_by(Airline) %>% summarise(MeanPercent = round(mean(PercentDelayed), 0))
    delays
    ## Source: local data frame [2 x 2]
    ## 
    ##   Airline MeanPercent
    ##     (chr)       (dbl)
    ## 1  ALASKA          11
    ## 2 AM West          18