R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

## Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data.
## Used three main functions for tidying data: gather(), separate() and spread()

# Read CSV file 

library(tidyr)

flightData<-read.csv("https://raw.githubusercontent.com/vijay564/R-Maincode/master/week5.csv", header= TRUE,sep=",",na.strings = "?",stringsAsFactors=FALSE)
flightData
##         X     X.1 Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1  ALASKA On Time         497     221       212           503    1841
## 2         Delayed          62      12        20           102     305
## 3                          NA      NA        NA            NA      NA
## 4 AM West On Time         694    4840       383           320     201
## 5         Delayed         117     415        65           129      61
# Eliminate Blank Row
flightData <- flightData[c(1,2,4,5), ]
flightData
##         X     X.1 Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1  ALASKA On Time         497     221       212           503    1841
## 2         Delayed          62      12        20           102     305
## 4 AM West On Time         694    4840       383           320     201
## 5         Delayed         117     415        65           129      61
# Rename missing Headers 
names(flightData)[names(flightData) == "X"] <- "Airline"
names(flightData)[names(flightData) == "X.1"] <- "Arrival"

# Repeating missing Airline name
flightData[2, 1] <- "ALASKA"
flightData[4, 1] <- "AM West"
flightData
##   Airline Arrival Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1  ALASKA On Time         497     221       212           503    1841
## 2  ALASKA Delayed          62      12        20           102     305
## 4 AM West On Time         694    4840       383           320     201
## 5 AM West Delayed         117     415        65           129      61
# Use gather() function to takes multiple columns and gathers them into key-value pairs. It makes “wide” data longer.
tidy <- gather(flightData, "City", "Count", 3:7) 
head(tidy)
##   Airline Arrival        City Count
## 1  ALASKA On Time Los.Angeles   497
## 2  ALASKA Delayed Los.Angeles    62
## 3 AM West On Time Los.Angeles   694
## 4 AM West Delayed Los.Angeles   117
## 5  ALASKA On Time     Phoenix   221
## 6  ALASKA Delayed     Phoenix    12
# Use gather() function to takes multiple columns, and gathers them into key-value pairs. It makes “long” data wider
tidy <- spread(tidy, "Arrival", Count)
tidy
##    Airline          City Delayed On Time
## 1   ALASKA   Los.Angeles      62     497
## 2   ALASKA       Phoenix      12     221
## 3   ALASKA     San.Diego      20     212
## 4   ALASKA San.Francisco     102     503
## 5   ALASKA       Seattle     305    1841
## 6  AM West   Los.Angeles     117     694
## 7  AM West       Phoenix     415    4840
## 8  AM West     San.Diego      65     383
## 9  AM West San.Francisco     129     320
## 10 AM West       Seattle      61     201
# Using select function which focus on subset of variables or we can use minus to hide it
suppressMessages(library(dplyr))
head(select(tidy, Airline))
##   Airline
## 1  ALASKA
## 2  ALASKA
## 3  ALASKA
## 4  ALASKA
## 5  ALASKA
## 6 AM West
head(select(tidy, -Airline))
##            City Delayed On Time
## 1   Los.Angeles      62     497
## 2       Phoenix      12     221
## 3     San.Diego      20     212
## 4 San.Francisco     102     503
## 5       Seattle     305    1841
## 6   Los.Angeles     117     694
# Use filter function
filter(tidy,Delayed==62)
##   Airline        City Delayed On Time
## 1  ALASKA Los.Angeles      62     497
# Use Mutate function to add new columns 
tidy=mutate(tidy,Total = Delayed + `On Time`)
head(tidy)
##   Airline          City Delayed On Time Total
## 1  ALASKA   Los.Angeles      62     497   559
## 2  ALASKA       Phoenix      12     221   233
## 3  ALASKA     San.Diego      20     212   232
## 4  ALASKA San.Francisco     102     503   605
## 5  ALASKA       Seattle     305    1841  2146
## 6 AM West   Los.Angeles     117     694   811
## Compare arrival delays for airlines
# AM West is getting delayed most of the times
library(ggplot2)


tidy <- mutate(tidy, Total = Delayed + `On Time`, PercentDelayed = Delayed / Total * 100)
tidy <- arrange(tidy, City, PercentDelayed)

ggplot(tidy,aes(x=City,y=PercentDelayed,fill=factor(Airline)))+
    geom_bar(stat="identity",position="dodge")

tidy
##    Airline          City Delayed On Time Total PercentDelayed
## 1   ALASKA   Los.Angeles      62     497   559      11.091234
## 2  AM West   Los.Angeles     117     694   811      14.426634
## 3   ALASKA       Phoenix      12     221   233       5.150215
## 4  AM West       Phoenix     415    4840  5255       7.897241
## 5   ALASKA     San.Diego      20     212   232       8.620690
## 6  AM West     San.Diego      65     383   448      14.508929
## 7   ALASKA San.Francisco     102     503   605      16.859504
## 8  AM West San.Francisco     129     320   449      28.730512
## 9   ALASKA       Seattle     305    1841  2146      14.212488
## 10 AM West       Seattle      61     201   262      23.282443
# Use Summarise function which reduce each group to a smaller number of summary statistics
# On High Level the Overall delay of Alaska airline is delayed by 11% and AM West delayed by 18%
delays <- tidy %>% group_by(Airline) %>% summarise(MeanPercent = round(mean(PercentDelayed), 0))
delays
## # A tibble: 2 x 2
##   Airline MeanPercent
##   <chr>         <dbl>
## 1 ALASKA           11
## 2 AM West          18

Including Plots

You can also embed plots, for example:

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.