This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:
## Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data.
## Used three main functions for tidying data: gather(), separate() and spread()
# Read CSV file
library(tidyr)
flightData<-read.csv("https://raw.githubusercontent.com/vijay564/R-Maincode/master/week5.csv", header= TRUE,sep=",",na.strings = "?",stringsAsFactors=FALSE)
flightData
## X X.1 Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1 ALASKA On Time 497 221 212 503 1841
## 2 Delayed 62 12 20 102 305
## 3 NA NA NA NA NA
## 4 AM West On Time 694 4840 383 320 201
## 5 Delayed 117 415 65 129 61
# Eliminate Blank Row
flightData <- flightData[c(1,2,4,5), ]
flightData
## X X.1 Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1 ALASKA On Time 497 221 212 503 1841
## 2 Delayed 62 12 20 102 305
## 4 AM West On Time 694 4840 383 320 201
## 5 Delayed 117 415 65 129 61
# Rename missing Headers
names(flightData)[names(flightData) == "X"] <- "Airline"
names(flightData)[names(flightData) == "X.1"] <- "Arrival"
# Repeating missing Airline name
flightData[2, 1] <- "ALASKA"
flightData[4, 1] <- "AM West"
flightData
## Airline Arrival Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1 ALASKA On Time 497 221 212 503 1841
## 2 ALASKA Delayed 62 12 20 102 305
## 4 AM West On Time 694 4840 383 320 201
## 5 AM West Delayed 117 415 65 129 61
# Use gather() function to takes multiple columns and gathers them into key-value pairs. It makes “wide” data longer.
tidy <- gather(flightData, "City", "Count", 3:7)
head(tidy)
## Airline Arrival City Count
## 1 ALASKA On Time Los.Angeles 497
## 2 ALASKA Delayed Los.Angeles 62
## 3 AM West On Time Los.Angeles 694
## 4 AM West Delayed Los.Angeles 117
## 5 ALASKA On Time Phoenix 221
## 6 ALASKA Delayed Phoenix 12
# Use gather() function to takes multiple columns, and gathers them into key-value pairs. It makes “long” data wider
tidy <- spread(tidy, "Arrival", Count)
tidy
## Airline City Delayed On Time
## 1 ALASKA Los.Angeles 62 497
## 2 ALASKA Phoenix 12 221
## 3 ALASKA San.Diego 20 212
## 4 ALASKA San.Francisco 102 503
## 5 ALASKA Seattle 305 1841
## 6 AM West Los.Angeles 117 694
## 7 AM West Phoenix 415 4840
## 8 AM West San.Diego 65 383
## 9 AM West San.Francisco 129 320
## 10 AM West Seattle 61 201
# Using select function which focus on subset of variables or we can use minus to hide it
suppressMessages(library(dplyr))
head(select(tidy, Airline))
## Airline
## 1 ALASKA
## 2 ALASKA
## 3 ALASKA
## 4 ALASKA
## 5 ALASKA
## 6 AM West
head(select(tidy, -Airline))
## City Delayed On Time
## 1 Los.Angeles 62 497
## 2 Phoenix 12 221
## 3 San.Diego 20 212
## 4 San.Francisco 102 503
## 5 Seattle 305 1841
## 6 Los.Angeles 117 694
# Use filter function
filter(tidy,Delayed==62)
## Airline City Delayed On Time
## 1 ALASKA Los.Angeles 62 497
# Use Mutate function to add new columns
tidy=mutate(tidy,Total = Delayed + `On Time`)
head(tidy)
## Airline City Delayed On Time Total
## 1 ALASKA Los.Angeles 62 497 559
## 2 ALASKA Phoenix 12 221 233
## 3 ALASKA San.Diego 20 212 232
## 4 ALASKA San.Francisco 102 503 605
## 5 ALASKA Seattle 305 1841 2146
## 6 AM West Los.Angeles 117 694 811
## Compare arrival delays for airlines
# AM West is getting delayed most of the times
library(ggplot2)
tidy <- mutate(tidy, Total = Delayed + `On Time`, PercentDelayed = Delayed / Total * 100)
tidy <- arrange(tidy, City, PercentDelayed)
ggplot(tidy,aes(x=City,y=PercentDelayed,fill=factor(Airline)))+
geom_bar(stat="identity",position="dodge")
tidy
## Airline City Delayed On Time Total PercentDelayed
## 1 ALASKA Los.Angeles 62 497 559 11.091234
## 2 AM West Los.Angeles 117 694 811 14.426634
## 3 ALASKA Phoenix 12 221 233 5.150215
## 4 AM West Phoenix 415 4840 5255 7.897241
## 5 ALASKA San.Diego 20 212 232 8.620690
## 6 AM West San.Diego 65 383 448 14.508929
## 7 ALASKA San.Francisco 102 503 605 16.859504
## 8 AM West San.Francisco 129 320 449 28.730512
## 9 ALASKA Seattle 305 1841 2146 14.212488
## 10 AM West Seattle 61 201 262 23.282443
# Use Summarise function which reduce each group to a smaller number of summary statistics
# On High Level the Overall delay of Alaska airline is delayed by 11% and AM West delayed by 18%
delays <- tidy %>% group_by(Airline) %>% summarise(MeanPercent = round(mean(PercentDelayed), 0))
delays
## # A tibble: 2 x 2
## Airline MeanPercent
## <chr> <dbl>
## 1 ALASKA 11
## 2 AM West 18
You can also embed plots, for example:
Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.