library(tidyr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
The chart above describes arrival delays for two airlines across five destinations. Your task is to: (1) Create a .CSV file (or optionally, a MySQL database!) that includes all of the information above. You’re encouraged to use a “wide” structure similar to how the information appears above, so that you can practice tidying and transformations as described below. (2) Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data. Read the data.
dt <- read.csv("Numbersense.csv", stringsAsFactors = FALSE, na.strings = "")
head(dt)
## X X.1 Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1 ALASKA on time 497 221 212 503 1841
## 2 <NA> delayed 62 12 20 102 305
## 3 <NA> <NA> NA NA NA NA NA
## 4 AM WEST on time 694 4840 383 320 201
## 5 <NA> delayed 117 415 65 129 61
Fill the airline columns
dt <- dt %>%
fill( X)
Remove empty rows
dt <- na.omit(dt)
head(dt)
## X X.1 Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1 ALASKA on time 497 221 212 503 1841
## 2 ALASKA delayed 62 12 20 102 305
## 4 AM WEST on time 694 4840 383 320 201
## 5 AM WEST delayed 117 415 65 129 61
Correct airlines column names
colnames(dt)[1:2] <- c("Airlines","Status")
head(dt)
## Airlines Status Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1 ALASKA on time 497 221 212 503 1841
## 2 ALASKA delayed 62 12 20 102 305
## 4 AM WEST on time 694 4840 383 320 201
## 5 AM WEST delayed 117 415 65 129 61
Transform wide to long format
dt <- dt %>%
gather(Destination, count, 3:7)
Seperate status column
dt <- dt %>%
spread(Status, count)
glimpse(dt)
## Observations: 10
## Variables: 4
## $ Airlines <chr> "ALASKA", "ALASKA", "ALASKA", "ALASKA", "ALASKA", ...
## $ Destination <chr> "Los.Angeles", "Phoenix", "San.Diego", "San.Franci...
## $ delayed <int> 62, 12, 20, 102, 305, 117, 415, 65, 129, 61
## $ `on time` <int> 497, 221, 212, 503, 1841, 694, 4840, 383, 320, 201
Airlines, Destination are string variable, delayed and on time are integer.
dt %>%
group_by(Airlines)%>%
summarise(mean = mean(delayed))
## # A tibble: 2 x 2
## Airlines mean
## <chr> <dbl>
## 1 ALASKA 100.
## 2 AM WEST 157.
Mean delay of Alaska is 100 and AM WEST is 157. ALASKA airlines has less delay. Add total column
dt$total <- dt$delayed+dt$`on time`
dt$percent_del <- (dt$delayed/dt$total)*100
ggplot(data = dt, aes(Destination, percent_del, fill = Airlines))+geom_bar(stat = "identity", position = "dodge")+theme_classic()+ggtitle("Percent Delayed")+ylab("Percent")+theme(plot.title = element_text(hjust = 0.5))
San Francisco had the most delayed followed by Seattle where Phoenix had the least delayed. (4) Your code should be in an R Markdown file, posted to rpubs.com, and should include narrative descriptions of your data cleanup work, analysis, and conclusions. Please include in your homework submission: The URL to the .Rmd file in your Git Hub repository. and The URL for your rpubs.com web page.