DATA 607 Assignment – Tidying and Transforming Data

library(tidyr)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)

The chart above describes arrival delays for two airlines across five destinations. Your task is to: (1) Create a .CSV file (or optionally, a MySQL database!) that includes all of the information above. You’re encouraged to use a “wide” structure similar to how the information appears above, so that you can practice tidying and transformations as described below. (2) Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data. Read the data.

dt <- read.csv("Numbersense.csv", stringsAsFactors = FALSE, na.strings = "")
head(dt)

##         X     X.1 Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1  ALASKA on time         497     221       212           503    1841
## 2    <NA> delayed          62      12        20           102     305
## 3    <NA>    <NA>          NA      NA        NA            NA      NA
## 4 AM WEST on time         694    4840       383           320     201
## 5    <NA> delayed         117     415        65           129      61

Fill the airline columns

dt <- dt %>%
    fill( X)

Remove empty rows

dt <- na.omit(dt)
head(dt)

##         X     X.1 Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1  ALASKA on time         497     221       212           503    1841
## 2  ALASKA delayed          62      12        20           102     305
## 4 AM WEST on time         694    4840       383           320     201
## 5 AM WEST delayed         117     415        65           129      61

Correct airlines column names

colnames(dt)[1:2] <- c("Airlines","Status")
head(dt)

##   Airlines  Status Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1   ALASKA on time         497     221       212           503    1841
## 2   ALASKA delayed          62      12        20           102     305
## 4  AM WEST on time         694    4840       383           320     201
## 5  AM WEST delayed         117     415        65           129      61

Transform wide to long format

dt <- dt %>%
    gather(Destination, count, 3:7)

Seperate status column

dt <- dt %>%
    spread(Status, count)

Perform analysis to compare the arrival delays for the two airlines.

glimpse(dt)

## Observations: 10
## Variables: 4
## $ Airlines    <chr> "ALASKA", "ALASKA", "ALASKA", "ALASKA", "ALASKA", ...
## $ Destination <chr> "Los.Angeles", "Phoenix", "San.Diego", "San.Franci...
## $ delayed     <int> 62, 12, 20, 102, 305, 117, 415, 65, 129, 61
## $ `on time`   <int> 497, 221, 212, 503, 1841, 694, 4840, 383, 320, 201

Airlines, Destination are string variable, delayed and on time are integer.

dt %>%
    group_by(Airlines)%>%
    summarise(mean = mean(delayed))

## # A tibble: 2 x 2
##   Airlines  mean
##   <chr>    <dbl>
## 1 ALASKA    100.
## 2 AM WEST   157.

Mean delay of Alaska is 100 and AM WEST is 157. ALASKA airlines has less delay. Add total column

dt$total <- dt$delayed+dt$`on time`
dt$percent_del <- (dt$delayed/dt$total)*100

ggplot(data = dt, aes(Destination, percent_del, fill = Airlines))+geom_bar(stat = "identity", position = "dodge")+theme_classic()+ggtitle("Percent Delayed")+ylab("Percent")+theme(plot.title = element_text(hjust = 0.5))

San Francisco had the most delayed followed by Seattle where Phoenix had the least delayed. (4) Your code should be in an R Markdown file, posted to rpubs.com, and should include narrative descriptions of your data cleanup work, analysis, and conclusions. Please include in your homework submission:  The URL to the .Rmd file in your Git Hub repository. and  The URL for your rpubs.com web page.

DATA 607 Assignment – Tidying and Transforming Data

Yohannes Deboch

March 3, 2019