The chart above describes arrival delays for two airlines across five destinations. Your task is to:
Create a .CSV file (or optionally, a MySQL database!) that includes all of the information above. You’re encouraged to use a “wide” structure similar to how the information appears above, so that you can practice tidying and transformations as described below.
Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data.
Perform analysis to compare the arrival delays for the two airlines.
Your code should be in an R Markdown file, posted to rpubs.com, and should include narrative descriptions of your data cleanup work, analysis, and conclusions. Please include in your homework submission:
The URL to the .Rmd file in your GitHub repository and the URL for your rpubs.com web page.
library(knitr)
library(stringr)
library(tidyr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
# Upload CSV file
airlines <- read.csv("https://raw.githubusercontent.com/sagreen131/DATA-607-HW-5/master/numbersense.csv", header=FALSE, sep=",", stringsAsFactors=FALSE)
airlines
## V1 V2 V3 V4 V5 V6 V7
## 1 Los Angeles Phoenix San Diego San Francisco Seattle
## 2 ALASKA on time 497 221 212 503 1841
## 3 delayed 62 12 20 102 305
## 4
## 5 AM WEST on time 694 4840 383 320 201
## 6 delayed 117 415 65 129 61
As we can see here, we need to do some data cleaning before doing our analysis. The CSV file is missing two column names.
# Rename header columns
airlines$V1[1] <- "Airlines"
airlines$V2[1] <- "Flight Status"
airlines
## V1 V2 V3 V4 V5 V6 V7
## 1 Airlines Flight Status Los Angeles Phoenix San Diego San Francisco Seattle
## 2 ALASKA on time 497 221 212 503 1841
## 3 delayed 62 12 20 102 305
## 4
## 5 AM WEST on time 694 4840 383 320 201
## 6 delayed 117 415 65 129 61
# Renaming row header
names(airlines) <- airlines[1,]
airlines
## Airlines Flight Status Los Angeles Phoenix San Diego San Francisco Seattle
## 1 Airlines Flight Status Los Angeles Phoenix San Diego San Francisco Seattle
## 2 ALASKA on time 497 221 212 503 1841
## 3 delayed 62 12 20 102 305
## 4
## 5 AM WEST on time 694 4840 383 320 201
## 6 delayed 117 415 65 129 61
Let’s delete Row 1 and 4
airlines <- airlines[-c(1,4),]
airlines
## Airlines Flight Status Los Angeles Phoenix San Diego San Francisco Seattle
## 2 ALASKA on time 497 221 212 503 1841
## 3 delayed 62 12 20 102 305
## 5 AM WEST on time 694 4840 383 320 201
## 6 delayed 117 415 65 129 61
After doing this, I noticed empty spaces for ALASKA and AM WEST. I will proceed to add in data in given data set.
airlines$Airlines[2] = 'ALASKA'
airlines$Airlines[4] = 'AM WEST'
airlines
## Airlines Flight Status Los Angeles Phoenix San Diego San Francisco Seattle
## 2 ALASKA on time 497 221 212 503 1841
## 3 ALASKA delayed 62 12 20 102 305
## 5 AM WEST on time 694 4840 383 320 201
## 6 AM WEST delayed 117 415 65 129 61
I will use the gather function to gather columns into rows. Doing this creates long data instead of wide data here.
airlines <-gather(airlines, city, number, 3:7, factor_key = TRUE)
airlines
## Airlines Flight Status city number
## 1 ALASKA on time Los Angeles 497
## 2 ALASKA delayed Los Angeles 62
## 3 AM WEST on time Los Angeles 694
## 4 AM WEST delayed Los Angeles 117
## 5 ALASKA on time Phoenix 221
## 6 ALASKA delayed Phoenix 12
## 7 AM WEST on time Phoenix 4840
## 8 AM WEST delayed Phoenix 415
## 9 ALASKA on time San Diego 212
## 10 ALASKA delayed San Diego 20
## 11 AM WEST on time San Diego 383
## 12 AM WEST delayed San Diego 65
## 13 ALASKA on time San Francisco 503
## 14 ALASKA delayed San Francisco 102
## 15 AM WEST on time San Francisco 320
## 16 AM WEST delayed San Francisco 129
## 17 ALASKA on time Seattle 1841
## 18 ALASKA delayed Seattle 305
## 19 AM WEST on time Seattle 201
## 20 AM WEST delayed Seattle 61
# I transposed columns 3 to 7 (Los Angeles, Phoenix, San Diego, San Francisco, and Seattle) here and displayed their listed numbers
I will do the same using the spread function, but instead display the Flight Status.
airlines <- airlines %>%
spread("Flight Status", `number`)
airlines
## Airlines city delayed on time
## 1 ALASKA Los Angeles 62 497
## 2 ALASKA Phoenix 12 221
## 3 ALASKA San Diego 20 212
## 4 ALASKA San Francisco 102 503
## 5 ALASKA Seattle 305 1841
## 6 AM WEST Los Angeles 117 694
## 7 AM WEST Phoenix 415 4840
## 8 AM WEST San Diego 65 383
## 9 AM WEST San Francisco 129 320
## 10 AM WEST Seattle 61 201
Although the delayed and on time column displays numeric value, I noticed that both columns were listed in chr.
# Convert from chr to numeric
airlines$delayed <- as.numeric(as.character(airlines$delayed))
airlines$`on time`<- as.numeric(as.character(airlines$'on time'))
Using the dplyr package, I will add a total variable to add the total number of delayed and on time flights, in addition to the delayed percentage.
airlines <- airlines %>%
mutate(total = delayed + `on time`, delayed.percentage = (delayed/total)*100)
airlines
## Airlines city delayed on time total delayed.percentage
## 1 ALASKA Los Angeles 62 497 559 11.091234
## 2 ALASKA Phoenix 12 221 233 5.150215
## 3 ALASKA San Diego 20 212 232 8.620690
## 4 ALASKA San Francisco 102 503 605 16.859504
## 5 ALASKA Seattle 305 1841 2146 14.212488
## 6 AM WEST Los Angeles 117 694 811 14.426634
## 7 AM WEST Phoenix 415 4840 5255 7.897241
## 8 AM WEST San Diego 65 383 448 14.508929
## 9 AM WEST San Francisco 129 320 449 28.730512
## 10 AM WEST Seattle 61 201 262 23.282443
ggplot(airlines, aes(x = Airlines, y = delayed.percentage, fill = city)) +
geom_bar(stat="identity", position="dodge") +
xlab("Airlines") + ylab("Delays")
After analyzing the data, we can see here that flights from San Francisco from both airlines, experience the most delays. Phoenix, on the other hand, has the least number of delayed flights from both airlines.
Github link here
Rpubs link here