The chart above describes arrival delays for two airlines across five destinations. Your task is to:
You’re encouraged to use a “wide” structure similar to how the information appears above, so that you can practice tidying and transformations as described below.
Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data.
Perform analysis to compare the arrival delays for the two airlines.
Your code should be in an R Markdown file, posted to rpubs.com, and should include narrative descriptions of your data cleanup work, analysis, and conclusions. Please include in your homework submission:
The URL to the .Rmd file in your GitHub repository. and The URL for your rpubs.com web page.
library(tidyr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(stringr)
flights <- read.csv("https://raw.githubusercontent.com/sortega7878/DATA607W5/master/flights.csv",
sep = ",", header = TRUE)
flights
## ï..airline status Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1 ALASKA on time 497 221 212 503 1841
## 2 ALASKA delayed 62 12 20 102 305
## 3 AMWEST on time 694 4840 383 320 201
## 4 AMWEST delayed 117 415 65 129 61
Data Cleaning and going Tidy.
At this point the initial data is tidy. Each variable (airline, city, delayed, ontime) is in its own column and each observation is in its own row.
# Tidy up data, so that every variable is in a column and every observation is in a row
flights <- rename(flights, airline = ï..airline)
flights <- flights %>%
gather(city, number, 3:length(flights)) %>% # Transfer 'city' variable into a column
spread(status, number) # Split 'delayed'/'on time' variables into 2 columns
# Adjust city names (get rid of a period)
flights$city <- str_replace_all(flights$city, "\\.", " ")
# Remove space in the 'on time' variable name
flights <- rename(flights, ontime = `on time`)
flights
## airline city delayed ontime
## 1 ALASKA Los Angeles 62 497
## 2 ALASKA Phoenix 12 221
## 3 ALASKA San Diego 20 212
## 4 ALASKA San Francisco 102 503
## 5 ALASKA Seattle 305 1841
## 6 AMWEST Los Angeles 117 694
## 7 AMWEST Phoenix 415 4840
## 8 AMWEST San Diego 65 383
## 9 AMWEST San Francisco 129 320
## 10 AMWEST Seattle 61 201
DATA ANALYSIS
Getting the ratio of delayed flights for all airlines and cities. We’ll sort the data by delay starting with the worst airline/city.
flights <- flights %>%
mutate(flights = delayed + ontime, delayed_ratio = delayed / flights) %>%
arrange(desc(delayed_ratio))
flights
## airline city delayed ontime flights delayed_ratio
## 1 AMWEST San Francisco 129 320 449 0.28730512
## 2 AMWEST Seattle 61 201 262 0.23282443
## 3 ALASKA San Francisco 102 503 605 0.16859504
## 4 AMWEST San Diego 65 383 448 0.14508929
## 5 AMWEST Los Angeles 117 694 811 0.14426634
## 6 ALASKA Seattle 305 1841 2146 0.14212488
## 7 ALASKA Los Angeles 62 497 559 0.11091234
## 8 ALASKA San Diego 20 212 232 0.08620690
## 9 AMWEST Phoenix 415 4840 5255 0.07897241
## 10 ALASKA Phoenix 12 221 233 0.05150215
Let us look compare how delays compare against each other using various criteria.
# Average delay across all cities and airlines
flights %>% summarise(mean(delayed_ratio))
## mean(delayed_ratio)
## 1 0.1447799
delay_city <- flights %>%
group_by(city) %>%
summarise(avg_delay = mean(delayed_ratio)) %>%
arrange(desc(avg_delay))
delay_city
## # A tibble: 5 × 2
## city avg_delay
## <chr> <dbl>
## 1 San Francisco 0.22795008
## 2 Seattle 0.18747466
## 3 Los Angeles 0.12758934
## 4 San Diego 0.11564809
## 5 Phoenix 0.06523728
delay_airline <- flights %>%
group_by(airline) %>%
summarise(avg_delay = mean(delayed_ratio)) %>%
arrange(desc(avg_delay))
delay_airline
## # A tibble: 2 × 2
## airline avg_delay
## <fctr> <dbl>
## 1 AMWEST 0.1776915
## 2 ALASKA 0.1118683
Map delays per city and per airline.
library(ggplot2)
ggplot(flights, aes(x = city, y = delayed_ratio, group = airline, color = airline, shape = airline)) +
geom_point(aes(shape = airline)) +
geom_line() +
labs(x = "City", y = "Delay") +
theme(legend.title=element_blank())
Is important to note how the trends of delay matches the airport , knowing more data will help us determine why Alaska has bigger delay periods and then will be able to make corrective actions.