Assignment - Tidying and Transforming Data

The chart above describes arrival delays for two airlines across five destinations. Your task is to:

  1. Create a .CSV file (or optionally, a MySQL database!) that includes all of the information above.

You’re encouraged to use a “wide” structure similar to how the information appears above, so that you can practice tidying and transformations as described below.

  1. Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data.

  2. Perform analysis to compare the arrival delays for the two airlines.

  3. Your code should be in an R Markdown file, posted to rpubs.com, and should include narrative descriptions of your data cleanup work, analysis, and conclusions. Please include in your homework submission:

The URL to the .Rmd file in your GitHub repository. and The URL for your rpubs.com web page.

library(tidyr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(stringr)

flights <- read.csv("https://raw.githubusercontent.com/sortega7878/DATA607W5/master/flights.csv", 
                    sep = ",", header = TRUE)
flights
##   ï..airline  status Los.Angeles Phoenix San.Diego San.Francisco Seattle
## 1     ALASKA on time         497     221       212           503    1841
## 2     ALASKA delayed          62      12        20           102     305
## 3     AMWEST on time         694    4840       383           320     201
## 4     AMWEST delayed         117     415        65           129      61

Data Cleaning and going Tidy.

At this point the initial data is tidy. Each variable (airline, city, delayed, ontime) is in its own column and each observation is in its own row.

# Tidy up data, so that every variable is in a column and every observation is in a row

flights <- rename(flights, airline = ï..airline)
flights <- flights %>% 
  gather(city, number, 3:length(flights)) %>%   # Transfer 'city' variable into a column
  spread(status, number)                         # Split 'delayed'/'on time' variables into 2 columns

# Adjust city names (get rid of a period)
flights$city <- str_replace_all(flights$city, "\\.", " ")

# Remove space in the 'on time' variable name
flights <- rename(flights, ontime = `on time`)
flights
##    airline          city delayed ontime
## 1   ALASKA   Los Angeles      62    497
## 2   ALASKA       Phoenix      12    221
## 3   ALASKA     San Diego      20    212
## 4   ALASKA San Francisco     102    503
## 5   ALASKA       Seattle     305   1841
## 6   AMWEST   Los Angeles     117    694
## 7   AMWEST       Phoenix     415   4840
## 8   AMWEST     San Diego      65    383
## 9   AMWEST San Francisco     129    320
## 10  AMWEST       Seattle      61    201

DATA ANALYSIS

Getting the ratio of delayed flights for all airlines and cities. We’ll sort the data by delay starting with the worst airline/city.

flights <- flights %>% 
  mutate(flights = delayed + ontime, delayed_ratio = delayed / flights) %>% 
  arrange(desc(delayed_ratio))
flights
##    airline          city delayed ontime flights delayed_ratio
## 1   AMWEST San Francisco     129    320     449    0.28730512
## 2   AMWEST       Seattle      61    201     262    0.23282443
## 3   ALASKA San Francisco     102    503     605    0.16859504
## 4   AMWEST     San Diego      65    383     448    0.14508929
## 5   AMWEST   Los Angeles     117    694     811    0.14426634
## 6   ALASKA       Seattle     305   1841    2146    0.14212488
## 7   ALASKA   Los Angeles      62    497     559    0.11091234
## 8   ALASKA     San Diego      20    212     232    0.08620690
## 9   AMWEST       Phoenix     415   4840    5255    0.07897241
## 10  ALASKA       Phoenix      12    221     233    0.05150215

Let us look compare how delays compare against each other using various criteria.

# Average delay across all cities and airlines
flights %>% summarise(mean(delayed_ratio))
##   mean(delayed_ratio)
## 1           0.1447799
delay_city <- flights %>% 
  group_by(city) %>% 
  summarise(avg_delay = mean(delayed_ratio)) %>% 
  arrange(desc(avg_delay))
delay_city
## # A tibble: 5 × 2
##            city  avg_delay
##           <chr>      <dbl>
## 1 San Francisco 0.22795008
## 2       Seattle 0.18747466
## 3   Los Angeles 0.12758934
## 4     San Diego 0.11564809
## 5       Phoenix 0.06523728
delay_airline <- flights %>% 
  group_by(airline) %>% 
  summarise(avg_delay = mean(delayed_ratio)) %>% 
  arrange(desc(avg_delay))
delay_airline
## # A tibble: 2 × 2
##   airline avg_delay
##    <fctr>     <dbl>
## 1  AMWEST 0.1776915
## 2  ALASKA 0.1118683

Map delays per city and per airline.

library(ggplot2)

ggplot(flights, aes(x = city, y = delayed_ratio, group = airline, color = airline, shape = airline)) + 
  geom_point(aes(shape = airline)) + 
  geom_line() + 
  labs(x = "City", y = "Delay") + 
  theme(legend.title=element_blank())

Is important to note how the trends of delay matches the airport , knowing more data will help us determine why Alaska has bigger delay periods and then will be able to make corrective actions.