NYC Flights13 Homework

NYC Flights Exploring and Analysis of departure delay by airlines.

Install NYC flights13, tidyverse and dplyr packages.

# install.packages("nycflights13")
# install.packages("tidyverse")
# install.packages("dplyr")

# load packages
library(ggplot2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(nycflights13)
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v tibble  3.1.4     v purrr   0.3.4
## v tidyr   1.1.3     v stringr 1.4.0
## v readr   2.0.2     v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
# explore data from NYC flights13 package.
str(flights)
## tibble [336,776 x 19] (S3: tbl_df/tbl/data.frame)
##  $ year          : int [1:336776] 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ...
##  $ month         : int [1:336776] 1 1 1 1 1 1 1 1 1 1 ...
##  $ day           : int [1:336776] 1 1 1 1 1 1 1 1 1 1 ...
##  $ dep_time      : int [1:336776] 517 533 542 544 554 554 555 557 557 558 ...
##  $ sched_dep_time: int [1:336776] 515 529 540 545 600 558 600 600 600 600 ...
##  $ dep_delay     : num [1:336776] 2 4 2 -1 -6 -4 -5 -3 -3 -2 ...
##  $ arr_time      : int [1:336776] 830 850 923 1004 812 740 913 709 838 753 ...
##  $ sched_arr_time: int [1:336776] 819 830 850 1022 837 728 854 723 846 745 ...
##  $ arr_delay     : num [1:336776] 11 20 33 -18 -25 12 19 -14 -8 8 ...
##  $ carrier       : chr [1:336776] "UA" "UA" "AA" "B6" ...
##  $ flight        : int [1:336776] 1545 1714 1141 725 461 1696 507 5708 79 301 ...
##  $ tailnum       : chr [1:336776] "N14228" "N24211" "N619AA" "N804JB" ...
##  $ origin        : chr [1:336776] "EWR" "LGA" "JFK" "JFK" ...
##  $ dest          : chr [1:336776] "IAH" "IAH" "MIA" "BQN" ...
##  $ air_time      : num [1:336776] 227 227 160 183 116 150 158 53 140 138 ...
##  $ distance      : num [1:336776] 1400 1416 1089 1576 762 ...
##  $ hour          : num [1:336776] 5 5 5 5 6 5 6 6 6 6 ...
##  $ minute        : num [1:336776] 15 29 40 45 0 58 0 0 0 0 ...
##  $ time_hour     : POSIXct[1:336776], format: "2013-01-01 05:00:00" "2013-01-01 05:00:00" ...
data("flights")
#view(flights)

Filter flights by month and day.

flights[flights$month & flights$day,]
## # A tibble: 336,776 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # ... with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
str(airlines)
## tibble [16 x 2] (S3: tbl_df/tbl/data.frame)
##  $ carrier: chr [1:16] "9E" "AA" "AS" "B6" ...
##  $ name   : chr [1:16] "Endeavor Air Inc." "American Airlines Inc." "Alaska Airlines Inc." "JetBlue Airways" ...
data("airlines")
view(airlines)

Create a new variable using Mutate function.

# create a speed variable with mutate in dplyr.
flights %>%
  select(distance,air_time) %>%
  mutate(speed = distance/air_time*60)
## # A tibble: 336,776 x 3
##    distance air_time speed
##       <dbl>    <dbl> <dbl>
##  1     1400      227  370.
##  2     1416      227  374.
##  3     1089      160  408.
##  4     1576      183  517.
##  5      762      116  394.
##  6      719      150  288.
##  7     1065      158  404.
##  8      229       53  259.
##  9      944      140  405.
## 10      733      138  319.
## # ... with 336,766 more rows
# store the new "speed" variale into flights.

flights <- flights %>% mutate(speed=distance/air_time*60)

Create a casestudy object for a future use.

casestudy <-flights %>% #select month, day and delays
  select(carrier,month,day,arr_delay,dep_delay) %>%
  filter(arr_delay >= 0, dep_delay >= 0) %>%
  group_by(carrier,month,day) %>%
  summarise(avg_delay = mean(arr_delay, na.rm = TRUE) +
              mean(dep_delay, na.rm = TRUE)) %>%
ungroup() %>%
arrange(-avg_delay)
## `summarise()` has grouped output by 'carrier', 'month'. You can override using the `.groups` argument.
head(casestudy, 9)
## # A tibble: 9 x 4
##   carrier month   day avg_delay
##   <chr>   <int> <int>     <dbl>
## 1 HA          1     9     2573 
## 2 F9          2    10     1687 
## 3 F9          3     8      840 
## 4 YV         10    22      768 
## 5 FL          7    10      754 
## 6 VX         11    24      631 
## 7 F9          5    23      612.
## 8 FL          9     2      587.
## 9 F9          2    24      560

Creating the ‘date’ variable by passing the entire data object inside a ‘with’

casestudy$date <- with(casestudy, ISOdate(year = 2013, month,day))# plot with legend
g <- ggplot(casestudy, aes(x = date, y = avg_delay)) +
  ggtitle("Delayed Airlines by Date")
g + 
  geom_point(aes(color = carrier)) + xlab("Date(month)") + ylab("Average Delay (mins)" )

I was interested in creating a data to ananlyse which airlines are the most delayed airlines to fly out NYC in year 2013. Also, I wanted to see the worst month to fly out.

The data provide is the flights data for all airlines that departed New York City airports in 2013.

Is there some particular airlines with the highest delay in a month that needs to be avoided? Yes, I noticed that the data provides the most delayed airline is Hawaiian Airlines Inc. in January 2013.

I am not sure what causes delays from the airline. It could be linked to a weather condition at Hawaii in January 2013.