NYC Flights Exploring and Analysis of departure delay by airlines.
Install NYC flights13, tidyverse and dplyr packages.
# install.packages("nycflights13")
# install.packages("tidyverse")
# install.packages("dplyr")
# load packages
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(nycflights13)
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v tibble 3.1.4 v purrr 0.3.4
## v tidyr 1.1.3 v stringr 1.4.0
## v readr 2.0.2 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
# explore data from NYC flights13 package.
str(flights)
## tibble [336,776 x 19] (S3: tbl_df/tbl/data.frame)
## $ year : int [1:336776] 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ...
## $ month : int [1:336776] 1 1 1 1 1 1 1 1 1 1 ...
## $ day : int [1:336776] 1 1 1 1 1 1 1 1 1 1 ...
## $ dep_time : int [1:336776] 517 533 542 544 554 554 555 557 557 558 ...
## $ sched_dep_time: int [1:336776] 515 529 540 545 600 558 600 600 600 600 ...
## $ dep_delay : num [1:336776] 2 4 2 -1 -6 -4 -5 -3 -3 -2 ...
## $ arr_time : int [1:336776] 830 850 923 1004 812 740 913 709 838 753 ...
## $ sched_arr_time: int [1:336776] 819 830 850 1022 837 728 854 723 846 745 ...
## $ arr_delay : num [1:336776] 11 20 33 -18 -25 12 19 -14 -8 8 ...
## $ carrier : chr [1:336776] "UA" "UA" "AA" "B6" ...
## $ flight : int [1:336776] 1545 1714 1141 725 461 1696 507 5708 79 301 ...
## $ tailnum : chr [1:336776] "N14228" "N24211" "N619AA" "N804JB" ...
## $ origin : chr [1:336776] "EWR" "LGA" "JFK" "JFK" ...
## $ dest : chr [1:336776] "IAH" "IAH" "MIA" "BQN" ...
## $ air_time : num [1:336776] 227 227 160 183 116 150 158 53 140 138 ...
## $ distance : num [1:336776] 1400 1416 1089 1576 762 ...
## $ hour : num [1:336776] 5 5 5 5 6 5 6 6 6 6 ...
## $ minute : num [1:336776] 15 29 40 45 0 58 0 0 0 0 ...
## $ time_hour : POSIXct[1:336776], format: "2013-01-01 05:00:00" "2013-01-01 05:00:00" ...
data("flights")
#view(flights)
Filter flights by month and day.
flights[flights$month & flights$day,]
## # A tibble: 336,776 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## 7 2013 1 1 555 600 -5 913 854
## 8 2013 1 1 557 600 -3 709 723
## 9 2013 1 1 557 600 -3 838 846
## 10 2013 1 1 558 600 -2 753 745
## # ... with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
str(airlines)
## tibble [16 x 2] (S3: tbl_df/tbl/data.frame)
## $ carrier: chr [1:16] "9E" "AA" "AS" "B6" ...
## $ name : chr [1:16] "Endeavor Air Inc." "American Airlines Inc." "Alaska Airlines Inc." "JetBlue Airways" ...
data("airlines")
view(airlines)
Create a new variable using Mutate function.
# create a speed variable with mutate in dplyr.
flights %>%
select(distance,air_time) %>%
mutate(speed = distance/air_time*60)
## # A tibble: 336,776 x 3
## distance air_time speed
## <dbl> <dbl> <dbl>
## 1 1400 227 370.
## 2 1416 227 374.
## 3 1089 160 408.
## 4 1576 183 517.
## 5 762 116 394.
## 6 719 150 288.
## 7 1065 158 404.
## 8 229 53 259.
## 9 944 140 405.
## 10 733 138 319.
## # ... with 336,766 more rows
# store the new "speed" variale into flights.
flights <- flights %>% mutate(speed=distance/air_time*60)
Create a casestudy object for a future use.
casestudy <-flights %>% #select month, day and delays
select(carrier,month,day,arr_delay,dep_delay) %>%
filter(arr_delay >= 0, dep_delay >= 0) %>%
group_by(carrier,month,day) %>%
summarise(avg_delay = mean(arr_delay, na.rm = TRUE) +
mean(dep_delay, na.rm = TRUE)) %>%
ungroup() %>%
arrange(-avg_delay)
## `summarise()` has grouped output by 'carrier', 'month'. You can override using the `.groups` argument.
head(casestudy, 9)
## # A tibble: 9 x 4
## carrier month day avg_delay
## <chr> <int> <int> <dbl>
## 1 HA 1 9 2573
## 2 F9 2 10 1687
## 3 F9 3 8 840
## 4 YV 10 22 768
## 5 FL 7 10 754
## 6 VX 11 24 631
## 7 F9 5 23 612.
## 8 FL 9 2 587.
## 9 F9 2 24 560
Creating the ‘date’ variable by passing the entire data object inside a ‘with’
casestudy$date <- with(casestudy, ISOdate(year = 2013, month,day))# plot with legend
g <- ggplot(casestudy, aes(x = date, y = avg_delay)) +
ggtitle("Delayed Airlines by Date")
g +
geom_point(aes(color = carrier)) + xlab("Date(month)") + ylab("Average Delay (mins)" )

I was interested in creating a data to ananlyse which airlines are the most delayed airlines to fly out NYC in year 2013. Also, I wanted to see the worst month to fly out.
The data provide is the flights data for all airlines that departed New York City airports in 2013.
Is there some particular airlines with the highest delay in a month that needs to be avoided? Yes, I noticed that the data provides the most delayed airline is Hawaiian Airlines Inc. in January 2013.
I am not sure what causes delays from the airline. It could be linked to a weather condition at Hawaii in January 2013.