library(tidyverse)
library(openintro)
library(nycflights13)
library(DT)

Introduction

This report explores patterns of flight delays and cancellations in the flights dataset from the nycflights13 R package. It will investigate dates and times during 2013 when a large percentage of flights were either canceled or delayed by more than one hour. After identifying these problematic periods, this report will compare them to historical New York weather data to see if those weather events explain the flights’ disruptions.

The flights dataset contains 336,776 flights departing from New York City (JFK, LGA or EWR) in 2013 and has 19 variables for each flight; the relevant ones are: year, month, day (date of departure), dep_time and arr_time (actual departure and arrival times), sched_dep_time and sched_arr_time (scheduled departure and arrival times), dep_delay and arr_delay (departure and arrival delay, in minutes. Negative times represent early departures/arrivals), carrier (Two letter carrier abbreviaition for airlines), flight (flight number), origin(airport of departure), dest (destination airport), air_time (amount of time spent in the air, in minutes), and distance (distance between airports, in miles).

Identifying Problematic Dates

This problem will find the days in 2013 where more than 35% of flights were cancelled or delayed by more than one hour.

flights_delayed_35 <- flights %>% 
  mutate(del_can_flights = is.na(dep_delay) | dep_delay > 60) %>% 
  group_by(year, month, day) %>% 
  summarize(perc_del_can_flights = mean(del_can_flights, na.rm = TRUE), .groups = "keep") %>% 
  filter(perc_del_can_flights > 0.35)

datatable(flights_delayed_35, options = list(scrollX = TRUE))

Weather Context of New York 2013

In February there was a major weather event called the Winter Storm Nemo, which brought terrible blizzard conditions to the northeast and affected the flights on the 8th and 9th. Just like in February, on March 8 there was a snowstorm that brought about a couple feet of snow along with very low visibility. The weather in July, September, and December was not quite as extreme as the other ones because there was just rain and fog during the days the flights were either delayed or canceled. Based off of these findings, it makes sense that a lot of these flights were either canceled or delayed by more than one hour.

Flight Disruptions by Time of Day

This problem looks to group the flights by the hour they were scheduled to depart and figure out the percentage of those flights that were cancelled or delayed.

flights_by_hour <- flights %>% 
  mutate(del_can_flights = is.na(dep_delay) | dep_delay > 60) %>% 
  group_by(hour) %>% 
  summarize(perc_del_can_flights = mean(del_can_flights, na.rm = TRUE))

datatable(flights_by_hour, options = list(scrollX = TRUE))

Visualization of Flight Disruptions by Time of Day

ggplot(data = flights_by_hour, aes(x = hour, y = perc_del_can_flights*100)) + 
  geom_line() + 
  geom_point() + 
  labs(
    title = "Percentage of Flight Cancellations/Long Delays by Hour of Day",
    x = "Scheduled Departure Hour",
    y = "Percent Canceled/Long Delayed Flights"
  )

The plot, at a glance, has a skewness to the left with an outlier at 100% of flights being canceled or delayed. This outlier happened at 1 in the morning and is the only flight scheduled to depart at this time, which was canceled and the reason for the 100%. Looking at the overall shape of the flights by hour of day, it would appear that later flights (5-9pm) have a higher chance of being delayed longer or canceled than morning flights (5-10am). It would make sense for the later flights to have this pattern because any delays from previous flights would then be accumulated into the later flights and the weather could change throughout the day, creating conditions unsuitable to depart on time.

On-Time Flights on Problematic Dates

on_time_flights <- flights %>% 
  group_by(year, month, day) %>%
  mutate(perc_del_can_flights = mean(is.na(dep_delay) | dep_delay > 60, na.rm = TRUE)) %>%
  filter(perc_del_can_flights > 0.35 & !is.na(dep_delay) & dep_delay <= 0)

datatable(on_time_flights, options = list(scrollX = TRUE))

This data set only shows the flights that departured early or on-time during the problematic dates we found in the first problem.

Average Scheduled Departure Hour for Problematic Dates that are On-Time

avg_dep_hour <- on_time_flights %>% 
  group_by(year, month, day) %>% 
  summarize(avg_sched_hour = mean(hour, na.rm = TRUE), .groups = "keep")

datatable(avg_dep_hour, options = list(scrollX = TRUE))

Looking at this table, we can see that the average scheduled departure hour (ASDH) for flights on each day was before noon. This confirms the guess that flights on problematic dates are able to depart early or on time if they left in the morning. The only date that had an ASDH in the afternoon was February 9, 2013. This date had flights scheduled for departure at hour 16 on average and this is mainly due to the weather that happened during that date. As stated above, during the 8th and 9th of February was the Winter Storm Nemo which carried blizzards up the Northeast side of the United States and into Canada. This terrible weather made later flights on the 8th get canceled and morning flights were canceled on the 9th until enough snow on the ground was cleared and the storm passes by during the day.

Conclusion

In summary, we can conclude through our findings that major weather events and the cancellation or delay of flights in 2013 are related to each other. After researching the weather forecast on the problematic dates found, we can see that heavy snow storms or rain were the main causes for affecting these flights’ departure time. We can also conclude that earlier flights have less risk of delay or cancellation than later flights do because there is less risk of weather getting worse over time or the possible accumulation of previous delays.

---
title: "Flight Delays and Cancellations"
author: "Viktor Chhun"
date: "`r Sys.Date()`"
output: openintro::lab_report
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(message=FALSE, warning=FALSE)
```

```{r load-packages, message=FALSE}
library(tidyverse)
library(openintro)
library(nycflights13)
library(DT)
```

## Introduction

This report explores patterns of flight delays and cancellations in the _flights_ dataset from the nycflights13 R package. It will investigate dates and times during 2013 when a large percentage of flights were either canceled or delayed by more than one hour. After identifying these problematic periods, this report will compare them to historical New York weather data to see if those weather events explain the flights' disruptions. 

The _flights_ dataset contains 336,776 flights departing from New York City (JFK, LGA or EWR) in 2013 and has 19 variables for each flight; the relevant ones are: year, month, day (date of departure), dep_time and arr_time (actual departure and arrival times), sched_dep_time and sched_arr_time (scheduled departure and arrival times), dep_delay and arr_delay (departure and arrival delay, in minutes. Negative times represent early departures/arrivals), carrier (Two letter carrier abbreviaition for airlines), flight (flight number), origin(airport of departure), dest (destination airport), air_time (amount of time spent in the air, in minutes), and distance (distance between airports, in miles). 

## Identifying Problematic Dates

This problem will find the days in 2013 where more than 35% of flights were cancelled or delayed by more than one hour. 

```{r}
flights_delayed_35 <- flights %>% 
  mutate(del_can_flights = is.na(dep_delay) | dep_delay > 60) %>% 
  group_by(year, month, day) %>% 
  summarize(perc_del_can_flights = mean(del_can_flights, na.rm = TRUE), .groups = "keep") %>% 
  filter(perc_del_can_flights > 0.35)

datatable(flights_delayed_35, options = list(scrollX = TRUE))
```

## Weather Context of New York 2013

In February there was a major weather event called the Winter Storm Nemo, which brought terrible blizzard conditions to the northeast and affected the flights on the 8th and 9th. Just like in February, on March 8 there was a snowstorm that brought about a couple feet of snow along with very low visibility. The weather in July, September, and December was not quite as extreme as the other ones because there was just rain and fog during the days the flights were either delayed or canceled. Based off of these findings, it makes sense that a lot of these flights were either canceled or delayed by more than one hour.

## Flight Disruptions by Time of Day

This problem looks to group the flights by the hour they were scheduled to depart and figure out the percentage of those flights that were cancelled or delayed.

```{r}
flights_by_hour <- flights %>% 
  mutate(del_can_flights = is.na(dep_delay) | dep_delay > 60) %>% 
  group_by(hour) %>% 
  summarize(perc_del_can_flights = mean(del_can_flights, na.rm = TRUE))

datatable(flights_by_hour, options = list(scrollX = TRUE))
```

## Visualization of Flight Disruptions by Time of Day

```{r}
ggplot(data = flights_by_hour, aes(x = hour, y = perc_del_can_flights*100)) + 
  geom_line() + 
  geom_point() + 
  labs(
    title = "Percentage of Flight Cancellations/Long Delays by Hour of Day",
    x = "Scheduled Departure Hour",
    y = "Percent Canceled/Long Delayed Flights"
  )
```

The plot, at a glance, has a skewness to the left with an outlier at 100% of flights being canceled or delayed. This outlier happened at 1 in the morning and is the only flight scheduled to depart at this time, which was canceled and the reason for the 100%. Looking at the overall shape of the flights by hour of day, it would appear that later flights (5-9pm) have a higher chance of being delayed longer or canceled than morning flights (5-10am). It would make sense for the later flights to have this pattern because any delays from previous flights would then be accumulated into the later flights and the weather could change throughout the day, creating conditions unsuitable to depart on time. 

## On-Time Flights on Problematic Dates

```{r}
on_time_flights <- flights %>% 
  group_by(year, month, day) %>%
  mutate(perc_del_can_flights = mean(is.na(dep_delay) | dep_delay > 60, na.rm = TRUE)) %>%
  filter(perc_del_can_flights > 0.35 & !is.na(dep_delay) & dep_delay <= 0)

datatable(on_time_flights, options = list(scrollX = TRUE))
```

This data set only shows the flights that departured early or on-time during the problematic dates we found in the first problem. 

## Average Scheduled Departure Hour for Problematic Dates that are On-Time

```{r}
avg_dep_hour <- on_time_flights %>% 
  group_by(year, month, day) %>% 
  summarize(avg_sched_hour = mean(hour, na.rm = TRUE), .groups = "keep")

datatable(avg_dep_hour, options = list(scrollX = TRUE))
```

Looking at this table, we can see that the average scheduled departure hour (ASDH) for flights on each day was before noon. This confirms the guess that flights on problematic dates are able to depart early or on time if they left in the morning. The only date that had an ASDH in the afternoon was February 9, 2013. This date had flights scheduled for departure at hour 16 on average and this is mainly due to the weather that happened during that date. As stated above, during the 8th and 9th of February was the Winter Storm Nemo which carried blizzards up the Northeast side of the United States and into Canada. This terrible weather made later flights on the 8th get canceled and morning flights were canceled on the 9th until enough snow on the ground was cleared and the storm passes by during the day. 

## Conclusion

In summary, we can conclude through our findings that major weather events and the cancellation or delay of flights in 2013 are related to each other. After researching the weather forecast on the problematic dates found, we can see that heavy snow storms or rain were the main causes for affecting these flights' departure time. We can also conclude that earlier flights have less risk of delay or cancellation than later flights do because there is less risk of weather getting worse over time or the possible accumulation of previous delays.

...