NYC Flights Homework

Load the library

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.2.0     ✔ readr     2.1.6
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.2     ✔ tibble    3.3.1
✔ lubridate 1.9.5     ✔ tidyr     1.3.2
✔ purrr     1.2.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(nycflights23)
data(flights)

Clean and Filter the Data

flights1 <-flights |>
 filter(!is.na(dep_delay),
        !is.na(arr_delay),
        !is.na(time_hour),
        time_hour=="2023-01-01 06:00:00")

flights1
# A tibble: 53 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2023     1     1     1051           1100        -9     1411           1424
 2  2023     1     1     1054           1100        -6     1223           1248
 3  2023     1     1     1057           1100        -3     1351           1403
 4  2023     1     1     1057           1100        -3     1539           1553
 5  2023     1     1     1101           1109        -8     1235           1312
 6  2023     1     1     1102           1100         2     1313           1318
 7  2023     1     1     1104           1110        -6     1241           1304
 8  2023     1     1     1105           1110        -5     1402           1432
 9  2023     1     1     1107           1110        -3     1321           1345
10  2023     1     1     1107           1110        -3     1346           1342
# ℹ 43 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

Visualisation:Scatterplot

plot <- flights1 |>
  ggplot(aes(x=dep_delay, y=arr_delay)) +
  #color difference shows if flights arrive on time or not
  geom_point(aes(color=factor(arr_delay>0)),
             alpha=0.6,
             na.rm=TRUE)+
  #rename legend title/labels
  scale_color_discrete(name = "Arrival Delays", 
                       labels = c("Early","Delayed"))+
  geom_smooth(method = "lm",
              #delete CI
              se=FALSE,
              size=1,
              #regression line 
              col="green")+
  labs(x = "Depature Delay (minutes)", 
       y = "Arrival Delay (minutes)",
       color="Arrival Delays",
       title = "Relationship Between Departure and Arrival Delays ",
       caption = "Time:2023-01-01 06:00-07:00        Souce:nycflights23 dataset")+
  # theme_minimal() before theme()
  theme_minimal()+
  theme(plot.title=element_text(hjust=0.5))
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
  plot
`geom_smooth()` using formula = 'y ~ x'

Data Analysis

The scatterplot above shows the relationship between departure and Arrival Delays between 06:00 and 07:00 on January 1, 2023. The x-axis represents Departure Delay in minutes, and the y-axis shows Arrival Delay in minutes. The green regression line represents the trend in this dataset. We can conclude that there is a positive relationship between departure delay and arrival delay. In general, the flights that depart late tend to arrive late as well.

Additionally, most delays are distributed around -20 to 20 minutes, indicating that most flights during 06:00 and 07:00 on that day have only minor delays. However, there are still 3 outliers who have been delayed for more than 1 hour at the same time. Last but not least, the pink dots represent all flights that arrive on time, while the blue dots represent flights with arrival delays.We can conclude that most flights arrive on time at this time.

One interesting part of this visualization is the regression line. This shows a clear positive relationship between departure delays and arrival delays. It also suggests that departure delays have a ripple effect on arrival delays, making it difficult to recover the lost time once they start very late.

Citation

All code references are based on previous Data 101 homeworks,classnotes and Data 101 homeworks.