nycflights13 Data Visualization Assignment

Author

Olivia Yuengling

The nycflights13 dataset

# loading the required packages with the nycflights13 dataset
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.4.4     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(nycflights13)
library(ggplot2)
flights_nona <- flights |>
  filter(!is.na(arr_delay))  
# remove na's arr_delay (cleaning data)

flights_nona6 <- flights_nona |> 
  filter(month == 6)
# filter the data to only the flights in June

flights_nona6 <- flights_nona6 |>
  select(year, month, day, arr_delay, tailnum)
head(flights_nona6)
# A tibble: 6 × 5
   year month   day arr_delay tailnum
  <int> <int> <int>     <dbl> <chr>  
1  2013     6     1        -9 N618JB 
2  2013     6     1       -16 N538UW 
3  2013     6     1       -45 N35407 
4  2013     6     1       -29 N27724 
5  2013     6     1         3 N806JB 
6  2013     6     1        -8 N5EAAA 
# specifiy needed columns used for the visualization in flights_nona

Data Cleaning (Removing NA’s and Filtering to June)

weather_nona <- weather |>
  filter(!is.na(visib))
# remove na's in the column visib (cleaning data)

weather_nona6 <- weather_nona |>
  filter(month == 6)
# filter the data to only the weather in June

weather_nona6 <- weather_nona6 |>
  select(year, month, day, hour, visib)
head(weather_nona6)
# A tibble: 6 × 5
   year month   day  hour visib
  <int> <int> <int> <int> <dbl>
1  2013     6     1     0    10
2  2013     6     1     1    10
3  2013     6     1     2    10
4  2013     6     1     3    10
5  2013     6     1     4    10
6  2013     6     1     5    10
# specify needed columns used for the visualization in weather_nona

Grouping Information & Calculating Means

by_day_flights <- flights_nona6 |>
  group_by(day) |>  # group all tailnumbers together
  summarise(count = n(),   # counts totals for each day
            delay = mean(arr_delay),
            month
            ) # calculates the mean arrival delay
Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
dplyr 1.1.0.
ℹ Please use `reframe()` instead.
ℹ When switching from `summarise()` to `reframe()`, remember that `reframe()`
  always returns an ungrouped data frame and adjust accordingly.
`summarise()` has grouped output by 'day'. You can override using the `.groups`
argument.
head(by_day_flights)
# A tibble: 6 × 4
# Groups:   day [1]
    day count delay month
  <int> <int> <dbl> <int>
1     1   749 -11.9     6
2     1   749 -11.9     6
3     1   749 -11.9     6
4     1   749 -11.9     6
5     1   749 -11.9     6
6     1   749 -11.9     6
delay <- filter(by_day_flights, count > 20) 

by_day_weather <- weather_nona6 |>
  group_by(day) |> # group the weather by each day in June
  summarise(count = n(),
            vision = mean(visib) # find the mean visibility for each day in June
            )
head(by_day_weather)
# A tibble: 6 × 3
    day count vision
  <int> <int>  <dbl>
1     1    72   9.99
2     2    72   9.69
3     3    72   8.88
4     4    72  10   
5     5    72  10   
6     6    72   9.75

Merging by_day_flights & by_day_weather

junenycflights <- merge(by_day_flights, by_day_weather, by = "day")
#merging the datasets

Creating the Visualization

graph <- ggplot(junenycflights, aes(vision, delay, color = vision)) +
  geom_point(aes(size = vision), alpha = .1) +
  scale_color_gradient(low = "navy", high = "lightpink") +
  scale_size_area() +
  theme_bw() +
  labs(x = "Overall Visibility of Weather",
       y = "Average Flight Delay (Minutes)",
       caption = "FAA Aircraft registry",
       title = "Arrival Delays and Visibility for Flights in June 2013")
graph

The graph exhibits a extremely slight correlation with flight visibility and average flight delays during June 2013.The x-axis represents how visible the weather is from a scale from 1-10. On the y-axis represents the amount of minutes late that a flight arrives. The lighter points represent flights that were in weather with better visibility, contrasting the darker points which represent flights with lower visibility.

As shown on the graph, the majority of the points are on the right side signaling that most flights had weather with good visibility. But there were some outliers found on the left side of the graph which have experienced poor visibility.

So lets answer the big question: Does weather visibility impact whether a flight is delayed or not? Sort of. As we can see, the flights that arrived the latest have had slightly lower visibility rankings compared to the flights that have arrived on time or slightly late. Not only this, the flights that arrived on time have had a 10/10 for visibility which indicates that flights with better visibility are highly likely to arrive on time.

But because there isn’t significant visual results, we cannot conclude that the visibility of the weather directly influences whether a flight arrives late or not. But with we can conclude that there is enough information to conclude that the visibility of the weather can contribute as a factor to whether a flight’s arrival is delayed or not.