Data 606 - Week 2 Lab

Data 606 Week 2 Lab

Per the assignment, I loaded in the pacakges

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(openintro)

## Loading required package: airports
## Loading required package: cherryblossom
## Loading required package: usdata

and needed data

data(nycflights)

and then we needed to view the variable names

names(nycflights)

##  [1] "year"      "month"     "day"       "dep_time"  "dep_delay" "arr_time" 
##  [7] "arr_delay" "carrier"   "tailnum"   "flight"    "origin"    "dest"     
## [13] "air_time"  "distance"  "hour"      "minute"

and to get a better handle on the data, used the glimpse function

glimpse(nycflights)

## Rows: 32,735
## Columns: 16
## $ year      <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, …
## $ month     <int> 6, 5, 12, 5, 7, 1, 12, 8, 9, 4, 6, 11, 4, 3, 10, 1, 2, 8, 10…
## $ day       <int> 30, 7, 8, 14, 21, 1, 9, 13, 26, 30, 17, 22, 26, 25, 21, 23, …
## $ dep_time  <int> 940, 1657, 859, 1841, 1102, 1817, 1259, 1920, 725, 1323, 940…
## $ dep_delay <dbl> 15, -3, -1, -4, -3, -3, 14, 85, -10, 62, 5, 5, -2, 115, -4, …
## $ arr_time  <int> 1216, 2104, 1238, 2122, 1230, 2008, 1617, 2032, 1027, 1549, …
## $ arr_delay <dbl> -4, 10, 11, -34, -8, 3, 22, 71, -8, 60, -4, -2, 22, 91, -6, …
## $ carrier   <chr> "VX", "DL", "DL", "DL", "9E", "AA", "WN", "B6", "AA", "EV", …
## $ tailnum   <chr> "N626VA", "N3760C", "N712TW", "N914DL", "N823AY", "N3AXAA", …
## $ flight    <int> 407, 329, 422, 2391, 3652, 353, 1428, 1407, 2279, 4162, 20, …
## $ origin    <chr> "JFK", "JFK", "JFK", "JFK", "LGA", "LGA", "EWR", "JFK", "LGA…
## $ dest      <chr> "LAX", "SJU", "LAX", "TPA", "ORF", "ORD", "HOU", "IAD", "MIA…
## $ air_time  <dbl> 313, 216, 376, 135, 50, 138, 240, 48, 148, 110, 50, 161, 87,…
## $ distance  <dbl> 2475, 1598, 2475, 1005, 296, 733, 1411, 228, 1096, 820, 264,…
## $ hour      <dbl> 9, 16, 8, 18, 11, 18, 12, 19, 7, 13, 9, 13, 8, 20, 12, 20, 6…
## $ minute    <dbl> 40, 57, 59, 41, 2, 17, 59, 20, 25, 23, 40, 20, 9, 54, 17, 24…

And from here, we look at three different histograms of departure delay information:

# Create a histogram of departure delays (default binwidth)
ggplot(data = nycflights, aes(x = dep_delay)) +
  geom_histogram() +
  labs(title = "Histogram of Departure Delays (Default Binwidth)",
       x = "Departure Delay (minutes)",
       y = "Frequency") +
  theme_minimal()

and then with a bin width of 15

# Create a histogram of departure delays with a binwidth of 15
ggplot(data = nycflights, aes(x = dep_delay)) +
  geom_histogram(binwidth = 15) +
  labs(title = "Histogram of Departure Delays (15-minute Binwidth)",
       x = "Departure Delay (minutes)",
       y = "Frequency") +
  theme_minimal()

and lastly with a bin width of 150

# Create a histogram of departure delays with a binwidth of 150
ggplot(data = nycflights, aes(x = dep_delay)) +
  geom_histogram(binwidth = 150) +
  labs(title = "Histogram of Departure Delays (150-minute Binwidth)",
       x = "Departure Delay (minutes)",
       y = "Frequency") +
  theme_minimal()

Exercise 1

Look carefully at these three histograms. How do they compare? Are features revealed in one that are obscured in another?

The first thing to establish is what the histograms are charting. We are looking at bars that the width of minutes delayed and height of the number of flights that meet the delay criteria specified by the minute width of the bar. For the first plot, we don’t specify a binwidth and it appears from internet searching that the default is likely 30 minutes. The second histrogram is a 15 minute-width bar while the third is 150 minutes.

This tells us that the most “accurate” from a down-to-the-minute perspective is the second historgram. The broader we make the histogram bars, the less specific our information becomes. The histogram appears to look great and that most flights are within the parameter for “0 minutes delayed.” That is until you look at the parameters and understand that we’re looking at flights that are between 0-150 minutes late. Hard to imagine most people accepting that a flight over 2 hours late is “basically 0 minutes delayed” which is what the chart erroneosly seems to indicate.

The second chart gives us a much better look, infomring the viewer that most flights are within a 15-minute window of “0 minutes delayed” which is likely what most people would accept as “on time.”

##LAX Flights

We were given the following code snippets:

# Filter for flights going to LAX
lax_flights <- nycflights %>% 
  filter(dest == "LAX")

# Create a histogram of departure delays for LAX flights
ggplot(data = lax_flights, aes(x = dep_delay)) +
  geom_histogram() +
  labs(title = "Histogram of Departure Delays for LAX Flights",
       x = "Departure Delay (minutes)",
       y = "Count") +
  theme_minimal()

From this information we were asked to make new snippet of code to filter the same data for SFO in February:

# Filter for flights going to LAX
sfo_feb_flights <- nycflights %>% 
  filter(dest == "LAX", month == 2)

# Create a histogram of departure delays for LAX flights
ggplot(data = sfo_feb_flights, aes(x = dep_delay)) +
  geom_histogram() +
  labs(title = "Histogram of Departure Delays for LAX Flights",
       x = "Departure Delay (minutes)",
       y = "Count") +
  theme_minimal()

Exercise 2

Create a new data frame that includes flights headed to SFO in February, and save this data frame as sfo_feb_flights. How many flights meet these criteria?

# Filter flights headed to SFO in February
sfo_feb_flights <- nycflights %>% 
  filter(dest == "SFO", month == 2)

# Count the number of flights that meet these criteria
num_flights <- nrow(sfo_feb_flights)
num_flights

## [1] 68

Exercise 3

Describe the distribution of the arrival delays of these flights using a histogram and appropriate summary statistics. Hint: The summary statistics you use should depend on the shape of the distribution.

First we’ll make some minor tweaks to the SFO histogram we already prepared and then generate some summary statistics for arrival delays:

# Create a histogram of arrival delays
ggplot(sfo_feb_flights, aes(x = arr_delay)) +
  geom_histogram(binwidth = 5, fill = "skyblue", color = "black") +
  labs(title = "Histogram of Arrival Delays for SFO Flights in February",
       x = "Arrival Delay (minutes)",
       y = "Frequency") +
  theme_minimal()

# Calculate summary statistics for arrival delays (ignoring NA values)
arrival_summary <- sfo_feb_flights %>%
  filter(!is.na(arr_delay)) %>%  # Exclude missing values
  summarise(
    n = n(),
    mean_delay = mean(arr_delay),
    median_delay = median(arr_delay),
    sd_delay = sd(arr_delay),
    min_delay = min(arr_delay),
    max_delay = max(arr_delay)
  )

arrival_summary

## # A tibble: 1 × 6
##       n mean_delay median_delay sd_delay min_delay max_delay
##   <int>      <dbl>        <dbl>    <dbl>     <dbl>     <dbl>
## 1    68       -4.5          -11     36.3       -66       196

This histogram is telling us that in general, most flights are arriving on time or even a little early. Again, this is from NYC to SFO. As a West Coast resident, I find this interesting as the jetstream is not usually in our favor for east to west travel. This then tells me that airlines presume the worst case scenario for jet stream offset giving flight a solid headway in order to turn over for future flights. But that’s a guess not backed up by anything other than anecdotal observation.

Exercise 4

Calculate the median and interquartile range for arr_delays of flights in in the sfo_feb_flights data frame, grouped by carrier. Which carrier has the most variable arrival delays?

library(dplyr)

carrier_summary <- sfo_feb_flights %>%
  group_by(carrier) %>%
  summarise(
    median_arr_delay = median(arr_delay, na.rm = TRUE),
    iqr_arr_delay = IQR(arr_delay, na.rm = TRUE),
    n_flights = n()
  )

carrier_summary

## # A tibble: 5 × 4
##   carrier median_arr_delay iqr_arr_delay n_flights
##   <chr>              <dbl>         <dbl>     <int>
## 1 AA                   5            17.5        10
## 2 B6                 -10.5          12.2         6
## 3 DL                 -15            22          19
## 4 UA                 -10            22          21
## 5 VX                 -22.5          21.2        12

In this case, we want to review the interquartile information to give us an understanding of which carrier has the widest margin, or broadest variable in arrival times. The smaller the number, the closer we are to the median, conversely, the larger the number, the more variable the arrival times are from the median. In this case, in the month of February, Delta and United had the broadest arrival time spectrum based on their having the highest interquartile arrival delay number.

Exercise 5

Suppose you really dislike departure delays and you want to schedule your travel in a month that minimizes your potential departure delay leaving NYC. One option is to choose the month with the lowest mean departure delay. Another option is to choose the month with the lowest median departure delay. What are the pros and cons of these two choices?

First, here’s the code provided in the assignment since we’ll be referencing it:

# Calculate the mean departure delay by month and arrange in descending order
monthly_delays <- nycflights %>%
  group_by(month) %>%
  summarise(mean_dd = mean(dep_delay, na.rm = TRUE)) %>%
  arrange(desc(mean_dd))

# Display the result
monthly_delays

## # A tibble: 12 × 2
##    month mean_dd
##    <int>   <dbl>
##  1     7   20.8 
##  2     6   20.4 
##  3    12   17.4 
##  4     4   14.6 
##  5     3   13.5 
##  6     5   13.3 
##  7     8   12.6 
##  8     2   10.7 
##  9     1   10.2 
## 10     9    6.87
## 11    11    6.10
## 12    10    5.88

# Calculate the median departure delay by month and arrange in descending order
monthly_delays <- nycflights %>%
  group_by(month) %>%
  summarise(median_dd = median(dep_delay, na.rm = TRUE)) %>%
  arrange(desc(median_dd))

# Display the result
monthly_delays

## # A tibble: 12 × 2
##    month median_dd
##    <int>     <dbl>
##  1    12         1
##  2     6         0
##  3     7         0
##  4     3        -1
##  5     5        -1
##  6     8        -1
##  7     1        -2
##  8     2        -2
##  9     4        -2
## 10    11        -2
## 11     9        -3
## 12    10        -3

Before we determine the best month to travel, we need to understand the criteria that we are reviewing. Mean score takes into account all delays, so extreme values can influence the overall score. Meaning…a few outliers that were really late can make the whole month look as though “flights are not on time.” Median, however, represents the middle value which means outliers don’t really influence the data. In short, to minimize travel delays and have the best odds at a low impact travel schedule, choose a month with low median departure delay. If, on the otherhand we want to reduce the overall risk of delay, we want the lowest mean departure delay. But, having said that, the months of September and October happen to both have the lowest mean and median numbers. Meaning…if you want a smooth trip out of NYC, fly in September or October.

Exercise 6

If you were selecting an airport simply based on on time departure percentage, which NYC airport would you choose to fly out of?

Following with the assignment, we’re going to use mutate to change the data and classify flights as either “on-time” or “delayed”

# Add a new column 'dep_type' based on departure delay
nycflights <- nycflights %>%
  mutate(dep_type = ifelse(dep_delay < 5, "on time", "delayed"))

# View the first few rows to verify the new column
head(nycflights)

## # A tibble: 6 × 17
##    year month   day dep_time dep_delay arr_time arr_delay carrier tailnum flight
##   <int> <int> <int>    <int>     <dbl>    <int>     <dbl> <chr>   <chr>    <int>
## 1  2013     6    30      940        15     1216        -4 VX      N626VA     407
## 2  2013     5     7     1657        -3     2104        10 DL      N3760C     329
## 3  2013    12     8      859        -1     1238        11 DL      N712TW     422
## 4  2013     5    14     1841        -4     2122       -34 DL      N914DL    2391
## 5  2013     7    21     1102        -3     1230        -8 9E      N823AY    3652
## 6  2013     1     1     1817        -3     2008         3 AA      N3AXAA     353
## # ℹ 7 more variables: origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, dep_type <chr>

Now that dep_type has been added to the data we are going to organize by origin airport and count up how many flights from each were delayed or on time:

ot_dep_rate_by_origin <- nycflights %>%
  group_by(origin) %>%
  summarise(ot_dep_rate = sum(dep_type == "on time") / n()) %>%
  arrange(desc(ot_dep_rate))

ot_dep_rate_by_origin

## # A tibble: 3 × 2
##   origin ot_dep_rate
##   <chr>        <dbl>
## 1 LGA          0.728
## 2 JFK          0.694
## 3 EWR          0.637

And then we can distribute the information visually per the assignment:

# Create a bar plot with origin on the x-axis and fill by departure type
ggplot(data = nycflights, aes(x = origin, fill = dep_type)) +
  geom_bar() +
  labs(title = "Flight Counts by Origin and Departure Type",
       x = "Origin Airport",
       y = "Count of Flights") +
  theme_minimal()

Going strictly based on “on time” percentage, it looks like you shuold travel through LGA as 73% of it’s flights fall within the on-time parameter we set.

Exercise 7

Mutate the data frame so that it includes a new variable that contains the average speed, avg_speed traveled by the plane for each flight (in mph). Hint: Average speed can be calculated as distance divided by number of hours of travel, and note that air_time is given in minutes.

The dataset provides us with “air_time” in minutes, ideally we’d want that in hours because most of us think of speed in terms of miles per hour. So, we’ll generate (avg_speed = distance / (air_time / 60))

# Calculate the average speed in mph for each flight
nycflights <- nycflights %>%
  mutate(avg_speed = distance / (air_time / 60))

# Display the first few rows to verify the new variable
head(nycflights %>% select(distance, air_time, avg_speed))

## # A tibble: 6 × 3
##   distance air_time avg_speed
##      <dbl>    <dbl>     <dbl>
## 1     2475      313      474.
## 2     1598      216      444.
## 3     2475      376      395.
## 4     1005      135      447.
## 5      296       50      355.
## 6      733      138      319.

Exercise 8

Make a scatterplot of avg_speed vs. distance. Describe the relationship between average speed and distance. Hint: Use geom_point().

# Create a scatterplot of average speed vs. distance
ggplot(nycflights, aes(x = distance, y = avg_speed)) +
  geom_point(alpha = 0.5, color = "steelblue") +
  labs(title = "Scatterplot of Average Speed vs. Distance",
       x = "Distance (miles)",
       y = "Average Speed (mph)") +
  theme_minimal()

Our scatterplot shows that speed seems to be more variable the shorter the flight. For flights that are 500 miles or less, speed can land anywhere between 100 and 400 mph. Since I don’t think a plane can fly at 100 mph (not a commercial liner anyway) I am going to assume that our flight time includes taxi and tarmac time. Meaning…all flighs “pay” a startup cost for ground travel, take off, landing, and ground travel. So the shorter a flight, the higher the impact of that startup cost. As flights get longer, the actual airspeed tends to flatten out. Based on our scatterplot, it sure looks like air speed tends to nullify the startup cost as we hit 1,500 miles of travel. Most air speed for most flights tends to hover in that 425-450mph range. So optimal flight time to distance indicates we should be flying 1500 mile trips rather than any trips shorter than 500 mph.

Exercise 9

Replicate the following plot. Hint: The data frame plotted only contains flights from American Airlines, Delta Airlines, and United Airlines, and the points are colored by carrier. Once you replicate the plot, determine (roughly) what the cutoff point is for departure delays where you can still expect to get to your destination on time.

# Filter for flights from AA, DL, and UA
selected_flights <- nycflights %>%
  filter(carrier %in% c("AA", "DL", "UA"))

# Create a scatterplot of dep_delay vs. arr_delay with specified colors for each carrier
ggplot(selected_flights, aes(x = dep_delay, y = arr_delay, color = carrier)) +
  geom_point(alpha = 0.5) +
  labs(title = "Departure vs. Arrival Delays for AA, DL, and UA",
       x = "Departure Delay (minutes)",
       y = "Arrival Delay (minutes)") +
  scale_color_manual(values = c("AA" = "red", "DL" = "green", "UA" = "blue")) +
  xlim(-30, 800) +
  ylim(-30, 800) +
  theme_minimal()

Not a scientific view of the data, but it appears that 50(ish) minutes is the max zone where a departure delay begins to significantly impact arrival delays. I really want to drop a 40-minute wide circle with the center positioned on the +5 minute departure line. That looks like it would really grab a huge percentage of the flights.

Conclusion

Wow. This was really in depth. And a lot of fun. I appreciated how much I learned how and how much R is really starting to feel like an extension of logic. Whatever I want to do with data, most of it can be quickly accomplished with R. deplyr is also great. It helped speed up a lot of this process. I think my favorite thing is plotting data and seeing things we’ve done to manipulate the data come to life right before my eyes.

Data 606 - Week 2 Lab

Tyler Graham

2025-02-14