Homework Assignment: Analyzing NYC Flight Data

This homework assignment uses the flights dataset from the nycflights13 package, which contains real-world data on over 336,000 flights departing from New York City airports (JFK, LGA, EWR) in 2013. The dataset includes variables such as departure and arrival times (with date components), airline carrier (categorical), origin and destination airports (categorical), delays (with missing values for cancelled flights), distance, and more. It is sourced from the US Bureau of Transportation Statistics.

Objectives

This assignment reinforces the Week 4 topics:

  • Parsing and manipulating dates/times using lubridate.
  • Creating and analyzing time series with zoo.
  • Working with factors, inspecting levels, and recoding them.
  • Identifying and handling missing data (e.g., removal, imputation).

All questions (except the final reflection) require you to write and run R code to solve them. Submit your URL for your RPubs. Make sure to comment your code, along with key outputs (e.g., summaries, plots, or tables). Use the provided setup code to load the data.

Setup

Install and load the necessary packages if not already done:

#install.packages(c("nycflights13", "dplyr", "lubridate", "zoo", "forcats"))  # If needed
library(nycflights13)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
library(zoo)
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
library(forcats)  # For factor recoding; base R alternatives are acceptable
data(flights)  # Load the dataset

Explore the data briefly with str(flights) and head(flights) to understand the structure. Note: Dates are in separate year, month, day columns; times are in dep_time and arr_time (as integers like 517 for 5:17 AM).

#Explore your data here
flights
## # A tibble: 336,776 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # ℹ 336,766 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

Question 1: Creating Dates with lubridate

Create a column dep_datetime by combining year, month, day, and dep_time into a POSIXct datetime using lubridate. (Hint: Use make_datetime function to combine: year, month, day, for hour and min use division, e.g., hour = dep_time %/% 100, minute = dep_time %% 100.)

Show the first 5 rows of flights with dep_datetime.

Output: First 5 rows showing year, month, day, dep_time, and dep_datetime.

library(lubridate)

flights <- flights |>
  mutate(dep_datetime = make_datetime(year, month, day, hour, min = 0, sec = 0))


head(flights$dep_datetime, 5)
## [1] "2013-01-01 05:00:00 UTC" "2013-01-01 05:00:00 UTC"
## [3] "2013-01-01 05:00:00 UTC" "2013-01-01 05:00:00 UTC"
## [5] "2013-01-01 06:00:00 UTC"

Question 2: Simple Date Manipulations with lubridate

Using dep_datetime from Question 1, create a column weekday with the day of the week (e.g., “Mon”) using wday(dep_datetime, label = TRUE). Use table() to show how many flights occur on each weekday.

Output: The table of flight counts by weekday.

library(lubridate)

flights <- flights |>
  dplyr::mutate(weekday = lubridate::wday(dep_datetime, label = TRUE))

table(flights$weekday)     
## 
##   Sun   Mon   Tue   Wed   Thu   Fri   Sat 
## 46357 50690 50422 50060 50219 50308 38720

Question 3: Time Series with zoo

Filter for flights from JFK (origin == “JFK”) and create a zoo time series of departure delays (dep_delay) by dep_datetime. Plot the time series (use plot()). (Hint: Use a subset to avoid memory issues, e.g., first 1000 JFK flights.)

Output: The time series plot.

library(dplyr)
library(zoo)
jfk1000 <- flights |>
  filter(origin == "JFK") |>
  arrange(dep_datetime) |>
  select(dep_datetime, dep_delay) |>
  filter(!is.na(dep_datetime), !is.na(dep_delay)) |>
  slice_head(n = 1000)
ts_dep <- zoo(jfk1000$dep_delay, jfk1000$dep_datetime)
## Warning in zoo(jfk1000$dep_delay, jfk1000$dep_datetime): some methods for "zoo"
## objects do not work if the index entries in 'order.by' are not unique
plot(ts_dep, main = "JFK departure - first 1,000 flights", ylab = "departure delay (minutes)", xlab = "Departure time (JFK)")

Question 4: Working with Factors

Convert the origin column (airports: “JFK”, “LGA”, “EWR”) to a factor called origin_factor. Show the factor levels with levels() and create a frequency table with table(). Make a bar plot of flights by airport using barplot().

Output: The levels, frequency table, and bar plot.

flights <- flights |>
  mutate(origin_factor = factor(origin, levels = c("JFK","LGA","EWR")))

levels(flights$origin_factor)
## [1] "JFK" "LGA" "EWR"
table(flights$origin_factor)
## 
##    JFK    LGA    EWR 
## 111279 104662 120835
barplot(table(flights$origin_factor),
        main = "Flights by Origin Airport",
        xlab = "Airport", ylab = "Number of Flights")

Question 5: Recoding Factors

Recode origin_factor from Question 4 into a new column origin_recoded with full names: “JFK” to “Kennedy”, “LGA” to “LaGuardia”, “EWR” to “Newark” using fct_recode() or base R. Create a bar plot of the recoded factor.

Output: The new levels and bar plot.

library(forcats)

flights <- flights |>
  mutate(origin_recoded = fct_recode(
    origin_factor,
    Kennedy   = "JFK",
    LaGuardia = "LGA",
    Newark    = "EWR"
  ))

levels(flights$origin_recoded)
## [1] "Kennedy"   "LaGuardia" "Newark"
barplot(table(flights$origin_recoded),
        main = "Flights by Origin (Recoded)",
        xlab = "Airport", ylab = "Number of Flights")

Question 6: Handling Missing Data

Count missing values in dep_delay and arr_delay using colSums(is.na(flights)). Impute missing dep_delay values with 0 (assuming no delay for cancelled flights) in a new column dep_delay_imputed. Create a frequency table of dep_delay_imputed for delays between -20 and 20 minutes (use filter() to subset).

Output: NA counts, and the frequency table for imputed delays.

# NA counts (only for the two columns of interest)
colSums(is.na(flights[, c("dep_delay", "arr_delay")]))
## dep_delay arr_delay 
##      8255      9430
# Impute dep_delay NAs with 0 in a new column
flights <- flights |>
  mutate(dep_delay_imputed = ifelse(is.na(dep_delay), 0, dep_delay))

# Frequency table for imputed delays between -20 and 20 minutes
tab_imputed <- flights |>
  filter(dep_delay_imputed >= -20, dep_delay_imputed <= 20) |>
  pull(dep_delay_imputed) |>
  table()

tab_imputed
## 
##   -20   -19   -18   -17   -16   -15   -14   -13   -12   -11   -10    -9    -8 
##    37    19    81   110   162   408   498   901  1594  2727  5891  7875 11791 
##    -7    -6    -5    -4    -3    -2    -1     0     1     2     3     4     5 
## 16752 20701 24821 24619 24218 21516 18813 24769  8050  6233  5450  4807  4447 
##     6     7     8     9    10    11    12    13    14    15    16    17    18 
##  3789  3520  3381  3062  2859  2756  2494  2414  2256  2140  2085  1873  1749 
##    19    20 
##  1730  1704

Question 7: Reflection (No Coding)

Reflect on the assignment: What was easy or hard about working with flight dates or missing data? How might assuming zero delay for missing values (Question 6) affect conclusions about flight punctuality? What did you learn about NYC flights in 2013? (150-200 words)

Working with the flight dates was straightforward once I split dep_time into hour/min using integer division and fed them to lubridate::make_datetime(). Adding weekday labels with wday() and plotting a small zoo series made temporal patterns easy to see, delays bunching in late afternoon/evening and more dispersion at JFK than LGA/EWR. The hard part was the data hygiene around missing values and row limits. I initially struggled to cap the dataset at exactly 1,000 JFK flights; I was slicing too early, before ordering by dep_datetime and removing NAs, which produced inconsistent counts and plots. With guidance from Prof. Mais and a tailored reference on zoo/time-series workflows, I learned the correct pipeline: filter → sort → drop NAs → slice_head(1000), which yields the first 1,000 valid, time-ordered JFK departures for a stable series. Treating missing dep_delay as zero in Question 6 was convenient for a quick table, but it almost certainly overstates punctuality (cancellations cluster with bad weather/peak periods), shrinking variance and biasing averages downward. Overall, the 2013 NYC flights show skewed delays, afternoon congestion, and clear airport differences, insights that depend on careful date handling and explicit treatment of missingness.