Homework Assignment: Analyzing NYC Flight Data

This homework assignment uses the flights dataset from the nycflights13 package, which contains real-world data on over 336,000 flights departing from New York City airports (JFK, LGA, EWR) in 2013. The dataset includes variables such as departure and arrival times (with date components), airline carrier (categorical), origin and destination airports (categorical), delays (with missing values for cancelled flights), distance, and more. It is sourced from the US Bureau of Transportation Statistics.

Objectives

This assignment reinforces the Week 4 topics:

  • Parsing and manipulating dates/times using lubridate.
  • Creating and analyzing time series with zoo.
  • Working with factors, inspecting levels, and recoding them.
  • Identifying and handling missing data (e.g., removal, imputation).

All questions (except the final reflection) require you to write and run R code to solve them. Submit your URL for your RPubs. Make sure to comment your code, along with key outputs (e.g., summaries, plots, or tables). Use the provided setup code to load the data.

Setup

Install and load the necessary packages if not already done:

#install.packages(c("nycflights13", "dplyr", "lubridate", "zoo", "forcats"))  # If needed
library(nycflights13)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
library(zoo)
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
library(forcats)  # For factor recoding; base R alternatives are acceptable
data(flights)  # Load the dataset

Explore the data briefly with str(flights) and head(flights) to understand the structure. Note: Dates are in separate year, month, day columns; times are in dep_time and arr_time (as integers like 517 for 5:17 AM).

#Explore your data here
str(flights)
## tibble [336,776 × 19] (S3: tbl_df/tbl/data.frame)
##  $ year          : int [1:336776] 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ...
##  $ month         : int [1:336776] 1 1 1 1 1 1 1 1 1 1 ...
##  $ day           : int [1:336776] 1 1 1 1 1 1 1 1 1 1 ...
##  $ dep_time      : int [1:336776] 517 533 542 544 554 554 555 557 557 558 ...
##  $ sched_dep_time: int [1:336776] 515 529 540 545 600 558 600 600 600 600 ...
##  $ dep_delay     : num [1:336776] 2 4 2 -1 -6 -4 -5 -3 -3 -2 ...
##  $ arr_time      : int [1:336776] 830 850 923 1004 812 740 913 709 838 753 ...
##  $ sched_arr_time: int [1:336776] 819 830 850 1022 837 728 854 723 846 745 ...
##  $ arr_delay     : num [1:336776] 11 20 33 -18 -25 12 19 -14 -8 8 ...
##  $ carrier       : chr [1:336776] "UA" "UA" "AA" "B6" ...
##  $ flight        : int [1:336776] 1545 1714 1141 725 461 1696 507 5708 79 301 ...
##  $ tailnum       : chr [1:336776] "N14228" "N24211" "N619AA" "N804JB" ...
##  $ origin        : chr [1:336776] "EWR" "LGA" "JFK" "JFK" ...
##  $ dest          : chr [1:336776] "IAH" "IAH" "MIA" "BQN" ...
##  $ air_time      : num [1:336776] 227 227 160 183 116 150 158 53 140 138 ...
##  $ distance      : num [1:336776] 1400 1416 1089 1576 762 ...
##  $ hour          : num [1:336776] 5 5 5 5 6 5 6 6 6 6 ...
##  $ minute        : num [1:336776] 15 29 40 45 0 58 0 0 0 0 ...
##  $ time_hour     : POSIXct[1:336776], format: "2013-01-01 05:00:00" "2013-01-01 05:00:00" ...
head(flights)
## # A tibble: 6 × 19
##    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
## 1  2013     1     1      517            515         2      830            819
## 2  2013     1     1      533            529         4      850            830
## 3  2013     1     1      542            540         2      923            850
## 4  2013     1     1      544            545        -1     1004           1022
## 5  2013     1     1      554            600        -6      812            837
## 6  2013     1     1      554            558        -4      740            728
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>
summary(flights)
##       year          month             day           dep_time    sched_dep_time
##  Min.   :2013   Min.   : 1.000   Min.   : 1.00   Min.   :   1   Min.   : 106  
##  1st Qu.:2013   1st Qu.: 4.000   1st Qu.: 8.00   1st Qu.: 907   1st Qu.: 906  
##  Median :2013   Median : 7.000   Median :16.00   Median :1401   Median :1359  
##  Mean   :2013   Mean   : 6.549   Mean   :15.71   Mean   :1349   Mean   :1344  
##  3rd Qu.:2013   3rd Qu.:10.000   3rd Qu.:23.00   3rd Qu.:1744   3rd Qu.:1729  
##  Max.   :2013   Max.   :12.000   Max.   :31.00   Max.   :2400   Max.   :2359  
##                                                  NA's   :8255                 
##    dep_delay          arr_time    sched_arr_time   arr_delay       
##  Min.   : -43.00   Min.   :   1   Min.   :   1   Min.   : -86.000  
##  1st Qu.:  -5.00   1st Qu.:1104   1st Qu.:1124   1st Qu.: -17.000  
##  Median :  -2.00   Median :1535   Median :1556   Median :  -5.000  
##  Mean   :  12.64   Mean   :1502   Mean   :1536   Mean   :   6.895  
##  3rd Qu.:  11.00   3rd Qu.:1940   3rd Qu.:1945   3rd Qu.:  14.000  
##  Max.   :1301.00   Max.   :2400   Max.   :2359   Max.   :1272.000  
##  NA's   :8255      NA's   :8713                  NA's   :9430      
##    carrier              flight       tailnum             origin         
##  Length:336776      Min.   :   1   Length:336776      Length:336776     
##  Class :character   1st Qu.: 553   Class :character   Class :character  
##  Mode  :character   Median :1496   Mode  :character   Mode  :character  
##                     Mean   :1972                                        
##                     3rd Qu.:3465                                        
##                     Max.   :8500                                        
##                                                                         
##      dest              air_time        distance         hour      
##  Length:336776      Min.   : 20.0   Min.   :  17   Min.   : 1.00  
##  Class :character   1st Qu.: 82.0   1st Qu.: 502   1st Qu.: 9.00  
##  Mode  :character   Median :129.0   Median : 872   Median :13.00  
##                     Mean   :150.7   Mean   :1040   Mean   :13.18  
##                     3rd Qu.:192.0   3rd Qu.:1389   3rd Qu.:17.00  
##                     Max.   :695.0   Max.   :4983   Max.   :23.00  
##                     NA's   :9430                                  
##      minute        time_hour                  
##  Min.   : 0.00   Min.   :2013-01-01 05:00:00  
##  1st Qu.: 8.00   1st Qu.:2013-04-04 13:00:00  
##  Median :29.00   Median :2013-07-03 10:00:00  
##  Mean   :26.23   Mean   :2013-07-03 05:22:54  
##  3rd Qu.:44.00   3rd Qu.:2013-10-01 07:00:00  
##  Max.   :59.00   Max.   :2013-12-31 23:00:00  
## 

Question 1: Creating Dates with lubridate

Create a column dep_datetime by combining year, month, day, and dep_time into a POSIXct datetime using lubridate. (Hint: Use make_datetime function to combine: year, month, day, for hour and min use division, e.g., hour = dep_time %/% 100, min = dep_time %% 100.)

Show the first 5 rows of flights with dep_datetime.

Output: First 5 rows showing year, month, day, dep_time, and dep_datetime.

extract_h_m <- function(hhmm) {
   hour <- ifelse(is.na(hhmm), NA_integer_, hhmm %/% 100)
  minute <- ifelse(is.na(hhmm), NA_integer_, hhmm %% 100)
  tibble(hour = hour, minute = minute)
}
flights_q1 <- flights %>%
  mutate(
    dep_hour = ifelse(is.na(dep_time), NA_integer_, dep_time %/% 100),
    dep_min  = ifelse(is.na(dep_time), NA_integer_, dep_time %% 100),
    dep_datetime = make_datetime(year = year, month = month, day = day, hour = dep_hour, min = dep_min)
  )
flights_q1 %>%
  select(year, month, day, dep_time, dep_datetime) %>%
  slice_head(n = 5)
## # A tibble: 5 × 5
##    year month   day dep_time dep_datetime       
##   <int> <int> <int>    <int> <dttm>             
## 1  2013     1     1      517 2013-01-01 05:17:00
## 2  2013     1     1      533 2013-01-01 05:33:00
## 3  2013     1     1      542 2013-01-01 05:42:00
## 4  2013     1     1      544 2013-01-01 05:44:00
## 5  2013     1     1      554 2013-01-01 05:54:00

Question 2: Simple Date Manipulations with lubridate

Using dep_datetime from Question 1, create a column weekday with the day of the week (e.g., “Mon”) using wday(dep_datetime, label = TRUE). Use table() to show how many flights occur on each weekday.

Output: The table of flight counts by weekday.

flights_q2 <- flights_q1 %>%
  mutate(weekday = wday(dep_datetime, label = TRUE, abbr = TRUE))
weekday_table <- table(flights_q2$weekday, useNA = "ifany")
weekday_table
## 
##   Sun   Mon   Tue   Wed   Thu   Fri   Sat  <NA> 
## 45643 49468 49273 48858 48654 48703 37922  8255

Question 3: Time Series with zoo

Filter for flights from JFK (origin == “JFK”) and create a zoo time series of departure delays (dep_delay) by dep_datetime. Plot the time series (use plot()). (Hint: Use a subset to avoid memory issues, e.g., first 1000 JFK flights using `slice_head().)

Output: The time series plot.

jfk_flights <- flights_q2 %>%
  filter(origin == "JFK" & !is.na(dep_datetime)) %>%
  arrange(dep_datetime) %>%
  slice_head(n = 1000) %>%
  select(dep_datetime, dep_delay)
jfk_zoo <- zoo(jfk_flights$dep_delay, order.by = jfk_flights$dep_datetime)
## Warning in zoo(jfk_flights$dep_delay, order.by = jfk_flights$dep_datetime):
## some methods for "zoo" objects do not work if the index entries in 'order.by'
## are not unique
plot(jfk_zoo, main = "JFK flights: dep_delay (first 1000)", xlab = "Departure datetime", ylab = "Departure delay (minutes)")

Question 4: Working with Factors

Convert the origin column (airports: “JFK”, “LGA”, “EWR”) to a factor called origin_factor. Show the factor levels with levels() and create a frequency table with table(). Make a bar plot of flights by airport using barplot().

Output: The levels, frequency table, and bar plot.

flights_q4 <- flights %>%
  mutate(origin_factor = factor(origin, levels = c("JFK", "LGA", "EWR")))
origin_levels <- levels(flights_q4$origin_factor)
origin_levels
## [1] "JFK" "LGA" "EWR"
origin_freq <- table(flights_q4$origin_factor)
origin_freq
## 
##    JFK    LGA    EWR 
## 111279 104662 120835
barplot(origin_freq, main = "Flights by Origin Airport (2013)", ylab = "Number of flights", xlab = "Origin", ylim = c(0, max(origin_freq) * 1.1))

Question 5: Recoding Factors

Recode origin_factor from Question 4 into a new column origin_recoded with full names: “JFK” to “Kennedy”, “LGA” to “LaGuardia”, “EWR” to “Newark” using fct_recode() or base R. Create a bar plot of the recoded factor.

Output: The new levels and bar plot.

flights_q5 <- flights_q4 %>%
  mutate(origin_recoded = fct_recode(origin_factor, "Kennedy" = "JFK", "LaGuardia" = "LGA", "Newark" = "EWR"))
levels(flights_q5$origin_recoded)
## [1] "Kennedy"   "LaGuardia" "Newark"
recoded_freq <- table(flights_q5$origin_recoded)
recoded_freq
## 
##   Kennedy LaGuardia    Newark 
##    111279    104662    120835
barplot(recoded_freq, main = "Flights by Origin (Recoded Names)", ylab = "Number of flights", xlab = "Origin (full name)", las = 1)

Question 6: Handling Missing Data

Count missing values in dep_delay and arr_delay using colSums(is.na(flights)). Impute missing dep_delay values with 0 (assuming no delay for cancelled flights) in a new column dep_delay_imputed. Create a frequency table of dep_delay_imputed for delays between -20 and 20 minutes (use filter() to subset).

Output: NA counts, and the frequency table for imputed delays.

na_counts <- colSums(is.na(flights[, c("dep_delay", "arr_delay")]))
na_counts
## dep_delay arr_delay 
##      8255      9430
flights_q6 <- flights %>%
  mutate(dep_delay_imputed = ifelse(is.na(dep_delay), 0, dep_delay))
freq_range <- flights_q6 %>%
  filter(!is.na(dep_delay_imputed) & dep_delay_imputed >= -20 & dep_delay_imputed <= 20) %>%
  count(dep_delay_imputed) %>%
  arrange(dep_delay_imputed)
head(freq_range, 20)
## # A tibble: 20 × 2
##    dep_delay_imputed     n
##                <dbl> <int>
##  1               -20    37
##  2               -19    19
##  3               -18    81
##  4               -17   110
##  5               -16   162
##  6               -15   408
##  7               -14   498
##  8               -13   901
##  9               -12  1594
## 10               -11  2727
## 11               -10  5891
## 12                -9  7875
## 13                -8 11791
## 14                -7 16752
## 15                -6 20701
## 16                -5 24821
## 17                -4 24619
## 18                -3 24218
## 19                -2 21516
## 20                -1 18813
# Optionally show as table:
# table(flights_q6$dep_delay_imputed[flights_q6$dep_delay_imputed >= -20 & flights_q6$dep_delay_imputed <= 20])

Question 7: Reflection (No Coding)

Reflect on the assignment: What was easy or hard about working with flight dates or missing data? How might assuming zero delay for missing values (Question 6) affect conclusions about flight punctuality? What did you learn about NYC flights in 2013? (150-200 words) Here’s a simpler, more casual version of your reflection for Question 7:

Working with the NYC flight data was kind of fun but also tricky at times. It was easy to combine the year, month, day, and time into one date column using lubridate and to see which day of the week flights happened. That part was cool because it made the data easier to understand. The harder part was dealing with missing values, especially for departure delays. I had to think about what it means to replace missing delays with zero.

Assuming zero for missing delays could make it look like flights were on time more often than they really were. This might make the airport or airline seem better at being punctual than it actually was.

From this assignment, I learned that NYC flights in 2013 had a lot of different delays, and some airports like JFK, LGA, and EWR had different numbers of flights. Overall, I realized that handling dates and missing data correctly is really important if you want to understand flight patterns accurately.