This homework assignment uses the flights dataset from
the nycflights13 package, which contains real-world data on
over 336,000 flights departing from New York City airports (JFK, LGA,
EWR) in 2013. The dataset includes variables such as departure and
arrival times (with date components), airline carrier (categorical),
origin and destination airports (categorical), delays (with missing
values for cancelled flights), distance, and more. It is sourced from
the US Bureau of Transportation Statistics.
This assignment reinforces the Week 4 topics:
lubridate.zoo.All questions (except the final reflection) require you to write and run R code to solve them. Submit your URL for your RPubs. Make sure to comment your code, along with key outputs (e.g., summaries, plots, or tables). Use the provided setup code to load the data.
Install and load the necessary packages if not already done:
#install.packages(c("nycflights13", "dplyr", "lubridate", "zoo", "forcats")) # If needed
library(nycflights13)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(lubridate)
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(zoo)
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
library(forcats) # For factor recoding; base R alternatives are acceptable
data(flights) # Load the dataset
Explore the data briefly with str(flights) and
head(flights) to understand the structure. Note: Dates are
in separate year, month, day
columns; times are in dep_time and arr_time
(as integers like 517 for 5:17 AM).
#Explore your data here
summary(flights)
## year month day dep_time sched_dep_time
## Min. :2013 Min. : 1.000 Min. : 1.00 Min. : 1 Min. : 106
## 1st Qu.:2013 1st Qu.: 4.000 1st Qu.: 8.00 1st Qu.: 907 1st Qu.: 906
## Median :2013 Median : 7.000 Median :16.00 Median :1401 Median :1359
## Mean :2013 Mean : 6.549 Mean :15.71 Mean :1349 Mean :1344
## 3rd Qu.:2013 3rd Qu.:10.000 3rd Qu.:23.00 3rd Qu.:1744 3rd Qu.:1729
## Max. :2013 Max. :12.000 Max. :31.00 Max. :2400 Max. :2359
## NA's :8255
## dep_delay arr_time sched_arr_time arr_delay
## Min. : -43.00 Min. : 1 Min. : 1 Min. : -86.000
## 1st Qu.: -5.00 1st Qu.:1104 1st Qu.:1124 1st Qu.: -17.000
## Median : -2.00 Median :1535 Median :1556 Median : -5.000
## Mean : 12.64 Mean :1502 Mean :1536 Mean : 6.895
## 3rd Qu.: 11.00 3rd Qu.:1940 3rd Qu.:1945 3rd Qu.: 14.000
## Max. :1301.00 Max. :2400 Max. :2359 Max. :1272.000
## NA's :8255 NA's :8713 NA's :9430
## carrier flight tailnum origin
## Length:336776 Min. : 1 Length:336776 Length:336776
## Class :character 1st Qu.: 553 Class :character Class :character
## Mode :character Median :1496 Mode :character Mode :character
## Mean :1972
## 3rd Qu.:3465
## Max. :8500
##
## dest air_time distance hour
## Length:336776 Min. : 20.0 Min. : 17 Min. : 1.00
## Class :character 1st Qu.: 82.0 1st Qu.: 502 1st Qu.: 9.00
## Mode :character Median :129.0 Median : 872 Median :13.00
## Mean :150.7 Mean :1040 Mean :13.18
## 3rd Qu.:192.0 3rd Qu.:1389 3rd Qu.:17.00
## Max. :695.0 Max. :4983 Max. :23.00
## NA's :9430
## minute time_hour
## Min. : 0.00 Min. :2013-01-01 05:00:00
## 1st Qu.: 8.00 1st Qu.:2013-04-04 13:00:00
## Median :29.00 Median :2013-07-03 10:00:00
## Mean :26.23 Mean :2013-07-03 05:22:54
## 3rd Qu.:44.00 3rd Qu.:2013-10-01 07:00:00
## Max. :59.00 Max. :2013-12-31 23:00:00
##
head(flights)
## # A tibble: 6 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
lubridateCreate a column dep_datetime by combining year, month, day, and
dep_time into a POSIXct datetime using lubridate. (Hint: Use
make_datetime function to combine: year, month, day, for
hour and min use division, e.g., hour = dep_time %/% 100, minute =
dep_time %% 100.)
Show the first 5 rows of flights with dep_datetime.
Output: First 5 rows showing year, month, day, dep_time, and dep_datetime.
# Create a column dep_datetime
flights <- flights %>%
filter(!is.na(dep_time)) %>% # Removing NAs first to avoid errors in make_datetime
mutate(dep_datetime = make_datetime(year, month, day, hour = dep_time %/% 100, min = dep_time %% 100))
# Show first 5 rows
flights %>%
select(year, month, day, dep_time, dep_datetime) %>%
head(5)
## # A tibble: 5 × 5
## year month day dep_time dep_datetime
## <int> <int> <int> <int> <dttm>
## 1 2013 1 1 517 2013-01-01 05:17:00
## 2 2013 1 1 533 2013-01-01 05:33:00
## 3 2013 1 1 542 2013-01-01 05:42:00
## 4 2013 1 1 544 2013-01-01 05:44:00
## 5 2013 1 1 554 2013-01-01 05:54:00
lubridateUsing dep_datetime from Question 1, create a column weekday with the day of the week (e.g., “Mon”) using wday(dep_datetime, label = TRUE). Use table() to show how many flights occur on each weekday.
Output: The table of flight counts by weekday.
# Add weekday column
flights <- flights %>%
mutate(weekday = wday(dep_datetime, label = TRUE))
# Frequency table
table(flights$weekday)
##
## Sun Mon Tue Wed Thu Fri Sat
## 45643 49468 49273 48858 48654 48703 37922
Filter for flights from JFK (origin == “JFK”) and create a zoo time series of departure delays (dep_delay) by dep_datetime. Plot the time series (use plot()). (Hint: Use a subset to avoid memory issues, e.g., first 1000 JFK flights.)
Output: The time series plot.
# Filter for JFK flights
jfk_data <- flights[flights$origin == "JFK", ]
# Subset the first 1000 rows to avoid memory issues
jfk_subset <- jfk_data[1:1000, ]
# Create zoo object
jfk_ts <- zoo(jfk_subset$dep_delay, jfk_subset$dep_datetime)
## Warning in zoo(jfk_subset$dep_delay, jfk_subset$dep_datetime): some methods for
## "zoo" objects do not work if the index entries in 'order.by' are not unique
# Plot
plot(jfk_ts,
main = "Departure Delays at JFK (First 1000 Flights)",
ylab = "Delay (minutes)",
xlab = "Date/Time",
col = "blue")
Convert the origin column (airports: “JFK”, “LGA”, “EWR”) to a factor called origin_factor. Show the factor levels with levels() and create a frequency table with table(). Make a bar plot of flights by airport using barplot().
Output: The levels, frequency table, and bar plot.
# Convert to factor
flights <- flights %>%
mutate(origin_factor = factor(origin))
# Levels and Table
levels(flights$origin_factor)
## [1] "EWR" "JFK" "LGA"
table(flights$origin_factor)
##
## EWR JFK LGA
## 117596 109416 101509
# Bar plot
barplot(table(flights$origin_factor),
main = "Flights per Origin Airport",
col = "green",
ylab = "Number of Flights")
Recode origin_factor from Question 4 into a new column origin_recoded with full names: “JFK” to “Kennedy”, “LGA” to “LaGuardia”, “EWR” to “Newark” using fct_recode() or base R. Create a bar plot of the recoded factor.
Output: The new levels and bar plot.
# Recode levels
flights <- flights %>%
mutate(origin_recoded = fct_recode(origin_factor,
"Newark" = "EWR",
"Kennedy" = "JFK",
"LaGuardia" = "LGA"))
# Show new levels and plot
levels(flights$origin_recoded)
## [1] "Newark" "Kennedy" "LaGuardia"
barplot(table(flights$origin_recoded),
main = "NYC Flights by Airport",
col = "red")
Count missing values in dep_delay and arr_delay using colSums(is.na(flights)). Impute missing dep_delay values with 0 (assuming no delay for cancelled flights) in a new column dep_delay_imputed. Create a frequency table of dep_delay_imputed for delays between -20 and 20 minutes (use filter() to subset).
Output: NA counts, and the frequency table for imputed delays.
# Count missing values per column
colSums(is.na(flights))
## year month day dep_time sched_dep_time
## 0 0 0 0 0
## dep_delay arr_time sched_arr_time arr_delay carrier
## 0 458 0 1175 0
## flight tailnum origin dest air_time
## 0 0 0 0 1175
## distance hour minute time_hour dep_datetime
## 0 0 0 0 0
## weekday origin_factor origin_recoded
## 0 0 0
# Impute missing dep_delay values with 0
# If is.na is true, use 0, else keep original value
flights$dep_delay_imputed <- ifelse(is.na(flights$dep_delay), 0, flights$dep_delay)
# Create a frequency table for delays between -20 and 20 minutes
# Using bracket indexing to subset the data
delay_subset <- flights[flights$dep_delay_imputed >= -20 & flights$dep_delay_imputed <= 20, ]
table(delay_subset$dep_delay_imputed)
##
## -20 -19 -18 -17 -16 -15 -14 -13 -12 -11 -10 -9 -8
## 37 19 81 110 162 408 498 901 1594 2727 5891 7875 11791
## -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5
## 16752 20701 24821 24619 24218 21516 18813 16514 8050 6233 5450 4807 4447
## 6 7 8 9 10 11 12 13 14 15 16 17 18
## 3789 3520 3381 3062 2859 2756 2494 2414 2256 2140 2085 1873 1749
## 19 20
## 1730 1704
Reflect on the assignment: What was easy or hard about working with flight dates or missing data? How might assuming zero delay for missing values (Question 6) affect conclusions about flight punctuality? What did you learn about NYC flights in 2013? (150-200 words)
Converting the dep_time was the most challenging part. Because R sees 517 as a number and not a time or date, using the %/% and %% operators (lubridate) was a necessary trick. Assuming missing delays as 0 is a risky assumption. In this dataset, NA usually means the flight was cancelled. By marking it as a 0 (on-time), we are artificially improving the airline’s performance metrics. If we were studying airport efficiency, we would be ignoring the worst-case scenario: a cancellation. It would be better to flag these as “Cancelled” or use a separate column to track them. I learned that NYC flight traffic is distributed fairly evenly across the three major airports and the fact that Newark (EWR) handles a surprisingly high volume.