This homework assignment uses the flights dataset from
the nycflights13 package, which contains real-world data on
over 336,000 flights departing from New York City airports (JFK, LGA,
EWR) in 2013. The dataset includes variables such as departure and
arrival times (with date components), airline carrier (categorical),
origin and destination airports (categorical), delays (with missing
values for cancelled flights), distance, and more. It is sourced from
the US Bureau of Transportation Statistics.
This assignment reinforces the Week 4 topics:
lubridate.zoo.All questions (except the final reflection) require you to write and run R code to solve them. Submit your URL for your RPubs. Make sure to comment your code, along with key outputs (e.g., summaries, plots, or tables). Use the provided setup code to load the data.
Install and load the necessary packages if not already done:
#install.packages(c("nycflights13", "dplyr", "lubridate", "zoo", "forcats")) # If needed
library(nycflights13)
## Warning: package 'nycflights13' was built under R version 4.5.3
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(lubridate)
## Warning: package 'lubridate' was built under R version 4.5.3
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(zoo)
## Warning: package 'zoo' was built under R version 4.5.3
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
library(forcats) # For factor recoding; base R alternatives are acceptable
## Warning: package 'forcats' was built under R version 4.5.3
data(flights) # Load the dataset
Explore the data briefly with str(flights) and
head(flights) to understand the structure. Note: Dates are
in separate year, month, day
columns; times are in dep_time and arr_time
(as integers like 517 for 5:17 AM).
# Explore the structure of the flights dataset
str(flights)
## tibble [336,776 × 19] (S3: tbl_df/tbl/data.frame)
## $ year : int [1:336776] 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ...
## $ month : int [1:336776] 1 1 1 1 1 1 1 1 1 1 ...
## $ day : int [1:336776] 1 1 1 1 1 1 1 1 1 1 ...
## $ dep_time : int [1:336776] 517 533 542 544 554 554 555 557 557 558 ...
## $ sched_dep_time: int [1:336776] 515 529 540 545 600 558 600 600 600 600 ...
## $ dep_delay : num [1:336776] 2 4 2 -1 -6 -4 -5 -3 -3 -2 ...
## $ arr_time : int [1:336776] 830 850 923 1004 812 740 913 709 838 753 ...
## $ sched_arr_time: int [1:336776] 819 830 850 1022 837 728 854 723 846 745 ...
## $ arr_delay : num [1:336776] 11 20 33 -18 -25 12 19 -14 -8 8 ...
## $ carrier : chr [1:336776] "UA" "UA" "AA" "B6" ...
## $ flight : int [1:336776] 1545 1714 1141 725 461 1696 507 5708 79 301 ...
## $ tailnum : chr [1:336776] "N14228" "N24211" "N619AA" "N804JB" ...
## $ origin : chr [1:336776] "EWR" "LGA" "JFK" "JFK" ...
## $ dest : chr [1:336776] "IAH" "IAH" "MIA" "BQN" ...
## $ air_time : num [1:336776] 227 227 160 183 116 150 158 53 140 138 ...
## $ distance : num [1:336776] 1400 1416 1089 1576 762 ...
## $ hour : num [1:336776] 5 5 5 5 6 5 6 6 6 6 ...
## $ minute : num [1:336776] 15 29 40 45 0 58 0 0 0 0 ...
## $ time_hour : POSIXct[1:336776], format: "2013-01-01 05:00:00" "2013-01-01 05:00:00" ...
# Preview the first few rows
head(flights)
## # A tibble: 6 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
lubridateCreate a column dep_datetime by combining year, month, day, and
dep_time into a POSIXct datetime using lubridate. (Hint: Use
make_datetime function to combine: year, month, day, for
hour and min use division, e.g., hour = dep_time %/% 100, minute =
dep_time %% 100.)
Show the first 5 rows of flights with dep_datetime.
Output: First 5 rows showing year, month, day, dep_time, and dep_datetime.
# Use make_datetime() to build a proper POSIXct column from the integer dep_time.
# dep_time stores time as an integer like 517 meaning 5:17 AM.
# Integer division (%/% 100) extracts the hour, modulo (%% 100) extracts the minute.
flights <- flights %>%
mutate(dep_datetime = make_datetime(year, month, day,
hour = dep_time %/% 100,
min = dep_time %% 100))
# Show the first 5 rows with the relevant columns
flights %>%
select(year, month, day, dep_time, dep_datetime) %>%
head(5)
## # A tibble: 5 × 5
## year month day dep_time dep_datetime
## <int> <int> <int> <int> <dttm>
## 1 2013 1 1 517 2013-01-01 05:17:00
## 2 2013 1 1 533 2013-01-01 05:33:00
## 3 2013 1 1 542 2013-01-01 05:42:00
## 4 2013 1 1 544 2013-01-01 05:44:00
## 5 2013 1 1 554 2013-01-01 05:54:00
lubridateUsing dep_datetime from Question 1, create a column weekday with the day of the week (e.g., “Mon”) using wday(dep_datetime, label = TRUE). Use table() to show how many flights occur on each weekday.
Output: The table of flight counts by weekday.
# wday() extracts the day of the week from a datetime.
# label = TRUE returns an ordered factor with abbreviated day names (Sun, Mon, ..., Sat).
flights <- flights %>%
mutate(weekday = wday(dep_datetime, label = TRUE))
# Show flight counts broken down by weekday
table(flights$weekday)
##
## Sun Mon Tue Wed Thu Fri Sat
## 45643 49468 49273 48858 48654 48703 37922
Filter for flights from JFK (origin == “JFK”) and create a zoo time series of departure delays (dep_delay) by dep_datetime. Plot the time series (use plot()). (Hint: Use a subset to avoid memory issues, e.g., first 1000 JFK flights.)
Output: The time series plot.
# Filter to only JFK flights and take the first 1000 to keep the plot readable
jfk_flights <- flights %>%
filter(origin == "JFK") %>%
head(1000)
# Create a zoo time series object: values = dep_delay, index = dep_datetime
jfk_zoo <- zoo(jfk_flights$dep_delay, order.by = jfk_flights$dep_datetime)
## Warning in zoo(jfk_flights$dep_delay, order.by = jfk_flights$dep_datetime):
## some methods for "zoo" objects do not work if the index entries in 'order.by'
## are not unique
# Plot the time series of departure delays for JFK flights
plot(jfk_zoo,
main = "Departure Delays for JFK Flights (First 1000)",
xlab = "Departure DateTime",
ylab = "Departure Delay (minutes)",
col = "steelblue")
Convert the origin column (airports: “JFK”, “LGA”, “EWR”) to a factor called origin_factor. Show the factor levels with levels() and create a frequency table with table(). Make a bar plot of flights by airport using barplot().
Output: The levels, frequency table, and bar plot.
# Convert the character origin column to a factor so R treats it as categorical
flights <- flights %>%
mutate(origin_factor = factor(origin))
# Inspect the factor levels (alphabetical by default: EWR, JFK, LGA)
levels(flights$origin_factor)
## [1] "EWR" "JFK" "LGA"
# Frequency count of flights at each airport
table(flights$origin_factor)
##
## EWR JFK LGA
## 120835 111279 104662
# Bar plot to visualize the distribution of flights across the three NYC airports
barplot(table(flights$origin_factor),
main = "Number of Flights by NYC Airport",
xlab = "Airport",
ylab = "Number of Flights",
col = c("tomato", "steelblue", "seagreen"),
ylim = c(0, 140000))
Recode origin_factor from Question 4 into a new column origin_recoded with full names: “JFK” to “Kennedy”, “LGA” to “LaGuardia”, “EWR” to “Newark” using fct_recode() or base R. Create a bar plot of the recoded factor.
Output: The new levels and bar plot.
# fct_recode() replaces factor level labels while keeping the underlying structure.
# The syntax is fct_recode(factor, "new_name" = "old_name").
flights <- flights %>%
mutate(origin_recoded = fct_recode(origin_factor,
"Kennedy" = "JFK",
"LaGuardia" = "LGA",
"Newark" = "EWR"))
# Confirm the new levels now show full airport names
levels(flights$origin_recoded)
## [1] "Newark" "Kennedy" "LaGuardia"
# Bar plot with the full airport names for a more readable presentation
barplot(table(flights$origin_recoded),
main = "Number of Flights by Airport (Full Names)",
xlab = "Airport",
ylab = "Number of Flights",
col = c("tomato", "steelblue", "seagreen"),
ylim = c(0, 140000))
Count missing values in dep_delay and arr_delay using colSums(is.na(flights)). Impute missing dep_delay values with 0 (assuming no delay for cancelled flights) in a new column dep_delay_imputed. Create a frequency table of dep_delay_imputed for delays between -20 and 20 minutes (use filter() to subset).
Output: NA counts, and the frequency table for imputed delays.
# Count NA values in dep_delay and arr_delay.
# Missing delays correspond to cancelled flights in this dataset.
colSums(is.na(flights[, c("dep_delay", "arr_delay")]))
## dep_delay arr_delay
## 8255 9430
# Impute missing dep_delay with 0, treating cancellations as zero-delay.
# if_else() keeps the original value when it exists; otherwise fills with 0.
flights <- flights %>%
mutate(dep_delay_imputed = if_else(is.na(dep_delay), 0, dep_delay))
# Filter to delays between -20 and 20 minutes to focus on near-on-time flights,
# then build a frequency table of exact delay values in that window.
flights %>%
filter(dep_delay_imputed >= -20 & dep_delay_imputed <= 20) %>%
pull(dep_delay_imputed) %>%
table()
## .
## -20 -19 -18 -17 -16 -15 -14 -13 -12 -11 -10 -9 -8
## 37 19 81 110 162 408 498 901 1594 2727 5891 7875 11791
## -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5
## 16752 20701 24821 24619 24218 21516 18813 24769 8050 6233 5450 4807 4447
## 6 7 8 9 10 11 12 13 14 15 16 17 18
## 3789 3520 3381 3062 2859 2756 2494 2414 2256 2140 2085 1873 1749
## 19 20
## 1730 1704
It was relatively easy to work with flight date information because I
learned that dep_time is actually stored as a simple
integer representing the time of day (i.e., 517 is equivalent to 5:17
AM), and not a string representation of time. Although I found that
using make_datetime() along with integer division and the
modulus operator to extract hours and minutes was somewhat indirect to
me at first, I quickly began to understand how the logic worked. The
most challenging thing for me to figure out was how to create the
zoo time series in question 3; I had to spend extra time
thinking about what I wanted to use for an index. Because I limited the
number of JFK flights to the first 1,000, the plot appeared much larger
than if I had plotted all JFK flights at once.
I found the missing data in question 6 to be interesting. I am sure that using 0 as the delay for all cancelled flights is a choice that could affect the final findings in a meaningful way. Since cancelled flights did not depart at all, treating them as having departed on time by imputing 0 would lead to artificially inflated levels of on-time service for airlines and airports. Instead of using 0 as the delay value for cancelled flights, perhaps a better way to handle these flights would be to keep those records in NA and eliminate them from summary statistics, or to classify all cancelled flights in a separate category.
The data clearly shows that airports in New York City are very busy; JFK, LGA, and EWR combined average hundreds of thousands of flights per year. The results of the distribution of flights across days, as shown in question 2, highlight that Saturday has the least amount of flights because there is typically not a lot of business travel on Saturdays. ```