This homework assignment uses the flights dataset from
the nycflights13 package, which contains real-world data on
over 336,000 flights departing from New York City airports (JFK, LGA,
EWR) in 2013. The dataset includes variables such as departure and
arrival times (with date components), airline carrier (categorical),
origin and destination airports (categorical), delays (with missing
values for cancelled flights), distance, and more. It is sourced from
the US Bureau of Transportation Statistics.
This assignment reinforces the Week 4 topics:
lubridate.zoo.All questions (except the final reflection) require you to write and run R code to solve them. Submit your URL for your RPubs. Make sure to comment your code, along with key outputs (e.g., summaries, plots, or tables). Use the provided setup code to load the data.
Install and load the necessary packages if not already done:
#install.packages(c("nycflights13", "dplyr", "lubridate", "zoo", "forcats")) # If needed
library(nycflights13)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(lubridate)
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(zoo)
## Warning: package 'zoo' was built under R version 4.5.2
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
library(forcats) # For factor recoding; base R alternatives are acceptable
data(flights) # Load the dataset
Explore the data briefly with str(flights) and
head(flights) to understand the structure. Note: Dates are
in separate year, month, day
columns; times are in dep_time and arr_time
(as integers like 517 for 5:17 AM).
str(flights)
## tibble [336,776 × 19] (S3: tbl_df/tbl/data.frame)
## $ year : int [1:336776] 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ...
## $ month : int [1:336776] 1 1 1 1 1 1 1 1 1 1 ...
## $ day : int [1:336776] 1 1 1 1 1 1 1 1 1 1 ...
## $ dep_time : int [1:336776] 517 533 542 544 554 554 555 557 557 558 ...
## $ sched_dep_time: int [1:336776] 515 529 540 545 600 558 600 600 600 600 ...
## $ dep_delay : num [1:336776] 2 4 2 -1 -6 -4 -5 -3 -3 -2 ...
## $ arr_time : int [1:336776] 830 850 923 1004 812 740 913 709 838 753 ...
## $ sched_arr_time: int [1:336776] 819 830 850 1022 837 728 854 723 846 745 ...
## $ arr_delay : num [1:336776] 11 20 33 -18 -25 12 19 -14 -8 8 ...
## $ carrier : chr [1:336776] "UA" "UA" "AA" "B6" ...
## $ flight : int [1:336776] 1545 1714 1141 725 461 1696 507 5708 79 301 ...
## $ tailnum : chr [1:336776] "N14228" "N24211" "N619AA" "N804JB" ...
## $ origin : chr [1:336776] "EWR" "LGA" "JFK" "JFK" ...
## $ dest : chr [1:336776] "IAH" "IAH" "MIA" "BQN" ...
## $ air_time : num [1:336776] 227 227 160 183 116 150 158 53 140 138 ...
## $ distance : num [1:336776] 1400 1416 1089 1576 762 ...
## $ hour : num [1:336776] 5 5 5 5 6 5 6 6 6 6 ...
## $ minute : num [1:336776] 15 29 40 45 0 58 0 0 0 0 ...
## $ time_hour : POSIXct[1:336776], format: "2013-01-01 05:00:00" "2013-01-01 05:00:00" ...
head(flights)
## # A tibble: 6 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
lubridateCreate a column dep_datetime by combining year, month, day, and
dep_time into a POSIXct datetime using lubridate. (Hint: Use
make_datetime function to combine: year, month, day, for
hour and min use division, e.g., hour = dep_time %/% 100, min = dep_time
%% 100.)
Show the first 5 rows of flights with dep_datetime.
Output: First 5 rows showing year, month, day, dep_time, and dep_datetime.
flights <- flights |>
mutate(dep_datetime = make_datetime(year, month, day, hour = dep_time %% 100, min = dep_time %% 100))
head(flights[c(1, 2, 3, 4, 20)],5)
## # A tibble: 5 × 5
## year month day dep_time dep_datetime
## <int> <int> <int> <int> <dttm>
## 1 2013 1 1 517 2013-01-01 17:17:00
## 2 2013 1 1 533 2013-01-02 09:33:00
## 3 2013 1 1 542 2013-01-02 18:42:00
## 4 2013 1 1 544 2013-01-02 20:44:00
## 5 2013 1 1 554 2013-01-03 06:54:00
lubridateUsing dep_datetime from Question 1, create a column weekday with the day of the week (e.g., “Mon”) using wday(dep_datetime, label = TRUE). Use table() to show how many flights occur on each weekday.
Output: The table of flight counts by weekday.
flights <- flights |>
mutate(weekday = wday(dep_datetime, label = TRUE))
table(flights$weekday)
##
## Sun Mon Tue Wed Thu Fri Sat
## 43926 44492 48107 48959 49192 48982 44863
Filter for flights from JFK (origin == “JFK”) and create a zoo time series of departure delays (dep_delay) by dep_datetime. Plot the time series (use plot()). (Hint: Use a subset to avoid memory issues, e.g., first 1000 JFK flights using `slice_head().)
Output: The time series plot.
flights_jfk <- flights |>
filter(origin == "JFK") |>
slice_head(n =1000)
flights_jfk_ts <- zoo(flights_jfk$dep_delay, flights_jfk$dep_datetime)
## Warning in zoo(flights_jfk$dep_delay, flights_jfk$dep_datetime): some methods
## for "zoo" objects do not work if the index entries in 'order.by' are not unique
plot(flights_jfk_ts, main = "Departure Delay's at JFK Airport", ylab = "Delay", xlab = "Date")
Convert the origin column (airports: “JFK”, “LGA”, “EWR”) to a factor called origin_factor. Show the factor levels with levels() and create a frequency table with table(). Make a bar plot of flights by airport using barplot().
Output: The levels, frequency table, and bar plot.
flights <- flights |>
mutate(origin_factor = factor(origin))
levels(flights$origin_factor)
## [1] "EWR" "JFK" "LGA"
table(flights$origin_factor)
##
## EWR JFK LGA
## 120835 111279 104662
barplot(table(flights$origin_factor))
Recode origin_factor from Question 4 into a new column origin_recoded with full names: “JFK” to “Kennedy”, “LGA” to “LaGuardia”, “EWR” to “Newark” using fct_recode() or base R. Create a bar plot of the recoded factor.
Output: The new levels and bar plot.
library(forcats)
flights <- flights |>
mutate(origin_recoded = fct_recode(origin_factor,
"JFK" = "Kennedy",
"LGA" = "LaGuardia",
"EWR" = "Newark",
))
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `origin_recoded = fct_recode(...)`.
## Caused by warning:
## ! Unknown levels in `f`: Kennedy, LaGuardia, Newark
barplot(table(flights$origin_recoded))
Count missing values in dep_delay and arr_delay using colSums(is.na(flights)). Impute missing dep_delay values with 0 (assuming no delay for cancelled flights) in a new column dep_delay_imputed. Create a frequency table of dep_delay_imputed for delays between -20 and 20 minutes (use filter() to subset).
Output: NA counts, and the frequency table for imputed delays.
colSums(is.na(flights))
## year month day dep_time sched_dep_time
## 0 0 0 8255 0
## dep_delay arr_time sched_arr_time arr_delay carrier
## 8255 8713 0 9430 0
## flight tailnum origin dest air_time
## 0 2512 0 0 9430
## distance hour minute time_hour dep_datetime
## 0 0 0 0 8255
## weekday origin_factor origin_recoded
## 8255 0 0
flights_mean <- flights |>
mutate(dep_delay_imputed =ifelse(is.na(dep_delay),
0,
dep_delay))
flights_mean|>
filter(dep_delay_imputed <= 20 & dep_delay_imputed >= -20)
## # A tibble: 275,102 × 24
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## 7 2013 1 1 555 600 -5 913 854
## 8 2013 1 1 557 600 -3 709 723
## 9 2013 1 1 557 600 -3 838 846
## 10 2013 1 1 558 600 -2 753 745
## # ℹ 275,092 more rows
## # ℹ 16 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>, dep_datetime <dttm>,
## # weekday <ord>, origin_factor <fct>, origin_recoded <fct>,
## # dep_delay_imputed <dbl>
Reflect on the assignment: What was easy or hard about working with flight dates or missing data? How might assuming zero delay for missing values (Question 6) affect conclusions about flight punctuality? What did you learn about NYC flights in 2013? (150-200 words)
This assignment was hard for me, especially working with the flight dates and missing data. When I started to code, the factor recode and frequency tables were the hardest things for me to do. When doing those things, I felt that filtering out and renaming the variables was tricky because of the mutate function. Mutate was, and still is the hardest function to grasp my head around, and using it in homeworks and potential projects makes it hard for me to complete them well. Somethings that were easy about working with flight dates was question 3, where we had to make a time series and plot it. I felt that this was easier as it made more sense and the coding part was easier, and I was able to interpret what the code was actually doing. If we assume zero delay for missing values, it makes the flight punctuality appear better than it actually is and a little biased, which leads us to make innacurate conclusions as having no delay may not be the case for these missing values. When looking at and doing this homework about the NYC flights in 2013, I learned that delays are very common with these major airports, with many of the departures being 2-5 minutes behind schedule. Something that I would like to look further into is the month and how that correlates to delays, (Winter months = more delays, summer months = less delays).