This homework assignment uses the flights dataset from
the nycflights13 package, which contains real-world data on
over 336,000 flights departing from New York City airports (JFK, LGA,
EWR) in 2013. The dataset includes variables such as departure and
arrival times (with date components), airline carrier (categorical),
origin and destination airports (categorical), delays (with missing
values for cancelled flights), distance, and more. It is sourced from
the US Bureau of Transportation Statistics.
This assignment reinforces the Week 4 topics:
lubridate.zoo.All questions (except the final reflection) require you to write and run R code to solve them. Submit your URL for your RPubs. Make sure to comment your code, along with key outputs (e.g., summaries, plots, or tables). Use the provided setup code to load the data.
Install and load the necessary packages if not already done:
#install.packages(c("tidyverse", nycflights13", "dplyr", "lubridate", "zoo", "forcats"))
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.1 ✔ stringr 1.5.2
## ✔ ggplot2 4.0.0 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.1.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(nycflights13)
library(dplyr)
library(lubridate)
library(zoo)
##
## Attaching package: 'zoo'
##
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
library(forcats) # For factor recoding; base R alternatives are acceptable
data(flights) # Load the dataset
Explore the data briefly with str(flights) and
head(flights) to understand the structure. Note: Dates are
in separate year, month, day
columns; times are istr(flights)n dep_time and
arr_time (as integers like 517 for 5:17 AM).
#Explore your data here
str(flights)
## tibble [336,776 × 19] (S3: tbl_df/tbl/data.frame)
## $ year : int [1:336776] 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ...
## $ month : int [1:336776] 1 1 1 1 1 1 1 1 1 1 ...
## $ day : int [1:336776] 1 1 1 1 1 1 1 1 1 1 ...
## $ dep_time : int [1:336776] 517 533 542 544 554 554 555 557 557 558 ...
## $ sched_dep_time: int [1:336776] 515 529 540 545 600 558 600 600 600 600 ...
## $ dep_delay : num [1:336776] 2 4 2 -1 -6 -4 -5 -3 -3 -2 ...
## $ arr_time : int [1:336776] 830 850 923 1004 812 740 913 709 838 753 ...
## $ sched_arr_time: int [1:336776] 819 830 850 1022 837 728 854 723 846 745 ...
## $ arr_delay : num [1:336776] 11 20 33 -18 -25 12 19 -14 -8 8 ...
## $ carrier : chr [1:336776] "UA" "UA" "AA" "B6" ...
## $ flight : int [1:336776] 1545 1714 1141 725 461 1696 507 5708 79 301 ...
## $ tailnum : chr [1:336776] "N14228" "N24211" "N619AA" "N804JB" ...
## $ origin : chr [1:336776] "EWR" "LGA" "JFK" "JFK" ...
## $ dest : chr [1:336776] "IAH" "IAH" "MIA" "BQN" ...
## $ air_time : num [1:336776] 227 227 160 183 116 150 158 53 140 138 ...
## $ distance : num [1:336776] 1400 1416 1089 1576 762 ...
## $ hour : num [1:336776] 5 5 5 5 6 5 6 6 6 6 ...
## $ minute : num [1:336776] 15 29 40 45 0 58 0 0 0 0 ...
## $ time_hour : POSIXct[1:336776], format: "2013-01-01 05:00:00" "2013-01-01 05:00:00" ...
head(flights)
## # A tibble: 6 × 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>
lubridateCreate a column dep_datetime by combining year, month, day, and
dep_time into a POSIXct datetime using lubridate. (Hint: Use
make_datetime function to combine: year, month, day, for
hour and min use division, e.g., hour = dep_time %/% 100, min = dep_time
%% 100.)
Show the first 5 rows of flights with dep_datetime.
Output: First 5 rows showing year, month, day, dep_time, and dep_datetime.
flights2 <- flights |>
mutate (dep_datetime = make_datetime(year, month, day, hour=dep_time %/% 100, min=dep_time %% 100))
select(flights2,year,month,day,dep_time,dep_datetime) |>
head(5)
## # A tibble: 5 × 5
## year month day dep_time dep_datetime
## <int> <int> <int> <int> <dttm>
## 1 2013 1 1 517 2013-01-01 05:17:00
## 2 2013 1 1 533 2013-01-01 05:33:00
## 3 2013 1 1 542 2013-01-01 05:42:00
## 4 2013 1 1 544 2013-01-01 05:44:00
## 5 2013 1 1 554 2013-01-01 05:54:00
lubridateUsing dep_datetime from Question 1, create a column weekday with the day of the week (e.g., “Mon”) using wday(dep_datetime, label = TRUE). Use table() to show how many flights occur on each weekday.
Output: The table of flight counts by weekday.
flights_table <- flights2 |>
mutate ( weekday = wday(dep_datetime, label = TRUE))
table(flights_table$weekday)
##
## Sun Mon Tue Wed Thu Fri Sat
## 45643 49468 49273 48858 48654 48703 37922
Filter for flights from JFK (origin == “JFK”) and create a zoo time series of departure delays (dep_delay) by dep_datetime. Plot the time series (use plot()). (Hint: Use a subset to avoid memory issues, e.g., first 1000 JFK flights using `slice_head().)
Output: The time series plot.
flights3 <- flights2 |>
filter (origin == "JFK") |>
slice_head(n = 1000) # flights3 contains just the first 1000 rows of the flights from JFK
zoo (flights3$dep_delay, flights3$dep_datetime) |> # create the zoo time series and plot it
plot(main = "Departure Delays", xlab = "Date", ylab = "Delay (min)", type = "p", col="blue", pch=20, cex=0.8) |>
points(col = "red", pch = 20, cex = 0.4)
## Warning in zoo(flights3$dep_delay, flights3$dep_datetime): some methods for
## "zoo" objects do not work if the index entries in 'order.by' are not unique
Convert the origin column (airports: “JFK”, “LGA”, “EWR”) to a factor called origin_factor. Show the factor levels with levels() and create a frequency table with table(). Make a bar plot of flights by airport using barplot().
Output: The levels, frequency table, and bar plot.
flights2$origin_factor = factor(flights2$origin) # convert to factor for table to work
levels(flights2$origin_factor) # show levels
## [1] "EWR" "JFK" "LGA"
flights_by_airport <-table(flights2$origin_factor) # create table
flights_by_airport # show it
##
## EWR JFK LGA
## 120835 111279 104662
barplot(flights_by_airport, # create barplot
xlab = "Airport",
ylab = "No. Flights",
main = "Flights by Airport",
col = c("red", "blue", "green")
)
Recode origin_factor from Question 4 into a new column origin_recoded with full names: “JFK” to “Kennedy”, “LGA” to “LaGuardia”, “EWR” to “Newark” using fct_recode() or base R. Create a bar plot of the recoded factor.
Output: The new levels and bar plot.
# recode using base-R
flights2$origin_recoded [flights2$origin_factor=="JFK"] <- "Kennedy"
## Warning: Unknown or uninitialised column: `origin_recoded`.
flights2$origin_recoded [flights2$origin_factor=="LGA"] <- "LaGuardia"
flights2$origin_recoded [flights2$origin_factor=="EWR"] <- "Newark"
flights2$origin_recoded = factor(flights2$origin_recoded) # convert to factor again for table to work
table(flights2$origin_recoded) |>
barplot(
xlab = "Airport",
ylab = "No. Flights",
main = "Flights by Airport",
col = c("red", "blue", "green")
)
# try forcats fctrecode: have to use mutate to generate new column origin_recoded2. Yes it works since origin_recoded2=origin_recoded
mutate ( flights2, origin_recoded2 = fct_recode (origin_factor, "Kennedy" = "JFK", "LaGuardia"="LGA", "Newark"="EWR"))
## # A tibble: 336,776 × 23
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## 7 2013 1 1 555 600 -5 913 854
## 8 2013 1 1 557 600 -3 709 723
## 9 2013 1 1 557 600 -3 838 846
## 10 2013 1 1 558 600 -2 753 745
## # ℹ 336,766 more rows
## # ℹ 15 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## # hour <dbl>, minute <dbl>, time_hour <dttm>, dep_datetime <dttm>,
## # origin_factor <fct>, origin_recoded <fct>, origin_recoded2 <fct>
Count missing values in dep_delay and arr_delay using colSums(is.na(flights)). Impute missing dep_delay values with 0 (assuming no delay for cancelled flights) in a new column dep_delay_imputed. Create a frequency table of dep_delay_imputed for delays between -20 and 20 minutes (use filter() to subset).
Output: NA counts, and the frequency table for imputed delays.
colSums(is.na(flights2)) [c("dep_delay","arr_delay")] # count the NAs in the 2 columns
## dep_delay arr_delay
## 8255 9430
flights2$dep_delay_imputed <- ifelse(is.na(flights2$dep_delay), 0, flights2$dep_delay) # base-R create dep_delay_imputed with value 0 for NA dep_delay
colSums(is.na(flights2))[c("dep_delay_imputed","arr_delay")] # check that all missing dep_delays were imputed; NA count should be zero
## dep_delay_imputed arr_delay
## 0 9430
flights_imputed <- filter (flights2, dep_delay_imputed >= -20 & dep_delay_imputed <= 20) # filter the -20 to 20 range
freq_table <- table(flights_imputed$dep_delay_imputed) # produce the table of dep_delay_imputed
freq_table # show it
##
## -20 -19 -18 -17 -16 -15 -14 -13 -12 -11 -10 -9 -8
## 37 19 81 110 162 408 498 901 1594 2727 5891 7875 11791
## -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5
## 16752 20701 24821 24619 24218 21516 18813 24769 8050 6233 5450 4807 4447
## 6 7 8 9 10 11 12 13 14 15 16 17 18
## 3789 3520 3381 3062 2859 2756 2494 2414 2256 2140 2085 1873 1749
## 19 20
## 1730 1704
as.data.frame(freq_table) # for clarity, make it a vertical table, and also show the barplot
## Var1 Freq
## 1 -20 37
## 2 -19 19
## 3 -18 81
## 4 -17 110
## 5 -16 162
## 6 -15 408
## 7 -14 498
## 8 -13 901
## 9 -12 1594
## 10 -11 2727
## 11 -10 5891
## 12 -9 7875
## 13 -8 11791
## 14 -7 16752
## 15 -6 20701
## 16 -5 24821
## 17 -4 24619
## 18 -3 24218
## 19 -2 21516
## 20 -1 18813
## 21 0 24769
## 22 1 8050
## 23 2 6233
## 24 3 5450
## 25 4 4807
## 26 5 4447
## 27 6 3789
## 28 7 3520
## 29 8 3381
## 30 9 3062
## 31 10 2859
## 32 11 2756
## 33 12 2494
## 34 13 2414
## 35 14 2256
## 36 15 2140
## 37 16 2085
## 38 17 1873
## 39 18 1749
## 40 19 1730
## 41 20 1704
barplot (freq_table, col='lightblue', main='Flight delays', xlab='Delay in min', ylab='No. flights')
Reflect on the assignment: What was easy or hard about working with flight dates or missing data? How might assuming zero delay for missing values (Question 6) affect conclusions about flight punctuality? What did you learn about NYC flights in 2013? (150-200 words)
The assignment was harder than previus homeworks definitely. Working with dates with lubridate and zoo was not difficult, for the simple conversions we had to do. However, it took trial and error to get table to work correctly for me. Conversion of the table variable to factor is essential. Also, recoding in base R gave me trouble –produced warnings when creating new columns; rerunning the line gave error. Forcats fct_recode was more robust. Handling missing values was fine, I had to use ifelse() to create the new column. As to your question: assuming zero delay for missing values may affect the true statistics of flight delays, as long as the missing values represent flights that never took off. As the barplot in Q6 shows, zeros are outside the Gaussian distribution. In a normal distribution, there should be around 13000 flights with zero delay, not over 25000. Punctuality is overstated. How to correct this? One can appropriately filter out the flights that never took off before calculating the statistics, and that might correct the observed distribution. Looking at the flights dataset, one can identify those flights because they had NAs for departure time, departure delay, arrival time and arrival delay. There may be cases where the flights actually took off and for some reason the departure time (and departure delay) were not recorded, but the arrival time and delay were. I did not search for those cases. But it they exist, then it is appropriate to assign logical values to the departure delay of those flights that show arrival time. I would look at the arrival delay; commonly not all flights make up the lost time in short domestic flights and thus typically the departure delay is to a first approximation equal to the arrival delay. I think that such first approximation is better than assigning zero to the missing departure delay. Second question of what I learned about the 2013 NYC flights: the 3 airports are very congested, particularly EWR, almost every day; Saturdays seem a little better than the other days. In terms of delays, just a visual examination of the scatterplot shows that delays are smaller in the AM than in the evening, so the message is to fly early to minimize wait time at the airport. There are gaps between around midnight and 4 am. No flights taking off at those times. A visual examination of the delay distribution shows a mean around -2 to -5 min, which tells us that these major airports had an excellent record of timely departure despite the long tails to -14 and +20 minutes.