## Warning: package 'tidyverse' was built under R version 4.4.3
## Warning: package 'ggplot2' was built under R version 4.4.3
## Warning: package 'dplyr' was built under R version 4.4.3
## Warning: package 'openintro' was built under R version 4.4.3
##Introduction
In this vignette, we will use lubridate package from the Tidyverse to
parse and manipulate data that uses date and time. The data we will use
is flights dataset in nycflights13 package. We will calculate departure
delays and visualize the results.
###Load Data and Perform Analysis
First, lets install and load the necessary libraries.
Let’s load the flight dataset and get a glimpse of the dataset.
## Rows: 336,776
## Columns: 19
## $ year <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2…
## $ month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ day <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ dep_time <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 558, 558, …
## $ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 600, 600, …
## $ dep_delay <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, -2, -1…
## $ arr_time <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 753, 849,…
## $ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 745, 851,…
## $ arr_delay <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3, 7, -1…
## $ carrier <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "B6", "…
## $ flight <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79, 301, 4…
## $ tailnum <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN", "N394…
## $ origin <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR", "LGA",…
## $ dest <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL", "IAD",…
## $ air_time <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138, 149, 1…
## $ distance <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 944, 733, …
## $ hour <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 6, 6, 6…
## $ minute <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, 0, 59, 0…
## $ time_hour <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013-01-01 0…
After the glimpse of the dataset, we will use functions in the
lubridate package to convert date and time columns into proper datetime
objects.
Step 1: Preprocess Arrival Times
With this, The date is combined into one column and the times are
combined into one column for easier readability and analysis.
## Rows: 336,776
## Columns: 23
## $ year <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, …
## $ month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ day <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ dep_time <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 558…
## $ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 600…
## $ dep_delay <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2,…
## $ arr_time <chr> "0830", "0850", "0923", "1004", "0812", "0740", …
## $ sched_arr_time <chr> "0819", "0830", "0850", "1022", "0837", "0728", …
## $ arr_delay <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3…
## $ carrier <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", …
## $ flight <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79,…
## $ tailnum <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN"…
## $ origin <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR",…
## $ dest <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL",…
## $ air_time <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138,…
## $ distance <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 944…
## $ hour <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, …
## $ minute <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ time_hour <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013-…
## $ flight_date <date> 2013-01-01, 2013-01-01, 2013-01-01, 2013-01-01,…
## $ actual_arrival_time <time> 08:30:00, 08:50:00, 09:23:00, 10:04:00, 08:12:0…
## $ sched_arrival_time <time> 08:19:00, 08:30:00, 08:50:00, 10:22:00, 08:37:0…
## $ arrival_delay_minutes <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3…
Next, we will convert the departure delay values into a numeric
format for easier analysis. Then we will summarize the average departure
delay for each day of the week and create a visualization to present the
findings.
We will use wday function from the lubridate package to figure out
what day of the week the date of departure fell on.
Analyze Delays by Carrier and Day of the Week

## Explore Delay Distribution by Origin Airport
## # A tibble: 1 × 3
## na_count nan_count inf_count
## <int> <int> <int>
## 1 8255 0 0
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -43.00 -5.00 -2.00 12.64 11.00 1301.00 8255
## # A tibble: 1 × 2
## below_limit above_limit
## <int> <int>
## 1 41 13346

## [1] 5 6 7 8 18 9 10 11 12 13 14 15 16 17 19 20 21 22 23 1
Investigate Hourly Delay Patterns

To compare departure and arrival delays
## # A tibble: 673,552 × 3
## day_of_week delay_type delay_minutes
## <ord> <chr> <dbl>
## 1 Tue departure_delay_minutes 2
## 2 Tue arrival_delay_minutes 11
## 3 Tue departure_delay_minutes 4
## 4 Tue arrival_delay_minutes 20
## 5 Tue departure_delay_minutes 2
## 6 Tue arrival_delay_minutes 33
## 7 Tue departure_delay_minutes -1
## 8 Tue arrival_delay_minutes -18
## 9 Tue departure_delay_minutes -6
## 10 Tue arrival_delay_minutes -25
## # ℹ 673,542 more rows

##Conclusion In this analysis, we demonstrated how the Tidyverse,
particularly the lubridate, dplyr, and ggplot2 packages, can be used to
efficiently manipulate, clean, and visualize flight delay data. By
transforming raw date and time information into structured formats, we
were able to explore patterns in both departure and arrival delays
across different days of the week, airports, and hours of the day. Using
visualizations like bar charts, boxplots, and line graphs, we uncovered
meaningful trends — for example, average delays fluctuating by weekday
and departure hour.
This workflow highlights the power of tidy data principles: with a
consistent and organized dataset, we can gain valuable insights quickly
and clearly. The tools and techniques used here can be easily extended
to other real-world datasets involving dates, times, and delay
analysis.
