library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.4.3
## Warning: package 'ggplot2' was built under R version 4.4.3
## Warning: package 'dplyr' was built under R version 4.4.3
library(openintro)
## Warning: package 'openintro' was built under R version 4.4.3

##Introduction

In this vignette, we will use lubridate package from the Tidyverse to parse and manipulate data that uses date and time. The data we will use is flights dataset in nycflights13 package. We will calculate departure delays and visualize the results.

###Load Data and Perform Analysis

First, lets install and load the necessary libraries.

Let’s load the flight dataset and get a glimpse of the dataset.

## Rows: 336,776
## Columns: 19
## $ year           <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2…
## $ month          <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ day            <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ dep_time       <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 558, 558, …
## $ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 600, 600, …
## $ dep_delay      <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2, -2, -1…
## $ arr_time       <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 753, 849,…
## $ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 745, 851,…
## $ arr_delay      <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3, 7, -1…
## $ carrier        <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", "B6", "…
## $ flight         <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79, 301, 4…
## $ tailnum        <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN", "N394…
## $ origin         <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR", "LGA",…
## $ dest           <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL", "IAD",…
## $ air_time       <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138, 149, 1…
## $ distance       <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 944, 733, …
## $ hour           <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, 6, 6, 6…
## $ minute         <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, 0, 59, 0…
## $ time_hour      <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013-01-01 0…

After the glimpse of the dataset, we will use functions in the lubridate package to convert date and time columns into proper datetime objects.

Step 1: Preprocess Arrival Times

With this, The date is combined into one column and the times are combined into one column for easier readability and analysis.

## Rows: 336,776
## Columns: 23
## $ year                  <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, …
## $ month                 <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ day                   <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ dep_time              <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 558…
## $ sched_dep_time        <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 600…
## $ dep_delay             <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2,…
## $ arr_time              <chr> "0830", "0850", "0923", "1004", "0812", "0740", …
## $ sched_arr_time        <chr> "0819", "0830", "0850", "1022", "0837", "0728", …
## $ arr_delay             <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3…
## $ carrier               <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV", …
## $ flight                <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79,…
## $ tailnum               <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN"…
## $ origin                <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR",…
## $ dest                  <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL",…
## $ air_time              <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138,…
## $ distance              <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 944…
## $ hour                  <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5, …
## $ minute                <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ time_hour             <dttm> 2013-01-01 05:00:00, 2013-01-01 05:00:00, 2013-…
## $ flight_date           <date> 2013-01-01, 2013-01-01, 2013-01-01, 2013-01-01,…
## $ actual_arrival_time   <time> 08:30:00, 08:50:00, 09:23:00, 10:04:00, 08:12:0…
## $ sched_arrival_time    <time> 08:19:00, 08:30:00, 08:50:00, 10:22:00, 08:37:0…
## $ arrival_delay_minutes <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -3…

Next, we will convert the departure delay values into a numeric format for easier analysis. Then we will summarize the average departure delay for each day of the week and create a visualization to present the findings.

We will use wday function from the lubridate package to figure out what day of the week the date of departure fell on.

Analyze Delays by Carrier and Day of the Week

## Explore Delay Distribution by Origin Airport

## # A tibble: 1 × 3
##   na_count nan_count inf_count
##      <int>     <int>     <int>
## 1     8255         0         0
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  -43.00   -5.00   -2.00   12.64   11.00 1301.00    8255
## # A tibble: 1 × 2
##   below_limit above_limit
##         <int>       <int>
## 1          41       13346

##  [1]  5  6  7  8 18  9 10 11 12 13 14 15 16 17 19 20 21 22 23  1

Investigate Hourly Delay Patterns

To compare departure and arrival delays

## # A tibble: 673,552 × 3
##    day_of_week delay_type              delay_minutes
##    <ord>       <chr>                           <dbl>
##  1 Tue         departure_delay_minutes             2
##  2 Tue         arrival_delay_minutes              11
##  3 Tue         departure_delay_minutes             4
##  4 Tue         arrival_delay_minutes              20
##  5 Tue         departure_delay_minutes             2
##  6 Tue         arrival_delay_minutes              33
##  7 Tue         departure_delay_minutes            -1
##  8 Tue         arrival_delay_minutes             -18
##  9 Tue         departure_delay_minutes            -6
## 10 Tue         arrival_delay_minutes             -25
## # ℹ 673,542 more rows

##Conclusion In this analysis, we demonstrated how the Tidyverse, particularly the lubridate, dplyr, and ggplot2 packages, can be used to efficiently manipulate, clean, and visualize flight delay data. By transforming raw date and time information into structured formats, we were able to explore patterns in both departure and arrival delays across different days of the week, airports, and hours of the day. Using visualizations like bar charts, boxplots, and line graphs, we uncovered meaningful trends — for example, average delays fluctuating by weekday and departure hour.

This workflow highlights the power of tidy data principles: with a consistent and organized dataset, we can gain valuable insights quickly and clearly. The tools and techniques used here can be easily extended to other real-world datasets involving dates, times, and delay analysis.

