Primary objective of the report this data dive is to analyse traffic pattern over time. Helping us to understand how time series analysis works and also recognize patterns such as seasonality over time.

1. Reading Data

Reading the data and performing minor adjustments to remove inappropriate outliers and make the data easy to work with.

library(readr)
library(tidyverse)
library(ggplot2)
library(patchwork)
library(dplyr)
library(lubridate)
library(GGally)
library(corrplot)
library(car)
library(ggthemes)
library(ggrepel)
library(boot)
library(broom)
library(lindia)
library(tidyr)
week2=read_csv("C:/Users/rajas/Desktop/Desktop/Applied Data Science/INFOH510/R Jupyter/Metro_Interstate_Traffic_Volume.csv")
week2=week2[week2$temp>0,]
week2=week2[week2$rain_1h< 60,]
week2<- week2|>
  mutate(temp=(((temp-273)*9/5))+32)
week2 <- week2 |>
  mutate(date_time = ymd_hms(date_time),
         hour = hour(date_time),
         year = year(date_time))
week2$hour<- as.integer(format(as.POSIXct(week2$date_time),"%H")) #converting the date_time information into hours,month,year, weekdays to get relevant insights.
week2$month<- month(as.integer(format(as.POSIXct(week2$date_time),"%m")),label = TRUE) #using lubridate library to get the month labels
week2$year<- as.integer(format(as.POSIXct(week2$date_time),"%y"))
week2$day<- as.integer(format(as.POSIXct(week2$date_time),"%d"))
week2$Weekday<-weekdays(as.Date(week2$date_time))
week2$Weekday<-factor(week2$Weekday,levels=c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday")) #sorting the weekdays
data_df<-week2
data_df$high_traffic <- ifelse(data_df$traffic_volume >= 4000, 1, 0)
traffic_data <- data_df |>
  select(date_time,hour, day,month,year, traffic_volume) |>
  as_tibble()

2. Data Cleaning

Data uniformity is very essential in time-series analysis. In our dataset, we will check if we have continuous data and take actions to mitigate if we do not. Lets plot the traffic volume across the years to check for any gaps.

ggplot(traffic_data, aes(x = date_time, y = traffic_volume, color = as.factor(year))) +
  geom_line(alpha = 0.5) +
  labs(
    title = "Traffic Volume Over Time by Year",
    x = "Date",
    y = "Traffic Volume",
    color = "Year"
  ) +
    theme(axis.text=element_text(size=25),
          axis.title.x = element_text(size = 20),
          axis.title.y = element_text(size = 20),
          plot.title = element_text(size = 20),
          legend.key.size = unit(2,"cm"),
          legend.text = element_text(size = 18),
          legend.title = element_text(size = 14),
          panel.background = element_rect(fill = 'white'),
          panel.grid.major = element_line(color = "grey")) 

We can clearly see from the above plot, that the data for 2014 and 2015 are missing. We need exclude these from the analysis to get accurate interpretations. We also see that considering every hour of the day makes the data unreadable. So we will consider hours of the day that are representable and also get a average of the total traffic across a day to make it even more readable.

traffic_data<-traffic_data[(traffic_data$hour >= 7)&(traffic_data$hour <= 20),]# Filter to only include hours between 7 AM and 8 PM
traffic_data<-traffic_data[(traffic_data$year != "14")&(traffic_data$year != "15"),]

3. Time-series Visualizations

Lets plot the above data

daily_data <- traffic_data %>%
  mutate(date = as_date(date_time)) %>%
  group_by(year, month,day,date) %>%
  summarise(daily_avg_volume = mean(traffic_volume), .groups = "drop")


# Split data by year into a named list
data_by_year <- split(daily_data, daily_data$year)

# Create plots: each year with lines for each month
plots <- map(data_by_year, function(df) {
  ggplot(df, aes(x = date, y = daily_avg_volume, color = month)) +
    geom_line(size=1.8) +
    labs(
      title = paste("Daily Avg Traffic Volume (7 AM to 8 PM) -", unique(df$year)),
      x = "Date",
      y = "Avg Traffic Volume",
      color = "Month"
    ) +
    theme(axis.text=element_text(size=25),
          axis.title.x = element_text(size = 20),
          axis.title.y = element_text(size = 20),
          plot.title = element_text(size = 20),
          legend.key.size = unit(2,"cm"),
          legend.text = element_text(size = 18),
          legend.title = element_text(size = 14),
          panel.background = element_rect(fill = 'white'),
          panel.grid.major = element_line(color = "grey")) +
    scale_color_brewer(palette = "Set3")  # Nice color palette for months
})
plots[["12"]]

plots[["13"]]

plots[["16"]]

plots[["17"]]

plots[["18"]]

We can clearly see multiple trends appearing across the years. We can see hourly fluctuations accounting for Peak and non-Peak hours of the day, weekly fluctuations showing the weekday- weekend seasonality. We also multiple steep drops rises across the years which will be either external factors affecting the traffic like public transport shut down/maintenance or holidays. There is very little monthly patterns that we observe across the years, atleast not as apparent as other tie steps.

Lets plot the data across a span of 31 days to check how data changes daily across 12 months. We will check one year to limit the total number of graphs.

monthly_day_avg <- traffic_data %>%
  group_by(year, month, day) %>%
  summarise(avg_volume = mean(traffic_volume), .groups = "drop")

# Remove day 31 for months that don’t have it (optional, to avoid NA lines)
monthly_day_avg <- monthly_day_avg %>%
  filter(!(day == 31 & month %in% c("Feb", "Apr", "Jun", "Sep", "Nov")))

# Split data by year
data_by_year <- split(monthly_day_avg, monthly_day_avg$year)

# Generate a plot per year: x = day of month, line = month
plots <- map(data_by_year, function(df) {
  ggplot(df, aes(x = day, y = avg_volume, color = month)) +
    geom_line(size = 1) +
    scale_x_continuous(breaks = 1:31) +
    scale_color_brewer(palette = "Paired") +
    labs(
      title = paste("Avg Traffic Volume by Day of Month (7 AM to 8 PM) - 20",unique(df$year)),
      x = "Day of Month",
      y = "Avg Traffic Volume",
      color = "Month"
    ) +
        theme(axis.text=element_text(size=25),
          axis.title.x = element_text(size = 20),
          axis.title.y = element_text(size = 20),
          plot.title = element_text(size = 20),
          legend.key.size = unit(2,"cm"),
          legend.text = element_text(size = 18),
          legend.title = element_text(size = 14),
          panel.background = element_rect(fill = 'white'),
          panel.grid.major = element_line(color = "grey"))
})
plots[["18"]]

Monthly plots shows a varied levels of patterns across day s in a month. Most noticeable pattern is the Weekday-Weekend effect. People going to work during weekdays and weekends staying at home. We also see some steep drops like April 14-15. This is interesting since there are no national or State holidays indicating this might have been a logging error.

4. Trend identification with Regression modelling

We are going to run linear regression models to identify trends and how they change across years. Since we have multiple years, we are going to initially run the model for one year, interpret it and then analyse the summary across years.

daily_data <- traffic_data |>
  mutate(date = as_date(date_time)) |>
  group_by(year, date) |>
  summarise(daily_avg_volume = mean(traffic_volume), .groups = "drop")

# Split by year
data_by_year <- split(daily_data, daily_data$year)

# Store results
models <- list()
plots <- list()

# Loop over each year and fit model
for (yr in names(data_by_year)) {
  df <- data_by_year[[yr]] %>%
    arrange(date) %>%
    mutate(day_num = as.numeric(date - min(date)))

  # Fit linear model
  model <- lm(daily_avg_volume ~ day_num, data = df)
  models[[yr]] <- model

  # Plot with trend line
  p <- ggplot(df, aes(x = date, y = daily_avg_volume)) +
    geom_line(color = "gray20", alpha = 0.6) +
    geom_smooth(method = "lm", se = TRUE, color = "blue", linetype = "dashed") +
    labs(
      title = paste("Trend in Traffic Volume -", yr),
      x = "Date",
      y = "Avg Traffic Volume"
    ) +
              theme(axis.text=element_text(size=25),
          axis.title.x = element_text(size = 20),
          axis.title.y = element_text(size = 20),
          plot.title = element_text(size = 20),
          legend.key.size = unit(2,"cm"),
          legend.text = element_text(size = 18),
          legend.title = element_text(size = 14),
          panel.background = element_rect(fill = 'white'),
          panel.grid.major = element_line(color = "grey"))

  plots[[yr]] <- p
}
plots[["17"]]

summary(models[["17"]])
## 
## Call:
## lm(formula = daily_avg_volume ~ day_num, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2268.8  -660.5   289.0   531.1   988.6 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 4.627e+03  7.191e+01  64.339   <2e-16 ***
## day_num     3.969e-02  3.420e-01   0.116    0.908    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 688.4 on 363 degrees of freedom
## Multiple R-squared:  3.711e-05,  Adjusted R-squared:  -0.002718 
## F-statistic: 0.01347 on 1 and 363 DF,  p-value: 0.9077

For 2017, The traffic volume remained essentially flat (Slope: 0.0397) through 2017. The small positive slope isn’t statistically meaningful. There’s strong day-to-day variability, but no long-term upward/downward trend in the year. We will compile the summaries across the years to see how it changes.

trend_summary <- map_dfr(names(models), function(yr) {
  mod <- models[[yr]]
  slope <- tidy(mod) %>% filter(term == "day_num")
  
  data.frame(
    year = yr,
    slope = slope$estimate,
    p_value = slope$p.value,
    r_squared = summary(mod)$r.squared
  )
})
Yearly regression model interpretations
Year Slope P-value R² Interpretation
2012 -11.1 0.0012 0.1110 Significant decrease in traffic over the year.
2013 0.28 0.5151 0.0012 No significant trend.
2016 -1.46 0.0001 0.0402 Significant decrease in traffic.
2017 0.04 0.9076 ~0.00 Completely flat — no significant trend.
2018 0.81 0.1679 0.0070 Slight increase, not statistically significant.
trend_summary$significant <- trend_summary$p_value < 0.05

ggplot(trend_summary, aes(x = year, y = slope, fill = significant)) +
  geom_col() +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
  scale_fill_manual(values = c("gray70", "steelblue"), 
                    labels = c("Not Significant", "Significant")) +
  labs(
    title = "Linear Trend in Daily Traffic Volume (Per Year)",
    subtitle = "Slope of daily_avg_volume ~ day_num | 7 AM – 8 PM",
    x = "Year", y = "Slope",
    fill = "Trend Significance"
  ) +
          theme(axis.text=element_text(size=25),
          axis.title.x = element_text(size = 20),
          axis.title.y = element_text(size = 20),
          plot.title = element_text(size = 20),
          legend.key.size = unit(2,"cm"),
          legend.text = element_text(size = 18),
          legend.title = element_text(size = 14),
          panel.background = element_rect(fill = 'white'),
          panel.grid.major = element_line(color = "grey"))

We can see from the above plot the significant trends and the insignificant ones. 2012 and 2016 had significant shift in traffic patterns compared to other years.

ggplot(df, aes(x = date, y = daily_avg_volume)) +
  geom_line(color = "gray50") +
  geom_smooth(method = "loess", span = 0.08, se = FALSE, color = "blue", size = 1.2) +
  labs(
    title = "Smoothed Daily Traffic Volume",
    subtitle = "LOESS smoothing to reveal seasonality (7 AM – 8 PM only)",
    x = "Date", y = "Avg Traffic Volume"
  ) +
          theme(axis.text=element_text(size=25),
          axis.title.x = element_text(size = 20),
          axis.title.y = element_text(size = 20),
          plot.title = element_text(size = 20),
          legend.key.size = unit(2,"cm"),
          legend.text = element_text(size = 18),
          legend.title = element_text(size = 14),
          panel.background = element_rect(fill = 'white'),
          panel.grid.major = element_line(color = "grey"))

Applying LOESS smoothing, we observe that there some interesting patterns where towards the end of the month traffic seems to go down. These dates closely align with weekends so this is likely the weekday-weekend effect.

We will try to visualize these effects using ACF and PACF where spikes at specific lags determine seasonal effects

library(forecast)

# Make sure your data is complete and evenly spaced (daily)
ts_data <- ts(df$daily_avg_volume, frequency = 7)  # 7 = weekly seasonality

# ACF and PACF plots
par(mfrow = c(1, 2))
acf(ts_data, main = "ACF of Daily Avg Traffic")

Repeating spikes every 7 lags shows strong weekly seasonality. There is also slight Gradual decay which indicates that trend may also be present but subtle.

pacf(ts_data, main = "PACF of Daily Avg Traffic")

Significant PACF at lag 7 confirms the weekly auto regressive effect.