Week 11 | Data Dive — GLMs Data diagnostics

1. Reading Data

Reading the data and performing minor adjustments to remove inappropriate outliers and make the data easy to work with.

library(readr)
library(ggplot2)
library(patchwork)
library(dplyr)
library(lubridate)
library(GGally)
library(corrplot)
library(car)
library(ggthemes)
library(ggrepel)
library(boot)
library(broom)
library(lindia)
library(tidyr)
week2=read_csv("C:/Users/rajas/Desktop/Desktop/Applied Data Science/INFOH510/R Jupyter/Metro_Interstate_Traffic_Volume.csv")
week2=week2[week2$temp>0,]
week2=week2[week2$rain_1h< 60,]
week2<- week2|>
  mutate(temp=(((temp-273)*9/5))+32)
week2$hour<- as.integer(format(as.POSIXct(week2$date_time),"%H")) #converting the date_time information into hours,month,year, weekdays to get relevant insights.
week2$month<- month(as.integer(format(as.POSIXct(week2$date_time),"%m")),label = TRUE) #using lubridate library to get the month labels
week2$year<- as.integer(format(as.POSIXct(week2$date_time),"%y"))
week2$day<- as.integer(format(as.POSIXct(week2$date_time),"%d"))
week2$Weekday<-weekdays(as.Date(week2$date_time))
week2$Weekday<-factor(week2$Weekday,levels=c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday")) #sorting the weekdays
data_df<-week2
data_df$high_traffic <- ifelse(data_df$traffic_volume >= 4000, 1, 0)

# Select relevant variables
data_model <- data_df[, c("high_traffic", "Weekday", "hour", "rain_1h", "snow_1h", "temp")]

data_model

## # A tibble: 48,193 × 6
##    high_traffic Weekday  hour rain_1h snow_1h  temp
##           <dbl> <fct>   <int>   <dbl>   <dbl> <dbl>
##  1            1 Tuesday     9       0       0  59.5
##  2            1 Tuesday    10       0       0  61.4
##  3            1 Tuesday    11       0       0  61.8
##  4            1 Tuesday    12       0       0  62.8
##  5            1 Tuesday    13       0       0  64.7
##  6            1 Tuesday    14       0       0  65.7
##  7            1 Tuesday    15       0       0  68.3
##  8            1 Tuesday    16       0       0  69.5
##  9            1 Tuesday    17       0       0  70.1
## 10            1 Tuesday    18       0       0  68.2
## # ℹ 48,183 more rows

Objective

We are going to explore different explanatory variable impact on the response variable traffic_volume for the metro interstate dataset. We will compare the effect of each variable on the inference of a model and try to arrive at the best model for interpretation using few diagnostic techniques that were discussed in class.

2. Variables: Selecting them and Understanding them

The primary response variable we choose will be the volume of traffic recorded on the interstate for any given hour. Various factors impact traffic volume across the year. lets try to find which factors influence the most by building a robust GLM model.

traffic_volume (Response Variable): We are interested in modeling or predicting traffic congestion, making this the primary outcome.

Explanatory Variables

Weekday: Traffic patterns are expected to vary significantly between weekdays and weekends.
hour: Traffic volume typically follows a diurnal pattern, with peak hours in the morning and evening.
rain_1h and snow_1h: Weather conditions are known to affect traffic behavior.
temp: Extreme temperatures might correlate with traffic due to behavioral changes (e.g., staying indoors during very hot/cold weather).

Lets see how these variables are related to the response variable

# Plot 1: Hour vs High Traffic
p1 <- ggplot(data_model, aes(x = hour, y = high_traffic)) +
  geom_jitter(alpha = 0.3, color = "steelblue") +
  geom_smooth(method = "glm", method.args = list(family = "binomial"), color = "darkred") +
  labs(title = "Hour vs High Traffic", x = "Hour", y = "High Traffic") +
    theme(axis.text=element_text(size=25),
          axis.title.x = element_text(size = 20),
          axis.title.y = element_text(size = 20),
          plot.title = element_text(size = 20),
          legend.key.size = unit(2,"cm"),
          legend.text = element_text(size = 18),
          legend.title = element_text(size = 14),
          panel.background = element_rect(fill = 'white'),
          panel.grid.major = element_line(color = "grey")) 

# Plot 2: Weekday vs High Traffic
p2 <- ggplot(data_model, aes(x = Weekday, y = high_traffic)) +
  geom_jitter(alpha = 0.3, color = "darkgreen") +
  stat_summary(fun = mean, geom = "point", size = 3, color = "red") +
  labs(title = "Weekday vs High Traffic", x = "Weekday", y = "High Traffic") +
    theme(axis.text=element_text(size=25),
          axis.title.x = element_text(size = 20),
          axis.title.y = element_text(size = 20),
          plot.title = element_text(size = 20),
          legend.key.size = unit(2,"cm"),
          legend.text = element_text(size = 18),
          legend.title = element_text(size = 14),
          panel.background = element_rect(fill = 'white'),
          panel.grid.major = element_line(color = "grey")) 

# Plot 3: Rainfall vs High Traffic
p3 <- ggplot(data_model, aes(x = rain_1h, y = high_traffic)) +
  geom_jitter(alpha = 0.3, color = "blue") +
  geom_smooth(method = "glm", method.args = list(family = "binomial"), color = "red") +
  labs(title = "Rainfall vs High Traffic", x = "Rainfall (mm)", y = "High Traffic") +
    theme(axis.text=element_text(size=25),
          axis.title.x = element_text(size = 20),
          axis.title.y = element_text(size = 20),
          plot.title = element_text(size = 20),
          legend.key.size = unit(2,"cm"),
          legend.text = element_text(size = 18),
          legend.title = element_text(size = 14),
          panel.background = element_rect(fill = 'white'),
          panel.grid.major = element_line(color = "grey")) 

# Plot 4: Snowfall vs High Traffic
p4 <- ggplot(data_model, aes(x = snow_1h, y = high_traffic)) +
  geom_jitter(alpha = 0.3, color = "purple") +
  geom_smooth(method = "glm", method.args = list(family = "binomial"), color = "orange") +
  labs(title = "Snowfall vs High Traffic", x = "Snowfall (mm)", y = "High Traffic") +
    theme(axis.text=element_text(size=25),
          axis.title.x = element_text(size = 20),
          axis.title.y = element_text(size = 20),
          plot.title = element_text(size = 20),
          legend.key.size = unit(2,"cm"),
          legend.text = element_text(size = 18),
          legend.title = element_text(size = 14),
          panel.background = element_rect(fill = 'white'),
          panel.grid.major = element_line(color = "grey")) 

# Plot 5: Temperature vs High Traffic
p5 <- ggplot(data_model, aes(x = temp, y = high_traffic)) +
  geom_jitter(alpha = 0.3, color = "brown") +
  geom_smooth(method = "glm", method.args = list(family = "binomial"), color = "darkblue") +
  labs(title = "Temperature vs High Traffic", x = "Temperature (F)", y = "High Traffic") +
    theme(axis.text=element_text(size=25),
          axis.title.x = element_text(size = 20),
          axis.title.y = element_text(size = 20),
          plot.title = element_text(size = 20),
          legend.key.size = unit(2,"cm"),
          legend.text = element_text(size = 18),
          legend.title = element_text(size = 14),
          panel.background = element_rect(fill = 'white'),
          panel.grid.major = element_line(color = "grey")) 
# Combine using patchwork
(p1 | p2) / (p3 | p4) / p5

As we can see from the above scatter and jitter plots, we can see how high traffic conditions and low traffic conditions vary across the year based on different explanatory variables. Lets see how we can choose the best variables to create a good GLM model for data insights.

3. Variable dependencies

3.1 Potential Multicollinearity Issues

Before proceeding, let’s check Variance Inflation Factor (VIF) to detect multicollinearity:

model_vif <- glm(high_traffic ~ Weekday + hour + rain_1h + snow_1h + temp, data = data_model, family = binomial)
vif(model_vif)

##             GVIF Df GVIF^(1/(2*Df))
## Weekday 1.007561  6        1.000628
## hour    1.011473  1        1.005720
## rain_1h 1.010245  1        1.005110
## snow_1h 1.001216  1        1.000608
## temp    1.020813  1        1.010353

Any VIF over 5 is a concern as it indicates high multicollinearity. Here we observe Weekday having VIF over 5 indicating large multicolinearity. We should ideally remove this parameter. But we will run the VIF check removing it and add a interaction pair with hour to see if that reduces the colinearity of hour.

model_vif <- glm(high_traffic ~ Weekday:hour + hour + rain_1h + snow_1h + temp, data = data_model, family = binomial)
vif(model_vif)

##                  GVIF Df GVIF^(1/(2*Df))
## hour         2.417165  1        1.554723
## rain_1h      1.010090  1        1.005032
## snow_1h      1.001615  1        1.000807
## temp         1.020055  1        1.009978
## Weekday:hour 2.416549  6        1.076299

From the results above we see that removing the Weekend and adding it as an interacting pair significantly reduced the multicolinearity . Based on the information we gained above, the new variable list changes as below.

Explanatory Variables:

Weekday: hour : Traffic patterns are expected to vary significantly between weekdays and weekends.
hour: Traffic volume typically follows a diurnal pattern, with peak hours in the morning and evening.
rain_1h and snow_1h: Weather conditions are known to affect traffic behavior.
temp: Extreme temperatures might correlate with traffic due to behavioral changes (e.g., staying indoors during very hot/cold weather).

We can also verify the validity of the selected variables by using step wise selection. We can check the AIC to understand their impact

model_aic <- lm(high_traffic ~ ., data_model)

step(model_aic, direction = "backward", k=2)

## Start:  AIC=-71721.04
## high_traffic ~ Weekday + hour + rain_1h + snow_1h + temp
## 
##           Df Sum of Sq   RSS    AIC
## - snow_1h  1      0.01 10876 -71723
## <none>                 10876 -71721
## - rain_1h  1      3.50 10879 -71708
## - hour     1    122.89 10999 -71182
## - temp     1    127.21 11003 -71163
## - Weekday  6    681.02 11557 -68806
## 
## Step:  AIC=-71723.02
## high_traffic ~ Weekday + hour + rain_1h + temp
## 
##           Df Sum of Sq   RSS    AIC
## <none>                 10876 -71723
## - rain_1h  1      3.50 10879 -71710
## - hour     1    122.89 10999 -71184
## - temp     1    127.31 11003 -71164
## - Weekday  6    681.12 11557 -68808

## 
## Call:
## lm(formula = high_traffic ~ Weekday + hour + rain_1h + temp, 
##     data = data_model)
## 
## Coefficients:
##      (Intercept)    WeekdayTuesday  WeekdayWednesday   WeekdayThursday  
##         0.279546          0.038111          0.055952          0.044196  
##    WeekdayFriday   WeekdaySaturday     WeekdaySunday              hour  
##         0.054716         -0.167853         -0.264081          0.007325  
##          rain_1h              temp  
##        -0.008531          0.002271

We can see from the above results that other than snow_1h all other variables are important. So our choices are in the ballpark.

Now that we have selected the variables and understood their impact on the response variable, lets start analysing models. We will start with a baseline Logistic regression model with the intial variables we selected and interpret it.

4. Evaluating the Models

4.1 Baseline Model

baseline_model <- glm(high_traffic ~ Weekday + hour + rain_1h + snow_1h + temp, data = data_model, family = binomial)
summary(baseline_model)

## 
## Call:
## glm(formula = high_traffic ~ Weekday + hour + rain_1h + snow_1h + 
##     temp, family = binomial, data = data_model)
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -0.9664774  0.0346638 -27.881  < 2e-16 ***
## WeekdayTuesday    0.1576888  0.0344223   4.581 4.63e-06 ***
## WeekdayWednesday  0.2306548  0.0343497   6.715 1.88e-11 ***
## WeekdayThursday   0.1825966  0.0344583   5.299 1.16e-07 ***
## WeekdayFriday     0.2259427  0.0344698   6.555 5.57e-11 ***
## WeekdaySaturday  -0.7368331  0.0360324 -20.449  < 2e-16 ***
## WeekdaySunday    -1.2630017  0.0387181 -32.620  < 2e-16 ***
## hour              0.0322577  0.0013957  23.113  < 2e-16 ***
## rain_1h          -0.0402054  0.0105468  -3.812 0.000138 ***
## snow_1h          -0.1528918  1.1215180  -0.136 0.891564    
## temp              0.0100925  0.0004293  23.509  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 65979  on 48192  degrees of freedom
## Residual deviance: 61819  on 48182  degrees of freedom
## AIC: 61841
## 
## Number of Fisher Scoring iterations: 4

Intercept value is negative, this indicates that base probability of high traffic in conditions when all predictors are at their reference level is less than 50%. Weekends, especially Sunday, have far lower odds of high traffic. Midweek days have significantly higher odds compared to Monday. For each additional hour later in the day, the log-odds of high traffic increases which means traffic builds up gradually throughout the day (likely peaking at commute times). Slight decrease in likelihood of high traffic with increasing rainfall. Possibly people avoid traveling or slow down (less volume per hour). Snowfall doesn’t appear to significantly affect the odds of high traffic in this model. Likely due to very few non-zero snowfall values or confounding with other variables. Higher temperature correlates with higher odds of high traffic. This might reflect seasonal effects—warmer months generally have more movement. A very interesting insight to be dig deeper on is the impact of snow on traffic. Lets see if diagnostic techniques to select variables change our interpretations.

4.2 Model comparisons

We are going to create a number of models using only few explanatory variables or a combinations of them to see how they compare to the baseline model above.

model_weekday <- glm(high_traffic ~ Weekday, data = data_model, family = binomial)
model_hour <- glm(high_traffic ~ hour, data = data_model, family = binomial)
model_temp <- glm(high_traffic ~ temp, data = data_model, family = binomial)
model_rain_snow <- glm(high_traffic ~ rain_1h + snow_1h, data = data_model, family = binomial)
model_hour_temp <- glm(high_traffic ~ hour + temp, data = data_model, family = binomial)
model_weekday_hour <- glm(high_traffic ~ Weekday + hour, data = data_model, family = binomial)

We will use diagnostic techniques like Akaike Information Criterion, Bayesian Information Criterion and ANNOVA to make our conclusions

AIC(model_weekday, model_hour, model_temp, model_rain_snow, model_hour_temp, model_weekday_hour, baseline_model)

##                    df      AIC
## model_weekday       7 63073.74
## model_hour          2 65340.87
## model_temp          2 65348.30
## model_rain_snow     3 65978.96
## model_hour_temp     3 64833.48
## model_weekday_hour  8 62400.88
## baseline_model     11 61840.60

BIC(model_weekday, model_hour, model_temp, model_rain_snow, model_hour_temp, model_weekday_hour, baseline_model)

##                    df      BIC
## model_weekday       7 63135.22
## model_hour          2 65358.43
## model_temp          2 65365.87
## model_rain_snow     3 66005.31
## model_hour_temp     3 64859.83
## model_weekday_hour  8 62471.14
## baseline_model     11 61937.21

#paste(model1$deviance, model2$deviance, model3$deviance, model4$deviance, model5$deviance, model6$deviance, baseline_model$deviance)

model_comparison <- data.frame(
  Model = c("baseline_model", "Weekday", "Hour", "Temp", "Rain + Snow", 
            "Hour + Temp", "Weekday + Hour"),
  AIC = c(AIC(baseline_model), AIC(model_weekday), AIC(model_hour), AIC(model_temp),
          AIC(model_rain_snow), AIC(model_hour_temp), AIC(model_weekday_hour)),
  BIC = c(BIC(baseline_model), BIC(model_weekday), BIC(model_hour), BIC(model_temp),
          BIC(model_rain_snow), BIC(model_hour_temp), BIC(model_weekday_hour)),
  Deviance = c(deviance(baseline_model), deviance(model_weekday), deviance(model_hour), deviance(model_temp),
               deviance(model_rain_snow), deviance(model_hour_temp), deviance(model_weekday_hour))
)

Lets visualize these comparisons for better optics for interpretation

comparison_long <- pivot_longer(model_comparison, cols = c("AIC", "BIC", "Deviance"), 
                                 names_to = "Metric", values_to = "Value")

# Plot
ggplot(comparison_long, aes(x = reorder(Model, Value), y = Value, fill = Metric)) +
  geom_bar(stat = "identity", position = "dodge") +
  facet_wrap(~Metric, scales = "free_y") +
  labs(title = "Model Comparison: AIC, BIC, and Deviance",
       x = "Model", y = "Metric Value") +
    theme(axis.text=element_text(size=25),
          axis.title.x = element_text(size = 20),
          axis.text.x = element_text(angle = 45, hjust = 1),
          axis.title.y = element_text(size = 20),
          plot.title = element_text(size = 20),
          legend.key.size = unit(2,"cm"),
          legend.text = element_text(size = 18),
          legend.title = element_text(size = 14),
          panel.background = element_rect(fill = 'white'),
          panel.grid.major = element_line(color = "grey"))

ANNOVA tests shows whether adding variables improves model fit significantly.

anova(model_hour_temp, model_weekday_hour, baseline_model, test = "Chisq")

## Analysis of Deviance Table
## 
## Model 1: high_traffic ~ hour + temp
## Model 2: high_traffic ~ Weekday + hour
## Model 3: high_traffic ~ Weekday + hour + rain_1h + snow_1h + temp
##   Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
## 1     48190      64827                          
## 2     48185      62385  5  2442.60 < 2.2e-16 ***
## 3     48182      61819  3   566.27 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

5. Conclusions and Takeaways

Surprisingly, looking at the comparison plots and the ANNOVA testes, we see that the baseline model we choose is the best combination of variables. The ANNOVA tests also show that adding more variables is not going to increase its impact ability. Therefore, our initial conclusion holds true as following.

Traffic volume is influenced by time, weather, and weekday-hour interactions. Time of Day and Weekday are the strongest predictors of high traffic.
Rainfall slightly reduces traffic odds, while temperature increases it. Snowfall appears to have no significant effect, likely due to data sparsity.
Sunday sees the most consistent drop in traffic. Traffic is least likely on weekends.
This model could be used to predict congestion windows and help city planners or commuters schedule travel accordingly.

Key Takeaways

Targeted Traffic Alerts during high-risk hours and weekdays.
Traffic mitigation strategies during high-temperature or adverse weather forecasts.
City Planning: Use low-traffic periods for construction or maintenance.