Reading the data and performing minor adjustments to remove inappropriate outliers and make the data easy to work with.
library(readr)
library(ggplot2)
library(patchwork)
library(dplyr)
library(lubridate)
library(GGally)
library(corrplot)
library(car)
library(ggthemes)
library(ggrepel)
library(boot)
library(broom)
library(lindia)
library(tidyr)
week2=read_csv("C:/Users/rajas/Desktop/Desktop/Applied Data Science/INFOH510/R Jupyter/Metro_Interstate_Traffic_Volume.csv")
week2=week2[week2$temp>0,]
week2=week2[week2$rain_1h< 60,]
week2<- week2|>
mutate(temp=(((temp-273)*9/5))+32)
week2$hour<- as.integer(format(as.POSIXct(week2$date_time),"%H")) #converting the date_time information into hours,month,year, weekdays to get relevant insights.
week2$month<- month(as.integer(format(as.POSIXct(week2$date_time),"%m")),label = TRUE) #using lubridate library to get the month labels
week2$year<- as.integer(format(as.POSIXct(week2$date_time),"%y"))
week2$day<- as.integer(format(as.POSIXct(week2$date_time),"%d"))
week2$Weekday<-weekdays(as.Date(week2$date_time))
week2$Weekday<-factor(week2$Weekday,levels=c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday")) #sorting the weekdays
data_df<-week2
data_df$high_traffic <- ifelse(data_df$traffic_volume >= 4000, 1, 0)
# Select relevant variables
data_model <- data_df[, c("high_traffic", "Weekday", "hour", "rain_1h", "snow_1h", "temp")]
data_model
## # A tibble: 48,193 × 6
## high_traffic Weekday hour rain_1h snow_1h temp
## <dbl> <fct> <int> <dbl> <dbl> <dbl>
## 1 1 Tuesday 9 0 0 59.5
## 2 1 Tuesday 10 0 0 61.4
## 3 1 Tuesday 11 0 0 61.8
## 4 1 Tuesday 12 0 0 62.8
## 5 1 Tuesday 13 0 0 64.7
## 6 1 Tuesday 14 0 0 65.7
## 7 1 Tuesday 15 0 0 68.3
## 8 1 Tuesday 16 0 0 69.5
## 9 1 Tuesday 17 0 0 70.1
## 10 1 Tuesday 18 0 0 68.2
## # ℹ 48,183 more rows
We are going to explore different explanatory variable impact on the response variable traffic_volume for the metro interstate dataset. We will compare the effect of each variable on the inference of a model and try to arrive at the best model for interpretation using few diagnostic techniques that were discussed in class.
The primary response variable we choose will be the volume of traffic recorded on the interstate for any given hour. Various factors impact traffic volume across the year. lets try to find which factors influence the most by building a robust GLM model.
traffic_volume (Response Variable): We are interested in modeling or predicting traffic congestion, making this the primary outcome.
Explanatory Variables
Weekday: Traffic patterns are expected to vary significantly between weekdays and weekends.
hour: Traffic volume typically follows a diurnal pattern, with peak hours in the morning and evening.
rain_1h and snow_1h: Weather conditions are known to affect traffic behavior.
temp: Extreme temperatures might correlate with traffic due to behavioral changes (e.g., staying indoors during very hot/cold weather).
Lets see how these variables are related to the response variable
# Plot 1: Hour vs High Traffic
p1 <- ggplot(data_model, aes(x = hour, y = high_traffic)) +
geom_jitter(alpha = 0.3, color = "steelblue") +
geom_smooth(method = "glm", method.args = list(family = "binomial"), color = "darkred") +
labs(title = "Hour vs High Traffic", x = "Hour", y = "High Traffic") +
theme(axis.text=element_text(size=25),
axis.title.x = element_text(size = 20),
axis.title.y = element_text(size = 20),
plot.title = element_text(size = 20),
legend.key.size = unit(2,"cm"),
legend.text = element_text(size = 18),
legend.title = element_text(size = 14),
panel.background = element_rect(fill = 'white'),
panel.grid.major = element_line(color = "grey"))
# Plot 2: Weekday vs High Traffic
p2 <- ggplot(data_model, aes(x = Weekday, y = high_traffic)) +
geom_jitter(alpha = 0.3, color = "darkgreen") +
stat_summary(fun = mean, geom = "point", size = 3, color = "red") +
labs(title = "Weekday vs High Traffic", x = "Weekday", y = "High Traffic") +
theme(axis.text=element_text(size=25),
axis.title.x = element_text(size = 20),
axis.title.y = element_text(size = 20),
plot.title = element_text(size = 20),
legend.key.size = unit(2,"cm"),
legend.text = element_text(size = 18),
legend.title = element_text(size = 14),
panel.background = element_rect(fill = 'white'),
panel.grid.major = element_line(color = "grey"))
# Plot 3: Rainfall vs High Traffic
p3 <- ggplot(data_model, aes(x = rain_1h, y = high_traffic)) +
geom_jitter(alpha = 0.3, color = "blue") +
geom_smooth(method = "glm", method.args = list(family = "binomial"), color = "red") +
labs(title = "Rainfall vs High Traffic", x = "Rainfall (mm)", y = "High Traffic") +
theme(axis.text=element_text(size=25),
axis.title.x = element_text(size = 20),
axis.title.y = element_text(size = 20),
plot.title = element_text(size = 20),
legend.key.size = unit(2,"cm"),
legend.text = element_text(size = 18),
legend.title = element_text(size = 14),
panel.background = element_rect(fill = 'white'),
panel.grid.major = element_line(color = "grey"))
# Plot 4: Snowfall vs High Traffic
p4 <- ggplot(data_model, aes(x = snow_1h, y = high_traffic)) +
geom_jitter(alpha = 0.3, color = "purple") +
geom_smooth(method = "glm", method.args = list(family = "binomial"), color = "orange") +
labs(title = "Snowfall vs High Traffic", x = "Snowfall (mm)", y = "High Traffic") +
theme(axis.text=element_text(size=25),
axis.title.x = element_text(size = 20),
axis.title.y = element_text(size = 20),
plot.title = element_text(size = 20),
legend.key.size = unit(2,"cm"),
legend.text = element_text(size = 18),
legend.title = element_text(size = 14),
panel.background = element_rect(fill = 'white'),
panel.grid.major = element_line(color = "grey"))
# Plot 5: Temperature vs High Traffic
p5 <- ggplot(data_model, aes(x = temp, y = high_traffic)) +
geom_jitter(alpha = 0.3, color = "brown") +
geom_smooth(method = "glm", method.args = list(family = "binomial"), color = "darkblue") +
labs(title = "Temperature vs High Traffic", x = "Temperature (F)", y = "High Traffic") +
theme(axis.text=element_text(size=25),
axis.title.x = element_text(size = 20),
axis.title.y = element_text(size = 20),
plot.title = element_text(size = 20),
legend.key.size = unit(2,"cm"),
legend.text = element_text(size = 18),
legend.title = element_text(size = 14),
panel.background = element_rect(fill = 'white'),
panel.grid.major = element_line(color = "grey"))
# Combine using patchwork
(p1 | p2) / (p3 | p4) / p5
As we can see from the above scatter and jitter plots, we can see how high traffic conditions and low traffic conditions vary across the year based on different explanatory variables. Lets see how we can choose the best variables to create a good GLM model for data insights.
Before proceeding, let’s check Variance Inflation Factor (VIF) to detect multicollinearity:
model_vif <- glm(high_traffic ~ Weekday + hour + rain_1h + snow_1h + temp, data = data_model, family = binomial)
vif(model_vif)
## GVIF Df GVIF^(1/(2*Df))
## Weekday 1.007561 6 1.000628
## hour 1.011473 1 1.005720
## rain_1h 1.010245 1 1.005110
## snow_1h 1.001216 1 1.000608
## temp 1.020813 1 1.010353
Any VIF over 5 is a concern as it indicates high multicollinearity. Here we observe Weekday having VIF over 5 indicating large multicolinearity. We should ideally remove this parameter. But we will run the VIF check removing it and add a interaction pair with hour to see if that reduces the colinearity of hour.
model_vif <- glm(high_traffic ~ Weekday:hour + hour + rain_1h + snow_1h + temp, data = data_model, family = binomial)
vif(model_vif)
## GVIF Df GVIF^(1/(2*Df))
## hour 2.417165 1 1.554723
## rain_1h 1.010090 1 1.005032
## snow_1h 1.001615 1 1.000807
## temp 1.020055 1 1.009978
## Weekday:hour 2.416549 6 1.076299
From the results above we see that removing the Weekend and adding it as an interacting pair significantly reduced the multicolinearity . Based on the information we gained above, the new variable list changes as below.
Explanatory Variables:
Weekday: hour : Traffic patterns are expected to vary significantly between weekdays and weekends.
hour: Traffic volume typically follows a diurnal pattern, with peak hours in the morning and evening.
rain_1h and snow_1h: Weather conditions are known to affect traffic behavior.
temp: Extreme temperatures might correlate with traffic due to behavioral changes (e.g., staying indoors during very hot/cold weather).
We can also verify the validity of the selected variables by using step wise selection. We can check the AIC to understand their impact
model_aic <- lm(high_traffic ~ ., data_model)
step(model_aic, direction = "backward", k=2)
## Start: AIC=-71721.04
## high_traffic ~ Weekday + hour + rain_1h + snow_1h + temp
##
## Df Sum of Sq RSS AIC
## - snow_1h 1 0.01 10876 -71723
## <none> 10876 -71721
## - rain_1h 1 3.50 10879 -71708
## - hour 1 122.89 10999 -71182
## - temp 1 127.21 11003 -71163
## - Weekday 6 681.02 11557 -68806
##
## Step: AIC=-71723.02
## high_traffic ~ Weekday + hour + rain_1h + temp
##
## Df Sum of Sq RSS AIC
## <none> 10876 -71723
## - rain_1h 1 3.50 10879 -71710
## - hour 1 122.89 10999 -71184
## - temp 1 127.31 11003 -71164
## - Weekday 6 681.12 11557 -68808
##
## Call:
## lm(formula = high_traffic ~ Weekday + hour + rain_1h + temp,
## data = data_model)
##
## Coefficients:
## (Intercept) WeekdayTuesday WeekdayWednesday WeekdayThursday
## 0.279546 0.038111 0.055952 0.044196
## WeekdayFriday WeekdaySaturday WeekdaySunday hour
## 0.054716 -0.167853 -0.264081 0.007325
## rain_1h temp
## -0.008531 0.002271
We can see from the above results that other than snow_1h all other variables are important. So our choices are in the ballpark.
Now that we have selected the variables and understood their impact on the response variable, lets start analysing models. We will start with a baseline Logistic regression model with the intial variables we selected and interpret it.
baseline_model <- glm(high_traffic ~ Weekday + hour + rain_1h + snow_1h + temp, data = data_model, family = binomial)
summary(baseline_model)
##
## Call:
## glm(formula = high_traffic ~ Weekday + hour + rain_1h + snow_1h +
## temp, family = binomial, data = data_model)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.9664774 0.0346638 -27.881 < 2e-16 ***
## WeekdayTuesday 0.1576888 0.0344223 4.581 4.63e-06 ***
## WeekdayWednesday 0.2306548 0.0343497 6.715 1.88e-11 ***
## WeekdayThursday 0.1825966 0.0344583 5.299 1.16e-07 ***
## WeekdayFriday 0.2259427 0.0344698 6.555 5.57e-11 ***
## WeekdaySaturday -0.7368331 0.0360324 -20.449 < 2e-16 ***
## WeekdaySunday -1.2630017 0.0387181 -32.620 < 2e-16 ***
## hour 0.0322577 0.0013957 23.113 < 2e-16 ***
## rain_1h -0.0402054 0.0105468 -3.812 0.000138 ***
## snow_1h -0.1528918 1.1215180 -0.136 0.891564
## temp 0.0100925 0.0004293 23.509 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 65979 on 48192 degrees of freedom
## Residual deviance: 61819 on 48182 degrees of freedom
## AIC: 61841
##
## Number of Fisher Scoring iterations: 4
Intercept value is negative, this indicates that base probability of high traffic in conditions when all predictors are at their reference level is less than 50%. Weekends, especially Sunday, have far lower odds of high traffic. Midweek days have significantly higher odds compared to Monday. For each additional hour later in the day, the log-odds of high traffic increases which means traffic builds up gradually throughout the day (likely peaking at commute times). Slight decrease in likelihood of high traffic with increasing rainfall. Possibly people avoid traveling or slow down (less volume per hour). Snowfall doesn’t appear to significantly affect the odds of high traffic in this model. Likely due to very few non-zero snowfall values or confounding with other variables. Higher temperature correlates with higher odds of high traffic. This might reflect seasonal effects—warmer months generally have more movement. A very interesting insight to be dig deeper on is the impact of snow on traffic. Lets see if diagnostic techniques to select variables change our interpretations.
We are going to create a number of models using only few explanatory variables or a combinations of them to see how they compare to the baseline model above.
model_weekday <- glm(high_traffic ~ Weekday, data = data_model, family = binomial)
model_hour <- glm(high_traffic ~ hour, data = data_model, family = binomial)
model_temp <- glm(high_traffic ~ temp, data = data_model, family = binomial)
model_rain_snow <- glm(high_traffic ~ rain_1h + snow_1h, data = data_model, family = binomial)
model_hour_temp <- glm(high_traffic ~ hour + temp, data = data_model, family = binomial)
model_weekday_hour <- glm(high_traffic ~ Weekday + hour, data = data_model, family = binomial)
We will use diagnostic techniques like Akaike Information Criterion, Bayesian Information Criterion and ANNOVA to make our conclusions
AIC(model_weekday, model_hour, model_temp, model_rain_snow, model_hour_temp, model_weekday_hour, baseline_model)
## df AIC
## model_weekday 7 63073.74
## model_hour 2 65340.87
## model_temp 2 65348.30
## model_rain_snow 3 65978.96
## model_hour_temp 3 64833.48
## model_weekday_hour 8 62400.88
## baseline_model 11 61840.60
BIC(model_weekday, model_hour, model_temp, model_rain_snow, model_hour_temp, model_weekday_hour, baseline_model)
## df BIC
## model_weekday 7 63135.22
## model_hour 2 65358.43
## model_temp 2 65365.87
## model_rain_snow 3 66005.31
## model_hour_temp 3 64859.83
## model_weekday_hour 8 62471.14
## baseline_model 11 61937.21
#paste(model1$deviance, model2$deviance, model3$deviance, model4$deviance, model5$deviance, model6$deviance, baseline_model$deviance)
model_comparison <- data.frame(
Model = c("baseline_model", "Weekday", "Hour", "Temp", "Rain + Snow",
"Hour + Temp", "Weekday + Hour"),
AIC = c(AIC(baseline_model), AIC(model_weekday), AIC(model_hour), AIC(model_temp),
AIC(model_rain_snow), AIC(model_hour_temp), AIC(model_weekday_hour)),
BIC = c(BIC(baseline_model), BIC(model_weekday), BIC(model_hour), BIC(model_temp),
BIC(model_rain_snow), BIC(model_hour_temp), BIC(model_weekday_hour)),
Deviance = c(deviance(baseline_model), deviance(model_weekday), deviance(model_hour), deviance(model_temp),
deviance(model_rain_snow), deviance(model_hour_temp), deviance(model_weekday_hour))
)
Lets visualize these comparisons for better optics for interpretation
comparison_long <- pivot_longer(model_comparison, cols = c("AIC", "BIC", "Deviance"),
names_to = "Metric", values_to = "Value")
# Plot
ggplot(comparison_long, aes(x = reorder(Model, Value), y = Value, fill = Metric)) +
geom_bar(stat = "identity", position = "dodge") +
facet_wrap(~Metric, scales = "free_y") +
labs(title = "Model Comparison: AIC, BIC, and Deviance",
x = "Model", y = "Metric Value") +
theme(axis.text=element_text(size=25),
axis.title.x = element_text(size = 20),
axis.text.x = element_text(angle = 45, hjust = 1),
axis.title.y = element_text(size = 20),
plot.title = element_text(size = 20),
legend.key.size = unit(2,"cm"),
legend.text = element_text(size = 18),
legend.title = element_text(size = 14),
panel.background = element_rect(fill = 'white'),
panel.grid.major = element_line(color = "grey"))
ANNOVA tests shows whether adding variables improves model fit significantly.
anova(model_hour_temp, model_weekday_hour, baseline_model, test = "Chisq")
## Analysis of Deviance Table
##
## Model 1: high_traffic ~ hour + temp
## Model 2: high_traffic ~ Weekday + hour
## Model 3: high_traffic ~ Weekday + hour + rain_1h + snow_1h + temp
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 48190 64827
## 2 48185 62385 5 2442.60 < 2.2e-16 ***
## 3 48182 61819 3 566.27 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Surprisingly, looking at the comparison plots and the ANNOVA testes, we see that the baseline model we choose is the best combination of variables. The ANNOVA tests also show that adding more variables is not going to increase its impact ability. Therefore, our initial conclusion holds true as following.
Traffic volume is influenced by time, weather, and weekday-hour interactions. Time of Day and Weekday are the strongest predictors of high traffic.
Rainfall slightly reduces traffic odds, while temperature increases it. Snowfall appears to have no significant effect, likely due to data sparsity.
Sunday sees the most consistent drop in traffic. Traffic is least likely on weekends.
This model could be used to predict congestion windows and help city planners or commuters schedule travel accordingly.
Targeted Traffic Alerts during high-risk hours and weekdays.
Traffic mitigation strategies during high-temperature or adverse weather forecasts.
City Planning: Use low-traffic periods for construction or maintenance.