Reading the data and performing minor adjustments to remove inappropriate outliers and make the data easy to work with.
library(readr)
library(ggplot2)
library(patchwork)
library(dplyr)
library(lubridate)
library(GGally)
library(corrplot)
library(car)
library(ggthemes)
library(ggrepel)
library(boot)
library(broom)
library(lindia)
week2=read_csv("C:/Users/rajas/OneDrive/Desktop/Desktop/Applied Data Science/INFOH510/R Jupyter/Metro_Interstate_Traffic_Volume.csv")
week2=week2[week2$temp>0,]
week2=week2[week2$rain_1h< 60,]
week2<- week2|>
mutate(temp=(((temp-273)*9/5))+32)
week2$hour<- as.integer(format(as.POSIXct(week2$date_time),"%H")) #converting the date_time information into hours,month,year, weekdays to get relevant insights.
week2$month<- month(as.integer(format(as.POSIXct(week2$date_time),"%m")),label = TRUE) #using lubridate library to get the month labels
week2$year<- as.integer(format(as.POSIXct(week2$date_time),"%y"))
week2$day<- as.integer(format(as.POSIXct(week2$date_time),"%d"))
week2$weekday<-weekdays(as.Date(week2$date_time))
week2$weekday<-factor(week2$weekday,levels=c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday")) #sorting the weekdays
df<-week2
Binary Variable Selection:
To conduct a logistic regression analysis, we need a binary outcome
variable. While the dataset doesn’t contain a direct binary variable, we
can create one by categorizing the traffic_volume
variable.
Specifically, we’ll define a high traffic volume indicator
(high_traffic
) as follows:
high_traffic = 1
if traffic_volume
is
above the median value.
high_traffic = 0
if traffic_volume
is
at or below the median value.
This transformation allows us to model the likelihood of high traffic volume based on selected explanatory variables.
Explanatory Variables:
For our logistic regression model, we’ll consider the following explanatory variables:
temp
(Temperature): Average
temperature in Kelvin.
rain_1h
(Rainfall): Amount of rain
in mm during the hour.
snow_1h
(Snowfall): Amount of snow
in mm during the hour.
is_weekend
(Weekend Indicator): A
binary variable indicating whether the observation falls on a weekend
(1) or a weekday (0). This can be derived from the
date_time
feature.
# Compute median traffic volume
median_traffic <- median(df$traffic_volume, na.rm = TRUE)
# Create binary target variable: High traffic (1) vs. Low traffic (0)
df <- df %>%
mutate(high_traffic = ifelse(traffic_volume > median_traffic, 1, 0))
# Create is_weekend variable (1 if weekend, 0 if weekday)
df <- df %>%
mutate(is_weekend = ifelse(weekdays(date_time) %in% c("Saturday", "Sunday"), 1, 0))
# Check summary statistics
summary(df$high_traffic)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 0.0 0.0 0.5 1.0 1.0
high_traffic
allows us to analyze
factors affecting heavy traffic.
is_weekend
captures the difference
between weekend and weekday traffic.
ggplot(df, aes(x = traffic_volume)) +
geom_histogram(bins = 50, fill = "blue", alpha = 0.7) +
labs(title = "Traffic Volume Distribution", x = "Traffic Volume", y = "Count")+
theme(axis.text=element_text(size=25),
axis.title.x = element_text(size = 20),
axis.title.y = element_text(size = 20),
plot.title = element_text(size = 20),
legend.key.size = unit(2,"cm"),
legend.text = element_text(size = 18),
legend.title = element_text(size = 14),
panel.background = element_rect(fill = 'white'),
panel.grid.major = element_line(color = "grey"))
Traffic volume exhibits a bimodal distribution, with peaks likely corresponding to rush hours. We have seen many such examples insights in the previous data dives where rush hour traffic is proven.
ggplot(df, aes(x = as.factor(is_weekend), y = traffic_volume, fill = as.factor(is_weekend))) +
geom_boxplot() +
labs(title = "Traffic Volume: Weekends vs. Weekdays",
x = " Weekday (0) vs. Weekend (1) ",
y = "Traffic Volume")+
theme(axis.text=element_text(size=25),
axis.title.x = element_text(size = 20),
axis.title.y = element_text(size = 20),
plot.title = element_text(size = 20),
legend.key.size = unit(2,"cm"),
legend.text = element_text(size = 18),
legend.title = element_text(size = 14),
panel.background = element_rect(fill = 'white'),
panel.grid.major = element_line(color = "grey"))
Weekdays have significantly higher traffic than weekends. This confirms that weekday commuting patterns dominate traffic trends.
# Fit logistic regression model
model <- glm(high_traffic ~ temp + rain_1h + snow_1h + is_weekend,
data = df, family = binomial)
# Model summary
summary(model)
##
## Call:
## glm(formula = high_traffic ~ temp + rain_1h + snow_1h + is_weekend,
## family = binomial, data = df)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.3113373 0.0219682 -14.172 < 2e-16 ***
## temp 0.0110004 0.0004122 26.687 < 2e-16 ***
## rain_1h -0.0563801 0.0102805 -5.484 4.15e-08 ***
## snow_1h 0.0141521 1.1170725 0.013 0.99
## is_weekend -0.7035612 0.0208502 -33.744 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 66810 on 48192 degrees of freedom
## Residual deviance: 64925 on 48188 degrees of freedom
## AIC: 64935
##
## Number of Fisher Scoring iterations: 4
# Convert log-odds to odds ratios
exp(coef(model))
## (Intercept) temp rain_1h snow_1h is_weekend
## 0.7324668 1.0110611 0.9451798 1.0142527 0.4948200
Intercept (-0.31) → Baseline log-odds of high traffic.
Temperature (temp
= 0.011) → Warmer weather
increases high traffic probability.
Rainfall (rain_1h
= -0.056) → More rain reduces high
traffic likelihood.
Snowfall (snow_1h
= 0.014) → Not significant (p =
0.99), needs further investigation.
Weekend (is_weekend
= -0.70) → High traffic is much
less likely on weekends.
temp
Coefficient# Compute 95% confidence interval for temperature coefficient
conf_int_temp <- confint(model)["temp",]
conf_int_temp
## 2.5 % 97.5 %
## 0.01019318 0.01180903
If 95% CI for temp
does not contain zero, we conclude
temperature significantly affects traffic. We can clearly see in the
above calculations, (0.010, 0.0118), it supports positive effect
of temperature on high traffic.
Higher temperatures significantly increase high traffic probability.
Rain reduces the odds of heavy traffic (people avoid driving in bad weather).
Weekends have lower traffic volumes, supporting the commuting pattern hypothesis.
Snowfall effect is not statistically significant, requiring further investigation.
Investigate why snowfall has a positive coefficient despite expectations.
Explore interaction effects (e.g.,
temp * rain_1h
).
Analyze seasonal variations in traffic volume.