Week 10: Data Dive

1. Reading Data

Reading the data and performing minor adjustments to remove inappropriate outliers and make the data easy to work with.

library(readr)
library(ggplot2)
library(patchwork)
library(dplyr)
library(lubridate)
library(GGally)
library(corrplot)
library(car)
library(ggthemes)
library(ggrepel)
library(boot)
library(broom)
library(lindia)
week2=read_csv("C:/Users/rajas/OneDrive/Desktop/Desktop/Applied Data Science/INFOH510/R Jupyter/Metro_Interstate_Traffic_Volume.csv")
week2=week2[week2$temp>0,]
week2=week2[week2$rain_1h< 60,]
week2<- week2|>
  mutate(temp=(((temp-273)*9/5))+32)
week2$hour<- as.integer(format(as.POSIXct(week2$date_time),"%H")) #converting the date_time information into hours,month,year, weekdays to get relevant insights.
week2$month<- month(as.integer(format(as.POSIXct(week2$date_time),"%m")),label = TRUE) #using lubridate library to get the month labels
week2$year<- as.integer(format(as.POSIXct(week2$date_time),"%y"))
week2$day<- as.integer(format(as.POSIXct(week2$date_time),"%d"))
week2$weekday<-weekdays(as.Date(week2$date_time))
week2$weekday<-factor(week2$weekday,levels=c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday")) #sorting the weekdays
df<-week2

2. Feature Engineering

Binary Variable Selection:

To conduct a logistic regression analysis, we need a binary outcome variable. While the dataset doesn’t contain a direct binary variable, we can create one by categorizing the traffic_volume variable. Specifically, we’ll define a high traffic volume indicator (high_traffic) as follows:

high_traffic = 1 if traffic_volume is above the median value.
high_traffic = 0 if traffic_volume is at or below the median value.

This transformation allows us to model the likelihood of high traffic volume based on selected explanatory variables.

Explanatory Variables:

For our logistic regression model, we’ll consider the following explanatory variables:

temp (Temperature): Average temperature in Kelvin.
rain_1h (Rainfall): Amount of rain in mm during the hour.
snow_1h (Snowfall): Amount of snow in mm during the hour.
is_weekend (Weekend Indicator): A binary variable indicating whether the observation falls on a weekend (1) or a weekday (0). This can be derived from the date_time feature.

# Compute median traffic volume
median_traffic <- median(df$traffic_volume, na.rm = TRUE)

# Create binary target variable: High traffic (1) vs. Low traffic (0)
df <- df %>%
  mutate(high_traffic = ifelse(traffic_volume > median_traffic, 1, 0))

# Create is_weekend variable (1 if weekend, 0 if weekday)
df <- df %>%
  mutate(is_weekend = ifelse(weekdays(date_time) %in% c("Saturday", "Sunday"), 1, 0))

# Check summary statistics
summary(df$high_traffic)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0     0.0     0.0     0.5     1.0     1.0

Why These Features?

high_traffic allows us to analyze factors affecting heavy traffic.
is_weekend captures the difference between weekend and weekday traffic.

3. Exploratory Data Analysis (EDA)

Traffic Volume Distribution

ggplot(df, aes(x = traffic_volume)) + 
  geom_histogram(bins = 50, fill = "blue", alpha = 0.7) +
  labs(title = "Traffic Volume Distribution", x = "Traffic Volume", y = "Count")+
  theme(axis.text=element_text(size=25),
          axis.title.x = element_text(size = 20),
          axis.title.y = element_text(size = 20),
          plot.title = element_text(size = 20),
          legend.key.size = unit(2,"cm"),
          legend.text = element_text(size = 18),
          legend.title = element_text(size = 14),
          panel.background = element_rect(fill = 'white'),
          panel.grid.major = element_line(color = "grey"))

Traffic volume exhibits a bimodal distribution, with peaks likely corresponding to rush hours. We have seen many such examples insights in the previous data dives where rush hour traffic is proven.

Weekend vs. Weekday Traffic

ggplot(df, aes(x = as.factor(is_weekend), y = traffic_volume, fill = as.factor(is_weekend))) +
  geom_boxplot() +
  labs(title = "Traffic Volume: Weekends vs. Weekdays", 
       x = " Weekday (0) vs. Weekend (1) ", 
       y = "Traffic Volume")+
  theme(axis.text=element_text(size=25),
          axis.title.x = element_text(size = 20),
          axis.title.y = element_text(size = 20),
          plot.title = element_text(size = 20),
          legend.key.size = unit(2,"cm"),
          legend.text = element_text(size = 18),
          legend.title = element_text(size = 14),
          panel.background = element_rect(fill = 'white'),
          panel.grid.major = element_line(color = "grey"))

Weekdays have significantly higher traffic than weekends. This confirms that weekday commuting patterns dominate traffic trends.

4. Logistic Regression Model

Model Building

# Fit logistic regression model
model <- glm(high_traffic ~ temp + rain_1h + snow_1h + is_weekend, 
             data = df, family = binomial)

# Model summary
summary(model)

## 
## Call:
## glm(formula = high_traffic ~ temp + rain_1h + snow_1h + is_weekend, 
##     family = binomial, data = df)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.3113373  0.0219682 -14.172  < 2e-16 ***
## temp         0.0110004  0.0004122  26.687  < 2e-16 ***
## rain_1h     -0.0563801  0.0102805  -5.484 4.15e-08 ***
## snow_1h      0.0141521  1.1170725   0.013     0.99    
## is_weekend  -0.7035612  0.0208502 -33.744  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 66810  on 48192  degrees of freedom
## Residual deviance: 64925  on 48188  degrees of freedom
## AIC: 64935
## 
## Number of Fisher Scoring iterations: 4

Interpreting Coefficients

# Convert log-odds to odds ratios
exp(coef(model))

## (Intercept)        temp     rain_1h     snow_1h  is_weekend 
##   0.7324668   1.0110611   0.9451798   1.0142527   0.4948200

Intercept (-0.31) → Baseline log-odds of high traffic.
Temperature (temp = 0.011) → Warmer weather increases high traffic probability.
Rainfall (rain_1h = -0.056) → More rain reduces high traffic likelihood.
Snowfall (snow_1h = 0.014) → Not significant (p = 0.99), needs further investigation.
Weekend (is_weekend = -0.70) → High traffic is much less likely on weekends.

5. Confidence Interval for `temp` Coefficient

# Compute 95% confidence interval for temperature coefficient
conf_int_temp <- confint(model)["temp",]
conf_int_temp

##      2.5 %     97.5 % 
## 0.01019318 0.01180903

If 95% CI for temp does not contain zero, we conclude temperature significantly affects traffic. We can clearly see in the above calculations, (0.010, 0.0118), it supports positive effect of temperature on high traffic.

6. Conclusion

Findings

Higher temperatures significantly increase high traffic probability.
Rain reduces the odds of heavy traffic (people avoid driving in bad weather).
Weekends have lower traffic volumes, supporting the commuting pattern hypothesis.
Snowfall effect is not statistically significant, requiring further investigation.

Possible Next Steps

Investigate why snowfall has a positive coefficient despite expectations.
Explore interaction effects (e.g., temp * rain_1h).
Analyze seasonal variations in traffic volume.

Week 10: Data Dive — GLMs

Rajashekar reddy Vedire

2025-03-31

1. Reading Data

2. Feature Engineering

Why These Features?

3. Exploratory Data Analysis (EDA)

Traffic Volume Distribution

Weekend vs. Weekday Traffic

4. Logistic Regression Model

Model Building

Interpreting Coefficients

5. Confidence Interval for `temp` Coefficient

6. Conclusion

Findings

Possible Next Steps

Week 10: Data Dive — GLMs

Rajashekar reddy Vedire

2025-03-31

1. Reading Data

2. Feature Engineering

Why These Features?

3. Exploratory Data Analysis (EDA)

Traffic Volume Distribution

Weekend vs. Weekday Traffic

4. Logistic Regression Model

Model Building

Interpreting Coefficients

5. Confidence Interval for temp Coefficient

6. Conclusion

Findings

Possible Next Steps

5. Confidence Interval for `temp` Coefficient