Week 8 | Data Dive — Regression Modeling

1. Reading Data

Reading the data and performing minor adjustments to remove inappropriate outliers and make the data easy to work with.

library(readr)
library(ggplot2)
library(patchwork)
library(dplyr)
library(lubridate)
library(GGally)
library(corrplot)
week2=read_csv("C:/Users/rajas/OneDrive/Desktop/Desktop/Applied Data Science/INFOH510/R Jupyter/Metro_Interstate_Traffic_Volume.csv")
week2=week2[week2$temp>0,]
week2=week2[week2$rain_1h< 60,]
week2<- week2|>
  mutate(temp=(((temp-273)*9/5))+32)
week2$hour<- as.integer(format(as.POSIXct(week2$date_time),"%H")) #converting the date_time information into hours,month,year, weekdays to get relevant insights.
week2$month<- month(as.integer(format(as.POSIXct(week2$date_time),"%m")),label = TRUE) #using lubridate library to get the month labels
week2$year<- as.integer(format(as.POSIXct(week2$date_time),"%y"))
week2$day<- as.integer(format(as.POSIXct(week2$date_time),"%d"))
week2$weekday<-weekdays(as.Date(week2$date_time))
week2$weekday<-factor(week2$weekday,levels=c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday")) #sorting the weekdays
data_df<-week2

df_selected <- data_df |>
  select(traffic_volume, weather_main, temp)

2. Data Selection

Lets choose the column traffic_volume as our response variable. Weather patterns should ideally have an impact on the traffic_volume. So weather_main will be our explanatory variable. Since we want to keep the categories within 10, we check the total categories the column weather_main is calssified into, below.

table(df_selected$weather_main)

## 
##        Clear       Clouds      Drizzle          Fog         Haze         Mist 
##        13381        15164         1821          912         1360         5950 
##         Rain        Smoke         Snow       Squall Thunderstorm 
##         5671           20         2876            4         1034

We have 11 categories. Since Squall and Thunderstorm are extreme weather types, we can categorize them together as extreme.

df_selected <- df_selected |>
  mutate(weather_main = recode(weather_main, "Squall" = "Extreme", "Thunderstorm" = "Extreme"))
table(df_selected$weather_main)

## 
##   Clear  Clouds Drizzle Extreme     Fog    Haze    Mist    Rain   Smoke    Snow 
##   13381   15164    1821    1038     912    1360    5950    5671      20    2876

3. ANOVA Test on weather_main and traffic_volume

3.1 Define the Hypothesis

Null Hypothesis (H₀): The mean traffic volume is the same across different weather conditions.
Alternative Hypothesis (H₁): At least one weather condition has a different mean traffic volume.

In our case, we only have one factor, so our formula is traffic_volume ~weather_main, and we have a one-way ANOVA. So we have the following test:

anova_result <- aov(traffic_volume ~ weather_main, data = df_selected)
summary(anova_result)

##                 Df    Sum Sq   Mean Sq F value Pr(>F)    
## weather_main     9 3.759e+09 417640535   107.9 <2e-16 ***
## Residuals    48183 1.865e+11   3869919                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

3.2 Key Observations

Mean Square (Mean Sq) : This shows that the variability explained by weather is significantly larger than the unexplained variation per degree of freedom.

-   *weather_main*: **417,640,535**

-   *Residuals*: **3,869,919**

F-statistic (F value): 107.9 (very high, indicating strong evidence against the null hypothesis).
p-value (Pr(>F)): <2e-16 (extremely small, meaning we reject the null hypothesis).
Since the p-value is much smaller than 0.05, we reject the null hypothesis. This means that weather conditions significantly affect traffic volume. The large F-statistic suggests that the differences in mean traffic volume across weather categories are meaningful.

3.3 Visualizing the Relationship (Boxplot)

ggplot(df_selected, aes(x = weather_main, y = traffic_volume)) +
  geom_boxplot() +
  labs(title = "Traffic Volume vs Weather Conditions",
       x = "Weather Condition",
       y = "Traffic Volume")+
  theme(axis.text=element_text(size=25),
          axis.title.x = element_text(size = 20),
          axis.title.y = element_text(size = 20),
          plot.title = element_text(size = 20),
          legend.key.size = unit(2,"cm"),
          legend.text = element_text(size = 18),
          legend.title = element_text(size = 14),
          panel.background = element_rect(fill = 'lightblue'),
          panel.grid.major = element_line(color = "white"))

3.4 Result Implications

Weather plays a statistically significant role in determining traffic patterns. City planners and transportation authorities should consider weather conditions when designing traffic management systems. The above box plot shows us mean traffic is lower during Fog and Mist. These are the specific conditions that will be needed to be catered to while addressing concerns or creating designs.

4. Linear Regression with Temperature

Temperature and traffic volumes are fairly linearly dependent with each other. So lets run a linear regression analysis to conclude insights.

lm_model <- lm(traffic_volume ~ temp, data = df_selected)
summary(lm_model)

## 
## Call:
## lm(formula = traffic_volume ~ temp, data = df_selected)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -3626  -1984     93   1662   4341 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2721.7694    20.4533  133.07   <2e-16 ***
## temp          11.4860     0.3921   29.29   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1969 on 48191 degrees of freedom
## Multiple R-squared:  0.01749,    Adjusted R-squared:  0.01747 
## F-statistic:   858 on 1 and 48191 DF,  p-value: < 2.2e-16

4.2 Key Observations

1. Regression Equation

The fitted regression equation is:

\[Traffic\quad Volume=2721.7694+11.4860×Temp\]

Intercept (2721.7694): This represents the estimated traffic volume when the temperature is 0°C.
Slope (11.4860): This means that for every 1°C increase in temperature, the traffic volume is expected to increase by 11.49 vehicles.

2. Statistical Significance of the Coefficients: The p-value for both the intercept and temp is <2e-16, which is highly significant. This means that temperature has a statistically significant effect on traffic volume.

3. Model Fit (R-squared)

Multiple R-squared: 0.01749 and Adjusted R-squared: 0.01747 : This is very low, meaning that temperature explains only about 1.75% of the variation in traffic volume. There are likely many other factors influencing traffic volume beyond temperature

4. Residual Standard Error =1969: This indicates the typical deviation of actual traffic volumes from the predicted values. Since traffic volume varies widely, this is relatively large.

5. F-Statistic= 858 (with a p-value < 2.2e-16): This shows that the model as a whole is statistically significant, even though the explanatory power (R²) is weak.

4.4 Result Implications

Temperature has a statistically significant but weak effect on traffic volume. The effect size is small, meaning other variables (like time of day, day of the week, holidays, or weather conditions) likely have a much stronger influence on traffic volume. We have observed this previously where time of the day and week of the day have significant impact on traffic volume. So if we want a better model, we shoud add more predictors (e.g., holiday, hour of the day) to improve the model’s predictive power.