Reading the data and performing minor adjustments to remove inappropriate outliers and make the data easy to work with.
library(readr)
library(ggplot2)
library(patchwork)
library(dplyr)
library(lubridate)
library(GGally)
library(corrplot)
library(car)
library(ggthemes)
library(ggrepel)
library(boot)
library(broom)
library(lindia)
week2=read_csv("C:/Users/rajas/OneDrive/Desktop/Desktop/Applied Data Science/INFOH510/R Jupyter/Metro_Interstate_Traffic_Volume.csv")
week2=week2[week2$temp>0,]
week2=week2[week2$rain_1h< 60,]
week2<- week2|>
mutate(temp=(((temp-273)*9/5))+32)
week2$hour<- as.integer(format(as.POSIXct(week2$date_time),"%H")) #converting the date_time information into hours,month,year, weekdays to get relevant insights.
week2$month<- month(as.integer(format(as.POSIXct(week2$date_time),"%m")),label = TRUE) #using lubridate library to get the month labels
week2$year<- as.integer(format(as.POSIXct(week2$date_time),"%y"))
week2$day<- as.integer(format(as.POSIXct(week2$date_time),"%d"))
week2$weekday<-weekdays(as.Date(week2$date_time))
week2$weekday<-factor(week2$weekday,levels=c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday")) #sorting the weekdays
data_df<-week2
df_selected <- data_df |>
select(traffic_volume, weather_main, temp, hour, holiday,weekday)
df_selected
## # A tibble: 48,193 × 6
## traffic_volume weather_main temp hour holiday weekday
## <dbl> <chr> <dbl> <int> <chr> <fct>
## 1 5545 Clouds 59.5 9 None Tuesday
## 2 4516 Clouds 61.4 10 None Tuesday
## 3 4767 Clouds 61.8 11 None Tuesday
## 4 5026 Clouds 62.8 12 None Tuesday
## 5 4918 Clouds 64.7 13 None Tuesday
## 6 5181 Clear 65.7 14 None Tuesday
## 7 5584 Clear 68.3 15 None Tuesday
## 8 6015 Clear 69.5 16 None Tuesday
## 9 5791 Clouds 70.1 17 None Tuesday
## 10 4770 Clouds 68.2 18 None Tuesday
## # ℹ 48,183 more rows
Below are the insights we gained from the previous data dive where we chose the response and explanatory variable and performed various tests. One of the test was an anova test where we disproved the null hypothesis and concluded that weather effects have impact on mean traffic volume.
table(df_selected$weather_main)
##
## Clear Clouds Drizzle Fog Haze Mist
## 13381 15164 1821 912 1360 5950
## Rain Smoke Snow Squall Thunderstorm
## 5671 20 2876 4 1034
df_selected <- df_selected |>
mutate(weather_main = dplyr::recode(weather_main, "Squall" = "Extreme", "Thunderstorm" = "Extreme"))
table(df_selected$weather_main)
##
## Clear Clouds Drizzle Extreme Fog Haze Mist Rain Smoke Snow
## 13381 15164 1821 1038 912 1360 5950 5671 20 2876
ANOVA Test on weather_main and traffic_volume
anova_result <- aov(traffic_volume ~ weather_main, data = df_selected)
summary(anova_result)
## Df Sum Sq Mean Sq F value Pr(>F)
## weather_main 9 3.759e+09 417640535 107.9 <2e-16 ***
## Residuals 48183 1.865e+11 3869919
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Since the p-value is much smaller than 0.05, we reject the null hypothesis. This means that weather conditions significantly affect traffic volume. The large F-statistic suggests that the differences in mean traffic volume across weather categories are meaningful. Weather plays a statistically significant role in determining traffic patterns. City planners and transportation authorities should consider weather conditions when designing traffic management systems. The above box plot shows us mean traffic is lower during Fog and Mist. These are the specific conditions that will be needed to be catered to while addressing concerns or creating designs.
ggplot(df_selected, aes(x = weather_main, y = traffic_volume)) +
geom_boxplot() +
labs(title = "Traffic Volume vs Weather Conditions",
x = "Weather Condition",
y = "Traffic Volume")+
theme(axis.text=element_text(size=25),
axis.title.x = element_text(size = 20),
axis.title.y = element_text(size = 20),
plot.title = element_text(size = 20),
legend.key.size = unit(2,"cm"),
legend.text = element_text(size = 18),
legend.title = element_text(size = 14),
panel.background = element_rect(fill = 'lightblue'),
panel.grid.major = element_line(color = "white"))
We also built a linear regression model to check the dependency of variables. We chose one dependent variable and concluded that Temperature has a statistically significant but weak effect on traffic volume. The effect size is small, meaning other variables (like time of day, day of the week, holidays, or weather conditions) likely have a much stronger influence on traffic volume. We have observed this previously where time of the day and week of the day have significant impact on traffic volume. So if we want a better model, we should add more predictors (e.g., holiday, hour of the day) to improve the model’s predictive power.
lm_model <- lm(traffic_volume ~ temp, data = df_selected)
summary(lm_model)
##
## Call:
## lm(formula = traffic_volume ~ temp, data = df_selected)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3626 -1984 93 1662 4341
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2721.7694 20.4533 133.07 <2e-16 ***
## temp 11.4860 0.3921 29.29 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1969 on 48191 degrees of freedom
## Multiple R-squared: 0.01749, Adjusted R-squared: 0.01747
## F-statistic: 858 on 1 and 48191 DF, p-value: < 2.2e-16
We will further delve deeper in this assignment to include more variables and diagnose them.
We’ll extend our model by adding:
Hour of the Day (hour) – Traffic volume is likely to vary depending on the time of day.
Weather Condition (weather_main) – Certain weather conditions may impact traffic volume.
Interaction Term (temp * hour) – To check if temperature influences traffic differently depending on the time of day.
Binary Variable (holiday) – To capture differences in traffic on holidays.
Hour (hour): We expect rush hour (7-9 AM, 4-7 PM) to have increased traffic.
Weather (weather_main): Certain conditions (rain, snow) might reduce traffic. We have already seen in the above anova analysis that Haze and Fog do effect weather similar to clouds, etc.
Interaction (temp * hour): Traffic may be more sensitive to temperature changes at certain times.
Holiday (holiday): Weekends/holidays may see reduced commuter traffic.
Before proceeding, let’s check Variance Inflation Factor (VIF) to detect multicollinearity:
vif(lm(traffic_volume ~ temp + hour + weather_main + temp:hour + holiday, data = df_selected))
## GVIF Df GVIF^(1/(2*Df))
## temp 4.310860 1 2.076261
## hour 5.304140 1 2.303072
## weather_main 1.167735 9 1.008652
## holiday 1.009854 11 1.000446
## temp:hour 9.271777 1 3.044959
Any VIF over 5 is a concern as it indicates high multicollinearity. Here we observe both hour and temp:hour having VIF greater. the interaction pair has a large multicolinearity. We should ideally remove these parameter. But we will run the VIF check removing only the pair to see if that reduces the colinearity of hour.
vif(lm(traffic_volume ~ temp + hour + weather_main + holiday, data = df_selected))
## GVIF Df GVIF^(1/(2*Df))
## temp 1.147373 1 1.071155
## hour 1.038489 1 1.019063
## weather_main 1.159077 9 1.008235
## holiday 1.006483 11 1.000294
From the results above we see that removing the interaction pair significantly reduced the multicolinearity of other variables. We will keep the variables hour and temp separately and instead choose weekday for interaction pair.
vif(lm(traffic_volume ~ temp + hour + weather_main + weekday:hour + holiday, data = df_selected))
## GVIF Df GVIF^(1/(2*Df))
## temp 1.147471 1 1.071201
## hour 2.581052 1 1.606565
## weather_main 1.166707 9 1.008603
## holiday 1.006485 11 1.000294
## hour:weekday 2.559308 6 1.081459
With every variable VIF within acceptable levels the new variable list changes as below.
Hour of the Day (hour) – Traffic volume is likely to vary depending on the time of day.
Weather Condition (weather_main) – Certain weather conditions may impact traffic volume.
Interaction Term (weekday * hour) – To check if weekday influences traffic differently depending on the time of day. (Traffic might be different on weekdays and weekends)
Binary Variable (holiday) – To capture differences in traffic on holidays.
Lets fit and diagnose the above model to gain further insights.
lm_extended <- lm(traffic_volume ~ temp + hour + weather_main + weekday:hour + holiday, data = df_selected)
summary(lm_extended)
##
## Call:
## lm(formula = traffic_volume ~ temp + hour + weather_main + weekday:hour +
## holiday, data = df_selected)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5397.7 -1646.8 68.3 1485.2 4645.6
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 476.8176 736.9499 0.647 0.517625
## temp 8.8236 0.3849 22.924 < 2e-16 ***
## hour 89.0123 1.9028 46.779 < 2e-16 ***
## weather_mainClouds 443.1876 21.4701 20.642 < 2e-16 ***
## weather_mainDrizzle 165.9657 45.1162 3.679 0.000235 ***
## weather_mainExtreme -210.7459 58.6663 -3.592 0.000328 ***
## weather_mainFog 12.3467 61.9070 0.199 0.841920
## weather_mainHaze 501.0958 51.5611 9.718 < 2e-16 ***
## weather_mainMist 76.7633 28.2479 2.717 0.006580 **
## weather_mainRain 133.5057 28.8533 4.627 3.72e-06 ***
## weather_mainSmoke 143.3855 403.9934 0.355 0.722651
## weather_mainSnow 166.6649 38.3272 4.348 1.37e-05 ***
## holidayColumbus Day -473.8311 1092.8421 -0.434 0.664598
## holidayIndependence Day -0.8928 1092.9233 -0.001 0.999348
## holidayLabor Day -144.9491 1004.1190 -0.144 0.885221
## holidayMartin Luther King Jr Day -317.8631 1042.0217 -0.305 0.760334
## holidayMemorial Day -305.5051 1092.9102 -0.280 0.779837
## holidayNew Years Day 501.9674 1041.9403 0.482 0.629977
## holidayNone 1105.6598 736.9498 1.500 0.133538
## holidayState Fair -525.3388 1092.8549 -0.481 0.630730
## holidayThanksgiving Day -46.5805 1041.9750 -0.045 0.964343
## holidayVeterans Day -326.2052 1092.8027 -0.299 0.765320
## holidayWashingtons Birthday -438.4094 1092.7909 -0.401 0.688287
## hour:weekdayTuesday 17.2176 2.2925 7.510 6.00e-14 ***
## hour:weekdayWednesday 21.9347 2.2796 9.622 < 2e-16 ***
## hour:weekdayThursday 27.6850 2.2857 12.113 < 2e-16 ***
## hour:weekdayFriday 33.5054 2.2895 14.634 < 2e-16 ***
## hour:weekdaySaturday -14.9533 2.2923 -6.523 6.94e-11 ***
## hour:weekdaySunday -48.2795 2.2887 -21.094 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1805 on 48164 degrees of freedom
## Multiple R-squared: 0.1755, Adjusted R-squared: 0.175
## F-statistic: 366.1 on 28 and 48164 DF, p-value: < 2.2e-16
The Multiple R-squared shows that the model explains 17.55% of the variance in traffic_volume which is pretty much a weak fit. The Adjust R-squared value of 0.175 does not differ from the adjusted signifying the new variables we added have not over fitted. The higher F-statistic value and very low p value shows us that the overall model is highly significant and one or more of the predictors are useful. The model is statistically significant, but it explains a relatively small portion of the variability in traffic volume. This suggests there are other important factors influencing traffic that are not accounted for in the model. Lets analyse individual variables and then further discuss if we can improve the model.
Estimate: 89.01 (p < 2e-16): As each hour of the day passes, the traffic increases by 89 vehicles on average. This assumes a linear effect, meaning traffic increases steadily throughout the day. We will verify this by plotting the residuals under diagnostics to see if this conclusion hold true.
Estimate: 8.82 (p < 2e-16): Traffic volume increases by 8.82 vehicles per degree Celsius. This suggests a small but statistically significant relationship between temperature and traffic volume. Warmer temperatures might encourage more travel. However, the effect size is relatively small compared to time-of-day and weather.
ggplot(data = df_selected) +
geom_point(mapping = aes(x=temp, y=traffic_volume))+
theme(axis.text=element_text(size=25),
axis.title.x = element_text(size = 20),
axis.title.y = element_text(size = 20),
plot.title = element_text(size = 20),
legend.key.size = unit(2,"cm"),
legend.text = element_text(size = 18),
legend.title = element_text(size = 14),
panel.background = element_rect(fill = 'white'),
panel.grid.major = element_line(color = "grey"))
Some weather conditions show a strong impact on traffic:
Significant Increases in Traffic:
Clouds: +443.19 (p < 2e-16) → Substantial increase.
Haze: +501.10 (p < 2e-16) → Large increase.
Mist: +76.76 (p = 0.0066) → Small increase.
Rain: +133.51 (p = 3.72e-06) → Moderate increase.
Snow: +166.66 (p = 1.37e-05) → Moderate increase.
Unexpected finding:
Significant Decreases in Traffic:
Extreme weather (Thunderstorms, Squalls): -210.75 (p = 0.0003)
Traffic volume drops in extreme weather, possibly due to hazardous conditions.
Weather clearly plays a role, but the specific effects should be explored further, especially the increase during snow events.
Most holidays are not statistically significant,
meaning they do not have a major impact on traffic.
No Holiday (None
): +1105.66 (p = 0.133).
This suggests traffic is slightly higher on
non-holidays. Some holidays (e.g., Thanksgiving, Veterans Day)
may not affect traffic significantly.A deeper look at
pre-holiday and post-holiday effects might be
useful.
The interaction term shows how traffic changes depending on the day of the week:
Tuesday-Friday: Traffic increases significantly as the day progresses.
Saturday: Less traffic later in the day.
Sunday: Strongest negative effect. Traffic drops throughout the day.
This confirms that weekday traffic follows work-hour patterns, while weekends behave differently. Sunday consistently shows the largest decrease in traffic, likely due to fewer work commutes.
gg_reshist(lm_extended)+
theme(axis.text=element_text(size=25),
axis.title.x = element_text(size = 20),
axis.title.y = element_text(size = 20),
plot.title = element_text(size = 20),
legend.key.size = unit(2,"cm"),
legend.text = element_text(size = 18),
legend.title = element_text(size = 14),
panel.background = element_rect(fill = 'white'),
panel.grid.major = element_line(color = "grey"))
When we look at the above histogram of residuals, we see that there is a slight concern, as the tails are not equally distributed. There is a “shorter” tail on the left, and a longer one on the right. This is not severe, but it indicates that some aspect of the phenomenon is missing from the model (i.e., we may need more/different explanatory variables).
gg_resfitted(lm_extended) +
geom_smooth(se=FALSE)+
theme(axis.text=element_text(size=25),
axis.title.x = element_text(size = 20),
axis.title.y = element_text(size = 20),
plot.title = element_text(size = 20),
legend.key.size = unit(2,"cm"),
legend.text = element_text(size = 18),
legend.title = element_text(size = 14),
panel.background = element_rect(fill = 'white'),
panel.grid.major = element_line(color = "grey"))
For this particular plot, we can already see that our assumption is being violated, in that the variance in errors increases and decreases as the day progresses. Traffic volume typically follows a peak pattern (e.g., rush hours). A linear trend may not fully capture variations in morning/evening peaks.
gg_qqplot(lm_extended)+
theme(axis.text=element_text(size=25),
axis.title.x = element_text(size = 20),
axis.title.y = element_text(size = 20),
plot.title = element_text(size = 20),
legend.key.size = unit(2,"cm"),
legend.text = element_text(size = 18),
legend.title = element_text(size = 14),
panel.background = element_rect(fill = 'white'),
panel.grid.major = element_line(color = "grey"))
The residuals deviate significantly from the red diagonal line, indicating non-normality. The tails (both lower and upper extremes) curve away from the line, suggesting heavy tails (presence of outliers or skewness). The left tail (lower values) is much lower than expected. The right tail (higher values) is much higher than expected. This is characteristic of a skewed distribution or potential unequal variance. The pattern suggests outliers or non-linear relationships that the model may not be capturing well. We can use boxplots or leverage Cook’s distance to identify high-leverage points. Based on these details normalizing the residuals like Log transformation or polynomial transformation might help us improve the model.
gg_cooksd(lm_extended, threshold = 'matlab')+
theme(axis.text=element_text(size=25),
axis.title.x = element_text(size = 20),
axis.title.y = element_text(size = 20),
plot.title = element_text(size = 20),
legend.key.size = unit(2,"cm"),
legend.text = element_text(size = 18),
legend.title = element_text(size = 14),
panel.background = element_rect(fill = 'white'),
panel.grid.major = element_line(color = "grey"))
Here, we can see a few rows of data have a high influence on the model. Each number represents the row in the data. Lets select Rows 16334, 16337, 23088, 23086, 36808, 38793, 46802 , so we can investigate them by plotting various variables against the response variable.
ggplot(data = slice(df_selected,
c(16334,16337, 23088,23086, 36808,38793,46802))) +
geom_point(data = df_selected,
mapping = aes(x = hour, y = traffic_volume)) +
geom_point(mapping = aes(x = hour, y = traffic_volume),
color = 'darkred') +
geom_text_repel(mapping = aes(x = hour,
y = traffic_volume,
label = weather_main),
color = 'darkred') +
labs(title="Investigating High Influence Points",
subtitle="Label = Weather Condition")+
theme(axis.text=element_text(size=25),
axis.title.x = element_text(size = 20),
axis.title.y = element_text(size = 20),
plot.title = element_text(size = 20),
legend.key.size = unit(2,"cm"),
legend.text = element_text(size = 18),
legend.title = element_text(size = 14),
panel.background = element_rect(fill = 'white'),
panel.grid.major = element_line(color = "grey"))
The above plot shows us that the majority of the outliers that are present are on days when the weather is “Smoke”. But the phenomenon is not clear as it creates both low traffic conditions and very high traffic contions. We need to figure out if any other variable explains this behavior.
ggplot(data = slice(df_selected,
c(16334,16337, 23088,23086, 36808,38793,46802))) +
geom_point(data = df_selected,
mapping = aes(x = hour, y = traffic_volume)) +
geom_point(mapping = aes(x = hour, y = traffic_volume),
color = 'darkred') +
geom_text_repel(mapping = aes(x = hour,
y = traffic_volume,
label = weekday),
color = 'darkred') +
labs(title="Investigating High Influence Points",
subtitle="Label = Day of the Week")+
theme(axis.text=element_text(size=25),
axis.title.x = element_text(size = 20),
axis.title.y = element_text(size = 20),
plot.title = element_text(size = 20),
legend.key.size = unit(2,"cm"),
legend.text = element_text(size = 18),
legend.title = element_text(size = 14),
panel.background = element_rect(fill = 'white'),
panel.grid.major = element_line(color = "grey"))
Its interesting to note that most influenced days by Smoky weather seem to be Monday, Friday and Saturday. Even though the weather was not ideal, people seemed to travel more or were unaffected on Monday and Friday. This could either be due to the fact that these are peak work travel days or that people prefer to take their own vehicles instead of public transport during days where there impaired vision.
Traffic volume is influenced by time, weather, and weekday-hour interactions.
Extreme weather reduces traffic, while clouds, mist, and haze increase it.
Sunday sees the most consistent drop in traffic.
Holiday effects are minimal.
The model can be made more efficient and fit better for more relevant insights by applying log transformations or polynomial for variables like hour
lm_poly <- lm(traffic_volume ~ temp + poly(hour, 2) + weather_main + holiday, data = df_selected)
summary(lm_poly)
##
## Call:
## lm(formula = traffic_volume ~ temp + poly(hour, 2) + weather_main +
## holiday, data = df_selected)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5082.4 -679.2 -116.7 644.1 3820.0
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.697e+03 4.804e+02 9.777 < 2e-16 ***
## temp 4.670e+00 2.513e-01 18.586 < 2e-16 ***
## poly(hour, 2)1 1.499e+05 1.198e+03 125.118 < 2e-16 ***
## poly(hour, 2)2 -3.138e+05 1.188e+03 -264.032 < 2e-16 ***
## weather_mainClouds 6.181e+01 1.406e+01 4.396 1.10e-05 ***
## weather_mainDrizzle 2.777e+01 2.939e+01 0.945 0.344646
## weather_mainExtreme -2.141e+02 3.821e+01 -5.603 2.12e-08 ***
## weather_mainFog -1.811e+02 4.033e+01 -4.490 7.13e-06 ***
## weather_mainHaze -5.300e+01 3.365e+01 -1.575 0.115219
## weather_mainMist -8.703e+01 1.840e+01 -4.729 2.27e-06 ***
## weather_mainRain -3.311e+01 1.879e+01 -1.762 0.078108 .
## weather_mainSmoke -4.974e+02 2.631e+02 -1.890 0.058745 .
## weather_mainSnow -1.512e+02 2.498e+01 -6.052 1.44e-09 ***
## holidayColumbus Day -4.858e+02 7.120e+02 -0.682 0.495051
## holidayIndependence Day 1.748e+01 7.121e+02 0.025 0.980414
## holidayLabor Day -2.667e+01 6.542e+02 -0.041 0.967487
## holidayMartin Luther King Jr Day -1.766e+02 6.789e+02 -0.260 0.794766
## holidayMemorial Day 9.338e+00 7.121e+02 0.013 0.989537
## holidayNew Years Day 5.803e+02 6.789e+02 0.855 0.392646
## holidayNone -1.644e+03 4.803e+02 -3.424 0.000618 ***
## holidayState Fair -4.380e+02 7.120e+02 -0.615 0.538465
## holidayThanksgiving Day 4.634e+01 6.789e+02 0.068 0.945578
## holidayVeterans Day -2.922e+02 7.120e+02 -0.410 0.681551
## holidayWashingtons Birthday -2.834e+02 7.120e+02 -0.398 0.690624
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1176 on 48169 degrees of freedom
## Multiple R-squared: 0.65, Adjusted R-squared: 0.6498
## F-statistic: 3889 on 23 and 48169 DF, p-value: < 2.2e-16
The Multiple R-squared shows that the model explains 65% of the variance in traffic_volume which is a moderate to good fit much better than the previews model. The Adjust R-squared value of 0.6498 does not differ from the adjusted signifying the new variables we added have not over fitted. The higher F-statistic value and very low p value shows us that the overall model is highly significant and one or more of the predictors are useful. the polynomial function added to the model shows us +ve poly 1 which signifies that the hour of the day significantly impact traffic volumes. the negative coefficients for ploy 2 variations for hour confirms a U-shaped or inverted-U pattern for traffic volume throughout the day. This suggests rush hour peaks and lower volume in early morning & late night. We have seen that in the Residual Vs Fitted graph we plotted for the previous model.
The new model is fits well (R² = 65%) and captures rush hour peaks, weather effects, and holiday impacts. Temperature has a small but significant effect. Bad weather (Extreme, Fog, Snow) lowers traffic significantly. Holidays have mixed effects, but traffic is lower on regular days.