library(tidyverse)
## Warning: package 'purrr' was built under R version 4.4.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
##
## The following object is masked from 'package:dplyr':
##
## recode
##
## The following object is masked from 'package:purrr':
##
## some
library(lmtest)
## Warning: package 'lmtest' was built under R version 4.4.3
## Loading required package: zoo
## Warning: package 'zoo' was built under R version 4.4.3
##
## Attaching package: 'zoo'
##
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
library(psych)
##
## Attaching package: 'psych'
##
## The following object is masked from 'package:car':
##
## logit
##
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
data <- read.csv("traffic_data.csv")
data$InsuranceBinary <- ifelse(data$Insurance == "YES", 1, 0)
data_clean <- na.omit(data[, c("ViolFine", "Insurance")])
model <- lm(ViolFine ~ Insurance, data = data_clean)
summary(model)
##
## Call:
## lm(formula = ViolFine ~ Insurance, data = data_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -105.39 -71.59 3.61 42.41 414.61
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 105.392 0.971 108.54 <2e-16 ***
## InsuranceYES -33.803 1.777 -19.02 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 79.47 on 9546 degrees of freedom
## Multiple R-squared: 0.03651, Adjusted R-squared: 0.03641
## F-statistic: 361.8 on 1 and 9546 DF, p-value: < 2.2e-16
raintest(model)
##
## Rainbow test
##
## data: model
## Rain = 1.0194, df1 = 4774, df2 = 4772, p-value = 0.2529
plot(model$fitted.values, model$residuals,
xlab = "Fitted Values", ylab = "Residuals",
main = "Residuals vs Fitted")
abline(h = 0, col = "red")

durbinWatsonTest(model)
## lag Autocorrelation D-W Statistic p-value
## 1 0.1958997 1.608168 0
## Alternative hypothesis: rho != 0
bptest(model)
##
## studentized Breusch-Pagan test
##
## data: model
## BP = 142.22, df = 1, p-value < 2.2e-16
qqnorm(residuals(model))
qqline(residuals(model), col = "red")

data_clean$LogFine <- log(data_clean$ViolFine + 1)
model_log <- lm(LogFine ~ Insurance, data = data_clean)
summary(model_log)
##
## Call:
## lm(formula = LogFine ~ Insurance, data = data_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.620 -2.729 1.168 1.746 2.636
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.61991 0.02716 133.30 <2e-16 ***
## InsuranceYES -0.89058 0.04971 -17.92 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.223 on 9546 degrees of freedom
## Multiple R-squared: 0.03253, Adjusted R-squared: 0.03243
## F-statistic: 321 on 1 and 9546 DF, p-value: < 2.2e-16
bptest(model_log)
##
## studentized Breusch-Pagan test
##
## data: model_log
## BP = 125.03, df = 1, p-value < 2.2e-16
set.seed(123) # for reproducibility
resids_sample <- sample(residuals(model_log), size = 5000)
shapiro.test(resids_sample)
##
## Shapiro-Wilk normality test
##
## data: resids_sample
## W = 0.77465, p-value < 2.2e-16
data_clean$LogFine <- log(data_clean$ViolFine + 1)
model_log <- lm(LogFine ~ Insurance, data = data_clean)
summary(model_log)
##
## Call:
## lm(formula = LogFine ~ Insurance, data = data_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.620 -2.729 1.168 1.746 2.636
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.61991 0.02716 133.30 <2e-16 ***
## InsuranceYES -0.89058 0.04971 -17.92 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.223 on 9546 degrees of freedom
## Multiple R-squared: 0.03253, Adjusted R-squared: 0.03243
## F-statistic: 321 on 1 and 9546 DF, p-value: < 2.2e-16
bptest(model_log)
##
## studentized Breusch-Pagan test
##
## data: model_log
## BP = 125.03, df = 1, p-value < 2.2e-16
shapiro.test(resids_sample)
##
## Shapiro-Wilk normality test
##
## data: resids_sample
## W = 0.77465, p-value < 2.2e-16
library(sandwich)
## Warning: package 'sandwich' was built under R version 4.4.3
library(lmtest)
coeftest(model, vcov = vcovHC(model, type = "HC1"))
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 105.3922 1.0254 102.782 < 2.2e-16 ***
## InsuranceYES -33.8027 1.6332 -20.698 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
"In the initial model using ViolFine as the dependent variable and Insurance as the predictor, two key assumptions were violated:
Homoscedasticity (constant variance of residuals) was violated. This was confirmed by the Breusch-Pagan test, which returned a highly significant p-value, indicating that the residuals do not have equal variance.
Normality of residuals was also violated. The Shapiro-Wilk test yielded a p-value well below 0.05, and the QQ plot showed deviations from the line, especially in the tails of the distribution.
These violations suggest that the model may produce biased standard errors and confidence intervals if left unaddressed."
## [1] "In the initial model using ViolFine as the dependent variable and Insurance as the predictor, two key assumptions were violated:\n\nHomoscedasticity (constant variance of residuals) was violated. This was confirmed by the Breusch-Pagan test, which returned a highly significant p-value, indicating that the residuals do not have equal variance.\n\nNormality of residuals was also violated. The Shapiro-Wilk test yielded a p-value well below 0.05, and the QQ plot showed deviations from the line, especially in the tails of the distribution.\n\nThese violations suggest that the model may produce biased standard errors and confidence intervals if left unaddressed."
"To address these issues, I can apply a log transformation to the dependent variable ViolFine, creating a new variable called LogFine. This transformation is a standard approach to correct heteroskedasticity and reduce skewness in the residuals."
## [1] "To address these issues, I can apply a log transformation to the dependent variable ViolFine, creating a new variable called LogFine. This transformation is a standard approach to correct heteroskedasticity and reduce skewness in the residuals."