Homework 7

library(tidyverse)

## Warning: package 'purrr' was built under R version 4.4.3

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(car)

## Loading required package: carData
## 
## Attaching package: 'car'
## 
## The following object is masked from 'package:dplyr':
## 
##     recode
## 
## The following object is masked from 'package:purrr':
## 
##     some

library(lmtest)

## Warning: package 'lmtest' was built under R version 4.4.3

## Loading required package: zoo

## Warning: package 'zoo' was built under R version 4.4.3

## 
## Attaching package: 'zoo'
## 
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

library(psych)

## 
## Attaching package: 'psych'
## 
## The following object is masked from 'package:car':
## 
##     logit
## 
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha

data <- read.csv("traffic_data.csv")

data$InsuranceBinary <- ifelse(data$Insurance == "YES", 1, 0)

data_clean <- na.omit(data[, c("ViolFine", "Insurance")])

model <- lm(ViolFine ~ Insurance, data = data_clean)

summary(model)

## 
## Call:
## lm(formula = ViolFine ~ Insurance, data = data_clean)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -105.39  -71.59    3.61   42.41  414.61 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   105.392      0.971  108.54   <2e-16 ***
## InsuranceYES  -33.803      1.777  -19.02   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 79.47 on 9546 degrees of freedom
## Multiple R-squared:  0.03651,    Adjusted R-squared:  0.03641 
## F-statistic: 361.8 on 1 and 9546 DF,  p-value: < 2.2e-16

raintest(model)

## 
##  Rainbow test
## 
## data:  model
## Rain = 1.0194, df1 = 4774, df2 = 4772, p-value = 0.2529

plot(model$fitted.values, model$residuals, 
     xlab = "Fitted Values", ylab = "Residuals",
     main = "Residuals vs Fitted")
abline(h = 0, col = "red")

durbinWatsonTest(model)

##  lag Autocorrelation D-W Statistic p-value
##    1       0.1958997      1.608168       0
##  Alternative hypothesis: rho != 0

bptest(model)

## 
##  studentized Breusch-Pagan test
## 
## data:  model
## BP = 142.22, df = 1, p-value < 2.2e-16

qqnorm(residuals(model))
qqline(residuals(model), col = "red")

data_clean$LogFine <- log(data_clean$ViolFine + 1)
model_log <- lm(LogFine ~ Insurance, data = data_clean)


summary(model_log)

## 
## Call:
## lm(formula = LogFine ~ Insurance, data = data_clean)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -3.620 -2.729  1.168  1.746  2.636 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   3.61991    0.02716  133.30   <2e-16 ***
## InsuranceYES -0.89058    0.04971  -17.92   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.223 on 9546 degrees of freedom
## Multiple R-squared:  0.03253,    Adjusted R-squared:  0.03243 
## F-statistic:   321 on 1 and 9546 DF,  p-value: < 2.2e-16

bptest(model_log)

## 
##  studentized Breusch-Pagan test
## 
## data:  model_log
## BP = 125.03, df = 1, p-value < 2.2e-16

set.seed(123)  # for reproducibility
resids_sample <- sample(residuals(model_log), size = 5000)

shapiro.test(resids_sample)

## 
##  Shapiro-Wilk normality test
## 
## data:  resids_sample
## W = 0.77465, p-value < 2.2e-16

data_clean$LogFine <- log(data_clean$ViolFine + 1)

model_log <- lm(LogFine ~ Insurance, data = data_clean)

summary(model_log)

## 
## Call:
## lm(formula = LogFine ~ Insurance, data = data_clean)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -3.620 -2.729  1.168  1.746  2.636 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   3.61991    0.02716  133.30   <2e-16 ***
## InsuranceYES -0.89058    0.04971  -17.92   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.223 on 9546 degrees of freedom
## Multiple R-squared:  0.03253,    Adjusted R-squared:  0.03243 
## F-statistic:   321 on 1 and 9546 DF,  p-value: < 2.2e-16

bptest(model_log)

## 
##  studentized Breusch-Pagan test
## 
## data:  model_log
## BP = 125.03, df = 1, p-value < 2.2e-16

shapiro.test(resids_sample)

## 
##  Shapiro-Wilk normality test
## 
## data:  resids_sample
## W = 0.77465, p-value < 2.2e-16

library(sandwich)

## Warning: package 'sandwich' was built under R version 4.4.3

library(lmtest)
coeftest(model, vcov = vcovHC(model, type = "HC1"))

## 
## t test of coefficients:
## 
##              Estimate Std. Error t value  Pr(>|t|)    
## (Intercept)  105.3922     1.0254 102.782 < 2.2e-16 ***
## InsuranceYES -33.8027     1.6332 -20.698 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

"In the initial model using ViolFine as the dependent variable and Insurance as the predictor, two key assumptions were violated:

Homoscedasticity (constant variance of residuals) was violated. This was confirmed by the Breusch-Pagan test, which returned a highly significant p-value, indicating that the residuals do not have equal variance.

Normality of residuals was also violated. The Shapiro-Wilk test yielded a p-value well below 0.05, and the QQ plot showed deviations from the line, especially in the tails of the distribution.

These violations suggest that the model may produce biased standard errors and confidence intervals if left unaddressed."

## [1] "In the initial model using ViolFine as the dependent variable and Insurance as the predictor, two key assumptions were violated:\n\nHomoscedasticity (constant variance of residuals) was violated. This was confirmed by the Breusch-Pagan test, which returned a highly significant p-value, indicating that the residuals do not have equal variance.\n\nNormality of residuals was also violated. The Shapiro-Wilk test yielded a p-value well below 0.05, and the QQ plot showed deviations from the line, especially in the tails of the distribution.\n\nThese violations suggest that the model may produce biased standard errors and confidence intervals if left unaddressed."

"To address these issues, I can apply a log transformation to the dependent variable ViolFine, creating a new variable called LogFine. This transformation is a standard approach to correct heteroskedasticity and reduce skewness in the residuals."

## [1] "To address these issues, I can apply a log transformation to the dependent variable ViolFine, creating a new variable called LogFine. This transformation is a standard approach to correct heteroskedasticity and reduce skewness in the residuals."

Homework 7

Krista Martinez

2025-04-22