library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(readxl)
district_ <- read.csv("district (1).csv", header = TRUE)
district <- read.csv("district (1).csv")
model <- lm(DPETALLC ~ DPETBLAP + DPETHISP, data = district_)
summary(model)
##
## Call:
## lm(formula = DPETALLC ~ DPETBLAP + DPETHISP, data = district_)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10876 -4249 -2250 -730 186556
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 579.11 740.64 0.782 0.434
## DPETBLAP 106.56 24.50 4.349 1.48e-05 ***
## DPETHISP 68.45 13.30 5.148 3.08e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12300 on 1204 degrees of freedom
## Multiple R-squared: 0.03095, Adjusted R-squared: 0.02934
## F-statistic: 19.23 on 2 and 1204 DF, p-value: 6.033e-09
plot(model, which = 1)
library(lmtest)
## Warning: package 'lmtest' was built under R version 4.4.3
## Loading required package: zoo
## Warning: package 'zoo' was built under R version 4.4.3
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
raintest(model)
##
## Rainbow test
##
## data: model
## Rain = 0.76738, df1 = 604, df2 = 600, p-value = 0.9994
dwtest(model)
##
## Durbin-Watson test
##
## data: model
## DW = 1.5638, p-value = 1.38e-14
## alternative hypothesis: true autocorrelation is greater than 0
plot(model, which = 3)
bptest(model)
##
## studentized Breusch-Pagan test
##
## data: model
## BP = 10.415, df = 2, p-value = 0.005477
plot(model, which = 2)
shapiro.test(residuals(model))
##
## Shapiro-Wilk normality test
##
## data: residuals(model)
## W = 0.42594, p-value < 2.2e-16
library(car)
## Warning: package 'car' was built under R version 4.4.3
## Loading required package: carData
## Warning: package 'carData' was built under R version 4.4.3
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
## The following object is masked from 'package:purrr':
##
## some
vif(model)
## DPETBLAP DPETHISP
## 1.03535 1.03535
cor(district_[, c("DPETBLAP", "DPETHISP")], use = "complete.obs")
## DPETBLAP DPETHISP
## DPETBLAP 1.0000000 -0.1847777
## DPETHISP -0.1847777 1.0000000
Step 4: Assumption Checks Linearity The residuals vs. fitted plot and the Rainbow test (raintest) both suggest that the assumption of linearity is mostly met. There is some slight curvature in the plot, which may indicate mild non-linearity, but it does not appear severe.
Independence of Errors The Durbin-Watson test returned a value of 1.5638 with a very small p-value, which suggests that residuals may be positively autocorrelated. This indicates that the assumption of independence of errors may be violated.
Homoscedasticity The Breusch-Pagan test was statistically significant, indicating heteroscedasticity—non-constant variance of residuals across fitted values. This violates the homoscedasticity assumption.
Normality of Residuals The Q-Q plot shows clear deviations from the expected diagonal line, especially in the tails, suggesting that the residuals are not normally distributed. The Shapiro-Wilk test result also supported this conclusion with a significant p-value. However, with a large sample size (n = 1207), minor deviations from normality are usually acceptable.
Multicollinearity The Variance Inflation Factor (VIF) values for both independent variables were close to 1 (1.0353), and the correlation between the variables was only -0.1848. These results indicate there is no multicollinearity concern.
Step 5:Which assumptions were violated The model violated three assumptions:
Independence of errors – The Durbin-Watson test suggests some positive autocorrelation.
Homoscedasticity – The Breusch-Pagan test indicates non-constant variance.
Normality of residuals – The Q-Q plot and Shapiro-Wilk test show that the residuals are not normally distributed.
Step 6: What would I do to address the violations To address the violations:
Independence of errors: This issue is harder to resolve without time-series data or additional predictors that account for structure in the residuals. Including district-level or regional fixed effects might help if we had more identifying variables.
Homoscedasticity: A common way to address this is to use robust standard errors (e.g., using lmtest::coeftest() with sandwich::vcovHC()), which adjust for heteroscedasticity and provide more reliable inference.
Normality of residuals: This violation is not as concerning with large sample sizes. Still, we could consider transforming the dependent variable (e.g., using a log transformation) to help normalize the distribution if the outcome values are highly skewed.