Homework 7

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(readxl)

district_ <- read.csv("district (1).csv", header = TRUE)

district <- read.csv("district (1).csv")

model <- lm(DPETALLC ~ DPETBLAP + DPETHISP, data = district_)
summary(model)

## 
## Call:
## lm(formula = DPETALLC ~ DPETBLAP + DPETHISP, data = district_)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -10876  -4249  -2250   -730 186556 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   579.11     740.64   0.782    0.434    
## DPETBLAP      106.56      24.50   4.349 1.48e-05 ***
## DPETHISP       68.45      13.30   5.148 3.08e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12300 on 1204 degrees of freedom
## Multiple R-squared:  0.03095,    Adjusted R-squared:  0.02934 
## F-statistic: 19.23 on 2 and 1204 DF,  p-value: 6.033e-09

plot(model, which = 1)

library(lmtest)

## Warning: package 'lmtest' was built under R version 4.4.3

## Loading required package: zoo

## Warning: package 'zoo' was built under R version 4.4.3

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

raintest(model)

## 
##  Rainbow test
## 
## data:  model
## Rain = 0.76738, df1 = 604, df2 = 600, p-value = 0.9994

dwtest(model)

## 
##  Durbin-Watson test
## 
## data:  model
## DW = 1.5638, p-value = 1.38e-14
## alternative hypothesis: true autocorrelation is greater than 0

plot(model, which = 3)

bptest(model)

## 
##  studentized Breusch-Pagan test
## 
## data:  model
## BP = 10.415, df = 2, p-value = 0.005477

plot(model, which = 2)

shapiro.test(residuals(model))

## 
##  Shapiro-Wilk normality test
## 
## data:  residuals(model)
## W = 0.42594, p-value < 2.2e-16

library(car)

## Warning: package 'car' was built under R version 4.4.3

## Loading required package: carData

## Warning: package 'carData' was built under R version 4.4.3

## 
## Attaching package: 'car'

## The following object is masked from 'package:dplyr':
## 
##     recode

## The following object is masked from 'package:purrr':
## 
##     some

vif(model)

## DPETBLAP DPETHISP 
##  1.03535  1.03535

cor(district_[, c("DPETBLAP", "DPETHISP")], use = "complete.obs")

##            DPETBLAP   DPETHISP
## DPETBLAP  1.0000000 -0.1847777
## DPETHISP -0.1847777  1.0000000

Step 4: Assumption Checks Linearity The residuals vs. fitted plot and the Rainbow test (raintest) both suggest that the assumption of linearity is mostly met. There is some slight curvature in the plot, which may indicate mild non-linearity, but it does not appear severe.

Independence of Errors The Durbin-Watson test returned a value of 1.5638 with a very small p-value, which suggests that residuals may be positively autocorrelated. This indicates that the assumption of independence of errors may be violated.

Homoscedasticity The Breusch-Pagan test was statistically significant, indicating heteroscedasticity—non-constant variance of residuals across fitted values. This violates the homoscedasticity assumption.

Normality of Residuals The Q-Q plot shows clear deviations from the expected diagonal line, especially in the tails, suggesting that the residuals are not normally distributed. The Shapiro-Wilk test result also supported this conclusion with a significant p-value. However, with a large sample size (n = 1207), minor deviations from normality are usually acceptable.

Multicollinearity The Variance Inflation Factor (VIF) values for both independent variables were close to 1 (1.0353), and the correlation between the variables was only -0.1848. These results indicate there is no multicollinearity concern.

Step 5:Which assumptions were violated The model violated three assumptions:

Independence of errors – The Durbin-Watson test suggests some positive autocorrelation.

Homoscedasticity – The Breusch-Pagan test indicates non-constant variance.

Normality of residuals – The Q-Q plot and Shapiro-Wilk test show that the residuals are not normally distributed.

Step 6: What would I do to address the violations To address the violations:

Independence of errors: This issue is harder to resolve without time-series data or additional predictors that account for structure in the residuals. Including district-level or regional fixed effects might help if we had more identifying variables.

Homoscedasticity: A common way to address this is to use robust standard errors (e.g., using lmtest::coeftest() with sandwich::vcovHC()), which adjust for heteroscedasticity and provide more reliable inference.

Normality of residuals: This violation is not as concerning with large sample sizes. Still, we could consider transforming the dependent variable (e.g., using a log transformation) to help normalize the distribution if the outcome values are highly skewed.

Homework 7

Brandy A’Hearn

2025-04-08