Homework 7

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.2
## ✔ ggplot2   3.5.2     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(lmtest)

## Loading required package: zoo
## 
## Attaching package: 'zoo'
## 
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

library(car)

## Loading required package: carData
## 
## Attaching package: 'car'
## 
## The following object is masked from 'package:dplyr':
## 
##     recode
## 
## The following object is masked from 'package:purrr':
## 
##     some

library(readxl)
districts <- read_excel("district.xls")

cleaned_districts <- districts |> drop_na()

cleaned_districts_multiple <- lm(DAGC4X21R~DPFPAHSAP+DZCAMPUS+DPETECOP, data=cleaned_districts)
plot(cleaned_districts_multiple,which=1)

raintest(cleaned_districts_multiple)

## 
##  Rainbow test
## 
## data:  cleaned_districts_multiple
## Rain = 2.4044, df1 = 162, df2 = 157, p-value = 2.774e-08

durbinWatsonTest(cleaned_districts_multiple)

##  lag Autocorrelation D-W Statistic p-value
##    1      0.02382918      1.946878    0.46
##  Alternative hypothesis: rho != 0

plot(cleaned_districts_multiple,which=3)

bptest(cleaned_districts_multiple)

## 
##  studentized Breusch-Pagan test
## 
## data:  cleaned_districts_multiple
## BP = 4.097, df = 3, p-value = 0.2512

plot(cleaned_districts_multiple,which=2)

shapiro.test(cleaned_districts_multiple$residuals)

## 
##  Shapiro-Wilk normality test
## 
## data:  cleaned_districts_multiple$residuals
## W = 0.53172, p-value < 2.2e-16

vif(cleaned_districts_multiple)

## DPFPAHSAP  DZCAMPUS  DPETECOP 
##  1.010078  1.027152  1.026994

cleaned_districts_vars <- cleaned_districts |> dplyr::select(DAGC4X21R, DPFPAHSAP,DZCAMPUS,DPETECOP)
cor(cleaned_districts_vars)

##             DAGC4X21R   DPFPAHSAP    DZCAMPUS    DPETECOP
## DAGC4X21R  1.00000000 -0.04883676 -0.16016676 -0.31783590
## DPFPAHSAP -0.04883676  1.00000000 -0.06593274  0.06475669
## DZCAMPUS  -0.16016676 -0.06593274  1.00000000  0.14403665
## DPETECOP  -0.31783590  0.06475669  0.14403665  1.00000000

Explanation

The diagnostic tests reveal mixed results. While the model satisfies assumptions of no autocorrelation (Durbin-Watson), homoscedasticity (Breusch-Pagan), and no multicollinearity (VIF < 10), it fails the normality assumption (Shapiro-Wilk test, confirmed by QQ-plot) and shows evidence of model misspecification (Rainbow test). The significant Rainbow test suggests the linear model may be missing important features, despite scatter plots appearing linear. In order to mitigate failures, I would transform the data using a log transformation, like so:

cleaned_districts_log <- lm(log(DAGC4X21R + 0.001) ~ 
                           log(DPFPAHSAP + 0.001) + 
                           log(DZCAMPUS + 0.001) + 
                           log(DPETECOP + 0.001), 
                           data = cleaned_districts_vars)
raintest(cleaned_districts_log)

## 
##  Rainbow test
## 
## data:  cleaned_districts_log
## Rain = 5.9154, df1 = 162, df2 = 157, p-value < 2.2e-16

Conclusion

However, after applying a log transformation, the Shapiro-Wilk test still showed significant non-normality (p < 0.05). Interestingly, the p-value remained at 2.2e-16, the same as the untransformed model, suggesting the transformation did not improve residual normality.

Homework 7

Janet N. Ekezie

2025-10-29

Explanation

Conclusion