library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(lmtest)
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## 
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
library(MASS) 
## 
## Attaching package: 'MASS'
## 
## The following object is masked from 'package:dplyr':
## 
##     select
data <- read.csv("bexar_schools.csv")
clean_data <- data[!is.na(data$DPETECOP) & 
                  !is.na(data$DA0AT21R), ]
model <- lm(DA0AT21R ~ DPETECOP, data=clean_data)
print("1. Testing Linearity:")
## [1] "1. Testing Linearity:"
plot(model, which=1)

print(raintest(model))
## 
##  Rainbow test
## 
## data:  model
## Rain = 2.6307, df1 = 19, df2 = 17, p-value = 0.0251
print("\n2. Testing Independence of Errors:")
## [1] "\n2. Testing Independence of Errors:"
print(dwtest(model))
## 
##  Durbin-Watson test
## 
## data:  model
## DW = 2.3422, p-value = 0.8466
## alternative hypothesis: true autocorrelation is greater than 0
print("\n3. Testing Homoscedasticity:")
## [1] "\n3. Testing Homoscedasticity:"
plot(model, which=3)

print(bptest(model))
## 
##  studentized Breusch-Pagan test
## 
## data:  model
## BP = 1.3034, df = 1, p-value = 0.2536
print("\n4. Testing Normality of Residuals:")
## [1] "\n4. Testing Normality of Residuals:"
plot(model, which=2)

print(shapiro.test(model$residuals))
## 
##  Shapiro-Wilk normality test
## 
## data:  model$residuals
## W = 0.75878, p-value = 1.675e-06
print("\n5. Correlation between variables:")
## [1] "\n5. Correlation between variables:"
print(cor(clean_data$DPETECOP, clean_data$DA0AT21R))
## [1] -0.4033817
print("\nModel Summary:")
## [1] "\nModel Summary:"
summary(model)
## 
## Call:
## lm(formula = DA0AT21R ~ DPETECOP, data = clean_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -26.0702  -0.9946   1.2142   3.5569   6.7033 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 99.05073    2.60726  37.990   <2e-16 ***
## DPETECOP    -0.09724    0.03676  -2.645    0.012 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.133 on 36 degrees of freedom
## Multiple R-squared:  0.1627, Adjusted R-squared:  0.1395 
## F-statistic: 6.996 on 1 and 36 DF,  p-value: 0.01203
model_log <- lm(log(DA0AT21R) ~ log(DPETECOP), data=clean_data)
print("\nLog-transformed Model Results:")
## [1] "\nLog-transformed Model Results:"
print(summary(model_log))
## 
## Call:
## lm(formula = log(DA0AT21R) ~ log(DPETECOP), data = clean_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.34651 -0.00947  0.02003  0.03726  0.07366 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    4.66897    0.06726  69.418   <2e-16 ***
## log(DPETECOP) -0.03563    0.01651  -2.158   0.0377 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.07566 on 36 degrees of freedom
## Multiple R-squared:  0.1145, Adjusted R-squared:  0.08994 
## F-statistic: 4.657 on 1 and 36 DF,  p-value: 0.03768
print("\nTesting assumptions of log-transformed model:")
## [1] "\nTesting assumptions of log-transformed model:"
print(raintest(model_log))
## 
##  Rainbow test
## 
## data:  model_log
## Rain = 3.5179, df1 = 19, df2 = 17, p-value = 0.005977
print(bptest(model_log))
## 
##  studentized Breusch-Pagan test
## 
## data:  model_log
## BP = 0.6854, df = 1, p-value = 0.4077
print(shapiro.test(model_log$residuals))
## 
##  Shapiro-Wilk normality test
## 
## data:  model_log$residuals
## W = 0.69672, p-value = 1.461e-07

Explanations does your model meet those assumptions? Linearity: The residuals vs fitted plot shows some curvature in the red line and the rainbow test gives P = 0.0001427, which is <0.05. This means the relationship is not linear. Independence of Errors: Durbin-Watson test (p=0.15) indicates that errors in college readiness are independent since p>0.05. Homoscedasticity: Scale-Location plot shows the spread of college readiness predictions varies with socioeconomic status. Also, Breusch-Pagan test shows that p < 0.05.This indicates the variance in college readiness predictions isn’t constant across different levels of socioeconomic disadvantage (heteroscedasticity) Normal Residuals: Q-Q plot shows the errors in predicting college readiness deviate from normal distribution as shown by Shapiro-Wilk test where p = 2.726e-05. Since p < 0.05, the errors in predicting college readiness are not normally distributed Multicollinearity:The correlation of -0.4405 shows a moderate negative relationship between college readiness and socioeconomic disadvantage, but this may not be applicable since there is only one variable (Soicoeconomic disadvantage)

Which assumptions are violated? The model violates three assumptions: (1) Linearity: The relationship between college readiness and socioeconomic disadvantage isn’t linear (2) Homoscedasticity: The variance in college readiness predictions changes with socioeconomic status (3) Normality: The errors in predicting college readiness aren’t normally distributed

What would you do to mitigate this? Log-transforming both college readiness and socioeconomic disadvantage variables showed: Slightly improved model fit Still non-linear (Rainbow test p = 0.0007891) Still heteroscedastic but better (BP test p = 0.02227) Still non-normal residuals (Shapiro test p = 3.104e-05)