library(readxl)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
tceq_data <- read.csv("Texas_Commission_on_Environmental_Quality_-_Supplemental_Environmental_Projects_20240925.csv")
# My linear model

model <- lm(Payable.Amount ~ Penalty.Assessed + SEP.Costs.Total + Penalty.Deferred, data = tceq_data)
summary(model)
## 
## Call:
## lm(formula = Payable.Amount ~ Penalty.Assessed + SEP.Costs.Total + 
##     Penalty.Deferred, data = tceq_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6075.8    -9.1    -7.9    -4.5  6742.1 
## 
## Coefficients:
##                    Estimate Std. Error   t value Pr(>|t|)    
## (Intercept)       9.8000517  9.8869601     0.991    0.322    
## Penalty.Assessed  1.0002789  0.0004447  2249.183   <2e-16 ***
## SEP.Costs.Total  -1.0005624  0.0008273 -1209.437   <2e-16 ***
## Penalty.Deferred -1.0007731  0.0010284  -973.096   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 302.9 on 1279 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 8.82e+06 on 3 and 1279 DF,  p-value: < 2.2e-16
# Linearity plot
plot(model, which = 1)

library(lmtest)
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
raintest(model)
## 
##  Rainbow test
## 
## data:  model
## Rain = 0.0012031, df1 = 642, df2 = 637, p-value = 1

The rain statistic is low (0.0012), and the high p-value (1) suggests that the model doesn’t show signs of non-linearity. It’s also evident in the linearity plot.

Based on the Rainbow test, the model meets the linearity assumption, meaning a linear relationship between the predictors and the outcome variable is appropriate.

# Durbin-Watson Test for independence of errors

dwtest(model)
## 
##  Durbin-Watson test
## 
## data:  model
## DW = 2.0027, p-value = 0.5178
## alternative hypothesis: true autocorrelation is greater than 0

DW: 2.0027, which is close to 2. This suggests that there is minimal or no autocorrelation in the residuals.

p-value: 0.5178, which is well above the typical significance threshold. This means we fail to reject the null hypothesis of no autocorrelation.

I think the model meets the independence of errors assumption, indicating that residuals are independent of each other and there is no strong evidence of autocorrelation.

# Normality check
plot(model, which = 2)  # QQ plot of residuals

# Shapiro-Wilk Test
shapiro.test(residuals(model))
## 
##  Shapiro-Wilk normality test
## 
## data:  residuals(model)
## W = 0.033113, p-value < 2.2e-16

W statistic: 0.0331, which is low. A W value closer to 1 would indicate normality, so this low value suggests a deviation from normality.

p-value: less than 2.2e-16, which not significant. This very low p-value indicates that the residuals are likely not normally distributed?

# Correlation matrix
cor(tceq_data[, c("Penalty.Assessed", "SEP.Costs.Total", "Penalty.Deferred")])
##                  Penalty.Assessed SEP.Costs.Total Penalty.Deferred
## Penalty.Assessed        1.0000000       0.9651039        0.5626695
## SEP.Costs.Total         0.9651039       1.0000000        0.4305512
## Penalty.Deferred        0.5626695       0.4305512        1.0000000

The high VIFs and correlation between Penalty.Assessed and SEP.Costs.Total indicate multicollinearity. This may affect the stability of coefficient estimates for these variables, making it challenging to interpret their individual reliably.

Model assumptions summary:

The model met both Linearity and Independence of Errors assumptions.

I think the model violates Normality of Residuals since their is significant deviation from normality. Also the Homoscedasticity required confirmation from Breusch-Pagan Test and residual plots. Lastly, there is significant multicollinearity (linearly dependency) between Penalty.Assessed and SEP.Costs.Total.

#Since the Shapiro-Wilk test and QQ plot indicates a shift from normality, I can try to transform the dependent variable.

tceq_data$log_Penalty.Assessed <- log(tceq_data$Penalty.Assessed)

model_log <- lm(log_Penalty.Assessed ~ SEP.Costs.Total + Penalty.Deferred, data = tceq_data)

# Summary of the new model
summary(model_log)
## 
## Call:
## lm(formula = log_Penalty.Assessed ~ SEP.Costs.Total + Penalty.Deferred, 
##     data = tceq_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.4410 -0.5374  0.0882  0.6435  1.6693 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      9.384e+00  2.809e-02  334.08   <2e-16 ***
## SEP.Costs.Total  1.344e-05  6.062e-07   22.17   <2e-16 ***
## Penalty.Deferred 2.401e-05  2.379e-06   10.09   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8954 on 1280 degrees of freedom
## Multiple R-squared:  0.4298, Adjusted R-squared:  0.4289 
## F-statistic: 482.4 on 2 and 1280 DF,  p-value: < 2.2e-16
# Check residuals after transformation
par(mfrow = c(1, 2)) # Create 2 plots in one row
plot(model_log, which = 1) # Residuals vs Fitted (for homoscedasticity)
plot(model_log, which = 2) # Normal Q-Q plot (for normality)

#Multicollinearity between Penalty.Assessed and SEP.Costs.Total is high (VIF > 10), going to try to remove one of the correlated variables.

model_no_SEP <- lm(Penalty.Assessed ~ Penalty.Deferred, data = tceq_data)

# Summary of the new model
summary(model_no_SEP)
## 
## Call:
## lm(formula = Penalty.Assessed ~ Penalty.Deferred, data = tceq_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
##  -24757  -17982  -16518   -7505 2001306 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      1.891e+04  2.297e+03   8.232  4.5e-16 ***
## Penalty.Deferred 4.485e+00  1.841e-01  24.361  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 76770 on 1281 degrees of freedom
## Multiple R-squared:  0.3166, Adjusted R-squared:  0.3161 
## F-statistic: 593.4 on 1 and 1281 DF,  p-value: < 2.2e-16