library(readxl)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
tceq_data <- read.csv("Texas_Commission_on_Environmental_Quality_-_Supplemental_Environmental_Projects_20240925.csv")
# Selected variables of interest
variables_of_interest <- tceq_data[, c("Penalty.Assessed", "Penalty.Deferred", "Payable.Amount", "SEP.Costs.Total", "SEP.Offset.Total")]

# Correlations
cor_matrix <- cor(variables_of_interest, use = "complete.obs")

# Correlation matrix
print(cor_matrix)
##                  Penalty.Assessed Penalty.Deferred Payable.Amount
## Penalty.Assessed        1.0000000        0.5626695      0.9687725
## Penalty.Deferred        0.5626695        1.0000000      0.4802672
## Payable.Amount          0.9687725        0.4802672      1.0000000
## SEP.Costs.Total         0.9651039        0.4305512      0.8930307
## SEP.Offset.Total        0.9650806        0.4305327      0.8930041
##                  SEP.Costs.Total SEP.Offset.Total
## Penalty.Assessed       0.9651039        0.9650806
## Penalty.Deferred       0.4305512        0.4305327
## Payable.Amount         0.8930307        0.8930041
## SEP.Costs.Total        1.0000000        0.9999827
## SEP.Offset.Total       0.9999827        1.0000000

Some correlations I see are between Penalty Assessed and other variables (except Penalty Deferred), suggesting that the assessed penalty is a key driver in determining other financial aspects of the enforcement actions. Additinoally, the near-perfect correlation between SEP Costs Total and SEP Offset Total suggests that the costs of SEPs are being fully offset, likely against the assessed penalties.

# Creating pairs plot here
pairs(variables_of_interest)

# Kendall correlation
cor.test(tceq_data$Penalty.Assessed, tceq_data$SEP.Costs.Total, method = "kendall")
## 
##  Kendall's rank correlation tau
## 
## data:  tceq_data$Penalty.Assessed and tceq_data$SEP.Costs.Total
## z = 43.736, p-value < 2.2e-16
## alternative hypothesis: true tau is not equal to 0
## sample estimates:
##       tau 
## 0.8169446
ggplot(tceq_data, aes(x = Penalty.Assessed, y = SEP.Costs.Total)) +
  geom_point() +
  geom_smooth(method = "lm", col = "blue") +
  labs(title = "Penalty Assessed vs. SEP Costs", x = "Penalty Assessed", y = "SEP Costs")
## `geom_smooth()` using formula = 'y ~ x'

The analysis of the “Penalty Assessed” and “SEP Costs Total” variables from the TCEQ data shows a strong positive correlation (τ = 0.8169) between the two. I believe this indicates that as the assessed penalties increase, the costs of Supplemental Environmental Projects (SEPs) also rise. The statistical significance of this correlation (p-value < 2.2e-16) suggests that this relationship is not due to random chance.

I chose Kendall’s Tau as the correlation method becasue for one, the dataset contains ties where the same value for “Penalty Assessed” or “SEP Costs Total” appears multiple times. Given that the relationship between “Penalty Assessed” and “SEP Costs Total” does not need to be perfectly linear, Kendall’s Tau is appropriate as it assesses whether an increase in one variable consistently results in an increase in the other. Additionally, I think it works for skewed financial data like penalties and SEP costs.