Section 1: Introduction

1.1 Background and Motivating Question

With affordability being a central concern for many Americans, the question of what factors correlate with affordability is extremely relevant. The goal of the analysis is to investigate what variables most influence affordability across US counties, using the mean cost-to-income ratio (AVG C2I) in each county as the quantitative response variable.

The dataset was obtained from Kaggle and contains publicly available data on quality-of-life metrics across U.S. counties, compiled from sources such as the Census Bureau, EPA, and USDA. Each row in the dataset represents a unique U.S. county, identified by its state and county name (and FIPS code), along with various economic, environmental, and educational characteristics.

1.2 Variables

The following variables were selected as predictors after removing those that were incomplete, redundant, or structurally collinear with others. Political affiliation and air quality metrics were excluded because they were recorded as per-state rather than per-county measurements, and would cause singularities alongside the State dummy variable. Median income and raw cost-of-living figures were excluded because they are already fully absorbed into the AVG C2I response. Cost-to-income breakdowns by family type, park coverage, and diversity rankings were excluded for incompleteness or redundancy.

The final set of predictors consists of two categorical variables (State and Locale) and five quantitative variables (Unemployment, CrimeRate, Water_Quality, Population, Stu_Tea_Rank):

AVG C2I (response): ratio of annual cost of living to median household income as of 2022, expressed as a percentage.
Unemployment: county unemployment rate (year unspecified in source data).
CrimeRate: number of crimes per resident as of 2016.
Water_Quality: average number of EPA water quality violations per visit.
Population: county population as of 2022.
Stu_Tea_Rank: rank of county by average public school student-to-teacher ratio (rank 1 = lowest ratio). While ordinal in nature, this variable is treated as numeric given the large number of rank levels (1-3,048), for which a linear approximation is reasonable.
State: the state the county belongs to (51 levels including DC, though DC was later removed; see Data Cleaning).
Locale: the urbanization classification of the county (e.g. Urban, Suburban, Town, Rural), encoded as a 12-level factor derived from the ULOCALE code.

Section 2: Methods

2.1 Data Loading and Cleaning

# Load the data
df = read.csv("CountyQOL.csv")

# Helper functions to clean columns
clean_percent = function(x) {
  as.numeric(str_replace_all(x, "%", ""))  # remove percentage sign
}

clean_population = function(x) {
  as.numeric(str_replace_all(x, ",", ""))  # remove comma
}

num_extract = function(x) {
  as.numeric(str_extract(x, "\\d+"))  # extract first number encountered
}

# Clean the data
df_clean = df %>%
  mutate(
    # Convert categorical variables
    State = as.factor(LSTATE),  # 51 levels because of DC
    Locale = as.factor(num_extract(ULOCALE)),
    
    # Clean numeric columns
    Population = clean_population(X2022.Population),
    CrimeRate = num_extract(X2016.Crime.Rate)/1000,
    Unemployment = na_if(clean_percent(Unemployment), -1),  # -1 = no data
    Water_Quality = na_if(WaterQualityVPV, -1),
    Stu_Tea_Rank = na_if(Stu.Tea.Rank, -1),
    AVG_C2I = clean_percent(AVG.C2I),
    
  ) 

# Select columns of interest, remove NaN values
model_df = df_clean %>%
  dplyr::select(AVG_C2I,
    Unemployment,
    CrimeRate,
    Water_Quality,
    State,
    Locale,
    Population,
    Stu_Tea_Rank
  ) %>%
  na.omit()

Several steps were taken to prepare the raw dataset for modelling. Most numeric columns were stored as character strings with formatting characters such as percent signs and commas, so helper functions were written to strip those and coerce each column to numeric.

In addition, the categorical variables were converted to factor type to ensure correct treatment during the modeling stage.

A few columns required special cleaning. For instance, the crime rate column encoded values in the form “X/1000”, so only the leading integer was extracted and then divided by 1000 to express the rate as a proportion per capita. The ULOCALE column contained locale codes prefixed with a category number (e.g. “42-Rural: Distant”), so the numeric prefix was extracted and treated as a factor.

Finally, sentinel values of -1 were used throughout the dataset to indicate missing data, most notably in the Unemployment and Water_Quality columns. These were replaced with NA before omitting rows that had any NaN values.

The columns were renamed to more readable labels and only the columns of interest were kept in the dataset for modeling.

After cleaning, the dataset contained 2,763 complete observations across 8 variables (appendix A). One additional observation, Washington DC, was discovered during diagnostics to have a hat value of exactly 1, indicating perfect leverage, because DC has only one county-equivalent unit (appendix A). This causes the LOOCV and standardized residual formulas to produce undefined values and makes the model difficult to evaluate. It was removed, leaving a final modeling dataset of 2,762 observations in model_df_nodc.

model_df_nodc = model_df[model_df$State != "DC", ]
model_df_nodc$State = droplevels(model_df_nodc$State)

2.2 Exploratory Data Analysis

# Summary statistics
summary(model_df)

##     AVG_C2I        Unemployment      CrimeRate       Water_Quality    
##  Min.   : 60.83   Min.   : 1.340   Min.   :0.00000   Min.   :  0.000  
##  1st Qu.: 94.55   1st Qu.: 2.790   1st Qu.:0.01100   1st Qu.:  0.000  
##  Median :104.43   Median : 3.490   Median :0.01700   Median :  1.000  
##  Mean   :107.15   Mean   : 3.711   Mean   :0.01914   Mean   :  3.088  
##  3rd Qu.:116.88   3rd Qu.: 4.325   3rd Qu.:0.02600   3rd Qu.:  3.000  
##  Max.   :287.30   Max.   :17.190   Max.   :0.08900   Max.   :456.000  
##                                                                       
##      State          Locale      Population       Stu_Tea_Rank   
##  TX     : 235   43     :697   Min.   :    358   Min.   :   1.0  
##  GA     : 146   42     :590   1st Qu.:  10772   1st Qu.: 756.5  
##  VA     : 117   41     :358   Median :  26909   Median :1536.0  
##  MO     : 107   32     :350   Mean   : 107634   Mean   :1527.5  
##  KY     : 102   33     :238   3rd Qu.:  71953   3rd Qu.:2289.5  
##  KS     : 101   21     :164   Max.   :9721138   Max.   :3048.0  
##  (Other):1955   (Other):366

# Correlation matrix
nums = model_df_nodc[sapply(model_df_nodc, is.numeric)]  # numeric variables

knitr::kable(round(cor(nums), 3))

	AVG_C2I	Unemployment	CrimeRate	Water_Quality	Population	Stu_Tea_Rank
AVG_C2I	1.000	0.403	0.062	0.028	-0.077	-0.196
Unemployment	0.403	1.000	0.130	0.085	0.034	0.082
CrimeRate	0.062	0.130	1.000	-0.010	0.205	0.429
Water_Quality	0.028	0.085	-0.010	1.000	0.022	0.021
Population	-0.077	0.034	0.205	0.022	1.000	0.386
Stu_Tea_Rank	-0.196	0.082	0.429	0.021	0.386	1.000

# Student-teacher ratio, crime rate, and population
# AVG_C2I and Unemployment

# Distribution of the response
ggplot(model_df_nodc, aes(x = AVG_C2I)) +
  geom_histogram(bins = 60, fill = "steelblue", color = "white") +
  labs(title = "Distribution of AVG C2I",
       x = "AVG C2I (%)", y = "Count") +
  theme_minimal()

# Right-skewed tail: check if log transform helps later

ggplot(model_df_nodc, aes(x = log(AVG_C2I))) +
  geom_histogram(bins = 60, fill = "coral", color = "white") +
  labs(title = "Distribution of log(AVG C2I)", x="log(AVG C2I)", y = "Count") +
  theme_minimal()  # a little more symmetric

# Unemployment vs AVG_C2I – strongest numeric predictor
ggplot(model_df_nodc, aes(x = Unemployment, y = AVG_C2I)) +
  geom_point(alpha = 0.3) +
  geom_smooth(method = "lm", se = TRUE, color = "red") +
  labs(title = "Unemployment Rate vs. AVG C2I",
       x = "Unemployment (%)", y = "AVG C2I (%)") +
  theme_minimal()

# The whole pairs plot can be found in the appendix,
# these are the 2 most significant trends between predictors
# Population vs Stu_Tea_Rank - Exponential
ggplot(model_df_nodc, aes(x = Stu_Tea_Rank, y = Population)) +
  geom_point(alpha = 0.2) +
  geom_smooth(method = "glm", formula = y ~ x, 
              method.args = list(family = gaussian(link = "log"),
                                 control=list(maxit=100)), 
              color = "blue") +
  labs(title = "Population vs Student-Teacher Rank",
       x = "Stu:Tea Rank", y = "Population") +
  theme_minimal()

# CrimeRate vs Stu_Tea_Rank - Linear
ggplot(model_df_nodc, aes(x = Stu_Tea_Rank, y = CrimeRate)) +
  geom_point(alpha = 0.2) +
  geom_smooth(method = "lm", 
              color = "blue") +
  labs(title = "Crime Rate vs Student-Teacher Rank",
       x = "Stu:Tea Rank", y = "Crime Rate") +
  theme_minimal()

# Boxplot: AVG_C2I by locale
ggplot(model_df_nodc, aes(x = Locale, y = AVG_C2I)) +
  geom_boxplot(fill = "lightblue") +
  labs(title = "AVG C2I by Locale Type", x = "Locale", y = "AVG C2I (%)") +
  theme_minimal()

# Identify potential outliers: Water_Quality has max 456 - inspect
outlier_wq = model_df_nodc %>% filter(Water_Quality > 50)
# Counties with Water Quality violations > 50:
nrow(outlier_wq)

## [1] 10

# Does unemployment's slope differ by locale?
ggplot(model_df_nodc, aes(x = Unemployment, y = AVG_C2I, color = Locale)) +
  geom_smooth(method = "lm", se = FALSE)  # Lines not parallel- interaction

Exploratory analysis was conducted using summary statistics, a pairwise scatterplot matrix (appendix B), a correlation matrix, several individual plots, and outlier analysis.

The response variable AVG_C2I ranges from 60.83 to 287.30 with a mean of 107.15 and median of 104.43, indicating a moderate right skew driven by a small number of high-cost counties. A histogram of the log-transformed response showed a noticeably more symmetric distribution, motivating a log transformation of the response as one model improvement to explore.

Among the quantitative predictors, Unemployment has by far the strongest linear relationship with AVG_C2I (r = 0.40), which is visible in the scatterplot. Stu_Tea_Rank shows a meaningful negative correlation (r = -0.20): counties with a lower student-to-teacher ratio rank (i.e. fewer students per teacher) tend to have lower cost-to-income ratios, likely because these are wealthier counties with higher incomes.

CrimeRate and Water_Quality both show weaker individual correlations with the response (r = 0.06 and 0.03, respectively), though they may still contribute once other variables are controlled for.

Potential multicollinearity between predictors was examined. The most notable correlation among predictors is between CrimeRate and Stu_Tea_Rank (r = 0.43) and between Population and Stu_Tea_Rank (r = 0.39), which is expected since larger, denser counties tend to have both higher crime rates and more crowded schools. The relationship between population and student teacher rank appears to be exponential in the plot. These two correlations are moderate and not severe enough to preclude including both variables in the model, but they suggest that VIF should be examined. Water_Quality exhibited an extreme outlier at 456 violations per visit. This value appears legitimate (it reflects a county with persistent, severe water safety violations), but it warrants monitoring in subsequent diagnostics.

Boxplots of AVG_C2I by Locale type showed that urban core counties (Locale 11) tend to have higher cost-to-income ratios than rural counties, consistent with the well-known urban affordability crisis, though rural counties have more outliers. This pattern suggests Locale may be a useful predictor.

Finally, the slope of unemployment versus AVG C2I differed visibly across Locale types, providing a visual motivation for exploring a Locale:Unemployment interaction term in the model.

2.3 Initial Full Model

# Start with simple additive model

init_model = lm(AVG_C2I ~ ., data = model_df_nodc)

s = summary(init_model)
sprintf(paste0("Residual Std Error: %.4f | R^2: %.4f | Adj R^2: %.4f",
  "| F-stat: %.2f (p < 2.2e-16)"),
    s$sigma, s$r.squared, s$adj.r.squared, s$fstatistic[1])

## [1] "Residual Std Error: 13.7898 | R^2: 0.4943 | Adj R^2: 0.4821| F-stat: 40.53 (p < 2.2e-16)"

The initial model was fit using all available predictors (see appendix C for full summary). It yielded an adjusted R² of 0.4821, meaning the model explains approximately 48% of the variability in county cost-to-income ratios after adjusting for the number of predictors. The overall F-test is highly significant (p < \(2.2\times10^{-16}\)), confirming that the predictors collectively explain a meaningful portion of variance.

Among the continuous predictors, Unemployment is by far the most significant (coefficient = 7.370, p < \(2\times10^{-16}\)). Holding all else equal, a one percentage-point increase in the unemployment rate is associated with an approximately 7.4 percentage-point increase in the cost-to-income ratio. This is intuitive, since counties with weaker labor markets have lower incomes while the cost of living remains relatively fixed. CrimeRate (coef = 110.3, p < 0.001) and Water_Quality (coef = 0.061, p = 0.02) are also statistically significant, though their practical effect sizes are smaller. Stu_Tea_Rank carries a small but highly significant negative coefficient (-0.00417, \(2\times10^{-16}\)). Thus, a county ranked 1,000 places better is associated with a cost-to-income ratio about 4.2 points lower. Population is not significant (p = 0.24) and was removed in subsequent models. The State categorical variable is collectively very important: state fixed effects absorb broad regional differences in housing costs, wages, and policy not captured by the other predictors.

Locale 12 (City: Midsize, the first non-reference Locale level) carries a coefficient of -12.47, meaning that counties in that locale type have a cost-to-income ratio about 12.5 percentage points lower than the reference urban core (Locale 11), holding all else equal. Several Locale levels (33, 43) are not statistically significant at \(\alpha\) = 0.05, suggesting these rural tiers are not strongly differentiated from the reference once other predictors are controlled.

2.4 Initial Diagnostics

# Diagnostic plots for the initial model
# Residuals show clear right-skew and heteroscedasticity
par(mfrow=c(2,2), mar=c(4,4,2,1))
plot(init_model)

par(mfrow=c(1,1))

bptest(init_model)

## 
##  studentized Breusch-Pagan test
## 
## data:  init_model
## BP = 202.33, df = 65, p-value = 5.213e-16

shapiro.test(resid(init_model))

## 
##  Shapiro-Wilk normality test
## 
## data:  resid(init_model)
## W = 0.9139, p-value < 2.2e-16

sprintf("Influential obs (Cook's D > 4/n): %d",
    sum(cooks.distance(init_model) > 4/nobs(init_model)))

## [1] "Influential obs (Cook's D > 4/n): 124"

The diagnostic plots reveal several violations of linear model assumptions. The Residuals vs Fitted plot shows a somewhat fan-shaped spread and more negative residuals for higher fitted values, confirming heteroscedasticity. The Breusch-Pagan test rejects homoscedasticity formally (BP = 202.33, p = \(5.2\times10^{-16}\)). The Normal Q-Q plot shows heavy deviation on both tails, especially on the right. Violation of the normality assumption was confirmed by the Shapiro-Wilk test (W = 0.9139, p < \(2.2\times10^{-16}\)). Cook’s distance flagged 124 influential observations using 4/n as the threshold. VIF values were generally moderate (appendix D), indicating absence of severe multicollinearity. The State factor shows a somewhat high generalized VIF of 8.69, which is acceptable given it is a 50-level factor absorbing substantial real geographic variation. These diagnostic findings motivated exploring response transformations in the model improvement stage in order to satisfy assumptions and make inference more robust.

2.5 Cross-Validation Baseline

# Instant LOOCV for lm objects
loocv_rmse = function(model) {
  h = hatvalues(model)
  res = residuals(model)
  sqrt(mean((res / (1 - h))^2))
}

# Instant LOOCV if response variable was transformed using log
loocv_rmse_log = function(model, actual) {
  h = hatvalues(model)
  res = residuals(model)
  loo_pred_log = fitted(model) - (h * res) / (1 - h)
  loo_pred_original = exp(loo_pred_log)
  sqrt(mean((actual - loo_pred_original)^2))
}

loocv_rmse(init_model)

## [1] 13.97991

The LOOCV formula uses hat values to compute leave-one-out residuals analytically, avoiding the need to refit the model nearly 3,000 times. The baseline LOOCV RMSE for the initial full model is 13.98 percentage points, meaning on average the model’s prediction is off by about 14 percentage points of cost-to-income ratio on a held-out county.

2.6 Model Improvement

Several strategies were explored to improve on the initial model. Each is described below, with dead ends noted briefly and promising approaches developed further.

Stepwise selection (AIC and BIC). Backward stepwise selection using AIC dropped Population from the initial model, confirming its low marginal contribution. The resulting AIC model had an adjusted \(R^2\) of 0.482 and LOOCV RMSE of 13.979 - nearly identical to the full model, suggesting Population’s removal costs essentially nothing. BIC selection was more aggressive, additionally dropping Water_Quality and Locale, which worsened adjusted \(R^2\) to 0.469. An F-test confirmed that the AIC model is significantly better than the BIC model (p = \(1.23\times10^{-11}\)), but the initial model was not significiantly better than the AIC model, so Locale and Water_Quality were retained and Population was removed. These results are shown in Appendix E.

Response transformations. A log transformation of the response (log(AVG_C2I)) improved normality of residuals (W = 0.96855 vs 0.9139) and raised adjusted R² to 0.509. However, the p-value was still extremely low, and the Breusch-Pagan test actually resulted in a higher test statistic (BP = 240.41), suggesting greater heteroscedasticity. In addition, log-transforming the response raised the LOOCV RMSE to 14.26, since errors on extreme counties got exponentiated, meaning raw prediction accuracy in the original scale is hurt. Due to these factors, a log transformation of the response was not selected for the final model.

The Box-Cox procedure suggested an optimal \(\lambda\) of approximately -0.67, close to the log (\(\lambda\) = 0). But since the response transformation makes coefficients harder to interpret and it still did not result in significantly higher p-values on the diagnostics tests, Box-Cox was not selected for the final model.

These results are shown in Appendix F.

Log and polynomial transformation of Unemployment. The CR plot for Unemployment showed a subtle curve, and the residual-vs-predictor plot confirmed a modest downward bend at high unemployment values, suggesting the relationship is not purely linear. Both a log transformation (log(Unemployment)) and a quadratic polynomial (I(Unemployment^2)) were tested. The polynomial model achieved a LOOCV RMSE of 13.78, versus 13.82 for the log model - nearly identical, but the polynomial was marginally better and was carried forward. These results are shown in Appendix F.

Interaction screening. To identify promising interactions, all pairwise combinations of quantitative predictors were tested individually, and each quantitative-by-categorical combination was tested via F-test (appendix G). The quantitative screening identified CrimeRate:Water_Quality (p = 0.039) and CrimeRate:Stu_Tea_Rank (p = 0.049) as the most significant pairs, though neither was significant under \(\alpha=0.01\), so they were not added to the final model. The categorical screening found that Locale:Unemployment (p = \(1.95\times10^{-9}\)) and Locale:CrimeRate (p = \(5.88\times10^{-6}\)) were the strongest interactions, consistent with the EDA plot showing clearly non-parallel slopes across locale types. State interactions were also highly significant in screening, but caused singularities because Hawaii (2 counties) has too few observations to estimate state-specific slopes. State interactions were therefore excluded, and the final interaction model included two strongest locale interactions.

Shrinkage methods. LASSO and ridge regression were applied to the full design matrix. LASSO converged to a very small \(\lambda\) (\(\lambda\) = 0.0087), retaining nearly all variables, and achieved a 10-fold CV RMSE of 13.96. Ridge regression achieved 14.02. Neither outperformed the OLS interaction model in cross-validated RMSE, even though they only used 10 folds instead of LOOCV. This is expected given the large sample size (n = 2762) relative to predictors, so overfitting is not the primary challenge here. These results are shown in Appendix H.

aic_model = step(init_model, direction = "backward", trace = 0)
# Removes Population

bic_model = step(init_model, direction = "backward", k = log(nobs(init_model)),
                 trace = 0)
# Removes Population, Locale, Water Quality

model_log_resp = lm(log(AVG_C2I) ~ ., data = model_df_nodc)

bc = boxcox(lm(AVG_C2I ~ ., data = model_df_nodc), plotit = FALSE)
lambda = bc$x[which.max(bc$y)]  # -2/3
model_bc = lm(((AVG_C2I^lambda - 1) / lambda) ~ ., data = model_df_nodc)

model_poly = lm(AVG_C2I ~ . + I(Unemployment^2), data=model_df_nodc)

model_log_pred = lm(AVG_C2I ~ . - Unemployment + log(Unemployment),
                    data=model_df_nodc)

model_int = lm(AVG_C2I ~ .
               + Locale:CrimeRate
               + Locale:Unemployment,
               data=model_df_nodc)

rmses = c(
  loocv_rmse(init_model),
  loocv_rmse(aic_model),
  loocv_rmse(bic_model),
  loocv_rmse_log(model_log_resp, model_df_nodc$AVG_C2I),
  loocv_rmse(model_poly),
  loocv_rmse(model_log_pred),
  loocv_rmse(model_int)
)
adj_r2 = c(
  summary(init_model)$adj.r.squared,
  summary(aic_model)$adj.r.squared,
  summary(bic_model)$adj.r.squared,
  summary(model_log_resp)$adj.r.squared,
  summary(model_poly)$adj.r.squared,
  summary(model_log_pred)$adj.r.squared,
  summary(model_int)$adj.r.squared
)
comp_df = data.frame(
  Model = c("Initial (full)", "AIC stepwise", "BIC stepwise", "log(AVG_C2I)",
            "Quadratic Unemployment", "log(Unemployment)", "Interactions"),
  Adj_R2 = round(adj_r2, 4),
  LOOCV_RMSE = round(rmses, 4)
)
comp_df

##                    Model Adj_R2 LOOCV_RMSE
## 1         Initial (full) 0.4821    13.9799
## 2           AIC stepwise 0.4820    13.9788
## 3           BIC stepwise 0.4693    14.1101
## 4           log(AVG_C2I) 0.5090    14.2584
## 5 Quadratic Unemployment 0.4967    13.7809
## 6      log(Unemployment) 0.4932    13.8177
## 7           Interactions 0.4980    13.8468

The interaction model and polynomial model are the two strongest candidates. The final model combines both, adding the quadratic unemployment term to the interaction model for a further improvement, and removing Population as it is an insignificant predictor and is correlated with other variables like locale and student-teacher-rank.

Section 3: Conclusion

The final model incorporates all predictors except Population (dropped due to non-significance), a quadratic polynomial on Unemployment, and two interaction terms motivated by the systematic screening: Locale:Unemployment and Locale:CrimeRate. See Appendix I for the full model summary.

3.1 Final Model

final_model = lm(AVG_C2I ~ . - Population
                 + Locale:CrimeRate
                 + Locale:Unemployment
                 + I(Unemployment^2),
                 data = model_df_nodc)

sf = summary(final_model)
sprintf("Residual Std Error: %.4f | R^2: %.4f | Adj R^2: %.4f",
    sf$sigma, sf$r.squared, sf$adj.r.squared)

## [1] "Residual Std Error: 13.4548 | R^2: 0.5225 | Adj R^2: 0.5069"

sprintf("F-stat: %.2f on %d and %d DF, p < 2.2e-16",
    sf$fstatistic[1], sf$fstatistic[2], sf$fstatistic[3])

## [1] "F-stat: 33.63 on 87 and 2674 DF, p < 2.2e-16"

sprintf("LOOCV RMSE: %.5f  (initial: %.5f, improvement: %.5f)",
    loocv_rmse(final_model), loocv_rmse(init_model),
    loocv_rmse(init_model) - loocv_rmse(final_model))

## [1] "LOOCV RMSE: 13.70428  (initial: 13.97991, improvement: 0.27563)"

# F-test vs initial model
anova(init_model, final_model)

## Analysis of Variance Table
## 
## Model 1: AVG_C2I ~ Unemployment + CrimeRate + Water_Quality + State + 
##     Locale + Population + Stu_Tea_Rank
## Model 2: AVG_C2I ~ (Unemployment + CrimeRate + Water_Quality + State + 
##     Locale + Population + Stu_Tea_Rank) - Population + Locale:CrimeRate + 
##     Locale:Unemployment + I(Unemployment^2)
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1   2696 512670                                  
## 2   2674 484081 22     28589 7.1783 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The final model achieves an adjusted R² of 0.507 and a LOOCV RMSE of 13.70, compared to 0.482 and 13.98 for the initial model. The improvement is modest but statistically meaningful. An F-test confirms that the additional terms (interactions and polynomial) collectively improve fit significantly (p < \(2.2\times10^{-16}\)).

Key coefficient interpretations. The linear Unemployment term (22.60) and quadratic term (-0.473, highly significant, p = \(3.86\times10^{-12}\)) together describe a relationship that grows steeply at low unemployment but levels off at high values, consistent with the CR plot showing a concave curve. In practical terms, moving from 2% to 4% unemployment adds far more to the cost-to-income ratio than moving from 12% to 14%. Stu_Tea_Rank remains significantly negative (-0.00367, p = \(1.26\times10^{-14}\)): a 1,000-rank improvement is associated with about 3.7 percentage points lower AVG C2I. The 95% confidence interval for Stu_Tea_Rank is (-0.00460, -0.00274), which is entirely negative and excludes zero, confirming the direction of the effect (appendix J). Although the main effect of CrimeRate is not significant, it is retained because it participates in significant interaction terms.

The Locale:Unemployment interactions are also informative. The reference level is Locale 11 (urban core), where unemployment has the steepest positive slope (captured by the main Unemployment term of ~22.60). All Locale:Unemployment interaction coefficients are negative and significant: for example, Unemployment:Locale13 = -10.97 (p < \(10^{-5}\)) means that in mid-size cities, each percentage point of unemployment adds about 11.63 rather than 22.60 percentage points to AVG C2I. Intuitively, this means that urban core counties have the steepest relationship between unemployment and AVG C2I, and unemployment matters most in cities.

3.2 Final Model Diagnostics

# Diagnostic plots for the final model
# Residual structure is improved over the initial model but heteroscedasticity
# persists, though to a smaller extent
par(mfrow=c(2,2), mar=c(4,4,2,1))
plot(final_model)

par(mfrow=c(1,1))

bp_f = bptest(final_model)
sw_f = shapiro.test(resid(final_model))
sprintf("Breusch-Pagan: BP=%.2f, p=%.2e", bp_f$statistic, bp_f$p.value)

## [1] "Breusch-Pagan: BP=185.34, p=4.42e-09"

sprintf("Shapiro-Wilk:  W=%.4f,  p=%.2e", sw_f$statistic, sw_f$p.value)

## [1] "Shapiro-Wilk:  W=0.9206,  p=6.98e-36"

cook_f = cooks.distance(final_model)
sprintf("Influential obs (Cook's D > 4): %d",
        sum(cook_f > 4/nobs(final_model)))

## [1] "Influential obs (Cook's D > 4): 116"

# Fitted vs actual values for the final model
# The model tracks the main body of the data well but underestimates
# extreme high-cost counties
plot(fitted(final_model), model_df_nodc$AVG_C2I,
     pch=16, cex=0.3, col=rgb(0.1,0.4,0.7,0.4),
     xlab="Fitted Values", ylab="Actual AVG C2I (%)",
     main="Final Model: Fitted vs. Actual")
abline(0, 1, col="red", lty=2, lwd=2)

Residual diagnostics show meaningful improvement over the initial model. The Scale-Location plot is more horizontal and the Residuals vs Fitted fan is less pronounced, though the Breusch-Pagan test still rejects homoscedasticity (BP = 185.34, p = \(4.42\times10^{-9}\)). The Shapiro-Wilk test also still rejects normality (W = 0.921, p = \(6.98\times10^{-36}\)). These violations are driven primarily by a small cluster of extreme high-cost counties — most notably observation 2052 (a low-population South Dakota county with AVG C2I = 287%), which appears consistently as the most extreme outlier across all diagnostics. The outlier is Todd County, a Native American-majority county classified as persistently impoverished, explaining the discrepancy (appendix K).

# Influential observations stability
clean_idx = which(cook_f <= 4/nobs(final_model))
final_clean = lm(formula(final_model), data=model_df_nodc[clean_idx,])
sprintf("Adj R^2 with all observations: %.4f", sf$adj.r.squared)

## [1] "Adj R^2 with all observations: 0.5069"

sprintf("Adj R^2 without influential points: %.4f",
        summary(final_clean)$adj.r.squared)

## [1] "Adj R^2 without influential points: 0.5918"

Refitting without the 116 influential observations raises adjusted R² to 0.592, suggesting the influential points reduce fit rather than propping it up. However, the influential counties are real observations, not data errors. They represent genuinely unusual counties that a comprehensive model should eventually explain. Both versions of the model are retained, and the full-data version is treated as the official final model, with the cleaned version noted as an alternative.

3.3 Discussion and Limitations

The final model provides a statistically sound and interpretable account of county-level affordability in the United States. Unemployment rate, geographic state, and student-to-teacher rank are the most consistent predictors of AVG C2I. The model’s adjusted R² of approximately 0.51 is reasonable given that county affordability is shaped by many unmeasured factors, including local housing supply constraints, zoning policy, property tax structures, and commuting patterns, none of which are available in this dataset.

Several limitations warrant acknowledgment. First, heteroscedasticity and non-normally distributed residuals persist in the final model. Standard errors and p-values based on ordinary least squares may be slightly misleading. Given the large sample size (n = 2,762), the central limit theorem provides some protection for inference even under non-normality, but the formal Shapiro-Wilk rejection means that exact p-values from this model should be interpreted with appropriate caution. Second, the crime rate data is from 2016 while most other variables are from 2020–2022, introducing a temporal mismatch that may weaken crime rate’s coefficient. Third, the student-teacher rank is an ordinal measure that compresses variation in the middle of the distribution - a direct ratio would be preferable. Fourth, the State fixed effects absorb a large amount of variance but also partially obscure the interpretation of other coefficients. States differ in many ways beyond what the remaining predictors capture.

The assumption of independent errors is partially violated because counties within the same state share unobserved characteristics. The state fixed effects partially mitigate this, but spatial autocorrelation within states is not fully addressed.

Overall, although causation cannot be concluded from the model, the findings suggest that unemployment and, to a lesser extent, quality of public school staffing, are the factors that correlate with affordability the most. However, the interaction terms, such as between locale type and unemployment, suggest that these correlations are not constant, and the relationship between unemployment and affordability is different in rural counties than it is in urban counties.

Section 4: Appendix

A. Cleaning Results

dim(model_df)   # rows and columns

## [1] 2763    8

model_with_dc = lm(AVG_C2I ~ ., data = model_df)
levs = hatvalues(model_with_dc)
model_df[which(levs == 1), ]  # DC has perfect leverage

##      AVG_C2I Unemployment CrimeRate Water_Quality State Locale Population
## 3077  103.16         4.91     0.061             1    DC     11     671803
##      Stu_Tea_Rank
## 3077         2972

B. Pairs Plot

pairs(nums)  # categorical variables are too cluttered to see

C. Initial Model Summary

## 
## Call:
## lm(formula = AVG_C2I ~ ., data = model_df_nodc)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -76.753  -7.959  -0.719   6.638 171.039 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    7.614e+01  4.820e+00  15.796  < 2e-16 ***
## Unemployment   7.370e+00  2.715e-01  27.147  < 2e-16 ***
## CrimeRate      1.103e+02  2.853e+01   3.866 0.000113 ***
## Water_Quality  6.055e-02  2.597e-02   2.331 0.019821 *  
## StateAL        3.723e+01  3.375e+00  11.033  < 2e-16 ***
## StateAR        1.710e+01  3.289e+00   5.200 2.15e-07 ***
## StateAZ        3.301e+01  5.025e+00   6.570 6.02e-11 ***
## StateCA        1.778e+01  3.395e+00   5.237 1.76e-07 ***
## StateCO        1.876e+01  3.305e+00   5.678 1.51e-08 ***
## StateCT        1.023e+00  5.643e+00   0.181 0.856169    
## StateDE        8.474e+00  8.482e+00   0.999 0.317819    
## StateFL        3.060e+01  3.304e+00   9.262  < 2e-16 ***
## StateGA        2.270e+01  3.052e+00   7.439 1.35e-13 ***
## StateHI        3.333e+01  1.017e+01   3.278 0.001058 ** 
## StateIA        5.102e+00  3.214e+00   1.588 0.112477    
## StateID        2.607e+01  3.535e+00   7.376 2.17e-13 ***
## StateIL       -1.670e+00  4.137e+00  -0.404 0.686454    
## StateIN        1.680e+01  3.247e+00   5.175 2.44e-07 ***
## StateKS        1.376e+01  3.178e+00   4.328 1.56e-05 ***
## StateKY        1.901e+01  3.091e+00   6.150 8.92e-10 ***
## StateLA        2.829e+01  3.297e+00   8.582  < 2e-16 ***
## StateMA        1.219e+01  4.648e+00   2.623 0.008754 ** 
## StateMD       -9.463e-01  4.044e+00  -0.234 0.815000    
## StateME        2.023e+01  4.875e+00   4.150 3.42e-05 ***
## StateMI        4.194e+00  3.184e+00   1.317 0.187962    
## StateMN        5.521e+00  3.198e+00   1.726 0.084395 .  
## StateMO        2.524e+01  3.153e+00   8.004 1.77e-15 ***
## StateMS        2.162e+01  3.136e+00   6.892 6.84e-12 ***
## StateMT        2.519e+01  3.439e+00   7.323 3.18e-13 ***
## StateNC        2.241e+01  3.131e+00   7.156 1.07e-12 ***
## StateND       -1.512e+00  3.593e+00  -0.421 0.673943    
## StateNE        2.662e+01  3.244e+00   8.205 3.52e-16 ***
## StateNH        5.092e+00  5.226e+00   0.974 0.329966    
## StateNJ        6.209e+00  4.201e+00   1.478 0.139484    
## StateNM        2.422e+01  3.829e+00   6.326 2.94e-10 ***
## StateNV        7.690e+00  4.338e+00   1.773 0.076373 .  
## StateNY        2.081e+01  3.360e+00   6.193 6.79e-10 ***
## StateOH        2.433e+00  3.210e+00   0.758 0.448432    
## StateOK        2.913e+01  3.293e+00   8.847  < 2e-16 ***
## StateOR        2.717e+01  3.656e+00   7.431 1.44e-13 ***
## StatePA        3.142e+00  3.302e+00   0.951 0.341465    
## StateRI        5.004e+00  6.829e+00   0.733 0.463754    
## StateSC        1.465e+01  3.596e+00   4.074 4.76e-05 ***
## StateSD        2.185e+01  3.427e+00   6.376 2.14e-10 ***
## StateTN        1.564e+01  3.193e+00   4.898 1.02e-06 ***
## StateTX        5.399e+00  2.903e+00   1.859 0.063067 .  
## StateUT        1.215e+01  3.979e+00   3.055 0.002276 ** 
## StateVA        1.794e+01  3.129e+00   5.734 1.09e-08 ***
## StateVT        2.270e+01  4.649e+00   4.883 1.11e-06 ***
## StateWA       -3.382e+00  3.512e+00  -0.963 0.335686    
## StateWI        1.295e+01  3.260e+00   3.972 7.31e-05 ***
## StateWV        3.308e+01  3.724e+00   8.885  < 2e-16 ***
## StateWY        1.605e+01  4.096e+00   3.918 9.17e-05 ***
## Locale12      -1.247e+01  3.717e+00  -3.354 0.000806 ***
## Locale13      -9.918e+00  3.496e+00  -2.837 0.004587 ** 
## Locale21      -1.377e+01  3.427e+00  -4.017 6.04e-05 ***
## Locale22      -1.156e+01  3.959e+00  -2.919 0.003538 ** 
## Locale23      -1.136e+01  3.960e+00  -2.868 0.004161 ** 
## Locale31      -1.372e+01  3.757e+00  -3.651 0.000266 ***
## Locale32      -9.486e+00  3.431e+00  -2.765 0.005738 ** 
## Locale33      -5.773e+00  3.511e+00  -1.644 0.100208    
## Locale41      -1.005e+01  3.422e+00  -2.935 0.003360 ** 
## Locale42      -1.009e+01  3.429e+00  -2.943 0.003280 ** 
## Locale43      -4.864e+00  3.507e+00  -1.387 0.165632    
## Population     1.113e-06  9.442e-07   1.179 0.238395    
## Stu_Tea_Rank  -4.174e-03  4.817e-04  -8.666  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.79 on 2696 degrees of freedom
## Multiple R-squared:  0.4943, Adjusted R-squared:  0.4821 
## F-statistic: 40.53 on 65 and 2696 DF,  p-value: < 2.2e-16

D. VIF for Initial Model

round(vif(init_model), 3)

##                GVIF Df GVIF^(1/(2*Df))
## Unemployment  1.912  1           1.383
## CrimeRate     1.760  1           1.327
## Water_Quality 1.169  1           1.081
## State         8.693 49           1.022
## Locale        4.715 11           1.073
## Population    1.511  1           1.229
## Stu_Tea_Rank  2.628  1           1.621

E. AIC and BIC Model Summaries

formula(aic_model)

## AVG_C2I ~ Unemployment + CrimeRate + Water_Quality + State + 
##     Locale + Stu_Tea_Rank

s_aic = summary(aic_model)

sprintf("Adj R^2: %.4f | LOOCV RMSE: %.4f",
    s_aic$adj.r.squared, loocv_rmse(aic_model))

## [1] "Adj R^2: 0.4820 | LOOCV RMSE: 13.9788"

formula(bic_model)

## AVG_C2I ~ Unemployment + CrimeRate + State + Stu_Tea_Rank

s_bic = summary(bic_model)

sprintf("Adj R^2: %.4f | LOOCV RMSE: %.4f",
    s_bic$adj.r.squared, loocv_rmse(bic_model))

## [1] "Adj R^2: 0.4693 | LOOCV RMSE: 14.1101"

anova(bic_model, aic_model, init_model) # AIC model is the best

## Analysis of Variance Table
## 
## Model 1: AVG_C2I ~ Unemployment + CrimeRate + State + Stu_Tea_Rank
## Model 2: AVG_C2I ~ Unemployment + CrimeRate + Water_Quality + State + 
##     Locale + Stu_Tea_Rank
## Model 3: AVG_C2I ~ Unemployment + CrimeRate + Water_Quality + State + 
##     Locale + Population + Stu_Tea_Rank
##   Res.Df    RSS Df Sum of Sq      F   Pr(>F)    
## 1   2709 527860                                 
## 2   2697 512935 12   14925.0 6.5406 1.23e-11 ***
## 3   2696 512670  1     264.5 1.3907   0.2384    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

F. Tranformation Models

shapiro.test(resid(model_log_resp))

## 
##  Shapiro-Wilk normality test
## 
## data:  resid(model_log_resp)
## W = 0.96855, p-value < 2.2e-16

bptest(model_log_resp)

## 
##  studentized Breusch-Pagan test
## 
## data:  model_log_resp
## BP = 240.41, df = 65, p-value < 2.2e-16

shapiro.test(resid(model_bc))

## 
##  Shapiro-Wilk normality test
## 
## data:  resid(model_bc)
## W = 0.97798, p-value < 2.2e-16

bptest(model_bc)

## 
##  studentized Breusch-Pagan test
## 
## data:  model_bc
## BP = 225.98, df = 65, p-value < 2.2e-16

suppressWarnings(crPlots(init_model))

# Zoom in on unemployment - seems curved
ggplot(model_df_nodc, aes(x = Unemployment, y = residuals(init_model))) +
  geom_point(alpha = 0.3) +
  geom_smooth()  # downward curve on the right

G. Interaction Screening Results

# Quantitative interaction p-values (each tested against full additive model)
quant_vars = c("Unemployment", "CrimeRate", "Water_Quality", "Stu_Tea_Rank")

# Test each pair, record p-value of the interaction term
results = data.frame()
for (i in 1:(length(quant_vars)-1)) {
  for (j in (i+1):length(quant_vars)) {
    v1 = quant_vars[i]
    v2 = quant_vars[j]
    f = as.formula(paste("AVG_C2I ~ . +", v1, ":", v2))  # add this interaction term
    m = lm(f, data = model_df_nodc)
    coef_name = paste0(v1, ":", v2)
    p = summary(m)$coefficients[coef_name, "Pr(>|t|)"]
    results = rbind(results, data.frame(Var1=v1, Var2=v2, p_value=round(p, 5)))  # add row to table
  }
}
results[order(results$p_value), ]

##            Var1          Var2 p_value
## 4     CrimeRate Water_Quality 0.03909
## 5     CrimeRate  Stu_Tea_Rank 0.04897
## 1  Unemployment     CrimeRate 0.05466
## 6 Water_Quality  Stu_Tea_Rank 0.06794
## 3  Unemployment  Stu_Tea_Rank 0.86762
## 2  Unemployment Water_Quality 0.90015

# Categorical:quantitative interaction F-test p-values
cat_vars = c("State", "Locale")

results_cat = data.frame()
for (cat in cat_vars) {
  for (quant in quant_vars) {
    f_with = as.formula(paste("AVG_C2I ~ . +", quant, ":", cat))
    m_with = lm(f_with, data = model_df_nodc)
    p = anova(init_model, m_with)$`Pr(>F)`[2]
    results_cat = rbind(results_cat, data.frame(Categorical=cat,
                                                Quantitative=quant, p_value=p))
  }
}
results_cat[order(results_cat$p_value), ]

##   Categorical  Quantitative      p_value
## 1       State  Unemployment 2.692944e-36
## 4       State  Stu_Tea_Rank 8.226318e-10
## 5      Locale  Unemployment 1.945681e-09
## 6      Locale     CrimeRate 5.875221e-06
## 2       State     CrimeRate 8.968462e-04
## 3       State Water_Quality 1.512618e-01
## 8      Locale  Stu_Tea_Rank 1.803448e-01
## 7      Locale Water_Quality 9.674013e-01

H. LASSO and Ridge

# LASSO (for automatic variable selection)
# Prepare matrix (model.matrix handles factors as dummies)
# remove intercept - glmnet adds it automatically
X = model.matrix(AVG_C2I ~ ., data = model_df_nodc)[, -1]
y = model_df_nodc$AVG_C2I
 
set.seed(42)  # for reproducibility
cv_lasso = cv.glmnet(X, y, alpha = 1, nfolds = 10)
plot(cv_lasso)

# LASSO 10-fold CV RMSE
lasso_cv_rmse = sqrt(min(cv_lasso$cvm))
 
set.seed(42)
cv_ridge = cv.glmnet(X, y, alpha = 0, nfolds=10)
# Ridge 10-fold CV RMSE  
ridge_cv_rmse = sqrt(min(cv_ridge$cvm))

sprintf("LASSO best lambda: %.4f | 10-fold RMSE: %.4f", cv_lasso$lambda.min, 
        lasso_cv_rmse)

## [1] "LASSO best lambda: 0.0087 | 10-fold RMSE: 13.9621"

sprintf("Ridge best lambda: %.4f | 10-fold RMSE: %.4f", cv_ridge$lambda.min, 
        ridge_cv_rmse)

## [1] "Ridge best lambda: 0.7712 | 10-fold RMSE: 14.0210"

I. Final Model Summary

sf

## 
## Call:
## lm(formula = AVG_C2I ~ . - Population + Locale:CrimeRate + Locale:Unemployment + 
##     I(Unemployment^2), data = model_df_nodc)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -45.278  -7.606  -0.801   6.635 164.336 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            2.320e+01  1.479e+01   1.569 0.116763    
## Unemployment           2.260e+01  2.386e+00   9.474  < 2e-16 ***
## CrimeRate              9.836e+00  1.843e+02   0.053 0.957439    
## Water_Quality          6.707e-02  2.567e-02   2.613 0.009030 ** 
## StateAL                4.056e+01  3.349e+00  12.109  < 2e-16 ***
## StateAR                1.864e+01  3.259e+00   5.720 1.18e-08 ***
## StateAZ                3.862e+01  4.982e+00   7.753 1.27e-14 ***
## StateCA                2.121e+01  3.375e+00   6.285 3.81e-10 ***
## StateCO                2.182e+01  3.288e+00   6.638 3.85e-11 ***
## StateCT                1.977e+00  5.607e+00   0.353 0.724472    
## StateDE                8.751e+00  8.422e+00   1.039 0.298882    
## StateFL                3.347e+01  3.274e+00  10.223  < 2e-16 ***
## StateGA                2.491e+01  3.021e+00   8.246 2.55e-16 ***
## StateHI                3.803e+01  9.964e+00   3.817 0.000138 ***
## StateIA                8.348e+00  3.206e+00   2.604 0.009264 ** 
## StateID                2.865e+01  3.499e+00   8.190 4.01e-16 ***
## StateIL                4.399e-01  4.088e+00   0.108 0.914329    
## StateIN                2.028e+01  3.223e+00   6.291 3.66e-10 ***
## StateKS                1.873e+01  3.210e+00   5.833 6.09e-09 ***
## StateKY                1.996e+01  3.063e+00   6.515 8.63e-11 ***
## StateLA                2.908e+01  3.278e+00   8.870  < 2e-16 ***
## StateMA                1.379e+01  4.612e+00   2.991 0.002809 ** 
## StateMD                1.183e+00  3.997e+00   0.296 0.767259    
## StateME                2.185e+01  4.794e+00   4.557 5.43e-06 ***
## StateMI                5.098e+00  3.196e+00   1.595 0.110811    
## StateMN                8.279e+00  3.173e+00   2.609 0.009137 ** 
## StateMO                2.922e+01  3.142e+00   9.300  < 2e-16 ***
## StateMS                2.191e+01  3.102e+00   7.062 2.08e-12 ***
## StateMT                2.926e+01  3.447e+00   8.490  < 2e-16 ***
## StateNC                2.370e+01  3.117e+00   7.602 4.00e-14 ***
## StateND                2.595e+00  3.606e+00   0.720 0.471745    
## StateNE                3.074e+01  3.279e+00   9.374  < 2e-16 ***
## StateNH                9.085e+00  5.162e+00   1.760 0.078499 .  
## StateNJ                8.071e+00  4.202e+00   1.921 0.054867 .  
## StateNM                2.645e+01  3.791e+00   6.977 3.78e-12 ***
## StateNV                9.774e+00  4.277e+00   2.285 0.022384 *  
## StateNY                2.123e+01  3.351e+00   6.335 2.77e-10 ***
## StateOH                3.033e+00  3.225e+00   0.941 0.347037    
## StateOK                3.212e+01  3.283e+00   9.783  < 2e-16 ***
## StateOR                2.707e+01  3.608e+00   7.502 8.49e-14 ***
## StatePA                3.863e+00  3.341e+00   1.156 0.247653    
## StateRI                8.488e+00  6.729e+00   1.261 0.207279    
## StateSC                1.632e+01  3.566e+00   4.576 4.95e-06 ***
## StateSD                2.538e+01  3.443e+00   7.371 2.24e-13 ***
## StateTN                1.759e+01  3.166e+00   5.555 3.05e-08 ***
## StateTX                7.329e+00  2.897e+00   2.529 0.011481 *  
## StateUT                1.622e+01  3.937e+00   4.118 3.93e-05 ***
## StateVA                2.060e+01  3.105e+00   6.635 3.91e-11 ***
## StateVT                2.510e+01  4.581e+00   5.479 4.68e-08 ***
## StateWA               -2.723e+00  3.496e+00  -0.779 0.436077    
## StateWI                1.555e+01  3.227e+00   4.818 1.53e-06 ***
## StateWV                3.461e+01  3.685e+00   9.391  < 2e-16 ***
## StateWY                1.779e+01  4.037e+00   4.405 1.10e-05 ***
## Locale12               1.955e+01  1.625e+01   1.203 0.229079    
## Locale13               2.632e+01  1.490e+01   1.767 0.077346 .  
## Locale21               1.196e+01  1.511e+01   0.792 0.428444    
## Locale22               3.901e+01  1.721e+01   2.266 0.023544 *  
## Locale23               2.527e+01  1.619e+01   1.561 0.118650    
## Locale31               2.261e+01  1.602e+01   1.411 0.158253    
## Locale32               2.856e+01  1.458e+01   1.959 0.050199 .  
## Locale33               3.168e+01  1.469e+01   2.157 0.031107 *  
## Locale41               2.726e+01  1.466e+01   1.859 0.063082 .  
## Locale42               3.013e+01  1.452e+01   2.076 0.038029 *  
## Locale43               3.587e+01  1.449e+01   2.475 0.013401 *  
## Stu_Tea_Rank          -3.670e-03  4.733e-04  -7.754 1.26e-14 ***
## I(Unemployment^2)     -4.734e-01  6.787e-02  -6.974 3.86e-12 ***
## CrimeRate:Locale12     7.129e-01  2.270e+02   0.003 0.997494    
## CrimeRate:Locale13     2.908e+02  2.087e+02   1.393 0.163703    
## CrimeRate:Locale21     3.573e+02  2.068e+02   1.728 0.084172 .  
## CrimeRate:Locale22    -2.036e+01  2.847e+02  -0.072 0.942997    
## CrimeRate:Locale23     2.637e+02  2.772e+02   0.951 0.341599    
## CrimeRate:Locale31     2.334e+02  2.353e+02   0.992 0.321338    
## CrimeRate:Locale32     1.592e+02  1.958e+02   0.813 0.416303    
## CrimeRate:Locale33     4.366e+01  2.031e+02   0.215 0.829782    
## CrimeRate:Locale41     1.126e+02  1.946e+02   0.579 0.562746    
## CrimeRate:Locale42     8.797e+01  1.929e+02   0.456 0.648482    
## CrimeRate:Locale43    -1.906e+02  1.922e+02  -0.992 0.321457    
## Unemployment:Locale12 -7.463e+00  2.895e+00  -2.578 0.010001 *  
## Unemployment:Locale13 -1.097e+01  2.404e+00  -4.564 5.24e-06 ***
## Unemployment:Locale21 -8.173e+00  2.644e+00  -3.092 0.002010 ** 
## Unemployment:Locale22 -1.274e+01  3.063e+00  -4.159 3.30e-05 ***
## Unemployment:Locale23 -1.081e+01  2.758e+00  -3.921 9.05e-05 ***
## Unemployment:Locale31 -1.039e+01  2.963e+00  -3.508 0.000459 ***
## Unemployment:Locale32 -1.044e+01  2.357e+00  -4.430 9.78e-06 ***
## Unemployment:Locale33 -9.664e+00  2.351e+00  -4.110 4.08e-05 ***
## Unemployment:Locale41 -9.963e+00  2.389e+00  -4.170 3.14e-05 ***
## Unemployment:Locale42 -1.061e+01  2.338e+00  -4.537 5.96e-06 ***
## Unemployment:Locale43 -9.613e+00  2.307e+00  -4.167 3.19e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.45 on 2674 degrees of freedom
## Multiple R-squared:  0.5225, Adjusted R-squared:  0.5069 
## F-statistic: 33.63 on 87 and 2674 DF,  p-value: < 2.2e-16

J. Key Confidence Intervals

# 95% confidence intervals for key quantitative predictors
confint(final_model)[c("Unemployment", "CrimeRate", "Stu_Tea_Rank",
                              "Water_Quality"), ]

##                       2.5 %        97.5 %
## Unemployment   1.792435e+01  27.280888652
## CrimeRate     -3.515380e+02 371.210358361
## Stu_Tea_Rank  -4.597987e-03  -0.002741861
## Water_Quality  1.673611e-02   0.117406258

K. Extreme Outlier County

df[2052, ]  # Todd County - persistent poverty

##       countyhelper LSTATE      NMCNTY  FIPS  LZIP          ULOCALE Overall.Rank
## 2052 SDTodd County     SD Todd County 46121 57555 43-Rural: Remote           NA
##      X2022.Population X2016.Crime.Rate Unemployment X2020PopulrVoteParty
## 2052            9,220        0                3.89%                    R
##      X2020.PopulrMajor. AQI.Good WaterQualityVPV ParkScore2023.Rank
## 2052             61.77%   92.09%               2                 -1
##      X.CvgCityPark NtnlPrkCnt X.CvgStatePark Cost.of.Living X2022.Median.Income
## 2052            -1          2          0.56%     $73,348.34          $25,529.98
##      AVG.C2I   X1p0c   X1p1c   X1p2c   X1p3c   X1p4c   X2p0c   X2p1c   X2p2c
## 2052 287.30% 147.62% 207.20% 263.98% 321.71% 350.32% 209.71% 271.82% 322.55%
##        X2p3c   X2p4c Stu.Tea.Rank Diversity.Rank..Race. Diversity.Rank..Gender.
## 2052 374.47% 403.64%         1357                  2052                    1454

County Affordability Analysis

Artem Khaiet

2026-05-01