With affordability being a central concern for many Americans, the question of what factors correlate with affordability is extremely relevant. The goal of the analysis is to investigate what variables most influence affordability across US counties, using the mean cost-to-income ratio (AVG C2I) in each county as the quantitative response variable.
The dataset was obtained from Kaggle and contains publicly available data on quality-of-life metrics across U.S. counties, compiled from sources such as the Census Bureau, EPA, and USDA. Each row in the dataset represents a unique U.S. county, identified by its state and county name (and FIPS code), along with various economic, environmental, and educational characteristics.
The following variables were selected as predictors after removing those that were incomplete, redundant, or structurally collinear with others. Political affiliation and air quality metrics were excluded because they were recorded as per-state rather than per-county measurements, and would cause singularities alongside the State dummy variable. Median income and raw cost-of-living figures were excluded because they are already fully absorbed into the AVG C2I response. Cost-to-income breakdowns by family type, park coverage, and diversity rankings were excluded for incompleteness or redundancy.
The final set of predictors consists of two categorical variables (State and Locale) and five quantitative variables (Unemployment, CrimeRate, Water_Quality, Population, Stu_Tea_Rank):
# Load the data
df = read.csv("CountyQOL.csv")
# Helper functions to clean columns
clean_percent = function(x) {
as.numeric(str_replace_all(x, "%", "")) # remove percentage sign
}
clean_population = function(x) {
as.numeric(str_replace_all(x, ",", "")) # remove comma
}
num_extract = function(x) {
as.numeric(str_extract(x, "\\d+")) # extract first number encountered
}
# Clean the data
df_clean = df %>%
mutate(
# Convert categorical variables
State = as.factor(LSTATE), # 51 levels because of DC
Locale = as.factor(num_extract(ULOCALE)),
# Clean numeric columns
Population = clean_population(X2022.Population),
CrimeRate = num_extract(X2016.Crime.Rate)/1000,
Unemployment = na_if(clean_percent(Unemployment), -1), # -1 = no data
Water_Quality = na_if(WaterQualityVPV, -1),
Stu_Tea_Rank = na_if(Stu.Tea.Rank, -1),
AVG_C2I = clean_percent(AVG.C2I),
)
# Select columns of interest, remove NaN values
model_df = df_clean %>%
dplyr::select(AVG_C2I,
Unemployment,
CrimeRate,
Water_Quality,
State,
Locale,
Population,
Stu_Tea_Rank
) %>%
na.omit()
Several steps were taken to prepare the raw dataset for modelling. Most numeric columns were stored as character strings with formatting characters such as percent signs and commas, so helper functions were written to strip those and coerce each column to numeric.
In addition, the categorical variables were converted to factor type to ensure correct treatment during the modeling stage.
A few columns required special cleaning. For instance, the crime rate column encoded values in the form “X/1000”, so only the leading integer was extracted and then divided by 1000 to express the rate as a proportion per capita. The ULOCALE column contained locale codes prefixed with a category number (e.g. “42-Rural: Distant”), so the numeric prefix was extracted and treated as a factor.
Finally, sentinel values of -1 were used throughout the dataset to indicate missing data, most notably in the Unemployment and Water_Quality columns. These were replaced with NA before omitting rows that had any NaN values.
The columns were renamed to more readable labels and only the columns of interest were kept in the dataset for modeling.
After cleaning, the dataset contained 2,763 complete observations
across 8 variables (appendix A). One additional observation, Washington
DC, was discovered during diagnostics to have a hat value of exactly 1,
indicating perfect leverage, because DC has only one county-equivalent
unit (appendix A). This causes the LOOCV and standardized residual
formulas to produce undefined values and makes the model difficult to
evaluate. It was removed, leaving a final modeling dataset of
2,762 observations in model_df_nodc.
model_df_nodc = model_df[model_df$State != "DC", ]
model_df_nodc$State = droplevels(model_df_nodc$State)
# Summary statistics
summary(model_df)
## AVG_C2I Unemployment CrimeRate Water_Quality
## Min. : 60.83 Min. : 1.340 Min. :0.00000 Min. : 0.000
## 1st Qu.: 94.55 1st Qu.: 2.790 1st Qu.:0.01100 1st Qu.: 0.000
## Median :104.43 Median : 3.490 Median :0.01700 Median : 1.000
## Mean :107.15 Mean : 3.711 Mean :0.01914 Mean : 3.088
## 3rd Qu.:116.88 3rd Qu.: 4.325 3rd Qu.:0.02600 3rd Qu.: 3.000
## Max. :287.30 Max. :17.190 Max. :0.08900 Max. :456.000
##
## State Locale Population Stu_Tea_Rank
## TX : 235 43 :697 Min. : 358 Min. : 1.0
## GA : 146 42 :590 1st Qu.: 10772 1st Qu.: 756.5
## VA : 117 41 :358 Median : 26909 Median :1536.0
## MO : 107 32 :350 Mean : 107634 Mean :1527.5
## KY : 102 33 :238 3rd Qu.: 71953 3rd Qu.:2289.5
## KS : 101 21 :164 Max. :9721138 Max. :3048.0
## (Other):1955 (Other):366
# Correlation matrix
nums = model_df_nodc[sapply(model_df_nodc, is.numeric)] # numeric variables
knitr::kable(round(cor(nums), 3))
| AVG_C2I | Unemployment | CrimeRate | Water_Quality | Population | Stu_Tea_Rank | |
|---|---|---|---|---|---|---|
| AVG_C2I | 1.000 | 0.403 | 0.062 | 0.028 | -0.077 | -0.196 |
| Unemployment | 0.403 | 1.000 | 0.130 | 0.085 | 0.034 | 0.082 |
| CrimeRate | 0.062 | 0.130 | 1.000 | -0.010 | 0.205 | 0.429 |
| Water_Quality | 0.028 | 0.085 | -0.010 | 1.000 | 0.022 | 0.021 |
| Population | -0.077 | 0.034 | 0.205 | 0.022 | 1.000 | 0.386 |
| Stu_Tea_Rank | -0.196 | 0.082 | 0.429 | 0.021 | 0.386 | 1.000 |
# Student-teacher ratio, crime rate, and population
# AVG_C2I and Unemployment
# Distribution of the response
ggplot(model_df_nodc, aes(x = AVG_C2I)) +
geom_histogram(bins = 60, fill = "steelblue", color = "white") +
labs(title = "Distribution of AVG C2I",
x = "AVG C2I (%)", y = "Count") +
theme_minimal()
# Right-skewed tail: check if log transform helps later
ggplot(model_df_nodc, aes(x = log(AVG_C2I))) +
geom_histogram(bins = 60, fill = "coral", color = "white") +
labs(title = "Distribution of log(AVG C2I)", x="log(AVG C2I)", y = "Count") +
theme_minimal() # a little more symmetric
# Unemployment vs AVG_C2I – strongest numeric predictor
ggplot(model_df_nodc, aes(x = Unemployment, y = AVG_C2I)) +
geom_point(alpha = 0.3) +
geom_smooth(method = "lm", se = TRUE, color = "red") +
labs(title = "Unemployment Rate vs. AVG C2I",
x = "Unemployment (%)", y = "AVG C2I (%)") +
theme_minimal()
# The whole pairs plot can be found in the appendix,
# these are the 2 most significant trends between predictors
# Population vs Stu_Tea_Rank - Exponential
ggplot(model_df_nodc, aes(x = Stu_Tea_Rank, y = Population)) +
geom_point(alpha = 0.2) +
geom_smooth(method = "glm", formula = y ~ x,
method.args = list(family = gaussian(link = "log"),
control=list(maxit=100)),
color = "blue") +
labs(title = "Population vs Student-Teacher Rank",
x = "Stu:Tea Rank", y = "Population") +
theme_minimal()
# CrimeRate vs Stu_Tea_Rank - Linear
ggplot(model_df_nodc, aes(x = Stu_Tea_Rank, y = CrimeRate)) +
geom_point(alpha = 0.2) +
geom_smooth(method = "lm",
color = "blue") +
labs(title = "Crime Rate vs Student-Teacher Rank",
x = "Stu:Tea Rank", y = "Crime Rate") +
theme_minimal()
# Boxplot: AVG_C2I by locale
ggplot(model_df_nodc, aes(x = Locale, y = AVG_C2I)) +
geom_boxplot(fill = "lightblue") +
labs(title = "AVG C2I by Locale Type", x = "Locale", y = "AVG C2I (%)") +
theme_minimal()
# Identify potential outliers: Water_Quality has max 456 - inspect
outlier_wq = model_df_nodc %>% filter(Water_Quality > 50)
# Counties with Water Quality violations > 50:
nrow(outlier_wq)
## [1] 10
# Does unemployment's slope differ by locale?
ggplot(model_df_nodc, aes(x = Unemployment, y = AVG_C2I, color = Locale)) +
geom_smooth(method = "lm", se = FALSE) # Lines not parallel- interaction
Exploratory analysis was conducted using summary statistics, a pairwise scatterplot matrix (appendix B), a correlation matrix, several individual plots, and outlier analysis.
The response variable AVG_C2I ranges from 60.83 to 287.30 with a mean of 107.15 and median of 104.43, indicating a moderate right skew driven by a small number of high-cost counties. A histogram of the log-transformed response showed a noticeably more symmetric distribution, motivating a log transformation of the response as one model improvement to explore.
Among the quantitative predictors, Unemployment has by far the strongest linear relationship with AVG_C2I (r = 0.40), which is visible in the scatterplot. Stu_Tea_Rank shows a meaningful negative correlation (r = -0.20): counties with a lower student-to-teacher ratio rank (i.e. fewer students per teacher) tend to have lower cost-to-income ratios, likely because these are wealthier counties with higher incomes.
CrimeRate and Water_Quality both show weaker individual correlations with the response (r = 0.06 and 0.03, respectively), though they may still contribute once other variables are controlled for.
Potential multicollinearity between predictors was examined. The most notable correlation among predictors is between CrimeRate and Stu_Tea_Rank (r = 0.43) and between Population and Stu_Tea_Rank (r = 0.39), which is expected since larger, denser counties tend to have both higher crime rates and more crowded schools. The relationship between population and student teacher rank appears to be exponential in the plot. These two correlations are moderate and not severe enough to preclude including both variables in the model, but they suggest that VIF should be examined. Water_Quality exhibited an extreme outlier at 456 violations per visit. This value appears legitimate (it reflects a county with persistent, severe water safety violations), but it warrants monitoring in subsequent diagnostics.
Boxplots of AVG_C2I by Locale type showed that urban core counties (Locale 11) tend to have higher cost-to-income ratios than rural counties, consistent with the well-known urban affordability crisis, though rural counties have more outliers. This pattern suggests Locale may be a useful predictor.
Finally, the slope of unemployment versus AVG C2I differed visibly across Locale types, providing a visual motivation for exploring a Locale:Unemployment interaction term in the model.
# Start with simple additive model
init_model = lm(AVG_C2I ~ ., data = model_df_nodc)
s = summary(init_model)
sprintf(paste0("Residual Std Error: %.4f | R^2: %.4f | Adj R^2: %.4f",
"| F-stat: %.2f (p < 2.2e-16)"),
s$sigma, s$r.squared, s$adj.r.squared, s$fstatistic[1])
## [1] "Residual Std Error: 13.7898 | R^2: 0.4943 | Adj R^2: 0.4821| F-stat: 40.53 (p < 2.2e-16)"
The initial model was fit using all available predictors (see appendix C for full summary). It yielded an adjusted R2 of 0.4821, meaning the model explains approximately 48% of the variability in county cost-to-income ratios after adjusting for the number of predictors. The overall F-test is highly significant (p < \(2.2\times10^{-16}\)), confirming that the predictors collectively explain a meaningful portion of variance.
Among the continuous predictors, Unemployment is by far the most significant (coefficient = 7.370, p < \(2\times10^{-16}\)). Holding all else equal, a one percentage-point increase in the unemployment rate is associated with an approximately 7.4 percentage-point increase in the cost-to-income ratio. This is intuitive, since counties with weaker labor markets have lower incomes while the cost of living remains relatively fixed. CrimeRate (coef = 110.3, p < 0.001) and Water_Quality (coef = 0.061, p = 0.02) are also statistically significant, though their practical effect sizes are smaller. Stu_Tea_Rank carries a small but highly significant negative coefficient (-0.00417, \(2\times10^{-16}\)). Thus, a county ranked 1,000 places better is associated with a cost-to-income ratio about 4.2 points lower. Population is not significant (p = 0.24) and was removed in subsequent models. The State categorical variable is collectively very important: state fixed effects absorb broad regional differences in housing costs, wages, and policy not captured by the other predictors.
Locale 12 (City: Midsize, the first non-reference Locale level) carries a coefficient of -12.47, meaning that counties in that locale type have a cost-to-income ratio about 12.5 percentage points lower than the reference urban core (Locale 11), holding all else equal. Several Locale levels (33, 43) are not statistically significant at \(\alpha\) = 0.05, suggesting these rural tiers are not strongly differentiated from the reference once other predictors are controlled.
# Diagnostic plots for the initial model
# Residuals show clear right-skew and heteroscedasticity
par(mfrow=c(2,2), mar=c(4,4,2,1))
plot(init_model)
par(mfrow=c(1,1))
bptest(init_model)
##
## studentized Breusch-Pagan test
##
## data: init_model
## BP = 202.33, df = 65, p-value = 5.213e-16
shapiro.test(resid(init_model))
##
## Shapiro-Wilk normality test
##
## data: resid(init_model)
## W = 0.9139, p-value < 2.2e-16
sprintf("Influential obs (Cook's D > 4/n): %d",
sum(cooks.distance(init_model) > 4/nobs(init_model)))
## [1] "Influential obs (Cook's D > 4/n): 124"
The diagnostic plots reveal several violations of linear model assumptions. The Residuals vs Fitted plot shows a somewhat fan-shaped spread and more negative residuals for higher fitted values, confirming heteroscedasticity. The Breusch-Pagan test rejects homoscedasticity formally (BP = 202.33, p = \(5.2\times10^{-16}\)). The Normal Q-Q plot shows heavy deviation on both tails, especially on the right. Violation of the normality assumption was confirmed by the Shapiro-Wilk test (W = 0.9139, p < \(2.2\times10^{-16}\)). Cook’s distance flagged 124 influential observations using 4/n as the threshold. VIF values were generally moderate (appendix D), indicating absence of severe multicollinearity. The State factor shows a somewhat high generalized VIF of 8.69, which is acceptable given it is a 50-level factor absorbing substantial real geographic variation. These diagnostic findings motivated exploring response transformations in the model improvement stage in order to satisfy assumptions and make inference more robust.
# Instant LOOCV for lm objects
loocv_rmse = function(model) {
h = hatvalues(model)
res = residuals(model)
sqrt(mean((res / (1 - h))^2))
}
# Instant LOOCV if response variable was transformed using log
loocv_rmse_log = function(model, actual) {
h = hatvalues(model)
res = residuals(model)
loo_pred_log = fitted(model) - (h * res) / (1 - h)
loo_pred_original = exp(loo_pred_log)
sqrt(mean((actual - loo_pred_original)^2))
}
loocv_rmse(init_model)
## [1] 13.97991
The LOOCV formula uses hat values to compute leave-one-out residuals analytically, avoiding the need to refit the model nearly 3,000 times. The baseline LOOCV RMSE for the initial full model is 13.98 percentage points, meaning on average the model’s prediction is off by about 14 percentage points of cost-to-income ratio on a held-out county.
Several strategies were explored to improve on the initial model. Each is described below, with dead ends noted briefly and promising approaches developed further.
Stepwise selection (AIC and BIC). Backward stepwise selection using AIC dropped Population from the initial model, confirming its low marginal contribution. The resulting AIC model had an adjusted \(R^2\) of 0.482 and LOOCV RMSE of 13.979 - nearly identical to the full model, suggesting Population’s removal costs essentially nothing. BIC selection was more aggressive, additionally dropping Water_Quality and Locale, which worsened adjusted \(R^2\) to 0.469. An F-test confirmed that the AIC model is significantly better than the BIC model (p = \(1.23\times10^{-11}\)), but the initial model was not significiantly better than the AIC model, so Locale and Water_Quality were retained and Population was removed. These results are shown in Appendix E.
Response transformations. A log transformation of
the response (log(AVG_C2I)) improved normality of residuals
(W = 0.96855 vs 0.9139) and raised adjusted R2 to 0.509.
However, the p-value was still extremely low, and the Breusch-Pagan test
actually resulted in a higher test statistic (BP = 240.41), suggesting
greater heteroscedasticity. In addition, log-transforming the response
raised the LOOCV RMSE to 14.26, since errors on extreme counties got
exponentiated, meaning raw prediction accuracy in the original scale is
hurt. Due to these factors, a log transformation of the response was not
selected for the final model.
The Box-Cox procedure suggested an optimal \(\lambda\) of approximately -0.67, close to the log (\(\lambda\) = 0). But since the response transformation makes coefficients harder to interpret and it still did not result in significantly higher p-values on the diagnostics tests, Box-Cox was not selected for the final model.
These results are shown in Appendix F.
Log and polynomial transformation of Unemployment.
The CR plot for Unemployment showed a subtle curve, and the
residual-vs-predictor plot confirmed a modest downward bend at high
unemployment values, suggesting the relationship is not purely linear.
Both a log transformation (log(Unemployment)) and a
quadratic polynomial (I(Unemployment^2)) were tested. The
polynomial model achieved a LOOCV RMSE of 13.78, versus
13.82 for the log model - nearly identical, but the
polynomial was marginally better and was carried forward. These results
are shown in Appendix F.
Interaction screening. To identify promising interactions, all pairwise combinations of quantitative predictors were tested individually, and each quantitative-by-categorical combination was tested via F-test (appendix G). The quantitative screening identified CrimeRate:Water_Quality (p = 0.039) and CrimeRate:Stu_Tea_Rank (p = 0.049) as the most significant pairs, though neither was significant under \(\alpha=0.01\), so they were not added to the final model. The categorical screening found that Locale:Unemployment (p = \(1.95\times10^{-9}\)) and Locale:CrimeRate (p = \(5.88\times10^{-6}\)) were the strongest interactions, consistent with the EDA plot showing clearly non-parallel slopes across locale types. State interactions were also highly significant in screening, but caused singularities because Hawaii (2 counties) has too few observations to estimate state-specific slopes. State interactions were therefore excluded, and the final interaction model included two strongest locale interactions.
Shrinkage methods. LASSO and ridge regression were applied to the full design matrix. LASSO converged to a very small \(\lambda\) (\(\lambda\) = 0.0087), retaining nearly all variables, and achieved a 10-fold CV RMSE of 13.96. Ridge regression achieved 14.02. Neither outperformed the OLS interaction model in cross-validated RMSE, even though they only used 10 folds instead of LOOCV. This is expected given the large sample size (n = 2762) relative to predictors, so overfitting is not the primary challenge here. These results are shown in Appendix H.
aic_model = step(init_model, direction = "backward", trace = 0)
# Removes Population
bic_model = step(init_model, direction = "backward", k = log(nobs(init_model)),
trace = 0)
# Removes Population, Locale, Water Quality
model_log_resp = lm(log(AVG_C2I) ~ ., data = model_df_nodc)
bc = boxcox(lm(AVG_C2I ~ ., data = model_df_nodc), plotit = FALSE)
lambda = bc$x[which.max(bc$y)] # -2/3
model_bc = lm(((AVG_C2I^lambda - 1) / lambda) ~ ., data = model_df_nodc)
model_poly = lm(AVG_C2I ~ . + I(Unemployment^2), data=model_df_nodc)
model_log_pred = lm(AVG_C2I ~ . - Unemployment + log(Unemployment),
data=model_df_nodc)
model_int = lm(AVG_C2I ~ .
+ Locale:CrimeRate
+ Locale:Unemployment,
data=model_df_nodc)
rmses = c(
loocv_rmse(init_model),
loocv_rmse(aic_model),
loocv_rmse(bic_model),
loocv_rmse_log(model_log_resp, model_df_nodc$AVG_C2I),
loocv_rmse(model_poly),
loocv_rmse(model_log_pred),
loocv_rmse(model_int)
)
adj_r2 = c(
summary(init_model)$adj.r.squared,
summary(aic_model)$adj.r.squared,
summary(bic_model)$adj.r.squared,
summary(model_log_resp)$adj.r.squared,
summary(model_poly)$adj.r.squared,
summary(model_log_pred)$adj.r.squared,
summary(model_int)$adj.r.squared
)
comp_df = data.frame(
Model = c("Initial (full)", "AIC stepwise", "BIC stepwise", "log(AVG_C2I)",
"Quadratic Unemployment", "log(Unemployment)", "Interactions"),
Adj_R2 = round(adj_r2, 4),
LOOCV_RMSE = round(rmses, 4)
)
comp_df
## Model Adj_R2 LOOCV_RMSE
## 1 Initial (full) 0.4821 13.9799
## 2 AIC stepwise 0.4820 13.9788
## 3 BIC stepwise 0.4693 14.1101
## 4 log(AVG_C2I) 0.5090 14.2584
## 5 Quadratic Unemployment 0.4967 13.7809
## 6 log(Unemployment) 0.4932 13.8177
## 7 Interactions 0.4980 13.8468
The interaction model and polynomial model are the two strongest candidates. The final model combines both, adding the quadratic unemployment term to the interaction model for a further improvement, and removing Population as it is an insignificant predictor and is correlated with other variables like locale and student-teacher-rank.
The final model incorporates all predictors except Population (dropped due to non-significance), a quadratic polynomial on Unemployment, and two interaction terms motivated by the systematic screening: Locale:Unemployment and Locale:CrimeRate. See Appendix I for the full model summary.
final_model = lm(AVG_C2I ~ . - Population
+ Locale:CrimeRate
+ Locale:Unemployment
+ I(Unemployment^2),
data = model_df_nodc)
sf = summary(final_model)
sprintf("Residual Std Error: %.4f | R^2: %.4f | Adj R^2: %.4f",
sf$sigma, sf$r.squared, sf$adj.r.squared)
## [1] "Residual Std Error: 13.4548 | R^2: 0.5225 | Adj R^2: 0.5069"
sprintf("F-stat: %.2f on %d and %d DF, p < 2.2e-16",
sf$fstatistic[1], sf$fstatistic[2], sf$fstatistic[3])
## [1] "F-stat: 33.63 on 87 and 2674 DF, p < 2.2e-16"
sprintf("LOOCV RMSE: %.5f (initial: %.5f, improvement: %.5f)",
loocv_rmse(final_model), loocv_rmse(init_model),
loocv_rmse(init_model) - loocv_rmse(final_model))
## [1] "LOOCV RMSE: 13.70428 (initial: 13.97991, improvement: 0.27563)"
# F-test vs initial model
anova(init_model, final_model)
## Analysis of Variance Table
##
## Model 1: AVG_C2I ~ Unemployment + CrimeRate + Water_Quality + State +
## Locale + Population + Stu_Tea_Rank
## Model 2: AVG_C2I ~ (Unemployment + CrimeRate + Water_Quality + State +
## Locale + Population + Stu_Tea_Rank) - Population + Locale:CrimeRate +
## Locale:Unemployment + I(Unemployment^2)
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 2696 512670
## 2 2674 484081 22 28589 7.1783 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The final model achieves an adjusted R2 of 0.507 and a LOOCV RMSE of 13.70, compared to 0.482 and 13.98 for the initial model. The improvement is modest but statistically meaningful. An F-test confirms that the additional terms (interactions and polynomial) collectively improve fit significantly (p < \(2.2\times10^{-16}\)).
Key coefficient interpretations. The linear Unemployment term (22.60) and quadratic term (-0.473, highly significant, p = \(3.86\times10^{-12}\)) together describe a relationship that grows steeply at low unemployment but levels off at high values, consistent with the CR plot showing a concave curve. In practical terms, moving from 2% to 4% unemployment adds far more to the cost-to-income ratio than moving from 12% to 14%. Stu_Tea_Rank remains significantly negative (-0.00367, p = \(1.26\times10^{-14}\)): a 1,000-rank improvement is associated with about 3.7 percentage points lower AVG C2I. The 95% confidence interval for Stu_Tea_Rank is (-0.00460, -0.00274), which is entirely negative and excludes zero, confirming the direction of the effect (appendix J). Although the main effect of CrimeRate is not significant, it is retained because it participates in significant interaction terms.
The Locale:Unemployment interactions are also informative. The reference level is Locale 11 (urban core), where unemployment has the steepest positive slope (captured by the main Unemployment term of ~22.60). All Locale:Unemployment interaction coefficients are negative and significant: for example, Unemployment:Locale13 = -10.97 (p < \(10^{-5}\)) means that in mid-size cities, each percentage point of unemployment adds about 11.63 rather than 22.60 percentage points to AVG C2I. Intuitively, this means that urban core counties have the steepest relationship between unemployment and AVG C2I, and unemployment matters most in cities.
# Diagnostic plots for the final model
# Residual structure is improved over the initial model but heteroscedasticity
# persists, though to a smaller extent
par(mfrow=c(2,2), mar=c(4,4,2,1))
plot(final_model)
par(mfrow=c(1,1))
bp_f = bptest(final_model)
sw_f = shapiro.test(resid(final_model))
sprintf("Breusch-Pagan: BP=%.2f, p=%.2e", bp_f$statistic, bp_f$p.value)
## [1] "Breusch-Pagan: BP=185.34, p=4.42e-09"
sprintf("Shapiro-Wilk: W=%.4f, p=%.2e", sw_f$statistic, sw_f$p.value)
## [1] "Shapiro-Wilk: W=0.9206, p=6.98e-36"
cook_f = cooks.distance(final_model)
sprintf("Influential obs (Cook's D > 4): %d",
sum(cook_f > 4/nobs(final_model)))
## [1] "Influential obs (Cook's D > 4): 116"
# Fitted vs actual values for the final model
# The model tracks the main body of the data well but underestimates
# extreme high-cost counties
plot(fitted(final_model), model_df_nodc$AVG_C2I,
pch=16, cex=0.3, col=rgb(0.1,0.4,0.7,0.4),
xlab="Fitted Values", ylab="Actual AVG C2I (%)",
main="Final Model: Fitted vs. Actual")
abline(0, 1, col="red", lty=2, lwd=2)
Residual diagnostics show meaningful improvement over the initial model. The Scale-Location plot is more horizontal and the Residuals vs Fitted fan is less pronounced, though the Breusch-Pagan test still rejects homoscedasticity (BP = 185.34, p = \(4.42\times10^{-9}\)). The Shapiro-Wilk test also still rejects normality (W = 0.921, p = \(6.98\times10^{-36}\)). These violations are driven primarily by a small cluster of extreme high-cost counties — most notably observation 2052 (a low-population South Dakota county with AVG C2I = 287%), which appears consistently as the most extreme outlier across all diagnostics. The outlier is Todd County, a Native American-majority county classified as persistently impoverished, explaining the discrepancy (appendix K).
# Influential observations stability
clean_idx = which(cook_f <= 4/nobs(final_model))
final_clean = lm(formula(final_model), data=model_df_nodc[clean_idx,])
sprintf("Adj R^2 with all observations: %.4f", sf$adj.r.squared)
## [1] "Adj R^2 with all observations: 0.5069"
sprintf("Adj R^2 without influential points: %.4f",
summary(final_clean)$adj.r.squared)
## [1] "Adj R^2 without influential points: 0.5918"
Refitting without the 116 influential observations raises adjusted R2 to 0.592, suggesting the influential points reduce fit rather than propping it up. However, the influential counties are real observations, not data errors. They represent genuinely unusual counties that a comprehensive model should eventually explain. Both versions of the model are retained, and the full-data version is treated as the official final model, with the cleaned version noted as an alternative.
The final model provides a statistically sound and interpretable account of county-level affordability in the United States. Unemployment rate, geographic state, and student-to-teacher rank are the most consistent predictors of AVG C2I. The model’s adjusted R2 of approximately 0.51 is reasonable given that county affordability is shaped by many unmeasured factors, including local housing supply constraints, zoning policy, property tax structures, and commuting patterns, none of which are available in this dataset.
Several limitations warrant acknowledgment. First, heteroscedasticity and non-normally distributed residuals persist in the final model. Standard errors and p-values based on ordinary least squares may be slightly misleading. Given the large sample size (n = 2,762), the central limit theorem provides some protection for inference even under non-normality, but the formal Shapiro-Wilk rejection means that exact p-values from this model should be interpreted with appropriate caution. Second, the crime rate data is from 2016 while most other variables are from 2020–2022, introducing a temporal mismatch that may weaken crime rate’s coefficient. Third, the student-teacher rank is an ordinal measure that compresses variation in the middle of the distribution - a direct ratio would be preferable. Fourth, the State fixed effects absorb a large amount of variance but also partially obscure the interpretation of other coefficients. States differ in many ways beyond what the remaining predictors capture.
The assumption of independent errors is partially violated because counties within the same state share unobserved characteristics. The state fixed effects partially mitigate this, but spatial autocorrelation within states is not fully addressed.
Overall, although causation cannot be concluded from the model, the findings suggest that unemployment and, to a lesser extent, quality of public school staffing, are the factors that correlate with affordability the most. However, the interaction terms, such as between locale type and unemployment, suggest that these correlations are not constant, and the relationship between unemployment and affordability is different in rural counties than it is in urban counties.
dim(model_df) # rows and columns
## [1] 2763 8
model_with_dc = lm(AVG_C2I ~ ., data = model_df)
levs = hatvalues(model_with_dc)
model_df[which(levs == 1), ] # DC has perfect leverage
## AVG_C2I Unemployment CrimeRate Water_Quality State Locale Population
## 3077 103.16 4.91 0.061 1 DC 11 671803
## Stu_Tea_Rank
## 3077 2972
pairs(nums) # categorical variables are too cluttered to see
s
##
## Call:
## lm(formula = AVG_C2I ~ ., data = model_df_nodc)
##
## Residuals:
## Min 1Q Median 3Q Max
## -76.753 -7.959 -0.719 6.638 171.039
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.614e+01 4.820e+00 15.796 < 2e-16 ***
## Unemployment 7.370e+00 2.715e-01 27.147 < 2e-16 ***
## CrimeRate 1.103e+02 2.853e+01 3.866 0.000113 ***
## Water_Quality 6.055e-02 2.597e-02 2.331 0.019821 *
## StateAL 3.723e+01 3.375e+00 11.033 < 2e-16 ***
## StateAR 1.710e+01 3.289e+00 5.200 2.15e-07 ***
## StateAZ 3.301e+01 5.025e+00 6.570 6.02e-11 ***
## StateCA 1.778e+01 3.395e+00 5.237 1.76e-07 ***
## StateCO 1.876e+01 3.305e+00 5.678 1.51e-08 ***
## StateCT 1.023e+00 5.643e+00 0.181 0.856169
## StateDE 8.474e+00 8.482e+00 0.999 0.317819
## StateFL 3.060e+01 3.304e+00 9.262 < 2e-16 ***
## StateGA 2.270e+01 3.052e+00 7.439 1.35e-13 ***
## StateHI 3.333e+01 1.017e+01 3.278 0.001058 **
## StateIA 5.102e+00 3.214e+00 1.588 0.112477
## StateID 2.607e+01 3.535e+00 7.376 2.17e-13 ***
## StateIL -1.670e+00 4.137e+00 -0.404 0.686454
## StateIN 1.680e+01 3.247e+00 5.175 2.44e-07 ***
## StateKS 1.376e+01 3.178e+00 4.328 1.56e-05 ***
## StateKY 1.901e+01 3.091e+00 6.150 8.92e-10 ***
## StateLA 2.829e+01 3.297e+00 8.582 < 2e-16 ***
## StateMA 1.219e+01 4.648e+00 2.623 0.008754 **
## StateMD -9.463e-01 4.044e+00 -0.234 0.815000
## StateME 2.023e+01 4.875e+00 4.150 3.42e-05 ***
## StateMI 4.194e+00 3.184e+00 1.317 0.187962
## StateMN 5.521e+00 3.198e+00 1.726 0.084395 .
## StateMO 2.524e+01 3.153e+00 8.004 1.77e-15 ***
## StateMS 2.162e+01 3.136e+00 6.892 6.84e-12 ***
## StateMT 2.519e+01 3.439e+00 7.323 3.18e-13 ***
## StateNC 2.241e+01 3.131e+00 7.156 1.07e-12 ***
## StateND -1.512e+00 3.593e+00 -0.421 0.673943
## StateNE 2.662e+01 3.244e+00 8.205 3.52e-16 ***
## StateNH 5.092e+00 5.226e+00 0.974 0.329966
## StateNJ 6.209e+00 4.201e+00 1.478 0.139484
## StateNM 2.422e+01 3.829e+00 6.326 2.94e-10 ***
## StateNV 7.690e+00 4.338e+00 1.773 0.076373 .
## StateNY 2.081e+01 3.360e+00 6.193 6.79e-10 ***
## StateOH 2.433e+00 3.210e+00 0.758 0.448432
## StateOK 2.913e+01 3.293e+00 8.847 < 2e-16 ***
## StateOR 2.717e+01 3.656e+00 7.431 1.44e-13 ***
## StatePA 3.142e+00 3.302e+00 0.951 0.341465
## StateRI 5.004e+00 6.829e+00 0.733 0.463754
## StateSC 1.465e+01 3.596e+00 4.074 4.76e-05 ***
## StateSD 2.185e+01 3.427e+00 6.376 2.14e-10 ***
## StateTN 1.564e+01 3.193e+00 4.898 1.02e-06 ***
## StateTX 5.399e+00 2.903e+00 1.859 0.063067 .
## StateUT 1.215e+01 3.979e+00 3.055 0.002276 **
## StateVA 1.794e+01 3.129e+00 5.734 1.09e-08 ***
## StateVT 2.270e+01 4.649e+00 4.883 1.11e-06 ***
## StateWA -3.382e+00 3.512e+00 -0.963 0.335686
## StateWI 1.295e+01 3.260e+00 3.972 7.31e-05 ***
## StateWV 3.308e+01 3.724e+00 8.885 < 2e-16 ***
## StateWY 1.605e+01 4.096e+00 3.918 9.17e-05 ***
## Locale12 -1.247e+01 3.717e+00 -3.354 0.000806 ***
## Locale13 -9.918e+00 3.496e+00 -2.837 0.004587 **
## Locale21 -1.377e+01 3.427e+00 -4.017 6.04e-05 ***
## Locale22 -1.156e+01 3.959e+00 -2.919 0.003538 **
## Locale23 -1.136e+01 3.960e+00 -2.868 0.004161 **
## Locale31 -1.372e+01 3.757e+00 -3.651 0.000266 ***
## Locale32 -9.486e+00 3.431e+00 -2.765 0.005738 **
## Locale33 -5.773e+00 3.511e+00 -1.644 0.100208
## Locale41 -1.005e+01 3.422e+00 -2.935 0.003360 **
## Locale42 -1.009e+01 3.429e+00 -2.943 0.003280 **
## Locale43 -4.864e+00 3.507e+00 -1.387 0.165632
## Population 1.113e-06 9.442e-07 1.179 0.238395
## Stu_Tea_Rank -4.174e-03 4.817e-04 -8.666 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.79 on 2696 degrees of freedom
## Multiple R-squared: 0.4943, Adjusted R-squared: 0.4821
## F-statistic: 40.53 on 65 and 2696 DF, p-value: < 2.2e-16
round(vif(init_model), 3)
## GVIF Df GVIF^(1/(2*Df))
## Unemployment 1.912 1 1.383
## CrimeRate 1.760 1 1.327
## Water_Quality 1.169 1 1.081
## State 8.693 49 1.022
## Locale 4.715 11 1.073
## Population 1.511 1 1.229
## Stu_Tea_Rank 2.628 1 1.621
formula(aic_model)
## AVG_C2I ~ Unemployment + CrimeRate + Water_Quality + State +
## Locale + Stu_Tea_Rank
s_aic = summary(aic_model)
sprintf("Adj R^2: %.4f | LOOCV RMSE: %.4f",
s_aic$adj.r.squared, loocv_rmse(aic_model))
## [1] "Adj R^2: 0.4820 | LOOCV RMSE: 13.9788"
formula(bic_model)
## AVG_C2I ~ Unemployment + CrimeRate + State + Stu_Tea_Rank
s_bic = summary(bic_model)
sprintf("Adj R^2: %.4f | LOOCV RMSE: %.4f",
s_bic$adj.r.squared, loocv_rmse(bic_model))
## [1] "Adj R^2: 0.4693 | LOOCV RMSE: 14.1101"
anova(bic_model, aic_model, init_model) # AIC model is the best
## Analysis of Variance Table
##
## Model 1: AVG_C2I ~ Unemployment + CrimeRate + State + Stu_Tea_Rank
## Model 2: AVG_C2I ~ Unemployment + CrimeRate + Water_Quality + State +
## Locale + Stu_Tea_Rank
## Model 3: AVG_C2I ~ Unemployment + CrimeRate + Water_Quality + State +
## Locale + Population + Stu_Tea_Rank
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 2709 527860
## 2 2697 512935 12 14925.0 6.5406 1.23e-11 ***
## 3 2696 512670 1 264.5 1.3907 0.2384
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
shapiro.test(resid(model_log_resp))
##
## Shapiro-Wilk normality test
##
## data: resid(model_log_resp)
## W = 0.96855, p-value < 2.2e-16
bptest(model_log_resp)
##
## studentized Breusch-Pagan test
##
## data: model_log_resp
## BP = 240.41, df = 65, p-value < 2.2e-16
shapiro.test(resid(model_bc))
##
## Shapiro-Wilk normality test
##
## data: resid(model_bc)
## W = 0.97798, p-value < 2.2e-16
bptest(model_bc)
##
## studentized Breusch-Pagan test
##
## data: model_bc
## BP = 225.98, df = 65, p-value < 2.2e-16
suppressWarnings(crPlots(init_model))
# Zoom in on unemployment - seems curved
ggplot(model_df_nodc, aes(x = Unemployment, y = residuals(init_model))) +
geom_point(alpha = 0.3) +
geom_smooth() # downward curve on the right
# Quantitative interaction p-values (each tested against full additive model)
quant_vars = c("Unemployment", "CrimeRate", "Water_Quality", "Stu_Tea_Rank")
# Test each pair, record p-value of the interaction term
results = data.frame()
for (i in 1:(length(quant_vars)-1)) {
for (j in (i+1):length(quant_vars)) {
v1 = quant_vars[i]
v2 = quant_vars[j]
f = as.formula(paste("AVG_C2I ~ . +", v1, ":", v2)) # add this interaction term
m = lm(f, data = model_df_nodc)
coef_name = paste0(v1, ":", v2)
p = summary(m)$coefficients[coef_name, "Pr(>|t|)"]
results = rbind(results, data.frame(Var1=v1, Var2=v2, p_value=round(p, 5))) # add row to table
}
}
results[order(results$p_value), ]
## Var1 Var2 p_value
## 4 CrimeRate Water_Quality 0.03909
## 5 CrimeRate Stu_Tea_Rank 0.04897
## 1 Unemployment CrimeRate 0.05466
## 6 Water_Quality Stu_Tea_Rank 0.06794
## 3 Unemployment Stu_Tea_Rank 0.86762
## 2 Unemployment Water_Quality 0.90015
# Categorical:quantitative interaction F-test p-values
cat_vars = c("State", "Locale")
results_cat = data.frame()
for (cat in cat_vars) {
for (quant in quant_vars) {
f_with = as.formula(paste("AVG_C2I ~ . +", quant, ":", cat))
m_with = lm(f_with, data = model_df_nodc)
p = anova(init_model, m_with)$`Pr(>F)`[2]
results_cat = rbind(results_cat, data.frame(Categorical=cat,
Quantitative=quant, p_value=p))
}
}
results_cat[order(results_cat$p_value), ]
## Categorical Quantitative p_value
## 1 State Unemployment 2.692944e-36
## 4 State Stu_Tea_Rank 8.226318e-10
## 5 Locale Unemployment 1.945681e-09
## 6 Locale CrimeRate 5.875221e-06
## 2 State CrimeRate 8.968462e-04
## 3 State Water_Quality 1.512618e-01
## 8 Locale Stu_Tea_Rank 1.803448e-01
## 7 Locale Water_Quality 9.674013e-01
# LASSO (for automatic variable selection)
# Prepare matrix (model.matrix handles factors as dummies)
# remove intercept - glmnet adds it automatically
X = model.matrix(AVG_C2I ~ ., data = model_df_nodc)[, -1]
y = model_df_nodc$AVG_C2I
set.seed(42) # for reproducibility
cv_lasso = cv.glmnet(X, y, alpha = 1, nfolds = 10)
plot(cv_lasso)
# LASSO 10-fold CV RMSE
lasso_cv_rmse = sqrt(min(cv_lasso$cvm))
set.seed(42)
cv_ridge = cv.glmnet(X, y, alpha = 0, nfolds=10)
# Ridge 10-fold CV RMSE
ridge_cv_rmse = sqrt(min(cv_ridge$cvm))
sprintf("LASSO best lambda: %.4f | 10-fold RMSE: %.4f", cv_lasso$lambda.min,
lasso_cv_rmse)
## [1] "LASSO best lambda: 0.0087 | 10-fold RMSE: 13.9621"
sprintf("Ridge best lambda: %.4f | 10-fold RMSE: %.4f", cv_ridge$lambda.min,
ridge_cv_rmse)
## [1] "Ridge best lambda: 0.7712 | 10-fold RMSE: 14.0210"
sf
##
## Call:
## lm(formula = AVG_C2I ~ . - Population + Locale:CrimeRate + Locale:Unemployment +
## I(Unemployment^2), data = model_df_nodc)
##
## Residuals:
## Min 1Q Median 3Q Max
## -45.278 -7.606 -0.801 6.635 164.336
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.320e+01 1.479e+01 1.569 0.116763
## Unemployment 2.260e+01 2.386e+00 9.474 < 2e-16 ***
## CrimeRate 9.836e+00 1.843e+02 0.053 0.957439
## Water_Quality 6.707e-02 2.567e-02 2.613 0.009030 **
## StateAL 4.056e+01 3.349e+00 12.109 < 2e-16 ***
## StateAR 1.864e+01 3.259e+00 5.720 1.18e-08 ***
## StateAZ 3.862e+01 4.982e+00 7.753 1.27e-14 ***
## StateCA 2.121e+01 3.375e+00 6.285 3.81e-10 ***
## StateCO 2.182e+01 3.288e+00 6.638 3.85e-11 ***
## StateCT 1.977e+00 5.607e+00 0.353 0.724472
## StateDE 8.751e+00 8.422e+00 1.039 0.298882
## StateFL 3.347e+01 3.274e+00 10.223 < 2e-16 ***
## StateGA 2.491e+01 3.021e+00 8.246 2.55e-16 ***
## StateHI 3.803e+01 9.964e+00 3.817 0.000138 ***
## StateIA 8.348e+00 3.206e+00 2.604 0.009264 **
## StateID 2.865e+01 3.499e+00 8.190 4.01e-16 ***
## StateIL 4.399e-01 4.088e+00 0.108 0.914329
## StateIN 2.028e+01 3.223e+00 6.291 3.66e-10 ***
## StateKS 1.873e+01 3.210e+00 5.833 6.09e-09 ***
## StateKY 1.996e+01 3.063e+00 6.515 8.63e-11 ***
## StateLA 2.908e+01 3.278e+00 8.870 < 2e-16 ***
## StateMA 1.379e+01 4.612e+00 2.991 0.002809 **
## StateMD 1.183e+00 3.997e+00 0.296 0.767259
## StateME 2.185e+01 4.794e+00 4.557 5.43e-06 ***
## StateMI 5.098e+00 3.196e+00 1.595 0.110811
## StateMN 8.279e+00 3.173e+00 2.609 0.009137 **
## StateMO 2.922e+01 3.142e+00 9.300 < 2e-16 ***
## StateMS 2.191e+01 3.102e+00 7.062 2.08e-12 ***
## StateMT 2.926e+01 3.447e+00 8.490 < 2e-16 ***
## StateNC 2.370e+01 3.117e+00 7.602 4.00e-14 ***
## StateND 2.595e+00 3.606e+00 0.720 0.471745
## StateNE 3.074e+01 3.279e+00 9.374 < 2e-16 ***
## StateNH 9.085e+00 5.162e+00 1.760 0.078499 .
## StateNJ 8.071e+00 4.202e+00 1.921 0.054867 .
## StateNM 2.645e+01 3.791e+00 6.977 3.78e-12 ***
## StateNV 9.774e+00 4.277e+00 2.285 0.022384 *
## StateNY 2.123e+01 3.351e+00 6.335 2.77e-10 ***
## StateOH 3.033e+00 3.225e+00 0.941 0.347037
## StateOK 3.212e+01 3.283e+00 9.783 < 2e-16 ***
## StateOR 2.707e+01 3.608e+00 7.502 8.49e-14 ***
## StatePA 3.863e+00 3.341e+00 1.156 0.247653
## StateRI 8.488e+00 6.729e+00 1.261 0.207279
## StateSC 1.632e+01 3.566e+00 4.576 4.95e-06 ***
## StateSD 2.538e+01 3.443e+00 7.371 2.24e-13 ***
## StateTN 1.759e+01 3.166e+00 5.555 3.05e-08 ***
## StateTX 7.329e+00 2.897e+00 2.529 0.011481 *
## StateUT 1.622e+01 3.937e+00 4.118 3.93e-05 ***
## StateVA 2.060e+01 3.105e+00 6.635 3.91e-11 ***
## StateVT 2.510e+01 4.581e+00 5.479 4.68e-08 ***
## StateWA -2.723e+00 3.496e+00 -0.779 0.436077
## StateWI 1.555e+01 3.227e+00 4.818 1.53e-06 ***
## StateWV 3.461e+01 3.685e+00 9.391 < 2e-16 ***
## StateWY 1.779e+01 4.037e+00 4.405 1.10e-05 ***
## Locale12 1.955e+01 1.625e+01 1.203 0.229079
## Locale13 2.632e+01 1.490e+01 1.767 0.077346 .
## Locale21 1.196e+01 1.511e+01 0.792 0.428444
## Locale22 3.901e+01 1.721e+01 2.266 0.023544 *
## Locale23 2.527e+01 1.619e+01 1.561 0.118650
## Locale31 2.261e+01 1.602e+01 1.411 0.158253
## Locale32 2.856e+01 1.458e+01 1.959 0.050199 .
## Locale33 3.168e+01 1.469e+01 2.157 0.031107 *
## Locale41 2.726e+01 1.466e+01 1.859 0.063082 .
## Locale42 3.013e+01 1.452e+01 2.076 0.038029 *
## Locale43 3.587e+01 1.449e+01 2.475 0.013401 *
## Stu_Tea_Rank -3.670e-03 4.733e-04 -7.754 1.26e-14 ***
## I(Unemployment^2) -4.734e-01 6.787e-02 -6.974 3.86e-12 ***
## CrimeRate:Locale12 7.129e-01 2.270e+02 0.003 0.997494
## CrimeRate:Locale13 2.908e+02 2.087e+02 1.393 0.163703
## CrimeRate:Locale21 3.573e+02 2.068e+02 1.728 0.084172 .
## CrimeRate:Locale22 -2.036e+01 2.847e+02 -0.072 0.942997
## CrimeRate:Locale23 2.637e+02 2.772e+02 0.951 0.341599
## CrimeRate:Locale31 2.334e+02 2.353e+02 0.992 0.321338
## CrimeRate:Locale32 1.592e+02 1.958e+02 0.813 0.416303
## CrimeRate:Locale33 4.366e+01 2.031e+02 0.215 0.829782
## CrimeRate:Locale41 1.126e+02 1.946e+02 0.579 0.562746
## CrimeRate:Locale42 8.797e+01 1.929e+02 0.456 0.648482
## CrimeRate:Locale43 -1.906e+02 1.922e+02 -0.992 0.321457
## Unemployment:Locale12 -7.463e+00 2.895e+00 -2.578 0.010001 *
## Unemployment:Locale13 -1.097e+01 2.404e+00 -4.564 5.24e-06 ***
## Unemployment:Locale21 -8.173e+00 2.644e+00 -3.092 0.002010 **
## Unemployment:Locale22 -1.274e+01 3.063e+00 -4.159 3.30e-05 ***
## Unemployment:Locale23 -1.081e+01 2.758e+00 -3.921 9.05e-05 ***
## Unemployment:Locale31 -1.039e+01 2.963e+00 -3.508 0.000459 ***
## Unemployment:Locale32 -1.044e+01 2.357e+00 -4.430 9.78e-06 ***
## Unemployment:Locale33 -9.664e+00 2.351e+00 -4.110 4.08e-05 ***
## Unemployment:Locale41 -9.963e+00 2.389e+00 -4.170 3.14e-05 ***
## Unemployment:Locale42 -1.061e+01 2.338e+00 -4.537 5.96e-06 ***
## Unemployment:Locale43 -9.613e+00 2.307e+00 -4.167 3.19e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.45 on 2674 degrees of freedom
## Multiple R-squared: 0.5225, Adjusted R-squared: 0.5069
## F-statistic: 33.63 on 87 and 2674 DF, p-value: < 2.2e-16
# 95% confidence intervals for key quantitative predictors
confint(final_model)[c("Unemployment", "CrimeRate", "Stu_Tea_Rank",
"Water_Quality"), ]
## 2.5 % 97.5 %
## Unemployment 1.792435e+01 27.280888652
## CrimeRate -3.515380e+02 371.210358361
## Stu_Tea_Rank -4.597987e-03 -0.002741861
## Water_Quality 1.673611e-02 0.117406258
df[2052, ] # Todd County - persistent poverty
## countyhelper LSTATE NMCNTY FIPS LZIP ULOCALE Overall.Rank
## 2052 SDTodd County SD Todd County 46121 57555 43-Rural: Remote NA
## X2022.Population X2016.Crime.Rate Unemployment X2020PopulrVoteParty
## 2052 9,220 0 3.89% R
## X2020.PopulrMajor. AQI.Good WaterQualityVPV ParkScore2023.Rank
## 2052 61.77% 92.09% 2 -1
## X.CvgCityPark NtnlPrkCnt X.CvgStatePark Cost.of.Living X2022.Median.Income
## 2052 -1 2 0.56% $73,348.34 $25,529.98
## AVG.C2I X1p0c X1p1c X1p2c X1p3c X1p4c X2p0c X2p1c X2p2c
## 2052 287.30% 147.62% 207.20% 263.98% 321.71% 350.32% 209.71% 271.82% 322.55%
## X2p3c X2p4c Stu.Tea.Rank Diversity.Rank..Race. Diversity.Rank..Gender.
## 2052 374.47% 403.64% 1357 2052 1454