Load and Prepare the Data
#Load the data
full_data <- read.table("uscrime.txt", header = TRUE)
# View structure
str(full_data)
'data.frame': 47 obs. of 16 variables:
$ M : num 15.1 14.3 14.2 13.6 14.1 12.1 12.7 13.1 15.7 14 ...
$ So : int 1 0 1 0 0 0 1 1 1 0 ...
$ Ed : num 9.1 11.3 8.9 12.1 12.1 11 11.1 10.9 9 11.8 ...
$ Po1 : num 5.8 10.3 4.5 14.9 10.9 11.8 8.2 11.5 6.5 7.1 ...
$ Po2 : num 5.6 9.5 4.4 14.1 10.1 11.5 7.9 10.9 6.2 6.8 ...
$ LF : num 0.51 0.583 0.533 0.577 0.591 0.547 0.519 0.542 0.553 0.632 ...
$ M.F : num 95 101.2 96.9 99.4 98.5 ...
$ Pop : int 33 13 18 157 18 25 4 50 39 7 ...
$ NW : num 30.1 10.2 21.9 8 3 4.4 13.9 17.9 28.6 1.5 ...
$ U1 : num 0.108 0.096 0.094 0.102 0.091 0.084 0.097 0.079 0.081 0.1 ...
$ U2 : num 4.1 3.6 3.3 3.9 2 2.9 3.8 3.5 2.8 2.4 ...
$ Wealth: int 3940 5570 3180 6730 5780 6890 6200 4720 4210 5260 ...
$ Ineq : num 26.1 19.4 25 16.7 17.4 12.6 16.8 20.6 23.9 17.4 ...
$ Prob : num 0.0846 0.0296 0.0834 0.0158 0.0414 ...
$ Time : num 26.2 25.3 24.3 29.9 21.3 ...
$ Crime : int 791 1635 578 1969 1234 682 963 1555 856 705 ...
#Fit the Model
full_model <- lm(Crime ~ .,data=full_data)
summary(full_model)
Call:
lm(formula = Crime ~ ., data = full_data)
Residuals:
Min 1Q Median 3Q Max
-395.74 -98.09 -6.69 112.99 512.67
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -5.984e+03 1.628e+03 -3.675 0.000893 ***
M 8.783e+01 4.171e+01 2.106 0.043443 *
So -3.803e+00 1.488e+02 -0.026 0.979765
Ed 1.883e+02 6.209e+01 3.033 0.004861 **
Po1 1.928e+02 1.061e+02 1.817 0.078892 .
Po2 -1.094e+02 1.175e+02 -0.931 0.358830
LF -6.638e+02 1.470e+03 -0.452 0.654654
M.F 1.741e+01 2.035e+01 0.855 0.398995
Pop -7.330e-01 1.290e+00 -0.568 0.573845
NW 4.204e+00 6.481e+00 0.649 0.521279
U1 -5.827e+03 4.210e+03 -1.384 0.176238
U2 1.678e+02 8.234e+01 2.038 0.050161 .
Wealth 9.617e-02 1.037e-01 0.928 0.360754
Ineq 7.067e+01 2.272e+01 3.111 0.003983 **
Prob -4.855e+03 2.272e+03 -2.137 0.040627 *
Time -3.479e+00 7.165e+00 -0.486 0.630708
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 209.1 on 31 degrees of freedom
Multiple R-squared: 0.8031, Adjusted R-squared: 0.7078
F-statistic: 8.429 on 15 and 31 DF, p-value: 3.539e-07
Using lm to fit a multiple linear regression model where full is the response variable and all other columns are predictors. Reviewing the outputted summary we can review a number of critical takeaways.
Evaluating the P-value allows us to identify the most notable predictors: Median Age, Education Level, Police Per Capita (on the cusp with a pvalue of .07 - relevant with a pvalue of <0.1 but not <0.05), Unemployment rate, Income Inequality, and Probability of arrest. The remaining factors have high p-values (>0.1) indicating that they are not significant in this model.I will repeat the modeling while addressing the insignificant predictors.
R-Squared: 0.8031 - The model accounts for ~80% of the the variance in full data. As discussed in the lectures - this is likely due to a “academic” dataset or overfitting. In real world scenarios we would not expect such a high Rsquared value.
Adjusted R-Squared: 0.7078 - Rsquared when accounting for the number of predictors, while lower still a very strong model.
FStat and P-value: 8.429 and 3.54e-07, indicating that the overall model is statistically significant.
# Residual plot
plot(full_model$fitted.values, full_model$residuals,
xlab = "Fitted Values", ylab = "Residuals",
main = "Residuals vs Fitted")
abline(h = 0, col = "red")
Additionally we can plot the residuals to ensure that they are roughly scattered around 0. We can also test the collinearity of the predictors using the VIF (Variance Inflation Factor) function.
library(car)
vif(full_model)
M So Ed Po1 Po2 LF M.F Pop NW U1
2.892448 5.342783 5.077447 104.658667 113.559262 3.712690 3.785934 2.536708 4.674088 6.063931
U2 Wealth Ineq Prob Time
5.088880 10.530375 8.644528 2.809459 2.713785
In this model any value >5 is a moderate concern while any value >10 has serious multicollinearity and should be addressed to maximize the efficacy of the model. I will first repeat the exercise excluding the insignificant attributes and re-assess collinearity again.
refined_model <- lm(Crime ~ M + Ed + Po1 + U2 + Ineq + Prob, data = full_data)
summary(refined_model)
Call:
lm(formula = Crime ~ M + Ed + Po1 + U2 + Ineq + Prob, data = full_data)
Residuals:
Min 1Q Median 3Q Max
-470.68 -78.41 -19.68 133.12 556.23
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -5040.50 899.84 -5.602 1.72e-06 ***
M 105.02 33.30 3.154 0.00305 **
Ed 196.47 44.75 4.390 8.07e-05 ***
Po1 115.02 13.75 8.363 2.56e-10 ***
U2 89.37 40.91 2.185 0.03483 *
Ineq 67.65 13.94 4.855 1.88e-05 ***
Prob -3801.84 1528.10 -2.488 0.01711 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 200.7 on 40 degrees of freedom
Multiple R-squared: 0.7659, Adjusted R-squared: 0.7307
F-statistic: 21.81 on 6 and 40 DF, p-value: 3.418e-11
AIC(full_model)
[1] 650.0291
BIC(full_model)
[1] 681.4816
Build a table to compare the values.
# Extract model metrics
model_metrics <- data.frame(
Metric = c("R-squared", "Adjusted R-squared", "Residual Std. Error", "Model P-value", "AIC","BIC"),
Full_Model = c(
summary(full_model)$r.squared,
summary(full_model)$adj.r.squared,
summary(full_model)$sigma,
pf(summary(full_model)$fstatistic[1],
summary(full_model)$fstatistic[2],
summary(full_model)$fstatistic[3],
lower.tail = FALSE),
AIC(full_model),
BIC(full_model)
),
Refined_Model = c(
summary(refined_model)$r.squared,
summary(refined_model)$adj.r.squared,
summary(refined_model)$sigma,
pf(summary(refined_model)$fstatistic[1],
summary(refined_model)$fstatistic[2],
summary(refined_model)$fstatistic[3],
lower.tail = FALSE),
AIC(refined_model),
BIC(refined_model)
)
)
# Print the table
print(model_metrics, row.names = FALSE)
We can see that the Refined Model has a slightly lower R-squared value, but a higher adjusted R-squared value. The Residual errors are comparable, and both are considered statistically significant models.
# Create a tidy data frame of metrics
library(tibble)
library(tidyr)
library(ggplot2)
metrics <- tibble(
Metric = c("R-squared", "Adjusted R-squared"),
Full_Model = c(
summary(full_model)$r.squared,
summary(full_model)$adj.r.squared
),
Refined_Model = c(
summary(refined_model)$r.squared,
summary(refined_model)$adj.r.squared
)
)
# Reshape for plotting
metrics_long <- pivot_longer(metrics, cols = -Metric, names_to = "Model", values_to = "Value")
# Create grouped bar chart
ggplot(metrics_long, aes(x = Metric, y = Value, fill = Model)) +
geom_bar(stat = "identity", position = "dodge") +
labs(title = "Model Comparison Across Key Metrics",
x = "Metric",
y = "Value") +
theme_minimal() +
scale_fill_brewer(palette = "Set2")
metrics2 <- tibble(
Metric = c("Residual Std. Error", "AIC", "BIC"),
Full_Model = c(
summary(full_model)$sigma,
AIC(full_model),
BIC(full_model)
),
Refined_Model = c(
summary(refined_model)$sigma,
AIC(refined_model),
BIC(refined_model)
)
)
# Reshape for plotting
metrics2_long <- pivot_longer(metrics2, cols = -Metric, names_to = "Model", values_to = "Value")
# Create grouped bar chart
ggplot(metrics2_long, aes(x = Metric, y = Value, fill = Model)) +
geom_bar(stat = "identity", position = "dodge") +
labs(title = "Model Comparison Across Key Metrics",
x = "Metric",
y = "Value") +
theme_minimal() +
scale_fill_brewer(palette = "Set2")
NA
NA
NA
We can see that the refined model has a lower R-squared, but a higher Adjusted R-Squared compared to the full model. We also see that the refined model has lower AIC and BIC values, which would insinuate that it is a better model.
# Extract coefficients and p-values from each model
extract_significance <- function(model, label) {
coef_df <- summary(model)$coefficients
data.frame(
Predictor = rownames(coef_df),
Estimate = coef_df[, "Estimate"],
P_value = coef_df[, "Pr(>|t|)"],
Model = label
)
}
full_sig <- extract_significance(full_model, "Full Model")
refined_sig <- extract_significance(refined_model, "Refined Model")
# Combine all
sig_data <- rbind(full_sig, refined_sig)
library(ggplot2)
ggplot(sig_data, aes(x = Predictor, y = -log10(P_value), fill = Model)) +
geom_bar(stat = "identity", position = "dodge") +
labs(title = "Predictor Significance Across Models",
x = "Predictor",
y = expression(-log[10](p_value))) +
theme_minimal() +
scale_fill_brewer(palette = "Set1") +
geom_hline(yintercept = -log10(0.05), linetype = "dashed", color = "gray") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
We can also visually compare the significance of the predictor between the first model and the second. Additionally we can use stepwise selection which is AIC based to iteratively evaluate/select the predictors.
stepwise_model <- step(lm(Crime ~ ., data = full_data), direction = "both")
Start: AIC=514.65
Crime ~ M + So + Ed + Po1 + Po2 + LF + M.F + Pop + NW + U1 +
U2 + Wealth + Ineq + Prob + Time
Df Sum of Sq RSS AIC
- So 1 29 1354974 512.65
- LF 1 8917 1363862 512.96
- Time 1 10304 1365250 513.00
- Pop 1 14122 1369068 513.14
- NW 1 18395 1373341 513.28
- M.F 1 31967 1386913 513.74
- Wealth 1 37613 1392558 513.94
- Po2 1 37919 1392865 513.95
<none> 1354946 514.65
- U1 1 83722 1438668 515.47
- Po1 1 144306 1499252 517.41
- U2 1 181536 1536482 518.56
- M 1 193770 1548716 518.93
- Prob 1 199538 1554484 519.11
- Ed 1 402117 1757063 524.86
- Ineq 1 423031 1777977 525.42
Step: AIC=512.65
Crime ~ M + Ed + Po1 + Po2 + LF + M.F + Pop + NW + U1 + U2 +
Wealth + Ineq + Prob + Time
Df Sum of Sq RSS AIC
- Time 1 10341 1365315 511.01
- LF 1 10878 1365852 511.03
- Pop 1 14127 1369101 511.14
- NW 1 21626 1376600 511.39
- M.F 1 32449 1387423 511.76
- Po2 1 37954 1392929 511.95
- Wealth 1 39223 1394197 511.99
<none> 1354974 512.65
- U1 1 96420 1451395 513.88
+ So 1 29 1354946 514.65
- Po1 1 144302 1499277 515.41
- U2 1 189859 1544834 516.81
- M 1 195084 1550059 516.97
- Prob 1 204463 1559437 517.26
- Ed 1 403140 1758114 522.89
- Ineq 1 488834 1843808 525.13
Step: AIC=511.01
Crime ~ M + Ed + Po1 + Po2 + LF + M.F + Pop + NW + U1 + U2 +
Wealth + Ineq + Prob
Df Sum of Sq RSS AIC
- LF 1 10533 1375848 509.37
- NW 1 15482 1380797 509.54
- Pop 1 21846 1387161 509.75
- Po2 1 28932 1394247 509.99
- Wealth 1 36070 1401385 510.23
- M.F 1 41784 1407099 510.42
<none> 1365315 511.01
- U1 1 91420 1456735 512.05
+ Time 1 10341 1354974 512.65
+ So 1 65 1365250 513.00
- Po1 1 134137 1499452 513.41
- U2 1 184143 1549458 514.95
- M 1 186110 1551425 515.01
- Prob 1 237493 1602808 516.54
- Ed 1 409448 1774763 521.33
- Ineq 1 502909 1868224 523.75
Step: AIC=509.37
Crime ~ M + Ed + Po1 + Po2 + M.F + Pop + NW + U1 + U2 + Wealth +
Ineq + Prob
Df Sum of Sq RSS AIC
- NW 1 11675 1387523 507.77
- Po2 1 21418 1397266 508.09
- Pop 1 27803 1403651 508.31
- M.F 1 31252 1407100 508.42
- Wealth 1 35035 1410883 508.55
<none> 1375848 509.37
- U1 1 80954 1456802 510.06
+ LF 1 10533 1365315 511.01
+ Time 1 9996 1365852 511.03
+ So 1 3046 1372802 511.26
- Po1 1 123896 1499744 511.42
- U2 1 190746 1566594 513.47
- M 1 217716 1593564 514.27
- Prob 1 226971 1602819 514.54
- Ed 1 413254 1789103 519.71
- Ineq 1 500944 1876792 521.96
Step: AIC=507.77
Crime ~ M + Ed + Po1 + Po2 + M.F + Pop + U1 + U2 + Wealth + Ineq +
Prob
Df Sum of Sq RSS AIC
- Po2 1 16706 1404229 506.33
- Pop 1 25793 1413315 506.63
- M.F 1 26785 1414308 506.66
- Wealth 1 31551 1419073 506.82
<none> 1387523 507.77
- U1 1 83881 1471404 508.52
+ NW 1 11675 1375848 509.37
+ So 1 7207 1380316 509.52
+ LF 1 6726 1380797 509.54
+ Time 1 4534 1382989 509.61
- Po1 1 118348 1505871 509.61
- U2 1 201453 1588976 512.14
- Prob 1 216760 1604282 512.59
- M 1 309214 1696737 515.22
- Ed 1 402754 1790276 517.74
- Ineq 1 589736 1977259 522.41
Step: AIC=506.33
Crime ~ M + Ed + Po1 + M.F + Pop + U1 + U2 + Wealth + Ineq +
Prob
Df Sum of Sq RSS AIC
- Pop 1 22345 1426575 505.07
- Wealth 1 32142 1436371 505.39
- M.F 1 36808 1441037 505.54
<none> 1404229 506.33
- U1 1 86373 1490602 507.13
+ Po2 1 16706 1387523 507.77
+ NW 1 6963 1397266 508.09
+ So 1 3807 1400422 508.20
+ LF 1 1986 1402243 508.26
+ Time 1 575 1403654 508.31
- U2 1 205814 1610043 510.76
- Prob 1 218607 1622836 511.13
- M 1 307001 1711230 513.62
- Ed 1 389502 1793731 515.83
- Ineq 1 608627 2012856 521.25
- Po1 1 1050202 2454432 530.57
Step: AIC=505.07
Crime ~ M + Ed + Po1 + M.F + U1 + U2 + Wealth + Ineq + Prob
Df Sum of Sq RSS AIC
- Wealth 1 26493 1453068 503.93
<none> 1426575 505.07
- M.F 1 84491 1511065 505.77
- U1 1 99463 1526037 506.24
+ Pop 1 22345 1404229 506.33
+ Po2 1 13259 1413315 506.63
+ NW 1 5927 1420648 506.87
+ So 1 5724 1420851 506.88
+ LF 1 5176 1421398 506.90
+ Time 1 3913 1422661 506.94
- Prob 1 198571 1625145 509.20
- U2 1 208880 1635455 509.49
- M 1 320926 1747501 512.61
- Ed 1 386773 1813348 514.35
- Ineq 1 594779 2021354 519.45
- Po1 1 1127277 2553852 530.44
Step: AIC=503.93
Crime ~ M + Ed + Po1 + M.F + U1 + U2 + Ineq + Prob
Df Sum of Sq RSS AIC
<none> 1453068 503.93
+ Wealth 1 26493 1426575 505.07
- M.F 1 103159 1556227 505.16
+ Pop 1 16697 1436371 505.39
+ Po2 1 14148 1438919 505.47
+ So 1 9329 1443739 505.63
+ LF 1 4374 1448694 505.79
+ NW 1 3799 1449269 505.81
+ Time 1 2293 1450775 505.86
- U1 1 127044 1580112 505.87
- Prob 1 247978 1701046 509.34
- U2 1 255443 1708511 509.55
- M 1 296790 1749858 510.67
- Ed 1 445788 1898855 514.51
- Ineq 1 738244 2191312 521.24
- Po1 1 1672038 3125105 537.93
summary(stepwise_model)
Call:
lm(formula = Crime ~ M + Ed + Po1 + M.F + U1 + U2 + Ineq + Prob,
data = full_data)
Residuals:
Min 1Q Median 3Q Max
-444.70 -111.07 3.03 122.15 483.30
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -6426.10 1194.61 -5.379 4.04e-06 ***
M 93.32 33.50 2.786 0.00828 **
Ed 180.12 52.75 3.414 0.00153 **
Po1 102.65 15.52 6.613 8.26e-08 ***
M.F 22.34 13.60 1.642 0.10874
U1 -6086.63 3339.27 -1.823 0.07622 .
U2 187.35 72.48 2.585 0.01371 *
Ineq 61.33 13.96 4.394 8.63e-05 ***
Prob -3796.03 1490.65 -2.547 0.01505 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 195.5 on 38 degrees of freedom
Multiple R-squared: 0.7888, Adjusted R-squared: 0.7444
F-statistic: 17.74 on 8 and 38 DF, p-value: 1.159e-10
# Create a tidy data frame of metrics
# Create a tidy data frame of metrics
library(tibble)
library(tidyr)
library(ggplot2)
metrics <- tibble(
Metric = c("R-squared", "Adjusted R-squared"),
Full_Model = c(
summary(full_model)$r.squared,
summary(full_model)$adj.r.squared
),
Refined_Model = c(
summary(refined_model)$r.squared,
summary(refined_model)$adj.r.squared
),
Stepwise_Model = c(
summary(stepwise_model)$r.squared,
summary(stepwise_model)$adj.r.squared
))
# Reshape for plotting
metrics_long <- pivot_longer(metrics, cols = -Metric, names_to = "Model", values_to = "Value")
# Create grouped bar chart
ggplot(metrics_long, aes(x = Metric, y = Value, fill = Model)) +
geom_bar(stat = "identity", position = "dodge") +
labs(title = "Model Comparison Across Key Metrics",
x = "Metric",
y = "Value") +
theme_minimal() +
scale_fill_brewer(palette = "Set2")
metrics2 <- tibble(
Metric = c("Residual Std. Error", "AIC", "BIC"),
Full_Model = c(
summary(full_model)$sigma,
AIC(full_model),
BIC(full_model)
),
Refined_Model = c(
summary(refined_model)$sigma,
AIC(refined_model),
BIC(refined_model)
),
Stepwise_Model = c(
summary(stepwise_model)$sigma,
AIC(stepwise_model),
BIC(stepwise_model))
)
# Reshape for plotting
metrics2_long <- pivot_longer(metrics2, cols = -Metric, names_to = "Model", values_to = "Value")
# Create grouped bar chart
ggplot(metrics2_long, aes(x = Metric, y = Value, fill = Model)) +
geom_bar(stat = "identity", position = "dodge") +
labs(title = "Model Comparison Across Key Metrics",
x = "Metric",
y = "Value") +
theme_minimal() +
scale_fill_brewer(palette = "Set2")
# Extract coefficients and p-values from each model
extract_significance <- function(model, label) {
coef_df <- summary(model)$coefficients
data.frame(
Predictor = rownames(coef_df),
Estimate = coef_df[, "Estimate"],
P_value = coef_df[, "Pr(>|t|)"],
Model = label
)
}
full_sig <- extract_significance(full_model, "Full Model")
refined_sig <- extract_significance(refined_model, "Refined Model")
stepwise_sig <- extract_significance(stepwise_model, "Stepwise Model")
# Combine all
sig_data <- rbind(full_sig, refined_sig, stepwise_sig)
library(ggplot2)
ggplot(sig_data, aes(x = Predictor, y = -log10(P_value), fill = Model)) +
geom_bar(stat = "identity", position = "dodge") +
labs(title = "Predictor Significance Across Models",
x = "Predictor",
y = expression(-log[10](p_value))) +
theme_minimal() +
scale_fill_brewer(palette = "Set1") +
geom_hline(yintercept = -log10(0.05), linetype = "dashed", color = "gray") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
With the step model, we further improve the efficacy of the model and will be what we choose to predict the value of the new city. Ultimately, the model formula is Crime ~ M + Ed + Po1 + M.F + U1 + U2 + Ineq + Prob.
print(predicted_crime)
1
1038.413
# Get fitted values for existing cities
existing_predictions <- predict(stepwise_model)
# Combine into a single vector
all_predictions <- c(existing_predictions, predicted_crime)
# Create a data frame with ranks
rank_df <- data.frame(
City = c(rownames(full_data), "New City"),
Predicted_Crime = all_predictions,
Rank = rank(-all_predictions, ties.method = "min") # Higher crime = lower rank
)
# Sort by rank
rank_df <- rank_df[order(rank_df$Rank), ]
rank_df
When comparing the new_city’s predicted crime rate, we see that it would rank 16 out of the 47 cities that we currently have.