options(scipen=999)
library(socviz)
load("Murders96_viz.RData")Lab 6
DATA
Partition 60% Train / 40% Test
library(caret)
set.seed(1)
index <- createDataPartition(Murders96_viz$arrests, p=0.6, list=FALSE)
train <- Murders96_viz[index,]
test <- Murders96_viz[-index,]Training set number of observations: 1320
Test dataset number of observations: 877
EDA
Descriptive Statistics
library(psych)
library(tidyverse)
library(flextable)
psych::describe(train, fast=TRUE) vars n mean sd min max range se
arrests 1 1320 5.21 18.76 0.00 320.00 320.00 0.52
countyid 2 1320 32550.57 15488.80 1001.00 56043.00 55042.00 426.32
density 3 1320 265.34 1817.83 0.08 54058.77 54058.69 50.03
popul 4 1320 94031.70 223537.24 141.00 3126966.00 3126825.00 6152.66
perc1019 5 1320 15.11 1.88 7.60 24.54 16.94 0.05
perc2029 6 1320 12.05 3.19 5.62 34.75 29.14 0.09
percblack 7 1320 8.34 13.93 0.00 86.28 86.28 0.38
percmale 8 1320 49.33 1.63 44.83 61.64 16.82 0.04
rpcincmaint 9 1320 207.78 116.83 19.14 1281.10 1261.96 3.22
rpcpersinc 10 1320 12569.38 3031.90 5322.20 41094.22 35772.02 83.45
rpcunemins 11 1320 52.90 34.85 0.00 287.10 287.10 0.96
year 12 1320 1996.00 0.00 1996.00 1996.00 0.00 0.00
murders 13 1320 5.88 26.27 0.00 420.00 420.00 0.72
murdrate 14 1320 0.42 0.77 0.00 11.01 11.01 0.02
arrestrate 15 1320 0.48 0.94 0.00 13.99 13.99 0.03
statefips 16 1320 32.45 15.46 1.00 56.00 55.00 0.43
countyfips 17 1320 99.81 109.51 1.00 840.00 839.00 3.01
execs 18 1320 0.01 0.14 0.00 3.00 3.00 0.00
lpopul 19 1320 10.40 1.35 4.95 14.96 10.01 0.04
execrate 20 1320 0.00 0.04 0.00 1.18 1.18 0.00
statename 21 1320 NaN NA Inf -Inf -Inf NA
statecode 22 1320 NaN NA Inf -Inf -Inf NA
region 23 1320 NaN NA Inf -Inf -Inf NA
Boxplots – All Numeric
data_long <- train %>%
select(where(is.numeric)) %>%
pivot_longer(cols = everything(), names_to = "Variable", values_to = "Value")
ggplot(data_long, aes(x = Variable, y = Value)) +
geom_boxplot() +
coord_flip() +
facet_wrap(~ Variable, scales = "free", ncol=3) +
theme_minimal() +
theme(axis.text.x = element_blank()) +
labs(title = "Horizontal Boxplot for Each Numeric Variable")library(DataExplorer)
plot_boxplot(train, by="region") Histograms
plot_histogram(train)Scatterplots (depvar ~ all x)
Look at the shape of the relationship between the dependent variable and all of the continuous potential independent variables.
plot_scatterplot(test, by="arrests")Correlation Matrix
plot_correlation(train, type="continuous")MODEL
Linear Regression Model
Estimate Coefficients and show coefficients table
train$arrest.n <- as.numeric(train$arrests)
library(dplyr)
train <- train %>%
mutate(reduced_statename = case_when(
statename %in% c("DC", "Louisiana") ~ statename,
TRUE ~ "AOther"))
model <- lm((arrest.n) ~ murdrate + murders + arrestrate + percblack + percmale + density + lpopul + execrate + popul + region + reduced_statename, data=train)
model %>% as_flextableEstimate | Standard Error | t value | Pr(>|t|) | ||
|---|---|---|---|---|---|
(Intercept) | -10.936 | 5.943 | -1.840 | 0.0660 | . |
murdrate | -2.297 | 0.306 | -7.499 | 0.0000 | *** |
murders | 0.631 | 0.011 | 56.395 | 0.0000 | *** |
arrestrate | 3.273 | 0.232 | 14.094 | 0.0000 | *** |
percblack | -0.015 | 0.015 | -1.041 | 0.2980 |
|
percmale | 0.023 | 0.110 | 0.213 | 0.8314 |
|
density | 0.002 | 0.000 | 14.781 | 0.0000 | *** |
lpopul | 0.962 | 0.171 | 5.643 | 0.0000 | *** |
execrate | 1.219 | 4.093 | 0.298 | 0.7659 |
|
popul | 0.000 | 0.000 | 2.565 | 0.0104 | * |
regionNortheast | -0.929 | 0.614 | -1.513 | 0.1305 |
|
regionSouth | 0.364 | 0.430 | 0.846 | 0.3975 |
|
regionWest | 0.768 | 0.523 | 1.469 | 0.1420 |
|
reduced_statenameDC | -251.140 | 6.942 | -36.177 | 0.0000 | *** |
reduced_statenameLouisiana | 5.169 | 1.302 | 3.970 | 0.0001 | *** |
Signif. codes: 0 <= '***' < 0.001 < '**' < 0.01 < '*' < 0.05 | |||||
Residual standard error: 5.83 on 1305 degrees of freedom | |||||
Multiple R-squared: 0.9044, Adjusted R-squared: 0.9034 | |||||
F-statistic: 882.2 on 1305 and 14 DF, p-value: 0.0000 | |||||
DC and Louisiana were the only states that were statistically significant. Arrest rate, murders, and murder rate were also highly significant.
Coefficient Magnitude Plot
library(coefplot)
coefplot(model, soft="magnitude", intercept=FALSE)DC has a coefficient value of about -250, which indicates that the variable has a strong negative influence on arrests. This means that there is an inverse relationship between being from DC and getting arrested.
Check for predictor independence
Using Variance Inflation Factors (VIF)
library(car)
vif(model) GVIF Df GVIF^(1/(2*Df))
murdrate 2.181668 1 1.477047
murders 3.351321 1 1.830661
arrestrate 1.835757 1 1.354901
percblack 1.668520 1 1.291712
percmale 1.250334 1 1.118184
density 1.337722 1 1.156599
lpopul 2.058492 1 1.434745
execrate 1.057285 1 1.028244
popul 3.384809 1 1.839785
region 1.824704 3 1.105432
reduced_statename 1.517300 2 1.109859
The murders variable had a VIF of 3.35 and the population variable was 3.38, which are both fairly high and indicate some degree of multicollinearity.
Residual Analysis
Residual Range
quantile(round(residuals(model, type = "deviance"),2)) 0% 25% 50% 75% 100%
-55.970 -1.020 -0.135 0.800 119.400
The residual range shows the smallest and largest residuals, as well as the first, second, and third quartiles.
Residual Plots
We are looking for: - Random distribution of residuals vs fitted values - Normally distributed residuals : Normal Q-Q plot with values along line - Homoskedasticity with a Scale-Location line that is horizontal and no residual pattern - Minimal influential obs - that is, those outside the borders of Cook’s distance
plot(model)Plot Fitted Value by Actual Value
library(broom)
train2 <- augment(model, data=train) # this appends predicted to original dataset to build plots on your own
p <- ggplot(train2, mapping = aes(y=.fitted, x=arrests))
p + geom_point()Plot Residuals by Fitted Values
p <- ggplot(train2, mapping = aes(y=.std.resid, x=.fitted))
p + geom_point()Performance Evaluation
Use Model to Score test dataset (Display First 10 values - depvar and fitted values only)
test <- test %>%
mutate(reduced_statename = case_when(
statename %in% c("DC", "Louisiana") ~ statename,
TRUE ~ "AOther"))
pred <- predict(model, newdata = test, interval="predict") #score test dataset wtih model
test_w_predict <- cbind(test, pred) # append score to original test dataset
test_w_predict %>%
head(10) %>%
select(arrests, fit, lwr,upr) %>%
as_flextable(show_coltype = FALSE) arrests | fit | lwr | upr |
|---|---|---|---|
6 | 6.4 | -5.0 | 17.9 |
1 | 0.8 | -10.7 | 12.3 |
0 | 1.3 | -10.2 | 12.7 |
20 | 13.9 | 2.4 | 25.4 |
2 | 3.3 | -8.2 | 14.7 |
0 | 0.7 | -10.8 | 12.1 |
2 | 4.6 | -6.9 | 16.1 |
8 | 7.5 | -3.9 | 19.0 |
1 | 1.8 | -9.7 | 13.2 |
5 | 4.0 | -7.5 | 15.4 |
n: 10 | |||
Plot Actual vs Fitted (test)
plot(test_w_predict$arrests, test_w_predict$fit, pch=16, cex=1)Performance Metrics
library(Metrics)
metric_label <- c("MAE","RMSE", "MAPE")
metrics <- c(round(mae(test_w_predict$arrests, test_w_predict$fit),4),
round(rmse(test_w_predict$arrests, test_w_predict$fit),4),
round(mape(test_w_predict$arrests, test_w_predict$fit),4))
pmtable <- data.frame(Metric=metric_label, Value = metrics)
flextable(pmtable)Metric | Value |
|---|---|
MAE | 3.2259 |
RMSE | 26.3791 |
MAPE | Inf |
The Mean Absolute Error was 3.2259, which means that on average, the model’s predictions are off by that amount from the actual values. The RSME suggests that the typical error of your model’s predictions is approximately 26.3791 units. The MAPE was Inf, which suggests that some of the actual values are 0, so when we divide by 0 to calculate the percentage error, we get infinity.
Model Fit by Continent
# more in depth plots of performance
p <- ggplot(data = subset(test_w_predict, region %in% c("South", "West")),
aes(x = arrests,
y = fit, ymin = lwr, ymax = upr,
color = region,
fill = region,
group = region))
p + geom_point(alpha = 0.5) +
geom_line() + geom_ribbon(alpha = .2, color = FALSE) +
labs(title="Actual vs Fitted with Upper and Lower CI",
subtitle="Regions",
caption="countymurders{wooldridge}") +
xlab(NULL) + ylab("fitted") +
theme(legend.position = "bottom")ggplot2 explore (Intro Section)
This is just replication of the first section within Chapter 6 (as FYI)
pip <- lm(arrests ~ region, data=train)
summary(pip)
Call:
lm(formula = arrests ~ region, data = train)
Residuals:
Min 1Q Median 3Q Max
-11.689 -5.557 -1.857 -1.557 308.311
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.8568 0.8989 2.066 0.03905 *
regionNortheast 9.8321 1.8323 5.366 0.0000000951 ***
regionSouth 3.6997 1.1941 3.098 0.00199 **
regionWest 5.1432 1.5849 3.245 0.00120 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 18.55 on 1316 degrees of freedom
Multiple R-squared: 0.02407, Adjusted R-squared: 0.02184
F-statistic: 10.82 on 3 and 1316 DF, p-value: 0.0000005048
library(ggplot2)
p <- ggplot(data=train, mapping=aes(y=arrests, x=region))
p + geom_point(alpha = 0.2) +
geom_smooth(method = "lm", aes(color = "OLS", fill = "OLS"))p + geom_point(alpha=0.1) +
geom_smooth(color = "tomato", fill="tomato", method = MASS::rlm) +
geom_smooth(color = "steelblue", fill="steelblue", method = "lm")p + geom_point(alpha=0.1) +
geom_smooth(color = "tomato", method = "lm", linewidth = 1.2,
formula = y ~ splines::bs(x, 3), se = FALSE)p + geom_point(alpha=0.1) +
geom_quantile(color = "tomato", size = 1.2, method = "rqss",
lambda = 1, quantiles = c(0.20, 0.5, 0.85))There is one notable outlier in the northeast.
SUMMARY ASSESSMENT AND EVALUATION OF THE MODEL
Using visualization is crucial for multiple regression models like this. Because there are so many predictor variables, it can be hard to understand the relationships between them solely through data output. Visualization helps us to identify trends, patterns, and outliers in the data set which aren’t always obvious unless we look at graphs or charts.Through this model, we can see that murder rate, murders, arrest rate, density, l population, DC, and Louisiana are highly statistically significant and can predict arrest well.