Lab 6

Author

Kate Shields

DATA

options(scipen=999)

library(socviz)
load("Murders96_viz.RData")

Partition 60% Train / 40% Test

library(caret)
set.seed(1)
index <- createDataPartition(Murders96_viz$arrests, p=0.6, list=FALSE)
train <- Murders96_viz[index,]
test  <- Murders96_viz[-index,]

Training set number of observations: 1320
Test dataset number of observations: 877

EDA

Descriptive Statistics

library(psych)
library(tidyverse)
library(flextable)
psych::describe(train, fast=TRUE)

            vars    n     mean        sd     min        max      range      se
arrests        1 1320     5.21     18.76    0.00     320.00     320.00    0.52
countyid       2 1320 32550.57  15488.80 1001.00   56043.00   55042.00  426.32
density        3 1320   265.34   1817.83    0.08   54058.77   54058.69   50.03
popul          4 1320 94031.70 223537.24  141.00 3126966.00 3126825.00 6152.66
perc1019       5 1320    15.11      1.88    7.60      24.54      16.94    0.05
perc2029       6 1320    12.05      3.19    5.62      34.75      29.14    0.09
percblack      7 1320     8.34     13.93    0.00      86.28      86.28    0.38
percmale       8 1320    49.33      1.63   44.83      61.64      16.82    0.04
rpcincmaint    9 1320   207.78    116.83   19.14    1281.10    1261.96    3.22
rpcpersinc    10 1320 12569.38   3031.90 5322.20   41094.22   35772.02   83.45
rpcunemins    11 1320    52.90     34.85    0.00     287.10     287.10    0.96
year          12 1320  1996.00      0.00 1996.00    1996.00       0.00    0.00
murders       13 1320     5.88     26.27    0.00     420.00     420.00    0.72
murdrate      14 1320     0.42      0.77    0.00      11.01      11.01    0.02
arrestrate    15 1320     0.48      0.94    0.00      13.99      13.99    0.03
statefips     16 1320    32.45     15.46    1.00      56.00      55.00    0.43
countyfips    17 1320    99.81    109.51    1.00     840.00     839.00    3.01
execs         18 1320     0.01      0.14    0.00       3.00       3.00    0.00
lpopul        19 1320    10.40      1.35    4.95      14.96      10.01    0.04
execrate      20 1320     0.00      0.04    0.00       1.18       1.18    0.00
statename     21 1320      NaN        NA     Inf       -Inf       -Inf      NA
statecode     22 1320      NaN        NA     Inf       -Inf       -Inf      NA
region        23 1320      NaN        NA     Inf       -Inf       -Inf      NA

Boxplots – All Numeric

data_long <- train %>%
  select(where(is.numeric)) %>%
  pivot_longer(cols = everything(), names_to = "Variable", values_to = "Value")

ggplot(data_long, aes(x = Variable, y = Value)) +
  geom_boxplot() +
  coord_flip() + 
  facet_wrap(~ Variable, scales = "free", ncol=3) +
  theme_minimal() +
  theme(axis.text.x = element_blank()) + 
  labs(title = "Horizontal Boxplot for Each Numeric Variable")

library(DataExplorer)
plot_boxplot(train, by="region")

Histograms

plot_histogram(train)

Scatterplots (depvar ~ all x)

Look at the shape of the relationship between the dependent variable and all of the continuous potential independent variables.

plot_scatterplot(test, by="arrests")

Correlation Matrix

plot_correlation(train, type="continuous")

The variables “statefips” and “countyid” are perfectly correlated, which is interesting. Murders and arrests are highly correlated, which makes sense. Population and arrests are also highly correlated.

MODEL

Linear Regression Model

Estimate Coefficients and show coefficients table

train$arrest.n <- as.numeric(train$arrests)

library(dplyr)

train <- train %>%
  mutate(reduced_statename = case_when(
    statename %in% c("DC", "Louisiana") ~ statename,
    TRUE ~ "AOther"))


model <- lm((arrest.n) ~  murdrate + murders + arrestrate + percblack + percmale + density + lpopul + execrate + popul + region + reduced_statename, data=train)


model %>% as_flextable

	Estimate	Standard Error	t value	Pr(>\|t\|)
(Intercept)	-10.936	5.943	-1.840	0.0660	.
murdrate	-2.297	0.306	-7.499	0.0000	***
murders	0.631	0.011	56.395	0.0000	***
arrestrate	3.273	0.232	14.094	0.0000	***
percblack	-0.015	0.015	-1.041	0.2980
percmale	0.023	0.110	0.213	0.8314
density	0.002	0.000	14.781	0.0000	***
lpopul	0.962	0.171	5.643	0.0000	***
execrate	1.219	4.093	0.298	0.7659
popul	0.000	0.000	2.565	0.0104	*
regionNortheast	-0.929	0.614	-1.513	0.1305
regionSouth	0.364	0.430	0.846	0.3975
regionWest	0.768	0.523	1.469	0.1420
reduced_statenameDC	-251.140	6.942	-36.177	0.0000	***
reduced_statenameLouisiana	5.169	1.302	3.970	0.0001	***
Signif. codes: 0 <= '*' < 0.001 < '' < 0.01 < '*' < 0.05

Residual standard error: 5.83 on 1305 degrees of freedom
Multiple R-squared: 0.9044, Adjusted R-squared: 0.9034
F-statistic: 882.2 on 1305 and 14 DF, p-value: 0.0000

DC and Louisiana were the only states that were statistically significant. Arrest rate, murders, and murder rate were also highly significant.

Coefficient Magnitude Plot

library(coefplot)
coefplot(model, soft="magnitude", intercept=FALSE)

DC has a coefficient value of about -250, which indicates that the variable has a strong negative influence on arrests. This means that there is an inverse relationship between being from DC and getting arrested.

Check for predictor independence

Using Variance Inflation Factors (VIF)

library(car)
vif(model)

                      GVIF Df GVIF^(1/(2*Df))
murdrate          2.181668  1        1.477047
murders           3.351321  1        1.830661
arrestrate        1.835757  1        1.354901
percblack         1.668520  1        1.291712
percmale          1.250334  1        1.118184
density           1.337722  1        1.156599
lpopul            2.058492  1        1.434745
execrate          1.057285  1        1.028244
popul             3.384809  1        1.839785
region            1.824704  3        1.105432
reduced_statename 1.517300  2        1.109859

The murders variable had a VIF of 3.35 and the population variable was 3.38, which are both fairly high and indicate some degree of multicollinearity.

Residual Analysis

Residual Range

quantile(round(residuals(model, type = "deviance"),2))

     0%     25%     50%     75%    100% 
-55.970  -1.020  -0.135   0.800 119.400

The residual range shows the smallest and largest residuals, as well as the first, second, and third quartiles.

Residual Plots

We are looking for: - Random distribution of residuals vs fitted values - Normally distributed residuals : Normal Q-Q plot with values along line - Homoskedasticity with a Scale-Location line that is horizontal and no residual pattern - Minimal influential obs - that is, those outside the borders of Cook’s distance

plot(model)

Plot Fitted Value by Actual Value

library(broom)
train2 <- augment(model, data=train)  # this appends predicted to original dataset to build plots on your own

p <- ggplot(train2, mapping = aes(y=.fitted, x=arrests))
p  + geom_point()

Plot Residuals by Fitted Values

p <- ggplot(train2, mapping = aes(y=.std.resid, x=.fitted))
p  + geom_point()

Performance Evaluation

Use Model to Score `test` dataset (Display First 10 values - depvar and fitted values only)

test <- test %>%
  mutate(reduced_statename = case_when(
    statename %in% c("DC", "Louisiana") ~ statename,
    TRUE ~ "AOther"))

pred <- predict(model, newdata = test, interval="predict")  #score test dataset wtih model
test_w_predict <- cbind(test, pred)  # append score to original test dataset

test_w_predict %>%
  head(10) %>%
    select(arrests, fit, lwr,upr) %>% 
  as_flextable(show_coltype = FALSE)

arrests	fit	lwr	upr
6	6.4	-5.0	17.9
1	0.8	-10.7	12.3
0	1.3	-10.2	12.7
20	13.9	2.4	25.4
2	3.3	-8.2	14.7
0	0.7	-10.8	12.1
2	4.6	-6.9	16.1
8	7.5	-3.9	19.0
1	1.8	-9.7	13.2
5	4.0	-7.5	15.4
n: 10

Plot Actual vs Fitted (`test`)

plot(test_w_predict$arrests, test_w_predict$fit, pch=16, cex=1)

Performance Metrics

library(Metrics)
metric_label <- c("MAE","RMSE", "MAPE")
metrics <- c(round(mae(test_w_predict$arrests, test_w_predict$fit),4),
                         round(rmse(test_w_predict$arrests, test_w_predict$fit),4),
                         round(mape(test_w_predict$arrests, test_w_predict$fit),4))
pmtable <- data.frame(Metric=metric_label, Value = metrics)
flextable(pmtable)

Metric	Value
MAE	3.2259
RMSE	26.3791
MAPE	Inf

The Mean Absolute Error was 3.2259, which means that on average, the model’s predictions are off by that amount from the actual values. The RSME suggests that the typical error of your model’s predictions is approximately 26.3791 units. The MAPE was Inf, which suggests that some of the actual values are 0, so when we divide by 0 to calculate the percentage error, we get infinity.

Model Fit by Continent

# more in depth plots of performance 
p <- ggplot(data = subset(test_w_predict, region %in% c("South", "West")),
            aes(x = arrests,
                y = fit, ymin = lwr, ymax = upr,
                color = region,
                fill = region,
                group = region))
p +  geom_point(alpha = 0.5) + 
       geom_line() + geom_ribbon(alpha = .2, color = FALSE) +
       labs(title="Actual vs Fitted with Upper and Lower CI",
              subtitle="Regions",
              caption="countymurders{wooldridge}") +
       xlab(NULL) + ylab("fitted") +
       theme(legend.position = "bottom")

`ggplot2` explore (Intro Section)

This is just replication of the first section within Chapter 6 (as FYI)

pip <- lm(arrests ~ region, data=train)
summary(pip)


Call:
lm(formula = arrests ~ region, data = train)

Residuals:
    Min      1Q  Median      3Q     Max 
-11.689  -5.557  -1.857  -1.557 308.311 

Coefficients:
                Estimate Std. Error t value     Pr(>|t|)    
(Intercept)       1.8568     0.8989   2.066      0.03905 *  
regionNortheast   9.8321     1.8323   5.366 0.0000000951 ***
regionSouth       3.6997     1.1941   3.098      0.00199 ** 
regionWest        5.1432     1.5849   3.245      0.00120 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 18.55 on 1316 degrees of freedom
Multiple R-squared:  0.02407,   Adjusted R-squared:  0.02184 
F-statistic: 10.82 on 3 and 1316 DF,  p-value: 0.0000005048

library(ggplot2)
p <- ggplot(data=train, mapping=aes(y=arrests, x=region))

p +  geom_point(alpha = 0.2) + 
       geom_smooth(method = "lm", aes(color = "OLS", fill = "OLS"))

p + geom_point(alpha=0.1) +
    geom_smooth(color = "tomato", fill="tomato", method = MASS::rlm) +
    geom_smooth(color = "steelblue", fill="steelblue", method = "lm")

p + geom_point(alpha=0.1) +
    geom_smooth(color = "tomato", method = "lm", linewidth = 1.2, 
                formula = y ~ splines::bs(x, 3), se = FALSE)

p + geom_point(alpha=0.1) +
    geom_quantile(color = "tomato", size = 1.2, method = "rqss",
                  lambda = 1, quantiles = c(0.20, 0.5, 0.85))

There is one notable outlier in the northeast.

SUMMARY ASSESSMENT AND EVALUATION OF THE MODEL

Using visualization is crucial for multiple regression models like this. Because there are so many predictor variables, it can be hard to understand the relationships between them solely through data output. Visualization helps us to identify trends, patterns, and outliers in the data set which aren’t always obvious unless we look at graphs or charts.Through this model, we can see that murder rate, murders, arrest rate, density, l population, DC, and Louisiana are highly statistically significant and can predict arrest well.