options(scipen=999)
library(socviz) # use for the round_df() function
load("Murders96_viz.RData")Lab 6
This chunk is to open the data and also deals with the scientific notations that we could be working with under.
library(caret)
set.seed(1)
index <- createDataPartition(Murders96_viz$arrests, p=0.6, list=FALSE)
train <- Murders96_viz[index,]
test <- Murders96_viz[-index,]Here we do the data partitioning for our dataset, creating the train and test dataset.
library(psych)
library(tidyverse)
library(flextable)
psych::describe(train, fast=TRUE) # (class note --> remove flextable) vars n mean sd min max range se
arrests 1 1320 5.21 18.76 0.00 320.00 320.00 0.52
countyid 2 1320 32550.57 15488.80 1001.00 56043.00 55042.00 426.32
density 3 1320 265.34 1817.83 0.08 54058.77 54058.69 50.03
popul 4 1320 94031.70 223537.24 141.00 3126966.00 3126825.00 6152.66
perc1019 5 1320 15.11 1.88 7.60 24.54 16.94 0.05
perc2029 6 1320 12.05 3.19 5.62 34.75 29.14 0.09
percblack 7 1320 8.34 13.93 0.00 86.28 86.28 0.38
percmale 8 1320 49.33 1.63 44.83 61.64 16.82 0.04
rpcincmaint 9 1320 207.78 116.83 19.14 1281.10 1261.96 3.22
rpcpersinc 10 1320 12569.38 3031.90 5322.20 41094.22 35772.02 83.45
rpcunemins 11 1320 52.90 34.85 0.00 287.10 287.10 0.96
year 12 1320 1996.00 0.00 1996.00 1996.00 0.00 0.00
murders 13 1320 5.88 26.27 0.00 420.00 420.00 0.72
murdrate 14 1320 0.42 0.77 0.00 11.01 11.01 0.02
arrestrate 15 1320 0.48 0.94 0.00 13.99 13.99 0.03
statefips 16 1320 32.45 15.46 1.00 56.00 55.00 0.43
countyfips 17 1320 99.81 109.51 1.00 840.00 839.00 3.01
execs 18 1320 0.01 0.14 0.00 3.00 3.00 0.00
lpopul 19 1320 10.40 1.35 4.95 14.96 10.01 0.04
execrate 20 1320 0.00 0.04 0.00 1.18 1.18 0.00
statename 21 1320 NaN NA Inf -Inf -Inf NA
statecode 22 1320 NaN NA Inf -Inf -Inf NA
region 23 1320 NaN NA Inf -Inf -Inf NA
In this chick w ### Boxplots – All Numeric
data_long <- train %>%
select(where(is.numeric)) %>%
pivot_longer(cols = everything(), names_to = "Variable", values_to = "Value")
ggplot(data_long, aes(x = Variable, y = Value)) +
geom_boxplot() +
coord_flip() +
facet_wrap(~ Variable, scales = "free", ncol=3) +
theme_minimal() +
theme(axis.text.x = element_blank()) +
labs(title = "Horizontal Boxplot for Each Numeric Variable")This horizontal boxplots show how numbers are spread out. These boxes give a quick look at the middle, edges, and any strange numbers in the bunch. Each box is labeled with a different number, like ‘arrestrate,’ ‘arrests,’ or ‘countyfips.’ The horizontal line inside each box shows where most of the numbers are, while the ends of the box mark the highest and lowest. If there are any weird numbers far from the box, they’re shown as small dots.
library(DataExplorer)
plot_boxplot(train, by="murders") Histograms
plot_histogram(train)Scatterplots (depvar ~ all x)
Look at the shape of the relationship between the dependent variable and all of the continuous potential independent variables.
plot_scatterplot(test, by="arrests")Correlation Matrix
plot_correlation(train, type="continuous")We create a correlation matrix in order to pick the best independent variables to build our Linear Regression model. Taking into consideration we are using arrests as independent varaible we are picking density, popul, murders, lpopul to build our model.
MODEL
Linear Regression Model
Estimate Coefficients and show coefficients table
model <- lm(sqrt(arrests) ~ density + I(density^2) + popul + murders + lpopul, data=train)
model$coefficients (Intercept) density I(density^2) popul
-4.3098484806007971 0.0000767535012589 -0.0000000001791262 0.0000018954562331
murders lpopul
0.0261811777683826 0.5005752188627091
summary(model)
Call:
lm(formula = sqrt(arrests) ~ density + I(density^2) + popul +
murders + lpopul, data = train)
Residuals:
Min 1Q Median 3Q Max
-14.3888 -0.6711 -0.1178 0.5220 8.7536
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -4.3098484806008 0.3050870230274 -14.127 <0.0000000000000002 ***
density 0.0000767535013 0.0000535057856 1.434 0.152
I(density^2) -0.0000000001791 0.0000000010649 -0.168 0.866
popul 0.0000018954562 0.0000002250240 8.423 <0.0000000000000002 ***
murders 0.0261811777684 0.0017108349363 15.303 <0.0000000000000002 ***
lpopul 0.5005752188627 0.0303350311774 16.502 <0.0000000000000002 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.088 on 1314 degrees of freedom
Multiple R-squared: 0.6768, Adjusted R-squared: 0.6756
F-statistic: 550.3 on 5 and 1314 DF, p-value: < 0.00000000000000022
model %>% as_flextableEstimate | Standard Error | t value | Pr(>|t|) | ||
|---|---|---|---|---|---|
(Intercept) | -4.310 | 0.305 | -14.127 | 0.0000 | *** |
density | 0.000 | 0.000 | 1.434 | 0.1517 |
|
I(density^2) | -0.000 | 0.000 | -0.168 | 0.8664 |
|
popul | 0.000 | 0.000 | 8.423 | 0.0000 | *** |
murders | 0.026 | 0.002 | 15.303 | 0.0000 | *** |
lpopul | 0.501 | 0.030 | 16.502 | 0.0000 | *** |
Signif. codes: 0 <= '***' < 0.001 < '**' < 0.01 < '*' < 0.05 | |||||
Residual standard error: 1.088 on 1314 degrees of freedom | |||||
Multiple R-squared: 0.6768, Adjusted R-squared: 0.6756 | |||||
F-statistic: 550.3 on 1314 and 5 DF, p-value: 0.0000 | |||||
The result of a linear regression model that forecasts arrests’ square root is the output. There is a -4.31 intercept. The population , the natural log of the population, and murders have the largest impacts. Arrests are not considerably impacted by density or its square. The model is statistically significant and accounts for 67.7% of the variance.
Check for predictor independence
Using Variance Inflation Factors (VIF)
library(car)
vif(model) density I(density^2) popul murders lpopul
10.537501 8.833180 2.818287 2.249513 1.869026
The expected square root of arrests is positively impacted by higher density. The expected square root of arrests is positively impacted by population, albeit only slightly. Higher expected square root of arrests is correlated with more homicides in a given area. Larger populations tend to have greater anticipated square roots of arrests.
Residual Analysis
Residual Range
quantile(round(residuals(model, type = "deviance"),2)) 0% 25% 50% 75% 100%
-14.39 -0.67 -0.12 0.52 8.75
The given quantiles shed light on the residuals’ distribution and show how the real square root of arrests differs from the values the model predicted.
Residual Plots
We are looking for: - Random distribution of residuals vs fitted values - Normally distributed residuals : Normal Q-Q plot with values along line - Homoskedasticity with a Scale-Location line that is horizontal and no residual pattern - Minimal influential obs - that is, those outside the borders of Cook’s distance
plot(model)Plot Fitted Value by Actual Value
library(broom)
train2 <- augment(model, data=train) # this appends predicted to original dataset to build plots on your own
p <- ggplot(train2, mapping = aes(y=.fitted, x=arrests))
p + geom_point()Plot Residuals by Fitted Values
p <- ggplot(train2, mapping = aes(y=.resid, x=.fitted))
p + geom_point()Performance Evaluation
Use Model to Score test dataset (Display First 10 values - depvar and fitted values only)
pred <- predict(model, newdata = test, interval="predict") #score test dataset wtih model
test_w_predict <- cbind(test, pred) # append score to original test dataset
test_w_predict %>%
head(10) %>%
select(countyid, arrests, density, popul,murders ) %>%
as_flextable(show_coltype = FALSE) countyid | arrests | density | popul | murders |
|---|---|---|---|---|
1,003 | 6 | 77.1 | 123,023 | 6 |
1,005 | 1 | 29.9 | 26,475 | 1 |
1,009 | 0 | 67.2 | 43,392 | 2 |
1,015 | 20 | 186.5 | 113,511 | 14 |
1,019 | 2 | 38.3 | 21,170 | 0 |
1,021 | 0 | 50.9 | 35,323 | 0 |
1,029 | 2 | 24.0 | 13,445 | 0 |
1,031 | 8 | 61.7 | 41,910 | 6 |
1,041 | 1 | 22.2 | 13,514 | 0 |
1,043 | 5 | 99.2 | 73,274 | 0 |
n: 10 | ||||
Plot Actual vs Fitted (test)
plot(test_w_predict$arrests, test_w_predict$fit, pch=16, cex=1)Performance Metrics
library(Metrics)
metric_label <- c("MAE","RMSE", "MAPE")
metrics <- c(round(mae(test_w_predict$arrests, test_w_predict$fit),4),
round(rmse(test_w_predict$arrests, test_w_predict$fit),4),
round(mape(test_w_predict$arrests, test_w_predict$fit),4))
pmtable <- data.frame(Metric=metric_label, Value = metrics)
flextable(pmtable)Metric | Value |
|---|---|
MAE | 6.7448 |
RMSE | 53.1457 |
MAPE | Inf |
MAE shows an average deviation of about 6.74 units between predictions and actual values. RMSE indicates a typical deviation of approximately 53.15 units. However, the MAPE is reported as Inf, suggesting perfect predictions for some data points. This warrants further scrutiny, especially for instances with very low actual values.
ggplot2 explore (Intro Section)
This is just replication of the first section within Chapter 6 (as FYI)
pip <- lm(sqrt(arrests) ~ murders, data=train)
summary(pip)
Call:
lm(formula = sqrt(arrests) ~ murders, data = train)
Residuals:
Min 1Q Median 3Q Max
-20.9192 -0.9549 -0.0555 0.6263 8.6846
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.954919 0.038944 24.52 <0.0000000000000002 ***
murders 0.050288 0.001447 34.74 <0.0000000000000002 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.381 on 1318 degrees of freedom
Multiple R-squared: 0.4781, Adjusted R-squared: 0.4777
F-statistic: 1207 on 1 and 1318 DF, p-value: < 0.00000000000000022
# knitr does not like autoplot
#autoplot(lm(log(Wage.n) ~ Potential, data=train), data=train, colour = 'Nationality_Continent')
library(ggplot2)
p <- ggplot(data=train, mapping=aes(y=log(arrests), x=murders))
p + geom_point(alpha = 0.2) +
geom_smooth(method = "lm", aes(color = "OLS", fill = "OLS"))p + geom_point(alpha=0.1) +
geom_smooth(color = "tomato", fill="tomato", method = MASS::rlm) +
geom_smooth(color = "steelblue", fill="steelblue", method = "lm")p + geom_point(alpha=0.1) +
geom_smooth(color = "tomato", method = "lm", linewidth = 1.2,
formula = y ~ splines::bs(x, 3), se = FALSE)p + geom_point(alpha=0.1) +
geom_quantile(color = "tomato", size = 1.2, method = "rqss",
lambda = 1, quantiles = c(0.20, 0.5, 0.85))When analyzing and evaluating models, data visualization is essential because it provides insights that go beyond simple numerical summaries. A thorough knowledge of variables, their distributions, and possible linkages is made possible by the visualization of data. Plotting histograms, scatter plots, or box plots allows analysts to see how data is distributed and shaped, allowing them to spot outliers and see underlying trends. Additionally, by displaying the direction and intensity of correlations between variables, visualizing matrices such as covariance or correlation matrices helps with feature selection and multicollinearity identification throughout the model-building process. A good image of a model’s predictive power can also be obtained by displaying its performance using graphs, such as ROC curves, precision-recall curves, or learning curves. This enables comparison with other models or benchmark performances.