Lab 6

Author

Jose Almanza

options(scipen=999)
library(socviz) # use for the round_df() function
load("Murders96_viz.RData")

This chunk is to open the data and also deals with the scientific notations that we could be working with under.

library(caret)
set.seed(1)
index <- createDataPartition(Murders96_viz$arrests, p=0.6, list=FALSE)
train <- Murders96_viz[index,]
test  <- Murders96_viz[-index,]

Here we do the data partitioning for our dataset, creating the train and test dataset.

library(psych)
library(tidyverse)
library(flextable)
psych::describe(train, fast=TRUE)  # (class note --> remove flextable)
            vars    n     mean        sd     min        max      range      se
arrests        1 1320     5.21     18.76    0.00     320.00     320.00    0.52
countyid       2 1320 32550.57  15488.80 1001.00   56043.00   55042.00  426.32
density        3 1320   265.34   1817.83    0.08   54058.77   54058.69   50.03
popul          4 1320 94031.70 223537.24  141.00 3126966.00 3126825.00 6152.66
perc1019       5 1320    15.11      1.88    7.60      24.54      16.94    0.05
perc2029       6 1320    12.05      3.19    5.62      34.75      29.14    0.09
percblack      7 1320     8.34     13.93    0.00      86.28      86.28    0.38
percmale       8 1320    49.33      1.63   44.83      61.64      16.82    0.04
rpcincmaint    9 1320   207.78    116.83   19.14    1281.10    1261.96    3.22
rpcpersinc    10 1320 12569.38   3031.90 5322.20   41094.22   35772.02   83.45
rpcunemins    11 1320    52.90     34.85    0.00     287.10     287.10    0.96
year          12 1320  1996.00      0.00 1996.00    1996.00       0.00    0.00
murders       13 1320     5.88     26.27    0.00     420.00     420.00    0.72
murdrate      14 1320     0.42      0.77    0.00      11.01      11.01    0.02
arrestrate    15 1320     0.48      0.94    0.00      13.99      13.99    0.03
statefips     16 1320    32.45     15.46    1.00      56.00      55.00    0.43
countyfips    17 1320    99.81    109.51    1.00     840.00     839.00    3.01
execs         18 1320     0.01      0.14    0.00       3.00       3.00    0.00
lpopul        19 1320    10.40      1.35    4.95      14.96      10.01    0.04
execrate      20 1320     0.00      0.04    0.00       1.18       1.18    0.00
statename     21 1320      NaN        NA     Inf       -Inf       -Inf      NA
statecode     22 1320      NaN        NA     Inf       -Inf       -Inf      NA
region        23 1320      NaN        NA     Inf       -Inf       -Inf      NA

In this chick w ### Boxplots – All Numeric

data_long <- train %>%
  select(where(is.numeric)) %>%
  pivot_longer(cols = everything(), names_to = "Variable", values_to = "Value")

ggplot(data_long, aes(x = Variable, y = Value)) +
  geom_boxplot() +
  coord_flip() + 
  facet_wrap(~ Variable, scales = "free", ncol=3) +
  theme_minimal() +
  theme(axis.text.x = element_blank()) + 
  labs(title = "Horizontal Boxplot for Each Numeric Variable")

This horizontal boxplots show how numbers are spread out. These boxes give a quick look at the middle, edges, and any strange numbers in the bunch. Each box is labeled with a different number, like ‘arrestrate,’ ‘arrests,’ or ‘countyfips.’ The horizontal line inside each box shows where most of the numbers are, while the ends of the box mark the highest and lowest. If there are any weird numbers far from the box, they’re shown as small dots.

library(DataExplorer)
plot_boxplot(train, by="murders") 

Histograms

plot_histogram(train)

Scatterplots (depvar ~ all x)

Look at the shape of the relationship between the dependent variable and all of the continuous potential independent variables.

plot_scatterplot(test, by="arrests")

Correlation Matrix

plot_correlation(train, type="continuous")

We create a correlation matrix in order to pick the best independent variables to build our Linear Regression model. Taking into consideration we are using arrests as independent varaible we are picking density, popul, murders, lpopul to build our model.

MODEL

Linear Regression Model

Estimate Coefficients and show coefficients table

model <- lm(sqrt(arrests) ~ density + I(density^2) + popul + murders +  lpopul, data=train)


model$coefficients
        (Intercept)             density        I(density^2)               popul 
-4.3098484806007971  0.0000767535012589 -0.0000000001791262  0.0000018954562331 
            murders              lpopul 
 0.0261811777683826  0.5005752188627091 
summary(model)

Call:
lm(formula = sqrt(arrests) ~ density + I(density^2) + popul + 
    murders + lpopul, data = train)

Residuals:
     Min       1Q   Median       3Q      Max 
-14.3888  -0.6711  -0.1178   0.5220   8.7536 

Coefficients:
                     Estimate       Std. Error t value            Pr(>|t|)    
(Intercept)  -4.3098484806008  0.3050870230274 -14.127 <0.0000000000000002 ***
density       0.0000767535013  0.0000535057856   1.434               0.152    
I(density^2) -0.0000000001791  0.0000000010649  -0.168               0.866    
popul         0.0000018954562  0.0000002250240   8.423 <0.0000000000000002 ***
murders       0.0261811777684  0.0017108349363  15.303 <0.0000000000000002 ***
lpopul        0.5005752188627  0.0303350311774  16.502 <0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.088 on 1314 degrees of freedom
Multiple R-squared:  0.6768,    Adjusted R-squared:  0.6756 
F-statistic: 550.3 on 5 and 1314 DF,  p-value: < 0.00000000000000022
model %>% as_flextable

Estimate

Standard Error

t value

Pr(>|t|)

(Intercept)

-4.310

0.305

-14.127

0.0000

***

density

0.000

0.000

1.434

0.1517

I(density^2)

-0.000

0.000

-0.168

0.8664

popul

0.000

0.000

8.423

0.0000

***

murders

0.026

0.002

15.303

0.0000

***

lpopul

0.501

0.030

16.502

0.0000

***

Signif. codes: 0 <= '***' < 0.001 < '**' < 0.01 < '*' < 0.05

Residual standard error: 1.088 on 1314 degrees of freedom

Multiple R-squared: 0.6768, Adjusted R-squared: 0.6756

F-statistic: 550.3 on 1314 and 5 DF, p-value: 0.0000

The result of a linear regression model that forecasts arrests’ square root is the output. There is a -4.31 intercept. The population , the natural log of the population, and murders have the largest impacts. Arrests are not considerably impacted by density or its square. The model is statistically significant and accounts for 67.7% of the variance.

Check for predictor independence

Using Variance Inflation Factors (VIF)

library(car)
vif(model)
     density I(density^2)        popul      murders       lpopul 
   10.537501     8.833180     2.818287     2.249513     1.869026 

The expected square root of arrests is positively impacted by higher density. The expected square root of arrests is positively impacted by population, albeit only slightly. Higher expected square root of arrests is correlated with more homicides in a given area. Larger populations tend to have greater anticipated square roots of arrests.

Residual Analysis

Residual Range

quantile(round(residuals(model, type = "deviance"),2))
    0%    25%    50%    75%   100% 
-14.39  -0.67  -0.12   0.52   8.75 

The given quantiles shed light on the residuals’ distribution and show how the real square root of arrests differs from the values the model predicted.

Residual Plots

We are looking for: - Random distribution of residuals vs fitted values - Normally distributed residuals : Normal Q-Q plot with values along line - Homoskedasticity with a Scale-Location line that is horizontal and no residual pattern - Minimal influential obs - that is, those outside the borders of Cook’s distance

plot(model)

Plot Fitted Value by Actual Value

library(broom)
train2 <- augment(model, data=train)  # this appends predicted to original dataset to build plots on your own

p <- ggplot(train2, mapping = aes(y=.fitted, x=arrests)) 
p  + geom_point()

Plot Residuals by Fitted Values

p <- ggplot(train2, mapping = aes(y=.resid, x=.fitted))
p  + geom_point()

Performance Evaluation

Use Model to Score test dataset (Display First 10 values - depvar and fitted values only)

pred <- predict(model, newdata = test, interval="predict")  #score test dataset wtih model
test_w_predict <- cbind(test, pred)  # append score to original test dataset

test_w_predict %>%
  head(10) %>%
    select(countyid, arrests, density, popul,murders ) %>% 
  as_flextable(show_coltype = FALSE) 

countyid

arrests

density

popul

murders

1,003

6

77.1

123,023

6

1,005

1

29.9

26,475

1

1,009

0

67.2

43,392

2

1,015

20

186.5

113,511

14

1,019

2

38.3

21,170

0

1,021

0

50.9

35,323

0

1,029

2

24.0

13,445

0

1,031

8

61.7

41,910

6

1,041

1

22.2

13,514

0

1,043

5

99.2

73,274

0

n: 10

Plot Actual vs Fitted (test)

plot(test_w_predict$arrests, test_w_predict$fit, pch=16, cex=1)

Performance Metrics

library(Metrics)
metric_label <- c("MAE","RMSE", "MAPE")
metrics <- c(round(mae(test_w_predict$arrests, test_w_predict$fit),4),
                         round(rmse(test_w_predict$arrests, test_w_predict$fit),4),
                         round(mape(test_w_predict$arrests, test_w_predict$fit),4))
pmtable <- data.frame(Metric=metric_label, Value = metrics)
flextable(pmtable)

Metric

Value

MAE

6.7448

RMSE

53.1457

MAPE

Inf

MAE shows an average deviation of about 6.74 units between predictions and actual values. RMSE indicates a typical deviation of approximately 53.15 units. However, the MAPE is reported as Inf, suggesting perfect predictions for some data points. This warrants further scrutiny, especially for instances with very low actual values.


ggplot2 explore (Intro Section)

This is just replication of the first section within Chapter 6 (as FYI)

pip <- lm(sqrt(arrests) ~ murders, data=train)
summary(pip)

Call:
lm(formula = sqrt(arrests) ~ murders, data = train)

Residuals:
     Min       1Q   Median       3Q      Max 
-20.9192  -0.9549  -0.0555   0.6263   8.6846 

Coefficients:
            Estimate Std. Error t value            Pr(>|t|)    
(Intercept) 0.954919   0.038944   24.52 <0.0000000000000002 ***
murders     0.050288   0.001447   34.74 <0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.381 on 1318 degrees of freedom
Multiple R-squared:  0.4781,    Adjusted R-squared:  0.4777 
F-statistic:  1207 on 1 and 1318 DF,  p-value: < 0.00000000000000022
# knitr does not like autoplot
#autoplot(lm(log(Wage.n) ~ Potential, data=train), data=train, colour = 'Nationality_Continent')


library(ggplot2)
p <- ggplot(data=train, mapping=aes(y=log(arrests), x=murders))

p +  geom_point(alpha = 0.2) + 
       geom_smooth(method = "lm", aes(color = "OLS", fill = "OLS"))

p + geom_point(alpha=0.1) +
    geom_smooth(color = "tomato", fill="tomato", method = MASS::rlm) +
    geom_smooth(color = "steelblue", fill="steelblue", method = "lm")

p + geom_point(alpha=0.1) +
    geom_smooth(color = "tomato", method = "lm", linewidth = 1.2, 
                formula = y ~ splines::bs(x, 3), se = FALSE)

p + geom_point(alpha=0.1) +
    geom_quantile(color = "tomato", size = 1.2, method = "rqss",
                  lambda = 1, quantiles = c(0.20, 0.5, 0.85))

When analyzing and evaluating models, data visualization is essential because it provides insights that go beyond simple numerical summaries. A thorough knowledge of variables, their distributions, and possible linkages is made possible by the visualization of data. Plotting histograms, scatter plots, or box plots allows analysts to see how data is distributed and shaped, allowing them to spot outliers and see underlying trends. Additionally, by displaying the direction and intensity of correlations between variables, visualizing matrices such as covariance or correlation matrices helps with feature selection and multicollinearity identification throughout the model-building process. A good image of a model’s predictive power can also be obtained by displaying its performance using graphs, such as ROC curves, precision-recall curves, or learning curves. This enables comparison with other models or benchmark performances. 

END