library(GGally) # Visualisasi (korelasi)

## Loading required package: ggplot2

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

library(dplyr) # Data Cleansing

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(MLmetrics) # Metrics Error

## Warning: package 'MLmetrics' was built under R version 4.2.2

## 
## Attaching package: 'MLmetrics'

## The following object is masked from 'package:base':
## 
##     Recall

library(lmtest) # Uji Asumsi

## Loading required package: zoo

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

library(car) # Uji Asumsi

## Loading required package: carData

## 
## Attaching package: 'car'

## The following object is masked from 'package:dplyr':
## 
##     recode

library(rsample)

## Warning: package 'rsample' was built under R version 4.2.2

1 Load Data

crime <- read.csv("crime.csv")
crime <- crime[,-1]
names(crime) <- c("percent_m",
                  "is_south",
                  "mean_education",
                  "police_exp60",
                  "police_exp59",
                  "labour_participation",
                  "m_per1000f",
                  "state_pop",
                  "nonwhites_per1000",
                  "unemploy_m24",
                  "unemploy_m39",
                  "gdp",
                  "inequality",
                  "prob_prison",
                  "time_prison",
                  "crime_rate")

plot(crime$state_pop, crime$inequality)

2 Data Preprocessing

Before making a model, we have to make sure that the variables’ types are already correct.

str(crime)

## 'data.frame':    47 obs. of  16 variables:
##  $ percent_m           : int  151 143 142 136 141 121 127 131 157 140 ...
##  $ is_south            : int  1 0 1 0 0 0 1 1 1 0 ...
##  $ mean_education      : int  91 113 89 121 121 110 111 109 90 118 ...
##  $ police_exp60        : int  58 103 45 149 109 118 82 115 65 71 ...
##  $ police_exp59        : int  56 95 44 141 101 115 79 109 62 68 ...
##  $ labour_participation: int  510 583 533 577 591 547 519 542 553 632 ...
##  $ m_per1000f          : int  950 1012 969 994 985 964 982 969 955 1029 ...
##  $ state_pop           : int  33 13 18 157 18 25 4 50 39 7 ...
##  $ nonwhites_per1000   : int  301 102 219 80 30 44 139 179 286 15 ...
##  $ unemploy_m24        : int  108 96 94 102 91 84 97 79 81 100 ...
##  $ unemploy_m39        : int  41 36 33 39 20 29 38 35 28 24 ...
##  $ gdp                 : int  394 557 318 673 578 689 620 472 421 526 ...
##  $ inequality          : int  261 194 250 167 174 126 168 206 239 174 ...
##  $ prob_prison         : num  0.0846 0.0296 0.0834 0.0158 0.0414 ...
##  $ time_prison         : num  26.2 25.3 24.3 29.9 21.3 ...
##  $ crime_rate          : int  791 1635 578 1969 1234 682 963 1555 856 705 ...

Data Description:

percent_m: percentage of males aged 14-24
is_south: whether it is in a Southern state. 1 for Yes, 0 for No.
mean_education: mean years of schooling
police_exp60: police expenditure in 1960
police_exp59: police expenditure in 1959
labour_participation: labour force participation rate
m_per1000f: number of males per 1000 females
state_pop: state population
nonwhites_per1000: number of non-whites resident per 1000 people
unemploy_m24: unemployment rate of urban males aged 14-24
unemploy_m39: unemployment rate of urban males aged 35-39
gdp: gross domestic product per head
inequality: income inequality
prob_prison: probability of imprisonment
time_prison: avg time served in prisons
crime_rate: crime rate in an unspecified category Produce a linear model

Then, we will check if any missing values in the dataset

colSums(is.na(crime))

##            percent_m             is_south       mean_education 
##                    0                    0                    0 
##         police_exp60         police_exp59 labour_participation 
##                    0                    0                    0 
##           m_per1000f            state_pop    nonwhites_per1000 
##                    0                    0                    0 
##         unemploy_m24         unemploy_m39                  gdp 
##                    0                    0                    0 
##           inequality          prob_prison          time_prison 
##                    0                    0                    0 
##           crime_rate 
##                    0

The result, we don’t find any missing value in the crime data set

3 Explanatory Data Analysis

Among all variables within our crime dataset, there is a crime_rate variable that describes the measure of crime rate of the United States in 1960 from different states. We want to see each variable’s correlation with crime_rate variable.

ggcorr(crime, label = T, label_size = 2.5, hjust = 1, layout.exp = 2)

On the result of the previous chunk, we can see that not all variables have correlation with crime_rate variable. The variables that have the highest correlation with crime_rate are police_exp60 and police_exp59 with 0.7 correlation.

4 Modelling

From the data exploration process and the correlation information of our dataset, we want to build a simple linear model using one of the highly correlated variables: police_exp59.

model_crime <- lm(crime_rate ~ police_exp59, crime)
summary(model_crime)

## 
## Call:
## lm(formula = crime_rate ~ police_exp59, data = crime)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -595.58 -156.76   12.29  146.74  593.74 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   165.164    130.427   1.266    0.212    
## police_exp59    9.222      1.537   6.001 3.11e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 291.4 on 45 degrees of freedom
## Multiple R-squared:  0.4445, Adjusted R-squared:  0.4322 
## F-statistic: 36.01 on 1 and 45 DF,  p-value: 3.114e-07

The R-squared of model_crime approximates 0.4445, which ideally needs to be improved upon by, example, add more predictor variables. We are going to use one of the techniques for selecting predictor variables which is using stepwise regression algorithm.

model_full <- lm(crime_rate ~ ., crime)
model_step <- step(model_full, direction = "backward", trace = 0) 
summary(model_step)

## 
## Call:
## lm(formula = crime_rate ~ percent_m + mean_education + police_exp60 + 
##     m_per1000f + unemploy_m24 + unemploy_m39 + inequality + prob_prison, 
##     data = crime)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -444.70 -111.07    3.03  122.15  483.30 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -6426.101   1194.611  -5.379 4.04e-06 ***
## percent_m          9.332      3.350   2.786  0.00828 ** 
## mean_education    18.012      5.275   3.414  0.00153 ** 
## police_exp60      10.265      1.552   6.613 8.26e-08 ***
## m_per1000f         2.234      1.360   1.642  0.10874    
## unemploy_m24      -6.087      3.339  -1.823  0.07622 .  
## unemploy_m39      18.735      7.248   2.585  0.01371 *  
## inequality         6.133      1.396   4.394 8.63e-05 ***
## prob_prison    -3796.032   1490.646  -2.547  0.01505 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 195.5 on 38 degrees of freedom
## Multiple R-squared:  0.7888, Adjusted R-squared:  0.7444 
## F-statistic: 17.74 on 8 and 38 DF,  p-value: 1.159e-10

Step-wise regression method will produce an optimum formula based on the lowest value of AIC. After doing the step-wise regression method, compared to the first model that only use the police_exp59 variable, the regression model with percent_m, mean_education, police_exp60, m_per1000f, unemploy_m24, unemploy_m39, inequality, and prob_prison adjusted R-squared is 0.7444, higher than the first model which is 0.4322.

5 Prediction

crime_test <- read.csv("crime_test.csv")
pred_crime1 <- predict(model_crime, crime_test)
pred_crime2 <- predict(model_step, crime_test)

Mean Absolute Percentage Error (MAPE) shows the difference between the prediction and actual number in percentage. We are going to compare the mean absolute percentage error between the first model and the second model.

MAPE(pred_crime1, crime_test$crime_rate)

## [1] 0.3046554

MAPE(pred_crime2, crime_test$crime_rate)

## [1] 0.2339692

First model MAPE 0.3046554 means that by average the difference with the actual number is 30% and for the second model the difference with the actual number is 23%. We can conclude that the second model is better because it is lower than the first one.

6 Assumption

6.1 Shapiro Test for Normality test

First Model

shapiro.test(model_crime$residuals)

## 
##  Shapiro-Wilk normality test
## 
## data:  model_crime$residuals
## W = 0.977, p-value = 0.4748

Second Model

shapiro.test(model_step$residuals)

## 
##  Shapiro-Wilk normality test
## 
## data:  model_step$residuals
## W = 0.98511, p-value = 0.8051

The Shapiro Test above, we can conclude that the error is distributed normally for the first and second model because the P-value is higher than 0.05 which are 0.4748 and 0.8051

6.2 Breusch-Pagan Test

bptest(model_crime)

## 
##  studentized Breusch-Pagan test
## 
## data:  model_crime
## BP = 22.918, df = 1, p-value = 1.69e-06

The Breusch-Pagan Test above, we can conclude that there is heteroscesdasticity on the first model because the P-value is lower than 0.05 which is 0.00000169. For the second model we conclude that there is homocesdasticity because the P-value is above 0.05, it means the error is constant.

6.3 Variance Inflation Factor (Multicollinearity)

For the first model, we cannot use the multicollienarity test because it only has one variable. Therefore we only present the multicollinenarity test for the second model.

vif(model_step)

##      percent_m mean_education   police_exp60     m_per1000f   unemploy_m24 
##       2.131963       4.189684       2.560496       1.932367       4.360038 
##   unemploy_m39     inequality    prob_prison 
##       4.508106       3.731074       1.381879

Based on the Variance Inflation Factor Test above, we conclude that multicollinearity is not present in our model because the VIF values for all variables are below 10.

7 Conclusion and Recommendation

We can conclude that the second model or model_step which use the stepwise-regression method is better than the first model with only one variable that has the highest positive correlation with crime_rate variable. The reasons are:

The Mean Absolute Percentage Error (MAPE) for model_step is higher than model_crime by 7% (23% and 30%)
Based on the Shapiro Test, the error is distributed normally for both model but the P-value for model_step is higher. (0.8051 > 0.4748)
Based on the Breusch-Pagan Test, there is heteroscesdasticity on the first model so the error is not constant and for the second model there is homocesdasticity because the P-value is above 0.05, it means the error is constant.
Based on the Variance Inflation Factor Test above, multicollinearity is not present in the model_step model because the VIF values for all variables are below 10.

But, we also find the better point of the first model model_crime which is it has a lower number of adjusted R square than the second one (0.4322 < 0.7444)

LBB REGRESSION Furqan

Furqan Akbar Nizar

26 November, 2022