1 Introduction

In the following LBB, i would like to see linearity model for crime rate from crime dataset that collected in US on 1960 (source: Algoritma modul).

2 Data Preparing and Processing

2.1 Libraries and Data Input

Loading Library and Data Input

library(dplyr)
crime <- read.csv("data_input/crime.csv") %>% 
dplyr::select(-X)
names(crime) <- c("percent_m", "is_south", "mean_education", "police_exp60", "police_exp59", "labour_participation", "m_per1000f", "state_pop", "nonwhites_per1000", "unemploy_m24", "unemploy_m39", "gdp", "inequality", "prob_prison", "time_prison", "crime_rate")

crime

Remarks of the dataset contents/variables:
- percent_m: percentage of males aged 14-24 - is_south: whether it is in a Southern state. 1 for Yes, 0 for No.
- mean_education: mean years of schooling
- police_exp60: police expenditure in 1960
- police_exp59: police expenditure in 1959 - labour_participation: labour force participation rate
- m_per1000f: number of males per 1000 females
- state_pop: state population
- nonwhites_per1000: number of non-whites resident per 1000 people
- unemploy_m24: unemployment rate of urban males aged 14-24
- unemploy_m39: unemployment rate of urban males aged 35-39
- gdp: gross domestic product per head
- inequality: income inequality
- prob_prison: probability of imprisonment
- time_prison: avg time served in prisons
- crime_rate: crime rate in an unspecified category

2.2 Initial Model

Create initial model

For the first step, i create linear regression model with crime_rate as the target, and the remaining variables in the dataset as predictors.

crime_rate <- lm(crime_rate~., data=crime)

2.3 Variabel Selection

In this step, i use stepwise regression with backward method to the initial model as automate variable selection in order to make the new best fit model.

step(crime_rate, direction = 'backward')

## Start:  AIC=514.65
## crime_rate ~ percent_m + is_south + mean_education + police_exp60 + 
##     police_exp59 + labour_participation + m_per1000f + state_pop + 
##     nonwhites_per1000 + unemploy_m24 + unemploy_m39 + gdp + inequality + 
##     prob_prison + time_prison
## 
##                        Df Sum of Sq     RSS    AIC
## - is_south              1        29 1354974 512.65
## - labour_participation  1      8917 1363862 512.96
## - time_prison           1     10304 1365250 513.00
## - state_pop             1     14122 1369068 513.14
## - nonwhites_per1000     1     18395 1373341 513.28
## - m_per1000f            1     31967 1386913 513.74
## - gdp                   1     37613 1392558 513.94
## - police_exp59          1     37919 1392865 513.95
## <none>                              1354946 514.65
## - unemploy_m24          1     83722 1438668 515.47
## - police_exp60          1    144306 1499252 517.41
## - unemploy_m39          1    181536 1536482 518.56
## - percent_m             1    193770 1548716 518.93
## - prob_prison           1    199538 1554484 519.11
## - mean_education        1    402117 1757063 524.86
## - inequality            1    423031 1777977 525.42
## 
## Step:  AIC=512.65
## crime_rate ~ percent_m + mean_education + police_exp60 + police_exp59 + 
##     labour_participation + m_per1000f + state_pop + nonwhites_per1000 + 
##     unemploy_m24 + unemploy_m39 + gdp + inequality + prob_prison + 
##     time_prison
## 
##                        Df Sum of Sq     RSS    AIC
## - time_prison           1     10341 1365315 511.01
## - labour_participation  1     10878 1365852 511.03
## - state_pop             1     14127 1369101 511.14
## - nonwhites_per1000     1     21626 1376600 511.39
## - m_per1000f            1     32449 1387423 511.76
## - police_exp59          1     37954 1392929 511.95
## - gdp                   1     39223 1394197 511.99
## <none>                              1354974 512.65
## - unemploy_m24          1     96420 1451395 513.88
## - police_exp60          1    144302 1499277 515.41
## - unemploy_m39          1    189859 1544834 516.81
## - percent_m             1    195084 1550059 516.97
## - prob_prison           1    204463 1559437 517.26
## - mean_education        1    403140 1758114 522.89
## - inequality            1    488834 1843808 525.13
## 
## Step:  AIC=511.01
## crime_rate ~ percent_m + mean_education + police_exp60 + police_exp59 + 
##     labour_participation + m_per1000f + state_pop + nonwhites_per1000 + 
##     unemploy_m24 + unemploy_m39 + gdp + inequality + prob_prison
## 
##                        Df Sum of Sq     RSS    AIC
## - labour_participation  1     10533 1375848 509.37
## - nonwhites_per1000     1     15482 1380797 509.54
## - state_pop             1     21846 1387161 509.75
## - police_exp59          1     28932 1394247 509.99
## - gdp                   1     36070 1401385 510.23
## - m_per1000f            1     41784 1407099 510.42
## <none>                              1365315 511.01
## - unemploy_m24          1     91420 1456735 512.05
## - police_exp60          1    134137 1499452 513.41
## - unemploy_m39          1    184143 1549458 514.95
## - percent_m             1    186110 1551425 515.01
## - prob_prison           1    237493 1602808 516.54
## - mean_education        1    409448 1774763 521.33
## - inequality            1    502909 1868224 523.75
## 
## Step:  AIC=509.37
## crime_rate ~ percent_m + mean_education + police_exp60 + police_exp59 + 
##     m_per1000f + state_pop + nonwhites_per1000 + unemploy_m24 + 
##     unemploy_m39 + gdp + inequality + prob_prison
## 
##                     Df Sum of Sq     RSS    AIC
## - nonwhites_per1000  1     11675 1387523 507.77
## - police_exp59       1     21418 1397266 508.09
## - state_pop          1     27803 1403651 508.31
## - m_per1000f         1     31252 1407100 508.42
## - gdp                1     35035 1410883 508.55
## <none>                           1375848 509.37
## - unemploy_m24       1     80954 1456802 510.06
## - police_exp60       1    123896 1499744 511.42
## - unemploy_m39       1    190746 1566594 513.47
## - percent_m          1    217716 1593564 514.27
## - prob_prison        1    226971 1602819 514.54
## - mean_education     1    413254 1789103 519.71
## - inequality         1    500944 1876792 521.96
## 
## Step:  AIC=507.77
## crime_rate ~ percent_m + mean_education + police_exp60 + police_exp59 + 
##     m_per1000f + state_pop + unemploy_m24 + unemploy_m39 + gdp + 
##     inequality + prob_prison
## 
##                  Df Sum of Sq     RSS    AIC
## - police_exp59    1     16706 1404229 506.33
## - state_pop       1     25793 1413315 506.63
## - m_per1000f      1     26785 1414308 506.66
## - gdp             1     31551 1419073 506.82
## <none>                        1387523 507.77
## - unemploy_m24    1     83881 1471404 508.52
## - police_exp60    1    118348 1505871 509.61
## - unemploy_m39    1    201453 1588976 512.14
## - prob_prison     1    216760 1604282 512.59
## - percent_m       1    309214 1696737 515.22
## - mean_education  1    402754 1790276 517.74
## - inequality      1    589736 1977259 522.41
## 
## Step:  AIC=506.33
## crime_rate ~ percent_m + mean_education + police_exp60 + m_per1000f + 
##     state_pop + unemploy_m24 + unemploy_m39 + gdp + inequality + 
##     prob_prison
## 
##                  Df Sum of Sq     RSS    AIC
## - state_pop       1     22345 1426575 505.07
## - gdp             1     32142 1436371 505.39
## - m_per1000f      1     36808 1441037 505.54
## <none>                        1404229 506.33
## - unemploy_m24    1     86373 1490602 507.13
## - unemploy_m39    1    205814 1610043 510.76
## - prob_prison     1    218607 1622836 511.13
## - percent_m       1    307001 1711230 513.62
## - mean_education  1    389502 1793731 515.83
## - inequality      1    608627 2012856 521.25
## - police_exp60    1   1050202 2454432 530.57
## 
## Step:  AIC=505.07
## crime_rate ~ percent_m + mean_education + police_exp60 + m_per1000f + 
##     unemploy_m24 + unemploy_m39 + gdp + inequality + prob_prison
## 
##                  Df Sum of Sq     RSS    AIC
## - gdp             1     26493 1453068 503.93
## <none>                        1426575 505.07
## - m_per1000f      1     84491 1511065 505.77
## - unemploy_m24    1     99463 1526037 506.24
## - prob_prison     1    198571 1625145 509.20
## - unemploy_m39    1    208880 1635455 509.49
## - percent_m       1    320926 1747501 512.61
## - mean_education  1    386773 1813348 514.35
## - inequality      1    594779 2021354 519.45
## - police_exp60    1   1127277 2553852 530.44
## 
## Step:  AIC=503.93
## crime_rate ~ percent_m + mean_education + police_exp60 + m_per1000f + 
##     unemploy_m24 + unemploy_m39 + inequality + prob_prison
## 
##                  Df Sum of Sq     RSS    AIC
## <none>                        1453068 503.93
## - m_per1000f      1    103159 1556227 505.16
## - unemploy_m24    1    127044 1580112 505.87
## - prob_prison     1    247978 1701046 509.34
## - unemploy_m39    1    255443 1708511 509.55
## - percent_m       1    296790 1749858 510.67
## - mean_education  1    445788 1898855 514.51
## - inequality      1    738244 2191312 521.24
## - police_exp60    1   1672038 3125105 537.93

## 
## Call:
## lm(formula = crime_rate ~ percent_m + mean_education + police_exp60 + 
##     m_per1000f + unemploy_m24 + unemploy_m39 + inequality + prob_prison, 
##     data = crime)
## 
## Coefficients:
##    (Intercept)       percent_m  mean_education    police_exp60  
##      -6426.101           9.332          18.012          10.265  
##     m_per1000f    unemploy_m24    unemploy_m39      inequality  
##          2.234          -6.087          18.735           6.133  
##    prob_prison  
##      -3796.032

Insight: From the stepwise backward method, i get 8 variables as predictors (percent_m, mean_education, police_exp60,m_per1000f, unemploy_m24, unemploy_m39, inequality, prob_prison) that have possibility influence to the target (crime_rate).

Create model based on automate variable selection (based on result of backward stepswise regression point above), and checking the residuals leverage.

crime_rate_model <- lm(formula = crime_rate ~ percent_m + mean_education + police_exp60 + 
    m_per1000f + unemploy_m24 + unemploy_m39 + inequality + prob_prison, 
    data = crime)

plot(crime_rate_model)

Insight: Based on plot cook’s distance, since there is no data on or above cook’s distance range, so there is no indication for high leverage, therefore the model is significant.

2.4 Inspect Adjusted R-squared Value and Influence/Significance Predictors

Checking the adjusted R-squared & significance of the predictor. After i get the new model after variable’s automate selection, i inspect the Adjusted R-Squared & p-value to see the significance of the new model.

summary(crime_rate_model)

## 
## Call:
## lm(formula = crime_rate ~ percent_m + mean_education + police_exp60 + 
##     m_per1000f + unemploy_m24 + unemploy_m39 + inequality + prob_prison, 
##     data = crime)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -444.70 -111.07    3.03  122.15  483.30 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -6426.101   1194.611  -5.379 4.04e-06 ***
## percent_m          9.332      3.350   2.786  0.00828 ** 
## mean_education    18.012      5.275   3.414  0.00153 ** 
## police_exp60      10.265      1.552   6.613 8.26e-08 ***
## m_per1000f         2.234      1.360   1.642  0.10874    
## unemploy_m24      -6.087      3.339  -1.823  0.07622 .  
## unemploy_m39      18.735      7.248   2.585  0.01371 *  
## inequality         6.133      1.396   4.394 8.63e-05 ***
## prob_prison    -3796.032   1490.646  -2.547  0.01505 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 195.5 on 38 degrees of freedom
## Multiple R-squared:  0.7888, Adjusted R-squared:  0.7444 
## F-statistic: 17.74 on 8 and 38 DF,  p-value: 1.159e-10

Insight: From above checking (summary), we get the result as follow: - Adjusted R-squared 74.44%, which can indicate that the model quite significant. - Significant predictors (predictors which can influence the crime rate value): police_exp60, inequality, percent_m, mean_education, unemploy_39, prob_prison. - Insignificant predictors: m_per100f, unemploy_24.

3 Linear Assumption Test

To see whether the linear model is significant, i do another checking to support the result of automate variable’s selection, Adjusted R-Squared, and p-value by doing regression linear assumption test.

3.1 Linearity

In Linearity, the model can be say as significant model if the p-value of the predictors <0.05, it means the target and predictor have significant correlation.

cor.test(crime$percent_m,crime$crime_rate)

## 
##  Pearson's product-moment correlation
## 
## data:  crime$percent_m and crime$crime_rate
## t = -0.60262, df = 45, p-value = 0.5498
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.3672044  0.2029078
## sample estimates:
##        cor 
## -0.0894724

cor.test(crime$mean_education,crime$crime_rate)

## 
##  Pearson's product-moment correlation
## 
## data:  crime$mean_education and crime$crime_rate
## t = 2.2882, df = 45, p-value = 0.02688
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.03931263 0.55824793
## sample estimates:
##       cor 
## 0.3228349

cor.test(crime$police_exp60,crime$crime_rate)

## 
##  Pearson's product-moment correlation
## 
## data:  crime$police_exp60 and crime$crime_rate
## t = 6.3527, df = 45, p-value = 9.338e-08
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4989611 0.8140343
## sample estimates:
##       cor 
## 0.6876045

cor.test(crime$m_per1000f,crime$crime_rate)

## 
##  Pearson's product-moment correlation
## 
## data:  crime$m_per1000f and crime$crime_rate
## t = 1.469, df = 45, p-value = 0.1488
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.0780469  0.4720815
## sample estimates:
##       cor 
## 0.2139143

cor.test(crime$unemploy_m24,crime$crime_rate)

## 
##  Pearson's product-moment correlation
## 
## data:  crime$unemploy_m24 and crime$crime_rate
## t = -0.33905, df = 45, p-value = 0.7362
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.3328203  0.2401703
## sample estimates:
##         cor 
## -0.05047792

cor.test(crime$unemploy_m39,crime$crime_rate)

## 
##  Pearson's product-moment correlation
## 
## data:  crime$unemploy_m39 and crime$crime_rate
## t = 1.2087, df = 45, p-value = 0.2331
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.1157397  0.4419815
## sample estimates:
##       cor 
## 0.1773206

cor.test(crime$inequality,crime$crime_rate)

## 
##  Pearson's product-moment correlation
## 
## data:  crime$inequality and crime$crime_rate
## t = -1.2206, df = 45, p-value = 0.2286
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.4433957  0.1140040
## sample estimates:
##        cor 
## -0.1790237

cor.test(crime$prob_prison,crime$crime_rate)

## 
##  Pearson's product-moment correlation
## 
## data:  crime$prob_prison and crime$crime_rate
## t = -3.1715, df = 45, p-value = 0.00273
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.6364680 -0.1598792
## sample estimates:
##        cor 
## -0.4274222

Insight: Based on Linearity test result above, there are predictors which have p-value > 0.05 : percent_m, m_per1000f, unemploy_m24, unemploy_m39, and inequality. Therefore, the linear model “crime_rate_model” is insignificant and/or less significant based on correlation between target and predictors.

3.2 Normality

In Normality, to check the significant level of the model by shapiro test. If the p-value >0.05, the model is significant.

shapiro.test(crime_rate_model$residuals)

## 
##  Shapiro-Wilk normality test
## 
## data:  crime_rate_model$residuals
## W = 0.98511, p-value = 0.8051

Insight: Based on the shapiro test above, the p-value is 0.8051 (p-value>0.05) –> ‘crime_rate_model’ is significant.

3.3 Multicollinearity

Multicollinearity In Multicollinearity, to check the significant level of the model by VIF (Variance Inflation Factor). If the VIF < 10, the model is significant.

library(car)

## Warning: package 'car' was built under R version 3.5.2

## Loading required package: carData

## 
## Attaching package: 'car'

## The following object is masked from 'package:dplyr':
## 
##     recode

vif(crime_rate_model)

##      percent_m mean_education   police_exp60     m_per1000f   unemploy_m24 
##       2.131963       4.189684       2.560496       1.932367       4.360038 
##   unemploy_m39     inequality    prob_prison 
##       4.508106       3.731074       1.381879

Insightt: based on the VIF test above, the VIF of all the predictors < 10 –> crime_rate_model significant.

3.4 Homoscedasticity

In Homoscedasticity, to check the significant level of the model by ‘bptest’. If the p-value > 0.05, the model is significant.

library(lmtest)

## Warning: package 'lmtest' was built under R version 3.5.2

## Loading required package: zoo

## Warning: package 'zoo' was built under R version 3.5.2

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

bptest(crime_rate_model)

## 
##  studentized Breusch-Pagan test
## 
## data:  crime_rate_model
## BP = 13.51, df = 8, p-value = 0.09546

Insight: Based on the bptest result, the p-value is 0.09546 (p-value>0.05) –> ‘crime_rate_model’ is significant

3.5 Conclusion

The intial conclusion that i can take from ‘crime_rate_model’ are as follow:

crime_rate_model pass normality, multicollinearity, and homoscedasticity test, but not with linearity test, it means there is indication non linear correlation.
There are 3 criteria of predictor based on the regression test above:

Significant predictors: police_exp60, mean_education, prob_prison.
Insignificant predictors: m_per100f, unemploy_24.
Grey predictors:
1. inequality: summary check -> p-value: 63e-05 *** ; Linearity test -> p-value: 0.2286
2. unemploy_m39: summary check -> p-value: 0.01371 * ; Linearity test -> p-value: 0.2331
3. percent_m: summary check -> p-value: 0.00828 **: Linearity test -> p-value: 0.5498
note:
- Signif. codes from summary check: 0 *** 0.001 ** 00.01 * 0.05 ‘.’ 0.1 ‘’ 1 , the higher of the ’*’, the more significant the predictor/ have influence to crime rate value.
- Linearity test, p-value > 0.05 = the target and predictor doesn’t have significant correlation.
- grey predictors means that the predictor could be significan or have enough/high influence to crime rate value, and could be not, since the result of the test are opposites one of another (in this case between the summary check and linearity check).

Recommendations:

We need to check further the linear correlation between target (‘crime_rate’) and the predictors on point-c above, whether it has big influence for crime rate value or less influence. If the influence seems low, we could eliminate the predictors, but if the influence seems medium/quite high, we better to keep the predictor as part of the model.
Change the regression model which more best fit to get the crime rate value.

Crime Rate Regression Model

Regita Anggriani

August 7, 2019