1 General Information

We would like to create regression analysis report using the dataset provided by Algoritma (crime.csv and crime_rate as target variable).

Data Description

Criminologists are interested in the effect of punishment regimes on crime rates. This has been studied using aggregate data on 47 states of the USA for 1960 given in this data frame. The variables seems to have been re-scaled to convenient numbers.

Data Preparation

First of all, we would like to prepare the data, starting from data explanation, load the library , import, investigate and explore the data.

1.1 Explanation

This data frame contains the following columns:

  • M : percentage of males aged 14–24.

  • So : indicator variable for a Southern state.

  • Ed : mean years of schooling.

  • Po1 : police expenditure in 1960.

  • Po2 : police expenditure in 1959.

  • LF : labour force participation rate.

  • M.F : number of males per 1000 females.

  • Pop : state population.

  • NW : number of non-whites per 1000 people.

  • U1 : unemployment rate of urban males 14–24.

  • U2 : unemployment rate of urban males 35–39.

  • GDP : gross domestic product per head.

  • Ineq : income inequality.

  • Prob : probability of imprisonment.

  • Time : average time served in state prisons.

  • y : rate of crimes in a particular category per head of population.

1.2 Load Library

library(dplyr) 
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(GGally)     # for ggcorr
## Loading required package: ggplot2
## 
## Attaching package: 'GGally'
## The following object is masked from 'package:dplyr':
## 
##     nasa
library(MLmetrics)  # for RMSE
## 
## Attaching package: 'MLmetrics'
## The following object is masked from 'package:base':
## 
##     Recall
library(lmtest)     # for homoscedasticity
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
library(car)        # for no-multicolinearity test
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode

1.3 Import the data

We read the crime.csv and rename the columns. After that, we can create a plot to give us the overview of the data , in this case we put crime_rate as the Target variable.

crime <- read.csv("data/crime.csv") %>% select(-X) 
names(crime) <- c("percent_m", "is_south", "mean_education", "police_exp60", "police_exp59", "labour_participation", "m_per1000f", "state_pop", "nonwhites_per1000", "unemploy_m24", "unemploy_m39", "gdp", "inequality", "prob_prison", "time_prison", "crime_rate") 

plot(crime$crime_rate, crime$police_exp59)

1.4 Investigate

Checking the data structure. In regression, the necessary data is the numeric.

str(crime)
## 'data.frame':    47 obs. of  16 variables:
##  $ percent_m           : int  151 143 142 136 141 121 127 131 157 140 ...
##  $ is_south            : int  1 0 1 0 0 0 1 1 1 0 ...
##  $ mean_education      : int  91 113 89 121 121 110 111 109 90 118 ...
##  $ police_exp60        : int  58 103 45 149 109 118 82 115 65 71 ...
##  $ police_exp59        : int  56 95 44 141 101 115 79 109 62 68 ...
##  $ labour_participation: int  510 583 533 577 591 547 519 542 553 632 ...
##  $ m_per1000f          : int  950 1012 969 994 985 964 982 969 955 1029 ...
##  $ state_pop           : int  33 13 18 157 18 25 4 50 39 7 ...
##  $ nonwhites_per1000   : int  301 102 219 80 30 44 139 179 286 15 ...
##  $ unemploy_m24        : int  108 96 94 102 91 84 97 79 81 100 ...
##  $ unemploy_m39        : int  41 36 33 39 20 29 38 35 28 24 ...
##  $ gdp                 : int  394 557 318 673 578 689 620 472 421 526 ...
##  $ inequality          : int  261 194 250 167 174 126 168 206 239 174 ...
##  $ prob_prison         : num  0.0846 0.0296 0.0834 0.0158 0.0414 ...
##  $ time_prison         : num  26.2 25.3 24.3 29.9 21.3 ...
##  $ crime_rate          : int  791 1635 578 1969 1234 682 963 1555 856 705 ...

1.5 Explore

Before analyze the data, we can explore the data by creating the histogram. Through it, we can check the spread of the data. And check the outlier using boxplot.

hist(crime$crime_rate)

boxplot(crime$crime_rate)

2 Linear Regression

We would like to evaluate the model using Linear Regression.

2.1 Correlation check

Check the correlation from all the variable in the data.

ggcorr(crime, label = T, label_size = 2.5, hjust = 1)

Check whether there is outlier data or not

boxplot(crime)

Because of police_exp59 and police_exp60 is really related (correlation = 1), we would like to go to Feature engineering

police <- crime$police_exp59 + crime$police_exp60
crime <- cbind(crime,police)

After the Feature engineering, we can remove those 2 columns.

crime <- crime[,-c(4,5)]
head(crime)
##   percent_m is_south mean_education labour_participation m_per1000f
## 1       151        1             91                  510        950
## 2       143        0            113                  583       1012
## 3       142        1             89                  533        969
## 4       136        0            121                  577        994
## 5       141        0            121                  591        985
## 6       121        0            110                  547        964
##   state_pop nonwhites_per1000 unemploy_m24 unemploy_m39 gdp inequality
## 1        33               301          108           41 394        261
## 2        13               102           96           36 557        194
## 3        18               219           94           33 318        250
## 4       157                80          102           39 673        167
## 5        18                30           91           20 578        174
## 6        25                44           84           29 689        126
##   prob_prison time_prison crime_rate police
## 1    0.084602     26.2011        791    114
## 2    0.029599     25.2999       1635    198
## 3    0.083401     24.3006        578     89
## 4    0.015801     29.9012       1969    290
## 5    0.041399     21.2998       1234    210
## 6    0.034201     20.9995        682    233

2.2 Create Model

Check the linear regression between variable by creating the model and predict. 1. Create linear model to predict Target: crime_rate based on Predictor = police

model1 <- lm(crime_rate~police,crime)

Predict crime_rate using predict()

crime$pred1 <- predict(object = model1, newdata = data.frame(police = crime$police))
  1. Create linear model to predict Target: crime_rate based on Predictor = police and gdp
model2 <- lm(crime_rate~police + prob_prison,crime)

Create predict crime_rate from model2

crime$pred2 <- predict(object = model2, newdata = data.frame(
  police = crime$police, prob_prison = crime$prob_prison))
  1. Create linear model to predict Target: crime_rate based on Predictor = police, prob_prison and state_pop
model3 <- lm(crime_rate~police + prob_prison + state_pop,crime)

Predict the value of crime_rate using predict() from model 3

crime$pred3 <- predict(object = model3, newdata = data.frame(
  police = crime$police, 
  prob_prison = crime$prob_prison,
  state_pop = crime$state_pop))

2.3 R Squared

R squared is the percentage of the total variability that is explained by the linear relationship with the predictor (Regression Variation / Total Variation).

In other words, R squared can be thought of as a quantity that represents the % of the total variation that’s represented by the model. We simply take the regression variation and divide it by the total variation. In our case, it is the % of the variation in crime_rate that is explained by the regression relationship with others.

Using R-squared itself can be misleading in our assessment of the model fit and this is due to the one of the key limitation of this metric. A model that has too many predictors also tend to overfit and worse, the regression model would “model” the random noise in our data as if they were “features”, hence producing misleading R-squared values.

The adjusted R-squared compares the explanatory power of regression models built with different number of predictors, allowing us to compare a crime_ _rate regression model with multiple variables. The better fit will have bigger adjusted R-squared.

  1. Compare value of r.squared, adj.r.squared, and RMSE-nya from those 3 models
print("R.squared") #seberapa besar pengaruh Prediktor menggambarkan Target
## [1] "R.squared"
summary(model1)$r.squared 
## [1] 0.4604511
summary(model2)$r.squared
## [1] 0.4749003
summary(model3)$r.squared
## [1] 0.4761414
print("Adj R.squared")
## [1] "Adj R.squared"
summary(model1)$adj.r.squared #Adj R2 mau nya sebesar2nya. 
## [1] 0.4484611
summary(model2)$adj.r.squared
## [1] 0.4510321
summary(model3)$adj.r.squared
## [1] 0.4395932
summary(model1)
## 
## Call:
## lm(formula = crime_rate ~ police, data = crime)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -581.07 -151.94   20.33  147.24  580.59 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 152.0674   128.5331   1.183    0.243    
## police        4.5573     0.7354   6.197 1.59e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 287.2 on 45 degrees of freedom
## Multiple R-squared:  0.4605, Adjusted R-squared:  0.4485 
## F-statistic:  38.4 on 1 and 45 DF,  p-value: 1.591e-07
summary(model2)
## 
## Call:
## lm(formula = crime_rate ~ police + prob_prison, data = crime)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -594.30 -117.44   19.47  149.21  554.21 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   333.2029   208.6689   1.597    0.117    
## police          4.1228     0.8332   4.948 1.14e-05 ***
## prob_prison -2322.0203  2110.2719  -1.100    0.277    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 286.6 on 44 degrees of freedom
## Multiple R-squared:  0.4749, Adjusted R-squared:  0.451 
## F-statistic:  19.9 on 2 and 44 DF,  p-value: 7.004e-07
summary(model3)
## 
## Call:
## lm(formula = crime_rate ~ police + prob_prison + state_pop, data = crime)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -602.45 -125.50   24.22  154.05  538.40 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   331.7658   210.8798   1.573    0.123    
## police          4.2514     0.9333   4.555 4.28e-05 ***
## prob_prison -2413.7125  2151.4101  -1.122    0.268    
## state_pop      -0.4231     1.3256  -0.319    0.751    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 289.5 on 43 degrees of freedom
## Multiple R-squared:  0.4761, Adjusted R-squared:  0.4396 
## F-statistic: 13.03 on 3 and 43 DF,  p-value: 3.454e-06

Based on the result from adjusted R-squared; if the Correlation between Predictor and Target is big enough, the model that has been created not necessarily will be the best one to describe the data.

[1] “Adj R.squared” [1] 0.4484611 [1] 0.4510321 [1] 0.4395932

Those 3 model are not good enough to describe the data because the value of adjusted R-squared is very low. Usually we can use the minimum adjusted R-squared bigger than 0.75

2.4 RMSE

Root Mean Square Error is the error value between the prediction with the Target variable. We have to choose the lowest RMSE.

print("RMSE")
## [1] "RMSE"
RMSE (crime$pred1, y_true = crime$crime_rate)
## [1] 281.0541
RMSE (crime$pred2, y_true = crime$crime_rate)
## [1] 277.2653
RMSE (crime$pred3, y_true = crime$crime_rate)
## [1] 276.9374
  1. Choose the best model The manual method to create the model is not success, so we can use the other method : Stepwise Regression.

2.5 Stepwise Regression

Stepwise regression is a set of methods to fit regression models where the choice of predictive variables is carried out by an automatic procedure. On each step, one variable is considered for addition to or subtraction from the set of variables in the current model - and this process repeats itself until the model cannot be further improved upon.

A common estimator of the relative quality of statistical models is the AIC (Akaike information criterion). The formula of the AIC is beyond the scope of this literature, but simply put the formula rewards goodness of fit while penalizing for number of estimated parameters. Akaike’s work demonstrated how we can estimate the amount of “information loss” from one model to another.

2.5.1 Backward

Backward elimination works by starting with all candidate variables, testing the deletion of each variable using a chosen model fit criterion (AIC or residual sum of squares) and then removing the variable whose absence gives the highest reduction in AIC (biggest improvement in model fit) and repeating this process until no further variables can be removed without a loss of fit.

First of all, we eliminate the predictions that has been created from manual method.

head(crime)
##   percent_m is_south mean_education labour_participation m_per1000f
## 1       151        1             91                  510        950
## 2       143        0            113                  583       1012
## 3       142        1             89                  533        969
## 4       136        0            121                  577        994
## 5       141        0            121                  591        985
## 6       121        0            110                  547        964
##   state_pop nonwhites_per1000 unemploy_m24 unemploy_m39 gdp inequality
## 1        33               301          108           41 394        261
## 2        13               102           96           36 557        194
## 3        18               219           94           33 318        250
## 4       157                80          102           39 673        167
## 5        18                30           91           20 578        174
## 6        25                44           84           29 689        126
##   prob_prison time_prison crime_rate police     pred1     pred2     pred3
## 1    0.084602     26.2011        791    114  671.5972  606.7562  598.2588
## 2    0.029599     25.2999       1635    198 1054.4087 1080.7907 1096.6008
## 3    0.083401     24.3006        578     89  557.6653  506.4746  501.2191
## 4    0.015801     29.9012       1969    290 1473.6785 1492.1289 1460.1068
## 5    0.041399     21.2998       1234    210 1109.0961 1102.8647 1117.0203
## 6    0.034201     20.9995        682    233 1213.9135 1214.4033 1229.2149
summary(lm(crime_rate~.,crime[,-c(16,17,18)]))
## 
## Call:
## lm(formula = crime_rate ~ ., data = crime[, -c(16, 17, 18)])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -459.84 -122.87   11.49  117.58  451.50 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          -6606.4737  1583.4238  -4.172 0.000215 ***
## percent_m                9.0258     4.2226   2.137 0.040298 *  
## is_south                 6.1504   150.5377   0.041 0.967664    
## mean_education          17.3268     6.1901   2.799 0.008612 ** 
## labour_participation    -0.1626     1.4416  -0.113 0.910921    
## m_per1000f               1.9532     2.0562   0.950 0.349278    
## state_pop               -0.7269     1.3066  -0.556 0.581830    
## nonwhites_per1000        0.2059     0.6369   0.323 0.748596    
## unemploy_m24            -5.4736     4.2578  -1.286 0.207824    
## unemploy_m39            17.3475     8.3316   2.082 0.045411 *  
## gdp                      0.9446     1.0503   0.899 0.375170    
## inequality               7.3384     2.2928   3.201 0.003091 ** 
## prob_prison          -4079.3771  2228.7052  -1.830 0.076521 .  
## time_prison             -0.2575     6.8520  -0.038 0.970261    
## police                   4.9475     1.2844   3.852 0.000530 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 211.8 on 32 degrees of freedom
## Multiple R-squared:  0.7913, Adjusted R-squared:    0.7 
## F-statistic: 8.668 on 14 and 32 DF,  p-value: 2.605e-07

We assign it into new names and create linear model : all. Because for “Backward elimination” , the step starts from all the variabels.

crime2 <- crime[,-c(16,17,18)] 
lm.all <- lm(crime_rate~., crime2)

step(lm.all, direction = "backward")
## Start:  AIC=515.37
## crime_rate ~ percent_m + is_south + mean_education + labour_participation + 
##     m_per1000f + state_pop + nonwhites_per1000 + unemploy_m24 + 
##     unemploy_m39 + gdp + inequality + prob_prison + time_prison + 
##     police
## 
##                        Df Sum of Sq     RSS    AIC
## - time_prison           1        63 1435912 513.38
## - is_south              1        75 1435924 513.38
## - labour_participation  1       571 1436420 513.39
## - nonwhites_per1000     1      4689 1440538 513.53
## - state_pop             1     13889 1449738 513.83
## - gdp                   1     36295 1472144 514.55
## - m_per1000f            1     40488 1476337 514.68
## <none>                              1435849 515.37
## - unemploy_m24          1     74155 1510004 515.74
## - prob_prison           1    150328 1586178 518.05
## - unemploy_m39          1    194523 1630372 519.35
## - percent_m             1    205006 1640855 519.65
## - mean_education        1    351558 1787407 523.67
## - inequality            1    459655 1895504 526.43
## - police                1    665754 2101603 531.28
## 
## Step:  AIC=513.38
## crime_rate ~ percent_m + is_south + mean_education + labour_participation + 
##     m_per1000f + state_pop + nonwhites_per1000 + unemploy_m24 + 
##     unemploy_m39 + gdp + inequality + prob_prison + police
## 
##                        Df Sum of Sq     RSS    AIC
## - is_south              1        94 1436006 511.38
## - labour_participation  1       573 1436486 511.40
## - nonwhites_per1000     1      4696 1440608 511.53
## - state_pop             1     15376 1451289 511.88
## - gdp                   1     36439 1472351 512.55
## - m_per1000f            1     42048 1477960 512.73
## <none>                              1435912 513.38
## - unemploy_m24          1     74740 1510652 513.76
## - unemploy_m39          1    196871 1632783 517.42
## - prob_prison           1    203777 1639689 517.61
## - percent_m             1    210715 1646627 517.81
## - mean_education        1    356119 1792032 521.79
## - inequality            1    459600 1895512 524.43
## - police                1    707398 2143310 530.20
## 
## Step:  AIC=511.38
## crime_rate ~ percent_m + mean_education + labour_participation + 
##     m_per1000f + state_pop + nonwhites_per1000 + unemploy_m24 + 
##     unemploy_m39 + gdp + inequality + prob_prison + police
## 
##                        Df Sum of Sq     RSS    AIC
## - labour_participation  1      1064 1437071 509.41
## - nonwhites_per1000     1      6303 1442310 509.59
## - state_pop             1     15458 1451464 509.88
## - gdp                   1     39135 1475142 510.64
## - m_per1000f            1     44243 1480249 510.81
## <none>                              1436006 511.38
## - unemploy_m24          1     88715 1524721 512.20
## - unemploy_m39          1    208089 1644096 515.74
## - percent_m             1    212656 1648662 515.87
## - prob_prison           1    217046 1653052 516.00
## - mean_education        1    358370 1794376 519.85
## - inequality            1    544157 1980163 524.48
## - police                1    722777 2158784 528.54
## 
## Step:  AIC=509.41
## crime_rate ~ percent_m + mean_education + m_per1000f + state_pop + 
##     nonwhites_per1000 + unemploy_m24 + unemploy_m39 + gdp + inequality + 
##     prob_prison + police
## 
##                     Df Sum of Sq     RSS    AIC
## - nonwhites_per1000  1      5723 1442794 507.60
## - state_pop          1     17753 1454824 507.99
## - gdp                1     38716 1475787 508.66
## - m_per1000f         1     52014 1489085 509.09
## <none>                           1437071 509.41
## - unemploy_m24       1     92066 1529136 510.33
## - unemploy_m39       1    209698 1646769 513.82
## - prob_prison        1    219922 1656993 514.11
## - percent_m          1    228700 1665771 514.36
## - mean_education     1    380957 1818028 518.47
## - inequality         1    543097 1980168 522.48
## - police             1    794558 2231628 528.10
## 
## Step:  AIC=507.6
## crime_rate ~ percent_m + mean_education + m_per1000f + state_pop + 
##     unemploy_m24 + unemploy_m39 + gdp + inequality + prob_prison + 
##     police
## 
##                  Df Sum of Sq     RSS    AIC
## - state_pop       1     16957 1459751 506.15
## - gdp             1     36113 1478907 506.76
## - m_per1000f      1     47772 1490565 507.13
## <none>                        1442794 507.60
## - unemploy_m24    1     93803 1536596 508.56
## - unemploy_m39    1    217091 1659885 512.19
## - prob_prison     1    218991 1661784 512.24
## - percent_m       1    306832 1749626 514.66
## - mean_education  1    375422 1818216 516.47
## - inequality      1    617614 2060408 522.35
## - police          1   1011638 2454432 530.57
## 
## Step:  AIC=506.15
## crime_rate ~ percent_m + mean_education + m_per1000f + unemploy_m24 + 
##     unemploy_m39 + gdp + inequality + prob_prison + police
## 
##                  Df Sum of Sq     RSS    AIC
## - gdp             1     30683 1490434 505.13
## <none>                        1459751 506.15
## - m_per1000f      1     95202 1554953 507.12
## - unemploy_m24    1    105276 1565027 507.42
## - prob_prison     1    202801 1662552 510.26
## - unemploy_m39    1    219084 1678835 510.72
## - percent_m       1    319185 1778936 513.44
## - mean_education  1    373762 1833513 514.87
## - inequality      1    614425 2074176 520.66
## - police          1   1094101 2553852 530.44
## 
## Step:  AIC=505.13
## crime_rate ~ percent_m + mean_education + m_per1000f + unemploy_m24 + 
##     unemploy_m39 + inequality + prob_prison + police
## 
##                  Df Sum of Sq     RSS    AIC
## <none>                        1490434 505.13
## - m_per1000f      1    117843 1608276 506.70
## - unemploy_m24    1    136366 1626799 507.24
## - prob_prison     1    256641 1747074 510.60
## - unemploy_m39    1    271328 1761761 510.99
## - percent_m       1    292207 1782641 511.54
## - mean_education  1    433030 1923464 515.12
## - inequality      1    744855 2235288 522.18
## - police          1   1634672 3125105 537.93
## 
## Call:
## lm(formula = crime_rate ~ percent_m + mean_education + m_per1000f + 
##     unemploy_m24 + unemploy_m39 + inequality + prob_prison + 
##     police, data = crime2)
## 
## Coefficients:
##    (Intercept)       percent_m  mean_education      m_per1000f  
##      -6541.179           9.258          17.775           2.383  
##   unemploy_m24    unemploy_m39      inequality     prob_prison  
##         -6.294          19.257           6.180       -3858.651  
##         police  
##          5.278

The result is the formula for new model.

back_crime <- lm(formula = crime_rate ~ percent_m + mean_education + m_per1000f + 
    unemploy_m24 + unemploy_m39 + inequality + prob_prison + 
    police, data = crime2)

Create the prediction from the model back_crime.

crime2$pred_back <- predict(object = back_crime, newdata = data.frame(
  percent_m = crime2$percent_m, 
    mean_education = crime2$mean_education,
    m_per1000f = crime2$m_per1000f, 
    unemploy_m24 = crime2$unemploy_m24, 
    unemploy_m39 = crime2$unemploy_m39, 
    inequality = crime2$inequality,
    prob_prison = crime2$prob_prison, 
    police = crime2$police))

2.5.2 Forward

In the forward step-wise regression, we’re starting from the lm.none model (inequality ~ 1) and searching through models lying in the range between that and the lm.all model (using all candidate variables).

lm.none <- lm(crime_rate~1,crime2)

step(lm.none, scope = list(lower = lm.none, upper = lm.all),direction = "forward")
## Start:  AIC=561.02
## crime_rate ~ 1
## 
##                        Df Sum of Sq     RSS    AIC
## + police                1   3168330 3712597 534.02
## + gdp                   1   1340152 5540775 552.84
## + prob_prison           1   1257075 5623853 553.54
## + state_pop             1    783660 6097267 557.34
## + mean_education        1    717146 6163781 557.85
## + m_per1000f            1    314867 6566061 560.82
## <none>                              6880928 561.02
## + labour_participation  1    245446 6635482 561.32
## + inequality            1    220530 6660397 561.49
## + unemploy_m39          1    216354 6664573 561.52
## + time_prison           1    154545 6726383 561.96
## + is_south              1     56527 6824400 562.64
## + percent_m             1     55084 6825844 562.65
## + unemploy_m24          1     17533 6863395 562.90
## + nonwhites_per1000     1      7312 6873615 562.97
## 
## Step:  AIC=534.02
## crime_rate ~ police
## 
##                        Df Sum of Sq     RSS    AIC
## + inequality            1    759855 2952743 525.26
## + percent_m             1    612966 3099632 527.54
## + m_per1000f            1    260696 3451901 532.60
## + nonwhites_per1000     1    232645 3479952 532.98
## + is_south              1    214795 3497803 533.22
## + gdp                   1    170207 3542390 533.82
## <none>                              3712597 534.02
## + prob_prison           1     99424 3613173 534.75
## + labour_participation  1     86340 3626257 534.92
## + time_prison           1     54647 3657950 535.33
## + unemploy_m39          1     22883 3689715 535.73
## + state_pop             1      2449 3710149 535.99
## + unemploy_m24          1      2269 3710328 535.99
## + mean_education        1      1064 3711533 536.01
## 
## Step:  AIC=525.26
## crime_rate ~ police + inequality
## 
##                        Df Sum of Sq     RSS    AIC
## + mean_education        1    578682 2374061 517.01
## + m_per1000f            1    479766 2472976 518.93
## + prob_prison           1    291301 2661441 522.38
## + labour_participation  1    287949 2664794 522.44
## + gdp                   1    234545 2718197 523.37
## + percent_m             1    176837 2775905 524.36
## <none>                              2952743 525.26
## + state_pop             1    117308 2835435 525.36
## + nonwhites_per1000     1     42677 2910065 526.58
## + is_south              1     41017 2911726 526.60
## + unemploy_m24          1      3597 2949145 527.20
## + time_prison           1      2838 2949905 527.22
## + unemploy_m39          1         4 2952738 527.26
## 
## Step:  AIC=517.01
## crime_rate ~ police + inequality + mean_education
## 
##                        Df Sum of Sq     RSS    AIC
## + prob_prison           1    245632 2128429 513.88
## + percent_m             1    233504 2140557 514.14
## + m_per1000f            1    133872 2240188 516.28
## <none>                              2374061 517.01
## + gdp                   1     95015 2279045 517.09
## + time_prison           1     77849 2296212 517.44
## + unemploy_m39          1     71032 2303029 517.58
## + state_pop             1     32168 2341893 518.37
## + labour_participation  1     14490 2359571 518.72
## + unemploy_m24          1      9465 2364596 518.82
## + nonwhites_per1000     1      2355 2371705 518.96
## + is_south              1      1893 2372167 518.97
## 
## Step:  AIC=513.88
## crime_rate ~ police + inequality + mean_education + prob_prison
## 
##                        Df Sum of Sq     RSS    AIC
## + percent_m             1    257271 1871157 509.82
## + m_per1000f            1    148320 1980109 512.48
## + state_pop             1     96832 2031597 513.69
## <none>                              2128429 513.88
## + is_south              1     64071 2064357 514.44
## + unemploy_m39          1     61658 2066770 514.49
## + nonwhites_per1000     1     43703 2084726 514.90
## + gdp                   1     38070 2090358 515.03
## + unemploy_m24          1      8726 2119703 515.68
## + labour_participation  1      1336 2127092 515.85
## + time_prison           1         3 2128426 515.88
## 
## Step:  AIC=509.82
## crime_rate ~ police + inequality + mean_education + prob_prison + 
##     percent_m
## 
##                        Df Sum of Sq     RSS    AIC
## + unemploy_m39          1    205935 1665223 506.34
## + gdp                   1    100497 1770660 509.23
## + m_per1000f            1     99365 1771793 509.26
## <none>                              1871157 509.82
## + unemploy_m24          1     57136 1814022 510.36
## + state_pop             1     41383 1829774 510.77
## + is_south              1     17419 1853739 511.38
## + time_prison           1      4963 1866195 511.70
## + nonwhites_per1000     1       245 1870912 511.81
## + labour_participation  1         3 1871154 511.82
## 
## Step:  AIC=506.34
## crime_rate ~ police + inequality + mean_education + prob_prison + 
##     percent_m + unemploy_m39
## 
##                        Df Sum of Sq     RSS    AIC
## + gdp                   1     69540 1595683 506.34
## <none>                              1665223 506.34
## + unemploy_m24          1     56946 1608276 506.70
## + state_pop             1     45741 1619482 507.03
## + m_per1000f            1     38423 1626799 507.24
## + labour_participation  1     23545 1641678 507.67
## + is_south              1     13773 1651449 507.95
## + time_prison           1      3062 1662160 508.25
## + nonwhites_per1000     1        29 1665194 508.34
## 
## Step:  AIC=506.34
## crime_rate ~ police + inequality + mean_education + prob_prison + 
##     percent_m + unemploy_m39 + gdp
## 
##                        Df Sum of Sq     RSS    AIC
## <none>                              1595683 506.34
## + state_pop             1     52319 1543364 506.77
## + unemploy_m24          1     40730 1554953 507.12
## + m_per1000f            1     30656 1565027 507.42
## + labour_participation  1     14263 1581420 507.91
## + is_south              1      6372 1589311 508.15
## + time_prison           1      5820 1589863 508.16
## + nonwhites_per1000     1      1206 1594477 508.30
## 
## Call:
## lm(formula = crime_rate ~ police + inequality + mean_education + 
##     prob_prison + percent_m + unemploy_m39 + gdp, data = crime2)
## 
## Coefficients:
##    (Intercept)          police      inequality  mean_education  
##      -5807.418           5.259           8.228          18.104  
##    prob_prison       percent_m    unemploy_m39             gdp  
##      -3388.152          11.279           8.578           1.218

The result is the formula for new model.

for_crime <- lm(formula = crime_rate ~ police + inequality + mean_education + 
    prob_prison + percent_m + unemploy_m39 + gdp, data = crime2)
crime2$pred_for <- predict(object = for_crime, newdata = data.frame(
  police = crime2$police,
  inequality = crime2$inequality,
  mean_education = crime2$mean_education,
  prob_prison = crime2$prob_prison,
  percent_m = crime2$percent_m, 
    unemploy_m39 = crime2$unemploy_m39,
    gdp = crime2$gdp))
print("R.squared") #seberapa besar pengaruh Prediktor menggambarkan Target
## [1] "R.squared"
summary(back_crime)$r.squared 
## [1] 0.7833964
summary(for_crime)$r.squared
## [1] 0.7681006
print("Adj R.squared")
## [1] "Adj R.squared"
summary(back_crime)$adj.r.squared #Adj R2 mau nya sebesar2nya. 
## [1] 0.7377957
summary(for_crime)$adj.r.squared
## [1] 0.7264776
print("RMSE")
## [1] "RMSE"
RMSE (crime2$pred_back, y_true = crime2$crime_rate)
## [1] 178.0768
RMSE (crime2$pred_for, y_true = crime2$crime_rate)
## [1] 184.2572
  1. Student can select variable that is significant to both statistically and business-wise.
  2. Student perform regression linear assumption checking (Linearity, No Multicolinearity, Homoscedasticity, Normality)
  3. Student write the findings and explaining recommendations where appropriate.

3 Assumption Checking

3.1 Linearity Check

ggcorr(crime2, label = T, label_size = 2.5, hjust = 1)

3.2 Normality of Residual

H0: normal distributed residual H1: non-normal distributed residual

# from model lm.all (data mtcars)
hist(lm.all$residuals)

# statistic test for normality
shapiro.test(lm.all$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  lm.all$residuals
## W = 0.98689, p-value = 0.8709

p-value = 0.8709 pvalue > 0.05, we fail to reject H0

3.3 Homoscedasticity

plot(crime2$crime_rate, lm.all$residuals) #liat pola = ga ada pola, tapi ini subjectif, makanya ada tes nya:

#statistic test for homoscedasticity
bptest(lm.all)
## 
##  studentized Breusch-Pagan test
## 
## data:  lm.all
## BP = 13.63, df = 14, p-value = 0.4777

p-value = 0.4777 pvalue > 0.05, we fail to reject H0

3.4 No-Multicolinearity

ggcorr(crime2, label = T, label_size = 2.5, hjust = 1)

vif(lm.all)
##            percent_m             is_south       mean_education 
##             2.887155             5.329859             4.916133 
## labour_participation           m_per1000f            state_pop 
##             3.479376             3.763639             2.536678 
##    nonwhites_per1000         unemploy_m24         unemploy_m39 
##             4.397337             6.040838             5.075819 
##                  gdp           inequality          prob_prison 
##            10.528833             8.577985             2.632507 
##          time_prison               police 
##             2.417413             5.608804

The result of Multicol is gdp = 10.528833, so we have to remove it.

4 Analysis

The best model is:

[1] “R.squared”

[1] 0.7833964

[1] 0.7681006

[1] “Adj R.squared”

** [1] 0.7377957 **

[1] 0.7264776

[1] “RMSE”

** [1] 178.0768 **

[1] 184.2572

From the result of Stepwise Regression, we can chose the best model based on biggest Adj. R Squared and lowest Error. So, we chose model from Backward.

4.1 Model

back_crime <- lm(formula = crime_rate ~ percent_m + mean_education + m_per1000f + 
    unemploy_m24 + unemploy_m39 + inequality + prob_prison + 
    police, data = crime2)

Based on summary above, we can conclude:

  - Variable percent_m will reflect 9.258 to crime_rate. 
  
  - Variable mean_education will reflect 17.775 to crime_rate.
  
  - Variable m_per1000f will reflect 2.383 to crime_rate.
  
  - Variable unemploy_m24 will reflect -6.294 to crime_rate.
  
  - Variable unemploy_m39 will reflect 19.257 to crime_rate. 
  
  - Variable inequality will reflect 6.180 to crime_rate.
  
  - Variable prob_prison will reflect -3858.651 to crime_rate.
  
  - Variable police will reflect 5.278 to crime_rate.

4.2 Prediction

Create prediction for Backward model and round it to 2 decimals.

crime2$pred_back <- round(predict(object = back_crime, newdata = data.frame(
  percent_m = crime2$percent_m, 
    mean_education = crime2$mean_education,
    m_per1000f = crime2$m_per1000f, 
    unemploy_m24 = crime2$unemploy_m24, 
    unemploy_m39 = crime2$unemploy_m39, 
    inequality = crime2$inequality,
    prob_prison = crime2$prob_prison, 
    police = crime2$police)),2)
summary(back_crime)
## 
## Call:
## lm(formula = crime_rate ~ percent_m + mean_education + m_per1000f + 
##     unemploy_m24 + unemploy_m39 + inequality + prob_prison + 
##     police, data = crime2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -475.87 -104.70    0.77  129.36  470.19 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -6541.1790  1210.6873  -5.403 3.75e-06 ***
## percent_m          9.2582     3.3919   2.729  0.00956 ** 
## mean_education    17.7753     5.3496   3.323  0.00198 ** 
## m_per1000f         2.3832     1.3749   1.733  0.09114 .  
## unemploy_m24      -6.2939     3.3755  -1.865  0.06997 .  
## unemploy_m39      19.2570     7.3216   2.630  0.01226 *  
## inequality         6.1800     1.4181   4.358 9.63e-05 ***
## prob_prison    -3858.6505  1508.4712  -2.558  0.01464 *  
## police             5.2781     0.8176   6.456 1.35e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 198 on 38 degrees of freedom
## Multiple R-squared:  0.7834, Adjusted R-squared:  0.7378 
## F-statistic: 17.18 on 8 and 38 DF,  p-value: 1.842e-10
head(crime2)
##   percent_m is_south mean_education labour_participation m_per1000f
## 1       151        1             91                  510        950
## 2       143        0            113                  583       1012
## 3       142        1             89                  533        969
## 4       136        0            121                  577        994
## 5       141        0            121                  591        985
## 6       121        0            110                  547        964
##   state_pop nonwhites_per1000 unemploy_m24 unemploy_m39 gdp inequality
## 1        33               301          108           41 394        261
## 2        13               102           96           36 557        194
## 3        18               219           94           33 318        250
## 4       157                80          102           39 673        167
## 5        18                30           91           20 578        174
## 6        25                44           84           29 689        126
##   prob_prison time_prison crime_rate police pred_back  pred_for
## 1    0.084602     26.2011        791    114    736.46  835.1994
## 2    0.029599     25.2999       1635    198   1421.99 1375.7041
## 3    0.083401     24.3006        578     89    401.63  318.3766
## 4    0.015801     29.9012       1969    290   1848.47 1917.0064
## 5    0.041399     21.2998       1234    210   1098.89 1244.8671
## 6    0.034201     20.9995        682    233    738.06  782.9239
crime2$pred_back
##  [1]  736.46 1421.99  401.63 1848.47 1098.89  738.06  794.83 1395.93
##  [9]  688.20  774.85 1203.81  723.39  734.37  791.69  956.87  950.44
## [17]  435.28  796.30 1225.87 1224.23  745.02  671.89  934.06  846.59
## [25]  610.89 1900.19  324.81 1189.19 1378.45  709.35  468.40  763.37
## [33]  888.00 1006.66  720.11 1118.20 1023.12  575.46  782.88 1113.65
## [41]  753.98  335.90 1090.38 1195.31  569.00  769.93 1112.67

From the prediction, we can get the value : crime_rate pred_back 791 736.46 1635 1421.00 578 401.63

4.3 Simulation

We will try to make better model from Backward by adding 1 more predictor and review the Adj.R-Squared and RMSE

back_crime_2 <- lm(formula = crime_rate ~ percent_m + mean_education + m_per1000f + 
    unemploy_m24 + unemploy_m39 + inequality + prob_prison + 
    police + state_pop , data = crime2)

crime2$pred_back_2 <- round(predict(object = back_crime_2, newdata = data.frame(
  percent_m = crime2$percent_m, 
    mean_education = crime2$mean_education,
    m_per1000f = crime2$m_per1000f, 
    unemploy_m24 = crime2$unemploy_m24, 
    unemploy_m39 = crime2$unemploy_m39, 
    inequality = crime2$inequality,
    prob_prison = crime2$prob_prison, 
    police = crime2$police, 
  state_pop = crime2$state_pop)),2)
print("R.squared")
## [1] "R.squared"
summary(back_crime)$r.squared 
## [1] 0.7833964
summary(back_crime_2)$r.squared
## [1] 0.7850716
print("Adj R.squared")
## [1] "Adj R.squared"
summary(back_crime)$adj.r.squared 
## [1] 0.7377957
summary(back_crime_2)$adj.r.squared
## [1] 0.7327918
print("RMSE")
## [1] "RMSE"
RMSE (crime2$pred_back, y_true = crime2$crime_rate)
## [1] 178.0765
RMSE (crime2$pred_back_2, y_true = crime2$crime_rate)
## [1] 177.3869

The result of the new model: Adj R.squared decrease and RMSE increase.

My suggestion are:

1. If we want better model which reflected the data, we have to chose back_crime

2. If we want better accuracy which have minimum error, we have to chose back_crime_2

Thank you (c) Meilinie