1 Objective

Pada kesempatan kali ini, kita akan melakukan memprediksi data Crime Rate dari data kriminalitas yang ada di Amerika pada tahun 1960. Prediksi akan dilakukan dengan metode regresi linear. Data yang akan digunakan adalah dataset yang telah disediakan oleh platform Kaggle pada link dibawah berikut

Dataset:Crime

2 Preparation

2.1 Import Library

library(dplyr)
library(GGally)
library(MLmetrics)
library(car)
library(lmtest)

2.2 Read Data

Input data

crime <- read.csv("data_input/crime.csv") %>% 
  select(-X) # buang kolom X karena hanya nomor baris
head(crime)

karena variable masih dalam singkatan, akan kita ubah agar lebih mudah

# rename kolom agar lebih deskriptif
names(crime) <- c("percent_m", "is_south", "mean_education", "police_exp60", "police_exp59", "labour_participation", "m_per1000f", "state_pop", "nonwhites_per1000", "unemploy_m24", "unemploy_m39", "gdp", "inequality", "prob_prison", "time_prison", "crime_rate")
head(crime)

Deskripsi kolom/variable:

  • percent_m: The number of males of age 14-24 per 1000 population
  • is_south: whether it is in a Southern state. 1 for Yes, 0 for No.
  • mean_education: Mean # of years of schooling x 10 for persons of age 25 or older
  • police_exp60: 1960 per capita expenditure on police by state and local government
  • police_exp59: 1959 per capita expenditure on police by state and local government
  • labour_participation: Labor force participation rate per 1000 civilian urban males age 14-24
  • m_per1000f: nThe number of males per 1000 females
  • state_pop: State population size in hundred thousands
  • nonwhites_per1000: number of non-whites resident per 1000 people
  • unemploy_m24: Unemployment rate of urban males per 1000 of age 14-24
  • unemploy_m39: Unemployment rate of urban males per 1000 of age 35-39
  • gdp: gross domestic product per head
  • inequality: income inequality
  • prob_prison: probability of imprisonment
  • time_prison: avg time served in prisons
  • crime_rate: crime rate in an unspecified category

2.3 Cek data

Cek tipe data & ubah tipe data yang belum sesuai

str(crime)
#> 'data.frame':    47 obs. of  16 variables:
#>  $ percent_m           : int  151 143 142 136 141 121 127 131 157 140 ...
#>  $ is_south            : int  1 0 1 0 0 0 1 1 1 0 ...
#>  $ mean_education      : int  91 113 89 121 121 110 111 109 90 118 ...
#>  $ police_exp60        : int  58 103 45 149 109 118 82 115 65 71 ...
#>  $ police_exp59        : int  56 95 44 141 101 115 79 109 62 68 ...
#>  $ labour_participation: int  510 583 533 577 591 547 519 542 553 632 ...
#>  $ m_per1000f          : int  950 1012 969 994 985 964 982 969 955 1029 ...
#>  $ state_pop           : int  33 13 18 157 18 25 4 50 39 7 ...
#>  $ nonwhites_per1000   : int  301 102 219 80 30 44 139 179 286 15 ...
#>  $ unemploy_m24        : int  108 96 94 102 91 84 97 79 81 100 ...
#>  $ unemploy_m39        : int  41 36 33 39 20 29 38 35 28 24 ...
#>  $ gdp                 : int  394 557 318 673 578 689 620 472 421 526 ...
#>  $ inequality          : int  261 194 250 167 174 126 168 206 239 174 ...
#>  $ prob_prison         : num  0.0846 0.0296 0.0834 0.0158 0.0414 ...
#>  $ time_prison         : num  26.2 25.3 24.3 29.9 21.3 ...
#>  $ crime_rate          : int  791 1635 578 1969 1234 682 963 1555 856 705 ...
crime <- crime %>% mutate(is_south = as.factor(is_south))
str(crime)
#> 'data.frame':    47 obs. of  16 variables:
#>  $ percent_m           : int  151 143 142 136 141 121 127 131 157 140 ...
#>  $ is_south            : Factor w/ 2 levels "0","1": 2 1 2 1 1 1 2 2 2 1 ...
#>  $ mean_education      : int  91 113 89 121 121 110 111 109 90 118 ...
#>  $ police_exp60        : int  58 103 45 149 109 118 82 115 65 71 ...
#>  $ police_exp59        : int  56 95 44 141 101 115 79 109 62 68 ...
#>  $ labour_participation: int  510 583 533 577 591 547 519 542 553 632 ...
#>  $ m_per1000f          : int  950 1012 969 994 985 964 982 969 955 1029 ...
#>  $ state_pop           : int  33 13 18 157 18 25 4 50 39 7 ...
#>  $ nonwhites_per1000   : int  301 102 219 80 30 44 139 179 286 15 ...
#>  $ unemploy_m24        : int  108 96 94 102 91 84 97 79 81 100 ...
#>  $ unemploy_m39        : int  41 36 33 39 20 29 38 35 28 24 ...
#>  $ gdp                 : int  394 557 318 673 578 689 620 472 421 526 ...
#>  $ inequality          : int  261 194 250 167 174 126 168 206 239 174 ...
#>  $ prob_prison         : num  0.0846 0.0296 0.0834 0.0158 0.0414 ...
#>  $ time_prison         : num  26.2 25.3 24.3 29.9 21.3 ...
#>  $ crime_rate          : int  791 1635 578 1969 1234 682 963 1555 856 705 ...

3 Cross-Validation & Model fitting

3.1 Split data train & test

Selanjutnya akan kita lakukan split data untuk membagi data train & juga data test

RNGkind(sample.kind = "Rounding")
set.seed(222)

#spli data train & test
index <- sample(nrow(crime), nrow(crime)*0.8)
crime_train <- crime[index,]
crime_test <- crime[-index,]

3.2 Pembuatan model

3.2.1 Model menggunakan seluruh variable

Dalam hal ini, kita akan melakukan prediksi terhadap variable crime_rate menggunakan seluruh variable

model_crime <- lm(crime_rate~.,data = crime_train)
summary(model_crime)
#> 
#> Call:
#> lm(formula = crime_rate ~ ., data = crime_train)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -310.91 -117.87    1.72  105.58  447.31 
#> 
#> Coefficients:
#>                        Estimate Std. Error t value Pr(>|t|)  
#> (Intercept)          -3593.2482  1851.4189  -1.941   0.0658 .
#> percent_m               14.0620     5.4571   2.577   0.0176 *
#> is_south1               14.0423   158.0050   0.089   0.9300  
#> mean_education          18.2434     7.9205   2.303   0.0316 *
#> police_exp60             6.5722    12.1225   0.542   0.5934  
#> police_exp59             6.1773    13.3735   0.462   0.6489  
#> labour_participation     1.4084     2.3318   0.604   0.5523  
#> m_per1000f              -1.7021     3.0878  -0.551   0.5873  
#> state_pop               -0.7578     1.4727  -0.515   0.6122  
#> nonwhites_per1000        0.1740     0.6872   0.253   0.8026  
#> unemploy_m24            -5.1227     5.1321  -0.998   0.3296  
#> unemploy_m39            21.7769    11.3853   1.913   0.0695 .
#> gdp                     -0.4605     1.1399  -0.404   0.6903  
#> inequality               5.6984     2.5441   2.240   0.0361 *
#> prob_prison          -8295.4333  3580.3621  -2.317   0.0307 *
#> time_prison             -8.7791     8.3117  -1.056   0.3029  
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 198.3 on 21 degrees of freedom
#> Multiple R-squared:  0.8613, Adjusted R-squared:  0.7621 
#> F-statistic:  8.69 on 15 and 21 DF,  p-value: 6.542e-06

Berdasarkan hasil diatas, jika diperhatikan, tidak semua variable berperan signifikan terhadap target variable yakni crime_rate. Karena itu perlu kita cek terlebih dahulu bagaimana korelasi dari masing-masing prediktor terhadap targetnya

Berdasarkan hasil plot korelasi diatas, kita akan membuat 3 model yakni sebagai berikut: 1. Model regresi linear menggunakan seluruh variable sebagai prediktor (model_crime) 2. Model regresi linear menggunakan variable exp59 saja sebagai prediktor (model_crime_2) 3. Model regresi linear menggunakan stepwise regression (backward) (model_crime_3)

3.2.2 Model menggunakan 1 variable

model_crime_2 <- lm(crime_rate~police_exp59, crime_train)
plot(crime_train$police_exp59, crime_train$crime_rate)
abline(model_crime_2, col = "red", lty = 2)

summary(model_crime_2)
#> 
#> Call:
#> lm(formula = crime_rate ~ police_exp59, data = crime_train)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -717.74 -128.62   20.12  185.08  489.07 
#> 
#> Coefficients:
#>              Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)   -59.691    139.754  -0.427    0.672    
#> police_exp59   12.691      1.714   7.406 1.15e-08 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 257.4 on 35 degrees of freedom
#> Multiple R-squared:  0.6105, Adjusted R-squared:  0.5993 
#> F-statistic: 54.85 on 1 and 35 DF,  p-value: 1.151e-08

3.2.3 Model menggunakan stepwise regression

model_crime_3 <- step(object = model_crime, 
                       direction = 'backward')
#> Start:  AIC=402.5
#> crime_rate ~ percent_m + is_south + mean_education + police_exp60 + 
#>     police_exp59 + labour_participation + m_per1000f + state_pop + 
#>     nonwhites_per1000 + unemploy_m24 + unemploy_m39 + gdp + inequality + 
#>     prob_prison + time_prison
#> 
#>                        Df Sum of Sq     RSS    AIC
#> - is_south              1       311  826388 400.51
#> - nonwhites_per1000     1      2520  828598 400.61
#> - gdp                   1      6419  832496 400.79
#> - police_exp59          1      8393  834470 400.87
#> - state_pop             1     10415  836493 400.96
#> - police_exp60          1     11562  837640 401.01
#> - m_per1000f            1     11953  838031 401.03
#> - labour_participation  1     14351  840429 401.14
#> - unemploy_m24          1     39193  865271 402.22
#> - time_prison           1     43885  869963 402.42
#> <none>                               826077 402.50
#> - unemploy_m39          1    143913  969991 406.44
#> - inequality            1    197345 1023422 408.43
#> - mean_education        1    208692 1034769 408.83
#> - prob_prison           1    211167 1037244 408.92
#> - percent_m             1    261197 1087274 410.67
#> 
#> Step:  AIC=400.51
#> crime_rate ~ percent_m + mean_education + police_exp60 + police_exp59 + 
#>     labour_participation + m_per1000f + state_pop + nonwhites_per1000 + 
#>     unemploy_m24 + unemploy_m39 + gdp + inequality + prob_prison + 
#>     time_prison
#> 
#>                        Df Sum of Sq     RSS    AIC
#> - nonwhites_per1000     1      3965  830354 398.69
#> - gdp                   1      6275  832663 398.79
#> - police_exp59          1      8171  834560 398.88
#> - state_pop             1     10106  836495 398.96
#> - police_exp60          1     11615  838004 399.03
#> - m_per1000f            1     11992  838380 399.05
#> - labour_participation  1     16479  842868 399.24
#> - time_prison           1     45225  871613 400.49
#> <none>                               826388 400.51
#> - unemploy_m24          1     53405  879793 400.83
#> - unemploy_m39          1    153679  980067 404.82
#> - mean_education        1    212453 1038841 406.98
#> - prob_prison           1    212639 1039027 406.99
#> - inequality            1    217143 1043532 407.15
#> - percent_m             1    262234 1088622 408.71
#> 
#> Step:  AIC=398.69
#> crime_rate ~ percent_m + mean_education + police_exp60 + police_exp59 + 
#>     labour_participation + m_per1000f + state_pop + unemploy_m24 + 
#>     unemploy_m39 + gdp + inequality + prob_prison + time_prison
#> 
#>                        Df Sum of Sq     RSS    AIC
#> - gdp                   1      7918  838272 397.04
#> - state_pop             1      9833  840186 397.13
#> - police_exp60          1     10201  840555 397.14
#> - police_exp59          1     12055  842409 397.22
#> - m_per1000f            1     12266  842620 397.23
#> - labour_participation  1     14621  844974 397.34
#> - time_prison           1     41557  871911 398.50
#> <none>                               830354 398.69
#> - unemploy_m24          1     50103  880457 398.86
#> - unemploy_m39          1    150067  980420 402.84
#> - mean_education        1    208837 1039190 404.99
#> - prob_prison           1    228633 1058987 405.69
#> - inequality            1    250287 1080640 406.44
#> - percent_m             1    265735 1096088 406.96
#> 
#> Step:  AIC=397.04
#> crime_rate ~ percent_m + mean_education + police_exp60 + police_exp59 + 
#>     labour_participation + m_per1000f + state_pop + unemploy_m24 + 
#>     unemploy_m39 + inequality + prob_prison + time_prison
#> 
#>                        Df Sum of Sq     RSS    AIC
#> - police_exp59          1      9823  848095 395.47
#> - police_exp60          1     11959  850230 395.57
#> - state_pop             1     14105  852376 395.66
#> - labour_participation  1     15627  853899 395.73
#> - m_per1000f            1     16419  854691 395.76
#> - time_prison           1     39819  878090 396.76
#> - unemploy_m24          1     44007  882279 396.94
#> <none>                               838272 397.04
#> - unemploy_m39          1    142720  980991 400.86
#> - mean_education        1    203510 1041781 403.08
#> - prob_prison           1    232932 1071204 404.11
#> - percent_m             1    285303 1123575 405.88
#> - inequality            1    490705 1328977 412.09
#> 
#> Step:  AIC=395.47
#> crime_rate ~ percent_m + mean_education + police_exp60 + labour_participation + 
#>     m_per1000f + state_pop + unemploy_m24 + unemploy_m39 + inequality + 
#>     prob_prison + time_prison
#> 
#>                        Df Sum of Sq     RSS    AIC
#> - labour_participation  1      9630  857725 393.89
#> - state_pop             1     13455  861550 394.06
#> - m_per1000f            1     17054  865149 394.21
#> <none>                               848095 395.47
#> - unemploy_m24          1     48395  896490 395.53
#> - time_prison           1     50409  898504 395.61
#> - unemploy_m39          1    139146  987241 399.09
#> - prob_prison           1    238546 1086640 402.64
#> - mean_education        1    260169 1108264 403.37
#> - percent_m             1    276597 1124692 403.92
#> - inequality            1    522093 1370188 411.22
#> - police_exp60          1    916631 1764726 420.59
#> 
#> Step:  AIC=393.89
#> crime_rate ~ percent_m + mean_education + police_exp60 + m_per1000f + 
#>     state_pop + unemploy_m24 + unemploy_m39 + inequality + prob_prison + 
#>     time_prison
#> 
#>                  Df Sum of Sq     RSS    AIC
#> - m_per1000f      1      7695  865419 392.22
#> - state_pop       1      8274  865999 392.25
#> - time_prison     1     41103  898828 393.62
#> <none>                         857725 393.89
#> - unemploy_m24    1     60420  918145 394.41
#> - unemploy_m39    1    130472  988197 397.13
#> - prob_prison     1    229020 1086744 400.65
#> - mean_education  1    279982 1137706 402.34
#> - percent_m       1    340490 1198215 404.26
#> - inequality      1    519862 1377586 409.42
#> - police_exp60    1    913013 1770737 418.71
#> 
#> Step:  AIC=392.22
#> crime_rate ~ percent_m + mean_education + police_exp60 + state_pop + 
#>     unemploy_m24 + unemploy_m39 + inequality + prob_prison + 
#>     time_prison
#> 
#>                  Df Sum of Sq     RSS    AIC
#> - state_pop       1      3290  868710 390.36
#> - time_prison     1     35430  900849 391.71
#> <none>                         865419 392.22
#> - unemploy_m24    1     85250  950669 393.70
#> - unemploy_m39    1    142409 1007829 395.86
#> - prob_prison     1    221674 1087094 398.66
#> - percent_m       1    333821 1199241 402.29
#> - mean_education  1    334088 1199508 402.30
#> - inequality      1    585816 1451235 409.35
#> - police_exp60    1   1021500 1886919 419.06
#> 
#> Step:  AIC=390.36
#> crime_rate ~ percent_m + mean_education + police_exp60 + unemploy_m24 + 
#>     unemploy_m39 + inequality + prob_prison + time_prison
#> 
#>                  Df Sum of Sq     RSS    AIC
#> - time_prison     1     47231  915941 390.32
#> <none>                         868710 390.36
#> - unemploy_m24    1     83686  952396 391.77
#> - unemploy_m39    1    140035 1008744 393.89
#> - prob_prison     1    221670 1090379 396.77
#> - mean_education  1    333975 1202685 400.40
#> - percent_m       1    367702 1236411 401.42
#> - inequality      1    589726 1458436 407.53
#> - police_exp60    1   1072386 1941096 418.11
#> 
#> Step:  AIC=390.32
#> crime_rate ~ percent_m + mean_education + police_exp60 + unemploy_m24 + 
#>     unemploy_m39 + inequality + prob_prison
#> 
#>                  Df Sum of Sq     RSS    AIC
#> <none>                         915941 390.32
#> - unemploy_m24    1     54103  970044 390.44
#> - unemploy_m39    1    105500 1021441 392.35
#> - prob_prison     1    188253 1104194 395.24
#> - percent_m       1    322844 1238785 399.49
#> - mean_education  1    389657 1305598 401.44
#> - inequality      1    594487 1510428 406.83
#> - police_exp60    1   1300243 2216184 421.01
summary(model_crime_3)
#> 
#> Call:
#> lm(formula = crime_rate ~ percent_m + mean_education + police_exp60 + 
#>     unemploy_m24 + unemploy_m39 + inequality + prob_prison, data = crime_train)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -310.06  -93.39  -19.63  102.32  375.43 
#> 
#> Coefficients:
#>                 Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)    -4741.956    921.972  -5.143 1.70e-05 ***
#> percent_m         11.170      3.494   3.197 0.003343 ** 
#> mean_education    18.884      5.376   3.512 0.001475 ** 
#> police_exp60      12.247      1.909   6.416 5.12e-07 ***
#> unemploy_m24      -4.499      3.437  -1.309 0.200876    
#> unemploy_m39      15.570      8.519   1.828 0.077910 .  
#> inequality         6.296      1.451   4.338 0.000159 ***
#> prob_prison    -4814.913   1972.205  -2.441 0.020968 *  
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 177.7 on 29 degrees of freedom
#> Multiple R-squared:  0.8462, Adjusted R-squared:  0.809 
#> F-statistic: 22.79 on 7 and 29 DF,  p-value: 3.537e-10

Evaluasi model pada seen data (data train)

crime_pred <- data.frame(crime_train$crime_rate)
crime_pred$pred_all <- predict(model_crime, crime_train)
crime_pred$pred_backward <- predict(model_crime_3, crime_train)
crime_pred$pred_1 <- predict(model_crime_2, crime_train)
head(crime_pred)

3.3 Evaluasi Model

Setelah kita membuat 2 model, selanjutnya akan kita bandingkan hasil dari ketiga model tersebut menggunakan Mean Absolute Percentage Error agar eror yang dihasilkan berupa persentase

MAPE(y_pred = crime_pred$pred_all , y_true = crime_train$crime_rate)*100
#> [1] 15.54173
MAPE(y_pred = crime_pred$pred_backward , y_true = crime_train$crime_rate)*100
#> [1] 17.36276
MAPE(y_pred = crime_pred$pred_1 , y_true = crime_train$crime_rate)*100
#> [1] 22.82917

Berdasarkan hasil diatas, dapat diranking berdasarkan 1. Model dengan seluruh variable 2. Model stepwise regression 3. Model 1 variable

3.4 Asumption

Meskipun model telah dibuat, terdapat beberapa uji asumsi linearitas yang harus diuji yakni sebagai berikut

3.4.1 Linearity

Pada uji linearitas diharapkan residual berada disekitar titik 0 yang menandakan variable prediktor memiliki hubungan lurus dengan target nya. Berdasarkan hasil di bawah, ketiga model tidak terlalu membentuk garis lurus ideal, namun residunya masih sebagian besar berada di titik 0

plot(model_crime, which = 1)

plot(model_crime_2, which = 1)

plot(model_crime_3, which = 1)

3.4.2 Homoscedacity

Pada uji homoscedacity, diharapkan error yang dihasilkan oleh model menyebar secara acak. Untuk uji asumsi ini kita akan menggunakan Breusch-Pagan hypothesis test dimana p-value harus > daripada 0.05

bptest(model_crime)
#> 
#>  studentized Breusch-Pagan test
#> 
#> data:  model_crime
#> BP = 18.332, df = 15, p-value = 0.2456
bptest(model_crime_2)
#> 
#>  studentized Breusch-Pagan test
#> 
#> data:  model_crime_2
#> BP = 6.3493, df = 1, p-value = 0.01174
bptest(model_crime_3)
#> 
#>  studentized Breusch-Pagan test
#> 
#> data:  model_crime_3
#> BP = 8.3745, df = 7, p-value = 0.3007

Berdasarkan hasil diatas, hanya model_crime (model dengan seluruh variable) & model_crime_3 (model dengan stepwise regression) yang memenuhi uji asumsi homoscedacity

3.4.3 Normality of Residuals

Pada uji asumsi ini, diharapkan error yang muncul dengan distribusi normal sehingga eror lebih banyak pada nilai 0. Kita akan lihat distribusinya menggunakan histogram terlebih dahulu

hist(model_crime$residuals)

hist(model_crime_2$residuals)

hist(model_crime_3$residuals)

Dari hasil di atas, ketiganya memiliki histogram eror yang terdistribusi normal, selanjutnya kita akan uji menggunakan Shaphiro test dimana p-value harus > 0.05

# shapiro test dari residual
shapiro.test(model_crime$residuals)
#> 
#>  Shapiro-Wilk normality test
#> 
#> data:  model_crime$residuals
#> W = 0.96588, p-value = 0.3081
shapiro.test(model_crime_2$residuals)
#> 
#>  Shapiro-Wilk normality test
#> 
#> data:  model_crime_2$residuals
#> W = 0.94479, p-value = 0.0656
shapiro.test(model_crime_3$residuals)
#> 
#>  Shapiro-Wilk normality test
#> 
#> data:  model_crime_3$residuals
#> W = 0.9832, p-value = 0.8372

melalui shaphiro test, ketiga model memiliki nilai p-value > 0.05 sehingga dapat dikatakan ketiganya memiliki eror yang terdistribusi normal

3.4.4 No Multicollinearity

Selanjutnya perlu kita uji multicollinearity dimana antar prediktor tidak boleh memiliki keterkaitan. Pengujian ini dilakukan dengan metode Variance Inflation Factor, dimana nilainya tidak boleh >10

vif(model_crime)
#>            percent_m             is_south       mean_education 
#>             3.200306             5.145879             6.801945 
#>         police_exp60         police_exp59 labour_participation 
#>            97.150366           102.604695             8.327902 
#>           m_per1000f            state_pop    nonwhites_per1000 
#>             7.261454             2.466355             4.377014 
#>         unemploy_m24         unemploy_m39                  gdp 
#>             7.057757             7.325475            10.583290 
#>           inequality          prob_prison          time_prison 
#>             8.924597             5.218725             3.601374
vif(model_crime_3)
#>      percent_m mean_education   police_exp60   unemploy_m24   unemploy_m39 
#>       1.633647       3.903302       2.999648       3.943235       5.108470 
#>     inequality    prob_prison 
#>       3.616152       1.972181

Terlihat pada model_crime terdapat multikolinearitas dimana varable police_exp59 dan police_exp60 memiliki keterkaitan nilainya dan ini menandakan kurang baik

4 Evaluation

4.1 Testing on unseen data

Setelah kita menguji model yang telah dibuat, selanjutnya akan diuji pada data test, yakni data yang belum pernah dilihat oleh model sebelumnya

crime_pred_test <- data.frame(crime_test$crime_rate)
crime_pred_test$pred_all <- predict(model_crime, crime_test)
crime_pred_test$pred_backward <- predict(model_crime_3, crime_test)
crime_pred_test$pred_1 <- predict(model_crime_2, crime_test)
head(crime_pred_test)

4.2 Eror Evaluation

MAPE(y_pred = crime_pred_test$pred_all , y_true = crime_test$crime_rate)*100
#> [1] 28.95856
MAPE(y_pred = crime_pred_test$pred_backward , y_true = crime_test$crime_rate)*100
#> [1] 21.02302
MAPE(y_pred = crime_pred_test$pred_1 , y_true = crime_test$crime_rate)*100
#> [1] 51.92493

Berdasarkan nilai eror dari ketiga model. Ketiganya menghasilkan eror yang masih cukup tinggil dimana nilainya >15%, namun jika dibandingkan hasilnya, dapat diurutkan dari eror terkecil sebagai berikut:

  1. Model_crime_3 (Model dengan stepwise regression)
  2. Model_crime (Model dengan seluruh variable)
  3. Model_crime_2 (Model dengan 1 variable saja)

4.3 Conclusion

Berdasarkan hasil pengujian, dapat disimpulkan bahwa stepwise regression untuk analisa linear regression bekerja paling baik untuk memprediksi crime rate pada data crime hal ini dikarenakan stepwise regression mengeliminasi variable - variable prediktor yang dapat menimbulkan informational loss sehingga tidak harus menggunakan seluruh variable sebagai prediktor.