1 Pendahuluan

Pada analisis ini data yang digunakan berasal dari data crime.csv dimana pada dataset yang ada, akan dipilih crime_rate sebagai variabel target untuk pengujian analisis regresi model.

2 Eksplorasi Data

2.1 Garis Besar Data

Kerangka kolom dataframe yang digunakan seperti berikut ini :

M : Percentage of Males Aged 14–24.
So : Indicator variable for a Southern state.
Ed : Mean Years of Schooling.
Po1 : Police Expenditure in 1960.
Po2 : Police Expenditure in 1959.
LF : Labour Force Participation Rate.
M.F : Number of Males per 1000 Females.
Pop : State Population.
NW : Number of Non-whites per 1000 people.
U1 : Unemployment Rate of Urban Males 14–24.
U2 : Unemployment Rate of Urban Males 35–39.
GDP : Gross Domestic Product per Head.
Ineq : Income Inequality.
Prob : Probability of Imprisonment.
Time : Average Time Served in State Prisons.
y : Rate of Crimes in a Particular Category per Head of Population.

2.2 Load Library

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(lmtest)

## Loading required package: zoo

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

library(car)

## Loading required package: carData

## 
## Attaching package: 'car'

## The following object is masked from 'package:dplyr':
## 
##     recode

library(GGally)

## Loading required package: ggplot2

## 
## Attaching package: 'GGally'

## The following object is masked from 'package:dplyr':
## 
##     nasa

library(MLmetrics)

## 
## Attaching package: 'MLmetrics'

## The following object is masked from 'package:base':
## 
##     Recall

2.3 Input Data

crime <- read.csv("crime.csv") %>% 
  select(-X) 
names(crime) <- c("percent_m", "is_south", "mean_education", 
  "police_exp60", "police_exp59", "labour_participation", 
  "m_per1000f", "state_pop", "nonwhites_per1000", "unemploy_m24", 
  "unemploy_m39", "gdp", "inequality", "prob_prison", 
  "time_prison", "crime_rate")

Visualisasikan plot penyebaran data.

plot(crime$police_exp59, crime$crime_rate)

2.4 Struktur Data

Cek terlebih dahulu struktur dataset. Dimana dalam pengujian analisis Regresi Model, hendaknya variabel berupa numerik (angka).

str(crime)

## 'data.frame':    47 obs. of  16 variables:
##  $ percent_m           : int  151 143 142 136 141 121 127 131 157 140 ...
##  $ is_south            : int  1 0 1 0 0 0 1 1 1 0 ...
##  $ mean_education      : int  91 113 89 121 121 110 111 109 90 118 ...
##  $ police_exp60        : int  58 103 45 149 109 118 82 115 65 71 ...
##  $ police_exp59        : int  56 95 44 141 101 115 79 109 62 68 ...
##  $ labour_participation: int  510 583 533 577 591 547 519 542 553 632 ...
##  $ m_per1000f          : int  950 1012 969 994 985 964 982 969 955 1029 ...
##  $ state_pop           : int  33 13 18 157 18 25 4 50 39 7 ...
##  $ nonwhites_per1000   : int  301 102 219 80 30 44 139 179 286 15 ...
##  $ unemploy_m24        : int  108 96 94 102 91 84 97 79 81 100 ...
##  $ unemploy_m39        : int  41 36 33 39 20 29 38 35 28 24 ...
##  $ gdp                 : int  394 557 318 673 578 689 620 472 421 526 ...
##  $ inequality          : int  261 194 250 167 174 126 168 206 239 174 ...
##  $ prob_prison         : num  0.0846 0.0296 0.0834 0.0158 0.0414 ...
##  $ time_prison         : num  26.2 25.3 24.3 29.9 21.3 ...
##  $ crime_rate          : int  791 1635 578 1969 1234 682 963 1555 856 705 ...

2.5 Ekplorasi Data

Sebelum melakukan analisis lebih lanjut, ekplorasi data diperlukan untuk mempelajari dataset dengan melakukan pengecekan visualisasi histogram, korelasi antar data, dan visualisasi boxplot untuk melihat outlier.

hist(crime$crime_rate)

boxplot(crime$crime_rate)

3 Linear Model

Dalam pembuatan atau analisa linear model, terlebih dahulu melakukan pengecekan terhadap korelasi crime_rate dengan variabel-variabel dari dataset.

ggcorr(crime, label = T, hjust =1)

boxplot(crime)

Terlihat pada visualisasi boxplot terdapat oulier pun dengan visualisasi korelasi, terdapat variabel berniali positif yang paling mendekati nilai 1.0 pada variabel police_exp59 dan variabel police_exp60 yaitu bernilai 0.7 terhadap crime_rate. Namun kita harus memastikan lagi apakah antar variabel tersebut saling mempengaruhi atau tidak. Bila keterikatannya sangat kuat antar variabel maka harus mengeliminasi salah satu variabel agar tidak mempengaruhi target. Pada analisis ini akan dilakukan Feature Engineering.

police <- crime$police_exp59 + crime$police_exp60
crime <- cbind(crime,police)

crime <- crime[,-c(4,5)]
head(crime)

##   percent_m is_south mean_education labour_participation m_per1000f
## 1       151        1             91                  510        950
## 2       143        0            113                  583       1012
## 3       142        1             89                  533        969
## 4       136        0            121                  577        994
## 5       141        0            121                  591        985
## 6       121        0            110                  547        964
##   state_pop nonwhites_per1000 unemploy_m24 unemploy_m39 gdp inequality
## 1        33               301          108           41 394        261
## 2        13               102           96           36 557        194
## 3        18               219           94           33 318        250
## 4       157                80          102           39 673        167
## 5        18                30           91           20 578        174
## 6        25                44           84           29 689        126
##   prob_prison time_prison crime_rate police
## 1    0.084602     26.2011        791    114
## 2    0.029599     25.2999       1635    198
## 3    0.083401     24.3006        578     89
## 4    0.015801     29.9012       1969    290
## 5    0.041399     21.2998       1234    210
## 6    0.034201     20.9995        682    233

Dari kegiatan Feature Enginering dan menghapus 2 kolom yaitu police_exp59 dan police_exp60.

3.1 Pembuatan dan Analisa Model Linear

Untuk membuat model ke persamaan linear, dapat menggunakan function lm() dan predict()

3.1.1 Pemodelan dengan satu prediktor

Crime_rate akan dijadikan variabel target dan police sebagai prediktornya.

modelsatu <- lm(crime_rate~police,crime)

crime$predictsatu <- predict(object = modelsatu, 
                             newdata = data.frame(police = crime$police))

3.1.2 Pemodelan dengan 2 prediktor

Crime_rate akan dijadikan variabel target dan police dan prob_prison sebagai prediktornya.

modeldua <- lm(crime_rate~police + prob_prison,crime)

crime$predictdua <- predict(object = modeldua, databaru = data.frame(
  police = crime$police, prob_prison = crime$prob_prison))

3.1.3 Pemodelan dengan keseluruhan data sebagai prediktor

modeltiga <- lm(crime_rate ~ ., crime)

crime$predicttiga <- predict(object = modeltiga, databaru = data.frame)

3.2 R-Squared

R squared merupakan angka yang berkisar antara 0 sampai 1 yang mengindikasikan besarnya kombinasi variabel independen secara bersama – sama mempengaruhi nilai variabel dependen. Semakin mendekati angka satu, model yang dikeluarkan oleh regresi tersebut akan semakin baik. Sifat R-squared yang akan semakin baik jika menambah variabel inilah yang menjadi kelemahan dari R squared itu sendiri. Semakin banyak variabel independen yang digunakan maka akan semakin banyak “noise” dalam model tersebut dan ini tidak dapat dijelaskan oleh R squared.

Salah satu tujuan untuk meregresikan variabel independen dengan variabel dependen adalah membuat rumus dan menggunakannya untuk melakukan prediksi dengan nilai nilai tertentu dari variabel independennya. Jika anda ingin melakukan prediksi nilai Y, maka anda juga seharusnya melihat nilai dari R squared predicted. R Squared predicted mengindikasikan seberapa baik model tersebut untuk melakukan prediksi dari observasi yang baru.

Penggunakan R Square (R Kuadrat) sering menimbulkan permasalahan, yaitu bahwa nilainya akan selalu meningkat dengan adanya penambahan variabel bebas dalam suatu model. Hal ini akan menimbulkan bias, karena jika ingin memperoleh model dengan R tinggi, seorang penelitian dapat dengan sembarangan menambahkan variabel bebas dan nilai R akan meningkat, tidak tergantung apakah variabel bebas tambahan itu berhubungan dengan variabel terikat atau tidak.

Oleh karena itu, banyak peneliti yang menyarankan untuk menggunakan Adjusted R Square. Interpretasinya sama dengan R Square, akan tetapi nilai Adjusted R Square dapat naik atau turun dengan adanya penambahan variabel baru, tergantung dari korelasi antara variabel bebas tambahan tersebut dengan variabel terikatnya. Nilai Adjusted R Square dapat bernilai negatif, sehingga jika nilainya negatif, maka nilai tersebut dianggap 0, atau variabel bebas sama sekali tidak mampu menjelaskan varians dari variabel terikatnya.

3.2.1 Perbandingan Model

Saatnya kita lakukan perbandingan antara ketiga model yang sudah dibuat berdasarkan nilai R-Squared, Adj. R-Squared.

summary(modelsatu)$r.squared ; summary(modeldua)$r.squared ;

## [1] 0.4604511

## [1] 0.4749003

summary(modeltiga)$r.squared

## [1] 0.7913291

Terlihat jelas peningkatan nilai R-Squared dengan penambahan variabel dependen (prediktor), yang mana pada modeltiga prediktor yang digunakan adalah seluruh varibel data. Mari lakukan perbandingan nilai Adj.R-Squared.

summary(modelsatu)$adj.r.squared ; summary(modeldua)$adj.r.squared ;

## [1] 0.4484611

## [1] 0.4510321

summary(modeltiga)$adj.r.squared

## [1] 0.7000357

Pada hasil Adj.R-Squared ketiga model didapatkan bahwa nilai tertinggi adad pada model ketiga yang artinya model ketiga sudah memiliki tingkat model yang cukup baik. Lalu bagaimana dengan nilai predict nya?? Mari kita lakukan analisa tingkat error nya melalui RMSE.

3.3 RMSE (Root Mean Square Error)

RMSE adalah metode alternatif untuk mengevaluasi teknik peramalan yang digunakan untuk mengukur tingkat akurasi hasil prakiraan suatu model. RMSE merupakan nilai rata-rata dari jumlah kuadrat kesalahan, juga dapat menyatakan ukuran besarnya kesalahan yang dihasilkan oleh suatu model prakiraan. Nilai RMSE rendah menunjukkan bahwa variasi nilai yang dihasilkan oleh suatu model prakiraan mendekati variasi nilai obeservasinya.

RMSE(crime$predictsatu, y_true = crime$crime_rate) ;

## [1] 281.0541

RMSE(crime$predictdua, y_true = crime$crime_rate) ;

## [1] 277.2653

RMSE(crime$predicttiga, y_true = crime$crime_rate)

## [1] 174.7855

Melihat hasil perbandingan ketiga model dari nilai RMSE yang paling rendah, dapat disimpulkan modeltiga adalah model terbaik. Namun bila masih belum yakin atau dari model yang dibuat belum menampilkan hasil yang memuaskan, atau bingung memilih predictor yang tepat untuk mendapatkan hasil terbaik. Kita bisa melakukan analisa metode lain yaitu seperti Stepwise Regression yang didalamnya terdapat metode backward, forward dan both yang dilanjutkan kembali dengan perbandingan ketiganya.

3.4 Stepwise Regression

Regresi stepwise melibatkan dua jenis proses yaitu: forward selection dan backward elimination. Teknik ini dilakukan melalui beberapa tahapan. Pada masing-masing tahapan, kita akan memutuskan variabel mana yang merupakan prediktor terbaik untuk dimasukkan ke dalam model. Nilai yang dilihat pada model analisa ini adalah nilai AIC (Akaike Information Criterion) terendah.

3.5 Backward

Sebelum melakukan analisa, baiknya kita hapus terlebih dahulu beberapa kolom variabel predict yang sebelumnya sudah kita buat dengan membuat dataframe crimestepwise.

head(crime)

##   percent_m is_south mean_education labour_participation m_per1000f
## 1       151        1             91                  510        950
## 2       143        0            113                  583       1012
## 3       142        1             89                  533        969
## 4       136        0            121                  577        994
## 5       141        0            121                  591        985
## 6       121        0            110                  547        964
##   state_pop nonwhites_per1000 unemploy_m24 unemploy_m39 gdp inequality
## 1        33               301          108           41 394        261
## 2        13               102           96           36 557        194
## 3        18               219           94           33 318        250
## 4       157                80          102           39 673        167
## 5        18                30           91           20 578        174
## 6        25                44           84           29 689        126
##   prob_prison time_prison crime_rate police predictsatu predictdua
## 1    0.084602     26.2011        791    114    671.5972   606.7562
## 2    0.029599     25.2999       1635    198   1054.4087  1080.7907
## 3    0.083401     24.3006        578     89    557.6653   506.4746
## 4    0.015801     29.9012       1969    290   1473.6785  1492.1289
## 5    0.041399     21.2998       1234    210   1109.0961  1102.8647
## 6    0.034201     20.9995        682    233   1213.9135  1214.4033
##   predicttiga
## 1    769.6416
## 2   1436.7117
## 3    348.1900
## 4   1809.6612
## 5   1119.9189
## 6    803.0214

summary(lm(crime_rate~.,crime[,-c(16,17,18)]))

## 
## Call:
## lm(formula = crime_rate ~ ., data = crime[, -c(16, 17, 18)])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -459.84 -122.87   11.49  117.58  451.50 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          -6606.4737  1583.4238  -4.172 0.000215 ***
## percent_m                9.0258     4.2226   2.137 0.040298 *  
## is_south                 6.1504   150.5377   0.041 0.967664    
## mean_education          17.3268     6.1901   2.799 0.008612 ** 
## labour_participation    -0.1626     1.4416  -0.113 0.910921    
## m_per1000f               1.9532     2.0562   0.950 0.349278    
## state_pop               -0.7269     1.3066  -0.556 0.581830    
## nonwhites_per1000        0.2059     0.6369   0.323 0.748596    
## unemploy_m24            -5.4736     4.2578  -1.286 0.207824    
## unemploy_m39            17.3475     8.3316   2.082 0.045411 *  
## gdp                      0.9446     1.0503   0.899 0.375170    
## inequality               7.3384     2.2928   3.201 0.003091 ** 
## prob_prison          -4079.3771  2228.7052  -1.830 0.076521 .  
## time_prison             -0.2575     6.8520  -0.038 0.970261    
## police                   4.9475     1.2844   3.852 0.000530 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 211.8 on 32 degrees of freedom
## Multiple R-squared:  0.7913, Adjusted R-squared:    0.7 
## F-statistic: 8.668 on 14 and 32 DF,  p-value: 0.0000002605

crimestepwise <- crime[,-c(16,17,18)] 
lm.all <- lm(crime_rate~., crimestepwise)

step(lm.all, direction = "backward")

## Start:  AIC=515.37
## crime_rate ~ percent_m + is_south + mean_education + labour_participation + 
##     m_per1000f + state_pop + nonwhites_per1000 + unemploy_m24 + 
##     unemploy_m39 + gdp + inequality + prob_prison + time_prison + 
##     police
## 
##                        Df Sum of Sq     RSS    AIC
## - time_prison           1        63 1435912 513.38
## - is_south              1        75 1435924 513.38
## - labour_participation  1       571 1436420 513.39
## - nonwhites_per1000     1      4689 1440538 513.53
## - state_pop             1     13889 1449738 513.83
## - gdp                   1     36295 1472144 514.55
## - m_per1000f            1     40488 1476337 514.68
## <none>                              1435849 515.37
## - unemploy_m24          1     74155 1510004 515.74
## - prob_prison           1    150328 1586178 518.05
## - unemploy_m39          1    194523 1630372 519.35
## - percent_m             1    205006 1640855 519.65
## - mean_education        1    351558 1787407 523.67
## - inequality            1    459655 1895504 526.43
## - police                1    665754 2101603 531.28
## 
## Step:  AIC=513.38
## crime_rate ~ percent_m + is_south + mean_education + labour_participation + 
##     m_per1000f + state_pop + nonwhites_per1000 + unemploy_m24 + 
##     unemploy_m39 + gdp + inequality + prob_prison + police
## 
##                        Df Sum of Sq     RSS    AIC
## - is_south              1        94 1436006 511.38
## - labour_participation  1       573 1436486 511.40
## - nonwhites_per1000     1      4696 1440608 511.53
## - state_pop             1     15376 1451289 511.88
## - gdp                   1     36439 1472351 512.55
## - m_per1000f            1     42048 1477960 512.73
## <none>                              1435912 513.38
## - unemploy_m24          1     74740 1510652 513.76
## - unemploy_m39          1    196871 1632783 517.42
## - prob_prison           1    203777 1639689 517.61
## - percent_m             1    210715 1646627 517.81
## - mean_education        1    356119 1792032 521.79
## - inequality            1    459600 1895512 524.43
## - police                1    707398 2143310 530.20
## 
## Step:  AIC=511.38
## crime_rate ~ percent_m + mean_education + labour_participation + 
##     m_per1000f + state_pop + nonwhites_per1000 + unemploy_m24 + 
##     unemploy_m39 + gdp + inequality + prob_prison + police
## 
##                        Df Sum of Sq     RSS    AIC
## - labour_participation  1      1064 1437071 509.41
## - nonwhites_per1000     1      6303 1442310 509.59
## - state_pop             1     15458 1451464 509.88
## - gdp                   1     39135 1475142 510.64
## - m_per1000f            1     44243 1480249 510.81
## <none>                              1436006 511.38
## - unemploy_m24          1     88715 1524721 512.20
## - unemploy_m39          1    208089 1644096 515.74
## - percent_m             1    212656 1648662 515.87
## - prob_prison           1    217046 1653052 516.00
## - mean_education        1    358370 1794376 519.85
## - inequality            1    544157 1980163 524.48
## - police                1    722777 2158784 528.54
## 
## Step:  AIC=509.41
## crime_rate ~ percent_m + mean_education + m_per1000f + state_pop + 
##     nonwhites_per1000 + unemploy_m24 + unemploy_m39 + gdp + inequality + 
##     prob_prison + police
## 
##                     Df Sum of Sq     RSS    AIC
## - nonwhites_per1000  1      5723 1442794 507.60
## - state_pop          1     17753 1454824 507.99
## - gdp                1     38716 1475787 508.66
## - m_per1000f         1     52014 1489085 509.09
## <none>                           1437071 509.41
## - unemploy_m24       1     92066 1529136 510.33
## - unemploy_m39       1    209698 1646769 513.82
## - prob_prison        1    219922 1656993 514.11
## - percent_m          1    228700 1665771 514.36
## - mean_education     1    380957 1818028 518.47
## - inequality         1    543097 1980168 522.48
## - police             1    794558 2231628 528.10
## 
## Step:  AIC=507.6
## crime_rate ~ percent_m + mean_education + m_per1000f + state_pop + 
##     unemploy_m24 + unemploy_m39 + gdp + inequality + prob_prison + 
##     police
## 
##                  Df Sum of Sq     RSS    AIC
## - state_pop       1     16957 1459751 506.15
## - gdp             1     36113 1478907 506.76
## - m_per1000f      1     47772 1490565 507.13
## <none>                        1442794 507.60
## - unemploy_m24    1     93803 1536596 508.56
## - unemploy_m39    1    217091 1659885 512.19
## - prob_prison     1    218991 1661784 512.24
## - percent_m       1    306832 1749626 514.66
## - mean_education  1    375422 1818216 516.47
## - inequality      1    617614 2060408 522.35
## - police          1   1011638 2454432 530.57
## 
## Step:  AIC=506.15
## crime_rate ~ percent_m + mean_education + m_per1000f + unemploy_m24 + 
##     unemploy_m39 + gdp + inequality + prob_prison + police
## 
##                  Df Sum of Sq     RSS    AIC
## - gdp             1     30683 1490434 505.13
## <none>                        1459751 506.15
## - m_per1000f      1     95202 1554953 507.12
## - unemploy_m24    1    105276 1565027 507.42
## - prob_prison     1    202801 1662552 510.26
## - unemploy_m39    1    219084 1678835 510.72
## - percent_m       1    319185 1778936 513.44
## - mean_education  1    373762 1833513 514.87
## - inequality      1    614425 2074176 520.66
## - police          1   1094101 2553852 530.44
## 
## Step:  AIC=505.13
## crime_rate ~ percent_m + mean_education + m_per1000f + unemploy_m24 + 
##     unemploy_m39 + inequality + prob_prison + police
## 
##                  Df Sum of Sq     RSS    AIC
## <none>                        1490434 505.13
## - m_per1000f      1    117843 1608276 506.70
## - unemploy_m24    1    136366 1626799 507.24
## - prob_prison     1    256641 1747074 510.60
## - unemploy_m39    1    271328 1761761 510.99
## - percent_m       1    292207 1782641 511.54
## - mean_education  1    433030 1923464 515.12
## - inequality      1    744855 2235288 522.18
## - police          1   1634672 3125105 537.93

## 
## Call:
## lm(formula = crime_rate ~ percent_m + mean_education + m_per1000f + 
##     unemploy_m24 + unemploy_m39 + inequality + prob_prison + 
##     police, data = crimestepwise)
## 
## Coefficients:
##    (Intercept)       percent_m  mean_education      m_per1000f  
##      -6541.179           9.258          17.775           2.383  
##   unemploy_m24    unemploy_m39      inequality     prob_prison  
##         -6.294          19.257           6.180       -3858.651  
##         police  
##          5.278

Dari data AIC terkecil pada metode backward didapatkan model baru seperti berikut ini :

backwardmodel <-
    lm(formula = crime_rate ~ percent_m + mean_education + m_per1000f + 
    unemploy_m24 + unemploy_m39 + inequality + prob_prison + 
    police, data = crimestepwise)

summary(backwardmodel)

## 
## Call:
## lm(formula = crime_rate ~ percent_m + mean_education + m_per1000f + 
##     unemploy_m24 + unemploy_m39 + inequality + prob_prison + 
##     police, data = crimestepwise)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -475.87 -104.70    0.77  129.36  470.19 
## 
## Coefficients:
##                  Estimate Std. Error t value    Pr(>|t|)    
## (Intercept)    -6541.1790  1210.6873  -5.403 0.000003750 ***
## percent_m          9.2582     3.3919   2.729     0.00956 ** 
## mean_education    17.7753     5.3496   3.323     0.00198 ** 
## m_per1000f         2.3832     1.3749   1.733     0.09114 .  
## unemploy_m24      -6.2939     3.3755  -1.865     0.06997 .  
## unemploy_m39      19.2570     7.3216   2.630     0.01226 *  
## inequality         6.1800     1.4181   4.358 0.000096343 ***
## prob_prison    -3858.6505  1508.4712  -2.558     0.01464 *  
## police             5.2781     0.8176   6.456 0.000000135 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 198 on 38 degrees of freedom
## Multiple R-squared:  0.7834, Adjusted R-squared:  0.7378 
## F-statistic: 17.18 on 8 and 38 DF,  p-value: 0.0000000001842

Buat analisa predict terhadap model backward

crimestepwise$predictbackward <- predict(object = backwardmodel, data.frame(
    percent_m = crimestepwise$percent_m, 
    mean_education = crimestepwise$mean_education,
    m_per1000f = crimestepwise$m_per1000f, 
    unemploy_m24 = crimestepwise$unemploy_m24, 
    unemploy_m39 = crimestepwise$unemploy_m39, 
    inequality = crimestepwise$inequality,
    prob_prison = crimestepwise$prob_prison, 
    police = crimestepwise$police))

3.6 Forward

Forward Selection Model merupakan salah satu metode pemodelan untuk menemukan kombinasi model variabel predikto yang “terbaik” dari suatu deret data. Dalam Prosedur Forward selection, sekalinya variable masuk kedalam persamaan maka tidak bisa dihilangkan.

Selain itu, Forward selection dapat berarti memasukkan variabel bebas yang memiliki korelasi yang paling erat dengan variabel tak bebasnya (variabel yang paling potensial untuk memiliki hubungan linier dengan Y). kemudian secara bertahap memasukkan variabel bebas yang potensial berikutnya dan nanti akan terhenti sampai tidak ada lagi variabel bebas yang potensial

lm.none <- lm(crime_rate~1,crimestepwise)

step(lm.none, scope = list(lower = lm.none, upper = lm.all),
     direction = "forward")

## Start:  AIC=561.02
## crime_rate ~ 1
## 
##                        Df Sum of Sq     RSS    AIC
## + police                1   3168330 3712597 534.02
## + gdp                   1   1340152 5540775 552.84
## + prob_prison           1   1257075 5623853 553.54
## + state_pop             1    783660 6097267 557.34
## + mean_education        1    717146 6163781 557.85
## + m_per1000f            1    314867 6566061 560.82
## <none>                              6880928 561.02
## + labour_participation  1    245446 6635482 561.32
## + inequality            1    220530 6660397 561.49
## + unemploy_m39          1    216354 6664573 561.52
## + time_prison           1    154545 6726383 561.96
## + is_south              1     56527 6824400 562.64
## + percent_m             1     55084 6825844 562.65
## + unemploy_m24          1     17533 6863395 562.90
## + nonwhites_per1000     1      7312 6873615 562.97
## 
## Step:  AIC=534.02
## crime_rate ~ police
## 
##                        Df Sum of Sq     RSS    AIC
## + inequality            1    759855 2952743 525.26
## + percent_m             1    612966 3099632 527.54
## + m_per1000f            1    260696 3451901 532.60
## + nonwhites_per1000     1    232645 3479952 532.98
## + is_south              1    214795 3497803 533.22
## + gdp                   1    170207 3542390 533.82
## <none>                              3712597 534.02
## + prob_prison           1     99424 3613173 534.75
## + labour_participation  1     86340 3626257 534.92
## + time_prison           1     54647 3657950 535.33
## + unemploy_m39          1     22883 3689715 535.73
## + state_pop             1      2449 3710149 535.99
## + unemploy_m24          1      2269 3710328 535.99
## + mean_education        1      1064 3711533 536.01
## 
## Step:  AIC=525.26
## crime_rate ~ police + inequality
## 
##                        Df Sum of Sq     RSS    AIC
## + mean_education        1    578682 2374061 517.01
## + m_per1000f            1    479766 2472976 518.93
## + prob_prison           1    291301 2661441 522.38
## + labour_participation  1    287949 2664794 522.44
## + gdp                   1    234545 2718197 523.37
## + percent_m             1    176837 2775905 524.36
## <none>                              2952743 525.26
## + state_pop             1    117308 2835435 525.36
## + nonwhites_per1000     1     42677 2910065 526.58
## + is_south              1     41017 2911726 526.60
## + unemploy_m24          1      3597 2949145 527.20
## + time_prison           1      2838 2949905 527.22
## + unemploy_m39          1         4 2952738 527.26
## 
## Step:  AIC=517.01
## crime_rate ~ police + inequality + mean_education
## 
##                        Df Sum of Sq     RSS    AIC
## + prob_prison           1    245632 2128429 513.88
## + percent_m             1    233504 2140557 514.14
## + m_per1000f            1    133872 2240188 516.28
## <none>                              2374061 517.01
## + gdp                   1     95015 2279045 517.09
## + time_prison           1     77849 2296212 517.44
## + unemploy_m39          1     71032 2303029 517.58
## + state_pop             1     32168 2341893 518.37
## + labour_participation  1     14490 2359571 518.72
## + unemploy_m24          1      9465 2364596 518.82
## + nonwhites_per1000     1      2355 2371705 518.96
## + is_south              1      1893 2372167 518.97
## 
## Step:  AIC=513.88
## crime_rate ~ police + inequality + mean_education + prob_prison
## 
##                        Df Sum of Sq     RSS    AIC
## + percent_m             1    257271 1871157 509.82
## + m_per1000f            1    148320 1980109 512.48
## + state_pop             1     96832 2031597 513.69
## <none>                              2128429 513.88
## + is_south              1     64071 2064357 514.44
## + unemploy_m39          1     61658 2066770 514.49
## + nonwhites_per1000     1     43703 2084726 514.90
## + gdp                   1     38070 2090358 515.03
## + unemploy_m24          1      8726 2119703 515.68
## + labour_participation  1      1336 2127092 515.85
## + time_prison           1         3 2128426 515.88
## 
## Step:  AIC=509.82
## crime_rate ~ police + inequality + mean_education + prob_prison + 
##     percent_m
## 
##                        Df Sum of Sq     RSS    AIC
## + unemploy_m39          1    205935 1665223 506.34
## + gdp                   1    100497 1770660 509.23
## + m_per1000f            1     99365 1771793 509.26
## <none>                              1871157 509.82
## + unemploy_m24          1     57136 1814022 510.36
## + state_pop             1     41383 1829774 510.77
## + is_south              1     17419 1853739 511.38
## + time_prison           1      4963 1866195 511.70
## + nonwhites_per1000     1       245 1870912 511.81
## + labour_participation  1         3 1871154 511.82
## 
## Step:  AIC=506.34
## crime_rate ~ police + inequality + mean_education + prob_prison + 
##     percent_m + unemploy_m39
## 
##                        Df Sum of Sq     RSS    AIC
## + gdp                   1     69540 1595683 506.34
## <none>                              1665223 506.34
## + unemploy_m24          1     56946 1608276 506.70
## + state_pop             1     45741 1619482 507.03
## + m_per1000f            1     38423 1626799 507.24
## + labour_participation  1     23545 1641678 507.67
## + is_south              1     13773 1651449 507.95
## + time_prison           1      3062 1662160 508.25
## + nonwhites_per1000     1        29 1665194 508.34
## 
## Step:  AIC=506.34
## crime_rate ~ police + inequality + mean_education + prob_prison + 
##     percent_m + unemploy_m39 + gdp
## 
##                        Df Sum of Sq     RSS    AIC
## <none>                              1595683 506.34
## + state_pop             1     52319 1543364 506.77
## + unemploy_m24          1     40730 1554953 507.12
## + m_per1000f            1     30656 1565027 507.42
## + labour_participation  1     14263 1581420 507.91
## + is_south              1      6372 1589311 508.15
## + time_prison           1      5820 1589863 508.16
## + nonwhites_per1000     1      1206 1594477 508.30

## 
## Call:
## lm(formula = crime_rate ~ police + inequality + mean_education + 
##     prob_prison + percent_m + unemploy_m39 + gdp, data = crimestepwise)
## 
## Coefficients:
##    (Intercept)          police      inequality  mean_education  
##      -5807.418           5.259           8.228          18.104  
##    prob_prison       percent_m    unemploy_m39             gdp  
##      -3388.152          11.279           8.578           1.218

Dari data analisa diatas AIC terkecil pada metode forward didapatkan model baru seperti berikut ini :

forwardmodel <- 
    lm(formula = crime_rate ~ police + inequality + mean_education + 
    prob_prison + percent_m + unemploy_m39 + gdp, data = crimestepwise)

summary(forwardmodel)

## 
## Call:
## lm(formula = crime_rate ~ police + inequality + mean_education + 
##     prob_prison + percent_m + unemploy_m39 + gdp, data = crimestepwise)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -456.61  -94.58  -10.87  106.26  491.20 
## 
## Coefficients:
##                  Estimate Std. Error t value    Pr(>|t|)    
## (Intercept)    -5807.4184  1071.1423  -5.422 0.000003286 ***
## police             5.2589     0.8833   5.953 0.000000601 ***
## inequality         8.2282     1.7613   4.672 0.000035087 ***
## mean_education    18.1039     4.6633   3.882    0.000389 ***
## prob_prison    -3388.1521  1583.1465  -2.140    0.038655 *  
## percent_m         11.2791     3.4073   3.310    0.002014 ** 
## unemploy_m39       8.5781     4.1480   2.068    0.045319 *  
## gdp                1.2179     0.9342   1.304    0.199981    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 202.3 on 39 degrees of freedom
## Multiple R-squared:  0.7681, Adjusted R-squared:  0.7265 
## F-statistic: 18.45 on 7 and 39 DF,  p-value: 0.000000000141

Analisa predict terhadap model forward.

crimestepwise$predictforward <- predict(object = forwardmodel, data.frame(
    police = crimestepwise$police,
    inequality = crimestepwise$inequality,
    mean_education = crimestepwise$mean_education,
    prob_prison = crimestepwise$prob_prison,
    percent_m = crimestepwise$percent_m, 
    unemploy_m39 = crimestepwise$unemploy_m39,
    gdp = crimestepwise$gdp))

3.7 Both

Metode both adalah gabungan antara metode regresi forward dan backward. Variabel prediktor yang pertama kali masuk adalah variabel yang korelasinya tertinggi dan significant dengan variabel respon, variabel yang masuk kedua adalah variabel yang korelasi parsialnya tertinggi dan masih significant. Apabila tidak signifikan setelah variabel tertentu masuk ke dalam model maka variabel lain yang ada di dalam model dievaluasi. Apabila ada variabel yang tidak significant maka variabel tersebut dikeluarkan dari model.

step(lm.none, scope = list(lower = lm.none, upper = lm.all),
     direction = "both")

## Start:  AIC=561.02
## crime_rate ~ 1
## 
##                        Df Sum of Sq     RSS    AIC
## + police                1   3168330 3712597 534.02
## + gdp                   1   1340152 5540775 552.84
## + prob_prison           1   1257075 5623853 553.54
## + state_pop             1    783660 6097267 557.34
## + mean_education        1    717146 6163781 557.85
## + m_per1000f            1    314867 6566061 560.82
## <none>                              6880928 561.02
## + labour_participation  1    245446 6635482 561.32
## + inequality            1    220530 6660397 561.49
## + unemploy_m39          1    216354 6664573 561.52
## + time_prison           1    154545 6726383 561.96
## + is_south              1     56527 6824400 562.64
## + percent_m             1     55084 6825844 562.65
## + unemploy_m24          1     17533 6863395 562.90
## + nonwhites_per1000     1      7312 6873615 562.97
## 
## Step:  AIC=534.02
## crime_rate ~ police
## 
##                        Df Sum of Sq     RSS    AIC
## + inequality            1    759855 2952743 525.26
## + percent_m             1    612966 3099632 527.54
## + m_per1000f            1    260696 3451901 532.60
## + nonwhites_per1000     1    232645 3479952 532.98
## + is_south              1    214795 3497803 533.22
## + gdp                   1    170207 3542390 533.82
## <none>                              3712597 534.02
## + prob_prison           1     99424 3613173 534.75
## + labour_participation  1     86340 3626257 534.92
## + time_prison           1     54647 3657950 535.33
## + unemploy_m39          1     22883 3689715 535.73
## + state_pop             1      2449 3710149 535.99
## + unemploy_m24          1      2269 3710328 535.99
## + mean_education        1      1064 3711533 536.01
## - police                1   3168330 6880928 561.02
## 
## Step:  AIC=525.26
## crime_rate ~ police + inequality
## 
##                        Df Sum of Sq     RSS    AIC
## + mean_education        1    578682 2374061 517.01
## + m_per1000f            1    479766 2472976 518.93
## + prob_prison           1    291301 2661441 522.38
## + labour_participation  1    287949 2664794 522.44
## + gdp                   1    234545 2718197 523.37
## + percent_m             1    176837 2775905 524.36
## <none>                              2952743 525.26
## + state_pop             1    117308 2835435 525.36
## + nonwhites_per1000     1     42677 2910065 526.58
## + is_south              1     41017 2911726 526.60
## + unemploy_m24          1      3597 2949145 527.20
## + time_prison           1      2838 2949905 527.22
## + unemploy_m39          1         4 2952738 527.26
## - inequality            1    759855 3712597 534.02
## - police                1   3707655 6660397 561.49
## 
## Step:  AIC=517.01
## crime_rate ~ police + inequality + mean_education
## 
##                        Df Sum of Sq     RSS    AIC
## + prob_prison           1    245632 2128429 513.88
## + percent_m             1    233504 2140557 514.14
## + m_per1000f            1    133872 2240188 516.28
## <none>                              2374061 517.01
## + gdp                   1     95015 2279045 517.09
## + time_prison           1     77849 2296212 517.44
## + unemploy_m39          1     71032 2303029 517.58
## + state_pop             1     32168 2341893 518.37
## + labour_participation  1     14490 2359571 518.72
## + unemploy_m24          1      9465 2364596 518.82
## + nonwhites_per1000     1      2355 2371705 518.96
## + is_south              1      1893 2372167 518.97
## - mean_education        1    578682 2952743 525.26
## - inequality            1   1337472 3711533 536.01
## - police                1   3709362 6083423 559.23
## 
## Step:  AIC=513.88
## crime_rate ~ police + inequality + mean_education + prob_prison
## 
##                        Df Sum of Sq     RSS    AIC
## + percent_m             1    257271 1871157 509.82
## + m_per1000f            1    148320 1980109 512.48
## + state_pop             1     96832 2031597 513.69
## <none>                              2128429 513.88
## + is_south              1     64071 2064357 514.44
## + unemploy_m39          1     61658 2066770 514.49
## + nonwhites_per1000     1     43703 2084726 514.90
## + gdp                   1     38070 2090358 515.03
## + unemploy_m24          1      8726 2119703 515.68
## + labour_participation  1      1336 2127092 515.85
## + time_prison           1         3 2128426 515.88
## - prob_prison           1    245632 2374061 517.01
## - mean_education        1    533013 2661441 522.38
## - inequality            1   1474896 3603324 536.62
## - police                1   2998253 5126682 553.19
## 
## Step:  AIC=509.82
## crime_rate ~ police + inequality + mean_education + prob_prison + 
##     percent_m
## 
##                        Df Sum of Sq     RSS    AIC
## + unemploy_m39          1    205935 1665223 506.34
## + gdp                   1    100497 1770660 509.23
## + m_per1000f            1     99365 1771793 509.26
## <none>                              1871157 509.82
## + unemploy_m24          1     57136 1814022 510.36
## + state_pop             1     41383 1829774 510.77
## + is_south              1     17419 1853739 511.38
## + time_prison           1      4963 1866195 511.70
## + nonwhites_per1000     1       245 1870912 511.81
## + labour_participation  1         3 1871154 511.82
## - percent_m             1    257271 2128429 513.88
## - prob_prison           1    269399 2140557 514.14
## - mean_education        1    588615 2459772 520.68
## - inequality            1    986336 2857493 527.72
## - police                1   3200711 5071868 554.69
## 
## Step:  AIC=506.34
## crime_rate ~ police + inequality + mean_education + prob_prison + 
##     percent_m + unemploy_m39
## 
##                        Df Sum of Sq     RSS    AIC
## + gdp                   1     69540 1595683 506.34
## <none>                              1665223 506.34
## + unemploy_m24          1     56946 1608276 506.70
## + state_pop             1     45741 1619482 507.03
## + m_per1000f            1     38423 1626799 507.24
## + labour_participation  1     23545 1641678 507.67
## + is_south              1     13773 1651449 507.95
## + time_prison           1      3062 1662160 508.25
## + nonwhites_per1000     1        29 1665194 508.34
## - unemploy_m39          1    205935 1871157 509.82
## - prob_prison           1    259033 1924256 511.14
## - percent_m             1    401548 2066770 514.49
## - mean_education        1    776031 2441253 522.32
## - inequality            1    966674 2631897 525.85
## - police                1   2762901 4428124 550.31
## 
## Step:  AIC=506.34
## crime_rate ~ police + inequality + mean_education + prob_prison + 
##     percent_m + unemploy_m39 + gdp
## 
##                        Df Sum of Sq     RSS    AIC
## <none>                              1595683 506.34
## - gdp                   1     69540 1665223 506.34
## + state_pop             1     52319 1543364 506.77
## + unemploy_m24          1     40730 1554953 507.12
## + m_per1000f            1     30656 1565027 507.42
## + labour_participation  1     14263 1581420 507.91
## + is_south              1      6372 1589311 508.15
## + time_prison           1      5820 1589863 508.16
## + nonwhites_per1000     1      1206 1594477 508.30
## - unemploy_m39          1    174977 1770660 509.23
## - prob_prison           1    187398 1783081 509.55
## - percent_m             1    448351 2044033 515.97
## - mean_education        1    616660 2212343 519.69
## - inequality            1    892945 2488628 525.22
## - police                1   1450148 3045831 534.72

## 
## Call:
## lm(formula = crime_rate ~ police + inequality + mean_education + 
##     prob_prison + percent_m + unemploy_m39 + gdp, data = crimestepwise)
## 
## Coefficients:
##    (Intercept)          police      inequality  mean_education  
##      -5807.418           5.259           8.228          18.104  
##    prob_prison       percent_m    unemploy_m39             gdp  
##      -3388.152          11.279           8.578           1.218

Dari data analisa diatas AIC terkecil pada metode forward didapatkan model baru seperti berikut ini :

bothmodel <- 
lm(formula = crime_rate ~ police + inequality + mean_education + 
    prob_prison + percent_m + unemploy_m39 + gdp, data = crimestepwise)
summary(bothmodel)

## 
## Call:
## lm(formula = crime_rate ~ police + inequality + mean_education + 
##     prob_prison + percent_m + unemploy_m39 + gdp, data = crimestepwise)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -456.61  -94.58  -10.87  106.26  491.20 
## 
## Coefficients:
##                  Estimate Std. Error t value    Pr(>|t|)    
## (Intercept)    -5807.4184  1071.1423  -5.422 0.000003286 ***
## police             5.2589     0.8833   5.953 0.000000601 ***
## inequality         8.2282     1.7613   4.672 0.000035087 ***
## mean_education    18.1039     4.6633   3.882    0.000389 ***
## prob_prison    -3388.1521  1583.1465  -2.140    0.038655 *  
## percent_m         11.2791     3.4073   3.310    0.002014 ** 
## unemploy_m39       8.5781     4.1480   2.068    0.045319 *  
## gdp                1.2179     0.9342   1.304    0.199981    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 202.3 on 39 degrees of freedom
## Multiple R-squared:  0.7681, Adjusted R-squared:  0.7265 
## F-statistic: 18.45 on 7 and 39 DF,  p-value: 0.000000000141

Analisa predict terhadap model both.

crimestepwise$predictboth <- predict(object = bothmodel, data.frame(
    police = crimestepwise$police,
    inequality = crimestepwise$inequality,
    mean_education = crimestepwise$mean_education,
    prob_prison = crimestepwise$prob_prison,
    percent_m = crimestepwise$percent_m, 
    unemploy_m39 = crimestepwise$unemploy_m39,
    gdp = crimestepwise$gdp))

Dari hasil analisa didapatkan bahwa model dan nilai AIC antara model forward dan both memiliki kesamaan. Sekarang mari kita lakukan perbandingan nilai Adj. R-Squared dan AIC antara ketiganya.

summary(backwardmodel)$adj.r.squared ;

## [1] 0.7377957

summary(forwardmodel)$adj.r.squared ;

## [1] 0.7264776

summary(bothmodel)$adj.r.squared

## [1] 0.7264776

Terlihat bahwa nilai Adj.R-Squared backward model memiliki nilai tertinggi yang artinya bisa kita jadikan suatu model fix untuk analisa data crime. Bagaimana dengan tingkat error predict nya?? Berikut perbandingannya.

RMSE (crimestepwise$predictbackward, y_true =crimestepwise$crime_rate);

## [1] 178.0768

RMSE (crimestepwise$predictforward, y_true =crimestepwise$crime_rate);

## [1] 184.2572

RMSE (crimestepwise$predictboth, y_true =crimestepwise$crime_rate)

## [1] 184.2572

Dan jika dilihat lagi tingkat error terkecil pada analisis RMSE ketiga model diatas dapat disimpulkan model linear terbaik ada pada model predictbackward. Maka model backward akan dijadikan model linear fix untuk analisis lanjutan yaitu analisis/tes asumsi.

3.8 Tes Asumsi Linear Model

Dari hasil perbandingan ketiga model dalam analisa stepwise regression didapat model terbaik yaitu :

crimemodelfix <-
    lm(formula = crime_rate ~ percent_m + mean_education + m_per1000f + 
    unemploy_m24 + unemploy_m39 + inequality + prob_prison + 
    police, data = crimestepwise)

summary(crimemodelfix)

## 
## Call:
## lm(formula = crime_rate ~ percent_m + mean_education + m_per1000f + 
##     unemploy_m24 + unemploy_m39 + inequality + prob_prison + 
##     police, data = crimestepwise)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -475.87 -104.70    0.77  129.36  470.19 
## 
## Coefficients:
##                  Estimate Std. Error t value    Pr(>|t|)    
## (Intercept)    -6541.1790  1210.6873  -5.403 0.000003750 ***
## percent_m          9.2582     3.3919   2.729     0.00956 ** 
## mean_education    17.7753     5.3496   3.323     0.00198 ** 
## m_per1000f         2.3832     1.3749   1.733     0.09114 .  
## unemploy_m24      -6.2939     3.3755  -1.865     0.06997 .  
## unemploy_m39      19.2570     7.3216   2.630     0.01226 *  
## inequality         6.1800     1.4181   4.358 0.000096343 ***
## prob_prison    -3858.6505  1508.4712  -2.558     0.01464 *  
## police             5.2781     0.8176   6.456 0.000000135 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 198 on 38 degrees of freedom
## Multiple R-squared:  0.7834, Adjusted R-squared:  0.7378 
## F-statistic: 17.18 on 8 and 38 DF,  p-value: 0.0000000001842

3.8.1 Test Normality Residual

hist(crimemodelfix$residuals)

Pada grafik terlihat distribusi normal namun bila ingin melakukan test lagi bisa dilakukan dengan menggunakan function shapiro.test().

shapiro.test(crimemodelfix$residuals)

## 
##  Shapiro-Wilk normality test
## 
## data:  crimemodelfix$residuals
## W = 0.98017, p-value = 0.5993

Ketentuan : H0 = residual menyebar normal H1 = residual tidak menyebar normal Untuk mengetahui sebaran normal dari residual maka p-value yg di inginkan atau dihasilkan harus besar (lebih dari 0.05). Terlihat dari hasil uji shapirotest P-value > dari alpha (0.05) yg artinya gagal tolak H0 (Asumsi Terpenuhi).

3.8.2 Test Homocedascity

Alasan kenapa residual tidak boleh ada pola adalah, karena kita ingin semua pola yang ada sudah berhasil di tangkap oleh model kita. untuk menguji error kita memiliki pola atau tidak bisa menggunakan fungsi bptest dari package lmtest.

Untuk melihat penyebaran, bisa dilakukan dengan visualisasi data terlebih dahulu.

plot(crimestepwise$crime_rate, crimemodelfix$residuals, 
     ylim = c(-1500, 1500))
abline(h = 0, col="red")

library(lmtest)
bptest(crimemodelfix)

## 
##  studentized Breusch-Pagan test
## 
## data:  crimemodelfix
## BP = 12.955, df = 8, p-value = 0.1134

pvalue > alpha, gagal tolak H0 alpha = 0.05

Kesimpulan : model linear crime.modelfix memenuhi asumsi homocedasticity karena nilai p-value = 0.09546 yang artinya lebih besar dari alpha. yang berarti gagal tolak H0 (Asumsi Terpenuhi).

3.8.3 Test Asumsi Multicollinearity (Variance Inflation Factor)

Pengecekan asumsi ini dilakukan karena kita ingin setiap variabel predictor kita adalah independent/ tidak dipengaruhi oleh variabel lainnya. untuk mengecek nilai ini bisa dilakukan dengan fungsi vif dari package car. nilai vif yg baik adalah dibawah 10

library(car)
vif(crimemodelfix)

##      percent_m mean_education     m_per1000f   unemploy_m24   unemploy_m39 
##       2.131212       4.200514       1.925157       4.343366       4.484234 
##     inequality    prob_prison         police 
##       3.754268       1.379647       2.599847

Ketika VIF nilainya > 10, maka harus ada variabel yang dieliminasi atau dilakukan feature engineering (membuat variabel baru dari variabel-variabel yang sudah ada).

Kesimpulan : Model linear tidak terjadi multicollinearity/no multicolinearity atau tidak ada korelasi tinggi antar variabel pada model linear (Asumsi Terpenuhi).

Setelah semua asumsi kita dapat melakukan prediksi terhadap model.

3.8.4 Prediksi Model

crimemodelfix$predictfix <- predict(object = backwardmodel, data.frame(
    percent_m = crimestepwise$percent_m, 
    mean_education = crimestepwise$mean_education,
    m_per1000f = crimestepwise$m_per1000f, 
    unemploy_m24 = crimestepwise$unemploy_m24, 
    unemploy_m39 = crimestepwise$unemploy_m39, 
    inequality = crimestepwise$inequality,
    prob_prison = crimestepwise$prob_prison, 
    police = crimestepwise$police))

summary(crimemodelfix)

## 
## Call:
## lm(formula = crime_rate ~ percent_m + mean_education + m_per1000f + 
##     unemploy_m24 + unemploy_m39 + inequality + prob_prison + 
##     police, data = crimestepwise)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -475.87 -104.70    0.77  129.36  470.19 
## 
## Coefficients:
##                  Estimate Std. Error t value    Pr(>|t|)    
## (Intercept)    -6541.1790  1210.6873  -5.403 0.000003750 ***
## percent_m          9.2582     3.3919   2.729     0.00956 ** 
## mean_education    17.7753     5.3496   3.323     0.00198 ** 
## m_per1000f         2.3832     1.3749   1.733     0.09114 .  
## unemploy_m24      -6.2939     3.3755  -1.865     0.06997 .  
## unemploy_m39      19.2570     7.3216   2.630     0.01226 *  
## inequality         6.1800     1.4181   4.358 0.000096343 ***
## prob_prison    -3858.6505  1508.4712  -2.558     0.01464 *  
## police             5.2781     0.8176   6.456 0.000000135 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 198 on 38 degrees of freedom
## Multiple R-squared:  0.7834, Adjusted R-squared:  0.7378 
## F-statistic: 17.18 on 8 and 38 DF,  p-value: 0.0000000001842

head(crimestepwise)

##   percent_m is_south mean_education labour_participation m_per1000f
## 1       151        1             91                  510        950
## 2       143        0            113                  583       1012
## 3       142        1             89                  533        969
## 4       136        0            121                  577        994
## 5       141        0            121                  591        985
## 6       121        0            110                  547        964
##   state_pop nonwhites_per1000 unemploy_m24 unemploy_m39 gdp inequality
## 1        33               301          108           41 394        261
## 2        13               102           96           36 557        194
## 3        18               219           94           33 318        250
## 4       157                80          102           39 673        167
## 5        18                30           91           20 578        174
## 6        25                44           84           29 689        126
##   prob_prison time_prison crime_rate police predictbackward predictforward
## 1    0.084602     26.2011        791    114        736.4606       835.1994
## 2    0.029599     25.2999       1635    198       1421.9928      1375.7041
## 3    0.083401     24.3006        578     89        401.6267       318.3766
## 4    0.015801     29.9012       1969    290       1848.4665      1917.0064
## 5    0.041399     21.2998       1234    210       1098.8942      1244.8671
## 6    0.034201     20.9995        682    233        738.0554       782.9239
##   predictboth
## 1    835.1994
## 2   1375.7041
## 3    318.3766
## 4   1917.0064
## 5   1244.8671
## 6    782.9239

crimemodelfix$predictfix

##         1         2         3         4         5         6         7 
##  736.4606 1421.9928  401.6267 1848.4665 1098.8942  738.0554  794.8266 
##         8         9        10        11        12        13        14 
## 1395.9296  688.1957  774.8499 1203.8082  723.3948  734.3688  791.6883 
##        15        16        17        18        19        20        21 
##  956.8743  950.4361  435.2767  796.2979 1225.8735 1224.2330  745.0239 
##        22        23        24        25        26        27        28 
##  671.8874  934.0591  846.5946  610.8896 1900.1856  324.8053 1189.1900 
##        29        30        31        32        33        34        35 
## 1378.4461  709.3463  468.4032  763.3692  887.9989 1006.6591  720.1070 
##        36        37        38        39        40        41        42 
## 1118.1954 1023.1192  575.4571  782.8805 1113.6493  753.9829  335.9048 
##        43        44        45        46        47 
## 1090.3831 1195.3133  568.9993  769.9314 1112.6690

Dari analisa prediksi terhadap data crimestepwise didapatkan nilai yang tidak terlalu berbeda secara signifikan. Artinya model bekerja cukup baik.

4 Kesimpulan

Dari keseluruhan tahap analisa, didapatkan bahwa model yang memberikan kinerja cukup baik adalah pada model regresi linear backward yang telah dirubah menjadi crimemodelfix. Namun bila masih merasa kurang mendapatkan hasil yang optimal, bisa dilakukan kembali analisa crimemodelfix dengan menambahkan satu atau dua prediktor untuk kemudian dilakukan pembandingan ulang terhadap nilai Adj.R-Squared dan RMSE pada simulasi linear yang baru.