Perkembangan kriminalitas pada umumnya mengalami kenaikan, yang disebabkan oleh berbagai persoalan, terutama persoalan perekonomian, sosial, konflik, dan kesadaran hukum. Selain itu tingkat pendidikan yang rendah dan jumlah pengangguran yang tinggi menjadi salah satu pendorong terjadinya bentuk-bentuk tindak kejahatan.

Berikut adalah analisis mengenai data kriminal pada Negara Bagian di Amerika Serikat dan data ini menunjukan pengaruh berbagai aspek sosial terhadap tingkat kriminalitas.

Menggunakan Analisis Regresi Linear, akan dibuat suatu model untuk memprediksi faktor-faktor yang menjadikan tingkat kriminalitas menurun/meningkat pada suatu wilayah.

Load & Cleansing Dataset

library(dplyr)
library(GGally)
library(lmtest)
library(car)

crime_clean <- read.csv("data_input/crime.csv", stringsAsFactors = F) %>%
  select(-X) %>% # data tidak digunakan, karena berupa nomor urut
  select(-Po2) %>% # data tidak digunakan, karena mirip dengan kolom Po1
  mutate(So = as.factor(So)) # data bernilai kategorik dan diubah menjadi factor

head(crime_clean)

#>     M So  Ed Po1  LF  M.F Pop  NW  U1 U2 GDP Ineq     Prob    Time    y
#> 1 151  1  91  58 510  950  33 301 108 41 394  261 0.084602 26.2011  791
#> 2 143  0 113 103 583 1012  13 102  96 36 557  194 0.029599 25.2999 1635
#> 3 142  1  89  45 533  969  18 219  94 33 318  250 0.083401 24.3006  578
#> 4 136  0 121 149 577  994 157  80 102 39 673  167 0.015801 29.9012 1969
#> 5 141  0 121 109 591  985  18  30  91 20 578  174 0.041399 21.2998 1234
#> 6 121  0 110 118 547  964  25  44  84 29 689  126 0.034201 20.9995  682

# melakukan pengecekan data yang kosong/NA

anyNA(crime_clean)

#> [1] FALSE

Berdasarkan nilai yang dihasilkan dari perintah diatas, tidak ditemukan data yang bernilai kosong/NA

# Mengubah kolom agar menjadi lebih deskriptif

names(crime_clean) <- c("percent_m","is_south", "mean_education", "police_exp",  
                          "labour_participation", "m_per1000f", "state_pop", 
                          "nonwhites_per1000", "unemploy_m24", "unemploy_m39",
                          "gdp", "inequality", "prob_prison", "time_prison", "crime_rate")
head(crime_clean)

#>   percent_m is_south mean_education police_exp labour_participation m_per1000f
#> 1       151        1             91         58                  510        950
#> 2       143        0            113        103                  583       1012
#> 3       142        1             89         45                  533        969
#> 4       136        0            121        149                  577        994
#> 5       141        0            121        109                  591        985
#> 6       121        0            110        118                  547        964
#>   state_pop nonwhites_per1000 unemploy_m24 unemploy_m39 gdp inequality
#> 1        33               301          108           41 394        261
#> 2        13               102           96           36 557        194
#> 3        18               219           94           33 318        250
#> 4       157                80          102           39 673        167
#> 5        18                30           91           20 578        174
#> 6        25                44           84           29 689        126
#>   prob_prison time_prison crime_rate
#> 1    0.084602     26.2011        791
#> 2    0.029599     25.2999       1635
#> 3    0.083401     24.3006        578
#> 4    0.015801     29.9012       1969
#> 5    0.041399     21.2998       1234
#> 6    0.034201     20.9995        682

Explanatory Data Analysis

# cek struktur data

str(crime_clean)

#> 'data.frame':    47 obs. of  15 variables:
#>  $ percent_m           : int  151 143 142 136 141 121 127 131 157 140 ...
#>  $ is_south            : Factor w/ 2 levels "0","1": 2 1 2 1 1 1 2 2 2 1 ...
#>  $ mean_education      : int  91 113 89 121 121 110 111 109 90 118 ...
#>  $ police_exp          : int  58 103 45 149 109 118 82 115 65 71 ...
#>  $ labour_participation: int  510 583 533 577 591 547 519 542 553 632 ...
#>  $ m_per1000f          : int  950 1012 969 994 985 964 982 969 955 1029 ...
#>  $ state_pop           : int  33 13 18 157 18 25 4 50 39 7 ...
#>  $ nonwhites_per1000   : int  301 102 219 80 30 44 139 179 286 15 ...
#>  $ unemploy_m24        : int  108 96 94 102 91 84 97 79 81 100 ...
#>  $ unemploy_m39        : int  41 36 33 39 20 29 38 35 28 24 ...
#>  $ gdp                 : int  394 557 318 673 578 689 620 472 421 526 ...
#>  $ inequality          : int  261 194 250 167 174 126 168 206 239 174 ...
#>  $ prob_prison         : num  0.0846 0.0296 0.0834 0.0158 0.0414 ...
#>  $ time_prison         : num  26.2 25.3 24.3 29.9 21.3 ...
#>  $ crime_rate          : int  791 1635 578 1969 1234 682 963 1555 856 705 ...

Berikut adalah keterangan dari setiap kolom:

percent_m: persentase pria usia 14-24
is_south: apakah terjadi di negara bagian selatan, jika 1 = Ya dan 0 = Tidak
mean_education: rata-rata lama pendidikan
police_exp: anggaran polisi dalam setahun
labour_participation: tingkat partisipasi angkatan kerja
m_per1000f: jumlah pria per 1.000 wanita
state_pop: jumlah penduduk
nonwhites_per1000: jumlah penduduk non-kulit putih per 1.000 orang
unemploy_m24: tingkat pengangguran pria pada usia 14-24
unemploy_m39: tingkat pengangguran pria pada usia 35-39
gdp: jumlah produk berupa barang dan jasa
inequality: ketimpangan pendapatan
prob_prison: kemungkinan hukuman penjara
time_prison: waktu rata-rata menjalani hukuman di penjara
crime_rate: tingkat kriminalitas

# melihat korelasi antar variabel

ggcorr(crime_clean, label = TRUE, label_size = 3, hjust = 1, layout.exp = 2)

Pada grafik korelasi, terlihat bahwa semua variabel memiliki pengaruh positif terhadap ‘crime_rate’ dimana faktor ‘police_exp’ memiliki korelasi positif yang paling tinggi dibandingkan faktor-faktor lain.

Memprediksi Tingkat Kriminalitas (crime_rate) berdasarkan Besarnya Anggaran Polisi (police_exp)

Berdasarkan hal di atas, dapat diketahui:

- Variabel prediktor (x) adalah 'police_exp'
- Variabel target (y) adalah 'crime_rate'

Pembuatan Model Regresi Linear

Selanjutnya dapat dibuat model regresi linear dengan variabel prediktor ‘police_exp’ karena variabel tersebut memiliki korelasi positif tertinggi terhadap variabel target ‘crime_rate’.

model_1 <- lm(formula = crime_rate ~ police_exp, 
                     data = crime_clean)

model_1

#> 
#> Call:
#> lm(formula = crime_rate ~ police_exp, data = crime_clean)
#> 
#> Coefficients:
#> (Intercept)   police_exp  
#>     144.464        8.948

summary(model_1)

#> 
#> Call:
#> lm(formula = crime_rate ~ police_exp, data = crime_clean)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -586.91 -155.63   32.52  139.58  568.84 
#> 
#> Coefficients:
#>             Estimate Std. Error t value     Pr(>|t|)    
#> (Intercept)  144.464    126.693   1.140         0.26    
#> police_exp     8.948      1.409   6.353 0.0000000934 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 283.9 on 45 degrees of freedom
#> Multiple R-squared:  0.4728, Adjusted R-squared:  0.4611 
#> F-statistic: 40.36 on 1 and 45 DF,  p-value: 0.00000009338

plot(crime_clean$police_exp, crime_clean$crime_rate)
abline(model_1$coefficients[1],model_1$coefficients[2], col = "red")

Dapat dilihat bahwa adjusted R-squared memiliki nilai 0.4611

Selanjutnya akan dicoba pemilihan variabel prediktor secara automatis menggunakan step-wise regression dengan metode backward elimination.

crime_backward <- lm(crime_rate ~ ., crime_clean)
step(crime_backward, direction = "backward")

#> Start:  AIC=513.95
#> crime_rate ~ percent_m + is_south + mean_education + police_exp + 
#>     labour_participation + m_per1000f + state_pop + nonwhites_per1000 + 
#>     unemploy_m24 + unemploy_m39 + gdp + inequality + prob_prison + 
#>     time_prison
#> 
#>                        Df Sum of Sq     RSS    AIC
#> - is_south              1        64 1392929 511.95
#> - time_prison           1      1236 1394101 511.99
#> - labour_participation  1      1723 1394588 512.00
#> - nonwhites_per1000     1      6802 1399667 512.18
#> - state_pop             1     16168 1409033 512.49
#> - gdp                   1     33582 1426447 513.07
#> - m_per1000f            1     35080 1427945 513.12
#> <none>                              1392865 513.95
#> - unemploy_m24          1     73136 1466001 514.35
#> - prob_prison           1    167590 1560455 517.29
#> - unemploy_m39          1    185009 1577874 517.81
#> - percent_m             1    203389 1596254 518.35
#> - mean_education        1    369864 1762729 523.01
#> - inequality            1    451937 1844802 525.15
#> - police_exp            1    708738 2101603 531.28
#> 
#> Step:  AIC=511.95
#> crime_rate ~ percent_m + mean_education + police_exp + labour_participation + 
#>     m_per1000f + state_pop + nonwhites_per1000 + unemploy_m24 + 
#>     unemploy_m39 + gdp + inequality + prob_prison + time_prison
#> 
#>                        Df Sum of Sq     RSS    AIC
#> - time_prison           1      1319 1394247 509.99
#> - labour_participation  1      2646 1395574 510.04
#> - nonwhites_per1000     1      8949 1401878 510.25
#> - state_pop             1     16166 1409095 510.49
#> - gdp                   1     36125 1429054 511.15
#> - m_per1000f            1     36467 1429396 511.16
#> <none>                              1392929 511.95
#> - unemploy_m24          1     86999 1479928 512.80
#> - prob_prison           1    171381 1564310 515.40
#> - unemploy_m39          1    196372 1589301 516.15
#> - percent_m             1    206121 1599050 516.43
#> - mean_education        1    371159 1764088 521.05
#> - inequality            1    534611 1927540 525.22
#> - police_exp            1    728570 2121499 529.72
#> 
#> Step:  AIC=509.99
#> crime_rate ~ percent_m + mean_education + police_exp + labour_participation + 
#>     m_per1000f + state_pop + nonwhites_per1000 + unemploy_m24 + 
#>     unemploy_m39 + gdp + inequality + prob_prison
#> 
#>                        Df Sum of Sq     RSS    AIC
#> - labour_participation  1      3019 1397266 508.09
#> - nonwhites_per1000     1      7996 1402243 508.26
#> - state_pop             1     19634 1413881 508.65
#> - gdp                   1     35276 1429524 509.17
#> - m_per1000f            1     40680 1434928 509.34
#> <none>                              1394247 509.99
#> - unemploy_m24          1     85946 1480194 510.80
#> - unemploy_m39          1    195095 1589343 514.15
#> - percent_m             1    206909 1601157 514.50
#> - prob_prison           1    223309 1617557 514.98
#> - mean_education        1    381593 1775840 519.36
#> - inequality            1    537046 1931294 523.31
#> - police_exp            1    764536 2158784 528.54
#> 
#> Step:  AIC=508.09
#> crime_rate ~ percent_m + mean_education + police_exp + m_per1000f + 
#>     state_pop + nonwhites_per1000 + unemploy_m24 + unemploy_m39 + 
#>     gdp + inequality + prob_prison
#> 
#>                     Df Sum of Sq     RSS    AIC
#> - nonwhites_per1000  1      6963 1404229 506.33
#> - state_pop          1     23381 1420648 506.87
#> - gdp                1     34787 1432053 507.25
#> - m_per1000f         1     41289 1438555 507.46
#> <none>                           1397266 508.09
#> - unemploy_m24       1     84385 1481652 508.85
#> - unemploy_m39       1    197849 1595115 512.32
#> - prob_prison        1    221812 1619078 513.02
#> - percent_m          1    226201 1623468 513.15
#> - mean_education     1    395884 1793150 517.82
#> - inequality         1    534370 1931637 521.32
#> - police_exp         1    834362 2231628 528.10
#> 
#> Step:  AIC=506.33
#> crime_rate ~ percent_m + mean_education + police_exp + m_per1000f + 
#>     state_pop + unemploy_m24 + unemploy_m39 + gdp + inequality + 
#>     prob_prison
#> 
#>                  Df Sum of Sq     RSS    AIC
#> - state_pop       1     22345 1426575 505.07
#> - gdp             1     32142 1436371 505.39
#> - m_per1000f      1     36808 1441037 505.54
#> <none>                        1404229 506.33
#> - unemploy_m24    1     86373 1490602 507.13
#> - unemploy_m39    1    205814 1610043 510.76
#> - prob_prison     1    218607 1622836 511.13
#> - percent_m       1    307001 1711230 513.62
#> - mean_education  1    389502 1793731 515.83
#> - inequality      1    608627 2012856 521.25
#> - police_exp      1   1050202 2454432 530.57
#> 
#> Step:  AIC=505.07
#> crime_rate ~ percent_m + mean_education + police_exp + m_per1000f + 
#>     unemploy_m24 + unemploy_m39 + gdp + inequality + prob_prison
#> 
#>                  Df Sum of Sq     RSS    AIC
#> - gdp             1     26493 1453068 503.93
#> <none>                        1426575 505.07
#> - m_per1000f      1     84491 1511065 505.77
#> - unemploy_m24    1     99463 1526037 506.24
#> - prob_prison     1    198571 1625145 509.20
#> - unemploy_m39    1    208880 1635455 509.49
#> - percent_m       1    320926 1747501 512.61
#> - mean_education  1    386773 1813348 514.35
#> - inequality      1    594779 2021354 519.45
#> - police_exp      1   1127277 2553852 530.44
#> 
#> Step:  AIC=503.93
#> crime_rate ~ percent_m + mean_education + police_exp + m_per1000f + 
#>     unemploy_m24 + unemploy_m39 + inequality + prob_prison
#> 
#>                  Df Sum of Sq     RSS    AIC
#> <none>                        1453068 503.93
#> - m_per1000f      1    103159 1556227 505.16
#> - unemploy_m24    1    127044 1580112 505.87
#> - prob_prison     1    247978 1701046 509.34
#> - unemploy_m39    1    255443 1708511 509.55
#> - percent_m       1    296790 1749858 510.67
#> - mean_education  1    445788 1898855 514.51
#> - inequality      1    738244 2191312 521.24
#> - police_exp      1   1672038 3125105 537.93

#> 
#> Call:
#> lm(formula = crime_rate ~ percent_m + mean_education + police_exp + 
#>     m_per1000f + unemploy_m24 + unemploy_m39 + inequality + prob_prison, 
#>     data = crime_clean)
#> 
#> Coefficients:
#>    (Intercept)       percent_m  mean_education      police_exp      m_per1000f  
#>      -6426.101           9.332          18.012          10.265           2.234  
#>   unemploy_m24    unemploy_m39      inequality     prob_prison  
#>         -6.087          18.735           6.133       -3796.032

summary(lm(formula = crime_rate ~ percent_m + mean_education + police_exp +
             m_per1000f + unemploy_m24 + unemploy_m39 + inequality + prob_prison, 
    data = crime_clean))

#> 
#> Call:
#> lm(formula = crime_rate ~ percent_m + mean_education + police_exp + 
#>     m_per1000f + unemploy_m24 + unemploy_m39 + inequality + prob_prison, 
#>     data = crime_clean)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -444.70 -111.07    3.03  122.15  483.30 
#> 
#> Coefficients:
#>                 Estimate Std. Error t value     Pr(>|t|)    
#> (Intercept)    -6426.101   1194.611  -5.379 0.0000040395 ***
#> percent_m          9.332      3.350   2.786      0.00828 ** 
#> mean_education    18.012      5.275   3.414      0.00153 ** 
#> police_exp        10.265      1.552   6.613 0.0000000826 ***
#> m_per1000f         2.234      1.360   1.642      0.10874    
#> unemploy_m24      -6.087      3.339  -1.823      0.07622 .  
#> unemploy_m39      18.735      7.248   2.585      0.01371 *  
#> inequality         6.133      1.396   4.394 0.0000863344 ***
#> prob_prison    -3796.032   1490.646  -2.547      0.01505 *  
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 195.5 on 38 degrees of freedom
#> Multiple R-squared:  0.7888, Adjusted R-squared:  0.7444 
#> F-statistic: 17.74 on 8 and 38 DF,  p-value: 0.0000000001159

# membuat formula model regresi yang baru menjadi objek
model_2  <- (lm(formula = crime_rate ~ percent_m + mean_education + police_exp + 
    m_per1000f + unemploy_m24 + unemploy_m39 + inequality + prob_prison, 
    data = crime_clean))
model_2

#> 
#> Call:
#> lm(formula = crime_rate ~ percent_m + mean_education + police_exp + 
#>     m_per1000f + unemploy_m24 + unemploy_m39 + inequality + prob_prison, 
#>     data = crime_clean)
#> 
#> Coefficients:
#>    (Intercept)       percent_m  mean_education      police_exp      m_per1000f  
#>      -6426.101           9.332          18.012          10.265           2.234  
#>   unemploy_m24    unemploy_m39      inequality     prob_prison  
#>         -6.087          18.735           6.133       -3796.032

Bila dibandingkan dengan model awal yang hanya menggunakan variabel ‘police_exp’, model regresi yang menggunakan variabel prediktor ‘percent_m’, ‘mean_education’, ‘police_exp’, ‘m_per1000f’, ‘unemploy_m24’, ‘unemploy_m39’, ‘inequality’ dan ‘prob_prison’ memiliki adjusted R-squared 0.7444 lebih tinggi dibandingkan model sebelumnya yaitu 0.4611.

Prediksi Model & Error

Akan dicoba prediksi nilai ‘police_exp’ berdasarkan nilai variabel prediktor, dan hasilnya akan dibandingkan dengan data aktual yang kita miliki.

# prediksi nilai 'police_exp' berdasarkan model_1

predict(model_1, data.frame(police_exp = 58), interval = "confidence", level = 0.95)

#>       fit      lwr      upr
#> 1 663.476 550.2256 776.7265

# prediksi nilai 'police_exp' berdasarkan model_2

predict(model_2, data.frame(percent_m = 151, mean_education = 91, police_exp = 58, m_per1000f = 950, unemploy_m24 = 108, unemploy_m39 = 41, inequality = 261, prob_prison = 0.0846), interval = "confidence", level = 0.95)

#>        fit      lwr      upr
#> 1 730.2679 563.3061 897.2297

# menghitung error model_1
sqrt((791-663.476)^2)

#> [1] 127.524

# menghitung error model_2
sqrt((791-730.2679)^2)

#> [1] 60.7321

Berdasarkan analisis penghitungan RSE menghasilkan model_2 memiliki nilai RSE yang lebih kecil sehingga lebih baik dibandingkan model_1

Evaluasi Model

Normalitas

# hist(m$residuals, breaks = 20)

hist(model_1$residuals, breaks = 15)

# shapiro.test(m$residuals)
shapiro.test(model_1$residuals)

#> 
#>  Shapiro-Wilk normality test
#> 
#> data:  model_1$residuals
#> W = 0.97601, p-value = 0.439

# hist(m1$residuals, breaks = 20)
hist(model_2$residuals, breaks = 15)

# shapiro.test(m1$residuals)
shapiro.test(model_2$residuals)

#> 
#>  Shapiro-Wilk normality test
#> 
#> data:  model_2$residuals
#> W = 0.98511, p-value = 0.8051

Untuk kedua model, P-value > 0.05 sehingga H0 diterima. Hal ini juga berarti residual menyebar normal agar model kita memiliki error disekitar mean-nya.

Heteroscedasticity

plot(crime_clean$police_exp, model_1$residuals)
abline(h = 0, col = "red")

bptest(model_1)

#> 
#>  studentized Breusch-Pagan test
#> 
#> data:  model_1
#> BP = 21.098, df = 1, p-value = 0.000004364

plot(crime_clean$crime_rate, model_2$residuals)
abline(h = 0, col = "red")

bptest(model_2)

#> 
#>  studentized Breusch-Pagan test
#> 
#> data:  model_2
#> BP = 13.51, df = 8, p-value = 0.09546

Untuk kedua model, P-value > 0.05 sehingga H0 diterima. Hal ini juga berarti residual tidak memiliki pola (Heteroscedasticity) dimana semua pola yang ada sudah berhasil ditangkap oleh model yang dibuat.

Variance Inflation Factor (Multicollinearity)

Karena model_1 yang hanya menggunakan satu variabel prediktor, maka tidak dapat digunakan vif() untuk analisis Multicollinearity

vif(model_2)

#>      percent_m mean_education     police_exp     m_per1000f   unemploy_m24 
#>       2.131963       4.189684       2.560496       1.932367       4.360038 
#>   unemploy_m39     inequality    prob_prison 
#>       4.508106       3.731074       1.381879

Karena hasil VIF < 10, maka tidak terdapat multikolinearitas antar variabel (antar variabel prediktor saling independen). Berdasarkan hasil analisis, kedua model memiliki kriteria yang baik sebagai model linear regression. Kemudian, bila dibandingkan RSE antara kedua model, model_2 memberikan nilai RSE yang lebih rendah. Oleh karena itu model m_2 dipilih sebagai model yang lebih baik.

Simpulan dan Saran

Model_2 yang didapatkan memiliki R-square 0.7444 dan memiliki RSE sebesar 60.7321. Selain itu setelah dilakukan uji analisis, model memiliki kriteria yang sudah baik.

Berdasarkan model ini, tingkat kriminalitas berkorelasi positif dengan besarnya anggaran Polisi. Sehingga melalui anggaran yang besar pada Institusi Polisi maka cenderung menurunkan kejahatan dalam suatu wilayah.