1 Penjelasan

Data crime adalah data yang menunjukkan tingkat kejahatan pada suatu daerah tertentu. Kolom-kolom yang terdapat pada data crime adalah:

percent_m: percentage of males aged 14-24
is_south: whether it is in a Southern state. 1 for Yes, 0 for No.
mean_education: mean years of schooling
police_exp60: police expenditure in 1960
police_exp59: police expenditure in 1959
labour_participation: labour force participation rate
m_per1000f: number of males per 1000 females
state_pop: state population
nonwhites_per1000: number of non-whites resident per 1000 people
unemploy_m24: unemployment rate of urban males aged 14-24
unemploy_m39: unemployment rate of urban males aged 35-39
gdp: gross domestic product per head
inequality: income inequality
prob_prison: probability of imprisonment
time_prison: average time served in prisons
crime_rate: crime rate in an unspecified category

2 Eksplorasi Data

# install package
library(dplyr)
library(leaps)
library(readr)
library(tufte)
library(GGally)
library(MLmetrics)
library(lmtest)
library(car)
library(performance)

# Pemanggilan data
crime <- read.csv("crime.csv") %>% 
dplyr::select(-X)
names(crime) <- c("percent_m", "is_south", "mean_education", "police_exp60", "police_exp59", "labour_participation", "m_per1000f", "state_pop", "nonwhites_per1000", "unemploy_m24", "unemploy_m39", "gdp", "inequality", "prob_prison", "time_prison", "crime_rate")

#Cek struktur data
str(crime)

## 'data.frame':    47 obs. of  16 variables:
##  $ percent_m           : int  151 143 142 136 141 121 127 131 157 140 ...
##  $ is_south            : int  1 0 1 0 0 0 1 1 1 0 ...
##  $ mean_education      : int  91 113 89 121 121 110 111 109 90 118 ...
##  $ police_exp60        : int  58 103 45 149 109 118 82 115 65 71 ...
##  $ police_exp59        : int  56 95 44 141 101 115 79 109 62 68 ...
##  $ labour_participation: int  510 583 533 577 591 547 519 542 553 632 ...
##  $ m_per1000f          : int  950 1012 969 994 985 964 982 969 955 1029 ...
##  $ state_pop           : int  33 13 18 157 18 25 4 50 39 7 ...
##  $ nonwhites_per1000   : int  301 102 219 80 30 44 139 179 286 15 ...
##  $ unemploy_m24        : int  108 96 94 102 91 84 97 79 81 100 ...
##  $ unemploy_m39        : int  41 36 33 39 20 29 38 35 28 24 ...
##  $ gdp                 : int  394 557 318 673 578 689 620 472 421 526 ...
##  $ inequality          : int  261 194 250 167 174 126 168 206 239 174 ...
##  $ prob_prison         : num  0.0846 0.0296 0.0834 0.0158 0.0414 ...
##  $ time_prison         : num  26.2 25.3 24.3 29.9 21.3 ...
##  $ crime_rate          : int  791 1635 578 1969 1234 682 963 1555 856 705 ...

Selanjutnya, kita mengecek apakah ada data yang hilang pada dataframe menggunakan colSums

colSums(is.na(crime))

##            percent_m             is_south       mean_education 
##                    0                    0                    0 
##         police_exp60         police_exp59 labour_participation 
##                    0                    0                    0 
##           m_per1000f            state_pop    nonwhites_per1000 
##                    0                    0                    0 
##         unemploy_m24         unemploy_m39                  gdp 
##                    0                    0                    0 
##           inequality          prob_prison          time_prison 
##                    0                    0                    0 
##           crime_rate 
##                    0

2.1 Mengubah tipe data

Kolom is_south dapat diubah menjadi tipe factor

crime_clean <- crime %>% 
  mutate(is_south = as.factor(is_south))

2.2 Mengecek tabel korelasi

ggcorr(crime_clean, label = TRUE, label_size = 2, hjust = 1, layout.exp = 2)

## Warning in ggcorr(crime_clean, label = TRUE, label_size = 2, hjust = 1, : data
## in column(s) 'is_south' are not numeric and were ignored

Dari diagram di atas, terdapat korelasi yang sangat tinggi antara kolom police_exp59 dan police_exp60 (bisa jadi datanya identical) sehingga kita buang salah 1. Kita akan menggunakan kolom yang relative lebih baru yaitu kolom police_exp60.

# membuang kolom
crime_clean <- crime_clean %>% 
  select(-c(police_exp59))

2.3 Mengecek Outliers

boxplot(crime_clean, cex.axis = 0.5)

boxplot(crime_clean$percent_m, cex.axis = 1)

boxplot(crime_clean$m_per1000f, cex.axis = 1)

boxplot(crime_clean$state_pop, cex.axis = 1)

boxplot(crime_clean$nonwhites_per1000, cex.axis = 1)

boxplot(crime_clean$unemploy_m24, cex.axis = 1)

boxplot(crime_clean$unemploy_m39, cex.axis = 1)

boxplot(crime_clean$prob_prison, cex.axis = 1)

boxplot(crime_clean$time_prison, cex.axis = 1)

boxplot(crime_clean$crime_rate, cex.axis = 1)

2.4 Membuang Outliers

Kolom percent_m, m_per1000f, state_pop, nonwhites_per1000, unemploy_m24, unemploy_m39, prob_prison, time_prison, dan crime rate memiliki outliers yang harus dibuang

crime_clean2 <- 
  crime_clean %>% 
  filter(percent_m < 169, m_per1000f < 1035, state_pop< 75, nonwhites_per1000 < 290, unemploy_m24 < 138, unemploy_m39< 55, prob_prison < 0.085, time_prison < 43, crime_rate < 1650)

3 Modeling

Selanjutnya, kita menentukan crime_rate sebagai variabel target. Kita akan membandingkan beberapa model yang nantinya akan dipilih model terbaik.

3.1 Semua Variabel Prediktor

model_all <- lm(crime_rate ~., crime_clean2)
summary(model_all)

## 
## Call:
## lm(formula = crime_rate ~ ., data = crime_clean2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -327.03  -83.08   -9.26   68.87  238.84 
## 
## Coefficients:
##                         Estimate  Std. Error t value Pr(>|t|)  
## (Intercept)           -2721.7283   2057.2204  -1.323   0.2070  
## percent_m                10.1279      5.2467   1.930   0.0741 .
## is_south1               -20.5695    164.8588  -0.125   0.9025  
## mean_education           18.1182      7.9029   2.293   0.0379 *
## police_exp60              9.1780      3.8519   2.383   0.0319 *
## labour_participation      2.2147      2.2500   0.984   0.3417  
## m_per1000f               -2.5958      2.8024  -0.926   0.3700  
## state_pop                -6.5622      4.6464  -1.412   0.1797  
## nonwhites_per1000         2.1328      0.8602   2.479   0.0265 *
## unemploy_m24             -0.1969      4.9548  -0.040   0.9689  
## unemploy_m39             15.6557     11.0780   1.413   0.1794  
## gdp                      -0.3524      1.1885  -0.296   0.7712  
## inequality                6.3515      2.6186   2.426   0.0294 *
## prob_prison          -10505.1968   3867.4921  -2.716   0.0167 *
## time_prison              -9.0890     10.2716  -0.885   0.3912  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 170 on 14 degrees of freedom
## Multiple R-squared:  0.8533, Adjusted R-squared:  0.7066 
## F-statistic: 5.816 on 14 and 14 DF,  p-value: 0.001107

Dari hasil di atas, terlihat bahwa variabel prediktor yang signifikan adalah police_exp60, nonwhites_per1000, inequality dan prob_prison

3.2 Menggunakan Variabel Prediktor Signifikan

model_sig <- lm(crime_rate ~ police_exp60 + nonwhites_per1000 +inequality + prob_prison, crime_clean2)
summary(model_sig)

## 
## Call:
## lm(formula = crime_rate ~ police_exp60 + nonwhites_per1000 + 
##     inequality + prob_prison, data = crime_clean2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -403.25 -170.02   38.21  123.82  439.33 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)   
## (Intercept)        -409.7022   518.4416  -0.790  0.43712   
## police_exp60          8.5737     2.8646   2.993  0.00631 **
## nonwhites_per1000     1.0466     0.7064   1.482  0.15147   
## inequality            4.4020     1.7294   2.545  0.01776 * 
## prob_prison       -8056.3623  3514.6651  -2.292  0.03096 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 220.2 on 24 degrees of freedom
## Multiple R-squared:  0.5779, Adjusted R-squared:  0.5076 
## F-statistic: 8.215 on 4 and 24 DF,  p-value: 0.0002538

Dari hasil di atas, ternyata model yang menggunakan semua variabel prediktor hasilnya lebih baik, sehingga akan dipakai untuk model stepwise backward.

3.3 Stepwise Backward

mb <- step(model_all, direction="backward")

## Start:  AIC=306.76
## crime_rate ~ percent_m + is_south + mean_education + police_exp60 + 
##     labour_participation + m_per1000f + state_pop + nonwhites_per1000 + 
##     unemploy_m24 + unemploy_m39 + gdp + inequality + prob_prison + 
##     time_prison
## 
##                        Df Sum of Sq    RSS    AIC
## - unemploy_m24          1        46 404658 304.76
## - is_south              1       450 405062 304.79
## - gdp                   1      2540 407152 304.94
## - time_prison           1     22629 427241 306.34
## - m_per1000f            1     24796 429408 306.48
## - labour_participation  1     28003 432615 306.70
## <none>                              404612 306.76
## - state_pop             1     57647 462259 308.62
## - unemploy_m39          1     57721 462333 308.63
## - percent_m             1    107689 512301 311.60
## - mean_education        1    151905 556517 314.00
## - police_exp60          1    164078 568690 314.63
## - inequality            1    170027 574639 314.93
## - nonwhites_per1000     1    177660 582272 315.31
## - prob_prison           1    213236 617848 317.03
## 
## Step:  AIC=304.76
## crime_rate ~ percent_m + is_south + mean_education + police_exp60 + 
##     labour_participation + m_per1000f + state_pop + nonwhites_per1000 + 
##     unemploy_m39 + gdp + inequality + prob_prison + time_prison
## 
##                        Df Sum of Sq    RSS    AIC
## - is_south              1       405 405062 302.79
## - gdp                   1      2496 407153 302.94
## - time_prison           1     25159 429817 304.51
## <none>                              404658 304.76
## - labour_participation  1     28957 433614 304.77
## - m_per1000f            1     29184 433841 304.78
## - state_pop             1     69987 474644 307.39
## - unemploy_m39          1     96322 500979 308.95
## - percent_m             1    107650 512307 309.60
## - mean_education        1    153455 558113 312.09
## - police_exp60          1    166701 571358 312.77
## - inequality            1    171211 575869 312.99
## - nonwhites_per1000     1    182100 586758 313.54
## - prob_prison           1    216376 621034 315.18
## 
## Step:  AIC=302.79
## crime_rate ~ percent_m + mean_education + police_exp60 + labour_participation + 
##     m_per1000f + state_pop + nonwhites_per1000 + unemploy_m39 + 
##     gdp + inequality + prob_prison + time_prison
## 
##                        Df Sum of Sq    RSS    AIC
## - gdp                   1      2707 407769 300.98
## - time_prison           1     25083 430146 302.53
## <none>                              405062 302.79
## - m_per1000f            1     29006 434068 302.80
## - labour_participation  1     39631 444693 303.50
## - state_pop             1     70955 476017 305.47
## - unemploy_m39          1     97184 502246 307.03
## - percent_m             1    111289 516351 307.83
## - mean_education        1    156451 561513 310.26
## - police_exp60          1    174738 579800 311.19
## - inequality            1    186249 591311 311.76
## - prob_prison           1    219466 624528 313.35
## - nonwhites_per1000     1    238617 643679 314.22
## 
## Step:  AIC=300.98
## crime_rate ~ percent_m + mean_education + police_exp60 + labour_participation + 
##     m_per1000f + state_pop + nonwhites_per1000 + unemploy_m39 + 
##     inequality + prob_prison + time_prison
## 
##                        Df Sum of Sq    RSS    AIC
## <none>                              407769 300.98
## - time_prison           1     30088 437857 301.05
## - m_per1000f            1     34977 442747 301.37
## - labour_participation  1     40397 448166 301.72
## - state_pop             1     69263 477033 303.53
## - unemploy_m39          1     95590 503359 305.09
## - percent_m             1    115604 523373 306.22
## - mean_education        1    162231 570001 308.70
## - police_exp60          1    181814 589584 309.68
## - prob_prison           1    225223 632992 311.74
## - nonwhites_per1000     1    248620 656389 312.79
## - inequality            1    353059 760828 317.07

summary(mb)

## 
## Call:
## lm(formula = crime_rate ~ percent_m + mean_education + police_exp60 + 
##     labour_participation + m_per1000f + state_pop + nonwhites_per1000 + 
##     unemploy_m39 + inequality + prob_prison + time_prison, data = crime_clean2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -316.85  -74.43  -14.32   66.24  235.08 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)   
## (Intercept)           -2876.268   1756.415  -1.638  0.11989   
## percent_m                10.353      4.716   2.195  0.04231 * 
## mean_education           18.198      6.997   2.601  0.01865 * 
## police_exp60              8.836      3.209   2.753  0.01358 * 
## labour_participation      2.364      1.822   1.298  0.21171   
## m_per1000f               -2.808      2.325  -1.208  0.24375   
## state_pop                -6.415      3.775  -1.699  0.10749   
## nonwhites_per1000         2.112      0.656   3.219  0.00503 **
## unemploy_m39             15.264      7.646   1.996  0.06218 . 
## inequality                6.727      1.753   3.837  0.00132 **
## prob_prison          -10209.503   3331.819  -3.064  0.00702 **
## time_prison              -9.832      8.779  -1.120  0.27829   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 154.9 on 17 degrees of freedom
## Multiple R-squared:  0.8521, Adjusted R-squared:  0.7565 
## F-statistic: 8.907 on 11 and 17 DF,  p-value: 0.00004636

3.4 Perbandingan

model <- compare_performance(model_all, model_sig, mb)

model_fix <- as.data.frame(model)

model_fix

##        Name Model      AIC       AIC_wt     AICc         AICc_wt      BIC
## 1 model_all    lm 391.0567 0.0527721133 436.3900 0.0000001746681 412.9334
## 2 model_sig    lm 401.7015 0.0002575815 405.5197 0.8823167331928 409.9053
## 3        mb    lm 385.2821 0.9469703052 409.5488 0.1176830921391 403.0570
##        BIC_wt        R2 R2_adjusted     RMSE    Sigma
## 1 0.006893314 0.8532824   0.7065648 118.1192 170.0025
## 2 0.031331666 0.5779064   0.5075575 200.3474 220.2303
## 3 0.961775020 0.8521375   0.7564617 118.5792 154.8756

Dari hasil di atas, dipilih model backward karena memiliki nilai R adjusted paling tinggi (0.75) dan RMSE sebesar 118.5)

4 Pengujian

4.1 Normality of Residuals

hist(mb$residuals)

4.2 Shapiro Test

shapiro.test(mb$residuals)

## 
##  Shapiro-Wilk normality test
## 
## data:  mb$residuals
## W = 0.95965, p-value = 0.3223

Hasil menunjukkan p-value > 0.05, berarti error terdistribusi normal dan gagal tolak H0.

4.3 Homoscedasticity of Residuals

Visualisasi pengujian ini menggunakan scatter plot

plot(x = mb$fitted.values, y = mb$residuals)
abline(h = 0, col = "hotpink")

Dari plot di atas, dapat disimpulkan bahwa residual tersebar acak.

4.4 Variance Inflation Factor (VIF)

vif(mb)

##            percent_m       mean_education         police_exp60 
##             3.415692             7.199900             5.974537 
## labour_participation           m_per1000f            state_pop 
##             5.987367             3.144851             3.601034 
##    nonwhites_per1000         unemploy_m39           inequality 
##             3.249053             3.654311             5.375351 
##          prob_prison          time_prison 
##             3.861701             2.742363

Hasil di atas menunjukkan bahwa tidak ada nilai 10 atau lebih, sehingga dapat disimpulkan bahwa antar variabel prediktor bersifat saling independen.

5 KESIMPULAN

Model backward merupakan model yang baik untuk melakukan regresi linear. Hasil model ini menunjukkan bahwa crime_rate berkorelasi positif dengan: - percent_m - mean_education - police_exp60 - nonwhites_per1000 - inequality

Sementara itu, crime_rate berkorelasi negatif dengan prob_prison.

LBB Regression Model

Ali Ramadhani

2023-01-03