Data crime adalah data yang menunjukkan tingkat
kejahatan pada suatu daerah tertentu. Kolom-kolom yang terdapat pada
data crime adalah:
percent_m: percentage of males aged 14-24is_south: whether it is in a Southern state. 1 for Yes,
0 for No.mean_education: mean years of schoolingpolice_exp60: police expenditure in 1960police_exp59: police expenditure in 1959labour_participation: labour force participation
ratem_per1000f: number of males per 1000 femalesstate_pop: state populationnonwhites_per1000: number of non-whites resident per
1000 peopleunemploy_m24: unemployment rate of urban males aged
14-24unemploy_m39: unemployment rate of urban males aged
35-39gdp: gross domestic product per headinequality: income inequalityprob_prison: probability of imprisonmenttime_prison: average time served in prisonscrime_rate: crime rate in an unspecified category# install package
library(dplyr)
library(leaps)
library(readr)
library(tufte)
library(GGally)
library(MLmetrics)
library(lmtest)
library(car)
library(performance)# Pemanggilan data
crime <- read.csv("crime.csv") %>%
dplyr::select(-X)
names(crime) <- c("percent_m", "is_south", "mean_education", "police_exp60", "police_exp59", "labour_participation", "m_per1000f", "state_pop", "nonwhites_per1000", "unemploy_m24", "unemploy_m39", "gdp", "inequality", "prob_prison", "time_prison", "crime_rate")
#Cek struktur data
str(crime)## 'data.frame': 47 obs. of 16 variables:
## $ percent_m : int 151 143 142 136 141 121 127 131 157 140 ...
## $ is_south : int 1 0 1 0 0 0 1 1 1 0 ...
## $ mean_education : int 91 113 89 121 121 110 111 109 90 118 ...
## $ police_exp60 : int 58 103 45 149 109 118 82 115 65 71 ...
## $ police_exp59 : int 56 95 44 141 101 115 79 109 62 68 ...
## $ labour_participation: int 510 583 533 577 591 547 519 542 553 632 ...
## $ m_per1000f : int 950 1012 969 994 985 964 982 969 955 1029 ...
## $ state_pop : int 33 13 18 157 18 25 4 50 39 7 ...
## $ nonwhites_per1000 : int 301 102 219 80 30 44 139 179 286 15 ...
## $ unemploy_m24 : int 108 96 94 102 91 84 97 79 81 100 ...
## $ unemploy_m39 : int 41 36 33 39 20 29 38 35 28 24 ...
## $ gdp : int 394 557 318 673 578 689 620 472 421 526 ...
## $ inequality : int 261 194 250 167 174 126 168 206 239 174 ...
## $ prob_prison : num 0.0846 0.0296 0.0834 0.0158 0.0414 ...
## $ time_prison : num 26.2 25.3 24.3 29.9 21.3 ...
## $ crime_rate : int 791 1635 578 1969 1234 682 963 1555 856 705 ...
Selanjutnya, kita mengecek apakah ada data yang hilang pada dataframe
menggunakan colSums
colSums(is.na(crime))## percent_m is_south mean_education
## 0 0 0
## police_exp60 police_exp59 labour_participation
## 0 0 0
## m_per1000f state_pop nonwhites_per1000
## 0 0 0
## unemploy_m24 unemploy_m39 gdp
## 0 0 0
## inequality prob_prison time_prison
## 0 0 0
## crime_rate
## 0
Kolom is_south dapat diubah menjadi tipe factor
crime_clean <- crime %>%
mutate(is_south = as.factor(is_south))ggcorr(crime_clean, label = TRUE, label_size = 2, hjust = 1, layout.exp = 2)## Warning in ggcorr(crime_clean, label = TRUE, label_size = 2, hjust = 1, : data
## in column(s) 'is_south' are not numeric and were ignored
Dari diagram di atas, terdapat korelasi yang sangat tinggi antara kolom
police_exp59 dan police_exp60 (bisa jadi
datanya identical) sehingga kita buang salah 1. Kita akan menggunakan
kolom yang relative lebih baru yaitu kolom
police_exp60.
# membuang kolom
crime_clean <- crime_clean %>%
select(-c(police_exp59))boxplot(crime_clean, cex.axis = 0.5)boxplot(crime_clean$percent_m, cex.axis = 1)boxplot(crime_clean$m_per1000f, cex.axis = 1)boxplot(crime_clean$state_pop, cex.axis = 1)boxplot(crime_clean$nonwhites_per1000, cex.axis = 1)boxplot(crime_clean$unemploy_m24, cex.axis = 1)boxplot(crime_clean$unemploy_m39, cex.axis = 1)boxplot(crime_clean$prob_prison, cex.axis = 1)boxplot(crime_clean$time_prison, cex.axis = 1)boxplot(crime_clean$crime_rate, cex.axis = 1)Kolom percent_m, m_per1000f,
state_pop, nonwhites_per1000,
unemploy_m24, unemploy_m39,
prob_prison, time_prison, dan
crime rate memiliki outliers yang harus dibuang
crime_clean2 <-
crime_clean %>%
filter(percent_m < 169, m_per1000f < 1035, state_pop< 75, nonwhites_per1000 < 290, unemploy_m24 < 138, unemploy_m39< 55, prob_prison < 0.085, time_prison < 43, crime_rate < 1650)Selanjutnya, kita menentukan crime_rate sebagai variabel
target. Kita akan membandingkan beberapa model yang nantinya akan
dipilih model terbaik.
model_all <- lm(crime_rate ~., crime_clean2)
summary(model_all)##
## Call:
## lm(formula = crime_rate ~ ., data = crime_clean2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -327.03 -83.08 -9.26 68.87 238.84
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2721.7283 2057.2204 -1.323 0.2070
## percent_m 10.1279 5.2467 1.930 0.0741 .
## is_south1 -20.5695 164.8588 -0.125 0.9025
## mean_education 18.1182 7.9029 2.293 0.0379 *
## police_exp60 9.1780 3.8519 2.383 0.0319 *
## labour_participation 2.2147 2.2500 0.984 0.3417
## m_per1000f -2.5958 2.8024 -0.926 0.3700
## state_pop -6.5622 4.6464 -1.412 0.1797
## nonwhites_per1000 2.1328 0.8602 2.479 0.0265 *
## unemploy_m24 -0.1969 4.9548 -0.040 0.9689
## unemploy_m39 15.6557 11.0780 1.413 0.1794
## gdp -0.3524 1.1885 -0.296 0.7712
## inequality 6.3515 2.6186 2.426 0.0294 *
## prob_prison -10505.1968 3867.4921 -2.716 0.0167 *
## time_prison -9.0890 10.2716 -0.885 0.3912
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 170 on 14 degrees of freedom
## Multiple R-squared: 0.8533, Adjusted R-squared: 0.7066
## F-statistic: 5.816 on 14 and 14 DF, p-value: 0.001107
Dari hasil di atas, terlihat bahwa variabel prediktor yang signifikan
adalah police_exp60, nonwhites_per1000,
inequality dan prob_prison
model_sig <- lm(crime_rate ~ police_exp60 + nonwhites_per1000 +inequality + prob_prison, crime_clean2)
summary(model_sig)##
## Call:
## lm(formula = crime_rate ~ police_exp60 + nonwhites_per1000 +
## inequality + prob_prison, data = crime_clean2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -403.25 -170.02 38.21 123.82 439.33
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -409.7022 518.4416 -0.790 0.43712
## police_exp60 8.5737 2.8646 2.993 0.00631 **
## nonwhites_per1000 1.0466 0.7064 1.482 0.15147
## inequality 4.4020 1.7294 2.545 0.01776 *
## prob_prison -8056.3623 3514.6651 -2.292 0.03096 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 220.2 on 24 degrees of freedom
## Multiple R-squared: 0.5779, Adjusted R-squared: 0.5076
## F-statistic: 8.215 on 4 and 24 DF, p-value: 0.0002538
Dari hasil di atas, ternyata model yang menggunakan semua variabel prediktor hasilnya lebih baik, sehingga akan dipakai untuk model stepwise backward.
mb <- step(model_all, direction="backward")## Start: AIC=306.76
## crime_rate ~ percent_m + is_south + mean_education + police_exp60 +
## labour_participation + m_per1000f + state_pop + nonwhites_per1000 +
## unemploy_m24 + unemploy_m39 + gdp + inequality + prob_prison +
## time_prison
##
## Df Sum of Sq RSS AIC
## - unemploy_m24 1 46 404658 304.76
## - is_south 1 450 405062 304.79
## - gdp 1 2540 407152 304.94
## - time_prison 1 22629 427241 306.34
## - m_per1000f 1 24796 429408 306.48
## - labour_participation 1 28003 432615 306.70
## <none> 404612 306.76
## - state_pop 1 57647 462259 308.62
## - unemploy_m39 1 57721 462333 308.63
## - percent_m 1 107689 512301 311.60
## - mean_education 1 151905 556517 314.00
## - police_exp60 1 164078 568690 314.63
## - inequality 1 170027 574639 314.93
## - nonwhites_per1000 1 177660 582272 315.31
## - prob_prison 1 213236 617848 317.03
##
## Step: AIC=304.76
## crime_rate ~ percent_m + is_south + mean_education + police_exp60 +
## labour_participation + m_per1000f + state_pop + nonwhites_per1000 +
## unemploy_m39 + gdp + inequality + prob_prison + time_prison
##
## Df Sum of Sq RSS AIC
## - is_south 1 405 405062 302.79
## - gdp 1 2496 407153 302.94
## - time_prison 1 25159 429817 304.51
## <none> 404658 304.76
## - labour_participation 1 28957 433614 304.77
## - m_per1000f 1 29184 433841 304.78
## - state_pop 1 69987 474644 307.39
## - unemploy_m39 1 96322 500979 308.95
## - percent_m 1 107650 512307 309.60
## - mean_education 1 153455 558113 312.09
## - police_exp60 1 166701 571358 312.77
## - inequality 1 171211 575869 312.99
## - nonwhites_per1000 1 182100 586758 313.54
## - prob_prison 1 216376 621034 315.18
##
## Step: AIC=302.79
## crime_rate ~ percent_m + mean_education + police_exp60 + labour_participation +
## m_per1000f + state_pop + nonwhites_per1000 + unemploy_m39 +
## gdp + inequality + prob_prison + time_prison
##
## Df Sum of Sq RSS AIC
## - gdp 1 2707 407769 300.98
## - time_prison 1 25083 430146 302.53
## <none> 405062 302.79
## - m_per1000f 1 29006 434068 302.80
## - labour_participation 1 39631 444693 303.50
## - state_pop 1 70955 476017 305.47
## - unemploy_m39 1 97184 502246 307.03
## - percent_m 1 111289 516351 307.83
## - mean_education 1 156451 561513 310.26
## - police_exp60 1 174738 579800 311.19
## - inequality 1 186249 591311 311.76
## - prob_prison 1 219466 624528 313.35
## - nonwhites_per1000 1 238617 643679 314.22
##
## Step: AIC=300.98
## crime_rate ~ percent_m + mean_education + police_exp60 + labour_participation +
## m_per1000f + state_pop + nonwhites_per1000 + unemploy_m39 +
## inequality + prob_prison + time_prison
##
## Df Sum of Sq RSS AIC
## <none> 407769 300.98
## - time_prison 1 30088 437857 301.05
## - m_per1000f 1 34977 442747 301.37
## - labour_participation 1 40397 448166 301.72
## - state_pop 1 69263 477033 303.53
## - unemploy_m39 1 95590 503359 305.09
## - percent_m 1 115604 523373 306.22
## - mean_education 1 162231 570001 308.70
## - police_exp60 1 181814 589584 309.68
## - prob_prison 1 225223 632992 311.74
## - nonwhites_per1000 1 248620 656389 312.79
## - inequality 1 353059 760828 317.07
summary(mb)##
## Call:
## lm(formula = crime_rate ~ percent_m + mean_education + police_exp60 +
## labour_participation + m_per1000f + state_pop + nonwhites_per1000 +
## unemploy_m39 + inequality + prob_prison + time_prison, data = crime_clean2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -316.85 -74.43 -14.32 66.24 235.08
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2876.268 1756.415 -1.638 0.11989
## percent_m 10.353 4.716 2.195 0.04231 *
## mean_education 18.198 6.997 2.601 0.01865 *
## police_exp60 8.836 3.209 2.753 0.01358 *
## labour_participation 2.364 1.822 1.298 0.21171
## m_per1000f -2.808 2.325 -1.208 0.24375
## state_pop -6.415 3.775 -1.699 0.10749
## nonwhites_per1000 2.112 0.656 3.219 0.00503 **
## unemploy_m39 15.264 7.646 1.996 0.06218 .
## inequality 6.727 1.753 3.837 0.00132 **
## prob_prison -10209.503 3331.819 -3.064 0.00702 **
## time_prison -9.832 8.779 -1.120 0.27829
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 154.9 on 17 degrees of freedom
## Multiple R-squared: 0.8521, Adjusted R-squared: 0.7565
## F-statistic: 8.907 on 11 and 17 DF, p-value: 0.00004636
model <- compare_performance(model_all, model_sig, mb)
model_fix <- as.data.frame(model)
model_fix## Name Model AIC AIC_wt AICc AICc_wt BIC
## 1 model_all lm 391.0567 0.0527721133 436.3900 0.0000001746681 412.9334
## 2 model_sig lm 401.7015 0.0002575815 405.5197 0.8823167331928 409.9053
## 3 mb lm 385.2821 0.9469703052 409.5488 0.1176830921391 403.0570
## BIC_wt R2 R2_adjusted RMSE Sigma
## 1 0.006893314 0.8532824 0.7065648 118.1192 170.0025
## 2 0.031331666 0.5779064 0.5075575 200.3474 220.2303
## 3 0.961775020 0.8521375 0.7564617 118.5792 154.8756
Dari hasil di atas, dipilih model backward karena memiliki nilai R adjusted paling tinggi (0.75) dan RMSE sebesar 118.5)
hist(mb$residuals)shapiro.test(mb$residuals)##
## Shapiro-Wilk normality test
##
## data: mb$residuals
## W = 0.95965, p-value = 0.3223
Hasil menunjukkan p-value > 0.05, berarti error terdistribusi normal dan gagal tolak H0.
Visualisasi pengujian ini menggunakan scatter plot
plot(x = mb$fitted.values, y = mb$residuals)
abline(h = 0, col = "hotpink")Dari plot di atas, dapat disimpulkan bahwa residual tersebar acak.
vif(mb)## percent_m mean_education police_exp60
## 3.415692 7.199900 5.974537
## labour_participation m_per1000f state_pop
## 5.987367 3.144851 3.601034
## nonwhites_per1000 unemploy_m39 inequality
## 3.249053 3.654311 5.375351
## prob_prison time_prison
## 3.861701 2.742363
Hasil di atas menunjukkan bahwa tidak ada nilai 10 atau lebih, sehingga dapat disimpulkan bahwa antar variabel prediktor bersifat saling independen.
Model backward merupakan model yang baik untuk melakukan regresi
linear. Hasil model ini menunjukkan bahwa crime_rate
berkorelasi positif dengan: - percent_m -
mean_education - police_exp60 -
nonwhites_per1000 - inequality
Sementara itu, crime_rate berkorelasi negatif dengan
prob_prison.