Pada kesempatan kali ini, kita akan melakukan memprediksi data Crime Rate dari data kriminalitas yang ada di Amerika pada tahun 1960. Prediksi akan dilakukan dengan metode regresi linear. Data yang akan digunakan adalah dataset yang telah disediakan oleh platform Kaggle pada link dibawah berikut
library(dplyr)
library(GGally)
library(MLmetrics)
library(car)
library(lmtest)Input data
crime <- read.csv("data_input/crime.csv") %>%
select(-X) # buang kolom X karena hanya nomor baris
head(crime)karena variable masih dalam singkatan, akan kita ubah agar lebih mudah
# rename kolom agar lebih deskriptif
names(crime) <- c("percent_m", "is_south", "mean_education", "police_exp60", "police_exp59", "labour_participation", "m_per1000f", "state_pop", "nonwhites_per1000", "unemploy_m24", "unemploy_m39", "gdp", "inequality", "prob_prison", "time_prison", "crime_rate")
head(crime)Deskripsi kolom/variable:
percent_m: The number of males of age 14-24 per 1000 populationis_south: whether it is in a Southern state. 1 for Yes, 0 for No.mean_education: Mean # of years of schooling x 10 for persons of age 25 or olderpolice_exp60: 1960 per capita expenditure on police by state and local governmentpolice_exp59: 1959 per capita expenditure on police by state and local governmentlabour_participation: Labor force participation rate per 1000 civilian urban males age 14-24m_per1000f: nThe number of males per 1000 femalesstate_pop: State population size in hundred thousandsnonwhites_per1000: number of non-whites resident per 1000 peopleunemploy_m24: Unemployment rate of urban males per 1000 of age 14-24unemploy_m39: Unemployment rate of urban males per 1000 of age 35-39gdp: gross domestic product per headinequality: income inequalityprob_prison: probability of imprisonmenttime_prison: avg time served in prisonscrime_rate: crime rate in an unspecified categoryCek tipe data & ubah tipe data yang belum sesuai
str(crime)#> 'data.frame': 47 obs. of 16 variables:
#> $ percent_m : int 151 143 142 136 141 121 127 131 157 140 ...
#> $ is_south : int 1 0 1 0 0 0 1 1 1 0 ...
#> $ mean_education : int 91 113 89 121 121 110 111 109 90 118 ...
#> $ police_exp60 : int 58 103 45 149 109 118 82 115 65 71 ...
#> $ police_exp59 : int 56 95 44 141 101 115 79 109 62 68 ...
#> $ labour_participation: int 510 583 533 577 591 547 519 542 553 632 ...
#> $ m_per1000f : int 950 1012 969 994 985 964 982 969 955 1029 ...
#> $ state_pop : int 33 13 18 157 18 25 4 50 39 7 ...
#> $ nonwhites_per1000 : int 301 102 219 80 30 44 139 179 286 15 ...
#> $ unemploy_m24 : int 108 96 94 102 91 84 97 79 81 100 ...
#> $ unemploy_m39 : int 41 36 33 39 20 29 38 35 28 24 ...
#> $ gdp : int 394 557 318 673 578 689 620 472 421 526 ...
#> $ inequality : int 261 194 250 167 174 126 168 206 239 174 ...
#> $ prob_prison : num 0.0846 0.0296 0.0834 0.0158 0.0414 ...
#> $ time_prison : num 26.2 25.3 24.3 29.9 21.3 ...
#> $ crime_rate : int 791 1635 578 1969 1234 682 963 1555 856 705 ...
crime <- crime %>% mutate(is_south = as.factor(is_south))
str(crime)#> 'data.frame': 47 obs. of 16 variables:
#> $ percent_m : int 151 143 142 136 141 121 127 131 157 140 ...
#> $ is_south : Factor w/ 2 levels "0","1": 2 1 2 1 1 1 2 2 2 1 ...
#> $ mean_education : int 91 113 89 121 121 110 111 109 90 118 ...
#> $ police_exp60 : int 58 103 45 149 109 118 82 115 65 71 ...
#> $ police_exp59 : int 56 95 44 141 101 115 79 109 62 68 ...
#> $ labour_participation: int 510 583 533 577 591 547 519 542 553 632 ...
#> $ m_per1000f : int 950 1012 969 994 985 964 982 969 955 1029 ...
#> $ state_pop : int 33 13 18 157 18 25 4 50 39 7 ...
#> $ nonwhites_per1000 : int 301 102 219 80 30 44 139 179 286 15 ...
#> $ unemploy_m24 : int 108 96 94 102 91 84 97 79 81 100 ...
#> $ unemploy_m39 : int 41 36 33 39 20 29 38 35 28 24 ...
#> $ gdp : int 394 557 318 673 578 689 620 472 421 526 ...
#> $ inequality : int 261 194 250 167 174 126 168 206 239 174 ...
#> $ prob_prison : num 0.0846 0.0296 0.0834 0.0158 0.0414 ...
#> $ time_prison : num 26.2 25.3 24.3 29.9 21.3 ...
#> $ crime_rate : int 791 1635 578 1969 1234 682 963 1555 856 705 ...
Selanjutnya akan kita lakukan split data untuk membagi data train & juga data test
RNGkind(sample.kind = "Rounding")
set.seed(222)
#spli data train & test
index <- sample(nrow(crime), nrow(crime)*0.8)
crime_train <- crime[index,]
crime_test <- crime[-index,]Dalam hal ini, kita akan melakukan prediksi terhadap variable crime_rate menggunakan seluruh variable
model_crime <- lm(crime_rate~.,data = crime_train)
summary(model_crime)#>
#> Call:
#> lm(formula = crime_rate ~ ., data = crime_train)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -310.91 -117.87 1.72 105.58 447.31
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -3593.2482 1851.4189 -1.941 0.0658 .
#> percent_m 14.0620 5.4571 2.577 0.0176 *
#> is_south1 14.0423 158.0050 0.089 0.9300
#> mean_education 18.2434 7.9205 2.303 0.0316 *
#> police_exp60 6.5722 12.1225 0.542 0.5934
#> police_exp59 6.1773 13.3735 0.462 0.6489
#> labour_participation 1.4084 2.3318 0.604 0.5523
#> m_per1000f -1.7021 3.0878 -0.551 0.5873
#> state_pop -0.7578 1.4727 -0.515 0.6122
#> nonwhites_per1000 0.1740 0.6872 0.253 0.8026
#> unemploy_m24 -5.1227 5.1321 -0.998 0.3296
#> unemploy_m39 21.7769 11.3853 1.913 0.0695 .
#> gdp -0.4605 1.1399 -0.404 0.6903
#> inequality 5.6984 2.5441 2.240 0.0361 *
#> prob_prison -8295.4333 3580.3621 -2.317 0.0307 *
#> time_prison -8.7791 8.3117 -1.056 0.3029
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 198.3 on 21 degrees of freedom
#> Multiple R-squared: 0.8613, Adjusted R-squared: 0.7621
#> F-statistic: 8.69 on 15 and 21 DF, p-value: 6.542e-06
Berdasarkan hasil diatas, jika diperhatikan, tidak semua variable berperan signifikan terhadap target variable yakni crime_rate. Karena itu perlu kita cek terlebih dahulu bagaimana korelasi dari masing-masing prediktor terhadap targetnya
Berdasarkan hasil plot korelasi diatas, kita akan membuat 3 model yakni sebagai berikut: 1. Model regresi linear menggunakan seluruh variable sebagai prediktor (model_crime) 2. Model regresi linear menggunakan variable exp59 saja sebagai prediktor (model_crime_2) 3. Model regresi linear menggunakan stepwise regression (backward) (model_crime_3)
model_crime_2 <- lm(crime_rate~police_exp59, crime_train)
plot(crime_train$police_exp59, crime_train$crime_rate)
abline(model_crime_2, col = "red", lty = 2)summary(model_crime_2)#>
#> Call:
#> lm(formula = crime_rate ~ police_exp59, data = crime_train)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -717.74 -128.62 20.12 185.08 489.07
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -59.691 139.754 -0.427 0.672
#> police_exp59 12.691 1.714 7.406 1.15e-08 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 257.4 on 35 degrees of freedom
#> Multiple R-squared: 0.6105, Adjusted R-squared: 0.5993
#> F-statistic: 54.85 on 1 and 35 DF, p-value: 1.151e-08
model_crime_3 <- step(object = model_crime,
direction = 'backward')#> Start: AIC=402.5
#> crime_rate ~ percent_m + is_south + mean_education + police_exp60 +
#> police_exp59 + labour_participation + m_per1000f + state_pop +
#> nonwhites_per1000 + unemploy_m24 + unemploy_m39 + gdp + inequality +
#> prob_prison + time_prison
#>
#> Df Sum of Sq RSS AIC
#> - is_south 1 311 826388 400.51
#> - nonwhites_per1000 1 2520 828598 400.61
#> - gdp 1 6419 832496 400.79
#> - police_exp59 1 8393 834470 400.87
#> - state_pop 1 10415 836493 400.96
#> - police_exp60 1 11562 837640 401.01
#> - m_per1000f 1 11953 838031 401.03
#> - labour_participation 1 14351 840429 401.14
#> - unemploy_m24 1 39193 865271 402.22
#> - time_prison 1 43885 869963 402.42
#> <none> 826077 402.50
#> - unemploy_m39 1 143913 969991 406.44
#> - inequality 1 197345 1023422 408.43
#> - mean_education 1 208692 1034769 408.83
#> - prob_prison 1 211167 1037244 408.92
#> - percent_m 1 261197 1087274 410.67
#>
#> Step: AIC=400.51
#> crime_rate ~ percent_m + mean_education + police_exp60 + police_exp59 +
#> labour_participation + m_per1000f + state_pop + nonwhites_per1000 +
#> unemploy_m24 + unemploy_m39 + gdp + inequality + prob_prison +
#> time_prison
#>
#> Df Sum of Sq RSS AIC
#> - nonwhites_per1000 1 3965 830354 398.69
#> - gdp 1 6275 832663 398.79
#> - police_exp59 1 8171 834560 398.88
#> - state_pop 1 10106 836495 398.96
#> - police_exp60 1 11615 838004 399.03
#> - m_per1000f 1 11992 838380 399.05
#> - labour_participation 1 16479 842868 399.24
#> - time_prison 1 45225 871613 400.49
#> <none> 826388 400.51
#> - unemploy_m24 1 53405 879793 400.83
#> - unemploy_m39 1 153679 980067 404.82
#> - mean_education 1 212453 1038841 406.98
#> - prob_prison 1 212639 1039027 406.99
#> - inequality 1 217143 1043532 407.15
#> - percent_m 1 262234 1088622 408.71
#>
#> Step: AIC=398.69
#> crime_rate ~ percent_m + mean_education + police_exp60 + police_exp59 +
#> labour_participation + m_per1000f + state_pop + unemploy_m24 +
#> unemploy_m39 + gdp + inequality + prob_prison + time_prison
#>
#> Df Sum of Sq RSS AIC
#> - gdp 1 7918 838272 397.04
#> - state_pop 1 9833 840186 397.13
#> - police_exp60 1 10201 840555 397.14
#> - police_exp59 1 12055 842409 397.22
#> - m_per1000f 1 12266 842620 397.23
#> - labour_participation 1 14621 844974 397.34
#> - time_prison 1 41557 871911 398.50
#> <none> 830354 398.69
#> - unemploy_m24 1 50103 880457 398.86
#> - unemploy_m39 1 150067 980420 402.84
#> - mean_education 1 208837 1039190 404.99
#> - prob_prison 1 228633 1058987 405.69
#> - inequality 1 250287 1080640 406.44
#> - percent_m 1 265735 1096088 406.96
#>
#> Step: AIC=397.04
#> crime_rate ~ percent_m + mean_education + police_exp60 + police_exp59 +
#> labour_participation + m_per1000f + state_pop + unemploy_m24 +
#> unemploy_m39 + inequality + prob_prison + time_prison
#>
#> Df Sum of Sq RSS AIC
#> - police_exp59 1 9823 848095 395.47
#> - police_exp60 1 11959 850230 395.57
#> - state_pop 1 14105 852376 395.66
#> - labour_participation 1 15627 853899 395.73
#> - m_per1000f 1 16419 854691 395.76
#> - time_prison 1 39819 878090 396.76
#> - unemploy_m24 1 44007 882279 396.94
#> <none> 838272 397.04
#> - unemploy_m39 1 142720 980991 400.86
#> - mean_education 1 203510 1041781 403.08
#> - prob_prison 1 232932 1071204 404.11
#> - percent_m 1 285303 1123575 405.88
#> - inequality 1 490705 1328977 412.09
#>
#> Step: AIC=395.47
#> crime_rate ~ percent_m + mean_education + police_exp60 + labour_participation +
#> m_per1000f + state_pop + unemploy_m24 + unemploy_m39 + inequality +
#> prob_prison + time_prison
#>
#> Df Sum of Sq RSS AIC
#> - labour_participation 1 9630 857725 393.89
#> - state_pop 1 13455 861550 394.06
#> - m_per1000f 1 17054 865149 394.21
#> <none> 848095 395.47
#> - unemploy_m24 1 48395 896490 395.53
#> - time_prison 1 50409 898504 395.61
#> - unemploy_m39 1 139146 987241 399.09
#> - prob_prison 1 238546 1086640 402.64
#> - mean_education 1 260169 1108264 403.37
#> - percent_m 1 276597 1124692 403.92
#> - inequality 1 522093 1370188 411.22
#> - police_exp60 1 916631 1764726 420.59
#>
#> Step: AIC=393.89
#> crime_rate ~ percent_m + mean_education + police_exp60 + m_per1000f +
#> state_pop + unemploy_m24 + unemploy_m39 + inequality + prob_prison +
#> time_prison
#>
#> Df Sum of Sq RSS AIC
#> - m_per1000f 1 7695 865419 392.22
#> - state_pop 1 8274 865999 392.25
#> - time_prison 1 41103 898828 393.62
#> <none> 857725 393.89
#> - unemploy_m24 1 60420 918145 394.41
#> - unemploy_m39 1 130472 988197 397.13
#> - prob_prison 1 229020 1086744 400.65
#> - mean_education 1 279982 1137706 402.34
#> - percent_m 1 340490 1198215 404.26
#> - inequality 1 519862 1377586 409.42
#> - police_exp60 1 913013 1770737 418.71
#>
#> Step: AIC=392.22
#> crime_rate ~ percent_m + mean_education + police_exp60 + state_pop +
#> unemploy_m24 + unemploy_m39 + inequality + prob_prison +
#> time_prison
#>
#> Df Sum of Sq RSS AIC
#> - state_pop 1 3290 868710 390.36
#> - time_prison 1 35430 900849 391.71
#> <none> 865419 392.22
#> - unemploy_m24 1 85250 950669 393.70
#> - unemploy_m39 1 142409 1007829 395.86
#> - prob_prison 1 221674 1087094 398.66
#> - percent_m 1 333821 1199241 402.29
#> - mean_education 1 334088 1199508 402.30
#> - inequality 1 585816 1451235 409.35
#> - police_exp60 1 1021500 1886919 419.06
#>
#> Step: AIC=390.36
#> crime_rate ~ percent_m + mean_education + police_exp60 + unemploy_m24 +
#> unemploy_m39 + inequality + prob_prison + time_prison
#>
#> Df Sum of Sq RSS AIC
#> - time_prison 1 47231 915941 390.32
#> <none> 868710 390.36
#> - unemploy_m24 1 83686 952396 391.77
#> - unemploy_m39 1 140035 1008744 393.89
#> - prob_prison 1 221670 1090379 396.77
#> - mean_education 1 333975 1202685 400.40
#> - percent_m 1 367702 1236411 401.42
#> - inequality 1 589726 1458436 407.53
#> - police_exp60 1 1072386 1941096 418.11
#>
#> Step: AIC=390.32
#> crime_rate ~ percent_m + mean_education + police_exp60 + unemploy_m24 +
#> unemploy_m39 + inequality + prob_prison
#>
#> Df Sum of Sq RSS AIC
#> <none> 915941 390.32
#> - unemploy_m24 1 54103 970044 390.44
#> - unemploy_m39 1 105500 1021441 392.35
#> - prob_prison 1 188253 1104194 395.24
#> - percent_m 1 322844 1238785 399.49
#> - mean_education 1 389657 1305598 401.44
#> - inequality 1 594487 1510428 406.83
#> - police_exp60 1 1300243 2216184 421.01
summary(model_crime_3)#>
#> Call:
#> lm(formula = crime_rate ~ percent_m + mean_education + police_exp60 +
#> unemploy_m24 + unemploy_m39 + inequality + prob_prison, data = crime_train)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -310.06 -93.39 -19.63 102.32 375.43
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -4741.956 921.972 -5.143 1.70e-05 ***
#> percent_m 11.170 3.494 3.197 0.003343 **
#> mean_education 18.884 5.376 3.512 0.001475 **
#> police_exp60 12.247 1.909 6.416 5.12e-07 ***
#> unemploy_m24 -4.499 3.437 -1.309 0.200876
#> unemploy_m39 15.570 8.519 1.828 0.077910 .
#> inequality 6.296 1.451 4.338 0.000159 ***
#> prob_prison -4814.913 1972.205 -2.441 0.020968 *
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 177.7 on 29 degrees of freedom
#> Multiple R-squared: 0.8462, Adjusted R-squared: 0.809
#> F-statistic: 22.79 on 7 and 29 DF, p-value: 3.537e-10
Evaluasi model pada seen data (data train)
crime_pred <- data.frame(crime_train$crime_rate)
crime_pred$pred_all <- predict(model_crime, crime_train)
crime_pred$pred_backward <- predict(model_crime_3, crime_train)
crime_pred$pred_1 <- predict(model_crime_2, crime_train)
head(crime_pred)Setelah kita membuat 2 model, selanjutnya akan kita bandingkan hasil dari ketiga model tersebut menggunakan Mean Absolute Percentage Error agar eror yang dihasilkan berupa persentase
MAPE(y_pred = crime_pred$pred_all , y_true = crime_train$crime_rate)*100#> [1] 15.54173
MAPE(y_pred = crime_pred$pred_backward , y_true = crime_train$crime_rate)*100#> [1] 17.36276
MAPE(y_pred = crime_pred$pred_1 , y_true = crime_train$crime_rate)*100#> [1] 22.82917
Berdasarkan hasil diatas, dapat diranking berdasarkan 1. Model dengan seluruh variable 2. Model stepwise regression 3. Model 1 variable
Meskipun model telah dibuat, terdapat beberapa uji asumsi linearitas yang harus diuji yakni sebagai berikut
Pada uji linearitas diharapkan residual berada disekitar titik 0 yang menandakan variable prediktor memiliki hubungan lurus dengan target nya. Berdasarkan hasil di bawah, ketiga model tidak terlalu membentuk garis lurus ideal, namun residunya masih sebagian besar berada di titik 0
plot(model_crime, which = 1)plot(model_crime_2, which = 1)plot(model_crime_3, which = 1)Pada uji homoscedacity, diharapkan error yang dihasilkan oleh model menyebar secara acak. Untuk uji asumsi ini kita akan menggunakan Breusch-Pagan hypothesis test dimana p-value harus > daripada 0.05
bptest(model_crime)#>
#> studentized Breusch-Pagan test
#>
#> data: model_crime
#> BP = 18.332, df = 15, p-value = 0.2456
bptest(model_crime_2)#>
#> studentized Breusch-Pagan test
#>
#> data: model_crime_2
#> BP = 6.3493, df = 1, p-value = 0.01174
bptest(model_crime_3)#>
#> studentized Breusch-Pagan test
#>
#> data: model_crime_3
#> BP = 8.3745, df = 7, p-value = 0.3007
Berdasarkan hasil diatas, hanya model_crime (model dengan seluruh variable) & model_crime_3 (model dengan stepwise regression) yang memenuhi uji asumsi homoscedacity
Pada uji asumsi ini, diharapkan error yang muncul dengan distribusi normal sehingga eror lebih banyak pada nilai 0. Kita akan lihat distribusinya menggunakan histogram terlebih dahulu
hist(model_crime$residuals)hist(model_crime_2$residuals)hist(model_crime_3$residuals) Dari hasil di atas, ketiganya memiliki histogram eror yang terdistribusi normal, selanjutnya kita akan uji menggunakan Shaphiro test dimana p-value harus > 0.05
# shapiro test dari residual
shapiro.test(model_crime$residuals)#>
#> Shapiro-Wilk normality test
#>
#> data: model_crime$residuals
#> W = 0.96588, p-value = 0.3081
shapiro.test(model_crime_2$residuals)#>
#> Shapiro-Wilk normality test
#>
#> data: model_crime_2$residuals
#> W = 0.94479, p-value = 0.0656
shapiro.test(model_crime_3$residuals)#>
#> Shapiro-Wilk normality test
#>
#> data: model_crime_3$residuals
#> W = 0.9832, p-value = 0.8372
melalui shaphiro test, ketiga model memiliki nilai p-value > 0.05 sehingga dapat dikatakan ketiganya memiliki eror yang terdistribusi normal
Selanjutnya perlu kita uji multicollinearity dimana antar prediktor tidak boleh memiliki keterkaitan. Pengujian ini dilakukan dengan metode Variance Inflation Factor, dimana nilainya tidak boleh >10
vif(model_crime)#> percent_m is_south mean_education
#> 3.200306 5.145879 6.801945
#> police_exp60 police_exp59 labour_participation
#> 97.150366 102.604695 8.327902
#> m_per1000f state_pop nonwhites_per1000
#> 7.261454 2.466355 4.377014
#> unemploy_m24 unemploy_m39 gdp
#> 7.057757 7.325475 10.583290
#> inequality prob_prison time_prison
#> 8.924597 5.218725 3.601374
vif(model_crime_3)#> percent_m mean_education police_exp60 unemploy_m24 unemploy_m39
#> 1.633647 3.903302 2.999648 3.943235 5.108470
#> inequality prob_prison
#> 3.616152 1.972181
Terlihat pada model_crime terdapat multikolinearitas dimana varable police_exp59 dan police_exp60 memiliki keterkaitan nilainya dan ini menandakan kurang baik
Setelah kita menguji model yang telah dibuat, selanjutnya akan diuji pada data test, yakni data yang belum pernah dilihat oleh model sebelumnya
crime_pred_test <- data.frame(crime_test$crime_rate)
crime_pred_test$pred_all <- predict(model_crime, crime_test)
crime_pred_test$pred_backward <- predict(model_crime_3, crime_test)
crime_pred_test$pred_1 <- predict(model_crime_2, crime_test)
head(crime_pred_test)MAPE(y_pred = crime_pred_test$pred_all , y_true = crime_test$crime_rate)*100#> [1] 28.95856
MAPE(y_pred = crime_pred_test$pred_backward , y_true = crime_test$crime_rate)*100#> [1] 21.02302
MAPE(y_pred = crime_pred_test$pred_1 , y_true = crime_test$crime_rate)*100#> [1] 51.92493
Berdasarkan nilai eror dari ketiga model. Ketiganya menghasilkan eror yang masih cukup tinggil dimana nilainya >15%, namun jika dibandingkan hasilnya, dapat diurutkan dari eror terkecil sebagai berikut:
Model_crime_3 (Model dengan stepwise regression)Model_crime (Model dengan seluruh variable)Model_crime_2 (Model dengan 1 variable saja)Berdasarkan hasil pengujian, dapat disimpulkan bahwa stepwise regression untuk analisa linear regression bekerja paling baik untuk memprediksi crime rate pada data crime hal ini dikarenakan stepwise regression mengeliminasi variable - variable prediktor yang dapat menimbulkan informational loss sehingga tidak harus menggunakan seluruh variable sebagai prediktor.