Menggunakan data crime yang kita miliki, kita akan
membuat model regresi untuk memprediksi crime_rate dengan
menggunakan seluruh prediktor yang ada.
library(dplyr) # Data Cleansing
library(GGally) # Visualisasi (korelasi)
library(MLmetrics) # Metrics Error
library(performance) # Compare performance
library(lmtest) # Uji Asumsi
library(car) # Uji Asumsi
library(rsample)crime <- read.csv("data_input/crime.csv", stringsAsFactors = F) %>%
dplyr::select(-X) # take out kolom X
rmarkdown::paged_table(crime)names(crime) <- c("percent_m", "is_south", "mean_education", "police_exp60",
"police_exp59", "labour_participation", "m_per1000f",
"state_pop", "nonwhites_per1000", "unemploy_m24", "unemploy_m39",
"gdp", "inequality", "prob_prison", "time_prison", "crime_rate")
head(crime)glimpse(crime)## Rows: 47
## Columns: 16
## $ percent_m <int> 151, 143, 142, 136, 141, 121, 127, 131, 157, 140,~
## $ is_south <int> 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0~
## $ mean_education <int> 91, 113, 89, 121, 121, 110, 111, 109, 90, 118, 10~
## $ police_exp60 <int> 58, 103, 45, 149, 109, 118, 82, 115, 65, 71, 121,~
## $ police_exp59 <int> 56, 95, 44, 141, 101, 115, 79, 109, 62, 68, 116, ~
## $ labour_participation <int> 510, 583, 533, 577, 591, 547, 519, 542, 553, 632,~
## $ m_per1000f <int> 950, 1012, 969, 994, 985, 964, 982, 969, 955, 102~
## $ state_pop <int> 33, 13, 18, 157, 18, 25, 4, 50, 39, 7, 101, 47, 2~
## $ nonwhites_per1000 <int> 301, 102, 219, 80, 30, 44, 139, 179, 286, 15, 106~
## $ unemploy_m24 <int> 108, 96, 94, 102, 91, 84, 97, 79, 81, 100, 77, 83~
## $ unemploy_m39 <int> 41, 36, 33, 39, 20, 29, 38, 35, 28, 24, 35, 31, 2~
## $ gdp <int> 394, 557, 318, 673, 578, 689, 620, 472, 421, 526,~
## $ inequality <int> 261, 194, 250, 167, 174, 126, 168, 206, 239, 174,~
## $ prob_prison <dbl> 0.084602, 0.029599, 0.083401, 0.015801, 0.041399,~
## $ time_prison <dbl> 26.2011, 25.2999, 24.3006, 29.9012, 21.2998, 20.9~
## $ crime_rate <int> 791, 1635, 578, 1969, 1234, 682, 963, 1555, 856, ~
Glossary data crime :
percent_m: percentage of males aged 14-24is_south: whether it is in a Southern state. 1 for Yes,
0 for No.mean_education: mean years of schoolingpolice_exp60: police expenditure in 1960police_exp59: police expenditure in 1959labour_participation: labour force participation
ratem_per1000f: number of males per 1000 femalesstate_pop: state populationnonwhites_per1000: number of non-whites resident per
1000 peopleunemploy_m24: unemployment rate of urban males aged
14-24unemploy_m39: unemployment rate of urban males aged
35-39gdp: gross domestic product per headinequality: income inequalityprob_prison: probability of imprisonmenttime_prison: avg time served in prisonscrime_rate: crime rate in an unspecified categoryUbah kolom is_south menjadi factor
crime <- crime %>%
mutate(is_south = as.factor(is_south))
glimpse(crime)## Rows: 47
## Columns: 16
## $ percent_m <int> 151, 143, 142, 136, 141, 121, 127, 131, 157, 140,~
## $ is_south <fct> 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0~
## $ mean_education <int> 91, 113, 89, 121, 121, 110, 111, 109, 90, 118, 10~
## $ police_exp60 <int> 58, 103, 45, 149, 109, 118, 82, 115, 65, 71, 121,~
## $ police_exp59 <int> 56, 95, 44, 141, 101, 115, 79, 109, 62, 68, 116, ~
## $ labour_participation <int> 510, 583, 533, 577, 591, 547, 519, 542, 553, 632,~
## $ m_per1000f <int> 950, 1012, 969, 994, 985, 964, 982, 969, 955, 102~
## $ state_pop <int> 33, 13, 18, 157, 18, 25, 4, 50, 39, 7, 101, 47, 2~
## $ nonwhites_per1000 <int> 301, 102, 219, 80, 30, 44, 139, 179, 286, 15, 106~
## $ unemploy_m24 <int> 108, 96, 94, 102, 91, 84, 97, 79, 81, 100, 77, 83~
## $ unemploy_m39 <int> 41, 36, 33, 39, 20, 29, 38, 35, 28, 24, 35, 31, 2~
## $ gdp <int> 394, 557, 318, 673, 578, 689, 620, 472, 421, 526,~
## $ inequality <int> 261, 194, 250, 167, 174, 126, 168, 206, 239, 174,~
## $ prob_prison <dbl> 0.084602, 0.029599, 0.083401, 0.015801, 0.041399,~
## $ time_prison <dbl> 26.2011, 25.2999, 24.3006, 29.9012, 21.2998, 20.9~
## $ crime_rate <int> 791, 1635, 578, 1969, 1234, 682, 963, 1555, 856, ~
anyNA(crime)## [1] FALSE
Sekarang kita akan melakukan pengecekan terhadap variabel apa saja yang mempunyai korelasi dengan crime_rate
ggcorr(crime, label = TRUE, label_size = 2.9, hjust = 1, layout.exp = 2)Variable police_exp59 dan police_exp60 memiliki korelasi yang cukup kuat dengan variabel target crime_rate
Karena kita akan membuat model untuk memprediksi crime_rate, maka model pertama yang akan kita coba adalah predictor yang memiliki korelasi yang cukup kuat dengan crime rate yakni police_59 dan police_exp60.
Sebelum melakukan pemodelan, terlebih dahulu dataset akan dibagi menjadi 2 bagian yakni 75% dari dataset untuk data train (pemodelan), dan 20% dari dataset untuk data test. Hal ini dilakukan agar nantinya dapat dilakukan pengecekan tentang seberapa baik data memprediksi data diluar data train, apakah model overfitting atau tidak.
RNGkind(sample.kind = "Rounding")
set.seed(100)
index <- sample(x = nrow(crime), size = nrow(crime)*0.75)
crime_train <- crime[index,]
crime_test <- crime[-index,]boxplot(crime_train$crime_rate)
> Berdasarkan boxplot diatas, data sudah terdistribusi normal. Namun
terdapat nilai leverage. Selanjutnya akan dilakukan perbandingan apakah
menghilangkan leverage akan berpengaruh baik atau buruk terhadap
model.
hist(crime_train$crime_rate)crime_nl <- crime_train %>%
filter(crime_rate < 1500)strCor <- lm(crime_rate~police_exp59+police_exp60, crime_train)
summary(strCor)##
## Call:
## lm(formula = crime_rate ~ police_exp59 + police_exp60, data = crime_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -514.2 -161.7 50.4 131.0 557.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 159.46 124.66 1.279 0.2100
## police_exp59 -28.76 13.73 -2.094 0.0442 *
## police_exp60 35.62 13.01 2.738 0.0100 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 254.7 on 32 degrees of freedom
## Multiple R-squared: 0.5761, Adjusted R-squared: 0.5496
## F-statistic: 21.74 on 2 and 32 DF, p-value: 0.000001088
Nilai Adjusted R-squared = 0.5496, artinya hanya 55% nilai crime_rate yang dapat dijelaskan oleh police_exp59 dan police_exp60. Maka dari itu diperlukan model yang lebih dapat menjelaskan crime_rate lebih baik.
Metode step-wise regression ini akan menghasilkan formula optimum berdasarkan nilai AIC yang terendah, dimana semakin rendah nilai AIC tersebut, maka nilai observasi yang tidak tertangkap semakin kecil.
Selanjutnya akan dicoba pemilihan variabel prediktor secara automatis menggunakan step-wise regression dengan metode both elimination.
crime_all <- lm(crime_rate~., crime_train)
crime_non <- lm(crime_rate~1, crime_train)
crime_both <- step(object = crime_all, direction = "both", scope = list(lower = crime_non, upper = crime_all), trace = F)
summary(crime_both)##
## Call:
## lm(formula = crime_rate ~ percent_m + police_exp60 + m_per1000f +
## gdp + inequality + prob_prison, data = crime_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -320.5 -117.8 10.8 104.7 372.0
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7072.5286 1298.3818 -5.447 0.00000818 ***
## percent_m 5.4635 3.3273 1.642 0.111769
## police_exp60 8.7766 1.7549 5.001 0.00002765 ***
## m_per1000f 3.9740 1.0792 3.682 0.000978 ***
## gdp 2.3764 0.9949 2.389 0.023896 *
## inequality 7.6033 1.7641 4.310 0.000182 ***
## prob_prison -3558.7773 1653.5475 -2.152 0.040148 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 180.5 on 28 degrees of freedom
## Multiple R-squared: 0.8138, Adjusted R-squared: 0.7738
## F-statistic: 20.39 on 6 and 28 DF, p-value: 0.000000004949
Dilihat dari summary model diatas model_both memiliki Adjusted R-squared sebesar 0.7738, maka dapat dikatakan bahwa variable percent_m, police_exp60, m_per1000f, gdp, inequality dapat prob_prison pada model crime_both dapat menjelaskan nilai crime_rate sebesar 77%. Model ini cukup baik jika dilihat pada nilai Adjusted R-squared dan model sebelumnya.
Untuk melihat pengaruh dari leverage yang ada, maka akan dilakukan perbandingan antara model yang menggunakan data asli dan data tanpa leverage
crime_all2 <- lm(crime_rate~., crime_nl)
crime_non2 <- lm(crime_rate~1, crime_nl)
crime_both2 <- step(object = crime_all2, direction = "both", scope = list(lower = crime_non2, upper = crime_all2), trace = F)
summary(crime_both2)##
## Call:
## lm(formula = crime_rate ~ percent_m + mean_education + police_exp60 +
## police_exp59 + nonwhites_per1000 + gdp + inequality + prob_prison +
## time_prison, data = crime_nl)
##
## Residuals:
## Min 1Q Median 3Q Max
## -212.61 -85.73 10.57 81.07 199.47
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2676.5184 1065.5207 -2.512 0.01984 *
## percent_m 4.5616 2.8314 1.611 0.12141
## mean_education 7.5453 4.3623 1.730 0.09770 .
## police_exp60 25.3815 9.5046 2.670 0.01397 *
## police_exp59 -21.1871 9.8326 -2.155 0.04239 *
## nonwhites_per1000 0.7341 0.4493 1.634 0.11654
## gdp 1.9173 0.7627 2.514 0.01975 *
## inequality 6.3728 1.7529 3.636 0.00146 **
## prob_prison -4823.2306 1612.8927 -2.990 0.00674 **
## time_prison -17.1741 5.7798 -2.971 0.00705 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 131.6 on 22 degrees of freedom
## Multiple R-squared: 0.7642, Adjusted R-squared: 0.6678
## F-statistic: 7.924 on 9 and 22 DF, p-value: 0.00003856
Jika dilihat pada summary model diatas model both tanpa leverage ini memiliki nilai Adjusted R-squared sebesar 0.6678. artinya nilai crime_rate yang dapat dijelaskan pada model ini hanya 67%. Jika dibandingkan dengan model tanpa membuang nilai leverage pada model crime_both sebelumnya, dapat dilihat model_both memiliki nilai Adjusted R-squared sebesar 0.7738, artinya leverage berpengaruh positif pada model prediksi. Maka diambil keputusan bahwa leverage tidak akan dibuang.
Untuk mengukur seberapa baik performa model dalam memprediksi target adalah dapat dilihat pada nilai Erornya. Kali ini akan digunakan RMSE untuk melihat seberapa besar error antara nilai aktual dan nilai prediksi yang ada.
RMSE(y_pred = crime_both$fitted.values, y_true = crime_train$crime_rate)## [1] 161.4481
summary(crime_train$crime_rate)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 342.0 673.0 826.0 888.2 965.5 1993.0
Berdasarkan nilai RMSE dan range dari nilai crime_rate pada data train diatas, maka dapat diambil kesimpulan bahwa nilai eror masih jauh lebih kecil dari range data crime_rate. Artinya model sudah cukup baik jika ditinjau dari segi nilai error yang dimilki.
test_pred <- predict(object = crime_both2, newdata = crime_test)
RMSE(y_pred = test_pred, y_true = crime_test$crime_rate)## [1] 308.0406
summary(crime_test$crime_rate)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 373.0 534.2 988.0 954.2 1230.0 1635.0
Berdasarkan RMSE yang dihasilkan diatas dan jika dibandingkan dengan range nilai pada data test, maka dapat dikatakan bahwa nilai eror yang dihasilkan cukup besar.
Jika dilihat perbandingan error data train dan test memiliki nilai error yang terpaut cukup jauh. Model dapat memprediksi crime_rate cukup baik pada data yang sudah dikenali yang dipakai pada data train, namun model menghasilkan prediksi yang cukup buruk jika memprediksi data baru. Maka saya asumsikan bahwa model mengalami Overfitting yakni model hanya relevan dengan kumpulan datanya, dan tidak relevan dengan kumpulan data lainnya.
H0 : Tidak Linear
H1 : Linear
percent_m + police_exp60 + m_per1000f + gdp + inequality + prob_prison
cor.test(crime_train$percent_m, crime_train$crime_rate)##
## Pearson's product-moment correlation
##
## data: crime_train$percent_m and crime_train$crime_rate
## t = -1.1439, df = 33, p-value = 0.2609
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.4962407 0.1475605
## sample estimates:
## cor
## -0.1952892
cor.test(crime_train$police_exp60, crime_train$crime_rate)##
## Pearson's product-moment correlation
##
## data: crime_train$police_exp60 and crime_train$crime_rate
## t = 5.9546, df = 33, p-value = 0.000001103
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.5083601 0.8492532
## sample estimates:
## cor
## 0.7196853
cor.test(crime_train$m_per1000f, crime_train$crime_rate)##
## Pearson's product-moment correlation
##
## data: crime_train$m_per1000f and crime_train$crime_rate
## t = 1.8281, df = 33, p-value = 0.07659
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.03337355 0.57807449
## sample estimates:
## cor
## 0.3032456
cor.test(crime_train$gdp, crime_train$crime_rate)##
## Pearson's product-moment correlation
##
## data: crime_train$gdp and crime_train$crime_rate
## t = 3.5851, df = 33, p-value = 0.001074
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2382205 0.7333044
## sample estimates:
## cor
## 0.529437
cor.test(crime_train$inequality, crime_train$crime_rate)##
## Pearson's product-moment correlation
##
## data: crime_train$inequality and crime_train$crime_rate
## t = -1.4805, df = 33, p-value = 0.1482
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.53805917 0.09127724
## sample estimates:
## cor
## -0.2495604
cor.test(crime_train$prob_prison, crime_train$crime_rate)##
## Pearson's product-moment correlation
##
## data: crime_train$prob_prison and crime_train$crime_rate
## t = -2.5314, df = 33, p-value = 0.01631
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.64924666 -0.08086645
## sample estimates:
## cor
## -0.4032461
Dari hasil uji linearitas diatas dihasilkan bahwa variabel predictor percent_m, m_per1000f dan inequality tidak linear terhadap crime rate.
hist(crime_both$residuals)
H0: error berdistribusi normal H1: error tidak berdistribusi normal
shapiro.test(crime_both$residuals)##
## Shapiro-Wilk normality test
##
## data: crime_both$residuals
## W = 0.98904, p-value = 0.975
residual/error pada hasil prediksi sudah berdistribusi normal, yang mana artinya dari hasil prediksi yang diperoleh model cukup akurat karena errornya banyak yang mengumpul di titik 0.
plot(crime_train$crime_rate, crime_both$residuals)
abline(h = 0, col = "red")H0: model homoscedasticity
H1: model heteroscedasticity
bptest(crime_both)##
## studentized Breusch-Pagan test
##
## data: crime_both
## BP = 6.3884, df = 6, p-value = 0.3811
p-value > 0.05 artinya terima H0. Maka dapat dikatakan bahwa resisual tidak memiliki pola(Heteroscedasticity).
vif(crime_both)## percent_m police_exp60 m_per1000f gdp inequality prob_prison
## 2.154071 3.304372 1.082709 11.236056 5.692216 1.523129
Dilihat pada nilai VIF pada model crime_both, maka diperoleh bahwa model memiliki Multicollinearity antar variable. Disimpulkan bahwa asumsi ini tidak terpenuhi
crime_nc<- crime %>%
select(-c(gdp))
crime_nc <- crime_nc %>%
mutate_if(is.numeric, log10)RNGkind(sample.kind = "Rounding")
set.seed(100)
index <- sample(x = nrow(crime_nc), size = nrow(crime_nc)*0.75)
crime_train2 <- crime_nc[index,]
crime_test2 <- crime_nc[-index,]crime_all3 <- lm(crime_rate~., crime_train2)
crime_non3 <- lm(crime_rate~1, crime_train2)
crime_both3 <- step(object = crime_all3, direction = "both", scope = list(lower = crime_non3, upper = crime_all3), trace = F)
summary(crime_both3)##
## Call:
## lm(formula = crime_rate ~ police_exp60 + m_per1000f + nonwhites_per1000 +
## inequality + prob_prison + mean_education, data = crime_train2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.152220 -0.045859 0.003791 0.056167 0.123680
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -15.36821 3.30786 -4.646 0.00007298 ***
## police_exp60 0.96364 0.16143 5.969 0.00000199 ***
## m_per1000f 3.89241 1.33496 2.916 0.00691 **
## nonwhites_per1000 0.09363 0.04341 2.157 0.03976 *
## inequality 1.17039 0.32570 3.593 0.00124 **
## prob_prison -0.23675 0.07869 -3.009 0.00550 **
## mean_education 0.81513 0.61268 1.330 0.19411
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.07324 on 28 degrees of freedom
## Multiple R-squared: 0.8457, Adjusted R-squared: 0.8126
## F-statistic: 25.57 on 6 and 28 DF, p-value: 0.0000000003825
Setelah melakukan beberapa langkah diatas, nilai Adjusted R-squared meningkat cukup signifikan menjadi 0.8126, artinya sebanyak 81% nilai crime_rate dapat dijeskan oleh model.
RMSE(y_pred = 10^(crime_both3$fitted.values), y_true = 10^(crime_train2$crime_rate))## [1] 126.7267
pred2 <- predict(object = crime_both3, newdata = crime_test2)
RMSE(y_pred = 10^(pred2), y_true = 10^(crime_test2$crime_rate))## [1] 193.5327
Jika dilihat perbandingan dari RMSE data train dan test masih memiliki perbedaan yang cukup besar. Mungkin masih bisa diasumsikan masih terjadinya overfitting.
H0 : Tidak Linear
H1 : Linear
percent_m + police_exp60 + m_per1000f + gdp + inequality + prob_prison
cor.test(crime_train2$percent_m, crime_train2$crime_rate)##
## Pearson's product-moment correlation
##
## data: crime_train2$percent_m and crime_train2$crime_rate
## t = -1.0796, df = 33, p-value = 0.2881
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.4879151 0.1582893
## sample estimates:
## cor
## -0.1847001
cor.test(crime_train2$m_per1000f, crime_train2$crime_rate)##
## Pearson's product-moment correlation
##
## data: crime_train2$m_per1000f and crime_train2$crime_rate
## t = 1.5663, df = 33, p-value = 0.1268
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.07693901 0.54823933
## sample estimates:
## cor
## 0.2630521
cor.test(crime_train2$nonwhites_per1000, crime_train2$crime_rate)##
## Pearson's product-moment correlation
##
## data: crime_train2$nonwhites_per1000 and crime_train2$crime_rate
## t = 1.5617, df = 33, p-value = 0.1279
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.07770599 0.54769942
## sample estimates:
## cor
## 0.2623338
cor.test(crime_train2$inequality, crime_train2$crime_rate)##
## Pearson's product-moment correlation
##
## data: crime_train2$inequality and crime_train2$crime_rate
## t = -1.2084, df = 33, p-value = 0.2355
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.5044936 0.1367736
## sample estimates:
## cor
## -0.2058557
Model sudah cukup Linear
hist(crime_both3$residuals)H0: error berdistribusi normal H1: error tidak berdistribusi normal
shapiro.test(crime_both3$residuals)##
## Shapiro-Wilk normality test
##
## data: crime_both3$residuals
## W = 0.98341, p-value = 0.8631
Terima H0, artinya error sudah terdistribusi normal.
plot(crime_train2$crime_rate, crime_both3$residuals)
abline(h = 0, col = "red")
H0: model homoscedasticity
H1: model heteroscedasticity
bptest(crime_both3)##
## studentized Breusch-Pagan test
##
## data: crime_both3
## BP = 3.9858, df = 6, p-value = 0.6786
p-value > 0.05 artinya terima H0. Maka dapat dikatakan bahwa resisual tidak memiliki pola(Heteroscedasticity).
vif(crime_both3)## police_exp60 m_per1000f nonwhites_per1000 inequality
## 3.879286 1.906638 2.812077 5.836119
## prob_prison mean_education
## 1.791914 5.753310
Dilihat pada nilai VIF pada model crime_both3, maka diperoleh bahwa model tidak memiliki Multicollinearity antar variable. Disimpulkan bahwa asumsi ini terpenuhi
Model terbaik yang didapatkan adalah both3 dengan nilai Adjusted R-squared sebesar 0.8126. Dan model ini juga sudah memenuhi uji analisis.
Model both3 dalam memprediksi crime_rate memakai beberapa prediktor yakni police_exp60, m_per1000f, nonwhites_per1000, inequality, prob_prison dan mean_education.