Analisis regresi ini bertujuan untuk membuat model regresi untuk menganalisa bagaimana pengaruh variabel-variabel socio-demographic terhadap tingkat kejahatan (crime rate) di suatu daerah. Dataset yang digunakan adalah: crime.csv
# Memanggil data
crime <- read.csv(file = "crime.csv")
# Cek missing values
is.na(crime) %>% colSums()#> percent_m is_south mean_education
#> 0 0 0
#> police_exp60 police_exp59 labour_participation
#> 0 0 0
#> m_per1000f state_pop nonwhites_per1000
#> 0 0 0
#> unemploy_m24 unemploy_m39 gdp
#> 0 0 0
#> inequality prob_prison time_prison
#> 0 0 0
#> crime_rate
#> 0
# Melihat struktur data
str(crime)#> 'data.frame': 47 obs. of 16 variables:
#> $ percent_m : int 151 143 142 136 141 121 127 131 157 140 ...
#> $ is_south : int 1 0 1 0 0 0 1 1 1 0 ...
#> $ mean_education : int 91 113 89 121 121 110 111 109 90 118 ...
#> $ police_exp60 : int 58 103 45 149 109 118 82 115 65 71 ...
#> $ police_exp59 : int 56 95 44 141 101 115 79 109 62 68 ...
#> $ labour_participation: int 510 583 533 577 591 547 519 542 553 632 ...
#> $ m_per1000f : int 950 1012 969 994 985 964 982 969 955 1029 ...
#> $ state_pop : int 33 13 18 157 18 25 4 50 39 7 ...
#> $ nonwhites_per1000 : int 301 102 219 80 30 44 139 179 286 15 ...
#> $ unemploy_m24 : int 108 96 94 102 91 84 97 79 81 100 ...
#> $ unemploy_m39 : int 41 36 33 39 20 29 38 35 28 24 ...
#> $ gdp : int 394 557 318 673 578 689 620 472 421 526 ...
#> $ inequality : int 261 194 250 167 174 126 168 206 239 174 ...
#> $ prob_prison : num 0.0846 0.0296 0.0834 0.0158 0.0414 ...
#> $ time_prison : num 26.2 25.3 24.3 29.9 21.3 ...
#> $ crime_rate : int 791 1635 578 1969 1234 682 963 1555 856 705 ...
Deskripsi kolom:
percent_m: percentage of males aged 14-24is_south: whether it is in a Southern state. 1 for Yes,
0 for No.mean_education: mean years of schoolingpolice_exp60: police expenditure in 1960police_exp59: police expenditure in 1959labour_participation: labour force participation
ratem_per1000f: number of males per 1000 femalesstate_pop: state populationnonwhites_per1000: number of non-whites resident per
1000 peopleunemploy_m24: unemployment rate of urban males aged
14-24unemploy_m39: unemployment rate of urban males aged
35-39gdp: gross domestic product per headinequality: income inequalityprob_prison: probability of imprisonmenttime_prison: average time served in prisonscrime_rate: crime rate in an unspecified categorycrime_clean <- crime %>%
mutate(is_south = as.factor(is_south))
str(crime_clean)#> 'data.frame': 47 obs. of 16 variables:
#> $ percent_m : int 151 143 142 136 141 121 127 131 157 140 ...
#> $ is_south : Factor w/ 2 levels "0","1": 2 1 2 1 1 1 2 2 2 1 ...
#> $ mean_education : int 91 113 89 121 121 110 111 109 90 118 ...
#> $ police_exp60 : int 58 103 45 149 109 118 82 115 65 71 ...
#> $ police_exp59 : int 56 95 44 141 101 115 79 109 62 68 ...
#> $ labour_participation: int 510 583 533 577 591 547 519 542 553 632 ...
#> $ m_per1000f : int 950 1012 969 994 985 964 982 969 955 1029 ...
#> $ state_pop : int 33 13 18 157 18 25 4 50 39 7 ...
#> $ nonwhites_per1000 : int 301 102 219 80 30 44 139 179 286 15 ...
#> $ unemploy_m24 : int 108 96 94 102 91 84 97 79 81 100 ...
#> $ unemploy_m39 : int 41 36 33 39 20 29 38 35 28 24 ...
#> $ gdp : int 394 557 318 673 578 689 620 472 421 526 ...
#> $ inequality : int 261 194 250 167 174 126 168 206 239 174 ...
#> $ prob_prison : num 0.0846 0.0296 0.0834 0.0158 0.0414 ...
#> $ time_prison : num 26.2 25.3 24.3 29.9 21.3 ...
#> $ crime_rate : int 791 1635 578 1969 1234 682 963 1555 856 705 ...
ggcorr(crime_clean, label = T,layout.exp = 3, hjust = 1)
Terdapat korelasi erat antara variable police_exp59 dan police_exp60,
maka diambil satu variable saja
crime_clean <- crime_clean %>%
select(-c(police_exp59))
head(crime_clean)#> percent_m is_south mean_education police_exp60 labour_participation
#> 1 151 1 91 58 510
#> 2 143 0 113 103 583
#> 3 142 1 89 45 533
#> 4 136 0 121 149 577
#> 5 141 0 121 109 591
#> 6 121 0 110 118 547
#> m_per1000f state_pop nonwhites_per1000 unemploy_m24 unemploy_m39 gdp
#> 1 950 33 301 108 41 394
#> 2 1012 13 102 96 36 557
#> 3 969 18 219 94 33 318
#> 4 994 157 80 102 39 673
#> 5 985 18 30 91 20 578
#> 6 964 25 44 84 29 689
#> inequality prob_prison time_prison crime_rate
#> 1 261 0.084602 26.2011 791
#> 2 194 0.029599 25.2999 1635
#> 3 250 0.083401 24.3006 578
#> 4 167 0.015801 29.9012 1969
#> 5 174 0.041399 21.2998 1234
#> 6 126 0.034201 20.9995 682
summary(crime_clean)#> percent_m is_south mean_education police_exp60 labour_participation
#> Min. :119.0 0:31 Min. : 87.0 Min. : 45.0 Min. :480.0
#> 1st Qu.:130.0 1:16 1st Qu.: 97.5 1st Qu.: 62.5 1st Qu.:530.5
#> Median :136.0 Median :108.0 Median : 78.0 Median :560.0
#> Mean :138.6 Mean :105.6 Mean : 85.0 Mean :561.2
#> 3rd Qu.:146.0 3rd Qu.:114.5 3rd Qu.:104.5 3rd Qu.:593.0
#> Max. :177.0 Max. :122.0 Max. :166.0 Max. :641.0
#> m_per1000f state_pop nonwhites_per1000 unemploy_m24
#> Min. : 934.0 Min. : 3.00 Min. : 2.0 Min. : 70.00
#> 1st Qu.: 964.5 1st Qu.: 10.00 1st Qu.: 24.0 1st Qu.: 80.50
#> Median : 977.0 Median : 25.00 Median : 76.0 Median : 92.00
#> Mean : 983.0 Mean : 36.62 Mean :101.1 Mean : 95.47
#> 3rd Qu.: 992.0 3rd Qu.: 41.50 3rd Qu.:132.5 3rd Qu.:104.00
#> Max. :1071.0 Max. :168.00 Max. :423.0 Max. :142.00
#> unemploy_m39 gdp inequality prob_prison
#> Min. :20.00 Min. :288.0 Min. :126.0 Min. :0.00690
#> 1st Qu.:27.50 1st Qu.:459.5 1st Qu.:165.5 1st Qu.:0.03270
#> Median :34.00 Median :537.0 Median :176.0 Median :0.04210
#> Mean :33.98 Mean :525.4 Mean :194.0 Mean :0.04709
#> 3rd Qu.:38.50 3rd Qu.:591.5 3rd Qu.:227.5 3rd Qu.:0.05445
#> Max. :58.00 Max. :689.0 Max. :276.0 Max. :0.11980
#> time_prison crime_rate
#> Min. :12.20 Min. : 342.0
#> 1st Qu.:21.60 1st Qu.: 658.5
#> Median :25.80 Median : 831.0
#> Mean :26.60 Mean : 905.1
#> 3rd Qu.:30.45 3rd Qu.:1057.5
#> Max. :44.00 Max. :1993.0
boxplot(crime_clean$percent_m, xlab = "percent_m", cex.lab = 1.2)boxplot(crime_clean$mean_education, xlab = "mean_education", cex.lab = 1.2 )boxplot(crime_clean$police_exp60, xlab = "police_exp60", cex.lab = 1.2)boxplot(crime_clean$labour_participation, xlab = "labour_participation", cex.lab = 1.2)boxplot(crime_clean$m_per1000f, xlab = "m_per1000f", cex.lab = 1.2)boxplot(crime_clean$state_pop, xlab = "state_pop", cex.lab = 1.2)boxplot(crime_clean$nonwhites_per1000,xlab = "nonwhites_per1000", cex.lab = 1.2 )boxplot(crime_clean$unemploy_m24,xlab = "unemploy_m24", cex.lab = 1.2)boxplot(crime_clean$unemploy_m39,xlab = "unemploy_m39", cex.lab = 1.2)boxplot(crime_clean$gdp,xlab = "gdp", cex.lab = 1.2)boxplot(crime_clean$inequality,xlab = "inequality", cex.lab = 1.2)boxplot(crime_clean$prob_prison,xlab = "prob_prison", cex.lab = 1.2)boxplot(crime_clean$time_prison,xlab = "time_prison", cex.lab = 1.2)boxplot(crime_clean$crime_rate,xlab = "crime_rate", cex.lab = 1.2)Ada outliers pada variabel percent_m, m_per1000f, state_pop, nonwhites_per1000, unemploy_m24, unemploy_m39, time_prison, prob_prison, dan crime rate.
Maka dibuat dataset berikut dengan membuang outliers.
crime_clean_no_outliers <-
crime_clean %>%
filter(crime_rate < 1700, state_pop< 75, prob_prison<0.085, percent_m < 168, m_per1000f < 1030, nonwhites_per1000 < 280, unemploy_m24 < 140, unemploy_m39< 60, time_prison < 43)Yang akan dilakukan untuk mencari model regresi terbaik dalam memprediksi tingkat kejahatan (crime_rate) adalah membandingkan model-model berikut ini:
Model tanpa prediktor
Model dengan semua prediktor dan mencari prediktor yang signifikan
Model dengan semua prediktor (TANPA outliers) dan mencari prediktor yang signifikan
Model dengan prediktor yang signifikan saja dengan membiarkan ada outliers
Model dengan prediktor yang signifikan saja tanpa outliers
Melihat hasil 2 dan 3 manakah yang memiliki nilai R-squared dan adjusted R lebih tinggi, lalu datasetnya digunakan untuk mencoba metode backward dan forward.
Model dengan stepwise backward dengan/tanpa outliers
(beracuan hasil 2 dan 3)
Mencoba model dengan stepwise forward dengan/tanpa
outliers (beracuan hasil 2 dan 3)
model_null <- lm(crime_rate ~1, crime_clean)
summary(model_null)#>
#> Call:
#> lm(formula = crime_rate ~ 1, data = crime_clean)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -563.09 -246.59 -74.09 152.41 1087.91
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 905.09 56.42 16.04 <0.0000000000000002 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 386.8 on 46 degrees of freedom
model_all <- lm(crime_rate ~., crime_clean)
summary(model_all)#>
#> Call:
#> lm(formula = crime_rate ~ ., data = crime_clean)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -442.55 -116.46 8.86 118.26 473.49
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -6379.0457 1568.9382 -4.066 0.000291 ***
#> percent_m 8.9861 4.1571 2.162 0.038232 *
#> is_south1 5.6688 148.0997 0.038 0.969705
#> mean_education 17.7304 6.0824 2.915 0.006445 **
#> police_exp60 9.6526 2.3921 4.035 0.000317 ***
#> labour_participation -0.2801 1.4079 -0.199 0.843538
#> m_per1000f 1.8218 2.0293 0.898 0.376026
#> state_pop -0.7836 1.2857 -0.609 0.546523
#> nonwhites_per1000 0.2446 0.6187 0.395 0.695239
#> unemploy_m24 -5.4163 4.1784 -1.296 0.204164
#> unemploy_m39 16.9362 8.2148 2.062 0.047441 *
#> gdp 0.9072 1.0329 0.878 0.386292
#> inequality 7.2708 2.2564 3.222 0.002921 **
#> prob_prison -4285.1992 2183.8679 -1.962 0.058484 .
#> time_prison -1.1276 6.6919 -0.168 0.867251
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 208.6 on 32 degrees of freedom
#> Multiple R-squared: 0.7976, Adjusted R-squared: 0.709
#> F-statistic: 9.006 on 14 and 32 DF, p-value: 0.0000001673
model_all_no_outliers <- lm(crime_rate ~., crime_clean_no_outliers)
summary(model_all_no_outliers)#>
#> Call:
#> lm(formula = crime_rate ~ ., data = crime_clean_no_outliers)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -317.22 -72.75 -11.42 76.37 215.95
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -2372.72932 2126.54726 -1.116 0.2847
#> percent_m 11.56547 5.59880 2.066 0.0594 .
#> is_south1 -15.79851 167.00282 -0.095 0.9261
#> mean_education 16.98892 8.12060 2.092 0.0566 .
#> police_exp60 9.19855 3.89967 2.359 0.0347 *
#> labour_participation 3.29612 2.63837 1.249 0.2336
#> m_per1000f -3.96377 3.29936 -1.201 0.2510
#> state_pop -6.98010 4.73201 -1.475 0.1640
#> nonwhites_per1000 2.56700 1.02185 2.512 0.0260 *
#> unemploy_m24 1.47181 5.42051 0.272 0.7903
#> unemploy_m39 15.68005 11.21516 1.398 0.1855
#> gdp -0.06154 1.25539 -0.049 0.9616
#> inequality 6.23779 2.65475 2.350 0.0352 *
#> prob_prison -10032.66004 3958.35247 -2.535 0.0249 *
#> time_prison -9.70610 10.42644 -0.931 0.3689
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 172.1 on 13 degrees of freedom
#> Multiple R-squared: 0.8603, Adjusted R-squared: 0.7099
#> F-statistic: 5.72 on 14 and 13 DF, p-value: 0.001632
model_1 <- lm(crime_rate ~ percent_m + mean_education +police_exp60 +unemploy_m39+inequality, crime_clean)
summary(model_1)#>
#> Call:
#> lm(formula = crime_rate ~ percent_m + mean_education + police_exp60 +
#> unemploy_m39 + inequality, data = crime_clean)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -453.44 -98.59 -18.07 106.03 629.64
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -5243.743 951.156 -5.513 0.0000021265864 ***
#> percent_m 10.198 3.532 2.887 0.006175 **
#> mean_education 20.308 4.742 4.283 0.000109 ***
#> police_exp60 12.331 1.416 8.706 0.0000000000726 ***
#> unemploy_m39 9.136 4.341 2.105 0.041496 *
#> inequality 6.349 1.468 4.324 0.0000955935585 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 213 on 41 degrees of freedom
#> Multiple R-squared: 0.7296, Adjusted R-squared: 0.6967
#> F-statistic: 22.13 on 5 and 41 DF, p-value: 0.0000000001105
model_1_no_outliers <- lm(crime_rate ~ police_exp60 + nonwhites_per1000 +inequality + prob_prison, crime_clean_no_outliers)
summary(model_1_no_outliers)#>
#> Call:
#> lm(formula = crime_rate ~ police_exp60 + nonwhites_per1000 +
#> inequality + prob_prison, data = crime_clean_no_outliers)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -400.2 -170.2 43.0 132.9 436.5
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -397.3137 529.4404 -0.750 0.4606
#> police_exp60 8.5196 2.9229 2.915 0.0078 **
#> nonwhites_per1000 1.1653 0.7976 1.461 0.1575
#> inequality 4.3037 1.7849 2.411 0.0243 *
#> prob_prison -7956.2164 3592.7022 -2.215 0.0370 *
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 224.4 on 23 degrees of freedom
#> Multiple R-squared: 0.58, Adjusted R-squared: 0.507
#> F-statistic: 7.94 on 4 and 23 DF, p-value: 0.0003566
Model ini memakai acuan model_all_no_outliers karena memiliki R-squared lebih tinggi daripada model_all. Begitu juga dilakukan untuk model forward.
model_backward_no_outliers <- step(model_all_no_outliers, direction="backward")#> Start: AIC=296.81
#> crime_rate ~ percent_m + is_south + mean_education + police_exp60 +
#> labour_participation + m_per1000f + state_pop + nonwhites_per1000 +
#> unemploy_m24 + unemploy_m39 + gdp + inequality + prob_prison +
#> time_prison
#>
#> Df Sum of Sq RSS AIC
#> - gdp 1 71 385141 294.82
#> - is_south 1 265 385335 294.83
#> - unemploy_m24 1 2184 387254 294.97
#> - time_prison 1 25669 410740 296.62
#> <none> 385070 296.81
#> - m_per1000f 1 42752 427822 297.76
#> - labour_participation 1 46231 431301 297.99
#> - unemploy_m39 1 57900 442971 298.73
#> - state_pop 1 64451 449521 299.14
#> - percent_m 1 126396 511466 302.76
#> - mean_education 1 129644 514714 302.94
#> - inequality 1 163535 548605 304.72
#> - police_exp60 1 164808 549879 304.79
#> - nonwhites_per1000 1 186926 571996 305.89
#> - prob_prison 1 190283 575353 306.06
#>
#> Step: AIC=294.82
#> crime_rate ~ percent_m + is_south + mean_education + police_exp60 +
#> labour_participation + m_per1000f + state_pop + nonwhites_per1000 +
#> unemploy_m24 + unemploy_m39 + inequality + prob_prison +
#> time_prison
#>
#> Df Sum of Sq RSS AIC
#> - is_south 1 274 385416 292.84
#> - unemploy_m24 1 2526 387667 293.00
#> - time_prison 1 28190 413331 294.79
#> <none> 385141 294.82
#> - labour_participation 1 47938 433079 296.10
#> - m_per1000f 1 50749 435890 296.28
#> - unemploy_m39 1 58471 443613 296.77
#> - state_pop 1 64682 449823 297.16
#> - percent_m 1 130652 515794 301.00
#> - mean_education 1 130737 515879 301.00
#> - police_exp60 1 185420 570562 303.82
#> - nonwhites_per1000 1 206166 591308 304.82
#> - prob_prison 1 207974 593116 304.91
#> - inequality 1 251295 636436 306.88
#>
#> Step: AIC=292.84
#> crime_rate ~ percent_m + mean_education + police_exp60 + labour_participation +
#> m_per1000f + state_pop + nonwhites_per1000 + unemploy_m24 +
#> unemploy_m39 + inequality + prob_prison + time_prison
#>
#> Df Sum of Sq RSS AIC
#> - unemploy_m24 1 3336 388752 291.08
#> <none> 385416 292.84
#> - time_prison 1 28843 414259 292.86
#> - m_per1000f 1 52037 437453 294.38
#> - unemploy_m39 1 58665 444080 294.80
#> - labour_participation 1 62127 447543 295.02
#> - state_pop 1 64411 449827 295.16
#> - mean_education 1 132392 517808 299.10
#> - percent_m 1 134436 519852 299.21
#> - police_exp60 1 189797 575213 302.05
#> - prob_prison 1 211530 596945 303.09
#> - nonwhites_per1000 1 233688 619104 304.11
#> - inequality 1 278456 663872 306.06
#>
#> Step: AIC=291.08
#> crime_rate ~ percent_m + mean_education + police_exp60 + labour_participation +
#> m_per1000f + state_pop + nonwhites_per1000 + unemploy_m39 +
#> inequality + prob_prison + time_prison
#>
#> Df Sum of Sq RSS AIC
#> - time_prison 1 25911 414662 290.88
#> <none> 388752 291.08
#> - m_per1000f 1 50419 439170 292.49
#> - labour_participation 1 58950 447702 293.03
#> - state_pop 1 85538 474289 294.65
#> - unemploy_m39 1 113579 502331 296.25
#> - mean_education 1 130717 519469 297.19
#> - percent_m 1 133569 522320 297.35
#> - police_exp60 1 195848 584600 300.50
#> - prob_prison 1 208306 597058 301.09
#> - nonwhites_per1000 1 255602 644354 303.23
#> - inequality 1 279938 668690 304.26
#>
#> Step: AIC=290.88
#> crime_rate ~ percent_m + mean_education + police_exp60 + labour_participation +
#> m_per1000f + state_pop + nonwhites_per1000 + unemploy_m39 +
#> inequality + prob_prison
#>
#> Df Sum of Sq RSS AIC
#> <none> 414662 290.88
#> - m_per1000f 1 36804 451466 291.26
#> - labour_participation 1 53609 468271 292.29
#> - unemploy_m39 1 95125 509787 294.67
#> - mean_education 1 110473 525136 295.50
#> - percent_m 1 112458 527121 295.60
#> - state_pop 1 142768 557430 297.17
#> - prob_prison 1 212799 627461 300.48
#> - nonwhites_per1000 1 236707 651369 301.53
#> - inequality 1 290866 705528 303.77
#> - police_exp60 1 484604 899267 310.56
summary(model_backward_no_outliers)#>
#> Call:
#> lm(formula = crime_rate ~ percent_m + mean_education + police_exp60 +
#> labour_participation + m_per1000f + state_pop + nonwhites_per1000 +
#> unemploy_m39 + inequality + prob_prison, data = crime_clean_no_outliers)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -278.24 -81.03 -12.97 58.09 220.54
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -3209.6576 1705.6257 -1.882 0.077098 .
#> percent_m 10.3186 4.8056 2.147 0.046493 *
#> mean_education 14.9471 7.0235 2.128 0.048249 *
#> police_exp60 11.3922 2.5559 4.457 0.000346 ***
#> labour_participation 3.1171 2.1026 1.483 0.156507
#> m_per1000f -3.0075 2.4484 -1.228 0.236047
#> state_pop -8.9840 3.7134 -2.419 0.027048 *
#> nonwhites_per1000 2.2997 0.7382 3.115 0.006296 **
#> unemploy_m39 15.6430 7.9213 1.975 0.064761 .
#> inequality 6.3666 1.8437 3.453 0.003037 **
#> prob_prison -7632.8462 2584.1900 -2.954 0.008890 **
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 156.2 on 17 degrees of freedom
#> Multiple R-squared: 0.8496, Adjusted R-squared: 0.7611
#> F-statistic: 9.604 on 10 and 17 DF, p-value: 0.00003407
model_forward_no_outliers <- step(object=model_null,
direction="forward",
scope=list(upper= model_all_no_outliers),
trace=F)
summary(model_forward_no_outliers)#>
#> Call:
#> lm(formula = crime_rate ~ police_exp60 + inequality + mean_education +
#> percent_m + prob_prison + unemploy_m39, data = crime_clean)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -470.68 -78.41 -19.68 133.12 556.23
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -5040.505 899.843 -5.602 0.000001715273 ***
#> police_exp60 11.502 1.375 8.363 0.000000000256 ***
#> inequality 6.765 1.394 4.855 0.000018793765 ***
#> mean_education 19.647 4.475 4.390 0.000080720160 ***
#> percent_m 10.502 3.330 3.154 0.00305 **
#> prob_prison -3801.836 1528.097 -2.488 0.01711 *
#> unemploy_m39 8.937 4.091 2.185 0.03483 *
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 200.7 on 40 degrees of freedom
#> Multiple R-squared: 0.7659, Adjusted R-squared: 0.7307
#> F-statistic: 21.81 on 6 and 40 DF, p-value: 0.00000000003418
Membandingkan semua model menggunakan fungsi compare
dari library (performance)
library(performance)
comparison <- compare_performance(model_null, model_all, model_all_no_outliers, model_1, model_1_no_outliers, model_backward_no_outliers, model_forward_no_outliers)
compare <- as.data.frame(comparison)
compare#> Name Model AIC
#> 1 model_null lm 696.4037
#> 2 model_all lm 649.3263
#> 3 model_all_no_outliers lm 378.2719
#> 4 model_1 lm 644.9286
#> 5 model_1_no_outliers lm 389.1012
#> 6 model_backward_no_outliers lm 372.3450
#> 7 model_forward_no_outliers lm 640.1661
#> AIC_wt
#> 1 0.00000000000000000000000000000000000000000000000000000000000000000000004069887
#> 2 0.00000000000000000000000000000000000000000000000000000000000067967596813547134
#> 3 0.04909339019476005594455614300386514514684677124023437500000000000000000000000
#> 4 0.00000000000000000000000000000000000000000000000000000000000612717054672095975
#> 5 0.00021850910573053278630577256347322645524400286376476287841796875000000000000
#> 6 0.95068810069950948626882336611743085086345672607421875000000000000000000000000
#> 7 0.00000000000000000000000000000000000000000000000000000000006628485952170377742
#> BIC
#> 1 700.1040
#> 2 678.9287
#> 3 399.5872
#> 4 657.8796
#> 5 397.0944
#> 6 388.3314
#> 7 654.9673
#> BIC_wt
#> 1 0.00000000000000000000000000000000000000000000000000000000000000000001961156
#> 2 0.00000000000000000000000000000000000000000000000000000000000000077744460998
#> 3 0.00353923319807805409481504810287333384621888399124145507812500000000000000
#> 4 0.00000000000000000000000000000000000000000000000000000000002893460346309315
#> 5 0.01230838387291111404864274447845673421397805213928222656250000000000000000
#> 6 0.98415238292901074945717709852033294737339019775390625000000000000000000000
#> 7 0.00000000000000000000000000000000000000000000000000000000012411305312414714
#> R2 R2_adjusted RMSE Sigma
#> 1 0.0000000 0.0000000 382.6261 386.7627
#> 2 0.7975760 0.7090155 172.1494 208.6313
#> 3 0.8603390 0.7099349 117.2711 172.1069
#> 4 0.7296346 0.6966632 198.9528 213.0135
#> 5 0.5799959 0.5069518 203.3669 224.3858
#> 6 0.8496063 0.7611395 121.6937 156.1791
#> 7 0.7658663 0.7307463 185.1427 200.6899
Dengan hasil ini, dipilih model backward tanpa outliers, yaitu yang memiliki nilai multiple R-squared dan adjusted R paling tinggi yaitu 0.849 dan 0.761. Nilai RMSE sebesar 121.69.
hist()hist(model_backward_no_outliers$residuals)shapiro.test()Shapiro-Wilk hypothesis test:
Kondisi yang diharapkan: H0
# shapiro test dari residual
shapiro.test(model_backward_no_outliers$residuals)#>
#> Shapiro-Wilk normality test
#>
#> data: model_backward_no_outliers$residuals
#> W = 0.96749, p-value = 0.5152
p-value > 0.5 HO tercapai: error berdistribusi normal
fitted.values vs
residualsfitted.values adalah nilai hasil prediksi data
trainingresiduals adalah nilai error# scatter plot
plot(x = model_backward_no_outliers$fitted.values, y = model_backward_no_outliers$residuals)
abline(h = 0, col = "red") # garis horizontal di angka 0library(car)
vif(model_backward_no_outliers)#> percent_m mean_education police_exp60
#> 3.170246 6.559376 3.679443
#> labour_participation m_per1000f state_pop
#> 7.821026 3.245882 3.317766
#> nonwhites_per1000 unemploy_m39 inequality
#> 3.075163 3.840580 5.583312
#> prob_prison
#> 2.108012
Tidak ada nilai sama dengan atau lebih dari 10 sehingga tidak ditemukan multicollinearity antar variabel (antar variabel prediktor saling independen)
Berdasarkan hasil analisis, model backward memiliki kriteria yang baik sebagai model linear regression.
summary (model_backward_no_outliers)#>
#> Call:
#> lm(formula = crime_rate ~ percent_m + mean_education + police_exp60 +
#> labour_participation + m_per1000f + state_pop + nonwhites_per1000 +
#> unemploy_m39 + inequality + prob_prison, data = crime_clean_no_outliers)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -278.24 -81.03 -12.97 58.09 220.54
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -3209.6576 1705.6257 -1.882 0.077098 .
#> percent_m 10.3186 4.8056 2.147 0.046493 *
#> mean_education 14.9471 7.0235 2.128 0.048249 *
#> police_exp60 11.3922 2.5559 4.457 0.000346 ***
#> labour_participation 3.1171 2.1026 1.483 0.156507
#> m_per1000f -3.0075 2.4484 -1.228 0.236047
#> state_pop -8.9840 3.7134 -2.419 0.027048 *
#> nonwhites_per1000 2.2997 0.7382 3.115 0.006296 **
#> unemploy_m39 15.6430 7.9213 1.975 0.064761 .
#> inequality 6.3666 1.8437 3.453 0.003037 **
#> prob_prison -7632.8462 2584.1900 -2.954 0.008890 **
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 156.2 on 17 degrees of freedom
#> Multiple R-squared: 0.8496, Adjusted R-squared: 0.7611
#> F-statistic: 9.604 on 10 and 17 DF, p-value: 0.00003407
Model backward tanpa outliers (model_backward_no_outliers) adalah model yang dipilih dengan R-squared dan adjusted R paling tinggi yaitu 0.849 dan 0.761. Nilai RMSE sebesar 121.69.
Berdasarkan model ini, nilai crime rate berkorelasi positif dengan percent_m, mean_education, police_exp60, m_per1000f, unemploy_m39, dan inequality,yang artinya peningkatan nilai prediktor-prediktor tersebut juga akan meningkatkan angka crime rate.
Sedangkan crime rate berkorelasi negatif dengan unemploy_m24 dan prob_prison, yang artinya setiap peningkatan nilai kedua prediktor tersebut, akan mengurangi angka crime rate.
Model yang dipilih juga menghasilkan uji asumsi sesuai dengan yang diharapkan yaitu:
nilai residual berdistribusi normal
nilai residual tersebar acak (tidak ada pola tertentu)
tidak ditemukan multicollinearity antar variabel