PERSIAPAN DATA

Analisis regresi ini bertujuan untuk membuat model regresi untuk menganalisa bagaimana pengaruh variabel-variabel socio-demographic terhadap tingkat kejahatan (crime rate) di suatu daerah. Dataset yang digunakan adalah: crime.csv

Memanggil data, cek missing values, dan melihat summary data

# Memanggil data
crime <- read.csv(file = "crime.csv")

# Cek missing values
is.na(crime) %>% colSums()
#>            percent_m             is_south       mean_education 
#>                    0                    0                    0 
#>         police_exp60         police_exp59 labour_participation 
#>                    0                    0                    0 
#>           m_per1000f            state_pop    nonwhites_per1000 
#>                    0                    0                    0 
#>         unemploy_m24         unemploy_m39                  gdp 
#>                    0                    0                    0 
#>           inequality          prob_prison          time_prison 
#>                    0                    0                    0 
#>           crime_rate 
#>                    0
# Melihat struktur data
str(crime)
#> 'data.frame':    47 obs. of  16 variables:
#>  $ percent_m           : int  151 143 142 136 141 121 127 131 157 140 ...
#>  $ is_south            : int  1 0 1 0 0 0 1 1 1 0 ...
#>  $ mean_education      : int  91 113 89 121 121 110 111 109 90 118 ...
#>  $ police_exp60        : int  58 103 45 149 109 118 82 115 65 71 ...
#>  $ police_exp59        : int  56 95 44 141 101 115 79 109 62 68 ...
#>  $ labour_participation: int  510 583 533 577 591 547 519 542 553 632 ...
#>  $ m_per1000f          : int  950 1012 969 994 985 964 982 969 955 1029 ...
#>  $ state_pop           : int  33 13 18 157 18 25 4 50 39 7 ...
#>  $ nonwhites_per1000   : int  301 102 219 80 30 44 139 179 286 15 ...
#>  $ unemploy_m24        : int  108 96 94 102 91 84 97 79 81 100 ...
#>  $ unemploy_m39        : int  41 36 33 39 20 29 38 35 28 24 ...
#>  $ gdp                 : int  394 557 318 673 578 689 620 472 421 526 ...
#>  $ inequality          : int  261 194 250 167 174 126 168 206 239 174 ...
#>  $ prob_prison         : num  0.0846 0.0296 0.0834 0.0158 0.0414 ...
#>  $ time_prison         : num  26.2 25.3 24.3 29.9 21.3 ...
#>  $ crime_rate          : int  791 1635 578 1969 1234 682 963 1555 856 705 ...

Deskripsi kolom:

  • percent_m: percentage of males aged 14-24
  • is_south: whether it is in a Southern state. 1 for Yes, 0 for No.
  • mean_education: mean years of schooling
  • police_exp60: police expenditure in 1960
  • police_exp59: police expenditure in 1959
  • labour_participation: labour force participation rate
  • m_per1000f: number of males per 1000 females
  • state_pop: state population
  • nonwhites_per1000: number of non-whites resident per 1000 people
  • unemploy_m24: unemployment rate of urban males aged 14-24
  • unemploy_m39: unemployment rate of urban males aged 35-39
  • gdp: gross domestic product per head
  • inequality: income inequality
  • prob_prison: probability of imprisonment
  • time_prison: average time served in prisons
  • crime_rate: crime rate in an unspecified category

Mengubah tipe data yang belum sesuai

crime_clean <- crime %>% 
  mutate(is_south = as.factor(is_south))

str(crime_clean)
#> 'data.frame':    47 obs. of  16 variables:
#>  $ percent_m           : int  151 143 142 136 141 121 127 131 157 140 ...
#>  $ is_south            : Factor w/ 2 levels "0","1": 2 1 2 1 1 1 2 2 2 1 ...
#>  $ mean_education      : int  91 113 89 121 121 110 111 109 90 118 ...
#>  $ police_exp60        : int  58 103 45 149 109 118 82 115 65 71 ...
#>  $ police_exp59        : int  56 95 44 141 101 115 79 109 62 68 ...
#>  $ labour_participation: int  510 583 533 577 591 547 519 542 553 632 ...
#>  $ m_per1000f          : int  950 1012 969 994 985 964 982 969 955 1029 ...
#>  $ state_pop           : int  33 13 18 157 18 25 4 50 39 7 ...
#>  $ nonwhites_per1000   : int  301 102 219 80 30 44 139 179 286 15 ...
#>  $ unemploy_m24        : int  108 96 94 102 91 84 97 79 81 100 ...
#>  $ unemploy_m39        : int  41 36 33 39 20 29 38 35 28 24 ...
#>  $ gdp                 : int  394 557 318 673 578 689 620 472 421 526 ...
#>  $ inequality          : int  261 194 250 167 174 126 168 206 239 174 ...
#>  $ prob_prison         : num  0.0846 0.0296 0.0834 0.0158 0.0414 ...
#>  $ time_prison         : num  26.2 25.3 24.3 29.9 21.3 ...
#>  $ crime_rate          : int  791 1635 578 1969 1234 682 963 1555 856 705 ...

Melihat tabel korelasi antar variable numerik

ggcorr(crime_clean, label = T,layout.exp = 3, hjust = 1)

Terdapat korelasi erat antara variable police_exp59 dan police_exp60, maka diambil satu variable saja

crime_clean <- crime_clean %>% 
  select(-c(police_exp59))
head(crime_clean)
#>   percent_m is_south mean_education police_exp60 labour_participation
#> 1       151        1             91           58                  510
#> 2       143        0            113          103                  583
#> 3       142        1             89           45                  533
#> 4       136        0            121          149                  577
#> 5       141        0            121          109                  591
#> 6       121        0            110          118                  547
#>   m_per1000f state_pop nonwhites_per1000 unemploy_m24 unemploy_m39 gdp
#> 1        950        33               301          108           41 394
#> 2       1012        13               102           96           36 557
#> 3        969        18               219           94           33 318
#> 4        994       157                80          102           39 673
#> 5        985        18                30           91           20 578
#> 6        964        25                44           84           29 689
#>   inequality prob_prison time_prison crime_rate
#> 1        261    0.084602     26.2011        791
#> 2        194    0.029599     25.2999       1635
#> 3        250    0.083401     24.3006        578
#> 4        167    0.015801     29.9012       1969
#> 5        174    0.041399     21.2998       1234
#> 6        126    0.034201     20.9995        682

Cek apakah ada outliers dalam data numerik

summary(crime_clean)
#>    percent_m     is_south mean_education   police_exp60   labour_participation
#>  Min.   :119.0   0:31     Min.   : 87.0   Min.   : 45.0   Min.   :480.0       
#>  1st Qu.:130.0   1:16     1st Qu.: 97.5   1st Qu.: 62.5   1st Qu.:530.5       
#>  Median :136.0            Median :108.0   Median : 78.0   Median :560.0       
#>  Mean   :138.6            Mean   :105.6   Mean   : 85.0   Mean   :561.2       
#>  3rd Qu.:146.0            3rd Qu.:114.5   3rd Qu.:104.5   3rd Qu.:593.0       
#>  Max.   :177.0            Max.   :122.0   Max.   :166.0   Max.   :641.0       
#>    m_per1000f       state_pop      nonwhites_per1000  unemploy_m24   
#>  Min.   : 934.0   Min.   :  3.00   Min.   :  2.0     Min.   : 70.00  
#>  1st Qu.: 964.5   1st Qu.: 10.00   1st Qu.: 24.0     1st Qu.: 80.50  
#>  Median : 977.0   Median : 25.00   Median : 76.0     Median : 92.00  
#>  Mean   : 983.0   Mean   : 36.62   Mean   :101.1     Mean   : 95.47  
#>  3rd Qu.: 992.0   3rd Qu.: 41.50   3rd Qu.:132.5     3rd Qu.:104.00  
#>  Max.   :1071.0   Max.   :168.00   Max.   :423.0     Max.   :142.00  
#>   unemploy_m39        gdp          inequality     prob_prison     
#>  Min.   :20.00   Min.   :288.0   Min.   :126.0   Min.   :0.00690  
#>  1st Qu.:27.50   1st Qu.:459.5   1st Qu.:165.5   1st Qu.:0.03270  
#>  Median :34.00   Median :537.0   Median :176.0   Median :0.04210  
#>  Mean   :33.98   Mean   :525.4   Mean   :194.0   Mean   :0.04709  
#>  3rd Qu.:38.50   3rd Qu.:591.5   3rd Qu.:227.5   3rd Qu.:0.05445  
#>  Max.   :58.00   Max.   :689.0   Max.   :276.0   Max.   :0.11980  
#>   time_prison      crime_rate    
#>  Min.   :12.20   Min.   : 342.0  
#>  1st Qu.:21.60   1st Qu.: 658.5  
#>  Median :25.80   Median : 831.0  
#>  Mean   :26.60   Mean   : 905.1  
#>  3rd Qu.:30.45   3rd Qu.:1057.5  
#>  Max.   :44.00   Max.   :1993.0
boxplot(crime_clean$percent_m, xlab = "percent_m", cex.lab = 1.2)

boxplot(crime_clean$mean_education, xlab = "mean_education", cex.lab = 1.2 )

boxplot(crime_clean$police_exp60,  xlab = "police_exp60", cex.lab = 1.2)

boxplot(crime_clean$labour_participation,  xlab = "labour_participation", cex.lab = 1.2)

boxplot(crime_clean$m_per1000f,  xlab = "m_per1000f", cex.lab = 1.2)

boxplot(crime_clean$state_pop,  xlab = "state_pop", cex.lab = 1.2)

boxplot(crime_clean$nonwhites_per1000,xlab = "nonwhites_per1000", cex.lab = 1.2 )

boxplot(crime_clean$unemploy_m24,xlab = "unemploy_m24", cex.lab = 1.2)

boxplot(crime_clean$unemploy_m39,xlab = "unemploy_m39", cex.lab = 1.2)

boxplot(crime_clean$gdp,xlab = "gdp", cex.lab = 1.2)

boxplot(crime_clean$inequality,xlab = "inequality", cex.lab = 1.2)

boxplot(crime_clean$prob_prison,xlab = "prob_prison", cex.lab = 1.2)

boxplot(crime_clean$time_prison,xlab = "time_prison", cex.lab = 1.2)

boxplot(crime_clean$crime_rate,xlab = "crime_rate", cex.lab = 1.2)

Ada outliers pada variabel percent_m, m_per1000f, state_pop, nonwhites_per1000, unemploy_m24, unemploy_m39, time_prison, prob_prison, dan crime rate.

Maka dibuat dataset berikut dengan membuang outliers.

crime_clean_no_outliers <- 
  crime_clean %>% 
  filter(crime_rate < 1700, state_pop< 75, prob_prison<0.085, percent_m < 168, m_per1000f < 1030,  nonwhites_per1000 < 280, unemploy_m24 < 140, unemploy_m39< 60, time_prison < 43)

PEMODELAN

Yang akan dilakukan untuk mencari model regresi terbaik dalam memprediksi tingkat kejahatan (crime_rate) adalah membandingkan model-model berikut ini:

  1. Model tanpa prediktor

  2. Model dengan semua prediktor dan mencari prediktor yang signifikan

  3. Model dengan semua prediktor (TANPA outliers) dan mencari prediktor yang signifikan

  4. Model dengan prediktor yang signifikan saja dengan membiarkan ada outliers

  5. Model dengan prediktor yang signifikan saja tanpa outliers

Melihat hasil 2 dan 3 manakah yang memiliki nilai R-squared dan adjusted R lebih tinggi, lalu datasetnya digunakan untuk mencoba metode backward dan forward.

  1. Model dengan stepwise backward dengan/tanpa outliers (beracuan hasil 2 dan 3)

  2. Mencoba model dengan stepwise forward dengan/tanpa outliers (beracuan hasil 2 dan 3)

1. Model tanpa prediktor

model_null <- lm(crime_rate ~1, crime_clean)
summary(model_null)
#> 
#> Call:
#> lm(formula = crime_rate ~ 1, data = crime_clean)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -563.09 -246.59  -74.09  152.41 1087.91 
#> 
#> Coefficients:
#>             Estimate Std. Error t value            Pr(>|t|)    
#> (Intercept)   905.09      56.42   16.04 <0.0000000000000002 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 386.8 on 46 degrees of freedom

2. Model dengan seluruh prediktor menyertakan outliers

model_all <- lm(crime_rate ~., crime_clean)
summary(model_all)
#> 
#> Call:
#> lm(formula = crime_rate ~ ., data = crime_clean)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -442.55 -116.46    8.86  118.26  473.49 
#> 
#> Coefficients:
#>                        Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)          -6379.0457  1568.9382  -4.066 0.000291 ***
#> percent_m                8.9861     4.1571   2.162 0.038232 *  
#> is_south1                5.6688   148.0997   0.038 0.969705    
#> mean_education          17.7304     6.0824   2.915 0.006445 ** 
#> police_exp60             9.6526     2.3921   4.035 0.000317 ***
#> labour_participation    -0.2801     1.4079  -0.199 0.843538    
#> m_per1000f               1.8218     2.0293   0.898 0.376026    
#> state_pop               -0.7836     1.2857  -0.609 0.546523    
#> nonwhites_per1000        0.2446     0.6187   0.395 0.695239    
#> unemploy_m24            -5.4163     4.1784  -1.296 0.204164    
#> unemploy_m39            16.9362     8.2148   2.062 0.047441 *  
#> gdp                      0.9072     1.0329   0.878 0.386292    
#> inequality               7.2708     2.2564   3.222 0.002921 ** 
#> prob_prison          -4285.1992  2183.8679  -1.962 0.058484 .  
#> time_prison             -1.1276     6.6919  -0.168 0.867251    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 208.6 on 32 degrees of freedom
#> Multiple R-squared:  0.7976, Adjusted R-squared:  0.709 
#> F-statistic: 9.006 on 14 and 32 DF,  p-value: 0.0000001673

3. Model dengan seluruh prediktor tanpa outliers

model_all_no_outliers <- lm(crime_rate ~., crime_clean_no_outliers)
summary(model_all_no_outliers)
#> 
#> Call:
#> lm(formula = crime_rate ~ ., data = crime_clean_no_outliers)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -317.22  -72.75  -11.42   76.37  215.95 
#> 
#> Coefficients:
#>                          Estimate   Std. Error t value Pr(>|t|)  
#> (Intercept)           -2372.72932   2126.54726  -1.116   0.2847  
#> percent_m                11.56547      5.59880   2.066   0.0594 .
#> is_south1               -15.79851    167.00282  -0.095   0.9261  
#> mean_education           16.98892      8.12060   2.092   0.0566 .
#> police_exp60              9.19855      3.89967   2.359   0.0347 *
#> labour_participation      3.29612      2.63837   1.249   0.2336  
#> m_per1000f               -3.96377      3.29936  -1.201   0.2510  
#> state_pop                -6.98010      4.73201  -1.475   0.1640  
#> nonwhites_per1000         2.56700      1.02185   2.512   0.0260 *
#> unemploy_m24              1.47181      5.42051   0.272   0.7903  
#> unemploy_m39             15.68005     11.21516   1.398   0.1855  
#> gdp                      -0.06154      1.25539  -0.049   0.9616  
#> inequality                6.23779      2.65475   2.350   0.0352 *
#> prob_prison          -10032.66004   3958.35247  -2.535   0.0249 *
#> time_prison              -9.70610     10.42644  -0.931   0.3689  
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 172.1 on 13 degrees of freedom
#> Multiple R-squared:  0.8603, Adjusted R-squared:  0.7099 
#> F-statistic:  5.72 on 14 and 13 DF,  p-value: 0.001632

4. Model dengan prediktor yang signifikan saja dengan menyertakan outliers

model_1 <- lm(crime_rate ~ percent_m + mean_education +police_exp60 +unemploy_m39+inequality, crime_clean)
summary(model_1)
#> 
#> Call:
#> lm(formula = crime_rate ~ percent_m + mean_education + police_exp60 + 
#>     unemploy_m39 + inequality, data = crime_clean)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -453.44  -98.59  -18.07  106.03  629.64 
#> 
#> Coefficients:
#>                 Estimate Std. Error t value        Pr(>|t|)    
#> (Intercept)    -5243.743    951.156  -5.513 0.0000021265864 ***
#> percent_m         10.198      3.532   2.887        0.006175 ** 
#> mean_education    20.308      4.742   4.283        0.000109 ***
#> police_exp60      12.331      1.416   8.706 0.0000000000726 ***
#> unemploy_m39       9.136      4.341   2.105        0.041496 *  
#> inequality         6.349      1.468   4.324 0.0000955935585 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 213 on 41 degrees of freedom
#> Multiple R-squared:  0.7296, Adjusted R-squared:  0.6967 
#> F-statistic: 22.13 on 5 and 41 DF,  p-value: 0.0000000001105

5. Model dengan prediktor yang signifikan saja tanpa outliers

model_1_no_outliers <- lm(crime_rate ~ police_exp60 + nonwhites_per1000 +inequality + prob_prison, crime_clean_no_outliers)
summary(model_1_no_outliers)
#> 
#> Call:
#> lm(formula = crime_rate ~ police_exp60 + nonwhites_per1000 + 
#>     inequality + prob_prison, data = crime_clean_no_outliers)
#> 
#> Residuals:
#>    Min     1Q Median     3Q    Max 
#> -400.2 -170.2   43.0  132.9  436.5 
#> 
#> Coefficients:
#>                     Estimate Std. Error t value Pr(>|t|)   
#> (Intercept)        -397.3137   529.4404  -0.750   0.4606   
#> police_exp60          8.5196     2.9229   2.915   0.0078 **
#> nonwhites_per1000     1.1653     0.7976   1.461   0.1575   
#> inequality            4.3037     1.7849   2.411   0.0243 * 
#> prob_prison       -7956.2164  3592.7022  -2.215   0.0370 * 
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 224.4 on 23 degrees of freedom
#> Multiple R-squared:   0.58,  Adjusted R-squared:  0.507 
#> F-statistic:  7.94 on 4 and 23 DF,  p-value: 0.0003566

6.Model menggunakan stepwise backward

Model ini memakai acuan model_all_no_outliers karena memiliki R-squared lebih tinggi daripada model_all. Begitu juga dilakukan untuk model forward.

model_backward_no_outliers <- step(model_all_no_outliers, direction="backward")
#> Start:  AIC=296.81
#> crime_rate ~ percent_m + is_south + mean_education + police_exp60 + 
#>     labour_participation + m_per1000f + state_pop + nonwhites_per1000 + 
#>     unemploy_m24 + unemploy_m39 + gdp + inequality + prob_prison + 
#>     time_prison
#> 
#>                        Df Sum of Sq    RSS    AIC
#> - gdp                   1        71 385141 294.82
#> - is_south              1       265 385335 294.83
#> - unemploy_m24          1      2184 387254 294.97
#> - time_prison           1     25669 410740 296.62
#> <none>                              385070 296.81
#> - m_per1000f            1     42752 427822 297.76
#> - labour_participation  1     46231 431301 297.99
#> - unemploy_m39          1     57900 442971 298.73
#> - state_pop             1     64451 449521 299.14
#> - percent_m             1    126396 511466 302.76
#> - mean_education        1    129644 514714 302.94
#> - inequality            1    163535 548605 304.72
#> - police_exp60          1    164808 549879 304.79
#> - nonwhites_per1000     1    186926 571996 305.89
#> - prob_prison           1    190283 575353 306.06
#> 
#> Step:  AIC=294.82
#> crime_rate ~ percent_m + is_south + mean_education + police_exp60 + 
#>     labour_participation + m_per1000f + state_pop + nonwhites_per1000 + 
#>     unemploy_m24 + unemploy_m39 + inequality + prob_prison + 
#>     time_prison
#> 
#>                        Df Sum of Sq    RSS    AIC
#> - is_south              1       274 385416 292.84
#> - unemploy_m24          1      2526 387667 293.00
#> - time_prison           1     28190 413331 294.79
#> <none>                              385141 294.82
#> - labour_participation  1     47938 433079 296.10
#> - m_per1000f            1     50749 435890 296.28
#> - unemploy_m39          1     58471 443613 296.77
#> - state_pop             1     64682 449823 297.16
#> - percent_m             1    130652 515794 301.00
#> - mean_education        1    130737 515879 301.00
#> - police_exp60          1    185420 570562 303.82
#> - nonwhites_per1000     1    206166 591308 304.82
#> - prob_prison           1    207974 593116 304.91
#> - inequality            1    251295 636436 306.88
#> 
#> Step:  AIC=292.84
#> crime_rate ~ percent_m + mean_education + police_exp60 + labour_participation + 
#>     m_per1000f + state_pop + nonwhites_per1000 + unemploy_m24 + 
#>     unemploy_m39 + inequality + prob_prison + time_prison
#> 
#>                        Df Sum of Sq    RSS    AIC
#> - unemploy_m24          1      3336 388752 291.08
#> <none>                              385416 292.84
#> - time_prison           1     28843 414259 292.86
#> - m_per1000f            1     52037 437453 294.38
#> - unemploy_m39          1     58665 444080 294.80
#> - labour_participation  1     62127 447543 295.02
#> - state_pop             1     64411 449827 295.16
#> - mean_education        1    132392 517808 299.10
#> - percent_m             1    134436 519852 299.21
#> - police_exp60          1    189797 575213 302.05
#> - prob_prison           1    211530 596945 303.09
#> - nonwhites_per1000     1    233688 619104 304.11
#> - inequality            1    278456 663872 306.06
#> 
#> Step:  AIC=291.08
#> crime_rate ~ percent_m + mean_education + police_exp60 + labour_participation + 
#>     m_per1000f + state_pop + nonwhites_per1000 + unemploy_m39 + 
#>     inequality + prob_prison + time_prison
#> 
#>                        Df Sum of Sq    RSS    AIC
#> - time_prison           1     25911 414662 290.88
#> <none>                              388752 291.08
#> - m_per1000f            1     50419 439170 292.49
#> - labour_participation  1     58950 447702 293.03
#> - state_pop             1     85538 474289 294.65
#> - unemploy_m39          1    113579 502331 296.25
#> - mean_education        1    130717 519469 297.19
#> - percent_m             1    133569 522320 297.35
#> - police_exp60          1    195848 584600 300.50
#> - prob_prison           1    208306 597058 301.09
#> - nonwhites_per1000     1    255602 644354 303.23
#> - inequality            1    279938 668690 304.26
#> 
#> Step:  AIC=290.88
#> crime_rate ~ percent_m + mean_education + police_exp60 + labour_participation + 
#>     m_per1000f + state_pop + nonwhites_per1000 + unemploy_m39 + 
#>     inequality + prob_prison
#> 
#>                        Df Sum of Sq    RSS    AIC
#> <none>                              414662 290.88
#> - m_per1000f            1     36804 451466 291.26
#> - labour_participation  1     53609 468271 292.29
#> - unemploy_m39          1     95125 509787 294.67
#> - mean_education        1    110473 525136 295.50
#> - percent_m             1    112458 527121 295.60
#> - state_pop             1    142768 557430 297.17
#> - prob_prison           1    212799 627461 300.48
#> - nonwhites_per1000     1    236707 651369 301.53
#> - inequality            1    290866 705528 303.77
#> - police_exp60          1    484604 899267 310.56
summary(model_backward_no_outliers)
#> 
#> Call:
#> lm(formula = crime_rate ~ percent_m + mean_education + police_exp60 + 
#>     labour_participation + m_per1000f + state_pop + nonwhites_per1000 + 
#>     unemploy_m39 + inequality + prob_prison, data = crime_clean_no_outliers)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -278.24  -81.03  -12.97   58.09  220.54 
#> 
#> Coefficients:
#>                        Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)          -3209.6576  1705.6257  -1.882 0.077098 .  
#> percent_m               10.3186     4.8056   2.147 0.046493 *  
#> mean_education          14.9471     7.0235   2.128 0.048249 *  
#> police_exp60            11.3922     2.5559   4.457 0.000346 ***
#> labour_participation     3.1171     2.1026   1.483 0.156507    
#> m_per1000f              -3.0075     2.4484  -1.228 0.236047    
#> state_pop               -8.9840     3.7134  -2.419 0.027048 *  
#> nonwhites_per1000        2.2997     0.7382   3.115 0.006296 ** 
#> unemploy_m39            15.6430     7.9213   1.975 0.064761 .  
#> inequality               6.3666     1.8437   3.453 0.003037 ** 
#> prob_prison          -7632.8462  2584.1900  -2.954 0.008890 ** 
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 156.2 on 17 degrees of freedom
#> Multiple R-squared:  0.8496, Adjusted R-squared:  0.7611 
#> F-statistic: 9.604 on 10 and 17 DF,  p-value: 0.00003407

7. Model stepwise forward

model_forward_no_outliers <- step(object=model_null,
                      direction="forward",
                      scope=list(upper= model_all_no_outliers), 
                      trace=F)

summary(model_forward_no_outliers)
#> 
#> Call:
#> lm(formula = crime_rate ~ police_exp60 + inequality + mean_education + 
#>     percent_m + prob_prison + unemploy_m39, data = crime_clean)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -470.68  -78.41  -19.68  133.12  556.23 
#> 
#> Coefficients:
#>                 Estimate Std. Error t value       Pr(>|t|)    
#> (Intercept)    -5040.505    899.843  -5.602 0.000001715273 ***
#> police_exp60      11.502      1.375   8.363 0.000000000256 ***
#> inequality         6.765      1.394   4.855 0.000018793765 ***
#> mean_education    19.647      4.475   4.390 0.000080720160 ***
#> percent_m         10.502      3.330   3.154        0.00305 ** 
#> prob_prison    -3801.836   1528.097  -2.488        0.01711 *  
#> unemploy_m39       8.937      4.091   2.185        0.03483 *  
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 200.7 on 40 degrees of freedom
#> Multiple R-squared:  0.7659, Adjusted R-squared:  0.7307 
#> F-statistic: 21.81 on 6 and 40 DF,  p-value: 0.00000000003418

PERBANDINGAN

Membandingkan semua model menggunakan fungsi compare dari library (performance)

library(performance)
comparison <- compare_performance(model_null, model_all, model_all_no_outliers, model_1, model_1_no_outliers, model_backward_no_outliers, model_forward_no_outliers)

compare <- as.data.frame(comparison)

compare
#>                         Name Model      AIC
#> 1                 model_null    lm 696.4037
#> 2                  model_all    lm 649.3263
#> 3      model_all_no_outliers    lm 378.2719
#> 4                    model_1    lm 644.9286
#> 5        model_1_no_outliers    lm 389.1012
#> 6 model_backward_no_outliers    lm 372.3450
#> 7  model_forward_no_outliers    lm 640.1661
#>                                                                            AIC_wt
#> 1 0.00000000000000000000000000000000000000000000000000000000000000000000004069887
#> 2 0.00000000000000000000000000000000000000000000000000000000000067967596813547134
#> 3 0.04909339019476005594455614300386514514684677124023437500000000000000000000000
#> 4 0.00000000000000000000000000000000000000000000000000000000000612717054672095975
#> 5 0.00021850910573053278630577256347322645524400286376476287841796875000000000000
#> 6 0.95068810069950948626882336611743085086345672607421875000000000000000000000000
#> 7 0.00000000000000000000000000000000000000000000000000000000006628485952170377742
#>        BIC
#> 1 700.1040
#> 2 678.9287
#> 3 399.5872
#> 4 657.8796
#> 5 397.0944
#> 6 388.3314
#> 7 654.9673
#>                                                                         BIC_wt
#> 1 0.00000000000000000000000000000000000000000000000000000000000000000001961156
#> 2 0.00000000000000000000000000000000000000000000000000000000000000077744460998
#> 3 0.00353923319807805409481504810287333384621888399124145507812500000000000000
#> 4 0.00000000000000000000000000000000000000000000000000000000002893460346309315
#> 5 0.01230838387291111404864274447845673421397805213928222656250000000000000000
#> 6 0.98415238292901074945717709852033294737339019775390625000000000000000000000
#> 7 0.00000000000000000000000000000000000000000000000000000000012411305312414714
#>          R2 R2_adjusted     RMSE    Sigma
#> 1 0.0000000   0.0000000 382.6261 386.7627
#> 2 0.7975760   0.7090155 172.1494 208.6313
#> 3 0.8603390   0.7099349 117.2711 172.1069
#> 4 0.7296346   0.6966632 198.9528 213.0135
#> 5 0.5799959   0.5069518 203.3669 224.3858
#> 6 0.8496063   0.7611395 121.6937 156.1791
#> 7 0.7658663   0.7307463 185.1427 200.6899

Dengan hasil ini, dipilih model backward tanpa outliers, yaitu yang memiliki nilai multiple R-squared dan adjusted R paling tinggi yaitu 0.849 dan 0.761. Nilai RMSE sebesar 121.69.

UJI ASUMSI

1. Visualisasi histogram residual menggunakan fungsi hist()

hist(model_backward_no_outliers$residuals)

2. Uji statistik dengan shapiro.test()

Shapiro-Wilk hypothesis test:

  • H0: error berdistribusi normal
  • H1: error TIDAK berdistribusi normal

Kondisi yang diharapkan: H0

# shapiro test dari residual
shapiro.test(model_backward_no_outliers$residuals)
#> 
#>  Shapiro-Wilk normality test
#> 
#> data:  model_backward_no_outliers$residuals
#> W = 0.96749, p-value = 0.5152

p-value > 0.5 HO tercapai: error berdistribusi normal

3. Homoscedasticity of Residuals

Visualisasi scatter plot: fitted.values vs residuals

  • fitted.values adalah nilai hasil prediksi data training
  • residuals adalah nilai error
# scatter plot

plot(x = model_backward_no_outliers$fitted.values, y = model_backward_no_outliers$residuals)
abline(h = 0, col = "red") # garis horizontal di angka 0

4. Uji VIF

library(car)
vif(model_backward_no_outliers)
#>            percent_m       mean_education         police_exp60 
#>             3.170246             6.559376             3.679443 
#> labour_participation           m_per1000f            state_pop 
#>             7.821026             3.245882             3.317766 
#>    nonwhites_per1000         unemploy_m39           inequality 
#>             3.075163             3.840580             5.583312 
#>          prob_prison 
#>             2.108012

Tidak ada nilai sama dengan atau lebih dari 10 sehingga tidak ditemukan multicollinearity antar variabel (antar variabel prediktor saling independen)

KESIMPULAN

Berdasarkan hasil analisis, model backward memiliki kriteria yang baik sebagai model linear regression.

summary (model_backward_no_outliers)
#> 
#> Call:
#> lm(formula = crime_rate ~ percent_m + mean_education + police_exp60 + 
#>     labour_participation + m_per1000f + state_pop + nonwhites_per1000 + 
#>     unemploy_m39 + inequality + prob_prison, data = crime_clean_no_outliers)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -278.24  -81.03  -12.97   58.09  220.54 
#> 
#> Coefficients:
#>                        Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)          -3209.6576  1705.6257  -1.882 0.077098 .  
#> percent_m               10.3186     4.8056   2.147 0.046493 *  
#> mean_education          14.9471     7.0235   2.128 0.048249 *  
#> police_exp60            11.3922     2.5559   4.457 0.000346 ***
#> labour_participation     3.1171     2.1026   1.483 0.156507    
#> m_per1000f              -3.0075     2.4484  -1.228 0.236047    
#> state_pop               -8.9840     3.7134  -2.419 0.027048 *  
#> nonwhites_per1000        2.2997     0.7382   3.115 0.006296 ** 
#> unemploy_m39            15.6430     7.9213   1.975 0.064761 .  
#> inequality               6.3666     1.8437   3.453 0.003037 ** 
#> prob_prison          -7632.8462  2584.1900  -2.954 0.008890 ** 
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 156.2 on 17 degrees of freedom
#> Multiple R-squared:  0.8496, Adjusted R-squared:  0.7611 
#> F-statistic: 9.604 on 10 and 17 DF,  p-value: 0.00003407

Model backward tanpa outliers (model_backward_no_outliers) adalah model yang dipilih dengan R-squared dan adjusted R paling tinggi yaitu 0.849 dan 0.761. Nilai RMSE sebesar 121.69.

Berdasarkan model ini, nilai crime rate berkorelasi positif dengan percent_m, mean_education, police_exp60, m_per1000f, unemploy_m39, dan inequality,yang artinya peningkatan nilai prediktor-prediktor tersebut juga akan meningkatkan angka crime rate.

Sedangkan crime rate berkorelasi negatif dengan unemploy_m24 dan prob_prison, yang artinya setiap peningkatan nilai kedua prediktor tersebut, akan mengurangi angka crime rate.

Model yang dipilih juga menghasilkan uji asumsi sesuai dengan yang diharapkan yaitu:

  • nilai residual berdistribusi normal

  • nilai residual tersebar acak (tidak ada pola tertentu)

  • tidak ditemukan multicollinearity antar variabel