# chunk options
knitr::opts_chunk$set(
  message = FALSE,
  warning = FALSE,
  fig.align = "center",
  comment = "#>"
)

# scientific notation
options(scipen = 9999)

1 Objective

We will to use linear regression model using car price assignmentdataset. We want to know the relationship among variables, especially between the car price with other variables. We also want to predit the price of a new car based on the historical data. You can download the data here .

Dalam report ini kita akan mencoba menggunakan model regression menggunakan data set crime dimana kita akan mencoba mencari tau hubungan antara variabel yang terdapat di dalam variabel terutama hubungan antara variabel crime rate dengan variabel-variabel lainnya yang terdapat di data set crime.

1.1 Just a Prediction !

Sebelum kita memulai melakukan pemodelan , kita akan melakukan tebakan awal adanya hubungan yang kuat antara crime rate dengan rata-rata pendidikan yang dimiliki oleh masyarakat.

pertanyaan ini akan kita coba jawab melalui pemodelan kita

2 Data Preparation

2.1 Load the rquired package

library(tidyverse)
library(lubridate)
library(GGally)
library(MLmetrics)
library(lmtest)
library(car)
library(plotly)

2.2 Read data

Setelah mempersiapkan package yang akan kita gunakan untuk membuat pemodelan kita, selanjutnya kita akan memasukan data yang akan kita gunakan yaitu data crime.csv sebagai object

data <- read.csv("crime.csv")

colnames(data) <- c("percent_m", "is_south", "mean_education", "police_exp60", "police_exp59", "labour_participation", "m_per1000f", "state_pop", "nonwhites_per1000", "unemploy_m24", "unemploy_m39", "gdp", "inequality", "prob_prison", "time_prison", "crime_rate")

data

rownames(data) <- data$Country
data[1] <- NULL

head(data)

2.3 Checking the Data

selanjutnya kita akan memeriksa tipe data dan juga melihat apakah terdapat variabel yang kira-kira tidak kita perlukan dalam pembuatan model kita.

glimpse(data)

#> Rows: 47
#> Columns: 15
#> $ is_south             <int> 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1...
#> $ mean_education       <int> 91, 113, 89, 121, 121, 110, 111, 109, 90, 118,...
#> $ police_exp60         <int> 58, 103, 45, 149, 109, 118, 82, 115, 65, 71, 1...
#> $ police_exp59         <int> 56, 95, 44, 141, 101, 115, 79, 109, 62, 68, 11...
#> $ labour_participation <int> 510, 583, 533, 577, 591, 547, 519, 542, 553, 6...
#> $ m_per1000f           <int> 950, 1012, 969, 994, 985, 964, 982, 969, 955, ...
#> $ state_pop            <int> 33, 13, 18, 157, 18, 25, 4, 50, 39, 7, 101, 47...
#> $ nonwhites_per1000    <int> 301, 102, 219, 80, 30, 44, 139, 179, 286, 15, ...
#> $ unemploy_m24         <int> 108, 96, 94, 102, 91, 84, 97, 79, 81, 100, 77,...
#> $ unemploy_m39         <int> 41, 36, 33, 39, 20, 29, 38, 35, 28, 24, 35, 31...
#> $ gdp                  <int> 394, 557, 318, 673, 578, 689, 620, 472, 421, 5...
#> $ inequality           <int> 261, 194, 250, 167, 174, 126, 168, 206, 239, 1...
#> $ prob_prison          <dbl> 0.084602, 0.029599, 0.083401, 0.015801, 0.0413...
#> $ time_prison          <dbl> 26.2011, 25.2999, 24.3006, 29.9012, 21.2998, 2...
#> $ crime_rate           <int> 791, 1635, 578, 1969, 1234, 682, 963, 1555, 85...

Berdasarkan summary data diatas kita memnilai bahwa seluruh variabel dapat kita gunakan dalam membuat pemodelan kita.

Khusus untuk variabel is_south kita akan melakukan perubahan tipe data menjadi factor dikarenakan di dalam variabel tersebut hanya terdapat 2 levels yaitu 1 dan 0

levels(data$is_south)

#> NULL

crime <- data %>% 
  mutate(is_south = as.factor(is_south))

crime

2.4 check missing value

colSums(is.na(crime))

#>             is_south       mean_education         police_exp60 
#>                    0                    0                    0 
#>         police_exp59 labour_participation           m_per1000f 
#>                    0                    0                    0 
#>            state_pop    nonwhites_per1000         unemploy_m24 
#>                    0                    0                    0 
#>         unemploy_m39                  gdp           inequality 
#>                    0                    0                    0 
#>          prob_prison          time_prison           crime_rate 
#>                    0                    0                    0

3 Exploratory Data Analysis

Untuk tahap selanjutnya kita akan melihat hubungan antara variabel yang terdapat di dalam data set kita

ggcorr(data,label=T, layout.exp=2, hjust= 1)

berdasarkan grafik diatas terdapat beberapa variabel yang kurang memiliki korelasi terhadap target variabel seperti variabel prob_prison, inequality, unemploy_m24 dan is_south. oleh karena itu variabel-variabel tidak akan kita gunakan dalam pemodelan yang kita kerjakan

crime <- crime %>% 
         select( -c(prob_prison, unemploy_m24, is_south))

4 LET THE MODELING BEGIN !

Dalam pembuatan model kita akan menggunakan hanya variabel mean_education dalam melakukan prediksi terhadap nilai crime_rate

crime_model_education <- lm(crime_rate ~ mean_education, data = data)
summary(crime_model_education)

#> 
#> Call:
#> lm(formula = crime_rate ~ mean_education, data = data)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -600.61 -271.25  -46.54  171.33  916.46 
#> 
#> Coefficients:
#>                Estimate Std. Error t value Pr(>|t|)  
#> (Intercept)    -273.967    518.104  -0.529   0.5996  
#> mean_education   11.161      4.878   2.288   0.0269 *
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 370.1 on 45 degrees of freedom
#> Multiple R-squared:  0.1042, Adjusted R-squared:  0.08432 
#> F-statistic: 5.236 on 1 and 45 DF,  p-value: 0.02688

Dari hasil model yang dibuat dapat disimpulkan bahwa r-squared dari model yang sudah dibuat masih menunjukan nilai yang kecil sehingga perlunya ditambahkan variable-variable lain untuk membuat model memliki prediksi yang lebih baik

4.1 model tuning

Untuk mendapatkan model yang lebih baik kita akan mencoba menggunakan stepwise regression untuk mendapatkan model terbaik untuk memprediksi nilai crime rate

crime_model_all <- lm(crime_rate ~ ., data = data)


summary(crime_model_all)

#> 
#> Call:
#> lm(formula = crime_rate ~ ., data = data)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -374.04 -114.78  -11.12  116.20  439.17 
#> 
#> Coefficients:
#>                        Estimate Std. Error t value Pr(>|t|)   
#> (Intercept)          -5379.2157  1686.5498  -3.189  0.00318 **
#> is_south                25.9551   155.8237   0.167  0.86876   
#> mean_education          17.9108     6.5172   2.748  0.00976 **
#> police_exp60            19.9885    11.1601   1.791  0.08275 . 
#> police_exp59           -12.2350    12.3450  -0.991  0.32908   
#> labour_participation    -1.1240     1.5294  -0.735  0.46774   
#> m_per1000f               3.0959     2.0319   1.524  0.13742   
#> state_pop               -0.9567     1.3524  -0.707  0.48441   
#> nonwhites_per1000        0.8139     0.6530   1.246  0.22167   
#> unemploy_m24            -6.5356     4.4162  -1.480  0.14868   
#> unemploy_m39            13.7482     8.5305   1.612  0.11686   
#> gdp                      0.6209     1.0775   0.576  0.56850   
#> inequality               6.7665     2.3857   2.836  0.00785 **
#> prob_prison          -5239.8716  2383.4334  -2.198  0.03527 * 
#> time_prison             -1.0952     7.4451  -0.147  0.88398   
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 220 on 32 degrees of freedom
#> Multiple R-squared:  0.7749, Adjusted R-squared:  0.6765 
#> F-statistic:  7.87 on 14 and 32 DF,  p-value: 0.0000007788

4.1.1 backward

crime_backward <- step(object = crime_model_all, direction = "backward")

#> Start:  AIC=518.93
#> crime_rate ~ is_south + mean_education + police_exp60 + police_exp59 + 
#>     labour_participation + m_per1000f + state_pop + nonwhites_per1000 + 
#>     unemploy_m24 + unemploy_m39 + gdp + inequality + prob_prison + 
#>     time_prison
#> 
#>                        Df Sum of Sq     RSS    AIC
#> - time_prison           1      1047 1549763 516.96
#> - is_south              1      1343 1550059 516.97
#> - gdp                   1     16069 1564785 517.42
#> - state_pop             1     24222 1572938 517.66
#> - labour_participation  1     26139 1574855 517.72
#> - police_exp59          1     47539 1596254 518.35
#> <none>                              1548716 518.93
#> - nonwhites_per1000     1     75186 1623901 519.16
#> - unemploy_m24          1    105996 1654712 520.04
#> - m_per1000f            1    112353 1661069 520.22
#> - unemploy_m39          1    125708 1674424 520.60
#> - police_exp60          1    155255 1703971 521.42
#> - prob_prison           1    233914 1782630 523.54
#> - mean_education        1    365540 1914256 526.89
#> - inequality            1    389337 1938053 527.47
#> 
#> Step:  AIC=516.96
#> crime_rate ~ is_south + mean_education + police_exp60 + police_exp59 + 
#>     labour_participation + m_per1000f + state_pop + nonwhites_per1000 + 
#>     unemploy_m24 + unemploy_m39 + gdp + inequality + prob_prison
#> 
#>                        Df Sum of Sq     RSS    AIC
#> - is_south              1      1662 1551425 515.01
#> - gdp                   1     15455 1565218 515.43
#> - labour_participation  1     25344 1575107 515.73
#> - state_pop             1     28095 1577859 515.81
#> - police_exp59          1     49082 1598845 516.43
#> <none>                              1549763 516.96
#> - nonwhites_per1000     1     79226 1628989 517.31
#> - unemploy_m24          1    105230 1654993 518.05
#> - m_per1000f            1    117312 1667075 518.39
#> - unemploy_m39          1    124680 1674444 518.60
#> - police_exp60          1    163197 1712960 519.67
#> - prob_prison           1    329264 1879027 524.02
#> - mean_education        1    368153 1917917 524.98
#> - inequality            1    391053 1940817 525.54
#> 
#> Step:  AIC=515.01
#> crime_rate ~ mean_education + police_exp60 + police_exp59 + labour_participation + 
#>     m_per1000f + state_pop + nonwhites_per1000 + unemploy_m24 + 
#>     unemploy_m39 + gdp + inequality + prob_prison
#> 
#>                        Df Sum of Sq     RSS    AIC
#> - gdp                   1     18491 1569916 513.57
#> - state_pop             1     28575 1580000 513.87
#> - labour_participation  1     42139 1593564 514.27
#> - police_exp59          1     49732 1601157 514.50
#> <none>                              1551425 515.01
#> - nonwhites_per1000     1    114651 1666076 516.36
#> - m_per1000f            1    129350 1680775 516.78
#> - unemploy_m24          1    134859 1686284 516.93
#> - unemploy_m39          1    136112 1687537 516.97
#> - police_exp60          1    162981 1714406 517.71
#> - prob_prison           1    339839 1891264 522.32
#> - mean_education        1    372386 1923811 523.12
#> - inequality            1    478612 2030037 525.65
#> 
#> Step:  AIC=513.57
#> crime_rate ~ mean_education + police_exp60 + police_exp59 + labour_participation + 
#>     m_per1000f + state_pop + nonwhites_per1000 + unemploy_m24 + 
#>     unemploy_m39 + inequality + prob_prison
#> 
#>                        Df Sum of Sq     RSS    AIC
#> - state_pop             1     23569 1593485 512.27
#> - labour_participation  1     39086 1609002 512.73
#> - police_exp59          1     48084 1617999 512.99
#> <none>                              1569916 513.57
#> - nonwhites_per1000     1    102467 1672383 514.54
#> - m_per1000f            1    141016 1710932 515.61
#> - unemploy_m24          1    159495 1729411 516.12
#> - police_exp60          1    170860 1740775 516.43
#> - unemploy_m39          1    173842 1743757 516.51
#> - prob_prison           1    367306 1937221 521.45
#> - mean_education        1    414349 1984265 522.58
#> - inequality            1    562021 2131936 525.95
#> 
#> Step:  AIC=512.27
#> crime_rate ~ mean_education + police_exp60 + police_exp59 + labour_participation + 
#>     m_per1000f + nonwhites_per1000 + unemploy_m24 + unemploy_m39 + 
#>     inequality + prob_prison
#> 
#>                        Df Sum of Sq     RSS    AIC
#> - police_exp59          1     45660 1639145 511.60
#> - labour_participation  1     53168 1646653 511.81
#> <none>                              1593485 512.27
#> - nonwhites_per1000     1    103646 1697131 513.23
#> - police_exp60          1    156409 1749894 514.67
#> - unemploy_m39          1    171390 1764875 515.07
#> - unemploy_m24          1    185301 1778786 515.44
#> - m_per1000f            1    271258 1864743 517.66
#> - prob_prison           1    345070 1938555 519.48
#> - mean_education        1    414458 2007943 521.14
#> - inequality            1    539076 2132561 523.97
#> 
#> Step:  AIC=511.6
#> crime_rate ~ mean_education + police_exp60 + labour_participation + 
#>     m_per1000f + nonwhites_per1000 + unemploy_m24 + unemploy_m39 + 
#>     inequality + prob_prison
#> 
#>                        Df Sum of Sq     RSS    AIC
#> - labour_participation  1     31149 1670295 510.48
#> <none>                              1639145 511.60
#> - nonwhites_per1000     1     82532 1721678 511.91
#> - unemploy_m24          1    176078 1815224 514.39
#> - unemploy_m39          1    180069 1819215 514.50
#> - m_per1000f            1    271655 1910801 516.81
#> - prob_prison           1    327632 1966777 518.16
#> - mean_education        1    372893 2012039 519.23
#> - inequality            1    608835 2247980 524.44
#> - police_exp60          1   1037573 2676718 532.65
#> 
#> Step:  AIC=510.48
#> crime_rate ~ mean_education + police_exp60 + m_per1000f + nonwhites_per1000 + 
#>     unemploy_m24 + unemploy_m39 + inequality + prob_prison
#> 
#>                     Df Sum of Sq     RSS    AIC
#> <none>                           1670295 510.48
#> - nonwhites_per1000  1     79563 1749858 510.67
#> - unemploy_m24       1    148514 1818809 512.49
#> - unemploy_m39       1    180373 1850668 513.30
#> - m_per1000f         1    247139 1917434 514.97
#> - prob_prison        1    298976 1969270 516.22
#> - mean_education     1    342951 2013246 517.26
#> - inequality         1    589514 2259809 522.69
#> - police_exp60       1   1101942 2772237 532.30

4.1.2 forward

crime_model_none <- lm(crime_rate ~ 1, data = data)



crime_forward <- step(object = crime_model_none, direction = "forward", 
                      scope = list(lower=crime_model_none, upper=crime_model_all))

#> Start:  AIC=561.02
#> crime_rate ~ 1
#> 
#>                        Df Sum of Sq     RSS    AIC
#> + police_exp60          1   3253302 3627626 532.94
#> + police_exp59          1   3058626 3822302 535.39
#> + gdp                   1   1340152 5540775 552.84
#> + prob_prison           1   1257075 5623853 553.54
#> + state_pop             1    783660 6097267 557.34
#> + mean_education        1    717146 6163781 557.85
#> + m_per1000f            1    314867 6566061 560.82
#> <none>                              6880928 561.02
#> + labour_participation  1    245446 6635482 561.32
#> + inequality            1    220530 6660397 561.49
#> + unemploy_m39          1    216354 6664573 561.52
#> + time_prison           1    154545 6726383 561.96
#> + is_south              1     56527 6824400 562.64
#> + unemploy_m24          1     17533 6863395 562.90
#> + nonwhites_per1000     1      7312 6873615 562.97
#> 
#> Step:  AIC=532.94
#> crime_rate ~ police_exp60
#> 
#>                        Df Sum of Sq     RSS    AIC
#> + inequality            1    739819 2887807 524.22
#> + m_per1000f            1    250522 3377104 531.57
#> + nonwhites_per1000     1    232434 3395192 531.82
#> + is_south              1    219098 3408528 532.01
#> + gdp                   1    180872 3446754 532.53
#> <none>                              3627626 532.94
#> + police_exp59          1    146167 3481459 533.00
#> + prob_prison           1     92278 3535348 533.72
#> + labour_participation  1     77479 3550147 533.92
#> + time_prison           1     43185 3584441 534.37
#> + unemploy_m39          1     17848 3609778 534.70
#> + state_pop             1      5666 3621959 534.86
#> + unemploy_m24          1      2878 3624748 534.90
#> + mean_education        1       767 3626859 534.93
#> 
#> Step:  AIC=524.22
#> crime_rate ~ police_exp60 + inequality
#> 
#>                        Df Sum of Sq     RSS    AIC
#> + mean_education        1    587050 2300757 515.53
#> + m_per1000f            1    454545 2433262 518.17
#> + prob_prison           1    280690 2607117 521.41
#> + labour_participation  1    260571 2627236 521.77
#> + gdp                   1    213937 2673871 522.60
#> + state_pop             1    130377 2757430 524.04
#> <none>                              2887807 524.22
#> + nonwhites_per1000     1     36439 2851369 525.62
#> + is_south              1     33738 2854069 525.66
#> + police_exp59          1     30673 2857134 525.71
#> + unemploy_m24          1      2309 2885498 526.18
#> + time_prison           1       497 2887310 526.21
#> + unemploy_m39          1       253 2887554 526.21
#> 
#> Step:  AIC=515.53
#> crime_rate ~ police_exp60 + inequality + mean_education
#> 
#>                        Df Sum of Sq     RSS    AIC
#> + prob_prison           1    234981 2065776 512.47
#> + m_per1000f            1    117026 2183731 515.08
#> <none>                              2300757 515.53
#> + gdp                   1     79540 2221218 515.88
#> + unemploy_m39          1     62112 2238646 516.25
#> + time_prison           1     61770 2238987 516.26
#> + police_exp59          1     42584 2258174 516.66
#> + state_pop             1     39319 2261438 516.72
#> + unemploy_m24          1      7365 2293392 517.38
#> + labour_participation  1      7254 2293503 517.39
#> + nonwhites_per1000     1      4210 2296547 517.45
#> + is_south              1      4135 2296622 517.45
#> 
#> Step:  AIC=512.47
#> crime_rate ~ police_exp60 + inequality + mean_education + prob_prison
#> 
#>                        Df Sum of Sq     RSS    AIC
#> + m_per1000f            1    131388 1934388 511.38
#> + state_pop             1    106659 1959118 511.98
#> <none>                              2065776 512.47
#> + is_south              1     72314 1993462 512.80
#> + unemploy_m39          1     54109 2011667 513.22
#> + nonwhites_per1000     1     48712 2017064 513.35
#> + police_exp59          1     35611 2030166 513.65
#> + gdp                   1     29624 2036152 513.79
#> + unemploy_m24          1      6851 2058926 514.31
#> + time_prison           1       886 2064890 514.45
#> + labour_participation  1        27 2065750 514.47
#> 
#> Step:  AIC=511.38
#> crime_rate ~ police_exp60 + inequality + mean_education + prob_prison + 
#>     m_per1000f
#> 
#>                        Df Sum of Sq     RSS    AIC
#> + is_south              1    113698 1820690 510.54
#> + nonwhites_per1000     1     81227 1853161 511.37
#> <none>                              1934388 511.38
#> + state_pop             1     34760 1899628 512.53
#> + unemploy_m39          1     25995 1908393 512.75
#> + gdp                   1     22095 1912293 512.84
#> + police_exp59          1     14475 1919913 513.03
#> + time_prison           1     12146 1922242 513.09
#> + labour_participation  1     12092 1922296 513.09
#> + unemploy_m24          1      5812 1928576 513.24
#> 
#> Step:  AIC=510.54
#> crime_rate ~ police_exp60 + inequality + mean_education + prob_prison + 
#>     m_per1000f + is_south
#> 
#>                        Df Sum of Sq     RSS    AIC
#> <none>                              1820690 510.54
#> + police_exp59          1   31426.8 1789263 511.72
#> + nonwhites_per1000     1   29199.8 1791490 511.78
#> + unemploy_m39          1   28253.0 1792437 511.80
#> + state_pop             1   20854.7 1799835 511.99
#> + gdp                   1   12880.2 1807810 512.20
#> + time_prison           1   11213.0 1809477 512.24
#> + unemploy_m24          1     615.8 1820074 512.52
#> + labour_participation  1      93.6 1820597 512.53

4.1.3 both

crime_both <- step(object = crime_model_none, direction = "both", 
                    scope = list(lower=crime_model_none, upper = crime_model_all))

#> Start:  AIC=561.02
#> crime_rate ~ 1
#> 
#>                        Df Sum of Sq     RSS    AIC
#> + police_exp60          1   3253302 3627626 532.94
#> + police_exp59          1   3058626 3822302 535.39
#> + gdp                   1   1340152 5540775 552.84
#> + prob_prison           1   1257075 5623853 553.54
#> + state_pop             1    783660 6097267 557.34
#> + mean_education        1    717146 6163781 557.85
#> + m_per1000f            1    314867 6566061 560.82
#> <none>                              6880928 561.02
#> + labour_participation  1    245446 6635482 561.32
#> + inequality            1    220530 6660397 561.49
#> + unemploy_m39          1    216354 6664573 561.52
#> + time_prison           1    154545 6726383 561.96
#> + is_south              1     56527 6824400 562.64
#> + unemploy_m24          1     17533 6863395 562.90
#> + nonwhites_per1000     1      7312 6873615 562.97
#> 
#> Step:  AIC=532.94
#> crime_rate ~ police_exp60
#> 
#>                        Df Sum of Sq     RSS    AIC
#> + inequality            1    739819 2887807 524.22
#> + m_per1000f            1    250522 3377104 531.57
#> + nonwhites_per1000     1    232434 3395192 531.82
#> + is_south              1    219098 3408528 532.01
#> + gdp                   1    180872 3446754 532.53
#> <none>                              3627626 532.94
#> + police_exp59          1    146167 3481459 533.00
#> + prob_prison           1     92278 3535348 533.72
#> + labour_participation  1     77479 3550147 533.92
#> + time_prison           1     43185 3584441 534.37
#> + unemploy_m39          1     17848 3609778 534.70
#> + state_pop             1      5666 3621959 534.86
#> + unemploy_m24          1      2878 3624748 534.90
#> + mean_education        1       767 3626859 534.93
#> - police_exp60          1   3253302 6880928 561.02
#> 
#> Step:  AIC=524.22
#> crime_rate ~ police_exp60 + inequality
#> 
#>                        Df Sum of Sq     RSS    AIC
#> + mean_education        1    587050 2300757 515.53
#> + m_per1000f            1    454545 2433262 518.17
#> + prob_prison           1    280690 2607117 521.41
#> + labour_participation  1    260571 2627236 521.77
#> + gdp                   1    213937 2673871 522.60
#> + state_pop             1    130377 2757430 524.04
#> <none>                              2887807 524.22
#> + nonwhites_per1000     1     36439 2851369 525.62
#> + is_south              1     33738 2854069 525.66
#> + police_exp59          1     30673 2857134 525.71
#> + unemploy_m24          1      2309 2885498 526.18
#> + time_prison           1       497 2887310 526.21
#> + unemploy_m39          1       253 2887554 526.21
#> - inequality            1    739819 3627626 532.94
#> - police_exp60          1   3772590 6660397 561.49
#> 
#> Step:  AIC=515.53
#> crime_rate ~ police_exp60 + inequality + mean_education
#> 
#>                        Df Sum of Sq     RSS    AIC
#> + prob_prison           1    234981 2065776 512.47
#> + m_per1000f            1    117026 2183731 515.08
#> <none>                              2300757 515.53
#> + gdp                   1     79540 2221218 515.88
#> + unemploy_m39          1     62112 2238646 516.25
#> + time_prison           1     61770 2238987 516.26
#> + police_exp59          1     42584 2258174 516.66
#> + state_pop             1     39319 2261438 516.72
#> + unemploy_m24          1      7365 2293392 517.38
#> + labour_participation  1      7254 2293503 517.39
#> + nonwhites_per1000     1      4210 2296547 517.45
#> + is_south              1      4135 2296622 517.45
#> - mean_education        1    587050 2887807 524.22
#> - inequality            1   1326101 3626859 534.93
#> - police_exp60          1   3782666 6083423 559.23
#> 
#> Step:  AIC=512.47
#> crime_rate ~ police_exp60 + inequality + mean_education + prob_prison
#> 
#>                        Df Sum of Sq     RSS    AIC
#> + m_per1000f            1    131388 1934388 511.38
#> + state_pop             1    106659 1959118 511.98
#> <none>                              2065776 512.47
#> + is_south              1     72314 1993462 512.80
#> + unemploy_m39          1     54109 2011667 513.22
#> + nonwhites_per1000     1     48712 2017064 513.35
#> + police_exp59          1     35611 2030166 513.65
#> + gdp                   1     29624 2036152 513.79
#> + unemploy_m24          1      6851 2058926 514.31
#> + time_prison           1       886 2064890 514.45
#> + labour_participation  1        27 2065750 514.47
#> - prob_prison           1    234981 2300757 515.53
#> - mean_education        1    541341 2607117 521.41
#> - inequality            1   1460866 3526642 535.61
#> - police_exp60          1   3060905 5126682 553.19
#> 
#> Step:  AIC=511.38
#> crime_rate ~ police_exp60 + inequality + mean_education + prob_prison + 
#>     m_per1000f
#> 
#>                        Df Sum of Sq     RSS    AIC
#> + is_south              1    113698 1820690 510.54
#> + nonwhites_per1000     1     81227 1853161 511.37
#> <none>                              1934388 511.38
#> - m_per1000f            1    131388 2065776 512.47
#> + state_pop             1     34760 1899628 512.53
#> + unemploy_m39          1     25995 1908393 512.75
#> + gdp                   1     22095 1912293 512.84
#> + police_exp59          1     14475 1919913 513.03
#> + time_prison           1     12146 1922242 513.09
#> + labour_participation  1     12092 1922296 513.09
#> + unemploy_m24          1      5812 1928576 513.24
#> - mean_education        1    213450 2147838 514.30
#> - prob_prison           1    249343 2183731 515.08
#> - inequality            1   1220183 3154571 532.37
#> - police_exp60          1   3151626 5086014 554.82
#> 
#> Step:  AIC=510.54
#> crime_rate ~ police_exp60 + inequality + mean_education + prob_prison + 
#>     m_per1000f + is_south
#> 
#>                        Df Sum of Sq     RSS    AIC
#> <none>                              1820690 510.54
#> - is_south              1    113698 1934388 511.38
#> + police_exp59          1     31427 1789263 511.72
#> + nonwhites_per1000     1     29200 1791490 511.78
#> + unemploy_m39          1     28253 1792437 511.80
#> + state_pop             1     20855 1799835 511.99
#> + gdp                   1     12880 1807810 512.20
#> + time_prison           1     11213 1809477 512.24
#> + unemploy_m24          1       616 1820074 512.52
#> + labour_participation  1        94 1820597 512.53
#> - m_per1000f            1    172772 1993462 512.80
#> - mean_education        1    272506 2093196 515.09
#> - prob_prison           1    349773 2170463 516.79
#> - inequality            1    667529 2488219 523.22
#> - police_exp60          1   2594282 4414972 550.17

4.1.4 goodnes of fit comparison

summary(crime_model_none)$adj.r.squared

#> [1] 0

summary(crime_model_all)$adj.r.squared

#> [1] 0.6764565

summary(crime_forward)$adj.r.squared

#> [1] 0.6957106

summary(crime_backward)$adj.r.squared

#> [1] 0.7061536

summary(crime_both)$adj.r.squared

#> [1] 0.6957106

adjusted r-squared model yang sudah melalui stepwise lebih banyak menggambarkan informasi dari taget variable

kesimpulannya model backward yang lebih bagus dipilih sebagai model

summary(crime_backward)

#> 
#> Call:
#> lm(formula = crime_rate ~ mean_education + police_exp60 + m_per1000f + 
#>     nonwhites_per1000 + unemploy_m24 + unemploy_m39 + inequality + 
#>     prob_prison, data = data)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -371.20 -120.35   22.43  120.22  380.21 
#> 
#> Coefficients:
#>                     Estimate Std. Error t value  Pr(>|t|)    
#> (Intercept)       -5751.9927  1245.5128  -4.618 0.0000434 ***
#> mean_education       15.5562     5.5692   2.793  0.008128 ** 
#> police_exp60          9.0230     1.8021   5.007 0.0000130 ***
#> m_per1000f            3.3736     1.4227   2.371  0.022904 *  
#> nonwhites_per1000     0.6703     0.4982   1.345  0.186471    
#> unemploy_m24         -6.5689     3.5737  -1.838  0.073869 .  
#> unemploy_m39         15.5352     7.6690   2.026  0.049857 *  
#> inequality            6.0220     1.6444   3.662  0.000758 ***
#> prob_prison       -4380.4081  1679.5830  -2.608  0.012946 *  
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 209.7 on 38 degrees of freedom
#> Multiple R-squared:  0.7573, Adjusted R-squared:  0.7062 
#> F-statistic: 14.82 on 8 and 38 DF,  p-value: 0.000000001459

5 Model Evaluation

Dalam melakuakan model evaluation kita akan menggunakan unseen data yang akan akan kita beri anam data_test

data_test <- read.csv("crime_test.csv")
data_test

backward_pred <- predict(crime_backward,data_test)

5.1 Pengujian error data menggunkan MSE

MSE(predict(crime_backward, data_test), y_true = data_test$crime_rate)

#> [1] 38933.34

MSE(predict(crime_forward, data_test), y_true = data_test$crime_rate)

#> [1] 47710.34

MSE(predict(crime_both, data_test), y_true = data_test$crime_rate)

#> [1] 47710.34

berdasarkan data diatas kita lebih mementingkan error yang rendah di data unseen sehingga kita memilih model backward sebagai model terbaik.

5.2 normality

Membuat visualisasi dari error yang dihasilkan (histogram)

hist(crime_backward$residuals,breaks = 10)

Melakukan uji statistik menggunakan fungsi shapiro.test()

Shapiro-Wilk hypothesis:

H0: error/residual berdistribusi normal
H1: error/residual tidak berdistribusi normal

shapiro.test(crime_backward$residuals)

#> 
#>  Shapiro-Wilk normality test
#> 
#> data:  crime_backward$residuals
#> W = 0.97508, p-value = 0.4076

Untuk model yang kita pilih yaitu model_backward, P-value > 0.05 sehingga H0 diterima. Hal ini juga berarti residual menyebar normal agar model kita memiliki error disekitar mean-nya.

5.3 Homoscedasticity

Membuat visualisasi antara hasil prediksi dengan error (scatter plot)

plot(crime_backward$fitted.values, crime_backward$residuals)
abline(h=0, col = "red")

Breusch-Pagan hypothesis :

H0: Homoscedasticity
H1: Heteroscedasticity

library(lmtest)
bptest(crime_backward)

#> 
#>  studentized Breusch-Pagan test
#> 
#> data:  crime_backward
#> BP = 12.908, df = 8, p-value = 0.115

Untuk model yang kita pilih yaitu model_backward , P-value > 0.05 sehingga H0 diterima. Hal ini juga berarti residual tidak memiliki pola (Heteroscedasticity) dimana semua pola yang ada sudah berhasil ditangkap oleh model yang dibuat.

5.4 No-multicolinearity

library(car)
vif(crime_backward)

#>    mean_education      police_exp60        m_per1000f nonwhites_per1000 
#>          4.062176          3.001690          1.839389          2.746325 
#>      unemploy_m24      unemploy_m39        inequality       prob_prison 
#>          4.344196          4.390032          4.504129          1.526218

Tidak ada nilai sama dengan atau lebih dari 10 sehingga tidak ditemukan Multicollinearity antar variabel (antar variabel prediktor saling independen).

Berdasarkan hasil analisis, kedua model memiliki kriteria yang baik sebagai model linear regression. Kemudian, bila dibandingkan RSE antara kedua model, model m memberikan nilai RSE yang lebih rendah. Oleh karena itu model m dipilih sebagai model yang lebih baik.

6 Conclusion

Untuk Model yang kita pilih yaitu “model_backward” memiliki nilai R-squar MSE yang lebih baik jika dibandngkan dengan model yang lain. Selain itu setelah dilakukan uji analisis, model ini juga memiliki kriteria yang sudah baik.

Berdasarkan model ini, nilai mean_education berkorelasi positif dengan nilai crime_rate. Dalam kata lain tingkat pendidikan rata-rata dalam suata masyarakat berpengaruh terhadap tingkat kriminal

LINNEAR REGRESSION ON CRIME RATE PREDICTION

by Ekaprana Danian