# clear-up the environment
rm(list = ls())

# chunk options
knitr::opts_chunk$set(
  message = FALSE,
  warning = FALSE,
  fig.align = "center",
  comment = "#>"
)

# scientific notation
options(scipen = 9999)

1 Flashback RM Day 1

1.1 Formula model regresi:

\[ \hat{y}=\beta_0+\beta_1.x_1+...+\beta_n.x_n \]

Keterangan:

\(\hat{y}\) adalah variable target
\(\beta_0\) adalah intercept (nilai titik potong sumbu y)
\(\beta_1, ..., \beta_n\) adalah coefficient dari prediktor (slope atau kemiringan garis)
\(x_1, ..., x_n\) adalah nilai dari prediktor

1.2 Fungsi model regresi di R:

Syntax tanpa prediktor: lm(formula = nama_kolom_target ~ 1, data = nama_df)
- formula: dituliskan dengan y ~ x
- data: objek dataframe
- Angka 1 pada parameter formula, diperuntuhkan untuk memberitahu model bahwa kita tidak akan menggunakan prediktor sama sekali.
Syntax dengan 1 prediktor: lm(formula = nama_kolom_target ~ nama_kolom_prediktor, data = nama_df)
Syntax dengan > 1 prediktor: lm(formula = nama_kolom_target ~ nama_prediktor1 + nama_prediktor2 + dst, data = nama_df)

1.3 Interpretasi model regresi:

Intercept
- Nilai intercepet merupakan nilai target variabel ketika nilai prediktornya 0 (Sales = 0).
  
  Sehingga ketika Sales = 0, makan Profit dari perusahaan tersebut -114.06 (rugi).
Slope
- Pengaruh prediktor yang digunakan, pada kasus ini adalah kolom Sales
  
  Setiap kenaikan Sales akan meningkatkan Profit
  Setiap kenaikan Sales sebesar 1 USD akan mengakibatkan Profit meningkat sebesar 0.42286 USD
  - Perhitungan Koefisein Sales * Nilai Sales = 0.42286 * 1 = 0.42286 USD
- Pengaruh prediktor akan berdasarkan nilai positif ataupun nilai negatif dari kolom prediktor.
  
  Jika prediktornya memiliki nilai positif, makan hubungan antara target dan prediktor adalah positif/meningkatkan value target
  
  Jika prediktornya memiliki nilai negatif, makan hubungan antara target dan prediktor adalah negatif/menurunkan value target
Signifikansi/P-Value

Kita bisa melihat nilai signifikasi berdasarkan jumlah bintang
- Jika 3 bintang *** -> prediktor tersebut secara signifikan mempengaruhi target
- Jika 2 bintang ** -> prediktor tersebut mempengaruhi target
- Jika 1 bintang * -> prediktor tersebut mempengaruhi target tp tidak terlalu kuat
R-Squared & Adjusted R-Squared
- Multiple r-squared: untuk 1 prediktor
- Adj r-squared: untuk >1 prediktor
  
  memiliki rumus dengan penambahan penalty untuk banyak prediktor yang kita gunakan
- Rentang: 0-1, semakin baik semakin mendekati 1
Dummy variable
- Setiap nilai dari kolom kategorik akan dijadikan dummy variable
- Sebuah variabel dengan kategori sebanyak k akan membutuhkan seperangkat k–1 variabel dummy untuk menjangkau semua informasi yang terkandung didalamnya.
- Interpretasi untuk dummy variabel (hanya untuk variabel prediktor bertipe kategorik) :
  
  Kategori dengan level terendah (dari sisi alfabetikal) tidak masuk sebagai prediktor, artinya dijadikan basis
  
  Nilai slope menunjukkan peningkatan/penurunan nilai target apabila kategori tersebut dibandingkan dengan kategori basisnya. Dengan catatan, nilai prediktor lainnya juga tetap.

1.4 Error

Untuk melakukan perhitungan error kita akan meminta bantuak library(MLmetrics), berikut beberapa perhitungan error yang dapat kita manfaatkan:

MAE (Mean Absolute Error)
Syntax: MAE(y_pred = hasil_prediksi , y_true = nilai_aktual)
- Lebih mudah dijelaskan ke orang tanpa latar belakang statistik
- Mengabaikan error yang besar dari outlier
MAPE (Mean Absolute Percentage Error)
Syntax: MAPE(y_pred = hasil_prediksi , y_true = nilai_aktual)
- Mudah diinterpretasi
- Tidak bisa digunakan ketika pada data kita terdapat nilai 0
MSE (Mean Squared Error)
Syntax: MSE(y_pred = hasil_prediksi , y_true = nilai_aktual)
- Sensitif terhadap error yang besar (karena error dikuadratkan)
- Tidak bisa diinterpretasi, lebih sulit dijelaskan
RMSE (Root Mean Squared Error)
Syntax: RMSE(y_pred = hasil_prediksi , y_true = nilai_aktual)
- Karena asalnya dari MSE, maka sensitif terhadap error besar juga
- Nilai bisa diinterpretasi (satuan sudah sama seperti data awal)
- Formula lebih sulit dijelaskan.

2 Case 2: Inequality Prediction

Berikut adalah runtutan workflow yang akan kita coba ikuti:

Step 1: Import data
Step 2: Persiapan Data
- Step 2.1: Data wrangling
- Step 2.2: Exploratory Data Analysis (EDA)
Step 3: Pembuatan Model
- Step 3.1: Membuat model regresi & interpretasi
- Step 3.2: Pemilihan variable ulang new
Step 4: Melakukan prediksi
Step 5: Evaluasi
- Step 5.1: Perhitungan nilai error
- Step 5.2: Pengecekan asumsi new

2.1 Step 1: Import data

Berikut adalah data terkait kriminologi yang terjadi di setiap negara bagian di Amerika serikat. Data dikumpulkan pada periode 1959-1960.

# Library
library(dplyr)

# Read data
crime <- read.csv("data_input/crime.csv") 

# Mengintip data
head(crime)

Jika kita lihat setiap nama kolom masih belum memiliki informasi yang cukup jelas untuk kita, maka dari itu kita akan mencoba untuk mengubah nama kolomnya dengan menggunakan fungsi names().

# rename column
names(crime) <- c("X", "percent_m", "is_south", "mean_education", "police_exp60", "police_exp59", "labour_participation", "m_per1000f", "state_pop", "nonwhites_per1000", "unemploy_m24", "unemploy_m39", "gdp", "inequality", "prob_prison", "time_prison", "crime_rate")

# Melihat lagi hasil perubahan nama
head(crime)

Variable description:

percent_m: percentage of males aged 14-24
is_south: whether it is in a Southern state. 1 for Yes, 0 for No.
mean_education: mean years of schooling
police_exp60: police expenditure in 1960
police_exp59: police expenditure in 1959
labour_participation: labour force participation rate
m_per1000f: number of males per 1000 females
state_pop: state population
nonwhites_per1000: number of non-whites resident per 1000 people
unemploy_m24: unemployment rate of urban males aged 14-24
unemploy_m39: unemployment rate of urban males aged 35-39
gdp: gross domestic product per head
inequality: income inequality
prob_prison: probability of imprisonment
time_prison: avg time served in prisons
crime_rate: crime rate in an unspecified category

2.2 Step 2: Persiapan Data

2.2.1 Step 2.1: Data Wrangling

Cek struktur data

# Please type your code
glimpse(crime)

#> Rows: 47
#> Columns: 17
#> $ X                    <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15…
#> $ percent_m            <int> 151, 143, 142, 136, 141, 121, 127, 131, 157, 140,…
#> $ is_south             <int> 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0…
#> $ mean_education       <int> 91, 113, 89, 121, 121, 110, 111, 109, 90, 118, 10…
#> $ police_exp60         <int> 58, 103, 45, 149, 109, 118, 82, 115, 65, 71, 121,…
#> $ police_exp59         <int> 56, 95, 44, 141, 101, 115, 79, 109, 62, 68, 116, …
#> $ labour_participation <int> 510, 583, 533, 577, 591, 547, 519, 542, 553, 632,…
#> $ m_per1000f           <int> 950, 1012, 969, 994, 985, 964, 982, 969, 955, 102…
#> $ state_pop            <int> 33, 13, 18, 157, 18, 25, 4, 50, 39, 7, 101, 47, 2…
#> $ nonwhites_per1000    <int> 301, 102, 219, 80, 30, 44, 139, 179, 286, 15, 106…
#> $ unemploy_m24         <int> 108, 96, 94, 102, 91, 84, 97, 79, 81, 100, 77, 83…
#> $ unemploy_m39         <int> 41, 36, 33, 39, 20, 29, 38, 35, 28, 24, 35, 31, 2…
#> $ gdp                  <int> 394, 557, 318, 673, 578, 689, 620, 472, 421, 526,…
#> $ inequality           <int> 261, 194, 250, 167, 174, 126, 168, 206, 239, 174,…
#> $ prob_prison          <dbl> 0.084602, 0.029599, 0.083401, 0.015801, 0.041399,…
#> $ time_prison          <dbl> 26.2011, 25.2999, 24.3006, 29.9012, 21.2998, 20.9…
#> $ crime_rate           <int> 791, 1635, 578, 1969, 1234, 682, 963, 1555, 856, …

Q1: Adakah kolom yang perlu dibuang?

Q2: Adakan kolom yang perlu diperbaiki?

# Please type your code
crime_clean <- crime %>% 
  select(-X) %>% 
  mutate(is_south = as.factor(is_south))

2.2.2 Step 2.2: Exploratory Data Analysis (EDA)

Analisa data 1: Mengetahui target & prediktor

Dari data yang dimiliki, kita ingin melakukan prediksi crime_rate berdasarkan data-data yang dirasa potensial, mari kita coba tentukan kolom apa yang akan menjadi target dan prediktor?

Target: Crime Rate
Prediktor: Masih belum tau
Analisa data 2: Mengetahui korelasi antara variable target dengan prediktor

Dikarenakan kita belum yakin mengenai prediktor apa saja yang akan kita gunakan, mari kita coba cari tahu hubungan antara setiap variable dengan prediktor.

Untuk mempermudah, kita akan memanfaatkan bantuan visualisasi heatmap dari fungsi ggcorr() yang berasal dari library(GGally).

# Please type your code
library(GGally)

ggcorr(data = crime_clean, # data apa yang akan digunakan
       label = T, # untuk memunculkan label angke
       hjust = 1, # untuk mengeser nama kolom
       layout.exp = 3) # agar tulisan tidak terpotong

Dari hasil visualisasi di atas, apakah ada prediktor yang dianggap potensial?

police_exp59
police_exp60
gdp
prob_prison

note: karena kemungkinan prediktor yang kita gunakan cukup banyak, penentuan outlier dari keseluruhan pattern data menjadi sulit. Kita akan gunakan seluruh data terlebih dahulu (tanpa melihat keberadaan outlier) untuk pembuatan model

note: antara police_exp59 & 60, korelasinya 1. berarti kita sebenarnya bisa pilih salah satu variable saja, agar tidak redundan. -> ini ada di pembahasan multicollinearity asumption.

2.3 Step 3: Pembuatan Model

2.3.1 Step 3.1: Membuat model regresi & interpretasi

Membuat Model

Kita akan membuat model berdasarkan beberapa prediktor yang kita anggap potensial.

# Please type your code
model_potensial <- lm(formula = crime_rate ~ police_exp59 + police_exp60 + gdp + prob_prison, 
                      data = crime)

Interpretasi Model

Mari kita coba cek summary model

# Please type your code
summary(model_potensial)

#> 
#> Call:
#> lm(formula = crime_rate ~ police_exp59 + police_exp60 + gdp + 
#>     prob_prison, data = crime)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -679.08 -123.01   25.81  140.31  480.64 
#> 
#> Coefficients:
#>                Estimate Std. Error t value Pr(>|t|)  
#> (Intercept)    849.8077   342.7515   2.479   0.0173 *
#> police_exp59   -14.1238    12.9385  -1.092   0.2812  
#> police_exp60    24.2491    11.9995   2.021   0.0497 *
#> gdp             -1.3353     0.7298  -1.830   0.0744 .
#> prob_prison  -3633.8717  2135.9895  -1.701   0.0963 .
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 273.1 on 42 degrees of freedom
#> Multiple R-squared:  0.5447, Adjusted R-squared:  0.5014 
#> F-statistic: 12.56 on 4 and 42 DF,  p-value: 0.0000008288

Q1 Apakah terdapat variable yang kurang signifikan?

Q1 Bagaimanakah nilai Adjusted R-Squared?

2.3.2 Step 3.2: Pemilihan variable ulang

Stepwise Regression

Dalam pemilihan variable ulang, R sudah menyediakan sebuauh metode Stepwise Regreesion.

Step-wise regression membantu kita memilih prediktor yang baik, dengan cara mencari kombinasi prediktor yang menghasilkan model terbaik berdasarkan nilai AIC. Akaike Information Criterion (AIC) merepresentasikan banyaknya informasi yang hilang pada model, atau information loss. Maka dari itu, model regresi yang baik adalah AIC yang kecil.

Terdapat tiga pendekatan step-wise regression:

Backward Elimination
Forward Selection
Both
Stepwise Regression - Backward Elimination

Proses Backward Elimination:

Step 1: Buat model dengan seluruh prediktor

Syntax dengan seluruh prediktor: lm(formula = nama_kolom_target ~ ,, data = nama_df)

# Please type your code

model_all_predictor <- lm(formula = crime_rate ~ .,
                          data = crime_clean)

Step 2: Mengimplementasikan backward elimintaion

Backward elimintaion dapat dilakukan dengan menggunakan fungsi step(), berikut parameter yang harus kita isi dari fungsi tersebut.

object = parameter ini akan di-isi dengan model yang sudah dibuat
direction = parameter ini akan di-isi dengan “backward”

# Please type your code
model_backward <- step(object = model_all_predictor, 
                       direction = "backward")

#> Start:  AIC=514.65
#> crime_rate ~ percent_m + is_south + mean_education + police_exp60 + 
#>     police_exp59 + labour_participation + m_per1000f + state_pop + 
#>     nonwhites_per1000 + unemploy_m24 + unemploy_m39 + gdp + inequality + 
#>     prob_prison + time_prison
#> 
#>                        Df Sum of Sq     RSS    AIC
#> - is_south              1        29 1354974 512.65
#> - labour_participation  1      8917 1363862 512.96
#> - time_prison           1     10304 1365250 513.00
#> - state_pop             1     14122 1369068 513.14
#> - nonwhites_per1000     1     18395 1373341 513.28
#> - m_per1000f            1     31967 1386913 513.74
#> - gdp                   1     37613 1392558 513.94
#> - police_exp59          1     37919 1392865 513.95
#> <none>                              1354946 514.65
#> - unemploy_m24          1     83722 1438668 515.47
#> - police_exp60          1    144306 1499252 517.41
#> - unemploy_m39          1    181536 1536482 518.56
#> - percent_m             1    193770 1548716 518.93
#> - prob_prison           1    199538 1554484 519.11
#> - mean_education        1    402117 1757063 524.86
#> - inequality            1    423031 1777977 525.42
#> 
#> Step:  AIC=512.65
#> crime_rate ~ percent_m + mean_education + police_exp60 + police_exp59 + 
#>     labour_participation + m_per1000f + state_pop + nonwhites_per1000 + 
#>     unemploy_m24 + unemploy_m39 + gdp + inequality + prob_prison + 
#>     time_prison
#> 
#>                        Df Sum of Sq     RSS    AIC
#> - time_prison           1     10341 1365315 511.01
#> - labour_participation  1     10878 1365852 511.03
#> - state_pop             1     14127 1369101 511.14
#> - nonwhites_per1000     1     21626 1376600 511.39
#> - m_per1000f            1     32449 1387423 511.76
#> - police_exp59          1     37954 1392929 511.95
#> - gdp                   1     39223 1394197 511.99
#> <none>                              1354974 512.65
#> - unemploy_m24          1     96420 1451395 513.88
#> - police_exp60          1    144302 1499277 515.41
#> - unemploy_m39          1    189859 1544834 516.81
#> - percent_m             1    195084 1550059 516.97
#> - prob_prison           1    204463 1559437 517.26
#> - mean_education        1    403140 1758114 522.89
#> - inequality            1    488834 1843808 525.13
#> 
#> Step:  AIC=511.01
#> crime_rate ~ percent_m + mean_education + police_exp60 + police_exp59 + 
#>     labour_participation + m_per1000f + state_pop + nonwhites_per1000 + 
#>     unemploy_m24 + unemploy_m39 + gdp + inequality + prob_prison
#> 
#>                        Df Sum of Sq     RSS    AIC
#> - labour_participation  1     10533 1375848 509.37
#> - nonwhites_per1000     1     15482 1380797 509.54
#> - state_pop             1     21846 1387161 509.75
#> - police_exp59          1     28932 1394247 509.99
#> - gdp                   1     36070 1401385 510.23
#> - m_per1000f            1     41784 1407099 510.42
#> <none>                              1365315 511.01
#> - unemploy_m24          1     91420 1456735 512.05
#> - police_exp60          1    134137 1499452 513.41
#> - unemploy_m39          1    184143 1549458 514.95
#> - percent_m             1    186110 1551425 515.01
#> - prob_prison           1    237493 1602808 516.54
#> - mean_education        1    409448 1774763 521.33
#> - inequality            1    502909 1868224 523.75
#> 
#> Step:  AIC=509.37
#> crime_rate ~ percent_m + mean_education + police_exp60 + police_exp59 + 
#>     m_per1000f + state_pop + nonwhites_per1000 + unemploy_m24 + 
#>     unemploy_m39 + gdp + inequality + prob_prison
#> 
#>                     Df Sum of Sq     RSS    AIC
#> - nonwhites_per1000  1     11675 1387523 507.77
#> - police_exp59       1     21418 1397266 508.09
#> - state_pop          1     27803 1403651 508.31
#> - m_per1000f         1     31252 1407100 508.42
#> - gdp                1     35035 1410883 508.55
#> <none>                           1375848 509.37
#> - unemploy_m24       1     80954 1456802 510.06
#> - police_exp60       1    123896 1499744 511.42
#> - unemploy_m39       1    190746 1566594 513.47
#> - percent_m          1    217716 1593564 514.27
#> - prob_prison        1    226971 1602819 514.54
#> - mean_education     1    413254 1789103 519.71
#> - inequality         1    500944 1876792 521.96
#> 
#> Step:  AIC=507.77
#> crime_rate ~ percent_m + mean_education + police_exp60 + police_exp59 + 
#>     m_per1000f + state_pop + unemploy_m24 + unemploy_m39 + gdp + 
#>     inequality + prob_prison
#> 
#>                  Df Sum of Sq     RSS    AIC
#> - police_exp59    1     16706 1404229 506.33
#> - state_pop       1     25793 1413315 506.63
#> - m_per1000f      1     26785 1414308 506.66
#> - gdp             1     31551 1419073 506.82
#> <none>                        1387523 507.77
#> - unemploy_m24    1     83881 1471404 508.52
#> - police_exp60    1    118348 1505871 509.61
#> - unemploy_m39    1    201453 1588976 512.14
#> - prob_prison     1    216760 1604282 512.59
#> - percent_m       1    309214 1696737 515.22
#> - mean_education  1    402754 1790276 517.74
#> - inequality      1    589736 1977259 522.41
#> 
#> Step:  AIC=506.33
#> crime_rate ~ percent_m + mean_education + police_exp60 + m_per1000f + 
#>     state_pop + unemploy_m24 + unemploy_m39 + gdp + inequality + 
#>     prob_prison
#> 
#>                  Df Sum of Sq     RSS    AIC
#> - state_pop       1     22345 1426575 505.07
#> - gdp             1     32142 1436371 505.39
#> - m_per1000f      1     36808 1441037 505.54
#> <none>                        1404229 506.33
#> - unemploy_m24    1     86373 1490602 507.13
#> - unemploy_m39    1    205814 1610043 510.76
#> - prob_prison     1    218607 1622836 511.13
#> - percent_m       1    307001 1711230 513.62
#> - mean_education  1    389502 1793731 515.83
#> - inequality      1    608627 2012856 521.25
#> - police_exp60    1   1050202 2454432 530.57
#> 
#> Step:  AIC=505.07
#> crime_rate ~ percent_m + mean_education + police_exp60 + m_per1000f + 
#>     unemploy_m24 + unemploy_m39 + gdp + inequality + prob_prison
#> 
#>                  Df Sum of Sq     RSS    AIC
#> - gdp             1     26493 1453068 503.93
#> <none>                        1426575 505.07
#> - m_per1000f      1     84491 1511065 505.77
#> - unemploy_m24    1     99463 1526037 506.24
#> - prob_prison     1    198571 1625145 509.20
#> - unemploy_m39    1    208880 1635455 509.49
#> - percent_m       1    320926 1747501 512.61
#> - mean_education  1    386773 1813348 514.35
#> - inequality      1    594779 2021354 519.45
#> - police_exp60    1   1127277 2553852 530.44
#> 
#> Step:  AIC=503.93
#> crime_rate ~ percent_m + mean_education + police_exp60 + m_per1000f + 
#>     unemploy_m24 + unemploy_m39 + inequality + prob_prison
#> 
#>                  Df Sum of Sq     RSS    AIC
#> <none>                        1453068 503.93
#> - m_per1000f      1    103159 1556227 505.16
#> - unemploy_m24    1    127044 1580112 505.87
#> - prob_prison     1    247978 1701046 509.34
#> - unemploy_m39    1    255443 1708511 509.55
#> - percent_m       1    296790 1749858 510.67
#> - mean_education  1    445788 1898855 514.51
#> - inequality      1    738244 2191312 521.24
#> - police_exp60    1   1672038 3125105 537.93

Fungsi step() akan langsung memberikan kita sebuah model baru yang berisikan preiktor yang cukup berpengaruh terhadap target karena jika variable tersebut akan menghasilkan information loss. Kita dapat melihat model baru yang sudah dibuat dengan menggunakan fungsi summary().

# Please type your code
summary(model_backward)

#> 
#> Call:
#> lm(formula = crime_rate ~ percent_m + mean_education + police_exp60 + 
#>     m_per1000f + unemploy_m24 + unemploy_m39 + inequality + prob_prison, 
#>     data = crime_clean)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -444.70 -111.07    3.03  122.15  483.30 
#> 
#> Coefficients:
#>                 Estimate Std. Error t value     Pr(>|t|)    
#> (Intercept)    -6426.101   1194.611  -5.379 0.0000040395 ***
#> percent_m          9.332      3.350   2.786      0.00828 ** 
#> mean_education    18.012      5.275   3.414      0.00153 ** 
#> police_exp60      10.265      1.552   6.613 0.0000000826 ***
#> m_per1000f         2.234      1.360   1.642      0.10874    
#> unemploy_m24      -6.087      3.339  -1.823      0.07622 .  
#> unemploy_m39      18.735      7.248   2.585      0.01371 *  
#> inequality         6.133      1.396   4.394 0.0000863344 ***
#> prob_prison    -3796.032   1490.646  -2.547      0.01505 *  
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 195.5 on 38 degrees of freedom
#> Multiple R-squared:  0.7888, Adjusted R-squared:  0.7444 
#> F-statistic: 17.74 on 8 and 38 DF,  p-value: 0.0000000001159

Summary Proses Backward Elimination

Buat model dengan menggunakan seluruh prediktor
Masing-masing variable dicoba untuk dihilangkan, lalu dihitung AIC-nya
Hilangkan 1 variable yang akan menghasilkan AIC terkecil
Ulangi langkah 2 dan 3
Proses berhenti apabila saat pengurangan variable malah menghasilkan AIC yang lebih tinggi

Stepwise Regression - Foward Elimination

Proses Foward Elimination:

Step 1: Buat model dengan seluruh prediktor

# Please type your code

model_wo_predictor <- lm(formula = crime_rate ~ 1,
                          data = crime)

Step 2: Mengimplementasikan foward elimintaion

Dalam melakukan foward elimination, parameter direction = akan di-isi dengan “forward” dan akan terdapat satu parameter tambahan yaitu scope =.

Parameter scope = diperuntuhkan untuk untuk menandakan batas atas maksimal kombinasi prediktor.

Berikut contoh penulisan syntax untuk parameter scope =:: scope = list(upper = model_dengan_semua_prediktor)

# Please type your code
model_forward <- step(object = model_wo_predictor, 
                    direction = "forward",
                    scope = list(upper = model_all_predictor))

#> Start:  AIC=561.02
#> crime_rate ~ 1
#> 
#>                        Df Sum of Sq     RSS    AIC
#> + police_exp60          1   3253302 3627626 532.94
#> + police_exp59          1   3058626 3822302 535.39
#> + gdp                   1   1340152 5540775 552.84
#> + prob_prison           1   1257075 5623853 553.54
#> + state_pop             1    783660 6097267 557.34
#> + mean_education        1    717146 6163781 557.85
#> + m_per1000f            1    314867 6566061 560.82
#> <none>                              6880928 561.02
#> + labour_participation  1    245446 6635482 561.32
#> + inequality            1    220530 6660397 561.49
#> + unemploy_m39          1    216354 6664573 561.52
#> + time_prison           1    154545 6726383 561.96
#> + is_south              1     56527 6824400 562.64
#> + percent_m             1     55084 6825844 562.65
#> + unemploy_m24          1     17533 6863395 562.90
#> + nonwhites_per1000     1      7312 6873615 562.97
#> 
#> Step:  AIC=532.94
#> crime_rate ~ police_exp60
#> 
#>                        Df Sum of Sq     RSS    AIC
#> + inequality            1    739819 2887807 524.22
#> + percent_m             1    616741 3010885 526.18
#> + m_per1000f            1    250522 3377104 531.57
#> + nonwhites_per1000     1    232434 3395192 531.82
#> + is_south              1    219098 3408528 532.01
#> + gdp                   1    180872 3446754 532.53
#> <none>                              3627626 532.94
#> + police_exp59          1    146167 3481459 533.00
#> + prob_prison           1     92278 3535348 533.72
#> + labour_participation  1     77479 3550147 533.92
#> + time_prison           1     43185 3584441 534.37
#> + unemploy_m39          1     17848 3609778 534.70
#> + state_pop             1      5666 3621959 534.86
#> + unemploy_m24          1      2878 3624748 534.90
#> + mean_education        1       767 3626859 534.93
#> 
#> Step:  AIC=524.22
#> crime_rate ~ police_exp60 + inequality
#> 
#>                        Df Sum of Sq     RSS    AIC
#> + mean_education        1    587050 2300757 515.53
#> + m_per1000f            1    454545 2433262 518.17
#> + prob_prison           1    280690 2607117 521.41
#> + labour_participation  1    260571 2627236 521.77
#> + gdp                   1    213937 2673871 522.60
#> + percent_m             1    181236 2706571 523.17
#> + state_pop             1    130377 2757430 524.04
#> <none>                              2887807 524.22
#> + nonwhites_per1000     1     36439 2851369 525.62
#> + is_south              1     33738 2854069 525.66
#> + police_exp59          1     30673 2857134 525.71
#> + unemploy_m24          1      2309 2885498 526.18
#> + time_prison           1       497 2887310 526.21
#> + unemploy_m39          1       253 2887554 526.21
#> 
#> Step:  AIC=515.53
#> crime_rate ~ police_exp60 + inequality + mean_education
#> 
#>                        Df Sum of Sq     RSS    AIC
#> + percent_m             1    239405 2061353 512.37
#> + prob_prison           1    234981 2065776 512.47
#> + m_per1000f            1    117026 2183731 515.08
#> <none>                              2300757 515.53
#> + gdp                   1     79540 2221218 515.88
#> + unemploy_m39          1     62112 2238646 516.25
#> + time_prison           1     61770 2238987 516.26
#> + police_exp59          1     42584 2258174 516.66
#> + state_pop             1     39319 2261438 516.72
#> + unemploy_m24          1      7365 2293392 517.38
#> + labour_participation  1      7254 2293503 517.39
#> + nonwhites_per1000     1      4210 2296547 517.45
#> + is_south              1      4135 2296622 517.45
#> 
#> Step:  AIC=512.37
#> crime_rate ~ police_exp60 + inequality + mean_education + percent_m
#> 
#>                        Df Sum of Sq     RSS    AIC
#> + prob_prison           1    258063 1803290 508.08
#> + unemploy_m39          1    200988 1860365 509.55
#> + gdp                   1    163378 1897975 510.49
#> <none>                              2061353 512.37
#> + m_per1000f            1     74398 1986955 512.64
#> + unemploy_m24          1     50835 2010518 513.20
#> + police_exp59          1     45392 2015961 513.32
#> + time_prison           1     42746 2018607 513.39
#> + nonwhites_per1000     1     16488 2044865 513.99
#> + state_pop             1      8101 2053251 514.19
#> + is_south              1      3189 2058164 514.30
#> + labour_participation  1      2988 2058365 514.30
#> 
#> Step:  AIC=508.08
#> crime_rate ~ police_exp60 + inequality + mean_education + percent_m + 
#>     prob_prison
#> 
#>                        Df Sum of Sq     RSS    AIC
#> + unemploy_m39          1    192233 1611057 504.79
#> + gdp                   1     86490 1716801 507.77
#> + m_per1000f            1     84509 1718781 507.83
#> <none>                              1803290 508.08
#> + unemploy_m24          1     52313 1750977 508.70
#> + state_pop             1     47719 1755571 508.82
#> + police_exp59          1     37967 1765323 509.08
#> + is_south              1     21971 1781320 509.51
#> + time_prison           1     10194 1793096 509.82
#> + labour_participation  1       990 1802301 510.06
#> + nonwhites_per1000     1       797 1802493 510.06
#> 
#> Step:  AIC=504.79
#> crime_rate ~ police_exp60 + inequality + mean_education + percent_m + 
#>     prob_prison + unemploy_m39
#> 
#>                        Df Sum of Sq     RSS    AIC
#> <none>                              1611057 504.79
#> + gdp                   1     59910 1551147 505.00
#> + unemploy_m24          1     54830 1556227 505.16
#> + state_pop             1     51320 1559737 505.26
#> + m_per1000f            1     30945 1580112 505.87
#> + police_exp59          1     25017 1586040 506.05
#> + is_south              1     17958 1593098 506.26
#> + labour_participation  1     13179 1597878 506.40
#> + time_prison           1      7159 1603898 506.58
#> + nonwhites_per1000     1       359 1610698 506.78

Summary Proses Foward Elimination

Buat model tanpa prediktor
Masing-masing variable dicoba untuk ditambahkan, lalu dihitung AIC-nya
Tambahkan 1 variable yang akan menghasilnya AIC terkecil
Ulangi langkah 2 dan 3
Proses berhenti apabila saat penambahan variable malah menghasilkan AIC yang lebih tinggi

Stepwise Regression - Both (Foward & Backward Elimination)

Ketika menggunakan tahapan Stepwise Regression - Both, fungsi step() akan memerlukan model tanpa prediktor & model dengan seluruh prediktor. Kedua model tersebut diperlukan karena setiap variabel akan dicoba untuk ditambahkan ataupun dihilangkan secara acak.

Dikarenakan terdapat 2 model yang akan kita gunakan, kita perlu memberitahu terhadap fungsi step() bahwa ada batas atas maupun batas bawah.

Berikut contoh penulisan syntax untuk parameter scope =:: scope = list(upper = model_dengan_semua_prediktor, lower = model_tanpa_prediktor)

# Please type your code
model_both_all <- step(object = model_all_predictor, 
                       direction = "both", 
                       scope = list(lower = model_wo_predictor, upper = model_all_predictor))

#> Start:  AIC=514.65
#> crime_rate ~ percent_m + is_south + mean_education + police_exp60 + 
#>     police_exp59 + labour_participation + m_per1000f + state_pop + 
#>     nonwhites_per1000 + unemploy_m24 + unemploy_m39 + gdp + inequality + 
#>     prob_prison + time_prison
#> 
#>                        Df Sum of Sq     RSS    AIC
#> - is_south              1        29 1354974 512.65
#> - labour_participation  1      8917 1363862 512.96
#> - time_prison           1     10304 1365250 513.00
#> - state_pop             1     14122 1369068 513.14
#> - nonwhites_per1000     1     18395 1373341 513.28
#> - m_per1000f            1     31967 1386913 513.74
#> - gdp                   1     37613 1392558 513.94
#> - police_exp59          1     37919 1392865 513.95
#> <none>                              1354946 514.65
#> - unemploy_m24          1     83722 1438668 515.47
#> - police_exp60          1    144306 1499252 517.41
#> - unemploy_m39          1    181536 1536482 518.56
#> - percent_m             1    193770 1548716 518.93
#> - prob_prison           1    199538 1554484 519.11
#> - mean_education        1    402117 1757063 524.86
#> - inequality            1    423031 1777977 525.42
#> 
#> Step:  AIC=512.65
#> crime_rate ~ percent_m + mean_education + police_exp60 + police_exp59 + 
#>     labour_participation + m_per1000f + state_pop + nonwhites_per1000 + 
#>     unemploy_m24 + unemploy_m39 + gdp + inequality + prob_prison + 
#>     time_prison
#> 
#>                        Df Sum of Sq     RSS    AIC
#> - time_prison           1     10341 1365315 511.01
#> - labour_participation  1     10878 1365852 511.03
#> - state_pop             1     14127 1369101 511.14
#> - nonwhites_per1000     1     21626 1376600 511.39
#> - m_per1000f            1     32449 1387423 511.76
#> - police_exp59          1     37954 1392929 511.95
#> - gdp                   1     39223 1394197 511.99
#> <none>                              1354974 512.65
#> - unemploy_m24          1     96420 1451395 513.88
#> + is_south              1        29 1354946 514.65
#> - police_exp60          1    144302 1499277 515.41
#> - unemploy_m39          1    189859 1544834 516.81
#> - percent_m             1    195084 1550059 516.97
#> - prob_prison           1    204463 1559437 517.26
#> - mean_education        1    403140 1758114 522.89
#> - inequality            1    488834 1843808 525.13
#> 
#> Step:  AIC=511.01
#> crime_rate ~ percent_m + mean_education + police_exp60 + police_exp59 + 
#>     labour_participation + m_per1000f + state_pop + nonwhites_per1000 + 
#>     unemploy_m24 + unemploy_m39 + gdp + inequality + prob_prison
#> 
#>                        Df Sum of Sq     RSS    AIC
#> - labour_participation  1     10533 1375848 509.37
#> - nonwhites_per1000     1     15482 1380797 509.54
#> - state_pop             1     21846 1387161 509.75
#> - police_exp59          1     28932 1394247 509.99
#> - gdp                   1     36070 1401385 510.23
#> - m_per1000f            1     41784 1407099 510.42
#> <none>                              1365315 511.01
#> - unemploy_m24          1     91420 1456735 512.05
#> + time_prison           1     10341 1354974 512.65
#> + is_south              1        65 1365250 513.00
#> - police_exp60          1    134137 1499452 513.41
#> - unemploy_m39          1    184143 1549458 514.95
#> - percent_m             1    186110 1551425 515.01
#> - prob_prison           1    237493 1602808 516.54
#> - mean_education        1    409448 1774763 521.33
#> - inequality            1    502909 1868224 523.75
#> 
#> Step:  AIC=509.37
#> crime_rate ~ percent_m + mean_education + police_exp60 + police_exp59 + 
#>     m_per1000f + state_pop + nonwhites_per1000 + unemploy_m24 + 
#>     unemploy_m39 + gdp + inequality + prob_prison
#> 
#>                        Df Sum of Sq     RSS    AIC
#> - nonwhites_per1000     1     11675 1387523 507.77
#> - police_exp59          1     21418 1397266 508.09
#> - state_pop             1     27803 1403651 508.31
#> - m_per1000f            1     31252 1407100 508.42
#> - gdp                   1     35035 1410883 508.55
#> <none>                              1375848 509.37
#> - unemploy_m24          1     80954 1456802 510.06
#> + labour_participation  1     10533 1365315 511.01
#> + time_prison           1      9996 1365852 511.03
#> + is_south              1      3046 1372802 511.26
#> - police_exp60          1    123896 1499744 511.42
#> - unemploy_m39          1    190746 1566594 513.47
#> - percent_m             1    217716 1593564 514.27
#> - prob_prison           1    226971 1602819 514.54
#> - mean_education        1    413254 1789103 519.71
#> - inequality            1    500944 1876792 521.96
#> 
#> Step:  AIC=507.77
#> crime_rate ~ percent_m + mean_education + police_exp60 + police_exp59 + 
#>     m_per1000f + state_pop + unemploy_m24 + unemploy_m39 + gdp + 
#>     inequality + prob_prison
#> 
#>                        Df Sum of Sq     RSS    AIC
#> - police_exp59          1     16706 1404229 506.33
#> - state_pop             1     25793 1413315 506.63
#> - m_per1000f            1     26785 1414308 506.66
#> - gdp                   1     31551 1419073 506.82
#> <none>                              1387523 507.77
#> - unemploy_m24          1     83881 1471404 508.52
#> + nonwhites_per1000     1     11675 1375848 509.37
#> + is_south              1      7207 1380316 509.52
#> + labour_participation  1      6726 1380797 509.54
#> + time_prison           1      4534 1382989 509.61
#> - police_exp60          1    118348 1505871 509.61
#> - unemploy_m39          1    201453 1588976 512.14
#> - prob_prison           1    216760 1604282 512.59
#> - percent_m             1    309214 1696737 515.22
#> - mean_education        1    402754 1790276 517.74
#> - inequality            1    589736 1977259 522.41
#> 
#> Step:  AIC=506.33
#> crime_rate ~ percent_m + mean_education + police_exp60 + m_per1000f + 
#>     state_pop + unemploy_m24 + unemploy_m39 + gdp + inequality + 
#>     prob_prison
#> 
#>                        Df Sum of Sq     RSS    AIC
#> - state_pop             1     22345 1426575 505.07
#> - gdp                   1     32142 1436371 505.39
#> - m_per1000f            1     36808 1441037 505.54
#> <none>                              1404229 506.33
#> - unemploy_m24          1     86373 1490602 507.13
#> + police_exp59          1     16706 1387523 507.77
#> + nonwhites_per1000     1      6963 1397266 508.09
#> + is_south              1      3807 1400422 508.20
#> + labour_participation  1      1986 1402243 508.26
#> + time_prison           1       575 1403654 508.31
#> - unemploy_m39          1    205814 1610043 510.76
#> - prob_prison           1    218607 1622836 511.13
#> - percent_m             1    307001 1711230 513.62
#> - mean_education        1    389502 1793731 515.83
#> - inequality            1    608627 2012856 521.25
#> - police_exp60          1   1050202 2454432 530.57
#> 
#> Step:  AIC=505.07
#> crime_rate ~ percent_m + mean_education + police_exp60 + m_per1000f + 
#>     unemploy_m24 + unemploy_m39 + gdp + inequality + prob_prison
#> 
#>                        Df Sum of Sq     RSS    AIC
#> - gdp                   1     26493 1453068 503.93
#> <none>                              1426575 505.07
#> - m_per1000f            1     84491 1511065 505.77
#> - unemploy_m24          1     99463 1526037 506.24
#> + state_pop             1     22345 1404229 506.33
#> + police_exp59          1     13259 1413315 506.63
#> + nonwhites_per1000     1      5927 1420648 506.87
#> + is_south              1      5724 1420851 506.88
#> + labour_participation  1      5176 1421398 506.90
#> + time_prison           1      3913 1422661 506.94
#> - prob_prison           1    198571 1625145 509.20
#> - unemploy_m39          1    208880 1635455 509.49
#> - percent_m             1    320926 1747501 512.61
#> - mean_education        1    386773 1813348 514.35
#> - inequality            1    594779 2021354 519.45
#> - police_exp60          1   1127277 2553852 530.44
#> 
#> Step:  AIC=503.93
#> crime_rate ~ percent_m + mean_education + police_exp60 + m_per1000f + 
#>     unemploy_m24 + unemploy_m39 + inequality + prob_prison
#> 
#>                        Df Sum of Sq     RSS    AIC
#> <none>                              1453068 503.93
#> + gdp                   1     26493 1426575 505.07
#> - m_per1000f            1    103159 1556227 505.16
#> + state_pop             1     16697 1436371 505.39
#> + police_exp59          1     14148 1438919 505.47
#> + is_south              1      9329 1443739 505.63
#> + labour_participation  1      4374 1448694 505.79
#> + nonwhites_per1000     1      3799 1449269 505.81
#> + time_prison           1      2293 1450775 505.86
#> - unemploy_m24          1    127044 1580112 505.87
#> - prob_prison           1    247978 1701046 509.34
#> - unemploy_m39          1    255443 1708511 509.55
#> - percent_m             1    296790 1749858 510.67
#> - mean_education        1    445788 1898855 514.51
#> - inequality            1    738244 2191312 521.24
#> - police_exp60          1   1672038 3125105 537.93

# Please type your code
model_both_none <- step(object = model_wo_predictor, 
                        direction = "both", 
                        scope = list(lower = model_wo_predictor, upper = model_all_predictor))

#> Start:  AIC=561.02
#> crime_rate ~ 1
#> 
#>                        Df Sum of Sq     RSS    AIC
#> + police_exp60          1   3253302 3627626 532.94
#> + police_exp59          1   3058626 3822302 535.39
#> + gdp                   1   1340152 5540775 552.84
#> + prob_prison           1   1257075 5623853 553.54
#> + state_pop             1    783660 6097267 557.34
#> + mean_education        1    717146 6163781 557.85
#> + m_per1000f            1    314867 6566061 560.82
#> <none>                              6880928 561.02
#> + labour_participation  1    245446 6635482 561.32
#> + inequality            1    220530 6660397 561.49
#> + unemploy_m39          1    216354 6664573 561.52
#> + time_prison           1    154545 6726383 561.96
#> + is_south              1     56527 6824400 562.64
#> + percent_m             1     55084 6825844 562.65
#> + unemploy_m24          1     17533 6863395 562.90
#> + nonwhites_per1000     1      7312 6873615 562.97
#> 
#> Step:  AIC=532.94
#> crime_rate ~ police_exp60
#> 
#>                        Df Sum of Sq     RSS    AIC
#> + inequality            1    739819 2887807 524.22
#> + percent_m             1    616741 3010885 526.18
#> + m_per1000f            1    250522 3377104 531.57
#> + nonwhites_per1000     1    232434 3395192 531.82
#> + is_south              1    219098 3408528 532.01
#> + gdp                   1    180872 3446754 532.53
#> <none>                              3627626 532.94
#> + police_exp59          1    146167 3481459 533.00
#> + prob_prison           1     92278 3535348 533.72
#> + labour_participation  1     77479 3550147 533.92
#> + time_prison           1     43185 3584441 534.37
#> + unemploy_m39          1     17848 3609778 534.70
#> + state_pop             1      5666 3621959 534.86
#> + unemploy_m24          1      2878 3624748 534.90
#> + mean_education        1       767 3626859 534.93
#> - police_exp60          1   3253302 6880928 561.02
#> 
#> Step:  AIC=524.22
#> crime_rate ~ police_exp60 + inequality
#> 
#>                        Df Sum of Sq     RSS    AIC
#> + mean_education        1    587050 2300757 515.53
#> + m_per1000f            1    454545 2433262 518.17
#> + prob_prison           1    280690 2607117 521.41
#> + labour_participation  1    260571 2627236 521.77
#> + gdp                   1    213937 2673871 522.60
#> + percent_m             1    181236 2706571 523.17
#> + state_pop             1    130377 2757430 524.04
#> <none>                              2887807 524.22
#> + nonwhites_per1000     1     36439 2851369 525.62
#> + is_south              1     33738 2854069 525.66
#> + police_exp59          1     30673 2857134 525.71
#> + unemploy_m24          1      2309 2885498 526.18
#> + time_prison           1       497 2887310 526.21
#> + unemploy_m39          1       253 2887554 526.21
#> - inequality            1    739819 3627626 532.94
#> - police_exp60          1   3772590 6660397 561.49
#> 
#> Step:  AIC=515.53
#> crime_rate ~ police_exp60 + inequality + mean_education
#> 
#>                        Df Sum of Sq     RSS    AIC
#> + percent_m             1    239405 2061353 512.37
#> + prob_prison           1    234981 2065776 512.47
#> + m_per1000f            1    117026 2183731 515.08
#> <none>                              2300757 515.53
#> + gdp                   1     79540 2221218 515.88
#> + unemploy_m39          1     62112 2238646 516.25
#> + time_prison           1     61770 2238987 516.26
#> + police_exp59          1     42584 2258174 516.66
#> + state_pop             1     39319 2261438 516.72
#> + unemploy_m24          1      7365 2293392 517.38
#> + labour_participation  1      7254 2293503 517.39
#> + nonwhites_per1000     1      4210 2296547 517.45
#> + is_south              1      4135 2296622 517.45
#> - mean_education        1    587050 2887807 524.22
#> - inequality            1   1326101 3626859 534.93
#> - police_exp60          1   3782666 6083423 559.23
#> 
#> Step:  AIC=512.37
#> crime_rate ~ police_exp60 + inequality + mean_education + percent_m
#> 
#>                        Df Sum of Sq     RSS    AIC
#> + prob_prison           1    258063 1803290 508.08
#> + unemploy_m39          1    200988 1860365 509.55
#> + gdp                   1    163378 1897975 510.49
#> <none>                              2061353 512.37
#> + m_per1000f            1     74398 1986955 512.64
#> + unemploy_m24          1     50835 2010518 513.20
#> + police_exp59          1     45392 2015961 513.32
#> + time_prison           1     42746 2018607 513.39
#> + nonwhites_per1000     1     16488 2044865 513.99
#> + state_pop             1      8101 2053251 514.19
#> + is_south              1      3189 2058164 514.30
#> + labour_participation  1      2988 2058365 514.30
#> - percent_m             1    239405 2300757 515.53
#> - mean_education        1    645219 2706571 523.17
#> - inequality            1    864671 2926024 526.83
#> - police_exp60          1   4000849 6062202 561.07
#> 
#> Step:  AIC=508.08
#> crime_rate ~ police_exp60 + inequality + mean_education + percent_m + 
#>     prob_prison
#> 
#>                        Df Sum of Sq     RSS    AIC
#> + unemploy_m39          1    192233 1611057 504.79
#> + gdp                   1     86490 1716801 507.77
#> + m_per1000f            1     84509 1718781 507.83
#> <none>                              1803290 508.08
#> + unemploy_m24          1     52313 1750977 508.70
#> + state_pop             1     47719 1755571 508.82
#> + police_exp59          1     37967 1765323 509.08
#> + is_south              1     21971 1781320 509.51
#> + time_prison           1     10194 1793096 509.82
#> + labour_participation  1       990 1802301 510.06
#> + nonwhites_per1000     1       797 1802493 510.06
#> - prob_prison           1    258063 2061353 512.37
#> - percent_m             1    262486 2065776 512.47
#> - mean_education        1    598315 2401605 519.55
#> - inequality            1    968199 2771489 526.28
#> - police_exp60          1   3268577 5071868 554.69
#> 
#> Step:  AIC=504.79
#> crime_rate ~ police_exp60 + inequality + mean_education + percent_m + 
#>     prob_prison + unemploy_m39
#> 
#>                        Df Sum of Sq     RSS    AIC
#> <none>                              1611057 504.79
#> + gdp                   1     59910 1551147 505.00
#> + unemploy_m24          1     54830 1556227 505.16
#> + state_pop             1     51320 1559737 505.26
#> + m_per1000f            1     30945 1580112 505.87
#> + police_exp59          1     25017 1586040 506.05
#> + is_south              1     17958 1593098 506.26
#> + labour_participation  1     13179 1597878 506.40
#> + time_prison           1      7159 1603898 506.58
#> + nonwhites_per1000     1       359 1610698 506.78
#> - unemploy_m39          1    192233 1803290 508.08
#> - prob_prison           1    249308 1860365 509.55
#> - percent_m             1    400611 2011667 513.22
#> - mean_education        1    776207 2387264 521.27
#> - inequality            1    949221 2560278 524.56
#> - police_exp60          1   2817067 4428124 550.31

Setelah berhasil membuat beberapa model dengan bantuan stepwise, mari kita coba evaluasi terlebih dahulu dengan melihat nilai Adjusted R-Squared.

2.3.3 Step 3.3: Evaluai Model Dengan Adjusted R-Squared

Dalam melakukan evaluasi model, kita akan mencoba untuk membandingkan model yang sudah kita buat dengan melihat hasil perhitungan Adjusted R-Squared.

# Please type your code
summary(model_potensial)$adj.r.squared

#> [1] 0.5013695

summary(model_forward)$adj.r.squared

#> [1] 0.7307463

summary(model_backward)$adj.r.squared

#> [1] 0.7443692

summary(model_both_none)$adj.r.squared

#> [1] 0.7307463

summary(model_both_all)$adj.r.squared

#> [1] 0.7443692

# Opsi 2
library(performance)

compare_performance(model_potensial, model_forward, model_backward, model_both_all, model_both_none)

summary(model_backward)$call

#> lm(formula = crime_rate ~ percent_m + mean_education + police_exp60 + 
#>     m_per1000f + unemploy_m24 + unemploy_m39 + inequality + prob_prison, 
#>     data = crime_clean)

summary(model_both_all)$call

#> lm(formula = crime_rate ~ percent_m + mean_education + police_exp60 + 
#>     m_per1000f + unemploy_m24 + unemploy_m39 + inequality + prob_prison, 
#>     data = crime_clean)

Insight: Dari sini kita mengetahui bahwa model_backward & model_both_all yang menghasilkan nilai Adjusted R-Squared paling tinggi, dengan menggunakan beberapa prediktor berikut ini:

percent_m
mean_education
police_exp60
m_per1000f
unemploy_m24
unemploy_m39
inequality
prob_prison

2.4 Step 4: Prediksi

Normal Prediction

Untuk memprediksi nilai crime_rate berdasarkan prediktor dari model yang paling baik, kita dapat menggunakan fungsi predict() dengan parameter:

object: model regresi yang digunakan
newdata: data yang akan diprediksi (harus berupa data frame dan tersedia nama kolom prediktor yang sama persis dengan prediktor pada model)

# Please type your code

pred <- predict(object = model_backward,
                newdata = crime_clean)

head(pred)

#>         1         2         3         4         5         6 
#>  730.2603 1429.5290  391.6707 1846.7501 1119.4533  724.2856

Note: Tahapan prediksi yang kita lakukan disini masih kita implementasikan berdasarkan data yang digunakan untuk melatih model. Tahapan tersebut kita lakukan agar kita mengetahui seberapa baik performa model sebelum diaplikasikan terhadap data baru.

Prediction Interval [Optional]

Selain memprediks pada suati titik saja, kita bisa melakukan prediksi terhadap sebuah rentang tertentu. Rentang tersebut akan kita manfaatkan sebagai confidence interval batas atas & batas awal dari hasil prediksi kita.

Untuk melakukan hal tersebut kita bisa menambahkan 2 parameter lagi pada fungsi predict().

interval: parameter ini akan kita isi dengan prediction, untuk memberi tahu fungsi bahwa hasil prediksinya akan berbentuk interval.
level: berapa nilai confidence interval yang akan kita implementasikan

# untuk menambahkan batas atas-bawah
pred_interval <- predict(
  object = model_backward,
  newdata = crime,
  interval = "prediction",
  level = 0.95)

head(pred_interval)

#>         fit        lwr       upr
#> 1  730.2603  300.62618 1159.8945
#> 2 1429.5290 1014.61168 1844.4464
#> 3  391.6707  -29.24626  812.5876
#> 4 1846.7501 1407.37353 2286.1266
#> 5 1119.4533  670.47573 1568.4309
#> 6  724.2856  294.23863 1154.3326

# Untuk mengambil hasil prediksi batas atas/bawah

pred_lower <- as.data.frame(pred_interval)$lwr
pred_upper <- as.data.frame(pred_interval)$upr

head(pred_lower)

#> [1]  300.62618 1014.61168  -29.24626 1407.37353  670.47573  294.23863

head(pred_upper)

#> [1] 1159.8945 1844.4464  812.5876 2286.1266 1568.4309 1154.3326

2.5 Step 5: Evaluasi

2.5.1 Step 5.1: Evaluasi Error

Dalam melakukan perhitungan error kita akan menggunakan RMSE, dengan pertimbangan agar analisis error model lebih teliti (kl ada error ekstrim, bisa terdeteksi). Dalam perhitungan error ini, kita juga akan menghitung error dari prediksi dengan interval

RMSE u/ Prediksi Fit

# Please type your code
library(MLmetrics)
RMSE(y_pred = pred, y_true = crime_clean$crime_rate)

#> [1] 175.8304

RMSE u/ Prediksi Lower

# Please type your code
RMSE(y_pred = pred_lower, y_true = crime_clean$crime_rate)

#> [1] 465.8112

RMSE u/ Prediksi Upper

# Please type your code
RMSE(y_pred = pred_upper, y_true = crime_clean$crime_rate)

#> [1] 467.2104

Additional Notes: Jika dari hasil error masih belum bagus, maka kita bisa melakukan beberapa hal berikut ini:

Kita bisa mencoba menambahkan prediktor lainnya ataupun menambahkan jumlah data.
Scalling -> menyamakan rentang data antar prediktor -> membuat pola data lebih terlihat/untuk mengurangi efek outlier

2.5.2 Step 5.2: Pengecekan Asumsi

Sebagai salah satu model statistik, linear regression adalah model yang ketat asumsi. Berikut beberapa asumsi yang harus dicek untuk memastikan apakah model yang kita buat dianggap sebagai model yang dapat memprediksi data baru secara konsisten.

Asumsi model linear regression:

Linearity

Dilakukan sebelum membuat model. Untuk menguji apakah variabel target dan prediktor memiliki hubungan linear. Dapat dilihat dengan nilai korelasi menggunakan function ggcorr()

ggcorr(crime_clean, label = T, hjust = 1, layout.exp = 3)

Normality of Residuals/Error

Diharapkan error yang berdistribusi normal. Artinya error banyak berkumpul disekitar angka 0.

Untuk menguji asumsi ini dapat dilakukan:

1. Visualisasi histogram residual

# Please type your code
hist(model_backward$residuals, breaks = 20)

terdapat sedikit observasi yang memiliki error tinggi (+-400) dilihat dari historgram error.

2. Uji statistik.

Shapiro-Wilk hypothesis test, menggunakan shapiro.test():

H0: error berdistribusi normal
H1: error TIDAK berdistribusi normal

Ingat kembali materi Practical Statistics (PS):

Apabila p-value > alpha, maka gagal tolak H0
Apabila p-value < alpha, maka tolak H0

Kondisi yang diharapkan: p-value > alpha

# Please type your code
shapiro.test(model_backward$residuals)

#> 
#>  Shapiro-Wilk normality test
#> 
#> data:  model_backward$residuals
#> W = 0.98511, p-value = 0.8051

Kesimpulan: error berdistribusi normal (p-value > alpha)

Homoscedasticity of Residuals

Diharapkan error yang tidak membentuk sebuah pola (menyebar random). Hal ini agar dipastikan seluruh pola sudah ditangkap oleh model.

Untuk menguji asumsi ini dapat dilakukan:

1. Visualisasi scatterplot antara nilai prediksi (fitted values) dengan nilai error:

# Please type your code
plot(model_backward$fitted.values, model_backward$residuals)

2. Uji statistik dengan Breusch-Pagan dari package lmtest

Breusch-Pagan hypothesis test bptest():

H0: error menyebar konstan atau homoscedasticity
H1: error menyebar TIDAK konstan atau heteroscedasticity

Kondisi yang diharapkan: p-value > alpha

# Please type your code
library(lmtest)
bptest(model_backward)

#> 
#>  studentized Breusch-Pagan test
#> 
#> data:  model_backward
#> BP = 13.51, df = 8, p-value = 0.09546

Kesimpulan: error menyebar tidak ada pola (random) -> asumsi homoscedasticity terpenuhi

No Multicollinearity

Multicollinearity adalah kondisi adanya korelasi antar prediktor yang kuat. Hal ini tidak diinginkan karena menandakan prediktor redundan pada model, yang seharusnya dapat dipilih salah satu saja dari variable yang hubungannya amat kuat tersebut. Harapannya tidak terjadi multicollinearity

Uji VIF (Variance Inflation Factor) dengan fungsi vif() dari package car: * nilai VIF > 10: terjadi multicollinearity pada model * nilai VIF < 10: tidak terjadi multicollinearity pada model

Kondisi yang diharapkan: VIF < 10

# vif dari model_backward
library(car)
vif(model_backward)

#>      percent_m mean_education   police_exp60     m_per1000f   unemploy_m24 
#>       2.131963       4.189684       2.560496       1.932367       4.360038 
#>   unemploy_m39     inequality    prob_prison 
#>       4.508106       3.731074       1.381879

vif(model_all_predictor)

#>            percent_m             is_south       mean_education 
#>             2.892448             5.342783             5.077447 
#>         police_exp60         police_exp59 labour_participation 
#>           104.658667           113.559262             3.712690 
#>           m_per1000f            state_pop    nonwhites_per1000 
#>             3.785934             2.536708             4.674088 
#>         unemploy_m24         unemploy_m39                  gdp 
#>             6.063931             5.088880            10.530375 
#>           inequality          prob_prison          time_prison 
#>             8.644528             2.809459             2.713785

Action Plan ketika ada yang variabel multicolinearity:

Memilih salah satu variabel yang lebih penting secara pemahaman bisnis
Bisa mengelurakan variabel yang lebih dari nilai vif yang ditentukan

Inclass RM BCA

Team Algoritma

September 21, 2022