Sebagai seorang penjual properti, kita ingin membuat model yang mana dapat memprediksi harga properti berdasarkan beberapa informasi yang ada pada data.
Tentukan variabel:
priceprice1. Read data house_data.csv
house <- read.csv("data_input/house_data.csv")
head(house)Deskripsi data :
2. Cek struktur data
glimpse(house)#> Rows: 21,613
#> Columns: 9
#> $ price <int> 221900, 538000, 180000, 604000, 510000, 1225000, 257500, 2…
#> $ bedrooms <int> 3, 3, 2, 4, 3, 4, 3, 3, 3, 3, 3, 2, 3, 3, 5, 4, 3, 4, 2, 3…
#> $ bathrooms <dbl> 1.00, 2.25, 1.00, 3.00, 2.00, 4.50, 2.25, 1.50, 1.00, 2.50…
#> $ sqft_living <int> 1180, 2570, 770, 1960, 1680, 5420, 1715, 1060, 1780, 1890,…
#> $ sqft_lot <int> 5650, 7242, 10000, 5000, 8080, 101930, 6819, 9711, 7470, 6…
#> $ floors <dbl> 1.0, 2.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 2.0, 1.0, 1.0…
#> $ waterfront <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
#> $ grade <int> 7, 7, 6, 7, 8, 11, 7, 7, 7, 7, 8, 7, 7, 7, 7, 9, 7, 7, 7, …
#> $ yr_built <int> 1955, 1951, 1933, 1965, 1987, 2001, 1995, 1963, 1960, 2003…
💡 Hasil pemeriksaan struktur data: - Dataset house terdiri dari 12 kolom dan 21,613 baris - Variabel kategorik pada dataset house adalah bathrooms(30 kelas),bedrooms (13), floors(6), waterfront(2), grade(12) - Perlu dilakukan penyesuaian pada kolom waterfront dan floors menjadi factor dan membulatkan nilai floor
3. Cleansing Data
# Machine Learnig hanya menerima 2 input yaitu numeric dan kategori
unique(house$bathrooms)#> [1] 1.00 2.25 3.00 2.00 4.50 1.50 2.50 1.75 2.75 3.25 4.00 3.50 0.75 4.75 5.00
#> [16] 4.25 3.75 0.00 1.25 5.25 6.00 0.50 5.50 6.75 5.75 8.00 7.50 7.75 6.25 6.50
dim(house)#> [1] 21613 9
house$waterfront <- as.factor(house$waterfront)
house$floors <- round(house$floors, digits = 0)
house$floors <- as.factor(house$floors)
house <- house %>%
select( -bedrooms)
# ubah tipe data
# data kotor, kelas lebih dari 5, tidak digunakan3. EDA
#persebaran data
library(GGally)
ggcorr(house,label = TRUE )#korelasi💡 Insight:
Buatlah 3 model berdasarkan feature selection yg telah dipelajari 1. model all predictor 2. model selection based on correlation (korelasi > 0.5) 3. model selection hasil stepwise (backward/forward/both)
model_all <- lm(formula = price ~ . ,
data = house)
model_selection <- lm(formula = price ~ sqft_living + bathrooms + grade,
data = house)
model_backward <- step(object = model_all,
direction = "backward")#> Start: AIC=531742.6
#> price ~ bathrooms + sqft_living + sqft_lot + floors + waterfront +
#> grade + yr_built
#>
#> Df Sum of Sq RSS AIC
#> <none> 1045250022783039 531743
#> - sqft_lot 1 1179284314470 1046429307097510 531765
#> - bathrooms 1 6713329147569 1051963351930608 531879
#> - floors 3 11562439065145 1056812461848184 531974
#> - waterfront 1 88186040354655 1133436063137694 533491
#> - sqft_living 1 127676403283544 1172926426066583 534231
#> - yr_built 1 185730949852517 1230980972635556 535276
#> - grade 1 197693619508446 1242943642291484 535485
# model tanpa prediktor dari data house untuk `price`
model_none <- lm(formula = price ~ 1,
data = house )
model_forward <- step(object = model_none,
scope = list(upper = model_all),
direction = "forward")#> Start: AIC=553875.8
#> price ~ 1
#>
#> Df Sum of Sq RSS AIC
#> + sqft_living 1 1435640399598809 1477276362322490 539204
#> + grade 1 1297612620095470 1615304141825828 541134
#> + bathrooms 1 803293306497671 2109623455423628 546904
#> + waterfront 1 206679237434408 2706237524486890 552287
#> + floors 3 204917575621047 2707999186300252 552305
#> + sqft_lot 1 23417141523777 2889499620397522 553703
#> + yr_built 1 8497693415832 2904419068505468 553815
#> <none> 2912916761921299 553876
#>
#> Step: AIC=539203.5
#> price ~ sqft_living
#>
#> Df Sum of Sq RSS AIC
#> + grade 1 121320543948745 1355955818373745 537353
#> + waterfront 1 110238185400763 1367038176921727 537529
#> + yr_built 1 92854405407200 1384421956915290 537802
#> + floors 3 10764066475322 1466512295847168 539051
#> + sqft_lot 1 3011349102420 1474265013220070 539161
#> + bathrooms 1 147193010785 1477129169311705 539203
#> <none> 1477276362322490 539204
#>
#> Step: AIC=537353.4
#> price ~ sqft_living + grade
#>
#> Df Sum of Sq RSS AIC
#> + yr_built 1 199227645099154 1156728173274590 533921
#> + waterfront 1 108953582123633 1247002236250112 535545
#> + floors 3 10236669120186 1345719149253559 537196
#> + bathrooms 1 7651853153980 1348303965219765 537233
#> + sqft_lot 1 2020142096807 1353935676276938 537323
#> <none> 1355955818373745 537353
#>
#> Step: AIC=533920.9
#> price ~ sqft_living + grade + yr_built
#>
#> Df Sum of Sq RSS AIC
#> + waterfront 1 90115805935988 1066612367338602 532170
#> + floors 3 15086239339641 1141641933934950 533643
#> + bathrooms 1 8626538372689 1148101634901902 533761
#> + sqft_lot 1 1713565021968 1155014608252622 533891
#> <none> 1156728173274590 533921
#>
#> Step: AIC=532169.9
#> price ~ sqft_living + grade + yr_built + waterfront
#>
#> Df Sum of Sq RSS AIC
#> + floors 3 13111436224558 1053500931114044 531909
#> + bathrooms 1 8476715048277 1058135652290326 531999
#> + sqft_lot 1 1826102349598 1064786264989004 532135
#> <none> 1066612367338602 532170
#>
#> Step: AIC=531908.6
#> price ~ sqft_living + grade + yr_built + waterfront + floors
#>
#> Df Sum of Sq RSS AIC
#> + bathrooms 1 7071624016534 1046429307097510 531765
#> + sqft_lot 1 1537579183436 1051963351930608 531879
#> <none> 1053500931114044 531909
#>
#> Step: AIC=531765
#> price ~ sqft_living + grade + yr_built + waterfront + floors +
#> bathrooms
#>
#> Df Sum of Sq RSS AIC
#> + sqft_lot 1 1179284314470 1045250022783039 531743
#> <none> 1046429307097510 531765
#>
#> Step: AIC=531742.6
#> price ~ sqft_living + grade + yr_built + waterfront + floors +
#> bathrooms + sqft_lot
model_both <- step(object = model_none,
direction = "both",
scope = list(upper = model_all))#> Start: AIC=553875.8
#> price ~ 1
#>
#> Df Sum of Sq RSS AIC
#> + sqft_living 1 1435640399598809 1477276362322490 539204
#> + grade 1 1297612620095470 1615304141825828 541134
#> + bathrooms 1 803293306497671 2109623455423628 546904
#> + waterfront 1 206679237434408 2706237524486890 552287
#> + floors 3 204917575621047 2707999186300252 552305
#> + sqft_lot 1 23417141523777 2889499620397522 553703
#> + yr_built 1 8497693415832 2904419068505468 553815
#> <none> 2912916761921299 553876
#>
#> Step: AIC=539203.5
#> price ~ sqft_living
#>
#> Df Sum of Sq RSS AIC
#> + grade 1 121320543948745 1355955818373745 537353
#> + waterfront 1 110238185400763 1367038176921727 537529
#> + yr_built 1 92854405407200 1384421956915290 537802
#> + floors 3 10764066475322 1466512295847168 539051
#> + sqft_lot 1 3011349102420 1474265013220070 539161
#> + bathrooms 1 147193010785 1477129169311705 539203
#> <none> 1477276362322490 539204
#> - sqft_living 1 1435640399598809 2912916761921299 553876
#>
#> Step: AIC=537353.4
#> price ~ sqft_living + grade
#>
#> Df Sum of Sq RSS AIC
#> + yr_built 1 199227645099154 1156728173274590 533921
#> + waterfront 1 108953582123633 1247002236250112 535545
#> + floors 3 10236669120186 1345719149253559 537196
#> + bathrooms 1 7651853153980 1348303965219765 537233
#> + sqft_lot 1 2020142096807 1353935676276938 537323
#> <none> 1355955818373745 537353
#> - grade 1 121320543948745 1477276362322490 539204
#> - sqft_living 1 259348323452084 1615304141825828 541134
#>
#> Step: AIC=533920.9
#> price ~ sqft_living + grade + yr_built
#>
#> Df Sum of Sq RSS AIC
#> + waterfront 1 90115805935988 1066612367338602 532170
#> + floors 3 15086239339641 1141641933934950 533643
#> + bathrooms 1 8626538372689 1148101634901902 533761
#> + sqft_lot 1 1713565021968 1155014608252622 533891
#> <none> 1156728173274590 533921
#> - yr_built 1 199227645099154 1355955818373745 537353
#> - grade 1 227693783640699 1384421956915290 537802
#> - sqft_living 1 241311622806594 1398039796081185 538014
#>
#> Step: AIC=532169.9
#> price ~ sqft_living + grade + yr_built + waterfront
#>
#> Df Sum of Sq RSS AIC
#> + floors 3 13111436224558 1053500931114044 531909
#> + bathrooms 1 8476715048277 1058135652290326 531999
#> + sqft_lot 1 1826102349598 1064786264989004 532135
#> <none> 1066612367338602 532170
#> - waterfront 1 90115805935988 1156728173274590 533921
#> - yr_built 1 180389868911509 1247002236250112 535545
#> - grade 1 219517939490357 1286130306828959 536213
#> - sqft_living 1 222939932856996 1289552300195599 536270
#>
#> Step: AIC=531908.6
#> price ~ sqft_living + grade + yr_built + waterfront + floors
#>
#> Df Sum of Sq RSS AIC
#> + bathrooms 1 7071624016534 1046429307097510 531765
#> + sqft_lot 1 1537579183436 1051963351930608 531879
#> <none> 1053500931114044 531909
#> - floors 3 13111436224558 1066612367338602 532170
#> - waterfront 1 88141002820905 1141641933934950 533643
#> - yr_built 1 184727950980833 1238228882094878 535398
#> - grade 1 201639096446661 1255140027560705 535692
#> - sqft_living 1 231760372734666 1285261303848710 536204
#>
#> Step: AIC=531765
#> price ~ sqft_living + grade + yr_built + waterfront + floors +
#> bathrooms
#>
#> Df Sum of Sq RSS AIC
#> + sqft_lot 1 1179284314470 1045250022783039 531743
#> <none> 1046429307097510 531765
#> - bathrooms 1 7071624016534 1053500931114044 531909
#> - floors 3 11706345192816 1058135652290326 531999
#> - waterfront 1 88083487972181 1134512795069691 533510
#> - sqft_living 1 126771514338914 1173200821436424 534235
#> - yr_built 1 187294202573311 1233723509670821 535322
#> - grade 1 198229532395873 1244658839493383 535512
#>
#> Step: AIC=531742.6
#> price ~ sqft_living + grade + yr_built + waterfront + floors +
#> bathrooms + sqft_lot
#>
#> Df Sum of Sq RSS AIC
#> <none> 1045250022783039 531743
#> - sqft_lot 1 1179284314470 1046429307097510 531765
#> - bathrooms 1 6713329147569 1051963351930608 531879
#> - floors 3 11562439065145 1056812461848184 531974
#> - waterfront 1 88186040354655 1133436063137695 533491
#> - sqft_living 1 127676403283543 1172926426066582 534231
#> - yr_built 1 185730949852517 1230980972635556 535276
#> - grade 1 197693619508445 1242943642291485 535485
# simpan hasil prediksi ke kolom baru
house$pred_all <- predict(object = model_all, newdata = house)
house$pred_selection <- predict(object = model_selection, newdata = house)
house$pred_backward <- predict(object = model_backward, newdata = house)
house$pred_forward <- predict(object = model_forward, newdata = house)
house$pred_both<- predict(object = model_both, newdata = house)# periksa nilai R-Squared untuk setiap model
summary(model_all)$adj.r.squared#> [1] 0.6410177
summary(model_selection)$adj.r.squared#> [1] 0.537065
summary(model_backward)$adj.r.squared#> [1] 0.6410177
summary(model_forward)$adj.r.squared#> [1] 0.6410177
summary(model_all)$adj.r.squared#> [1] 0.6410177
Berdasarkan RMSE model regresi manakah yang terbaik?
library(MLmetrics)
# RMSE model_all
RMSE(y_pred = house$pred_all,
y_true = house$price)#> [1] 219913.8
# RMSE model_selection
RMSE(y_pred = house$pred_selection,
y_true = house$price)#> [1] 249767.8
# RMSE model_forward
RMSE(y_pred = house$pred_forward,
y_true = house$price)#> [1] 219913.8
# RMSE model_backward
RMSE(y_pred = house$pred_backward,
y_true = house$price)#> [1] 219913.8
# RMSE model_both
RMSE(y_pred = house$pred_both,
y_true = house$price)#> [1] 219913.8
💡 Kesimpulan :
model yang dapat menjelaskan variabel target dengan baik adalah model_all, model_forward, model_backward dan model_both, yakni sebesar 0.6410177 atau 64,1%
Model yang memberikan error paling kecil dalam memprediksi nilai price adalah model_all, model_forward, model_backward dan model_both dengan nilai RMSE sebesar 219913.8
summary(model_all)#>
#> Call:
#> lm(formula = price ~ ., data = house)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -1351044 -115009 -11131 91500 4382598
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 6761353.16186 119332.39059 56.660 < 0.0000000000000002 ***
#> bathrooms 39711.29580 3371.30580 11.779 < 0.0000000000000002 ***
#> sqft_living 160.20232 3.11865 51.369 < 0.0000000000000002 ***
#> sqft_lot -0.18179 0.03682 -4.937 0.0000008 ***
#> floors2 -11144.70271 3498.14121 -3.186 0.001445 **
#> floors3 129985.45263 9669.80166 13.442 < 0.0000000000000002 ***
#> floors4 287251.90083 77830.92725 3.691 0.000224 ***
#> waterfront1 744436.57691 17437.36435 42.692 < 0.0000000000000002 ***
#> grade 135905.05230 2126.14234 63.921 < 0.0000000000000002 ***
#> yr_built -3896.69256 62.89369 -61.957 < 0.0000000000000002 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 220000 on 21603 degrees of freedom
#> Multiple R-squared: 0.6412, Adjusted R-squared: 0.641
#> F-statistic: 4289 on 9 and 21603 DF, p-value: < 0.00000000000000022
artinya nilai pasar akan sebesar 6761353.16 apabila floors = 1 dan waterfront = 0, dan variabel prediktor lainnya bernilai 0
pasar akan sebesar 6761353.16 apabila rumah tidak memiliki pemandangan waterfront (pantai, danau, sungai, dll), memiliki rumah dengan lantai 1 dan tidak ada informasi tambahan lainnya.
1. Interpretasi coefficient untuk prediktor kategorik:
waterfront = 0 menjadi basis
waterfront = 1, 744436.57, artinya nilai price akan meningkat sebesar 744436.57 apabila rumah tersebut memiliki pemandangan waterfront (pantai, danau, sungai, dll) dan variabel prediktor lainnya bernilai tetap
floors = 1 menjadi basis
floors = 2, -11144.70271, artinya nilai price akan menurun sebesar 11144.7 apabila rumah tersebut bertingkat dan variabel prediktor lainnya bernilai tetap
floors = 3, 129985.45263, artinya nilai price akan meningkat sebesar 129985.45263 apabila rumah tersebut bertingkat dan variabel prediktor lainnya bernilai tetap
floors = 4, 287251.90083, artinya nilai price akan meningkat sebesar 287251.90083 apabila rumah tersebut bertingkat dan variabel prediktor lainnya bernilai tetap
2. Interpretasi coefficient untuk prediktor numerik:
3. Signifikansi prediktor:
4. Adjusted R Squared: