Sebagai seorang penjual properti, kita ingin membuat model yang mana dapat memprediksi harga properti berdasarkan beberapa informasi yang ada pada data.
Tentukan variabel:
priceprice1. Read data house_data.csv
house <- read.csv("data_input/house_data.csv")
head(house)2. Cek struktur data
glimpse(house)#> Rows: 21,613
#> Columns: 9
#> $ price <int> 221900, 538000, 180000, 604000, 510000, 1225000, 257500, 2…
#> $ bedrooms <int> 3, 3, 2, 4, 3, 4, 3, 3, 3, 3, 3, 2, 3, 3, 5, 4, 3, 4, 2, 3…
#> $ bathrooms <dbl> 1.00, 2.25, 1.00, 3.00, 2.00, 4.50, 2.25, 1.50, 1.00, 2.50…
#> $ sqft_living <int> 1180, 2570, 770, 1960, 1680, 5420, 1715, 1060, 1780, 1890,…
#> $ sqft_lot <int> 5650, 7242, 10000, 5000, 8080, 101930, 6819, 9711, 7470, 6…
#> $ floors <dbl> 1.0, 2.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 2.0, 1.0, 1.0…
#> $ waterfront <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
#> $ grade <int> 7, 7, 6, 7, 8, 11, 7, 7, 7, 7, 8, 7, 7, 7, 7, 9, 7, 7, 7, …
#> $ yr_built <int> 1955, 1951, 1933, 1965, 1987, 2001, 1995, 1963, 1960, 2003…
💡 Hasil pemeriksaan struktur data: - waterfront -> diubah jadi factor
3. Cleansing Data
library(lubridate)
house <- house %>%
mutate(waterfront = factor(waterfront)) %>%
glimpse()#> Rows: 21,613
#> Columns: 9
#> $ price <int> 221900, 538000, 180000, 604000, 510000, 1225000, 257500, 2…
#> $ bedrooms <int> 3, 3, 2, 4, 3, 4, 3, 3, 3, 3, 3, 2, 3, 3, 5, 4, 3, 4, 2, 3…
#> $ bathrooms <dbl> 1.00, 2.25, 1.00, 3.00, 2.00, 4.50, 2.25, 1.50, 1.00, 2.50…
#> $ sqft_living <int> 1180, 2570, 770, 1960, 1680, 5420, 1715, 1060, 1780, 1890,…
#> $ sqft_lot <int> 5650, 7242, 10000, 5000, 8080, 101930, 6819, 9711, 7470, 6…
#> $ floors <dbl> 1.0, 2.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 2.0, 1.0, 1.0…
#> $ waterfront <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
#> $ grade <int> 7, 7, 6, 7, 8, 11, 7, 7, 7, 7, 8, 7, 7, 7, 7, 9, 7, 7, 7, …
#> $ yr_built <int> 1955, 1951, 1933, 1965, 1987, 2001, 1995, 1963, 1960, 2003…
3. EDA
#persebaran data
boxplot(house$price)💡 Insight: - ada banyak outlier pada data tersebut - distribusi persebaran data adalah right skewed - median data kurang lebih adalah 500.000
#korelasi
cor(x = house$price,
y = house$bedrooms)#> [1] 0.3083496
cor(x = house$price,
y = house$bathrooms)#> [1] 0.5251375
cor(x = house$price,
y = house$sqft_living)#> [1] 0.7020351
cor(x = house$price,
y = house$sqft_lot)#> [1] 0.08966086
cor(x = house$price,
y = house$floors)#> [1] 0.2567939
cor(x = house$price,
y = house$grade)#> [1] 0.6674343
cor(x = house$price,
y = house$yr_built)#> [1] 0.05401153
💡 Insight: - price dan bedrooms memiliki korelasi positif yang lemah sebesar 0.3083496, artinya semakin tinggi harga, maka jumlah kamar semakin tinggi - price dan bathrooms memiliki korelasi positif yang kuat sebesar 0.5251375, artinya semakin tinggi harga, maka jumlah kamar mandi semakin tinggi - price dan sqft_living memiliki korelasi positif yang kuat sebesar 0.7020351, artinya semakin tinggi harga, maka luas bangunan akan semakin tinggi - price dan sqft_lot memiliki korelasi positif yang sangat lemah sebesar 0.08966086, artinya semakin tinggi harga, maka luas tanah akan semakin tinggi - price dan floors memiliki korelasi positif yang lemah sebesar 0.2567939, artinya semakin tinggi harga, maka jumlah lantai akan semakin tinggi - price dan grade memiliki korelasi positif yang kuat sebesar 0.6674343, artinya semakin tinggi harga, maka nilai grade akan semakin tinggi - price dan yr_buit memiliki korelasi positif yang sangat lemah sebesar 0.05401153, artinya semakin tinggi harga, maka tahun rumah dibuat akan semakin tinggi (rumah semakin baru)
Buatlah 3 model berdasarkan feature selection yg telah dipelajari 1. model all predictor 2. model selection based on correlation (korelasi > 0.5) 3. model selection hasil stepwise (backward/forward/both)
#model all prediktor
model_all <- lm(formula = price ~ .,
data = house)
#model correlation > 0.5
model_correlation <- lm(formula = price ~ bathrooms + sqft_living + grade,
data = house)#model backward elimination
model_backward <- step(object = model_all,
direction = "backward")#> Start: AIC=531533.8
#> price ~ bedrooms + bathrooms + sqft_living + sqft_lot + floors +
#> waterfront + grade + yr_built
#>
#> Df Sum of Sq RSS AIC
#> <none> 1035294741860252 531534
#> - floors 1 1218939726736 1036513681586988 531557
#> - sqft_lot 1 2030230655068 1037324972515320 531574
#> - bathrooms 1 10844095263314 1046138837123566 531757
#> - bedrooms 1 19802603600782 1055097345461034 531941
#> - waterfront 1 82402185733932 1117696927594184 533187
#> - sqft_living 1 140116227741713 1175410969601965 534275
#> - grade 1 172030644434158 1207325386294410 534854
#> - yr_built 1 183528097123653 1218822838983905 535059
#model forward selection
model_none <- lm(formula = price ~ 1,
data = house)
model_forward <- step(object = model_none, #model tanpa prediktor
direction = "forward",
scope = list(upper = model_all)) #model dengan semua prediktor#> Start: AIC=553875.8
#> price ~ 1
#>
#> Df Sum of Sq RSS AIC
#> + sqft_living 1 1435640399598810 1477276362322490 539204
#> + grade 1 1297612620095468 1615304141825832 541134
#> + bathrooms 1 803293306497671 2109623455423628 546904
#> + bedrooms 1 276958595500072 2635958166421226 551718
#> + waterfront 1 206679237434409 2706237524486890 552287
#> + floors 1 192086763313773 2720829998607526 552403
#> + sqft_lot 1 23417141523777 2889499620397522 553703
#> + yr_built 1 8497693415832 2904419068505468 553815
#> <none> 2912916761921299 553876
#>
#> Step: AIC=539203.5
#> price ~ sqft_living
#>
#> Df Sum of Sq RSS AIC
#> + grade 1 121320543948745 1355955818373745 537353
#> + waterfront 1 110238185400763 1367038176921726 537529
#> + yr_built 1 92854405407200 1384421956915290 537802
#> + bedrooms 1 40635382190095 1436640980132394 538603
#> + sqft_lot 1 3011349102420 1474265013220070 539161
#> + floors 1 229913654972 1477046448667517 539202
#> + bathrooms 1 147193010785 1477129169311705 539203
#> <none> 1477276362322490 539204
#>
#> Step: AIC=537353.4
#> price ~ sqft_living + grade
#>
#> Df Sum of Sq RSS AIC
#> + yr_built 1 199227645099154 1156728173274590 533921
#> + waterfront 1 108953582123634 1247002236250111 535545
#> + bedrooms 1 22141328690666 1333814489683078 537000
#> + floors 1 9622247208765 1346333571164980 537202
#> + bathrooms 1 7651853153980 1348303965219765 537233
#> + sqft_lot 1 2020142096807 1353935676276938 537323
#> <none> 1355955818373745 537353
#>
#> Step: AIC=533920.9
#> price ~ sqft_living + grade + yr_built
#>
#> Df Sum of Sq RSS AIC
#> + waterfront 1 90115805935988 1066612367338602 532170
#> + bedrooms 1 20121301261862 1136606872012728 533544
#> + bathrooms 1 8626538372689 1148101634901901 533761
#> + floors 1 4395842729126 1152332330545465 533841
#> + sqft_lot 1 1713565021968 1155014608252622 533891
#> <none> 1156728173274590 533921
#>
#> Step: AIC=532169.9
#> price ~ sqft_living + grade + yr_built + waterfront
#>
#> Df Sum of Sq RSS AIC
#> + bedrooms 1 13902067534638 1052710299803964 531888
#> + bathrooms 1 8476715048277 1058135652290325 531999
#> + floors 1 4061693690658 1062550673647944 532089
#> + sqft_lot 1 1826102349598 1064786264989004 532135
#> <none> 1066612367338602 532170
#>
#> Step: AIC=531888.3
#> price ~ sqft_living + grade + yr_built + waterfront + bedrooms
#>
#> Df Sum of Sq RSS AIC
#> + bathrooms 1 13953491780042 1038756808023922 531602
#> + floors 1 4176630866150 1048533668937814 531804
#> + sqft_lot 1 2870998480425 1049839301323540 531831
#> <none> 1052710299803964 531888
#>
#> Step: AIC=531602
#> price ~ sqft_living + grade + yr_built + waterfront + bedrooms +
#> bathrooms
#>
#> Df Sum of Sq RSS AIC
#> + sqft_lot 1 2243126436934 1036513681586988 531557
#> + floors 1 1431835508601 1037324972515320 531574
#> <none> 1038756808023922 531602
#>
#> Step: AIC=531557.2
#> price ~ sqft_living + grade + yr_built + waterfront + bedrooms +
#> bathrooms + sqft_lot
#>
#> Df Sum of Sq RSS AIC
#> + floors 1 1218939726736 1035294741860252 531534
#> <none> 1036513681586988 531557
#>
#> Step: AIC=531533.8
#> price ~ sqft_living + grade + yr_built + waterfront + bedrooms +
#> bathrooms + sqft_lot + floors
#model both
model_both <- step(object = model_none ,
direction = "both",
scope = list(upper = model_all))#> Start: AIC=553875.8
#> price ~ 1
#>
#> Df Sum of Sq RSS AIC
#> + sqft_living 1 1435640399598810 1477276362322490 539204
#> + grade 1 1297612620095468 1615304141825832 541134
#> + bathrooms 1 803293306497671 2109623455423628 546904
#> + bedrooms 1 276958595500072 2635958166421226 551718
#> + waterfront 1 206679237434409 2706237524486890 552287
#> + floors 1 192086763313773 2720829998607526 552403
#> + sqft_lot 1 23417141523777 2889499620397522 553703
#> + yr_built 1 8497693415832 2904419068505468 553815
#> <none> 2912916761921299 553876
#>
#> Step: AIC=539203.5
#> price ~ sqft_living
#>
#> Df Sum of Sq RSS AIC
#> + grade 1 121320543948745 1355955818373745 537353
#> + waterfront 1 110238185400763 1367038176921726 537529
#> + yr_built 1 92854405407200 1384421956915290 537802
#> + bedrooms 1 40635382190095 1436640980132394 538603
#> + sqft_lot 1 3011349102420 1474265013220070 539161
#> + floors 1 229913654972 1477046448667517 539202
#> + bathrooms 1 147193010785 1477129169311705 539203
#> <none> 1477276362322490 539204
#> - sqft_living 1 1435640399598810 2912916761921299 553876
#>
#> Step: AIC=537353.4
#> price ~ sqft_living + grade
#>
#> Df Sum of Sq RSS AIC
#> + yr_built 1 199227645099154 1156728173274590 533921
#> + waterfront 1 108953582123634 1247002236250111 535545
#> + bedrooms 1 22141328690666 1333814489683078 537000
#> + floors 1 9622247208765 1346333571164980 537202
#> + bathrooms 1 7651853153980 1348303965219765 537233
#> + sqft_lot 1 2020142096807 1353935676276938 537323
#> <none> 1355955818373745 537353
#> - grade 1 121320543948745 1477276362322490 539204
#> - sqft_living 1 259348323452087 1615304141825832 541134
#>
#> Step: AIC=533920.9
#> price ~ sqft_living + grade + yr_built
#>
#> Df Sum of Sq RSS AIC
#> + waterfront 1 90115805935988 1066612367338602 532170
#> + bedrooms 1 20121301261862 1136606872012728 533544
#> + bathrooms 1 8626538372689 1148101634901901 533761
#> + floors 1 4395842729126 1152332330545465 533841
#> + sqft_lot 1 1713565021968 1155014608252622 533891
#> <none> 1156728173274590 533921
#> - yr_built 1 199227645099154 1355955818373745 537353
#> - grade 1 227693783640699 1384421956915290 537802
#> - sqft_living 1 241311622806597 1398039796081188 538014
#>
#> Step: AIC=532169.9
#> price ~ sqft_living + grade + yr_built + waterfront
#>
#> Df Sum of Sq RSS AIC
#> + bedrooms 1 13902067534638 1052710299803964 531888
#> + bathrooms 1 8476715048277 1058135652290325 531999
#> + floors 1 4061693690658 1062550673647944 532089
#> + sqft_lot 1 1826102349598 1064786264989004 532135
#> <none> 1066612367338602 532170
#> - waterfront 1 90115805935988 1156728173274590 533921
#> - yr_built 1 180389868911509 1247002236250111 535545
#> - grade 1 219517939490357 1286130306828959 536213
#> - sqft_living 1 222939932856999 1289552300195601 536270
#>
#> Step: AIC=531888.3
#> price ~ sqft_living + grade + yr_built + waterfront + bedrooms
#>
#> Df Sum of Sq RSS AIC
#> + bathrooms 1 13953491780042 1038756808023922 531602
#> + floors 1 4176630866150 1048533668937814 531804
#> + sqft_lot 1 2870998480425 1049839301323540 531831
#> <none> 1052710299803964 531888
#> - bedrooms 1 13902067534638 1066612367338602 532170
#> - waterfront 1 83896572208764 1136606872012728 533544
#> - yr_built 1 179366335438934 1232076635242899 535287
#> - grade 1 198282580205801 1250992880009766 535616
#> - sqft_living 1 217755228757075 1270465528561039 535950
#>
#> Step: AIC=531602
#> price ~ sqft_living + grade + yr_built + waterfront + bedrooms +
#> bathrooms
#>
#> Df Sum of Sq RSS AIC
#> + sqft_lot 1 2243126436934 1036513681586988 531557
#> + floors 1 1431835508601 1037324972515320 531574
#> <none> 1038756808023922 531602
#> - bathrooms 1 13953491780042 1052710299803964 531888
#> - bedrooms 1 19378844266404 1058135652290325 531999
#> - waterfront 1 82547202240690 1121304010264612 533253
#> - sqft_living 1 136922615063114 1175679423087036 534276
#> - grade 1 184390684043560 1223147492067482 535132
#> - yr_built 1 190012863248603 1228769671272524 535231
#>
#> Step: AIC=531557.2
#> price ~ sqft_living + grade + yr_built + waterfront + bedrooms +
#> bathrooms + sqft_lot
#>
#> Df Sum of Sq RSS AIC
#> + floors 1 1218939726736 1035294741860252 531534
#> <none> 1036513681586988 531557
#> - sqft_lot 1 2243126436934 1038756808023922 531602
#> - bathrooms 1 13325619736552 1049839301323540 531831
#> - bedrooms 1 20298780261196 1056812461848184 531974
#> - waterfront 1 82496530390614 1119010211977602 533210
#> - sqft_living 1 138930506698738 1175444188285726 534274
#> - grade 1 182635783415639 1219149465002627 535063
#> - yr_built 1 188639111042699 1225152792629687 535169
#>
#> Step: AIC=531533.8
#> price ~ sqft_living + grade + yr_built + waterfront + bedrooms +
#> bathrooms + sqft_lot + floors
#>
#> Df Sum of Sq RSS AIC
#> <none> 1035294741860252 531534
#> - floors 1 1218939726736 1036513681586988 531557
#> - sqft_lot 1 2030230655068 1037324972515320 531574
#> - bathrooms 1 10844095263314 1046138837123566 531757
#> - bedrooms 1 19802603600782 1055097345461034 531941
#> - waterfront 1 82402185733932 1117696927594184 533187
#> - sqft_living 1 140116227741713 1175410969601965 534275
#> - grade 1 172030644434158 1207325386294410 534854
#> - yr_built 1 183528097123653 1218822838983905 535059
model_backward, model_forward, dan model_both menghasilkan AIC yang sama yaitu 531533.8 dan menggunakan semua prediktor kecuali price.
#menambahkan hasil prediksi dari ke ketiga model ke kolom baru
house$pred_all <- predict(object = model_all, newdata = house)
house$pred_correlation <- predict(object = model_correlation, newdata = house)
house$pred_backward<- predict(object = model_backward, newdata = house)
head(house)Berdasarkan RMSE model regresi manakah yang terbaik?
#RMSE model_all
RMSE(y_pred = house$pred_all,
y_true = house$price)#> [1] 218864.1
#RMSE model_correlation
RMSE(y_pred = house$pred_correlation,
y_true = house$price)#> [1] 249767.8
#RMSE model_backward
RMSE(y_pred = house$pred_backward,
y_true = house$price)#> [1] 218864.1
💡 Kesimpulan : model yang memberikan error paling kecil dalam memprediksi nilai price adalah model_all dan model_correlation dengan nilai RMSE sebesar 218864.1
summary(model_all)#>
#> Call:
#> lm(formula = price ~ ., data = house)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -1384206 -112972 -10077 91060 4251811
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 6999106.70657 121576.94670 57.569 < 0.0000000000000002 ***
#> bedrooms -41484.20936 2040.73489 -20.328 < 0.0000000000000002 ***
#> bathrooms 51710.08964 3437.50666 15.043 < 0.0000000000000002 ***
#> sqft_living 177.91392 3.29026 54.073 < 0.0000000000000002 ***
#> sqft_lot -0.23947 0.03679 -6.509 0.0000000000774 ***
#> floors 17283.13337 3426.85939 5.043 0.0000004609553 ***
#> waterfront1 721804.73094 17406.65326 41.467 < 0.0000000000000002 ***
#> grade 128813.92794 2149.93255 59.915 < 0.0000000000000002 ***
#> yr_built -3963.73577 64.04988 -61.885 < 0.0000000000000002 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 218900 on 21604 degrees of freedom
#> Multiple R-squared: 0.6446, Adjusted R-squared: 0.6445
#> F-statistic: 4898 on 8 and 21604 DF, p-value: < 0.00000000000000022
1. Interpretasi coefficient untuk prediktor kategorik: - prediktor kategorik: waterfront - waterfront = 0 menjadi basis - waterfront = 1, yaitu 721804.73094, artinya nilai price akan meningkat sebesar 721804.73094 apabila rumah tersebut memiliki waterfront dan variabel prediktor lainnya bernilai tetap
2. Interpretasi coefficient untuk prediktor numerik:
3. Signifikansi prediktor:
4. Adjusted R Squared: