1 Business Problem

Sebagai seorang penjual properti, kita ingin membuat model yang mana dapat memprediksi harga properti berdasarkan beberapa informasi yang ada pada data.

Tentukan variabel:

  • target: price
  • prediktor: seluruh variabel terkecuali price

2 Data Wrangling & EDA

1. Read data house_data.csv

house <- read.csv("data_input/house_data.csv")
head(house)

2. Cek struktur data

glimpse(house)
#> Rows: 21,613
#> Columns: 9
#> $ price       <int> 221900, 538000, 180000, 604000, 510000, 1225000, 257500, 2…
#> $ bedrooms    <int> 3, 3, 2, 4, 3, 4, 3, 3, 3, 3, 3, 2, 3, 3, 5, 4, 3, 4, 2, 3…
#> $ bathrooms   <dbl> 1.00, 2.25, 1.00, 3.00, 2.00, 4.50, 2.25, 1.50, 1.00, 2.50…
#> $ sqft_living <int> 1180, 2570, 770, 1960, 1680, 5420, 1715, 1060, 1780, 1890,…
#> $ sqft_lot    <int> 5650, 7242, 10000, 5000, 8080, 101930, 6819, 9711, 7470, 6…
#> $ floors      <dbl> 1.0, 2.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 2.0, 1.0, 1.0…
#> $ waterfront  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
#> $ grade       <int> 7, 7, 6, 7, 8, 11, 7, 7, 7, 7, 8, 7, 7, 7, 7, 9, 7, 7, 7, …
#> $ yr_built    <int> 1955, 1951, 1933, 1965, 1987, 2001, 1995, 1963, 1960, 2003…

💡 Hasil pemeriksaan struktur data: - Tipe data masing-masing kolom sudah sesuai, hanya saja waterfront lebih cocok dibuat tipe data faktor karena nilai uniquenya adalah 2 yaitu (0 dan 1)

house$waterfront <- as.factor(house$waterfront)
glimpse(house)
#> Rows: 21,613
#> Columns: 9
#> $ price       <int> 221900, 538000, 180000, 604000, 510000, 1225000, 257500, 2…
#> $ bedrooms    <int> 3, 3, 2, 4, 3, 4, 3, 3, 3, 3, 3, 2, 3, 3, 5, 4, 3, 4, 2, 3…
#> $ bathrooms   <dbl> 1.00, 2.25, 1.00, 3.00, 2.00, 4.50, 2.25, 1.50, 1.00, 2.50…
#> $ sqft_living <int> 1180, 2570, 770, 1960, 1680, 5420, 1715, 1060, 1780, 1890,…
#> $ sqft_lot    <int> 5650, 7242, 10000, 5000, 8080, 101930, 6819, 9711, 7470, 6…
#> $ floors      <dbl> 1.0, 2.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 2.0, 1.0, 1.0…
#> $ waterfront  <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
#> $ grade       <int> 7, 7, 6, 7, 8, 11, 7, 7, 7, 7, 8, 7, 7, 7, 7, 9, 7, 7, 7, …
#> $ yr_built    <int> 1955, 1951, 1933, 1965, 1987, 2001, 1995, 1963, 1960, 2003…

3. Cleansing Data

anyNA(house)
#> [1] FALSE

💡 Hasil pemeriksaan Missing Value: - Tidak ada missing value

3. EDA

#persebaran data
boxplot(house$price)

#korelasi
ggcorr(house,label = TRUE)

💡 Insight:

  • Persebaran data tidak terlalu banyak karena ukuran box yang sempit
  • Ada data yang outlier dengan jumlah yang cukup banyak pada variabel price
  • Variable price memiliki korelasi kuat (>0.5) dengan sqft_living dan grade

3 Modeling

Buatlah 3 model berdasarkan feature selection yg telah dipelajari 1. model all predictor 2. model selection based on correlation (korelasi > 0.5) 3. model selection hasil stepwise (backward/forward/both)

# Model Without Predictor
model_base_house <- lm(formula = price ~ 1,
                 data = house)

# Model All Predictor
model_all_house <- lm(formula = price ~.,
                data = house)

# Model Selection based on Correlation
model_selection_house <- lm(formula = price ~ sqft_living + grade,
                            data = house)

# Model Selection Stepwise Both
model_stepwise_both <- step(object = model_base_house,
                            direction = "both",
                            scope = list(upper = model_all_house))
#> Start:  AIC=553875.8
#> price ~ 1
#> 
#>               Df        Sum of Sq              RSS    AIC
#> + sqft_living  1 1435640399598803 1477276362322494 539204
#> + grade        1 1297612620095472 1615304141825826 541134
#> + bathrooms    1  803293306497667 2109623455423631 546904
#> + bedrooms     1  276958595500094 2635958166421204 551718
#> + waterfront   1  206679237434406 2706237524486891 552287
#> + floors       1  192086763313769 2720829998607528 552403
#> + sqft_lot     1   23417141523788 2889499620397510 553703
#> + yr_built     1    8497693415847 2904419068505450 553815
#> <none>                            2912916761921298 553876
#> 
#> Step:  AIC=539203.5
#> price ~ sqft_living
#> 
#>               Df        Sum of Sq              RSS    AIC
#> + grade        1  121320543948748 1355955818373746 537353
#> + waterfront   1  110238185400773 1367038176921721 537529
#> + yr_built     1   92854405407209 1384421956915285 537802
#> + bedrooms     1   40635382190095 1436640980132400 538603
#> + sqft_lot     1    3011349102430 1474265013220065 539161
#> + floors       1     229913654980 1477046448667514 539202
#> + bathrooms    1     147193010776 1477129169311718 539203
#> <none>                            1477276362322494 539204
#> - sqft_living  1 1435640399598803 2912916761921298 553876
#> 
#> Step:  AIC=537353.4
#> price ~ sqft_living + grade
#> 
#>               Df       Sum of Sq              RSS    AIC
#> + yr_built     1 199227645099163 1156728173274584 533921
#> + waterfront   1 108953582123630 1247002236250117 535545
#> + bedrooms     1  22141328690674 1333814489683073 537000
#> + floors       1   9622247208762 1346333571164984 537202
#> + bathrooms    1   7651853153983 1348303965219764 537233
#> + sqft_lot     1   2020142096816 1353935676276931 537323
#> <none>                           1355955818373746 537353
#> - grade        1 121320543948748 1477276362322494 539204
#> - sqft_living  1 259348323452079 1615304141825826 541134
#> 
#> Step:  AIC=533920.9
#> price ~ sqft_living + grade + yr_built
#> 
#>               Df       Sum of Sq              RSS    AIC
#> + waterfront   1  90115805935978 1066612367338606 532170
#> + bedrooms     1  20121301261852 1136606872012732 533544
#> + bathrooms    1   8626538372687 1148101634901897 533761
#> + floors       1   4395842729118 1152332330545465 533841
#> + sqft_lot     1   1713565021960 1155014608252624 533891
#> <none>                           1156728173274584 533921
#> - yr_built     1 199227645099163 1355955818373747 537353
#> - grade        1 227693783640702 1384421956915285 537802
#> - sqft_living  1 241311622806606 1398039796081190 538014
#> 
#> Step:  AIC=532169.9
#> price ~ sqft_living + grade + yr_built + waterfront
#> 
#>               Df       Sum of Sq              RSS    AIC
#> + bedrooms     1  13902067534646 1052710299803960 531888
#> + bathrooms    1   8476715048277 1058135652290329 531999
#> + floors       1   4061693690667 1062550673647939 532089
#> + sqft_lot     1   1826102349601 1064786264989005 532135
#> <none>                           1066612367338606 532170
#> - waterfront   1  90115805935978 1156728173274584 533921
#> - yr_built     1 180389868911511 1247002236250117 535545
#> - grade        1 219517939490350 1286130306828956 536213
#> - sqft_living  1 222939932856995 1289552300195601 536270
#> 
#> Step:  AIC=531888.3
#> price ~ sqft_living + grade + yr_built + waterfront + bedrooms
#> 
#>               Df       Sum of Sq              RSS    AIC
#> + bathrooms    1  13953491780040 1038756808023920 531602
#> + floors       1   4176630866141 1048533668937819 531804
#> + sqft_lot     1   2870998480422 1049839301323538 531831
#> <none>                           1052710299803960 531888
#> - bedrooms     1  13902067534646 1066612367338606 532170
#> - waterfront   1  83896572208772 1136606872012732 533544
#> - yr_built     1 179366335438936 1232076635242896 535287
#> - grade        1 198282580205803 1250992880009763 535616
#> - sqft_living  1 217755228757082 1270465528561043 535950
#> 
#> Step:  AIC=531602
#> price ~ sqft_living + grade + yr_built + waterfront + bedrooms + 
#>     bathrooms
#> 
#>               Df       Sum of Sq              RSS    AIC
#> + sqft_lot     1   2243126436929 1036513681586991 531557
#> + floors       1   1431835508600 1037324972515319 531574
#> <none>                           1038756808023920 531602
#> - bathrooms    1  13953491780040 1052710299803960 531888
#> - bedrooms     1  19378844266409 1058135652290329 531999
#> - waterfront   1  82547202240685 1121304010264605 533253
#> - sqft_living  1 136922615063111 1175679423087031 534276
#> - grade        1 184390684043561 1223147492067481 535132
#> - yr_built     1 190012863248597 1228769671272517 535231
#> 
#> Step:  AIC=531557.2
#> price ~ sqft_living + grade + yr_built + waterfront + bedrooms + 
#>     bathrooms + sqft_lot
#> 
#>               Df       Sum of Sq              RSS    AIC
#> + floors       1   1218939726729 1035294741860261 531534
#> <none>                           1036513681586991 531557
#> - sqft_lot     1   2243126436929 1038756808023920 531602
#> - bathrooms    1  13325619736547 1049839301323538 531831
#> - bedrooms     1  20298780261191 1056812461848181 531974
#> - waterfront   1  82496530390604 1119010211977595 533210
#> - sqft_living  1 138930506698739 1175444188285730 534274
#> - grade        1 182635783415641 1219149465002632 535063
#> - yr_built     1 188639111042702 1225152792629693 535169
#> 
#> Step:  AIC=531533.8
#> price ~ sqft_living + grade + yr_built + waterfront + bedrooms + 
#>     bathrooms + sqft_lot + floors
#> 
#>               Df       Sum of Sq              RSS    AIC
#> <none>                           1035294741860261 531534
#> - floors       1   1218939726729 1036513681586991 531557
#> - sqft_lot     1   2030230655058 1037324972515319 531574
#> - bathrooms    1  10844095263302 1046138837123564 531757
#> - bedrooms     1  19802603600768 1055097345461029 531941
#> - waterfront   1  82402185733920 1117696927594182 533187
#> - sqft_living  1 140116227741708 1175410969601969 534275
#> - grade        1 172030644434154 1207325386294415 534854
#> - yr_built     1 183528097123653 1218822838983914 535059

4 Evaluasi model

Berdasarkan RMSE model regresi manakah yang terbaik?

RMSE(y_pred = model_all_house$fitted.values,
     y_true = house$price)
#> [1] 218864.1
RMSE(y_pred=model_selection_house$fitted.values, 
     y_true=house$price)
#> [1] 250475.5
RMSE(y_pred=model_stepwise_both$fitted.values,
     y_true=house$price)
#> [1] 218864.1

💡 Kesimpulan :

5 Interpretasi Model Terbaik:

summary(model_all_house)
#> 
#> Call:
#> lm(formula = price ~ ., data = house)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -1384206  -112972   -10077    91060  4251811 
#> 
#> Coefficients:
#>                  Estimate    Std. Error t value             Pr(>|t|)    
#> (Intercept) 6999106.70657  121576.94670  57.569 < 0.0000000000000002 ***
#> bedrooms     -41484.20936    2040.73489 -20.328 < 0.0000000000000002 ***
#> bathrooms     51710.08964    3437.50666  15.043 < 0.0000000000000002 ***
#> sqft_living     177.91392       3.29026  54.073 < 0.0000000000000002 ***
#> sqft_lot         -0.23947       0.03679  -6.509      0.0000000000774 ***
#> floors        17283.13337    3426.85939   5.043      0.0000004609553 ***
#> waterfront1  721804.73094   17406.65326  41.467 < 0.0000000000000002 ***
#> grade        128813.92794    2149.93255  59.915 < 0.0000000000000002 ***
#> yr_built      -3963.73577      64.04988 -61.885 < 0.0000000000000002 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 218900 on 21604 degrees of freedom
#> Multiple R-squared:  0.6446, Adjusted R-squared:  0.6445 
#> F-statistic:  4898 on 8 and 21604 DF,  p-value: < 0.00000000000000022

1. Interpretasi coefficient untuk prediktor kategorik:

  • Ketika nilai waterfront adalah 1 (rumah berada pada daerah tepi laut), maka nilai price akan meningkat sebesar 721.804,73094.

2. Interpretasi coefficient untuk prediktor numerik:

  • bedrooms : nilai price akan menurun sebesar 41.484,20936 ketika nilai bedrooms meningkat sebesar 1 dan variabel lainnya tetap.
  • bathrooms : nilai price akan meningkat sebesar 51.710,08964 ketika nilai bathrooms meningkat sebesar 1 dan variabel lainnya tetap.
  • sqft_living : nilai price akan meningkat sebanyak 177.91392 ketika nilai sqft_living meningkat sebanyak 1 dan variabel lainnya tetap.
  • sqft_lot : nilai price akan menurun sebanyak 0,23947 ketika nilai sqft_lot meningkat sebanyak 1 dan variabel lainnya tetap.
  • floors : nilai price akan meningkat sebanyak 17.283,13337 ketika nilai floors meningkat sebanyak 1 dan variabel lainnya tetap.
  • grade : nilai price akan meningkat sebanyak 128.813,92794 ketika nilai grade meningkat sebanyak 1 dan variabel lainnya tetap.
  • yr_built : nilai price akan menurun sebanyak 3.963,73577 ketika nilai yr_built meningkat sebanyak 1 dan variabel lainnya tetap.

3. Signifikansi prediktor:

  • Semua variabel prediktor yang ada memiliki pengaruh yang signifikan terhadap model.

4. Adjusted R Squared:

  • 0.6466, artinya model bisa menjelaskan nilai price dengan baik sebesar 64,66%