1 Business Problem

Sebagai seorang penjual properti, kita ingin membuat model yang mana dapat memprediksi harga properti berdasarkan beberapa informasi yang ada pada data.

Tentukan variabel:

  • target: price
  • prediktor: seluruh variabel terkecuali price

2 Data Wrangling & EDA

1. Read data house_data.csv

house <- read.csv("data_input/house_data.csv")
head(house)

2. Cek struktur data

glimpse(house)
#> Rows: 21,613
#> Columns: 9
#> $ price       <int> 221900, 538000, 180000, 604000, 510000, 1225000, 257500, 2…
#> $ bedrooms    <int> 3, 3, 2, 4, 3, 4, 3, 3, 3, 3, 3, 2, 3, 3, 5, 4, 3, 4, 2, 3…
#> $ bathrooms   <dbl> 1.00, 2.25, 1.00, 3.00, 2.00, 4.50, 2.25, 1.50, 1.00, 2.50…
#> $ sqft_living <int> 1180, 2570, 770, 1960, 1680, 5420, 1715, 1060, 1780, 1890,…
#> $ sqft_lot    <int> 5650, 7242, 10000, 5000, 8080, 101930, 6819, 9711, 7470, 6…
#> $ floors      <dbl> 1.0, 2.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 2.0, 1.0, 1.0…
#> $ waterfront  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
#> $ grade       <int> 7, 7, 6, 7, 8, 11, 7, 7, 7, 7, 8, 7, 7, 7, 7, 9, 7, 7, 7, …
#> $ yr_built    <int> 1955, 1951, 1933, 1965, 1987, 2001, 1995, 1963, 1960, 2003…
# cek unique yang terindikasi bertipe data factor
unique(house$bedrooms)
#>  [1]  3  2  4  5  1  6  7  0  8  9 11 10 33
unique(house$waterfront)
#> [1] 0 1
unique(house$floors)
#> [1] 1.0 2.0 1.5 3.0 2.5 3.5
unique(house$grade)
#>  [1]  7  6  8 11  9  5 10 12  4  3 13  1

💡 Hasil pemeriksaan struktur data:

  • lakukan perubahan data pada atribut waterfront menjadi factor

3. Cleansing Data

house <- house %>%
  mutate(waterfront = as.factor(waterfront))

glimpse(house)
#> Rows: 21,613
#> Columns: 9
#> $ price       <int> 221900, 538000, 180000, 604000, 510000, 1225000, 257500, 2…
#> $ bedrooms    <int> 3, 3, 2, 4, 3, 4, 3, 3, 3, 3, 3, 2, 3, 3, 5, 4, 3, 4, 2, 3…
#> $ bathrooms   <dbl> 1.00, 2.25, 1.00, 3.00, 2.00, 4.50, 2.25, 1.50, 1.00, 2.50…
#> $ sqft_living <int> 1180, 2570, 770, 1960, 1680, 5420, 1715, 1060, 1780, 1890,…
#> $ sqft_lot    <int> 5650, 7242, 10000, 5000, 8080, 101930, 6819, 9711, 7470, 6…
#> $ floors      <dbl> 1.0, 2.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 2.0, 1.0, 1.0…
#> $ waterfront  <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
#> $ grade       <int> 7, 7, 6, 7, 8, 11, 7, 7, 7, 7, 8, 7, 7, 7, 7, 9, 7, 7, 7, …
#> $ yr_built    <int> 1955, 1951, 1933, 1965, 1987, 2001, 1995, 1963, 1960, 2003…
house_2 <- house

3. EDA

# fungsi untuk mengecek prediktor numeric yang memiliki korelasi kuat dengan target
cek_korelasi <- function(data, predictors, target, threshold) {
  results <- list()
  
  for (predictor in predictors) {
    correlation <- cor(data[[predictor]], data[[target]])
    
    if (abs(correlation) > threshold) {
      results[[predictor]] <- correlation
    }
  }
  
  return(results)
}
target <- "price"
predictors <- colnames(house)[!colnames(house) %in% c(target, "waterfront")]
threshold <- 0.5

target
#> [1] "price"
predictors
#> [1] "bedrooms"    "bathrooms"   "sqft_living" "sqft_lot"    "floors"     
#> [6] "grade"       "yr_built"
threshold
#> [1] 0.5
hasil_korelasi <- cek_korelasi(house, predictors, target, threshold)

hasil_korelasi
#> $bathrooms
#> [1] 0.5251375
#> 
#> $sqft_living
#> [1] 0.7020351
#> 
#> $grade
#> [1] 0.6674343
# Menghitung korelasi predictor yang kategorical
model_aov <- aov(price ~ waterfront, data = house)
anova_result <- anova(model_aov)

anova_result

💡 Insight:

  • Dari fungsi perhitungan korelasi untuk predictor numeric, didapatkan bahwa predictor yang memiliki korelasi lebih dari 0.5 yaitu bathrooms, sqft_living, grade
  • Dari hasil uji Chi-squared test, didapatkan bahwa P-value yang sangat kecil, yaitu kurang dari 0.00000000000000022 dan dapat disimpulkan bahwa predictor waterfront memiliki pengaruh yang signifikan terhadap target
  • Model dengan seleksi fitur berdasarkan korelasi dapat menggunakan predictor bathrooms, sqft_living, grade, waterfront

3 Modeling

Buatlah 3 model berdasarkan feature selection yg telah dipelajari

  1. model all predictor
  2. model selection based on correlation (korelasi > 0.5)
  3. model selection hasil stepwise (backward/forward/both)
# Model all predictor
model_all <- lm(formula = price ~ .,
                data = house)
summary(model_all)
#> 
#> Call:
#> lm(formula = price ~ ., data = house)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -1384206  -112972   -10077    91060  4251811 
#> 
#> Coefficients:
#>                  Estimate    Std. Error t value             Pr(>|t|)    
#> (Intercept) 6999106.70657  121576.94670  57.569 < 0.0000000000000002 ***
#> bedrooms     -41484.20936    2040.73489 -20.328 < 0.0000000000000002 ***
#> bathrooms     51710.08964    3437.50666  15.043 < 0.0000000000000002 ***
#> sqft_living     177.91392       3.29026  54.073 < 0.0000000000000002 ***
#> sqft_lot         -0.23947       0.03679  -6.509      0.0000000000774 ***
#> floors        17283.13337    3426.85939   5.043      0.0000004609553 ***
#> waterfront1  721804.73094   17406.65326  41.467 < 0.0000000000000002 ***
#> grade        128813.92794    2149.93255  59.915 < 0.0000000000000002 ***
#> yr_built      -3963.73577      64.04988 -61.885 < 0.0000000000000002 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 218900 on 21604 degrees of freedom
#> Multiple R-squared:  0.6446, Adjusted R-squared:  0.6445 
#> F-statistic:  4898 on 8 and 21604 DF,  p-value: < 0.00000000000000022
# Model dengan predictor korelasi > 0.5
model_selected_cor <- lm(formula = price ~ bathrooms + sqft_living + grade + waterfront,
                data = house)
summary(model_selected_cor)
#> 
#> Call:
#> lm(formula = price ~ bathrooms + sqft_living + grade + waterfront, 
#>     data = house)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -1239531  -131463   -20276   101079  4885132 
#> 
#> Coefficients:
#>                Estimate  Std. Error t value            Pr(>|t|)    
#> (Intercept) -584322.722   12724.367  -45.92 <0.0000000000000002 ***
#> bathrooms    -34648.476    3300.646  -10.50 <0.0000000000000002 ***
#> sqft_living     194.094       3.202   60.61 <0.0000000000000002 ***
#> grade        102888.055    2193.383   46.91 <0.0000000000000002 ***
#> waterfront1  820352.869   18947.759   43.30 <0.0000000000000002 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 239600 on 21608 degrees of freedom
#> Multiple R-squared:  0.5741, Adjusted R-squared:  0.574 
#> F-statistic:  7281 on 4 and 21608 DF,  p-value: < 0.00000000000000022
# Model dengan predictor korelasi > 0.5
model_backward <- step(object = model_all,
                       direction = "backward")
#> Start:  AIC=531533.8
#> price ~ bedrooms + bathrooms + sqft_living + sqft_lot + floors + 
#>     waterfront + grade + yr_built
#> 
#>               Df       Sum of Sq              RSS    AIC
#> <none>                           1035294741860252 531534
#> - floors       1   1218939726736 1036513681586988 531557
#> - sqft_lot     1   2030230655068 1037324972515320 531574
#> - bathrooms    1  10844095263314 1046138837123566 531757
#> - bedrooms     1  19802603600782 1055097345461034 531941
#> - waterfront   1  82402185733932 1117696927594184 533187
#> - sqft_living  1 140116227741713 1175410969601965 534275
#> - grade        1 172030644434158 1207325386294410 534854
#> - yr_built     1 183528097123653 1218822838983905 535059
summary(model_backward)
#> 
#> Call:
#> lm(formula = price ~ bedrooms + bathrooms + sqft_living + sqft_lot + 
#>     floors + waterfront + grade + yr_built, data = house)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -1384206  -112972   -10077    91060  4251811 
#> 
#> Coefficients:
#>                  Estimate    Std. Error t value             Pr(>|t|)    
#> (Intercept) 6999106.70657  121576.94670  57.569 < 0.0000000000000002 ***
#> bedrooms     -41484.20936    2040.73489 -20.328 < 0.0000000000000002 ***
#> bathrooms     51710.08964    3437.50666  15.043 < 0.0000000000000002 ***
#> sqft_living     177.91392       3.29026  54.073 < 0.0000000000000002 ***
#> sqft_lot         -0.23947       0.03679  -6.509      0.0000000000774 ***
#> floors        17283.13337    3426.85939   5.043      0.0000004609553 ***
#> waterfront1  721804.73094   17406.65326  41.467 < 0.0000000000000002 ***
#> grade        128813.92794    2149.93255  59.915 < 0.0000000000000002 ***
#> yr_built      -3963.73577      64.04988 -61.885 < 0.0000000000000002 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 218900 on 21604 degrees of freedom
#> Multiple R-squared:  0.6446, Adjusted R-squared:  0.6445 
#> F-statistic:  4898 on 8 and 21604 DF,  p-value: < 0.00000000000000022
model_none <- lm(formula = price ~ 1,
                 data = house)

model_forward <- step(object = model_none, 
                      direction = "forward", 
                      scope = list(upper = model_all))
#> Start:  AIC=553875.8
#> price ~ 1
#> 
#>               Df        Sum of Sq              RSS    AIC
#> + sqft_living  1 1435640399598810 1477276362322490 539204
#> + grade        1 1297612620095468 1615304141825832 541134
#> + bathrooms    1  803293306497671 2109623455423628 546904
#> + bedrooms     1  276958595500072 2635958166421226 551718
#> + waterfront   1  206679237434409 2706237524486890 552287
#> + floors       1  192086763313773 2720829998607526 552403
#> + sqft_lot     1   23417141523777 2889499620397522 553703
#> + yr_built     1    8497693415832 2904419068505468 553815
#> <none>                            2912916761921299 553876
#> 
#> Step:  AIC=539203.5
#> price ~ sqft_living
#> 
#>              Df       Sum of Sq              RSS    AIC
#> + grade       1 121320543948745 1355955818373745 537353
#> + waterfront  1 110238185400763 1367038176921726 537529
#> + yr_built    1  92854405407200 1384421956915290 537802
#> + bedrooms    1  40635382190095 1436640980132394 538603
#> + sqft_lot    1   3011349102420 1474265013220070 539161
#> + floors      1    229913654972 1477046448667517 539202
#> + bathrooms   1    147193010785 1477129169311705 539203
#> <none>                          1477276362322490 539204
#> 
#> Step:  AIC=537353.4
#> price ~ sqft_living + grade
#> 
#>              Df       Sum of Sq              RSS    AIC
#> + yr_built    1 199227645099154 1156728173274590 533921
#> + waterfront  1 108953582123634 1247002236250111 535545
#> + bedrooms    1  22141328690666 1333814489683078 537000
#> + floors      1   9622247208765 1346333571164980 537202
#> + bathrooms   1   7651853153980 1348303965219765 537233
#> + sqft_lot    1   2020142096807 1353935676276938 537323
#> <none>                          1355955818373745 537353
#> 
#> Step:  AIC=533920.9
#> price ~ sqft_living + grade + yr_built
#> 
#>              Df      Sum of Sq              RSS    AIC
#> + waterfront  1 90115805935988 1066612367338602 532170
#> + bedrooms    1 20121301261862 1136606872012728 533544
#> + bathrooms   1  8626538372689 1148101634901901 533761
#> + floors      1  4395842729126 1152332330545465 533841
#> + sqft_lot    1  1713565021968 1155014608252622 533891
#> <none>                         1156728173274590 533921
#> 
#> Step:  AIC=532169.9
#> price ~ sqft_living + grade + yr_built + waterfront
#> 
#>             Df      Sum of Sq              RSS    AIC
#> + bedrooms   1 13902067534638 1052710299803964 531888
#> + bathrooms  1  8476715048277 1058135652290325 531999
#> + floors     1  4061693690658 1062550673647944 532089
#> + sqft_lot   1  1826102349598 1064786264989004 532135
#> <none>                        1066612367338602 532170
#> 
#> Step:  AIC=531888.3
#> price ~ sqft_living + grade + yr_built + waterfront + bedrooms
#> 
#>             Df      Sum of Sq              RSS    AIC
#> + bathrooms  1 13953491780042 1038756808023922 531602
#> + floors     1  4176630866150 1048533668937814 531804
#> + sqft_lot   1  2870998480425 1049839301323540 531831
#> <none>                        1052710299803964 531888
#> 
#> Step:  AIC=531602
#> price ~ sqft_living + grade + yr_built + waterfront + bedrooms + 
#>     bathrooms
#> 
#>            Df     Sum of Sq              RSS    AIC
#> + sqft_lot  1 2243126436934 1036513681586988 531557
#> + floors    1 1431835508601 1037324972515320 531574
#> <none>                      1038756808023922 531602
#> 
#> Step:  AIC=531557.2
#> price ~ sqft_living + grade + yr_built + waterfront + bedrooms + 
#>     bathrooms + sqft_lot
#> 
#>          Df     Sum of Sq              RSS    AIC
#> + floors  1 1218939726736 1035294741860252 531534
#> <none>                    1036513681586988 531557
#> 
#> Step:  AIC=531533.8
#> price ~ sqft_living + grade + yr_built + waterfront + bedrooms + 
#>     bathrooms + sqft_lot + floors
summary(model_forward)
#> 
#> Call:
#> lm(formula = price ~ sqft_living + grade + yr_built + waterfront + 
#>     bedrooms + bathrooms + sqft_lot + floors, data = house)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -1384206  -112972   -10077    91060  4251811 
#> 
#> Coefficients:
#>                  Estimate    Std. Error t value             Pr(>|t|)    
#> (Intercept) 6999106.70657  121576.94670  57.569 < 0.0000000000000002 ***
#> sqft_living     177.91392       3.29026  54.073 < 0.0000000000000002 ***
#> grade        128813.92794    2149.93255  59.915 < 0.0000000000000002 ***
#> yr_built      -3963.73577      64.04988 -61.885 < 0.0000000000000002 ***
#> waterfront1  721804.73094   17406.65326  41.467 < 0.0000000000000002 ***
#> bedrooms     -41484.20936    2040.73489 -20.328 < 0.0000000000000002 ***
#> bathrooms     51710.08964    3437.50666  15.043 < 0.0000000000000002 ***
#> sqft_lot         -0.23947       0.03679  -6.509      0.0000000000774 ***
#> floors        17283.13337    3426.85939   5.043      0.0000004609553 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 218900 on 21604 degrees of freedom
#> Multiple R-squared:  0.6446, Adjusted R-squared:  0.6445 
#> F-statistic:  4898 on 8 and 21604 DF,  p-value: < 0.00000000000000022
model_both <- step(object = model_none, 
                      direction = "both", 
                      scope = list(upper = model_all))
#> Start:  AIC=553875.8
#> price ~ 1
#> 
#>               Df        Sum of Sq              RSS    AIC
#> + sqft_living  1 1435640399598810 1477276362322490 539204
#> + grade        1 1297612620095468 1615304141825832 541134
#> + bathrooms    1  803293306497671 2109623455423628 546904
#> + bedrooms     1  276958595500072 2635958166421226 551718
#> + waterfront   1  206679237434409 2706237524486890 552287
#> + floors       1  192086763313773 2720829998607526 552403
#> + sqft_lot     1   23417141523777 2889499620397522 553703
#> + yr_built     1    8497693415832 2904419068505468 553815
#> <none>                            2912916761921299 553876
#> 
#> Step:  AIC=539203.5
#> price ~ sqft_living
#> 
#>               Df        Sum of Sq              RSS    AIC
#> + grade        1  121320543948745 1355955818373745 537353
#> + waterfront   1  110238185400763 1367038176921726 537529
#> + yr_built     1   92854405407200 1384421956915290 537802
#> + bedrooms     1   40635382190095 1436640980132394 538603
#> + sqft_lot     1    3011349102420 1474265013220070 539161
#> + floors       1     229913654972 1477046448667517 539202
#> + bathrooms    1     147193010785 1477129169311705 539203
#> <none>                            1477276362322490 539204
#> - sqft_living  1 1435640399598810 2912916761921299 553876
#> 
#> Step:  AIC=537353.4
#> price ~ sqft_living + grade
#> 
#>               Df       Sum of Sq              RSS    AIC
#> + yr_built     1 199227645099154 1156728173274590 533921
#> + waterfront   1 108953582123634 1247002236250111 535545
#> + bedrooms     1  22141328690666 1333814489683078 537000
#> + floors       1   9622247208765 1346333571164980 537202
#> + bathrooms    1   7651853153980 1348303965219765 537233
#> + sqft_lot     1   2020142096807 1353935676276938 537323
#> <none>                           1355955818373745 537353
#> - grade        1 121320543948745 1477276362322490 539204
#> - sqft_living  1 259348323452087 1615304141825832 541134
#> 
#> Step:  AIC=533920.9
#> price ~ sqft_living + grade + yr_built
#> 
#>               Df       Sum of Sq              RSS    AIC
#> + waterfront   1  90115805935988 1066612367338602 532170
#> + bedrooms     1  20121301261862 1136606872012728 533544
#> + bathrooms    1   8626538372689 1148101634901901 533761
#> + floors       1   4395842729126 1152332330545465 533841
#> + sqft_lot     1   1713565021968 1155014608252622 533891
#> <none>                           1156728173274590 533921
#> - yr_built     1 199227645099154 1355955818373745 537353
#> - grade        1 227693783640699 1384421956915290 537802
#> - sqft_living  1 241311622806597 1398039796081188 538014
#> 
#> Step:  AIC=532169.9
#> price ~ sqft_living + grade + yr_built + waterfront
#> 
#>               Df       Sum of Sq              RSS    AIC
#> + bedrooms     1  13902067534638 1052710299803964 531888
#> + bathrooms    1   8476715048277 1058135652290325 531999
#> + floors       1   4061693690658 1062550673647944 532089
#> + sqft_lot     1   1826102349598 1064786264989004 532135
#> <none>                           1066612367338602 532170
#> - waterfront   1  90115805935988 1156728173274590 533921
#> - yr_built     1 180389868911509 1247002236250111 535545
#> - grade        1 219517939490357 1286130306828959 536213
#> - sqft_living  1 222939932856999 1289552300195601 536270
#> 
#> Step:  AIC=531888.3
#> price ~ sqft_living + grade + yr_built + waterfront + bedrooms
#> 
#>               Df       Sum of Sq              RSS    AIC
#> + bathrooms    1  13953491780042 1038756808023922 531602
#> + floors       1   4176630866150 1048533668937814 531804
#> + sqft_lot     1   2870998480425 1049839301323540 531831
#> <none>                           1052710299803964 531888
#> - bedrooms     1  13902067534638 1066612367338602 532170
#> - waterfront   1  83896572208764 1136606872012728 533544
#> - yr_built     1 179366335438934 1232076635242899 535287
#> - grade        1 198282580205801 1250992880009766 535616
#> - sqft_living  1 217755228757075 1270465528561039 535950
#> 
#> Step:  AIC=531602
#> price ~ sqft_living + grade + yr_built + waterfront + bedrooms + 
#>     bathrooms
#> 
#>               Df       Sum of Sq              RSS    AIC
#> + sqft_lot     1   2243126436934 1036513681586988 531557
#> + floors       1   1431835508601 1037324972515320 531574
#> <none>                           1038756808023922 531602
#> - bathrooms    1  13953491780042 1052710299803964 531888
#> - bedrooms     1  19378844266404 1058135652290325 531999
#> - waterfront   1  82547202240690 1121304010264612 533253
#> - sqft_living  1 136922615063114 1175679423087036 534276
#> - grade        1 184390684043560 1223147492067482 535132
#> - yr_built     1 190012863248603 1228769671272524 535231
#> 
#> Step:  AIC=531557.2
#> price ~ sqft_living + grade + yr_built + waterfront + bedrooms + 
#>     bathrooms + sqft_lot
#> 
#>               Df       Sum of Sq              RSS    AIC
#> + floors       1   1218939726736 1035294741860252 531534
#> <none>                           1036513681586988 531557
#> - sqft_lot     1   2243126436934 1038756808023922 531602
#> - bathrooms    1  13325619736552 1049839301323540 531831
#> - bedrooms     1  20298780261196 1056812461848184 531974
#> - waterfront   1  82496530390614 1119010211977602 533210
#> - sqft_living  1 138930506698738 1175444188285726 534274
#> - grade        1 182635783415639 1219149465002627 535063
#> - yr_built     1 188639111042699 1225152792629687 535169
#> 
#> Step:  AIC=531533.8
#> price ~ sqft_living + grade + yr_built + waterfront + bedrooms + 
#>     bathrooms + sqft_lot + floors
#> 
#>               Df       Sum of Sq              RSS    AIC
#> <none>                           1035294741860252 531534
#> - floors       1   1218939726736 1036513681586988 531557
#> - sqft_lot     1   2030230655068 1037324972515320 531574
#> - bathrooms    1  10844095263314 1046138837123566 531757
#> - bedrooms     1  19802603600782 1055097345461034 531941
#> - waterfront   1  82402185733932 1117696927594184 533187
#> - sqft_living  1 140116227741713 1175410969601965 534275
#> - grade        1 172030644434158 1207325386294410 534854
#> - yr_built     1 183528097123653 1218822838983905 535059
summary(model_both)
#> 
#> Call:
#> lm(formula = price ~ sqft_living + grade + yr_built + waterfront + 
#>     bedrooms + bathrooms + sqft_lot + floors, data = house)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -1384206  -112972   -10077    91060  4251811 
#> 
#> Coefficients:
#>                  Estimate    Std. Error t value             Pr(>|t|)    
#> (Intercept) 6999106.70657  121576.94670  57.569 < 0.0000000000000002 ***
#> sqft_living     177.91392       3.29026  54.073 < 0.0000000000000002 ***
#> grade        128813.92794    2149.93255  59.915 < 0.0000000000000002 ***
#> yr_built      -3963.73577      64.04988 -61.885 < 0.0000000000000002 ***
#> waterfront1  721804.73094   17406.65326  41.467 < 0.0000000000000002 ***
#> bedrooms     -41484.20936    2040.73489 -20.328 < 0.0000000000000002 ***
#> bathrooms     51710.08964    3437.50666  15.043 < 0.0000000000000002 ***
#> sqft_lot         -0.23947       0.03679  -6.509      0.0000000000774 ***
#> floors        17283.13337    3426.85939   5.043      0.0000004609553 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 218900 on 21604 degrees of freedom
#> Multiple R-squared:  0.6446, Adjusted R-squared:  0.6445 
#> F-statistic:  4898 on 8 and 21604 DF,  p-value: < 0.00000000000000022

4 Evaluasi model

4.1 Goodness of Fit

Berdasarkan Goodness of Fit model regresi manakah yang terbaik?

summary(model_all)$r.squared
#> [1] 0.6445849
summary(model_selected_cor)$r.squared
#> [1] 0.5740781
summary(model_backward)$r.squared
#> [1] 0.6445849
summary(model_forward)$r.squared
#> [1] 0.6445849
summary(model_both)$r.squared
#> [1] 0.6445849

💡 Kesimpulan :

  • Hasil Adjusted R-squared dari model_all, model_backward, model_forward, dan model_both sama-sama memiliki nilai 0.6445849, hal ini menandakan bahwa keempat model ini memiliki kesamaan dalam penggunaan predictor dan data pada tahap pelatihan modelnya.
  • Setelah ditelusuri dari hasil stepwise (forward, backward, both) bahwa ketiganya memang menggunakan semua predictor dalam melakukan pelatihan model ini. oleh karenya wajar hasil R-Squared dari keempatnya sama.

4.2 RMSE

house$pred_all <- predict(object = model_backward, newdata = house)
house$pred_selected_cor <- predict(object = model_selected_cor, newdata = house)
house$pred_backward <- predict(object = model_backward, newdata = house)
house$pred_forward <- predict(object = model_forward, newdata = house)
house$pred_both <- predict(object = model_both, newdata = house)

Berdasarkan RMSE model regresi manakah yang terbaik?

# RMSE model_all
RMSE(y_pred = house$pred_all ,
    y_true = house$price)
#> [1] 218864.1
# RMSE model_selection
RMSE(y_pred = house$pred_selected_cor ,
    y_true = house$price)
#> [1] 239591.5
# RMSE model_forward
RMSE(y_pred = house$pred_forward,
    y_true = house$price)
#> [1] 218864.1
# RMSE model_backward
RMSE(y_pred = house$pred_backward ,
    y_true = house$price)
#> [1] 218864.1
# RMSE model_both
RMSE(y_pred = house$pred_both ,
    y_true = house$price)
#> [1] 218864.1

💡 Kesimpulan :

  • Hasil dari RMSE menandakan bahwa memang opsi feature selection menggunakan stepwise memiliki nilai yang sama dengan model menggunakan semua feature. seperti halnya yang sudah dijelaskan pada kesimpulan goodnes of fit bahwa memang pada model stepwise dan model_all menggunakan predictor yang sama.
  • Dapat disimpulkan bahwa model_all dan model dengan stepwise merupakan model yang memiliki error paling kecil dalam memprediksi nilai price, dengan nilai RMSE sebesar 218864.1

5 Interpretasi Model Terbaik:

summary(model_all)
#> 
#> Call:
#> lm(formula = price ~ ., data = house)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -1384206  -112972   -10077    91060  4251811 
#> 
#> Coefficients:
#>                  Estimate    Std. Error t value             Pr(>|t|)    
#> (Intercept) 6999106.70657  121576.94670  57.569 < 0.0000000000000002 ***
#> bedrooms     -41484.20936    2040.73489 -20.328 < 0.0000000000000002 ***
#> bathrooms     51710.08964    3437.50666  15.043 < 0.0000000000000002 ***
#> sqft_living     177.91392       3.29026  54.073 < 0.0000000000000002 ***
#> sqft_lot         -0.23947       0.03679  -6.509      0.0000000000774 ***
#> floors        17283.13337    3426.85939   5.043      0.0000004609553 ***
#> waterfront1  721804.73094   17406.65326  41.467 < 0.0000000000000002 ***
#> grade        128813.92794    2149.93255  59.915 < 0.0000000000000002 ***
#> yr_built      -3963.73577      64.04988 -61.885 < 0.0000000000000002 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 218900 on 21604 degrees of freedom
#> Multiple R-squared:  0.6446, Adjusted R-squared:  0.6445 
#> F-statistic:  4898 on 8 and 21604 DF,  p-value: < 0.00000000000000022

1. Interpretasi coefficient untuk prediktor kategorik:

  • waterfront = 0 menjadi basis
  • waterfront1 = 1, 721804.73094, artinya nilai price akan meningkat sebesar 721804.73094 apabila rumah tersebut terdapat ditepi laut dan variabel prediktor lainnya bernilai tetap

2. Interpretasi coefficient untuk prediktor numerik:

5.2 Bertambah

  • bathroom = 51710.08964, artinya nilai price akan bertambah sebesar 51710.08964 dengan catatan nilai variabel prediktor lainnya bernilai tetap.
  • sqft_living = 177.91392, artinya nilai price akan bertambah sebesar 177.91392 dengan catatan nilai variabel prediktor lainnya bernilai tetap.
  • floors = 17283.13337, artinya nilai price akan bertambah sebesar 17283.13337 dengan catatan nilai variabel prediktor lainnya bernilai tetap.
  • grade = 128813.92794, artinya nilai price akan bertambah sebesar 128813.92794 dengan catatan nilai variabel prediktor lainnya bernilai tetap.

3. Signifikansi prediktor:

  • semua variable prediktor sangat signifikan dan berpengaruh terhadap price, hal tersebut dapat dilihat dari p-value yang lebih dari 0.0001

4. Adjusted R Squared:

  • 0.6445, artinya model kita bisa menjelaskan price dengan cukup baik sebesar 64.45%

6 Imporeve

boxplot(house$bedrooms)

boxplot(house$bathrooms)

boxplot(house$sqft_living)

boxplot(house$sqft_lot)

boxplot(house$floors)

boxplot(house$grade)

boxplot(house$yr_built)

6.1 Remove Outlier

remove_outliers <- function(data) {
  cleaned_data <- data
  
  for (col in names(data)) {
    if (is.numeric(data[[col]])) {
      Q1 <- quantile(data[[col]], 0.25)
      Q3 <- quantile(data[[col]], 0.75)
      IQR <- Q3 - Q1
      
      lower_bound <- Q1 - 1.5 * IQR
      upper_bound <- Q3 + 1.5 * IQR
      
      cleaned_data <- cleaned_data[cleaned_data[[col]] >= lower_bound & cleaned_data[[col]] <= upper_bound, ]
    }
  }
  
  return(cleaned_data)
}
# Memanggil fungsi remove_outliers
cleaned_data <- remove_outliers(house_2)
cleaned_data
model_all_no_outliers = lm(formula = price ~ .,
                          data = cleaned_data)

model_forward_no_outliers = step(object = model_all_no_outliers,
                                 direction = "backward")
#> Start:  AIC=401868.6
#> price ~ bedrooms + bathrooms + sqft_living + sqft_lot + floors + 
#>     waterfront + grade + yr_built
#> 
#>               Df      Sum of Sq             RSS    AIC
#> <none>                          304184394291320 401869
#> - floors       1   329640802640 304514035093959 401885
#> - waterfront   1  1664232477644 305848626768963 401960
#> - bathrooms    1  2023059768457 306207454059776 401979
#> - bedrooms     1  2148014539480 306332408830800 401986
#> - sqft_lot     1  7675382696985 311859776988305 402291
#> - sqft_living  1 30057070464447 334241464755767 403471
#> - grade        1 67040310753357 371224705044677 405257
#> - yr_built     1 74176810323310 378361204614629 405581
summary(model_forward_no_outliers)
#> 
#> Call:
#> lm(formula = price ~ bedrooms + bathrooms + sqft_living + sqft_lot + 
#>     floors + waterfront + grade + yr_built, data = cleaned_data)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -624453  -89842   -5757   79306  822094 
#> 
#> Coefficients:
#>                 Estimate   Std. Error t value             Pr(>|t|)    
#> (Intercept) 5265061.0489   85897.3612  61.295 < 0.0000000000000002 ***
#> bedrooms     -18646.0557    1701.1148 -10.961 < 0.0000000000000002 ***
#> bathrooms     27507.7861    2585.9286  10.637 < 0.0000000000000002 ***
#> sqft_living     113.5838       2.7702  41.002 < 0.0000000000000002 ***
#> sqft_lot         -7.1611       0.3456 -20.720 < 0.0000000000000002 ***
#> floors        11231.0232    2615.5564   4.294            0.0000177 ***
#> waterfront1  249095.4190   25818.0831   9.648 < 0.0000000000000002 ***
#> grade        110332.3801    1801.7736  61.235 < 0.0000000000000002 ***
#> yr_built      -2942.9818      45.6897 -64.412 < 0.0000000000000002 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 133700 on 17014 degrees of freedom
#> Multiple R-squared:  0.5157, Adjusted R-squared:  0.5154 
#> F-statistic:  2264 on 8 and 17014 DF,  p-value: < 0.00000000000000022