1 Business Problem

Sebagai seorang penjual properti, kita ingin membuat model yang mana dapat memprediksi harga properti berdasarkan beberapa informasi yang ada pada data.

Tentukan variabel:

  • target: price
  • prediktor: seluruh variabel terkecuali price

2 Data Wrangling & EDA

1. Read data house_data.csv

house <- read.csv("data_input/house_data.csv")
head(house)

2. Cek struktur data

glimpse(house)
#> Rows: 21,613
#> Columns: 9
#> $ price       <int> 221900, 538000, 180000, 604000, 510000, 1225000, 257500, 2…
#> $ bedrooms    <int> 3, 3, 2, 4, 3, 4, 3, 3, 3, 3, 3, 2, 3, 3, 5, 4, 3, 4, 2, 3…
#> $ bathrooms   <dbl> 1.00, 2.25, 1.00, 3.00, 2.00, 4.50, 2.25, 1.50, 1.00, 2.50…
#> $ sqft_living <int> 1180, 2570, 770, 1960, 1680, 5420, 1715, 1060, 1780, 1890,…
#> $ sqft_lot    <int> 5650, 7242, 10000, 5000, 8080, 101930, 6819, 9711, 7470, 6…
#> $ floors      <dbl> 1.0, 2.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 2.0, 1.0, 1.0…
#> $ waterfront  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
#> $ grade       <int> 7, 7, 6, 7, 8, 11, 7, 7, 7, 7, 8, 7, 7, 7, 7, 9, 7, 7, 7, …
#> $ yr_built    <int> 1955, 1951, 1933, 1965, 1987, 2001, 1995, 1963, 1960, 2003…
length(unique(house$waterfront))
#> [1] 2
unique(house$grade)
#>  [1]  7  6  8 11  9  5 10 12  4  3 13  1

💡 Hasil pemeriksaan struktur data: Ditemukan bahwa kita harus mengubah tipe data kategorik terbelih dahulu yang meliputi : - waterfront

3. Cleansing Data

# Ubah Tipe Data
house$waterfront <- as.factor(house$waterfront)


glimpse(house)
#> Rows: 21,613
#> Columns: 9
#> $ price       <int> 221900, 538000, 180000, 604000, 510000, 1225000, 257500, 2…
#> $ bedrooms    <int> 3, 3, 2, 4, 3, 4, 3, 3, 3, 3, 3, 2, 3, 3, 5, 4, 3, 4, 2, 3…
#> $ bathrooms   <dbl> 1.00, 2.25, 1.00, 3.00, 2.00, 4.50, 2.25, 1.50, 1.00, 2.50…
#> $ sqft_living <int> 1180, 2570, 770, 1960, 1680, 5420, 1715, 1060, 1780, 1890,…
#> $ sqft_lot    <int> 5650, 7242, 10000, 5000, 8080, 101930, 6819, 9711, 7470, 6…
#> $ floors      <dbl> 1.0, 2.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 2.0, 1.0, 1.0…
#> $ waterfront  <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
#> $ grade       <int> 7, 7, 6, 7, 8, 11, 7, 7, 7, 7, 8, 7, 7, 7, 7, 9, 7, 7, 7, …
#> $ yr_built    <int> 1955, 1951, 1933, 1965, 1987, 2001, 1995, 1963, 1960, 2003…
# Cek missing value
colSums(is.na(house))
#>       price    bedrooms   bathrooms sqft_living    sqft_lot      floors 
#>           0           0           0           0           0           0 
#>  waterfront       grade    yr_built 
#>           0           0           0

4. EDA

#persebaran data
ggplot(house, aes(x = sqft_living, y = price)) +
  geom_point() +
  geom_hline(yintercept = mean(house$price), color = "red", linetype = "dashed") +
  geom_vline(xintercept = mean(house$sqft_living), color = "blue", linetype = "dashed")

#korelasi
cor(x = house$sqft_living,
    y = house$price)
#> [1] 0.7020351
# cek mean house$price
mean(house$price)
#> [1] 540088.1

💡 Insight: - nilai korelasi 0.7020351 artinya Semakin besar luas rumah, maka harganya akan semakin tinggi

  • rata2 harga jual rumah ada di harga 540.088.1 dengan luas tanah perkiraan sebesar 2400 sqft_living

  • sqft_living memiliki korelasi yang kuat sebagai prediktor yaitu sebesar 0.7020351

3 Modeling

Buatlah 3 model berdasarkan feature selection yg telah dipelajari 1. model all predictor 2. model selection based on correlation (korelasi > 0.5) 3. model selection hasil stepwise (backward/forward/both)

4 MODEL BERDASARKAN KORELASI

# 1. Model_all predictor
model_all <- lm(formula = price ~ .,
                  data = house)


summary(model_all)
#> 
#> Call:
#> lm(formula = price ~ ., data = house)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -1384206  -112972   -10077    91060  4251811 
#> 
#> Coefficients:
#>                  Estimate    Std. Error t value             Pr(>|t|)    
#> (Intercept) 6999106.70657  121576.94670  57.569 < 0.0000000000000002 ***
#> bedrooms     -41484.20936    2040.73489 -20.328 < 0.0000000000000002 ***
#> bathrooms     51710.08964    3437.50666  15.043 < 0.0000000000000002 ***
#> sqft_living     177.91392       3.29026  54.073 < 0.0000000000000002 ***
#> sqft_lot         -0.23947       0.03679  -6.509      0.0000000000774 ***
#> floors        17283.13337    3426.85939   5.043      0.0000004609553 ***
#> waterfront1  721804.73094   17406.65326  41.467 < 0.0000000000000002 ***
#> grade        128813.92794    2149.93255  59.915 < 0.0000000000000002 ***
#> yr_built      -3963.73577      64.04988 -61.885 < 0.0000000000000002 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 218900 on 21604 degrees of freedom
#> Multiple R-squared:  0.6446, Adjusted R-squared:  0.6445 
#> F-statistic:  4898 on 8 and 21604 DF,  p-value: < 0.00000000000000022
# 2. Model_selection
model_sqft_living <- lm(formula = price ~ sqft_living,
                  data = house)

summary(model_sqft_living)
#> 
#> Call:
#> lm(formula = price ~ sqft_living, data = house)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -1476062  -147486   -24043   106182  4362067 
#> 
#> Coefficients:
#>               Estimate Std. Error t value            Pr(>|t|)    
#> (Intercept) -43580.743   4402.690  -9.899 <0.0000000000000002 ***
#> sqft_living    280.624      1.936 144.920 <0.0000000000000002 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 261500 on 21611 degrees of freedom
#> Multiple R-squared:  0.4929, Adjusted R-squared:  0.4928 
#> F-statistic: 2.1e+04 on 1 and 21611 DF,  p-value: < 0.00000000000000022

💡 dari ringkasan model_sqft_living terdapat perolehan informasi:

  • nilai intercept b0 = -43580.743
  • nilai estimate untuk sales b1 = 280.624

💡 p-value prediktor pada model_sqft_living: 0.0000000000000002, variabel sqft_living signifkan mempengaruhi price.

4.1 Plot berdasarkan model

💡 melihat sebaran data yang dihasilkan oleh sqft_living ~ price dan mmebuat line plot berdasarkan model_sqft_living yang telah dibuat

plot(house$sqft_living, house$price)
abline(model_sqft_living, col = "red")

## Mencoba mmebuat model tanpa tanpa outlier untuk cek high leverage

# membuat filter tanpa outlier dengan subset sqft_living dibawah 6000 berdasarkan plot yang telah dibuat
house_no_outlier <- house[house$sqft_living <= 6000,]

# Mengambil keputusan outlier dibawah 6000 karena data sqft_living cenderung berkumpul di range angka perkiraan 100-6000. Lebih dari 6000 data sudah mulai berpecah dan cenderung tidak berkumpul satu sama lain dalam sebarannya

Buatlah model linear regression dengan data copiers_no_outlier:

# model dengan data tanpa outlier
model_no_outlier <- lm(formula = price ~ sqft_living,
                       data = house_no_outlier)


summary(model_no_outlier)
#> 
#> Call:
#> lm(formula = price ~ sqft_living, data = house_no_outlier)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -881194 -144906  -26189  101683 2380546 
#> 
#> Coefficients:
#>              Estimate Std. Error t value            Pr(>|t|)    
#> (Intercept) -9067.419   4330.219  -2.094              0.0363 *  
#> sqft_living   262.820      1.933 135.960 <0.0000000000000002 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 247100 on 21543 degrees of freedom
#> Multiple R-squared:  0.4618, Adjusted R-squared:  0.4618 
#> F-statistic: 1.849e+04 on 1 and 21543 DF,  p-value: < 0.00000000000000022

💡 Ketika model_selection dilakukan sebelumnya r-squared memiliki angka di 0.4928 dan setelah outlier dibuang dengan subset < 6000, r-squared mengalami penurunan menjadi 0.4618

# Memasukan line plot untuk cek high leverage
plot(house$sqft_living, house$price)
abline(model_sqft_living, col = "red")
abline(model_no_outlier, col = "green")

💡 Terdapat banyak sekali high leverage yang dihasilkan saat nilai sqft_living sudah mencapai diangka lebih dari 6000

5 MODEL BERDASARKAN STEPWISE

# membuat model_none terlebih dahulu untuk membuat model stepwise
model_none <- lm(formula = price ~ 1,
                 data = house)
model_backward <- step(object = model_all,
                       direction = 'backward')
#> Start:  AIC=531533.8
#> price ~ bedrooms + bathrooms + sqft_living + sqft_lot + floors + 
#>     waterfront + grade + yr_built
#> 
#>               Df       Sum of Sq              RSS    AIC
#> <none>                           1035294741860252 531534
#> - floors       1   1218939726736 1036513681586988 531557
#> - sqft_lot     1   2030230655068 1037324972515320 531574
#> - bathrooms    1  10844095263314 1046138837123566 531757
#> - bedrooms     1  19802603600782 1055097345461034 531941
#> - waterfront   1  82402185733932 1117696927594184 533187
#> - sqft_living  1 140116227741713 1175410969601965 534275
#> - grade        1 172030644434158 1207325386294409 534854
#> - yr_built     1 183528097123653 1218822838983905 535059
model_forward <- step(object = model_none,
                       direction = 'forward',
                   scope = list(upper = model_all))
#> Start:  AIC=553875.8
#> price ~ 1
#> 
#>               Df        Sum of Sq              RSS    AIC
#> + sqft_living  1 1435640399598809 1477276362322490 539204
#> + grade        1 1297612620095470 1615304141825828 541134
#> + bathrooms    1  803293306497671 2109623455423628 546904
#> + bedrooms     1  276958595500073 2635958166421226 551718
#> + waterfront   1  206679237434408 2706237524486890 552287
#> + floors       1  192086763313773 2720829998607526 552403
#> + sqft_lot     1   23417141523777 2889499620397522 553703
#> + yr_built     1    8497693415832 2904419068505468 553815
#> <none>                            2912916761921299 553876
#> 
#> Step:  AIC=539203.5
#> price ~ sqft_living
#> 
#>              Df       Sum of Sq              RSS    AIC
#> + grade       1 121320543948745 1355955818373745 537353
#> + waterfront  1 110238185400763 1367038176921727 537529
#> + yr_built    1  92854405407200 1384421956915290 537802
#> + bedrooms    1  40635382190095 1436640980132395 538603
#> + sqft_lot    1   3011349102420 1474265013220070 539161
#> + floors      1    229913654973 1477046448667517 539202
#> + bathrooms   1    147193010785 1477129169311705 539203
#> <none>                          1477276362322490 539204
#> 
#> Step:  AIC=537353.4
#> price ~ sqft_living + grade
#> 
#>              Df       Sum of Sq              RSS    AIC
#> + yr_built    1 199227645099154 1156728173274590 533921
#> + waterfront  1 108953582123633 1247002236250112 535545
#> + bedrooms    1  22141328690666 1333814489683078 537000
#> + floors      1   9622247208765 1346333571164980 537202
#> + bathrooms   1   7651853153980 1348303965219765 537233
#> + sqft_lot    1   2020142096807 1353935676276938 537323
#> <none>                          1355955818373745 537353
#> 
#> Step:  AIC=533920.9
#> price ~ sqft_living + grade + yr_built
#> 
#>              Df      Sum of Sq              RSS    AIC
#> + waterfront  1 90115805935988 1066612367338602 532170
#> + bedrooms    1 20121301261862 1136606872012728 533544
#> + bathrooms   1  8626538372689 1148101634901902 533761
#> + floors      1  4395842729126 1152332330545465 533841
#> + sqft_lot    1  1713565021968 1155014608252622 533891
#> <none>                         1156728173274590 533921
#> 
#> Step:  AIC=532169.9
#> price ~ sqft_living + grade + yr_built + waterfront
#> 
#>             Df      Sum of Sq              RSS    AIC
#> + bedrooms   1 13902067534638 1052710299803964 531888
#> + bathrooms  1  8476715048277 1058135652290326 531999
#> + floors     1  4061693690658 1062550673647944 532089
#> + sqft_lot   1  1826102349598 1064786264989004 532135
#> <none>                        1066612367338602 532170
#> 
#> Step:  AIC=531888.3
#> price ~ sqft_living + grade + yr_built + waterfront + bedrooms
#> 
#>             Df      Sum of Sq              RSS    AIC
#> + bathrooms  1 13953491780043 1038756808023922 531602
#> + floors     1  4176630866150 1048533668937814 531804
#> + sqft_lot   1  2870998480425 1049839301323540 531831
#> <none>                        1052710299803964 531888
#> 
#> Step:  AIC=531602
#> price ~ sqft_living + grade + yr_built + waterfront + bedrooms + 
#>     bathrooms
#> 
#>            Df     Sum of Sq              RSS    AIC
#> + sqft_lot  1 2243126436934 1036513681586988 531557
#> + floors    1 1431835508601 1037324972515321 531574
#> <none>                      1038756808023922 531602
#> 
#> Step:  AIC=531557.2
#> price ~ sqft_living + grade + yr_built + waterfront + bedrooms + 
#>     bathrooms + sqft_lot
#> 
#>          Df     Sum of Sq              RSS    AIC
#> + floors  1 1218939726736 1035294741860252 531534
#> <none>                    1036513681586988 531557
#> 
#> Step:  AIC=531533.8
#> price ~ sqft_living + grade + yr_built + waterfront + bedrooms + 
#>     bathrooms + sqft_lot + floors
model_both <- step(object = model_none,
                       direction = 'both',
                   scope = list(upper = model_all))
#> Start:  AIC=553875.8
#> price ~ 1
#> 
#>               Df        Sum of Sq              RSS    AIC
#> + sqft_living  1 1435640399598809 1477276362322490 539204
#> + grade        1 1297612620095470 1615304141825828 541134
#> + bathrooms    1  803293306497671 2109623455423628 546904
#> + bedrooms     1  276958595500073 2635958166421226 551718
#> + waterfront   1  206679237434408 2706237524486890 552287
#> + floors       1  192086763313773 2720829998607526 552403
#> + sqft_lot     1   23417141523777 2889499620397522 553703
#> + yr_built     1    8497693415832 2904419068505468 553815
#> <none>                            2912916761921299 553876
#> 
#> Step:  AIC=539203.5
#> price ~ sqft_living
#> 
#>               Df        Sum of Sq              RSS    AIC
#> + grade        1  121320543948745 1355955818373745 537353
#> + waterfront   1  110238185400763 1367038176921727 537529
#> + yr_built     1   92854405407200 1384421956915290 537802
#> + bedrooms     1   40635382190095 1436640980132395 538603
#> + sqft_lot     1    3011349102420 1474265013220070 539161
#> + floors       1     229913654973 1477046448667517 539202
#> + bathrooms    1     147193010785 1477129169311705 539203
#> <none>                            1477276362322490 539204
#> - sqft_living  1 1435640399598809 2912916761921299 553876
#> 
#> Step:  AIC=537353.4
#> price ~ sqft_living + grade
#> 
#>               Df       Sum of Sq              RSS    AIC
#> + yr_built     1 199227645099154 1156728173274590 533921
#> + waterfront   1 108953582123633 1247002236250112 535545
#> + bedrooms     1  22141328690666 1333814489683078 537000
#> + floors       1   9622247208765 1346333571164980 537202
#> + bathrooms    1   7651853153980 1348303965219765 537233
#> + sqft_lot     1   2020142096807 1353935676276938 537323
#> <none>                           1355955818373745 537353
#> - grade        1 121320543948745 1477276362322490 539204
#> - sqft_living  1 259348323452084 1615304141825828 541134
#> 
#> Step:  AIC=533920.9
#> price ~ sqft_living + grade + yr_built
#> 
#>               Df       Sum of Sq              RSS    AIC
#> + waterfront   1  90115805935988 1066612367338602 532170
#> + bedrooms     1  20121301261862 1136606872012728 533544
#> + bathrooms    1   8626538372689 1148101634901902 533761
#> + floors       1   4395842729126 1152332330545465 533841
#> + sqft_lot     1   1713565021968 1155014608252622 533891
#> <none>                           1156728173274590 533921
#> - yr_built     1 199227645099154 1355955818373745 537353
#> - grade        1 227693783640699 1384421956915290 537802
#> - sqft_living  1 241311622806594 1398039796081185 538014
#> 
#> Step:  AIC=532169.9
#> price ~ sqft_living + grade + yr_built + waterfront
#> 
#>               Df       Sum of Sq              RSS    AIC
#> + bedrooms     1  13902067534638 1052710299803964 531888
#> + bathrooms    1   8476715048277 1058135652290326 531999
#> + floors       1   4061693690658 1062550673647944 532089
#> + sqft_lot     1   1826102349598 1064786264989004 532135
#> <none>                           1066612367338602 532170
#> - waterfront   1  90115805935988 1156728173274590 533921
#> - yr_built     1 180389868911509 1247002236250112 535545
#> - grade        1 219517939490357 1286130306828959 536213
#> - sqft_living  1 222939932856996 1289552300195599 536270
#> 
#> Step:  AIC=531888.3
#> price ~ sqft_living + grade + yr_built + waterfront + bedrooms
#> 
#>               Df       Sum of Sq              RSS    AIC
#> + bathrooms    1  13953491780043 1038756808023922 531602
#> + floors       1   4176630866150 1048533668937814 531804
#> + sqft_lot     1   2870998480425 1049839301323540 531831
#> <none>                           1052710299803964 531888
#> - bedrooms     1  13902067534638 1066612367338602 532170
#> - waterfront   1  83896572208764 1136606872012728 533544
#> - yr_built     1 179366335438934 1232076635242899 535287
#> - grade        1 198282580205801 1250992880009766 535616
#> - sqft_living  1 217755228757072 1270465528561036 535950
#> 
#> Step:  AIC=531602
#> price ~ sqft_living + grade + yr_built + waterfront + bedrooms + 
#>     bathrooms
#> 
#>               Df       Sum of Sq              RSS    AIC
#> + sqft_lot     1   2243126436934 1036513681586988 531557
#> + floors       1   1431835508601 1037324972515321 531574
#> <none>                           1038756808023922 531602
#> - bathrooms    1  13953491780043 1052710299803964 531888
#> - bedrooms     1  19378844266404 1058135652290326 531999
#> - waterfront   1  82547202240690 1121304010264612 533253
#> - sqft_living  1 136922615063114 1175679423087036 534276
#> - grade        1 184390684043560 1223147492067482 535132
#> - yr_built     1 190012863248603 1228769671272525 535231
#> 
#> Step:  AIC=531557.2
#> price ~ sqft_living + grade + yr_built + waterfront + bedrooms + 
#>     bathrooms + sqft_lot
#> 
#>               Df       Sum of Sq              RSS    AIC
#> + floors       1   1218939726736 1035294741860252 531534
#> <none>                           1036513681586988 531557
#> - sqft_lot     1   2243126436934 1038756808023922 531602
#> - bathrooms    1  13325619736552 1049839301323540 531831
#> - bedrooms     1  20298780261196 1056812461848184 531974
#> - waterfront   1  82496530390614 1119010211977602 533210
#> - sqft_living  1 138930506698738 1175444188285726 534274
#> - grade        1 182635783415639 1219149465002627 535063
#> - yr_built     1 188639111042699 1225152792629688 535169
#> 
#> Step:  AIC=531533.8
#> price ~ sqft_living + grade + yr_built + waterfront + bedrooms + 
#>     bathrooms + sqft_lot + floors
#> 
#>               Df       Sum of Sq              RSS    AIC
#> <none>                           1035294741860252 531534
#> - floors       1   1218939726736 1036513681586988 531557
#> - sqft_lot     1   2030230655068 1037324972515321 531574
#> - bathrooms    1  10844095263314 1046138837123566 531757
#> - bedrooms     1  19802603600782 1055097345461034 531941
#> - waterfront   1  82402185733932 1117696927594184 533187
#> - sqft_living  1 140116227741712 1175410969601964 534275
#> - grade        1 172030644434158 1207325386294410 534854
#> - yr_built     1 183528097123652 1218822838983905 535059
# SUMMARY MODEL STEPWISE
summary(model_backward)
#> 
#> Call:
#> lm(formula = price ~ bedrooms + bathrooms + sqft_living + sqft_lot + 
#>     floors + waterfront + grade + yr_built, data = house)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -1384206  -112972   -10077    91060  4251811 
#> 
#> Coefficients:
#>                  Estimate    Std. Error t value             Pr(>|t|)    
#> (Intercept) 6999106.70657  121576.94670  57.569 < 0.0000000000000002 ***
#> bedrooms     -41484.20936    2040.73489 -20.328 < 0.0000000000000002 ***
#> bathrooms     51710.08964    3437.50666  15.043 < 0.0000000000000002 ***
#> sqft_living     177.91392       3.29026  54.073 < 0.0000000000000002 ***
#> sqft_lot         -0.23947       0.03679  -6.509      0.0000000000774 ***
#> floors        17283.13337    3426.85939   5.043      0.0000004609553 ***
#> waterfront1  721804.73094   17406.65326  41.467 < 0.0000000000000002 ***
#> grade        128813.92794    2149.93255  59.915 < 0.0000000000000002 ***
#> yr_built      -3963.73577      64.04988 -61.885 < 0.0000000000000002 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 218900 on 21604 degrees of freedom
#> Multiple R-squared:  0.6446, Adjusted R-squared:  0.6445 
#> F-statistic:  4898 on 8 and 21604 DF,  p-value: < 0.00000000000000022
# SUMMARY MODEL STEPWISE
summary(model_forward)
#> 
#> Call:
#> lm(formula = price ~ sqft_living + grade + yr_built + waterfront + 
#>     bedrooms + bathrooms + sqft_lot + floors, data = house)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -1384206  -112972   -10077    91060  4251811 
#> 
#> Coefficients:
#>                  Estimate    Std. Error t value             Pr(>|t|)    
#> (Intercept) 6999106.70657  121576.94670  57.569 < 0.0000000000000002 ***
#> sqft_living     177.91392       3.29026  54.073 < 0.0000000000000002 ***
#> grade        128813.92794    2149.93255  59.915 < 0.0000000000000002 ***
#> yr_built      -3963.73577      64.04988 -61.885 < 0.0000000000000002 ***
#> waterfront1  721804.73094   17406.65326  41.467 < 0.0000000000000002 ***
#> bedrooms     -41484.20936    2040.73489 -20.328 < 0.0000000000000002 ***
#> bathrooms     51710.08964    3437.50666  15.043 < 0.0000000000000002 ***
#> sqft_lot         -0.23947       0.03679  -6.509      0.0000000000774 ***
#> floors        17283.13337    3426.85939   5.043      0.0000004609553 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 218900 on 21604 degrees of freedom
#> Multiple R-squared:  0.6446, Adjusted R-squared:  0.6445 
#> F-statistic:  4898 on 8 and 21604 DF,  p-value: < 0.00000000000000022
# SUMMARY MODEL STEPWISE
summary(model_both)
#> 
#> Call:
#> lm(formula = price ~ sqft_living + grade + yr_built + waterfront + 
#>     bedrooms + bathrooms + sqft_lot + floors, data = house)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -1384206  -112972   -10077    91060  4251811 
#> 
#> Coefficients:
#>                  Estimate    Std. Error t value             Pr(>|t|)    
#> (Intercept) 6999106.70657  121576.94670  57.569 < 0.0000000000000002 ***
#> sqft_living     177.91392       3.29026  54.073 < 0.0000000000000002 ***
#> grade        128813.92794    2149.93255  59.915 < 0.0000000000000002 ***
#> yr_built      -3963.73577      64.04988 -61.885 < 0.0000000000000002 ***
#> waterfront1  721804.73094   17406.65326  41.467 < 0.0000000000000002 ***
#> bedrooms     -41484.20936    2040.73489 -20.328 < 0.0000000000000002 ***
#> bathrooms     51710.08964    3437.50666  15.043 < 0.0000000000000002 ***
#> sqft_lot         -0.23947       0.03679  -6.509      0.0000000000774 ***
#> floors        17283.13337    3426.85939   5.043      0.0000004609553 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 218900 on 21604 degrees of freedom
#> Multiple R-squared:  0.6446, Adjusted R-squared:  0.6445 
#> F-statistic:  4898 on 8 and 21604 DF,  p-value: < 0.00000000000000022

Membandingkan nilai R-Squared dari model-model yang telah dibuat

# periksa nilai R-Squared untuk setiap model
summary(model_sqft_living)$r.squared
#> [1] 0.4928532
summary(model_all)$adj.r.squared
#> [1] 0.6444532
summary(model_backward)$adj.r.squared
#> [1] 0.6444532
summary(model_forward)$adj.r.squared
#> [1] 0.6444532
summary(model_both)$adj.r.squared
#> [1] 0.6444532

💡 Kesimpulan model : model menggunakan stepwise apapun hasilnya sama saja, karena setiap variable memang saling mempengaruhi.

# menambahkan hasil prediksi model_both stepwise ke kolom baru di data house
house$pred_both<- predict(object = model_both, newdata = house)
head(house)

💡 saya memilih model both saja, karena menurut saya lebih stabil untuk kasus ini dikarenakan model sudah mengujinya dengan membuang dan menambahkan variable untuk mencoba mendapatkan hasil yang terbaik dari step perhitungan modelnya.

#membuat dataframe berisi data aktual, hasil prediksi, dan error dari model_both sebagai perbandingan hasil prediksi
res <- data.frame(aktual = house$price, 
                  prediksi = model_both$fitted.values) %>% 
  mutate(error = prediksi - aktual)

head(res)

6 RMSE

library(MLmetrics)
# Menghitung RMSE dari res
RMSE(y_pred = res$prediksi,
    y_true = res$aktual)
#> [1] 218864.1
# cek range target variable
range(res$aktual)
#> [1]   75000 7700000

#membuat dataframe berisi data aktual, hasil prediksi, dan error
res2 <- data.frame(aktual = house$price, #mengambil nilai aktual inequality
                  prediksi = model_sqft_living$fitted.values) %>% #mengambil nilai prediksi price di model_both
  mutate(error = prediksi - aktual) #membuat kolom berisi hasil perhitungan nilai eror

head(res2)
# Menghitung RMSE dari res
RMSE(y_pred = res2$prediksi,
    y_true = res2$aktual)
#> [1] 261440.8

7 Evaluasi model

Berdasarkan ‘RMSE’ model regresi manakah yang terbaik?

REGRESSI model_both<<<

💡 Kesimpulan :

8 Interpretasi Model Terbaik:

💡 model_both Menjadi model terbaik dikarenakan saat memakai stepwise dengan mode lainnya, hasilnya sama saja, mendapatkan hasil yang terbilang cukup bagus dengan r-squared di angka 0.6444532 dan RMSE masih dapat diterima jika dilihat dari range datanya.

model_both dengan nilai RMSE : 218864.1, Dibandingkan model_selection yakni model_sqft_living dengan r-squared di angka 0.4928532 dan RMSE lebih besar dibandingkan model_both, yakni diangka 261440.8. Terdapat GAP RMSE yang lumayan jauh.

summary(model_both)
#> 
#> Call:
#> lm(formula = price ~ sqft_living + grade + yr_built + waterfront + 
#>     bedrooms + bathrooms + sqft_lot + floors, data = house)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -1384206  -112972   -10077    91060  4251811 
#> 
#> Coefficients:
#>                  Estimate    Std. Error t value             Pr(>|t|)    
#> (Intercept) 6999106.70657  121576.94670  57.569 < 0.0000000000000002 ***
#> sqft_living     177.91392       3.29026  54.073 < 0.0000000000000002 ***
#> grade        128813.92794    2149.93255  59.915 < 0.0000000000000002 ***
#> yr_built      -3963.73577      64.04988 -61.885 < 0.0000000000000002 ***
#> waterfront1  721804.73094   17406.65326  41.467 < 0.0000000000000002 ***
#> bedrooms     -41484.20936    2040.73489 -20.328 < 0.0000000000000002 ***
#> bathrooms     51710.08964    3437.50666  15.043 < 0.0000000000000002 ***
#> sqft_lot         -0.23947       0.03679  -6.509      0.0000000000774 ***
#> floors        17283.13337    3426.85939   5.043      0.0000004609553 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 218900 on 21604 degrees of freedom
#> Multiple R-squared:  0.6446, Adjusted R-squared:  0.6445 
#> F-statistic:  4898 on 8 and 21604 DF,  p-value: < 0.00000000000000022

1. Interpretasi coefficient untuk prediktor kategorik:

  • waterfront1 : 721804.73094, nilai price akan naik senilai 721804.73094 ketika nilai waterfront1 meningkat sebanyak 1 nilai waterfront1 dan variable lainnya tetap

2. Interpretasi coefficient untuk prediktor numerik:

  • intercept : 6999106.70657, artinya price akan bernilai 6999106.70657 dengan syarat semua variabel prediktornya bernilai 0

  • sqft_living : 177.91392, nilai price akan naik senilai 177.91392 ketika nilai meningkat sebanyak 1 sqft_living dan variable lainnya tetap

  • grade : 128813.92794, nilai price akan naik senilai 128813.92794 ketika nilai meningkat sebanyak 1 grade dan variable lainnya tetap

  • yr_built : -3963.73577, nilai price akan turun senilai -3963.73577 ketika nilai meningkat sebanyak 1 yr_built dan variable lainnya tetap

  • bedrooms : -41484.20936, nilai price akan turun senilai -41484.20936 ketika nilai meningkat sebanyak 1 bedrooms dan variable lainnya tetap

  • bathrooms : 51710.08964, nilai price akan naik senilai 51710.08964 ketika nilai meningkat sebanyak 1 bathrooms dan variable lainnya tetap

  • sqft_lot : -0.23947, nilai price akan turun senilai -0.23947 ketika nilai meningkat sebanyak 1 sqft_lot dan variable lainnya tetap

  • floors : 17283.13337, nilai price akan naik senilai 17283.13337 ketika nilai meningkat sebanyak 1 floors dan variable lainnya tetap

3. Signifikansi prediktor:

💡 p-value prediktor pada model_both: 0.0000000000000002, variabel lainnya saling signifkan mempengaruhi price dalam harga suatu rumah.

4. Adjusted R Squared:

💡 Adjusted R-squared :0.6445, Model dapat menjelaskan dengan cukup baik sebesar 64.50%

9 Kesimpulan : Alasan membawa data outlier

dengan hasil diatas, saya memutuskan untuk membawa data-data outlier karena dirasa variable-variable yang memiliki outlier ini penting bagi business question kita untuk memprediksi harga properti rumah(dilihat dari data outlier yang sangat banyak juga di plot sqft_living ~ price, tidak mungkin kita akan membuang outlier2 ini).

MAPE(y_pred = res$prediksi,
    y_true = res$aktual)*100
#> [1] 29.75172
MAPE(y_pred = res2$prediksi,
    y_true = res2$aktual)*100
#> [1] 35.99486

10 Hasil Akhir

💡 model_both mendapatkan hasil r-squared yang paling baik dari model lainnya dengan hasil model dapat menginterpretasikan cukup bagus sebesar 64.50% dengan error sebesar : 29.75% saja

dan mendapatkan hasil RMSE : 218864.1, dari range data 75.000 - 7.700.000 dengan RMSE 218864.1, hasil RMSE masih terbilang cukup baik.

LINK HASIL : https://rpubs.com/zdims17/Divedeeper_ML_RegressionModels