Prediksi Harga Rumah di King County, USA

created by Reza Lutfi Ismail

Background

Proyek kali ini membahas tentang prediksi harga rumah menggunakan regressi model pada harga rumah di King County, USA. hal ini dilakukan karena target atau y pada kasus ini adalah numerik. data pada proyek ini saya dapatkan di kaggle dengan link: https://www.kaggle.com/harlfoxem/housesalesprediction

EDA

Import Data

# import library 
library(tidyverse) # data wrangling 
library(lubridate) # EDA (date)
library(GGally) # correlation plot
library(MLmetrics) 
library(lmtest) # create linear model 
library(car) 
library(plotly) # create interactive plot 

# import data 
house <- read_csv("data_input/kc_house_data.csv")

Check Type Data

glimpse(house)
#> Rows: 21,613
#> Columns: 21
#> $ id            <chr> "7129300520", "6414100192", "5631500400", "2487200875...
#> $ date          <dttm> 2014-10-13, 2014-12-09, 2015-02-25, 2014-12-09, 2015...
#> $ price         <dbl> 221900, 538000, 180000, 604000, 510000, 1225000, 2575...
#> $ bedrooms      <dbl> 3, 3, 2, 4, 3, 4, 3, 3, 3, 3, 3, 2, 3, 3, 5, 4, 3, 4,...
#> $ bathrooms     <dbl> 1.00, 2.25, 1.00, 3.00, 2.00, 4.50, 2.25, 1.50, 1.00,...
#> $ sqft_living   <dbl> 1180, 2570, 770, 1960, 1680, 5420, 1715, 1060, 1780, ...
#> $ sqft_lot      <dbl> 5650, 7242, 10000, 5000, 8080, 101930, 6819, 9711, 74...
#> $ floors        <dbl> 1.0, 2.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 2.0, 1.0...
#> $ waterfront    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
#> $ view          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0,...
#> $ condition     <dbl> 3, 3, 3, 5, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3, 3, 3, 4,...
#> $ grade         <dbl> 7, 7, 6, 7, 8, 11, 7, 7, 7, 7, 8, 7, 7, 7, 7, 9, 7, 7...
#> $ sqft_above    <dbl> 1180, 2170, 770, 1050, 1680, 3890, 1715, 1060, 1050, ...
#> $ sqft_basement <dbl> 0, 400, 0, 910, 0, 1530, 0, 0, 730, 0, 1700, 300, 0, ...
#> $ yr_built      <dbl> 1955, 1951, 1933, 1965, 1987, 2001, 1995, 1963, 1960,...
#> $ yr_renovated  <dbl> 0, 1991, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
#> $ zipcode       <dbl> 98178, 98125, 98028, 98136, 98074, 98053, 98003, 9819...
#> $ lat           <dbl> 47.5112, 47.7210, 47.7379, 47.5208, 47.6168, 47.6561,...
#> $ long          <dbl> -122.257, -122.319, -122.233, -122.393, -122.045, -12...
#> $ sqft_living15 <dbl> 1340, 1690, 2720, 1360, 1800, 4760, 2238, 1650, 1780,...
#> $ sqft_lot15    <dbl> 5650, 7639, 8062, 5000, 7503, 101930, 6819, 9711, 811...

Delete Column: - id - date - zipcode

house_new <- house %>% 
  select(-c(id, date, zipcode, grade))

Data Processing

Linear Regression

Membuat model linear regressi tanpa prediktor dan prediktor

ggcorr(house_new, label = T)

Model with no Predictor

# no predictor 
model <- lm(formula = price~1, data = house_new)
# check summary of model 
summary(model)
#> 
#> Call:
#> lm(formula = price ~ 1, data = house_new)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -465088 -218138  -90088  104912 7159912 
#> 
#> Coefficients:
#>             Estimate Std. Error t value            Pr(>|t|)    
#> (Intercept)   540088       2497   216.3 <0.0000000000000002 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 367100 on 21612 degrees of freedom

berdasarkan data diatas, kita dapat formula

\[y = b0\] \[Price = 540088\] ketika tidak menggunakan prediktor kedalam model, maka intercept sama saja dnegan rata-rata dari data tersebut

mean(house_new$price)
#> [1] 540088.1

Model with sqft_living predictor

# model sqft_living 
model_s_living <- lm(formula = price~sqft_living, data = house_new)

\[price = -43580.743 + 280.64*sqftliving\]

plot(house_new$sqft_living, house_new$price)
abline(model_s_living, col = "red")

## Model with above predictor

# model grade
model_above <- lm(formula = price~sqft_above, data = house_new)

\[price = 59953.2 + 268.5*sqftabove\]

plot(house_new$sqft_above, house_new$price)
abline(model_above, col = "blue")

## Model with sqft_living15 predictor

model_l_15 <- lm(formula = price~sqft_living15, data = house_new)

\[price = -82807.195 + 313.556*sqftliving15\]

plot(house_new$sqft_living15, house_new$price)
abline(model_l_15, col = "green")

summary(model_s_living)$r.squared
#> [1] 0.4928532
summary(model_above)$r.squared
#> [1] 0.3667118
summary(model_l_15)$r.squared
#> [1] 0.3426685

Dari ketiga model diatas, didapat bahwa model dengan prediktor sqft_living memiliki pola yang lebih baik dengan nilai r-squared yaitu 0.43. maka dari itu, kita akan memilih prediktor sqft_living untuk dijadikan prediksi

Multi Linear Regression

Model with Multiple Predictor

dari correlation plot (ggcorr) kita dapat bahwa prediktor yang memiliki nilai korelasi tertinggi adalah
- sqft_living - sqft above - sqft_living15

# multiple predictor 
model_multiple <- lm(formula = price~sqft_living + sqft_above + sqft_living15, data = house_new)
summary(model_multiple)
#> 
#> Call:
#> lm(formula = price ~ sqft_living + sqft_above + sqft_living15, 
#>     data = house_new)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -1258201  -146093   -23313   106499  4597726 
#> 
#> Coefficients:
#>                 Estimate Std. Error t value            Pr(>|t|)    
#> (Intercept)   -99360.984   5411.285 -18.362 <0.0000000000000002 ***
#> sqft_living      267.633      4.260  62.821 <0.0000000000000002 ***
#> sqft_above       -37.340      4.535  -8.233 <0.0000000000000002 ***
#> sqft_living15     75.295      4.031  18.677 <0.0000000000000002 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 259300 on 21609 degrees of freedom
#> Multiple R-squared:  0.5013, Adjusted R-squared:  0.5013 
#> F-statistic:  7241 on 3 and 21609 DF,  p-value: < 0.00000000000000022

we get formula

\[Price = -99360.984 + 267.633*sqft living - 37.340*sqftabove + 75.295*sqftliving15\] from the summary, we all know that all of predictor have corralation with the target/price (p-value < 0.05)

Model with All Predictor

# all predictor 
model_all <- lm(formula = price~., data = house_new)
summary(model_all)
#> 
#> Call:
#> lm(formula = price ~ ., data = house_new)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -1329723  -107135   -10257    83039  4161420 
#> 
#> Coefficients: (1 not defined because of singularities)
#>                      Estimate      Std. Error t value             Pr(>|t|)    
#> (Intercept)   -51820690.20472   1631235.72956 -31.768 < 0.0000000000000002 ***
#> bedrooms         -45061.50431      1973.54316 -22.833 < 0.0000000000000002 ***
#> bathrooms         53904.80545      3414.75936  15.786 < 0.0000000000000002 ***
#> sqft_living         183.14124         4.53440  40.389 < 0.0000000000000002 ***
#> sqft_lot              0.17084         0.05045   3.386             0.000710 ***
#> floors            14544.76601      3756.31925   3.872             0.000108 ***
#> waterfront       569227.68416     18273.35744  31.151 < 0.0000000000000002 ***
#> view              57215.00632      2236.84375  25.578 < 0.0000000000000002 ***
#> condition         34249.71009      2459.48103  13.926 < 0.0000000000000002 ***
#> sqft_above           61.30804         4.54229  13.497 < 0.0000000000000002 ***
#> sqft_basement              NA              NA      NA                   NA    
#> yr_built          -1713.19198        73.88326 -23.188 < 0.0000000000000002 ***
#> yr_renovated         27.64023         3.84548   7.188     0.00000000000068 ***
#> lat              637283.46123     10883.25939  58.556 < 0.0000000000000002 ***
#> long            -201913.59418     12385.91912 -16.302 < 0.0000000000000002 ***
#> sqft_living15        72.48159         3.45852  20.957 < 0.0000000000000002 ***
#> sqft_lot15           -0.48238         0.07712  -6.255     0.00000000040604 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 211900 on 21597 degrees of freedom
#> Multiple R-squared:  0.667,  Adjusted R-squared:  0.6668 
#> F-statistic:  2884 on 15 and 21597 DF,  p-value: < 0.00000000000000022

Step Wise

Backward

backward <- step(object = model_all, direction = "backward")
#> Start:  AIC=530138.4
#> price ~ bedrooms + bathrooms + sqft_living + sqft_lot + floors + 
#>     waterfront + view + condition + sqft_above + sqft_basement + 
#>     yr_built + yr_renovated + lat + long + sqft_living15 + sqft_lot15
#> 
#> 
#> Step:  AIC=530138.4
#> price ~ bedrooms + bathrooms + sqft_living + sqft_lot + floors + 
#>     waterfront + view + condition + sqft_above + yr_built + yr_renovated + 
#>     lat + long + sqft_living15 + sqft_lot15
#> 
#>                 Df       Sum of Sq              RSS    AIC
#> <none>                              969935989419937 530138
#> - sqft_lot       1    514943987546  970450933407484 530148
#> - floors         1    673345763894  970609335183832 530151
#> - sqft_lot15     1   1756900127612  971692889547549 530176
#> - yr_renovated   1   2320239916195  972256229336132 530188
#> - sqft_above     1   8181532252160  978117521672097 530318
#> - condition      1   8709164479210  978645153899148 530330
#> - bathrooms      1  11191399594868  981127389014805 530384
#> - long           1  11935051171834  981871040591772 530401
#> - sqft_living15  1  19725372733472  989661362153409 530572
#> - bedrooms       1  23413576500218  993349565920156 530652
#> - yr_built       1  24147354886282  994083344306220 530668
#> - view           1  29383142925346  999319132345283 530781
#> - waterfront     1  43579774079066 1013515763499003 531086
#> - sqft_living    1  73262717718286 1043198707138223 531710
#> - lat            1 153991582273224 1123927571693161 533321
summary(backward)
#> 
#> Call:
#> lm(formula = price ~ bedrooms + bathrooms + sqft_living + sqft_lot + 
#>     floors + waterfront + view + condition + sqft_above + yr_built + 
#>     yr_renovated + lat + long + sqft_living15 + sqft_lot15, data = house_new)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -1329723  -107135   -10257    83039  4161420 
#> 
#> Coefficients:
#>                      Estimate      Std. Error t value             Pr(>|t|)    
#> (Intercept)   -51820690.20472   1631235.72956 -31.768 < 0.0000000000000002 ***
#> bedrooms         -45061.50431      1973.54316 -22.833 < 0.0000000000000002 ***
#> bathrooms         53904.80545      3414.75936  15.786 < 0.0000000000000002 ***
#> sqft_living         183.14124         4.53440  40.389 < 0.0000000000000002 ***
#> sqft_lot              0.17084         0.05045   3.386             0.000710 ***
#> floors            14544.76601      3756.31925   3.872             0.000108 ***
#> waterfront       569227.68416     18273.35744  31.151 < 0.0000000000000002 ***
#> view              57215.00632      2236.84375  25.578 < 0.0000000000000002 ***
#> condition         34249.71009      2459.48103  13.926 < 0.0000000000000002 ***
#> sqft_above           61.30804         4.54229  13.497 < 0.0000000000000002 ***
#> yr_built          -1713.19198        73.88326 -23.188 < 0.0000000000000002 ***
#> yr_renovated         27.64023         3.84548   7.188     0.00000000000068 ***
#> lat              637283.46123     10883.25939  58.556 < 0.0000000000000002 ***
#> long            -201913.59418     12385.91912 -16.302 < 0.0000000000000002 ***
#> sqft_living15        72.48159         3.45852  20.957 < 0.0000000000000002 ***
#> sqft_lot15           -0.48238         0.07712  -6.255     0.00000000040604 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 211900 on 21597 degrees of freedom
#> Multiple R-squared:  0.667,  Adjusted R-squared:  0.6668 
#> F-statistic:  2884 on 15 and 21597 DF,  p-value: < 0.00000000000000022

Forward

forward <- step(object = model, direction = "forward", scope = list(lower = model, upper = model_all))
#> Start:  AIC=553875.8
#> price ~ 1
#> 
#>                 Df        Sum of Sq              RSS    AIC
#> + sqft_living    1 1435640399598809 1477276362322490 539204
#> + sqft_above     1 1068200811636164 1844715950285135 544004
#> + sqft_living15  1  998164703117973 1914752058803326 544810
#> + bathrooms      1  803293306497671 2109623455423628 546904
#> + view           1  459780944970999 2453135816950300 550165
#> + sqft_basement  1  305439174800922 2607477587120376 551484
#> + bedrooms       1  276958595500073 2635958166421226 551718
#> + lat            1  274545716008578 2638371045912720 551738
#> + waterfront     1  206679237434408 2706237524486890 552287
#> + floors         1  192086763313773 2720829998607526 552403
#> + yr_renovated   1   46564442910138 2866352319011162 553529
#> + sqft_lot       1   23417141523777 2889499620397522 553703
#> + sqft_lot15     1   19800647694735 2893116114226564 553730
#> + yr_built       1    8497693415832 2904419068505468 553815
#> + condition      1    3851399435633 2909065362485666 553849
#> + long           1    1362354570271 2911554407351028 553868
#> <none>                              2912916761921299 553876
#> 
#> Step:  AIC=539203.5
#> price ~ sqft_living
#> 
#>                 Df       Sum of Sq              RSS    AIC
#> + lat            1 213137924966058 1264138437356432 535838
#> + view           1 123620038415268 1353656323907222 537317
#> + waterfront     1 110238185400763 1367038176921727 537529
#> + yr_built       1  92854405407200 1384421956915290 537802
#> + long           1  66817275084005 1410459087238485 538205
#> + bedrooms       1  40635382190095 1436640980132395 538603
#> + yr_renovated   1  22404898015578 1454871464306912 538875
#> + sqft_living15  1  20108677275202 1457167685047288 538909
#> + condition      1  17605348260420 1459671014062070 538946
#> + sqft_lot15     1   6440740801824 1470835621520666 539111
#> + sqft_lot       1   3011349102420 1474265013220070 539161
#> + sqft_above     1   1216499294160 1476059863028330 539188
#> + sqft_basement  1   1216499294160 1476059863028330 539188
#> + floors         1    229913654973 1477046448667517 539202
#> + bathrooms      1    147193010785 1477129169311705 539203
#> <none>                             1477276362322490 539204
#> 
#> Step:  AIC=535838
#> price ~ sqft_living + lat
#> 
#>                 Df       Sum of Sq              RSS    AIC
#> + view           1 126630803103906 1137507634252527 533559
#> + waterfront     1 116457216515170 1147681220841262 533751
#> + yr_built       1  51903847813610 1212234589542822 534934
#> + long           1  36167023609974 1227971413746458 535213
#> + bedrooms       1  32254157675667 1231884279680765 535281
#> + condition      1  19095077947046 1245043359409386 535511
#> + yr_renovated   1  18896934952437 1245241502403995 535515
#> + sqft_living15  1  18324966511740 1245813470844692 535524
#> + sqft_lot15     1   1242925578859 1262895511777573 535819
#> <none>                             1264138437356432 535838
#> + sqft_lot       1    109125131288 1264029312225144 535838
#> + sqft_above     1    103865920728 1264034571435704 535838
#> + sqft_basement  1    103865920728 1264034571435704 535838
#> + bathrooms      1      2294249118 1264136143107314 535840
#> + floors         1        29322202 1264138408034231 535840
#> 
#> Step:  AIC=533558.7
#> price ~ sqft_living + lat + view
#> 
#>                 Df      Sum of Sq              RSS    AIC
#> + waterfront     1 48301384600438 1089206249652089 532623
#> + yr_built       1 29685166215947 1107822468036580 532989
#> + bedrooms       1 20105107128838 1117402527123689 533175
#> + long           1 18126102858463 1119381531394064 533214
#> + condition      1 13259272107760 1124248362144767 533307
#> + yr_renovated   1 11033349370024 1126474284882502 533350
#> + sqft_living15  1  9777260266168 1127730373986359 533374
#> + sqft_above     1  5649302765515 1131858331487012 533453
#> + sqft_basement  1  5649302765514 1131858331487012 533453
#> + sqft_lot15     1  1822194970820 1135685439281707 533526
#> + floors         1   790838989632 1136716795262894 533546
#> + sqft_lot       1   392067203463 1137115567049064 533553
#> + bathrooms      1   192700592756 1137314933659771 533557
#> <none>                            1137507634252527 533559
#> 
#> Step:  AIC=532622.9
#> price ~ sqft_living + lat + view + waterfront
#> 
#>                 Df      Sum of Sq              RSS    AIC
#> + yr_built       1 29367526832499 1059838722819591 532034
#> + bedrooms       1 17478610279237 1071727639372852 532275
#> + long           1 17464543757804 1071741705894285 532276
#> + condition      1 13417265900615 1075788983751474 532357
#> + sqft_living15  1 11169376379955 1078036873272134 532402
#> + yr_renovated   1  8586367580730 1080619882071360 532454
#> + sqft_above     1  4670597698794 1084535651953295 532532
#> + sqft_basement  1  4670597698794 1084535651953295 532532
#> + sqft_lot15     1  1861231101284 1087345018550805 532588
#> + floors         1   572161272649 1088634088379440 532614
#> + sqft_lot       1   316492359462 1088889757292628 532619
#> + bathrooms      1   234351649857 1088971898002233 532620
#> <none>                            1089206249652089 532623
#> 
#> Step:  AIC=532034.2
#> price ~ sqft_living + lat + view + waterfront + yr_built
#> 
#>                 Df      Sum of Sq              RSS    AIC
#> + bedrooms       1 20688901922724 1039149820896867 531610
#> + sqft_living15  1 18321662372237 1041517060447353 531659
#> + sqft_above     1 15037471026572 1044801251793019 531727
#> + sqft_basement  1 15037471026571 1044801251793019 531727
#> + floors         1 11793327955809 1048045394863782 531794
#> + bathrooms      1  9687073571957 1050151649247634 531838
#> + long           1  6492502179478 1053346220640113 531903
#> + condition      1  3277674240866 1056561048578725 531969
#> + yr_renovated   1  2732915035417 1057105807784173 531980
#> + sqft_lot15     1  1862994537472 1057975728282119 531998
#> + sqft_lot       1   415711282743 1059423011536848 532028
#> <none>                            1059838722819591 532034
#> 
#> Step:  AIC=531610.1
#> price ~ sqft_living + lat + view + waterfront + yr_built + bedrooms
#> 
#>                 Df      Sum of Sq              RSS    AIC
#> + bathrooms      1 16549684599818 1022600136297049 531265
#> + sqft_living15  1 15977295635070 1023172525261797 531277
#> + sqft_above     1 12393157291328 1026756663605539 531353
#> + sqft_basement  1 12393157291328 1026756663605539 531353
#> + floors         1 11258034321641 1027891786575226 531377
#> + long           1  6897822190854 1032251998706014 531468
#> + condition      1  4437764244632 1034712056652235 531520
#> + sqft_lot15     1  3328710661702 1035821110235166 531543
#> + yr_renovated   1  2454309574146 1036695511322721 531561
#> + sqft_lot       1  1110404179464 1038039416717403 531589
#> <none>                            1039149820896867 531610
#> 
#> Step:  AIC=531265.2
#> price ~ sqft_living + lat + view + waterfront + yr_built + bedrooms + 
#>     bathrooms
#> 
#>                 Df      Sum of Sq              RSS    AIC
#> + sqft_living15  1 18350055736756 1004250080560294 530876
#> + sqft_above     1 13904485810531 1008695650486518 530971
#> + sqft_basement  1 13904485810531 1008695650486518 530971
#> + floors         1  5885883452763 1016714252844286 531142
#> + long           1  5034210071147 1017565926225902 531160
#> + condition      1  3931668236404 1018668468060645 531184
#> + sqft_lot15     1  2353625870920 1020246510426129 531217
#> + yr_renovated   1   950579379520 1021649556917529 531247
#> + sqft_lot       1   713608944368 1021886527352680 531252
#> <none>                            1022600136297049 531265
#> 
#> Step:  AIC=530875.8
#> price ~ sqft_living + lat + view + waterfront + yr_built + bedrooms + 
#>     bathrooms + sqft_living15
#> 
#>                 Df      Sum of Sq              RSS    AIC
#> + long           1 10805533494283  993444547066010 530644
#> + sqft_above     1  8499381288515  995750699271779 530694
#> + sqft_basement  1  8499381288515  995750699271779 530694
#> + floors         1  6586050836779  997664029723515 530736
#> + condition      1  4245848049028 1000004232511265 530786
#> + sqft_lot15     1  3215905185938 1001034175374355 530808
#> + yr_renovated   1  1227639962617 1003022440597676 530851
#> + sqft_lot       1   817312968376 1003432767591917 530860
#> <none>                            1004250080560294 530876
#> 
#> Step:  AIC=530644
#> price ~ sqft_living + lat + view + waterfront + yr_built + bedrooms + 
#>     bathrooms + sqft_living15 + long
#> 
#>                 Df      Sum of Sq             RSS    AIC
#> + sqft_above     1 11430787297210 982013759768800 530396
#> + sqft_basement  1 11430787297210 982013759768800 530396
#> + floors         1  5113171922797 988331375143213 530534
#> + condition      1  4867275002312 988577272063699 530540
#> + yr_renovated   1  1510892433283 991933654632728 530613
#> + sqft_lot15     1  1242605733929 992201941332081 530619
#> <none>                            993444547066010 530644
#> + sqft_lot       1    54136120758 993390410945252 530645
#> 
#> Step:  AIC=530395.9
#> price ~ sqft_living + lat + view + waterfront + yr_built + bedrooms + 
#>     bathrooms + sqft_living15 + long + sqft_above
#> 
#>                Df     Sum of Sq             RSS    AIC
#> + condition     1 7030802787234 974982956981566 530243
#> + sqft_lot15    1 1394571936760 980619187832040 530367
#> + yr_renovated  1 1154163783763 980859595985037 530372
#> + floors        1  589657336434 981424102432366 530385
#> + sqft_lot      1  123985408301 981889774360499 530395
#> <none>                          982013759768800 530396
#> 
#> Step:  AIC=530242.6
#> price ~ sqft_living + lat + view + waterfront + yr_built + bedrooms + 
#>     bathrooms + sqft_living15 + long + sqft_above + condition
#> 
#>                Df     Sum of Sq             RSS    AIC
#> + yr_renovated  1 2427753578407 972555203403159 530191
#> + sqft_lot15    1 1448160668390 973534796313176 530212
#> + floors        1  944508155898 974038448825669 530224
#> + sqft_lot      1  114782670508 974868174311059 530242
#> <none>                          974982956981566 530243
#> 
#> Step:  AIC=530190.7
#> price ~ sqft_living + lat + view + waterfront + yr_built + bedrooms + 
#>     bathrooms + sqft_living15 + long + sqft_above + condition + 
#>     yr_renovated
#> 
#>              Df     Sum of Sq             RSS    AIC
#> + sqft_lot15  1 1454398952337 971100804450822 530160
#> + floors      1  789954019031 971765249384128 530175
#> + sqft_lot    1  105333375183 972449870027976 530190
#> <none>                        972555203403159 530191
#> 
#> Step:  AIC=530160.3
#> price ~ sqft_living + lat + view + waterfront + yr_built + bedrooms + 
#>     bathrooms + sqft_living15 + long + sqft_above + condition + 
#>     yr_renovated + sqft_lot15
#> 
#>            Df    Sum of Sq             RSS    AIC
#> + floors    1 649871043338 970450933407484 530148
#> + sqft_lot  1 491469266990 970609335183832 530151
#> <none>                     971100804450822 530160
#> 
#> Step:  AIC=530147.9
#> price ~ sqft_living + lat + view + waterfront + yr_built + bedrooms + 
#>     bathrooms + sqft_living15 + long + sqft_above + condition + 
#>     yr_renovated + sqft_lot15 + floors
#> 
#>            Df    Sum of Sq             RSS    AIC
#> + sqft_lot  1 514943987546 969935989419938 530138
#> <none>                     970450933407484 530148
#> 
#> Step:  AIC=530138.4
#> price ~ sqft_living + lat + view + waterfront + yr_built + bedrooms + 
#>     bathrooms + sqft_living15 + long + sqft_above + condition + 
#>     yr_renovated + sqft_lot15 + floors + sqft_lot
#> 
#>        Df Sum of Sq             RSS    AIC
#> <none>              969935989419938 530138
summary(forward)
#> 
#> Call:
#> lm(formula = price ~ sqft_living + lat + view + waterfront + 
#>     yr_built + bedrooms + bathrooms + sqft_living15 + long + 
#>     sqft_above + condition + yr_renovated + sqft_lot15 + floors + 
#>     sqft_lot, data = house_new)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -1329723  -107135   -10257    83039  4161420 
#> 
#> Coefficients:
#>                      Estimate      Std. Error t value             Pr(>|t|)    
#> (Intercept)   -51820690.20472   1631235.72956 -31.768 < 0.0000000000000002 ***
#> sqft_living         183.14124         4.53440  40.389 < 0.0000000000000002 ***
#> lat              637283.46123     10883.25939  58.556 < 0.0000000000000002 ***
#> view              57215.00632      2236.84375  25.578 < 0.0000000000000002 ***
#> waterfront       569227.68416     18273.35744  31.151 < 0.0000000000000002 ***
#> yr_built          -1713.19198        73.88326 -23.188 < 0.0000000000000002 ***
#> bedrooms         -45061.50431      1973.54316 -22.833 < 0.0000000000000002 ***
#> bathrooms         53904.80545      3414.75936  15.786 < 0.0000000000000002 ***
#> sqft_living15        72.48159         3.45852  20.957 < 0.0000000000000002 ***
#> long            -201913.59418     12385.91912 -16.302 < 0.0000000000000002 ***
#> sqft_above           61.30804         4.54229  13.497 < 0.0000000000000002 ***
#> condition         34249.71009      2459.48103  13.926 < 0.0000000000000002 ***
#> yr_renovated         27.64023         3.84548   7.188     0.00000000000068 ***
#> sqft_lot15           -0.48238         0.07712  -6.255     0.00000000040604 ***
#> floors            14544.76601      3756.31925   3.872             0.000108 ***
#> sqft_lot              0.17084         0.05045   3.386             0.000710 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 211900 on 21597 degrees of freedom
#> Multiple R-squared:  0.667,  Adjusted R-squared:  0.6668 
#> F-statistic:  2884 on 15 and 21597 DF,  p-value: < 0.00000000000000022
summary(backward)$adj.r.squared
#> [1] 0.6667911
summary(forward)$adj.r.squared
#> [1] 0.6667911

dari kedua model dengan menggunakan stepwise yaitu backward dan forward, tidak didapat perbedaan yang signifikan

Predict

house_new$price_s_living <- predict(object = model_s_living, newdata = house_new)
house_new$priceBackward <-predict(object = backward, newdata = house_new)

RMSE dan MSE

RMSE(y_pred = house_new$priceBackward, y_true = house_new$price)
#> [1] 211842.9
RMSE(y_pred = house_new$price_s_living, y_true = house_new$price)
#> [1] 261440.8
MSE(y_pred = house_new$priceBackward, y_true = house_new$price)
#> [1] 44877434388
MSE(y_pred = house_new$price_s_living, y_true = house_new$price)
#> [1] 68351286833

Dari hasil prediksi dan pengujian error dengan menggunakan RMSE dan MSE, didapat model multiple linear regression (backward) adalah yang terbaik dengan error yang lebih rendah dibandingkan dengan model single linear regression (model_s_living)

Normality Error

hist(backward$residuals)

### Homoscedascity

plot(backward$fitted.values, backward$residuals)
abline(h=0, col = "red")

bptest(backward)
#> 
#>  studentized Breusch-Pagan test
#> 
#> data:  backward
#> BP = 3117.3, df = 15, p-value < 0.00000000000000022

Tolak H0 jika p-value < alpha (0.05). Berdasarkan nilai p-value yang diperoleh maka dapat disimpulkan bahwa memenuhi asumsi homoscedasticity

No-Multicolinearity (antar x/prediktor tidak saling berkorelasi)

vif(backward)
#>      bedrooms     bathrooms   sqft_living      sqft_lot        floors 
#>      1.621296      3.328365      8.346160      2.101672      1.979885 
#>    waterfront          view     condition    sqft_above      yr_built 
#>      1.202782      1.413950      1.232683      6.808499      2.266450 
#>  yr_renovated           lat          long sqft_living15    sqft_lot15 
#>      1.148164      1.094365      1.464138      2.703973      2.133977

didapat bahwa ketika nilai vif tidak ada yang melebihi 10 maka nilai no-multicolinearity terpenuhi

Conclusion

dari hasil linear regressi model, didapat model dengan menggunakan stepwise dan diapat prediktor yang mempengaruhi harga rumah yaitu diantaranya:

  • bedrooms
  • bathrooms
  • sqft_living
  • sqft_lot
  • floors
  • waterfront
  • view
  • condition
  • sqft_above
  • yr_built
  • yr_renovated
  • lat
  • long
  • sqft_living15
  • sqft_lot15

didapat adj. r squared sebesar 0.66. hal ini disebabkan karena banyaknya outlier yang terdapat didalamnya, sehingga jarak antara garis linear model dengan data tersebut menjadi besar. setelah dilakukan uji test yaitu normality error, homoscedascity dan no-multicolinearity bahwa model dapat diterima