Memprediksi Harga Rumah dengan Dataset Kaggle Harga Rumah.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(STARTS)
## #* STARTS 1.2-35 (2019-11-04 15:18:11)
library(ggplot2)
library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
library(tidyr)
library(zoo)
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
library(carData)
library(lmtest)
library(caret)
## Loading required package: lattice

Data Preparation

houseprice<-read.csv("data_input/HousePrices_HalfMil.csv")
anyNA(houseprice)
## [1] FALSE
str(houseprice)
## 'data.frame':    500000 obs. of  16 variables:
##  $ Area         : int  164 84 190 75 148 124 58 249 243 242 ...
##  $ Garage       : int  2 2 2 2 1 3 1 2 1 1 ...
##  $ FirePlace    : int  0 0 4 4 4 3 0 1 0 2 ...
##  $ Baths        : int  2 4 4 4 2 3 2 1 2 4 ...
##  $ White.Marble : int  0 0 1 0 1 0 0 1 0 0 ...
##  $ Black.Marble : int  1 0 0 0 0 1 0 0 0 0 ...
##  $ Indian.Marble: int  0 1 0 1 0 0 1 0 1 1 ...
##  $ Floors       : int  0 1 0 1 1 1 0 1 1 0 ...
##  $ City         : int  3 2 2 1 2 1 3 1 1 2 ...
##  $ Solar        : int  1 0 0 1 1 0 0 0 0 1 ...
##  $ Electric     : int  1 0 0 1 0 0 1 1 0 0 ...
##  $ Fiber        : int  1 0 1 1 0 1 1 0 0 0 ...
##  $ Glass.Doors  : int  1 1 0 1 1 1 1 1 0 0 ...
##  $ Swiming.Pool : int  0 1 0 1 1 1 0 1 1 1 ...
##  $ Garden       : int  0 1 0 1 1 1 1 0 0 0 ...
##  $ Prices       : int  43800 37550 49500 50075 52400 54300 34400 50425 29575 22300 ...

Pada dataset ini, kami memiliki 500.000 observasi dengan 16 variabel. Ada beberapa informasi yang tidak cukup jelas untuk kita gunakan sebagai variabel prediksi yang signifikan seperti: Area, Warna Marmer karena semuanya sama, dan beberapa di antaranya tidak tercantum dalam Kaggle seperti City dan Fiber.

boxplot(houseprice$Prices)

Ada outlier pada Harga Variabel Target kita.

house_price <- houseprice %>% 
  dplyr::select(-c(Area, Black.Marble, White.Marble, Indian.Marble, City, Fiber))
str(house_price)
## 'data.frame':    500000 obs. of  10 variables:
##  $ Garage      : int  2 2 2 2 1 3 1 2 1 1 ...
##  $ FirePlace   : int  0 0 4 4 4 3 0 1 0 2 ...
##  $ Baths       : int  2 4 4 4 2 3 2 1 2 4 ...
##  $ Floors      : int  0 1 0 1 1 1 0 1 1 0 ...
##  $ Solar       : int  1 0 0 1 1 0 0 0 0 1 ...
##  $ Electric    : int  1 0 0 1 0 0 1 1 0 0 ...
##  $ Glass.Doors : int  1 1 0 1 1 1 1 1 0 0 ...
##  $ Swiming.Pool: int  0 1 0 1 1 1 0 1 1 1 ...
##  $ Garden      : int  0 1 0 1 1 1 1 0 0 0 ...
##  $ Prices      : int  43800 37550 49500 50075 52400 54300 34400 50425 29575 22300 ...

Exploratory Data Analysis

glimpse(house_price)
## Rows: 500,000
## Columns: 10
## $ Garage       <int> 2, 2, 2, 2, 1, 3, 1, 2, 1, 1, 2, 2, 2, 3, 3, 3, 1, 3, 2, ~
## $ FirePlace    <int> 0, 0, 4, 4, 4, 3, 0, 1, 0, 2, 4, 0, 0, 3, 3, 4, 0, 3, 3, ~
## $ Baths        <int> 2, 4, 4, 4, 2, 3, 2, 1, 2, 4, 5, 4, 2, 3, 1, 1, 5, 3, 5, ~
## $ Floors       <int> 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, ~
## $ Solar        <int> 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, ~
## $ Electric     <int> 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, ~
## $ Glass.Doors  <int> 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, ~
## $ Swiming.Pool <int> 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, ~
## $ Garden       <int> 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, ~
## $ Prices       <int> 43800, 37550, 49500, 50075, 52400, 54300, 34400, 50425, 2~

Berikut adalah keterangan dari setiap Kolom :

Garage = the number of garage FirePlace = the number of fire place Baths= the number of bathroom Floors = the number of floors in the house Solar = 1 if the house have solar system else 0 Electric = 1 if the house have electric work to be done else 0 Glass.Doors = 1 if the house have glass doors else 0 Swiming.Pool = 1 the house have a swiming pool else 0 Garden = 1 if the house have garden in the house else 0

house_price[,c("Garage","FirePlace","Baths", "Floors","Solar","Electric","Glass.Doors","Swiming.Pool")] <-
  lapply(house_price[,c("Garage","FirePlace","Baths", "Floors","Solar","Electric","Glass.Doors","Swiming.Pool")], FUN = as.numeric)

#cek apakah ada missing data

colSums(is.na(house_price))
##       Garage    FirePlace        Baths       Floors        Solar     Electric 
##            0            0            0            0            0            0 
##  Glass.Doors Swiming.Pool       Garden       Prices 
##            0            0            0            0

Tidak terdapat missing value pada data

Menemukan korelasi antara Harga(Price) dengan variable lainnya

ggcorr(house_price, label = TRUE, label_size = 3, hjust = 1, layout.exp = 3)

Kita dapat melihat korelasi terkuat berpengaruh positif terhadap Harga(Price) adalah Lantai(Floor).

3. Pembuatan Model Regresi Linear

Membuat model regresi linear dengan variabel prediktor Floors karena variabel tersebut memiliki korelasi positif tertinggi terhadap variabel target Prices.

rm <- lm(Prices ~ Floors, data = house_price) %>% print()
## 
## Call:
## lm(formula = Prices ~ Floors, data = house_price)
## 
## Coefficients:
## (Intercept)       Floors  
##       34558        15003
summary(rm)
## 
## Call:
## lm(formula = Prices ~ Floors, data = house_price)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -27211  -6986    -11   6592  28542 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 34557.66      19.00    1819   <2e-16 ***
## Floors      15003.39      26.89     558   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9507 on 499998 degrees of freedom
## Multiple R-squared:  0.3837, Adjusted R-squared:  0.3837 
## F-statistic: 3.113e+05 on 1 and 499998 DF,  p-value: < 2.2e-16
plot(house_price$Floors, house_price$Prices)
abline(rm$coefficients[1],rm$coefficients[2], col = "red")

Multiple R-squared memiliki nilai 0.3837

Selanjutnya akan dicoba pemilihan variabel prediktor secara automatis menggunakan step-wise regression dengan metode backward elimination.

rm1 <- lm(Prices ~ ., house_price)
step(rm1, direction = "backward")
## Start:  AIC=9094486
## Prices ~ Garage + FirePlace + Baths + Floors + Solar + Electric + 
##     Glass.Doors + Swiming.Pool + Garden
## 
##                Df  Sum of Sq        RSS     AIC
## - Garden        1 7.8791e+07 3.9657e+13 9094485
## - Swiming.Pool  1 1.1853e+08 3.9657e+13 9094486
## <none>                       3.9657e+13 9094486
## - Solar         1 7.5030e+09 3.9665e+13 9094579
## - Electric      1 1.9692e+11 3.9854e+13 9096961
## - FirePlace     1 5.7817e+11 4.0235e+13 9101721
## - Garage        1 7.5734e+11 4.0415e+13 9103943
## - Baths         1 1.5673e+12 4.1225e+13 9113864
## - Glass.Doors   1 2.4402e+12 4.2097e+13 9124341
## - Floors        1 2.8158e+13 6.7816e+13 9362744
## 
## Step:  AIC=9094485
## Prices ~ Garage + FirePlace + Baths + Floors + Solar + Electric + 
##     Glass.Doors + Swiming.Pool
## 
##                Df  Sum of Sq        RSS     AIC
## - Swiming.Pool  1 1.1849e+08 3.9657e+13 9094485
## <none>                       3.9657e+13 9094485
## - Solar         1 7.4966e+09 3.9665e+13 9094578
## - Electric      1 1.9693e+11 3.9854e+13 9096960
## - FirePlace     1 5.7817e+11 4.0236e+13 9101720
## - Garage        1 7.5733e+11 4.0415e+13 9103942
## - Baths         1 1.5673e+12 4.1225e+13 9113864
## - Glass.Doors   1 2.4403e+12 4.2098e+13 9124341
## - Floors        1 2.8158e+13 6.7816e+13 9362742
## 
## Step:  AIC=9094485
## Prices ~ Garage + FirePlace + Baths + Floors + Solar + Electric + 
##     Glass.Doors
## 
##               Df  Sum of Sq        RSS     AIC
## <none>                      3.9657e+13 9094485
## - Solar        1 7.4957e+09 3.9665e+13 9094577
## - Electric     1 1.9693e+11 3.9854e+13 9096960
## - FirePlace    1 5.7819e+11 4.0236e+13 9101720
## - Garage       1 7.5735e+11 4.0415e+13 9103941
## - Baths        1 1.5674e+12 4.1225e+13 9113864
## - Glass.Doors  1 2.4403e+12 4.2098e+13 9124341
## - Floors       1 2.8158e+13 6.7816e+13 9362741
## 
## Call:
## lm(formula = Prices ~ Garage + FirePlace + Baths + Floors + Solar + 
##     Electric + Glass.Doors, data = house_price)
## 
## Coefficients:
## (Intercept)       Garage    FirePlace        Baths       Floors        Solar  
##     23303.9       1506.4        760.5       1252.0      15009.0        244.9  
##    Electric  Glass.Doors  
##      1255.2       4418.5
summary(lm(formula = Prices ~ Garage + FirePlace + Baths + Floors + Solar + 
    Electric + Glass.Doors, data = house_price))
## 
## Call:
## lm(formula = Prices ~ Garage + FirePlace + Baths + Floors + Solar + 
##     Electric + Glass.Doors, data = house_price)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -18854.9  -6681.5    116.2   6000.0  20189.1 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 23303.868     52.739 441.872   <2e-16 ***
## Garage       1506.411     15.416  97.716   <2e-16 ***
## FirePlace     760.496      8.907  85.380   <2e-16 ***
## Baths        1251.960      8.906 140.575   <2e-16 ***
## Floors      15008.981     25.190 595.831   <2e-16 ***
## Solar         244.881     25.190   9.721   <2e-16 ***
## Electric     1255.179     25.190  49.828   <2e-16 ***
## Glass.Doors  4418.470     25.190 175.406   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8906 on 499992 degrees of freedom
## Multiple R-squared:  0.4592, Adjusted R-squared:  0.4592 
## F-statistic: 6.065e+04 on 7 and 499992 DF,  p-value: < 2.2e-16

Membuat formula model regresi yang baru menjadi objek

rm1 <- lm(formula = Prices ~ Garage + FirePlace + Baths + Floors + Solar + 
    Electric + Glass.Doors, data = house_price) %>% print()
## 
## Call:
## lm(formula = Prices ~ Garage + FirePlace + Baths + Floors + Solar + 
##     Electric + Glass.Doors, data = house_price)
## 
## Coefficients:
## (Intercept)       Garage    FirePlace        Baths       Floors        Solar  
##     23303.9       1506.4        760.5       1252.0      15009.0        244.9  
##    Electric  Glass.Doors  
##      1255.2       4418.5

Metode step-wise regression ini akan menghasilkan formula optimum berdasarkan nilai AIC yang terendah, dimana semakin rendah nilai AIC tersebut, maka nilai observasi yang tidak tertangkap semakin kecil.

Bila dibandingkan dengan model awal yang hanya menggunakan variabel Floors, model regresi yang menggunakan 12 variabel prediktor memiliki adjusted R-squared 1. lebih tinggi dibandingkan model sebelumnya yaitu 0.3837.

Kandidat Model:

Prices = 34557.66 + 15009.0(Floors) Prices = 999 + 1506.4(Garage) + 760,5(FirePlace) + 1252(Baths) + 15009.0(Floors) + 249.9(Solar) + 1255,2(Electric) + 4418.5(Glass.Doors)

4. Prediksi Model & Error

Prediksi nilai Prices berdasarkan nilai variabel prediktor, dan hasilnya akan dibandingkan dengan data aktual yang kita miliki.

library(MLmetrics)
## 
## Attaching package: 'MLmetrics'
## The following objects are masked from 'package:caret':
## 
##     MAE, RMSE
## The following object is masked from 'package:base':
## 
##     Recall
# prediksi nilai Prices berdasarkan model `rm`
predict_rm <- predict(rm, newdata = head(house_price,10))
predict(rm, newdata = head(house_price,10), interval = "confidence", level = 0.95)
##         fit      lwr      upr
## 1  34557.66 34520.41 34594.90
## 2  49561.05 49523.76 49598.34
## 3  34557.66 34520.41 34594.90
## 4  49561.05 49523.76 49598.34
## 5  49561.05 49523.76 49598.34
## 6  49561.05 49523.76 49598.34
## 7  34557.66 34520.41 34594.90
## 8  49561.05 49523.76 49598.34
## 9  49561.05 49523.76 49598.34
## 10 34557.66 34520.41 34594.90
# prediksi nilai Prices berdasarkan model `rm1`
predict_rm1 <- predict(rm1, newdata = head(house_price,10))
predict(rm1, newdata = head(house_price,10), interval = "confidence", level = 0.95)
##         fit      lwr      upr
## 1  34739.14 34671.55 34806.73
## 2  50751.98 50684.36 50819.60
## 3  34366.51 34298.96 34434.07
## 4  55294.02 55226.41 55361.64
## 5  50028.51 49954.36 50102.67
## 6  53287.92 53222.63 53353.21
## 7  32987.85 32913.78 33061.92
## 8  49011.78 48944.22 49079.33
## 9  42323.18 42249.14 42397.22
## 10 31583.99 31518.65 31649.33

5. Evaluasi Model

Normalitas

hist(rm$residuals)

ks.test(rm$residuals, "pnorm")
## Warning in ks.test(rm$residuals, "pnorm"): ties should not be present for the
## Kolmogorov-Smirnov test
## 
##  One-sample Kolmogorov-Smirnov test
## 
## data:  rm$residuals
## D = 0.50057, p-value < 2.2e-16
## alternative hypothesis: two-sided
hist(rm1$residuals)

ks.test(rm1$residuals, "pnorm")
## Warning in ks.test(rm1$residuals, "pnorm"): ties should not be present for the
## Kolmogorov-Smirnov test
## 
##  One-sample Kolmogorov-Smirnov test
## 
## data:  rm1$residuals
## D = 0.50507, p-value < 2.2e-16
## alternative hypothesis: two-sided

Heteroscedasticity Sebaran Plot(scatter plot) antara nilai yang dipasang dan Residual

plot(rm$fitted.values, rm$residuals)
abline(h = 0, col = "red")

bptest(rm)
## 
##  studentized Breusch-Pagan test
## 
## data:  rm
## BP = 0.13344, df = 1, p-value = 0.7149
plot(rm1$fitted.values, rm1$residuals)
abline(h = 0, col = "red")

bptest(rm1)
## 
##  studentized Breusch-Pagan test
## 
## data:  rm1
## BP = 6.0737, df = 7, p-value = 0.5312

Untuk kedua model, P-value > 0.05 sehingga H0 diterima. Hal ini juga berarti residual tidak memiliki pola (Heteroscedasticity) dimana semua pola yang ada sudah berhasil ditangkap oleh model yang dibuat.

Variance Inflation Factor (Multicollinearity)

untuk model m yang hanya menggunakan satu variabel prediktor, tidak dapat digunakan vif() untuk analis

library (car)
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
vif(rm1)
##      Garage   FirePlace       Baths      Floors       Solar    Electric 
##    1.000023    1.000004    1.000019    1.000009    1.000014    1.000008 
## Glass.Doors 
##    1.000010

Tidak ada nilai sama dengan atau lebih dari 10 sehingga tidak ditemukan Multicollinearity antar variabel (antar variabel prediktor saling independen).

6. Kesimpulan

Untuk memprediksi Harga di masa yang akan datang, kita dapat menggunakan variabel signifikan sebagai berikut

Model rm1 yang didapatkan memiliki Adjusted R-squared 1. Selain itu setelah dilakukan uji analisis, model memiliki kriteria yang sudah baik.