Memprediksi Harga Rumah dengan Dataset Kaggle Harga Rumah.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(STARTS)
## #* STARTS 1.2-35 (2019-11-04 15:18:11)
library(ggplot2)
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(tidyr)
library(zoo)
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
library(carData)
library(lmtest)
library(caret)
## Loading required package: lattice
houseprice<-read.csv("data_input/HousePrices_HalfMil.csv")
anyNA(houseprice)
## [1] FALSE
str(houseprice)
## 'data.frame': 500000 obs. of 16 variables:
## $ Area : int 164 84 190 75 148 124 58 249 243 242 ...
## $ Garage : int 2 2 2 2 1 3 1 2 1 1 ...
## $ FirePlace : int 0 0 4 4 4 3 0 1 0 2 ...
## $ Baths : int 2 4 4 4 2 3 2 1 2 4 ...
## $ White.Marble : int 0 0 1 0 1 0 0 1 0 0 ...
## $ Black.Marble : int 1 0 0 0 0 1 0 0 0 0 ...
## $ Indian.Marble: int 0 1 0 1 0 0 1 0 1 1 ...
## $ Floors : int 0 1 0 1 1 1 0 1 1 0 ...
## $ City : int 3 2 2 1 2 1 3 1 1 2 ...
## $ Solar : int 1 0 0 1 1 0 0 0 0 1 ...
## $ Electric : int 1 0 0 1 0 0 1 1 0 0 ...
## $ Fiber : int 1 0 1 1 0 1 1 0 0 0 ...
## $ Glass.Doors : int 1 1 0 1 1 1 1 1 0 0 ...
## $ Swiming.Pool : int 0 1 0 1 1 1 0 1 1 1 ...
## $ Garden : int 0 1 0 1 1 1 1 0 0 0 ...
## $ Prices : int 43800 37550 49500 50075 52400 54300 34400 50425 29575 22300 ...
Pada dataset ini, kami memiliki 500.000 observasi dengan 16 variabel. Ada beberapa informasi yang tidak cukup jelas untuk kita gunakan sebagai variabel prediksi yang signifikan seperti: Area, Warna Marmer karena semuanya sama, dan beberapa di antaranya tidak tercantum dalam Kaggle seperti City dan Fiber.
boxplot(houseprice$Prices)
Ada outlier pada Harga Variabel Target kita.
house_price <- houseprice %>%
dplyr::select(-c(Area, Black.Marble, White.Marble, Indian.Marble, City, Fiber))
str(house_price)
## 'data.frame': 500000 obs. of 10 variables:
## $ Garage : int 2 2 2 2 1 3 1 2 1 1 ...
## $ FirePlace : int 0 0 4 4 4 3 0 1 0 2 ...
## $ Baths : int 2 4 4 4 2 3 2 1 2 4 ...
## $ Floors : int 0 1 0 1 1 1 0 1 1 0 ...
## $ Solar : int 1 0 0 1 1 0 0 0 0 1 ...
## $ Electric : int 1 0 0 1 0 0 1 1 0 0 ...
## $ Glass.Doors : int 1 1 0 1 1 1 1 1 0 0 ...
## $ Swiming.Pool: int 0 1 0 1 1 1 0 1 1 1 ...
## $ Garden : int 0 1 0 1 1 1 1 0 0 0 ...
## $ Prices : int 43800 37550 49500 50075 52400 54300 34400 50425 29575 22300 ...
glimpse(house_price)
## Rows: 500,000
## Columns: 10
## $ Garage <int> 2, 2, 2, 2, 1, 3, 1, 2, 1, 1, 2, 2, 2, 3, 3, 3, 1, 3, 2, ~
## $ FirePlace <int> 0, 0, 4, 4, 4, 3, 0, 1, 0, 2, 4, 0, 0, 3, 3, 4, 0, 3, 3, ~
## $ Baths <int> 2, 4, 4, 4, 2, 3, 2, 1, 2, 4, 5, 4, 2, 3, 1, 1, 5, 3, 5, ~
## $ Floors <int> 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, ~
## $ Solar <int> 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, ~
## $ Electric <int> 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, ~
## $ Glass.Doors <int> 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, ~
## $ Swiming.Pool <int> 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, ~
## $ Garden <int> 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, ~
## $ Prices <int> 43800, 37550, 49500, 50075, 52400, 54300, 34400, 50425, 2~
Berikut adalah keterangan dari setiap Kolom :
Garage = the number of garage FirePlace = the number of fire place Baths= the number of bathroom Floors = the number of floors in the house Solar = 1 if the house have solar system else 0 Electric = 1 if the house have electric work to be done else 0 Glass.Doors = 1 if the house have glass doors else 0 Swiming.Pool = 1 the house have a swiming pool else 0 Garden = 1 if the house have garden in the house else 0
house_price[,c("Garage","FirePlace","Baths", "Floors","Solar","Electric","Glass.Doors","Swiming.Pool")] <-
lapply(house_price[,c("Garage","FirePlace","Baths", "Floors","Solar","Electric","Glass.Doors","Swiming.Pool")], FUN = as.numeric)
#cek apakah ada missing data
colSums(is.na(house_price))
## Garage FirePlace Baths Floors Solar Electric
## 0 0 0 0 0 0
## Glass.Doors Swiming.Pool Garden Prices
## 0 0 0 0
Tidak terdapat missing value pada data
Menemukan korelasi antara Harga(Price) dengan variable lainnya
ggcorr(house_price, label = TRUE, label_size = 3, hjust = 1, layout.exp = 3)
Kita dapat melihat korelasi terkuat berpengaruh positif terhadap Harga(Price) adalah Lantai(Floor).
Membuat model regresi linear dengan variabel prediktor Floors karena variabel tersebut memiliki korelasi positif tertinggi terhadap variabel target Prices.
rm <- lm(Prices ~ Floors, data = house_price) %>% print()
##
## Call:
## lm(formula = Prices ~ Floors, data = house_price)
##
## Coefficients:
## (Intercept) Floors
## 34558 15003
summary(rm)
##
## Call:
## lm(formula = Prices ~ Floors, data = house_price)
##
## Residuals:
## Min 1Q Median 3Q Max
## -27211 -6986 -11 6592 28542
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 34557.66 19.00 1819 <2e-16 ***
## Floors 15003.39 26.89 558 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9507 on 499998 degrees of freedom
## Multiple R-squared: 0.3837, Adjusted R-squared: 0.3837
## F-statistic: 3.113e+05 on 1 and 499998 DF, p-value: < 2.2e-16
plot(house_price$Floors, house_price$Prices)
abline(rm$coefficients[1],rm$coefficients[2], col = "red")
Multiple R-squared memiliki nilai 0.3837
Selanjutnya akan dicoba pemilihan variabel prediktor secara automatis menggunakan step-wise regression dengan metode backward elimination.
rm1 <- lm(Prices ~ ., house_price)
step(rm1, direction = "backward")
## Start: AIC=9094486
## Prices ~ Garage + FirePlace + Baths + Floors + Solar + Electric +
## Glass.Doors + Swiming.Pool + Garden
##
## Df Sum of Sq RSS AIC
## - Garden 1 7.8791e+07 3.9657e+13 9094485
## - Swiming.Pool 1 1.1853e+08 3.9657e+13 9094486
## <none> 3.9657e+13 9094486
## - Solar 1 7.5030e+09 3.9665e+13 9094579
## - Electric 1 1.9692e+11 3.9854e+13 9096961
## - FirePlace 1 5.7817e+11 4.0235e+13 9101721
## - Garage 1 7.5734e+11 4.0415e+13 9103943
## - Baths 1 1.5673e+12 4.1225e+13 9113864
## - Glass.Doors 1 2.4402e+12 4.2097e+13 9124341
## - Floors 1 2.8158e+13 6.7816e+13 9362744
##
## Step: AIC=9094485
## Prices ~ Garage + FirePlace + Baths + Floors + Solar + Electric +
## Glass.Doors + Swiming.Pool
##
## Df Sum of Sq RSS AIC
## - Swiming.Pool 1 1.1849e+08 3.9657e+13 9094485
## <none> 3.9657e+13 9094485
## - Solar 1 7.4966e+09 3.9665e+13 9094578
## - Electric 1 1.9693e+11 3.9854e+13 9096960
## - FirePlace 1 5.7817e+11 4.0236e+13 9101720
## - Garage 1 7.5733e+11 4.0415e+13 9103942
## - Baths 1 1.5673e+12 4.1225e+13 9113864
## - Glass.Doors 1 2.4403e+12 4.2098e+13 9124341
## - Floors 1 2.8158e+13 6.7816e+13 9362742
##
## Step: AIC=9094485
## Prices ~ Garage + FirePlace + Baths + Floors + Solar + Electric +
## Glass.Doors
##
## Df Sum of Sq RSS AIC
## <none> 3.9657e+13 9094485
## - Solar 1 7.4957e+09 3.9665e+13 9094577
## - Electric 1 1.9693e+11 3.9854e+13 9096960
## - FirePlace 1 5.7819e+11 4.0236e+13 9101720
## - Garage 1 7.5735e+11 4.0415e+13 9103941
## - Baths 1 1.5674e+12 4.1225e+13 9113864
## - Glass.Doors 1 2.4403e+12 4.2098e+13 9124341
## - Floors 1 2.8158e+13 6.7816e+13 9362741
##
## Call:
## lm(formula = Prices ~ Garage + FirePlace + Baths + Floors + Solar +
## Electric + Glass.Doors, data = house_price)
##
## Coefficients:
## (Intercept) Garage FirePlace Baths Floors Solar
## 23303.9 1506.4 760.5 1252.0 15009.0 244.9
## Electric Glass.Doors
## 1255.2 4418.5
summary(lm(formula = Prices ~ Garage + FirePlace + Baths + Floors + Solar +
Electric + Glass.Doors, data = house_price))
##
## Call:
## lm(formula = Prices ~ Garage + FirePlace + Baths + Floors + Solar +
## Electric + Glass.Doors, data = house_price)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18854.9 -6681.5 116.2 6000.0 20189.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 23303.868 52.739 441.872 <2e-16 ***
## Garage 1506.411 15.416 97.716 <2e-16 ***
## FirePlace 760.496 8.907 85.380 <2e-16 ***
## Baths 1251.960 8.906 140.575 <2e-16 ***
## Floors 15008.981 25.190 595.831 <2e-16 ***
## Solar 244.881 25.190 9.721 <2e-16 ***
## Electric 1255.179 25.190 49.828 <2e-16 ***
## Glass.Doors 4418.470 25.190 175.406 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8906 on 499992 degrees of freedom
## Multiple R-squared: 0.4592, Adjusted R-squared: 0.4592
## F-statistic: 6.065e+04 on 7 and 499992 DF, p-value: < 2.2e-16
Membuat formula model regresi yang baru menjadi objek
rm1 <- lm(formula = Prices ~ Garage + FirePlace + Baths + Floors + Solar +
Electric + Glass.Doors, data = house_price) %>% print()
##
## Call:
## lm(formula = Prices ~ Garage + FirePlace + Baths + Floors + Solar +
## Electric + Glass.Doors, data = house_price)
##
## Coefficients:
## (Intercept) Garage FirePlace Baths Floors Solar
## 23303.9 1506.4 760.5 1252.0 15009.0 244.9
## Electric Glass.Doors
## 1255.2 4418.5
Metode step-wise regression ini akan menghasilkan formula optimum berdasarkan nilai AIC yang terendah, dimana semakin rendah nilai AIC tersebut, maka nilai observasi yang tidak tertangkap semakin kecil.
Bila dibandingkan dengan model awal yang hanya menggunakan variabel Floors, model regresi yang menggunakan 12 variabel prediktor memiliki adjusted R-squared 1. lebih tinggi dibandingkan model sebelumnya yaitu 0.3837.
Kandidat Model:
Prices = 34557.66 + 15009.0(Floors) Prices = 999 + 1506.4(Garage) + 760,5(FirePlace) + 1252(Baths) + 15009.0(Floors) + 249.9(Solar) + 1255,2(Electric) + 4418.5(Glass.Doors)
Prediksi nilai Prices berdasarkan nilai variabel prediktor, dan hasilnya akan dibandingkan dengan data aktual yang kita miliki.
library(MLmetrics)
##
## Attaching package: 'MLmetrics'
## The following objects are masked from 'package:caret':
##
## MAE, RMSE
## The following object is masked from 'package:base':
##
## Recall
# prediksi nilai Prices berdasarkan model `rm`
predict_rm <- predict(rm, newdata = head(house_price,10))
predict(rm, newdata = head(house_price,10), interval = "confidence", level = 0.95)
## fit lwr upr
## 1 34557.66 34520.41 34594.90
## 2 49561.05 49523.76 49598.34
## 3 34557.66 34520.41 34594.90
## 4 49561.05 49523.76 49598.34
## 5 49561.05 49523.76 49598.34
## 6 49561.05 49523.76 49598.34
## 7 34557.66 34520.41 34594.90
## 8 49561.05 49523.76 49598.34
## 9 49561.05 49523.76 49598.34
## 10 34557.66 34520.41 34594.90
# prediksi nilai Prices berdasarkan model `rm1`
predict_rm1 <- predict(rm1, newdata = head(house_price,10))
predict(rm1, newdata = head(house_price,10), interval = "confidence", level = 0.95)
## fit lwr upr
## 1 34739.14 34671.55 34806.73
## 2 50751.98 50684.36 50819.60
## 3 34366.51 34298.96 34434.07
## 4 55294.02 55226.41 55361.64
## 5 50028.51 49954.36 50102.67
## 6 53287.92 53222.63 53353.21
## 7 32987.85 32913.78 33061.92
## 8 49011.78 48944.22 49079.33
## 9 42323.18 42249.14 42397.22
## 10 31583.99 31518.65 31649.33
Normalitas
hist(rm$residuals)
ks.test(rm$residuals, "pnorm")
## Warning in ks.test(rm$residuals, "pnorm"): ties should not be present for the
## Kolmogorov-Smirnov test
##
## One-sample Kolmogorov-Smirnov test
##
## data: rm$residuals
## D = 0.50057, p-value < 2.2e-16
## alternative hypothesis: two-sided
hist(rm1$residuals)
ks.test(rm1$residuals, "pnorm")
## Warning in ks.test(rm1$residuals, "pnorm"): ties should not be present for the
## Kolmogorov-Smirnov test
##
## One-sample Kolmogorov-Smirnov test
##
## data: rm1$residuals
## D = 0.50507, p-value < 2.2e-16
## alternative hypothesis: two-sided
Heteroscedasticity Sebaran Plot(scatter plot) antara nilai yang dipasang dan Residual
plot(rm$fitted.values, rm$residuals)
abline(h = 0, col = "red")
bptest(rm)
##
## studentized Breusch-Pagan test
##
## data: rm
## BP = 0.13344, df = 1, p-value = 0.7149
plot(rm1$fitted.values, rm1$residuals)
abline(h = 0, col = "red")
bptest(rm1)
##
## studentized Breusch-Pagan test
##
## data: rm1
## BP = 6.0737, df = 7, p-value = 0.5312
Untuk kedua model, P-value > 0.05 sehingga H0 diterima. Hal ini juga berarti residual tidak memiliki pola (Heteroscedasticity) dimana semua pola yang ada sudah berhasil ditangkap oleh model yang dibuat.
Variance Inflation Factor (Multicollinearity)
untuk model m yang hanya menggunakan satu variabel prediktor, tidak dapat digunakan vif() untuk analis
library (car)
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
vif(rm1)
## Garage FirePlace Baths Floors Solar Electric
## 1.000023 1.000004 1.000019 1.000009 1.000014 1.000008
## Glass.Doors
## 1.000010
Tidak ada nilai sama dengan atau lebih dari 10 sehingga tidak ditemukan Multicollinearity antar variabel (antar variabel prediktor saling independen).
Untuk memprediksi Harga di masa yang akan datang, kita dapat menggunakan variabel signifikan sebagai berikut
Model rm1 yang didapatkan memiliki Adjusted R-squared 1. Selain itu setelah dilakukan uji analisis, model memiliki kriteria yang sudah baik.