Berikut adalah analisis dari data House Pricing yang akan dipakai untuk membantu membuat model Machine Learning menggunakan metode Linear Regression.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats 1.0.0 ✔ readr 2.1.4
## ✔ ggplot2 3.4.2 ✔ stringr 1.5.0
## ✔ lubridate 1.9.2 ✔ tibble 3.2.1
## ✔ purrr 1.0.1 ✔ tidyr 1.3.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(plotly)
##
## Attaching package: 'plotly'
##
## The following object is masked from 'package:ggplot2':
##
## last_plot
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following object is masked from 'package:graphics':
##
## layout
library(ggplot2)
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
house <- read.csv("HousePrices_HalfMil.csv")
str(house)
## 'data.frame': 500000 obs. of 16 variables:
## $ Area : int 164 84 190 75 148 124 58 249 243 242 ...
## $ Garage : int 2 2 2 2 1 3 1 2 1 1 ...
## $ FirePlace : int 0 0 4 4 4 3 0 1 0 2 ...
## $ Baths : int 2 4 4 4 2 3 2 1 2 4 ...
## $ White.Marble : int 0 0 1 0 1 0 0 1 0 0 ...
## $ Black.Marble : int 1 0 0 0 0 1 0 0 0 0 ...
## $ Indian.Marble: int 0 1 0 1 0 0 1 0 1 1 ...
## $ Floors : int 0 1 0 1 1 1 0 1 1 0 ...
## $ City : int 3 2 2 1 2 1 3 1 1 2 ...
## $ Solar : int 1 0 0 1 1 0 0 0 0 1 ...
## $ Electric : int 1 0 0 1 0 0 1 1 0 0 ...
## $ Fiber : int 1 0 1 1 0 1 1 0 0 0 ...
## $ Glass.Doors : int 1 1 0 1 1 1 1 1 0 0 ...
## $ Swiming.Pool : int 0 1 0 1 1 1 0 1 1 1 ...
## $ Garden : int 0 1 0 1 1 1 1 0 0 0 ...
## $ Prices : int 43800 37550 49500 50075 52400 54300 34400 50425 29575 22300 ...
Inspeksi Kelengkapan Data
anyNA(house)
## [1] FALSE
Tidak ada missing value pada data yang akan digunakan.
Pada project ini, target dari variabel yang kita pilih adalah Prices. Karena diharapkan model yang dibuat akan dapat membantu dalam menentukan harga rumah di kemudian hari. Target dipilih berdasarkan kebutuhan bisnis.
Selanjutnya, mari kita melihat korelasi antar variabel dalam rangka membantu kita menentukan prediktor yang tepat.
ggcorr(house, label = TRUE, label_size = 3, hjust = 1, layout.exp = 2)
Hasil dari uji korelasi menunjukkan bahwa variable Floors memiliki korelasi terkuat Prices. Informasi tersebut sepertinya tidak memberikan insight yang cukup signifikan terhadap analisis kita nantinya. Sehingga, diperlukan informasi diluar data yang akan membantu pemilihan variable prediktor.
Menurut artikel https://www.opendoor.com/articles/factors-that-influence-home-value, ada 7 faktor penting dalam menentukan harga properti. Dua faktor tertinggi adalah faktor lingkungan dan juga lokasi. Sehingga, untuk pengamatan kali ini kami memutuskan untuk menggunakan variable area sebagai variable prediktor utama.
Selanjutnya kita akan mencoba memuat model dengan menggunakan variable Prices dan Area.
model_area <- lm(formula = Prices ~ Area,
data = house)
summary(model_area)
##
## Call:
## lm(formula = Prices ~ Area, data = house)
##
## Residuals:
## Min 1Q Median 3Q Max
## -31686 -8430 -223 8581 33033
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.894e+04 3.399e+01 1145.4 <2e-16 ***
## Area 2.492e+01 2.359e-01 105.6 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11980 on 499998 degrees of freedom
## Multiple R-squared: 0.02182, Adjusted R-squared: 0.02182
## F-statistic: 1.115e+04 on 1 and 499998 DF, p-value: < 2.2e-16
house %>%
slice_sample(n = 500) %>%
ggplot(aes(x = Area, y = Prices)) +
geom_point() +
geom_abline(aes(intercept = model_area$coefficients[1], slope = model_area$coefficients[2]))
Berdasarkan model tersebut, kita mendapatkan hasil bahwa Adjusted R-squared sebesar 0.02182.
Selanjutnya, mari kita memilih variabel prediktor secara otomatis menggunakan step-wise regression dengan metode backward elimination.
step_model <- lm(Prices ~ .,
house)
step(step_model,direction = "backward")
## Start: AIC=-15476867
## Prices ~ Area + Garage + FirePlace + Baths + White.Marble + Black.Marble +
## Indian.Marble + Floors + City + Solar + Electric + Fiber +
## Glass.Doors + Swiming.Pool + Garden
## Warning: attempting model selection on an essentially perfect fit is nonsense
##
## Step: AIC=-15476867
## Prices ~ Area + Garage + FirePlace + Baths + White.Marble + Black.Marble +
## Floors + City + Solar + Electric + Fiber + Glass.Doors +
## Swiming.Pool + Garden
## Warning: attempting model selection on an essentially perfect fit is nonsense
## Df Sum of Sq RSS AIC
## - Garden 1 0.0000e+00 0.0000e+00 -15476868
## - Swiming.Pool 1 0.0000e+00 0.0000e+00 -15476868
## <none> 0.0000e+00 -15476867
## - Solar 1 7.8122e+09 7.8122e+09 4828320
## - Electric 1 1.9531e+11 1.9531e+11 6437774
## - FirePlace 1 5.6234e+11 5.6234e+11 6966532
## - Garage 1 7.5091e+11 7.5091e+11 7111121
## - Baths 1 1.5625e+12 1.5625e+12 7477490
## - Area 1 1.6108e+12 1.6108e+12 7492711
## - Black.Marble 1 2.0844e+12 2.0844e+12 7621588
## - Glass.Doors 1 2.4752e+12 2.4752e+12 7707526
## - City 1 4.0803e+12 4.0803e+12 7957447
## - White.Marble 1 1.6349e+13 1.6349e+13 8651433
## - Fiber 1 1.7257e+13 1.7257e+13 8678471
## - Floors 1 2.8125e+13 2.8125e+13 8922680
##
## Step: AIC=-15476868
## Prices ~ Area + Garage + FirePlace + Baths + White.Marble + Black.Marble +
## Floors + City + Solar + Electric + Fiber + Glass.Doors +
## Swiming.Pool
## Warning: attempting model selection on an essentially perfect fit is nonsense
## Df Sum of Sq RSS AIC
## - Swiming.Pool 1 0.0000e+00 0.0000e+00 -15476869
## <none> 0.0000e+00 -15476868
## - Solar 1 7.8123e+09 7.8123e+09 4828327
## - Electric 1 1.9531e+11 1.9531e+11 6437772
## - FirePlace 1 5.6234e+11 5.6234e+11 6966530
## - Garage 1 7.5091e+11 7.5091e+11 7111119
## - Baths 1 1.5625e+12 1.5625e+12 7477490
## - Area 1 1.6108e+12 1.6108e+12 7492710
## - Black.Marble 1 2.0844e+12 2.0844e+12 7621587
## - Glass.Doors 1 2.4753e+12 2.4753e+12 7707530
## - City 1 4.0803e+12 4.0803e+12 7957446
## - White.Marble 1 1.6349e+13 1.6349e+13 8651432
## - Fiber 1 1.7257e+13 1.7257e+13 8678469
## - Floors 1 2.8125e+13 2.8125e+13 8922678
##
## Step: AIC=-15476869
## Prices ~ Area + Garage + FirePlace + Baths + White.Marble + Black.Marble +
## Floors + City + Solar + Electric + Fiber + Glass.Doors
## Warning: attempting model selection on an essentially perfect fit is nonsense
## Df Sum of Sq RSS AIC
## <none> 0.0000e+00 -15476869
## - Solar 1 7.8123e+09 7.8123e+09 4828325
## - Electric 1 1.9531e+11 1.9531e+11 6437770
## - FirePlace 1 5.6234e+11 5.6234e+11 6966529
## - Garage 1 7.5091e+11 7.5091e+11 7111118
## - Baths 1 1.5625e+12 1.5625e+12 7477490
## - Area 1 1.6108e+12 1.6108e+12 7492708
## - Black.Marble 1 2.0844e+12 2.0844e+12 7621585
## - Glass.Doors 1 2.4753e+12 2.4753e+12 7707528
## - City 1 4.0803e+12 4.0803e+12 7957444
## - White.Marble 1 1.6349e+13 1.6349e+13 8651431
## - Fiber 1 1.7257e+13 1.7257e+13 8678475
## - Floors 1 2.8125e+13 2.8125e+13 8922676
##
## Call:
## lm(formula = Prices ~ Area + Garage + FirePlace + Baths + White.Marble +
## Black.Marble + Floors + City + Solar + Electric + Fiber +
## Glass.Doors, data = house)
##
## Coefficients:
## (Intercept) Area Garage FirePlace Baths
## 1000 25 1500 750 1250
## White.Marble Black.Marble Floors City Solar
## 14000 5000 15000 3500 250
## Electric Fiber Glass.Doors
## 1250 11750 4450
summary(step_model)
##
## Call:
## lm(formula = Prices ~ ., data = house)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.343e-04 0.000e+00 0.000e+00 1.000e-09 3.600e-07
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.000e+03 1.509e-09 6.628e+11 <2e-16 ***
## Area 2.500e+01 3.740e-12 6.684e+12 <2e-16 ***
## Garage 1.500e+03 3.287e-10 4.564e+12 <2e-16 ***
## FirePlace 7.500e+02 1.899e-10 3.949e+12 <2e-16 ***
## Baths 1.250e+03 1.899e-10 6.583e+12 <2e-16 ***
## White.Marble 1.400e+04 6.574e-10 2.129e+13 <2e-16 ***
## Black.Marble 5.000e+03 6.576e-10 7.603e+12 <2e-16 ***
## Indian.Marble NA NA NA NA
## Floors 1.500e+04 5.371e-10 2.793e+13 <2e-16 ***
## City 3.500e+03 3.290e-10 1.064e+13 <2e-16 ***
## Solar 2.500e+02 5.371e-10 4.655e+11 <2e-16 ***
## Electric 1.250e+03 5.371e-10 2.327e+12 <2e-16 ***
## Fiber 1.175e+04 5.371e-10 2.188e+13 <2e-16 ***
## Glass.Doors 4.450e+03 5.371e-10 8.286e+12 <2e-16 ***
## Swiming.Pool 5.414e-10 5.371e-10 1.008e+00 0.313
## Garden 5.386e-10 5.371e-10 1.003e+00 0.316
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.899e-07 on 499985 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 1.453e+26 on 14 and 499985 DF, p-value: < 2.2e-16
model_new <- lm(Prices ~.,
house)
summary(model_new)
##
## Call:
## lm(formula = Prices ~ ., data = house)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.343e-04 0.000e+00 0.000e+00 1.000e-09 3.600e-07
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.000e+03 1.509e-09 6.628e+11 <2e-16 ***
## Area 2.500e+01 3.740e-12 6.684e+12 <2e-16 ***
## Garage 1.500e+03 3.287e-10 4.564e+12 <2e-16 ***
## FirePlace 7.500e+02 1.899e-10 3.949e+12 <2e-16 ***
## Baths 1.250e+03 1.899e-10 6.583e+12 <2e-16 ***
## White.Marble 1.400e+04 6.574e-10 2.129e+13 <2e-16 ***
## Black.Marble 5.000e+03 6.576e-10 7.603e+12 <2e-16 ***
## Indian.Marble NA NA NA NA
## Floors 1.500e+04 5.371e-10 2.793e+13 <2e-16 ***
## City 3.500e+03 3.290e-10 1.064e+13 <2e-16 ***
## Solar 2.500e+02 5.371e-10 4.655e+11 <2e-16 ***
## Electric 1.250e+03 5.371e-10 2.327e+12 <2e-16 ***
## Fiber 1.175e+04 5.371e-10 2.188e+13 <2e-16 ***
## Glass.Doors 4.450e+03 5.371e-10 8.286e+12 <2e-16 ***
## Swiming.Pool 5.414e-10 5.371e-10 1.008e+00 0.313
## Garden 5.386e-10 5.371e-10 1.003e+00 0.316
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.899e-07 on 499985 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 1.453e+26 on 14 and 499985 DF, p-value: < 2.2e-16
Model dengan prediktor semua variable sesuai dengan hasil dari step-wise regression menunjukkan nilai adjusted r-squared yang jauh lebih baik dibandingkan model variable Area. Sehingga, untuk saat ini kandidat model yang akan kita pakai adalah model dengan menggunakan seluruh variable sebagai prediktor.
pred_step <- predict(step_model,
newdata = house,
interval = "prediction",
level = 0.95)
head(pred_step)
## fit lwr upr
## 1 43800 43800 43800
## 2 37550 37550 37550
## 3 49500 49500 49500
## 4 50075 50075 50075
## 5 52400 52400 52400
## 6 54300 54300 54300
plot(model_new, which = 1)
hist(model_new$residuals)
model_new$residual[sample(5000)] %>% shapiro.test()
##
## Shapiro-Wilk normality test
##
## data: .
## W = 0.0031139, p-value < 2.2e-16
library(lmtest)
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
bptest(model_new)
##
## studentized Breusch-Pagan test
##
## data: model_new
## BP = 13.316, df = 14, p-value = 0.5018
P Value lebih besar dari 0.05, berati gagal menolak H0 atau error menyebar secara konstan.
#library(car)
#vif(model_new)
Model gagal dalam proses uji Multicollinearity, yang menunjukkan bahwa terdapat korelasi yang tinggi antar variable prediktor sehingga sulit untuk diinterpretasi sebagai variable yang berbeda.