Berikut adalah analisis dari data House Pricing yang akan dipakai untuk membantu membuat model Machine Learning menggunakan metode Linear Regression.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.0     ✔ readr     2.1.4
## ✔ ggplot2   3.4.2     ✔ stringr   1.5.0
## ✔ lubridate 1.9.2     ✔ tibble    3.2.1
## ✔ purrr     1.0.1     ✔ tidyr     1.3.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(plotly)
## 
## Attaching package: 'plotly'
## 
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following object is masked from 'package:graphics':
## 
##     layout
library(ggplot2)
library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
house <- read.csv("HousePrices_HalfMil.csv")

Exploratory Data Analysis

str(house)
## 'data.frame':    500000 obs. of  16 variables:
##  $ Area         : int  164 84 190 75 148 124 58 249 243 242 ...
##  $ Garage       : int  2 2 2 2 1 3 1 2 1 1 ...
##  $ FirePlace    : int  0 0 4 4 4 3 0 1 0 2 ...
##  $ Baths        : int  2 4 4 4 2 3 2 1 2 4 ...
##  $ White.Marble : int  0 0 1 0 1 0 0 1 0 0 ...
##  $ Black.Marble : int  1 0 0 0 0 1 0 0 0 0 ...
##  $ Indian.Marble: int  0 1 0 1 0 0 1 0 1 1 ...
##  $ Floors       : int  0 1 0 1 1 1 0 1 1 0 ...
##  $ City         : int  3 2 2 1 2 1 3 1 1 2 ...
##  $ Solar        : int  1 0 0 1 1 0 0 0 0 1 ...
##  $ Electric     : int  1 0 0 1 0 0 1 1 0 0 ...
##  $ Fiber        : int  1 0 1 1 0 1 1 0 0 0 ...
##  $ Glass.Doors  : int  1 1 0 1 1 1 1 1 0 0 ...
##  $ Swiming.Pool : int  0 1 0 1 1 1 0 1 1 1 ...
##  $ Garden       : int  0 1 0 1 1 1 1 0 0 0 ...
##  $ Prices       : int  43800 37550 49500 50075 52400 54300 34400 50425 29575 22300 ...

Inspeksi Kelengkapan Data

anyNA(house)
## [1] FALSE

Tidak ada missing value pada data yang akan digunakan.

Pada project ini, target dari variabel yang kita pilih adalah Prices. Karena diharapkan model yang dibuat akan dapat membantu dalam menentukan harga rumah di kemudian hari. Target dipilih berdasarkan kebutuhan bisnis.

Selanjutnya, mari kita melihat korelasi antar variabel dalam rangka membantu kita menentukan prediktor yang tepat.

ggcorr(house, label = TRUE, label_size = 3, hjust = 1, layout.exp = 2)

Hasil dari uji korelasi menunjukkan bahwa variable Floors memiliki korelasi terkuat Prices. Informasi tersebut sepertinya tidak memberikan insight yang cukup signifikan terhadap analisis kita nantinya. Sehingga, diperlukan informasi diluar data yang akan membantu pemilihan variable prediktor.

Menurut artikel https://www.opendoor.com/articles/factors-that-influence-home-value, ada 7 faktor penting dalam menentukan harga properti. Dua faktor tertinggi adalah faktor lingkungan dan juga lokasi. Sehingga, untuk pengamatan kali ini kami memutuskan untuk menggunakan variable area sebagai variable prediktor utama.

Selanjutnya kita akan mencoba memuat model dengan menggunakan variable Prices dan Area.

Pembuatan Model

model_area <- lm(formula = Prices ~ Area,
                 data = house)
summary(model_area)
## 
## Call:
## lm(formula = Prices ~ Area, data = house)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -31686  -8430   -223   8581  33033 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 3.894e+04  3.399e+01  1145.4   <2e-16 ***
## Area        2.492e+01  2.359e-01   105.6   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11980 on 499998 degrees of freedom
## Multiple R-squared:  0.02182,    Adjusted R-squared:  0.02182 
## F-statistic: 1.115e+04 on 1 and 499998 DF,  p-value: < 2.2e-16
house %>% 
  slice_sample(n = 500) %>%
  ggplot(aes(x = Area, y = Prices)) + 
  geom_point() +
  geom_abline(aes(intercept = model_area$coefficients[1], slope = model_area$coefficients[2]))

Berdasarkan model tersebut, kita mendapatkan hasil bahwa Adjusted R-squared sebesar 0.02182.

Selanjutnya, mari kita memilih variabel prediktor secara otomatis menggunakan step-wise regression dengan metode backward elimination.

step_model <- lm(Prices ~ .,
                 house)
step(step_model,direction = "backward")
## Start:  AIC=-15476867
## Prices ~ Area + Garage + FirePlace + Baths + White.Marble + Black.Marble + 
##     Indian.Marble + Floors + City + Solar + Electric + Fiber + 
##     Glass.Doors + Swiming.Pool + Garden
## Warning: attempting model selection on an essentially perfect fit is nonsense
## 
## Step:  AIC=-15476867
## Prices ~ Area + Garage + FirePlace + Baths + White.Marble + Black.Marble + 
##     Floors + City + Solar + Electric + Fiber + Glass.Doors + 
##     Swiming.Pool + Garden
## Warning: attempting model selection on an essentially perfect fit is nonsense
##                Df  Sum of Sq        RSS       AIC
## - Garden        1 0.0000e+00 0.0000e+00 -15476868
## - Swiming.Pool  1 0.0000e+00 0.0000e+00 -15476868
## <none>                       0.0000e+00 -15476867
## - Solar         1 7.8122e+09 7.8122e+09   4828320
## - Electric      1 1.9531e+11 1.9531e+11   6437774
## - FirePlace     1 5.6234e+11 5.6234e+11   6966532
## - Garage        1 7.5091e+11 7.5091e+11   7111121
## - Baths         1 1.5625e+12 1.5625e+12   7477490
## - Area          1 1.6108e+12 1.6108e+12   7492711
## - Black.Marble  1 2.0844e+12 2.0844e+12   7621588
## - Glass.Doors   1 2.4752e+12 2.4752e+12   7707526
## - City          1 4.0803e+12 4.0803e+12   7957447
## - White.Marble  1 1.6349e+13 1.6349e+13   8651433
## - Fiber         1 1.7257e+13 1.7257e+13   8678471
## - Floors        1 2.8125e+13 2.8125e+13   8922680
## 
## Step:  AIC=-15476868
## Prices ~ Area + Garage + FirePlace + Baths + White.Marble + Black.Marble + 
##     Floors + City + Solar + Electric + Fiber + Glass.Doors + 
##     Swiming.Pool
## Warning: attempting model selection on an essentially perfect fit is nonsense
##                Df  Sum of Sq        RSS       AIC
## - Swiming.Pool  1 0.0000e+00 0.0000e+00 -15476869
## <none>                       0.0000e+00 -15476868
## - Solar         1 7.8123e+09 7.8123e+09   4828327
## - Electric      1 1.9531e+11 1.9531e+11   6437772
## - FirePlace     1 5.6234e+11 5.6234e+11   6966530
## - Garage        1 7.5091e+11 7.5091e+11   7111119
## - Baths         1 1.5625e+12 1.5625e+12   7477490
## - Area          1 1.6108e+12 1.6108e+12   7492710
## - Black.Marble  1 2.0844e+12 2.0844e+12   7621587
## - Glass.Doors   1 2.4753e+12 2.4753e+12   7707530
## - City          1 4.0803e+12 4.0803e+12   7957446
## - White.Marble  1 1.6349e+13 1.6349e+13   8651432
## - Fiber         1 1.7257e+13 1.7257e+13   8678469
## - Floors        1 2.8125e+13 2.8125e+13   8922678
## 
## Step:  AIC=-15476869
## Prices ~ Area + Garage + FirePlace + Baths + White.Marble + Black.Marble + 
##     Floors + City + Solar + Electric + Fiber + Glass.Doors
## Warning: attempting model selection on an essentially perfect fit is nonsense
##                Df  Sum of Sq        RSS       AIC
## <none>                       0.0000e+00 -15476869
## - Solar         1 7.8123e+09 7.8123e+09   4828325
## - Electric      1 1.9531e+11 1.9531e+11   6437770
## - FirePlace     1 5.6234e+11 5.6234e+11   6966529
## - Garage        1 7.5091e+11 7.5091e+11   7111118
## - Baths         1 1.5625e+12 1.5625e+12   7477490
## - Area          1 1.6108e+12 1.6108e+12   7492708
## - Black.Marble  1 2.0844e+12 2.0844e+12   7621585
## - Glass.Doors   1 2.4753e+12 2.4753e+12   7707528
## - City          1 4.0803e+12 4.0803e+12   7957444
## - White.Marble  1 1.6349e+13 1.6349e+13   8651431
## - Fiber         1 1.7257e+13 1.7257e+13   8678475
## - Floors        1 2.8125e+13 2.8125e+13   8922676
## 
## Call:
## lm(formula = Prices ~ Area + Garage + FirePlace + Baths + White.Marble + 
##     Black.Marble + Floors + City + Solar + Electric + Fiber + 
##     Glass.Doors, data = house)
## 
## Coefficients:
##  (Intercept)          Area        Garage     FirePlace         Baths  
##         1000            25          1500           750          1250  
## White.Marble  Black.Marble        Floors          City         Solar  
##        14000          5000         15000          3500           250  
##     Electric         Fiber   Glass.Doors  
##         1250         11750          4450
summary(step_model)
## 
## Call:
## lm(formula = Prices ~ ., data = house)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -1.343e-04  0.000e+00  0.000e+00  1.000e-09  3.600e-07 
## 
## Coefficients: (1 not defined because of singularities)
##                Estimate Std. Error   t value Pr(>|t|)    
## (Intercept)   1.000e+03  1.509e-09 6.628e+11   <2e-16 ***
## Area          2.500e+01  3.740e-12 6.684e+12   <2e-16 ***
## Garage        1.500e+03  3.287e-10 4.564e+12   <2e-16 ***
## FirePlace     7.500e+02  1.899e-10 3.949e+12   <2e-16 ***
## Baths         1.250e+03  1.899e-10 6.583e+12   <2e-16 ***
## White.Marble  1.400e+04  6.574e-10 2.129e+13   <2e-16 ***
## Black.Marble  5.000e+03  6.576e-10 7.603e+12   <2e-16 ***
## Indian.Marble        NA         NA        NA       NA    
## Floors        1.500e+04  5.371e-10 2.793e+13   <2e-16 ***
## City          3.500e+03  3.290e-10 1.064e+13   <2e-16 ***
## Solar         2.500e+02  5.371e-10 4.655e+11   <2e-16 ***
## Electric      1.250e+03  5.371e-10 2.327e+12   <2e-16 ***
## Fiber         1.175e+04  5.371e-10 2.188e+13   <2e-16 ***
## Glass.Doors   4.450e+03  5.371e-10 8.286e+12   <2e-16 ***
## Swiming.Pool  5.414e-10  5.371e-10 1.008e+00    0.313    
## Garden        5.386e-10  5.371e-10 1.003e+00    0.316    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.899e-07 on 499985 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 1.453e+26 on 14 and 499985 DF,  p-value: < 2.2e-16
model_new <- lm(Prices ~.,
                house)
summary(model_new)
## 
## Call:
## lm(formula = Prices ~ ., data = house)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -1.343e-04  0.000e+00  0.000e+00  1.000e-09  3.600e-07 
## 
## Coefficients: (1 not defined because of singularities)
##                Estimate Std. Error   t value Pr(>|t|)    
## (Intercept)   1.000e+03  1.509e-09 6.628e+11   <2e-16 ***
## Area          2.500e+01  3.740e-12 6.684e+12   <2e-16 ***
## Garage        1.500e+03  3.287e-10 4.564e+12   <2e-16 ***
## FirePlace     7.500e+02  1.899e-10 3.949e+12   <2e-16 ***
## Baths         1.250e+03  1.899e-10 6.583e+12   <2e-16 ***
## White.Marble  1.400e+04  6.574e-10 2.129e+13   <2e-16 ***
## Black.Marble  5.000e+03  6.576e-10 7.603e+12   <2e-16 ***
## Indian.Marble        NA         NA        NA       NA    
## Floors        1.500e+04  5.371e-10 2.793e+13   <2e-16 ***
## City          3.500e+03  3.290e-10 1.064e+13   <2e-16 ***
## Solar         2.500e+02  5.371e-10 4.655e+11   <2e-16 ***
## Electric      1.250e+03  5.371e-10 2.327e+12   <2e-16 ***
## Fiber         1.175e+04  5.371e-10 2.188e+13   <2e-16 ***
## Glass.Doors   4.450e+03  5.371e-10 8.286e+12   <2e-16 ***
## Swiming.Pool  5.414e-10  5.371e-10 1.008e+00    0.313    
## Garden        5.386e-10  5.371e-10 1.003e+00    0.316    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.899e-07 on 499985 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 1.453e+26 on 14 and 499985 DF,  p-value: < 2.2e-16

Model dengan prediktor semua variable sesuai dengan hasil dari step-wise regression menunjukkan nilai adjusted r-squared yang jauh lebih baik dibandingkan model variable Area. Sehingga, untuk saat ini kandidat model yang akan kita pakai adalah model dengan menggunakan seluruh variable sebagai prediktor.

Prediction

pred_step <- predict(step_model, 
                     newdata = house, 
                     interval = "prediction",
                     level = 0.95)

head(pred_step)
##     fit   lwr   upr
## 1 43800 43800 43800
## 2 37550 37550 37550
## 3 49500 49500 49500
## 4 50075 50075 50075
## 5 52400 52400 52400
## 6 54300 54300 54300

Uji Asumsi

1. Linearity

plot(model_new, which = 1)

2. Normality of Residuals

hist(model_new$residuals)

3. Uji Statistik dengan Shapiro Test()

model_new$residual[sample(5000)] %>% shapiro.test()
## 
##  Shapiro-Wilk normality test
## 
## data:  .
## W = 0.0031139, p-value < 2.2e-16
library(lmtest)
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
bptest(model_new)
## 
##  studentized Breusch-Pagan test
## 
## data:  model_new
## BP = 13.316, df = 14, p-value = 0.5018

P Value lebih besar dari 0.05, berati gagal menolak H0 atau error menyebar secara konstan.

4. Uji Multicollinearity

#library(car)
#vif(model_new)

Model gagal dalam proses uji Multicollinearity, yang menunjukkan bahwa terdapat korelasi yang tinggi antar variable prediktor sehingga sulit untuk diinterpretasi sebagai variable yang berbeda.