# Data Wrangling
library(dplyr)
# Visualization
library(GGally)
# Model testing
library(performance)
library(lmtest)
library(car)
This project will use Linear Regression to model the relationship between house prices and the features that the houses have. The data used in this project is taken from: https://www.kaggle.com/datasets/greenwing1985/housepricing
houseprice <- read.csv("HousePrices_HalfMil.csv")
head(houseprice)
#> Area Garage FirePlace Baths White.Marble Black.Marble Indian.Marble Floors
#> 1 164 2 0 2 0 1 0 0
#> 2 84 2 0 4 0 0 1 1
#> 3 190 2 4 4 1 0 0 0
#> 4 75 2 4 4 0 0 1 1
#> 5 148 1 4 2 1 0 0 1
#> 6 124 3 3 3 0 1 0 1
#> City Solar Electric Fiber Glass.Doors Swiming.Pool Garden Prices
#> 1 3 1 1 1 1 0 0 43800
#> 2 2 0 0 0 1 1 1 37550
#> 3 2 0 0 1 0 0 0 49500
#> 4 1 1 1 1 1 1 1 50075
#> 5 2 1 0 0 1 1 1 52400
#> 6 1 0 0 1 1 1 1 54300
glimpse(houseprice)
#> Rows: 500,000
#> Columns: 16
#> $ Area <int> 164, 84, 190, 75, 148, 124, 58, 249, 243, 242, 61, 189, …
#> $ Garage <int> 2, 2, 2, 2, 1, 3, 1, 2, 1, 1, 2, 2, 2, 3, 3, 3, 1, 3, 2,…
#> $ FirePlace <int> 0, 0, 4, 4, 4, 3, 0, 1, 0, 2, 4, 0, 0, 3, 3, 4, 0, 3, 3,…
#> $ Baths <int> 2, 4, 4, 4, 2, 3, 2, 1, 2, 4, 5, 4, 2, 3, 1, 1, 5, 3, 5,…
#> $ White.Marble <int> 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,…
#> $ Black.Marble <int> 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1,…
#> $ Indian.Marble <int> 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0,…
#> $ Floors <int> 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1,…
#> $ City <int> 3, 2, 2, 1, 2, 1, 3, 1, 1, 2, 1, 2, 1, 3, 3, 1, 3, 1, 3,…
#> $ Solar <int> 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0,…
#> $ Electric <int> 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1,…
#> $ Fiber <int> 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0,…
#> $ Glass.Doors <int> 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1,…
#> $ Swiming.Pool <int> 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0,…
#> $ Garden <int> 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0,…
#> $ Prices <int> 43800, 37550, 49500, 50075, 52400, 54300, 34400, 50425, …
The data has 500,000 rows and 16 columns.
anyNA(houseprice)
#> [1] FALSE
There is no NA values in our data, so we’re good to use it as is.
The Prices column in our data will be our target
variable.
Let’s check the house prices distribution.
summary(houseprice$Prices)
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 7725 33500 41850 42050 50750 77975
We want to see the correlation between features.
ggcorr(houseprice, label = TRUE, label_size = 3, hjust = 1, layout.exp = 2)
Features correlation with Prices:
The rest of the features has negligible correlation with Prices.
We also need to note that White.Marble,
Black.Marble, and Indian.Marble has moderate
correlation with each other. This may cause multicollinearity to happen
in our linear regression model. We will deal with these later after
looking at our models.
For the first model, we will use all features as predictor variables.
model_all <- lm(formula = Prices ~ ., data = houseprice)
summary(model_all)
#>
#> Call:
#> lm(formula = Prices ~ ., data = houseprice)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -0.000134260 0.000000000 0.000000000 0.000000001 0.000000360
#>
#> Coefficients: (1 not defined because of singularities)
#> Estimate Std. Error t value
#> (Intercept) 999.99999980933296 0.00000000150884 662758838130.408
#> Area 25.00000000000007 0.00000000000374 6684054946884.657
#> Garage 1500.00000000008640 0.00000000032868 4563708109116.631
#> FirePlace 750.00000000030752 0.00000000018991 3949331829292.207
#> Baths 1250.00000000023601 0.00000000018988 6583092214898.437
#> White.Marble 13999.99999999921783 0.00000000065745 21294466488403.605
#> Black.Marble 4999.99999999932425 0.00000000065760 7603454046266.312
#> Indian.Marble NA NA NA
#> Floors 15000.00000000155524 0.00000000053706 27929788605714.617
#> City 3499.99999999960937 0.00000000032900 10638307309887.180
#> Solar 249.99999999946232 0.00000000053707 465489478654.927
#> Electric 1249.99999999946135 0.00000000053706 2327484193486.549
#> Fiber 11749.99999999948523 0.00000000053707 21878078290865.754
#> Glass.Doors 4449.99999999952979 0.00000000053707 8285776488182.684
#> Swiming.Pool 0.00000000054140 0.00000000053707 1.008
#> Garden 0.00000000053855 0.00000000053707 1.003
#> Pr(>|t|)
#> (Intercept) <0.0000000000000002 ***
#> Area <0.0000000000000002 ***
#> Garage <0.0000000000000002 ***
#> FirePlace <0.0000000000000002 ***
#> Baths <0.0000000000000002 ***
#> White.Marble <0.0000000000000002 ***
#> Black.Marble <0.0000000000000002 ***
#> Indian.Marble NA
#> Floors <0.0000000000000002 ***
#> City <0.0000000000000002 ***
#> Solar <0.0000000000000002 ***
#> Electric <0.0000000000000002 ***
#> Fiber <0.0000000000000002 ***
#> Glass.Doors <0.0000000000000002 ***
#> Swiming.Pool 0.313
#> Garden 0.316
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.0000001899 on 499985 degrees of freedom
#> Multiple R-squared: 1, Adjusted R-squared: 1
#> F-statistic: 1.453e+26 on 14 and 499985 DF, p-value: < 0.00000000000000022
It seems like there is a problem with our model when defining the
coefficients of Indian.Marble. As mentioned before,
Indian.Marble has a strong correlation with
White.Marble and Black.Marble, which may cause
multicollinearity and cause problems in our model.
To handle this, we will remove the feature Indian.Marble
from our model.
model_all <- lm(formula = Prices ~ .-Indian.Marble, data = houseprice)
summary(model_all)
#>
#> Call:
#> lm(formula = Prices ~ . - Indian.Marble, data = houseprice)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -0.000134260 0.000000000 0.000000000 0.000000001 0.000000360
#>
#> Coefficients:
#> Estimate Std. Error t value
#> (Intercept) 999.99999980933296 0.00000000150884 662758838130.408
#> Area 25.00000000000007 0.00000000000374 6684054946884.657
#> Garage 1500.00000000008640 0.00000000032868 4563708109116.631
#> FirePlace 750.00000000030752 0.00000000018991 3949331829292.207
#> Baths 1250.00000000023601 0.00000000018988 6583092214898.437
#> White.Marble 13999.99999999921783 0.00000000065745 21294466488403.605
#> Black.Marble 4999.99999999932425 0.00000000065760 7603454046266.312
#> Floors 15000.00000000155524 0.00000000053706 27929788605714.617
#> City 3499.99999999960937 0.00000000032900 10638307309887.180
#> Solar 249.99999999946232 0.00000000053707 465489478654.927
#> Electric 1249.99999999946135 0.00000000053706 2327484193486.549
#> Fiber 11749.99999999948523 0.00000000053707 21878078290865.754
#> Glass.Doors 4449.99999999952979 0.00000000053707 8285776488182.684
#> Swiming.Pool 0.00000000054140 0.00000000053707 1.008
#> Garden 0.00000000053855 0.00000000053707 1.003
#> Pr(>|t|)
#> (Intercept) <0.0000000000000002 ***
#> Area <0.0000000000000002 ***
#> Garage <0.0000000000000002 ***
#> FirePlace <0.0000000000000002 ***
#> Baths <0.0000000000000002 ***
#> White.Marble <0.0000000000000002 ***
#> Black.Marble <0.0000000000000002 ***
#> Floors <0.0000000000000002 ***
#> City <0.0000000000000002 ***
#> Solar <0.0000000000000002 ***
#> Electric <0.0000000000000002 ***
#> Fiber <0.0000000000000002 ***
#> Glass.Doors <0.0000000000000002 ***
#> Swiming.Pool 0.313
#> Garden 0.316
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.0000001899 on 499985 degrees of freedom
#> Multiple R-squared: 1, Adjusted R-squared: 1
#> F-statistic: 1.453e+26 on 14 and 499985 DF, p-value: < 0.00000000000000022
Model Interpretation
Swiming.Pool and Garden.Prices can be explained by the chosen
predictor variables.We will remove the non-significant variables:
Swiming.Pool and Garden.
model_all <- lm(formula = Prices ~ .-Indian.Marble -Swiming.Pool -Garden, data = houseprice)
summary(model_all)
#>
#> Call:
#> lm(formula = Prices ~ . - Indian.Marble - Swiming.Pool - Garden,
#> data = houseprice)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -0.000134260 0.000000000 0.000000000 0.000000001 0.000000360
#>
#> Coefficients:
#> Estimate Std. Error t value
#> (Intercept) 999.99999980986900 0.00000000146117 684384458756
#> Area 25.00000000000009 0.00000000000374 6684062965981
#> Garage 1500.00000000008663 0.00000000032868 4563711740576
#> FirePlace 750.00000000030764 0.00000000018991 3949335111245
#> Baths 1250.00000000023670 0.00000000018988 6583118088334
#> White.Marble 13999.99999999921783 0.00000000065745 21294514660633
#> Black.Marble 4999.99999999932515 0.00000000065760 7603456106824
#> Floors 15000.00000000155524 0.00000000053706 27929792148117
#> City 3499.99999999960983 0.00000000032900 10638315514581
#> Solar 249.99999999945976 0.00000000053706 465493747927
#> Electric 1249.99999999946203 0.00000000053706 2327485216394
#> Fiber 11749.99999999948704 0.00000000053706 21878264248050
#> Glass.Doors 4449.99999999953161 0.00000000053706 8285822881367
#> Pr(>|t|)
#> (Intercept) <0.0000000000000002 ***
#> Area <0.0000000000000002 ***
#> Garage <0.0000000000000002 ***
#> FirePlace <0.0000000000000002 ***
#> Baths <0.0000000000000002 ***
#> White.Marble <0.0000000000000002 ***
#> Black.Marble <0.0000000000000002 ***
#> Floors <0.0000000000000002 ***
#> City <0.0000000000000002 ***
#> Solar <0.0000000000000002 ***
#> Electric <0.0000000000000002 ***
#> Fiber <0.0000000000000002 ***
#> Glass.Doors <0.0000000000000002 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.0000001899 on 499987 degrees of freedom
#> Multiple R-squared: 1, Adjusted R-squared: 1
#> F-statistic: 1.695e+26 on 12 and 499987 DF, p-value: < 0.00000000000000022
Both this and the previous model has adjusted R-squared value 1, therefore it is fine to keep this model.
Model Interpretation
Prices can be explained by the chosen
predictor variables.In this model we will try and only use several features as predictor variable, based on their correlation.
ggcorr(houseprice, label = TRUE, label_size = 3, hjust = 1, layout.exp = 2)
We will try to use the features with minimum absolute correlation
value 0.2. So the features that we will use are:
Glass.Doors, Fiber, City,
Floors, Indian.Marble,
White.Marble
houseprice_multiple <- houseprice %>% select(Prices, Glass.Doors, Fiber, City, Floors, Indian.Marble, White.Marble)
model_multiple <- lm(formula = Prices ~ ., data = houseprice_multiple)
summary(model_multiple)
#>
#> Call:
#> lm(formula = Prices ~ ., data = houseprice_multiple)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -9370.5 -2153.1 1.1 2147.2 9351.4
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 18144.277 15.012 1208.6 <0.0000000000000002 ***
#> Glass.Doors 4434.766 8.658 512.2 <0.0000000000000002 ***
#> Fiber 11750.084 8.658 1357.1 <0.0000000000000002 ***
#> City 3492.153 5.304 658.4 <0.0000000000000002 ***
#> Floors 14991.642 8.658 1731.6 <0.0000000000000002 ***
#> Indian.Marble -4997.262 10.601 -471.4 <0.0000000000000002 ***
#> White.Marble 9024.192 10.612 850.4 <0.0000000000000002 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 3061 on 499993 degrees of freedom
#> Multiple R-squared: 0.9361, Adjusted R-squared: 0.9361
#> F-statistic: 1.221e+06 on 6 and 499993 DF, p-value: < 0.00000000000000022
Model Interpretation
Prices can be explained by the chosen
predictor variables.We will do feature selection with all 3 kinds of step wise regression: forward, backward, and both.
# Model without predictor
model_base_1 <- lm(formula = Prices ~ 1,
data = houseprice)
# Model with all predictor
model_base_all <- lm(formula = Prices ~ .,
data = houseprice)
# Stepwise regression
model_step_forward <- step(object = model_base_1,
direction = "forward",
scope = list(upper = model_base_all),
trace = F)
model_step_backward <- step(object = model_base_all,
direction = "backward",
scope = list(lower = model_base_1),
trace = F)
model_step_both <- step(object = model_base_1,
direction = "both",
scope = list(upper = model_base_all,
lower = model_base_1),
trace = F)
summary(model_step_forward)
#>
#> Call:
#> lm(formula = Prices ~ Floors + Fiber + White.Marble + City +
#> Glass.Doors + Black.Marble + Area + Baths + Garage + FirePlace +
#> Electric + Solar, data = houseprice)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -0.000134145 0.000000000 0.000000000 0.000000001 0.000004352
#>
#> Coefficients:
#> Estimate Std. Error t value
#> (Intercept) 999.999999904189053 0.000000001590920 628567198577
#> Floors 14999.999999853393092 0.000000000584752 25651884669521
#> Fiber 11749.999999947449396 0.000000000584754 20093909338301
#> White.Marble 14000.000000052730684 0.000000000715828 19557769398158
#> City 3499.999999992096491 0.000000000358215 9770672162967
#> Glass.Doors 4450.000000006955815 0.000000000584753 7610044923319
#> Black.Marble 4999.999999999224201 0.000000000715991 6983330850034
#> Area 24.999999999998334 0.000000000004072 6138921887378
#> Baths 1250.000000000046384 0.000000000206741 6046209906362
#> Garage 1500.000000000003638 0.000000000357867 4191502987704
#> FirePlace 750.000000000298428 0.000000000206769 3627233896271
#> Electric 1249.999999999690090 0.000000000584752 2137659386244
#> Solar 249.999999999674060 0.000000000584756 427528850660
#> Pr(>|t|)
#> (Intercept) <0.0000000000000002 ***
#> Floors <0.0000000000000002 ***
#> Fiber <0.0000000000000002 ***
#> White.Marble <0.0000000000000002 ***
#> City <0.0000000000000002 ***
#> Glass.Doors <0.0000000000000002 ***
#> Black.Marble <0.0000000000000002 ***
#> Area <0.0000000000000002 ***
#> Baths <0.0000000000000002 ***
#> Garage <0.0000000000000002 ***
#> FirePlace <0.0000000000000002 ***
#> Electric <0.0000000000000002 ***
#> Solar <0.0000000000000002 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.0000002067 on 499987 degrees of freedom
#> Multiple R-squared: 1, Adjusted R-squared: 1
#> F-statistic: 1.43e+26 on 12 and 499987 DF, p-value: < 0.00000000000000022
Model Interpretation
Prices can be explained by the chosen
predictor variables.names(model_step_forward$coefficients) %>% sort
#> [1] "(Intercept)" "Area" "Baths" "Black.Marble" "City"
#> [6] "Electric" "Fiber" "FirePlace" "Floors" "Garage"
#> [11] "Glass.Doors" "Solar" "White.Marble"
summary(model_step_backward)
#>
#> Call:
#> lm(formula = Prices ~ Area + Garage + FirePlace + Baths + White.Marble +
#> Black.Marble + Floors + City + Solar + Electric + Fiber +
#> Glass.Doors, data = houseprice)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -0.000134260 0.000000000 0.000000000 0.000000001 0.000000360
#>
#> Coefficients:
#> Estimate Std. Error t value
#> (Intercept) 999.99999980986900 0.00000000146117 684384458756
#> Area 25.00000000000009 0.00000000000374 6684062965981
#> Garage 1500.00000000008663 0.00000000032868 4563711740576
#> FirePlace 750.00000000030764 0.00000000018991 3949335111245
#> Baths 1250.00000000023670 0.00000000018988 6583118088334
#> White.Marble 13999.99999999921783 0.00000000065745 21294514660633
#> Black.Marble 4999.99999999932515 0.00000000065760 7603456106824
#> Floors 15000.00000000155524 0.00000000053706 27929792148117
#> City 3499.99999999960983 0.00000000032900 10638315514581
#> Solar 249.99999999945976 0.00000000053706 465493747927
#> Electric 1249.99999999946203 0.00000000053706 2327485216394
#> Fiber 11749.99999999948704 0.00000000053706 21878264248050
#> Glass.Doors 4449.99999999953161 0.00000000053706 8285822881367
#> Pr(>|t|)
#> (Intercept) <0.0000000000000002 ***
#> Area <0.0000000000000002 ***
#> Garage <0.0000000000000002 ***
#> FirePlace <0.0000000000000002 ***
#> Baths <0.0000000000000002 ***
#> White.Marble <0.0000000000000002 ***
#> Black.Marble <0.0000000000000002 ***
#> Floors <0.0000000000000002 ***
#> City <0.0000000000000002 ***
#> Solar <0.0000000000000002 ***
#> Electric <0.0000000000000002 ***
#> Fiber <0.0000000000000002 ***
#> Glass.Doors <0.0000000000000002 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.0000001899 on 499987 degrees of freedom
#> Multiple R-squared: 1, Adjusted R-squared: 1
#> F-statistic: 1.695e+26 on 12 and 499987 DF, p-value: < 0.00000000000000022
Model Interpretation
Prices can be explained by the chosen
predictor variables.names(model_step_backward$coefficients) %>% sort
#> [1] "(Intercept)" "Area" "Baths" "Black.Marble" "City"
#> [6] "Electric" "Fiber" "FirePlace" "Floors" "Garage"
#> [11] "Glass.Doors" "Solar" "White.Marble"
summary(model_step_both)
#>
#> Call:
#> lm(formula = Prices ~ Floors + Fiber + White.Marble + City +
#> Glass.Doors + Black.Marble + Area + Baths + Garage + FirePlace +
#> Electric + Solar, data = houseprice)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -0.000134145 0.000000000 0.000000000 0.000000001 0.000004352
#>
#> Coefficients:
#> Estimate Std. Error t value
#> (Intercept) 999.999999904189053 0.000000001590920 628567198577
#> Floors 14999.999999853393092 0.000000000584752 25651884669521
#> Fiber 11749.999999947449396 0.000000000584754 20093909338301
#> White.Marble 14000.000000052730684 0.000000000715828 19557769398158
#> City 3499.999999992096491 0.000000000358215 9770672162967
#> Glass.Doors 4450.000000006955815 0.000000000584753 7610044923319
#> Black.Marble 4999.999999999224201 0.000000000715991 6983330850034
#> Area 24.999999999998334 0.000000000004072 6138921887378
#> Baths 1250.000000000046384 0.000000000206741 6046209906362
#> Garage 1500.000000000003638 0.000000000357867 4191502987704
#> FirePlace 750.000000000298428 0.000000000206769 3627233896271
#> Electric 1249.999999999690090 0.000000000584752 2137659386244
#> Solar 249.999999999674060 0.000000000584756 427528850660
#> Pr(>|t|)
#> (Intercept) <0.0000000000000002 ***
#> Floors <0.0000000000000002 ***
#> Fiber <0.0000000000000002 ***
#> White.Marble <0.0000000000000002 ***
#> City <0.0000000000000002 ***
#> Glass.Doors <0.0000000000000002 ***
#> Black.Marble <0.0000000000000002 ***
#> Area <0.0000000000000002 ***
#> Baths <0.0000000000000002 ***
#> Garage <0.0000000000000002 ***
#> FirePlace <0.0000000000000002 ***
#> Electric <0.0000000000000002 ***
#> Solar <0.0000000000000002 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.0000002067 on 499987 degrees of freedom
#> Multiple R-squared: 1, Adjusted R-squared: 1
#> F-statistic: 1.43e+26 on 12 and 499987 DF, p-value: < 0.00000000000000022
Model Interpretation
Prices can be explained by the chosen
predictor variables.names(model_step_both$coefficients) %>% sort
#> [1] "(Intercept)" "Area" "Baths" "Black.Marble" "City"
#> [6] "Electric" "Fiber" "FirePlace" "Floors" "Garage"
#> [11] "Glass.Doors" "Solar" "White.Marble"
We will make predictions for each model on our data and evaluate the prediction results.
houseprice_pred <- data.frame(actual_price = houseprice$Prices)
houseprice_pred$model_all <- predict(model_all, houseprice)
houseprice_pred$model_multiple <-predict(model_multiple, houseprice)
houseprice_pred$model_step_forward <- predict(model_step_forward, houseprice)
houseprice_pred$model_step_backward <-predict(model_step_backward, houseprice)
houseprice_pred$model_step_both <-predict(model_step_both, houseprice)
head(houseprice_pred)
#> actual_price model_all model_multiple model_step_forward model_step_backward
#> 1 43800 43800 44805.59 43800 43800
#> 2 37550 37550 39557.73 37550 37550
#> 3 49500 49500 45902.86 49500 49500
#> 4 50075 50075 47815.66 50075 50075
#> 5 52400 52400 53579.18 52400 52400
#> 6 54300 54300 52812.92 54300 54300
#> model_step_both
#> 1 43800
#> 2 37550
#> 3 49500
#> 4 50075
#> 5 52400
#> 6 54300
Here we can see the actual_price and compare it with the
predicted house price from each models. Most of them seem to be able to
predict the house price accurately. However, we may not be able to tell
which one is the best model just by looking at this table. Therefore, we
need to create a model comparison.
We have made several models for our data. We want to pick the best model that can both predict and also describe our target variable (prices). The evaluation method that we will use are by looking at:
AIC, or Akaike Information Criterion
AIC value represents information loss.
The smaller AIC value, the less information is lost from our model.R-square value
The higher r-square value, the better our chosen predictors can describe target variable variance.
Because all of our models use more than 1 predictor, we will use the adjusted R-square value.Error value (difference between predicted and actual value), The smaller error value, the better our model can predict target variable.
For this case we will calculate the error using RMSE, or Root Mean Squared Error,
which is the square root of squared error value.
It is sensitive to error and can be easily interpreted.To do the comparison, we will utilize the performance
library.
comparison <- compare_performance( model_all, model_multiple,
model_step_forward, model_step_backward, model_step_both )
as.data.frame(comparison)[, c('Name','AIC','R2_adjusted', 'RMSE')] %>% arrange(RMSE)
#> Name AIC R2_adjusted RMSE
#> 1 model_all -14057928 1.0000000 0.0000001898760
#> 2 model_step_backward -14057928 1.0000000 0.0000001898760
#> 3 model_step_forward -13972851 1.0000000 0.0000002067371
#> 4 model_step_both -13972851 1.0000000 0.0000002067371
#> 5 model_multiple 9445453 0.9361105 3061.0061860142678
We want a model that has:
From the table above we can see that model_all has the
smallest AIC value, highest R-squared value, and smallest RMSE.
Therefore, we can conclude that model_all is the best model
that we have.
We need to check if our model fulfills the assumptions for Linear Regression model, to ensure that it can predict new data consistently (Best Linear Unbiased Estimator / BLUE model).
The assumptions that we need to check are:
We will perform the assumption checking only on our best model, which
is model_all.
The assumption of linearity means that the target variable and its predictors have a linear relationship. In multiple linear regression, the assumption of linearity can be assessed by visualizing the residuals vs. the fitted / predicted values using a scatter plot.
The line in the plot should be close to linear / have no pattern.
plot(model_all, which = 1)
Conclusion:
The errors/residuals are expected to have a normal distribution, so the error values will cluster around zero.
We can check this assumption using:
hist()
function.shapiro.test()However, using the shapiro.test() function to our data will cause an error as follows: > Error in shapiro.test(.) : sample size must be between 3 and 5000
This is because our model size exceeds 5000, as seen below:
length(model_all$residuals)
#> [1] 500000
Therefore, we will use histogram visualization.
hist(model_all$residuals)
As the residual values cluster around 0, we can conclude that the assumption is satisfied.
Error/residuals are expected to have no pattern. If we plot the model’s residuals vs fitted.values, and the residual values spread around zero, then it is homoscedasticity.
We can check this assumption in 2 ways:
plot(model_all$fitted.values, model_all$residuals)
abline(h=0, col="red")
bptest() from package lmtestbptest(model_all)
#>
#> studentized Breusch-Pagan test
#>
#> data: model_all
#> BP = 11.298, df = 12, p-value = 0.5035
The residuals spreads around 0 in the scatter plot and the p-value > 0.05 in th Breuch-Pagan test. Therefore, we can conclude that the homoscedasticity assumption is satisfied.
Multicollinearity is the condition where there are strong correlation(s) between predictor variables. This is undesirable because it indicates redundant predictors in the model, and ideally, only one of the highly correlated variables should be selected.
We will do VIF (Variance Inflation Factor) test using
vif() function from package car:
vif(model_all)
#> Area Garage FirePlace Baths White.Marble Black.Marble
#> 1.000022 1.000031 1.000009 1.000030 1.331386 1.331386
#> Floors City Solar Electric Fiber Glass.Doors
#> 1.000012 1.000024 1.000018 1.000009 1.000019 1.000017
All VIF values are < 10, therefore there is no multicollinearity in the model and the assumption is satisfied.
Our best model is model_all which includes all house
features except for Indian.marble,
Swiming.Pool and Garden as predictors for
target variable Prices. This model’s adjusted R-square
value is 1 and its RMSE value is 0.0000001898756. This model satisfy all
linearity regression assumption and can be used to predict house prices
based on its features.