In this LBB Project, we will create Regression Model using dataset provided by Kaggle to analyze house pricing based on its property attributes.
# Library setup and Installation necessary packages
library(dplyr)
library(lubridate) # working with datetime
library(GGally) # correlation relationship
library(MLmetrics) # for MAE calculations
library(performance) # model performance comparison
library(lmtest) # Testing Linear Regression Model
library(car)# Read dataset
house <- read.csv("data_input/HousePrices_HalfMil.csv")#, stringsAsFactors = F) %>%
str(house)#> 'data.frame': 500000 obs. of 16 variables:
#> $ Area : int 164 84 190 75 148 124 58 249 243 242 ...
#> $ Garage : int 2 2 2 2 1 3 1 2 1 1 ...
#> $ FirePlace : int 0 0 4 4 4 3 0 1 0 2 ...
#> $ Baths : int 2 4 4 4 2 3 2 1 2 4 ...
#> $ White.Marble : int 0 0 1 0 1 0 0 1 0 0 ...
#> $ Black.Marble : int 1 0 0 0 0 1 0 0 0 0 ...
#> $ Indian.Marble: int 0 1 0 1 0 0 1 0 1 1 ...
#> $ Floors : int 0 1 0 1 1 1 0 1 1 0 ...
#> $ City : int 3 2 2 1 2 1 3 1 1 2 ...
#> $ Solar : int 1 0 0 1 1 0 0 0 0 1 ...
#> $ Electric : int 1 0 0 1 0 0 1 1 0 0 ...
#> $ Fiber : int 1 0 1 1 0 1 1 0 0 0 ...
#> $ Glass.Doors : int 1 1 0 1 1 1 1 1 0 0 ...
#> $ Swiming.Pool : int 0 1 0 1 1 1 0 1 1 1 ...
#> $ Garden : int 0 1 0 1 1 1 1 0 0 0 ...
#> $ Prices : int 43800 37550 49500 50075 52400 54300 34400 50425 29575 22300 ...
As there have not been prior knowledge on information regarding variables in the dataset given from the data source information page, hence we will explore the dataset further to interpret the variables meaning of the house attributes.
#> Rows: 500,000
#> Columns: 16
#> $ Area <int> 164, 84, 190, 75, 148, 124, 58, 249, 243, 242, 61, 189, …
#> $ Garage <int> 2, 2, 2, 2, 1, 3, 1, 2, 1, 1, 2, 2, 2, 3, 3, 3, 1, 3, 2,…
#> $ FirePlace <int> 0, 0, 4, 4, 4, 3, 0, 1, 0, 2, 4, 0, 0, 3, 3, 4, 0, 3, 3,…
#> $ Baths <int> 2, 4, 4, 4, 2, 3, 2, 1, 2, 4, 5, 4, 2, 3, 1, 1, 5, 3, 5,…
#> $ White.Marble <int> 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,…
#> $ Black.Marble <int> 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1,…
#> $ Indian.Marble <int> 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0,…
#> $ Floors <int> 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1,…
#> $ City <int> 3, 2, 2, 1, 2, 1, 3, 1, 1, 2, 1, 2, 1, 3, 3, 1, 3, 1, 3,…
#> $ Solar <int> 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0,…
#> $ Electric <int> 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1,…
#> $ Fiber <int> 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0,…
#> $ Glass.Doors <int> 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1,…
#> $ Swiming.Pool <int> 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0,…
#> $ Garden <int> 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0,…
#> $ Prices <int> 43800, 37550, 49500, 50075, 52400, 54300, 34400, 50425, …
There are 500,000 observation data in the dataset
#> Area Garage FirePlace Baths
#> Min. : 1.0 Min. :1.000 Min. :0.000 Min. :1.000
#> 1st Qu.: 63.0 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:2.000
#> Median :125.0 Median :2.000 Median :2.000 Median :3.000
#> Mean :124.9 Mean :2.001 Mean :2.003 Mean :2.998
#> 3rd Qu.:187.0 3rd Qu.:3.000 3rd Qu.:3.000 3rd Qu.:4.000
#> Max. :249.0 Max. :3.000 Max. :4.000 Max. :5.000
#> White.Marble Black.Marble Indian.Marble Floors
#> Min. :0.000 Min. :0.0000 Min. :0.0000 Min. :0.0000
#> 1st Qu.:0.000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
#> Median :0.000 Median :0.0000 Median :0.0000 Median :0.0000
#> Mean :0.333 Mean :0.3327 Mean :0.3343 Mean :0.4994
#> 3rd Qu.:1.000 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000
#> Max. :1.000 Max. :1.0000 Max. :1.0000 Max. :1.0000
#> City Solar Electric Fiber
#> Min. :1.000 Min. :0.0000 Min. :0.0000 Min. :0.0000
#> 1st Qu.:1.000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
#> Median :2.000 Median :0.0000 Median :1.0000 Median :1.0000
#> Mean :2.001 Mean :0.4987 Mean :0.5007 Mean :0.5005
#> 3rd Qu.:3.000 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000
#> Max. :3.000 Max. :1.0000 Max. :1.0000 Max. :1.0000
#> Glass.Doors Swiming.Pool Garden Prices
#> Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. : 7725
#> 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:33500
#> Median :0.0000 Median :1.0000 Median :1.0000 Median :41850
#> Mean :0.4999 Mean :0.5004 Mean :0.5016 Mean :42050
#> 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:50750
#> Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :77975
Check any missing values in the dataset
#> Area Garage FirePlace Baths White.Marble
#> 0 0 0 0 0
#> Black.Marble Indian.Marble Floors City Solar
#> 0 0 0 0 0
#> Electric Fiber Glass.Doors Swiming.Pool Garden
#> 0 0 0 0 0
#> Prices
#> 0
Check for unique values on several columns
#> [1] 2 1 3
#> [1] 0 4 3 1 2
#> [1] 2 4 3 1 5
#> [1] 3 2 1
All the columns except Area and Prices can
be changed to categorial type as those columns have unique values at
most 5 unique values.
Based on information from our data exploration on the structure of
our original dataset house, we can interpret the variable
attributes as follows: - Area : Area of the house -
Garage : Number of Garages in the house
- FirePlace : Number of Fireplaces built in the house
- Baths : Number of Bathrooms in the house
- White.Marble : Whether the house using white marble. 1
for Yes, 0 for No. - Black.Marble : Whether the house using
black marble. 1 for Yes, 0 for No.
- Indian.Marble : Whether the house using black marble. 1
for Yes, 0 for No. - Floors : How many levels of floors in
the house. 0 for ground level, 1 for 1 level above ground level,
etc.
- City : Code for city area which the house is
located
- Solar : Whether the house is using Solar based power
sourced. 1 for Yes, 0 for No.
- Electric : Whether the house is using Electricity based
power sourced. 1 for Yes, 0 for No. - Fiber : Whether the
house is using Fiber. 1 for Yes, 0 for No. - Glass.Doors :
Whether the house has Glass Doors. 1 for Yes, 0 for No. -
Swiming.Pool : Whether the house has Swiming Pool. 1 for
Yes, 0 for No. - Garden : Whether the house has a Garden. 1
for Yes, 0 for No.
- Prices : House Price
We will change all the data types except Area and
Price to categorial factor type and store the
new dataset into new dataframe named house_clean.
# Change data_type
house_clean <- house %>%
mutate_at(vars(Garage, FirePlace, Baths, White.Marble, Black.Marble,
Indian.Marble, Floors, City, Solar, Electric,
Fiber, Glass.Doors, Swiming.Pool, Garden),
as.factor)
# Confirm Data Type Change
glimpse(house_clean)#> Rows: 500,000
#> Columns: 16
#> $ Area <int> 164, 84, 190, 75, 148, 124, 58, 249, 243, 242, 61, 189, …
#> $ Garage <fct> 2, 2, 2, 2, 1, 3, 1, 2, 1, 1, 2, 2, 2, 3, 3, 3, 1, 3, 2,…
#> $ FirePlace <fct> 0, 0, 4, 4, 4, 3, 0, 1, 0, 2, 4, 0, 0, 3, 3, 4, 0, 3, 3,…
#> $ Baths <fct> 2, 4, 4, 4, 2, 3, 2, 1, 2, 4, 5, 4, 2, 3, 1, 1, 5, 3, 5,…
#> $ White.Marble <fct> 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,…
#> $ Black.Marble <fct> 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1,…
#> $ Indian.Marble <fct> 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0,…
#> $ Floors <fct> 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1,…
#> $ City <fct> 3, 2, 2, 1, 2, 1, 3, 1, 1, 2, 1, 2, 1, 3, 3, 1, 3, 1, 3,…
#> $ Solar <fct> 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0,…
#> $ Electric <fct> 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1,…
#> $ Fiber <fct> 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0,…
#> $ Glass.Doors <fct> 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1,…
#> $ Swiming.Pool <fct> 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0,…
#> $ Garden <fct> 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0,…
#> $ Prices <int> 43800, 37550, 49500, 50075, 52400, 54300, 34400, 50425, …
In this project, we will analyze House Pricing Prices
based on the house attributes to find which house attributes affect
house pricing.
PricesCheck variable correlations using dataframe
house_clean
Because all the variables have been changed to categorial type, hence
the only attribute that can be used to analyze Price can
only be Area.
This is not what we wanted because our purpose is to analyze House Pricing based on the House’s attributes, so we will need to keep the data types of all the columns as it is.
Check variable correlations using dataframe house
Based on the figure above, we can observe the correlations between
each variable with Prices and begin to create several
modelling to analyze as follows:
Modelling using predictor variable with the highest correlations which is:
Floors : 0.6Modelling using predictor variables with higher correlations of ranges > 0.3 or < -0.3 :
Floors : 0.6Fiber : 0.5Indian.Marble : -0.4White.Marble : 0.4Modelling using all variables as predictor variables.
Predictor Variable : Floors
#>
#> Call:
#> lm(formula = Prices ~ Floors, data = house)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -27211 -6986 -11 6592 28542
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 34557.66 19.00 1819 <0.0000000000000002 ***
#> Floors 15003.39 26.89 558 <0.0000000000000002 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 9507 on 499998 degrees of freedom
#> Multiple R-squared: 0.3837, Adjusted R-squared: 0.3837
#> F-statistic: 3.113e+05 on 1 and 499998 DF, p-value: < 0.00000000000000022
💡 Insight :
Using one predictor variable Floors, R-squared value
is 0.3837 = 38.37%.
38.37% variance from target, house pricing Prices
can be explained from our chosen predictor variable.
Predictor Variables : Floors, Fiber,
Indian.Marble, White.Marble.
model_price_2 <- lm(formula = Prices ~ Floors + Fiber + Indian.Marble + White.Marble,
data = house)
summary(model_price_2)#>
#> Call:
#> lm(formula = Prices ~ Floors + Fiber + Indian.Marble + White.Marble,
#> data = house)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -15059.3 -3325.2 -0.2 3328.2 14999.8
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 27362.24 14.98 1827.1 <0.0000000000000002 ***
#> Floors 14987.99 13.39 1119.1 <0.0000000000000002 ***
#> Fiber 11724.54 13.39 875.4 <0.0000000000000002 ***
#> Indian.Marble -4990.47 16.40 -304.3 <0.0000000000000002 ***
#> White.Marble 9020.49 16.41 549.5 <0.0000000000000002 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 4735 on 499995 degrees of freedom
#> Multiple R-squared: 0.8471, Adjusted R-squared: 0.8471
#> F-statistic: 6.926e+05 on 4 and 499995 DF, p-value: < 0.00000000000000022
💡 Insight :
Using four predictor variable, R-squared value is 0.8471 = 84.71%.
84.71% variance from target, house pricing Prices
can be explained from our four chosen predictor variables.
Predictor Variables : all variables.
#>
#> Call:
#> lm(formula = Prices ~ ., data = house)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -0.000134259 0.000000000 0.000000000 0.000000001 0.000000359
#>
#> Coefficients: (1 not defined because of singularities)
#> Estimate Std. Error t value
#> (Intercept) 999.99999980934274 0.00000000150884 662761448609.378
#> Area 25.00000000000007 0.00000000000374 6684081274085.180
#> Garage 1500.00000000008686 0.00000000032868 4563726084680.788
#> FirePlace 750.00000000030684 0.00000000018991 3949347384946.940
#> Baths 1250.00000000023488 0.00000000018988 6583118144426.266
#> White.Marble 13999.99999999921420 0.00000000065744 21294550363191.246
#> Black.Marble 4999.99999999932425 0.00000000065759 7603483994802.182
#> Indian.Marble NA NA NA
#> Floors 15000.00000000155524 0.00000000053706 27929898615753.586
#> City 3499.99999999960710 0.00000000032900 10638349212123.465
#> Solar 249.99999999946226 0.00000000053707 465491312128.012
#> Electric 1249.99999999946249 0.00000000053706 2327493360996.926
#> Fiber 11749.99999999948704 0.00000000053707 21878164464387.484
#> Glass.Doors 4449.99999999952979 0.00000000053706 8285809124254.771
#> Swiming.Pool 0.00000000054141 0.00000000053706 1.008
#> Garden 0.00000000053856 0.00000000053707 1.003
#> Pr(>|t|)
#> (Intercept) <0.0000000000000002 ***
#> Area <0.0000000000000002 ***
#> Garage <0.0000000000000002 ***
#> FirePlace <0.0000000000000002 ***
#> Baths <0.0000000000000002 ***
#> White.Marble <0.0000000000000002 ***
#> Black.Marble <0.0000000000000002 ***
#> Indian.Marble NA
#> Floors <0.0000000000000002 ***
#> City <0.0000000000000002 ***
#> Solar <0.0000000000000002 ***
#> Electric <0.0000000000000002 ***
#> Fiber <0.0000000000000002 ***
#> Glass.Doors <0.0000000000000002 ***
#> Swiming.Pool 0.313
#> Garden 0.316
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.0000001899 on 499985 degrees of freedom
#> Multiple R-squared: 1, Adjusted R-squared: 1
#> F-statistic: 1.453e+26 on 14 and 499985 DF, p-value: < 0.00000000000000022
💡 Insight :
Using all variables as our predictor variable, R-squared value is 1 = 100%.
100% variance from target, house pricing Prices can
be explained from all variables above.
Significant predictor variables almost all except
Indian.Marble, Swimming.Pool and
Garden.
⚠️ R-square value = 1, we may need to be cautious that this model may be overfitting.
Indian.MarbleBased on the Third Model using All Variables as the predictor
variables, we observed that the value of R-squared equals to 1 (one) and
there is one variable Indian.Marble gives us
NA value and we will create additional model by eliminating
this variable to see the effect it gave to our model.
Predictor Variables : all except Indian.Marble.
model_price_no_IndianMarble <- lm(formula = Prices ~ . - Indian.Marble,
data = house)
summary(model_price_no_IndianMarble)#>
#> Call:
#> lm(formula = Prices ~ . - Indian.Marble, data = house)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -0.000134259 0.000000000 0.000000000 0.000000001 0.000000359
#>
#> Coefficients:
#> Estimate Std. Error t value
#> (Intercept) 999.99999980934274 0.00000000150884 662761448609.378
#> Area 25.00000000000007 0.00000000000374 6684081274085.180
#> Garage 1500.00000000008686 0.00000000032868 4563726084680.788
#> FirePlace 750.00000000030684 0.00000000018991 3949347384946.940
#> Baths 1250.00000000023488 0.00000000018988 6583118144426.266
#> White.Marble 13999.99999999921420 0.00000000065744 21294550363191.246
#> Black.Marble 4999.99999999932425 0.00000000065759 7603483994802.182
#> Floors 15000.00000000155524 0.00000000053706 27929898615753.586
#> City 3499.99999999960710 0.00000000032900 10638349212123.465
#> Solar 249.99999999946226 0.00000000053707 465491312128.012
#> Electric 1249.99999999946249 0.00000000053706 2327493360996.926
#> Fiber 11749.99999999948704 0.00000000053707 21878164464387.484
#> Glass.Doors 4449.99999999952979 0.00000000053706 8285809124254.771
#> Swiming.Pool 0.00000000054141 0.00000000053706 1.008
#> Garden 0.00000000053856 0.00000000053707 1.003
#> Pr(>|t|)
#> (Intercept) <0.0000000000000002 ***
#> Area <0.0000000000000002 ***
#> Garage <0.0000000000000002 ***
#> FirePlace <0.0000000000000002 ***
#> Baths <0.0000000000000002 ***
#> White.Marble <0.0000000000000002 ***
#> Black.Marble <0.0000000000000002 ***
#> Floors <0.0000000000000002 ***
#> City <0.0000000000000002 ***
#> Solar <0.0000000000000002 ***
#> Electric <0.0000000000000002 ***
#> Fiber <0.0000000000000002 ***
#> Glass.Doors <0.0000000000000002 ***
#> Swiming.Pool 0.313
#> Garden 0.316
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.0000001899 on 499985 degrees of freedom
#> Multiple R-squared: 1, Adjusted R-squared: 1
#> F-statistic: 1.453e+26 on 14 and 499985 DF, p-value: < 0.00000000000000022
💡 Insight :
Using all variables as our predictor variables except
Indian.Marble, R-squared value is still 1 = 100%.
100% variance from target, house pricing Prices can
be explained from all variables above.
⚠️ R-square value = 1, we may need to be cautious that this model may be overfitting.
Based on the Fourth Model using All Variables except
Indian.Marble, we observed that the value of R-squared
remains the same as equals to 1 (one) and let us create additional model
by eliminating the other two non-significant variables
Swimming.Pool and Garden.
Predictor Variables : all variables except
Indian.Marble, Swiming.Pool and
Garden.
model_price_signf <- lm(formula = Prices ~ . - Indian.Marble - Swiming.Pool - Garden,
data = house)
summary(model_price_signf)#>
#> Call:
#> lm(formula = Prices ~ . - Indian.Marble - Swiming.Pool - Garden,
#> data = house)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -0.000134260 0.000000000 0.000000000 0.000000001 0.000000359
#>
#> Coefficients:
#> Estimate Std. Error t value
#> (Intercept) 999.99999980987025 0.00000000146116 684387154347
#> Area 25.00000000000009 0.00000000000374 6684089292557
#> Garage 1500.00000000008708 0.00000000032868 4563729715707
#> FirePlace 750.00000000030707 0.00000000018991 3949350666525
#> Baths 1250.00000000023556 0.00000000018988 6583144017318
#> White.Marble 13999.99999999921238 0.00000000065744 21294598533522
#> Black.Marble 4999.99999999932515 0.00000000065759 7603486054622
#> Floors 15000.00000000155524 0.00000000053706 27929902155431
#> City 3499.99999999960755 0.00000000032900 10638357415806
#> Solar 249.99999999945970 0.00000000053706 465495581371
#> Electric 1249.99999999946317 0.00000000053706 2327494383680
#> Fiber 11749.99999999948886 0.00000000053706 21878350420159
#> Glass.Doors 4449.99999999953161 0.00000000053706 8285855516810
#> Pr(>|t|)
#> (Intercept) <0.0000000000000002 ***
#> Area <0.0000000000000002 ***
#> Garage <0.0000000000000002 ***
#> FirePlace <0.0000000000000002 ***
#> Baths <0.0000000000000002 ***
#> White.Marble <0.0000000000000002 ***
#> Black.Marble <0.0000000000000002 ***
#> Floors <0.0000000000000002 ***
#> City <0.0000000000000002 ***
#> Solar <0.0000000000000002 ***
#> Electric <0.0000000000000002 ***
#> Fiber <0.0000000000000002 ***
#> Glass.Doors <0.0000000000000002 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.0000001899 on 499987 degrees of freedom
#> Multiple R-squared: 1, Adjusted R-squared: 1
#> F-statistic: 1.695e+26 on 12 and 499987 DF, p-value: < 0.00000000000000022
💡 Insight :
Using all variables as our predictor variables except
Indian.Marble, Swiming.Pool and
Garden, R-squared value remains the same at 1 =
100%.
100% variance from target, house pricing Prices can
be explained from all variables above.
⚠️ R-square value = 1, we may need to be cautious that this model may be overfitting.
At this point, we have created several models which are :
model_price_1 : Model with Highest Correlation
Variable Floors.
model_price_2 : Model using Higher Correlation
Variables Floors, Fiber,
Indian.Marble, White.Marble.
model_price_all : Model using all variables as
Predictor Variables.
model_price_no_IndianMarble : Model using all
variables except Indian.Marble.
model_price_signf : Model using all significant
variables which is all variables except Indian.Marble,
Swiming.Pool and Garden.
Let us recap and compare all of our five modellings above using r-squared and error :
#> [1] 0.3837194
#> [1] 0.8471227
#> [1] 1
#> [1] 1
#> [1] 1
💡 Insight :
model_price_all,
model_price_no_IndianMarble, and
model_price_signf has the highest r-squared value at
1.
model_price_1 has the lowest r-squared value at
0.3837 = 38.37%.
⚠️ However, we may need to be cautious that those models may be overfitting because it is too perfect.
Make predictions for each model
price_pred_1 <- predict(model_price_1, house)
price_pred_2 <- predict(model_price_2, house)
price_pred_all <- predict(model_price_all, house)
price_pred_no_IndianMarble <- predict(model_price_no_IndianMarble, house)
price_pred_signf <- predict(model_price_signf, house)Prediction Evaluation using Error Metrics (RMSE)
RMSE_1 <- RMSE(y_pred = price_pred_1,
y_true = house$Prices)
RMSE_2 <- RMSE(y_pred = price_pred_2,
y_true = house$Prices)
RMSE_all <- RMSE(y_pred = price_pred_all,
y_true = house$Prices)
RMSE_no_IndianMarble <- RMSE(y_pred = price_pred_no_IndianMarble,
y_true = house$Prices)
RMSE_signf <- RMSE(y_pred = price_pred_signf,
y_true = house$Prices)
RMSE_1#> [1] 9506.956
#> [1] 4735.022
#> [1] 0.0000001901459
#> [1] 0.0000001901459
#> [1] 0.0000001901501
💡 Insight :
Based on RMSE values, model_price_all and
model_price_no_IndianMarble have the best performance
because the two models have the lowest RMSE values at
0.0000001901459.
However, model_price_signf also considered to have
best performance with almost no difference on RMSE values at
0.0000001901501.
⚠️ Still, with RMSE values almost non-existent, we may need to be cautious.
In the opposite side, model_price_1 have the worst
performance with the highest RMSE values at 9506.956 and this is
unexpected because this model is using the highest correlation variable
as its predictor variable.
Based on the Model Comparison that we have combined and created, we can conclude that there are many variables which can affect the value of our house pricing. We can observed it by summarizing as follows:
Model using only 1 predictor variable (with highest correlations)
Floors = 0.6, gives the worst model performance.
Model using 4 predictor variables (with higher correlations
variables) (Floors, Fiber,
Indian.Marble, White.Marble), gives better
model performance compare to only using 1 predictor variable.
Model using all variables as predictor variables or also only using all significant variables, gives best model performances but ⚠️ note of caution that this may be overfitting model as it is too perfect with almost zero error.
In other words, the more the house has many additional significant attributes except Indian Marble, Swimming Pool and Garden, the more it can affect the house pricing.
If in our previous attempt of modelling is more like trial and error to find the best combinations of predictor variables, let us now use method of Stepwise Regression instead.
# Create model without predictor to start
model_price_base <- lm(formula = Prices ~ 1,
data = house)
# We already have model for upper limit : model_price_all
# Stepwise Regression
model_price_forward <- step(object = model_price_base,
direction = "forward",
scope = list(upper = model_price_all)
)#> Start: AIC=9401807
#> Prices ~ 1
#>
#> Df Sum of Sq RSS AIC
#> + Floors 1 28137670256676 45191105612975 9159783
#> + Fiber 1 17222144966830 56106630902821 9267960
#> + White.Marble 1 14727518113609 58601257756042 9289711
#> + Indian.Marble 1 10025454477185 63303321392466 9328302
#> + City 1 3989805570061 69338970299590 9373836
#> + Glass.Doors 1 2428230058991 70900545810660 9384972
#> + Area 1 1600050598200 71728725271451 9390779
#> + Baths 1 1543586418569 71785189451082 9391172
#> + Garage 1 737604900715 72591170968936 9396755
#> + FirePlace 1 582650002516 72746125867135 9397821
#> + Black.Marble 1 446697232267 72882078637384 9398754
#> + Electric 1 201670610161 73127105259490 9400432
#> + Solar 1 5209679561 73323566190090 9401774
#> <none> 73328775869651 9401807
#> + Swiming.Pool 1 234043881 73328541825770 9401808
#> + Garden 1 173920415 73328601949236 9401808
#>
#> Step: AIC=9159783
#> Prices ~ Floors
#>
#> Df Sum of Sq RSS AIC
#> + Fiber 1 17161774856129 28029330756847 8920961
#> + White.Marble 1 14724329410278 30466776202698 8962653
#> + Indian.Marble 1 10035180243376 35155925369599 9034231
#> + City 1 4003393234440 41187712378535 9113405
#> + Glass.Doors 1 2428367593536 42762738019440 9132168
#> + Area 1 1610488043954 43580617569021 9141641
#> + Baths 1 1555207029826 43635898583149 9142275
#> + Garage 1 746110257205 44444995355770 9151461
#> + FirePlace 1 581149541139 44609956071836 9153313
#> + Black.Marble 1 444090818181 44747014794794 9154847
#> + Electric 1 201282051847 44989823561128 9157553
#> + Solar 1 7437646280 45183667966695 9159703
#> + Swiming.Pool 1 269529965 45190836083010 9159782
#> + Garden 1 249560798 45190856052177 9159782
#> <none> 45191105612975 9159783
#>
#> Step: AIC=8920961
#> Prices ~ Floors + Fiber
#>
#> Df Sum of Sq RSS AIC
#> + White.Marble 1 14742671033035 13286659723812 8547717
#> + Indian.Marble 1 10048366919693 17980963837154 8698994
#> + City 1 4048561433438 23980769323408 8842963
#> + Glass.Doors 1 2457747958687 25571582798160 8875078
#> + Area 1 1609279115650 26420051641197 8891398
#> + Baths 1 1562306360520 26467024396327 8892287
#> + Garage 1 750129260320 27279201496527 8907399
#> + FirePlace 1 569726306346 27459604450501 8910695
#> + Black.Marble 1 444499183390 27584831573457 8912970
#> + Electric 1 202433546973 27826897209874 8917338
#> + Solar 1 7265670476 28022065086371 8920833
#> + Garden 1 252502079 28029078254768 8920958
#> <none> 28029330756847 8920961
#> + Swiming.Pool 1 464385 28029330292462 8920963
#>
#> Step: AIC=8547717
#> Prices ~ Floors + Fiber + White.Marble
#>
#> Df Sum of Sq RSS AIC
#> + City 1 4061895141055 9224764582757 8365285
#> + Glass.Doors 1 2462605272635 10824054451177 8445224
#> + Black.Marble 1 2076445268323 11210214455490 8462752
#> + Indian.Marble 1 2076445268323 11210214455490 8462752
#> + Area 1 1584782472730 11701877251082 8484214
#> + Baths 1 1538485207921 11748174515891 8486188
#> + Garage 1 746534973249 12540124750563 8518806
#> + FirePlace 1 564215464953 12722444258859 8526023
#> + Electric 1 200509914544 13086149809269 8540116
#> + Solar 1 8338855184 13278320868628 8547405
#> + Garden 1 149069376 13286510654437 8547714
#> <none> 13286659723812 8547717
#> + Swiming.Pool 1 43506897 13286616216915 8547717
#>
#> Step: AIC=8365285
#> Prices ~ Floors + Fiber + White.Marble + City
#>
#> Df Sum of Sq RSS AIC
#> + Glass.Doors 1 2457781442951 6766983139806 8210369
#> + Black.Marble 1 2081506095751 7143258487006 8237425
#> + Indian.Marble 1 2081506095751 7143258487006 8237425
#> + Area 1 1602371800156 7622392782601 8269886
#> + Baths 1 1542779602026 7681984980731 8273780
#> + Garage 1 743782030855 8480982551902 8323254
#> + FirePlace 1 564835861461 8659928721296 8333694
#> + Electric 1 199091680363 9025672902394 8354377
#> + Solar 1 8161007847 9216603574909 8364844
#> + Garden 1 95579979 9224669002778 8365282
#> <none> 9224764582757 8365285
#> + Swiming.Pool 1 35130383 9224729452374 8365285
#>
#> Step: AIC=8210369
#> Prices ~ Floors + Fiber + White.Marble + City + Glass.Doors
#>
#> Df Sum of Sq RSS AIC
#> + Black.Marble 1 2082103704397 4684879435409 8026513
#> + Indian.Marble 1 2082103704397 4684879435409 8026513
#> + Area 1 1607244583078 5159738556728 8074786
#> + Baths 1 1549285292131 5217697847674 8080371
#> + Garage 1 749672149111 6017310990695 8151663
#> + FirePlace 1 565687046739 6201296093066 8166722
#> + Electric 1 197573697371 6569409442435 8195555
#> + Solar 1 8394120147 6758589019658 8209750
#> + Swiming.Pool 1 28014024 6766955125782 8210368
#> <none> 6766983139806 8210369
#> + Garden 1 20781279 6766962358527 8210369
#>
#> Step: AIC=8026513
#> Prices ~ Floors + Fiber + White.Marble + City + Glass.Doors +
#> Black.Marble
#>
#> Df Sum of Sq RSS AIC
#> + Area 1 1608175453133 3076703982276 7816274
#> + Baths 1 1555500334873 3129379100536 7824762
#> + Garage 1 743586714890 3941292720518 7940099
#> + FirePlace 1 566807351596 4118072083812 7962037
#> + Electric 1 197972686303 4486906749106 8004927
#> + Solar 1 8099860660 4676779574749 8025650
#> + Swiming.Pool 1 32266754 4684847168654 8026512
#> <none> 4684879435409 8026513
#> + Garden 1 12521931 4684866913477 8026514
#>
#> Step: AIC=7816274
#> Prices ~ Floors + Fiber + White.Marble + City + Glass.Doors +
#> Black.Marble + Area
#>
#> Df Sum of Sq RSS AIC
#> + Baths 1 1556798499478 1519905482798 7463671
#> + Garage 1 745553282885 2331150699390 7677528
#> + FirePlace 1 566099969257 2510604013018 7714609
#> + Electric 1 198113917947 2878590064329 7782997
#> + Solar 1 7979583108 3068724399167 7814978
#> + Swiming.Pool 1 24009667 3076679972609 7816272
#> <none> 3076703982276 7816274
#> + Garden 1 2962361 3076701019915 7816276
#>
#> Step: AIC=7463671
#> Prices ~ Floors + Fiber + White.Marble + City + Glass.Doors +
#> Black.Marble + Area + Baths
#>
#> Df Sum of Sq RSS AIC
#> + Garage 1 753445238325 766460244472 7121363
#> + FirePlace 1 564897553777 955007929020 7231331
#> + Electric 1 196952649137 1322952833661 7394282
#> + Solar 1 8147991183 1511757491615 7460985
#> <none> 1519905482798 7463671
#> + Swiming.Pool 1 4538430 1519900944368 7463671
#> + Garden 1 189502 1519905293296 7463673
#>
#> Step: AIC=7121363
#> Prices ~ Floors + Fiber + White.Marble + City + Glass.Doors +
#> Black.Marble + Area + Baths + Garage
#>
#> Df Sum of Sq RSS AIC
#> + FirePlace 1 563189576547 203270667925 6457742
#> + Electric 1 196348696668 570111547805 6973389
#> + Solar 1 7919027911 758541216561 7116172
#> <none> 766460244472 7121363
#> + Swiming.Pool 1 1379736 766458864737 7121364
#> + Garden 1 18324 766460226148 7121365
#>
#> Step: AIC=6457742
#> Prices ~ Floors + Fiber + White.Marble + City + Glass.Doors +
#> Black.Marble + Area + Baths + Garage + FirePlace
#>
#> Df Sum of Sq RSS AIC
#> + Electric 1 195458365125 7812302801 4828325
#> + Solar 1 7960332630 195310335295 6437770
#> <none> 203270667925 6457742
#> + Swiming.Pool 1 43820 203270624105 6457744
#> + Garden 1 1465 203270666460 6457744
#>
#> Step: AIC=4828325
#> Prices ~ Floors + Fiber + White.Marble + City + Glass.Doors +
#> Black.Marble + Area + Baths + Garage + FirePlace + Electric
#>
#> Df Sum of Sq RSS AIC
#> + Solar 1 7812302801 0 -15391789
#> + Garden 1 141896 7812160905 4828318
#> <none> 7812302801 4828325
#> + Swiming.Pool 1 1731 7812301069 4828327
#>
#> Step: AIC=-15391789
#> Prices ~ Floors + Fiber + White.Marble + City + Glass.Doors +
#> Black.Marble + Area + Baths + Garage + FirePlace + Electric +
#> Solar
#>
#> Df Sum of Sq RSS AIC
#> <none> 0.00000002137 -15391789
#> + Garden 1 0.000000000000016711 0.00000002137 -15391787
#> + Swiming.Pool 1 0.000000000000016452 0.00000002137 -15391787
Let us look at the summary for our forward stepwise model
model_price_forward.
#>
#> Call:
#> lm(formula = Prices ~ Floors + Fiber + White.Marble + City +
#> Glass.Doors + Black.Marble + Area + Baths + Garage + FirePlace +
#> Electric + Solar, data = house)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -0.000134145 0.000000000 0.000000000 0.000000001 0.000004353
#>
#> Coefficients:
#> Estimate Std. Error t value
#> (Intercept) 999.999999904196670 0.000000001590924 628565405640
#> Floors 14999.999999853393092 0.000000000584754 25651811499613
#> Fiber 11749.999999947447577 0.000000000584756 20093852022064
#> White.Marble 14000.000000052734322 0.000000000715830 19557713611217
#> City 3499.999999992095582 0.000000000358216 9770644292922
#> Glass.Doors 4450.000000006944902 0.000000000584755 7610023216287
#> Black.Marble 4999.999999999229658 0.000000000715993 6983310930652
#> Area 24.999999999998334 0.000000000004072 6138904376604
#> Baths 1250.000000000046839 0.000000000206742 6046192660042
#> Garage 1500.000000000003183 0.000000000357868 4191491031784
#> FirePlace 750.000000000298542 0.000000000206770 3627223549882
#> Electric 1249.999999999690544 0.000000000584753 2137653288745
#> Solar 249.999999999673605 0.000000000584758 427527631169
#> Pr(>|t|)
#> (Intercept) <0.0000000000000002 ***
#> Floors <0.0000000000000002 ***
#> Fiber <0.0000000000000002 ***
#> White.Marble <0.0000000000000002 ***
#> City <0.0000000000000002 ***
#> Glass.Doors <0.0000000000000002 ***
#> Black.Marble <0.0000000000000002 ***
#> Area <0.0000000000000002 ***
#> Baths <0.0000000000000002 ***
#> Garage <0.0000000000000002 ***
#> FirePlace <0.0000000000000002 ***
#> Electric <0.0000000000000002 ***
#> Solar <0.0000000000000002 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.0000002067 on 499987 degrees of freedom
#> Multiple R-squared: 1, Adjusted R-squared: 1
#> F-statistic: 1.43e+26 on 12 and 499987 DF, p-value: < 0.00000000000000022
The best performance models using combination of the following predictor variables:
FloorsFiberWhite.MarbleCityGlass.DoorsBlack.MarbleAreaBathsGarageFirePlaceElectricSolarTotal predictor variables used above are combination of 12 variables,
except Indian.Marble, Swiming.Pool and
Garden.
# Backward Stepwise Regression
model_price_backward <- step(object = model_price_all,
direction = "backward",
scope = list(lower = model_price_base)
)#> Start: AIC=-15476871
#> Prices ~ Area + Garage + FirePlace + Baths + White.Marble + Black.Marble +
#> Indian.Marble + Floors + City + Solar + Electric + Fiber +
#> Glass.Doors + Swiming.Pool + Garden
#>
#>
#> Step: AIC=-15476871
#> Prices ~ Area + Garage + FirePlace + Baths + White.Marble + Black.Marble +
#> Floors + City + Solar + Electric + Fiber + Glass.Doors +
#> Swiming.Pool + Garden
#>
#> Df Sum of Sq RSS AIC
#> - Garden 1 0 0 -15476872
#> - Swiming.Pool 1 0 0 -15476872
#> <none> 0 -15476871
#> - Solar 1 7812159167 7812159167 4828320
#> - Electric 1 195310155285 195310155285 6437774
#> - FirePlace 1 562339506701 562339506701 6966532
#> - Garage 1 750908575108 750908575108 7111121
#> - Baths 1 1562468325721 1562468325721 7477490
#> - Area 1 1610761970684 1610761970684 7492711
#> - Black.Marble 1 2084362234399 2084362234399 7621588
#> - Glass.Doors 1 2475242609737 2475242609737 7707526
#> - City 1 4080340520010 4080340520010 7957447
#> - White.Marble 1 16348762026008 16348762026008 8651433
#> - Fiber 1 17257174438008 17257174438008 8678471
#> - Floors 1 28124618625749 28124618625749 8922680
#>
#> Step: AIC=-15476872
#> Prices ~ Area + Garage + FirePlace + Baths + White.Marble + Black.Marble +
#> Floors + City + Solar + Electric + Fiber + Glass.Doors +
#> Swiming.Pool
#>
#> Df Sum of Sq RSS AIC
#> - Swiming.Pool 1 0 0 -15476873
#> <none> 0 -15476872
#> - Solar 1 7812301069 7812301069 4828327
#> - Electric 1 195310272173 195310272173 6437772
#> - FirePlace 1 562339536264 562339536264 6966530
#> - Garage 1 750908894209 750908894209 7111119
#> - Baths 1 1562472960134 1562472960134 7477490
#> - Area 1 1610765290914 1610765290914 7492710
#> - Black.Marble 1 2084363305721 2084363305721 7621587
#> - Glass.Doors 1 2475270016081 2475270016081 7707530
#> - City 1 4080346531325 4080346531325 7957446
#> - White.Marble 1 16348784685940 16348784685940 8651432
#> - Fiber 1 17257174439261 17257174439261 8678469
#> - Floors 1 28124625667462 28124625667462 8922678
#>
#> Step: AIC=-15476873
#> Prices ~ Area + Garage + FirePlace + Baths + White.Marble + Black.Marble +
#> Floors + City + Solar + Electric + Fiber + Glass.Doors
#>
#> Df Sum of Sq RSS AIC
#> <none> 0 -15476873
#> - Solar 1 7812302801 7812302801 4828325
#> - Electric 1 195310335295 195310335295 6437770
#> - FirePlace 1 562340465328 562340465328 6966529
#> - Garage 1 750909802193 750909802193 7111118
#> - Baths 1 1562480674329 1562480674329 7477490
#> - Area 1 1610765904412 1610765904412 7492708
#> - Black.Marble 1 2084363453100 2084363453100 7621585
#> - Glass.Doors 1 2475270433901 2475270433901 7707528
#> - City 1 4080346988017 4080346988017 7957444
#> - White.Marble 1 16348836692056 16348836692056 8651431
#> - Fiber 1 17257468537577 17257468537577 8678475
#> - Floors 1 28124626960386 28124626960386 8922676
Let us look at the summary for our backward stepwise model
model_price_backward.
#>
#> Call:
#> lm(formula = Prices ~ Area + Garage + FirePlace + Baths + White.Marble +
#> Black.Marble + Floors + City + Solar + Electric + Fiber +
#> Glass.Doors, data = house)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -0.000134260 0.000000000 0.000000000 0.000000001 0.000000359
#>
#> Coefficients:
#> Estimate Std. Error t value
#> (Intercept) 999.99999980987025 0.00000000146116 684387154347
#> Area 25.00000000000009 0.00000000000374 6684089292557
#> Garage 1500.00000000008708 0.00000000032868 4563729715707
#> FirePlace 750.00000000030707 0.00000000018991 3949350666525
#> Baths 1250.00000000023556 0.00000000018988 6583144017318
#> White.Marble 13999.99999999921238 0.00000000065744 21294598533522
#> Black.Marble 4999.99999999932515 0.00000000065759 7603486054622
#> Floors 15000.00000000155524 0.00000000053706 27929902155431
#> City 3499.99999999960755 0.00000000032900 10638357415806
#> Solar 249.99999999945970 0.00000000053706 465495581371
#> Electric 1249.99999999946317 0.00000000053706 2327494383680
#> Fiber 11749.99999999948886 0.00000000053706 21878350420159
#> Glass.Doors 4449.99999999953161 0.00000000053706 8285855516810
#> Pr(>|t|)
#> (Intercept) <0.0000000000000002 ***
#> Area <0.0000000000000002 ***
#> Garage <0.0000000000000002 ***
#> FirePlace <0.0000000000000002 ***
#> Baths <0.0000000000000002 ***
#> White.Marble <0.0000000000000002 ***
#> Black.Marble <0.0000000000000002 ***
#> Floors <0.0000000000000002 ***
#> City <0.0000000000000002 ***
#> Solar <0.0000000000000002 ***
#> Electric <0.0000000000000002 ***
#> Fiber <0.0000000000000002 ***
#> Glass.Doors <0.0000000000000002 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.0000001899 on 499987 degrees of freedom
#> Multiple R-squared: 1, Adjusted R-squared: 1
#> F-statistic: 1.695e+26 on 12 and 499987 DF, p-value: < 0.00000000000000022
The best performance models using combination of the following predictor variables:
AreaGarageFirePlaceBathsWhite.MarbleBlack.MarbleFloorsCitySolarElectricFiberGlass.DoorsTotal predictor variables used above are combination of 12 variables,
except Indian.Marble, Swiming.Pool and
Garden.
# Both
model_price_both <- step(object = model_price_base,
direction = "both",
scope = list(upper = model_price_all,
lower = model_price_base)
)#> Start: AIC=9401807
#> Prices ~ 1
#>
#> Df Sum of Sq RSS AIC
#> + Floors 1 28137670256676 45191105612975 9159783
#> + Fiber 1 17222144966830 56106630902821 9267960
#> + White.Marble 1 14727518113609 58601257756042 9289711
#> + Indian.Marble 1 10025454477185 63303321392466 9328302
#> + City 1 3989805570061 69338970299590 9373836
#> + Glass.Doors 1 2428230058991 70900545810660 9384972
#> + Area 1 1600050598200 71728725271451 9390779
#> + Baths 1 1543586418569 71785189451082 9391172
#> + Garage 1 737604900715 72591170968936 9396755
#> + FirePlace 1 582650002516 72746125867135 9397821
#> + Black.Marble 1 446697232267 72882078637384 9398754
#> + Electric 1 201670610161 73127105259490 9400432
#> + Solar 1 5209679561 73323566190090 9401774
#> <none> 73328775869651 9401807
#> + Swiming.Pool 1 234043881 73328541825770 9401808
#> + Garden 1 173920415 73328601949236 9401808
#>
#> Step: AIC=9159783
#> Prices ~ Floors
#>
#> Df Sum of Sq RSS AIC
#> + Fiber 1 17161774856129 28029330756847 8920961
#> + White.Marble 1 14724329410278 30466776202698 8962653
#> + Indian.Marble 1 10035180243376 35155925369599 9034231
#> + City 1 4003393234440 41187712378535 9113405
#> + Glass.Doors 1 2428367593536 42762738019440 9132168
#> + Area 1 1610488043954 43580617569021 9141641
#> + Baths 1 1555207029826 43635898583149 9142275
#> + Garage 1 746110257205 44444995355770 9151461
#> + FirePlace 1 581149541139 44609956071836 9153313
#> + Black.Marble 1 444090818181 44747014794794 9154847
#> + Electric 1 201282051847 44989823561128 9157553
#> + Solar 1 7437646280 45183667966695 9159703
#> + Swiming.Pool 1 269529965 45190836083010 9159782
#> + Garden 1 249560798 45190856052177 9159782
#> <none> 45191105612975 9159783
#> - Floors 1 28137670256676 73328775869651 9401807
#>
#> Step: AIC=8920961
#> Prices ~ Floors + Fiber
#>
#> Df Sum of Sq RSS AIC
#> + White.Marble 1 14742671033035 13286659723812 8547717
#> + Indian.Marble 1 10048366919693 17980963837154 8698994
#> + City 1 4048561433438 23980769323408 8842963
#> + Glass.Doors 1 2457747958687 25571582798160 8875078
#> + Area 1 1609279115650 26420051641197 8891398
#> + Baths 1 1562306360520 26467024396327 8892287
#> + Garage 1 750129260320 27279201496527 8907399
#> + FirePlace 1 569726306346 27459604450501 8910695
#> + Black.Marble 1 444499183390 27584831573457 8912970
#> + Electric 1 202433546973 27826897209874 8917338
#> + Solar 1 7265670476 28022065086371 8920833
#> + Garden 1 252502079 28029078254768 8920958
#> <none> 28029330756847 8920961
#> + Swiming.Pool 1 464385 28029330292462 8920963
#> - Fiber 1 17161774856129 45191105612975 9159783
#> - Floors 1 28077300145974 56106630902821 9267960
#>
#> Step: AIC=8547717
#> Prices ~ Floors + Fiber + White.Marble
#>
#> Df Sum of Sq RSS AIC
#> + City 1 4061895141055 9224764582757 8365285
#> + Glass.Doors 1 2462605272635 10824054451177 8445224
#> + Black.Marble 1 2076445268323 11210214455490 8462752
#> + Indian.Marble 1 2076445268323 11210214455490 8462752
#> + Area 1 1584782472730 11701877251082 8484214
#> + Baths 1 1538485207921 11748174515891 8486188
#> + Garage 1 746534973249 12540124750563 8518806
#> + FirePlace 1 564215464953 12722444258859 8526023
#> + Electric 1 200509914544 13086149809269 8540116
#> + Solar 1 8338855184 13278320868628 8547405
#> + Garden 1 149069376 13286510654437 8547714
#> <none> 13286659723812 8547717
#> + Swiming.Pool 1 43506897 13286616216915 8547717
#> - White.Marble 1 14742671033035 28029330756847 8920961
#> - Fiber 1 17180116478885 30466776202698 8962653
#> - Floors 1 28074080667113 41360740390925 9115501
#>
#> Step: AIC=8365285
#> Prices ~ Floors + Fiber + White.Marble + City
#>
#> Df Sum of Sq RSS AIC
#> + Glass.Doors 1 2457781442951 6766983139806 8210369
#> + Black.Marble 1 2081506095751 7143258487006 8237425
#> + Indian.Marble 1 2081506095751 7143258487006 8237425
#> + Area 1 1602371800156 7622392782601 8269886
#> + Baths 1 1542779602026 7681984980731 8273780
#> + Garage 1 743782030855 8480982551902 8323254
#> + FirePlace 1 564835861461 8659928721296 8333694
#> + Electric 1 199091680363 9025672902394 8354377
#> + Solar 1 8161007847 9216603574909 8364844
#> + Garden 1 95579979 9224669002778 8365282
#> <none> 9224764582757 8365285
#> + Swiming.Pool 1 35130383 9224729452374 8365285
#> - City 1 4061895141055 13286659723812 8547717
#> - White.Marble 1 14756004740652 23980769323408 8842963
#> - Fiber 1 17225391617828 26450156200585 8891968
#> - Floors 1 28087670960865 37312435543622 9064000
#>
#> Step: AIC=8210369
#> Prices ~ Floors + Fiber + White.Marble + City + Glass.Doors
#>
#> Df Sum of Sq RSS AIC
#> + Black.Marble 1 2082103704397 4684879435409 8026513
#> + Indian.Marble 1 2082103704397 4684879435409 8026513
#> + Area 1 1607244583078 5159738556728 8074786
#> + Baths 1 1549285292131 5217697847674 8080371
#> + Garage 1 749672149111 6017310990695 8151663
#> + FirePlace 1 565687046739 6201296093066 8166722
#> + Electric 1 197573697371 6569409442435 8195555
#> + Solar 1 8394120147 6758589019658 8209750
#> + Swiming.Pool 1 28014024 6766955125782 8210368
#> <none> 6766983139806 8210369
#> + Garden 1 20781279 6766962358527 8210369
#> - Glass.Doors 1 2457781442951 9224764582757 8365285
#> - City 1 4057071311372 10824054451177 8445224
#> - White.Marble 1 14760851566768 21527834706573 8789012
#> - Fiber 1 17254802476653 24021785616459 8843819
#> - Floors 1 28087748852556 34854731992362 9029933
#>
#> Step: AIC=8026513
#> Prices ~ Floors + Fiber + White.Marble + City + Glass.Doors +
#> Black.Marble
#>
#> Df Sum of Sq RSS AIC
#> + Area 1 1608175453133 3076703982276 7816274
#> + Baths 1 1555500334873 3129379100536 7824762
#> + Garage 1 743586714890 3941292720518 7940099
#> + FirePlace 1 566807351596 4118072083812 7962037
#> + Electric 1 197972686303 4486906749106 8004927
#> + Solar 1 8099860660 4676779574749 8025650
#> + Swiming.Pool 1 32266754 4684847168654 8026512
#> <none> 4684879435409 8026513
#> + Garden 1 12521931 4684866913477 8026514
#> - Black.Marble 1 2082103704397 6766983139806 8210369
#> - Glass.Doors 1 2458379051597 7143258487006 8237425
#> - City 1 4062129275196 8747008710605 8338697
#> - White.Marble 1 16399145659060 21084025094468 8778599
#> - Fiber 1 17257789445128 21942668880536 8798557
#> - Floors 1 28093555546010 32778434981418 8999226
#>
#> Step: AIC=7816274
#> Prices ~ Floors + Fiber + White.Marble + City + Glass.Doors +
#> Black.Marble + Area
#>
#> Df Sum of Sq RSS AIC
#> + Baths 1 1556798499478 1519905482798 7463671
#> + Garage 1 745553282885 2331150699390 7677528
#> + FirePlace 1 566099969257 2510604013018 7714609
#> + Electric 1 198113917947 2878590064329 7782997
#> + Solar 1 7979583108 3068724399167 7814978
#> + Swiming.Pool 1 24009667 3076679972609 7816272
#> <none> 3076703982276 7816274
#> + Garden 1 2962361 3076701019915 7816276
#> - Area 1 1608175453133 4684879435409 8026513
#> - Black.Marble 1 2083034574452 5159738556728 8074786
#> - Glass.Doors 1 2463253972628 5539957954904 8110336
#> - City 1 4079747417786 7156451400062 8238350
#> - White.Marble 1 16377938345664 19454642327940 8738386
#> - Fiber 1 17256690991214 20333394973490 8760475
#> - Floors 1 28104013042666 31180717024941 8974243
#>
#> Step: AIC=7463671
#> Prices ~ Floors + Fiber + White.Marble + City + Glass.Doors +
#> Black.Marble + Area + Baths
#>
#> Df Sum of Sq RSS AIC
#> + Garage 1 753445238325 766460244472 7121363
#> + FirePlace 1 564897553777 955007929020 7231331
#> + Electric 1 196952649137 1322952833661 7394282
#> + Solar 1 8147991183 1511757491615 7460985
#> <none> 1519905482798 7463671
#> + Swiming.Pool 1 4538430 1519900944368 7463671
#> + Garden 1 189502 1519905293296 7463673
#> - Baths 1 1556798499478 3076703982276 7816274
#> - Area 1 1609473617738 3129379100536 7824762
#> - Black.Marble 1 2089253977860 3609159460658 7896082
#> - Glass.Doors 1 2469785555845 3989691038643 7946202
#> - City 1 4084079035568 5603984518365 8116084
#> - White.Marble 1 16364805574575 17884711057373 8696318
#> - Fiber 1 17263849668437 18783755151235 8720841
#> - Floors 1 28115643074462 29635548557259 8948832
#>
#> Step: AIC=7121363
#> Prices ~ Floors + Fiber + White.Marble + City + Glass.Doors +
#> Black.Marble + Area + Baths + Garage
#>
#> Df Sum of Sq RSS AIC
#> + FirePlace 1 563189576547 203270667925 6457742
#> + Electric 1 196348696668 570111547805 6973389
#> + Solar 1 7919027911 758541216561 7116172
#> <none> 766460244472 7121363
#> + Swiming.Pool 1 1379736 766458864737 7121364
#> + Garden 1 18324 766460226148 7121365
#> - Garage 1 753445238325 1519905482798 7463671
#> - Baths 1 1564690454918 2331150699390 7677528
#> - Area 1 1611454680757 2377914925229 7687459
#> - Black.Marble 1 2083134304329 2849594548801 7777935
#> - Glass.Doors 1 2475723608373 3242183852846 7842470
#> - City 1 4081309331261 4847769575733 8043606
#> - White.Marble 1 16352865847243 17119326091715 8674451
#> - Fiber 1 17267920331614 18034380576087 8700487
#> - Floors 1 28124203167579 28890663412051 8936106
#>
#> Step: AIC=6457742
#> Prices ~ Floors + Fiber + White.Marble + City + Glass.Doors +
#> Black.Marble + Area + Baths + Garage + FirePlace
#>
#> Df Sum of Sq RSS AIC
#> + Electric 1 195458365125 7812302801 4828325
#> + Solar 1 7960332630 195310335295 6437770
#> <none> 203270667925 6457742
#> + Swiming.Pool 1 43820 203270624105 6457744
#> + Garden 1 1465 203270666460 6457744
#> - FirePlace 1 563189576547 766460244472 7121363
#> - Garage 1 751737261095 955007929020 7231331
#> - Baths 1 1563477838353 1766748506278 7538919
#> - Area 1 1610745649461 1814016317387 7552121
#> - Black.Marble 1 2084255608832 2287526276757 7668084
#> - Glass.Doors 1 2476565823927 2679836491852 7747227
#> - City 1 4081928408880 4285199076805 7981932
#> - White.Marble 1 16349433795392 16552704463317 8657624
#> - Fiber 1 17256526080070 17459796747996 8684299
#> - Floors 1 28122732979287 28326003647212 8926239
#>
#> Step: AIC=4828325
#> Prices ~ Floors + Fiber + White.Marble + City + Glass.Doors +
#> Black.Marble + Area + Baths + Garage + FirePlace + Electric
#>
#> Df Sum of Sq RSS AIC
#> + Solar 1 7812302801 0 -15391789
#> + Garden 1 141896 7812160905 4828318
#> <none> 7812302801 4828325
#> + Swiming.Pool 1 1731 7812301069 4828327
#> - Electric 1 195458365125 203270667925 6457742
#> - FirePlace 1 562299245004 570111547805 6973389
#> - Garage 1 751136710960 758949013760 7116441
#> - Baths 1 1562316285510 1570128588311 7479930
#> - Area 1 1610885328953 1618697631753 7495162
#> - Black.Marble 1 2084651543827 2092463846627 7623522
#> - Glass.Doors 1 2475044981222 2482857284022 7709056
#> - City 1 4080521442083 4088333744884 7958420
#> - White.Marble 1 16348265198893 16356077501694 8651651
#> - Fiber 1 17257647498350 17265459801151 8678705
#> - Floors 1 28122340428933 28130152731733 8922772
#>
#> Step: AIC=-15391789
#> Prices ~ Floors + Fiber + White.Marble + City + Glass.Doors +
#> Black.Marble + Area + Baths + Garage + FirePlace + Electric +
#> Solar
#>
#> Df Sum of Sq RSS AIC
#> <none> 0 -15391789
#> + Garden 1 0 0 -15391787
#> + Swiming.Pool 1 0 0 -15391787
#> - Solar 1 7812302801 7812302801 4828325
#> - Electric 1 195310335295 195310335295 6437770
#> - FirePlace 1 562340465328 562340465328 6966529
#> - Garage 1 750909802193 750909802193 7111118
#> - Baths 1 1562480674329 1562480674329 7477490
#> - Area 1 1610765904413 1610765904413 7492708
#> - Black.Marble 1 2084363453100 2084363453100 7621585
#> - Glass.Doors 1 2475270433900 2475270433900 7707528
#> - City 1 4080346988017 4080346988017 7957444
#> - White.Marble 1 16348836692056 16348836692056 8651431
#> - Fiber 1 17257468537578 17257468537578 8678475
#> - Floors 1 28124626960385 28124626960385 8922676
Let us look at the summary for our both stepwise model
model_price_both.
#>
#> Call:
#> lm(formula = Prices ~ Floors + Fiber + White.Marble + City +
#> Glass.Doors + Black.Marble + Area + Baths + Garage + FirePlace +
#> Electric + Solar, data = house)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -0.000134145 0.000000000 0.000000000 0.000000001 0.000004353
#>
#> Coefficients:
#> Estimate Std. Error t value
#> (Intercept) 999.999999904196670 0.000000001590924 628565405640
#> Floors 14999.999999853393092 0.000000000584754 25651811499613
#> Fiber 11749.999999947447577 0.000000000584756 20093852022064
#> White.Marble 14000.000000052734322 0.000000000715830 19557713611217
#> City 3499.999999992095582 0.000000000358216 9770644292922
#> Glass.Doors 4450.000000006944902 0.000000000584755 7610023216287
#> Black.Marble 4999.999999999229658 0.000000000715993 6983310930652
#> Area 24.999999999998334 0.000000000004072 6138904376604
#> Baths 1250.000000000046839 0.000000000206742 6046192660042
#> Garage 1500.000000000003183 0.000000000357868 4191491031784
#> FirePlace 750.000000000298542 0.000000000206770 3627223549882
#> Electric 1249.999999999690544 0.000000000584753 2137653288745
#> Solar 249.999999999673605 0.000000000584758 427527631169
#> Pr(>|t|)
#> (Intercept) <0.0000000000000002 ***
#> Floors <0.0000000000000002 ***
#> Fiber <0.0000000000000002 ***
#> White.Marble <0.0000000000000002 ***
#> City <0.0000000000000002 ***
#> Glass.Doors <0.0000000000000002 ***
#> Black.Marble <0.0000000000000002 ***
#> Area <0.0000000000000002 ***
#> Baths <0.0000000000000002 ***
#> Garage <0.0000000000000002 ***
#> FirePlace <0.0000000000000002 ***
#> Electric <0.0000000000000002 ***
#> Solar <0.0000000000000002 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.0000002067 on 499987 degrees of freedom
#> Multiple R-squared: 1, Adjusted R-squared: 1
#> F-statistic: 1.43e+26 on 12 and 499987 DF, p-value: < 0.00000000000000022
The best performance models using combination of the following predictor variables:
FloorsFiberWhite.MarbleCityGlass.DoorsBlack.MarbleAreaBathsGarageFirePlaceElectricSolarTotal predictor variables used above are combination of 12 variables,
except Indian.Marble, Swiming.Pool and
Garden.
#> [1] 1
#> [1] 1
#> [1] 1
All three step-wise models also gives the best performance with r-squared value equals to 1.
Make prediction for each model
price_pred_forward <- predict(model_price_forward, house)
price_pred_backward <- predict(model_price_backward, house)
price_pred_both <- predict(model_price_both, house)Prediction Evaluation using Error Metrics (RMSE)
#> [1] 0.0000002071309
#> [1] 0.0000001901501
#> [1] 0.0000002071309
Even though all the error from three step-wise models above almost
non-existent, model_price_backward give the lowest error
value of 0.0000001901501.
We can conclude that the best performance models based on our
Stepwise Regression Models are using the combination
from 12 predictor variables which essentially all variables except
Indian.Marble, Swiming.Pool and
Garden.
Although all three models gives r-squared value equals to 1, but the
best modelling is using backward step-wise regression
model_price_backward with error value of RMSE equals to
0.0000001901501.
The RMSE value from our model_price_backward has the
same value as our model model_price_signf which essentially
also using all variables as its predictor variables except variables
Indian.Marble, Swiming.Pool and
Garden.
However, out of our previous modelings without Stepwise Regression,
there are two models that give slightly lower RMSE values compare to
model_price_backward and model_price_signf,
which are:
- `model_price_all`
- `model_price_no_IndianMarble`
With RMSE value of 0.0000001901459
As we have several best models to choose from let us choose
model_price_no_IndianMarble to check for Breusch-Pagan for
Homoscedasticity Test and No Multicolinearity because it seems that our
model has best performance so far to the point almost perfect.
Let us do Breusch-Pagan testing using the following hypotheses:
\(H_0\): Error is considered
Homoscedasticity
\(H_1\): Error is considered
Heteroscedasticity
#>
#> studentized Breusch-Pagan test
#>
#> data: model_price_no_IndianMarble
#> BP = 13.316, df = 14, p-value = 0.5018
Conclusion :
VIF (Variance Inflation Factor) Testing will indicate if:
#> Area Garage FirePlace Baths White.Marble Black.Marble
#> 1.000024 1.000033 1.000011 1.000037 1.331392 1.331387
#> Floors City Solar Electric Fiber Glass.Doors
#> 1.000012 1.000026 1.000037 1.000010 1.000036 1.000028
#> Swiming.Pool Garden
#> 1.000030 1.000039
Conclusion : All predictor variables used in
model_price_no_IndianMarble passed the No Multicolinearity
Test as the VIF values are all less than 10
It seems that we have created several best models to analyze and predict House Pricing based on the given dataset.
At the same time, we may need to be cautious whether these models are indeed provide us with the correct solution for the purpose of to analyze the list of House Attributes that can affect House Pricing, because these models give results that are near perfect with almost non-existent error.
The best models conclude that the more attributes within the House (up to the 12 variables), the best house pricing it will generates.
Further studies may need to be conducted with different dataset to check the validity of these models because using the provided dataset, these are already the best models created.