The purpose of machine learning is to make machine that can learn by itself in understanding the pattern of data so that it can predict what will happen in the future. At this time, we want to use house price data set to find out the relationships among variables that impact to the house price. We also want to find out which model is the best to predict the price of a house based on variables on the data set.
Load the required package.
library(dplyr)
library(MLmetrics)
library(lmtest)
library(car)
library(GGally)
library(performance)
library(nortest)
options(scipen = 100, max.print = 1e+06)Read data
# read data house price
houseprice <- read.csv("HousePrices_HalfMil.csv")
rmarkdown::paged_table(houseprice)str(houseprice)#> 'data.frame': 500000 obs. of 16 variables:
#> $ Area : int 164 84 190 75 148 124 58 249 243 242 ...
#> $ Garage : int 2 2 2 2 1 3 1 2 1 1 ...
#> $ FirePlace : int 0 0 4 4 4 3 0 1 0 2 ...
#> $ Baths : int 2 4 4 4 2 3 2 1 2 4 ...
#> $ White.Marble : int 0 0 1 0 1 0 0 1 0 0 ...
#> $ Black.Marble : int 1 0 0 0 0 1 0 0 0 0 ...
#> $ Indian.Marble: int 0 1 0 1 0 0 1 0 1 1 ...
#> $ Floors : int 0 1 0 1 1 1 0 1 1 0 ...
#> $ City : int 3 2 2 1 2 1 3 1 1 2 ...
#> $ Solar : int 1 0 0 1 1 0 0 0 0 1 ...
#> $ Electric : int 1 0 0 1 0 0 1 1 0 0 ...
#> $ Fiber : int 1 0 1 1 0 1 1 0 0 0 ...
#> $ Glass.Doors : int 1 1 0 1 1 1 1 1 0 0 ...
#> $ Swiming.Pool : int 0 1 0 1 1 1 0 1 1 1 ...
#> $ Garden : int 0 1 0 1 1 1 1 0 0 0 ...
#> $ Prices : int 43800 37550 49500 50075 52400 54300 34400 50425 29575 22300 ...
The data has 500,000 rows and 16 variables. Below is the description
of the variables:
- Area: area size of a house
- Garage: no of car in a garage
-
FirePlace: no of fire place in a house
-
Baths: no of bathroom in a house
-
White.Marble: if a house use white marble (0 = No, 1 =
Yes)
- Black.Marble: if a house use black marble
(0 = No, 1 = Yes)
- Indian.Marble: if a house use
indian marble (0 = No, 1 = Yes)
- Floors: no of
floor in a house
- City: city location of a house
- Solar: if a house use solar panel (0 = No, 1 =
Yes)
- Electric: if a house use electric for the
solar panel (0 = No, 1 = Yes)
- Fiber: if a house
use fiber (0 = No, 1 = Yes)
- Glass.Doors: if a
house use glass doors (0 = No, 1 = Yes)
-
Swiming.Pool: if a house has swimming pool (0 = No, 1 =
Yes)
- Garden: if a house has garden (0 = No, 1 =
Yes)
- Prices: price of a house
Let’s say we want to predict the price of a house based on other variables that will give us the best prediction. Thus, in this case our target variable is Prices.
Before we go further, first we need to make sure that our data is clean and has proper data type. Since there are some variables that are supposed to be in category type instead of numeric, we need to transform them.
# transform int into factor
houseprice_clean <- houseprice %>% mutate_at(
c("White.Marble","Black.Marble","Indian.Marble","City","Solar","Electric","Fiber","Glass.Doors","Swiming.Pool","Garden"), factor
)
glimpse(houseprice_clean)#> Rows: 500,000
#> Columns: 16
#> $ Area <int> 164, 84, 190, 75, 148, 124, 58, 249, 243, 242, 61, 189, ~
#> $ Garage <int> 2, 2, 2, 2, 1, 3, 1, 2, 1, 1, 2, 2, 2, 3, 3, 3, 1, 3, 2,~
#> $ FirePlace <int> 0, 0, 4, 4, 4, 3, 0, 1, 0, 2, 4, 0, 0, 3, 3, 4, 0, 3, 3,~
#> $ Baths <int> 2, 4, 4, 4, 2, 3, 2, 1, 2, 4, 5, 4, 2, 3, 1, 1, 5, 3, 5,~
#> $ White.Marble <fct> 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,~
#> $ Black.Marble <fct> 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1,~
#> $ Indian.Marble <fct> 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0,~
#> $ Floors <int> 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1,~
#> $ City <fct> 3, 2, 2, 1, 2, 1, 3, 1, 1, 2, 1, 2, 1, 3, 3, 1, 3, 1, 3,~
#> $ Solar <fct> 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0,~
#> $ Electric <fct> 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1,~
#> $ Fiber <fct> 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0,~
#> $ Glass.Doors <fct> 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1,~
#> $ Swiming.Pool <fct> 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0,~
#> $ Garden <fct> 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0,~
#> $ Prices <int> 43800, 37550, 49500, 50075, 52400, 54300, 34400, 50425, ~
# check if there is missing value
colSums((is.na(houseprice_clean)))#> Area Garage FirePlace Baths White.Marble
#> 0 0 0 0 0
#> Black.Marble Indian.Marble Floors City Solar
#> 0 0 0 0 0
#> Electric Fiber Glass.Doors Swiming.Pool Garden
#> 0 0 0 0 0
#> Prices
#> 0
Now we can explore the data, see if there is any pattern that can show us correlation between variables.
Check distribution of Prices variable:
boxplot(houseprice_clean$Prices)💡 Insight:
Check distribution of
Area,Garage,FirePlace,Baths,Floors
variable:
boxplot(houseprice_clean$Area)boxplot(houseprice_clean$Garage)boxplot(houseprice_clean$FirePlace)boxplot(houseprice_clean$Baths)boxplot(houseprice_clean$Floors)💡 Insight:
Check correlation between variables:
ggcorr(houseprice_clean, label = TRUE, label_size = 2.9, hjust = 1, layout.exp = 2)💡 Insight:
Area,Garage,FirePlace,Baths,Floors
have positive correlation with PricesFloors has the strongest
positive correlation with PricesLet’s use all variables other than Prices as our
predictor variables.
model_houseprice_multi <- lm(Prices ~ ., houseprice_clean)
summary(model_houseprice_multi)#>
#> Call:
#> lm(formula = Prices ~ ., data = houseprice_clean)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -0.000114740 0.000000000 0.000000000 0.000000001 0.000000504
#>
#> Coefficients: (1 not defined because of singularities)
#> Estimate Std. Error t value
#> (Intercept) 4499.999999838227268 0.000000001204656 3735506482202.710
#> Area 24.999999999988276 0.000000000003196 7821089712763.962
#> Garage 1500.000000000106411 0.000000000280896 5340047166572.192
#> FirePlace 750.000000000248633 0.000000000162297 4621155359901.714
#> Baths 1250.000000000239197 0.000000000162276 7702942606260.459
#> White.Marble1 13999.999999999954525 0.000000000561868 24916899410430.270
#> Black.Marble1 4999.999999999263309 0.000000000561994 8896888137234.410
#> Indian.Marble1 NA NA NA
#> Floors 15000.000000000349246 0.000000000458984 32680891666696.090
#> City2 3500.000000000096406 0.000000000562242 6225082231790.734
#> City3 6999.999999999426109 0.000000000562339 12448001978974.068
#> Solar1 249.999999999535987 0.000000000458990 544674534492.101
#> Electric1 1249.999999999550255 0.000000000458983 2723414492680.783
#> Fiber1 11749.999999999592546 0.000000000458990 25599693872701.074
#> Glass.Doors1 4449.999999999497959 0.000000000458986 9695282581124.541
#> Swiming.Pool1 0.000000000462530 0.000000000458987 1.008
#> Garden1 0.000000000462974 0.000000000458991 1.009
#> Pr(>|t|)
#> (Intercept) <0.0000000000000002 ***
#> Area <0.0000000000000002 ***
#> Garage <0.0000000000000002 ***
#> FirePlace <0.0000000000000002 ***
#> Baths <0.0000000000000002 ***
#> White.Marble1 <0.0000000000000002 ***
#> Black.Marble1 <0.0000000000000002 ***
#> Indian.Marble1 NA
#> Floors <0.0000000000000002 ***
#> City2 <0.0000000000000002 ***
#> City3 <0.0000000000000002 ***
#> Solar1 <0.0000000000000002 ***
#> Electric1 <0.0000000000000002 ***
#> Fiber1 <0.0000000000000002 ***
#> Glass.Doors1 <0.0000000000000002 ***
#> Swiming.Pool1 0.314
#> Garden1 0.313
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.0000001623 on 499984 degrees of freedom
#> Multiple R-squared: 1, Adjusted R-squared: 1
#> F-statistic: 1.856e+26 on 15 and 499984 DF, p-value: < 0.00000000000000022
💡 Insight:
Area increases 1 point, then
Prices will also increase 24.9999 point, note that other
predictor values are constant.Garage increases 1 point, then
Prices will also increase 1500 point, note that other
predictor values are constant.FirePlace increases 1 point, then
Prices will also increase 750 point, note that other
predictor values are constant.Baths increases 1 point, then
Prices will also increase 1250 point, note that other
predictor values are constant.Floors increases 1 point, then
Prices will also increase 15000 point, note that other
predictor values are constant.White.Marble increases Prices to
13999.9999 point higher than without one, note that other predictor
values are constant.Black.Marble increases Prices to
4999.9999 point higher than without one, note that other predictor
values are constant.City 2 increases Prices to
3500 point higher than location of City 1, note that other
predictor values are constant.City 3 increases Prices to
6999.9999 point higher than location of City 1, note that
other predictor values are constant.Solar increases Prices to 249.9999
point higher than without one, note that other predictor values are
constant.Electric increases Prices to
1249.9999 point higher than without one, note that other predictor
values are constant.Fiber increases Prices to
11749.9999 point higher than without one, note that other predictor
values are constant.Glass.Doors increases Prices to
4449.9999 point higher than without one, note that other predictor
values are constant.Significant Predictor Based on the Pr(>|t|)
value, we can see that most of the predictor variables are significant
except for Indian.Marble, Swiming.Pool and
Garden. Thus, we can try to remove predictor variables that
are not significant.
Adjusted R-squared Since we use multiple linear regression (more
than one predictor variables), therefore we see the value of Adjusted
R-squared which is equal to 1. It means that predictor variables can
explain the variety of Prices up to 100%, while the rest
(0%) can be explained by other unused variables.
Now let’s create a new model, with only use the most significant
variables from model_houseprice_multi, which are
White.Marble, Floors and
Fiber
model_houseprice_multi2 <- lm(Prices ~ White.Marble+Floors+Fiber, data = houseprice_clean)
summary(model_houseprice_multi2)#>
#> Call:
#> lm(formula = Prices ~ White.Marble + Floors + Fiber, data = houseprice_clean)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -17547 -3594 6 3589 17501
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 24862.19 13.63 1823.8 <0.0000000000000002 ***
#> White.Marble1 11521.81 15.47 744.8 <0.0000000000000002 ***
#> Floors 14986.44 14.58 1027.8 <0.0000000000000002 ***
#> Fiber1 11723.54 14.58 804.1 <0.0000000000000002 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 5155 on 499996 degrees of freedom
#> Multiple R-squared: 0.8188, Adjusted R-squared: 0.8188
#> F-statistic: 7.532e+05 on 3 and 499996 DF, p-value: < 0.00000000000000022
R-squared value from the model is 0.8188 (Adjusted R-squared)
Since we do not have new data, let’s use train data to see the performance of the model, i.e. predict house price data with the model that has been created.
Let’s compare the performance of two models as follows: 1.
model_houseprice_multi: all predictor variables (numeric
and categoric) 2. model_houseprice_multi2: three
significant predictor variables (White.Marble, Floors and Fiber)
# save the prediction result in new column
houseprice_clean$pred_multi <- predict(object = model_houseprice_multi, newdata = houseprice_clean)
houseprice_clean$pred_multi2 <- predict(object = model_houseprice_multi2, newdata = houseprice_clean)
head(houseprice_clean)Now let’s evaluate the models to find out which model is better than others.
First of all, we see the goodness of fit based on the R-squared value.
# check r-squared value for each model
summary(model_houseprice_multi)$adj.r.squared # all predictors#> [1] 1
summary(model_houseprice_multi2)$adj.r.squared # 3 predictors#> [1] 0.8188059
The best model based on R-squared is
model_houseprice_multi
Remember that the purpose of regression model is to minimize prediction error, therefore we will calculate the difference between actual value and prediction value.
# check RMSE for each model
RMSE(y_pred = houseprice_clean$pred_multi, y_true = houseprice_clean$Prices)#> [1] 0.0000001625065
RMSE(y_pred = houseprice_clean$pred_multi2, y_true = houseprice_clean$Prices)#> [1] 5154.932
The best model based on RMSE is
model_houseprice_multiEach timemodel_houseprice_multiruns a prediction, the result will miss 0.0000001625065 value in average.
As an option to find out the best model, let’s try with Step-wise Regression. Step-wise regression helps us to find good predictor variables, by looking for the best combination of predictors that result in the best model based on AIC value. Akaike Information Criterion (AIC) represents how many lost information exists in the model or information loss. The best model is a model that have small AIC value.
# stepwise regression: backward elimination
model_houseprice_backward <- step(object = model_houseprice_multi,
direction = "backward",
trace = FALSE) # step-wise process not shown# summary model backward
summary(model_houseprice_backward)#>
#> Call:
#> lm(formula = Prices ~ Area + Garage + FirePlace + Baths + White.Marble +
#> Black.Marble + Floors + City + Solar + Electric + Fiber +
#> Glass.Doors, data = houseprice_clean)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -0.000114741 0.000000000 0.000000000 0.000000001 0.000000504
#>
#> Coefficients:
#> Estimate Std. Error t value
#> (Intercept) 4499.999999838680196 0.000000001160900 3876302704558
#> Area 24.999999999988287 0.000000000003196 7821099005137
#> Garage 1500.000000000106411 0.000000000280896 5340051354462
#> FirePlace 750.000000000248974 0.000000000162297 4621159145740
#> Baths 1250.000000000240107 0.000000000162275 7702972827634
#> White.Marble1 13999.999999999954525 0.000000000561866 24916955485305
#> Black.Marble1 4999.999999999263309 0.000000000561994 8896890444215
#> Floors 15000.000000000349246 0.000000000458984 32680895394628
#> City2 3500.000000000097316 0.000000000562241 6225092621617
#> City3 6999.999999999427018 0.000000000562339 12448011458101
#> Solar1 249.999999999533827 0.000000000458985 544679525694
#> Electric1 1249.999999999550710 0.000000000458982 2723415655671
#> Fiber1 11749.999999999592546 0.000000000458986 25599911210639
#> Glass.Doors1 4449.999999999498868 0.000000000458984 9695336766293
#> Pr(>|t|)
#> (Intercept) <0.0000000000000002 ***
#> Area <0.0000000000000002 ***
#> Garage <0.0000000000000002 ***
#> FirePlace <0.0000000000000002 ***
#> Baths <0.0000000000000002 ***
#> White.Marble1 <0.0000000000000002 ***
#> Black.Marble1 <0.0000000000000002 ***
#> Floors <0.0000000000000002 ***
#> City2 <0.0000000000000002 ***
#> City3 <0.0000000000000002 ***
#> Solar1 <0.0000000000000002 ***
#> Electric1 <0.0000000000000002 ***
#> Fiber1 <0.0000000000000002 ***
#> Glass.Doors1 <0.0000000000000002 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.0000001623 on 499986 degrees of freedom
#> Multiple R-squared: 1, Adjusted R-squared: 1
#> F-statistic: 2.142e+26 on 13 and 499986 DF, p-value: < 0.00000000000000022
# create model without predictor variables
model_houseprice_none <- lm(Prices ~ 1, data = houseprice_clean)# stepwise regression: forward selection
model_houseprice_forward <- step(
object = model_houseprice_none, # lower limit
direction = "forward",
scope = list(upper = model_houseprice_multi), # upper limit
trace = FALSE) # step-wise process not shown# summary model forward
summary(model_houseprice_forward)#>
#> Call:
#> lm(formula = Prices ~ Floors + Fiber + White.Marble + City +
#> Glass.Doors + Indian.Marble + Area + Baths + Garage + FirePlace +
#> Electric + Solar, data = houseprice_clean)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -0.000105009 0.000000000 0.000000000 0.000000001 0.000016350
#>
#> Coefficients:
#> Estimate Std. Error t value
#> (Intercept) 9499.999999827012289 0.000000001077595 8815929964174
#> Floors 15000.000000046766218 0.000000000425905 35219132098377
#> Fiber1 11749.999999995810867 0.000000000425907 27588187035442
#> White.Marble1 8999.999999984282113 0.000000000522011 17241008183984
#> City2 3500.000000009361429 0.000000000521720 6708578719096
#> City3 7000.000000005459697 0.000000000521811 13414814821034
#> Glass.Doors1 4450.000000006350092 0.000000000425905 10448347335250
#> Indian.Marble1 -5000.000000000656655 0.000000000521491 -9587887848100
#> Area 24.999999999999648 0.000000000002966 8428542599274
#> Baths 1250.000000000126420 0.000000000150580 8301241881238
#> Garage 1500.000000000003183 0.000000000260652 5754798691823
#> FirePlace 750.000000000237151 0.000000000150600 4980072070729
#> Electric1 1249.999999999593683 0.000000000425904 2934935979493
#> Solar1 249.999999999554802 0.000000000425906 586983310434
#> Pr(>|t|)
#> (Intercept) <0.0000000000000002 ***
#> Floors <0.0000000000000002 ***
#> Fiber1 <0.0000000000000002 ***
#> White.Marble1 <0.0000000000000002 ***
#> City2 <0.0000000000000002 ***
#> City3 <0.0000000000000002 ***
#> Glass.Doors1 <0.0000000000000002 ***
#> Indian.Marble1 <0.0000000000000002 ***
#> Area <0.0000000000000002 ***
#> Baths <0.0000000000000002 ***
#> Garage <0.0000000000000002 ***
#> FirePlace <0.0000000000000002 ***
#> Electric1 <0.0000000000000002 ***
#> Solar1 <0.0000000000000002 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.0000001506 on 499986 degrees of freedom
#> Multiple R-squared: 1, Adjusted R-squared: 1
#> F-statistic: 2.488e+26 on 13 and 499986 DF, p-value: < 0.00000000000000022
# stepwise regression: both
model_houseprice_both <- step(
object = model_houseprice_none, # lower limit
direction = "both",
scope = list(upper = model_houseprice_multi), # upper limit
trace = FALSE
)# summary model both
summary(model_houseprice_both)#>
#> Call:
#> lm(formula = Prices ~ Floors + Fiber + White.Marble + City +
#> Glass.Doors + Indian.Marble + Area + Baths + Garage + FirePlace +
#> Electric + Solar, data = houseprice_clean)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -0.000105009 0.000000000 0.000000000 0.000000001 0.000016350
#>
#> Coefficients:
#> Estimate Std. Error t value
#> (Intercept) 9499.999999827012289 0.000000001077595 8815929964174
#> Floors 15000.000000046766218 0.000000000425905 35219132098377
#> Fiber1 11749.999999995810867 0.000000000425907 27588187035442
#> White.Marble1 8999.999999984282113 0.000000000522011 17241008183984
#> City2 3500.000000009361429 0.000000000521720 6708578719096
#> City3 7000.000000005459697 0.000000000521811 13414814821034
#> Glass.Doors1 4450.000000006350092 0.000000000425905 10448347335250
#> Indian.Marble1 -5000.000000000656655 0.000000000521491 -9587887848100
#> Area 24.999999999999648 0.000000000002966 8428542599274
#> Baths 1250.000000000126420 0.000000000150580 8301241881238
#> Garage 1500.000000000003183 0.000000000260652 5754798691823
#> FirePlace 750.000000000237151 0.000000000150600 4980072070729
#> Electric1 1249.999999999593683 0.000000000425904 2934935979493
#> Solar1 249.999999999554802 0.000000000425906 586983310434
#> Pr(>|t|)
#> (Intercept) <0.0000000000000002 ***
#> Floors <0.0000000000000002 ***
#> Fiber1 <0.0000000000000002 ***
#> White.Marble1 <0.0000000000000002 ***
#> City2 <0.0000000000000002 ***
#> City3 <0.0000000000000002 ***
#> Glass.Doors1 <0.0000000000000002 ***
#> Indian.Marble1 <0.0000000000000002 ***
#> Area <0.0000000000000002 ***
#> Baths <0.0000000000000002 ***
#> Garage <0.0000000000000002 ***
#> FirePlace <0.0000000000000002 ***
#> Electric1 <0.0000000000000002 ***
#> Solar1 <0.0000000000000002 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 0.0000001506 on 499986 degrees of freedom
#> Multiple R-squared: 1, Adjusted R-squared: 1
#> F-statistic: 2.488e+26 on 13 and 499986 DF, p-value: < 0.00000000000000022
Let’s compare Adjusted R-squared value for those 3 models:
summary(model_houseprice_backward)$adj.r.squared#> [1] 1
summary(model_houseprice_forward)$adj.r.squared#> [1] 1
summary(model_houseprice_both)$adj.r.squared#> [1] 1
Let’s see comparison of all models that we have created.
comparison <- compare_performance(model_houseprice_none, model_houseprice_multi, model_houseprice_multi2, model_houseprice_backward, model_houseprice_forward, model_houseprice_both)
as.data.frame(comparison)💡 Insight:
model_houseprice_none and
model_houseprice_multi2model_houseprice_forward and
model_houseprice_bothmodel_houseprice_forward and
model_houseprice_bothIn this section, let’s use one of the best model from step-wise
regression, i.e model_houseprice_forward
# ordinary prediction
pred_model_step <- predict(model_houseprice_forward, newdata = houseprice_clean)
head(pred_model_step)#> 1 2 3 4 5 6
#> 43800 37550 49500 50075 52400 54300
In case we need to predict a range, we can add
interval = "prediction" to get prediction interval.
pred_model_step_interval <- predict(
object = model_houseprice_forward,
newdata = houseprice_clean,
interval = "prediction",
level = 0.95) # level of confidence, with 0.05 error rate
head(pred_model_step_interval)#> fit lwr upr
#> 1 43800 43800 43800
#> 2 37550 37550 37550
#> 3 49500 49500 49500
#> 4 50075 50075 50075
#> 5 52400 52400 52400
#> 6 54300 54300 54300
model_houseprice_forward or
model_houseprice_both (Higher R-squared value)model_houseprice_forward or
model_houseprice_both (Smallest RMSE value)model_houseprice_forward or
model_houseprice_bothBelow are some assumptions need to check to ensure if our models can be considered as the Best Linear Unbiased Estimator (BLUE) model, i.e model that can predict new data consistently.
Model needs to have error with normal distribution. Therefore, error is gathered around zero value.
# residual histogram
hist(model_houseprice_forward$residuals)# Anderson-Darling test from residual
ad.test(model_houseprice_forward$residuals)#>
#> Anderson-Darling normality test
#>
#> data: model_houseprice_forward$residuals
#> A = 190847, p-value < 0.00000000000000022
With p-value < alpha (0.05), the residuals are not normally distributed. Therefore, somehow the assumption is not fulfilled and we need to look at how to handle this.
# scatter plot
plot(x = model_houseprice_forward$fitted.values, y = model_houseprice_forward$residuals)
abline(h = 0, col = "red") # horizontal line at 0# bptest of model
bptest(model_houseprice_forward)#>
#> studentized Breusch-Pagan test
#>
#> data: model_houseprice_forward
#> BP = 11.606, df = 13, p-value = 0.5602
With p-value > alpha (0.05), the residuals are distributed constantly or homoscedasticity. Therefore, the assumption is fulfilled.
summary(model_houseprice_forward)$call#> lm(formula = Prices ~ Floors + Fiber + White.Marble + City +
#> Glass.Doors + Indian.Marble + Area + Baths + Garage + FirePlace +
#> Electric + Solar, data = houseprice_clean)
# vif of model
vif(model_houseprice_forward)#> GVIF Df GVIF^(1/(2*Df))
#> Floors 1.000016 1 1.000008
#> Fiber 1.000027 1 1.000013
#> White.Marble 1.334649 1 1.155270
#> City 1.000042 2 1.000011
#> Glass.Doors 1.000017 1 1.000008
#> Indian.Marble 1.334637 1 1.155265
#> Area 1.000022 1 1.000011
#> Baths 1.000032 1 1.000016
#> Garage 1.000032 1 1.000016
#> FirePlace 1.000010 1 1.000005
#> Electric 1.000011 1 1.000005
#> Solar 1.000019 1 1.000009
All VIF < 10, there is no multicollinearity. Therefore, assumption is fulfilled.
Model model_houseprice_forward is the selected model as
the best model. Variables that are useful to describe the variances in
house prices are Floors, Fiber, White.Marble, City, Glass.Doors,
Indian.Marble, Area, Baths, Garage, FirePlace, Electric and Solar. Our
final model has satisfied some of the classical assumptions. The
R-squared of the model is perfect, which is 100% of the variables can
explain the variances in the house price. The accuracy of the model in
predicting the house price is measured with RMSE, with training data has
RMSE of 0.0000001505767.