library(dplyr)
library(GGally)
library(MLmetrics)
library(performance)
library(ggplot2)
library(lmtest)
library(car)
options(scipen=99)This report will try to predict house price using linear regression using house price database taken from Kaggle. We will explore which predictors/variables that has significant impact in determining house price.
This database is generated by computer, to help the very bigger in the field of machine learning, who wish to practice R and different ML models.
house <- read.csv("house.csv")Checking data structure of house data set.
glimpse(house)## Rows: 500,000
## Columns: 16
## $ Area <int> 164, 84, 190, 75, 148, 124, 58, 249, 243, 242, 61, 189, ~
## $ Garage <int> 2, 2, 2, 2, 1, 3, 1, 2, 1, 1, 2, 2, 2, 3, 3, 3, 1, 3, 2,~
## $ FirePlace <int> 0, 0, 4, 4, 4, 3, 0, 1, 0, 2, 4, 0, 0, 3, 3, 4, 0, 3, 3,~
## $ Baths <int> 2, 4, 4, 4, 2, 3, 2, 1, 2, 4, 5, 4, 2, 3, 1, 1, 5, 3, 5,~
## $ White.Marble <int> 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,~
## $ Black.Marble <int> 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1,~
## $ Indian.Marble <int> 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0,~
## $ Floors <int> 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1,~
## $ City <int> 3, 2, 2, 1, 2, 1, 3, 1, 1, 2, 1, 2, 1, 3, 3, 1, 3, 1, 3,~
## $ Solar <int> 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0,~
## $ Electric <int> 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1,~
## $ Fiber <int> 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0,~
## $ Glass.Doors <int> 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1,~
## $ Swiming.Pool <int> 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0,~
## $ Garden <int> 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0,~
## $ Prices <int> 43800, 37550, 49500, 50075, 52400, 54300, 34400, 50425, ~
Below are the explanation of each column:
Area : Area of the houseGarage : How many garage in the houseFirePlace : How many fireplace in the houseBaths : How many bathrooms in the houseWhite.Marble : Whether the house use white marble or notBlack.Marble : Whether the house use black marble or notIndian.Marble : Whether the house use Indian marble or notFloors : Whether the house has floors or notCity : Location of the houseSolar : Whether the house has solar water heater or notElectric : Whether the house has electric heater or notFiber : Whether the house has fiber connection or notGlass.Doors : Whether the house has glass door or notSwimming.Pool : Whether the house has swimming pool or notGarden : Whether the house has garden or notPrices : Prices of the houseChecking for NA value:
anyNA(house)## [1] FALSE
Luckily there is no NA value in the data set.
The data set that we are using has 500,000 rows and it is necessary to take sample from original data set so we can test the model properly. We are going to take 5000 from the data set.
set.seed(100)
indx_hs <- sample(nrow(house), 5000)
hs_samp <- house[indx_hs,]After taking samples we want to check the distribution of the data.
hist(house$Prices)hist(hs_samp$Prices)From two histograms below the distribution Prices in house and hs_samp already identical so we can safely use hs_samp use data set to generate linear regression models.
We are going to see correlation of each variables:
ggcorr(hs_samp, label = TRUE,label_size = 3, hjust = 1)From the chart above we can see that most of the predictors are not strongly correlated, except Fiber, Floors and White.Marble.
Fiber, Floors, and White.Marble has moderate correlation with Prices, compared to other predictors
Beside that there looking at the Black.Marble and White.Marble and Indian.Marble has correlation which could indicate multicollinearity, which we have to drop one of the columns in the .
We will start with Fiber & Floors which have no multicollinearity as shown by mod_house_ff below.
mod_house_ff <- lm(formula=Prices~Fiber+Floors, data=hs_samp)summary(mod_house_ff)##
## Call:
## lm(formula = Prices ~ Fiber + Floors, data = hs_samp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -19810.3 -5549.5 -407.5 5591.1 21600.5
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 28515.5 184.5 154.53 <0.0000000000000002 ***
## Fiber 11914.2 212.5 56.06 <0.0000000000000002 ***
## Floors 15344.7 212.5 72.20 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7513 on 4997 degrees of freedom
## Multiple R-squared: 0.6231, Adjusted R-squared: 0.6229
## F-statistic: 4130 on 2 and 4997 DF, p-value: < 0.00000000000000022
coef(mod_house_ff)## (Intercept) Fiber Floors
## 28515.52 11914.23 15344.73
From Summary above we can get several points:
11914.23 and 15344.73 with intercept of 28515.52.mod_house_ff model also supported by the p-value that is less than 0.05.mod_house_ff, the model able to explain 62.29% of Prices, while the rest are explained by other variables that not included in the model.From the previous mod_house_ff, the model ability to explain could be enhanced by using all predictors and see the difference.
mod_house_all <- lm(formula=Prices~., data=hs_samp)summary(mod_house_all)##
## Call:
## lm(formula = Prices ~ ., data = hs_samp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.000000000876 -0.000000000037 -0.000000000010 0.000000000017 0.000000053765
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value
## (Intercept) 1000.0000000008236611 0.0000000000606875 16477854946452.666
## Area 24.9999999999999822 0.0000000000001492 167568017144802.688
## Garage 1500.0000000000120508 0.0000000000132916 112852850157247.984
## FirePlace 750.0000000000053433 0.0000000000077381 96922660742804.609
## Baths 1249.9999999999902229 0.0000000000075672 165185662267132.281
## White.Marble 13999.9999999999981810 0.0000000000262460 533414713650915.000
## Black.Marble 5000.0000000000418368 0.0000000000267158 187155176597633.344
## Indian.Marble NA NA NA
## Floors 15000.0000000000072760 0.0000000000216129 694030933314415.750
## City 3499.9999999999786269 0.0000000000131179 266811384074804.969
## Solar 249.9999999999790248 0.0000000000216080 11569804692099.191
## Electric 1249.9999999999779448 0.0000000000216196 57817851986206.141
## Fiber 11749.9999999999417923 0.0000000000215977 544040442124294.750
## Glass.Doors 4450.0000000000463842 0.0000000000216148 205877330706994.969
## Swiming.Pool 0.0000000000235680 0.0000000000216042 1.091
## Garden -0.0000000000212118 0.0000000000216120 -0.981
## Pr(>|t|)
## (Intercept) <0.0000000000000002 ***
## Area <0.0000000000000002 ***
## Garage <0.0000000000000002 ***
## FirePlace <0.0000000000000002 ***
## Baths <0.0000000000000002 ***
## White.Marble <0.0000000000000002 ***
## Black.Marble <0.0000000000000002 ***
## Indian.Marble NA
## Floors <0.0000000000000002 ***
## City <0.0000000000000002 ***
## Solar <0.0000000000000002 ***
## Electric <0.0000000000000002 ***
## Fiber <0.0000000000000002 ***
## Glass.Doors <0.0000000000000002 ***
## Swiming.Pool 0.275
## Garden 0.326
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.000000000763 on 4985 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 9.182e+28 on 14 and 4985 DF, p-value: < 0.00000000000000022
From Summary above we can get several points:
Indian.Marble that has NA. which could indicate collinearity with other predictor which also explained by ggcorr above. So, we decided to drop Black.Marble and Indian.Marble from the model.mod_house_all model which also supported by the p-value that is less than 0.05. Beside that, Swimming.Pool and Garden are not significant predictor as proved by p-values which are more than 0.05. Those predictors will be removed in the revied model.mod_house_all, the model able to explain 100% of Prices. And will not be able to run a feature selection using step.We will drop predictors that have multicollinearity and run once again ggcorr.
house_eliminate <- hs_samp %>%
select(-Garden, -Swiming.Pool, -Indian.Marble, -Black.Marble,
-Solar, -Electric, -Baths, -Garage, -FirePlace)ggcorr(house_eliminate,label = TRUE,label_size = 3, hjust = 1)From the correlation plot above, there is no more predictor with multicollinearity. Also there is one new additional predictor which is White.Marble.
Fiber, Floors, White.MarbleAfter dropping several predictors and running ggcorr, we will run the regression with Fiber, Floors and White.Marble.
mod_hs_ffw <- lm(formula=Prices~Fiber+Floors+White.Marble, data=hs_samp)summary(mod_hs_ffw)##
## Call:
## lm(formula = Prices ~ Fiber + Floors + White.Marble, data = hs_samp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15808.9 -3612.9 16.1 3580.3 15912.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 24774.7 137.1 180.77 <0.0000000000000002 ***
## Fiber 11809.2 146.7 80.48 <0.0000000000000002 ***
## Floors 15029.0 146.8 102.38 <0.0000000000000002 ***
## White.Marble 11431.7 154.3 74.07 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5187 on 4996 degrees of freedom
## Multiple R-squared: 0.8203, Adjusted R-squared: 0.8202
## F-statistic: 7605 on 3 and 4996 DF, p-value: < 0.00000000000000022
From Summary above we can get several points:
Fibers and Floors and White.Marble have positive coefficient 11809.2 and 15029.0 with intercept of 24774.7.Fibers and Floors are White.Marble significant predictor in mod_house_ff model also supported by the p-value that is less than 0.05.mod_hs_ffw, the model able to explain 82.02% of Prices, while the rest are explained by other variables that not included in the model.Below are the model all predictors after dropping multicollinear and insignificant predictors.
mod_house_all_f <- lm(formula=Prices~., data=house_eliminate)
summary(mod_house_all_f)##
## Call:
## lm(formula = Prices ~ ., data = house_eliminate)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8663.8 -2254.7 26.6 2270.4 8773.0
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12451.8789 165.3735 75.30 <0.0000000000000002 ***
## Area 24.3637 0.6271 38.85 <0.0000000000000002 ***
## White.Marble 11543.3204 95.4672 120.91 <0.0000000000000002 ***
## Floors 15057.2971 90.8337 165.77 <0.0000000000000002 ***
## City 3523.5828 55.1490 63.89 <0.0000000000000002 ***
## Fiber 11770.4246 90.7903 129.64 <0.0000000000000002 ***
## Glass.Doors 4439.4131 90.7702 48.91 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3208 on 4993 degrees of freedom
## Multiple R-squared: 0.9313, Adjusted R-squared: 0.9312
## F-statistic: 1.129e+04 on 6 and 4993 DF, p-value: < 0.00000000000000022
From Summary of revised model above we can see that:
mod_house_all_f are significant in predicting the house price.mod_house_all_f, the model able to explain 93.12% of Prices.Running stepwise regression resulted same model as mod_house_all_f.
step(object=mod_house_all_f, direction = "backward", trace = 0)##
## Call:
## lm(formula = Prices ~ Area + White.Marble + Floors + City + Fiber +
## Glass.Doors, data = house_eliminate)
##
## Coefficients:
## (Intercept) Area White.Marble Floors City
## 12451.88 24.36 11543.32 15057.30 3523.58
## Fiber Glass.Doors
## 11770.42 4439.41
To note, this model has been through trial and error process to determine more balanced model compared to other models. Because, when using all predictors (after removing predictors with multicollinearity), adjusted R-Squared is 1 which indicated problem in the model. Also, due to Adjusted R-squared condition, running feature selection with stepwise regression is impossible because the stepwise thought the model is already perfect. After removing multicollinearity and insignificant perdictors, the model is still considered “too good” with Adjusted R-Squared around 0.97. After that, we removed some significant predictors but has 0.1 correlation (from ggcorr). We removed
Solar,Electric,Baths,Garage,FirePlace. After dropping more predictors, the model meet homoscedasticity.
Below are candidate models for determining house Prices.
coef(mod_house_ff)## (Intercept) Fiber Floors
## 28515.52 11914.23 15344.73
Prices = 28515.52 + 11914.23(Fiber) + 15344.73(Floors)
coef(mod_hs_ffw)## (Intercept) Fiber Floors White.Marble
## 24774.71 11809.17 15029.03 11431.74
Prices = 24774.71 + 11809.17(Fiber) + 15029.03(Floors) + 11431.74(White.Marble)
coef(mod_house_all_f)## (Intercept) Area White.Marble Floors City Fiber
## 12451.87889 24.36368 11543.32035 15057.29714 3523.58277 11770.42463
## Glass.Doors
## 4439.41313
Prices = 12451.87889 + 24.36368(Area) + 11543.32035(White.Marble) + 15057.29714(Floors) + 3523.58277(City) + 11770.42463(Fiber) + 4439.41313(Glass.Doors)
In this section we will check the prediction and error checking from each models.
h_p_pred <- predict(mod_house_all_f, house_eliminate)h_p_pred_ff <- predict(mod_house_ff, house_eliminate)h_p_pred_ffw <- predict(mod_hs_ffw, house_eliminate)We will check each model using RMSE (Root Mean Square Error).
mod_house_all_fRMSE(y_pred=h_p_pred, y_true=house_eliminate$Prices)## [1] 3206.039
mod_hs_ffwRMSE(y_pred=h_p_pred_ffw, y_true=house_eliminate$Prices)## [1] 5185.418
mod_house_ffRMSE(y_pred=h_p_pred_ff, y_true=house_eliminate$Prices)## [1] 7511.113
From RMSE of each models, we can see the lowest RMSE are mod_house_all_f.
We will check each model using MAE (Mean Absolute Error).
summary(house_eliminate$Prices)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9650 33700 42000 42118 50900 77375
We can see from summary of Prices, the data ranges from 9650 to 77375
mod_house_all_fMAE(y_pred=h_p_pred, y_true=house_eliminate$Prices)## [1] 2614.859
mod_hs_ffwMAE(y_pred=h_p_pred_ffw, y_true=house_eliminate$Prices)## [1] 4160.664
mod_house_ffMAE(y_pred=h_p_pred_ff, y_true=house_eliminate$Prices)## [1] 6222.643
From MAE of each models, we can see the lowest RMSE are mod_house_all_f as the error range is below Prices ranges.
We will check each model using MAE (Mean Absolute Percentage Error).
mod_house_all_fMAPE(y_pred=h_p_pred, y_true=house_eliminate$Prices)## [1] 0.07172891
mod_hs_ffwMAPE(y_pred=h_p_pred_ffw, y_true=house_eliminate$Prices)## [1] 0.1144254
mod_house_ffMAPE(y_pred=h_p_pred_ff, y_true=house_eliminate$Prices)## [1] 0.1685246
From MAPE of each models, we can see the lowest RMSE are mod_house_all_f.
In this section, we will evaluate each models using Normality test, Homoscedasticity test, Multicollinearity test.
mod_house_all_fhist(mod_house_all_f$residuals)shapiro.test(mod_house_all_f$residuals)##
## Shapiro-Wilk normality test
##
## data: mod_house_all_f$residuals
## W = 0.99653, p-value = 0.000000002084
mod_hs_ffwhist(mod_hs_ffw$residuals)shapiro.test(mod_hs_ffw$residuals)##
## Shapiro-Wilk normality test
##
## data: mod_hs_ffw$residuals
## W = 0.99911, p-value = 0.01036
mod_house_ffhist(mod_house_ff$residuals)shapiro.test(mod_house_ff$residuals)##
## Shapiro-Wilk normality test
##
## data: mod_house_ff$residuals
## W = 0.99183, p-value = 0.0000000000000002233
Using normality test, Shapiro-Wilk test, none of the models are having p-value more than 0.05. The only model that has p-value close to 0.05 is mod_hs_ffw.
Using histogram only all models has normal distribution of residuals. Most of the residuals are still concentrated in 0 value.
Below are homoscedasticity test of each model using bptest.
mod_house_all_fbptest(mod_house_all_f)##
## studentized Breusch-Pagan test
##
## data: mod_house_all_f
## BP = 314.38, df = 6, p-value < 0.00000000000000022
plot(x = mod_house_all_f$fitted.values, y = mod_house_all_f$residuals)
abline(h = 0, col = "red", lty = 2)mod_hs_ffwbptest(mod_hs_ffw)##
## studentized Breusch-Pagan test
##
## data: mod_hs_ffw
## BP = 42.745, df = 3, p-value = 0.000000002788
plot(x = mod_hs_ffw$fitted.values, y = mod_hs_ffw$residuals)
abline(h = 0, col = "red", lty = 2)mod_house_ffbptest(mod_house_ff)##
## studentized Breusch-Pagan test
##
## data: mod_house_ff
## BP = 0.24471, df = 2, p-value = 0.8848
plot(x = mod_house_ff$fitted.values, y = mod_house_ff$residuals)
abline(h = 0, col = "red", lty = 2)Using homoscedasticity test mod_house_ff model that has p-value that surpassed 0.05. Other model’s p-value are below 0.05.
In this section we will test multicollinearity of each model.
mod_house_ffvif(mod_house_ff)## Fiber Floors
## 1.000144 1.000144
mod_house_all_fvif(mod_house_all_f)## Area White.Marble Floors City Fiber Glass.Doors
## 1.001263 1.001267 1.001975 1.001084 1.001015 1.000493
mod_hs_ffwvif(mod_hs_ffw)## Fiber Floors White.Marble
## 1.000238 1.000988 1.000931
From test to each model above, there is no multicollinearity found in each model.
From the all the models above, we can see that:
mod_house_ff (Low Adj R Square, failed Shapiro-Wilk, Meet Normality, Heteroscedasticity, BP > 0.05, No Multicollinearity)
mod_hs_ffw (Medium Adj R Square, failed Shapiro-Wilk, Meet Normality, Heteroscedasticity, BP < 0.05,No Multicollinearity)
mod_house_all_f (High Adj R Square, failed Shapiro-Wilk, Meet Normality, Homoscedasticity, BP < 0.05, No Multicollinearity)
From the summary above we can conclude that no model is perfect. But We think that mod_house_all_f has the balance of pros and cons and has good performance in predicting the house Prices.
So, using the mod_house_all_f house Prices has positive correlation with Area, White.Marble, Floors, City, Fiber, Glass.Doors.