Business Problem

Predicting House Pricing with House Price Kaggle Dataset.

Data Preparation

## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
## Loading required package: ggplot2
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
## 
## Attaching package: 'GGally'
## The following object is masked from 'package:dplyr':
## 
##     nasa
## Loading required package: tidyr
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
## [1] FALSE
## 'data.frame':    500000 obs. of  16 variables:
##  $ Area         : int  164 84 190 75 148 124 58 249 243 242 ...
##  $ Garage       : int  2 2 2 2 1 3 1 2 1 1 ...
##  $ FirePlace    : int  0 0 4 4 4 3 0 1 0 2 ...
##  $ Baths        : int  2 4 4 4 2 3 2 1 2 4 ...
##  $ White.Marble : int  0 0 1 0 1 0 0 1 0 0 ...
##  $ Black.Marble : int  1 0 0 0 0 1 0 0 0 0 ...
##  $ Indian.Marble: int  0 1 0 1 0 0 1 0 1 1 ...
##  $ Floors       : int  0 1 0 1 1 1 0 1 1 0 ...
##  $ City         : int  3 2 2 1 2 1 3 1 1 2 ...
##  $ Solar        : int  1 0 0 1 1 0 0 0 0 1 ...
##  $ Electric     : int  1 0 0 1 0 0 1 1 0 0 ...
##  $ Fiber        : int  1 0 1 1 0 1 1 0 0 0 ...
##  $ Glass.Doors  : int  1 1 0 1 1 1 1 1 0 0 ...
##  $ Swiming.Pool : int  0 1 0 1 1 1 0 1 1 1 ...
##  $ Garden       : int  0 1 0 1 1 1 1 0 0 0 ...
##  $ Prices       : int  43800 37550 49500 50075 52400 54300 34400 50425 29575 22300 ...

On this dataset, we have 500,000 observasions with 16 variables. There are several information that aren’t clear enough for us to use as significant prediction variables such as: Area, the Marble color since it’s all the same, and some of which is not stated in Kaggle such as City and Fiber.

There are outliers on our Target Variable Prices, I will …..

## 'data.frame':    500000 obs. of  10 variables:
##  $ Garage      : int  2 2 2 2 1 3 1 2 1 1 ...
##  $ FirePlace   : int  0 0 4 4 4 3 0 1 0 2 ...
##  $ Baths       : int  2 4 4 4 2 3 2 1 2 4 ...
##  $ Floors      : int  0 1 0 1 1 1 0 1 1 0 ...
##  $ Solar       : int  1 0 0 1 1 0 0 0 0 1 ...
##  $ Electric    : int  1 0 0 1 0 0 1 1 0 0 ...
##  $ Glass.Doors : int  1 1 0 1 1 1 1 1 0 0 ...
##  $ Swiming.Pool: int  0 1 0 1 1 1 0 1 1 1 ...
##  $ Garden      : int  0 1 0 1 1 1 1 0 0 0 ...
##  $ Prices      : int  43800 37550 49500 50075 52400 54300 34400 50425 29575 22300 ...

Exploratory Data Analysis

Find the correlation between Price and another variables

from this, we can see the strongest correlation with Prices is Floors. But I cannot use only 1 variables since almost every variables matter.

I’m just try to check the linearity between Prices & Floors

## 
##  Pearson's product-moment correlation
## 
## data:  houseprice$Prices and houseprice$Floors
## t = 557.96, df = 499998, p-value < 0.00000000000000022
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.6177397 0.6211561
## sample estimates:
##       cor 
## 0.6194508

From the cor.test, we can see that the p-value is < 0.05 so the Floors variable is significant to the target varible (Prices)

Here I change the class of Glass.Doors, Electric, Floors as factor.

I’m going to spare the data into train & test.

Build Model

Using all variables & no variable

## 
## Call:
## lm(formula = Prices ~ ., data = house_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -18892.6  -6679.1    113.8   5999.4  20234.6 
## 
## Coefficients:
##                Estimate Std. Error t value            Pr(>|t|)    
## (Intercept)   23286.831     62.151 374.684 <0.0000000000000002 ***
## Garage         1504.176     17.228  87.312 <0.0000000000000002 ***
## FirePlace       763.598      9.955  76.708 <0.0000000000000002 ***
## Baths          1246.708      9.952 125.266 <0.0000000000000002 ***
## Floors1       15018.231     28.146 533.577 <0.0000000000000002 ***
## Solar1          234.661     28.147   8.337 <0.0000000000000002 ***
## Electric1      1257.920     28.146  44.692 <0.0000000000000002 ***
## Glass.Doors1   4422.793     28.146 157.136 <0.0000000000000002 ***
## Swiming.Pool1    13.354     28.146   0.474               0.635    
## Garden1          43.049     28.147   1.529               0.126    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8901 on 399992 degrees of freedom
## Multiple R-squared:  0.4598, Adjusted R-squared:  0.4598 
## F-statistic: 3.783e+04 on 9 and 399992 DF,  p-value: < 0.00000000000000022

In this summary we can see the Swiming.Pool and Garden1 are insignificant variable.

To see if the step wise can help us making the best model, I make a model with no perdictable variable

## 
## Call:
## lm(formula = Prices ~ 1, data = house_train)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -34324  -8549   -199   8701  35476 
## 
## Coefficients:
##             Estimate Std. Error t value            Pr(>|t|)    
## (Intercept) 42048.79      19.15    2196 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12110 on 400001 degrees of freedom

Feature Selection with Step Wise

Backwards

## 
## Call:
## lm(formula = Prices ~ Garage + FirePlace + Baths + Floors + Solar + 
##     Electric + Glass.Doors + Garden, data = house_train)
## 
## Coefficients:
##  (Intercept)        Garage     FirePlace         Baths       Floors1  
##     23293.46       1504.19        763.60       1246.72      15018.22  
##       Solar1     Electric1  Glass.Doors1       Garden1  
##       234.65       1257.92       4422.80         43.05
## 
## Call:
## lm(formula = Prices ~ Garage + FirePlace + Baths + Floors + Solar + 
##     Electric + Glass.Doors, data = house_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -18864.5  -6680.3    113.5   5999.6  20206.3 
## 
## Coefficients:
##               Estimate Std. Error t value            Pr(>|t|)    
## (Intercept)  23315.115     58.881  395.97 <0.0000000000000002 ***
## Garage        1504.163     17.228   87.31 <0.0000000000000002 ***
## FirePlace      763.585      9.955   76.71 <0.0000000000000002 ***
## Baths         1246.743      9.952  125.27 <0.0000000000000002 ***
## Floors1      15018.196     28.146  533.58 <0.0000000000000002 ***
## Solar1         234.465     28.147    8.33 <0.0000000000000002 ***
## Electric1     1257.923     28.147   44.69 <0.0000000000000002 ***
## Glass.Doors1  4422.886     28.146  157.14 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8901 on 399994 degrees of freedom
## Multiple R-squared:  0.4598, Adjusted R-squared:  0.4598 
## F-statistic: 4.863e+04 on 7 and 399994 DF,  p-value: < 0.00000000000000022

Forward

## 
## Call:
## lm(formula = Prices ~ Floors + Glass.Doors + Baths + Garage + 
##     FirePlace + Electric + Solar + Garden, data = house_train)
## 
## Coefficients:
##  (Intercept)       Floors1  Glass.Doors1         Baths        Garage  
##     23293.46      15018.22       4422.80       1246.72       1504.19  
##    FirePlace     Electric1        Solar1       Garden1  
##       763.60       1257.92        234.65         43.05
## 
## Call:
## lm(formula = Prices ~ Floors + Glass.Doors + Baths + Garage + 
##     FirePlace + Electric + Solar, data = house_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -18864.5  -6680.3    113.5   5999.6  20206.3 
## 
## Coefficients:
##               Estimate Std. Error t value            Pr(>|t|)    
## (Intercept)  23315.115     58.881  395.97 <0.0000000000000002 ***
## Floors1      15018.196     28.146  533.58 <0.0000000000000002 ***
## Glass.Doors1  4422.886     28.146  157.14 <0.0000000000000002 ***
## Baths         1246.743      9.952  125.27 <0.0000000000000002 ***
## Garage        1504.163     17.228   87.31 <0.0000000000000002 ***
## FirePlace      763.585      9.955   76.71 <0.0000000000000002 ***
## Electric1     1257.923     28.147   44.69 <0.0000000000000002 ***
## Solar1         234.465     28.147    8.33 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8901 on 399994 degrees of freedom
## Multiple R-squared:  0.4598, Adjusted R-squared:  0.4598 
## F-statistic: 4.863e+04 on 7 and 399994 DF,  p-value: < 0.00000000000000022

Both

## 
## Call:
## lm(formula = Prices ~ Garage + FirePlace + Baths + Floors + Solar + 
##     Electric + Glass.Doors + Garden, data = house_train)
## 
## Coefficients:
##  (Intercept)        Garage     FirePlace         Baths       Floors1  
##     23293.46       1504.19        763.60       1246.72      15018.22  
##       Solar1     Electric1  Glass.Doors1       Garden1  
##       234.65       1257.92       4422.80         43.05
## 
## Call:
## lm(formula = Prices ~ Garage + FirePlace + Baths + Floors + Solar + 
##     Electric + Glass.Doors, data = house_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -18864.5  -6680.3    113.5   5999.6  20206.3 
## 
## Coefficients:
##               Estimate Std. Error t value            Pr(>|t|)    
## (Intercept)  23315.115     58.881  395.97 <0.0000000000000002 ***
## Garage        1504.163     17.228   87.31 <0.0000000000000002 ***
## FirePlace      763.585      9.955   76.71 <0.0000000000000002 ***
## Baths         1246.743      9.952  125.27 <0.0000000000000002 ***
## Floors1      15018.196     28.146  533.58 <0.0000000000000002 ***
## Solar1         234.465     28.147    8.33 <0.0000000000000002 ***
## Electric1     1257.923     28.147   44.69 <0.0000000000000002 ***
## Glass.Doors1  4422.886     28.146  157.14 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8901 on 399994 degrees of freedom
## Multiple R-squared:  0.4598, Adjusted R-squared:  0.4598 
## F-statistic: 4.863e+04 on 7 and 399994 DF,  p-value: < 0.00000000000000022

Comparing all the adjusted R squared

## [1] 0.4597662
## [1] 0.4597662
## [1] 0.4597662

All the step wise gave the same predictor variables and taking out the Swiming.Pool & Garden.

Confidence Interval

##        fit      lwr      upr
## 2 50751.49 50675.93 50827.06
## 3 34364.75 34289.26 34440.24
## 4 55298.22 55222.72 55373.73
## 6 53299.67 53226.71 53372.63
## 7 32993.57 32910.74 33076.41

Prediction:

Dengan confidence 95%, maka harga pada rumah 1, batas bawahnya 34671.55 dollar, dan batas atasnya 34806.73 dollar.

Assumtions

Normality

Summary: from this histogram, we can see that the residuals slightly distribute normally near to zero. this means that the errors distributes near to zero.

Checking the Saphiro test

since the data is more than 5,000, we cannot use the Shapiro test

Homoscedasticity

Scatter plot between Fiited Values & Residuals

Since we cannot see whether the plot creates a pattern, we will check with Breusch-Pagan test.

Breusch-Pagan

## 
##  studentized Breusch-Pagan test
## 
## data:  model_houseback
## BP = 10.176, df = 7, p-value = 0.1788
## 
##  studentized Breusch-Pagan test
## 
## data:  model_houseall
## BP = 12.598, df = 9, p-value = 0.1817

the p.value from bptest for model_housebck is 0.6446, and 0.7688 for model_houseall that are more than 0.05. this means both of the models indicate that the null hypothesis (the variance is unchanging in the residual) can be rejected and therefore heterscedasticity exists.

Multicolinearity

##      Garage   FirePlace       Baths      Floors       Solar    Electric 
##    1.000019    1.000006    1.000021    1.000009    1.000017    1.000023 
## Glass.Doors 
##    1.000007
##       Garage    FirePlace        Baths       Floors        Solar     Electric 
##     1.000023     1.000007     1.000029     1.000010     1.000037     1.000023 
##  Glass.Doors Swiming.Pool       Garden 
##     1.000011     1.000010     1.000028

Since the score are lower than 10, we assume that the predictor variable is independent compare to each other.

Model Evaluation

Evaluate the chosen model (model_houseback)

## 
## Attaching package: 'MLmetrics'
## The following object is masked from 'package:base':
## 
##     Recall
## [1] 79697397
## [1] 8927.34

Evaluate the all model (model_houseall)

## Warning in y_true - y_pred: longer object length is not a multiple of shorter
## object length
## [1] 214102244
## Warning in y_true - y_pred: longer object length is not a multiple of shorter
## object length
## [1] 14632.23

The MSE & RMSE for model houseback is lesser than model using all the variables.

Summary

For predicting the Price in the future, we can use the model_houseback model with significant variables as following: - Garage
- FirePlace
- Baths
- Floors
- Solar
- Electric
- Glass.Doors

The all models using the aformentioned variables (in model houseback) indicates a better model to use rather than model all (with all variables) because the MSE & RMSE of model houseback is smaller than model houseall. I suggest to use model house back for predicting the right pirce in the future.