Business Problem
Predicting House Pricing with House Price Kaggle Dataset.
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
## Loading required package: ggplot2
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
##
## Attaching package: 'GGally'
## The following object is masked from 'package:dplyr':
##
## nasa
## Loading required package: tidyr
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
houseprice <- read.csv("/Users/dinnah/Documents/ALGORITMA Machine Learning/LBB Machine Learning/HousePrices_HalfMil.csv")
anyNA(houseprice)## [1] FALSE
## 'data.frame': 500000 obs. of 16 variables:
## $ Area : int 164 84 190 75 148 124 58 249 243 242 ...
## $ Garage : int 2 2 2 2 1 3 1 2 1 1 ...
## $ FirePlace : int 0 0 4 4 4 3 0 1 0 2 ...
## $ Baths : int 2 4 4 4 2 3 2 1 2 4 ...
## $ White.Marble : int 0 0 1 0 1 0 0 1 0 0 ...
## $ Black.Marble : int 1 0 0 0 0 1 0 0 0 0 ...
## $ Indian.Marble: int 0 1 0 1 0 0 1 0 1 1 ...
## $ Floors : int 0 1 0 1 1 1 0 1 1 0 ...
## $ City : int 3 2 2 1 2 1 3 1 1 2 ...
## $ Solar : int 1 0 0 1 1 0 0 0 0 1 ...
## $ Electric : int 1 0 0 1 0 0 1 1 0 0 ...
## $ Fiber : int 1 0 1 1 0 1 1 0 0 0 ...
## $ Glass.Doors : int 1 1 0 1 1 1 1 1 0 0 ...
## $ Swiming.Pool : int 0 1 0 1 1 1 0 1 1 1 ...
## $ Garden : int 0 1 0 1 1 1 1 0 0 0 ...
## $ Prices : int 43800 37550 49500 50075 52400 54300 34400 50425 29575 22300 ...
On this dataset, we have 500,000 observasions with 16 variables. There are several information that aren’t clear enough for us to use as significant prediction variables such as: Area, the Marble color since it’s all the same, and some of which is not stated in Kaggle such as City and Fiber.
There are outliers on our Target Variable
Prices, I will …..
house_price <- houseprice %>%
dplyr::select(-c(Area, Black.Marble, White.Marble, Indian.Marble, City, Fiber))
str(house_price)## 'data.frame': 500000 obs. of 10 variables:
## $ Garage : int 2 2 2 2 1 3 1 2 1 1 ...
## $ FirePlace : int 0 0 4 4 4 3 0 1 0 2 ...
## $ Baths : int 2 4 4 4 2 3 2 1 2 4 ...
## $ Floors : int 0 1 0 1 1 1 0 1 1 0 ...
## $ Solar : int 1 0 0 1 1 0 0 0 0 1 ...
## $ Electric : int 1 0 0 1 0 0 1 1 0 0 ...
## $ Glass.Doors : int 1 1 0 1 1 1 1 1 0 0 ...
## $ Swiming.Pool: int 0 1 0 1 1 1 0 1 1 1 ...
## $ Garden : int 0 1 0 1 1 1 1 0 0 0 ...
## $ Prices : int 43800 37550 49500 50075 52400 54300 34400 50425 29575 22300 ...
Find the correlation between Price and another variables
from this, we can see the strongest correlation with Prices is Floors. But I cannot use only 1 variables since almost every variables matter.
I’m just try to check the linearity between Prices & Floors
##
## Pearson's product-moment correlation
##
## data: houseprice$Prices and houseprice$Floors
## t = 557.96, df = 499998, p-value < 0.00000000000000022
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.6177397 0.6211561
## sample estimates:
## cor
## 0.6194508
From the cor.test, we can see that the p-value is < 0.05 so the Floors variable is significant to the target varible (Prices)
Here I change the class of Glass.Doors, Electric, Floors as factor.
house_price <- house_price %>%
mutate(Glass.Doors = as.factor(Glass.Doors),
Electric = as.factor(Electric),
Floors = as.factor(Floors),
Swiming.Pool = as.factor(Swiming.Pool),
Garden = as.factor(Garden),
Solar = as.factor(Solar))I’m going to spare the data into train & test.
##
## Call:
## lm(formula = Prices ~ ., data = house_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18892.6 -6679.1 113.8 5999.4 20234.6
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 23286.831 62.151 374.684 <0.0000000000000002 ***
## Garage 1504.176 17.228 87.312 <0.0000000000000002 ***
## FirePlace 763.598 9.955 76.708 <0.0000000000000002 ***
## Baths 1246.708 9.952 125.266 <0.0000000000000002 ***
## Floors1 15018.231 28.146 533.577 <0.0000000000000002 ***
## Solar1 234.661 28.147 8.337 <0.0000000000000002 ***
## Electric1 1257.920 28.146 44.692 <0.0000000000000002 ***
## Glass.Doors1 4422.793 28.146 157.136 <0.0000000000000002 ***
## Swiming.Pool1 13.354 28.146 0.474 0.635
## Garden1 43.049 28.147 1.529 0.126
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8901 on 399992 degrees of freedom
## Multiple R-squared: 0.4598, Adjusted R-squared: 0.4598
## F-statistic: 3.783e+04 on 9 and 399992 DF, p-value: < 0.00000000000000022
In this summary we can see the Swiming.Pool and Garden1 are insignificant variable.
To see if the step wise can help us making the best model, I make a model with no perdictable variable
##
## Call:
## lm(formula = Prices ~ 1, data = house_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -34324 -8549 -199 8701 35476
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 42048.79 19.15 2196 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12110 on 400001 degrees of freedom
##
## Call:
## lm(formula = Prices ~ Garage + FirePlace + Baths + Floors + Solar +
## Electric + Glass.Doors + Garden, data = house_train)
##
## Coefficients:
## (Intercept) Garage FirePlace Baths Floors1
## 23293.46 1504.19 763.60 1246.72 15018.22
## Solar1 Electric1 Glass.Doors1 Garden1
## 234.65 1257.92 4422.80 43.05
model_houseback <- lm(formula = Prices ~ Garage + FirePlace + Baths + Floors + Solar +
Electric + Glass.Doors, data = house_train)
summary(model_houseback)##
## Call:
## lm(formula = Prices ~ Garage + FirePlace + Baths + Floors + Solar +
## Electric + Glass.Doors, data = house_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18864.5 -6680.3 113.5 5999.6 20206.3
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 23315.115 58.881 395.97 <0.0000000000000002 ***
## Garage 1504.163 17.228 87.31 <0.0000000000000002 ***
## FirePlace 763.585 9.955 76.71 <0.0000000000000002 ***
## Baths 1246.743 9.952 125.27 <0.0000000000000002 ***
## Floors1 15018.196 28.146 533.58 <0.0000000000000002 ***
## Solar1 234.465 28.147 8.33 <0.0000000000000002 ***
## Electric1 1257.923 28.147 44.69 <0.0000000000000002 ***
## Glass.Doors1 4422.886 28.146 157.14 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8901 on 399994 degrees of freedom
## Multiple R-squared: 0.4598, Adjusted R-squared: 0.4598
## F-statistic: 4.863e+04 on 7 and 399994 DF, p-value: < 0.00000000000000022
##
## Call:
## lm(formula = Prices ~ Floors + Glass.Doors + Baths + Garage +
## FirePlace + Electric + Solar + Garden, data = house_train)
##
## Coefficients:
## (Intercept) Floors1 Glass.Doors1 Baths Garage
## 23293.46 15018.22 4422.80 1246.72 1504.19
## FirePlace Electric1 Solar1 Garden1
## 763.60 1257.92 234.65 43.05
model_housefwd <- lm(formula = Prices ~ Floors + Glass.Doors + Baths + Garage +
FirePlace + Electric + Solar, data = house_train)
summary(model_housefwd)##
## Call:
## lm(formula = Prices ~ Floors + Glass.Doors + Baths + Garage +
## FirePlace + Electric + Solar, data = house_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18864.5 -6680.3 113.5 5999.6 20206.3
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 23315.115 58.881 395.97 <0.0000000000000002 ***
## Floors1 15018.196 28.146 533.58 <0.0000000000000002 ***
## Glass.Doors1 4422.886 28.146 157.14 <0.0000000000000002 ***
## Baths 1246.743 9.952 125.27 <0.0000000000000002 ***
## Garage 1504.163 17.228 87.31 <0.0000000000000002 ***
## FirePlace 763.585 9.955 76.71 <0.0000000000000002 ***
## Electric1 1257.923 28.147 44.69 <0.0000000000000002 ***
## Solar1 234.465 28.147 8.33 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8901 on 399994 degrees of freedom
## Multiple R-squared: 0.4598, Adjusted R-squared: 0.4598
## F-statistic: 4.863e+04 on 7 and 399994 DF, p-value: < 0.00000000000000022
##
## Call:
## lm(formula = Prices ~ Garage + FirePlace + Baths + Floors + Solar +
## Electric + Glass.Doors + Garden, data = house_train)
##
## Coefficients:
## (Intercept) Garage FirePlace Baths Floors1
## 23293.46 1504.19 763.60 1246.72 15018.22
## Solar1 Electric1 Glass.Doors1 Garden1
## 234.65 1257.92 4422.80 43.05
model_houseboth <- lm(formula = Prices ~ Garage + FirePlace + Baths + Floors + Solar +
Electric + Glass.Doors, data = house_train)
summary(model_houseboth)##
## Call:
## lm(formula = Prices ~ Garage + FirePlace + Baths + Floors + Solar +
## Electric + Glass.Doors, data = house_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18864.5 -6680.3 113.5 5999.6 20206.3
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 23315.115 58.881 395.97 <0.0000000000000002 ***
## Garage 1504.163 17.228 87.31 <0.0000000000000002 ***
## FirePlace 763.585 9.955 76.71 <0.0000000000000002 ***
## Baths 1246.743 9.952 125.27 <0.0000000000000002 ***
## Floors1 15018.196 28.146 533.58 <0.0000000000000002 ***
## Solar1 234.465 28.147 8.33 <0.0000000000000002 ***
## Electric1 1257.923 28.147 44.69 <0.0000000000000002 ***
## Glass.Doors1 4422.886 28.146 157.14 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8901 on 399994 degrees of freedom
## Multiple R-squared: 0.4598, Adjusted R-squared: 0.4598
## F-statistic: 4.863e+04 on 7 and 399994 DF, p-value: < 0.00000000000000022
Comparing all the adjusted R squared
## [1] 0.4597662
## [1] 0.4597662
## [1] 0.4597662
All the step wise gave the same predictor variables and taking out the Swiming.Pool & Garden.
predict(object = model_houseback,newdata = house_train[1:5,], interval = "confidence", level = 0.95)## fit lwr upr
## 2 50751.49 50675.93 50827.06
## 3 34364.75 34289.26 34440.24
## 4 55298.22 55222.72 55373.73
## 6 53299.67 53226.71 53372.63
## 7 32993.57 32910.74 33076.41
Prediction:
Dengan confidence 95%, maka harga pada rumah 1, batas bawahnya 34671.55 dollar, dan batas atasnya 34806.73 dollar.
Summary: from this histogram, we can see that the residuals slightly distribute normally near to zero. this means that the errors distributes near to zero.
Checking the Saphiro test
since the data is more than 5,000, we cannot use the Shapiro test
Scatter plot between Fiited Values & Residuals
Since we cannot see whether the plot creates a pattern, we will check with Breusch-Pagan test.
Breusch-Pagan
##
## studentized Breusch-Pagan test
##
## data: model_houseback
## BP = 10.176, df = 7, p-value = 0.1788
##
## studentized Breusch-Pagan test
##
## data: model_houseall
## BP = 12.598, df = 9, p-value = 0.1817
the p.value from bptest for model_housebck is 0.6446, and 0.7688 for model_houseall that are more than 0.05. this means both of the models indicate that the null hypothesis (the variance is unchanging in the residual) can be rejected and therefore heterscedasticity exists.
## Garage FirePlace Baths Floors Solar Electric
## 1.000019 1.000006 1.000021 1.000009 1.000017 1.000023
## Glass.Doors
## 1.000007
## Garage FirePlace Baths Floors Solar Electric
## 1.000023 1.000007 1.000029 1.000010 1.000037 1.000023
## Glass.Doors Swiming.Pool Garden
## 1.000011 1.000010 1.000028
Since the score are lower than 10, we assume that the predictor variable is independent compare to each other.
##
## Attaching package: 'MLmetrics'
## The following object is masked from 'package:base':
##
## Recall
## [1] 79697397
## [1] 8927.34
## Warning in y_true - y_pred: longer object length is not a multiple of shorter
## object length
## [1] 214102244
## Warning in y_true - y_pred: longer object length is not a multiple of shorter
## object length
## [1] 14632.23
The MSE & RMSE for model houseback is lesser than model using all the variables.
For predicting the Price in the future, we can use the model_houseback model with significant variables as following: - Garage
- FirePlace
- Baths
- Floors
- Solar
- Electric
- Glass.Doors
The all models using the aformentioned variables (in model houseback) indicates a better model to use rather than model all (with all variables) because the MSE & RMSE of model houseback is smaller than model houseall. I suggest to use model house back for predicting the right pirce in the future.