In this exercise, I want to predict price of a house based on various predictor variable.
The data is taken from kaggle, and from the description it seemed that the data is computer generated. So we’ll see how that affects our numbers.
I will use MSE and RMSE as a measure of my model’s accuracy. For those who are not familiar, MSE and RMSE is our error rate. The value is dependent on our target range, meaning if we get MSE of a 1000 and our data is in the millions, then it’s a very low error rate. But if our MSE is 1000 and our data is in the hundreds, then our error rate is very high.
Library packages that I’m using
Reading the data
Checking the data
| Area | Garage | FirePlace | Baths | White Marble | Black Marble | Indian Marble | Floors | City | Solar | Electric | Fiber | Glass Doors | Swiming Pool | Garden | Prices |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 164 | 2 | 0 | 2 | 0 | 1 | 0 | 0 | 3 | 1 | 1 | 1 | 1 | 0 | 0 | 43800 |
| 84 | 2 | 0 | 4 | 0 | 0 | 1 | 1 | 2 | 0 | 0 | 0 | 1 | 1 | 1 | 37550 |
| 190 | 2 | 4 | 4 | 1 | 0 | 0 | 0 | 2 | 0 | 0 | 1 | 0 | 0 | 0 | 49500 |
| 75 | 2 | 4 | 4 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 50075 |
| 148 | 1 | 4 | 2 | 1 | 0 | 0 | 1 | 2 | 1 | 0 | 0 | 1 | 1 | 1 | 52400 |
| 124 | 3 | 3 | 3 | 0 | 1 | 0 | 1 | 1 | 0 | 0 | 1 | 1 | 1 | 1 | 54300 |
| 58 | 1 | 0 | 2 | 0 | 0 | 1 | 0 | 3 | 0 | 1 | 1 | 1 | 0 | 1 | 34400 |
| 249 | 2 | 1 | 1 | 1 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 1 | 1 | 0 | 50425 |
| 243 | 1 | 0 | 2 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 29575 |
| 242 | 1 | 2 | 4 | 0 | 0 | 1 | 0 | 2 | 1 | 0 | 0 | 0 | 1 | 0 | 22300 |
At first glance, there is some linearity in our data, but I want to try to compare the result after I engineer the data.
The data seems to have a few feature we need to engineer first.
1. White Marble, Black Marble, and Indian Marble should be a factor of one column instead of 3
2. I think Garage, Fireplace, Baths, and City columns should be an factor.
3. Columns who only have 0 and 1 like Floors, Solar, Electric, Fiber, Glass Doors, Swimming Pool, and Garden will be logical.
names(house_price) <- str_replace_all(str_to_lower(names(house_price)), " ", "_")
house_price_filter <- house_price %>%
rename(swimming_pool = swiming_pool) %>% #spelling!!
mutate( floors = as.logical(as.integer(floors)),
solar = as.logical(as.integer(solar)),
electric = as.logical(as.integer(electric)),
fiber = as.logical(as.integer(fiber)),
glass_doors = as.logical(as.integer(glass_doors)),
swimming_pool = as.logical(as.integer(swimming_pool)),
garden = as.logical(as.integer(garden))
) %>%
mutate(garage = as.factor(garage),
fireplace = as.factor(fireplace),
baths = as.factor(baths),
city = as.factor(city)) %>%
mutate(marble_choice = case_when( white_marble == 1 ~ "white_marble",
black_marble == 1 ~ "black_marble",
indian_marble == 1 ~ "indian_marble") ) %>%
select(-c(white_marble, black_marble, indian_marble))In order to test the model on later stage, I’ll separate the dataset into 2, training and testing data with a ratio 8:2.
set.seed(100)
intrain <- sample(nrow(house_price_filter), nrow(house_price_filter) * .8)
house_price_train <- house_price_filter[intrain,]
house_price_test <- house_price_filter[-intrain,]We ended up with version 2. The reason being that after trying out model 1, it seemed that we have an overfit model. Our R squared with version 1 is 1, and our RMSE in the end is very low, as well as using predictors with weak correlation.
Version 2 however, is made with predictors with relatively stronger correlation to our target variable and predictor that should impact the price in a business-case sense.
##
## Call:
## lm(formula = prices ~ ., data = house_price_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.000046590 0.000000000 0.000000000 0.000000000 0.000000198
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12249.999999925450538 0.000000000583347 20999507180557.438 <0.0000000000000002 ***
## area 25.000000000004619 0.000000000001623 15408197241292.121 <0.0000000000000002 ***
## garage2 1500.000000000111413 0.000000000285641 5251338988395.544 <0.0000000000000002 ***
## garage3 2999.999999999419742 0.000000000285171 10520012508983.896 <0.0000000000000002 ***
## fireplace1 749.999999999717488 0.000000000369017 2032427319033.808 <0.0000000000000002 ***
## fireplace2 1499.999999999714419 0.000000000369037 4064628304190.725 <0.0000000000000002 ***
## fireplace3 2249.999999999310603 0.000000000368570 6104671996517.958 <0.0000000000000002 ***
## fireplace4 3000.000000000022283 0.000000000368597 8138972100765.117 <0.0000000000000002 ***
## baths2 1249.999999998968860 0.000000000368232 3394599186564.467 <0.0000000000000002 ***
## baths3 2499.999999999759893 0.000000000367638 6800172397216.058 <0.0000000000000002 ***
## baths4 3749.999999999752617 0.000000000368039 10189128624311.863 <0.0000000000000002 ***
## baths5 4999.999999999542524 0.000000000368346 13574209596262.346 <0.0000000000000002 ***
## floorsTRUE 15000.000000000394721 0.000000000232974 64384820861913.500 <0.0000000000000002 ***
## city2 3500.000000000356522 0.000000000285400 12263477558947.682 <0.0000000000000002 ***
## city3 7000.000000000159162 0.000000000285355 24530818169298.816 <0.0000000000000002 ***
## solarTRUE 250.000000000234593 0.000000000232976 1073070716705.686 <0.0000000000000002 ***
## electricTRUE 1250.000000000187811 0.000000000232975 5365379701286.465 <0.0000000000000002 ***
## fiberTRUE 11750.000000000161890 0.000000000232976 50434366647922.492 <0.0000000000000002 ***
## glass_doorsTRUE 4449.999999999679858 0.000000000232976 19100663635603.164 <0.0000000000000002 ***
## swimming_poolTRUE -0.000000000239923 0.000000000232977 -1.030 0.303
## gardenTRUE 0.000000000232821 0.000000000232980 0.999 0.318
## marble_choiceindian_marble -4999.999999999145075 0.000000000285229 -17529758565682.213 <0.0000000000000002 ***
## marble_choicewhite_marble 9000.000000000556611 0.000000000285637 31508535793061.383 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.00000007367 on 399977 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 4.912e+26 on 22 and 399977 DF, p-value: < 0.00000000000000022
Checking fitted value and residuals.
model_house_v2 <- lm(formula = prices ~ area + floors + marble_choice + fiber + city, data = house_price_train)
summary(model_house_v2)##
## Call:
## lm(formula = prices ~ area + floors + marble_choice + fiber +
## city, data = house_price_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8504.7 -2474.5 2.6 2470.9 8500.6
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 20742.0789 16.6831 1243.3 <0.0000000000000002 ***
## area 24.8906 0.0733 339.6 <0.0000000000000002 ***
## floorsTRUE 14990.8490 10.5256 1424.2 <0.0000000000000002 ***
## marble_choiceindian_marble -4996.4787 12.8864 -387.7 <0.0000000000000002 ***
## marble_choicewhite_marble 9008.5019 12.9048 698.1 <0.0000000000000002 ***
## fiberTRUE 11741.9339 10.5256 1115.6 <0.0000000000000002 ***
## city2 3505.7211 12.8942 271.9 <0.0000000000000002 ***
## city3 7004.1872 12.8920 543.3 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3328 on 399992 degrees of freedom
## Multiple R-squared: 0.9245, Adjusted R-squared: 0.9244
## F-statistic: 6.992e+05 on 7 and 399992 DF, p-value: < 0.00000000000000022
Checking fitted value and residuals.
We have a very much normal distributed residual plot.
##
## Shapiro-Wilk normality test
##
## data: model_house_v2$residuals[0:5000]
## W = 0.99481, p-value = 0.000000000002007
It seemed that our model passed homoscedacity assumption according to our Breusch-Pagan test.
##
## studentized Breusch-Pagan test
##
## data: model_house_v2
## BP = 3.4978, df = 7, p-value = 0.8355
It doesn’t seem that the predictor variables has strong linear relationship with each other.
## GVIF Df GVIF^(1/(2*Df))
## area 1.000018 1 1.000009
## floors 1.000011 1 1.000006
## marble_choice 1.000010 2 1.000002
## fiber 1.000011 1 1.000005
## city 1.000031 2 1.000008
Below is the price range of the original dataset as a guide to our model performance.
Below is our train and test MSE and RMSE. AS you can see, the RMSE difference between the two is very close, meaning our model is not an overfit one.
Below is a plot of our predicted price against our test price.
Currently, I think our model is a good one. The difference in RMSE between our train and test is very small, indicating that our model is not an overfit model, and it passes all the assumptions test. We managed to get a high Adjusted R squared of 0.9244, meaning that our model and chosen predictor variables can explain 92% of our target variable.
Based on our model, the variables affects a house price positively are the Area, Floors, Marble Choices, Fiber, and Cities. In the case of Marble Choice, having Indian Marble negatively impacts a house price.
Floors and Fibers affects the price the most, and Area seemed to affects our price the least.