This report are made around the House Prices that are provided in Kaggle.
Be aware that this dataset is computer generated that are provided with
the purpose of academic research in practice of model development. We
will use linear regression in order to make prediction model to decide
the Prices based on the important parameters provided.
Load the required packages
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(GGally)
## Loading required package: ggplot2
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(MLmetrics)
##
## Attaching package: 'MLmetrics'
## The following object is masked from 'package:base':
##
## Recall
library(lmtest)
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
Read csv data
house_price <- read.csv("dataset_houseprices.csv")
house_price %>% head(5)
## Area Garage FirePlace Baths White.Marble Black.Marble Indian.Marble Floors
## 1 164 2 0 2 0 1 0 0
## 2 84 2 0 4 0 0 1 1
## 3 190 2 4 4 1 0 0 0
## 4 75 2 4 4 0 0 1 1
## 5 148 1 4 2 1 0 0 1
## City Solar Electric Fiber Glass.Doors Swiming.Pool Garden Prices
## 1 3 1 1 1 1 0 0 43800
## 2 2 0 0 0 1 1 1 37550
## 3 2 0 0 1 0 0 0 49500
## 4 1 1 1 1 1 1 1 50075
## 5 2 1 0 0 1 1 1 52400
The dataset contain a lot of integer value as we can see above, but
infact it was a categorical value that represent
information in the form of number. For example, in
Black.Marble column, 1 represent ‘Yes, there is a Black
Marble in the House’ and 0 the otherwise. So, for the purpose of
efficiency in model development, a transformation to type
factor is needed.
house_price_clean <- house_price %>%
mutate_at(vars(Area,Garage,FirePlace,Baths,White.Marble,Black.Marble,Indian.Marble,Floors,City,Solar,Electric,Fiber,Glass.Doors,Swiming.Pool,Garden), as.factor)
glimpse(house_price_clean)
## Rows: 500,000
## Columns: 16
## $ Area <fct> 164, 84, 190, 75, 148, 124, 58, 249, 243, 242, 61, 189, …
## $ Garage <fct> 2, 2, 2, 2, 1, 3, 1, 2, 1, 1, 2, 2, 2, 3, 3, 3, 1, 3, 2,…
## $ FirePlace <fct> 0, 0, 4, 4, 4, 3, 0, 1, 0, 2, 4, 0, 0, 3, 3, 4, 0, 3, 3,…
## $ Baths <fct> 2, 4, 4, 4, 2, 3, 2, 1, 2, 4, 5, 4, 2, 3, 1, 1, 5, 3, 5,…
## $ White.Marble <fct> 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,…
## $ Black.Marble <fct> 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1,…
## $ Indian.Marble <fct> 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0,…
## $ Floors <fct> 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1,…
## $ City <fct> 3, 2, 2, 1, 2, 1, 3, 1, 1, 2, 1, 2, 1, 3, 3, 1, 3, 1, 3,…
## $ Solar <fct> 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0,…
## $ Electric <fct> 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1,…
## $ Fiber <fct> 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0,…
## $ Glass.Doors <fct> 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1,…
## $ Swiming.Pool <fct> 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 0,…
## $ Garden <fct> 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0,…
## $ Prices <int> 43800, 37550, 49500, 50075, 52400, 54300, 34400, 50425, …
check if any missing data
anyNA(house_price_clean)
## [1] FALSE
We also notice that predictor Area has 249 unique
values, so we decide to leave out this predictor from the trainning set
because it was too specific and too many levels.
length(unique(house_price_clean$Area))
## [1] 249
house_price_clean <- house_price_clean %>%
select(-Area)
names(house_price_clean)
## [1] "Garage" "FirePlace" "Baths" "White.Marble"
## [5] "Black.Marble" "Indian.Marble" "Floors" "City"
## [9] "Solar" "Electric" "Fiber" "Glass.Doors"
## [13] "Swiming.Pool" "Garden" "Prices"
EDA is the phase where we asses the dataset. A correlation test in generally used to test the linearity of the dataset, but since our predictor data is all factor, then we cannot do correlation assesment, instead we can check the skewness of each predictor to make sure there are no imbalance frequency in each column
house_price %>% group_by(Area) %>% summarise(count=n()) %>% ungroup()
## # A tibble: 249 × 2
## Area count
## <int> <int>
## 1 1 2038
## 2 2 2030
## 3 3 1905
## 4 4 1967
## 5 5 2071
## 6 6 1990
## 7 7 2060
## 8 8 1983
## 9 9 1920
## 10 10 2025
## # … with 239 more rows
house_price %>% group_by(Garage) %>% summarise(count=n()) %>% ungroup()
## # A tibble: 3 × 2
## Garage count
## <int> <int>
## 1 1 166552
## 2 2 166251
## 3 3 167197
house_price %>% group_by(FirePlace) %>% summarise(count=n()) %>% ungroup()
## # A tibble: 5 × 2
## FirePlace count
## <int> <int>
## 1 0 99569
## 2 1 99983
## 3 2 99954
## 4 3 100168
## 5 4 100326
house_price %>% group_by(Baths) %>% summarise(count=n()) %>% ungroup()
## # A tibble: 5 × 2
## Baths count
## <int> <int>
## 1 1 100319
## 2 2 99794
## 3 3 100158
## 4 4 99989
## 5 5 99740
house_price %>% group_by(White.Marble) %>% summarise(count=n()) %>% ungroup()
## # A tibble: 2 × 2
## White.Marble count
## <int> <int>
## 1 0 333504
## 2 1 166496
house_price %>% group_by(Black.Marble) %>% summarise(count=n()) %>% ungroup()
## # A tibble: 2 × 2
## Black.Marble count
## <int> <int>
## 1 0 333655
## 2 1 166345
house_price %>% group_by(Indian.Marble) %>% summarise(count=n()) %>% ungroup()
## # A tibble: 2 × 2
## Indian.Marble count
## <int> <int>
## 1 0 332841
## 2 1 167159
house_price %>% group_by(Floors) %>% summarise(count=n()) %>% ungroup()
## # A tibble: 2 × 2
## Floors count
## <int> <int>
## 1 0 250307
## 2 1 249693
house_price %>% group_by(City) %>% summarise(count=n()) %>% ungroup()
## # A tibble: 3 × 2
## City count
## <int> <int>
## 1 1 166314
## 2 2 166902
## 3 3 166784
house_price %>% group_by(Solar) %>% summarise(count=n()) %>% ungroup()
## # A tibble: 2 × 2
## Solar count
## <int> <int>
## 1 0 250653
## 2 1 249347
house_price %>% group_by(Electric) %>% summarise(count=n()) %>% ungroup()
## # A tibble: 2 × 2
## Electric count
## <int> <int>
## 1 0 249675
## 2 1 250325
house_price %>% group_by(Fiber) %>% summarise(count=n()) %>% ungroup()
## # A tibble: 2 × 2
## Fiber count
## <int> <int>
## 1 0 249766
## 2 1 250234
house_price %>% group_by(Glass.Doors) %>% summarise(count=n()) %>% ungroup()
## # A tibble: 2 × 2
## Glass.Doors count
## <int> <int>
## 1 0 250065
## 2 1 249935
house_price %>% group_by(Swiming.Pool) %>% summarise(count=n()) %>% ungroup()
## # A tibble: 2 × 2
## Swiming.Pool count
## <int> <int>
## 1 0 249782
## 2 1 250218
house_price %>% group_by(Garden) %>% summarise(count=n()) %>% ungroup()
## # A tibble: 2 × 2
## Garden count
## <int> <int>
## 1 0 249177
## 2 1 250823
as many as it is, we can observe that all of the variables are
fairly distributed, most of the variables are evenly
porpotioned, except for White.Marble,
Black.Marble, and Indian.Marble with the
composition of roughly 1:2 which is still acceptable.
model_house_price <- lm(formula = Prices ~ . , data = house_price_clean)
summary(model_house_price)
##
## Call:
## lm(formula = Prices ~ ., data = house_price_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3127.69 -1556.89 0.65 1552.74 3137.98
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10379.346 11.910 871.512 <2e-16 ***
## Garage2 1505.576 6.223 241.946 <2e-16 ***
## Garage3 2996.052 6.214 482.149 <2e-16 ***
## FirePlace1 751.615 8.036 93.530 <2e-16 ***
## FirePlace2 1489.221 8.037 185.301 <2e-16 ***
## FirePlace3 2251.582 8.032 280.312 <2e-16 ***
## FirePlace4 3002.395 8.029 373.931 <2e-16 ***
## Baths2 1247.118 8.025 155.406 <2e-16 ***
## Baths3 2499.609 8.018 311.763 <2e-16 ***
## Baths4 3745.880 8.021 467.011 <2e-16 ***
## Baths5 4997.958 8.026 622.724 <2e-16 ***
## White.Marble1 14009.039 6.215 2254.144 <2e-16 ***
## Black.Marble1 4998.862 6.216 804.164 <2e-16 ***
## Indian.Marble1 NA NA NA NA
## Floors1 14997.228 5.077 2954.075 <2e-16 ***
## City2 3493.203 6.219 561.707 <2e-16 ***
## City3 6984.857 6.220 1122.962 <2e-16 ***
## Solar1 251.915 5.077 49.620 <2e-16 ***
## Electric1 1249.509 5.077 246.122 <2e-16 ***
## Fiber1 11750.359 5.077 2314.506 <2e-16 ***
## Glass.Doors1 4445.571 5.077 875.660 <2e-16 ***
## Swiming.Pool1 2.222 5.077 0.438 0.662
## Garden1 5.114 5.077 1.007 0.314
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1795 on 499978 degrees of freedom
## Multiple R-squared: 0.978, Adjusted R-squared: 0.978
## F-statistic: 1.06e+06 on 21 and 499978 DF, p-value: < 2.2e-16
Based on the initial modelling, we can observe that one variable is
not defined because of singularities ad that is
Indian.Marble1. For the next model tuning, we will remove
the unused parameters and build the model with step-wise feature
selection to find the most efficient model.
house_price_clean <- house_price_clean %>% select(-Indian.Marble)
model_house_price_clean <- lm(formula = Prices ~ . , data = house_price_clean)
model_house_price_step <- step(object = model_house_price_clean,
direction = "both",
trace = FALSE)
summary(model_house_price_step)
##
## Call:
## lm(formula = Prices ~ Garage + FirePlace + Baths + White.Marble +
## Black.Marble + Floors + City + Solar + Electric + Fiber +
## Glass.Doors, data = house_price_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3124.25 -1556.78 0.72 1552.72 3134.30
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10383.005 11.360 913.97 <2e-16 ***
## Garage2 1505.584 6.223 241.95 <2e-16 ***
## Garage3 2996.051 6.214 482.15 <2e-16 ***
## FirePlace1 751.625 8.036 93.53 <2e-16 ***
## FirePlace2 1489.212 8.037 185.30 <2e-16 ***
## FirePlace3 2251.574 8.032 280.31 <2e-16 ***
## FirePlace4 3002.412 8.029 373.93 <2e-16 ***
## Baths2 1247.103 8.025 155.40 <2e-16 ***
## Baths3 2499.613 8.018 311.76 <2e-16 ***
## Baths4 3745.891 8.021 467.01 <2e-16 ***
## Baths5 4997.970 8.026 622.73 <2e-16 ***
## White.Marble1 14009.041 6.215 2254.15 <2e-16 ***
## Black.Marble1 4998.866 6.216 804.17 <2e-16 ***
## Floors1 14997.225 5.077 2954.08 <2e-16 ***
## City2 3493.215 6.219 561.71 <2e-16 ***
## City3 6984.865 6.220 1122.96 <2e-16 ***
## Solar1 251.892 5.077 49.62 <2e-16 ***
## Electric1 1249.514 5.077 246.12 <2e-16 ***
## Fiber1 11750.368 5.077 2314.53 <2e-16 ***
## Glass.Doors1 4445.589 5.077 875.67 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1795 on 499980 degrees of freedom
## Multiple R-squared: 0.978, Adjusted R-squared: 0.978
## F-statistic: 1.172e+06 on 19 and 499980 DF, p-value: < 2.2e-16
With the addition of step-wise regression it helps us to reduce the
number of unsignificant predictor, and in this case
Swiming Pool and Garden are removed from the
list of predictors.
While in this dataset does not contain new data, we will predict the
prices from the house_price dataset. We will use two models
that we already made to predict the Prices :
model_house_price_clean, a model without predictor
Area and Indian.Marble
model_house_price_step, a model made using step-wise
regression to find the best predictors
house_price$pred_all_clean <- predict(object = model_house_price_clean, newdata = house_price_clean)
house_price$pred_step <- predict(object = model_house_price_step, newdata = house_price_clean)
head(house_price)
## Area Garage FirePlace Baths White.Marble Black.Marble Indian.Marble Floors
## 1 164 2 0 2 0 1 0 0
## 2 84 2 0 4 0 0 1 1
## 3 190 2 4 4 1 0 0 0
## 4 75 2 4 4 0 0 1 1
## 5 148 1 4 2 1 0 0 1
## 6 124 3 3 3 0 1 0 1
## City Solar Electric Fiber Glass.Doors Swiming.Pool Garden Prices
## 1 3 1 1 1 1 0 0 43800
## 2 2 0 0 0 1 1 1 37550
## 3 2 0 0 1 0 0 0 49500
## 4 1 1 1 1 1 1 1 50075
## 5 2 1 0 0 1 1 1 52400
## 6 1 0 0 1 1 1 1 54300
## pred_all_clean pred_step
## 1 42813.11 42816.78
## 2 38574.14 38570.51
## 3 47885.80 47889.52
## 4 51335.11 51331.48
## 5 51833.15 51829.48
## 6 54325.95 54322.29
We will compared the goodness of fit of the model based on the it’s R-squared value
summary(model_house_price_clean)$adj.r.squared
## [1] 0.978033
summary(model_house_price_step)$adj.r.squared
## [1] 0.9780331
We will calculate the error which means the difference of the actual value and the predicted value. Using RMSE or Root Mean Squared Error that sensitive to bigger Errors.
RMSE(y_pred = house_price$pred_all_clean, y_true = house_price$Prices)
## [1] 1794.85
RMSE(y_pred = house_price$pred_step, y_true = house_price$Prices)
## [1] 1794.852
By comparing the RMSE from both models we can argue that both models produce almost identical prediction results. That number means the prediction Price may differ up to 1794 points from the actual prices.
Both model produces a reliable prediction to forecast the House Price. Also by using Step-Wise regression we were able to produce an equal model with similar results from the model without Step-Wise regression, this means that out Step-Wise regression for feature reduction succesfully reduce the number of features for trainning efficiency but also at the same time retain the information needed to produce reliable predictions.