Project Description
This report contains the detail implementation of House Price Prediction model using regression algorithm.
the data set used in this report is sourced from real US housing data.
The Report structured as follow:
1. Data Extraction
2. Exploratory Data Analysis
3. Data Preparation
4. Modeling
5. Evaluation
6. Recommendation
#1. Data Extraction
Import necessary libraries
library(ggplot2)
library(caret)
library(corrgram)
library(dplyr)
library(Metrics)
library(gridExtra)
library (ggplot2)
for graphic and visualization library(caret)
one hot encoding library(corrgram)
visualization of correlation coefficient
library(Metrics)
plotting multiple graphs library(gridExtra)
for performance evaluation library(dplyr)
providing data manipulation
Read house data set from .csv
file to R dataframe. Then, see the structure of datafame.
## import data in csv format to data frame
house_df<-read.csv("data/house.csv")
## see the structure of datafrrame
str(house_df)
## 'data.frame': 4600 obs. of 18 variables:
## $ date : chr "2014-05-02 00:00:00" "2014-05-02 00:00:00" "2014-05-02 00:00:00" "2014-05-02 00:00:00" ...
## $ price : num 313000 2384000 342000 420000 550000 ...
## $ bedrooms : num 3 5 3 3 4 2 2 4 3 4 ...
## $ bathrooms : num 1.5 2.5 2 2.25 2.5 1 2 2.5 2.5 2 ...
## $ sqft_living : int 1340 3650 1930 2000 1940 880 1350 2710 2430 1520 ...
## $ sqft_lot : int 7912 9050 11947 8030 10500 6380 2560 35868 88426 6200 ...
## $ floors : num 1.5 2 1 1 1 1 1 2 1 1.5 ...
## $ waterfront : int 0 0 0 0 0 0 0 0 0 0 ...
## $ view : int 0 4 0 0 0 0 0 0 0 0 ...
## $ condition : int 3 5 4 4 4 3 3 3 4 3 ...
## $ sqft_above : int 1340 3370 1930 1000 1140 880 1350 2710 1570 1520 ...
## $ sqft_basement: int 0 280 0 1000 800 0 0 0 860 0 ...
## $ yr_built : int 1955 1921 1966 1963 1976 1938 1976 1989 1985 1945 ...
## $ yr_renovated : int 2005 0 0 0 1992 1994 0 0 0 2010 ...
## $ street : chr "18810 Densmore Ave N" "709 W Blaine St" "26206-26214 143rd Ave SE" "857 170th Pl NE" ...
## $ city : chr "Shoreline" "Seattle" "Kent" "Bellevue" ...
## $ statezip : chr "WA 98133" "WA 98119" "WA 98042" "WA 98008" ...
## $ country : chr "USA" "USA" "USA" "USA" ...
the dataset has 4600 observation (rows) and 18 variables (columns). the target variable is ‘price’ and the remaining variables are features candidate.
Calculation of numerical variables statistical summary
summary(house_df)
## date price bedrooms bathrooms
## Length:4600 Min. : 0 Min. :0.000 Min. :0.000
## Class :character 1st Qu.: 322875 1st Qu.:3.000 1st Qu.:1.750
## Mode :character Median : 460943 Median :3.000 Median :2.250
## Mean : 551963 Mean :3.401 Mean :2.161
## 3rd Qu.: 654962 3rd Qu.:4.000 3rd Qu.:2.500
## Max. :26590000 Max. :9.000 Max. :8.000
## sqft_living sqft_lot floors waterfront
## Min. : 370 Min. : 638 Min. :1.000 Min. :0.000000
## 1st Qu.: 1460 1st Qu.: 5001 1st Qu.:1.000 1st Qu.:0.000000
## Median : 1980 Median : 7683 Median :1.500 Median :0.000000
## Mean : 2139 Mean : 14852 Mean :1.512 Mean :0.007174
## 3rd Qu.: 2620 3rd Qu.: 11001 3rd Qu.:2.000 3rd Qu.:0.000000
## Max. :13540 Max. :1074218 Max. :3.500 Max. :1.000000
## view condition sqft_above sqft_basement
## Min. :0.0000 Min. :1.000 Min. : 370 Min. : 0.0
## 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:1190 1st Qu.: 0.0
## Median :0.0000 Median :3.000 Median :1590 Median : 0.0
## Mean :0.2407 Mean :3.452 Mean :1827 Mean : 312.1
## 3rd Qu.:0.0000 3rd Qu.:4.000 3rd Qu.:2300 3rd Qu.: 610.0
## Max. :4.0000 Max. :5.000 Max. :9410 Max. :4820.0
## yr_built yr_renovated street city
## Min. :1900 Min. : 0.0 Length:4600 Length:4600
## 1st Qu.:1951 1st Qu.: 0.0 Class :character Class :character
## Median :1976 Median : 0.0 Mode :character Mode :character
## Mean :1971 Mean : 808.6
## 3rd Qu.:1997 3rd Qu.:1999.0
## Max. :2014 Max. :2014.0
## statezip country
## Length:4600 Length:4600
## Class :character Class :character
## Mode :character Mode :character
##
##
##
here we can observe the normal distribution data (min, max, mean, Q1, Q2, Q3) of each numerical distribution.
as noticed, an interesting observation made of the minimum value of house price is zero. this is probably an incorrect or data anomaly.
#2. Exploratory Data Analysis
Analysis of a single variable. We want to see the distribution of target variable price
.
ggplot(data = house_df, aes(y = price)) + geom_boxplot() +
scale_y_continuous(limits= c(0,2000000))
out_price <- boxplot.stats(house_df$price)$out
paste("Number of outliers: ", length(out_price))
## [1] "Number of outliers: 240"
based on the boxplot above, we can see that there are a significant number of outliers in variable price
. There are 240 outliers (about 5.217%) of the population analyzed.
As the name suggests, we have run analysis on these specific variables (number of bedroom Vs price).
house_df$bedrooms2 <- factor(house_df$bedrooms)
ggplot(data = house_df, aes(y = price, x = bedrooms2)) +
geom_boxplot() +
scale_y_continuous(limits = c(0, 2000000))
In a more normative distribution, the higher the number of bedrooms the higher the price. However, houses with bedrooms == 0
, the prices
are significantly higher.
Let’s see deeper into this finding, to investigate what cause this especially high value house with zero bedroom.
house_df$bedrooms2 <- factor(house_df$bedrooms)
ggplot(data = house_df, aes(y = sqft_living, x = bedrooms2)) +
geom_boxplot()
the analysis finds that the houses with zero bedroom have the highest median of square foot living than the rest of the normative house population with one or more bedrooms.
Now we are looking into the analysis of three or more variables.
ggplot(data = house_df, aes(y = price, x = sqft_living,
color = bedrooms2,)) +
geom_point() +
scale_y_continuous(limits = c(0, 2000000)) +
scale_x_continuous(limits = c(0, 7500))
the result of the analysis suggests that: 1. In general, the highest number of ‘sqft_livingthe higher the
price`. 2. there houses with various squarefoot living sizes with ’zero’ price (dots in x-axis). This can suggest more multiple errors within the dataset or anomalies in the actual conditions of the related houses.
Compute Pearson’s correlation coefficient (R)
house_df_num <- house_df[ , 2:12]
corrgram(house_df_num,
upper.panel = panel.cor)
All numerical variables have positive correlation with price. There variables with highest correlation to Price
are sqft_living
, sqft_above
, bathrooms
#3. Data Preparation
Remove observation with incorrect price (price==0)
price_0 <- filter(house_df_num, price==0)
paste("Number of observations with price == 0: ", nrow(price_0))
## [1] "Number of observations with price == 0: 49"
house_df_num1 <- filter(house_df_num, price > 0)
paste("Number of Observations: ", nrow(house_df_num1))
## [1] "Number of Observations: 4551"
Our observation finds there are 49 houses sith the price of zero. After removing these items, the current data frame has 4551 observations.
Removing outliers on price
### get outliers
out_price <- boxplot.stats(house_df_num1$price)$out
paste("Number of outliers: ", length(out_price))
## [1] "Number of outliers: 240"
### get outliers' index
out_idx <- which( house_df_num1$price %in% c(out_price) )
### remove outliers
house_df_num1 <- house_df_num1[-out_idx, ]
paste ("Number of curret observations: ", nrow(house_df_num1))
## [1] "Number of curret observations: 4311"
there are 240 observations with outliers in price
. After removal, the current data frame left with 4311 observations.
adding One Hot Encoding (OHE)on chosen location variable (statezip
)
## 1. create dataframe for column to be encoded.
location_df <- data.frame(house_df$statezip)
colnames(location_df) <- "location_"
## 2. OHE on that column
df1 <- dummyVars("~.", data = location_df)
df2 <- data.frame(predict(df1, newdata = location_df))
paste("Number of additional columns: ", ncol(df2))
## [1] "Number of additional columns: 77"
## 3. Combine to original dataframe
idx <- rownames(house_df_num1)
house_df_num2 <- cbind(house_df_num1, df2[ idx, ] )
paste ("Number of current variables: ", ncol(house_df_num2))
## [1] "Number of current variables: 88"
Because there 77 distinctive statezip
(s), by using OHE technique, we created 77 new columns (one column for each), adding to the existing 11 column, we currently have 88 columns of variables.
Data division ratio between Training:Testing is 70:30
m <- nrow(house_df_num1)
n <- ncol(house_df_num1)
mtrain <- floor(0.7 * m)
set.seed(2022)
train_idx <- sample(m, mtrain)
train_df <- house_df_num2[ train_idx, ]
test_df <- house_df_num2[ -train_idx, ]
paste("Number of training obs: ", nrow(train_df))
## [1] "Number of training obs: 3017"
paste("Number of testing obs: ", nrow(test_df))
## [1] "Number of testing obs: 1294"
There are 3017 observations in training data to develop a prediction model and 1294 observations in testing data to evaluate the models
#4. Modeling
Using Multivariate Linear Regression (MLR) algorithms to predict house price.
model.mlr2 <- lm(formula = price ~ . ,
data = train_df)
summary (model.mlr2)
##
## Call:
## lm(formula = price ~ ., data = train_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -842587 -51386 -464 51246 483921
##
## Coefficients: (3 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.111e+04 7.649e+04 -0.407 0.684249
## bedrooms -9.967e+03 2.950e+03 -3.379 0.000738 ***
## bathrooms 1.836e+04 4.517e+03 4.064 4.95e-05 ***
## sqft_living 9.948e+01 6.310e+00 15.765 < 2e-16 ***
## sqft_lot 1.363e-01 5.514e-02 2.471 0.013530 *
## floors -1.957e+04 5.467e+03 -3.579 0.000350 ***
## waterfront 1.581e+05 3.208e+04 4.929 8.71e-07 ***
## view 3.479e+04 3.359e+03 10.356 < 2e-16 ***
## condition 2.867e+04 3.120e+03 9.190 < 2e-16 ***
## sqft_above 7.374e+01 6.551e+00 11.256 < 2e-16 ***
## sqft_basement NA NA NA NA
## location_WA.98001 -8.852e+04 7.657e+04 -1.156 0.247773
## location_WA.98002 -9.675e+04 7.854e+04 -1.232 0.218122
## location_WA.98003 -8.770e+04 7.718e+04 -1.136 0.255895
## location_WA.98004 4.304e+05 7.727e+04 5.570 2.77e-08 ***
## location_WA.98005 2.637e+05 7.828e+04 3.368 0.000766 ***
## location_WA.98006 2.314e+05 7.630e+04 3.032 0.002447 **
## location_WA.98007 2.317e+05 7.788e+04 2.975 0.002956 **
## location_WA.98008 1.490e+05 7.740e+04 1.924 0.054394 .
## location_WA.98010 2.111e+04 8.867e+04 0.238 0.811809
## location_WA.98011 6.525e+04 7.818e+04 0.835 0.403992
## location_WA.98014 3.734e+04 7.960e+04 0.469 0.639040
## location_WA.98019 6.195e+03 7.725e+04 0.080 0.936086
## location_WA.98022 -6.777e+04 7.803e+04 -0.868 0.385207
## location_WA.98023 -9.450e+04 7.609e+04 -1.242 0.214381
## location_WA.98024 9.047e+04 8.385e+04 1.079 0.280671
## location_WA.98027 1.075e+05 7.631e+04 1.409 0.158995
## location_WA.98028 5.548e+04 7.664e+04 0.724 0.469178
## location_WA.98029 1.663e+05 7.655e+04 2.173 0.029892 *
## location_WA.98030 -4.593e+04 7.766e+04 -0.592 0.554227
## location_WA.98031 -6.642e+04 7.654e+04 -0.868 0.385569
## location_WA.98032 -1.210e+05 7.979e+04 -1.516 0.129591
## location_WA.98033 2.511e+05 7.630e+04 3.291 0.001011 **
## location_WA.98034 1.174e+05 7.601e+04 1.544 0.122607
## location_WA.98038 -5.324e+04 7.607e+04 -0.700 0.484051
## location_WA.98039 -3.736e+05 1.299e+05 -2.877 0.004043 **
## location_WA.98040 2.942e+05 7.714e+04 3.814 0.000140 ***
## location_WA.98042 -7.963e+04 7.598e+04 -1.048 0.294749
## location_WA.98045 1.471e+04 7.696e+04 0.191 0.848449
## location_WA.98047 -5.551e+04 8.863e+04 -0.626 0.531127
## location_WA.98050 5.543e+04 1.303e+05 0.425 0.670645
## location_WA.98051 1.815e+04 8.901e+04 0.204 0.838424
## location_WA.98052 2.080e+05 7.569e+04 2.748 0.006037 **
## location_WA.98053 1.746e+05 7.621e+04 2.291 0.022035 *
## location_WA.98055 -3.989e+04 7.921e+04 -0.504 0.614523
## location_WA.98056 2.473e+04 7.628e+04 0.324 0.745796
## location_WA.98057 -7.596e+04 8.147e+04 -0.932 0.351208
## location_WA.98058 -2.487e+04 7.601e+04 -0.327 0.743586
## location_WA.98059 3.724e+04 7.588e+04 0.491 0.623563
## location_WA.98065 4.485e+04 7.654e+04 0.586 0.557953
## location_WA.98068 NA NA NA NA
## location_WA.98070 1.408e+03 7.905e+04 0.018 0.985794
## location_WA.98072 1.225e+05 7.668e+04 1.598 0.110260
## location_WA.98074 1.458e+05 7.614e+04 1.915 0.055599 .
## location_WA.98075 1.694e+05 7.638e+04 2.218 0.026631 *
## location_WA.98077 1.142e+05 7.728e+04 1.477 0.139689
## location_WA.98092 -8.542e+04 7.614e+04 -1.122 0.262050
## location_WA.98102 3.458e+05 8.109e+04 4.265 2.06e-05 ***
## location_WA.98103 2.488e+05 7.580e+04 3.282 0.001041 **
## location_WA.98105 2.769e+05 7.801e+04 3.549 0.000393 ***
## location_WA.98106 4.414e+04 7.664e+04 0.576 0.564725
## location_WA.98107 2.469e+05 7.674e+04 3.218 0.001307 **
## location_WA.98108 4.986e+04 7.725e+04 0.645 0.518713
## location_WA.98109 4.164e+05 8.102e+04 5.140 2.93e-07 ***
## location_WA.98112 3.410e+05 7.741e+04 4.405 1.10e-05 ***
## location_WA.98115 2.078e+05 7.592e+04 2.737 0.006232 **
## location_WA.98116 2.283e+05 7.672e+04 2.976 0.002943 **
## location_WA.98117 2.169e+05 7.585e+04 2.860 0.004267 **
## location_WA.98118 8.666e+04 7.626e+04 1.136 0.255878
## location_WA.98119 3.844e+05 7.794e+04 4.932 8.58e-07 ***
## location_WA.98122 2.508e+05 7.654e+04 3.277 0.001062 **
## location_WA.98125 7.111e+04 7.644e+04 0.930 0.352310
## location_WA.98126 1.221e+05 7.621e+04 1.602 0.109347
## location_WA.98133 6.534e+04 7.608e+04 0.859 0.390445
## location_WA.98136 1.667e+05 7.701e+04 2.164 0.030515 *
## location_WA.98144 1.952e+05 7.645e+04 2.554 0.010714 *
## location_WA.98146 3.163e+04 7.667e+04 0.412 0.680012
## location_WA.98148 -2.256e+04 8.207e+04 -0.275 0.783450
## location_WA.98155 5.985e+04 7.607e+04 0.787 0.431448
## location_WA.98166 3.252e+04 7.675e+04 0.424 0.671860
## location_WA.98168 -2.261e+04 7.654e+04 -0.295 0.767693
## location_WA.98177 1.461e+05 7.791e+04 1.875 0.060837 .
## location_WA.98178 -8.146e+04 7.721e+04 -1.055 0.291490
## location_WA.98188 -7.244e+04 7.925e+04 -0.914 0.360720
## location_WA.98198 -7.122e+04 7.694e+04 -0.926 0.354748
## location_WA.98199 2.643e+05 7.688e+04 3.438 0.000594 ***
## location_WA.98288 -4.277e+04 9.676e+04 -0.442 0.658491
## location_WA.98354 NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 105900 on 2932 degrees of freedom
## Multiple R-squared: 0.7612, Adjusted R-squared: 0.7543
## F-statistic: 111.2 on 84 and 2932 DF, p-value: < 2.2e-16
From 88 variables, there are some variables with significant coefficients (***). Some locations (state zip) are more significant than others. Also, side note grown from this observation is that: 1. more specific models need to be developed to cluster and predict by the correlation significance shared by the predicted variable and correlated variables to build stronger prediction algorithm.
#5. Evaluation
## 5.1. Visualise Actual vs. Predicted Values
Plot actual values of price
in the testing data with its predicted values using MLR model.
actual <- test_df$price
pred.mlr2 <- predict(model.mlr2, test_df)
pred_df <- data.frame(actual, pred.mlr2)
### plot MLR2 (used all num vars)
plot_mlr2 <- ggplot(data = pred_df,
aes(x = actual, y = pred.mlr2)) +
geom_point() +
geom_smooth() +
scale_x_continuous(limits = c(0, 1250000)) +
scale_y_continuous(limits = c(0, 1250000)) +
labs(title = "Actual Vs Predicted Values using MLR2")
plot_mlr2
the plot graphic suggests the distributions of the observations are in near diagonal tendency. This means the model can predict the actual price with a good accuracy. However, it is vividly observable that there are view points ‘strayed’ pretty far from the rest of the values. Further investigation of this phenomena needs to be taken. Also, improvements on data extraction needs to be done as well.
Compute Mean Absolute Error (MAE) and Pearso Correlation Coefficient (R)
mae_mlr2 <- mae(actual, pred.mlr2)
paste("MAE: ", round(mae_mlr2, 2))
## [1] "MAE: 74759.03"
r_mlr2 <- cor(actual, pred.mlr2)
paste ("R: ", round(r_mlr2,2))
## [1] "R: 0.87"
The current method’s MAE = 74759.03
and R = 0.87
shows us a significant improvement in MAE & correlation scores compare to the initial performance (MAE>250k
and R<0.6
).
#6. Recommendation
the prediction model is sufficient to predict house pricing in mainstream distribution.
Location (statezip
) is an important feature in improving the model’s performance. Highly recommended to consider other important features that can be generated from the variables that can make significant influence to the model.
Judging by the R and the ‘nature’ of current MAE score, the model is ready to be deploy with few notes. The almost 90% reliability/correlation signifies that the model is quite ready for deployment. Although, our second note: that the pretty high MAE score dictates that any prediction that fall far from the diagonal trend line will need further advice from domain expert. On final note, rooms for improvements need to be explored and exploited in the future to increase the efficacy and accuracy of the model.