Heading 1 Project Description

Project Description

This report contains the detail implementation of House Price Prediction model using regression algorithm.

the data set used in this report is sourced from real US housing data.

The Report structured as follow:
1. Data Extraction
2. Exploratory Data Analysis
3. Data Preparation
4. Modeling
5. Evaluation
6. Recommendation

#1. Data Extraction
Import necessary libraries

library(ggplot2)    
library(caret)    
library(corrgram)  
library(dplyr)  
library(Metrics)  
library(gridExtra)

library (ggplot2) for graphic and visualization library(caret) one hot encoding library(corrgram)visualization of correlation coefficient
library(Metrics) plotting multiple graphs library(gridExtra)for performance evaluation library(dplyr) providing data manipulation

Read house data set from .csv file to R dataframe. Then, see the structure of datafame.

## import data in csv format to data frame
house_df<-read.csv("data/house.csv")

## see the structure of datafrrame 
str(house_df)
## 'data.frame':    4600 obs. of  18 variables:
##  $ date         : chr  "2014-05-02 00:00:00" "2014-05-02 00:00:00" "2014-05-02 00:00:00" "2014-05-02 00:00:00" ...
##  $ price        : num  313000 2384000 342000 420000 550000 ...
##  $ bedrooms     : num  3 5 3 3 4 2 2 4 3 4 ...
##  $ bathrooms    : num  1.5 2.5 2 2.25 2.5 1 2 2.5 2.5 2 ...
##  $ sqft_living  : int  1340 3650 1930 2000 1940 880 1350 2710 2430 1520 ...
##  $ sqft_lot     : int  7912 9050 11947 8030 10500 6380 2560 35868 88426 6200 ...
##  $ floors       : num  1.5 2 1 1 1 1 1 2 1 1.5 ...
##  $ waterfront   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ view         : int  0 4 0 0 0 0 0 0 0 0 ...
##  $ condition    : int  3 5 4 4 4 3 3 3 4 3 ...
##  $ sqft_above   : int  1340 3370 1930 1000 1140 880 1350 2710 1570 1520 ...
##  $ sqft_basement: int  0 280 0 1000 800 0 0 0 860 0 ...
##  $ yr_built     : int  1955 1921 1966 1963 1976 1938 1976 1989 1985 1945 ...
##  $ yr_renovated : int  2005 0 0 0 1992 1994 0 0 0 2010 ...
##  $ street       : chr  "18810 Densmore Ave N" "709 W Blaine St" "26206-26214 143rd Ave SE" "857 170th Pl NE" ...
##  $ city         : chr  "Shoreline" "Seattle" "Kent" "Bellevue" ...
##  $ statezip     : chr  "WA 98133" "WA 98119" "WA 98042" "WA 98008" ...
##  $ country      : chr  "USA" "USA" "USA" "USA" ...

the dataset has 4600 observation (rows) and 18 variables (columns). the target variable is ‘price’ and the remaining variables are features candidate.

Calculation of numerical variables statistical summary

summary(house_df)
##      date               price             bedrooms       bathrooms    
##  Length:4600        Min.   :       0   Min.   :0.000   Min.   :0.000  
##  Class :character   1st Qu.:  322875   1st Qu.:3.000   1st Qu.:1.750  
##  Mode  :character   Median :  460943   Median :3.000   Median :2.250  
##                     Mean   :  551963   Mean   :3.401   Mean   :2.161  
##                     3rd Qu.:  654962   3rd Qu.:4.000   3rd Qu.:2.500  
##                     Max.   :26590000   Max.   :9.000   Max.   :8.000  
##   sqft_living       sqft_lot           floors        waterfront      
##  Min.   :  370   Min.   :    638   Min.   :1.000   Min.   :0.000000  
##  1st Qu.: 1460   1st Qu.:   5001   1st Qu.:1.000   1st Qu.:0.000000  
##  Median : 1980   Median :   7683   Median :1.500   Median :0.000000  
##  Mean   : 2139   Mean   :  14852   Mean   :1.512   Mean   :0.007174  
##  3rd Qu.: 2620   3rd Qu.:  11001   3rd Qu.:2.000   3rd Qu.:0.000000  
##  Max.   :13540   Max.   :1074218   Max.   :3.500   Max.   :1.000000  
##       view          condition       sqft_above   sqft_basement   
##  Min.   :0.0000   Min.   :1.000   Min.   : 370   Min.   :   0.0  
##  1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:1190   1st Qu.:   0.0  
##  Median :0.0000   Median :3.000   Median :1590   Median :   0.0  
##  Mean   :0.2407   Mean   :3.452   Mean   :1827   Mean   : 312.1  
##  3rd Qu.:0.0000   3rd Qu.:4.000   3rd Qu.:2300   3rd Qu.: 610.0  
##  Max.   :4.0000   Max.   :5.000   Max.   :9410   Max.   :4820.0  
##     yr_built     yr_renovated       street              city          
##  Min.   :1900   Min.   :   0.0   Length:4600        Length:4600       
##  1st Qu.:1951   1st Qu.:   0.0   Class :character   Class :character  
##  Median :1976   Median :   0.0   Mode  :character   Mode  :character  
##  Mean   :1971   Mean   : 808.6                                        
##  3rd Qu.:1997   3rd Qu.:1999.0                                        
##  Max.   :2014   Max.   :2014.0                                        
##    statezip           country         
##  Length:4600        Length:4600       
##  Class :character   Class :character  
##  Mode  :character   Mode  :character  
##                                       
##                                       
## 

here we can observe the normal distribution data (min, max, mean, Q1, Q2, Q3) of each numerical distribution.

as noticed, an interesting observation made of the minimum value of house price is zero. this is probably an incorrect or data anomaly.

#2. Exploratory Data Analysis

2.1. Univariate Analysis

Analysis of a single variable. We want to see the distribution of target variable price.

ggplot(data = house_df, aes(y = price)) + geom_boxplot() +
  scale_y_continuous(limits= c(0,2000000))

out_price <- boxplot.stats(house_df$price)$out  
paste("Number of outliers: ", length(out_price))
## [1] "Number of outliers:  240"

based on the boxplot above, we can see that there are a significant number of outliers in variable price. There are 240 outliers (about 5.217%) of the population analyzed.

2.2. Bivariate Analysis

As the name suggests, we have run analysis on these specific variables (number of bedroom Vs price).

house_df$bedrooms2 <- factor(house_df$bedrooms)

ggplot(data = house_df, aes(y = price, x = bedrooms2)) +
  geom_boxplot() +
  scale_y_continuous(limits = c(0, 2000000))

In a more normative distribution, the higher the number of bedrooms the higher the price. However, houses with bedrooms == 0, the prices are significantly higher.

Let’s see deeper into this finding, to investigate what cause this especially high value house with zero bedroom.

house_df$bedrooms2 <- factor(house_df$bedrooms)

ggplot(data = house_df, aes(y = sqft_living, x = bedrooms2)) +
  geom_boxplot() 

the analysis finds that the houses with zero bedroom have the highest median of square foot living than the rest of the normative house population with one or more bedrooms.

2.3. Multivariate Analysis

Now we are looking into the analysis of three or more variables.

ggplot(data = house_df, aes(y = price, x = sqft_living,
                        color = bedrooms2,)) +
  geom_point() +
  scale_y_continuous(limits = c(0, 2000000)) +
  scale_x_continuous(limits = c(0, 7500))

the result of the analysis suggests that: 1. In general, the highest number of ‘sqft_livingthe higher theprice`. 2. there houses with various squarefoot living sizes with ’zero’ price (dots in x-axis). This can suggest more multiple errors within the dataset or anomalies in the actual conditions of the related houses.

Compute Pearson’s correlation coefficient (R)

house_df_num <- house_df[ , 2:12] 
corrgram(house_df_num,
         upper.panel = panel.cor)

All numerical variables have positive correlation with price. There variables with highest correlation to Price are sqft_living, sqft_above, bathrooms

#3. Data Preparation

3.1. Data Cleaning

Remove observation with incorrect price (price==0)

price_0 <- filter(house_df_num, price==0)
paste("Number of observations with price == 0: ", nrow(price_0))
## [1] "Number of observations with price == 0:  49"
house_df_num1 <- filter(house_df_num, price > 0)  
paste("Number of Observations: ", nrow(house_df_num1)) 
## [1] "Number of Observations:  4551"

Our observation finds there are 49 houses sith the price of zero. After removing these items, the current data frame has 4551 observations.

Removing outliers on price

### get outliers
out_price <- boxplot.stats(house_df_num1$price)$out
paste("Number of outliers: ", length(out_price))
## [1] "Number of outliers:  240"
### get outliers' index
out_idx <- which( house_df_num1$price %in% c(out_price) )

### remove outliers
house_df_num1 <- house_df_num1[-out_idx, ]
paste ("Number of curret observations: ", nrow(house_df_num1))
## [1] "Number of curret observations:  4311"

there are 240 observations with outliers in price. After removal, the current data frame left with 4311 observations.

3.2. Feature Extraction

adding One Hot Encoding (OHE)on chosen location variable (statezip)

## 1. create dataframe for column to be encoded.  
location_df <- data.frame(house_df$statezip)
colnames(location_df) <- "location_"

## 2. OHE on that column

df1 <- dummyVars("~.", data = location_df)
df2 <- data.frame(predict(df1, newdata = location_df))
paste("Number of additional columns: ", ncol(df2))
## [1] "Number of additional columns:  77"
## 3. Combine to original dataframe
idx <- rownames(house_df_num1)
house_df_num2 <- cbind(house_df_num1, df2[ idx, ] )

paste ("Number of current variables: ", ncol(house_df_num2)) 
## [1] "Number of current variables:  88"

Because there 77 distinctive statezip(s), by using OHE technique, we created 77 new columns (one column for each), adding to the existing 11 column, we currently have 88 columns of variables.

3.3. Splitting Data into Training & Testing

Data division ratio between Training:Testing is 70:30

m <- nrow(house_df_num1)  
n <- ncol(house_df_num1)   
mtrain <- floor(0.7 * m)   

set.seed(2022)  
train_idx <- sample(m, mtrain)  

train_df <- house_df_num2[ train_idx, ] 
test_df <- house_df_num2[ -train_idx, ] 

paste("Number of training obs: ", nrow(train_df))
## [1] "Number of training obs:  3017"
paste("Number of testing obs: ", nrow(test_df))
## [1] "Number of testing obs:  1294"

There are 3017 observations in training data to develop a prediction model and 1294 observations in testing data to evaluate the models

#4. Modeling

Using Multivariate Linear Regression (MLR) algorithms to predict house price.

model.mlr2 <- lm(formula = price ~ . ,
                 data = train_df)
summary (model.mlr2)
## 
## Call:
## lm(formula = price ~ ., data = train_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -842587  -51386    -464   51246  483921 
## 
## Coefficients: (3 not defined because of singularities)
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -3.111e+04  7.649e+04  -0.407 0.684249    
## bedrooms          -9.967e+03  2.950e+03  -3.379 0.000738 ***
## bathrooms          1.836e+04  4.517e+03   4.064 4.95e-05 ***
## sqft_living        9.948e+01  6.310e+00  15.765  < 2e-16 ***
## sqft_lot           1.363e-01  5.514e-02   2.471 0.013530 *  
## floors            -1.957e+04  5.467e+03  -3.579 0.000350 ***
## waterfront         1.581e+05  3.208e+04   4.929 8.71e-07 ***
## view               3.479e+04  3.359e+03  10.356  < 2e-16 ***
## condition          2.867e+04  3.120e+03   9.190  < 2e-16 ***
## sqft_above         7.374e+01  6.551e+00  11.256  < 2e-16 ***
## sqft_basement             NA         NA      NA       NA    
## location_WA.98001 -8.852e+04  7.657e+04  -1.156 0.247773    
## location_WA.98002 -9.675e+04  7.854e+04  -1.232 0.218122    
## location_WA.98003 -8.770e+04  7.718e+04  -1.136 0.255895    
## location_WA.98004  4.304e+05  7.727e+04   5.570 2.77e-08 ***
## location_WA.98005  2.637e+05  7.828e+04   3.368 0.000766 ***
## location_WA.98006  2.314e+05  7.630e+04   3.032 0.002447 ** 
## location_WA.98007  2.317e+05  7.788e+04   2.975 0.002956 ** 
## location_WA.98008  1.490e+05  7.740e+04   1.924 0.054394 .  
## location_WA.98010  2.111e+04  8.867e+04   0.238 0.811809    
## location_WA.98011  6.525e+04  7.818e+04   0.835 0.403992    
## location_WA.98014  3.734e+04  7.960e+04   0.469 0.639040    
## location_WA.98019  6.195e+03  7.725e+04   0.080 0.936086    
## location_WA.98022 -6.777e+04  7.803e+04  -0.868 0.385207    
## location_WA.98023 -9.450e+04  7.609e+04  -1.242 0.214381    
## location_WA.98024  9.047e+04  8.385e+04   1.079 0.280671    
## location_WA.98027  1.075e+05  7.631e+04   1.409 0.158995    
## location_WA.98028  5.548e+04  7.664e+04   0.724 0.469178    
## location_WA.98029  1.663e+05  7.655e+04   2.173 0.029892 *  
## location_WA.98030 -4.593e+04  7.766e+04  -0.592 0.554227    
## location_WA.98031 -6.642e+04  7.654e+04  -0.868 0.385569    
## location_WA.98032 -1.210e+05  7.979e+04  -1.516 0.129591    
## location_WA.98033  2.511e+05  7.630e+04   3.291 0.001011 ** 
## location_WA.98034  1.174e+05  7.601e+04   1.544 0.122607    
## location_WA.98038 -5.324e+04  7.607e+04  -0.700 0.484051    
## location_WA.98039 -3.736e+05  1.299e+05  -2.877 0.004043 ** 
## location_WA.98040  2.942e+05  7.714e+04   3.814 0.000140 ***
## location_WA.98042 -7.963e+04  7.598e+04  -1.048 0.294749    
## location_WA.98045  1.471e+04  7.696e+04   0.191 0.848449    
## location_WA.98047 -5.551e+04  8.863e+04  -0.626 0.531127    
## location_WA.98050  5.543e+04  1.303e+05   0.425 0.670645    
## location_WA.98051  1.815e+04  8.901e+04   0.204 0.838424    
## location_WA.98052  2.080e+05  7.569e+04   2.748 0.006037 ** 
## location_WA.98053  1.746e+05  7.621e+04   2.291 0.022035 *  
## location_WA.98055 -3.989e+04  7.921e+04  -0.504 0.614523    
## location_WA.98056  2.473e+04  7.628e+04   0.324 0.745796    
## location_WA.98057 -7.596e+04  8.147e+04  -0.932 0.351208    
## location_WA.98058 -2.487e+04  7.601e+04  -0.327 0.743586    
## location_WA.98059  3.724e+04  7.588e+04   0.491 0.623563    
## location_WA.98065  4.485e+04  7.654e+04   0.586 0.557953    
## location_WA.98068         NA         NA      NA       NA    
## location_WA.98070  1.408e+03  7.905e+04   0.018 0.985794    
## location_WA.98072  1.225e+05  7.668e+04   1.598 0.110260    
## location_WA.98074  1.458e+05  7.614e+04   1.915 0.055599 .  
## location_WA.98075  1.694e+05  7.638e+04   2.218 0.026631 *  
## location_WA.98077  1.142e+05  7.728e+04   1.477 0.139689    
## location_WA.98092 -8.542e+04  7.614e+04  -1.122 0.262050    
## location_WA.98102  3.458e+05  8.109e+04   4.265 2.06e-05 ***
## location_WA.98103  2.488e+05  7.580e+04   3.282 0.001041 ** 
## location_WA.98105  2.769e+05  7.801e+04   3.549 0.000393 ***
## location_WA.98106  4.414e+04  7.664e+04   0.576 0.564725    
## location_WA.98107  2.469e+05  7.674e+04   3.218 0.001307 ** 
## location_WA.98108  4.986e+04  7.725e+04   0.645 0.518713    
## location_WA.98109  4.164e+05  8.102e+04   5.140 2.93e-07 ***
## location_WA.98112  3.410e+05  7.741e+04   4.405 1.10e-05 ***
## location_WA.98115  2.078e+05  7.592e+04   2.737 0.006232 ** 
## location_WA.98116  2.283e+05  7.672e+04   2.976 0.002943 ** 
## location_WA.98117  2.169e+05  7.585e+04   2.860 0.004267 ** 
## location_WA.98118  8.666e+04  7.626e+04   1.136 0.255878    
## location_WA.98119  3.844e+05  7.794e+04   4.932 8.58e-07 ***
## location_WA.98122  2.508e+05  7.654e+04   3.277 0.001062 ** 
## location_WA.98125  7.111e+04  7.644e+04   0.930 0.352310    
## location_WA.98126  1.221e+05  7.621e+04   1.602 0.109347    
## location_WA.98133  6.534e+04  7.608e+04   0.859 0.390445    
## location_WA.98136  1.667e+05  7.701e+04   2.164 0.030515 *  
## location_WA.98144  1.952e+05  7.645e+04   2.554 0.010714 *  
## location_WA.98146  3.163e+04  7.667e+04   0.412 0.680012    
## location_WA.98148 -2.256e+04  8.207e+04  -0.275 0.783450    
## location_WA.98155  5.985e+04  7.607e+04   0.787 0.431448    
## location_WA.98166  3.252e+04  7.675e+04   0.424 0.671860    
## location_WA.98168 -2.261e+04  7.654e+04  -0.295 0.767693    
## location_WA.98177  1.461e+05  7.791e+04   1.875 0.060837 .  
## location_WA.98178 -8.146e+04  7.721e+04  -1.055 0.291490    
## location_WA.98188 -7.244e+04  7.925e+04  -0.914 0.360720    
## location_WA.98198 -7.122e+04  7.694e+04  -0.926 0.354748    
## location_WA.98199  2.643e+05  7.688e+04   3.438 0.000594 ***
## location_WA.98288 -4.277e+04  9.676e+04  -0.442 0.658491    
## location_WA.98354         NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 105900 on 2932 degrees of freedom
## Multiple R-squared:  0.7612, Adjusted R-squared:  0.7543 
## F-statistic: 111.2 on 84 and 2932 DF,  p-value: < 2.2e-16

From 88 variables, there are some variables with significant coefficients (***). Some locations (state zip) are more significant than others. Also, side note grown from this observation is that: 1. more specific models need to be developed to cluster and predict by the correlation significance shared by the predicted variable and correlated variables to build stronger prediction algorithm.

#5. Evaluation
## 5.1. Visualise Actual vs. Predicted Values

Plot actual values of price in the testing data with its predicted values using MLR model.

actual <- test_df$price

pred.mlr2 <- predict(model.mlr2, test_df)

pred_df <- data.frame(actual, pred.mlr2)

### plot MLR2 (used all num vars)
plot_mlr2 <- ggplot(data = pred_df, 
                    aes(x = actual, y = pred.mlr2)) +
  geom_point() +
  geom_smooth() +
  scale_x_continuous(limits = c(0, 1250000)) +
  scale_y_continuous(limits = c(0, 1250000)) +
  labs(title = "Actual Vs Predicted Values using MLR2")  
plot_mlr2

the plot graphic suggests the distributions of the observations are in near diagonal tendency. This means the model can predict the actual price with a good accuracy. However, it is vividly observable that there are view points ‘strayed’ pretty far from the rest of the values. Further investigation of this phenomena needs to be taken. Also, improvements on data extraction needs to be done as well.

5.2. Compute Performance Metrics

Compute Mean Absolute Error (MAE) and Pearso Correlation Coefficient (R)

mae_mlr2 <- mae(actual, pred.mlr2)
paste("MAE: ", round(mae_mlr2, 2))
## [1] "MAE:  74759.03"
r_mlr2 <- cor(actual, pred.mlr2)
paste ("R: ", round(r_mlr2,2))
## [1] "R:  0.87"

The current method’s MAE = 74759.03 and R = 0.87 shows us a significant improvement in MAE & correlation scores compare to the initial performance (MAE>250k and R<0.6).

#6. Recommendation

  1. the prediction model is sufficient to predict house pricing in mainstream distribution.

  2. Location (statezip) is an important feature in improving the model’s performance. Highly recommended to consider other important features that can be generated from the variables that can make significant influence to the model.

  3. Judging by the R and the ‘nature’ of current MAE score, the model is ready to be deploy with few notes. The almost 90% reliability/correlation signifies that the model is quite ready for deployment. Although, our second note: that the pretty high MAE score dictates that any prediction that fall far from the diagonal trend line will need further advice from domain expert. On final note, rooms for improvements need to be explored and exploited in the future to increase the efficacy and accuracy of the model.