Description

This report provides house price prediction using regression algorithms.

The dataset using in this report for modeling is real house data in the US. The dataset is hosted in Kaggle. It can be downloaded here: http://www.kaggle.com/shree1992/housedata.

The report is structured as follows:
1. Data Extraction
2. Exploratory Data Analysis
3. Data Preparation
4. Modeling
5. Evaluation
6. Recommendation

1. Data Extraction

Import necessary libraries.

rm(list = ls())
library(ggplot2)
library(corrgram)
library(gridExtra)
library(caret)
## Loading required package: lattice
## 
## Attaching package: 'lattice'
## The following object is masked from 'package:corrgram':
## 
##     panel.fill

Library ggplot2: for graphics and visualization
Library corrgram: for visualization of correlation coefficient matrix
Library gridExtra: for plotting multiple graphs.
Library caret for

Read house dataset from .csv file to R dataframe. Then, see the dataframe’s structure.

## read data
house_df <- read.csv("data/house.csv")

## structure of dataframe
str(house_df)
## 'data.frame':    4600 obs. of  18 variables:
##  $ date         : chr  "2014-05-02 00:00:00" "2014-05-02 00:00:00" "2014-05-02 00:00:00" "2014-05-02 00:00:00" ...
##  $ price        : num  313000 2384000 342000 420000 550000 ...
##  $ bedrooms     : num  3 5 3 3 4 2 2 4 3 4 ...
##  $ bathrooms    : num  1.5 2.5 2 2.25 2.5 1 2 2.5 2.5 2 ...
##  $ sqft_living  : int  1340 3650 1930 2000 1940 880 1350 2710 2430 1520 ...
##  $ sqft_lot     : int  7912 9050 11947 8030 10500 6380 2560 35868 88426 6200 ...
##  $ floors       : num  1.5 2 1 1 1 1 1 2 1 1.5 ...
##  $ waterfront   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ view         : int  0 4 0 0 0 0 0 0 0 0 ...
##  $ condition    : int  3 5 4 4 4 3 3 3 4 3 ...
##  $ sqft_above   : int  1340 3370 1930 1000 1140 880 1350 2710 1570 1520 ...
##  $ sqft_basement: int  0 280 0 1000 800 0 0 0 860 0 ...
##  $ yr_built     : int  1955 1921 1966 1963 1976 1938 1976 1989 1985 1945 ...
##  $ yr_renovated : int  2005 0 0 0 1992 1994 0 0 0 2010 ...
##  $ street       : chr  "18810 Densmore Ave N" "709 W Blaine St" "26206-26214 143rd Ave SE" "857 170th Pl NE" ...
##  $ city         : chr  "Shoreline" "Seattle" "Kent" "Bellevue" ...
##  $ statezip     : chr  "WA 98133" "WA 98119" "WA 98042" "WA 98008" ...
##  $ country      : chr  "USA" "USA" "USA" "USA" ...

The dataset has 4600 observations and 18 variables. The target variable is price and the remaining variables are candidate features.

Compute statistical summary of each variable.

## statistical summary
summary(house_df)
##      date               price             bedrooms       bathrooms    
##  Length:4600        Min.   :       0   Min.   :0.000   Min.   :0.000  
##  Class :character   1st Qu.:  322875   1st Qu.:3.000   1st Qu.:1.750  
##  Mode  :character   Median :  460943   Median :3.000   Median :2.250  
##                     Mean   :  551963   Mean   :3.401   Mean   :2.161  
##                     3rd Qu.:  654962   3rd Qu.:4.000   3rd Qu.:2.500  
##                     Max.   :26590000   Max.   :9.000   Max.   :8.000  
##   sqft_living       sqft_lot           floors        waterfront      
##  Min.   :  370   Min.   :    638   Min.   :1.000   Min.   :0.000000  
##  1st Qu.: 1460   1st Qu.:   5001   1st Qu.:1.000   1st Qu.:0.000000  
##  Median : 1980   Median :   7683   Median :1.500   Median :0.000000  
##  Mean   : 2139   Mean   :  14852   Mean   :1.512   Mean   :0.007174  
##  3rd Qu.: 2620   3rd Qu.:  11001   3rd Qu.:2.000   3rd Qu.:0.000000  
##  Max.   :13540   Max.   :1074218   Max.   :3.500   Max.   :1.000000  
##       view          condition       sqft_above   sqft_basement   
##  Min.   :0.0000   Min.   :1.000   Min.   : 370   Min.   :   0.0  
##  1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:1190   1st Qu.:   0.0  
##  Median :0.0000   Median :3.000   Median :1590   Median :   0.0  
##  Mean   :0.2407   Mean   :3.452   Mean   :1827   Mean   : 312.1  
##  3rd Qu.:0.0000   3rd Qu.:4.000   3rd Qu.:2300   3rd Qu.: 610.0  
##  Max.   :4.0000   Max.   :5.000   Max.   :9410   Max.   :4820.0  
##     yr_built     yr_renovated       street              city          
##  Min.   :1900   Min.   :   0.0   Length:4600        Length:4600       
##  1st Qu.:1951   1st Qu.:   0.0   Class :character   Class :character  
##  Median :1976   Median :   0.0   Mode  :character   Mode  :character  
##  Mean   :1971   Mean   : 808.6                                        
##  3rd Qu.:1997   3rd Qu.:1999.0                                        
##  Max.   :2014   Max.   :2014.0                                        
##    statezip           country         
##  Length:4600        Length:4600       
##  Class :character   Class :character  
##  Mode  :character   Mode  :character  
##                                       
##                                       
## 

We can see minimum, median, mean, and maximum values of each numeric variable.

It is interesting to see that the minimum values of price is zero. This could be an incorrect data.

We can also notice that the maximum values of price is statistically far away from median and third quantile. This could be an outlier.

2. Exploratory Data Analysis

2.1. Univariate Analysis

Plot distribution of price using boxplot

ggplot(data = house_df, aes(y = price)) +
  geom_boxplot() +
  scale_y_continuous(limits = c(0, 2000000))

Based on boxplot above, we can see that there are outlier in price.

2.2. Bivariate Analysis

## casting bedrooms to factor
house_df$bedrooms2 <- factor(house_df$bedrooms)

ggplot(data = house_df, aes(x = bedrooms2, 
                            y = price)) +
  geom_boxplot() +
  scale_y_continuous(limits = c(0, 2000000))

Based on price by number of bedrooms plot, we can see the following:
1. In general, the higher number of bedrooms, the higher the price.
2. It is interesting that houses with number of bedrooms == 0, the house prices are significantly high. These could be special building, such as meeting hall, religious building, sport center, etc.

2.3. Multivariate Analysis

Compute Pearson’s Correlation Coefficient (R) among all numerical variables. Then, visualize the result in a diagram.

## Compute Pearson's Correlation Coefficient (R)

house_df_num <- house_df[ , 2:12]
r <- cor(house_df_num)

# install.packages("corrgram")
library(corrgram)
corrgram(house_df_num,
         upper.panel = panel.cor)

For target variable (price), the variables with high correlation in order are sqft_living(0.43), sqft_above(0.37), and bathrooms(0.33).

Among features, several variables are highly correlated. For example, sqft_living and sqft_aboce (0.88).

Insight form EDA:

  1. There are outliers in price
  2. Incorrect price values (price == 0)
  3. In general, the higher number of bedrooms, the higher the price. However, for houses with 0 bedrooms, the price are significantly higher.
  4. Based on Pearson’s Correlation Coefficient (R), the variable with highest correlation with target (price) are sqft_living, sqft_above, and bathrooms.
  5. Location is an important feature to predict price. So, it is necessary to include in the modeling.

3. Data Preparation

3.1 Data Cleaning

Remove observations with incorrect price (price == 0)

### get index that price == 0
idx_price_0 <- which( house_df_num$price == 0 )

### remove obs with price == 0
house_df_num <- house_df_num[ -idx_price_0, ] 
summary(house_df_num$price)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##     7800   326264   465000   557906   657500 26590000

The minimum vakue on price is now not zero, but 78–.

Remove observations with outliers in price

### get outlier idx

out_price <- boxplot.stats(house_df_num$price)$out
idx_out <- which(house_df_num$price %in% c(out_price))

### remove obs with outlier

house_df_num <- house_df_num[ -idx_out, ]
summary(house_df_num$price)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    7800  320000  450000  487457  615000 1150000

The maximum value on price is now at 26590000, but 1150000.

dim(house_df_num)
## [1] 4311   11

Number of observation is now 4311. It means, the data cleaning process removed 289 rows.

3.2 Feature Extraction

### create dataframe to be encoded
statezip_df <- data.frame(house_df$statezip)
colnames(statezip_df) <- "loc"

### create OHE dataframe
df1 <- dummyVars("~.", data = statezip_df)
df2 <- data.frame(predict(df1, newdata = statezip_df))

### combine to original dataframe
house_df <- cbind(house_df, df2)
house_df$statezip <- NULL

### combine to numerical data for modeling
idx <- rownames(house_df_num)
house_df_num <- cbind(house_df_num, df2[idx, ] )
dim(house_df_num)
## [1] 4311   88

3.3. Training and Testing Division

Randomly divided the dataset into training and testing with 70:30. For reproducible result, it is necessary to set seed.

## for reproducible result
set.seed(2021) 

m <- nrow(house_df_num)
m_train <- m * 0.7
train_idx <- sample(m, m_train)


train_df <- house_df_num[ train_idx, ]
test_df <- house_df_num[ -train_idx, ]

dim(train_df)
## [1] 3017   88
dim(test_df)
## [1] 1294   88

The training data has 3017 observations while testing data hs 1294 observations.

4. Modeling

Create regression model using Multivariate Linear Regression. We will create two models: without location and with location features.

# Model without Location
model.mlr2 <- lm(formula = price ~ . ,
                 data = train_df[, 1:11])

# With Location
model.mlr3 <- lm(formula = price ~ . ,
                 data = train_df)

5. Evaluation

Get predicted values from the trained models. Then, create a dataframe of actual and predicted values.

# Actual Values from test data
actual <- test_df$price

## Predicted values 
pred.mlr2 <- predict(model.mlr2, test_df[, 1:11])
## Warning in predict.lm(model.mlr2, test_df[, 1:11]): prediction from a rank-
## deficient fit may be misleading
pred.mlr3 <- predict(model.mlr3, test_df)
## Warning in predict.lm(model.mlr3, test_df): prediction from a rank-deficient fit
## may be misleading
## create dataframe for actual and predicted values
prediction_df <- data.frame(actual,  
                            pred.mlr2, 
                            pred.mlr3)

5.1 Visualize Actual vs Predicted

First Model: Without Location Features. Second Model: With Location Features.

## Visualize actual vs predicted from model without location
p1 <- ggplot(data = prediction_df, 
       aes(x = actual, y = pred.mlr2)) +
  geom_point() + 
  geom_smooth() +
  scale_x_continuous(limits = c(0,1500000),
                     breaks = c(500000, 1000000, 1500000),
                     labels = c("$500K", "$1M", "$1.5M")) +
  scale_y_continuous(limits = c(0,1500000),
                     breaks = c(500000, 1000000, 1500000),
                     labels = c("$500K", "$1M", "$1.5M")) +
  labs(title = "Predicted Values without Location Features")
## Visualize actual vs predicted from model without location
p2 <- ggplot(data = prediction_df, 
       aes(x = actual, y = pred.mlr3)) +
  geom_point() + 
  geom_smooth() +
  scale_x_continuous(limits = c(0,1500000),
                     breaks = c(500000, 1000000, 1500000),
                     labels = c("$500K", "$1M", "$1.5M")) +
  scale_y_continuous(limits = c(0,1500000),
                     breaks = c(500000, 1000000, 1500000),
                     labels = c("$500K", "$1M", "$1.5M")) +
  labs(title = "Predicted Values with Location Features")
p1
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

p2
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

From two plots above, we can see that the predicted values with locarition features are more located in the diagonal. This means, there are closer to actual values.

5.2 Compute Performance Metrics (RMSE and R)

Evaluate performance using Root Mean Squared Error (RMSE) and Pearson’s Correlation Coefficient (R).

The lower the RMSE, the better the model. On the other hand, the higher the R, the better is model.

We define function to compute RMSE and R called performance.

## Compute Performance evaluation

performance <- function(actual, predicted, model){
  
  ## Root Mean Squared Error (RMSE)
  e <- predicted - actual  # error
  se <- e^2 # squared error
  sse <- sum(se) # sum of squared error
  mse <- mean(se) # mean squared error
  rmse <- sqrt(mse) # root mean squared error
  
  ## Pearson's Correlation Coefficient (R)
  r <- cor(predicted, actual)
  
  result <- paste("=== Model: ", model,
                  "\nRoot Mean Squared Error (RMSE): ", round(rmse, 2),
                  "\nCorrelation Coefficient (R): ", round(r,5),
                  "\n\n")
  cat(result)
}
performance(actual, pred.mlr2, "MLR without location features")
## === Model:  MLR without location features 
## Root Mean Squared Error (RMSE):  158561.44 
## Correlation Coefficient (R):  0.66553
performance(actual, pred.mlr3, "MLR with location features")
## === Model:  MLR with location features 
## Root Mean Squared Error (RMSE):  97530.56 
## Correlation Coefficient (R):  0.88837

With location features, RMSE is reduced from 158,561 to 97,530 and R is increased from 0.66 to 0.88

This means location features is very significant to improve house prediction.

6. Recommendation

  1. The model is ready for deployment. However, the prediction performance could be improved with PCA or other data preprocessing method.

  2. The prediction model is good enough to predict the price of mainstream houses. Additional dedicated model can be developed to predict special house, if necessary.

  3. Location is an important features generated from the one of the column in the dataset (statezip). There could be other important feature that can be generated and make significant influence. For example, yr_built or yr_renovated.