Description

This report provides house price prediction using regression algorithms.

The dataset using in this report for modeling is real house data in the USA. It is publicaly available at kaggle. It can be downloades here: https://www.kaggle.com/shree1992/housedata

This report is stuctured as follows:
1. Data Exraction
2. Exploratory Data Analysis
3. Data Preparation
4. Modeling
5. Evaluation
6. Recommendation

1. Data Exraction

Import necessary libraryes.

rm(list = ls())
library(ggplot2)
library(corrgram)
library(caret)
library(gridExtra)
library(dplyr)

Library ggplot : for graphic and visualization.
Library corrgram : for visualizatoion of correlation coeficient.
Library caret : for One Hot Encoding.
Library gridExtra : for plotting multiple graphs.

Read house dataset from .csv file to R dataframe. Then see the dataframe’s structure

## read data to dataframe
house_df <- read.csv("../data/data.csv")

## structure of data
str(house_df)
## 'data.frame':    4600 obs. of  18 variables:
##  $ date         : chr  "2014-05-02 00:00:00" "2014-05-02 00:00:00" "2014-05-02 00:00:00" "2014-05-02 00:00:00" ...
##  $ price        : num  313000 2384000 342000 420000 550000 ...
##  $ bedrooms     : num  3 5 3 3 4 2 2 4 3 4 ...
##  $ bathrooms    : num  1.5 2.5 2 2.25 2.5 1 2 2.5 2.5 2 ...
##  $ sqft_living  : int  1340 3650 1930 2000 1940 880 1350 2710 2430 1520 ...
##  $ sqft_lot     : int  7912 9050 11947 8030 10500 6380 2560 35868 88426 6200 ...
##  $ floors       : num  1.5 2 1 1 1 1 1 2 1 1.5 ...
##  $ waterfront   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ view         : int  0 4 0 0 0 0 0 0 0 0 ...
##  $ condition    : int  3 5 4 4 4 3 3 3 4 3 ...
##  $ sqft_above   : int  1340 3370 1930 1000 1140 880 1350 2710 1570 1520 ...
##  $ sqft_basement: int  0 280 0 1000 800 0 0 0 860 0 ...
##  $ yr_built     : int  1955 1921 1966 1963 1976 1938 1976 1989 1985 1945 ...
##  $ yr_renovated : int  2005 0 0 0 1992 1994 0 0 0 2010 ...
##  $ street       : chr  "18810 Densmore Ave N" "709 W Blaine St" "26206-26214 143rd Ave SE" "857 170th Pl NE" ...
##  $ city         : chr  "Shoreline" "Seattle" "Kent" "Bellevue" ...
##  $ statezip     : chr  "WA 98133" "WA 98119" "WA 98042" "WA 98008" ...
##  $ country      : chr  "USA" "USA" "USA" "USA" ...

The dataset has 4600 observation (rows) and 18 variables (columns). The target variables is price and the reamining variables are features candidate.

Compute statistical summary of each variables.

## statistical summary
summary(house_df)
##      date               price             bedrooms       bathrooms    
##  Length:4600        Min.   :       0   Min.   :0.000   Min.   :0.000  
##  Class :character   1st Qu.:  322875   1st Qu.:3.000   1st Qu.:1.750  
##  Mode  :character   Median :  460943   Median :3.000   Median :2.250  
##                     Mean   :  551963   Mean   :3.401   Mean   :2.161  
##                     3rd Qu.:  654962   3rd Qu.:4.000   3rd Qu.:2.500  
##                     Max.   :26590000   Max.   :9.000   Max.   :8.000  
##   sqft_living       sqft_lot           floors        waterfront      
##  Min.   :  370   Min.   :    638   Min.   :1.000   Min.   :0.000000  
##  1st Qu.: 1460   1st Qu.:   5001   1st Qu.:1.000   1st Qu.:0.000000  
##  Median : 1980   Median :   7683   Median :1.500   Median :0.000000  
##  Mean   : 2139   Mean   :  14852   Mean   :1.512   Mean   :0.007174  
##  3rd Qu.: 2620   3rd Qu.:  11001   3rd Qu.:2.000   3rd Qu.:0.000000  
##  Max.   :13540   Max.   :1074218   Max.   :3.500   Max.   :1.000000  
##       view          condition       sqft_above   sqft_basement   
##  Min.   :0.0000   Min.   :1.000   Min.   : 370   Min.   :   0.0  
##  1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:1190   1st Qu.:   0.0  
##  Median :0.0000   Median :3.000   Median :1590   Median :   0.0  
##  Mean   :0.2407   Mean   :3.452   Mean   :1827   Mean   : 312.1  
##  3rd Qu.:0.0000   3rd Qu.:4.000   3rd Qu.:2300   3rd Qu.: 610.0  
##  Max.   :4.0000   Max.   :5.000   Max.   :9410   Max.   :4820.0  
##     yr_built     yr_renovated       street              city          
##  Min.   :1900   Min.   :   0.0   Length:4600        Length:4600       
##  1st Qu.:1951   1st Qu.:   0.0   Class :character   Class :character  
##  Median :1976   Median :   0.0   Mode  :character   Mode  :character  
##  Mean   :1971   Mean   : 808.6                                        
##  3rd Qu.:1997   3rd Qu.:1999.0                                        
##  Max.   :2014   Max.   :2014.0                                        
##    statezip           country         
##  Length:4600        Length:4600       
##  Class :character   Class :character  
##  Mode  :character   Mode  :character  
##                                       
##                                       
## 

We can see the min, median, and max values of each numeric variable.

It is intersting to see that the min value of price is zero. This could be an incorrect data.

We can also nitice tha max value of price statisticaly far away from med and 3rd quantile. This could be an Outlier.

2. Exploratory Data Analysis

2.1. Univariate Analysis (one variable)

Plot and Analysis distribution of price using boxplot

# distribution of price 
ggplot(data = house_df, aes(y = price)) +
  geom_boxplot() +
  scale_y_continuous(limits = c(0, 2000000))

There are outliers (significantly high price).
Normaly house price is around 300000 up to 700000

Analysis of a single variable. there are bedrooms

# distribution of bedrooms
ggplot(data = house_df, aes(x = bedrooms)) +
  geom_bar()

Based on barchart above, we can see the house distribution by bedrooms.
According to data, at most house tahat have 3 bedrooms

2.2. Bivariate Analysis (two variables)

Plot and Analysis distribution of price based on number of bedrooms using boxplot

house_df$bedrooms2 <- factor(house_df$bedrooms)
house_df$bathrooms2 <- factor(house_df$bathrooms)

# Relationship between price and bedrooms
ggplot(data = house_df, aes(y = price, 
                            x = bedrooms2)) +
  geom_boxplot() +
  scale_y_continuous(limits = c(0, 2000000))

In general, the higher number of bedrooms, the higher the price. However, price for house with bedrooms == 0 is significantly higher. These could be a sprcial buildings such as factory, commercial space, meeing hall, sport center, etc.

Plot and Analysis distribution of price based on number of bathrooms using boxplot

# Relationship between price and bathrooms
ggplot(data = house_df, aes(y = price, 
                            x = bathrooms2)) +
  geom_boxplot() +
  scale_y_continuous(limits = c(0, 2000000))

In general, the higher number of bathrooms, the higher the price. However, price for house with bathrooms == 0 is significantly higher. These could be a sprcial buildings such as factory, commercial space, meeing hall, sport center, etc.

2.3. Multivariate Analysis

Compute Pearson’s Correlation Coeficient (R) among all numerical variable. Then, visualize the result in a diagram using corrgram.

### extract numerical variable from dataframe
house_df_num <- house_df[ , 2:12]

### visualize pearson's Correlation Coefficient (R)
corrgram(house_df_num, 
         upper.panel = panel.cor)

For target variable (price), the variables with high correlation in order are sqft_living (0.43), sqft_above (0.37), bathrooms (0.33)

2.4. Insight from EDA:

Insight from EDA:
1. Incorrect data (price == 0),
2. There are outliers (significantly high price)
3. In general, the higher number of bedrooms, the higher the price. However, price for house with bedrooms == 0 is significantly higher.
4. Based on Pearson’s Correlation Coefficient, the variables with highest correlation with target (price) are sqft_living, sqft_above, bathrooms.
5. Location is an important feature to predict price. So, it is necessary to be included in the modeling.

3. Data Preparation

3.1. Data Cleaning

Remove observation with incorrect price (price == 0)

### Remove rows with price == 0
#opsi pertama menggunakan filter
house_df_num1 <- filter(house_df_num, price > 0)
summary(house_df_num1$price)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##     7800   326264   465000   557906   657500 26590000
dim(house_df_num1)
## [1] 4551   11

The min value of price is now 7800, number of observation is 4551

Remove observation with outliers in price

### Remove rows with outliers in price
out_price <- boxplot.stats(house_df_num1$price)$out

idx_out <- which(house_df_num1$price %in% c(out_price))
house_df_num2 <- house_df_num1[-idx_out, ]
summary(house_df_num2$price)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    7800  320000  450000  487457  615000 1150000
dim(house_df_num2)
## [1] 4311   11

The max value of price is now $1,150,000, number of observation is 4311

3.2. Feature Extraction

Add location feature (statezip). OHE on location

Add location feature using One Hot Encoding (statezip) variable

## 1. Create dataframe of location
idx <- rownames(house_df_num2)
location_df <- data.frame(house_df[idx, ]$statezip)
colnames(location_df) <- "location"
  1. OHE dataframe
## 2. OHE dataframe
df1 <- dummyVars("~.", data = location_df)
df2 <- data.frame(predict(df1, newdata = location_df))
  1. Marge with original data
## 3. Marge with original data
house_df_num3 <- cbind(house_df_num2,df2)
dim(house_df_num3)
## [1] 4311   88

Number of columns is now 88. It means, we added 77 new features for location information using OHE in statezip.

Summary of dataframe:
- house_df_num : raw data. dimension: [4600, 11]
- house_df_num1 : remove price == 0. dimension: [4551, 11]
- house_df_num2 : remove price == 0, remove outliers in price. dimension: [4311, 11]
- house_df_num3 : remove price == 0, remove outliers in price, OHE in statezip. dimension: [4311, 88]

3.3. Split data into training and testing Data

get_train_idx <- function(m, ratio, seed_number){
  # input 
  # m: number of observation
  # ratio: percentage of tarining data
  # seed_number: reproducible restult
  m_train <- ratio * m      # number of observations in training data
  set.seed(seed_number)
  train_idx <- sample(m, m_train)
  # output:
  # tain_idx: train index
  return(train_idx)
}

ratio <- 0.7
seed_number <- 2021

# raw data. dimension: `[4600, 11]`  
train_idx <- get_train_idx(m = nrow(house_df_num), ratio, seed_number)
train_df <- house_df_num[train_idx, ]
test_df <- house_df_num[-train_idx, ]

# remove `price == 0`. dimension: `[4551, 11]` 
train_idx1 <- get_train_idx(m = nrow(house_df_num1), ratio, seed_number)
train_df1 <- house_df_num1[train_idx1, ]
test_df1 <- house_df_num1[-train_idx1, ]

# remove `price == 0`, remove outliers in `price`. dimension: `[4311, 11]`
train_idx2 <- get_train_idx(m = nrow(house_df_num2), ratio, seed_number)
train_df2 <- house_df_num2[train_idx2, ]
test_df2 <- house_df_num2[-train_idx2, ]

# remove `price == 0`, remove outliers in `price`, OHE in `statezip`. dimension: `[4311, 88]`
train_idx3 <- get_train_idx(m = nrow(house_df_num3), ratio, seed_number)
train_df3 <- house_df_num3[train_idx3, ]
test_df3 <- house_df_num3[-train_idx3, ]

Now, we have four pairs of training ang testing dat.

4. Modeling

Using Multivariate Linear Regression with different training data.

4.1. Modeling with Raw Data

## Model: Multivariate Linear Regression (from 2 to n variables)

### using all numerical variables raw data
model.mlr <- lm(formula = price ~ .,
                data = train_df)

4.2. Modeling with Cleaned Data

## Model: Multivariate Linear Regression (from 2 to n variables)

### using all numerical variables price > 0
model.mlr1 <- lm(formula = price ~ .,
                data = train_df1)
## Model: Multivariate Linear Regression (from 2 to n variables)

### using all numerical variables no outliers
model.mlr2 <- lm(formula = price ~ .,
                data = train_df2)

4.3. Modeling with Cleaned Data and OHE in Location

## Model: Multivariate Linear Regression (from 2 to n variables)

### using all numerical variables clean and OHE
model.mlr3 <- lm(formula = price ~ .,
                data = train_df3)

5. Evaluation

Actual

Actual value raw data

actual <- test_df$price

Actual value clean data

actual1 <- test_df1$price
actual2 <- test_df2$price

Actual value clean data and OHE

actual3 <- test_df3$price

Predict

Predict actual and predicted value with raw data

pred.mlr <- predict(model.mlr, test_df)     # x = all

Predict actual and predicted value with clean data

pred.mlr1 <- predict(model.mlr1, test_df1)     # x = all
pred.mlr2 <- predict(model.mlr2, test_df2)     # x = all

Predict actual and predicted value with clean data and OHE

pred.mlr3 <- predict(model.mlr3, test_df3)     # x = all

5.1. Visualize Actual vs. Prediction

prediction

actual vs predicted values with raw data

## actual vs predicted values
prediction_df <- data.frame(actual, pred.mlr)

actual vs predicted values with clean data

## actual vs predicted values
prediction_df1 <- data.frame(actual1, pred.mlr1)
## actual vs predicted values
prediction_df2 <- data.frame(actual2, pred.mlr2)

actual vs predicted values with clean data and OHE

## actual vs predicted values
prediction_df3 <- data.frame(actual3, pred.mlr3)

Visualise actual vs predicted values

Visualise actual vs predicted values raw data

plot_mlr <- ggplot(data = prediction_df, aes(x = actual, y = pred.mlr)) +
  geom_point() +
  scale_x_continuous(limits = c(0, 1200000)) +
  scale_y_continuous(limits = c(0, 1200000)) +
  labs(title = "Raw Data")

Visualise actual vs predicted values clean data

plot_mlr1 <- ggplot(data = prediction_df1, aes(x = actual1, y = pred.mlr1)) +
  geom_point() +
  scale_x_continuous(limits = c(0, 1200000)) +
  scale_y_continuous(limits = c(0, 1200000)) +
  labs(title = "Remove price = 0")
plot_mlr2 <- ggplot(data = prediction_df2, aes(x = actual2, y = pred.mlr2)) +
  geom_point() +
  scale_x_continuous(limits = c(0, 1200000)) +
  scale_y_continuous(limits = c(0, 1200000)) +
  labs(title = "Remove Outliers")

Visualise actual vs predicted values clean data and OHE

plot_mlr3 <- ggplot(data = prediction_df3, aes(x = actual3, y = pred.mlr3)) +
  geom_point() +
  scale_x_continuous(limits = c(0, 1200000)) +
  scale_y_continuous(limits = c(0, 1200000)) +
  labs(title = "Using OHE on Location")

Visualization

grid.arrange(plot_mlr, plot_mlr1, plot_mlr2, plot_mlr3)

5.2. Comparison of Performation Metrics (RSME and R)

library(Metrics)
rmse <- rmse(pred.mlr, actual)
rmse1 <- rmse(pred.mlr1, actual1)
rmse2 <- rmse(pred.mlr2, actual2)
rmse3 <- rmse(pred.mlr3, actual3)
rmse
## [1] 268352.1
rmse1
## [1] 239129.9
rmse2
## [1] 158561.4
rmse3
## [1] 102922
rmse_df <- data.frame(models = c("Raw Data", 
                                 "Remove Proce == 0", 
                                 "Remove Outliers", 
                                 "OHE"),
                      rmse_values = c(rmse, 
                                      rmse1, 
                                      rmse2, 
                                      rmse3))

ggplot(rmse_df, aes(x = models,
                    y = rmse_values,
                    fill = models)) +
  geom_bar(stat = "identity")

6. Recommendation

  1. The accuracy is good enough. The model can be deployed.
  2. However, there are opportunities for improvements.