Description

This report describe house price prediction using regression model. The dataset used in this project is the real estate data from Kaggle. https://www.kaggle.com/shree1992/housedata

Report Outline:
1. Data Extraction
2. Exploratory Data Analysis
3. Data Preparation
4. Modelling
5. Evaluation
6. Recommendation

1. Data Extraction

Extract data in csv format into dataframe in R

house_df <- read.csv("PERURI/Training DS Wiley/DS_Training/dataset/house_price/data.csv")

see the data structure of dataframe. There are 4600 onservations and 18 variables.

str(house_df)
## 'data.frame':    4600 obs. of  18 variables:
##  $ date         : chr  "2014-05-02 00:00:00" "2014-05-02 00:00:00" "2014-05-02 00:00:00" "2014-05-02 00:00:00" ...
##  $ price        : num  313000 2384000 342000 420000 550000 ...
##  $ bedrooms     : num  3 5 3 3 4 2 2 4 3 4 ...
##  $ bathrooms    : num  1.5 2.5 2 2.25 2.5 1 2 2.5 2.5 2 ...
##  $ sqft_living  : int  1340 3650 1930 2000 1940 880 1350 2710 2430 1520 ...
##  $ sqft_lot     : int  7912 9050 11947 8030 10500 6380 2560 35868 88426 6200 ...
##  $ floors       : num  1.5 2 1 1 1 1 1 2 1 1.5 ...
##  $ waterfront   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ view         : int  0 4 0 0 0 0 0 0 0 0 ...
##  $ condition    : int  3 5 4 4 4 3 3 3 4 3 ...
##  $ sqft_above   : int  1340 3370 1930 1000 1140 880 1350 2710 1570 1520 ...
##  $ sqft_basement: int  0 280 0 1000 800 0 0 0 860 0 ...
##  $ yr_built     : int  1955 1921 1966 1963 1976 1938 1976 1989 1985 1945 ...
##  $ yr_renovated : int  2005 0 0 0 1992 1994 0 0 0 2010 ...
##  $ street       : chr  "18810 Densmore Ave N" "709 W Blaine St" "26206-26214 143rd Ave SE" "857 170th Pl NE" ...
##  $ city         : chr  "Shoreline" "Seattle" "Kent" "Bellevue" ...
##  $ statezip     : chr  "WA 98133" "WA 98119" "WA 98042" "WA 98008" ...
##  $ country      : chr  "USA" "USA" "USA" "USA" ...

Calculate statistical summary of numerical variables.

summary(house_df)
##      date               price             bedrooms       bathrooms    
##  Length:4600        Min.   :       0   Min.   :0.000   Min.   :0.000  
##  Class :character   1st Qu.:  322875   1st Qu.:3.000   1st Qu.:1.750  
##  Mode  :character   Median :  460943   Median :3.000   Median :2.250  
##                     Mean   :  551963   Mean   :3.401   Mean   :2.161  
##                     3rd Qu.:  654962   3rd Qu.:4.000   3rd Qu.:2.500  
##                     Max.   :26590000   Max.   :9.000   Max.   :8.000  
##   sqft_living       sqft_lot           floors        waterfront      
##  Min.   :  370   Min.   :    638   Min.   :1.000   Min.   :0.000000  
##  1st Qu.: 1460   1st Qu.:   5001   1st Qu.:1.000   1st Qu.:0.000000  
##  Median : 1980   Median :   7683   Median :1.500   Median :0.000000  
##  Mean   : 2139   Mean   :  14852   Mean   :1.512   Mean   :0.007174  
##  3rd Qu.: 2620   3rd Qu.:  11001   3rd Qu.:2.000   3rd Qu.:0.000000  
##  Max.   :13540   Max.   :1074218   Max.   :3.500   Max.   :1.000000  
##       view          condition       sqft_above   sqft_basement   
##  Min.   :0.0000   Min.   :1.000   Min.   : 370   Min.   :   0.0  
##  1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:1190   1st Qu.:   0.0  
##  Median :0.0000   Median :3.000   Median :1590   Median :   0.0  
##  Mean   :0.2407   Mean   :3.452   Mean   :1827   Mean   : 312.1  
##  3rd Qu.:0.0000   3rd Qu.:4.000   3rd Qu.:2300   3rd Qu.: 610.0  
##  Max.   :4.0000   Max.   :5.000   Max.   :9410   Max.   :4820.0  
##     yr_built     yr_renovated       street              city          
##  Min.   :1900   Min.   :   0.0   Length:4600        Length:4600       
##  1st Qu.:1951   1st Qu.:   0.0   Class :character   Class :character  
##  Median :1976   Median :   0.0   Mode  :character   Mode  :character  
##  Mean   :1971   Mean   : 808.6                                        
##  3rd Qu.:1997   3rd Qu.:1999.0                                        
##  Max.   :2014   Max.   :2014.0                                        
##    statezip           country         
##  Length:4600        Length:4600       
##  Class :character   Class :character  
##  Mode  :character   Mode  :character  
##                                       
##                                       
## 

Price is the target variable. The average value is $51,963. There is observation with price == 0. There are possibly outliers as the max price value is signifinantly higher than the average.

2. Exploratory Data Analysis

2.1 Univariate Analysis

ggplot(house_df, aes(y = price)) + 
  geom_boxplot() +
  ylim(0,2000000)

The target variable (Price) has outlier and invalid value (price == 0). This should be handled or cleaned in the data preparation step.

2.1 Bivariate Analysis

house_df$bedrooms2 <- as.factor(house_df$bedrooms) 
ggplot(data = house_df, aes(x = bedrooms2, y = price)) + 
  geom_boxplot()+
  ylim(0,2000000)

In general, the higher number of bedrooms the higher price of the house.

2.1 Multivariate Analysis

Compute and vizualize Pearson’s Correlation Coefficient (R) between

house_df_num <- house_df[ ,c("price","bedrooms","bathrooms","sqft_living",
                            "sqft_lot","floors","waterfront","view","condition",
                            "sqft_above","sqft_basement")]
corrgram(house_df_num,
         order = TRUE,
         upper.panel = panel.pie)

Based on diagram above, sqft_living has a high correlation with sqft_above and bathroom. Variable sqft_living, sqft_above and bathroom have the highest correlation with the target variable (Price)

3. Data Preparation

3.1 Feature Extraction

One Hot Encoding

Add location variable (statezip) and use OHE to convert categorical into numerical variable.

location_df <- data.frame(house_df$statezip)
colnames(location_df) <- c("loc.")

### OHE on Location Dataframe
library(caret)
df1 <- dummyVars("~.", data = location_df)
df2 <- data.frame(predict(df1, newdata = location_df))

## Combine to Original Dataframe
house_df_num <- cbind(house_df_num, df2)
dim(house_df_num)
## [1] 4600   88

The current dataframe gas 88 column (87 feature and 1 target)

3.2 Data Cleaning

Remove observation with price == 0

idx <- house_df_num$price > 0

# get dataframe
house_df_num <- house_df_num[idx, ]

Remove Outlier

out_price <- boxplot.stats(house_df_num$price)$out

# get rows idx of outliers value
out_idx <- which(house_df_num$price %in% c(out_price))

# remove rows with outliers or get rows without outliers
house_df_num <- house_df_num[ -out_idx, ]

3.3 Training and Testing Division

set.seed(2021)  # agar data tidak berubah
m <- nrow(house_df)
train_idx <- sample(m, 0.7*m)
train_df <- house_df_num[ train_idx, ]
test_df <- house_df_num[ -train_idx, ]

train_idx[1:3]
## [1] 2214  442  743

4. Modelling

Use Multivatiate Linear Regression (MLR) Algorithm to predict house price based on all extracted feature.

# Without location data
fit.mlr1 <- lm(formula = price ~ ., data = train_df[, 1:11]) 

# With location data
fit.mlr2 <- lm(formula = price ~ ., data = train_df) 

We trained two models. First without location data and second with location data

5. Evaluation

To evaluate out models, we use Root Mean Square Error (RMSE) and Pearson’s Correlation Coefficient (R) as performance metrics.

performance <- function(prediction, actual, method){
  e <- prediction - actual
  se <- e^2
  sse <- sum(se)
  mse <- mean(se)
  rmse <- sqrt(mse)
  
  r <- cor(prediction,actual)
  result <- paste("==Method Name:",method,
                  "\nRoot Mean Square Error (RMSE) = ", round(rmse,2),
                  "\nCorrelation Coefficient (R) =", round(r,2) ) 
  cat(result)
}

Performance of Model 1 (without location data)

pred.mlr1 <- predict(fit.mlr1, test_df)
performance(pred.mlr1, test_df$price, "Model 1")
## ==Method Name: Model 1 
## Root Mean Square Error (RMSE) =  164896.24 
## Correlation Coefficient (R) = 0.61

Performance of Model 2 (with location data)

pred.mlr2 <- predict(fit.mlr2, test_df)
performance(pred.mlr2, test_df$price, "Model 2")
## ==Method Name: Model 2 
## Root Mean Square Error (RMSE) =  103961.28 
## Correlation Coefficient (R) = 0.87

RMSE is significantly reduced and R is significantly increase using location data in Model 2. This means location data is very important feature to predict house price.

Visualize actual price vs predicted price in Model 2

actual <- test_df$price
prediction_df <- data.frame(actual,pred.mlr1, pred.mlr2)

p<- ggplot(data = prediction_df,aes(x = actual, y = pred.mlr2)) + geom_point()+
  scale_x_continuous(breaks = c(500000, 1000000, 1500000),
                     labels = c("$500K", "$1M", "$1.5M"),
                     limits = c(0,1500000))+
  scale_y_continuous(breaks = c(500000, 1000000, 1500000),
                     labels = c("$500K", "$1M", "$1.5M"),
                     limits = c(0,1500000)) +
  labs(title = "Multivariate Linear Regression using Location Data", 
       x = "Actual Price", 
       y = "Prediction Price")
p

Most of the points are in the diagonal. This means the predicted price are close to actual price in Model 2.

6. Recommendation

  1. Location data is very important to predict house price. It significanly improve prediction performance. So, this data should always be included in modeling.

  2. One Hot Encoding is an effective method to convert categorical variable to numerical variables. In this case, location data (statezip)

  3. The model is good enough to be deployed. However, there are still some room of improvments.For example, try different regression model or using other preprocessing method.