Description

This report describes house price prediction using regression model. The dataset used in this project is the real estate data from kaggle https://www.kaggle.com/shree1992/housedata

Report outline:
1. Data Extraction
2. Exploratory Data Analysis
3. Data Preparation
4. Modeling
5. Evaluation
6. Recommendation

1. Data Extraction

Extract data in csv format into dataframe in R.

house_df <- read.csv("data/data.csv")

see the sructure of dataframe. there are 4600 observations and 18 variables.

str(house_df)

## 'data.frame':    4600 obs. of  18 variables:
##  $ date         : chr  "2014-05-02 00:00:00" "2014-05-02 00:00:00" "2014-05-02 00:00:00" "2014-05-02 00:00:00" ...
##  $ price        : num  313000 2384000 342000 420000 550000 ...
##  $ bedrooms     : num  3 5 3 3 4 2 2 4 3 4 ...
##  $ bathrooms    : num  1.5 2.5 2 2.25 2.5 1 2 2.5 2.5 2 ...
##  $ sqft_living  : int  1340 3650 1930 2000 1940 880 1350 2710 2430 1520 ...
##  $ sqft_lot     : int  7912 9050 11947 8030 10500 6380 2560 35868 88426 6200 ...
##  $ floors       : num  1.5 2 1 1 1 1 1 2 1 1.5 ...
##  $ waterfront   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ view         : int  0 4 0 0 0 0 0 0 0 0 ...
##  $ condition    : int  3 5 4 4 4 3 3 3 4 3 ...
##  $ sqft_above   : int  1340 3370 1930 1000 1140 880 1350 2710 1570 1520 ...
##  $ sqft_basement: int  0 280 0 1000 800 0 0 0 860 0 ...
##  $ yr_built     : int  1955 1921 1966 1963 1976 1938 1976 1989 1985 1945 ...
##  $ yr_renovated : int  2005 0 0 0 1992 1994 0 0 0 2010 ...
##  $ street       : chr  "18810 Densmore Ave N" "709 W Blaine St" "26206-26214 143rd Ave SE" "857 170th Pl NE" ...
##  $ city         : chr  "Shoreline" "Seattle" "Kent" "Bellevue" ...
##  $ statezip     : chr  "WA 98133" "WA 98119" "WA 98042" "WA 98008" ...
##  $ country      : chr  "USA" "USA" "USA" "USA" ...

2. Exploratory Data Analysis

2.1 Univariate Analysis

ggplot(house_df, aes(y = price)) + 
  geom_boxplot() +
  ylim(0,2000000)

The target variable (*price) has outliers and invalid values (price == 0). This should be handled or cleaned in the data preparation step.

2.2 Bivariate Data Analysis (two variables)

house_df$bedrooms2 <- as.factor(house_df$bedrooms) 
ggplot(data = house_df, aes(x = bedrooms2, y = price)) + 
  geom_boxplot()+
  ylim(0,2000000)

2.3 Multivariate Data Analysis

Correlation Coeficient

house_df_num <- house_df[ ,c("price","bedrooms","bathrooms","sqft_living",
                            "sqft_lot","floors","waterfront","view","condition",
                            "sqft_above","sqft_basement")]
r <- cor(house_df_num)
r[1, ]

##         price      bedrooms     bathrooms   sqft_living      sqft_lot 
##    1.00000000    0.20033629    0.32710992    0.43041003    0.05045130 
##        floors    waterfront          view     condition    sqft_above 
##    0.15146080    0.13564832    0.22850417    0.03491454    0.36756960 
## sqft_basement 
##    0.21042657

library(corrgram)  
corrgram(house_df_num,
         order = TRUE,
         upper.panel = panel.pie)

Based on diagram above, sqft_living has a high correlation with

3. Data Preparation

3.1. Featuring Extraction

One Hot Encoding

Add location variable (statezip) and use OHE method to convert categorical into numerical variable

### create dataframe of location (statezip)
location_df <- data.frame(house_df$statezip)
colnames(location_df) <- c("loc")

### OHE on Location Dataframe
library(caret)
df1 <- dummyVars("~.", data = location_df)
df2 <- data.frame(predict(df1, newdata = location_df))

## combine to orginal data frame
house_df_num <- cbind(house_df_num, df2)
dim(house_df_num)

## [1] 4600   88

The current dataframe has 88 column (87 feature and 1 target)

3.2. Data Clening

Remove observation with price Remove rows with price == 0

# get row indices that price > 0
idx <- house_df_num$price > 0

# get dataframe
house_df_num <- house_df_num[idx, ]

Remove outliers

# get outliers
out_price <- boxplot.stats(house_df_num$price)$out

# get rows idx of outliers value
out_idx <- which(house_df_num$price %in% c(out_price))

# remove rows with outliers or get rows without outliers
house_df_num <- house_df_num[ -out_idx, ]

3.3. Training and Testing Division

## Divide data into train and test = 70:30
set.seed(2021)  # agar data tidak berubah
m <- nrow(house_df)
train_idx <- sample(m, 0.7*m)
train_df <- house_df_num[ train_idx, ]
test_df <- house_df_num[ -train_idx, ]

4. Modeling

Use multivariate linear regression algorithm to predict house price based on all extracted features.

# Mode without location data
fit.mlr1 <- lm(formula = price ~ ., data = train_df[, 1:11])
              
# Mode with location data    
fit.mlr2 <- lm(formula = price ~ ., data = train_df)

We trained two models. First without location data and second with location data. # 5. Evaluation
To evaluate our models, we use Root Mean Squared Error (RMSE) and Pearson’s Correlation Coefficient (R) as performance metrics.

performance <- function(prediction, actual, method){
  e <- prediction - actual
  se <- e^2
  sse <- sum(se)
  mse <- mean(se)
  rmse <- sqrt(mse)
  
  r <- cor(prediction,actual)
  result <- paste("==Method Name:",method,
                  "\nRoot Mean Square Error (RMSE) = ", round(rmse,2),
                  "\nCorrelation Coefficient (R) =", round(r,2) ) 
  cat(result)
}

Performance of model 1 (without location data)

pred.mlr1 <- predict(fit.mlr1, test_df)
performance(pred.mlr1, test_df$price, "Model 1")

## ==Method Name: Model 1 
## Root Mean Square Error (RMSE) =  164896.24 
## Correlation Coefficient (R) = 0.61

Performance of model 2 (without location data)

pred.mlr2 <- predict(fit.mlr2, test_df)
performance(pred.mlr2, test_df$price, "Model 2")

## ==Method Name: Model 2 
## Root Mean Square Error (RMSE) =  103961.28 
## Correlation Coefficient (R) = 0.87

Visualize actual price vs Predicted price in model 2

actual <- test_df$price
prediction_df <- data.frame(actual, pred.mlr1, pred.mlr2)
p <- ggplot(data = prediction_df,aes(x = actual, y = pred.mlr2)) + geom_point()+
  scale_x_continuous(breaks = c(500000, 1000000, 1500000),
                     labels = c("$500K", "$1M", "$1.5M"),
                     limits = c(0,1500000))+
  scale_y_continuous(breaks = c(500000, 1000000, 1500000),
                     labels = c("$500K", "$1M", "$1.5M"),
                     limits = c(0,1500000)) +
  labs(title = "Multivariate Linear Regression Using Location Data", 
       x = "Actual Price",
       y = "Prediction Price")
p

Most of the points are in the diagonal. This means the predicted price are close to actual price in Model 2.

6. Recommendation

Location data is very important to predict house price. It significantly improve prediction performance. So, this data should always be included in modeling.
One Hot Encoding is an effective method to convert categorical variable to numerical variables. In this case, location data (statezip).
The model is good enough to be deployed. However, there are still some room of improvements. For example, try different regression model or using other preprocessing methods.

House Price Prediction using Linear Regression

ALuL

5/23/2021