Description

This report privides house price prediction regression algorithm. The dataset using in this report for modeling is House Data in Australia.The dataset is hosted in Kaggel and can be downloaded here.

The report is structured as follows:
1. Data Extraction
2. Exploratory Data Analysis
3. Data Preparation
4. Modeling
5. Evaluation
6. Recommendation

1. Data Extraction

Import necessary libraries.

rm(list = ls())
library(ggplot2)

Read house dataset dan see its structure.

# Read data
house_df <- read.csv("C:/Users/lenovo/Documents/MyProjects/data/house.csv")

# Structure of dataframe
str(house_df)
## 'data.frame':    4600 obs. of  18 variables:
##  $ date         : chr  "2014-05-02 00:00:00" "2014-05-02 00:00:00" "2014-05-02 00:00:00" "2014-05-02 00:00:00" ...
##  $ price        : num  313000 2384000 342000 420000 550000 ...
##  $ bedrooms     : num  3 5 3 3 4 2 2 4 3 4 ...
##  $ bathrooms    : num  1.5 2.5 2 2.25 2.5 1 2 2.5 2.5 2 ...
##  $ sqft_living  : int  1340 3650 1930 2000 1940 880 1350 2710 2430 1520 ...
##  $ sqft_lot     : int  7912 9050 11947 8030 10500 6380 2560 35868 88426 6200 ...
##  $ floors       : num  1.5 2 1 1 1 1 1 2 1 1.5 ...
##  $ waterfront   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ view         : int  0 4 0 0 0 0 0 0 0 0 ...
##  $ condition    : int  3 5 4 4 4 3 3 3 4 3 ...
##  $ sqft_above   : int  1340 3370 1930 1000 1140 880 1350 2710 1570 1520 ...
##  $ sqft_basement: int  0 280 0 1000 800 0 0 0 860 0 ...
##  $ yr_built     : int  1955 1921 1966 1963 1976 1938 1976 1989 1985 1945 ...
##  $ yr_renovated : int  2005 0 0 0 1992 1994 0 0 0 2010 ...
##  $ street       : chr  "18810 Densmore Ave N" "709 W Blaine St" "26206-26214 143rd Ave SE" "857 170th Pl NE" ...
##  $ city         : chr  "Shoreline" "Seattle" "Kent" "Bellevue" ...
##  $ statezip     : chr  "WA 98133" "WA 98119" "WA 98042" "WA 98008" ...
##  $ country      : chr  "USA" "USA" "USA" "USA" ...

The dataset 4600 observations and 18 variables. The target variable is price and the remaining variables are features.

Extract statictical summary of each variables.

# Data dimension
d <- dim(house_df)
m <- d[1] # m: number of rows
n <- d[2] # n: number of columns

# Statistical summary
summary(house_df)
##      date               price             bedrooms       bathrooms    
##  Length:4600        Min.   :       0   Min.   :0.000   Min.   :0.000  
##  Class :character   1st Qu.:  322875   1st Qu.:3.000   1st Qu.:1.750  
##  Mode  :character   Median :  460943   Median :3.000   Median :2.250  
##                     Mean   :  551963   Mean   :3.401   Mean   :2.161  
##                     3rd Qu.:  654962   3rd Qu.:4.000   3rd Qu.:2.500  
##                     Max.   :26590000   Max.   :9.000   Max.   :8.000  
##   sqft_living       sqft_lot           floors        waterfront      
##  Min.   :  370   Min.   :    638   Min.   :1.000   Min.   :0.000000  
##  1st Qu.: 1460   1st Qu.:   5001   1st Qu.:1.000   1st Qu.:0.000000  
##  Median : 1980   Median :   7683   Median :1.500   Median :0.000000  
##  Mean   : 2139   Mean   :  14852   Mean   :1.512   Mean   :0.007174  
##  3rd Qu.: 2620   3rd Qu.:  11001   3rd Qu.:2.000   3rd Qu.:0.000000  
##  Max.   :13540   Max.   :1074218   Max.   :3.500   Max.   :1.000000  
##       view          condition       sqft_above   sqft_basement   
##  Min.   :0.0000   Min.   :1.000   Min.   : 370   Min.   :   0.0  
##  1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:1190   1st Qu.:   0.0  
##  Median :0.0000   Median :3.000   Median :1590   Median :   0.0  
##  Mean   :0.2407   Mean   :3.452   Mean   :1827   Mean   : 312.1  
##  3rd Qu.:0.0000   3rd Qu.:4.000   3rd Qu.:2300   3rd Qu.: 610.0  
##  Max.   :4.0000   Max.   :5.000   Max.   :9410   Max.   :4820.0  
##     yr_built     yr_renovated       street              city          
##  Min.   :1900   Min.   :   0.0   Length:4600        Length:4600       
##  1st Qu.:1951   1st Qu.:   0.0   Class :character   Class :character  
##  Median :1976   Median :   0.0   Mode  :character   Mode  :character  
##  Mean   :1971   Mean   : 808.6                                        
##  3rd Qu.:1997   3rd Qu.:1999.0                                        
##  Max.   :2014   Max.   :2014.0                                        
##    statezip           country         
##  Length:4600        Length:4600       
##  Class :character   Class :character  
##  Mode  :character   Mode  :character  
##                                       
##                                       
## 

We can see minimun, median, mean and maximum values of each numeric variables. It is interesting that the minimum values of price is zero. This could be an incorrect data.

We can also notice that the maximum values statistically far away from median third quantile. This could be an outliers.

2. Exploratory Data Analysis

2.1. Univariate Data Analysis

Plot distribution of price using boxplot.

## Boxplot

ggplot(data = house_df, aes(y = price)) +
  geom_boxplot() +
  scale_y_continuous(limits = c(0, 2000000))

Based on boxplot above, we can see that there are outliers in price.

2.2. Bivariate Data Analysis

For plotting purposes, some numerical and char variables are transformed to factor.

## Casting bedrooms from numeric to factor
house_df$bedrooms2 <- factor(house_df$bedrooms)
house_df$city2 <- factor(house_df$city)
house_df$statezip2 <- factor(house_df$statezip)
house_df$street2 <- factor(house_df$street)
house_df$country2 <- factor(house_df$country)

ggplot(data = house_df, aes(x = bedrooms2,
                            y = price)) +
  geom_boxplot() +
  scale_y_continuous(limits = c(0, 2000000))

Plot house price based on number of bedrooms.

ggplot(data = house_df, aes(x = bedrooms2,
                            y = price)) +
  geom_boxplot() +
  scale_y_continuous(limits = c(0, 2000000))

Based on price by number of bedrooms, we can see the following: 1. In general, the higher number of bedrooms the higher the price. 2. It is interesting that house with number of bedrooms == 0, the house price is significantly high. It could be a special house such as meeting hall, religious builing, sport center, etc.

2.3. Multivariate Data Analysis

Compute correlation coefficient (R) among all numerical variables. Visualize Correlation Coefficient in a diagram.

## Compute Correlation Coefficient
house_df_num <- house_df[ , 2:12]
r <- cor(house_df_num)

library(corrgram)
corrgram(house_df_num, order = TRUE,
         upper.panel = panel.pie)

Several variables are highly correlated. For example sqft_living and sqft_above.

For target variable (price), the variables with high correlation in order are sqft_living, sqft_above, bathroom and bedroom.

Insight from EDA

  1. Incorrect values on price (price == 0).
  2. Extremen outliers on price variables.
  3. There are co-linear variables.
  4. Based on Pearson’s correlation coefficient (r), the variables with highest correlation with target (price) are sqft_living, sqft_above, bathroom and bedroom.
  5. In general, the highter number of bedrooms, the higher the price. Exception for bedroom == 0.
  6. There are locations variables. However, they are non-numeric.

3. Data Preparation

3.1. Data Cleaning

Remove rows with incorrect values in price.

# Get indices
idx <- which(house_df_num$price %in% c(0))

# Remove those rows
house_df_num <- house_df_num[-idx, ]
summary(house_df_num$price)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##     7800   326264   465000   557906   657500 26590000

The minimum value on price is now not zero but 7800.

Remove rows with outliers in price.

# Get outliers values
out_price <- boxplot.stats(house_df_num$price)$out

# Get indices
out_idx <- which(house_df_num$price %in% c(out_price))

# Remove those rows
house_df_num <- house_df_num[ -out_idx, ]
summary(house_df_num$price)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    7800  320000  450000  487457  615000 1150000

The maximum value on price is now 1150000.

nrow(house_df_num)
## [1] 4311

Number of observation is now 4311. It means the data cleaning process removed 298 rows.

3.2. Feature Extraction

One Hot Encoding for Location Features (using statezip column)

#### Remove rows as in house_df_num
house_df <- house_df[ rownames(house_df_num), ]

#### 1. Create dataframe for statezip
statezip <- house_df$statezip
statezip_df <- data.frame(statezip)
colnames(statezip_df) <- c("loc.")

#### 2. OHE on statezip dataframe
library(caret)
df1 <- dummyVars("~.", data = statezip_df)
df2 <- data.frame(predict(df1, newdata = statezip_df))

#### 3. Combine to house_df_num dataframe
#### 11 + 77 = 88 columns
house_df_num <- cbind(house_df_num, df2)
dim(house_df_num)
## [1] 4311   88

Number of features (columns) is now 88. It means added 77 new features for location information using OHE in statezip.

3.3. Training and Testing Division

Ramdomly divided the dataset into training dan testing with ration = 70:30. For reproducible result, it is necessary to set seed.

# Data dimension
d <- dim(house_df_num)
m <- d[1] # m: number of rows
n <- d[2] # n: number of columns

## Ratio train:test = 70:30
## Randomly divided
set.seed(2021)
train_idx <- sample(m, 0.7 * m)
train_idx[1:3]
## [1] 2214  442  743
train_data <- house_df_num[ train_idx , ]
test_data <- house_df_num[ -train_idx , ]

dim(train_data)
## [1] 3017   88
dim(test_data)
## [1] 1294   88

The train data has 3017 rows and test data has 1294 rows.

4. Modeling

Create regression model.

mymodel <- lm(formula = price ~ . + 
                I(sqft_living^2) +
                sqft_living:bedrooms,
             data = train_data)
summary(mymodel)
## 
## Call:
## lm(formula = price ~ . + I(sqft_living^2) + sqft_living:bedrooms, 
##     data = train_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -872020  -52263   -1836   48176  520918 
## 
## Coefficients: (2 not defined because of singularities)
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          -1.000e+05  1.037e+05  -0.964 0.335018    
## bedrooms              1.127e+04  7.435e+03   1.516 0.129518    
## bathrooms             2.191e+04  4.317e+03   5.075 4.11e-07 ***
## sqft_living           1.702e+02  1.137e+01  14.960  < 2e-16 ***
## sqft_lot              2.743e-01  5.575e-02   4.920 9.13e-07 ***
## floors               -3.338e+04  5.251e+03  -6.356 2.39e-10 ***
## waterfront            1.354e+05  3.682e+04   3.678 0.000239 ***
## view                  3.672e+04  3.262e+03  11.257  < 2e-16 ***
## condition             2.308e+04  3.043e+03   7.585 4.44e-14 ***
## sqft_above            8.334e+01  6.353e+00  13.117  < 2e-16 ***
## sqft_basement                NA         NA      NA       NA    
## loc.WA.98001         -1.046e+05  1.026e+05  -1.020 0.307959    
## loc.WA.98002         -1.127e+05  1.037e+05  -1.087 0.277266    
## loc.WA.98003         -9.356e+04  1.030e+05  -0.908 0.363934    
## loc.WA.98004          4.455e+05  1.036e+05   4.299 1.77e-05 ***
## loc.WA.98005          2.412e+05  1.042e+05   2.315 0.020657 *  
## loc.WA.98006          1.915e+05  1.023e+05   1.872 0.061294 .  
## loc.WA.98007          2.161e+05  1.036e+05   2.086 0.037080 *  
## loc.WA.98008          1.237e+05  1.030e+05   1.201 0.229827    
## loc.WA.98010          5.043e+03  1.112e+05   0.045 0.963840    
## loc.WA.98011          4.492e+04  1.037e+05   0.433 0.664739    
## loc.WA.98014          5.046e+03  1.052e+05   0.048 0.961734    
## loc.WA.98019         -8.018e+03  1.033e+05  -0.078 0.938134    
## loc.WA.98022         -1.214e+05  1.040e+05  -1.167 0.243275    
## loc.WA.98023         -1.179e+05  1.022e+05  -1.153 0.248916    
## loc.WA.98024          6.126e+04  1.097e+05   0.559 0.576464    
## loc.WA.98027          9.721e+04  1.025e+05   0.949 0.342795    
## loc.WA.98028          5.189e+04  1.026e+05   0.506 0.612964    
## loc.WA.98029          1.634e+05  1.024e+05   1.595 0.110740    
## loc.WA.98030         -9.482e+04  1.034e+05  -0.917 0.359028    
## loc.WA.98031         -9.087e+04  1.027e+05  -0.885 0.376375    
## loc.WA.98032         -1.182e+05  1.051e+05  -1.125 0.260848    
## loc.WA.98033          2.388e+05  1.024e+05   2.331 0.019809 *  
## loc.WA.98034          1.069e+05  1.024e+05   1.044 0.296346    
## loc.WA.98038         -7.578e+04  1.023e+05  -0.741 0.458861    
## loc.WA.98039         -3.825e+05  1.436e+05  -2.663 0.007786 ** 
## loc.WA.98040          3.101e+05  1.030e+05   3.009 0.002641 ** 
## loc.WA.98042         -9.962e+04  1.022e+05  -0.975 0.329789    
## loc.WA.98045         -6.168e+03  1.029e+05  -0.060 0.952219    
## loc.WA.98047         -9.415e+04  1.112e+05  -0.847 0.397284    
## loc.WA.98050          8.550e+03  1.441e+05   0.059 0.952686    
## loc.WA.98051         -2.062e+04  1.116e+05  -0.185 0.853467    
## loc.WA.98052          1.918e+05  1.020e+05   1.879 0.060306 .  
## loc.WA.98053          1.514e+05  1.023e+05   1.480 0.138858    
## loc.WA.98055         -5.428e+04  1.043e+05  -0.520 0.602784    
## loc.WA.98056          2.489e+04  1.024e+05   0.243 0.808066    
## loc.WA.98057         -7.989e+04  1.060e+05  -0.753 0.451264    
## loc.WA.98058         -5.394e+04  1.023e+05  -0.527 0.597992    
## loc.WA.98059          5.430e+03  1.022e+05   0.053 0.957646    
## loc.WA.98065          4.819e+04  1.027e+05   0.469 0.639102    
## loc.WA.98068          1.495e+05  1.437e+05   1.041 0.298156    
## loc.WA.98070          7.905e+03  1.044e+05   0.076 0.939640    
## loc.WA.98072          1.012e+05  1.027e+05   0.985 0.324596    
## loc.WA.98074          1.267e+05  1.023e+05   1.239 0.215417    
## loc.WA.98075          1.705e+05  1.025e+05   1.663 0.096330 .  
## loc.WA.98077          1.219e+05  1.032e+05   1.181 0.237887    
## loc.WA.98092         -1.075e+05  1.024e+05  -1.050 0.293706    
## loc.WA.98102          3.308e+05  1.050e+05   3.150 0.001648 ** 
## loc.WA.98103          2.502e+05  1.021e+05   2.450 0.014339 *  
## loc.WA.98105          3.583e+05  1.039e+05   3.447 0.000575 ***
## loc.WA.98106          3.489e+04  1.027e+05   0.340 0.734118    
## loc.WA.98107          2.486e+05  1.027e+05   2.421 0.015551 *  
## loc.WA.98108          2.727e+04  1.032e+05   0.264 0.791644    
## loc.WA.98109          4.323e+05  1.066e+05   4.055 5.13e-05 ***
## loc.WA.98112          3.731e+05  1.031e+05   3.619 0.000301 ***
## loc.WA.98115          2.409e+05  1.021e+05   2.360 0.018358 *  
## loc.WA.98116          2.219e+05  1.027e+05   2.160 0.030848 *  
## loc.WA.98117          2.241e+05  1.021e+05   2.195 0.028250 *  
## loc.WA.98118          8.617e+04  1.025e+05   0.841 0.400645    
## loc.WA.98119          3.310e+05  1.034e+05   3.200 0.001389 ** 
## loc.WA.98122          2.650e+05  1.028e+05   2.578 0.009984 ** 
## loc.WA.98125          8.670e+04  1.024e+05   0.847 0.397056    
## loc.WA.98126          1.142e+05  1.024e+05   1.115 0.264930    
## loc.WA.98133          6.011e+04  1.023e+05   0.588 0.556720    
## loc.WA.98136          1.694e+05  1.029e+05   1.646 0.099898 .  
## loc.WA.98144          1.748e+05  1.025e+05   1.706 0.088124 .  
## loc.WA.98146          4.353e+04  1.029e+05   0.423 0.672173    
## loc.WA.98148         -3.307e+04  1.085e+05  -0.305 0.760587    
## loc.WA.98155          3.714e+04  1.024e+05   0.363 0.716785    
## loc.WA.98166          9.244e+02  1.027e+05   0.009 0.992821    
## loc.WA.98168         -5.496e+04  1.026e+05  -0.535 0.592420    
## loc.WA.98177          1.633e+05  1.029e+05   1.587 0.112601    
## loc.WA.98178         -5.612e+04  1.032e+05  -0.544 0.586724    
## loc.WA.98188         -9.531e+04  1.049e+05  -0.909 0.363523    
## loc.WA.98198         -9.769e+04  1.028e+05  -0.950 0.341938    
## loc.WA.98199          2.732e+05  1.030e+05   2.653 0.008022 ** 
## loc.WA.98288          4.083e+04  1.438e+05   0.284 0.776479    
## loc.WA.98354                 NA         NA      NA       NA    
## I(sqft_living^2)     -8.753e-03  2.508e-03  -3.490 0.000490 ***
## bedrooms:sqft_living -1.086e+01  3.057e+00  -3.553 0.000387 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 101500 on 2929 degrees of freedom
## Multiple R-squared:  0.7877, Adjusted R-squared:  0.7814 
## F-statistic: 124.9 on 87 and 2929 DF,  p-value: < 2.2e-16

We can see that some statezip (locations) has significant influence on price. This means location is important to predict house price.

5. Evaluation

Predict Price

## Actual price
actual <- test_data$price

## Predicted price
pred.mymodel <- predict(mymodel, test_data)

Plot Actual vs. Predicted Price

price_df <- data.frame(actual, pred.mymodel)

ggplot(data = price_df, aes(x = actual, y = pred.mymodel)) +
  geom_point() +
  scale_x_continuous(limits = c(0,1500000)) +
  scale_y_continuous(limits = c(0,1500000))

Mostly the points are located in the diagonal. It means the predicted valeus are closed to the actual values. However, there are still some points that are relatively far for diagonal.

Measure RMSE and R

performance <- function(prediction, actual, method){
  e <- prediction - actual # error
  se <- e^2 # squared error
  sse <- sum(se) # sum of squared error
  mse <- mean(se) # mean squared error
  rmse <- sqrt(mse) # root mean squared error
  r <- cor(prediction, actual) # correlation coefficient
  
  result <- paste("*** Method: ", method,
                  "\nRMSE = ", round(rmse,3),
                  "\nR = ", round(r,3),
                  "\n")
  cat(result)
}

performance(pred.mymodel, actual, "My Model")
## *** Method:  My Model 
## RMSE =  97428.002 
## R =  0.889

The correlation coefficient is very high. The error in terms of RMSE is significantly reduced to lower 100000. This means we made huge improvement in modeling.

6. Recommendation

  1. The prediction model is good enough to predict the price of mainstream houses.

  2. It is ready for deployment. However, there are still room of improvement in the future. For example, house with bedroom == 0 and house with significantly expensive price may be designed a dedicated model.

  3. Location is an important features generated from one of the column in the dataset (statezip). There could be other important feature that can be generated and make significant influence.