Description

This report provides detail implementation house price prediction using regression algorithms

The dataset using in this report for modeling is a real hous data in US.

  1. Data Extraction
  2. Exploratory Data Ananlysis
  3. Data Preparartion
  4. Modeling
  5. Evaluation
  6. Recomendation

1. Data Extraction

Import necessary libraries.

library(ggplot2)
library(corrgram)
library(dplyr)
library(caret)
library(gridExtra)
library(Metrics)

Library ggplot2 : For grapich and visualization
Library corrgram : For visualization of correlation coefficient Library dplyr : For data manipulation Library caret : For One-Hot-Encoding Library gridExtra : Forploting multiple graphs Library Metrics : for performance

Read house dataset from .csv file to R dataframe. Then, set the structure of dataframe.

house_df <- read.csv("../data/house.csv")


str(house_df)
## 'data.frame':    4600 obs. of  18 variables:
##  $ date         : chr  "2014-05-02 00:00:00" "2014-05-02 00:00:00" "2014-05-02 00:00:00" "2014-05-02 00:00:00" ...
##  $ price        : num  313000 2384000 342000 420000 550000 ...
##  $ bedrooms     : num  3 5 3 3 4 2 2 4 3 4 ...
##  $ bathrooms    : num  1.5 2.5 2 2.25 2.5 1 2 2.5 2.5 2 ...
##  $ sqft_living  : int  1340 3650 1930 2000 1940 880 1350 2710 2430 1520 ...
##  $ sqft_lot     : int  7912 9050 11947 8030 10500 6380 2560 35868 88426 6200 ...
##  $ floors       : num  1.5 2 1 1 1 1 1 2 1 1.5 ...
##  $ waterfront   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ view         : int  0 4 0 0 0 0 0 0 0 0 ...
##  $ condition    : int  3 5 4 4 4 3 3 3 4 3 ...
##  $ sqft_above   : int  1340 3370 1930 1000 1140 880 1350 2710 1570 1520 ...
##  $ sqft_basement: int  0 280 0 1000 800 0 0 0 860 0 ...
##  $ yr_built     : int  1955 1921 1966 1963 1976 1938 1976 1989 1985 1945 ...
##  $ yr_renovated : int  2005 0 0 0 1992 1994 0 0 0 2010 ...
##  $ street       : chr  "18810 Densmore Ave N" "709 W Blaine St" "26206-26214 143rd Ave SE" "857 170th Pl NE" ...
##  $ city         : chr  "Shoreline" "Seattle" "Kent" "Bellevue" ...
##  $ statezip     : chr  "WA 98133" "WA 98119" "WA 98042" "WA 98008" ...
##  $ country      : chr  "USA" "USA" "USA" "USA" ...

The dataset has 4600 observations (rows) and 18 variables (columns). The targer is price and the remaining variables are features candidate.

Calculate statistical summary of numerical variabel

summary(house_df)
##      date               price             bedrooms       bathrooms    
##  Length:4600        Min.   :       0   Min.   :0.000   Min.   :0.000  
##  Class :character   1st Qu.:  322875   1st Qu.:3.000   1st Qu.:1.750  
##  Mode  :character   Median :  460943   Median :3.000   Median :2.250  
##                     Mean   :  551963   Mean   :3.401   Mean   :2.161  
##                     3rd Qu.:  654962   3rd Qu.:4.000   3rd Qu.:2.500  
##                     Max.   :26590000   Max.   :9.000   Max.   :8.000  
##   sqft_living       sqft_lot           floors        waterfront      
##  Min.   :  370   Min.   :    638   Min.   :1.000   Min.   :0.000000  
##  1st Qu.: 1460   1st Qu.:   5001   1st Qu.:1.000   1st Qu.:0.000000  
##  Median : 1980   Median :   7683   Median :1.500   Median :0.000000  
##  Mean   : 2139   Mean   :  14852   Mean   :1.512   Mean   :0.007174  
##  3rd Qu.: 2620   3rd Qu.:  11001   3rd Qu.:2.000   3rd Qu.:0.000000  
##  Max.   :13540   Max.   :1074218   Max.   :3.500   Max.   :1.000000  
##       view          condition       sqft_above   sqft_basement   
##  Min.   :0.0000   Min.   :1.000   Min.   : 370   Min.   :   0.0  
##  1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:1190   1st Qu.:   0.0  
##  Median :0.0000   Median :3.000   Median :1590   Median :   0.0  
##  Mean   :0.2407   Mean   :3.452   Mean   :1827   Mean   : 312.1  
##  3rd Qu.:0.0000   3rd Qu.:4.000   3rd Qu.:2300   3rd Qu.: 610.0  
##  Max.   :4.0000   Max.   :5.000   Max.   :9410   Max.   :4820.0  
##     yr_built     yr_renovated       street              city          
##  Min.   :1900   Min.   :   0.0   Length:4600        Length:4600       
##  1st Qu.:1951   1st Qu.:   0.0   Class :character   Class :character  
##  Median :1976   Median :   0.0   Mode  :character   Mode  :character  
##  Mean   :1971   Mean   : 808.6                                        
##  3rd Qu.:1997   3rd Qu.:1999.0                                        
##  Max.   :2014   Max.   :2014.0                                        
##    statezip           country         
##  Length:4600        Length:4600       
##  Class :character   Class :character  
##  Mode  :character   Mode  :character  
##                                       
##                                       
## 

We can see the min, max, and mean, Q1, Q2(median), and Q3 of each numerical variable.

It is interisting to see that the min value of price is zero. This could be an incorrect data.

2. Exploratory Data Ananlysis

2.1 Univariate Analysis

Analysis of a single variabel. We want to see the distribution of the target variabel price.

ggplot(data = house_df, aes(y = price)) +
  geom_boxplot() +
  scale_y_continuous(limits = c(0, 2000000))

out_price <- boxplot.stats(house_df$price)$out
paste("Number of Outliers:", length(out_price))
## [1] "Number of Outliers: 240"

Based on the boxplot above, we can see that there are outliers in variable price. There are 240 outliers in the house price

2.2 Bivariate Analysis

Analysis of two variabels. We want to see the realtionship between house price with number of bedrooms

house_df$bedrooms2 <- factor(house_df$bedrooms)

ggplot(data = house_df, aes(y = price, x = bedrooms2)) +
  geom_boxplot() +
  scale_y_continuous(limits = c(0, 2000000))

In general, the higher number of bedrooms the higher the price However, houses with bedrooms == 0, the prices are significantly higher.

We want to investigate size of the house (sqft_living) for each number of bedrooms.

ggplot(data = house_df, aes(y=sqft_living, x = bedrooms2))+
  geom_boxplot()

We can see that the house with bedrooms == 0 has the highest median sqft_living than the rest of the number of bedrooms. This means, its is a special type of house.

2.3 Multivariate Analysis

Analysis of 3 variables. We want to see the relationship between bedroom, sqft_living, and price.

ggplot(data = house_df, aes(y = price, x = sqft_living, color = bedrooms2)) +
  geom_point() +
  scale_y_continuous(limits = c(0, 2000000)) +
  scale_x_continuous(limits = c(0, 7500))

In general, the highest number of sqft_living the higher price.

It is interisting to see observations with price == 0(dots in the x-axis). They have bedrooms and sqft_living. It indicates that the price is incorrect.

Compute Pearson’s Correlation Coefficient (R)

house_df_num <- house_df[ , 2:12]
corrgram(house_df_num,
         upper.panel = panel.cor)

All numerical variables have positive correlation with price. There variables with highest correlation are sqft_living, sqft_above, and bathrooms.

3. Data Preparartion

3.1 Data cleaning

Remove observations with incorrect price == 0

price_0<- filter(house_df_num, price ==0)
paste("Number of observation with price:", nrow(price_0))
## [1] "Number of observation with price: 49"
house_df_num1 <- filter(house_df_num, price >0)
paste("Number of observation:", nrow(house_df_num1))
## [1] "Number of observation: 4551"

There are 49 observations with price == 0. After removing these observations, the current, dataframe has 4551 observations.

Remove outliers on price

# get outliers
out_price <- boxplot.stats(house_df_num1$price)$out
paste("Number of Outliers:", length(out_price))
## [1] "Number of Outliers: 240"
# get outliers index

out_idx <- which(house_df_num1$price %in% c(out_price))

# remove outliers 
house_df_num1 <- house_df_num1[-out_idx,]

There are 240 observations with price == 0. After removing these observations, the current, dataframe has 4311 observations.

3.2 Features Extraction

add OHE on location Variabels (statezip)

## 1. create dataframe for column to be encode
location_df <- data.frame(house_df$statezip)
colnames(location_df) <- "location_"

#2. OHE on that column
df1 <- dummyVars("~.", data = location_df)
df2 <- data.frame(predict(df1, newdata = location_df))

paste("Number of additional:", ncol(df2))
## [1] "Number of additional: 77"
#3 Combine to original dataframe
idx <- rownames(house_df_num1)
house_df_num2 <- cbind(house_df_num1, df2[idx,])

paste("Number of current variabels:", ncol(house_df_num2))
## [1] "Number of current variabels: 88"

There are 77 unique statezip. Using OHE techmique, we created 77 new columns (one columns one satetzip). After adding these variabels, the current dataframe has 88 variabels.

3.3 Spliting Data in Training and Testing

Divide data into training and testing with ratio 70:30

### train:test = 70:30
m <- nrow(house_df_num1) # number of obs/rows
n <- ncol(house_df_num1) #number of vars/colum

mtrain <-floor( 0.7 * m) # number of training examples

set.seed(2022)
train_idx <- sample(m, mtrain)


train_df <- house_df_num2[ train_idx,]
test_df <- house_df_num2  [ -train_idx,]

paste("Number of training.obs:", nrow(train_df))
## [1] "Number of training.obs: 3017"
paste("Number of testing.obs:", nrow(test_df))
## [1] "Number of testing.obs: 1294"

There are 3017 obs in training data to develop a prediction model obs and 1294 obs in testing data to evaluate the model.

4. Modeling

Using Multivariate Linear Regression (MLR) algorithm to predict house price.

model.mlr2 <- lm( formula = price ~ . ,
                  data = train_df)
summary(model.mlr2)
## 
## Call:
## lm(formula = price ~ ., data = train_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -842587  -51386    -464   51246  483921 
## 
## Coefficients: (3 not defined because of singularities)
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -3.111e+04  7.649e+04  -0.407 0.684249    
## bedrooms          -9.967e+03  2.950e+03  -3.379 0.000738 ***
## bathrooms          1.836e+04  4.517e+03   4.064 4.95e-05 ***
## sqft_living        9.948e+01  6.310e+00  15.765  < 2e-16 ***
## sqft_lot           1.363e-01  5.514e-02   2.471 0.013530 *  
## floors            -1.957e+04  5.467e+03  -3.579 0.000350 ***
## waterfront         1.581e+05  3.208e+04   4.929 8.71e-07 ***
## view               3.479e+04  3.359e+03  10.356  < 2e-16 ***
## condition          2.867e+04  3.120e+03   9.190  < 2e-16 ***
## sqft_above         7.374e+01  6.551e+00  11.256  < 2e-16 ***
## sqft_basement             NA         NA      NA       NA    
## location_WA.98001 -8.852e+04  7.657e+04  -1.156 0.247773    
## location_WA.98002 -9.675e+04  7.854e+04  -1.232 0.218122    
## location_WA.98003 -8.770e+04  7.718e+04  -1.136 0.255895    
## location_WA.98004  4.304e+05  7.727e+04   5.570 2.77e-08 ***
## location_WA.98005  2.637e+05  7.828e+04   3.368 0.000766 ***
## location_WA.98006  2.314e+05  7.630e+04   3.032 0.002447 ** 
## location_WA.98007  2.317e+05  7.788e+04   2.975 0.002956 ** 
## location_WA.98008  1.490e+05  7.740e+04   1.924 0.054394 .  
## location_WA.98010  2.111e+04  8.867e+04   0.238 0.811809    
## location_WA.98011  6.525e+04  7.818e+04   0.835 0.403992    
## location_WA.98014  3.734e+04  7.960e+04   0.469 0.639040    
## location_WA.98019  6.195e+03  7.725e+04   0.080 0.936086    
## location_WA.98022 -6.777e+04  7.803e+04  -0.868 0.385207    
## location_WA.98023 -9.450e+04  7.609e+04  -1.242 0.214381    
## location_WA.98024  9.047e+04  8.385e+04   1.079 0.280671    
## location_WA.98027  1.075e+05  7.631e+04   1.409 0.158995    
## location_WA.98028  5.548e+04  7.664e+04   0.724 0.469178    
## location_WA.98029  1.663e+05  7.655e+04   2.173 0.029892 *  
## location_WA.98030 -4.593e+04  7.766e+04  -0.592 0.554227    
## location_WA.98031 -6.642e+04  7.654e+04  -0.868 0.385569    
## location_WA.98032 -1.210e+05  7.979e+04  -1.516 0.129591    
## location_WA.98033  2.511e+05  7.630e+04   3.291 0.001011 ** 
## location_WA.98034  1.174e+05  7.601e+04   1.544 0.122607    
## location_WA.98038 -5.324e+04  7.607e+04  -0.700 0.484051    
## location_WA.98039 -3.736e+05  1.299e+05  -2.877 0.004043 ** 
## location_WA.98040  2.942e+05  7.714e+04   3.814 0.000140 ***
## location_WA.98042 -7.963e+04  7.598e+04  -1.048 0.294749    
## location_WA.98045  1.471e+04  7.696e+04   0.191 0.848449    
## location_WA.98047 -5.551e+04  8.863e+04  -0.626 0.531127    
## location_WA.98050  5.543e+04  1.303e+05   0.425 0.670645    
## location_WA.98051  1.815e+04  8.901e+04   0.204 0.838424    
## location_WA.98052  2.080e+05  7.569e+04   2.748 0.006037 ** 
## location_WA.98053  1.746e+05  7.621e+04   2.291 0.022035 *  
## location_WA.98055 -3.989e+04  7.921e+04  -0.504 0.614523    
## location_WA.98056  2.473e+04  7.628e+04   0.324 0.745796    
## location_WA.98057 -7.596e+04  8.147e+04  -0.932 0.351208    
## location_WA.98058 -2.487e+04  7.601e+04  -0.327 0.743586    
## location_WA.98059  3.724e+04  7.588e+04   0.491 0.623563    
## location_WA.98065  4.485e+04  7.654e+04   0.586 0.557953    
## location_WA.98068         NA         NA      NA       NA    
## location_WA.98070  1.408e+03  7.905e+04   0.018 0.985794    
## location_WA.98072  1.225e+05  7.668e+04   1.598 0.110260    
## location_WA.98074  1.458e+05  7.614e+04   1.915 0.055599 .  
## location_WA.98075  1.694e+05  7.638e+04   2.218 0.026631 *  
## location_WA.98077  1.142e+05  7.728e+04   1.477 0.139689    
## location_WA.98092 -8.542e+04  7.614e+04  -1.122 0.262050    
## location_WA.98102  3.458e+05  8.109e+04   4.265 2.06e-05 ***
## location_WA.98103  2.488e+05  7.580e+04   3.282 0.001041 ** 
## location_WA.98105  2.769e+05  7.801e+04   3.549 0.000393 ***
## location_WA.98106  4.414e+04  7.664e+04   0.576 0.564725    
## location_WA.98107  2.469e+05  7.674e+04   3.218 0.001307 ** 
## location_WA.98108  4.986e+04  7.725e+04   0.645 0.518713    
## location_WA.98109  4.164e+05  8.102e+04   5.140 2.93e-07 ***
## location_WA.98112  3.410e+05  7.741e+04   4.405 1.10e-05 ***
## location_WA.98115  2.078e+05  7.592e+04   2.737 0.006232 ** 
## location_WA.98116  2.283e+05  7.672e+04   2.976 0.002943 ** 
## location_WA.98117  2.169e+05  7.585e+04   2.860 0.004267 ** 
## location_WA.98118  8.666e+04  7.626e+04   1.136 0.255878    
## location_WA.98119  3.844e+05  7.794e+04   4.932 8.58e-07 ***
## location_WA.98122  2.508e+05  7.654e+04   3.277 0.001062 ** 
## location_WA.98125  7.111e+04  7.644e+04   0.930 0.352310    
## location_WA.98126  1.221e+05  7.621e+04   1.602 0.109347    
## location_WA.98133  6.534e+04  7.608e+04   0.859 0.390445    
## location_WA.98136  1.667e+05  7.701e+04   2.164 0.030515 *  
## location_WA.98144  1.952e+05  7.645e+04   2.554 0.010714 *  
## location_WA.98146  3.163e+04  7.667e+04   0.412 0.680012    
## location_WA.98148 -2.256e+04  8.207e+04  -0.275 0.783450    
## location_WA.98155  5.985e+04  7.607e+04   0.787 0.431448    
## location_WA.98166  3.252e+04  7.675e+04   0.424 0.671860    
## location_WA.98168 -2.261e+04  7.654e+04  -0.295 0.767693    
## location_WA.98177  1.461e+05  7.791e+04   1.875 0.060837 .  
## location_WA.98178 -8.146e+04  7.721e+04  -1.055 0.291490    
## location_WA.98188 -7.244e+04  7.925e+04  -0.914 0.360720    
## location_WA.98198 -7.122e+04  7.694e+04  -0.926 0.354748    
## location_WA.98199  2.643e+05  7.688e+04   3.438 0.000594 ***
## location_WA.98288 -4.277e+04  9.676e+04  -0.442 0.658491    
## location_WA.98354         NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 105900 on 2932 degrees of freedom
## Multiple R-squared:  0.7612, Adjusted R-squared:  0.7543 
## F-statistic: 111.2 on 84 and 2932 DF,  p-value: < 2.2e-16

From 88 variabels, there are some variabels with significant coefficients(***). Some location (statezip) are more significant than others.

5. Evaluation

5.1 Visualise Actual vs. Predicted

Plot actual values of price in the testing data its predicted values using MLR model.

actual <- test_df$price
pred.mlr2 <- predict(model.mlr2, test_df)
pred_df <- data.frame(actual,pred.mlr2)

## PLOT MLR2 
plot_mlr2 <- ggplot(data = pred_df, aes(y = pred.mlr2, x = actual)) +
  geom_point()+
  geom_smooth()+
  scale_x_continuous(limits = c(0, 1250000)) +
  scale_y_continuous(limits = c(0, 1250000)) +
  labs(title = "Actual VS Predicted Values using MLR")


plot_mlr2

Generally, the points are near diagonals. This means the model can predict actual price with good accuracy.

However, some point are still far away from the diagonal. Further investigation and improvement is needed.

5.2 Compute Performance Metrics

Compute Mean Absolute Error (MAE) and pearson correlation Coefficient (R)

mae_mlr2 <- mae(actual, pred.mlr2)
paste("MAE:", mae_mlr2)
## [1] "MAE: 74759.0279855563"
r_mlr2 <-cor(actual, pred.mlr2)
paste("R:", round(r_mlr2,2))
## [1] "R: 0.87"

The current methods has MAE = 74759.03 and R = 0.87. This is a significant improvement from initial performance (MAE > 250K and R < 0.6)

6. Recomendation

  1. The prediction model is generally good enough to predict the price of mainstream house.

  2. Location is an importan features generated from on the location column (statezip). There could be other importan feauters generated from variabels and make significant influence.

  3. There are still of improvement. Is it ready for deployment?. It depends on the business value generated from the curerent model.

R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

summary(cars)
##      speed           dist       
##  Min.   : 4.0   Min.   :  2.00  
##  1st Qu.:12.0   1st Qu.: 26.00  
##  Median :15.0   Median : 36.00  
##  Mean   :15.4   Mean   : 42.98  
##  3rd Qu.:19.0   3rd Qu.: 56.00  
##  Max.   :25.0   Max.   :120.00

Including Plots

You can also embed plots, for example:

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.