Description

This report provides house price prediction using regression algorithms. The data set used for modeling in this report is a real house pricing data in the US. The data set is sourced out form Kaggle, it can be downloaded through the following link: https://www.kaggle.com/shree1992/housedata

The report is structured as follows:
1. Data Extraction
2. Exploratory Data Analysis
3. Data Preparation
4. Modeling
5. Evaluation
6. Recommendation

#1. Data Extraction

Import Necessary Libraries.

rm(list = ls())
library(ggplot2)
library(corrgram)
library(gridExtra)
library(caret)
## 载入需要的程辑包:lattice
## 
## 载入程辑包:'lattice'
## The following object is masked from 'package:corrgram':
## 
##     panel.fill

Library ggplot2 : for graphic & Visualization Library corrgram : for visualization of correlation coefficient matrix Library gridExtra: for visualize multiple graph Library Caret : for One Hot Encoding (OHE)

## read data
house_df <- read.csv("data/house.csv")

## structure of dataframe
str(house_df)
## 'data.frame':    4600 obs. of  18 variables:
##  $ date         : chr  "2014-05-02 00:00:00" "2014-05-02 00:00:00" "2014-05-02 00:00:00" "2014-05-02 00:00:00" ...
##  $ price        : num  313000 2384000 342000 420000 550000 ...
##  $ bedrooms     : num  3 5 3 3 4 2 2 4 3 4 ...
##  $ bathrooms    : num  1.5 2.5 2 2.25 2.5 1 2 2.5 2.5 2 ...
##  $ sqft_living  : int  1340 3650 1930 2000 1940 880 1350 2710 2430 1520 ...
##  $ sqft_lot     : int  7912 9050 11947 8030 10500 6380 2560 35868 88426 6200 ...
##  $ floors       : num  1.5 2 1 1 1 1 1 2 1 1.5 ...
##  $ waterfront   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ view         : int  0 4 0 0 0 0 0 0 0 0 ...
##  $ condition    : int  3 5 4 4 4 3 3 3 4 3 ...
##  $ sqft_above   : int  1340 3370 1930 1000 1140 880 1350 2710 1570 1520 ...
##  $ sqft_basement: int  0 280 0 1000 800 0 0 0 860 0 ...
##  $ yr_built     : int  1955 1921 1966 1963 1976 1938 1976 1989 1985 1945 ...
##  $ yr_renovated : int  2005 0 0 0 1992 1994 0 0 0 2010 ...
##  $ street       : chr  "18810 Densmore Ave N" "709 W Blaine St" "26206-26214 143rd Ave SE" "857 170th Pl NE" ...
##  $ city         : chr  "Shoreline" "Seattle" "Kent" "Bellevue" ...
##  $ statezip     : chr  "WA 98133" "WA 98119" "WA 98042" "WA 98008" ...
##  $ country      : chr  "USA" "USA" "USA" "USA" ...

read house data set from .csv file to R data frame. Then, set the data frame’s structure. The target variable is price and the remaining variables are candidate features. there are 4600 observations and 18 variables recorded in the data set.

## statistical summary
summary(house_df)
##      date               price             bedrooms       bathrooms    
##  Length:4600        Min.   :       0   Min.   :0.000   Min.   :0.000  
##  Class :character   1st Qu.:  322875   1st Qu.:3.000   1st Qu.:1.750  
##  Mode  :character   Median :  460943   Median :3.000   Median :2.250  
##                     Mean   :  551963   Mean   :3.401   Mean   :2.161  
##                     3rd Qu.:  654962   3rd Qu.:4.000   3rd Qu.:2.500  
##                     Max.   :26590000   Max.   :9.000   Max.   :8.000  
##   sqft_living       sqft_lot           floors        waterfront      
##  Min.   :  370   Min.   :    638   Min.   :1.000   Min.   :0.000000  
##  1st Qu.: 1460   1st Qu.:   5001   1st Qu.:1.000   1st Qu.:0.000000  
##  Median : 1980   Median :   7683   Median :1.500   Median :0.000000  
##  Mean   : 2139   Mean   :  14852   Mean   :1.512   Mean   :0.007174  
##  3rd Qu.: 2620   3rd Qu.:  11001   3rd Qu.:2.000   3rd Qu.:0.000000  
##  Max.   :13540   Max.   :1074218   Max.   :3.500   Max.   :1.000000  
##       view          condition       sqft_above   sqft_basement   
##  Min.   :0.0000   Min.   :1.000   Min.   : 370   Min.   :   0.0  
##  1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:1190   1st Qu.:   0.0  
##  Median :0.0000   Median :3.000   Median :1590   Median :   0.0  
##  Mean   :0.2407   Mean   :3.452   Mean   :1827   Mean   : 312.1  
##  3rd Qu.:0.0000   3rd Qu.:4.000   3rd Qu.:2300   3rd Qu.: 610.0  
##  Max.   :4.0000   Max.   :5.000   Max.   :9410   Max.   :4820.0  
##     yr_built     yr_renovated       street              city          
##  Min.   :1900   Min.   :   0.0   Length:4600        Length:4600       
##  1st Qu.:1951   1st Qu.:   0.0   Class :character   Class :character  
##  Median :1976   Median :   0.0   Mode  :character   Mode  :character  
##  Mean   :1971   Mean   : 808.6                                        
##  3rd Qu.:1997   3rd Qu.:1999.0                                        
##  Max.   :2014   Max.   :2014.0                                        
##    statezip           country         
##  Length:4600        Length:4600       
##  Class :character   Class :character  
##  Mode  :character   Mode  :character  
##                                       
##                                       
## 

The data summary gives us the pictures of minimum, median, mean and maximum values of each numerical variable. We can take notice that the minimum value of observed price is zero, this could be an incorrect data. The maximum value of price observed is statically fallen way from its normal distribution values and the 3rd quantile, this could be identified as an outlier.

2. Exploratory Data Analysis

2.1. Univariate Analysis

in this analysis we are using box plot statistical model to identify the existence and the degree of significance of out liers in the data since Box plot provides clearer picture of the out liers to the data set’s normal distribution values.

ggplot(data = house_df, aes(y = price)) +
  geom_boxplot() + 
  scale_y_continuous(limits = c(0, 2000000))

based on boxplot above, we can se that there are a significant number of outliers in the target’s distribution.

2.2. Bivariate Analysis

Plotting house price based on number of bedrooms.

house_df$bedrooms2 <- factor(house_df$bedrooms)
ggplot(data = house_df, aes(x = bedrooms2, 
                            y = price)) +
  geom_boxplot() +
  scale_y_continuous(limits = c(0, 2000000))

Based on the price by number of bedrooms plot, we can see the following observation: 1. In General, the higher the number of bedrooms, the higher the price of that particular property. 2. It is interesting that the houses with number of bedrooms == 0, the house prices are significantly high. This phenomena could be caused by few probabilities such as: different type of property than the rest of the distribution, sampling error, etc.

2.3. Multivariate Analyss

compute Pearson’s correlation coefficient (R) among all numerical variables/ Then visualize the result in a diagram

## Compute Pearson's Correlation Coefficient (R)

house_df_num <- house_df[ , 2:12]
r <- cor(house_df_num)


corrgram(house_df_num,
         upper.panel = panel.cor)

Based from our analysis, there are few notes can be taken as followed: 1. sqft_living and bathrooms are two top tiers of highly correlated feature variables to price with following correlation scores of 0.43 and 0.33 respectively 2. however it is important to note, even though sqft_above scores higher than bathrooms in correlation, nevertheless it also shares even higher correlation to **sqft_living*. This signifies that both of features provide the same (redundant) information.

insight:

  1. There are outliers in price
  2. Incorrect price values (price == 0)
  3. In general, the higher number of bedrooms, the higher the price. However, for houses with 0 bedrooms, the price are significantly higher.
  4. Based on Pearson’s Correlation Coefficient (R), the variable with highest correlation with target (price) are sqft_living, sqft_above, and bathrooms.
  5. Location is an important feature to predict price. So, it is necessary to include in the modeling.

3. Data Preparation

3.1. Data Cleaning

Remove observations with incorrect price (price == 0).

### get index that price == 0
idx_price_0 <- which( house_df_num$price == 0 )

### remove obs with price == 0
house_df_num <- house_df_num[ -idx_price_0, ] 

summary(house_df_num$price)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##     7800   326264   465000   557906   657500 26590000

The minimum value on price is now not zero, but 7800.

Remove observations with outliers in price

# get outliers idx
out_price <- boxplot.stats(house_df_num$price)$out
idx_out <- which(house_df_num$price %in% c(out_price))

### remove obs with outlier
house_df_num <- house_df_num[ -idx_out, ]
summary(house_df_num$price)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    7800  320000  450000  487457  615000 1150000

The maximum value on price is now not 26590000, but 1150000.

dim(house_df_num)
## [1] 4311   11

Number of observations is now 4311. It means, the data cleaning process removed 289 rows.

3.2. Feature Extraction

Add location features using statezip variable with One Hot Encoding (OHE) method.

house_df$street <- as.factor(house_df$street)
house_df$city <- as.factor(house_df$city)
house_df$country <- as.factor(house_df$country)
house_df$statezip <- as.factor(house_df$statezip)

### OHE on statezip column (house_df dataframe)

### create dataframe to be encoded
statezip_df <- data.frame(house_df$statezip)
colnames(statezip_df) <- "loc"

### create OHE dataframe
df1 <- dummyVars("~.", data = statezip_df)
df2 <- data.frame(predict(df1, newdata = statezip_df))

### combine to original dataframe
house_df <- cbind(house_df, df2)
house_df$statezip <- NULL

### combine to numerical data for modeling
idx <- rownames(house_df_num)
house_df_num <- cbind(house_df_num, df2[idx, ] )

dim(house_df_num)
## [1] 4311   88

Number of columns is now 88. It means, we added 77 new features for location information using OHE in statezip.

4. Modeling

Create regression model using multivariate linear regression algorithm. We will create two models: 1. without location feature 2. with location feature

## for reproducible result
set.seed(2021) 

m <- nrow(house_df_num)
m_train <- m * 0.7
train_idx <- sample(m, m_train)
# train_idx[1:3]

train_df <- house_df_num[ train_idx, ]
test_df <- house_df_num[ -train_idx, ]
## 4.3. Multivariate Linear Regression (two or more variables)
model.mlr1 <- lm(formula = price ~ sqft_living + bathrooms + view, 
                data = train_df)
model.mlr1
## 
## Call:
## lm(formula = price ~ sqft_living + bathrooms + view, data = train_df)
## 
## Coefficients:
## (Intercept)  sqft_living    bathrooms         view  
##    137968.7        150.8      17711.4      35283.6
# Without Location
model.mlr2 <- lm(formula = price ~ . ,
                 data = train_df[, 1:11])
model.mlr2
## 
## Call:
## lm(formula = price ~ ., data = train_df[, 1:11])
## 
## Coefficients:
##   (Intercept)       bedrooms      bathrooms    sqft_living       sqft_lot  
##    -1.005e+04     -2.812e+04      1.491e+04      1.726e+02     -1.524e-01  
##        floors     waterfront           view      condition     sqft_above  
##     4.468e+04      7.641e+04      3.078e+04      4.545e+04     -9.451e+00  
## sqft_basement  
##            NA
# With Location
model.mlr3 <- lm(formula = price ~ . ,
                 data = train_df)
model.mlr3
## 
## Call:
## lm(formula = price ~ ., data = train_df)
## 
## Coefficients:
##   (Intercept)       bedrooms      bathrooms    sqft_living       sqft_lot  
##     3.079e+03     -8.555e+03      2.423e+04      8.759e+01      2.429e-01  
##        floors     waterfront           view      condition     sqft_above  
##    -2.884e+04      1.338e+05      3.576e+04      2.461e+04      7.897e+01  
## sqft_basement   loc.WA.98001   loc.WA.98002   loc.WA.98003   loc.WA.98004  
##            NA     -9.982e+04     -1.095e+05     -9.539e+04      4.527e+05  
##  loc.WA.98005   loc.WA.98006   loc.WA.98007   loc.WA.98008   loc.WA.98010  
##     2.436e+05      1.961e+05      2.202e+05      1.247e+05      1.019e+04  
##  loc.WA.98011   loc.WA.98014   loc.WA.98019   loc.WA.98022   loc.WA.98023  
##     5.375e+04      1.696e+03     -2.265e+03     -1.112e+05     -1.137e+05  
##  loc.WA.98024   loc.WA.98027   loc.WA.98028   loc.WA.98029   loc.WA.98030  
##     7.352e+04      9.424e+04      5.830e+04      1.664e+05     -8.976e+04  
##  loc.WA.98031   loc.WA.98032   loc.WA.98033   loc.WA.98034   loc.WA.98038  
##    -8.540e+04     -1.154e+05      2.423e+05      1.123e+05     -6.967e+04  
##  loc.WA.98039   loc.WA.98040   loc.WA.98042   loc.WA.98045   loc.WA.98047  
##    -3.787e+05      3.153e+05     -9.370e+04     -2.295e+03     -9.456e+04  
##  loc.WA.98050   loc.WA.98051   loc.WA.98052   loc.WA.98053   loc.WA.98055  
##     3.396e+04     -1.068e+04      1.982e+05      1.604e+05     -5.200e+04  
##  loc.WA.98056   loc.WA.98057   loc.WA.98058   loc.WA.98059   loc.WA.98065  
##     2.865e+04     -8.412e+04     -5.039e+04      1.179e+04      4.635e+04  
##  loc.WA.98068   loc.WA.98070   loc.WA.98072   loc.WA.98074   loc.WA.98075  
##     1.591e+05      1.602e+04      1.095e+05      1.346e+05      1.739e+05  
##  loc.WA.98077   loc.WA.98092   loc.WA.98102   loc.WA.98103   loc.WA.98105  
##     1.171e+05     -1.004e+05      3.287e+05      2.454e+05      3.598e+05  
##  loc.WA.98106   loc.WA.98107   loc.WA.98108   loc.WA.98109   loc.WA.98112  
##     2.772e+04      2.426e+05      2.498e+04      4.301e+05      3.728e+05  
##  loc.WA.98115   loc.WA.98116   loc.WA.98117   loc.WA.98118   loc.WA.98119  
##     2.426e+05      2.233e+05      2.224e+05      8.315e+04      3.298e+05  
##  loc.WA.98122   loc.WA.98125   loc.WA.98126   loc.WA.98133   loc.WA.98136  
##     2.608e+05      8.297e+04      1.085e+05      5.955e+04      1.696e+05  
##  loc.WA.98144   loc.WA.98146   loc.WA.98148   loc.WA.98155   loc.WA.98166  
##     1.689e+05      3.936e+04     -2.671e+04      3.573e+04      5.474e+03  
##  loc.WA.98168   loc.WA.98177   loc.WA.98178   loc.WA.98188   loc.WA.98198  
##    -5.839e+04      1.721e+05     -5.800e+04     -8.872e+04     -9.393e+04  
##  loc.WA.98199   loc.WA.98288   loc.WA.98354  
##     2.741e+05      8.278e+03             NA

5. Evaluation

6. Recommendation