This report provides house price prediction using regression algorithms. The data set used for modeling in this report is a real house pricing data in the US. The data set is sourced out form Kaggle, it can be downloaded through the following link: https://www.kaggle.com/shree1992/housedata
The report is structured as follows:
1. Data Extraction
2. Exploratory Data Analysis
3. Data Preparation
4. Modeling
5. Evaluation
6. Recommendation
#1. Data Extraction
Import Necessary Libraries.
rm(list = ls())
library(ggplot2)
library(corrgram)
library(gridExtra)
library(caret)
## 载入需要的程辑包:lattice
##
## 载入程辑包:'lattice'
## The following object is masked from 'package:corrgram':
##
## panel.fill
Library ggplot2 : for graphic & Visualization Library corrgram : for visualization of correlation coefficient matrix Library gridExtra: for visualize multiple graph Library Caret : for One Hot Encoding (OHE)
## read data
house_df <- read.csv("data/house.csv")
## structure of dataframe
str(house_df)
## 'data.frame': 4600 obs. of 18 variables:
## $ date : chr "2014-05-02 00:00:00" "2014-05-02 00:00:00" "2014-05-02 00:00:00" "2014-05-02 00:00:00" ...
## $ price : num 313000 2384000 342000 420000 550000 ...
## $ bedrooms : num 3 5 3 3 4 2 2 4 3 4 ...
## $ bathrooms : num 1.5 2.5 2 2.25 2.5 1 2 2.5 2.5 2 ...
## $ sqft_living : int 1340 3650 1930 2000 1940 880 1350 2710 2430 1520 ...
## $ sqft_lot : int 7912 9050 11947 8030 10500 6380 2560 35868 88426 6200 ...
## $ floors : num 1.5 2 1 1 1 1 1 2 1 1.5 ...
## $ waterfront : int 0 0 0 0 0 0 0 0 0 0 ...
## $ view : int 0 4 0 0 0 0 0 0 0 0 ...
## $ condition : int 3 5 4 4 4 3 3 3 4 3 ...
## $ sqft_above : int 1340 3370 1930 1000 1140 880 1350 2710 1570 1520 ...
## $ sqft_basement: int 0 280 0 1000 800 0 0 0 860 0 ...
## $ yr_built : int 1955 1921 1966 1963 1976 1938 1976 1989 1985 1945 ...
## $ yr_renovated : int 2005 0 0 0 1992 1994 0 0 0 2010 ...
## $ street : chr "18810 Densmore Ave N" "709 W Blaine St" "26206-26214 143rd Ave SE" "857 170th Pl NE" ...
## $ city : chr "Shoreline" "Seattle" "Kent" "Bellevue" ...
## $ statezip : chr "WA 98133" "WA 98119" "WA 98042" "WA 98008" ...
## $ country : chr "USA" "USA" "USA" "USA" ...
read house data set from .csv file to R data frame. Then, set the data frame’s structure. The target variable is price and the remaining variables are candidate features. there are 4600 observations and 18 variables recorded in the data set.
## statistical summary
summary(house_df)
## date price bedrooms bathrooms
## Length:4600 Min. : 0 Min. :0.000 Min. :0.000
## Class :character 1st Qu.: 322875 1st Qu.:3.000 1st Qu.:1.750
## Mode :character Median : 460943 Median :3.000 Median :2.250
## Mean : 551963 Mean :3.401 Mean :2.161
## 3rd Qu.: 654962 3rd Qu.:4.000 3rd Qu.:2.500
## Max. :26590000 Max. :9.000 Max. :8.000
## sqft_living sqft_lot floors waterfront
## Min. : 370 Min. : 638 Min. :1.000 Min. :0.000000
## 1st Qu.: 1460 1st Qu.: 5001 1st Qu.:1.000 1st Qu.:0.000000
## Median : 1980 Median : 7683 Median :1.500 Median :0.000000
## Mean : 2139 Mean : 14852 Mean :1.512 Mean :0.007174
## 3rd Qu.: 2620 3rd Qu.: 11001 3rd Qu.:2.000 3rd Qu.:0.000000
## Max. :13540 Max. :1074218 Max. :3.500 Max. :1.000000
## view condition sqft_above sqft_basement
## Min. :0.0000 Min. :1.000 Min. : 370 Min. : 0.0
## 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:1190 1st Qu.: 0.0
## Median :0.0000 Median :3.000 Median :1590 Median : 0.0
## Mean :0.2407 Mean :3.452 Mean :1827 Mean : 312.1
## 3rd Qu.:0.0000 3rd Qu.:4.000 3rd Qu.:2300 3rd Qu.: 610.0
## Max. :4.0000 Max. :5.000 Max. :9410 Max. :4820.0
## yr_built yr_renovated street city
## Min. :1900 Min. : 0.0 Length:4600 Length:4600
## 1st Qu.:1951 1st Qu.: 0.0 Class :character Class :character
## Median :1976 Median : 0.0 Mode :character Mode :character
## Mean :1971 Mean : 808.6
## 3rd Qu.:1997 3rd Qu.:1999.0
## Max. :2014 Max. :2014.0
## statezip country
## Length:4600 Length:4600
## Class :character Class :character
## Mode :character Mode :character
##
##
##
The data summary gives us the pictures of minimum, median, mean and maximum values of each numerical variable. We can take notice that the minimum value of observed price is zero, this could be an incorrect data. The maximum value of price observed is statically fallen way from its normal distribution values and the 3rd quantile, this could be identified as an outlier.
in this analysis we are using box plot statistical model to identify the existence and the degree of significance of out liers in the data since Box plot provides clearer picture of the out liers to the data set’s normal distribution values.
ggplot(data = house_df, aes(y = price)) +
geom_boxplot() +
scale_y_continuous(limits = c(0, 2000000))
based on boxplot above, we can se that there are a significant number of outliers in the target’s distribution.
Plotting house price based on number of bedrooms.
house_df$bedrooms2 <- factor(house_df$bedrooms)
ggplot(data = house_df, aes(x = bedrooms2,
y = price)) +
geom_boxplot() +
scale_y_continuous(limits = c(0, 2000000))
Based on the price by number of bedrooms plot, we can see the following observation: 1. In General, the higher the number of bedrooms, the higher the price of that particular property. 2. It is interesting that the houses with number of bedrooms == 0, the house prices are significantly high. This phenomena could be caused by few probabilities such as: different type of property than the rest of the distribution, sampling error, etc.
compute Pearson’s correlation coefficient (R) among all numerical variables/ Then visualize the result in a diagram
## Compute Pearson's Correlation Coefficient (R)
house_df_num <- house_df[ , 2:12]
r <- cor(house_df_num)
corrgram(house_df_num,
upper.panel = panel.cor)
Based from our analysis, there are few notes can be taken as followed: 1. sqft_living and bathrooms are two top tiers of highly correlated feature variables to price with following correlation scores of 0.43 and 0.33 respectively 2. however it is important to note, even though sqft_above scores higher than bathrooms in correlation, nevertheless it also shares even higher correlation to **sqft_living*. This signifies that both of features provide the same (redundant) information.
insight:
Remove observations with incorrect price (price == 0).
### get index that price == 0
idx_price_0 <- which( house_df_num$price == 0 )
### remove obs with price == 0
house_df_num <- house_df_num[ -idx_price_0, ]
summary(house_df_num$price)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 7800 326264 465000 557906 657500 26590000
The minimum value on price is now not zero, but 7800.
Remove observations with outliers in price
# get outliers idx
out_price <- boxplot.stats(house_df_num$price)$out
idx_out <- which(house_df_num$price %in% c(out_price))
### remove obs with outlier
house_df_num <- house_df_num[ -idx_out, ]
summary(house_df_num$price)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 7800 320000 450000 487457 615000 1150000
The maximum value on price is now not 26590000, but 1150000.
dim(house_df_num)
## [1] 4311 11
Number of observations is now 4311. It means, the data cleaning process removed 289 rows.
Add location features using statezip variable with One Hot Encoding (OHE) method.
house_df$street <- as.factor(house_df$street)
house_df$city <- as.factor(house_df$city)
house_df$country <- as.factor(house_df$country)
house_df$statezip <- as.factor(house_df$statezip)
### OHE on statezip column (house_df dataframe)
### create dataframe to be encoded
statezip_df <- data.frame(house_df$statezip)
colnames(statezip_df) <- "loc"
### create OHE dataframe
df1 <- dummyVars("~.", data = statezip_df)
df2 <- data.frame(predict(df1, newdata = statezip_df))
### combine to original dataframe
house_df <- cbind(house_df, df2)
house_df$statezip <- NULL
### combine to numerical data for modeling
idx <- rownames(house_df_num)
house_df_num <- cbind(house_df_num, df2[idx, ] )
dim(house_df_num)
## [1] 4311 88
Number of columns is now 88. It means, we added 77 new features for location information using OHE in statezip.
Create regression model using multivariate linear regression algorithm. We will create two models: 1. without location feature 2. with location feature
## for reproducible result
set.seed(2021)
m <- nrow(house_df_num)
m_train <- m * 0.7
train_idx <- sample(m, m_train)
# train_idx[1:3]
train_df <- house_df_num[ train_idx, ]
test_df <- house_df_num[ -train_idx, ]
## 4.3. Multivariate Linear Regression (two or more variables)
model.mlr1 <- lm(formula = price ~ sqft_living + bathrooms + view,
data = train_df)
model.mlr1
##
## Call:
## lm(formula = price ~ sqft_living + bathrooms + view, data = train_df)
##
## Coefficients:
## (Intercept) sqft_living bathrooms view
## 137968.7 150.8 17711.4 35283.6
# Without Location
model.mlr2 <- lm(formula = price ~ . ,
data = train_df[, 1:11])
model.mlr2
##
## Call:
## lm(formula = price ~ ., data = train_df[, 1:11])
##
## Coefficients:
## (Intercept) bedrooms bathrooms sqft_living sqft_lot
## -1.005e+04 -2.812e+04 1.491e+04 1.726e+02 -1.524e-01
## floors waterfront view condition sqft_above
## 4.468e+04 7.641e+04 3.078e+04 4.545e+04 -9.451e+00
## sqft_basement
## NA
# With Location
model.mlr3 <- lm(formula = price ~ . ,
data = train_df)
model.mlr3
##
## Call:
## lm(formula = price ~ ., data = train_df)
##
## Coefficients:
## (Intercept) bedrooms bathrooms sqft_living sqft_lot
## 3.079e+03 -8.555e+03 2.423e+04 8.759e+01 2.429e-01
## floors waterfront view condition sqft_above
## -2.884e+04 1.338e+05 3.576e+04 2.461e+04 7.897e+01
## sqft_basement loc.WA.98001 loc.WA.98002 loc.WA.98003 loc.WA.98004
## NA -9.982e+04 -1.095e+05 -9.539e+04 4.527e+05
## loc.WA.98005 loc.WA.98006 loc.WA.98007 loc.WA.98008 loc.WA.98010
## 2.436e+05 1.961e+05 2.202e+05 1.247e+05 1.019e+04
## loc.WA.98011 loc.WA.98014 loc.WA.98019 loc.WA.98022 loc.WA.98023
## 5.375e+04 1.696e+03 -2.265e+03 -1.112e+05 -1.137e+05
## loc.WA.98024 loc.WA.98027 loc.WA.98028 loc.WA.98029 loc.WA.98030
## 7.352e+04 9.424e+04 5.830e+04 1.664e+05 -8.976e+04
## loc.WA.98031 loc.WA.98032 loc.WA.98033 loc.WA.98034 loc.WA.98038
## -8.540e+04 -1.154e+05 2.423e+05 1.123e+05 -6.967e+04
## loc.WA.98039 loc.WA.98040 loc.WA.98042 loc.WA.98045 loc.WA.98047
## -3.787e+05 3.153e+05 -9.370e+04 -2.295e+03 -9.456e+04
## loc.WA.98050 loc.WA.98051 loc.WA.98052 loc.WA.98053 loc.WA.98055
## 3.396e+04 -1.068e+04 1.982e+05 1.604e+05 -5.200e+04
## loc.WA.98056 loc.WA.98057 loc.WA.98058 loc.WA.98059 loc.WA.98065
## 2.865e+04 -8.412e+04 -5.039e+04 1.179e+04 4.635e+04
## loc.WA.98068 loc.WA.98070 loc.WA.98072 loc.WA.98074 loc.WA.98075
## 1.591e+05 1.602e+04 1.095e+05 1.346e+05 1.739e+05
## loc.WA.98077 loc.WA.98092 loc.WA.98102 loc.WA.98103 loc.WA.98105
## 1.171e+05 -1.004e+05 3.287e+05 2.454e+05 3.598e+05
## loc.WA.98106 loc.WA.98107 loc.WA.98108 loc.WA.98109 loc.WA.98112
## 2.772e+04 2.426e+05 2.498e+04 4.301e+05 3.728e+05
## loc.WA.98115 loc.WA.98116 loc.WA.98117 loc.WA.98118 loc.WA.98119
## 2.426e+05 2.233e+05 2.224e+05 8.315e+04 3.298e+05
## loc.WA.98122 loc.WA.98125 loc.WA.98126 loc.WA.98133 loc.WA.98136
## 2.608e+05 8.297e+04 1.085e+05 5.955e+04 1.696e+05
## loc.WA.98144 loc.WA.98146 loc.WA.98148 loc.WA.98155 loc.WA.98166
## 1.689e+05 3.936e+04 -2.671e+04 3.573e+04 5.474e+03
## loc.WA.98168 loc.WA.98177 loc.WA.98178 loc.WA.98188 loc.WA.98198
## -5.839e+04 1.721e+05 -5.800e+04 -8.872e+04 -9.393e+04
## loc.WA.98199 loc.WA.98288 loc.WA.98354
## 2.741e+05 8.278e+03 NA