This report provides house price prediction using regression algorithm. The dataset used in this report for modelling is House Data in Australia.
The dataset is hoseted in Kabble and can be downloaded here.
The report is structured as follows:
1. Data Extraction
2. Exploratory Data Analysis
3. Data Preparation
4. Modeling
5. Evaluation
6. Recommendation
Import necessary libraries
rm(list = ls())
library(ggplot2)
Import house dataset and see its structure
# read data
house_df <- read.csv("data/house.csv")
# structure of dataframe
str(house_df)
## 'data.frame': 4600 obs. of 18 variables:
## $ date : chr "2014-05-02 00:00:00" "2014-05-02 00:00:00" "2014-05-02 00:00:00" "2014-05-02 00:00:00" ...
## $ price : num 313000 2384000 342000 420000 550000 ...
## $ bedrooms : num 3 5 3 3 4 2 2 4 3 4 ...
## $ bathrooms : num 1.5 2.5 2 2.25 2.5 1 2 2.5 2.5 2 ...
## $ sqft_living : int 1340 3650 1930 2000 1940 880 1350 2710 2430 1520 ...
## $ sqft_lot : int 7912 9050 11947 8030 10500 6380 2560 35868 88426 6200 ...
## $ floors : num 1.5 2 1 1 1 1 1 2 1 1.5 ...
## $ waterfront : int 0 0 0 0 0 0 0 0 0 0 ...
## $ view : int 0 4 0 0 0 0 0 0 0 0 ...
## $ condition : int 3 5 4 4 4 3 3 3 4 3 ...
## $ sqft_above : int 1340 3370 1930 1000 1140 880 1350 2710 1570 1520 ...
## $ sqft_basement: int 0 280 0 1000 800 0 0 0 860 0 ...
## $ yr_built : int 1955 1921 1966 1963 1976 1938 1976 1989 1985 1945 ...
## $ yr_renovated : int 2005 0 0 0 1992 1994 0 0 0 2010 ...
## $ street : chr "18810 Densmore Ave N" "709 W Blaine St" "26206-26214 143rd Ave SE" "857 170th Pl NE" ...
## $ city : chr "Shoreline" "Seattle" "Kent" "Bellevue" ...
## $ statezip : chr "WA 98133" "WA 98119" "WA 98042" "WA 98008" ...
## $ country : chr "USA" "USA" "USA" "USA" ...
The dataset has 4600 obeservations and 18 variables. The target variables is price and the remaining variables are features.
# data dimension
d <- dim(house_df)
m <- d[1] # m: number of rows
n <- d[2] # n: number of columns
# statistical summary
summary(house_df)
## date price bedrooms bathrooms
## Length:4600 Min. : 0 Min. :0.000 Min. :0.000
## Class :character 1st Qu.: 322875 1st Qu.:3.000 1st Qu.:1.750
## Mode :character Median : 460943 Median :3.000 Median :2.250
## Mean : 551963 Mean :3.401 Mean :2.161
## 3rd Qu.: 654962 3rd Qu.:4.000 3rd Qu.:2.500
## Max. :26590000 Max. :9.000 Max. :8.000
## sqft_living sqft_lot floors waterfront
## Min. : 370 Min. : 638 Min. :1.000 Min. :0.000000
## 1st Qu.: 1460 1st Qu.: 5001 1st Qu.:1.000 1st Qu.:0.000000
## Median : 1980 Median : 7683 Median :1.500 Median :0.000000
## Mean : 2139 Mean : 14852 Mean :1.512 Mean :0.007174
## 3rd Qu.: 2620 3rd Qu.: 11001 3rd Qu.:2.000 3rd Qu.:0.000000
## Max. :13540 Max. :1074218 Max. :3.500 Max. :1.000000
## view condition sqft_above sqft_basement
## Min. :0.0000 Min. :1.000 Min. : 370 Min. : 0.0
## 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:1190 1st Qu.: 0.0
## Median :0.0000 Median :3.000 Median :1590 Median : 0.0
## Mean :0.2407 Mean :3.452 Mean :1827 Mean : 312.1
## 3rd Qu.:0.0000 3rd Qu.:4.000 3rd Qu.:2300 3rd Qu.: 610.0
## Max. :4.0000 Max. :5.000 Max. :9410 Max. :4820.0
## yr_built yr_renovated street city
## Min. :1900 Min. : 0.0 Length:4600 Length:4600
## 1st Qu.:1951 1st Qu.: 0.0 Class :character Class :character
## Median :1976 Median : 0.0 Mode :character Mode :character
## Mean :1971 Mean : 808.6
## 3rd Qu.:1997 3rd Qu.:1999.0
## Max. :2014 Max. :2014.0
## statezip country
## Length:4600 Length:4600
## Class :character Class :character
## Mode :character Mode :character
##
##
##
We can see minimum, median, mean, and maximum values of each numeric variables. It is interesting that the minimum values of price is zero. This could be an incorrect data.
We can also notice that the maximum values of price is statistically far away from median and third quantile. This could be an outliers.
Plot distribution of price using boxplot.
ggplot(data = house_df, aes(y = price)) +
geom_boxplot() +
scale_y_continuous(limits = c(0, 2000000))
Based on boxplot above, we can see that there are outliers in price.
For plotting purposes, some numerical and char variables are transformed to factor.
## casting to factor
house_df$bedrooms2 <- factor(house_df$bedrooms)
house_df$city2 <- factor(house_df$city)
house_df$statezip2 <- factor(house_df$statezip)
house_df$street2 <- factor(house_df$street)
house_df$country2 <- factor(house_df$country)
Plot house price based on number of bedrooms.
ggplot(data = house_df, aes(x = bedrooms2,
y = price)) +
geom_boxplot() +
scale_y_continuous(limits = c(0, 2000000))
Based on price by number of bedrooms, we can see the following:
1. In general, the higher number of bedrooms, the higher the price.
2. It is interesting that house with number of bedrooms == 0, the house price is significantly high. It could be a special house as meeting hall, religious building, sport center, etc.
Compute correlation coefficient (R) among all numeric variables.
Visualize Correlation Coefficient in a diagram.
house_df_num <- house_df[ , 2:12]
r <- cor(house_df_num)
library(corrgram)
corrgram(house_df_num, order = TRUE,
upper.panel = panel.pie)
Several variables are highly correlated.
For example: sqft_living and sqft_above.
For target variable (price), the variables with high correlation in order are sqft_living, sqft_above, bathroom, and bedroom.
Remove rows with incorrect values in price
# get indices
idx <- which(house_df_num$price %in% c(0))
# remove those rows
house_df_num <- house_df_num[-idx, ]
summary(house_df_num$price)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 7800 326264 465000 557906 657500 26590000
The minimum value on price is now not zero, but 7800.
Remove rows outliers in price
# get outliers values
out_price <- boxplot.stats(house_df_num$price)$out
# get indices
out_idx <- which(house_df_num$price %in% c(out_price))
# remove those rows
house_df_num <- house_df_num[ -out_idx, ]
summary(house_df_num$price)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 7800 320000 450000 487457 615000 1150000
The maximum value on price is now 1150000.
dim(house_df_num)
## [1] 4311 11
Number of observation is now 4311.
It means the data cleaning process has removed 289 rows.
One Hot Encoding for Location Features (using statezip column)
### Add location features
### OHE on statezip
#### remove rows as in house_df_num
house_df <- house_df[ rownames(house_df_num), ]
#### 1. Create dataframe for statezip (available on house_df)
statezip <- house_df$statezip
statezip_df <- data.frame(statezip)
colnames(statezip_df) <- c("loc.")
#### 2. OHE on statezip dataframe
library(caret)
df1 <- dummyVars("~.", data = statezip_df)
df2 <- data.frame(predict(df1, newdata = statezip_df))
#### 3. Combine to house_df_num dataframe
#### 11 + 77 = 88 columns
house_df_num <- cbind(house_df_num, df2)
dim(house_df_num)
## [1] 4311 88
Number of features (columns) is no 88.
It means that we have added 77 new features for location information using OHE in statezip.
Randomly divided the dataset into training and testing with ratio = 70:30.
For reproducible result, it is necessary to set the seed.
# data dimension
d <- dim(house_df_num)
m <- d[1] # m: number of rows
n <- d[2] # n: number of columns
## ratio train:test = 70:30
## randomly divided
set.seed(2021)
train_idx <- sample(m, 0.7 * m)
train_idx[1:3]
## [1] 2214 442 743
train_data <- house_df_num[ train_idx , ]
test_data <- house_df_num[ -train_idx , ]
dim(train_data)
## [1] 3017 88
dim(test_data)
## [1] 1294 88
The train data has 3017 rows and test data hs 1294 rows.
Create regression model.
mymodel <- lm(formula = price ~ . +
I(sqft_living^2) +
sqft_living:bedrooms,
data = train_data)
summary(mymodel)
##
## Call:
## lm(formula = price ~ . + I(sqft_living^2) + sqft_living:bedrooms,
## data = train_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -872020 -52263 -1836 48176 520918
##
## Coefficients: (2 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.000e+05 1.037e+05 -0.964 0.335018
## bedrooms 1.127e+04 7.435e+03 1.516 0.129518
## bathrooms 2.191e+04 4.317e+03 5.075 4.11e-07 ***
## sqft_living 1.702e+02 1.137e+01 14.960 < 2e-16 ***
## sqft_lot 2.743e-01 5.575e-02 4.920 9.13e-07 ***
## floors -3.338e+04 5.251e+03 -6.356 2.39e-10 ***
## waterfront 1.354e+05 3.682e+04 3.678 0.000239 ***
## view 3.672e+04 3.262e+03 11.257 < 2e-16 ***
## condition 2.308e+04 3.043e+03 7.585 4.44e-14 ***
## sqft_above 8.334e+01 6.353e+00 13.117 < 2e-16 ***
## sqft_basement NA NA NA NA
## loc.WA.98001 -1.046e+05 1.026e+05 -1.020 0.307959
## loc.WA.98002 -1.127e+05 1.037e+05 -1.087 0.277266
## loc.WA.98003 -9.356e+04 1.030e+05 -0.908 0.363934
## loc.WA.98004 4.455e+05 1.036e+05 4.299 1.77e-05 ***
## loc.WA.98005 2.412e+05 1.042e+05 2.315 0.020657 *
## loc.WA.98006 1.915e+05 1.023e+05 1.872 0.061294 .
## loc.WA.98007 2.161e+05 1.036e+05 2.086 0.037080 *
## loc.WA.98008 1.237e+05 1.030e+05 1.201 0.229827
## loc.WA.98010 5.043e+03 1.112e+05 0.045 0.963840
## loc.WA.98011 4.492e+04 1.037e+05 0.433 0.664739
## loc.WA.98014 5.046e+03 1.052e+05 0.048 0.961734
## loc.WA.98019 -8.018e+03 1.033e+05 -0.078 0.938134
## loc.WA.98022 -1.214e+05 1.040e+05 -1.167 0.243275
## loc.WA.98023 -1.179e+05 1.022e+05 -1.153 0.248916
## loc.WA.98024 6.126e+04 1.097e+05 0.559 0.576464
## loc.WA.98027 9.721e+04 1.025e+05 0.949 0.342795
## loc.WA.98028 5.189e+04 1.026e+05 0.506 0.612964
## loc.WA.98029 1.634e+05 1.024e+05 1.595 0.110740
## loc.WA.98030 -9.482e+04 1.034e+05 -0.917 0.359028
## loc.WA.98031 -9.087e+04 1.027e+05 -0.885 0.376375
## loc.WA.98032 -1.182e+05 1.051e+05 -1.125 0.260848
## loc.WA.98033 2.388e+05 1.024e+05 2.331 0.019809 *
## loc.WA.98034 1.069e+05 1.024e+05 1.044 0.296346
## loc.WA.98038 -7.578e+04 1.023e+05 -0.741 0.458861
## loc.WA.98039 -3.825e+05 1.436e+05 -2.663 0.007786 **
## loc.WA.98040 3.101e+05 1.030e+05 3.009 0.002641 **
## loc.WA.98042 -9.962e+04 1.022e+05 -0.975 0.329789
## loc.WA.98045 -6.168e+03 1.029e+05 -0.060 0.952219
## loc.WA.98047 -9.415e+04 1.112e+05 -0.847 0.397284
## loc.WA.98050 8.550e+03 1.441e+05 0.059 0.952686
## loc.WA.98051 -2.062e+04 1.116e+05 -0.185 0.853467
## loc.WA.98052 1.918e+05 1.020e+05 1.879 0.060306 .
## loc.WA.98053 1.514e+05 1.023e+05 1.480 0.138858
## loc.WA.98055 -5.428e+04 1.043e+05 -0.520 0.602784
## loc.WA.98056 2.489e+04 1.024e+05 0.243 0.808066
## loc.WA.98057 -7.989e+04 1.060e+05 -0.753 0.451264
## loc.WA.98058 -5.394e+04 1.023e+05 -0.527 0.597992
## loc.WA.98059 5.430e+03 1.022e+05 0.053 0.957646
## loc.WA.98065 4.819e+04 1.027e+05 0.469 0.639102
## loc.WA.98068 1.495e+05 1.437e+05 1.041 0.298156
## loc.WA.98070 7.905e+03 1.044e+05 0.076 0.939640
## loc.WA.98072 1.012e+05 1.027e+05 0.985 0.324596
## loc.WA.98074 1.267e+05 1.023e+05 1.239 0.215417
## loc.WA.98075 1.705e+05 1.025e+05 1.663 0.096330 .
## loc.WA.98077 1.219e+05 1.032e+05 1.181 0.237887
## loc.WA.98092 -1.075e+05 1.024e+05 -1.050 0.293706
## loc.WA.98102 3.308e+05 1.050e+05 3.150 0.001648 **
## loc.WA.98103 2.502e+05 1.021e+05 2.450 0.014339 *
## loc.WA.98105 3.583e+05 1.039e+05 3.447 0.000575 ***
## loc.WA.98106 3.489e+04 1.027e+05 0.340 0.734118
## loc.WA.98107 2.486e+05 1.027e+05 2.421 0.015551 *
## loc.WA.98108 2.727e+04 1.032e+05 0.264 0.791644
## loc.WA.98109 4.323e+05 1.066e+05 4.055 5.13e-05 ***
## loc.WA.98112 3.731e+05 1.031e+05 3.619 0.000301 ***
## loc.WA.98115 2.409e+05 1.021e+05 2.360 0.018358 *
## loc.WA.98116 2.219e+05 1.027e+05 2.160 0.030848 *
## loc.WA.98117 2.241e+05 1.021e+05 2.195 0.028250 *
## loc.WA.98118 8.617e+04 1.025e+05 0.841 0.400645
## loc.WA.98119 3.310e+05 1.034e+05 3.200 0.001389 **
## loc.WA.98122 2.650e+05 1.028e+05 2.578 0.009984 **
## loc.WA.98125 8.670e+04 1.024e+05 0.847 0.397056
## loc.WA.98126 1.142e+05 1.024e+05 1.115 0.264930
## loc.WA.98133 6.011e+04 1.023e+05 0.588 0.556720
## loc.WA.98136 1.694e+05 1.029e+05 1.646 0.099898 .
## loc.WA.98144 1.748e+05 1.025e+05 1.706 0.088124 .
## loc.WA.98146 4.353e+04 1.029e+05 0.423 0.672173
## loc.WA.98148 -3.307e+04 1.085e+05 -0.305 0.760587
## loc.WA.98155 3.714e+04 1.024e+05 0.363 0.716785
## loc.WA.98166 9.244e+02 1.027e+05 0.009 0.992821
## loc.WA.98168 -5.496e+04 1.026e+05 -0.535 0.592420
## loc.WA.98177 1.633e+05 1.029e+05 1.587 0.112601
## loc.WA.98178 -5.612e+04 1.032e+05 -0.544 0.586724
## loc.WA.98188 -9.531e+04 1.049e+05 -0.909 0.363523
## loc.WA.98198 -9.769e+04 1.028e+05 -0.950 0.341938
## loc.WA.98199 2.732e+05 1.030e+05 2.653 0.008022 **
## loc.WA.98288 4.083e+04 1.438e+05 0.284 0.776479
## loc.WA.98354 NA NA NA NA
## I(sqft_living^2) -8.753e-03 2.508e-03 -3.490 0.000490 ***
## bedrooms:sqft_living -1.086e+01 3.057e+00 -3.553 0.000387 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 101500 on 2929 degrees of freedom
## Multiple R-squared: 0.7877, Adjusted R-squared: 0.7814
## F-statistic: 124.9 on 87 and 2929 DF, p-value: < 2.2e-16
We can see that some statezip (location) have significant influence on price.
This means location is important to predict house price.
Predict Price
## actual price
actual <- test_data$price
## predicted price
pred.mymodel <- predict(mymodel, test_data)
## Warning in predict.lm(mymodel, test_data): prediction from a rank-deficient fit
## may be misleading
## plot actual vs predicted
price_df <- data.frame(actual, pred.mymodel)
Plot Actual vs. Predicted Price
ggplot(data = price_df, aes(x = actual, y = pred.mymodel)) +
geom_point() +
scale_x_continuous(limits = c(0,1500000)) +
scale_y_continuous(limits = c(0,1500000))
Mostly the points are located in the diagonal.
It means that the predicted values are close to the actual values.
However, there are still some points that are relatively far from diagonal.
Measure RMSE (root mean squared error) and R (coefficient relation)
## performance evaluation
performance <- function(prediction, actual, method){
e <- prediction - actual # error
se <- e^2 # squared error
sse <- sum(se) # sum of squared error
mse <- mean(se) # mean squared error
rmse <- sqrt(mse) # root mean squared error
r <- cor(prediction, actual) # correlation coefficient
result <- paste("*** Method: ", method,
"\nRMSE = ", round(rmse,3),
"\nR = ", round(r,3),
"\n")
cat(result)
}
performance(pred.mymodel, actual, "My Model")
## *** Method: My Model
## RMSE = 97428.002
## R = 0.889
The correlation coefficient is very high.
The error is terms of RMSE is significantly reduced to lower 100000.
This means we made huge improvement in modelling.
(Is the model good enough or not ?)
1. The prediction model is good enough to predict the price for mainstream houses.
(Can we deploy the current model or do we still need to improve it ?)
2. It is ready for deployment.
However, there are still room of improvement in the future.
For example, house with bedroom == 0 and house with significantly expensive price, maybe design a dedicated model.
(What is the important variables or any useful insight from the report ?)
3. Location is an important feature generated from one of the column in the dataset (statezip).
There could be other important feature that can be generated and make significant influence.