This report provides house price prediction using regression algorithms.
The dataset using in this report for modeling is a real house data in the US. It is publicly available at Kaggle. It can be downloaded here: https://www.kaggle.com/shree1992/housedata.
The report is structured as follows:
1. Data Extraction
2. Exploratory Data Analysis
3. Data Preparation
4. Modeling
5. Evaluation
6. Recommendation
Import necessary libraries.
rm(list = ls())
library(ggplot2)
library(corrgram)
library(caret)
library(gridExtra)
library(dplyr)
Library ggplot: for graphic and visualization.
Library corrgram: for visualization of correlation coefficient.
Library caret: for One Hot Encoding.
Library gridExtra: for plotting multiple graphs. Library dplyr: for data manipulation.
Read house dataset from .csv file to R dataframe. Then, see the dataframe’s structure.
## read data to dataframe
house_df <- read.csv("../data/data.csv")
## structure of data
str(house_df)
## 'data.frame': 4600 obs. of 18 variables:
## $ date : chr "2014-05-02 00:00:00" "2014-05-02 00:00:00" "2014-05-02 00:00:00" "2014-05-02 00:00:00" ...
## $ price : num 313000 2384000 342000 420000 550000 ...
## $ bedrooms : num 3 5 3 3 4 2 2 4 3 4 ...
## $ bathrooms : num 1.5 2.5 2 2.25 2.5 1 2 2.5 2.5 2 ...
## $ sqft_living : int 1340 3650 1930 2000 1940 880 1350 2710 2430 1520 ...
## $ sqft_lot : int 7912 9050 11947 8030 10500 6380 2560 35868 88426 6200 ...
## $ floors : num 1.5 2 1 1 1 1 1 2 1 1.5 ...
## $ waterfront : int 0 0 0 0 0 0 0 0 0 0 ...
## $ view : int 0 4 0 0 0 0 0 0 0 0 ...
## $ condition : int 3 5 4 4 4 3 3 3 4 3 ...
## $ sqft_above : int 1340 3370 1930 1000 1140 880 1350 2710 1570 1520 ...
## $ sqft_basement: int 0 280 0 1000 800 0 0 0 860 0 ...
## $ yr_built : int 1955 1921 1966 1963 1976 1938 1976 1989 1985 1945 ...
## $ yr_renovated : int 2005 0 0 0 1992 1994 0 0 0 2010 ...
## $ street : chr "18810 Densmore Ave N" "709 W Blaine St" "26206-26214 143rd Ave SE" "857 170th Pl NE" ...
## $ city : chr "Shoreline" "Seattle" "Kent" "Bellevue" ...
## $ statezip : chr "WA 98133" "WA 98119" "WA 98042" "WA 98008" ...
## $ country : chr "USA" "USA" "USA" "USA" ...
The dataset has 4600 observations (rows) and 18 variables (columns). The target variable is price and the remaining variables are features candidate.
Compute statistical summary of each variable.
## statistical summary
summary(house_df)
## date price bedrooms bathrooms
## Length:4600 Min. : 0 Min. :0.000 Min. :0.000
## Class :character 1st Qu.: 322875 1st Qu.:3.000 1st Qu.:1.750
## Mode :character Median : 460943 Median :3.000 Median :2.250
## Mean : 551963 Mean :3.401 Mean :2.161
## 3rd Qu.: 654962 3rd Qu.:4.000 3rd Qu.:2.500
## Max. :26590000 Max. :9.000 Max. :8.000
## sqft_living sqft_lot floors waterfront
## Min. : 370 Min. : 638 Min. :1.000 Min. :0.000000
## 1st Qu.: 1460 1st Qu.: 5001 1st Qu.:1.000 1st Qu.:0.000000
## Median : 1980 Median : 7683 Median :1.500 Median :0.000000
## Mean : 2139 Mean : 14852 Mean :1.512 Mean :0.007174
## 3rd Qu.: 2620 3rd Qu.: 11001 3rd Qu.:2.000 3rd Qu.:0.000000
## Max. :13540 Max. :1074218 Max. :3.500 Max. :1.000000
## view condition sqft_above sqft_basement
## Min. :0.0000 Min. :1.000 Min. : 370 Min. : 0.0
## 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:1190 1st Qu.: 0.0
## Median :0.0000 Median :3.000 Median :1590 Median : 0.0
## Mean :0.2407 Mean :3.452 Mean :1827 Mean : 312.1
## 3rd Qu.:0.0000 3rd Qu.:4.000 3rd Qu.:2300 3rd Qu.: 610.0
## Max. :4.0000 Max. :5.000 Max. :9410 Max. :4820.0
## yr_built yr_renovated street city
## Min. :1900 Min. : 0.0 Length:4600 Length:4600
## 1st Qu.:1951 1st Qu.: 0.0 Class :character Class :character
## Median :1976 Median : 0.0 Mode :character Mode :character
## Mean :1971 Mean : 808.6
## 3rd Qu.:1997 3rd Qu.:1999.0
## Max. :2014 Max. :2014.0
## statezip country
## Length:4600 Length:4600
## Class :character Class :character
## Mode :character Mode :character
##
##
##
We can see the min, median, mean, and max values of each numeric variable.
It is interesting to see that the min value of price is zero. This could be an incorrect data.
We can also notice that the max value of price is statistically far away from med and 3rd quantile. This could be an outlier.
Plot distribution of price using boxplot.
# distribution of price
ggplot(data = house_df, aes(y = price)) +
geom_boxplot() +
scale_y_continuous(limits = c(0, 2000000))
Based on boxplot above, we can see that there are outliers in price.
Plot house price based on number of bedrooms.
house_df$bedrooms2 <- factor(house_df$bedrooms)
# Relationship between price and bedrooms
ggplot(data = house_df, aes(y = price,
x = bedrooms2)) +
geom_boxplot() +
scale_y_continuous(limits = c(0, 2000000))
In general, the higher number of bedrooms, the higher the price. However, price for house with bedrooms == 0 is significantly higher. These could be a special buildings such as factory, commercial space, meeting hall, sport center, etc.
Compute Pearson’s Correlation Coefficient (R) among all numerical variables. Then, visualize the result in a diagram using corrgram.
# extract numerical variable from dataframe
house_df_num <- house_df[ , 2:12]
# visualize Pearson's Correlation Coefficient (R)
corrgram(house_df_num,
upper.panel = panel.cor)
For target variable (price), the variables with high correlation in order are sqft_living (0.43), sqft_above (0.37), bathrooms (0.33).
Insight from EDA:
1. Incorrect data (price == 0)
2. There are outliers (significantly high price)
3. In general, the higher number of bedrooms, the higher the price. However, price for house with bedrooms == 0 is significantly higher.
4. Based on Pearson’s Correlation Coefficient, the variables with highest correlation with target (price) are sqft_living, sqft_above, bathrooms.
5. Location is an important feature to predict price. So, it is necessary to be included in the modeling.
Remove observation with incorrect price (price == 0)
house_df_num1 <- filter(house_df_num, price > 0)
out_price <- boxplot.stats(house_df_num1$price)$out
idx_out <- which( house_df_num1$price %in% c(out_price))
house_df_num2 <- house_df_num1[ -idx_out, ]
summary(house_df_num2$price)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 7800 320000 450000 487457 615000 1150000
In Data Cleaning at this dataset, we decide to remove outliers from variable price and now, in summary we have a min value of price is USD 7,800 and max value of price is USD 1,150,000
Add location feature using One Hot Encoding in (zipcode) variable.
# OHE on Location
## 1. Create dataframe of location
idx <- rownames(house_df_num2)
location_df <- data.frame(house_df[idx,] $statezip)
colnames(location_df) <- "Location"
## 2. OHE dataframe
df1 <- dummyVars("~.", data = location_df)
df2 <- data.frame( predict(df1, newdata = location_df))
## 3. Merge with original data
house_df_num3 <- cbind(house_df_num2, df2)
dim(house_df_num3)
## [1] 4311 88
Number of columns is now 88. It means, we added 77 new features for location information using OHE in statezip
Summary of dataframe:
house_df_num : Raw data. Dimension: [4600,11]
house_df_num1 : Remove price == 0. Dimension: [4551,11]
house_df_num2 : Remove price == 0, then remove outliers. Dimension: [4311,11]
house_df_num3 : Remove price == 0, then remove outliers & OHE in statezip. Dimension: [4311,88]
get_train_idx <- function(m, ratio, seed_number){
m_train <- m * ratio
set.seed(seed_number)
train_idx <- sample(m, m_train)
return(train_idx);
}
ratio <- 0.7
seed_number <- 2021
# Raw data: house_df_num
train_idx <- get_train_idx(m = nrow(house_df_num), ratio, seed_number)
train_df <- house_df_num[ train_idx, ]
test_df <- house_df_num[ -train_idx, ]
# Remove price == 0: house_df_num1
train_idx1 <- get_train_idx(m = nrow(house_df_num1), ratio, seed_number)
train_df1 <- house_df_num1[ train_idx1, ]
test_df1 <- house_df_num1[ -train_idx1, ]
# Remove outliers: house_df_num2
train_idx2 <- get_train_idx(m = nrow(house_df_num2), ratio, seed_number)
train_df2 <- house_df_num2[ train_idx2, ]
test_df2 <- house_df_num2[ -train_idx2, ]
# OHE in statezip: house_df_num3
train_idx3 <- get_train_idx(m = nrow(house_df_num3), ratio, seed_number)
train_df3 <- house_df_num3[ train_idx3, ]
test_df3 <- house_df_num3[ -train_idx3, ]
Now, we have four pairs of training and testing data.
Using Multivariate Linear Regression Model
model.mlr <- lm(formula = price ~ .,
data = train_df)
model.mlr1 <- lm(formula = price ~ .,
data = train_df1)
model.mlr2 <- lm(formula = price ~ .,
data = train_df2)
model.mlr3 <- lm(formula = price ~ .,
data = train_df3)
# Actual Values
actual <- test_df$price
actual1 <- test_df1$price
actual2 <- test_df2$price
actual3 <- test_df3$price
# Prediction Values
pred.mlr <- predict(model.mlr, test_df)
pred.mlr1 <- predict(model.mlr1, test_df1)
pred.mlr2 <- predict(model.mlr2, test_df2)
pred.mlr3 <- predict(model.mlr3, test_df3)
prediction_df <- data.frame(actual, pred.mlr)
prediction_df1 <- data.frame(actual1, pred.mlr1)
prediction_df2 <- data.frame(actual2, pred.mlr2)
prediction_df3 <- data.frame(actual3, pred.mlr3)
plot_mlr <- ggplot(data = prediction_df, aes(x = actual, y = pred.mlr)) +
geom_point() +
scale_x_continuous(limits = c(0, 2000000)) +
scale_y_continuous(limits = c(0, 2000000)) +
labs(title = "Raw Data")
plot_mlr1 <- ggplot(data = prediction_df1, aes(x = actual1, y = pred.mlr1)) +
geom_point() +
scale_x_continuous(limits = c(0, 2000000)) +
scale_y_continuous(limits = c(0, 2000000)) +
labs(title = "Cleaned Data(Remove price == 0)")
plot_mlr2 <- ggplot(data = prediction_df2, aes(x = actual2, y = pred.mlr2)) +
geom_point() +
scale_x_continuous(limits = c(0, 1200000)) +
scale_y_continuous(limits = c(0, 1200000)) +
labs(title = "Cleaned Data(Remove outliers)")
plot_mlr3 <- ggplot(data = prediction_df3, aes(x = actual3, y = pred.mlr3)) +
geom_point() +
scale_x_continuous(limits = c(0, 1200000)) +
scale_y_continuous(limits = c(0, 1200000)) +
labs(title = "Cleaned Data and OHE in Location")
grid.arrange(plot_mlr, plot_mlr1, plot_mlr2, plot_mlr3)
library(Metrics)
##
## Attaching package: 'Metrics'
## The following objects are masked from 'package:caret':
##
## precision, recall
rmse <- rmse(pred.mlr, actual)
rmse1 <- rmse(pred.mlr1, actual1)
rmse2 <- rmse(pred.mlr2, actual2)
rmse3 <- rmse(pred.mlr3, actual3)
rmse_df <- data.frame(models = c("Raw Data",
"Remove Price == 0",
"Remove Outlier",
"OHE"),
rmse_values = c(rmse,
rmse1,
rmse2,
rmse3))
ggplot(rmse_df, aes(x = models,
y = rmse_values,
fill = models)) +
geom_bar(stat = "identity")