This report provides house price prediction using regression algorithms.
The dataset using in this report for modeling is real house data in the USA. It is publicaly available at kaggle. It can be downloades here: https://www.kaggle.com/shree1992/housedata
This report is stuctured as follows:
1. Data Exraction
2. Exploratory Data Analysis
3. Data Preparation
4. Modeling
5. Evaluation
6. Recommendation
Import necessary libraryes.
rm(list = ls())
library(ggplot2)
library(corrgram)
library(caret)
library(gridExtra)
library(dplyr)
Library ggplot : for graphic and visualization.
Library corrgram : for visualizatoion of correlation coeficient.
Library caret : for One Hot Encoding.
Library gridExtra : for plotting multiple graphs.
Read house dataset from .csv file to R dataframe. Then see the dataframe’s structure
## read data to dataframe
house_df <- read.csv("../data/data.csv")
## structure of data
str(house_df)
## 'data.frame': 4600 obs. of 18 variables:
## $ date : chr "2014-05-02 00:00:00" "2014-05-02 00:00:00" "2014-05-02 00:00:00" "2014-05-02 00:00:00" ...
## $ price : num 313000 2384000 342000 420000 550000 ...
## $ bedrooms : num 3 5 3 3 4 2 2 4 3 4 ...
## $ bathrooms : num 1.5 2.5 2 2.25 2.5 1 2 2.5 2.5 2 ...
## $ sqft_living : int 1340 3650 1930 2000 1940 880 1350 2710 2430 1520 ...
## $ sqft_lot : int 7912 9050 11947 8030 10500 6380 2560 35868 88426 6200 ...
## $ floors : num 1.5 2 1 1 1 1 1 2 1 1.5 ...
## $ waterfront : int 0 0 0 0 0 0 0 0 0 0 ...
## $ view : int 0 4 0 0 0 0 0 0 0 0 ...
## $ condition : int 3 5 4 4 4 3 3 3 4 3 ...
## $ sqft_above : int 1340 3370 1930 1000 1140 880 1350 2710 1570 1520 ...
## $ sqft_basement: int 0 280 0 1000 800 0 0 0 860 0 ...
## $ yr_built : int 1955 1921 1966 1963 1976 1938 1976 1989 1985 1945 ...
## $ yr_renovated : int 2005 0 0 0 1992 1994 0 0 0 2010 ...
## $ street : chr "18810 Densmore Ave N" "709 W Blaine St" "26206-26214 143rd Ave SE" "857 170th Pl NE" ...
## $ city : chr "Shoreline" "Seattle" "Kent" "Bellevue" ...
## $ statezip : chr "WA 98133" "WA 98119" "WA 98042" "WA 98008" ...
## $ country : chr "USA" "USA" "USA" "USA" ...
The dataset has 4600 observation (rows) and 18 variables (columns). The target variables is price and the reamining variables are features candidate.
Compute statistical summary of each variables.
## statistical summary
summary(house_df)
## date price bedrooms bathrooms
## Length:4600 Min. : 0 Min. :0.000 Min. :0.000
## Class :character 1st Qu.: 322875 1st Qu.:3.000 1st Qu.:1.750
## Mode :character Median : 460943 Median :3.000 Median :2.250
## Mean : 551963 Mean :3.401 Mean :2.161
## 3rd Qu.: 654962 3rd Qu.:4.000 3rd Qu.:2.500
## Max. :26590000 Max. :9.000 Max. :8.000
## sqft_living sqft_lot floors waterfront
## Min. : 370 Min. : 638 Min. :1.000 Min. :0.000000
## 1st Qu.: 1460 1st Qu.: 5001 1st Qu.:1.000 1st Qu.:0.000000
## Median : 1980 Median : 7683 Median :1.500 Median :0.000000
## Mean : 2139 Mean : 14852 Mean :1.512 Mean :0.007174
## 3rd Qu.: 2620 3rd Qu.: 11001 3rd Qu.:2.000 3rd Qu.:0.000000
## Max. :13540 Max. :1074218 Max. :3.500 Max. :1.000000
## view condition sqft_above sqft_basement
## Min. :0.0000 Min. :1.000 Min. : 370 Min. : 0.0
## 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:1190 1st Qu.: 0.0
## Median :0.0000 Median :3.000 Median :1590 Median : 0.0
## Mean :0.2407 Mean :3.452 Mean :1827 Mean : 312.1
## 3rd Qu.:0.0000 3rd Qu.:4.000 3rd Qu.:2300 3rd Qu.: 610.0
## Max. :4.0000 Max. :5.000 Max. :9410 Max. :4820.0
## yr_built yr_renovated street city
## Min. :1900 Min. : 0.0 Length:4600 Length:4600
## 1st Qu.:1951 1st Qu.: 0.0 Class :character Class :character
## Median :1976 Median : 0.0 Mode :character Mode :character
## Mean :1971 Mean : 808.6
## 3rd Qu.:1997 3rd Qu.:1999.0
## Max. :2014 Max. :2014.0
## statezip country
## Length:4600 Length:4600
## Class :character Class :character
## Mode :character Mode :character
##
##
##
We can see the min, median, and max values of each numeric variable.
It is intersting to see that the min value of price is zero. This could be an incorrect data.
We can also nitice tha max value of price statisticaly far away from med and 3rd quantile. This could be an Outlier.
price using boxplot# distribution of price
ggplot(data = house_df, aes(y = price)) +
geom_boxplot() +
scale_y_continuous(limits = c(0, 2000000))
There are outliers (significantly high price).
Normaly house price is around 300000 up to 700000
bedrooms# distribution of bedrooms
ggplot(data = house_df, aes(x = bedrooms)) +
geom_bar()
Based on barchart above, we can see the house distribution by bedrooms.
According to data, at most house tahat have 3 bedrooms
price based on number of bedrooms using boxplothouse_df$bedrooms2 <- factor(house_df$bedrooms)
house_df$bathrooms2 <- factor(house_df$bathrooms)
# Relationship between price and bedrooms
ggplot(data = house_df, aes(y = price,
x = bedrooms2)) +
geom_boxplot() +
scale_y_continuous(limits = c(0, 2000000))
In general, the higher number of bedrooms, the higher the price. However, price for house with bedrooms == 0 is significantly higher. These could be a sprcial buildings such as factory, commercial space, meeing hall, sport center, etc.
price based on number of bathrooms using boxplot# Relationship between price and bathrooms
ggplot(data = house_df, aes(y = price,
x = bathrooms2)) +
geom_boxplot() +
scale_y_continuous(limits = c(0, 2000000))
In general, the higher number of bathrooms, the higher the price. However, price for house with bathrooms == 0 is significantly higher. These could be a sprcial buildings such as factory, commercial space, meeing hall, sport center, etc.
corrgram.### extract numerical variable from dataframe
house_df_num <- house_df[ , 2:12]
### visualize pearson's Correlation Coefficient (R)
corrgram(house_df_num,
upper.panel = panel.cor)
For target variable (price), the variables with high correlation in order are sqft_living (0.43), sqft_above (0.37), bathrooms (0.33)
Insight from EDA:
1. Incorrect data (price == 0),
2. There are outliers (significantly high price)
3. In general, the higher number of bedrooms, the higher the price. However, price for house with bedrooms == 0 is significantly higher.
4. Based on Pearson’s Correlation Coefficient, the variables with highest correlation with target (price) are sqft_living, sqft_above, bathrooms.
5. Location is an important feature to predict price. So, it is necessary to be included in the modeling.
price (price == 0)### Remove rows with price == 0
#opsi pertama menggunakan filter
house_df_num1 <- filter(house_df_num, price > 0)
summary(house_df_num1$price)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 7800 326264 465000 557906 657500 26590000
dim(house_df_num1)
## [1] 4551 11
The min value of price is now 7800, number of observation is 4551
price### Remove rows with outliers in price
out_price <- boxplot.stats(house_df_num1$price)$out
idx_out <- which(house_df_num1$price %in% c(out_price))
house_df_num2 <- house_df_num1[-idx_out, ]
summary(house_df_num2$price)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 7800 320000 450000 487457 615000 1150000
dim(house_df_num2)
## [1] 4311 11
The max value of price is now $1,150,000, number of observation is 4311
statezip). OHE on locationAdd location feature using One Hot Encoding (statezip) variable
## 1. Create dataframe of location
idx <- rownames(house_df_num2)
location_df <- data.frame(house_df[idx, ]$statezip)
colnames(location_df) <- "location"
## 2. OHE dataframe
df1 <- dummyVars("~.", data = location_df)
df2 <- data.frame(predict(df1, newdata = location_df))
## 3. Marge with original data
house_df_num3 <- cbind(house_df_num2,df2)
dim(house_df_num3)
## [1] 4311 88
Number of columns is now 88. It means, we added 77 new features for location information using OHE in statezip.
Summary of dataframe:
- house_df_num : raw data. dimension: [4600, 11]
- house_df_num1 : remove price == 0. dimension: [4551, 11]
- house_df_num2 : remove price == 0, remove outliers in price. dimension: [4311, 11]
- house_df_num3 : remove price == 0, remove outliers in price, OHE in statezip. dimension: [4311, 88]
get_train_idx <- function(m, ratio, seed_number){
# input
# m: number of observation
# ratio: percentage of tarining data
# seed_number: reproducible restult
m_train <- ratio * m # number of observations in training data
set.seed(seed_number)
train_idx <- sample(m, m_train)
# output:
# tain_idx: train index
return(train_idx)
}
ratio <- 0.7
seed_number <- 2021
# raw data. dimension: `[4600, 11]`
train_idx <- get_train_idx(m = nrow(house_df_num), ratio, seed_number)
train_df <- house_df_num[train_idx, ]
test_df <- house_df_num[-train_idx, ]
# remove `price == 0`. dimension: `[4551, 11]`
train_idx1 <- get_train_idx(m = nrow(house_df_num1), ratio, seed_number)
train_df1 <- house_df_num1[train_idx1, ]
test_df1 <- house_df_num1[-train_idx1, ]
# remove `price == 0`, remove outliers in `price`. dimension: `[4311, 11]`
train_idx2 <- get_train_idx(m = nrow(house_df_num2), ratio, seed_number)
train_df2 <- house_df_num2[train_idx2, ]
test_df2 <- house_df_num2[-train_idx2, ]
# remove `price == 0`, remove outliers in `price`, OHE in `statezip`. dimension: `[4311, 88]`
train_idx3 <- get_train_idx(m = nrow(house_df_num3), ratio, seed_number)
train_df3 <- house_df_num3[train_idx3, ]
test_df3 <- house_df_num3[-train_idx3, ]
Now, we have four pairs of training ang testing dat.
Using Multivariate Linear Regression with different training data.
## Model: Multivariate Linear Regression (from 2 to n variables)
### using all numerical variables raw data
model.mlr <- lm(formula = price ~ .,
data = train_df)
## Model: Multivariate Linear Regression (from 2 to n variables)
### using all numerical variables price > 0
model.mlr1 <- lm(formula = price ~ .,
data = train_df1)
## Model: Multivariate Linear Regression (from 2 to n variables)
### using all numerical variables no outliers
model.mlr2 <- lm(formula = price ~ .,
data = train_df2)
## Model: Multivariate Linear Regression (from 2 to n variables)
### using all numerical variables clean and OHE
model.mlr3 <- lm(formula = price ~ .,
data = train_df3)
Actual value raw data
actual <- test_df$price
Actual value clean data
actual1 <- test_df1$price
actual2 <- test_df2$price
Actual value clean data and OHE
actual3 <- test_df3$price
Predict actual and predicted value with raw data
pred.mlr <- predict(model.mlr, test_df) # x = all
Predict actual and predicted value with clean data
pred.mlr1 <- predict(model.mlr1, test_df1) # x = all
pred.mlr2 <- predict(model.mlr2, test_df2) # x = all
Predict actual and predicted value with clean data and OHE
pred.mlr3 <- predict(model.mlr3, test_df3) # x = all
actual vs predicted values with raw data
## actual vs predicted values
prediction_df <- data.frame(actual, pred.mlr)
actual vs predicted values with clean data
## actual vs predicted values
prediction_df1 <- data.frame(actual1, pred.mlr1)
## actual vs predicted values
prediction_df2 <- data.frame(actual2, pred.mlr2)
actual vs predicted values with clean data and OHE
## actual vs predicted values
prediction_df3 <- data.frame(actual3, pred.mlr3)
Visualise actual vs predicted values raw data
plot_mlr <- ggplot(data = prediction_df, aes(x = actual, y = pred.mlr)) +
geom_point() +
scale_x_continuous(limits = c(0, 1200000)) +
scale_y_continuous(limits = c(0, 1200000)) +
labs(title = "Raw Data")
Visualise actual vs predicted values clean data
plot_mlr1 <- ggplot(data = prediction_df1, aes(x = actual1, y = pred.mlr1)) +
geom_point() +
scale_x_continuous(limits = c(0, 1200000)) +
scale_y_continuous(limits = c(0, 1200000)) +
labs(title = "Remove price = 0")
plot_mlr2 <- ggplot(data = prediction_df2, aes(x = actual2, y = pred.mlr2)) +
geom_point() +
scale_x_continuous(limits = c(0, 1200000)) +
scale_y_continuous(limits = c(0, 1200000)) +
labs(title = "Remove Outliers")
Visualise actual vs predicted values clean data and OHE
plot_mlr3 <- ggplot(data = prediction_df3, aes(x = actual3, y = pred.mlr3)) +
geom_point() +
scale_x_continuous(limits = c(0, 1200000)) +
scale_y_continuous(limits = c(0, 1200000)) +
labs(title = "Using OHE on Location")
grid.arrange(plot_mlr, plot_mlr1, plot_mlr2, plot_mlr3)
library(Metrics)
rmse <- rmse(pred.mlr, actual)
rmse1 <- rmse(pred.mlr1, actual1)
rmse2 <- rmse(pred.mlr2, actual2)
rmse3 <- rmse(pred.mlr3, actual3)
rmse
## [1] 268352.1
rmse1
## [1] 239129.9
rmse2
## [1] 158561.4
rmse3
## [1] 102922
rmse_df <- data.frame(models = c("Raw Data",
"Remove Proce == 0",
"Remove Outliers",
"OHE"),
rmse_values = c(rmse,
rmse1,
rmse2,
rmse3))
ggplot(rmse_df, aes(x = models,
y = rmse_values,
fill = models)) +
geom_bar(stat = "identity")