This report describes house price prediction using regression model. The dataset used in this project is the real estate data from kaggle. https://www.kaggle.com/shree1992/housedata
Report Outline:
1. Data Extraction
2. Exploratory Data Analysis
3. Data Preparation
4. Modelling
5. Evaluation
6. Recommendation
Extract data in csv format into dataframe in R.
house_df <- read.csv("data.csv")
See the structure of dataframe. There are 4600 observation and 18 variables.
str(house_df)
## 'data.frame': 4600 obs. of 18 variables:
## $ date : chr "2014-05-02 00:00:00" "2014-05-02 00:00:00" "2014-05-02 00:00:00" "2014-05-02 00:00:00" ...
## $ price : num 313000 2384000 342000 420000 550000 ...
## $ bedrooms : num 3 5 3 3 4 2 2 4 3 4 ...
## $ bathrooms : num 1.5 2.5 2 2.25 2.5 1 2 2.5 2.5 2 ...
## $ sqft_living : int 1340 3650 1930 2000 1940 880 1350 2710 2430 1520 ...
## $ sqft_lot : int 7912 9050 11947 8030 10500 6380 2560 35868 88426 6200 ...
## $ floors : num 1.5 2 1 1 1 1 1 2 1 1.5 ...
## $ waterfront : int 0 0 0 0 0 0 0 0 0 0 ...
## $ view : int 0 4 0 0 0 0 0 0 0 0 ...
## $ condition : int 3 5 4 4 4 3 3 3 4 3 ...
## $ sqft_above : int 1340 3370 1930 1000 1140 880 1350 2710 1570 1520 ...
## $ sqft_basement: int 0 280 0 1000 800 0 0 0 860 0 ...
## $ yr_built : int 1955 1921 1966 1963 1976 1938 1976 1989 1985 1945 ...
## $ yr_renovated : int 2005 0 0 0 1992 1994 0 0 0 2010 ...
## $ street : chr "18810 Densmore Ave N" "709 W Blaine St" "26206-26214 143rd Ave SE" "857 170th Pl NE" ...
## $ city : chr "Shoreline" "Seattle" "Kent" "Bellevue" ...
## $ statezip : chr "WA 98133" "WA 98119" "WA 98042" "WA 98008" ...
## $ country : chr "USA" "USA" "USA" "USA" ...
Calculate statistical summary of numerical variables.
summary(house_df)
## date price bedrooms bathrooms
## Length:4600 Min. : 0 Min. :0.000 Min. :0.000
## Class :character 1st Qu.: 322875 1st Qu.:3.000 1st Qu.:1.750
## Mode :character Median : 460943 Median :3.000 Median :2.250
## Mean : 551963 Mean :3.401 Mean :2.161
## 3rd Qu.: 654962 3rd Qu.:4.000 3rd Qu.:2.500
## Max. :26590000 Max. :9.000 Max. :8.000
## sqft_living sqft_lot floors waterfront
## Min. : 370 Min. : 638 Min. :1.000 Min. :0.000000
## 1st Qu.: 1460 1st Qu.: 5001 1st Qu.:1.000 1st Qu.:0.000000
## Median : 1980 Median : 7683 Median :1.500 Median :0.000000
## Mean : 2139 Mean : 14852 Mean :1.512 Mean :0.007174
## 3rd Qu.: 2620 3rd Qu.: 11001 3rd Qu.:2.000 3rd Qu.:0.000000
## Max. :13540 Max. :1074218 Max. :3.500 Max. :1.000000
## view condition sqft_above sqft_basement
## Min. :0.0000 Min. :1.000 Min. : 370 Min. : 0.0
## 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:1190 1st Qu.: 0.0
## Median :0.0000 Median :3.000 Median :1590 Median : 0.0
## Mean :0.2407 Mean :3.452 Mean :1827 Mean : 312.1
## 3rd Qu.:0.0000 3rd Qu.:4.000 3rd Qu.:2300 3rd Qu.: 610.0
## Max. :4.0000 Max. :5.000 Max. :9410 Max. :4820.0
## yr_built yr_renovated street city
## Min. :1900 Min. : 0.0 Length:4600 Length:4600
## 1st Qu.:1951 1st Qu.: 0.0 Class :character Class :character
## Median :1976 Median : 0.0 Mode :character Mode :character
## Mean :1971 Mean : 808.6
## 3rd Qu.:1997 3rd Qu.:1999.0
## Max. :2014 Max. :2014.0
## statezip country
## Length:4600 Length:4600
## Class :character Class :character
## Mode :character Mode :character
##
##
##
Price is the target variable. The mean value is 551,963. There is observation with price == 0. There are possibly outliers as the maximum price value is significantly higher than the average.
ggplot(house_df, aes(y = price)) +
geom_boxplot() +
ylim(0,2000000)
The target variable (Price) has outliers and invalid values (Price==0). This should be handled or cleaned in the data preparation step.
house_df$bedrooms2 <- as.factor(house_df$bedrooms)
ggplot(data = house_df, aes(x = bedrooms2, y = price)) +
geom_boxplot()+
ylim(0,2000000)
## Warning: Removed 47 rows containing non-finite values (stat_boxplot).
in general, the higher number of bedrooms the higher price of the house.
Compute and visualize Pearson’s Correlation Coefficient (R) between numerical variables.
house_df_num <- house_df[ ,c("price","bedrooms","bathrooms","sqft_living",
"sqft_lot","floors","waterfront","view","condition",
"sqft_above","sqft_basement")]
library(corrgram)
corrgram(house_df_num,
order =TRUE,
upper.panel = panel.pie)
Based on diagram above, sqft_living has a high correlation with sqft_above and bathrooms. Variable sqft_living, sqft_above, and bathrooms have the highest correlation with the target variable (price).
One Hot Encoding
add location variable (statezip) and use OHE to convert categorical into numerical variables.
### Create Dataframe of Location (statezip)
location_df <- data.frame(house_df$statezip)
colnames(location_df) <- c("loc.")
### OHE on Location Dataframe
library(caret)
df1 <- dummyVars("~.", data = location_df)
df2 <- data.frame(predict(df1, newdata = location_df))
## Combine to Original Dataframe
house_df_num <- cbind(house_df_num, df2)
dim(house_df_num)
## [1] 4600 88
The current dataframe has 88 column (87 features and 1 target)
Remove Observation with price==0
# get row indices that price > 0
idx <- house_df_num$price > 0
# get dataframe
house_df_num <- house_df_num[idx, ]
Remove Outliers
out_price <- boxplot.stats(house_df_num$price)$out
# get rows idx of outliers value
out_idx <- which(house_df_num$price %in% c(out_price))
# remove rows with outliers or get rows without outliers
house_df_num <- house_df_num[ -out_idx, ]
set.seed(2021) # agar data tidak berubah
m <- nrow(house_df)
train_idx <- sample(m, 0.7*m)
train_df <- house_df_num[ train_idx, ]
test_df <- house_df_num[ -train_idx, ]
Use multivariate linear regression algorithm to predict house price based on all extracted features
# without location data
fit.mlr1 <- lm(formula = price ~ ., data = train_df[, 1:11])
# with location data
fit.mlr2 <- lm(formula = price ~ ., data = train_df)
we trained two models. First without location data and second with location data.
To evaluate our models, we use Root Mean Square Error (RMSE) and Pearson's Correlation Coeficient (R) as performance metrics.
performance <- function(prediction, actual, method){
e <- prediction - actual
se <- e^2
sse <- sum(se)
mse <- mean(se)
rmse <- sqrt(mse)
r <- cor(prediction,actual)
result <- paste("==Method Name:",method,
"\nRoot Mean Square Error (RMSE) = ", round(rmse,2),
"\nCorrelation Coefficient (R) =", round(r,2) )
cat(result)
}
performance of Model 1 (without location data)
pred.mlr1 <- predict(fit.mlr1, test_df)
performance(pred.mlr1, test_df$price, "Model 1")
## ==Method Name: Model 1
## Root Mean Square Error (RMSE) = 164896.24
## Correlation Coefficient (R) = 0.61
performance of Model 2 (with location data)
pred.mlr2 <- predict(fit.mlr2, test_df)
performance(pred.mlr2, test_df$price, "Model 2")
## ==Method Name: Model 2
## Root Mean Square Error (RMSE) = 103961.28
## Correlation Coefficient (R) = 0.87
RMSE is significantly reduced and R is significantly increased using location data in model 2. This mean location data is very important feature to predict house price.
Visualize actual price vs predicted price in Model 2
actual <- test_df$price
prediction_df <- data.frame(actual,pred.mlr1, pred.mlr2)
p<- ggplot(data = prediction_df,aes(x = actual, y = pred.mlr2)) + geom_point()+
scale_x_continuous(breaks = c(500000, 1000000, 1500000),
labels = c("$500K", "$1M", "$1.5M"),
limits = c(0,1500000))+
scale_y_continuous(breaks = c(500000, 1000000, 1500000),
labels = c("$500K", "$1M", "$1.5M"),
limits = c(0,1500000)) +
labs(title = "Multivariate Linear Regression",
x = "Actual Price",
y = "Prediction Price")
p
Most of the points are in the diagonal. This means the predicted price are close to actual price.
Location data is very important to predict house price . It significantly improve prediction performance. So, this data should always be included in modelling.
One Hot Encoding is an effective method to convert categorical variable to numerical variables. In this case, location data (statezip).
The model is good enough to be deployed. However, there are still some room of improvements.