1 Intro

The following is an analysis of the real estate dataset. In this analysis, we will predict the price of real estate using linear regression analysis.

1.1 Load Library and Dataset

library(dplyr)
library(ggplot2)
library(GGally)
library(MLmetrics)
library(performance)
library(lmtest)
library(car)
real_estate <- read.csv("data/Real_estate.csv")

2 Explanatory Data Analysis

We want to see top 10 data in this dataset

head(real_estate,10)

Then, we want to change the column names to make it easier to understand

real_estate <- real_estate %>% 
  rename(transaction_date = X1.transaction.date,
         house_age = X2.house.age,
         distance_to_the_nearest_MRT_station = X3.distance.to.the.nearest.MRT.station,
         number_of_convenience_stores = X4.number.of.convenience.stores,
         latitude = X5.latitude,
         longitude = X6.longitude,
         house_price_of_unit_area = Y.house.price.of.unit.area)

Next, we want to explain and check data types in every column

str(real_estate)
## 'data.frame':    414 obs. of  8 variables:
##  $ No                                 : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ transaction_date                   : num  2013 2013 2014 2014 2013 ...
##  $ house_age                          : num  32 19.5 13.3 13.3 5 7.1 34.5 20.3 31.7 17.9 ...
##  $ distance_to_the_nearest_MRT_station: num  84.9 306.6 562 562 390.6 ...
##  $ number_of_convenience_stores       : int  10 9 5 5 5 3 7 6 1 3 ...
##  $ latitude                           : num  25 25 25 25 25 ...
##  $ longitude                          : num  122 122 122 122 122 ...
##  $ house_price_of_unit_area           : num  37.9 42.2 47.3 54.8 43.1 32.1 40.3 46.7 18.8 22.1 ...

The following is a description of each column : No : Record number of data transaction_date : Date of transaction house_age : Age of house Distance_to_the_nearest_MRT_station : Distance from house to nearest MRT station number_of_convenience_stores : Number of convenience stores latitude : Latitude of house longitude : Longitude of house house_price_of_unit_area : House price of unit area The data types contained in each column are appropriate.

After that, we check the number of missing values contained in this data

colSums(is.na(real_estate))
##                                  No                    transaction_date 
##                                   0                                   0 
##                           house_age distance_to_the_nearest_MRT_station 
##                                   0                                   0 
##        number_of_convenience_stores                            latitude 
##                                   0                                   0 
##                           longitude            house_price_of_unit_area 
##                                   0                                   0

Great, this data no have missing values. Then we take the column that we want to do data analysis

real_estate <- real_estate %>% select(-No,-transaction_date)

We want to see the correlation between the columns contained in this dataset

ggcorr(real_estate, label = T)

In the correlation graph, it can be seen that the variables longitude, latitude, and number_of_convenience_stores have a positive correlation with house_price_of_unit_area. While the distance_to_the_nearest_MRT_station and house age variables have a negative correlation with house_price_of_unit_area

3 Linear Regression Modeling

The next stage we will make linear regression modeling with the predictor variable number_of_convenience_stores, because this variable has the highest positive correlation to the target variable house_price_of_unit_area

model_one_pred <- lm(formula = house_price_of_unit_area~number_of_convenience_stores, data = real_estate) # Model with one prediction
summary(model_one_pred)
## 
## Call:
## lm(formula = house_price_of_unit_area ~ number_of_convenience_stores, 
##     data = real_estate)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -35.407  -7.341  -1.788   5.984  87.681 
## 
## Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                   27.1811     0.9419   28.86   <2e-16 ***
## number_of_convenience_stores   2.6377     0.1868   14.12   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.18 on 412 degrees of freedom
## Multiple R-squared:  0.326,  Adjusted R-squared:  0.3244 
## F-statistic: 199.3 on 1 and 412 DF,  p-value: < 2.2e-16

It can be seen that the adjusted R-squared has a value of 0.3244.

Next, we will try to select predictor variables automatically using step-wise regression with the backward-elimination method. First, we have to create a model with all prediction variables

model_all_pred <- lm(formula = house_price_of_unit_area~., data = real_estate)
summary(model_all_pred)
## 
## Call:
## lm(formula = house_price_of_unit_area ~ ., data = real_estate)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -34.546  -5.267  -1.600   4.247  76.372 
## 
## Coefficients:
##                                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                         -4.946e+03  6.211e+03  -0.796    0.426    
## house_age                           -2.689e-01  3.900e-02  -6.896 2.04e-11 ***
## distance_to_the_nearest_MRT_station -4.259e-03  7.233e-04  -5.888 8.17e-09 ***
## number_of_convenience_stores         1.163e+00  1.902e-01   6.114 2.27e-09 ***
## latitude                             2.378e+02  4.495e+01   5.290 2.00e-07 ***
## longitude                           -7.805e+00  4.915e+01  -0.159    0.874    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.965 on 408 degrees of freedom
## Multiple R-squared:  0.5712, Adjusted R-squared:  0.5659 
## F-statistic: 108.7 on 5 and 408 DF,  p-value: < 2.2e-16

Then we create a backward-elimination model to the model_backward object

model_backward <- step(model_all_pred, direction = "backward", trace = 0)
summary(model_backward)
## 
## Call:
## lm(formula = house_price_of_unit_area ~ house_age + distance_to_the_nearest_MRT_station + 
##     number_of_convenience_stores + latitude, data = real_estate)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -34.522  -5.292  -1.579   4.264  76.466 
## 
## Coefficients:
##                                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                         -5.916e+03  1.113e+03  -5.317 1.74e-07 ***
## house_age                           -2.687e-01  3.893e-02  -6.903 1.95e-11 ***
## distance_to_the_nearest_MRT_station -4.175e-03  4.928e-04  -8.473 4.37e-16 ***
## number_of_convenience_stores         1.165e+00  1.897e-01   6.141 1.94e-09 ***
## latitude                             2.386e+02  4.456e+01   5.355 1.43e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.954 on 409 degrees of freedom
## Multiple R-squared:  0.5711, Adjusted R-squared:  0.5669 
## F-statistic: 136.2 on 4 and 409 DF,  p-value: < 2.2e-16

Model Interpretation :

  1. Significant predictor variables: All variables have a significant effect

  2. Goodness of fit (adj r-squared): 56% of the price variation can be explained by this model

After that we create a new regression model formula into the model_new object

model_new <- lm(formula = house_price_of_unit_area ~ house_age + distance_to_the_nearest_MRT_station + 
    number_of_convenience_stores + latitude, data = real_estate)

After that, we see the prediction results using interval prediction with 95% level

predict <- predict(object = model_new, newdata = real_estate, interval = "prediction", level = 0.95)
head(predict, 10)
##          fit       lwr      upr
## 1  48.519732  30.77427 66.26519
## 2  49.158260  31.46686 66.84966
## 3  46.798007  29.11538 64.48064
## 4  46.798007  29.11538 64.48064
## 5  47.813489  30.14700 65.47998
## 6  33.574811  15.91496 51.23466
## 7  41.233895  23.54032 58.92747
## 8  45.547325  27.90432 63.19033
## 9   7.814989 -10.13303 25.76301
## 10 33.325237  15.69445 50.95603

Finally, we test the performance of the linear regression model by checking the RMSE value and comparing it with the target variable, namely house_price_of_unit_area

RMSE(model_new$fitted.values, real_estate$house_price_of_unit_area)
## [1] 8.899817
summary(real_estate$house_price_of_unit_area)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    7.60   27.70   38.45   37.98   46.60  117.50

RMSE is quite small compared to the range of house_price (the model is already relatively good)

4 Assumptions

4.1 Normality of Residual

shapiro.test(model_new$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  model_new$residuals
## W = 0.86768, p-value < 2.2e-16

P-value < 0.05, it indicates that we reject the HO and it means that the residuals/errors in the prediction results are not normally distributed

4.2 Homoscedasticity of Residual

bptest(model_new)
## 
##  studentized Breusch-Pagan test
## 
## data:  model_new
## BP = 2.1663, df = 4, p-value = 0.7052

Because p-value > alpha, it means that the error variance spreads randomly or is constant (homoscedasticity)

4.3 No Multicolinearity

vif(model_new)
##                           house_age distance_to_the_nearest_MRT_station 
##                            1.013216                            1.992371 
##        number_of_convenience_stores                            latitude 
##                            1.607857                            1.575344

None of the predictor variables have multicollinearity

5 Conclusions

The linear regression model with house_age, distance_to_the_nearest_MRT_station, number_of_convenience_stores, and latitude predictors is good enough because it has a relatively small error (RMSE), but adj. R squared is still not very good because it only has a percentage of 56%.