The following is an analysis of the real estate dataset. In this analysis, we will predict the price of real estate using linear regression analysis.
library(dplyr)
library(ggplot2)
library(GGally)
library(MLmetrics)
library(performance)
library(lmtest)
library(car)real_estate <- read.csv("data/Real_estate.csv")We want to see top 10 data in this dataset
head(real_estate,10)Then, we want to change the column names to make it easier to understand
real_estate <- real_estate %>%
rename(transaction_date = X1.transaction.date,
house_age = X2.house.age,
distance_to_the_nearest_MRT_station = X3.distance.to.the.nearest.MRT.station,
number_of_convenience_stores = X4.number.of.convenience.stores,
latitude = X5.latitude,
longitude = X6.longitude,
house_price_of_unit_area = Y.house.price.of.unit.area)Next, we want to explain and check data types in every column
str(real_estate)## 'data.frame': 414 obs. of 8 variables:
## $ No : int 1 2 3 4 5 6 7 8 9 10 ...
## $ transaction_date : num 2013 2013 2014 2014 2013 ...
## $ house_age : num 32 19.5 13.3 13.3 5 7.1 34.5 20.3 31.7 17.9 ...
## $ distance_to_the_nearest_MRT_station: num 84.9 306.6 562 562 390.6 ...
## $ number_of_convenience_stores : int 10 9 5 5 5 3 7 6 1 3 ...
## $ latitude : num 25 25 25 25 25 ...
## $ longitude : num 122 122 122 122 122 ...
## $ house_price_of_unit_area : num 37.9 42.2 47.3 54.8 43.1 32.1 40.3 46.7 18.8 22.1 ...
The following is a description of each column : No : Record number of data transaction_date : Date of transaction house_age : Age of house Distance_to_the_nearest_MRT_station : Distance from house to nearest MRT station number_of_convenience_stores : Number of convenience stores latitude : Latitude of house longitude : Longitude of house house_price_of_unit_area : House price of unit area The data types contained in each column are appropriate.
After that, we check the number of missing values contained in this data
colSums(is.na(real_estate))## No transaction_date
## 0 0
## house_age distance_to_the_nearest_MRT_station
## 0 0
## number_of_convenience_stores latitude
## 0 0
## longitude house_price_of_unit_area
## 0 0
Great, this data no have missing values. Then we take the column that we want to do data analysis
real_estate <- real_estate %>% select(-No,-transaction_date)We want to see the correlation between the columns contained in this dataset
ggcorr(real_estate, label = T)In the correlation graph, it can be seen that the variables longitude, latitude, and number_of_convenience_stores have a positive correlation with house_price_of_unit_area. While the distance_to_the_nearest_MRT_station and house age variables have a negative correlation with house_price_of_unit_area
The next stage we will make linear regression modeling with the predictor variable number_of_convenience_stores, because this variable has the highest positive correlation to the target variable house_price_of_unit_area
model_one_pred <- lm(formula = house_price_of_unit_area~number_of_convenience_stores, data = real_estate) # Model with one prediction
summary(model_one_pred)##
## Call:
## lm(formula = house_price_of_unit_area ~ number_of_convenience_stores,
## data = real_estate)
##
## Residuals:
## Min 1Q Median 3Q Max
## -35.407 -7.341 -1.788 5.984 87.681
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 27.1811 0.9419 28.86 <2e-16 ***
## number_of_convenience_stores 2.6377 0.1868 14.12 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.18 on 412 degrees of freedom
## Multiple R-squared: 0.326, Adjusted R-squared: 0.3244
## F-statistic: 199.3 on 1 and 412 DF, p-value: < 2.2e-16
It can be seen that the adjusted R-squared has a value of 0.3244.
Next, we will try to select predictor variables automatically using step-wise regression with the backward-elimination method. First, we have to create a model with all prediction variables
model_all_pred <- lm(formula = house_price_of_unit_area~., data = real_estate)
summary(model_all_pred)##
## Call:
## lm(formula = house_price_of_unit_area ~ ., data = real_estate)
##
## Residuals:
## Min 1Q Median 3Q Max
## -34.546 -5.267 -1.600 4.247 76.372
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.946e+03 6.211e+03 -0.796 0.426
## house_age -2.689e-01 3.900e-02 -6.896 2.04e-11 ***
## distance_to_the_nearest_MRT_station -4.259e-03 7.233e-04 -5.888 8.17e-09 ***
## number_of_convenience_stores 1.163e+00 1.902e-01 6.114 2.27e-09 ***
## latitude 2.378e+02 4.495e+01 5.290 2.00e-07 ***
## longitude -7.805e+00 4.915e+01 -0.159 0.874
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.965 on 408 degrees of freedom
## Multiple R-squared: 0.5712, Adjusted R-squared: 0.5659
## F-statistic: 108.7 on 5 and 408 DF, p-value: < 2.2e-16
Then we create a backward-elimination model to the model_backward object
model_backward <- step(model_all_pred, direction = "backward", trace = 0)
summary(model_backward)##
## Call:
## lm(formula = house_price_of_unit_area ~ house_age + distance_to_the_nearest_MRT_station +
## number_of_convenience_stores + latitude, data = real_estate)
##
## Residuals:
## Min 1Q Median 3Q Max
## -34.522 -5.292 -1.579 4.264 76.466
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.916e+03 1.113e+03 -5.317 1.74e-07 ***
## house_age -2.687e-01 3.893e-02 -6.903 1.95e-11 ***
## distance_to_the_nearest_MRT_station -4.175e-03 4.928e-04 -8.473 4.37e-16 ***
## number_of_convenience_stores 1.165e+00 1.897e-01 6.141 1.94e-09 ***
## latitude 2.386e+02 4.456e+01 5.355 1.43e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.954 on 409 degrees of freedom
## Multiple R-squared: 0.5711, Adjusted R-squared: 0.5669
## F-statistic: 136.2 on 4 and 409 DF, p-value: < 2.2e-16
Model Interpretation :
Significant predictor variables: All variables have a significant effect
Goodness of fit (adj r-squared): 56% of the price variation can be explained by this model
After that we create a new regression model formula into the model_new object
model_new <- lm(formula = house_price_of_unit_area ~ house_age + distance_to_the_nearest_MRT_station +
number_of_convenience_stores + latitude, data = real_estate)After that, we see the prediction results using interval prediction with 95% level
predict <- predict(object = model_new, newdata = real_estate, interval = "prediction", level = 0.95)
head(predict, 10)## fit lwr upr
## 1 48.519732 30.77427 66.26519
## 2 49.158260 31.46686 66.84966
## 3 46.798007 29.11538 64.48064
## 4 46.798007 29.11538 64.48064
## 5 47.813489 30.14700 65.47998
## 6 33.574811 15.91496 51.23466
## 7 41.233895 23.54032 58.92747
## 8 45.547325 27.90432 63.19033
## 9 7.814989 -10.13303 25.76301
## 10 33.325237 15.69445 50.95603
Finally, we test the performance of the linear regression model by checking the RMSE value and comparing it with the target variable, namely house_price_of_unit_area
RMSE(model_new$fitted.values, real_estate$house_price_of_unit_area)## [1] 8.899817
summary(real_estate$house_price_of_unit_area)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 7.60 27.70 38.45 37.98 46.60 117.50
RMSE is quite small compared to the range of house_price (the model is already relatively good)
shapiro.test(model_new$residuals)##
## Shapiro-Wilk normality test
##
## data: model_new$residuals
## W = 0.86768, p-value < 2.2e-16
P-value < 0.05, it indicates that we reject the HO and it means that the residuals/errors in the prediction results are not normally distributed
bptest(model_new)##
## studentized Breusch-Pagan test
##
## data: model_new
## BP = 2.1663, df = 4, p-value = 0.7052
Because p-value > alpha, it means that the error variance spreads randomly or is constant (homoscedasticity)
vif(model_new)## house_age distance_to_the_nearest_MRT_station
## 1.013216 1.992371
## number_of_convenience_stores latitude
## 1.607857 1.575344
None of the predictor variables have multicollinearity
The linear regression model with house_age, distance_to_the_nearest_MRT_station, number_of_convenience_stores, and latitude predictors is good enough because it has a relatively small error (RMSE), but adj. R squared is still not very good because it only has a percentage of 56%.