Picture are taken from Kaggle

Intro

1.1 Greetings

Hi Everyone :)

Welcome to my Rmd.

This is my HTML_Document which contains Linear Regression - Amsterdam House Price Prediction.

Hope you can enjoy that!

1.2. What We Will Do

We will learn to use linear regression model using Amsterdam house price dataset. We wanna know the relationship among variables. We also wanna predict the price of a new house based on the historical data.

Data Source: https://www.kaggle.com/datasets/thomasnibb/amsterdam-house-price-prediction.

1.3. About Dataset

Context

If you are like me, you might get overwhelmed when having to make big decisions such as buying a house. In such cases, I always like to go for a data driven approach, that will help me find an optimum solution.

Content

The housing prices have been obtained from Pararius.nl as a snapshot in August 2021. The original data provided features such as price, floor area and the number of rooms. The data has been further enhanced by utilising the Mapbox API to obtain the coordinates of each listing.

1.4. Business Goal

We wanna know :

-. Which variables are significant in predicting the price of a house -. How well those variables describe the price of a house

Data Preparation

The first things we should do is load all packages that might be needed.

library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.6     v dplyr   1.0.9
## v tidyr   1.2.0     v stringr 1.4.0
## v readr   2.1.2     v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(caret)
## Loading required package: lattice
## 
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
## 
##     lift
library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
library(data.table)
## 
## Attaching package: 'data.table'
## The following objects are masked from 'package:dplyr':
## 
##     between, first, last
## The following object is masked from 'package:purrr':
## 
##     transpose
library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
library(tidymodels)
## -- Attaching packages -------------------------------------- tidymodels 0.2.0 --
## v broom        0.8.0     v rsample      0.1.1
## v dials        0.1.1     v tune         0.2.0
## v infer        1.0.0     v workflows    0.2.6
## v modeldata    0.1.1     v workflowsets 0.2.1
## v parsnip      0.2.1     v yardstick    0.0.9
## v recipes      0.2.0
## -- Conflicts ----------------------------------------- tidymodels_conflicts() --
## x data.table::between()    masks dplyr::between()
## x scales::discard()        masks purrr::discard()
## x plotly::filter()         masks dplyr::filter(), stats::filter()
## x data.table::first()      masks dplyr::first()
## x recipes::fixed()         masks stringr::fixed()
## x dplyr::lag()             masks stats::lag()
## x data.table::last()       masks dplyr::last()
## x caret::lift()            masks purrr::lift()
## x yardstick::precision()   masks caret::precision()
## x yardstick::recall()      masks caret::recall()
## x yardstick::sensitivity() masks caret::sensitivity()
## x yardstick::spec()        masks readr::spec()
## x yardstick::specificity() masks caret::specificity()
## x recipes::step()          masks stats::step()
## x data.table::transpose()  masks purrr::transpose()
## * Learn how to get started at https://www.tidymodels.org/start/
library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
## The following object is masked from 'package:purrr':
## 
##     some
library(scales)
library(lmtest)
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
library(dplyr)

Read the dataset

ams <- read.csv("data_input/ams.csv")
glimpse(ams)
## Rows: 924
## Columns: 8
## $ X       <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,~
## $ Address <chr> "Blasiusstraat 8 2, Amsterdam", "Kromme Leimuidenstraat 13 H, ~
## $ Zip     <chr> "1091 CR", "1059 EL", "1097 SM", "1060 TH", "1036 KN", "1051 A~
## $ Price   <dbl> 685000, 475000, 850000, 580000, 720000, 450000, 450000, 590000~
## $ Area    <int> 64, 60, 109, 128, 138, 53, 87, 80, 49, 33, 69, 88, 45, 70, 86,~
## $ Room    <int> 3, 3, 4, 6, 5, 2, 3, 2, 3, 2, 3, 3, 3, 2, 3, 5, 6, 3, 3, 3, 6,~
## $ Lon     <dbl> 4.907736, 4.850476, 4.944774, 4.789928, 4.902503, 4.875024, 4.~
## $ Lat     <dbl> 52.35616, 52.34859, 52.34378, 52.34371, 52.41054, 52.38223, 52~
dim(ams)
## [1] 924   8

The data has 924 rows and 8 columns. X is a unique number, so we can ignore it. Our target variable is the Price, which signifies the price of the house. We will use other variable except the Address and Zip.

Before we go further, first we need to make sure that our data is clean and will be useful.

Remove unused variables.

ams <- ams %>% select(-c(X, Address, Zip))
glimpse(ams)
## Rows: 924
## Columns: 5
## $ Price <dbl> 685000, 475000, 850000, 580000, 720000, 450000, 450000, 590000, ~
## $ Area  <int> 64, 60, 109, 128, 138, 53, 87, 80, 49, 33, 69, 88, 45, 70, 86, 1~
## $ Room  <int> 3, 3, 4, 6, 5, 2, 3, 2, 3, 2, 3, 3, 3, 2, 3, 5, 6, 3, 3, 3, 6, 3~
## $ Lon   <dbl> 4.907736, 4.850476, 4.944774, 4.789928, 4.902503, 4.875024, 4.89~
## $ Lat   <dbl> 52.35616, 52.34859, 52.34378, 52.34371, 52.41054, 52.38223, 52.4~

Explanatory Data Analysis

Next we must check the missing value data.

anyNA(ams)
## [1] TRUE

Oh no! There are some missing value datas. So we need to clean it first.

table(is.na(ams))
## 
## FALSE  TRUE 
##  4616     4

Delete missing value datas. We can delete the rows which contained ‘NA’

ams <- na.omit(ams)

Then after preparing and cleaning our dataset, now we can check the correlation between each variables.

ggcorr(ams, label = TRUE, label_size = 2.9, hjust = 1, layout.exp = 2)

As we can see from the graph of correlation, all variables has positive correlation to Price variable which are Area and Room have higher positive correlation than Lattitude and Longitude

Modelling

TRAIN - TEST SPLIT

Before we make model, we need to split the data into train dataset and test dataset. We will use the train dataset to train the linear regression model. The test dataset will be used as a comparasion and see if the model get overfit and can not predict new data that hasn’t been seen during training phase. We will 70% of the data as the training data and the rest of it as the testing data.

set.seed(123)
samplesize <- round(0.7 * nrow(ams), 0)
index <- sample(seq_len(nrow(ams)), size = samplesize)

data_train <- ams[index, ]
data_test <- ams[-index, ]

Linear Regression

And now we will try to model the linear regression using Price as the target variable.

set.seed(123)
ams_lm <- lm(Price ~ ., data = data_train)

summary(ams_lm)
## 
## Call:
## lm(formula = Price ~ ., data = data_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1794275  -148722    13176   122921  3181612 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -4.923e+07  2.789e+07  -1.765   0.0780 .  
## Area         9.342e+03  3.696e+02  25.279  < 2e-16 ***
## Room        -5.954e+04  1.375e+04  -4.331 1.72e-05 ***
## Lon         -3.388e+05  2.421e+05  -1.399   0.1622    
## Lat          9.707e+05  5.269e+05   1.842   0.0659 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 319800 on 639 degrees of freedom
## Multiple R-squared:  0.7105, Adjusted R-squared:  0.7087 
## F-statistic: 392.1 on 4 and 639 DF,  p-value: < 2.2e-16

The summary of ams_lm model shows a lot of information. But for now, we may be better focus on the Pr(>|t|). This column shows the signifance level of the variable toward the model. If the value is below 0.05, than we can safely asume that the variable has significant effect toward the model (meaning that the estimated coefficient are no different than 0), and vice versa. Thus, we can made a simpler model by removing variables that has p-value > 0.05, since they don’t have significant effect toward our model. The estimate value shows the coefficient of each variable.

ams2 <- ams %>% select(Area, Room, Price)

data_train2 <- ams2[index, ]
data_test2 <- ams2[-index, ]

set.seed(123)
ams_lm2 <- lm(Price ~ ., data = data_train2)

summary(ams_lm2)
## 
## Call:
## lm(formula = Price ~ ., data = data_train2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1777900  -163182    22019   125721  3189746 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -61930      30613  -2.023   0.0435 *  
## Area            9340        369  25.312  < 2e-16 ***
## Room          -59076      13718  -4.306 1.92e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 321000 on 641 degrees of freedom
## Multiple R-squared:  0.7074, Adjusted R-squared:  0.7065 
## F-statistic: 774.8 on 2 and 641 DF,  p-value: < 2.2e-16

We have removed the non-significant variables. Too see if this action affect our model, we can check the Adjusted R-Squared value from our two previous models. The first model with complete variables has adjusted R-squared of 0.7087, meaning that the model can explain 70.87% of variance in the target variable (house price). Meanwhile, our simpler model has adjusted R-squared of 70.65%, no big difference with our first model. This shows that it is safe to remove variables that has no significant coefficient values.

Evaluation

Model Performance

The performance of our model (how well our model predict the target variable) can be calculated using root mean squared error:

\[ RMSE = \sqrt{\frac{1}{n} \sum (\hat y - y)^2} \]

RMSE is better than MAE or mean absolute error, because RMSE squared the difference between the actual values and the predicted values, meaning that prediction with higher error will be penalized greatly. This metric is often used to compare two or more alternative models, even though it is harder to interpret than MAE. We can use the RMSE () functions from caret package. Below is the first model (with complete variables) performance.

lm_pred <- predict(ams_lm, newdata = data_test %>% select(-Price))

# RMSE of train dataset
RMSE(pred = ams_lm$fitted.values, obs = data_train$Price)
## [1] 318509.9
# RMSE of test dataset
RMSE(pred = lm_pred, obs = data_test$Price)
## [1] 214567.5

Below is the second model (with removed variables) performance.

lm_pred2 <- predict(ams_lm2, newdata = data_test2 %>% select(-Price))

# RMSE of train dataset
RMSE(pred = ams_lm2$fitted.values, obs = data_train2$Price)
## [1] 320244.9
# RMSE of test dataset
RMSE(pred = lm_pred2, obs = data_test2$Price)
## [1] 215251.8

The second model is better than the first one in predicting the testing dataset, even only by a small margin. On both models, the error of the training dataset is higher than the test dataset.

Assumptions

Linear regression is a parametric model, meaning that in order to create a model equation, the model follows some classical assumption. Linear regression that doesn’t follow the assumption may be misleading, or just simply has biased estimator. For this section, we only check the second model (the model with removed variables).

1. Linearity

The linear regression model assumes that there is a straight-line relationship between the predictors and the response. If the true relationship is far from linear, then virtually all of the conclusions that we draw from the fit are suspect. In addition, the prediction accuracy of the model can be significantly reduced.

Residual plots are a useful graphical tool for identifying non-linearity. If there is a pattern in the residual plot, it means that the model can be further improved upon or that it does not meet the linearity assumption. The plot shows the relationship between the residuals/errors with the predicted/fitted values.

resact <- data.frame(residual = ams_lm2$residuals, fitted = ams_lm2$fitted.values)

resact %>% ggplot(aes(fitted, residual)) + geom_point() + geom_smooth() + geom_hline(aes(yintercept = 0)) + 
    theme(panel.grid = element_blank(), panel.background = element_blank())
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

There is a pattern in the data, with the residuals has become more negative as the fitted values increase before increased again. The pattern indicate that our model may not be linear enough.

2. Normality

The second assumption in linear regression is that the residuals follow normal distribution. We can easily check this by using the Saphiro-Wilk normality test.

shapiro.test(ams_lm2$residuals[0:500])
## 
##  Shapiro-Wilk normality test
## 
## data:  ams_lm2$residuals[0:500]
## W = 0.72274, p-value < 2.2e-16

The null hypothesis is that the residuals follow normal distribution. With p-value < 0.05, we can conclude that our hypothesis is rejected, and our residuals are not following the normal distribution.

3. Autocorrelation

The standard errors that are computed for the estimated regression coefficients or the fitted values are based on the assumption of uncorrelated error terms (no autocorrelation). If in fact there is correlation among the error terms, then the estimated standard errors will tend to underestimate the true standard errors. As a result, confidence and prediction intervals will be narrower than they should be. For example, a 95% confidence interval may in reality have a much lower probability than 0.95 of containing the true value of the parameter. In addition, p-values associated with the model will be lower than they should be; this could cause us to erroneously conclude that a parameter is statistically significant. In short, if the error terms are correlated, we may have an unwarranted sense of confidence in our model.

Autocorrelation can be detected using the durbin watson test, with null hypothesis that there is no autocorrelation.

durbinWatsonTest(ams_lm2)
##  lag Autocorrelation D-W Statistic p-value
##    1     -0.01823275      2.035447    0.64
##  Alternative hypothesis: rho != 0

The result shows that the null hypothesis is rejected, meaning that our residual has autocorrelation in it.

4. Heterocedasticity

Heterocedasticity means that the variances of the error terms are non-constant. One can identify non-constant variances in the errors from the presence of a funnel shape in the residual plot, same with the linearity one.

bptest(ams_lm2)
## 
##  studentized Breusch-Pagan test
## 
## data:  ams_lm2
## BP = 93.794, df = 2, p-value < 2.2e-16
resact %>% ggplot(aes(fitted, residual)) + geom_point() + theme_light() + geom_hline(aes(yintercept = 0))

We can observe that on lower fitted values, the residuals are concentrated around the value of 0. As the fitted value increases, the residuals are also got bigger. Second way to detect heterocesdasticity is using the Breusch-Pagan test, with null hypothesis is there is no heterocesdasticity. With p-value < 0.05, we can conclude that heterocesdasticity is present in our model.

5. Multicollinearity

Multicollinearity mean that there is a correlation between the independent variables/predictors. To check the multicollinearity, we can measure the varianec inflation factor (VIF). As a rule of thumb, a VIF value that exceeds 5 or 10 indicates a problematic amount of collinearity.

vif(ams_lm2)
##     Area     Room 
## 3.242112 3.242112

Conclusion

Variables that are useful to describe the variances in Amsterdam house prices are Area, Room, Price. Our second model has The R-squared of the model with 70.65% of the variables can explain the variances in the house price. The accuracy of the model in predicting the house price is measured with RMSE, with training data has RMSE of 320244.9 and testing data has RMSE of 215251.8, suggesting that our model may overfit the traning dataset.

We have already learn how to build a linear regression model and what need to be concerned when building the model.