Data Preparation
The first things we should do is load all packages that might be
needed.
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5 v purrr 0.3.4
## v tibble 3.1.6 v dplyr 1.0.9
## v tidyr 1.2.0 v stringr 1.4.0
## v readr 2.1.2 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(caret)
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
library(data.table)
##
## Attaching package: 'data.table'
## The following objects are masked from 'package:dplyr':
##
## between, first, last
## The following object is masked from 'package:purrr':
##
## transpose
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(tidymodels)
## -- Attaching packages -------------------------------------- tidymodels 0.2.0 --
## v broom 0.8.0 v rsample 0.1.1
## v dials 0.1.1 v tune 0.2.0
## v infer 1.0.0 v workflows 0.2.6
## v modeldata 0.1.1 v workflowsets 0.2.1
## v parsnip 0.2.1 v yardstick 0.0.9
## v recipes 0.2.0
## -- Conflicts ----------------------------------------- tidymodels_conflicts() --
## x data.table::between() masks dplyr::between()
## x scales::discard() masks purrr::discard()
## x plotly::filter() masks dplyr::filter(), stats::filter()
## x data.table::first() masks dplyr::first()
## x recipes::fixed() masks stringr::fixed()
## x dplyr::lag() masks stats::lag()
## x data.table::last() masks dplyr::last()
## x caret::lift() masks purrr::lift()
## x yardstick::precision() masks caret::precision()
## x yardstick::recall() masks caret::recall()
## x yardstick::sensitivity() masks caret::sensitivity()
## x yardstick::spec() masks readr::spec()
## x yardstick::specificity() masks caret::specificity()
## x recipes::step() masks stats::step()
## x data.table::transpose() masks purrr::transpose()
## * Learn how to get started at https://www.tidymodels.org/start/
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
## The following object is masked from 'package:purrr':
##
## some
library(scales)
library(lmtest)
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
library(dplyr)
Read the dataset
ams <- read.csv("data_input/ams.csv")
glimpse(ams)
## Rows: 924
## Columns: 8
## $ X <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,~
## $ Address <chr> "Blasiusstraat 8 2, Amsterdam", "Kromme Leimuidenstraat 13 H, ~
## $ Zip <chr> "1091 CR", "1059 EL", "1097 SM", "1060 TH", "1036 KN", "1051 A~
## $ Price <dbl> 685000, 475000, 850000, 580000, 720000, 450000, 450000, 590000~
## $ Area <int> 64, 60, 109, 128, 138, 53, 87, 80, 49, 33, 69, 88, 45, 70, 86,~
## $ Room <int> 3, 3, 4, 6, 5, 2, 3, 2, 3, 2, 3, 3, 3, 2, 3, 5, 6, 3, 3, 3, 6,~
## $ Lon <dbl> 4.907736, 4.850476, 4.944774, 4.789928, 4.902503, 4.875024, 4.~
## $ Lat <dbl> 52.35616, 52.34859, 52.34378, 52.34371, 52.41054, 52.38223, 52~
dim(ams)
## [1] 924 8
The data has 924 rows and 8 columns. X is a unique number, so we can
ignore it. Our target variable is the Price, which signifies the price
of the house. We will use other variable except the Address and Zip.
Before we go further, first we need to make sure that our data is
clean and will be useful.
Remove unused variables.
ams <- ams %>% select(-c(X, Address, Zip))
glimpse(ams)
## Rows: 924
## Columns: 5
## $ Price <dbl> 685000, 475000, 850000, 580000, 720000, 450000, 450000, 590000, ~
## $ Area <int> 64, 60, 109, 128, 138, 53, 87, 80, 49, 33, 69, 88, 45, 70, 86, 1~
## $ Room <int> 3, 3, 4, 6, 5, 2, 3, 2, 3, 2, 3, 3, 3, 2, 3, 5, 6, 3, 3, 3, 6, 3~
## $ Lon <dbl> 4.907736, 4.850476, 4.944774, 4.789928, 4.902503, 4.875024, 4.89~
## $ Lat <dbl> 52.35616, 52.34859, 52.34378, 52.34371, 52.41054, 52.38223, 52.4~
Evaluation
Assumptions
Linear regression is a parametric model, meaning that in order to
create a model equation, the model follows some classical assumption.
Linear regression that doesn’t follow the assumption may be misleading,
or just simply has biased estimator. For this section, we only check the
second model (the model with removed variables).
1. Linearity
The linear regression model assumes that there is a straight-line
relationship between the predictors and the response. If the true
relationship is far from linear, then virtually all of the conclusions
that we draw from the fit are suspect. In addition, the prediction
accuracy of the model can be significantly reduced.
Residual plots are a useful graphical tool for identifying
non-linearity. If there is a pattern in the residual plot, it means that
the model can be further improved upon or that it does not meet the
linearity assumption. The plot shows the relationship between the
residuals/errors with the predicted/fitted values.
resact <- data.frame(residual = ams_lm2$residuals, fitted = ams_lm2$fitted.values)
resact %>% ggplot(aes(fitted, residual)) + geom_point() + geom_smooth() + geom_hline(aes(yintercept = 0)) +
theme(panel.grid = element_blank(), panel.background = element_blank())
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

There is a pattern in the data, with the residuals has become more
negative as the fitted values increase before increased again. The
pattern indicate that our model may not be linear enough.
2. Normality
The second assumption in linear regression is that the residuals
follow normal distribution. We can easily check this by using the
Saphiro-Wilk normality test.
shapiro.test(ams_lm2$residuals[0:500])
##
## Shapiro-Wilk normality test
##
## data: ams_lm2$residuals[0:500]
## W = 0.72274, p-value < 2.2e-16
The null hypothesis is that the residuals follow normal distribution.
With p-value < 0.05, we can conclude that our hypothesis is rejected,
and our residuals are not following the normal distribution.
3. Autocorrelation
The standard errors that are computed for the estimated regression
coefficients or the fitted values are based on the assumption of
uncorrelated error terms (no autocorrelation). If in fact there is
correlation among the error terms, then the estimated standard errors
will tend to underestimate the true standard errors. As a result,
confidence and prediction intervals will be narrower than they should
be. For example, a 95% confidence interval may in reality have a much
lower probability than 0.95 of containing the true value of the
parameter. In addition, p-values associated with the model will be lower
than they should be; this could cause us to erroneously conclude that a
parameter is statistically significant. In short, if the error terms are
correlated, we may have an unwarranted sense of confidence in our
model.
Autocorrelation can be detected using the durbin watson test, with
null hypothesis that there is no autocorrelation.
durbinWatsonTest(ams_lm2)
## lag Autocorrelation D-W Statistic p-value
## 1 -0.01823275 2.035447 0.64
## Alternative hypothesis: rho != 0
The result shows that the null hypothesis is rejected, meaning that
our residual has autocorrelation in it.
4. Heterocedasticity
Heterocedasticity means that the variances of the error terms are
non-constant. One can identify non-constant variances in the errors from
the presence of a funnel shape in the residual plot, same with the
linearity one.
bptest(ams_lm2)
##
## studentized Breusch-Pagan test
##
## data: ams_lm2
## BP = 93.794, df = 2, p-value < 2.2e-16
resact %>% ggplot(aes(fitted, residual)) + geom_point() + theme_light() + geom_hline(aes(yintercept = 0))

We can observe that on lower fitted values, the residuals are
concentrated around the value of 0. As the fitted value increases, the
residuals are also got bigger. Second way to detect heterocesdasticity
is using the Breusch-Pagan test, with null hypothesis is there is no
heterocesdasticity. With p-value < 0.05, we can conclude that
heterocesdasticity is present in our model.
5. Multicollinearity
Multicollinearity mean that there is a correlation between the
independent variables/predictors. To check the multicollinearity, we can
measure the varianec inflation factor (VIF). As a rule of thumb, a VIF
value that exceeds 5 or 10 indicates a problematic amount of
collinearity.
vif(ams_lm2)
## Area Room
## 3.242112 3.242112