The data set used for this tutorial can be found here: https://www.kaggle.com/quantbruce/real-estate-price-prediction
RealEstate <- read.csv("file:///C:/Users/Dakota/Documents/Real estate.csv")
The first linear regression model to start with, should be the variable you want as the y variable predicted by all the other response variables. In this case, house price per unit is predicted by all the other variables, Transaction Date, House Age, Latitude, Longitude, Distance to MRT, and the Number of convenience stores.
hp1 <- lm(HousePriceUnitArea ~ (TransactionDate + HouseAge + Latitude + Longitude + DistanceMRT + NumberConvenienceStores), data = RealEstate)
Here is the corresponding summary and output:
summary(hp1)
##
## Call:
## lm(formula = HousePriceUnitArea ~ (TransactionDate + HouseAge +
## Latitude + Longitude + DistanceMRT + NumberConvenienceStores),
## data = RealEstate)
##
## Residuals:
## Min 1Q Median 3Q Max
## -35.664 -5.410 -0.966 4.217 75.193
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.444e+04 6.776e+03 -2.131 0.03371 *
## TransactionDate 5.146e+00 1.557e+00 3.305 0.00103 **
## HouseAge -2.697e-01 3.853e-02 -7.000 1.06e-11 ***
## Latitude 2.255e+02 4.457e+01 5.059 6.38e-07 ***
## Longitude -1.242e+01 4.858e+01 -0.256 0.79829
## DistanceMRT -4.488e-03 7.180e-04 -6.250 1.04e-09 ***
## NumberConvenienceStores 1.133e+00 1.882e-01 6.023 3.84e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.858 on 407 degrees of freedom
## Multiple R-squared: 0.5824, Adjusted R-squared: 0.5762
## F-statistic: 94.59 on 6 and 407 DF, p-value: < 2.2e-16
Looking at this initial summary we can see that the pvalue is less than our usual significance level of 0.05, which is good, however our R value isn't such high, we would like this value to be as close to 1 as possible.
Now we are creating residuals and fitted data so we can compare each of the independent variables to their corresonding residual to test for non-linearity:
RealEstate$fit.r <- hp1$residuals
RealEstate$fit.p <- hp1$fitted.values
Install the package "reshape2" and reference it in the library:
library(reshape2)
Now we will test for non-linearity, the graphs that appear to be evenly distributed above and below the line are linear, graphs that have more points towards one side are none-linear.
RealEstate %>%
melt(measure.vars = c("TransactionDate", "HouseAge", "Latitude", "Longitude", "DistanceMRT", "NumberConvenienceStores", "fit.p")) %>%
ggplot(aes(value, fit.r, group = variable)) +
geom_point(shape = 1) +
geom_smooth(method = loess) +
geom_hline(yintercept = 0) +
facet_wrap(~ variable, scales = "free")
## `geom_smooth()` using formula 'y ~ x'
Looking at the graphs, we can see that DistanceMRT and Longitude appear non-linear.
We will now change our first first linear regression model to add in all the explanatory variables squared, we are doing this to see which variables become more significant when squared:
RealEstate$HouseAge2 <- RealEstate$HouseAge^2
RealEstate$TransactionDate2 <- abs(RealEstate$TransactionDate^2)
RealEstate$Latitude2 <- RealEstate$Latitude^2
RealEstate$Longitude2 <- abs(RealEstate$Longitude^2)
RealEstate$DistanceMRT2 <- RealEstate$DistanceMRT^2
RealEstate$NumberConvenienceStores2 <- RealEstate$NumberConvenienceStores^2
hp2 <- lm(HousePriceUnitArea ~ HouseAge+HouseAge2+TransactionDate+TransactionDate2+Latitude+Latitude2+Longitude+Longitude2+DistanceMRT+DistanceMRT2+NumberConvenienceStores+NumberConvenienceStores2, data=RealEstate)
summary(hp2)
##
## Call:
## lm(formula = HousePriceUnitArea ~ HouseAge + HouseAge2 + TransactionDate +
## TransactionDate2 + Latitude + Latitude2 + Longitude + Longitude2 +
## DistanceMRT + DistanceMRT2 + NumberConvenienceStores + NumberConvenienceStores2,
## data = RealEstate)
##
## Residuals:
## Min 1Q Median 3Q Max
## -33.478 -4.312 -0.281 3.653 72.832
##
## Coefficients: (2 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.847e+06 1.447e+06 2.658 0.00818 **
## HouseAge -1.023e+00 1.357e-01 -7.536 3.23e-13 ***
## HouseAge2 1.866e-02 3.307e-03 5.643 3.16e-08 ***
## TransactionDate 6.021e+00 1.402e+00 4.296 2.18e-05 ***
## TransactionDate2 NA NA NA NA
## Latitude -3.093e+05 1.158e+05 -2.670 0.00789 **
## Latitude2 6.198e+03 2.319e+03 2.672 0.00784 **
## Longitude -5.753e+00 4.537e+01 -0.127 0.89915
## Longitude2 NA NA NA NA
## DistanceMRT -1.194e-02 1.268e-03 -9.419 < 2e-16 ***
## DistanceMRT2 1.475e-06 2.138e-07 6.899 2.04e-11 ***
## NumberConvenienceStores 5.391e-01 5.007e-01 1.077 0.28219
## NumberConvenienceStores2 8.309e-03 5.180e-02 0.160 0.87265
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.951 on 403 degrees of freedom
## Multiple R-squared: 0.6668, Adjusted R-squared: 0.6586
## F-statistic: 80.66 on 10 and 403 DF, p-value: < 2.2e-16
From here we can see that Distance MRT becomes much more significant, leading us to believe it is in fact, non-linear.
We will now run an analaysis of variance test between the two linear models and see if adding the squared values improves the model.
anova(hp1,hp2)
## Analysis of Variance Table
##
## Model 1: HousePriceUnitArea ~ (TransactionDate + HouseAge + Latitude +
## Longitude + DistanceMRT + NumberConvenienceStores)
## Model 2: HousePriceUnitArea ~ HouseAge + HouseAge2 + TransactionDate +
## TransactionDate2 + Latitude + Latitude2 + Longitude + Longitude2 +
## DistanceMRT + DistanceMRT2 + NumberConvenienceStores + NumberConvenienceStores2
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 407 31933
## 2 403 25474 4 6458.4 25.543 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Adding the squared values doesn't seem to make much difference for this model. But for some models, with a much lower p-value, it might make a difference.
A final way to test for non-linearity is by using the Ramsey's Regression Error Specification Test (RESET). Essentially, if this test produces a low p-value than non-linearity is expected.
library(lmtest)
## Warning: package 'lmtest' was built under R version 3.6.3
## Loading required package: zoo
## Warning: package 'zoo' was built under R version 3.6.3
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
resettest(hp1,power=2:3,type="regressor")
##
## RESET test
##
## data: hp1
## RESET = 11.911, df1 = 12, df2 = 395, p-value < 2.2e-16
As we can see, non-linearity is found in our model, like we expected.
Now testing for heteroscedasticity, we will plot the residuals against the fitted values:
RealEstate$fit.r <- hp1$residuals
RealEstate$fit.p <- hp1$fitted.values
ggplot(RealEstate, aes(fit.p, fit.r)) +
geom_jitter(shape = 1) +
geom_hline(yintercept = 0, color = "red") +
ylab("Residuals") +
xlab("Fitted")
We can see that our graph may appear to show heteroscedasticity because we don't have an even distriubtion throughout the graph.
We can test for non-constant error using the Breusch-Pagan (aka Cook-Weisberg) test
this requires us to install the "car" package
library(car)
ncvTest(hp1)
## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 4.194015, Df = 1, p = 0.040567
Our p-value, is above an alpha level of 0.05, but just barely, so it does show heteroscedasiticty.
We can calculate robust standard error for our linear model for predicting house price unit:
hccm(hp1) %>% diag() %>% sqrt()
## (Intercept) TransactionDate HouseAge
## 5.598146e+03 1.469637e+00 4.305044e-02
## Latitude Longitude DistanceMRT
## 4.679488e+01 4.410022e+01 8.354920e-04
## NumberConvenienceStores
## 2.316899e-01
robust.se <- function(model) {
s <- summary(model)
wse <- sqrt(diag(hccm(hp1)))
t <- model$coefficients/wse
p <- 2*pnorm(-abs(t))
results <- cbind(model$coefficients, wse, t, p)
dimnames(results) <- dimnames(s$coefficients)
results
}
robust.se(hp1)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.443710e+04 5.598146e+03 -2.5789077 9.911328e-03
## TransactionDate 5.146227e+00 1.469637e+00 3.5016999 4.623000e-04
## HouseAge -2.696954e-01 4.305044e-02 -6.2646392 3.736897e-10
## Latitude 2.254730e+02 4.679488e+01 4.8183259 1.447678e-06
## Longitude -1.242360e+01 4.410022e+01 -0.2817129 7.781637e-01
## DistanceMRT -4.487461e-03 8.354920e-04 -5.3710400 7.828384e-08
## NumberConvenienceStores 1.133277e+00 2.316899e-01 4.8913522 1.001456e-06
Comparing this to the original model, there are no significant changes, so there is again no heteroscedasticity in the model.
We are now moving on to the Null Hypothesis test We will use the dwtest function to test the null hypothesis that autocorrelation is 0.
This requires the "lmtest" package
library(lmtest)
dwtest(hp1)
##
## Durbin-Watson test
##
## data: hp1
## DW = 2.1527, p-value = 0.9415
## alternative hypothesis: true autocorrelation is greater than 0
We want our value to be between 1.5 and negative 2.5, which it is, so the autocorrelation does not have any effect on the model.
Now moving onto Multiple Regression Residuals, the graphs below are testing for normality:
Multiple Regression Residuals:
p1 <- ggplot(RealEstate, aes(fit.r)) +
geom_histogram(bins = 10, color = "black", fill = "white")
p2 <- ggplot(RealEstate, aes(fit.r)) +
geom_density() +
stat_function(fun = dnorm, args = list(mean = mean(RealEstate$fit.r),
sd = sd(RealEstate$fit.r)),
color = "dodgerblue", size = 2, alpha = .5)
p3 <- ggplot(RealEstate, aes("", fit.r)) +
geom_boxplot()
p4 <- ggplot(RealEstate, aes(sample = fit.r)) +
stat_qq(shape = 1) +
stat_qq_line(size = 1.5, alpha = .5)
p1
p2
p3
p4
The graphs do not appear normal, the first three graphs show us that the distribution is not symmetrical, with a skewness to the right and the final graph also shows this with many points deviating from the line on the right side.
The Sharp Ratio test also checks for normality.
shapiro.test(hp1$residuals)
##
## Shapiro-Wilk normality test
##
## data: hp1$residuals
## W = 0.87621, p-value < 2.2e-16
Since our p-value is small, we conclude that our graph is not normal. We reject the hypothesis of normally distributed errors.
Next we will plot the residuals and look for any possible outliers
ggplot(RealEstate, aes(row.names(RealEstate), fit.r)) +
geom_point(shape = 1) +
geom_hline(yintercept = 0, color = "red")
There appears to be one outlier around the 80 mark
To examine the outliers more in-depth will will look at the largest and smallest values
output.1 <- sort(hp1$residuals) # smallest first
output.2 <- sort(hp1$residuals, decreasing = TRUE) # largest first
head(output.1, 1) # smallest residual absolute value
## 114
## -35.66435
head(output.2, 1) # largest residual absolute value
## 271
## 75.1927
We will now pull these numbers from the matrix
RealEstate[c(271,114),c(2,3,4,5,6,7,8)] # [c(row numbers),c(column numbers)]
## TransactionDate HouseAge DistanceMRT NumberConvenienceStores Latitude
## 271 2013.333 10.8 252.5822 1 24.97460
## 114 2013.333 14.8 393.2606 6 24.96172
## Longitude HousePriceUnitArea
## 271 121.5305 117.5
## 114 121.5381 7.6
As you can see those values look larger and smaller than the rest for the HousePrice so they could be potential outliers.
Another approach is to look at the degrees of freedom of the betas
df <- 2/sqrt(414)
df
## [1] 0.09829464
Now that we have our degrees of freedom, we will examine this value for possible influence against all the variables.
dfhp1 <- dfbetas(hp1)
melt(dfhp1, varnames = c("index", "variable")) %>%
ggplot(aes(index, value)) +
geom_point() +
geom_hline(yintercept = df) +
geom_hline(yintercept = -df) +
facet_wrap(~ variable, scales = "free")
A few varibales seem to go off the cut off, so we will do a similar thing we did with the outliers, we will look at the highest and lowest absolute value of dfbetas:
names(dfhp1) <- row.names(RealEstate)
dfhp1[abs(dfhp1) == max(abs(dfhp1))]
## <NA>
## -0.8991347
We will now look through all the observations trying to find a row number that has this value in it.
class(dfhp1)
## [1] "matrix"
df2hp1 <- as.data.frame(dfhp1)
# add an id variable
df2hp1$id <- 1:414 # generate a new observation number
# head function returns one value, based on ,1
# syntax - head(data_set[with(data_set, order(+/-variable)), ], 1)
# Longitude
head(df2hp1[with(df2hp1, order(-Longitude)), ], 1) # order declining
## (Intercept) TransactionDate HouseAge Latitude Longitude DistanceMRT
## 245 -0.1151602 0.02422121 0.1170819 -0.05293354 0.1292413 0.05578565
## NumberConvenienceStores id
## 245 0.0234964 245
head(df2hp1[with(df2hp1, order(+Longitude)), ], 1) # order increasing
## (Intercept) TransactionDate HouseAge Latitude Longitude DistanceMRT
## 271 0.3738488 0.384242 -0.2429588 0.05812468 -0.6436714 -0.8943845
## NumberConvenienceStores id
## 271 -0.8991347 271
#Latitude
head(df2hp1[with(df2hp1, order(-Latitude)), ], 1) # order declining
## (Intercept) TransactionDate HouseAge Latitude Longitude DistanceMRT
## 229 -0.1370148 0.03567835 -0.06047111 0.3361963 0.07492576 0.2016581
## NumberConvenienceStores id
## 229 -0.07306714 229
head(df2hp1[with(df2hp1, order(+Latitude)), ], 1) # order increasing
## (Intercept) TransactionDate HouseAge Latitude Longitude DistanceMRT
## 149 -0.08308603 0.2205801 -0.0004957044 -0.3569275 0.0455112 0.07767458
## NumberConvenienceStores id
## 149 0.01839568 149
#NumberConvenienceStores
head(df2hp1[with(df2hp1, order(-NumberConvenienceStores)), ], 1) # order declining
## (Intercept) TransactionDate HouseAge Latitude Longitude DistanceMRT
## 313 -0.1207385 0.2925673 0.2835299 -0.1391709 0.009406094 -0.005841154
## NumberConvenienceStores id
## 313 0.2884421 313
head(df2hp1[with(df2hp1, order(+NumberConvenienceStores)), ], 1) # order inclining
## (Intercept) TransactionDate HouseAge Latitude Longitude DistanceMRT
## 271 0.3738488 0.384242 -0.2429588 0.05812468 -0.6436714 -0.8943845
## NumberConvenienceStores id
## 271 -0.8991347 271
#DistanceMRT
head(df2hp1[with(df2hp1, order(-DistanceMRT)), ], 1) # order declining
## (Intercept) TransactionDate HouseAge Latitude Longitude DistanceMRT
## 229 -0.1370148 0.03567835 -0.06047111 0.3361963 0.07492576 0.2016581
## NumberConvenienceStores id
## 229 -0.07306714 229
head(df2hp1[with(df2hp1, order(+DistanceMRT)), ], 1) # order inclining
## (Intercept) TransactionDate HouseAge Latitude Longitude DistanceMRT
## 271 0.3738488 0.384242 -0.2429588 0.05812468 -0.6436714 -0.8943845
## NumberConvenienceStores id
## 271 -0.8991347 271
#HouseAge
head(df2hp1[with(df2hp1, order(-HouseAge)), ], 1) # order declining
## (Intercept) TransactionDate HouseAge Latitude Longitude DistanceMRT
## 390 0.0005917871 0.06280159 0.3172411 -0.1614731 -0.003625267 -0.06477414
## NumberConvenienceStores id
## 390 0.159074 390
head(df2hp1[with(df2hp1, order(+HouseAge)), ], 1) # order inclining
## (Intercept) TransactionDate HouseAge Latitude Longitude DistanceMRT
## 271 0.3738488 0.384242 -0.2429588 0.05812468 -0.6436714 -0.8943845
## NumberConvenienceStores id
## 271 -0.8991347 271
#TransactionDate
head(df2hp1[with(df2hp1, order(-TransactionDate)), ], 1) # order declining
## (Intercept) TransactionDate HouseAge Latitude Longitude DistanceMRT
## 271 0.3738488 0.384242 -0.2429588 0.05812468 -0.6436714 -0.8943845
## NumberConvenienceStores id
## 271 -0.8991347 271
head(df2hp1[with(df2hp1, order(+TransactionDate)), ], 1) # order inclining
## (Intercept) TransactionDate HouseAge Latitude Longitude DistanceMRT
## 114 -0.03535736 -0.1581897 0.04081476 0.259526 0.07561799 0.1632732
## NumberConvenienceStores id
## 114 -0.1039386 114
refererencing our maximum value we wanted to pay attention to. We can see that row 271 has that value for variables: NumberConvenience, DistanceMRT and HouseAge
dfhp1[abs(dfhp1) == max(abs(dfhp1))]
## <NA>
## -0.8991347
RealEstate[c(271), c(2,3,4,5,6,7,8)]
## TransactionDate HouseAge DistanceMRT NumberConvenienceStores Latitude
## 271 2013.333 10.8 252.5822 1 24.9746
## Longitude HousePriceUnitArea
## 271 121.5305 117.5
From here we can see our highest outliier and determine whether or not it affects the model and the dataset. ####
OUr final test is the test for multicolliearity. This tells us the correlation of the independent variables in the model.
RealEstate %>%
dplyr::select(HouseAge, TransactionDate, DistanceMRT, NumberConvenienceStores, Latitude, Longitude) %>%
na.omit() %>%
data.frame() %>%
cor()
## HouseAge TransactionDate DistanceMRT
## HouseAge 1.00000000 0.017548767 0.02562205
## TransactionDate 0.01754877 1.000000000 0.06087995
## DistanceMRT 0.02562205 0.060879953 1.00000000
## NumberConvenienceStores 0.04959251 0.009635445 -0.60251914
## Latitude 0.05441990 0.035057756 -0.59106657
## Longitude -0.04852005 -0.041081778 -0.80631677
## NumberConvenienceStores Latitude Longitude
## HouseAge 0.049592513 0.05441990 -0.04852005
## TransactionDate 0.009635445 0.03505776 -0.04108178
## DistanceMRT -0.602519145 -0.59106657 -0.80631677
## NumberConvenienceStores 1.000000000 0.44414331 0.44909901
## Latitude 0.444143306 1.00000000 0.41292394
## Longitude 0.449099007 0.41292394 1.00000000
Looking at this initial correlation matrix, none of these variables seem to have too high of correlation. Numbers close to 1 or -1 besides the diagnal would be numbers to watch out for.
We will now check the variance inflation factor and tolerance.
library(car)
vif(hp1)
## TransactionDate HouseAge Latitude
## 1.014674 1.014287 1.610234
## Longitude DistanceMRT NumberConvenienceStores
## 2.926302 4.323019 1.617038
1/vif(hp1)
## TransactionDate HouseAge Latitude
## 0.9855386 0.9859145 0.6210276
## Longitude DistanceMRT NumberConvenienceStores
## 0.3417282 0.2313198 0.6184148
The VIF, you would like to be less than 5, which our example is, however calculating the tolerance is a little different. n < 50, tolerance should exceed 0.7 n < 300, tolerance should exceed 0.5 n < 600, tolerance should exceed 0.3 n < 1000, tolerance should exceed 0.1
Our number of observations is 400, so our tolerance should exceed 0.3, but it doesnt for DistanceMRT, telling us that we do in fact have multicollinearity.
In Summary, this data set might not be the best one to have picked to build a linear model off of, several tests were violated and changes would need to be made to this data so that the data is not skewed or biasing the results.