The data set used for this tutorial can be found here: https://www.kaggle.com/quantbruce/real-estate-price-prediction

RealEstate <- read.csv("file:///C:/Users/Dakota/Documents/Real estate.csv")

The first linear regression model to start with, should be the variable you want as the y variable predicted by all the other response variables. In this case, house price per unit is predicted by all the other variables, Transaction Date, House Age, Latitude, Longitude, Distance to MRT, and the Number of convenience stores.

hp1 <-  lm(HousePriceUnitArea ~ (TransactionDate + HouseAge + Latitude + Longitude + DistanceMRT + NumberConvenienceStores), data = RealEstate)

Here is the corresponding summary and output:

summary(hp1)
## 
## Call:
## lm(formula = HousePriceUnitArea ~ (TransactionDate + HouseAge + 
##     Latitude + Longitude + DistanceMRT + NumberConvenienceStores), 
##     data = RealEstate)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -35.664  -5.410  -0.966   4.217  75.193 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             -1.444e+04  6.776e+03  -2.131  0.03371 *  
## TransactionDate          5.146e+00  1.557e+00   3.305  0.00103 ** 
## HouseAge                -2.697e-01  3.853e-02  -7.000 1.06e-11 ***
## Latitude                 2.255e+02  4.457e+01   5.059 6.38e-07 ***
## Longitude               -1.242e+01  4.858e+01  -0.256  0.79829    
## DistanceMRT             -4.488e-03  7.180e-04  -6.250 1.04e-09 ***
## NumberConvenienceStores  1.133e+00  1.882e-01   6.023 3.84e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.858 on 407 degrees of freedom
## Multiple R-squared:  0.5824, Adjusted R-squared:  0.5762 
## F-statistic: 94.59 on 6 and 407 DF,  p-value: < 2.2e-16

Looking at this initial summary we can see that the pvalue is less than our usual significance level of 0.05, which is good, however our R value isn't such high, we would like this value to be as close to 1 as possible.

Now we are creating residuals and fitted data so we can compare each of the independent variables to their corresonding residual to test for non-linearity:

RealEstate$fit.r <- hp1$residuals
RealEstate$fit.p <- hp1$fitted.values

Install the package "reshape2" and reference it in the library:

library(reshape2)

Now we will test for non-linearity, the graphs that appear to be evenly distributed above and below the line are linear, graphs that have more points towards one side are none-linear.

RealEstate %>%
  melt(measure.vars = c("TransactionDate", "HouseAge", "Latitude", "Longitude", "DistanceMRT", "NumberConvenienceStores", "fit.p")) %>%
  ggplot(aes(value, fit.r, group = variable)) +
  geom_point(shape = 1) +
  geom_smooth(method = loess) +
  geom_hline(yintercept = 0) +
  facet_wrap(~ variable, scales = "free")
## `geom_smooth()` using formula 'y ~ x'

Looking at the graphs, we can see that DistanceMRT and Longitude appear non-linear.

We will now change our first first linear regression model to add in all the explanatory variables squared, we are doing this to see which variables become more significant when squared:

RealEstate$HouseAge2 <- RealEstate$HouseAge^2
RealEstate$TransactionDate2 <- abs(RealEstate$TransactionDate^2)
RealEstate$Latitude2 <- RealEstate$Latitude^2
RealEstate$Longitude2 <- abs(RealEstate$Longitude^2)
RealEstate$DistanceMRT2 <- RealEstate$DistanceMRT^2
RealEstate$NumberConvenienceStores2 <- RealEstate$NumberConvenienceStores^2
hp2 <- lm(HousePriceUnitArea ~ HouseAge+HouseAge2+TransactionDate+TransactionDate2+Latitude+Latitude2+Longitude+Longitude2+DistanceMRT+DistanceMRT2+NumberConvenienceStores+NumberConvenienceStores2, data=RealEstate)
summary(hp2)
## 
## Call:
## lm(formula = HousePriceUnitArea ~ HouseAge + HouseAge2 + TransactionDate + 
##     TransactionDate2 + Latitude + Latitude2 + Longitude + Longitude2 + 
##     DistanceMRT + DistanceMRT2 + NumberConvenienceStores + NumberConvenienceStores2, 
##     data = RealEstate)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -33.478  -4.312  -0.281   3.653  72.832 
## 
## Coefficients: (2 not defined because of singularities)
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               3.847e+06  1.447e+06   2.658  0.00818 ** 
## HouseAge                 -1.023e+00  1.357e-01  -7.536 3.23e-13 ***
## HouseAge2                 1.866e-02  3.307e-03   5.643 3.16e-08 ***
## TransactionDate           6.021e+00  1.402e+00   4.296 2.18e-05 ***
## TransactionDate2                 NA         NA      NA       NA    
## Latitude                 -3.093e+05  1.158e+05  -2.670  0.00789 ** 
## Latitude2                 6.198e+03  2.319e+03   2.672  0.00784 ** 
## Longitude                -5.753e+00  4.537e+01  -0.127  0.89915    
## Longitude2                       NA         NA      NA       NA    
## DistanceMRT              -1.194e-02  1.268e-03  -9.419  < 2e-16 ***
## DistanceMRT2              1.475e-06  2.138e-07   6.899 2.04e-11 ***
## NumberConvenienceStores   5.391e-01  5.007e-01   1.077  0.28219    
## NumberConvenienceStores2  8.309e-03  5.180e-02   0.160  0.87265    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.951 on 403 degrees of freedom
## Multiple R-squared:  0.6668, Adjusted R-squared:  0.6586 
## F-statistic: 80.66 on 10 and 403 DF,  p-value: < 2.2e-16

From here we can see that Distance MRT becomes much more significant, leading us to believe it is in fact, non-linear.

We will now run an analaysis of variance test between the two linear models and see if adding the squared values improves the model.

anova(hp1,hp2)
## Analysis of Variance Table
## 
## Model 1: HousePriceUnitArea ~ (TransactionDate + HouseAge + Latitude + 
##     Longitude + DistanceMRT + NumberConvenienceStores)
## Model 2: HousePriceUnitArea ~ HouseAge + HouseAge2 + TransactionDate + 
##     TransactionDate2 + Latitude + Latitude2 + Longitude + Longitude2 + 
##     DistanceMRT + DistanceMRT2 + NumberConvenienceStores + NumberConvenienceStores2
##   Res.Df   RSS Df Sum of Sq      F    Pr(>F)    
## 1    407 31933                                  
## 2    403 25474  4    6458.4 25.543 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Adding the squared values doesn't seem to make much difference for this model. But for some models, with a much lower p-value, it might make a difference.

A final way to test for non-linearity is by using the Ramsey's Regression Error Specification Test (RESET). Essentially, if this test produces a low p-value than non-linearity is expected.

library(lmtest)
## Warning: package 'lmtest' was built under R version 3.6.3
## Loading required package: zoo
## Warning: package 'zoo' was built under R version 3.6.3
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
resettest(hp1,power=2:3,type="regressor")
## 
##  RESET test
## 
## data:  hp1
## RESET = 11.911, df1 = 12, df2 = 395, p-value < 2.2e-16

As we can see, non-linearity is found in our model, like we expected.

Now testing for heteroscedasticity, we will plot the residuals against the fitted values:

RealEstate$fit.r <- hp1$residuals
RealEstate$fit.p <- hp1$fitted.values
ggplot(RealEstate, aes(fit.p, fit.r)) +
  geom_jitter(shape = 1) +
  geom_hline(yintercept = 0, color = "red") +
  ylab("Residuals") +
  xlab("Fitted")

We can see that our graph may appear to show heteroscedasticity because we don't have an even distriubtion throughout the graph.

We can test for non-constant error using the Breusch-Pagan (aka Cook-Weisberg) test

this requires us to install the "car" package

library(car)
ncvTest(hp1)
## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 4.194015, Df = 1, p = 0.040567

Our p-value, is above an alpha level of 0.05, but just barely, so it does show heteroscedasiticty.

We can calculate robust standard error for our linear model for predicting house price unit:

hccm(hp1) %>% diag() %>% sqrt()
##             (Intercept)         TransactionDate                HouseAge 
##            5.598146e+03            1.469637e+00            4.305044e-02 
##                Latitude               Longitude             DistanceMRT 
##            4.679488e+01            4.410022e+01            8.354920e-04 
## NumberConvenienceStores 
##            2.316899e-01
robust.se <- function(model) {
  s <- summary(model)
  wse <- sqrt(diag(hccm(hp1)))
  t <- model$coefficients/wse
  p <- 2*pnorm(-abs(t))
  results <- cbind(model$coefficients, wse, t, p)
  dimnames(results) <- dimnames(s$coefficients)
  results
}

robust.se(hp1)
##                              Estimate   Std. Error    t value     Pr(>|t|)
## (Intercept)             -1.443710e+04 5.598146e+03 -2.5789077 9.911328e-03
## TransactionDate          5.146227e+00 1.469637e+00  3.5016999 4.623000e-04
## HouseAge                -2.696954e-01 4.305044e-02 -6.2646392 3.736897e-10
## Latitude                 2.254730e+02 4.679488e+01  4.8183259 1.447678e-06
## Longitude               -1.242360e+01 4.410022e+01 -0.2817129 7.781637e-01
## DistanceMRT             -4.487461e-03 8.354920e-04 -5.3710400 7.828384e-08
## NumberConvenienceStores  1.133277e+00 2.316899e-01  4.8913522 1.001456e-06

Comparing this to the original model, there are no significant changes, so there is again no heteroscedasticity in the model.

We are now moving on to the Null Hypothesis test We will use the dwtest function to test the null hypothesis that autocorrelation is 0.

This requires the "lmtest" package

library(lmtest)
dwtest(hp1)
## 
##  Durbin-Watson test
## 
## data:  hp1
## DW = 2.1527, p-value = 0.9415
## alternative hypothesis: true autocorrelation is greater than 0

We want our value to be between 1.5 and negative 2.5, which it is, so the autocorrelation does not have any effect on the model.

Now moving onto Multiple Regression Residuals, the graphs below are testing for normality:

Multiple Regression Residuals:

p1 <- ggplot(RealEstate, aes(fit.r)) +
  geom_histogram(bins = 10, color = "black", fill = "white")
p2 <- ggplot(RealEstate, aes(fit.r)) +
  geom_density() +
  stat_function(fun = dnorm, args = list(mean = mean(RealEstate$fit.r),
                                         sd = sd(RealEstate$fit.r)),
                color = "dodgerblue", size = 2, alpha = .5)
p3 <- ggplot(RealEstate, aes("", fit.r)) +
  geom_boxplot() 
p4 <- ggplot(RealEstate, aes(sample = fit.r)) +
  stat_qq(shape = 1) +
  stat_qq_line(size = 1.5, alpha = .5)
p1

p2

p3

p4

The graphs do not appear normal, the first three graphs show us that the distribution is not symmetrical, with a skewness to the right and the final graph also shows this with many points deviating from the line on the right side.

The Sharp Ratio test also checks for normality.

shapiro.test(hp1$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  hp1$residuals
## W = 0.87621, p-value < 2.2e-16

Since our p-value is small, we conclude that our graph is not normal. We reject the hypothesis of normally distributed errors.

Next we will plot the residuals and look for any possible outliers

ggplot(RealEstate, aes(row.names(RealEstate), fit.r)) +
  geom_point(shape = 1) +
  geom_hline(yintercept = 0, color = "red")

There appears to be one outlier around the 80 mark

To examine the outliers more in-depth will will look at the largest and smallest values

output.1 <- sort(hp1$residuals)  # smallest first
output.2 <- sort(hp1$residuals, decreasing = TRUE) # largest first


head(output.1, 1) # smallest residual absolute value
##       114 
## -35.66435
head(output.2, 1) # largest residual absolute value
##     271 
## 75.1927

We will now pull these numbers from the matrix

RealEstate[c(271,114),c(2,3,4,5,6,7,8)] # [c(row numbers),c(column numbers)]
##     TransactionDate HouseAge DistanceMRT NumberConvenienceStores Latitude
## 271        2013.333     10.8    252.5822                       1 24.97460
## 114        2013.333     14.8    393.2606                       6 24.96172
##     Longitude HousePriceUnitArea
## 271  121.5305              117.5
## 114  121.5381                7.6

As you can see those values look larger and smaller than the rest for the HousePrice so they could be potential outliers.

Another approach is to look at the degrees of freedom of the betas

df <- 2/sqrt(414)
df
## [1] 0.09829464

Now that we have our degrees of freedom, we will examine this value for possible influence against all the variables.

dfhp1 <- dfbetas(hp1)
melt(dfhp1, varnames = c("index", "variable")) %>%
  ggplot(aes(index, value)) +
  geom_point() +
  geom_hline(yintercept = df) +
  geom_hline(yintercept = -df) +
  facet_wrap(~ variable, scales = "free")

A few varibales seem to go off the cut off, so we will do a similar thing we did with the outliers, we will look at the highest and lowest absolute value of dfbetas:

names(dfhp1) <- row.names(RealEstate)
dfhp1[abs(dfhp1) == max(abs(dfhp1))] 
##       <NA> 
## -0.8991347

We will now look through all the observations trying to find a row number that has this value in it.

class(dfhp1)
## [1] "matrix"
df2hp1 <- as.data.frame(dfhp1)

#  add an id variable
df2hp1$id <- 1:414 #  generate a new observation number

#  head function returns one value, based on ,1
#  syntax - head(data_set[with(data_set, order(+/-variable)), ], 1)

#  Longitude
head(df2hp1[with(df2hp1, order(-Longitude)), ], 1) # order declining
##     (Intercept) TransactionDate  HouseAge    Latitude Longitude DistanceMRT
## 245  -0.1151602      0.02422121 0.1170819 -0.05293354 0.1292413  0.05578565
##     NumberConvenienceStores  id
## 245               0.0234964 245
head(df2hp1[with(df2hp1, order(+Longitude)), ], 1) # order increasing
##     (Intercept) TransactionDate   HouseAge   Latitude  Longitude DistanceMRT
## 271   0.3738488        0.384242 -0.2429588 0.05812468 -0.6436714  -0.8943845
##     NumberConvenienceStores  id
## 271              -0.8991347 271
#Latitude
head(df2hp1[with(df2hp1, order(-Latitude)), ], 1) # order declining
##     (Intercept) TransactionDate    HouseAge  Latitude  Longitude DistanceMRT
## 229  -0.1370148      0.03567835 -0.06047111 0.3361963 0.07492576   0.2016581
##     NumberConvenienceStores  id
## 229             -0.07306714 229
head(df2hp1[with(df2hp1, order(+Latitude)), ], 1) # order increasing
##     (Intercept) TransactionDate      HouseAge   Latitude Longitude DistanceMRT
## 149 -0.08308603       0.2205801 -0.0004957044 -0.3569275 0.0455112  0.07767458
##     NumberConvenienceStores  id
## 149              0.01839568 149
#NumberConvenienceStores
head(df2hp1[with(df2hp1, order(-NumberConvenienceStores)), ], 1) # order declining
##     (Intercept) TransactionDate  HouseAge   Latitude   Longitude  DistanceMRT
## 313  -0.1207385       0.2925673 0.2835299 -0.1391709 0.009406094 -0.005841154
##     NumberConvenienceStores  id
## 313               0.2884421 313
head(df2hp1[with(df2hp1, order(+NumberConvenienceStores)), ], 1) # order inclining
##     (Intercept) TransactionDate   HouseAge   Latitude  Longitude DistanceMRT
## 271   0.3738488        0.384242 -0.2429588 0.05812468 -0.6436714  -0.8943845
##     NumberConvenienceStores  id
## 271              -0.8991347 271
#DistanceMRT
head(df2hp1[with(df2hp1, order(-DistanceMRT)), ], 1) # order declining
##     (Intercept) TransactionDate    HouseAge  Latitude  Longitude DistanceMRT
## 229  -0.1370148      0.03567835 -0.06047111 0.3361963 0.07492576   0.2016581
##     NumberConvenienceStores  id
## 229             -0.07306714 229
head(df2hp1[with(df2hp1, order(+DistanceMRT)), ], 1) # order inclining
##     (Intercept) TransactionDate   HouseAge   Latitude  Longitude DistanceMRT
## 271   0.3738488        0.384242 -0.2429588 0.05812468 -0.6436714  -0.8943845
##     NumberConvenienceStores  id
## 271              -0.8991347 271
#HouseAge
head(df2hp1[with(df2hp1, order(-HouseAge)), ], 1) # order declining
##      (Intercept) TransactionDate  HouseAge   Latitude    Longitude DistanceMRT
## 390 0.0005917871      0.06280159 0.3172411 -0.1614731 -0.003625267 -0.06477414
##     NumberConvenienceStores  id
## 390                0.159074 390
head(df2hp1[with(df2hp1, order(+HouseAge)), ], 1) # order inclining
##     (Intercept) TransactionDate   HouseAge   Latitude  Longitude DistanceMRT
## 271   0.3738488        0.384242 -0.2429588 0.05812468 -0.6436714  -0.8943845
##     NumberConvenienceStores  id
## 271              -0.8991347 271
#TransactionDate
head(df2hp1[with(df2hp1, order(-TransactionDate)), ], 1) # order declining
##     (Intercept) TransactionDate   HouseAge   Latitude  Longitude DistanceMRT
## 271   0.3738488        0.384242 -0.2429588 0.05812468 -0.6436714  -0.8943845
##     NumberConvenienceStores  id
## 271              -0.8991347 271
head(df2hp1[with(df2hp1, order(+TransactionDate)), ], 1) # order inclining
##     (Intercept) TransactionDate   HouseAge Latitude  Longitude DistanceMRT
## 114 -0.03535736      -0.1581897 0.04081476 0.259526 0.07561799   0.1632732
##     NumberConvenienceStores  id
## 114              -0.1039386 114

refererencing our maximum value we wanted to pay attention to. We can see that row 271 has that value for variables: NumberConvenience, DistanceMRT and HouseAge

dfhp1[abs(dfhp1) == max(abs(dfhp1))] 
##       <NA> 
## -0.8991347
RealEstate[c(271), c(2,3,4,5,6,7,8)]
##     TransactionDate HouseAge DistanceMRT NumberConvenienceStores Latitude
## 271        2013.333     10.8    252.5822                       1  24.9746
##     Longitude HousePriceUnitArea
## 271  121.5305              117.5

From here we can see our highest outliier and determine whether or not it affects the model and the dataset. ####

OUr final test is the test for multicolliearity. This tells us the correlation of the independent variables in the model.

RealEstate %>%
  dplyr::select(HouseAge, TransactionDate, DistanceMRT, NumberConvenienceStores, Latitude, Longitude) %>%
  na.omit() %>%
  data.frame() %>%
  cor()
##                            HouseAge TransactionDate DistanceMRT
## HouseAge                 1.00000000     0.017548767  0.02562205
## TransactionDate          0.01754877     1.000000000  0.06087995
## DistanceMRT              0.02562205     0.060879953  1.00000000
## NumberConvenienceStores  0.04959251     0.009635445 -0.60251914
## Latitude                 0.05441990     0.035057756 -0.59106657
## Longitude               -0.04852005    -0.041081778 -0.80631677
##                         NumberConvenienceStores    Latitude   Longitude
## HouseAge                            0.049592513  0.05441990 -0.04852005
## TransactionDate                     0.009635445  0.03505776 -0.04108178
## DistanceMRT                        -0.602519145 -0.59106657 -0.80631677
## NumberConvenienceStores             1.000000000  0.44414331  0.44909901
## Latitude                            0.444143306  1.00000000  0.41292394
## Longitude                           0.449099007  0.41292394  1.00000000

Looking at this initial correlation matrix, none of these variables seem to have too high of correlation. Numbers close to 1 or -1 besides the diagnal would be numbers to watch out for.

We will now check the variance inflation factor and tolerance.

library(car)
vif(hp1)
##         TransactionDate                HouseAge                Latitude 
##                1.014674                1.014287                1.610234 
##               Longitude             DistanceMRT NumberConvenienceStores 
##                2.926302                4.323019                1.617038
1/vif(hp1)
##         TransactionDate                HouseAge                Latitude 
##               0.9855386               0.9859145               0.6210276 
##               Longitude             DistanceMRT NumberConvenienceStores 
##               0.3417282               0.2313198               0.6184148

The VIF, you would like to be less than 5, which our example is, however calculating the tolerance is a little different. n < 50, tolerance should exceed 0.7 n < 300, tolerance should exceed 0.5 n < 600, tolerance should exceed 0.3 n < 1000, tolerance should exceed 0.1

Our number of observations is 400, so our tolerance should exceed 0.3, but it doesnt for DistanceMRT, telling us that we do in fact have multicollinearity.

In Summary, this data set might not be the best one to have picked to build a linear model off of, several tests were violated and changes would need to be made to this data so that the data is not skewed or biasing the results.