1.0 Introduction

This course project involves use of a specific dataset (selected by the student and approved by the instructor) for demonstrating an understanding of the concepts learnt in class. I have selected a dataset named Real Estate Valuation Dataset which is available at: https://archive.ics.uci.edu/ml/datasets/Real+estate+valuation+data+set# This is a dataset that contains market historical data of real estate valuation collected from Sindian Dist, New Taipei City, Taiwan. This is a regression problem as the outcome variable is continuous.

There are 6 predictor variables and 1 outcome variable for each observation. Each observation here represents valuation record of a house.

All the variables are of numeric type. Also, all the variables are quantitative in nature. See below other specific details about the variables.

The outcome is: * Y= house price of unit area (10000 New Taiwan Dollar/Ping, where Ping is a local unit, 1 Ping = 3.3 meter squared)

As listed above among the variables, the label of the outcome variable is ‘Y’. It tells us the resulting value of the house based on the predictor variables.

Because of my experience when I was helping a friend out on her desire to own a house, I can appreciate the listed variables. The time of the year in which houses are sold has an impact on the value; the realtor we worked with at the time was of the opinion houses are cheaper around July/August. The age of the house without any doubt is important, newer houses tend to be more expensive. Nearness to public transport services too for buyers who will be traveling to work by train will be important and this has an impact on the value of the house. Number of convenience stores around is a selling point especially for those who don’t buy things in bulk. Geographic coordinates may be an important factor too as some areas have peculiar things about them that make them desirable or otherwise. Areas with crime rate will likely have this negative factor impact the value of properties around there.

It turned out that “X3-The distance to the nearest MRT station” and “X4-The number of convenience stores in the living circle on foot” have both been derived from geographic coordinates X5 and X6 [1]. So, I’d say all the predictor variables are potentially useful in predicting the outcome.

2.0 Introduction to R

2.1 Loading Data

Here we load the data, remove the “No” variable which is not among the predictors:

library(xlsx)
## Warning: package 'xlsx' was built under R version 3.5.2
valuation <-  read.xlsx('Real estate valuation data set.xlsx', sheetName = 'sheet1')
valuation <- valuation[,-1]
head(valuation)
##   X1.transaction.date X2.house.age X3.distance.to.the.nearest.MRT.station
## 1            2012.917         32.0                               84.87882
## 2            2012.917         19.5                              306.59470
## 3            2013.583         13.3                              561.98450
## 4            2013.500         13.3                              561.98450
## 5            2012.833          5.0                              390.56840
## 6            2012.667          7.1                             2175.03000
##   X4.number.of.convenience.stores X5.latitude X6.longitude
## 1                              10    24.98298     121.5402
## 2                               9    24.98034     121.5395
## 3                               5    24.98746     121.5439
## 4                               5    24.98746     121.5439
## 5                               5    24.97937     121.5425
## 6                               3    24.96305     121.5125
##   Y.house.price.of.unit.area
## 1                       37.9
## 2                       42.2
## 3                       47.3
## 4                       54.8
## 5                       43.1
## 6                       32.1

We do some simple inspection of the data:

dim(valuation)
## [1] 414   7
names(valuation)
## [1] "X1.transaction.date"                   
## [2] "X2.house.age"                          
## [3] "X3.distance.to.the.nearest.MRT.station"
## [4] "X4.number.of.convenience.stores"       
## [5] "X5.latitude"                           
## [6] "X6.longitude"                          
## [7] "Y.house.price.of.unit.area"
summary(valuation)
##  X1.transaction.date  X2.house.age   
##  Min.   :2013        Min.   : 0.000  
##  1st Qu.:2013        1st Qu.: 9.025  
##  Median :2013        Median :16.100  
##  Mean   :2013        Mean   :17.713  
##  3rd Qu.:2013        3rd Qu.:28.150  
##  Max.   :2014        Max.   :43.800  
##  X3.distance.to.the.nearest.MRT.station X4.number.of.convenience.stores
##  Min.   :  23.38                        Min.   : 0.000                 
##  1st Qu.: 289.32                        1st Qu.: 1.000                 
##  Median : 492.23                        Median : 4.000                 
##  Mean   :1083.89                        Mean   : 4.094                 
##  3rd Qu.:1454.28                        3rd Qu.: 6.000                 
##  Max.   :6488.02                        Max.   :10.000                 
##   X5.latitude     X6.longitude   Y.house.price.of.unit.area
##  Min.   :24.93   Min.   :121.5   Min.   :  7.60            
##  1st Qu.:24.96   1st Qu.:121.5   1st Qu.: 27.70            
##  Median :24.97   Median :121.5   Median : 38.45            
##  Mean   :24.97   Mean   :121.5   Mean   : 37.98            
##  3rd Qu.:24.98   3rd Qu.:121.5   3rd Qu.: 46.60            
##  Max.   :25.01   Max.   :121.6   Max.   :117.50
attach(valuation)
summary(X2.house.age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   9.025  16.100  17.713  28.150  43.800

2.2 Additional Graphs and Numerical Summaries

Here we use graphical approach to get better understanding of the dataset. The first output is a scatterplot combining the location coordinates while the second ouput presents a histogram showing the distribution of house prices- houses between 40 to 50 units of Y.house.price.of.unit.area have the highest frequency. Lastly, we have a pairplot of all the predictors:

plot(X5.latitude,X6.longitude)

hist(Y.house.price.of.unit.area)

pairs(valuation)

Each of X6,X5 and X4 seems to have a positive correlation with Y. X3 looks like a negative correlation while X2 and X1 will require further inspection to know exactly what’s going on.

pairs(~X2.house.age+X3.distance.to.the.nearest.MRT.station+X4.number.of.convenience.stores+Y.house.price.of.unit.area , valuation)

3.0 Linear Regression

3.1 Simple Linear Regression

From the pair-plot it appears X3.the distance to the nearest MRT station has some relationship that looks quadratic with Y.house.price.of.unit.area so I’ll take a quick look at an attempt to build simple linear regression on this.

lm.fit = lm(Y.house.price.of.unit.area~X3.distance.to.the.nearest.MRT.station)
lm.fit
## 
## Call:
## lm(formula = Y.house.price.of.unit.area ~ X3.distance.to.the.nearest.MRT.station)
## 
## Coefficients:
##                            (Intercept)  
##                              45.851427  
## X3.distance.to.the.nearest.MRT.station  
##                              -0.007262
summary(lm.fit)
## 
## Call:
## lm(formula = Y.house.price.of.unit.area ~ X3.distance.to.the.nearest.MRT.station)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -35.396  -6.007  -1.195   4.831  73.483 
## 
## Coefficients:
##                                          Estimate Std. Error t value
## (Intercept)                            45.8514271  0.6526105   70.26
## X3.distance.to.the.nearest.MRT.station -0.0072621  0.0003925  -18.50
##                                        Pr(>|t|)    
## (Intercept)                              <2e-16 ***
## X3.distance.to.the.nearest.MRT.station   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.07 on 412 degrees of freedom
## Multiple R-squared:  0.4538, Adjusted R-squared:  0.4524 
## F-statistic: 342.2 on 1 and 412 DF,  p-value: < 2.2e-16

The predictor X3 yields an \({R}^2\) of 0.45 meaning just this variable has explained 45% of the variance of the dataset.

We’ll make some predictions using some values of X3 and create a plot of the linear regression fit using abline():

confint(lm.fit)
##                                               2.5 %       97.5 %
## (Intercept)                            44.568565390 47.134288726
## X3.distance.to.the.nearest.MRT.station -0.008033701 -0.006490402
predict(lm.fit,data.frame(X3.distance.to.the.nearest.MRT.station=c(4,16,32)),interval = "confidence")
##        fit      lwr      upr
## 1 45.82238 44.54153 47.10323
## 2 45.73523 44.46039 47.01008
## 3 45.61904 44.35214 46.88594
predict(lm.fit,data.frame(X3.distance.to.the.nearest.MRT.station=c(4,16,32)),interval = "prediction")
##        fit      lwr      upr
## 1 45.82238 25.98886 65.65590
## 2 45.73523 25.90210 65.56837
## 3 45.61904 25.78642 65.45166
plot(X3.distance.to.the.nearest.MRT.station,Y.house.price.of.unit.area)
abline(lm.fit)

par(mfrow=c(2,2))
plot(lm.fit)

This clearly does not look a great fit, we’ll see much later that some non linear appraoches could produce better fit

3.2 Multiple Linear Regression

We first add one more predictor to the simple linear regression, say X2.house.age:

lm.fit=lm(Y.house.price.of.unit.area~X2.house.age+X3.distance.to.the.nearest.MRT.station)
summary(lm.fit)
## 
## Call:
## lm(formula = Y.house.price.of.unit.area ~ X2.house.age + X3.distance.to.the.nearest.MRT.station)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -36.032  -4.742  -1.037   4.533  71.930 
## 
## Coefficients:
##                                          Estimate Std. Error t value
## (Intercept)                            49.8855858  0.9677644  51.547
## X2.house.age                           -0.2310266  0.0420383  -5.496
## X3.distance.to.the.nearest.MRT.station -0.0072086  0.0003795 -18.997
##                                        Pr(>|t|)    
## (Intercept)                             < 2e-16 ***
## X2.house.age                           6.84e-08 ***
## X3.distance.to.the.nearest.MRT.station  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.73 on 411 degrees of freedom
## Multiple R-squared:  0.4911, Adjusted R-squared:  0.4887 
## F-statistic: 198.3 on 2 and 411 DF,  p-value: < 2.2e-16

Just with the added one variable R-squared increased to 0.49. We can now attempt to use all the predictors:

lm.fit=lm(Y.house.price.of.unit.area~., data = valuation)
summary(lm.fit)
## 
## Call:
## lm(formula = Y.house.price.of.unit.area ~ ., data = valuation)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -35.667  -5.412  -0.967   4.217  75.190 
## 
## Coefficients:
##                                          Estimate Std. Error t value
## (Intercept)                            -1.444e+04  6.775e+03  -2.132
## X1.transaction.date                     5.149e+00  1.557e+00   3.307
## X2.house.age                           -2.697e-01  3.853e-02  -7.000
## X3.distance.to.the.nearest.MRT.station -4.488e-03  7.180e-04  -6.250
## X4.number.of.convenience.stores         1.133e+00  1.882e-01   6.023
## X5.latitude                             2.255e+02  4.457e+01   5.059
## X6.longitude                           -1.243e+01  4.858e+01  -0.256
##                                        Pr(>|t|)    
## (Intercept)                             0.03364 *  
## X1.transaction.date                     0.00103 ** 
## X2.house.age                           1.06e-11 ***
## X3.distance.to.the.nearest.MRT.station 1.04e-09 ***
## X4.number.of.convenience.stores        3.83e-09 ***
## X5.latitude                            6.38e-07 ***
## X6.longitude                            0.79820    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.858 on 407 degrees of freedom
## Multiple R-squared:  0.5824, Adjusted R-squared:  0.5762 
## F-statistic:  94.6 on 6 and 407 DF,  p-value: < 2.2e-16

This pushes the \({R}^2\) to 0.58. We can also see from the p-values that X6.longitude is not significant while X1.transaction.date is not as significant as other predictors. It comes as no surprise that “X6.longitude” is not significant becuase as stated earlier, geographic coordinates X5 and X6 were used to derive other variables X3-The distance to the nearest MRT station and X4-The number of convenience stores in the living circle on foot hence there is likely going to be some correlation among these variables which explains why X6 might not be significant.

cor(X3.distance.to.the.nearest.MRT.station,X5.latitude)
## [1] -0.5910666
cor(X3.distance.to.the.nearest.MRT.station,X6.longitude)
## [1] -0.8063168
cor(X4.number.of.convenience.stores,X5.latitude)
## [1] 0.4441433
cor(X4.number.of.convenience.stores,X6.longitude)
## [1] 0.449099

These correlations are somewhat substantial especially the negative correlations between X3 and X6 then that of X3 and X5. It might be better to leave out X5 and X6 so i will create a valuationSub dataset which will be without these 2. I will however still use the valuation dataset in cases i wish to show how certain statistical methods might react to correlated variables.

valuationSub <- valuation[,c(-5,-6)]
lm.fit=lm(Y.house.price.of.unit.area~., data = valuationSub)
summary(lm.fit)
## 
## Call:
## lm(formula = Y.house.price.of.unit.area ~ ., data = valuationSub)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -38.392  -5.630  -0.985   4.304  76.004 
## 
## Coefficients:
##                                          Estimate Std. Error t value
## (Intercept)                            -1.159e+04  3.214e+03  -3.607
## X1.transaction.date                     5.780e+00  1.597e+00   3.620
## X2.house.age                           -2.545e-01  3.953e-02  -6.438
## X3.distance.to.the.nearest.MRT.station -5.513e-03  4.480e-04 -12.305
## X4.number.of.convenience.stores         1.258e+00  1.918e-01   6.559
##                                        Pr(>|t|)    
## (Intercept)                            0.000348 ***
## X1.transaction.date                    0.000331 ***
## X2.house.age                           3.40e-10 ***
## X3.distance.to.the.nearest.MRT.station  < 2e-16 ***
## X4.number.of.convenience.stores        1.64e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.118 on 409 degrees of freedom
## Multiple R-squared:  0.5553, Adjusted R-squared:  0.551 
## F-statistic: 127.7 on 4 and 409 DF,  p-value: < 2.2e-16

Fitting linear regression model on only 4 predictors left still records an \({R}^2\) of 0.55 which is not bad at all. Instead of fitting the model on the valuationSub data like i did above, i could also use an update function to modify the earlier model and just make the modifications i desire:

lm.fit1=update(lm.fit,~.-No -X5.latitude -X6.longitude)

3.3 Interacting Terms

To demonstrate interacting terms i’m guessing there might be some interaction effect between X3 and X4. If people are interested in nearness to MRT station they are likely also going to want their convenience stores near them:

summary(lm(Y.house.price.of.unit.area~X3.distance.to.the.nearest.MRT.station*X4.number.of.convenience.stores, data = valuationSub))
## 
## Call:
## lm(formula = Y.house.price.of.unit.area ~ X3.distance.to.the.nearest.MRT.station * 
##     X4.number.of.convenience.stores, data = valuationSub)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -36.417  -5.747  -0.844   4.856  78.205 
## 
## Coefficients:
##                                                                          Estimate
## (Intercept)                                                            39.1439079
## X3.distance.to.the.nearest.MRT.station                                 -0.0045362
## X4.number.of.convenience.stores                                         1.6328694
## X3.distance.to.the.nearest.MRT.station:X4.number.of.convenience.stores -0.0013310
##                                                                        Std. Error
## (Intercept)                                                             1.2605696
## X3.distance.to.the.nearest.MRT.station                                  0.0005008
## X4.number.of.convenience.stores                                         0.2137367
## X3.distance.to.the.nearest.MRT.station:X4.number.of.convenience.stores  0.0002572
##                                                                        t value
## (Intercept)                                                             31.053
## X3.distance.to.the.nearest.MRT.station                                  -9.058
## X4.number.of.convenience.stores                                          7.640
## X3.distance.to.the.nearest.MRT.station:X4.number.of.convenience.stores  -5.176
##                                                                        Pr(>|t|)
## (Intercept)                                                             < 2e-16
## X3.distance.to.the.nearest.MRT.station                                  < 2e-16
## X4.number.of.convenience.stores                                        1.55e-13
## X3.distance.to.the.nearest.MRT.station:X4.number.of.convenience.stores 3.56e-07
##                                                                           
## (Intercept)                                                            ***
## X3.distance.to.the.nearest.MRT.station                                 ***
## X4.number.of.convenience.stores                                        ***
## X3.distance.to.the.nearest.MRT.station:X4.number.of.convenience.stores ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.388 on 410 degrees of freedom
## Multiple R-squared:  0.5274, Adjusted R-squared:  0.524 
## F-statistic: 152.5 on 3 and 410 DF,  p-value: < 2.2e-16

With an \({R}^2\) of 0.52, the interacting effect between these 2 seems strong. One would be curious to see the the effect when we have other predictors in the mix:

summary(lm(Y.house.price.of.unit.area~X1.transaction.date+X2.house.age+X3.distance.to.the.nearest.MRT.station*X4.number.of.convenience.stores, data = valuationSub))
## 
## Call:
## lm(formula = Y.house.price.of.unit.area ~ X1.transaction.date + 
##     X2.house.age + X3.distance.to.the.nearest.MRT.station * X4.number.of.convenience.stores, 
##     data = valuationSub)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -38.362  -5.144  -0.706   4.016  75.519 
## 
## Coefficients:
##                                                                          Estimate
## (Intercept)                                                            -1.233e+04
## X1.transaction.date                                                     6.145e+00
## X2.house.age                                                           -2.557e-01
## X3.distance.to.the.nearest.MRT.station                                 -4.441e-03
## X4.number.of.convenience.stores                                         1.707e+00
## X3.distance.to.the.nearest.MRT.station:X4.number.of.convenience.stores -1.379e-03
##                                                                        Std. Error
## (Intercept)                                                             3.099e+03
## X1.transaction.date                                                     1.539e+00
## X2.house.age                                                            3.808e-02
## X3.distance.to.the.nearest.MRT.station                                  4.704e-04
## X4.number.of.convenience.stores                                         2.007e-01
## X3.distance.to.the.nearest.MRT.station:X4.number.of.convenience.stores  2.408e-04
##                                                                        t value
## (Intercept)                                                             -3.978
## X1.transaction.date                                                      3.992
## X2.house.age                                                            -6.716
## X3.distance.to.the.nearest.MRT.station                                  -9.441
## X4.number.of.convenience.stores                                          8.505
## X3.distance.to.the.nearest.MRT.station:X4.number.of.convenience.stores  -5.727
##                                                                        Pr(>|t|)
## (Intercept)                                                            8.23e-05
## X1.transaction.date                                                    7.78e-05
## X2.house.age                                                           6.29e-11
## X3.distance.to.the.nearest.MRT.station                                  < 2e-16
## X4.number.of.convenience.stores                                        3.48e-16
## X3.distance.to.the.nearest.MRT.station:X4.number.of.convenience.stores 1.98e-08
##                                                                           
## (Intercept)                                                            ***
## X1.transaction.date                                                    ***
## X2.house.age                                                           ***
## X3.distance.to.the.nearest.MRT.station                                 ***
## X4.number.of.convenience.stores                                        ***
## X3.distance.to.the.nearest.MRT.station:X4.number.of.convenience.stores ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.783 on 408 degrees of freedom
## Multiple R-squared:  0.5884, Adjusted R-squared:  0.5834 
## F-statistic: 116.7 on 5 and 408 DF,  p-value: < 2.2e-16

Eventhough the interaction term remains significant, the \({R}^2\) did not improve much when compared with the fit of all the models including X5 and X6. But comparing this with the fit of the model with just X1, X2,X3 and X4, the interaction effect is definitely felt

3.4 Non-linear Transformations of Predictors

Here we investigate the effect introducing non-linearity to the model especially we found evidence of non linearity earlier in this analysis:

lm.fit1=lm(Y.house.price.of.unit.area~X3.distance.to.the.nearest.MRT.station+I(X3.distance.to.the.nearest.MRT.station^2), data = valuationSub)
summary(lm.fit1)
## 
## Call:
## lm(formula = Y.house.price.of.unit.area ~ X3.distance.to.the.nearest.MRT.station + 
##     I(X3.distance.to.the.nearest.MRT.station^2), data = valuationSub)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -36.354  -4.939  -0.906   4.415  71.526 
## 
## Coefficients:
##                                               Estimate Std. Error t value
## (Intercept)                                  4.978e+01  7.826e-01  63.609
## X3.distance.to.the.nearest.MRT.station      -1.553e-02  1.100e-03 -14.126
## I(X3.distance.to.the.nearest.MRT.station^2)  1.822e-06  2.284e-07   7.976
##                                             Pr(>|t|)    
## (Intercept)                                  < 2e-16 ***
## X3.distance.to.the.nearest.MRT.station       < 2e-16 ***
## I(X3.distance.to.the.nearest.MRT.station^2) 1.51e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.381 on 411 degrees of freedom
## Multiple R-squared:  0.527,  Adjusted R-squared:  0.5247 
## F-statistic: 228.9 on 2 and 411 DF,  p-value: < 2.2e-16
lm.fit = lm(Y.house.price.of.unit.area~X3.distance.to.the.nearest.MRT.station, data = valuationSub)
anova(lm.fit,lm.fit1)
## Analysis of Variance Table
## 
## Model 1: Y.house.price.of.unit.area ~ X3.distance.to.the.nearest.MRT.station
## Model 2: Y.house.price.of.unit.area ~ X3.distance.to.the.nearest.MRT.station + 
##     I(X3.distance.to.the.nearest.MRT.station^2)
##   Res.Df   RSS Df Sum of Sq     F    Pr(>F)    
## 1    412 41767                                 
## 2    411 36168  1    5598.6 63.62 1.513e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Here the F-statistic is 63.62 and the associated p-value is virtually zero. This provides very clear evidence that the model containing the predictors X3 and X^2 is far superior to the model that only contains the predictor X3. This is not surprising, since earlier we saw evidence for non-linearity in the relationship between Y and X3.

par(mfrow=c(2,2))
plot(lm.fit1)

Comparing the plot above to the one made for lm.fit we can see there is little discernible pattern in the residuals. Below I attempt a cubic predictor transformation to see what changes:

lm.fit3=lm(Y.house.price.of.unit.area~poly(X3.distance.to.the.nearest.MRT.station,3))
summary(lm.fit3)
## 
## Call:
## lm(formula = Y.house.price.of.unit.area ~ poly(X3.distance.to.the.nearest.MRT.station, 
##     3))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -36.278  -4.955  -0.687   4.188  70.680 
## 
## Coefficients:
##                                                   Estimate Std. Error
## (Intercept)                                        37.9802     0.4471
## poly(X3.distance.to.the.nearest.MRT.station, 3)1 -186.2651     9.0972
## poly(X3.distance.to.the.nearest.MRT.station, 3)2   74.8237     9.0972
## poly(X3.distance.to.the.nearest.MRT.station, 3)3  -47.2930     9.0972
##                                                  t value Pr(>|t|)    
## (Intercept)                                       84.947  < 2e-16 ***
## poly(X3.distance.to.the.nearest.MRT.station, 3)1 -20.475  < 2e-16 ***
## poly(X3.distance.to.the.nearest.MRT.station, 3)2   8.225 2.61e-15 ***
## poly(X3.distance.to.the.nearest.MRT.station, 3)3  -5.199 3.17e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.097 on 410 degrees of freedom
## Multiple R-squared:  0.5562, Adjusted R-squared:  0.553 
## F-statistic: 171.3 on 3 and 410 DF,  p-value: < 2.2e-16
summary(lm(Y.house.price.of.unit.area~log(X3.distance.to.the.nearest.MRT.station)))
## 
## Call:
## lm(formula = Y.house.price.of.unit.area ~ log(X3.distance.to.the.nearest.MRT.station))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -34.103  -5.023  -0.827   3.449  71.846 
## 
## Coefficients:
##                                             Estimate Std. Error t value
## (Intercept)                                  95.0169     2.6369   36.03
## log(X3.distance.to.the.nearest.MRT.station)  -8.9235     0.4064  -21.96
##                                             Pr(>|t|)    
## (Intercept)                                   <2e-16 ***
## log(X3.distance.to.the.nearest.MRT.station)   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.247 on 412 degrees of freedom
## Multiple R-squared:  0.5393, Adjusted R-squared:  0.5381 
## F-statistic: 482.2 on 1 and 412 DF,  p-value: < 2.2e-16

5.0 Cross-Validation and the Bootstrap

5.1 The Validation Set Approach

To do this we have to split the dataset into training and validation set. Here we fit the model using the training set:

set.seed(1)
train=sample(414,207)
lm.fit = lm(Y.house.price.of.unit.area~X3.distance.to.the.nearest.MRT.station, data = valuation, subset = train)

I now make prediction on the other half of the dataset split and compute the mean of the mean square error MSE.

mean((Y.house.price.of.unit.area-predict(lm.fit, valuationSub))[-train]^2)
## [1] 119.7758

This means the estimated test MSE for the linear regression is 119.776. Below we consider computing the MSE for models with polynomial predictors using validation set approach

lm.fit2=lm(Y.house.price.of.unit.area~poly(X3.distance.to.the.nearest.MRT.station,2), data= valuationSub, subset=train)
mean((Y.house.price.of.unit.area-predict(lm.fit2, valuation))[-train]^2)
## [1] 106.1868
lm.fit3=lm(Y.house.price.of.unit.area~poly(X3.distance.to.the.nearest.MRT.station,3), data= valuationSub, subset=train)
mean((Y.house.price.of.unit.area-predict(lm.fit3, valuation))[-train]^2)
## [1] 98.81372

These error rates are 106.187 and 98.814 for quadratic and cubic fits respectively. If I chose a different training set instead i would obtain a somewhat different errors on the validation set becuase of variance.

set.seed(2)
train=sample(414,207)
lm.fit = lm(Y.house.price.of.unit.area~X3.distance.to.the.nearest.MRT.station, data = valuationSub, subset = train)
mean((Y.house.price.of.unit.area-predict(lm.fit, valuation))[-train]^2)
## [1] 83.50401
lm.fit2=lm(Y.house.price.of.unit.area~poly(X3.distance.to.the.nearest.MRT.station,2), data= valuationSub, subset=train)
mean((Y.house.price.of.unit.area-predict(lm.fit2, valuation))[-train]^2)
## [1] 71.53812
lm.fit3=lm(Y.house.price.of.unit.area~poly(X3.distance.to.the.nearest.MRT.station,3), data= valuationSub, subset=train)
mean((Y.house.price.of.unit.area-predict(lm.fit3, valuation))[-train]^2)
## [1] 65.48127

Here I recorded 83.504,71.538, and 65.481 for linear fit, quadratic and cubic fits respectively. Even though the MSE for the selection in the first dataset is relatively larger than those recorded for the second dataset, the results are still consistent with earlier findings that a model that predicts Y using a quadratic function of X3 performs better than a model involving only linear function of X3.

5.2 Leave-One-Out Cross Validation

The Leave-One-Out Cross Validation LOOCV approach uses a fold k = n (number of observations). I am going to use a combination of glm() and cv.glm() functions. Find below a fit with glm() without family parameter, which is equivalent to lm()

glm.fit = glm(Y.house.price.of.unit.area~X3.distance.to.the.nearest.MRT.station, data = valuationSub)
coef(glm.fit)
##                            (Intercept) 
##                           45.851427058 
## X3.distance.to.the.nearest.MRT.station 
##                           -0.007262052

and remember

lm.fit = lm(Y.house.price.of.unit.area~X3.distance.to.the.nearest.MRT.station, data=valuationSub)
coef(lm.fit)
##                            (Intercept) 
##                           45.851427058 
## X3.distance.to.the.nearest.MRT.station 
##                           -0.007262052

The results are exactly the same.

library(boot)

cv.err=cv.glm(valuationSub,glm.fit)
cv.err$delta
## [1] 101.8171 101.8160

The 2 numbers 101.8171 and 101.8160 in the delta vector contain the cross-validated results and as we may observe are identical up to 2 decimal places. I can repeat this procedure for different degrees of polynomial fits. I write a for loop to automate this:

cv.error=rep(0,5)
for (i in 1:5) {
  glm.fit=glm(Y.house.price.of.unit.area~poly(X3.distance.to.the.nearest.MRT.station, i), data=valuationSub)
  cv.error[i]=cv.glm(valuationSub,glm.fit)$delta[1]
}
cv.error
## [1] 101.81709  88.70438  83.09502  81.12731  81.60174

It can be noticed here that there is a significant drop in Test MSE going from linear fit to quadratic fit but not so much with higher-order polynomial fits.

5.3 K-fold Cross Validation

Here I attempt setting K=10 for our run of cv.glm() on our valuation dataset:

set.seed(15)
cv.error.10=rep(0,10)
for (i in 1:10) {
  glm.fit=glm(Y.house.price.of.unit.area~poly(X3.distance.to.the.nearest.MRT.station, i), data=valuationSub)
  cv.error.10[i]=cv.glm(valuationSub,glm.fit,K=10)$delta[1]
}
cv.error.10
##  [1] 101.86857  88.47310  84.14043  80.85301  81.47782  80.12462  80.91027
##  [8]  80.75981  82.40350  81.23460

There is no clear evidence that going beyond cubic fit improves the performance as far as MSE is concerned.

5.4 The Bootstrap

Here we are going to use the bootstrap to assess variability of the coefficient estimates- intercept and slope terms of the model that used “X3.distance.to.the.nearest.MRT.station” to predict “Y.house.price.of.unit.area” in the valuation dataset.

boot.fn = function(data, index)
  return(coef(lm(Y.house.price.of.unit.area~X3.distance.to.the.nearest.MRT.station, data = data, subset = index)))
boot.fn(valuationSub,1:414)
##                            (Intercept) 
##                           45.851427058 
## X3.distance.to.the.nearest.MRT.station 
##                           -0.007262052

I now sample instances randomly with replacement in order to create bootstrap estimates for the intercept and slope terms:

set.seed(1)
boot.fn(valuationSub, sample(414,414, replace = T))
##                            (Intercept) 
##                           45.385024134 
## X3.distance.to.the.nearest.MRT.station 
##                           -0.007128944

Next I use the boot() function to compute the standard errors of 1000 bootstrap estimates for the intercept and slope terms:

boot(valuationSub,boot.fn, R=1000)
## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = valuationSub, statistic = boot.fn, R = 1000)
## 
## 
## Bootstrap Statistics :
##         original        bias     std. error
## t1* 45.851427058  2.474703e-02 0.7073580380
## t2* -0.007262052 -3.675892e-05 0.0003752961

This means the bootstrap estimate for intercept standard error SE is 0.68 and that of slope term is 0.00037. Since I can also get the standard errors from the regression coefficients in a linear model, it is desireable to compare the estimates from bootstrap with those from linear model.

summary(lm(Y.house.price.of.unit.area~X3.distance.to.the.nearest.MRT.station, data = valuationSub))$coef
##                                            Estimate   Std. Error   t value
## (Intercept)                            45.851427058 0.6526105126  70.25849
## X3.distance.to.the.nearest.MRT.station -0.007262052 0.0003925495 -18.49971
##                                             Pr(>|t|)
## (Intercept)                            1.856440e-231
## X3.distance.to.the.nearest.MRT.station  4.639825e-56

Comparing the standard erros from the 2 different approaches I can see how close they are. Lastly I compute the bootstrap standard error estimates and the standard linear regression estimates that result from fitting the quadratic model to the data:

boot.fn=function(data, index)
  coefficients(lm(Y.house.price.of.unit.area~X3.distance.to.the.nearest.MRT.station + I(X3.distance.to.the.nearest.MRT.station^2), data = data, subset = index))
set.seed(1)
boot(valuationSub,boot.fn, 1000)
## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = valuationSub, statistic = boot.fn, R = 1000)
## 
## 
## Bootstrap Statistics :
##          original        bias     std. error
## t1*  4.978154e+01  1.154286e-01 8.937389e-01
## t2* -1.553404e-02 -2.559037e-04 1.166370e-03
## t3*  1.821995e-06  6.366975e-08 2.595740e-07
summary(lm(Y.house.price.of.unit.area~X3.distance.to.the.nearest.MRT.station + I(X3.distance.to.the.nearest.MRT.station^2), data = valuationSub))$coef
##                                                  Estimate   Std. Error
## (Intercept)                                  4.978154e+01 7.826176e-01
## X3.distance.to.the.nearest.MRT.station      -1.553404e-02 1.099683e-03
## I(X3.distance.to.the.nearest.MRT.station^2)  1.821995e-06 2.284285e-07
##                                                t value      Pr(>|t|)
## (Intercept)                                  63.609020 7.581184e-215
## X3.distance.to.the.nearest.MRT.station      -14.125925  3.276742e-37
## I(X3.distance.to.the.nearest.MRT.station^2)   7.976217  1.513422e-14

6.0 Linear Model Selection and Regularization

6.1 Subset Selection Methods

Best Subset Selection For subset selection i’ll favour a data with more variables (valuation) to see how these methods will react to the variables i kept out of valuationSub.

library(leaps)
## Warning: package 'leaps' was built under R version 3.5.2
regfit.full= regsubsets(Y.house.price.of.unit.area~., valuation)
summary(regfit.full)
## Subset selection object
## Call: regsubsets.formula(Y.house.price.of.unit.area ~ ., valuation)
## 6 Variables  (and intercept)
##                                        Forced in Forced out
## X1.transaction.date                        FALSE      FALSE
## X2.house.age                               FALSE      FALSE
## X3.distance.to.the.nearest.MRT.station     FALSE      FALSE
## X4.number.of.convenience.stores            FALSE      FALSE
## X5.latitude                                FALSE      FALSE
## X6.longitude                               FALSE      FALSE
## 1 subsets of each size up to 6
## Selection Algorithm: exhaustive
##          X1.transaction.date X2.house.age
## 1  ( 1 ) " "                 " "         
## 2  ( 1 ) " "                 " "         
## 3  ( 1 ) " "                 "*"         
## 4  ( 1 ) " "                 "*"         
## 5  ( 1 ) "*"                 "*"         
## 6  ( 1 ) "*"                 "*"         
##          X3.distance.to.the.nearest.MRT.station
## 1  ( 1 ) "*"                                   
## 2  ( 1 ) "*"                                   
## 3  ( 1 ) "*"                                   
## 4  ( 1 ) "*"                                   
## 5  ( 1 ) "*"                                   
## 6  ( 1 ) "*"                                   
##          X4.number.of.convenience.stores X5.latitude X6.longitude
## 1  ( 1 ) " "                             " "         " "         
## 2  ( 1 ) "*"                             " "         " "         
## 3  ( 1 ) "*"                             " "         " "         
## 4  ( 1 ) "*"                             "*"         " "         
## 5  ( 1 ) "*"                             "*"         " "         
## 6  ( 1 ) "*"                             "*"         "*"

Above is the output of running the best subset selection on the valuation dataset. It is interesting to find that “X3.distance.to.the.nearest.MRT.station” is selected as the variable for one-variable model; this is consistent with the observation i made looking at the pair-plot earlier which i have used so far in our analysis. I’ll nevertheless repeat the previous step using the valuationSub dataset just for caparison:

regfit.sub= regsubsets(Y.house.price.of.unit.area~., valuationSub)
summary(regfit.sub)
## Subset selection object
## Call: regsubsets.formula(Y.house.price.of.unit.area ~ ., valuationSub)
## 4 Variables  (and intercept)
##                                        Forced in Forced out
## X1.transaction.date                        FALSE      FALSE
## X2.house.age                               FALSE      FALSE
## X3.distance.to.the.nearest.MRT.station     FALSE      FALSE
## X4.number.of.convenience.stores            FALSE      FALSE
## 1 subsets of each size up to 4
## Selection Algorithm: exhaustive
##          X1.transaction.date X2.house.age
## 1  ( 1 ) " "                 " "         
## 2  ( 1 ) " "                 " "         
## 3  ( 1 ) " "                 "*"         
## 4  ( 1 ) "*"                 "*"         
##          X3.distance.to.the.nearest.MRT.station
## 1  ( 1 ) "*"                                   
## 2  ( 1 ) "*"                                   
## 3  ( 1 ) "*"                                   
## 4  ( 1 ) "*"                                   
##          X4.number.of.convenience.stores
## 1  ( 1 ) " "                            
## 2  ( 1 ) "*"                            
## 3  ( 1 ) "*"                            
## 4  ( 1 ) "*"

Now we can see X1.transaction.date only made it into four-variable model while for one-variable model we again have X3.distance.to.the.nearest.MRT.station.

reg.summary= summary(regfit.full)
names(reg.summary)
## [1] "which"  "rsq"    "rss"    "adjr2"  "cp"     "bic"    "outmat" "obj"

As seen above, summary() function returns \({R}^2\), RSS, adjusted-\({R}^2\), Cp and BIC. These can be examined in order to select the best overall model:

reg.summary$rsq
## [1] 0.4537543 0.4965684 0.5410633 0.5711352 0.5823179 0.5823850

Here we can see \({R}^2\) improved from 45% for one-variable model to 49% for two-varaible model and then improved again to 54% for a three-variable model. A plot of RSS, adjusted \({R}^2\). Plotting all of RSS, adjusted \({R}^2\), Cp, and BIC for all of the models can help in making a decision on the model to select:

par(mfrow =c(2,2))
plot(reg.summary$rss ,xlab=" Number of Variables ",ylab=" RSS", type="l")
plot(reg.summary$adjr2 ,xlab =" Number of Variables ", ylab=" Adjusted RSq",type="l")
plot(reg.summary$cp ,xlab=" Number of Variables ",ylab=" Cp", type="l")
plot(reg.summary$bic ,xlab =" Number of Variables ", ylab=" BIC",type="l")

Except for the unexpected inclusion of X5.latitude to four-varaible model, this analysis agrees with our understanding of the dataset about the variables. I need to restate here that even though X5.latitude has been selected as one of the best 4 varaibles in a four-varaible model, the varaible in itself has little or no meaning and it has been used to derive other reasonable varaibles already so it’s better left out of any model. With which.max() function we can identify the maximum point of a vector. See the maximum point for adjusted \({R}^2\) below:

which.max(reg.summary$adjr2)
## [1] 5

There is alternatively a regsubsets()’s built in plot() function which can be used to display the selected varaibles for the best model:

plot(regfit.full ,scale ="r2")

plot(regfit.full ,scale ="adjr2")

plot(regfit.full ,scale ="Cp")

plot(regfit.full ,scale ="bic")

Forward and Backward Stepwise Selection

Again we use the regsubsets() function for this but with the method argument set to “forward” or “backward”

regfit.fwd=regsubsets(Y.house.price.of.unit.area~.,data=valuation , method ="forward")
summary(regfit.fwd)
## Subset selection object
## Call: regsubsets.formula(Y.house.price.of.unit.area ~ ., data = valuation, 
##     method = "forward")
## 6 Variables  (and intercept)
##                                        Forced in Forced out
## X1.transaction.date                        FALSE      FALSE
## X2.house.age                               FALSE      FALSE
## X3.distance.to.the.nearest.MRT.station     FALSE      FALSE
## X4.number.of.convenience.stores            FALSE      FALSE
## X5.latitude                                FALSE      FALSE
## X6.longitude                               FALSE      FALSE
## 1 subsets of each size up to 6
## Selection Algorithm: forward
##          X1.transaction.date X2.house.age
## 1  ( 1 ) " "                 " "         
## 2  ( 1 ) " "                 " "         
## 3  ( 1 ) " "                 "*"         
## 4  ( 1 ) " "                 "*"         
## 5  ( 1 ) "*"                 "*"         
## 6  ( 1 ) "*"                 "*"         
##          X3.distance.to.the.nearest.MRT.station
## 1  ( 1 ) "*"                                   
## 2  ( 1 ) "*"                                   
## 3  ( 1 ) "*"                                   
## 4  ( 1 ) "*"                                   
## 5  ( 1 ) "*"                                   
## 6  ( 1 ) "*"                                   
##          X4.number.of.convenience.stores X5.latitude X6.longitude
## 1  ( 1 ) " "                             " "         " "         
## 2  ( 1 ) "*"                             " "         " "         
## 3  ( 1 ) "*"                             " "         " "         
## 4  ( 1 ) "*"                             "*"         " "         
## 5  ( 1 ) "*"                             "*"         " "         
## 6  ( 1 ) "*"                             "*"         "*"
regfit.bwd=regsubsets(Y.house.price.of.unit.area~.,data=valuation , method ="backward")
summary(regfit.bwd )
## Subset selection object
## Call: regsubsets.formula(Y.house.price.of.unit.area ~ ., data = valuation, 
##     method = "backward")
## 6 Variables  (and intercept)
##                                        Forced in Forced out
## X1.transaction.date                        FALSE      FALSE
## X2.house.age                               FALSE      FALSE
## X3.distance.to.the.nearest.MRT.station     FALSE      FALSE
## X4.number.of.convenience.stores            FALSE      FALSE
## X5.latitude                                FALSE      FALSE
## X6.longitude                               FALSE      FALSE
## 1 subsets of each size up to 6
## Selection Algorithm: backward
##          X1.transaction.date X2.house.age
## 1  ( 1 ) " "                 " "         
## 2  ( 1 ) " "                 " "         
## 3  ( 1 ) " "                 "*"         
## 4  ( 1 ) " "                 "*"         
## 5  ( 1 ) "*"                 "*"         
## 6  ( 1 ) "*"                 "*"         
##          X3.distance.to.the.nearest.MRT.station
## 1  ( 1 ) "*"                                   
## 2  ( 1 ) "*"                                   
## 3  ( 1 ) "*"                                   
## 4  ( 1 ) "*"                                   
## 5  ( 1 ) "*"                                   
## 6  ( 1 ) "*"                                   
##          X4.number.of.convenience.stores X5.latitude X6.longitude
## 1  ( 1 ) " "                             " "         " "         
## 2  ( 1 ) "*"                             " "         " "         
## 3  ( 1 ) "*"                             " "         " "         
## 4  ( 1 ) "*"                             "*"         " "         
## 5  ( 1 ) "*"                             "*"         " "         
## 6  ( 1 ) "*"                             "*"         "*"

To view the coefficients of the subset selection methods, we can call the coef() function on each of regfit.fwd, regfit.bwd and regfit.full:

coef(regfit.fwd,4)
##                            (Intercept) 
##                          -5.916006e+03 
##                           X2.house.age 
##                          -2.687192e-01 
## X3.distance.to.the.nearest.MRT.station 
##                          -4.175080e-03 
##        X4.number.of.convenience.stores 
##                           1.164781e+00 
##                            X5.latitude 
##                           2.386357e+02
coef(regfit.bwd,4)
##                            (Intercept) 
##                          -5.916006e+03 
##                           X2.house.age 
##                          -2.687192e-01 
## X3.distance.to.the.nearest.MRT.station 
##                          -4.175080e-03 
##        X4.number.of.convenience.stores 
##                           1.164781e+00 
##                            X5.latitude 
##                           2.386357e+02
coef(regfit.full,4)
##                            (Intercept) 
##                          -5.916006e+03 
##                           X2.house.age 
##                          -2.687192e-01 
## X3.distance.to.the.nearest.MRT.station 
##                          -4.175080e-03 
##        X4.number.of.convenience.stores 
##                           1.164781e+00 
##                            X5.latitude 
##                           2.386357e+02

Choosing among Models using the Validation Set Approach and Cross-Validation It’s important to avoid using the full dataset here in order to get accurate estimates of test error. I’ll start by creating train:test split:

set.seed(1)
train= sample(c(TRUE,FALSE), nrow(valuation), rep=TRUE)
test= (!train)

I’ll now apply regsubsets() to the training set and compute the validation set errors:

regfit.best = regsubsets(Y.house.price.of.unit.area ~ ., data= valuation[train, ] )
test.mat=model.matrix(Y.house.price.of.unit.area ~ ., data= valuation[test, ])

val.errors= rep(NA,6)
for (i in 1:6) {
  coefi=coef(regfit.best, id=i)
  pred=test.mat[,names(coefi)]%*%coefi
  val.errors[i]=mean((valuation$Y.house.price.of.unit.area[test]-pred)^2)
}
val.errors
## [1] 86.33732 76.85376 66.61968 60.86184 60.93020 61.10042
which.min(val.errors)
## [1] 4

It is found from here that the best model contains 4 variables. To check the coefficients i can run coef() on regfit.best:

coef(regfit.best,4)
##                            (Intercept) 
##                          -6.587165e+03 
##                           X2.house.age 
##                          -2.596989e-01 
## X3.distance.to.the.nearest.MRT.station 
##                          -4.366077e-03 
##        X4.number.of.convenience.stores 
##                           1.030686e+00 
##                            X5.latitude 
##                           2.655454e+02

This can be less tedious using the user defined function created in the book which mimics the steps i took above:

predict.regsubsets = function (object ,newdata ,id ,...){
  form=as.formula (object$call [[2]])
  mat=model.matrix (form ,newdata )
  coefi =coef(object ,id=id)
  xvars =names (coefi )
  mat[,xvars ]%*% coefi
  }

I am going to use this function to perform cross-validation. To start with i am going to run the best subset selection on the full dataset and note once more the best 4-variable model.

regfit.best = regsubsets(Y.house.price.of.unit.area ~ ., data= valuation )
coef(regfit.best, 4)
##                            (Intercept) 
##                          -5.916006e+03 
##                           X2.house.age 
##                          -2.687192e-01 
## X3.distance.to.the.nearest.MRT.station 
##                          -4.175080e-03 
##        X4.number.of.convenience.stores 
##                           1.164781e+00 
##                            X5.latitude 
##                           2.386357e+02

This agrees with the best 4-variable model we found from the training dataset though the coefficients are slightly different. Below we try to choose among the models of different size using cross-validation:

k=10
set.seed(1)
folds=sample(1:k,nrow(valuation), replace = T)
cv.errors=matrix(NA,k,6, dimnames = list(NULL, paste(1:6)))
for (j in 1:k) {
  best.fit=regsubsets(Y.house.price.of.unit.area ~ ., data= valuation[folds!=j,])
  for (i in 1:6) {
    pred=predict(best.fit,valuation[folds==j,], id=i)
    cv.errors[j,i]= mean((valuation$Y.house.price.of.unit.area[folds==j]-pred)^2)
    
  }
}

mean.cv.errors=apply(cv.errors, 2, mean)
mean.cv.errors
##         1         2         3         4         5         6 
## 110.08357 109.06606  95.59084  89.94835  87.87536  88.09624
par(mfrow=c(1,1))
plot(mean.cv.errors, type = 'b')

From the cross validated plot we can see that from 5-variable models the MSE did not really change.

6.2 Ridge Regression and the Lasso

We start by separating the dataset into x matrices (features) and y vector (outcome)

x=model.matrix(Y.house.price.of.unit.area ~ ., data= valuation)[,-1]
y=valuation$Y.house.price.of.unit.area

Ridge Regression

By default the glmnet() function performs ridge regression for an automatically selected range of ?? values:

library(glmnet)
## Warning: package 'glmnet' was built under R version 3.5.2
## Loading required package: Matrix
## Loading required package: foreach
## Warning: package 'foreach' was built under R version 3.5.2
## Loaded glmnet 2.0-16
grid= 10^seq(10,-2,length=100)
ridge.mod = glmnet(x,y,alpha = 0,lambda = grid)
plot(ridge.mod,xvar="lambda",label=T)

From the plot we can see that log lambda of 10 is able to keep all the coefficients close to 0. We also noticed the eratic behaviour of variables labelled 5 and 6 between log lambda values -5 to 5.

Note that glmnet() is also used for Lasso; the alpha parameter set to 1 fits a Lasso model while as seen above alpha=0 fits a Ridge Regression model

dim(coef(ridge.mod))
## [1]   7 100
ridge.mod$lambda[50]
## [1] 11497.57

From the above we can see that 50th column in the ridge.mod model is equivalent to a ?? of 11497.57. Find below the corresponding coefficients and \({l}^~2\) norm:

coef(ridge.mod)[,50]
##                            (Intercept) 
##                          -5.609740e+01 
##                    X1.transaction.date 
##                           4.988195e-03 
##                           X2.house.age 
##                          -2.969285e-04 
## X3.distance.to.the.nearest.MRT.station 
##                          -8.561195e-06 
##        X4.number.of.convenience.stores 
##                           3.109136e-03 
##                            X5.latitude 
##                           7.059093e-01 
##                           X6.longitude 
##                           5.464471e-01
sqrt(sum(coef(ridge.mod)[-1,50]^2))
## [1] 0.8927189

In contrast, ridge.mod at column 60 is equivalent to ?? = 705.48.

ridge.mod$lambda[60]
## [1] 705.4802

Find below the corresponding coefficients and \({l}^~2\) norm:

coef(ridge.mod)[,60]
##                            (Intercept) 
##                          -1.425642e+03 
##                    X1.transaction.date 
##                           8.051419e-02 
##                           X2.house.age 
##                          -4.760133e-03 
## X3.distance.to.the.nearest.MRT.station 
##                          -1.331875e-04 
##        X4.number.of.convenience.stores 
##                           4.845372e-02 
##                            X5.latitude 
##                           1.099406e+01 
##                           X6.longitude 
##                           8.450803e+00
sqrt(sum(coef(ridge.mod)[-1,60]^2))
## [1] 13.86702

If we are interested in obtaining the Ridge Regression coefficient for a new value of ?? (say 50), we can use the predict() function:

predict(ridge.mod,s=50, type = "coefficients")[1:7,]
##                            (Intercept) 
##                          -1.262065e+04 
##                    X1.transaction.date 
##                           9.825777e-01 
##                           X2.house.age 
##                          -5.526150e-02 
## X3.distance.to.the.nearest.MRT.station 
##                          -1.169744e-03 
##        X4.number.of.convenience.stores 
##                           4.324010e-01 
##                            X5.latitude 
##                           9.712821e+01 
##                           X6.longitude 
##                           6.793060e+01

For the purpose of estimating test error of ridge regression and the lasso, we are going to split the samples into training and test set:

set.seed(1)
train=sample (1: nrow(x), nrow(x)/2)
test=(-train )
y.test=y[test]

We now fit a ridge regreesion model on the training set and evaluate MSE on the test set setting ?? = 4

ridge.mod = glmnet(x[train, ], y[train], alpha = 0, lambda = grid, thresh = 1e-12)
ridge.pred = predict(ridge.mod ,s=4, newx=x[test, ])
mean((ridge.pred -y.test)^2)
## [1] 97.54287

This yields a test MSE of 97.54. Ignoring other predictors and using just intercept would mean predicting each observation using the mean of the training set. With this in mind, we could compute test MSE like this:

mean((mean(y[train])-y.test)^2)
## [1] 200.9584

Clearly we can see fitting a ridge regression model with ?? = 4 yields a much lower test MSE than fitting a model with just an intercept.

So far we have arbitrarily chosen tuning parameter ?? = 4, it would be more appropriate to use cross-validation to choose the tuning parameter. We use cv.glmnet() function which by default performs ten-fold cross-validation

set.seed(1)
cv.out= cv.glmnet(x[train, ], y[train], alpha = 0)
plot(cv.out)

bestlam =cv.out$lambda.min
bestlam
## [1] 1.022016

From the plot we can see there are 6 coefficients used all through and that the one standard error of the minimum is fixed at a little less than log lambda 3. This results to ?? = 1.022. We can check the test MSE associated with this value of ??:

ridge.pred=predict (ridge.mod ,s=bestlam ,newx=x[test, ])
mean((ridge.pred - y.test)^2)
## [1] 94.76925

This clearly is an improvement over the test MSE we recorded using ?? = 4. We not refit our ridge regression model on the full data set using the bestlam:

out=glmnet (x,y,alpha=0)
predict(out, type="coefficients",s=bestlam )[1:7, ]
##                            (Intercept) 
##                          -1.939100e+04 
##                    X1.transaction.date 
##                           4.676221e+00 
##                           X2.house.age 
##                          -2.498740e-01 
## X3.distance.to.the.nearest.MRT.station 
##                          -3.727150e-03 
##        X4.number.of.convenience.stores 
##                           1.121554e+00 
##                            X5.latitude 
##                           2.301395e+02 
##                           X6.longitude 
##                           3.515546e+01

The Lasso The goal here is to find out if lasso can yield a more accurate or more interpretable model than ridge regression.

lasso.mod = glmnet(x[train, ], y[train], alpha = 1, lambda= grid)
plot(lasso.mod)

Right off the bat we can see that some of the coefficients will be exactly equal to zero. We can now perform cross-validation and compute the associated test error:

set.seed(1)
cv.out = cv.glmnet(x[train, ],y[train], alpha =1)
plot(cv.out)

bestlam = cv.out$lambda.min
lasso.pred = predict(lasso.mod, s=bestlam, newx=x[test, ])
mean((lasso.pred -y.test)^2)
## [1] 94.63347

The lasso has a substantial advantage over the ridge regression in that the resulting coefficent estimates are sparse.

out = glmnet(x,y,alpha = 1, lambda = grid)
lasso.coef = predict(out, type ="coefficients", s=bestlam )[1:7, ]
lasso.coef
##                            (Intercept) 
##                          -1.424595e+04 
##                    X1.transaction.date 
##                           4.397579e+00 
##                           X2.house.age 
##                          -2.501060e-01 
## X3.distance.to.the.nearest.MRT.station 
##                          -4.295610e-03 
##        X4.number.of.convenience.stores 
##                           1.095564e+00 
##                            X5.latitude 
##                           2.176917e+02 
##                           X6.longitude 
##                           0.000000e+00
lasso.coef[lasso.coef!= 0]
##                            (Intercept) 
##                          -1.424595e+04 
##                    X1.transaction.date 
##                           4.397579e+00 
##                           X2.house.age 
##                          -2.501060e-01 
## X3.distance.to.the.nearest.MRT.station 
##                          -4.295610e-03 
##        X4.number.of.convenience.stores 
##                           1.095564e+00 
##                            X5.latitude 
##                           2.176917e+02

X6.longitude has been reduced to 0.

6.3 The PCR and PLS Regression

Principal Components Regression To perform Principal Components Regression (PCR) we use pcr() function.

library(pls)
## Warning: package 'pls' was built under R version 3.5.2
## 
## Attaching package: 'pls'
## The following object is masked from 'package:stats':
## 
##     loadings
set.seed(2)
pcr.fit=pcr(Y.house.price.of.unit.area ~ ., data=valuation ,scale=TRUE , validation ="CV")
summary(pcr.fit)
## Data:    X dimension: 414 6 
##  Y dimension: 414 1
## Fit method: svdpc
## Number of components considered: 6
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV           13.62    9.686    9.640    9.187    9.066    9.044    8.964
## adjCV        13.62    9.682    9.671    9.181    9.054    9.036    8.954
## 
## TRAINING: % variance explained
##                             1 comps  2 comps  3 comps  4 comps  5 comps
## X                             44.57    62.03    78.41    88.34    97.50
## Y.house.price.of.unit.area    50.04    50.50    55.54    56.90    57.24
##                             6 comps
## X                            100.00
## Y.house.price.of.unit.area    58.24

We can see the CV scores and the corresponding MSE. We can plot the cross validation scores using validationplot():

validationplot(pcr.fit,val.type = "MSEP")

Cross validation error from 3 components is reasonably low and doesnt reduce much with more components. We now perform PCR on the training data and evaluate it’s test performance:

set.seed(1)
pcr.fit=pcr(Y.house.price.of.unit.area~., data=valuation ,subset =train ,scale =TRUE ,
validation ="CV")
validationplot(pcr.fit, val.type="MSEP")

Lowest cross-validated error occurs at M=6 even though there is not much decrease from M=3. I will compute test MSE using the M of 6 and 3:

pcr.pred=predict(pcr.fit, x[test, ], ncomp=6)
mean((pcr.pred - y.test)^2)
## [1] 94.24018
pcr.pred=predict(pcr.fit, x[test, ], ncomp=3)
mean((pcr.pred - y.test)^2)
## [1] 98.3154

Of course, M=6 records the least MSE. This test set MSE is competitive with the results obtained using ridge regression and lasso.

Lastly we fit PCR on the full dataset:

pcr.fit=pcr(y~x,scale =TRUE ,ncomp =6)
summary (pcr.fit )
## Data:    X dimension: 414 6 
##  Y dimension: 414 1
## Fit method: svdpc
## Number of components considered: 6
## TRAINING: % variance explained
##    1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## X    44.57    62.03    78.41    88.34    97.50   100.00
## y    50.04    50.50    55.54    56.90    57.24    58.24

Partial Least Squares We use plsr() function to to implement partial least squares:

set.seed(1)
pls.fit=plsr(Y.house.price.of.unit.area~., data=valuation ,subset =train ,scale=TRUE ,
validation ="CV")
summary(pls.fit )
## Data:    X dimension: 207 6 
##  Y dimension: 207 1
## Fit method: kernelpls
## Number of components considered: 6
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV            13.1    8.349    8.111    8.046    8.000    8.004    8.004
## adjCV         13.1    8.344    8.101    8.035    7.991    7.994    7.994
## 
## TRAINING: % variance explained
##                             1 comps  2 comps  3 comps  4 comps  5 comps
## X                             43.50    58.13    68.85    77.36    90.17
## Y.house.price.of.unit.area    60.04    63.17    63.78    64.09    64.16
##                             6 comps
## X                            100.00
## Y.house.price.of.unit.area    64.17

The lowest cross-validation error occurs at M=4 directions. We can evaluate the corresponding test set MSE:

pls.pred=predict(pls.fit, x[test, ], ncomp =4)
mean((pls.pred -y.test)^2)
## [1] 94.53886

This again is comparable to what we recorded for ridge regression, lasso and PCR. Finally we perform PLS using the full dataset, using M=6:

pls.fit=plsr(Y.house.price.of.unit.area~., data=valuation ,scale=TRUE ,ncomp =6)
summary(pls.fit)
## Data:    X dimension: 414 6 
##  Y dimension: 414 1
## Fit method: kernelpls
## Number of components considered: 6
## TRAINING: % variance explained
##                             1 comps  2 comps  3 comps  4 comps  5 comps
## X                             44.10    59.98    69.38    74.49    90.46
## Y.house.price.of.unit.area    54.18    57.48    57.95    58.22    58.24
##                             6 comps
## X                            100.00
## Y.house.price.of.unit.area    58.24

7.0 Non-linear Modeling

7.1 Polynomial Regression and Step Functions

We begin by fitting polynomial regression up to 4 degrees:

fit=lm(Y.house.price.of.unit.area~poly(X3.distance.to.the.nearest.MRT.station,4), data = valuationSub)
coef(summary(fit))
##                                                    Estimate Std. Error
## (Intercept)                                        37.98019  0.4419469
## poly(X3.distance.to.the.nearest.MRT.station, 4)1 -186.26507  8.9922887
## poly(X3.distance.to.the.nearest.MRT.station, 4)2   74.82367  8.9922887
## poly(X3.distance.to.the.nearest.MRT.station, 4)3  -47.29298  8.9922887
## poly(X3.distance.to.the.nearest.MRT.station, 4)4   29.31277  8.9922887
##                                                     t value      Pr(>|t|)
## (Intercept)                                       85.938368 6.843541e-264
## poly(X3.distance.to.the.nearest.MRT.station, 4)1 -20.713867  1.065100e-65
## poly(X3.distance.to.the.nearest.MRT.station, 4)2   8.320871  1.317772e-15
## poly(X3.distance.to.the.nearest.MRT.station, 4)3  -5.259282  2.336748e-07
## poly(X3.distance.to.the.nearest.MRT.station, 4)4   3.259767  1.208304e-03

poly() could also be used to obtain coefficients up to 4th degree directly, we set argument raw = TRUE in the poly() function. As we’re going to observe, this method affects the coefficient estimates but it does not affect the fitted values obtained:

fit2=lm(Y.house.price.of.unit.area~poly(X3.distance.to.the.nearest.MRT.station,4,raw = T), data = valuationSub)
coef(summary(fit2))
##                                                                Estimate
## (Intercept)                                                5.562006e+01
## poly(X3.distance.to.the.nearest.MRT.station, 4, raw = T)1 -3.793675e-02
## poly(X3.distance.to.the.nearest.MRT.station, 4, raw = T)2  1.809243e-05
## poly(X3.distance.to.the.nearest.MRT.station, 4, raw = T)3 -3.739689e-09
## poly(X3.distance.to.the.nearest.MRT.station, 4, raw = T)4  2.621493e-13
##                                                             Std. Error
## (Intercept)                                               1.289050e+00
## poly(X3.distance.to.the.nearest.MRT.station, 4, raw = T)1 4.466579e-03
## poly(X3.distance.to.the.nearest.MRT.station, 4, raw = T)2 3.606429e-06
## poly(X3.distance.to.the.nearest.MRT.station, 4, raw = T)3 9.661662e-10
## poly(X3.distance.to.the.nearest.MRT.station, 4, raw = T)4 8.041962e-14
##                                                             t value
## (Intercept)                                               43.148114
## poly(X3.distance.to.the.nearest.MRT.station, 4, raw = T)1 -8.493469
## poly(X3.distance.to.the.nearest.MRT.station, 4, raw = T)2  5.016718
## poly(X3.distance.to.the.nearest.MRT.station, 4, raw = T)3 -3.870648
## poly(X3.distance.to.the.nearest.MRT.station, 4, raw = T)4  3.259767
##                                                                Pr(>|t|)
## (Intercept)                                               2.506774e-154
## poly(X3.distance.to.the.nearest.MRT.station, 4, raw = T)1  3.759656e-16
## poly(X3.distance.to.the.nearest.MRT.station, 4, raw = T)2  7.852080e-07
## poly(X3.distance.to.the.nearest.MRT.station, 4, raw = T)3  1.263064e-04
## poly(X3.distance.to.the.nearest.MRT.station, 4, raw = T)4  1.208304e-03

EVn though they might be cumbersome, other equivalent ways of fitting the model are:

fit2a=lm(Y.house.price.of.unit.area~X3.distance.to.the.nearest.MRT.station+I(X3.distance.to.the.nearest.MRT.station^2)+I(X3.distance.to.the.nearest.MRT.station^3)+I(X3.distance.to.the.nearest.MRT.station^4), data = valuationSub)
coef(fit2a)
##                                 (Intercept) 
##                                5.562006e+01 
##      X3.distance.to.the.nearest.MRT.station 
##                               -3.793675e-02 
## I(X3.distance.to.the.nearest.MRT.station^2) 
##                                1.809243e-05 
## I(X3.distance.to.the.nearest.MRT.station^3) 
##                               -3.739689e-09 
## I(X3.distance.to.the.nearest.MRT.station^4) 
##                                2.621493e-13
fit2b=lm(Y.house.price.of.unit.area~cbind(X3.distance.to.the.nearest.MRT.station,X3.distance.to.the.nearest.MRT.station^2,X3.distance.to.the.nearest.MRT.station^3,X3.distance.to.the.nearest.MRT.station^4), data = valuationSub)
coef(fit2b)
##                                                                                                                                                                                                       (Intercept) 
##                                                                                                                                                                                                      5.562006e+01 
## cbind(X3.distance.to.the.nearest.MRT.station, X3.distance.to.the.nearest.MRT.station^2, X3.distance.to.the.nearest.MRT.station^3, X3.distance.to.the.nearest.MRT.station^4)X3.distance.to.the.nearest.MRT.station 
##                                                                                                                                                                                                     -3.793675e-02 
##                                       cbind(X3.distance.to.the.nearest.MRT.station, X3.distance.to.the.nearest.MRT.station^2, X3.distance.to.the.nearest.MRT.station^3, X3.distance.to.the.nearest.MRT.station^4) 
##                                                                                                                                                                                                      1.809243e-05 
##                                       cbind(X3.distance.to.the.nearest.MRT.station, X3.distance.to.the.nearest.MRT.station^2, X3.distance.to.the.nearest.MRT.station^3, X3.distance.to.the.nearest.MRT.station^4) 
##                                                                                                                                                                                                     -3.739689e-09 
##                                       cbind(X3.distance.to.the.nearest.MRT.station, X3.distance.to.the.nearest.MRT.station^2, X3.distance.to.the.nearest.MRT.station^3, X3.distance.to.the.nearest.MRT.station^4) 
##                                                                                                                                                                                                      2.621493e-13

We can create a grid of values for X3.distance.to.the.nearest.MRT.station at which we want predictions and then call the predict() function, specifying that we want standard errors as well:

x3lims=range(X3.distance.to.the.nearest.MRT.station)
x3.grid=seq(from=x3lims[1], to=x3lims[2])
preds=predict(fit, newdata = list(X3.distance.to.the.nearest.MRT.station=x3.grid), se=T)
se.bands=cbind(preds$fit+2*preds$se.fit, preds$fit -2*preds$se.fit)

We can now plot the data and add the fit from the degree-4 polynomial:

par(mfrow=c(1,2), mar=c(4.5,4.5,1,1),oma=c(0,0,4,0))
plot(X3.distance.to.the.nearest.MRT.station,Y.house.price.of.unit.area,xlim = x3lims,cex=.5,col="darkgrey")
title("Degree -4 Polynomial ", outer = T)
lines(x3.grid, preds$fit, lwd=2,col="blue")
matlines(x3.grid,se.bands,lwd=1,col="blue", lty = 3)

SO far our choice of degree of polynomial has been rather arbitrary. We can use hypothesis tests to make a choice of degree. Our goal is to fit series of polynomial models (say up to 5 degree) from which we can test to determine the simplest model which is just sufficient to explain the relationship between Y.house.price.of.unit.area and X3.distance.to.the.nearest.MRT.station. We use the anova function for analysis of variance:

fit.1= lm(Y.house.price.of.unit.area~X3.distance.to.the.nearest.MRT.station ,data=valuationSub)
fit.2= lm(Y.house.price.of.unit.area~poly(X3.distance.to.the.nearest.MRT.station,2) ,data=valuationSub)
fit.3= lm(Y.house.price.of.unit.area~poly(X3.distance.to.the.nearest.MRT.station,3) ,data=valuationSub)
fit.4= lm(Y.house.price.of.unit.area~poly(X3.distance.to.the.nearest.MRT.station,4) ,data=valuationSub)
fit.5= lm(Y.house.price.of.unit.area~poly(X3.distance.to.the.nearest.MRT.station,5) ,data=valuationSub)
anova(fit.1, fit.2, fit.3, fit.4, fit.5)
## Analysis of Variance Table
## 
## Model 1: Y.house.price.of.unit.area ~ X3.distance.to.the.nearest.MRT.station
## Model 2: Y.house.price.of.unit.area ~ poly(X3.distance.to.the.nearest.MRT.station, 
##     2)
## Model 3: Y.house.price.of.unit.area ~ poly(X3.distance.to.the.nearest.MRT.station, 
##     3)
## Model 4: Y.house.price.of.unit.area ~ poly(X3.distance.to.the.nearest.MRT.station, 
##     4)
## Model 5: Y.house.price.of.unit.area ~ poly(X3.distance.to.the.nearest.MRT.station, 
##     5)
##   Res.Df   RSS Df Sum of Sq       F    Pr(>F)    
## 1    412 41767                                   
## 2    411 36168  1    5598.6 69.1257 1.391e-15 ***
## 3    410 33931  1    2236.6 27.6156 2.390e-07 ***
## 4    409 33072  1     859.2 10.6090  0.001219 ** 
## 5    408 33044  1      27.8  0.3429  0.558500    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Judging from the p-values: we can see that linear fit is not sufficient, quadratic or cubic fit would do a good job but to get all the explanation possible of the relationship between Y.house.price.of.unit.area and X3.distance.to.the.nearest.MRT.station we might want to go for polynomial of order 4. No justification for models of higher polynomials.

An alternative to using hypothesis test and ANOVA is to use cross validation:

fit=glm(I(Y.house.price.of.unit.area<20)~poly(X3.distance.to.the.nearest.MRT.station,4), data = valuationSub,family = binomial)
preds=predict(fit ,newdata =list(X3.distance.to.the.nearest.MRT.station=x3.grid),se=T)

pfit=exp(preds$fit)/(1+ exp( preds$fit))
se.bands.logit = cbind(preds$fit +2*preds$se.fit , preds$fit -2*preds$se.fit)
se.bands = exp(se.bands.logit)/(1+ exp(se.bands.logit))

plot(X3.distance.to.the.nearest.MRT.station ,I(Y.house.price.of.unit.area <20) ,xlim=x3lims ,type ="n",ylim=c(0 ,.2))
points(jitter(X3.distance.to.the.nearest.MRT.station), I((Y.house.price.of.unit.area <20) /5) ,cex =.5, pch ="|",col ="darkgrey")
lines(x3.grid ,pfit ,lwd =2, col ="blue")
matlines(x3.grid ,se.bands ,lwd =1, col ="blue",lty =3)

Here we have X3.distance.to.the.nearest.MRT.station vlaues corresponding to the observations with Y.house.price.of.unit.area values below 20 (for the lack of a clear spearation in the scatterplot) as gray marks below the plot and those with Y.house.price.of.unit.area above 20 are shown as gray marks above the plot. Jitter() function is used to jitter the X3.distance.to.the.nearest.MRT.station a bit so that observations of the same X3.distance.to.the.nearest.MRT.station value do not cover each other up. In order to fit a step function , we use cut() function:

table(cut(X3.distance.to.the.nearest.MRT.station,4))
## 
##     (16.9,1.64e+03] (1.64e+03,3.26e+03] (3.26e+03,4.87e+03] 
##                 320                  57                  32 
## (4.87e+03,6.49e+03] 
##                   5
fit=lm(Y.house.price.of.unit.area~cut(X3.distance.to.the.nearest.MRT.station,4), data = valuationSub)
coef(summary(fit))
##                                                                    Estimate
## (Intercept)                                                        42.21375
## cut(X3.distance.to.the.nearest.MRT.station, 4)(1.64e+03,3.26e+03] -15.48568
## cut(X3.distance.to.the.nearest.MRT.station, 4)(3.26e+03,4.87e+03] -22.92312
## cut(X3.distance.to.the.nearest.MRT.station, 4)(4.87e+03,6.49e+03] -27.29375
##                                                                   Std. Error
## (Intercept)                                                        0.6153076
## cut(X3.distance.to.the.nearest.MRT.station, 4)(1.64e+03,3.26e+03]  1.5824337
## cut(X3.distance.to.the.nearest.MRT.station, 4)(3.26e+03,4.87e+03]  2.0407444
## cut(X3.distance.to.the.nearest.MRT.station, 4)(4.87e+03,6.49e+03]  4.9607684
##                                                                     t value
## (Intercept)                                                        68.60593
## cut(X3.distance.to.the.nearest.MRT.station, 4)(1.64e+03,3.26e+03]  -9.78599
## cut(X3.distance.to.the.nearest.MRT.station, 4)(3.26e+03,4.87e+03] -11.23273
## cut(X3.distance.to.the.nearest.MRT.station, 4)(4.87e+03,6.49e+03]  -5.50192
##                                                                        Pr(>|t|)
## (Intercept)                                                       7.758794e-227
## cut(X3.distance.to.the.nearest.MRT.station, 4)(1.64e+03,3.26e+03]  1.833744e-20
## cut(X3.distance.to.the.nearest.MRT.station, 4)(3.26e+03,4.87e+03]  1.044903e-25
## cut(X3.distance.to.the.nearest.MRT.station, 4)(4.87e+03,6.49e+03]  6.628491e-08

The function cut returns an ordered categorical variable for which the lm() function creates a set of dummy variables in the regression.

we could draw an inference from the summary above: The values of X3.distance.to.the.nearest.MRT.station< 1640 units is left out of the summary and this is because the intercept covers that portion such that 42.21 represents the average prices Y.house.price.of.unit.area of properties with X3.distance.to.the.nearest.MRT.station not up to 1640.

7.2 Splines

Regression splines can be fit by constructing an appropriate matrix of basis functions. By default cubic splines are produced. We’re going to attempt fitting Y.house.price.of.unit.area to X3.distance.to.the.nearest.MRT.station using regression splines:

library (splines )
fit=lm(Y.house.price.of.unit.area~bs(X3.distance.to.the.nearest.MRT.station,knots =c(750 ,1500 ,2500) ),data=valuationSub)
pred=predict(fit, newdata =list(X3.distance.to.the.nearest.MRT.station=x3.grid), se=T)
plot(X3.distance.to.the.nearest.MRT.station, Y.house.price.of.unit.area ,col =" gray ")
lines(x3.grid, pred$fit ,lwd = 2)
lines(x3.grid ,pred$fit +2*pred$se, lty ="dashed")
lines(x3.grid ,pred$fit - 2*pred$se ,lty ="dashed")

Here we have presepcified knots at X3.distance.to.the.nearest.MRT.station values 750, 1500 and 2500 and this yields a spline with six basis functions. We could use the set the df parameter to produce a spline with knots at unifrom quantiles of the data:

dim(bs(X3.distance.to.the.nearest.MRT.station, knots=c(750,1500,2500)))
## [1] 414   6
dim(bs(X3.distance.to.the.nearest.MRT.station, df=6))
## [1] 414   6
attr(bs(X3.distance.to.the.nearest.MRT.station, df=6), "knots")
##       25%       50%       75% 
##  289.3248  492.2313 1454.2790

In this case X3.distance.to.the.nearest.MRT.station values 289.3248, 492.2313, and 1454.2790 have been chosen for 25th, 50th and 75th percentiles of X3.distance.to.the.nearest.MRT.station. The bs() function also has a degree parameter that we can use to fit any degree instead of the default cubic splines.

To fit a natural splines we use ns() function instead:

fit2=lm(Y.house.price.of.unit.area~ns(X3.distance.to.the.nearest.MRT.station, df =4) ,data=valuationSub)
pred2=predict(fit2, newdata =list(X3.distance.to.the.nearest.MRT.station=x3.grid),se=T)
plot(X3.distance.to.the.nearest.MRT.station,Y.house.price.of.unit.area,xlim = x3lims,cex=.5, col="darkgrey")
lines(x3.grid , pred2$fit ,col ="red",lwd =2)

#par(mfrow=c(1,2))
plot(X3.distance.to.the.nearest.MRT.station,Y.house.price.of.unit.area,xlim = x3lims,cex=.5, col="darkgrey")
title("Smoothing Spline")
fit=smooth.spline(X3.distance.to.the.nearest.MRT.station ,Y.house.price.of.unit.area ,df=16)
fit2=smooth.spline(X3.distance.to.the.nearest.MRT.station ,Y.house.price.of.unit.area ,cv=T)
## Warning in smooth.spline(X3.distance.to.the.nearest.MRT.station,
## Y.house.price.of.unit.area, : cross-validation with non-unique 'x' values
## seems doubtful
fit2$df
## [1] 19.71385
lines(fit,col="red", lwd=2)

lines(fit2, col="blue", lwd=2)
legend("topright", legend = c("16 DF", "19.71385 DF"),col=c("red","blue"), lty=1,lwd = 2,cex = .8)

As some datapoints share x values, setting cv=F becomes appropriate but the we have left the argument as cv=T for the purpose of demonstration and also realizing the difference is not too much.

In order to perform local regression, we use the losses function:

plot(X3.distance.to.the.nearest.MRT.station, Y.house.price.of.unit.area ,xlim=x3lims ,cex =.5, col ="darkgrey")
title("Local Regression")
fit=loess(Y.house.price.of.unit.area~X3.distance.to.the.nearest.MRT.station ,span =.2, data=valuationSub)
fit2=loess(Y.house.price.of.unit.area~X3.distance.to.the.nearest.MRT.station ,span =.5, data=valuationSub)
lines(x3.grid, predict(fit, data.frame(X3.distance.to.the.nearest.MRT.station=x3.grid)),col ="red",lwd=2)
lines(x3.grid, predict(fit2, data.frame(X3.distance.to.the.nearest.MRT.station=x3.grid)), col ="blue",lwd=2)
legend("topright",legend = c("Span =0.2" ,"Span =0.5") ,col=c("red","blue"),lty = 1, lwd = 2, cex = .8)

Span of 0.2 and 0.5 has neighborhood consisting of 20% and 50% of observations respectively. The larger the span the smoother the fit.

7.3 GAMs

We now fit a GAM to predict Y.house.price.of.unit.area using natural spline functions of X2.house.age, X3.distance.to.the.nearest.MRT.station, and treating X4.number.of.convenience.stores as a qualitative predictor:

gam1=lm(Y.house.price.of.unit.area~ns(X2.house.age,4)+ns(X3.distance.to.the.nearest.MRT.station,5)+X4.number.of.convenience.stores, data = valuationSub)

summary(gam1)
## 
## Call:
## lm(formula = Y.house.price.of.unit.area ~ ns(X2.house.age, 4) + 
##     ns(X3.distance.to.the.nearest.MRT.station, 5) + X4.number.of.convenience.stores, 
##     data = valuationSub)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -36.491  -4.572  -0.663   3.286  70.607 
## 
## Coefficients:
##                                                Estimate Std. Error t value
## (Intercept)                                     48.5605     3.1947  15.201
## ns(X2.house.age, 4)1                            -9.4118     1.9212  -4.899
## ns(X2.house.age, 4)2                           -12.6821     2.4168  -5.247
## ns(X2.house.age, 4)3                            -7.9123     3.9525  -2.002
## ns(X2.house.age, 4)4                            -2.4587     2.9142  -0.844
## ns(X3.distance.to.the.nearest.MRT.station, 5)1  -2.6424     2.3814  -1.110
## ns(X3.distance.to.the.nearest.MRT.station, 5)2 -13.2266     3.3940  -3.897
## ns(X3.distance.to.the.nearest.MRT.station, 5)3 -22.0705     3.3354  -6.617
## ns(X3.distance.to.the.nearest.MRT.station, 5)4 -20.1727     6.1039  -3.305
## ns(X3.distance.to.the.nearest.MRT.station, 5)5 -25.7467     4.4427  -5.795
## X4.number.of.convenience.stores                  0.6131     0.2046   2.997
##                                                Pr(>|t|)    
## (Intercept)                                     < 2e-16 ***
## ns(X2.house.age, 4)1                           1.40e-06 ***
## ns(X2.house.age, 4)2                           2.50e-07 ***
## ns(X2.house.age, 4)3                           0.045972 *  
## ns(X2.house.age, 4)4                           0.399351    
## ns(X3.distance.to.the.nearest.MRT.station, 5)1 0.267835    
## ns(X3.distance.to.the.nearest.MRT.station, 5)2 0.000114 ***
## ns(X3.distance.to.the.nearest.MRT.station, 5)3 1.17e-10 ***
## ns(X3.distance.to.the.nearest.MRT.station, 5)4 0.001035 ** 
## ns(X3.distance.to.the.nearest.MRT.station, 5)5 1.38e-08 ***
## X4.number.of.convenience.stores                0.002897 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.358 on 403 degrees of freedom
## Multiple R-squared:  0.6318, Adjusted R-squared:  0.6226 
## F-statistic: 69.15 on 10 and 403 DF,  p-value: < 2.2e-16

This pushes the \({R}^2\) to 0.63. We are now going to fit a more general sorts of GAMs using smoothing splines or other components that cannot be expressed in terms of basic functions and then fit using least squares regression:

library(gam)
## Warning: package 'gam' was built under R version 3.5.2
## Loaded gam 1.16
gam.m3=gam(Y.house.price.of.unit.area~s(X2.house.age,4)+s(X3.distance.to.the.nearest.MRT.station,5)+X4.number.of.convenience.stores, data = valuationSub)
par(mfrow=c(1,3))
plot(gam.m3, se=TRUE ,col ="blue ")

The generic plot() function recognizes that gam.m3 is an object of class gam and invokes the appropriate plot.gam() method. We can perform a series of ANOVA tests in order to determine which of these 3 models is best:

gam.m1=gam(Y.house.price.of.unit.area~s(X3.distance.to.the.nearest.MRT.station,5) +X4.number.of.convenience.stores ,data=valuationSub)
gam.m2=gam(Y.house.price.of.unit.area~X2.house.age+s(X3.distance.to.the.nearest.MRT.station,5)+X4.number.of.convenience.stores, data = valuationSub)
anova(gam.m1, gam.m2, gam.m3,test="F")
## Analysis of Deviance Table
## 
## Model 1: Y.house.price.of.unit.area ~ s(X3.distance.to.the.nearest.MRT.station, 
##     5) + X4.number.of.convenience.stores
## Model 2: Y.house.price.of.unit.area ~ X2.house.age + s(X3.distance.to.the.nearest.MRT.station, 
##     5) + X4.number.of.convenience.stores
## Model 3: Y.house.price.of.unit.area ~ s(X2.house.age, 4) + s(X3.distance.to.the.nearest.MRT.station, 
##     5) + X4.number.of.convenience.stores
##   Resid. Df Resid. Dev     Df Deviance       F    Pr(>F)    
## 1       407      32353                                      
## 2       406      29421 1.0000   2931.6 42.6374 1.987e-10 ***
## 3       403      27709 3.0002   1711.7  8.2979 2.284e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

From the result, model gam.m1 is not significant. The p-value for model 2 which is better than model 3 suggests use of polynomial function for X2.house.age is unnecessary.

summary(gam.m3)
## 
## Call: gam(formula = Y.house.price.of.unit.area ~ s(X2.house.age, 4) + 
##     s(X3.distance.to.the.nearest.MRT.station, 5) + X4.number.of.convenience.stores, 
##     data = valuationSub)
## Deviance Residuals:
##     Min      1Q  Median      3Q     Max 
## -36.033  -4.500  -0.460   3.517  72.651 
## 
## (Dispersion Parameter for gaussian family taken to be 68.7573)
## 
##     Null Deviance: 76461.38 on 413 degrees of freedom
## Residual Deviance: 27709.16 on 402.9998 degrees of freedom
## AIC: 2939.194 
## 
## Number of Local Scoring Iterations: 2 
## 
## Anova for Parametric Effects
##                                               Df  Sum Sq Mean Sq F value
## s(X2.house.age, 4)                             1  3727.2  3727.2  54.208
## s(X3.distance.to.the.nearest.MRT.station, 5)   1 29340.5 29340.5 426.725
## X4.number.of.convenience.stores                1   995.4   995.4  14.477
## Residuals                                    403 27709.2    68.8        
##                                                 Pr(>F)    
## s(X2.house.age, 4)                           1.024e-12 ***
## s(X3.distance.to.the.nearest.MRT.station, 5) < 2.2e-16 ***
## X4.number.of.convenience.stores              0.0001639 ***
## Residuals                                                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Anova for Nonparametric Effects
##                                              Npar Df  Npar F     Pr(F)    
## (Intercept)                                                               
## s(X2.house.age, 4)                                 3  8.8883 1.024e-05 ***
## s(X3.distance.to.the.nearest.MRT.station, 5)       4 17.5983 2.527e-13 ***
## X4.number.of.convenience.stores                                           
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We can make predictions from gam objects, just like from lm objects:

preds=predict(gam.m2,newdata = valuationSub)

We can use local regression fits as building blocks in a GAM using the lo() function:

#par(mfrow=c(1,3))
gam.lo=gam(Y.house.price.of.unit.area ~ s(X2.house.age, df=4)+lo(X3.distance.to.the.nearest.MRT.station ,span =0.7)+X4.number.of.convenience.stores ,
data=valuationSub)
plot.Gam(gam.lo , se=TRUE , col ="green ")

Here we have used local regression and span 0.7 for X3.distance.to.the.nearest.MRT.station.

Adding interaction term before calling the gam() function is possible:

gam.lo.i=gam(Y.house.price.of.unit.area~lo(X2.house.age,X3.distance.to.the.nearest.MRT.station, span=0.5)+X4.number.of.convenience.stores, data = valuationSub)


library(akima)
## Warning: package 'akima' was built under R version 3.5.3
plot(gam.lo.i)

In order to fit a logistic regression GAM, we once again use the I() function in constructing the binaryresponse variable, and set family=binomial:

gam.lr=gam(I(Y.house.price.of.unit.area<20)~X2.house.age+s(X3.distance.to.the.nearest.MRT.station, df=5)+X4.number.of.convenience.stores, family = binomial, data = valuationSub)
par(mfrow=c(1,3))
plot(gam.lr,se=T,col="green")

table(X4.number.of.convenience.stores,I(Y.house.price.of.unit.area<20))
##                                
## X4.number.of.convenience.stores FALSE TRUE
##                              0     41   26
##                              1     39    7
##                              2     24    0
##                              3     45    1
##                              4     31    0
##                              5     67    0
##                              6     36    1
##                              7     31    0
##                              8     30    0
##                              9     25    0
##                              10    10    0

It is easy to see that for Y.house.price.of.unit.area of 20 units and below, there are few X4.number.of.convenience.stores; 26 properties with price of 20 or less have 0 convenience stores while 7 properties have 1 convenience store near them. We fit a logistic regression GAM leaving out x4>6 as they all are zero.

gam.lr.s=gam(I(Y.house.price.of.unit.area<20)~X2.house.age+s(X3.distance.to.the.nearest.MRT.station, df=5)+X4.number.of.convenience.stores,family = binomial,data=valuationSub, subset = (X4.number.of.convenience.stores<7))
plot(gam.lr.s,se=T,col="green")

8.0 Decision Tree

8.2 Fitting Regression Tree

To fit a regression tree to our valuationSub dataset, we first create a training set and fit the tree to the training data

library(tree)
## Warning: package 'tree' was built under R version 3.5.3
set.seed(1)
names(valuationSub)= c("x1","x2","x3","x4","Y")
train= sample(1:nrow(valuationSub), nrow(valuationSub)/2)
tree.valuation=tree(Y~.,valuationSub, subset = train)
summary(tree.valuation)
## 
## Regression tree:
## tree(formula = Y ~ ., data = valuationSub, subset = train)
## Variables actually used in tree construction:
## [1] "x3" "x2" "x4"
## Number of terminal nodes:  10 
## Residual mean deviance:  45.15 = 8895 / 197 
## Distribution of residuals:
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -29.3700  -3.7490  -0.4686   0.0000   3.6190  24.2000

It is interesting to see that only 3 predictors have been used to fit the tree. To make tree labels readable i have renamed the columns of valuationSub.We now plot the tree:

plot(tree.valuation)

Here is the tree without labels. We now add labels:

plot(tree.valuation)
text(tree.valuation, pretty = 0)

Now we use the cv.tree() function to see whether pruning the tree will improve performance:

cv.valuation=cv.tree(tree.valuation)
plot(cv.valuation$size, cv.valuation$dev, type = "b")

Here we see that performance did not improve after size 4. We can prune the tree using the prune.tree():

prune.valuation=prune.tree(tree.valuation, best = 4)
plot(prune.valuation)
text(prune.valuation, pretty=0)

We will now use the unpruned tree to make predictions on the test data:

yhat=predict(tree.valuation,newdata = valuationSub[-train, ])
valuation.test=valuationSub[-train, "Y"]
plot(yhat, valuation.test)
abline(0,1)

mean((yhat-valuation.test)^2)
## [1] 89.84545

The test MSE associated with the regression tree is 89.84

8.3 Bagging and Random Forests

Here we apply bagging and random forest to the valuation data, using the randomForest package in R (Recall that bagging is simply a special case of a random forest with m=p):

library(randomForest)
## Warning: package 'randomForest' was built under R version 3.5.2
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
set.seed(1)
bag.valuation=randomForest(Y~., data=valuationSub, subset=train, mtry=4,importance=T)
bag.valuation
## 
## Call:
##  randomForest(formula = Y ~ ., data = valuationSub, mtry = 4,      importance = T, subset = train) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 4
## 
##           Mean of squared residuals: 53.3088
##                     % Var explained: 68.64

mtry=4 indicates all 4 predictors should be used. We could once more have a look at the valuation dataset:

set.seed(1)
bag.valuationFull=randomForest(Y.house.price.of.unit.area~., data=valuation, subset=train, mtry=6,importance=T)
bag.valuationFull
## 
## Call:
##  randomForest(formula = Y.house.price.of.unit.area ~ ., data = valuation,      mtry = 6, importance = T, subset = train) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 6
## 
##           Mean of squared residuals: 50.67226
##                     % Var explained: 70.19

The 2 additional predictors in valuation dataset only moved the percentage of variance explained from 68.64% to 70.19%.

We can test how this bagged model perform on the test set:

yhat.bag= predict(bag.valuation, newdata = valuationSub[-train, ])
plot(yhat.bag, valuation.test)
abline(0,1)

mean((yhat.bag-valuation.test)^2)
## [1] 77.4828

For the bagged regression tree MSE associated is 77.48 which is way lower compared to 89.19 we recorded earlier. We could change the number of trees grown by randomForest() using the ntree argument:

bag.valuation=randomForest(Y~., data=valuationSub, subset=train, mtry=4,ntree=6)
yhat.bag= predict(bag.valuation, newdata = valuationSub[-train, ])
mean((yhat.bag-valuation.test)^2)
## [1] 85.97656

Growing a random forest proceeds in exactly the same way except that we use a smaller mtry argument. By default, randomForest() uses p/3 variables when building a random forest of regression trees:

rf.valuation=randomForest(Y~., data=valuationSub, subset=train, mtry=2,importance=T)
yhat.fr= predict(rf.valuation, newdata = valuationSub[-train, ])
mean((yhat.bag-valuation.test)^2)
## [1] 85.97656

Here we set mtry=2 and recorded MSE 78.11

Using the importance() function we can view the importance of each variable:

importance(rf.valuation)
##      %IncMSE IncNodePurity
## x1  5.400982      2383.471
## x2 28.283211      6611.023
## x3 53.268912     17778.182
## x4 17.320115      6465.840
varImpPlot(rf.valuation)

Clearly we can see that x3 is the most important variable.

8.4 Boosting

Here we use the gbm() function in the gbm package to fit boosted regression trees to the valuation data set. We set the distribution parameter to “gaussian” indicating regression:

library(gbm)
## Warning: package 'gbm' was built under R version 3.5.2
## Loaded gbm 2.1.5
set.seed(1)
boost.valuation=gbm(Y~., data = valuationSub[train, ], distribution = "gaussian", n.trees = 5000, interaction.depth = 4)
summary(boost.valuation)

##    var  rel.inf
## x3  x3 47.52232
## x2  x2 28.58931
## x1  x1 12.39509
## x4  x4 11.49328

We see that x3 and x2 are the most important predictors. We can also produce partial dependence plots for the two variables:

par(mfrow=c(1,2))
plot(boost.valuation,i='x3')

plot(boost.valuation,i='x2')

yhat.boost=predict(boost.valuation, newdata = valuationSub[-train, ], n.trees = 45)
mean((yhat.boost- valuation.test)^2)
## [1] 81.21231
boost.valuation=gbm(Y~.,data = valuationSub[train, ], distribution = "gaussian", n.trees = 47, interaction.depth = 4, shrinkage = 0.2, verbose = F)
yhat.boost=predict(boost.valuation, newdata = valuationSub[-train, ], n.trees = 47)
mean((yhat.boost-valuation.test)^2)
## [1] 85.42025

References:

1 Yeh, I. C. & Hsu, T. K. (2018). Building real estate valuation models with comparative approach through case-based reasoning. Applied Soft. Computing, 65, 260-271 2 James, G., Witten, D., Hastie, T. & Tibshirani R. (2017) An Introduction to Statistical Learning with Applications in R. New York, NY:Springer