Introduction

Our aim is to predict house values. Before we begin to do any analysis, we should always check whether the dataset has missing value or not, we do so by typing:

taiwan_real_estate <- read.csv("real_estates.csv",row.names=1)
attach(taiwan_real_estate)
any(is.na(taiwan_real_estate))
## [1] FALSE

Let’s take a look at structure of the data set:

glimpse(taiwan_real_estate)
## Rows: 414
## Columns: 6
## $ house.age                           <dbl> 32.0, 19.5, 13.3, 13.3, 5.0, 7.1, …
## $ distance.to.the.nearest.MRT.station <dbl> 84.87882, 306.59470, 561.98450, 56…
## $ number.of.convenience.stores        <int> 10, 9, 5, 5, 5, 3, 7, 6, 1, 3, 1, …
## $ latitude                            <dbl> 24.98298, 24.98034, 24.98746, 24.9…
## $ longitude                           <dbl> 121.5402, 121.5395, 121.5439, 121.…
## $ house.price.of.unit.area            <dbl> 37.9, 42.2, 47.3, 54.8, 43.1, 32.1…

Let’s simplify variables’ names:

taiwan_real_estate <- taiwan_real_estate %>%
rename(house_age_years = house.age, price_twd_msq = house.price.of.unit.area,
       n_convenience = number.of.convenience.stores, 
       dist_to_mrt_m = distance.to.the.nearest.MRT.station)

We can also perform binning for “house_age_years”:

#perform binning with specific number of bins
taiwan_real_estate<-taiwan_real_estate %>% mutate(house_age_cat = cut(house_age_years, breaks=c(0,15,30,45),include.lowest = T,
                                                                        right = F))

Descriptive Statistics

Prepare a heatmap with correlation coefficients on it:

library(corrplot)
## Warning: pakiet 'corrplot' został zbudowany w wersji R 4.3.3
## corrplot 0.92 loaded
M<-cor(taiwan_real_estate[,1:6])
corrplot(M, method = 'number')

Draw a scatter plot of n_convenience vs. price_twd_msq:

ggplot(taiwan_real_estate, aes(x = n_convenience, y = price_twd_msq)) +
  geom_point() +
  geom_smooth(method = "loess", se = FALSE) +
  labs(title = "Scatter Plot of n_convenience vs. price_twd_msq",
       x = "Number of Convenience Stores",
       y = "Price (TWD per square meter)") +
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'

Draw a scatter plot of house_age_years vs. price_twd_msq:

ggplot(taiwan_real_estate, aes(x = house_age_years, y = price_twd_msq)) +
  geom_point() +
  geom_smooth(method = "loess", se = FALSE) +
  labs(title = "Scatter Plot of house_age_years vs. price_twd_msq",
       x = "House Age (years)",
       y = "Price (TWD per square meter)") +
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'

Draw a scatter plot of distance to nearest MRT station vs. price_twd_msq:

Plot a histogram of price_twd_msq with 10 bins, facet the plot so each house age group gets its own panel:

ggplot(taiwan_real_estate, aes(x = price_twd_msq)) +
  geom_histogram(bins = 10) +
  facet_wrap(~ house_age_cat)

Summarize to calculate the mean, sd, median etc. house price/area by house age:

taiwan_real_estate %>%
  group_by(house_age_cat) %>%
  summarise(mean = mean(price_twd_msq, na.rm = TRUE),
            median = median(price_twd_msq, na.rm = TRUE),
            sd = sd(price_twd_msq, na.rm = TRUE),
            min = min(price_twd_msq, na.rm = TRUE),
            max = max(price_twd_msq, na.rm = TRUE))
## # A tibble: 3 × 6
##   house_age_cat  mean median    sd   min   max
##   <fct>         <dbl>  <dbl> <dbl> <dbl> <dbl>
## 1 [0,15)         41.8   42.6  14.2   7.6 118. 
## 2 [15,30)        32.6   32.9  11.4  11.2  59.6
## 3 [30,45]        37.7   38.3  12.8  12.2  78.3

Simple model

Run a linear regression of price_twd_msq vs. best, but only 1 predictor:

model <- lm(price_twd_msq ~ n_convenience, data = taiwan_real_estate)

We start by displaying the statistical summary of the model using the R function summary():

summary(model)
## 
## Call:
## lm(formula = price_twd_msq ~ n_convenience, data = taiwan_real_estate)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -35.407  -7.341  -1.788   5.984  87.681 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    27.1811     0.9419   28.86   <2e-16 ***
## n_convenience   2.6377     0.1868   14.12   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.18 on 412 degrees of freedom
## Multiple R-squared:  0.326,  Adjusted R-squared:  0.3244 
## F-statistic: 199.3 on 1 and 412 DF,  p-value: < 2.2e-16

You can access lots of different aspects of the regression object. To see what’s inside, use names():

names(model)
##  [1] "coefficients"  "residuals"     "effects"       "rank"         
##  [5] "fitted.values" "assign"        "qr"            "df.residual"  
##  [9] "xlevels"       "call"          "terms"         "model"

What do they mean?

Discuss model accuracy:

Model diagnostics:

par(mfrow = c(2, 2))
plot(model) # ?your name of the model

The four plots show…

Create the diagnostic plots using ggfortify:

library(ggfortify)
## Warning: pakiet 'ggfortify' został zbudowany w wersji R 4.3.3
autoplot(model) # ?your name of the model

Outliers and high levarage points:

plot(model, 5)  # ?your name of the model

Influential values:

# Cook's distance
plot(model, 4) # ?your name of the model

or just plot all of diagnostic plots together:

autoplot(model, which = 1:6, label.size = 3) # ?your name of the model

Discussion:

Multiple Regression Model

Test and training set

We begin by splitting the dataset into two parts, training set and testing set. In this example we will randomly take 75% row in this dataset and put it into the training set, and other 25% row in the testing set:

smp_size<-floor(0.75*nrow(taiwan_real_estate))
set.seed(12)
train_ind<-sample(seq_len(nrow(taiwan_real_estate)), size=smp_size)
train<-taiwan_real_estate[train_ind, ]
test<-taiwan_real_estate[-train_ind, ]
  • 1st comment: floor() is used to return the largest integer value which is not greater than an individual number, or expression.

  • 2nd comment: set.seed() is used to set the seed of R’s random number generator, this function is used so results from this example can be recreated easily.

Now we have our training set and testing set.

Variable selection methods

Generally, selecting variables for linear regression is a debatable topic.

There are many methods for variable selecting, namely, forward stepwise selection, backward stepwise selection, etc, some are valid, some are heavily criticized.

I recommend this document: https://www.stat.cmu.edu/~cshalizi/mreg/15/lectures/26/lecture-26.pdf and Gung’s comment: https://stats.stackexchange.com/questions/20836/algorithms-for-automatic-model-selection/20856#20856 if you want to learn more about variable selection process.

If our goal is prediction, it is safer to include all predictors in our model, removing variables without knowing the science behind it usually does more harm than good!!!

We begin to create our multiple linear regression model:

model2 <- lm(price_twd_msq ~ ., data = train)
summary(model2)
## 
## Call:
## lm(formula = price_twd_msq ~ ., data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -34.009  -4.953  -1.296   4.461  75.042 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          -2.861e+03  7.542e+03  -0.379 0.704717    
## house_age_years      -4.209e-01  1.223e-01  -3.442 0.000659 ***
## dist_to_mrt_m        -4.558e-03  8.681e-04  -5.251 2.86e-07 ***
## n_convenience         9.826e-01  2.287e-01   4.297 2.34e-05 ***
## latitude              2.505e+02  5.375e+01   4.660 4.74e-06 ***
## longitude            -2.755e+01  6.032e+01  -0.457 0.648221    
## house_age_cat[15,30) -1.089e+00  1.916e+00  -0.568 0.570124    
## house_age_cat[30,45]  5.789e+00  3.577e+00   1.618 0.106655    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.331 on 302 degrees of freedom
## Multiple R-squared:  0.5676, Adjusted R-squared:  0.5575 
## F-statistic: 56.62 on 7 and 302 DF,  p-value: < 2.2e-16

Discuss the results…

Looking at model summary, we see that variables …. are insignificant, so let’s estimate the model without those variables:

model3 <- lm(price_twd_msq ~ n_convenience + dist_to_mrt_m, data = train)
summary(model3)
## 
## Call:
## lm(formula = price_twd_msq ~ n_convenience + dist_to_mrt_m, data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -36.936  -6.057  -1.652   4.812  77.254 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   40.7446728  1.5962317  25.526  < 2e-16 ***
## n_convenience  1.0280148  0.2429554   4.231 3.07e-05 ***
## dist_to_mrt_m -0.0060439  0.0005741 -10.528  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.09 on 307 degrees of freedom
## Multiple R-squared:  0.4864, Adjusted R-squared:  0.483 
## F-statistic: 145.3 on 2 and 307 DF,  p-value: < 2.2e-16

Evaluating multi-collinearity

There are many standards researchers apply for deciding whether a VIF is too large. In some domains, a VIF over 2 is worthy of suspicion. Others set the bar higher, at 5 or 10. Others still will say you shouldn’t pay attention to these at all. Ultimately, the main thing to consider is that small effects are more likely to be “drowned out” by higher VIFs, but this may just be a natural, unavoidable fact with your model.

## Warning: pakiet 'car' został zbudowany w wersji R 4.3.3
## Ładowanie wymaganego pakietu: carData
## Warning: pakiet 'carData' został zbudowany w wersji R 4.3.3
## 
## Dołączanie pakietu: 'car'
## Następujący obiekt został zakryty z 'package:dplyr':
## 
##     recode
## Następujący obiekt został zakryty z 'package:purrr':
## 
##     some
##                     GVIF Df GVIF^(1/(2*Df))
## house_age_years 6.920312  1        2.630649
## dist_to_mrt_m   4.244914  1        2.060319
## n_convenience   1.644746  1        1.282476
## latitude        1.522464  1        1.233882
## longitude       3.014812  1        1.736321
## house_age_cat   7.480049  2        1.653774

Finally we test our model on test dataset:

##                     GVIF Df GVIF^(1/(2*Df))
## house_age_years 7.926079  1        2.815329
## dist_to_mrt_m   4.898253  1        2.213200
## n_convenience   1.606998  1        1.267674
## latitude        1.998653  1        1.413737
## longitude       2.926354  1        1.710659
## house_age_cat   8.511494  2        1.708053

Interpret results… Values are quite similar

Variable selection using best subset regression

Best subset and stepwise (forward, backward, both) techniques of variable selection can be used to come up with the best linear regression model for the dependent variable medv.

Variable selection using stepwise regression

## Start:  AIC=1638.47
## price_twd_msq ~ 1
## 
##                   Df Sum of Sq   RSS    AIC
## + dist_to_mrt_m    1   27754.0 33057 1451.5
## + n_convenience    1   18298.4 42512 1529.5
## + longitude        1   17543.7 43267 1535.0
## + latitude         1   16685.2 44125 1541.0
## + house_age_cat    2    5118.7 55692 1615.2
## + house_age_years  1    2037.6 58773 1629.9
## <none>                         60811 1638.5
## 
## Step:  AIC=1451.52
## price_twd_msq ~ dist_to_mrt_m
## 
##                   Df Sum of Sq   RSS    AIC
## + n_convenience    1    1821.6 31235 1435.9
## + latitude         1    1793.8 31263 1436.2
## + house_age_years  1    1773.7 31283 1436.4
## + house_age_cat    2    1666.5 31390 1439.5
## <none>                         33057 1451.5
## + longitude        1      27.3 33029 1453.3
## 
## Step:  AIC=1435.94
## price_twd_msq ~ dist_to_mrt_m + n_convenience
## 
##                   Df Sum of Sq   RSS    AIC
## + house_age_years  1    2123.9 29111 1416.1
## + house_age_cat    2    2113.2 29122 1418.2
## + latitude         1    1388.9 29846 1423.8
## <none>                         31235 1435.9
## + longitude        1       6.3 31229 1437.9
## 
## Step:  AIC=1416.11
## price_twd_msq ~ dist_to_mrt_m + n_convenience + house_age_years
## 
##                 Df Sum of Sq   RSS    AIC
## + latitude       1   1822.97 27288 1398.1
## + house_age_cat  2    871.12 28240 1410.7
## <none>                       29111 1416.1
## + longitude      1     32.98 29078 1417.8
## 
## Step:  AIC=1398.07
## price_twd_msq ~ dist_to_mrt_m + n_convenience + house_age_years + 
##     latitude
## 
##                 Df Sum of Sq   RSS    AIC
## + house_age_cat  2    972.81 26315 1390.8
## <none>                       27288 1398.1
## + longitude      1      8.02 27280 1400.0
## 
## Step:  AIC=1390.81
## price_twd_msq ~ dist_to_mrt_m + n_convenience + house_age_years + 
##     latitude + house_age_cat
## 
##             Df Sum of Sq   RSS    AIC
## <none>                   26315 1390.8
## + longitude  1    18.161 26297 1392.6
## Start:  AIC=1392.6
## price_twd_msq ~ house_age_years + dist_to_mrt_m + n_convenience + 
##     latitude + longitude + house_age_cat
## 
##                   Df Sum of Sq   RSS    AIC
## - longitude        1     18.16 26315 1390.8
## <none>                         26297 1392.6
## - house_age_cat    2    982.95 27280 1400.0
## - house_age_years  1   1031.51 27329 1402.5
## - n_convenience    1   1607.54 27905 1409.0
## - latitude         1   1891.31 28189 1412.1
## - dist_to_mrt_m    1   2400.71 28698 1417.7
## 
## Step:  AIC=1390.81
## price_twd_msq ~ house_age_years + dist_to_mrt_m + n_convenience + 
##     latitude + house_age_cat
## 
##                   Df Sum of Sq   RSS    AIC
## <none>                         26315 1390.8
## - house_age_cat    2     972.8 27288 1398.1
## - house_age_years  1    1035.7 27351 1400.8
## - n_convenience    1    1625.4 27941 1407.4
## - latitude         1    1924.7 28240 1410.7
## - dist_to_mrt_m    1    4348.8 30664 1436.2
## Start:  AIC=1638.47
## price_twd_msq ~ 1
## 
##                   Df Sum of Sq   RSS    AIC
## + dist_to_mrt_m    1   27754.0 33057 1451.5
## + n_convenience    1   18298.4 42512 1529.5
## + longitude        1   17543.7 43267 1535.0
## + latitude         1   16685.2 44125 1541.0
## + house_age_cat    2    5118.7 55692 1615.2
## + house_age_years  1    2037.6 58773 1629.9
## <none>                         60811 1638.5
## 
## Step:  AIC=1451.52
## price_twd_msq ~ dist_to_mrt_m
## 
##                   Df Sum of Sq   RSS    AIC
## + n_convenience    1    1821.6 31235 1435.9
## + latitude         1    1793.8 31263 1436.2
## + house_age_years  1    1773.7 31283 1436.4
## + house_age_cat    2    1666.5 31390 1439.5
## <none>                         33057 1451.5
## + longitude        1      27.3 33029 1453.3
## - dist_to_mrt_m    1   27754.0 60811 1638.5
## 
## Step:  AIC=1435.94
## price_twd_msq ~ dist_to_mrt_m + n_convenience
## 
##                   Df Sum of Sq   RSS    AIC
## + house_age_years  1    2123.9 29111 1416.1
## + house_age_cat    2    2113.2 29122 1418.2
## + latitude         1    1388.9 29846 1423.8
## <none>                         31235 1435.9
## + longitude        1       6.3 31229 1437.9
## - n_convenience    1    1821.6 33057 1451.5
## - dist_to_mrt_m    1   11277.2 42512 1529.5
## 
## Step:  AIC=1416.11
## price_twd_msq ~ dist_to_mrt_m + n_convenience + house_age_years
## 
##                   Df Sum of Sq   RSS    AIC
## + latitude         1    1823.0 27288 1398.1
## + house_age_cat    2     871.1 28240 1410.7
## <none>                         29111 1416.1
## + longitude        1      33.0 29078 1417.8
## - house_age_years  1    2123.9 31235 1435.9
## - n_convenience    1    2171.8 31283 1436.4
## - dist_to_mrt_m    1   10564.4 39676 1510.1
## 
## Step:  AIC=1398.07
## price_twd_msq ~ dist_to_mrt_m + n_convenience + house_age_years + 
##     latitude
## 
##                   Df Sum of Sq   RSS    AIC
## + house_age_cat    2     972.8 26315 1390.8
## <none>                         27288 1398.1
## + longitude        1       8.0 27280 1400.0
## - n_convenience    1    1706.1 28994 1414.9
## - latitude         1    1823.0 29111 1416.1
## - house_age_years  1    2557.9 29846 1423.8
## - dist_to_mrt_m    1    5494.4 32783 1452.9
## 
## Step:  AIC=1390.81
## price_twd_msq ~ dist_to_mrt_m + n_convenience + house_age_years + 
##     latitude + house_age_cat
## 
##                   Df Sum of Sq   RSS    AIC
## <none>                         26315 1390.8
## + longitude        1      18.2 26297 1392.6
## - house_age_cat    2     972.8 27288 1398.1
## - house_age_years  1    1035.7 27351 1400.8
## - n_convenience    1    1625.4 27941 1407.4
## - latitude         1    1924.7 28240 1410.7
## - dist_to_mrt_m    1    4348.8 30664 1436.2

Comparing competing models

##                df      AIC
## model_forward   8 2272.556
## model_backward  8 2272.556
## model_step      8 2272.556
## 
## Call:
## lm(formula = price_twd_msq ~ dist_to_mrt_m + n_convenience + 
##     house_age_years + latitude + house_age_cat, data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -33.913  -5.098  -1.227   4.551  75.387 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          -6.250e+03  1.337e+03  -4.673 4.46e-06 ***
## dist_to_mrt_m        -4.274e-03  6.040e-04  -7.076 1.03e-11 ***
## n_convenience         9.871e-01  2.282e-01   4.326 2.06e-05 ***
## house_age_years      -4.217e-01  1.221e-01  -3.453 0.000633 ***
## latitude              2.521e+02  5.356e+01   4.708 3.82e-06 ***
## house_age_cat[15,30) -1.024e+00  1.909e+00  -0.536 0.592043    
## house_age_cat[30,45]  5.851e+00  3.570e+00   1.639 0.102303    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.319 on 303 degrees of freedom
## Multiple R-squared:  0.5673, Adjusted R-squared:  0.5587 
## F-statistic:  66.2 on 6 and 303 DF,  p-value: < 2.2e-16

From Best subset regression and stepwise selection (forward, backward, both), we see that …….

Plots show that…

Models are compared based on adjusted r square, AIC, BIC criteria for in-sample performance and mean square prediction error (MSPE) for out-of-sample performance.

## [1] 100.7582
## [1] 0.4863557
## [1] 0.4830094
## [1] 2317.686
## [1] 2332.633

Finally, we can check the Out-of-sample Prediction or test error (MSPE):

## [1] 55.28947
## [1] 71.45759

Cross Validation

Please check how function ?cv.glm works.

We will just extract from this object created by cv.glm command - the raw cross-validation estimate of prediction error.

## Warning: pakiet 'boot' został zbudowany w wersji R 4.3.3
## 
## Dołączanie pakietu: 'boot'
## Następujący obiekt został zakryty z 'package:car':
## 
##     logit
## [1] NaN

Strange result

Based on AIC criteria …

We need to check out-of-sample MSPE for both models. Based on out-of-sample prediction error, model …