Regression Analysis

Introduction

Our aim is to predict house values. Before we begin to do any analysis, we should always check whether the dataset has missing value or not, we do so by typing:

taiwan_real_estate <- read.csv("real_estates.csv",row.names=1)
attach(taiwan_real_estate)
any(is.na(taiwan_real_estate))

## [1] FALSE

Let’s take a look at structure of the data set:

glimpse(taiwan_real_estate)

## Rows: 414
## Columns: 6
## $ house.age                           <dbl> 32.0, 19.5, 13.3, 13.3, 5.0, 7.1, …
## $ distance.to.the.nearest.MRT.station <dbl> 84.87882, 306.59470, 561.98450, 56…
## $ number.of.convenience.stores        <int> 10, 9, 5, 5, 5, 3, 7, 6, 1, 3, 1, …
## $ latitude                            <dbl> 24.98298, 24.98034, 24.98746, 24.9…
## $ longitude                           <dbl> 121.5402, 121.5395, 121.5439, 121.…
## $ house.price.of.unit.area            <dbl> 37.9, 42.2, 47.3, 54.8, 43.1, 32.1…

Let’s simplify variables’ names:

taiwan_real_estate <- taiwan_real_estate %>%
rename(house_age_years = house.age, price_twd_msq = house.price.of.unit.area,
       n_convenience = number.of.convenience.stores, 
       dist_to_mrt_m = distance.to.the.nearest.MRT.station)

We can also perform binning for “house_age_years”:

#perform binning with specific number of bins
taiwan_real_estate<-taiwan_real_estate %>% mutate(house_age_cat = cut(house_age_years, breaks=c(0,15,30,45),include.lowest = T,
                                                                        right = F))

Descriptive Statistics

Prepare a heatmap with correlation coefficients on it:

library(corrplot)

## Warning: package 'corrplot' was built under R version 4.3.3

## corrplot 0.92 loaded

M<-cor(taiwan_real_estate[,1:6])

corrplot(M, method = 'color', col = colorRampPalette(c("hotpink3", "white", "orchid4"))(200),
         addCoef.col = "black", # Add correlation coefficients
         tl.col = "black", tl.srt = 45, # Text label color and rotation
         title = "Correlation Heatmap", mar = c(0, 0, 1, 0))+
  theme_minimal()

## NULL

Thanks to this heat map we acn easily read the relation between each factor, it really nicely sum up this in visual way

Draw a scatter plot of n_convenience vs. price_twd_msq:

ggplot(data = taiwan_real_estate, aes(x = n_convenience, y = price_twd_msq))+
  geom_point(color = "mediumpurple3")+
  geom_smooth(method=lm,se=FALSE, color = "olivedrab")+
  labs(
    title = "Number of Conveneince Stores vs House Price of Unit Area", 
    x = "Number of Conveneince Stores", 
    y = "House Price of Unit Area"
  )+
  theme_minimal()

## `geom_smooth()` using formula = 'y ~ x'

we can say based on this plot and this trend that there is the coloration that the more stores in the neighberhood the higher is the price for a unit area

Draw a scatter plot of house_age_years vs. price_twd_msq:

ggplot(data = taiwan_real_estate, aes(x = house_age_years, y = price_twd_msq))+
  geom_point(color = "darkseagreen")+
  geom_smooth(method=lm,se=FALSE, color = "tomato4")+
  labs(
    title = "House Age vs House Price of Unit Area", 
    x = "House Age", 
    y = "House Price of Unit Area"
  )+
  theme_minimal()

## `geom_smooth()` using formula = 'y ~ x'

This one is showing us the coloration between the age of house and the price of unit area and we see that the older house is the lower price is for square meter

Draw a scatter plot of distance to nearest MRT station vs. price_twd_msq:

Here we see that the biggest distance is to metro station the price of unit area is lower, and comparing to other colorations we can say that this one is way bigger (in modulo value)

Plot a histogram of price_twd_msq with 10 bins, facet the plot so each house age group gets its own panel:

taiwan_real_estate$house_age <- as.factor(taiwan_real_estate$house_age_cat)

ggplot(data = taiwan_real_estate, aes(x = price_twd_msq)) +
  geom_histogram(bins = 10, fill = "darkorange3", color = "black") +
  facet_wrap(~ house_age) +
  labs(
    title = "House Price of Unit Area vs House Age Group",
    x = "House Price of Unit Area",
    y = "Amount"
  ) +
  theme_minimal()

This histogram is really nicely shows us the amount of houses and the price of them for each age group, the biggest amount of houses are those young ones and rather small amount of old houses. In terms of prices they are rather in similar price groups.

Summarize to calculate the mean, sd, median etc. house price/area by house age:

summary_stats <- taiwan_real_estate %>%
  group_by(house_age) %>%
  summarize(
    min_price = min(price_twd_msq, na.rm = TRUE),
    max_price = max(price_twd_msq, na.rm = TRUE),
    median_price = median(price_twd_msq, na.rm = TRUE),
    mean_price = mean(price_twd_msq, na.rm = TRUE),
    sd_price = sd(price_twd_msq, na.rm = TRUE),
    iqr_price = IQR(price_twd_msq, na.rm = TRUE),
    q1_price = quantile(price_twd_msq, 0.25, na.rm = TRUE),
    q3_price = quantile(price_twd_msq, 0.75, na.rm = TRUE),
    count = n()
  )

# Print the summary statistics
print(summary_stats)

## # A tibble: 3 × 10
##   house_age min_price max_price median_price mean_price sd_price iqr_price
##   <fct>         <dbl>     <dbl>        <dbl>      <dbl>    <dbl>     <dbl>
## 1 [0,15)          7.6     118.          42.6       41.8     14.2      21.0
## 2 [15,30)        11.2      59.6         32.9       32.6     11.4      17.5
## 3 [30,45]        12.2      78.3         38.3       37.7     12.8      11.0
## # ℹ 3 more variables: q1_price <dbl>, q3_price <dbl>, count <int>

Simple model

Run a linear regression of price_twd_msq vs. best, but only 1 predictor:

correlation_coeffs <- cor(taiwan_real_estate[c("house_age_years", "n_convenience", "dist_to_mrt_m", "price_twd_msq")], method = "pearson")

correlation_with_price <- correlation_coeffs["price_twd_msq", c("house_age_years", "n_convenience", "dist_to_mrt_m")]

print(correlation_with_price)

## house_age_years   n_convenience   dist_to_mrt_m 
##      -0.2105670       0.5710049      -0.6736129

# Highest correlation with price_twd_msq
# I assume that highest correlation with price_twd_msq means best 
# that means that best is n_convenience

simple_model <- lm(price_twd_msq ~ n_convenience, data = taiwan_real_estate)

We start by displaying the statistical summary of the model using the R function summary():

summary(simple_model)

## 
## Call:
## lm(formula = price_twd_msq ~ n_convenience, data = taiwan_real_estate)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -35.407  -7.341  -1.788   5.984  87.681 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    27.1811     0.9419   28.86   <2e-16 ***
## n_convenience   2.6377     0.1868   14.12   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.18 on 412 degrees of freedom
## Multiple R-squared:  0.326,  Adjusted R-squared:  0.3244 
## F-statistic: 199.3 on 1 and 412 DF,  p-value: < 2.2e-16

You can access lots of different aspects of the regression object. To see what’s inside, use names():

names(simple_model)

##  [1] "coefficients"  "residuals"     "effects"       "rank"         
##  [5] "fitted.values" "assign"        "qr"            "df.residual"  
##  [9] "xlevels"       "call"          "terms"         "model"

What do they mean?

coefficients: These are the estimated coefficients (parameters) of the regression model, representing the relationships between the predictor variables and the response variable. residuals: Residuals are the differences between the observed values of the response variable and the values predicted by the model. They indicate how well the model fits the data points. effects: This term can refer to various effects in the model, such as fixed effects or random effects, depending on the type of model used. It can also refer to the estimated effects of predictors on the response variable. rank: The rank of the model matrix, which indicates the number of linearly independent rows or columns in the matrix of predictors. fitted.values: These are the predicted values of the response variable based on the regression model. assign: This usually refers to the variable assignment in the model matrix. qr: The QR decomposition of the model matrix, which is a numerical method used in solving linear regression problems. df.residual: Degrees of freedom of the residuals, which is the number of observations minus the number of parameters estimated in the model. xlevels: Levels or categories of the predictor variables. call: The original call that produced the model object, including the formula and any other arguments. terms: The terms object used in constructing the model formula. model: This typically refers to the type of model fitted (e.g., “lm” for linear model in R).

Discuss model accuracy:

Coefficient Interpretation: for each additional unit increase in the number of convenience stores nearby, the price per square meter increases by approximately 2.64 TWD. Significance: Indicates that there is strong evidence that both the intercept and the effect of n_convenience on price_twd_msq are not zero. Model Fit: The adjusted R-squared is 0.3244, this indicates a moderate level of explanatory power. Residual Analysis: average amount that the observed price_twd_msq values deviate from the predicted values by the model

Model diagnostics:

par(mfrow = c(2, 2))
plot(simple_model)

The four plots show correlation between the variables

Create the diagnostic plots using ggfortify:

library(ggfortify)

## Warning: package 'ggfortify' was built under R version 4.3.3

autoplot(simple_model)

Outliers and high levarage points:

plot(simple_model, 5)

Influential values:

# Cook's distance
plot(simple_model, 4)

or just plot all of diagnostic plots together:

autoplot(simple_model, which = 1:6, label.size = 3)

Discussion:

Residuals vs Fitted, Scale-Location, Normal Q-Q are correlated Residuals vs Leverage has a lot of outliers

Multiple Regression Model

Test and training set

We begin by splitting the dataset into two parts, training set and testing set. In this example we will randomly take 75% row in this dataset and put it into the training set, and other 25% row in the testing set:

smp_size<-floor(0.75*nrow(taiwan_real_estate))
set.seed(12)
train_ind<-sample(seq_len(nrow(taiwan_real_estate)), size=smp_size)
train<-taiwan_real_estate[train_ind, ]
test<-taiwan_real_estate[-train_ind, ]

1st comment: floor() is used to return the largest integer value which is not greater than an individual number, or expression.
2nd comment: set.seed() is used to set the seed of R’s random number generator, this function is used so results from this example can be recreated easily.

Now we have our training set and testing set.

Variable selection methods

Generally, selecting variables for linear regression is a debatable topic.

There are many methods for variable selecting, namely, forward stepwise selection, backward stepwise selection, etc, some are valid, some are heavily criticized.

I recommend this document: https://www.stat.cmu.edu/~cshalizi/mreg/15/lectures/26/lecture-26.pdf and Gung’s comment: https://stats.stackexchange.com/questions/20836/algorithms-for-automatic-model-selection/20856#20856 if you want to learn more about variable selection process.

If our goal is prediction, it is safer to include all predictors in our model, removing variables without knowing the science behind it usually does more harm than good!!!

We begin to create our multiple linear regression model:

model2 <- lm(price_twd_msq ~ ., data = train)
summary(model2)

## 
## Call:
## lm(formula = price_twd_msq ~ ., data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -34.009  -4.953  -1.296   4.461  75.042 
## 
## Coefficients: (2 not defined because of singularities)
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          -2.861e+03  7.542e+03  -0.379 0.704717    
## house_age_years      -4.209e-01  1.223e-01  -3.442 0.000659 ***
## dist_to_mrt_m        -4.558e-03  8.681e-04  -5.251 2.86e-07 ***
## n_convenience         9.826e-01  2.287e-01   4.297 2.34e-05 ***
## latitude              2.505e+02  5.375e+01   4.660 4.74e-06 ***
## longitude            -2.755e+01  6.032e+01  -0.457 0.648221    
## house_age_cat[15,30) -1.089e+00  1.916e+00  -0.568 0.570124    
## house_age_cat[30,45]  5.789e+00  3.577e+00   1.618 0.106655    
## house_age[15,30)             NA         NA      NA       NA    
## house_age[30,45]             NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.331 on 302 degrees of freedom
## Multiple R-squared:  0.5676, Adjusted R-squared:  0.5575 
## F-statistic: 56.62 on 7 and 302 DF,  p-value: < 2.2e-16

Discuss the results: a) Intercept term is not significant (p = 0.616)

house_age_years: significant (p < 0.001) -> if age of the house increases, the price decreases.

c)dist_to_mrt_m: significant (p < 0.001) -> as the distance to the nearest MRT station increases, the price decreases.

d)n_convenience: significant (p < 0.001) -> increase in the number of convenience stores increases the price

e)latitude: significant (p < 0.001) -> higher latitude correlates with higher prices

f)longitude: not significant (p = 0.765) -> longitude does not have a significant impact on the price

Looking at model summary, we see that variables “longitude” and “intercept term” are insignificant, so let’s estimate the model without those variables:

model3 <- lm(formula = price_twd_msq ~ house_age_years + dist_to_mrt_m + n_convenience + latitude, data = train)
summary(model3)

## 
## Call:
## lm(formula = price_twd_msq ~ house_age_years + dist_to_mrt_m + 
##     n_convenience + latitude, data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -34.782  -5.448  -1.616   4.269  75.310 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -6.079e+03  1.356e+03  -4.481 1.05e-05 ***
## house_age_years -2.546e-01  4.762e-02  -5.347 1.76e-07 ***
## dist_to_mrt_m   -4.692e-03  5.988e-04  -7.837 7.78e-14 ***
## n_convenience    1.006e+00  2.303e-01   4.367 1.73e-05 ***
## latitude         2.452e+02  5.432e+01   4.514 9.10e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.459 on 305 degrees of freedom
## Multiple R-squared:  0.5513, Adjusted R-squared:  0.5454 
## F-statistic: 93.67 on 4 and 305 DF,  p-value: < 2.2e-16

Evaluating multi-collinearity

There are many standards researchers apply for deciding whether a VIF is too large. In some domains, a VIF over 2 is worthy of suspicion. Others set the bar higher, at 5 or 10. Others still will say you shouldn’t pay attention to these at all. Ultimately, the main thing to consider is that small effects are more likely to be “drowned out” by higher VIFs, but this may just be a natural, unavoidable fact with your model.

## Warning: package 'car' was built under R version 4.3.3

## Loading required package: carData

## Warning: package 'carData' was built under R version 4.3.3

## 
## Attaching package: 'car'

## The following object is masked from 'package:dplyr':
## 
##     recode

## The following object is masked from 'package:purrr':
## 
##     some

## house_age_years   dist_to_mrt_m   n_convenience        latitude 
##        1.021076        1.965314        1.623361        1.513424

Finally we test our model on test dataset:

Interpret results:

Multicollinearity -> all VIF values are below 5 indicating that multicollinearity is not a significant. Variables are not highly correlated with each other.

Multiple_diagnostics: 1. Residuals vs Fitted: the residuals appear to be scattered randomly around the horizonal line of 0 -> Model3 fits the data well.

Q-Q Plot (Normal Q-Q): The plot shows some deviation from the line at the tails but it does not seem to be a major concern because they are not off severly.
Scale-Location Plot: the spread is relatively consistent across the range of fitted values -> homoscedasticity holds.
Residuals vs Leverage: most points are within the Cook’s distance lines -> no highly influential points affecting the model.

Variable selection using best subset regression

Best subset and stepwise (forward, backward, both) techniques of variable selection can be used to come up with the best linear regression model for the dependent variable medv.

## Warning: package 'leaps' was built under R version 4.3.3

## Warning in leaps.setup(x, y, wt = wt, nbest = nbest, nvmax = nvmax, force.in =
## force.in, : 2 linear dependencies found

## Warning in leaps.setup(x, y, wt = wt, nbest = nbest, nvmax = nvmax, force.in =
## force.in, : nvmax reduced to 7

## Subset selection object
## Call: regsubsets.formula(price_twd_msq ~ ., data = train, nvmax = 10)
## 9 Variables  (and intercept)
##                      Forced in Forced out
## house_age_years          FALSE      FALSE
## dist_to_mrt_m            FALSE      FALSE
## n_convenience            FALSE      FALSE
## latitude                 FALSE      FALSE
## longitude                FALSE      FALSE
## house_age_cat[15,30)     FALSE      FALSE
## house_age_cat[30,45]     FALSE      FALSE
## house_age[15,30)         FALSE      FALSE
## house_age[30,45]         FALSE      FALSE
## 1 subsets of each size up to 7
## Selection Algorithm: exhaustive
##          house_age_years dist_to_mrt_m n_convenience latitude longitude
## 1  ( 1 ) " "             "*"           " "           " "      " "      
## 2  ( 1 ) " "             "*"           "*"           " "      " "      
## 3  ( 1 ) "*"             "*"           " "           "*"      " "      
## 4  ( 1 ) "*"             "*"           "*"           "*"      " "      
## 5  ( 1 ) "*"             "*"           "*"           "*"      " "      
## 6  ( 1 ) "*"             "*"           "*"           "*"      " "      
## 7  ( 1 ) "*"             "*"           "*"           "*"      "*"      
##          house_age_cat[15,30) house_age_cat[30,45] house_age[15,30)
## 1  ( 1 ) " "                  " "                  " "             
## 2  ( 1 ) " "                  " "                  " "             
## 3  ( 1 ) " "                  " "                  " "             
## 4  ( 1 ) " "                  " "                  " "             
## 5  ( 1 ) " "                  " "                  " "             
## 6  ( 1 ) " "                  " "                  "*"             
## 7  ( 1 ) "*"                  "*"                  " "             
##          house_age[30,45]
## 1  ( 1 ) " "             
## 2  ( 1 ) " "             
## 3  ( 1 ) " "             
## 4  ( 1 ) " "             
## 5  ( 1 ) "*"             
## 6  ( 1 ) "*"             
## 7  ( 1 ) " "

##          (Intercept)      house_age_years        dist_to_mrt_m 
##                 TRUE                 TRUE                 TRUE 
##        n_convenience             latitude            longitude 
##                 TRUE                 TRUE                FALSE 
## house_age_cat[15,30) house_age_cat[30,45]     house_age[15,30) 
##                FALSE                FALSE                FALSE 
##     house_age[30,45] 
##                 TRUE

Variable selection using stepwise regression

## Start:  AIC=1638.47
## price_twd_msq ~ 1
## 
##                   Df Sum of Sq   RSS    AIC
## + dist_to_mrt_m    1   27754.0 33057 1451.5
## + n_convenience    1   18298.4 42512 1529.5
## + longitude        1   17543.7 43267 1535.0
## + latitude         1   16685.2 44125 1541.0
## + house_age_cat    2    5118.7 55692 1615.2
## + house_age        2    5118.7 55692 1615.2
## + house_age_years  1    2037.6 58773 1629.9
## <none>                         60811 1638.5
## 
## Step:  AIC=1451.52
## price_twd_msq ~ dist_to_mrt_m
## 
##                   Df Sum of Sq   RSS    AIC
## + n_convenience    1    1821.6 31235 1435.9
## + latitude         1    1793.8 31263 1436.2
## + house_age_years  1    1773.7 31283 1436.4
## + house_age_cat    2    1666.5 31390 1439.5
## + house_age        2    1666.5 31390 1439.5
## <none>                         33057 1451.5
## + longitude        1      27.3 33029 1453.3
## 
## Step:  AIC=1435.94
## price_twd_msq ~ dist_to_mrt_m + n_convenience
## 
##                   Df Sum of Sq   RSS    AIC
## + house_age_years  1    2123.9 29111 1416.1
## + house_age_cat    2    2113.2 29122 1418.2
## + house_age        2    2113.2 29122 1418.2
## + latitude         1    1388.9 29846 1423.8
## <none>                         31235 1435.9
## + longitude        1       6.3 31229 1437.9
## 
## Step:  AIC=1416.11
## price_twd_msq ~ dist_to_mrt_m + n_convenience + house_age_years
## 
##                 Df Sum of Sq   RSS    AIC
## + latitude       1   1822.97 27288 1398.1
## + house_age_cat  2    871.12 28240 1410.7
## + house_age      2    871.12 28240 1410.7
## <none>                       29111 1416.1
## + longitude      1     32.98 29078 1417.8
## 
## Step:  AIC=1398.07
## price_twd_msq ~ dist_to_mrt_m + n_convenience + house_age_years + 
##     latitude
## 
##                 Df Sum of Sq   RSS    AIC
## + house_age_cat  2    972.81 26315 1390.8
## + house_age      2    972.81 26315 1390.8
## <none>                       27288 1398.1
## + longitude      1      8.02 27280 1400.0
## 
## Step:  AIC=1390.81
## price_twd_msq ~ dist_to_mrt_m + n_convenience + house_age_years + 
##     latitude + house_age_cat
## 
##             Df Sum of Sq   RSS    AIC
## <none>                   26315 1390.8
## + longitude  1    18.161 26297 1392.6

## 
## Call:
## lm(formula = price_twd_msq ~ dist_to_mrt_m + n_convenience + 
##     house_age_years + latitude + house_age_cat, data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -33.913  -5.098  -1.227   4.551  75.387 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          -6.250e+03  1.337e+03  -4.673 4.46e-06 ***
## dist_to_mrt_m        -4.274e-03  6.040e-04  -7.076 1.03e-11 ***
## n_convenience         9.871e-01  2.282e-01   4.326 2.06e-05 ***
## house_age_years      -4.217e-01  1.221e-01  -3.453 0.000633 ***
## latitude              2.521e+02  5.356e+01   4.708 3.82e-06 ***
## house_age_cat[15,30) -1.024e+00  1.909e+00  -0.536 0.592043    
## house_age_cat[30,45]  5.851e+00  3.570e+00   1.639 0.102303    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.319 on 303 degrees of freedom
## Multiple R-squared:  0.5673, Adjusted R-squared:  0.5587 
## F-statistic:  66.2 on 6 and 303 DF,  p-value: < 2.2e-16

## Start:  AIC=1392.6
## price_twd_msq ~ house_age_years + dist_to_mrt_m + n_convenience + 
##     latitude + longitude + house_age_cat + house_age
## 
## 
## Step:  AIC=1392.6
## price_twd_msq ~ house_age_years + dist_to_mrt_m + n_convenience + 
##     latitude + longitude + house_age_cat
## 
##                   Df Sum of Sq   RSS    AIC
## - longitude        1     18.16 26315 1390.8
## <none>                         26297 1392.6
## - house_age_cat    2    982.95 27280 1400.0
## - house_age_years  1   1031.51 27329 1402.5
## - n_convenience    1   1607.54 27905 1409.0
## - latitude         1   1891.31 28189 1412.1
## - dist_to_mrt_m    1   2400.71 28698 1417.7
## 
## Step:  AIC=1390.81
## price_twd_msq ~ house_age_years + dist_to_mrt_m + n_convenience + 
##     latitude + house_age_cat
## 
##                   Df Sum of Sq   RSS    AIC
## <none>                         26315 1390.8
## - house_age_cat    2     972.8 27288 1398.1
## - house_age_years  1    1035.7 27351 1400.8
## - n_convenience    1    1625.4 27941 1407.4
## - latitude         1    1924.7 28240 1410.7
## - dist_to_mrt_m    1    4348.8 30664 1436.2

## 
## Call:
## lm(formula = price_twd_msq ~ house_age_years + dist_to_mrt_m + 
##     n_convenience + latitude + house_age_cat, data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -33.913  -5.098  -1.227   4.551  75.387 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          -6.250e+03  1.337e+03  -4.673 4.46e-06 ***
## house_age_years      -4.217e-01  1.221e-01  -3.453 0.000633 ***
## dist_to_mrt_m        -4.274e-03  6.040e-04  -7.076 1.03e-11 ***
## n_convenience         9.871e-01  2.282e-01   4.326 2.06e-05 ***
## latitude              2.521e+02  5.356e+01   4.708 3.82e-06 ***
## house_age_cat[15,30) -1.024e+00  1.909e+00  -0.536 0.592043    
## house_age_cat[30,45]  5.851e+00  3.570e+00   1.639 0.102303    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.319 on 303 degrees of freedom
## Multiple R-squared:  0.5673, Adjusted R-squared:  0.5587 
## F-statistic:  66.2 on 6 and 303 DF,  p-value: < 2.2e-16

## Start:  AIC=1392.6
## price_twd_msq ~ house_age_years + dist_to_mrt_m + n_convenience + 
##     latitude + longitude + house_age_cat + house_age
## 
## 
## Step:  AIC=1392.6
## price_twd_msq ~ house_age_years + dist_to_mrt_m + n_convenience + 
##     latitude + longitude + house_age_cat
## 
##                   Df Sum of Sq   RSS    AIC
## - longitude        1     18.16 26315 1390.8
## <none>                         26297 1392.6
## - house_age_cat    2    982.95 27280 1400.0
## - house_age_years  1   1031.51 27329 1402.5
## - n_convenience    1   1607.54 27905 1409.0
## - latitude         1   1891.31 28189 1412.1
## - dist_to_mrt_m    1   2400.71 28698 1417.7
## 
## Step:  AIC=1390.81
## price_twd_msq ~ house_age_years + dist_to_mrt_m + n_convenience + 
##     latitude + house_age_cat
## 
##                   Df Sum of Sq   RSS    AIC
## <none>                         26315 1390.8
## + longitude        1      18.2 26297 1392.6
## - house_age_cat    2     972.8 27288 1398.1
## - house_age_years  1    1035.7 27351 1400.8
## - n_convenience    1    1625.4 27941 1407.4
## - latitude         1    1924.7 28240 1410.7
## - dist_to_mrt_m    1    4348.8 30664 1436.2

## 
## Call:
## lm(formula = price_twd_msq ~ house_age_years + dist_to_mrt_m + 
##     n_convenience + latitude + house_age_cat, data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -33.913  -5.098  -1.227   4.551  75.387 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          -6.250e+03  1.337e+03  -4.673 4.46e-06 ***
## house_age_years      -4.217e-01  1.221e-01  -3.453 0.000633 ***
## dist_to_mrt_m        -4.274e-03  6.040e-04  -7.076 1.03e-11 ***
## n_convenience         9.871e-01  2.282e-01   4.326 2.06e-05 ***
## latitude              2.521e+02  5.356e+01   4.708 3.82e-06 ***
## house_age_cat[15,30) -1.024e+00  1.909e+00  -0.536 0.592043    
## house_age_cat[30,45]  5.851e+00  3.570e+00   1.639 0.102303    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.319 on 303 degrees of freedom
## Multiple R-squared:  0.5673, Adjusted R-squared:  0.5587 
## F-statistic:  66.2 on 6 and 303 DF,  p-value: < 2.2e-16

Comparing competing models

##                df      AIC
## model_forward   8 2272.556
## model.backward  8 2272.556
## model.step      8 2272.556

## 
## Call:
## lm(formula = price_twd_msq ~ house_age_years + dist_to_mrt_m + 
##     n_convenience + latitude + house_age_cat, data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -33.913  -5.098  -1.227   4.551  75.387 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          -6.250e+03  1.337e+03  -4.673 4.46e-06 ***
## house_age_years      -4.217e-01  1.221e-01  -3.453 0.000633 ***
## dist_to_mrt_m        -4.274e-03  6.040e-04  -7.076 1.03e-11 ***
## n_convenience         9.871e-01  2.282e-01   4.326 2.06e-05 ***
## latitude              2.521e+02  5.356e+01   4.708 3.82e-06 ***
## house_age_cat[15,30) -1.024e+00  1.909e+00  -0.536 0.592043    
## house_age_cat[30,45]  5.851e+00  3.570e+00   1.639 0.102303    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.319 on 303 degrees of freedom
## Multiple R-squared:  0.5673, Adjusted R-squared:  0.5587 
## F-statistic:  66.2 on 6 and 303 DF,  p-value: < 2.2e-16

From Best subset regression and stepwise selection (forward, backward, both), we see that methods converge on the same model, indicating that the variables: dist_to_mrt_m, n_convenience, house_age_years, and latitude provide the best fit for predicting price_twd_msq. It is proven by identical AIC values and the fact that these variables are choosen to be in the final selected models.

Plots show that:

Residuals vs Fitted: Residuals appear to be mostly randomly scattered -> linearity assumption is satisfied although slight deviation can be spotted, but fortunately without any clear pattern.
Normal Q-Q Plot: Most of the residuals fall along the line, but there are deviations at the tails -> there are some deviations at the extremes
Scale-Location Plot: Residuals appear to be spread relatively equally with some slight increase in variability at higher fitted values -> homoscedasticity is generally satisfied with a possibility of heteroscedasticity at higher fitted values.
Residuals vs Leverage: We can see a few points with higher leverage and some points outside the Cook’s distance lines, like (271, 273) -> these points could be influential.

Models are compared based on adjusted r square, AIC, BIC criteria for in-sample performance and mean square prediction error (MSPE) for out-of-sample performance.

Finally, we can check the Out-of-sample Prediction or test error (MSPE):

## Warning in predict.lm(model2, newdata = test): prediction from rank-deficient
## fit; attr(*, "non-estim") has doubtful cases

##      Model      MSE R_squared Adjusted_R_squared      AIC      BIC     MSPE
## 1 Stepwise 84.88832 0.5672572          0.5586880 2272.556 2302.449 55.28947
## 2  Model 3 88.02643 0.5512597          0.5453746 2279.809 2302.229 54.56405
## 3   Model2 84.82973 0.5675558          0.5575323 2274.342 2307.971 55.57320

Cross Validation

Please check how function ?cv.glm works.

We will just extract from this object created by cv.glm command - the raw cross-validation estimate of prediction error.

## 
## Attaching package: 'boot'

## The following object is masked from 'package:car':
## 
##     logit

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from rank-deficient fit; attr(*, "non-estim") has doubtful cases

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from rank-deficient fit; attr(*, "non-estim") has doubtful cases

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from rank-deficient fit; attr(*, "non-estim") has doubtful cases

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from rank-deficient fit; attr(*, "non-estim") has doubtful cases

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from rank-deficient fit; attr(*, "non-estim") has doubtful cases

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from rank-deficient fit; attr(*, "non-estim") has doubtful cases

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from rank-deficient fit; attr(*, "non-estim") has doubtful cases

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from rank-deficient fit; attr(*, "non-estim") has doubtful cases

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from rank-deficient fit; attr(*, "non-estim") has doubtful cases

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from rank-deficient fit; attr(*, "non-estim") has doubtful cases

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from rank-deficient fit; attr(*, "non-estim") has doubtful cases

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from rank-deficient fit; attr(*, "non-estim") has doubtful cases

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from rank-deficient fit; attr(*, "non-estim") has doubtful cases

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from rank-deficient fit; attr(*, "non-estim") has doubtful cases

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from rank-deficient fit; attr(*, "non-estim") has doubtful cases

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from rank-deficient fit; attr(*, "non-estim") has doubtful cases

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from rank-deficient fit; attr(*, "non-estim") has doubtful cases

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from rank-deficient fit; attr(*, "non-estim") has doubtful cases

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from rank-deficient fit; attr(*, "non-estim") has doubtful cases

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from rank-deficient fit; attr(*, "non-estim") has doubtful cases

##      Model CV_Error
## 1 Stepwise 88.73881
## 2  Model 3 91.43365
## 3   Model2 90.30129

Based on AIC criteria, Model3 and Stepwise appears to be the best models.

We need to check out-of-sample MSPE for all models. Based on out-of-sample prediction error, Model2 has the lowest cross-validation error.