Introduction

Our aim is to predict house values. Before we begin to do any analysis, we should always check whether the dataset has missing value or not, we do so by typing:

taiwan_real_estate <- read.csv("real_estates.csv",row.names=1)
attach(taiwan_real_estate)
any(is.na(taiwan_real_estate))
## [1] FALSE

Let’s take a look at structure of the data set:

glimpse(taiwan_real_estate)
## Rows: 414
## Columns: 6
## $ house.age                           <dbl> 32.0, 19.5, 13.3, 13.3, 5.0, 7.1, …
## $ distance.to.the.nearest.MRT.station <dbl> 84.87882, 306.59470, 561.98450, 56…
## $ number.of.convenience.stores        <int> 10, 9, 5, 5, 5, 3, 7, 6, 1, 3, 1, …
## $ latitude                            <dbl> 24.98298, 24.98034, 24.98746, 24.9…
## $ longitude                           <dbl> 121.5402, 121.5395, 121.5439, 121.…
## $ house.price.of.unit.area            <dbl> 37.9, 42.2, 47.3, 54.8, 43.1, 32.1…

Let’s simplify variables’ names:

taiwan_real_estate <- taiwan_real_estate %>%
rename(house_age_years = house.age, price_twd_msq = house.price.of.unit.area,
       n_convenience = number.of.convenience.stores, 
       dist_to_mrt_m = distance.to.the.nearest.MRT.station)

We can also perform binning for “house_age_years”:

#perform binning with specific number of bins
taiwan_real_estate<-taiwan_real_estate %>% mutate(house_age_cat = cut(house_age_years, breaks=c(0,15,30,45),include.lowest = T,
                                                                        right = F))

Descriptive Statistics

Prepare a heatmap with correlation coefficients on it:

library(corrplot)
## Warning: package 'corrplot' was built under R version 4.3.3
## corrplot 0.92 loaded
M<-cor(taiwan_real_estate[,1:6])
corrplot(M, method = 'number')

Draw a scatter plot of n_convenience vs. price_twd_msq:

ggplot(data = taiwan_real_estate, aes(x = n_convenience, y = price_twd_msq)) +
  geom_point() +
  labs(title = "Scatter Plot of Convenience Stores vs. Price per Square Meter",
       x = "Number of Convenience Stores",
       y = "Price per Square Meter (TWD)") +
  theme_minimal()

Draw a scatter plot of house_age_years vs. price_twd_msq:

ggplot(data = taiwan_real_estate, aes(x = house_age_years, y = price_twd_msq)) +
  geom_point() +
  labs(title = "Scatter Plot of House Age vs. Price per Square Meter",
       x = "Number of Convenience Stores",
       y = "Price per Square Meter (TWD)") +
  theme_minimal()

Draw a scatter plot of distance to nearest MRT station vs. price_twd_msq:

Plot a histogram of price_twd_msq with 10 bins, facet the plot so each house age group gets its own panel:

ggplot(data = taiwan_real_estate, aes(x = price_twd_msq)) +
  geom_histogram(bins = 10, fill = "blue", color = "black") +
  facet_wrap(~ house_age_cat) +
  labs(title = "Histogram of Price per Square Meter by House Age Group",
       x = "Price per Square Meter (TWD)",
       y = "Frequency") +
  theme_minimal()

Summarize to calculate the mean, sd, median etc. house price/area by house age:

library(dplyr)
library(summarytools)
## Warning: package 'summarytools' was built under R version 4.3.3
## 
## Attaching package: 'summarytools'
## The following object is masked from 'package:tibble':
## 
##     view
(stats_by <- stby(data      = taiwan_real_estate, INDICES   = taiwan_real_estate$house_age_cat, FUN       = descr, stats     = "common", transpose = TRUE))
## Non-numerical variable(s) ignored: house_age_cat
## Descriptive Statistics  
## taiwan_real_estate  
## Group: house_age_cat = [0,15)  
## N: 190  
## 
##                           Mean   Std.Dev      Min   Median       Max   N.Valid   Pct.Valid
## --------------------- -------- --------- -------- -------- --------- --------- -----------
##         dist_to_mrt_m   921.67   1041.21    23.38   391.91   4573.78    190.00      100.00
##       house_age_years     7.95      4.85     0.00     8.05     14.80    190.00      100.00
##              latitude    24.97      0.01    24.94    24.97     25.01    190.00      100.00
##             longitude   121.54      0.01   121.50   121.54    121.57    190.00      100.00
##         n_convenience     4.07      2.87     0.00     5.00     10.00    190.00      100.00
##         price_twd_msq    41.77     14.16     7.60    42.55    117.50    190.00      100.00
## 
## Group: house_age_cat = [15,30)  
## N: 129  
## 
##                            Mean   Std.Dev      Min   Median       Max   N.Valid   Pct.Valid
## --------------------- --------- --------- -------- -------- --------- --------- -----------
##         dist_to_mrt_m   1529.58   1499.98    82.89   837.72   6488.02    129.00      100.00
##       house_age_years     19.61      4.18    15.00    18.00     29.60    129.00      100.00
##              latitude     24.97      0.01    24.93    24.97     24.99    129.00      100.00
##             longitude    121.53      0.02   121.47   121.54    121.55    129.00      100.00
##         n_convenience      3.53      2.69     0.00     3.00     10.00    129.00      100.00
##         price_twd_msq     32.64     11.40    11.20    32.90     59.60    129.00      100.00
## 
## Group: house_age_cat = [30,45]  
## N: 95  
## 
##                           Mean   Std.Dev      Min   Median       Max   N.Valid   Pct.Valid
## --------------------- -------- --------- -------- -------- --------- --------- -----------
##         dist_to_mrt_m   803.12   1161.78    57.59   480.70   6396.28     95.00      100.00
##       house_age_years    34.66      3.37    30.00    34.40     43.80     95.00      100.00
##              latitude    24.97      0.01    24.94    24.97     25.00     95.00      100.00
##             longitude   121.54      0.01   121.48   121.54    121.55     95.00      100.00
##         n_convenience     4.91      3.25     0.00     5.00     10.00     95.00      100.00
##         price_twd_msq    37.65     12.84    12.20    38.30     78.30     95.00      100.00

Simple model

Run a linear regression of price_twd_msq vs. best, but only 1 predictor:

model <- lm(price_twd_msq ~ house_age_years , data = taiwan_real_estate)

We start by displaying the statistical summary of the model using the R function summary():

summary(model)
## 
## Call:
## lm(formula = price_twd_msq ~ house_age_years, data = taiwan_real_estate)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -31.113 -10.738   1.626   8.199  77.781 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     42.43470    1.21098  35.042  < 2e-16 ***
## house_age_years -0.25149    0.05752  -4.372 1.56e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.32 on 412 degrees of freedom
## Multiple R-squared:  0.04434,    Adjusted R-squared:  0.04202 
## F-statistic: 19.11 on 1 and 412 DF,  p-value: 1.56e-05

You can access lots of different aspects of the regression object. To see what’s inside, use names():

names(model)
##  [1] "coefficients"  "residuals"     "effects"       "rank"         
##  [5] "fitted.values" "assign"        "qr"            "df.residual"  
##  [9] "xlevels"       "call"          "terms"         "model"

What do they mean?

Talking about residuals, 25% of the residuals are below -10.738. A negative first quartile indicates that a substantial portion of the residuals are below the predicted value, whilst 75% of the residuals are above 8.199. A positive third quartile indicates that most residuals are above zero, suggesting a slight tendency to underestimate the dependent variable. Estimation of house age shows that for each marginal increase of house age price decreases by 0.25149. R-squared indicates that approximately 4.43% of the variability of dependent variable is expalined by independent.

Discuss model accuracy:

Overall, model seems not really accurate (large residuals, low R-sqaured)

Model diagnostics:

par(mfrow = c(2, 2))
plot(model) # ?your name of the model

The four plots show… -First plot (Residuals vs Fitted) is used to check the linearity, roughly, it could be named linear, but rather heteroskedastic, also, some outliers exist. -Second plot (Normal Q-Q) shows normal distribution, which, in this case mostly exists, except for the outliers. -Third plot (Scale-Location) displays the homoscedasticity assumption, Points start to be more concentrated on the right side, which can indicate heteroscedasticity. -Fourth plot (Residuals vs Leverage) is used to identify influential data points (leverage points) and outliers, which exist in this case.

Create the diagnostic plots using ggfortify:

library(ggfortify)
## Warning: package 'ggfortify' was built under R version 4.3.3
autoplot(model) # ?your name of the model

Outliers and high levarage points:

plot(model, 5)  # ?your name of the model

Influential values:

# Cook's distance
plot(model, 4) # ?your name of the model

or just plot all of diagnostic plots together:

autoplot(model, which = 1:6, label.size = 3) # ?your name of the model

Discussion:

in Cook’s distance plot 5 influential variables(outliers) could be observed, judging from Cook vs Leverage plot these values are most probably outliers.

Multiple Regression Model

Test and training set

We begin by splitting the dataset into two parts, training set and testing set. In this example we will randomly take 75% row in this dataset and put it into the training set, and other 25% row in the testing set:

smp_size<-floor(0.75*nrow(taiwan_real_estate))
set.seed(12)
train_ind<-sample(seq_len(nrow(taiwan_real_estate)), size=smp_size)
train<-taiwan_real_estate[train_ind, ]
test<-taiwan_real_estate[-train_ind, ]
  • 1st comment: floor() is used to return the largest integer value which is not greater than an individual number, or expression.

  • 2nd comment: set.seed() is used to set the seed of R’s random number generator, this function is used so results from this example can be recreated easily.

Now we have our training set and testing set.

Variable selection methods

Generally, selecting variables for linear regression is a debatable topic.

There are many methods for variable selecting, namely, forward stepwise selection, backward stepwise selection, etc, some are valid, some are heavily criticized.

I recommend this document: https://www.stat.cmu.edu/~cshalizi/mreg/15/lectures/26/lecture-26.pdf and Gung’s comment: https://stats.stackexchange.com/questions/20836/algorithms-for-automatic-model-selection/20856#20856 if you want to learn more about variable selection process.

If our goal is prediction, it is safer to include all predictors in our model, removing variables without knowing the science behind it usually does more harm than good!!!

We begin to create our multiple linear regression model:

model2 <- lm(price_twd_msq ~ ., data = train)
summary(model2)
## 
## Call:
## lm(formula = price_twd_msq ~ ., data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -34.009  -4.953  -1.296   4.461  75.042 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          -2.861e+03  7.542e+03  -0.379 0.704717    
## house_age_years      -4.209e-01  1.223e-01  -3.442 0.000659 ***
## dist_to_mrt_m        -4.558e-03  8.681e-04  -5.251 2.86e-07 ***
## n_convenience         9.826e-01  2.287e-01   4.297 2.34e-05 ***
## latitude              2.505e+02  5.375e+01   4.660 4.74e-06 ***
## longitude            -2.755e+01  6.032e+01  -0.457 0.648221    
## house_age_cat[15,30) -1.089e+00  1.916e+00  -0.568 0.570124    
## house_age_cat[30,45]  5.789e+00  3.577e+00   1.618 0.106655    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.331 on 302 degrees of freedom
## Multiple R-squared:  0.5676, Adjusted R-squared:  0.5575 
## F-statistic: 56.62 on 7 and 302 DF,  p-value: < 2.2e-16

Discuss the results…

Now model has improved in quality. Dispersion of residuals is lower, R-squared shows much more satisfying results, overall significance of the model rose also.

Looking at model summary, we see that variables house_age_years(and connected), longitude are insignificant, so let’s estimate the model without those variables:

model3 <- lm(price_twd_msq ~ dist_to_mrt_m + n_convenience + latitude, data = train)
summary(model3)
## 
## Call:
## lm(formula = price_twd_msq ~ dist_to_mrt_m + n_convenience + 
##     latitude, data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -34.384  -5.804  -1.360   4.648  76.637 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -5.270e+03  1.407e+03  -3.745 0.000216 ***
## dist_to_mrt_m -5.044e-03  6.214e-04  -8.118 1.17e-14 ***
## n_convenience  9.139e-01  2.398e-01   3.811 0.000167 ***
## latitude       2.127e+02  5.636e+01   3.774 0.000193 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.876 on 306 degrees of freedom
## Multiple R-squared:  0.5092, Adjusted R-squared:  0.5044 
## F-statistic: 105.8 on 3 and 306 DF,  p-value: < 2.2e-16

Evaluating multi-collinearity

There are many standards researchers apply for deciding whether a VIF is too large. In some domains, a VIF over 2 is worthy of suspicion. Others set the bar higher, at 5 or 10. Others still will say you shouldn’t pay attention to these at all. Ultimately, the main thing to consider is that small effects are more likely to be “drowned out” by higher VIFs, but this may just be a natural, unavoidable fact with your model.

## Warning: package 'car' was built under R version 4.3.3
## Loading required package: carData
## Warning: package 'carData' was built under R version 4.3.3
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
## The following object is masked from 'package:purrr':
## 
##     some
## dist_to_mrt_m n_convenience      latitude 
##      1.941558      1.614338      1.494453

Finally we test our model on test dataset:

model4 <- lm(price_twd_msq ~ dist_to_mrt_m + n_convenience + latitude, data = taiwan_real_estate)
summary(model4)
## 
## Call:
## lm(formula = price_twd_msq ~ dist_to_mrt_m + n_convenience + 
##     latitude, data = taiwan_real_estate)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -33.961  -5.359  -1.391   4.729  77.895 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -5.336e+03  1.171e+03  -4.557 6.87e-06 ***
## dist_to_mrt_m -4.503e-03  5.176e-04  -8.700  < 2e-16 ***
## n_convenience  1.072e+00  1.997e-01   5.370 1.32e-07 ***
## latitude       2.152e+02  4.689e+01   4.590 5.90e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.45 on 410 degrees of freedom
## Multiple R-squared:  0.5212, Adjusted R-squared:  0.5177 
## F-statistic: 148.8 on 3 and 410 DF,  p-value: < 2.2e-16
correlation_matrix <- cor(taiwan_real_estate[, c("dist_to_mrt_m", "n_convenience", "latitude", "longitude")])
correlation_matrix
##               dist_to_mrt_m n_convenience   latitude  longitude
## dist_to_mrt_m     1.0000000    -0.6025191 -0.5910666 -0.8063168
## n_convenience    -0.6025191     1.0000000  0.4441433  0.4490990
## latitude         -0.5910666     0.4441433  1.0000000  0.4129239
## longitude        -0.8063168     0.4490990  0.4129239  1.0000000
vif_values <- car::vif(model4)
vif_values
## dist_to_mrt_m n_convenience      latitude 
##      1.973819      1.599834      1.566223

Interpret results… The median residual is close to zero, indicating that the model’s predictions are centered around the actual values.Values in first 25% are slightly lower, than expected, whilst in 75% - slightly higher. Prices grow with increase in latitude and nearable stores, and decrease with smaller distance to the nearest MRT. More than half of the variance is explained by the model. All VIF values are below 2, suggesting that multicollinearity is not a significant concern in this model.

Variable selection using best subset regression

Best subset and stepwise (forward, backward, both) techniques of variable selection can be used to come up with the best linear regression model for the dependent variable medv.

library(rsample)
## Warning: package 'rsample' was built under R version 4.3.3
split <- initial_split(taiwan_real_estate, prop = 0.7)
dataTrain <- training(split)
dataTest <- testing(split)

Variable selection using stepwise regression

## Start:  AIC=1490.31
## price_twd_msq ~ 1
## 
##                   Df Sum of Sq   RSS    AIC
## + dist_to_mrt_m    1   22960.5 26866 1313.8
## + n_convenience    1   18551.2 31275 1357.7
## + longitude        1   12590.2 37236 1408.1
## + latitude         1   12160.6 37666 1411.5
## + house_age_cat    2    4244.0 45582 1468.6
## + house_age_years  1    2261.2 47565 1478.9
## <none>                         49826 1490.3
## 
## Step:  AIC=1313.8
## price_twd_msq ~ dist_to_mrt_m
## 
##                   Df Sum of Sq   RSS    AIC
## + n_convenience    1   3027.03 23839 1281.3
## + house_age_years  1   2060.06 24806 1292.8
## + house_age_cat    2   1589.73 25276 1300.2
## + latitude         1    948.23 25918 1305.4
## <none>                         26866 1313.8
## + longitude        1      0.96 26865 1315.8
## 
## Step:  AIC=1281.26
## price_twd_msq ~ dist_to_mrt_m + n_convenience
## 
##                   Df Sum of Sq   RSS    AIC
## + house_age_years  1   2457.80 21381 1251.8
## + house_age_cat    2   1983.79 21855 1260.2
## + latitude         1    602.23 23237 1275.9
## <none>                         23839 1281.3
## + longitude        1      6.02 23833 1283.2
## 
## Step:  AIC=1251.81
## price_twd_msq ~ dist_to_mrt_m + n_convenience + house_age_years
## 
##                 Df Sum of Sq   RSS    AIC
## + latitude       1    806.73 20574 1242.7
## + house_age_cat  2    477.52 20904 1249.3
## <none>                       21381 1251.8
## + longitude      1      1.98 21379 1253.8
## 
## Step:  AIC=1242.69
## price_twd_msq ~ dist_to_mrt_m + n_convenience + house_age_years + 
##     latitude
## 
##                 Df Sum of Sq   RSS    AIC
## + house_age_cat  2    542.51 20032 1239.0
## <none>                       20574 1242.7
## + longitude      1     22.80 20552 1244.4
## 
## Step:  AIC=1238.97
## price_twd_msq ~ dist_to_mrt_m + n_convenience + house_age_years + 
##     latitude + house_age_cat
## 
##             Df Sum of Sq   RSS    AIC
## <none>                   20032 1239.0
## + longitude  1    29.352 20003 1240.5
## Start:  AIC=1240.55
## price_twd_msq ~ house_age_years + dist_to_mrt_m + n_convenience + 
##     latitude + longitude + house_age_cat
## 
##                   Df Sum of Sq   RSS    AIC
## - longitude        1     29.35 20032 1239.0
## <none>                         20003 1240.5
## - house_age_cat    2    549.07 20552 1244.4
## - latitude         1    897.28 20900 1251.2
## - house_age_years  1   1025.69 21028 1253.0
## - dist_to_mrt_m    1   1540.17 21543 1260.0
## - n_convenience    1   2945.81 22948 1278.2
## 
## Step:  AIC=1238.97
## price_twd_msq ~ house_age_years + dist_to_mrt_m + n_convenience + 
##     latitude + house_age_cat
## 
##                   Df Sum of Sq   RSS    AIC
## <none>                         20032 1239.0
## - house_age_cat    2     542.5 20574 1242.7
## - latitude         1     871.7 20904 1249.3
## - house_age_years  1    1006.4 21038 1251.1
## - n_convenience    1    2925.9 22958 1276.4
## - dist_to_mrt_m    1    3212.1 23244 1280.0
## Start:  AIC=1490.31
## price_twd_msq ~ 1
## 
##                   Df Sum of Sq   RSS    AIC
## + dist_to_mrt_m    1   22960.5 26866 1313.8
## + n_convenience    1   18551.2 31275 1357.7
## + longitude        1   12590.2 37236 1408.1
## + latitude         1   12160.6 37666 1411.5
## + house_age_cat    2    4244.0 45582 1468.6
## + house_age_years  1    2261.2 47565 1478.9
## <none>                         49826 1490.3
## 
## Step:  AIC=1313.8
## price_twd_msq ~ dist_to_mrt_m
## 
##                   Df Sum of Sq   RSS    AIC
## + n_convenience    1    3027.0 23839 1281.3
## + house_age_years  1    2060.1 24806 1292.8
## + house_age_cat    2    1589.7 25276 1300.2
## + latitude         1     948.2 25918 1305.4
## <none>                         26866 1313.8
## + longitude        1       1.0 26865 1315.8
## - dist_to_mrt_m    1   22960.5 49826 1490.3
## 
## Step:  AIC=1281.26
## price_twd_msq ~ dist_to_mrt_m + n_convenience
## 
##                   Df Sum of Sq   RSS    AIC
## + house_age_years  1    2457.8 21381 1251.8
## + house_age_cat    2    1983.8 21855 1260.2
## + latitude         1     602.2 23237 1275.9
## <none>                         23839 1281.3
## + longitude        1       6.0 23833 1283.2
## - n_convenience    1    3027.0 26866 1313.8
## - dist_to_mrt_m    1    7436.3 31275 1357.7
## 
## Step:  AIC=1251.81
## price_twd_msq ~ dist_to_mrt_m + n_convenience + house_age_years
## 
##                   Df Sum of Sq   RSS    AIC
## + latitude         1     806.7 20574 1242.7
## + house_age_cat    2     477.5 20904 1249.3
## <none>                         21381 1251.8
## + longitude        1       2.0 21379 1253.8
## - house_age_years  1    2457.8 23839 1281.3
## - n_convenience    1    3424.8 24806 1292.8
## - dist_to_mrt_m    1    6938.6 28320 1331.0
## 
## Step:  AIC=1242.69
## price_twd_msq ~ dist_to_mrt_m + n_convenience + house_age_years + 
##     latitude
## 
##                   Df Sum of Sq   RSS    AIC
## + house_age_cat    2     542.5 20032 1239.0
## <none>                         20574 1242.7
## + longitude        1      22.8 20552 1244.4
## - latitude         1     806.7 21381 1251.8
## - house_age_years  1    2662.3 23237 1275.9
## - n_convenience    1    3025.0 23599 1280.3
## - dist_to_mrt_m    1    4004.2 24579 1292.1
## 
## Step:  AIC=1238.97
## price_twd_msq ~ dist_to_mrt_m + n_convenience + house_age_years + 
##     latitude + house_age_cat
## 
##                   Df Sum of Sq   RSS    AIC
## <none>                         20032 1239.0
## + longitude        1      29.4 20003 1240.5
## - house_age_cat    2     542.5 20574 1242.7
## - latitude         1     871.7 20904 1249.3
## - house_age_years  1    1006.4 21038 1251.1
## - n_convenience    1    2925.9 22958 1276.4
## - dist_to_mrt_m    1    3212.1 23244 1280.0

Comparing competing models

##                df      AIC
## model_forward   8 2061.118
## model_backward  8 2061.118
## model_step      8 2061.118
## 
## Call:
## lm(formula = price_twd_msq ~ dist_to_mrt_m + n_convenience + 
##     house_age_years + latitude + house_age_cat, data = dataTrain)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -34.755  -4.765  -1.509   4.475  32.712 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          -4.432e+03  1.278e+03  -3.469 0.000604 ***
## dist_to_mrt_m        -4.326e-03  6.433e-04  -6.724 9.78e-11 ***
## n_convenience         1.390e+00  2.166e-01   6.418 5.83e-10 ***
## house_age_years      -4.304e-01  1.143e-01  -3.764 0.000203 ***
## latitude              1.792e+02  5.117e+01   3.503 0.000535 ***
## house_age_cat[15,30) -5.519e-02  1.822e+00  -0.030 0.975853    
## house_age_cat[30,45]  5.692e+00  3.348e+00   1.700 0.090233 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.428 on 282 degrees of freedom
## Multiple R-squared:  0.598,  Adjusted R-squared:  0.5894 
## F-statistic: 69.91 on 6 and 282 DF,  p-value: < 2.2e-16

From Best subset regression and stepwise selection (forward, backward, both), we see that the median residual is close to zero, indicating that the model’s predictions are generally centered around the actual values. The spread of residuals suggests there might be some extreme values (notably the maximum residual).

Plots show that extreme values indeed exist, they are rather influential due to their extreme predictor values but do not have a large impact on the regression coefficients.

Models are compared based on adjusted r square, AIC, BIC criteria for in-sample performance and mean square prediction error (MSPE) for out-of-sample performance.

## [1] 77.98945
## [1] 77.98945
## [1] 77.98945
## [1] 0.5979661
## [1] 0.5979661
## [1] 0.5979661
## [1] 0.5894122 0.5894122 0.5894122
## [1] 2061.118 2061.118 2061.118
## [1] 2090.449 2090.449 2090.449

Finally, we can check the Out-of-sample Prediction or test error (MSPE):

## [1] 9.901804
## [1] 9.901804
## [1] 9.901804

Cross Validation

Please check how function ?cv.glm works.

We will just extract from this object created by cv.glm command - the raw cross-validation estimate of prediction error.

## Warning: package 'boot' was built under R version 4.3.3
## 
## Attaching package: 'boot'
## The following object is masked from 'package:car':
## 
##     logit
## [1] "AIC Model 3: 2305.58582491523"
## [1] "AIC Model 4: 3040.53625932183"
##       Length Class  Mode   
## call    4    -none- call   
## K       1    -none- numeric
## delta   2    -none- numeric
## seed  626    -none- numeric
## [1] NaN
## NULL
##       Length Class  Mode   
## call    4    -none- call   
## K       1    -none- numeric
## delta   2    -none- numeric
## seed  626    -none- numeric
## [1] NaN
## NULL
## [1] 9.425119
## [1] 9.403946

Based on AIC criteria, model 3 is better, than model 4.

We need to check out-of-sample MSPE for both models. Based on out-of-sample prediction error, models seem more or less equal