Regression Analysis

Introduction

Our aim is to predict house values. Before we begin to do any analysis, we should always check whether the dataset has missing value or not, we do so by typing:

taiwan_real_estate <- read.csv("real_estates.csv",row.names=1)
attach(taiwan_real_estate)
any(is.na(taiwan_real_estate))

## [1] FALSE

Let’s take a look at structure of the data set:

glimpse(taiwan_real_estate)

## Rows: 414
## Columns: 6
## $ house.age                           <dbl> 32.0, 19.5, 13.3, 13.3, 5.0, 7.1, …
## $ distance.to.the.nearest.MRT.station <dbl> 84.87882, 306.59470, 561.98450, 56…
## $ number.of.convenience.stores        <int> 10, 9, 5, 5, 5, 3, 7, 6, 1, 3, 1, …
## $ latitude                            <dbl> 24.98298, 24.98034, 24.98746, 24.9…
## $ longitude                           <dbl> 121.5402, 121.5395, 121.5439, 121.…
## $ house.price.of.unit.area            <dbl> 37.9, 42.2, 47.3, 54.8, 43.1, 32.1…

Let’s simplify variables’ names:

taiwan_real_estate <- taiwan_real_estate %>%
rename(house_age_years = house.age, price_twd_msq = house.price.of.unit.area,
       n_convenience = number.of.convenience.stores, 
       dist_to_mrt_m = distance.to.the.nearest.MRT.station)

We can also perform binning for “house_age_years”:

#perform binning with specific number of bins
taiwan_real_estate<-taiwan_real_estate %>% mutate(house_age_cat = cut(house_age_years, breaks=c(0,15,30,45),include.lowest = T,
                                                                        right = F))

Descriptive Statistics

Prepare a heatmap with correlation coefficients on it:

library(corrplot)
M<-cor(taiwan_real_estate[,1:6])
corrplot(M, method = 'number')

Draw a scatter plot of n_convenience vs. price_twd_msq:

# Scatter plot of number of convenience stores vs. price per square meter
ggplot(taiwan_real_estate, aes(x = n_convenience, y = price_twd_msq)) +
  geom_point(color = "red") +
  labs(title = "Number of Convenience Stores vs. Price per Square Meter", 
       x = "Number of Convenience Stores", y = "Price per Square Meter (TWD)")

Draw a scatter plot of house_age_years vs. price_twd_msq:

# Scatter plot of house age vs. price per square meter
ggplot(taiwan_real_estate, aes(x = house_age_years, y = price_twd_msq)) +
  geom_point(color = "blue") +
  labs(title = "House Age vs. Price per Square Meter", 
       x = "House Age (years)", y = "Price per Square Meter (TWD)")

Draw a scatter plot of distance to nearest MRT station vs. price_twd_msq:

Plot a histogram of price_twd_msq with 10 bins, facet the plot so each house age group gets its own panel:

# Histogram of price per square meter with 10 bins, faceted by house age category
ggplot(taiwan_real_estate, aes(x = price_twd_msq)) +
  geom_histogram(bins = 10, fill = "green", color = "black") +
  facet_wrap(~house_age_cat) +
  labs(title = "Histogram of Price per Square Meter by House Age Category", 
       x = "Price per Square Meter (TWD)", y = "Count")

Summarize to calculate the mean, sd, median etc. house price/area by house age:

# Summarize statistics of price per square meter by house age category
summary_stats <- taiwan_real_estate %>%
  group_by(house_age_cat) %>%
  summarize(
    mean_price = mean(price_twd_msq, na.rm = TRUE),
    sd_price = sd(price_twd_msq, na.rm = TRUE),
    median_price = median(price_twd_msq, na.rm = TRUE),
    min_price = min(price_twd_msq, na.rm = TRUE),
    max_price = max(price_twd_msq, na.rm = TRUE)
  )

summary_stats

## # A tibble: 3 × 6
##   house_age_cat mean_price sd_price median_price min_price max_price
##   <fct>              <dbl>    <dbl>        <dbl>     <dbl>     <dbl>
## 1 [0,15)              41.8     14.2         42.6       7.6     118. 
## 2 [15,30)             32.6     11.4         32.9      11.2      59.6
## 3 [30,45]             37.7     12.8         38.3      12.2      78.3

Simple model

Run a linear regression of price_twd_msq vs. best, but only 1 predictor:

# Linear regression of price_twd_msq vs. dist_to_mrt_m
lm_model <- lm(price_twd_msq ~ dist_to_mrt_m, data = taiwan_real_estate)
summary(lm_model)

## 
## Call:
## lm(formula = price_twd_msq ~ dist_to_mrt_m, data = taiwan_real_estate)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -35.396  -6.007  -1.195   4.831  73.483 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   45.8514271  0.6526105   70.26   <2e-16 ***
## dist_to_mrt_m -0.0072621  0.0003925  -18.50   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.07 on 412 degrees of freedom
## Multiple R-squared:  0.4538, Adjusted R-squared:  0.4524 
## F-statistic: 342.2 on 1 and 412 DF,  p-value: < 2.2e-16

We start by displaying the statistical summary of the model using the R function summary():

# Linear regression of price_twd_msq vs. dist_to_mrt_m
lm_model <- lm(price_twd_msq ~ dist_to_mrt_m, data = taiwan_real_estate)
summary(lm_model)

## 
## Call:
## lm(formula = price_twd_msq ~ dist_to_mrt_m, data = taiwan_real_estate)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -35.396  -6.007  -1.195   4.831  73.483 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   45.8514271  0.6526105   70.26   <2e-16 ***
## dist_to_mrt_m -0.0072621  0.0003925  -18.50   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.07 on 412 degrees of freedom
## Multiple R-squared:  0.4538, Adjusted R-squared:  0.4524 
## F-statistic: 342.2 on 1 and 412 DF,  p-value: < 2.2e-16

You can access lots of different aspects of the regression object. To see what’s inside, use names():

# Display the names of elements in the regression model object
names(lm_model)

##  [1] "coefficients"  "residuals"     "effects"       "rank"         
##  [5] "fitted.values" "assign"        "qr"            "df.residual"  
##  [9] "xlevels"       "call"          "terms"         "model"

What do they mean?

The elements in the regression model object lm_model can be accessed using the names() function, which lists the following components:

coefficients: The regression coefficients, including the intercept and the slope. residuals: The residuals, which are the differences between the observed values and the values predicted by the model. fitted.values: The values predicted by the model. rank: The rank of the model matrix. df.residual: The degrees of freedom for the residuals. call: The function call that was used to create the model. terms: An object containing information about the terms in the model. model: The input data on which the model is based. assign: Indexes assigning values to variables in the model. qr: The QR decomposition object used in the model calculations.

Discuss model accuracy:

Model accuracy can be evaluated using several metrics obtained from the model summary (summary(lm_model)):

R-squared: The coefficient of determination, which indicates the proportion of the variance in the dependent variable that is predictable from the independent variable(s). A higher R-squared value means better model fit. Adjusted R-squared: The adjusted coefficient of determination, which adjusts for the number of predictors in the model. It provides a more accurate measure of model fit for models with multiple predictors. F-statistic: This statistic tests the overall significance of the model. A higher F-statistic value and a low p-value (usually < 0.05) indicate that the model is statistically significant. p-value: The p-values for the regression coefficients indicate whether each predictor is statistically significantly different from zero. Small p-values (usually < 0.05) suggest that the predictor is significant.

Model diagnostics:

par(mfrow = c(2, 2))
plot(lm_model)

The four plots show…

Residuals vs. Fitted: Checks if the residuals have a random distribution around zero. This plot helps to identify non-linearity, unequal error variances, and outliers. Normal Q-Q: Checks if the residuals are normally distributed. Deviations from the diagonal line in this plot indicate departures from normality. Scale-Location (or Spread-Location): Checks the homoscedasticity (constant variance) of the residuals. Ideally, residuals should be spread equally along the range of predictors. Residuals vs. Leverage: Identifies points that have a high influence on the model fit. Points with high leverage and large residuals can disproportionately affect the model.

Create the diagnostic plots using ggfortify:

library(ggfortify)
autoplot(lm_model)

Outliers and high levarage points:

plot(lm_model, 5)

Influential values:

# Cook's distance
plot(lm_model, 4)

or just plot all of diagnostic plots together:

autoplot(lm_model, which = 1:6, label.size = 3)

Discussion:

Based on the diagnostic plots, the following conclusions can be drawn:

Outliers: The residuals vs. fitted plot and the normal Q-Q plot help identify potential outliers. Observations with large residuals (far from the horizontal line in the residuals vs. fitted plot) and deviations from the line in the normal Q-Q plot indicate outliers. High Leverage Points: The residuals vs. leverage plot helps identify high leverage points. Points far from the center in the horizontal direction are high leverage points. Influential Values: Cook’s distance plot highlights influential observations. Points with Cook’s distance greater than 1 are considered highly influential and should be examined closely. Overall, the diagnostic plots provide valuable insights into the fit and assumptions of the linear regression model. By identifying and addressing outliers, high leverage points, and influential values, we can improve the robustness and accuracy of the model.

Multiple Regression Model

Test and training set

We begin by splitting the dataset into two parts, training set and testing set. In this example we will randomly take 75% row in this dataset and put it into the training set, and other 25% row in the testing set:

# TO DO
smp_size<-floor(0.75*nrow(taiwan_real_estate))
set.seed(12)
train_ind<-sample(seq_len(nrow(taiwan_real_estate)), size=smp_size)
train<-taiwan_real_estate[train_ind, ]
test<-taiwan_real_estate[-train_ind, ]

1st comment: floor() is used to return the largest integer value which is not greater than an individual number, or expression.
2nd comment: set.seed() is used to set the seed of R’s random number generator, this function is used so results from this example can be recreated easily.

Now we have our training set and testing set.

Variable selection methods

Generally, selecting variables for linear regression is a debatable topic.

There are many methods for variable selecting, namely, forward stepwise selection, backward stepwise selection, etc, some are valid, some are heavily criticized.

I recommend this document: https://www.stat.cmu.edu/~cshalizi/mreg/15/lectures/26/lecture-26.pdf and Gung’s comment: https://stats.stackexchange.com/questions/20836/algorithms-for-automatic-model-selection/20856#20856 if you want to learn more about variable selection process.

If our goal is prediction, it is safer to include all predictors in our model, removing variables without knowing the science behind it usually does more harm than good!!!

We begin to create our multiple linear regression model:

model2 <- lm(price_twd_msq ~ ., data = train)
summary(model2)

## 
## Call:
## lm(formula = price_twd_msq ~ ., data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -34.009  -4.953  -1.296   4.461  75.042 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          -2.861e+03  7.542e+03  -0.379 0.704717    
## house_age_years      -4.209e-01  1.223e-01  -3.442 0.000659 ***
## dist_to_mrt_m        -4.558e-03  8.681e-04  -5.251 2.86e-07 ***
## n_convenience         9.826e-01  2.287e-01   4.297 2.34e-05 ***
## latitude              2.505e+02  5.375e+01   4.660 4.74e-06 ***
## longitude            -2.755e+01  6.032e+01  -0.457 0.648221    
## house_age_cat[15,30) -1.089e+00  1.916e+00  -0.568 0.570124    
## house_age_cat[30,45]  5.789e+00  3.577e+00   1.618 0.106655    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.331 on 302 degrees of freedom
## Multiple R-squared:  0.5676, Adjusted R-squared:  0.5575 
## F-statistic: 56.62 on 7 and 302 DF,  p-value: < 2.2e-16

Discuss the results…

Looking at model summary, we see that variables …. are insignificant, so let’s estimate the model without those variables:

#model3 <-
#summary(model3)

Evaluating multi-collinearity

There are many standards researchers apply for deciding whether a VIF is too large. In some domains, a VIF over 2 is worthy of suspicion. Others set the bar higher, at 5 or 10. Others still will say you shouldn’t pay attention to these at all. Ultimately, the main thing to consider is that small effects are more likely to be “drowned out” by higher VIFs, but this may just be a natural, unavoidable fact with your model.

Finally we test our model on test dataset:

Interpret results…

Variable selection using best subset regression

Best subset and stepwise (forward, backward, both) techniques of variable selection can be used to come up with the best linear regression model for the dependent variable medv.

Variable selection using stepwise regression

Comparing competing models

From Best subset regression and stepwise selection (forward, backward, both), we see that …….

Plots show that…

Models are compared based on adjusted r square, AIC, BIC criteria for in-sample performance and mean square prediction error (MSPE) for out-of-sample performance.

Finally, we can check the Out-of-sample Prediction or test error (MSPE):

Cross Validation

Please check how function ?cv.glm works.

We will just extract from this object created by cv.glm command - the raw cross-validation estimate of prediction error.

Based on AIC criteria …

We need to check out-of-sample MSPE for both models. Based on out-of-sample prediction error, model …