Multiple Linear Regression_02

Kate C

2021-12-29

Many Numeric Explanatory Variables

Load Packages and Dataset

Packages includes fst (for reading fst document), dplyr (data manipulation), ggplot2, broom. r dataset is the one on Taiwan’s property price.

Visualizing Many Numeric Variables

Faceting should be a good choice when dealing with multiple categorical/non-numeric variables and when data are closely responding to each one.

In general, it gets trickier to include more than three numeric variables in a scatter plot (without losing focus and confusing the audience), but we can (in theory) include as many categorical variables as we like by applying faceting.

However, it is important to note that more facets can make it harder to see an overall picture.

Faceted Visualization

In below plot, we faceted the scatter points by ages of house.

ggplot(taiwan_real_estate, aes(sqrt(dist_to_mrt_m), n_convenience, color = price_twd_msq)) +
  # Make it a scatter plot
  geom_point() +
  # Use the continuous viridis plasma color scale
  scale_color_viridis_c(option = "plasma") +
  # Facet, wrapped by house age
  facet_wrap(~house_age_years)

Different Levels of Interactions

  1. No interactions. No global intercept, no interactions.

    • Note that all independent variables are included in the lm model by “+” sign and there is a “zero” added in the lm function to ensure no global intercept.
mdl_price_vs_all_3_way_inter <- lm(
  price_twd_msq ~ sqrt(dist_to_mrt_m) * n_convenience * house_age_years + 0, 
  data = taiwan_real_estate
)

See the results

mdl_price_vs_all_3_way_inter
## 
## Call:
## lm(formula = price_twd_msq ~ sqrt(dist_to_mrt_m) * n_convenience * 
##     house_age_years + 0, data = taiwan_real_estate)
## 
## Coefficients:
##                                       sqrt(dist_to_mrt_m)  
##                                                 -0.162944  
##                                             n_convenience  
##                                                  0.374982  
##                                    house_age_years0 to 15  
##                                                 16.046849  
##                                   house_age_years15 to 30  
##                                                 13.760066  
##                                   house_age_years30 to 45  
##                                                 12.088773  
##                         sqrt(dist_to_mrt_m):n_convenience  
##                                                 -0.008393  
##               sqrt(dist_to_mrt_m):house_age_years15 to 30  
##                                                  0.036618  
##               sqrt(dist_to_mrt_m):house_age_years30 to 45  
##                                                  0.061281  
##                     n_convenience:house_age_years15 to 30  
##                                                  0.078370  
##                     n_convenience:house_age_years30 to 45  
##                                                  0.066720  
## sqrt(dist_to_mrt_m):n_convenience:house_age_years15 to 30  
##                                                 -0.003821  
## sqrt(dist_to_mrt_m):n_convenience:house_age_years30 to 45  
##                                                  0.004401
  1. No global intercept, but there are 2-way and 3-way interactions (i.e. all interactions) between the explanatory variables. Note that the “+” sign is swapped for “*” sign and also “+” is preserved because that we still don’t want global intercept in this case.
mdl_price_vs_all_3_way_inter <- lm(
  price_twd_msq ~ sqrt(dist_to_mrt_m) * n_convenience * house_age_years + 0, 
  data = taiwan_real_estate
)

See the results.

mdl_price_vs_all_3_way_inter
## 
## Call:
## lm(formula = price_twd_msq ~ sqrt(dist_to_mrt_m) * n_convenience * 
##     house_age_years + 0, data = taiwan_real_estate)
## 
## Coefficients:
##                                       sqrt(dist_to_mrt_m)  
##                                                 -0.162944  
##                                             n_convenience  
##                                                  0.374982  
##                                    house_age_years0 to 15  
##                                                 16.046849  
##                                   house_age_years15 to 30  
##                                                 13.760066  
##                                   house_age_years30 to 45  
##                                                 12.088773  
##                         sqrt(dist_to_mrt_m):n_convenience  
##                                                 -0.008393  
##               sqrt(dist_to_mrt_m):house_age_years15 to 30  
##                                                  0.036618  
##               sqrt(dist_to_mrt_m):house_age_years30 to 45  
##                                                  0.061281  
##                     n_convenience:house_age_years15 to 30  
##                                                  0.078370  
##                     n_convenience:house_age_years30 to 45  
##                                                  0.066720  
## sqrt(dist_to_mrt_m):n_convenience:house_age_years15 to 30  
##                                                 -0.003821  
## sqrt(dist_to_mrt_m):n_convenience:house_age_years30 to 45  
##                                                  0.004401
  1. No global intercept but with only two way interactions between the explanatory variables. Here using the power of 2 is for indicating two-way interactions and not for squaring the variable itself.
  2. If want to square/take power of the variable, the syntax to use is I(var) ^ 2, for example
mdl_price_vs_all_2_way_inter <- lm(
  price_twd_msq ~ (sqrt(dist_to_mrt_m) + n_convenience + house_age_years) ^ 2 + 0, 
  data = taiwan_real_estate
)

Predictions

Generating explanatory values

Similar to before LM_TwoVariables, an expanded grid of data is created with seq. Note that for distance_to_mrt variable, the sequence of numbers is created then raised to power of 2.

explanatory_data <- expand_grid(
  dist_to_mrt_m = seq(0, 80, 10) ^ 2,
  n_convenience = 0:10,
  house_age_years = unique(taiwan_real_estate$house_age_years)
)

Predictions

Similar workflow to before, we create a table of prediction data consisting of explanatory data and predicted data from the linear model.

prediction_data <- explanatory_data %>% 
  mutate(price_twd_msq = predict(mdl_price_vs_all_3_way_inter, explanatory_data))

Plot

ggplot(
  taiwan_real_estate, 
  aes(sqrt(dist_to_mrt_m), n_convenience, color = price_twd_msq)
) +
  geom_point() +
  scale_color_viridis_c(option = "plasma") +
  facet_wrap(vars(house_age_years)) +
  geom_point(data = prediction_data, size = 3, shape = 15)

Conclusion

The plot shows that the house price decreases as the square-root of the distance to the nearest MRT station increases, and increases as the number of nearby convenience stores increases, and is higher for the house under 15 years old (as more brighter color points in the far left pane plot).

One More Math Fun after plotting

Theory

Linear regression minimizes the sum of the squares of the differences between the actual responses and the predicted responses.

Calculation

Setting a simple quadratic function to illustrate how optimization works.

calc_quadratic <- function(coeffs) {
  x <- coeffs[1]
  x ^ 2 - x + 10
}

Use optim function to calculate the optimised x value. par is the starting guess value and fn calls on the function for optimization.

optim(par = c(x = 3), fn = calc_quadratic)
## $par
##         x 
## 0.4998047 
## 
## $value
## [1] 9.75
## 
## $counts
## function gradient 
##       30       NA 
## 
## $convergence
## [1] 0
## 
## $message
## NULL

It is the same how linear regression performs the optimisation process and calculates the slopes and intercept which make the sum of residual^2 the smallest.