Multiple Linear Regression

Kate C

2021-12-28

Two Numeric Explanatory Variables

Load Packages and Dataset

Packages includes fst (for reading fst document), dplyr (data manipulation), ggplot2, broom. r dataset is the one on Taiwan’s property price.

Visualizing Three Numeric Variables

  1. 3D scatter plot - might suffer perspective issues and difficult to interpret
  2. 2D scatter plot with response as color

3D Visualization

For easier reading of 3D-plotting code, we choose to use Magrittr package to use pipe in the code for plotting. x = number of convenience stores, y = square-root of distance to mrt, z = price per square meter.

taiwan_real_estate %$%
  scatter3D(n_convenience, sqrt(dist_to_mrt_m), price_twd_msq)

2D plot with response as color

Using taiwan_real_estate, plot sqrt dist to MRT vs. # no. of conv stores, colored by price; Make it a scatter plot; Use the continuous viridis plasma color scale. Flat plot in this case provides easier interpretation.

ggplot(taiwan_real_estate, aes(n_convenience, sqrt(dist_to_mrt_m), color = price_twd_msq)) +
  geom_point() +
  scale_color_viridis_c(option = "plasma")

Modelling Two Numeric Explanatory Variables

A Linear Regression

Model and predict the house price against the number of nearby convenience stores and the square-root of the distance to the nearest MRT station.

Packages including dplyr, tidyr, ggplot2 are utilised.

Two Numeric Explanatory Variables with no Interaction

mdl_price_vs_conv_dist <- lm(price_twd_msq ~ n_convenience + sqrt(dist_to_mrt_m), data = taiwan_real_estate)

Create expanded grid of explanatory variables with expand_grid.

explanatory_data <- expand_grid(
  n_convenience = 0:10,
  dist_to_mrt_m = seq(from = 0, to = 80, by = 10)^2
)
explanatory_data
## # A tibble: 99 × 2
##    n_convenience dist_to_mrt_m
##            <int>         <dbl>
##  1             0             0
##  2             0           100
##  3             0           400
##  4             0           900
##  5             0          1600
##  6             0          2500
##  7             0          3600
##  8             0          4900
##  9             0          6400
## 10             1             0
## # … with 89 more rows

Add explanatory data to prediction data which is from the lm prediction.

prediction_data <- explanatory_data %>% 
  mutate(price_twd_msq = predict(mdl_price_vs_conv_dist, explanatory_data))

Add predictions to plot.

ggplot(
  taiwan_real_estate, 
  aes(n_convenience, sqrt(dist_to_mrt_m), color = price_twd_msq)
) + 
  geom_point() +
  scale_color_viridis_c(option = "plasma")+
  geom_point(data = prediction_data, color = "yellow", size = 3)

Two Numeric Explanatory Variables with Interaction

mdl_price_vs_conv_dist <- lm(price_twd_msq ~ n_convenience * sqrt(dist_to_mrt_m), data = taiwan_real_estate)

The rest process is the same as the one used in Two Numeric Explanatory Variables with no Interaction.

explanatory_data <- expand_grid(n_convenience = 0:10, dist_to_mrt_m = seq(0, 80, 10) ^ 2)

prediction_data <- explanatory_data %>% 
  mutate(price_twd_msq = predict(mdl_price_vs_conv_dist, explanatory_data))

ggplot(
  taiwan_real_estate, 
  aes(n_convenience, sqrt(dist_to_mrt_m), color = price_twd_msq)
) + 
  geom_point() +
  scale_color_viridis_c(option = "D") +
  geom_point(data = prediction_data, color = "yellow", size = 3)