Many Numeric Explanatory Variables
Load Packages and Dataset
Packages includes fst (for reading fst document), dplyr (data manipulation), ggplot2, broom. r dataset is the one on Taiwan’s property price.
Visualizing Many Numeric Variables
Faceting should be a good choice when dealing with multiple categorical/non-numeric variables and when data are closely responding to each one.
In general, it gets trickier to include more than three numeric variables in a scatter plot (without losing focus and confusing the audience), but we can (in theory) include as many categorical variables as we like by applying faceting.
However, it is important to note that more facets can make it harder to see an overall picture.
Faceted Visualization
In below plot, we faceted the scatter points by ages of house.
ggplot(taiwan_real_estate, aes(sqrt(dist_to_mrt_m), n_convenience, color = price_twd_msq)) +
# Make it a scatter plot
geom_point() +
# Use the continuous viridis plasma color scale
scale_color_viridis_c(option = "plasma") +
# Facet, wrapped by house age
facet_wrap(~house_age_years)Different Levels of Interactions
No interactions. No global intercept, no interactions.
- Note that all independent variables are included in the lm model by “+” sign and there is a “zero” added in the lm function to ensure no global intercept.
mdl_price_vs_all_3_way_inter <- lm(
price_twd_msq ~ sqrt(dist_to_mrt_m) * n_convenience * house_age_years + 0,
data = taiwan_real_estate
)See the results
mdl_price_vs_all_3_way_inter##
## Call:
## lm(formula = price_twd_msq ~ sqrt(dist_to_mrt_m) * n_convenience *
## house_age_years + 0, data = taiwan_real_estate)
##
## Coefficients:
## sqrt(dist_to_mrt_m)
## -0.162944
## n_convenience
## 0.374982
## house_age_years0 to 15
## 16.046849
## house_age_years15 to 30
## 13.760066
## house_age_years30 to 45
## 12.088773
## sqrt(dist_to_mrt_m):n_convenience
## -0.008393
## sqrt(dist_to_mrt_m):house_age_years15 to 30
## 0.036618
## sqrt(dist_to_mrt_m):house_age_years30 to 45
## 0.061281
## n_convenience:house_age_years15 to 30
## 0.078370
## n_convenience:house_age_years30 to 45
## 0.066720
## sqrt(dist_to_mrt_m):n_convenience:house_age_years15 to 30
## -0.003821
## sqrt(dist_to_mrt_m):n_convenience:house_age_years30 to 45
## 0.004401
- No global intercept, but there are 2-way and 3-way interactions (i.e. all interactions) between the explanatory variables. Note that the “+” sign is swapped for “*” sign and also “+” is preserved because that we still don’t want global intercept in this case.
mdl_price_vs_all_3_way_inter <- lm(
price_twd_msq ~ sqrt(dist_to_mrt_m) * n_convenience * house_age_years + 0,
data = taiwan_real_estate
)See the results.
mdl_price_vs_all_3_way_inter##
## Call:
## lm(formula = price_twd_msq ~ sqrt(dist_to_mrt_m) * n_convenience *
## house_age_years + 0, data = taiwan_real_estate)
##
## Coefficients:
## sqrt(dist_to_mrt_m)
## -0.162944
## n_convenience
## 0.374982
## house_age_years0 to 15
## 16.046849
## house_age_years15 to 30
## 13.760066
## house_age_years30 to 45
## 12.088773
## sqrt(dist_to_mrt_m):n_convenience
## -0.008393
## sqrt(dist_to_mrt_m):house_age_years15 to 30
## 0.036618
## sqrt(dist_to_mrt_m):house_age_years30 to 45
## 0.061281
## n_convenience:house_age_years15 to 30
## 0.078370
## n_convenience:house_age_years30 to 45
## 0.066720
## sqrt(dist_to_mrt_m):n_convenience:house_age_years15 to 30
## -0.003821
## sqrt(dist_to_mrt_m):n_convenience:house_age_years30 to 45
## 0.004401
- No global intercept but with only two way interactions between the explanatory variables. Here using the power of 2 is for indicating two-way interactions and not for squaring the variable itself.
- If want to square/take power of the variable, the syntax to use is I(var) ^ 2, for example
mdl_price_vs_all_2_way_inter <- lm(
price_twd_msq ~ (sqrt(dist_to_mrt_m) + n_convenience + house_age_years) ^ 2 + 0,
data = taiwan_real_estate
)Predictions
Generating explanatory values
Similar to before LM_TwoVariables, an expanded grid of data is created with seq. Note that for distance_to_mrt variable, the sequence of numbers is created then raised to power of 2.
explanatory_data <- expand_grid(
dist_to_mrt_m = seq(0, 80, 10) ^ 2,
n_convenience = 0:10,
house_age_years = unique(taiwan_real_estate$house_age_years)
)Predictions
Similar workflow to before, we create a table of prediction data consisting of explanatory data and predicted data from the linear model.
prediction_data <- explanatory_data %>%
mutate(price_twd_msq = predict(mdl_price_vs_all_3_way_inter, explanatory_data))Plot
ggplot(
taiwan_real_estate,
aes(sqrt(dist_to_mrt_m), n_convenience, color = price_twd_msq)
) +
geom_point() +
scale_color_viridis_c(option = "plasma") +
facet_wrap(vars(house_age_years)) +
geom_point(data = prediction_data, size = 3, shape = 15)Conclusion
The plot shows that the house price decreases as the square-root of the distance to the nearest MRT station increases, and increases as the number of nearby convenience stores increases, and is higher for the house under 15 years old (as more brighter color points in the far left pane plot).
One More Math Fun after plotting
Theory
Linear regression minimizes the sum of the squares of the differences between the actual responses and the predicted responses.
Calculation
Setting a simple quadratic function to illustrate how optimization works.
calc_quadratic <- function(coeffs) {
x <- coeffs[1]
x ^ 2 - x + 10
}Use optim function to calculate the optimised x value. par is the starting guess value and fn calls on the function for optimization.
optim(par = c(x = 3), fn = calc_quadratic)## $par
## x
## 0.4998047
##
## $value
## [1] 9.75
##
## $counts
## function gradient
## 30 NA
##
## $convergence
## [1] 0
##
## $message
## NULL
It is the same how linear regression performs the optimisation process and calculates the slopes and intercept which make the sum of residual^2 the smallest.