Packages and Data
Packages includes fst (for reading fst document), dplyr (data manipulation), ggplot2, broom.
r dataset is the one on taiwan’s property price.
One Model Per Category
Filter data set for each subcategory per categorical data:
taiwan_0_to_15 <- taiwan_real_estate %>%
filter(house_age_years == "0 to 15")
taiwan_15_to_30 <- taiwan_real_estate %>%
filter(house_age_years == "15 to 30")
taiwan_30_to_45 <- taiwan_real_estate %>%
filter(house_age_years == "30 to 45")Model each category by linear regression
mdl_0_to_15 <- lm(price_twd_msq ~ n_convenience, data = taiwan_0_to_15 )
mdl_15_to_30 <- lm(price_twd_msq ~ n_convenience, data = taiwan_15_to_30 )
mdl_30_to_45 <- lm(price_twd_msq ~ n_convenience, data = taiwan_30_to_45 )
mdl_all_ages <- lm(price_twd_msq ~ n_convenience, data = taiwan_real_estate)Make Predictions for Each Category
Create a tibble of explanatory data, setting n_convenience to a vector 0:10, assigning to explanatory_data
explanatory_data <- tibble(n_convenience = 1:10)Add column of predictions using “0 to 15” model and explanatory data, and assign to prediction_data_0_to_15
prediction_data_0_to_15 <- explanatory_data %>%
mutate(price_twd_msq = predict(mdl_0_to_15, explanatory_data))Add column of predictions using “15 to 30” model and explanatory data, and assign to prediction_data_15_to_30
prediction_data_15_to_30 <- explanatory_data %>%
mutate(price_twd_msq = predict(mdl_15_to_30, explanatory_data))Add column of predictions using “30 to 45” model and explanatory data, and assign to prediction_data_30_to_45
prediction_data_30_to_45 <- explanatory_data %>%
mutate(price_twd_msq = predict(mdl_30_to_45, explanatory_data))
knitr::kable(prediction_data_30_to_45, caption = "Prediction Table", "simple")| n_convenience | price_twd_msq |
|---|---|
| 1 | 8.781822 |
| 2 | 9.450520 |
| 3 | 10.119218 |
| 4 | 10.787916 |
| 5 | 11.456614 |
| 6 | 12.125312 |
| 7 | 12.794010 |
| 8 | 13.462708 |
| 9 | 14.131407 |
| 10 | 14.800105 |
Visualizing multiple models
Using taiwan_real_estate, plot price vs. no. of conv. stores, colored by house age
Extend the original plot to include prediction points which are in geom_point(data = )
One Model with an Interactions
Two intuitive way to perform the model
Better solution to above is to have a single model that had all predictive power of the individual models, with adding interactions between explanatory variables.
- Model price vs both with an interaction using “times” syntax
lm(price_twd_msq ~ n_convenience * house_age_years, data = taiwan_real_estate)##
## Call:
## lm(formula = price_twd_msq ~ n_convenience * house_age_years,
## data = taiwan_real_estate)
##
## Coefficients:
## (Intercept) n_convenience
## 9.24170 0.83359
## house_age_years15 to 30 house_age_years30 to 45
## -2.36978 -1.12858
## n_convenience:house_age_years15 to 30 n_convenience:house_age_years30 to 45
## 0.01833 -0.16489
- Model with precise code using “colon” syntax
lm(price_twd_msq ~ n_convenience + house_age_years + n_convenience:house_age_years, data = taiwan_real_estate)##
## Call:
## lm(formula = price_twd_msq ~ n_convenience + house_age_years +
## n_convenience:house_age_years, data = taiwan_real_estate)
##
## Coefficients:
## (Intercept) n_convenience
## 9.24170 0.83359
## house_age_years15 to 30 house_age_years30 to 45
## -2.36978 -1.12858
## n_convenience:house_age_years15 to 30 n_convenience:house_age_years30 to 45
## 0.01833 -0.16489
Reformulate and Easy to Interpret
- Reformulate the model so that it returns easy-to-read coefficients: model price vs. house age plus an interaction, no intercept
mdl_readable_inter <- lm(price_twd_msq ~ house_age_years + house_age_years:n_convenience + 0, data = taiwan_real_estate)
mdl_readable_inter##
## Call:
## lm(formula = price_twd_msq ~ house_age_years + house_age_years:n_convenience +
## 0, data = taiwan_real_estate)
##
## Coefficients:
## house_age_years0 to 15 house_age_years15 to 30
## 9.2417 6.8719
## house_age_years30 to 45 house_age_years0 to 15:n_convenience
## 8.1131 0.8336
## house_age_years15 to 30:n_convenience house_age_years30 to 45:n_convenience
## 0.8519 0.6687
See the modelling results
dt <- tidy(mdl_readable_inter)
kbl(dt, booktabs = T)| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| house_age_years0 to 15 | 9.2417022 | 0.4052812 | 22.803184 | 0 |
| house_age_years15 to 30 | 6.8719186 | 0.4673710 | 14.703348 | 0 |
| house_age_years30 to 45 | 8.1131235 | 0.5992443 | 13.538926 | 0 |
| house_age_years0 to 15:n_convenience | 0.8335867 | 0.0813824 | 10.242834 | 0 |
| house_age_years15 to 30:n_convenience | 0.8519172 | 0.1054725 | 8.077154 | 0 |
| house_age_years30 to 45:n_convenience | 0.6686981 | 0.1020143 | 6.554943 | 0 |
Making Predictions with Interactions
Generate some explanatory_data. Note that using “expand_grid” from tidyR package to generate a tibble from multiple source of data.
explanatory_data <- expand_grid(
n_convenience = 0:10,
house_age_years = unique(taiwan_real_estate$house_age_years)
)Check out the explanatory_data created
kable(head(explanatory_data), align = "ll") | n_convenience | house_age_years |
|---|---|
| 0 | 30 to 45 |
| 0 | 15 to 30 |
| 0 | 0 to 15 |
| 1 | 30 to 45 |
| 1 | 15 to 30 |
| 1 | 0 to 15 |
Add another column in explanatory data
prediction_data <- explanatory_data %>%
mutate(
price_twd_msq = predict(mdl_readable_inter, explanatory_data)
)Plot the predicted data versus original - flow is identical to parallel slopes
ggplot(taiwan_real_estate, aes(n_convenience, price_twd_msq, color = house_age_years)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
geom_point(data = prediction_data, size = 5, shape = 15)## `geom_smooth()` using formula 'y ~ x'
Manually check the results - understand how predicted coefficients are used to calculate the predictions of prices
coeffs <- coefficients(mdl_readable_inter)
intercept_0_15 <- coeffs[1]
intercept_15_30 <- coeffs[2]
intercept_30_45 <- coeffs[3]
slope_0_15 <- coeffs[4]
slope_15_30 <- coeffs[5]
slope_30_45 <- coeffs[6]
prediction_data <- explanatory_data %>%
mutate(
# Consider the 3 cases to choose the price
price_twd_msq = case_when(
house_age_years == "0 to 15" ~ intercept_0_15 + slope_0_15 * n_convenience,
house_age_years == "15 to 30" ~ intercept_15_30 + slope_15_30 * n_convenience,
house_age_years == "30 to 45" ~ intercept_30_45 + slope_30_45 * n_convenience )
)Check out the prediction data
head(prediction_data)## # A tibble: 6 × 3
## n_convenience house_age_years price_twd_msq
## <int> <fct> <dbl>
## 1 0 30 to 45 8.11
## 2 0 15 to 30 6.87
## 3 0 0 to 15 9.24
## 4 1 30 to 45 8.78
## 5 1 15 to 30 7.72
## 6 1 0 to 15 10.1
Simpson’s Paradox
Models of a whole dataset vs. individual models for each category.
- Simpson’s Paradox occurs when the trend of a model on the whole dataset is very different from the trends shown by models on the subsets of the dataset
- Here in the context, trend means slope (“the slope coefficient”)
- Visualizing the dataset shall be the first step when approaching the dataset though usually the grouped model contains more insight
- cautious that if there are missing explanatory variables that help explain the results
Illustration by an Example
Take a glimpse of the dataset
glimpse(auctions)## Rows: 5,917
## Columns: 3
## $ price <dbl> 256.86, 256.86, 256.86, 256.86, 256.86, 256.86, 256.86, 2…
## $ openbid <dbl> 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.0…
## $ auction_type <chr> "3 day auction", "3 day auction", "3 day auction", "3 day…
Model price vs. opening bid using auctions dataset
mdl_price_vs_openbid <- lm(price ~ openbid, data = auctions)Plot
ggplot(auctions, aes(openbid, price)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE)## `geom_smooth()` using formula 'y ~ x'
Model variables with interactions
mdl_price_vs_both <- lm(
price ~ auction_type + openbid:auction_type + 0, # or price ~ auction_type * openbid
data = auctions
)
ggplot(auctions, aes(openbid, price, color = auction_type)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE)## `geom_smooth()` using formula 'y ~ x'
Conclusion
The two models (one on whole dataset, the other one on each categorical dataset) disagree, and the best model to take advice from depends upon the questions we are trying to answer.