Packages and Data

Packages includes fst (for reading fst document), dplyr (data manipulation), ggplot2, broom.

r dataset is the one on taiwan’s property price.

One Model Per Category

Filter data set for each subcategory per categorical data:

taiwan_0_to_15 <- taiwan_real_estate %>%
  filter(house_age_years == "0 to 15")

taiwan_15_to_30 <- taiwan_real_estate %>%
  filter(house_age_years == "15 to 30")

taiwan_30_to_45 <- taiwan_real_estate %>%
  filter(house_age_years == "30 to 45")

Model each category by linear regression

mdl_0_to_15 <- lm(price_twd_msq ~ n_convenience, data = taiwan_0_to_15 )

mdl_15_to_30 <- lm(price_twd_msq ~ n_convenience, data = taiwan_15_to_30 )

mdl_30_to_45 <- lm(price_twd_msq ~ n_convenience, data = taiwan_30_to_45 )

mdl_all_ages <- lm(price_twd_msq ~ n_convenience, data = taiwan_real_estate)

Make Predictions for Each Category

Create a tibble of explanatory data, setting n_convenience to a vector 0:10, assigning to explanatory_data

explanatory_data <- tibble(n_convenience = 1:10)

Add column of predictions using “0 to 15” model and explanatory data, and assign to prediction_data_0_to_15

prediction_data_0_to_15 <- explanatory_data %>%
  mutate(price_twd_msq = predict(mdl_0_to_15, explanatory_data))

Add column of predictions using “15 to 30” model and explanatory data, and assign to prediction_data_15_to_30

prediction_data_15_to_30 <- explanatory_data %>%
  mutate(price_twd_msq = predict(mdl_15_to_30, explanatory_data))

Add column of predictions using “30 to 45” model and explanatory data, and assign to prediction_data_30_to_45

prediction_data_30_to_45 <- explanatory_data %>%
  mutate(price_twd_msq = predict(mdl_30_to_45, explanatory_data))
knitr::kable(prediction_data_30_to_45, caption = "Prediction Table", "simple")
Prediction Table
n_convenience price_twd_msq
1 8.781822
2 9.450520
3 10.119218
4 10.787916
5 11.456614
6 12.125312
7 12.794010
8 13.462708
9 14.131407
10 14.800105

Visualizing multiple models

  1. Using taiwan_real_estate, plot price vs. no. of conv. stores, colored by house age

  2. Extend the original plot to include prediction points which are in geom_point(data = )

One Model with an Interactions

Two intuitive way to perform the model

Better solution to above is to have a single model that had all predictive power of the individual models, with adding interactions between explanatory variables.

  1. Model price vs both with an interaction using “times” syntax
lm(price_twd_msq ~ n_convenience * house_age_years, data = taiwan_real_estate)
## 
## Call:
## lm(formula = price_twd_msq ~ n_convenience * house_age_years, 
##     data = taiwan_real_estate)
## 
## Coefficients:
##                           (Intercept)                          n_convenience  
##                               9.24170                                0.83359  
##               house_age_years15 to 30                house_age_years30 to 45  
##                              -2.36978                               -1.12858  
## n_convenience:house_age_years15 to 30  n_convenience:house_age_years30 to 45  
##                               0.01833                               -0.16489
  1. Model with precise code using “colon” syntax
lm(price_twd_msq ~ n_convenience + house_age_years + n_convenience:house_age_years, data = taiwan_real_estate)
## 
## Call:
## lm(formula = price_twd_msq ~ n_convenience + house_age_years + 
##     n_convenience:house_age_years, data = taiwan_real_estate)
## 
## Coefficients:
##                           (Intercept)                          n_convenience  
##                               9.24170                                0.83359  
##               house_age_years15 to 30                house_age_years30 to 45  
##                              -2.36978                               -1.12858  
## n_convenience:house_age_years15 to 30  n_convenience:house_age_years30 to 45  
##                               0.01833                               -0.16489

Reformulate and Easy to Interpret

  1. Reformulate the model so that it returns easy-to-read coefficients: model price vs. house age plus an interaction, no intercept
mdl_readable_inter <- lm(price_twd_msq ~ house_age_years + house_age_years:n_convenience + 0, data = taiwan_real_estate)
mdl_readable_inter
## 
## Call:
## lm(formula = price_twd_msq ~ house_age_years + house_age_years:n_convenience + 
##     0, data = taiwan_real_estate)
## 
## Coefficients:
##                house_age_years0 to 15                house_age_years15 to 30  
##                                9.2417                                 6.8719  
##               house_age_years30 to 45   house_age_years0 to 15:n_convenience  
##                                8.1131                                 0.8336  
## house_age_years15 to 30:n_convenience  house_age_years30 to 45:n_convenience  
##                                0.8519                                 0.6687

See the modelling results

dt <- tidy(mdl_readable_inter)
kbl(dt, booktabs = T)
term estimate std.error statistic p.value
house_age_years0 to 15 9.2417022 0.4052812 22.803184 0
house_age_years15 to 30 6.8719186 0.4673710 14.703348 0
house_age_years30 to 45 8.1131235 0.5992443 13.538926 0
house_age_years0 to 15:n_convenience 0.8335867 0.0813824 10.242834 0
house_age_years15 to 30:n_convenience 0.8519172 0.1054725 8.077154 0
house_age_years30 to 45:n_convenience 0.6686981 0.1020143 6.554943 0

Making Predictions with Interactions

Generate some explanatory_data. Note that using “expand_grid” from tidyR package to generate a tibble from multiple source of data.

explanatory_data <- expand_grid(
  n_convenience = 0:10, 
  house_age_years = unique(taiwan_real_estate$house_age_years)
)

Check out the explanatory_data created

kable(head(explanatory_data), align = "ll") 
n_convenience house_age_years
0 30 to 45
0 15 to 30
0 0 to 15
1 30 to 45
1 15 to 30
1 0 to 15

Add another column in explanatory data

prediction_data <- explanatory_data %>% 
  mutate(
    price_twd_msq = predict(mdl_readable_inter, explanatory_data)
  )

Plot the predicted data versus original - flow is identical to parallel slopes

ggplot(taiwan_real_estate, aes(n_convenience, price_twd_msq, color = house_age_years)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  geom_point(data = prediction_data, size = 5, shape = 15)
## `geom_smooth()` using formula 'y ~ x'

Manually check the results - understand how predicted coefficients are used to calculate the predictions of prices

coeffs <- coefficients(mdl_readable_inter)
intercept_0_15 <- coeffs[1]
intercept_15_30 <- coeffs[2]
intercept_30_45 <- coeffs[3]
slope_0_15 <- coeffs[4]
slope_15_30 <- coeffs[5]
slope_30_45 <- coeffs[6]

prediction_data <- explanatory_data %>% 
  mutate(
    # Consider the 3 cases to choose the price
    price_twd_msq = case_when(
      house_age_years == "0 to 15" ~ intercept_0_15 + slope_0_15 * n_convenience, 
      house_age_years == "15 to 30" ~ intercept_15_30 + slope_15_30 * n_convenience, 
      house_age_years == "30 to 45" ~ intercept_30_45 + slope_30_45 * n_convenience     )
  )

Check out the prediction data

head(prediction_data)
## # A tibble: 6 × 3
##   n_convenience house_age_years price_twd_msq
##           <int> <fct>                   <dbl>
## 1             0 30 to 45                 8.11
## 2             0 15 to 30                 6.87
## 3             0 0 to 15                  9.24
## 4             1 30 to 45                 8.78
## 5             1 15 to 30                 7.72
## 6             1 0 to 15                 10.1

Simpson’s Paradox

Models of a whole dataset vs. individual models for each category.

  1. Simpson’s Paradox occurs when the trend of a model on the whole dataset is very different from the trends shown by models on the subsets of the dataset
  2. Here in the context, trend means slope (“the slope coefficient”)
  3. Visualizing the dataset shall be the first step when approaching the dataset though usually the grouped model contains more insight
  4. cautious that if there are missing explanatory variables that help explain the results

Illustration by an Example

Take a glimpse of the dataset

glimpse(auctions)
## Rows: 5,917
## Columns: 3
## $ price        <dbl> 256.86, 256.86, 256.86, 256.86, 256.86, 256.86, 256.86, 2…
## $ openbid      <dbl> 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.0…
## $ auction_type <chr> "3 day auction", "3 day auction", "3 day auction", "3 day…

Model price vs. opening bid using auctions dataset

mdl_price_vs_openbid <- lm(price ~ openbid, data = auctions)

Plot

ggplot(auctions, aes(openbid, price)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE)
## `geom_smooth()` using formula 'y ~ x'

Model variables with interactions

mdl_price_vs_both <- lm(
  price ~ auction_type + openbid:auction_type + 0, # or price ~ auction_type * openbid
  data = auctions
)

ggplot(auctions, aes(openbid, price, color = auction_type)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE)
## `geom_smooth()` using formula 'y ~ x'

Conclusion

The two models (one on whole dataset, the other one on each categorical dataset) disagree, and the best model to take advice from depends upon the questions we are trying to answer.