Logistic Regression - County data - Model explainer

Author

Tural Sadigov

libraries and data

We use tidyverse (Wickham et al. (2019)) for data wrangling, tidymodels (Kuhn and Wickham (2020)) for modeling and stats2data (Sadigov (2022)) for the data.

Code
```{r}
library(tidyverse)
library(tidymodels)
library(stats2data)
# one <- 'https://raw.githubusercontent.com/'
# two <- 'turalsadigov/'
# three <- 'MATH_254/main/data/'
# four <- 'county.csv'
# url <- str_c(one, two, three, four)
# county <- read_csv(url)

df <- stats2data::county %>% 
  select(name, state, pop2017, median_hh_income, metro) %>% 
  unite('name/state', name:state, sep = '/') %>% 
  mutate(metro = factor(metro)) %>% 
  drop_na()

df <- 
  df %>% 
  mutate(pop2017 = log(pop2017))

set.seed(2022)
df_split <- initial_split(data = df, 
                          prop = 0.80, 
                          strata = metro)
df_training <- training(df_split)
df_testing <- testing(df_split)
```

load the fitted model

Code
```{r}
load(file = 'county_logistic_model.Rdata')
```

explain your model to others

See Tidy Modeling with R (Kuhn and Silge (2022)) chapter 18 (https://www.tmwr.org/explain.html). We use new packages:

  • vip (Greenwell and Boehmke (2020))

  • DALEX (Biecek (2018))

  • DALEXtra (Maksymiuk, Gosiewska, and Biecek (2020))

  • modelStudio (Baniecki and Biecek (2019))

Code
```{r}
library(vip)
library(DALEX)
library(DALEXtra)
library(modelStudio)
```

variable importance

Code
```{r}
vip(extract_fit_engine(chosen_model))
```

explainer

Code
```{r}
# modelStudio likes numeric response, so we create one
metro_dummy_num <- 
  df_training %>% 
  transmute(metro = if_else(metro == 'yes', 1, 0)) %>% 
  pull(metro)

# create explainer
explainer_logistic <- 
  explain_tidymodels(
    extract_fit_parsnip(chosen_model), 
    data = df_training %>% select(-metro), 
    y = metro_dummy_num,
    label = "logsitic regression",
    verbose = FALSE
  )
explainer_logistic
```
Model label:  logsitic regression 
Model class:  _glm,model_fit 
Data head  :
              name/state   pop2017 median_hh_income
1 Barbour County/Alabama 10.137373            33368
2 Bullock County/Alabama  9.240773            29655

create interactive explainer

Code
```{r}
modelStudio::modelStudio(explainer = explainer_logistic)
```

References

Baniecki, Hubert, and Przemyslaw Biecek. 2019. modelStudio: Interactive Studio with Explanations for ML Predictive Models” 4: 1798. https://doi.org/10.21105/joss.01798.
Biecek, Przemyslaw. 2018. “DALEX: Explainers for Complex Predictive Models in r” 19: 1–5. https://jmlr.org/papers/v19/18-416.html.
Greenwell, Brandon M., and Bradley C. Boehmke. 2020. “Variable Importance Plotsan Introduction to the Vip Package” 12. https://doi.org/10.32614/RJ-2020-013.
Kuhn, Max, and Julia Silge. 2022. “Tidy Modeling with r.” https://www.tmwr.org/.
Kuhn, Max, and Hadley Wickham. 2020. “Tidymodels: A Collection of Packages for Modeling and Machine Learning Using Tidyverse Principles.” https://www.tidymodels.org.
Maksymiuk, Szymon, Alicja Gosiewska, and Przemyslaw Biecek. 2020. “Landscape of r Packages for eXplainable Artificial Intelligence.” https://arxiv.org/abs/2009.13248.
Sadigov, Tural. 2022. “Stats2data: Data Package for MATH 254, Statistical Modeling and Applications, at Hamilton College.”
Wickham, Hadley, Mara Averick, Jennifer Bryan, Winston Chang, Lucy D’Agostino McGowan, Romain François, Garrett Grolemund, et al. 2019. “Welcome to the Tidyverse 4: 1686. https://doi.org/10.21105/joss.01686.