1 Background

The purpose of this project is to estimate the average percentage of women completing undergraduate degrees in STEM programs for public and private 4-YR+ institutions (via. tree-based models and ensembles) in the United States based on extrinsic factors. The latter include such covariates as institution size, location, sector, student:faculty ratio, etc.

Data Source: Integrated Postsecondary Education Database System

The full dataset includes variables for 1803 post-secondary institutions (Public 4 Year +, Private non_profit) and comprises the period from 2010-2020, excluding 2011-12 and 2016-17. The latter are currently omitted due to data quality concerns.

Included are the following disciplines: Computer Science, Biological Science, Engineering, Mathematics, and Physcial Science.

library(tidyverse)
library(magrittr)
library(flextable)
library(tidymodels)
library(dlookr)
library(xgboost)
library(vip) # variable importance plots
library(rpart.plot)

First, we load the dataset.

data.stm<-read_csv('https://raw.githubusercontent.com/sconnin/STEM_Completion_Study/main/gender_tree_data.csv')

2 Data Processing

The initial processing steps include feature selection, setting factor types, and removal of missing values from factor covariates.

The factor covariates include:

  • sector: academic sector
  • region: geographical region
  • location: type of location (urbanicity)
  • size: number of students in institution

Numerical covariates include:

  • gran_tot: total number of completions (institution)
  • perc_wm_stem: percent of STEM graduates who are women
  • perc_wm_comp: percent of computer science graduates who are women
  • perc_wm_admit: percent of women admitted to the institution
  • ret_rate: full year retention rate
  • sf_ratio: student faculty ratio
# remove selected covariates for gender model

data.stm%<>%
    select(!c(id, name, state, tot_men, tot_women, perc_wm_enroll, perc_mn_admit, perc_mn_stem, perc_mn_comp))

# factor categorical variables and drop NA

data.stm%<>%
    mutate(across(where(is.character), factor))%>%
    drop_na()

2.1 Summary Statistics: Numerical

Summary statistics for numerical variables are shown in Table 1. Women STEM graduates are absent from only one institution in this dataset.

# Calculate statistics summary for numeric variables    

data.stm%>%
    diagnose_numeric()%>%
    dplyr::select(variables, min, mean, median, max, zero, minus, outlier)%>%
    flextable()%>%
    set_caption("Table 1. Summary Statistics for Numeric Variables")

2.2 Pairwise Correlations

# Calculate pairwise relationships

data.stm%>%
    filter_if(is.numeric, all_vars((.) != 0))%>%
    correlate()%>%
    filter(coef_corr > .5 | coef_corr < -.5)%>% # set thresholds to limit output 
    flextable()%>%
    set_caption("Table 2. Pairwise Correlation: Numerical Covariates")

Given that perc_wm_admit and perc_admit are almost perfectly correlated, we will drop the latter from our working dataset.

data.stm%<>%
    select(!perc_admit)

2.3 Summary Statistics: Factor

A data summary for factor variables is shown in Table 2. The data are imbalanced for all classes and levels.

# Statistical summary for categorical variables 

data.stm%>%
    diagnose_category()%>%
    flextable()%>%
    set_caption("Table 2. Summary Statistics for Factor Variables")
# Check for any remaining missingness among factor variables

zero<-map(data.stm, ~sum(is.factor(is.na(.))))

3 Tree Regression

With the data prepared we will build a basic tree regression models for our response variables, perc_wm_stem and per_wm_comp. The latter represent the average percentage of STEM graduates and computer science graduates (over the period of record) that are women.

Note that STEM programs are inclusive of computer science.

3.1 STEM Program Completions

The numerical variables are normalized (scaled and centerd) prior to modeling. Model results indicate the importance of retention rate, location, total number of students, and geographic region as predictors of STEM completion (percent of STEM graduates) by women.

set.seed(005001)

# remove perc_wm_comp

nocomp<-data.stm%>%
    select(!perc_wm_comp)

#scale numerical variables
 
nocomp%>%
    mutate_if(is.numeric, scale)
# run model and plot results

wm.stem <- rpart(perc_wm_stem ~ ., nocomp, method = "anova")
rpart.plot(wm.stem, main='Figure 1. Percent STEM Graduates: Women')

3.2 Computer Science Program Completions

Completion rates for women from computer science programs is strongly predicted by overall STEM completion rates (women) as well as the percent of women admitted to the institution, total student number, sector, and region. Retention rates are not an important variable in this case.

set.seed(0033001)

wm.computer <- rpart(perc_wm_comp ~ ., data.stm, method = "anova")
rpart.plot(wm.computer, main='Figure 2. Percent Computer Science Graduates: Women')

4 Ensemble Methods

We will now undertake a more complete model development, selection, and evaluation process for perc_wm_stem. This will involve the use of ensemble methods such as boosting and random forests.

4.1 Model Splits and Resampling

First we split our data into training and test sets. And set up 10-fold cross-validation for hyperparameter tuning.

set.seed(1)

# create random data split -- data will not include variable for computer programs

wm.split <- initial_split(nocomp, strata = perc_wm_stem, prop = 0.75)

# create train and test sets from split

wm.train <- training(wm.split)

wm.test  <- testing(wm.split)

# create resamples for model tuning

set.seed(2)

cv_folds<-
  vfold_cv(wm.train, 
           v=10,
           strata= perc_wm_stem)

4.2 Tidymodels Recipe

We then establish a preprocessing recipe for the tree regression models. This will include, one-hot encoding for factor covariates, removal of near-zero variance covariates, and normalizing (center and scale) the numerical covariates.

# pre-model processing using recipes

wm_recipe<-
    recipe(perc_wm_stem ~., data=wm.train)%>%
    update_role(contains("id"), new_role = "id vars") %>%
    step_dummy(c(sector, region, location, size))%>%
    step_nzv(all_predictors(), freq_cut = 0, unique_cut = 0) %>%
    step_normalize(all_numeric()) # center and scale numerical vars

#prep the recipe and call juice to extract the transformed dataset
    
wm_juiced <-juice(prep(wm_recipe))
  
# review type, roles, source
  
summary(wm_recipe)%>%
    flextable()%>%
    set_caption('Table 3. Overview of Variable Type, Role, and Source')

4.3 Model Specification & Tuning Parameters

We will build and specify three tree-based models for comparison and evaluation:

  1. basic
  2. boosted (XGBoost)
  3. random forest.

Each model includes hyperparameters that will be selected using tuning grids and cross-validation.

The hyperparameters include:

  • basic: tree_depth, min_n, cost_complexity
  • boosted: min_n, tree_depth, learn_rate
  • random forest: mtry, min_n

The number of trees grown for the boosted and random forest models = 1000.

# specify decision tree

set.seed(3)

wm.dec <-decision_tree(
  tree_depth = tune(), # creating tuneable parameters
  min_n = tune(),
  cost_complexity = tune()
)%>%
    set_engine("rpart")%>% # ctree - tree_depth, min_n
    set_mode("regression")
  

#set up tuning grid for dec tree, levels default = 3

dec.grid<-grid_regular(tree_depth(), min_n(), cost_complexity(), levels=4) 

# specify boosted tree

set.seed(4)

wm.boost<-boost_tree(
  mtry =  NULL, # .preds() all predictors is actually the default, use this for bagging
  trees = 1000,
  min_n = tune(),
  tree_depth = tune(),
  learn_rate = tune(),
  loss_reduction = NULL,
  sample_size = NULL,
  stop_iter = NULL
)%>%
    set_engine("xgboost")%>%
    set_mode("regression")

#set up tuning grid for dec tree, levels default = 3

boost.grid<-grid_regular(tree_depth(), min_n(), learn_rate(), levels=4)

# specify random forest

set.seed(5)

wm.rf <-rand_forest(
    mtry = tune(),
    trees = 1000,
    min_n = tune()
) %>%
    set_engine("ranger", importance='impurity')%>%
    set_mode("regression") 

4.4 Model Workflows

Using tidymodels, the data recipes and model specifications are combined into a single workflow for each model.

# initiate a workflow for decision tree

wm.dec.wf<- workflow()%>%
    add_recipe(wm_recipe)%>%
    add_model(wm.dec)

# workflow for boosted tree

wm.boost.wf <- workflow() %>% 
  add_recipe(wm_recipe) %>% 
  add_model(wm.boost)

# workflow for random forest

wm.rf.wf <- workflow() %>% 
  add_recipe(wm_recipe) %>% 
  add_model(wm.rf)

# select metrics for model evaluation

wm.metrics <- metric_set(rmse, rsq, mae, mape)

4.5 Model Training

The models are then fit to the training set. Metrics for model performance are collected at the same time.

These include: rmse, rsq, mae, mape.

Model results are saved to a local directory as an Rdata file.

doParallel::registerDoParallel()

set.seed(6)

# fit decision tree

wm.dec.fit <- 
    wm.dec.wf %>%
    tune_grid(
    resamples = cv_folds,
    grid = dec.grid,
    metrics = wm.metrics
) 


# fit boosted tree

set.seed(7)

wm.boost.fit<-
    wm.boost.wf%>%
    tune_grid(
    resamples = cv_folds,
    grid = boost.grid,
    metrics = wm.metrics
)

# fit random forest

set.seed(8)

wm.rf.fit<-
    tune_grid(
    wm.rf.wf,
    resamples = cv_folds,
    grid=20
)
    

save(wm.dec.fit, wm.boost.fit, wm.rf.fit, file="gender_trees.Rdata")
# run to load model results without rerunning models

#load("gender_trees.Rdata")

4.6 Cross-Validation Results

Parameter tuning results for the basic tree model, XGBoost, and random forest are shown in Figures 3-5, respectively.

The optimized hyperparameter settings for each model are as follows:

  • basic: tree_depth = 1, min_n = 1, cost_complexity = 1e-10
  • boosted: min_n = 2, tree_depth =1, learn_rate = 0.1
  • random forest: mtry = 6, min_n = 34
# Decision Tree

#collect_metrics(wm.dec.fit)

autoplot(wm.dec.fit)+theme_light()+labs(title='Figure 3. Hyperparameter Tuning for Basic Tree Regression Model')

#show_best(wm.dec.fit)

best.dec<-select_best(wm.dec.fit, 'rmse')

# Boosted Tree

#wm.boost.fit%>% 
    #collect_metrics()

autoplot(wm.boost.fit)+theme_light()+labs(title='Figure 4. Hyperparameter Tuning for XGBoost Model')

#show_best(wm.boost.fit)

best.boost<-select_best(wm.boost.fit, 'rmse')


# Random Forests

#wm.rf.fit%>% 
    #collect_metrics()

autoplot(wm.rf.fit)+theme_light()+labs(title='Figure 5. Hyperparameter Tuning for Random Forest Model')

#show_best(wm.dec.fit)

best.rf<-select_best(wm.rf.fit, 'rmse')

4.7 Variable Importance Plots

Variable importance scores (a measure of variable influence on the outcome) are shown for each of the models in Figures 5-7.

Student retention rates (ret_rate) are rank the highest for each of the models. Followed by total student number (gran_tot). It’s also worthing noting the greater complexity of the XGBoost and random forest models (which include scores for sub-regions).

#vip for decision tree

vip.dec<-finalize_workflow(wm.dec.wf, best.dec)

vip.dec%>%
    fit(data=wm.train)%>%
    extract_fit_parsnip()%>%
    vip(geom = 'col',  aesthetics=list(fill='midnightblue', alpha=0.7))+
    labs(title='Figure 5. Variable Importance: Decision Tree Model', subtitle='Percent Completion: Women in Undergraduate\nSTEM Programs: 2010-2020, (n=889)')+
    theme_light()

# vip for XGboost

vip.boost<-finalize_workflow(wm.boost.wf, best.boost)

vip.boost%>%
    fit(data=wm.train)%>%
    extract_fit_parsnip()%>%
    vip(geom = 'col',  aesthetics=list(fill='midnightblue', alpha=0.7))+
    labs(title='Figure 6. Variable Importance: XGBoost Model', subtitle='Percent Completion: Women in Undergraduate\nSTEM Programs: 2010-2020, (n=889)')+
    theme_light()

# vip for XGboost

vip.rf<-finalize_workflow(wm.rf.wf, best.rf)

vip.rf%>%
    fit(data=wm.train)%>%
    extract_fit_parsnip()%>%
    vip(geom = 'col',  aesthetics=list(fill='midnightblue', alpha=0.7))+
    labs(title='Figure 7. Variable Importance: Random Forest Model', subtitle='Percent Completion: Women in Undergraduate\nSTEM Programs: 2010-2020, (n=889)')+
    theme_light()

## Final Models and Evaluation

Each model is updated with the selected hyperparameters (based on RMSE).

# Decision Tree

final.dec<-finalize_model(wm.dec, select_best(wm.dec.fit, 'rmse')) #with tuning specs

# Boosted Tree

final.boost<-finalize_model(wm.boost, select_best(wm.boost.fit, 'rmse'))

# Random Forest

final.rf<-finalize_model(wm.rf, select_best(wm.rf.fit, 'rmse'))

And then fit to the test splits.

#Fit to training and eval on testing - last fit is a convenience function to save code

# Decision Tree

final.fit.dec<-last_fit(final.dec,  perc_wm_stem ~., split = wm.split)

# XGoost Tree

final.fit.boost<-last_fit(final.boost,  perc_wm_stem ~., split = wm.split)

# Random Forest

final.fit.rf<-last_fit(final.rf,  perc_wm_stem ~., split = wm.split)

The XGBoost model returns the lowest estimated RMSE (10.73) and, as a result, represents the best fit for this dataset. However, the difference in RMSE estimates between the models is slight.

It’s worth noting that the RMSE is average difference between the predictions made for perc_wm_stem and the actual observed values in the test set. On this basis, the average error is ~ 11% for the models.

collect_metrics(final.fit.dec) %>% 
    bind_rows(collect_metrics(final.fit.boost)) %>%
    bind_rows(collect_metrics(final.fit.rf)) %>% 
    filter(.metric == "rmse") %>% 
    mutate(model = c("Tree", "XGboost", "Random Forest")) %>% 
    select(model, .metric, .estimator, .estimate) %>% 
    flextable()%>%
    set_caption('Table 4. Goodness of Fit (RMSE) for the Final Models')

4.8 Model Predictions: Scatterplots

Figures 8-10 plot the relationship between observed vs. predicted outcomes for each of the models.

# plot decision tree as scatterplot

final.fit.dec%>%
    collect_predictions() %>%
    ggplot(aes(perc_wm_stem, .pred)) +
    geom_abline(slope = 1, lty = 2, color = "gray50", alpha = 0.5) +
    geom_point(alpha = 0.6, color = "midnightblue") +
    labs(title='Figure 8. Decision Tree Model', subtitle='Percent Completion: Women in Undergraduate\nSTEM Programs: 2010-2020, (n=889)', x='Observed', y='Predicted')+
    theme_light()

# plot XGboost as scatterplot

final.fit.boost%>%
    collect_predictions() %>%
    ggplot(aes(perc_wm_stem, .pred)) +
    geom_abline(slope = 1, lty = 2, color = "gray50", alpha = 0.5) +
    geom_point(alpha = 0.6, color = "midnightblue") +
    labs(title='Figure 9. XGboost Model', subtitle='Percent Completion: Women in Undergraduate\nSTEM Programs: 2010-2020, n=(889)', x='Observed', y='Predicted')+
    theme_light()

# plot decision tree as scatterplot

final.fit.rf%>%
    collect_predictions() %>%
    ggplot(aes(perc_wm_stem, .pred)) +
    geom_abline(slope = 1, lty = 2, color = "gray50", alpha = 0.5) +
    geom_point(alpha = 0.6, color = "midnightblue") +
    labs(title='Figure 10. Random Forest Model', subtitle='Percent Completion: Women in Undergraduate\nSTEM Programs: 2010-2020, n=(889)', x='Observed', y='Predicted')+
    theme_light()