1 Background

The purpose of this project is to estimate the average percentage of women completing undergraduate degrees in STEM programs for public and private 4-YR+ institutions (via. tree-based models and ensembles) in the United States based on extrinsic factors. The latter include such covariates as institution size, location, sector, student:faculty ratio, etc.

Data Source: Integrated Postsecondary Education Database System

The full dataset includes variables for 1803 post-secondary institutions (Public 4 Year +, Private non_profit) and comprises the period from 2010-2020, excluding 2011-12 and 2016-17. The latter are currently omitted due to data quality concerns.

Included are the following disciplines: Computer Science, Biological Science, Engineering, Mathematics, and Physcial Science.

library(tidyverse)
library(magrittr)
library(flextable)
library(tidymodels)
library(dlookr)
library(xgboost)
library(vip) # variable importance plots
library(rpart.plot)

First, we load the dataset.

data.stm<-read_csv('https://raw.githubusercontent.com/sconnin/STEM_Completion_Study/main/gender_tree_data.csv')

2 Data Processing

The initial processing steps include feature selection, setting factor types, and removal of missing values from factor covariates.

The factor covariates include:

sector: academic sector
region: geographical region
location: type of location (urbanicity)
size: number of students in institution

Numerical covariates include:

gran_tot: total number of completions (institution)
perc_wm_stem: percent of STEM graduates who are women
perc_wm_comp: percent of computer science graduates who are women
perc_wm_admit: percent of women admitted to the institution
ret_rate: full year retention rate
sf_ratio: student faculty ratio

# remove selected covariates for gender model

data.stm%<>%
    select(!c(id, name, state, tot_men, tot_women, perc_wm_enroll, perc_mn_admit, perc_mn_stem, perc_mn_comp))

# factor categorical variables and drop NA

data.stm%<>%
    mutate(across(where(is.character), factor))%>%
    drop_na()

2.1 Summary Statistics: Numerical

Summary statistics for numerical variables are shown in Table 1. Women STEM graduates are absent from only one institution in this dataset.

# Calculate statistics summary for numeric variables    

data.stm%>%
    diagnose_numeric()%>%
    dplyr::select(variables, min, mean, median, max, zero, minus, outlier)%>%
    flextable()%>%
    set_caption("Table 1. Summary Statistics for Numeric Variables")

Table 1. Summary Statistics for Numeric Variables
variables	min	mean	median	max	zero	outlier
gran_tot	28	1,174.52868	534	9,328	0	97
perc_wm_stem	0	58.34871	58	100	1	51
perc_wm_comp	0	19.15861	16	100	27	45
perc_admit	5	67.59505	71	100	0	37
perc_wm_admit	5	68.39708	72	100	0	40
ret_rate	0	76.17660	77	99	2	13
sf_ratio	3	14.34196	14	67	0	8

2.2 Pairwise Correlations

# Calculate pairwise relationships

data.stm%>%
    filter_if(is.numeric, all_vars((.) != 0))%>%
    correlate()%>%
    filter(coef_corr > .5 | coef_corr < -.5)%>% # set thresholds to limit output 
    flextable()%>%
    set_caption("Table 2. Pairwise Correlation: Numerical Covariates")

Table 2. Pairwise Correlation: Numerical Covariates
var1	var2	coef_corr
perc_wm_admit	perc_admit	0.992224
perc_admit	perc_wm_admit	0.992224

Given that perc_wm_admit and perc_admit are almost perfectly correlated, we will drop the latter from our working dataset.

data.stm%<>%
    select(!perc_admit)

2.3 Summary Statistics: Factor

A data summary for factor variables is shown in Table 2. The data are imbalanced for all classes and levels.

# Statistical summary for categorical variables 

data.stm%>%
    diagnose_category()%>%
    flextable()%>%
    set_caption("Table 2. Summary Statistics for Factor Variables")

Table 2. Summary Statistics for Factor Variables
variables	levels	N	freq	ratio	rank
sector	private_nonprofit	889	514	57.817773	1
sector	public	889	375	42.182227	2
region	Southeast	889	234	26.321710	1
region	Great_Lakes	889	146	16.422947	2
region	Mid_East	889	135	15.185602	3
region	Plains	889	114	12.823397	4
region	Southwest	889	81	9.111361	5
region	Far_West	889	78	8.773903	6
region	New_England	889	57	6.411699	7
region	Rocky_Mountains	889	28	3.149606	8
region	Other_US_Jurisdictions	889	16	1.799775	9
location	City	889	571	64.229471	1
location	Town	889	275	30.933633	2
location	Rural	889	43	4.836895	3
size	1,000 - 4,999	889	434	48.818898	1
size	5,000 - 9,999	889	143	16.085489	2
size	10,000 - 19,999	889	124	13.948256	3
size	20,000 +	889	120	13.498313	4
size	< 1,000	889	68	7.649044	5

# Check for any remaining missingness among factor variables

zero<-map(data.stm, ~sum(is.factor(is.na(.))))

3 Tree Regression

With the data prepared we will build a basic tree regression models for our response variables, perc_wm_stem and per_wm_comp. The latter represent the average percentage of STEM graduates and computer science graduates (over the period of record) that are women.

Note that STEM programs are inclusive of computer science.

3.1 STEM Program Completions

The numerical variables are normalized (scaled and centerd) prior to modeling. Model results indicate the importance of retention rate, location, total number of students, and geographic region as predictors of STEM completion (percent of STEM graduates) by women.

set.seed(005001)

# remove perc_wm_comp

nocomp<-data.stm%>%
    select(!perc_wm_comp)

#scale numerical variables
 
nocomp%>%
    mutate_if(is.numeric, scale)

# run model and plot results

wm.stem <- rpart(perc_wm_stem ~ ., nocomp, method = "anova")
rpart.plot(wm.stem, main='Figure 1. Percent STEM Graduates: Women')

3.2 Computer Science Program Completions

Completion rates for women from computer science programs is strongly predicted by overall STEM completion rates (women) as well as the percent of women admitted to the institution, total student number, sector, and region. Retention rates are not an important variable in this case.

set.seed(0033001)

wm.computer <- rpart(perc_wm_comp ~ ., data.stm, method = "anova")
rpart.plot(wm.computer, main='Figure 2. Percent Computer Science Graduates: Women')

4 Ensemble Methods

We will now undertake a more complete model development, selection, and evaluation process for perc_wm_stem. This will involve the use of ensemble methods such as boosting and random forests.

4.1 Model Splits and Resampling

First we split our data into training and test sets. And set up 10-fold cross-validation for hyperparameter tuning.

set.seed(1)

# create random data split -- data will not include variable for computer programs

wm.split <- initial_split(nocomp, strata = perc_wm_stem, prop = 0.75)

# create train and test sets from split

wm.train <- training(wm.split)

wm.test  <- testing(wm.split)

# create resamples for model tuning

set.seed(2)

cv_folds<-
  vfold_cv(wm.train, 
           v=10,
           strata= perc_wm_stem)

4.2 Tidymodels Recipe

We then establish a preprocessing recipe for the tree regression models. This will include, one-hot encoding for factor covariates, removal of near-zero variance covariates, and normalizing (center and scale) the numerical covariates.

# pre-model processing using recipes

wm_recipe<-
    recipe(perc_wm_stem ~., data=wm.train)%>%
    update_role(contains("id"), new_role = "id vars") %>%
    step_dummy(c(sector, region, location, size))%>%
    step_nzv(all_predictors(), freq_cut = 0, unique_cut = 0) %>%
    step_normalize(all_numeric()) # center and scale numerical vars

#prep the recipe and call juice to extract the transformed dataset
    
wm_juiced <-juice(prep(wm_recipe))
  
# review type, roles, source
  
summary(wm_recipe)%>%
    flextable()%>%
    set_caption('Table 3. Overview of Variable Type, Role, and Source')

Table 3. Overview of Variable Type, Role, and Source
variable	type	role	source
sector	nominal	predictor	original
region	nominal	predictor	original
location	nominal	predictor	original
size	nominal	predictor	original
gran_tot	numeric	predictor	original
perc_wm_admit	numeric	predictor	original
ret_rate	numeric	predictor	original
sf_ratio	numeric	predictor	original
perc_wm_stem	numeric	outcome	original

4.3 Model Specification & Tuning Parameters

We will build and specify three tree-based models for comparison and evaluation:

basic
boosted (XGBoost)
random forest.

Each model includes hyperparameters that will be selected using tuning grids and cross-validation.

The hyperparameters include:

basic: tree_depth, min_n, cost_complexity
boosted: min_n, tree_depth, learn_rate
random forest: mtry, min_n

The number of trees grown for the boosted and random forest models = 1000.

# specify decision tree

set.seed(3)

wm.dec <-decision_tree(
  tree_depth = tune(), # creating tuneable parameters
  min_n = tune(),
  cost_complexity = tune()
)%>%
    set_engine("rpart")%>% # ctree - tree_depth, min_n
    set_mode("regression")
  

#set up tuning grid for dec tree, levels default = 3

dec.grid<-grid_regular(tree_depth(), min_n(), cost_complexity(), levels=4) 

# specify boosted tree

set.seed(4)

wm.boost<-boost_tree(
  mtry =  NULL, # .preds() all predictors is actually the default, use this for bagging
  trees = 1000,
  min_n = tune(),
  tree_depth = tune(),
  learn_rate = tune(),
  loss_reduction = NULL,
  sample_size = NULL,
  stop_iter = NULL
)%>%
    set_engine("xgboost")%>%
    set_mode("regression")

#set up tuning grid for dec tree, levels default = 3

boost.grid<-grid_regular(tree_depth(), min_n(), learn_rate(), levels=4)

# specify random forest

set.seed(5)

wm.rf <-rand_forest(
    mtry = tune(),
    trees = 1000,
    min_n = tune()
) %>%
    set_engine("ranger", importance='impurity')%>%
    set_mode("regression")

4.4 Model Workflows

Using tidymodels, the data recipes and model specifications are combined into a single workflow for each model.

# initiate a workflow for decision tree

wm.dec.wf<- workflow()%>%
    add_recipe(wm_recipe)%>%
    add_model(wm.dec)

# workflow for boosted tree

wm.boost.wf <- workflow() %>% 
  add_recipe(wm_recipe) %>% 
  add_model(wm.boost)

# workflow for random forest

wm.rf.wf <- workflow() %>% 
  add_recipe(wm_recipe) %>% 
  add_model(wm.rf)

# select metrics for model evaluation

wm.metrics <- metric_set(rmse, rsq, mae, mape)

4.5 Model Training

The models are then fit to the training set. Metrics for model performance are collected at the same time.

These include: rmse, rsq, mae, mape.

Model results are saved to a local directory as an Rdata file.

doParallel::registerDoParallel()

set.seed(6)

# fit decision tree

wm.dec.fit <- 
    wm.dec.wf %>%
    tune_grid(
    resamples = cv_folds,
    grid = dec.grid,
    metrics = wm.metrics
) 


# fit boosted tree

set.seed(7)

wm.boost.fit<-
    wm.boost.wf%>%
    tune_grid(
    resamples = cv_folds,
    grid = boost.grid,
    metrics = wm.metrics
)

# fit random forest

set.seed(8)

wm.rf.fit<-
    tune_grid(
    wm.rf.wf,
    resamples = cv_folds,
    grid=20
)
    

save(wm.dec.fit, wm.boost.fit, wm.rf.fit, file="gender_trees.Rdata")

# run to load model results without rerunning models

#load("gender_trees.Rdata")

4.6 Cross-Validation Results

Parameter tuning results for the basic tree model, XGBoost, and random forest are shown in Figures 3-5, respectively.

The optimized hyperparameter settings for each model are as follows:

basic: tree_depth = 1, min_n = 1, cost_complexity = 1e-10
boosted: min_n = 2, tree_depth =1, learn_rate = 0.1
random forest: mtry = 6, min_n = 34

# Decision Tree

#collect_metrics(wm.dec.fit)

autoplot(wm.dec.fit)+theme_light()+labs(title='Figure 3. Hyperparameter Tuning for Basic Tree Regression Model')

#show_best(wm.dec.fit)

best.dec<-select_best(wm.dec.fit, 'rmse')

# Boosted Tree

#wm.boost.fit%>% 
    #collect_metrics()

autoplot(wm.boost.fit)+theme_light()+labs(title='Figure 4. Hyperparameter Tuning for XGBoost Model')

#show_best(wm.boost.fit)

best.boost<-select_best(wm.boost.fit, 'rmse')


# Random Forests

#wm.rf.fit%>% 
    #collect_metrics()

autoplot(wm.rf.fit)+theme_light()+labs(title='Figure 5. Hyperparameter Tuning for Random Forest Model')

#show_best(wm.dec.fit)

best.rf<-select_best(wm.rf.fit, 'rmse')

4.7 Variable Importance Plots

Variable importance scores (a measure of variable influence on the outcome) are shown for each of the models in Figures 5-7.

Student retention rates (ret_rate) are rank the highest for each of the models. Followed by total student number (gran_tot). It’s also worthing noting the greater complexity of the XGBoost and random forest models (which include scores for sub-regions).

#vip for decision tree

vip.dec<-finalize_workflow(wm.dec.wf, best.dec)

vip.dec%>%
    fit(data=wm.train)%>%
    extract_fit_parsnip()%>%
    vip(geom = 'col',  aesthetics=list(fill='midnightblue', alpha=0.7))+
    labs(title='Figure 5. Variable Importance: Decision Tree Model', subtitle='Percent Completion: Women in Undergraduate\nSTEM Programs: 2010-2020, (n=889)')+
    theme_light()

# vip for XGboost

vip.boost<-finalize_workflow(wm.boost.wf, best.boost)

vip.boost%>%
    fit(data=wm.train)%>%
    extract_fit_parsnip()%>%
    vip(geom = 'col',  aesthetics=list(fill='midnightblue', alpha=0.7))+
    labs(title='Figure 6. Variable Importance: XGBoost Model', subtitle='Percent Completion: Women in Undergraduate\nSTEM Programs: 2010-2020, (n=889)')+
    theme_light()

# vip for XGboost

vip.rf<-finalize_workflow(wm.rf.wf, best.rf)

vip.rf%>%
    fit(data=wm.train)%>%
    extract_fit_parsnip()%>%
    vip(geom = 'col',  aesthetics=list(fill='midnightblue', alpha=0.7))+
    labs(title='Figure 7. Variable Importance: Random Forest Model', subtitle='Percent Completion: Women in Undergraduate\nSTEM Programs: 2010-2020, (n=889)')+
    theme_light()

## Final Models and Evaluation

Each model is updated with the selected hyperparameters (based on RMSE).

# Decision Tree

final.dec<-finalize_model(wm.dec, select_best(wm.dec.fit, 'rmse')) #with tuning specs

# Boosted Tree

final.boost<-finalize_model(wm.boost, select_best(wm.boost.fit, 'rmse'))

# Random Forest

final.rf<-finalize_model(wm.rf, select_best(wm.rf.fit, 'rmse'))

And then fit to the test splits.

#Fit to training and eval on testing - last fit is a convenience function to save code

# Decision Tree

final.fit.dec<-last_fit(final.dec,  perc_wm_stem ~., split = wm.split)

# XGoost Tree

final.fit.boost<-last_fit(final.boost,  perc_wm_stem ~., split = wm.split)

# Random Forest

final.fit.rf<-last_fit(final.rf,  perc_wm_stem ~., split = wm.split)

The XGBoost model returns the lowest estimated RMSE (10.73) and, as a result, represents the best fit for this dataset. However, the difference in RMSE estimates between the models is slight.

It’s worth noting that the RMSE is average difference between the predictions made for perc_wm_stem and the actual observed values in the test set. On this basis, the average error is ~ 11% for the models.

collect_metrics(final.fit.dec) %>% 
    bind_rows(collect_metrics(final.fit.boost)) %>%
    bind_rows(collect_metrics(final.fit.rf)) %>% 
    filter(.metric == "rmse") %>% 
    mutate(model = c("Tree", "XGboost", "Random Forest")) %>% 
    select(model, .metric, .estimator, .estimate) %>% 
    flextable()%>%
    set_caption('Table 4. Goodness of Fit (RMSE) for the Final Models')

Table 4. Goodness of Fit (RMSE) for the Final Models
model	.metric	.estimator	.estimate
Tree	rmse	standard	10.90850
XGboost	rmse	standard	10.73072
Random Forest	rmse	standard	10.84231

4.8 Model Predictions: Scatterplots

Figures 8-10 plot the relationship between observed vs. predicted outcomes for each of the models.

# plot decision tree as scatterplot

final.fit.dec%>%
    collect_predictions() %>%
    ggplot(aes(perc_wm_stem, .pred)) +
    geom_abline(slope = 1, lty = 2, color = "gray50", alpha = 0.5) +
    geom_point(alpha = 0.6, color = "midnightblue") +
    labs(title='Figure 8. Decision Tree Model', subtitle='Percent Completion: Women in Undergraduate\nSTEM Programs: 2010-2020, (n=889)', x='Observed', y='Predicted')+
    theme_light()

# plot XGboost as scatterplot

final.fit.boost%>%
    collect_predictions() %>%
    ggplot(aes(perc_wm_stem, .pred)) +
    geom_abline(slope = 1, lty = 2, color = "gray50", alpha = 0.5) +
    geom_point(alpha = 0.6, color = "midnightblue") +
    labs(title='Figure 9. XGboost Model', subtitle='Percent Completion: Women in Undergraduate\nSTEM Programs: 2010-2020, n=(889)', x='Observed', y='Predicted')+
    theme_light()

# plot decision tree as scatterplot

final.fit.rf%>%
    collect_predictions() %>%
    ggplot(aes(perc_wm_stem, .pred)) +
    geom_abline(slope = 1, lty = 2, color = "gray50", alpha = 0.5) +
    geom_point(alpha = 0.6, color = "midnightblue") +
    labs(title='Figure 10. Random Forest Model', subtitle='Percent Completion: Women in Undergraduate\nSTEM Programs: 2010-2020, n=(889)', x='Observed', y='Predicted')+
    theme_light()

Modeling STEM Completion: Women

Sean Connin

04/1/22