The purpose of this project is to estimate the average percentage of women completing undergraduate degrees in STEM programs for public and private 4-YR+ institutions (via. tree-based models and ensembles) in the United States based on extrinsic factors. The latter include such covariates as institution size, location, sector, student:faculty ratio, etc.
Data Source: Integrated Postsecondary Education Database System
The full dataset includes variables for 1803 post-secondary institutions (Public 4 Year +, Private non_profit) and comprises the period from 2010-2020, excluding 2011-12 and 2016-17. The latter are currently omitted due to data quality concerns.
Included are the following disciplines: Computer Science, Biological Science, Engineering, Mathematics, and Physcial Science.
library(tidyverse)
library(magrittr)
library(flextable)
library(tidymodels)
library(dlookr)
library(xgboost)
library(vip) # variable importance plots
library(rpart.plot)
First, we load the dataset.
<-read_csv('https://raw.githubusercontent.com/sconnin/STEM_Completion_Study/main/gender_tree_data.csv') data.stm
The initial processing steps include feature selection, setting factor types, and removal of missing values from factor covariates.
The factor covariates include:
Numerical covariates include:
# remove selected covariates for gender model
%<>%
data.stmselect(!c(id, name, state, tot_men, tot_women, perc_wm_enroll, perc_mn_admit, perc_mn_stem, perc_mn_comp))
# factor categorical variables and drop NA
%<>%
data.stmmutate(across(where(is.character), factor))%>%
drop_na()
Summary statistics for numerical variables are shown in Table 1. Women STEM graduates are absent from only one institution in this dataset.
# Calculate statistics summary for numeric variables
%>%
data.stmdiagnose_numeric()%>%
::select(variables, min, mean, median, max, zero, minus, outlier)%>%
dplyrflextable()%>%
set_caption("Table 1. Summary Statistics for Numeric Variables")
variables | min | mean | median | max | zero | minus | outlier |
gran_tot | 28 | 1,174.52868 | 534 | 9,328 | 0 | 0 | 97 |
perc_wm_stem | 0 | 58.34871 | 58 | 100 | 1 | 0 | 51 |
perc_wm_comp | 0 | 19.15861 | 16 | 100 | 27 | 0 | 45 |
perc_admit | 5 | 67.59505 | 71 | 100 | 0 | 0 | 37 |
perc_wm_admit | 5 | 68.39708 | 72 | 100 | 0 | 0 | 40 |
ret_rate | 0 | 76.17660 | 77 | 99 | 2 | 0 | 13 |
sf_ratio | 3 | 14.34196 | 14 | 67 | 0 | 0 | 8 |
# Calculate pairwise relationships
%>%
data.stmfilter_if(is.numeric, all_vars((.) != 0))%>%
correlate()%>%
filter(coef_corr > .5 | coef_corr < -.5)%>% # set thresholds to limit output
flextable()%>%
set_caption("Table 2. Pairwise Correlation: Numerical Covariates")
var1 | var2 | coef_corr |
perc_wm_admit | perc_admit | 0.992224 |
perc_admit | perc_wm_admit | 0.992224 |
Given that perc_wm_admit and perc_admit are almost perfectly correlated, we will drop the latter from our working dataset.
%<>%
data.stmselect(!perc_admit)
A data summary for factor variables is shown in Table 2. The data are imbalanced for all classes and levels.
# Statistical summary for categorical variables
%>%
data.stmdiagnose_category()%>%
flextable()%>%
set_caption("Table 2. Summary Statistics for Factor Variables")
variables | levels | N | freq | ratio | rank |
sector | private_nonprofit | 889 | 514 | 57.817773 | 1 |
sector | public | 889 | 375 | 42.182227 | 2 |
region | Southeast | 889 | 234 | 26.321710 | 1 |
region | Great_Lakes | 889 | 146 | 16.422947 | 2 |
region | Mid_East | 889 | 135 | 15.185602 | 3 |
region | Plains | 889 | 114 | 12.823397 | 4 |
region | Southwest | 889 | 81 | 9.111361 | 5 |
region | Far_West | 889 | 78 | 8.773903 | 6 |
region | New_England | 889 | 57 | 6.411699 | 7 |
region | Rocky_Mountains | 889 | 28 | 3.149606 | 8 |
region | Other_US_Jurisdictions | 889 | 16 | 1.799775 | 9 |
location | City | 889 | 571 | 64.229471 | 1 |
location | Town | 889 | 275 | 30.933633 | 2 |
location | Rural | 889 | 43 | 4.836895 | 3 |
size | 1,000 - 4,999 | 889 | 434 | 48.818898 | 1 |
size | 5,000 - 9,999 | 889 | 143 | 16.085489 | 2 |
size | 10,000 - 19,999 | 889 | 124 | 13.948256 | 3 |
size | 20,000 + | 889 | 120 | 13.498313 | 4 |
size | < 1,000 | 889 | 68 | 7.649044 | 5 |
# Check for any remaining missingness among factor variables
<-map(data.stm, ~sum(is.factor(is.na(.)))) zero
With the data prepared we will build a basic tree regression models for our response variables, perc_wm_stem and per_wm_comp. The latter represent the average percentage of STEM graduates and computer science graduates (over the period of record) that are women.
Note that STEM programs are inclusive of computer science.
The numerical variables are normalized (scaled and centerd) prior to modeling. Model results indicate the importance of retention rate, location, total number of students, and geographic region as predictors of STEM completion (percent of STEM graduates) by women.
set.seed(005001)
# remove perc_wm_comp
<-data.stm%>%
nocompselect(!perc_wm_comp)
#scale numerical variables
%>%
nocompmutate_if(is.numeric, scale)
# run model and plot results
<- rpart(perc_wm_stem ~ ., nocomp, method = "anova")
wm.stem rpart.plot(wm.stem, main='Figure 1. Percent STEM Graduates: Women')
Completion rates for women from computer science programs is strongly predicted by overall STEM completion rates (women) as well as the percent of women admitted to the institution, total student number, sector, and region. Retention rates are not an important variable in this case.
set.seed(0033001)
<- rpart(perc_wm_comp ~ ., data.stm, method = "anova")
wm.computer rpart.plot(wm.computer, main='Figure 2. Percent Computer Science Graduates: Women')
We will now undertake a more complete model development, selection, and evaluation process for perc_wm_stem. This will involve the use of ensemble methods such as boosting and random forests.
First we split our data into training and test sets. And set up 10-fold cross-validation for hyperparameter tuning.
set.seed(1)
# create random data split -- data will not include variable for computer programs
<- initial_split(nocomp, strata = perc_wm_stem, prop = 0.75)
wm.split
# create train and test sets from split
<- training(wm.split)
wm.train
<- testing(wm.split)
wm.test
# create resamples for model tuning
set.seed(2)
<-
cv_foldsvfold_cv(wm.train,
v=10,
strata= perc_wm_stem)
We then establish a preprocessing recipe for the tree regression models. This will include, one-hot encoding for factor covariates, removal of near-zero variance covariates, and normalizing (center and scale) the numerical covariates.
# pre-model processing using recipes
<-
wm_reciperecipe(perc_wm_stem ~., data=wm.train)%>%
update_role(contains("id"), new_role = "id vars") %>%
step_dummy(c(sector, region, location, size))%>%
step_nzv(all_predictors(), freq_cut = 0, unique_cut = 0) %>%
step_normalize(all_numeric()) # center and scale numerical vars
#prep the recipe and call juice to extract the transformed dataset
<-juice(prep(wm_recipe))
wm_juiced
# review type, roles, source
summary(wm_recipe)%>%
flextable()%>%
set_caption('Table 3. Overview of Variable Type, Role, and Source')
variable | type | role | source |
sector | nominal | predictor | original |
region | nominal | predictor | original |
location | nominal | predictor | original |
size | nominal | predictor | original |
gran_tot | numeric | predictor | original |
perc_wm_admit | numeric | predictor | original |
ret_rate | numeric | predictor | original |
sf_ratio | numeric | predictor | original |
perc_wm_stem | numeric | outcome | original |
We will build and specify three tree-based models for comparison and evaluation:
Each model includes hyperparameters that will be selected using tuning grids and cross-validation.
The hyperparameters include:
The number of trees grown for the boosted and random forest models = 1000.
# specify decision tree
set.seed(3)
<-decision_tree(
wm.dec tree_depth = tune(), # creating tuneable parameters
min_n = tune(),
cost_complexity = tune()
%>%
)set_engine("rpart")%>% # ctree - tree_depth, min_n
set_mode("regression")
#set up tuning grid for dec tree, levels default = 3
<-grid_regular(tree_depth(), min_n(), cost_complexity(), levels=4)
dec.grid
# specify boosted tree
set.seed(4)
<-boost_tree(
wm.boostmtry = NULL, # .preds() all predictors is actually the default, use this for bagging
trees = 1000,
min_n = tune(),
tree_depth = tune(),
learn_rate = tune(),
loss_reduction = NULL,
sample_size = NULL,
stop_iter = NULL
%>%
)set_engine("xgboost")%>%
set_mode("regression")
#set up tuning grid for dec tree, levels default = 3
<-grid_regular(tree_depth(), min_n(), learn_rate(), levels=4)
boost.grid
# specify random forest
set.seed(5)
<-rand_forest(
wm.rf mtry = tune(),
trees = 1000,
min_n = tune()
%>%
) set_engine("ranger", importance='impurity')%>%
set_mode("regression")
Using tidymodels, the data recipes and model specifications are combined into a single workflow for each model.
# initiate a workflow for decision tree
<- workflow()%>%
wm.dec.wfadd_recipe(wm_recipe)%>%
add_model(wm.dec)
# workflow for boosted tree
<- workflow() %>%
wm.boost.wf add_recipe(wm_recipe) %>%
add_model(wm.boost)
# workflow for random forest
<- workflow() %>%
wm.rf.wf add_recipe(wm_recipe) %>%
add_model(wm.rf)
# select metrics for model evaluation
<- metric_set(rmse, rsq, mae, mape) wm.metrics
The models are then fit to the training set. Metrics for model performance are collected at the same time.
These include: rmse, rsq, mae, mape.
Model results are saved to a local directory as an Rdata file.
::registerDoParallel()
doParallel
set.seed(6)
# fit decision tree
<-
wm.dec.fit %>%
wm.dec.wf tune_grid(
resamples = cv_folds,
grid = dec.grid,
metrics = wm.metrics
)
# fit boosted tree
set.seed(7)
<-
wm.boost.fit%>%
wm.boost.wftune_grid(
resamples = cv_folds,
grid = boost.grid,
metrics = wm.metrics
)
# fit random forest
set.seed(8)
<-
wm.rf.fittune_grid(
wm.rf.wf,resamples = cv_folds,
grid=20
)
save(wm.dec.fit, wm.boost.fit, wm.rf.fit, file="gender_trees.Rdata")
# run to load model results without rerunning models
#load("gender_trees.Rdata")
Parameter tuning results for the basic tree model, XGBoost, and random forest are shown in Figures 3-5, respectively.
The optimized hyperparameter settings for each model are as follows:
# Decision Tree
#collect_metrics(wm.dec.fit)
autoplot(wm.dec.fit)+theme_light()+labs(title='Figure 3. Hyperparameter Tuning for Basic Tree Regression Model')
#show_best(wm.dec.fit)
<-select_best(wm.dec.fit, 'rmse')
best.dec
# Boosted Tree
#wm.boost.fit%>%
#collect_metrics()
autoplot(wm.boost.fit)+theme_light()+labs(title='Figure 4. Hyperparameter Tuning for XGBoost Model')
#show_best(wm.boost.fit)
<-select_best(wm.boost.fit, 'rmse')
best.boost
# Random Forests
#wm.rf.fit%>%
#collect_metrics()
autoplot(wm.rf.fit)+theme_light()+labs(title='Figure 5. Hyperparameter Tuning for Random Forest Model')
#show_best(wm.dec.fit)
<-select_best(wm.rf.fit, 'rmse') best.rf
Variable importance scores (a measure of variable influence on the outcome) are shown for each of the models in Figures 5-7.
Student retention rates (ret_rate) are rank the highest for each of the models. Followed by total student number (gran_tot). It’s also worthing noting the greater complexity of the XGBoost and random forest models (which include scores for sub-regions).
#vip for decision tree
<-finalize_workflow(wm.dec.wf, best.dec)
vip.dec
%>%
vip.decfit(data=wm.train)%>%
extract_fit_parsnip()%>%
vip(geom = 'col', aesthetics=list(fill='midnightblue', alpha=0.7))+
labs(title='Figure 5. Variable Importance: Decision Tree Model', subtitle='Percent Completion: Women in Undergraduate\nSTEM Programs: 2010-2020, (n=889)')+
theme_light()
# vip for XGboost
<-finalize_workflow(wm.boost.wf, best.boost)
vip.boost
%>%
vip.boostfit(data=wm.train)%>%
extract_fit_parsnip()%>%
vip(geom = 'col', aesthetics=list(fill='midnightblue', alpha=0.7))+
labs(title='Figure 6. Variable Importance: XGBoost Model', subtitle='Percent Completion: Women in Undergraduate\nSTEM Programs: 2010-2020, (n=889)')+
theme_light()
# vip for XGboost
<-finalize_workflow(wm.rf.wf, best.rf)
vip.rf
%>%
vip.rffit(data=wm.train)%>%
extract_fit_parsnip()%>%
vip(geom = 'col', aesthetics=list(fill='midnightblue', alpha=0.7))+
labs(title='Figure 7. Variable Importance: Random Forest Model', subtitle='Percent Completion: Women in Undergraduate\nSTEM Programs: 2010-2020, (n=889)')+
theme_light()
## Final Models and Evaluation
Each model is updated with the selected hyperparameters (based on RMSE).
# Decision Tree
<-finalize_model(wm.dec, select_best(wm.dec.fit, 'rmse')) #with tuning specs
final.dec
# Boosted Tree
<-finalize_model(wm.boost, select_best(wm.boost.fit, 'rmse'))
final.boost
# Random Forest
<-finalize_model(wm.rf, select_best(wm.rf.fit, 'rmse')) final.rf
And then fit to the test splits.
#Fit to training and eval on testing - last fit is a convenience function to save code
# Decision Tree
<-last_fit(final.dec, perc_wm_stem ~., split = wm.split)
final.fit.dec
# XGoost Tree
<-last_fit(final.boost, perc_wm_stem ~., split = wm.split)
final.fit.boost
# Random Forest
<-last_fit(final.rf, perc_wm_stem ~., split = wm.split) final.fit.rf
The XGBoost model returns the lowest estimated RMSE (10.73) and, as a result, represents the best fit for this dataset. However, the difference in RMSE estimates between the models is slight.
It’s worth noting that the RMSE is average difference between the predictions made for perc_wm_stem and the actual observed values in the test set. On this basis, the average error is ~ 11% for the models.
collect_metrics(final.fit.dec) %>%
bind_rows(collect_metrics(final.fit.boost)) %>%
bind_rows(collect_metrics(final.fit.rf)) %>%
filter(.metric == "rmse") %>%
mutate(model = c("Tree", "XGboost", "Random Forest")) %>%
select(model, .metric, .estimator, .estimate) %>%
flextable()%>%
set_caption('Table 4. Goodness of Fit (RMSE) for the Final Models')
model | .metric | .estimator | .estimate |
Tree | rmse | standard | 10.90850 |
XGboost | rmse | standard | 10.73072 |
Random Forest | rmse | standard | 10.84231 |
Figures 8-10 plot the relationship between observed vs. predicted outcomes for each of the models.
# plot decision tree as scatterplot
%>%
final.fit.deccollect_predictions() %>%
ggplot(aes(perc_wm_stem, .pred)) +
geom_abline(slope = 1, lty = 2, color = "gray50", alpha = 0.5) +
geom_point(alpha = 0.6, color = "midnightblue") +
labs(title='Figure 8. Decision Tree Model', subtitle='Percent Completion: Women in Undergraduate\nSTEM Programs: 2010-2020, (n=889)', x='Observed', y='Predicted')+
theme_light()
# plot XGboost as scatterplot
%>%
final.fit.boostcollect_predictions() %>%
ggplot(aes(perc_wm_stem, .pred)) +
geom_abline(slope = 1, lty = 2, color = "gray50", alpha = 0.5) +
geom_point(alpha = 0.6, color = "midnightblue") +
labs(title='Figure 9. XGboost Model', subtitle='Percent Completion: Women in Undergraduate\nSTEM Programs: 2010-2020, n=(889)', x='Observed', y='Predicted')+
theme_light()
# plot decision tree as scatterplot
%>%
final.fit.rfcollect_predictions() %>%
ggplot(aes(perc_wm_stem, .pred)) +
geom_abline(slope = 1, lty = 2, color = "gray50", alpha = 0.5) +
geom_point(alpha = 0.6, color = "midnightblue") +
labs(title='Figure 10. Random Forest Model', subtitle='Percent Completion: Women in Undergraduate\nSTEM Programs: 2010-2020, n=(889)', x='Observed', y='Predicted')+
theme_light()