library(tidymodels)
library(finetune) # customize grid search for speed
library(vip) # variable importance
library(tidyverse)
library(dlookr)
library(magrittr)
library(GGally) # scatterplot matrix
library(gridExtra) # grid.arrange
library(flextable)This report focuses on the training and evaluation of an SVM regression model to estimate the average percentage of women completing undergraduate degrees in STEM programs for public and private 4-YR+ institutions in the United States based on extrinsic factors. The latter include such covariates as institution size, location, sector, student:faculty ratio, and more.
Data Source: Integrated Postsecondary Education Database System
The full dataset includes variables for 1803 post-secondary institutions (Public 4 Year +, Private non_profit) and comprises the period from 2010-2020, excluding 2011-12 and 2016-17. The latter are currently omitted due to data quality concerns.
Included are the following disciplines: Computer Science, Biological Science, Engineering, Mathematics, and Physcial Science.
Note: see https://rpubs.com/sconnin/885276 for an analysis of this data using tree regression models.
data.stm<-read_csv('https://raw.githubusercontent.com/sconnin/STEM_Completion_Study/main/gender_tree_data.csv')The initial processing steps include feature selection, setting factor types, and removal of missing values from factor covariates.
The factor covariates include:
Numerical covariates include:
# remove selected covariates for gender model
data.stm%<>%
    select(!c(id, name, state, tot_men, tot_women, perc_wm_enroll, perc_mn_admit, perc_mn_stem, perc_mn_comp, perc_wm_comp))
# factor categorical variables and drop NA
data.stm%<>%
    mutate(across(where(is.character), factor))%>%
    drop_na()Summary statistics for numerical variables are shown in Table 1. Women STEM graduates are absent from only one institution in this dataset.
# Calculate statistics summary for numeric variables    
data.stm%>%
    diagnose_numeric()%>%
    dplyr::select(variables, min, mean, median, max, zero, minus, outlier)%>%
    flextable()%>%
    set_caption("Table 1. Summary Statistics for Numeric Variables")| variables | min | mean | median | max | zero | minus | outlier | 
| gran_tot | 28 | 1,174.52868 | 534 | 9,328 | 0 | 0 | 97 | 
| perc_wm_stem | 0 | 58.34871 | 58 | 100 | 1 | 0 | 51 | 
| perc_admit | 5 | 67.59505 | 71 | 100 | 0 | 0 | 37 | 
| perc_wm_admit | 5 | 68.39708 | 72 | 100 | 0 | 0 | 40 | 
| ret_rate | 0 | 76.17660 | 77 | 99 | 2 | 0 | 13 | 
| sf_ratio | 3 | 14.34196 | 14 | 67 | 0 | 0 | 8 | 
Pairwise correlations are shown in Table 2. Only two variables are included based on the a threshold of r > 0.5 and < -0.5 for the correlation coefficient.
# Calculate pairwise relationships
data.stm%>%
    filter_if(is.numeric, all_vars((.) != 0))%>%
    correlate()%>%
    filter(coef_corr > .5 | coef_corr < -.5)%>% # set thresholds to limit output 
    flextable()%>%
    set_caption("Table 2. Pairwise Correlation: Numerical Covariates")| var1 | var2 | coef_corr | 
| perc_wm_admit | perc_admit | 0.992003 | 
| perc_admit | perc_wm_admit | 0.992003 | 
Given that perc_wm_admit and perc_admit are perfectly correlated, we will drop the latter from our working dataset.
The pairwise correlation of 0.99 for these variables is unexpected and may owe to data labeling or entry errors. A review of the data is warranted but will not be undertaken as part of this exercise.
data.stm%<>%
    select(!perc_admit)We can explore linear and/or nonlinear relationships between our response (perc_wm_stem) and numerical covariates using scatterplots.
From Figure 1., it is difficult to discern any clear patterns in this context.
# scatterplot to compare response and numerical covariates
numerics<-data.stm%>%
  select(where(is.numeric))%>%
  relocate(perc_wm_stem, .before=gran_tot)
#scatterplot function
create_scatterplots <- function(x, y){
  ggplot(data = numerics) +
  aes_string(x = x, y = y) +
  geom_point(color='steelblue') +
  theme_minimal()
}
response <- names(numerics)[1]
numerical <- names(numerics)[2:5]
scatter <- map2(response, numerical, create_scatterplots)
grid.arrange(grobs=scatter, top = 'Figure 1. Perc_WM_STEM vs. Numerical Covariates')A data summary for factor variables is shown in Table 2. The data are imbalanced for all classes and levels.
Similar to the numerical covariates, it is difficult to discern any clear relationship(s) between our response (perc_wm_stem) and categorical variables in the dataset (Figure 2).
# Statistical summary for categorical variables 
data.stm%>%
    diagnose_category()%>%
    flextable()%>%
    set_caption("Table 3. Summary Statistics for Categorical Variables")| variables | levels | N | freq | ratio | rank | 
| sector | private_nonprofit | 889 | 514 | 57.817773 | 1 | 
| sector | public | 889 | 375 | 42.182227 | 2 | 
| region | Southeast | 889 | 234 | 26.321710 | 1 | 
| region | Great_Lakes | 889 | 146 | 16.422947 | 2 | 
| region | Mid_East | 889 | 135 | 15.185602 | 3 | 
| region | Plains | 889 | 114 | 12.823397 | 4 | 
| region | Southwest | 889 | 81 | 9.111361 | 5 | 
| region | Far_West | 889 | 78 | 8.773903 | 6 | 
| region | New_England | 889 | 57 | 6.411699 | 7 | 
| region | Rocky_Mountains | 889 | 28 | 3.149606 | 8 | 
| region | Other_US_Jurisdictions | 889 | 16 | 1.799775 | 9 | 
| location | City | 889 | 571 | 64.229471 | 1 | 
| location | Town | 889 | 275 | 30.933633 | 2 | 
| location | Rural | 889 | 43 | 4.836895 | 3 | 
| size | 1,000 - 4,999 | 889 | 434 | 48.818898 | 1 | 
| size | 5,000 - 9,999 | 889 | 143 | 16.085489 | 2 | 
| size | 10,000 - 19,999 | 889 | 124 | 13.948256 | 3 | 
| size | 20,000 + | 889 | 120 | 13.498313 | 4 | 
| size | < 1,000 | 889 | 68 | 7.649044 | 5 | 
# Check for any remaining missingness among factor variables
zero<-map(data.stm, ~sum(is.factor(is.na(.))))
# graph boxplots to compare response and covariates
# Subset variables
factors<-data.stm%>%
  select_if(is.factor)
response<- data.stm%>%
  select(perc_wm_stem)
boxes<-cbind(response, factors)%>%
  as.data.frame()
#boxplot function
create_boxplots <- function(x, y){
  ggplot(data = boxes) +
  aes_string(x = x, y = y) +
  geom_boxplot() +
  coord_flip()+
  theme(panel.background = element_rect(fill = "white"), axis.text.x = element_text(angle = 45, hjust=1))
}
x_var <- names(boxes)[1]
y_var <- names(boxes)[2:5]
plots <- map2(x_var, y_var, create_boxplots)
grid.arrange(grobs=plots, top = "Figure 2. Perc_WM_Stem vs. Categorical Covariates")For this exercise, an SVM regression model will be constructed to predict the response variable (perc_wm_stem) on the basis of numerical and categorical covariates.
First we split our data into training and test sets. And set up 10-fold cross-validation for hyperparameter tuning.
set.seed(10)
# create random data split -- data will not include variable for computer programs
wm.split <- initial_split(data.stm, strata = perc_wm_stem, prop = 0.75)
# create train and test sets from split
wm.train <- training(wm.split)
wm.test  <- testing(wm.split)
# create resamples for model tuning
set.seed(20)
cv_folds<-
  vfold_cv(wm.train, 
           v=10,
           strata= perc_wm_stem)We then establish a pre-processing recipe for the SVM model. This will include, one-hot encoding for categorical covariates and normalizing (center and scale) the numerical covariates.
# pre-model processing using recipes
wm.recipe <- 
  recipe(perc_wm_stem ~., data=wm.train)%>% 
  step_dummy(c(sector, region, location, size))%>%
  step_normalize(all_numeric()) # scale and center
#prep the recipe and call juice to extract the transformed dataset
    
wm_juiced <-juice(prep(wm.recipe)) # not required for model development
  
# review type, roles, source
  
summary(wm.recipe)%>%
    flextable()%>%
    set_caption('Table 4. Overview of Variable Type, Role, and Source')| variable | type | role | source | 
| sector | nominal | predictor | original | 
| region | nominal | predictor | original | 
| location | nominal | predictor | original | 
| size | nominal | predictor | original | 
| gran_tot | numeric | predictor | original | 
| perc_wm_admit | numeric | predictor | original | 
| ret_rate | numeric | predictor | original | 
| sf_ratio | numeric | predictor | original | 
| perc_wm_stem | numeric | outcome | original | 
We will build and specify a single svm regression model with the following hyperparameters:
# specify SVM model
set.seed(30)
wm.svm.spec<-svm_rbf(
  cost = tune(),
  margin = tune(),
  rbf_sigma = tune()) %>% 
  set_engine("kernlab") %>% 
  set_mode("regression")Using tidymodels, the data recipes and model specifications are combined into a single workflow for the SVM model.
# initiate a workflow for SVM model
wm.svm.wf<- workflow()%>%
    add_recipe(wm.recipe)%>%
    add_model(wm.svm.spec)
# select metrics for model evaluation
wm.metrics <- metric_set(rmse, rsq, mae, mape)The SVM model is tuned using 25 point grid and then fit to the training data. Metrics for model performance are collected at the same time. These include: rmse and rsq.
The cross-validation tuning results are displayed in Figure 4. And the best fit parameters and model (Model09) are displayed in Table 5.
Note: here we use the control_race() function to speed up the tuning process.
# build grids and tune the models - use race via. finetune to speed analyses
race.ctrl <-
   control_race(
      save_pred = TRUE,
      save_workflow = TRUE
   )
grid.results <-
   wm.svm.wf %>%
   tune_grid(
      seed = 40,
      resamples = cv_folds,
      grid = 25, # setting up grid size of 25
      control = race.ctrl
   )
# plot parameter tuning results
autoplot(grid.results)+labs(title='Figure 4. Hyperparameter Tuning for SVM Regression Model')+theme_light()# display best fit model
collect_metrics(grid.results)%>%
  filter(.config %in% 'Preprocessor1_Model09')%>%
  flextable()%>%
  set_caption('Table 5. Best Fit SVM Model From 10-Fold Cross Validation')| cost | rbf_sigma | margin | .metric | .estimator | mean | n | std_err | .config | 
| 4.848643 | 0.004941744 | 0.1371824 | rmse | standard | 0.9426258 | 10 | 0.05911445 | Preprocessor1_Model09 | 
| 4.848643 | 0.004941744 | 0.1371824 | rsq | standard | 0.1185440 | 10 | 0.03247675 | Preprocessor1_Model09 | 
# extract best fit model for final model development
best.svm<-select_best(grid.results, 'rmse')The final model specification is updated with the selected tuning parameters (based on RMSE) and is then (simultaneously) trained on the training data and evaluated on the test data.
Metrics for model performance on the test data are show in Table 6. And the relationship between observed and predicted test data are shown in Figure 5. It appears that the model tends to over-predict values < 50% and under-predict values > 75%.
# update model to include optimized parameters
wm.svm.fit<-finalize_model(wm.svm.spec, best.svm) #with tuning specs
# train and evaluate final model
wm.svm.final<-last_fit(wm.svm.fit,  perc_wm_stem ~., split = wm.split)
# collect and display model performance metrics
collect_metrics(wm.svm.final) %>%
  mutate(model = 'SVM') %>% 
  select(model, .metric, .estimate) %>% 
  rename(Metric = .metric, Estimate = .estimate)%>% 
  flextable()%>%
  set_caption('Table 6. Goodness of Fit (RMSE) for the Final SVM Model')| model | Metric | Estimate | 
| SVM | rmse | 8.182372 | 
| SVM | rsq | 0.150484 | 
# plot svm predictions vs. observations
wm.svm.final%>%
    collect_predictions() %>%
    ggplot(aes(perc_wm_stem, .pred)) +
    geom_point(alpha = 0.6, color = "midnightblue") +
    labs(title='Figure 5. SVM Model', subtitle='Percent Completion: Women in Undergraduate\nSTEM Programs: 2010-2020, n=(889)', x='Observed', y='Predicted')+
    theme_light() # Variable Importance
With the final trained model we can evaluate the model dependent variable importance using permutation. The steps involved in this process include:
The top three variables in order of importance are: retention rate, region, and institution size.
wm.svm.final %>%
  extract_fit_parsnip() %>%
  vip(method = "permute", 
      target = "perc_wm_stem", metric = "rmse",aesthetics=list(fill='midnightblue', alpha=0.7),
      pred_wrapper = kernlab::predict, train = wm.train)+
  labs(title='Figure 6. Variable Importance Scores: SVM Regression Model', , subtitle='Percent Completion: Women in Undergraduate\nSTEM Programs: 2010-2020, (n=889)')+
  theme_classic()# for other model algorithms, methods for model specific variable importance may be required
# see: https://stackoverflow.com/questions/61606337/computing-importance-measure-using-vip-package-on-a-parsnip-model/61811440#61811440The final SVM model returned an estimate RMSE of 8.18 (rsq = 0.15). This is less than final RMSE (10.73) produced by an XGBoost model from the same dataset (see: https://rpubs.com/sconnin/885276), indicating that the SVM out-performed the latter.
The SVM model accounts for ~ 15% of the variance in the data and points to the influence of extrinsic factors on STEM completion rates (undergraduate) by women in public and private 4+ institutions.
It’s interesting to note differences in variable importance compared between the SVM (reported here) and XGboost (reported previously) models. For example, the top three variables in order of importance for the latter were: retention rate, gran_tot, and location (town).
These differences reflect the extent to which each model relied upon a given predictor to build accurate predictions.