library(tidymodels)
library(finetune) # customize grid search for speed
library(vip) # variable importance
library(tidyverse)
library(dlookr)
library(magrittr)
library(GGally) # scatterplot matrix
library(gridExtra) # grid.arrange
library(flextable)

1 Background

This report focuses on the training and evaluation of an SVM regression model to estimate the average percentage of women completing undergraduate degrees in STEM programs for public and private 4-YR+ institutions in the United States based on extrinsic factors. The latter include such covariates as institution size, location, sector, student:faculty ratio, and more.

Data Source: Integrated Postsecondary Education Database System

The full dataset includes variables for 1803 post-secondary institutions (Public 4 Year +, Private non_profit) and comprises the period from 2010-2020, excluding 2011-12 and 2016-17. The latter are currently omitted due to data quality concerns.

Included are the following disciplines: Computer Science, Biological Science, Engineering, Mathematics, and Physcial Science.

Note: see https://rpubs.com/sconnin/885276 for an analysis of this data using tree regression models.

data.stm<-read_csv('https://raw.githubusercontent.com/sconnin/STEM_Completion_Study/main/gender_tree_data.csv')

2 Data Processing

The initial processing steps include feature selection, setting factor types, and removal of missing values from factor covariates.

The factor covariates include:

  • sector: academic sector
  • region: geographical region
  • location: type of location (urbanicity)
  • size: number of students in institution

Numerical covariates include:

  • gran_tot: total number of completions (institution)
  • perc_wm_stem: percent of STEM graduates who are women
  • perc_wm_comp: percent of computer science graduates who are women
  • perc_wm_admit: percent of women admitted to the institution
  • ret_rate: full year retention rate
  • sf_ratio: student faculty ratio
# remove selected covariates for gender model

data.stm%<>%
    select(!c(id, name, state, tot_men, tot_women, perc_wm_enroll, perc_mn_admit, perc_mn_stem, perc_mn_comp, perc_wm_comp))

# factor categorical variables and drop NA

data.stm%<>%
    mutate(across(where(is.character), factor))%>%
    drop_na()

2.1 Summary Statistics: Numerical

Summary statistics for numerical variables are shown in Table 1. Women STEM graduates are absent from only one institution in this dataset.

# Calculate statistics summary for numeric variables    

data.stm%>%
    diagnose_numeric()%>%
    dplyr::select(variables, min, mean, median, max, zero, minus, outlier)%>%
    flextable()%>%
    set_caption("Table 1. Summary Statistics for Numeric Variables")

2.2 Pairwise Correlations

Pairwise correlations are shown in Table 2. Only two variables are included based on the a threshold of r > 0.5 and < -0.5 for the correlation coefficient.

# Calculate pairwise relationships

data.stm%>%
    filter_if(is.numeric, all_vars((.) != 0))%>%
    correlate()%>%
    filter(coef_corr > .5 | coef_corr < -.5)%>% # set thresholds to limit output 
    flextable()%>%
    set_caption("Table 2. Pairwise Correlation: Numerical Covariates")

Given that perc_wm_admit and perc_admit are perfectly correlated, we will drop the latter from our working dataset.

The pairwise correlation of 0.99 for these variables is unexpected and may owe to data labeling or entry errors. A review of the data is warranted but will not be undertaken as part of this exercise.

data.stm%<>%
    select(!perc_admit)

We can explore linear and/or nonlinear relationships between our response (perc_wm_stem) and numerical covariates using scatterplots.

From Figure 1., it is difficult to discern any clear patterns in this context.

# scatterplot to compare response and numerical covariates

numerics<-data.stm%>%
  select(where(is.numeric))%>%
  relocate(perc_wm_stem, .before=gran_tot)

#scatterplot function

create_scatterplots <- function(x, y){
  ggplot(data = numerics) +
  aes_string(x = x, y = y) +
  geom_point(color='steelblue') +
  theme_minimal()
}


response <- names(numerics)[1]
numerical <- names(numerics)[2:5]

scatter <- map2(response, numerical, create_scatterplots)

grid.arrange(grobs=scatter, top = 'Figure 1. Perc_WM_STEM vs. Numerical Covariates')

2.3 Summary Statistics: Categorical Covariates

A data summary for factor variables is shown in Table 2. The data are imbalanced for all classes and levels.

Similar to the numerical covariates, it is difficult to discern any clear relationship(s) between our response (perc_wm_stem) and categorical variables in the dataset (Figure 2).

# Statistical summary for categorical variables 

data.stm%>%
    diagnose_category()%>%
    flextable()%>%
    set_caption("Table 3. Summary Statistics for Categorical Variables")
# Check for any remaining missingness among factor variables

zero<-map(data.stm, ~sum(is.factor(is.na(.))))

# graph boxplots to compare response and covariates

# Subset variables

factors<-data.stm%>%
  select_if(is.factor)

response<- data.stm%>%
  select(perc_wm_stem)

boxes<-cbind(response, factors)%>%
  as.data.frame()


#boxplot function

create_boxplots <- function(x, y){
  ggplot(data = boxes) +
  aes_string(x = x, y = y) +
  geom_boxplot() +
  coord_flip()+
  theme(panel.background = element_rect(fill = "white"), axis.text.x = element_text(angle = 45, hjust=1))
}


x_var <- names(boxes)[1]
y_var <- names(boxes)[2:5]

plots <- map2(x_var, y_var, create_boxplots)

grid.arrange(grobs=plots, top = "Figure 2. Perc_WM_Stem vs. Categorical Covariates")

3 SVM Regression Model

For this exercise, an SVM regression model will be constructed to predict the response variable (perc_wm_stem) on the basis of numerical and categorical covariates.

First we split our data into training and test sets. And set up 10-fold cross-validation for hyperparameter tuning.

set.seed(10)

# create random data split -- data will not include variable for computer programs

wm.split <- initial_split(data.stm, strata = perc_wm_stem, prop = 0.75)

# create train and test sets from split

wm.train <- training(wm.split)

wm.test  <- testing(wm.split)

# create resamples for model tuning

set.seed(20)

cv_folds<-
  vfold_cv(wm.train, 
           v=10,
           strata= perc_wm_stem)

3.1 Model Recipe

We then establish a pre-processing recipe for the SVM model. This will include, one-hot encoding for categorical covariates and normalizing (center and scale) the numerical covariates.

# pre-model processing using recipes

wm.recipe <- 
  recipe(perc_wm_stem ~., data=wm.train)%>% 
  step_dummy(c(sector, region, location, size))%>%
  step_normalize(all_numeric()) # scale and center

#prep the recipe and call juice to extract the transformed dataset
    
wm_juiced <-juice(prep(wm.recipe)) # not required for model development
  
# review type, roles, source
  
summary(wm.recipe)%>%
    flextable()%>%
    set_caption('Table 4. Overview of Variable Type, Role, and Source')

3.2 Model Specification

We will build and specify a single svm regression model with the following hyperparameters:

  1. cost = A positive number for the cost of predicting a sample within or on the wrong side of the margin.
  2. rbf_sigma = a positive number for the radial basis function.
  3. margin = A positive number for the epsilon in the SVM insensitive loss function.
# specify SVM model

set.seed(30)

wm.svm.spec<-svm_rbf(
  cost = tune(),
  margin = tune(),
  rbf_sigma = tune()) %>% 
  set_engine("kernlab") %>% 
  set_mode("regression")

3.3 Model Workflow and Metrics

Using tidymodels, the data recipes and model specifications are combined into a single workflow for the SVM model.

# initiate a workflow for SVM model

wm.svm.wf<- workflow()%>%
    add_recipe(wm.recipe)%>%
    add_model(wm.svm.spec)

# select metrics for model evaluation

wm.metrics <- metric_set(rmse, rsq, mae, mape)

3.4 Model Tuning & Training

The SVM model is tuned using 25 point grid and then fit to the training data. Metrics for model performance are collected at the same time. These include: rmse and rsq.

The cross-validation tuning results are displayed in Figure 4. And the best fit parameters and model (Model09) are displayed in Table 5.

Note: here we use the control_race() function to speed up the tuning process.

# build grids and tune the models - use race via. finetune to speed analyses

race.ctrl <-
   control_race(
      save_pred = TRUE,
      save_workflow = TRUE
   )

grid.results <-
   wm.svm.wf %>%
   tune_grid(
      seed = 40,
      resamples = cv_folds,
      grid = 25, # setting up grid size of 25
      control = race.ctrl
   )

# plot parameter tuning results

autoplot(grid.results)+labs(title='Figure 4. Hyperparameter Tuning for SVM Regression Model')+theme_light()

# display best fit model

collect_metrics(grid.results)%>%
  filter(.config %in% 'Preprocessor1_Model09')%>%
  flextable()%>%
  set_caption('Table 5. Best Fit SVM Model From 10-Fold Cross Validation')
# extract best fit model for final model development

best.svm<-select_best(grid.results, 'rmse')

3.5 Model Evaluation

The final model specification is updated with the selected tuning parameters (based on RMSE) and is then (simultaneously) trained on the training data and evaluated on the test data.

Metrics for model performance on the test data are show in Table 6. And the relationship between observed and predicted test data are shown in Figure 5. It appears that the model tends to over-predict values < 50% and under-predict values > 75%.

# update model to include optimized parameters

wm.svm.fit<-finalize_model(wm.svm.spec, best.svm) #with tuning specs


# train and evaluate final model

wm.svm.final<-last_fit(wm.svm.fit,  perc_wm_stem ~., split = wm.split)

# collect and display model performance metrics

collect_metrics(wm.svm.final) %>%
  mutate(model = 'SVM') %>% 
  select(model, .metric, .estimate) %>% 
  rename(Metric = .metric, Estimate = .estimate)%>% 
  flextable()%>%
  set_caption('Table 6. Goodness of Fit (RMSE) for the Final SVM Model')
# plot svm predictions vs. observations

wm.svm.final%>%
    collect_predictions() %>%
    ggplot(aes(perc_wm_stem, .pred)) +
    geom_point(alpha = 0.6, color = "midnightblue") +
    labs(title='Figure 5. SVM Model', subtitle='Percent Completion: Women in Undergraduate\nSTEM Programs: 2010-2020, n=(889)', x='Observed', y='Predicted')+
    theme_light()

# Variable Importance

With the final trained model we can evaluate the model dependent variable importance using permutation. The steps involved in this process include:

  1. extracting the model object
  2. specifying the response variable (perc_wm_stem)
  3. specifying the prediction function
  4. specifying the data used for model training

The top three variables in order of importance are: retention rate, region, and institution size.

wm.svm.final %>%
  extract_fit_parsnip() %>%
  vip(method = "permute", 
      target = "perc_wm_stem", metric = "rmse",aesthetics=list(fill='midnightblue', alpha=0.7),
      pred_wrapper = kernlab::predict, train = wm.train)+
  labs(title='Figure 6. Variable Importance Scores: SVM Regression Model', , subtitle='Percent Completion: Women in Undergraduate\nSTEM Programs: 2010-2020, (n=889)')+
  theme_classic()

# for other model algorithms, methods for model specific variable importance may be required

# see: https://stackoverflow.com/questions/61606337/computing-importance-measure-using-vip-package-on-a-parsnip-model/61811440#61811440

4 Discussion

The final SVM model returned an estimate RMSE of 8.18 (rsq = 0.15). This is less than final RMSE (10.73) produced by an XGBoost model from the same dataset (see: https://rpubs.com/sconnin/885276), indicating that the SVM out-performed the latter.

The SVM model accounts for ~ 15% of the variance in the data and points to the influence of extrinsic factors on STEM completion rates (undergraduate) by women in public and private 4+ institutions.

It’s interesting to note differences in variable importance compared between the SVM (reported here) and XGboost (reported previously) models. For example, the top three variables in order of importance for the latter were: retention rate, gran_tot, and location (town).

These differences reflect the extent to which each model relied upon a given predictor to build accurate predictions.