library(tidymodels)
library(finetune) # customize grid search for speed
library(vip) # variable importance
library(tidyverse)
library(dlookr)
library(magrittr)
library(GGally) # scatterplot matrix
library(gridExtra) # grid.arrange
library(flextable)

1 Background

This report focuses on the training and evaluation of an SVM regression model to estimate the average percentage of women completing undergraduate degrees in STEM programs for public and private 4-YR+ institutions in the United States based on extrinsic factors. The latter include such covariates as institution size, location, sector, student:faculty ratio, and more.

Data Source: Integrated Postsecondary Education Database System

The full dataset includes variables for 1803 post-secondary institutions (Public 4 Year +, Private non_profit) and comprises the period from 2010-2020, excluding 2011-12 and 2016-17. The latter are currently omitted due to data quality concerns.

Included are the following disciplines: Computer Science, Biological Science, Engineering, Mathematics, and Physcial Science.

Note: see https://rpubs.com/sconnin/885276 for an analysis of this data using tree regression models.

data.stm<-read_csv('https://raw.githubusercontent.com/sconnin/STEM_Completion_Study/main/gender_tree_data.csv')

2 Data Processing

The initial processing steps include feature selection, setting factor types, and removal of missing values from factor covariates.

The factor covariates include:

sector: academic sector
region: geographical region
location: type of location (urbanicity)
size: number of students in institution

Numerical covariates include:

gran_tot: total number of completions (institution)
perc_wm_stem: percent of STEM graduates who are women
perc_wm_comp: percent of computer science graduates who are women
perc_wm_admit: percent of women admitted to the institution
ret_rate: full year retention rate
sf_ratio: student faculty ratio

# remove selected covariates for gender model

data.stm%<>%
    select(!c(id, name, state, tot_men, tot_women, perc_wm_enroll, perc_mn_admit, perc_mn_stem, perc_mn_comp, perc_wm_comp))

# factor categorical variables and drop NA

data.stm%<>%
    mutate(across(where(is.character), factor))%>%
    drop_na()

2.1 Summary Statistics: Numerical

Summary statistics for numerical variables are shown in Table 1. Women STEM graduates are absent from only one institution in this dataset.

# Calculate statistics summary for numeric variables    

data.stm%>%
    diagnose_numeric()%>%
    dplyr::select(variables, min, mean, median, max, zero, minus, outlier)%>%
    flextable()%>%
    set_caption("Table 1. Summary Statistics for Numeric Variables")

Table 1. Summary Statistics for Numeric Variables
variables	min	mean	median	max	zero	outlier
gran_tot	28	1,174.52868	534	9,328	0	97
perc_wm_stem	0	58.34871	58	100	1	51
perc_admit	5	67.59505	71	100	0	37
perc_wm_admit	5	68.39708	72	100	0	40
ret_rate	0	76.17660	77	99	2	13
sf_ratio	3	14.34196	14	67	0	8

2.2 Pairwise Correlations

Pairwise correlations are shown in Table 2. Only two variables are included based on the a threshold of r > 0.5 and < -0.5 for the correlation coefficient.

# Calculate pairwise relationships

data.stm%>%
    filter_if(is.numeric, all_vars((.) != 0))%>%
    correlate()%>%
    filter(coef_corr > .5 | coef_corr < -.5)%>% # set thresholds to limit output 
    flextable()%>%
    set_caption("Table 2. Pairwise Correlation: Numerical Covariates")

Table 2. Pairwise Correlation: Numerical Covariates
var1	var2	coef_corr
perc_wm_admit	perc_admit	0.992003
perc_admit	perc_wm_admit	0.992003

Given that perc_wm_admit and perc_admit are perfectly correlated, we will drop the latter from our working dataset.

The pairwise correlation of 0.99 for these variables is unexpected and may owe to data labeling or entry errors. A review of the data is warranted but will not be undertaken as part of this exercise.

data.stm%<>%
    select(!perc_admit)

We can explore linear and/or nonlinear relationships between our response (perc_wm_stem) and numerical covariates using scatterplots.

From Figure 1., it is difficult to discern any clear patterns in this context.

# scatterplot to compare response and numerical covariates

numerics<-data.stm%>%
  select(where(is.numeric))%>%
  relocate(perc_wm_stem, .before=gran_tot)

#scatterplot function

create_scatterplots <- function(x, y){
  ggplot(data = numerics) +
  aes_string(x = x, y = y) +
  geom_point(color='steelblue') +
  theme_minimal()
}


response <- names(numerics)[1]
numerical <- names(numerics)[2:5]

scatter <- map2(response, numerical, create_scatterplots)

grid.arrange(grobs=scatter, top = 'Figure 1. Perc_WM_STEM vs. Numerical Covariates')

2.3 Summary Statistics: Categorical Covariates

A data summary for factor variables is shown in Table 2. The data are imbalanced for all classes and levels.

Similar to the numerical covariates, it is difficult to discern any clear relationship(s) between our response (perc_wm_stem) and categorical variables in the dataset (Figure 2).

# Statistical summary for categorical variables 

data.stm%>%
    diagnose_category()%>%
    flextable()%>%
    set_caption("Table 3. Summary Statistics for Categorical Variables")

Table 3. Summary Statistics for Categorical Variables
variables	levels	N	freq	ratio	rank
sector	private_nonprofit	889	514	57.817773	1
sector	public	889	375	42.182227	2
region	Southeast	889	234	26.321710	1
region	Great_Lakes	889	146	16.422947	2
region	Mid_East	889	135	15.185602	3
region	Plains	889	114	12.823397	4
region	Southwest	889	81	9.111361	5
region	Far_West	889	78	8.773903	6
region	New_England	889	57	6.411699	7
region	Rocky_Mountains	889	28	3.149606	8
region	Other_US_Jurisdictions	889	16	1.799775	9
location	City	889	571	64.229471	1
location	Town	889	275	30.933633	2
location	Rural	889	43	4.836895	3
size	1,000 - 4,999	889	434	48.818898	1
size	5,000 - 9,999	889	143	16.085489	2
size	10,000 - 19,999	889	124	13.948256	3
size	20,000 +	889	120	13.498313	4
size	< 1,000	889	68	7.649044	5

# Check for any remaining missingness among factor variables

zero<-map(data.stm, ~sum(is.factor(is.na(.))))

# graph boxplots to compare response and covariates

# Subset variables

factors<-data.stm%>%
  select_if(is.factor)

response<- data.stm%>%
  select(perc_wm_stem)

boxes<-cbind(response, factors)%>%
  as.data.frame()


#boxplot function

create_boxplots <- function(x, y){
  ggplot(data = boxes) +
  aes_string(x = x, y = y) +
  geom_boxplot() +
  coord_flip()+
  theme(panel.background = element_rect(fill = "white"), axis.text.x = element_text(angle = 45, hjust=1))
}


x_var <- names(boxes)[1]
y_var <- names(boxes)[2:5]

plots <- map2(x_var, y_var, create_boxplots)

grid.arrange(grobs=plots, top = "Figure 2. Perc_WM_Stem vs. Categorical Covariates")

3 SVM Regression Model

For this exercise, an SVM regression model will be constructed to predict the response variable (perc_wm_stem) on the basis of numerical and categorical covariates.

First we split our data into training and test sets. And set up 10-fold cross-validation for hyperparameter tuning.

set.seed(10)

# create random data split -- data will not include variable for computer programs

wm.split <- initial_split(data.stm, strata = perc_wm_stem, prop = 0.75)

# create train and test sets from split

wm.train <- training(wm.split)

wm.test  <- testing(wm.split)

# create resamples for model tuning

set.seed(20)

cv_folds<-
  vfold_cv(wm.train, 
           v=10,
           strata= perc_wm_stem)

3.1 Model Recipe

We then establish a pre-processing recipe for the SVM model. This will include, one-hot encoding for categorical covariates and normalizing (center and scale) the numerical covariates.

# pre-model processing using recipes

wm.recipe <- 
  recipe(perc_wm_stem ~., data=wm.train)%>% 
  step_dummy(c(sector, region, location, size))%>%
  step_normalize(all_numeric()) # scale and center

#prep the recipe and call juice to extract the transformed dataset
    
wm_juiced <-juice(prep(wm.recipe)) # not required for model development
  
# review type, roles, source
  
summary(wm.recipe)%>%
    flextable()%>%
    set_caption('Table 4. Overview of Variable Type, Role, and Source')

Table 4. Overview of Variable Type, Role, and Source
variable	type	role	source
sector	nominal	predictor	original
region	nominal	predictor	original
location	nominal	predictor	original
size	nominal	predictor	original
gran_tot	numeric	predictor	original
perc_wm_admit	numeric	predictor	original
ret_rate	numeric	predictor	original
sf_ratio	numeric	predictor	original
perc_wm_stem	numeric	outcome	original

3.2 Model Specification

We will build and specify a single svm regression model with the following hyperparameters:

cost = A positive number for the cost of predicting a sample within or on the wrong side of the margin.
rbf_sigma = a positive number for the radial basis function.
margin = A positive number for the epsilon in the SVM insensitive loss function.

# specify SVM model

set.seed(30)

wm.svm.spec<-svm_rbf(
  cost = tune(),
  margin = tune(),
  rbf_sigma = tune()) %>% 
  set_engine("kernlab") %>% 
  set_mode("regression")

3.3 Model Workflow and Metrics

Using tidymodels, the data recipes and model specifications are combined into a single workflow for the SVM model.

# initiate a workflow for SVM model

wm.svm.wf<- workflow()%>%
    add_recipe(wm.recipe)%>%
    add_model(wm.svm.spec)

# select metrics for model evaluation

wm.metrics <- metric_set(rmse, rsq, mae, mape)

3.4 Model Tuning & Training

The SVM model is tuned using 25 point grid and then fit to the training data. Metrics for model performance are collected at the same time. These include: rmse and rsq.

The cross-validation tuning results are displayed in Figure 4. And the best fit parameters and model (Model09) are displayed in Table 5.

Note: here we use the control_race() function to speed up the tuning process.

# build grids and tune the models - use race via. finetune to speed analyses

race.ctrl <-
   control_race(
      save_pred = TRUE,
      save_workflow = TRUE
   )

grid.results <-
   wm.svm.wf %>%
   tune_grid(
      seed = 40,
      resamples = cv_folds,
      grid = 25, # setting up grid size of 25
      control = race.ctrl
   )

# plot parameter tuning results

autoplot(grid.results)+labs(title='Figure 4. Hyperparameter Tuning for SVM Regression Model')+theme_light()

# display best fit model

collect_metrics(grid.results)%>%
  filter(.config %in% 'Preprocessor1_Model09')%>%
  flextable()%>%
  set_caption('Table 5. Best Fit SVM Model From 10-Fold Cross Validation')

Table 5. Best Fit SVM Model From 10-Fold Cross Validation
cost	rbf_sigma	margin	.metric	.estimator	mean	n	std_err	.config
4.848643	0.004941744	0.1371824	rmse	standard	0.9426258	10	0.05911445	Preprocessor1_Model09
4.848643	0.004941744	0.1371824	rsq	standard	0.1185440	10	0.03247675	Preprocessor1_Model09

# extract best fit model for final model development

best.svm<-select_best(grid.results, 'rmse')

3.5 Model Evaluation

The final model specification is updated with the selected tuning parameters (based on RMSE) and is then (simultaneously) trained on the training data and evaluated on the test data.

Metrics for model performance on the test data are show in Table 6. And the relationship between observed and predicted test data are shown in Figure 5. It appears that the model tends to over-predict values < 50% and under-predict values > 75%.

# update model to include optimized parameters

wm.svm.fit<-finalize_model(wm.svm.spec, best.svm) #with tuning specs


# train and evaluate final model

wm.svm.final<-last_fit(wm.svm.fit,  perc_wm_stem ~., split = wm.split)

# collect and display model performance metrics

collect_metrics(wm.svm.final) %>%
  mutate(model = 'SVM') %>% 
  select(model, .metric, .estimate) %>% 
  rename(Metric = .metric, Estimate = .estimate)%>% 
  flextable()%>%
  set_caption('Table 6. Goodness of Fit (RMSE) for the Final SVM Model')

Table 6. Goodness of Fit (RMSE) for the Final SVM Model
model	Metric	Estimate
SVM	rmse	8.182372
SVM	rsq	0.150484

# plot svm predictions vs. observations

wm.svm.final%>%
    collect_predictions() %>%
    ggplot(aes(perc_wm_stem, .pred)) +
    geom_point(alpha = 0.6, color = "midnightblue") +
    labs(title='Figure 5. SVM Model', subtitle='Percent Completion: Women in Undergraduate\nSTEM Programs: 2010-2020, n=(889)', x='Observed', y='Predicted')+
    theme_light()

# Variable Importance

With the final trained model we can evaluate the model dependent variable importance using permutation. The steps involved in this process include:

extracting the model object
specifying the response variable (perc_wm_stem)
specifying the prediction function
specifying the data used for model training

The top three variables in order of importance are: retention rate, region, and institution size.

wm.svm.final %>%
  extract_fit_parsnip() %>%
  vip(method = "permute", 
      target = "perc_wm_stem", metric = "rmse",aesthetics=list(fill='midnightblue', alpha=0.7),
      pred_wrapper = kernlab::predict, train = wm.train)+
  labs(title='Figure 6. Variable Importance Scores: SVM Regression Model', , subtitle='Percent Completion: Women in Undergraduate\nSTEM Programs: 2010-2020, (n=889)')+
  theme_classic()

# for other model algorithms, methods for model specific variable importance may be required

# see: https://stackoverflow.com/questions/61606337/computing-importance-measure-using-vip-package-on-a-parsnip-model/61811440#61811440

4 Discussion

The final SVM model returned an estimate RMSE of 8.18 (rsq = 0.15). This is less than final RMSE (10.73) produced by an XGBoost model from the same dataset (see: https://rpubs.com/sconnin/885276), indicating that the SVM out-performed the latter.

The SVM model accounts for ~ 15% of the variance in the data and points to the influence of extrinsic factors on STEM completion rates (undergraduate) by women in public and private 4+ institutions.

It’s interesting to note differences in variable importance compared between the SVM (reported here) and XGboost (reported previously) models. For example, the top three variables in order of importance for the latter were: retention rate, gran_tot, and location (town).

These differences reflect the extent to which each model relied upon a given predictor to build accurate predictions.

622-SVM

Sean Connin

04/4/22