Introduction

In this case study, we will use predictive modeling techniques to model to predict research experience on graduate school applications using various factors contained on the graduate school application.

The data set contains several parameters which are considered important during the application for Masters Programs.There are 400 observations and 9 total variables. This dataset is from kaggle (https://www.kaggle.com/datasets/mohansacharya/graduate-admissions) and the way that the data was collected has not been specified. The data was also uploaded to github (https://raw.githubusercontent.com/JackRoss10089/STA-321/main/Admission_Predict.csv). When applying to graduate school many important factors are considered. For this multiple logistical regression, we will focus on the relationship between research experience and other relevant predictor variables that will be selected during the exploratory data analysis.

The parameters included are:

Research Question

The objective of this case study is to build a logistic regression model to predict research experience using various risk factors associated graduate school applications. Research experience is a key indicator for many graduate programs. Examining what factors are associated with students that obtain research experience can be a valuable insight to help create future research opportunities for students that will better prepare them for graduate school.

Exploratory Data Analysis

To begin analysis, first it is necessary to evaluate the variables in the data set and choose which variables can be used to build the model.

After creating the correlation matrix plot, we observe that all predictor variables are unimodal. The variable Chance.of.Admit is slightly skewed to the left. By investigating this variable more closely, we can assess how to properly discretize this variable.

By creating new groups within Chance.of.Admit, the variable Chance.of.Admit can been transformed to a categorical variable with three categories, 0.34-0.63 (LOW), 0.64-0.80 (MEDIUM), and 0.81-0.97 (HIGH). We also transform the variable from decimals to whole numbers by multiplying the variable by 100.

Next we standardize the numeric predictor variables in the data set. Since this is a predictive model, we don’t worry about the interpretation of the coefficients. The objective is to identify a model that has the best predictive performance. We also must change other variables to categorical or numeric types to simplify the interpretation of the model.

Data Split - Training and Testing Data

We randomly split the data into two subsets. 80% of the data will be used as training data. We will use the training data to search the candidate models, validate them and identify the final model using the cross-validation method. The 20% of the hold-up sample will be used for assessing the performance of the final model.

Candidate Models

For convenience, we use 0.5 as the common cut-off for all three models to define the predicted.

Since our training data is relatively small, we will use 5-fold cross-validation to ensure the validation data set has enough graduate school applications.

Average of prediction errors of candidate models
PE1 PE2 PE3
0.259375 0.259375 0.228125

The average predictive errors show that candidate models 1 and 2 have the same predictive error. Since model 3 has a smaller predictive error than the other two models, we choose model 3 as the final predictive model.

Final Model

The actual accuracy of the final model is given by

The actual accuracy of the final model
x
0.8125

Therefore, the final model has an accuracy rate given in the above table.

Since we used a random split method to define the training and testing data when re-running the code, the performance metrics will be slightly different. We will next try and ROC approach to select a final model for the case study.

ROC Approach

We first estimate the TPR (true positive rate, sensitivity) and FPR (false positive rate, 1 - specificity) at each cut-off probability for each of the three candidate models.

The ROC curves of the three candidate models are given below.

The ROC curve shows a linear trend which suggests that the model that has been generated does not detect False Positives and False Negatives at a rate better than pure randomness. Based on the cross validation approach, we still choose model #3 as the final working model.

Summary and Conclusion

The case study focused on predicting research experience on graduate school applications. For illustrative purposes, we used three models as candidates and use both cross-validation and ROC curve to select the final working model. The ROC approach did not yield a meaningful interpretation for the response variable, but we can still conclude from the cross-validation appraoch that model three is the best model to be used to predict the presence of research experience on a graduate school application within this case study.