STA 321 HW 5

Introduction

This is a case study to exercise model building techniques using logistical regression.

The data set contains several parameters which are considered important during the application for Masters Programs.There are 400 observations and 9 total variables. This dataset is from kaggle (https://www.kaggle.com/datasets/mohansacharya/graduate-admissions) and the way that the data was collected has not been specified. The data was also uploaded to github (https://raw.githubusercontent.com/JackRoss10089/STA-321/main/Admission_Predict.csv). When applying to graduate school many important factors are considered. For this multiple logistical regression, we will focus on the relationship between research experience and other relevant predictor variables that will be selected during the exploratory data analysis.

The parameters included are:

GRE Scores ( out of 340 )
TOEFL Scores ( out of 120 )
University Rating ( out of 5 )
Statement of Purpose and Letter of Recommendation Strength ( out of 5 )
Undergraduate GPA ( out of 10 )
Research Experience ( either 0 or 1 )
Chance of Admit ( ranging from 0 to 1 )

Practical Question

The practical question analyzes the relationship between a given graduate school applicant’s research experience and other relevant predictors. Research experience is a key indicator for many graduate programs. Examining what factors are associated with students that obtain research experience can be a valuable insight to help create future research opportunities for students that will better prepare them for graduate school.

Exploratory Data Analysis

To begin analysis, first it is necessary to evaluate the variables in the data set and choose which variables can be used to build the model.

After creating the correlation matrix plot, we observe that all predictor variables are unimodal. The variable Chance.of.Admit is slightly skewed to the left. By investigating this variable more closely, we can assess how to properly discretize this variable.

By creating new groups within Chance.of.Admit, the variable Chance.of.Admit can been transformed to a categorical variable with three categories, 0.34-0.63 (LOW), 0.64-0.80 (MEDIUM), and 0.81-0.97 (HIGH). We also transform the variable from decimals to whole numbers by multiplying the variable by 100.

After we grouped the Chance.of.Admit variable, we are now ready to generate a multiple logistical regression model.

Model Building and Variable Selection

Summary of inferential statistics of the full model
	Estimate	Std. Error	z value	Pr(>\|z\|)
(Intercept)	-31.4310531	5.6765881	-5.5369621	0.0000000
Chance.of.Admit01Medium	0.5032709	0.3622835	1.3891631	0.1647832
Chance.of.Admit01High	2.0183256	0.6330967	3.1880210	0.0014325
GRE.Score	0.1155451	0.0236846	4.8784997	0.0000011
TOEFL.Score	-0.0467645	0.0437278	-1.0694473	0.2848681
University.Rating	-0.0410313	0.2002620	-0.2048882	0.8376595
SOP	0.3172054	0.2206878	1.4373493	0.1506188
LOR	0.0382019	0.2226186	0.1716023	0.8637502
CGPA	-0.1932977	0.4927942	-0.3922482	0.6948748

After generating the first full model with the admissions data, we can now perform manual variable selection to reduce our model. We will simple select the variables with the lowest p-values in this model. Also, the categorical variable for Chance.of.admit will be omitted to further reduce the model.

Summary of inferential statistics of the full model
	Estimate	Std. Error	z value	Pr(>\|z\|)
(Intercept)	-40.9351415	4.9041146	-8.3471014	0.0000000
GRE.Score	0.1333916	0.0216198	6.1698889	0.0000000
TOEFL.Score	-0.0238659	0.0401430	-0.5945218	0.5521632
SOP	0.4478046	0.1638406	2.7331718	0.0062728

Next, we will perform automatic variable selection in order to produce a reduced model.

Summary of inferential statistics of the final model
	Estimate	Std. Error	z value	Pr(>\|z\|)
(Intercept)	-31.2691274	5.4936354	-5.691883	0.0000000
GRE.Score	0.1124020	0.0225153	4.992250	0.0000006
TOEFL.Score	-0.0532816	0.0414156	-1.286511	0.1982649
SOP	0.2929219	0.1732845	1.690410	0.0909495
Chance.of.Admit01Medium	0.4744637	0.3436798	1.380540	0.1674205
Chance.of.Admit01High	1.9331886	0.5800682	3.332692	0.0008601

Next, we perform some global goodness of fit measures in order to gauge which model is the best for demonstrating the relationship between research experience and other relevant predictors.

Comparison of global goodness-of-fit statistics
	Deviance.residual	Null.Deviance.Residual	AIC
full.model	371.7508	550.9023	389.7508
reduced.model	385.2231	550.9023	393.2231
forward.reduced.model	371.9707	550.9023	383.9707

Final Model

From the full model we selected GRE.Score, TOEFL.Score, and SOP for a reduced model. After automatic variable selection, those same variables were retained in the model along with the addition of the Chance.of.Admit variable. This reduced model created from automatic variable selection is the final model that will be used for the conclusion of this analysis.

Next, the odds ratio is generated to interpret the final model.

Summary Stats with Odds Ratios
	Estimate	Std. Error	z value	Pr(>\|z\|)	odds.ratio
(Intercept)	-31.2691274	5.4936354	-5.691883	0.0000000	0.000000
GRE.Score	0.1124020	0.0225153	4.992250	0.0000006	1.118963
TOEFL.Score	-0.0532816	0.0414156	-1.286511	0.1982649	0.948113
SOP	0.2929219	0.1732845	1.690410	0.0909495	1.340338
Chance.of.Admit01Medium	0.4744637	0.3436798	1.380540	0.1674205	1.607152
Chance.of.Admit01High	1.9331886	0.5800682	3.332692	0.0008601	6.911513

The interpretation of the odds ratio in the table above best describes the practical interpretation of this model. For the Chance.of.Admit variable, the baseline category is “Low”. With this in mind, the odds ratio associated with Chance.of.Admit for the “Medium” level is about 1.61, suggesting that, given the same GRE score, TOEFL score, and Statement of Purpose score, the odds of having research experience in the Chance.of.Admit “Medium” group is 1.61 times of that in the baseline group of “Low” Chance.of.Admit. The same ratio becomes almost 7 times when comparing the Chance.of.Admit “High” group to the “Low” group.

Summary and Conclusion

This case study focused on an association analysis between graduate school application criteria and the presence of research experience on a graduate school application. The initial data set has 11 variable including one “fake” variable for serial number.

After exploratory data analysis, we grouped the Chance.of.Admit variable in three categories. We then defined a new variable to make the grouped variable a factor with three levels. This new variable was used in the process to search for the final model.

GRE scores, TOEFL scores, and the quality of the Statement of Purpose letter were selected for the final model as they were perceived as major contributors to the presence of research experience on the application. We also included the Chance.of.Admit variable in the final model as it helped to quantify the acceptance rate of students who have research experience. The final model can be insightful in helping to assess the paths to providing more research opportunities and allow students to further engage in their education by pursuing post-bachelor programs.

A practical drawback of this model was the lack of a binary acceptance variable in the data set. If the data set were to be reevaluated with this variable present, it could change the scope of the research question and allow for a different practical question about the data.