This report presents a study on the admission data of prospective
students, aiming to predict the likelihood of their acceptance into a
university. The dataset obtained from [kaggle] (https://www.kaggle.com/datasets/mohansacharya/graduate-admissions)
contains various attributes, including
GRE and TOEFL scores,
Statements of Purpose (SOP),
Letters of Recommendation (LOR),
Cumulative Grade Point Average (CGPA), and
the chance of admission.
The analysis begins with exploratory data analysis (EDA), where we
investigate the relationships between the variables and their
distributions. We also visualize the influence of factors such as
University Rating and Research on the
admission process. The main goal is to identify patterns and insights
that may help improve the prediction models.
For modeling, two different approaches are explored:
Support Vector Machine (SVM) and
Gradient Boosting using XGBoost. SVM is employed as a
powerful classification algorithm that works well for both linear and
non-linear relationships. On the other hand, XGBoost, a popular boosting
technique, is utilized to handle complex interactions and improve
predictive accuracy.
To evaluate the performance of the models, we use Receiver Operating Characteristic Area Under the Curve (ROC AUC), a metric that assesses the ability of the models to distinguish between accepted and rejected applicants.
Let’s proceed with a detailed analysis and interpretation of the results obtained from the SVM and XGBoost models.
The methodology followed in this study involves several key steps to
predict the likelihood of admission for prospective students. The
initial phase revolves around data pre-processing, where the admission
dataset is loaded and its structure is examined. To facilitate analysis,
the University Rating column is transformed into a factor.
Additionally, numeric attributes, such as GRE Score,
TOEFL Score, SOP, LOR, and
CGPA, are scaled to ensure their comparability.
The subsequent step involves exploratory data analysis (EDA) to gain
insights into the dataset. The distribution of the “Chance of Admission”
variable is visualized to understand its spread and potential patterns.
Scatter plots are generated to explore the relationships between the
numeric variables (GRE Score, TOEFL Score, SOP, LOR, CGPA)
and the Chance of Admission. Furthermore, a correlation
plot is constructed to examine the interrelationships between the
numeric attributes, identifying potential correlations that may
influence the admission outcome.
To evaluate the performance of predictive models, the dataset is
split into a training set (90%) and a test set (10%) using the
caret package. The training set will be utilized to train
the models, while the test set will serve for model evaluation. Two
different approaches are explored for modeling: Support Vector Machine
(SVM) and Gradient Boosting using XGBoost. SVM, known for its ability to
handle both linear and non-linear relationships, is employed to create
the initial model. On the other hand, XGBoost, a powerful boosting
technique, is utilized to capture complex interactions and enhance
predictive accuracy
The Exploratory Data Analysis (EDA) section of this study focuses on gaining a comprehensive understanding of the admission dataset and identifying potential patterns and relationships among its variables. The primary goal is to explore the data visually and statistically, providing valuable insights to guide the subsequent modeling process.
knitr::opts_chunk$set(echo = TRUE)
library(tidyverse)
library(e1071)
library(xgboost)
library(MLmetrics)
library(ggplot2)
library(corrplot)
admission <- read.csv("Admission_Predict_Ver1.1.csv")
admission_data <- admission %>%
mutate(University_Rating = as.factor(University.Rating))
numeric_cols <- c("GRE.Score", "TOEFL.Score", "SOP", "LOR", "CGPA")
admission_data[numeric_cols] <- scale(admission_data[numeric_cols])
#Exploratory Data Analysis
# Convert University Rating and Research columns to factors
admission$University.Rating <- as.factor(admission$University.Rating)
admission$Research <- as.factor(admission$Research)
The summary statistics provide valuable insights into the distribution and characteristics of the admission dataset. Let’s delve deeper into each attribute’s summary statistics:
# Summary statistics
summary(admission_data)
## Serial.No. GRE.Score TOEFL.Score University.Rating
## Min. : 1.0 Min. :-2.34366 Min. :-2.49792 Min. :1.000
## 1st Qu.:125.8 1st Qu.:-0.75006 1st Qu.:-0.68926 1st Qu.:2.000
## Median :250.5 Median : 0.04675 Median :-0.03157 Median :3.000
## Mean :250.5 Mean : 0.00000 Mean : 0.00000 Mean :3.114
## 3rd Qu.:375.2 3rd Qu.: 0.75501 3rd Qu.: 0.79055 3rd Qu.:4.000
## Max. :500.0 Max. : 2.08302 Max. : 2.10593 Max. :5.000
## SOP LOR CGPA Research
## Min. :-2.3956 Min. :-2.68410 Min. :-2.93717 Min. :0.00
## 1st Qu.:-0.8819 1st Qu.:-0.52299 1st Qu.:-0.74228 1st Qu.:0.00
## Median : 0.1271 Median : 0.01729 Median :-0.02718 Median :1.00
## Mean : 0.0000 Mean : 0.00000 Mean : 0.00000 Mean :0.56
## 3rd Qu.: 0.6317 3rd Qu.: 0.55757 3rd Qu.: 0.76645 3rd Qu.:1.00
## Max. : 1.6408 Max. : 1.63812 Max. : 2.22145 Max. :1.00
## Chance.of.Admit University_Rating
## Min. :0.3400 1: 34
## 1st Qu.:0.6300 2:126
## Median :0.7200 3:162
## Mean :0.7217 4:105
## 3rd Qu.:0.8200 5: 73
## Max. :0.9700
The GRE Score and TOEFL Score are standardized variables, where the mean is centered at 0, and the standard deviation is set to 1. The minimum and maximum values indicate the minimum and maximum scores in the dataset. The first quartile (1st Qu.) represents the 25th percentile of scores, while the third quartile (3rd Qu.) corresponds to the 75th percentile. The interquartile range (IQR) measures the spread between these quartiles. The median is the 50th percentile value, indicating the midpoint of the data distribution. The mean is approximately 0, as the standardized data has a mean of 0 by design.
University Rating is represented as a factor with values ranging from 1 to 5, indicating the rating of the universities applied to by the students. Research is a binary variable, represented as 0 or 1, indicating whether a student has research experience (1) or not (0). The summary statistics provide the count of students belonging to each University Rating category and the proportion of students with or without research experience (Research).
Similar to GRE Score and TOEFL Score, SOP, LOR, and CGPA are standardized variables, with a mean of 0 and a standard deviation of 1.The quartile values (1st Qu., Median, 3rd Qu.) provide insights into the spread of scores across these attributes.
The Chance of Admission is the target variable we seek
to predict. The summary statistics include the minimum and maximum
values, which represent the range of admission probabilities in the
dataset. The first quartile (1st Qu.), median, and third quartile (3rd
Qu.) provide insights into the spread of admission probabilities and the
central tendency of the data. The mean, approximately 0.7217, represents
the average chance of admission in the dataset.
Overall, the summary statistics offer a comprehensive overview of the dataset’s numeric attributes and their distributions. By understanding the spread and central tendency of each attribute, we gain crucial insights that help inform our subsequent modeling and decision-making processes. The standardized scores allow for comparisons and analysis of the relative importance of different attributes in predicting the likelihood of admission.
To explore the relationship between University Rating and Research experience, we employ a bar plot. The visualization shows the distribution of students across different University Ratings, with each bar segment representing the proportion of students with or without research experience.
# Visualize University Rating and Research
ggplot(admission, aes(x = University.Rating, fill = Research)) +
geom_bar(position = "dodge") +
labs(title = "Research vs. University Rating",
x = "University Rating",
y = "Count",
fill = "Research")
A comprehensive understanding of the distribution of the
Chance of Admission variable is crucial for modeling. We
plotted the chance of admission against the independent
variables.
# Visualize Chance of Admission
ggplot(admission, aes(x = Chance.of.Admit)) +
geom_histogram(binwidth = 0.02, fill = "darkblue", color = "black") +
labs(title = "Distribution of Chance of Admission",
x = "Chance of Admission",
y = "Count")
# Visualize Chance of Admission vs. GRE Score with smoothing
ggplot(admission, aes(x = GRE.Score, y = Chance.of.Admit)) +
geom_point() +
geom_smooth(method = "loess", color = "red") +
labs(title = "Chance of Admission vs. GRE Score",
x = "GRE Score",
y = "Chance of Admission")
## `geom_smooth()` using formula = 'y ~ x'
# Visualize Chance of Admission vs. CGPA with smoothing
ggplot(admission, aes(x = CGPA, y = Chance.of.Admit)) +
geom_point() +
geom_smooth(method = "loess", color = "red") +
labs(title = "Chance of Admission vs. CGPA",
x = "CGPA",
y = "Chance of Admission")
## `geom_smooth()` using formula = 'y ~ x'
We isolated the numeric attributes (GRE Score, TOEFL Score, SOP, LOR,
CGPA) and the Chance of Admission by creating scatter plots
with added regression lines or loess smoothing curves. These plots
facilitate the identification of potential linear or non-linear
associations between attributes and admission chances. Furthermore, we
employ color-coding or size variation to highlight any categorical or
numerical variables that might influence the relationship.
# Scatter plots for numeric variables
scatter_data <- admission %>% select(GRE.Score, TOEFL.Score, SOP, LOR, CGPA, Chance.of.Admit)
pairs(scatter_data)
To assess the interdependency between numeric attributes, we performed correlation analysis as highly correlated features might introduce multicollinearity in the predictive models.
# Correlation plot
correlation_matrix <- cor(scatter_data)
corrplot(correlation_matrix, type = "upper", order = "hclust", tl.col = "black")
By executing this robust Exploratory Data Analysis, we gained a profound understanding of the dataset’s characteristics, unveil intricate relationships between variables, and obtain valuable insights. These insights guide us in selecting appropriate features for modeling, addressing data quality concerns, and building accurate and reliable prediction models for university admission likelihood.
set.seed(1234)
inTrain <- caret::createDataPartition(admission_data$Chance.of.Admit, p = 0.9, list = FALSE)
train_data <- admission_data[inTrain,]
test_data <- admission_data[-inTrain,]
# Build the support vector machine model
adm_model_svm <- svm(Chance.of.Admit ~ ., data = train_data, probability = TRUE)
adm_test_pred_svm <- predict(adm_model_svm, test_data)
# Convert the response variable to binary class labels for AUC calculation
threshold <- 0.5
train_data$Chance.of.Admit_binary <- ifelse(train_data$Chance.of.Admit >= threshold, 1, 0)
test_data$Chance.of.Admit_binary <- ifelse(test_data$Chance.of.Admit >= threshold, 1, 0)
# Train the XGBoost model with DMatrix
xgb_train_matrix <- xgboost::xgb.DMatrix(data = as.matrix(train_data[, numeric_cols]),
label = train_data$Chance.of.Admit_binary)
params <- list(booster = "gbtree", objective = "binary:logistic", eval_metric = "logloss")
adm_model_xgb <- xgboost::xgboost(data = xgb_train_matrix, params = params, nrounds = 100)
## [1] train-logloss:0.474831
## [2] train-logloss:0.349991
## [3] train-logloss:0.270528
## [4] train-logloss:0.217936
## [5] train-logloss:0.179214
## [6] train-logloss:0.149240
## [7] train-logloss:0.127637
## [8] train-logloss:0.111913
## [9] train-logloss:0.099664
## [10] train-logloss:0.090056
## [11] train-logloss:0.081062
## [12] train-logloss:0.074985
## [13] train-logloss:0.069522
## [14] train-logloss:0.064330
## [15] train-logloss:0.061047
## [16] train-logloss:0.057959
## [17] train-logloss:0.055605
## [18] train-logloss:0.053565
## [19] train-logloss:0.051945
## [20] train-logloss:0.050082
## [21] train-logloss:0.048650
## [22] train-logloss:0.047174
## [23] train-logloss:0.045843
## [24] train-logloss:0.044868
## [25] train-logloss:0.043199
## [26] train-logloss:0.041532
## [27] train-logloss:0.040401
## [28] train-logloss:0.039310
## [29] train-logloss:0.038570
## [30] train-logloss:0.038007
## [31] train-logloss:0.037264
## [32] train-logloss:0.036868
## [33] train-logloss:0.036159
## [34] train-logloss:0.035771
## [35] train-logloss:0.035308
## [36] train-logloss:0.034794
## [37] train-logloss:0.034494
## [38] train-logloss:0.034061
## [39] train-logloss:0.033529
## [40] train-logloss:0.033281
## [41] train-logloss:0.032707
## [42] train-logloss:0.032211
## [43] train-logloss:0.031733
## [44] train-logloss:0.031233
## [45] train-logloss:0.030693
## [46] train-logloss:0.030341
## [47] train-logloss:0.029879
## [48] train-logloss:0.029372
## [49] train-logloss:0.028972
## [50] train-logloss:0.028616
## [51] train-logloss:0.028373
## [52] train-logloss:0.028056
## [53] train-logloss:0.027719
## [54] train-logloss:0.027359
## [55] train-logloss:0.027049
## [56] train-logloss:0.026793
## [57] train-logloss:0.026564
## [58] train-logloss:0.026343
## [59] train-logloss:0.026175
## [60] train-logloss:0.025996
## [61] train-logloss:0.025813
## [62] train-logloss:0.025515
## [63] train-logloss:0.025239
## [64] train-logloss:0.025019
## [65] train-logloss:0.024835
## [66] train-logloss:0.024648
## [67] train-logloss:0.024467
## [68] train-logloss:0.024328
## [69] train-logloss:0.024127
## [70] train-logloss:0.023985
## [71] train-logloss:0.023808
## [72] train-logloss:0.023656
## [73] train-logloss:0.023508
## [74] train-logloss:0.023324
## [75] train-logloss:0.023175
## [76] train-logloss:0.023089
## [77] train-logloss:0.022887
## [78] train-logloss:0.022762
## [79] train-logloss:0.022661
## [80] train-logloss:0.022529
## [81] train-logloss:0.022383
## [82] train-logloss:0.022240
## [83] train-logloss:0.022089
## [84] train-logloss:0.022000
## [85] train-logloss:0.021922
## [86] train-logloss:0.021804
## [87] train-logloss:0.021704
## [88] train-logloss:0.021552
## [89] train-logloss:0.021470
## [90] train-logloss:0.021410
## [91] train-logloss:0.021268
## [92] train-logloss:0.021147
## [93] train-logloss:0.021075
## [94] train-logloss:0.021008
## [95] train-logloss:0.020895
## [96] train-logloss:0.020799
## [97] train-logloss:0.020731
## [98] train-logloss:0.020608
## [99] train-logloss:0.020555
## [100] train-logloss:0.020459
# Make predictions on the test set for XGBoost
xgb_test_matrix <- xgboost::xgb.DMatrix(data = as.matrix(test_data[, numeric_cols]))
adm_test_pred_xgb <- predict(adm_model_xgb, xgb_test_matrix)
# Evaluate models using ROC AUC on the test set
roc_auc_svm <- MLmetrics::AUC(y_pred = adm_test_pred_svm, y_true = test_data$Chance.of.Admit_binary)
roc_auc_xgb <- MLmetrics::AUC(y_pred = adm_test_pred_xgb, y_true = test_data$Chance.of.Admit_binary)
print(paste("ROC AUC (Support Vector Machine): ", roc_auc_svm))
## [1] "ROC AUC (Support Vector Machine): 0.911627906976744"
print(paste("ROC AUC (Gradient Boosting): ", roc_auc_xgb))
## [1] "ROC AUC (Gradient Boosting): 0.776744186046512"
we present a comprehensive evaluation of two predictive models: the
Support Vector Machine (SVM) and Gradient Boosting using
XGBoost, for the task of predicting students’ chances of
admission to universities. The evaluation process includes data
partitioning, model building, prediction, and performance assessment
using Receiver Operating Characteristic Area Under the Curve (ROC
AUC).
Data Partitioning: To ensure an unbiased evaluation, we split the admission dataset into two subsets: a training set (90% of the data) and a testing set (10% of the data). The createDataPartition function from the caret package is utilized to randomly partition the data, ensuring that both sets are representative of the overall dataset.
Support Vector Machine (SVM) Model: We employ the powerful Support Vector Machine algorithm, available in the e1071 package, to build the first predictive model. All available attributes, including GRE Score, TOEFL Score, Statement of Purpose (SOP), Letter of Recommendation (LOR), CGPA, University Rating, and Research, are used as predictors to estimate the “Chance of Admission” target variable. The SVM model is designed to classify students into two categories: admitted or not admitted. To enable probability estimates, we set the probability parameter to TRUE.
Conversion to Binary Class Labels: To assess the model’s performance using the ROC AUC metric, we convert the “Chance of Admission” target variable to binary class labels. A threshold of 0.5 is applied to categorize instances into two classes: 1 (admitted) and 0 (not admitted), based on whether their chance of admission is greater than or equal to the threshold or not.
Gradient Boosting using XGBoost: The second predictive model is built using Gradient Boosting, a powerful ensemble learning technique, with the XGBoost implementation available in the xgboost package. Similar to the SVM model, all numeric attributes are used as features to predict the binary outcome of admission. We set the objective as “binary:logistic” to facilitate binary classification.
Model Training and Iterations: The XGBoost model undergoes a training process, where it iteratively improves its performance over 100 rounds (nrounds). The primary objective is to minimize the log-loss (logistic loss) during training. The training log provides valuable insights into the learning progress and log-loss values at each iteration.
Model Prediction: After model training, we utilize the trained SVM and XGBoost models to make predictions on the test data. The predicted probabilities for admission are stored in adm_test_pred_svm for SVM and adm_test_pred_xgb for XGBoost.
We evaluate the predictive models’ performance using the ROC AUC metric, which measures the models’ ability to discriminate between admitted and non-admitted students. A higher ROC AUC value, closer to 1, indicates better model discrimination. The SVM model achieves an ROC AUC of approximately 0.912, while the XGBoost model achieves a score of around 0.777.
The detailed model evaluation allows us to compare the performance of the SVM and XGBoost models for predicting students’ chances of admission. The SVM model demonstrates superior discrimination ability, as evidenced by its higher ROC AUC score. These results provide valuable insights into selecting the most effective model for making accurate predictions in the university admission process.
Data Source: (https://www.kaggle.com/datasets/mohansacharya/graduate-admissions)
Kuhn, M. (2020). caret: Classification and Regression Training. R package version 6.0-86. https://CRAN.R-project.org/package=caret
Robin, X., Turck, N., Hainard, A., Tiberti, N., Lisacek, F., Sanchez, J.-C., & Muller, M. (2011). pROC: an open-source package for R and S+ to analyze and compare ROC curves. BMC Bioinformatics, 12(1), 77.
Chen, T., & Guestrin, C. (2016). XGBoost: A Scalable Tree Boosting System. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 785-794.
Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5-32. Healy, K. (2018). Data Visualization: A Practical Introduction. Princeton University Press. Retrieved from
Wickham, H., & Grolemund, G. (2017). R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. O’Reilly Media.
Wilke, C. O. (2019). Fundamentals of Data Visualization: A Primer on Making Informative and Compelling Figures. O’Reilly Media.