The purpose of this project is to analyze a dataset on graduate school admissions to uncover patterns and insights that can aid students in understanding their chances of admission. By leveraging statistical and machine learning techniques, the project aims to predict admission probabilities and categorize students into likelihood tiers (high or low). This analysis can provide valuable guidance for prospective graduate students seeking to optimize their applications.
library(rmarkdown)
library(knitr)
knitr::opts_chunk$set(echo = TRUE)
set.seed(123)
Dataset Details
Our data is obtained from https://www.kaggle.com/datasets/mohansacharya/graduate-admissions
df <- read.csv("Admission_Predict.csv")
# Preview the first few rows of the dataset
head(df)
## Serial.No. GRE.Score TOEFL.Score University.Rating Statement.of.Purpose
## 1 1 337 118 4 4.5
## 2 2 324 107 4 4.0
## 3 3 316 104 3 3.0
## 4 4 322 110 3 3.5
## 5 5 314 103 2 2.0
## 6 6 330 115 5 4.5
## Letter.of.Recommendation CGPA Research Chance.of.Admit
## 1 4.5 9.65 1 0.92
## 2 4.5 8.87 1 0.76
## 3 3.5 8.00 1 0.72
## 4 2.5 8.67 1 0.80
## 5 3.0 8.21 0 0.65
## 6 3.0 9.34 1 0.90
str(df)
## 'data.frame': 400 obs. of 9 variables:
## $ Serial.No. : int 1 2 3 4 5 6 7 8 9 10 ...
## $ GRE.Score : int 337 324 316 322 314 330 321 308 302 323 ...
## $ TOEFL.Score : int 118 107 104 110 103 115 109 101 102 108 ...
## $ University.Rating : int 4 4 3 3 2 5 3 2 1 3 ...
## $ Statement.of.Purpose : num 4.5 4 3 3.5 2 4.5 NA 3 2 3.5 ...
## $ Letter.of.Recommendation: num 4.5 4.5 3.5 2.5 3 3 4 4 NA 3 ...
## $ CGPA : num 9.65 8.87 8 8.67 8.21 9.34 8.2 7.9 8 8.6 ...
## $ Research : int 1 1 1 1 0 1 1 0 0 0 ...
## $ Chance.of.Admit : num 0.92 0.76 0.72 0.8 0.65 0.9 0.75 0.68 0.5 0.45 ...
summary(df)
## Serial.No. GRE.Score TOEFL.Score University.Rating
## Min. : 1.0 Min. :290.0 Min. : 92.0 Min. :1.000
## 1st Qu.:100.8 1st Qu.:308.0 1st Qu.:103.0 1st Qu.:2.000
## Median :200.5 Median :317.0 Median :107.0 Median :3.000
## Mean :200.5 Mean :316.8 Mean :107.4 Mean :3.087
## 3rd Qu.:300.2 3rd Qu.:325.0 3rd Qu.:112.0 3rd Qu.:4.000
## Max. :400.0 Max. :340.0 Max. :120.0 Max. :5.000
##
## Statement.of.Purpose Letter.of.Recommendation CGPA Research
## Min. :1.000 Min. :1.00 Min. :6.800 Min. :0.0000
## 1st Qu.:2.500 1st Qu.:3.00 1st Qu.:8.170 1st Qu.:0.0000
## Median :3.500 Median :3.50 Median :8.610 Median :1.0000
## Mean :3.416 Mean :3.45 Mean :8.599 Mean :0.5475
## 3rd Qu.:4.000 3rd Qu.:4.00 3rd Qu.:9.062 3rd Qu.:1.0000
## Max. :5.000 Max. :5.00 Max. :9.920 Max. :1.0000
## NA's :18 NA's :19
## Chance.of.Admit
## Min. :0.3400
## 1st Qu.:0.6400
## Median :0.7300
## Mean :0.7244
## 3rd Qu.:0.8300
## Max. :0.9700
##
# Check if there are any missing values in the data frame
any_missing <- any(is.na(df))
cat("Are there any missing values in the dataset? ", any_missing, "\n")
## Are there any missing values in the dataset? TRUE
# Count the total number of missing values in the dataset
total_missing <- sum(is.na(df))
cat("Total number of missing values in the dataset: ", total_missing, "\n")
## Total number of missing values in the dataset: 37
# Count missing values for each column
missing_per_column <- colSums(is.na(df))
cat("Missing values per column:\n")
## Missing values per column:
print(missing_per_column)
## Serial.No. GRE.Score TOEFL.Score
## 0 0 0
## University.Rating Statement.of.Purpose Letter.of.Recommendation
## 0 18 19
## CGPA Research Chance.of.Admit
## 0 0 0
# Display rows with missing values
rows_with_na <- df[!complete.cases(df), ]
cat("Rows with missing values:\n")
## Rows with missing values:
print(rows_with_na)
## Serial.No. GRE.Score TOEFL.Score University.Rating Statement.of.Purpose
## 7 7 321 109 3 NA
## 9 9 302 102 1 2.0
## 14 14 307 109 3 NA
## 21 21 312 107 3 NA
## 24 24 334 119 5 5.0
## 37 37 299 106 2 NA
## 49 49 321 110 3 3.5
## 67 67 327 114 3 NA
## 77 77 327 112 3 NA
## 80 80 294 93 1 1.5
## 93 93 298 98 2 4.0
## 95 95 303 99 3 NA
## 105 105 326 112 3 3.5
## 113 113 301 107 3 NA
## 128 128 319 112 3 2.5
## 140 140 318 109 1 NA
## 155 155 326 108 3 NA
## 170 170 311 99 2 2.5
## 182 182 305 107 2 2.5
## 194 194 336 118 5 4.5
## 203 203 340 120 5 4.5
## 215 215 331 117 4 4.5
## 224 224 308 109 2 3.0
## 232 232 319 106 3 NA
## 233 233 312 107 2 2.5
## 244 244 325 114 3 NA
## 253 253 318 100 2 NA
## 262 262 312 104 3 NA
## 274 274 312 99 1 NA
## 285 285 340 112 4 5.0
## 294 294 312 98 1 3.5
## 303 303 322 105 2 3.0
## 311 311 320 104 3 NA
## 312 312 328 108 4 4.5
## 325 325 315 104 3 NA
## 332 332 311 105 2 NA
## Letter.of.Recommendation CGPA Research Chance.of.Admit
## 7 4.0 8.20 1 0.75
## 9 NA 8.00 0 0.50
## 14 3.0 8.00 1 0.62
## 21 2.0 7.90 1 0.64
## 24 NA 9.70 1 0.95
## 37 4.0 8.40 0 0.64
## 49 NA 8.85 1 0.82
## 67 3.0 9.02 0 0.61
## 77 3.0 8.72 1 0.74
## 80 NA 7.36 0 0.46
## 93 NA 8.03 0 0.34
## 95 2.5 7.66 0 0.36
## 105 NA 9.05 1 0.74
## 113 3.5 8.34 1 0.62
## 128 NA 8.71 1 0.78
## 140 3.5 9.12 0 0.78
## 155 NA 8.89 0 0.80
## 170 NA 7.98 0 0.65
## 182 NA 8.42 0 0.71
## 194 NA 9.53 1 0.94
## 203 NA 9.91 1 0.97
## 215 NA 9.42 1 0.94
## 224 NA 8.45 0 0.71
## 232 2.5 8.33 1 0.74
## 233 NA 8.27 0 0.69
## 244 3.0 9.04 1 0.76
## 253 3.5 8.54 1 0.71
## 262 4.0 8.09 0 0.71
## 274 1.5 8.01 1 0.52
## 285 NA 9.66 1 0.94
## 294 NA 8.18 1 0.64
## 303 NA 8.45 1 0.65
## 311 3.5 8.74 1 0.76
## 312 NA 9.18 1 0.84
## 325 2.5 8.33 0 0.67
## 332 2.0 8.12 1 0.73
Replace NA Values
# Replace NA values with 0 in 'Statement of Purpose' and 'Letter of Recommendation'
df$`Statement.of.Purpose`[is.na(df$`Statement.of.Purpose`)] <- 0
df$`Letter.of.Recommendation`[is.na(df$`Letter.of.Recommendation`)] <- 0
# Check if the missing values have been replaced successfully
cat("Missing values in 'Statement of Purpose':", sum(is.na(df$`Statement.of.Purpose`)), "\n")
## Missing values in 'Statement of Purpose': 0
cat("Missing values in 'Letter of Recommendation':", sum(is.na(df$`Letter.of.Recommendation`)), "\n")
## Missing values in 'Letter of Recommendation': 0
# Visualizations (Histograms and Boxplots) for distribution and to choose mean/median
# SOP Histogram
hist(df$`Statement.of.Purpose`,
main = "Statement of Purpose (SOP) Distribution",
xlab = "SOP",
col = "lightblue",
border = "black")
# LOR Histogram
hist(df$`Letter.of.Recommendation`,
main = "Letter of Recommendation (LOR) Distribution",
xlab = "LOR",
col = "lightgreen",
border = "black")
# SOP Boxplot
boxplot(df$`Statement.of.Purpose`,
main = "Statement of Purpose (SOP) Boxplot",
ylab = "SOP",
col = "lightblue",
border = "blue",
horizontal = TRUE)
# LOR Boxplot
boxplot(df$`Letter.of.Recommendation`,
main = "Letter of Recommendation (LOR) Boxplot",
ylab = "LOR",
col = "lightgreen",
border = "green",
horizontal = TRUE)
Histogram (First Plot):
The data distribution is slightly skewed to the left with a significant amount of data clustered towards the higher SOP values (3 to 5). There is a small concentration of values at 0, likely due to replacing NA values. Boxplot (Second Plot):
The median (thick blue line inside the box) is closer to the center of the interquartile range (IQR), but there is a noticeable outlier at 0. The presence of an outlier suggests that the mean might be pulled down due to the replacement of NA values with 0. Decision: Median is more appropriate Since the mean is sensitive to outliers and skewed data, the median provides a better measure of central tendency in this case. The outlier at 0 (from replacing NA) will distort the mean, but the median remains robust.
Replace 0 values with the median In order to ensure that the distribution of the dataset remains more consistent, median imputation technique is selected instead of mean imputation techniques. Reasons as follows: - Median remains more central to the majority of data points without skewing it away. - Median is less sensitive to outliers. - Median is suitable when the data is continous, not categorical (mode is more suitable).
Target columns: (1) Statement of Purpose, (2) Letter of Recommendation
# Replace 0 values with the median for specified columns
df$Statement.of.Purpose[df$Statement.of.Purpose == 0] <- median(df$Statement.of.Purpose[df$Statement.of.Purpose != 0], na.rm = TRUE)
df$Letter.of.Recommendation[df$Letter.of.Recommendation == 0] <- median(df$Letter.of.Recommendation[df$Letter.of.Recommendation != 0], na.rm = TRUE)
# Display the updated dataframe (first 20 rows)
head(df, 20)
## Serial.No. GRE.Score TOEFL.Score University.Rating Statement.of.Purpose
## 1 1 337 118 4 4.5
## 2 2 324 107 4 4.0
## 3 3 316 104 3 3.0
## 4 4 322 110 3 3.5
## 5 5 314 103 2 2.0
## 6 6 330 115 5 4.5
## 7 7 321 109 3 3.5
## 8 8 308 101 2 3.0
## 9 9 302 102 1 2.0
## 10 10 323 108 3 3.5
## 11 11 325 106 3 3.5
## 12 12 327 111 4 4.0
## 13 13 328 112 4 4.0
## 14 14 307 109 3 3.5
## 15 15 311 104 3 3.5
## 16 16 314 105 3 3.5
## 17 17 317 107 3 4.0
## 18 18 319 106 3 4.0
## 19 19 318 110 3 4.0
## 20 20 303 102 3 3.5
## Letter.of.Recommendation CGPA Research Chance.of.Admit
## 1 4.5 9.65 1 0.92
## 2 4.5 8.87 1 0.76
## 3 3.5 8.00 1 0.72
## 4 2.5 8.67 1 0.80
## 5 3.0 8.21 0 0.65
## 6 3.0 9.34 1 0.90
## 7 4.0 8.20 1 0.75
## 8 4.0 7.90 0 0.68
## 9 3.5 8.00 0 0.50
## 10 3.0 8.60 0 0.45
## 11 4.0 8.40 1 0.52
## 12 4.5 9.00 1 0.84
## 13 4.5 9.10 1 0.78
## 14 3.0 8.00 1 0.62
## 15 2.0 8.20 1 0.61
## 16 2.5 8.30 0 0.54
## 17 3.0 8.70 0 0.66
## 18 3.0 8.00 1 0.65
## 19 3.0 8.80 0 0.63
## 20 3.0 8.50 0 0.62
Calculate skewness for each column
library(e1071)
## Warning: package 'e1071' was built under R version 4.4.2
sop_skewness <- skewness(df$`Statement.of.Purpose`, na.rm = TRUE)
lor_skewness <- skewness(df$`Letter.of.Recommendation`, na.rm = TRUE)
gre_skewness <- skewness(df$`GRE.Score`, na.rm = TRUE)
toefl_skewness <- skewness(df$`TOEFL.Score`, na.rm = TRUE)
cgpa_skewness <- skewness(df$`CGPA`, na.rm = TRUE)
admit_skewness <- skewness(df$`Chance.of.Admit`, na.rm = TRUE)
cat("SOP Skewness:", sop_skewness, "\n")
## SOP Skewness: -0.3007009
cat("LOR Skewness:", lor_skewness, "\n")
## LOR Skewness: -0.1174626
cat("GRE Skewness:", gre_skewness, "\n")
## GRE Skewness: -0.06242254
cat("TOEEFL Skewness:", toefl_skewness, "\n")
## TOEEFL Skewness: 0.05678751
cat("CGPA Skewness:", cgpa_skewness, "\n")
## CGPA Skewness: -0.06549644
cat("Chance of Admit Skewness:", admit_skewness, "\n")
## Chance of Admit Skewness: -0.3508017
# Check for negative values in SOP and LOR columns
sum(df$`Statement.of.Purpose` < 0, na.rm = TRUE)
## [1] 0
sum(df$`Letter.of.Recommendation` < 0, na.rm = TRUE)
## [1] 0
Summary of Skewness: - Negative skewed (Left-skewed): Statement of Purpose, Chance of Admit - Slightly Negative skewed (Left-skewed): Letter of Recommendation, GRE, CGPA - Slightly Positive skewed (Right-skewed): TOEEFL
# Rename columns
colnames(df)[colnames(df) == "Statement.of.Purpose"] <- "SOP"
colnames(df)[colnames(df) == "Letter.of.Recommendation"] <- "LOR"
colnames(df)[colnames(df) == "GRE.Score"] <- "GRE"
colnames(df)[colnames(df) == "TOEFL.Score"] <- "TOEFL"
# Check the updated column names
colnames(df)
## [1] "Serial.No." "GRE" "TOEFL"
## [4] "University.Rating" "SOP" "LOR"
## [7] "CGPA" "Research" "Chance.of.Admit"
# Calculate Total Score (GRE + TOEFL)
df$Total_Score <- df$GRE + df$TOEFL
# Calculate Research Experience & GPA Interaction
df$Research_GPA_Interaction <- df$Research * df$CGPA
# Binning Chance of Admit into categorical column
df$Chance.Level <- ifelse(df$Chance.of.Admit > 0.7, "1", # Class "High"
"0") # Class "Low"
Total_Score helps you understand the combined performance of the student in both GRE and TOEFL. Research_GPA_Interaction captures the effect of research experience combined with the academic performance (CGPA). This feature could help predict how research experience might influence a student’s chance of admission when combined with their GPA. New column Chance.Level is converting continuous variable of chance of admit into a categorical numeric classes(1 for High, 0 for Low) in order to proceed with the classification modeling.
# Remove Serial.No column
df <- df[, !colnames(df) %in% "Serial.No."]
df <- df[, c(setdiff(colnames(df), "Chance.of.Admit"), "Chance.of.Admit")]
head(df, 5)
## GRE TOEFL University.Rating SOP LOR CGPA Research Total_Score
## 1 337 118 4 4.5 4.5 9.65 1 455
## 2 324 107 4 4.0 4.5 8.87 1 431
## 3 316 104 3 3.0 3.5 8.00 1 420
## 4 322 110 3 3.5 2.5 8.67 1 432
## 5 314 103 2 2.0 3.0 8.21 0 417
## Research_GPA_Interaction Chance.Level Chance.of.Admit
## 1 9.65 1 0.92
## 2 8.87 1 0.76
## 3 8.00 1 0.72
## 4 8.67 1 0.80
## 5 0.00 0 0.65
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.4.2
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(Hmisc)
## Warning: package 'Hmisc' was built under R version 4.4.2
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:dplyr':
##
## src, summarize
## The following object is masked from 'package:e1071':
##
## impute
## The following objects are masked from 'package:base':
##
## format.pval, units
library(corrplot)
## Warning: package 'corrplot' was built under R version 4.4.2
## corrplot 0.95 loaded
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.4.2
library(gridExtra)
## Warning: package 'gridExtra' was built under R version 4.4.2
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
library(patchwork)
## Warning: package 'patchwork' was built under R version 4.4.2
# Remove Serial.No column
df <- df[, !colnames(df) %in% "Serial.No."]
df <- df[, c(setdiff(colnames(df), "Chance.of.Admit"), "Chance.of.Admit")]
# Compute correlation matrix using Spearman for mixed types
cor_matrix <- rcorr(as.matrix(df), type = "spearman")
# Extract correlation coefficients
corr <- cor_matrix$r
corr
## GRE TOEFL University.Rating SOP
## GRE 1.0000000 0.8318602 0.6762653 0.6220924
## TOEFL 0.8318602 1.0000000 0.6968676 0.6462887
## University.Rating 0.6762653 0.6968676 1.0000000 0.7392971
## SOP 0.6220924 0.6462887 0.7392971 1.0000000
## LOR 0.5172537 0.5209534 0.6238022 0.6926797
## CGPA 0.8318480 0.8257200 0.7505625 0.7223057
## Research 0.5959110 0.5043224 0.4541311 0.4523064
## Total_Score 0.9788624 0.9225135 0.7056926 0.6542414
## Research_GPA_Interaction 0.7549828 0.6862895 0.6227475 0.6022060
## Chance.Level 0.6711266 0.6357582 0.5863705 0.5424313
## Chance.of.Admit 0.8153521 0.7955730 0.7319766 0.6915088
## LOR CGPA Research Total_Score
## GRE 0.5172537 0.8318480 0.5959110 0.9788624
## TOEFL 0.5209534 0.8257200 0.5043224 0.9225135
## University.Rating 0.6238022 0.7505625 0.4541311 0.7056926
## SOP 0.6926797 0.7223057 0.4523064 0.6542414
## LOR 1.0000000 0.6371129 0.3854145 0.5356037
## CGPA 0.6371129 1.0000000 0.5302649 0.8607412
## Research 0.3854145 0.5302649 1.0000000 0.5852178
## Total_Score 0.5356037 0.8607412 0.5852178 1.0000000
## Research_GPA_Interaction 0.5096447 0.7601049 0.9050681 0.7545088
## Chance.Level 0.5179281 0.6990713 0.5033846 0.6894465
## Chance.of.Admit 0.6396353 0.8784026 0.5817423 0.8389873
## Research_GPA_Interaction Chance.Level Chance.of.Admit
## GRE 0.7549828 0.6711266 0.8153521
## TOEFL 0.6862895 0.6357582 0.7955730
## University.Rating 0.6227475 0.5863705 0.7319766
## SOP 0.6022060 0.5424313 0.6915088
## LOR 0.5096447 0.5179281 0.6396353
## CGPA 0.7601049 0.6990713 0.8784026
## Research 0.9050681 0.5033846 0.5817423
## Total_Score 0.7545088 0.6894465 0.8389873
## Research_GPA_Interaction 1.0000000 0.5926343 0.7691904
## Chance.Level 0.5926343 1.0000000 0.8529164
## Chance.of.Admit 0.7691904 0.8529164 1.0000000
corrplot.mixed(corr,
upper = "circle", # Upper triangle with squares
lower = "number", # Lower triangle with correlation coefficients
addgrid.col = "black", # Add gridlines
tl.col = "black", # Text color
tl.cex = 0.6, # Adjust text size
number.cex = 0.6, # Adjust size of numbers in the lower triangle
tl.pos = "lt", # Rotate text to 45 degrees for better fit
title = "Spearman Correlation Matrix",
mar = c(2, 2, 2, 2))
# GRE
gre_hist <- ggplot(df, aes(x = GRE)) +
geom_histogram(binwidth = 5, fill = "darkseagreen", color = "black") +
labs(title = "Distribution of GRE Scores", x = "GRE", y = "Count")
stats_gre <- summary(df$GRE)
q1 <- stats_gre["1st Qu."]
median <- stats_gre["Median"]
q3 <- stats_gre["3rd Qu."]
gre_box <- ggplot(df, aes(x = factor(1), y = GRE)) +
geom_boxplot(fill = "darkseagreen", width = 0.4) +
annotate("text", x = 1.4, y = q1, label = paste0("Q1: ", round(q1, 1)), color = "blue", size = 4) +
annotate("text", x = 1.4, y = median, label = paste0("Median: ", round(median, 1)), color = "red", size = 4) +
annotate("text", x = 1.4, y = q3, label = paste0("Q3: ", round(q3, 1)), color = "darkgreen", size = 4) +
labs(title = "Boxplot of GRE Scores", y = "GRE Scores", x = NULL) +
theme(axis.text.x = element_blank(), axis.ticks.x = element_blank())
grid.arrange(gre_hist, gre_box, ncol = 2)
# TOEFL
toefl_hist <- ggplot(df, aes(x = TOEFL)) +
geom_histogram(binwidth = 3, fill = "cadetblue2", color = "black") +
labs(title = "Distribution of TOEFL Scores", x = "TOEFL", y = "Count")
stats_toefl <- summary(df$TOEFL)
q1_toefl <- stats_toefl["1st Qu."]
median_toefl <- stats_toefl["Median"]
q3_toefl <- stats_toefl["3rd Qu."]
toefl_box <- ggplot(df, aes(x = factor(1), y = TOEFL)) +
geom_boxplot(fill = "cadetblue2", width = 0.4) +
annotate("text", x = 1.3, y = q1_toefl, label = paste0("Q1: ", round(q1_toefl, 1)), color = "blue", size = 4, hjust = 0) +
annotate("text", x = 1.3, y = median_toefl, label = paste0("Median: ", round(median_toefl, 1)), color = "red", size = 4, hjust = 0) +
annotate("text", x = 1.3, y = q3_toefl, label = paste0("Q3: ", round(q3_toefl, 1)), color = "darkgreen", size = 4, hjust = 0) +
labs(title = "Boxplot of TOEFL Scores", y = "TOEFL", x = NULL) +
theme(axis.text.x = element_blank(), axis.ticks.x = element_blank())
grid.arrange(toefl_hist, toefl_box, ncol = 2)
# CGPA
cgpa_hist <- ggplot(df, aes(x = CGPA)) +
geom_histogram(binwidth = 0.3, fill = "cyan4", color = "black") +
labs(title = "Distribution of CGPA Scores", x = "CGPA", y = "Count")
stats_cgpa <- summary(df$CGPA)
q1_cgpa <- stats_cgpa["1st Qu."]
median_cgpa <- stats_cgpa["Median"]
q3_cgpa <- stats_cgpa["3rd Qu."]
cgpa_box <- ggplot(df, aes(x = factor(1), y = CGPA)) +
geom_boxplot(fill = "cyan4", width = 0.3) +
annotate("text", x = 1.2, y = q1_cgpa, label = paste0("Q1: ", round(q1_cgpa, 2)), color = "blue", size = 4, hjust = 0) +
annotate("text", x = 1.2, y = median_cgpa, label = paste0("Median: ", round(median_cgpa, 2)), color = "red", size = 4, hjust = 0) +
annotate("text", x = 1.2, y = q3_cgpa, label = paste0("Q3: ", round(q3_cgpa, 2)), color = "darkgreen", size = 4, hjust = 0) +
labs(title = "Boxplot of CGPA", y = "CGPA", x = NULL) +
scale_y_continuous(limits = c(6, 10)) +
theme(axis.text.x = element_blank(), axis.ticks.x = element_blank())
grid.arrange(cgpa_hist, cgpa_box, ncol = 2)
# Proportion of students with research experience
bar_research <- df %>%
count(Research) %>%
mutate(color = ifelse(n == max(n), "red", "orange")) # Highlight the highest bar
dist_research<-ggplot(bar_research, aes(x = factor(Research), y = n, fill = color)) +
geom_bar(stat = "identity") +
geom_text(aes(label = n), vjust = -0.5, size=3) + # Add count labels
scale_fill_identity() + # Use the color column directly
labs(
title = "Distribution of Students by Research Experience",
x = "Research Experience (0: No, 1: Yes)",
y = "Count"
) +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.5, face = "bold", size = 9),
axis.title.x = element_text(size = 8),
axis.title.y = element_text(size = 8),
axis.text.x = element_text(size = 8),
axis.text.y = element_text(size = 8)
)+
ylim(0, max(bar_research$n) * 1)
# Proportion of students with LOR
bar_LOR <- df %>%
count(LOR) %>%
mutate(color = ifelse(n == max(n), "red", "orange")) # Highlight the highest bar
dist_LOR<-ggplot(bar_LOR, aes(x = factor(LOR), y = n, fill = color)) +
geom_bar(stat = "identity") +
geom_text(aes(label = n), vjust = -0.5, size=3) + # Add count labels
scale_fill_identity() + # Use the color column directly
labs(
title = "Distribution of Students by Letter of Recommendation Strength (LOR)",
x = "Letter of Recommendation Strength (LOR)",
y = "Count"
) +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.5, face = "bold", size = 9),
axis.title.x = element_text(size = 8),
axis.title.y = element_text(size = 8),
axis.text.x = element_text(size = 8),
axis.text.y = element_text(size = 8)
)+
ylim(0, max(bar_LOR$n) * 1)
# Proportion of students with SOP
bar_SOP <- df %>%
count(SOP) %>%
mutate(color = ifelse(n == max(n), "red", "orange")) # Highlight the highest bar
dist_SOP<-ggplot(bar_SOP, aes(x = factor(SOP), y = n, fill = color)) +
geom_bar(stat = "identity") +
geom_text(aes(label = n), vjust = -0.5, size=3) + # Add count labels
scale_fill_identity() + # Use the color column directly
labs(
title = "Distribution of Students by Statement of Purpose Strength (SOP)",
x = "Statement of Purpose Strength (SOP)",
y = "Count"
) +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.5, face = "bold", size = 9),
axis.title.x = element_text(size = 8),
axis.title.y = element_text(size = 8),
axis.text.x = element_text(size = 8),
axis.text.y = element_text(size = 8)
)+
ylim(0, max(bar_SOP$n) * 1)
# Proportion of students with University Rating
bar_rating <- df %>%
count(University.Rating) %>%
mutate(color = ifelse(n == max(n), "red", "orange"))
dist_rating<-ggplot(bar_rating, aes(x = factor(University.Rating), y = n, fill = color)) +
geom_bar(stat = "identity") +
geom_text(aes(label = n), vjust = -0.5, size=3) + # Add count labels
scale_fill_identity() + # Use the color column directly
labs(
title = "Distribution of Students by University Rating",
x = "University Rating",
y = "Count"
) +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.5, face = "bold", size = 9),
axis.title.x = element_text(size = 8),
axis.title.y = element_text(size = 8),
axis.text.x = element_text(size = 8),
axis.text.y = element_text(size = 8)
)+
ylim(0, max(bar_rating$n) * 1)
combined_dist_plot <- (dist_research | dist_LOR) / (dist_SOP | dist_rating)
combined_dist_plot
# To identify relationships between key predictors and target variable (Chance.of.Admit)
# GRE vs Chance of Admit
scatter_gre <- ggplot(df, aes(x = GRE, y = Chance.of.Admit)) +
geom_point() +
geom_smooth(method = "lm", color = "red", se = FALSE) +
labs(title = "GRE Scores vs Chance of Admit", x = "GRE", y = "Chance of Admit")+
theme(
plot.title = element_text(size = 9),
axis.title.x = element_text(size = 9),
axis.title.y = element_text(size = 9)
)
# TOEFL Scores vs Chance of Admit
scatter_toefl <- ggplot(df, aes(x = TOEFL, y = Chance.of.Admit)) +
geom_point() +
geom_smooth(method = "lm", color = "red", se = FALSE) +
labs(title = "TOEFL Scores vs Chance of Admit", x = "TOEFL", y = "Chance of Admit")+
theme(
plot.title = element_text(size = 9),
axis.title.x = element_text(size = 9),
axis.title.y = element_text(size = 9)
)
# CGPA vs Chance of Admit
scatter_cgpa <- ggplot(df, aes(x = CGPA, y = Chance.of.Admit)) +
geom_point() +
geom_smooth(method = "lm", color = "red", se = FALSE) +
labs(title = "CGPA vs Chance of Admit", x = "CGPA", y = "Chance of Admit")+
theme(
plot.title = element_text(size = 9),
axis.title.x = element_text(size = 9),
axis.title.y = element_text(size = 9)
)
combined_scatter <- scatter_gre + scatter_toefl + scatter_cgpa +
plot_layout(ncol = 3)
combined_scatter
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
# University Rating vs Chance of Admit
quartiles_by_uni_rating <- df %>%
group_by(University.Rating) %>%
summarise(
Q1 = quantile(Chance.of.Admit, 0.25),
Median = median(Chance.of.Admit),
Q3 = quantile(Chance.of.Admit, 0.75))
quartiles_by_uni_rating
## # A tibble: 5 × 4
## University.Rating Q1 Median Q3
## <int> <dbl> <dbl> <dbl>
## 1 1 0.492 0.545 0.58
## 2 2 0.56 0.64 0.705
## 3 3 0.67 0.73 0.77
## 4 4 0.78 0.835 0.9
## 5 5 0.87 0.91 0.94
ggplot(df, aes(x = factor(University.Rating), y = Chance.of.Admit)) +
geom_boxplot(fill = "burlywood1") +
labs(title = "University Rating vs Chance of Admit", x = "University Rating", y = "Chance of Admit")
# Research vs Chance of Admit
quartiles_by_research <- df %>%
group_by(Research) %>%
summarise(
Q1 = quantile(Chance.of.Admit, 0.25),
Median = median(Chance.of.Admit),
Q3 = quantile(Chance.of.Admit, 0.75))
quartiles_by_research
## # A tibble: 2 × 4
## Research Q1 Median Q3
## <int> <dbl> <dbl> <dbl>
## 1 0 0.57 0.65 0.72
## 2 1 0.73 0.8 0.9
ggplot(df, aes(x = factor(Research), y = Chance.of.Admit)) +
geom_boxplot(fill = "lightpink") +
labs(title = "Research vs Chance of Admit", x = "Research Experience (0: No, 1: Yes)", y = "Chance of Admit")
# SOP vs Chance of Admit
quartiles_by_SOP <- df %>%
group_by(SOP) %>%
summarise(
Q1 = quantile(Chance.of.Admit, 0.25),
Median = median(Chance.of.Admit),
Q3 = quantile(Chance.of.Admit, 0.75))
quartiles_by_SOP
## # A tibble: 9 × 4
## SOP Q1 Median Q3
## <dbl> <dbl> <dbl> <dbl>
## 1 1 0.56 0.57 0.58
## 2 1.5 0.47 0.525 0.56
## 3 2 0.525 0.64 0.7
## 4 2.5 0.61 0.67 0.71
## 5 3 0.64 0.705 0.742
## 6 3.5 0.64 0.72 0.775
## 7 4 0.728 0.8 0.86
## 8 4.5 0.78 0.87 0.91
## 9 5 0.88 0.92 0.94
ggplot(df, aes(x = factor(SOP), y = Chance.of.Admit)) +
geom_boxplot(fill = "darksalmon") +
labs(title = "Statement of Purpose vs Chance of Admit", x = "Statement of Purpose Strength", y = "Chance of Admit")
# LOR vs Chance of Admit
quartiles_by_LOR <- df %>%
group_by(LOR) %>%
summarise(
Q1 = quantile(Chance.of.Admit, 0.25),
Median = median(Chance.of.Admit),
Q3 = quantile(Chance.of.Admit, 0.75))
quartiles_by_LOR
## # A tibble: 9 × 4
## LOR Q1 Median Q3
## <dbl> <dbl> <dbl> <dbl>
## 1 1 0.42 0.42 0.42
## 2 1.5 0.498 0.52 0.655
## 3 2 0.485 0.56 0.633
## 4 2.5 0.57 0.645 0.728
## 5 3 0.62 0.67 0.763
## 6 3.5 0.65 0.725 0.8
## 7 4 0.705 0.78 0.855
## 8 4.5 0.782 0.89 0.91
## 9 5 0.85 0.895 0.94
ggplot(df, aes(x = factor(LOR), y = Chance.of.Admit)) +
geom_boxplot(fill = "cornsilk") +
labs(title = "Letter of Recommendation Strength vs Chance of Admit", x = "Letter of Recommendation Strength", y = "Chance of Admit")
Feature Selection For Regression Correlation scores for Chance.Of.Admit: University.Rating: 0.732 SOP:0.691 LOR: 0.640 CGPA: 0.878 Research: 0.582 Total_Score: 0.839
Feature Selection For Classification Correlation scores for Chance.Level: University.Rating: 0.5863705 SOP: 0.5424313 LOR: 0.5179281 CGPA: 0.6990713 Research: 0.5033846 Total_Score: 0.6894465
Chance.Level is a categorical variable derived from Chance.of.Admit. When converting continuous data like Chance.of.Admit into categorical bins (such as “1”, “2”, “3”), the relationship between individual features and the new target (Chance.Level) becomes less precise. The values in Chance.Level now represent broader categories, which naturally reduces the correlation.
Dropping variables (or features) during the feature selection process before modeling is a crucial step to enhance model performance, reduce overfitting, and improve computational efficiency.
In our case, unwanted features that might cause duplication such as GRE, TOEFL, Research_GPA_Interaction, and Chance.of.Admit/Chance.Level are dropped.
# Drop the unwanted features for classification
dfc <- df[, !(names(df) %in% c("GRE", "TOEFL", "Research_GPA_Interaction","Chance.of.Admit"))]
# Drop the unwanted features for regression
dfr <- df[, !(names(df) %in% c("GRE", "TOEFL", "Research_GPA_Interaction","Chance.Level"))]
Dropping the GRE and TOEFL because we have the Total_Score column, which combines these two. Dropping the Research_GPA_Interaction due to many 0 values (as Research is binary and can lead to a lot of 0 values).
skewness(df$Chance.of.Admit)
## [1] -0.3508017
table(df$Chance.Level)
##
## 0 1
## 165 235
Since the skewness value of -0.3508017 indicates only a slight left-skew, it’s likely not severe enough to require immediate resampling or transformation. The classes are fairly close to each other, with class 1 having 235 entries and class 0 having 165 entries. The difference of 70 instances is not large, so this isn’t considered a significant imbalance.
Data Splitting (80-20) for classification model
The 80/20 rule is selected to split the data to get a better balance between having enough data to train the model and enough data to assess how it generalizes. This can reduce overfitting issues, thus improve the performance of the model.
library(caret)
## Warning: package 'caret' was built under R version 4.4.2
## Loading required package: lattice
# Split the dataset into training and test sets (80-20 split)
set.seed(123) # Ensures reproducibility
split.index <- createDataPartition(dfc$Chance.Level, p = 0.8, list = FALSE) #target column is 'Chance.Level'
# Create the training set (80% of the data)
train.data <- dfc[split.index, ]
# Create the test set (remaining 20% of the data)
test.data <- dfc[-split.index, ]
# Check the distribution of the target variable in both sets to ensure proper splitting
table(train.data$Chance.Level)
##
## 0 1
## 132 188
table(test.data$Chance.Level)
##
## 0 1
## 33 47
# Convert Chance.Level to factor for classification
train.data$Chance.Level <- factor(train.data$Chance.Level, levels = c("1", "0"), labels = c("HighChance", "LowChance"))
test.data$Chance.Level <- factor(test.data$Chance.Level, levels = c("1", "0"), labels = c("HighChance", "LowChance"))
As the Random Forest Model can provide feature importance scores, it is utilized to help understand which features are most significant for the classification tasks.
Besides, different cross-validation techniques are implemented to assess the performance of Random Forest classifier, while the best classifier will be selected for modelling.
After that, a bar chart is created to visualize the importance of each feature variables that can impact the chance of admission by the most.
library(randomForest)
## Warning: package 'randomForest' was built under R version 4.4.2
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:gridExtra':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
## The following object is masked from 'package:dplyr':
##
## combine
#Fit the Random Forest Model
model_rf <- randomForest(Chance.Level ~ ., data = train.data)
# Print the model to see the results
print(model_rf)
##
## Call:
## randomForest(formula = Chance.Level ~ ., data = train.data)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 15.94%
## Confusion matrix:
## HighChance LowChance class.error
## HighChance 164 24 0.1276596
## LowChance 27 105 0.2045455
# Make predictions on the test set
predictions_rf <- predict(model_rf, newdata = test.data)
# Confusion matrix to evaluate the model
conf_matrix_rf <- confusionMatrix(predictions_rf, test.data$Chance.Level)
print(conf_matrix_rf)
## Confusion Matrix and Statistics
##
## Reference
## Prediction HighChance LowChance
## HighChance 37 6
## LowChance 10 27
##
## Accuracy : 0.8
## 95% CI : (0.6956, 0.8811)
## No Information Rate : 0.5875
## P-Value [Acc > NIR] : 4.709e-05
##
## Kappa : 0.5947
##
## Mcnemar's Test P-Value : 0.4533
##
## Sensitivity : 0.7872
## Specificity : 0.8182
## Pos Pred Value : 0.8605
## Neg Pred Value : 0.7297
## Prevalence : 0.5875
## Detection Rate : 0.4625
## Detection Prevalence : 0.5375
## Balanced Accuracy : 0.8027
##
## 'Positive' Class : HighChance
##
Cross Validation for Random Forest
# k-fold
train_control <- trainControl(method="cv", number=10, classProbs = TRUE)
model_rf1 <- train(Chance.Level~., data=train.data, trControl=train_control, method="rf")
# repeated K Fold
train_control <- trainControl(method="repeatedcv", number=10, repeats=3, classProbs = TRUE)
model_rf2 <- train(Chance.Level~., data=train.data, trControl=train_control, method="rf")
# Leave one out CV
train_control <- trainControl(method="LOOCV", classProbs = TRUE)
model_rf3 <- train(Chance.Level~., data=train.data, trControl=train_control, method="rf")
# bootstrap
train_control <- trainControl(method="boot", number=100, classProbs = TRUE)
model_rf4 <- train(Chance.Level~., data=train.data, trControl=train_control, method="rf")
# Extract performance metrics for each model
results <- data.frame(
Model = c("k-fold", "Repeated k-fold", "LOOCV", "Bootstrap"),
Accuracy = c(
max(model_rf1$results$Accuracy),
max(model_rf2$results$Accuracy),
max(model_rf3$results$Accuracy),
max(model_rf4$results$Accuracy)
),
Kappa = c(
max(model_rf1$results$Kappa),
max(model_rf2$results$Kappa),
max(model_rf3$results$Kappa),
max(model_rf4$results$Kappa)
)
)
# Print the comparison table
print(results)
## Model Accuracy Kappa
## 1 k-fold 0.8310881 0.6504532
## 2 Repeated k-fold 0.8235144 0.6341111
## 3 LOOCV 0.8281250 0.6457897
## 4 Bootstrap 0.8252230 0.6386866
# Identify the best model based on highest accuracy
best_model_index <- which.max(results$Accuracy)
best_model <- results$Model[best_model_index]
print(paste("The best model is:", best_model))
## [1] "The best model is: k-fold"
Best Performing Random Forest
# Extract Random Forest Model from k-fold
rf_model <- model_rf1$finalModel
# Random Forest Feature Importance
feature_importance_rf <- importance(rf_model)
# Convert to data frame and prepare for visualization
importance_df_rf <- as.data.frame(feature_importance_rf)
importance_df_rf$Feature <- rownames(importance_df_rf)
# Sort by Mean Decrease Gini
importance_df_rf <- importance_df_rf[order(importance_df_rf$MeanDecreaseGini, decreasing = TRUE), ]
# Visualize Random Forest Feature Importance
library(ggplot2)
ggplot(importance_df_rf, aes(x = reorder(Feature, MeanDecreaseGini), y = MeanDecreaseGini)) +
geom_bar(stat = "identity", fill = "skyblue") +
coord_flip() +
labs(title = "Feature Importance for Random Forest Model", x = "Feature", y = "Importance (Mean Decrease Gini)") +
theme_minimal()
Mean Decrease Gini (or Mean Decrease Accuracy) is used for calculating feature importance in Random Forest model trained using Repeated k-fold CV. CGPA and Total Score are the most important feature.
library(pROC)
## Warning: package 'pROC' was built under R version 4.4.2
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
# Make predictions using the trained Random Forest model on the test set
predictions_rf <- predict(rf_model, newdata = test.data)
# Confusion Matrix for Random Forest
conf_matrix_rf <- confusionMatrix(predictions_rf, test.data$Chance.Level)
# Extract metrics from the confusion matrix
accuracy_rf <- conf_matrix_rf$overall['Accuracy']
precision_rf <- posPredValue(predictions_rf, test.data$Chance.Level)
recall_rf <- sensitivity(predictions_rf, test.data$Chance.Level)
f1_rf <- (2 * precision_rf * recall_rf) / (precision_rf + recall_rf)
# Calculate AUC (Area Under the ROC Curve)
# Convert test.data$Chance.Level to numeric for AUC calculation
roc_curve <- roc(as.numeric(test.data$Chance.Level), as.numeric(predictions_rf))
## Setting levels: control = 1, case = 2
## Setting direction: controls < cases
auc_rf <- auc(roc_curve)
# Create a data frame to store the metrics
results_rf <- data.frame(
Metric = c("Accuracy", "Precision", "Recall", "F1-Score", "AUC"),
Random_Forest = c(as.numeric(accuracy_rf), as.numeric(precision_rf),
as.numeric(recall_rf), as.numeric(f1_rf), as.numeric(auc_rf))
)
# Print the results
print(results_rf)
## Metric Random_Forest
## 1 Accuracy 0.8125000
## 2 Precision 0.8809524
## 3 Recall 0.7872340
## 4 F1-Score 0.8314607
## 5 AUC 0.8178594
The second model chosen is logistic regression which specialize in binary classification tasks.
Different cross-validation techniques are implemented in this step to assess the performance of linear regression classifier, while the best classifier will be selected for modelling.
After that, a bar chart is created to visualize the importance of each feature variables that can impact the chance of admission by the most.
# Fit Logistic Regression model
model_lr <- train(Chance.Level ~ ., data = train.data)
# Print the model summary
print(model_lr)
## Random Forest
##
## 320 samples
## 6 predictor
## 2 classes: 'HighChance', 'LowChance'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 320, 320, 320, 320, 320, 320, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.8233746 0.6313569
## 4 0.8167568 0.6171070
## 6 0.8116929 0.6068051
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
# Make predictions on the test set
predictions_lr <- predict(model_lr, newdata = test.data)
# Evaluate performance with a confusion matrix
conf_matrix_lr <- confusionMatrix(predictions_lr, test.data$Chance.Level)
print(conf_matrix_lr)
## Confusion Matrix and Statistics
##
## Reference
## Prediction HighChance LowChance
## HighChance 37 5
## LowChance 10 28
##
## Accuracy : 0.8125
## 95% CI : (0.7097, 0.8911)
## No Information Rate : 0.5875
## P-Value [Acc > NIR] : 1.59e-05
##
## Kappa : 0.6217
##
## Mcnemar's Test P-Value : 0.3017
##
## Sensitivity : 0.7872
## Specificity : 0.8485
## Pos Pred Value : 0.8810
## Neg Pred Value : 0.7368
## Prevalence : 0.5875
## Detection Rate : 0.4625
## Detection Prevalence : 0.5250
## Balanced Accuracy : 0.8179
##
## 'Positive' Class : HighChance
##
Cross Validation for Logistic Regression
# k-fold
train_control <- trainControl(method="cv", number=10)
model_lr1 <- train(Chance.Level~., data=train.data, trControl=train_control, method="glm")
# repeated K Fold
train_control <- trainControl(method="repeatedcv", number=10, repeats=3)
model_lr2 <- train(Chance.Level~., data=train.data, trControl=train_control, method="glm")
# Leave one out CV
train_control <- trainControl(method="LOOCV")
model_lr3 <- train(Chance.Level~., data=train.data, trControl=train_control, method="glm")
# bootstrap
train_control <- trainControl(method="boot", number=100)
model_lr4 <- train(Chance.Level~., data=train.data, trControl=train_control, method="glm")
# Extract performance metrics for each model
results <- data.frame(
Model = c("k-fold", "Repeated k-fold", "LOOCV", "Bootstrap"),
Accuracy = c(
max(model_lr1$results$Accuracy),
max(model_lr2$results$Accuracy),
max(model_lr3$results$Accuracy),
max(model_lr4$results$Accuracy)
),
Kappa = c(
max(model_lr1$results$Kappa),
max(model_lr2$results$Kappa),
max(model_lr3$results$Kappa),
max(model_lr4$results$Kappa)
)
)
# Print the comparison table
print(results)
## Model Accuracy Kappa
## 1 k-fold 0.8561248 0.7032586
## 2 Repeated k-fold 0.8521709 0.6940222
## 3 LOOCV 0.8562500 0.7034172
## 4 Bootstrap 0.8397092 0.6673948
# Identify the best model based on highest accuracy
best_model_index <- which.max(results$Accuracy)
best_model <- results$Model[best_model_index]
print(paste("The best model is:", best_model))
## [1] "The best model is: LOOCV"
Best Performing Logistic Regression
# Extract Logistic Regression Model from LOOCV
lr_model <- model_rf3$finalModel
# Extract Feature Importance using varImp from the caret package
importance_lr <- varImp(model_lr, scale = FALSE)
# Check the structure of the importance object
print(str(importance_lr))
## List of 3
## $ importance:'data.frame': 6 obs. of 1 variable:
## ..$ Overall: num [1:6] 15.28 14.56 15.57 52.13 7.81 ...
## $ model : chr "rf"
## $ calledFrom: chr "varImp"
## - attr(*, "class")= chr "varImp.train"
## NULL
# Convert the importance object to a data frame for visualization
importance_df_lr <- as.data.frame(importance_lr$importance)
importance_df_lr$Feature <- rownames(importance_df_lr)
# Sort by importance value (descending)
importance_df_lr <- importance_df_lr[order(importance_df_lr$Overall, decreasing = TRUE), ]
# Check the structure and data
print(str(importance_df_lr))
## 'data.frame': 6 obs. of 2 variables:
## $ Overall: num 52.1 45.6 15.6 15.3 14.6 ...
## $ Feature: chr "CGPA" "Total_Score" "LOR" "University.Rating" ...
## NULL
print(head(importance_df_lr))
## Overall Feature
## CGPA 52.130220 CGPA
## Total_Score 45.560152 Total_Score
## LOR 15.567068 LOR
## University.Rating 15.278563 University.Rating
## SOP 14.563066 SOP
## Research 7.808335 Research
# Visualize the feature importance using ggplot
ggplot(importance_df_lr, aes(x = reorder(Feature, Overall), y = Overall)) +
geom_bar(stat = "identity", fill = "lightgreen") +
coord_flip() +
labs(title = "Feature Importance for Logistic Regression Model",
x = "Feature", y = "Importance (Overall)") +
theme_minimal()
The varImp() function from the caret package is designed to extract feature importance directly from a trained model like logistic regression. CGPA and Total_Score shows the highest importance.
# Make predictions using the trained Logistic Regression model on the test set
predictions_lr <- predict(lr_model, newdata = test.data)
# Logistic Regression Metrics
conf_matrix_lr <- confusionMatrix(predictions_lr, test.data$Chance.Level)
accuracy_lr <- conf_matrix_lr$overall['Accuracy']
precision_lr <- posPredValue(predictions_lr, test.data$Chance.Level)
recall_lr <- sensitivity(predictions_lr, test.data$Chance.Level)
f1_lr <- (2 * precision_lr * recall_lr) / (precision_lr + recall_lr)
# Calculate AUC (Area Under the ROC Curve)
# Convert test.data$Chance.Level to numeric for AUC calculation
roc_curve <- roc(as.numeric(test.data$Chance.Level), as.numeric(predictions_lr))
## Setting levels: control = 1, case = 2
## Setting direction: controls < cases
auc_lr <- auc(roc_curve)
results_lr <- data.frame(
Metric = c("Accuracy", "Precision", "Recall", "F1-Score", "AUC"),
Logistic_Regression = c(as.numeric(accuracy_lr), as.numeric(precision_lr),
as.numeric(recall_lr), as.numeric(f1_lr), as.numeric(auc_lr))
)
# Print the results
print(results_lr)
## Metric Logistic_Regression
## 1 Accuracy 0.8250000
## 2 Precision 0.9024390
## 3 Recall 0.7872340
## 4 F1-Score 0.8409091
## 5 AUC 0.8330110
The third alternative chosen is the Support Vector Machine (SVM) Model, which specialize in finding the optimal hyperplane that separates different classes in the feature space.
Different cross-validation techniques are implemented in this step to assess the performance of SVM classifier, while the best classifier will be selected for modelling.
After that, a bar chart is created to visualize the importance of each feature variables that can impact the chance of admission by the most.
# Fit the SVM model
model_svm <- svm(Chance.Level ~ ., data = train.data, probability = TRUE)
# Print the model summary
print(model_svm)
##
## Call:
## svm(formula = Chance.Level ~ ., data = train.data, probability = TRUE)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 1
##
## Number of Support Vectors: 133
# Make predictions on the test set
predictions_svm <- predict(model_svm, newdata = test.data)
prob_predictions <- attr(predict(model_svm, test.data, probability = TRUE), "probabilities")[, 2]
# Evaluate performance with a confusion matrix
conf_matrix_svm <- confusionMatrix(predictions_svm, test.data$Chance.Level)
print(conf_matrix_svm)
## Confusion Matrix and Statistics
##
## Reference
## Prediction HighChance LowChance
## HighChance 37 4
## LowChance 10 29
##
## Accuracy : 0.825
## 95% CI : (0.7238, 0.9009)
## No Information Rate : 0.5875
## P-Value [Acc > NIR] : 4.97e-06
##
## Kappa : 0.6485
##
## Mcnemar's Test P-Value : 0.1814
##
## Sensitivity : 0.7872
## Specificity : 0.8788
## Pos Pred Value : 0.9024
## Neg Pred Value : 0.7436
## Prevalence : 0.5875
## Detection Rate : 0.4625
## Detection Prevalence : 0.5125
## Balanced Accuracy : 0.8330
##
## 'Positive' Class : HighChance
##
Cross Validation for SVM
library(kernlab)
##
## Attaching package: 'kernlab'
## The following object is masked from 'package:ggplot2':
##
## alpha
# k-fold
train_control <- trainControl(method="cv", number=10)
model_svm1 <- train(Chance.Level~., data=train.data, trControl=train_control, method="svmRadial")
# repeated K Fold
train_control <- trainControl(method="repeatedcv", number=10, repeats=3)
model_svm2 <- train(Chance.Level~., data=train.data, trControl=train_control, method="svmRadial")
# Leave one out CV
train_control <- trainControl(method="LOOCV")
model_svm3 <- train(Chance.Level~., data=train.data, trControl=train_control, method="svmRadial")
# bootstrap
train_control <- trainControl(method="boot", number=100)
model_svm4 <- train(Chance.Level~., data=train.data, trControl=train_control, method="svmRadial")
Best Performing SVM
# Extract performance metrics for each model
results <- data.frame(
Model = c("k-fold", "Repeated k-fold", "LOOCV", "Bootstrap"),
Accuracy = c(
max(model_svm1$results$Accuracy),
max(model_svm2$results$Accuracy),
max(model_svm3$results$Accuracy),
max(model_svm4$results$Accuracy)
),
Kappa = c(
max(model_svm1$results$Kappa),
max(model_svm2$results$Kappa),
max(model_svm3$results$Kappa),
max(model_svm4$results$Kappa)
)
)
# Print the comparison table
print(results)
## Model Accuracy Kappa
## 1 k-fold 0.8558101 0.7054645
## 2 Repeated k-fold 0.8511364 0.6946757
## 3 LOOCV 0.8593750 0.7121382
## 4 Bootstrap 0.8458848 0.6826663
# Identify the best model based on highest accuracy
best_model_index <- which.max(results$Accuracy)
best_model <- results$Model[best_model_index]
print(paste("The best model is:", best_model))
## [1] "The best model is: LOOCV"
# Extract SVM Model from LOOCV
svm_model <- model_svm3$finalModel
# performance function to calculate accuracy
performance_function <- function(model, data, labels) {
predictions <- predict(model, newdata = data)
accuracy <- sum(predictions == labels) / length(labels)
return(accuracy)
}
# SVM Permutation Feature Importance
importance_scores_svm <- numeric(ncol(train.data) - 1)
for (i in 1:(ncol(train.data) - 1)) {
shuffled_data <- train.data
shuffled_data[, i] <- sample(shuffled_data[, i])
baseline_accuracy <- performance_function(model_svm, train.data, train.data$Chance.Level)
shuffled_accuracy <- performance_function(model_svm, shuffled_data, shuffled_data$Chance.Level)
importance_scores_svm[i] <- baseline_accuracy - shuffled_accuracy
}
importance_df_svm <- data.frame(Feature = names(train.data)[-ncol(train.data)], Importance = importance_scores_svm)
importance_df_svm <- importance_df_svm[order(importance_df_svm$Importance, decreasing = TRUE), ]
# Visualize SVM Feature Importance
ggplot(importance_df_svm, aes(x = reorder(Feature, Importance), y = Importance)) +
geom_bar(stat = "identity", fill = "orange") +
coord_flip() +
labs(title = "Feature Importance for SVM Model", x = "Feature", y = "Importance (Decrease in Accuracy)") +
theme_minimal()
Since SVM does not inherently provide a way to rank feature importance, permutation importance is used as a model-agnostic method to estimate the effect of each feature on model performance. Total_Score amd CGPA is still having the highest effect on the model.
# SVM created with train function is different
test.data_no_target <- test.data[, -which(names(test.data) == "Chance.Level")]
# Make predictions using the trained SVM model on the test set
predictions_svm <- predict(svm_model, newdata = test.data_no_target)
# SVM Metrics
conf_matrix_svm <- confusionMatrix(predictions_svm, test.data$Chance.Level)
accuracy_svm <- conf_matrix_svm$overall['Accuracy']
precision_svm <- posPredValue(predictions_svm, test.data$Chance.Level)
recall_svm <- sensitivity(predictions_svm, test.data$Chance.Level)
f1_svm <- (2 * precision_svm * recall_svm) / (precision_svm + recall_svm)
# Calculate AUC (Area Under the ROC Curve)
# Convert test.data$Chance.Level to numeric for AUC calculation
roc_curve <- roc(as.numeric(test.data$Chance.Level), as.numeric(predictions_svm))
## Setting levels: control = 1, case = 2
## Setting direction: controls < cases
auc_svm <- auc(roc_curve)
results_svm <- data.frame(
Metric = c("Accuracy", "Precision", "Recall", "F1-Score", "AUC"),
SVM = c(as.numeric(accuracy_svm), as.numeric(precision_svm),
as.numeric(recall_svm), as.numeric(f1_svm), as.numeric(auc_svm))
)
# Print the results
print(results_svm)
## Metric SVM
## 1 Accuracy 0.8375000
## 2 Precision 0.9047619
## 3 Recall 0.8085106
## 4 F1-Score 0.8539326
## 5 AUC 0.8436493
The last model chosen for classification is the KNN Model which is good in classifying a data point based on how its neighbors are classified.
Different cross-validation techniques are implemented in this step to assess the performance of KNN classifier, while the best classifier will be selected for modelling.
After that, a bar chart is created to visualize the importance of each feature variables that can impact the chance of admission by the most.
library(class)
# Fit k-NN model
predictions_knn <- knn(train = train.data[, -ncol(train.data)],
test = test.data[, -ncol(test.data)],
cl = train.data$Chance.Level
)
# Evaluate performance with a confusion matrix
conf_matrix_knn <- confusionMatrix(predictions_knn, test.data$Chance.Level)
print(conf_matrix_knn)
## Confusion Matrix and Statistics
##
## Reference
## Prediction HighChance LowChance
## HighChance 38 12
## LowChance 9 21
##
## Accuracy : 0.7375
## 95% CI : (0.6271, 0.8296)
## No Information Rate : 0.5875
## P-Value [Acc > NIR] : 0.003784
##
## Kappa : 0.451
##
## Mcnemar's Test P-Value : 0.662521
##
## Sensitivity : 0.8085
## Specificity : 0.6364
## Pos Pred Value : 0.7600
## Neg Pred Value : 0.7000
## Prevalence : 0.5875
## Detection Rate : 0.4750
## Detection Prevalence : 0.6250
## Balanced Accuracy : 0.7224
##
## 'Positive' Class : HighChance
##
Cross Validation for KNN
# k-fold
train_control <- trainControl(method="cv", number=10)
model_knn1 <- train(Chance.Level~., data=train.data, trControl=train_control, method="knn")
# repeated K Fold
train_control <- trainControl(method="repeatedcv", number=10, repeats=3)
model_knn2 <- train(Chance.Level~., data=train.data, trControl=train_control, method="knn")
# Leave one out CV
train_control <- trainControl(method="LOOCV")
model_knn3 <- train(Chance.Level~., data=train.data, trControl=train_control, method="knn")
# bootstrap
train_control <- trainControl(method="boot", number=100)
model_knn4 <- train(Chance.Level~., data=train.data, trControl=train_control, method="knn")
# Extract performance metrics for each model
results <- data.frame(
Model = c("k-fold", "Repeated k-fold", "LOOCV", "Bootstrap"),
Accuracy = c(
max(model_knn1$results$Accuracy),
max(model_knn2$results$Accuracy),
max(model_knn3$results$Accuracy),
max(model_knn4$results$Accuracy)
),
Kappa = c(
max(model_knn1$results$Kappa),
max(model_knn2$results$Kappa),
max(model_knn3$results$Kappa),
max(model_knn4$results$Kappa)
)
)
# Print the comparison table
print(results)
## Model Accuracy Kappa
## 1 k-fold 0.8346102 0.6549046
## 2 Repeated k-fold 0.8343465 0.6529221
## 3 LOOCV 0.8312500 0.6470588
## 4 Bootstrap 0.8169719 0.6164830
# Identify the best model based on highest accuracy
best_model_index <- which.max(results$Accuracy)
best_model <- results$Model[best_model_index]
print(paste("The best model is:", best_model))
## [1] "The best model is: k-fold"
Best Performing KNN
# Extract K-Nearest Neighbour Model from k-fold
knn_model <- model_knn1$finalModel
# Calculate feature importance using varImp()
importance_knn <- varImp(model_knn3, scale = FALSE)
# Print the importance of each feature
print(importance_knn)
## ROC curve variable importance
##
## Importance
## CGPA 0.9131
## Total_Score 0.9049
## University.Rating 0.8452
## SOP 0.8205
## LOR 0.8156
## Research 0.7409
# Visualize the importance using ggplot2
library(ggplot2)
ggplot(importance_knn, aes(x = reorder(Features, Overall), y = Overall)) +
geom_bar(stat = "identity", fill = "lightblue") +
coord_flip() +
labs(title = "Feature Importance for k-NN Model", x = "Feature", y = "Importance (Decrease in Accuracy)") +
theme_minimal()
## Coordinate system already present. Adding new coordinate system, which will
## replace the existing one.
Unlike other models, k-Nearest Neighbors (kNN) doesn’t explicitly model feature importance. It makes predictions based on the majority class of the nearest neighbors, so it doesn’t produce direct feature importance scores. As a result, the feature importance values might seem to be quite similar, as they are not derived from a model that inherently ranks features.
# KNN created with train function is different
test.data_no_target <- test.data[, -which(names(test.data) == "Chance.Level")]
# Make predictions using the trained KNN model on the test set
predictions_knn <- predict(knn_model, newdata = test.data_no_target, type = "class")
# KNN Metrics
conf_matrix_knn <- confusionMatrix(predictions_knn, test.data$Chance.Level)
accuracy_knn <- conf_matrix_knn$overall['Accuracy']
precision_knn <- posPredValue(predictions_knn, test.data$Chance.Level)
recall_knn <- sensitivity(predictions_knn, test.data$Chance.Level)
f1_knn <- (2 * precision_knn * recall_knn) / (precision_knn + recall_knn)
# Calculate AUC (Area Under the ROC Curve)
# Convert test.data$Chance.Level to numeric for AUC calculation
roc_curve <- roc(as.numeric(test.data$Chance.Level), as.numeric(predictions_knn))
## Setting levels: control = 1, case = 2
## Setting direction: controls < cases
auc_knn <- auc(roc_curve)
results_knn <- data.frame(
Metric = c("Accuracy", "Precision", "Recall", "F1-Score", "AUC"),
kNN = c(as.numeric(accuracy_knn), as.numeric(precision_knn),
as.numeric(recall_knn), as.numeric(f1_knn), as.numeric(auc_knn))
)
# Print the results
print(results_knn)
## Metric kNN
## 1 Accuracy 0.8125000
## 2 Precision 0.8478261
## 3 Recall 0.8297872
## 4 F1-Score 0.8387097
## 5 AUC 0.8088330
In this step, all the models are compared with their performance metrics to choose the most suitable model for classification tasks.
# Create a summary data frame
results_classification <- data.frame(
Metric = c("Accuracy", "Precision", "Recall", "F1-Score", "AUC"),
Random_Forest = c(accuracy_rf, precision_rf, recall_rf, f1_rf, auc_rf),
Logistic_Regression = c(accuracy_lr, precision_lr, recall_lr, f1_lr, auc_lr),
SVM = c(accuracy_svm, precision_svm, recall_svm, f1_svm, auc_svm),
kNN = c(accuracy_knn, precision_knn, recall_knn, f1_knn, auc_knn)
)
# Print the results table
print(results_classification)
## Metric Random_Forest Logistic_Regression SVM kNN
## 1 Accuracy 0.8125000 0.8250000 0.8375000 0.8125000
## 2 Precision 0.8809524 0.9024390 0.9047619 0.8478261
## 3 Recall 0.7872340 0.7872340 0.8085106 0.8297872
## 4 F1-Score 0.8314607 0.8409091 0.8539326 0.8387097
## 5 AUC 0.8178594 0.8330110 0.8436493 0.8088330
library(tidyr)
## Warning: package 'tidyr' was built under R version 4.4.2
# Reshape the data into a long format for ggplot
metrics_long_df <- results_classification %>%
pivot_longer(cols = -Metric, names_to = "Model", values_to = "Value")
# Visualize the metrics with ggplot
ggplot(metrics_long_df, aes(x = Metric, y = Value, fill = Model)) +
geom_bar(stat = "identity", position = "dodge") +
labs(title = "Model Performance Comparison", x = "Metric", y = "Score") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
SVM with Leave-one-out cross-validation (LOOCV) is the best model in terms of all metrics, making it the most effective for distinguishing between classes and capturing positives.
Data Splitting (80-20) for regression model
# Split the dataset into training and test sets (80-20 split)
set.seed(123) # Ensures reproducibility
split.index <- createDataPartition(dfr$Chance.of.Admit, p = 0.8, list = FALSE) #target column is 'Chance.of.Admit'
# Create the training set (80% of the data)
train_data <- dfr[split.index, ]
# Create the test set (remaining 20% of the data)
test_data <- dfr[-split.index, ]
Then, the Random Forest model is selected to capture complex relationships between our target variable (Chance of Admit) and the input features. Besides, this step allows us to identify the feature importance of this dataset via prediction.
To evaluate the model performance, various performance metrics are calculated, which includes: - RMSE (Root Mean Square Error): Quantifies error in model’s prediction. - R2 (Coefficient of Determination): Understand goodness of fit - MAE (Mean Absolute Error): Shows average model performance (less sensitive to outliers) - MAPE (Mean Absolute Percentage Error): Understands errors in terms of percentage.
Before evaluating, different cross-validation methods are done and the best performing fold will be chosen as the best model to be evaluated. Via this model, the feature importance is computed based on permutation, also known as via evaluating the decrease in model performance when a feature’s values are shuffled.. Furthermore, a bar chart is plotted to visualize the importance of the variables.
Cross Validation for Random Forest
# k-fold
train_control <- trainControl(method="cv", number=10)
model_rfa <- train(Chance.of.Admit~., data=train_data, trControl=train_control, method="rf")
# repeated K Fold
train_control <- trainControl(method="repeatedcv", number=10, repeats=3)
model_rfb <- train(Chance.of.Admit~., data=train_data, trControl=train_control, method="rf")
# Leave one out CV
train_control <- trainControl(method="LOOCV")
model_rfc <- train(Chance.of.Admit~., data=train_data, trControl=train_control, method="rf")
# bootstrap
train_control <- trainControl(method="boot", number=100)
model_rfd <- train(Chance.of.Admit~., data=train_data, trControl=train_control, method="rf")
Best Performing Random Forest
# Compare Performance Metrics
results <- data.frame(
Model = c("k-fold", "Repeated k-fold", "LOOCV", "Bootstrap"),
RMSE = c(
min(model_rfa$results$RMSE),
min(model_rfb$results$RMSE),
min(model_rfc$results$RMSE),
min(model_rfd$results$RMSE)
),
Rsquared = c(
max(model_rfa$results$Rsquared),
max(model_rfb$results$Rsquared),
max(model_rfc$results$Rsquared),
max(model_rfd$results$Rsquared)
)
)
# Print results table
print(results)
## Model RMSE Rsquared
## 1 k-fold 0.06488381 0.8007859
## 2 Repeated k-fold 0.06539835 0.7876448
## 3 LOOCV 0.06585309 0.7811270
## 4 Bootstrap 0.06742689 0.7719480
# Identify the best model
best_model_index <- which.min(results$RMSE)
best_model <- results$Model[best_model_index]
print(paste("The best model is:", best_model))
## [1] "The best model is: k-fold"
# Feature Importance for the Best Model
rf_model <- model_rfa$finalModel
importance_rf <- varImp(model_rfb, scale = FALSE)
print(importance_rf)
## rf variable importance
##
## Overall
## CGPA 2.3576
## Total_Score 1.8087
## University.Rating 0.6978
## SOP 0.5016
## Research 0.4064
## LOR 0.2974
# Visualize Feature Importance
ggplot(importance_rf, aes(x = reorder(Variables, Overall), y = Overall)) +
geom_bar(stat = "identity", fill = "lightgreen") +
coord_flip() +
labs(title = "Feature Importance for Random Forest Model",
x = "Feature", y = "Importance") +
theme_minimal()
## Coordinate system already present. Adding new coordinate system, which will
## replace the existing one.
# Define the MAPE function
MAPE <- function(predictions, actual) {
mean(abs((actual - predictions) / actual)) * 100
}
# Ensure predictions and target are numeric vectors
predictions_rf <- predict(rf_model, newdata = test_data)
target_rf <- as.numeric(test_data$Chance.of.Admit)
# Calculate Metrics
rmse_rf <- RMSE(predictions_rf, target_rf)
mae_rf <- MAE(predictions_rf, target_rf)
rsquared_rf <- cor(predictions_rf, target_rf)^2
mape_rf <- MAPE(predictions_rf, target_rf)
# Summarize Results
results_rf <- data.frame(
Metric = c("RMSE", "MAE", "R^2", "MAPE"),
Random_Forest = c(rmse_rf, mae_rf, rsquared_rf, mape_rf)
)
# Print Results
print(results_rf)
## Metric Random_Forest
## 1 RMSE 0.06933500
## 2 MAE 0.05126318
## 3 R^2 0.78999014
## 4 MAPE 8.39879037
The second regression model implemented is the linear regression model, which models the relationship between the target variable and all other predictors. The performance metrics used for model evaluation is similar to Random Forest Model, which are RSME, R2, MAE, and MAPE.
Before evaluating, different cross-validation methods are done and the best performing fold will be chosen as the best model to be evaluated. Via this model, the feature importance is computed based on permutation, also known as via evaluating the decrease in model performance when a feature’s values are shuffled.
A bar chart is created to visualize the feature importance modeled by Linear Regression.
Cross Validation for Linear Regression
# k-fold
train_control <- trainControl(method="cv", number=10)
model_lm1 <- train(Chance.of.Admit~., data=train_data, trControl=train_control, method="lm")
# repeated K Fold
train_control <- trainControl(method="repeatedcv", number=10, repeats=3)
model_lm2 <- train(Chance.of.Admit~., data=train_data, trControl=train_control, method="lm")
# Leave one out CV
train_control <- trainControl(method="LOOCV")
model_lm3 <- train(Chance.of.Admit~., data=train_data, trControl=train_control, method="lm")
# bootstrap
train_control <- trainControl(method="boot", number=100)
model_lm4 <- train(Chance.of.Admit~., data=train_data, trControl=train_control, method="lm")
# Compare Performance Metrics
results <- data.frame(
Model = c("k-fold", "Repeated k-fold", "LOOCV", "Bootstrap"),
RMSE = c(
min(model_lm1$results$RMSE),
min(model_lm2$results$RMSE),
min(model_lm3$results$RMSE),
min(model_lm4$results$RMSE)
),
Rsquared = c(
max(model_lm1$results$Rsquared),
max(model_lm2$results$Rsquared),
max(model_lm3$results$Rsquared),
max(model_lm4$results$Rsquared)
)
)
# Print results table
print(results)
## Model RMSE Rsquared
## 1 k-fold 0.06295156 0.8067221
## 2 Repeated k-fold 0.06322814 0.7995616
## 3 LOOCV 0.06415998 0.7921128
## 4 Bootstrap 0.06526121 0.7914113
# Identify the best model
best_model_index <- which.min(results$RMSE)
best_model <- results$Model[best_model_index]
print(paste("The best model is:", best_model))
## [1] "The best model is: k-fold"
Best Performing Linear Regression
# Feature Importance for the Best Model
lm_model <- model_lm1$finalModel
importance_lm <- varImp(model_lm1, scale = FALSE)
print(importance_lm)
## lm variable importance
##
## Overall
## CGPA 8.721
## Total_Score 5.044
## Research 3.469
## LOR 3.396
## SOP 1.642
## University.Rating 1.541
# Visualize Feature Importance
ggplot(importance_lm, aes(x = reorder(Variables, Overall), y = Overall)) +
geom_bar(stat = "identity", fill = "purple") +
coord_flip() +
labs(title = "Feature Importance for Linear Regression Model",
x = "Feature", y = "Importance") +
theme_minimal()
## Coordinate system already present. Adding new coordinate system, which will
## replace the existing one.
# Ensure predictions and target are numeric vectors
predictions_lm <- predict(lm_model, newdata = test_data)
target_lm <- as.numeric(test_data$Chance.of.Admit)
# Calculate Metrics
rmse_lm <- RMSE(predictions_lm, target_lm)
mae_lm <- MAE(predictions_lm, target_lm)
rsquared_lm <- cor(predictions_lm, target_lm)^2
mape_lm <- MAPE(predictions_lm, target_lm)
# Summarize Results
results_lm <- data.frame(
Metric = c("RMSE", "MAE", "R^2", "MAPE"),
Linear_Regression = c(rmse_lm, mae_lm, rsquared_lm, mape_lm)
)
# Print Results
print(results_lm)
## Metric Linear_Regression
## 1 RMSE 0.06738683
## 2 MAE 0.04930168
## 3 R^2 0.80110282
## 4 MAPE 8.03041013
The 3rd alternative regression model we used is the Support Vector Regression (SVR) Model, where the performance metrics are fixed as RSME, R2, MAE, and MAPE. Before evaluating, different cross-validation methods are done and the best performing fold will be chosen as the best model to be evaluated. Via this model, the feature importance is computed based on permutation, also known as via evaluating the decrease in model performance when a feature’s values are shuffled. A bar chart is created to visualize the importance of each feature towards Chance of Admission too.
Cross Validation for SVR Model
# k-fold
train_control <- trainControl(method="cv", number=10)
model_svr1 <- train(Chance.of.Admit~., data=train_data, trControl=train_control, method="svmRadial")
# repeated K Fold
train_control <- trainControl(method="repeatedcv", number=10, repeats=3)
model_svr2 <- train(Chance.of.Admit~., data=train_data, trControl=train_control, method="svmRadial")
# Leave one out CV
train_control <- trainControl(method="LOOCV")
model_svr3 <- train(Chance.of.Admit~., data=train_data, trControl=train_control, method="svmRadial")
# bootstrap
train_control <- trainControl(method="boot", number=100)
model_svr4 <- train(Chance.of.Admit~., data=train_data, trControl=train_control, method="svmRadial")
This step compares the performance metrics
# Compare Performance Metrics
results <- data.frame(
Model = c("k-fold", "Repeated k-fold", "LOOCV", "Bootstrap"),
RMSE = c(
min(model_svr1$results$RMSE),
min(model_svr2$results$RMSE),
min(model_svr3$results$RMSE),
min(model_svr4$results$RMSE)
),
Rsquared = c(
max(model_svr1$results$Rsquared),
max(model_svr2$results$Rsquared),
max(model_svr3$results$Rsquared),
max(model_svr4$results$Rsquared)
)
)
# Print results table
print(results)
## Model RMSE Rsquared
## 1 k-fold 0.06701107 0.7803287
## 2 Repeated k-fold 0.06864063 0.7721359
## 3 LOOCV 0.06718161 0.7765524
## 4 Bootstrap 0.06938481 0.7641694
# Identify the best model
best_model_index <- which.min(results$RMSE)
best_model <- results$Model[best_model_index]
print(paste("The best model is:", best_model))
## [1] "The best model is: k-fold"
Best Performing SVR
# Feature Importance for the Best Model
svr_model <- model_svr1$finalModel
importance_svr <- varImp(model_svr2, scale = FALSE)
print(importance_svr)
## loess r-squared variable importance
##
## Overall
## CGPA 0.7605
## Total_Score 0.7045
## University.Rating 0.4934
## SOP 0.4171
## LOR 0.3849
## Research 0.3440
# Verify variable names in importance output
importance_svr$Variables <- rownames(importance_svr)
# Visualize Feature Importance
ggplot(importance_svr, aes(x = reorder(Variables, Overall), y = Overall)) +
geom_bar(stat = "identity", fill = "orange") +
coord_flip() +
labs(title = "Feature Importance for Support Vector Regression (SVR) Model",
x = "Feature", y = "Importance") +
theme_minimal()
## Coordinate system already present. Adding new coordinate system, which will
## replace the existing one.
A Feature Importance Plot is created to visualize the importance of the features, where CGPA is evaluated as the most important feature for the target variable, followed by Total_Score.
# Ensure predictions and target are numeric vectors
predictions_svr <- predict(model_svr2, newdata = test_data)
target_svr <- as.numeric(test_data$Chance.of.Admit)
# Calculate Metrics
rmse_svr <- RMSE(predictions_svr, target_svr)
mae_svr <- MAE(predictions_svr, target_svr)
rsquared_svr <- ifelse(var(predictions_svr) > 0 && var(target_svr) > 0, cor(predictions_svr, target_svr)^2, NA)
mape_svr <- MAPE(predictions_svr, target_svr)
# Summarize Results
results_svr <- data.frame(
Metric = c("RMSE", "MAE", "R^2", "MAPE"),
Support_Vector_Regression = c(rmse_svr, mae_svr, rsquared_svr, mape_svr)
)
# Print Results
print(results_svr)
## Metric Support_Vector_Regression
## 1 RMSE 0.07388692
## 2 MAE 0.05429197
## 3 R^2 0.77866232
## 4 MAPE 9.26627097
The last regression model used is the Light Gradient Boosting Machine (LightGBM) model, which is particularly effective for structured/tabular data as it build decision trees sequentially, while each will improve on the previous one. Using this model, the feature importance will be computed and visualized through a bar chart too.
As usual, before evaluating, different cross-validation methods are done and the best performing fold will be chosen as the best model to be evaluated. Via this model, the feature importance is computed based on permutation, also known as via evaluating the decrease in model performance when a feature’s values are shuffled.
Cross Validation for LightGBM
library(gbm)
## Warning: package 'gbm' was built under R version 4.4.2
## Loaded gbm 2.2.2
## This version of gbm is no longer under development. Consider transitioning to gbm3, https://github.com/gbm-developers/gbm3
# k-fold Cross Validation
train_control <- trainControl(method = "cv", number = 10, verboseIter = FALSE)
model_gbm1 <- train(Chance.of.Admit ~ ., data = train_data, trControl = train_control, method = "gbm", verbose = FALSE)
# Repeated k-fold Cross Validation
train_control <- trainControl(method = "repeatedcv", number = 10, repeats = 3, verboseIter = FALSE)
model_gbm2 <- train(Chance.of.Admit ~ ., data = train_data, trControl = train_control, method = "gbm", verbose = FALSE)
# Leave-One-Out Cross Validation
train_control <- trainControl(method = "LOOCV")
model_gbm3 <- train(Chance.of.Admit ~ ., data = train_data, trControl = train_control, method = "gbm", verbose = FALSE)
# Bootstrap
train_control <- trainControl(method = "boot", number = 100)
model_gbm4 <- train(Chance.of.Admit ~ ., data = train_data, trControl = train_control, method = "gbm", verbose = FALSE)
# Compare Performance Metrics
results <- data.frame(
Model = c("k-fold", "Repeated k-fold", "LOOCV", "Bootstrap"),
RMSE = c(
min(model_gbm1$results$RMSE),
min(model_gbm2$results$RMSE),
min(model_gbm3$results$RMSE),
min(model_gbm4$results$RMSE)
),
Rsquared = c(
max(model_gbm1$results$Rsquared),
max(model_gbm2$results$Rsquared),
max(model_gbm3$results$Rsquared),
max(model_gbm4$results$Rsquared)
)
)
# Print results table
print(results)
## Model RMSE Rsquared
## 1 k-fold 0.06707669 0.7752814
## 2 Repeated k-fold 0.06649505 0.7826897
## 3 LOOCV 0.06695469 0.7737474
## 4 Bootstrap 0.06795659 0.7702729
# Identify the best model
best_model_index <- which.min(results$RMSE)
best_model <- results$Model[best_model_index]
print(paste("The best model is:", best_model))
## [1] "The best model is: Repeated k-fold"
Best Performing LightGBM
# Feature Importance for the Best Model
importance_gbm <- varImp(model_gbm2, scale = FALSE)
print(importance_gbm)
## gbm variable importance
##
## Overall
## CGPA 11.37258
## Total_Score 3.56790
## LOR 0.37719
## Research 0.31816
## SOP 0.29775
## University.Rating 0.09249
# Visualize Feature Importance
ggplot(importance_gbm, aes(x = reorder(Variables, Overall), y = Overall)) +
geom_bar(stat = "identity", fill = "lightblue") +
coord_flip() +
labs(title = "Feature Importance for GBM Model", x = "Feature", y = "Importance") +
theme_minimal()
## Coordinate system already present. Adding new coordinate system, which will
## replace the existing one.
# Ensure predictions and target are numeric vectors for GBM
predictions_gbm <- predict(model_gbm1, newdata = test_data)
target_gbm <- as.numeric(test_data$Chance.of.Admit)
# Calculate Metrics for GBM
rmse_gbm <- RMSE(predictions_gbm, target_gbm)
mae_gbm <- MAE(predictions_gbm, target_gbm)
rsquared_gbm <- cor(predictions_gbm, target_gbm)^2
mape_gbm <- MAPE(predictions_gbm, target_gbm)
# Summarize Results for GBM
results_gbm <- data.frame(
Metric = c("RMSE", "MAE", "R^2", "MAPE"),
Gradient_Boosting = c(rmse_gbm, mae_gbm, rsquared_gbm, mape_gbm)
)
# Print Results for GBM
print(results_gbm)
## Metric Gradient_Boosting
## 1 RMSE 0.07070301
## 2 MAE 0.05135666
## 3 R^2 0.78056142
## 4 MAPE 8.48132472
Evaluation of Regression Models The last step is to store all the models’ performance results in a summary table for comparison across all the metrics. Besides, bar plots are created to visualize the performance comparison of the models too, which refers to Random Forest Model, Linear Regression Model, and SVR Model. Meanwhile, the selected performance metrics for comparison are RMSE, R2, MAE, and MAPE.
# Combine Results into a Table
results_regression <- data.frame(
Model = c("Random Forest", "Linear Regression", "SVR", "LightGBM"),
RMSE = c(rmse_rf, rmse_lm, rmse_svr, rmse_gbm),
R2 = c(rsquared_rf, rsquared_lm, rsquared_svr, rsquared_gbm),
MAE = c(mae_rf, mae_lm, mae_svr, mae_gbm),
MAPE = c(mape_rf, mape_lm, mape_svr, mape_gbm)
)
print(results_regression)
## Model RMSE R2 MAE MAPE
## 1 Random Forest 0.06933500 0.7899901 0.05126318 8.398790
## 2 Linear Regression 0.06738683 0.8011028 0.04930168 8.030410
## 3 SVR 0.07388692 0.7786623 0.05429197 9.266271
## 4 LightGBM 0.07070301 0.7805614 0.05135666 8.481325
# Create Bar Plot for RMSE Comparison
par(mfrow=c(2, 2)) # Set up a 2x2 plot layout
barplot(results_regression$RMSE,
names.arg = results_regression$Model,
las = 2, col = "lightblue",
main = "RMSE Comparison of Models",
xlab = "Model", ylab = "RMSE",
cex.names = 0.8)
# Create Bar Plot for R^2 Comparison
barplot(results_regression$R2,
names.arg = results_regression$Model,
las = 2, col = "lightgreen",
main = "R^2 Comparison of Models",
xlab = "Model", ylab = "R^2",
cex.names = 0.8)
# Create Bar Plot for MAE Comparison
barplot(results_regression$MAE,
names.arg = results_regression$Model,
las = 2, col = "lightcoral",
main = "MAE Comparison of Models",
xlab = "Model", ylab = "MAE",
cex.names = 0.8)
# Create Bar Plot for MAPE Comparison
barplot(results_regression$MAPE,
names.arg = results_regression$Model,
las = 2, col = "orange",
main = "MAPE Comparison of Models",
xlab = "Model", ylab = "MAPE",
cex.names = 0.8)
As visualized, in general all model perform well with close value in all the metrics of RMSE, R2, MAE and MAPE. Linear Regression with k-fold outperforms others across all three metrics, with highest R^2 and lowest RMSE, MAE and MAPE, making it the best-performing model for this dataset.
Hence, in conclusion, Linear Regression Model is selected as the best model among all the regression models.
Comparison of Correlation and Feature Importance
Classification
chance_cor <- cor_matrix$r[,"Chance.Level", drop = FALSE] # Correlations
# Correlation table
cor_table <- data.frame(
Feature = rownames(chance_cor),
Correlation = chance_cor[, 1],
row.names = NULL # Remove row names
)
# Remove specific features and self-correlation
cor_table <- cor_table[!cor_table$Feature %in% c("GRE", "TOEFL", "Research_GPA_Interaction", "Chance.Level", "Chance.of.Admit"), ]
# Rank correlation values
cor_table <- cor_table[order(-cor_table$Correlation), ]
cor_table$Correlation_Rank <- 1:nrow(cor_table)
# SVM Feature importance table (already computed)
importance_table_svm <- importance_df_svm
importance_table_svm$Importance_Rank <- 1:nrow(importance_table_svm)
# Merge both tables for comparison
comparison_table <- merge(cor_table, importance_table_svm, by = "Feature", all = TRUE)
comparison_table <- comparison_table %>%
arrange(Correlation_Rank) %>%
mutate(
Correlation = round(Correlation, 3),
Importance = round(Importance, 3)
)
# View the comparison without row numbers
print(comparison_table, row.names = FALSE)
## Feature Correlation Correlation_Rank Importance Importance_Rank
## CGPA 0.699 1 0.044 2
## Total_Score 0.689 2 0.075 1
## University.Rating 0.586 3 -0.009 6
## SOP 0.542 4 0.006 5
## LOR 0.518 5 0.028 3
## Research 0.503 6 0.012 4
Total_Score and CGPA are both highly correlated with the target variable and are very important in predicting the target in the SVM model. Both features are key factors for the model’s performance. Other features with moderate correlation but low importance might be less useful in predicting the target. SOP and Research have low importance in the SVM model, with SOP showing a correlation but no impact on model performance. University Rating and LOR have moderate correlations and play an intermediate role in the model, but their importance in the SVM model is not as high.
Regression
chance_cor <- cor_matrix$r[,"Chance.of.Admit", drop = FALSE] # Correlations
# Correlation table
cor_table <- data.frame(
Feature = rownames(chance_cor),
Correlation = chance_cor[, 1],
row.names = NULL # Remove row names
)
# Remove specific features and self-correlation
cor_table <- cor_table[!cor_table$Feature %in% c("GRE", "TOEFL", "Research_GPA_Interaction", "Chance.Level", "Chance.of.Admit"), ]
# Rank correlation values
cor_table <- cor_table[order(-cor_table$Correlation), ]
cor_table$Correlation_Rank <- 1:nrow(cor_table)
# Feature importance table
importance_lm <- varImp(model_lm1, scale = FALSE)$importance
importance_table <- data.frame(
Feature = rownames(importance_lm),
Importance = importance_lm$Overall,
row.names = NULL # Remove row names
)
# Rank feature importance
importance_table <- importance_table[order(-importance_table$Importance), ]
importance_table$Importance_Rank <- 1:nrow(importance_table)
# Merge both tables for comparison
comparison_table <- merge(cor_table, importance_table, by = "Feature", all = TRUE)
comparison_table <- comparison_table %>%
arrange(Correlation_Rank) %>%
mutate(
Correlation = round(Correlation, 3),
Importance = round(Importance, 3)
)
# View the comparison
print(comparison_table, row.names = FALSE)
## Feature Correlation Correlation_Rank Importance Importance_Rank
## CGPA 0.878 1 8.721 1
## Total_Score 0.839 2 5.044 2
## University.Rating 0.732 3 1.541 6
## SOP 0.692 4 1.642 5
## LOR 0.640 5 3.396 4
## Research 0.582 6 3.469 3
The regression analysis indicates that CGPA and Total_Score are the most influential predictors of the target variable, with the highest correlation (0.878 and 0.839, respectively) and importance scores (8.721 and 5.044), suggesting that academic performance is strongly linked to the outcome. University Rating and SOP also show moderate correlations and moderate importance, but they have a less significant impact compared to CGPA and Total_Score. LOR and Research exhibit weaker correlations (0.640 and 0.582), but their importance ranks (4 and 3) suggest that they still play a meaningful role in the model’s predictions.
Overall, academic achievements, such as CGPA and Total_Score, dominate the classification and regression model, while other factors like University Rating, LOR, and Research contribute to the prediction but with less influence.
Classification Model
The performance metrics of four classification models—Random Forest, Logistic Regression, SVM, and kNN—are compared across several evaluation criteria.Overall, SVM demonstrates the strongest overall performance, particularly in terms of precision, recall, F1-Score, and AUC, while Logistic Regression and kNN also provide strong results.
# Classification Model
print(results_classification)
## Metric Random_Forest Logistic_Regression SVM kNN
## 1 Accuracy 0.8125000 0.8250000 0.8375000 0.8125000
## 2 Precision 0.8809524 0.9024390 0.9047619 0.8478261
## 3 Recall 0.7872340 0.7872340 0.8085106 0.8297872
## 4 F1-Score 0.8314607 0.8409091 0.8539326 0.8387097
## 5 AUC 0.8178594 0.8330110 0.8436493 0.8088330
Regression Model
The performance metrics of four regression models—Random Forest, Linear Regression, SVR, and LightGBM—are compared across several evaluation criteria Overall, Linear Regression is the top performer, with Random Forest and LightGBM also providing strong results.
# Regression Model
print(results_regression)
## Model RMSE R2 MAE MAPE
## 1 Random Forest 0.06933500 0.7899901 0.05126318 8.398790
## 2 Linear Regression 0.06738683 0.8011028 0.04930168 8.030410
## 3 SVR 0.07388692 0.7786623 0.05429197 9.266271
## 4 LightGBM 0.07070301 0.7805614 0.05135666 8.481325
Summary
In conclusion, SVM is the top-performing model in classification tasks, excelling in Precision, Recall, F1-Score, and AUC, followed closely by Logistic Regression and kNN. For regression tasks, Linear Regression leads with the best performance in RMSE, R2,MAE and MAPE, slightly outperforming Random Forest, with LightGBM also showing strong results.
Meanwhile, feature importance analysis shows that features like CGPA and Total_Score are the most influential in both classification and regression models. The academic performance in CGPA and Total_Score are the key predictors that significantly influence graduate school admission outcomes.