1.0 Introduction

The purpose of this project is to analyze a dataset on graduate school admissions to uncover patterns and insights that can aid students in understanding their chances of admission. By leveraging statistical and machine learning techniques, the project aims to predict admission probabilities and categorize students into likelihood tiers (high or low). This analysis can provide valuable guidance for prospective graduate students seeking to optimize their applications.

1.1 Research Objectives

To predict graduate school admission probabilities using a regression model based on the factors.
To classify students into high and low admission chance level categories to develop a classification model that leverages the same factors.
To identify the key predictors that significantly influence graduate school admission outcomes.
To evaluate and compare the performance of different regression and classification algorithms in accurately predicting admission outcomes.

1.2 R Markdown Setup

library(rmarkdown)
library(knitr)

knitr::opts_chunk$set(echo = TRUE)
set.seed(123)

2.0 Data Collection

Dataset Details

Our data is obtained from https://www.kaggle.com/datasets/mohansacharya/graduate-admissions

df <- read.csv("Admission_Predict.csv")

# Preview the first few rows of the dataset
head(df)

##   Serial.No. GRE.Score TOEFL.Score University.Rating Statement.of.Purpose
## 1          1       337         118                 4                  4.5
## 2          2       324         107                 4                  4.0
## 3          3       316         104                 3                  3.0
## 4          4       322         110                 3                  3.5
## 5          5       314         103                 2                  2.0
## 6          6       330         115                 5                  4.5
##   Letter.of.Recommendation CGPA Research Chance.of.Admit
## 1                      4.5 9.65        1            0.92
## 2                      4.5 8.87        1            0.76
## 3                      3.5 8.00        1            0.72
## 4                      2.5 8.67        1            0.80
## 5                      3.0 8.21        0            0.65
## 6                      3.0 9.34        1            0.90

str(df)

## 'data.frame':    400 obs. of  9 variables:
##  $ Serial.No.              : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ GRE.Score               : int  337 324 316 322 314 330 321 308 302 323 ...
##  $ TOEFL.Score             : int  118 107 104 110 103 115 109 101 102 108 ...
##  $ University.Rating       : int  4 4 3 3 2 5 3 2 1 3 ...
##  $ Statement.of.Purpose    : num  4.5 4 3 3.5 2 4.5 NA 3 2 3.5 ...
##  $ Letter.of.Recommendation: num  4.5 4.5 3.5 2.5 3 3 4 4 NA 3 ...
##  $ CGPA                    : num  9.65 8.87 8 8.67 8.21 9.34 8.2 7.9 8 8.6 ...
##  $ Research                : int  1 1 1 1 0 1 1 0 0 0 ...
##  $ Chance.of.Admit         : num  0.92 0.76 0.72 0.8 0.65 0.9 0.75 0.68 0.5 0.45 ...

summary(df)

##    Serial.No.      GRE.Score      TOEFL.Score    University.Rating
##  Min.   :  1.0   Min.   :290.0   Min.   : 92.0   Min.   :1.000    
##  1st Qu.:100.8   1st Qu.:308.0   1st Qu.:103.0   1st Qu.:2.000    
##  Median :200.5   Median :317.0   Median :107.0   Median :3.000    
##  Mean   :200.5   Mean   :316.8   Mean   :107.4   Mean   :3.087    
##  3rd Qu.:300.2   3rd Qu.:325.0   3rd Qu.:112.0   3rd Qu.:4.000    
##  Max.   :400.0   Max.   :340.0   Max.   :120.0   Max.   :5.000    
##                                                                   
##  Statement.of.Purpose Letter.of.Recommendation      CGPA          Research     
##  Min.   :1.000        Min.   :1.00             Min.   :6.800   Min.   :0.0000  
##  1st Qu.:2.500        1st Qu.:3.00             1st Qu.:8.170   1st Qu.:0.0000  
##  Median :3.500        Median :3.50             Median :8.610   Median :1.0000  
##  Mean   :3.416        Mean   :3.45             Mean   :8.599   Mean   :0.5475  
##  3rd Qu.:4.000        3rd Qu.:4.00             3rd Qu.:9.062   3rd Qu.:1.0000  
##  Max.   :5.000        Max.   :5.00             Max.   :9.920   Max.   :1.0000  
##  NA's   :18           NA's   :19                                               
##  Chance.of.Admit 
##  Min.   :0.3400  
##  1st Qu.:0.6400  
##  Median :0.7300  
##  Mean   :0.7244  
##  3rd Qu.:0.8300  
##  Max.   :0.9700  
##

3.0 Data Cleaning and Preprocessing

3.1 Handling Missing Values

# Check if there are any missing values in the data frame
any_missing <- any(is.na(df))
cat("Are there any missing values in the dataset? ", any_missing, "\n")

## Are there any missing values in the dataset?  TRUE

# Count the total number of missing values in the dataset
total_missing <- sum(is.na(df))
cat("Total number of missing values in the dataset: ", total_missing, "\n")

## Total number of missing values in the dataset:  37

# Count missing values for each column
missing_per_column <- colSums(is.na(df))
cat("Missing values per column:\n")

## Missing values per column:

print(missing_per_column)

##               Serial.No.                GRE.Score              TOEFL.Score 
##                        0                        0                        0 
##        University.Rating     Statement.of.Purpose Letter.of.Recommendation 
##                        0                       18                       19 
##                     CGPA                 Research          Chance.of.Admit 
##                        0                        0                        0

# Display rows with missing values
rows_with_na <- df[!complete.cases(df), ]
cat("Rows with missing values:\n")

## Rows with missing values:

print(rows_with_na)

##     Serial.No. GRE.Score TOEFL.Score University.Rating Statement.of.Purpose
## 7            7       321         109                 3                   NA
## 9            9       302         102                 1                  2.0
## 14          14       307         109                 3                   NA
## 21          21       312         107                 3                   NA
## 24          24       334         119                 5                  5.0
## 37          37       299         106                 2                   NA
## 49          49       321         110                 3                  3.5
## 67          67       327         114                 3                   NA
## 77          77       327         112                 3                   NA
## 80          80       294          93                 1                  1.5
## 93          93       298          98                 2                  4.0
## 95          95       303          99                 3                   NA
## 105        105       326         112                 3                  3.5
## 113        113       301         107                 3                   NA
## 128        128       319         112                 3                  2.5
## 140        140       318         109                 1                   NA
## 155        155       326         108                 3                   NA
## 170        170       311          99                 2                  2.5
## 182        182       305         107                 2                  2.5
## 194        194       336         118                 5                  4.5
## 203        203       340         120                 5                  4.5
## 215        215       331         117                 4                  4.5
## 224        224       308         109                 2                  3.0
## 232        232       319         106                 3                   NA
## 233        233       312         107                 2                  2.5
## 244        244       325         114                 3                   NA
## 253        253       318         100                 2                   NA
## 262        262       312         104                 3                   NA
## 274        274       312          99                 1                   NA
## 285        285       340         112                 4                  5.0
## 294        294       312          98                 1                  3.5
## 303        303       322         105                 2                  3.0
## 311        311       320         104                 3                   NA
## 312        312       328         108                 4                  4.5
## 325        325       315         104                 3                   NA
## 332        332       311         105                 2                   NA
##     Letter.of.Recommendation CGPA Research Chance.of.Admit
## 7                        4.0 8.20        1            0.75
## 9                         NA 8.00        0            0.50
## 14                       3.0 8.00        1            0.62
## 21                       2.0 7.90        1            0.64
## 24                        NA 9.70        1            0.95
## 37                       4.0 8.40        0            0.64
## 49                        NA 8.85        1            0.82
## 67                       3.0 9.02        0            0.61
## 77                       3.0 8.72        1            0.74
## 80                        NA 7.36        0            0.46
## 93                        NA 8.03        0            0.34
## 95                       2.5 7.66        0            0.36
## 105                       NA 9.05        1            0.74
## 113                      3.5 8.34        1            0.62
## 128                       NA 8.71        1            0.78
## 140                      3.5 9.12        0            0.78
## 155                       NA 8.89        0            0.80
## 170                       NA 7.98        0            0.65
## 182                       NA 8.42        0            0.71
## 194                       NA 9.53        1            0.94
## 203                       NA 9.91        1            0.97
## 215                       NA 9.42        1            0.94
## 224                       NA 8.45        0            0.71
## 232                      2.5 8.33        1            0.74
## 233                       NA 8.27        0            0.69
## 244                      3.0 9.04        1            0.76
## 253                      3.5 8.54        1            0.71
## 262                      4.0 8.09        0            0.71
## 274                      1.5 8.01        1            0.52
## 285                       NA 9.66        1            0.94
## 294                       NA 8.18        1            0.64
## 303                       NA 8.45        1            0.65
## 311                      3.5 8.74        1            0.76
## 312                       NA 9.18        1            0.84
## 325                      2.5 8.33        0            0.67
## 332                      2.0 8.12        1            0.73

Replace NA Values

# Replace NA values with 0 in 'Statement of Purpose' and 'Letter of Recommendation'
df$`Statement.of.Purpose`[is.na(df$`Statement.of.Purpose`)] <- 0
df$`Letter.of.Recommendation`[is.na(df$`Letter.of.Recommendation`)] <- 0

# Check if the missing values have been replaced successfully
cat("Missing values in 'Statement of Purpose':", sum(is.na(df$`Statement.of.Purpose`)), "\n")

## Missing values in 'Statement of Purpose': 0

cat("Missing values in 'Letter of Recommendation':", sum(is.na(df$`Letter.of.Recommendation`)), "\n")

## Missing values in 'Letter of Recommendation': 0

# Visualizations (Histograms and Boxplots) for distribution and to choose mean/median

# SOP Histogram
hist(df$`Statement.of.Purpose`,
     main = "Statement of Purpose (SOP) Distribution",
     xlab = "SOP",
     col = "lightblue",
     border = "black")

# LOR Histogram
hist(df$`Letter.of.Recommendation`,
     main = "Letter of Recommendation (LOR) Distribution",
     xlab = "LOR",
     col = "lightgreen",
     border = "black")

# SOP Boxplot
boxplot(df$`Statement.of.Purpose`,
        main = "Statement of Purpose (SOP) Boxplot",
        ylab = "SOP",
        col = "lightblue",
        border = "blue",
        horizontal = TRUE)

# LOR Boxplot
boxplot(df$`Letter.of.Recommendation`,
        main = "Letter of Recommendation (LOR) Boxplot",
        ylab = "LOR",
        col = "lightgreen",
        border = "green",
        horizontal = TRUE)

Histogram (First Plot):

The data distribution is slightly skewed to the left with a significant amount of data clustered towards the higher SOP values (3 to 5). There is a small concentration of values at 0, likely due to replacing NA values. Boxplot (Second Plot):

The median (thick blue line inside the box) is closer to the center of the interquartile range (IQR), but there is a noticeable outlier at 0. The presence of an outlier suggests that the mean might be pulled down due to the replacement of NA values with 0. Decision: Median is more appropriate Since the mean is sensitive to outliers and skewed data, the median provides a better measure of central tendency in this case. The outlier at 0 (from replacing NA) will distort the mean, but the median remains robust.

Replace 0 values with the median In order to ensure that the distribution of the dataset remains more consistent, median imputation technique is selected instead of mean imputation techniques. Reasons as follows: - Median remains more central to the majority of data points without skewing it away. - Median is less sensitive to outliers. - Median is suitable when the data is continous, not categorical (mode is more suitable).

Target columns: (1) Statement of Purpose, (2) Letter of Recommendation

# Replace 0 values with the median for specified columns
df$Statement.of.Purpose[df$Statement.of.Purpose == 0] <- median(df$Statement.of.Purpose[df$Statement.of.Purpose != 0], na.rm = TRUE)
df$Letter.of.Recommendation[df$Letter.of.Recommendation == 0] <- median(df$Letter.of.Recommendation[df$Letter.of.Recommendation != 0], na.rm = TRUE)

# Display the updated dataframe (first 20 rows)
head(df, 20)

##    Serial.No. GRE.Score TOEFL.Score University.Rating Statement.of.Purpose
## 1           1       337         118                 4                  4.5
## 2           2       324         107                 4                  4.0
## 3           3       316         104                 3                  3.0
## 4           4       322         110                 3                  3.5
## 5           5       314         103                 2                  2.0
## 6           6       330         115                 5                  4.5
## 7           7       321         109                 3                  3.5
## 8           8       308         101                 2                  3.0
## 9           9       302         102                 1                  2.0
## 10         10       323         108                 3                  3.5
## 11         11       325         106                 3                  3.5
## 12         12       327         111                 4                  4.0
## 13         13       328         112                 4                  4.0
## 14         14       307         109                 3                  3.5
## 15         15       311         104                 3                  3.5
## 16         16       314         105                 3                  3.5
## 17         17       317         107                 3                  4.0
## 18         18       319         106                 3                  4.0
## 19         19       318         110                 3                  4.0
## 20         20       303         102                 3                  3.5
##    Letter.of.Recommendation CGPA Research Chance.of.Admit
## 1                       4.5 9.65        1            0.92
## 2                       4.5 8.87        1            0.76
## 3                       3.5 8.00        1            0.72
## 4                       2.5 8.67        1            0.80
## 5                       3.0 8.21        0            0.65
## 6                       3.0 9.34        1            0.90
## 7                       4.0 8.20        1            0.75
## 8                       4.0 7.90        0            0.68
## 9                       3.5 8.00        0            0.50
## 10                      3.0 8.60        0            0.45
## 11                      4.0 8.40        1            0.52
## 12                      4.5 9.00        1            0.84
## 13                      4.5 9.10        1            0.78
## 14                      3.0 8.00        1            0.62
## 15                      2.0 8.20        1            0.61
## 16                      2.5 8.30        0            0.54
## 17                      3.0 8.70        0            0.66
## 18                      3.0 8.00        1            0.65
## 19                      3.0 8.80        0            0.63
## 20                      3.0 8.50        0            0.62

Calculate skewness for each column

library(e1071)

## Warning: package 'e1071' was built under R version 4.4.2

sop_skewness <- skewness(df$`Statement.of.Purpose`, na.rm = TRUE)
lor_skewness <- skewness(df$`Letter.of.Recommendation`, na.rm = TRUE)
gre_skewness <- skewness(df$`GRE.Score`, na.rm = TRUE)
toefl_skewness <- skewness(df$`TOEFL.Score`, na.rm = TRUE)
cgpa_skewness <- skewness(df$`CGPA`, na.rm = TRUE)
admit_skewness <- skewness(df$`Chance.of.Admit`, na.rm = TRUE)

cat("SOP Skewness:", sop_skewness, "\n")

## SOP Skewness: -0.3007009

cat("LOR Skewness:", lor_skewness, "\n")

## LOR Skewness: -0.1174626

cat("GRE Skewness:", gre_skewness, "\n")

## GRE Skewness: -0.06242254

cat("TOEEFL Skewness:", toefl_skewness, "\n")

## TOEEFL Skewness: 0.05678751

cat("CGPA Skewness:", cgpa_skewness, "\n")

## CGPA Skewness: -0.06549644

cat("Chance of Admit Skewness:", admit_skewness, "\n")

## Chance of Admit Skewness: -0.3508017

# Check for negative values in SOP and LOR columns
sum(df$`Statement.of.Purpose` < 0, na.rm = TRUE)

## [1] 0

sum(df$`Letter.of.Recommendation` < 0, na.rm = TRUE)

## [1] 0

Summary of Skewness: - Negative skewed (Left-skewed): Statement of Purpose, Chance of Admit - Slightly Negative skewed (Left-skewed): Letter of Recommendation, GRE, CGPA - Slightly Positive skewed (Right-skewed): TOEEFL

3.2 Feature Engineering

# Rename columns
colnames(df)[colnames(df) == "Statement.of.Purpose"] <- "SOP"
colnames(df)[colnames(df) == "Letter.of.Recommendation"] <- "LOR"
colnames(df)[colnames(df) == "GRE.Score"] <- "GRE"
colnames(df)[colnames(df) == "TOEFL.Score"] <- "TOEFL"

# Check the updated column names
colnames(df)

## [1] "Serial.No."        "GRE"               "TOEFL"            
## [4] "University.Rating" "SOP"               "LOR"              
## [7] "CGPA"              "Research"          "Chance.of.Admit"

# Calculate Total Score (GRE + TOEFL)
df$Total_Score <- df$GRE + df$TOEFL

# Calculate Research Experience & GPA Interaction
df$Research_GPA_Interaction <- df$Research * df$CGPA

# Binning Chance of Admit into categorical column
df$Chance.Level <- ifelse(df$Chance.of.Admit > 0.7, "1", # Class "High"
                             "0") # Class "Low"

Total_Score helps you understand the combined performance of the student in both GRE and TOEFL. Research_GPA_Interaction captures the effect of research experience combined with the academic performance (CGPA). This feature could help predict how research experience might influence a student’s chance of admission when combined with their GPA. New column Chance.Level is converting continuous variable of chance of admit into a categorical numeric classes(1 for High, 0 for Low) in order to proceed with the classification modeling.

# Remove Serial.No column
df <- df[, !colnames(df) %in% "Serial.No."]
df <- df[, c(setdiff(colnames(df), "Chance.of.Admit"), "Chance.of.Admit")]

head(df, 5)

##   GRE TOEFL University.Rating SOP LOR CGPA Research Total_Score
## 1 337   118                 4 4.5 4.5 9.65        1         455
## 2 324   107                 4 4.0 4.5 8.87        1         431
## 3 316   104                 3 3.0 3.5 8.00        1         420
## 4 322   110                 3 3.5 2.5 8.67        1         432
## 5 314   103                 2 2.0 3.0 8.21        0         417
##   Research_GPA_Interaction Chance.Level Chance.of.Admit
## 1                     9.65            1            0.92
## 2                     8.87            1            0.76
## 3                     8.00            1            0.72
## 4                     8.67            1            0.80
## 5                     0.00            0            0.65

4.0 Exploratory Data Analysis (EDA)

4.1 Correlation matrix

library(dplyr)

## Warning: package 'dplyr' was built under R version 4.4.2

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(Hmisc)

## Warning: package 'Hmisc' was built under R version 4.4.2

## 
## Attaching package: 'Hmisc'

## The following objects are masked from 'package:dplyr':
## 
##     src, summarize

## The following object is masked from 'package:e1071':
## 
##     impute

## The following objects are masked from 'package:base':
## 
##     format.pval, units

library(corrplot)

## Warning: package 'corrplot' was built under R version 4.4.2

## corrplot 0.95 loaded

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 4.4.2

library(gridExtra)

## Warning: package 'gridExtra' was built under R version 4.4.2

## 
## Attaching package: 'gridExtra'

## The following object is masked from 'package:dplyr':
## 
##     combine

library(patchwork)

## Warning: package 'patchwork' was built under R version 4.4.2

# Remove Serial.No column
df <- df[, !colnames(df) %in% "Serial.No."]
df <- df[, c(setdiff(colnames(df), "Chance.of.Admit"), "Chance.of.Admit")]

# Compute correlation matrix using Spearman for mixed types
cor_matrix <- rcorr(as.matrix(df), type = "spearman")

# Extract correlation coefficients
corr <- cor_matrix$r
corr

##                                GRE     TOEFL University.Rating       SOP
## GRE                      1.0000000 0.8318602         0.6762653 0.6220924
## TOEFL                    0.8318602 1.0000000         0.6968676 0.6462887
## University.Rating        0.6762653 0.6968676         1.0000000 0.7392971
## SOP                      0.6220924 0.6462887         0.7392971 1.0000000
## LOR                      0.5172537 0.5209534         0.6238022 0.6926797
## CGPA                     0.8318480 0.8257200         0.7505625 0.7223057
## Research                 0.5959110 0.5043224         0.4541311 0.4523064
## Total_Score              0.9788624 0.9225135         0.7056926 0.6542414
## Research_GPA_Interaction 0.7549828 0.6862895         0.6227475 0.6022060
## Chance.Level             0.6711266 0.6357582         0.5863705 0.5424313
## Chance.of.Admit          0.8153521 0.7955730         0.7319766 0.6915088
##                                LOR      CGPA  Research Total_Score
## GRE                      0.5172537 0.8318480 0.5959110   0.9788624
## TOEFL                    0.5209534 0.8257200 0.5043224   0.9225135
## University.Rating        0.6238022 0.7505625 0.4541311   0.7056926
## SOP                      0.6926797 0.7223057 0.4523064   0.6542414
## LOR                      1.0000000 0.6371129 0.3854145   0.5356037
## CGPA                     0.6371129 1.0000000 0.5302649   0.8607412
## Research                 0.3854145 0.5302649 1.0000000   0.5852178
## Total_Score              0.5356037 0.8607412 0.5852178   1.0000000
## Research_GPA_Interaction 0.5096447 0.7601049 0.9050681   0.7545088
## Chance.Level             0.5179281 0.6990713 0.5033846   0.6894465
## Chance.of.Admit          0.6396353 0.8784026 0.5817423   0.8389873
##                          Research_GPA_Interaction Chance.Level Chance.of.Admit
## GRE                                     0.7549828    0.6711266       0.8153521
## TOEFL                                   0.6862895    0.6357582       0.7955730
## University.Rating                       0.6227475    0.5863705       0.7319766
## SOP                                     0.6022060    0.5424313       0.6915088
## LOR                                     0.5096447    0.5179281       0.6396353
## CGPA                                    0.7601049    0.6990713       0.8784026
## Research                                0.9050681    0.5033846       0.5817423
## Total_Score                             0.7545088    0.6894465       0.8389873
## Research_GPA_Interaction                1.0000000    0.5926343       0.7691904
## Chance.Level                            0.5926343    1.0000000       0.8529164
## Chance.of.Admit                         0.7691904    0.8529164       1.0000000

corrplot.mixed(corr,
               upper = "circle",           # Upper triangle with squares
               lower = "number",           # Lower triangle with correlation coefficients
               addgrid.col = "black",      # Add gridlines
               tl.col = "black",           # Text color
               tl.cex = 0.6,               # Adjust text size
               number.cex = 0.6,           # Adjust size of numbers in the lower triangle
               tl.pos = "lt",              # Rotate text to 45 degrees for better fit
               title = "Spearman Correlation Matrix",
               mar = c(2, 2, 2, 2))

4.2 Distributions for numerical variables

# GRE
gre_hist <- ggplot(df, aes(x = GRE)) +
  geom_histogram(binwidth = 5, fill = "darkseagreen", color = "black") +
  labs(title = "Distribution of GRE Scores", x = "GRE", y = "Count")

stats_gre <- summary(df$GRE)
q1 <- stats_gre["1st Qu."]
median <- stats_gre["Median"]
q3 <- stats_gre["3rd Qu."]

gre_box <- ggplot(df, aes(x = factor(1), y = GRE)) +
  geom_boxplot(fill = "darkseagreen", width = 0.4) +
  annotate("text", x = 1.4, y = q1, label = paste0("Q1: ", round(q1, 1)), color = "blue", size = 4) +
  annotate("text", x = 1.4, y = median, label = paste0("Median: ", round(median, 1)), color = "red", size = 4) +
  annotate("text", x = 1.4, y = q3, label = paste0("Q3: ", round(q3, 1)), color = "darkgreen", size = 4) +
  labs(title = "Boxplot of GRE Scores", y = "GRE Scores", x = NULL) +
  theme(axis.text.x = element_blank(), axis.ticks.x = element_blank())

grid.arrange(gre_hist, gre_box, ncol = 2)

# TOEFL
toefl_hist <- ggplot(df, aes(x = TOEFL)) +
  geom_histogram(binwidth = 3, fill = "cadetblue2", color = "black") +
  labs(title = "Distribution of TOEFL Scores", x = "TOEFL", y = "Count")

stats_toefl <- summary(df$TOEFL)
q1_toefl <- stats_toefl["1st Qu."]
median_toefl <- stats_toefl["Median"]
q3_toefl <- stats_toefl["3rd Qu."]

toefl_box <- ggplot(df, aes(x = factor(1), y = TOEFL)) +
  geom_boxplot(fill = "cadetblue2", width = 0.4) +
  annotate("text", x = 1.3, y = q1_toefl, label = paste0("Q1: ", round(q1_toefl, 1)), color = "blue", size = 4, hjust = 0) +
  annotate("text", x = 1.3, y = median_toefl, label = paste0("Median: ", round(median_toefl, 1)), color = "red", size = 4, hjust = 0) +
  annotate("text", x = 1.3, y = q3_toefl, label = paste0("Q3: ", round(q3_toefl, 1)), color = "darkgreen", size = 4, hjust = 0) +
  labs(title = "Boxplot of TOEFL Scores", y = "TOEFL", x = NULL) +
  theme(axis.text.x = element_blank(), axis.ticks.x = element_blank())

grid.arrange(toefl_hist, toefl_box, ncol = 2)

# CGPA
cgpa_hist <- ggplot(df, aes(x = CGPA)) +
  geom_histogram(binwidth = 0.3, fill = "cyan4", color = "black") +
  labs(title = "Distribution of CGPA Scores", x = "CGPA", y = "Count")

stats_cgpa <- summary(df$CGPA)
q1_cgpa <- stats_cgpa["1st Qu."]
median_cgpa <- stats_cgpa["Median"]
q3_cgpa <- stats_cgpa["3rd Qu."]

cgpa_box <- ggplot(df, aes(x = factor(1), y = CGPA)) +
  geom_boxplot(fill = "cyan4", width = 0.3) +
  annotate("text", x = 1.2, y = q1_cgpa, label = paste0("Q1: ", round(q1_cgpa, 2)), color = "blue", size = 4, hjust = 0) +
  annotate("text", x = 1.2, y = median_cgpa, label = paste0("Median: ", round(median_cgpa, 2)), color = "red", size = 4, hjust = 0) +
  annotate("text", x = 1.2, y = q3_cgpa, label = paste0("Q3: ", round(q3_cgpa, 2)), color = "darkgreen", size = 4, hjust = 0) +
  labs(title = "Boxplot of CGPA", y = "CGPA", x = NULL) +
  scale_y_continuous(limits = c(6, 10)) +
  theme(axis.text.x = element_blank(), axis.ticks.x = element_blank())

grid.arrange(cgpa_hist, cgpa_box, ncol = 2)

4.3 Distribution of categorical variables

# Proportion of students with research experience
bar_research <- df %>%
  count(Research) %>%
  mutate(color = ifelse(n == max(n), "red", "orange")) # Highlight the highest bar

dist_research<-ggplot(bar_research, aes(x = factor(Research), y = n, fill = color)) +
  geom_bar(stat = "identity") +
  geom_text(aes(label = n), vjust = -0.5, size=3) + # Add count labels
  scale_fill_identity() + # Use the color column directly
  labs(
    title = "Distribution of Students by Research Experience",
    x = "Research Experience (0: No, 1: Yes)",
    y = "Count"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5, face = "bold", size = 9),
    axis.title.x = element_text(size = 8),
    axis.title.y = element_text(size = 8),
    axis.text.x = element_text(size = 8),
    axis.text.y = element_text(size = 8)
  )+
  ylim(0, max(bar_research$n) * 1)


# Proportion of students with LOR
bar_LOR <- df %>%
  count(LOR) %>%
  mutate(color = ifelse(n == max(n), "red", "orange")) # Highlight the highest bar

dist_LOR<-ggplot(bar_LOR, aes(x = factor(LOR), y = n, fill = color)) +
  geom_bar(stat = "identity") +
  geom_text(aes(label = n), vjust = -0.5, size=3) + # Add count labels
  scale_fill_identity() + # Use the color column directly
  labs(
    title = "Distribution of Students by Letter of Recommendation Strength (LOR)",
    x = "Letter of Recommendation Strength (LOR)",
    y = "Count"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5, face = "bold", size = 9),
    axis.title.x = element_text(size = 8),
    axis.title.y = element_text(size = 8),
    axis.text.x = element_text(size = 8),
    axis.text.y = element_text(size = 8)
  )+
  ylim(0, max(bar_LOR$n) * 1)

# Proportion of students with SOP
bar_SOP <- df %>%
  count(SOP) %>%
  mutate(color = ifelse(n == max(n), "red", "orange")) # Highlight the highest bar

dist_SOP<-ggplot(bar_SOP, aes(x = factor(SOP), y = n, fill = color)) +
  geom_bar(stat = "identity") +
  geom_text(aes(label = n), vjust = -0.5, size=3) + # Add count labels
  scale_fill_identity() + # Use the color column directly
  labs(
    title = "Distribution of Students by Statement of Purpose Strength (SOP)",
    x = "Statement of Purpose Strength (SOP)",
    y = "Count"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5, face = "bold", size = 9),
    axis.title.x = element_text(size = 8),
    axis.title.y = element_text(size = 8),
    axis.text.x = element_text(size = 8),
    axis.text.y = element_text(size = 8)
  )+
  ylim(0, max(bar_SOP$n) * 1)

# Proportion of students with University Rating
bar_rating <- df %>%
  count(University.Rating) %>%
  mutate(color = ifelse(n == max(n), "red", "orange"))

dist_rating<-ggplot(bar_rating, aes(x = factor(University.Rating), y = n, fill = color)) +
  geom_bar(stat = "identity") +
  geom_text(aes(label = n), vjust = -0.5, size=3) + # Add count labels
  scale_fill_identity() + # Use the color column directly
  labs(
    title = "Distribution of Students by University Rating",
    x = "University Rating",
    y = "Count"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5, face = "bold", size = 9),
    axis.title.x = element_text(size = 8),
    axis.title.y = element_text(size = 8),
    axis.text.x = element_text(size = 8),
    axis.text.y = element_text(size = 8)
  )+
  ylim(0, max(bar_rating$n) * 1)

combined_dist_plot <- (dist_research | dist_LOR) / (dist_SOP | dist_rating)
combined_dist_plot

4.4 Scatter plots by numerical variables

# To identify relationships between key predictors and target variable (Chance.of.Admit)
# GRE vs Chance of Admit
scatter_gre <- ggplot(df, aes(x = GRE, y = Chance.of.Admit)) +
  geom_point() +
  geom_smooth(method = "lm", color = "red", se = FALSE) +
  labs(title = "GRE Scores vs Chance of Admit", x = "GRE", y = "Chance of Admit")+
  theme(
    plot.title = element_text(size = 9),
    axis.title.x = element_text(size = 9),
    axis.title.y = element_text(size = 9)
  )

# TOEFL Scores vs Chance of Admit
scatter_toefl <- ggplot(df, aes(x = TOEFL, y = Chance.of.Admit)) +
  geom_point() +
  geom_smooth(method = "lm", color = "red", se = FALSE) +
  labs(title = "TOEFL Scores vs Chance of Admit", x = "TOEFL", y = "Chance of Admit")+
    theme(
    plot.title = element_text(size = 9),
    axis.title.x = element_text(size = 9),
    axis.title.y = element_text(size = 9)
  )

# CGPA vs Chance of Admit
scatter_cgpa <- ggplot(df, aes(x = CGPA, y = Chance.of.Admit)) +
  geom_point() +
  geom_smooth(method = "lm", color = "red", se = FALSE) +
  labs(title = "CGPA vs Chance of Admit", x = "CGPA", y = "Chance of Admit")+
    theme(
    plot.title = element_text(size = 9),
    axis.title.x = element_text(size = 9),
    axis.title.y = element_text(size = 9)
  )

combined_scatter <- scatter_gre + scatter_toefl + scatter_cgpa +
  plot_layout(ncol = 3)
combined_scatter

## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'

4.5 Boxplot by categorical variable

# University Rating vs Chance of Admit
quartiles_by_uni_rating <- df %>%
  group_by(University.Rating) %>%
  summarise(
    Q1 = quantile(Chance.of.Admit, 0.25),
    Median = median(Chance.of.Admit),
    Q3 = quantile(Chance.of.Admit, 0.75))

quartiles_by_uni_rating

## # A tibble: 5 × 4
##   University.Rating    Q1 Median    Q3
##               <int> <dbl>  <dbl> <dbl>
## 1                 1 0.492  0.545 0.58 
## 2                 2 0.56   0.64  0.705
## 3                 3 0.67   0.73  0.77 
## 4                 4 0.78   0.835 0.9  
## 5                 5 0.87   0.91  0.94

ggplot(df, aes(x = factor(University.Rating), y = Chance.of.Admit)) +
  geom_boxplot(fill = "burlywood1") +
  labs(title = "University Rating vs Chance of Admit", x = "University Rating", y = "Chance of Admit")

# Research vs Chance of Admit
quartiles_by_research <- df %>%
  group_by(Research) %>%
  summarise(
    Q1 = quantile(Chance.of.Admit, 0.25),
    Median = median(Chance.of.Admit),
    Q3 = quantile(Chance.of.Admit, 0.75))

quartiles_by_research

## # A tibble: 2 × 4
##   Research    Q1 Median    Q3
##      <int> <dbl>  <dbl> <dbl>
## 1        0  0.57   0.65  0.72
## 2        1  0.73   0.8   0.9

ggplot(df, aes(x = factor(Research), y = Chance.of.Admit)) +
  geom_boxplot(fill = "lightpink") +
  labs(title = "Research vs Chance of Admit", x = "Research Experience (0: No, 1: Yes)", y = "Chance of Admit")

# SOP vs Chance of Admit
quartiles_by_SOP <- df %>%
  group_by(SOP) %>%
  summarise(
    Q1 = quantile(Chance.of.Admit, 0.25),
    Median = median(Chance.of.Admit),
    Q3 = quantile(Chance.of.Admit, 0.75))

quartiles_by_SOP

## # A tibble: 9 × 4
##     SOP    Q1 Median    Q3
##   <dbl> <dbl>  <dbl> <dbl>
## 1   1   0.56   0.57  0.58 
## 2   1.5 0.47   0.525 0.56 
## 3   2   0.525  0.64  0.7  
## 4   2.5 0.61   0.67  0.71 
## 5   3   0.64   0.705 0.742
## 6   3.5 0.64   0.72  0.775
## 7   4   0.728  0.8   0.86 
## 8   4.5 0.78   0.87  0.91 
## 9   5   0.88   0.92  0.94

ggplot(df, aes(x = factor(SOP), y = Chance.of.Admit)) +
  geom_boxplot(fill = "darksalmon") +
  labs(title = "Statement of Purpose vs Chance of Admit", x = "Statement of Purpose Strength", y = "Chance of Admit")

# LOR vs Chance of Admit
quartiles_by_LOR <- df %>%
  group_by(LOR) %>%
  summarise(
    Q1 = quantile(Chance.of.Admit, 0.25),
    Median = median(Chance.of.Admit),
    Q3 = quantile(Chance.of.Admit, 0.75))

quartiles_by_LOR

## # A tibble: 9 × 4
##     LOR    Q1 Median    Q3
##   <dbl> <dbl>  <dbl> <dbl>
## 1   1   0.42   0.42  0.42 
## 2   1.5 0.498  0.52  0.655
## 3   2   0.485  0.56  0.633
## 4   2.5 0.57   0.645 0.728
## 5   3   0.62   0.67  0.763
## 6   3.5 0.65   0.725 0.8  
## 7   4   0.705  0.78  0.855
## 8   4.5 0.782  0.89  0.91 
## 9   5   0.85   0.895 0.94

ggplot(df, aes(x = factor(LOR), y = Chance.of.Admit)) +
  geom_boxplot(fill = "cornsilk") +
  labs(title = "Letter of Recommendation Strength vs Chance of Admit", x = "Letter of Recommendation Strength", y = "Chance of Admit")

Feature Selection For Regression Correlation scores for Chance.Of.Admit: University.Rating: 0.732 SOP:0.691 LOR: 0.640 CGPA: 0.878 Research: 0.582 Total_Score: 0.839

Feature Selection For Classification Correlation scores for Chance.Level: University.Rating: 0.5863705 SOP: 0.5424313 LOR: 0.5179281 CGPA: 0.6990713 Research: 0.5033846 Total_Score: 0.6894465

Chance.Level is a categorical variable derived from Chance.of.Admit. When converting continuous data like Chance.of.Admit into categorical bins (such as “1”, “2”, “3”), the relationship between individual features and the new target (Chance.Level) becomes less precise. The values in Chance.Level now represent broader categories, which naturally reduces the correlation.

5.0 Modelling

5.1 Feature Selection

Dropping variables (or features) during the feature selection process before modeling is a crucial step to enhance model performance, reduce overfitting, and improve computational efficiency.

In our case, unwanted features that might cause duplication such as GRE, TOEFL, Research_GPA_Interaction, and Chance.of.Admit/Chance.Level are dropped.

# Drop the unwanted features for classification
dfc <- df[, !(names(df) %in% c("GRE", "TOEFL", "Research_GPA_Interaction","Chance.of.Admit"))]

# Drop the unwanted features for regression
dfr <- df[, !(names(df) %in% c("GRE", "TOEFL", "Research_GPA_Interaction","Chance.Level"))]

Dropping the GRE and TOEFL because we have the Total_Score column, which combines these two. Dropping the Research_GPA_Interaction due to many 0 values (as Research is binary and can lead to a lot of 0 values).

skewness(df$Chance.of.Admit)

## [1] -0.3508017

table(df$Chance.Level)

## 
##   0   1 
## 165 235

Since the skewness value of -0.3508017 indicates only a slight left-skew, it’s likely not severe enough to require immediate resampling or transformation. The classes are fairly close to each other, with class 1 having 235 entries and class 0 having 165 entries. The difference of 70 instances is not large, so this isn’t considered a significant imbalance.

5.2 Classification Model

Data Splitting (80-20) for classification model

The 80/20 rule is selected to split the data to get a better balance between having enough data to train the model and enough data to assess how it generalizes. This can reduce overfitting issues, thus improve the performance of the model.

library(caret)

## Warning: package 'caret' was built under R version 4.4.2

## Loading required package: lattice

# Split the dataset into training and test sets (80-20 split)
set.seed(123)  # Ensures reproducibility
split.index <- createDataPartition(dfc$Chance.Level, p = 0.8, list = FALSE) #target column is 'Chance.Level'
# Create the training set (80% of the data)
train.data <- dfc[split.index, ]

# Create the test set (remaining 20% of the data)
test.data <- dfc[-split.index, ]

# Check the distribution of the target variable in both sets to ensure proper splitting
table(train.data$Chance.Level)

## 
##   0   1 
## 132 188

table(test.data$Chance.Level)

## 
##  0  1 
## 33 47

# Convert Chance.Level to factor for classification
train.data$Chance.Level <- factor(train.data$Chance.Level, levels = c("1", "0"), labels = c("HighChance", "LowChance"))
test.data$Chance.Level <- factor(test.data$Chance.Level, levels = c("1", "0"), labels = c("HighChance", "LowChance"))

5.2.1 Random Forest

As the Random Forest Model can provide feature importance scores, it is utilized to help understand which features are most significant for the classification tasks.

Besides, different cross-validation techniques are implemented to assess the performance of Random Forest classifier, while the best classifier will be selected for modelling.

After that, a bar chart is created to visualize the importance of each feature variables that can impact the chance of admission by the most.

library(randomForest)

## Warning: package 'randomForest' was built under R version 4.4.2

## randomForest 4.7-1.2

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:gridExtra':
## 
##     combine

## The following object is masked from 'package:ggplot2':
## 
##     margin

## The following object is masked from 'package:dplyr':
## 
##     combine

#Fit the Random Forest Model
model_rf <- randomForest(Chance.Level ~ ., data = train.data)

# Print the model to see the results
print(model_rf)

## 
## Call:
##  randomForest(formula = Chance.Level ~ ., data = train.data) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 15.94%
## Confusion matrix:
##            HighChance LowChance class.error
## HighChance        164        24   0.1276596
## LowChance          27       105   0.2045455

# Make predictions on the test set
predictions_rf <- predict(model_rf, newdata = test.data)

# Confusion matrix to evaluate the model
conf_matrix_rf <- confusionMatrix(predictions_rf, test.data$Chance.Level)
print(conf_matrix_rf)

## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   HighChance LowChance
##   HighChance         37         6
##   LowChance          10        27
##                                           
##                Accuracy : 0.8             
##                  95% CI : (0.6956, 0.8811)
##     No Information Rate : 0.5875          
##     P-Value [Acc > NIR] : 4.709e-05       
##                                           
##                   Kappa : 0.5947          
##                                           
##  Mcnemar's Test P-Value : 0.4533          
##                                           
##             Sensitivity : 0.7872          
##             Specificity : 0.8182          
##          Pos Pred Value : 0.8605          
##          Neg Pred Value : 0.7297          
##              Prevalence : 0.5875          
##          Detection Rate : 0.4625          
##    Detection Prevalence : 0.5375          
##       Balanced Accuracy : 0.8027          
##                                           
##        'Positive' Class : HighChance      
##

Cross Validation for Random Forest

# k-fold
train_control <- trainControl(method="cv", number=10, classProbs = TRUE)
model_rf1 <- train(Chance.Level~., data=train.data, trControl=train_control, method="rf")

# repeated K Fold
train_control <- trainControl(method="repeatedcv", number=10, repeats=3, classProbs = TRUE)
model_rf2 <- train(Chance.Level~., data=train.data, trControl=train_control, method="rf")

# Leave one out CV
train_control <- trainControl(method="LOOCV", classProbs = TRUE)
model_rf3 <- train(Chance.Level~., data=train.data, trControl=train_control, method="rf")

# bootstrap
train_control <- trainControl(method="boot", number=100, classProbs = TRUE)
model_rf4 <- train(Chance.Level~., data=train.data, trControl=train_control, method="rf")

# Extract performance metrics for each model
results <- data.frame(
  Model = c("k-fold", "Repeated k-fold", "LOOCV", "Bootstrap"),
  Accuracy = c(
    max(model_rf1$results$Accuracy),
    max(model_rf2$results$Accuracy),
    max(model_rf3$results$Accuracy),
    max(model_rf4$results$Accuracy)
  ),
  Kappa = c(
    max(model_rf1$results$Kappa),
    max(model_rf2$results$Kappa),
    max(model_rf3$results$Kappa),
    max(model_rf4$results$Kappa)
  )
)

# Print the comparison table
print(results)

##             Model  Accuracy     Kappa
## 1          k-fold 0.8310881 0.6504532
## 2 Repeated k-fold 0.8235144 0.6341111
## 3           LOOCV 0.8281250 0.6457897
## 4       Bootstrap 0.8252230 0.6386866

# Identify the best model based on highest accuracy
best_model_index <- which.max(results$Accuracy)
best_model <- results$Model[best_model_index]
print(paste("The best model is:", best_model))

## [1] "The best model is: k-fold"

Best Performing Random Forest

# Extract Random Forest Model from k-fold
rf_model <- model_rf1$finalModel

# Random Forest Feature Importance
feature_importance_rf <- importance(rf_model)

# Convert to data frame and prepare for visualization
importance_df_rf <- as.data.frame(feature_importance_rf)
importance_df_rf$Feature <- rownames(importance_df_rf)

# Sort by Mean Decrease Gini
importance_df_rf <- importance_df_rf[order(importance_df_rf$MeanDecreaseGini, decreasing = TRUE), ]

# Visualize Random Forest Feature Importance
library(ggplot2)
ggplot(importance_df_rf, aes(x = reorder(Feature, MeanDecreaseGini), y = MeanDecreaseGini)) +
  geom_bar(stat = "identity", fill = "skyblue") +
  coord_flip() +
  labs(title = "Feature Importance for Random Forest Model", x = "Feature", y = "Importance (Mean Decrease Gini)") +
  theme_minimal()

Mean Decrease Gini (or Mean Decrease Accuracy) is used for calculating feature importance in Random Forest model trained using Repeated k-fold CV. CGPA and Total Score are the most important feature.

library(pROC)

## Warning: package 'pROC' was built under R version 4.4.2

## Type 'citation("pROC")' for a citation.

## 
## Attaching package: 'pROC'

## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var

# Make predictions using the trained Random Forest model on the test set
predictions_rf <- predict(rf_model, newdata = test.data)

# Confusion Matrix for Random Forest
conf_matrix_rf <- confusionMatrix(predictions_rf, test.data$Chance.Level)

# Extract metrics from the confusion matrix
accuracy_rf <- conf_matrix_rf$overall['Accuracy']
precision_rf <- posPredValue(predictions_rf, test.data$Chance.Level)
recall_rf <- sensitivity(predictions_rf, test.data$Chance.Level)
f1_rf <- (2 * precision_rf * recall_rf) / (precision_rf + recall_rf)

# Calculate AUC (Area Under the ROC Curve)
# Convert test.data$Chance.Level to numeric for AUC calculation
roc_curve <- roc(as.numeric(test.data$Chance.Level), as.numeric(predictions_rf))

## Setting levels: control = 1, case = 2

## Setting direction: controls < cases

auc_rf <- auc(roc_curve)

# Create a data frame to store the metrics
results_rf <- data.frame(
  Metric = c("Accuracy", "Precision", "Recall", "F1-Score", "AUC"),
  Random_Forest = c(as.numeric(accuracy_rf), as.numeric(precision_rf),
                    as.numeric(recall_rf), as.numeric(f1_rf), as.numeric(auc_rf))
)

# Print the results
print(results_rf)

##      Metric Random_Forest
## 1  Accuracy     0.8125000
## 2 Precision     0.8809524
## 3    Recall     0.7872340
## 4  F1-Score     0.8314607
## 5       AUC     0.8178594

5.2.2 Logistic Regression

The second model chosen is logistic regression which specialize in binary classification tasks.

Different cross-validation techniques are implemented in this step to assess the performance of linear regression classifier, while the best classifier will be selected for modelling.

After that, a bar chart is created to visualize the importance of each feature variables that can impact the chance of admission by the most.

# Fit Logistic Regression model
model_lr <- train(Chance.Level ~ ., data = train.data)

# Print the model summary
print(model_lr)

## Random Forest 
## 
## 320 samples
##   6 predictor
##   2 classes: 'HighChance', 'LowChance' 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 320, 320, 320, 320, 320, 320, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##   2     0.8233746  0.6313569
##   4     0.8167568  0.6171070
##   6     0.8116929  0.6068051
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.

# Make predictions on the test set
predictions_lr <- predict(model_lr, newdata = test.data)

# Evaluate performance with a confusion matrix
conf_matrix_lr <- confusionMatrix(predictions_lr, test.data$Chance.Level)
print(conf_matrix_lr)

## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   HighChance LowChance
##   HighChance         37         5
##   LowChance          10        28
##                                           
##                Accuracy : 0.8125          
##                  95% CI : (0.7097, 0.8911)
##     No Information Rate : 0.5875          
##     P-Value [Acc > NIR] : 1.59e-05        
##                                           
##                   Kappa : 0.6217          
##                                           
##  Mcnemar's Test P-Value : 0.3017          
##                                           
##             Sensitivity : 0.7872          
##             Specificity : 0.8485          
##          Pos Pred Value : 0.8810          
##          Neg Pred Value : 0.7368          
##              Prevalence : 0.5875          
##          Detection Rate : 0.4625          
##    Detection Prevalence : 0.5250          
##       Balanced Accuracy : 0.8179          
##                                           
##        'Positive' Class : HighChance      
##

Cross Validation for Logistic Regression

# k-fold
train_control <- trainControl(method="cv", number=10)
model_lr1 <- train(Chance.Level~., data=train.data, trControl=train_control, method="glm")

# repeated K Fold
train_control <- trainControl(method="repeatedcv", number=10, repeats=3)
model_lr2 <- train(Chance.Level~., data=train.data, trControl=train_control, method="glm")

# Leave one out CV
train_control <- trainControl(method="LOOCV")
model_lr3 <- train(Chance.Level~., data=train.data, trControl=train_control, method="glm")

# bootstrap
train_control <- trainControl(method="boot", number=100)
model_lr4 <- train(Chance.Level~., data=train.data, trControl=train_control, method="glm")

# Extract performance metrics for each model
results <- data.frame(
  Model = c("k-fold", "Repeated k-fold", "LOOCV", "Bootstrap"),
  Accuracy = c(
    max(model_lr1$results$Accuracy),
    max(model_lr2$results$Accuracy),
    max(model_lr3$results$Accuracy),
    max(model_lr4$results$Accuracy)
  ),
  Kappa = c(
    max(model_lr1$results$Kappa),
    max(model_lr2$results$Kappa),
    max(model_lr3$results$Kappa),
    max(model_lr4$results$Kappa)
  )
)

# Print the comparison table
print(results)

##             Model  Accuracy     Kappa
## 1          k-fold 0.8561248 0.7032586
## 2 Repeated k-fold 0.8521709 0.6940222
## 3           LOOCV 0.8562500 0.7034172
## 4       Bootstrap 0.8397092 0.6673948

# Identify the best model based on highest accuracy
best_model_index <- which.max(results$Accuracy)
best_model <- results$Model[best_model_index]
print(paste("The best model is:", best_model))

## [1] "The best model is: LOOCV"

Best Performing Logistic Regression

# Extract Logistic Regression Model from LOOCV
lr_model <- model_rf3$finalModel

# Extract Feature Importance using varImp from the caret package
importance_lr <- varImp(model_lr, scale = FALSE)

# Check the structure of the importance object
print(str(importance_lr))

## List of 3
##  $ importance:'data.frame':  6 obs. of  1 variable:
##   ..$ Overall: num [1:6] 15.28 14.56 15.57 52.13 7.81 ...
##  $ model     : chr "rf"
##  $ calledFrom: chr "varImp"
##  - attr(*, "class")= chr "varImp.train"
## NULL

# Convert the importance object to a data frame for visualization
importance_df_lr <- as.data.frame(importance_lr$importance)
importance_df_lr$Feature <- rownames(importance_df_lr)

# Sort by importance value (descending)
importance_df_lr <- importance_df_lr[order(importance_df_lr$Overall, decreasing = TRUE), ]

# Check the structure and data
print(str(importance_df_lr))

## 'data.frame':    6 obs. of  2 variables:
##  $ Overall: num  52.1 45.6 15.6 15.3 14.6 ...
##  $ Feature: chr  "CGPA" "Total_Score" "LOR" "University.Rating" ...
## NULL

print(head(importance_df_lr))

##                     Overall           Feature
## CGPA              52.130220              CGPA
## Total_Score       45.560152       Total_Score
## LOR               15.567068               LOR
## University.Rating 15.278563 University.Rating
## SOP               14.563066               SOP
## Research           7.808335          Research

# Visualize the feature importance using ggplot
ggplot(importance_df_lr, aes(x = reorder(Feature, Overall), y = Overall)) +
  geom_bar(stat = "identity", fill = "lightgreen") +
  coord_flip() +
  labs(title = "Feature Importance for Logistic Regression Model",
       x = "Feature", y = "Importance (Overall)") +
  theme_minimal()

The varImp() function from the caret package is designed to extract feature importance directly from a trained model like logistic regression. CGPA and Total_Score shows the highest importance.

# Make predictions using the trained Logistic Regression model on the test set
predictions_lr <- predict(lr_model, newdata = test.data)

# Logistic Regression Metrics
conf_matrix_lr <- confusionMatrix(predictions_lr, test.data$Chance.Level)
accuracy_lr <- conf_matrix_lr$overall['Accuracy']
precision_lr <- posPredValue(predictions_lr, test.data$Chance.Level)
recall_lr <- sensitivity(predictions_lr, test.data$Chance.Level)
f1_lr <- (2 * precision_lr * recall_lr) / (precision_lr + recall_lr)

# Calculate AUC (Area Under the ROC Curve)
# Convert test.data$Chance.Level to numeric for AUC calculation
roc_curve <- roc(as.numeric(test.data$Chance.Level), as.numeric(predictions_lr))

## Setting levels: control = 1, case = 2

## Setting direction: controls < cases

auc_lr <- auc(roc_curve)

results_lr <- data.frame(
  Metric = c("Accuracy", "Precision", "Recall", "F1-Score", "AUC"),
  Logistic_Regression = c(as.numeric(accuracy_lr), as.numeric(precision_lr),
                        as.numeric(recall_lr), as.numeric(f1_lr), as.numeric(auc_lr))
)

# Print the results
print(results_lr)

##      Metric Logistic_Regression
## 1  Accuracy           0.8250000
## 2 Precision           0.9024390
## 3    Recall           0.7872340
## 4  F1-Score           0.8409091
## 5       AUC           0.8330110

5.2.3 Support Vector Machine

The third alternative chosen is the Support Vector Machine (SVM) Model, which specialize in finding the optimal hyperplane that separates different classes in the feature space.

Different cross-validation techniques are implemented in this step to assess the performance of SVM classifier, while the best classifier will be selected for modelling.

After that, a bar chart is created to visualize the importance of each feature variables that can impact the chance of admission by the most.

# Fit the SVM model
model_svm <- svm(Chance.Level ~ ., data = train.data, probability = TRUE)

# Print the model summary
print(model_svm)

## 
## Call:
## svm(formula = Chance.Level ~ ., data = train.data, probability = TRUE)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  1 
## 
## Number of Support Vectors:  133

# Make predictions on the test set
predictions_svm <- predict(model_svm, newdata = test.data)
prob_predictions <- attr(predict(model_svm, test.data, probability = TRUE), "probabilities")[, 2]

# Evaluate performance with a confusion matrix
conf_matrix_svm <- confusionMatrix(predictions_svm, test.data$Chance.Level)
print(conf_matrix_svm)

## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   HighChance LowChance
##   HighChance         37         4
##   LowChance          10        29
##                                           
##                Accuracy : 0.825           
##                  95% CI : (0.7238, 0.9009)
##     No Information Rate : 0.5875          
##     P-Value [Acc > NIR] : 4.97e-06        
##                                           
##                   Kappa : 0.6485          
##                                           
##  Mcnemar's Test P-Value : 0.1814          
##                                           
##             Sensitivity : 0.7872          
##             Specificity : 0.8788          
##          Pos Pred Value : 0.9024          
##          Neg Pred Value : 0.7436          
##              Prevalence : 0.5875          
##          Detection Rate : 0.4625          
##    Detection Prevalence : 0.5125          
##       Balanced Accuracy : 0.8330          
##                                           
##        'Positive' Class : HighChance      
##

Cross Validation for SVM

library(kernlab)

## 
## Attaching package: 'kernlab'

## The following object is masked from 'package:ggplot2':
## 
##     alpha

# k-fold
train_control <- trainControl(method="cv", number=10)
model_svm1 <- train(Chance.Level~., data=train.data, trControl=train_control, method="svmRadial")

# repeated K Fold
train_control <- trainControl(method="repeatedcv", number=10, repeats=3)
model_svm2 <- train(Chance.Level~., data=train.data, trControl=train_control, method="svmRadial")

# Leave one out CV
train_control <- trainControl(method="LOOCV")
model_svm3 <- train(Chance.Level~., data=train.data, trControl=train_control, method="svmRadial")

# bootstrap
train_control <- trainControl(method="boot", number=100)
model_svm4 <- train(Chance.Level~., data=train.data, trControl=train_control, method="svmRadial")

Best Performing SVM

# Extract performance metrics for each model
results <- data.frame(
  Model = c("k-fold", "Repeated k-fold", "LOOCV", "Bootstrap"),
  Accuracy = c(
    max(model_svm1$results$Accuracy),
    max(model_svm2$results$Accuracy),
    max(model_svm3$results$Accuracy),
    max(model_svm4$results$Accuracy)
  ),
  Kappa = c(
    max(model_svm1$results$Kappa),
    max(model_svm2$results$Kappa),
    max(model_svm3$results$Kappa),
    max(model_svm4$results$Kappa)
  )
)

# Print the comparison table
print(results)

##             Model  Accuracy     Kappa
## 1          k-fold 0.8558101 0.7054645
## 2 Repeated k-fold 0.8511364 0.6946757
## 3           LOOCV 0.8593750 0.7121382
## 4       Bootstrap 0.8458848 0.6826663

# Identify the best model based on highest accuracy
best_model_index <- which.max(results$Accuracy)
best_model <- results$Model[best_model_index]
print(paste("The best model is:", best_model))

## [1] "The best model is: LOOCV"

# Extract SVM Model from LOOCV
svm_model <- model_svm3$finalModel

# performance function to calculate accuracy
performance_function <- function(model, data, labels) {
  predictions <- predict(model, newdata = data)
  accuracy <- sum(predictions == labels) / length(labels)
  return(accuracy)
}

# SVM Permutation Feature Importance
importance_scores_svm <- numeric(ncol(train.data) - 1)
for (i in 1:(ncol(train.data) - 1)) {
  shuffled_data <- train.data
  shuffled_data[, i] <- sample(shuffled_data[, i])
  baseline_accuracy <- performance_function(model_svm, train.data, train.data$Chance.Level)
  shuffled_accuracy <- performance_function(model_svm, shuffled_data, shuffled_data$Chance.Level)
  importance_scores_svm[i] <- baseline_accuracy - shuffled_accuracy
}
importance_df_svm <- data.frame(Feature = names(train.data)[-ncol(train.data)], Importance = importance_scores_svm)
importance_df_svm <- importance_df_svm[order(importance_df_svm$Importance, decreasing = TRUE), ]

# Visualize SVM Feature Importance
ggplot(importance_df_svm, aes(x = reorder(Feature, Importance), y = Importance)) +
  geom_bar(stat = "identity", fill = "orange") +
  coord_flip() +
  labs(title = "Feature Importance for SVM Model", x = "Feature", y = "Importance (Decrease in Accuracy)") +
  theme_minimal()

Since SVM does not inherently provide a way to rank feature importance, permutation importance is used as a model-agnostic method to estimate the effect of each feature on model performance. Total_Score amd CGPA is still having the highest effect on the model.

# SVM created with train function is different
test.data_no_target <- test.data[, -which(names(test.data) == "Chance.Level")]

# Make predictions using the trained SVM model on the test set
predictions_svm <- predict(svm_model, newdata = test.data_no_target)

# SVM Metrics
conf_matrix_svm <- confusionMatrix(predictions_svm, test.data$Chance.Level)
accuracy_svm <- conf_matrix_svm$overall['Accuracy']
precision_svm <- posPredValue(predictions_svm, test.data$Chance.Level)
recall_svm <- sensitivity(predictions_svm, test.data$Chance.Level)
f1_svm <- (2 * precision_svm * recall_svm) / (precision_svm + recall_svm)

# Calculate AUC (Area Under the ROC Curve)
# Convert test.data$Chance.Level to numeric for AUC calculation
roc_curve <- roc(as.numeric(test.data$Chance.Level), as.numeric(predictions_svm))

## Setting levels: control = 1, case = 2

## Setting direction: controls < cases

auc_svm <- auc(roc_curve)

results_svm <- data.frame(
  Metric = c("Accuracy", "Precision", "Recall", "F1-Score", "AUC"),
  SVM = c(as.numeric(accuracy_svm), as.numeric(precision_svm),
                        as.numeric(recall_svm), as.numeric(f1_svm), as.numeric(auc_svm))
)

# Print the results
print(results_svm)

##      Metric       SVM
## 1  Accuracy 0.8375000
## 2 Precision 0.9047619
## 3    Recall 0.8085106
## 4  F1-Score 0.8539326
## 5       AUC 0.8436493

5.2.4 K-Nearest Neighbors

The last model chosen for classification is the KNN Model which is good in classifying a data point based on how its neighbors are classified.

Different cross-validation techniques are implemented in this step to assess the performance of KNN classifier, while the best classifier will be selected for modelling.

After that, a bar chart is created to visualize the importance of each feature variables that can impact the chance of admission by the most.

library(class)
# Fit k-NN model
predictions_knn <- knn(train = train.data[, -ncol(train.data)],
                       test = test.data[, -ncol(test.data)],
                       cl = train.data$Chance.Level
                       )

# Evaluate performance with a confusion matrix
conf_matrix_knn <- confusionMatrix(predictions_knn, test.data$Chance.Level)
print(conf_matrix_knn)

## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   HighChance LowChance
##   HighChance         38        12
##   LowChance           9        21
##                                           
##                Accuracy : 0.7375          
##                  95% CI : (0.6271, 0.8296)
##     No Information Rate : 0.5875          
##     P-Value [Acc > NIR] : 0.003784        
##                                           
##                   Kappa : 0.451           
##                                           
##  Mcnemar's Test P-Value : 0.662521        
##                                           
##             Sensitivity : 0.8085          
##             Specificity : 0.6364          
##          Pos Pred Value : 0.7600          
##          Neg Pred Value : 0.7000          
##              Prevalence : 0.5875          
##          Detection Rate : 0.4750          
##    Detection Prevalence : 0.6250          
##       Balanced Accuracy : 0.7224          
##                                           
##        'Positive' Class : HighChance      
##

Cross Validation for KNN

# k-fold
train_control <- trainControl(method="cv", number=10)
model_knn1 <- train(Chance.Level~., data=train.data, trControl=train_control, method="knn")

# repeated K Fold
train_control <- trainControl(method="repeatedcv", number=10, repeats=3)
model_knn2 <- train(Chance.Level~., data=train.data, trControl=train_control, method="knn")

# Leave one out CV
train_control <- trainControl(method="LOOCV")
model_knn3 <- train(Chance.Level~., data=train.data, trControl=train_control, method="knn")

# bootstrap
train_control <- trainControl(method="boot", number=100)
model_knn4 <- train(Chance.Level~., data=train.data, trControl=train_control, method="knn")

# Extract performance metrics for each model
results <- data.frame(
  Model = c("k-fold", "Repeated k-fold", "LOOCV", "Bootstrap"),
  Accuracy = c(
    max(model_knn1$results$Accuracy),
    max(model_knn2$results$Accuracy),
    max(model_knn3$results$Accuracy),
    max(model_knn4$results$Accuracy)
  ),
  Kappa = c(
    max(model_knn1$results$Kappa),
    max(model_knn2$results$Kappa),
    max(model_knn3$results$Kappa),
    max(model_knn4$results$Kappa)
  )
)

# Print the comparison table
print(results)

##             Model  Accuracy     Kappa
## 1          k-fold 0.8346102 0.6549046
## 2 Repeated k-fold 0.8343465 0.6529221
## 3           LOOCV 0.8312500 0.6470588
## 4       Bootstrap 0.8169719 0.6164830

# Identify the best model based on highest accuracy
best_model_index <- which.max(results$Accuracy)
best_model <- results$Model[best_model_index]
print(paste("The best model is:", best_model))

## [1] "The best model is: k-fold"

Best Performing KNN

# Extract K-Nearest Neighbour Model from k-fold
knn_model <- model_knn1$finalModel

# Calculate feature importance using varImp()
importance_knn <- varImp(model_knn3, scale = FALSE)

# Print the importance of each feature
print(importance_knn)

## ROC curve variable importance
## 
##                   Importance
## CGPA                  0.9131
## Total_Score           0.9049
## University.Rating     0.8452
## SOP                   0.8205
## LOR                   0.8156
## Research              0.7409

# Visualize the importance using ggplot2
library(ggplot2)
ggplot(importance_knn, aes(x = reorder(Features, Overall), y = Overall)) +
  geom_bar(stat = "identity", fill = "lightblue") +
  coord_flip() +
  labs(title = "Feature Importance for k-NN Model", x = "Feature", y = "Importance (Decrease in Accuracy)") +
  theme_minimal()

## Coordinate system already present. Adding new coordinate system, which will
## replace the existing one.

Unlike other models, k-Nearest Neighbors (kNN) doesn’t explicitly model feature importance. It makes predictions based on the majority class of the nearest neighbors, so it doesn’t produce direct feature importance scores. As a result, the feature importance values might seem to be quite similar, as they are not derived from a model that inherently ranks features.

# KNN created with train function is different
test.data_no_target <- test.data[, -which(names(test.data) == "Chance.Level")]

# Make predictions using the trained KNN model on the test set
predictions_knn <- predict(knn_model, newdata = test.data_no_target, type = "class")

# KNN Metrics
conf_matrix_knn <- confusionMatrix(predictions_knn, test.data$Chance.Level)
accuracy_knn <- conf_matrix_knn$overall['Accuracy']
precision_knn <- posPredValue(predictions_knn, test.data$Chance.Level)
recall_knn <- sensitivity(predictions_knn, test.data$Chance.Level)
f1_knn <- (2 * precision_knn * recall_knn) / (precision_knn + recall_knn)

# Calculate AUC (Area Under the ROC Curve)
# Convert test.data$Chance.Level to numeric for AUC calculation
roc_curve <- roc(as.numeric(test.data$Chance.Level), as.numeric(predictions_knn))

## Setting levels: control = 1, case = 2

## Setting direction: controls < cases

auc_knn <- auc(roc_curve)

results_knn <- data.frame(
  Metric = c("Accuracy", "Precision", "Recall", "F1-Score", "AUC"),
  kNN = c(as.numeric(accuracy_knn), as.numeric(precision_knn),
                        as.numeric(recall_knn), as.numeric(f1_knn), as.numeric(auc_knn))
)

# Print the results
print(results_knn)

##      Metric       kNN
## 1  Accuracy 0.8125000
## 2 Precision 0.8478261
## 3    Recall 0.8297872
## 4  F1-Score 0.8387097
## 5       AUC 0.8088330

5.2.5 Evaluation of Classification Models

In this step, all the models are compared with their performance metrics to choose the most suitable model for classification tasks.

# Create a summary data frame
results_classification <- data.frame(
  Metric = c("Accuracy", "Precision", "Recall", "F1-Score", "AUC"),
  Random_Forest = c(accuracy_rf, precision_rf, recall_rf, f1_rf, auc_rf),
  Logistic_Regression = c(accuracy_lr, precision_lr, recall_lr, f1_lr, auc_lr),
  SVM = c(accuracy_svm, precision_svm, recall_svm, f1_svm, auc_svm),
  kNN = c(accuracy_knn, precision_knn, recall_knn, f1_knn, auc_knn)
)

# Print the results table
print(results_classification)

##      Metric Random_Forest Logistic_Regression       SVM       kNN
## 1  Accuracy     0.8125000           0.8250000 0.8375000 0.8125000
## 2 Precision     0.8809524           0.9024390 0.9047619 0.8478261
## 3    Recall     0.7872340           0.7872340 0.8085106 0.8297872
## 4  F1-Score     0.8314607           0.8409091 0.8539326 0.8387097
## 5       AUC     0.8178594           0.8330110 0.8436493 0.8088330

library(tidyr)

## Warning: package 'tidyr' was built under R version 4.4.2

# Reshape the data into a long format for ggplot
metrics_long_df <- results_classification %>%
  pivot_longer(cols = -Metric, names_to = "Model", values_to = "Value")

# Visualize the metrics with ggplot
ggplot(metrics_long_df, aes(x = Metric, y = Value, fill = Model)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Model Performance Comparison", x = "Metric", y = "Score") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

SVM with Leave-one-out cross-validation (LOOCV) is the best model in terms of all metrics, making it the most effective for distinguishing between classes and capturing positives.

5.3 Regression Model

Data Splitting (80-20) for regression model

# Split the dataset into training and test sets (80-20 split)
set.seed(123)  # Ensures reproducibility
split.index <- createDataPartition(dfr$Chance.of.Admit, p = 0.8, list = FALSE) #target column is 'Chance.of.Admit'
# Create the training set (80% of the data)
train_data <- dfr[split.index, ]

# Create the test set (remaining 20% of the data)
test_data <- dfr[-split.index, ]

5.3.1 Random Forest

Then, the Random Forest model is selected to capture complex relationships between our target variable (Chance of Admit) and the input features. Besides, this step allows us to identify the feature importance of this dataset via prediction.

To evaluate the model performance, various performance metrics are calculated, which includes: - RMSE (Root Mean Square Error): Quantifies error in model’s prediction. - R2 (Coefficient of Determination): Understand goodness of fit - MAE (Mean Absolute Error): Shows average model performance (less sensitive to outliers) - MAPE (Mean Absolute Percentage Error): Understands errors in terms of percentage.

Before evaluating, different cross-validation methods are done and the best performing fold will be chosen as the best model to be evaluated. Via this model, the feature importance is computed based on permutation, also known as via evaluating the decrease in model performance when a feature’s values are shuffled.. Furthermore, a bar chart is plotted to visualize the importance of the variables.

Cross Validation for Random Forest

# k-fold
train_control <- trainControl(method="cv", number=10)
model_rfa <- train(Chance.of.Admit~., data=train_data, trControl=train_control, method="rf")

# repeated K Fold
train_control <- trainControl(method="repeatedcv", number=10, repeats=3)
model_rfb <- train(Chance.of.Admit~., data=train_data, trControl=train_control, method="rf")

# Leave one out CV
train_control <- trainControl(method="LOOCV")
model_rfc <- train(Chance.of.Admit~., data=train_data, trControl=train_control, method="rf")

# bootstrap
train_control <- trainControl(method="boot", number=100)
model_rfd <- train(Chance.of.Admit~., data=train_data, trControl=train_control, method="rf")

Best Performing Random Forest

# Compare Performance Metrics
results <- data.frame(
  Model = c("k-fold", "Repeated k-fold", "LOOCV", "Bootstrap"),
  RMSE = c(
    min(model_rfa$results$RMSE),
    min(model_rfb$results$RMSE),
    min(model_rfc$results$RMSE),
    min(model_rfd$results$RMSE)
  ),
  Rsquared = c(
    max(model_rfa$results$Rsquared),
    max(model_rfb$results$Rsquared),
    max(model_rfc$results$Rsquared),
    max(model_rfd$results$Rsquared)
  )
)

# Print results table
print(results)

##             Model       RMSE  Rsquared
## 1          k-fold 0.06488381 0.8007859
## 2 Repeated k-fold 0.06539835 0.7876448
## 3           LOOCV 0.06585309 0.7811270
## 4       Bootstrap 0.06742689 0.7719480

# Identify the best model
best_model_index <- which.min(results$RMSE)
best_model <- results$Model[best_model_index]
print(paste("The best model is:", best_model))

## [1] "The best model is: k-fold"

# Feature Importance for the Best Model
rf_model <- model_rfa$finalModel
importance_rf <- varImp(model_rfb, scale = FALSE)
print(importance_rf)

## rf variable importance
## 
##                   Overall
## CGPA               2.3576
## Total_Score        1.8087
## University.Rating  0.6978
## SOP                0.5016
## Research           0.4064
## LOR                0.2974

# Visualize Feature Importance
ggplot(importance_rf, aes(x = reorder(Variables, Overall), y = Overall)) +
  geom_bar(stat = "identity", fill = "lightgreen") +
  coord_flip() +
  labs(title = "Feature Importance for Random Forest Model",
       x = "Feature", y = "Importance") +
  theme_minimal()

## Coordinate system already present. Adding new coordinate system, which will
## replace the existing one.

# Define the MAPE function
MAPE <- function(predictions, actual) {
  mean(abs((actual - predictions) / actual)) * 100
}

# Ensure predictions and target are numeric vectors
predictions_rf <- predict(rf_model, newdata = test_data)
target_rf <- as.numeric(test_data$Chance.of.Admit)

# Calculate Metrics
rmse_rf <- RMSE(predictions_rf, target_rf)
mae_rf <- MAE(predictions_rf, target_rf)
rsquared_rf <- cor(predictions_rf, target_rf)^2
mape_rf <- MAPE(predictions_rf, target_rf)

# Summarize Results
results_rf <- data.frame(
  Metric = c("RMSE", "MAE", "R^2", "MAPE"),
  Random_Forest = c(rmse_rf, mae_rf, rsquared_rf, mape_rf)
)

# Print Results
print(results_rf)

##   Metric Random_Forest
## 1   RMSE    0.06933500
## 2    MAE    0.05126318
## 3    R^2    0.78999014
## 4   MAPE    8.39879037

5.3.2 Linear Regression

The second regression model implemented is the linear regression model, which models the relationship between the target variable and all other predictors. The performance metrics used for model evaluation is similar to Random Forest Model, which are RSME, R2, MAE, and MAPE.

A bar chart is created to visualize the feature importance modeled by Linear Regression.

Cross Validation for Linear Regression

# k-fold
train_control <- trainControl(method="cv", number=10)
model_lm1 <- train(Chance.of.Admit~., data=train_data, trControl=train_control, method="lm")

# repeated K Fold
train_control <- trainControl(method="repeatedcv", number=10, repeats=3)
model_lm2 <- train(Chance.of.Admit~., data=train_data, trControl=train_control, method="lm")

# Leave one out CV
train_control <- trainControl(method="LOOCV")
model_lm3 <- train(Chance.of.Admit~., data=train_data, trControl=train_control, method="lm")

# bootstrap
train_control <- trainControl(method="boot", number=100)
model_lm4 <- train(Chance.of.Admit~., data=train_data, trControl=train_control, method="lm")

# Compare Performance Metrics
results <- data.frame(
  Model = c("k-fold", "Repeated k-fold", "LOOCV", "Bootstrap"),
  RMSE = c(
    min(model_lm1$results$RMSE),
    min(model_lm2$results$RMSE),
    min(model_lm3$results$RMSE),
    min(model_lm4$results$RMSE)
  ),
  Rsquared = c(
    max(model_lm1$results$Rsquared),
    max(model_lm2$results$Rsquared),
    max(model_lm3$results$Rsquared),
    max(model_lm4$results$Rsquared)
  )
)

# Print results table
print(results)

##             Model       RMSE  Rsquared
## 1          k-fold 0.06295156 0.8067221
## 2 Repeated k-fold 0.06322814 0.7995616
## 3           LOOCV 0.06415998 0.7921128
## 4       Bootstrap 0.06526121 0.7914113

# Identify the best model
best_model_index <- which.min(results$RMSE)
best_model <- results$Model[best_model_index]
print(paste("The best model is:", best_model))

## [1] "The best model is: k-fold"

Best Performing Linear Regression

# Feature Importance for the Best Model
lm_model <- model_lm1$finalModel
importance_lm <- varImp(model_lm1, scale = FALSE)
print(importance_lm)

## lm variable importance
## 
##                   Overall
## CGPA                8.721
## Total_Score         5.044
## Research            3.469
## LOR                 3.396
## SOP                 1.642
## University.Rating   1.541

# Visualize Feature Importance
ggplot(importance_lm, aes(x = reorder(Variables, Overall), y = Overall)) +
  geom_bar(stat = "identity", fill = "purple") +
  coord_flip() +
  labs(title = "Feature Importance for Linear Regression Model",
       x = "Feature", y = "Importance") +
  theme_minimal()

## Coordinate system already present. Adding new coordinate system, which will
## replace the existing one.

# Ensure predictions and target are numeric vectors
predictions_lm <- predict(lm_model, newdata = test_data)
target_lm <- as.numeric(test_data$Chance.of.Admit)

# Calculate Metrics
rmse_lm <- RMSE(predictions_lm, target_lm)
mae_lm <- MAE(predictions_lm, target_lm)
rsquared_lm <- cor(predictions_lm, target_lm)^2
mape_lm <- MAPE(predictions_lm, target_lm)

# Summarize Results
results_lm <- data.frame(
  Metric = c("RMSE", "MAE", "R^2", "MAPE"),
  Linear_Regression = c(rmse_lm, mae_lm, rsquared_lm, mape_lm)
)

# Print Results
print(results_lm)

##   Metric Linear_Regression
## 1   RMSE        0.06738683
## 2    MAE        0.04930168
## 3    R^2        0.80110282
## 4   MAPE        8.03041013

5.3.3 SVR Model

The 3rd alternative regression model we used is the Support Vector Regression (SVR) Model, where the performance metrics are fixed as RSME, R2, MAE, and MAPE. Before evaluating, different cross-validation methods are done and the best performing fold will be chosen as the best model to be evaluated. Via this model, the feature importance is computed based on permutation, also known as via evaluating the decrease in model performance when a feature’s values are shuffled. A bar chart is created to visualize the importance of each feature towards Chance of Admission too.

Cross Validation for SVR Model

# k-fold
train_control <- trainControl(method="cv", number=10)
model_svr1 <- train(Chance.of.Admit~., data=train_data, trControl=train_control, method="svmRadial")

# repeated K Fold
train_control <- trainControl(method="repeatedcv", number=10, repeats=3)
model_svr2 <- train(Chance.of.Admit~., data=train_data, trControl=train_control, method="svmRadial")

# Leave one out CV
train_control <- trainControl(method="LOOCV")
model_svr3 <- train(Chance.of.Admit~., data=train_data, trControl=train_control, method="svmRadial")

# bootstrap
train_control <- trainControl(method="boot", number=100)
model_svr4 <- train(Chance.of.Admit~., data=train_data, trControl=train_control, method="svmRadial")

This step compares the performance metrics

# Compare Performance Metrics
results <- data.frame(
  Model = c("k-fold", "Repeated k-fold", "LOOCV", "Bootstrap"),
  RMSE = c(
    min(model_svr1$results$RMSE),
    min(model_svr2$results$RMSE),
    min(model_svr3$results$RMSE),
    min(model_svr4$results$RMSE)
  ),
  Rsquared = c(
    max(model_svr1$results$Rsquared),
    max(model_svr2$results$Rsquared),
    max(model_svr3$results$Rsquared),
    max(model_svr4$results$Rsquared)
  )
)

# Print results table
print(results)

##             Model       RMSE  Rsquared
## 1          k-fold 0.06701107 0.7803287
## 2 Repeated k-fold 0.06864063 0.7721359
## 3           LOOCV 0.06718161 0.7765524
## 4       Bootstrap 0.06938481 0.7641694

# Identify the best model
best_model_index <- which.min(results$RMSE)
best_model <- results$Model[best_model_index]
print(paste("The best model is:", best_model))

## [1] "The best model is: k-fold"

Best Performing SVR

# Feature Importance for the Best Model
svr_model <- model_svr1$finalModel
importance_svr <- varImp(model_svr2, scale = FALSE)
print(importance_svr)

## loess r-squared variable importance
## 
##                   Overall
## CGPA               0.7605
## Total_Score        0.7045
## University.Rating  0.4934
## SOP                0.4171
## LOR                0.3849
## Research           0.3440

# Verify variable names in importance output
importance_svr$Variables <- rownames(importance_svr)

# Visualize Feature Importance
ggplot(importance_svr, aes(x = reorder(Variables, Overall), y = Overall)) +
  geom_bar(stat = "identity", fill = "orange") +
  coord_flip() +
  labs(title = "Feature Importance for Support Vector Regression (SVR) Model",
       x = "Feature", y = "Importance") +
  theme_minimal()

## Coordinate system already present. Adding new coordinate system, which will
## replace the existing one.

A Feature Importance Plot is created to visualize the importance of the features, where CGPA is evaluated as the most important feature for the target variable, followed by Total_Score.

# Ensure predictions and target are numeric vectors
predictions_svr <- predict(model_svr2, newdata = test_data)
target_svr <- as.numeric(test_data$Chance.of.Admit)

# Calculate Metrics
rmse_svr <- RMSE(predictions_svr, target_svr)
mae_svr <- MAE(predictions_svr, target_svr)
rsquared_svr <- ifelse(var(predictions_svr) > 0 && var(target_svr) > 0, cor(predictions_svr, target_svr)^2, NA)
mape_svr <- MAPE(predictions_svr, target_svr)

# Summarize Results
results_svr <- data.frame(
  Metric = c("RMSE", "MAE", "R^2", "MAPE"),
  Support_Vector_Regression = c(rmse_svr, mae_svr, rsquared_svr, mape_svr)
)

# Print Results
print(results_svr)

##   Metric Support_Vector_Regression
## 1   RMSE                0.07388692
## 2    MAE                0.05429197
## 3    R^2                0.77866232
## 4   MAPE                9.26627097

5.3.4 LightGBM

The last regression model used is the Light Gradient Boosting Machine (LightGBM) model, which is particularly effective for structured/tabular data as it build decision trees sequentially, while each will improve on the previous one. Using this model, the feature importance will be computed and visualized through a bar chart too.

As usual, before evaluating, different cross-validation methods are done and the best performing fold will be chosen as the best model to be evaluated. Via this model, the feature importance is computed based on permutation, also known as via evaluating the decrease in model performance when a feature’s values are shuffled.

Cross Validation for LightGBM

library(gbm)

## Warning: package 'gbm' was built under R version 4.4.2

## Loaded gbm 2.2.2

## This version of gbm is no longer under development. Consider transitioning to gbm3, https://github.com/gbm-developers/gbm3

# k-fold Cross Validation
train_control <- trainControl(method = "cv", number = 10, verboseIter = FALSE)
model_gbm1 <- train(Chance.of.Admit ~ ., data = train_data, trControl = train_control, method = "gbm", verbose = FALSE)

# Repeated k-fold Cross Validation
train_control <- trainControl(method = "repeatedcv", number = 10, repeats = 3, verboseIter = FALSE)
model_gbm2 <- train(Chance.of.Admit ~ ., data = train_data, trControl = train_control, method = "gbm", verbose = FALSE)

# Leave-One-Out Cross Validation
train_control <- trainControl(method = "LOOCV")
model_gbm3 <- train(Chance.of.Admit ~ ., data = train_data, trControl = train_control, method = "gbm", verbose = FALSE)

# Bootstrap
train_control <- trainControl(method = "boot", number = 100)
model_gbm4 <- train(Chance.of.Admit ~ ., data = train_data, trControl = train_control, method = "gbm", verbose = FALSE)

# Compare Performance Metrics
results <- data.frame(
  Model = c("k-fold", "Repeated k-fold", "LOOCV", "Bootstrap"),
  RMSE = c(
    min(model_gbm1$results$RMSE),
    min(model_gbm2$results$RMSE),
    min(model_gbm3$results$RMSE),
    min(model_gbm4$results$RMSE)
  ),
  Rsquared = c(
    max(model_gbm1$results$Rsquared),
    max(model_gbm2$results$Rsquared),
    max(model_gbm3$results$Rsquared),
    max(model_gbm4$results$Rsquared)
  )
)

# Print results table
print(results)

##             Model       RMSE  Rsquared
## 1          k-fold 0.06707669 0.7752814
## 2 Repeated k-fold 0.06649505 0.7826897
## 3           LOOCV 0.06695469 0.7737474
## 4       Bootstrap 0.06795659 0.7702729

# Identify the best model
best_model_index <- which.min(results$RMSE)
best_model <- results$Model[best_model_index]
print(paste("The best model is:", best_model))

## [1] "The best model is: Repeated k-fold"

Best Performing LightGBM

# Feature Importance for the Best Model
importance_gbm <- varImp(model_gbm2, scale = FALSE)
print(importance_gbm)

## gbm variable importance
## 
##                    Overall
## CGPA              11.37258
## Total_Score        3.56790
## LOR                0.37719
## Research           0.31816
## SOP                0.29775
## University.Rating  0.09249

# Visualize Feature Importance
ggplot(importance_gbm, aes(x = reorder(Variables, Overall), y = Overall)) +
  geom_bar(stat = "identity", fill = "lightblue") +
  coord_flip() +
  labs(title = "Feature Importance for GBM Model", x = "Feature", y = "Importance") +
  theme_minimal()

## Coordinate system already present. Adding new coordinate system, which will
## replace the existing one.

# Ensure predictions and target are numeric vectors for GBM
predictions_gbm <- predict(model_gbm1, newdata = test_data)
target_gbm <- as.numeric(test_data$Chance.of.Admit)

# Calculate Metrics for GBM
rmse_gbm <- RMSE(predictions_gbm, target_gbm)
mae_gbm <- MAE(predictions_gbm, target_gbm)
rsquared_gbm <- cor(predictions_gbm, target_gbm)^2
mape_gbm <- MAPE(predictions_gbm, target_gbm)

# Summarize Results for GBM
results_gbm <- data.frame(
  Metric = c("RMSE", "MAE", "R^2", "MAPE"),
  Gradient_Boosting = c(rmse_gbm, mae_gbm, rsquared_gbm, mape_gbm)
)

# Print Results for GBM
print(results_gbm)

##   Metric Gradient_Boosting
## 1   RMSE        0.07070301
## 2    MAE        0.05135666
## 3    R^2        0.78056142
## 4   MAPE        8.48132472

Evaluation of Regression Models The last step is to store all the models’ performance results in a summary table for comparison across all the metrics. Besides, bar plots are created to visualize the performance comparison of the models too, which refers to Random Forest Model, Linear Regression Model, and SVR Model. Meanwhile, the selected performance metrics for comparison are RMSE, R2, MAE, and MAPE.

5.3.5 Evaluation of Regression Models

# Combine Results into a Table
results_regression <- data.frame(
  Model = c("Random Forest", "Linear Regression", "SVR", "LightGBM"),
  RMSE = c(rmse_rf, rmse_lm, rmse_svr, rmse_gbm),
  R2 = c(rsquared_rf, rsquared_lm, rsquared_svr, rsquared_gbm),
  MAE = c(mae_rf, mae_lm, mae_svr, mae_gbm),
  MAPE = c(mape_rf, mape_lm, mape_svr, mape_gbm)
)
print(results_regression)

##               Model       RMSE        R2        MAE     MAPE
## 1     Random Forest 0.06933500 0.7899901 0.05126318 8.398790
## 2 Linear Regression 0.06738683 0.8011028 0.04930168 8.030410
## 3               SVR 0.07388692 0.7786623 0.05429197 9.266271
## 4          LightGBM 0.07070301 0.7805614 0.05135666 8.481325

# Create Bar Plot for RMSE Comparison
par(mfrow=c(2, 2))  # Set up a 2x2 plot layout
barplot(results_regression$RMSE,
        names.arg = results_regression$Model,
        las = 2, col = "lightblue",
        main = "RMSE Comparison of Models",
        xlab = "Model", ylab = "RMSE",
        cex.names = 0.8)

# Create Bar Plot for R^2 Comparison
barplot(results_regression$R2,
        names.arg = results_regression$Model,
        las = 2, col = "lightgreen",
        main = "R^2 Comparison of Models",
        xlab = "Model", ylab = "R^2",
        cex.names = 0.8)

# Create Bar Plot for MAE Comparison
barplot(results_regression$MAE,
        names.arg = results_regression$Model,
        las = 2, col = "lightcoral",
        main = "MAE Comparison of Models",
        xlab = "Model", ylab = "MAE",
        cex.names = 0.8)

# Create Bar Plot for MAPE Comparison
barplot(results_regression$MAPE,
        names.arg = results_regression$Model,
        las = 2, col = "orange",
        main = "MAPE Comparison of Models",
        xlab = "Model", ylab = "MAPE",
        cex.names = 0.8)

As visualized, in general all model perform well with close value in all the metrics of RMSE, R2, MAE and MAPE. Linear Regression with k-fold outperforms others across all three metrics, with highest R^2 and lowest RMSE, MAE and MAPE, making it the best-performing model for this dataset.

Hence, in conclusion, Linear Regression Model is selected as the best model among all the regression models.

6.0 Evaluation and Intepretation

Comparison of Correlation and Feature Importance

Classification

chance_cor <- cor_matrix$r[,"Chance.Level", drop = FALSE]  # Correlations

# Correlation table
cor_table <- data.frame(
  Feature = rownames(chance_cor),
  Correlation = chance_cor[, 1],
  row.names = NULL # Remove row names
)

# Remove specific features and self-correlation
cor_table <- cor_table[!cor_table$Feature %in% c("GRE", "TOEFL", "Research_GPA_Interaction", "Chance.Level", "Chance.of.Admit"), ]

# Rank correlation values
cor_table <- cor_table[order(-cor_table$Correlation), ]
cor_table$Correlation_Rank <- 1:nrow(cor_table)

# SVM Feature importance table (already computed)
importance_table_svm <- importance_df_svm
importance_table_svm$Importance_Rank <- 1:nrow(importance_table_svm)

# Merge both tables for comparison
comparison_table <- merge(cor_table, importance_table_svm, by = "Feature", all = TRUE)

comparison_table <- comparison_table %>%
  arrange(Correlation_Rank) %>%
  mutate(
    Correlation = round(Correlation, 3),
    Importance = round(Importance, 3)
  )

# View the comparison without row numbers
print(comparison_table, row.names = FALSE)

##            Feature Correlation Correlation_Rank Importance Importance_Rank
##               CGPA       0.699                1      0.044               2
##        Total_Score       0.689                2      0.075               1
##  University.Rating       0.586                3     -0.009               6
##                SOP       0.542                4      0.006               5
##                LOR       0.518                5      0.028               3
##           Research       0.503                6      0.012               4

Total_Score and CGPA are both highly correlated with the target variable and are very important in predicting the target in the SVM model. Both features are key factors for the model’s performance. Other features with moderate correlation but low importance might be less useful in predicting the target. SOP and Research have low importance in the SVM model, with SOP showing a correlation but no impact on model performance. University Rating and LOR have moderate correlations and play an intermediate role in the model, but their importance in the SVM model is not as high.

Regression

chance_cor <- cor_matrix$r[,"Chance.of.Admit", drop = FALSE]  # Correlations

# Correlation table
cor_table <- data.frame(
  Feature = rownames(chance_cor),
  Correlation = chance_cor[, 1],
  row.names = NULL # Remove row names
)

# Remove specific features and self-correlation
cor_table <- cor_table[!cor_table$Feature %in% c("GRE", "TOEFL", "Research_GPA_Interaction", "Chance.Level", "Chance.of.Admit"), ]

# Rank correlation values
cor_table <- cor_table[order(-cor_table$Correlation), ]
cor_table$Correlation_Rank <- 1:nrow(cor_table)

# Feature importance table
importance_lm <- varImp(model_lm1, scale = FALSE)$importance
importance_table <- data.frame(
  Feature = rownames(importance_lm),
  Importance = importance_lm$Overall,
  row.names = NULL # Remove row names
)

# Rank feature importance
importance_table <- importance_table[order(-importance_table$Importance), ]
importance_table$Importance_Rank <- 1:nrow(importance_table)

# Merge both tables for comparison
comparison_table <- merge(cor_table, importance_table, by = "Feature", all = TRUE)

comparison_table <- comparison_table %>%
  arrange(Correlation_Rank) %>%
  mutate(
    Correlation = round(Correlation, 3),
    Importance = round(Importance, 3)
  )

# View the comparison
print(comparison_table, row.names = FALSE)

##            Feature Correlation Correlation_Rank Importance Importance_Rank
##               CGPA       0.878                1      8.721               1
##        Total_Score       0.839                2      5.044               2
##  University.Rating       0.732                3      1.541               6
##                SOP       0.692                4      1.642               5
##                LOR       0.640                5      3.396               4
##           Research       0.582                6      3.469               3

The regression analysis indicates that CGPA and Total_Score are the most influential predictors of the target variable, with the highest correlation (0.878 and 0.839, respectively) and importance scores (8.721 and 5.044), suggesting that academic performance is strongly linked to the outcome. University Rating and SOP also show moderate correlations and moderate importance, but they have a less significant impact compared to CGPA and Total_Score. LOR and Research exhibit weaker correlations (0.640 and 0.582), but their importance ranks (4 and 3) suggest that they still play a meaningful role in the model’s predictions.

Overall, academic achievements, such as CGPA and Total_Score, dominate the classification and regression model, while other factors like University Rating, LOR, and Research contribute to the prediction but with less influence.

7.0 Conclusion

Classification Model

The performance metrics of four classification models—Random Forest, Logistic Regression, SVM, and kNN—are compared across several evaluation criteria.Overall, SVM demonstrates the strongest overall performance, particularly in terms of precision, recall, F1-Score, and AUC, while Logistic Regression and kNN also provide strong results.

# Classification Model
print(results_classification)

##      Metric Random_Forest Logistic_Regression       SVM       kNN
## 1  Accuracy     0.8125000           0.8250000 0.8375000 0.8125000
## 2 Precision     0.8809524           0.9024390 0.9047619 0.8478261
## 3    Recall     0.7872340           0.7872340 0.8085106 0.8297872
## 4  F1-Score     0.8314607           0.8409091 0.8539326 0.8387097
## 5       AUC     0.8178594           0.8330110 0.8436493 0.8088330

Regression Model

The performance metrics of four regression models—Random Forest, Linear Regression, SVR, and LightGBM—are compared across several evaluation criteria Overall, Linear Regression is the top performer, with Random Forest and LightGBM also providing strong results.

# Regression Model
print(results_regression)

##               Model       RMSE        R2        MAE     MAPE
## 1     Random Forest 0.06933500 0.7899901 0.05126318 8.398790
## 2 Linear Regression 0.06738683 0.8011028 0.04930168 8.030410
## 3               SVR 0.07388692 0.7786623 0.05429197 9.266271
## 4          LightGBM 0.07070301 0.7805614 0.05135666 8.481325

Summary

In conclusion, SVM is the top-performing model in classification tasks, excelling in Precision, Recall, F1-Score, and AUC, followed closely by Logistic Regression and kNN. For regression tasks, Linear Regression leads with the best performance in RMSE, R2,MAE and MAPE, slightly outperforming Random Forest, with LightGBM also showing strong results.

Meanwhile, feature importance analysis shows that features like CGPA and Total_Score are the most influential in both classification and regression models. The academic performance in CGPA and Total_Score are the key predictors that significantly influence graduate school admission outcomes.

Group_Assignment

2024-12-16

1.0 Introduction

1.1 Research Objectives

1.2 R Markdown Setup

2.0 Data Collection

3.0 Data Cleaning and Preprocessing

3.1 Handling Missing Values

3.2 Feature Engineering

4.0 Exploratory Data Analysis (EDA)

4.1 Correlation matrix

4.2 Distributions for numerical variables

4.3 Distribution of categorical variables

4.4 Scatter plots by numerical variables

4.5 Boxplot by categorical variable

5.0 Modelling

5.1 Feature Selection

5.2 Classification Model

5.2.1 Random Forest

5.2.2 Logistic Regression

5.2.3 Support Vector Machine

5.2.4 K-Nearest Neighbors

5.2.5 Evaluation of Classification Models

5.3 Regression Model

5.3.1 Random Forest

5.3.2 Linear Regression

5.3.3 SVR Model

5.3.4 LightGBM

5.3.5 Evaluation of Regression Models

6.0 Evaluation and Intepretation

7.0 Conclusion