Assignment 3 - Discriminant Analysis

Soal 1

Question 1: Conduct discriminant analysis and report the results using student-mat.csv data.

Binary classification: pass if G3≥10, else fail.

Dataset Information

This data approach student achievement in secondary education of two Portuguese schools. The data attributes include student grades, demographic, social and school related features) and it was collected by using school reports and questionnaires. Two datasets are provided regarding the performance in two distinct subjects: Mathematics (mat) and Portuguese language (por). In [Cortez and Silva, 2008], the two datasets were modeled under binary/five-level classification and regression tasks. Important note: the target attribute G3 has a strong correlation with attributes G2 and G1. This occurs because G3 is the final year grade (issued at the 3rd period), while G1 and G2 correspond to the 1st and 2nd period grades. It is more difficult to predict G3 without G2 and G1, but such prediction is much more useful (see paper source for more details).

Atribute Information

Attributes for both student-mat.csv (Math course) and student-por.csv (Portuguese language course) datasets:

school - student’s school (binary: ‘GP’ - Gabriel Pereira or ‘MS’ - Mousinho da Silveira)
sex - student’s sex (binary: ‘F’ - female or ‘M’ - male)
age - student’s age (numeric: from 15 to 22)
address - student’s home address type (binary: ‘U’ - urban or ‘R’ - rural)
famsize - family size (binary: ‘LE3’ - less or equal to 3 or ‘GT3’ - greater than 3)
Pstatus - parent’s cohabitation status (binary: ‘T’ - living together or ‘A’ - apart)
Medu - mother’s education (numeric: 0 - none, 1 - primary education (4th grade), 2 â€“ 5th to 9th grade, 3 â€“ secondary education or 4 â€“ higher education)
Fedu - father’s education (numeric: 0 - none, 1 - primary education (4th grade), 2 â€“ 5th to 9th grade, 3 â€“ secondary education or 4 â€“ higher education)
Mjob - mother’s job (nominal: ‘teacher’, ‘health’ care related, civil ‘services’ (e.g. administrative or police), ‘at_home’ or ‘other’)
Fjob - father’s job (nominal: ‘teacher’, ‘health’ care related, civil ‘services’ (e.g. administrative or police), ‘at_home’ or ‘other’)
reason - reason to choose this school (nominal: close to ‘home’, school ‘reputation’, ‘course’ preference or ‘other’)
guardian - student’s guardian (nominal: ‘mother’, ‘father’ or ‘other’)
traveltime - home to school travel time (numeric: 1 - <15 min., 2 - 15 to 30 min., 3 - 30 min. to 1 hour, or 4 - >1 hour)
studytime - weekly study time (numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours)
failures - number of past class failures (numeric: n if 1<=n<3, else 4)
schoolsup - extra educational support (binary: yes or no)
famsup - family educational support (binary: yes or no)
paid - extra paid classes within the course subject (Math or Portuguese) (binary: yes or no)
activities - extra-curricular activities (binary: yes or no)
nursery - attended nursery school (binary: yes or no)
higher - wants to take higher education (binary: yes or no)
internet - Internet access at home (binary: yes or no)
romantic - with a romantic relationship (binary: yes or no)
famrel - quality of family relationships (numeric: from 1 - very bad to 5 - excellent)
freetime - free time after school (numeric: from 1 - very low to 5 - very high)
goout - going out with friends (numeric: from 1 - very low to 5 - very high)
Dalc - workday alcohol consumption (numeric: from 1 - very low to 5 - very high)
Walc - weekend alcohol consumption (numeric: from 1 - very low to 5 - very high)
health - current health status (numeric: from 1 - very bad to 5 - very good)
absences - number of school absences (numeric: from 0 to 93)

these grades are related with the course subject, Math or Portuguese:

G1 - first period grade (numeric: from 0 to 20)
G2 - second period grade (numeric: from 0 to 20)
G3 - final grade (numeric: from 0 to 20, output target)

Data Analysis

A. Library

library(caret)
library(dplyr)
select <- dplyr::select 
library(ggplot2)
library(tidyr)
library(GGally)
library(corrplot)
library(GGally)
library(patchwork)
library(biotools)
library(MASS)

B. Data

2.1. Import Data

mat <- read.csv("D:/UNY/MySta/SEM 5/StatMul/dataset StatMul/student-mat.csv",header = TRUE, sep = ';')
head(mat)

##   school sex age address famsize Pstatus Medu Fedu     Mjob     Fjob     reason
## 1     GP   F  18       U     GT3       A    4    4  at_home  teacher     course
## 2     GP   F  17       U     GT3       T    1    1  at_home    other     course
## 3     GP   F  15       U     LE3       T    1    1  at_home    other      other
## 4     GP   F  15       U     GT3       T    4    2   health services       home
## 5     GP   F  16       U     GT3       T    3    3    other    other       home
## 6     GP   M  16       U     LE3       T    4    3 services    other reputation
##   guardian traveltime studytime failures schoolsup famsup paid activities
## 1   mother          2         2        0       yes     no   no         no
## 2   father          1         2        0        no    yes   no         no
## 3   mother          1         2        3       yes     no  yes         no
## 4   mother          1         3        0        no    yes  yes        yes
## 5   father          1         2        0        no    yes  yes         no
## 6   mother          1         2        0        no    yes  yes        yes
##   nursery higher internet romantic famrel freetime goout Dalc Walc health
## 1     yes    yes       no       no      4        3     4    1    1      3
## 2      no    yes      yes       no      5        3     3    1    1      3
## 3     yes    yes      yes       no      4        3     2    2    3      3
## 4     yes    yes      yes      yes      3        2     2    1    1      5
## 5     yes    yes       no       no      4        3     2    1    2      5
## 6     yes    yes      yes       no      5        4     2    1    2      5
##   absences G1 G2 G3
## 1        6  5  6  6
## 2        4  5  5  6
## 3       10  7  8 10
## 4        2 15 14 15
## 5        4  6 10 10
## 6       10 15 15 15

2.2. Data Structure, Character Variables, and Size

# Structure
str(mat)

## 'data.frame':    395 obs. of  33 variables:
##  $ school    : chr  "GP" "GP" "GP" "GP" ...
##  $ sex       : chr  "F" "F" "F" "F" ...
##  $ age       : int  18 17 15 15 16 16 16 17 15 15 ...
##  $ address   : chr  "U" "U" "U" "U" ...
##  $ famsize   : chr  "GT3" "GT3" "LE3" "GT3" ...
##  $ Pstatus   : chr  "A" "T" "T" "T" ...
##  $ Medu      : int  4 1 1 4 3 4 2 4 3 3 ...
##  $ Fedu      : int  4 1 1 2 3 3 2 4 2 4 ...
##  $ Mjob      : chr  "at_home" "at_home" "at_home" "health" ...
##  $ Fjob      : chr  "teacher" "other" "other" "services" ...
##  $ reason    : chr  "course" "course" "other" "home" ...
##  $ guardian  : chr  "mother" "father" "mother" "mother" ...
##  $ traveltime: int  2 1 1 1 1 1 1 2 1 1 ...
##  $ studytime : int  2 2 2 3 2 2 2 2 2 2 ...
##  $ failures  : int  0 0 3 0 0 0 0 0 0 0 ...
##  $ schoolsup : chr  "yes" "no" "yes" "no" ...
##  $ famsup    : chr  "no" "yes" "no" "yes" ...
##  $ paid      : chr  "no" "no" "yes" "yes" ...
##  $ activities: chr  "no" "no" "no" "yes" ...
##  $ nursery   : chr  "yes" "no" "yes" "yes" ...
##  $ higher    : chr  "yes" "yes" "yes" "yes" ...
##  $ internet  : chr  "no" "yes" "yes" "yes" ...
##  $ romantic  : chr  "no" "no" "no" "yes" ...
##  $ famrel    : int  4 5 4 3 4 5 4 4 4 5 ...
##  $ freetime  : int  3 3 3 2 3 4 4 1 2 5 ...
##  $ goout     : int  4 3 2 2 2 2 4 4 2 1 ...
##  $ Dalc      : int  1 1 2 1 1 1 1 1 1 1 ...
##  $ Walc      : int  1 1 3 1 2 2 1 1 1 1 ...
##  $ health    : int  3 3 3 5 5 5 3 1 1 5 ...
##  $ absences  : int  6 4 10 2 4 10 0 6 0 0 ...
##  $ G1        : int  5 5 7 15 6 15 12 6 16 14 ...
##  $ G2        : int  6 5 8 14 10 15 12 5 18 15 ...
##  $ G3        : int  6 6 10 15 10 15 11 6 19 15 ...

# Character variable
sapply(mat, class)

##      school         sex         age     address     famsize     Pstatus 
## "character" "character"   "integer" "character" "character" "character" 
##        Medu        Fedu        Mjob        Fjob      reason    guardian 
##   "integer"   "integer" "character" "character" "character" "character" 
##  traveltime   studytime    failures   schoolsup      famsup        paid 
##   "integer"   "integer"   "integer" "character" "character" "character" 
##  activities     nursery      higher    internet    romantic      famrel 
## "character" "character" "character" "character" "character"   "integer" 
##    freetime       goout        Dalc        Walc      health    absences 
##   "integer"   "integer"   "integer"   "integer"   "integer"   "integer" 
##          G1          G2          G3 
##   "integer"   "integer"   "integer"

# Size
cat("Number of rows:", nrow(mat), "\n")

## Number of rows: 395

cat("Number of columns:", ncol(mat), "\n")

## Number of columns: 33

2.3. Data Summary

summary(mat)

##     school              sex                 age         address         
##  Length:395         Length:395         Min.   :15.0   Length:395        
##  Class :character   Class :character   1st Qu.:16.0   Class :character  
##  Mode  :character   Mode  :character   Median :17.0   Mode  :character  
##                                        Mean   :16.7                     
##                                        3rd Qu.:18.0                     
##                                        Max.   :22.0                     
##    famsize            Pstatus               Medu            Fedu      
##  Length:395         Length:395         Min.   :0.000   Min.   :0.000  
##  Class :character   Class :character   1st Qu.:2.000   1st Qu.:2.000  
##  Mode  :character   Mode  :character   Median :3.000   Median :2.000  
##                                        Mean   :2.749   Mean   :2.522  
##                                        3rd Qu.:4.000   3rd Qu.:3.000  
##                                        Max.   :4.000   Max.   :4.000  
##      Mjob               Fjob              reason            guardian        
##  Length:395         Length:395         Length:395         Length:395        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##    traveltime      studytime        failures       schoolsup        
##  Min.   :1.000   Min.   :1.000   Min.   :0.0000   Length:395        
##  1st Qu.:1.000   1st Qu.:1.000   1st Qu.:0.0000   Class :character  
##  Median :1.000   Median :2.000   Median :0.0000   Mode  :character  
##  Mean   :1.448   Mean   :2.035   Mean   :0.3342                     
##  3rd Qu.:2.000   3rd Qu.:2.000   3rd Qu.:0.0000                     
##  Max.   :4.000   Max.   :4.000   Max.   :3.0000                     
##     famsup              paid            activities          nursery         
##  Length:395         Length:395         Length:395         Length:395        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##     higher            internet           romantic             famrel     
##  Length:395         Length:395         Length:395         Min.   :1.000  
##  Class :character   Class :character   Class :character   1st Qu.:4.000  
##  Mode  :character   Mode  :character   Mode  :character   Median :4.000  
##                                                           Mean   :3.944  
##                                                           3rd Qu.:5.000  
##                                                           Max.   :5.000  
##     freetime         goout            Dalc            Walc      
##  Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.:3.000   1st Qu.:2.000   1st Qu.:1.000   1st Qu.:1.000  
##  Median :3.000   Median :3.000   Median :1.000   Median :2.000  
##  Mean   :3.235   Mean   :3.109   Mean   :1.481   Mean   :2.291  
##  3rd Qu.:4.000   3rd Qu.:4.000   3rd Qu.:2.000   3rd Qu.:3.000  
##  Max.   :5.000   Max.   :5.000   Max.   :5.000   Max.   :5.000  
##      health         absences            G1              G2       
##  Min.   :1.000   Min.   : 0.000   Min.   : 3.00   Min.   : 0.00  
##  1st Qu.:3.000   1st Qu.: 0.000   1st Qu.: 8.00   1st Qu.: 9.00  
##  Median :4.000   Median : 4.000   Median :11.00   Median :11.00  
##  Mean   :3.554   Mean   : 5.709   Mean   :10.91   Mean   :10.71  
##  3rd Qu.:5.000   3rd Qu.: 8.000   3rd Qu.:13.00   3rd Qu.:13.00  
##  Max.   :5.000   Max.   :75.000   Max.   :19.00   Max.   :19.00  
##        G3       
##  Min.   : 0.00  
##  1st Qu.: 8.00  
##  Median :11.00  
##  Mean   :10.42  
##  3rd Qu.:14.00  
##  Max.   :20.00

C. Data Preprocessing

3.1. Missing and Duplicate Values

# Check missing values
colSums(is.na(mat))

##     school        sex        age    address    famsize    Pstatus       Medu 
##          0          0          0          0          0          0          0 
##       Fedu       Mjob       Fjob     reason   guardian traveltime  studytime 
##          0          0          0          0          0          0          0 
##   failures  schoolsup     famsup       paid activities    nursery     higher 
##          0          0          0          0          0          0          0 
##   internet   romantic     famrel   freetime      goout       Dalc       Walc 
##          0          0          0          0          0          0          0 
##     health   absences         G1         G2         G3 
##          0          0          0          0          0

# Check duplicate values
sum(duplicated(mat))

## [1] 0

3.2.Correcting Data Types

# Nominal
nominal_vars <- c("school","sex","address","famsize","Pstatus",
                  "Mjob","Fjob","reason","guardian",
                  "schoolsup","famsup","paid","activities",
                  "nursery","higher","internet","romantic")

mat[nominal_vars] <- lapply(mat[nominal_vars], factor)

# Ordinal
ordinal_vars <- c("Medu","Fedu","traveltime","studytime","failures",
                  "famrel","freetime","goout","Dalc","Walc","health")

mat[ordinal_vars] <- lapply(mat[ordinal_vars], function(x){
  factor(x, ordered = TRUE)
})

str(mat)

## 'data.frame':    395 obs. of  33 variables:
##  $ school    : Factor w/ 2 levels "GP","MS": 1 1 1 1 1 1 1 1 1 1 ...
##  $ sex       : Factor w/ 2 levels "F","M": 1 1 1 1 1 2 2 1 2 2 ...
##  $ age       : int  18 17 15 15 16 16 16 17 15 15 ...
##  $ address   : Factor w/ 2 levels "R","U": 2 2 2 2 2 2 2 2 2 2 ...
##  $ famsize   : Factor w/ 2 levels "GT3","LE3": 1 1 2 1 1 2 2 1 2 1 ...
##  $ Pstatus   : Factor w/ 2 levels "A","T": 1 2 2 2 2 2 2 1 1 2 ...
##  $ Medu      : Ord.factor w/ 5 levels "0"<"1"<"2"<"3"<..: 5 2 2 5 4 5 3 5 4 4 ...
##  $ Fedu      : Ord.factor w/ 5 levels "0"<"1"<"2"<"3"<..: 5 2 2 3 4 4 3 5 3 5 ...
##  $ Mjob      : Factor w/ 5 levels "at_home","health",..: 1 1 1 2 3 4 3 3 4 3 ...
##  $ Fjob      : Factor w/ 5 levels "at_home","health",..: 5 3 3 4 3 3 3 5 3 3 ...
##  $ reason    : Factor w/ 4 levels "course","home",..: 1 1 3 2 2 4 2 2 2 2 ...
##  $ guardian  : Factor w/ 3 levels "father","mother",..: 2 1 2 2 1 2 2 2 2 2 ...
##  $ traveltime: Ord.factor w/ 4 levels "1"<"2"<"3"<"4": 2 1 1 1 1 1 1 2 1 1 ...
##  $ studytime : Ord.factor w/ 4 levels "1"<"2"<"3"<"4": 2 2 2 3 2 2 2 2 2 2 ...
##  $ failures  : Ord.factor w/ 4 levels "0"<"1"<"2"<"3": 1 1 4 1 1 1 1 1 1 1 ...
##  $ schoolsup : Factor w/ 2 levels "no","yes": 2 1 2 1 1 1 1 2 1 1 ...
##  $ famsup    : Factor w/ 2 levels "no","yes": 1 2 1 2 2 2 1 2 2 2 ...
##  $ paid      : Factor w/ 2 levels "no","yes": 1 1 2 2 2 2 1 1 2 2 ...
##  $ activities: Factor w/ 2 levels "no","yes": 1 1 1 2 1 2 1 1 1 2 ...
##  $ nursery   : Factor w/ 2 levels "no","yes": 2 1 2 2 2 2 2 2 2 2 ...
##  $ higher    : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
##  $ internet  : Factor w/ 2 levels "no","yes": 1 2 2 2 1 2 2 1 2 2 ...
##  $ romantic  : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
##  $ famrel    : Ord.factor w/ 5 levels "1"<"2"<"3"<"4"<..: 4 5 4 3 4 5 4 4 4 5 ...
##  $ freetime  : Ord.factor w/ 5 levels "1"<"2"<"3"<"4"<..: 3 3 3 2 3 4 4 1 2 5 ...
##  $ goout     : Ord.factor w/ 5 levels "1"<"2"<"3"<"4"<..: 4 3 2 2 2 2 4 4 2 1 ...
##  $ Dalc      : Ord.factor w/ 5 levels "1"<"2"<"3"<"4"<..: 1 1 2 1 1 1 1 1 1 1 ...
##  $ Walc      : Ord.factor w/ 5 levels "1"<"2"<"3"<"4"<..: 1 1 3 1 2 2 1 1 1 1 ...
##  $ health    : Ord.factor w/ 5 levels "1"<"2"<"3"<"4"<..: 3 3 3 5 5 5 3 1 1 5 ...
##  $ absences  : int  6 4 10 2 4 10 0 6 0 0 ...
##  $ G1        : int  5 5 7 15 6 15 12 6 16 14 ...
##  $ G2        : int  6 5 8 14 10 15 12 5 18 15 ...
##  $ G3        : int  6 6 10 15 10 15 11 6 19 15 ...

3.3. Target Variable (Pass/Fail)

names(mat)

##  [1] "school"     "sex"        "age"        "address"    "famsize"   
##  [6] "Pstatus"    "Medu"       "Fedu"       "Mjob"       "Fjob"      
## [11] "reason"     "guardian"   "traveltime" "studytime"  "failures"  
## [16] "schoolsup"  "famsup"     "paid"       "activities" "nursery"   
## [21] "higher"     "internet"   "romantic"   "famrel"     "freetime"  
## [26] "goout"      "Dalc"       "Walc"       "health"     "absences"  
## [31] "G1"         "G2"         "G3"

# Make binary target
mat$G3_pass <- ifelse(mat$G3 >= 10, "pass", "fail")
mat$G3_pass <- factor(mat$G3_pass, levels = c("fail", "pass"))

# Drop original G3
mat <- dplyr::select(mat, -G3)

# Check distribution
table(mat$G3_pass)

## 
## fail pass 
##  130  265

prop.table(table(mat$G3_pass)) * 100

## 
##     fail     pass 
## 32.91139 67.08861

The distribution is slightly imbalanced, but not severely imbalanced → no need for SMOTE.

Not SMOTE because:

LDA is sensitive to changes in the covariance matrix
SMOTE changes the data structure → reduces the validity of LDA
An imbalance of 33% vs. 67% is still mild
LDA is already capable of performing well on mild imbalances

3.4. Train–Test Split

set.seed(2025)
index <- createDataPartition(mat$G3_pass, p = 0.8, list = FALSE)
train <- mat[index, ]
test  <- mat[-index, ]

3.5. Outlier

Because many ordinal variables are not appropriately analyzed as numeric variables, we only count outliers on the actual numeric variables:age, absences, G1, G2

num_cols <- train %>% select(age, absences, G1, G2)

detect_outliers <- function(x){
  Q1 <- quantile(x, 0.25)
  Q3 <- quantile(x, 0.75)
  IQR <- Q3 - Q1
  low <- Q1 - 1.5 * IQR
  up  <- Q3 + 1.5 * IQR
  sum(x < low | x > up)
}

out_count <- sapply(num_cols, detect_outliers)
out_percent <- round((out_count / nrow(train)) * 100, 2)

data.frame(
  Variable = names(out_count),
  Outlier_Count = out_count,
  Outlier_Percent = out_percent
)

##          Variable Outlier_Count Outlier_Percent
## age           age             1            0.32
## absences absences            14            4.43
## G1             G1             0            0.00
## G2             G2            12            3.80

It does not handle outliers because:

Outlier absences are a natural characteristic of student behavior.
LDA is more robust than regression methods.
Capping/transformation changes the covariance matrix → violates the LDA assumptions.

3.6. Scaling Numeric Predictors

train_scaled <- train
test_scaled  <- test

numeric_vars <- c("age","absences","G1","G2")

train_scaled[, numeric_vars] <- scale(train_scaled[, numeric_vars])
test_scaled[, numeric_vars]  <- scale(test_scaled[, numeric_vars])

D. Exploratory Data Analysis (EDA)

4.1.Data Structure

str(train_scaled)

## 'data.frame':    316 obs. of  33 variables:
##  $ school    : Factor w/ 2 levels "GP","MS": 1 1 1 1 1 1 1 1 1 1 ...
##  $ sex       : Factor w/ 2 levels "F","M": 1 1 2 2 1 2 2 1 1 2 ...
##  $ age       : num  0.228 -1.305 -0.539 -0.539 0.228 ...
##  $ address   : Factor w/ 2 levels "R","U": 2 2 2 2 2 2 2 2 2 2 ...
##  $ famsize   : Factor w/ 2 levels "GT3","LE3": 1 1 2 2 1 2 1 1 1 2 ...
##  $ Pstatus   : Factor w/ 2 levels "A","T": 2 2 2 2 1 1 2 2 2 2 ...
##  $ Medu      : Ord.factor w/ 5 levels "0"<"1"<"2"<"3"<..: 2 5 5 3 5 4 4 5 3 5 ...
##  $ Fedu      : Ord.factor w/ 5 levels "0"<"1"<"2"<"3"<..: 2 3 4 3 5 3 5 5 2 5 ...
##  $ Mjob      : Factor w/ 5 levels "at_home","health",..: 1 2 4 3 3 4 3 5 4 2 ...
##  $ Fjob      : Factor w/ 5 levels "at_home","health",..: 3 4 3 3 5 3 3 2 3 4 ...
##  $ reason    : Factor w/ 4 levels "course","home",..: 1 2 4 2 2 2 2 4 4 1 ...
##  $ guardian  : Factor w/ 3 levels "father","mother",..: 1 2 2 2 2 2 2 2 1 1 ...
##  $ traveltime: Ord.factor w/ 4 levels "1"<"2"<"3"<"4": 1 1 1 1 2 1 1 1 3 1 ...
##  $ studytime : Ord.factor w/ 4 levels "1"<"2"<"3"<"4": 2 3 2 2 2 2 2 2 3 1 ...
##  $ failures  : Ord.factor w/ 4 levels "0"<"1"<"2"<"3": 1 1 1 1 1 1 1 1 1 1 ...
##  $ schoolsup : Factor w/ 2 levels "no","yes": 1 1 1 1 2 1 1 1 1 1 ...
##  $ famsup    : Factor w/ 2 levels "no","yes": 2 2 2 1 2 2 2 2 2 2 ...
##  $ paid      : Factor w/ 2 levels "no","yes": 1 2 2 1 1 2 2 2 1 2 ...
##  $ activities: Factor w/ 2 levels "no","yes": 1 2 2 1 1 1 2 1 2 2 ...
##  $ nursery   : Factor w/ 2 levels "no","yes": 1 2 2 2 2 2 2 2 2 2 ...
##  $ higher    : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
##  $ internet  : Factor w/ 2 levels "no","yes": 2 2 2 2 1 2 2 2 2 2 ...
##  $ romantic  : Factor w/ 2 levels "no","yes": 1 2 1 1 1 1 1 1 1 1 ...
##  $ famrel    : Ord.factor w/ 5 levels "1"<"2"<"3"<"4"<..: 5 3 5 4 4 4 5 3 5 4 ...
##  $ freetime  : Ord.factor w/ 5 levels "1"<"2"<"3"<"4"<..: 3 2 4 4 1 2 5 3 2 3 ...
##  $ goout     : Ord.factor w/ 5 levels "1"<"2"<"3"<"4"<..: 3 2 2 4 4 2 1 3 2 3 ...
##  $ Dalc      : Ord.factor w/ 5 levels "1"<"2"<"3"<"4"<..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Walc      : Ord.factor w/ 5 levels "1"<"2"<"3"<"4"<..: 1 1 2 1 1 1 1 2 1 3 ...
##  $ health    : Ord.factor w/ 5 levels "1"<"2"<"3"<"4"<..: 3 5 5 3 1 1 5 2 4 5 ...
##  $ absences  : num  -0.2029 -0.4391 0.5056 -0.6752 0.0333 ...
##  $ G1        : num  -1.827 1.267 1.267 0.339 -1.517 ...
##  $ G2        : num  -1.51 0.892 1.159 0.358 -1.51 ...
##  $ G3_pass   : Factor w/ 2 levels "fail","pass": 1 2 2 2 1 2 2 1 2 2 ...

summary(train_scaled)

##  school   sex          age          address famsize   Pstatus Medu    Fedu  
##  GP:278   F:165   Min.   :-1.3051   R: 67   GT3:220   A: 33   0:  2   0: 1  
##  MS: 38   M:151   1st Qu.:-0.5385   U:249   LE3: 96   T:283   1: 49   1:66  
##                   Median : 0.2280                             2: 79   2:92  
##                   Mean   : 0.0000                             3: 79   3:81  
##                   3rd Qu.: 0.9946                             4:107   4:76  
##                   Max.   : 4.0609                                           
##        Mjob           Fjob            reason      guardian   traveltime
##  at_home : 43   at_home : 17   course    :113   father: 69   1:204     
##  health  : 28   health  : 14   home      : 93   mother:219   2: 88     
##  other   :112   other   :168   other     : 29   other : 28   3: 18     
##  services: 84   services: 93   reputation: 81                4:  6     
##  teacher : 49   teacher : 24                                           
##                                                                        
##  studytime failures schoolsup famsup     paid     activities nursery  
##  1: 82     0:247    no :275   no :127   no :170   no :150    no : 61  
##  2:158     1: 41    yes: 41   yes:189   yes:146   yes:166    yes:255  
##  3: 54     2: 16                                                      
##  4: 22     3: 12                                                      
##                                                                       
##                                                                       
##  higher    internet  romantic  famrel  freetime goout   Dalc    Walc    health 
##  no : 16   no : 47   no :203   1:  7   1: 15    1: 17   1:218   1:122   1: 37  
##  yes:300   yes:269   yes:113   2: 13   2: 43    2: 76   2: 61   2: 69   2: 38  
##                                3: 55   3:132    3:110   3: 21   3: 63   3: 65  
##                                4:150   4: 95    4: 72   4:  8   4: 39   4: 54  
##                                5: 91   5: 31    5: 41   5:  8   5: 23   5:122  
##                                                                                
##     absences             G1                 G2           G3_pass   
##  Min.   :-0.6752   Min.   :-2.13623   Min.   :-2.84496   fail:104  
##  1st Qu.:-0.6752   1st Qu.:-0.89875   1st Qu.:-0.44262   pass:212  
##  Median :-0.2029   Median : 0.02937   Median : 0.09123             
##  Mean   : 0.0000   Mean   : 0.00000   Mean   : 0.00000             
##  3rd Qu.: 0.2694   3rd Qu.: 0.64812   3rd Qu.: 0.62508             
##  Max.   : 8.1808   Max.   : 2.50435   Max.   : 2.22663

4.2. Target Distribution

ggplot(train_scaled, aes(x = G3_pass, fill = G3_pass)) +
  geom_bar() +
  geom_text(stat = "count", 
            aes(label = ..count..), 
            vjust = -0.5) +
  scale_fill_manual(values = c("fail"="tomato","pass"="steelblue")) +
  labs(title = "Distribution of Pass/Fail in Training Set",
       x = "Outcome", y = "Count") +
  theme_minimal()

## Warning: The dot-dot notation (`..count..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(count)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

4.3. Numeric Distribution

num_vars <- c("age","absences","G1","G2")

train_scaled %>%
  pivot_longer(cols = num_vars, names_to = "Variable", values_to = "Value") %>%
  ggplot(aes(x=Value)) +
  geom_histogram(bins = 30, fill="steelblue", alpha=0.7) +
  facet_wrap(~ Variable, scales="free") +
  labs(title="Histogram of Numeric Variables (Scaled)") +
  theme_minimal()

## Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0.
## ℹ Please use `all_of()` or `any_of()` instead.
##   # Was:
##   data %>% select(num_vars)
## 
##   # Now:
##   data %>% select(all_of(num_vars))
## 
## See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

4.4. Boxplot

train_scaled %>%
  pivot_longer(cols = num_vars, names_to = "Variable", values_to = "Value") %>%
  ggplot(aes(x=Variable, y=Value, fill=Variable)) +
  geom_boxplot(alpha=0.7) +
  theme_minimal() +
  labs(title="Boxplot of Scaled Numeric Variables") +
  theme(axis.text.x = element_text(angle=45, hjust=1))

4.5. Korelasi antar numeric

ggpairs(train_scaled[, num_vars],
        upper = list(continuous = wrap("cor", size = 4)),
        lower = list(continuous = "smooth"),
        diag = list(continuous = "densityDiag"))

4.6. Categorical Nominal Variables

options(repr.plot.width = 14, repr.plot.height = 10)

nominal_vars <- c("school","sex","address","famsize","Pstatus",
                  "Mjob","Fjob","reason","guardian",
                  "schoolsup","famsup","paid","activities",
                  "nursery","higher","internet","romantic")

max_count <- max(sapply(nominal_vars, function(v){
  max(table(train_scaled[[v]]))
}))

plot_nominal_freq <- function(var){
  ggplot(train_scaled, aes(x = .data[[var]])) +
    geom_bar(fill="steelblue") +
    coord_cartesian(ylim = c(0, max_count)) +
    labs(title = var, x = NULL, y = "Count") +
    theme_minimal(base_size = 12) +
    theme(
      axis.text.x = element_text(angle = 45, hjust = 1, size = 9),
      plot.title = element_text(size = 14, face = "bold"),
      plot.margin = margin(5,5,5,5),
      panel.grid.minor = element_blank()
    )
}

grid_nominal_freq <- wrap_plots(
  lapply(nominal_vars, plot_nominal_freq),
  ncol = 4
)

grid_nominal_freq

# Crosstab vs Target (Pass/Fail)

options(repr.plot.width = 14, repr.plot.height = 10)

plot_nominal_target <- function(var){
  ggplot(train_scaled, aes(x = .data[[var]], fill = G3_pass)) +
    geom_bar(position = "fill") +
    scale_fill_manual(values=c("fail"="tomato","pass"="steelblue")) +
    labs(title = var, x = NULL, y = "Proportion") +
    theme_minimal(base_size = 12) +
    theme(
      axis.text.x = element_text(angle = 45, hjust = 1, size = 9),
      plot.title = element_text(size = 14, face = "bold"),
      plot.margin = margin(5,5,5,5)
    )
}

grid_nominal_target <- wrap_plots(
  lapply(nominal_vars, plot_nominal_target),
  ncol = 4,
  guides = "collect"
)

grid_nominal_target

4.7. Categorical Ordinal Variables

ordinal_vars <- c("Medu","Fedu","traveltime","studytime","failures",
                  "famrel","freetime","goout","Dalc","Walc","health")

# Barplot Frekuensi
plot_ordinal_freq <- function(var){
  ggplot(train_scaled, aes_string(x = var)) +
    geom_bar(fill="steelblue") +
    theme_minimal() +
    labs(title = var, x = NULL, y = "Count")
}

grid_ordinal_freq <- wrap_plots(lapply(ordinal_vars, plot_ordinal_freq), ncol = 3)

## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

grid_ordinal_freq

# Crosstab vs Target (Pass/Fail)

plot_ordinal_target <- function(var){
  ggplot(train_scaled, aes_string(x = var, fill = "G3_pass")) +
    geom_bar(position = "fill") +
    scale_fill_manual(values=c("fail"="tomato","pass"="steelblue")) +
    theme_minimal() +
    labs(title = var, x = NULL, y = "Proportion")
}

grid_ordinal_target <- wrap_plots(lapply(ordinal_vars, plot_ordinal_target), ncol = 3)
grid_ordinal_target

4.8. Numeric vs Target (Density / Boxplot)

train_scaled %>%
  pivot_longer(cols=num_vars, names_to="Variable", values_to="Value") %>%
  ggplot(aes(x=Value, fill=G3_pass)) +
  geom_density(alpha=0.5) +
  facet_wrap(~Variable, scales="free") +
  scale_fill_manual(values=c("fail"="tomato","pass"="steelblue")) +
  labs(title="Density Plot of Numeric Variables by Pass/Fail") +
  theme_minimal()

train_scaled %>%
  pivot_longer(cols=num_vars, names_to="Variable", values_to="Value") %>%
  ggplot(aes(x=G3_pass, y=Value, fill=G3_pass)) +
  geom_boxplot(alpha=.7) +
  facet_wrap(~Variable, scales="free") +
  scale_fill_manual(values=c("fail"="tomato","pass"="steelblue")) +
  labs(title="Numeric Variables vs Pass/Fail") +
  theme_minimal()

4.9. Pairwise Plot

ggpairs(train_scaled, columns = numeric_vars, aes(color=G3_pass))

E. Linear Discriminant Analysis (LDA)

which assumes that the covariance of the independent variables is equal across all classes

# Convert ordinal → numeric
convert_ordinal <- function(df, vars){
  for(v in vars){
    df[[v]] <- as.numeric(df[[v]])
  }
  return(df)
}

train_final <- convert_ordinal(train_scaled, ordinal_vars)
test_final  <- convert_ordinal(test_scaled, ordinal_vars)

Preparation

# Prior sample (sesuai train_final)
prior_sample <- prop.table(table(train_final$G3_pass))
prior_sample

## 
##      fail      pass 
## 0.3291139 0.6708861

prior_equal <- c(1,1)/2

# X and y split
y_train <- train_final$G3_pass
y_test  <- test_final$G3_pass

X_train <- train_final[, setdiff(names(train_final), "G3_pass")]
X_test  <- test_final[, setdiff(names(test_final), "G3_pass")]

summarize.class <- function(original, classify){
  class.table <- table(original, classify)
  numb <- rowSums(class.table)
  prop <- round(class.table/numb, 4)
  overall <- round(sum(diag(class.table))/sum(class.table), 4)
  list(class.table = class.table, prop = prop, overall.correct = overall)
}

# Summarize Class Code
summarize.class <- function(original, classify){
  class.table <- table(original, classify)
  numb <- rowSums(class.table)
  prop <- round(class.table/numb, 4)
  overall <- round(sum(diag(class.table))/sum(class.table), 4)
  list(class.table = class.table, prop = prop, overall.correct = overall)
}

DA1 - LDA Resubstitution (prior = c(1,1)/2)

DA1 <- lda(
  G3_pass ~ .,
  data = train_final,
  CV = FALSE,
  prior = prior_equal
)

DA1

## Call:
## lda(G3_pass ~ ., data = train_final, CV = FALSE, prior = prior_equal)
## 
## Prior probabilities of groups:
## fail pass 
##  0.5  0.5 
## 
## Group means:
##       schoolMS      sexM        age  addressU famsizeLE3  PstatusT     Medu
## fail 0.1442308 0.4230769  0.2648881 0.7500000  0.2692308 0.9230769 3.548077
## pass 0.1084906 0.5047170 -0.1299451 0.8066038  0.3207547 0.8820755 3.863208
##          Fedu Mjobhealth Mjobother Mjobservices Mjobteacher Fjobhealth
## fail 3.307692 0.05769231 0.4038462    0.2403846   0.1538462 0.01923077
## pass 3.627358 0.10377358 0.3301887    0.2783019   0.1556604 0.05660377
##      Fjobother Fjobservices Fjobteacher reasonhome reasonother reasonreputation
## fail 0.5384615    0.3173077  0.05769231  0.2884615  0.07692308        0.2115385
## pass 0.5283019    0.2830189  0.08490566  0.2971698  0.09905660        0.2783019
##      guardianmother guardianother traveltime studytime failures schoolsupyes
## fail      0.6923077    0.14423077   1.528846  1.942308 1.750000    0.1730769
## pass      0.6933962    0.06132075   1.410377  2.103774 1.146226    0.1084906
##      famsupyes   paidyes activitiesyes nurseryyes higheryes internetyes
## fail 0.6538462 0.4134615     0.5288462  0.8076923 0.8942308   0.8269231
## pass 0.5707547 0.4858491     0.5235849  0.8066038 0.9764151   0.8632075
##      romanticyes   famrel freetime    goout     Dalc     Walc   health
## fail   0.4519231 3.913462 3.317308 3.432692 1.576923 2.365385 3.711538
## pass   0.3113208 3.990566 3.240566 2.995283 1.466981 2.235849 3.528302
##         absences         G1        G2
## fail  0.14566053 -0.9463416 -1.043207
## pass -0.07145611  0.4642431  0.511762
## 
## Coefficients of linear discriminants:
##                           LD1
## schoolMS         -0.157641005
## sexM              0.016155875
## age              -0.184439141
## addressU         -0.076221538
## famsizeLE3       -0.248553200
## PstatusT         -0.445774831
## Medu             -0.069679116
## Fedu              0.016186178
## Mjobhealth       -0.434844020
## Mjobother        -0.173955448
## Mjobservices     -0.080591316
## Mjobteacher      -0.385145204
## Fjobhealth       -0.054480316
## Fjobother         0.046078900
## Fjobservices     -0.216629527
## Fjobteacher      -0.290054426
## reasonhome        0.213842135
## reasonother       0.390951396
## reasonreputation  0.186025993
## guardianmother   -0.054782024
## guardianother    -0.182278007
## traveltime        0.138035696
## studytime        -0.001142424
## failures         -0.290233698
## schoolsupyes     -0.297870987
## famsupyes        -0.130122846
## paidyes          -0.026726206
## activitiesyes    -0.145843854
## nurseryyes       -0.261226987
## higheryes         0.107052597
## internetyes       0.071384293
## romanticyes       0.035879386
## famrel            0.135548236
## freetime         -0.037126921
## goout            -0.136272410
## Dalc             -0.014657208
## Walc              0.126073843
## health           -0.033065769
## absences         -0.190160417
## G1                0.328380840
## G2                1.166422841

# Posterior, class, LD scores
pred_DA1 <- predict(DA1)

head(pred_DA1$posterior)

##           fail       pass
## 2 0.9633519880 0.03664801
## 4 0.0129978094 0.98700219
## 6 0.0023152754 0.99768472
## 7 0.0441119557 0.95588804
## 8 0.9930064698 0.00699353
## 9 0.0000106775 0.99998932

head(pred_DA1$class)

## [1] fail pass pass pass fail pass
## Levels: fail pass

head(pred_DA1$x)

##         LD1
## 2 -1.226714
## 4  1.624791
## 6  2.276232
## 7  1.154235
## 8 -1.859646
## 9  4.295622

# Rule accuracy: RESUB
acc_DA1_resub <- summarize.class(y_train, pred_DA1$class)

# Validation 
pred_DA1_val <- predict(DA1, newdata = test_final)
acc_DA1_val <- summarize.class(y_test, pred_DA1_val$class)
acc_DA1_val

## $class.table
##         classify
## original fail pass
##     fail   23    3
##     pass    7   46
## 
## $prop
##         classify
## original   fail   pass
##     fail 0.8846 0.1154
##     pass 0.1321 0.8679
## 
## $overall.correct
## [1] 0.8734

DA2 - LDA Cross-Validation (prior = c(1,1)/2)

DA2 <- lda(
  G3_pass ~ .,
  data = train_final,
  CV = TRUE,
  prior = prior_equal
)

head(DA2$posterior)

##           fail        pass
## 2 9.459679e-01 0.054032107
## 4 1.426205e-02 0.985737951
## 6 2.148856e-03 0.997851144
## 7 4.994589e-02 0.950054108
## 8 9.901866e-01 0.009813376
## 9 4.236083e-06 0.999995764

head(DA2$class)

## [1] fail pass pass pass fail pass
## Levels: fail pass

# CV accuracy
acc_DA2_cv <- summarize.class(y_train, DA2$class)
acc_DA2_cv

## $class.table
##         classify
## original fail pass
##     fail   92   12
##     pass   32  180
## 
## $prop
##         classify
## original   fail   pass
##     fail 0.8846 0.1154
##     pass 0.1509 0.8491
## 
## $overall.correct
## [1] 0.8608

DA3 - LDA Resubstitution (Prior = Sample Prior)

DA3 <- lda(
  G3_pass ~ .,
  data = train_final,
  CV = FALSE   
)

# Posterior validation
pred_DA3_val <- predict(DA3, newdata = test_final)

head(pred_DA3_val$posterior)

##            fail       pass
## 1  0.9856296532 0.01437035
## 3  0.8705693860 0.12943061
## 5  0.2198304309 0.78016957
## 15 0.0000618789 0.99993812
## 25 0.4077133182 0.59228668
## 42 0.0152107350 0.98478927

head(pred_DA3_val$class)

## [1] fail fail pass pass pass pass
## Levels: fail pass

acc_DA3_val <- summarize.class(y_test, pred_DA3_val$class)
acc_DA3_val

## $class.table
##         classify
## original fail pass
##     fail   19    7
##     pass    4   49
## 
## $prop
##         classify
## original   fail   pass
##     fail 0.7308 0.2692
##     pass 0.0755 0.9245
## 
## $overall.correct
## [1] 0.8608

DA4 - LDA Cross-Validation (Prior = Sample Prior)

DA4 <- lda(
  G3_pass ~ .,
  data = train_final,
  CV = TRUE
)

head(DA4$posterior)

##           fail       pass
## 2 8.957094e-01 0.10429060
## 4 7.047682e-03 0.99295232
## 6 1.055311e-03 0.99894469
## 7 2.514146e-02 0.97485854
## 8 9.801976e-01 0.01980239
## 9 2.078083e-06 0.99999792

head(DA4$class)

## [1] fail pass pass pass fail pass
## Levels: fail pass

acc_DA4_cv <- summarize.class(y_train, DA4$class)
acc_DA4_cv

## $class.table
##         classify
## original fail pass
##     fail   83   21
##     pass   21  191
## 
## $prop
##         classify
## original   fail   pass
##     fail 0.7981 0.2019
##     pass 0.0991 0.9009
## 
## $overall.correct
## [1] 0.8671

Box’s M Test

num_vars <- sapply(X_train, is.numeric)
X_num <- X_train[, num_vars]

# Box’s M test (only numeric continuous)
boxM(as.matrix(X_num), y_train)

## 
##  Box's M-test for Homogeneity of Covariance Matrices
## 
## data:  as.matrix(X_num)
## Chi-Sq (approx.) = 431.28, df = 120, p-value < 2.2e-16

hypothesis

H₀: Σ₁ = Σ₂ (covariance matrices equal)
H₁: Σ₁ ≠ Σ₂ (covariance matrices unequal)

Kesimpulan

p-value < 0.001 → tolak H. then, covariance matrices berbeda (Σ₁ ≠ Σ₂)
LDA tidak sepenuhnya sesuai asumsi, karena LDA mengharuskan Σ₁ = Σ₂. QDA lebih cocok karena QDA membolehkan Σ₁ ≠ Σ₂

F. Quadratic Discriminant Analysis (QDA)

which does not assume equal covariance across the classes.

DA5 - QDA Resubstitution (Prior = Sample Prior)

DA5 <- qda(
  G3_pass ~ .,
  data = train_final,
  CV = FALSE
)

# Posterior validation
pred_DA5_val <- predict(DA5, newdata = test_final)

head(pred_DA5_val$posterior)

##            fail      pass
## 1  3.408787e-01 0.6591213
## 3  3.264243e-03 0.9967358
## 5  6.487377e-02 0.9351262
## 15 2.298263e-15 1.0000000
## 25 8.141272e-13 1.0000000
## 42 1.326957e-02 0.9867304

head(pred_DA5_val$class)

## [1] pass pass pass pass pass pass
## Levels: fail pass

acc_DA5_val <- summarize.class(y_test, pred_DA5_val$class)
acc_DA5_val

## $class.table
##         classify
## original fail pass
##     fail    8   18
##     pass    2   51
## 
## $prop
##         classify
## original   fail   pass
##     fail 0.3077 0.6923
##     pass 0.0377 0.9623
## 
## $overall.correct
## [1] 0.7468

DA6 - QDA Cross-Validation (Prior = Sample Prior)

DA6 <- qda(
  G3_pass ~ .,
  data = train_final,
  CV = TRUE
)

head(DA6$posterior)

##           fail      pass
## 2 9.868601e-04 0.9990131
## 4 3.764598e-13 1.0000000
## 6 1.849190e-22 1.0000000
## 7 1.983241e-08 1.0000000
## 8 7.143495e-08 0.9999999
## 9 1.081972e-27 1.0000000

head(DA6$class)

## [1] pass pass pass pass pass pass
## Levels: fail pass

acc_DA6_cv <- summarize.class(y_train, DA6$class)
acc_DA6_cv

## $class.table
##         classify
## original fail pass
##     fail   47   57
##     pass   15  197
## 
## $prop
##         classify
## original   fail   pass
##     fail 0.4519 0.5481
##     pass 0.0708 0.9292
## 
## $overall.correct
## [1] 0.7722

G. Summary

Summary table

# SUMMARY TABLE UNTUK MODEL DISCRIMINANT ANALYSIS 

extract_metrics <- function(acc_obj) {
  tab <- acc_obj$class.table
  
  # Overall accuracy 
  overall <- acc_obj$overall.correct
  
  # Success rate
  per_class_success <- diag(tab) / rowSums(tab)
  success <- mean(per_class_success, na.rm = TRUE)
  
  # Failure = 1 - overall
  failure <- 1 - overall
  
  return(list(
    success = success,
    failure = failure,
    overall = overall
  ))
}

m1 <- extract_metrics(acc_DA1_val)
m2 <- extract_metrics(acc_DA2_cv)
m3 <- extract_metrics(acc_DA3_val)
m4 <- extract_metrics(acc_DA4_cv)
m5 <- extract_metrics(acc_DA5_val)
m6 <- extract_metrics(acc_DA6_cv)

DA_summary <- data.frame(
  Model = c("DA1", "DA2", "DA3", "DA4", "DA5", "DA6"),
  CovMatrix = c(
    "Equal", "Equal", "Equal", "Equal", "Unequal", "Unequal"
  ),
  Priors = c(
    "Equal", "Equal", "Proportional", "Proportional", "Proportional", "Proportional"
  ),
  AccuracyMethod = c(
    "Validation", "Cross-validation",
    "Validation", "Cross-validation",
    "Validation", "Cross-validation"
  ),
  Success = c(
    m1$success, m2$success, m3$success, m4$success, m5$success, m6$success
  ),
  Failure = c(
    m1$failure, m2$failure, m3$failure, m4$failure, m5$failure, m6$failure
  ),
  Overall = c(
    m1$overall, m2$overall, m3$overall, m4$overall, m5$overall, m6$overall
  )
)
DA_summary

##   Model CovMatrix       Priors   AccuracyMethod   Success Failure Overall
## 1   DA1     Equal        Equal       Validation 0.8762700  0.1266  0.8734
## 2   DA2     Equal        Equal Cross-validation 0.8668360  0.1392  0.8608
## 3   DA3     Equal Proportional       Validation 0.8276488  0.1392  0.8608
## 4   DA4     Equal Proportional Cross-validation 0.8495102  0.1329  0.8671
## 5   DA5   Unequal Proportional       Validation 0.6349782  0.2532  0.7468
## 6   DA6   Unequal Proportional Cross-validation 0.6905842  0.2278  0.7722

Example : Count Manually:

## $class.table
##         classify
## original fail pass
##     fail   23    3
##     pass    7   46

## $overall.correct
## [1] 0.8734

Total = 23+3+7+46 = 79
Benar = 23 + 46 = 69
success = 69/79 = 0.8734177
failure = 1 – 0.8734177 = 0.1265823
overall.correct = 0.8734

Plot Model Comparison

# Plot Model Comparison 

DA_long <- DA_summary %>%
  select(Model, Success, Failure, Overall) %>%
  pivot_longer(cols = c(Success, Failure, Overall),
               names_to = "Metric",
               values_to = "Value")

ggplot(DA_long, aes(x = Model, y = Value, fill = Metric)) +
  geom_bar(stat = "identity", position = position_dodge(width = 0.8)) +
  
  labs(
    title = "Comparison of Success, Failure, and Overall Accuracy Across DA Models",
    x = "Model",
    y = "Accuracy"
  ) +

  scale_y_continuous(limits = c(0,1), breaks = seq(0,1,0.1)) +
  scale_fill_brewer(palette = "Set2") +
  theme_minimal(base_size = 14) +
  theme(
    plot.title = element_text(hjust = 0.5, face = "bold"),
    legend.title = element_blank()
  )