Assignment 3 - Discriminant Analysis

Soal 1

Question 1: Conduct discriminant analysis and report the results using student-mat.csv data.

Binary classification: pass if G3≥10, else fail.

Dataset Information

This data approach student achievement in secondary education of two Portuguese schools. The data attributes include student grades, demographic, social and school related features) and it was collected by using school reports and questionnaires. Two datasets are provided regarding the performance in two distinct subjects: Mathematics (mat) and Portuguese language (por). In [Cortez and Silva, 2008], the two datasets were modeled under binary/five-level classification and regression tasks. Important note: the target attribute G3 has a strong correlation with attributes G2 and G1. This occurs because G3 is the final year grade (issued at the 3rd period), while G1 and G2 correspond to the 1st and 2nd period grades. It is more difficult to predict G3 without G2 and G1, but such prediction is much more useful (see paper source for more details).

Atribute Information

Attributes for both student-mat.csv (Math course) and student-por.csv (Portuguese language course) datasets:

  1. school - student’s school (binary: ‘GP’ - Gabriel Pereira or ‘MS’ - Mousinho da Silveira)
  2. sex - student’s sex (binary: ‘F’ - female or ‘M’ - male)
  3. age - student’s age (numeric: from 15 to 22)
  4. address - student’s home address type (binary: ‘U’ - urban or ‘R’ - rural)
  5. famsize - family size (binary: ‘LE3’ - less or equal to 3 or ‘GT3’ - greater than 3)
  6. Pstatus - parent’s cohabitation status (binary: ‘T’ - living together or ‘A’ - apart)
  7. Medu - mother’s education (numeric: 0 - none, 1 - primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education)
  8. Fedu - father’s education (numeric: 0 - none, 1 - primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education)
  9. Mjob - mother’s job (nominal: ‘teacher’, ‘health’ care related, civil ‘services’ (e.g. administrative or police), ‘at_home’ or ‘other’)
  10. Fjob - father’s job (nominal: ‘teacher’, ‘health’ care related, civil ‘services’ (e.g. administrative or police), ‘at_home’ or ‘other’)
  11. reason - reason to choose this school (nominal: close to ‘home’, school ‘reputation’, ‘course’ preference or ‘other’)
  12. guardian - student’s guardian (nominal: ‘mother’, ‘father’ or ‘other’)
  13. traveltime - home to school travel time (numeric: 1 - <15 min., 2 - 15 to 30 min., 3 - 30 min. to 1 hour, or 4 - >1 hour)
  14. studytime - weekly study time (numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours)
  15. failures - number of past class failures (numeric: n if 1<=n<3, else 4)
  16. schoolsup - extra educational support (binary: yes or no)
  17. famsup - family educational support (binary: yes or no)
  18. paid - extra paid classes within the course subject (Math or Portuguese) (binary: yes or no)
  19. activities - extra-curricular activities (binary: yes or no)
  20. nursery - attended nursery school (binary: yes or no)
  21. higher - wants to take higher education (binary: yes or no)
  22. internet - Internet access at home (binary: yes or no)
  23. romantic - with a romantic relationship (binary: yes or no)
  24. famrel - quality of family relationships (numeric: from 1 - very bad to 5 - excellent)
  25. freetime - free time after school (numeric: from 1 - very low to 5 - very high)
  26. goout - going out with friends (numeric: from 1 - very low to 5 - very high)
  27. Dalc - workday alcohol consumption (numeric: from 1 - very low to 5 - very high)
  28. Walc - weekend alcohol consumption (numeric: from 1 - very low to 5 - very high)
  29. health - current health status (numeric: from 1 - very bad to 5 - very good)
  30. absences - number of school absences (numeric: from 0 to 93)

these grades are related with the course subject, Math or Portuguese:

  • G1 - first period grade (numeric: from 0 to 20)
  • G2 - second period grade (numeric: from 0 to 20)
  • G3 - final grade (numeric: from 0 to 20, output target)

Data Analysis

A. Library

library(caret)
library(dplyr)
library(ggplot2)
library(tidyr)
library(GGally)
library(corrplot)
library(GGally)
library(patchwork)

B. Data

2.1. Import Data

mat <- read.csv("D:/UNY/MySta/SEM 5/StatMul/dataset StatMul/student-mat.csv",header = TRUE, sep = ';')
head(mat)
##   school sex age address famsize Pstatus Medu Fedu     Mjob     Fjob     reason
## 1     GP   F  18       U     GT3       A    4    4  at_home  teacher     course
## 2     GP   F  17       U     GT3       T    1    1  at_home    other     course
## 3     GP   F  15       U     LE3       T    1    1  at_home    other      other
## 4     GP   F  15       U     GT3       T    4    2   health services       home
## 5     GP   F  16       U     GT3       T    3    3    other    other       home
## 6     GP   M  16       U     LE3       T    4    3 services    other reputation
##   guardian traveltime studytime failures schoolsup famsup paid activities
## 1   mother          2         2        0       yes     no   no         no
## 2   father          1         2        0        no    yes   no         no
## 3   mother          1         2        3       yes     no  yes         no
## 4   mother          1         3        0        no    yes  yes        yes
## 5   father          1         2        0        no    yes  yes         no
## 6   mother          1         2        0        no    yes  yes        yes
##   nursery higher internet romantic famrel freetime goout Dalc Walc health
## 1     yes    yes       no       no      4        3     4    1    1      3
## 2      no    yes      yes       no      5        3     3    1    1      3
## 3     yes    yes      yes       no      4        3     2    2    3      3
## 4     yes    yes      yes      yes      3        2     2    1    1      5
## 5     yes    yes       no       no      4        3     2    1    2      5
## 6     yes    yes      yes       no      5        4     2    1    2      5
##   absences G1 G2 G3
## 1        6  5  6  6
## 2        4  5  5  6
## 3       10  7  8 10
## 4        2 15 14 15
## 5        4  6 10 10
## 6       10 15 15 15

2.2. Data Structure, Character Variables, and Size

# Structure
str(mat)
## 'data.frame':    395 obs. of  33 variables:
##  $ school    : chr  "GP" "GP" "GP" "GP" ...
##  $ sex       : chr  "F" "F" "F" "F" ...
##  $ age       : int  18 17 15 15 16 16 16 17 15 15 ...
##  $ address   : chr  "U" "U" "U" "U" ...
##  $ famsize   : chr  "GT3" "GT3" "LE3" "GT3" ...
##  $ Pstatus   : chr  "A" "T" "T" "T" ...
##  $ Medu      : int  4 1 1 4 3 4 2 4 3 3 ...
##  $ Fedu      : int  4 1 1 2 3 3 2 4 2 4 ...
##  $ Mjob      : chr  "at_home" "at_home" "at_home" "health" ...
##  $ Fjob      : chr  "teacher" "other" "other" "services" ...
##  $ reason    : chr  "course" "course" "other" "home" ...
##  $ guardian  : chr  "mother" "father" "mother" "mother" ...
##  $ traveltime: int  2 1 1 1 1 1 1 2 1 1 ...
##  $ studytime : int  2 2 2 3 2 2 2 2 2 2 ...
##  $ failures  : int  0 0 3 0 0 0 0 0 0 0 ...
##  $ schoolsup : chr  "yes" "no" "yes" "no" ...
##  $ famsup    : chr  "no" "yes" "no" "yes" ...
##  $ paid      : chr  "no" "no" "yes" "yes" ...
##  $ activities: chr  "no" "no" "no" "yes" ...
##  $ nursery   : chr  "yes" "no" "yes" "yes" ...
##  $ higher    : chr  "yes" "yes" "yes" "yes" ...
##  $ internet  : chr  "no" "yes" "yes" "yes" ...
##  $ romantic  : chr  "no" "no" "no" "yes" ...
##  $ famrel    : int  4 5 4 3 4 5 4 4 4 5 ...
##  $ freetime  : int  3 3 3 2 3 4 4 1 2 5 ...
##  $ goout     : int  4 3 2 2 2 2 4 4 2 1 ...
##  $ Dalc      : int  1 1 2 1 1 1 1 1 1 1 ...
##  $ Walc      : int  1 1 3 1 2 2 1 1 1 1 ...
##  $ health    : int  3 3 3 5 5 5 3 1 1 5 ...
##  $ absences  : int  6 4 10 2 4 10 0 6 0 0 ...
##  $ G1        : int  5 5 7 15 6 15 12 6 16 14 ...
##  $ G2        : int  6 5 8 14 10 15 12 5 18 15 ...
##  $ G3        : int  6 6 10 15 10 15 11 6 19 15 ...
# Character variable
sapply(mat, class)
##      school         sex         age     address     famsize     Pstatus 
## "character" "character"   "integer" "character" "character" "character" 
##        Medu        Fedu        Mjob        Fjob      reason    guardian 
##   "integer"   "integer" "character" "character" "character" "character" 
##  traveltime   studytime    failures   schoolsup      famsup        paid 
##   "integer"   "integer"   "integer" "character" "character" "character" 
##  activities     nursery      higher    internet    romantic      famrel 
## "character" "character" "character" "character" "character"   "integer" 
##    freetime       goout        Dalc        Walc      health    absences 
##   "integer"   "integer"   "integer"   "integer"   "integer"   "integer" 
##          G1          G2          G3 
##   "integer"   "integer"   "integer"
# Size
cat("Number of rows:", nrow(mat), "\n")
## Number of rows: 395
cat("Number of columns:", ncol(mat), "\n")
## Number of columns: 33

2.3. Data Summary

summary(mat)
##     school              sex                 age         address         
##  Length:395         Length:395         Min.   :15.0   Length:395        
##  Class :character   Class :character   1st Qu.:16.0   Class :character  
##  Mode  :character   Mode  :character   Median :17.0   Mode  :character  
##                                        Mean   :16.7                     
##                                        3rd Qu.:18.0                     
##                                        Max.   :22.0                     
##    famsize            Pstatus               Medu            Fedu      
##  Length:395         Length:395         Min.   :0.000   Min.   :0.000  
##  Class :character   Class :character   1st Qu.:2.000   1st Qu.:2.000  
##  Mode  :character   Mode  :character   Median :3.000   Median :2.000  
##                                        Mean   :2.749   Mean   :2.522  
##                                        3rd Qu.:4.000   3rd Qu.:3.000  
##                                        Max.   :4.000   Max.   :4.000  
##      Mjob               Fjob              reason            guardian        
##  Length:395         Length:395         Length:395         Length:395        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##    traveltime      studytime        failures       schoolsup        
##  Min.   :1.000   Min.   :1.000   Min.   :0.0000   Length:395        
##  1st Qu.:1.000   1st Qu.:1.000   1st Qu.:0.0000   Class :character  
##  Median :1.000   Median :2.000   Median :0.0000   Mode  :character  
##  Mean   :1.448   Mean   :2.035   Mean   :0.3342                     
##  3rd Qu.:2.000   3rd Qu.:2.000   3rd Qu.:0.0000                     
##  Max.   :4.000   Max.   :4.000   Max.   :3.0000                     
##     famsup              paid            activities          nursery         
##  Length:395         Length:395         Length:395         Length:395        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##     higher            internet           romantic             famrel     
##  Length:395         Length:395         Length:395         Min.   :1.000  
##  Class :character   Class :character   Class :character   1st Qu.:4.000  
##  Mode  :character   Mode  :character   Mode  :character   Median :4.000  
##                                                           Mean   :3.944  
##                                                           3rd Qu.:5.000  
##                                                           Max.   :5.000  
##     freetime         goout            Dalc            Walc      
##  Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.:3.000   1st Qu.:2.000   1st Qu.:1.000   1st Qu.:1.000  
##  Median :3.000   Median :3.000   Median :1.000   Median :2.000  
##  Mean   :3.235   Mean   :3.109   Mean   :1.481   Mean   :2.291  
##  3rd Qu.:4.000   3rd Qu.:4.000   3rd Qu.:2.000   3rd Qu.:3.000  
##  Max.   :5.000   Max.   :5.000   Max.   :5.000   Max.   :5.000  
##      health         absences            G1              G2       
##  Min.   :1.000   Min.   : 0.000   Min.   : 3.00   Min.   : 0.00  
##  1st Qu.:3.000   1st Qu.: 0.000   1st Qu.: 8.00   1st Qu.: 9.00  
##  Median :4.000   Median : 4.000   Median :11.00   Median :11.00  
##  Mean   :3.554   Mean   : 5.709   Mean   :10.91   Mean   :10.71  
##  3rd Qu.:5.000   3rd Qu.: 8.000   3rd Qu.:13.00   3rd Qu.:13.00  
##  Max.   :5.000   Max.   :75.000   Max.   :19.00   Max.   :19.00  
##        G3       
##  Min.   : 0.00  
##  1st Qu.: 8.00  
##  Median :11.00  
##  Mean   :10.42  
##  3rd Qu.:14.00  
##  Max.   :20.00

C. Data Preprocessing

3.1. Missing and Duplicate Values

# Check missing values
colSums(is.na(mat))
##     school        sex        age    address    famsize    Pstatus       Medu 
##          0          0          0          0          0          0          0 
##       Fedu       Mjob       Fjob     reason   guardian traveltime  studytime 
##          0          0          0          0          0          0          0 
##   failures  schoolsup     famsup       paid activities    nursery     higher 
##          0          0          0          0          0          0          0 
##   internet   romantic     famrel   freetime      goout       Dalc       Walc 
##          0          0          0          0          0          0          0 
##     health   absences         G1         G2         G3 
##          0          0          0          0          0
# Check duplicate values
sum(duplicated(mat))
## [1] 0

3.2.Correcting Data Types

# List nominal categorical variables
nominal_vars <- c("school","sex","address","famsize","Pstatus",
                  "Mjob","Fjob","reason","guardian",
                  "schoolsup","famsup","paid","activities",
                  "nursery","higher","internet","romantic")

mat[nominal_vars] <- lapply(mat[nominal_vars], factor)

# Ordinal variables
ordinal_vars <- c("Medu","Fedu","traveltime","studytime","failures",
                  "famrel","freetime","goout","Dalc","Walc","health")

mat[ordinal_vars] <- lapply(mat[ordinal_vars], function(x){
  factor(x, ordered = TRUE)
})
str(mat)
## 'data.frame':    395 obs. of  33 variables:
##  $ school    : Factor w/ 2 levels "GP","MS": 1 1 1 1 1 1 1 1 1 1 ...
##  $ sex       : Factor w/ 2 levels "F","M": 1 1 1 1 1 2 2 1 2 2 ...
##  $ age       : int  18 17 15 15 16 16 16 17 15 15 ...
##  $ address   : Factor w/ 2 levels "R","U": 2 2 2 2 2 2 2 2 2 2 ...
##  $ famsize   : Factor w/ 2 levels "GT3","LE3": 1 1 2 1 1 2 2 1 2 1 ...
##  $ Pstatus   : Factor w/ 2 levels "A","T": 1 2 2 2 2 2 2 1 1 2 ...
##  $ Medu      : Ord.factor w/ 5 levels "0"<"1"<"2"<"3"<..: 5 2 2 5 4 5 3 5 4 4 ...
##  $ Fedu      : Ord.factor w/ 5 levels "0"<"1"<"2"<"3"<..: 5 2 2 3 4 4 3 5 3 5 ...
##  $ Mjob      : Factor w/ 5 levels "at_home","health",..: 1 1 1 2 3 4 3 3 4 3 ...
##  $ Fjob      : Factor w/ 5 levels "at_home","health",..: 5 3 3 4 3 3 3 5 3 3 ...
##  $ reason    : Factor w/ 4 levels "course","home",..: 1 1 3 2 2 4 2 2 2 2 ...
##  $ guardian  : Factor w/ 3 levels "father","mother",..: 2 1 2 2 1 2 2 2 2 2 ...
##  $ traveltime: Ord.factor w/ 4 levels "1"<"2"<"3"<"4": 2 1 1 1 1 1 1 2 1 1 ...
##  $ studytime : Ord.factor w/ 4 levels "1"<"2"<"3"<"4": 2 2 2 3 2 2 2 2 2 2 ...
##  $ failures  : Ord.factor w/ 4 levels "0"<"1"<"2"<"3": 1 1 4 1 1 1 1 1 1 1 ...
##  $ schoolsup : Factor w/ 2 levels "no","yes": 2 1 2 1 1 1 1 2 1 1 ...
##  $ famsup    : Factor w/ 2 levels "no","yes": 1 2 1 2 2 2 1 2 2 2 ...
##  $ paid      : Factor w/ 2 levels "no","yes": 1 1 2 2 2 2 1 1 2 2 ...
##  $ activities: Factor w/ 2 levels "no","yes": 1 1 1 2 1 2 1 1 1 2 ...
##  $ nursery   : Factor w/ 2 levels "no","yes": 2 1 2 2 2 2 2 2 2 2 ...
##  $ higher    : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
##  $ internet  : Factor w/ 2 levels "no","yes": 1 2 2 2 1 2 2 1 2 2 ...
##  $ romantic  : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
##  $ famrel    : Ord.factor w/ 5 levels "1"<"2"<"3"<"4"<..: 4 5 4 3 4 5 4 4 4 5 ...
##  $ freetime  : Ord.factor w/ 5 levels "1"<"2"<"3"<"4"<..: 3 3 3 2 3 4 4 1 2 5 ...
##  $ goout     : Ord.factor w/ 5 levels "1"<"2"<"3"<"4"<..: 4 3 2 2 2 2 4 4 2 1 ...
##  $ Dalc      : Ord.factor w/ 5 levels "1"<"2"<"3"<"4"<..: 1 1 2 1 1 1 1 1 1 1 ...
##  $ Walc      : Ord.factor w/ 5 levels "1"<"2"<"3"<"4"<..: 1 1 3 1 2 2 1 1 1 1 ...
##  $ health    : Ord.factor w/ 5 levels "1"<"2"<"3"<"4"<..: 3 3 3 5 5 5 3 1 1 5 ...
##  $ absences  : int  6 4 10 2 4 10 0 6 0 0 ...
##  $ G1        : int  5 5 7 15 6 15 12 6 16 14 ...
##  $ G2        : int  6 5 8 14 10 15 12 5 18 15 ...
##  $ G3        : int  6 6 10 15 10 15 11 6 19 15 ...

3.3. Target Variable (Pass/Fail)

# Make binary target
mat$G3_pass <- ifelse(mat$G3 >= 10, "pass", "fail")
mat$G3_pass <- factor(mat$G3_pass, levels = c("fail", "pass"))

# Drop original G3
mat <- mat %>% select(-G3)

# Check distribution
table(mat$G3_pass)
## 
## fail pass 
##  130  265
prop.table(table(mat$G3_pass)) * 100
## 
##     fail     pass 
## 32.91139 67.08861

The distribution is slightly imbalanced, but not severely imbalanced → no need for SMOTE.

Not SMOTE because:

  • LDA is sensitive to changes in the covariance matrix
  • SMOTE changes the data structure → reduces the validity of LDA
  • An imbalance of 33% vs. 67% is still mild
  • LDA is already capable of performing well on mild imbalances

3.4. Train–Test Split

set.seed(2025)
index <- createDataPartition(mat$G3_pass, p = 0.8, list = FALSE)
train <- mat[index, ]
test  <- mat[-index, ]

3.5. Outlier

Because many ordinal variables are not appropriately analyzed as numeric variables, we only count outliers on the actual numeric variables:age, absences, G1, G2

num_cols <- train %>% select(age, absences, G1, G2)

detect_outliers <- function(x){
  Q1 <- quantile(x, 0.25)
  Q3 <- quantile(x, 0.75)
  IQR <- Q3 - Q1
  low <- Q1 - 1.5 * IQR
  up  <- Q3 + 1.5 * IQR
  sum(x < low | x > up)
}

out_count <- sapply(num_cols, detect_outliers)
out_percent <- round((out_count / nrow(train)) * 100, 2)

out_summary <- data.frame(
  Variable = names(out_count),
  Outlier_Count = out_count,
  Outlier_Percent = out_percent
)

print(out_summary)
##          Variable Outlier_Count Outlier_Percent
## age           age             1            0.32
## absences absences            14            4.43
## G1             G1             0            0.00
## G2             G2            12            3.80

It does not handle outliers because:

  • Outlier absences are a natural characteristic of student behavior.
  • LDA is more robust than regression methods.
  • Capping/transformation changes the covariance matrix → violates the LDA assumptions.

3.6. Scaling Numeric Predictors

train_scaled <- train
test_scaled  <- test

numeric_vars <- c("age","absences","G1","G2")

train_scaled[, numeric_vars] <- scale(train_scaled[, numeric_vars])
test_scaled[, numeric_vars]  <- scale(test_scaled[, numeric_vars])

D. Exploratory Data Analysis (EDA)

4.1.Data Structure

str(train_scaled)
## 'data.frame':    316 obs. of  33 variables:
##  $ school    : Factor w/ 2 levels "GP","MS": 1 1 1 1 1 1 1 1 1 1 ...
##  $ sex       : Factor w/ 2 levels "F","M": 1 1 2 2 1 2 2 1 1 2 ...
##  $ age       : num  0.228 -1.305 -0.539 -0.539 0.228 ...
##  $ address   : Factor w/ 2 levels "R","U": 2 2 2 2 2 2 2 2 2 2 ...
##  $ famsize   : Factor w/ 2 levels "GT3","LE3": 1 1 2 2 1 2 1 1 1 2 ...
##  $ Pstatus   : Factor w/ 2 levels "A","T": 2 2 2 2 1 1 2 2 2 2 ...
##  $ Medu      : Ord.factor w/ 5 levels "0"<"1"<"2"<"3"<..: 2 5 5 3 5 4 4 5 3 5 ...
##  $ Fedu      : Ord.factor w/ 5 levels "0"<"1"<"2"<"3"<..: 2 3 4 3 5 3 5 5 2 5 ...
##  $ Mjob      : Factor w/ 5 levels "at_home","health",..: 1 2 4 3 3 4 3 5 4 2 ...
##  $ Fjob      : Factor w/ 5 levels "at_home","health",..: 3 4 3 3 5 3 3 2 3 4 ...
##  $ reason    : Factor w/ 4 levels "course","home",..: 1 2 4 2 2 2 2 4 4 1 ...
##  $ guardian  : Factor w/ 3 levels "father","mother",..: 1 2 2 2 2 2 2 2 1 1 ...
##  $ traveltime: Ord.factor w/ 4 levels "1"<"2"<"3"<"4": 1 1 1 1 2 1 1 1 3 1 ...
##  $ studytime : Ord.factor w/ 4 levels "1"<"2"<"3"<"4": 2 3 2 2 2 2 2 2 3 1 ...
##  $ failures  : Ord.factor w/ 4 levels "0"<"1"<"2"<"3": 1 1 1 1 1 1 1 1 1 1 ...
##  $ schoolsup : Factor w/ 2 levels "no","yes": 1 1 1 1 2 1 1 1 1 1 ...
##  $ famsup    : Factor w/ 2 levels "no","yes": 2 2 2 1 2 2 2 2 2 2 ...
##  $ paid      : Factor w/ 2 levels "no","yes": 1 2 2 1 1 2 2 2 1 2 ...
##  $ activities: Factor w/ 2 levels "no","yes": 1 2 2 1 1 1 2 1 2 2 ...
##  $ nursery   : Factor w/ 2 levels "no","yes": 1 2 2 2 2 2 2 2 2 2 ...
##  $ higher    : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
##  $ internet  : Factor w/ 2 levels "no","yes": 2 2 2 2 1 2 2 2 2 2 ...
##  $ romantic  : Factor w/ 2 levels "no","yes": 1 2 1 1 1 1 1 1 1 1 ...
##  $ famrel    : Ord.factor w/ 5 levels "1"<"2"<"3"<"4"<..: 5 3 5 4 4 4 5 3 5 4 ...
##  $ freetime  : Ord.factor w/ 5 levels "1"<"2"<"3"<"4"<..: 3 2 4 4 1 2 5 3 2 3 ...
##  $ goout     : Ord.factor w/ 5 levels "1"<"2"<"3"<"4"<..: 3 2 2 4 4 2 1 3 2 3 ...
##  $ Dalc      : Ord.factor w/ 5 levels "1"<"2"<"3"<"4"<..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Walc      : Ord.factor w/ 5 levels "1"<"2"<"3"<"4"<..: 1 1 2 1 1 1 1 2 1 3 ...
##  $ health    : Ord.factor w/ 5 levels "1"<"2"<"3"<"4"<..: 3 5 5 3 1 1 5 2 4 5 ...
##  $ absences  : num  -0.2029 -0.4391 0.5056 -0.6752 0.0333 ...
##  $ G1        : num  -1.827 1.267 1.267 0.339 -1.517 ...
##  $ G2        : num  -1.51 0.892 1.159 0.358 -1.51 ...
##  $ G3_pass   : Factor w/ 2 levels "fail","pass": 1 2 2 2 1 2 2 1 2 2 ...
summary(train_scaled)
##  school   sex          age          address famsize   Pstatus Medu    Fedu  
##  GP:278   F:165   Min.   :-1.3051   R: 67   GT3:220   A: 33   0:  2   0: 1  
##  MS: 38   M:151   1st Qu.:-0.5385   U:249   LE3: 96   T:283   1: 49   1:66  
##                   Median : 0.2280                             2: 79   2:92  
##                   Mean   : 0.0000                             3: 79   3:81  
##                   3rd Qu.: 0.9946                             4:107   4:76  
##                   Max.   : 4.0609                                           
##        Mjob           Fjob            reason      guardian   traveltime
##  at_home : 43   at_home : 17   course    :113   father: 69   1:204     
##  health  : 28   health  : 14   home      : 93   mother:219   2: 88     
##  other   :112   other   :168   other     : 29   other : 28   3: 18     
##  services: 84   services: 93   reputation: 81                4:  6     
##  teacher : 49   teacher : 24                                           
##                                                                        
##  studytime failures schoolsup famsup     paid     activities nursery  
##  1: 82     0:247    no :275   no :127   no :170   no :150    no : 61  
##  2:158     1: 41    yes: 41   yes:189   yes:146   yes:166    yes:255  
##  3: 54     2: 16                                                      
##  4: 22     3: 12                                                      
##                                                                       
##                                                                       
##  higher    internet  romantic  famrel  freetime goout   Dalc    Walc    health 
##  no : 16   no : 47   no :203   1:  7   1: 15    1: 17   1:218   1:122   1: 37  
##  yes:300   yes:269   yes:113   2: 13   2: 43    2: 76   2: 61   2: 69   2: 38  
##                                3: 55   3:132    3:110   3: 21   3: 63   3: 65  
##                                4:150   4: 95    4: 72   4:  8   4: 39   4: 54  
##                                5: 91   5: 31    5: 41   5:  8   5: 23   5:122  
##                                                                                
##     absences             G1                 G2           G3_pass   
##  Min.   :-0.6752   Min.   :-2.13623   Min.   :-2.84496   fail:104  
##  1st Qu.:-0.6752   1st Qu.:-0.89875   1st Qu.:-0.44262   pass:212  
##  Median :-0.2029   Median : 0.02937   Median : 0.09123             
##  Mean   : 0.0000   Mean   : 0.00000   Mean   : 0.00000             
##  3rd Qu.: 0.2694   3rd Qu.: 0.64812   3rd Qu.: 0.62508             
##  Max.   : 8.1808   Max.   : 2.50435   Max.   : 2.22663

4.2. Target Distribution

ggplot(train_scaled, aes(x = G3_pass, fill = G3_pass)) +
  geom_bar() +
  scale_fill_manual(values = c("fail"="tomato","pass"="steelblue")) +
  labs(title="Distribution of Pass/Fail in Training Set",
       x="Outcome", y="Count") +
  theme_minimal()

4.3. Numeric Distribution

num_vars <- c("age","absences","G1","G2")

train_scaled %>%
  pivot_longer(cols = num_vars, names_to = "Variable", values_to = "Value") %>%
  ggplot(aes(x=Value)) +
  geom_histogram(bins = 30, fill="steelblue", alpha=0.7) +
  facet_wrap(~ Variable, scales="free") +
  labs(title="Histogram of Numeric Variables (Scaled)") +
  theme_minimal()
## Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0.
## ℹ Please use `all_of()` or `any_of()` instead.
##   # Was:
##   data %>% select(num_vars)
## 
##   # Now:
##   data %>% select(all_of(num_vars))
## 
## See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

4.4. Boxplot

train_scaled %>%
  pivot_longer(cols = num_vars, names_to = "Variable", values_to = "Value") %>%
  ggplot(aes(x=Variable, y=Value, fill=Variable)) +
  geom_boxplot(alpha=0.7) +
  theme_minimal() +
  labs(title="Boxplot of Scaled Numeric Variables") +
  theme(axis.text.x = element_text(angle=45, hjust=1))

4.5. Korelasi antar numeric

ggpairs(train_scaled[, num_vars],
        upper = list(continuous = wrap("cor", size = 4)),
        lower = list(continuous = "smooth"),
        diag = list(continuous = "densityDiag"))

4.6. Categorical Nominal Variables

# Nominal variables
nominal_vars <- c("school","sex","address","famsize","Pstatus",
                  "Mjob","Fjob","reason","guardian",
                  "schoolsup","famsup","paid","activities",
                  "nursery","higher","internet","romantic")

# Barplot Frekuensi
max_count <- max(sapply(nominal_vars, function(v){
  max(table(train_scaled[[v]]))
}))

plot_nominal_freq <- function(var){
  ggplot(train_scaled, aes(x = .data[[var]])) +
    geom_bar(fill="steelblue") +
    coord_cartesian(ylim = c(0, max_count)) +   # Fix y-axis
    labs(title = var, x = NULL, y = "Count") +
    theme_minimal(base_size = 12) +
    theme(
      axis.text.x = element_text(angle = 45, hjust = 1),
      plot.title = element_text(size = 13, face = "bold"),
      panel.grid.minor = element_blank()
    )
}

grid_nominal_freq <- wrap_plots(lapply(nominal_vars, plot_nominal_freq), ncol = 3)
grid_nominal_freq

# Crosstab vs Target (Pass/Fail)

plot_nominal_target <- function(var){
  ggplot(train_scaled, aes(x = .data[[var]], fill = G3_pass)) +
    geom_bar(position = "fill") +
    scale_fill_manual(values=c("fail"="tomato","pass"="steelblue")) +
    labs(title = var, x = NULL, y = "Proportion") +
    theme_minimal(base_size = 12) +
    theme(
      axis.text.x = element_text(angle = 45, hjust = 1),
      plot.title = element_text(size = 13, face = "bold")
    )
}

grid_nominal_target <- wrap_plots(lapply(nominal_vars, plot_nominal_target), ncol = 3)
grid_nominal_target

4.7. Categorical Ordinal Variables

ordinal_vars <- c("Medu","Fedu","traveltime","studytime","failures",
                  "famrel","freetime","goout","Dalc","Walc","health")

# Barplot Frekuensi
plot_ordinal_freq <- function(var){
  ggplot(train_scaled, aes_string(x = var)) +
    geom_bar(fill="steelblue") +
    theme_minimal() +
    labs(title = var, x = NULL, y = "Count")
}

grid_ordinal_freq <- wrap_plots(lapply(ordinal_vars, plot_ordinal_freq), ncol = 3)
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
grid_ordinal_freq

# Crosstab vs Target (Pass/Fail)

plot_ordinal_target <- function(var){
  ggplot(train_scaled, aes_string(x = var, fill = "G3_pass")) +
    geom_bar(position = "fill") +
    scale_fill_manual(values=c("fail"="tomato","pass"="steelblue")) +
    theme_minimal() +
    labs(title = var, x = NULL, y = "Proportion")
}

grid_ordinal_target <- wrap_plots(lapply(ordinal_vars, plot_ordinal_target), ncol = 3)
grid_ordinal_target

4.8. Numeric vs Target (Density / Boxplot)

train_scaled %>%
  pivot_longer(cols=num_vars, names_to="Variable", values_to="Value") %>%
  ggplot(aes(x=Value, fill=G3_pass)) +
  geom_density(alpha=0.5) +
  facet_wrap(~Variable, scales="free") +
  scale_fill_manual(values=c("fail"="tomato","pass"="steelblue")) +
  labs(title="Density Plot of Numeric Variables by Pass/Fail") +
  theme_minimal()

train_scaled %>%
  pivot_longer(cols=num_vars, names_to="Variable", values_to="Value") %>%
  ggplot(aes(x=G3_pass, y=Value, fill=G3_pass)) +
  geom_boxplot(alpha=.7) +
  facet_wrap(~Variable, scales="free") +
  scale_fill_manual(values=c("fail"="tomato","pass"="steelblue")) +
  labs(title="Numeric Variables vs Pass/Fail") +
  theme_minimal()

4.9. Pairwise Plot

ggpairs(train_scaled, columns = numeric_vars, aes(color=G3_pass))