Pendahuluan

Penelitian ini bertujuan untuk mengklasifikasikan tingkat prestasi akademik siswa sekolah menengah berdasarkan sejumlah faktor demografis, sosial, dan akademik menggunakan dua pendekatan statistik, yaitu Analisis Diskriminan Linear (LDA) dan Regresi Logistik Ordinal. Dataset yang digunakan mencakup informasi dari 649 siswa dengan 33 variabel. Analisis dimulai dari eksplorasi data, penanganan data ekstrem dan transformasi, seleksi fitur yang relevan, uji asumsi masing-masing model, hingga evaluasi performa klasifikasi. Hasil akhir diharapkan dapat membantu dalam memahami faktor-faktor penting yang mempengaruhi prestasi akademik siswa serta memilih model klasifikasi yang paling efektif.

# Input data
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
data <- read.csv("C:\\Users\\Shobah\\Downloads\\student-por.csv")

A. Exploratory Data Analysis (EDA)

EDA bertujuan untuk memahami karakteristik data sebelum dilakukan pemodelan.

str(data)
## 'data.frame':    649 obs. of  33 variables:
##  $ school    : chr  "GP" "GP" "GP" "GP" ...
##  $ sex       : chr  "F" "F" "F" "F" ...
##  $ age       : int  18 17 15 15 16 16 16 17 15 15 ...
##  $ address   : chr  "U" "U" "U" "U" ...
##  $ famsize   : chr  "GT3" "GT3" "LE3" "GT3" ...
##  $ Pstatus   : chr  "A" "T" "T" "T" ...
##  $ Medu      : int  4 1 1 4 3 4 2 4 3 3 ...
##  $ Fedu      : int  4 1 1 2 3 3 2 4 2 4 ...
##  $ Mjob      : chr  "at_home" "at_home" "at_home" "health" ...
##  $ Fjob      : chr  "teacher" "other" "other" "services" ...
##  $ reason    : chr  "course" "course" "other" "home" ...
##  $ guardian  : chr  "mother" "father" "mother" "mother" ...
##  $ traveltime: int  2 1 1 1 1 1 1 2 1 1 ...
##  $ studytime : int  2 2 2 3 2 2 2 2 2 2 ...
##  $ failures  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ schoolsup : chr  "yes" "no" "yes" "no" ...
##  $ famsup    : chr  "no" "yes" "no" "yes" ...
##  $ paid      : chr  "no" "no" "no" "no" ...
##  $ activities: chr  "no" "no" "no" "yes" ...
##  $ nursery   : chr  "yes" "no" "yes" "yes" ...
##  $ higher    : chr  "yes" "yes" "yes" "yes" ...
##  $ internet  : chr  "no" "yes" "yes" "yes" ...
##  $ romantic  : chr  "no" "no" "no" "yes" ...
##  $ famrel    : int  4 5 4 3 4 5 4 4 4 5 ...
##  $ freetime  : int  3 3 3 2 3 4 4 1 2 5 ...
##  $ goout     : int  4 3 2 2 2 2 4 4 2 1 ...
##  $ Dalc      : int  1 1 2 1 1 1 1 1 1 1 ...
##  $ Walc      : int  1 1 3 1 2 2 1 1 1 1 ...
##  $ health    : int  3 3 3 5 5 5 3 1 1 5 ...
##  $ absences  : int  4 2 6 0 0 6 0 2 0 0 ...
##  $ G1        : int  0 9 12 14 11 12 13 10 15 12 ...
##  $ G2        : int  11 11 13 14 13 12 12 13 16 12 ...
##  $ G3        : int  11 11 12 14 13 13 13 13 17 13 ...

Dataset terdiri atas 649 observasi dengan 33 variabel, mencakup data demografi siswa, latar belakang keluarga, serta nilai akademik. Struktur data juga memberikan informasi mengenai type data tersebut yakni variabel numerik dan kategorik.

Kemudian dilakukan checking missing value:

colSums(is.na(data))
##     school        sex        age    address    famsize    Pstatus       Medu 
##          0          0          0          0          0          0          0 
##       Fedu       Mjob       Fjob     reason   guardian traveltime  studytime 
##          0          0          0          0          0          0          0 
##   failures  schoolsup     famsup       paid activities    nursery     higher 
##          0          0          0          0          0          0          0 
##   internet   romantic     famrel   freetime      goout       Dalc       Walc 
##          0          0          0          0          0          0          0 
##     health   absences         G1         G2         G3 
##          0          0          0          0          0

Tidak ditemukan nilai yang hilang.

Selanjutnya, dilakukan deteksi outlier dengan metode IQR:

detect_outliers_iqr <- function(x) {
  Q1 <- quantile(x, 0.25, na.rm = TRUE)
  Q3 <- quantile(x, 0.75, na.rm = TRUE)
  IQR <- Q3 - Q1
  lower_bound <- Q1 - 1.5 * IQR
  upper_bound <- Q3 + 1.5 * IQR
  which(x < lower_bound | x > upper_bound)
}

numeric_vars <- sapply(data, is.numeric)
outliers_index_list <- lapply(data[, numeric_vars], detect_outliers_iqr)

sapply(outliers_index_list, length)
##        age       Medu       Fedu traveltime  studytime   failures     famrel 
##          1          0          0         16         35        100         51 
##   freetime      goout       Dalc       Walc     health   absences         G1 
##         45          0         34          0          0         21         16 
##         G2         G3 
##         25         16

Melihat Distribusi nilai akhir G3:

hist(data$G3, main = "Distribusi Nilai G3", xlab = "Nilai G3", col = "skyblue")

Mayoritas siswa memperoleh nilai G3 pada rentang 10–14. Hal ini menunjukkan distribusi agak normal namun agak right-skewed.


B. Pre-Processing

Langkah preprocessing dilakukan untuk meningkatkan kualitas data. Tahapan yang dilakukan:

1. Penanganan Outlier (Winsorization)

Outlier yang terdeteksi ditangani dengan winsorization.

winsorize <- function(x) {
  Q1 <- quantile(x, 0.25)
  Q3 <- quantile(x, 0.75)
  IQR <- Q3 - Q1
  lower <- Q1 - 1.5 * IQR
  upper <- Q3 + 1.5 * IQR
  x[x < lower] <- lower
  x[x > upper] <- upper
  return(x)
}

numeric_vars <- sapply(data, is.numeric)
numeric_vars["G3"] <- FALSE  

data_winsorized <- data
data_winsorized[, numeric_vars] <- lapply(data[, numeric_vars], winsorize)

summary(data_winsorized)
##     school              sex                 age          address         
##  Length:649         Length:649         Min.   :15.00   Length:649        
##  Class :character   Class :character   1st Qu.:16.00   Class :character  
##  Mode  :character   Mode  :character   Median :17.00   Mode  :character  
##                                        Mean   :16.74                     
##                                        3rd Qu.:18.00                     
##                                        Max.   :21.00                     
##    famsize            Pstatus               Medu            Fedu      
##  Length:649         Length:649         Min.   :0.000   Min.   :0.000  
##  Class :character   Class :character   1st Qu.:2.000   1st Qu.:1.000  
##  Mode  :character   Mode  :character   Median :2.000   Median :2.000  
##                                        Mean   :2.515   Mean   :2.307  
##                                        3rd Qu.:4.000   3rd Qu.:3.000  
##                                        Max.   :4.000   Max.   :4.000  
##      Mjob               Fjob              reason            guardian        
##  Length:649         Length:649         Length:649         Length:649        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##    traveltime      studytime        failures  schoolsup        
##  Min.   :1.000   Min.   :1.000   Min.   :0   Length:649        
##  1st Qu.:1.000   1st Qu.:1.000   1st Qu.:0   Class :character  
##  Median :1.000   Median :2.000   Median :0   Mode  :character  
##  Mean   :1.556   Mean   :1.904   Mean   :0                     
##  3rd Qu.:2.000   3rd Qu.:2.000   3rd Qu.:0                     
##  Max.   :3.500   Max.   :3.500   Max.   :0                     
##     famsup              paid            activities          nursery         
##  Length:649         Length:649         Length:649         Length:649        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##     higher            internet           romantic             famrel     
##  Length:649         Length:649         Length:649         Min.   :2.500  
##  Class :character   Class :character   Class :character   1st Qu.:4.000  
##  Mode  :character   Mode  :character   Mode  :character   Median :4.000  
##                                                           Mean   :4.004  
##                                                           3rd Qu.:5.000  
##                                                           Max.   :5.000  
##     freetime         goout            Dalc           Walc          health     
##  Min.   :1.500   Min.   :1.000   Min.   :1.00   Min.   :1.00   Min.   :1.000  
##  1st Qu.:3.000   1st Qu.:2.000   1st Qu.:1.00   1st Qu.:1.00   1st Qu.:2.000  
##  Median :3.000   Median :3.000   Median :1.00   Median :2.00   Median :4.000  
##  Mean   :3.215   Mean   :3.185   Mean   :1.45   Mean   :2.28   Mean   :3.536  
##  3rd Qu.:4.000   3rd Qu.:4.000   3rd Qu.:2.00   3rd Qu.:3.00   3rd Qu.:5.000  
##  Max.   :5.000   Max.   :5.000   Max.   :3.50   Max.   :5.00   Max.   :5.000  
##     absences           G1              G2              G3       
##  Min.   : 0.00   Min.   : 5.50   Min.   : 5.50   Min.   : 0.00  
##  1st Qu.: 0.00   1st Qu.:10.00   1st Qu.:10.00   1st Qu.:10.00  
##  Median : 2.00   Median :11.00   Median :11.00   Median :12.00  
##  Mean   : 3.51   Mean   :11.41   Mean   :11.62   Mean   :11.91  
##  3rd Qu.: 6.00   3rd Qu.:13.00   3rd Qu.:13.00   3rd Qu.:14.00  
##  Max.   :15.00   Max.   :17.50   Max.   :17.50   Max.   :19.00
Variabel Min sebelum Max sebelum Min sesudah Max sesudah
absences 0 32 0 15
G1 0 19 5.5 17.5
G2 0 19 5.5 17.5

Nilai ekstrem dipangkas ke batas bawah dan atas IQR. Ini membantu menjaga kestabilan model.

2. Transformasi

Dilakukan transformasi akar kuadrat untuk mengurangi skewness:

data_clean <- data_winsorized
numeric_vars <- sapply(data_clean, is.numeric)
numeric_vars["G3"] <- FALSE  

data_clean[, numeric_vars] <- lapply(data_clean[, numeric_vars], sqrt)
summary(data_clean)
##     school              sex                 age          address         
##  Length:649         Length:649         Min.   :3.873   Length:649        
##  Class :character   Class :character   1st Qu.:4.000   Class :character  
##  Mode  :character   Mode  :character   Median :4.123   Mode  :character  
##                                        Mean   :4.089                     
##                                        3rd Qu.:4.243                     
##                                        Max.   :4.583                     
##    famsize            Pstatus               Medu            Fedu      
##  Length:649         Length:649         Min.   :0.000   Min.   :0.000  
##  Class :character   Class :character   1st Qu.:1.414   1st Qu.:1.000  
##  Mode  :character   Mode  :character   Median :1.414   Median :1.414  
##                                        Mean   :1.536   Mean   :1.468  
##                                        3rd Qu.:2.000   3rd Qu.:1.732  
##                                        Max.   :2.000   Max.   :2.000  
##      Mjob               Fjob              reason            guardian        
##  Length:649         Length:649         Length:649         Length:649        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##    traveltime      studytime        failures  schoolsup        
##  Min.   :1.000   Min.   :1.000   Min.   :0   Length:649        
##  1st Qu.:1.000   1st Qu.:1.000   1st Qu.:0   Class :character  
##  Median :1.000   Median :1.414   Median :0   Mode  :character  
##  Mean   :1.218   Mean   :1.351   Mean   :0                     
##  3rd Qu.:1.414   3rd Qu.:1.414   3rd Qu.:0                     
##  Max.   :1.871   Max.   :1.871   Max.   :0                     
##     famsup              paid            activities          nursery         
##  Length:649         Length:649         Length:649         Length:649        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##     higher            internet           romantic             famrel     
##  Length:649         Length:649         Length:649         Min.   :1.581  
##  Class :character   Class :character   Class :character   1st Qu.:2.000  
##  Mode  :character   Mode  :character   Mode  :character   Median :2.000  
##                                                           Mean   :1.991  
##                                                           3rd Qu.:2.236  
##                                                           Max.   :2.236  
##     freetime         goout            Dalc            Walc      
##  Min.   :1.225   Min.   :1.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.:1.732   1st Qu.:1.414   1st Qu.:1.000   1st Qu.:1.000  
##  Median :1.732   Median :1.732   Median :1.000   Median :1.414  
##  Mean   :1.771   Mean   :1.751   Mean   :1.171   Mean   :1.451  
##  3rd Qu.:2.000   3rd Qu.:2.000   3rd Qu.:1.414   3rd Qu.:1.732  
##  Max.   :2.236   Max.   :2.236   Max.   :1.871   Max.   :2.236  
##      health         absences           G1              G2       
##  Min.   :1.000   Min.   :0.000   Min.   :2.345   Min.   :2.345  
##  1st Qu.:1.414   1st Qu.:0.000   1st Qu.:3.162   1st Qu.:3.162  
##  Median :2.000   Median :1.414   Median :3.317   Median :3.317  
##  Mean   :1.830   Mean   :1.398   Mean   :3.354   Mean   :3.385  
##  3rd Qu.:2.236   3rd Qu.:2.449   3rd Qu.:3.606   3rd Qu.:3.606  
##  Max.   :2.236   Max.   :3.873   Max.   :4.183   Max.   :4.183  
##        G3       
##  Min.   : 0.00  
##  1st Qu.:10.00  
##  Median :12.00  
##  Mean   :11.91  
##  3rd Qu.:14.00  
##  Max.   :19.00
Statistik Nilai
Min 0.000
Median 1.414
Maks 3.873

Data menjadi lebih simetris dan normal.

3. Konversi Variabel Kategorik

Variabel karakter diubah ke tipe faktor

data_clean[] <- lapply(names(data_clean), function(colname) {
  x <- data_clean[[colname]]
  if (is.character(x) && colname != "G3") return(as.factor(x)) else return(x)
})
str(data_clean)
## 'data.frame':    649 obs. of  33 variables:
##  $ school    : Factor w/ 2 levels "GP","MS": 1 1 1 1 1 1 1 1 1 1 ...
##  $ sex       : Factor w/ 2 levels "F","M": 1 1 1 1 1 2 2 1 2 2 ...
##  $ age       : num  4.24 4.12 3.87 3.87 4 ...
##  $ address   : Factor w/ 2 levels "R","U": 2 2 2 2 2 2 2 2 2 2 ...
##  $ famsize   : Factor w/ 2 levels "GT3","LE3": 1 1 2 1 1 2 2 1 2 1 ...
##  $ Pstatus   : Factor w/ 2 levels "A","T": 1 2 2 2 2 2 2 1 1 2 ...
##  $ Medu      : num  2 1 1 2 1.73 ...
##  $ Fedu      : num  2 1 1 1.41 1.73 ...
##  $ Mjob      : Factor w/ 5 levels "at_home","health",..: 1 1 1 2 3 4 3 3 4 3 ...
##  $ Fjob      : Factor w/ 5 levels "at_home","health",..: 5 3 3 4 3 3 3 5 3 3 ...
##  $ reason    : Factor w/ 4 levels "course","home",..: 1 1 3 2 2 4 2 2 2 2 ...
##  $ guardian  : Factor w/ 3 levels "father","mother",..: 2 1 2 2 1 2 2 2 2 2 ...
##  $ traveltime: num  1.41 1 1 1 1 ...
##  $ studytime : num  1.41 1.41 1.41 1.73 1.41 ...
##  $ failures  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ schoolsup : Factor w/ 2 levels "no","yes": 2 1 2 1 1 1 1 2 1 1 ...
##  $ famsup    : Factor w/ 2 levels "no","yes": 1 2 1 2 2 2 1 2 2 2 ...
##  $ paid      : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ activities: Factor w/ 2 levels "no","yes": 1 1 1 2 1 2 1 1 1 2 ...
##  $ nursery   : Factor w/ 2 levels "no","yes": 2 1 2 2 2 2 2 2 2 2 ...
##  $ higher    : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
##  $ internet  : Factor w/ 2 levels "no","yes": 1 2 2 2 1 2 2 1 2 2 ...
##  $ romantic  : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
##  $ famrel    : num  2 2.24 2 1.73 2 ...
##  $ freetime  : num  1.73 1.73 1.73 1.41 1.73 ...
##  $ goout     : num  2 1.73 1.41 1.41 1.41 ...
##  $ Dalc      : num  1 1 1.41 1 1 ...
##  $ Walc      : num  1 1 1.73 1 1.41 ...
##  $ health    : num  1.73 1.73 1.73 2.24 2.24 ...
##  $ absences  : num  2 1.41 2.45 0 0 ...
##  $ G1        : num  2.35 3 3.46 3.74 3.32 ...
##  $ G2        : num  3.32 3.32 3.61 3.74 3.61 ...
##  $ G3        : int  11 11 12 14 13 13 13 13 17 13 ...

Hasil: Factor w/ 2 levels "GP","MS"

4. Klasifikasi Target G3

data_clean$G3_class <- cut(data_clean$G3,
                           breaks = c(-1, 9.9, 13.9, 15.9, 17.9, 20),
                           labels = c("Fail", "Sufficient", "Good", "Very Good", "Excellent"),
                           ordered_result = TRUE)
table(data_clean$G3_class)
## 
##       Fail Sufficient       Good  Very Good  Excellent 
##        100        355        112         65         17
Kelas Jumlah
Fail 100
Sufficient 355
Good 112
Very Good 65
Excellent 17

G3 dikategorikan ke dalam lima tingkatan untuk klasifikasi.

5. Normalisasi

Semua variabel numerik dinormalisasi agar memiliki mean = 0 dan standar deviasi = 1:

numeric_vars <- sapply(data_clean, is.numeric)
data_clean[, numeric_vars] <- scale(data_clean[, numeric_vars])

summary(data_clean)
##  school   sex          age          address famsize   Pstatus      Medu        
##  GP:423   F:383   Min.   :-1.4661   R:197   GT3:457   A: 80   Min.   :-3.8900  
##  MS:226   M:266   1st Qu.:-0.6046   U:452   LE3:192   T:569   1st Qu.:-0.3082  
##                   Median : 0.2305                             Median :-0.3082  
##                   Mean   : 0.0000                             Mean   : 0.0000  
##                   3rd Qu.: 1.0413                             3rd Qu.: 1.1754  
##                   Max.   : 3.3471                             Max.   : 1.1754  
##                                                                                
##       Fedu               Mjob           Fjob            reason      guardian  
##  Min.   :-3.7517   at_home :135   at_home : 42   course    :285   father:153  
##  1st Qu.:-1.1953   health  : 48   health  : 23   home      :149   mother:455  
##  Median :-0.1365   other   :258   other   :367   other     : 72   other : 41  
##  Mean   : 0.0000   services:136   services:181   reputation:143               
##  3rd Qu.: 0.6760   teacher : 72   teacher : 36                                
##  Max.   : 1.3610                                                              
##                                                                               
##    traveltime        studytime          failures   schoolsup famsup   
##  Min.   :-0.8134   Min.   :-1.2528   Min.   : NA   no :581   no :251  
##  1st Qu.:-0.8134   1st Qu.:-1.2528   1st Qu.: NA   yes: 68   yes:398  
##  Median :-0.8134   Median : 0.2255   Median : NA                      
##  Mean   : 0.0000   Mean   : 0.0000   Mean   :NaN                      
##  3rd Qu.: 0.7298   3rd Qu.: 0.2255   3rd Qu.: NA                      
##  Max.   : 2.4311   Max.   : 1.8550   Max.   : NA                      
##                                      NA's   :649                      
##   paid     activities nursery   higher    internet  romantic 
##  no :610   no :334    no :128   no : 69   no :151   no :410  
##  yes: 39   yes:315    yes:521   yes:580   yes:498   yes:239  
##                                                              
##                                                              
##                                                              
##                                                              
##                                                              
##      famrel            freetime           goout               Dalc        
##  Min.   :-2.03856   Min.   :-1.9374   Min.   :-2.16029   Min.   :-0.6136  
##  1st Qu.: 0.04548   1st Qu.:-0.1374   1st Qu.:-0.96805   1st Qu.:-0.6136  
##  Median : 0.04548   Median :-0.1374   Median :-0.05321   Median :-0.6136  
##  Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 1.22004   3rd Qu.: 0.8133   3rd Qu.: 0.71803   3rd Qu.: 0.8697  
##  Max.   : 1.22004   Max.   : 1.6509   Max.   : 1.39751   Max.   : 2.5048  
##                                                                           
##       Walc              health           absences             G1         
##  Min.   :-1.07547   Min.   :-1.9226   Min.   :-1.1204   Min.   :-2.5023  
##  1st Qu.:-1.07547   1st Qu.:-0.9635   1st Qu.:-1.1204   1st Qu.:-0.4746  
##  Median :-0.08739   Median : 0.3930   Median : 0.0128   Median :-0.0916  
##  Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.67079   3rd Qu.: 0.9396   3rd Qu.: 0.8424   3rd Qu.: 0.6254  
##  Max.   : 1.87308   Max.   : 0.9396   Max.   : 1.9831   Max.   : 2.0592  
##                                                                          
##        G2                G3                 G3_class  
##  Min.   :-2.6007   Min.   :-3.68532   Fail      :100  
##  1st Qu.:-0.5573   1st Qu.:-0.58998   Sufficient:355  
##  Median :-0.1713   Median : 0.02909   Good      :112  
##  Mean   : 0.0000   Mean   : 0.00000   Very Good : 65  
##  3rd Qu.: 0.5513   3rd Qu.: 0.64816   Excellent : 17  
##  Max.   : 1.9962   Max.   : 2.19584                   
## 
Statistik Nilai
Min -2.6007
Median -0.1713
Maks 1.9962

Normalisasi dilakukan agar semua variabel memiliki kontribusi setara dalam model.


Ringkasan Preprocessing

Tahap Tujuan Hasil Output Penting
Winsorization Mengurangi pengaruh outlier absences maksimal turun dari 32 → 15
Transformasi SQRT Menurunkan skewness absences jadi 0–3.873 (lebih simetris)
Faktor Mengubah kategorik ke numerik kategorikal school, sex, address, dll. jadi factor
Klasifikasi G3 Membuat target klasifikasi 5 kelas G3: Fail – Excellent
Normalisasi Menyamakan skala numerik Semua numerik jadi mean 0, sd 1

C. Seleksi Fitur

Seleksi fitur dilakukan untuk memilih variabel-variabel yang paling berpengaruh terhadap klasifikasi nilai akhir siswa.

Untuk variabel numerik digunakan korelasi Spearman terhadap G3_class:

numeric_vars <- sapply(data_clean, is.numeric)
numeric_vars["G3"] <- FALSE  
numeric_vars["G3_class"] <- FALSE  

spearman_result <- sapply(names(data_clean)[numeric_vars], function(var) {
  cor(as.numeric(data_clean[[var]]), as.numeric(data_clean$G3_class), method = "spearman")
})

spearman_result <- sort(abs(spearman_result), decreasing = TRUE)
print(spearman_result)
##         G2         G1       Medu  studytime       Fedu       Dalc       Walc 
## 0.85744519 0.80490197 0.25065187 0.24294597 0.18538665 0.16652126 0.14781968 
##   absences traveltime   freetime     health      goout     famrel        age 
## 0.11948967 0.11162019 0.09202719 0.08960823 0.08735400 0.03435474 0.01985015

Hasil menunjukkan bahwa variabel G2 dan G1 memiliki korelasi paling tinggi (masing-masing 0.85 dan 0.80).

Untuk variabel kategorik digunakan uji Chi-square:

factor_vars <- sapply(data_clean, is.factor)
chisq_result <- sapply(names(data_clean)[factor_vars], function(var) {
  tbl <- table(data_clean[[var]], data_clean$G3_class)
  test <- chisq.test(tbl)
  return(test$p.value)
})
## Warning in chisq.test(tbl): Chi-squared approximation may be incorrect
## Warning in chisq.test(tbl): Chi-squared approximation may be incorrect
## Warning in chisq.test(tbl): Chi-squared approximation may be incorrect
## Warning in chisq.test(tbl): Chi-squared approximation may be incorrect
## Warning in chisq.test(tbl): Chi-squared approximation may be incorrect
## Warning in chisq.test(tbl): Chi-squared approximation may be incorrect
## Warning in chisq.test(tbl): Chi-squared approximation may be incorrect
## Warning in chisq.test(tbl): Chi-squared approximation may be incorrect
## Warning in chisq.test(tbl): Chi-squared approximation may be incorrect
## Warning in chisq.test(tbl): Chi-squared approximation may be incorrect
## Warning in chisq.test(tbl): Chi-squared approximation may be incorrect
chisq_result <- sort(chisq_result)
print(chisq_result)
##     G3_class       higher       school    schoolsup         Mjob       reason 
## 0.000000e+00 3.543213e-15 1.214454e-12 3.548904e-04 5.291069e-04 1.885781e-03 
##      address     internet         Fjob          sex      nursery   activities 
## 5.060560e-03 5.090668e-03 1.609668e-02 7.683058e-02 1.324086e-01 1.553318e-01 
##         paid     guardian     romantic       famsup      famsize      Pstatus 
## 1.628765e-01 1.700738e-01 2.516525e-01 6.510745e-01 7.433608e-01 8.980707e-01

Variabel yang memiliki nilai p < 0.01 dianggap signifikan, antara lain higher, school, schoolsup, Mjob, reason, address, dan internet.

Kemudian dilakukan penggabungan kedua fitur variabel tersebut

selected_numeric <- names(spearman_result[spearman_result > 0.3])
selected_factor <- names(chisq_result[chisq_result < 0.01])

selected_features <- c(selected_numeric, selected_factor)
selected_features <- setdiff(selected_features, "G3_class")

print(selected_features)
## [1] "G2"        "G1"        "higher"    "school"    "schoolsup" "Mjob"     
## [7] "reason"    "address"   "internet"

Fitur-fitur yang dipilih untuk model:


D. Uji Asumsi

Regresi Logistik Ordinal

Dilakukan uji multikolinearitas dengan menghitung Variance Inflation Factor (VIF):

library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
vif_data <- data_clean[, selected_features]
vif_data$G3_class <- data_clean$G3_class  

vif_model <- lm(as.numeric(G3_class) ~ ., data = vif_data)
vif_values <- vif(vif_model)
print(vif_values)
##               GVIF Df GVIF^(1/(2*Df))
## G2        4.644795  1        2.155179
## G1        4.805692  1        2.192189
## higher    1.191868  1        1.091727
## school    1.381606  1        1.175417
## schoolsup 1.057334  1        1.028268
## Mjob      1.263417  4        1.029659
## reason    1.195480  3        1.030205
## address   1.196680  1        1.093929
## internet  1.155197  1        1.074801

Semua nilai VIF berada di bawah 5 yang berarti tidak ada gejala multikolinearitas serius antar prediktor.

Linear Discriminant Analysis (LDA)

Dilakukan uji normalitas multivariat menggunakan Mardia Test dan uji homogenitas kovarian menggunakan Box’s M Test:

library(MVN)
## Warning: package 'MVN' was built under R version 4.4.3
numeric_features <- c("G1", "G2") 
mardia_result <- mvn(data_clean[, numeric_features], mvnTest = "mardia")

print(mardia_result$multivariateNormality)
##              Test        Statistic              p value Result
## 1 Mardia Skewness 85.5666614809526 1.15012313282574e-17     NO
## 2 Mardia Kurtosis 4.90356946172837 9.41106419016791e-07     NO
## 3             MVN             <NA>                 <NA>     NO
library(biotools)
## Warning: package 'biotools' was built under R version 4.4.3
## Loading required package: MASS
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
## ---
## biotools version 4.3
boxm_result <- boxM(data_clean[, numeric_features], grouping = data_clean$G3_class)
print(boxm_result)
## 
##  Box's M-test for Homogeneity of Covariance Matrices
## 
## data:  data_clean[, numeric_features]
## Chi-Sq (approx.) = 143.77, df = 12, p-value < 2.2e-16

Hasil:

Uji Nilai p-value Kesimpulan
Mardia Skewness 85.57 < 0.001 Tidak normal
Mardia Kurtosis 4.90 < 0.001 Tidak normal
Box’s M 143.77 < 0.001 Varians tidak homogen
  • Uji Mardia menunjukkan data tidak berdistribusi normal.
  • Uji Box’s M menunjukkan bahwa kovarian antar kelas tidak homogen.

Meskipun asumsi tidak terpenuhi secara ketat, model LDA tetap dilanjutkan karena sifatnya yang cukup robust.


E. Pemodelan

Regresi Logistik Ordinal

Model dibangun menggunakan fungsi polr:

library(MASS)

model_polr <- polr(G3_class ~ ., data = data_clean[, c(selected_features, "G3_class")], Hess = TRUE)
summary(model_polr)
## Call:
## polr(formula = G3_class ~ ., data = data_clean[, c(selected_features, 
##     "G3_class")], Hess = TRUE)
## 
## Coefficients:
##                     Value Std. Error  t value
## G2                4.64511     0.3765 12.33902
## G1                1.30536     0.2595  5.02952
## higheryes         0.45503     0.4488  1.01384
## schoolMS          0.01525     0.2640  0.05777
## schoolsupyes     -0.69196     0.3821 -1.81116
## Mjobhealth        0.01482     0.4556  0.03253
## Mjobother         0.07228     0.3202  0.22576
## Mjobservices     -0.13523     0.3789 -0.35691
## Mjobteacher       0.22862     0.4166  0.54871
## reasonhome       -0.25828     0.2909 -0.88798
## reasonother       0.11992     0.3820  0.31397
## reasonreputation  0.13110     0.2818  0.46519
## addressU          0.20950     0.2646  0.79172
## internetyes       0.29421     0.2823  1.04229
## 
## Intercepts:
##                     Value   Std. Error t value
## Fail|Sufficient     -5.5458  0.6183    -8.9695
## Sufficient|Good      4.0874  0.6300     6.4875
## Good|Very Good       8.1100  0.7552    10.7394
## Very Good|Excellent 12.4631  0.9451    13.1875
## 
## Residual Deviance: 594.9772 
## AIC: 630.9772
coef(model_polr)
##               G2               G1        higheryes         schoolMS 
##       4.64511246       1.30536330       0.45503082       0.01525236 
##     schoolsupyes       Mjobhealth        Mjobother     Mjobservices 
##      -0.69196168       0.01481916       0.07227988      -0.13522882 
##      Mjobteacher       reasonhome      reasonother reasonreputation 
##       0.22862147      -0.25827559       0.11992143       0.13109807 
##         addressU      internetyes 
##       0.20950274       0.29420706

Koefisien G2 dan G1 signifikan dengan nilai p < 0.05, artinya kedua variabel tersebut memiliki pengaruh yang signifikan terhadap klasifikasi nilai akhir siswa.

Linear Discriminant Analysis (LDA)

Model LDA dibentuk menggunakan fungsi lda:

model_lda <- lda(G3_class ~ ., data = data_clean[, c(selected_features, "G3_class")])

print(model_lda)
## Call:
## lda(G3_class ~ ., data = data_clean[, c(selected_features, "G3_class")])
## 
## Prior probabilities of groups:
##       Fail Sufficient       Good  Very Good  Excellent 
## 0.15408320 0.54699538 0.17257319 0.10015408 0.02619414 
## 
## Group means:
##                    G2         G1 higheryes  schoolMS schoolsupyes Mjobhealth
## Fail       -1.4505900 -1.4009938 0.6700000 0.6800000   0.08000000 0.06000000
## Sufficient -0.1974866 -0.1607238 0.9014085 0.3098592   0.15211268 0.05352113
## Good        0.7860927  0.7056960 0.9910714 0.2232143   0.03571429 0.10714286
## Very Good   1.4498856  1.3204290 1.0000000 0.2461538   0.01538462 0.12307692
## Excellent   1.9342233  1.8994403 1.0000000 0.4117647   0.05882353 0.17647059
##            Mjobother Mjobservices Mjobteacher reasonhome reasonother
## Fail       0.4200000   0.19000000  0.04000000  0.1700000  0.18000000
## Sufficient 0.4169014   0.20563380  0.09577465  0.2535211  0.10422535
## Good       0.3571429   0.24107143  0.15178571  0.2232143  0.08928571
## Very Good  0.3692308   0.24615385  0.20000000  0.2153846  0.09230769
## Excellent  0.2352941   0.05882353  0.23529412  0.1764706  0.05882353
##            reasonreputation  addressU internetyes
## Fail              0.1000000 0.5600000   0.6800000
## Sufficient        0.1971831 0.6929577   0.7436620
## Good              0.3035714 0.7678571   0.8571429
## Very Good         0.3384615 0.8000000   0.8769231
## Excellent         0.4117647 0.7058824   0.7647059
## 
## Coefficients of linear discriminants:
##                           LD1         LD2        LD3        LD4
## G2                1.602640785  0.27543752 -0.6998131 -0.1771381
## G1                0.538380559 -0.02651963  1.0270536 -0.1000573
## higheryes        -0.013352908 -1.72311513 -0.5512671  1.3813041
## schoolMS          0.105507647  1.64353787  0.7387131 -0.4052150
## schoolsupyes     -0.260443810 -0.97711027  1.8566742 -1.5440125
## Mjobhealth       -0.009432997  1.29710494 -0.7499566 -0.5351045
## Mjobother        -0.055887595  0.17323363 -1.1880468 -1.8397717
## Mjobservices     -0.061502551  0.36984893 -1.7373340 -1.7820798
## Mjobteacher       0.139908262  0.71218379 -0.7207513 -1.7416423
## reasonhome       -0.114655835 -0.26530449  0.1674413 -0.1360225
## reasonother       0.036966429  0.08591206 -0.5181147 -0.3210036
## reasonreputation  0.122766438  0.23804254  0.1853363  0.6382720
## addressU          0.028696006  0.13847738 -0.1217455 -0.1795131
## internetyes       0.071401650  0.22813567 -0.5495137  0.5135963
## 
## Proportion of trace:
##    LD1    LD2    LD3    LD4 
## 0.9674 0.0234 0.0081 0.0012
print(model_lda$scaling)
##                           LD1         LD2        LD3        LD4
## G2                1.602640785  0.27543752 -0.6998131 -0.1771381
## G1                0.538380559 -0.02651963  1.0270536 -0.1000573
## higheryes        -0.013352908 -1.72311513 -0.5512671  1.3813041
## schoolMS          0.105507647  1.64353787  0.7387131 -0.4052150
## schoolsupyes     -0.260443810 -0.97711027  1.8566742 -1.5440125
## Mjobhealth       -0.009432997  1.29710494 -0.7499566 -0.5351045
## Mjobother        -0.055887595  0.17323363 -1.1880468 -1.8397717
## Mjobservices     -0.061502551  0.36984893 -1.7373340 -1.7820798
## Mjobteacher       0.139908262  0.71218379 -0.7207513 -1.7416423
## reasonhome       -0.114655835 -0.26530449  0.1674413 -0.1360225
## reasonother       0.036966429  0.08591206 -0.5181147 -0.3210036
## reasonreputation  0.122766438  0.23804254  0.1853363  0.6382720
## addressU          0.028696006  0.13847738 -0.1217455 -0.1795131
## internetyes       0.071401650  0.22813567 -0.5495137  0.5135963

LD1 menjelaskan 96.7% dari variasi antar kelas, menunjukkan bahwa diskriminan pertama sangat kuat membedakan antar kelas prestasi.


F. Uji Signifikansi Variabel

Regresi Logistik Ordinal

Uji signifikansi dilakukan secara serentak menggunakan Likelihood Ratio Test:

model_null <- polr(G3_class ~ 1, data = data_clean[, c(selected_features, "G3_class")], Hess = TRUE)

lr_stat <- 2 * (logLik(model_polr) - logLik(model_null))
df_lr <- attr(logLik(model_polr), "df") - attr(logLik(model_null), "df")
p_value_lr <- pchisq(lr_stat, df = df_lr, lower.tail = FALSE)

cat("Likelihood Ratio Statistic:", lr_stat, "\n")
## Likelihood Ratio Statistic: 1023.953
cat("Degrees of Freedom:", df_lr, "\n")
## Degrees of Freedom: 14
cat("p-value:", p_value_lr, "\n")
## p-value: 1.13402e-209
  • Likelihood Ratio Test: p-value sangat kecil (< 2.2e-16) → model secara keseluruhan signifikan.

Dilakuakn juga uji parsial menggunakan Wald Tes:

summary_polr <- summary(model_polr)
wald_stat <- (summary_polr$coefficients[, "Value"] / summary_polr$coefficients[, "Std. Error"])^2
wald_p_values <- 1 - pchisq(wald_stat, df = 1)

wald_table <- data.frame(
  Estimate = summary_polr$coefficients[, "Value"],
  StdError = summary_polr$coefficients[, "Std. Error"],
  p_value = round(wald_p_values, 4)
)
print(wald_table)
##                        Estimate  StdError p_value
## G2                   4.64511246 0.3764572  0.0000
## G1                   1.30536330 0.2595404  0.0000
## higheryes            0.45503082 0.4488182  0.3107
## schoolMS             0.01525236 0.2640229  0.9539
## schoolsupyes        -0.69196168 0.3820548  0.0701
## Mjobhealth           0.01481916 0.4556115  0.9741
## Mjobother            0.07227988 0.3201652  0.8214
## Mjobservices        -0.13522882 0.3788897  0.7212
## Mjobteacher          0.22862147 0.4166495  0.5832
## reasonhome          -0.25827559 0.2908572  0.3746
## reasonother          0.11992143 0.3819520  0.7535
## reasonreputation     0.13109807 0.2818190  0.6418
## addressU             0.20950274 0.2646157  0.4285
## internetyes          0.29420706 0.2822701  0.2973
## Fail|Sufficient     -5.54577578 0.6182900  0.0000
## Sufficient|Good      4.08743704 0.6300437  0.0000
## Good|Very Good       8.10995175 0.7551577  0.0000
## Very Good|Excellent 12.46310464 0.9450727  0.0000
  • Wald Test: G2 dan G1 signifikan (p < 0.05), variabel lain tidak signifikan secara statistik.

LDA

Uji Wilks’ Lambda digunakan untuk menilai kemampuan diskriminan:

eigen_values <- model_lda$svd^2
wilks_lambda <- prod(1 / (1 + eigen_values))

cat("Wilks' Lambda:", wilks_lambda, "\n")
## Wilks' Lambda: 1.307714e-05

Nilai Wilks’ Lambda sangat kecil (~0.000013), menunjukkan bahwa fungsi diskriminan dapat memisahkan kelas dengan sangat baik.


G. Evaluasi Model

Confusion Matrix

pred_polr <- predict(model_polr, newdata = data_clean[, selected_features])
pred_lda <- predict(model_lda)$class

conf_matrix_polr <- table(Predicted = pred_polr, Actual = data_clean$G3_class)
conf_matrix_lda <- table(Predicted = pred_lda, Actual = data_clean$G3_class)

cat("Confusion Matrix - Regresi Logistik Ordinal:\n")
## Confusion Matrix - Regresi Logistik Ordinal:
print(conf_matrix_polr)
##             Actual
## Predicted    Fail Sufficient Good Very Good Excellent
##   Fail         73         16    0         0         0
##   Sufficient   27        320   31         1         0
##   Good          0         19   70        15         0
##   Very Good     0          0   11        45         8
##   Excellent     0          0    0         4         9
cat("\nConfusion Matrix - LDA:\n")
## 
## Confusion Matrix - LDA:
print(conf_matrix_lda)
##             Actual
## Predicted    Fail Sufficient Good Very Good Excellent
##   Fail         74         19    0         0         0
##   Sufficient   26        316   27         1         0
##   Good          0         20   80        17         0
##   Very Good     0          0    4        46         9
##   Excellent     0          0    1         1         8

Tabel klasifikasi menunjukkan bahwa model LDA dan regresi logistik ordinal mampu memprediksi kelas Sufficient dengan baik. Namun, kelas Excellent dan Fail masih memiliki tingkat kesalahan relatif tinggi.

Evaluasi Metrik

evaluate_model <- function(conf_matrix) {
  accuracy <- sum(diag(conf_matrix)) / sum(conf_matrix)
  precision <- diag(conf_matrix) / rowSums(conf_matrix)
  recall <- diag(conf_matrix) / colSums(conf_matrix)
  f1 <- 2 * (precision * recall) / (precision + recall)
  
  hasil <- data.frame(
    Class = names(precision),
    Precision = round(precision, 2),
    Recall = round(recall, 2),
    F1_Score = round(f1, 2)
  )
  hasil$Accuracy <- round(accuracy, 2)
  return(hasil)
}

cat("\nEvaluasi Regresi Logistik Ordinal:\n")
## 
## Evaluasi Regresi Logistik Ordinal:
print(evaluate_model(conf_matrix_polr))
##                 Class Precision Recall F1_Score Accuracy
## Fail             Fail      0.82   0.73     0.77      0.8
## Sufficient Sufficient      0.84   0.90     0.87      0.8
## Good             Good      0.67   0.62     0.65      0.8
## Very Good   Very Good      0.70   0.69     0.70      0.8
## Excellent   Excellent      0.69   0.53     0.60      0.8
cat("\nEvaluasi LDA:\n")
## 
## Evaluasi LDA:
print(evaluate_model(conf_matrix_lda))
##                 Class Precision Recall F1_Score Accuracy
## Fail             Fail      0.80   0.74     0.77     0.81
## Sufficient Sufficient      0.85   0.89     0.87     0.81
## Good             Good      0.68   0.71     0.70     0.81
## Very Good   Very Good      0.78   0.71     0.74     0.81
## Excellent   Excellent      0.80   0.47     0.59     0.81
Model Akurasi Rata-rata F1
Regresi Logistik 0.80 0.72
LDA 0.81 0.73

Model LDA sedikit lebih unggul dalam hal akurasi dan F1-score dibanding regresi logistik ordinal.


H. Interpretasi Hasil

Regresi Logistik Ordinal

summary(model_polr)
## Call:
## polr(formula = G3_class ~ ., data = data_clean[, c(selected_features, 
##     "G3_class")], Hess = TRUE)
## 
## Coefficients:
##                     Value Std. Error  t value
## G2                4.64511     0.3765 12.33902
## G1                1.30536     0.2595  5.02952
## higheryes         0.45503     0.4488  1.01384
## schoolMS          0.01525     0.2640  0.05777
## schoolsupyes     -0.69196     0.3821 -1.81116
## Mjobhealth        0.01482     0.4556  0.03253
## Mjobother         0.07228     0.3202  0.22576
## Mjobservices     -0.13523     0.3789 -0.35691
## Mjobteacher       0.22862     0.4166  0.54871
## reasonhome       -0.25828     0.2909 -0.88798
## reasonother       0.11992     0.3820  0.31397
## reasonreputation  0.13110     0.2818  0.46519
## addressU          0.20950     0.2646  0.79172
## internetyes       0.29421     0.2823  1.04229
## 
## Intercepts:
##                     Value   Std. Error t value
## Fail|Sufficient     -5.5458  0.6183    -8.9695
## Sufficient|Good      4.0874  0.6300     6.4875
## Good|Very Good       8.1100  0.7552    10.7394
## Very Good|Excellent 12.4631  0.9451    13.1875
## 
## Residual Deviance: 594.9772 
## AIC: 630.9772
coef_polr <- coef(model_polr)
odds_ratio <- exp(coef_polr)

cat("Koefisien Regresi Logistik:\n")
## Koefisien Regresi Logistik:
print(coef_polr)
##               G2               G1        higheryes         schoolMS 
##       4.64511246       1.30536330       0.45503082       0.01525236 
##     schoolsupyes       Mjobhealth        Mjobother     Mjobservices 
##      -0.69196168       0.01481916       0.07227988      -0.13522882 
##      Mjobteacher       reasonhome      reasonother reasonreputation 
##       0.22862147      -0.25827559       0.11992143       0.13109807 
##         addressU      internetyes 
##       0.20950274       0.29420706
cat("\nOdds Ratio:\n")
## 
## Odds Ratio:
print(round(odds_ratio, 2))
##               G2               G1        higheryes         schoolMS 
##           104.08             3.69             1.58             1.02 
##     schoolsupyes       Mjobhealth        Mjobother     Mjobservices 
##             0.50             1.01             1.07             0.87 
##      Mjobteacher       reasonhome      reasonother reasonreputation 
##             1.26             0.77             1.13             1.14 
##         addressU      internetyes 
##             1.23             1.34

Koefisien model dan odds ratio:

  • G2 memiliki odds ratio sebesar 104.08 → peningkatan 1 satuan pada G2 meningkatkan kemungkinan naik ke kelas prestasi lebih tinggi sebanyak 104 kali.
  • G1 memiliki odds ratio sebesar 3.69 → juga signifikan.
  • Variabel lain memiliki pengaruh yang kecil dan tidak signifikan.

Linear Discriminant Analysis

cat("Koefisien Fungsi Diskriminan:\n")
## Koefisien Fungsi Diskriminan:
print(model_lda$scaling)
##                           LD1         LD2        LD3        LD4
## G2                1.602640785  0.27543752 -0.6998131 -0.1771381
## G1                0.538380559 -0.02651963  1.0270536 -0.1000573
## higheryes        -0.013352908 -1.72311513 -0.5512671  1.3813041
## schoolMS          0.105507647  1.64353787  0.7387131 -0.4052150
## schoolsupyes     -0.260443810 -0.97711027  1.8566742 -1.5440125
## Mjobhealth       -0.009432997  1.29710494 -0.7499566 -0.5351045
## Mjobother        -0.055887595  0.17323363 -1.1880468 -1.8397717
## Mjobservices     -0.061502551  0.36984893 -1.7373340 -1.7820798
## Mjobteacher       0.139908262  0.71218379 -0.7207513 -1.7416423
## reasonhome       -0.114655835 -0.26530449  0.1674413 -0.1360225
## reasonother       0.036966429  0.08591206 -0.5181147 -0.3210036
## reasonreputation  0.122766438  0.23804254  0.1853363  0.6382720
## addressU          0.028696006  0.13847738 -0.1217455 -0.1795131
## internetyes       0.071401650  0.22813567 -0.5495137  0.5135963

Koefisien LD1:

Variabel Koefisien Odds Ratio Signifikan
G2 4.65 104.08 Ya
G1 1.30 3.69 Ya
lainnya < 1 ~1 Tidak signifikan
  • G2 dan G1 memberikan kontribusi paling besar terhadap LD1, sehingga menjadi fitur utama dalam klasifikasi.
  • Koefisien lainnya memiliki pengaruh lebih kecil.

Visualisasi Distribusi LD1

library(ggplot2)
lda_pred <- predict(model_lda)
lda_df <- data.frame(LD1 = lda_pred$x[, 1], G3_class = data_clean$G3_class)

ggplot(lda_df, aes(x = LD1, fill = G3_class)) +
  geom_histogram(binwidth = 0.5, position = "identity", alpha = 0.6) +
  labs(title = "Distribusi Fungsi Diskriminan Pertama", x = "LD1", y = "Frekuensi")

Visualisasi menunjukkan bahwa fungsi diskriminan pertama (LD1) mampu memisahkan kelas dengan cukup jelas. Kelas Fail dan Excellent berada di ujung distribusi yang berlawanan.


Kesimpulan