Penelitian ini bertujuan untuk mengklasifikasikan tingkat prestasi akademik siswa sekolah menengah berdasarkan sejumlah faktor demografis, sosial, dan akademik menggunakan dua pendekatan statistik, yaitu Analisis Diskriminan Linear (LDA) dan Regresi Logistik Ordinal. Dataset yang digunakan mencakup informasi dari 649 siswa dengan 33 variabel. Analisis dimulai dari eksplorasi data, penanganan data ekstrem dan transformasi, seleksi fitur yang relevan, uji asumsi masing-masing model, hingga evaluasi performa klasifikasi. Hasil akhir diharapkan dapat membantu dalam memahami faktor-faktor penting yang mempengaruhi prestasi akademik siswa serta memilih model klasifikasi yang paling efektif.
# Input data
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
data <- read.csv("C:\\Users\\Shobah\\Downloads\\student-por.csv")
EDA bertujuan untuk memahami karakteristik data sebelum dilakukan pemodelan.
str(data)
## 'data.frame': 649 obs. of 33 variables:
## $ school : chr "GP" "GP" "GP" "GP" ...
## $ sex : chr "F" "F" "F" "F" ...
## $ age : int 18 17 15 15 16 16 16 17 15 15 ...
## $ address : chr "U" "U" "U" "U" ...
## $ famsize : chr "GT3" "GT3" "LE3" "GT3" ...
## $ Pstatus : chr "A" "T" "T" "T" ...
## $ Medu : int 4 1 1 4 3 4 2 4 3 3 ...
## $ Fedu : int 4 1 1 2 3 3 2 4 2 4 ...
## $ Mjob : chr "at_home" "at_home" "at_home" "health" ...
## $ Fjob : chr "teacher" "other" "other" "services" ...
## $ reason : chr "course" "course" "other" "home" ...
## $ guardian : chr "mother" "father" "mother" "mother" ...
## $ traveltime: int 2 1 1 1 1 1 1 2 1 1 ...
## $ studytime : int 2 2 2 3 2 2 2 2 2 2 ...
## $ failures : int 0 0 0 0 0 0 0 0 0 0 ...
## $ schoolsup : chr "yes" "no" "yes" "no" ...
## $ famsup : chr "no" "yes" "no" "yes" ...
## $ paid : chr "no" "no" "no" "no" ...
## $ activities: chr "no" "no" "no" "yes" ...
## $ nursery : chr "yes" "no" "yes" "yes" ...
## $ higher : chr "yes" "yes" "yes" "yes" ...
## $ internet : chr "no" "yes" "yes" "yes" ...
## $ romantic : chr "no" "no" "no" "yes" ...
## $ famrel : int 4 5 4 3 4 5 4 4 4 5 ...
## $ freetime : int 3 3 3 2 3 4 4 1 2 5 ...
## $ goout : int 4 3 2 2 2 2 4 4 2 1 ...
## $ Dalc : int 1 1 2 1 1 1 1 1 1 1 ...
## $ Walc : int 1 1 3 1 2 2 1 1 1 1 ...
## $ health : int 3 3 3 5 5 5 3 1 1 5 ...
## $ absences : int 4 2 6 0 0 6 0 2 0 0 ...
## $ G1 : int 0 9 12 14 11 12 13 10 15 12 ...
## $ G2 : int 11 11 13 14 13 12 12 13 16 12 ...
## $ G3 : int 11 11 12 14 13 13 13 13 17 13 ...
Dataset terdiri atas 649 observasi dengan 33 variabel, mencakup data demografi siswa, latar belakang keluarga, serta nilai akademik. Struktur data juga memberikan informasi mengenai type data tersebut yakni variabel numerik dan kategorik.
Kemudian dilakukan checking missing value:
colSums(is.na(data))
## school sex age address famsize Pstatus Medu
## 0 0 0 0 0 0 0
## Fedu Mjob Fjob reason guardian traveltime studytime
## 0 0 0 0 0 0 0
## failures schoolsup famsup paid activities nursery higher
## 0 0 0 0 0 0 0
## internet romantic famrel freetime goout Dalc Walc
## 0 0 0 0 0 0 0
## health absences G1 G2 G3
## 0 0 0 0 0
Tidak ditemukan nilai yang hilang.
Selanjutnya, dilakukan deteksi outlier dengan metode IQR:
detect_outliers_iqr <- function(x) {
Q1 <- quantile(x, 0.25, na.rm = TRUE)
Q3 <- quantile(x, 0.75, na.rm = TRUE)
IQR <- Q3 - Q1
lower_bound <- Q1 - 1.5 * IQR
upper_bound <- Q3 + 1.5 * IQR
which(x < lower_bound | x > upper_bound)
}
numeric_vars <- sapply(data, is.numeric)
outliers_index_list <- lapply(data[, numeric_vars], detect_outliers_iqr)
sapply(outliers_index_list, length)
## age Medu Fedu traveltime studytime failures famrel
## 1 0 0 16 35 100 51
## freetime goout Dalc Walc health absences G1
## 45 0 34 0 0 21 16
## G2 G3
## 25 16
Melihat Distribusi nilai akhir G3
:
hist(data$G3, main = "Distribusi Nilai G3", xlab = "Nilai G3", col = "skyblue")
Mayoritas siswa memperoleh nilai G3 pada rentang 10–14. Hal ini menunjukkan distribusi agak normal namun agak right-skewed.
Langkah preprocessing dilakukan untuk meningkatkan kualitas data. Tahapan yang dilakukan:
Outlier yang terdeteksi ditangani dengan winsorization.
winsorize <- function(x) {
Q1 <- quantile(x, 0.25)
Q3 <- quantile(x, 0.75)
IQR <- Q3 - Q1
lower <- Q1 - 1.5 * IQR
upper <- Q3 + 1.5 * IQR
x[x < lower] <- lower
x[x > upper] <- upper
return(x)
}
numeric_vars <- sapply(data, is.numeric)
numeric_vars["G3"] <- FALSE
data_winsorized <- data
data_winsorized[, numeric_vars] <- lapply(data[, numeric_vars], winsorize)
summary(data_winsorized)
## school sex age address
## Length:649 Length:649 Min. :15.00 Length:649
## Class :character Class :character 1st Qu.:16.00 Class :character
## Mode :character Mode :character Median :17.00 Mode :character
## Mean :16.74
## 3rd Qu.:18.00
## Max. :21.00
## famsize Pstatus Medu Fedu
## Length:649 Length:649 Min. :0.000 Min. :0.000
## Class :character Class :character 1st Qu.:2.000 1st Qu.:1.000
## Mode :character Mode :character Median :2.000 Median :2.000
## Mean :2.515 Mean :2.307
## 3rd Qu.:4.000 3rd Qu.:3.000
## Max. :4.000 Max. :4.000
## Mjob Fjob reason guardian
## Length:649 Length:649 Length:649 Length:649
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## traveltime studytime failures schoolsup
## Min. :1.000 Min. :1.000 Min. :0 Length:649
## 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:0 Class :character
## Median :1.000 Median :2.000 Median :0 Mode :character
## Mean :1.556 Mean :1.904 Mean :0
## 3rd Qu.:2.000 3rd Qu.:2.000 3rd Qu.:0
## Max. :3.500 Max. :3.500 Max. :0
## famsup paid activities nursery
## Length:649 Length:649 Length:649 Length:649
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## higher internet romantic famrel
## Length:649 Length:649 Length:649 Min. :2.500
## Class :character Class :character Class :character 1st Qu.:4.000
## Mode :character Mode :character Mode :character Median :4.000
## Mean :4.004
## 3rd Qu.:5.000
## Max. :5.000
## freetime goout Dalc Walc health
## Min. :1.500 Min. :1.000 Min. :1.00 Min. :1.00 Min. :1.000
## 1st Qu.:3.000 1st Qu.:2.000 1st Qu.:1.00 1st Qu.:1.00 1st Qu.:2.000
## Median :3.000 Median :3.000 Median :1.00 Median :2.00 Median :4.000
## Mean :3.215 Mean :3.185 Mean :1.45 Mean :2.28 Mean :3.536
## 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:2.00 3rd Qu.:3.00 3rd Qu.:5.000
## Max. :5.000 Max. :5.000 Max. :3.50 Max. :5.00 Max. :5.000
## absences G1 G2 G3
## Min. : 0.00 Min. : 5.50 Min. : 5.50 Min. : 0.00
## 1st Qu.: 0.00 1st Qu.:10.00 1st Qu.:10.00 1st Qu.:10.00
## Median : 2.00 Median :11.00 Median :11.00 Median :12.00
## Mean : 3.51 Mean :11.41 Mean :11.62 Mean :11.91
## 3rd Qu.: 6.00 3rd Qu.:13.00 3rd Qu.:13.00 3rd Qu.:14.00
## Max. :15.00 Max. :17.50 Max. :17.50 Max. :19.00
Variabel | Min sebelum | Max sebelum | Min sesudah | Max sesudah |
---|---|---|---|---|
absences | 0 | 32 | 0 | 15 |
G1 | 0 | 19 | 5.5 | 17.5 |
G2 | 0 | 19 | 5.5 | 17.5 |
Nilai ekstrem dipangkas ke batas bawah dan atas IQR. Ini membantu menjaga kestabilan model.
Dilakukan transformasi akar kuadrat untuk mengurangi skewness:
data_clean <- data_winsorized
numeric_vars <- sapply(data_clean, is.numeric)
numeric_vars["G3"] <- FALSE
data_clean[, numeric_vars] <- lapply(data_clean[, numeric_vars], sqrt)
summary(data_clean)
## school sex age address
## Length:649 Length:649 Min. :3.873 Length:649
## Class :character Class :character 1st Qu.:4.000 Class :character
## Mode :character Mode :character Median :4.123 Mode :character
## Mean :4.089
## 3rd Qu.:4.243
## Max. :4.583
## famsize Pstatus Medu Fedu
## Length:649 Length:649 Min. :0.000 Min. :0.000
## Class :character Class :character 1st Qu.:1.414 1st Qu.:1.000
## Mode :character Mode :character Median :1.414 Median :1.414
## Mean :1.536 Mean :1.468
## 3rd Qu.:2.000 3rd Qu.:1.732
## Max. :2.000 Max. :2.000
## Mjob Fjob reason guardian
## Length:649 Length:649 Length:649 Length:649
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## traveltime studytime failures schoolsup
## Min. :1.000 Min. :1.000 Min. :0 Length:649
## 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:0 Class :character
## Median :1.000 Median :1.414 Median :0 Mode :character
## Mean :1.218 Mean :1.351 Mean :0
## 3rd Qu.:1.414 3rd Qu.:1.414 3rd Qu.:0
## Max. :1.871 Max. :1.871 Max. :0
## famsup paid activities nursery
## Length:649 Length:649 Length:649 Length:649
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## higher internet romantic famrel
## Length:649 Length:649 Length:649 Min. :1.581
## Class :character Class :character Class :character 1st Qu.:2.000
## Mode :character Mode :character Mode :character Median :2.000
## Mean :1.991
## 3rd Qu.:2.236
## Max. :2.236
## freetime goout Dalc Walc
## Min. :1.225 Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:1.732 1st Qu.:1.414 1st Qu.:1.000 1st Qu.:1.000
## Median :1.732 Median :1.732 Median :1.000 Median :1.414
## Mean :1.771 Mean :1.751 Mean :1.171 Mean :1.451
## 3rd Qu.:2.000 3rd Qu.:2.000 3rd Qu.:1.414 3rd Qu.:1.732
## Max. :2.236 Max. :2.236 Max. :1.871 Max. :2.236
## health absences G1 G2
## Min. :1.000 Min. :0.000 Min. :2.345 Min. :2.345
## 1st Qu.:1.414 1st Qu.:0.000 1st Qu.:3.162 1st Qu.:3.162
## Median :2.000 Median :1.414 Median :3.317 Median :3.317
## Mean :1.830 Mean :1.398 Mean :3.354 Mean :3.385
## 3rd Qu.:2.236 3rd Qu.:2.449 3rd Qu.:3.606 3rd Qu.:3.606
## Max. :2.236 Max. :3.873 Max. :4.183 Max. :4.183
## G3
## Min. : 0.00
## 1st Qu.:10.00
## Median :12.00
## Mean :11.91
## 3rd Qu.:14.00
## Max. :19.00
Statistik | Nilai |
---|---|
Min | 0.000 |
Median | 1.414 |
Maks | 3.873 |
Data menjadi lebih simetris dan normal.
Variabel karakter diubah ke tipe faktor
data_clean[] <- lapply(names(data_clean), function(colname) {
x <- data_clean[[colname]]
if (is.character(x) && colname != "G3") return(as.factor(x)) else return(x)
})
str(data_clean)
## 'data.frame': 649 obs. of 33 variables:
## $ school : Factor w/ 2 levels "GP","MS": 1 1 1 1 1 1 1 1 1 1 ...
## $ sex : Factor w/ 2 levels "F","M": 1 1 1 1 1 2 2 1 2 2 ...
## $ age : num 4.24 4.12 3.87 3.87 4 ...
## $ address : Factor w/ 2 levels "R","U": 2 2 2 2 2 2 2 2 2 2 ...
## $ famsize : Factor w/ 2 levels "GT3","LE3": 1 1 2 1 1 2 2 1 2 1 ...
## $ Pstatus : Factor w/ 2 levels "A","T": 1 2 2 2 2 2 2 1 1 2 ...
## $ Medu : num 2 1 1 2 1.73 ...
## $ Fedu : num 2 1 1 1.41 1.73 ...
## $ Mjob : Factor w/ 5 levels "at_home","health",..: 1 1 1 2 3 4 3 3 4 3 ...
## $ Fjob : Factor w/ 5 levels "at_home","health",..: 5 3 3 4 3 3 3 5 3 3 ...
## $ reason : Factor w/ 4 levels "course","home",..: 1 1 3 2 2 4 2 2 2 2 ...
## $ guardian : Factor w/ 3 levels "father","mother",..: 2 1 2 2 1 2 2 2 2 2 ...
## $ traveltime: num 1.41 1 1 1 1 ...
## $ studytime : num 1.41 1.41 1.41 1.73 1.41 ...
## $ failures : num 0 0 0 0 0 0 0 0 0 0 ...
## $ schoolsup : Factor w/ 2 levels "no","yes": 2 1 2 1 1 1 1 2 1 1 ...
## $ famsup : Factor w/ 2 levels "no","yes": 1 2 1 2 2 2 1 2 2 2 ...
## $ paid : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ activities: Factor w/ 2 levels "no","yes": 1 1 1 2 1 2 1 1 1 2 ...
## $ nursery : Factor w/ 2 levels "no","yes": 2 1 2 2 2 2 2 2 2 2 ...
## $ higher : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
## $ internet : Factor w/ 2 levels "no","yes": 1 2 2 2 1 2 2 1 2 2 ...
## $ romantic : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
## $ famrel : num 2 2.24 2 1.73 2 ...
## $ freetime : num 1.73 1.73 1.73 1.41 1.73 ...
## $ goout : num 2 1.73 1.41 1.41 1.41 ...
## $ Dalc : num 1 1 1.41 1 1 ...
## $ Walc : num 1 1 1.73 1 1.41 ...
## $ health : num 1.73 1.73 1.73 2.24 2.24 ...
## $ absences : num 2 1.41 2.45 0 0 ...
## $ G1 : num 2.35 3 3.46 3.74 3.32 ...
## $ G2 : num 3.32 3.32 3.61 3.74 3.61 ...
## $ G3 : int 11 11 12 14 13 13 13 13 17 13 ...
Hasil: Factor w/ 2 levels "GP","MS"
data_clean$G3_class <- cut(data_clean$G3,
breaks = c(-1, 9.9, 13.9, 15.9, 17.9, 20),
labels = c("Fail", "Sufficient", "Good", "Very Good", "Excellent"),
ordered_result = TRUE)
table(data_clean$G3_class)
##
## Fail Sufficient Good Very Good Excellent
## 100 355 112 65 17
Kelas | Jumlah |
---|---|
Fail | 100 |
Sufficient | 355 |
Good | 112 |
Very Good | 65 |
Excellent | 17 |
G3 dikategorikan ke dalam lima tingkatan untuk klasifikasi.
Semua variabel numerik dinormalisasi agar memiliki mean = 0 dan standar deviasi = 1:
numeric_vars <- sapply(data_clean, is.numeric)
data_clean[, numeric_vars] <- scale(data_clean[, numeric_vars])
summary(data_clean)
## school sex age address famsize Pstatus Medu
## GP:423 F:383 Min. :-1.4661 R:197 GT3:457 A: 80 Min. :-3.8900
## MS:226 M:266 1st Qu.:-0.6046 U:452 LE3:192 T:569 1st Qu.:-0.3082
## Median : 0.2305 Median :-0.3082
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.0413 3rd Qu.: 1.1754
## Max. : 3.3471 Max. : 1.1754
##
## Fedu Mjob Fjob reason guardian
## Min. :-3.7517 at_home :135 at_home : 42 course :285 father:153
## 1st Qu.:-1.1953 health : 48 health : 23 home :149 mother:455
## Median :-0.1365 other :258 other :367 other : 72 other : 41
## Mean : 0.0000 services:136 services:181 reputation:143
## 3rd Qu.: 0.6760 teacher : 72 teacher : 36
## Max. : 1.3610
##
## traveltime studytime failures schoolsup famsup
## Min. :-0.8134 Min. :-1.2528 Min. : NA no :581 no :251
## 1st Qu.:-0.8134 1st Qu.:-1.2528 1st Qu.: NA yes: 68 yes:398
## Median :-0.8134 Median : 0.2255 Median : NA
## Mean : 0.0000 Mean : 0.0000 Mean :NaN
## 3rd Qu.: 0.7298 3rd Qu.: 0.2255 3rd Qu.: NA
## Max. : 2.4311 Max. : 1.8550 Max. : NA
## NA's :649
## paid activities nursery higher internet romantic
## no :610 no :334 no :128 no : 69 no :151 no :410
## yes: 39 yes:315 yes:521 yes:580 yes:498 yes:239
##
##
##
##
##
## famrel freetime goout Dalc
## Min. :-2.03856 Min. :-1.9374 Min. :-2.16029 Min. :-0.6136
## 1st Qu.: 0.04548 1st Qu.:-0.1374 1st Qu.:-0.96805 1st Qu.:-0.6136
## Median : 0.04548 Median :-0.1374 Median :-0.05321 Median :-0.6136
## Mean : 0.00000 Mean : 0.0000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 1.22004 3rd Qu.: 0.8133 3rd Qu.: 0.71803 3rd Qu.: 0.8697
## Max. : 1.22004 Max. : 1.6509 Max. : 1.39751 Max. : 2.5048
##
## Walc health absences G1
## Min. :-1.07547 Min. :-1.9226 Min. :-1.1204 Min. :-2.5023
## 1st Qu.:-1.07547 1st Qu.:-0.9635 1st Qu.:-1.1204 1st Qu.:-0.4746
## Median :-0.08739 Median : 0.3930 Median : 0.0128 Median :-0.0916
## Mean : 0.00000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.67079 3rd Qu.: 0.9396 3rd Qu.: 0.8424 3rd Qu.: 0.6254
## Max. : 1.87308 Max. : 0.9396 Max. : 1.9831 Max. : 2.0592
##
## G2 G3 G3_class
## Min. :-2.6007 Min. :-3.68532 Fail :100
## 1st Qu.:-0.5573 1st Qu.:-0.58998 Sufficient:355
## Median :-0.1713 Median : 0.02909 Good :112
## Mean : 0.0000 Mean : 0.00000 Very Good : 65
## 3rd Qu.: 0.5513 3rd Qu.: 0.64816 Excellent : 17
## Max. : 1.9962 Max. : 2.19584
##
Statistik | Nilai |
---|---|
Min | -2.6007 |
Median | -0.1713 |
Maks | 1.9962 |
Normalisasi dilakukan agar semua variabel memiliki kontribusi setara dalam model.
Tahap | Tujuan | Hasil Output Penting |
---|---|---|
Winsorization | Mengurangi pengaruh outlier | absences maksimal turun dari 32 → 15 |
Transformasi SQRT | Menurunkan skewness | absences jadi 0–3.873 (lebih simetris) |
Faktor | Mengubah kategorik ke numerik kategorikal | school , sex , address , dll.
jadi factor |
Klasifikasi G3 | Membuat target klasifikasi | 5 kelas G3: Fail – Excellent |
Normalisasi | Menyamakan skala numerik | Semua numerik jadi mean 0, sd 1 |
Seleksi fitur dilakukan untuk memilih variabel-variabel yang paling berpengaruh terhadap klasifikasi nilai akhir siswa.
Untuk variabel numerik digunakan korelasi Spearman terhadap
G3_class
:
numeric_vars <- sapply(data_clean, is.numeric)
numeric_vars["G3"] <- FALSE
numeric_vars["G3_class"] <- FALSE
spearman_result <- sapply(names(data_clean)[numeric_vars], function(var) {
cor(as.numeric(data_clean[[var]]), as.numeric(data_clean$G3_class), method = "spearman")
})
spearman_result <- sort(abs(spearman_result), decreasing = TRUE)
print(spearman_result)
## G2 G1 Medu studytime Fedu Dalc Walc
## 0.85744519 0.80490197 0.25065187 0.24294597 0.18538665 0.16652126 0.14781968
## absences traveltime freetime health goout famrel age
## 0.11948967 0.11162019 0.09202719 0.08960823 0.08735400 0.03435474 0.01985015
Hasil menunjukkan bahwa variabel G2
dan G1
memiliki korelasi paling tinggi (masing-masing 0.85 dan 0.80).
Untuk variabel kategorik digunakan uji Chi-square:
factor_vars <- sapply(data_clean, is.factor)
chisq_result <- sapply(names(data_clean)[factor_vars], function(var) {
tbl <- table(data_clean[[var]], data_clean$G3_class)
test <- chisq.test(tbl)
return(test$p.value)
})
## Warning in chisq.test(tbl): Chi-squared approximation may be incorrect
## Warning in chisq.test(tbl): Chi-squared approximation may be incorrect
## Warning in chisq.test(tbl): Chi-squared approximation may be incorrect
## Warning in chisq.test(tbl): Chi-squared approximation may be incorrect
## Warning in chisq.test(tbl): Chi-squared approximation may be incorrect
## Warning in chisq.test(tbl): Chi-squared approximation may be incorrect
## Warning in chisq.test(tbl): Chi-squared approximation may be incorrect
## Warning in chisq.test(tbl): Chi-squared approximation may be incorrect
## Warning in chisq.test(tbl): Chi-squared approximation may be incorrect
## Warning in chisq.test(tbl): Chi-squared approximation may be incorrect
## Warning in chisq.test(tbl): Chi-squared approximation may be incorrect
chisq_result <- sort(chisq_result)
print(chisq_result)
## G3_class higher school schoolsup Mjob reason
## 0.000000e+00 3.543213e-15 1.214454e-12 3.548904e-04 5.291069e-04 1.885781e-03
## address internet Fjob sex nursery activities
## 5.060560e-03 5.090668e-03 1.609668e-02 7.683058e-02 1.324086e-01 1.553318e-01
## paid guardian romantic famsup famsize Pstatus
## 1.628765e-01 1.700738e-01 2.516525e-01 6.510745e-01 7.433608e-01 8.980707e-01
Variabel yang memiliki nilai p < 0.01 dianggap signifikan, antara
lain higher
, school
, schoolsup
,
Mjob
, reason
, address
, dan
internet
.
Kemudian dilakukan penggabungan kedua fitur variabel tersebut
selected_numeric <- names(spearman_result[spearman_result > 0.3])
selected_factor <- names(chisq_result[chisq_result < 0.01])
selected_features <- c(selected_numeric, selected_factor)
selected_features <- setdiff(selected_features, "G3_class")
print(selected_features)
## [1] "G2" "G1" "higher" "school" "schoolsup" "Mjob"
## [7] "reason" "address" "internet"
Fitur-fitur yang dipilih untuk model:
Dilakukan uji multikolinearitas dengan menghitung Variance Inflation Factor (VIF):
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
vif_data <- data_clean[, selected_features]
vif_data$G3_class <- data_clean$G3_class
vif_model <- lm(as.numeric(G3_class) ~ ., data = vif_data)
vif_values <- vif(vif_model)
print(vif_values)
## GVIF Df GVIF^(1/(2*Df))
## G2 4.644795 1 2.155179
## G1 4.805692 1 2.192189
## higher 1.191868 1 1.091727
## school 1.381606 1 1.175417
## schoolsup 1.057334 1 1.028268
## Mjob 1.263417 4 1.029659
## reason 1.195480 3 1.030205
## address 1.196680 1 1.093929
## internet 1.155197 1 1.074801
Semua nilai VIF berada di bawah 5 yang berarti tidak ada gejala multikolinearitas serius antar prediktor.
Dilakukan uji normalitas multivariat menggunakan Mardia Test dan uji homogenitas kovarian menggunakan Box’s M Test:
library(MVN)
## Warning: package 'MVN' was built under R version 4.4.3
numeric_features <- c("G1", "G2")
mardia_result <- mvn(data_clean[, numeric_features], mvnTest = "mardia")
print(mardia_result$multivariateNormality)
## Test Statistic p value Result
## 1 Mardia Skewness 85.5666614809526 1.15012313282574e-17 NO
## 2 Mardia Kurtosis 4.90356946172837 9.41106419016791e-07 NO
## 3 MVN <NA> <NA> NO
library(biotools)
## Warning: package 'biotools' was built under R version 4.4.3
## Loading required package: MASS
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
## ---
## biotools version 4.3
boxm_result <- boxM(data_clean[, numeric_features], grouping = data_clean$G3_class)
print(boxm_result)
##
## Box's M-test for Homogeneity of Covariance Matrices
##
## data: data_clean[, numeric_features]
## Chi-Sq (approx.) = 143.77, df = 12, p-value < 2.2e-16
Hasil:
Uji | Nilai | p-value | Kesimpulan |
---|---|---|---|
Mardia Skewness | 85.57 | < 0.001 | Tidak normal |
Mardia Kurtosis | 4.90 | < 0.001 | Tidak normal |
Box’s M | 143.77 | < 0.001 | Varians tidak homogen |
Meskipun asumsi tidak terpenuhi secara ketat, model LDA tetap dilanjutkan karena sifatnya yang cukup robust.
Model dibangun menggunakan fungsi
polr
:
library(MASS)
model_polr <- polr(G3_class ~ ., data = data_clean[, c(selected_features, "G3_class")], Hess = TRUE)
summary(model_polr)
## Call:
## polr(formula = G3_class ~ ., data = data_clean[, c(selected_features,
## "G3_class")], Hess = TRUE)
##
## Coefficients:
## Value Std. Error t value
## G2 4.64511 0.3765 12.33902
## G1 1.30536 0.2595 5.02952
## higheryes 0.45503 0.4488 1.01384
## schoolMS 0.01525 0.2640 0.05777
## schoolsupyes -0.69196 0.3821 -1.81116
## Mjobhealth 0.01482 0.4556 0.03253
## Mjobother 0.07228 0.3202 0.22576
## Mjobservices -0.13523 0.3789 -0.35691
## Mjobteacher 0.22862 0.4166 0.54871
## reasonhome -0.25828 0.2909 -0.88798
## reasonother 0.11992 0.3820 0.31397
## reasonreputation 0.13110 0.2818 0.46519
## addressU 0.20950 0.2646 0.79172
## internetyes 0.29421 0.2823 1.04229
##
## Intercepts:
## Value Std. Error t value
## Fail|Sufficient -5.5458 0.6183 -8.9695
## Sufficient|Good 4.0874 0.6300 6.4875
## Good|Very Good 8.1100 0.7552 10.7394
## Very Good|Excellent 12.4631 0.9451 13.1875
##
## Residual Deviance: 594.9772
## AIC: 630.9772
coef(model_polr)
## G2 G1 higheryes schoolMS
## 4.64511246 1.30536330 0.45503082 0.01525236
## schoolsupyes Mjobhealth Mjobother Mjobservices
## -0.69196168 0.01481916 0.07227988 -0.13522882
## Mjobteacher reasonhome reasonother reasonreputation
## 0.22862147 -0.25827559 0.11992143 0.13109807
## addressU internetyes
## 0.20950274 0.29420706
Koefisien G2 dan G1 signifikan dengan nilai p < 0.05, artinya kedua variabel tersebut memiliki pengaruh yang signifikan terhadap klasifikasi nilai akhir siswa.
Model LDA dibentuk menggunakan fungsi
lda
:
model_lda <- lda(G3_class ~ ., data = data_clean[, c(selected_features, "G3_class")])
print(model_lda)
## Call:
## lda(G3_class ~ ., data = data_clean[, c(selected_features, "G3_class")])
##
## Prior probabilities of groups:
## Fail Sufficient Good Very Good Excellent
## 0.15408320 0.54699538 0.17257319 0.10015408 0.02619414
##
## Group means:
## G2 G1 higheryes schoolMS schoolsupyes Mjobhealth
## Fail -1.4505900 -1.4009938 0.6700000 0.6800000 0.08000000 0.06000000
## Sufficient -0.1974866 -0.1607238 0.9014085 0.3098592 0.15211268 0.05352113
## Good 0.7860927 0.7056960 0.9910714 0.2232143 0.03571429 0.10714286
## Very Good 1.4498856 1.3204290 1.0000000 0.2461538 0.01538462 0.12307692
## Excellent 1.9342233 1.8994403 1.0000000 0.4117647 0.05882353 0.17647059
## Mjobother Mjobservices Mjobteacher reasonhome reasonother
## Fail 0.4200000 0.19000000 0.04000000 0.1700000 0.18000000
## Sufficient 0.4169014 0.20563380 0.09577465 0.2535211 0.10422535
## Good 0.3571429 0.24107143 0.15178571 0.2232143 0.08928571
## Very Good 0.3692308 0.24615385 0.20000000 0.2153846 0.09230769
## Excellent 0.2352941 0.05882353 0.23529412 0.1764706 0.05882353
## reasonreputation addressU internetyes
## Fail 0.1000000 0.5600000 0.6800000
## Sufficient 0.1971831 0.6929577 0.7436620
## Good 0.3035714 0.7678571 0.8571429
## Very Good 0.3384615 0.8000000 0.8769231
## Excellent 0.4117647 0.7058824 0.7647059
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3 LD4
## G2 1.602640785 0.27543752 -0.6998131 -0.1771381
## G1 0.538380559 -0.02651963 1.0270536 -0.1000573
## higheryes -0.013352908 -1.72311513 -0.5512671 1.3813041
## schoolMS 0.105507647 1.64353787 0.7387131 -0.4052150
## schoolsupyes -0.260443810 -0.97711027 1.8566742 -1.5440125
## Mjobhealth -0.009432997 1.29710494 -0.7499566 -0.5351045
## Mjobother -0.055887595 0.17323363 -1.1880468 -1.8397717
## Mjobservices -0.061502551 0.36984893 -1.7373340 -1.7820798
## Mjobteacher 0.139908262 0.71218379 -0.7207513 -1.7416423
## reasonhome -0.114655835 -0.26530449 0.1674413 -0.1360225
## reasonother 0.036966429 0.08591206 -0.5181147 -0.3210036
## reasonreputation 0.122766438 0.23804254 0.1853363 0.6382720
## addressU 0.028696006 0.13847738 -0.1217455 -0.1795131
## internetyes 0.071401650 0.22813567 -0.5495137 0.5135963
##
## Proportion of trace:
## LD1 LD2 LD3 LD4
## 0.9674 0.0234 0.0081 0.0012
print(model_lda$scaling)
## LD1 LD2 LD3 LD4
## G2 1.602640785 0.27543752 -0.6998131 -0.1771381
## G1 0.538380559 -0.02651963 1.0270536 -0.1000573
## higheryes -0.013352908 -1.72311513 -0.5512671 1.3813041
## schoolMS 0.105507647 1.64353787 0.7387131 -0.4052150
## schoolsupyes -0.260443810 -0.97711027 1.8566742 -1.5440125
## Mjobhealth -0.009432997 1.29710494 -0.7499566 -0.5351045
## Mjobother -0.055887595 0.17323363 -1.1880468 -1.8397717
## Mjobservices -0.061502551 0.36984893 -1.7373340 -1.7820798
## Mjobteacher 0.139908262 0.71218379 -0.7207513 -1.7416423
## reasonhome -0.114655835 -0.26530449 0.1674413 -0.1360225
## reasonother 0.036966429 0.08591206 -0.5181147 -0.3210036
## reasonreputation 0.122766438 0.23804254 0.1853363 0.6382720
## addressU 0.028696006 0.13847738 -0.1217455 -0.1795131
## internetyes 0.071401650 0.22813567 -0.5495137 0.5135963
LD1 menjelaskan 96.7% dari variasi antar kelas, menunjukkan bahwa diskriminan pertama sangat kuat membedakan antar kelas prestasi.
Uji signifikansi dilakukan secara serentak menggunakan Likelihood Ratio Test:
model_null <- polr(G3_class ~ 1, data = data_clean[, c(selected_features, "G3_class")], Hess = TRUE)
lr_stat <- 2 * (logLik(model_polr) - logLik(model_null))
df_lr <- attr(logLik(model_polr), "df") - attr(logLik(model_null), "df")
p_value_lr <- pchisq(lr_stat, df = df_lr, lower.tail = FALSE)
cat("Likelihood Ratio Statistic:", lr_stat, "\n")
## Likelihood Ratio Statistic: 1023.953
cat("Degrees of Freedom:", df_lr, "\n")
## Degrees of Freedom: 14
cat("p-value:", p_value_lr, "\n")
## p-value: 1.13402e-209
Dilakuakn juga uji parsial menggunakan Wald Tes:
summary_polr <- summary(model_polr)
wald_stat <- (summary_polr$coefficients[, "Value"] / summary_polr$coefficients[, "Std. Error"])^2
wald_p_values <- 1 - pchisq(wald_stat, df = 1)
wald_table <- data.frame(
Estimate = summary_polr$coefficients[, "Value"],
StdError = summary_polr$coefficients[, "Std. Error"],
p_value = round(wald_p_values, 4)
)
print(wald_table)
## Estimate StdError p_value
## G2 4.64511246 0.3764572 0.0000
## G1 1.30536330 0.2595404 0.0000
## higheryes 0.45503082 0.4488182 0.3107
## schoolMS 0.01525236 0.2640229 0.9539
## schoolsupyes -0.69196168 0.3820548 0.0701
## Mjobhealth 0.01481916 0.4556115 0.9741
## Mjobother 0.07227988 0.3201652 0.8214
## Mjobservices -0.13522882 0.3788897 0.7212
## Mjobteacher 0.22862147 0.4166495 0.5832
## reasonhome -0.25827559 0.2908572 0.3746
## reasonother 0.11992143 0.3819520 0.7535
## reasonreputation 0.13109807 0.2818190 0.6418
## addressU 0.20950274 0.2646157 0.4285
## internetyes 0.29420706 0.2822701 0.2973
## Fail|Sufficient -5.54577578 0.6182900 0.0000
## Sufficient|Good 4.08743704 0.6300437 0.0000
## Good|Very Good 8.10995175 0.7551577 0.0000
## Very Good|Excellent 12.46310464 0.9450727 0.0000
Uji Wilks’ Lambda digunakan untuk menilai kemampuan diskriminan:
eigen_values <- model_lda$svd^2
wilks_lambda <- prod(1 / (1 + eigen_values))
cat("Wilks' Lambda:", wilks_lambda, "\n")
## Wilks' Lambda: 1.307714e-05
Nilai Wilks’ Lambda sangat kecil (~0.000013), menunjukkan bahwa fungsi diskriminan dapat memisahkan kelas dengan sangat baik.
pred_polr <- predict(model_polr, newdata = data_clean[, selected_features])
pred_lda <- predict(model_lda)$class
conf_matrix_polr <- table(Predicted = pred_polr, Actual = data_clean$G3_class)
conf_matrix_lda <- table(Predicted = pred_lda, Actual = data_clean$G3_class)
cat("Confusion Matrix - Regresi Logistik Ordinal:\n")
## Confusion Matrix - Regresi Logistik Ordinal:
print(conf_matrix_polr)
## Actual
## Predicted Fail Sufficient Good Very Good Excellent
## Fail 73 16 0 0 0
## Sufficient 27 320 31 1 0
## Good 0 19 70 15 0
## Very Good 0 0 11 45 8
## Excellent 0 0 0 4 9
cat("\nConfusion Matrix - LDA:\n")
##
## Confusion Matrix - LDA:
print(conf_matrix_lda)
## Actual
## Predicted Fail Sufficient Good Very Good Excellent
## Fail 74 19 0 0 0
## Sufficient 26 316 27 1 0
## Good 0 20 80 17 0
## Very Good 0 0 4 46 9
## Excellent 0 0 1 1 8
Tabel klasifikasi menunjukkan bahwa model LDA dan regresi logistik
ordinal mampu memprediksi kelas Sufficient
dengan baik.
Namun, kelas Excellent
dan Fail
masih memiliki
tingkat kesalahan relatif tinggi.
evaluate_model <- function(conf_matrix) {
accuracy <- sum(diag(conf_matrix)) / sum(conf_matrix)
precision <- diag(conf_matrix) / rowSums(conf_matrix)
recall <- diag(conf_matrix) / colSums(conf_matrix)
f1 <- 2 * (precision * recall) / (precision + recall)
hasil <- data.frame(
Class = names(precision),
Precision = round(precision, 2),
Recall = round(recall, 2),
F1_Score = round(f1, 2)
)
hasil$Accuracy <- round(accuracy, 2)
return(hasil)
}
cat("\nEvaluasi Regresi Logistik Ordinal:\n")
##
## Evaluasi Regresi Logistik Ordinal:
print(evaluate_model(conf_matrix_polr))
## Class Precision Recall F1_Score Accuracy
## Fail Fail 0.82 0.73 0.77 0.8
## Sufficient Sufficient 0.84 0.90 0.87 0.8
## Good Good 0.67 0.62 0.65 0.8
## Very Good Very Good 0.70 0.69 0.70 0.8
## Excellent Excellent 0.69 0.53 0.60 0.8
cat("\nEvaluasi LDA:\n")
##
## Evaluasi LDA:
print(evaluate_model(conf_matrix_lda))
## Class Precision Recall F1_Score Accuracy
## Fail Fail 0.80 0.74 0.77 0.81
## Sufficient Sufficient 0.85 0.89 0.87 0.81
## Good Good 0.68 0.71 0.70 0.81
## Very Good Very Good 0.78 0.71 0.74 0.81
## Excellent Excellent 0.80 0.47 0.59 0.81
Model | Akurasi | Rata-rata F1 |
---|---|---|
Regresi Logistik | 0.80 | 0.72 |
LDA | 0.81 | 0.73 |
Model LDA sedikit lebih unggul dalam hal akurasi dan F1-score dibanding regresi logistik ordinal.
summary(model_polr)
## Call:
## polr(formula = G3_class ~ ., data = data_clean[, c(selected_features,
## "G3_class")], Hess = TRUE)
##
## Coefficients:
## Value Std. Error t value
## G2 4.64511 0.3765 12.33902
## G1 1.30536 0.2595 5.02952
## higheryes 0.45503 0.4488 1.01384
## schoolMS 0.01525 0.2640 0.05777
## schoolsupyes -0.69196 0.3821 -1.81116
## Mjobhealth 0.01482 0.4556 0.03253
## Mjobother 0.07228 0.3202 0.22576
## Mjobservices -0.13523 0.3789 -0.35691
## Mjobteacher 0.22862 0.4166 0.54871
## reasonhome -0.25828 0.2909 -0.88798
## reasonother 0.11992 0.3820 0.31397
## reasonreputation 0.13110 0.2818 0.46519
## addressU 0.20950 0.2646 0.79172
## internetyes 0.29421 0.2823 1.04229
##
## Intercepts:
## Value Std. Error t value
## Fail|Sufficient -5.5458 0.6183 -8.9695
## Sufficient|Good 4.0874 0.6300 6.4875
## Good|Very Good 8.1100 0.7552 10.7394
## Very Good|Excellent 12.4631 0.9451 13.1875
##
## Residual Deviance: 594.9772
## AIC: 630.9772
coef_polr <- coef(model_polr)
odds_ratio <- exp(coef_polr)
cat("Koefisien Regresi Logistik:\n")
## Koefisien Regresi Logistik:
print(coef_polr)
## G2 G1 higheryes schoolMS
## 4.64511246 1.30536330 0.45503082 0.01525236
## schoolsupyes Mjobhealth Mjobother Mjobservices
## -0.69196168 0.01481916 0.07227988 -0.13522882
## Mjobteacher reasonhome reasonother reasonreputation
## 0.22862147 -0.25827559 0.11992143 0.13109807
## addressU internetyes
## 0.20950274 0.29420706
cat("\nOdds Ratio:\n")
##
## Odds Ratio:
print(round(odds_ratio, 2))
## G2 G1 higheryes schoolMS
## 104.08 3.69 1.58 1.02
## schoolsupyes Mjobhealth Mjobother Mjobservices
## 0.50 1.01 1.07 0.87
## Mjobteacher reasonhome reasonother reasonreputation
## 1.26 0.77 1.13 1.14
## addressU internetyes
## 1.23 1.34
Koefisien model dan odds ratio:
cat("Koefisien Fungsi Diskriminan:\n")
## Koefisien Fungsi Diskriminan:
print(model_lda$scaling)
## LD1 LD2 LD3 LD4
## G2 1.602640785 0.27543752 -0.6998131 -0.1771381
## G1 0.538380559 -0.02651963 1.0270536 -0.1000573
## higheryes -0.013352908 -1.72311513 -0.5512671 1.3813041
## schoolMS 0.105507647 1.64353787 0.7387131 -0.4052150
## schoolsupyes -0.260443810 -0.97711027 1.8566742 -1.5440125
## Mjobhealth -0.009432997 1.29710494 -0.7499566 -0.5351045
## Mjobother -0.055887595 0.17323363 -1.1880468 -1.8397717
## Mjobservices -0.061502551 0.36984893 -1.7373340 -1.7820798
## Mjobteacher 0.139908262 0.71218379 -0.7207513 -1.7416423
## reasonhome -0.114655835 -0.26530449 0.1674413 -0.1360225
## reasonother 0.036966429 0.08591206 -0.5181147 -0.3210036
## reasonreputation 0.122766438 0.23804254 0.1853363 0.6382720
## addressU 0.028696006 0.13847738 -0.1217455 -0.1795131
## internetyes 0.071401650 0.22813567 -0.5495137 0.5135963
Koefisien LD1:
Variabel | Koefisien | Odds Ratio | Signifikan |
---|---|---|---|
G2 | 4.65 | 104.08 | Ya |
G1 | 1.30 | 3.69 | Ya |
lainnya | < 1 | ~1 | Tidak signifikan |
library(ggplot2)
lda_pred <- predict(model_lda)
lda_df <- data.frame(LD1 = lda_pred$x[, 1], G3_class = data_clean$G3_class)
ggplot(lda_df, aes(x = LD1, fill = G3_class)) +
geom_histogram(binwidth = 0.5, position = "identity", alpha = 0.6) +
labs(title = "Distribusi Fungsi Diskriminan Pertama", x = "LD1", y = "Frekuensi")
Visualisasi menunjukkan bahwa fungsi diskriminan pertama (LD1) mampu memisahkan kelas dengan cukup jelas. Kelas Fail dan Excellent berada di ujung distribusi yang berlawanan.