url <- "https://archive.ics.uci.edu/static/public/320/data.csv"
data <- read.csv(url)
head(data)
## school sex age address famsize Pstatus Medu Fedu Mjob Fjob reason
## 1 GP F 18 U GT3 A 4 4 at_home teacher course
## 2 GP F 17 U GT3 T 1 1 at_home other course
## 3 GP F 15 U LE3 T 1 1 at_home other other
## 4 GP F 15 U GT3 T 4 2 health services home
## 5 GP F 16 U GT3 T 3 3 other other home
## 6 GP M 16 U LE3 T 4 3 services other reputation
## guardian traveltime studytime failures schoolsup famsup paid activities
## 1 mother 2 2 0 yes no no no
## 2 father 1 2 0 no yes no no
## 3 mother 1 2 0 yes no no no
## 4 mother 1 3 0 no yes no yes
## 5 father 1 2 0 no yes no no
## 6 mother 1 2 0 no yes no yes
## nursery higher internet romantic famrel freetime goout Dalc Walc health
## 1 yes yes no no 4 3 4 1 1 3
## 2 no yes yes no 5 3 3 1 1 3
## 3 yes yes yes no 4 3 2 2 3 3
## 4 yes yes yes yes 3 2 2 1 1 5
## 5 yes yes no no 4 3 2 1 2 5
## 6 yes yes yes no 5 4 2 1 2 5
## absences G1 G2 G3
## 1 4 0 11 11
## 2 2 9 11 11
## 3 6 12 13 12
## 4 0 14 14 14
## 5 0 11 13 13
## 6 6 12 12 13
Create variable ordinal target from G3
data$performance <- cut(data$G3,
breaks = c(-1, 9, 14, 20),
labels = c("low", "medium", "high"),
ordered_result = TRUE)
Feature Selection for data that has causality
selected_data <- data[, c("performance", "studytime", "failures", "absences", "Medu", "famsup", "goout", "higher")]
# Cek data
str(selected_data)
## 'data.frame': 649 obs. of 8 variables:
## $ performance: Ord.factor w/ 3 levels "low"<"medium"<..: 2 2 2 2 2 2 2 2 3 2 ...
## $ studytime : int 2 2 2 3 2 2 2 2 2 2 ...
## $ failures : int 0 0 0 0 0 0 0 0 0 0 ...
## $ absences : int 4 2 6 0 0 6 0 2 0 0 ...
## $ Medu : int 4 1 1 4 3 4 2 4 3 3 ...
## $ famsup : chr "no" "yes" "no" "yes" ...
## $ goout : int 4 3 2 2 2 2 4 4 2 1 ...
## $ higher : chr "yes" "yes" "yes" "yes" ...
library(MASS)
selected_data$famsup <- as.factor(selected_data$famsup)
selected_data$higher <- as.factor(selected_data$higher)
# Fit model with POLR (proportional odds logistic regression)
model <- polr(performance ~ studytime + failures + absences +
Medu + famsup + goout + higher,
data = selected_data, Hess = TRUE)
# Ringkasan hasil model
summary(model)
## Call:
## polr(formula = performance ~ studytime + failures + absences +
## Medu + famsup + goout + higher, data = selected_data, Hess = TRUE)
##
## Coefficients:
## Value Std. Error t value
## studytime 0.38064 0.10413 3.6554
## failures -0.99905 0.16536 -6.0417
## absences -0.02484 0.01897 -1.3094
## Medu 0.29212 0.07797 3.7466
## famsupyes -0.11643 0.17465 -0.6667
## goout -0.06805 0.07210 -0.9438
## higheryes 1.26897 0.29546 4.2948
##
## Intercepts:
## Value Std. Error t value
## low|medium -0.0504 0.4321 -0.1166
## medium|high 3.6497 0.4631 7.8808
##
## Residual Deviance: 1014.48
## AIC: 1032.48
# Add p-value to summary
ctable <- coef(summary(model))
pvals <- pnorm(abs(ctable[, "t value"]), lower.tail = FALSE) * 2
ctable <- cbind(ctable, "p value" = round(pvals, 4))
ctable
## Value Std. Error t value p value
## studytime 0.38064003 0.10412967 3.6554428 0.0003
## failures -0.99905155 0.16536060 -6.0416543 0.0000
## absences -0.02484278 0.01897229 -1.3094241 0.1904
## Medu 0.29212002 0.07797019 3.7465603 0.0002
## famsupyes -0.11643403 0.17464598 -0.6666860 0.5050
## goout -0.06805385 0.07210476 -0.9438192 0.3453
## higheryes 1.26896766 0.29546346 4.2948378 0.0000
## low|medium -0.05036035 0.43208599 -0.1165517 0.9072
## medium|high 3.64972532 0.46311521 7.8808151 0.0000
# Performance category prediction
predicted <- predict(model, selected_data)
# Confusion matrix
table(predicted, selected_data$performance)
##
## predicted low medium high
## low 24 18 0
## medium 76 397 128
## high 0 3 3
Based on the results of the analysis using Ordinal Logistic Regression, it was found that the variables failures, studytime, Medu, and higher significantly influence students’ academic performance levels. The coefficient values indicate that more study time, higher maternal education, and the desire to pursue higher education increase the likelihood of students achieving high performance. Conversely, previous academic failures significantly reduce the chances of students being in the high-performance category. Meanwhile, variables such as absences (absences), going out habits (goout), and family support (famsup) did not show a significant influence. Model evaluation shows that the model is fairly capable of predicting the ‘medium’ category but is less accurate in classifying the ‘high’ and ‘low’ categories, with most predictions tending to fall into the middle class, indicating the need for improvement in classifying the extreme categories.