Load data Student Performance from UCI Repository

url <- "https://archive.ics.uci.edu/static/public/320/data.csv"

data <- read.csv(url)

head(data)
##   school sex age address famsize Pstatus Medu Fedu     Mjob     Fjob     reason
## 1     GP   F  18       U     GT3       A    4    4  at_home  teacher     course
## 2     GP   F  17       U     GT3       T    1    1  at_home    other     course
## 3     GP   F  15       U     LE3       T    1    1  at_home    other      other
## 4     GP   F  15       U     GT3       T    4    2   health services       home
## 5     GP   F  16       U     GT3       T    3    3    other    other       home
## 6     GP   M  16       U     LE3       T    4    3 services    other reputation
##   guardian traveltime studytime failures schoolsup famsup paid activities
## 1   mother          2         2        0       yes     no   no         no
## 2   father          1         2        0        no    yes   no         no
## 3   mother          1         2        0       yes     no   no         no
## 4   mother          1         3        0        no    yes   no        yes
## 5   father          1         2        0        no    yes   no         no
## 6   mother          1         2        0        no    yes   no        yes
##   nursery higher internet romantic famrel freetime goout Dalc Walc health
## 1     yes    yes       no       no      4        3     4    1    1      3
## 2      no    yes      yes       no      5        3     3    1    1      3
## 3     yes    yes      yes       no      4        3     2    2    3      3
## 4     yes    yes      yes      yes      3        2     2    1    1      5
## 5     yes    yes       no       no      4        3     2    1    2      5
## 6     yes    yes      yes       no      5        4     2    1    2      5
##   absences G1 G2 G3
## 1        4  0 11 11
## 2        2  9 11 11
## 3        6 12 13 12
## 4        0 14 14 14
## 5        0 11 13 13
## 6        6 12 12 13

Create variable ordinal target from G3

data$performance <- cut(data$G3,
                        breaks = c(-1, 9, 14, 20),
                        labels = c("low", "medium", "high"),
                        ordered_result = TRUE)

Feature Selection for data that has causality

selected_data <- data[, c("performance", "studytime", "failures", "absences", "Medu", "famsup", "goout", "higher")]

# Cek data
str(selected_data)
## 'data.frame':    649 obs. of  8 variables:
##  $ performance: Ord.factor w/ 3 levels "low"<"medium"<..: 2 2 2 2 2 2 2 2 3 2 ...
##  $ studytime  : int  2 2 2 3 2 2 2 2 2 2 ...
##  $ failures   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ absences   : int  4 2 6 0 0 6 0 2 0 0 ...
##  $ Medu       : int  4 1 1 4 3 4 2 4 3 3 ...
##  $ famsup     : chr  "no" "yes" "no" "yes" ...
##  $ goout      : int  4 3 2 2 2 2 4 4 2 1 ...
##  $ higher     : chr  "yes" "yes" "yes" "yes" ...
library(MASS)
selected_data$famsup <- as.factor(selected_data$famsup)
selected_data$higher <- as.factor(selected_data$higher)
# Fit model with POLR (proportional odds logistic regression)
model <- polr(performance ~ studytime + failures + absences + 
                            Medu + famsup + goout + higher,
              data = selected_data, Hess = TRUE)

# Ringkasan hasil model
summary(model)
## Call:
## polr(formula = performance ~ studytime + failures + absences + 
##     Medu + famsup + goout + higher, data = selected_data, Hess = TRUE)
## 
## Coefficients:
##              Value Std. Error t value
## studytime  0.38064    0.10413  3.6554
## failures  -0.99905    0.16536 -6.0417
## absences  -0.02484    0.01897 -1.3094
## Medu       0.29212    0.07797  3.7466
## famsupyes -0.11643    0.17465 -0.6667
## goout     -0.06805    0.07210 -0.9438
## higheryes  1.26897    0.29546  4.2948
## 
## Intercepts:
##             Value   Std. Error t value
## low|medium  -0.0504  0.4321    -0.1166
## medium|high  3.6497  0.4631     7.8808
## 
## Residual Deviance: 1014.48 
## AIC: 1032.48
# Add p-value to summary
ctable <- coef(summary(model))
pvals <- pnorm(abs(ctable[, "t value"]), lower.tail = FALSE) * 2
ctable <- cbind(ctable, "p value" = round(pvals, 4))
ctable
##                   Value Std. Error    t value p value
## studytime    0.38064003 0.10412967  3.6554428  0.0003
## failures    -0.99905155 0.16536060 -6.0416543  0.0000
## absences    -0.02484278 0.01897229 -1.3094241  0.1904
## Medu         0.29212002 0.07797019  3.7465603  0.0002
## famsupyes   -0.11643403 0.17464598 -0.6666860  0.5050
## goout       -0.06805385 0.07210476 -0.9438192  0.3453
## higheryes    1.26896766 0.29546346  4.2948378  0.0000
## low|medium  -0.05036035 0.43208599 -0.1165517  0.9072
## medium|high  3.64972532 0.46311521  7.8808151  0.0000
# Performance category prediction
predicted <- predict(model, selected_data)

# Confusion matrix
table(predicted, selected_data$performance)
##          
## predicted low medium high
##    low     24     18    0
##    medium  76    397  128
##    high     0      3    3

Based on the results of the analysis using Ordinal Logistic Regression, it was found that the variables failures, studytime, Medu, and higher significantly influence students’ academic performance levels. The coefficient values indicate that more study time, higher maternal education, and the desire to pursue higher education increase the likelihood of students achieving high performance. Conversely, previous academic failures significantly reduce the chances of students being in the high-performance category. Meanwhile, variables such as absences (absences), going out habits (goout), and family support (famsup) did not show a significant influence. Model evaluation shows that the model is fairly capable of predicting the ‘medium’ category but is less accurate in classifying the ‘high’ and ‘low’ categories, with most predictions tending to fall into the middle class, indicating the need for improvement in classifying the extreme categories.