Στόχος μας είναι η πρόβλεψη της μεταβλητής Pass_Fail με Λογιστική Παλινδρόμηση.

Θα χρησιμοποιήσουμε το dataset Student_Performance_Dataset.csv και θα χρειαστούμε τις βιβλιοθήκες caTools και ROCR, οπότε τις φορτώνουμε αυτές και μετά το dataset:

library(caTools)
## Warning: package 'caTools' was built under R version 4.5.3
library(ROCR)
## Warning: package 'ROCR' was built under R version 4.5.3
data <- read.csv("C:/Users/maria/Desktop/Student_Performance_Dataset.csv")

Με τα str(data) και summary(data) βλέπουμε τα δεδομένα μας (μεταβλητές, τύπος, κτλ).

str(data)
## 'data.frame':    5000 obs. of  16 variables:
##  $ Student_ID                : chr  "S0001" "S0002" "S0003" "S0004" ...
##  $ Age                       : int  15 19 14 18 14 19 16 18 18 18 ...
##  $ Gender                    : chr  "Male" "Female" "Female" "Female" ...
##  $ Class                     : int  12 9 12 9 10 12 12 11 11 10 ...
##  $ Study_Hours_Per_Day       : num  1 1.6 3.6 5.5 5 5.2 0.8 2 3.5 5.7 ...
##  $ Attendance_Percentage     : int  65 58 64 68 80 82 78 81 61 97 ...
##  $ Parental_Education        : chr  "Postgraduate" "Graduate" "High School" "Postgraduate" ...
##  $ Internet_Access           : chr  "No" "No" "Yes" "Yes" ...
##  $ Extracurricular_Activities: chr  "No" "Yes" "Yes" "No" ...
##  $ Math_Score                : int  40 80 83 68 41 88 75 93 55 72 ...
##  $ Science_Score             : int  39 44 73 48 46 70 69 37 39 73 ...
##  $ English_Score             : int  72 35 59 77 36 46 64 48 83 73 ...
##  $ Previous_Year_Score       : int  81 47 58 54 68 60 84 55 71 61 ...
##  $ Final_Percentage          : num  50.3 53 71.7 64.3 41 ...
##  $ Performance_Level         : chr  "Average" "Average" "Good" "Average" ...
##  $ Pass_Fail                 : chr  "Pass" "Pass" "Pass" "Pass" ...
summary(data)
##   Student_ID             Age           Gender              Class     
##  Length:5000        Min.   :14.00   Length:5000        Min.   : 9.0  
##  Class :character   1st Qu.:15.00   Class :character   1st Qu.:10.0  
##  Mode  :character   Median :17.00   Mode  :character   Median :10.0  
##                     Mean   :16.51                      Mean   :10.5  
##                     3rd Qu.:18.00                      3rd Qu.:11.0  
##                     Max.   :19.00                      Max.   :12.0  
##  Study_Hours_Per_Day Attendance_Percentage Parental_Education
##  Min.   :0.500       Min.   : 50.00        Length:5000       
##  1st Qu.:1.900       1st Qu.: 62.00        Class :character  
##  Median :3.300       Median : 75.00        Mode  :character  
##  Mean   :3.287       Mean   : 74.92                          
##  3rd Qu.:4.700       3rd Qu.: 88.00                          
##  Max.   :6.000       Max.   :100.00                          
##  Internet_Access    Extracurricular_Activities   Math_Score     Science_Score  
##  Length:5000        Length:5000                Min.   : 35.00   Min.   : 35.0  
##  Class :character   Class :character           1st Qu.: 52.00   1st Qu.: 50.0  
##  Mode  :character   Mode  :character           Median : 68.00   Median : 67.0  
##                                                Mean   : 67.75   Mean   : 66.9  
##                                                3rd Qu.: 84.00   3rd Qu.: 83.0  
##                                                Max.   :100.00   Max.   :100.0  
##  English_Score    Previous_Year_Score Final_Percentage Performance_Level 
##  Min.   : 35.00   Min.   :40.00       Min.   :36.33    Length:5000       
##  1st Qu.: 51.00   1st Qu.:53.00       1st Qu.:59.67    Class :character  
##  Median : 68.00   Median :67.00       Median :67.33    Mode  :character  
##  Mean   : 67.78   Mean   :67.28       Mean   :67.48                      
##  3rd Qu.: 85.00   3rd Qu.:81.00       3rd Qu.:75.33                      
##  Max.   :100.00   Max.   :95.00       Max.   :98.33                      
##   Pass_Fail        
##  Length:5000       
##  Class :character  
##  Mode  :character  
##                    
##                    
## 

Έχουμε 5000 μαθητές και 16 μεταβλητές. Το StudentID αριθμεί τους ασθενείς απο S0001 έως S5000. Οι 14 μεταβλητές απο Age έως Performance_Level είναι οι ανεξάρτητες μεταβλητές μας και θα τις χρησιμοποιήσουμε όλες.

Η τελευταία μεταβλητή Pass_Fail είναι το αποτέλεσμα ή εξαρτημένη μεταβλητή μας και κανονικά παίρνει τιμή Pass (επιτυχία) και Fail (αποτυχία), θα την μετατρέψουμε όμως, έτσι ώστε να παίρνει τιμή 1 για την επιτυχία και 0 για την αποτυχία του μαθητή.

data$Pass_Fail <- ifelse(data$Pass_Fail == "Pass", 1, 0)
table(data$Pass_Fail)
## 
##    0    1 
##  265 4735
set.seed(903)
split <- sample.split(data$Pass_Fail, SplitRatio = 0.65)
train <- subset(data, split == TRUE)
test <- subset(data, split == FALSE)
nrow(train)
## [1] 3250

Αντίστοιχα:

nrow(test)
## [1] 1750
model <- glm(Pass_Fail ~ Age + Class + Study_Hours_Per_Day + Math_Score + Science_Score + English_Score + Attendance_Percentage + Previous_Year_Score, data = train, family = binomial)
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(model)
## 
## Call:
## glm(formula = Pass_Fail ~ Age + Class + Study_Hours_Per_Day + 
##     Math_Score + Science_Score + English_Score + Attendance_Percentage + 
##     Previous_Year_Score, family = binomial, data = train)
## 
## Coefficients:
##                         Estimate Std. Error z value Pr(>|z|)
## (Intercept)           -4.277e+03  5.312e+04  -0.081    0.936
## Age                   -2.173e-01  1.421e+02  -0.002    0.999
## Class                  3.945e-01  2.147e+02   0.002    0.999
## Study_Hours_Per_Day   -4.816e-02  2.278e+02   0.000    1.000
## Math_Score             2.863e+01  3.557e+02   0.080    0.936
## Science_Score          2.863e+01  3.563e+02   0.080    0.936
## English_Score          2.856e+01  3.530e+02   0.081    0.936
## Attendance_Percentage  7.906e-03  1.380e+01   0.001    1.000
## Previous_Year_Score   -3.214e-03  1.507e+01   0.000    1.000
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1.3457e+03  on 3249  degrees of freedom
## Residual deviance: 2.3863e-05  on 3241  degrees of freedom
## AIC: 18
## 
## Number of Fisher Scoring iterations: 25

Η predict επιστρέφει την εκτιμώμενη πιθανότητα κάθε παρατήρηση να ανήκει στην κατηγορία Pass.

predictTest <- predict(model, newdata = test, type = "response")

predicted <- ifelse(predictTest > 0.5, 1, 0)

cm <- table(
  Predicted = factor(predicted, levels = c(0,1)),
  Actual = factor(test$Pass_Fail, levels = c(0,1))
)

cm
##          Actual
## Predicted    0    1
##         0   93    0
##         1    0 1657
accuracy <- sum(diag(cm)) / sum(cm)
sensitivity <- cm[2,2] / (cm[2,2] + cm[1,2])
specificity <- cm[1,1] / (cm[1,1] + cm[2,1])

table(test$Pass_Fail)
## 
##    0    1 
##   93 1657
baseline_accuracy <- max(prop.table(table(test$Pass_Fail)))
baseline_accuracy
## [1] 0.9468571
ROCRpred <- prediction(predictTest, test$Pass_Fail)
ROCRperf <- performance(ROCRpred, "tpr", "fpr")
plot(ROCRperf, colorize = TRUE)

auc <- performance(ROCRpred, "auc")
auc@y.values[[1]]
## [1] 1
data2 <- na.omit(data)
set.seed(903)
split2 <- sample.split(data2$Pass_Fail, SplitRatio = 0.65)
train2 <- subset(data2, split2 == TRUE)
test2 <- subset(data2, split2 == FALSE)
nrow(train2)
## [1] 3250
nrow(test2)
## [1] 1750
model2 <- glm(Pass_Fail ~ Age + Class + Study_Hours_Per_Day + Math_Score + Science_Score + English_Score + Attendance_Percentage + Previous_Year_Score, data = train2, family = binomial)
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
predictTest2 <- predict(model2, newdata = test2, type = "response")
ROCRpred2 <- prediction(predictTest2, test2$Pass_Fail)
ROCRperf2 <- performance(ROCRpred2, "tpr", "fpr")
plot(ROCRperf2, colorize = TRUE)

auc2 <- performance(ROCRpred2, "auc")
auc2@y.values[[1]]
## [1] 1

Συμπεράσματα

Το μοντέλο παρουσιάζει χαμηλή απόδοση λόγω ανισορροπίας των κλάσεων. Η AUC επιβεβαιώνει περιορισμένη διακριτική ικανότητα.