Στόχος μας είναι η πρόβλεψη της μεταβλητής Pass_Fail με Λογιστική Παλινδρόμηση.
Θα χρησιμοποιήσουμε το dataset Student_Performance_Dataset.csv και θα χρειαστούμε τις βιβλιοθήκες caTools και ROCR, οπότε τις φορτώνουμε αυτές και μετά το dataset:
library(caTools)
## Warning: package 'caTools' was built under R version 4.5.3
library(ROCR)
## Warning: package 'ROCR' was built under R version 4.5.3
data <- read.csv("C:/Users/maria/Desktop/Student_Performance_Dataset.csv")
Με τα str(data) και summary(data) βλέπουμε τα δεδομένα μας (μεταβλητές, τύπος, κτλ).
str(data)
## 'data.frame': 5000 obs. of 16 variables:
## $ Student_ID : chr "S0001" "S0002" "S0003" "S0004" ...
## $ Age : int 15 19 14 18 14 19 16 18 18 18 ...
## $ Gender : chr "Male" "Female" "Female" "Female" ...
## $ Class : int 12 9 12 9 10 12 12 11 11 10 ...
## $ Study_Hours_Per_Day : num 1 1.6 3.6 5.5 5 5.2 0.8 2 3.5 5.7 ...
## $ Attendance_Percentage : int 65 58 64 68 80 82 78 81 61 97 ...
## $ Parental_Education : chr "Postgraduate" "Graduate" "High School" "Postgraduate" ...
## $ Internet_Access : chr "No" "No" "Yes" "Yes" ...
## $ Extracurricular_Activities: chr "No" "Yes" "Yes" "No" ...
## $ Math_Score : int 40 80 83 68 41 88 75 93 55 72 ...
## $ Science_Score : int 39 44 73 48 46 70 69 37 39 73 ...
## $ English_Score : int 72 35 59 77 36 46 64 48 83 73 ...
## $ Previous_Year_Score : int 81 47 58 54 68 60 84 55 71 61 ...
## $ Final_Percentage : num 50.3 53 71.7 64.3 41 ...
## $ Performance_Level : chr "Average" "Average" "Good" "Average" ...
## $ Pass_Fail : chr "Pass" "Pass" "Pass" "Pass" ...
summary(data)
## Student_ID Age Gender Class
## Length:5000 Min. :14.00 Length:5000 Min. : 9.0
## Class :character 1st Qu.:15.00 Class :character 1st Qu.:10.0
## Mode :character Median :17.00 Mode :character Median :10.0
## Mean :16.51 Mean :10.5
## 3rd Qu.:18.00 3rd Qu.:11.0
## Max. :19.00 Max. :12.0
## Study_Hours_Per_Day Attendance_Percentage Parental_Education
## Min. :0.500 Min. : 50.00 Length:5000
## 1st Qu.:1.900 1st Qu.: 62.00 Class :character
## Median :3.300 Median : 75.00 Mode :character
## Mean :3.287 Mean : 74.92
## 3rd Qu.:4.700 3rd Qu.: 88.00
## Max. :6.000 Max. :100.00
## Internet_Access Extracurricular_Activities Math_Score Science_Score
## Length:5000 Length:5000 Min. : 35.00 Min. : 35.0
## Class :character Class :character 1st Qu.: 52.00 1st Qu.: 50.0
## Mode :character Mode :character Median : 68.00 Median : 67.0
## Mean : 67.75 Mean : 66.9
## 3rd Qu.: 84.00 3rd Qu.: 83.0
## Max. :100.00 Max. :100.0
## English_Score Previous_Year_Score Final_Percentage Performance_Level
## Min. : 35.00 Min. :40.00 Min. :36.33 Length:5000
## 1st Qu.: 51.00 1st Qu.:53.00 1st Qu.:59.67 Class :character
## Median : 68.00 Median :67.00 Median :67.33 Mode :character
## Mean : 67.78 Mean :67.28 Mean :67.48
## 3rd Qu.: 85.00 3rd Qu.:81.00 3rd Qu.:75.33
## Max. :100.00 Max. :95.00 Max. :98.33
## Pass_Fail
## Length:5000
## Class :character
## Mode :character
##
##
##
Έχουμε 5000 μαθητές και 16 μεταβλητές. Το StudentID αριθμεί τους ασθενείς απο S0001 έως S5000. Οι 14 μεταβλητές απο Age έως Performance_Level είναι οι ανεξάρτητες μεταβλητές μας και θα τις χρησιμοποιήσουμε όλες.
Η τελευταία μεταβλητή Pass_Fail είναι το αποτέλεσμα ή εξαρτημένη μεταβλητή μας και κανονικά παίρνει τιμή Pass (επιτυχία) και Fail (αποτυχία), θα την μετατρέψουμε όμως, έτσι ώστε να παίρνει τιμή 1 για την επιτυχία και 0 για την αποτυχία του μαθητή.
data$Pass_Fail <- ifelse(data$Pass_Fail == "Pass", 1, 0)
table(data$Pass_Fail)
##
## 0 1
## 265 4735
set.seed(903)
split <- sample.split(data$Pass_Fail, SplitRatio = 0.65)
train <- subset(data, split == TRUE)
test <- subset(data, split == FALSE)
nrow(train)
## [1] 3250
Αντίστοιχα:
nrow(test)
## [1] 1750
model <- glm(Pass_Fail ~ Age + Class + Study_Hours_Per_Day + Math_Score + Science_Score + English_Score + Attendance_Percentage + Previous_Year_Score, data = train, family = binomial)
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(model)
##
## Call:
## glm(formula = Pass_Fail ~ Age + Class + Study_Hours_Per_Day +
## Math_Score + Science_Score + English_Score + Attendance_Percentage +
## Previous_Year_Score, family = binomial, data = train)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.277e+03 5.312e+04 -0.081 0.936
## Age -2.173e-01 1.421e+02 -0.002 0.999
## Class 3.945e-01 2.147e+02 0.002 0.999
## Study_Hours_Per_Day -4.816e-02 2.278e+02 0.000 1.000
## Math_Score 2.863e+01 3.557e+02 0.080 0.936
## Science_Score 2.863e+01 3.563e+02 0.080 0.936
## English_Score 2.856e+01 3.530e+02 0.081 0.936
## Attendance_Percentage 7.906e-03 1.380e+01 0.001 1.000
## Previous_Year_Score -3.214e-03 1.507e+01 0.000 1.000
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1.3457e+03 on 3249 degrees of freedom
## Residual deviance: 2.3863e-05 on 3241 degrees of freedom
## AIC: 18
##
## Number of Fisher Scoring iterations: 25
Η predict επιστρέφει την εκτιμώμενη πιθανότητα κάθε παρατήρηση να ανήκει στην κατηγορία Pass.
predictTest <- predict(model, newdata = test, type = "response")
predicted <- ifelse(predictTest > 0.5, 1, 0)
cm <- table(
Predicted = factor(predicted, levels = c(0,1)),
Actual = factor(test$Pass_Fail, levels = c(0,1))
)
cm
## Actual
## Predicted 0 1
## 0 93 0
## 1 0 1657
accuracy <- sum(diag(cm)) / sum(cm)
sensitivity <- cm[2,2] / (cm[2,2] + cm[1,2])
specificity <- cm[1,1] / (cm[1,1] + cm[2,1])
table(test$Pass_Fail)
##
## 0 1
## 93 1657
baseline_accuracy <- max(prop.table(table(test$Pass_Fail)))
baseline_accuracy
## [1] 0.9468571
ROCRpred <- prediction(predictTest, test$Pass_Fail)
ROCRperf <- performance(ROCRpred, "tpr", "fpr")
plot(ROCRperf, colorize = TRUE)
auc <- performance(ROCRpred, "auc")
auc@y.values[[1]]
## [1] 1
data2 <- na.omit(data)
set.seed(903)
split2 <- sample.split(data2$Pass_Fail, SplitRatio = 0.65)
train2 <- subset(data2, split2 == TRUE)
test2 <- subset(data2, split2 == FALSE)
nrow(train2)
## [1] 3250
nrow(test2)
## [1] 1750
model2 <- glm(Pass_Fail ~ Age + Class + Study_Hours_Per_Day + Math_Score + Science_Score + English_Score + Attendance_Percentage + Previous_Year_Score, data = train2, family = binomial)
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
predictTest2 <- predict(model2, newdata = test2, type = "response")
ROCRpred2 <- prediction(predictTest2, test2$Pass_Fail)
ROCRperf2 <- performance(ROCRpred2, "tpr", "fpr")
plot(ROCRperf2, colorize = TRUE)
auc2 <- performance(ROCRpred2, "auc")
auc2@y.values[[1]]
## [1] 1
Συμπεράσματα
Το μοντέλο παρουσιάζει χαμηλή απόδοση λόγω ανισορροπίας των κλάσεων. Η AUC επιβεβαιώνει περιορισμένη διακριτική ικανότητα.