1. Περιγραφή Dataset

Το dataset περιέχει πληροφορίες για μαθητές, όπως:

ώρες μελέτης ποσοστό παρουσιών βαθμούς (Math, Science, English) τελικό ποσοστό (Final_Percentage)

Στόχος: Να προβλέψουμε το Final_Percentage.

data <- read.csv("Student_Performance_Dataset.csv")
head(data)
##   Student_ID Age Gender Class Study_Hours_Per_Day Attendance_Percentage
## 1      S0001  15   Male    12                 1.0                    65
## 2      S0002  19 Female     9                 1.6                    58
## 3      S0003  14 Female    12                 3.6                    64
## 4      S0004  18 Female     9                 5.5                    68
## 5      S0005  14   Male    10                 5.0                    80
## 6      S0006  19   Male    12                 5.2                    82
##   Parental_Education Internet_Access Extracurricular_Activities Math_Score
## 1       Postgraduate              No                         No         40
## 2           Graduate              No                        Yes         80
## 3        High School             Yes                        Yes         83
## 4       Postgraduate             Yes                         No         68
## 5        High School             Yes                         No         41
## 6        High School              No                        Yes         88
##   Science_Score English_Score Previous_Year_Score Final_Percentage
## 1            39            72                  81            50.33
## 2            44            35                  47            53.00
## 3            73            59                  58            71.67
## 4            48            77                  54            64.33
## 5            46            36                  68            41.00
## 6            70            46                  60            68.00
##   Performance_Level Pass_Fail
## 1           Average      Pass
## 2           Average      Pass
## 3              Good      Pass
## 4           Average      Pass
## 5              Poor      Fail
## 6              Good      Pass
summary(data)
##   Student_ID             Age           Gender              Class     
##  Length:5000        Min.   :14.00   Length:5000        Min.   : 9.0  
##  Class :character   1st Qu.:15.00   Class :character   1st Qu.:10.0  
##  Mode  :character   Median :17.00   Mode  :character   Median :10.0  
##                     Mean   :16.51                      Mean   :10.5  
##                     3rd Qu.:18.00                      3rd Qu.:11.0  
##                     Max.   :19.00                      Max.   :12.0  
##  Study_Hours_Per_Day Attendance_Percentage Parental_Education
##  Min.   :0.500       Min.   : 50.00        Length:5000       
##  1st Qu.:1.900       1st Qu.: 62.00        Class :character  
##  Median :3.300       Median : 75.00        Mode  :character  
##  Mean   :3.287       Mean   : 74.92                          
##  3rd Qu.:4.700       3rd Qu.: 88.00                          
##  Max.   :6.000       Max.   :100.00                          
##  Internet_Access    Extracurricular_Activities   Math_Score     Science_Score  
##  Length:5000        Length:5000                Min.   : 35.00   Min.   : 35.0  
##  Class :character   Class :character           1st Qu.: 52.00   1st Qu.: 50.0  
##  Mode  :character   Mode  :character           Median : 68.00   Median : 67.0  
##                                                Mean   : 67.75   Mean   : 66.9  
##                                                3rd Qu.: 84.00   3rd Qu.: 83.0  
##                                                Max.   :100.00   Max.   :100.0  
##  English_Score    Previous_Year_Score Final_Percentage Performance_Level 
##  Min.   : 35.00   Min.   :40.00       Min.   :36.33    Length:5000       
##  1st Qu.: 51.00   1st Qu.:53.00       1st Qu.:59.67    Class :character  
##  Median : 68.00   Median :67.00       Median :67.33    Mode  :character  
##  Mean   : 67.78   Mean   :67.28       Mean   :67.48                      
##  3rd Qu.: 85.00   3rd Qu.:81.00       3rd Qu.:75.33                      
##  Max.   :100.00   Max.   :95.00       Max.   :98.33                      
##   Pass_Fail        
##  Length:5000       
##  Class :character  
##  Mode  :character  
##                    
##                    
## 
  1. Διαγράμματα Scatterplot (Study Hours vs Final Percentage)
plot(data$Study_Hours_Per_Day, data$Final_Percentage,
     main="Study Hours vs Final Percentage",
     xlab="Study Hours",
     ylab="Final %",
     pch=19)

Boxplot

boxplot(data$Final_Percentage,
        main="Κατανομή Τελικού Βαθμού")

3. Απλή Γραμμική Παλινδρόμηση

Προσπαθούμε να προβλέψουμε το Final Percentage από τις ώρες μελέτης.

model1 <- lm(Final_Percentage ~ Study_Hours_Per_Day, data=data)
summary(model1)
## 
## Call:
## lm(formula = Final_Percentage ~ Study_Hours_Per_Day, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -30.9855  -7.8244  -0.0123   7.7841  30.8088 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         67.77175    0.35650 190.101   <2e-16 ***
## Study_Hours_Per_Day -0.08946    0.09765  -0.916     0.36    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.96 on 4998 degrees of freedom
## Multiple R-squared:  0.0001679,  Adjusted R-squared:  -3.215e-05 
## F-statistic: 0.8393 on 1 and 4998 DF,  p-value: 0.3596

Γραφική απεικόνιση

plot(data$Study_Hours_Per_Day, data$Final_Percentage)
abline(model1, col="red", lwd=2)

4. Πολλαπλή Γραμμική Παλινδρόμηση

Χρησιμοποιούμε περισσότερες μεταβλητές:

model2 <- lm(Final_Percentage ~ Study_Hours_Per_Day +
                                   Attendance_Percentage +
                                   Previous_Year_Score,
             data=data)
summary(model2)
## 
## Call:
## lm(formula = Final_Percentage ~ Study_Hours_Per_Day + Attendance_Percentage + 
##     Previous_Year_Score, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -31.056  -7.813  -0.003   7.793  31.020 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           68.4879726  1.0780987  63.527   <2e-16 ***
## Study_Hours_Per_Day   -0.0873326  0.0977150  -0.894    0.371    
## Attendance_Percentage  0.0007018  0.0105732   0.066    0.947    
## Previous_Year_Score   -0.0115307  0.0096007  -1.201    0.230    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.96 on 4996 degrees of freedom
## Multiple R-squared:  0.0004578,  Adjusted R-squared:  -0.0001424 
## F-statistic: 0.7627 on 3 and 4996 DF,  p-value: 0.5148
  1. Σύγκριση Μοντέλων (R² & SSE)
# R-squared
summary(model1)$r.squared
## [1] 0.0001678996
summary(model2)$r.squared
## [1] 0.0004578019
# SSE
SSE1 <- sum(residuals(model1)^2)
SSE2 <- sum(residuals(model2)^2)

SSE1
## [1] 600834.3
SSE2
## [1] 600660.1

Αναμένουμε:

Το model2 να έχει μεγαλύτερο R² Και μικρότερο SSE (καλύτερη πρόβλεψη)

  1. Συσχετίσεις
cor(data[, c("Study_Hours_Per_Day",
             "Attendance_Percentage",
             "Previous_Year_Score",
             "Final_Percentage")])
##                       Study_Hours_Per_Day Attendance_Percentage
## Study_Hours_Per_Day            1.00000000          0.0269157371
## Attendance_Percentage          0.02691574          1.0000000000
## Previous_Year_Score            0.01966172         -0.0116111435
## Final_Percentage              -0.01295761          0.0007960933
##                       Previous_Year_Score Final_Percentage
## Study_Hours_Per_Day            0.01966172    -0.0129576088
## Attendance_Percentage         -0.01161114     0.0007960933
## Previous_Year_Score            1.00000000    -0.0172520949
## Final_Percentage              -0.01725209     1.0000000000
  1. Συμπεράσματα Οι ώρες μελέτης έχουν θετική επίδραση στον βαθμό. Η προσθήκη περισσότερων μεταβλητών βελτιώνει το μοντέλο. Το model2 είναι καλύτερο λόγω υψηλότερου R² και μικρότερου SSE.