Εργασία 5 - Γραμμική Παλινδρόμιση

Περιγραφή Δεδομένων
Γραμμική Παλινδρόμηση

Περιγραφή Δεδομένων

Το επιλεγμένο σύνολο δεδομένων με 1000 εγγραφές, περιλαμβάνει έναν συνδυασμό ποσοτικών και ποιοτικών μεταβλητών που αφορούν τις καθημερινές συνήθειες μαθητών και την τελική τους επίδοση στις εξετάσεις. Κύριος στόχος της ανάλυσης είναι ο προσδιορισμός των παραγόντων που παρουσιάζουν την ισχυρότερη συσχέτιση με τη βαθμολογία των μαθητών.

Γιατί επιλέξαμε το συγκεκριμένο dataset;

Το συγκεκριμένο dataset επιλέχθηκε για να κατανοήσουμε τους παράγοντες που οδηγούν σε υψηλότερες βαθμολογίες στις εξετάσεις. Στόχος μας είναι να δείξουμε ότι οι καθημερινές συνήθειες επηρεάζουν άμεσα την επίδοση και ότι η επιτυχία δεν εξαρτάται αποκλειστικά από το έμφυτο ταλέντο. Μέσω της επιχειρηματικής αναλυτικής, θέλουμε να προσφέρουμε στους φοιτητές ουσιαστική υποστήριξη για την ακαδημαϊκή τους πορεία.

Τι ερωτήματα μπορούν να απαντηθούν;

Θα βρόυμε τις μεταβλητές που έχουν σημαντική επίδραση στη βαθμολογία. Πως και πόσο επιδρούν η διατροφή, ο ύπνος, η χρήση των social media, η γυμναστική στην επίδοση μας; Μπορούμε να εντοπίσουμε συνήθειες που θα βελτιώσουν σημαντικά την ακαδημαϊκή πορεία των φοιτητών;

kable(metadata, caption = "Περιγραφή Μεταβλητών")

Περιγραφή Μεταβλητών
Μεταβλητή	Τύπος	Εύρος_Τιμών	Περιγραφή
student_id	character	-	Μοναδικό αναγνωριστικό μαθητή
age	numeric	17 - 24	Ηλικία φοιτητή
gender	character	-	Φύλο (Male/Female/Other)
study_hours_per_day	numeric	0 - 8.3	Ώρες μελέτης ανά ημέρα
social_media_hours	numeric	0 - 7.2	Χρήση Social Media καθημερινά (ώρες)
netflix_hours	numeric	0 - 5.4	Χρήση Streaming/Netflix (ώρες)
part_time_job	character	-	Ύπαρξη μερικής απασχόλησης (Yes/No)
attendance_percentage	numeric	56 - 100	Ποσοστό παρουσιών (0-100%)
sleep_hours	numeric	3.2 - 10	Ώρες ύπνου καθημερινά
diet_quality	character	-	Ποιότητα διατροφής (Poor/Fair/Good)
exercise_frequency	numeric	0 - 6	Συχνότητα άσκησης (ημέρες/βδομάδα)
parental_education_level	character	-	Επίπεδο μόρφωσης γονέων
internet_quality	character	-	Ποιότητα σύνδεσης Internet
mental_health_rating	numeric	1 - 10	Δείκτης ψυχικής υγείας (1 = Χαμηλή - 10 = Καλή
extracurricular_participation	character	-	Συμμετοχή σε εξωδιδακτικές δραστηριότητες (Yes/No)
exam_score	numeric	18.4 - 100	Βαθμολογία εξέτασης (0,100)

Περιγραφικά στατιστικά του Dataset:

# Read CSV file
studentTable <- read_csv("student_performance.csv" , show_col_types = FALSE)
summary(studentTable)

##   student_id             age           gender          study_hours_per_day
##  Length:1000        Min.   :17.00   Length:1000        Min.   :0.00       
##  Class :character   1st Qu.:18.75   Class :character   1st Qu.:2.60       
##  Mode  :character   Median :20.00   Mode  :character   Median :3.50       
##                     Mean   :20.50                      Mean   :3.55       
##                     3rd Qu.:23.00                      3rd Qu.:4.50       
##                     Max.   :24.00                      Max.   :8.30       
##  social_media_hours netflix_hours   part_time_job      attendance_percentage
##  Min.   :0.000      Min.   :0.000   Length:1000        Min.   : 56.00       
##  1st Qu.:1.700      1st Qu.:1.000   Class :character   1st Qu.: 78.00       
##  Median :2.500      Median :1.800   Mode  :character   Median : 84.40       
##  Mean   :2.506      Mean   :1.820                      Mean   : 84.13       
##  3rd Qu.:3.300      3rd Qu.:2.525                      3rd Qu.: 91.03       
##  Max.   :7.200      Max.   :5.400                      Max.   :100.00       
##   sleep_hours    diet_quality       exercise_frequency parental_education_level
##  Min.   : 3.20   Length:1000        Min.   :0.000      Length:1000             
##  1st Qu.: 5.60   Class :character   1st Qu.:1.000      Class :character        
##  Median : 6.50   Mode  :character   Median :3.000      Mode  :character        
##  Mean   : 6.47                      Mean   :3.042                              
##  3rd Qu.: 7.30                      3rd Qu.:5.000                              
##  Max.   :10.00                      Max.   :6.000                              
##  internet_quality   mental_health_rating extracurricular_participation
##  Length:1000        Min.   : 1.000       Length:1000                  
##  Class :character   1st Qu.: 3.000       Class :character             
##  Mode  :character   Median : 5.000       Mode  :character             
##                     Mean   : 5.438                                    
##                     3rd Qu.: 8.000                                    
##                     Max.   :10.000                                    
##    exam_score    
##  Min.   : 18.40  
##  1st Qu.: 58.48  
##  Median : 70.50  
##  Mean   : 69.60  
##  3rd Qu.: 81.33  
##  Max.   :100.00

Συσχέτιση μεταξύ μεταβλητών (Correlation Plot)

Θα βρούμε ποιες μεταβλητές έχουν ισχυρή συσχέτιση μεταξύ τους.

#Βήμα 1.  Πίνακας Συσχετίσεων όλων των numerical μεταβλητών
numeric_data <- studentTable %>% select(where(is.numeric)) # Select only numeric columns
student_numeric_data <- cor(numeric_data, use = "complete.obs")
corrplot(student_numeric_data, method = "color", addCoef.col = "black", type = "upper", 
         tl.col = "black", tl.srt = 45)

cor(studentTable$study_hours_per_day,studentTable$exam_score)

## [1] 0.8254185

cor(studentTable$social_media_hours,studentTable$exam_score)

## [1] -0.1667329

cor(studentTable$mental_health_rating,studentTable$exam_score)

## [1] 0.3215229

Σχολιασμός

Παρατηρούμε πως οι ώρες που διαβάζει ένας μαθητής καθημερινά καθώς και η ψυχική υγεία έχουν ισχυρή επίδραση στο τελικό βαθμό.
Όσο αυξάνεται η χρήση των social media καθημερινά, τότε μειώνεται η απόδοση του μαθητή. *

Scatterplot

# Scatterplot
scatter <- ggplot(data = studentTable, aes(x = study_hours_per_day, y = exam_score)) +
  geom_point(color = "firebrick", size = 3, alpha = 0.7) +
  labs(
    title = "Exam score based on Study hours per day",
    color = "Exam Score"
  ) +
  theme_classic()

scatter

Σχολιασμός Παρατηρούμε πως όσο περισσότερο διαβάζει κάποιος μαθητής, γράφει καλύτερα στις εξετάσεις στις περισσότερες περιπτώσεις.

Boxplot

# 1. Δημιουργεί 2 groups με βάση τον βαθμό της ψυχικής υγείας.
boxplot_data <- studentTable %>%
  mutate(mental_health_group = ifelse(mental_health_rating < 5, 
                                      "Below 5 (Struggling)", 
                                      "5 and Above (Stable)"))

# 2. Create the boxplot
boxplot_mental_health <- ggplot(boxplot_data, aes(x = mental_health_group, y = exam_score, fill = mental_health_group)) +
  geom_boxplot(alpha = 0.7, width = 0.6) +
  geom_jitter(width = 0.2, alpha = 0.3) + # Adds individual student dots to see the sample size
  scale_fill_manual(values = c("Λιγότερο απο 5 (Δυσκολία)" = "#f8766d", 
                               "5 και άνω (Σταθερότητα)" = "#00bfc4")) +
  labs(
    title = "Επίδοση στις Εξετάσεις με βάση την Ψυχική Υγεία",
    subtitle = "Σύγκριση Βαθμολογιών με βάση την ψυχική υγεία 1-10 rating",
    x = "Ψυχική Υγεία",
    y = "Βαθμολογία (%)",
    fill = "Category"
  ) +
  theme_minimal() +
  theme(legend.position = "none") # Legend is redundant here since x-axis is labeled

boxplot_mental_health

Σχολιασμός Παρατηρούμε πως οι μαθητές με ψυχική υγεία που είναι πιο υγιείς (5+) σημειώνουν μέσο όρο στην βαθμολογία των εξετάσεων (~75) ενώ οι μαθητές που δυσκολεύονται ψυχικά σημειώνουν ~62.

Ιστόγραμμα Συχνοτήτων

#Step 4 A histogram for the exam_scores

score_hist <- ggplot(data = studentTable, aes(x = exam_score)) +
  geom_histogram(binwidth = 5, fill = "#3498DB", color = "white") +
  stat_bin(binwidth = 5, geom = "text", aes(label = after_stat(count)), 
           vjust = -0.5, size = 3.5) + 
  labs(
    title = "Κατανομή Βαθμολογίας Μαθητών",
    x = "Βαθμολογία",
    y = "Πλήθος Μαθητών"
  ) +
  theme_minimal()

score_hist

Σχολιασμός Παρατηρούμε πως περίπου το 30% των μαθητων (300+ μαθητές) έγραψαν μεταξύ 62 και 75 στις εξετάσεις.

Γραμμική Παλινδρόμηση

Μοντέλο Απλής Παλινδρόμησης (Study Hours)

# Δημιουργία μοντέλου
lm_1 <- lm(exam_score ~ study_hours_per_day, data = studentTable)

# Οπτικοποίηση
ggplot(data = studentTable, aes(x = study_hours_per_day, y = exam_score)) +
  geom_point(color = "steelblue", size = 3, alpha = 0.5) +
  geom_abline(intercept = coef(lm_1)[1], 
              slope = coef(lm_1)[2], 
              colour = "red", size = 1) +
  labs(
    title = "Exam score based on Study hours per day",
    subtitle = paste("R-squared:", round(summary(lm_1)$r.squared, 3)),
    x = "Study Hours per Day",
    y = "Exam Score"
  ) +
  theme_classic()

summary(lm_1)

## 
## Call:
## lm(formula = exam_score ~ study_hours_per_day, data = studentTable)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -25.979  -6.626   0.236   6.537  34.319 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          35.9102     0.7893   45.50   <2e-16 ***
## study_hours_per_day   9.4903     0.2055   46.19   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.539 on 998 degrees of freedom
## Multiple R-squared:  0.6813, Adjusted R-squared:  0.681 
## F-statistic:  2134 on 1 and 998 DF,  p-value: < 2.2e-16

# Υπολογισμός RMSE
actual_values <- studentTable$exam_score
predicted_values <- predict(lm_1)
rmse <- sqrt(mean((actual_values - predicted_values)^2))

cat("Το μέσο σφάλμα πρόβλεψης (RMSE) στο απλό μοντέλο είναι:", round(rmse, 2), "μονάδες.")

## Το μέσο σφάλμα πρόβλεψης (RMSE) στο απλό μοντέλο είναι: 9.53 μονάδες.

Σχολιασμός

Αρχικά βάλαμε ώς ανεξάρτητη μεταβλητή το study_hours_per_day διότι φαίνεται να έχει την ισχυρότερη επίδραση στο βαθμό των εξετάσεων. το \(R^2\) είναι 0.68 που είναι ένα ικανοποιητικό ποσοστό για μια μεταβλητή. *Εξηγεί την διακύμανση της βαθμολογίας κατά 68%.

Μοντέλο Πολλαπλής Παλινδρόμησης (Study Hours + Social Media Hours)

Στη συνέχεια προσθέτουμε τη μεταβλητή social_media_hours, η οποία καταγράφει την καθημερινή χρήση μέσων κοινωνικής δικτύωσης.

# Δημιουργία μοντέλου
lm_2 <- lm(exam_score ~ study_hours_per_day + social_media_hours, data = studentTable)

# Υπολογισμός Adjusted Intercept για τον μέσο χρήστη social media
avg_social_media <- mean(studentTable$social_media_hours, na.rm = TRUE)
adj_intercept <- coef(lm_2)[1] + (coef(lm_2)[3] * avg_social_media)

# Οπτικοποίηση
ggplot(data = studentTable, aes(x = study_hours_per_day, y = exam_score)) +
  geom_point(color = "steelblue", size = 3, alpha = 0.5) +
  geom_abline(intercept = adj_intercept, 
              slope = coef(lm_2)[2], 
              colour = "red", size = 1) +
  labs(
    title = "Exam score vs Study Hours",
    subtitle = paste("Προσαρμοσμένο για μέσο χρόνο Social Media:", round(avg_social_media, 1), "ώρες"),
    x = "Study Hours per Day",
    y = "Exam Score"
  ) +
  theme_classic()

summary(lm_2)

## 
## Call:
## lm(formula = exam_score ~ study_hours_per_day + social_media_hours, 
##     data = studentTable)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -28.1953  -6.5285   0.0703   6.2253  27.7181 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          42.3828     0.9557   44.35   <2e-16 ***
## study_hours_per_day   9.5331     0.1944   49.03   <2e-16 ***
## social_media_hours   -2.6440     0.2436  -10.85   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.025 on 997 degrees of freedom
## Multiple R-squared:  0.715,  Adjusted R-squared:  0.7144 
## F-statistic:  1251 on 2 and 997 DF,  p-value: < 2.2e-16

Σχολιασμός

adjusted_intercept = β0 + (-2.64 * mean(social_media_hours))

Παρατηρούμε πως το Intercept αυξήθηκε στο 42.38. Αυτό συμβαίνει διότι η τιμή αυτή αντιπροσωπεύει έναν φοιτητή με μηδενική χρήση Social Media.

Αν ο μέσος μαθητής χρησιμοποιεί τα social media για 2.5 ώρες καθημερινά έχουμε: \(42.38 + (-2.64 \times 2.5) \approx 35.7\), η τιμή σχεδόν ταυτόσημη με το προηγούμενο μοντέλο.

Backward Elimination Τεχνική

Στην συνέχεια θα βρούμε όλες τις μεταβλητές που συμβάλουν σημαντικά στην διακύμανση.

# Ξεκινάμε με όλες τις μεταβλητές εκτός απο το student ID
full_model <- lm(exam_score ~ . - student_id, data = studentTable)

# Backward Elimination (αυτόματα αφαιρεί τις μεταβλητές που δεν είναι σημαντικές)
best_model <- step(full_model, direction = "backward")

## Start:  AIC=3371.06
## exam_score ~ (student_id + age + gender + study_hours_per_day + 
##     social_media_hours + netflix_hours + part_time_job + attendance_percentage + 
##     sleep_hours + diet_quality + exercise_frequency + parental_education_level + 
##     internet_quality + mental_health_rating + extracurricular_participation) - 
##     student_id
## 
##                                 Df Sum of Sq    RSS    AIC
## - parental_education_level       3        44  28012 3366.6
## - gender                         2        26  27994 3368.0
## - internet_quality               2        50  28018 3368.8
## - extracurricular_participation  1         0  27968 3369.1
## - age                            1         1  27969 3369.1
## - part_time_job                  1         7  27975 3369.3
## - diet_quality                   2        94  28062 3370.4
## <none>                                        27968 3371.1
## - attendance_percentage          1      1770  29738 3430.4
## - sleep_hours                    1      5906  33874 3560.7
## - netflix_hours                  1      5968  33936 3562.5
## - exercise_frequency             1      8510  36478 3634.7
## - social_media_hours             1      9205  37173 3653.6
## - mental_health_rating           1     29934  57902 4096.7
## - study_hours_per_day            1    195185 223153 5445.9
## 
## Step:  AIC=3366.64
## exam_score ~ age + gender + study_hours_per_day + social_media_hours + 
##     netflix_hours + part_time_job + attendance_percentage + sleep_hours + 
##     diet_quality + exercise_frequency + internet_quality + mental_health_rating + 
##     extracurricular_participation
## 
##                                 Df Sum of Sq    RSS    AIC
## - gender                         2        23  28035 3363.5
## - internet_quality               2        49  28061 3364.4
## - extracurricular_participation  1         0  28012 3364.6
## - age                            1         1  28013 3364.7
## - part_time_job                  1         7  28019 3364.9
## - diet_quality                   2        93  28106 3366.0
## <none>                                        28012 3366.6
## - attendance_percentage          1      1784  29796 3426.4
## - sleep_hours                    1      5934  33946 3556.8
## - netflix_hours                  1      5948  33960 3557.2
## - exercise_frequency             1      8563  36576 3631.4
## - social_media_hours             1      9238  37250 3649.7
## - mental_health_rating           1     30494  58506 4101.1
## - study_hours_per_day            1    195183 223196 5440.0
## 
## Step:  AIC=3363.46
## exam_score ~ age + study_hours_per_day + social_media_hours + 
##     netflix_hours + part_time_job + attendance_percentage + sleep_hours + 
##     diet_quality + exercise_frequency + internet_quality + mental_health_rating + 
##     extracurricular_participation
## 
##                                 Df Sum of Sq    RSS    AIC
## - internet_quality               2        48  28083 3361.2
## - extracurricular_participation  1         0  28035 3361.5
## - age                            1         1  28036 3361.5
## - part_time_job                  1         8  28043 3361.7
## - diet_quality                   2        94  28129 3362.8
## <none>                                        28035 3363.5
## - attendance_percentage          1      1779  29814 3423.0
## - netflix_hours                  1      5927  33963 3553.3
## - sleep_hours                    1      5985  34020 3555.0
## - exercise_frequency             1      8625  36660 3629.7
## - social_media_hours             1      9229  37264 3646.0
## - mental_health_rating           1     30498  58533 4097.6
## - study_hours_per_day            1    195307 223342 5436.7
## 
## Step:  AIC=3361.17
## exam_score ~ age + study_hours_per_day + social_media_hours + 
##     netflix_hours + part_time_job + attendance_percentage + sleep_hours + 
##     diet_quality + exercise_frequency + mental_health_rating + 
##     extracurricular_participation
## 
##                                 Df Sum of Sq    RSS    AIC
## - extracurricular_participation  1         0  28083 3359.2
## - age                            1         1  28084 3359.2
## - part_time_job                  1         9  28093 3359.5
## - diet_quality                   2        96  28180 3360.6
## <none>                                        28083 3361.2
## - attendance_percentage          1      1795  29878 3421.1
## - netflix_hours                  1      5968  34051 3551.9
## - sleep_hours                    1      5988  34071 3552.5
## - exercise_frequency             1      8670  36753 3628.2
## - social_media_hours             1      9313  37396 3645.6
## - mental_health_rating           1     30728  58811 4098.3
## - study_hours_per_day            1    195566 223649 5434.1
## 
## Step:  AIC=3359.17
## exam_score ~ age + study_hours_per_day + social_media_hours + 
##     netflix_hours + part_time_job + attendance_percentage + sleep_hours + 
##     diet_quality + exercise_frequency + mental_health_rating
## 
##                         Df Sum of Sq    RSS    AIC
## - age                    1         1  28084 3357.2
## - part_time_job          1         9  28093 3357.5
## - diet_quality           2        96  28180 3358.6
## <none>                                28083 3359.2
## - attendance_percentage  1      1795  29879 3419.1
## - netflix_hours          1      5968  34052 3549.9
## - sleep_hours            1      5993  34076 3550.6
## - exercise_frequency     1      8671  36754 3626.2
## - social_media_hours     1      9316  37400 3643.7
## - mental_health_rating   1     30730  58813 4096.4
## - study_hours_per_day    1    195567 223650 5432.1
## 
## Step:  AIC=3357.21
## exam_score ~ study_hours_per_day + social_media_hours + netflix_hours + 
##     part_time_job + attendance_percentage + sleep_hours + diet_quality + 
##     exercise_frequency + mental_health_rating
## 
##                         Df Sum of Sq    RSS    AIC
## - part_time_job          1         9  28094 3355.5
## - diet_quality           2        96  28181 3356.6
## <none>                                28084 3357.2
## - attendance_percentage  1      1799  29883 3417.3
## - netflix_hours          1      5968  34053 3547.9
## - sleep_hours            1      5995  34080 3548.7
## - exercise_frequency     1      8672  36756 3624.3
## - social_media_hours     1      9315  37400 3641.7
## - mental_health_rating   1     30807  58891 4095.7
## - study_hours_per_day    1    195567 223651 5430.1
## 
## Step:  AIC=3355.55
## exam_score ~ study_hours_per_day + social_media_hours + netflix_hours + 
##     attendance_percentage + sleep_hours + diet_quality + exercise_frequency + 
##     mental_health_rating
## 
##                         Df Sum of Sq    RSS    AIC
## - diet_quality           2        95  28189 3354.9
## <none>                                28094 3355.5
## - attendance_percentage  1      1792  29885 3415.4
## - netflix_hours          1      5965  34059 3546.1
## - sleep_hours            1      5997  34090 3547.0
## - exercise_frequency     1      8663  36757 3622.3
## - social_media_hours     1      9307  37401 3639.7
## - mental_health_rating   1     30824  58917 4094.1
## - study_hours_per_day    1    195659 223752 5428.5
## 
## Step:  AIC=3354.92
## exam_score ~ study_hours_per_day + social_media_hours + netflix_hours + 
##     attendance_percentage + sleep_hours + exercise_frequency + 
##     mental_health_rating
## 
##                         Df Sum of Sq    RSS    AIC
## <none>                                28189 3354.9
## - attendance_percentage  1      1843  30032 3416.3
## - netflix_hours          1      5980  34169 3545.3
## - sleep_hours            1      6027  34216 3546.7
## - exercise_frequency     1      8616  36805 3619.6
## - social_media_hours     1      9388  37577 3640.4
## - mental_health_rating   1     30752  58941 4090.5
## - study_hours_per_day    1    196883 225072 5430.4

# summary(full_model)
summary(best_model)

## 
## Call:
## lm(formula = exam_score ~ study_hours_per_day + social_media_hours + 
##     netflix_hours + attendance_percentage + sleep_hours + exercise_frequency + 
##     mental_health_rating, data = studentTable)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -21.9509  -3.3953  -0.0283   3.6680  15.9059 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            6.15722    1.89252   3.253  0.00118 ** 
## study_hours_per_day    9.57456    0.11503  83.238  < 2e-16 ***
## social_media_hours    -2.61978    0.14413 -18.177  < 2e-16 ***
## netflix_hours         -2.27708    0.15697 -14.507  < 2e-16 ***
## attendance_percentage  0.14473    0.01797   8.054 2.28e-15 ***
## sleep_hours            2.00462    0.13764  14.564  < 2e-16 ***
## exercise_frequency     1.45187    0.08338  17.413  < 2e-16 ***
## mental_health_rating   1.94891    0.05924  32.897  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.331 on 992 degrees of freedom
## Multiple R-squared:  0.9011, Adjusted R-squared:  0.9004 
## F-statistic:  1291 on 7 and 992 DF,  p-value: < 2.2e-16

Στην συνέχεια θα φτιάξουμε έναν υποθετικό μέσο μαθητή

# 1. Φτιάχνουμε έναν υποθετικό "Μέσο Μαθητή"
plot_data <- data.frame(
  study_hours_per_day = seq(min(studentTable$study_hours_per_day), 
                            max(studentTable$study_hours_per_day), length.out = 100),
  social_media_hours = mean(studentTable$social_media_hours),
  netflix_hours = mean(studentTable$netflix_hours),
  exercise_frequency = mean(studentTable$exercise_frequency),
  sleep_hours = mean(studentTable$sleep_hours),
  attendance_percentage = mean(studentTable$attendance_percentage),
  mental_health_rating = mean(studentTable$mental_health_rating)
)

plot_data$predicted_y <- predict(best_model, newdata = plot_data)

ggplot(studentTable, aes(x = study_hours_per_day, y = exam_score)) +
  geom_point(alpha = 0.3) +
  geom_line(data = plot_data, aes(y = predicted_y), color = "red", size = 1) +
  labs(title = "Improved Linear Model",
       subtitle = "Calculated via Backward Elimination (All significant variables included)")

summary(best_model)

## 
## Call:
## lm(formula = exam_score ~ study_hours_per_day + social_media_hours + 
##     netflix_hours + attendance_percentage + sleep_hours + exercise_frequency + 
##     mental_health_rating, data = studentTable)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -21.9509  -3.3953  -0.0283   3.6680  15.9059 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            6.15722    1.89252   3.253  0.00118 ** 
## study_hours_per_day    9.57456    0.11503  83.238  < 2e-16 ***
## social_media_hours    -2.61978    0.14413 -18.177  < 2e-16 ***
## netflix_hours         -2.27708    0.15697 -14.507  < 2e-16 ***
## attendance_percentage  0.14473    0.01797   8.054 2.28e-15 ***
## sleep_hours            2.00462    0.13764  14.564  < 2e-16 ***
## exercise_frequency     1.45187    0.08338  17.413  < 2e-16 ***
## mental_health_rating   1.94891    0.05924  32.897  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.331 on 992 degrees of freedom
## Multiple R-squared:  0.9011, Adjusted R-squared:  0.9004 
## F-statistic:  1291 on 7 and 992 DF,  p-value: < 2.2e-16

Σχολιασμός

Το μοντέλο έχει βελτιωμένο \(R^2\) στο ~90% και μικρότερο Residual Std Error στο 5.3!

Train and Test Split Data

Στην συνέχεια θα χωρίσουμε το δείγμα σε train και test για να δούμε το RMSE και πόσο έξω πέφτουμε.

# 1. Διαχωρισμός 80/20 (Train/Test Split)
set.seed(123) # Για να έχουμε τα ίδια αποτελέσματα κάθε φορά
n <- nrow(studentTable)
train_indices <- sample(1:n, size = 0.8 * n)

train_data <- studentTable[train_indices, ]
test_data  <- studentTable[-train_indices, ]

# 2. Εκπαίδευση του βελτιστοποιημένου μοντέλου στο Train Set
final_model <- lm(exam_score ~ . - student_id, data = train_data)
best_model_train <- step(final_model, direction = "backward", trace = 0)

# 3. Πρόβλεψη πάνω στο Test Set (τα "άγνωστα" δεδομένα)
test_predictions <- predict(best_model_train, newdata = test_data)

# 4. Δημιουργία Dataframe σύγκρισης για τους φοιτητές του Test Set
validation_results <- data.frame(
  student_id = test_data$student_id,
  Actual = test_data$exam_score,
  Predicted = test_predictions,
  Residual = test_data$exam_score - test_predictions
)


# 6. Γράφημα Actual vs Predicted
ggplot(validation_results, aes(x = Actual, y = Predicted)) +
  geom_point(color = "steelblue", alpha = 0.6, size = 3) +
  # 45 μοίρες γραμμή για την *τέλεια πρόβλεψη*
  geom_abline(intercept = 0, slope = 1, color = "red", linetype = "dashed", size = 1) +
  labs(title = "Actual vs Predicted Exam Scores (Test Set)",
       subtitle = "Όσο πιο κοντά στην κόκκινη γραμμή, τόσο πιο ακριβές το μοντέλο",
       x = "Actual Value",
       y = "Predicted Value") +
  theme_minimal()

# 6. Υπολογίζουμε το σφάλμα (RMSE - Root Mean Squared Error)
actual_values <- test_data$exam_score
rmse <- sqrt(mean((actual_values - test_predictions)^2))

cat("Το μέσο σφάλμα πρόβλεψης (RMSE) είναι:", round(rmse, 2), "μονάδες.")

## Το μέσο σφάλμα πρόβλεψης (RMSE) είναι: 5.4 μονάδες.

Στην συνέχεια κάνουμε export τα test data με τις τιμές που προβλέψαμε έναντι των πραγματικών! Παρατηρούμε πως το σφάλμα μας είναι όντως +- 5 μονάδες.

# 5. Export σε CSV
output_csv <- write_csv(validation_results, "test_predictions_comparison.csv")

#View(output_csv)

# Εμφάνιση του dataframe των προβλέψεων που φτιάξαμε
datatable(output_csv, 
          caption = "Πίνακας Πραγματικών vs Προβλεπόμενων Τιμών",
          filter = 'top', 
          options = list(pageLength = 10))

Συμπεράσματα

Ξεκινήσαμε την γραμμική παλινδρόμηση με την ανεξάρτητη μεταβλητή study_hours και είχαμε RMSE ~9.53, αλλά με την χρήση του Backward Elimination μπορέσαμε να βρούμε τις μεταβλητές που εξηγούν την διακύμανση του βαθμού σημαντικά και φτάσαμε \(R^2\) ~90% και χαμηλόσαμε το RMSE στο 5.3!

Παρατηρήσαμε πως η ομοσκεδαστικότητα ισχύει στο datasheet που επιλέξαμε, διότι όσο αυξάνονται οι ανεξάρτητες μεταβλητές, τότε η διακύμανση παραμένει ίδιο οπότε μπορέσαμε να εφαρμόσουμε την Γραμμική παλινδρόμηση με ευκολία.

sessionInfo()

## R version 4.5.2 (2025-10-31 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 26200)
## 
## Matrix products: default
##   LAPACK version 3.12.1
## 
## locale:
## [1] LC_COLLATE=English_United States.utf8 
## [2] LC_CTYPE=English_United States.utf8   
## [3] LC_MONETARY=English_United States.utf8
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.utf8    
## 
## time zone: Europe/Bucharest
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] DT_0.34.0        kableExtra_1.4.0 knitr_1.51       corrplot_0.95   
##  [5] lubridate_1.9.5  forcats_1.0.1    stringr_1.6.0    dplyr_1.2.0     
##  [9] purrr_1.2.1      readr_2.2.0      tidyr_1.3.2      tibble_3.3.1    
## [13] ggplot2_4.0.2    tidyverse_2.0.0 
## 
## loaded via a namespace (and not attached):
##  [1] sass_0.4.10        generics_0.1.4     xml2_1.5.2         stringi_1.8.7     
##  [5] hms_1.1.4          digest_0.6.39      magrittr_2.0.4     evaluate_1.0.5    
##  [9] grid_4.5.2         timechange_0.4.0   RColorBrewer_1.1-3 fastmap_1.2.0     
## [13] jsonlite_2.0.0     crosstalk_1.2.2    viridisLite_0.4.3  scales_1.4.0      
## [17] textshaping_1.0.4  jquerylib_0.1.4    cli_3.6.5          crayon_1.5.3      
## [21] rlang_1.1.7        bit64_4.6.0-1      withr_3.0.2        cachem_1.1.0      
## [25] yaml_2.3.12        otel_0.2.0         parallel_4.5.2     tools_4.5.2       
## [29] tzdb_0.5.0         vctrs_0.7.1        R6_2.6.1           lifecycle_1.0.5   
## [33] bit_4.6.0          htmlwidgets_1.6.4  vroom_1.7.0        pkgconfig_2.0.3   
## [37] pillar_1.11.1      bslib_0.10.0       gtable_0.3.6       glue_1.8.0        
## [41] systemfonts_1.3.2  xfun_0.56          tidyselect_1.2.1   rstudioapi_0.18.0 
## [45] farver_2.1.2       htmltools_0.5.9    labeling_0.4.3     rmarkdown_2.31    
## [49] svglite_2.2.2      compiler_4.5.2     S7_0.2.1

Εργασία 5 - Γραμμική Παλινδρόμιση

Kostas Soantzoglou

2026-04-01

Περιγραφή Δεδομένων

Γιατί επιλέξαμε το συγκεκριμένο dataset;

Τι ερωτήματα μπορούν να απαντηθούν;

Συσχέτιση μεταξύ μεταβλητών (Correlation Plot)

Scatterplot

Boxplot

Ιστόγραμμα Συχνοτήτων

Γραμμική Παλινδρόμηση

Μοντέλο Απλής Παλινδρόμησης (Study Hours)

Σχολιασμός

Backward Elimination Τεχνική

Στην συνέχεια θα φτιάξουμε έναν υποθετικό μέσο μαθητή

Σχολιασμός

Train and Test Split Data

Συμπεράσματα

Εργασία 5 - Γραμμική Παλινδρόμιση

Kostas Soantzoglou

2026-04-01

Περιγραφή Δεδομένων

Γιατί επιλέξαμε το συγκεκριμένο dataset;

Τι ερωτήματα μπορούν να απαντηθούν;

Συσχέτιση μεταξύ μεταβλητών (Correlation Plot)

Scatterplot

Boxplot

Ιστόγραμμα Συχνοτήτων

Γραμμική Παλινδρόμηση

Μοντέλο Απλής Παλινδρόμησης (Study Hours)

Σχολιασμός

Μοντέλο Πολλαπλής Παλινδρόμησης (Study Hours + Social Media Hours)

Σχολιασμός

Σύγκριση Επιπέδων χρήσης Social Media

Σχολιασμός

Backward Elimination Τεχνική

Στην συνέχεια θα φτιάξουμε έναν υποθετικό μέσο μαθητή

Σχολιασμός

Train and Test Split Data

Συμπεράσματα