Student Habits Analysis. GR7

Ozan Patlar 23023069 Elif Şimşek 22023027 Yasin Özcan 21023022

PART A: Introducing the Dataset

Data Source and Explanation

For this project, we are analyzing the “Student Habits and Performance” dataset. This dataset contains records for 1000 students and includes variables that capture demographics, academic habits, lifestyle factors, and academic performance. This dataset is suitable for linear regression because it contains a continuous numerical target variable, exam_score, allowing us to predict specific performance values based on students’ habits. The dataset is also suitable for Logistic Regression since the target variable exam_status is binary.

Descriptive Statistics and Variable Transformations

First, we load the required libraries (including the new visualization packages), import our dataset, and examine its general structure. We also prepare our variables for statistical testing and machine learning.

# Load necessary libraries
library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
library(tidyr)
library(ggplot2)
library(readr)
library(forcats)
library(rstatix)

Attaching package: 'rstatix'
The following object is masked from 'package:stats':

    filter
library(ggpubr)
library(RColorBrewer)
library(tidymodels)
── Attaching packages ────────────────────────────────────── tidymodels 1.4.1 ──
✔ broom        1.0.12     ✔ rsample      1.3.2 
✔ dials        1.4.2      ✔ tailor       0.1.0 
✔ infer        1.1.0      ✔ tune         2.0.1 
✔ modeldata    1.5.1      ✔ workflows    1.3.0 
✔ parsnip      1.4.1      ✔ workflowsets 1.1.1 
✔ purrr        1.2.1      ✔ yardstick    1.3.2 
✔ recipes      1.3.1      
── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
✖ infer::chisq_test() masks rstatix::chisq_test()
✖ purrr::discard()    masks scales::discard()
✖ rstatix::filter()   masks dplyr::filter(), stats::filter()
✖ dials::get_n()      masks rstatix::get_n()
✖ dplyr::lag()        masks stats::lag()
✖ infer::prop_test()  masks rstatix::prop_test()
✖ yardstick::spec()   masks readr::spec()
✖ recipes::step()     masks stats::step()
✖ infer::t_test()     masks rstatix::t_test()
# Load the dataset 
df <- read_csv("/Users/ozanpatlar/Downloads/student_habits_performance.csv")
Rows: 1000 Columns: 16
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (7): student_id, gender, part_time_job, diet_quality, parental_education...
dbl (9): age, study_hours_per_day, social_media_hours, netflix_hours, attend...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Basic cleaning and transformations
df_clean <- df %>%
  mutate(
    gender = as.factor(gender),
    part_time_job = as.factor(part_time_job),
    diet_quality = factor(diet_quality, levels = c("Poor", "Fair", "Good")),
    extracurricular_participation = as.factor(extracurricular_participation),
    internet_quality         = factor(internet_quality, levels = c("Poor", "Average", "Good")),
    parental_education_level = factor(parental_education_level,
                                      levels = c("High School", "Bachelor", "Master")),
    # Creating target variable for Machine Learning classification
    exam_status = ifelse(exam_score >= 50, "Pass", "Fail"),
    exam_status = factor(exam_status, levels = c("Fail", "Pass"))
  )

#Taking a look at the dataset:
summary(df_clean)
  student_id             age           gender    study_hours_per_day
 Length:1000        Min.   :17.00   Female:481   Min.   :0.00       
 Class :character   1st Qu.:18.75   Male  :477   1st Qu.:2.60       
 Mode  :character   Median :20.00   Other : 42   Median :3.50       
                    Mean   :20.50                Mean   :3.55       
                    3rd Qu.:23.00                3rd Qu.:4.50       
                    Max.   :24.00                Max.   :8.30       
 social_media_hours netflix_hours   part_time_job attendance_percentage
 Min.   :0.000      Min.   :0.000   No :785       Min.   : 56.00       
 1st Qu.:1.700      1st Qu.:1.000   Yes:215       1st Qu.: 78.00       
 Median :2.500      Median :1.800                 Median : 84.40       
 Mean   :2.506      Mean   :1.820                 Mean   : 84.13       
 3rd Qu.:3.300      3rd Qu.:2.525                 3rd Qu.: 91.03       
 Max.   :7.200      Max.   :5.400                 Max.   :100.00       
  sleep_hours    diet_quality exercise_frequency parental_education_level
 Min.   : 3.20   Poor:185     Min.   :0.000      High School:392         
 1st Qu.: 5.60   Fair:437     1st Qu.:1.000      Bachelor   :350         
 Median : 6.50   Good:378     Median :3.000      Master     :167         
 Mean   : 6.47                Mean   :3.042      NA's       : 91         
 3rd Qu.: 7.30                3rd Qu.:5.000                              
 Max.   :10.00                Max.   :6.000                              
 internet_quality mental_health_rating extracurricular_participation
 Poor   :162      Min.   : 1.000       No :682                      
 Average:391      1st Qu.: 3.000       Yes:318                      
 Good   :447      Median : 5.000                                    
                  Mean   : 5.438                                    
                  3rd Qu.: 8.000                                    
                  Max.   :10.000                                    
   exam_score     exam_status
 Min.   : 18.40   Fail:131   
 1st Qu.: 58.48   Pass:869   
 Median : 70.50              
 Mean   : 69.60              
 3rd Qu.: 81.33              
 Max.   :100.00              
str(df_clean)
tibble [1,000 × 17] (S3: tbl_df/tbl/data.frame)
 $ student_id                   : chr [1:1000] "S1000" "S1001" "S1002" "S1003" ...
 $ age                          : num [1:1000] 23 20 21 23 19 24 21 21 23 18 ...
 $ gender                       : Factor w/ 3 levels "Female","Male",..: 1 1 2 1 1 2 1 1 1 1 ...
 $ study_hours_per_day          : num [1:1000] 0 6.9 1.4 1 5 7.2 5.6 4.3 4.4 4.8 ...
 $ social_media_hours           : num [1:1000] 1.2 2.8 3.1 3.9 4.4 1.3 1.5 1 2.2 3.1 ...
 $ netflix_hours                : num [1:1000] 1.1 2.3 1.3 1 0.5 0 1.4 2 1.7 1.3 ...
 $ part_time_job                : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 2 2 1 1 ...
 $ attendance_percentage        : num [1:1000] 85 97.3 94.8 71 90.9 82.9 85.8 77.7 100 95.4 ...
 $ sleep_hours                  : num [1:1000] 8 4.6 8 9.2 4.9 7.4 6.5 4.6 7.1 7.5 ...
 $ diet_quality                 : Factor w/ 3 levels "Poor","Fair",..: 2 3 1 1 2 2 3 2 3 3 ...
 $ exercise_frequency           : num [1:1000] 6 6 1 4 3 1 2 0 3 5 ...
 $ parental_education_level     : Factor w/ 3 levels "High School",..: 3 1 1 3 3 3 3 2 2 2 ...
 $ internet_quality             : Factor w/ 3 levels "Poor","Average",..: 2 2 1 3 3 2 1 2 3 3 ...
 $ mental_health_rating         : num [1:1000] 8 8 1 1 1 4 4 8 1 10 ...
 $ extracurricular_participation: Factor w/ 2 levels "No","Yes": 2 1 1 2 1 1 1 1 1 2 ...
 $ exam_score                   : num [1:1000] 56.2 100 34.3 26.8 66.4 100 89.8 72.6 78.9 100 ...
 $ exam_status                  : Factor w/ 2 levels "Fail","Pass": 2 2 1 1 2 2 2 2 2 2 ...

PART B: Hypothesis Testing and Visualizations

Hypothesis 1: Study Hours vs. Exam Score

Does Study Hours significantly predict Exam Score?

Hypothesis: * H0: There is no linear relationship between study hours per day and exam score. * H1: There IS a significant positive linear relationship between study hours per day and exam score.

We are using a correlation test and simple lineer regression since two variables are continious.

# Data Prep
h1_data <- df_clean %>%
  select(student_id, study_hours_per_day, exam_score) %>%
  filter(!is.na(study_hours_per_day), !is.na(exam_score)) %>%
  arrange(desc(study_hours_per_day))

h1_data %>%
  select(study_hours_per_day, exam_score) %>%
  get_summary_stats(type = "common")
# A tibble: 2 × 10
  variable                n   min   max median   iqr  mean    sd    se    ci
  <fct>               <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 study_hours_per_day  1000   0     8.3    3.5   1.9  3.55  1.47 0.046 0.091
2 exam_score           1000  18.4 100     70.5  22.8 69.6  16.9  0.534 1.05 
# Summary & Correlation
cor_test_h1 <- h1_data %>% cor_test(study_hours_per_day, exam_score, method = "pearson")
print(cor_test_h1)
# A tibble: 1 × 8
  var1                var2      cor statistic        p conf.low conf.high method
  <chr>               <chr>   <dbl>     <dbl>    <dbl>    <dbl>     <dbl> <chr> 
1 study_hours_per_day exam_s…  0.83      46.2 4.6e-250    0.805     0.844 Pears…
# Why: Both variables are continuous; we want to model how
#      one predicts the other and test the slope coefficient.
# Linear Regression Test
lm_h1 <- lm(exam_score ~ study_hours_per_day, data = h1_data)
summary(lm_h1)

Call:
lm(formula = exam_score ~ study_hours_per_day, data = h1_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-25.979  -6.626   0.236   6.537  34.319 

Coefficients:
                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)          35.9102     0.7893   45.50   <2e-16 ***
study_hours_per_day   9.4903     0.2055   46.19   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 9.539 on 998 degrees of freedom
Multiple R-squared:  0.6813,    Adjusted R-squared:  0.681 
F-statistic:  2134 on 1 and 998 DF,  p-value: < 2.2e-16
# Visualization
ggplot(h1_data, aes(x = study_hours_per_day, y = exam_score)) +
  geom_point(aes(color = exam_score), alpha = 0.5, size = 2) +
  geom_smooth(method = "lm", se = TRUE, color = "firebrick") +
  scale_color_viridis_c(option = "C") +
  labs(title = "Study Hours vs Exam Score", subtitle = "Linear regression fit with 95% CI",
       x = "Study Hours per Day", y = "Exam Score", color = "Score") +
  theme_minimal()

Comment: We used Simple Linear Regression and a Pearson Correlation test. If the p-value is < 0.05, we reject H0. A positive slope indicates that more study hours directly predict higher exam scores. p<0.05 H0 reject


Hypothesis 2: Gender vs. Exam Score

Do male and female students differ in Exam Score?

Hypothesis: * H0: Mean exam scores are equal for male and female students. * H1: Mean exam scores differ between male and female students.

We are using a T-test since we are comparing means of one continious variable and one categorical variable between two independent groups.

# Summary Stats
df_clean %>% group_by(gender) %>% get_summary_stats(exam_score, type = "common")
# A tibble: 3 × 11
  gender variable       n   min   max median   iqr  mean    sd    se    ci
  <fct>  <fct>      <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Female exam_score   481  18.4   100   70.7  23.6  69.7  16.9 0.771  1.51
2 Male   exam_score   477  23.1   100   70.2  22.2  69.4  17.2 0.785  1.54
3 Other  exam_score    42  43.9   100   69    19.5  70.6  13.8 2.12   4.29
#Why: Comparing means of one continuous variable (exam_score)
#      bet ween two independent groups (Male/Female). Assumes
#      approximate normality (large sample → CLT applies).
# T-test
# ── Normality check ─────────────────────────────────────────
df %>%
  group_by(gender) %>%
  shapiro_test(exam_score)
# A tibble: 3 × 4
  gender variable   statistic         p
  <chr>  <chr>          <dbl>     <dbl>
1 Female exam_score     0.986 0.000170 
2 Male   exam_score     0.985 0.0000846
3 Other  exam_score     0.975 0.490    
t_test_h2 <- df_clean %>% t_test(exam_score ~ gender, var.equal = FALSE)
print(t_test_h2)
# A tibble: 1 × 7
  statistic  t_df p_value alternative estimate lower_ci upper_ci
      <dbl> <dbl>   <dbl> <chr>          <dbl>    <dbl>    <dbl>
1     0.339  955.   0.735 two.sided      0.373    -1.79     2.53
# Effect Size
df_clean %>% cohens_d(exam_score ~ gender)
# A tibble: 3 × 7
  .y.        group1 group2 effsize    n1    n2 magnitude 
* <chr>      <chr>  <chr>    <dbl> <int> <int> <ord>     
1 exam_score Female Male    0.0219   481   477 negligible
2 exam_score Female Other  -0.0588   481    42 negligible
3 exam_score Male   Other  -0.0823   477    42 negligible
# Visualization (ggpubr)
ggboxplot(df_clean, x = "gender", y = "exam_score",
          color = "gender", palette = "jco",
          add = "jitter", shape = "gender") +
  stat_compare_means(method = "t.test", label = "p.signif",
                     comparisons = list(c("Female", "Male"))) +
  facet_wrap(~part_time_job) +
  labs(title = "Exam Score by Gender", subtitle = "Faceted by Part-Time Job",
       x = "Gender", y = "Exam Score")

Comment: We applied an Independent Samples t-test. If p < 0.05, there is a significant gender difference in exam scores. Cohen’s d indicates the practical magnitude of that difference. p=0.735>0.05 fail to reject H0


Hypothesis 3: Diet Quality vs. Mental Health Rating

Does Diet Quality affect Mental Health?

Hypothesis: * H0: Mean mental health ratings are equal across all diet quality groups (Poor, Fair, Good). * H1: At least one diet quality group has a different mean mental health rating.

We are using an ANOVA test since we are comparing means of a continuous variable across 3+ # independent groups.

# Summary Stats
df_clean %>% group_by(diet_quality) %>% get_summary_stats(mental_health_rating, type = "common")
# A tibble: 3 × 11
  diet_quality variable       n   min   max median   iqr  mean    sd    se    ci
  <fct>        <fct>      <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Poor         mental_he…   185     1    10      6     5  5.56  2.79 0.205 0.405
2 Fair         mental_he…   437     1    10      5     5  5.21  2.88 0.138 0.271
3 Good         mental_he…   378     1    10      6     5  5.65  2.82 0.145 0.286
# Why: Comparing means of a continuous variable across 3+
#      independent groups. ANOVA generalizes the t-test.
# ANOVA Test
# ── Normality & Levene test ──────────────────────────────────
df %>%
  group_by(diet_quality) %>%
  shapiro_test(mental_health_rating)
# A tibble: 3 × 4
  diet_quality variable             statistic        p
  <chr>        <chr>                    <dbl>    <dbl>
1 Fair         mental_health_rating     0.933 4.33e-13
2 Good         mental_health_rating     0.941 4.10e-11
3 Poor         mental_health_rating     0.935 2.25e- 7
df %>% levene_test(mental_health_rating ~ diet_quality)
# A tibble: 1 × 4
    df1   df2 statistic     p
  <int> <int>     <dbl> <dbl>
1     2   997     0.182 0.833
aov_h3 <- df_clean %>% anova_test(mental_health_rating ~ diet_quality)
print(aov_h3)
ANOVA Table (type II tests)

        Effect DFn DFd     F     p p<.05   ges
1 diet_quality   2 997 2.595 0.075       0.005
# Post-hoc: Tukey HSD if ANOVA is significant
posthoc_h3 <- df %>%
  tukey_hsd(mental_health_rating ~ diet_quality)
print(posthoc_h3)
# A tibble: 3 × 9
  term  group1 group2 null.value estimate conf.low conf.high  p.adj p.adj.signif
* <chr> <chr>  <chr>       <dbl>    <dbl>    <dbl>     <dbl>  <dbl> <chr>       
1 diet… Fair   Good            0   0.437   -0.0315     0.906 0.0734 ns          
2 diet… Fair   Poor            0   0.349   -0.237      0.934 0.343  ns          
3 diet… Good   Poor            0  -0.0887  -0.687      0.510 0.935  ns          
# Visualization
ggplot(df_clean, aes(x = diet_quality, y = mental_health_rating,
                     fill = diet_quality, color = diet_quality)) +
  geom_violin(alpha = 0.4, trim = FALSE) +
  geom_boxplot(width = 0.15, alpha = 0.8, outlier.shape = NA) +
  facet_wrap(~gender) +
  scale_fill_brewer(palette  = "Set2") +
  scale_color_brewer(palette = "Set2") +
  labs(title = "Mental Health Rating by Diet Quality", subtitle = "Faceted by Gender",
       x = "Diet Quality", y = "Mental Health Rating (1–10)") +
  theme_minimal() +
  theme(legend.position = "bottom")

Comment: We used a One-Way ANOVA test. A significant ANOVA (p < 0.05) implies at least one group differs, meaning diet quality has an impact on mental health. p=0.178>0.05 fail to reject H0


Hypothesis 4: Part-Time Job vs. Attendance Percentage

Is there an association between having a Part Time job and Attendance?

Hypothesis: * H0: Mean attendance percentage is equal for students with and without part-time jobs. * H1: Mean attendance percentage differs between students with and without part-time jobs.

We are using a T-test since we are comparing means of one continious variable and one categorical variable between two independent groups.

# Why: Comparing means of one continuous variable between two
#      independent groups (Yes/No part-time job).
# T-test
# ── Normality ───────────────────────────────────────────────
df %>%
  group_by(part_time_job) %>%
  shapiro_test(attendance_percentage)
# A tibble: 2 × 4
  part_time_job variable              statistic            p
  <chr>         <chr>                     <dbl>        <dbl>
1 No            attendance_percentage     0.981 0.0000000134
2 Yes           attendance_percentage     0.982 0.00648     
t_test_h4 <- df_clean %>% filter(!is.na(part_time_job)) %>% t_test(attendance_percentage ~ part_time_job, var.equal = FALSE)
print(t_test_h4)
# A tibble: 1 × 7
  statistic  t_df p_value alternative estimate lower_ci upper_ci
      <dbl> <dbl>   <dbl> <chr>          <dbl>    <dbl>    <dbl>
1      1.35  351.   0.178 two.sided      0.955   -0.437     2.35
# Visualization (ggpubr)
ggviolin(df_clean %>% filter(!is.na(part_time_job)),
         x = "part_time_job", y = "attendance_percentage",
         fill = "part_time_job", color = "part_time_job",
         palette = "npg", add = "boxplot", add.params = list(fill = "white")) +
  stat_compare_means(method = "t.test", label = "p.format", label.x = 1.5) +
  facet_grid(~gender) +
  labs(title = "Attendance % by Part-Time Job Status", subtitle = "Faceted by Gender",
       x = "Has Part-Time Job", y = "Attendance Percentage (%)") +
  theme(legend.position = "none")

Comment: We used an Independent Samples t-test. If p < 0.05, having a part-time job significantly affects attendance, likely due to time constraints. p=0.178>0.05 fail to reject H0


Hypothesis 5: Parental Education vs. Exam Score

Does Parental Education Level influence Exam Score?

Hypothesis: * H0: Mean exam scores are equal across all parental education levels. * H1: At least one parental education level group has a different mean exam score.

# ── dplyr: group_by + summarise + slice_max ──────────────────
h5_summary <- df_clean %>%
  group_by(parental_education_level) %>%
  summarise(
    n          = n(),
    mean_score = mean(exam_score, na.rm = TRUE),
    sd_score   = sd(exam_score, na.rm = TRUE)
  )
print(h5_summary)
# A tibble: 4 × 4
  parental_education_level     n mean_score sd_score
  <fct>                    <int>      <dbl>    <dbl>
1 High School                392       69.5     16.8
2 Bachelor                   350       70.3     17.3
3 Master                     167       68.1     16.4
4 <NA>                        91       70.0     16.6
# Top scorer per parental education level
df_clean %>%
  group_by(parental_education_level) %>%
  slice_max(order_by = exam_score, n = 1) %>%
  select(student_id, parental_education_level, exam_score, study_hours_per_day)
# A tibble: 48 × 4
# Groups:   parental_education_level [4]
   student_id parental_education_level exam_score study_hours_per_day
   <chr>      <fct>                         <dbl>               <dbl>
 1 S1001      High School                     100                 6.9
 2 S1131      High School                     100                 7.2
 3 S1230      High School                     100                 4.2
 4 S1250      High School                     100                 5.5
 5 S1335      High School                     100                 6.5
 6 S1356      High School                     100                 5.8
 7 S1455      High School                     100                 8.3
 8 S1489      High School                     100                 7.4
 9 S1579      High School                     100                 6.1
10 S1611      High School                     100                 6  
# ℹ 38 more rows
# ── rstatix ─────────────────────────────────────────────────
df_clean %>% freq_table(parental_education_level)
# A tibble: 3 × 3
  parental_education_level     n  prop
  <fct>                    <int> <dbl>
1 High School                392  43.1
2 Bachelor                   350  38.5
3 Master                     167  18.4
df_clean %>%
  group_by(parental_education_level) %>%
  get_summary_stats(exam_score, type = "common")
# A tibble: 4 × 11
  parental_education_level variable       n   min   max median   iqr  mean    sd
  <fct>                    <fct>      <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl>
1 High School              exam_score   392  23.1   100   70.0  24.0  69.5  16.8
2 Bachelor                 exam_score   350  18.4   100   71.7  23.2  70.3  17.3
3 Master                   exam_score   167  26.8   100   66.8  21.9  68.1  16.4
4 <NA>                     exam_score    91  30.2   100   71    21.4  70.0  16.6
# ℹ 2 more variables: se <dbl>, ci <dbl>
# ── Normality & Levene ───────────────────────────────────────
df_clean %>%
  group_by(parental_education_level) %>%
  shapiro_test(exam_score)
# A tibble: 4 × 4
  parental_education_level variable   statistic        p
  <fct>                    <chr>          <dbl>    <dbl>
1 High School              exam_score     0.987 0.00119 
2 Bachelor                 exam_score     0.981 0.000138
3 Master                   exam_score     0.985 0.0618  
4 <NA>                     exam_score     0.982 0.236   
df_clean %>% levene_test(exam_score ~ parental_education_level)
# A tibble: 1 × 4
    df1   df2 statistic     p
  <int> <int>     <dbl> <dbl>
1     2   906     0.277 0.758
# ── Statistical Test: One-Way ANOVA (or Kruskal-Wallis) ─────
# Why: Comparing a continuous outcome (exam_score) across 3+
#      independent groups. If normality fails → Kruskal-Wallis.


# Non-parametric alternative (use if normality violated)
kruskal_h5 <- df %>%
  kruskal_test(exam_score ~ parental_education_level)
print(kruskal_h5)
# A tibble: 1 × 6
  .y.            n statistic    df     p method        
* <chr>      <int>     <dbl> <int> <dbl> <chr>         
1 exam_score  1000      3.16     3 0.367 Kruskal-Wallis
# Visualization (ggplot2)
ggplot(df_clean, aes(x = parental_education_level, y = exam_score,
                     fill = parental_education_level, shape = parental_education_level)) +
  stat_summary(fun = mean, geom = "bar", alpha = 0.7) +
  stat_summary(fun.data = mean_se, geom = "errorbar", width = 0.2, color = "gray30") +
  geom_jitter(aes(color = parental_education_level), width = 0.2, alpha = 0.25, size = 1) +
  scale_fill_brewer(palette  = "Dark2") +
  scale_color_brewer(palette = "Dark2") +
  facet_grid(diet_quality ~ gender) +
  labs(title = "Exam Score by Parental Education Level", subtitle = "Faceted by Diet Quality and Gender",
       x = "Parental Education Level", y = "Mean Exam Score ± SE") +
  theme_minimal() +
  theme(axis.text.x  = element_text(angle = 25, hjust = 1), legend.position = "bottom")

Comment: A Kruskall Wallis test was utilized. Since p=0.224, we fail to reject H0. parental education level was found to have no statistically significant effect on exam scores.

Before the analysis, normality was checked using the Shapiro-Wilk test. The results showed that the normality assumption was not fully satisfied because some groups had p-values below 0.05. Therefore, a Kruskal-Wallis test was also performed as a non-parametric alternative. Kruskall-Wallis test produced p = 0.224. Since p-value was greater than 0.05, parental education level was found to have no statistically significant effect on exam scores.

PART C: MACHINE LEARNING APPLICATION PART

For this section, we utilize the tidymodels framework. To predict the continuous “exam_score”, we selected Multiple Linear Regression. To predict the binary “exam_status” (Pass/Fail), we selected Logistic Regression.

h) Method Definitions and Logic

1. Regression Model: Multiple Linear Regression

  • Definition: A fundamental machine learning technique used to model the linear relationship between a single continuous dependent variable and multiple independent variables.
  • Purpose: To predict the continuous value of a target variable and infer the strength of relationships.
  • Logic: Based on Ordinary Least Squares (OLS), it minimizes the Sum of Squared Residuals (SSR) between actual and predicted values.
  • Formula: Y = b0 + (b1 * X1) + (b2 * X2) + … + (bp * Xp) + e
  • Use Cases: When the target is continuous (exam scores) and high interpretability is needed.

2. Classification Model: Logistic Regression

  • Definition: A classification algorithm used to model the probability of a discrete, binary outcome.
  • Purpose: To classify observations into specific classes and estimate the probability of belonging to a class.
  • Logic: Uses a logistic (sigmoid) function to constrain the output between 0 and 1.
  • Formula: P = 1 / (1 + e^-(b0 + b1X1 + … + bpXp))
  • Use Cases: Binary classification problems (Pass/Fail) where calibrated probabilities are required.

i) Reasoning for Method Choice

  • Multiple Linear Regression: exam_score is continuous. Regression is mandatory to see how 1-unit changes in daily habits affect the final score.
  • Logistic Regression: exam_status is a binary categorical variable. We need a model that provides the precise probability of a student passing.

j) Machine Learning Models Implementation

# Remove student_id for ML processing
df_ml <- df_clean %>% select(-student_id)

Model 1: Regression (with V-Fold Cross-Validation)

set.seed(123)
folds <- vfold_cv(df_ml, v = 5)

lm_recipe <- recipe(exam_score ~ ., data = df_ml) %>%
  step_rm(exam_status) %>%
  step_dummy(all_nominal_predictors(), one_hot=TRUE) %>%
  step_normalize(all_numeric_predictors())

lm_model <- linear_reg() %>% set_engine("lm") %>% set_mode("regression")
lm_workflow <- workflow() %>% add_model(lm_model) %>% add_recipe(lm_recipe)

lm_res <- fit_resamples(lm_workflow, resamples = folds, control = control_resamples(save_pred = TRUE))

print(collect_metrics(lm_res))
# A tibble: 2 × 6
  .metric .estimator  mean     n std_err .config        
  <chr>   <chr>      <dbl> <int>   <dbl> <chr>          
1 rmse    standard   5.44      5 0.172   pre0_mod0_post0
2 rsq     standard   0.898     5 0.00544 pre0_mod0_post0

Findings:

RMSE (Root Mean Squared Error): The Root Mean Squared Error (RMSE) value of 5.4394 indicates that our model’s predictions deviate from the actual exam score by an average of 5.4394 units.

R-Squared: The R-squared value of 0.8983 demonstrates that approximately %89.83 of the variance in exam scores can be explained by the behavioral and academic habits included in our model.

Model 2: Classification (with Stratified Split)

set.seed(456)
data_split <- initial_split(df_ml, prop = 0.80, strata = exam_status)
train_data <- training(data_split)
test_data  <- testing(data_split)

log_recipe <- recipe(exam_status ~ ., data = train_data) %>%
  step_rm(exam_score) %>%
  step_dummy(all_nominal_predictors()) %>%
  step_normalize(all_numeric_predictors())

log_model <- logistic_reg() %>% 
  set_engine("glm") %>% 
  set_mode("classification")

log_workflow <- workflow() %>% 
  add_model(log_model) %>% 
  add_recipe(log_recipe)

log_fit <- fit(log_workflow, data = train_data)

#Predict classes on test data
class_preds <- predict(log_fit, new_data = test_data)

#Predict probabilities on test data (required for ROC AUC and ROC Curve)
prob_preds <- predict(log_fit, new_data = test_data, type = "prob")

#Bind predictions to RAW test data
log_res <- test_data %>%
  bind_cols(class_preds) %>%
  bind_cols(prob_preds)

#Check classification metrics (Accuracy and Kappa)
log_res %>%
  accuracy(truth = exam_status, estimate = .pred_class) %>%
  print()
# A tibble: 1 × 3
  .metric  .estimator .estimate
  <chr>    <chr>          <dbl>
1 accuracy binary         0.929
#Check ROC AUC
log_res %>%
  roc_auc(truth = exam_status, .pred_Fail) %>%
  print()
# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 roc_auc binary         0.951
#Confusion Matrix
log_res %>%
  conf_mat(truth = exam_status, estimate = .pred_class) %>%
  print()
          Truth
Prediction Fail Pass
      Fail   20    7
      Pass    6  150
#Plot ROC Curve
log_res %>%
  roc_curve(truth = exam_status, .pred_Fail) %>%
  autoplot() +
  labs(title = "ROC Curve for Logistic Regression (Pass/Fail)")

Detailed Classification Metrics (Accuracy) and Cohen’s Kappa:

The model achieved an overall accuracy of 92.89%, indicating a reliable performance in predicting whether a student will pass or fail.

The Cohen’s Kappa value of 0.71 indicates a substantial level of agreement between the model’s predictions and the actual exam status, demonstrating strong predictive performance beyond what would be expected by random chance.”

Confusion Matrix:

The logistic regression model correctly classified 170 students, but misclassified 13 cases.

ROC AUC:

The ROC AUC score of 0.9514 demonstrates that the model has a strong capability to correctly distinguish between passing and failing students.

Sensitivity / True Positive Rate: The steep climb on the y-axis indicates that the model successfully and correctly identifies the vast majority of the actual “Pass” cases.

1 - Specificity / False Positive Rate: The x-axis values show that the model maintains an exceptionally low probability of incorrectly predicting a “Pass” for an instance that is actually a “Fail”.