R.project.Student

Ozan Patlar 23023069 Elif Şimşek 22023027 Yasin Özcan 21023022

PART A: Introducing the Dataset

Data Source and Explanation

For this project, we are analyzing the “Student Habits and Performance” dataset, which was sourced from Kaggle. This dataset contains records for 1000 students and includes variables that capture demographics, academic habits, lifestyle factors, and academic performance. Our main goal is to explore how these daily habits and lifestyle choices impact a student’s final exam score.

Descriptive Statistics and Variable Transformations

First, we load the required libraries (including the new visualization packages), import our dataset, and examine its general structure. We also prepare our variables for statistical testing and machine learning.

# Load necessary libraries
library(dplyr)
library(tidyr)
library(ggplot2)
library(readr)
library(forcats)
library(rstatix)
library(ggpubr)
library(RColorBrewer)
library(viridis)
library(tidymodels)

# Load the dataset 
df <- read_csv("student_habits_performance.csv")

# Basic cleaning and transformations
df_clean <- df %>%
  mutate(
    gender = as.factor(gender),
    part_time_job = as.factor(part_time_job),
    diet_quality = factor(diet_quality, levels = c("Poor", "Fair", "Good")),
    extracurricular_participation = as.factor(extracurricular_participation),
    # Creating target variable for Machine Learning classification
    exam_status = ifelse(exam_score >= 50, "Pass", "Fail"),
    exam_status = factor(exam_status, levels = c("Fail", "Pass"))
  )

# Display descriptive statistics for numerical variables
get_summary_stats(df_clean, type = "common")

# A tibble: 9 × 10
  variable                  n   min   max median   iqr  mean    sd    se    ci
  <fct>                 <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 age                    1000  17    24     20    4.25 20.5   2.31 0.073 0.143
2 study_hours_per_day    1000   0     8.3    3.5  1.9   3.55  1.47 0.046 0.091
3 social_media_hours     1000   0     7.2    2.5  1.6   2.51  1.17 0.037 0.073
4 netflix_hours          1000   0     5.4    1.8  1.52  1.82  1.08 0.034 0.067
5 attendance_percentage  1000  56   100     84.4 13.0  84.1   9.40 0.297 0.583
6 sleep_hours            1000   3.2  10      6.5  1.7   6.47  1.23 0.039 0.076
7 exercise_frequency     1000   0     6      3    4     3.04  2.02 0.064 0.126
8 mental_health_rating   1000   1    10      5    5     5.44  2.85 0.09  0.177
9 exam_score             1000  18.4 100     70.5 22.8  69.6  16.9  0.534 1.05

PART B: Hypothesis Testing and Visualizations

Hypothesis 1: Study Hours vs. Exam Score

Hypothesis: * H0: There is no linear relationship between study hours per day and exam score. * H1: There IS a significant positive linear relationship between study hours per day and exam score.

# Data Prep
h1_data <- df_clean %>%
  select(student_id, study_hours_per_day, exam_score) %>%
  filter(!is.na(study_hours_per_day), !is.na(exam_score)) %>%
  arrange(desc(study_hours_per_day))

# Summary & Correlation
cor_test_h1 <- h1_data %>% cor_test(study_hours_per_day, exam_score, method = "pearson")
print(cor_test_h1)

# A tibble: 1 × 8
  var1                var2      cor statistic        p conf.low conf.high method
  <chr>               <chr>   <dbl>     <dbl>    <dbl>    <dbl>     <dbl> <chr> 
1 study_hours_per_day exam_s…  0.83      46.2 4.6e-250    0.805     0.844 Pears…

# Linear Regression Test
lm_h1 <- lm(exam_score ~ study_hours_per_day, data = h1_data)
summary(lm_h1)


Call:
lm(formula = exam_score ~ study_hours_per_day, data = h1_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-25.979  -6.626   0.236   6.537  34.319 

Coefficients:
                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)          35.9102     0.7893   45.50   <2e-16 ***
study_hours_per_day   9.4903     0.2055   46.19   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 9.539 on 998 degrees of freedom
Multiple R-squared:  0.6813,    Adjusted R-squared:  0.681 
F-statistic:  2134 on 1 and 998 DF,  p-value: < 2.2e-16

# Visualization
ggplot(h1_data, aes(x = study_hours_per_day, y = exam_score)) +
  geom_point(aes(color = exam_score), alpha = 0.5, size = 2) +
  geom_smooth(method = "lm", se = TRUE, color = "firebrick") +
  scale_color_viridis_c(option = "C") +
  labs(title = "Study Hours vs Exam Score", subtitle = "Linear regression fit with 95% CI",
       x = "Study Hours per Day", y = "Exam Score", color = "Score") +
  theme_minimal()

Comment: We used Simple Linear Regression and a Pearson Correlation test. If the p-value is < 0.05, we reject H0. A positive slope indicates that more study hours directly predict higher exam scores.

Hypothesis 2: Gender vs. Exam Score

Hypothesis: * H0: Mean exam scores are equal for male and female students. * H1: Mean exam scores differ between male and female students.

# Summary Stats
df_clean %>% group_by(gender) %>% get_summary_stats(exam_score, type = "common")

# A tibble: 3 × 11
  gender variable       n   min   max median   iqr  mean    sd    se    ci
  <fct>  <fct>      <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Female exam_score   481  18.4   100   70.7  23.6  69.7  16.9 0.771  1.51
2 Male   exam_score   477  23.1   100   70.2  22.2  69.4  17.2 0.785  1.54
3 Other  exam_score    42  43.9   100   69    19.5  70.6  13.8 2.12   4.29

# T-test
t_test_h2 <- df_clean %>% t_test(exam_score ~ gender, var.equal = FALSE)
print(t_test_h2)

# A tibble: 1 × 7
  statistic  t_df p_value alternative estimate lower_ci upper_ci
      <dbl> <dbl>   <dbl> <chr>          <dbl>    <dbl>    <dbl>
1     0.339  955.   0.735 two.sided      0.373    -1.79     2.53

# Effect Size
df_clean %>% cohens_d(exam_score ~ gender)

# A tibble: 3 × 7
  .y.        group1 group2 effsize    n1    n2 magnitude 
* <chr>      <chr>  <chr>    <dbl> <int> <int> <ord>     
1 exam_score Female Male    0.0219   481   477 negligible
2 exam_score Female Other  -0.0588   481    42 negligible
3 exam_score Male   Other  -0.0823   477    42 negligible

# Visualization (ggpubr)
ggboxplot(df_clean, x = "gender", y = "exam_score",
          color = "gender", palette = "jco",
          add = "jitter", shape = "gender") +
  stat_compare_means(method = "t.test", label = "p.signif",
                     comparisons = list(c("Female", "Male"))) +
  facet_wrap(~part_time_job) +
  labs(title = "Exam Score by Gender", subtitle = "Faceted by Part-Time Job",
       x = "Gender", y = "Exam Score")

Comment: We applied an Independent Samples t-test. If p < 0.05, there is a significant gender difference in exam scores. Cohen’s d indicates the practical magnitude of that difference.

Hypothesis 3: Diet Quality vs. Mental Health Rating

Hypothesis: * H0: Mean mental health ratings are equal across all diet quality groups (Poor, Fair, Good). * H1: At least one diet quality group has a different mean mental health rating.

# Summary Stats
df_clean %>% group_by(diet_quality) %>% get_summary_stats(mental_health_rating, type = "common")

# A tibble: 3 × 11
  diet_quality variable       n   min   max median   iqr  mean    sd    se    ci
  <fct>        <fct>      <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Poor         mental_he…   185     1    10      6     5  5.56  2.79 0.205 0.405
2 Fair         mental_he…   437     1    10      5     5  5.21  2.88 0.138 0.271
3 Good         mental_he…   378     1    10      6     5  5.65  2.82 0.145 0.286

# ANOVA Test
aov_h3 <- df_clean %>% anova_test(mental_health_rating ~ diet_quality)
print(aov_h3)

ANOVA Table (type II tests)

        Effect DFn DFd     F     p p<.05   ges
1 diet_quality   2 997 2.595 0.075       0.005

# Visualization
ggplot(df_clean, aes(x = diet_quality, y = mental_health_rating,
                     fill = diet_quality, color = diet_quality)) +
  geom_violin(alpha = 0.4, trim = FALSE) +
  geom_boxplot(width = 0.15, alpha = 0.8, outlier.shape = NA) +
  facet_wrap(~gender) +
  scale_fill_brewer(palette  = "Set2") +
  scale_color_brewer(palette = "Set2") +
  labs(title = "Mental Health Rating by Diet Quality", subtitle = "Faceted by Gender",
       x = "Diet Quality", y = "Mental Health Rating (1–10)") +
  theme_minimal() +
  theme(legend.position = "bottom")

Comment: We used a One-Way ANOVA test. A significant ANOVA (p < 0.05) implies at least one group differs, meaning diet quality has an impact on mental health.

Hypothesis 4: Part-Time Job vs. Attendance Percentage

Hypothesis: * H0: Mean attendance percentage is equal for students with and without part-time jobs. * H1: Mean attendance percentage differs between students with and without part-time jobs.

# T-test
t_test_h4 <- df_clean %>% filter(!is.na(part_time_job)) %>% t_test(attendance_percentage ~ part_time_job, var.equal = FALSE)
print(t_test_h4)

# A tibble: 1 × 7
  statistic  t_df p_value alternative estimate lower_ci upper_ci
      <dbl> <dbl>   <dbl> <chr>          <dbl>    <dbl>    <dbl>
1      1.35  351.   0.178 two.sided      0.955   -0.437     2.35

# Visualization (ggpubr)
ggviolin(df_clean %>% filter(!is.na(part_time_job)),
         x = "part_time_job", y = "attendance_percentage",
         fill = "part_time_job", color = "part_time_job",
         palette = "npg", add = "boxplot", add.params = list(fill = "white")) +
  stat_compare_means(method = "t.test", label = "p.format", label.x = 1.5) +
  facet_grid(~gender) +
  labs(title = "Attendance % by Part-Time Job Status", subtitle = "Faceted by Gender",
       x = "Has Part-Time Job", y = "Attendance Percentage (%)") +
  theme(legend.position = "none")

Comment: We used an Independent Samples t-test. If p < 0.05, having a part-time job significantly affects attendance, likely due to time constraints.

Hypothesis 5: Parental Education vs. Exam Score

Hypothesis: * H0: Mean exam scores are equal across all parental education levels. * H1: At least one parental education level group has a different mean exam score.

# ANOVA Test
aov_h5 <- df_clean %>% anova_test(exam_score ~ parental_education_level)
print(aov_h5)

ANOVA Table (type II tests)

                    Effect DFn DFd     F     p p<.05   ges
1 parental_education_level   3 996 0.653 0.581       0.002

# Visualization (ggplot2)
ggplot(df_clean, aes(x = parental_education_level, y = exam_score,
                     fill = parental_education_level, shape = parental_education_level)) +
  stat_summary(fun = mean, geom = "bar", alpha = 0.7) +
  stat_summary(fun.data = mean_se, geom = "errorbar", width = 0.2, color = "gray30") +
  geom_jitter(aes(color = parental_education_level), width = 0.2, alpha = 0.25, size = 1) +
  scale_fill_brewer(palette  = "Dark2") +
  scale_color_brewer(palette = "Dark2") +
  facet_grid(diet_quality ~ gender) +
  labs(title = "Exam Score by Parental Education Level", subtitle = "Faceted by Diet Quality and Gender",
       x = "Parental Education Level", y = "Mean Exam Score ± SE") +
  theme_minimal() +
  theme(axis.text.x  = element_text(angle = 25, hjust = 1), legend.position = "bottom")

Comment: A One-Way ANOVA was utilized. If p < 0.05, parental education significantly affects student exam scores, potentially due to a richer academic environment at home.

PART C: MACHINE LEARNING APPLICATION PART

For this section, we utilize the tidymodels framework. To predict the continuous “exam_score”, we selected Multiple Linear Regression. To predict the binary “exam_status” (Pass/Fail), we selected Logistic Regression.

h) Method Definitions and Logic

1. Regression Model: Multiple Linear Regression

Definition: A fundamental machine learning technique used to model the linear relationship between a single continuous dependent variable and multiple independent variables.
Purpose: To predict the continuous value of a target variable and infer the strength of relationships.
Logic: Based on Ordinary Least Squares (OLS), it minimizes the Sum of Squared Residuals (SSR) between actual and predicted values.
Formula: Y = b0 + (b1 * X1) + (b2 * X2) + … + (bp * Xp) + e
Use Cases: When the target is continuous (exam scores) and high interpretability is needed.

2. Classification Model: Logistic Regression

Definition: A classification algorithm used to model the probability of a discrete, binary outcome.
Purpose: To classify observations into specific classes and estimate the probability of belonging to a class.
Logic: Uses a logistic (sigmoid) function to constrain the output between 0 and 1.
Formula: P = 1 / (1 + e^-(b0 + b1X1 + … + bpXp))
Use Cases: Binary classification problems (Pass/Fail) where calibrated probabilities are required.

i) Reasoning for Method Choice

Multiple Linear Regression: exam_score is continuous. Regression is mandatory to see how 1-unit changes in daily habits affect the final score.
Logistic Regression: exam_status is a binary categorical variable. We need a model that provides the precise probability of a student passing.

j) Machine Learning Models Implementation

# Remove student_id for ML processing
df_ml <- df_clean %>% select(-student_id)

Model 1: Regression (with V-Fold Cross-Validation)

set.seed(123)
folds <- vfold_cv(df_ml, v = 5)

lm_recipe <- recipe(exam_score ~ ., data = df_ml) %>%
  step_rm(exam_status) %>%
  step_dummy(all_nominal_predictors()) %>%
  step_normalize(all_numeric_predictors())

lm_model <- linear_reg() %>% set_engine("lm") %>% set_mode("regression")
lm_workflow <- workflow() %>% add_model(lm_model) %>% add_recipe(lm_recipe)

lm_res <- fit_resamples(lm_workflow, resamples = folds, control = control_resamples(save_pred = TRUE))

cat("Regression Model Metrics:\n")

Regression Model Metrics:

print(collect_metrics(lm_res))

# A tibble: 2 × 6
  .metric .estimator  mean     n std_err .config        
  <chr>   <chr>      <dbl> <int>   <dbl> <chr>          
1 rmse    standard   5.38      5 0.192   pre0_mod0_post0
2 rsq     standard   0.899     5 0.00585 pre0_mod0_post0

Findings: The RMSE shows the average prediction error, and R-squared shows how much variance in exam scores is explained by the predictors.

Model 2: Classification (with Stratified Split)

set.seed(456)
data_split <- initial_split(df_ml, prop = 0.80, strata = exam_status)
train_data <- training(data_split)
test_data  <- testing(data_split)

log_recipe <- recipe(exam_status ~ ., data = train_data) %>%
  step_rm(exam_score) %>%
  step_dummy(all_nominal_predictors()) %>%
  step_normalize(all_numeric_predictors())

log_model <- logistic_reg() %>% set_engine("glm") %>% set_mode("classification")
log_workflow <- workflow() %>% add_model(log_model) %>% add_recipe(log_recipe)

log_fit <- fit(log_workflow, data = train_data)
log_preds <- augment(log_fit, new_data = test_data)

cat("Classification Accuracy:\n")

Classification Accuracy:

print(accuracy(log_preds, truth = exam_status, estimate = .pred_class))

# A tibble: 1 × 3
  .metric  .estimator .estimate
  <chr>    <chr>          <dbl>
1 accuracy binary         0.935

cat("\nROC AUC:\n")


ROC AUC:

print(roc_auc(log_preds, truth = exam_status, .pred_Fail))

# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 roc_auc binary         0.957

cat("\nConfusion Matrix:\n")


Confusion Matrix:

print(conf_mat(log_preds, truth = exam_status, estimate = .pred_class))

          Truth
Prediction Fail Pass
      Fail   21    7
      Pass    6  167

# ROC Curve Plot
log_preds %>%
  roc_curve(truth = exam_status, .pred_Fail) %>%
  autoplot() +
  labs(title = "ROC Curve for Logistic Regression (Pass/Fail)")

Findings: The model used stratified splitting to balance Pass/Fail ratios. The Confusion Matrix and high ROC AUC value confirm the model’s strong capability to classify student success.

References

https://www.kaggle.com/datasets/jayaantanaath/student-habits-vs-academic-performance/data
Anthropic Claude LLM
Google Gemini LLM