For this project, we are analyzing the “Student Habits and Performance” dataset, which was sourced from Kaggle. This dataset contains records for 1000 students and includes variables that capture demographics, academic habits, lifestyle factors, and academic performance. Our main goal is to explore how these daily habits and lifestyle choices impact a student’s final exam score.
Descriptive Statistics and Variable Transformations
First, we load the required libraries (including the new visualization packages), import our dataset, and examine its general structure. We also prepare our variables for statistical testing and machine learning.
# Load necessary librarieslibrary(dplyr)library(tidyr)library(ggplot2)library(readr)library(forcats)library(rstatix)library(ggpubr)library(RColorBrewer)library(viridis)library(tidymodels)# Load the dataset df <-read_csv("student_habits_performance.csv")# Basic cleaning and transformationsdf_clean <- df %>%mutate(gender =as.factor(gender),part_time_job =as.factor(part_time_job),diet_quality =factor(diet_quality, levels =c("Poor", "Fair", "Good")),extracurricular_participation =as.factor(extracurricular_participation),# Creating target variable for Machine Learning classificationexam_status =ifelse(exam_score >=50, "Pass", "Fail"),exam_status =factor(exam_status, levels =c("Fail", "Pass")) )# Display descriptive statistics for numerical variablesget_summary_stats(df_clean, type ="common")
Hypothesis: * H0: There is no linear relationship between study hours per day and exam score. * H1: There IS a significant positive linear relationship between study hours per day and exam score.
# Linear Regression Testlm_h1 <-lm(exam_score ~ study_hours_per_day, data = h1_data)summary(lm_h1)
Call:
lm(formula = exam_score ~ study_hours_per_day, data = h1_data)
Residuals:
Min 1Q Median 3Q Max
-25.979 -6.626 0.236 6.537 34.319
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 35.9102 0.7893 45.50 <2e-16 ***
study_hours_per_day 9.4903 0.2055 46.19 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 9.539 on 998 degrees of freedom
Multiple R-squared: 0.6813, Adjusted R-squared: 0.681
F-statistic: 2134 on 1 and 998 DF, p-value: < 2.2e-16
# Visualizationggplot(h1_data, aes(x = study_hours_per_day, y = exam_score)) +geom_point(aes(color = exam_score), alpha =0.5, size =2) +geom_smooth(method ="lm", se =TRUE, color ="firebrick") +scale_color_viridis_c(option ="C") +labs(title ="Study Hours vs Exam Score", subtitle ="Linear regression fit with 95% CI",x ="Study Hours per Day", y ="Exam Score", color ="Score") +theme_minimal()
Comment: We used Simple Linear Regression and a Pearson Correlation test. If the p-value is < 0.05, we reject H0. A positive slope indicates that more study hours directly predict higher exam scores.
Hypothesis 2: Gender vs. Exam Score
Hypothesis: * H0: Mean exam scores are equal for male and female students. * H1: Mean exam scores differ between male and female students.
# Summary Statsdf_clean %>%group_by(gender) %>%get_summary_stats(exam_score, type ="common")
# A tibble: 3 × 11
gender variable n min max median iqr mean sd se ci
<fct> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Female exam_score 481 18.4 100 70.7 23.6 69.7 16.9 0.771 1.51
2 Male exam_score 477 23.1 100 70.2 22.2 69.4 17.2 0.785 1.54
3 Other exam_score 42 43.9 100 69 19.5 70.6 13.8 2.12 4.29
# A tibble: 3 × 7
.y. group1 group2 effsize n1 n2 magnitude
* <chr> <chr> <chr> <dbl> <int> <int> <ord>
1 exam_score Female Male 0.0219 481 477 negligible
2 exam_score Female Other -0.0588 481 42 negligible
3 exam_score Male Other -0.0823 477 42 negligible
# Visualization (ggpubr)ggboxplot(df_clean, x ="gender", y ="exam_score",color ="gender", palette ="jco",add ="jitter", shape ="gender") +stat_compare_means(method ="t.test", label ="p.signif",comparisons =list(c("Female", "Male"))) +facet_wrap(~part_time_job) +labs(title ="Exam Score by Gender", subtitle ="Faceted by Part-Time Job",x ="Gender", y ="Exam Score")
Comment: We applied an Independent Samples t-test. If p < 0.05, there is a significant gender difference in exam scores. Cohen’s d indicates the practical magnitude of that difference.
Hypothesis 3: Diet Quality vs. Mental Health Rating
Hypothesis: * H0: Mean mental health ratings are equal across all diet quality groups (Poor, Fair, Good). * H1: At least one diet quality group has a different mean mental health rating.
# Summary Statsdf_clean %>%group_by(diet_quality) %>%get_summary_stats(mental_health_rating, type ="common")
# A tibble: 3 × 11
diet_quality variable n min max median iqr mean sd se ci
<fct> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Poor mental_he… 185 1 10 6 5 5.56 2.79 0.205 0.405
2 Fair mental_he… 437 1 10 5 5 5.21 2.88 0.138 0.271
3 Good mental_he… 378 1 10 6 5 5.65 2.82 0.145 0.286
ANOVA Table (type II tests)
Effect DFn DFd F p p<.05 ges
1 diet_quality 2 997 2.595 0.075 0.005
# Visualizationggplot(df_clean, aes(x = diet_quality, y = mental_health_rating,fill = diet_quality, color = diet_quality)) +geom_violin(alpha =0.4, trim =FALSE) +geom_boxplot(width =0.15, alpha =0.8, outlier.shape =NA) +facet_wrap(~gender) +scale_fill_brewer(palette ="Set2") +scale_color_brewer(palette ="Set2") +labs(title ="Mental Health Rating by Diet Quality", subtitle ="Faceted by Gender",x ="Diet Quality", y ="Mental Health Rating (1–10)") +theme_minimal() +theme(legend.position ="bottom")
Comment: We used a One-Way ANOVA test. A significant ANOVA (p < 0.05) implies at least one group differs, meaning diet quality has an impact on mental health.
Hypothesis 4: Part-Time Job vs. Attendance Percentage
Hypothesis: * H0: Mean attendance percentage is equal for students with and without part-time jobs. * H1: Mean attendance percentage differs between students with and without part-time jobs.
# Visualization (ggpubr)ggviolin(df_clean %>%filter(!is.na(part_time_job)),x ="part_time_job", y ="attendance_percentage",fill ="part_time_job", color ="part_time_job",palette ="npg", add ="boxplot", add.params =list(fill ="white")) +stat_compare_means(method ="t.test", label ="p.format", label.x =1.5) +facet_grid(~gender) +labs(title ="Attendance % by Part-Time Job Status", subtitle ="Faceted by Gender",x ="Has Part-Time Job", y ="Attendance Percentage (%)") +theme(legend.position ="none")
Comment: We used an Independent Samples t-test. If p < 0.05, having a part-time job significantly affects attendance, likely due to time constraints.
Hypothesis 5: Parental Education vs. Exam Score
Hypothesis: * H0: Mean exam scores are equal across all parental education levels. * H1: At least one parental education level group has a different mean exam score.
Comment: A One-Way ANOVA was utilized. If p < 0.05, parental education significantly affects student exam scores, potentially due to a richer academic environment at home.
PART C: MACHINE LEARNING APPLICATION PART
For this section, we utilize the tidymodels framework. To predict the continuous “exam_score”, we selected Multiple Linear Regression. To predict the binary “exam_status” (Pass/Fail), we selected Logistic Regression.
h) Method Definitions and Logic
1. Regression Model: Multiple Linear Regression
Definition: A fundamental machine learning technique used to model the linear relationship between a single continuous dependent variable and multiple independent variables.
Purpose: To predict the continuous value of a target variable and infer the strength of relationships.
Logic: Based on Ordinary Least Squares (OLS), it minimizes the Sum of Squared Residuals (SSR) between actual and predicted values.
Formula: Y = b0 + (b1 * X1) + (b2 * X2) + … + (bp * Xp) + e
Use Cases: When the target is continuous (exam scores) and high interpretability is needed.
2. Classification Model: Logistic Regression
Definition: A classification algorithm used to model the probability of a discrete, binary outcome.
Purpose: To classify observations into specific classes and estimate the probability of belonging to a class.
Logic: Uses a logistic (sigmoid) function to constrain the output between 0 and 1.
Formula: P = 1 / (1 + e^-(b0 + b1X1 + … + bpXp))
Use Cases: Binary classification problems (Pass/Fail) where calibrated probabilities are required.
i) Reasoning for Method Choice
Multiple Linear Regression:exam_score is continuous. Regression is mandatory to see how 1-unit changes in daily habits affect the final score.
Logistic Regression:exam_status is a binary categorical variable. We need a model that provides the precise probability of a student passing.
j) Machine Learning Models Implementation
# Remove student_id for ML processingdf_ml <- df_clean %>%select(-student_id)
Model 1: Regression (with V-Fold Cross-Validation)
set.seed(123)folds <-vfold_cv(df_ml, v =5)lm_recipe <-recipe(exam_score ~ ., data = df_ml) %>%step_rm(exam_status) %>%step_dummy(all_nominal_predictors()) %>%step_normalize(all_numeric_predictors())lm_model <-linear_reg() %>%set_engine("lm") %>%set_mode("regression")lm_workflow <-workflow() %>%add_model(lm_model) %>%add_recipe(lm_recipe)lm_res <-fit_resamples(lm_workflow, resamples = folds, control =control_resamples(save_pred =TRUE))cat("Regression Model Metrics:\n")
Regression Model Metrics:
print(collect_metrics(lm_res))
# A tibble: 2 × 6
.metric .estimator mean n std_err .config
<chr> <chr> <dbl> <int> <dbl> <chr>
1 rmse standard 5.38 5 0.192 pre0_mod0_post0
2 rsq standard 0.899 5 0.00585 pre0_mod0_post0
Findings: The RMSE shows the average prediction error, and R-squared shows how much variance in exam scores is explained by the predictors.
Findings: The model used stratified splitting to balance Pass/Fail ratios. The Confusion Matrix and high ROC AUC value confirm the model’s strong capability to classify student success.