For this project, we are analyzing the “Student Habits and Performance” dataset. This dataset contains records for 1000 students and includes variables that capture demographics, academic habits, lifestyle factors, and academic performance. This dataset is suitable for linear regression because it contains a continuous numerical target variable, exam_score, allowing us to predict specific performance values based on students’ habits. The dataset is also suitable for Logistic Regression since the target variable exam_status is binary.
Descriptive Statistics and Variable Transformations
First, we load the required libraries (including the new visualization packages), import our dataset, and examine its general structure. We also prepare our variables for statistical testing and machine learning.
# Load necessary librarieslibrary(dplyr)
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
# Load the dataset df <-read_csv("/Users/ozanpatlar/Downloads/student_habits_performance.csv")
Rows: 1000 Columns: 16
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (7): student_id, gender, part_time_job, diet_quality, parental_education...
dbl (9): age, study_hours_per_day, social_media_hours, netflix_hours, attend...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Basic cleaning and transformationsdf_clean <- df %>%mutate(gender =as.factor(gender),part_time_job =as.factor(part_time_job),diet_quality =factor(diet_quality, levels =c("Poor", "Fair", "Good")),extracurricular_participation =as.factor(extracurricular_participation),internet_quality =factor(internet_quality, levels =c("Poor", "Average", "Good")),parental_education_level =factor(parental_education_level,levels =c("High School", "Bachelor", "Master")),# Creating target variable for Machine Learning classificationexam_status =ifelse(exam_score >=50, "Pass", "Fail"),exam_status =factor(exam_status, levels =c("Fail", "Pass")) )#Taking a look at the dataset:summary(df_clean)
student_id age gender study_hours_per_day
Length:1000 Min. :17.00 Female:481 Min. :0.00
Class :character 1st Qu.:18.75 Male :477 1st Qu.:2.60
Mode :character Median :20.00 Other : 42 Median :3.50
Mean :20.50 Mean :3.55
3rd Qu.:23.00 3rd Qu.:4.50
Max. :24.00 Max. :8.30
social_media_hours netflix_hours part_time_job attendance_percentage
Min. :0.000 Min. :0.000 No :785 Min. : 56.00
1st Qu.:1.700 1st Qu.:1.000 Yes:215 1st Qu.: 78.00
Median :2.500 Median :1.800 Median : 84.40
Mean :2.506 Mean :1.820 Mean : 84.13
3rd Qu.:3.300 3rd Qu.:2.525 3rd Qu.: 91.03
Max. :7.200 Max. :5.400 Max. :100.00
sleep_hours diet_quality exercise_frequency parental_education_level
Min. : 3.20 Poor:185 Min. :0.000 High School:392
1st Qu.: 5.60 Fair:437 1st Qu.:1.000 Bachelor :350
Median : 6.50 Good:378 Median :3.000 Master :167
Mean : 6.47 Mean :3.042 NA's : 91
3rd Qu.: 7.30 3rd Qu.:5.000
Max. :10.00 Max. :6.000
internet_quality mental_health_rating extracurricular_participation
Poor :162 Min. : 1.000 No :682
Average:391 1st Qu.: 3.000 Yes:318
Good :447 Median : 5.000
Mean : 5.438
3rd Qu.: 8.000
Max. :10.000
exam_score exam_status
Min. : 18.40 Fail:131
1st Qu.: 58.48 Pass:869
Median : 70.50
Mean : 69.60
3rd Qu.: 81.33
Max. :100.00
Does Study Hours significantly predict Exam Score?
Hypothesis: * H0: There is no linear relationship between study hours per day and exam score. * H1: There IS a significant positive linear relationship between study hours per day and exam score.
We are using a correlation test and simple lineer regression since two variables are continious.
# Why: Both variables are continuous; we want to model how# one predicts the other and test the slope coefficient.# Linear Regression Testlm_h1 <-lm(exam_score ~ study_hours_per_day, data = h1_data)summary(lm_h1)
Call:
lm(formula = exam_score ~ study_hours_per_day, data = h1_data)
Residuals:
Min 1Q Median 3Q Max
-25.979 -6.626 0.236 6.537 34.319
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 35.9102 0.7893 45.50 <2e-16 ***
study_hours_per_day 9.4903 0.2055 46.19 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 9.539 on 998 degrees of freedom
Multiple R-squared: 0.6813, Adjusted R-squared: 0.681
F-statistic: 2134 on 1 and 998 DF, p-value: < 2.2e-16
# Visualizationggplot(h1_data, aes(x = study_hours_per_day, y = exam_score)) +geom_point(aes(color = exam_score), alpha =0.5, size =2) +geom_smooth(method ="lm", se =TRUE, color ="firebrick") +scale_color_viridis_c(option ="C") +labs(title ="Study Hours vs Exam Score", subtitle ="Linear regression fit with 95% CI",x ="Study Hours per Day", y ="Exam Score", color ="Score") +theme_minimal()
Comment: We used Simple Linear Regression and a Pearson Correlation test. If the p-value is < 0.05, we reject H0. A positive slope indicates that more study hours directly predict higher exam scores. p<0.05 H0 reject
Hypothesis 2: Gender vs. Exam Score
Do male and female students differ in Exam Score?
Hypothesis: * H0: Mean exam scores are equal for male and female students. * H1: Mean exam scores differ between male and female students.
We are using a T-test since we are comparing means of one continious variable and one categorical variable between two independent groups.
# Summary Statsdf_clean %>%group_by(gender) %>%get_summary_stats(exam_score, type ="common")
# A tibble: 3 × 11
gender variable n min max median iqr mean sd se ci
<fct> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Female exam_score 481 18.4 100 70.7 23.6 69.7 16.9 0.771 1.51
2 Male exam_score 477 23.1 100 70.2 22.2 69.4 17.2 0.785 1.54
3 Other exam_score 42 43.9 100 69 19.5 70.6 13.8 2.12 4.29
#Why: Comparing means of one continuous variable (exam_score)# bet ween two independent groups (Male/Female). Assumes# approximate normality (large sample → CLT applies).# T-test# ── Normality check ─────────────────────────────────────────df %>%group_by(gender) %>%shapiro_test(exam_score)
# A tibble: 3 × 4
gender variable statistic p
<chr> <chr> <dbl> <dbl>
1 Female exam_score 0.986 0.000170
2 Male exam_score 0.985 0.0000846
3 Other exam_score 0.975 0.490
# A tibble: 3 × 7
.y. group1 group2 effsize n1 n2 magnitude
* <chr> <chr> <chr> <dbl> <int> <int> <ord>
1 exam_score Female Male 0.0219 481 477 negligible
2 exam_score Female Other -0.0588 481 42 negligible
3 exam_score Male Other -0.0823 477 42 negligible
# Visualization (ggpubr)ggboxplot(df_clean, x ="gender", y ="exam_score",color ="gender", palette ="jco",add ="jitter", shape ="gender") +stat_compare_means(method ="t.test", label ="p.signif",comparisons =list(c("Female", "Male"))) +facet_wrap(~part_time_job) +labs(title ="Exam Score by Gender", subtitle ="Faceted by Part-Time Job",x ="Gender", y ="Exam Score")
Comment: We applied an Independent Samples t-test. If p < 0.05, there is a significant gender difference in exam scores. Cohen’s d indicates the practical magnitude of that difference. p=0.735>0.05 fail to reject H0
Hypothesis 3: Diet Quality vs. Mental Health Rating
Does Diet Quality affect Mental Health?
Hypothesis: * H0: Mean mental health ratings are equal across all diet quality groups (Poor, Fair, Good). * H1: At least one diet quality group has a different mean mental health rating.
We are using an ANOVA test since we are comparing means of a continuous variable across 3+ # independent groups.
# Summary Statsdf_clean %>%group_by(diet_quality) %>%get_summary_stats(mental_health_rating, type ="common")
# A tibble: 3 × 11
diet_quality variable n min max median iqr mean sd se ci
<fct> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Poor mental_he… 185 1 10 6 5 5.56 2.79 0.205 0.405
2 Fair mental_he… 437 1 10 5 5 5.21 2.88 0.138 0.271
3 Good mental_he… 378 1 10 6 5 5.65 2.82 0.145 0.286
# Why: Comparing means of a continuous variable across 3+# independent groups. ANOVA generalizes the t-test.# ANOVA Test# ── Normality & Levene test ──────────────────────────────────df %>%group_by(diet_quality) %>%shapiro_test(mental_health_rating)
# A tibble: 3 × 4
diet_quality variable statistic p
<chr> <chr> <dbl> <dbl>
1 Fair mental_health_rating 0.933 4.33e-13
2 Good mental_health_rating 0.941 4.10e-11
3 Poor mental_health_rating 0.935 2.25e- 7
# Visualizationggplot(df_clean, aes(x = diet_quality, y = mental_health_rating,fill = diet_quality, color = diet_quality)) +geom_violin(alpha =0.4, trim =FALSE) +geom_boxplot(width =0.15, alpha =0.8, outlier.shape =NA) +facet_wrap(~gender) +scale_fill_brewer(palette ="Set2") +scale_color_brewer(palette ="Set2") +labs(title ="Mental Health Rating by Diet Quality", subtitle ="Faceted by Gender",x ="Diet Quality", y ="Mental Health Rating (1–10)") +theme_minimal() +theme(legend.position ="bottom")
Comment: We used a One-Way ANOVA test. A significant ANOVA (p < 0.05) implies at least one group differs, meaning diet quality has an impact on mental health. p=0.178>0.05 fail to reject H0
Hypothesis 4: Part-Time Job vs. Attendance Percentage
Is there an association between having a Part Time job and Attendance?
Hypothesis: * H0: Mean attendance percentage is equal for students with and without part-time jobs. * H1: Mean attendance percentage differs between students with and without part-time jobs.
We are using a T-test since we are comparing means of one continious variable and one categorical variable between two independent groups.
# Why: Comparing means of one continuous variable between two# independent groups (Yes/No part-time job).# T-test# ── Normality ───────────────────────────────────────────────df %>%group_by(part_time_job) %>%shapiro_test(attendance_percentage)
# A tibble: 2 × 4
part_time_job variable statistic p
<chr> <chr> <dbl> <dbl>
1 No attendance_percentage 0.981 0.0000000134
2 Yes attendance_percentage 0.982 0.00648
# Visualization (ggpubr)ggviolin(df_clean %>%filter(!is.na(part_time_job)),x ="part_time_job", y ="attendance_percentage",fill ="part_time_job", color ="part_time_job",palette ="npg", add ="boxplot", add.params =list(fill ="white")) +stat_compare_means(method ="t.test", label ="p.format", label.x =1.5) +facet_grid(~gender) +labs(title ="Attendance % by Part-Time Job Status", subtitle ="Faceted by Gender",x ="Has Part-Time Job", y ="Attendance Percentage (%)") +theme(legend.position ="none")
Comment: We used an Independent Samples t-test. If p < 0.05, having a part-time job significantly affects attendance, likely due to time constraints. p=0.178>0.05 fail to reject H0
Hypothesis 5: Parental Education vs. Exam Score
Does Parental Education Level influence Exam Score?
Hypothesis: * H0: Mean exam scores are equal across all parental education levels. * H1: At least one parental education level group has a different mean exam score.
# A tibble: 4 × 4
parental_education_level n mean_score sd_score
<fct> <int> <dbl> <dbl>
1 High School 392 69.5 16.8
2 Bachelor 350 70.3 17.3
3 Master 167 68.1 16.4
4 <NA> 91 70.0 16.6
# Top scorer per parental education leveldf_clean %>%group_by(parental_education_level) %>%slice_max(order_by = exam_score, n =1) %>%select(student_id, parental_education_level, exam_score, study_hours_per_day)
# A tibble: 48 × 4
# Groups: parental_education_level [4]
student_id parental_education_level exam_score study_hours_per_day
<chr> <fct> <dbl> <dbl>
1 S1001 High School 100 6.9
2 S1131 High School 100 7.2
3 S1230 High School 100 4.2
4 S1250 High School 100 5.5
5 S1335 High School 100 6.5
6 S1356 High School 100 5.8
7 S1455 High School 100 8.3
8 S1489 High School 100 7.4
9 S1579 High School 100 6.1
10 S1611 High School 100 6
# ℹ 38 more rows
Comment: A Kruskall Wallis test was utilized. Since p=0.224, we fail to reject H0. parental education level was found to have no statistically significant effect on exam scores.
Before the analysis, normality was checked using the Shapiro-Wilk test. The results showed that the normality assumption was not fully satisfied because some groups had p-values below 0.05. Therefore, a Kruskal-Wallis test was also performed as a non-parametric alternative. Kruskall-Wallis test produced p = 0.224. Since p-value was greater than 0.05, parental education level was found to have no statistically significant effect on exam scores.
PART C: MACHINE LEARNING APPLICATION PART
For this section, we utilize the tidymodels framework. To predict the continuous “exam_score”, we selected Multiple Linear Regression. To predict the binary “exam_status” (Pass/Fail), we selected Logistic Regression.
h) Method Definitions and Logic
1. Regression Model: Multiple Linear Regression
Definition: A fundamental machine learning technique used to model the linear relationship between a single continuous dependent variable and multiple independent variables.
Purpose: To predict the continuous value of a target variable and infer the strength of relationships.
Logic: Based on Ordinary Least Squares (OLS), it minimizes the Sum of Squared Residuals (SSR) between actual and predicted values.
Formula: Y = b0 + (b1 * X1) + (b2 * X2) + … + (bp * Xp) + e
Use Cases: When the target is continuous (exam scores) and high interpretability is needed.
2. Classification Model: Logistic Regression
Definition: A classification algorithm used to model the probability of a discrete, binary outcome.
Purpose: To classify observations into specific classes and estimate the probability of belonging to a class.
Logic: Uses a logistic (sigmoid) function to constrain the output between 0 and 1.
Formula: P = 1 / (1 + e^-(b0 + b1X1 + … + bpXp))
Use Cases: Binary classification problems (Pass/Fail) where calibrated probabilities are required.
i) Reasoning for Method Choice
Multiple Linear Regression:exam_score is continuous. Regression is mandatory to see how 1-unit changes in daily habits affect the final score.
Logistic Regression:exam_status is a binary categorical variable. We need a model that provides the precise probability of a student passing.
j) Machine Learning Models Implementation
# Remove student_id for ML processingdf_ml <- df_clean %>%select(-student_id)
Model 1: Regression (with V-Fold Cross-Validation)
set.seed(123)folds <-vfold_cv(df_ml, v =5)lm_recipe <-recipe(exam_score ~ ., data = df_ml) %>%step_rm(exam_status) %>%step_dummy(all_nominal_predictors(), one_hot=TRUE) %>%step_normalize(all_numeric_predictors())lm_model <-linear_reg() %>%set_engine("lm") %>%set_mode("regression")lm_workflow <-workflow() %>%add_model(lm_model) %>%add_recipe(lm_recipe)lm_res <-fit_resamples(lm_workflow, resamples = folds, control =control_resamples(save_pred =TRUE))print(collect_metrics(lm_res))
# A tibble: 2 × 6
.metric .estimator mean n std_err .config
<chr> <chr> <dbl> <int> <dbl> <chr>
1 rmse standard 5.44 5 0.172 pre0_mod0_post0
2 rsq standard 0.898 5 0.00544 pre0_mod0_post0
Findings:
RMSE (Root Mean Squared Error): The Root Mean Squared Error (RMSE) value of 5.4394 indicates that our model’s predictions deviate from the actual exam score by an average of 5.4394 units.
R-Squared: The R-squared value of 0.8983 demonstrates that approximately %89.83 of the variance in exam scores can be explained by the behavioral and academic habits included in our model.
Model 2: Classification (with Stratified Split)
set.seed(456)data_split <-initial_split(df_ml, prop =0.80, strata = exam_status)train_data <-training(data_split)test_data <-testing(data_split)log_recipe <-recipe(exam_status ~ ., data = train_data) %>%step_rm(exam_score) %>%step_dummy(all_nominal_predictors()) %>%step_normalize(all_numeric_predictors())log_model <-logistic_reg() %>%set_engine("glm") %>%set_mode("classification")log_workflow <-workflow() %>%add_model(log_model) %>%add_recipe(log_recipe)log_fit <-fit(log_workflow, data = train_data)#Predict classes on test dataclass_preds <-predict(log_fit, new_data = test_data)#Predict probabilities on test data (required for ROC AUC and ROC Curve)prob_preds <-predict(log_fit, new_data = test_data, type ="prob")#Bind predictions to RAW test datalog_res <- test_data %>%bind_cols(class_preds) %>%bind_cols(prob_preds)#Check classification metrics (Accuracy and Kappa)log_res %>%accuracy(truth = exam_status, estimate = .pred_class) %>%print()
Detailed Classification Metrics (Accuracy) and Cohen’s Kappa:
The model achieved an overall accuracy of 92.89%, indicating a reliable performance in predicting whether a student will pass or fail.
The Cohen’s Kappa value of 0.71 indicates a substantial level of agreement between the model’s predictions and the actual exam status, demonstrating strong predictive performance beyond what would be expected by random chance.”
Confusion Matrix:
The logistic regression model correctly classified 170 students, but misclassified 13 cases.
ROC AUC:
The ROC AUC score of 0.9514 demonstrates that the model has a strong capability to correctly distinguish between passing and failing students.
Sensitivity / True Positive Rate: The steep climb on the y-axis indicates that the model successfully and correctly identifies the vast majority of the actual “Pass” cases.
1 - Specificity / False Positive Rate: The x-axis values show that the model maintains an exceptionally low probability of incorrectly predicting a “Pass” for an instance that is actually a “Fail”.
Comment: A Kruskall Wallis test was utilized. Since p=0.224, we fail to reject H0. parental education level was found to have no statistically significant effect on exam scores.
Before the analysis, normality was checked using the Shapiro-Wilk test. The results showed that the normality assumption was not fully satisfied because some groups had p-values below 0.05. Therefore, a Kruskal-Wallis test was also performed as a non-parametric alternative. Kruskall-Wallis test produced p = 0.224. Since p-value was greater than 0.05, parental education level was found to have no statistically significant effect on exam scores.