R.project.Student

PART A: Introducing the Dataset

Data Source and Explanation

For this project, we are analyzing the “Student Habits and Performance” dataset. This dataset contains records for 1000 students and includes variables that capture demographics, academic habits, lifestyle factors, and academic performance measured by exam_score. Our main goal is to explore how these daily habits and lifestyle choices impact a student’s final exam score.

Descriptive Statistics

First, we load the required libraries, import our dataset, and examine its general structure and descriptive statistics.

# Load necessary libraries
library(tidyverse)
library(rstatix)
library(ggpubr)

# Load the dataset 
df <- read_csv("student_habits_performance.csv")

# Display descriptive statistics for numerical variables
get_summary_stats(df, type = "common")
# A tibble: 9 × 10
  variable                  n   min   max median   iqr  mean    sd    se    ci
  <fct>                 <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 age                    1000  17    24     20    4.25 20.5   2.31 0.073 0.143
2 study_hours_per_day    1000   0     8.3    3.5  1.9   3.55  1.47 0.046 0.091
3 social_media_hours     1000   0     7.2    2.5  1.6   2.51  1.17 0.037 0.073
4 netflix_hours          1000   0     5.4    1.8  1.52  1.82  1.08 0.034 0.067
5 attendance_percentage  1000  56   100     84.4 13.0  84.1   9.40 0.297 0.583
6 sleep_hours            1000   3.2  10      6.5  1.7   6.47  1.23 0.039 0.076
7 exercise_frequency     1000   0     6      3    4     3.04  2.02 0.064 0.126
8 mental_health_rating   1000   1    10      5    5     5.44  2.85 0.09  0.177
9 exam_score             1000  18.4 100     70.5 22.8  69.6  16.9  0.534 1.05 

Variable Transformations

df_clean <- df %>%
  mutate(
    mental_health_rating = factor(mental_health_rating, ordered = TRUE),
    exercise_frequency = factor(exercise_frequency, ordered = TRUE),
    exam_status = ifelse(exam_score >= 50, "Pass", "Fail"),
    exam_status = factor(exam_status, levels = c("Fail", "Pass"))
  )

PART B: Hypothesis Testing and Visualizations

Data Manipulation (dplyr)

# H1 Data: Gender
h1_data <- df_clean %>%
  select(student_id, gender, exam_score) %>%
  filter(gender %in% c("Male", "Female"))

# H4 Data: Part-Time Job
h4_data <- df_clean %>%
  select(part_time_job, attendance_percentage) %>%
  filter(!is.na(part_time_job))

# H5 Data: Parental Education
h5_data <- df_clean %>%
  filter(!is.na(parental_education_level)) %>%
  select(parental_education_level, exam_score)

Descriptive Statistics (rstatix)

# Summary Stats for H1 (Gender)
h1_stats <- h1_data %>%
  group_by(gender) %>%
  get_summary_stats(exam_score, type = "common")
print(h1_stats)
# A tibble: 2 × 11
  gender variable       n   min   max median   iqr  mean    sd    se    ci
  <chr>  <fct>      <dbl> <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Female exam_score   481  18.4   100   70.7  23.6  69.7  16.9 0.771  1.51
2 Male   exam_score   477  23.1   100   70.2  22.2  69.4  17.2 0.785  1.54
# Frequency Table for H2 (Diet Quality)
h2_freq <- df_clean %>%
  freq_table(diet_quality)
print(h2_freq)
# A tibble: 3 × 3
  diet_quality     n  prop
  <chr>        <int> <dbl>
1 Fair           437  43.7
2 Good           378  37.8
3 Poor           185  18.5
# Summary Stats for H4 (Part-Time Job)
h4_stats <- h4_data %>%
  group_by(part_time_job) %>%
  get_summary_stats(attendance_percentage, type = "robust")
print(h4_stats)
# A tibble: 2 × 5
  part_time_job variable                  n median   iqr
  <chr>         <fct>                 <dbl>  <dbl> <dbl>
1 No            attendance_percentage   785   84.5  13.5
2 Yes           attendance_percentage   215   83.9  10.8
# Frequency Table for H5 (Parental Education)
h5_freq <- h5_data %>%
  freq_table(parental_education_level)
print(h5_freq)
# A tibble: 4 × 3
  parental_education_level     n  prop
  <chr>                    <int> <dbl>
1 Bachelor                   350  35  
2 High School                392  39.2
3 Master                     167  16.7
4 None                        91   9.1

Visualizations

Plot 1: Study Hours vs. Exam Scores (ggplot2)

df_plot_h3 <- df_clean %>%
  mutate(study_intensity = ifelse(study_hours_per_day >= 4, "Heavy", "Light"))

ggplot(df_plot_h3, aes(x = study_hours_per_day, y = exam_score, 
                       color = gender, shape = part_time_job)) +
  geom_point(alpha = 0.7, size = 3) +
  facet_wrap(~ study_intensity, scales = "free_x") +
  scale_color_manual(values = c("Female" = "#E69F00", "Male" = "#56B4E9")) +
  labs(title = "Impact of Study Hours on Exam Scores by Intensity",
       x = "Study Hours per Day", y = "Exam Score", color = "Gender", shape = "Part-Time Job") +
  theme_minimal()

Plot 2: Parental Education vs. Exam Scores (ggplot2)

ggplot(h5_data, aes(x = parental_education_level, y = exam_score, fill = parental_education_level)) +
  geom_boxplot(alpha = 0.8) +
  facet_grid(. ~ h5_data %>% pull(parental_education_level) %>% is.na() == FALSE) + 
  labs(title = "Exam Scores by Parental Education Level",
       x = "Parental Education Level", y = "Exam Score", fill = "Education Level") +
  theme_bw() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Plot 3: Diet Quality vs. Exam Scores (ggpubr)

h2_ggpubr <- ggboxplot(
  df_clean, 
  x = "diet_quality", 
  y = "exam_score",
  color = "gender", 
  shape = "gender",
  palette = c("#00AFBB", "#E7B800", "#FC4E07"), 
  facet.by = "part_time_job",        
  add = "jitter",                    
  short.panel.labs = FALSE
) +
  stat_compare_means(method = "anova", label.y = 110) + 
  labs(
    title = "Effect of Diet Quality on Exam Scores",
    subtitle = "Faceted by Part-Time Job Status, colored by Gender",
    x = "Diet Quality",
    y = "Exam Score"
  )

print(h2_ggpubr)

Formal Hypothesis Testing

In this section, we conduct formal statistical tests for each of our hypotheses using the rstatix package. We evaluate the p-values against a standard significance level of \(\alpha = 0.05\).

H1: Gender vs. Exam Scores (Independent t-test)

res_h1 <- df_clean %>% filter(gender %in% c("Male", "Female")) %>% t_test(exam_score ~ gender)
print(res_h1)
# A tibble: 1 × 8
  .y.        group1 group2    n1    n2 statistic    df     p
* <chr>      <chr>  <chr>  <int> <int>     <dbl> <dbl> <dbl>
1 exam_score Female Male     481   477     0.339  955. 0.735

Reasoning: We used an Independent Samples t-test because we are comparing the means of a continuous variable (exam score) across exactly two independent categorical groups (Male vs. Female).

Hypothesis: * \(H_0\): There is no significant difference in mean exam scores between male and female students. * \(H_1\): There is a significant difference.

Comment: The test yielded a p-value of 0.735. Since p > 0.05, we fail to reject the null hypothesis (\(H_0\)). We conclude that gender does not have a statistically significant impact on student exam scores.


H2: Diet Quality vs. Exam Scores (One-way ANOVA)

res_h2 <- df_clean %>% anova_test(exam_score ~ diet_quality)
print(res_h2)
ANOVA Table (type II tests)

        Effect DFn DFd     F     p p<.05   ges
1 diet_quality   2 997 1.266 0.282       0.003

Reasoning: We selected a One-way ANOVA test because we are comparing the means of a continuous variable (exam score) across three independent categorical groups (Poor, Fair, Good).

Hypothesis: * \(H_0\): Mean exam scores are equal across all diet quality groups. * \(H_1\): At least one group mean is significantly different.

Comment: The ANOVA test resulted in p = 0.282. Since p > 0.05, we fail to reject \(H_0\). Diet quality does not create a significant variance in exam scores in this dataset.


H3: Study Hours vs. Exam Scores (Pearson Correlation)

res_h3 <- df_clean %>% cor_test(study_hours_per_day, exam_score, method = "pearson")
print(res_h3)
# A tibble: 1 × 8
  var1                var2      cor statistic        p conf.low conf.high method
  <chr>               <chr>   <dbl>     <dbl>    <dbl>    <dbl>     <dbl> <chr> 
1 study_hours_per_day exam_s…  0.83      46.2 4.6e-250    0.805     0.844 Pears…

Reasoning: We applied a Pearson Correlation test to determine the strength and direction of the linear relationship between two continuous numerical variables (study hours and exam score).

Hypothesis: * \(H_0\): There is no linear correlation between study hours per day and exam scores (r = 0). * \(H_1\): There is a linear correlation (r \(\neq\) 0).

Comment: The test shows an extremely low p-value (p < 0.001) and a strong positive correlation coefficient (r = 0.83). We strongly reject \(H_0\). There is a highly significant positive relationship: as study hours increase, exam scores increase.


H4: Part-Time Job vs. Attendance (Independent t-test)

res_h4 <- df_clean %>% filter(!is.na(part_time_job)) %>% t_test(attendance_percentage ~ part_time_job)
print(res_h4)
# A tibble: 1 × 8
  .y.                   group1 group2    n1    n2 statistic    df     p
* <chr>                 <chr>  <chr>  <int> <int>     <dbl> <dbl> <dbl>
1 attendance_percentage No     Yes      785   215      1.35  351. 0.178

Reasoning: An Independent Samples t-test was chosen to compare the mean attendance percentage (continuous) between students who have a part-time job and those who do not (two categorical groups).

Hypothesis: * \(H_0\): Working a part-time job has no significant effect on mean attendance. * \(H_1\): Working a part-time job significantly affects attendance.

Comment: With p = 0.178 (> 0.05), we fail to reject \(H_0\). Having a part-time job does not significantly negatively or positively impact a student’s attendance percentage.


H5: Parental Education vs. Exam Scores (One-way ANOVA)

res_h5 <- h5_data %>% anova_test(exam_score ~ parental_education_level)
print(res_h5)
ANOVA Table (type II tests)

                    Effect DFn DFd     F     p p<.05   ges
1 parental_education_level   3 996 0.653 0.581       0.002

Reasoning: We used a One-way ANOVA to check for mean differences in exam scores (continuous) across four categorical levels of parental education (None, High School, Bachelor, Master).

Hypothesis: * \(H_0\): Mean exam scores are equal regardless of parental education level. * \(H_1\): At least one parental education group has a different mean exam score.

Comment: The p-value is 0.581. Since p > 0.05, we fail to reject \(H_0\). We conclude that parental education background does not significantly impact student performance in this sample.