PART A: Introducing the Dataset

Data Source and Explanation

For this project, we are analyzing the “Student Habits and Performance” dataset. This dataset contains records for 1000 students and includes 16 variables that capture demographics (e.g., age, gender), academic habits (e.g., study hours, attendance), lifestyle factors (e.g., sleep hours, social media usage, diet quality), and academic performance measured by exam_score. Our main goal is to explore how these daily habits and lifestyle choices impact a student’s final exam score.

Descriptive Statistics

First, we load the required libraries, import our dataset, and examine its general structure and descriptive statistics.

```{r setup, message=FALSE, warning=FALSE} # Load necessary libraries library(tidyverse) library(rstatix) library(skimr)

Load the dataset

df <- read_csv(“student_habits_performance.csv”)

Display general structure

glimpse(df)

Display descriptive statistics for numerical variables

get_summary_stats(df, type = “common”)

df_clean <- df %>% mutate( # Converting specific numerical rating/frequency variables to ordered factors mental_health_rating = factor(mental_health_rating, ordered = TRUE), exercise_frequency = factor(exercise_frequency, ordered = TRUE),

# Converting a continuous numerical variable to a categorical one (Pass/Fail)
exam_status = ifelse(exam_score >= 50, "Pass", "Fail"),
exam_status = factor(exam_status, levels = c("Fail", "Pass"))

)

Check the transformed variables

df_clean %>% select(mental_health_rating, exercise_frequency, exam_score, exam_status) %>% head()

PART B: Data Manipulation, Visualization, and Hypothesis Testing

In this section, we investigate five distinct hypotheses to understand the underlying patterns in student performance and habits. For each investigation, we define our null ($H_0$) and alternative ($H_1$) hypotheses.

Selected Hypotheses

Hypothesis 1: Impact of Gender on Exam Scores (Two-Sample Test)

We want to investigate if there is a significant difference in academic performance based on gender. * $H_0$: There is no significant difference in the true mean exam scores between male and female students. ($\mu_{male} = \mu_{female}$) * $H_1$: There is a significant difference in the true mean exam scores between male and female students. ($\mu_{male} \neq \mu_{female}$)

Hypothesis 2: Effect of Diet Quality on Exam Scores (ANOVA)

Diet quality (e.g., Poor, Fair, Good) might play a crucial role in cognitive function and, consequently, exam scores. * $H_0$: The true mean exam scores are equal across all diet quality levels. ($\mu_{Poor} = \mu_{Fair} = \mu_{Good}$) * $H_1$: At least one diet quality level has a significantly different true mean exam score compared to the others.

Hypothesis 3: Relationship Between Study Hours and Exam Scores (Correlation)

It is generally expected that more study hours lead to better exam scores. We will test this linear association. * $H_0$: There is no significant linear correlation between daily study hours and exam scores. ($\rho = 0$) * $H_1$: There is a significant linear correlation between daily study hours and exam scores. ($\rho \neq 0$)

Hypothesis 4: Part-Time Job Influence on Attendance (Two-Sample Test)

Working a part-time job might limit a student’s ability to attend classes regularly. * $H_0$: The true mean attendance percentage is the same for students with and without part-time jobs. ($\mu_{Job} = \mu_{NoJob}$) * $H_1$: The true mean attendance percentage is significantly different for students with and without part-time jobs. ($\mu_{Job} \neq \mu_{NoJob}$)

Hypothesis 5: Parental Education Level and Student Performance (ANOVA)

Socioeconomic factors, represented here by parental education level, often influence student academic outcomes. * $H_0$: The true mean exam scores are identical across different parental education levels (e.g., High School, Bachelor’s, Master’s). * $H_1$: The true mean exam score for at least one parental education level is significantly different from the others.

Data Manipulation for Hypotheses (dplyr)

Before running our statistical tests and visualizations, we need to prepare our data using dplyr functions to ensure clean and targeted analysis for each hypothesis.

Prep for Hypothesis 1: Gender vs. Exam Scores

Here, we isolate the relevant columns and ensure there are no unexpected categories in the gender variable. ```{r dplyr-h1, message=FALSE} # Using select() and filter() h1_data <- df_clean %>% select(student_id, gender, exam_score) %>% filter(gender %in% c(“Male”, “Female”))

head(h1_data, 3)

Using group_by(), summarise(), and arrange()

h2_summary <- df_clean %>% group_by(diet_quality) %>% summarise( mean_exam_score = mean(exam_score, na.rm = TRUE), student_count = n() ) %>% arrange(desc(mean_exam_score))

print(h2_summary)

Using mutate() and select()

h3_data <- df_clean %>% select(student_id, study_hours_per_day, exam_score) %>% mutate(study_intensity = ifelse(study_hours_per_day >= 4, “Heavy”, “Light”))

head(h3_data, 3)

Using group_by(), slice_max(), and select()

h4_top_attendance <- df_clean %>% group_by(part_time_job) %>% slice_max(order_by = attendance_percentage, n = 3) %>% select(student_id, part_time_job, attendance_percentage, exam_score)

print(h4_top_attendance)

Main data for the H4 test

h4_data <- df_clean %>% select(part_time_job, attendance_percentage) %>% filter(!is.na(part_time_job))

# Using filter() to remove NAs h5_data <- df_clean %>% filter(!is.na(parental_education_level)) %>% select(parental_education_level, exam_score)

Display counts per education level

table(h5_data$parental_education_level)

Descriptive Statistics with rstatix

Before diving into formal hypothesis testing and complex visualizations, it is crucial to understand the distribution and summary statistics of our variables. As requested, we utilize the rstatix package (freq_table() and get_summary_stats()) to describe the subsets of data we prepared in the previous step.

Summary Stats for Hypothesis 1: Gender vs. Exam Scores

To understand the baseline distribution of scores between genders, we use get_summary_stats(). This provides us with the mean, median, standard deviation, and other robust statistical measures for exam scores grouped by gender.

```{r rstatix-h1, message=FALSE, warning=FALSE} # Using get_summary_stats() for numerical data grouped by a categorical variable h1_stats <- h1_data %>% group_by(gender) %>% get_summary_stats(exam_score, type = “common”)

print(h1_stats)

Using freq_table() for categorical data

h2_freq <- df_clean %>% freq_table(diet_quality)

print(h2_freq)

Using get_summary_stats() specifically requesting robust measures (five-number summary)

h4_stats <- h4_data %>% group_by(part_time_job) %>% get_summary_stats(attendance_percentage, type = “robust”)

print(h4_stats)

Using freq_table() for the filtered parental education data

h5_freq <- h5_data %>% freq_table(parental_education_level)

print(h5_freq)

Scatter plot demonstrating color, shape, and facet_wrap

h3_plot <- ggplot(df_clean, aes(x = study_hours_per_day, y = exam_score, color = gender, shape = part_time_job)) + geom_point(alpha = 0.7, size = 3) + facet_wrap(~ study_intensity, scales = “free_x”) + scale_color_manual(values = c(“Female” = “#E69F00”, “Male” = “#56B4E9”)) + labs( title = “Impact of Study Hours on Exam Scores by Intensity”, subtitle = “Dimensionality added via Gender (Color) and Job Status (Shape)”, x = “Study Hours per Day”, y = “Exam Score”, color = “Gender”, shape = “Part-Time Job” ) + theme_minimal()

print(h3_plot)

Boxplot demonstrating fill and facet_grid

h5_plot <- df_clean %>% filter(!is.na(parental_education_level)) %>% ggplot(aes(x = parental_education_level, y = exam_score, fill = parental_education_level)) + geom_boxplot(alpha = 0.8) + facet_grid(part_time_job ~ gender) + labs( title = “Exam Scores by Parental Education Level”, subtitle = “Grid grouped by Part-Time Job (Rows) and Gender (Columns)”, x = “Parental Education Level”, y = “Exam Score”, fill = “Education Level” ) + theme_bw() + theme(axis.text.x = element_text(angle = 45, hjust = 1))

print(h5_plot)