2026-03-29

Dataset Overview and Source

Student Exam Performance Dataset

This analysis examines data from 10000 students to identify key indicators of passing or failing exams.

Data Source: Kaggle.com

Key Variables:

  • study_hours_per_day: Average daily study time
  • study_environment: Quality of study environment
  • attendance_rate: Percentage of class attendance
  • sleep_hours: Average hours of sleep per night
  • final_exam_score: The grade achieved on the final exam
  • pass_fail: A binary metric where 0 is failing and 1 is passing the test

R Code for Data Preparation

Here’s how I loaded and prepared the data:

# Load required libraries
library(ggplot2)
library(plotly)
library(dplyr)
library(htmltools)

# Load the data
scores <- read.csv("student_exam_performance_dataset.csv")

# Convert to factors
scores$gender <- factor(scores$gender)
scores$grade_category <- factor(scores$grade_category, 
                                levels = c("F","D","C","B","A"))

3D Plotly: Study Hours, Attendance, and Sleep hours

3D Plot Analysis

Key Observations:

  • Study Time: Study time shows strong correlation between more time spent studying and being more likely to pass. While studying less is showing much more fails.
  • Attendance Rate: Attendance rate does not show a clear correlation with passing or failing when observing the graph, a deeper analysis of the numbers is needed to be certain.
  • Time Spent Sleeping: There seems to be some correlation between more sleep and more passes, but it does not seem as obvious as time spent studying, and could be related, since students often skip sleep to study more. It does seem the middle area, 7-8 hours of sleep is the ideal amount.
  • Combined Effect: The 3D view reveals that students who study 4 or more hours have a much higher chance of passing, students which do not study that much are more likely to fail, but having a high attendance rate does seem to counteract the low amount of studying slightly.

Plotly Scatter: Study Time vs Attendance Rate

ggplot Bar Chart: Average Study Hours by Grade Category

ggplot Pie Chart: Percentage of Students in Each Grade Category

Statistical Analysis: Summary Statistics

# Five-number summary and means for key variables
scores %>%
  group_by(pass_fail) %>%
  summarise(
    Count = n(),
    Mean_A = round(mean(attendance_rate), 1),
    Mean_S = round(mean(sleep_hours), 1),
    Mean_FE = round(mean(final_exam_score), 1),
    Mean_Study = round(mean(study_hours_per_day), 1),
    Median_Study = median(study_hours_per_day),
    SD_Study = round(sd(study_hours_per_day), 1)
  )
## # A tibble: 2 × 8
##   pass_fail Count Mean_A Mean_S Mean_FE Mean_Study Median_Study SD_Study
##   <chr>     <int>  <dbl>  <dbl>   <dbl>      <dbl>        <dbl>    <dbl>
## 1 Fail       5142   83.5      7    40.3        2.5         2.48        1
## 2 Pass       4858   86        7    59.6        3.6         3.61        1

Summary Statistics: Interpretation

Detailed Findings:

  • Balanced Dataset: With 5142 fails versus 4858 passes, the sample sizes are well balanced, which allows us to make reasonable statistical comparisons.

  • Attendance Rate: The mean attendance rate for Failing is 83.5, while the mean attendance rate for passing is 86.0. This shows a positive correlation between attending class and passing, but not a very strong one.

  • Final Exam Scores: The mean final exam score for passing was 19.3 higher than failing. While the average grade for both is extrememly low, the group that fail was much lower than I expected.

  • Study Time Amount: The difference of study hours seemed to be the biggest indicator for passing or failing. The failed group had a mean of 2.5 hours while the passing group had a mean of 3.6 hours. The median shows us that there were some outliers in both groups, specifically the passing group. As the median is above the mean and the Standard Deviation is 1.

Thank You