Student Exam Performance Analysis

Author

Vanessa Lade

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.2.1     ✔ readr     2.2.0
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.3     ✔ tibble    3.3.1
✔ lubridate 1.9.5     ✔ tidyr     1.3.2
✔ purrr     1.2.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(readxl)

data <- read_excel("API_4_DS2_en_excel_v2_2101.xls")

glimpse(data)
Rows: 10,000
Columns: 23
$ student_id                 <chr> "S00001", "S00002", "S00003", "S00004", "S0…
$ gender                     <chr> "Male", "Female", "Male", "Male", "Male", "…
$ age                        <dbl> 17, 18, 17, 18, 18, 17, 16, 18, 17, 15, 17,…
$ parental_education         <chr> "High School", "High School", "High School"…
$ family_income              <chr> "Medium", "Low", "Medium", "Medium", "Mediu…
$ internet_access            <chr> "Yes", "Yes", "No", "Yes", "Yes", "Yes", "Y…
$ study_environment          <chr> "Quiet", "Quiet", "Quiet", "Quiet", "Quiet"…
$ study_hours_per_day        <dbl> 2.98, 4.45, 3.75, 2.03, 5.14, 1.16, 3.33, 2…
$ attendance_rate            <dbl> 96.5, 95.7, 76.0, 72.6, 87.3, 92.0, 74.2, 8…
$ sleep_hours                <dbl> 6.05, 6.96, 7.02, 6.23, 8.54, 8.16, 7.34, 6…
$ social_media_hours         <dbl> 0.1, 2.9, 2.4, 3.5, 2.1, 2.6, 3.9, 0.0, 1.1…
$ assignment_completion_rate <dbl> 80.5, 70.9, 77.6, 63.5, 71.8, 100.0, 81.3, …
$ participation_score        <dbl> 68.7, 92.6, 45.8, 72.9, 55.7, 76.3, 41.8, 7…
$ online_courses_completed   <dbl> 1, 0, 4, 4, 0, 2, 1, 4, 2, 2, 1, 6, 0, 6, 2…
$ tutoring                   <chr> "Yes", "Yes", "Yes", "No", "No", "No", "No"…
$ math_score                 <dbl> 42.8, 77.9, 53.5, 28.3, 74.7, 40.3, 52.1, 4…
$ reading_score              <dbl> 62.4, 73.5, 38.3, 23.5, 54.9, 42.2, 48.4, 5…
$ writing_score              <dbl> 54.8, 64.4, 36.3, 32.0, 73.6, 44.1, 49.4, 4…
$ science_score              <dbl> 51.8, 61.6, 47.1, 39.0, 55.5, 30.0, 47.8, 6…
$ final_exam_score           <dbl> 49.1, 70.1, 42.2, 31.9, 66.4, 40.9, 47.1, 5…
$ previous_gpa               <dbl> 2.44, 2.79, 1.49, 1.34, 2.60, 1.61, 1.35, 2…
$ pass_fail                  <chr> "Fail", "Pass", "Fail", "Fail", "Pass", "Fa…
$ grade_category             <chr> "F", "C", "F", "F", "C", "F", "F", "D", "F"…

Gender Differences by Subject: Do male and female students perform differently across subjects?

data %>%
  pivot_longer(cols = c(math_score, reading_score, writing_score),
               names_to = "subject",
               values_to = "score") %>%
  group_by(gender, subject) %>%
  summarise(mean_score = mean(score), .groups = "drop") %>%
  ggplot(aes(x = subject, y = mean_score, fill = gender)) +
  geom_col(position = "dodge") +
  labs(
    title = "Average Exam Scores by Gender and Subject",
    x = "Subject",
    y = "Average Score",
    fill = "Gender"
  ) +  
  theme_minimal()

Interpretation: Males often perform slightly higher in math. Females typically perform higher in reading and writing. The largest gap appears in writing scores. This suggests subject-specific gender performance trends.

Correlation Between Reading and Writing: Are reading and writing scores related?

ggplot(data, aes(x = reading_score, y = writing_score)) +
  geom_point(alpha = 0.6) +
  geom_smooth(method = "lm", se = FALSE, color = "blue") +
  labs(
    title = "Relationship Between Reading and Writing Scores",
    x = "Reading Score",
    y = "Writing Score"
  ) +
  theme_minimal() 
`geom_smooth()` using formula = 'y ~ x'

Interpretation: There is a strong positive linear relationship. Students who score high in reading also score high in writing. This suggests these skills may be closely related academically.

Academic Influence of Parental Education: How does parental education affect student scores?

ggplot(data, aes(x = parental_education, y = final_exam_score, fill = parental_education)) +
  geom_boxplot(alpha = 0.7, outlier.colour = "red", outlier.shape = 1) +
  labs(
    title = "Impact of Parental Education on Final Exam Scores",
    subtitle = "Higher parental education levels correlate with improved academic outcomes.",
    x = "Parental Education Level",
    y = "Final Exam Score",
    fill = "Education Level"
  ) +
  theme_minimal() +
  scale_fill_brewer(palette = "Set2") +
  theme(legend.position = "none")

Interpretation: Students whose parents hold advanced degrees (Master’s or PhD) not only achieve higher median scores but also fewer students in these groups fall into the failing range compared to those whose parents only completed High School. This suggests that household academic background provides a significant advantage, likely through better resource access or academic guidance.

Correlation Between Study Habits and Performance: Do better study habits result in better grades?

ggplot(data, aes(x = study_hours_per_day, y = final_exam_score, color = tutoring)) +
  geom_point(alpha = 0.3) +
  geom_smooth(method = "lm", color = "black", linetype = "dashed") +
  facet_wrap(~tutoring) +
  labs(
    title = "Study Hours vs. Final Exam Scores",
    subtitle = "Visualizing the relationship between daily study time and success, faceted by tutoring.",
    x = "Daily Study Hours",
    y = "Final Exam Score",
    color = "Tutoring Support"
  ) +
  theme_light() +
  scale_color_manual(values = c("Yes" = "#2E8B57", "No" = "#CD5C5C"))
`geom_smooth()` using formula = 'y ~ x'

Interpretation: The scatter plot demonstrates a strong positive correlation between Daily Study Hours and Final Exam Scores. Regardless of whether a student receives tutoring, increasing study time yields higher scores. When faceted by tutoring status, we observe that students receiving extra help (“Yes”) tend to maintain a tighter grouping around the trend line at higher score levels, suggesting that tutoring may help maximize the efficiency of study hours.