Student Exam Performance Analysis

2026-06-14

Background

This project analyzes the Students Performance in Exams dataset from Kaggle.

The dataset contains student exam scores in:

Math
Reading
Writing

It also includes student background variables such as gender, lunch type, parental education, and test preparation course status.

Research Questions

The main questions are:

Do students who complete test preparation have higher average scores?
Is math score related to reading score?
Does lunch type appear related to average student performance?

Dataset Source

The dataset comes from Kaggle:

Students Performance in Exams

The dataset includes information about student exam scores and background factors.

## # A tibble: 6 × 10
##   gender race_ethnicity parental_level_of_education lunch test_preparation_cou…¹
##   <chr>  <chr>          <chr>                       <chr> <chr>                 
## 1 female group B        bachelor's degree           stan… none                  
## 2 female group C        some college                stan… completed             
## 3 female group B        master's degree             stan… none                  
## 4 male   group A        associate's degree          free… none                  
## 5 male   group C        some college                stan… none                  
## 6 female group B        associate's degree          stan… none                  
## # ℹ abbreviated name: ¹test_preparation_course
## # ℹ 5 more variables: math_score <dbl>, reading_score <dbl>,
## #   writing_score <dbl>, average_score <dbl>, passed_math <chr>

Variables in the Dataset

The main variables are:

gender
race_ethnicity
parental_level_of_education
lunch
test_preparation_course
math_score
reading_score
writing_score

I also created a new variable:

students <- students %>%
  mutate(
    average_score = (math_score + reading_score + writing_score) / 3
  )

Data Cleaning

The dataset was already mostly clean.

I renamed the columns to make them easier to use in R.

names(students) <- tolower(names(students))
names(students) <- gsub("[^a-z0-9]+", "_", names(students))
names(students) <- gsub("_$", "", names(students))

Summary Statistics

## # A tibble: 1 × 6
##   mean_math mean_reading mean_writing mean_average sd_average number_of_students
##       <dbl>        <dbl>        <dbl>        <dbl>      <dbl>              <int>
## 1      66.1         69.2         68.1         67.8       14.3               1000

Distribution of Average Scores

Average Score by Test Preparation

Mean Scores by Test Preparation

## # A tibble: 2 × 6
##   test_preparation_course mean_math mean_reading mean_writing mean_average count
##   <chr>                       <dbl>        <dbl>        <dbl>        <dbl> <int>
## 1 completed                    69.7         73.9         74.4         72.7   358
## 2 none                         64.1         66.5         64.5         65.0   642

Average Score by Lunch Type

Interactive Plot: Math vs Reading Score

Correlation Between Math and Reading

## [1] 0.8175797

There is a positive relationship between math score and reading score.

Students who score higher in math often also tend to score higher in reading.

Linear Regression Model

I used a simple linear regression model to predict reading score from math score.

model <- lm(reading_score ~ math_score, data = students)
summary(model)

## 
## Call:
## lm(formula = reading_score ~ math_score, data = students)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -26.2905  -5.8011   0.1139   6.0341  21.4117 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 17.14181    1.19000   14.40   <2e-16 ***
## math_score   0.78723    0.01755   44.85   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.411 on 998 degrees of freedom
## Multiple R-squared:  0.6684, Adjusted R-squared:  0.6681 
## F-statistic:  2012 on 1 and 998 DF,  p-value: < 2.2e-16

Regression Plot

## `geom_smooth()` using formula = 'y ~ x'

Test Preparation Comparison

I used a two-sample t-test to compare average scores between students who completed test preparation and students who did not.

t_test_result <- t.test(
  average_score ~ test_preparation_course,
  data = students
)

t_test_result

## 
##  Welch Two Sample t-test
## 
## data:  average_score by test_preparation_course
## t = 8.5945, df = 791.84, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group completed and group none is not equal to 0
## 95 percent confidence interval:
##  5.887734 9.373305
## sample estimates:
## mean in group completed      mean in group none 
##                72.66946                65.03894

Interpretation of the T-Test

The t-test compares the mean average score between the two groups:

Students who completed test preparation
Students who did not complete test preparation

If the p-value is small, it suggests that the average scores are different between the groups.

Average Score by Parental Education

## # A tibble: 6 × 3
##   parental_level_of_education mean_average count
##   <chr>                              <dbl> <int>
## 1 master's degree                     73.6    59
## 2 bachelor's degree                   71.9   118
## 3 associate's degree                  69.6   222
## 4 some college                        68.5   226
## 5 some high school                    65.1   179
## 6 high school                         63.1   196

Parental Education Plot

Main Findings

The main findings are:

Students who completed test preparation tended to have higher average scores.
Math and reading scores had a positive relationship.
Lunch type appeared related to score differences.
Parental education level also showed some differences in average score.

Conclusion

This project used data cleaning, exploratory data analysis, visualization, and simple statistical methods.

The analysis suggests that student performance is related to several factors, especially test preparation, lunch type, and performance across other subjects.

The results do not prove causation, but they show useful patterns in the data.