Executive summary

The 2025 Korean College Scholastic Ability Test (CSAT) has just ended. In South Korea, CSAT day is a national event where everyone feels the tension, and special measures are taken to support test-takers. This highlights the significance of exams in Korea’s competitive educational landscape, often referred to as “K-education.”

The experience of preparing for such an important exam carries valuable lessons that resonate even now as university students. At this very moment, we are facing our final exam season. University exams are more than just assessments of academic performance; they also play a crucial role in shaping our future goals. This makes it essential to explore effective strategies and approaches to exam preparation.

So, what are the most practical and impactful learning strategies we can apply as we prepare for final exams?

This research aims to analyze the key factors that influence students’ academic performance, not only to draw theoretical conclusions but to identify actionable strategies we can implement immediately. By doing so, we seek to discover ways to study more effectively while reducing stress during the exam period.

Ultimately, this study aspires to go beyond improving grades. It aims to uncover methods that enhance learning efficiency while maintaining a balanced and fulfilling life—a critical starting point for long-term success.

Data background

A dataset that we used is “StudentPerformanceFactors.csv”. This dataset provides a comprehensive overview of various factors affecting student performance in exams. It includes information on study habits, attendance, parental involvement, and other aspects influencing academic success.

We found this dataset in Kaggle. The original source of the Kaggle dataset is the UCI Machine Learning Repository. This study was conducted through a survey on parents’ vocational and home education levels, school records such as attendance and test scores, and an observation survey of students from two secondary schools in Portugal.

Data cleaning

We decided to find out how the six independent variables of Hours_Studied, Attendance, Family_Income, Parental_Involvement, Sleep_Hours, and Peer_Influence affect the dependent variable Exam_Scores.

Descriptive statistics related to variables are summarized as follows.

summary(data[, c("Exam_Score", "Hours_Studied", "Attendance",
                 "Family_Income", "Parental_Involvement",
                 "Sleep_Hours", "Peer_Influence")])

##    Exam_Score     Hours_Studied     Attendance     Family_Income     
##  Min.   : 55.00   Min.   : 1.00   Min.   : 60.00   Length:6607       
##  1st Qu.: 65.00   1st Qu.:16.00   1st Qu.: 70.00   Class :character  
##  Median : 67.00   Median :20.00   Median : 80.00   Mode  :character  
##  Mean   : 67.24   Mean   :19.98   Mean   : 79.98                     
##  3rd Qu.: 69.00   3rd Qu.:24.00   3rd Qu.: 90.00                     
##  Max.   :101.00   Max.   :44.00   Max.   :100.00                     
##  Parental_Involvement  Sleep_Hours     Peer_Influence    
##  Length:6607          Min.   : 4.000   Length:6607       
##  Class :character     1st Qu.: 6.000   Class :character  
##  Mode  :character     Median : 7.000   Mode  :character  
##                       Mean   : 7.029                     
##                       3rd Qu.: 8.000                     
##                       Max.   :10.000

We grouped six independent variables among similar ones for data visualization. First, we decided to investigate the effect of learning time and sleep time on test results. Second, we tried to find out the effect of attendance on test results. Finally, we tried to find out the relationship between the surrounding environment, that is, family income, parental participation, peer influence, and test performance.

Figure 1

The first figure we created is a scatter plot with a regression line, showing the relationship between Hours_Studied, Sleep_Hours, and Exam_Scores. Each data point represents a combination of study hours and sleep hours for an individual student, while the color gradient indicates the corresponding exam score, ranging from low (yellow) to high (red).

We chose this type of visualization because it effectively highlights how the two key independent variables—Hours_Studied and Sleep_Hours—jointly influence exam performance. The scatter plot allows us to see trends, clusters, or outliers, while the regression line shows the general direction of the relationship. Additionally, using a color gradient for Exam_Scores enables us to quickly identify which combinations of study and sleep are associated with higher scores, facilitating intuitive data interpretation.

library(ggplot2)

ggplot(data, aes(x = Hours_Studied, y = Sleep_Hours, color = Exam_Score)) +
  geom_point(alpha = 0.6, size = 3) +
  geom_smooth(method = "lm", color = "blue", se = FALSE) +
  scale_color_gradient(low = "yellow", high = "red") +
  labs(title = "Hours Studied and Sleep Hours vs. Exam Score",
       x = "Hours Studied",
       y = "Sleep Hours") +
  theme_minimal()

## `geom_smooth()` using formula = 'y ~ x'

Figure 2

The second figure we created is a line graph illustrating the relationship between Attendance and the mean Exam Score. To simplify the analysis, we categorized Attendance into bins using the cut() function, and for each bin, we calculated the mean Exam Score. The x-axis represents the binned attendance levels, while the y-axis shows the corresponding mean exam scores. The line connects the points, with the red markers emphasizing the values for each bin.

This figure type was chosen because it clearly shows the trend of how attendance affects academic performance. By binning the data, we reduced noise and emphasized broader patterns. The line graph helps visualize whether exam scores increase, decrease, or remain stable as attendance improves, providing actionable insights. The use of point markers and the angled x-axis labels further enhances readability.

library(dplyr)

attendance_score <- data %>%
  group_by(Attendance_Bin = cut(Attendance, breaks = 10)) %>%
  summarise(Mean_Score = mean(Exam_Score, na.rm = TRUE))

ggplot(attendance_score, aes(x = Attendance_Bin, y = Mean_Score, group = 1)) +
  geom_line(color = "blue", size = 1.2) +
  geom_point(color = "red", size = 3) +
  labs(title = "Line Graph: Attendance vs. Exam Score",
       x = "Attendance (Binned)",
       y = "Mean Exam Score") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Figure 3

The third figure is a box plot that visualizes the relationship between Family Income, Exam Score, and Peer Influence, segmented by Parental Involvement using faceting. Each box plot represents the distribution of exam scores for a specific level of family income, while the fill color corresponds to the level of Peer Influence. Facets allow us to observe how Parental Involvement modifies these relationships.

We selected this figure type because it effectively showcases the combined effects of multiple variables on exam performance. Box plots are ideal for comparing distributions and spotting outliers, while faceting adds an extra layer of analysis by showing variations under different levels of Parental Involvement. The use of color for Peer Influence further emphasizes its impact, making it easier to interpret the complex interactions between these variables.

ggplot(data, aes(x = Family_Income, y = Exam_Score, fill = Peer_Influence)) +
  geom_boxplot() +
  facet_wrap(~ Parental_Involvement) +
  labs(title = "Exam Score by Family Income, Parental Involvement, and Peer Influence",
       x = "Family Income",
       y = "Exam Score") +
  theme_light() +
  scale_fill_brewer(palette = "Set2")

We designed the figures with clarity and interpretability as the main goals, ensuring that the visual elements effectively communicate the underlying patterns and relationships in the data.

Colors: We used distinct color schemes to highlight key variables. For example, in the scatter plot, the gradient from yellow to red for Exam Scores emphasizes the performance levels, making it easier to identify trends. In the box plot, the Peer Influence levels were distinguished using the “Set2” palette, chosen for its pleasant and easily distinguishable colors, suitable for categorical data.

Fonts: A clean and modern font (default in R themes like theme_minimal and theme_light) was chosen to maintain readability, especially for titles, axis labels, and legends.

Design Elements:
- In the scatter plot, the use of a regression line adds statistical context, while the transparency (alpha) reduces clutter.
- In the line graph, angled x-axis labels ensure that bin names remain readable. Red points highlight data points, adding emphasis to the trend.
- In the box plot, faceting allows viewers to compare distributions across Parental Involvement levels, making the visualization more informative without overwhelming the reader.

These choices ensure that the figures accurately and truthfully convey insights without introducing bias or distortion. By balancing simplicity and detail, the figures effectively communicate the key findings to the audience.

Factors affecting students’ final grades

김효령, 조윤경

December 10, 2024