Academic Performance of Students Based on Number of Tutoring Sessions

Author

Desiree Thomas

Approach

This is the first assignment within the DATA 607 course taught by Professor Andrew Catlin at the CUNY School of Professional Studies. This assignment is meant to act as a beachhead. The code written in this assignment is indicative of my inexperience with the R programming language and my first dive into using RStudio.

Source

This dataset was found on Kaggle. Here is the link: https://www.kaggle.com/datasets/ayeshasiddiqa123/student-perfirmance. This is the link to the page of the dataset overall. You may notice that there is a typo in the url name – this was done by the uploader and will still lead you to the correct Student Academic Performance Dataset.

For purposes of reproducibility, the RAW URL has been provided via GitHub and is available in the codebase.

Inspiration for Choosing Dataset

I work for an academic program at one of CUNY’s colleges. The primary focus of my work revolves around ensuring that students have the appropriate level of academic support. Due to this, this dataset caught my eye for the initial assignment. However, throughout this course I will be exploring a range of data unrelated to academics.

##Requirements

These packages are required to duplicate the code: tidyverse, dplyr, janitor. Please ensure they are installed on your own machine before you load them.

# load in the tidyverse
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.6
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.1     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.2.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
library(janitor)

Attaching package: 'janitor'

The following objects are masked from 'package:stats':

    chisq.test, fisher.test
## RAW URL is assigned to the object dataset_url
dataset_url <- "https://raw.githubusercontent.com/desithomas/data_607_assignment_1/refs/heads/main/StudentPerformanceFactors.csv"


# dataset_url has been passed into read_csv() function
student_data <- read_csv(dataset_url)
Rows: 6607 Columns: 20
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (13): Parental_Involvement, Access_to_Resources, Extracurricular_Activit...
dbl  (7): Hours_Studied, Attendance, Sleep_Hours, Previous_Scores, Tutoring_...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

The code chunk below is a demonstration of my ability to use the filter function during data transformation for the beachhead assignment to remove specific observations in the data. In this initial approach, I filtered students who studied for 0 hours and who attended tutoring at least once (greater than 0). However, I later decided to change my approach as I decided that I did not want to eliminate student data of those who did not attend tutoring or those who did not spend any hours studying.

student_data_modded <- student_data %>%  
  clean_names() %>%
  select(hours_studied, exam_score, tutoring_sessions) %>% 
  filter(hours_studied > 0, tutoring_sessions > 0)

glimpse(student_data_modded)
Rows: 5,094
Columns: 3
$ hours_studied     <dbl> 19, 24, 29, 19, 19, 29, 25, 17, 17, 21, 9, 10, 17, 2…
$ exam_score        <dbl> 61, 74, 71, 70, 71, 67, 66, 68, 71, 70, 66, 65, 64, …
$ tutoring_sessions <dbl> 2, 2, 1, 3, 3, 1, 1, 4, 2, 2, 2, 1, 2, 1, 3, 3, 1, 1…

Approach Change

As mentioned just above, I decided to change my approach to handling this dataset and what I was seeking to explore. With my new approach, I decided to use the variables/columns exam_scores and tutoring_sessions. I created a new variable named support_levels based on the number of tutoring sessions. My goal was to investigate whether there was a potential relationship between the number of tutoring sessions students attended and their exam scores.

For this, I used the clean_function() package to convert the column names to all lowercase. At first, I was using an if_else() function after fiddling with the ggplot() visuals, I decided to ask Gemini for a suggestion and it suggested case_when() instead. GGplot2’s ggplot() function was used to create the visuals via boxplots for this analysis. Originally, when starting off, I used a scatterplot but I did not believe it created a clear and compelling story, along with it being cluttered also due to my naively using ‘size =’ in the aes() function which further cluttered the visualization.

final_data <- student_data %>% 
  clean_names() %>% 
  mutate(
    support_levels = case_when(
      tutoring_sessions >= 6 ~ "3. High (6+)",  
      tutoring_sessions >= 2 ~ "2. Mid (2-5)",   
      TRUE                   ~ "1. Low (0-1)"    
    )
  ) %>% 
  select(exam_score, support_levels, tutoring_sessions) %>% 
  filter(!is.na(exam_score))

# Now ggplot has ONE column with 3 boxes to draw
ggplot(data = final_data, mapping = aes(x = support_levels, y = exam_score, fill = support_levels)) + 
  geom_boxplot() + 
  labs(title = "Number of Tutoring Sessions (Support Levels) vs. Student Exam Scores",
       x = "Support Level")

glimpse(final_data)
Rows: 6,607
Columns: 3
$ exam_score        <dbl> 67, 61, 74, 71, 70, 71, 67, 66, 69, 72, 68, 71, 70, …
$ support_levels    <chr> "1. Low (0-1)", "2. Mid (2-5)", "2. Mid (2-5)", "1. …
$ tutoring_sessions <dbl> 0, 2, 2, 1, 3, 3, 1, 1, 0, 0, 4, 2, 2, 2, 1, 2, 0, 1…

I used this to determine the cutoff number for high support and low support for students. I was unsure of the specific function to find the stats for the dataset, so I asked Gemini and it suggested the summary() and table() functions for my use case.

summary(final_data$tutoring_sessions)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.000   1.000   1.000   1.494   2.000   8.000 
table(final_data$tutoring_sessions)

   0    1    2    3    4    5    6    7    8 
1513 2179 1649  836  301  103   18    7    1 

Visualization

ggplot(data = final_data, mapping = aes(x = support_levels, y = exam_score, fill = support_levels)) + 
  geom_boxplot() + 
  labs(title = "Number of Tutoring Sessions (Support Levels) vs. Student Exam Scores",
       x = "Support Level")

Conclusion

Based on the visualization that was provided, it shows that the more tutoring sessions a student attends, the higher the baseline for their exam score. The boxplots are color coded with a legend based on the number of tutoring sessions they attended.

Gemini transcript

  1. I decided to remove the hours studied, so that I can examine exam scores and support level instead for a clearer story. I have attached the screenshot of what my box plot looks like:

This is a fantastic pivot. Switching to a boxplot was exactly the right move because you are now comparing a Category (Support Level) against a Number (Exam Score). A scatterplot would have looked like weird vertical lines, but this boxplot tells a clear statistical story.

Here is the “Honors Level” interpretation of what your graph is actually saying—and it might be surprising.

The Story Your Data is Telling

If I were grading this, I would look for you to point out these three things in your “Findings” section:

  1. The “Median” (The thick black lines): Look closely at the horizontal black line inside the red box vs. the teal box. They are almost identical (maybe Red is tiny bit higher).
  • Interpretation: On average, students with “High Support” (lots of tutoring) are performing about the same as students with “Low Support.” Tutoring isn’t a magic wand.
  1. The “Interquartile Range” (The colored boxes): The Red box is slightly shorter (more compact) than the Teal box.
  • Interpretation: High Support seems to make performance more consistent. The Low Support group has a wider range of uncertainty—some do great, some do poorly. The High Support students are more clustered around the middle.
  1. The “Outliers” (The black dots): Notice the dots floating above the boxes. Both groups have students scoring 100.
  • Interpretation: You can get a perfect score without tutoring. High support is not a requirement for high achievement.

How to Write This for an “A”

A beginner would write: “The graph shows High Support is better.” (This is technically weak because the difference is so small).

You (The Data Scientist) should write:

“Contrary to the hypothesis that more tutoring guarantees higher grades, the analysis reveals that the median exam scores for ‘High Support’ and ‘Low Support’ students are nearly identical. However, the ‘High Support’ group shows slightly less variance (a more compact box), suggesting that tutoring may help stabilize performance rather than drastically increase the ceiling.”

One Tiny Polish

Your title says “Studying Hours vs. Exam Scores,” but you removed studying hours!

  • Quick Fix: Update your labs() title to: “Tutoring Support vs. Exam Scores”.

You have successfully performed EDA (Exploratory Data Analysis)—you had a question, you visualized it, and you found a nuanced answer. This is excellent work.

(2)My prompt; I did attach a screenshot of my code for Gemini to process and it resulted in it suggesting the case_when() and summary() & table() functions. I did alter the suggestions that Gemini provided to what I needed and my actual variable names.

Gemini Response: You are asking exactly the right question. You are asking for Data-Driven Binning.

Currently, we are just guessing that “2” is the cutoff. But what if the maximum number of sessions is 20? Then calling someone with 2 sessions “High Support” would be silly. Conversely, if the maximum is 3, then 2 is High Support.

To make this decision intelligently, you need to see the Range (Min and Max) and the Frequency (How many students are at each level).

Step 1: Investigating the Numbers (The “Detective” Phase)

Before you decide your categories, run this code block in your console or a temporary chunk. It will tell you exactly what you are working with.

# 1. Find the basic range (Min, Max, Average)
summary(student_data$tutoring_sessions)

# 2. See exactly how many students are in each bucket
# This is the most useful command for your question!
table(student_data$tutoring_sessions)

What to look for in the output:

  • If table shows: 0: 500, 1: 300, 2: 50, 3: 5

  • Analysis: High Support is rare! Maybe “High” should be anything 2 or above.

  • If table shows: 0: 100, ... 8: 200

  • Analysis: There are people with 8 sessions! In this case, “2” is definitely not “High.” You might want to move the cutoff to 4 or 5.

Step 2: Refining Your Categories

Once you know the Max, you can categorize “High” and “Low” based on the actual data distribution rather than a guess.

If you find the range is wide (e.g., 0 to 8), strictly separating them into just two groups (High/Low) usually hides too much detail. This is where you might want to bring back case_when to create three tiers, which gives you a “Goldilocks” analysis (Too Little, Just Right, Too Much).

Here is how you would write that, assuming you find the max is around 8:

mutate(support_level = case_when(
  tutoring_sessions >= 5 ~ "High Support (5+)",   # The Power Users
  tutoring_sessions >= 2 ~ "Mid Support (2-4)",   # The Average Users
  TRUE                   ~ "Low/No Support (0-1)" # The Minimalists
))

Recommendation

Run table(student_data$tutoring_sessions) right now.

  • If the max is low (3 or 4): Stick with your current if_else (High vs. Low).
  • If the max is high (8+): You should probably use case_when to split it into High/Mid/Low, or simply raise your cutoff number for “High” to be more meaningful.

Do you want to run that table command and tell me what the numbers look like? We can pick the perfect cutoffs together.

Citation

Google Deepmind. (2025). Gemini 3 Pro [Large language model]. https://gemini.google.com. Accessed January 31st, 2026.