Student Grading

Author

Madina Kudanova

Introduction

This project develops practical competency in transforming wide-format datasets into tidy formats suitable for downstream analysis. All transformations and data manipulations are performed using the `tidyr` and `dplyr` packages in R, with visualizations built using `ggplot2`.

Dataset used:

Student Performance & Behavior Dataset: //www.kaggle.com/datasets/mahmoudelhemaly/students-grading-dataset

Approach

This dataset tracks 5,000 students, each with their own Student_ID, across 23 columns covering everything from midterm and final scores to sleep hours, stress levels, and whether they have internet at home. It’s wide in a very natural way there are six separate score columns (midterm, final, assignments, quizzes, participation, and projects) that all measure the same thing: how a student is performing academically.

The plan is to collapse those six score columns into long format using pivot_longer(), so each row represents one student’s score on one type of assessment. That opens up comparisons that would be awkward to do in wide format, like whether students with more study hours consistently score higher across all assessment types, or just some. The tricky part will be deciding which columns actually get pivoted and which ones stay as contextual background, stress level and sleep hours, for instance, belong in the latter category.

Once the data is tidy, the analysis will focus on which factors are most associated with a higher total score. I’ll use group_by() and summarize() to compare average scores across socioeconomic groups (family income level, internet access, parental education), and calculate the average score per assessment type per student to see whether performance is consistent across all assessments or whether certain students shine in some areas and struggle in others. Visualizations will include grouped bar charts and scatter plots of study hours against total score, faceted by factors like internet access and family income.

Base Code

 #| message: false
 library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.2.0     ✔ readr     2.1.6
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.2     ✔ tibble    3.3.1
✔ lubridate 1.9.5     ✔ tidyr     1.3.2
✔ purrr     1.2.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Data Import

# Raw Data Import from my GitHub repository 
raw_data <- read_csv(
  "https://raw.githubusercontent.com/MKudanova/Data607/refs/heads/main/Project2/Students%20Performance%20Dataset.csv")

Rows: 5000 Columns: 23
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (11): Student_ID, First_Name, Last_Name, Email, Gender, Department, Grad...
dbl (12): Age, Attendance (%), Midterm_Score, Final_Score, Assignments_Avg, ...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

glimpse(raw_data)

Rows: 5,000
Columns: 23
$ Student_ID                 <chr> "S1000", "S1001", "S1002", "S1003", "S1004"…
$ First_Name                 <chr> "Omar", "Maria", "Ahmed", "Omar", "John", "…
$ Last_Name                  <chr> "Williams", "Brown", "Jones", "Williams", "…
$ Email                      <chr> "student0@university.com", "student1@univer…
$ Gender                     <chr> "Female", "Male", "Male", "Female", "Female…
$ Age                        <dbl> 22, 18, 24, 24, 23, 21, 24, 19, 21, 22, 23,…
$ Department                 <chr> "Mathematics", "Business", "Engineering", "…
$ `Attendance (%)`           <dbl> 97.36, 97.71, 99.52, 90.38, 59.41, 60.60, 6…
$ Midterm_Score              <dbl> 40.61, 57.27, 41.84, 45.65, 53.13, 70.21, 9…
$ Final_Score                <dbl> 59.61, 74.00, 63.85, 44.44, 61.77, 64.48, 9…
$ Assignments_Avg            <dbl> 73.69, 74.23, 85.85, 68.10, 67.66, 87.85, 8…
$ Quizzes_Avg                <dbl> 53.17, 98.23, 50.00, 66.27, 83.98, 52.26, 5…
$ Participation_Score        <dbl> 73.4, 88.0, 4.7, 4.2, 64.3, 54.8, 15.5, 54.…
$ Projects_Score             <dbl> 62.84, 98.23, 91.22, 55.48, 87.43, 67.65, 8…
$ Total_Score                <dbl> 59.8865, 81.9170, 67.7170, 51.6535, 71.4030…
$ Grade                      <chr> "F", "B", "D", "F", "C", "D", "B", "C", "D"…
$ Study_Hours_per_Week       <dbl> 10.3, 27.1, 12.4, 25.5, 13.3, 25.3, 22.1, 1…
$ Extracurricular_Activities <chr> "Yes", "No", "Yes", "No", "Yes", "Yes", "Ye…
$ Internet_Access_at_Home    <chr> "No", "No", "No", "Yes", "No", "No", "No", …
$ Parent_Education_Level     <chr> "Master's", "High School", "High School", "…
$ Family_Income_Level        <chr> "Medium", "Low", "Low", "Low", "Medium", "M…
$ `Stress_Level (1-10)`      <dbl> 1, 4, 9, 8, 6, 6, 2, 5, 9, 8, 4, 7, 9, 5, 2…
$ Sleep_Hours_per_Night      <dbl> 5.9, 4.3, 6.1, 4.9, 4.5, 5.6, 4.0, 7.8, 4.8…

names(raw_data)

 [1] "Student_ID"                 "First_Name"                
 [3] "Last_Name"                  "Email"                     
 [5] "Gender"                     "Age"                       
 [7] "Department"                 "Attendance (%)"            
 [9] "Midterm_Score"              "Final_Score"               
[11] "Assignments_Avg"            "Quizzes_Avg"               
[13] "Participation_Score"        "Projects_Score"            
[15] "Total_Score"                "Grade"                     
[17] "Study_Hours_per_Week"       "Extracurricular_Activities"
[19] "Internet_Access_at_Home"    "Parent_Education_Level"    
[21] "Family_Income_Level"        "Stress_Level (1-10)"       
[23] "Sleep_Hours_per_Night"

head(raw_data)

# A tibble: 6 × 23
  Student_ID First_Name Last_Name Email Gender   Age Department `Attendance (%)`
  <chr>      <chr>      <chr>     <chr> <chr>  <dbl> <chr>                 <dbl>
1 S1000      Omar       Williams  stud… Female    22 Mathemati…             97.4
2 S1001      Maria      Brown     stud… Male      18 Business               97.7
3 S1002      Ahmed      Jones     stud… Male      24 Engineeri…             99.5
4 S1003      Omar       Williams  stud… Female    24 Engineeri…             90.4
5 S1004      John       Smith     stud… Female    23 CS                     59.4
6 S1005      Liam       Brown     stud… Male      21 Mathemati…             60.6
# ℹ 15 more variables: Midterm_Score <dbl>, Final_Score <dbl>,
#   Assignments_Avg <dbl>, Quizzes_Avg <dbl>, Participation_Score <dbl>,
#   Projects_Score <dbl>, Total_Score <dbl>, Grade <chr>,
#   Study_Hours_per_Week <dbl>, Extracurricular_Activities <chr>,
#   Internet_Access_at_Home <chr>, Parent_Education_Level <chr>,
#   Family_Income_Level <chr>, `Stress_Level (1-10)` <dbl>,
#   Sleep_Hours_per_Night <dbl>

Data Cleaning and Tidying

# Cleaning and Tidying

# Standardize column names
clean_data <- raw_data |>
  rename_with(tolower)

# Convert wide score columns to long format
clean_data <- clean_data |>
  pivot_longer(
    cols = c(
      midterm_score,
      final_score,
      assignments_avg,
      quizzes_avg,
      participation_score,
      projects_score
    ),
    names_to = "assessment_type",
    values_to = "score"
  )

# Remove rows with missing score values
clean_data <- clean_data |>
  drop_na(score)

# Check final tidy dataset
glimpse(clean_data)

Rows: 30,000
Columns: 19
$ student_id                 <chr> "S1000", "S1000", "S1000", "S1000", "S1000"…
$ first_name                 <chr> "Omar", "Omar", "Omar", "Omar", "Omar", "Om…
$ last_name                  <chr> "Williams", "Williams", "Williams", "Willia…
$ email                      <chr> "student0@university.com", "student0@univer…
$ gender                     <chr> "Female", "Female", "Female", "Female", "Fe…
$ age                        <dbl> 22, 22, 22, 22, 22, 22, 18, 18, 18, 18, 18,…
$ department                 <chr> "Mathematics", "Mathematics", "Mathematics"…
$ `attendance (%)`           <dbl> 97.36, 97.36, 97.36, 97.36, 97.36, 97.36, 9…
$ total_score                <dbl> 59.8865, 59.8865, 59.8865, 59.8865, 59.8865…
$ grade                      <chr> "F", "F", "F", "F", "F", "F", "B", "B", "B"…
$ study_hours_per_week       <dbl> 10.3, 10.3, 10.3, 10.3, 10.3, 10.3, 27.1, 2…
$ extracurricular_activities <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "…
$ internet_access_at_home    <chr> "No", "No", "No", "No", "No", "No", "No", "…
$ parent_education_level     <chr> "Master's", "Master's", "Master's", "Master…
$ family_income_level        <chr> "Medium", "Medium", "Medium", "Medium", "Me…
$ `stress_level (1-10)`      <dbl> 1, 1, 1, 1, 1, 1, 4, 4, 4, 4, 4, 4, 9, 9, 9…
$ sleep_hours_per_night      <dbl> 5.9, 5.9, 5.9, 5.9, 5.9, 5.9, 4.3, 4.3, 4.3…
$ assessment_type            <chr> "midterm_score", "final_score", "assignment…
$ score                      <dbl> 40.61, 59.61, 73.69, 53.17, 73.40, 62.84, 5…

head(clean_data)

# A tibble: 6 × 19
  student_id first_name last_name email gender   age department `attendance (%)`
  <chr>      <chr>      <chr>     <chr> <chr>  <dbl> <chr>                 <dbl>
1 S1000      Omar       Williams  stud… Female    22 Mathemati…             97.4
2 S1000      Omar       Williams  stud… Female    22 Mathemati…             97.4
3 S1000      Omar       Williams  stud… Female    22 Mathemati…             97.4
4 S1000      Omar       Williams  stud… Female    22 Mathemati…             97.4
5 S1000      Omar       Williams  stud… Female    22 Mathemati…             97.4
6 S1000      Omar       Williams  stud… Female    22 Mathemati…             97.4
# ℹ 11 more variables: total_score <dbl>, grade <chr>,
#   study_hours_per_week <dbl>, extracurricular_activities <chr>,
#   internet_access_at_home <chr>, parent_education_level <chr>,
#   family_income_level <chr>, `stress_level (1-10)` <dbl>,
#   sleep_hours_per_night <dbl>, assessment_type <chr>, score <dbl>

names(clean_data)

 [1] "student_id"                 "first_name"                
 [3] "last_name"                  "email"                     
 [5] "gender"                     "age"                       
 [7] "department"                 "attendance (%)"            
 [9] "total_score"                "grade"                     
[11] "study_hours_per_week"       "extracurricular_activities"
[13] "internet_access_at_home"    "parent_education_level"    
[15] "family_income_level"        "stress_level (1-10)"       
[17] "sleep_hours_per_night"      "assessment_type"           
[19] "score"

glimpse(clean_data)

Rows: 30,000
Columns: 19
$ student_id                 <chr> "S1000", "S1000", "S1000", "S1000", "S1000"…
$ first_name                 <chr> "Omar", "Omar", "Omar", "Omar", "Omar", "Om…
$ last_name                  <chr> "Williams", "Williams", "Williams", "Willia…
$ email                      <chr> "student0@university.com", "student0@univer…
$ gender                     <chr> "Female", "Female", "Female", "Female", "Fe…
$ age                        <dbl> 22, 22, 22, 22, 22, 22, 18, 18, 18, 18, 18,…
$ department                 <chr> "Mathematics", "Mathematics", "Mathematics"…
$ `attendance (%)`           <dbl> 97.36, 97.36, 97.36, 97.36, 97.36, 97.36, 9…
$ total_score                <dbl> 59.8865, 59.8865, 59.8865, 59.8865, 59.8865…
$ grade                      <chr> "F", "F", "F", "F", "F", "F", "B", "B", "B"…
$ study_hours_per_week       <dbl> 10.3, 10.3, 10.3, 10.3, 10.3, 10.3, 27.1, 2…
$ extracurricular_activities <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "Yes", "…
$ internet_access_at_home    <chr> "No", "No", "No", "No", "No", "No", "No", "…
$ parent_education_level     <chr> "Master's", "Master's", "Master's", "Master…
$ family_income_level        <chr> "Medium", "Medium", "Medium", "Medium", "Me…
$ `stress_level (1-10)`      <dbl> 1, 1, 1, 1, 1, 1, 4, 4, 4, 4, 4, 4, 9, 9, 9…
$ sleep_hours_per_night      <dbl> 5.9, 5.9, 5.9, 5.9, 5.9, 5.9, 4.3, 4.3, 4.3…
$ assessment_type            <chr> "midterm_score", "final_score", "assignment…
$ score                      <dbl> 40.61, 59.61, 73.69, 53.17, 73.40, 62.84, 5…

select(clean_data, student_id, assessment_type, score) |> head()

# A tibble: 6 × 3
  student_id assessment_type     score
  <chr>      <chr>               <dbl>
1 S1000      midterm_score        40.6
2 S1000      final_score          59.6
3 S1000      assignments_avg      73.7
4 S1000      quizzes_avg          53.2
5 S1000      participation_score  73.4
6 S1000      projects_score       62.8

Data Analysis

avg_scores <- clean_data |>
  group_by(assessment_type) |>
  summarise(
    average_score = mean(score, na.rm = TRUE),
    .groups = "drop"
  )

avg_scores

# A tibble: 6 × 2
  assessment_type     average_score
  <chr>                       <dbl>
1 assignments_avg              75.0
2 final_score                  69.5
3 midterm_score                70.7
4 participation_score          50.0
5 projects_score               74.8
6 quizzes_avg                  74.8

Visualization

ggplot(avg_scores, aes(x = assessment_type, y = average_score)) +
  geom_col(fill = "steelblue") +
  geom_text(aes(label = round(average_score,1)),
            vjust = -0.5) +
  labs(
    title = "                                     Average Score by Assessment Type
    
    ",
    x = "Assessment Type",
    y = "Average Score"
  ) +
  theme_minimal()

Conclusion

The original dataset stored multiple assessment scores in separate columns, resulting in a wide structure that made comparisons between different types of evaluations less straightforward. By applying the pivot_longer() function from the tidyr package within the tidyverse, these score columns were transformed into a tidy format where each row represents a single student’s score for a specific assessment type.

This restructuring created a consistent variable structure that made it easier to group, summarize, and visualize the data. Using the tidy dataset, the analysis compared average performance across different assessment categories. The results indicate that students tend to achieve higher average scores on projects, quizzes, and assignments, while participation scores are noticeably lower. Midterm and final exam scores fall within the middle range of overall performance.

Overall, transforming the dataset from wide to tidy format significantly simplified the analytical workflow and enabled clearer comparisons between assessment types. This project demonstrates how proper data structuring is a critical step in data analysis, allowing meaningful insights to be extracted more efficiently.