Introduction

The following document is an ongoing work on the Applied Project of DAMA51. Following the Generic Project Description, my choice of interest is to work on the project: “Tracking and Analysing Student Progress”. This report will evolve as the course progresses, with incremental updates after each unit.

1 Project Understanding

This project focuses on tracking and analysing student progress in a module (assuming being part of HOU) using a dataset of student grades for a specific academic year. Reading through this project’s overview, we notice that the dataset includes marks for various activities, namely:

Exams
Homework Assignments
Compulsory Tasks (contributing to the final grade)
Optional Tasks (not contributing to the final grade)

From these activities, exams, homework assignments, and compulsory tasks contribute to the final grade. Optional tasks do not. Activities of the same type are chronologically ordered, while a -1 value indicates non-participation.

1.1 Objective

The primary goals are:

gain insights
draw conclusions from the data
attempt preliminary forecasting

Potential analyses include:

Predicting exam pass/fail based on previous activities
Clustering students based on grade similarities
Identifying patterns in student performance over time

Stakeholders

The primary stakeholders for this project include:

Teaching Academic Community of HOU: Educators and academic advisors who will use the insights to improve student outcomes.
Students: Indirect beneficiaries through targeted interventions and improved teaching strategies.

Cognitive Map

The proposed cognitive map is a preliminary assumption and will be revised as the project progresses through data understanding and analysis. Revisions will be made based on insights from exploratory data analysis, correlation studies, and clustering results.

Show the code

library(DiagrammeR)
library(DiagrammeRsvg)
library(rsvg)

cognitive_map <- grViz("
digraph cognitive_map {
  graph [label = 'Cognitive Map of Student Performance', labelloc = 't', fontsize = 14]
  node [shape = box, style = filled, fillcolor = white, fontcolor = black, color = lightgrey]
  
  'Exam Performance' 
  'Homework Grades (Consistency)' 
  'Compulsory Tasks (Weighted)' 
  'Non-Participation (Flagged as -1)' 
  'Final Grade (Target Variable)'

  edge [color = black, style = dashed, arrowhead = normal]
  
  'Homework Grades (Consistency)' -> 'Exam Performance'
  'Compulsory Tasks (Weighted)' -> 'Exam Performance'
  'Non-Participation (Flagged as -1)' -> 'Exam Performance'
  'Exam Performance' -> 'Final Grade (Target Variable)'
}
")

if (!dir.exists("Plots")) {
  dir.create("Plots")
}

output_file <- "Plots/cognitive_map.png"

export_svg(cognitive_map) %>%
  charToRaw() %>%
  rsvg_png(file = output_file, width = 1200, height = 800)

Key Questions

An initial set of our main questions includes the following:

Can we predict exam pass/fail outcomes based on prior activity performance?
What are the key factors (e.g., homework grades, compulsory task participation) influencing exam performance?
How do non-participation (-1) and optional tasks impact student outcomes?
Can we cluster students into performance-based groups for targeted interventions?

As we progress we will evaluate these questions and - potentially - reform them.

Success Criteria

The preliminary assessment is based on model performance and correlation analysis.

Model Performance: Achieve ≥75% accuracy in predicting exam pass/fail.
Identified Factors: Statistically significant correlation (p < 0.05) between ≥2 activities (e.g., homework grades) and exam outcomes.
Clustering Performance: Identify ≥3 distinct student clusters based on performance patterns.

1.2 Methodology

The project follows a structured, iterative approach aligned with the data science lifecycle, organized into distinct yet interconnected chapters. Each phase builds on the previous one, ensuring a logical progression from foundational understanding to actionable insights.

1 – Project Understanding

This foundational chapter establishes the purpose and scope of the analysis. It begins by defining the core objective: investigating the relationship between student engagement and academic performance while identifying behavioural patterns influencing outcomes. Stakeholder needs—such as educators seeking to improve course design or administrators aiming to reduce dropout rates—are clarified to align the analysis with real-world priorities.

2 – Data Understanding and Preparation

This chapter combines exploratory analysis with preprocessing to ensure robust, analysis-ready data. The process begins with examining the dataset’s structure, including features like homework grades, class activities, optional assignments, and exam scores. The preparation phase directly addresses data quality issues.

3 – Engagement Analysis

Focusing on participation, this chapter investigates how students interact with coursework. Engagement is quantified through metrics like assignment submission rates, consistency over time, and participation in optional tasks.

4 – Performance Analysis

Building on engagement metrics, this chapter evaluates academic outcomes and their drivers. Exam grades and assignment averages are analyzed to identify performance distributions. Correlation tests quantify the strength of relationships between engagement rates and exam scores, while regression models assess whether participation predicts academic success. Outliers—like students with high engagement but low grades—are flagged for further investigation, potentially revealing issues such as ineffective study habits or external challenges. Comparative analyses, such as t-tests between engagement-based student groups, highlight statistically significant differences in performance, enabling targeted support strategies.

5 – Pattern Finding

This chapter employs advanced analytical techniques to uncover hidden structures and anomalies. Clustering algorithms, such as k-means or hierarchical clustering, group students by shared engagement-performance profiles, revealing archetypes like “consistent high performers” or “disengaged strugglers.” Association rule mining identifies behavioural combinations linked to success—for example, students who complete optional assignments and attend tutoring sessions are more likely to achieve top grades. Deviation analysis detects anomalies, such as students with moderate engagement but unusually high exam scores, prompting investigations into potential causes like prior knowledge or exceptional talent. These patterns synthesize insights from earlier chapters, offering a holistic view of student behaviour and its academic implications.

Constraints

To better frame the methodoloty we provide an initial set of constraints. These can better define the boundaries and limitations of the project.

Anonymity: Student identities are anonymized to comply with privacy regulations, and ensure ethical handling of sensitive data.
Static Dataset: The dataset is historical and unmodifiable. This limits analysis to retrospective insights (no real-time updates).
Data Scope: Analysis is restricted to academic activities (e.g., grades, participation). Necessity: Excludes external factors (e.g., socioeconomic status) due to data unavailability.

Assumptions

Additionally, we will start with some first assumptions, in order to proceed with the analysis.

Exam Weight: later on we will try to set an hypothetical exam contribution 70% to the final grade. This will assist us in forecasting, and it will simplify modelingfor any data that is unavailable.
Representativeness: The dataset reflects the broader student population. This ensures findings can generalize beyond the sample.
Informativeness: Non-participation (-1) is a meaningful indicator. It allows treating -1 as a deliberate choice rather than missing data.
Data Quality: Missing values/outliers are minimal. This avoids overcomplicating preprocessing without evidence of major issues.
External Factors: Excluded variables (e.g., student motivation) do not dominate outcomes. The analysis Focuses on measurable academic factors.

1.3 Deliverables

The deliverables of this project include:

Report: A Quarto document structured around each unit we follow on DAMA51, rendered as an HTML document for easy sharing and accessibility.
Scripts: independent R scripts for review.
Presentation: A 3–5 minute summary highlighting actionable insights for the teaching academic community of HOU.

2 Data Understanding & Preparation

The Dataset used for this project is “grades.xlsx”. It tracks student performance across exams, homework assignments, and activities in an educational course. The dataset is provides insight for better understanding student engagement, identifying learning trends, and informing pedagogical improvements..

2.1 Attribute Understanding

We start with understanding and analyzing the individual attributes in the dataset, their characteristics, and domain relevance.

Overview

Seeing from the screenshot below, we understand the structure and format of the dataset.

Column Details

The main 4 column categories are further subdivided.

Column Group	Columns	Description
Exams	Final, Repeat	Final exam score (0–10 scale) and repeat exam score (mixed numeric/flags).
Homework Assignment	4 columns (`1`–`4`)	Grades for 4 homework assignments (0–10 scale).
Compulsory Activities	8 columns (`1`–`8`)	Scores for 8 mandatory activities (0–10 scale).
Optional Activities	10 columns (`1`–`10`)	Scores for 10 optional activities (`-1` indicates non-participation).

2.2 Data Quality

After a general overview of the dataset, we dive deeper and seek to understand what types of data we have, and the challenges we will be facing. Then we start with importing the dataset in r, and prior to that setting global parameters.

Data Types

The structure of the dataset mainly includes numeric data. However, there is categorical data for the headers and sub-headers. Finally, there is missing data, and a symbolic numeric data to designate non-participation. Below an overview of the data-types:

Numeric Data
- Exam scores (Final, Repeat)
- Activity grades (Homework_1 to Optional_10)
- Grade Range with two digit precision: 0.00–10.00, with actual values containing exceptions
Categorical Data
- Exam scores sub-categories (Final, Repeat)
- Activity types (Homework/Compulsory/Optional)
- Implicit ordinality in activity sequence (e.g., Homework_1 precedes Homework_2)
Missing Data Encoding
- “-” for general missingness
- -1 for explicit non-participation in optional activities
- empty cells due to visual separation columns (C, H, I, R)

Dataset

Data Import

Show code

#| label: data-import
#| message: false
#| warning: false
#| code-fold: false
#| code-summary: "Show the code"
#| code-line-numbers: true
#| code-overflow: scroll
#| code-copy: true

suppressPackageStartupMessages({
  library(tidyverse)
  library(readxl)
  library(here)
})

library(tidyverse)
library(readxl)
library(here)

# Import raw data
grades_raw <- read_excel(
  here("Data", "grades.xlsx"),
  sheet = "Βαθμοί",
  skip = 2,  # Skip merged headers
  na = c(""),
  col_names = FALSE  # Handle custom column names
)

New names:
• `` -> `...1`
• `` -> `...2`
• `` -> `...3`
• `` -> `...4`
• `` -> `...5`
• `` -> `...6`
• `` -> `...7`
• `` -> `...8`
• `` -> `...9`
• `` -> `...10`
• `` -> `...11`
• `` -> `...12`
• `` -> `...13`
• `` -> `...14`
• `` -> `...15`
• `` -> `...16`
• `` -> `...17`
• `` -> `...18`
• `` -> `...19`
• `` -> `...20`
• `` -> `...21`
• `` -> `...22`
• `` -> `...23`
• `` -> `...24`
• `` -> `...25`
• `` -> `...26`
• `` -> `...27`
• `` -> `...28`

Data Cleaning

The current dataset doesn’t represent the structure we want, by including the proper headers for each assignment or exam. Moreover, we have to remove the empty column separators.

Show code

#| label: data-clean_up
#| message: false
#| warning: false
#| code-fold: false
#| code-summary: "Show the code"
#| code-line-numbers: true
#| code-overflow: scroll
#| code-copy: true


options(tibble.print_max = 5, tibble.print_min = 5)  
options(width = 50)  
options(max.print = 50)  

# Remove empty columns
grades_clean <- grades_raw %>%
  select(where(~!all(is.na(.x))))

# Clean column names
colnames(grades_clean) <- c(
  "ex_f", "ex_r",
  paste0("hw_", 1:4),
  paste0("ca_", 1:8),
  paste0("oa_", 1:10)
)

# Verify structure
knitr::kable(head(grades_clean))

ex_f	ex_r	hw_1	hw_2	hw_3	hw_4	ca_1	ca_2	ca_3	ca_4	ca_5	ca_6	ca_7	ca_8	oa_1	oa_2	oa_3	oa_4	oa_5	oa_6	oa_7	oa_8	oa_9	oa_10
10.0	-	10.0	10.0	10.0	10.0	10	10	10	9	10	10	10	10	5	4	5	5	4	5	4.5	-	-	-
9.1	-1	9.5	9.6	8.8	10.0	7.5	9	10	6	7.5	8.8000000000000007	-	10	3.5	4	3.5	4	2	4.7	3	1	3	-
9.0	-	9.7	10.0	9.5	1.9	10	9	10	9.5	10	10	9	10	-	-	-	-	-	-	-	-	-	-
8.8	-1	9.1	8.9	6.0	0.8	8.5	9	10	8	9.5	10	-	8	2.2999999999999998	1.5	5	2	2	2	-	0	-	3
8.8	-1	9.9	9.6	9.5	10.0	9.5	10	10	9	10	10	10	8	2.5	1	1	3	1	5	0	2	2	3
8.5	-	8.5	10.0	8.0	10.0	7.5	9.5	9	8	-	6.5	7.5	7	3	4	3	4.3	5	5	2	3.5	4	5

Show code

knitr::kable(str(grades_clean))

tibble [159 × 24] (S3: tbl_df/tbl/data.frame)
 $ ex_f : num [1:159] 10 9.1 9 8.8 8.8 8.5 8.5 8.4 8.2 8.1 ...
 $ ex_r : chr [1:159] "-" "-1" "-" "-1" ...
 $ hw_1 : num [1:159] 10 9.5 9.7 9.1 9.9 8.5 6.3 9.7 8.8 7.8 ...
 $ hw_2 : num [1:159] 10 9.6 10 8.9 9.6 10 9.2 9.8 9.3 8.3 ...
 $ hw_3 : num [1:159] 10 8.8 9.5 6 9.5 8 7.3 9.5 9.5 7.4 ...
 $ hw_4 : num [1:159] 10 10 1.9 0.8 10 10 5 6 2 6.6 ...
 $ ca_1 : chr [1:159] "10" "7.5" "10" "8.5" ...
 $ ca_2 : chr [1:159] "10" "9" "9" "9" ...
 $ ca_3 : chr [1:159] "10" "10" "10" "10" ...
 $ ca_4 : chr [1:159] "9" "6" "9.5" "8" ...
 $ ca_5 : chr [1:159] "10" "7.5" "10" "9.5" ...
 $ ca_6 : chr [1:159] "10" "8.8000000000000007" "10" "10" ...
 $ ca_7 : chr [1:159] "10" "-" "9" "-" ...
 $ ca_8 : chr [1:159] "10" "10" "10" "8" ...
 $ oa_1 : chr [1:159] "5" "3.5" "-" "2.2999999999999998" ...
 $ oa_2 : chr [1:159] "4" "4" "-" "1.5" ...
 $ oa_3 : chr [1:159] "5" "3.5" "-" "5" ...
 $ oa_4 : chr [1:159] "5" "4" "-" "2" ...
 $ oa_5 : chr [1:159] "4" "2" "-" "2" ...
 $ oa_6 : chr [1:159] "5" "4.7" "-" "2" ...
 $ oa_7 : chr [1:159] "4.5" "3" "-" "-" ...
 $ oa_8 : chr [1:159] "-" "1" "-" "0" ...
 $ oa_9 : chr [1:159] "-" "3" "-" "-" ...
 $ oa_10: chr [1:159] "-" "-" "-" "3" ...

Show code

knitr::kable(summary(grades_clean))

ex_f	ex_r	hw_1	hw_2	hw_3	hw_4	ca_1	ca_2	ca_3	ca_4	ca_5	ca_6	ca_7	ca_8	oa_1	oa_2	oa_3	oa_4	oa_5	oa_6	oa_7	oa_8	oa_9	oa_10
Min. :-1.000	Length:159	Min. :-1.000	Min. :-1.000	Min. :-1.000	Min. :-1.00	Length:159	Length:159	Length:159	Length:159	Length:159	Length:159	Length:159	Length:159	Length:159	Length:159	Length:159	Length:159	Length:159	Length:159	Length:159	Length:159	Length:159	Length:159
1st Qu.:-1.000	Class :character	1st Qu.: 6.200	1st Qu.: 7.050	1st Qu.: 5.800	1st Qu.: 0.40	Class :character	Class :character	Class :character	Class :character	Class :character	Class :character	Class :character	Class :character	Class :character	Class :character	Class :character	Class :character	Class :character	Class :character	Class :character	Class :character	Class :character	Class :character
Median : 3.900	Mode :character	Median : 7.700	Median : 8.000	Median : 7.200	Median : 1.90	Mode :character	Mode :character	Mode :character	Mode :character	Mode :character	Mode :character	Mode :character	Mode :character	Mode :character	Mode :character	Mode :character	Mode :character	Mode :character	Mode :character	Mode :character	Mode :character	Mode :character	Mode :character
Mean : 3.161	NA	Mean : 7.071	Mean : 7.204	Mean : 6.535	Mean : 2.73	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
3rd Qu.: 6.000	NA	3rd Qu.: 8.800	3rd Qu.: 8.950	3rd Qu.: 8.000	3rd Qu.: 4.90	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
Max. :10.000	NA	Max. :10.000	Max. :10.000	Max. :10.000	Max. :10.00	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA

Missing Data & Precision

From this first clean-up we can make some remarks:

We recognize, as stated earlier, that the non-participation value “-1”, is creating distractions in the statistical analysis
Similarly, the missing value “-” is creating distractions in the statistical analysis
some of the numeric values do not follow the same data type and digit precision (dbl vs chr).

We will start by converting “-1” and “-” values to NA. Then we will convert all columns to numeric, assign two digit precision, and, lastly, impute the NA values to preserve data. The latter, assists since instead of removing rows with NA values (and potential loss of important information), we keep all rows while filling in missing values. Additionally, we maintain statistical integrity, as NA is replaced with the mean. We will keep the orignal cleaned dataset in order to track student’s engagement per topic.

Hence we have to further clean and modify the “grades_clean” dataset.

Show code

#| label: data-clean_up_2
#| message: false
#| warning: false
#| code-fold: false
#| code-summary: "Show the code"
#| code-line-numbers: true
#| code-overflow: scroll
#| code-copy: true

library(dplyr)

# Replace "-" and -1 with NA globally
grades_clean <- grades_clean %>%
  mutate(across(everything(), ~ ifelse(. == "-" | . == -1, NA, .)))

# Convert character columns to numeric
grades_clean <- grades_clean %>%
  mutate(across(where(is.character), ~ as.numeric(.)))

# Round numeric columns to 2 decimal places
grades_clean <- grades_clean %>%
  mutate(across(where(is.numeric), ~ round(., 2)))

# Save this version as grades_original (with NAs)
grades_original <- grades_clean

knitr::kable(head(grades_original))

ex_f	ex_r	hw_1	hw_2	hw_3	hw_4	ca_1	ca_2	ca_3	ca_4	ca_5	ca_6	ca_7	ca_8	oa_1	oa_2	oa_3	oa_4	oa_5	oa_6	oa_7	oa_8	oa_9	oa_10
10.0	NA	10.0	10.0	10.0	10.0	10.0	10.0	10	9.0	10.0	10.0	10.0	10	5.0	4.0	5.0	5.0	4	5.0	4.5	NA	NA	NA
9.1	NA	9.5	9.6	8.8	10.0	7.5	9.0	10	6.0	7.5	8.8	NA	10	3.5	4.0	3.5	4.0	2	4.7	3.0	1.0	3	NA
9.0	NA	9.7	10.0	9.5	1.9	10.0	9.0	10	9.5	10.0	10.0	9.0	10	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
8.8	NA	9.1	8.9	6.0	0.8	8.5	9.0	10	8.0	9.5	10.0	NA	8	2.3	1.5	5.0	2.0	2	2.0	NA	0.0	NA	3
8.8	NA	9.9	9.6	9.5	10.0	9.5	10.0	10	9.0	10.0	10.0	10.0	8	2.5	1.0	1.0	3.0	1	5.0	0.0	2.0	2	3
8.5	NA	8.5	10.0	8.0	10.0	7.5	9.5	9	8.0	NA	6.5	7.5	7	3.0	4.0	3.0	4.3	5	5.0	2.0	3.5	4	5

Show code

# Impute NA values with column mean and keep 2 digits precision
grades_imputed <- grades_original %>%
  mutate(across(where(is.numeric), ~ ifelse(is.na(.), mean(., na.rm = TRUE), .)))

grades_imputed <- grades_imputed %>%
  mutate(across(where(is.numeric), ~ round(., 2)))

# Verify the imputed dataset
knitr::kable(head(grades_imputed))

ex_f	ex_r	hw_1	hw_2	hw_3	hw_4	ca_1	ca_2	ca_3	ca_4	ca_5	ca_6	ca_7	ca_8	oa_1	oa_2	oa_3	oa_4	oa_5	oa_6	oa_7	oa_8	oa_9	oa_10
10.0	4.5	10.0	10.0	10.0	10.0	10.0	10.0	10	9.0	10.00	10.0	10.00	10	5.00	4.00	5.00	5.00	4.00	5.00	4.50	3.12	3.13	3.76
9.1	4.5	9.5	9.6	8.8	10.0	7.5	9.0	10	6.0	7.50	8.8	8.02	10	3.50	4.00	3.50	4.00	2.00	4.70	3.00	1.00	3.00	3.76
9.0	4.5	9.7	10.0	9.5	1.9	10.0	9.0	10	9.5	10.00	10.0	9.00	10	2.94	3.33	3.39	3.21	2.82	3.65	3.18	3.12	3.13	3.76
8.8	4.5	9.1	8.9	6.0	0.8	8.5	9.0	10	8.0	9.50	10.0	8.02	8	2.30	1.50	5.00	2.00	2.00	2.00	3.18	0.00	3.13	3.00
8.8	4.5	9.9	9.6	9.5	10.0	9.5	10.0	10	9.0	10.00	10.0	10.00	8	2.50	1.00	1.00	3.00	1.00	5.00	0.00	2.00	2.00	3.00
8.5	4.5	8.5	10.0	8.0	10.0	7.5	9.5	9	8.0	8.09	6.5	7.50	7	3.00	4.00	3.00	4.30	5.00	5.00	2.00	3.50	4.00	5.00

3 Engagement Analysis

Now that the dataset is prepared and cleaned, we can conduct a preliminary analysis. We start with a pre-step for analyzing engagement, which involves visualizing missing values to identify patterns in assignment submissions. It also includes calculating participation rates to determine the percentage of students who submitted each assignment and comparing exam attempts by analyzing the results of retakes (ex_r) against first attempts (ex_f).

We first analyze student engagement by looking at missing assignments and participation rates. This will help us see how many students skipped assignments and identify patterns in non-participation.

We will focus on two parts:

Visualize Missing Values
Compute Participation Rates

3.1 Visualize Missing Values

Show code

#| label: engagement-heatmap
#| message: false
#| warning: false
#| code-fold: true
#| code-summary: "Show the code"
#| code-line-numbers: true
#| code-overflow: scroll
#| code-copy: true

library(naniar)

vis_miss(grades_original) +
  labs(
    title = "Student Participation in Grades Dataset",
    x = "Assignment/Exam",
    y = "Students",
    caption = "Dark Grey = Missing, Light Grey = Submitted", "In parentheses the % of non-participation"
  ) +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 90, hjust = 1, size = 8, margin = margin(t = 4)),
    axis.text.y = element_text(size = 6),
    panel.grid.major = element_line(color = "white", linetype = "dotted"),
    panel.grid.minor = element_blank(),
    plot.background = element_rect(fill = "gray95"),
    plot.caption = element_text(hjust = 0.5, face = "italic")
  ) +
  
  # Add participation percentages
  annotate(
    "text",
    x = 1:ncol(grades_original),
    y = -2,
    label = paste0(round(100 - colMeans(is.na(grades_original)) * 100, 1), "%"),
    angle = 0,
    hjust = 1,
    size = 1.5,
    color = "darkblue"
  )

3.2 Compute Participation Rates

Show code

#| label: participation-rate
#| message: false
#| warning: false
#| code-fold: true
#| code-summary: "Show the code"
#| code-line-numbers: true
#| code-overflow: scroll
#| code-copy: true

participation_rates <- grades_original %>%
  summarise(across(everything(), ~ round(mean(!is.na(.)) * 100, 2)))

knitr::kable(participation_rates, digits = 2, caption = "Participation Rates (%)")

Participation Rates (%)
ex_f	ex_r	hw_1	hw_2	hw_3	hw_4	ca_1	ca_2	ca_3	ca_4	ca_5	ca_6	ca_7	ca_8	oa_1	oa_2	oa_3	oa_4	oa_5	oa_6	oa_7	oa_8	oa_9	oa_10
63.52	23.27	98.11	93.08	91.82	76.73	70.44	66.67	66.67	73.58	71.07	73.58	58.49	53.46	42.77	27.04	34.59	38.36	31.45	29.56	22.64	18.87	14.47	10.69

Bar Plot of the Participation Rates

Show code

#| label: participation-rate-bar-plot
#| message: false
#| warning: false
#| code-fold: true
#| code-summary: "Show the code"
#| code-line-numbers: true
#| code-overflow: scroll
#| code-copy: true

library(tidyr)

# Reshape participation rates

participation_rates_long <- participation_rates %>%
  pivot_longer(
    everything(),
    names_to = "Assignment",
    values_to = "Participation"
  )

# Bar plot

ggplot(participation_rates_long, aes(x = Assignment, y = Participation)) +
  geom_bar(stat = "identity", color="#424242", fill = "#bdbdbd", width = 0.7) +
  geom_text(
    aes(label = paste0(Participation, "%")),
    vjust = -0.5,
    size = 2,
    color = "black"
  ) +
  labs(title = "Participation Rates by Assignment") +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),
    panel.grid.major.y = element_line(color = "gray80", linetype = "dotted"),
    panel.grid.major.x = element_blank()
  ) +
  scale_y_continuous(limits = c(0, 100))

Participation per Category

Show code

#| label: participation_per_category-bar-plot
#| message: false
#| warning: false
#| code-fold: true
#| code-summary: "Show the code"
#| code-line-numbers: true
#| code-overflow: scroll
#| code-copy: true

# Map assignments to categories
category_mapping <- data.frame(
  Assignment = names(grades_original),
  Category = case_when(
    grepl("ex_", names(grades_original)) ~ "Exams",
    grepl("hw_", names(grades_original)) ~ "Homework",
    grepl("ca_", names(grades_original)) ~ "Class Assignments",
    grepl("oa_", names(grades_original)) ~ "Optional Assignments"
  )
)

# Compute average participation per category
category_participation <- participation_rates_long %>%
  left_join(category_mapping, by = "Assignment") %>%
  group_by(Category) %>%
  summarise(Avg_Participation = mean(Participation, na.rm = TRUE))

# Bar plot by category
ggplot(category_participation, aes(x = reorder(Category, -Avg_Participation), y = Avg_Participation)) +
  geom_bar(stat = "identity", color="#424242", fill = "#bdbdbd", width = 0.5) +
  geom_text(
    aes(label = paste0(round(Avg_Participation, 2), "%")),
    vjust = -0.5,
    size = 4,
    color = "black"
  ) +
  labs(
    title = "Participation Rates by Major Category",
    x = "Category",
    y = "Average Participation (%)"
  ) +
  theme_minimal() +
  theme(
    panel.grid.major.y = element_line(color = "gray80", linetype = "dotted"),
    panel.grid.major.x = element_blank(),
    axis.text.x = element_text(angle = 0, hjust = 1)
  ) +
  scale_y_continuous(limits = c(0, 100))

3.3 Curve Graph

Show the code

library(ggplot2)
library(dplyr)
library(tidyr)
library(ggrepel)
library(wesanderson)

# Set color palette 

category_colors <- wes_palette("GrandBudapest1", n = 4, type = "discrete")
names(category_colors) <- c("Homework Assignments", "Class Assignments", "Optional Assignments", "Exams")

# Prepare data for plotting

participation_rates_long <- data.frame(
  Assignment = names(participation_rates),
  Participation = as.numeric(participation_rates)
) %>%
  mutate(
    Category = case_when(
      grepl("ex_", Assignment) ~ "Exams",
      grepl("hw_", Assignment) ~ "Homework Assignments",
      grepl("ca_", Assignment) ~ "Class Assignments",
      grepl("oa_", Assignment) ~ "Optional Assignments"
    ),
    Assignment_Number = as.numeric(gsub("[^0-9]", "", Assignment)),
    
    # Unique labels per category
    Assignment_Label = paste0(
      case_when(
        Category == "Homework Assignments" ~ "ha_",
        Category == "Class Assignments" ~ "ca_",
        Category == "Optional Assignments" ~ "oa_",
        Category == "Exams" ~ "ex_"
      ),
      sprintf("%02d", Assignment_Number)
    ),
    
    # Position for x-axis
    Position = as.numeric(factor(Assignment_Label, levels = unique(Assignment_Label)))
  ) %>%
  filter(!is.na(Participation))  

# Plot

ggplot(participation_rates_long, aes(x = Position, y = Participation, color = Category)) +
  
  
  geom_col(
    aes(fill = Category),
    position = position_dodge2(preserve = "single"),
    alpha = 0.2,
    width = 0.8
  ) +
  # Add lines
  geom_line(aes(group = Category), size = 1.2) +
  # Add points
  geom_point(size = 3) +
  # Non-overlapping labels
  geom_label_repel(
    aes(label = paste0(Assignment_Label, "\n", round(Participation, 1), "%")),
    size = 3,
    color = "black",
    box.padding = 0.5,
    max.overlaps = 20,
    segment.color = "gray50"
  ) +
  # Customize plot
  labs(
    title = "Participation Rates Over Time",
    x = "Assignment Number",
    y = "Participation Rate (%)",
    color = "Category",
    fill = "Category"
  ) +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),
    legend.position = "bottom",
    panel.grid.major.x = element_blank()
  ) +
  scale_x_continuous(
    breaks = participation_rates_long$Position,
    labels = participation_rates_long$Assignment_Label
  ) +
  scale_color_manual(values = category_colors) +
  scale_fill_manual(values = category_colors)

3.4 Interval Drop Rates

Show the code

library(dplyr)

# Calculate participation changes
interval_change_rates <- participation_rates_long %>%
  filter(Category != "Exams") %>%
  group_by(Category) %>%
  arrange(Assignment_Number) %>%
  mutate(
    Change_Percent = ((Participation - lag(Participation)) / lag(Participation)) * 100
  ) %>%
  filter(!is.na(Change_Percent)) %>%
  ungroup()

# Generate tables (one per category)
category_tables <- interval_change_rates %>%
  split(.$Category) %>%
  lapply(function(df) {
    df %>%
      arrange(Assignment_Number) %>%
      select(Assignment_Label, Change_Percent) %>%
      knitr::kable(
        digits = 2,
        col.names = c("Assignment", "Change (%)"),
        align = c("l", "r")
      )
  })

# Print all tables
category_tables$`Class Assignments`

Assignment	Change (%)
ca_02	-5.35
ca_03	0.00
ca_04	10.36
ca_05	-3.41
ca_06	3.53
ca_07	-20.51
ca_08	-8.60

Show the code

category_tables$`Homework Assignments`

Assignment	Change (%)
ha_02	-5.13
ha_03	-1.35
ha_04	-16.43

Show the code

category_tables$`Optional Assignments`

Assignment	Change (%)
oa_02	-36.78
oa_03	27.92
oa_04	10.90
oa_05	-18.01
oa_06	-6.01
oa_07	-23.41
oa_08	-16.65
oa_09	-23.32
oa_10	-26.12

3.5 Average Drop Rate

Show the code

# Calculate average change per assignment category
average_change <- interval_change_rates %>%
  group_by(Category) %>%
  summarise(
    Avg_Change_Percent = mean(Change_Percent, na.rm = TRUE)
  )

knitr::kable(
  average_change,
  digits = 2,
  caption = "Average Participation Change Per Assignment Category (%)"
)

Average Participation Change Per Assignment Category (%)
Category	Avg_Change_Percent
Class Assignments	-3.43
Homework Assignments	-7.64
Optional Assignments	-12.39

For the exams, since this assignment category functions as a gate, we evalutate the relation between final and repeat exam.

Show the code

# Filter exam data
exam_data <- participation_rates_long %>%
  filter(Category == "Exams") %>%
  mutate(
    Assignment_Label = case_when(
      Assignment == "ex_f" ~ "ex_01 (Final Exam)",
      Assignment == "ex_r" ~ "ex_02 (Repeat Exam)"
    )
  )

knitr::kable(
  exam_data %>%
    select(Assignment_Label, Participation),
  digits = 2,
  caption = "Exam Participation Rates (%)"
)

Exam Participation Rates (%)
Assignment_Label	Participation
ex_01 (Final Exam)	63.52
ex_02 (Repeat Exam)	23.27

3.6 Observations

1 Overview

The dataset reveals clear patterns of student engagement across different assignment types (homework, class assignments, optional assignments, and exams). Participation rates vary significantly, with high engagement early in the course and a steady decline as the course progresses.

2 Key Observations

A. Homework Assignments (hw_1 to hw_4)

The course initially experienced high engagement, with participation rates of 98.11% for hw_1 and 93.08% for hw_2. However, there was a gradual decline as the course progressed, with participation dropping to 91.82% for hw_3 and further to 76.73% for hw_4. This decline suggests that students may be experiencing workload fatigue or shifting priorities.

B. Class Assignments (ca_1 to ca_8)

Participation rates initially reflect moderate engagement, with ca_1 at 70.44% and ca_2 at 66.67%, which, while lower than homework rates, are still substantial. However, participation begins to decline after ca_5, about 71.07%. This decline becomes more pronounced in ca_6 at 73.58% and continues to drop to 58.49% in ca_7 and 53.46% in ca_8, indicating a significant disengagement among students in the latter half of the course.

C. Optional Assignments (oa_1 to oa_10)

The engagement levels observed reveal a concerning trend. For instance, the participation rate for “oa_1” stands at 42.77%, while “oa_10” shows an alarmingly low participation rate of just 10.69%. This consistent decline in engagement suggests that students prioritize mandatory tasks over optional ones, reflecting a broader issue of low engagement across the board. D. Exams (ex_f and ex_r) In the final exam (ex_f), participation stood at 63.52%. This figure raises some concerns, as 36.48% of students did not take the final exam, indicating a potential issue that warrants further investigation.

Regarding the repeat exam (ex_r), the participation rate was significantly lower at 23.27%. This low percentage suggests that only a few students opted for retakes, implying that most either passed the first attempt or decided against retaking the exam altogether.

3 Conclusion

The data reveals a clear pattern of high engagement early in the course, followed by a steady decline in participation as the course progresses. Homework assignments see the highest engagement, while optional assignments are largely ignored. Low participation in exams (especially final exams) is a critical issue that requires attention.

4 Next Steps

Visualizing Trends in Participation

We can create a curve graph to track participation changes over time for homework, class assignments, optional assignments, and exams. The X-axis will represent assignment numbers, while the Y-axis will show participation rates.

Correlation Analysis

Additionally, we should analyze the relationship between participation rates and exam scores to see if higher homework participation correlates with better exam performance.

Outlier Detection

Lastly, it’s important to identify students with unusual participation patterns, such as those who skipped many assignments but still excelled on exams, to understand the factors influencing their performance.

4 Performance Analysis

After understanding the overall course’s engagement, per assignmenet category, we proceed to evaluate the students individually.

First, we will calculate an engagement rate per student (ERS). Thus, we’ll define engagement rate as the percentage of assignments (homework, class assignments, optional assignments) a student participated in (i.e., non-NA values). Therefore, we will designate all NA values as a 0 result.

4.1 Engagement Rate Per Student (ERS)

This step examines how missing exam grades may affect our analysis of engagement and performance. We calculate engagement rates for all students and flag those without exam scores. We create a histogram of engagement rates and a scatter plot comparing engagement to exam grades, showing students missing data. We calculate the correlation and regression line only for students (109 students) with valid exam grades to ensure valid statistics while highlighting the effect of missing data. This approach provides context on missing exams and prevents misleading correlations.

Show the code

library(dplyr)
library(tidyr)
library(ggplot2)

# Define assignments and handle missing data
assignment_columns <- c(paste0("hw_", 1:4), paste0("ca_", 1:8), paste0("oa_", 1:10))

grades_clean <- grades_clean %>%
  mutate(across(all_of(assignment_columns), ~ replace_na(., 0)))

# Calculate engagement rate (all students)
grades_with_engagement <- grades_clean %>%
  rowwise() %>%
  mutate(
    Participation_Count = sum(c_across(all_of(assignment_columns)) > 0),
    Engagement_Rate = (Participation_Count / length(assignment_columns)) * 100
  ) %>%
  ungroup() %>%
  # Flag students missing exam grades
  mutate(Exam_Status = ifelse(is.na(ex_f), "No Exam", "Has Exam"))

# Engagement distribution plot
ggplot(grades_with_engagement, aes(x = Engagement_Rate)) +
  geom_histogram(fill = "grey", color = "black", bins = 20) +
  labs(
    title = "Student Engagement Rate Distribution",
    x = "Engagement Rate (%)",
    y = "Number of Students"
  ) +
  theme_minimal()

4.2 Correlation: ERS Vs Exam Grade

We visualize in a scatter plot and quantify the relationship between engagement and exam performance.

Show the code

library(dplyr)
library(ggplot2)

# Create a placeholder y-value for students without exams
min_exam_grade <- min(grades_with_engagement$ex_f, na.rm = TRUE)
grades_plot_data <- grades_with_engagement %>%
  mutate(
    ex_f_plot = ifelse(is.na(ex_f), min_exam_grade - 5, ex_f),
    Exam_Status = factor(Exam_Status, levels = c("Has Exam", "No Exam"))
  )

# Plot with adjusted y-axis
ggplot(grades_plot_data, aes(x = Engagement_Rate, y = ex_f_plot)) +
  geom_point(aes(color = Exam_Status), alpha = 0.7) +
  geom_smooth(
    data = filter(grades_plot_data, Exam_Status == "Has Exam"),
    method = "lm",
    color = "red",
    se = FALSE
  ) +
  scale_color_manual(
    values = c("Has Exam" = "darkblue", "No Exam" = "gray70"),
    labels = c("Has Exam", "No Exam")
  ) +
  labs(
    title = "Engagement Rate vs. Final Exam Grade",
    x = "Engagement Rate (%)",
    y = "Final Exam Grade",
    color = "Exam Status",
    caption = "Students without exams are shown 5 points below the lowest exam grade."
  ) +
  theme_minimal()

The correlation test results to 0.46 between engagement rate and final exam grades indicating a moderate positive relationship. This means students with higher engagement tend to score slightly higher on exams, but the relationship is not very strong.

4.3 Outlier Detection

Show the code

library(ggplot2)
library(dplyr)
library(tidyr)

# Outlier detection using IQR method
outlier_summary <- grades_clean %>%
  select(ex_f, all_of(assignment_columns)) %>%
  summarise(
    across(everything(),
           list(
             Q1 = ~quantile(., 0.25, na.rm = TRUE),
             Q3 = ~quantile(., 0.75, na.rm = TRUE),
             IQR = ~IQR(., na.rm = TRUE),
             Outlier_Low = ~quantile(., 0.25, na.rm = TRUE) - 1.5*IQR(., na.rm = TRUE),
             Outlier_High = ~quantile(., 0.75, na.rm = TRUE) + 1.5*IQR(., na.rm = TRUE)
           )
  )) %>%
  pivot_longer(everything(), names_to = "Metric", values_to = "Value")

4.4 PCA/Scree Plot

Principal Component Analysis (PCA)

To identify the underlying structure of student performance data and reduce dimensionality, we perform Principal Component Analysis (PCA). This technique transforms correlated variables (assignment grades, exam scores) into a set of uncorrelated principal components that capture maximum variance. The scree plot visualizes each component’s relative importance, helping determine how many components meaningfully explain performance patterns.

Show code

#| label: pca-analysis  
#| message: false  
#| warning: false  

library(ggplot2)  
library(tidyr)  
library(tibble)  
library(wesanderson)

# Numeric data (exclude non-grade columns)  
pca_data <- grades_clean %>%  
  select(all_of(assignment_columns), ex_f) %>%  
  na.omit() %>%  
  scale() 

# Perform PCA  
pca_result <- prcomp(pca_data, scale. = TRUE)  

# Scree plot  
variance <- pca_result$sdev^2 / sum(pca_result$sdev^2)  
scree_data <- data.frame(PC = 1:length(variance), Variance = variance)  

ggplot(scree_data, aes(x = factor(PC), y = Variance)) +  
  geom_col(fill = wes_palette("Zissou1")[1], alpha = 0.8) +  
  geom_line(aes(group = 1), color = wes_palette("Zissou1")[5], linewidth = 1) +  
  labs(  
    title = "PCA Scree Plot",  
    x = "Principal Component",  
    y = "Proportion of Variance Explained"  
  ) +  
  theme_minimal()

Show code

#| label: pca-variance-table 
#| message: false  
#| warning: false  

library(ggplot2)  
library(tidyr)  
library(tibble)  
library(wesanderson)


# Calculate cumulative variance  
cumulative_variance <- cumsum(pca_result$sdev^2 / sum(pca_result$sdev^2))  

# Create variance table  
variance_table <- data.frame(  
  Component = paste0("PC", 01:length(pca_result$sdev)),  
  Variance_Explained = round(pca_result$sdev^2 / sum(pca_result$sdev^2), 3),  
  Cumulative_Variance = round(cumulative_variance, 3)  
)  

knitr::kable(  
  variance_table,  
  caption = "Variance Explained by Principal Components"  
)

Variance Explained by Principal Components
Component	Variance_Explained	Cumulative_Variance
PC1	0.365	0.365
PC2	0.157	0.522
PC3	0.075	0.597
PC4	0.057	0.654
PC5	0.039	0.693
PC6	0.035	0.728
PC7	0.034	0.762
PC8	0.029	0.791
PC9	0.026	0.816
PC10	0.024	0.840
PC11	0.021	0.862
PC12	0.021	0.883
PC13	0.017	0.899
PC14	0.016	0.915
PC15	0.014	0.929
PC16	0.013	0.942
PC17	0.013	0.955
PC18	0.010	0.965
PC19	0.009	0.974
PC20	0.008	0.982
PC21	0.008	0.990
PC22	0.006	0.996
PC23	0.004	1.000

Show code

#| label: pca-heatmap  
#| fig-height: 8  
#| fig-width: 10  

library(ggplot2)  
library(tidyr)  
library(tibble)  

# Prepare loadings data for PC1-PC8 
heatmap_data <- as.data.frame(pca_result$rotation[, 1:8]) %>%  
  rownames_to_column("Variable") %>%  
  pivot_longer(  
    cols = -Variable,  
    names_to = "Component",  
    values_to = "Loading"  
  ) %>%  
  group_by(Component) %>%  
  mutate(  
    Loading_Sc = scale(Loading),  
    Significance = ifelse(abs(Loading) > 0.3, "Strong", "Weak")  
  )  

ggplot(heatmap_data, aes(x = Component, y = Variable, fill = Loading)) +  
  geom_tile(color = "white") +  
  scale_fill_gradient2(  
    low = wes_palette("Zissou1")[3],  
    mid = "white",  
    high = wes_palette("Zissou1")[1],  
    midpoint = 0,  
    limits = c(-1, 1)  
  ) +  
  labs(  
    title = "PCA Loadings Heatmap (PC1-PC8)",  
    subtitle = "Variables with |Loading| > 0.3 strongly influence components",  
    x = "Principal Component",  
    y = "Grade Variable"  
  ) +  
  theme_minimal() +  
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

PCA Summary

The Principal Component Analysis (PCA) shows that the first four components (PC1 to PC4) explain 65.4% of the differences in student performance data. PC1 accounts for 36.5% of the variance, reflecting overall performance, as students who do well in one area tend to do well in others. PC2, which explains 15.7% of the variance, highlights the difference between assignment and exam performance.

The scree plot shows a clear break at PC4, indicating that additional components add little value, contributing less than 5% each. Keeping four components captures important patterns while reducing noise.

Extending the analysis to eight components (PC1 to PC8) would increase the explained variance to 79.1%. However, this could introduce noise and complicate insights.

1 Project Understanding

1.1 Objective

Stakeholders

Cognitive Map

Key Questions

Success Criteria

1.2 Methodology

1 – Project Understanding

2 – Data Understanding and Preparation

3 – Engagement Analysis

4 – Performance Analysis

5 – Pattern Finding

Constraints

Assumptions

1.3 Deliverables

2 Data Understanding & Preparation

2.1 Attribute Understanding

Overview

Column Details

2.2 Data Quality

Data Types

Dataset

Data Import

Data Cleaning

Missing Data & Precision

3 Engagement Analysis

3.1 Visualize Missing Values

3.2 Compute Participation Rates

3.3 Curve Graph

3.4 Interval Drop Rates

3.5 Average Drop Rate

3.6 Observations

1 Overview

2 Key Observations

3 Conclusion

4 Next Steps

4 Performance Analysis

4.1 Engagement Rate Per Student (ERS)

4.2 Correlation: ERS Vs Exam Grade

4.3 Outlier Detection

4.4 PCA/Scree Plot

PCA Summary

5 Pattern Finding

5.1 Clustering

Hierarchical

Prototype-based

Density-based

Self-organizing

5.2 Association Rules

5.3 Deviation Analysis