The following document is an ongoing work on the Applied Project of DAMA51. Following the Generic Project Description, my choice of interest is to work on the project: “Tracking and Analysing Student Progress”. This report will evolve as the course progresses, with incremental updates after each unit.
1 Project Understanding
This project focuses on tracking and analysing student progress in a module (assuming being part of HOU) using a dataset of student grades for a specific academic year. Reading through this project’s overview, we notice that the dataset includes marks for various activities, namely:
Exams
Homework Assignments
Compulsory Tasks (contributing to the final grade)
Optional Tasks (not contributing to the final grade)
From these activities, exams, homework assignments, and compulsory tasks contribute to the final grade. Optional tasks do not. Activities of the same type are chronologically ordered, while a -1 value indicates non-participation.
1.1 Objective
The primary goals are:
gain insights
draw conclusions from the data
attempt preliminary forecasting
Potential analyses include:
Predicting exam pass/fail based on previous activities
Clustering students based on grade similarities
Identifying patterns in student performance over time
Stakeholders
The primary stakeholders for this project include:
Teaching Academic Community of HOU: Educators and academic advisors who will use the insights to improve student outcomes.
Students: Indirect beneficiaries through targeted interventions and improved teaching strategies.
Cognitive Map
The proposed cognitive map is a preliminary assumption and will be revised as the project progresses through data understanding and analysis. Revisions will be made based on insights from exploratory data analysis, correlation studies, and clustering results.
An initial set of our main questions includes the following:
Can we predict exam pass/fail outcomes based on prior activity performance?
What are the key factors (e.g., homework grades, compulsory task participation) influencing exam performance?
How do non-participation (-1) and optional tasks impact student outcomes?
Can we cluster students into performance-based groups for targeted interventions?
As we progress we will evaluate these questions and - potentially - reform them.
Success Criteria
The preliminary assessment is based on model performance and correlation analysis.
Model Performance: Achieve ≥75% accuracy in predicting exam pass/fail.
Identified Factors: Statistically significant correlation (p < 0.05) between ≥2 activities (e.g., homework grades) and exam outcomes.
Clustering Performance: Identify ≥3 distinct student clusters based on performance patterns.
1.2 Methodology
The project follows a structured, iterative approach aligned with the data science lifecycle, organized into distinct yet interconnected chapters. Each phase builds on the previous one, ensuring a logical progression from foundational understanding to actionable insights.
1 – Project Understanding
This foundational chapter establishes the purpose and scope of the analysis. It begins by defining the core objective: investigating the relationship between student engagement and academic performance while identifying behavioural patterns influencing outcomes. Stakeholder needs—such as educators seeking to improve course design or administrators aiming to reduce dropout rates—are clarified to align the analysis with real-world priorities.
2 – Data Understanding and Preparation
This chapter combines exploratory analysis with preprocessing to ensure robust, analysis-ready data. The process begins with examining the dataset’s structure, including features like homework grades, class activities, optional assignments, and exam scores. The preparation phase directly addresses data quality issues.
3 – Engagement Analysis
Focusing on participation, this chapter investigates how students interact with coursework. Engagement is quantified through metrics like assignment submission rates, consistency over time, and participation in optional tasks.
4 – Performance Analysis
Building on engagement metrics, this chapter evaluates academic outcomes and their drivers. Exam grades and assignment averages are analyzed to identify performance distributions. Correlation tests quantify the strength of relationships between engagement rates and exam scores, while regression models assess whether participation predicts academic success. Outliers—like students with high engagement but low grades—are flagged for further investigation, potentially revealing issues such as ineffective study habits or external challenges. Comparative analyses, such as t-tests between engagement-based student groups, highlight statistically significant differences in performance, enabling targeted support strategies.
5 – Pattern Finding
This chapter employs advanced analytical techniques to uncover hidden structures and anomalies. Clustering algorithms, such as k-means or hierarchical clustering, group students by shared engagement-performance profiles, revealing archetypes like “consistent high performers” or “disengaged strugglers.” Association rule mining identifies behavioural combinations linked to success—for example, students who complete optional assignments and attend tutoring sessions are more likely to achieve top grades. Deviation analysis detects anomalies, such as students with moderate engagement but unusually high exam scores, prompting investigations into potential causes like prior knowledge or exceptional talent. These patterns synthesize insights from earlier chapters, offering a holistic view of student behaviour and its academic implications.
Constraints
To better frame the methodoloty we provide an initial set of constraints. These can better define the boundaries and limitations of the project.
Anonymity: Student identities are anonymized to comply with privacy regulations, and ensure ethical handling of sensitive data.
Static Dataset: The dataset is historical and unmodifiable. This limits analysis to retrospective insights (no real-time updates).
Data Scope: Analysis is restricted to academic activities (e.g., grades, participation). Necessity: Excludes external factors (e.g., socioeconomic status) due to data unavailability.
Assumptions
Additionally, we will start with some first assumptions, in order to proceed with the analysis.
Exam Weight: later on we will try to set an hypothetical exam contribution 70% to the final grade. This will assist us in forecasting, and it will simplify modelingfor any data that is unavailable.
Representativeness: The dataset reflects the broader student population. This ensures findings can generalize beyond the sample.
Informativeness: Non-participation (-1) is a meaningful indicator. It allows treating -1 as a deliberate choice rather than missing data.
Data Quality: Missing values/outliers are minimal. This avoids overcomplicating preprocessing without evidence of major issues.
External Factors: Excluded variables (e.g., student motivation) do not dominate outcomes. The analysis Focuses on measurable academic factors.
1.3 Deliverables
The deliverables of this project include:
Report: A Quarto document structured around each unit we follow on DAMA51, rendered as an HTML document for easy sharing and accessibility.
Scripts: independent R scripts for review.
Presentation: A 3–5 minute summary highlighting actionable insights for the teaching academic community of HOU.
2 Data Understanding & Preparation
The Dataset used for this project is “grades.xlsx”. It tracks student performance across exams, homework assignments, and activities in an educational course. The dataset is provides insight for better understanding student engagement, identifying learning trends, and informing pedagogical improvements..
2.1 Attribute Understanding
We start with understanding and analyzing the individual attributes in the dataset, their characteristics, and domain relevance.
Overview
Seeing from the screenshot below, we understand the structure and format of the dataset.
Raw Dataset Screenshot
Column Details
The main 4 column categories are further subdivided.
Column Group
Columns
Description
Exams
Final, Repeat
Final exam score (0–10 scale) and repeat exam score (mixed numeric/flags).
Homework Assignment
4 columns (1–4)
Grades for 4 homework assignments (0–10 scale).
Compulsory Activities
8 columns (1–8)
Scores for 8 mandatory activities (0–10 scale).
Optional Activities
10 columns (1–10)
Scores for 10 optional activities (-1 indicates non-participation).
2.2 Data Quality
After a general overview of the dataset, we dive deeper and seek to understand what types of data we have, and the challenges we will be facing. Then we start with importing the dataset in r, and prior to that setting global parameters.
Data Types
The structure of the dataset mainly includes numeric data. However, there is categorical data for the headers and sub-headers. Finally, there is missing data, and a symbolic numeric data to designate non-participation. Below an overview of the data-types:
Numeric Data
Exam scores (Final, Repeat)
Activity grades (Homework_1 to Optional_10)
Grade Range with two digit precision: 0.00–10.00, with actual values containing exceptions
Categorical Data
Exam scores sub-categories (Final, Repeat)
Activity types (Homework/Compulsory/Optional)
Implicit ordinality in activity sequence (e.g., Homework_1 precedes Homework_2)
Missing Data Encoding
“-” for general missingness
-1 for explicit non-participation in optional activities
empty cells due to visual separation columns (C, H, I, R)
The current dataset doesn’t represent the structure we want, by including the proper headers for each assignment or exam. Moreover, we have to remove the empty column separators.
From this first clean-up we can make some remarks:
We recognize, as stated earlier, that the non-participation value “-1”, is creating distractions in the statistical analysis
Similarly, the missing value “-” is creating distractions in the statistical analysis
some of the numeric values do not follow the same data type and digit precision (dbl vs chr).
We will start by converting “-1” and “-” values to NA. Then we will convert all columns to numeric, assign two digit precision, and, lastly, impute the NA values to preserve data. The latter, assists since instead of removing rows with NA values (and potential loss of important information), we keep all rows while filling in missing values. Additionally, we maintain statistical integrity, as NA is replaced with the mean. We will keep the orignal cleaned dataset in order to track student’s engagement per topic.
Hence we have to further clean and modify the “grades_clean” dataset.
Show code
#| label: data-clean_up_2#| message: false#| warning: false#| code-fold: false#| code-summary: "Show the code"#| code-line-numbers: true#| code-overflow: scroll#| code-copy: truelibrary(dplyr)# Replace "-" and -1 with NA globallygrades_clean <- grades_clean %>%mutate(across(everything(), ~ifelse(. =="-"| . ==-1, NA, .)))# Convert character columns to numericgrades_clean <- grades_clean %>%mutate(across(where(is.character), ~as.numeric(.)))# Round numeric columns to 2 decimal placesgrades_clean <- grades_clean %>%mutate(across(where(is.numeric), ~round(., 2)))# Save this version as grades_original (with NAs)grades_original <- grades_cleanknitr::kable(head(grades_original))
ex_f
ex_r
hw_1
hw_2
hw_3
hw_4
ca_1
ca_2
ca_3
ca_4
ca_5
ca_6
ca_7
ca_8
oa_1
oa_2
oa_3
oa_4
oa_5
oa_6
oa_7
oa_8
oa_9
oa_10
10.0
NA
10.0
10.0
10.0
10.0
10.0
10.0
10
9.0
10.0
10.0
10.0
10
5.0
4.0
5.0
5.0
4
5.0
4.5
NA
NA
NA
9.1
NA
9.5
9.6
8.8
10.0
7.5
9.0
10
6.0
7.5
8.8
NA
10
3.5
4.0
3.5
4.0
2
4.7
3.0
1.0
3
NA
9.0
NA
9.7
10.0
9.5
1.9
10.0
9.0
10
9.5
10.0
10.0
9.0
10
NA
NA
NA
NA
NA
NA
NA
NA
NA
NA
8.8
NA
9.1
8.9
6.0
0.8
8.5
9.0
10
8.0
9.5
10.0
NA
8
2.3
1.5
5.0
2.0
2
2.0
NA
0.0
NA
3
8.8
NA
9.9
9.6
9.5
10.0
9.5
10.0
10
9.0
10.0
10.0
10.0
8
2.5
1.0
1.0
3.0
1
5.0
0.0
2.0
2
3
8.5
NA
8.5
10.0
8.0
10.0
7.5
9.5
9
8.0
NA
6.5
7.5
7
3.0
4.0
3.0
4.3
5
5.0
2.0
3.5
4
5
Show code
# Impute NA values with column mean and keep 2 digits precisiongrades_imputed <- grades_original %>%mutate(across(where(is.numeric), ~ifelse(is.na(.), mean(., na.rm =TRUE), .)))grades_imputed <- grades_imputed %>%mutate(across(where(is.numeric), ~round(., 2)))# Verify the imputed datasetknitr::kable(head(grades_imputed))
ex_f
ex_r
hw_1
hw_2
hw_3
hw_4
ca_1
ca_2
ca_3
ca_4
ca_5
ca_6
ca_7
ca_8
oa_1
oa_2
oa_3
oa_4
oa_5
oa_6
oa_7
oa_8
oa_9
oa_10
10.0
4.5
10.0
10.0
10.0
10.0
10.0
10.0
10
9.0
10.00
10.0
10.00
10
5.00
4.00
5.00
5.00
4.00
5.00
4.50
3.12
3.13
3.76
9.1
4.5
9.5
9.6
8.8
10.0
7.5
9.0
10
6.0
7.50
8.8
8.02
10
3.50
4.00
3.50
4.00
2.00
4.70
3.00
1.00
3.00
3.76
9.0
4.5
9.7
10.0
9.5
1.9
10.0
9.0
10
9.5
10.00
10.0
9.00
10
2.94
3.33
3.39
3.21
2.82
3.65
3.18
3.12
3.13
3.76
8.8
4.5
9.1
8.9
6.0
0.8
8.5
9.0
10
8.0
9.50
10.0
8.02
8
2.30
1.50
5.00
2.00
2.00
2.00
3.18
0.00
3.13
3.00
8.8
4.5
9.9
9.6
9.5
10.0
9.5
10.0
10
9.0
10.00
10.0
10.00
8
2.50
1.00
1.00
3.00
1.00
5.00
0.00
2.00
2.00
3.00
8.5
4.5
8.5
10.0
8.0
10.0
7.5
9.5
9
8.0
8.09
6.5
7.50
7
3.00
4.00
3.00
4.30
5.00
5.00
2.00
3.50
4.00
5.00
3 Engagement Analysis
Now that the dataset is prepared and cleaned, we can conduct a preliminary analysis. We start with a pre-step for analyzing engagement, which involves visualizing missing values to identify patterns in assignment submissions. It also includes calculating participation rates to determine the percentage of students who submitted each assignment and comparing exam attempts by analyzing the results of retakes (ex_r) against first attempts (ex_f).
We first analyze student engagement by looking at missing assignments and participation rates. This will help us see how many students skipped assignments and identify patterns in non-participation.
The dataset reveals clear patterns of student engagement across different assignment types (homework, class assignments, optional assignments, and exams). Participation rates vary significantly, with high engagement early in the course and a steady decline as the course progresses.
2 Key Observations
A. Homework Assignments (hw_1 to hw_4)
The course initially experienced high engagement, with participation rates of 98.11% for hw_1 and 93.08% for hw_2. However, there was a gradual decline as the course progressed, with participation dropping to 91.82% for hw_3 and further to 76.73% for hw_4. This decline suggests that students may be experiencing workload fatigue or shifting priorities.
B. Class Assignments (ca_1 to ca_8)
Participation rates initially reflect moderate engagement, with ca_1 at 70.44% and ca_2 at 66.67%, which, while lower than homework rates, are still substantial. However, participation begins to decline after ca_5, about 71.07%. This decline becomes more pronounced in ca_6 at 73.58% and continues to drop to 58.49% in ca_7 and 53.46% in ca_8, indicating a significant disengagement among students in the latter half of the course.
C. Optional Assignments (oa_1 to oa_10)
The engagement levels observed reveal a concerning trend. For instance, the participation rate for “oa_1” stands at 42.77%, while “oa_10” shows an alarmingly low participation rate of just 10.69%. This consistent decline in engagement suggests that students prioritize mandatory tasks over optional ones, reflecting a broader issue of low engagement across the board. D. Exams (ex_f and ex_r) In the final exam (ex_f), participation stood at 63.52%. This figure raises some concerns, as 36.48% of students did not take the final exam, indicating a potential issue that warrants further investigation.
Regarding the repeat exam (ex_r), the participation rate was significantly lower at 23.27%. This low percentage suggests that only a few students opted for retakes, implying that most either passed the first attempt or decided against retaking the exam altogether.
3 Conclusion
The data reveals a clear pattern of high engagement early in the course, followed by a steady decline in participation as the course progresses. Homework assignments see the highest engagement, while optional assignments are largely ignored. Low participation in exams (especially final exams) is a critical issue that requires attention.
4 Next Steps
Visualizing Trends in Participation
We can create a curve graph to track participation changes over time for homework, class assignments, optional assignments, and exams. The X-axis will represent assignment numbers, while the Y-axis will show participation rates.
Correlation Analysis
Additionally, we should analyze the relationship between participation rates and exam scores to see if higher homework participation correlates with better exam performance.
Outlier Detection
Lastly, it’s important to identify students with unusual participation patterns, such as those who skipped many assignments but still excelled on exams, to understand the factors influencing their performance.
4 Performance Analysis
After understanding the overall course’s engagement, per assignmenet category, we proceed to evaluate the students individually.
First, we will calculate an engagement rate per student (ERS). Thus, we’ll define engagement rate as the percentage of assignments (homework, class assignments, optional assignments) a student participated in (i.e., non-NA values). Therefore, we will designate all NA values as a 0 result.
4.1 Engagement Rate Per Student (ERS)
This step examines how missing exam grades may affect our analysis of engagement and performance. We calculate engagement rates for all students and flag those without exam scores. We create a histogram of engagement rates and a scatter plot comparing engagement to exam grades, showing students missing data. We calculate the correlation and regression line only for students (109 students) with valid exam grades to ensure valid statistics while highlighting the effect of missing data. This approach provides context on missing exams and prevents misleading correlations.
We visualize in a scatter plot and quantify the relationship between engagement and exam performance.
Show the code
library(dplyr)library(ggplot2)# Create a placeholder y-value for students without examsmin_exam_grade <-min(grades_with_engagement$ex_f, na.rm =TRUE)grades_plot_data <- grades_with_engagement %>%mutate(ex_f_plot =ifelse(is.na(ex_f), min_exam_grade -5, ex_f),Exam_Status =factor(Exam_Status, levels =c("Has Exam", "No Exam")) )# Plot with adjusted y-axisggplot(grades_plot_data, aes(x = Engagement_Rate, y = ex_f_plot)) +geom_point(aes(color = Exam_Status), alpha =0.7) +geom_smooth(data =filter(grades_plot_data, Exam_Status =="Has Exam"),method ="lm",color ="red",se =FALSE ) +scale_color_manual(values =c("Has Exam"="darkblue", "No Exam"="gray70"),labels =c("Has Exam", "No Exam") ) +labs(title ="Engagement Rate vs. Final Exam Grade",x ="Engagement Rate (%)",y ="Final Exam Grade",color ="Exam Status",caption ="Students without exams are shown 5 points below the lowest exam grade." ) +theme_minimal()
The correlation test results to 0.46 between engagement rate and final exam grades indicating a moderate positive relationship. This means students with higher engagement tend to score slightly higher on exams, but the relationship is not very strong.
To identify the underlying structure of student performance data and reduce dimensionality, we perform Principal Component Analysis (PCA). This technique transforms correlated variables (assignment grades, exam scores) into a set of uncorrelated principal components that capture maximum variance. The scree plot visualizes each component’s relative importance, helping determine how many components meaningfully explain performance patterns.
The Principal Component Analysis (PCA) shows that the first four components (PC1 to PC4) explain 65.4% of the differences in student performance data. PC1 accounts for 36.5% of the variance, reflecting overall performance, as students who do well in one area tend to do well in others. PC2, which explains 15.7% of the variance, highlights the difference between assignment and exam performance.
The scree plot shows a clear break at PC4, indicating that additional components add little value, contributing less than 5% each. Keeping four components captures important patterns while reducing noise.
Extending the analysis to eight components (PC1 to PC8) would increase the explained variance to 79.1%. However, this could introduce noise and complicate insights.