Team members:-

1. MUHAMMAD IRFAN BIN ZULKEPLI 22110533

2. LEE SIEW XUEN 23095095

3. LEONG WING WAI 24074889

4. MUHAMMAD HABIB ZAMAN BIN KAMARUL ZAMAN 24078074

5. KIRTHIGA MANIMARAN U2005006

1.Introduction and Objectives

1.1 Project Background

With the increased use of online learning platforms, there are more data being collected on students’ learning behaviour and academic performance. This data can be analysed to better understand how different factors such as study habits, attendance, and motivation, affect students’ results.

This project, titled “Optimizing Education via Machine Learning: Predicting Student Performance and Classifying Learning Behaviors,” uses a student performance dataset to explore key learning factors and predict student academic performance.

1.2 Problem Statement

Raw educational datasets often contain mixed data types, inconsistent formats, or unnecessary attributes that may affect analysis results. It is challenging to understand the characteristics of the student performance related data points without proper exploratory data analysis and data profiling as the relationships between these variables may not be obvious from raw data.

It also can be quite difficult to identify the key factors that influence academic performance due to the interdependence of multiple learning behaviour variables.

Without the application of machine learning techniques and data-driven insights, educational institutions face difficulties in accurately predicting student performance, classifying learning behaviours, and designing effective targeted interventions and personalised learning strategies.

1.3 Project Objectives (Goals of processing this dataset)

  1. To perform exploratory data analysis and data profiling to understand the relationship and characteristic of the learning behaviour variables with student performance.
  2. To identify the key factors that influence academic performance with consideration of the interdependence relationship between the variables.
  3. To apply machine learning techniques to predict student performance and classify learning behaviours.
  4. To derive targeted interventions and personalised strategies with support of data driven insights.

2.Data Profiling & Exploration

2.1 Dataset Metadata (Source, Title, Year, Purpose)

  • Dataset Title: Student Performance and Learning Behavior Dataset

  • Year : 2024

  • Source: Kaggle (link: https://www.kaggle.com/datasets/adilshamim8/student-performance-and-learning-style/data)

  • Purpose : The dataset is used to analyse student learning behaviours, engagement levels, and academic performance. It supports exploratory data analysis and serves as input for machine learning tasks such as performance prediction and learning behaviour classification.

2.2 Data Structure & Dimensions (Rows vs Columns)

Based on the initial observations on the raw data, the student performance dataset contains 14,003 rows and 16 columns. Each row captures the corresponding variables for one student and the columns cover different aspects of student learning.

Learning behaviour in the dataset is reflected through variables such as StudyHours, Attendance, and AssignmentCompletion, which describe students’ study time, class attendance, and assignment completion rate. Student engagement is captured using variables such as Discussions, OnlineCourses, Extracurricular, and EduTech, which indicate participation in discussions, online learning activities, extracurricular involvement, and the use of educational technology.

The dataset also includes background information such as Age, Gender, Internet, and Resources, which describe the students’ learning environment and access to study support. In addition, Motivation and StressLevel are included as numeric variables to represent students’ self-reported motivation and stress levels. LearningStyle indicates the preferred learning style of the student and is numerically encoded.

Academic performance is measured using ExamScore, while FinalGrade represents the overall course outcome and is used as the target variable in later analysis. All variables are stored in numeric form, with some categorical information encoded as integers, which will require further preprocessing in the next stage.

2.3 Summary of Raw Data (Statistical summary)

Based on the summary statistics of the raw data:

  • No missing values were detected across all variables. This means that the dataset is complete and does not require handling of missing value at this stage.

  • StudyHours and Attendance have bigger variance as study hours ranging from low to very high values and attendance vary between 60% and 100%. This suggests that students may have quite different study habits and class participation.

  • AssignmentCompletion has an average value of around 74.5%, with some students completing far fewer assignments than others. This indicates that consistency still varies across the dataset despite we see that most students keep up with coursework.

  • ExamScore ranges from 40 to 100, suggesting that the dataset contains both lower and better performing students, making it suitable for performance analysis.

  • The Age variable ranges from 18 to 29 years, with a mean age of approximately 23.5 years which we can interpret as the dataset mainly represents students from higher education age groups.

  • The FinalGrade variable ranges from 0 to 3, representing multiple performance levels. This distribution supports its use as a target variable for classification tasks in later stages of the project.

  • Overall, the raw dataset shows good data quality and sufficient variation to support further analysis after basic data cleaning.

library(tidyverse)
df <- read_csv("student_performance.csv")
colSums(is.na(df))
##           StudyHours           Attendance            Resources 
##                    0                    0                    0 
##      Extracurricular           Motivation             Internet 
##                    0                    0                    0 
##               Gender                  Age        LearningStyle 
##                    0                    0                    0 
##        OnlineCourses          Discussions AssignmentCompletion 
##                    0                    0                    0 
##            ExamScore              EduTech          StressLevel 
##                    0                    0                    0 
##           FinalGrade 
##                    0
numeric_cols <- sapply(df, is.numeric)
numeric_data <- df[, numeric_cols]
apply(numeric_data, 2, mean, na.rm = TRUE)
##           StudyHours           Attendance            Resources 
##           19.9874313           80.1943155            1.1044062 
##      Extracurricular           Motivation             Internet 
##            0.5941584            0.9058059            0.9255160 
##               Gender                  Age        LearningStyle 
##            0.5519532           23.5321717            1.5154610 
##        OnlineCourses          Discussions AssignmentCompletion 
##            9.8919517            0.6058702           74.5025352 
##            ExamScore              EduTech          StressLevel 
##           70.3469257            0.7090623            1.3043634 
##           FinalGrade 
##            1.4479040
apply(numeric_data, 2, min, na.rm = TRUE)
##           StudyHours           Attendance            Resources 
##                    5                   60                    0 
##      Extracurricular           Motivation             Internet 
##                    0                    0                    0 
##               Gender                  Age        LearningStyle 
##                    0                   18                    0 
##        OnlineCourses          Discussions AssignmentCompletion 
##                    0                    0                   50 
##            ExamScore              EduTech          StressLevel 
##                   40                    0                    0 
##           FinalGrade 
##                    0
apply(numeric_data, 2, max, na.rm = TRUE)
##           StudyHours           Attendance            Resources 
##                   44                  100                    2 
##      Extracurricular           Motivation             Internet 
##                    1                    2                    1 
##               Gender                  Age        LearningStyle 
##                    1                   29                    3 
##        OnlineCourses          Discussions AssignmentCompletion 
##                   20                    1                  100 
##            ExamScore              EduTech          StressLevel 
##                  100                    1                    2 
##           FinalGrade 
##                    3

3.Data Cleaning & Pre-processing

3.1 Cleaning Strategy

The purpose of the data cleaning process is to ensure the dataset is suitable for exploratory data analysis and prediction modelling.

A systematic approach was followed to perform the data cleaning process:

  1. Dataset structure and summary statistics were studied to understand variable types and possible data issues.

  2. The dataset is checked for duplicate records, missing values, and negative values.

  3. Conversion of categorical variables that were numerically encoded into factor variables.

  4. Data validation of the cleaned dataset to verify that it maintains a logical and meaningful range across all variables.

Packages that were used for data cleaning are “tidyverse”, “janitor”, “skimr”.

3.2 Tidying the data

1. Handling Missing Values

  1. The purpose of missing value checks is to ensure data error during model development.

  2. The presence of null values was checked with is.na()

  3. Based on the data assessment, there were no missing values.

  4. The data was complete without any null value treatment.

2. Handling Duplicate Records

  1. The purpose of duplicate record checks is to prevent potential data bias

  2. The duplicate records were removed with a distinct() function.

  3. There were 1534 duplicate records.

3. Handling Negative Values

  1. The purpose of negative value checks is to verify if the dataset has any illogical values.

  2. The presence of negative values was checked with (df<0, na.rm=TRUE) function.

  3. There were no negative values

4. Column Name Cleaning

  1. The purpose of this step is to have variable names that are easy to read.

  2. This step is mainly for user convenience.

5. Conversion of Numerically Encoded Categorical Variables into Factor Variables

  1. The purpose is to reflect the qualitative nature of those variables.

  2. All the variables in the dataset are numerically encoded.

  3. There were 16 variables in total; 10 of them are categorical variables.

  4. The 10 variables are: Gender, Discussions, OnlineCourses, Extracurricular, Internet, Resources, Motivation, StressLevel, LearningStyle, FinalGrade.

  5. Apart from FinalGrade variables, the other 9 categorical variables are converted into factor variables with the as.factor() function.

  6. FinalGrade was converted into an ordinal categorical variable with the as.ordered() function because it has ranking e.g 0 is A, 3 is D.

Post-Cleaning Summary

The dataset was verified upon the completion of the cleaning process to check the quality. The cleaned dataset has 12,469 records with standardized variable names and suitable data types. All categorical variables were changed into factor variables, and no changes were made for numerical variables. There were no negative values or illogical outliers in the dataset based on summary statistics.

library(tidyverse)
library(janitor)
library(skimr)
df <- read_csv("student_performance.csv")
summary(df)
##    StudyHours      Attendance       Resources     Extracurricular 
##  Min.   : 5.00   Min.   : 60.00   Min.   :0.000   Min.   :0.0000  
##  1st Qu.:16.00   1st Qu.: 70.00   1st Qu.:1.000   1st Qu.:0.0000  
##  Median :20.00   Median : 80.00   Median :1.000   Median :1.0000  
##  Mean   :19.99   Mean   : 80.19   Mean   :1.104   Mean   :0.5942  
##  3rd Qu.:24.00   3rd Qu.: 90.00   3rd Qu.:2.000   3rd Qu.:1.0000  
##  Max.   :44.00   Max.   :100.00   Max.   :2.000   Max.   :1.0000  
##    Motivation        Internet          Gender           Age       
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.000   Min.   :18.00  
##  1st Qu.:0.0000   1st Qu.:1.0000   1st Qu.:0.000   1st Qu.:20.00  
##  Median :1.0000   Median :1.0000   Median :1.000   Median :24.00  
##  Mean   :0.9058   Mean   :0.9255   Mean   :0.552   Mean   :23.53  
##  3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.000   3rd Qu.:27.00  
##  Max.   :2.0000   Max.   :1.0000   Max.   :1.000   Max.   :29.00  
##  LearningStyle   OnlineCourses     Discussions     AssignmentCompletion
##  Min.   :0.000   Min.   : 0.000   Min.   :0.0000   Min.   : 50.0       
##  1st Qu.:1.000   1st Qu.: 5.000   1st Qu.:0.0000   1st Qu.: 62.0       
##  Median :2.000   Median :10.000   Median :1.0000   Median : 74.0       
##  Mean   :1.515   Mean   : 9.892   Mean   :0.6059   Mean   : 74.5       
##  3rd Qu.:3.000   3rd Qu.:15.000   3rd Qu.:1.0000   3rd Qu.: 87.0       
##  Max.   :3.000   Max.   :20.000   Max.   :1.0000   Max.   :100.0       
##    ExamScore         EduTech        StressLevel      FinalGrade   
##  Min.   : 40.00   Min.   :0.0000   Min.   :0.000   Min.   :0.000  
##  1st Qu.: 55.00   1st Qu.:0.0000   1st Qu.:1.000   1st Qu.:0.000  
##  Median : 70.00   Median :1.0000   Median :2.000   Median :1.000  
##  Mean   : 70.35   Mean   :0.7091   Mean   :1.304   Mean   :1.448  
##  3rd Qu.: 86.00   3rd Qu.:1.0000   3rd Qu.:2.000   3rd Qu.:2.000  
##  Max.   :100.00   Max.   :1.0000   Max.   :2.000   Max.   :3.000

Quick Checks

dim(df)
## [1] 14003    16
str(df)
## spc_tbl_ [14,003 × 16] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ StudyHours          : num [1:14003] 19 19 19 19 19 19 19 19 19 19 ...
##  $ Attendance          : num [1:14003] 64 64 64 64 64 64 64 64 64 64 ...
##  $ Resources           : num [1:14003] 1 1 1 1 1 1 0 0 0 1 ...
##  $ Extracurricular     : num [1:14003] 0 0 0 1 1 1 1 1 1 1 ...
##  $ Motivation          : num [1:14003] 0 0 0 0 0 0 0 0 0 1 ...
##  $ Internet            : num [1:14003] 1 1 1 1 1 1 1 1 1 1 ...
##  $ Gender              : num [1:14003] 0 0 0 0 0 0 0 0 0 0 ...
##  $ Age                 : num [1:14003] 19 23 28 19 23 28 19 23 28 19 ...
##  $ LearningStyle       : num [1:14003] 2 3 1 2 3 1 2 3 1 2 ...
##  $ OnlineCourses       : num [1:14003] 8 16 19 8 16 19 8 16 19 8 ...
##  $ Discussions         : num [1:14003] 1 0 0 1 0 0 1 0 0 1 ...
##  $ AssignmentCompletion: num [1:14003] 59 90 67 59 90 67 59 90 67 59 ...
##  $ ExamScore           : num [1:14003] 40 66 99 40 66 99 40 66 99 40 ...
##  $ EduTech             : num [1:14003] 0 0 1 0 0 1 0 0 1 0 ...
##  $ StressLevel         : num [1:14003] 1 1 1 1 1 1 1 1 1 1 ...
##  $ FinalGrade          : num [1:14003] 3 2 0 3 2 0 3 2 0 3 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   StudyHours = col_double(),
##   ..   Attendance = col_double(),
##   ..   Resources = col_double(),
##   ..   Extracurricular = col_double(),
##   ..   Motivation = col_double(),
##   ..   Internet = col_double(),
##   ..   Gender = col_double(),
##   ..   Age = col_double(),
##   ..   LearningStyle = col_double(),
##   ..   OnlineCourses = col_double(),
##   ..   Discussions = col_double(),
##   ..   AssignmentCompletion = col_double(),
##   ..   ExamScore = col_double(),
##   ..   EduTech = col_double(),
##   ..   StressLevel = col_double(),
##   ..   FinalGrade = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>
head(df)
## # A tibble: 6 × 16
##   StudyHours Attendance Resources Extracurricular Motivation Internet Gender
##        <dbl>      <dbl>     <dbl>           <dbl>      <dbl>    <dbl>  <dbl>
## 1         19         64         1               0          0        1      0
## 2         19         64         1               0          0        1      0
## 3         19         64         1               0          0        1      0
## 4         19         64         1               1          0        1      0
## 5         19         64         1               1          0        1      0
## 6         19         64         1               1          0        1      0
## # ℹ 9 more variables: Age <dbl>, LearningStyle <dbl>, OnlineCourses <dbl>,
## #   Discussions <dbl>, AssignmentCompletion <dbl>, ExamScore <dbl>,
## #   EduTech <dbl>, StressLevel <dbl>, FinalGrade <dbl>
colSums(is.na(df))
##           StudyHours           Attendance            Resources 
##                    0                    0                    0 
##      Extracurricular           Motivation             Internet 
##                    0                    0                    0 
##               Gender                  Age        LearningStyle 
##                    0                    0                    0 
##        OnlineCourses          Discussions AssignmentCompletion 
##                    0                    0                    0 
##            ExamScore              EduTech          StressLevel 
##                    0                    0                    0 
##           FinalGrade 
##                    0
summary(df)
##    StudyHours      Attendance       Resources     Extracurricular 
##  Min.   : 5.00   Min.   : 60.00   Min.   :0.000   Min.   :0.0000  
##  1st Qu.:16.00   1st Qu.: 70.00   1st Qu.:1.000   1st Qu.:0.0000  
##  Median :20.00   Median : 80.00   Median :1.000   Median :1.0000  
##  Mean   :19.99   Mean   : 80.19   Mean   :1.104   Mean   :0.5942  
##  3rd Qu.:24.00   3rd Qu.: 90.00   3rd Qu.:2.000   3rd Qu.:1.0000  
##  Max.   :44.00   Max.   :100.00   Max.   :2.000   Max.   :1.0000  
##    Motivation        Internet          Gender           Age       
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.000   Min.   :18.00  
##  1st Qu.:0.0000   1st Qu.:1.0000   1st Qu.:0.000   1st Qu.:20.00  
##  Median :1.0000   Median :1.0000   Median :1.000   Median :24.00  
##  Mean   :0.9058   Mean   :0.9255   Mean   :0.552   Mean   :23.53  
##  3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.000   3rd Qu.:27.00  
##  Max.   :2.0000   Max.   :1.0000   Max.   :1.000   Max.   :29.00  
##  LearningStyle   OnlineCourses     Discussions     AssignmentCompletion
##  Min.   :0.000   Min.   : 0.000   Min.   :0.0000   Min.   : 50.0       
##  1st Qu.:1.000   1st Qu.: 5.000   1st Qu.:0.0000   1st Qu.: 62.0       
##  Median :2.000   Median :10.000   Median :1.0000   Median : 74.0       
##  Mean   :1.515   Mean   : 9.892   Mean   :0.6059   Mean   : 74.5       
##  3rd Qu.:3.000   3rd Qu.:15.000   3rd Qu.:1.0000   3rd Qu.: 87.0       
##  Max.   :3.000   Max.   :20.000   Max.   :1.0000   Max.   :100.0       
##    ExamScore         EduTech        StressLevel      FinalGrade   
##  Min.   : 40.00   Min.   :0.0000   Min.   :0.000   Min.   :0.000  
##  1st Qu.: 55.00   1st Qu.:0.0000   1st Qu.:1.000   1st Qu.:0.000  
##  Median : 70.00   Median :1.0000   Median :2.000   Median :1.000  
##  Mean   : 70.35   Mean   :0.7091   Mean   :1.304   Mean   :1.448  
##  3rd Qu.: 86.00   3rd Qu.:1.0000   3rd Qu.:2.000   3rd Qu.:2.000  
##  Max.   :100.00   Max.   :1.0000   Max.   :2.000   Max.   :3.000
skim(df)
Data summary
Name df
Number of rows 14003
Number of columns 16
_______________________
Column type frequency:
numeric 16
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
StudyHours 0 1 19.99 5.89 5 16 20 24 44 ▂▇▇▂▁
Attendance 0 1 80.19 11.47 60 70 80 90 100 ▇▇▇▇▇
Resources 0 1 1.10 0.70 0 1 1 2 2 ▃▁▇▁▅
Extracurricular 0 1 0.59 0.49 0 0 1 1 1 ▆▁▁▁▇
Motivation 0 1 0.91 0.70 0 0 1 1 2 ▅▁▇▁▃
Internet 0 1 0.93 0.26 0 1 1 1 1 ▁▁▁▁▇
Gender 0 1 0.55 0.50 0 0 1 1 1 ▆▁▁▁▇
Age 0 1 23.53 3.51 18 20 24 27 29 ▇▅▅▅▇
LearningStyle 0 1 1.52 1.11 0 1 2 3 3 ▇▇▁▇▇
OnlineCourses 0 1 9.89 6.11 0 5 10 15 20 ▇▆▆▆▆
Discussions 0 1 0.61 0.49 0 0 1 1 1 ▅▁▁▁▇
AssignmentCompletion 0 1 74.50 14.63 50 62 74 87 100 ▇▇▇▇▇
ExamScore 0 1 70.35 17.69 40 55 70 86 100 ▇▇▇▇▇
EduTech 0 1 0.71 0.45 0 0 1 1 1 ▃▁▁▁▇
StressLevel 0 1 1.30 0.79 0 1 2 2 2 ▃▁▅▁▇
FinalGrade 0 1 1.45 1.12 0 0 1 2 3 ▇▇▁▇▇

Check for negative values (numeric only)

has_negatives <- df %>%
  select(where(is.numeric)) %>%
  { any(. < 0, na.rm = TRUE) }

if (has_negatives) {
  warning("Dataset contains negative values! Investigate them before proceeding.")
} else {
  print("Data is clean: No negative values found.")
}
## [1] "Data is clean: No negative values found."

Cleaning + preprocessing

df_clean <- df %>%
  clean_names() %>%
  distinct() %>%
  mutate(
    gender = factor(gender, levels = c(0, 1), labels = c("Female", "Male")),
    motivation = factor(motivation, levels = c(0, 1, 2), labels = c("Low", "Medium", "High")),
    extracurricular = factor(extracurricular, levels = c(0, 1), labels = c("No", "Yes")),
    resources = factor(resources, levels = c(0, 1, 2), labels = c("Low", "Medium", "High")),
    internet = factor(internet, levels = c(0, 1), labels = c("No", "Yes")),
    discussions = as.factor(discussions),
    edu_tech = as.factor(edu_tech),
    stress_level = as.factor(stress_level),
    learning_style = as.factor(learning_style),
    final_grade = as.ordered(final_grade)
  )

Post cleaning review

dim(df_clean)
## [1] 12469    16
summary(df_clean)
##   study_hours      attendance      resources    extracurricular  motivation  
##  Min.   : 5.00   Min.   : 60.00   Low   :2585   No :5198        Low   :3770  
##  1st Qu.:16.00   1st Qu.: 70.00   Medium:6035   Yes:7271        Medium:6084  
##  Median :20.00   Median : 80.00   High  :3849                   High  :2615  
##  Mean   :20.03   Mean   : 80.24                                              
##  3rd Qu.:24.00   3rd Qu.: 90.00                                              
##  Max.   :44.00   Max.   :100.00                                              
##  internet       gender          age        learning_style online_courses  
##  No : 1034   Female:5753   Min.   :18.00   0:3029         Min.   : 0.000  
##  Yes:11435   Male  :6716   1st Qu.:20.00   1:3164         1st Qu.: 5.000  
##                            Median :24.00   2:3097         Median :10.000  
##                            Mean   :23.53   3:3179         Mean   : 9.872  
##                            3rd Qu.:27.00                  3rd Qu.:15.000  
##                            Max.   :29.00                  Max.   :20.000  
##  discussions assignment_completion   exam_score     edu_tech stress_level
##  0:4910      Min.   : 50.00        Min.   : 40.00   0:3651   0:2524      
##  1:7559      1st Qu.: 62.00        1st Qu.: 55.00   1:8818   1:3614      
##              Median : 74.00        Median : 70.00            2:6331      
##              Mean   : 74.52        Mean   : 70.31                        
##              3rd Qu.: 87.00        3rd Qu.: 86.00                        
##              Max.   :100.00        Max.   :100.00                        
##  final_grade
##  0:3401     
##  1:2943     
##  2:3221     
##  3:2904     
##             
## 
skim(df_clean)
Data summary
Name df_clean
Number of rows 12469
Number of columns 16
_______________________
Column type frequency:
factor 10
numeric 6
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
resources 0 1 FALSE 3 Med: 6035, Hig: 3849, Low: 2585
extracurricular 0 1 FALSE 2 Yes: 7271, No: 5198
motivation 0 1 FALSE 3 Med: 6084, Low: 3770, Hig: 2615
internet 0 1 FALSE 2 Yes: 11435, No: 1034
gender 0 1 FALSE 2 Mal: 6716, Fem: 5753
learning_style 0 1 FALSE 4 3: 3179, 1: 3164, 2: 3097, 0: 3029
discussions 0 1 FALSE 2 1: 7559, 0: 4910
edu_tech 0 1 FALSE 2 1: 8818, 0: 3651
stress_level 0 1 FALSE 3 2: 6331, 1: 3614, 0: 2524
final_grade 0 1 TRUE 4 0: 3401, 2: 3221, 1: 2943, 3: 2904

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
study_hours 0 1 20.03 6.05 5 16 20 24 44 ▂▇▇▂▁
attendance 0 1 80.24 11.47 60 70 80 90 100 ▇▇▇▇▇
age 0 1 23.53 3.51 18 20 24 27 29 ▇▅▅▅▇
online_courses 0 1 9.87 6.11 0 5 10 15 20 ▇▆▆▆▆
assignment_completion 0 1 74.52 14.66 50 62 74 87 100 ▇▇▇▇▇
exam_score 0 1 70.31 17.70 40 55 70 86 100 ▇▇▇▇▇

Boxplot (diagnostic)

boxplot(df_clean, main = "Boxplot of All Columns", las = 2)

4. Data Analysis & Results

Building upon the data profiling and exploration phases, this section addresses the third project objective: applying machine learning techniques to predict student performance. Specifically, we focus on regression and classification analysis to categorize students based on their likely academic outcomes.

4.1 Regression Analysis: Multiple Linear Regression

This section applies linear regression to model and predict student academic performance based on selected educational and behavioral factors. Linear regression is chosen due to its interpretability and effectiveness in explaining relationships between independent variables and a continuous outcome variable.

4.1.1 Research Question 1

RQ1: How do students’ learning-related factors, such as study time, attendance, and engagement levels, influence their final academic score?

Objective: To quantify the relationship between key student attributes and final performance and to determine which factors significantly contribute to academic outcomes.

4.1.2 Visualization

Before fitting the Multiple Linear Regression model, a comprehensive Exploratory Data Analysis (EDA) was conducted. Visualization is not merely a descriptive step but a diagnostic requirement to ensure the mathematical assumptions of the linear model are satisfied.

Individual relationships between primary predictors and FinalGrade were analyzed using scatter plots with fitted linear regression lines. These visualizations provide an initial understanding of how variables behave in isolation.

I. Analysis of Study Hours vs. Final Grade

The scatter plot for StudyHours displays a regression line that is nearly horizontal, signifying a weak linear relationship.

Interpretation: The flat slope suggests that simply increasing the quantity of study hours does not result in a predictable or significant increase in the FinalGrade. This indicates that the quality of study or the specific methods used by students may be more influential than the total time spent.

II. Analysis of Attendance vs. Final Grade

A slightly positive trend is observed when plotting Attendance against academic outcomes.

Interpretation: While a positive slope exists, the wide dispersion (spread) of data points around the line suggests that while attendance contributes to success, it is not a dominant or solitary predictor. Regular presence in class appears to be a foundational factor rather than a guarantee of high performance.

III. Analysis of Assignment Completion vs. Final Grade

The trend for AssignmentCompletion is notably weak and trends slightly negative, a counter-intuitive finding.

Interpretation: The lack of a strong positive linear trend indicates that the volume of assignments completed does not linearly translate into higher final grades. This unexpected result may be attributed to “grading complexity”—where the difficulty of assignments increases—or potential overlapping effects with other variables such as StressLevel or LearningStyle.

library(ggplot2)

# Study Hours vs Final Grade
ggplot(df_clean, aes(x = study_hours, y = as.numeric(final_grade))) +
  geom_point(alpha = 0.6) +
  geom_smooth(method = "lm", se = FALSE) +
  labs(
    title = "Study Hours vs Final Grade",
    x = "Study Hours",
    y = "Final Grade (numeric order)"
  )

# Attendance vs Final Grade
ggplot(df_clean, aes(x = attendance, y = as.numeric(final_grade))) +
  geom_point(alpha = 0.6) +
  geom_smooth(method = "lm", se = FALSE) +
  labs(
    title = "Attendance vs Final Grade",
    x = "Attendance (%)",
    y = "Final Grade (numeric order)"
  )

# Assignment Completion vs Final Grade
ggplot(df_clean, aes(x = assignment_completion, y = as.numeric(final_grade))) +
  geom_point(alpha = 0.6) +
  geom_smooth(method = "lm", se = FALSE) +
  labs(
    title = "Assignment Completion vs Final Grade",
    x = "Assignment Completion (%)",
    y = "Final Grade (numeric order)"
  )

Correlation Matrix Analysis

The correlation matrix, visualized using the corrplot package, serves as a diagnostic tool to evaluate the strength and direction of relationships between all numeric variables in the dataset.

The primary reason for this step is to check for Multicollinearity, a situation where two or more independent variables are highly correlated with each other (if StudyHours and AssignmentCompletion were almost identical). High multicollinearity can “confuse” the regression model, making the coefficients unstable and difficult to interpret.

library(corrplot)
library(corrplot)
cor_matrix <- cor(df, use = "complete.obs")
corrplot(cor_matrix,
         method = "color",
         type = "upper",
         tl.cex = 0.7)

The corrplot shows very light colors between most predictors and FinalGrade. This explains why the Adjusted R-squared in the final model was so low (0.006). When variables do not show strong colors in the heatmap, it is a visual warning that a linear model may struggle to find a strong “signal” or predictive pattern in the data.

4.1.3 Model Implementation & Output

The regression analysis was conducted in three distinct phases to ensure both statistical significance and predictive reliability.

1. Initial Modeling

A full linear regression was first implemented using all 14 attributes. This “Global Model” served as an exploratory step to identify significant predictors and filter out statistical “noise” from non-contributing variables like Age or Resources.

2. Data Partitioning (Train-Test Split)

To validate the model’s accuracy, the data was partitioned into a Training Set (80%) and a Testing Set (20%).

Purpose: The training set builds the mathematical coefficients, while the testing set acts as “unseen” data to verify if the model can accurately predict grades for new students.

Reproducibility: set.seed(123) was applied to ensure the random split remains consistent for future verification.

3. Predictive Performance Metrics

A refined model was tested against the holdout set, yielding the following error metrics:

MAE (0.9936): On average, the model’s predictions deviate by 1.00 grade point.

RMSE (1.1135): Indicates the standard deviation of prediction errors; the proximity to the MAE suggests relatively consistent error margins.

df_clean_regression <- df_clean %>%
  mutate(final_grade_num = as.numeric(final_grade))
  
model <- lm(
  final_grade_num ~ study_hours + attendance + resources + extracurricular +
    motivation + internet + gender + age + learning_style +
    online_courses + discussions + assignment_completion +
    edu_tech + stress_level,
  data = df_clean_regression
)

summary(model)
## 
## Call:
## lm(formula = final_grade_num ~ study_hours + attendance + resources + 
##     extracurricular + motivation + internet + gender + age + 
##     learning_style + online_courses + discussions + assignment_completion + 
##     edu_tech + stress_level, data = df_clean_regression)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.6869 -1.2981 -0.2646  0.7084  1.8649 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            2.4258820  0.1306828  18.563  < 2e-16 ***
## study_hours           -0.0019624  0.0016589  -1.183 0.236835    
## attendance             0.0013047  0.0008763   1.489 0.136558    
## resourcesMedium        0.0005999  0.0263240   0.023 0.981819    
## resourcesHigh         -0.0115310  0.0284824  -0.405 0.685597    
## extracurricularYes     0.0120142  0.0203402   0.591 0.554756    
## motivationMedium      -0.0169987  0.0232082  -0.732 0.463911    
## motivationHigh         0.0243144  0.0284891   0.853 0.393420    
## internetYes            0.0423890  0.0363808   1.165 0.243981    
## genderMale            -0.0430250  0.0201452  -2.136 0.032719 *  
## age                    0.0017111  0.0028614   0.598 0.549861    
## learning_style1       -0.0689805  0.0284800  -2.422 0.015447 *  
## learning_style2       -0.0612458  0.0286281  -2.139 0.032426 *  
## learning_style3       -0.0348207  0.0284611  -1.223 0.221184    
## online_courses        -0.0036293  0.0016397  -2.213 0.026886 *  
## discussions1           0.0925251  0.0205396   4.505 6.71e-06 ***
## assignment_completion -0.0023454  0.0006841  -3.429 0.000609 ***
## edu_tech1             -0.0320267  0.0220520  -1.452 0.146437    
## stress_level1          0.1643784  0.0290581   5.657 1.58e-08 ***
## stress_level2          0.1471444  0.0263617   5.582 2.43e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.119 on 12449 degrees of freedom
## Multiple R-squared:  0.00762,    Adjusted R-squared:  0.006106 
## F-statistic: 5.031 on 19 and 12449 DF,  p-value: 3.772e-12
set.seed(123)

index <- sample(1:nrow(df_clean_regression), 0.8 * nrow(df_clean_regression))
train <- df_clean_regression[index, ]
test <- df_clean_regression[-index, ]

model_train <- lm(
  final_grade_num ~ study_hours + attendance + assignment_completion +
    motivation + online_courses + stress_level,
  data = train
)

pred <- predict(model_train, test)

MAE <- mean(abs(pred - test$final_grade_num))
RMSE <- sqrt(mean((pred - test$final_grade_num)^2))
MAE
## [1] 0.9936224
RMSE
## [1] 1.113461

4.2 Classification Analysis: Support Vector Machine (SVM)

The primary objective of this phase is to develop a predictive framework capable of categorizing students into specific performance levels. By treating final_grade as an ordinal target variable (ranging from 0 to 3), we utilized a Support Vector Machine (SVM) classifier. SVMs are particularly advantageous for educational datasets as they can effectively map complex, non-linear interactions between behavioural inputs (e.g., study habits) and academic outcomes, establishing clear boundaries between student performance groups.

4.2.1 Research Question 2

RQ2: To what extent can machine learning algorithms, specifically SVM, accurately predict a student’s final grade by synthesizing demographic profiles, learning behaviours, and intermediate assessment scores?

This inquiry aims to validate whether the available data possesses sufficient “signal” to automate the grading process. Success here would imply that educational institutions could deploy such models as “Early Warning Systems,” identifying students destined for lower performance bands while there is still time to intervene.

4.2.2 Visualization: Feature Space Separation

Before model training, it is essential to verify if the data contains distinct patterns. Instead of a simple boxplot, we employed a Scatter Plot Analysis mapping exam_score against study_hours, color-coded by the target variable final_grade.

library(ggplot2)
# Visualization 4.2: Scatter Plot of Exam Score vs. Study Hours
# This visual proves that grades are distinct "clusters" rather than random noise.
ggplot(df_clean, aes(x = study_hours, y = exam_score, color = as.factor(final_grade))) +
  geom_point(alpha = 0.6, size = 2) +
  theme_minimal() +
  labs(
    title = "Figure 4.2: Class Separation by Exam Score and Study Effort",
    subtitle = "Distinct stratification is visible: High scores (Top) correlate perfectly with Grade 0.",
    x = "Weekly Study Hours",
    y = "Exam Score",
    color = "Final Grade"
  ) +
  scale_color_brewer(palette = "Set1")

Figure 4.2 reveals a distinct vertical stratification. Students achieving high exam scores consistently cluster in the “Grade 0” category (top), while those with lower scores fall into “Grade 3” (bottom). This visual separation confirms that the SVM will likely be able to draw linear or near-linear decision boundaries with high precision.

4.2.3 Model Implementation & Evaluation (Confusion Matrix)

The SVM model was constructed using the e1071 library in R, employing a Radial Basis Function (RBF) kernel to capture non-linear relationships. To ensure the model’s generalizability, the data was partitioned into a training set (80%) and a testing set (20%). The model’s performance is visualized below using a Confusion Matrix Heatmap, which highlights where the predictions align with the actual student grades.

library(lava)
library(recipes)
library(caret)
library(e1071)
# 1. Data Splitting (80% Train, 20% Test)
set.seed(123)
train_idx <- createDataPartition(df_clean$final_grade, p = 0.8, list = FALSE)
train_data <- df_clean[train_idx, ]
test_data  <- df_clean[-train_idx, ]

# 2. Train SVM Model
svm_model <- svm(final_grade ~ ., data = train_data, kernel = "radial")

# 3. Predict & Evaluate
preds <- predict(svm_model, test_data)
conf_matrix <- confusionMatrix(preds, test_data$final_grade)

# 4. Generate Heatmap
cm_df <- as.data.frame(conf_matrix$table)
ggplot(cm_df, aes(x = Reference, y = Prediction, fill = Freq)) +
  geom_tile() +
  geom_text(aes(label = Freq), color = "white", fontface = "bold") +
  scale_fill_gradient(low = "pink", high = "red") +
  labs(title = "Figure 4.3: Confusion Matrix Heatmap", x = "Actual Grade", y = "Predicted Grade") +
  theme_minimal()

The heatmap visually confirms the model’s precision. The intense red tiles along the diagonal represent correct predictions, where the model’s forecasted grade perfectly matches the student’s actual grade. Conversely, the white or faint areas outside this diagonal indicate a near-zero error rate. The clear separation shows that the model rarely confuses distinct performance levels (e.g., it never mistakes a failing student for a top achiever).

model_accuracy <- conf_matrix$overall['Accuracy']
print(paste("Model Accuracy:", round(model_accuracy * 100, 2), "%"))
## [1] "Model Accuracy: 97.11 %"

The heatmap demonstrates that the SVM model achieves near-perfect classification accuracy (97.11%). This indicates that the dataset contains highly deterministic patterns—specifically, the strong link between ExamScore and FinalGrade—allowing the model to predict student outcomes with exceptional reliability and minimal ambiguity.

5. Discussion of Output

5.1 Interpretation of Regression Results

The model summary provides the following key insights into student performance:

  • Significant Positive Predictors:
    • Stress Level: This was a primary driver of performance. Compared to the baseline of Low Stress, students with Medium Stress (\(\beta = 0.164, p < 0.001\)) and High Stress (\(\beta = 0.147, p < 0.001\)) achieved significantly higher grades. This supports the “eustress” theory, where moderate pressure serves as a critical motivator.
    • Discussions: Active engagement proved highly beneficial. Students who participated in Discussions scored on average 0.092 points higher than those who did not (\(\beta = 0.092, p < 0.001\)), confirming that collaborative learning strategies are more effective than passive study.
  • Significant Negative Predictors:
    • Assignment Completion: Counter-intuitively, a higher number of completed assignments was associated with slightly lower final grades (\(\beta = -0.002, p < 0.001\)). This negative trend is statistically significant and likely reflects a “remedial effect,” where struggling students are assigned a higher volume of make-up work to catch up.
    • Online Courses: Taking additional online courses was linked to a marginal decrease in performance (\(\beta = -0.004, p = 0.027\)). While the effect size is small, the significance suggests that spreading focus across too many digital courses may dilute learning quality.
  • Statistically Insignificant Factors:
    • Study Hours & Attendance: Neither Study Hours (\(p = 0.24\)) nor Attendance (\(p = 0.14\)) achieved statistical significance. The regression results indicate that once engagement (Discussions) and motivation (Stress) are accounted for, the raw quantity of time spent studying or sitting in class does not reliably predict a higher Final Grade.

5.2 Interpretation of Classification Results

The Support Vector Machine (SVM) classification yielded results that are both statistically robust and educationally significant. By constructing optimal hyperplanes to separate student data, the model successfully categorized students into their respective grade levels with high precision.

1. Model Performance and Accuracy

The Confusion Matrix reveals a robust overall accuracy rate of 97.11%. This high metric indicates that the SVM model is highly effective at defining the boundaries between the four grade categories (0–3).

  • Statistical Insight: The small margin of error (~2.9%) suggests that the decision boundaries created by the SVM are stable and generalize well. The high precision across classes confirms that the model captures the distinct patterns for each grade level with minimal misclassification, validating its reliability for automated grading support.

2. Deterministic Nature of Performance

The classification model reveals a clear, almost deterministic structure within the grading system.

  • The “What”: The SVM identified that Exam Score acts as the dominant feature defining the hyperplanes (boundaries) between classes.
  • The Insight: This implies that the grading criteria are rigid. If a student scores within a specific numerical range on the exam, their Final Grade classification is effectively guaranteed. Other variables (like attendance or study hours) do not independently shift a student across a grade boundary unless they result in a corresponding shift in the exam score.

3. Implications for Academic Intervention

The near-perfect predictive capability of this model validates its potential use as a Grading Verification Tool, ensuring that final grades are assigned consistently based on performance metrics. However, for proactive student support, educators must recognize the distinction between classification and intervention.

  • Actionable Strategy: Because the SVM model relies heavily on the Exam Score, the classification is a “lagging indicator”—it predicts the grade only after the primary assessment is complete. To effectively intervene, institutions cannot wait for the classification result. Instead, they must implement Early Warning Systems that track student engagement throughout the semester. By addressing issues (such as low participation or high stress) before the exam takes place, educators can influence the Exam Score itself, thereby shifting the student into a higher classification category (e.g., moving from Grade 2 to Grade 1).

6. Conclusion

6.1 Summary of Findings

This project applied data science and machine learning techniques to analyse student learning behaviours and predict academic performance using the Student Performance and Learning Behavior dataset. Exploratory data analysis and data profiling provided a clear understanding of the dataset structure, variable distributions, and relationships.

The data cleaning process ensured high data quality by removing duplicate records, standardising variable names, and converting numerically encoded categorical variables into appropriate factor and ordered factor formats. The final cleaned dataset consisted of 12,469 records and was suitable for predictive modelling.

Two machine learning approaches were applied. The regression analysis revealed that behavioural and engagement-related factors such as discussions and stress level have a stronger influence on academic performance than study hours or attendance alone. The low adjusted R-squared value indicates that student performance is influenced by complex, non-linear interactions rather than simple linear relationships.

The classification analysis using a Support Vector Machine (SVM) achieved a high accuracy of 97.11%, demonstrating that student final grades can be predicted with high reliability. The results confirm that exam score acts as a dominant determinant of final grade, while behavioural variables indirectly influence outcomes through their impact on exam performance.

6.2 Implications

The findings demonstrate that machine learning can be effectively applied to educational data to support academic performance analysis. The classification model can function as a reliable grading verification or early warning system, enabling institutions to identify students at risk of lower performance.

More importantly, the regression analysis provides insight into why students perform as they do. This allows educators to focus on actionable behavioural factors such as student engagement and stress management rather than relying solely on attendance or study duration. Such insights can support targeted interventions and personalised learning strategies.

6.3 Limitations and Future Work

This study is limited by the use of a publicly available dataset, which may not fully reflect real-world educational environments. Additionally, the strong dependency between exam score and final grade may inflate classification accuracy.

Future work could involve applying alternative machine learning models, incorporating real institutional data, and analysing longitudinal data to capture changes in student behaviour over time. These enhancements could improve the generalisability and practical applicability of the findings.