With the increased use of online learning platforms, there are more data being collected on students’ learning behaviour and academic performance. This data can be analysed to better understand how different factors such as study habits, attendance, and motivation, affect students’ results.
This project, titled “Optimizing Education via Machine Learning: Predicting Student Performance and Classifying Learning Behaviors,” uses a student performance dataset to explore key learning factors and predict student academic performance.
Raw educational datasets often contain mixed data types, inconsistent formats, or unnecessary attributes that may affect analysis results. It is challenging to understand the characteristics of the student performance related data points without proper exploratory data analysis and data profiling as the relationships between these variables may not be obvious from raw data.
It also can be quite difficult to identify the key factors that influence academic performance due to the interdependence of multiple learning behaviour variables.
Without the application of machine learning techniques and data-driven insights, educational institutions face difficulties in accurately predicting student performance, classifying learning behaviours, and designing effective targeted interventions and personalised learning strategies.
Dataset Title: Student Performance and Learning Behavior Dataset
Year : 2024
Source: Kaggle (link: https://www.kaggle.com/datasets/adilshamim8/student-performance-and-learning-style/data)
Purpose : The dataset is used to analyse student learning behaviours, engagement levels, and academic performance. It supports exploratory data analysis and serves as input for machine learning tasks such as performance prediction and learning behaviour classification.
Based on the initial observations on the raw data, the student performance dataset contains 14,003 rows and 16 columns. Each row captures the corresponding variables for one student and the columns cover different aspects of student learning.
Learning behaviour in the dataset is reflected through variables such as StudyHours, Attendance, and AssignmentCompletion, which describe students’ study time, class attendance, and assignment completion rate. Student engagement is captured using variables such as Discussions, OnlineCourses, Extracurricular, and EduTech, which indicate participation in discussions, online learning activities, extracurricular involvement, and the use of educational technology.
The dataset also includes background information such as Age, Gender, Internet, and Resources, which describe the students’ learning environment and access to study support. In addition, Motivation and StressLevel are included as numeric variables to represent students’ self-reported motivation and stress levels. LearningStyle indicates the preferred learning style of the student and is numerically encoded.
Academic performance is measured using ExamScore, while FinalGrade represents the overall course outcome and is used as the target variable in later analysis. All variables are stored in numeric form, with some categorical information encoded as integers, which will require further preprocessing in the next stage.
Based on the summary statistics of the raw data:
No missing values were detected across all variables. This means that the dataset is complete and does not require handling of missing value at this stage.
StudyHours and Attendance have bigger variance as study hours ranging from low to very high values and attendance vary between 60% and 100%. This suggests that students may have quite different study habits and class participation.
AssignmentCompletion has an average value of around 74.5%, with some students completing far fewer assignments than others. This indicates that consistency still varies across the dataset despite we see that most students keep up with coursework.
ExamScore ranges from 40 to 100, suggesting that the dataset contains both lower and better performing students, making it suitable for performance analysis.
The Age variable ranges from 18 to 29 years, with a mean age of approximately 23.5 years which we can interpret as the dataset mainly represents students from higher education age groups.
The FinalGrade variable ranges from 0 to 3, representing multiple performance levels. This distribution supports its use as a target variable for classification tasks in later stages of the project.
Overall, the raw dataset shows good data quality and sufficient variation to support further analysis after basic data cleaning.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.1 ✔ stringr 1.6.0
## ✔ ggplot2 3.5.0 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
df <- read_csv("./student_performance.csv")
## Rows: 14003 Columns: 16
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (16): StudyHours, Attendance, Resources, Extracurricular, Motivation, In...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
colSums(is.na(df))
## StudyHours Attendance Resources
## 0 0 0
## Extracurricular Motivation Internet
## 0 0 0
## Gender Age LearningStyle
## 0 0 0
## OnlineCourses Discussions AssignmentCompletion
## 0 0 0
## ExamScore EduTech StressLevel
## 0 0 0
## FinalGrade
## 0
numeric_cols <- sapply(df, is.numeric)
numeric_data <- df[, numeric_cols]
apply(numeric_data, 2, mean, na.rm = TRUE)
## StudyHours Attendance Resources
## 19.9874313 80.1943155 1.1044062
## Extracurricular Motivation Internet
## 0.5941584 0.9058059 0.9255160
## Gender Age LearningStyle
## 0.5519532 23.5321717 1.5154610
## OnlineCourses Discussions AssignmentCompletion
## 9.8919517 0.6058702 74.5025352
## ExamScore EduTech StressLevel
## 70.3469257 0.7090623 1.3043634
## FinalGrade
## 1.4479040
apply(numeric_data, 2, min, na.rm = TRUE)
## StudyHours Attendance Resources
## 5 60 0
## Extracurricular Motivation Internet
## 0 0 0
## Gender Age LearningStyle
## 0 18 0
## OnlineCourses Discussions AssignmentCompletion
## 0 0 50
## ExamScore EduTech StressLevel
## 40 0 0
## FinalGrade
## 0
apply(numeric_data, 2, max, na.rm = TRUE)
## StudyHours Attendance Resources
## 44 100 2
## Extracurricular Motivation Internet
## 1 2 1
## Gender Age LearningStyle
## 1 29 3
## OnlineCourses Discussions AssignmentCompletion
## 20 1 100
## ExamScore EduTech StressLevel
## 100 1 2
## FinalGrade
## 3
The purpose of the data cleaning process is to ensure the dataset is suitable for exploratory data analysis and prediction modelling.
A systematic approach was followed to perform the data cleaning process:
Dataset structure and summary statistics were studied to understand variable types and possible data issues.
The dataset is checked for duplicate records, missing values, and negative values.
Conversion of categorical variables that were numerically encoded into factor variables.
Data validation of the cleaned dataset to verify that it maintains a logical and meaningful range across all variables.
Packages that were used for data cleaning are “tidyverse”, “janitor”, “skimr”.
The purpose of missing value checks is to ensure data error during model development.
The presence of null values was checked with is.na()
Based on the data assessment, there were no missing values.
The data was complete without any null value treatment.
The purpose of duplicate record checks is to prevent potential data bias
The duplicate records were removed with a distinct() function.
There were 1534 duplicate records.
The purpose of negative value checks is to verify if the dataset has any illogical values.
The presence of negative values was checked with (df<0, na.rm=TRUE) function.
There were no negative values
The purpose of this step is to have variable names that are easy to read.
This step is mainly for user convenience.
The purpose is to reflect the qualitative nature of those variables.
All the variables in the dataset are numerically encoded.
There were 16 variables in total; 10 of them are categorical variables.
The 10 variables are: Gender, Discussions, OnlineCourses, Extracurricular, Internet, Resources, Motivation, StressLevel, LearningStyle, FinalGrade.
Apart from FinalGrade variables, the other 9 categorical variables are converted into factor variables with the as.factor() function.
FinalGrade was converted into an ordinal categorical variable with the as.ordered() function because it has ranking e.g 0 is A, 3 is D.
Post-Cleaning Summary
The dataset was verified upon the completion of the cleaning process to check the quality. The cleaned dataset has 12,469 records with standardized variable names and suitable data types. All categorical variables were changed into factor variables, and no changes were made for numerical variables. There were no negative values or illogical outliers in the dataset based on summary statistics.
library(tidyverse)
library(janitor)
##
## Attaching package: 'janitor'
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
library(skimr)
df <- read_csv("./student_performance.csv")
## Rows: 14003 Columns: 16
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (16): StudyHours, Attendance, Resources, Extracurricular, Motivation, In...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
summary(df)
## StudyHours Attendance Resources Extracurricular
## Min. : 5.00 Min. : 60.00 Min. :0.000 Min. :0.0000
## 1st Qu.:16.00 1st Qu.: 70.00 1st Qu.:1.000 1st Qu.:0.0000
## Median :20.00 Median : 80.00 Median :1.000 Median :1.0000
## Mean :19.99 Mean : 80.19 Mean :1.104 Mean :0.5942
## 3rd Qu.:24.00 3rd Qu.: 90.00 3rd Qu.:2.000 3rd Qu.:1.0000
## Max. :44.00 Max. :100.00 Max. :2.000 Max. :1.0000
## Motivation Internet Gender Age
## Min. :0.0000 Min. :0.0000 Min. :0.000 Min. :18.00
## 1st Qu.:0.0000 1st Qu.:1.0000 1st Qu.:0.000 1st Qu.:20.00
## Median :1.0000 Median :1.0000 Median :1.000 Median :24.00
## Mean :0.9058 Mean :0.9255 Mean :0.552 Mean :23.53
## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.000 3rd Qu.:27.00
## Max. :2.0000 Max. :1.0000 Max. :1.000 Max. :29.00
## LearningStyle OnlineCourses Discussions AssignmentCompletion
## Min. :0.000 Min. : 0.000 Min. :0.0000 Min. : 50.0
## 1st Qu.:1.000 1st Qu.: 5.000 1st Qu.:0.0000 1st Qu.: 62.0
## Median :2.000 Median :10.000 Median :1.0000 Median : 74.0
## Mean :1.515 Mean : 9.892 Mean :0.6059 Mean : 74.5
## 3rd Qu.:3.000 3rd Qu.:15.000 3rd Qu.:1.0000 3rd Qu.: 87.0
## Max. :3.000 Max. :20.000 Max. :1.0000 Max. :100.0
## ExamScore EduTech StressLevel FinalGrade
## Min. : 40.00 Min. :0.0000 Min. :0.000 Min. :0.000
## 1st Qu.: 55.00 1st Qu.:0.0000 1st Qu.:1.000 1st Qu.:0.000
## Median : 70.00 Median :1.0000 Median :2.000 Median :1.000
## Mean : 70.35 Mean :0.7091 Mean :1.304 Mean :1.448
## 3rd Qu.: 86.00 3rd Qu.:1.0000 3rd Qu.:2.000 3rd Qu.:2.000
## Max. :100.00 Max. :1.0000 Max. :2.000 Max. :3.000
dim(df)
## [1] 14003 16
str(df)
## spc_tbl_ [14,003 × 16] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ StudyHours : num [1:14003] 19 19 19 19 19 19 19 19 19 19 ...
## $ Attendance : num [1:14003] 64 64 64 64 64 64 64 64 64 64 ...
## $ Resources : num [1:14003] 1 1 1 1 1 1 0 0 0 1 ...
## $ Extracurricular : num [1:14003] 0 0 0 1 1 1 1 1 1 1 ...
## $ Motivation : num [1:14003] 0 0 0 0 0 0 0 0 0 1 ...
## $ Internet : num [1:14003] 1 1 1 1 1 1 1 1 1 1 ...
## $ Gender : num [1:14003] 0 0 0 0 0 0 0 0 0 0 ...
## $ Age : num [1:14003] 19 23 28 19 23 28 19 23 28 19 ...
## $ LearningStyle : num [1:14003] 2 3 1 2 3 1 2 3 1 2 ...
## $ OnlineCourses : num [1:14003] 8 16 19 8 16 19 8 16 19 8 ...
## $ Discussions : num [1:14003] 1 0 0 1 0 0 1 0 0 1 ...
## $ AssignmentCompletion: num [1:14003] 59 90 67 59 90 67 59 90 67 59 ...
## $ ExamScore : num [1:14003] 40 66 99 40 66 99 40 66 99 40 ...
## $ EduTech : num [1:14003] 0 0 1 0 0 1 0 0 1 0 ...
## $ StressLevel : num [1:14003] 1 1 1 1 1 1 1 1 1 1 ...
## $ FinalGrade : num [1:14003] 3 2 0 3 2 0 3 2 0 3 ...
## - attr(*, "spec")=
## .. cols(
## .. StudyHours = col_double(),
## .. Attendance = col_double(),
## .. Resources = col_double(),
## .. Extracurricular = col_double(),
## .. Motivation = col_double(),
## .. Internet = col_double(),
## .. Gender = col_double(),
## .. Age = col_double(),
## .. LearningStyle = col_double(),
## .. OnlineCourses = col_double(),
## .. Discussions = col_double(),
## .. AssignmentCompletion = col_double(),
## .. ExamScore = col_double(),
## .. EduTech = col_double(),
## .. StressLevel = col_double(),
## .. FinalGrade = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
head(df)
## # A tibble: 6 × 16
## StudyHours Attendance Resources Extracurricular Motivation Internet Gender
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 19 64 1 0 0 1 0
## 2 19 64 1 0 0 1 0
## 3 19 64 1 0 0 1 0
## 4 19 64 1 1 0 1 0
## 5 19 64 1 1 0 1 0
## 6 19 64 1 1 0 1 0
## # ℹ 9 more variables: Age <dbl>, LearningStyle <dbl>, OnlineCourses <dbl>,
## # Discussions <dbl>, AssignmentCompletion <dbl>, ExamScore <dbl>,
## # EduTech <dbl>, StressLevel <dbl>, FinalGrade <dbl>
colSums(is.na(df))
## StudyHours Attendance Resources
## 0 0 0
## Extracurricular Motivation Internet
## 0 0 0
## Gender Age LearningStyle
## 0 0 0
## OnlineCourses Discussions AssignmentCompletion
## 0 0 0
## ExamScore EduTech StressLevel
## 0 0 0
## FinalGrade
## 0
summary(df)
## StudyHours Attendance Resources Extracurricular
## Min. : 5.00 Min. : 60.00 Min. :0.000 Min. :0.0000
## 1st Qu.:16.00 1st Qu.: 70.00 1st Qu.:1.000 1st Qu.:0.0000
## Median :20.00 Median : 80.00 Median :1.000 Median :1.0000
## Mean :19.99 Mean : 80.19 Mean :1.104 Mean :0.5942
## 3rd Qu.:24.00 3rd Qu.: 90.00 3rd Qu.:2.000 3rd Qu.:1.0000
## Max. :44.00 Max. :100.00 Max. :2.000 Max. :1.0000
## Motivation Internet Gender Age
## Min. :0.0000 Min. :0.0000 Min. :0.000 Min. :18.00
## 1st Qu.:0.0000 1st Qu.:1.0000 1st Qu.:0.000 1st Qu.:20.00
## Median :1.0000 Median :1.0000 Median :1.000 Median :24.00
## Mean :0.9058 Mean :0.9255 Mean :0.552 Mean :23.53
## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.000 3rd Qu.:27.00
## Max. :2.0000 Max. :1.0000 Max. :1.000 Max. :29.00
## LearningStyle OnlineCourses Discussions AssignmentCompletion
## Min. :0.000 Min. : 0.000 Min. :0.0000 Min. : 50.0
## 1st Qu.:1.000 1st Qu.: 5.000 1st Qu.:0.0000 1st Qu.: 62.0
## Median :2.000 Median :10.000 Median :1.0000 Median : 74.0
## Mean :1.515 Mean : 9.892 Mean :0.6059 Mean : 74.5
## 3rd Qu.:3.000 3rd Qu.:15.000 3rd Qu.:1.0000 3rd Qu.: 87.0
## Max. :3.000 Max. :20.000 Max. :1.0000 Max. :100.0
## ExamScore EduTech StressLevel FinalGrade
## Min. : 40.00 Min. :0.0000 Min. :0.000 Min. :0.000
## 1st Qu.: 55.00 1st Qu.:0.0000 1st Qu.:1.000 1st Qu.:0.000
## Median : 70.00 Median :1.0000 Median :2.000 Median :1.000
## Mean : 70.35 Mean :0.7091 Mean :1.304 Mean :1.448
## 3rd Qu.: 86.00 3rd Qu.:1.0000 3rd Qu.:2.000 3rd Qu.:2.000
## Max. :100.00 Max. :1.0000 Max. :2.000 Max. :3.000
skim(df)
| Name | df |
| Number of rows | 14003 |
| Number of columns | 16 |
| _______________________ | |
| Column type frequency: | |
| numeric | 16 |
| ________________________ | |
| Group variables | None |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| StudyHours | 0 | 1 | 19.99 | 5.89 | 5 | 16 | 20 | 24 | 44 | ▂▇▇▂▁ |
| Attendance | 0 | 1 | 80.19 | 11.47 | 60 | 70 | 80 | 90 | 100 | ▇▇▇▇▇ |
| Resources | 0 | 1 | 1.10 | 0.70 | 0 | 1 | 1 | 2 | 2 | ▃▁▇▁▅ |
| Extracurricular | 0 | 1 | 0.59 | 0.49 | 0 | 0 | 1 | 1 | 1 | ▆▁▁▁▇ |
| Motivation | 0 | 1 | 0.91 | 0.70 | 0 | 0 | 1 | 1 | 2 | ▅▁▇▁▃ |
| Internet | 0 | 1 | 0.93 | 0.26 | 0 | 1 | 1 | 1 | 1 | ▁▁▁▁▇ |
| Gender | 0 | 1 | 0.55 | 0.50 | 0 | 0 | 1 | 1 | 1 | ▆▁▁▁▇ |
| Age | 0 | 1 | 23.53 | 3.51 | 18 | 20 | 24 | 27 | 29 | ▇▅▅▅▇ |
| LearningStyle | 0 | 1 | 1.52 | 1.11 | 0 | 1 | 2 | 3 | 3 | ▇▇▁▇▇ |
| OnlineCourses | 0 | 1 | 9.89 | 6.11 | 0 | 5 | 10 | 15 | 20 | ▇▆▆▆▆ |
| Discussions | 0 | 1 | 0.61 | 0.49 | 0 | 0 | 1 | 1 | 1 | ▅▁▁▁▇ |
| AssignmentCompletion | 0 | 1 | 74.50 | 14.63 | 50 | 62 | 74 | 87 | 100 | ▇▇▇▇▇ |
| ExamScore | 0 | 1 | 70.35 | 17.69 | 40 | 55 | 70 | 86 | 100 | ▇▇▇▇▇ |
| EduTech | 0 | 1 | 0.71 | 0.45 | 0 | 0 | 1 | 1 | 1 | ▃▁▁▁▇ |
| StressLevel | 0 | 1 | 1.30 | 0.79 | 0 | 1 | 2 | 2 | 2 | ▃▁▅▁▇ |
| FinalGrade | 0 | 1 | 1.45 | 1.12 | 0 | 0 | 1 | 2 | 3 | ▇▇▁▇▇ |
has_negatives <- df %>%
select(where(is.numeric)) %>%
{ any(. < 0, na.rm = TRUE) }
if (has_negatives) {
warning("Dataset contains negative values! Investigate them before proceeding.")
} else {
print("Data is clean: No negative values found.")
}
## [1] "Data is clean: No negative values found."
df_clean <- df %>%
clean_names() %>%
distinct() %>%
mutate(
gender = factor(gender, levels = c(0, 1), labels = c("Female", "Male")),
motivation = factor(motivation, levels = c(0, 1, 2), labels = c("Low", "Medium", "High")),
extracurricular = factor(extracurricular, levels = c(0, 1), labels = c("No", "Yes")),
resources = factor(resources, levels = c(0, 1, 2), labels = c("Low", "Medium", "High")),
internet = factor(internet, levels = c(0, 1), labels = c("No", "Yes")),
discussions = as.factor(discussions),
online_courses = as.factor(online_courses),
edu_tech = as.factor(edu_tech),
stress_level = as.factor(stress_level),
learning_style = as.factor(learning_style),
final_grade = as.ordered(final_grade)
)
dim(df_clean)
## [1] 12469 16
summary(df_clean)
## study_hours attendance resources extracurricular motivation
## Min. : 5.00 Min. : 60.00 Low :2585 No :5198 Low :3770
## 1st Qu.:16.00 1st Qu.: 70.00 Medium:6035 Yes:7271 Medium:6084
## Median :20.00 Median : 80.00 High :3849 High :2615
## Mean :20.03 Mean : 80.24
## 3rd Qu.:24.00 3rd Qu.: 90.00
## Max. :44.00 Max. :100.00
##
## internet gender age learning_style online_courses
## No : 1034 Female:5753 Min. :18.00 0:3029 5 : 665
## Yes:11435 Male :6716 1st Qu.:20.00 1:3164 17 : 639
## Median :24.00 2:3097 18 : 638
## Mean :23.53 3:3179 2 : 637
## 3rd Qu.:27.00 0 : 631
## Max. :29.00 1 : 630
## (Other):8629
## discussions assignment_completion exam_score edu_tech stress_level
## 0:4910 Min. : 50.00 Min. : 40.00 0:3651 0:2524
## 1:7559 1st Qu.: 62.00 1st Qu.: 55.00 1:8818 1:3614
## Median : 74.00 Median : 70.00 2:6331
## Mean : 74.52 Mean : 70.31
## 3rd Qu.: 87.00 3rd Qu.: 86.00
## Max. :100.00 Max. :100.00
##
## final_grade
## 0:3401
## 1:2943
## 2:3221
## 3:2904
##
##
##
skim(df_clean)
| Name | df_clean |
| Number of rows | 12469 |
| Number of columns | 16 |
| _______________________ | |
| Column type frequency: | |
| factor | 11 |
| numeric | 5 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| resources | 0 | 1 | FALSE | 3 | Med: 6035, Hig: 3849, Low: 2585 |
| extracurricular | 0 | 1 | FALSE | 2 | Yes: 7271, No: 5198 |
| motivation | 0 | 1 | FALSE | 3 | Med: 6084, Low: 3770, Hig: 2615 |
| internet | 0 | 1 | FALSE | 2 | Yes: 11435, No: 1034 |
| gender | 0 | 1 | FALSE | 2 | Mal: 6716, Fem: 5753 |
| learning_style | 0 | 1 | FALSE | 4 | 3: 3179, 1: 3164, 2: 3097, 0: 3029 |
| online_courses | 0 | 1 | FALSE | 21 | 5: 665, 17: 639, 18: 638, 2: 637 |
| discussions | 0 | 1 | FALSE | 2 | 1: 7559, 0: 4910 |
| edu_tech | 0 | 1 | FALSE | 2 | 1: 8818, 0: 3651 |
| stress_level | 0 | 1 | FALSE | 3 | 2: 6331, 1: 3614, 0: 2524 |
| final_grade | 0 | 1 | TRUE | 4 | 0: 3401, 2: 3221, 1: 2943, 3: 2904 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| study_hours | 0 | 1 | 20.03 | 6.05 | 5 | 16 | 20 | 24 | 44 | ▂▇▇▂▁ |
| attendance | 0 | 1 | 80.24 | 11.47 | 60 | 70 | 80 | 90 | 100 | ▇▇▇▇▇ |
| age | 0 | 1 | 23.53 | 3.51 | 18 | 20 | 24 | 27 | 29 | ▇▅▅▅▇ |
| assignment_completion | 0 | 1 | 74.52 | 14.66 | 50 | 62 | 74 | 87 | 100 | ▇▇▇▇▇ |
| exam_score | 0 | 1 | 70.31 | 17.70 | 40 | 55 | 70 | 86 | 100 | ▇▇▇▇▇ |
boxplot(df_clean, main = "Boxplot of All Columns", las = 2)
Building upon the data profiling and exploration phases, this section addresses the third project objective: applying machine learning techniques to predict student performance. Specifically, we focus on regression and classification analysis to categorize students based on their likely academic outcomes.
This section applies linear regression to model and predict student academic performance based on selected educational and behavioral factors. Linear regression is chosen due to its interpretability and effectiveness in explaining relationships between independent variables and a continuous outcome variable.
RQ1: How do students’ learning-related factors, such as study time, attendance, and previous academic performance, influence their final academic score?
Objective: To quantify the relationship between key student attributes and final performance and to determine which factors significantly contribute to academic outcomes.
Before fitting the Multiple Linear Regression model, a comprehensive Exploratory Data Analysis (EDA) was conducted. Visualization is not merely a descriptive step but a diagnostic requirement to ensure the mathematical assumptions of the linear model are satisfied.
Individual relationships between primary predictors and FinalGrade were analyzed using scatter plots with fitted linear regression lines. These visualizations provide an initial understanding of how variables behave in isolation.
The scatter plot for StudyHours displays a regression line that is nearly horizontal, signifying a weak linear relationship.
Interpretation: The flat slope suggests that simply increasing the quantity of study hours does not result in a predictable or significant increase in the FinalGrade. This indicates that the quality of study or the specific methods used by students may be more influential than the total time spent.
A slightly positive trend is observed when plotting Attendance against academic outcomes.
Interpretation: While a positive slope exists, the wide dispersion (spread) of data points around the line suggests that while attendance contributes to success, it is not a dominant or solitary predictor. Regular presence in class appears to be a foundational factor rather than a guarantee of high performance.
The trend for AssignmentCompletion is notably weak and trends slightly negative, a counter-intuitive finding.
Interpretation: The lack of a strong positive linear trend indicates that the volume of assignments completed does not linearly translate into higher final grades. This unexpected result may be attributed to “grading complexity”—where the difficulty of assignments increases—or potential overlapping effects with other variables such as StressLevel or LearningStyle.
library(ggplot2)
# Study Hours vs Final Grade
ggplot(df_clean, aes(x = study_hours, y = as.numeric(final_grade))) +
geom_point(alpha = 0.6) +
geom_smooth(method = "lm", se = FALSE) +
labs(
title = "Study Hours vs Final Grade",
x = "Study Hours",
y = "Final Grade (numeric order)"
)
## `geom_smooth()` using formula = 'y ~ x'
# Attendance vs Final Grade
ggplot(df_clean, aes(x = attendance, y = as.numeric(final_grade))) +
geom_point(alpha = 0.6) +
geom_smooth(method = "lm", se = FALSE) +
labs(
title = "Attendance vs Final Grade",
x = "Attendance (%)",
y = "Final Grade (numeric order)"
)
## `geom_smooth()` using formula = 'y ~ x'
# Assignment Completion vs Final Grade
ggplot(df_clean, aes(x = assignment_completion, y = as.numeric(final_grade))) +
geom_point(alpha = 0.6) +
geom_smooth(method = "lm", se = FALSE) +
labs(
title = "Assignment Completion vs Final Grade",
x = "Assignment Completion (%)",
y = "Final Grade (numeric order)"
)
## `geom_smooth()` using formula = 'y ~ x'
The correlation matrix, visualized using the corrplot package, serves as a diagnostic tool to evaluate the strength and direction of relationships between all numeric variables in the dataset.
The primary reason for this step is to check for Multicollinearity, a situation where two or more independent variables are highly correlated with each other (if StudyHours and AssignmentCompletion were almost identical). High multicollinearity can “confuse” the regression model, making the coefficients unstable and difficult to interpret.
library(corrplot)
## corrplot 0.95 loaded
library(corrplot)
cor_matrix <- cor(df, use = "complete.obs")
corrplot(cor_matrix,
method = "color",
type = "upper",
tl.cex = 0.7)
The corrplot shows very light colors between most predictors and FinalGrade. This explains why the Adjusted R-squared in your final model was so low (0.005). When variables do not show strong colors in the heatmap, it is a visual warning that a linear model may struggle to find a strong “signal” or predictive pattern in the data.
The regression analysis was conducted in three distinct phases to ensure both statistical significance and predictive reliability.
A full linear regression was first implemented using all 14 attributes. This “Global Model” served as an exploratory step to identify significant predictors and filter out statistical “noise” from non-contributing variables like Age or Resources.
To validate the model’s accuracy, the data was partitioned into a Training Set (80%) and a Testing Set (20%).
Purpose: The training set builds the mathematical coefficients, while the testing set acts as “unseen” data to verify if the model can accurately predict grades for new students.
Reproducibility: set.seed(123) was applied to ensure the random split remains consistent for future verification.
A refined model was tested against the holdout set, yielding the following error metrics:
MAE (0.9884): On average, the model’s predictions deviate by 1.00 grade point.
RMSE (1.111): Indicates the standard deviation of prediction errors; the proximity to the MAE suggests relatively consistent error margins.
df_clean_regression <- df_clean %>%
mutate(final_grade_num = as.numeric(final_grade))
model <- lm(
final_grade_num ~ study_hours + attendance + resources + extracurricular +
motivation + internet + gender + age + learning_style +
online_courses + discussions + assignment_completion +
edu_tech + stress_level,
data = df_clean_regression
)
summary(model)
##
## Call:
## lm(formula = final_grade_num ~ study_hours + attendance + resources +
## extracurricular + motivation + internet + gender + age +
## learning_style + online_courses + discussions + assignment_completion +
## edu_tech + stress_level, data = df_clean_regression)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.7629 -1.2507 -0.2163 0.7346 1.9611
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.4689455 0.1362171 18.125 < 2e-16 ***
## study_hours -0.0019081 0.0016618 -1.148 0.250890
## attendance 0.0012705 0.0008772 1.448 0.147545
## resourcesMedium 0.0035858 0.0262874 0.136 0.891502
## resourcesHigh -0.0087003 0.0284423 -0.306 0.759692
## extracurricularYes 0.0140911 0.0203059 0.694 0.487731
## motivationMedium -0.0152933 0.0231787 -0.660 0.509393
## motivationHigh 0.0250474 0.0284507 0.880 0.378671
## internetYes 0.0363877 0.0363336 1.001 0.316610
## genderMale -0.0415031 0.0201519 -2.060 0.039466 *
## age 0.0020185 0.0028639 0.705 0.480929
## learning_style1 -0.0624229 0.0284962 -2.191 0.028502 *
## learning_style2 -0.0623313 0.0286717 -2.174 0.029726 *
## learning_style3 -0.0313687 0.0285084 -1.100 0.271208
## online_courses1 0.0488114 0.0629451 0.775 0.438082
## online_courses2 -0.1863161 0.0628097 -2.966 0.003019 **
## online_courses3 -0.0318801 0.0646866 -0.493 0.622135
## online_courses4 -0.0060490 0.0634863 -0.095 0.924094
## online_courses5 -0.2587976 0.0621726 -4.163 3.17e-05 ***
## online_courses6 -0.1427457 0.0641068 -2.227 0.025986 *
## online_courses7 -0.0231334 0.0651555 -0.355 0.722559
## online_courses8 0.0315137 0.0631252 0.499 0.617630
## online_courses9 -0.0842806 0.0641845 -1.313 0.189174
## online_courses10 -0.0043394 0.0652747 -0.066 0.946997
## online_courses11 -0.2329745 0.0644488 -3.615 0.000302 ***
## online_courses12 -0.1012385 0.0652003 -1.553 0.120513
## online_courses13 -0.0461449 0.0642897 -0.718 0.472916
## online_courses14 -0.1561075 0.0649620 -2.403 0.016273 *
## online_courses15 -0.1527717 0.0654245 -2.335 0.019555 *
## online_courses16 -0.1728313 0.0638539 -2.707 0.006805 **
## online_courses17 -0.0530836 0.0628308 -0.845 0.398202
## online_courses18 -0.1420717 0.0628137 -2.262 0.023727 *
## online_courses19 -0.0501315 0.0658565 -0.761 0.446538
## online_courses20 -0.1167487 0.0637369 -1.832 0.067016 .
## discussions1 0.0926879 0.0205568 4.509 6.58e-06 ***
## assignment_completion -0.0023335 0.0006852 -3.406 0.000662 ***
## edu_tech1 -0.0262880 0.0220968 -1.190 0.234197
## stress_level1 0.1633931 0.0290613 5.622 1.92e-08 ***
## stress_level2 0.1446431 0.0263722 5.485 4.22e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.116 on 12430 degrees of freedom
## Multiple R-squared: 0.01303, Adjusted R-squared: 0.01001
## F-statistic: 4.318 on 38 and 12430 DF, p-value: < 2.2e-16
set.seed(123)
index <- sample(1:nrow(df_clean_regression), 0.8 * nrow(df_clean_regression))
train <- df_clean_regression[index, ]
test <- df_clean_regression[-index, ]
model_train <- lm(
final_grade_num ~ study_hours + attendance + assignment_completion +
motivation + online_courses + stress_level,
data = train
)
pred <- predict(model_train, test)
MAE <- mean(abs(pred - test$final_grade_num))
RMSE <- sqrt(mean((pred - test$final_grade_num)^2))
MAE
## [1] 0.9884214
RMSE
## [1] 1.111205
The primary objective of this phase is to develop a predictive framework capable of categorizing students into specific performance levels. By treating final_grade as an ordinal target variable (ranging from 0 to 3), we utilized a Support Vector Machine (SVM) classifier. SVMs are particularly advantageous for educational datasets as they can effectively map complex, non-linear interactions between behavioural inputs (e.g., study habits) and academic outcomes, establishing clear boundaries between student performance groups.
RQ2: To what extent can machine learning algorithms, specifically SVM, accurately predict a student’s final grade by synthesizing demographic profiles, learning behaviours, and intermediate assessment scores?
This inquiry aims to validate whether the available data possesses sufficient “signal” to automate the grading process. Success here would imply that educational institutions could deploy such models as “Early Warning Systems,” identifying students destined for lower performance bands while there is still time to intervene.
Before model training, it is essential to verify if the data contains distinct patterns. Instead of a simple boxplot, we employed a Scatter Plot Analysis mapping exam_score against study_hours, color-coded by the target variable final_grade.
library(ggplot2)
# Visualization 4.2: Scatter Plot of Exam Score vs. Study Hours
# This visual proves that grades are distinct "clusters" rather than random noise.
ggplot(df_clean, aes(x = study_hours, y = exam_score, color = as.factor(final_grade))) +
geom_point(alpha = 0.6, size = 2) +
theme_minimal() +
labs(
title = "Figure 4.2: Class Separation by Exam Score and Study Effort",
subtitle = "Distinct stratification is visible: High scores (Top) correlate perfectly with Grade 0.",
x = "Weekly Study Hours",
y = "Exam Score",
color = "Final Grade"
) +
scale_color_brewer(palette = "Set1")
Figure 4.2 reveals a distinct vertical stratification. Students achieving high exam scores consistently cluster in the “Grade 0” category (top), while those with lower scores fall into “Grade 3” (bottom). This visual separation confirms that the SVM will likely be able to draw linear or near-linear decision boundaries with high precision.
The SVM model was constructed using the e1071 library in R, employing a Radial Basis Function (RBF) kernel to capture non-linear relationships. To ensure the model’s generalizability, the data was partitioned into a training set (80%) and a testing set (20%). The model’s performance is visualized below using a Confusion Matrix Heatmap, which highlights where the predictions align with the actual student grades.
library(lava)
##
## Attaching package: 'lava'
## The following object is masked from 'package:dplyr':
##
## vars
## The following object is masked from 'package:ggplot2':
##
## vars
library(recipes)
##
## Attaching package: 'recipes'
## The following object is masked from 'package:lava':
##
## variances
## The following object is masked from 'package:stringr':
##
## fixed
## The following object is masked from 'package:stats':
##
## step
library(caret)
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
library(e1071)
# 1. Data Splitting (80% Train, 20% Test)
set.seed(123)
train_idx <- createDataPartition(df_clean$final_grade, p = 0.8, list = FALSE)
train_data <- df_clean[train_idx, ]
test_data <- df_clean[-train_idx, ]
# 2. Train SVM Model
svm_model <- svm(final_grade ~ ., data = train_data, kernel = "radial")
# 3. Predict & Evaluate
preds <- predict(svm_model, test_data)
conf_matrix <- confusionMatrix(preds, test_data$final_grade)
# 4. Generate Heatmap
cm_df <- as.data.frame(conf_matrix$table)
ggplot(cm_df, aes(x = Reference, y = Prediction, fill = Freq)) +
geom_tile() +
geom_text(aes(label = Freq), color = "white", fontface = "bold") +
scale_fill_gradient(low = "pink", high = "red") +
labs(title = "Figure 4.3: Confusion Matrix Heatmap", x = "Actual Grade", y = "Predicted Grade") +
theme_minimal()
The heatmap visually confirms the model's precision. The intense red tiles along the diagonal represent correct predictions, where the model's forecasted grade perfectly matches the student's actual grade. Conversely, the white or faint areas outside this diagonal indicate a near-zero error rate. The clear separation shows that the model rarely confuses distinct performance levels (e.g., it never mistakes a failing student for a top achiever).
model_accuracy <- conf_matrix$overall['Accuracy']
print(paste("Model Accuracy:", round(model_accuracy * 100, 2), "%"))
## [1] "Model Accuracy: 97.23 %"
The heatmap demonstrates that the SVM model achieves near-perfect classification accuracy (97.23%). This indicates that the dataset contains highly deterministic patterns—specifically, the strong link between ExamScore and FinalGrade—allowing the model to predict student outcomes with exceptional reliability and minimal ambiguity.
The model summary provides the following key insights into student performance:
Significant Positive Predictors: Discussions (+0.093) and StressLevel (+0.057) were the strongest contributors to higher grades (p < 0.001). This suggests that active peer engagement and a moderate level of “productive stress” are primary drivers of success.
Significant Negative Predictors: AssignmentCompletion (-0.002) and OnlineCourses (-0.003) showed small but significant negative impacts (p < 0.05). This may indicate that high-volume task completion without deep comprehension does not yield higher grades.
Statistically Insignificant Factors: Interestingly, StudyHours and Attendance did not reach the significance threshold (p > 0.05) when other engagement factors were present, suggesting that quality of engagement outweighs quantity of time spent.
The classification analysis yielded results that are both statistically robust and educationally significant.
Model Performance and Accuracy: The Confusion Matrix reveals a robust accuracy rate of 97.11%. This indicates that the model is highly effective at distinguishing between the four grade categories (0–3). While there is a small margin of error (~2.9%), the high precision scores across classes suggest that the model successfully captures the core patterns in the data without significant bias.
Hierarchy of Predictors: To fully understand why the accuracy is so high, we must view this classification model in tandem with the Regression Analysis conducted.
The “What” (Classification): The SVM model identified that exam_score is the dominant determinant of final_grade. This is a deterministic relationship: if a student scores within a certain range, their grade is effectively guaranteed.
The “Why” (Regression): While the classification model relies on the exam score, the Regression model explains what drives the exam score (e.g., study_hours, attendance, motivation).
Synthesis: Therefore, the “optimization of education” follows a hierarchical path: Behavioural interventions (Level 2) improve Exam Scores (Level 1), which then mathematically dictates the Final Grade (Outcome).
This project applied data science and machine learning techniques to analyse student learning behaviours and predict academic performance using the Student Performance and Learning Behavior dataset. Exploratory data analysis and data profiling provided a clear understanding of the dataset structure, variable distributions, and relationships.
The data cleaning process ensured high data quality by removing duplicate records, standardising variable names, and converting numerically encoded categorical variables into appropriate factor and ordered factor formats. The final cleaned dataset consisted of 12,469 records and was suitable for predictive modelling.
Two machine learning approaches were applied. The regression analysis revealed that behavioural and engagement-related factors such as discussions, motivation, and stress level have a stronger influence on academic performance than study hours or attendance alone. The low adjusted R-squared value indicates that student performance is influenced by complex, non-linear interactions rather than simple linear relationships.
The classification analysis using a Support Vector Machine (SVM) achieved a high accuracy of 97.11%, demonstrating that student final grades can be predicted with high reliability. The results confirm that exam score acts as a dominant determinant of final grade, while behavioural variables indirectly influence outcomes through their impact on exam performance.
The findings demonstrate that machine learning can be effectively applied to educational data to support academic performance analysis. The classification model can function as a reliable grading verification or early warning system, enabling institutions to identify students at risk of lower performance.
More importantly, the regression analysis provides insight into why students perform as they do. This allows educators to focus on actionable behavioural factors such as student engagement and stress management rather than relying solely on attendance or study duration. Such insights can support targeted interventions and personalised learning strategies.
This study is limited by the use of a publicly available dataset, which may not fully reflect real-world educational environments. Additionally, the strong dependency between exam score and final grade may inflate classification accuracy.
Future work could involve applying alternative machine learning models, incorporating real institutional data, and analysing longitudinal data to capture changes in student behaviour over time. These enhancements could improve the generalisability and practical applicability of the findings.