Students’ performance in academic typically measured by the grades they achieve from the course’s exams. Students’ academic performance is affected by several factors, including students’ parental background, daily habits, environmental influence, and learning facilities. Based on the collected data, the project aims to predict the students’ passing at the end of the mathematics course, so that it can benefits the teacher where they can supervise and communicate with their students on how they perform in the course based on their background, daily habits, environment, and facilities provided for them to study. This project also can be used by the students to evaluate their academic performance based on their conditions in the present time.
Packages that are going to be used for the projects are imported here
library(dplyr)
library(lubridate)
library(e1071)
library(caret)
library(randomForest)
library(partykit)
library(ROCR)
library(ggplot2)The Dataset is taken from secondary education of two Portuguese schools. The data attributes include student grades on mathematics, demographic, social and school related features and it was collected by using school reports and questionnaires.
Here, we will read our dataset by using
read.csv() function and change string columns into factors
by using stringsAsFactors=T parameter and then use
head() to view the first 5 data
students <- read.csv("input_data/student-mat.csv", sep=';', stringsAsFactors=T)
head(students)Here is the descriptions of the attributes for math course students dataset:
glimpse() function is used to view the data
structure of each columns
glimpse(students)#> Rows: 395
#> Columns: 33
#> $ school <fct> GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP,…
#> $ sex <fct> F, F, F, F, F, M, M, F, M, M, F, F, M, M, M, F, F, F, M, M,…
#> $ age <int> 18, 17, 15, 15, 16, 16, 16, 17, 15, 15, 15, 15, 15, 15, 15,…
#> $ address <fct> U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U,…
#> $ famsize <fct> GT3, GT3, LE3, GT3, GT3, LE3, LE3, GT3, LE3, GT3, GT3, GT3,…
#> $ Pstatus <fct> A, T, T, T, T, T, T, A, A, T, T, T, T, T, A, T, T, T, T, T,…
#> $ Medu <int> 4, 1, 1, 4, 3, 4, 2, 4, 3, 3, 4, 2, 4, 4, 2, 4, 4, 3, 3, 4,…
#> $ Fedu <int> 4, 1, 1, 2, 3, 3, 2, 4, 2, 4, 4, 1, 4, 3, 2, 4, 4, 3, 2, 3,…
#> $ Mjob <fct> at_home, at_home, at_home, health, other, services, other, …
#> $ Fjob <fct> teacher, other, other, services, other, other, other, teach…
#> $ reason <fct> course, course, other, home, home, reputation, home, home, …
#> $ guardian <fct> mother, father, mother, mother, father, mother, mother, mot…
#> $ traveltime <int> 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 3, 1, 2, 1, 1, 1, 3, 1, 1,…
#> $ studytime <int> 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 3, 1, 2, 3, 1, 3, 2, 1, 1,…
#> $ failures <int> 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0,…
#> $ schoolsup <fct> yes, no, yes, no, no, no, no, yes, no, no, no, no, no, no, …
#> $ famsup <fct> no, yes, no, yes, yes, yes, no, yes, yes, yes, yes, yes, ye…
#> $ paid <fct> no, no, yes, yes, yes, yes, no, no, yes, yes, yes, no, yes,…
#> $ activities <fct> no, no, no, yes, no, yes, no, no, no, yes, no, yes, yes, no…
#> $ nursery <fct> yes, no, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, …
#> $ higher <fct> yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes,…
#> $ internet <fct> no, yes, yes, yes, no, yes, yes, no, yes, yes, yes, yes, ye…
#> $ romantic <fct> no, no, no, yes, no, no, no, no, no, no, no, no, no, no, ye…
#> $ famrel <int> 4, 5, 4, 3, 4, 5, 4, 4, 4, 5, 3, 5, 4, 5, 4, 4, 3, 5, 5, 3,…
#> $ freetime <int> 3, 3, 3, 2, 3, 4, 4, 1, 2, 5, 3, 2, 3, 4, 5, 4, 2, 3, 5, 1,…
#> $ goout <int> 4, 3, 2, 2, 2, 2, 4, 4, 2, 1, 3, 2, 3, 3, 2, 4, 3, 2, 5, 3,…
#> $ Dalc <int> 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1,…
#> $ Walc <int> 1, 1, 3, 1, 2, 2, 1, 1, 1, 1, 2, 1, 3, 2, 1, 2, 2, 1, 4, 3,…
#> $ health <int> 3, 3, 3, 5, 5, 5, 3, 1, 1, 5, 2, 4, 5, 3, 3, 2, 2, 4, 5, 5,…
#> $ absences <int> 6, 4, 10, 2, 4, 10, 0, 6, 0, 0, 0, 4, 2, 2, 0, 4, 6, 4, 16,…
#> $ G1 <int> 5, 5, 7, 15, 6, 15, 12, 6, 16, 14, 10, 10, 14, 10, 14, 14, …
#> $ G2 <int> 6, 5, 8, 14, 10, 15, 12, 5, 18, 15, 8, 12, 14, 10, 16, 14, …
#> $ G3 <int> 6, 6, 10, 15, 10, 15, 11, 6, 19, 15, 9, 12, 14, 11, 16, 14,…
It seems like few of the data structures doesn’t match the data, we
will fix it using the mutate_at() function to change some
columns into a factor by using as.factor function
students <- students %>%
mutate_at(vars(Medu, Fedu, traveltime, studytime, failures, famrel, freetime, freetime, goout, Dalc, Walc, health), as.factor)
glimpse(students)#> Rows: 395
#> Columns: 33
#> $ school <fct> GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP,…
#> $ sex <fct> F, F, F, F, F, M, M, F, M, M, F, F, M, M, M, F, F, F, M, M,…
#> $ age <int> 18, 17, 15, 15, 16, 16, 16, 17, 15, 15, 15, 15, 15, 15, 15,…
#> $ address <fct> U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U,…
#> $ famsize <fct> GT3, GT3, LE3, GT3, GT3, LE3, LE3, GT3, LE3, GT3, GT3, GT3,…
#> $ Pstatus <fct> A, T, T, T, T, T, T, A, A, T, T, T, T, T, A, T, T, T, T, T,…
#> $ Medu <fct> 4, 1, 1, 4, 3, 4, 2, 4, 3, 3, 4, 2, 4, 4, 2, 4, 4, 3, 3, 4,…
#> $ Fedu <fct> 4, 1, 1, 2, 3, 3, 2, 4, 2, 4, 4, 1, 4, 3, 2, 4, 4, 3, 2, 3,…
#> $ Mjob <fct> at_home, at_home, at_home, health, other, services, other, …
#> $ Fjob <fct> teacher, other, other, services, other, other, other, teach…
#> $ reason <fct> course, course, other, home, home, reputation, home, home, …
#> $ guardian <fct> mother, father, mother, mother, father, mother, mother, mot…
#> $ traveltime <fct> 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 3, 1, 2, 1, 1, 1, 3, 1, 1,…
#> $ studytime <fct> 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 3, 1, 2, 3, 1, 3, 2, 1, 1,…
#> $ failures <fct> 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0,…
#> $ schoolsup <fct> yes, no, yes, no, no, no, no, yes, no, no, no, no, no, no, …
#> $ famsup <fct> no, yes, no, yes, yes, yes, no, yes, yes, yes, yes, yes, ye…
#> $ paid <fct> no, no, yes, yes, yes, yes, no, no, yes, yes, yes, no, yes,…
#> $ activities <fct> no, no, no, yes, no, yes, no, no, no, yes, no, yes, yes, no…
#> $ nursery <fct> yes, no, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, …
#> $ higher <fct> yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes,…
#> $ internet <fct> no, yes, yes, yes, no, yes, yes, no, yes, yes, yes, yes, ye…
#> $ romantic <fct> no, no, no, yes, no, no, no, no, no, no, no, no, no, no, ye…
#> $ famrel <fct> 4, 5, 4, 3, 4, 5, 4, 4, 4, 5, 3, 5, 4, 5, 4, 4, 3, 5, 5, 3,…
#> $ freetime <fct> 3, 3, 3, 2, 3, 4, 4, 1, 2, 5, 3, 2, 3, 4, 5, 4, 2, 3, 5, 1,…
#> $ goout <fct> 4, 3, 2, 2, 2, 2, 4, 4, 2, 1, 3, 2, 3, 3, 2, 4, 3, 2, 5, 3,…
#> $ Dalc <fct> 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1,…
#> $ Walc <fct> 1, 1, 3, 1, 2, 2, 1, 1, 1, 1, 2, 1, 3, 2, 1, 2, 2, 1, 4, 3,…
#> $ health <fct> 3, 3, 3, 5, 5, 5, 3, 1, 1, 5, 2, 4, 5, 3, 3, 2, 2, 4, 5, 5,…
#> $ absences <int> 6, 4, 10, 2, 4, 10, 0, 6, 0, 0, 0, 4, 2, 2, 0, 4, 6, 4, 16,…
#> $ G1 <int> 5, 5, 7, 15, 6, 15, 12, 6, 16, 14, 10, 10, 14, 10, 14, 14, …
#> $ G2 <int> 6, 5, 8, 14, 10, 15, 12, 5, 18, 15, 8, 12, 14, 10, 16, 14, …
#> $ G3 <int> 6, 6, 10, 15, 10, 15, 11, 6, 19, 15, 9, 12, 14, 11, 16, 14,…
Now, all the data structure matched the data.
An Outlier is a data point that differs significantly from other observations. We need to remove the outliers so that it won’t affect our machine learning model. When outliers are present, the model can become overly sensitive to these points. It could alter how the model identify pattern in the data.
We will use boxplot() to check which numeric columns
contains outliers
studens_num <- students %>%
select_if(is.numeric)
boxplot(studens_num)It seems like the absences column contains outliers
because most of the students do not absent for more than 20 times (by
looking at the upper whisker line). If we include the students that have
more than 20 absences, the characteristics of the outliers could affect
how the model determine the entire data.
We will use filter() to remove the outlier data.
students <- students %>%
filter(absences < 20)
summary(students)#> school sex age address famsize Pstatus Medu Fedu
#> GP:330 F:194 Min. :15.00 R: 85 GT3:266 A: 38 0: 3 0: 2
#> MS: 46 M:182 1st Qu.:16.00 U:291 LE3:110 T:338 1: 58 1: 79
#> Median :17.00 2: 99 2:111
#> Mean :16.66 3: 92 3: 92
#> 3rd Qu.:18.00 4:124 4: 92
#> Max. :22.00
#> Mjob Fjob reason guardian traveltime
#> at_home : 57 at_home : 20 course :143 father: 88 1:248
#> health : 34 health : 18 home : 98 mother:261 2: 99
#> other :132 other :205 other : 35 other : 27 3: 21
#> services: 97 services:105 reputation:100 4: 8
#> teacher : 56 teacher : 28
#>
#> studytime failures schoolsup famsup paid activities nursery
#> 1:100 0:302 no :327 no :147 no :205 no :182 no : 78
#> 2:186 1: 42 yes: 49 yes:229 yes:171 yes:194 yes:298
#> 3: 63 2: 16
#> 4: 27 3: 16
#>
#>
#> higher internet romantic famrel freetime goout Dalc Walc health
#> no : 18 no : 66 no :258 1: 8 1: 16 1: 21 1:263 1:144 1: 44
#> yes:358 yes:310 yes:118 2: 18 2: 59 2: 98 2: 70 2: 82 2: 41
#> 3: 63 3:152 3:126 3: 25 3: 77 3: 87
#> 4:184 4:111 4: 80 4: 9 4: 45 4: 65
#> 5:103 5: 38 5: 51 5: 9 5: 28 5:139
#>
#> absences G1 G2 G3
#> Min. : 0.000 Min. : 3.0 Min. : 0.00 Min. : 0.00
#> 1st Qu.: 0.000 1st Qu.: 8.0 1st Qu.: 9.00 1st Qu.: 8.00
#> Median : 3.000 Median :11.0 Median :11.00 Median :11.00
#> Mean : 4.439 Mean :10.9 Mean :10.72 Mean :10.41
#> 3rd Qu.: 7.000 3rd Qu.:13.0 3rd Qu.:13.00 3rd Qu.:14.00
#> Max. :19.000 Max. :19.0 Max. :19.00 Max. :20.00
colSums(is.na()) function is used to check if there are
missing values at the data
colSums(is.na(students))#> school sex age address famsize Pstatus Medu
#> 0 0 0 0 0 0 0
#> Fedu Mjob Fjob reason guardian traveltime studytime
#> 0 0 0 0 0 0 0
#> failures schoolsup famsup paid activities nursery higher
#> 0 0 0 0 0 0 0
#> internet romantic famrel freetime goout Dalc Walc
#> 0 0 0 0 0 0 0
#> health absences G1 G2 G3
#> 0 0 0 0 0
There are no missing value at any columns
We are going to do Exploratory Data Analysis to analyze some factors that affect students’ final grade, such as absences, age, study time, parent’s education, extra educational support, having internet, having a romantic relationship, alcohol consumption, and health condition.
Missing classes can have a bad impact on a student’s academic
performance. When students miss their classes, they miss out on
important lectures, discussions, exercises, and quizzes that were given
by the teachers. Lets compare how the students that performed in the
math class by splitting them using filter into the students
that passed the math class (G3 >= 10) and the students
that failed the math class (G3 < 10).
To visualize comparison of the students data distribution based on
their absences, we can use boxplot.
pass_students <- students %>%
filter(G3 >= 10)
fail_students <- students %>%
filter(G3 < 10)
boxplot(pass_students$absences, fail_students$absences,
horizontal = T,
names = c("Pass", "Fail"),
xlab = "Number of Absences",
col = '#3A9BDC')Analysis
Based on the boxplot, we can see that even though the passed students have higher median than the failed students, but the passed students have lower dispersion. Therefore, in overall we can see that the passed students missed less classes than the failed students.
In a school, its possible for students of different ages to be in the
same class. Age can influence how the students interact with their
friends and how the students pay attention to their academic
performance. Let us see the age distribution of the students in the math
class by plotting a histogram using hist function.
hist(students$age,
breaks = 8,
main = NULL,
xlab = "Age",
ylab = "Number of Students",
ylim = c(0, 200),
col = '#3A9BDC')Analysis
Most of the students are 15 years old and the other students are 16 years old and older. We can see that some of the students are older than 19 years old, which is not a normal age for a high school student. Let us see how well they performed in the math class by comparing the passed students and the failed students based on their age.
We are grouping the students based on their age using
group_by and counting the number of students of each age.
To visualize the data for easier interpretation, passed student and
failed student data are combined in a same dataframe and
barplot() is used to create a bar chart.
pass_age <- pass_students %>%
group_by(age) %>%
summarise(count = n()) %>%
rbind(list(age = 21, count = 0), list(age = 22, count = 0)) #Adding students with the age of 21 and 22 to match the dataframe
fail_age <- students %>%
filter(G3 < 10) %>%
group_by(age) %>%
summarise(count = n())
plot_age <- pass_age %>%
mutate(pass_age = count, fail_age = fail_age$count)
barplot(cbind(pass_age, fail_age) ~ age,
data = plot_age,
xlab = "Age",
ylab = "Number of Students",
col = c('#34A583', '#EA4335'),
legend.text = TRUE,
args.legend = list(legend = c("Pass", "Fail")),
beside = T)Analysis
Based on the plot, we can see that more students aged 15 to 18 years old that passed than the students that failed. But, there is a slight increase of the students that failed as they aged more. There are more students aged 19 and older that failed compared to students that passed the math class. Therefore, from the plot we can conclude that the older the students are, the more likely they are to fail the math class.
As a student, they have to spare some time outside of school to study so they can learn and increase their knowledge. Self-studying can also help review the materials so the students will be more prepared for the test.
We are going to calculate the average final grade of every students
based on how long they studied each week using group_by()
function and then aggregating the final grade using
summarise_at().
plot_study <- students %>%
group_by(studytime) %>%
summarise_at(vars(G3), mean)
plot_studyTo visualize the data for easier interpretation,
barplot() is used to create a bar plot.
barplot(G3~studytime,
data = plot_study,
names.arg = c('<2 Hours','2-5 Hours','5-10 Hours','>10 Hours'),
xlab = "Study Time",
ylab = "Final Grade Average",
ylim = c(0,12),
col = '#3A9BDC')Analysis
Students that study for less than 5 hours in a week have an average final grade that barely passed the math course (passing grade = 10), while students that study for more than 5 hours have higher final grades on average. How long students study can affect the students’ performance significantly.
Parents are one of the most important factor of students’ education. In the academic field, parents can also help their children’s study to increase their knowledge of the course.
We are going to calculate the average final grade based on the
students’ mother (Medu) and father’s (Fedu)
education using group_by() function and then aggregating
the final grade using summarise_at().
Note : Due to the small number of students whose parents have no education, the data won’t be used for analyzing
Mother’s Education
plot_medu <- students %>%
group_by(Medu) %>%
summarise_at(vars(G3), mean)
plot_meduFather’s Education
plot_fedu <- students %>%
group_by(Fedu) %>%
summarise_at(vars(G3), mean)
plot_feduTo visualize the data for easier interpretation, data from
Medu and Fedu are combined in a same dataframe
and barplot() is used to create a bar chart.
plot_parent <- plot_fedu %>%
mutate(fedu_G3 = G3, medu_G3 = plot_medu$G3)
barplot(cbind(fedu_G3, medu_G3) ~ Fedu,
data = plot_parent[2:5,],
names.arg = c('primary','5th to 9th grade','secondary', 'higher education'),
xlab = "Parents Education",
ylab = "Final Grade Average",
ylim = c(0,14),
col = c('#3A9BDC', 'violetred1'),
legend.text = TRUE,
args.legend = list(x = "topleft", legend = c("Father", "Mother")),
beside = T)Analysis
The higher the parents’ education, the higher the students’ average final grade. This shows that parents’ education impacts students’ academic performance. Parents with higher education have more knowledge to teach their children and they can help if their children have learning problems.
Students tend to receive extra educational support to help them
study. That extra educational support can be received from school
(schoolsup) and also from their family
(family).
We are going to calculate the average final grade based on the extra
educational support they received using group_by() function
and then aggregating the final grade using
summarise_at().
plot_schoolsup <- students %>%
group_by(schoolsup, famsup) %>%
summarise_at(vars(G3), mean)
plot_schoolsupTo visualize the data for easier interpretation,
barplot() is used to create a bar plot.
barplot(plot_schoolsup$G3,
data = plot_schoolsup,
names.arg = c("No Support", "Family Only", "School Only", "Family and School"),
xlab = "Extra Educational Support",
ylab = "Final Grade Average",
ylim = c(0,12),
col = '#3A9BDC')Analysis
Students that received no extra educational support have the highest average grade compared to the students that received educational supports. On the contrary, students that received both extra educational support from family and from school have the lowest average grade.
Internet is a powerful tool to be used for learning. Using the internet, students can search for various study materials that isn’t taught at school.
We are going to calculate the average final grade based on the
students’ internet availability at home using group_by()
function and then aggregating the final grade using
summarise_at().
plot_internet <- students %>%
group_by(internet) %>%
summarise_at(vars(G3), mean)
plot_internetTo visualize the data for easier interpretation,
barplot() is used to create a bar plot.
barplot(G3~internet,
data = plot_internet,
xlab = "Internet Availability",
ylab = "Final Grade Average",
ylim = c(0,12),
col = '#3A9BDC')Analysis
Students that is facilitated with internet have higher average grade compared to students that don’t have internet. Students with no internet average final grade is lower than the course’s passing grade. It shows that having internet can improve students’ academic performance.
Having a romantic relationship can also affect students’ academic performance. Students with romantic partner tend to spend time with their partner.
We are going to calculate the average final grade based on the
students’ relationship status using group_by() function and
then aggregating the final grade using summarise_at().
plot_romantic <- students %>%
group_by(romantic) %>%
summarise_at(vars(G3), mean)
plot_romanticTo visualize the data for easier interpretation,
barplot() is used to create a bar plot.
barplot(G3~romantic,
data = plot_romantic,
xlab = "Romantic Partner",
ylab = "Final Grade Average",
ylim = c(0,12),
col = '#3A9BDC')Analysis
Students without romantic partner have higher average final grade compared to the students with romantic partner. Students with romantic partner average final grade is lower than the course’s passing grade. It seems like the students that have a romantic partner tend to focus on their loved one more than their grade.
Alcohol is a commonly consumed by students as a means of relaxing and socializing, but excessive drinking can lead to negative consequences such as poor academic performance, health problems, risky behaviors, and even alcohol addiction.
We are going to calculate the average final grade based on how much
alcohol the students consume on weekdays using group_by()
function and then aggregating the final grade using
summarise_at().
plot_alcohol <- students %>%
group_by(Dalc) %>%
summarise_at(vars(G3), mean)
plot_alcoholTo visualize the data for easier interpretation,
barplot() is used to create a bar plot.
barplot(G3~Dalc,
data = plot_alcohol,
names.arg = c("Very Low", "Low", "Medium", "High", "Very High"),
xlab = "Alcohol Consumption",
ylab = "Final Grade Average",
ylim = c(0,12),
col = '#3A9BDC')Analysis
Students that have very low alcohol consumption have the highest average final grade, but the difference compared to the students that have very high alcohol is very little. It shows that alcohol consumption does not have a significant impact to the students’ performance due to the varying average final grade.
The health conditions of students can have a significant impact on their academic performance. When students are not feeling well, they may struggle to concentrate and stay focus on their study. Illness can also cause students miss classes, assignments, and tests.
We are going to calculate the average final grade based on how
healthy the students are using group_by() function and then
aggregating the final grade using summarise_at().
plot_health <- students %>%
group_by(health) %>%
summarise_at(vars(G3), mean)
plot_healthTo visualize the data for easier interpretation,
barplot() is used to create a bar plot.
barplot(G3~health,
data = plot_health,
names.arg = c("Very Bad", "Bad", "Medium", "Good", "Very Good"),
xlab = "Health",
ylab = "Final Grade Average",
ylim = c(0,12),
col = '#3A9BDC')Analysis
It seems that students that have very bad health have the highest final grade compared to the other conditions. But, for the other health conditions, the better their health, their average grade also gets better. Students who have poor health may be more motivated to achieve higher grades than students that are healthier.
We are going to analyze further on the students with very bad health
that unexpectedly have the highest average grades compared to the
healthy students. First, we are splitting the students with very bad
health (health == 1) and the healthy students
(health >= 3) using filter.
sick_students <- students %>%
filter(health == 1)
healthy_students <- students %>%
filter(health %in% c(3, 4, 5))
summary(sick_students)#> school sex age address famsize Pstatus Medu Fedu
#> GP:37 F:28 Min. :15.00 R:10 GT3:29 A: 8 0: 0 0: 1
#> MS: 7 M:16 1st Qu.:15.00 U:34 LE3:15 T:36 1: 8 1: 8
#> Median :16.00 2: 5 2:13
#> Mean :16.50 3:13 3:11
#> 3rd Qu.:17.25 4:18 4:11
#> Max. :22.00
#> Mjob Fjob reason guardian traveltime studytime
#> at_home : 8 at_home : 4 course :14 father: 5 1:30 1:11
#> health : 2 health : 1 home :10 mother:39 2:11 2:26
#> other :17 other :23 other : 5 other : 0 3: 2 3: 3
#> services: 9 services:14 reputation:15 4: 1 4: 4
#> teacher : 8 teacher : 2
#>
#> failures schoolsup famsup paid activities nursery higher internet
#> 0:41 no :38 no :19 no :17 no :21 no : 6 no : 1 no : 4
#> 1: 2 yes: 6 yes:25 yes:27 yes:23 yes:38 yes:43 yes:40
#> 2: 0
#> 3: 1
#>
#>
#> romantic famrel freetime goout Dalc Walc health absences
#> no :33 1: 4 1: 3 1: 4 1:33 1:20 1:44 Min. : 0.000
#> yes:11 2: 2 2: 8 2: 9 2: 6 2:10 2: 0 1st Qu.: 0.000
#> 3: 6 3:14 3:12 3: 3 3: 8 3: 0 Median : 3.000
#> 4:18 4:14 4:14 4: 1 4: 4 4: 0 Mean : 4.477
#> 5:14 5: 5 5: 5 5: 1 5: 2 5: 0 3rd Qu.: 6.250
#> Max. :17.000
#> G1 G2 G3
#> Min. : 6.00 Min. : 5.0 Min. : 0.00
#> 1st Qu.: 9.75 1st Qu.:10.0 1st Qu.:10.00
#> Median :11.00 Median :13.0 Median :13.00
#> Mean :11.86 Mean :12.3 Mean :12.16
#> 3rd Qu.:15.00 3rd Qu.:15.0 3rd Qu.:15.00
#> Max. :19.00 Max. :18.0 Max. :19.00
By using summary, we can see the distribution of each
columns. Lets check a few of the sick students data and compare it with
the healthy students.
Absences
Students with bad health tends to miss schools due to their sickness
and the need to see a doctor. Therefore, sick student should’ve been
absent more than the healthy students, we are using boxplot
to compare the absences data of the students.
boxplot(healthy_students$absences, sick_students$absences,
horizontal = T,
names = c("Healthy", "Sick"),
xlab = "Number of Absences",
col = '#3A9BDC')Analysis : It turns out that even though the sick students have higher median than the healthy ones, but some healthy students misses school more than them. The students that often misses school could be the reason that the healthy students have lower average grade.
Study Time
We would like to check if students with bad health tend to study more
than the healthy students. We are comparing the proportion of each study
time from the sick and healthy students using prop.table
and visualizing it using barplot to create a bar chart.
studyhealth_prop <- as.data.frame(list(healthy = prop.table(table(healthy_students$studytime)),
sick = prop.table(table(sick_students$studytime))))
barplot(cbind(healthy.Freq, sick.Freq) ~ sick.Var1,
data = studyhealth_prop,
names.arg = c('<2 Hours','2-5 Hours','5-10 Hours','>10 Hours'),
xlab = "Study Time",
ylim = c(0,0.6),
col = c('#3A9BDC', 'skyblue'),
beside = T,
legend.text = T,
args.legend = list(legend = c("Healthy", "Sick")))Analysis : Most of the sick and the healthy students studied for 2-5 hours. The healthy students have more that studied for 5-10 hours compared to the sick students, but have less that studied for more than 10 hours.
Conclusion
There is not much information that we can gain by comparing the healthy and sick students data because of the similarities of the students.
This project aims to predict the students’ passing at the end of the mathematics course. Based on Portugal’s curriculum, grades are given on a scale from 0 to 20, with the minimum passing grade being 10. Therefore, to pass the course, the students need to have final grade more than 10.
We are going to create a target variable status using
mutate() function based on the students’ final grade on the
math course.
students <- students %>%
mutate(status = ifelse(G3 >= 10, "Pass", "Fail")) %>%
mutate(status = as.factor(status)) %>%
select(-c(G3))
glimpse(students)#> Rows: 376
#> Columns: 33
#> $ school <fct> GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP,…
#> $ sex <fct> F, F, F, F, F, M, M, F, M, M, F, F, M, M, M, F, F, F, M, M,…
#> $ age <int> 18, 17, 15, 15, 16, 16, 16, 17, 15, 15, 15, 15, 15, 15, 15,…
#> $ address <fct> U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U,…
#> $ famsize <fct> GT3, GT3, LE3, GT3, GT3, LE3, LE3, GT3, LE3, GT3, GT3, GT3,…
#> $ Pstatus <fct> A, T, T, T, T, T, T, A, A, T, T, T, T, T, A, T, T, T, T, T,…
#> $ Medu <fct> 4, 1, 1, 4, 3, 4, 2, 4, 3, 3, 4, 2, 4, 4, 2, 4, 4, 3, 3, 4,…
#> $ Fedu <fct> 4, 1, 1, 2, 3, 3, 2, 4, 2, 4, 4, 1, 4, 3, 2, 4, 4, 3, 2, 3,…
#> $ Mjob <fct> at_home, at_home, at_home, health, other, services, other, …
#> $ Fjob <fct> teacher, other, other, services, other, other, other, teach…
#> $ reason <fct> course, course, other, home, home, reputation, home, home, …
#> $ guardian <fct> mother, father, mother, mother, father, mother, mother, mot…
#> $ traveltime <fct> 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 3, 1, 2, 1, 1, 1, 3, 1, 1,…
#> $ studytime <fct> 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 3, 1, 2, 3, 1, 3, 2, 1, 1,…
#> $ failures <fct> 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0,…
#> $ schoolsup <fct> yes, no, yes, no, no, no, no, yes, no, no, no, no, no, no, …
#> $ famsup <fct> no, yes, no, yes, yes, yes, no, yes, yes, yes, yes, yes, ye…
#> $ paid <fct> no, no, yes, yes, yes, yes, no, no, yes, yes, yes, no, yes,…
#> $ activities <fct> no, no, no, yes, no, yes, no, no, no, yes, no, yes, yes, no…
#> $ nursery <fct> yes, no, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, …
#> $ higher <fct> yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes,…
#> $ internet <fct> no, yes, yes, yes, no, yes, yes, no, yes, yes, yes, yes, ye…
#> $ romantic <fct> no, no, no, yes, no, no, no, no, no, no, no, no, no, no, ye…
#> $ famrel <fct> 4, 5, 4, 3, 4, 5, 4, 4, 4, 5, 3, 5, 4, 5, 4, 4, 3, 5, 5, 3,…
#> $ freetime <fct> 3, 3, 3, 2, 3, 4, 4, 1, 2, 5, 3, 2, 3, 4, 5, 4, 2, 3, 5, 1,…
#> $ goout <fct> 4, 3, 2, 2, 2, 2, 4, 4, 2, 1, 3, 2, 3, 3, 2, 4, 3, 2, 5, 3,…
#> $ Dalc <fct> 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1,…
#> $ Walc <fct> 1, 1, 3, 1, 2, 2, 1, 1, 1, 1, 2, 1, 3, 2, 1, 2, 2, 1, 4, 3,…
#> $ health <fct> 3, 3, 3, 5, 5, 5, 3, 1, 1, 5, 2, 4, 5, 3, 3, 2, 2, 4, 5, 5,…
#> $ absences <int> 6, 4, 10, 2, 4, 10, 0, 6, 0, 0, 0, 4, 2, 2, 0, 4, 6, 4, 16,…
#> $ G1 <int> 5, 5, 7, 15, 6, 15, 12, 6, 16, 14, 10, 10, 14, 10, 14, 14, …
#> $ G2 <int> 6, 5, 8, 14, 10, 15, 12, 5, 18, 15, 8, 12, 14, 10, 16, 14, …
#> $ status <fct> Fail, Fail, Pass, Pass, Pass, Pass, Pass, Fail, Pass, Pass,…
Cross validation is a method in machine learning used to evaluate the performance of a model by using it to predict a new unseen data. Cross validation is done by splitting the dataset into training and testing dataset randomly. The training dataset is used to train the model and the testing dataset is used to be predicted so we can evaluate the model’s performance.
sample() function is used to take a sample of a
specified size from the data. We are going to use 80% data for the
training dataset and the remaining 20% to the
test dataset.
set.seed(1)
RNGkind(sample.kind = 'Rounding')
# Random Index Sampling
index <- sample(x = nrow(students),
size = 0.8*nrow(students))
# Train-Test Split
students_train <- students[index,]
students_test <- students[-index,] We are going to check how balanced our target
variable by using prop.table()
prop.table(table(students_train$status))#>
#> Fail Pass
#> 0.31 0.69
It seems like our data is a little unbalanced
because the amount of students that passed the math course is higher
than the students that failed. we will need to upsample our data using
upSample() to balance the data.
students_train <- upSample(x = students_train %>% select(-status),
y = students_train$status,
yname = "status")
prop.table(table(students_train$status))#>
#> Fail Pass
#> 0.5 0.5
The proportion of students that failed and passed the course is now balanced.
Naive Bayes is a machine learning model that is based on Bayes theorem. Bayes Theorem describes the probability of occurrence of an event based on conditions that might be related to the event, where additional information impacts the initial probability.
This is the Bayes Theorem formula:
\[P(A|B) = \frac{P(B|A) * P(A)}{P(B)}\]
Where :
Note :
We are going to build our Naive Bayes model using the
naiveBayes() function to train the
students_train data. laplace = 1 parameter is
added to apply laplace smoothing with 1 as the value.
model_bayes <- naiveBayes(formula = status~.,
data = students_train,
laplace = 1)Decision tree is a machine learning model that used a tree-based model that split predictors into 2 branches based on the predictors’ highest information gain. Decision Tree chooses predictors that have the most homogeneous target variable.
We are going build our Decision Tree model using the
ctree() function to train the students_train
data and plot the decision tree model using plot().
model_dtree <- ctree(formula = status~.,
data = students_train)
plot(model_dtree, type = "simple")The first node is called the root node, it splits
based on the value of G2 which considered the predictor
with the most information gain. Then, the branch will split again and
creating new branches with new sets of rules until they reached the
terminal node that contains information of the target
variable. The terminal node consists of :
Random forest is a machine learning model that combines the output of multiple decision trees to reach a single result. Random Forest utilise the Bagging (Bootstrap and Aggregation) method.
Before building our model, we are going to use K-fold Cross-Validation method to evaluate our model. In this method, the training dataset is divided into k subsets. In each fold, one of the k subsets is used as the validation set, and the remaining subsets are used as the training set. The process is done multiple times until all folds are done. The accuracy of each folds are averaged to give us the final accuracy of the model.
Here, we are going to use trainControl() function to
apply cross validation to our Random Forest training process.
repeatedcv method is used with 5 number of folds, then the
process will be repeated 3 times.
ctrl <- trainControl(method = "repeatedcv",
number = 5, # Number of Folds
repeats = 3) # Number of RepetitionWe are going build our Random Forest model using the
train() function to train the students_train
data. rf method is selected to use the Random Forest
algorithm and trControl parameter is used to apply the
k-fold cross validation.
set.seed(1)
RNGkind(sample.kind = 'Rounding')
model_rf <- train(status ~ .,
data = students_train,
method = "rf",
trControl = ctrl)
model_rf#> Random Forest
#>
#> 414 samples
#> 32 predictor
#> 2 classes: 'Fail', 'Pass'
#>
#> No pre-processing
#> Resampling: Cross-Validated (5 fold, repeated 3 times)
#> Summary of sample sizes: 332, 331, 332, 330, 331, 331, ...
#> Resampling results across tuning parameters:
#>
#> mtry Accuracy Kappa
#> 2 0.9257973 0.8516120
#> 36 0.9443017 0.8886158
#> 71 0.9475736 0.8951632
#>
#> Accuracy was used to select the optimal model using the largest value.
#> The final value used for the model was mtry = 71.
From the model summary, we can see that the model used 71
predictor variables (mtry) and has an accuracy of
0.9475 from cross-validation.
There are 2 model evaluation methods that we are going to use to evaluate our model’s performance, Confusion Matrix and ROC-AUC
1. Confusion Matrix
Confusion Matrix is a table that shows the prediction result based on the actual classification label. Confusion matrix allowed us to calculate metrics that we are going to use to evaluate the model:
Based on our data:
Therefore, we should prioritize to suppress the False Positive value so that we don’t miss out on the failed students, so Precision is our priority metric.
2. ROC-AUC
ROC (Receiver Operating Characteristic) is a graph representing the performance of a classification model based on True Positive Rate and False Positive Rate. ROC curve plots TPR vs. FPR at different classification thresholds. A good model a have high True Positive Rate and a low False Positive Rate. To measure how good an ROC curve is, we can use AUC value. AUC (Area under the ROC Curve) measures the entire area underneath the entire ROC curve. The higher the value of AUC, the better the model is.
Confusion Matrix
We are going to evaluate our Naive Bayes model’s performance by
predicting training data from students_train and unseen
data from students_test, then using confusion matrix.
predict() function is used to predict which class the data
is classified to by the model and then confusionMatrix()
function is used to create the confusion matrix using the prediction
results.
#Training Evaluation
predict_bayes <- predict(model_bayes, students_train)
confusionMatrix(predict_bayes,
students_train$status,
positive = "Pass")#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction Fail Pass
#> Fail 181 31
#> Pass 26 176
#>
#> Accuracy : 0.8623
#> 95% CI : (0.8253, 0.894)
#> No Information Rate : 0.5
#> P-Value [Acc > NIR] : <0.0000000000000002
#>
#> Kappa : 0.7246
#>
#> Mcnemar's Test P-Value : 0.5962
#>
#> Sensitivity : 0.8502
#> Specificity : 0.8744
#> Pos Pred Value : 0.8713
#> Neg Pred Value : 0.8538
#> Prevalence : 0.5000
#> Detection Rate : 0.4251
#> Detection Prevalence : 0.4879
#> Balanced Accuracy : 0.8623
#>
#> 'Positive' Class : Pass
#>
Our Naive-Bayes model got 0.8623 train accuracy and 0.8713 train precision.
#Testing Evaluation
predict_bayes <- predict(model_bayes, students_test)
confusionMatrix(predict_bayes,
students_test$status,
positive = "Pass")#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction Fail Pass
#> Fail 26 8
#> Pass 2 40
#>
#> Accuracy : 0.8684
#> 95% CI : (0.7713, 0.9351)
#> No Information Rate : 0.6316
#> P-Value [Acc > NIR] : 0.000003932
#>
#> Kappa : 0.7293
#>
#> Mcnemar's Test P-Value : 0.1138
#>
#> Sensitivity : 0.8333
#> Specificity : 0.9286
#> Pos Pred Value : 0.9524
#> Neg Pred Value : 0.7647
#> Prevalence : 0.6316
#> Detection Rate : 0.5263
#> Detection Prevalence : 0.5526
#> Balanced Accuracy : 0.8810
#>
#> 'Positive' Class : Pass
#>
Our Naive-Bayes model got 0.8684 test accuracy and 0.9524 test precision. The precision of the model is very high because 40 out of 42 passed students predictions are correct.
Both the training and testing evaluation shows decent performance and the model does not overfit because there’s not much difference between their accuracy.
ROC & AUC
To create an ROC curve, we need to save our predictions as
probabilities by changing the type to "raw", therefore we
can see the probability of each students being predicted as the failed
or the passed student.
prob_bayes <- predict(object = model_bayes,
newdata = students_test,
type = "raw")
head(prob_bayes)#> Fail Pass
#> [1,] 0.9961522575929 0.0038477424
#> [2,] 0.9991555701996 0.0008444298
#> [3,] 0.0000007558935 0.9999992441
#> [4,] 0.1612843812754 0.8387156187
#> [5,] 0.0000188164298 0.9999811836
#> [6,] 0.0000052712139 0.9999947288
Using the probabilities of the passed students, we can graph the true
positive rate and the false positive rate at different classification
thresholds to create an ROC curve. performance() function
is used to create the ROC curve.
roc_bayes <- prediction(prob_bayes[,2], students_test$status)
model_roc_bayes <- performance(prediction.obj = roc_bayes,
measure = "tpr", # True Positive Rate
x.measure = "fpr") # False Positive Rate
plot(model_roc_bayes)
abline(0,1 , lty = 2)We can get the AUC value from the model’s ROC by using
measure = "auc" parameter.
model_auc_bayes <- performance(roc_bayes,
measure = "auc")
model_auc_bayes@y.values[[1]]#> [1] 0.953125
Our Naive-Bayes model got an AUC value of 0.953125.
Confusion Matrix
We are going to evaluate our Decision Tree model’s performance by
predicting training data from students_train and unseen
data from students_test, then using confusion matrix.
predict() function is used to predict which class the data
is classified to by the model and then confusionMatrix()
function is used to create the confusion matrix using the prediction
results.
#Training Evaluation
predict_dtree <- predict(object = model_dtree,
newdata = students_train)
confusionMatrix(data = predict_dtree,
reference = students_train$status,
positive = "Pass")#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction Fail Pass
#> Fail 199 23
#> Pass 8 184
#>
#> Accuracy : 0.9251
#> 95% CI : (0.8954, 0.9486)
#> No Information Rate : 0.5
#> P-Value [Acc > NIR] : < 0.0000000000000002
#>
#> Kappa : 0.8502
#>
#> Mcnemar's Test P-Value : 0.01192
#>
#> Sensitivity : 0.8889
#> Specificity : 0.9614
#> Pos Pred Value : 0.9583
#> Neg Pred Value : 0.8964
#> Prevalence : 0.5000
#> Detection Rate : 0.4444
#> Detection Prevalence : 0.4638
#> Balanced Accuracy : 0.9251
#>
#> 'Positive' Class : Pass
#>
Our Decision Tree model got 0.9251 train accuracy and 0.9583 train precision.
#Testing Evaluation
predict_dtree <- predict(object = model_dtree,
newdata = students_test)
confusionMatrix(data = predict_dtree,
reference = students_test$status,
positive = "Pass")#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction Fail Pass
#> Fail 27 6
#> Pass 1 42
#>
#> Accuracy : 0.9079
#> 95% CI : (0.8194, 0.9622)
#> No Information Rate : 0.6316
#> P-Value [Acc > NIR] : 0.00000004098
#>
#> Kappa : 0.8092
#>
#> Mcnemar's Test P-Value : 0.1306
#>
#> Sensitivity : 0.8750
#> Specificity : 0.9643
#> Pos Pred Value : 0.9767
#> Neg Pred Value : 0.8182
#> Prevalence : 0.6316
#> Detection Rate : 0.5526
#> Detection Prevalence : 0.5658
#> Balanced Accuracy : 0.9196
#>
#> 'Positive' Class : Pass
#>
Our Decision Tree model got 0.9079 accuracy and 0.9767 precision. The precision of the model is very high because 42 out of 43 passed students predictions are correct.
Both the training and testing evaluation shows very good performance and the model does not overfit because there’s not much difference between their accuracy.
ROC & AUC
To create an ROC curve, we need to save our predictions as
probabilities by changing the type to "prob", therefore we
can see the probability of each students being predicted as the failed
or the passed student.
prob_dtree <- predict(object = model_dtree,
newdata = students_test,
type = "prob")
head(prob_dtree)#> Fail Pass
#> 1 0.85 0.15
#> 2 1.00 0.00
#> 6 0.00 1.00
#> 11 1.00 0.00
#> 15 0.00 1.00
#> 16 0.00 1.00
Using the probabilities of the passed students, we can graph the true
positive rate and the false positive rate at different classification
thresholds to create an ROC curve. performance() function
is used to create the ROC curve.
roc_dtree <- prediction(prob_dtree[,2], students_test$status)
model_roc_dtree <- performance(prediction.obj = roc_dtree,
measure = "tpr", # True Positive Rate
x.measure = "fpr") # False Positive Rate
plot(model_roc_dtree)
abline(0,1 , lty = 2)We can get the AUC value from the model’s ROC by using
measure = "auc" parameter.
model_auc_dtree <- performance(roc_dtree,
measure = "auc")
model_auc_dtree@y.values[[1]]#> [1] 0.9668899
Our Decision Tree model got an AUC value of 0.9668899.
Confusion Matrix
We are going to evaluate our Random Forest model’s performance by
predicting training data from students_train and unseen
data from students_test, then using confusion matrix.
predict() function is used to predict which class the data
is classified to by the model and then confusionMatrix()
function is used to create the confusion matrix using the prediction
results.
#Training Evaluation
predict_rf <- predict(object = model_rf,
newdata = students_train)
confusionMatrix(data = predict_rf,
reference = students_train$status,
positive = "Pass")#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction Fail Pass
#> Fail 207 0
#> Pass 0 207
#>
#> Accuracy : 1
#> 95% CI : (0.9911, 1)
#> No Information Rate : 0.5
#> P-Value [Acc > NIR] : < 0.00000000000000022
#>
#> Kappa : 1
#>
#> Mcnemar's Test P-Value : NA
#>
#> Sensitivity : 1.0
#> Specificity : 1.0
#> Pos Pred Value : 1.0
#> Neg Pred Value : 1.0
#> Prevalence : 0.5
#> Detection Rate : 0.5
#> Detection Prevalence : 0.5
#> Balanced Accuracy : 1.0
#>
#> 'Positive' Class : Pass
#>
Our Random Forest model got perfect 1.00 train accuracy and 1.00 train precision.
#Testing Evaluation
predict_rf <- predict(object = model_rf,
newdata = students_test)
confusionMatrix(data = predict_rf,
reference = students_test$status,
positive = "Pass")#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction Fail Pass
#> Fail 27 3
#> Pass 1 45
#>
#> Accuracy : 0.9474
#> 95% CI : (0.8707, 0.9855)
#> No Information Rate : 0.6316
#> P-Value [Acc > NIR] : 0.0000000001112
#>
#> Kappa : 0.8886
#>
#> Mcnemar's Test P-Value : 0.6171
#>
#> Sensitivity : 0.9375
#> Specificity : 0.9643
#> Pos Pred Value : 0.9783
#> Neg Pred Value : 0.9000
#> Prevalence : 0.6316
#> Detection Rate : 0.5921
#> Detection Prevalence : 0.6053
#> Balanced Accuracy : 0.9509
#>
#> 'Positive' Class : Pass
#>
Our Random Forest model got 0.9474 accuracy and 0.9783 precision. The precision of the model is very high because 45 out of 46 passed students predictions are correct.
The training evaluation shows perfect modelling of the data, although the testing evaluation shows a decrease of performance. Nevertheless, the model is still performing excellently and does not overfit the data.
ROC & AUC
To create an ROC curve, we need to save our predictions as
probabilities by changing the type to "prob", therefore we
can see the probability of each students being predicted as the failed
or the passed student.
prob_rf <- predict(object = model_rf,
newdata = students_test,
type = "prob")
head(prob_rf)Using the probabilities of the passed students, we can graph the true
positive rate and the false positive rate at different classification
thresholds to create an ROC curve. performance() function
is used to create the ROC curve.
roc_rf <- prediction(prob_rf[,2], students_test$status)
model_roc_rf <- performance(prediction.obj = roc_rf,
measure = "tpr", # True Positive Rate
x.measure = "fpr") # False Positive Rate
plot(model_roc_rf)
abline(0,1 , lty = 2)We can get the AUC value from the model’s ROC by using
measure = "auc" parameter.
model_auc_rf <- performance(roc_rf,
measure = "auc")
model_auc_rf@y.values[[1]]#> [1] 0.9784226
Our Random Forest Model model got an AUC value of 0.9784226.
We’ve done Exploratory Data Analysis the students’ performance dataset by analyzing some factors that affect students’ final grade. Here is the summary of the analysis that have been done:
We’ve build our classification model using Naive-Bayes, Decision Tree, and Random Forest. Based on the model evaluation, Random Forest is the best classification method for classifying the student’s academic performance in math course, with 0.9474 accuracy, 0.9783 precision, and 0.9784 AUC. Random Forest models tend to have a very high performance by combining predictions from multiple decision trees and aggregating their outputs. Random Forest models are also very efficient at handling dataset with a a lot of predictors. Random Forest models automatically perform feature selection by evaluating the importance of each feature during the training process, therefore it is a suitable model for the students’ performance dataset.
In conclusion, this project has explored students’ academic performance data to get insights and successfully predicting students’ academic performance based on various input factors. Students can utilize this project to forecast their grade before their final exam and prepare themselves for the outcome. Teachers can also utilize the project to supervise their students that would possibly fail the course based on the student’s data.