Introduction

Students’ performance in academic typically measured by the grades they achieve from the course’s exams. Students’ academic performance is affected by several factors, including students’ parental background, daily habits, environmental influence, and learning facilities. Based on the collected data, the project aims to predict the students’ passing at the end of the mathematics course, so that it can benefits the teacher where they can supervise and communicate with their students on how they perform in the course based on their background, daily habits, environment, and facilities provided for them to study. This project also can be used by the students to evaluate their academic performance based on their conditions in the present time.

Import Packages

Packages that are going to be used for the projects are imported here

library(dplyr)
library(lubridate)
library(e1071)
library(caret)
library(randomForest)
library(partykit)
library(ROCR)
library(ggplot2)

Load Data

The Dataset is taken from secondary education of two Portuguese schools. The data attributes include student grades on mathematics, demographic, social and school related features and it was collected by using school reports and questionnaires.

Here, we will read our dataset by using read.csv() function and change string columns into factors by using stringsAsFactors=T parameter and then use head() to view the first 5 data

students <- read.csv("input_data/student-mat.csv", sep=';', stringsAsFactors=T)
head(students)

Here is the descriptions of the attributes for math course students dataset:

  1. school - student’s school (‘GP’ - Gabriel Pereira or ‘MS’ - Mousinho da Silveira)
  2. sex - student’s gender (‘F’ - female or ‘M’ - male)
  3. age - student’s age
  4. address - student’s home address type (‘U’ - urban or ‘R’ - rural)
  5. famsize - family size (‘LE3’ - <=3 or ‘GT3’ - >=3)
  6. Pstatus - parent’s cohabitation status (‘T’ - living together or ‘A’ - apart)
  7. Medu - mother’s education (0 - none, 1 - primary, 2 - 5-9th grade, 3 - secondary or 4 - higher education)
  8. Fedu - father’s education (0 - none, 1 - primary, 2 - 5-9th grade, 3 - secondary or 4 - higher education)
  9. Mjob - mother’s job (‘teacher’, ‘health’ care related, civil ‘services’, ‘at_home’ or ‘other’)
  10. Fjob - father’s job (‘teacher’, ‘health’ care related, civil ‘services’, ‘at_home’ or ‘other’)
  11. reason - reason to choose this school (close to ‘home’, school ‘reputation’, ‘course’ preference or ‘other’)
  12. guardian - student’s guardian (‘mother’, ‘father’ or ‘other’)
  13. traveltime - home to school travel time (1 - <15 min., 2 - 15 to 30 min., 3 - 30 min. to 1 hour, or 4 - >1 hour)
  14. studytime - weekly study time (1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours)
  15. failures - number of past class failures (n if 1<=n<3, else 4)
  16. schoolsup - extra educational support (yes or no)
  17. famsup - family educational support (yes or no)
  18. paid - extra paid classes within the course subject (yes or no)
  19. activities - extra-curricular activities (yes or no)
  20. nursery - attended nursery school (yes or no)
  21. higher - wants to take higher education (yes or no)
  22. internet - Internet access at home (yes or no)
  23. romantic - with a romantic relationship (yes or no)
  24. famrel - quality of family relationships (from 1 - very bad to 5 - excellent)
  25. freetime - free time after school (from 1 - very low to 5 - very high)
  26. goout - going out with friends (from 1 - very low to 5 - very high)
  27. Dalc - workday alcohol consumption (from 1 - very low to 5 - very high)
  28. Walc - weekend alcohol consumption (from 1 - very low to 5 - very high)
  29. health - current health status (from 1 - very bad to 5 - very good)
  30. absences - number of school absences
  31. G1 - first period grade (from 0 to 20)
  32. G2 - second period grade (from 0 to 20)
  33. G3 - final grade (from 0 to 20)

Data Wrangling

glimpse() function is used to view the data structure of each columns

glimpse(students)
#> Rows: 395
#> Columns: 33
#> $ school     <fct> GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP,…
#> $ sex        <fct> F, F, F, F, F, M, M, F, M, M, F, F, M, M, M, F, F, F, M, M,…
#> $ age        <int> 18, 17, 15, 15, 16, 16, 16, 17, 15, 15, 15, 15, 15, 15, 15,…
#> $ address    <fct> U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U,…
#> $ famsize    <fct> GT3, GT3, LE3, GT3, GT3, LE3, LE3, GT3, LE3, GT3, GT3, GT3,…
#> $ Pstatus    <fct> A, T, T, T, T, T, T, A, A, T, T, T, T, T, A, T, T, T, T, T,…
#> $ Medu       <int> 4, 1, 1, 4, 3, 4, 2, 4, 3, 3, 4, 2, 4, 4, 2, 4, 4, 3, 3, 4,…
#> $ Fedu       <int> 4, 1, 1, 2, 3, 3, 2, 4, 2, 4, 4, 1, 4, 3, 2, 4, 4, 3, 2, 3,…
#> $ Mjob       <fct> at_home, at_home, at_home, health, other, services, other, …
#> $ Fjob       <fct> teacher, other, other, services, other, other, other, teach…
#> $ reason     <fct> course, course, other, home, home, reputation, home, home, …
#> $ guardian   <fct> mother, father, mother, mother, father, mother, mother, mot…
#> $ traveltime <int> 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 3, 1, 2, 1, 1, 1, 3, 1, 1,…
#> $ studytime  <int> 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 3, 1, 2, 3, 1, 3, 2, 1, 1,…
#> $ failures   <int> 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0,…
#> $ schoolsup  <fct> yes, no, yes, no, no, no, no, yes, no, no, no, no, no, no, …
#> $ famsup     <fct> no, yes, no, yes, yes, yes, no, yes, yes, yes, yes, yes, ye…
#> $ paid       <fct> no, no, yes, yes, yes, yes, no, no, yes, yes, yes, no, yes,…
#> $ activities <fct> no, no, no, yes, no, yes, no, no, no, yes, no, yes, yes, no…
#> $ nursery    <fct> yes, no, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, …
#> $ higher     <fct> yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes,…
#> $ internet   <fct> no, yes, yes, yes, no, yes, yes, no, yes, yes, yes, yes, ye…
#> $ romantic   <fct> no, no, no, yes, no, no, no, no, no, no, no, no, no, no, ye…
#> $ famrel     <int> 4, 5, 4, 3, 4, 5, 4, 4, 4, 5, 3, 5, 4, 5, 4, 4, 3, 5, 5, 3,…
#> $ freetime   <int> 3, 3, 3, 2, 3, 4, 4, 1, 2, 5, 3, 2, 3, 4, 5, 4, 2, 3, 5, 1,…
#> $ goout      <int> 4, 3, 2, 2, 2, 2, 4, 4, 2, 1, 3, 2, 3, 3, 2, 4, 3, 2, 5, 3,…
#> $ Dalc       <int> 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1,…
#> $ Walc       <int> 1, 1, 3, 1, 2, 2, 1, 1, 1, 1, 2, 1, 3, 2, 1, 2, 2, 1, 4, 3,…
#> $ health     <int> 3, 3, 3, 5, 5, 5, 3, 1, 1, 5, 2, 4, 5, 3, 3, 2, 2, 4, 5, 5,…
#> $ absences   <int> 6, 4, 10, 2, 4, 10, 0, 6, 0, 0, 0, 4, 2, 2, 0, 4, 6, 4, 16,…
#> $ G1         <int> 5, 5, 7, 15, 6, 15, 12, 6, 16, 14, 10, 10, 14, 10, 14, 14, …
#> $ G2         <int> 6, 5, 8, 14, 10, 15, 12, 5, 18, 15, 8, 12, 14, 10, 16, 14, …
#> $ G3         <int> 6, 6, 10, 15, 10, 15, 11, 6, 19, 15, 9, 12, 14, 11, 16, 14,…

Data Structure

It seems like few of the data structures doesn’t match the data, we will fix it using the mutate_at() function to change some columns into a factor by using as.factor function

students <- students %>%
  mutate_at(vars(Medu, Fedu, traveltime, studytime, failures, famrel, freetime, freetime, goout, Dalc, Walc, health), as.factor)
  
glimpse(students)
#> Rows: 395
#> Columns: 33
#> $ school     <fct> GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP,…
#> $ sex        <fct> F, F, F, F, F, M, M, F, M, M, F, F, M, M, M, F, F, F, M, M,…
#> $ age        <int> 18, 17, 15, 15, 16, 16, 16, 17, 15, 15, 15, 15, 15, 15, 15,…
#> $ address    <fct> U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U,…
#> $ famsize    <fct> GT3, GT3, LE3, GT3, GT3, LE3, LE3, GT3, LE3, GT3, GT3, GT3,…
#> $ Pstatus    <fct> A, T, T, T, T, T, T, A, A, T, T, T, T, T, A, T, T, T, T, T,…
#> $ Medu       <fct> 4, 1, 1, 4, 3, 4, 2, 4, 3, 3, 4, 2, 4, 4, 2, 4, 4, 3, 3, 4,…
#> $ Fedu       <fct> 4, 1, 1, 2, 3, 3, 2, 4, 2, 4, 4, 1, 4, 3, 2, 4, 4, 3, 2, 3,…
#> $ Mjob       <fct> at_home, at_home, at_home, health, other, services, other, …
#> $ Fjob       <fct> teacher, other, other, services, other, other, other, teach…
#> $ reason     <fct> course, course, other, home, home, reputation, home, home, …
#> $ guardian   <fct> mother, father, mother, mother, father, mother, mother, mot…
#> $ traveltime <fct> 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 3, 1, 2, 1, 1, 1, 3, 1, 1,…
#> $ studytime  <fct> 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 3, 1, 2, 3, 1, 3, 2, 1, 1,…
#> $ failures   <fct> 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0,…
#> $ schoolsup  <fct> yes, no, yes, no, no, no, no, yes, no, no, no, no, no, no, …
#> $ famsup     <fct> no, yes, no, yes, yes, yes, no, yes, yes, yes, yes, yes, ye…
#> $ paid       <fct> no, no, yes, yes, yes, yes, no, no, yes, yes, yes, no, yes,…
#> $ activities <fct> no, no, no, yes, no, yes, no, no, no, yes, no, yes, yes, no…
#> $ nursery    <fct> yes, no, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, …
#> $ higher     <fct> yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes,…
#> $ internet   <fct> no, yes, yes, yes, no, yes, yes, no, yes, yes, yes, yes, ye…
#> $ romantic   <fct> no, no, no, yes, no, no, no, no, no, no, no, no, no, no, ye…
#> $ famrel     <fct> 4, 5, 4, 3, 4, 5, 4, 4, 4, 5, 3, 5, 4, 5, 4, 4, 3, 5, 5, 3,…
#> $ freetime   <fct> 3, 3, 3, 2, 3, 4, 4, 1, 2, 5, 3, 2, 3, 4, 5, 4, 2, 3, 5, 1,…
#> $ goout      <fct> 4, 3, 2, 2, 2, 2, 4, 4, 2, 1, 3, 2, 3, 3, 2, 4, 3, 2, 5, 3,…
#> $ Dalc       <fct> 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1,…
#> $ Walc       <fct> 1, 1, 3, 1, 2, 2, 1, 1, 1, 1, 2, 1, 3, 2, 1, 2, 2, 1, 4, 3,…
#> $ health     <fct> 3, 3, 3, 5, 5, 5, 3, 1, 1, 5, 2, 4, 5, 3, 3, 2, 2, 4, 5, 5,…
#> $ absences   <int> 6, 4, 10, 2, 4, 10, 0, 6, 0, 0, 0, 4, 2, 2, 0, 4, 6, 4, 16,…
#> $ G1         <int> 5, 5, 7, 15, 6, 15, 12, 6, 16, 14, 10, 10, 14, 10, 14, 14, …
#> $ G2         <int> 6, 5, 8, 14, 10, 15, 12, 5, 18, 15, 8, 12, 14, 10, 16, 14, …
#> $ G3         <int> 6, 6, 10, 15, 10, 15, 11, 6, 19, 15, 9, 12, 14, 11, 16, 14,…

Now, all the data structure matched the data.

Data Outlier

An Outlier is a data point that differs significantly from other observations. We need to remove the outliers so that it won’t affect our machine learning model. When outliers are present, the model can become overly sensitive to these points. It could alter how the model identify pattern in the data.

We will use boxplot() to check which numeric columns contains outliers

studens_num <- students %>% 
  select_if(is.numeric)

boxplot(studens_num)

It seems like the absences column contains outliers because most of the students do not absent for more than 20 times (by looking at the upper whisker line). If we include the students that have more than 20 absences, the characteristics of the outliers could affect how the model determine the entire data.

We will use filter() to remove the outlier data.

students <- students %>% 
  filter(absences < 20)
  
summary(students)
#>  school   sex          age        address famsize   Pstatus Medu    Fedu   
#>  GP:330   F:194   Min.   :15.00   R: 85   GT3:266   A: 38   0:  3   0:  2  
#>  MS: 46   M:182   1st Qu.:16.00   U:291   LE3:110   T:338   1: 58   1: 79  
#>                   Median :17.00                             2: 99   2:111  
#>                   Mean   :16.66                             3: 92   3: 92  
#>                   3rd Qu.:18.00                             4:124   4: 92  
#>                   Max.   :22.00                                            
#>        Mjob           Fjob            reason      guardian   traveltime
#>  at_home : 57   at_home : 20   course    :143   father: 88   1:248     
#>  health  : 34   health  : 18   home      : 98   mother:261   2: 99     
#>  other   :132   other   :205   other     : 35   other : 27   3: 21     
#>  services: 97   services:105   reputation:100                4:  8     
#>  teacher : 56   teacher : 28                                           
#>                                                                        
#>  studytime failures schoolsup famsup     paid     activities nursery  
#>  1:100     0:302    no :327   no :147   no :205   no :182    no : 78  
#>  2:186     1: 42    yes: 49   yes:229   yes:171   yes:194    yes:298  
#>  3: 63     2: 16                                                      
#>  4: 27     3: 16                                                      
#>                                                                       
#>                                                                       
#>  higher    internet  romantic  famrel  freetime goout   Dalc    Walc    health 
#>  no : 18   no : 66   no :258   1:  8   1: 16    1: 21   1:263   1:144   1: 44  
#>  yes:358   yes:310   yes:118   2: 18   2: 59    2: 98   2: 70   2: 82   2: 41  
#>                                3: 63   3:152    3:126   3: 25   3: 77   3: 87  
#>                                4:184   4:111    4: 80   4:  9   4: 45   4: 65  
#>                                5:103   5: 38    5: 51   5:  9   5: 28   5:139  
#>                                                                                
#>     absences            G1             G2              G3       
#>  Min.   : 0.000   Min.   : 3.0   Min.   : 0.00   Min.   : 0.00  
#>  1st Qu.: 0.000   1st Qu.: 8.0   1st Qu.: 9.00   1st Qu.: 8.00  
#>  Median : 3.000   Median :11.0   Median :11.00   Median :11.00  
#>  Mean   : 4.439   Mean   :10.9   Mean   :10.72   Mean   :10.41  
#>  3rd Qu.: 7.000   3rd Qu.:13.0   3rd Qu.:13.00   3rd Qu.:14.00  
#>  Max.   :19.000   Max.   :19.0   Max.   :19.00   Max.   :20.00

Missing Values

colSums(is.na()) function is used to check if there are missing values at the data

colSums(is.na(students))
#>     school        sex        age    address    famsize    Pstatus       Medu 
#>          0          0          0          0          0          0          0 
#>       Fedu       Mjob       Fjob     reason   guardian traveltime  studytime 
#>          0          0          0          0          0          0          0 
#>   failures  schoolsup     famsup       paid activities    nursery     higher 
#>          0          0          0          0          0          0          0 
#>   internet   romantic     famrel   freetime      goout       Dalc       Walc 
#>          0          0          0          0          0          0          0 
#>     health   absences         G1         G2         G3 
#>          0          0          0          0          0

There are no missing value at any columns


Exploratory Data Analysis

We are going to do Exploratory Data Analysis to analyze some factors that affect students’ final grade, such as absences, age, study time, parent’s education, extra educational support, having internet, having a romantic relationship, alcohol consumption, and health condition.

Absences

Missing classes can have a bad impact on a student’s academic performance. When students miss their classes, they miss out on important lectures, discussions, exercises, and quizzes that were given by the teachers. Lets compare how the students that performed in the math class by splitting them using filter into the students that passed the math class (G3 >= 10) and the students that failed the math class (G3 < 10).

To visualize comparison of the students data distribution based on their absences, we can use boxplot.

pass_students <- students %>%
  filter(G3 >= 10)

fail_students <- students %>% 
  filter(G3 < 10)

boxplot(pass_students$absences, fail_students$absences,
        horizontal = T,
        names = c("Pass", "Fail"),
        xlab = "Number of Absences",
        col = '#3A9BDC')

Analysis

Based on the boxplot, we can see that even though the passed students have higher median than the failed students, but the passed students have lower dispersion. Therefore, in overall we can see that the passed students missed less classes than the failed students.

Age

In a school, its possible for students of different ages to be in the same class. Age can influence how the students interact with their friends and how the students pay attention to their academic performance. Let us see the age distribution of the students in the math class by plotting a histogram using hist function.

hist(students$age,
     breaks = 8,
     main = NULL,
     xlab = "Age",
     ylab = "Number of Students",
     ylim = c(0, 200),
     col = '#3A9BDC')

Analysis

Most of the students are 15 years old and the other students are 16 years old and older. We can see that some of the students are older than 19 years old, which is not a normal age for a high school student. Let us see how well they performed in the math class by comparing the passed students and the failed students based on their age.

We are grouping the students based on their age using group_by and counting the number of students of each age. To visualize the data for easier interpretation, passed student and failed student data are combined in a same dataframe and barplot() is used to create a bar chart.

pass_age <- pass_students %>% 
  group_by(age) %>% 
  summarise(count = n()) %>% 
  rbind(list(age = 21, count = 0), list(age = 22, count = 0)) #Adding students with the age of 21 and 22 to match the dataframe

fail_age <- students %>% 
  filter(G3 < 10) %>%
  group_by(age) %>% 
  summarise(count = n())

plot_age <- pass_age %>%
  mutate(pass_age = count, fail_age = fail_age$count)

barplot(cbind(pass_age, fail_age) ~ age,
        data = plot_age,
        xlab = "Age",
        ylab = "Number of Students",
        col = c('#34A583', '#EA4335'),
        legend.text = TRUE,
        args.legend = list(legend = c("Pass", "Fail")),
        beside = T)

Analysis

Based on the plot, we can see that more students aged 15 to 18 years old that passed than the students that failed. But, there is a slight increase of the students that failed as they aged more. There are more students aged 19 and older that failed compared to students that passed the math class. Therefore, from the plot we can conclude that the older the students are, the more likely they are to fail the math class.

Study Time

As a student, they have to spare some time outside of school to study so they can learn and increase their knowledge. Self-studying can also help review the materials so the students will be more prepared for the test.

We are going to calculate the average final grade of every students based on how long they studied each week using group_by() function and then aggregating the final grade using summarise_at().

plot_study <- students %>%
  group_by(studytime) %>% 
  summarise_at(vars(G3), mean)

plot_study

To visualize the data for easier interpretation, barplot() is used to create a bar plot.

barplot(G3~studytime, 
        data = plot_study,
        names.arg = c('<2 Hours','2-5 Hours','5-10 Hours','>10 Hours'),
        xlab = "Study Time",
        ylab = "Final Grade Average",
        ylim = c(0,12),
        col = '#3A9BDC')

Analysis

Students that study for less than 5 hours in a week have an average final grade that barely passed the math course (passing grade = 10), while students that study for more than 5 hours have higher final grades on average. How long students study can affect the students’ performance significantly.

Parent’s Education

Parents are one of the most important factor of students’ education. In the academic field, parents can also help their children’s study to increase their knowledge of the course.

We are going to calculate the average final grade based on the students’ mother (Medu) and father’s (Fedu) education using group_by() function and then aggregating the final grade using summarise_at().

Note : Due to the small number of students whose parents have no education, the data won’t be used for analyzing

Mother’s Education

plot_medu <- students %>%
  group_by(Medu) %>% 
  summarise_at(vars(G3), mean)

plot_medu

Father’s Education

plot_fedu <- students %>%
  group_by(Fedu) %>% 
  summarise_at(vars(G3), mean)

plot_fedu

To visualize the data for easier interpretation, data from Medu and Fedu are combined in a same dataframe and barplot() is used to create a bar chart.

plot_parent <- plot_fedu %>%
  mutate(fedu_G3 = G3, medu_G3 = plot_medu$G3)

barplot(cbind(fedu_G3, medu_G3) ~ Fedu,
        data = plot_parent[2:5,],
        names.arg = c('primary','5th to 9th grade','secondary', 'higher education'),
        xlab = "Parents Education",
        ylab = "Final Grade Average",
        ylim = c(0,14),
        col = c('#3A9BDC', 'violetred1'),
        legend.text = TRUE,
        args.legend = list(x = "topleft", legend = c("Father", "Mother")),
        beside = T)

Analysis

The higher the parents’ education, the higher the students’ average final grade. This shows that parents’ education impacts students’ academic performance. Parents with higher education have more knowledge to teach their children and they can help if their children have learning problems.

Extra Educational Support

Students tend to receive extra educational support to help them study. That extra educational support can be received from school (schoolsup) and also from their family (family).

We are going to calculate the average final grade based on the extra educational support they received using group_by() function and then aggregating the final grade using summarise_at().

plot_schoolsup <- students %>%
  group_by(schoolsup, famsup) %>% 
  summarise_at(vars(G3), mean)

plot_schoolsup

To visualize the data for easier interpretation, barplot() is used to create a bar plot.

barplot(plot_schoolsup$G3, 
        data = plot_schoolsup,
        names.arg = c("No Support", "Family Only", "School Only", "Family and School"),
        xlab = "Extra Educational Support",
        ylab = "Final Grade Average",
        ylim = c(0,12),
        col = '#3A9BDC')

Analysis

Students that received no extra educational support have the highest average grade compared to the students that received educational supports. On the contrary, students that received both extra educational support from family and from school have the lowest average grade.

Having Internet

Internet is a powerful tool to be used for learning. Using the internet, students can search for various study materials that isn’t taught at school.

We are going to calculate the average final grade based on the students’ internet availability at home using group_by() function and then aggregating the final grade using summarise_at().

plot_internet <- students %>%
  group_by(internet) %>% 
  summarise_at(vars(G3), mean)

plot_internet

To visualize the data for easier interpretation, barplot() is used to create a bar plot.

barplot(G3~internet,
        data = plot_internet,
        xlab = "Internet Availability",
        ylab = "Final Grade Average",
        ylim = c(0,12),
        col = '#3A9BDC')

Analysis

Students that is facilitated with internet have higher average grade compared to students that don’t have internet. Students with no internet average final grade is lower than the course’s passing grade. It shows that having internet can improve students’ academic performance.

Having Romantic Partner

Having a romantic relationship can also affect students’ academic performance. Students with romantic partner tend to spend time with their partner.

We are going to calculate the average final grade based on the students’ relationship status using group_by() function and then aggregating the final grade using summarise_at().

plot_romantic <- students %>%
  group_by(romantic) %>% 
  summarise_at(vars(G3), mean)

plot_romantic

To visualize the data for easier interpretation, barplot() is used to create a bar plot.

barplot(G3~romantic,
        data = plot_romantic,
        xlab = "Romantic Partner",
        ylab = "Final Grade Average",
        ylim = c(0,12),
        col = '#3A9BDC')

Analysis

Students without romantic partner have higher average final grade compared to the students with romantic partner. Students with romantic partner average final grade is lower than the course’s passing grade. It seems like the students that have a romantic partner tend to focus on their loved one more than their grade.

Alcohol Consumption

Alcohol is a commonly consumed by students as a means of relaxing and socializing, but excessive drinking can lead to negative consequences such as poor academic performance, health problems, risky behaviors, and even alcohol addiction.

We are going to calculate the average final grade based on how much alcohol the students consume on weekdays using group_by() function and then aggregating the final grade using summarise_at().

plot_alcohol <- students %>%
  group_by(Dalc) %>% 
  summarise_at(vars(G3), mean)

plot_alcohol

To visualize the data for easier interpretation, barplot() is used to create a bar plot.

barplot(G3~Dalc,
        data = plot_alcohol,
        names.arg = c("Very Low", "Low", "Medium", "High", "Very High"),
        xlab = "Alcohol Consumption",
        ylab = "Final Grade Average",
        ylim = c(0,12),
        col = '#3A9BDC')

Analysis

Students that have very low alcohol consumption have the highest average final grade, but the difference compared to the students that have very high alcohol is very little. It shows that alcohol consumption does not have a significant impact to the students’ performance due to the varying average final grade.

Health Condition

The health conditions of students can have a significant impact on their academic performance. When students are not feeling well, they may struggle to concentrate and stay focus on their study. Illness can also cause students miss classes, assignments, and tests.

We are going to calculate the average final grade based on how healthy the students are using group_by() function and then aggregating the final grade using summarise_at().

plot_health <- students %>%
  group_by(health) %>% 
  summarise_at(vars(G3), mean)

plot_health

To visualize the data for easier interpretation, barplot() is used to create a bar plot.

barplot(G3~health,
        data = plot_health,
        names.arg = c("Very Bad", "Bad", "Medium", "Good", "Very Good"),
        xlab = "Health",
        ylab = "Final Grade Average",
        ylim = c(0,12),
        col = '#3A9BDC')

Analysis

It seems that students that have very bad health have the highest final grade compared to the other conditions. But, for the other health conditions, the better their health, their average grade also gets better. Students who have poor health may be more motivated to achieve higher grades than students that are healthier.

Students with Very Bad Health Analysis

We are going to analyze further on the students with very bad health that unexpectedly have the highest average grades compared to the healthy students. First, we are splitting the students with very bad health (health == 1) and the healthy students (health >= 3) using filter.

sick_students <- students %>%
  filter(health == 1)

healthy_students <- students %>%
  filter(health %in% c(3, 4, 5))

summary(sick_students)
#>  school  sex         age        address famsize  Pstatus Medu   Fedu  
#>  GP:37   F:28   Min.   :15.00   R:10    GT3:29   A: 8    0: 0   0: 1  
#>  MS: 7   M:16   1st Qu.:15.00   U:34    LE3:15   T:36    1: 8   1: 8  
#>                 Median :16.00                            2: 5   2:13  
#>                 Mean   :16.50                            3:13   3:11  
#>                 3rd Qu.:17.25                            4:18   4:11  
#>                 Max.   :22.00                                         
#>        Mjob          Fjob           reason     guardian  traveltime studytime
#>  at_home : 8   at_home : 4   course    :14   father: 5   1:30       1:11     
#>  health  : 2   health  : 1   home      :10   mother:39   2:11       2:26     
#>  other   :17   other   :23   other     : 5   other : 0   3: 2       3: 3     
#>  services: 9   services:14   reputation:15               4: 1       4: 4     
#>  teacher : 8   teacher : 2                                                   
#>                                                                              
#>  failures schoolsup famsup    paid    activities nursery  higher   internet
#>  0:41     no :38    no :19   no :17   no :21     no : 6   no : 1   no : 4  
#>  1: 2     yes: 6    yes:25   yes:27   yes:23     yes:38   yes:43   yes:40  
#>  2: 0                                                                      
#>  3: 1                                                                      
#>                                                                            
#>                                                                            
#>  romantic famrel freetime goout  Dalc   Walc   health    absences     
#>  no :33   1: 4   1: 3     1: 4   1:33   1:20   1:44   Min.   : 0.000  
#>  yes:11   2: 2   2: 8     2: 9   2: 6   2:10   2: 0   1st Qu.: 0.000  
#>           3: 6   3:14     3:12   3: 3   3: 8   3: 0   Median : 3.000  
#>           4:18   4:14     4:14   4: 1   4: 4   4: 0   Mean   : 4.477  
#>           5:14   5: 5     5: 5   5: 1   5: 2   5: 0   3rd Qu.: 6.250  
#>                                                       Max.   :17.000  
#>        G1              G2             G3       
#>  Min.   : 6.00   Min.   : 5.0   Min.   : 0.00  
#>  1st Qu.: 9.75   1st Qu.:10.0   1st Qu.:10.00  
#>  Median :11.00   Median :13.0   Median :13.00  
#>  Mean   :11.86   Mean   :12.3   Mean   :12.16  
#>  3rd Qu.:15.00   3rd Qu.:15.0   3rd Qu.:15.00  
#>  Max.   :19.00   Max.   :18.0   Max.   :19.00

By using summary, we can see the distribution of each columns. Lets check a few of the sick students data and compare it with the healthy students.

Absences

Students with bad health tends to miss schools due to their sickness and the need to see a doctor. Therefore, sick student should’ve been absent more than the healthy students, we are using boxplot to compare the absences data of the students.

boxplot(healthy_students$absences, sick_students$absences,
        horizontal = T,
        names = c("Healthy", "Sick"),
        xlab = "Number of Absences",
        col = '#3A9BDC')

Analysis : It turns out that even though the sick students have higher median than the healthy ones, but some healthy students misses school more than them. The students that often misses school could be the reason that the healthy students have lower average grade.

Study Time

We would like to check if students with bad health tend to study more than the healthy students. We are comparing the proportion of each study time from the sick and healthy students using prop.table and visualizing it using barplot to create a bar chart.

studyhealth_prop <- as.data.frame(list(healthy = prop.table(table(healthy_students$studytime)), 
                                       sick = prop.table(table(sick_students$studytime))))

barplot(cbind(healthy.Freq, sick.Freq) ~ sick.Var1,
        data = studyhealth_prop,
        names.arg = c('<2 Hours','2-5 Hours','5-10 Hours','>10 Hours'),
        xlab = "Study Time",
        ylim = c(0,0.6),
        col = c('#3A9BDC', 'skyblue'),
        beside = T,
        legend.text = T,
        args.legend = list(legend = c("Healthy", "Sick")))

Analysis : Most of the sick and the healthy students studied for 2-5 hours. The healthy students have more that studied for 5-10 hours compared to the sick students, but have less that studied for more than 10 hours.

Conclusion

There is not much information that we can gain by comparing the healthy and sick students data because of the similarities of the students.


Data Preprocessing

This project aims to predict the students’ passing at the end of the mathematics course. Based on Portugal’s curriculum, grades are given on a scale from 0 to 20, with the minimum passing grade being 10. Therefore, to pass the course, the students need to have final grade more than 10.

We are going to create a target variable status using mutate() function based on the students’ final grade on the math course.

students <- students %>%
  mutate(status = ifelse(G3 >= 10, "Pass", "Fail")) %>%
  mutate(status = as.factor(status)) %>% 
  select(-c(G3))
  
glimpse(students)
#> Rows: 376
#> Columns: 33
#> $ school     <fct> GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP,…
#> $ sex        <fct> F, F, F, F, F, M, M, F, M, M, F, F, M, M, M, F, F, F, M, M,…
#> $ age        <int> 18, 17, 15, 15, 16, 16, 16, 17, 15, 15, 15, 15, 15, 15, 15,…
#> $ address    <fct> U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U,…
#> $ famsize    <fct> GT3, GT3, LE3, GT3, GT3, LE3, LE3, GT3, LE3, GT3, GT3, GT3,…
#> $ Pstatus    <fct> A, T, T, T, T, T, T, A, A, T, T, T, T, T, A, T, T, T, T, T,…
#> $ Medu       <fct> 4, 1, 1, 4, 3, 4, 2, 4, 3, 3, 4, 2, 4, 4, 2, 4, 4, 3, 3, 4,…
#> $ Fedu       <fct> 4, 1, 1, 2, 3, 3, 2, 4, 2, 4, 4, 1, 4, 3, 2, 4, 4, 3, 2, 3,…
#> $ Mjob       <fct> at_home, at_home, at_home, health, other, services, other, …
#> $ Fjob       <fct> teacher, other, other, services, other, other, other, teach…
#> $ reason     <fct> course, course, other, home, home, reputation, home, home, …
#> $ guardian   <fct> mother, father, mother, mother, father, mother, mother, mot…
#> $ traveltime <fct> 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 3, 1, 2, 1, 1, 1, 3, 1, 1,…
#> $ studytime  <fct> 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 3, 1, 2, 3, 1, 3, 2, 1, 1,…
#> $ failures   <fct> 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0,…
#> $ schoolsup  <fct> yes, no, yes, no, no, no, no, yes, no, no, no, no, no, no, …
#> $ famsup     <fct> no, yes, no, yes, yes, yes, no, yes, yes, yes, yes, yes, ye…
#> $ paid       <fct> no, no, yes, yes, yes, yes, no, no, yes, yes, yes, no, yes,…
#> $ activities <fct> no, no, no, yes, no, yes, no, no, no, yes, no, yes, yes, no…
#> $ nursery    <fct> yes, no, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, …
#> $ higher     <fct> yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, yes,…
#> $ internet   <fct> no, yes, yes, yes, no, yes, yes, no, yes, yes, yes, yes, ye…
#> $ romantic   <fct> no, no, no, yes, no, no, no, no, no, no, no, no, no, no, ye…
#> $ famrel     <fct> 4, 5, 4, 3, 4, 5, 4, 4, 4, 5, 3, 5, 4, 5, 4, 4, 3, 5, 5, 3,…
#> $ freetime   <fct> 3, 3, 3, 2, 3, 4, 4, 1, 2, 5, 3, 2, 3, 4, 5, 4, 2, 3, 5, 1,…
#> $ goout      <fct> 4, 3, 2, 2, 2, 2, 4, 4, 2, 1, 3, 2, 3, 3, 2, 4, 3, 2, 5, 3,…
#> $ Dalc       <fct> 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1,…
#> $ Walc       <fct> 1, 1, 3, 1, 2, 2, 1, 1, 1, 1, 2, 1, 3, 2, 1, 2, 2, 1, 4, 3,…
#> $ health     <fct> 3, 3, 3, 5, 5, 5, 3, 1, 1, 5, 2, 4, 5, 3, 3, 2, 2, 4, 5, 5,…
#> $ absences   <int> 6, 4, 10, 2, 4, 10, 0, 6, 0, 0, 0, 4, 2, 2, 0, 4, 6, 4, 16,…
#> $ G1         <int> 5, 5, 7, 15, 6, 15, 12, 6, 16, 14, 10, 10, 14, 10, 14, 14, …
#> $ G2         <int> 6, 5, 8, 14, 10, 15, 12, 5, 18, 15, 8, 12, 14, 10, 16, 14, …
#> $ status     <fct> Fail, Fail, Pass, Pass, Pass, Pass, Pass, Fail, Pass, Pass,…

Cross Validation

Cross validation is a method in machine learning used to evaluate the performance of a model by using it to predict a new unseen data. Cross validation is done by splitting the dataset into training and testing dataset randomly. The training dataset is used to train the model and the testing dataset is used to be predicted so we can evaluate the model’s performance.

sample() function is used to take a sample of a specified size from the data. We are going to use 80% data for the training dataset and the remaining 20% to the test dataset.

set.seed(1)
RNGkind(sample.kind = 'Rounding')

# Random Index Sampling
index <- sample(x = nrow(students),
                size = 0.8*nrow(students))

# Train-Test Split
students_train <- students[index,]
students_test <- students[-index,] 

Data Balancing

We are going to check how balanced our target variable by using prop.table()

prop.table(table(students_train$status))
#> 
#> Fail Pass 
#> 0.31 0.69

It seems like our data is a little unbalanced because the amount of students that passed the math course is higher than the students that failed. we will need to upsample our data using upSample() to balance the data.

students_train <- upSample(x = students_train %>% select(-status), 
                           y = students_train$status,
                           yname = "status")

prop.table(table(students_train$status))
#> 
#> Fail Pass 
#>  0.5  0.5

The proportion of students that failed and passed the course is now balanced.


Modelling

Naive Bayes

Naive Bayes is a machine learning model that is based on Bayes theorem. Bayes Theorem describes the probability of occurrence of an event based on conditions that might be related to the event, where additional information impacts the initial probability.

This is the Bayes Theorem formula:

\[P(A|B) = \frac{P(B|A) * P(A)}{P(B)}\]

Where :

  • P(A|B) = probability of A based on B occurrence
  • P(B|A) = Probability of B based on A occurrence
  • P(A) = Probability of A
  • P(B) = Probability of B

Note :

  • Naive Bayes assumes that all features of the dataset are equally important and independent.
  • Data Scarcity on a predictor could cause the the prediction to have zero probability. Applying Laplace Smoothing could prevent it by adding a small number to each predictor.

We are going to build our Naive Bayes model using the naiveBayes() function to train the students_train data. laplace = 1 parameter is added to apply laplace smoothing with 1 as the value.

model_bayes <- naiveBayes(formula = status~., 
                          data = students_train,
                          laplace = 1)

Decision Tree

Decision tree is a machine learning model that used a tree-based model that split predictors into 2 branches based on the predictors’ highest information gain. Decision Tree chooses predictors that have the most homogeneous target variable.

We are going build our Decision Tree model using the ctree() function to train the students_train data and plot the decision tree model using plot().

model_dtree <- ctree(formula = status~.,
                     data = students_train)

plot(model_dtree, type = "simple")

The first node is called the root node, it splits based on the value of G2 which considered the predictor with the most information gain. Then, the branch will split again and creating new branches with new sets of rules until they reached the terminal node that contains information of the target variable. The terminal node consists of :

  • Predicted class (Fail / Pass) of the observations
  • Number of observations in the node
  • Error percentage of the observations

Random Forest

Random forest is a machine learning model that combines the output of multiple decision trees to reach a single result. Random Forest utilise the Bagging (Bootstrap and Aggregation) method.

  • Bootstrap Sampling is done by sampling the data and create a decision tree for each bootstrap data. Each decision tree make predictions for new observations.
  • Aggregation is done by counting every predictions made by decision tree and the classification result will be based on the class that has most count.

Before building our model, we are going to use K-fold Cross-Validation method to evaluate our model. In this method, the training dataset is divided into k subsets. In each fold, one of the k subsets is used as the validation set, and the remaining subsets are used as the training set. The process is done multiple times until all folds are done. The accuracy of each folds are averaged to give us the final accuracy of the model.

Here, we are going to use trainControl() function to apply cross validation to our Random Forest training process. repeatedcv method is used with 5 number of folds, then the process will be repeated 3 times.

ctrl <- trainControl(method = "repeatedcv",
                     number = 5, # Number of Folds
                     repeats = 3) # Number of Repetition

We are going build our Random Forest model using the train() function to train the students_train data. rf method is selected to use the Random Forest algorithm and trControl parameter is used to apply the k-fold cross validation.

set.seed(1)
RNGkind(sample.kind = 'Rounding')

model_rf <- train(status ~ .,
                  data = students_train,
                  method = "rf",
                  trControl = ctrl)

model_rf
#> Random Forest 
#> 
#> 414 samples
#>  32 predictor
#>   2 classes: 'Fail', 'Pass' 
#> 
#> No pre-processing
#> Resampling: Cross-Validated (5 fold, repeated 3 times) 
#> Summary of sample sizes: 332, 331, 332, 330, 331, 331, ... 
#> Resampling results across tuning parameters:
#> 
#>   mtry  Accuracy   Kappa    
#>    2    0.9257973  0.8516120
#>   36    0.9443017  0.8886158
#>   71    0.9475736  0.8951632
#> 
#> Accuracy was used to select the optimal model using the largest value.
#> The final value used for the model was mtry = 71.

From the model summary, we can see that the model used 71 predictor variables (mtry) and has an accuracy of 0.9475 from cross-validation.


Model Evaluation

There are 2 model evaluation methods that we are going to use to evaluate our model’s performance, Confusion Matrix and ROC-AUC

1. Confusion Matrix

Confusion Matrix is a table that shows the prediction result based on the actual classification label. Confusion matrix allowed us to calculate metrics that we are going to use to evaluate the model:

  • Accuracy : Measure all observations that are correctly predicted.
  • Precision/Pos Pred Value : Measure observations predicted as positive that are correctly predicted. (contains False Positive)
  • Recall/Sensitivity : Measure observations from positive class that are correctly predicted. (contains False Negative)
  • Specificity : Measure observations from negative class that are correctly predicted.

Based on our data:

  • False Positive means a student that failed the math course being predicted as passed
  • False Negative means a student that passed the math course being predicted as failed

Therefore, we should prioritize to suppress the False Positive value so that we don’t miss out on the failed students, so Precision is our priority metric.

2. ROC-AUC

ROC (Receiver Operating Characteristic) is a graph representing the performance of a classification model based on True Positive Rate and False Positive Rate. ROC curve plots TPR vs. FPR at different classification thresholds. A good model a have high True Positive Rate and a low False Positive Rate. To measure how good an ROC curve is, we can use AUC value. AUC (Area under the ROC Curve) measures the entire area underneath the entire ROC curve. The higher the value of AUC, the better the model is.

Naive Bayes Evaluation

Confusion Matrix

We are going to evaluate our Naive Bayes model’s performance by predicting training data from students_train and unseen data from students_test, then using confusion matrix. predict() function is used to predict which class the data is classified to by the model and then confusionMatrix() function is used to create the confusion matrix using the prediction results.

#Training Evaluation
predict_bayes <- predict(model_bayes, students_train)

confusionMatrix(predict_bayes, 
                students_train$status, 
                positive = "Pass")
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction Fail Pass
#>       Fail  181   31
#>       Pass   26  176
#>                                              
#>                Accuracy : 0.8623             
#>                  95% CI : (0.8253, 0.894)    
#>     No Information Rate : 0.5                
#>     P-Value [Acc > NIR] : <0.0000000000000002
#>                                              
#>                   Kappa : 0.7246             
#>                                              
#>  Mcnemar's Test P-Value : 0.5962             
#>                                              
#>             Sensitivity : 0.8502             
#>             Specificity : 0.8744             
#>          Pos Pred Value : 0.8713             
#>          Neg Pred Value : 0.8538             
#>              Prevalence : 0.5000             
#>          Detection Rate : 0.4251             
#>    Detection Prevalence : 0.4879             
#>       Balanced Accuracy : 0.8623             
#>                                              
#>        'Positive' Class : Pass               
#> 

Our Naive-Bayes model got 0.8623 train accuracy and 0.8713 train precision.

#Testing Evaluation
predict_bayes <- predict(model_bayes, students_test)

confusionMatrix(predict_bayes, 
                students_test$status, 
                positive = "Pass")
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction Fail Pass
#>       Fail   26    8
#>       Pass    2   40
#>                                           
#>                Accuracy : 0.8684          
#>                  95% CI : (0.7713, 0.9351)
#>     No Information Rate : 0.6316          
#>     P-Value [Acc > NIR] : 0.000003932     
#>                                           
#>                   Kappa : 0.7293          
#>                                           
#>  Mcnemar's Test P-Value : 0.1138          
#>                                           
#>             Sensitivity : 0.8333          
#>             Specificity : 0.9286          
#>          Pos Pred Value : 0.9524          
#>          Neg Pred Value : 0.7647          
#>              Prevalence : 0.6316          
#>          Detection Rate : 0.5263          
#>    Detection Prevalence : 0.5526          
#>       Balanced Accuracy : 0.8810          
#>                                           
#>        'Positive' Class : Pass            
#> 

Our Naive-Bayes model got 0.8684 test accuracy and 0.9524 test precision. The precision of the model is very high because 40 out of 42 passed students predictions are correct.

Both the training and testing evaluation shows decent performance and the model does not overfit because there’s not much difference between their accuracy.

ROC & AUC

To create an ROC curve, we need to save our predictions as probabilities by changing the type to "raw", therefore we can see the probability of each students being predicted as the failed or the passed student.

prob_bayes <- predict(object = model_bayes, 
                      newdata = students_test, 
                      type = "raw")

head(prob_bayes)
#>                 Fail         Pass
#> [1,] 0.9961522575929 0.0038477424
#> [2,] 0.9991555701996 0.0008444298
#> [3,] 0.0000007558935 0.9999992441
#> [4,] 0.1612843812754 0.8387156187
#> [5,] 0.0000188164298 0.9999811836
#> [6,] 0.0000052712139 0.9999947288

Using the probabilities of the passed students, we can graph the true positive rate and the false positive rate at different classification thresholds to create an ROC curve. performance() function is used to create the ROC curve.

roc_bayes <- prediction(prob_bayes[,2], students_test$status)

model_roc_bayes <- performance(prediction.obj = roc_bayes, 
                               measure = "tpr", # True Positive Rate
                               x.measure = "fpr") # False Positive Rate
                        

plot(model_roc_bayes)
abline(0,1 , lty = 2)

We can get the AUC value from the model’s ROC by using measure = "auc" parameter.

model_auc_bayes <- performance(roc_bayes,
                               measure = "auc")

model_auc_bayes@y.values[[1]]
#> [1] 0.953125

Our Naive-Bayes model got an AUC value of 0.953125.

Decision Tree Evaluation

Confusion Matrix

We are going to evaluate our Decision Tree model’s performance by predicting training data from students_train and unseen data from students_test, then using confusion matrix. predict() function is used to predict which class the data is classified to by the model and then confusionMatrix() function is used to create the confusion matrix using the prediction results.

#Training Evaluation
predict_dtree <- predict(object = model_dtree,
                         newdata = students_train) 

confusionMatrix(data = predict_dtree,
                reference = students_train$status,
                positive = "Pass")
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction Fail Pass
#>       Fail  199   23
#>       Pass    8  184
#>                                               
#>                Accuracy : 0.9251              
#>                  95% CI : (0.8954, 0.9486)    
#>     No Information Rate : 0.5                 
#>     P-Value [Acc > NIR] : < 0.0000000000000002
#>                                               
#>                   Kappa : 0.8502              
#>                                               
#>  Mcnemar's Test P-Value : 0.01192             
#>                                               
#>             Sensitivity : 0.8889              
#>             Specificity : 0.9614              
#>          Pos Pred Value : 0.9583              
#>          Neg Pred Value : 0.8964              
#>              Prevalence : 0.5000              
#>          Detection Rate : 0.4444              
#>    Detection Prevalence : 0.4638              
#>       Balanced Accuracy : 0.9251              
#>                                               
#>        'Positive' Class : Pass                
#> 

Our Decision Tree model got 0.9251 train accuracy and 0.9583 train precision.

#Testing Evaluation
predict_dtree <- predict(object = model_dtree,
                         newdata = students_test) 

confusionMatrix(data = predict_dtree,
                reference = students_test$status,
                positive = "Pass")
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction Fail Pass
#>       Fail   27    6
#>       Pass    1   42
#>                                           
#>                Accuracy : 0.9079          
#>                  95% CI : (0.8194, 0.9622)
#>     No Information Rate : 0.6316          
#>     P-Value [Acc > NIR] : 0.00000004098   
#>                                           
#>                   Kappa : 0.8092          
#>                                           
#>  Mcnemar's Test P-Value : 0.1306          
#>                                           
#>             Sensitivity : 0.8750          
#>             Specificity : 0.9643          
#>          Pos Pred Value : 0.9767          
#>          Neg Pred Value : 0.8182          
#>              Prevalence : 0.6316          
#>          Detection Rate : 0.5526          
#>    Detection Prevalence : 0.5658          
#>       Balanced Accuracy : 0.9196          
#>                                           
#>        'Positive' Class : Pass            
#> 

Our Decision Tree model got 0.9079 accuracy and 0.9767 precision. The precision of the model is very high because 42 out of 43 passed students predictions are correct.

Both the training and testing evaluation shows very good performance and the model does not overfit because there’s not much difference between their accuracy.

ROC & AUC

To create an ROC curve, we need to save our predictions as probabilities by changing the type to "prob", therefore we can see the probability of each students being predicted as the failed or the passed student.

prob_dtree <- predict(object = model_dtree, 
                      newdata = students_test, 
                      type = "prob")

head(prob_dtree)
#>    Fail Pass
#> 1  0.85 0.15
#> 2  1.00 0.00
#> 6  0.00 1.00
#> 11 1.00 0.00
#> 15 0.00 1.00
#> 16 0.00 1.00

Using the probabilities of the passed students, we can graph the true positive rate and the false positive rate at different classification thresholds to create an ROC curve. performance() function is used to create the ROC curve.

roc_dtree <- prediction(prob_dtree[,2], students_test$status)

model_roc_dtree <- performance(prediction.obj = roc_dtree, 
                               measure = "tpr", # True Positive Rate
                               x.measure = "fpr") # False Positive Rate
                        

plot(model_roc_dtree)
abline(0,1 , lty = 2)

We can get the AUC value from the model’s ROC by using measure = "auc" parameter.

model_auc_dtree <- performance(roc_dtree,
                               measure = "auc")

model_auc_dtree@y.values[[1]]
#> [1] 0.9668899

Our Decision Tree model got an AUC value of 0.9668899.

Random Forest Evaluation

Confusion Matrix

We are going to evaluate our Random Forest model’s performance by predicting training data from students_train and unseen data from students_test, then using confusion matrix. predict() function is used to predict which class the data is classified to by the model and then confusionMatrix() function is used to create the confusion matrix using the prediction results.

#Training Evaluation
predict_rf <- predict(object = model_rf,
                      newdata = students_train) 

confusionMatrix(data = predict_rf,
                reference = students_train$status,
                positive = "Pass")
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction Fail Pass
#>       Fail  207    0
#>       Pass    0  207
#>                                                
#>                Accuracy : 1                    
#>                  95% CI : (0.9911, 1)          
#>     No Information Rate : 0.5                  
#>     P-Value [Acc > NIR] : < 0.00000000000000022
#>                                                
#>                   Kappa : 1                    
#>                                                
#>  Mcnemar's Test P-Value : NA                   
#>                                                
#>             Sensitivity : 1.0                  
#>             Specificity : 1.0                  
#>          Pos Pred Value : 1.0                  
#>          Neg Pred Value : 1.0                  
#>              Prevalence : 0.5                  
#>          Detection Rate : 0.5                  
#>    Detection Prevalence : 0.5                  
#>       Balanced Accuracy : 1.0                  
#>                                                
#>        'Positive' Class : Pass                 
#> 

Our Random Forest model got perfect 1.00 train accuracy and 1.00 train precision.

#Testing Evaluation
predict_rf <- predict(object = model_rf,
                      newdata = students_test) 

confusionMatrix(data = predict_rf,
                reference = students_test$status,
                positive = "Pass")
#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction Fail Pass
#>       Fail   27    3
#>       Pass    1   45
#>                                           
#>                Accuracy : 0.9474          
#>                  95% CI : (0.8707, 0.9855)
#>     No Information Rate : 0.6316          
#>     P-Value [Acc > NIR] : 0.0000000001112 
#>                                           
#>                   Kappa : 0.8886          
#>                                           
#>  Mcnemar's Test P-Value : 0.6171          
#>                                           
#>             Sensitivity : 0.9375          
#>             Specificity : 0.9643          
#>          Pos Pred Value : 0.9783          
#>          Neg Pred Value : 0.9000          
#>              Prevalence : 0.6316          
#>          Detection Rate : 0.5921          
#>    Detection Prevalence : 0.6053          
#>       Balanced Accuracy : 0.9509          
#>                                           
#>        'Positive' Class : Pass            
#> 

Our Random Forest model got 0.9474 accuracy and 0.9783 precision. The precision of the model is very high because 45 out of 46 passed students predictions are correct.

The training evaluation shows perfect modelling of the data, although the testing evaluation shows a decrease of performance. Nevertheless, the model is still performing excellently and does not overfit the data.

ROC & AUC

To create an ROC curve, we need to save our predictions as probabilities by changing the type to "prob", therefore we can see the probability of each students being predicted as the failed or the passed student.

prob_rf <- predict(object = model_rf, 
                   newdata = students_test, 
                   type = "prob")

head(prob_rf)

Using the probabilities of the passed students, we can graph the true positive rate and the false positive rate at different classification thresholds to create an ROC curve. performance() function is used to create the ROC curve.

roc_rf <- prediction(prob_rf[,2], students_test$status)

model_roc_rf <- performance(prediction.obj = roc_rf, 
                               measure = "tpr", # True Positive Rate
                               x.measure = "fpr") # False Positive Rate
                        

plot(model_roc_rf)
abline(0,1 , lty = 2)

We can get the AUC value from the model’s ROC by using measure = "auc" parameter.

model_auc_rf <- performance(roc_rf,
                            measure = "auc")

model_auc_rf@y.values[[1]]
#> [1] 0.9784226

Our Random Forest Model model got an AUC value of 0.9784226.


Conclusion

We’ve done Exploratory Data Analysis the students’ performance dataset by analyzing some factors that affect students’ final grade. Here is the summary of the analysis that have been done:

  • Students that study for more than 5 hours a week have higher grades than students that study for less than 5 hours a week.
  • The higher the parents’ education, the higher the students’ grade.
  • Students that received no extra educational support have the higher grade compared to the students that received any educational supports.
  • Students that is facilitated with internet have higher grade compared to students that don’t have internet.
  • Students without romantic partner have higher average final grade compared to the students with romantic partner.
  • Alcohol consumption does not have a significant impact to the students’ performance
  • The better the students’ health, their average grade also gets better, except for the students that have very bad health where they have the highest final grade.

We’ve build our classification model using Naive-Bayes, Decision Tree, and Random Forest. Based on the model evaluation, Random Forest is the best classification method for classifying the student’s academic performance in math course, with 0.9474 accuracy, 0.9783 precision, and 0.9784 AUC. Random Forest models tend to have a very high performance by combining predictions from multiple decision trees and aggregating their outputs. Random Forest models are also very efficient at handling dataset with a a lot of predictors. Random Forest models automatically perform feature selection by evaluating the importance of each feature during the training process, therefore it is a suitable model for the students’ performance dataset.

In conclusion, this project has explored students’ academic performance data to get insights and successfully predicting students’ academic performance based on various input factors. Students can utilize this project to forecast their grade before their final exam and prepare themselves for the outcome. Teachers can also utilize the project to supervise their students that would possibly fail the course based on the student’s data.


Further Improvements

  • Use Hyperparameter Tuning and Feature Engineering to improve model performance.
  • Increase dataset size by gathering more data considering the initial dataset was small and imbalanced
  • Try other machine learning algorithms that may fit more than the algorithms that have been used in this project