A series of R packages and libraries are installed beforehand.

if (!require("caret")) {
  install.packages("caret", repos = "https://cran.asia/")
}
if (!require("dplyr")) {
  install.packages("dplyr", repos = "https://cran.asia/")
}
if (!require("ggplot2")) {
  install.packages("ggplot2", repos = "https://cran.asia/")
}
if (!require("kableExtra")) {
  install.packages("kableExtra", repos = "https://cran.asia/")
}
if (!require('tidyr')) {
  install.packages('tidyr', repos = "https://cran.asia/")
}
if (!require("caTools")) {
  install.packages("caTools", repos = "https://cran.asia/")
}
if (!require("e1071")) {
  install.packages("e1071", repos = "https://cran.asia/")
}
if (!require("MASS")) {
  install.packages("MASS", repos = "https://cran.asia/")
}

library(caret)
library(dplyr)
library(ggplot2)
library(kableExtra)
library(tidyr)
library(caTools)
library(e1071)
library(MASS)

Introduction

General Introducation

Education is very important for achieving a long-term economic progress. During the last decades, the Portuguese educational level has improved. However, the statistics keep the Portugal at Europe’s tail end due to its high student failure and dropping out rates. For example, in 2006 the early school leaving rate in Portugal was 40% for 18 to 24 year olds, while the European Union average value was just 15% (Eurostat 2007). In particular, failure in the core classes of Mathematics and Portuguese (the native language) is extremely serious, since they provide fundamental knowledge for the success in the remaining school subjects (e.g. physics or history).

On the other hand, the interest in Data Mining (DM) arose due to the advances of Information Technology, leading to an exponential growth of business and organizational databases. All this data holds valuable information, such as trends and patterns, which can be used to improve decision making and optimize success. The aim is to predict student achievement and if possible to identify the key variables that affect educational success/failure.

Literature Review

LR table

Problem Statement

There is no doubt that everyone expects to have good performance in schools. This is not only the ideal and pursuit of individuals, but also the needs of social development and progress. Education in Europe is world-renowned, however, Portugal’s education level is weak in Europe and student dropout is an important issue. By predicting students’ performance in school and analyzing the factors that affect students’ performance in school, it can help educators better adjust their work styles, and provide help and encouragement to students who want to drop out of school get higher education.

Objectives

In this project, there are three objectives to be achieved:

  • To predict the total academic score by using multiple regression and support vector machine models.
  • To classify the intentional of the students to further study in higher education by using Naive Bayes and random foreset models.
  • To evaluate which models will have the highest accuracy in the prediction and classification analyses.

Based on the aforementioned objectives, the dataset “student.csv” was obtained from https://archive.ics.uci.edu/ml/datasets/student+performance and imported to RStudio for further data processing and data analysis.

Data Ingestion

df <- read.csv("student.csv", header = TRUE, sep = ";")
# Inspect the structure of the data frame
str(df)
## 'data.frame':    649 obs. of  33 variables:
##  $ school    : chr  "GP" "GP" "GP" "GP" ...
##  $ sex       : chr  "F" "F" "F" "F" ...
##  $ age       : int  18 17 15 15 16 16 16 17 15 15 ...
##  $ address   : chr  "U" "U" "U" "U" ...
##  $ famsize   : chr  "GT3" "GT3" "LE3" "GT3" ...
##  $ Pstatus   : chr  "A" "T" "T" "T" ...
##  $ Medu      : int  4 1 1 4 3 4 2 4 3 3 ...
##  $ Fedu      : int  4 1 1 2 3 3 2 4 2 4 ...
##  $ Mjob      : chr  "at_home" "at_home" "at_home" "health" ...
##  $ Fjob      : chr  "teacher" "other" "other" "services" ...
##  $ reason    : chr  "course" "course" "other" "home" ...
##  $ guardian  : chr  "mother" "father" "mother" "mother" ...
##  $ traveltime: int  2 1 1 1 1 1 1 2 1 1 ...
##  $ studytime : int  2 2 2 3 2 2 2 2 2 2 ...
##  $ failures  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ schoolsup : chr  "yes" "no" "yes" "no" ...
##  $ famsup    : chr  "no" "yes" "no" "yes" ...
##  $ paid      : chr  "no" "no" "no" "no" ...
##  $ activities: chr  "no" "no" "no" "yes" ...
##  $ nursery   : chr  "yes" "no" "yes" "yes" ...
##  $ higher    : chr  "yes" "yes" "yes" "yes" ...
##  $ internet  : chr  "no" "yes" "yes" "yes" ...
##  $ romantic  : chr  "no" "no" "no" "yes" ...
##  $ famrel    : int  4 5 4 3 4 5 4 4 4 5 ...
##  $ freetime  : int  3 3 3 2 3 4 4 1 2 5 ...
##  $ goout     : int  4 3 2 2 2 2 4 4 2 1 ...
##  $ Dalc      : int  1 1 2 1 1 1 1 1 1 1 ...
##  $ Walc      : int  1 1 3 1 2 2 1 1 1 1 ...
##  $ health    : int  3 3 3 5 5 5 3 1 1 5 ...
##  $ absences  : int  4 2 6 0 0 6 0 2 0 0 ...
##  $ G1        : int  0 9 12 14 11 12 13 10 15 12 ...
##  $ G2        : int  11 11 13 14 13 12 12 13 16 12 ...
##  $ G3        : int  11 11 12 14 13 13 13 13 17 13 ...

Data Inspection

# Check whether the numerical attributes contain noise
summary(df) %>%
  kable("html") %>%
  scroll_box(width = "100%") %>%
  kable_styling(font_size = 12)
school sex age address famsize Pstatus Medu Fedu Mjob Fjob reason guardian traveltime studytime failures schoolsup famsup paid activities nursery higher internet romantic famrel freetime goout Dalc Walc health absences G1 G2 G3
Length:649 Length:649 Min. :-2.00 Length:649 Length:649 Length:649 Min. :0.000 Min. :0.000 Length:649 Length:649 Length:649 Length:649 Min. :1.000 Min. :1.000 Min. :0.0000 Length:649 Length:649 Length:649 Length:649 Length:649 Length:649 Length:649 Length:649 Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.00 Min. :1.000 Min. : 0.000 Min. : 0.0 Min. : 0.00 Min. : 0.00
Class :character Class :character 1st Qu.:16.00 Class :character Class :character Class :character 1st Qu.:2.000 1st Qu.:1.000 Class :character Class :character Class :character Class :character 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:0.0000 Class :character Class :character Class :character Class :character Class :character Class :character Class :character Class :character 1st Qu.:4.000 1st Qu.:3.000 1st Qu.:2.000 1st Qu.:1.000 1st Qu.:1.00 1st Qu.:2.000 1st Qu.: 0.000 1st Qu.:10.0 1st Qu.:10.00 1st Qu.:10.00
Mode :character Mode :character Median :17.00 Mode :character Mode :character Mode :character Median :2.000 Median :2.000 Mode :character Mode :character Mode :character Mode :character Median :1.000 Median :2.000 Median :0.0000 Mode :character Mode :character Mode :character Mode :character Mode :character Mode :character Mode :character Mode :character Median :4.000 Median :3.000 Median :3.000 Median :1.000 Median :2.00 Median :4.000 Median : 2.000 Median :11.0 Median :11.00 Median :12.00
NA NA Mean :16.72 NA NA NA Mean :2.515 Mean :2.307 NA NA NA NA Mean :1.569 Mean :1.931 Mean :0.2219 NA NA NA NA NA NA NA NA Mean :3.931 Mean :3.181 Mean :3.185 Mean :1.502 Mean :2.28 Mean :3.536 Mean : 3.659 Mean :11.4 Mean :11.57 Mean :11.91
NA NA 3rd Qu.:18.00 NA NA NA 3rd Qu.:4.000 3rd Qu.:3.000 NA NA NA NA 3rd Qu.:2.000 3rd Qu.:2.000 3rd Qu.:0.0000 NA NA NA NA NA NA NA NA 3rd Qu.:5.000 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:2.000 3rd Qu.:3.00 3rd Qu.:5.000 3rd Qu.: 6.000 3rd Qu.:13.0 3rd Qu.:13.00 3rd Qu.:14.00
NA NA Max. :22.00 NA NA NA Max. :4.000 Max. :4.000 NA NA NA NA Max. :4.000 Max. :4.000 Max. :3.0000 NA NA NA NA NA NA NA NA Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.00 Max. :5.000 Max. :32.000 Max. :19.0 Max. :19.00 Max. :19.00
NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA’s :1 NA NA NA NA NA NA NA NA
# Check the first 6 rows
head(df) %>%
  kable("html") %>%
  scroll_box(width = "100%") %>%
  kable_styling(font_size = 12)
school sex age address famsize Pstatus Medu Fedu Mjob Fjob reason guardian traveltime studytime failures schoolsup famsup paid activities nursery higher internet romantic famrel freetime goout Dalc Walc health absences G1 G2 G3
GP F 18 U GT3 A 4 4 at_home teacher course mother 2 2 0 yes no no no yes yes no no 4 3 4 1 1 3 4 0 11 11
GP F 17 U GT3 T 1 1 at_home other course father 1 2 0 no yes no no no yes yes no 5 3 3 1 1 3 2 9 11 11
GP F 15 U LE3 T 1 1 at_home other other mother 1 2 0 yes no no no yes yes yes no 4 3 2 2 3 3 6 12 13 12
GP F 15 U GT3 T 4 2 health services home mother 1 3 0 no yes no yes yes yes yes yes 3 2 2 1 1 5 0 14 14 14
GP F 16 U GT3 T 3 3 other other home father 1 2 0 no yes no no yes yes no no 4 3 2 1 2 5 0 11 13 13
GP M 16 U LE3 T 4 3 services other reputation mother 1 2 0 no yes no yes yes yes yes no 5 4 2 1 2 5 6 12 12 13
# Checking whether the dataset contains "" or NA.
columns <- names(df)
for (i in columns) {
  print(paste(i, sum(df[i] == "" | is.na(df[i]))))
}
## [1] "school 0"
## [1] "sex 0"
## [1] "age 0"
## [1] "address 0"
## [1] "famsize 0"
## [1] "Pstatus 0"
## [1] "Medu 0"
## [1] "Fedu 0"
## [1] "Mjob 0"
## [1] "Fjob 2"
## [1] "reason 2"
## [1] "guardian 2"
## [1] "traveltime 0"
## [1] "studytime 0"
## [1] "failures 0"
## [1] "schoolsup 0"
## [1] "famsup 0"
## [1] "paid 0"
## [1] "activities 0"
## [1] "nursery 0"
## [1] "higher 1"
## [1] "internet 2"
## [1] "romantic 1"
## [1] "famrel 0"
## [1] "freetime 1"
## [1] "goout 0"
## [1] "Dalc 0"
## [1] "Walc 0"
## [1] "health 0"
## [1] "absences 0"
## [1] "G1 0"
## [1] "G2 0"
## [1] "G3 0"
# Check whether the data consistency in the character columns
columns <- names(df)
for (i in columns) {
  if (sapply(df[i], typeof) == "character") {
    print(unique(df[i]))
  }
}
##     school
## 1       GP
## 424     MS
##     sex
## 1     F
## 6     M
## 10    B
## 275   G
##    address
## 1        U
## 25       R
##   famsize
## 1     GT3
## 3     LE3
##   Pstatus
## 1       A
## 2       T
##        Mjob
## 1   at_home
## 4    health
## 5     other
## 6  services
## 11  teacher
##         Fjob
## 1    teacher
## 2      other
## 4   services
## 11    health
## 33   at_home
## 239         
##         reason
## 1       course
## 3        other
## 4         home
## 6   reputation
## 209           
##     guardian
## 1     mother
## 2     father
## 15     other
## 125         
##   schoolsup
## 1       yes
## 2        no
##   famsup
## 1     no
## 2    yes
##    paid
## 1    no
## 19  yes
## 90    0
##   activities
## 1         no
## 4        yes
##   nursery
## 1     yes
## 2      no
##     higher
## 1      yes
## 79      no
## 561       
##    internet
## 1        no
## 2       yes
## 25         
##     romantic
## 1         no
## 4        yes
## 455

Data Preparation

# Remove the noise in age columns
df$age[df$age < 0] <- mean(df$age, na.rm = T)

# Amend the data inconsistency in column 
df$sex[df$sex == 'B'] <- 'M'
df$sex[df$sex == 'G'] <- 'F'
df$paid[df$paid == 0] <- 'no'

# Replace the "" and na value with the mean or mode of the columns
getmode <- function(v){
  v=v[nchar(as.character(v))>0]
  uniqv <- unique(v)
  uniqv[which.max(tabulate(match(v, uniqv)))]
}
df[sapply(df, is.character)] <- lapply(df[sapply(df, is.character)], function(x) ifelse(x=="" | is.na(x), getmode(x), x))

df$freetime[is.na(df$freetime)] <- getmode(df$freetime)

columns <- names(df)
for (i in columns) {
  print(paste(i, sum(df[i] == "" | is.na(df[i]))))
}
## [1] "school 0"
## [1] "sex 0"
## [1] "age 0"
## [1] "address 0"
## [1] "famsize 0"
## [1] "Pstatus 0"
## [1] "Medu 0"
## [1] "Fedu 0"
## [1] "Mjob 0"
## [1] "Fjob 0"
## [1] "reason 0"
## [1] "guardian 0"
## [1] "traveltime 0"
## [1] "studytime 0"
## [1] "failures 0"
## [1] "schoolsup 0"
## [1] "famsup 0"
## [1] "paid 0"
## [1] "activities 0"
## [1] "nursery 0"
## [1] "higher 0"
## [1] "internet 0"
## [1] "romantic 0"
## [1] "famrel 0"
## [1] "freetime 0"
## [1] "goout 0"
## [1] "Dalc 0"
## [1] "Walc 0"
## [1] "health 0"
## [1] "absences 0"
## [1] "G1 0"
## [1] "G2 0"
## [1] "G3 0"
# Integrate the columns G1, G2 and G3 into a total score
df['TotalScore'] <- df['G1'] + df['G2'] + df['G3'] 

# Construct another dataset for the classification
df2 <- subset(df, select = -c(G1, G2, G3, school, reason))

# Drop columns G1, G2, and G3
df <- subset(df, select = -c(G1, G2, G3, higher, school, reason))
head(df) %>%
  kable("html") %>%
  scroll_box(width = "100%") %>%
  kable_styling(font_size = 12)
sex age address famsize Pstatus Medu Fedu Mjob Fjob guardian traveltime studytime failures schoolsup famsup paid activities nursery internet romantic famrel freetime goout Dalc Walc health absences TotalScore
F 18 U GT3 A 4 4 at_home teacher mother 2 2 0 yes no no no yes no no 4 3 4 1 1 3 4 22
F 17 U GT3 T 1 1 at_home other father 1 2 0 no yes no no no yes no 5 3 3 1 1 3 2 31
F 15 U LE3 T 1 1 at_home other mother 1 2 0 yes no no no yes yes no 4 3 2 2 3 3 6 37
F 15 U GT3 T 4 2 health services mother 1 3 0 no yes no yes yes yes yes 3 2 2 1 1 5 0 42
F 16 U GT3 T 3 3 other other father 1 2 0 no yes no no yes no no 4 3 2 1 2 5 0 37
M 16 U LE3 T 4 3 services other mother 1 2 0 no yes no yes yes yes no 5 4 2 1 2 5 6 37

Exploratory Data Analyis

# Boxplot of number of school absences
boxplot(df$absences, horizontal=T, main="Boxplot of school absences number", xlab="Number of school absences")

In average, the absences students is about 0~15, but still have some special student have high absence rate.

boxplot(df$TotalScore, horizontal=F, main="Boxplot of Total Score",
        ylab="Total Score")

In average, the total score is around 15~55, there hava a top student have 60 reach the highest.

# To examine the extent how the number of school absences that will influence total score by gender.
ggplot(df, aes(x=absences, y=TotalScore, color=sex)) + geom_point() + stat_smooth(method=lm, se=FALSE) + labs(title = "Number of school absences vs total score by gender")
## `geom_smooth()` using formula 'y ~ x'

We could see the absences is higher, the total score is lowver.

# Average Score group by sex
df %>%
  dplyr::group_by(sex) %>%
  dplyr::summarise(Average_score = mean(TotalScore)) %>%
  ggplot(aes(x = sex, y = Average_score, fill = sex)) + 
  geom_bar(stat = "identity") + 
  theme_classic() +
  labs(
    x = "Sex",
    y = "Average Score",
    title = paste("Average Score group by sex")
  )

Girls have higher score than boys.

# Average Score group by region
df %>%
  dplyr::group_by(address) %>%
  dplyr::summarise(Average_score = mean(TotalScore)) %>%
  ggplot(aes(x =  address, y = Average_score, fill = address)) + 
  geom_bar(stat = "identity") + 
  theme_classic() +
  labs(
    x = "Region",
    y = "Average Score",
    title = paste("Average Score group by region")
  )

Urban students have higher score, if the distance is low, they have more time to study.

# Average Score group by studytime
df %>%
  dplyr::group_by(studytime) %>%
  dplyr::summarise(Average_score = mean(TotalScore)) %>%
  ggplot(aes(x =  studytime, y = Average_score, fill = studytime)) + 
  geom_bar(stat = "identity") + 
  theme_classic() +
  labs(
    x = "Study Time",
    y = "Average Score",
    title = paste("Average Score group by study time")
  )

Obviously, the longer you study, the better score you get.

# Average Score group by famsup
df %>%
  dplyr::group_by(famsup) %>%
  dplyr::summarise(Average_score = mean(TotalScore)) %>%
  ggplot(aes(x =  famsup, y = Average_score, fill = famsup)) + 
  geom_bar(stat = "identity") + 
  theme_classic() +
  labs(
    x = "Family Educational Support",
    y = "Average Score",
    title = paste("Average Score group by family educational support")
  )

No big difference in family size.

# Average Score group by internet access
df %>%
  dplyr::group_by(internet) %>%
  dplyr::summarise(Average_score = mean(TotalScore)) %>%
  ggplot(aes(x =  internet, y = Average_score, fill = internet)) + 
  geom_bar(stat = "identity") + 
  theme_classic() +
  labs(
    x = "Internet Access",
    y = "Average Score",
    title = paste("Average Score group by internet access")
  )

The students with internet access will have higher score than without internet access.

# Average Score group by romantic relationship
df %>%
  dplyr::group_by(romantic) %>%
  dplyr::summarise(Average_score = mean(TotalScore)) %>%
  ggplot(aes(x =  romantic, y = Average_score, fill = romantic)) + 
  geom_bar(stat = "identity") + 
  theme_classic() +
  labs(
    x = "Romantic Relationship",
    y = "Average Score",
    title = paste("Average Score group by romantic relationship")
  )

The students in romantic relationship will have lower score.

# Total Score vs number of school absences
df %>%
  dplyr::group_by(absences) %>%
  ggplot(aes(x =  absences, y = TotalScore, fill = absences)) + 
  geom_bar(stat = "identity") + 
  theme_classic() +
  labs(
    x = "Number of School Absences",
    y = "Total Score",
    title = paste("Total Score vs number of school absences")
  )

The lesser of school absences, the higher of the total score can be obtained by the students.

Data Modeling

Regression

Multiple Linear Regression

# Encoding categorical data
df$sex = factor(df$sex,
                       levels = c('F', 'M'),
                       labels = c(0,1))
df$address = factor(df$address,
                       levels = c('U', 'R'),
                       labels = c(0,1))

df$famsize = factor(df$famsize,
                       levels = c('LE3', 'GT3'),
                       labels = c(0,1))

df$Pstatus = factor(df$Pstatus,
                       levels = c('T', 'A'),
                       labels = c(0,1))

df$Mjob = factor(df$Mjob,
                       levels = c('teacher', 'health', 'services',   'at_home', 'other'),
                       labels = c(0,1,2,3,4))


df$Fjob = factor(df$Fjob,
                       levels = c('teacher', 'health', 'services',   'at_home', 'other'),
                       labels = c(0,1,2,3,4))

df$guardian = factor(df$guardian,
                       levels = c('mother', 'father', 'other'),
                       labels = c(0,1,3))

df$schoolsup = factor(df$schoolsup,
                       levels = c('no', 'yes'),
                       labels = c(0,1))

df$famsup = factor(df$famsup,
                       levels = c('no', 'yes'),
                       labels = c(0,1))

df$paid = factor(df$paid,
                       levels = c('no', 'yes'),
                       labels = c(0,1))

df$activities = factor(df$activities,
                       levels = c('no', 'yes'),
                       labels = c(0,1))

df$nursery = factor(df$nursery,
                       levels = c('no', 'yes'),
                       labels = c(0,1))

df$internet = factor(df$internet,
                       levels = c('no', 'yes'),
                       labels = c(0,1))

df$romantic = factor(df$romantic,
                       levels = c('no', 'yes'),
                       labels = c(0,1))
df$Fjob
##   [1] 0 4 4 2 4 4 4 0 4 4 1 4 2 4 4 4 2 4 2 4 4 1 4 4 1 2 4 2 4 0 2 2 3 4 4 4 2
##  [38] 0 1 4 4 4 0 2 3 4 2 2 4 0 2 4 1 2 4 4 2 1 3 4 0 2 2 1 2 2 2 4 2 4 4 4 4 4
##  [75] 2 4 4 4 4 4 2 4 4 2 4 2 4 4 4 1 4 4 4 2 1 4 4 4 4 3 2 0 4 4 4 1 4 4 4 1 0
## [112] 4 4 4 2 0 0 2 4 4 2 2 1 4 4 2 4 4 0 4 0 4 4 4 4 4 4 4 0 3 4 4 2 2 2 4 4 0
## [149] 4 2 2 2 2 2 3 4 2 4 2 4 2 0 4 4 4 2 3 0 2 4 4 4 2 2 4 4 4 2 2 2 4 2 4 1 4
## [186] 4 4 2 4 4 4 4 2 4 4 4 2 4 4 2 2 4 4 4 2 2 2 2 4 2 0 4 0 0 4 4 4 4 2 2 4 4
## [223] 4 4 4 4 4 4 4 4 4 1 4 3 2 4 0 4 4 4 4 2 4 4 4 4 0 4 4 4 4 4 4 4 2 4 3 0 4
## [260] 3 4 2 4 2 4 4 4 4 4 4 2 2 4 4 4 4 4 2 4 2 2 4 4 4 2 4 4 4 4 2 2 4 4 3 4 3
## [297] 4 2 2 4 4 4 4 4 2 4 4 4 4 4 4 2 2 4 3 4 1 4 2 2 1 2 2 2 4 4 2 4 4 4 4 4 3
## [334] 0 2 0 4 2 4 4 4 4 4 4 0 4 0 2 4 1 2 4 2 0 4 4 0 4 2 2 4 4 4 4 1 2 4 4 4 4
## [371] 4 4 4 2 2 3 4 4 2 4 2 0 4 4 2 4 4 4 2 4 4 4 4 2 4 2 4 3 4 2 2 4 4 4 4 4 3
## [408] 4 2 4 2 4 4 4 4 4 4 4 2 4 4 4 4 4 4 2 4 2 4 4 4 4 4 4 4 2 2 4 4 2 2 4 2 2
## [445] 1 4 3 4 0 2 2 4 4 2 2 4 4 2 4 4 4 4 3 4 4 4 2 2 4 0 4 4 4 2 4 2 4 2 4 3 2
## [482] 4 4 4 4 4 2 4 2 4 3 4 4 3 4 4 2 3 2 4 4 4 2 4 4 4 2 2 4 4 4 4 4 4 3 3 0 4
## [519] 2 2 4 2 1 1 4 2 4 2 4 2 2 2 2 2 1 2 4 2 4 4 4 4 2 4 4 4 1 3 3 4 2 4 4 4 3
## [556] 4 3 4 4 3 2 3 2 4 4 4 4 4 2 3 2 4 3 3 4 2 4 2 4 2 4 4 4 4 4 3 2 2 3 4 4 3
## [593] 2 4 0 4 4 4 2 2 2 4 2 4 2 2 4 4 2 3 3 4 4 2 2 4 2 2 4 2 3 2 4 2 2 2 0 2 3
## [630] 4 4 4 4 2 4 4 0 4 2 2 4 4 4 3 4 2 4 2 4
## Levels: 0 1 2 3 4
# Splitting the dataset into the training and test datasets
set.seed(42)
split = sample.split(df$TotalScore, SplitRatio = 0.8)
training_set = subset(df, split == TRUE)
test_set = subset(df, split == FALSE)

# Fitting Multiple Linear Regression to the Training set
regressor = lm(TotalScore~sex+age+address+famsize+Pstatus+Medu+Fedu+Mjob+Fjob+guardian+traveltime+studytime+failures+schoolsup+famsup+paid+activities+nursery+internet+romantic+famrel+freetime+goout+Dalc+Walc+health, data = training_set)

summary(regressor)
## 
## Call:
## lm(formula = TotalScore ~ sex + age + address + famsize + Pstatus + 
##     Medu + Fedu + Mjob + Fjob + guardian + traveltime + studytime + 
##     failures + schoolsup + famsup + paid + activities + nursery + 
##     internet + romantic + famrel + freetime + goout + Dalc + 
##     Walc + health, data = training_set)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -34.151  -3.739  -0.353   4.116  19.703 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 39.76779    6.11341   6.505 1.92e-10 ***
## sex1        -1.51228    0.75183  -2.011  0.04482 *  
## age          0.00715    0.31390   0.023  0.98184    
## address1    -1.43010    0.75472  -1.895  0.05870 .  
## famsize1    -1.06520    0.74968  -1.421  0.15599    
## Pstatus1    -0.71320    1.11143  -0.642  0.52137    
## Medu         0.44276    0.47824   0.926  0.35500    
## Fedu         0.51621    0.42514   1.214  0.22524    
## Mjob1        0.45834    1.66915   0.275  0.78375    
## Mjob2       -0.60045    1.32771  -0.452  0.65129    
## Mjob3       -2.44144    1.54918  -1.576  0.11568    
## Mjob4       -1.59979    1.36665  -1.171  0.24233    
## Fjob1       -5.02570    2.36670  -2.124  0.03421 *  
## Fjob2       -4.84692    1.61473  -3.002  0.00282 ** 
## Fjob3       -5.89512    2.03603  -2.895  0.00396 ** 
## Fjob4       -3.35671    1.58278  -2.121  0.03444 *  
## guardian1    1.23847    0.81560   1.518  0.12954    
## guardian3    1.89626    1.60473   1.182  0.23791    
## traveltime  -0.16335    0.46597  -0.351  0.72606    
## studytime    1.67676    0.41596   4.031 6.44e-05 ***
## failures    -4.30605    0.61081  -7.050 6.13e-12 ***
## schoolsup1  -2.45202    1.14147  -2.148  0.03219 *  
## famsup1     -0.27608    0.69204  -0.399  0.69012    
## paid1       -1.26911    1.38395  -0.917  0.35958    
## activities1  1.07615    0.67785   1.588  0.11303    
## nursery1    -0.60931    0.84106  -0.724  0.46913    
## internet1    1.07463    0.83223   1.291  0.19722    
## romantic1   -1.68258    0.69937  -2.406  0.01650 *  
## famrel       0.53240    0.35738   1.490  0.13694    
## freetime    -0.42623    0.33780  -1.262  0.20764    
## goout       -0.09145    0.33091  -0.276  0.78239    
## Dalc        -0.83445    0.46049  -1.812  0.07059 .  
## Walc        -0.33459    0.36823  -0.909  0.36399    
## health      -0.24421    0.23312  -1.048  0.29535    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.325 on 490 degrees of freedom
## Multiple R-squared:  0.3349, Adjusted R-squared:  0.2901 
## F-statistic: 7.477 on 33 and 490 DF,  p-value: < 2.2e-16
# Predicting the Test set results
y_pred = predict(regressor, newdata = test_set)

# RMSE of the model
RMSE(y_pred, test_set$TotalScore)
## [1] 6.610499
# R2 of the model
R2(y_pred, test_set$TotalScore)
## [1] 0.2734951

The p-value = 2.2e-16. The null hypothesis is rejected, so we conclude that our model is better than a model with only the intercept because at least one coefficient β is significantly different from 0.

Support Vector Machine

# Splitting the dataset into the training and test datasets
set.seed(42)
split = sample.split(df$TotalScore, SplitRatio = 0.8)
training_set = subset(df, split == TRUE)
test_set = subset(df, split == FALSE)
 
svm_model = svm(TotalScore~., data = training_set)
print(svm_model)
## 
## Call:
## svm(formula = TotalScore ~ ., data = training_set)
## 
## 
## Parameters:
##    SVM-Type:  eps-regression 
##  SVM-Kernel:  radial 
##        cost:  1 
##       gamma:  0.02857143 
##     epsilon:  0.1 
## 
## 
## Number of Support Vectors:  472
pred = predict(svm_model, test_set)
 
x=1:length(test_set$TotalScore)
plot(x, test_set$TotalScore, pch=18, col="red")
lines(x, pred, lwd="1", col="blue")

# accuracy check 
rmse = RMSE(test_set$TotalScore, pred)
r2 = R2(pred, test_set$TotalScore)

cat( "RMSE:", rmse, "\n", "R-squared:", r2)
## RMSE: 6.320792 
##  R-squared: 0.321206

The support vector machine has higher accuracy compared with the multiple regression model in terms of RMSE and R2.

Classification

Naiye Bayes

# Splitting the dataset into the training and test datasets
set.seed(42)
split = sample.split(df2$higher, SplitRatio = 0.8)
training_set = subset(df2, split == TRUE)
test_set = subset(df2, split == FALSE)

nb_model = naiveBayes(x = training_set[,-21],
                            y = training_set$higher)

y_pred <- predict(nb_model, newdata = test_set)

# Confusion Matrix
cm <- table(test_set$higher, y_pred)

# Model Evaluation
confusionMatrix(cm)
## Confusion Matrix and Statistics
## 
##      y_pred
##        no yes
##   no   14   0
##   yes   3 113
##                                          
##                Accuracy : 0.9769         
##                  95% CI : (0.934, 0.9952)
##     No Information Rate : 0.8692         
##     P-Value [Acc > NIR] : 1.748e-05      
##                                          
##                   Kappa : 0.8903         
##                                          
##  Mcnemar's Test P-Value : 0.2482         
##                                          
##             Sensitivity : 0.8235         
##             Specificity : 1.0000         
##          Pos Pred Value : 1.0000         
##          Neg Pred Value : 0.9741         
##              Prevalence : 0.1308         
##          Detection Rate : 0.1077         
##    Detection Prevalence : 0.1077         
##       Balanced Accuracy : 0.9118         
##                                          
##        'Positive' Class : no             
## 

Random Forest

# Splitting the dataset into the training and test datasets

df2$higher = factor(df2$higher,levels = c('no', 'yes'),
                    labels = c(0,1))
set.seed(42)
split = sample.split(df2$higher, SplitRatio = 0.8)
training_set = subset(df2, split == TRUE)
test_set = subset(df2, split == FALSE)


require(randomForest)
## Loading required package: randomForest
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
## 
##     combine
## The following object is masked from 'package:ggplot2':
## 
##     margin
rf_model = randomForest(higher~., data = training_set)

y_pred <- predict(rf_model, newdata = test_set)

# Confusion Matrix
cm <- table(test_set$higher, y_pred)

# Model Evaluation
confusionMatrix(cm)
## Confusion Matrix and Statistics
## 
##    y_pred
##       0   1
##   0   2  12
##   1   1 115
##                                           
##                Accuracy : 0.9             
##                  95% CI : (0.8351, 0.9457)
##     No Information Rate : 0.9769          
##     P-Value [Acc > NIR] : 0.999998        
##                                           
##                   Kappa : 0.2051          
##                                           
##  Mcnemar's Test P-Value : 0.005546        
##                                           
##             Sensitivity : 0.66667         
##             Specificity : 0.90551         
##          Pos Pred Value : 0.14286         
##          Neg Pred Value : 0.99138         
##              Prevalence : 0.02308         
##          Detection Rate : 0.01538         
##    Detection Prevalence : 0.10769         
##       Balanced Accuracy : 0.78609         
##                                           
##        'Positive' Class : 0               
## 

In the classification model, naive bayes model achieved higher accuracy and kappa value compared with the random forest model.

conclusions

Reference

Kaunang, F. J., & Rotikan, R. (2018). Students’ Academic Performance Prediction using Data Mining. 2018 Third International Conference on Informatics and Computing (ICIC), 1–5. https://doi.org/10.1109/IAC.2018.8780547 Kiu, C.-C. (2018). Data Mining Analysis on Student’s Academic Performance through Exploration of Student’s Background and Social Activities. 2018 Fourth International Conference on Advances in Computing, Communication & Automation (ICACCA), 1–5. https://doi.org/10.1109/ICACCAF.2018.8776809 Mengash, H. A. (2020). Using Data Mining Techniques to Predict Student Performance to Support Decision Making in University Admission Systems. IEEE Access, 8, 55462–55470. https://doi.org/10.1109/ACCESS.2020.2981905 Yossy, E. H., Heryadi, Y., & Lukas. (2019). Comparison of Data Mining Classification Algorithms for Student Performance. 2019 IEEE International Conference on Engineering, Technology and Education (TALE), 1–4. https://doi.org/10.1109/TALE48000.2019.9225887