A series of R packages and libraries are installed beforehand.

if (!require("caret")) {
  install.packages("caret", repos = "https://cran.asia/")
}
if (!require("dplyr")) {
  install.packages("dplyr", repos = "https://cran.asia/")
}
if (!require("ggplot2")) {
  install.packages("ggplot2", repos = "https://cran.asia/")
}
if (!require("kableExtra")) {
  install.packages("kableExtra", repos = "https://cran.asia/")
}
if (!require('tidyr')) {
  install.packages('tidyr', repos = "https://cran.asia/")
}
if (!require("caTools")) {
  install.packages("caTools", repos = "https://cran.asia/")
}
if (!require("e1071")) {
  install.packages("e1071", repos = "https://cran.asia/")
}
if (!require("MASS")) {
  install.packages("MASS", repos = "https://cran.asia/")
}

library(caret)
library(dplyr)
library(ggplot2)
library(kableExtra)
library(tidyr)
library(caTools)
library(e1071)
library(MASS)

Introduction

General Introducation

Education is very important for achieving a long-term economic progress. During the last decades, the Portuguese educational level has improved. However, the statistics keep the Portugal at Europe’s tail end due to its high student failure and dropping out rates. For example, in 2006 the early school leaving rate in Portugal was 40% for 18 to 24 year olds, while the European Union average value was just 15% (Eurostat 2007). In particular, failure in the core classes of Mathematics and Portuguese (the native language) is extremely serious, since they provide fundamental knowledge for the success in the remaining school subjects (e.g. physics or history).

On the other hand, the interest in Data Mining (DM) arose due to the advances of Information Technology, leading to an exponential growth of business and organizational databases. All this data holds valuable information, such as trends and patterns, which can be used to improve decision making and optimize success. The aim is to predict student achievement and if possible to identify the key variables that affect educational success/failure.

Literature Review

LR table

Problem Statement

There is no doubt that everyone expects to have good performance in schools. This is not only the ideal and pursuit of individuals, but also the needs of social development and progress. Education in Europe is world-renowned, however, Portugal’s education level is weak in Europe and student dropout is an important issue. By predicting students’ performance in school and analyzing the factors that affect students’ performance in school, it can help educators better adjust their work styles, and provide help and encouragement to students who want to drop out of school get higher education.

Objectives

In this project, there are three objectives to be achieved:

To predict the total academic score by using multiple regression and support vector machine models.
To classify the intentional of the students to further study in higher education by using Naive Bayes and random foreset models.
To evaluate which models will have the highest accuracy in the prediction and classification analyses.

Based on the aforementioned objectives, the dataset “student.csv” was obtained from https://archive.ics.uci.edu/ml/datasets/student+performance and imported to RStudio for further data processing and data analysis.

Data Ingestion

df <- read.csv("student.csv", header = TRUE, sep = ";")

# Inspect the structure of the data frame
str(df)

## 'data.frame':    649 obs. of  33 variables:
##  $ school    : chr  "GP" "GP" "GP" "GP" ...
##  $ sex       : chr  "F" "F" "F" "F" ...
##  $ age       : int  18 17 15 15 16 16 16 17 15 15 ...
##  $ address   : chr  "U" "U" "U" "U" ...
##  $ famsize   : chr  "GT3" "GT3" "LE3" "GT3" ...
##  $ Pstatus   : chr  "A" "T" "T" "T" ...
##  $ Medu      : int  4 1 1 4 3 4 2 4 3 3 ...
##  $ Fedu      : int  4 1 1 2 3 3 2 4 2 4 ...
##  $ Mjob      : chr  "at_home" "at_home" "at_home" "health" ...
##  $ Fjob      : chr  "teacher" "other" "other" "services" ...
##  $ reason    : chr  "course" "course" "other" "home" ...
##  $ guardian  : chr  "mother" "father" "mother" "mother" ...
##  $ traveltime: int  2 1 1 1 1 1 1 2 1 1 ...
##  $ studytime : int  2 2 2 3 2 2 2 2 2 2 ...
##  $ failures  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ schoolsup : chr  "yes" "no" "yes" "no" ...
##  $ famsup    : chr  "no" "yes" "no" "yes" ...
##  $ paid      : chr  "no" "no" "no" "no" ...
##  $ activities: chr  "no" "no" "no" "yes" ...
##  $ nursery   : chr  "yes" "no" "yes" "yes" ...
##  $ higher    : chr  "yes" "yes" "yes" "yes" ...
##  $ internet  : chr  "no" "yes" "yes" "yes" ...
##  $ romantic  : chr  "no" "no" "no" "yes" ...
##  $ famrel    : int  4 5 4 3 4 5 4 4 4 5 ...
##  $ freetime  : int  3 3 3 2 3 4 4 1 2 5 ...
##  $ goout     : int  4 3 2 2 2 2 4 4 2 1 ...
##  $ Dalc      : int  1 1 2 1 1 1 1 1 1 1 ...
##  $ Walc      : int  1 1 3 1 2 2 1 1 1 1 ...
##  $ health    : int  3 3 3 5 5 5 3 1 1 5 ...
##  $ absences  : int  4 2 6 0 0 6 0 2 0 0 ...
##  $ G1        : int  0 9 12 14 11 12 13 10 15 12 ...
##  $ G2        : int  11 11 13 14 13 12 12 13 16 12 ...
##  $ G3        : int  11 11 12 14 13 13 13 13 17 13 ...

Data Inspection

# Check whether the numerical attributes contain noise
summary(df) %>%
  kable("html") %>%
  scroll_box(width = "100%") %>%
  kable_styling(font_size = 12)

school	sex	age	address	famsize	Pstatus	Medu	Fedu	Mjob	Fjob	reason	guardian	traveltime	studytime	failures	schoolsup	famsup	paid	activities	nursery	higher	internet	romantic	famrel	freetime	goout	Dalc	Walc	health	absences	G1	G2	G3
Length:649	Length:649	Min. :-2.00	Length:649	Length:649	Length:649	Min. :0.000	Min. :0.000	Length:649	Length:649	Length:649	Length:649	Min. :1.000	Min. :1.000	Min. :0.0000	Length:649	Length:649	Length:649	Length:649	Length:649	Length:649	Length:649	Length:649	Min. :1.000	Min. :1.000	Min. :1.000	Min. :1.000	Min. :1.00	Min. :1.000	Min. : 0.000	Min. : 0.0	Min. : 0.00	Min. : 0.00
Class :character	Class :character	1st Qu.:16.00	Class :character	Class :character	Class :character	1st Qu.:2.000	1st Qu.:1.000	Class :character	Class :character	Class :character	Class :character	1st Qu.:1.000	1st Qu.:1.000	1st Qu.:0.0000	Class :character	Class :character	Class :character	Class :character	Class :character	Class :character	Class :character	Class :character	1st Qu.:4.000	1st Qu.:3.000	1st Qu.:2.000	1st Qu.:1.000	1st Qu.:1.00	1st Qu.:2.000	1st Qu.: 0.000	1st Qu.:10.0	1st Qu.:10.00	1st Qu.:10.00
Mode :character	Mode :character	Median :17.00	Mode :character	Mode :character	Mode :character	Median :2.000	Median :2.000	Mode :character	Mode :character	Mode :character	Mode :character	Median :1.000	Median :2.000	Median :0.0000	Mode :character	Mode :character	Mode :character	Mode :character	Mode :character	Mode :character	Mode :character	Mode :character	Median :4.000	Median :3.000	Median :3.000	Median :1.000	Median :2.00	Median :4.000	Median : 2.000	Median :11.0	Median :11.00	Median :12.00
NA	NA	Mean :16.72	NA	NA	NA	Mean :2.515	Mean :2.307	NA	NA	NA	NA	Mean :1.569	Mean :1.931	Mean :0.2219	NA	NA	NA	NA	NA	NA	NA	NA	Mean :3.931	Mean :3.181	Mean :3.185	Mean :1.502	Mean :2.28	Mean :3.536	Mean : 3.659	Mean :11.4	Mean :11.57	Mean :11.91
NA	NA	3rd Qu.:18.00	NA	NA	NA	3rd Qu.:4.000	3rd Qu.:3.000	NA	NA	NA	NA	3rd Qu.:2.000	3rd Qu.:2.000	3rd Qu.:0.0000	NA	NA	NA	NA	NA	NA	NA	NA	3rd Qu.:5.000	3rd Qu.:4.000	3rd Qu.:4.000	3rd Qu.:2.000	3rd Qu.:3.00	3rd Qu.:5.000	3rd Qu.: 6.000	3rd Qu.:13.0	3rd Qu.:13.00	3rd Qu.:14.00
NA	NA	Max. :22.00	NA	NA	NA	Max. :4.000	Max. :4.000	NA	NA	NA	NA	Max. :4.000	Max. :4.000	Max. :3.0000	NA	NA	NA	NA	NA	NA	NA	NA	Max. :5.000	Max. :5.000	Max. :5.000	Max. :5.000	Max. :5.00	Max. :5.000	Max. :32.000	Max. :19.0	Max. :19.00	Max. :19.00
NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA’s :1	NA	NA	NA	NA	NA	NA	NA	NA

# Check the first 6 rows
head(df) %>%
  kable("html") %>%
  scroll_box(width = "100%") %>%
  kable_styling(font_size = 12)

school	sex	age	address	famsize	Pstatus	Medu	Fedu	Mjob	Fjob	reason	guardian	traveltime	studytime	schoolsup	famsup	paid	activities	nursery	higher	internet	romantic	famrel	freetime	goout	Dalc	Walc	health	absences	G1	G2	G3
GP	F	18	U	GT3	A	4	4	at_home	teacher	course	mother	2	2	yes	no	no	no	yes	yes	no	no	4	3	4	1	1	3	4	0	11	11
GP	F	17	U	GT3	T	1	1	at_home	other	course	father	1	2	no	yes	no	no	no	yes	yes	no	5	3	3	1	1	3	2	9	11	11
GP	F	15	U	LE3	T	1	1	at_home	other	other	mother	1	2	yes	no	no	no	yes	yes	yes	no	4	3	2	2	3	3	6	12	13	12
GP	F	15	U	GT3	T	4	2	health	services	home	mother	1	3	no	yes	no	yes	yes	yes	yes	yes	3	2	2	1	1	5	0	14	14	14
GP	F	16	U	GT3	T	3	3	other	other	home	father	1	2	no	yes	no	no	yes	yes	no	no	4	3	2	1	2	5	0	11	13	13
GP	M	16	U	LE3	T	4	3	services	other	reputation	mother	1	2	no	yes	no	yes	yes	yes	yes	no	5	4	2	1	2	5	6	12	12	13

# Checking whether the dataset contains "" or NA.
columns <- names(df)
for (i in columns) {
  print(paste(i, sum(df[i] == "" | is.na(df[i]))))
}

## [1] "school 0"
## [1] "sex 0"
## [1] "age 0"
## [1] "address 0"
## [1] "famsize 0"
## [1] "Pstatus 0"
## [1] "Medu 0"
## [1] "Fedu 0"
## [1] "Mjob 0"
## [1] "Fjob 2"
## [1] "reason 2"
## [1] "guardian 2"
## [1] "traveltime 0"
## [1] "studytime 0"
## [1] "failures 0"
## [1] "schoolsup 0"
## [1] "famsup 0"
## [1] "paid 0"
## [1] "activities 0"
## [1] "nursery 0"
## [1] "higher 1"
## [1] "internet 2"
## [1] "romantic 1"
## [1] "famrel 0"
## [1] "freetime 1"
## [1] "goout 0"
## [1] "Dalc 0"
## [1] "Walc 0"
## [1] "health 0"
## [1] "absences 0"
## [1] "G1 0"
## [1] "G2 0"
## [1] "G3 0"

# Check whether the data consistency in the character columns
columns <- names(df)
for (i in columns) {
  if (sapply(df[i], typeof) == "character") {
    print(unique(df[i]))
  }
}

##     school
## 1       GP
## 424     MS
##     sex
## 1     F
## 6     M
## 10    B
## 275   G
##    address
## 1        U
## 25       R
##   famsize
## 1     GT3
## 3     LE3
##   Pstatus
## 1       A
## 2       T
##        Mjob
## 1   at_home
## 4    health
## 5     other
## 6  services
## 11  teacher
##         Fjob
## 1    teacher
## 2      other
## 4   services
## 11    health
## 33   at_home
## 239         
##         reason
## 1       course
## 3        other
## 4         home
## 6   reputation
## 209           
##     guardian
## 1     mother
## 2     father
## 15     other
## 125         
##   schoolsup
## 1       yes
## 2        no
##   famsup
## 1     no
## 2    yes
##    paid
## 1    no
## 19  yes
## 90    0
##   activities
## 1         no
## 4        yes
##   nursery
## 1     yes
## 2      no
##     higher
## 1      yes
## 79      no
## 561       
##    internet
## 1        no
## 2       yes
## 25         
##     romantic
## 1         no
## 4        yes
## 455

Data Preparation

# Remove the noise in age columns
df$age[df$age < 0] <- mean(df$age, na.rm = T)

# Amend the data inconsistency in column 
df$sex[df$sex == 'B'] <- 'M'
df$sex[df$sex == 'G'] <- 'F'
df$paid[df$paid == 0] <- 'no'

# Replace the "" and na value with the mean or mode of the columns
getmode <- function(v){
  v=v[nchar(as.character(v))>0]
  uniqv <- unique(v)
  uniqv[which.max(tabulate(match(v, uniqv)))]
}
df[sapply(df, is.character)] <- lapply(df[sapply(df, is.character)], function(x) ifelse(x=="" | is.na(x), getmode(x), x))

df$freetime[is.na(df$freetime)] <- getmode(df$freetime)

columns <- names(df)
for (i in columns) {
  print(paste(i, sum(df[i] == "" | is.na(df[i]))))
}

## [1] "school 0"
## [1] "sex 0"
## [1] "age 0"
## [1] "address 0"
## [1] "famsize 0"
## [1] "Pstatus 0"
## [1] "Medu 0"
## [1] "Fedu 0"
## [1] "Mjob 0"
## [1] "Fjob 0"
## [1] "reason 0"
## [1] "guardian 0"
## [1] "traveltime 0"
## [1] "studytime 0"
## [1] "failures 0"
## [1] "schoolsup 0"
## [1] "famsup 0"
## [1] "paid 0"
## [1] "activities 0"
## [1] "nursery 0"
## [1] "higher 0"
## [1] "internet 0"
## [1] "romantic 0"
## [1] "famrel 0"
## [1] "freetime 0"
## [1] "goout 0"
## [1] "Dalc 0"
## [1] "Walc 0"
## [1] "health 0"
## [1] "absences 0"
## [1] "G1 0"
## [1] "G2 0"
## [1] "G3 0"

# Integrate the columns G1, G2 and G3 into a total score
df['TotalScore'] <- df['G1'] + df['G2'] + df['G3'] 

# Construct another dataset for the classification
df2 <- subset(df, select = -c(G1, G2, G3, school, reason))

# Drop columns G1, G2, and G3
df <- subset(df, select = -c(G1, G2, G3, higher, school, reason))
head(df) %>%
  kable("html") %>%
  scroll_box(width = "100%") %>%
  kable_styling(font_size = 12)

sex	age	address	famsize	Pstatus	Medu	Fedu	Mjob	Fjob	guardian	traveltime	studytime	schoolsup	famsup	paid	activities	nursery	internet	romantic	famrel	freetime	goout	Dalc	Walc	health	absences	TotalScore
F	18	U	GT3	A	4	4	at_home	teacher	mother	2	2	yes	no	no	no	yes	no	no	4	3	4	1	1	3	4	22
F	17	U	GT3	T	1	1	at_home	other	father	1	2	no	yes	no	no	no	yes	no	5	3	3	1	1	3	2	31
F	15	U	LE3	T	1	1	at_home	other	mother	1	2	yes	no	no	no	yes	yes	no	4	3	2	2	3	3	6	37
F	15	U	GT3	T	4	2	health	services	mother	1	3	no	yes	no	yes	yes	yes	yes	3	2	2	1	1	5	0	42
F	16	U	GT3	T	3	3	other	other	father	1	2	no	yes	no	no	yes	no	no	4	3	2	1	2	5	0	37
M	16	U	LE3	T	4	3	services	other	mother	1	2	no	yes	no	yes	yes	yes	no	5	4	2	1	2	5	6	37

Exploratory Data Analyis

# Boxplot of number of school absences
boxplot(df$absences, horizontal=T, main="Boxplot of school absences number", xlab="Number of school absences")

In average, the absences students is about 0~15, but still have some special student have high absence rate.

boxplot(df$TotalScore, horizontal=F, main="Boxplot of Total Score",
        ylab="Total Score")

In average, the total score is around 15~55, there hava a top student have 60 reach the highest.

# To examine the extent how the number of school absences that will influence total score by gender.
ggplot(df, aes(x=absences, y=TotalScore, color=sex)) + geom_point() + stat_smooth(method=lm, se=FALSE) + labs(title = "Number of school absences vs total score by gender")

## `geom_smooth()` using formula 'y ~ x'

We could see the absences is higher, the total score is lowver.

# Average Score group by sex
df %>%
  dplyr::group_by(sex) %>%
  dplyr::summarise(Average_score = mean(TotalScore)) %>%
  ggplot(aes(x = sex, y = Average_score, fill = sex)) + 
  geom_bar(stat = "identity") + 
  theme_classic() +
  labs(
    x = "Sex",
    y = "Average Score",
    title = paste("Average Score group by sex")
  )

Girls have higher score than boys.

# Average Score group by region
df %>%
  dplyr::group_by(address) %>%
  dplyr::summarise(Average_score = mean(TotalScore)) %>%
  ggplot(aes(x =  address, y = Average_score, fill = address)) + 
  geom_bar(stat = "identity") + 
  theme_classic() +
  labs(
    x = "Region",
    y = "Average Score",
    title = paste("Average Score group by region")
  )

Urban students have higher score, if the distance is low, they have more time to study.

# Average Score group by studytime
df %>%
  dplyr::group_by(studytime) %>%
  dplyr::summarise(Average_score = mean(TotalScore)) %>%
  ggplot(aes(x =  studytime, y = Average_score, fill = studytime)) + 
  geom_bar(stat = "identity") + 
  theme_classic() +
  labs(
    x = "Study Time",
    y = "Average Score",
    title = paste("Average Score group by study time")
  )

Obviously, the longer you study, the better score you get.

# Average Score group by famsup
df %>%
  dplyr::group_by(famsup) %>%
  dplyr::summarise(Average_score = mean(TotalScore)) %>%
  ggplot(aes(x =  famsup, y = Average_score, fill = famsup)) + 
  geom_bar(stat = "identity") + 
  theme_classic() +
  labs(
    x = "Family Educational Support",
    y = "Average Score",
    title = paste("Average Score group by family educational support")
  )

No big difference in family size.

# Average Score group by internet access
df %>%
  dplyr::group_by(internet) %>%
  dplyr::summarise(Average_score = mean(TotalScore)) %>%
  ggplot(aes(x =  internet, y = Average_score, fill = internet)) + 
  geom_bar(stat = "identity") + 
  theme_classic() +
  labs(
    x = "Internet Access",
    y = "Average Score",
    title = paste("Average Score group by internet access")
  )

The students with internet access will have higher score than without internet access.

# Average Score group by romantic relationship
df %>%
  dplyr::group_by(romantic) %>%
  dplyr::summarise(Average_score = mean(TotalScore)) %>%
  ggplot(aes(x =  romantic, y = Average_score, fill = romantic)) + 
  geom_bar(stat = "identity") + 
  theme_classic() +
  labs(
    x = "Romantic Relationship",
    y = "Average Score",
    title = paste("Average Score group by romantic relationship")
  )

The students in romantic relationship will have lower score.

# Total Score vs number of school absences
df %>%
  dplyr::group_by(absences) %>%
  ggplot(aes(x =  absences, y = TotalScore, fill = absences)) + 
  geom_bar(stat = "identity") + 
  theme_classic() +
  labs(
    x = "Number of School Absences",
    y = "Total Score",
    title = paste("Total Score vs number of school absences")
  )

The lesser of school absences, the higher of the total score can be obtained by the students.

Data Modeling

Regression

Multiple Linear Regression

# Encoding categorical data
df$sex = factor(df$sex,
                       levels = c('F', 'M'),
                       labels = c(0,1))
df$address = factor(df$address,
                       levels = c('U', 'R'),
                       labels = c(0,1))

df$famsize = factor(df$famsize,
                       levels = c('LE3', 'GT3'),
                       labels = c(0,1))

df$Pstatus = factor(df$Pstatus,
                       levels = c('T', 'A'),
                       labels = c(0,1))

df$Mjob = factor(df$Mjob,
                       levels = c('teacher', 'health', 'services',   'at_home', 'other'),
                       labels = c(0,1,2,3,4))


df$Fjob = factor(df$Fjob,
                       levels = c('teacher', 'health', 'services',   'at_home', 'other'),
                       labels = c(0,1,2,3,4))

df$guardian = factor(df$guardian,
                       levels = c('mother', 'father', 'other'),
                       labels = c(0,1,3))

df$schoolsup = factor(df$schoolsup,
                       levels = c('no', 'yes'),
                       labels = c(0,1))

df$famsup = factor(df$famsup,
                       levels = c('no', 'yes'),
                       labels = c(0,1))

df$paid = factor(df$paid,
                       levels = c('no', 'yes'),
                       labels = c(0,1))

df$activities = factor(df$activities,
                       levels = c('no', 'yes'),
                       labels = c(0,1))

df$nursery = factor(df$nursery,
                       levels = c('no', 'yes'),
                       labels = c(0,1))

df$internet = factor(df$internet,
                       levels = c('no', 'yes'),
                       labels = c(0,1))

df$romantic = factor(df$romantic,
                       levels = c('no', 'yes'),
                       labels = c(0,1))

df$Fjob

##   [1] 0 4 4 2 4 4 4 0 4 4 1 4 2 4 4 4 2 4 2 4 4 1 4 4 1 2 4 2 4 0 2 2 3 4 4 4 2
##  [38] 0 1 4 4 4 0 2 3 4 2 2 4 0 2 4 1 2 4 4 2 1 3 4 0 2 2 1 2 2 2 4 2 4 4 4 4 4
##  [75] 2 4 4 4 4 4 2 4 4 2 4 2 4 4 4 1 4 4 4 2 1 4 4 4 4 3 2 0 4 4 4 1 4 4 4 1 0
## [112] 4 4 4 2 0 0 2 4 4 2 2 1 4 4 2 4 4 0 4 0 4 4 4 4 4 4 4 0 3 4 4 2 2 2 4 4 0
## [149] 4 2 2 2 2 2 3 4 2 4 2 4 2 0 4 4 4 2 3 0 2 4 4 4 2 2 4 4 4 2 2 2 4 2 4 1 4
## [186] 4 4 2 4 4 4 4 2 4 4 4 2 4 4 2 2 4 4 4 2 2 2 2 4 2 0 4 0 0 4 4 4 4 2 2 4 4
## [223] 4 4 4 4 4 4 4 4 4 1 4 3 2 4 0 4 4 4 4 2 4 4 4 4 0 4 4 4 4 4 4 4 2 4 3 0 4
## [260] 3 4 2 4 2 4 4 4 4 4 4 2 2 4 4 4 4 4 2 4 2 2 4 4 4 2 4 4 4 4 2 2 4 4 3 4 3
## [297] 4 2 2 4 4 4 4 4 2 4 4 4 4 4 4 2 2 4 3 4 1 4 2 2 1 2 2 2 4 4 2 4 4 4 4 4 3
## [334] 0 2 0 4 2 4 4 4 4 4 4 0 4 0 2 4 1 2 4 2 0 4 4 0 4 2 2 4 4 4 4 1 2 4 4 4 4
## [371] 4 4 4 2 2 3 4 4 2 4 2 0 4 4 2 4 4 4 2 4 4 4 4 2 4 2 4 3 4 2 2 4 4 4 4 4 3
## [408] 4 2 4 2 4 4 4 4 4 4 4 2 4 4 4 4 4 4 2 4 2 4 4 4 4 4 4 4 2 2 4 4 2 2 4 2 2
## [445] 1 4 3 4 0 2 2 4 4 2 2 4 4 2 4 4 4 4 3 4 4 4 2 2 4 0 4 4 4 2 4 2 4 2 4 3 2
## [482] 4 4 4 4 4 2 4 2 4 3 4 4 3 4 4 2 3 2 4 4 4 2 4 4 4 2 2 4 4 4 4 4 4 3 3 0 4
## [519] 2 2 4 2 1 1 4 2 4 2 4 2 2 2 2 2 1 2 4 2 4 4 4 4 2 4 4 4 1 3 3 4 2 4 4 4 3
## [556] 4 3 4 4 3 2 3 2 4 4 4 4 4 2 3 2 4 3 3 4 2 4 2 4 2 4 4 4 4 4 3 2 2 3 4 4 3
## [593] 2 4 0 4 4 4 2 2 2 4 2 4 2 2 4 4 2 3 3 4 4 2 2 4 2 2 4 2 3 2 4 2 2 2 0 2 3
## [630] 4 4 4 4 2 4 4 0 4 2 2 4 4 4 3 4 2 4 2 4
## Levels: 0 1 2 3 4

# Splitting the dataset into the training and test datasets
set.seed(42)
split = sample.split(df$TotalScore, SplitRatio = 0.8)
training_set = subset(df, split == TRUE)
test_set = subset(df, split == FALSE)

# Fitting Multiple Linear Regression to the Training set
regressor = lm(TotalScore~sex+age+address+famsize+Pstatus+Medu+Fedu+Mjob+Fjob+guardian+traveltime+studytime+failures+schoolsup+famsup+paid+activities+nursery+internet+romantic+famrel+freetime+goout+Dalc+Walc+health, data = training_set)

summary(regressor)

## 
## Call:
## lm(formula = TotalScore ~ sex + age + address + famsize + Pstatus + 
##     Medu + Fedu + Mjob + Fjob + guardian + traveltime + studytime + 
##     failures + schoolsup + famsup + paid + activities + nursery + 
##     internet + romantic + famrel + freetime + goout + Dalc + 
##     Walc + health, data = training_set)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -34.151  -3.739  -0.353   4.116  19.703 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 39.76779    6.11341   6.505 1.92e-10 ***
## sex1        -1.51228    0.75183  -2.011  0.04482 *  
## age          0.00715    0.31390   0.023  0.98184    
## address1    -1.43010    0.75472  -1.895  0.05870 .  
## famsize1    -1.06520    0.74968  -1.421  0.15599    
## Pstatus1    -0.71320    1.11143  -0.642  0.52137    
## Medu         0.44276    0.47824   0.926  0.35500    
## Fedu         0.51621    0.42514   1.214  0.22524    
## Mjob1        0.45834    1.66915   0.275  0.78375    
## Mjob2       -0.60045    1.32771  -0.452  0.65129    
## Mjob3       -2.44144    1.54918  -1.576  0.11568    
## Mjob4       -1.59979    1.36665  -1.171  0.24233    
## Fjob1       -5.02570    2.36670  -2.124  0.03421 *  
## Fjob2       -4.84692    1.61473  -3.002  0.00282 ** 
## Fjob3       -5.89512    2.03603  -2.895  0.00396 ** 
## Fjob4       -3.35671    1.58278  -2.121  0.03444 *  
## guardian1    1.23847    0.81560   1.518  0.12954    
## guardian3    1.89626    1.60473   1.182  0.23791    
## traveltime  -0.16335    0.46597  -0.351  0.72606    
## studytime    1.67676    0.41596   4.031 6.44e-05 ***
## failures    -4.30605    0.61081  -7.050 6.13e-12 ***
## schoolsup1  -2.45202    1.14147  -2.148  0.03219 *  
## famsup1     -0.27608    0.69204  -0.399  0.69012    
## paid1       -1.26911    1.38395  -0.917  0.35958    
## activities1  1.07615    0.67785   1.588  0.11303    
## nursery1    -0.60931    0.84106  -0.724  0.46913    
## internet1    1.07463    0.83223   1.291  0.19722    
## romantic1   -1.68258    0.69937  -2.406  0.01650 *  
## famrel       0.53240    0.35738   1.490  0.13694    
## freetime    -0.42623    0.33780  -1.262  0.20764    
## goout       -0.09145    0.33091  -0.276  0.78239    
## Dalc        -0.83445    0.46049  -1.812  0.07059 .  
## Walc        -0.33459    0.36823  -0.909  0.36399    
## health      -0.24421    0.23312  -1.048  0.29535    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.325 on 490 degrees of freedom
## Multiple R-squared:  0.3349, Adjusted R-squared:  0.2901 
## F-statistic: 7.477 on 33 and 490 DF,  p-value: < 2.2e-16

# Predicting the Test set results
y_pred = predict(regressor, newdata = test_set)

# RMSE of the model
RMSE(y_pred, test_set$TotalScore)

## [1] 6.610499

# R2 of the model
R2(y_pred, test_set$TotalScore)

## [1] 0.2734951

The p-value = 2.2e-16. The null hypothesis is rejected, so we conclude that our model is better than a model with only the intercept because at least one coefficient β is significantly different from 0.

Support Vector Machine

# Splitting the dataset into the training and test datasets
set.seed(42)
split = sample.split(df$TotalScore, SplitRatio = 0.8)
training_set = subset(df, split == TRUE)
test_set = subset(df, split == FALSE)
 
svm_model = svm(TotalScore~., data = training_set)
print(svm_model)

## 
## Call:
## svm(formula = TotalScore ~ ., data = training_set)
## 
## 
## Parameters:
##    SVM-Type:  eps-regression 
##  SVM-Kernel:  radial 
##        cost:  1 
##       gamma:  0.02857143 
##     epsilon:  0.1 
## 
## 
## Number of Support Vectors:  472

pred = predict(svm_model, test_set)
 
x=1:length(test_set$TotalScore)
plot(x, test_set$TotalScore, pch=18, col="red")
lines(x, pred, lwd="1", col="blue")

# accuracy check 
rmse = RMSE(test_set$TotalScore, pred)
r2 = R2(pred, test_set$TotalScore)

cat( "RMSE:", rmse, "\n", "R-squared:", r2)

## RMSE: 6.320792 
##  R-squared: 0.321206

The support vector machine has higher accuracy compared with the multiple regression model in terms of RMSE and R2.

Classification

Naiye Bayes

# Splitting the dataset into the training and test datasets
set.seed(42)
split = sample.split(df2$higher, SplitRatio = 0.8)
training_set = subset(df2, split == TRUE)
test_set = subset(df2, split == FALSE)

nb_model = naiveBayes(x = training_set[,-21],
                            y = training_set$higher)

y_pred <- predict(nb_model, newdata = test_set)

# Confusion Matrix
cm <- table(test_set$higher, y_pred)

# Model Evaluation
confusionMatrix(cm)

## Confusion Matrix and Statistics
## 
##      y_pred
##        no yes
##   no   14   0
##   yes   3 113
##                                          
##                Accuracy : 0.9769         
##                  95% CI : (0.934, 0.9952)
##     No Information Rate : 0.8692         
##     P-Value [Acc > NIR] : 1.748e-05      
##                                          
##                   Kappa : 0.8903         
##                                          
##  Mcnemar's Test P-Value : 0.2482         
##                                          
##             Sensitivity : 0.8235         
##             Specificity : 1.0000         
##          Pos Pred Value : 1.0000         
##          Neg Pred Value : 0.9741         
##              Prevalence : 0.1308         
##          Detection Rate : 0.1077         
##    Detection Prevalence : 0.1077         
##       Balanced Accuracy : 0.9118         
##                                          
##        'Positive' Class : no             
##

Random Forest

# Splitting the dataset into the training and test datasets

df2$higher = factor(df2$higher,levels = c('no', 'yes'),
                    labels = c(0,1))
set.seed(42)
split = sample.split(df2$higher, SplitRatio = 0.8)
training_set = subset(df2, split == TRUE)
test_set = subset(df2, split == FALSE)


require(randomForest)

## Loading required package: randomForest

## randomForest 4.6-14

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:dplyr':
## 
##     combine

## The following object is masked from 'package:ggplot2':
## 
##     margin

rf_model = randomForest(higher~., data = training_set)

y_pred <- predict(rf_model, newdata = test_set)

# Confusion Matrix
cm <- table(test_set$higher, y_pred)

# Model Evaluation
confusionMatrix(cm)

## Confusion Matrix and Statistics
## 
##    y_pred
##       0   1
##   0   2  12
##   1   1 115
##                                           
##                Accuracy : 0.9             
##                  95% CI : (0.8351, 0.9457)
##     No Information Rate : 0.9769          
##     P-Value [Acc > NIR] : 0.999998        
##                                           
##                   Kappa : 0.2051          
##                                           
##  Mcnemar's Test P-Value : 0.005546        
##                                           
##             Sensitivity : 0.66667         
##             Specificity : 0.90551         
##          Pos Pred Value : 0.14286         
##          Neg Pred Value : 0.99138         
##              Prevalence : 0.02308         
##          Detection Rate : 0.01538         
##    Detection Prevalence : 0.10769         
##       Balanced Accuracy : 0.78609         
##                                           
##        'Positive' Class : 0               
##

In the classification model, naive bayes model achieved higher accuracy and kappa value compared with the random forest model.

conclusions

Reference

Kaunang, F. J., & Rotikan, R. (2018). Students’ Academic Performance Prediction using Data Mining. 2018 Third International Conference on Informatics and Computing (ICIC), 1–5. https://doi.org/10.1109/IAC.2018.8780547 Kiu, C.-C. (2018). Data Mining Analysis on Student’s Academic Performance through Exploration of Student’s Background and Social Activities. 2018 Fourth International Conference on Advances in Computing, Communication & Automation (ICACCA), 1–5. https://doi.org/10.1109/ICACCAF.2018.8776809 Mengash, H. A. (2020). Using Data Mining Techniques to Predict Student Performance to Support Decision Making in University Admission Systems. IEEE Access, 8, 55462–55470. https://doi.org/10.1109/ACCESS.2020.2981905 Yossy, E. H., Heryadi, Y., & Lukas. (2019). Comparison of Data Mining Classification Algorithms for Student Performance. 2019 IEEE International Conference on Engineering, Technology and Education (TALE), 1–4. https://doi.org/10.1109/TALE48000.2019.9225887

The prediction of student performance by using machine learning

Choong Che Wei (S2106183)
Ximin Zhang (S2118371)
Gong Wei (17220022)
Yushan Lin (S2117392)
Di Bei (s2013283)

16/06/22

A series of R packages and libraries are installed beforehand.

Introduction

General Introducation

Literature Review

Problem Statement

Objectives

In this project, there are three objectives to be achieved:

Based on the aforementioned objectives, the dataset “student.csv” was obtained from https://archive.ics.uci.edu/ml/datasets/student+performance and imported to RStudio for further data processing and data analysis.

Data Ingestion

Data Inspection

Data Preparation

Exploratory Data Analyis

In average, the absences students is about 0~15, but still have some special student have high absence rate.

In average, the total score is around 15~55, there hava a top student have 60 reach the highest.

We could see the absences is higher, the total score is lowver.

Girls have higher score than boys.

Urban students have higher score, if the distance is low, they have more time to study.

Obviously, the longer you study, the better score you get.

No big difference in family size.

The students with internet access will have higher score than without internet access.

The students in romantic relationship will have lower score.

The lesser of school absences, the higher of the total score can be obtained by the students.

Data Modeling

Regression

Multiple Linear Regression

The p-value = 2.2e-16. The null hypothesis is rejected, so we conclude that our model is better than a model with only the intercept because at least one coefficient β is significantly different from 0.

Support Vector Machine

The support vector machine has higher accuracy compared with the multiple regression model in terms of RMSE and R2.

Classification

Naiye Bayes

Random Forest

In the classification model, naive bayes model achieved higher accuracy and kappa value compared with the random forest model.

conclusions

Reference

The prediction of student performance by using machine learning

Choong Che Wei (S2106183) Ximin Zhang (S2118371) Gong Wei (17220022) Yushan Lin (S2117392) Di Bei (s2013283)

16/06/22

A series of R packages and libraries are installed beforehand.

Introduction

General Introducation

Literature Review

Problem Statement

Objectives

In this project, there are three objectives to be achieved:

Based on the aforementioned objectives, the dataset “student.csv” was obtained from https://archive.ics.uci.edu/ml/datasets/student+performance and imported to RStudio for further data processing and data analysis.

Data Ingestion

Data Inspection

Data Preparation

Exploratory Data Analyis

In average, the absences students is about 0~15, but still have some special student have high absence rate.

In average, the total score is around 15~55, there hava a top student have 60 reach the highest.

We could see the absences is higher, the total score is lowver.

Girls have higher score than boys.

Urban students have higher score, if the distance is low, they have more time to study.

Obviously, the longer you study, the better score you get.

No big difference in family size.

The students with internet access will have higher score than without internet access.

The students in romantic relationship will have lower score.

The lesser of school absences, the higher of the total score can be obtained by the students.

Data Modeling

Regression

Multiple Linear Regression

The p-value = 2.2e-16. The null hypothesis is rejected, so we conclude that our model is better than a model with only the intercept because at least one coefficient β is significantly different from 0.

Support Vector Machine

The support vector machine has higher accuracy compared with the multiple regression model in terms of RMSE and R2.

Classification

Naiye Bayes

Random Forest

In the classification model, naive bayes model achieved higher accuracy and kappa value compared with the random forest model.

conclusions

Reference

Choong Che Wei (S2106183)
Ximin Zhang (S2118371)
Gong Wei (17220022)
Yushan Lin (S2117392)
Di Bei (s2013283)