Classification models can be an invaluable tool to a researcher. Utilizing these methods allow a research to focus on qualitative techniques to answer questions other statistical techniques cannot. In this tutorial we are going to look at the use of linear discriminant analysis on our Student Performance dataset.
The dataset is from the UCI Machine Learning Repository and can be found by clicking the following link: https://archive.ics.uci.edu/ml/datasets/Student+Performance
In this step we need to get a few packages for our analysis. We can do this by using the following code. The dependencies = TRUE ensures we install all of the necessary packages the other packages depend on.
# install.packages(c("shiny","ggvis","reshape2","dplyr","gam","tree","randomForest"),dependencies = TRUE)
Ensure you have the MASS package loaded as this is necessary to perform linear discriminant analysis in R.
library(MASS)
library(splines)
library(gam)
We need to import our data into R. We can do this in several ways. In this tutorial we will be doing this via the read.csv() function that is available in base R. The header = T allows us to keep our headers from our excel sheet, and our sep = “,” tells R that our data is separated by commas. We can utilize tab / or semicolon ; as well in our ‘sep =’ argument. We are then renaming our dataset to ‘Student_Performance_Ben_Gonzalez’ which creates an object in R we can use throughout our analysis. The attach() function then ensures we have attached our dataset so we can analyze it.
##Student Performance Dataset-Ben Gonzalez ##########################
Student_Performance_Ben_Gonzalez <-read.csv("~/Datasets/Student_Performance_Ben_Gonzalez.csv",header = T, sep = ",")
attach(Student_Performance_Ben_Gonzalez)
This will give us an overview of what variables are in our dataset. I recommend doing this to help better understand our dataset.
names(Student_Performance_Ben_Gonzalez)
## [1] "school" "sex" "age" "address" "famsize"
## [6] "Pstatus" "Medu" "Fedu" "Mjob" "Fjob"
## [11] "reason" "guardian" "traveltime" "studytime" "failures"
## [16] "schoolsup" "famsup" "paid" "activities" "nursery"
## [21] "higher" "internet" "romantic" "famrel" "freetime"
## [26] "goout" "Dalc" "Walc" "health" "absences"
## [31] "G1" "G2" "G3"
The summary overview gives us a solid understanding of our data, it also may highlight extreme values for us as well.
summary(Student_Performance_Ben_Gonzalez)
## school sex age address famsize Pstatus Medu
## GP:349 F:208 Min. :15.0 R: 88 GT3:281 A: 41 Min. :0.000
## MS: 46 M:187 1st Qu.:16.0 U:307 LE3:114 T:354 1st Qu.:2.000
## Median :17.0 Median :3.000
## Mean :16.7 Mean :2.749
## 3rd Qu.:18.0 3rd Qu.:4.000
## Max. :22.0 Max. :4.000
## Fedu Mjob Fjob reason
## Min. :0.000 at_home : 59 at_home : 20 course :145
## 1st Qu.:2.000 health : 34 health : 18 home :109
## Median :2.000 other :141 other :217 other : 36
## Mean :2.522 services:103 services:111 reputation:105
## 3rd Qu.:3.000 teacher : 58 teacher : 29
## Max. :4.000
## guardian traveltime studytime failures schoolsup
## father: 90 Min. :1.000 Min. :1.000 Min. :0.0000 no :344
## mother:273 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:0.0000 yes: 51
## other : 32 Median :1.000 Median :2.000 Median :0.0000
## Mean :1.448 Mean :2.035 Mean :0.3342
## 3rd Qu.:2.000 3rd Qu.:2.000 3rd Qu.:0.0000
## Max. :4.000 Max. :4.000 Max. :3.0000
## famsup paid activities nursery higher internet romantic
## no :153 no :214 no :194 no : 81 no : 20 no : 66 no :263
## yes:242 yes:181 yes:201 yes:314 yes:375 yes:329 yes:132
##
##
##
##
## famrel freetime goout Dalc
## Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:4.000 1st Qu.:3.000 1st Qu.:2.000 1st Qu.:1.000
## Median :4.000 Median :3.000 Median :3.000 Median :1.000
## Mean :3.944 Mean :3.235 Mean :3.109 Mean :1.481
## 3rd Qu.:5.000 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:2.000
## Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.000
## Walc health absences G1
## Min. :1.000 Min. :1.000 Min. : 0.000 Min. : 3.00
## 1st Qu.:1.000 1st Qu.:3.000 1st Qu.: 0.000 1st Qu.: 8.00
## Median :2.000 Median :4.000 Median : 4.000 Median :11.00
## Mean :2.291 Mean :3.554 Mean : 5.709 Mean :10.91
## 3rd Qu.:3.000 3rd Qu.:5.000 3rd Qu.: 8.000 3rd Qu.:13.00
## Max. :5.000 Max. :5.000 Max. :75.000 Max. :19.00
## G2 G3
## Min. : 0.00 Min. : 0.00
## 1st Qu.: 9.00 1st Qu.: 8.00
## Median :11.00 Median :11.00
## Mean :10.71 Mean :10.42
## 3rd Qu.:13.00 3rd Qu.:14.00
## Max. :19.00 Max. :20.00
This step allows us to look at the structure of our data and which variables are integers or factors. This will help us in determining what statistical techniques we can utilize on our data. In this tutorial we will be focusing on ‘classification’ techniques.
str(Student_Performance_Ben_Gonzalez)
## 'data.frame': 395 obs. of 33 variables:
## $ school : Factor w/ 2 levels "GP","MS": 1 1 1 1 1 1 1 1 1 1 ...
## $ sex : Factor w/ 2 levels "F","M": 1 1 1 1 1 2 2 1 2 2 ...
## $ age : int 18 17 15 15 16 16 16 17 15 15 ...
## $ address : Factor w/ 2 levels "R","U": 2 2 2 2 2 2 2 2 2 2 ...
## $ famsize : Factor w/ 2 levels "GT3","LE3": 1 1 2 1 1 2 2 1 2 1 ...
## $ Pstatus : Factor w/ 2 levels "A","T": 1 2 2 2 2 2 2 1 1 2 ...
## $ Medu : int 4 1 1 4 3 4 2 4 3 3 ...
## $ Fedu : int 4 1 1 2 3 3 2 4 2 4 ...
## $ Mjob : Factor w/ 5 levels "at_home","health",..: 1 1 1 2 3 4 3 3 4 3 ...
## $ Fjob : Factor w/ 5 levels "at_home","health",..: 5 3 3 4 3 3 3 5 3 3 ...
## $ reason : Factor w/ 4 levels "course","home",..: 1 1 3 2 2 4 2 2 2 2 ...
## $ guardian : Factor w/ 3 levels "father","mother",..: 2 1 2 2 1 2 2 2 2 2 ...
## $ traveltime: int 2 1 1 1 1 1 1 2 1 1 ...
## $ studytime : int 2 2 2 3 2 2 2 2 2 2 ...
## $ failures : int 0 0 3 0 0 0 0 0 0 0 ...
## $ schoolsup : Factor w/ 2 levels "no","yes": 2 1 2 1 1 1 1 2 1 1 ...
## $ famsup : Factor w/ 2 levels "no","yes": 1 2 1 2 2 2 1 2 2 2 ...
## $ paid : Factor w/ 2 levels "no","yes": 1 1 2 2 2 2 1 1 2 2 ...
## $ activities: Factor w/ 2 levels "no","yes": 1 1 1 2 1 2 1 1 1 2 ...
## $ nursery : Factor w/ 2 levels "no","yes": 2 1 2 2 2 2 2 2 2 2 ...
## $ higher : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
## $ internet : Factor w/ 2 levels "no","yes": 1 2 2 2 1 2 2 1 2 2 ...
## $ romantic : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
## $ famrel : int 4 5 4 3 4 5 4 4 4 5 ...
## $ freetime : int 3 3 3 2 3 4 4 1 2 5 ...
## $ goout : int 4 3 2 2 2 2 4 4 2 1 ...
## $ Dalc : int 1 1 2 1 1 1 1 1 1 1 ...
## $ Walc : int 1 1 3 1 2 2 1 1 1 1 ...
## $ health : int 3 3 3 5 5 5 3 1 1 5 ...
## $ absences : int 6 4 10 2 4 10 0 6 0 0 ...
## $ G1 : int 5 5 7 15 6 15 12 6 16 14 ...
## $ G2 : int 6 5 8 14 10 15 12 5 18 15 ...
## $ G3 : int 6 6 10 15 10 15 11 6 19 15 ...
train=(G1<10)
g1.10=Student_Performance_Ben_Gonzalez[!train,]
dim(g1.10)
## [1] 253 33
gradeone.10=schoolsup[!train]
This shows us we have 253 observations for G1 that are less than 10 and 33 that are higher than 10.
Here we are going to utilize a classification technique called linear discriminant analysis. This is similar to using both the lm() and ** glm()** functions. The absence of the family option is the notable difference from the glm() function.
## Schoolsup LDA
lda.fit=lda(schoolsup~.,data = Student_Performance_Ben_Gonzalez,subset = train)
lda.fit1=lda(schoolsup~G1,data = Student_Performance_Ben_Gonzalez,subset = train)
lda.fit1
## Call:
## lda(schoolsup ~ G1, data = Student_Performance_Ben_Gonzalez,
## subset = train)
##
## Prior probabilities of groups:
## no yes
## 0.8028169 0.1971831
##
## Group means:
## G1
## no 7.429825
## yes 7.321429
##
## Coefficients of linear discriminants:
## LD1
## G1 0.8012909
plot(lda.fit1)
Here we will have 3 elements returned to us.
lda.pred1=predict(lda.fit1,g1.10)
lda.class1=lda.pred1$class
table(lda.class1, gradeone.10)
## gradeone.10
## lda.class1 no yes
## no 230 23
## yes 0 0
mean(lda.class1== gradeone.10)
## [1] 0.9090909
sum(lda.pred1$posterior [ ,1] >=.5)
## [1] 253
sum(lda.pred1$posterior [ ,1] <.5)
## [1] 0
lda.pred1$posterior[1:10, 1]
## 4 6 7 9 10 11 12
## 0.8737611 0.8737611 0.8488794 0.8812404 0.8658826 0.8301434 0.8301434
## 13 14 15
## 0.8658826 0.8301434 0.8658826
lda.class1[1:65]
## [1] no no no no no no no no no no no no no no no no no no no no no no no
## [24] no no no no no no no no no no no no no no no no no no no no no no no
## [47] no no no no no no no no no no no no no no no no no no no
## Levels: no yes