Classification models can be an invaluable tool to a researcher. Utilizing these methods allow a research to focus on qualitative techniques to answer questions other statistical techniques cannot. In this tutorial we are going to look at the use of logistic regression on our Student Performance dataset.
The dataset is from the UCI Machine Learning Repository and can be found by clicking the following link: https://archive.ics.uci.edu/ml/datasets/Student+Performance
In this step we need to get a few packages for our analysis. We can do this by using the following code. The dependencies = TRUE ensures we install all of the necessary packages the other packages depend on.
# install.packages(c("shiny","ggvis","reshape2","dplyr","gam","tree","randomForest"),dependencies = TRUE)
library(ggvis)
library(reshape2)
library(dplyr)
library(splines)
library(gam)
We need to import our data into R. We can do this in several ways. In this tutorial we will be doing this via the read.csv() function that is available in base R. The header = T allows us to keep our headers from our excel sheet, and our sep = “,” tells R that our data is separated by commas. We can utilize tab / or semicolon ; as well in our ‘sep =’ argument. We are then renaming our dataset to ‘Student_Performance_Ben_Gonzalez’ which creates an object in R we can use throughout our analysis. The attach() function then ensures we have attached our dataset so we can analyze it.
##Student Performance Dataset-Ben Gonzalez ##########################
Student_Performance_Ben_Gonzalez <-read.csv("~/Datasets/Student_Performance_Ben_Gonzalez.csv",header = T, sep = ",")
attach(Student_Performance_Ben_Gonzalez)
This will give us an overview of what variables are in our dataset. I recommend doing this to help better understand our dataset.
names(Student_Performance_Ben_Gonzalez)
## [1] "school" "sex" "age" "address" "famsize"
## [6] "Pstatus" "Medu" "Fedu" "Mjob" "Fjob"
## [11] "reason" "guardian" "traveltime" "studytime" "failures"
## [16] "schoolsup" "famsup" "paid" "activities" "nursery"
## [21] "higher" "internet" "romantic" "famrel" "freetime"
## [26] "goout" "Dalc" "Walc" "health" "absences"
## [31] "G1" "G2" "G3"
The summary overview gives us a solid understanding of our data, it also may highlight extreme values for us as well.
summary(Student_Performance_Ben_Gonzalez)
## school sex age address famsize Pstatus Medu
## GP:349 F:208 Min. :15.0 R: 88 GT3:281 A: 41 Min. :0.000
## MS: 46 M:187 1st Qu.:16.0 U:307 LE3:114 T:354 1st Qu.:2.000
## Median :17.0 Median :3.000
## Mean :16.7 Mean :2.749
## 3rd Qu.:18.0 3rd Qu.:4.000
## Max. :22.0 Max. :4.000
## Fedu Mjob Fjob reason
## Min. :0.000 at_home : 59 at_home : 20 course :145
## 1st Qu.:2.000 health : 34 health : 18 home :109
## Median :2.000 other :141 other :217 other : 36
## Mean :2.522 services:103 services:111 reputation:105
## 3rd Qu.:3.000 teacher : 58 teacher : 29
## Max. :4.000
## guardian traveltime studytime failures schoolsup
## father: 90 Min. :1.000 Min. :1.000 Min. :0.0000 no :344
## mother:273 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:0.0000 yes: 51
## other : 32 Median :1.000 Median :2.000 Median :0.0000
## Mean :1.448 Mean :2.035 Mean :0.3342
## 3rd Qu.:2.000 3rd Qu.:2.000 3rd Qu.:0.0000
## Max. :4.000 Max. :4.000 Max. :3.0000
## famsup paid activities nursery higher internet romantic
## no :153 no :214 no :194 no : 81 no : 20 no : 66 no :263
## yes:242 yes:181 yes:201 yes:314 yes:375 yes:329 yes:132
##
##
##
##
## famrel freetime goout Dalc
## Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:4.000 1st Qu.:3.000 1st Qu.:2.000 1st Qu.:1.000
## Median :4.000 Median :3.000 Median :3.000 Median :1.000
## Mean :3.944 Mean :3.235 Mean :3.109 Mean :1.481
## 3rd Qu.:5.000 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:2.000
## Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.000
## Walc health absences G1
## Min. :1.000 Min. :1.000 Min. : 0.000 Min. : 3.00
## 1st Qu.:1.000 1st Qu.:3.000 1st Qu.: 0.000 1st Qu.: 8.00
## Median :2.000 Median :4.000 Median : 4.000 Median :11.00
## Mean :2.291 Mean :3.554 Mean : 5.709 Mean :10.91
## 3rd Qu.:3.000 3rd Qu.:5.000 3rd Qu.: 8.000 3rd Qu.:13.00
## Max. :5.000 Max. :5.000 Max. :75.000 Max. :19.00
## G2 G3
## Min. : 0.00 Min. : 0.00
## 1st Qu.: 9.00 1st Qu.: 8.00
## Median :11.00 Median :11.00
## Mean :10.71 Mean :10.42
## 3rd Qu.:13.00 3rd Qu.:14.00
## Max. :19.00 Max. :20.00
This step allows us to look at the structure of our data and which variables are integers or factors. This will help us in determining what statistical techniques we can utilize on our data. In this tutorial we will be focusing on ‘classification’ techniques.
str(Student_Performance_Ben_Gonzalez)
## 'data.frame': 395 obs. of 33 variables:
## $ school : Factor w/ 2 levels "GP","MS": 1 1 1 1 1 1 1 1 1 1 ...
## $ sex : Factor w/ 2 levels "F","M": 1 1 1 1 1 2 2 1 2 2 ...
## $ age : int 18 17 15 15 16 16 16 17 15 15 ...
## $ address : Factor w/ 2 levels "R","U": 2 2 2 2 2 2 2 2 2 2 ...
## $ famsize : Factor w/ 2 levels "GT3","LE3": 1 1 2 1 1 2 2 1 2 1 ...
## $ Pstatus : Factor w/ 2 levels "A","T": 1 2 2 2 2 2 2 1 1 2 ...
## $ Medu : int 4 1 1 4 3 4 2 4 3 3 ...
## $ Fedu : int 4 1 1 2 3 3 2 4 2 4 ...
## $ Mjob : Factor w/ 5 levels "at_home","health",..: 1 1 1 2 3 4 3 3 4 3 ...
## $ Fjob : Factor w/ 5 levels "at_home","health",..: 5 3 3 4 3 3 3 5 3 3 ...
## $ reason : Factor w/ 4 levels "course","home",..: 1 1 3 2 2 4 2 2 2 2 ...
## $ guardian : Factor w/ 3 levels "father","mother",..: 2 1 2 2 1 2 2 2 2 2 ...
## $ traveltime: int 2 1 1 1 1 1 1 2 1 1 ...
## $ studytime : int 2 2 2 3 2 2 2 2 2 2 ...
## $ failures : int 0 0 3 0 0 0 0 0 0 0 ...
## $ schoolsup : Factor w/ 2 levels "no","yes": 2 1 2 1 1 1 1 2 1 1 ...
## $ famsup : Factor w/ 2 levels "no","yes": 1 2 1 2 2 2 1 2 2 2 ...
## $ paid : Factor w/ 2 levels "no","yes": 1 1 2 2 2 2 1 1 2 2 ...
## $ activities: Factor w/ 2 levels "no","yes": 1 1 1 2 1 2 1 1 1 2 ...
## $ nursery : Factor w/ 2 levels "no","yes": 2 1 2 2 2 2 2 2 2 2 ...
## $ higher : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
## $ internet : Factor w/ 2 levels "no","yes": 1 2 2 2 1 2 2 1 2 2 ...
## $ romantic : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
## $ famrel : int 4 5 4 3 4 5 4 4 4 5 ...
## $ freetime : int 3 3 3 2 3 4 4 1 2 5 ...
## $ goout : int 4 3 2 2 2 2 4 4 2 1 ...
## $ Dalc : int 1 1 2 1 1 1 1 1 1 1 ...
## $ Walc : int 1 1 3 1 2 2 1 1 1 1 ...
## $ health : int 3 3 3 5 5 5 3 1 1 5 ...
## $ absences : int 6 4 10 2 4 10 0 6 0 0 ...
## $ G1 : int 5 5 7 15 6 15 12 6 16 14 ...
## $ G2 : int 6 5 8 14 10 15 12 5 18 15 ...
## $ G3 : int 6 6 10 15 10 15 11 6 19 15 ...
Here we will use the glm() function to create our logistic regression models. The glm() works the same way our lm() function works with one difference. We need to set the argument family = to ‘binomial’ to let R know we are wanting to perform a logistic regression. We want to see if schoolsup has an effect on G3.
## Famsup Classification Prediction
glm.fit=glm(schoolsup~., family = binomial, data = Student_Performance_Ben_Gonzalez)
glm.fit1=glm(schoolsup~G1, family = binomial, data = Student_Performance_Ben_Gonzalez)
summary(glm.fit1)
##
## Call:
## glm(formula = schoolsup ~ G1, family = binomial, data = Student_Performance_Ben_Gonzalez)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.0268 -0.5864 -0.4314 -0.3144 2.5457
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.28929 0.52213 0.554 0.58
## G1 -0.21810 0.05346 -4.080 4.51e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 303.91 on 394 degrees of freedom
## Residual deviance: 284.68 on 393 degrees of freedom
## AIC: 288.68
##
## Number of Fisher Scoring iterations: 5
Next we create an object called glm.probs and utilize the predict() function to train our data with. We set the argument to type = ‘response’ to tell R that we want a ‘yes’ or ‘no’ response from our dataset.
glm.probs=predict(glm.fit1,schoolsup,type = "response")
Here we utilize the contrasts() function to look at how R has create our dummy variables. We have a ‘0’ for no and a ‘1’ for yes in our analysis.
contrasts(schoolsup)
## yes
## no 0
## yes 1
Here we print the first 10 probabilities of our dataset. This gives us an understanding of what our probabilities look like.
glm.probs[1:10]
## 1 2 3 4 5 6
## 0.30976274 0.30976274 0.22488266 0.04823361 0.26515750 0.04823361
## 7 8 9 10
## 0.08883421 0.26515750 0.03915192 0.05929191
Next we create vectors for our response variables. We create two vectors based on whether the predicted probability of school supplies being bought by the family is greater than or less than 0.5.
glm.pred=rep("no",395)
glm.pred[glm.probs>.5]="yes"
Next we create a confusion matrix to look at our data. The reason it is called a confusion matrix is due to the way our data is output. We want to look at the data diagonally from top left to bottom right in the table, and then from top right to bottom left. The top left to bottom right are the correct predictions, while the top right to bottom left are incorrect predictions.
table(glm.pred, schoolsup)
## schoolsup
## glm.pred no yes
## no 344 51
Next we look to see how well our model works. The mean() function shows us what fraction for which the prediction was correct. Our first line of code shows us our percentage correctly predicted. While our second line of code uses the != argument which means ‘does not or did not’ correctly predict the influence of schoolsup on G3.
mean(glm.pred== schoolsup)
## [1] 0.8708861
mean(glm.pred!= schoolsup)
## [1] 0.1291139
We are able to see our model predicts the impact of G3 on schoolsup 87% of the time and that we do not properly predict the impact 13% of the time.