Logistic Regression

Classification models can be an invaluable tool to a researcher. Utilizing these methods allow a research to focus on qualitative techniques to answer questions other statistical techniques cannot. In this tutorial we are going to look at the use of logistic regression on our Student Performance dataset.

The dataset is from the UCI Machine Learning Repository and can be found by clicking the following link: https://archive.ics.uci.edu/ml/datasets/Student+Performance

Step 1: Load Libraries.

In this step we need to get a few packages for our analysis. We can do this by using the following code. The dependencies = TRUE ensures we install all of the necessary packages the other packages depend on.

# install.packages(c("shiny","ggvis","reshape2","dplyr","gam","tree","randomForest"),dependencies = TRUE)
library(ggvis)
library(reshape2)
library(dplyr)
library(splines)
library(gam)

Step 2: Import the data into R.

We need to import our data into R. We can do this in several ways. In this tutorial we will be doing this via the read.csv() function that is available in base R. The header = T allows us to keep our headers from our excel sheet, and our sep = “,” tells R that our data is separated by commas. We can utilize tab / or semicolon ; as well in our ‘sep =’ argument. We are then renaming our dataset to ‘Student_Performance_Ben_Gonzalez’ which creates an object in R we can use throughout our analysis. The attach() function then ensures we have attached our dataset so we can analyze it.

##Student Performance Dataset-Ben Gonzalez ##########################

Student_Performance_Ben_Gonzalez <-read.csv("~/Datasets/Student_Performance_Ben_Gonzalez.csv",header = T, sep = ",")
attach(Student_Performance_Ben_Gonzalez)

Step 3: Look at variable names.

This will give us an overview of what variables are in our dataset. I recommend doing this to help better understand our dataset.

names(Student_Performance_Ben_Gonzalez)
##  [1] "school"     "sex"        "age"        "address"    "famsize"   
##  [6] "Pstatus"    "Medu"       "Fedu"       "Mjob"       "Fjob"      
## [11] "reason"     "guardian"   "traveltime" "studytime"  "failures"  
## [16] "schoolsup"  "famsup"     "paid"       "activities" "nursery"   
## [21] "higher"     "internet"   "romantic"   "famrel"     "freetime"  
## [26] "goout"      "Dalc"       "Walc"       "health"     "absences"  
## [31] "G1"         "G2"         "G3"

Step 4: Look at a summary of our data.

The summary overview gives us a solid understanding of our data, it also may highlight extreme values for us as well.

summary(Student_Performance_Ben_Gonzalez)
##  school   sex          age       address famsize   Pstatus      Medu      
##  GP:349   F:208   Min.   :15.0   R: 88   GT3:281   A: 41   Min.   :0.000  
##  MS: 46   M:187   1st Qu.:16.0   U:307   LE3:114   T:354   1st Qu.:2.000  
##                   Median :17.0                             Median :3.000  
##                   Mean   :16.7                             Mean   :2.749  
##                   3rd Qu.:18.0                             3rd Qu.:4.000  
##                   Max.   :22.0                             Max.   :4.000  
##       Fedu             Mjob           Fjob            reason   
##  Min.   :0.000   at_home : 59   at_home : 20   course    :145  
##  1st Qu.:2.000   health  : 34   health  : 18   home      :109  
##  Median :2.000   other   :141   other   :217   other     : 36  
##  Mean   :2.522   services:103   services:111   reputation:105  
##  3rd Qu.:3.000   teacher : 58   teacher : 29                   
##  Max.   :4.000                                                 
##    guardian     traveltime      studytime        failures      schoolsup
##  father: 90   Min.   :1.000   Min.   :1.000   Min.   :0.0000   no :344  
##  mother:273   1st Qu.:1.000   1st Qu.:1.000   1st Qu.:0.0000   yes: 51  
##  other : 32   Median :1.000   Median :2.000   Median :0.0000            
##               Mean   :1.448   Mean   :2.035   Mean   :0.3342            
##               3rd Qu.:2.000   3rd Qu.:2.000   3rd Qu.:0.0000            
##               Max.   :4.000   Max.   :4.000   Max.   :3.0000            
##  famsup     paid     activities nursery   higher    internet  romantic 
##  no :153   no :214   no :194    no : 81   no : 20   no : 66   no :263  
##  yes:242   yes:181   yes:201    yes:314   yes:375   yes:329   yes:132  
##                                                                        
##                                                                        
##                                                                        
##                                                                        
##      famrel         freetime         goout            Dalc      
##  Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.:4.000   1st Qu.:3.000   1st Qu.:2.000   1st Qu.:1.000  
##  Median :4.000   Median :3.000   Median :3.000   Median :1.000  
##  Mean   :3.944   Mean   :3.235   Mean   :3.109   Mean   :1.481  
##  3rd Qu.:5.000   3rd Qu.:4.000   3rd Qu.:4.000   3rd Qu.:2.000  
##  Max.   :5.000   Max.   :5.000   Max.   :5.000   Max.   :5.000  
##       Walc           health         absences            G1       
##  Min.   :1.000   Min.   :1.000   Min.   : 0.000   Min.   : 3.00  
##  1st Qu.:1.000   1st Qu.:3.000   1st Qu.: 0.000   1st Qu.: 8.00  
##  Median :2.000   Median :4.000   Median : 4.000   Median :11.00  
##  Mean   :2.291   Mean   :3.554   Mean   : 5.709   Mean   :10.91  
##  3rd Qu.:3.000   3rd Qu.:5.000   3rd Qu.: 8.000   3rd Qu.:13.00  
##  Max.   :5.000   Max.   :5.000   Max.   :75.000   Max.   :19.00  
##        G2              G3       
##  Min.   : 0.00   Min.   : 0.00  
##  1st Qu.: 9.00   1st Qu.: 8.00  
##  Median :11.00   Median :11.00  
##  Mean   :10.71   Mean   :10.42  
##  3rd Qu.:13.00   3rd Qu.:14.00  
##  Max.   :19.00   Max.   :20.00

Step 5: Look at the structure of our data.

This step allows us to look at the structure of our data and which variables are integers or factors. This will help us in determining what statistical techniques we can utilize on our data. In this tutorial we will be focusing on ‘classification’ techniques.

str(Student_Performance_Ben_Gonzalez)
## 'data.frame':    395 obs. of  33 variables:
##  $ school    : Factor w/ 2 levels "GP","MS": 1 1 1 1 1 1 1 1 1 1 ...
##  $ sex       : Factor w/ 2 levels "F","M": 1 1 1 1 1 2 2 1 2 2 ...
##  $ age       : int  18 17 15 15 16 16 16 17 15 15 ...
##  $ address   : Factor w/ 2 levels "R","U": 2 2 2 2 2 2 2 2 2 2 ...
##  $ famsize   : Factor w/ 2 levels "GT3","LE3": 1 1 2 1 1 2 2 1 2 1 ...
##  $ Pstatus   : Factor w/ 2 levels "A","T": 1 2 2 2 2 2 2 1 1 2 ...
##  $ Medu      : int  4 1 1 4 3 4 2 4 3 3 ...
##  $ Fedu      : int  4 1 1 2 3 3 2 4 2 4 ...
##  $ Mjob      : Factor w/ 5 levels "at_home","health",..: 1 1 1 2 3 4 3 3 4 3 ...
##  $ Fjob      : Factor w/ 5 levels "at_home","health",..: 5 3 3 4 3 3 3 5 3 3 ...
##  $ reason    : Factor w/ 4 levels "course","home",..: 1 1 3 2 2 4 2 2 2 2 ...
##  $ guardian  : Factor w/ 3 levels "father","mother",..: 2 1 2 2 1 2 2 2 2 2 ...
##  $ traveltime: int  2 1 1 1 1 1 1 2 1 1 ...
##  $ studytime : int  2 2 2 3 2 2 2 2 2 2 ...
##  $ failures  : int  0 0 3 0 0 0 0 0 0 0 ...
##  $ schoolsup : Factor w/ 2 levels "no","yes": 2 1 2 1 1 1 1 2 1 1 ...
##  $ famsup    : Factor w/ 2 levels "no","yes": 1 2 1 2 2 2 1 2 2 2 ...
##  $ paid      : Factor w/ 2 levels "no","yes": 1 1 2 2 2 2 1 1 2 2 ...
##  $ activities: Factor w/ 2 levels "no","yes": 1 1 1 2 1 2 1 1 1 2 ...
##  $ nursery   : Factor w/ 2 levels "no","yes": 2 1 2 2 2 2 2 2 2 2 ...
##  $ higher    : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
##  $ internet  : Factor w/ 2 levels "no","yes": 1 2 2 2 1 2 2 1 2 2 ...
##  $ romantic  : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
##  $ famrel    : int  4 5 4 3 4 5 4 4 4 5 ...
##  $ freetime  : int  3 3 3 2 3 4 4 1 2 5 ...
##  $ goout     : int  4 3 2 2 2 2 4 4 2 1 ...
##  $ Dalc      : int  1 1 2 1 1 1 1 1 1 1 ...
##  $ Walc      : int  1 1 3 1 2 2 1 1 1 1 ...
##  $ health    : int  3 3 3 5 5 5 3 1 1 5 ...
##  $ absences  : int  6 4 10 2 4 10 0 6 0 0 ...
##  $ G1        : int  5 5 7 15 6 15 12 6 16 14 ...
##  $ G2        : int  6 5 8 14 10 15 12 5 18 15 ...
##  $ G3        : int  6 6 10 15 10 15 11 6 19 15 ...

Step 6: Logistic Regression.

Here we will use the glm() function to create our logistic regression models. The glm() works the same way our lm() function works with one difference. We need to set the argument family = to ‘binomial’ to let R know we are wanting to perform a logistic regression. We want to see if schoolsup has an effect on G3.

## Famsup Classification Prediction
glm.fit=glm(schoolsup~., family = binomial, data = Student_Performance_Ben_Gonzalez)
glm.fit1=glm(schoolsup~G1, family = binomial, data = Student_Performance_Ben_Gonzalez)
summary(glm.fit1)
## 
## Call:
## glm(formula = schoolsup ~ G1, family = binomial, data = Student_Performance_Ben_Gonzalez)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.0268  -0.5864  -0.4314  -0.3144   2.5457  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.28929    0.52213   0.554     0.58    
## G1          -0.21810    0.05346  -4.080 4.51e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 303.91  on 394  degrees of freedom
## Residual deviance: 284.68  on 393  degrees of freedom
## AIC: 288.68
## 
## Number of Fisher Scoring iterations: 5

Step 7: Using the predict() function.

Next we create an object called glm.probs and utilize the predict() function to train our data with. We set the argument to type = ‘response’ to tell R that we want a ‘yes’ or ‘no’ response from our dataset.

glm.probs=predict(glm.fit1,schoolsup,type = "response")

Step 8: Using the contrasts() function.

Here we utilize the contrasts() function to look at how R has create our dummy variables. We have a ‘0’ for no and a ‘1’ for yes in our analysis.

contrasts(schoolsup)
##     yes
## no    0
## yes   1

Step 9: Print the first 10 rows of our data.

Here we print the first 10 probabilities of our dataset. This gives us an understanding of what our probabilities look like.

glm.probs[1:10]
##          1          2          3          4          5          6 
## 0.30976274 0.30976274 0.22488266 0.04823361 0.26515750 0.04823361 
##          7          8          9         10 
## 0.08883421 0.26515750 0.03915192 0.05929191

Step 10: Create our variable vectors.

Next we create vectors for our response variables. We create two vectors based on whether the predicted probability of school supplies being bought by the family is greater than or less than 0.5.

glm.pred=rep("no",395)
glm.pred[glm.probs>.5]="yes"

Step 11: Create our ‘confusion’ matrix.

Next we create a confusion matrix to look at our data. The reason it is called a confusion matrix is due to the way our data is output. We want to look at the data diagonally from top left to bottom right in the table, and then from top right to bottom left. The top left to bottom right are the correct predictions, while the top right to bottom left are incorrect predictions.

table(glm.pred, schoolsup)
##         schoolsup
## glm.pred  no yes
##       no 344  51

Step 12: Look at how well our model fits our data.

Next we look to see how well our model works. The mean() function shows us what fraction for which the prediction was correct. Our first line of code shows us our percentage correctly predicted. While our second line of code uses the != argument which means ‘does not or did not’ correctly predict the influence of schoolsup on G3.

mean(glm.pred== schoolsup)
## [1] 0.8708861
mean(glm.pred!= schoolsup)
## [1] 0.1291139

Results:

We are able to see our model predicts the impact of G3 on schoolsup 87% of the time and that we do not properly predict the impact 13% of the time.