Educational Predictive Analytics Tutorial

1 Goals

Practice using decision tree and logistic regression models to predict which students are going to pass or fail, using the student-por.csv data.
Fine-tune predictive models.
Compare the results of different predictive models and choose the best one.
Brainstorm about how the predictions would be used in an educational setting.

2 Important reminders

Anywhere you see the word MODIFY is one place where you might consider making changes to the code.
If you are not certain about any interpretations of results—especially confusion matrices, accuracy, sensitivity, and specificity—stop and ask an instructor for assistance.
Link to this document: https://rpubs.com/anshulkumar/EducAnalytics1

3 Load relevant packages

Step 1: Load packages

if (!require(PerformanceAnalytics)) install.packages('PerformanceAnalytics') 
if (!require(rpart)) install.packages('rpart') 
if (!require(rpart.plot)) install.packages('rpart.plot') 
if (!require(car)) install.packages('car') 
if (!require(rattle)) install.packages('rattle') 

library(PerformanceAnalytics)
library(rpart)
library(rpart.plot)
library(car)
library(rattle)

4 Import and describe data

We will use the student-por.csv data.

Step 2: Import data

d <- read.csv("student-por.csv")

Data source and details:

P. Cortez and A. Silva. Using Data Mining to Predict Secondary School Student Performance. In A. Brito and J. Teixeira Eds., Proceedings of 5th FUture BUsiness TEChnology Conference (FUBUTEC 2008) pp. 5-12, Porto, Portugal, April, 2008, EUROSIS, ISBN 978-9077381-39-7. Available at https://archive.ics.uci.edu/ml/datasets/Student+Performance.
Alternate source: https://www.kaggle.com/larsen0966/student-performance-data-set

Step 3: List variables

names(d)

##  [1] "school"     "sex"        "age"        "address"    "famsize"   
##  [6] "Pstatus"    "Medu"       "Fedu"       "Mjob"       "Fjob"      
## [11] "reason"     "guardian"   "traveltime" "studytime"  "failures"  
## [16] "schoolsup"  "famsup"     "paid"       "activities" "nursery"   
## [21] "higher"     "internet"   "romantic"   "famrel"     "freetime"  
## [26] "goout"      "Dalc"       "Walc"       "health"     "absences"  
## [31] "G1"         "G2"         "G3"

Description of dataset and all variables: https://archive.ics.uci.edu/ml/datasets/Student+Performance.

All variables in formula (for easy copying and pasting):

(b <- paste(names(d), collapse="+"))

## [1] "school+sex+age+address+famsize+Pstatus+Medu+Fedu+Mjob+Fjob+reason+guardian+traveltime+studytime+failures+schoolsup+famsup+paid+activities+nursery+higher+internet+romantic+famrel+freetime+goout+Dalc+Walc+health+absences+G1+G2+G3"

Step 4: Calculate number of observations

nrow(d)

## [1] 649

Step 5: Generate binary version of dependent variable, G3 (final grade, 0 to 20).

d$passed <- ifelse(d$G3 > 9.99, 1, 0)

We’re assuming that a score of over 9.99 is a passing score, and below that is failing.

Step 6: Descriptive statistics for G3 continuous numeric variable.

summary(d$G3)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   10.00   12.00   11.91   14.00   19.00

sd(d$G3)

## [1] 3.230656

Step 7: Who all passed and failed, binary qualitative categorical variable.

with(d, table(passed, useNA = "always"))

## passed
##    0    1 <NA> 
##  100  549    0

Step 8: Histogram

hist(as.numeric(d$G3))

Step 9: Scatterplots

plot(d$G1 , d$G3) # MODIFY which variables you plot

plot(d$health, d$G3)

Step 10: Selected correlations (optional)

chart.Correlation(d[c("G3","G1","Medu","failures")], histogram=TRUE, pch=19)

5 Divide training and testing data

Step 11: Divide data

trainingRowIndex <- sample(1:nrow(d), 0.75*nrow(d))  # row indices for training data
dtrain <- d[trainingRowIndex, ]  # model training data
dtest  <- d[-trainingRowIndex, ]   # test data

5.1 Training data characteristics

Step 12: Examine training data

nrow(dtrain)

## [1] 486

with(dtrain, table(passed, useNA = "always"))

## passed
##    0    1 <NA> 
##   79  407    0

5.2 Testing data characteristics

Step 13: Examine testing data

nrow(dtest)

## [1] 163

with(dtest, table(passed, useNA = "always"))

## passed
##    0    1 <NA> 
##   21  142    0

6 Decision tree model – regression tree

Activity summary:

Goal: predict continuous outcome G3, using a regression tree.
Start by using all variables to make decision tree. Check predictive capability.
Remove variables to make predictions with less information.
Modify cutoff thresholds and see how confusion matrix changes.
Anywhere you see the word MODIFY is one place where you might consider making changes to the code.
Figure out which students to remediate.

6.1 Train and inspect model

Step 14: Train a decision tree model

tree1 <- rpart(G3 ~ school+sex+age+address+famsize+Pstatus+Medu+Fedu+Mjob+Fjob+reason+guardian+traveltime+studytime+failures+schoolsup+famsup+paid+activities+nursery+higher+internet+romantic+famrel+freetime+goout+Dalc+Walc+health+absences+G1+G2, data=dtrain, method = 'anova')
# MODIFY. Try without G1 and G2. Then try other combinations. 

summary(tree1)

## Call:
## rpart(formula = G3 ~ school + sex + age + address + famsize + 
##     Pstatus + Medu + Fedu + Mjob + Fjob + reason + guardian + 
##     traveltime + studytime + failures + schoolsup + famsup + 
##     paid + activities + nursery + higher + internet + romantic + 
##     famrel + freetime + goout + Dalc + Walc + health + absences + 
##     G1 + G2, data = dtrain, method = "anova")
##   n= 486 
## 
##           CP nsplit rel error    xerror       xstd
## 1 0.51254992      0 1.0000000 1.0014104 0.09835703
## 2 0.14621244      1 0.4874501 0.4933781 0.06289929
## 3 0.08751316      2 0.3412376 0.3635385 0.03806078
## 4 0.03585219      3 0.2537245 0.2786784 0.03812384
## 5 0.03169099      4 0.2178723 0.2721950 0.04066375
## 6 0.01847363      5 0.1861813 0.2236670 0.03828464
## 7 0.01000000      6 0.1677077 0.2042922 0.03802597
## 
## Variable importance
##         G2         G1       Medu   failures     school     reason   absences 
##         45         26          6          6          5          5          2 
##    address    famsize traveltime  studytime       Mjob     famsup   freetime 
##          1          1          1          1          1          1          1 
## 
## Node number 1: 486 observations,    complexity param=0.5125499
##   mean=11.78395, MSE=10.95127 
##   left son=2 (254 obs) right son=3 (232 obs)
##   Primary splits:
##       G2       < 11.5 to the left,  improve=0.51254990, (0 missing)
##       G1       < 11.5 to the left,  improve=0.43964680, (0 missing)
##       failures < 0.5  to the right, improve=0.19056210, (0 missing)
##       higher   splits as  LR,       improve=0.11339540, (0 missing)
##       school   splits as  RL,       improve=0.07697885, (0 missing)
##   Surrogate splits:
##       G1       < 11.5 to the left,  agree=0.879, adj=0.746, (0 split)
##       failures < 0.5  to the right, agree=0.615, adj=0.194, (0 split)
##       Medu     < 3.5  to the left,  agree=0.613, adj=0.190, (0 split)
##       school   splits as  RL,       agree=0.611, adj=0.185, (0 split)
##       reason   splits as  LRLR,     agree=0.601, adj=0.164, (0 split)
## 
## Node number 2: 254 observations,    complexity param=0.1462124
##   mean=9.519685, MSE=7.22599 
##   left son=4 (27 obs) right son=5 (227 obs)
##   Primary splits:
##       G2       < 7.5  to the left,  improve=0.42398820, (0 missing)
##       G1       < 8.5  to the left,  improve=0.29191630, (0 missing)
##       failures < 0.5  to the right, improve=0.10862330, (0 missing)
##       school   splits as  RL,       improve=0.07848752, (0 missing)
##       higher   splits as  LR,       improve=0.05366950, (0 missing)
##   Surrogate splits:
##       G1 < 5.5  to the left,  agree=0.902, adj=0.074, (0 split)
## 
## Node number 3: 232 observations,    complexity param=0.08751316
##   mean=14.26293, MSE=3.271385 
##   left son=6 (125 obs) right son=7 (107 obs)
##   Primary splits:
##       G2        < 13.5 to the left,  improve=0.61369750, (0 missing)
##       G1        < 14.5 to the left,  improve=0.46371290, (0 missing)
##       schoolsup splits as  RL,       improve=0.06600864, (0 missing)
##       studytime < 1.5  to the left,  improve=0.04155828, (0 missing)
##       paid      splits as  RL,       improve=0.03131116, (0 missing)
##   Surrogate splits:
##       G1        < 13.5 to the left,  agree=0.819, adj=0.607, (0 split)
##       Medu      < 3.5  to the left,  agree=0.591, adj=0.112, (0 split)
##       studytime < 2.5  to the left,  agree=0.591, adj=0.112, (0 split)
##       Mjob      splits as  LLLLR,    agree=0.586, adj=0.103, (0 split)
##       Fedu      < 3.5  to the left,  agree=0.578, adj=0.084, (0 split)
## 
## Node number 4: 27 observations,    complexity param=0.03585219
##   mean=4.444444, MSE=13.95062 
##   left son=8 (15 obs) right son=9 (12 obs)
##   Primary splits:
##       absences   < 1    to the left,  improve=0.5065929, (0 missing)
##       G2         < 5.5  to the left,  improve=0.3580810, (0 missing)
##       famsup     splits as  LR,       improve=0.2168142, (0 missing)
##       traveltime < 1.5  to the right, improve=0.1387611, (0 missing)
##       freetime   < 3.5  to the right, improve=0.1283134, (0 missing)
##   Surrogate splits:
##       address    splits as  LR,       agree=0.704, adj=0.333, (0 split)
##       famsize    splits as  LR,       agree=0.704, adj=0.333, (0 split)
##       traveltime < 1.5  to the right, agree=0.704, adj=0.333, (0 split)
##       famsup     splits as  LR,       agree=0.667, adj=0.250, (0 split)
##       freetime   < 3.5  to the right, agree=0.667, adj=0.250, (0 split)
## 
## Node number 5: 227 observations,    complexity param=0.03169099
##   mean=10.12335, MSE=2.998001 
##   left son=10 (83 obs) right son=11 (144 obs)
##   Primary splits:
##       G2       < 9.5  to the left,  improve=0.24784420, (0 missing)
##       G1       < 8.5  to the left,  improve=0.19138490, (0 missing)
##       failures < 0.5  to the right, improve=0.10049120, (0 missing)
##       higher   splits as  LR,       improve=0.04495073, (0 missing)
##       goout    < 4.5  to the right, improve=0.03829217, (0 missing)
##   Surrogate splits:
##       G1       < 8.5  to the left,  agree=0.744, adj=0.301, (0 split)
##       failures < 0.5  to the right, agree=0.683, adj=0.133, (0 split)
##       higher   splits as  LR,       agree=0.661, adj=0.072, (0 split)
##       famrel   < 1.5  to the left,  agree=0.643, adj=0.024, (0 split)
##       absences < 11.5 to the right, agree=0.643, adj=0.024, (0 split)
## 
## Node number 6: 125 observations
##   mean=12.952, MSE=0.813696 
## 
## Node number 7: 107 observations,    complexity param=0.01847363
##   mean=15.79439, MSE=1.789501 
##   left son=14 (80 obs) right son=15 (27 obs)
##   Primary splits:
##       G2        < 16.5 to the left,  improve=0.51349590, (0 missing)
##       G1        < 15.5 to the left,  improve=0.40601700, (0 missing)
##       studytime < 3.5  to the left,  improve=0.04418029, (0 missing)
##       Mjob      splits as  RRLLL,    improve=0.03995700, (0 missing)
##       absences  < 0.5  to the right, improve=0.03431453, (0 missing)
##   Surrogate splits:
##       G1        < 16.5 to the left,  agree=0.897, adj=0.593, (0 split)
##       studytime < 3.5  to the left,  agree=0.757, adj=0.037, (0 split)
## 
## Node number 8: 15 observations
##   mean=2.066667, MSE=11.79556 
## 
## Node number 9: 12 observations
##   mean=7.416667, MSE=0.7430556 
## 
## Node number 10: 83 observations
##   mean=8.987952, MSE=3.554072 
## 
## Node number 11: 144 observations
##   mean=10.77778, MSE=1.506173 
## 
## Node number 14: 80 observations
##   mean=15.2375, MSE=0.9810938 
## 
## Node number 15: 27 observations
##   mean=17.44444, MSE=0.5432099

6.2 Tree visualization

Step 15: Visualize decision tree model in two ways.

prp(tree1)

fancyRpartPlot(tree1, caption = "Regression Tree")

6.3 Test model

Step 16: Make predictions on testing data, using trained model

dtest$tree1.pred <- predict(tree1, newdata = dtest)

Step 17: Visualize predictions

with(dtest, plot(G3,tree1.pred, main="Actual vs Predicted, testing data",xlab = "Actual G3",ylab = "Predicted G3"))

Step 18: Make confusion matrix.

PredictionCutoff <- 9.99 # MODIFY. Compare values in 9-11 range.

dtest$tree1.pred.passed <- ifelse(dtest$tree1.pred > PredictionCutoff, 1, 0)

(cm1 <- with(dtest,table(tree1.pred.passed,passed)))

##                  passed
## tree1.pred.passed   0   1
##                 0  19  16
##                 1   2 126

Step 19: Calculate accuracy

CorrectPredictions1 <- cm1[1,1] + cm1[2,2]
TotalStudents1 <- nrow(dtest)

(Accuracy1 <- CorrectPredictions1/TotalStudents1)

## [1] 0.8895706

Step 20: Sensitivity (proportion of people who actually failed that were correctly predicted to fail).

(Sensitivity1 <- cm1[1,1]/(cm1[1,1]+cm1[2,1]))

## [1] 0.9047619

Step 21: Specificity (proportion of people who actually passed that were correctly predicted to pass).

(Specificity1 <- cm1[2,2]/(cm1[1,2]+cm1[2,2]))

## [1] 0.8873239

BE SURE TO DOUBLE-CHECK THE CALCULATIONS ABOVE MANUALLY!

Step 22: It is very important for you, the data analyst, to modify the 9.99 cutoff assigned as PredictionCutoff above to see how you can change the predictions made by the model. Write down what you observe as you change this value and re-run the confusion matrix, accuracy, sensitivity, and specificity code above. What are the implications of your manual modification of this cutoff? Remind your instructors to discuss this, in case they forget!

7 Decision tree model – classification tree

Activity summary:

Goal: predict binary outcome passed, using a classification tree.
Start by using all variables to make decision tree. Check predictive capability.
Remove variables to make predictions with less information.
Modify cutoff thresholds and see how confusion matrix changes.
Anywhere you see the word MODIFY is one place where you might consider making changes to the code.
Figure out which students to remediate.

7.1 Train and inspect model

Step 23: Train a decision tree model

tree2 <- rpart(passed ~ school+sex+age+address+famsize+Pstatus+Medu+Fedu+Mjob+Fjob+reason+guardian+traveltime+studytime+failures+schoolsup+famsup+paid+activities+nursery+higher+internet+romantic+famrel+freetime+goout+Dalc+Walc+health+absences+G1+G2, data=dtrain, method = "class")
# MODIFY. Try without G1 and G2. Then try other combinations. 

summary(tree2)

## Call:
## rpart(formula = passed ~ school + sex + age + address + famsize + 
##     Pstatus + Medu + Fedu + Mjob + Fjob + reason + guardian + 
##     traveltime + studytime + failures + schoolsup + famsup + 
##     paid + activities + nursery + higher + internet + romantic + 
##     famrel + freetime + goout + Dalc + Walc + health + absences + 
##     G1 + G2, data = dtrain, method = "class")
##   n= 486 
## 
##           CP nsplit rel error    xerror       xstd
## 1 0.59493671      0 1.0000000 1.0000000 0.10295929
## 2 0.03164557      1 0.4050633 0.4050633 0.06920822
## 3 0.01000000      3 0.3417722 0.4556962 0.07308231
## 
## Variable importance
##         G2         G1       Mjob traveltime   absences        age      goout 
##         63         29          3          1          1          1          1 
##     higher  studytime 
##          1          1 
## 
## Node number 1: 486 observations,    complexity param=0.5949367
##   predicted class=1  expected loss=0.1625514  P(node) =1
##     class counts:    79   407
##    probabilities: 0.163 0.837 
##   left son=2 (59 obs) right son=3 (427 obs)
##   Primary splits:
##       G2       < 8.5  to the left,  improve=72.70349, (0 missing)
##       G1       < 8.5  to the left,  improve=61.83621, (0 missing)
##       failures < 0.5  to the right, improve=18.50123, (0 missing)
##       school   splits as  RL,       improve=14.15570, (0 missing)
##       higher   splits as  LR,       improve=10.81315, (0 missing)
##   Surrogate splits:
##       G1       < 7.5  to the left,  agree=0.920, adj=0.339, (0 split)
##       absences < 21.5 to the right, agree=0.881, adj=0.017, (0 split)
## 
## Node number 2: 59 observations
##   predicted class=0  expected loss=0.1016949  P(node) =0.1213992
##     class counts:    53     6
##    probabilities: 0.898 0.102 
## 
## Node number 3: 427 observations,    complexity param=0.03164557
##   predicted class=1  expected loss=0.06088993  P(node) =0.8786008
##     class counts:    26   401
##    probabilities: 0.061 0.939 
##   left son=6 (26 obs) right son=7 (401 obs)
##   Primary splits:
##       G1       < 8.5  to the left,  improve=8.888203, (0 missing)
##       G2       < 9.5  to the left,  improve=8.597908, (0 missing)
##       school   splits as  RL,       improve=1.561668, (0 missing)
##       Dalc     < 2.5  to the right, improve=1.490442, (0 missing)
##       failures < 0.5  to the right, improve=1.434721, (0 missing)
## 
## Node number 6: 26 observations,    complexity param=0.03164557
##   predicted class=1  expected loss=0.4615385  P(node) =0.05349794
##     class counts:    12    14
##    probabilities: 0.462 0.538 
##   left son=12 (15 obs) right son=13 (11 obs)
##   Primary splits:
##       Mjob       splits as  LLRLR,    improve=2.983683, (0 missing)
##       absences   < 6.5  to the right, improve=2.753142, (0 missing)
##       age        < 17.5 to the left,  improve=2.223077, (0 missing)
##       Medu       < 2.5  to the right, improve=1.230769, (0 missing)
##       activities splits as  RL,       improve=1.230769, (0 missing)
##   Surrogate splits:
##       traveltime < 1.5  to the left,  agree=0.769, adj=0.455, (0 split)
##       age        < 17.5 to the left,  agree=0.731, adj=0.364, (0 split)
##       higher     splits as  RL,       agree=0.731, adj=0.364, (0 split)
##       goout      < 4.5  to the left,  agree=0.731, adj=0.364, (0 split)
##       studytime  < 1.5  to the right, agree=0.692, adj=0.273, (0 split)
## 
## Node number 7: 401 observations
##   predicted class=1  expected loss=0.03491272  P(node) =0.8251029
##     class counts:    14   387
##    probabilities: 0.035 0.965 
## 
## Node number 12: 15 observations
##   predicted class=0  expected loss=0.3333333  P(node) =0.0308642
##     class counts:    10     5
##    probabilities: 0.667 0.333 
## 
## Node number 13: 11 observations
##   predicted class=1  expected loss=0.1818182  P(node) =0.02263374
##     class counts:     2     9
##    probabilities: 0.182 0.818

7.2 Tree Visualization

Step 24: Visualize decision tree model in two ways

prp(tree2)

fancyRpartPlot(tree2, caption = "Classification Tree")

7.3 Test model

Step 25: Make predictions and confusion matrix on testing data classes, using trained model.

dtest$tree2.pred <- predict(tree2, newdata = dtest, type = 'class')
# MODIFY. change 'class' to 'prob'

(cm2 <- with(dtest,table(tree2.pred,passed)))

##           passed
## tree2.pred   0   1
##          0  15   3
##          1   6 139

Step 26: Make predictions and confusion matrix on testing data using probability cutoffs. Optional; results not shown.

dtest$tree2.pred <- predict(tree2, newdata = dtest, type = 'prob')

ProbabilityCutoff <- 0.5 # MODIFY. Compare different probability values. 
dtest$tree2.pred.probs <- 1-dtest$tree2.pred[,1]

dtest$tree2.pred.passed <- ifelse(dtest$tree2.pred.probs > ProbabilityCutoff, 1, 0)

(cm2b <- with(dtest,table(tree2.pred.passed,passed)))

Step 27: Calculate accuracy

CorrectPredictions2 <- cm2[1,1] + cm2[2,2]
TotalStudents2 <- nrow(dtest)

(Accuracy2 <- CorrectPredictions2/TotalStudents2)

## [1] 0.9447853

Step 28: Sensitivity (proportion of people who actually failed that were correctly predicted to fail)

(Sensitivity2 <- cm2[1,1]/(cm2[1,1]+cm2[2,1]))

## [1] 0.7142857

Step 29: Specificity (proportion of people who actually passed that were correctly predicted to pass):

(Specificity2 <- cm2[2,2]/(cm2[1,2]+cm2[2,2]))

## [1] 0.9788732

ALSO DOUBLE-CHECK THE CALCULATIONS ABOVE MANUALLY!

8 Logistic regression model – classification

Activity summary:

Goal: predict binary outcome passed, using logistic regression.
Start by using all variables to make a logistic regression model. Check predictive capability.
Remove variables to make predictions with less information.
Modify cutoff thresholds and see how confusion matrix changes.
Anywhere you see the word MODIFY is one place where you might consider making changes to the code.
Figure out which students to remediate.

8.1 Train and inspect model

Step 30: Train a logistic regression model

blr1 <- glm(passed ~ school+sex+age+address+famsize+Pstatus+Medu+Fedu+guardian+traveltime+studytime+failures+schoolsup+famsup+paid+activities+nursery+higher+internet+romantic+famrel+freetime+goout+Dalc+Walc+health+absences+Mjob+reason+Fjob+G1+G2, data=dtrain, family = "binomial")
# MODIFY. Try without G1 and G2. Then try other combinations. 
# also remove variables causing multicollinearity and see if it makes a difference!

summary(blr1)

## 
## Call:
## glm(formula = passed ~ school + sex + age + address + famsize + 
##     Pstatus + Medu + Fedu + guardian + traveltime + studytime + 
##     failures + schoolsup + famsup + paid + activities + nursery + 
##     higher + internet + romantic + famrel + freetime + goout + 
##     Dalc + Walc + health + absences + Mjob + reason + Fjob + 
##     G1 + G2, family = "binomial", data = dtrain)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -4.0757   0.0001   0.0066   0.0813   1.8149  
## 
## Coefficients:
##                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -30.15118    8.14281  -3.703 0.000213 ***
## schoolMS          -1.08851    0.78243  -1.391 0.164168    
## sexM              -0.78627    0.73955  -1.063 0.287704    
## age                0.47710    0.29663   1.608 0.107740    
## addressU           0.44035    0.76060   0.579 0.562622    
## famsizeLE3         0.39831    0.68341   0.583 0.560005    
## PstatusT           0.67439    1.14806   0.587 0.556925    
## Medu              -0.22366    0.34423  -0.650 0.515856    
## Fedu               0.06211    0.32804   0.189 0.849816    
## guardianmother    -0.13636    0.79511  -0.171 0.863836    
## guardianother      0.78562    1.53926   0.510 0.609779    
## traveltime         0.34318    0.38701   0.887 0.375220    
## studytime          0.04504    0.37989   0.119 0.905616    
## failures          -0.06421    0.38269  -0.168 0.866761    
## schoolsupyes      -0.68289    0.97599  -0.700 0.484118    
## famsupyes          0.08435    0.60815   0.139 0.889686    
## paidyes           -0.69395    1.19520  -0.581 0.561498    
## activitiesyes     -0.18586    0.59727  -0.311 0.755669    
## nurseryyes        -0.57052    0.67052  -0.851 0.394847    
## higheryes         -0.24928    0.82068  -0.304 0.761315    
## internetyes       -0.12180    0.71099  -0.171 0.863979    
## romanticyes       -0.47423    0.65288  -0.726 0.467612    
## famrel             0.02052    0.28264   0.073 0.942109    
## freetime           0.03488    0.28645   0.122 0.903088    
## goout             -0.12763    0.29909  -0.427 0.669568    
## Dalc               0.10583    0.44857   0.236 0.813492    
## Walc              -0.22782    0.42610  -0.535 0.592878    
## health            -0.22539    0.24286  -0.928 0.353369    
## absences          -0.03952    0.06205  -0.637 0.524267    
## Mjobhealth         0.62621    1.35132   0.463 0.643072    
## Mjobother          0.12519    0.75506   0.166 0.868311    
## Mjobservices       0.13444    1.05109   0.128 0.898221    
## Mjobteacher        2.68861    1.58196   1.700 0.089216 .  
## reasonhome         0.30812    0.77833   0.396 0.692199    
## reasonother        0.23116    0.87319   0.265 0.791215    
## reasonreputation   1.61106    1.14599   1.406 0.159775    
## Fjobhealth        -0.82151    2.00861  -0.409 0.682545    
## Fjobother         -0.27813    1.30689  -0.213 0.831469    
## Fjobservices      -0.34707    1.35233  -0.257 0.797453    
## Fjobteacher       -3.50771    1.93941  -1.809 0.070506 .  
## G1                 1.11496    0.30570   3.647 0.000265 ***
## G2                 1.65136    0.35287   4.680 2.87e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 431.45  on 485  degrees of freedom
## Residual deviance: 114.32  on 444  degrees of freedom
## AIC: 198.32
## 
## Number of Fisher Scoring iterations: 9

car::vif(blr1)

##                GVIF Df GVIF^(1/(2*Df))
## school     2.673217  1        1.634998
## sex        2.386451  1        1.544814
## age        2.451568  1        1.565749
## address    2.477824  1        1.574111
## famsize    1.546042  1        1.243399
## Pstatus    1.992302  1        1.411489
## Medu       2.421625  1        1.556157
## Fedu       2.217443  1        1.489108
## guardian   2.978028  2        1.313658
## traveltime 1.758702  1        1.326160
## studytime  1.543159  1        1.242239
## failures   1.934883  1        1.391001
## schoolsup  2.041797  1        1.428915
## famsup     1.575093  1        1.255027
## paid       1.346165  1        1.160243
## activities 1.442708  1        1.201128
## nursery    1.468244  1        1.211711
## higher     1.731705  1        1.315943
## internet   2.005652  1        1.416210
## romantic   1.728680  1        1.314793
## famrel     1.701719  1        1.304500
## freetime   1.798360  1        1.341029
## goout      2.931592  1        1.712189
## Dalc       4.744859  1        2.178270
## Walc       6.404792  1        2.530769
## health     1.950421  1        1.396575
## absences   1.708603  1        1.307135
## Mjob       9.910933  4        1.332031
## reason     5.619753  3        1.333377
## Fjob       8.185609  4        1.300563
## G1         2.222563  1        1.490826
## G2         1.895424  1        1.376744

8.2 Test model

Step 31: Make predictions on testing data, using trained model.

Predicting probabilities…

dtest$blr1.pred <- predict(blr1, newdata = dtest, type = 'response')
# MODIFY. change 'class' to 'prob'

ProbabilityCutoff <- 0.5 # MODIFY. Compare different probability values. 
dtest$blr1.pred.probs <- 1-dtest$blr1.pred

dtest$blr1.pred.passed <- ifelse(dtest$blr1.pred > ProbabilityCutoff, 1, 0)

(cm3 <- with(dtest,table(blr1.pred.passed,passed)))

##                 passed
## blr1.pred.passed   0   1
##                0  17   5
##                1   4 137

Step 32: Make confusion matrix

(cm3 <- with(dtest,table(blr1.pred.passed,passed)))

##                 passed
## blr1.pred.passed   0   1
##                0  17   5
##                1   4 137

Step 33: Calculate accuracy

CorrectPredictions3 <- cm3[1,1] + cm3[2,2]
TotalStudents3 <- nrow(dtest)

(Accuracy3 <- CorrectPredictions3/TotalStudents3)

## [1] 0.9447853

Step 34: Sensitivity (proportion of people who actually failed that were correctly predicted to fail)

(Sensitivity3 <- cm3[1,1]/(cm3[1,1]+cm3[2,1]))

## [1] 0.8095238

Step 35: Specificity (proportion of people who actually passed that were correctly predicted to pass)

(Specificity3 <- cm3[2,2]/(cm3[1,2]+cm3[2,2]))

## [1] 0.9647887

ALSO DOUBLE-CHECK THE CALCULATIONS ABOVE MANUALLY!

Step 36: It is very important for you, the data analyst, to modify the 0.5 cutoff assigned as ProbabilityCutoff above to see how you can change the predictions made by the model. Write down what you observe as you change this value and re-run the confusion matrix, accuracy, sensitivity, and specificity code above. What are the implications of your manual modification of this cutoff? Remind your instructors to discuss this, in case they forget!