Part I Introduction and Data Acquisition

1.1 Problem Scoping and Diagnosis

In any educational institution, it is very essential to provide added support to students who are at high risk for failing the course and turn-around their performance. Finding out such students is a complex task for academic professionals to be done manually, along with their other duties. The reason for this model is to find a tool to make this prediction, by finding out the socioeconomic factors that affect student grades and help both the parties.

1.2 Goals and objectives of the project:

The goal of this project is to build prediction models to classify if alcohol consumption of a student affects their ability to pass and to predict their final grades in relation to their demographic and social conditions.

We have used a dataset from the Machine Learning Repository at the University of California, Irvine. Source of the Dataset is : Paulo Cortez, University of Minho, Guimarães, Portugal (2008).

1.3 Dataset Description

This dataset (student-por.csv) consists of 649 observations on 33 variables. The dataset provides data attributes including grades, demographic, social and school related features, for students of Portuguese language subject. The variable G3 is the target variable, and is also used for classifying into Pass vs Fail. These classifications will be predicted based on other independent variables.

Part II Data Preparation

2.1 Import Data

#Data source downloaded from : https://archive.ics.uci.edu/ml/datasets/student+performance

data <- read.csv("student-por.csv")

dim(data)
## [1] 649  33
str(data)
## 'data.frame':    649 obs. of  33 variables:
##  $ school    : chr  "GP" "GP" "GP" "GP" ...
##  $ sex       : chr  "F" "F" "F" "F" ...
##  $ age       : int  18 17 15 15 16 16 16 17 15 15 ...
##  $ address   : chr  "U" "U" "U" "U" ...
##  $ famsize   : chr  "GT3" "GT3" "LE3" "GT3" ...
##  $ Pstatus   : chr  "A" "T" "T" "T" ...
##  $ Medu      : int  4 1 1 4 3 4 2 4 3 3 ...
##  $ Fedu      : int  4 1 1 2 3 3 2 4 2 4 ...
##  $ Mjob      : chr  "at_home" "at_home" "at_home" "health" ...
##  $ Fjob      : chr  "teacher" "other" "other" "services" ...
##  $ reason    : chr  "course" "course" "other" "home" ...
##  $ guardian  : chr  "mother" "father" "mother" "mother" ...
##  $ traveltime: int  2 1 1 1 1 1 1 2 1 1 ...
##  $ studytime : int  2 2 2 3 2 2 2 2 2 2 ...
##  $ failures  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ schoolsup : chr  "yes" "no" "yes" "no" ...
##  $ famsup    : chr  "no" "yes" "no" "yes" ...
##  $ paid      : chr  "no" "no" "no" "no" ...
##  $ activities: chr  "no" "no" "no" "yes" ...
##  $ nursery   : chr  "yes" "no" "yes" "yes" ...
##  $ higher    : chr  "yes" "yes" "yes" "yes" ...
##  $ internet  : chr  "no" "yes" "yes" "yes" ...
##  $ romantic  : chr  "no" "no" "no" "yes" ...
##  $ famrel    : int  4 5 4 3 4 5 4 4 4 5 ...
##  $ freetime  : int  3 3 3 2 3 4 4 1 2 5 ...
##  $ goout     : int  4 3 2 2 2 2 4 4 2 1 ...
##  $ Dalc      : int  1 1 2 1 1 1 1 1 1 1 ...
##  $ Walc      : int  1 1 3 1 2 2 1 1 1 1 ...
##  $ health    : int  3 3 3 5 5 5 3 1 1 5 ...
##  $ absences  : int  4 2 6 0 0 6 0 2 0 0 ...
##  $ G1        : int  0 9 12 14 11 12 13 10 15 12 ...
##  $ G2        : int  11 11 13 14 13 12 12 13 16 12 ...
##  $ G3        : int  11 11 12 14 13 13 13 13 17 13 ...
summary(data)
##     school              sex                 age          address         
##  Length:649         Length:649         Min.   :15.00   Length:649        
##  Class :character   Class :character   1st Qu.:16.00   Class :character  
##  Mode  :character   Mode  :character   Median :17.00   Mode  :character  
##                                        Mean   :16.74                     
##                                        3rd Qu.:18.00                     
##                                        Max.   :22.00                     
##    famsize            Pstatus               Medu            Fedu      
##  Length:649         Length:649         Min.   :0.000   Min.   :0.000  
##  Class :character   Class :character   1st Qu.:2.000   1st Qu.:1.000  
##  Mode  :character   Mode  :character   Median :2.000   Median :2.000  
##                                        Mean   :2.515   Mean   :2.307  
##                                        3rd Qu.:4.000   3rd Qu.:3.000  
##                                        Max.   :4.000   Max.   :4.000  
##      Mjob               Fjob              reason            guardian        
##  Length:649         Length:649         Length:649         Length:649        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##    traveltime      studytime        failures       schoolsup        
##  Min.   :1.000   Min.   :1.000   Min.   :0.0000   Length:649        
##  1st Qu.:1.000   1st Qu.:1.000   1st Qu.:0.0000   Class :character  
##  Median :1.000   Median :2.000   Median :0.0000   Mode  :character  
##  Mean   :1.569   Mean   :1.931   Mean   :0.2219                     
##  3rd Qu.:2.000   3rd Qu.:2.000   3rd Qu.:0.0000                     
##  Max.   :4.000   Max.   :4.000   Max.   :3.0000                     
##     famsup              paid            activities          nursery         
##  Length:649         Length:649         Length:649         Length:649        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##     higher            internet           romantic             famrel     
##  Length:649         Length:649         Length:649         Min.   :1.000  
##  Class :character   Class :character   Class :character   1st Qu.:4.000  
##  Mode  :character   Mode  :character   Mode  :character   Median :4.000  
##                                                           Mean   :3.931  
##                                                           3rd Qu.:5.000  
##                                                           Max.   :5.000  
##     freetime        goout            Dalc            Walc          health     
##  Min.   :1.00   Min.   :1.000   Min.   :1.000   Min.   :1.00   Min.   :1.000  
##  1st Qu.:3.00   1st Qu.:2.000   1st Qu.:1.000   1st Qu.:1.00   1st Qu.:2.000  
##  Median :3.00   Median :3.000   Median :1.000   Median :2.00   Median :4.000  
##  Mean   :3.18   Mean   :3.185   Mean   :1.502   Mean   :2.28   Mean   :3.536  
##  3rd Qu.:4.00   3rd Qu.:4.000   3rd Qu.:2.000   3rd Qu.:3.00   3rd Qu.:5.000  
##  Max.   :5.00   Max.   :5.000   Max.   :5.000   Max.   :5.00   Max.   :5.000  
##     absences            G1             G2              G3       
##  Min.   : 0.000   Min.   : 0.0   Min.   : 0.00   Min.   : 0.00  
##  1st Qu.: 0.000   1st Qu.:10.0   1st Qu.:10.00   1st Qu.:10.00  
##  Median : 2.000   Median :11.0   Median :11.00   Median :12.00  
##  Mean   : 3.659   Mean   :11.4   Mean   :11.57   Mean   :11.91  
##  3rd Qu.: 6.000   3rd Qu.:13.0   3rd Qu.:13.00   3rd Qu.:14.00  
##  Max.   :32.000   Max.   :19.0   Max.   :19.00   Max.   :19.00

2.2 Data Cleaning

Data often come from different sources and most of the time don’t come in the right format for the machine to process them. Hence, data cleaning is an important aspect of a data science project. We need to check for any missing data and remove columns that are not correlated to the final grade variable, after finding out the related variables, which is done by exploratory data analysis using Histograms.

#Find Missing data: 
sum(is.na(data))
## [1] 0

There is no missing data in the dataset.

The data is further cleaned by selecting only the necessary socio-economic variables and made available in a tidy format.

The response variable “passed” is added for the purpose of predictive analytics. We’re assuming that a score of over 9.99 is a passing score, and below that is failing.

data$passed <- ifelse(data$G3 > 9.99, 1, 0)

data<-dplyr::select(data, age, Medu, Fedu, traveltime, studytime, failures, famrel, freetime, goout,Dalc, Walc,health,absences, G1,G2,G3, passed)

str(data)
## 'data.frame':    649 obs. of  17 variables:
##  $ age       : int  18 17 15 15 16 16 16 17 15 15 ...
##  $ Medu      : int  4 1 1 4 3 4 2 4 3 3 ...
##  $ Fedu      : int  4 1 1 2 3 3 2 4 2 4 ...
##  $ traveltime: int  2 1 1 1 1 1 1 2 1 1 ...
##  $ studytime : int  2 2 2 3 2 2 2 2 2 2 ...
##  $ failures  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ famrel    : int  4 5 4 3 4 5 4 4 4 5 ...
##  $ freetime  : int  3 3 3 2 3 4 4 1 2 5 ...
##  $ goout     : int  4 3 2 2 2 2 4 4 2 1 ...
##  $ Dalc      : int  1 1 2 1 1 1 1 1 1 1 ...
##  $ Walc      : int  1 1 3 1 2 2 1 1 1 1 ...
##  $ health    : int  3 3 3 5 5 5 3 1 1 5 ...
##  $ absences  : int  4 2 6 0 0 6 0 2 0 0 ...
##  $ G1        : int  0 9 12 14 11 12 13 10 15 12 ...
##  $ G2        : int  11 11 13 14 13 12 12 13 16 12 ...
##  $ G3        : int  11 11 12 14 13 13 13 13 17 13 ...
##  $ passed    : num  1 1 1 1 1 1 1 1 1 1 ...

Part III Descriptive and Exploratory Analysis of the data

3.1 Visualization of significant predictors:

Let us pursue our hunt for significant predictors of student performance within the remaining set of variables. First, we plot a bunch of graphs contrasting performance with other factors. Because of the overwhelming number of columns, we limit the number of variables and only display those which once again appeal most to basic common sense: number of absences, age, commute time, mother’s and father’s education/relationship level, how much he/she parties and Past failures.

multi.hist(data, freq=F, dcol = "red", dlty=c("dotted", "solid")) 

#Find correlations
res<-Hmisc::rcorr(as.matrix(data))

We can see that some variables make much better predictors than others.

3.2 Correlation Matrix

Positive correlations are displayed in blue and negative correlations in red color. Color intensity and the size of the circle are proportional to the correlation coefficients.

corrplot(res$r, type="upper", order="hclust", # Insignificant correlations are leaved blank
         p.mat = res$P, sig.level = 0.01, insig = "blank")

#Because G1 and G2 are highly correlated with G3, as these are scores students got in two previous classes at the end of the year and we propose to predict value G3 without them, so we decided to remove G1&G2 from our model.

student_data<-dplyr::select(data, -c(G1, G2))

3.3 Dealing with nominal variables

fastDummies library is used to change binary variables related to alcohol consumption including goout, Dalc and Walc, into nominal variables to improve classification results.

#Change the nominal variables to binary variables
student_data<-fastDummies::dummy_cols(student_data, select_columns = c("goout","Dalc", "Walc"), remove_most_frequent_dummy = T, remove_selected_columns = T)

Part IV Data Analysis & Model Building

4.1 Splitting the dataset into the Training set and Test set

The usual practice in Machine Learning is to split the dataset into both training and test set. While the model is built on the training set; the model is evaluated on the test set which the model has not been exposed to before. In order to ensure that the samples; both train and test, are the true representation of the dataset, we check the proportion of the data split.

#Splitting data on train and test set
#create train and test set
set.seed(333)

trainingRowIndex <- sample(1:nrow(student_data), 0.75*nrow(student_data))  # row indices for training data
train_data <- student_data[trainingRowIndex, ]  # model training data
dim(train_data)                                 # Training data characteristics
## [1] 486  24
with(train_data, table(passed, useNA = "always"))
## passed
##    0    1 <NA> 
##   80  406    0
test_data  <- student_data[-trainingRowIndex, ]   # model test data
dim(test_data)                                    # Testing data characteristics
## [1] 163  24
with(test_data, table(passed, useNA = "always"))
## passed
##    0    1 <NA> 
##   20  143    0

4.2 Model Fitting

4.2.1 Logistic regression model – classification

We will be building our 1st model to predict the binary outcome passed, based on alcohol consumption, using logistic regression.

#Train a decision tree model:

blr1 <- glm(passed ~ goout_1+goout_2+goout_4+goout_5+Dalc_2+Dalc_3+Dalc_4+Dalc_5+Walc_2+Walc_3+Walc_4+Walc_5, data=train_data, family = "binomial")

summary(blr1)  
## 
## Call:
## glm(formula = passed ~ goout_1 + goout_2 + goout_4 + goout_5 + 
##     Dalc_2 + Dalc_3 + Dalc_4 + Dalc_5 + Walc_2 + Walc_3 + Walc_4 + 
##     Walc_5, family = "binomial", data = train_data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1857   0.4406   0.5092   0.6550   1.1623  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  2.283291   0.319747   7.141 9.27e-13 ***
## goout_1     -1.130705   0.461580  -2.450   0.0143 *  
## goout_2     -0.003700   0.383405  -0.010   0.9923    
## goout_4      0.009031   0.370692   0.024   0.9806    
## goout_5     -0.568915   0.380326  -1.496   0.1347    
## Dalc_2      -0.547247   0.348868  -1.569   0.1167    
## Dalc_3       0.293384   0.637639   0.460   0.6454    
## Dalc_4      -1.177791   0.664723  -1.772   0.0764 .  
## Dalc_5      -0.964846   0.815710  -1.183   0.2369    
## Walc_2      -0.305911   0.369919  -0.827   0.4083    
## Walc_3      -0.394970   0.421499  -0.937   0.3487    
## Walc_4      -0.500909   0.473909  -1.057   0.2905    
## Walc_5      -0.550201   0.672524  -0.818   0.4133    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 434.71  on 485  degrees of freedom
## Residual deviance: 412.10  on 473  degrees of freedom
## AIC: 438.1
## 
## Number of Fisher Scoring iterations: 4
car::vif(blr1)
##  goout_1  goout_2  goout_4  goout_5   Dalc_2   Dalc_3   Dalc_4   Dalc_5 
## 1.351811 1.347480 1.434162 1.529986 1.444008 1.310073 1.261014 1.907266 
##   Walc_2   Walc_3   Walc_4   Walc_5 
## 1.547454 1.798360 1.855203 2.646943

We assessed the multicollinearity by computing the variance inflation factor (or VIF), which measures how much the variance of a regression coefficient is inflated due to multicollinearity in the model. The smallest possible value of VIF is one (absence of multicollinearity) and a VIF value that exceeds 5 or 10 indicates a problematic amount of collinearity. Our model stays well below 5.

#Test model
#Make predictions on testing data, using trained model:

test_data$blr1.pred <- predict(blr1, newdata = test_data, type = 'response')


# Make confusion matrix:

ProbabilityCutoff <- 0.5  
test_data$blr1.pred.probs <- 1-test_data$blr1.pred

test_data$blr1.pred.passed <- ifelse(test_data$blr1.pred > ProbabilityCutoff, 1, 0)

(cm1 <- with(test_data,table(blr1.pred.passed,passed)))
##                 passed
## blr1.pred.passed   0   1
##                0   0   1
##                1  20 142
#Evaluate the model by calculating the accuracy:

CorrectPredictions1 <- cm1[1,1] + cm1[2,2]
TotalStudents1 <- nrow(test_data)

(Accuracy1 <- CorrectPredictions1/TotalStudents1)
## [1] 0.8711656

4.2.2 Decision tree model – Regression tree

We will be building our 2nd model to predict the continuous outcome G3, based on relevant socioeconomic variables, using a regression tree..

#Train a decision tree model:
tree1 <- rpart(G3 ~ age+Medu+Fedu+traveltime+studytime+failures+famrel+freetime+health+absences+passed+goout_1+goout_2+goout_4+goout_5+Dalc_2+Dalc_3+Dalc_4+Dalc_5+Walc_2+Walc_3+Walc_4+Walc_5, data=train_data, method = 'anova')
summary(tree1)
## Call:
## rpart(formula = G3 ~ age + Medu + Fedu + traveltime + studytime + 
##     failures + famrel + freetime + health + absences + passed + 
##     goout_1 + goout_2 + goout_4 + goout_5 + Dalc_2 + Dalc_3 + 
##     Dalc_4 + Dalc_5 + Walc_2 + Walc_3 + Walc_4 + Walc_5, data = train_data, 
##     method = "anova")
##   n= 486 
## 
##           CP nsplit rel error    xerror       xstd
## 1 0.46139145      0 1.0000000 1.0038862 0.09827608
## 2 0.04849765      1 0.5386085 0.5452987 0.03718525
## 3 0.03841289      2 0.4901109 0.5218651 0.03278874
## 4 0.02106355      3 0.4516980 0.4781988 0.03059034
## 5 0.01754239      4 0.4306345 0.4943932 0.03511224
## 6 0.01337600      5 0.4130921 0.4948611 0.03538565
## 7 0.01000000      6 0.3997161 0.4847677 0.03490410
## 
## Variable importance
##    passed  failures  absences       age studytime    famrel      Medu    Dalc_4 
##        69        11         7         6         3         1         1         1 
##    Dalc_2    health 
##         1         1 
## 
## Node number 1: 486 observations,    complexity param=0.4613915
##   mean=11.89712, MSE=10.77131 
##   left son=2 (80 obs) right son=3 (406 obs)
##   Primary splits:
##       passed    < 0.5  to the left,  improve=0.46139150, (0 missing)
##       failures  < 0.5  to the right, improve=0.20734760, (0 missing)
##       studytime < 1.5  to the left,  improve=0.06425934, (0 missing)
##       Medu      < 3.5  to the left,  improve=0.05068504, (0 missing)
##       Fedu      < 2.5  to the left,  improve=0.04345307, (0 missing)
##   Surrogate splits:
##       failures < 0.5  to the right, agree=0.846, adj=0.063, (0 split)
##       famrel   < 1.5  to the left,  agree=0.837, adj=0.013, (0 split)
## 
## Node number 2: 80 observations,    complexity param=0.04849765
##   mean=6.875, MSE=9.659375 
##   left son=4 (26 obs) right son=5 (54 obs)
##   Primary splits:
##       absences < 1    to the left,  improve=0.32853860, (0 missing)
##       age      < 17.5 to the right, improve=0.11181280, (0 missing)
##       goout_1  < 0.5  to the right, improve=0.04659611, (0 missing)
##       Medu     < 3.5  to the right, improve=0.02404215, (0 missing)
##       goout_4  < 0.5  to the left,  improve=0.02274749, (0 missing)
##   Surrogate splits:
##       Dalc_4 < 0.5  to the right, agree=0.713, adj=0.115, (0 split)
## 
## Node number 3: 406 observations,    complexity param=0.03841289
##   mean=12.8867, MSE=5.04135 
##   left son=6 (34 obs) right son=7 (372 obs)
##   Primary splits:
##       failures  < 0.5  to the right, improve=0.09824462, (0 missing)
##       Medu      < 3.5  to the left,  improve=0.06358073, (0 missing)
##       studytime < 1.5  to the left,  improve=0.04891397, (0 missing)
##       age       < 18.5 to the right, improve=0.03494529, (0 missing)
##       absences  < 10.5 to the right, improve=0.02847947, (0 missing)
##   Surrogate splits:
##       age < 19.5 to the right, agree=0.924, adj=0.088, (0 split)
## 
## Node number 4: 26 observations,    complexity param=0.02106355
##   mean=4.307692, MSE=18.13609 
##   left son=8 (12 obs) right son=9 (14 obs)
##   Primary splits:
##       age      < 17.5 to the right, improve=0.23384020, (0 missing)
##       failures < 0.5  to the right, improve=0.13886010, (0 missing)
##       Medu     < 1.5  to the right, improve=0.08728346, (0 missing)
##       famrel   < 4.5  to the right, improve=0.07173153, (0 missing)
##       health   < 3.5  to the right, improve=0.06394780, (0 missing)
##   Surrogate splits:
##       Medu     < 3.5  to the right, agree=0.654, adj=0.250, (0 split)
##       Dalc_2   < 0.5  to the right, agree=0.654, adj=0.250, (0 split)
##       failures < 1.5  to the right, agree=0.615, adj=0.167, (0 split)
##       famrel   < 3.5  to the left,  agree=0.615, adj=0.167, (0 split)
##       health   < 2.5  to the right, agree=0.615, adj=0.167, (0 split)
## 
## Node number 5: 54 observations
##   mean=8.111111, MSE=0.8765432 
## 
## Node number 6: 34 observations
##   mean=10.55882, MSE=1.364187 
## 
## Node number 7: 372 observations,    complexity param=0.01754239
##   mean=13.09946, MSE=4.836881 
##   left son=14 (89 obs) right son=15 (283 obs)
##   Primary splits:
##       studytime < 1.5  to the left,  improve=0.05103701, (0 missing)
##       Medu      < 3.5  to the left,  improve=0.04687513, (0 missing)
##       age       < 16.5 to the left,  improve=0.02670706, (0 missing)
##       health    < 2.5  to the right, improve=0.02206145, (0 missing)
##       absences  < 10.5 to the right, improve=0.02194982, (0 missing)
##   Surrogate splits:
##       Walc_5   < 0.5  to the right, agree=0.772, adj=0.045, (0 split)
##       freetime < 4.5  to the right, agree=0.766, adj=0.022, (0 split)
##       Dalc_5   < 0.5  to the right, agree=0.766, adj=0.022, (0 split)
## 
## Node number 8: 12 observations
##   mean=2.083333, MSE=13.24306 
## 
## Node number 9: 14 observations
##   mean=6.214286, MSE=14.45408 
## 
## Node number 14: 89 observations
##   mean=12.21348, MSE=3.740942 
## 
## Node number 15: 283 observations,    complexity param=0.013376
##   mean=13.37809, MSE=4.857047 
##   left son=30 (130 obs) right son=31 (153 obs)
##   Primary splits:
##       age        < 16.5 to the left,  improve=0.05094157, (0 missing)
##       Medu       < 3.5  to the left,  improve=0.04635605, (0 missing)
##       traveltime < 3.5  to the right, improve=0.03387211, (0 missing)
##       absences   < 11   to the right, improve=0.02660105, (0 missing)
##       health     < 2.5  to the right, improve=0.02462094, (0 missing)
##   Surrogate splits:
##       traveltime < 2.5  to the right, agree=0.572, adj=0.069, (0 split)
##       famrel     < 2.5  to the left,  agree=0.565, adj=0.054, (0 split)
##       Medu       < 3.5  to the right, agree=0.558, adj=0.038, (0 split)
##       health     < 4.5  to the right, agree=0.558, adj=0.038, (0 split)
##       goout_2    < 0.5  to the right, agree=0.558, adj=0.038, (0 split)
## 
## Node number 30: 130 observations
##   mean=12.83846, MSE=3.48929 
## 
## Node number 31: 153 observations
##   mean=13.8366, MSE=5.561536
#Plots a fancy RPart decision tree using the pretty rpart plotter.
fancyRpartPlot(tree1, caption = "Classification Tree")

In the plotted classification tree, Each node box displays the classification, the probability of each class at that node (i.e. the probability of the class conditioned on the node) and the percentage of observations used at that node.

#Make predictions on testing data, using trained model:
test_data$tree1.pred.G3 <- predict(tree1, newdata = test_data)

#Visualize predictions:
with(test_data, plot(G3,tree1.pred.G3, main="Actual vs Predicted, testing data",xlab = "Actual G3",ylab = "Predicted G3"))

#Make confusion matrix:
  
PredictionCutoff <- 10.99 # Use ROC curve to decide cutoff value (this case:in 9-11 range)

test_data$tree1.pred.passed <- ifelse(test_data$tree1.pred.G3 > PredictionCutoff, 1, 0)

(cm2 <- with(test_data,table(tree1.pred.passed,passed)))
##                  passed
## tree1.pred.passed   0   1
##                 0  20  17
##                 1   0 126
#Evaluate the model by calculating the accuracy:
  
CorrectPredictions2 <- cm2[1,1] + cm2[2,2]
TotalStudents2 <- nrow(test_data)

(Accuracy2 <- CorrectPredictions2/TotalStudents2)
## [1] 0.8957055
#Final output dataset for visual validation, using predicted results:
predicted_data<-dplyr::select(test_data, -c(goout_1, goout_2, goout_4, goout_5,
                                            Dalc_2, Dalc_3, Dalc_4, Dalc_5, 
                                            Walc_2, Walc_3, Walc_4, Walc_5,
                                            blr1.pred, blr1.pred.probs, tree1.pred.passed ))

Part V Conclusion and Validity

5.1 Explaining the validity of the model

The Logistic regression model (classification) and Decision tree model(regression tree) performed exceptionally well in this project and showed 86.50% and 89.57% accuracy, respectively.

5.2 Conclusion.

The essence of building a student performance predictor is for the model to be able to effectively predict a high risk student by determining their ability to pass and their final grades, based on relevant factors. A model will not be doing very well if it is unable to classify both categories (pass or fail) effectively. As much as we can expect some element errors in our predictions, we are also expecting our model to do a good job, as shown by the confusion matrix and accuracy calculations. This Student performance predictor was built just for academic purposes only.

# libraries used in the project
library(dplyr)
library(readr)
library(stringr)
library(ggplot2)
library(corrplot)
library(Hmisc)
library(psych)
library(rattle)
library(rpart)
library(car)