In any educational institution, it is very essential to provide added support to students who are at high risk for failing the course and turn-around their performance. Finding out such students is a complex task for academic professionals to be done manually, along with their other duties. The reason for this model is to find a tool to make this prediction, by finding out the socioeconomic factors that affect student grades and help both the parties.
The goal of this project is to build prediction models to classify if alcohol consumption of a student affects their ability to pass and to predict their final grades in relation to their demographic and social conditions.
We have used a dataset from the Machine Learning Repository at the University of California, Irvine. Source of the Dataset is : Paulo Cortez, University of Minho, Guimarães, Portugal (2008).
This dataset (student-por.csv) consists of 649 observations on 33 variables. The dataset provides data attributes including grades, demographic, social and school related features, for students of Portuguese language subject. The variable G3 is the target variable, and is also used for classifying into Pass vs Fail. These classifications will be predicted based on other independent variables.
#Data source downloaded from : https://archive.ics.uci.edu/ml/datasets/student+performance
data <- read.csv("student-por.csv")
dim(data)
## [1] 649 33
str(data)
## 'data.frame': 649 obs. of 33 variables:
## $ school : chr "GP" "GP" "GP" "GP" ...
## $ sex : chr "F" "F" "F" "F" ...
## $ age : int 18 17 15 15 16 16 16 17 15 15 ...
## $ address : chr "U" "U" "U" "U" ...
## $ famsize : chr "GT3" "GT3" "LE3" "GT3" ...
## $ Pstatus : chr "A" "T" "T" "T" ...
## $ Medu : int 4 1 1 4 3 4 2 4 3 3 ...
## $ Fedu : int 4 1 1 2 3 3 2 4 2 4 ...
## $ Mjob : chr "at_home" "at_home" "at_home" "health" ...
## $ Fjob : chr "teacher" "other" "other" "services" ...
## $ reason : chr "course" "course" "other" "home" ...
## $ guardian : chr "mother" "father" "mother" "mother" ...
## $ traveltime: int 2 1 1 1 1 1 1 2 1 1 ...
## $ studytime : int 2 2 2 3 2 2 2 2 2 2 ...
## $ failures : int 0 0 0 0 0 0 0 0 0 0 ...
## $ schoolsup : chr "yes" "no" "yes" "no" ...
## $ famsup : chr "no" "yes" "no" "yes" ...
## $ paid : chr "no" "no" "no" "no" ...
## $ activities: chr "no" "no" "no" "yes" ...
## $ nursery : chr "yes" "no" "yes" "yes" ...
## $ higher : chr "yes" "yes" "yes" "yes" ...
## $ internet : chr "no" "yes" "yes" "yes" ...
## $ romantic : chr "no" "no" "no" "yes" ...
## $ famrel : int 4 5 4 3 4 5 4 4 4 5 ...
## $ freetime : int 3 3 3 2 3 4 4 1 2 5 ...
## $ goout : int 4 3 2 2 2 2 4 4 2 1 ...
## $ Dalc : int 1 1 2 1 1 1 1 1 1 1 ...
## $ Walc : int 1 1 3 1 2 2 1 1 1 1 ...
## $ health : int 3 3 3 5 5 5 3 1 1 5 ...
## $ absences : int 4 2 6 0 0 6 0 2 0 0 ...
## $ G1 : int 0 9 12 14 11 12 13 10 15 12 ...
## $ G2 : int 11 11 13 14 13 12 12 13 16 12 ...
## $ G3 : int 11 11 12 14 13 13 13 13 17 13 ...
summary(data)
## school sex age address
## Length:649 Length:649 Min. :15.00 Length:649
## Class :character Class :character 1st Qu.:16.00 Class :character
## Mode :character Mode :character Median :17.00 Mode :character
## Mean :16.74
## 3rd Qu.:18.00
## Max. :22.00
## famsize Pstatus Medu Fedu
## Length:649 Length:649 Min. :0.000 Min. :0.000
## Class :character Class :character 1st Qu.:2.000 1st Qu.:1.000
## Mode :character Mode :character Median :2.000 Median :2.000
## Mean :2.515 Mean :2.307
## 3rd Qu.:4.000 3rd Qu.:3.000
## Max. :4.000 Max. :4.000
## Mjob Fjob reason guardian
## Length:649 Length:649 Length:649 Length:649
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## traveltime studytime failures schoolsup
## Min. :1.000 Min. :1.000 Min. :0.0000 Length:649
## 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:0.0000 Class :character
## Median :1.000 Median :2.000 Median :0.0000 Mode :character
## Mean :1.569 Mean :1.931 Mean :0.2219
## 3rd Qu.:2.000 3rd Qu.:2.000 3rd Qu.:0.0000
## Max. :4.000 Max. :4.000 Max. :3.0000
## famsup paid activities nursery
## Length:649 Length:649 Length:649 Length:649
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## higher internet romantic famrel
## Length:649 Length:649 Length:649 Min. :1.000
## Class :character Class :character Class :character 1st Qu.:4.000
## Mode :character Mode :character Mode :character Median :4.000
## Mean :3.931
## 3rd Qu.:5.000
## Max. :5.000
## freetime goout Dalc Walc health
## Min. :1.00 Min. :1.000 Min. :1.000 Min. :1.00 Min. :1.000
## 1st Qu.:3.00 1st Qu.:2.000 1st Qu.:1.000 1st Qu.:1.00 1st Qu.:2.000
## Median :3.00 Median :3.000 Median :1.000 Median :2.00 Median :4.000
## Mean :3.18 Mean :3.185 Mean :1.502 Mean :2.28 Mean :3.536
## 3rd Qu.:4.00 3rd Qu.:4.000 3rd Qu.:2.000 3rd Qu.:3.00 3rd Qu.:5.000
## Max. :5.00 Max. :5.000 Max. :5.000 Max. :5.00 Max. :5.000
## absences G1 G2 G3
## Min. : 0.000 Min. : 0.0 Min. : 0.00 Min. : 0.00
## 1st Qu.: 0.000 1st Qu.:10.0 1st Qu.:10.00 1st Qu.:10.00
## Median : 2.000 Median :11.0 Median :11.00 Median :12.00
## Mean : 3.659 Mean :11.4 Mean :11.57 Mean :11.91
## 3rd Qu.: 6.000 3rd Qu.:13.0 3rd Qu.:13.00 3rd Qu.:14.00
## Max. :32.000 Max. :19.0 Max. :19.00 Max. :19.00
Data often come from different sources and most of the time don’t come in the right format for the machine to process them. Hence, data cleaning is an important aspect of a data science project. We need to check for any missing data and remove columns that are not correlated to the final grade variable, after finding out the related variables, which is done by exploratory data analysis using Histograms.
#Find Missing data:
sum(is.na(data))
## [1] 0
There is no missing data in the dataset.
The data is further cleaned by selecting only the necessary socio-economic variables and made available in a tidy format.
The response variable “passed” is added for the purpose of predictive analytics. We’re assuming that a score of over 9.99 is a passing score, and below that is failing.
data$passed <- ifelse(data$G3 > 9.99, 1, 0)
data<-dplyr::select(data, age, Medu, Fedu, traveltime, studytime, failures, famrel, freetime, goout,Dalc, Walc,health,absences, G1,G2,G3, passed)
str(data)
## 'data.frame': 649 obs. of 17 variables:
## $ age : int 18 17 15 15 16 16 16 17 15 15 ...
## $ Medu : int 4 1 1 4 3 4 2 4 3 3 ...
## $ Fedu : int 4 1 1 2 3 3 2 4 2 4 ...
## $ traveltime: int 2 1 1 1 1 1 1 2 1 1 ...
## $ studytime : int 2 2 2 3 2 2 2 2 2 2 ...
## $ failures : int 0 0 0 0 0 0 0 0 0 0 ...
## $ famrel : int 4 5 4 3 4 5 4 4 4 5 ...
## $ freetime : int 3 3 3 2 3 4 4 1 2 5 ...
## $ goout : int 4 3 2 2 2 2 4 4 2 1 ...
## $ Dalc : int 1 1 2 1 1 1 1 1 1 1 ...
## $ Walc : int 1 1 3 1 2 2 1 1 1 1 ...
## $ health : int 3 3 3 5 5 5 3 1 1 5 ...
## $ absences : int 4 2 6 0 0 6 0 2 0 0 ...
## $ G1 : int 0 9 12 14 11 12 13 10 15 12 ...
## $ G2 : int 11 11 13 14 13 12 12 13 16 12 ...
## $ G3 : int 11 11 12 14 13 13 13 13 17 13 ...
## $ passed : num 1 1 1 1 1 1 1 1 1 1 ...
Let us pursue our hunt for significant predictors of student performance within the remaining set of variables. First, we plot a bunch of graphs contrasting performance with other factors. Because of the overwhelming number of columns, we limit the number of variables and only display those which once again appeal most to basic common sense: number of absences, age, commute time, mother’s and father’s education/relationship level, how much he/she parties and Past failures.
multi.hist(data, freq=F, dcol = "red", dlty=c("dotted", "solid"))
#Find correlations
res<-Hmisc::rcorr(as.matrix(data))
We can see that some variables make much better predictors than others.
Positive correlations are displayed in blue and negative correlations in red color. Color intensity and the size of the circle are proportional to the correlation coefficients.
corrplot(res$r, type="upper", order="hclust", # Insignificant correlations are leaved blank
p.mat = res$P, sig.level = 0.01, insig = "blank")
#Because G1 and G2 are highly correlated with G3, as these are scores students got in two previous classes at the end of the year and we propose to predict value G3 without them, so we decided to remove G1&G2 from our model.
student_data<-dplyr::select(data, -c(G1, G2))
fastDummies library is used to change binary variables related to alcohol consumption including goout, Dalc and Walc, into nominal variables to improve classification results.
#Change the nominal variables to binary variables
student_data<-fastDummies::dummy_cols(student_data, select_columns = c("goout","Dalc", "Walc"), remove_most_frequent_dummy = T, remove_selected_columns = T)
The usual practice in Machine Learning is to split the dataset into both training and test set. While the model is built on the training set; the model is evaluated on the test set which the model has not been exposed to before. In order to ensure that the samples; both train and test, are the true representation of the dataset, we check the proportion of the data split.
#Splitting data on train and test set
#create train and test set
set.seed(333)
trainingRowIndex <- sample(1:nrow(student_data), 0.75*nrow(student_data)) # row indices for training data
train_data <- student_data[trainingRowIndex, ] # model training data
dim(train_data) # Training data characteristics
## [1] 486 24
with(train_data, table(passed, useNA = "always"))
## passed
## 0 1 <NA>
## 80 406 0
test_data <- student_data[-trainingRowIndex, ] # model test data
dim(test_data) # Testing data characteristics
## [1] 163 24
with(test_data, table(passed, useNA = "always"))
## passed
## 0 1 <NA>
## 20 143 0
We will be building our 1st model to predict the binary outcome passed, based on alcohol consumption, using logistic regression.
#Train a decision tree model:
blr1 <- glm(passed ~ goout_1+goout_2+goout_4+goout_5+Dalc_2+Dalc_3+Dalc_4+Dalc_5+Walc_2+Walc_3+Walc_4+Walc_5, data=train_data, family = "binomial")
summary(blr1)
##
## Call:
## glm(formula = passed ~ goout_1 + goout_2 + goout_4 + goout_5 +
## Dalc_2 + Dalc_3 + Dalc_4 + Dalc_5 + Walc_2 + Walc_3 + Walc_4 +
## Walc_5, family = "binomial", data = train_data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1857 0.4406 0.5092 0.6550 1.1623
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.283291 0.319747 7.141 9.27e-13 ***
## goout_1 -1.130705 0.461580 -2.450 0.0143 *
## goout_2 -0.003700 0.383405 -0.010 0.9923
## goout_4 0.009031 0.370692 0.024 0.9806
## goout_5 -0.568915 0.380326 -1.496 0.1347
## Dalc_2 -0.547247 0.348868 -1.569 0.1167
## Dalc_3 0.293384 0.637639 0.460 0.6454
## Dalc_4 -1.177791 0.664723 -1.772 0.0764 .
## Dalc_5 -0.964846 0.815710 -1.183 0.2369
## Walc_2 -0.305911 0.369919 -0.827 0.4083
## Walc_3 -0.394970 0.421499 -0.937 0.3487
## Walc_4 -0.500909 0.473909 -1.057 0.2905
## Walc_5 -0.550201 0.672524 -0.818 0.4133
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 434.71 on 485 degrees of freedom
## Residual deviance: 412.10 on 473 degrees of freedom
## AIC: 438.1
##
## Number of Fisher Scoring iterations: 4
car::vif(blr1)
## goout_1 goout_2 goout_4 goout_5 Dalc_2 Dalc_3 Dalc_4 Dalc_5
## 1.351811 1.347480 1.434162 1.529986 1.444008 1.310073 1.261014 1.907266
## Walc_2 Walc_3 Walc_4 Walc_5
## 1.547454 1.798360 1.855203 2.646943
We assessed the multicollinearity by computing the variance inflation factor (or VIF), which measures how much the variance of a regression coefficient is inflated due to multicollinearity in the model. The smallest possible value of VIF is one (absence of multicollinearity) and a VIF value that exceeds 5 or 10 indicates a problematic amount of collinearity. Our model stays well below 5.
#Test model
#Make predictions on testing data, using trained model:
test_data$blr1.pred <- predict(blr1, newdata = test_data, type = 'response')
# Make confusion matrix:
ProbabilityCutoff <- 0.5
test_data$blr1.pred.probs <- 1-test_data$blr1.pred
test_data$blr1.pred.passed <- ifelse(test_data$blr1.pred > ProbabilityCutoff, 1, 0)
(cm1 <- with(test_data,table(blr1.pred.passed,passed)))
## passed
## blr1.pred.passed 0 1
## 0 0 1
## 1 20 142
#Evaluate the model by calculating the accuracy:
CorrectPredictions1 <- cm1[1,1] + cm1[2,2]
TotalStudents1 <- nrow(test_data)
(Accuracy1 <- CorrectPredictions1/TotalStudents1)
## [1] 0.8711656
We will be building our 2nd model to predict the continuous outcome G3, based on relevant socioeconomic variables, using a regression tree..
#Train a decision tree model:
tree1 <- rpart(G3 ~ age+Medu+Fedu+traveltime+studytime+failures+famrel+freetime+health+absences+passed+goout_1+goout_2+goout_4+goout_5+Dalc_2+Dalc_3+Dalc_4+Dalc_5+Walc_2+Walc_3+Walc_4+Walc_5, data=train_data, method = 'anova')
summary(tree1)
## Call:
## rpart(formula = G3 ~ age + Medu + Fedu + traveltime + studytime +
## failures + famrel + freetime + health + absences + passed +
## goout_1 + goout_2 + goout_4 + goout_5 + Dalc_2 + Dalc_3 +
## Dalc_4 + Dalc_5 + Walc_2 + Walc_3 + Walc_4 + Walc_5, data = train_data,
## method = "anova")
## n= 486
##
## CP nsplit rel error xerror xstd
## 1 0.46139145 0 1.0000000 1.0038862 0.09827608
## 2 0.04849765 1 0.5386085 0.5452987 0.03718525
## 3 0.03841289 2 0.4901109 0.5218651 0.03278874
## 4 0.02106355 3 0.4516980 0.4781988 0.03059034
## 5 0.01754239 4 0.4306345 0.4943932 0.03511224
## 6 0.01337600 5 0.4130921 0.4948611 0.03538565
## 7 0.01000000 6 0.3997161 0.4847677 0.03490410
##
## Variable importance
## passed failures absences age studytime famrel Medu Dalc_4
## 69 11 7 6 3 1 1 1
## Dalc_2 health
## 1 1
##
## Node number 1: 486 observations, complexity param=0.4613915
## mean=11.89712, MSE=10.77131
## left son=2 (80 obs) right son=3 (406 obs)
## Primary splits:
## passed < 0.5 to the left, improve=0.46139150, (0 missing)
## failures < 0.5 to the right, improve=0.20734760, (0 missing)
## studytime < 1.5 to the left, improve=0.06425934, (0 missing)
## Medu < 3.5 to the left, improve=0.05068504, (0 missing)
## Fedu < 2.5 to the left, improve=0.04345307, (0 missing)
## Surrogate splits:
## failures < 0.5 to the right, agree=0.846, adj=0.063, (0 split)
## famrel < 1.5 to the left, agree=0.837, adj=0.013, (0 split)
##
## Node number 2: 80 observations, complexity param=0.04849765
## mean=6.875, MSE=9.659375
## left son=4 (26 obs) right son=5 (54 obs)
## Primary splits:
## absences < 1 to the left, improve=0.32853860, (0 missing)
## age < 17.5 to the right, improve=0.11181280, (0 missing)
## goout_1 < 0.5 to the right, improve=0.04659611, (0 missing)
## Medu < 3.5 to the right, improve=0.02404215, (0 missing)
## goout_4 < 0.5 to the left, improve=0.02274749, (0 missing)
## Surrogate splits:
## Dalc_4 < 0.5 to the right, agree=0.713, adj=0.115, (0 split)
##
## Node number 3: 406 observations, complexity param=0.03841289
## mean=12.8867, MSE=5.04135
## left son=6 (34 obs) right son=7 (372 obs)
## Primary splits:
## failures < 0.5 to the right, improve=0.09824462, (0 missing)
## Medu < 3.5 to the left, improve=0.06358073, (0 missing)
## studytime < 1.5 to the left, improve=0.04891397, (0 missing)
## age < 18.5 to the right, improve=0.03494529, (0 missing)
## absences < 10.5 to the right, improve=0.02847947, (0 missing)
## Surrogate splits:
## age < 19.5 to the right, agree=0.924, adj=0.088, (0 split)
##
## Node number 4: 26 observations, complexity param=0.02106355
## mean=4.307692, MSE=18.13609
## left son=8 (12 obs) right son=9 (14 obs)
## Primary splits:
## age < 17.5 to the right, improve=0.23384020, (0 missing)
## failures < 0.5 to the right, improve=0.13886010, (0 missing)
## Medu < 1.5 to the right, improve=0.08728346, (0 missing)
## famrel < 4.5 to the right, improve=0.07173153, (0 missing)
## health < 3.5 to the right, improve=0.06394780, (0 missing)
## Surrogate splits:
## Medu < 3.5 to the right, agree=0.654, adj=0.250, (0 split)
## Dalc_2 < 0.5 to the right, agree=0.654, adj=0.250, (0 split)
## failures < 1.5 to the right, agree=0.615, adj=0.167, (0 split)
## famrel < 3.5 to the left, agree=0.615, adj=0.167, (0 split)
## health < 2.5 to the right, agree=0.615, adj=0.167, (0 split)
##
## Node number 5: 54 observations
## mean=8.111111, MSE=0.8765432
##
## Node number 6: 34 observations
## mean=10.55882, MSE=1.364187
##
## Node number 7: 372 observations, complexity param=0.01754239
## mean=13.09946, MSE=4.836881
## left son=14 (89 obs) right son=15 (283 obs)
## Primary splits:
## studytime < 1.5 to the left, improve=0.05103701, (0 missing)
## Medu < 3.5 to the left, improve=0.04687513, (0 missing)
## age < 16.5 to the left, improve=0.02670706, (0 missing)
## health < 2.5 to the right, improve=0.02206145, (0 missing)
## absences < 10.5 to the right, improve=0.02194982, (0 missing)
## Surrogate splits:
## Walc_5 < 0.5 to the right, agree=0.772, adj=0.045, (0 split)
## freetime < 4.5 to the right, agree=0.766, adj=0.022, (0 split)
## Dalc_5 < 0.5 to the right, agree=0.766, adj=0.022, (0 split)
##
## Node number 8: 12 observations
## mean=2.083333, MSE=13.24306
##
## Node number 9: 14 observations
## mean=6.214286, MSE=14.45408
##
## Node number 14: 89 observations
## mean=12.21348, MSE=3.740942
##
## Node number 15: 283 observations, complexity param=0.013376
## mean=13.37809, MSE=4.857047
## left son=30 (130 obs) right son=31 (153 obs)
## Primary splits:
## age < 16.5 to the left, improve=0.05094157, (0 missing)
## Medu < 3.5 to the left, improve=0.04635605, (0 missing)
## traveltime < 3.5 to the right, improve=0.03387211, (0 missing)
## absences < 11 to the right, improve=0.02660105, (0 missing)
## health < 2.5 to the right, improve=0.02462094, (0 missing)
## Surrogate splits:
## traveltime < 2.5 to the right, agree=0.572, adj=0.069, (0 split)
## famrel < 2.5 to the left, agree=0.565, adj=0.054, (0 split)
## Medu < 3.5 to the right, agree=0.558, adj=0.038, (0 split)
## health < 4.5 to the right, agree=0.558, adj=0.038, (0 split)
## goout_2 < 0.5 to the right, agree=0.558, adj=0.038, (0 split)
##
## Node number 30: 130 observations
## mean=12.83846, MSE=3.48929
##
## Node number 31: 153 observations
## mean=13.8366, MSE=5.561536
#Plots a fancy RPart decision tree using the pretty rpart plotter.
fancyRpartPlot(tree1, caption = "Classification Tree")
In the plotted classification tree, Each node box displays the classification, the probability of each class at that node (i.e. the probability of the class conditioned on the node) and the percentage of observations used at that node.
#Make predictions on testing data, using trained model:
test_data$tree1.pred.G3 <- predict(tree1, newdata = test_data)
#Visualize predictions:
with(test_data, plot(G3,tree1.pred.G3, main="Actual vs Predicted, testing data",xlab = "Actual G3",ylab = "Predicted G3"))
#Make confusion matrix:
PredictionCutoff <- 10.99 # Use ROC curve to decide cutoff value (this case:in 9-11 range)
test_data$tree1.pred.passed <- ifelse(test_data$tree1.pred.G3 > PredictionCutoff, 1, 0)
(cm2 <- with(test_data,table(tree1.pred.passed,passed)))
## passed
## tree1.pred.passed 0 1
## 0 20 17
## 1 0 126
#Evaluate the model by calculating the accuracy:
CorrectPredictions2 <- cm2[1,1] + cm2[2,2]
TotalStudents2 <- nrow(test_data)
(Accuracy2 <- CorrectPredictions2/TotalStudents2)
## [1] 0.8957055
#Final output dataset for visual validation, using predicted results:
predicted_data<-dplyr::select(test_data, -c(goout_1, goout_2, goout_4, goout_5,
Dalc_2, Dalc_3, Dalc_4, Dalc_5,
Walc_2, Walc_3, Walc_4, Walc_5,
blr1.pred, blr1.pred.probs, tree1.pred.passed ))
The Logistic regression model (classification) and Decision tree model(regression tree) performed exceptionally well in this project and showed 86.50% and 89.57% accuracy, respectively.
The essence of building a student performance predictor is for the model to be able to effectively predict a high risk student by determining their ability to pass and their final grades, based on relevant factors. A model will not be doing very well if it is unable to classify both categories (pass or fail) effectively. As much as we can expect some element errors in our predictions, we are also expecting our model to do a good job, as shown by the confusion matrix and accuracy calculations. This Student performance predictor was built just for academic purposes only.
# libraries used in the project
library(dplyr)
library(readr)
library(stringr)
library(ggplot2)
library(corrplot)
library(Hmisc)
library(psych)
library(rattle)
library(rpart)
library(car)