library(dplyr)
library(readr)
library(stringr)
library(ggplot2)
library(caret)
library(rpart)
library(rattle)
library(randomForest)
library(outliers)
library(pander)
We start by loading the data and taking a look at its general format as well as its dimensions.
dat <- read.csv("D:/iXperience/Projects/model_project/student-mat.csv", sep = ";")
dim(dat)
## [1] 395 33
| school | sex | age | address | famsize | Pstatus | Medu | Fedu | Mjob | Fjob | reason | guardian | traveltime | studytime | failures |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| GP | F | 18 | U | GT3 | A | 4 | 4 | at_home | teacher | course | mother | 2 | 2 | 0 |
| GP | F | 17 | U | GT3 | T | 1 | 1 | at_home | other | course | father | 1 | 2 | 0 |
| GP | F | 15 | U | LE3 | T | 1 | 1 | at_home | other | other | mother | 1 | 2 | 3 |
| GP | F | 15 | U | GT3 | T | 4 | 2 | health | services | home | mother | 1 | 3 | 0 |
| GP | F | 16 | U | GT3 | T | 3 | 3 | other | other | home | father | 1 | 2 | 0 |
| GP | M | 16 | U | LE3 | T | 4 | 3 | services | other | reputation | mother | 1 | 2 | 0 |
| schoolsup | famsup | paid | activities | nursery | higher | internet | romantic | famrel | freetime | goout | Dalc | Walc | health | absences | G1 | G2 | G3 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| yes | no | no | no | yes | yes | no | no | 4 | 3 | 4 | 1 | 1 | 3 | 6 | 5 | 6 | 6 |
| no | yes | no | no | no | yes | yes | no | 5 | 3 | 3 | 1 | 1 | 3 | 4 | 5 | 5 | 6 |
| yes | no | yes | no | yes | yes | yes | no | 4 | 3 | 2 | 2 | 3 | 3 | 10 | 7 | 8 | 10 |
| no | yes | yes | yes | yes | yes | yes | yes | 3 | 2 | 2 | 1 | 1 | 5 | 2 | 15 | 14 | 15 |
| no | yes | yes | no | yes | yes | no | no | 4 | 3 | 2 | 1 | 2 | 5 | 4 | 6 | 10 | 10 |
| no | yes | yes | yes | yes | yes | yes | no | 5 | 4 | 2 | 1 | 2 | 5 | 10 | 15 | 15 | 15 |
The table contains quite a few columns! There are 33 variables and 395 rows in total. Following is a list of all the available columns, along with their types and a brief description that was provided along with the dataset:
## [1] "school - student's school (binary: 'GP'-Gabriel Pereira or 'MS'-Mousinho da Silveira)"
## [2] "sex - student's sex (binary: 'F' - female or 'M' - male)"
## [3] "age - student's age (numeric: from 15 to 22)"
## [4] "address - student's home address type (binary: 'U' - urban or 'R' - rural)"
## [5] "famsize - family size (binary: 'LE3' - less or equal to 3 or 'GT3' - greater than 3)"
## [6] "Pstatus - parent's cohabitation status (binary: 'T' - living together or 'A' - apart)"
## [7] "Medu - mother's education (numeric)"
## [8] "Fedu - father's education (numeric)"
## [9] "Mjob - mother's job (nominal)"
## [10] "Fjob - father's job (nominal)"
## [11] "reason - reason to choose this school (nominal)"
## [12] "guardian - student's guardian (nominal: 'mother', 'father' or 'other')"
## [13] "traveltime - home to school travel time (numeric)"
## [14] "studytime - weekly study time (numeric)"
## [15] "failures - number of past class failures (numeric: n if 1<=n<3, else 4)"
## [16] "schoolsup - extra educational support (binary: yes or no)"
## [17] "famsup - family educational support (binary: yes or no)"
## [18] "paid - extra paid classes within the course subject (binary: yes or no)"
## [19] "activities - extra-curricular activities (binary: yes or no)"
## [20] "nursery - attended nursery school (binary: yes or no)"
## [21] "higher - wants to take higher education (binary: yes or no)"
## [22] "internet - Internet access at home (binary: yes or no)"
## [23] "romantic - with a romantic relationship (binary: yes or no)"
## [24] "famrel - quality of family relationships (numeric: from 1 - very bad to 5 - excellent)"
## [25] "freetime - free time after school (numeric: from 1 - very low to 5 - very high)"
## [26] "goout - going out with friends (numeric: from 1 - very low to 5 - very high)"
## [27] "Dalc - workday alcohol consumption (numeric: from 1 - very low to 5 - very high)"
## [28] "Walc - weekend alcohol consumption (numeric: from 1 - very low to 5 - very high)"
## [29] "health - current health status (numeric: from 1 - very bad to 5 - very good)"
## [30] "absences - number of school absences (numeric: from 0 to 93)"
## [31] "G1 - first period grade (numeric: from 0 to 20)"
## [32] "G2 - second period grade (numeric: from 0 to 20)"
## [33] "G3 - final grade (numeric: from 0 to 20, output target)"
We have at our disposal an extensive report of the characteristics of high-school students coming from two different institutions: Gabriel Pereira HS and Mousinho da Silveira HS. The data includes detailed observations about each student’s work situation, family status, school performance, alcohol consumption, … The goal of our model is to predict the school performance of a given student on the basis of the available information about him or her.
It’s time to get our hands dirty with some data cleaning and various visualisations.
In order to be able to effectuate a classification on our data, we need to introduce a binary variable that assesses student performance. Certainly, no decision can ever receive unanimous approval; with this in mind, let us design our own performance variable:
Of the G1, G2, G3 columns, we only focus on the first one: our goal is to distinguish above average students from students in difficulty, and a single measure of a student’s performance should be enough to classify him or her in one of two categories.
We choose as our key variable the fact that a given student is above or below the median. We thus create a new variable ‘performance’ which takes the values 0 if the student is under the median or 1 if the student is above the median.
By definition, precisely 50% of the students are below the median and the other 50% are above it, so that any significant deviation from that 50% midpoint will indicate that the model we craft has a high predictive value.
dat <- dat %>% mutate(performance = ifelse(G1 > median(G1), 1, 0))
Just as we only need a tweaked G1 column to assess performance, there are several columns in the dataset which don’t improve our understanding of the situation. The three grade variables (Gi) actually indicate the very element we want to predict, so we get rid of them. Furthermore, we do not need to make any distinction between the two schools, nor do we care about the address of the student (the variable traveltime does a good enough job indicating how much commute any given student has to go through).
dat <- dat[, -which(colnames(dat) %in% c("G1", "G2", "G3"))]
dat <- dat[, -which(colnames(dat) %in% c("address", "school"))]
Good news: the data at our disposal is extremely compatible with classification analysis, as most of the columns are either binary variables or elements of a finite set of discrete value whose size hovers around 5. We can therefore proceed to change the class of most of the columns to factor. The absences variable is the only one excluded from the transformation because its range is much wider than average.
for (i in c(1:27, 29)) {
dat[, i] <- as.factor(dat[, i])
}
Let’s start with some data exploration using nothing more than our intuition. If we believe conventional wisdom, there should be a high positive correlation between the studytime variable and our performance feature engineered column. We use the cor.test() method to verify that claim:
cor.test(as.numeric(dat$studytime), as.numeric(dat$performance))
##
## Pearson's product-moment correlation
##
## data: as.numeric(dat$studytime) and as.numeric(dat$performance)
## t = 1.9829, df = 393, p-value = 0.04807
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.0008669578 0.1962729650
## sample estimates:
## cor
## 0.09952947
We get a very low - although positive - correlation: r = 0.1 We usually only consider a correlation to be significant if its absolute value is above 0.3, which is far from the case in the present situation. The value of machine learning becomes much clearer when we are confronted with such blatant failures of human intuition. In the mean time, let us pursue our hunt for significant predictors of student performance within the remaining set of variables.
First, we plot a bunch of graphs contrasting performance with other factors. Because of the overwhelming number of columns, we limit the number of variables and only display those which once again appeal most to basic common sense: number of absences, age, commute time, whether the student has access to internet, how much he/she parties and whether he/she is contemplating higher education:
The performance line is the one we are interested in. Despite the limitation on the number of variables, the plot quality is quite poor on this pairs graph. Nevertheless, we can already see that some variables make much better predictors than others. After a trial-and-error period, here are some of the most telling plots the dataset has to offer:
The more the student tends to go out, the less academically successful he is likely to be. It seems that friday nights out are more damaging than a limited study time!
Students who aren’t contemplating higher education are much less likely to be performant. In order to achieve big, one has first to dream big.
Older students’ performance peaks at 20 and then plummets for age groups 21 and 22. Looks like slackers have a narrow window of brillance which if missed is followed by an abrupt plunge into mediocrity.
Students with a lot of absences are most likely to be less performant. As the saying goes, out of sight, out of the top two quartiles.
Putting everything together, we get a much nicer bird’s-eye view of the hidden connections within our dataset
Now that we have gained a basic understanding of what is going on within the dataset, it is time to let the machine take the commands. From now on, we shall only play the role of SHMs (Support Humans for Machines) playing around with the different avaiable models and comparing the accuracy of their respective predictions.
Here is the overall plan that shall guide our modeling wanderings:
Note: The accuracy results obtained using the various models were analyzed during the coding process. As you will realize, the structure of the presentation relies quite significantly on the exact retransciprtion of these results. Due to the inherent variability of accuracy rates across different replications of the same model, I decided to omit the actual output of the model-related code included in this document, in order to maintain a beneficial consistency with the initial storyline. I thus ask the reader to believe that the accuracy percentages given throughout the modeling process are reliable.
We start with the most straightforward model when it comes to classification purposes and create a logistic regression which will estimate the probability of each outcome of a single dependent variable - performance - using every other variable within the dataset as independant variables.
In order to assess the accuracy of our first model, we will be using bootstraping validation. Indeed, bootstrapping has been proven time and time again to be the coolest and most badass way to assess the accuracy of a model. On a more sober note, bootstrapping allows us to subsequently make use of all the available data to make our predictions because it does not require us to isolate test data. This property is particularly useful considering that our dataset’s generous length along the x-axis contrasts with a limited number of observations by data science standards.
logistic.model <- train(data = dat, performance ~ ., method = "glm", trControl = trainControl(method = "boot",
verboseIter = T,
number = 50))
We get a 61.7% accuracy rate, which is a reasonable starting point for our modeling journey. As mentioned previously, any significant deviation upward from the 50% baseline value makes the attempt relatively successful. In that sense, the model gives us a non-trivial predictive ability. Logistic regression being one of the simpler algorithms out there, it isn’t unreasonably optimistic to expect to make some improvements as we choose more robust and flexible models.
We now move on to a decision tree model, whose accuracy we will be assessing the good old way: we assign 85% of the rows dataset to a training subset which shall be used to produce the decision tree, with the intention to subsequently use the remaining 15% of the data to test the model we will have generated.
index <- sample(1:length(dat$performance), size = 0.85 * length(dat$performance))
training <- dat[index, ]
test <- dat[-index, ]
tree.model <- rpart(data = training, formula = performance ~ ., control = rpart.control(cp = 0.015))
Summing up the true values of the boolean vector generated by comparing the predictions of the model with the actual test data and dividing over by the total number of observations yields the accuracy rate:
sum(predict(tree.model, test, type = "class") == test$performance) / length(test$performance)
The above line yields an accuracy rate of 58%. Surprisingly, the value obtained using a decision tree is lower than the one we got using a logistic regression. Perhaps tweaking the different parameters used by the model will lead us to a more or less significant boost in accuracy.
Minsplit, minbucket, maxcompete and maxsurrogate are all variables that are at our disposal to specify in the control parameter of the rpart method call. We shall compare the accuracy of the model as we make each of these variables vary within an appropriate range, with all else kept constant.
Using the sapply function, we make minsplit vary between 3 and 30 with increments of 3
sapply(seq(3, 30, by = 3), function(k) {
sprintf("%.7f", # get output values with 7 decimal places
sum(
predict(rpart(data = training, formula = performance ~ ., control = rpart.control(cp = 0.3597, minsplit = k)),
test, type = "class") == test$performance) / length(test$performance))
})
We somehow got up to a 60% accuracy rate, but the latter is invariant of the value of minsplit. We thus leave it as default.
We then vary minbucket. According to the rpart control documentation the default value is round(minsplit/3), so we will simply make minbucket vary between 1 and 10, which is the mapping of the previously used minsplit values through the function f(x) = (1/3)x
sapply(1:100, function(k) {
sprintf("%.7f",
sum(
predict(rpart(data = training, formula = performance ~ ., control = rpart.control(cp = 0.3597, minbucket = k)),
test, type = "class") == test$performance) / length(test$performance))
})
Again, none of the values produce a deviation from the constant accuracy rate of 60%. Furthermore, applying a similar process to the maxcompete and maxsurrogate variables, we invariably get to the same stagnancy point. It thus seems that we have reached a plateau for the decision tree model.
Accuracy concerns put aside, decision trees offer a valuable resource: by virtue of their simplicity, they allow us to produce nice visualisations that usually improve our understanding of the dataset at hand. Let us look at the nodes of a rather simple decision tree, obtained by simply setting the minsplit variable 30, thereby requiring that no further split be made if the number of observations is under 30.
The resulting plot goes beyond our expectations by directing our attention to a variable we had completely overlooked in our initial exploration! Indeed, the failures column, which indicates how many classes the student has failed in the past, is at the top of the decision tree and is consequently strongly associated to the performance value. A low number of failures is the n°1 predictor of student success, and it is only past this node that the decision tree considers other factors such as school support or amount of free time in order to further refine the model for high failure students.
Let us go one more step further down the decision tree path and generate a random forest. By doing so, we deliberately trade the ability to visualize our model for a higher accuracy rate. We use the training/test data that was initially created for the decision tree in order to be able to make a direct side-by-side comparison of the two models.
forest.model <- randomForest(data = training, performance ~ ., importance=TRUE, ntree=2000)
We use the same basic arithmetic to arrive at an accuracy value for our random forest model:
sum(predict(forest.model, test, type = "class") == test$performance) / length(test$performance)
Awesome! We got a huge accuracy jump! Our model now has a 68% accuracy rate, which is far higher than anything we have seen so far using logistic regression and basic decision trees. The accuracy gap is so large that it seems reasonnable to generate another random forest model, this time around using a technique such as cross-validation in order to average the accuracy across multiple generated models.
forest.model2 <- train(data = dat, performance ~ ., method = "rf", trControl = trainControl(method = "cv",
verboseIter = T,
number = 10))
Using cross-validation in order to be able to build our model using the entire dataset, we plateau at 64.3% accuracy, a smaller and more realistic value than the 68% we got initially. This might be due to the fact that in the split we made on our first run, the test data we preserved was relatively small (60 rows) and was thus more susceptible to fluctuations that might significantly distort the accuracy of the model.
In any case, the accuracy of the random forest is clearly better than that of both previous models, and our predictive power for the performance of any given high-schooler has increased accordingly.
Let us now attempt to predict one last model: the one and only, highly mathematical, hyperplane-based Support Vector Machine! SVM is apparently the most efficient alternative out of all the models we have tested so far, mainly due to its flexibility. Let us verify that claim. We will be using cross-validation to maximize the data that goes into building the model and avoid a scenario similar to the one we encountered while producing our random forest model.
Fun fact: The caret package train function offers 8 variations of the SVM model. I thus ran every one of them separately and compared their optimal accuracy, where optimal qualifies that accuracy which is corresponding to the best combination of the various factors that go into the model - factors such as C, degree, scale, …
Given a version of the SVM model annoted
svm.model <- train(data = dat, performance ~ ., method = <SVM variant>, trControl = trainControl(method = "cv",
verboseIter = T,
number = 10))
and… we have a winner! Although most svm models hovered around 64 to 66% accuracy, the “svmPoly” method with degree = 1, scale = 0.1 and C = 0.25 tops the charts with a remarkable accuracy of 67.3% While the optimal accuracy of the svm model does surpass by a significant margin the random forest cross-validated results, it is still surprisingly just below the 68% accuracy we got from the training data using random forest. Could this situation be due to the simple fact that the training/test combination data we randomly generated results in an outlier accuracy value? Let us investigate the validity of that claim.
One flaw of the standard training/test modeling method, beyond the reduced amount of rows we can use to build our model, is that it only yields a single accuracy rate. While the bootstrapping and cross validation methods output an average accuracy and thereby give a value we can use as our expectations for the final model, it would be unreasonable to assume that the accuracy a training/test generated model will have he same accuracy when confronted to a distinct group of test data.
In order to tackle this issue, we apply the following procedure: we generate 10 different training/test combinations from the data, run random forest on each of them, and use the final vector to obtain an average accuracy for the model.
values <- c()
sapply(1:10, function(k) {
index <- sample(1:length(dat$performance), size = 0.85 * length(dat$performance))
training <- dat[index, ]
test <- dat[-index, ]
values <- c(values,
sum(predict(randomForest(data = training, performance ~ ., importance=TRUE, ntree=2000),
test, type = "class") == test$performance) / length(test$performance)
)
})
The values vector obtained running the above code contains the following values:
## [1] 0.5500000 0.5833333 0.6833333 0.6166667 0.6000000 0.6166667 0.6000000
## [8] 0.6333333 0.6666667 0.6000000
We can now take the average of the elements of the values vector:
## [1] 0.615
After 10 iterations of the train/test modeling method, we get an average accuracy rate of 61.5%, much closer to the cross-validated random forest result and far, far away from the 68% we got on the first try. It seems that we were right: we were (un)lucky and generated an outlier value on our first random forest try, a situation that might have led us to over-estimate the power of the random forest algorithm when confronted to the optimal svm method.
We can actually verify how likely the 68% value is to have been an outlier: we simply add the value to the values vector and run the scores() function from the outliers package:
values <- c(values, 0.68)
values[which(scores(values, prob = 0.90))]
## [1] 0.5500000 0.6833333 0.6800000
The illegitimate king has been dethroned: we can now assert with 90% confidence that the 68% accuracy rate we initially obtained using a random forest was nothing more than an outlier, and that it is unreasonable to expect the actual random forest model to consistently produce such a high accuracy.
Having resolved our initial confusion, we can now serenely join the clan of defenders of the SVM algorithm! It yielded a 67.3% accuracy rate in predicting our model, a value which is significantly higher than that of any of the other models we have generated throughout this analysis.
However, we have certainly gained multiple insights into both our dataset and the modeling process while attempting various models that produced a lesser accuracy. Indeed, the decision tree did have the nice advantage of producing an accessible output which increased our understanding of the model; on the technical side, the interleaved use of cross-validation, boostrapping and data splitting in producing our models has led us to develop an increased awareness of the intricacies and pitfalls of the process of model generation. The advantage of the averaged accuracy rate of bootstrapping and cross-validation over basic training/test procedures has also been made clear the hard way, in that failing to realize the reason behind the accuracy gap might have led us to ultimately prefer the random forest model over the more performant SVM model.