Playing with SVM to predict academic performance (grade on final report) from demographic, feedback provision and feedback use data

loading data…

So, we have data on students’ academic performance, feedback provision, feedback use, and demographics. 2 semesters of 2 courses with a total of 1947 students (some students took both courses, some students repeated courses - 101 students to be exact). Here is the breakdown of the number of students in each course each semester:

##          
##           Semester 1 Semester 2
##   Level 1        733        972
##   Level 2        251         92

As usual, I ended up spending 90% of the time and code on wrangling the data - even tho this is a clean data set that we have published the analysis from in a leading speciality, peer-reviewed, international journal, and I am working in a programming language I am fluent in (R) - and then 5% of the time running, altering, re-running the ‘machine learning’ algorithms (and 5% of the time making it pretty for posting here :P)

Data wrangling

Checking for duplicate records (StudentID+report) within a single semester
these would be errors since students shouldn’t be resubmitting in this context

## integer(0)

No duplicate submissions - so data is realiable and don’t need to delete duplicates (or alternatively use mean for dcast) - great!

Removing cases w missing data
(svm will train but get errors during prediction if any feature contains NA or NaN)

Starting with just a single semester to mock up a draft.

also selecting variables to use as features
then reshaping using StudentID as rows

Turning the ‘outcome’ ie mark for Report 3 into a categorical feature so can use logistic/classification approach ie “did student X get an A on their 3rd (final) report?” also removing Report 3 Final.Grade from dep vars since that alone would predict outcome perfectly :P

Split data into training, cv and test sets

NB - in next chunk, I’m not real sure about the ‘tune’ function in R - may need to read some more. If not ‘tuning’ to optimise the regularisation parameter (lambda) [inputted as ‘cost’ in this R package/function], then can pool cv and test data sets together to get a better estimate of svm model performance.

SVM

First, train svm on training set (prints out svm parameters). Then use the svm model to predict outcome (y ie get an A on final report or not) on the cross-validation set.

## 
## Call:
## svm(formula = y ~ ., data = df.train)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  1 
##       gamma:  0.007936508 
## 
## Number of Support Vectors:  296
##                    Results
## accuracy          66.39344
## misclassification 33.60656
## prevalence        46.72131
## precision         62.90323
## true.pos          68.42105
## false.pos         35.38462
## true.neg          64.61538

So, pretty good (predicts with ~70-80% accuracy who will get an A for the final report and who won’t).

Now trying to ‘tune cost’ (ie optimise regularisation parameter)

## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##  cost
##    10
## 
## - best performance: 0.4089744 
## 
## - Detailed performance results:
##    cost     error dispersion
## 1 1e-04 0.4673077  0.1165049
## 2 1e-03 0.4673077  0.1165049
## 3 1e-02 0.4673077  0.1165049
## 4 1e-01 0.4673077  0.1165049
## 5 1e+00 0.4673077  0.1096822
## 6 1e+01 0.4089744  0.1421799
## 7 1e+02 0.4089744  0.1421799
## 
## Call:
## svm(formula = y ~ ., data = df.train, cost = as.numeric(tuned[[1]]))
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  10 
##       gamma:  0.007936508 
## 
## Number of Support Vectors:  259
##                    Results
## accuracy          76.22951
## misclassification 23.77049
## prevalence        50.00000
## precision         78.57143
## true.pos          72.13115
## false.pos         19.67213
## true.neg          80.32787

Tunning really seems to drop accuarcy on test data set - as low as 42% where just using cost = 1 kept accuracy up near 70-80% on the test set. Need to look into tune function…

Generalising

Next to generalise ie add in demographic data and reduce back to Report 1 (mark, feedback provision and use) so can use the model on both courses. Pragmatically, this would also give academics, markers and students an earlier indication of performance on final report…

More data wrangling

Reducing down to 1st and final reports to align/generalise

## [1] 0

Reshaping so that StudentID leads each row of features and adding in demographic data

add on the ‘outcome’ ie final = A and remove R3 Final.Grade from dep vars

Need to turn characters into factors for svm

Found the NAs tripping up later svm - UQ..OP (high school score students entered uni with) has 899 NAs which takes out 25% of the students - and biased coz these would be internationals, interstaters, mature entry
Since prior AcP predicts future, and lack of OP is unevenly distributed/correlated with other demographic features (“biased”) in the data set, will try with and without to compare performance…

Next, need to split for svm - but systematic random samples from each courseXsem
with OP included first

SVM - with OP

## 
## Call:
## svm(formula = y ~ ., data = df.train, cost = 1)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  1 
##       gamma:  0.0007692308 
## 
## Number of Support Vectors:  696
##       FALSE TRUE Sum
## FALSE   130    7 137
## TRUE     76   51 127
## Sum     206   58 264
##                     Results
## accuracy          68.560606
## misclassification 31.439394
## prevalence        48.106061
## precision         87.931034
## true.pos          40.157480
## false.pos          5.109489
## true.neg          94.890511

Ok, first run gave 75% accuracy - so still very good. (2nd run without UQM.ID and Consent gave 73%) Next to try with the larger data set but without the OP variable…

Again split for svm
no OP this time

SVM - without OP

## 
## Call:
## svm(formula = y ~ ., data = df.train, cost = 1)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  1 
##       gamma:  0.0007698229 
## 
## Number of Support Vectors:  918
##       FALSE TRUE Sum
## FALSE   151   40 191
## TRUE     52  109 161
## Sum     203  149 352
##                    Results
## accuracy          73.86364
## misclassification 26.13636
## prevalence        45.73864
## precision         73.15436
## true.pos          67.70186
## false.pos         20.94241
## true.neg          79.05759

Accuracy came down to 69% but not really that different to 75% given the random split of data will do that anyway (2nd run without UQM.ID and Consent gave 75%)

So, you would loop over svm a few times to get a range of acuracies?