loading data…
So, we have data on students’ academic performance, feedback provision, feedback use, and demographics. 2 semesters of 2 courses with a total of 1947 students (some students took both courses, some students repeated courses - 101 students to be exact). Here is the breakdown of the number of students in each course each semester:
##
## Semester 1 Semester 2
## Level 1 733 972
## Level 2 251 92
As usual, I ended up spending 90% of the time and code on wrangling the data - even tho this is a clean data set that we have published the analysis from in a leading speciality, peer-reviewed, international journal, and I am working in a programming language I am fluent in (R) - and then 5% of the time running, altering, re-running the ‘machine learning’ algorithms (and 5% of the time making it pretty for posting here :P)
Checking for duplicate records (StudentID+report) within a single semester
these would be errors since students shouldn’t be resubmitting in this context
## integer(0)
No duplicate submissions - so data is realiable and don’t need to delete duplicates (or alternatively use mean for dcast) - great!
Removing cases w missing data
(svm will train but get errors during prediction if any feature contains NA or NaN)
also selecting variables to use as features
then reshaping using StudentID as rows
Turning the ‘outcome’ ie mark for Report 3 into a categorical feature so can use logistic/classification approach ie “did student X get an A on their 3rd (final) report?” also removing Report 3 Final.Grade from dep vars since that alone would predict outcome perfectly :P
Split data into training, cv and test sets
NB - in next chunk, I’m not real sure about the ‘tune’ function in R - may need to read some more. If not ‘tuning’ to optimise the regularisation parameter (lambda) [inputted as ‘cost’ in this R package/function], then can pool cv and test data sets together to get a better estimate of svm model performance.
First, train svm on training set (prints out svm parameters). Then use the svm model to predict outcome (y ie get an A on final report or not) on the cross-validation set.
##
## Call:
## svm(formula = y ~ ., data = df.train)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 1
## gamma: 0.007936508
##
## Number of Support Vectors: 296
## Results
## accuracy 66.39344
## misclassification 33.60656
## prevalence 46.72131
## precision 62.90323
## true.pos 68.42105
## false.pos 35.38462
## true.neg 64.61538
So, pretty good (predicts with ~70-80% accuracy who will get an A for the final report and who won’t).
Now trying to ‘tune cost’ (ie optimise regularisation parameter)
##
## Parameter tuning of 'svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## cost
## 10
##
## - best performance: 0.4089744
##
## - Detailed performance results:
## cost error dispersion
## 1 1e-04 0.4673077 0.1165049
## 2 1e-03 0.4673077 0.1165049
## 3 1e-02 0.4673077 0.1165049
## 4 1e-01 0.4673077 0.1165049
## 5 1e+00 0.4673077 0.1096822
## 6 1e+01 0.4089744 0.1421799
## 7 1e+02 0.4089744 0.1421799
##
## Call:
## svm(formula = y ~ ., data = df.train, cost = as.numeric(tuned[[1]]))
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 10
## gamma: 0.007936508
##
## Number of Support Vectors: 259
## Results
## accuracy 76.22951
## misclassification 23.77049
## prevalence 50.00000
## precision 78.57143
## true.pos 72.13115
## false.pos 19.67213
## true.neg 80.32787
Tunning really seems to drop accuarcy on test data set - as low as 42% where just using cost = 1 kept accuracy up near 70-80% on the test set. Need to look into tune function…
Next to generalise ie add in demographic data and reduce back to Report 1 (mark, feedback provision and use) so can use the model on both courses. Pragmatically, this would also give academics, markers and students an earlier indication of performance on final report…
Reducing down to 1st and final reports to align/generalise
## [1] 0
Reshaping so that StudentID leads each row of features and adding in demographic data
add on the ‘outcome’ ie final = A and remove R3 Final.Grade from dep vars
Need to turn characters into factors for svm
Found the NAs tripping up later svm - UQ..OP (high school score students entered uni with) has 899 NAs which takes out 25% of the students - and biased coz these would be internationals, interstaters, mature entry
Since prior AcP predicts future, and lack of OP is unevenly distributed/correlated with other demographic features (“biased”) in the data set, will try with and without to compare performance…
Next, need to split for svm - but systematic random samples from each courseXsem
with OP included first
##
## Call:
## svm(formula = y ~ ., data = df.train, cost = 1)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 1
## gamma: 0.0007692308
##
## Number of Support Vectors: 696
## FALSE TRUE Sum
## FALSE 130 7 137
## TRUE 76 51 127
## Sum 206 58 264
## Results
## accuracy 68.560606
## misclassification 31.439394
## prevalence 48.106061
## precision 87.931034
## true.pos 40.157480
## false.pos 5.109489
## true.neg 94.890511
Ok, first run gave 75% accuracy - so still very good. (2nd run without UQM.ID and Consent gave 73%) Next to try with the larger data set but without the OP variable…
Again split for svm
no OP this time
##
## Call:
## svm(formula = y ~ ., data = df.train, cost = 1)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 1
## gamma: 0.0007698229
##
## Number of Support Vectors: 918
## FALSE TRUE Sum
## FALSE 151 40 191
## TRUE 52 109 161
## Sum 203 149 352
## Results
## accuracy 73.86364
## misclassification 26.13636
## prevalence 45.73864
## precision 73.15436
## true.pos 67.70186
## false.pos 20.94241
## true.neg 79.05759
Accuracy came down to 69% but not really that different to 75% given the random split of data will do that anyway (2nd run without UQM.ID and Consent gave 75%)
So, you would loop over svm a few times to get a range of acuracies?