Playing with SVM to predict academic performance (grade on final report) from demographic, feedback provision and feedback use data

loading data…

So, we have data on students’ academic performance, feedback provision, feedback use, and demographics. 2 semesters of 2 courses with a total of 1947 students (some students took both courses, some students repeated courses - 101 students to be exact). Here is the breakdown of the number of students in each course each semester:

table(demog$course, demog$sem)

##          
##           Semester 1 Semester 2
##   Level 1        733        972
##   Level 2        251         92

As usual, I ended up spending 90% of the time and code on wrangling the data - even tho this is a clean data set that we have published the analysis from in a leading speciality, peer-reviewed, international journal, and I am working in a programming language I am fluent in (R) - and then 5% of the time running, altering, re-running the ‘machine learning’ algorithms (and 5% of the time making it pretty for posting here :P)

Data wrangling

Checking for duplicate records (StudentID+report) within a single semester
these would be errors since students shouldn’t be resubmitting in this context

df = AcP.FbU.FbP

df = subset(AcP.FbU.FbP, course == "Level 1" & sem == "Semester 1")
sub.count = with(df, table(StudentID, report))
which(isTRUE(sub.count>1))

## integer(0)

No duplicate submissions - so data is realiable and don’t need to delete duplicates (or alternatively use mean for dcast) - great!

Removing cases w missing data
(svm will train but get errors during prediction if any feature contains NA or NaN)

df = AcP.FbU.FbP

df2 = df[!rowSums(is.na(df)),] # drops data from 705 obs to 461! #actually most Nas (90) in OpenDuration.min, min(OpenDuration.min) > and open.bin shows 66, 44, 50 and 104 unopened for R0 -> R4, so convert NA's to 0

cols = c("OpenDuration.min")
for (i in 1:nrow(df)) {
  for (j in 1:length(cols)) {
    if (is.na(df[i,cols[j]])) {
      df[i,cols[j]] = 0
} } }

df2 = df[!rowSums(is.na(df)),] # brings data back up to 609 obs :)

AcP.FbU.FbP = df

Starting with just a single semester to mock up a draft.

also selecting variables to use as features
then reshaping using StudentID as rows

#df = AcP.FbU.FbP
df = subset(AcP.FbU.FbP, course == "Level 1" & sem == "Semester 1")
#df.FinalGrade = melt(df[,c(1,13,17)], id=c("StudentID", "report"), value.name="Final.Grade")
#df2 = dcast(df.FinalGrade, StudentID~variable+report, mean)

#df columns = "StudentID", "report", "Final.Grade”, “Freehand”, “Highlight”, “Recording”, “Text”, “total.words”, “auidio.min”, “audio.words”, “txt.words”, “OpenDuration.min”, “open.bin”, “MarkerID" 
#dv = "Final.Grade”, “Freehand”, “Highlight”, “Recording”, “Text”, “total.words”, “auidio.min”, “audio.words”, “txt.words”, “OpenDuration.min”, “open.bin”, “MarkerID" 
df = df[,c(10,17,13,2:9,18,19,11)]
dv = names(df[,c(3:ncol(df))])
ls = NULL
for (i in 1:length(dv)) {
  ls[[i]] = melt(df[,c(1,2,(i+2))], id=c("StudentID", "report"), value.name=dv[i])
}

ls2 = NULL
for (i in 1:length(dv)) {
  ls2[[i]] = dcast(ls[[i]], StudentID~variable+report)
}

df2 = ls2[[1]]
for (i in 2:length(ls2)) {
  df2 = merge(df2, ls2[[i]], by = "StudentID", all.x=T)
}

# removing Na's so svm runs
#str(df2)
#names(df2)
#length(which(is.na(rowSums(df2[,2:41]))))
df2 = df2[-which(is.na(rowSums(df2[,2:41]))),]


for (i in 42:49) {
  df2[,i] = as.factor(df2[,i]) 
}


mach.learn = df2

Turning the ‘outcome’ ie mark for Report 3 into a categorical feature so can use logistic/classification approach ie “did student X get an A on their 3rd (final) report?” also removing Report 3 Final.Grade from dep vars since that alone would predict outcome perfectly :P

df = mach.learn

df$y = as.factor(df$`Final.Grade_Report 3` > (41 * (100/48)))
#table(df[,37], df[,42]) #check

col = which(names(df) == "Final.Grade_Report 3")
mach.learn = df[,-col]

Split data into training, cv and test sets

df = mach.learn

stud.split = splitter(df$StudentID, 60)
df.train = df[which(df$StudentID %in% stud.split[[1]]),2:ncol(df)]
df.cv = df[which(df$StudentID %in% stud.split[[2]]),2:ncol(df)]
df.test = df[which(df$StudentID %in% stud.split[[3]]),2:ncol(df)]

NB - in next chunk, I’m not real sure about the ‘tune’ function in R - may need to read some more. If not ‘tuning’ to optimise the regularisation parameter (lambda) [inputted as ‘cost’ in this R package/function], then can pool cv and test data sets together to get a better estimate of svm model performance.

SVM

First, train svm on training set (prints out svm parameters). Then use the svm model to predict outcome (y ie get an A on final report or not) on the cross-validation set.

#df = df.train[,c(2:ncol(df.train))]
#df$y = as.logical(df$y)
svmfit <- svm(y ~., data=df.train)
print(svmfit)

## 
## Call:
## svm(formula = y ~ ., data = df.train)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  1 
##       gamma:  0.007936508 
## 
## Number of Support Vectors:  311

#my_cols <- c("#00AFBB", "#E7B800")
#plot(svmfit, df) # can't plot coz >2 variables (high dimensionality data)
p <- predict(svmfit, df.cv)
p.cv = predict(svmfit, df.cv)
df.cv$p = p.cv
cm = as.data.frame.matrix(addmargins(with(df.cv, table(y, p))))

cmStats = conMatrixStats(cm) # for accuracy could use: mean(df.cv$p == df.cv$y)
cmStats

##                    Results
## accuracy          77.04918
## misclassification 22.95082
## prevalence        49.18033
## precision         69.51220
## true.pos          95.00000
## false.pos         40.32258
## true.neg          59.67742

So, pretty good (predicts with ~70-80% accuracy who will get an A for the final report and who won’t).

Now trying to ‘tune cost’ (ie optimise regularisation parameter)

tuned = tune(svm, y~., data = df.cv, ranges = list(cost=c(0.0001,0.001,0.01, 0.1, 1, 10, 100)))
summary(tuned)

## 
## Parameter tuning of 'svm':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##  cost
##    10
## 
## - best performance: 0.2961538 
## 
## - Detailed performance results:
##    cost     error dispersion
## 1 1e-04 0.6121795  0.1092150
## 2 1e-03 0.6121795  0.1092150
## 3 1e-02 0.6121795  0.1092150
## 4 1e-01 0.6121795  0.1092150
## 5 1e+00 0.3839744  0.1522487
## 6 1e+01 0.2961538  0.1517786
## 7 1e+02 0.3275641  0.1327351

# returns lowest cost but that gives worst performance on test data set - clearly overfitting?
# now returns highest ie 100 dropping accuracy on df.test to 68%, but also get 69% for cost = 10 and cost = 1, so must be data split

svmfit.tuned <- svm(y ~., data=df.train, cost = as.numeric(tuned[[1]]))
print(svmfit.tuned)

## 
## Call:
## svm(formula = y ~ ., data = df.train, cost = as.numeric(tuned[[1]]))
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  10 
##       gamma:  0.007936508 
## 
## Number of Support Vectors:  273

p.test = predict(svmfit.tuned, df.test)
df.test$p = p.test
cm.test = as.data.frame.matrix(addmargins(with(df.test, table(y, p))))

cmStats = conMatrixStats(cm.test)
cmStats

##                    Results
## accuracy          80.32787
## misclassification 19.67213
## prevalence        54.09836
## precision         81.81818
## true.pos          81.81818
## false.pos         21.42857
## true.neg          78.57143

Tunning really seems to drop accuarcy on test data set - as low as 42% where just using cost = 1 kept accuracy up near 70-80% on the test set. Need to look into tune function…

Generalising

Next to generalise ie add in demographic data and reduce back to Report 1 (mark, feedback provision and use) so can use the model on both courses. Pragmatically, this would also give academics, markers and students an earlier indication of performance on final report…

More data wrangling

Reducing down to 1st and final reports to align/generalise

df = AcP.FbU.FbP

for (i in 1:nrow(df)) {
  if (df$course[i] == "Level 2" & df$final[i] == "nonfinal") {
    df$final[i] = "first"
  }
  if (df$course[i] == "Level 1" & df$report[i] == "Report 0") {
    df$final[i] = "first"
  }
   if (df$course[i] == "Level 1" & df$sem[i] == "Semester 2" & df$report[i] == "Report 1") {
    df$final[i] = "first"
  }
}

df2 = subset(df, final == "first" | final == "final") 

# check for duplicates (errors)
length(which(duplicated(df2[,c("StudentID", "course", "sem", "report")])))

## [1] 0

mach.learn2 = df2

Reshaping so that StudentID leads each row of features and adding in demographic data

df = mach.learn2

# UQM data first (since diff for first and final)
df = df[,c("StudentID", "course", "sem", "final", "Final.Grade", "Freehand", "Highlight", "Recording", "Text", "total.words", "auidio.min", "audio.words", "txt.words", "OpenDuration.min", "open.bin", "MarkerID")]

vars = c("Final.Grade", "Freehand", "Highlight", "Recording", "Text", "total.words", "auidio.min", "audio.words", "txt.words", "OpenDuration.min", "open.bin", "MarkerID")

cs = as.factor(paste(df$course, df$sem))
cs.ls = NULL
cs.ls = split(df, cs)

last.ls = NULL
for (l in 1:length(cs.ls)) 
{
  ls.11 = NULL
  for (i in 1:length(vars)) {
    ls.11[[i]] = melt(cs.ls[[l]][,c(1,4, (i+4))], id=c("StudentID", "final"), value.name = vars[i])
  }
  
  ls.11b = NULL
  for (i in 1:length(vars)) {
    ls.11b[[i]] = dcast(ls.11[[i]], StudentID~variable+final)
  }
  
  df2 = ls.11b[[1]]
  for (i in 2:length(ls.11b)) {
    df2 = merge(df2, ls.11b[[i]], by = "StudentID", all.x=T)
  }
  
  df2$cs = names(cs.ls)[l]
  last.ls[[l]] = df2
}
rm(ls.11, ls.11b, df2)

df3 = last.ls[[1]]
for (l in 2:length(last.ls)) {
  df3 = rbind(df3, last.ls[[l]])
}


# then demog data (since same for first and final)
# first removing end of semester demographic info since this would come in after final report outcome
#"Official.Grade", "Course.Grade", "Semester.GPA", "Cumulative.Semester.GPA"
# and removing UQM.ID (unique to each studnet) and consent (same for all students)
# not easy to remove columns by names, so listing those kept instead

demog = demog[,c("StudentID","Date.of.Birth", "Gender.Description", "Permanent.Country", "International.Indicator.Code", "Lang.Spoken.at.Home.Description", "Citizenship.Status.Description", "Birth.Country.Description", "ATSI.Indicator.Code", "UQ.Translated.Base.OP", "course", "sem")]

demog$cs = as.factor(paste(demog$course, demog$sem))
df4 = merge(df3, demog, by=c("StudentID", "cs"))

# removed cs
mach.learn2 = df4[,c(1,36,37,3:35)]

add on the ‘outcome’ ie final = A and remove R3 Final.Grade from dep vars

df = mach.learn2

df$y = as.factor(df$`Final.Grade_final` > 85)
#table(df[,37], df[,42]) #check

# not generalising well to grade bands, and need to adjust conMatrixStats function if y > 2 classes
#bins = c(0,49,64,74,84,100)
#df$y = as.factor(cut(df$`Final.Grade_final`, bins))
col = which(names(df) == "Final.Grade_final")
mach.learn2 = df[,-col]

Need to turn characters into factors for svm

df = mach.learn2

c = NULL
for (j in 1:ncol(df)) {
  c[j] = (class(df[,j]))
}

cs = which(c == "character")

for (i in 1:length(cs)) {
  df[,cs[i]] = as.factor(df[,cs[i]]) 
}

mach.learn2 = df

Found the NAs tripping up later svm - UQ..OP (high school score students entered uni with) has 899 NAs which takes out 25% of the students - and biased coz these would be internationals, interstaters, mature entry
Since prior AcP predicts future, and lack of OP is unevenly distributed/correlated with other demographic features (“biased”) in the data set, will try with and without to compare performance…

df = mach.learn2

df2 = df[,c(1:34,36)] # removing OP
mach.learn2.op  = df[!rowSums(is.na(df)),] # drops data from 1869 obs to 1322 if keep OP
mach.learn2.no_op  = df2[!rowSums(is.na(df2)),] # drops data from 1869 obs to 1762 if leave OP out

Next, need to split for svm - but systematic random samples from each courseXsem
with OP included first

df = mach.learn2.op

df$sID.cs = with(df, paste(StudentID, course, sem))
col = ncol(df)
df = df[,c(1,col, 2:(col-1))]

stud.split = splitter(df$sID.cs, 60)
df.train = df[which(df$sID.cs %in% stud.split[[1]]),3:ncol(df)]
df.cv = df[which(df$sID.cs %in% stud.split[[2]]),3:ncol(df)]
df.test = df[which(df$sID.cs %in% stud.split[[3]]),3:ncol(df)]

SVM - with OP

svmfit2 <- svm(y ~., data=df.train, cost=1)
print(svmfit2)

## 
## Call:
## svm(formula = y ~ ., data = df.train, cost = 1)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  1 
##       gamma:  0.0007692308 
## 
## Number of Support Vectors:  697

p <- predict(svmfit2, df.cv)
df.cv$p = p
cm = as.data.frame.matrix(addmargins(with(df.cv, table(y, p))))
cm # rows are actual values (y) and columns are prediction (p)

##       FALSE TRUE Sum
## FALSE   121   25 146
## TRUE     42   76 118
## Sum     163  101 264

cmStats = conMatrixStats(cm)
cmStats

##                    Results
## accuracy          74.62121
## misclassification 25.37879
## prevalence        44.69697
## precision         75.24752
## true.pos          64.40678
## false.pos         17.12329
## true.neg          82.87671

Ok, first run gave 75% accuracy - so still very good. (2nd run without UQM.ID and Consent gave 73%) Next to try with the larger data set but without the OP variable…

Again split for svm
no OP this time

df = mach.learn2.no_op

df$sID.cs = with(df, paste(StudentID, course, sem))
col = ncol(df)
df = df[,c(1,col, 2:(col-1))]

stud.split = splitter(df$sID.cs, 60)
df.train = df[which(df$sID.cs %in% stud.split[[1]]),3:ncol(df)]
df.cv = df[which(df$sID.cs %in% stud.split[[2]]),3:ncol(df)]
df.test = df[which(df$sID.cs %in% stud.split[[3]]),3:ncol(df)]

SVM - without OP

svmfit2 <- svm(y ~., data=df.train, cost=1)
print(svmfit2)

## 
## Call:
## svm(formula = y ~ ., data = df.train, cost = 1)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  1 
##       gamma:  0.0007698229 
## 
## Number of Support Vectors:  910

p <- predict(svmfit2, df.cv)
df.cv$p = p
cm = as.data.frame.matrix(addmargins(with(df.cv, table(y, p))))
cm # rows are actual values (y) and columns are prediction (p)

##       FALSE TRUE Sum
## FALSE   144   60 204
## TRUE     37  111 148
## Sum     181  171 352

cmStats = conMatrixStats(cm)
cmStats

##                    Results
## accuracy          72.44318
## misclassification 27.55682
## prevalence        42.04545
## precision         64.91228
## true.pos          75.00000
## false.pos         29.41176
## true.neg          70.58824

Accuracy came down to 69% but not really that different to 75% given the random split of data will do that anyway (2nd run without UQM.ID and Consent gave 75%)

So, you would loop over svm a few times to get a range of acuracies?

SVM to predict Academic Performance

Kirsten Zimbardi

26 May 2016