loading data…
So, we have data on students’ academic performance, feedback provision, feedback use, and demographics. 2 semesters of 2 courses with a total of 1947 students (some students took both courses, some students repeated courses - 101 students to be exact). Here is the breakdown of the number of students in each course each semester:
table(demog$course, demog$sem)
##
## Semester 1 Semester 2
## Level 1 733 972
## Level 2 251 92
As usual, I ended up spending 90% of the time and code on wrangling the data - even tho this is a clean data set that we have published the analysis from in a leading speciality, peer-reviewed, international journal, and I am working in a programming language I am fluent in (R) - and then 5% of the time running, altering, re-running the ‘machine learning’ algorithms (and 5% of the time making it pretty for posting here :P)
Checking for duplicate records (StudentID+report) within a single semester
these would be errors since students shouldn’t be resubmitting in this context
df = AcP.FbU.FbP
df = subset(AcP.FbU.FbP, course == "Level 1" & sem == "Semester 1")
sub.count = with(df, table(StudentID, report))
which(isTRUE(sub.count>1))
## integer(0)
No duplicate submissions - so data is realiable and don’t need to delete duplicates (or alternatively use mean for dcast) - great!
Removing cases w missing data
(svm will train but get errors during prediction if any feature contains NA or NaN)
df = AcP.FbU.FbP
df2 = df[!rowSums(is.na(df)),] # drops data from 705 obs to 461! #actually most Nas (90) in OpenDuration.min, min(OpenDuration.min) > and open.bin shows 66, 44, 50 and 104 unopened for R0 -> R4, so convert NA's to 0
cols = c("OpenDuration.min")
for (i in 1:nrow(df)) {
for (j in 1:length(cols)) {
if (is.na(df[i,cols[j]])) {
df[i,cols[j]] = 0
} } }
df2 = df[!rowSums(is.na(df)),] # brings data back up to 609 obs :)
AcP.FbU.FbP = df
also selecting variables to use as features
then reshaping using StudentID as rows
#df = AcP.FbU.FbP
df = subset(AcP.FbU.FbP, course == "Level 1" & sem == "Semester 1")
#df.FinalGrade = melt(df[,c(1,13,17)], id=c("StudentID", "report"), value.name="Final.Grade")
#df2 = dcast(df.FinalGrade, StudentID~variable+report, mean)
#df columns = "StudentID", "report", "Final.Grade”, “Freehand”, “Highlight”, “Recording”, “Text”, “total.words”, “auidio.min”, “audio.words”, “txt.words”, “OpenDuration.min”, “open.bin”, “MarkerID"
#dv = "Final.Grade”, “Freehand”, “Highlight”, “Recording”, “Text”, “total.words”, “auidio.min”, “audio.words”, “txt.words”, “OpenDuration.min”, “open.bin”, “MarkerID"
df = df[,c(10,17,13,2:9,18,19,11)]
dv = names(df[,c(3:ncol(df))])
ls = NULL
for (i in 1:length(dv)) {
ls[[i]] = melt(df[,c(1,2,(i+2))], id=c("StudentID", "report"), value.name=dv[i])
}
ls2 = NULL
for (i in 1:length(dv)) {
ls2[[i]] = dcast(ls[[i]], StudentID~variable+report)
}
df2 = ls2[[1]]
for (i in 2:length(ls2)) {
df2 = merge(df2, ls2[[i]], by = "StudentID", all.x=T)
}
# removing Na's so svm runs
#str(df2)
#names(df2)
#length(which(is.na(rowSums(df2[,2:41]))))
df2 = df2[-which(is.na(rowSums(df2[,2:41]))),]
for (i in 42:49) {
df2[,i] = as.factor(df2[,i])
}
mach.learn = df2
Turning the ‘outcome’ ie mark for Report 3 into a categorical feature so can use logistic/classification approach ie “did student X get an A on their 3rd (final) report?” also removing Report 3 Final.Grade from dep vars since that alone would predict outcome perfectly :P
df = mach.learn
df$y = as.factor(df$`Final.Grade_Report 3` > (41 * (100/48)))
#table(df[,37], df[,42]) #check
col = which(names(df) == "Final.Grade_Report 3")
mach.learn = df[,-col]
Split data into training, cv and test sets
df = mach.learn
stud.split = splitter(df$StudentID, 60)
df.train = df[which(df$StudentID %in% stud.split[[1]]),2:ncol(df)]
df.cv = df[which(df$StudentID %in% stud.split[[2]]),2:ncol(df)]
df.test = df[which(df$StudentID %in% stud.split[[3]]),2:ncol(df)]
NB - in next chunk, I’m not real sure about the ‘tune’ function in R - may need to read some more. If not ‘tuning’ to optimise the regularisation parameter (lambda) [inputted as ‘cost’ in this R package/function], then can pool cv and test data sets together to get a better estimate of svm model performance.
First, train svm on training set (prints out svm parameters). Then use the svm model to predict outcome (y ie get an A on final report or not) on the cross-validation set.
#df = df.train[,c(2:ncol(df.train))]
#df$y = as.logical(df$y)
svmfit <- svm(y ~., data=df.train)
print(svmfit)
##
## Call:
## svm(formula = y ~ ., data = df.train)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 1
## gamma: 0.007936508
##
## Number of Support Vectors: 311
#my_cols <- c("#00AFBB", "#E7B800")
#plot(svmfit, df) # can't plot coz >2 variables (high dimensionality data)
p <- predict(svmfit, df.cv)
p.cv = predict(svmfit, df.cv)
df.cv$p = p.cv
cm = as.data.frame.matrix(addmargins(with(df.cv, table(y, p))))
cmStats = conMatrixStats(cm) # for accuracy could use: mean(df.cv$p == df.cv$y)
cmStats
## Results
## accuracy 77.04918
## misclassification 22.95082
## prevalence 49.18033
## precision 69.51220
## true.pos 95.00000
## false.pos 40.32258
## true.neg 59.67742
So, pretty good (predicts with ~70-80% accuracy who will get an A for the final report and who won’t).
Now trying to ‘tune cost’ (ie optimise regularisation parameter)
tuned = tune(svm, y~., data = df.cv, ranges = list(cost=c(0.0001,0.001,0.01, 0.1, 1, 10, 100)))
summary(tuned)
##
## Parameter tuning of 'svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## cost
## 10
##
## - best performance: 0.2961538
##
## - Detailed performance results:
## cost error dispersion
## 1 1e-04 0.6121795 0.1092150
## 2 1e-03 0.6121795 0.1092150
## 3 1e-02 0.6121795 0.1092150
## 4 1e-01 0.6121795 0.1092150
## 5 1e+00 0.3839744 0.1522487
## 6 1e+01 0.2961538 0.1517786
## 7 1e+02 0.3275641 0.1327351
# returns lowest cost but that gives worst performance on test data set - clearly overfitting?
# now returns highest ie 100 dropping accuracy on df.test to 68%, but also get 69% for cost = 10 and cost = 1, so must be data split
svmfit.tuned <- svm(y ~., data=df.train, cost = as.numeric(tuned[[1]]))
print(svmfit.tuned)
##
## Call:
## svm(formula = y ~ ., data = df.train, cost = as.numeric(tuned[[1]]))
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 10
## gamma: 0.007936508
##
## Number of Support Vectors: 273
p.test = predict(svmfit.tuned, df.test)
df.test$p = p.test
cm.test = as.data.frame.matrix(addmargins(with(df.test, table(y, p))))
cmStats = conMatrixStats(cm.test)
cmStats
## Results
## accuracy 80.32787
## misclassification 19.67213
## prevalence 54.09836
## precision 81.81818
## true.pos 81.81818
## false.pos 21.42857
## true.neg 78.57143
Tunning really seems to drop accuarcy on test data set - as low as 42% where just using cost = 1 kept accuracy up near 70-80% on the test set. Need to look into tune function…
Next to generalise ie add in demographic data and reduce back to Report 1 (mark, feedback provision and use) so can use the model on both courses. Pragmatically, this would also give academics, markers and students an earlier indication of performance on final report…
Reducing down to 1st and final reports to align/generalise
df = AcP.FbU.FbP
for (i in 1:nrow(df)) {
if (df$course[i] == "Level 2" & df$final[i] == "nonfinal") {
df$final[i] = "first"
}
if (df$course[i] == "Level 1" & df$report[i] == "Report 0") {
df$final[i] = "first"
}
if (df$course[i] == "Level 1" & df$sem[i] == "Semester 2" & df$report[i] == "Report 1") {
df$final[i] = "first"
}
}
df2 = subset(df, final == "first" | final == "final")
# check for duplicates (errors)
length(which(duplicated(df2[,c("StudentID", "course", "sem", "report")])))
## [1] 0
mach.learn2 = df2
Reshaping so that StudentID leads each row of features and adding in demographic data
df = mach.learn2
# UQM data first (since diff for first and final)
df = df[,c("StudentID", "course", "sem", "final", "Final.Grade", "Freehand", "Highlight", "Recording", "Text", "total.words", "auidio.min", "audio.words", "txt.words", "OpenDuration.min", "open.bin", "MarkerID")]
vars = c("Final.Grade", "Freehand", "Highlight", "Recording", "Text", "total.words", "auidio.min", "audio.words", "txt.words", "OpenDuration.min", "open.bin", "MarkerID")
cs = as.factor(paste(df$course, df$sem))
cs.ls = NULL
cs.ls = split(df, cs)
last.ls = NULL
for (l in 1:length(cs.ls))
{
ls.11 = NULL
for (i in 1:length(vars)) {
ls.11[[i]] = melt(cs.ls[[l]][,c(1,4, (i+4))], id=c("StudentID", "final"), value.name = vars[i])
}
ls.11b = NULL
for (i in 1:length(vars)) {
ls.11b[[i]] = dcast(ls.11[[i]], StudentID~variable+final)
}
df2 = ls.11b[[1]]
for (i in 2:length(ls.11b)) {
df2 = merge(df2, ls.11b[[i]], by = "StudentID", all.x=T)
}
df2$cs = names(cs.ls)[l]
last.ls[[l]] = df2
}
rm(ls.11, ls.11b, df2)
df3 = last.ls[[1]]
for (l in 2:length(last.ls)) {
df3 = rbind(df3, last.ls[[l]])
}
# then demog data (since same for first and final)
# first removing end of semester demographic info since this would come in after final report outcome
#"Official.Grade", "Course.Grade", "Semester.GPA", "Cumulative.Semester.GPA"
# and removing UQM.ID (unique to each studnet) and consent (same for all students)
# not easy to remove columns by names, so listing those kept instead
demog = demog[,c("StudentID","Date.of.Birth", "Gender.Description", "Permanent.Country", "International.Indicator.Code", "Lang.Spoken.at.Home.Description", "Citizenship.Status.Description", "Birth.Country.Description", "ATSI.Indicator.Code", "UQ.Translated.Base.OP", "course", "sem")]
demog$cs = as.factor(paste(demog$course, demog$sem))
df4 = merge(df3, demog, by=c("StudentID", "cs"))
# removed cs
mach.learn2 = df4[,c(1,36,37,3:35)]
add on the ‘outcome’ ie final = A and remove R3 Final.Grade from dep vars
df = mach.learn2
df$y = as.factor(df$`Final.Grade_final` > 85)
#table(df[,37], df[,42]) #check
# not generalising well to grade bands, and need to adjust conMatrixStats function if y > 2 classes
#bins = c(0,49,64,74,84,100)
#df$y = as.factor(cut(df$`Final.Grade_final`, bins))
col = which(names(df) == "Final.Grade_final")
mach.learn2 = df[,-col]
Need to turn characters into factors for svm
df = mach.learn2
c = NULL
for (j in 1:ncol(df)) {
c[j] = (class(df[,j]))
}
cs = which(c == "character")
for (i in 1:length(cs)) {
df[,cs[i]] = as.factor(df[,cs[i]])
}
mach.learn2 = df
Found the NAs tripping up later svm - UQ..OP (high school score students entered uni with) has 899 NAs which takes out 25% of the students - and biased coz these would be internationals, interstaters, mature entry
Since prior AcP predicts future, and lack of OP is unevenly distributed/correlated with other demographic features (“biased”) in the data set, will try with and without to compare performance…
df = mach.learn2
df2 = df[,c(1:34,36)] # removing OP
mach.learn2.op = df[!rowSums(is.na(df)),] # drops data from 1869 obs to 1322 if keep OP
mach.learn2.no_op = df2[!rowSums(is.na(df2)),] # drops data from 1869 obs to 1762 if leave OP out
Next, need to split for svm - but systematic random samples from each courseXsem
with OP included first
df = mach.learn2.op
df$sID.cs = with(df, paste(StudentID, course, sem))
col = ncol(df)
df = df[,c(1,col, 2:(col-1))]
stud.split = splitter(df$sID.cs, 60)
df.train = df[which(df$sID.cs %in% stud.split[[1]]),3:ncol(df)]
df.cv = df[which(df$sID.cs %in% stud.split[[2]]),3:ncol(df)]
df.test = df[which(df$sID.cs %in% stud.split[[3]]),3:ncol(df)]
svmfit2 <- svm(y ~., data=df.train, cost=1)
print(svmfit2)
##
## Call:
## svm(formula = y ~ ., data = df.train, cost = 1)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 1
## gamma: 0.0007692308
##
## Number of Support Vectors: 697
p <- predict(svmfit2, df.cv)
df.cv$p = p
cm = as.data.frame.matrix(addmargins(with(df.cv, table(y, p))))
cm # rows are actual values (y) and columns are prediction (p)
## FALSE TRUE Sum
## FALSE 121 25 146
## TRUE 42 76 118
## Sum 163 101 264
cmStats = conMatrixStats(cm)
cmStats
## Results
## accuracy 74.62121
## misclassification 25.37879
## prevalence 44.69697
## precision 75.24752
## true.pos 64.40678
## false.pos 17.12329
## true.neg 82.87671
Ok, first run gave 75% accuracy - so still very good. (2nd run without UQM.ID and Consent gave 73%) Next to try with the larger data set but without the OP variable…
Again split for svm
no OP this time
df = mach.learn2.no_op
df$sID.cs = with(df, paste(StudentID, course, sem))
col = ncol(df)
df = df[,c(1,col, 2:(col-1))]
stud.split = splitter(df$sID.cs, 60)
df.train = df[which(df$sID.cs %in% stud.split[[1]]),3:ncol(df)]
df.cv = df[which(df$sID.cs %in% stud.split[[2]]),3:ncol(df)]
df.test = df[which(df$sID.cs %in% stud.split[[3]]),3:ncol(df)]
svmfit2 <- svm(y ~., data=df.train, cost=1)
print(svmfit2)
##
## Call:
## svm(formula = y ~ ., data = df.train, cost = 1)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 1
## gamma: 0.0007698229
##
## Number of Support Vectors: 910
p <- predict(svmfit2, df.cv)
df.cv$p = p
cm = as.data.frame.matrix(addmargins(with(df.cv, table(y, p))))
cm # rows are actual values (y) and columns are prediction (p)
## FALSE TRUE Sum
## FALSE 144 60 204
## TRUE 37 111 148
## Sum 181 171 352
cmStats = conMatrixStats(cm)
cmStats
## Results
## accuracy 72.44318
## misclassification 27.55682
## prevalence 42.04545
## precision 64.91228
## true.pos 75.00000
## false.pos 29.41176
## true.neg 70.58824
Accuracy came down to 69% but not really that different to 75% given the random split of data will do that anyway (2nd run without UQM.ID and Consent gave 75%)
So, you would loop over svm a few times to get a range of acuracies?