This dataset is an aggregate of the screen-fixations from screen movements of StarCraft 2 replay files. The goal is to see whether we can predict the League Index of players based on their attributes (age, number of hours played) and game skills.
From wikipedia :
Ranked from the lowest to the highest, the Leagues are: Bronze, Silver, Gold, Platinum, Diamond, Master and Grandmaster. The Copper league, which was formerly below Bronze, was removed in favor of Diamond in beta patch 13. The Master League was added with patch 1.2, and the Grandmaster League was added in 1.3.
For now I will not look into tuning, but just to see which model is working fine.
library(ggplot2) # for plotting
library(gridExtra) # for plotting
library(ROCR) # for ROC curve
library(dplyr) # utils
library(rpart) # Decision tree package
library(randomForest)
library(rpart.plot)
library(rattle)
library(caTools) # utils for splitting datasets
library(caret) # confusion matrix
df<-read.csv('../../SC2/starcraft.csv',sep=',')
df <- na.omit(df)
df<-df[df$TotalHours!=1000000,]
I tried to make a multiclassification but did not find easily a way to plot the ROC curves so I went back to a binary classification, where I defined :
makeLeague<-function(x){
if(x>=1 & x<=5) {return('LOW')}
else if(x>=6) {return('HIGH')}
}
df$LeagueGlob<-sapply(df$LeagueIndex,makeLeague)
Some plots to show the distribution of players within each league, and how a skill (here ActionLatency) varies vs the League Index, and the age of the player.
actionLatVsLeague<-ggplot(data=df,aes(x=factor(LeagueIndex),y= ActionLatency)) +
geom_boxplot(aes(fill=factor(Age))) + theme(legend.position=c(.9, .65)) + xlab('League Index') + ylab('Action Latency')
leagueVsAge<-ggplot(data=df,aes(x=factor(LeagueIndex))) + geom_bar(aes(color=factor(Age))) + theme(legend.position=c(.9, .65)) + xlab('League Index')
print(leagueVsAge)
print(actionLatVsLeague)
#grid.arrange(leagueVsAge,actionLatVsLeague,ncol=2)
I remove some columns not used for training the model and split the dataset into a training and testing samples (training = 70% of the initial dataset), in order to avoid some overfitting.
df2<-(select(df, -GameID,-LeagueIndex,-MaxTimeStamp))
split<-sample.split(df2$LeagueGlob,SplitRatio=.7)
train<-subset(df2,split==T)
test<-subset(df2,split==F)
tree <- rpart(LeagueGlob ~ ., method='class',data = train)
print(tree$variable.importance)
## ActionLatency APM NumberOfPACs
## 176.2075508 118.9580576 114.5710983
## GapBetweenPACs SelectByHotkeys AssignToHotkeys
## 52.8847511 42.6027082 30.3756021
## MinimapAttacks ActionsInPAC TotalHours
## 22.9125614 13.0232210 11.2553854
## TotalMapExplored MinimapRightClicks UniqueHotkeys
## 8.2052008 6.9113904 6.5990019
## UniqueUnitsMade WorkersMade HoursPerWeek
## 4.0442274 3.5770676 2.1531690
## ComplexUnitsMade
## 0.9583732
#prp(tree)
fancyRpartPlot(tree)
I usually used
rattle
package to display the tree but it’s not working on Kaggle
I need to redefine another function to transform the labels into 0 and 1 (due to the ROCR
package)
makeClass2<-function(x){
if(x=='HIGH'){
return(1)
} else {return(0)}
}
test$LEAGUE<-sapply(test$LeagueGlob,makeClass2)
pred<-prediction(prediction$HIGH,test$LEAGUE)
perf<-performance(pred,"tpr", "fpr")
roc.data <- data.frame(fpr=unlist(perf@x.values),tpr=unlist(perf@y.values),model="tree")
auc <- performance(pred, measure = "auc")
auc <- auc@y.values[[1]]
g1<-ggplot(roc.data, aes(x=fpr, ymin=0, ymax=tpr)) +geom_ribbon(alpha=0.2) +geom_line(aes(y=tpr)) + ggtitle(paste0("ROC Curve w/ AUC=", auc))
g1<-g1 + geom_segment(x = 0, y = 0, xend = 1, yend = 1, colour = 'red')
print(g1)
As seen before, this very basic model is just above a random guess (red line). There is definitely room for improvement !
train$LeagueGlob = factor(train$LeagueGlob)
rf.model<-randomForest(LeagueGlob ~ . , data = train,importance = TRUE)
print(rf.model)
##
## Call:
## randomForest(formula = LeagueGlob ~ ., data = train, importance = TRUE)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 4
##
## OOB estimate of error rate: 14.85%
## Confusion matrix:
## HIGH LOW class.error
## HIGH 217 242 0.52723312
## LOW 105 1772 0.05594033
predictionRF<-as.data.frame(predict(rf.model,test))
colnames(predictionRF)<-c('res')
confusionMatrix(predictionRF$res,test$LeagueGlob)
## Confusion Matrix and Statistics
##
## Reference
## Prediction HIGH LOW
## HIGH 88 49
## LOW 109 755
##
## Accuracy : 0.8422
## 95% CI : (0.8181, 0.8642)
## No Information Rate : 0.8032
## P-Value [Acc > NIR] : 0.0008585
##
## Kappa : 0.4359
## Mcnemar's Test P-Value : 2.682e-06
##
## Sensitivity : 0.44670
## Specificity : 0.93905
## Pos Pred Value : 0.64234
## Neg Pred Value : 0.87384
## Prevalence : 0.19680
## Detection Rate : 0.08791
## Detection Prevalence : 0.13686
## Balanced Accuracy : 0.69288
##
## 'Positive' Class : HIGH
##
library(e1071)
svm.model <- svm(LeagueGlob ~ ., data=train)
print(svm.model)
##
## Call:
## svm(formula = LeagueGlob ~ ., data = train)
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: radial
## cost: 1
## gamma: 0.05555556
##
## Number of Support Vectors: 896
predictionSVM<-as.data.frame(predict(svm.model,test))
colnames(predictionSVM)<-c('res')
confusionMatrix(predictionSVM$res,test$LeagueGlob)
## Confusion Matrix and Statistics
##
## Reference
## Prediction HIGH LOW
## HIGH 82 41
## LOW 115 763
##
## Accuracy : 0.8442
## 95% CI : (0.8202, 0.8661)
## No Information Rate : 0.8032
## P-Value [Acc > NIR] : 0.0004772
##
## Kappa : 0.4256
## Mcnemar's Test P-Value : 5.076e-09
##
## Sensitivity : 0.41624
## Specificity : 0.94900
## Pos Pred Value : 0.66667
## Neg Pred Value : 0.86902
## Prevalence : 0.19680
## Detection Rate : 0.08192
## Detection Prevalence : 0.12288
## Balanced Accuracy : 0.68262
##
## 'Positive' Class : HIGH
##
Comments and estimation of accuracy
As seen by the visualization, some variables are more meaningful than others. For the estimation of the model (confusion matrix), the dataframes need to be reworked a little bit: * I define another function to classify as ‘HIGH’ or ‘LOW’ the result of the prediction (initially it returns a probability to belong to a given class) * the model is then tested with the test dataset (unknow data) * the confusion matrix is build by comparing the true class and the prediction