Motivation

This dataset is an aggregate of the screen-fixations from screen movements of StarCraft 2 replay files. The goal is to see whether we can predict the League Index of players based on their attributes (age, number of hours played) and game skills.

From wikipedia :

Ranked from the lowest to the highest, the Leagues are: Bronze, Silver, Gold, Platinum, Diamond, Master and Grandmaster. The Copper league, which was formerly below Bronze, was removed in favor of Diamond in beta patch 13. The Master League was added with patch 1.2, and the Grandmaster League was added in 1.3.

For now I will not look into tuning, but just to see which model is working fine.

data file

library(ggplot2) # for plotting
library(gridExtra) # for plotting
library(ROCR) # for ROC curve
library(dplyr) # utils
library(rpart) # Decision tree package
library(randomForest)
library(rpart.plot)
library(rattle) 
library(caTools) # utils for splitting datasets
library(caret) # confusion matrix
df<-read.csv('../../SC2/starcraft.csv',sep=',')
df <- na.omit(df)
df<-df[df$TotalHours!=1000000,]

I tried to make a multiclassification but did not find easily a way to plot the ROC curves so I went back to a binary classification, where I defined :

makeLeague<-function(x){
  if(x>=1 & x<=5) {return('LOW')}
  else if(x>=6) {return('HIGH')}
}
df$LeagueGlob<-sapply(df$LeagueIndex,makeLeague)

Some plots

Some plots to show the distribution of players within each league, and how a skill (here ActionLatency) varies vs the League Index, and the age of the player.

actionLatVsLeague<-ggplot(data=df,aes(x=factor(LeagueIndex),y= ActionLatency)) + 
geom_boxplot(aes(fill=factor(Age))) + theme(legend.position=c(.9, .65)) + xlab('League Index') + ylab('Action Latency')

leagueVsAge<-ggplot(data=df,aes(x=factor(LeagueIndex))) + geom_bar(aes(color=factor(Age))) + theme(legend.position=c(.9, .65)) + xlab('League Index') 

print(leagueVsAge)

print(actionLatVsLeague)

#grid.arrange(leagueVsAge,actionLatVsLeague,ncol=2)

Modelling

I remove some columns not used for training the model and split the dataset into a training and testing samples (training = 70% of the initial dataset), in order to avoid some overfitting.

df2<-(select(df, -GameID,-LeagueIndex,-MaxTimeStamp))
split<-sample.split(df2$LeagueGlob,SplitRatio=.7)
train<-subset(df2,split==T)
test<-subset(df2,split==F)

Decision Tree

tree <- rpart(LeagueGlob ~ ., method='class',data = train)
print(tree$variable.importance)
##      ActionLatency                APM       NumberOfPACs 
##        176.2075508        118.9580576        114.5710983 
##     GapBetweenPACs    SelectByHotkeys    AssignToHotkeys 
##         52.8847511         42.6027082         30.3756021 
##     MinimapAttacks       ActionsInPAC         TotalHours 
##         22.9125614         13.0232210         11.2553854 
##   TotalMapExplored MinimapRightClicks      UniqueHotkeys 
##          8.2052008          6.9113904          6.5990019 
##    UniqueUnitsMade        WorkersMade       HoursPerWeek 
##          4.0442274          3.5770676          2.1531690 
##   ComplexUnitsMade 
##          0.9583732
#prp(tree)
fancyRpartPlot(tree)

I usually used rattle package to display the tree but it’s not working on Kaggle

Comments and estimation of accuracy

As seen by the visualization, some variables are more meaningful than others. For the estimation of the model (confusion matrix), the dataframes need to be reworked a little bit: * I define another function to classify as ‘HIGH’ or ‘LOW’ the result of the prediction (initially it returns a probability to belong to a given class) * the model is then tested with the test dataset (unknow data) * the confusion matrix is build by comparing the true class and the prediction

makeClass<-function(x){
  if(x>0.5){
    return('HIGH')
  } else {return('LOW')
      }
}

prediction<-as.data.frame(predict(tree,test))
prediction$result<-sapply(prediction$HIGH,makeClass)
treeTab <- table(prediction$result,test$LeagueGlob)
confusionMatrix(treeTab)
## Confusion Matrix and Statistics
## 
##       
##        HIGH LOW
##   HIGH   76  62
##   LOW   121 742
##                                           
##                Accuracy : 0.8172          
##                  95% CI : (0.7918, 0.8407)
##     No Information Rate : 0.8032          
##     P-Value [Acc > NIR] : 0.1413          
##                                           
##                   Kappa : 0.348           
##  Mcnemar's Test P-Value : 1.807e-05       
##                                           
##             Sensitivity : 0.38579         
##             Specificity : 0.92289         
##          Pos Pred Value : 0.55072         
##          Neg Pred Value : 0.85979         
##              Prevalence : 0.19680         
##          Detection Rate : 0.07592         
##    Detection Prevalence : 0.13786         
##       Balanced Accuracy : 0.65434         
##                                           
##        'Positive' Class : HIGH            
## 
  • a brute-force model (all features) does not really give a good model : the accuracy is okay (I guess) but the specificity (negative class correspond to leagues 1 to 5) is quite low

ROC curve

I need to redefine another function to transform the labels into 0 and 1 (due to the ROCR package)

makeClass2<-function(x){
  if(x=='HIGH'){
    return(1)
  } else {return(0)}
}

test$LEAGUE<-sapply(test$LeagueGlob,makeClass2)

pred<-prediction(prediction$HIGH,test$LEAGUE)
perf<-performance(pred,"tpr", "fpr")
roc.data <- data.frame(fpr=unlist(perf@x.values),tpr=unlist(perf@y.values),model="tree")
auc <- performance(pred, measure = "auc")
auc <- auc@y.values[[1]]
g1<-ggplot(roc.data, aes(x=fpr, ymin=0, ymax=tpr)) +geom_ribbon(alpha=0.2) +geom_line(aes(y=tpr)) + ggtitle(paste0("ROC Curve w/ AUC=", auc))
g1<-g1 + geom_segment(x = 0, y = 0, xend = 1, yend = 1, colour = 'red')
print(g1)

As seen before, this very basic model is just above a random guess (red line). There is definitely room for improvement !

Random Forest

train$LeagueGlob = factor(train$LeagueGlob)
rf.model<-randomForest(LeagueGlob ~ . , data = train,importance = TRUE)
print(rf.model)
## 
## Call:
##  randomForest(formula = LeagueGlob ~ ., data = train, importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 4
## 
##         OOB estimate of  error rate: 14.85%
## Confusion matrix:
##      HIGH  LOW class.error
## HIGH  217  242  0.52723312
## LOW   105 1772  0.05594033
predictionRF<-as.data.frame(predict(rf.model,test))
colnames(predictionRF)<-c('res')
confusionMatrix(predictionRF$res,test$LeagueGlob)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction HIGH LOW
##       HIGH   88  49
##       LOW   109 755
##                                           
##                Accuracy : 0.8422          
##                  95% CI : (0.8181, 0.8642)
##     No Information Rate : 0.8032          
##     P-Value [Acc > NIR] : 0.0008585       
##                                           
##                   Kappa : 0.4359          
##  Mcnemar's Test P-Value : 2.682e-06       
##                                           
##             Sensitivity : 0.44670         
##             Specificity : 0.93905         
##          Pos Pred Value : 0.64234         
##          Neg Pred Value : 0.87384         
##              Prevalence : 0.19680         
##          Detection Rate : 0.08791         
##    Detection Prevalence : 0.13686         
##       Balanced Accuracy : 0.69288         
##                                           
##        'Positive' Class : HIGH            
## 

SVM (Support Vector Machines)

library(e1071)
svm.model <- svm(LeagueGlob ~ ., data=train)
print(svm.model)
## 
## Call:
## svm(formula = LeagueGlob ~ ., data = train)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  1 
##       gamma:  0.05555556 
## 
## Number of Support Vectors:  896
predictionSVM<-as.data.frame(predict(svm.model,test))
colnames(predictionSVM)<-c('res')
confusionMatrix(predictionSVM$res,test$LeagueGlob)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction HIGH LOW
##       HIGH   82  41
##       LOW   115 763
##                                           
##                Accuracy : 0.8442          
##                  95% CI : (0.8202, 0.8661)
##     No Information Rate : 0.8032          
##     P-Value [Acc > NIR] : 0.0004772       
##                                           
##                   Kappa : 0.4256          
##  Mcnemar's Test P-Value : 5.076e-09       
##                                           
##             Sensitivity : 0.41624         
##             Specificity : 0.94900         
##          Pos Pred Value : 0.66667         
##          Neg Pred Value : 0.86902         
##              Prevalence : 0.19680         
##          Detection Rate : 0.08192         
##    Detection Prevalence : 0.12288         
##       Balanced Accuracy : 0.68262         
##                                           
##        'Positive' Class : HIGH            
##