Use the player-reported Hearthstone: Heroes of Warcraft Arena deck composition to predict deck success. This investigation looks at whether the success of a player can be predicted by the cards that player selects during a deck-selection phase. There is skill required at all levels of the game (from deck-selection to the gameplay itself), however, deck selection is one aspect where a player can look to a variety of outside sources for help
This study made use of MySQL to access card data and made extensive use of the dplyr
and caret
libraries for data frame manipulation and machine learning, respectively. For modeling, I used Random Forest and Gradient Boosting Machines (GBM), evaluating model performance using area under the receiver operating characteristic (ROC) curve. The source code for this report can be found at my GitHub repository.
Hearthstone is turn-based electronic card game where players assume the role of one of nine heroes including a mage, a rogue, and a warrior. Each hero competes using 30-card decks of creatures (minions), spells, and weapons to reduce their opponent’s health points to zero. These decks may contain any combination of cards from a pool of common cards and cards that only that hero may weild. For instance, mages have access to Fireball, but not to Fiery War Axe–a warrior card–whereas both have access to the Abomination. This study focuses on Arena Mode, where players craft their decks by drafting one card at a time, making this choice among three random cards each round. Since players do not know which cards may appear in later rounds, they must choose these cards based on their perceived value among the cards shown or based on any synergy among the cards they have already selected.
There are no built-in quantitative datasets available to the player by way of a profile or ranking list. Dedicated players can record and track their progress in these arena matches on websites such as Arena Mastery. A player may keep track of which cards they had in their deck, how many wins/losses they ended their run with, and even which heroes they lost to. Because Blizzard Entertainment, the game’s creator, has not released an API for this game, these websites are the best way for outsiders to acquire quantitative information on gameplay. The webmaster and developer of Arena Mastery made available a de-personalized portion of this data for use in this project. This subset includes data from the game’s release until late November, 2014, over 90,000 completed decks by over 9,000 unique players in total. It is because of his hard work and the self-reporting of scores and decks from players around the world that this work was made possible.
The data set contains several SQL tables that track player performance. The tables accessed in this study are described below:
arenaCards
: All Hearthstone cards (with names, id, description and other stats)
arenaArena
: All arena results (summary of wins/losses)
arenaDraftRow
: Records of arena “picks” but not the individual cards
arenaDraftCards
: The cards associated with the picks tracked in arenaDraftRow
statEras
: Events that mark different periods to track updates/expansions/changes to the game
arenaClass
: All 9 Hearthstone classes
The RMySQL
library was essential to reading these data and allowed the selection of only variables of interest to be loaded, saving memory.
The data were generally pretty clean. Most of what was done in this section was done by merges, joins and filters through the dplyr
package.
cardPool
data, i.e. details about the cards that could be chosenHere, the primary variables I was concerned with were:
cardId
: A unique numeric card identifier
cardName
: Name of the card
cardSet
: The set to which the card belongs (i.e. original release or expansion)
cardRarity
: A factor determining the level of rarity of the card (common, rare, epic, or legendary)
cardType
: Is the card a minion/creature, a spell, or a weapon
cardClass
: Which class can use the card (0 is common to all classes)
cardCost
: The card’s mana cost
cardText
: Any additional text that described other card attributes.
For example, the Abomination card appears like this to a player:
And has a corresponding entry in cardPool
that looks like this:
cardId | cardName | cardSet | cardRarity | cardType | cardClass | cardCost | cardText |
---|---|---|---|---|---|---|---|
121 | Abomination | 1 | Rare | Minion | Neutral | 5 | Taunt. Deathrattle: Deal 2 damage to ALL characters. |
Later, the cardText
was parsed in order to pick out special features like “Taunt” or “Damage.”
With the card pool in place, the next step was to load the arena records themselves. The variables of interest are as follows:
arenaId
: A unique identifier for the arena run
arenaPlayerId
: A unique player ID (account number)
arenaClassId
: The ID of the class being played during the arena run
arenaOfficialWins
: Number of wins
arenaOfficialLosses
: Number of losses
arenaWins
: Number of wins (not including “disconnects”)
arenaLosses
: Number of losses (not including “disconnects”)
arenaRetireEarly
: Did the player end the run early
arenaStartDate
: When was the arena deck selected, starting that run
arenaId | arenaPlayerId | arenaClassId | wins | losses | arenaStartDate |
---|---|---|---|---|---|
107774 | 6925 | Mage | 8 | 3 | 2014-03-13 04:00:00 |
109627 | 9605 | Shaman | 9 | 3 | 2014-05-31 04:00:00 |
145994 | 10193 | Mage | 12 | 2 | 2014-03-12 04:00:00 |
150369 | 8624 | Druid | 1 | 3 | 2014-03-16 04:00:00 |
153911 | 8530 | Rogue | 4 | 3 | 2014-03-12 04:00:00 |
Here, I made the choice to only look at reported (“unofficial”) wins and losses, rather than official outcomes. Players have the option to flag a win or a loss as if the game ended in either opponent disconnecting early. For example, if a player’s connection to the server fails and causes a game loss, that player may choose to report the loss, or evaluate his/her standing at the time of disconnect and flag the game as a “win.” Out of 235159 arena games, games flagged in this way account for an over-reporting of wins by 0.3% and an under-reporting of losses by 0.76%.
The final stage was to join arena records and card choices. Here is where most of the data were discarded. It turns out that only 33.694% of games had full decks associated with them. This small subset was not randomly selected and a students’ t-test indicates that those who recorded their decks had a mean win rate that was lower than those who did not, by about 0.5 at 95% confidence.
To sum up, deck lists associated with the win rates of over 90,000 games have been cleaned, merged, and subsetted. The next steps were to determine which features would help summarize deck quality and determine if “powerful” deck lists were associated with high win rates.
The data were rich with information about each card, however there were some other features that might be useful in predicting the success of a deck. For example, some players value cards that allow them to draw more cards when played. Some might value minions that “taunt” other minions, thereby protecting their life points. Furthermore, the original data set had information about which cards were selected and which were seen in a given draft round. From this information, I can learn the popularity of a card among the community of players reporting the data and get some measure as to the strength (or perceived strength) of that card. Finally, I wanted some idea of the strength of an individual card measured by how the mean win rate of decks with one or more copies of that card differed from those with no copies.
Since some of the most powerful aspects of a card are written in the card text. This text was parsed for certain functions and those were stored as a boolean variable.
For example, the Abomination would show the following relevant TRUE
flags (the FALSE
flags have been omitted for brevity):
cardName | hasTaunt | hasDeathrattle | hasDamage | hasAOEdmg |
---|---|---|---|---|
Abomination | TRUE | TRUE | TRUE | TRUE |
The complete list of card abilities was added to the provided list of card attributes so that the more detailed characteristics (beyond card cost, card class, etc) of each deck could be summarized later.
Since every player is shown a random set of cards , it is difficult to separate those who make poor selections from those who have bad luck, i.e. those who do not pick the best cards vs. those who cannot. In the data set, cards are recorded whether they are selected (isSelected==1
) or not (isSelected==0
); so, an objective measure of card popularity would be to normalize a card’s selection by how often it appeared as an option.
For example, the top 10 cards selected by Mages (among all deck win rates, 0-12) are:
rank | cardName | fractionPicked |
---|---|---|
1 | Ragnaros the Firelord | 0.9683908 |
2 | Azure Drake | 0.9663462 |
3 | Fireball | 0.9428052 |
4 | Argent Commander | 0.9258144 |
5 | Pyroblast | 0.9179663 |
6 | Flamestrike | 0.9158199 |
7 | Ysera | 0.9003115 |
8 | Water Elemental | 0.8986598 |
9 | Frostbolt | 0.8894714 |
10 | Chillwind Yeti | 0.8656741 |
These ranks, when averaged among 30-card dekcs, would give some clue as to how many popular cards made it into the deck in question. More popular cards might indicate access to or selection of cards that the population as a whole agreed were “good.” Furthermore, each individual card’s rank was used to weight the card attributes when summarizing the decks. For example, not all cards with the taunt attribute are created equally. This ensured that a deck with 2 popular taunt cards would receive a different score than a deck with two unpopular taunt cards.
The last feature I wanted to generate was the effective swing of a card. For each card, I calculated the mean win rate of decks without that card then the mean win rate for decks with 1 or more copies of that card, comparing the change in win rate with respect to the mean win rate of the class. For example, the “0-1” swing for Chillwind Yeti is +0.1, meaning that decks that have 1 copy of that card average 0.1 wins more than decks with no copies of that card, decks that have 3 copies perform even better. Flamestrike, has much greater positive swing values, perhaps suggesting it is a very important card to draft. These values have also been tabulated on wowmetrics.com using a similar data set.
With these features created, each deck was summarized, resulting in the following predictors:
Predictor Name | Description |
---|---|
cost.mean |
mean deck mana cost |
cost.median |
median deck mana cost |
skew |
skew of the mana curve |
taunt |
score of cards with taunt, weighted by card popularity |
draw |
score of cards with draw, weighted by card popularity |
destroy |
score of cards with destroy, weighted by card popularity |
aoe |
score of cards with damage all, weighted by card popularity |
silence |
score of cards with silence, weighted by card popularity |
charge |
score of cards with charge, weighted by card popularity |
heal |
score of cards with heal, weighted by card popularity |
rattle |
score of cards with deathrattle, weighted by card popularity |
enrage |
score of cards with enrage, weighted by card popularity |
damage |
score of cards that deal damage, weighted by card popularity |
buff |
score of cards with buffs, weighted by card popularity |
battlecry |
score of cards with battlecry, weighted by card popularity |
blank |
score of cards with no special abilities, weighted by card popularity |
dmgSpell |
score of spell cards with damage, weighted by card popularity |
minions |
score of minion cards, weighted by card popularity |
spells |
score of spell cards, weighted by card popularity |
classCard |
score of cards available only to that class, weighted by card popularity |
avgRank |
average popularity of the cards in the deck |
top15 |
total cards in the top 15 most popular |
deckSwing |
sum of cardSwing |
The header of the numeric dataframe of predictors is sampled below:
## Source: local data frame [6 x 25]
##
## winCount cost.mean cost.median skew rarity taunt draw destroy
## 1 8 3.633333 4.0 0.13490137 1.233333 116 252 0
## 2 12 3.900000 4.0 1.25304461 1.166667 203 155 0
## 3 12 4.066667 3.5 3.51218185 1.366667 69 160 82
## 4 3 3.633333 3.5 0.37719214 1.266667 93 257 0
## 5 7 4.033333 4.0 0.92222271 1.500000 17 204 97
## 6 3 3.866667 4.0 0.04184354 1.300000 23 157 19
## Variables not shown: aoe (dbl), silence (dbl), charge (dbl), heal (dbl),
## rattle (dbl), enrage (dbl), damage (dbl), buff (dbl), battlecry (dbl),
## blank (dbl), dmgSpell (dbl), minions (dbl), spells (dbl), classCard
## (dbl), avgRank (dbl), top15 (int), deckSwing (dbl)
Notably missing from this list is a metric card synergy. Many cards are much stronger in tandem than individually. Cursory attempts were made to derive these interactions from the raw data; however, professional players have a stronger sense of this classification. This would be a great metric to add for future work.
Each hero class plays with a different style, so I chose to model the classes separately. Furthermore, gameplay was potentially changed with the release of the expansion. Here, I use the Mage data as a test case for the original release of the game. This subset resulted in a selection of approximately 11,000 games.
First, I looked at a correlation plot of the features in question, providing some hint of colineariaty among features.
require(corrplot)
corrplot(cor(mageSet),method="ellipse",order="hclust",type="upper")
There is a generally low level of colinearity for the features present. For some features, high correlation was expected. For example, dmgSpell
cards are a subset of damage
cards and are correlated as a result. However, there is little correlation between any of the predictors and the winCount
of the decks.
Principal component analysis (PCA) is a popular method to summarize the description of variance among multiple predictors by selecting a series of principle components along which variance is maximized. However, use of PCA requires a fair degree of colinearity. A high Kaiser-Mayer-Olkin (KMO) index (greater than 0.75) gives some indication of whether PCA is applicable. In this case, however, the KMO index was insufficient (0.58), so PCA was not used.
require(psych)
## Center and Scale the data
df.scale <- scale(mageSet, center=T, scale=T)
df.pca<- principal(r=df.scale,
nfactors=10,
covar=FALSE)
# Run a KMO index test
KMO(cor(as.matrix(df.scale)))
## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = cor(as.matrix(df.scale)))
## Overall MSA = 0.59
## MSA for each item =
## winCount cost.mean cost.median skew rarity taunt
## 0.83 0.55 0.54 0.34 0.57 0.47
## draw destroy aoe silence charge heal
## 0.35 0.52 0.53 0.41 0.30 0.43
## rattle enrage damage buff battlecry blank
## 0.49 0.53 0.57 0.50 0.56 0.27
## dmgSpell minions spells classCard avgRank top15
## 0.71 0.47 0.52 0.93 0.54 0.91
## deckSwing
## 0.87
caret
)In order to select the most important features, I employed recursive feature elimination (RFE) using the caret
package. In this case, I used random forest functions to evaluate variable importance and found that up to 18 (out of 24) predictors were necessary to provide significant boosts in accuracy. Below is a plot showing the absolute accuracy vs. number of features.
# Select Features with RFE
# define the control using a recursive variable selection function
# function options include "lmFuncs" (linear models), "rfFuncs" (random forest), among others
rfeControl <- rfeControl(functions=rfFuncs, method="cv", number=10)
set.seed(1337)
mage.rfe <- rfe(training[,predictors], training$topHalf, sizes=c(1:20), rfeControl=rfeControl) # 1-20 variables
## [1] "deckSwing" "avgRank" "spells" "classCard" "top15"
## [6] "minions" "dmgSpell" "damage" "buff" "skew"
## [11] "cost.mean" "cost.median" "rattle" "draw" "aoe"
## [16] "blank" "silence" "heal"
With a narrowed list of features, I proceeded to set up my training and test sets of data. I focused on Mage decks that were recorded during the original release of the game (approximately 11,000 rows of data). Regression models provided poor accuracy, so categorization was used. Two bins were created: “Good” and “Bad” decks, binned by being above or below the mean win rate, respectively. The models were trained on a randomly selected, stratified 75% of the data. Repeated cross-validation was used to determine optimal performance. Prior to modeling, predictors were checked for near-zero variance.
df<- mageSet %>%
mutate(perf=as.factor(ifelse(winCount>mean(winCount),"Good","Bad"))) %>%
select(-winCount)
labelName<-"perf" # name of observation
predictors<-names(df)[!(names(df) %in% labelName)] # predictors
set.seed(1337)
labelNum<-which(names(df)==labelName)
inTrain<-createDataPartition(y=df$perf,p=0.75,list=F)
training<-df[inTrain,]
test<-df[-inTrain,]
trControl <- trainControl(method="repeatedcv", number=10, repeats=3,
classProbs=T,summaryFunction=twoClassSummary)
Random forest and gradient boosting machines were both used to classify the mage decks as above average (“Good”) or below average (“Bad”). The variables were centered and scaled before the models were trained.
######### Random Forest #########
set.seed(1337)
mage.rf<-train(perf~.,
data=training[,c(topPredictors,labelName)],
method='rf',
trControl=trControl,
preProc = c("center","scale"))
pred.rf<-predict(mage.rf,test[,c(topPredictors,labelName)])
perf.rf<-confusionMatrix(predictions,test$perf)
######### GBM #########
set.seed(1337)
mage.gbm<-train(perf~.,
data=training[,c(topPredictors,labelName)],
method='gbm',
trControl=trControl,
metric="ROC",
preProc = c("center","scale"))
pred.gbm<-predict(mage.gbm,test[,c(topPredictors,labelName)])
perf.gbm<-confusionMatrix(pred.gbm,test$perf)
print(perf.rf)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Bad Good
## Bad 731 470
## Good 608 906
##
## Accuracy : 0.6029
## 95% CI : (0.5843, 0.6214)
## No Information Rate : 0.5068
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.2046
## Mcnemar's Test P-Value : 3.011e-05
##
## Sensitivity : 0.5459
## Specificity : 0.6584
## Pos Pred Value : 0.6087
## Neg Pred Value : 0.5984
## Prevalence : 0.4932
## Detection Rate : 0.2692
## Detection Prevalence : 0.4424
## Balanced Accuracy : 0.6022
##
## 'Positive' Class : Bad
##
print(perf.gbm)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Bad Good
## Bad 661 376
## Good 678 1000
##
## Accuracy : 0.6118
## 95% CI : (0.5932, 0.6302)
## No Information Rate : 0.5068
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.2211
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.4937
## Specificity : 0.7267
## Pos Pred Value : 0.6374
## Neg Pred Value : 0.5959
## Prevalence : 0.4932
## Detection Rate : 0.2435
## Detection Prevalence : 0.3820
## Balanced Accuracy : 0.6102
##
## 'Positive' Class : Bad
##
Neither method did particularly well at classifying deck outcomes, given the list of predictors. Both Random Forest and GBM produced areas under the ROC curve of approximately 0.65 and 0.68, respectively. This could be good news: it would be bad for game design if anyone could win with a particular deck without having some element of skill. Next, I looked at the ability of these features and models to distinguish excellent from poor decks.
While it was possible that the difficulty in classifying games was due to good game design, it could be an issue with the feature set chosen. As an experiment, I removed the middle 50% of the data (between 3 and 7 wins, inclusive) labeling more than 7 wins as “Good” and fewer than 3 wins as “Bad.” I hypothesized that very good and very poor decks might be easy to spot, however skill/luck might play a role in blurring the line between good and bad decks with average results. Once the GBM model was trained, the performance was tested against the same (segmented) data.
### No Middle ###
df<- mageSet[mageSet$winCount>7 | mageSet$winCount<3,] %>% # remove the middle 50% of data
mutate(perf=as.factor(ifelse(winCount>7,"Good","Bad"))) %>%
select(-winCount)
labelName<-"perf" # name of observation
predictors<-names(df)[!(names(df) %in% labelName)] # predictors
set.seed(1337)
labelNum<-which(names(df)==labelName)
inTrain<-createDataPartition(y=df$perf,p=0.75,list=F)
training<-df[inTrain,]
test<-df[-inTrain,]
trControl <- trainControl(method="repeatedcv", number=10, repeats=3,
classProbs=T,summaryFunction=twoClassSummary)
set.seed(1337)
mage.gbmx<-train(perf~.,
data=training[,c(topPredictors,labelName)],
method='gbm',
trControl=trControl,
metric="ROC",
preProc = c("center","scale"))
pred.gbmx<-predict(mage.gbmx,test[,c(topPredictors,labelName)])
perf.gbmx<-confusionMatrix(pred.gbmx,test$perf)
print(perf.gbmx)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Bad Good
## Bad 377 154
## Good 172 347
##
## Accuracy : 0.6895
## 95% CI : (0.6606, 0.7174)
## No Information Rate : 0.5229
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.3787
## Mcnemar's Test P-Value : 0.3464
##
## Sensitivity : 0.6867
## Specificity : 0.6926
## Pos Pred Value : 0.7100
## Neg Pred Value : 0.6686
## Prevalence : 0.5229
## Detection Rate : 0.3590
## Detection Prevalence : 0.5057
## Balanced Accuracy : 0.6897
##
## 'Positive' Class : Bad
##
When testing performance of this “extremes” case on a complete set of data (with the middle 50% re-inserted), performance did not improve substantially. Modeling the extremes marginally improved the specificity of the model from a true positive rate of 0.5 to 0.68.
To compare the extremes case with the full-data case, the area under the ROC curves were compared. From the plot below, it is clear that there is some improvement in modeling (and testing) the extremes of the data versus the complete set.
# get ROC values for plotting using ROCR
require("ROCR")
# ROC for entire data set
prob.gbm<-predict(mage.gbm,test[,c(topPredictors,labelName)],type="prob") # calculate class probabilities
pred.rocr.gbm<-prediction(prob.gbm$Good,labels=test$perf)
perf.rocr.gbm<-performance(pred.rocr.gbm,"tpr","fpr") # track true positive rate and false positive rate
# ROC for extremes
prob.gbmx<-predict(mage.gbmx,test[,c(topPredictors,labelName)],type="prob") # calculate class probabilities
pred.rocr.x<-prediction(prob.gbmx$Good,labels=test$perf)
perf.rocr.x<-performance(pred.rocr.x,"tpr","fpr") # track true positive rate and false positive rate
The model trained on extremes was applied to try to classify the total dataset with no significant improvement noted. It would appear that decks that performed at and around an average win rate might lead to some misclassification. Perhaps a skilled player received a weak set of cards to choose from or an unskilled player used a good set of cards in a non-ideal way during their games.
Modeling deck success based on cards alone seems to be a difficult task. This study looked only factors that would influence an arena run before the games were started. In reality, other factors (still outside of the player’s control) weigh in on arena performance such as which hero the opponent is playing and who goes first. Adding these factors would surely improve model accuracy, but would require the completion of an arena run.
There are clearly other factors to consider when characterizing a deck. For example, card combos and synergies were not included. In the future, it would be interesting to see how card synergies, such as those provided at HearthArena.com influence model performance.
Finally, although the models themselves were not terribly accurate, the relative importance of predictors such as deck swing and card popularity were supported after the GBM and RF models were trained. This led to the wireframing of a card recommendation system which was covered in a follow-up RPub.