Goal & Motivation

Use the player-reported Hearthstone: Heroes of Warcraft Arena deck composition to predict deck success. This investigation looks at whether the success of a player can be predicted by the cards that player selects during a deck-selection phase. There is skill required at all levels of the game (from deck-selection to the gameplay itself), however, deck selection is one aspect where a player can look to a variety of outside sources for help

Methods and Techniques

This study made use of MySQL to access card data and made extensive use of the dplyr and caret libraries for data frame manipulation and machine learning, respectively. For modeling, I used Random Forest and Gradient Boosting Machines (GBM), evaluating model performance using area under the receiver operating characteristic (ROC) curve. The source code for this report can be found at my GitHub repository.

About the Game

Hearthstone is turn-based electronic card game where players assume the role of one of nine heroes including a mage, a rogue, and a warrior. Each hero competes using 30-card decks of creatures (minions), spells, and weapons to reduce their opponent’s health points to zero. These decks may contain any combination of cards from a pool of common cards and cards that only that hero may weild. For instance, mages have access to Fireball, but not to Fiery War Axe–a warrior card–whereas both have access to the Abomination. This study focuses on Arena Mode, where players craft their decks by drafting one card at a time, making this choice among three random cards each round. Since players do not know which cards may appear in later rounds, they must choose these cards based on their perceived value among the cards shown or based on any synergy among the cards they have already selected.

The Data Source

There are no built-in quantitative datasets available to the player by way of a profile or ranking list. Dedicated players can record and track their progress in these arena matches on websites such as Arena Mastery. A player may keep track of which cards they had in their deck, how many wins/losses they ended their run with, and even which heroes they lost to. Because Blizzard Entertainment, the game’s creator, has not released an API for this game, these websites are the best way for outsiders to acquire quantitative information on gameplay. The webmaster and developer of Arena Mastery made available a de-personalized portion of this data for use in this project. This subset includes data from the game’s release until late November, 2014, over 90,000 completed decks by over 9,000 unique players in total. It is because of his hard work and the self-reporting of scores and decks from players around the world that this work was made possible.

The data set contains several SQL tables that track player performance. The tables accessed in this study are described below:

The RMySQL library was essential to reading these data and allowed the selection of only variables of interest to be loaded, saving memory.

Data Tidying

The data were generally pretty clean. Most of what was done in this section was done by merges, joins and filters through the dplyr package.

Load and clean the cardPool data, i.e. details about the cards that could be chosen

Here, the primary variables I was concerned with were:

  • cardId: A unique numeric card identifier

  • cardName: Name of the card

  • cardSet: The set to which the card belongs (i.e. original release or expansion)

  • cardRarity: A factor determining the level of rarity of the card (common, rare, epic, or legendary)

  • cardType: Is the card a minion/creature, a spell, or a weapon

  • cardClass: Which class can use the card (0 is common to all classes)

  • cardCost: The card’s mana cost

  • cardText: Any additional text that described other card attributes.

For example, the Abomination card appears like this to a player:

And has a corresponding entry in cardPool that looks like this:

cardId cardName cardSet cardRarity cardType cardClass cardCost cardText
121 Abomination 1 Rare Minion Neutral 5 Taunt. Deathrattle: Deal 2 damage to ALL characters.

Later, the cardText was parsed in order to pick out special features like “Taunt” or “Damage.”

Merge arena results and deck selection records

With the card pool in place, the next step was to load the arena records themselves. The variables of interest are as follows:

  • arenaId: A unique identifier for the arena run

  • arenaPlayerId: A unique player ID (account number)

  • arenaClassId: The ID of the class being played during the arena run

  • arenaOfficialWins: Number of wins

  • arenaOfficialLosses: Number of losses

  • arenaWins: Number of wins (not including “disconnects”)

  • arenaLosses: Number of losses (not including “disconnects”)

  • arenaRetireEarly: Did the player end the run early

  • arenaStartDate: When was the arena deck selected, starting that run

arenaId arenaPlayerId arenaClassId wins losses arenaStartDate
107774 6925 Mage 8 3 2014-03-13 04:00:00
109627 9605 Shaman 9 3 2014-05-31 04:00:00
145994 10193 Mage 12 2 2014-03-12 04:00:00
150369 8624 Druid 1 3 2014-03-16 04:00:00
153911 8530 Rogue 4 3 2014-03-12 04:00:00

Official vs. Reported wins/losses

Here, I made the choice to only look at reported (“unofficial”) wins and losses, rather than official outcomes. Players have the option to flag a win or a loss as if the game ended in either opponent disconnecting early. For example, if a player’s connection to the server fails and causes a game loss, that player may choose to report the loss, or evaluate his/her standing at the time of disconnect and flag the game as a “win.” Out of 235159 arena games, games flagged in this way account for an over-reporting of wins by 0.3% and an under-reporting of losses by 0.76%.

The final stage was to join arena records and card choices. Here is where most of the data were discarded. It turns out that only 33.694% of games had full decks associated with them. This small subset was not randomly selected and a students’ t-test indicates that those who recorded their decks had a mean win rate that was lower than those who did not, by about 0.5 at 95% confidence.

To sum up, deck lists associated with the win rates of over 90,000 games have been cleaned, merged, and subsetted. The next steps were to determine which features would help summarize deck quality and determine if “powerful” deck lists were associated with high win rates.

Feature Generation

The data were rich with information about each card, however there were some other features that might be useful in predicting the success of a deck. For example, some players value cards that allow them to draw more cards when played. Some might value minions that “taunt” other minions, thereby protecting their life points. Furthermore, the original data set had information about which cards were selected and which were seen in a given draft round. From this information, I can learn the popularity of a card among the community of players reporting the data and get some measure as to the strength (or perceived strength) of that card. Finally, I wanted some idea of the strength of an individual card measured by how the mean win rate of decks with one or more copies of that card differed from those with no copies.

Detailed Card/Deck Attributes (Parsing Card Text)

Since some of the most powerful aspects of a card are written in the card text. This text was parsed for certain functions and those were stored as a boolean variable.

For example, the Abomination would show the following relevant TRUE flags (the FALSE flags have been omitted for brevity):

cardName hasTaunt hasDeathrattle hasDamage hasAOEdmg
Abomination TRUE TRUE TRUE TRUE

The complete list of card abilities was added to the provided list of card attributes so that the more detailed characteristics (beyond card cost, card class, etc) of each deck could be summarized later.

Determine Card Popularity Ranks

Since every player is shown a random set of cards , it is difficult to separate those who make poor selections from those who have bad luck, i.e. those who do not pick the best cards vs. those who cannot. In the data set, cards are recorded whether they are selected (isSelected==1) or not (isSelected==0); so, an objective measure of card popularity would be to normalize a card’s selection by how often it appeared as an option.

For example, the top 10 cards selected by Mages (among all deck win rates, 0-12) are:

rank cardName fractionPicked
1 Ragnaros the Firelord 0.9683908
2 Azure Drake 0.9663462
3 Fireball 0.9428052
4 Argent Commander 0.9258144
5 Pyroblast 0.9179663
6 Flamestrike 0.9158199
7 Ysera 0.9003115
8 Water Elemental 0.8986598
9 Frostbolt 0.8894714
10 Chillwind Yeti 0.8656741

These ranks, when averaged among 30-card dekcs, would give some clue as to how many popular cards made it into the deck in question. More popular cards might indicate access to or selection of cards that the population as a whole agreed were “good.” Furthermore, each individual card’s rank was used to weight the card attributes when summarizing the decks. For example, not all cards with the taunt attribute are created equally. This ensured that a deck with 2 popular taunt cards would receive a different score than a deck with two unpopular taunt cards.

Card and Deck Swing

The last feature I wanted to generate was the effective swing of a card. For each card, I calculated the mean win rate of decks without that card then the mean win rate for decks with 1 or more copies of that card, comparing the change in win rate with respect to the mean win rate of the class. For example, the “0-1” swing for Chillwind Yeti is +0.1, meaning that decks that have 1 copy of that card average 0.1 wins more than decks with no copies of that card, decks that have 3 copies perform even better. Flamestrike, has much greater positive swing values, perhaps suggesting it is a very important card to draft. These values have also been tabulated on wowmetrics.com using a similar data set.

Feature Generation Summary

With these features created, each deck was summarized, resulting in the following predictors:

Predictor Name Description
cost.mean mean deck mana cost
cost.median median deck mana cost
skew skew of the mana curve
taunt score of cards with taunt, weighted by card popularity
draw score of cards with draw, weighted by card popularity
destroy score of cards with destroy, weighted by card popularity
aoe score of cards with damage all, weighted by card popularity
silence score of cards with silence, weighted by card popularity
charge score of cards with charge, weighted by card popularity
heal score of cards with heal, weighted by card popularity
rattle score of cards with deathrattle, weighted by card popularity
enrage score of cards with enrage, weighted by card popularity
damage score of cards that deal damage, weighted by card popularity
buff score of cards with buffs, weighted by card popularity
battlecry score of cards with battlecry, weighted by card popularity
blank score of cards with no special abilities, weighted by card popularity
dmgSpell score of spell cards with damage, weighted by card popularity
minions score of minion cards, weighted by card popularity
spells score of spell cards, weighted by card popularity
classCard score of cards available only to that class, weighted by card popularity
avgRank average popularity of the cards in the deck
top15 total cards in the top 15 most popular
deckSwing sum of cardSwing

The header of the numeric dataframe of predictors is sampled below:

## Source: local data frame [6 x 25]
## 
##   winCount cost.mean cost.median       skew   rarity taunt draw destroy
## 1        8  3.633333         4.0 0.13490137 1.233333   116  252       0
## 2       12  3.900000         4.0 1.25304461 1.166667   203  155       0
## 3       12  4.066667         3.5 3.51218185 1.366667    69  160      82
## 4        3  3.633333         3.5 0.37719214 1.266667    93  257       0
## 5        7  4.033333         4.0 0.92222271 1.500000    17  204      97
## 6        3  3.866667         4.0 0.04184354 1.300000    23  157      19
## Variables not shown: aoe (dbl), silence (dbl), charge (dbl), heal (dbl),
##   rattle (dbl), enrage (dbl), damage (dbl), buff (dbl), battlecry (dbl),
##   blank (dbl), dmgSpell (dbl), minions (dbl), spells (dbl), classCard
##   (dbl), avgRank (dbl), top15 (int), deckSwing (dbl)

Notably missing from this list is a metric card synergy. Many cards are much stronger in tandem than individually. Cursory attempts were made to derive these interactions from the raw data; however, professional players have a stronger sense of this classification. This would be a great metric to add for future work.

Feature Selection

Each hero class plays with a different style, so I chose to model the classes separately. Furthermore, gameplay was potentially changed with the release of the expansion. Here, I use the Mage data as a test case for the original release of the game. This subset resulted in a selection of approximately 11,000 games.

Overview of Features: Correlation Plots

First, I looked at a correlation plot of the features in question, providing some hint of colineariaty among features.

require(corrplot)
corrplot(cor(mageSet),method="ellipse",order="hclust",type="upper")

There is a generally low level of colinearity for the features present. For some features, high correlation was expected. For example, dmgSpell cards are a subset of damage cards and are correlated as a result. However, there is little correlation between any of the predictors and the winCount of the decks.

A note about Principal Component Analysis

Principal component analysis (PCA) is a popular method to summarize the description of variance among multiple predictors by selecting a series of principle components along which variance is maximized. However, use of PCA requires a fair degree of colinearity. A high Kaiser-Mayer-Olkin (KMO) index (greater than 0.75) gives some indication of whether PCA is applicable. In this case, however, the KMO index was insufficient (0.58), so PCA was not used.

require(psych)
## Center and Scale the data
df.scale <- scale(mageSet, center=T, scale=T)
df.pca<- principal(r=df.scale,
                   nfactors=10,
                   covar=FALSE)

# Run a KMO index test
KMO(cor(as.matrix(df.scale)))
## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = cor(as.matrix(df.scale)))
## Overall MSA =  0.59
## MSA for each item = 
##    winCount   cost.mean cost.median        skew      rarity       taunt 
##        0.83        0.55        0.54        0.34        0.57        0.47 
##        draw     destroy         aoe     silence      charge        heal 
##        0.35        0.52        0.53        0.41        0.30        0.43 
##      rattle      enrage      damage        buff   battlecry       blank 
##        0.49        0.53        0.57        0.50        0.56        0.27 
##    dmgSpell     minions      spells   classCard     avgRank       top15 
##        0.71        0.47        0.52        0.93        0.54        0.91 
##   deckSwing 
##        0.87

Feature selection using Recursive Feature Elimination (caret)

In order to select the most important features, I employed recursive feature elimination (RFE) using the caret package. In this case, I used random forest functions to evaluate variable importance and found that up to 18 (out of 24) predictors were necessary to provide significant boosts in accuracy. Below is a plot showing the absolute accuracy vs. number of features.

# Select Features with RFE
# define the control using a recursive variable selection function
# function options include "lmFuncs" (linear models), "rfFuncs" (random forest), among others
rfeControl <- rfeControl(functions=rfFuncs, method="cv", number=10)
set.seed(1337)
mage.rfe <- rfe(training[,predictors], training$topHalf, sizes=c(1:20), rfeControl=rfeControl) # 1-20 variables

##  [1] "deckSwing"   "avgRank"     "spells"      "classCard"   "top15"      
##  [6] "minions"     "dmgSpell"    "damage"      "buff"        "skew"       
## [11] "cost.mean"   "cost.median" "rattle"      "draw"        "aoe"        
## [16] "blank"       "silence"     "heal"

Modeling Deck Performance

With a narrowed list of features, I proceeded to set up my training and test sets of data. I focused on Mage decks that were recorded during the original release of the game (approximately 11,000 rows of data). Regression models provided poor accuracy, so categorization was used. Two bins were created: “Good” and “Bad” decks, binned by being above or below the mean win rate, respectively. The models were trained on a randomly selected, stratified 75% of the data. Repeated cross-validation was used to determine optimal performance. Prior to modeling, predictors were checked for near-zero variance.

df<- mageSet %>%
  mutate(perf=as.factor(ifelse(winCount>mean(winCount),"Good","Bad"))) %>%
  select(-winCount)

labelName<-"perf" # name of observation
predictors<-names(df)[!(names(df) %in% labelName)] # predictors
set.seed(1337)
labelNum<-which(names(df)==labelName) 
inTrain<-createDataPartition(y=df$perf,p=0.75,list=F) 
training<-df[inTrain,]
test<-df[-inTrain,]

trControl <- trainControl(method="repeatedcv", number=10, repeats=3,
                          classProbs=T,summaryFunction=twoClassSummary)

Random forest and gradient boosting machines were both used to classify the mage decks as above average (“Good”) or below average (“Bad”). The variables were centered and scaled before the models were trained.

######### Random Forest #########
set.seed(1337)
mage.rf<-train(perf~.,
               data=training[,c(topPredictors,labelName)],
               method='rf',
               trControl=trControl,
               preProc = c("center","scale"))

pred.rf<-predict(mage.rf,test[,c(topPredictors,labelName)])
perf.rf<-confusionMatrix(predictions,test$perf)

######### GBM #########
set.seed(1337)
mage.gbm<-train(perf~.,
                data=training[,c(topPredictors,labelName)],
                method='gbm',
                trControl=trControl,
                metric="ROC",
                preProc = c("center","scale"))

pred.gbm<-predict(mage.gbm,test[,c(topPredictors,labelName)])
perf.gbm<-confusionMatrix(pred.gbm,test$perf)

Random Forest Performance

print(perf.rf)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Bad Good
##       Bad  731  470
##       Good 608  906
##                                           
##                Accuracy : 0.6029          
##                  95% CI : (0.5843, 0.6214)
##     No Information Rate : 0.5068          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.2046          
##  Mcnemar's Test P-Value : 3.011e-05       
##                                           
##             Sensitivity : 0.5459          
##             Specificity : 0.6584          
##          Pos Pred Value : 0.6087          
##          Neg Pred Value : 0.5984          
##              Prevalence : 0.4932          
##          Detection Rate : 0.2692          
##    Detection Prevalence : 0.4424          
##       Balanced Accuracy : 0.6022          
##                                           
##        'Positive' Class : Bad             
## 

GBM Performance

print(perf.gbm)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  Bad Good
##       Bad   661  376
##       Good  678 1000
##                                           
##                Accuracy : 0.6118          
##                  95% CI : (0.5932, 0.6302)
##     No Information Rate : 0.5068          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.2211          
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.4937          
##             Specificity : 0.7267          
##          Pos Pred Value : 0.6374          
##          Neg Pred Value : 0.5959          
##              Prevalence : 0.4932          
##          Detection Rate : 0.2435          
##    Detection Prevalence : 0.3820          
##       Balanced Accuracy : 0.6102          
##                                           
##        'Positive' Class : Bad             
## 

Neither method did particularly well at classifying deck outcomes, given the list of predictors. Both Random Forest and GBM produced areas under the ROC curve of approximately 0.65 and 0.68, respectively. This could be good news: it would be bad for game design if anyone could win with a particular deck without having some element of skill. Next, I looked at the ability of these features and models to distinguish excellent from poor decks.

Modeling Extremes: Top and Bottom Win Quartiles

While it was possible that the difficulty in classifying games was due to good game design, it could be an issue with the feature set chosen. As an experiment, I removed the middle 50% of the data (between 3 and 7 wins, inclusive) labeling more than 7 wins as “Good” and fewer than 3 wins as “Bad.” I hypothesized that very good and very poor decks might be easy to spot, however skill/luck might play a role in blurring the line between good and bad decks with average results. Once the GBM model was trained, the performance was tested against the same (segmented) data.

### No Middle ###
df<- mageSet[mageSet$winCount>7 | mageSet$winCount<3,] %>% # remove the middle 50% of data
  mutate(perf=as.factor(ifelse(winCount>7,"Good","Bad"))) %>%
  select(-winCount)

labelName<-"perf" # name of observation
predictors<-names(df)[!(names(df) %in% labelName)] # predictors
set.seed(1337)
labelNum<-which(names(df)==labelName) 
inTrain<-createDataPartition(y=df$perf,p=0.75,list=F) 
training<-df[inTrain,]
test<-df[-inTrain,]

trControl <- trainControl(method="repeatedcv", number=10, repeats=3,
                          classProbs=T,summaryFunction=twoClassSummary)

set.seed(1337)
mage.gbmx<-train(perf~.,
                  data=training[,c(topPredictors,labelName)],
                  method='gbm',
                  trControl=trControl,
                  metric="ROC",
                  preProc = c("center","scale"))

pred.gbmx<-predict(mage.gbmx,test[,c(topPredictors,labelName)])
perf.gbmx<-confusionMatrix(pred.gbmx,test$perf)
print(perf.gbmx)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Bad Good
##       Bad  377  154
##       Good 172  347
##                                           
##                Accuracy : 0.6895          
##                  95% CI : (0.6606, 0.7174)
##     No Information Rate : 0.5229          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.3787          
##  Mcnemar's Test P-Value : 0.3464          
##                                           
##             Sensitivity : 0.6867          
##             Specificity : 0.6926          
##          Pos Pred Value : 0.7100          
##          Neg Pred Value : 0.6686          
##              Prevalence : 0.5229          
##          Detection Rate : 0.3590          
##    Detection Prevalence : 0.5057          
##       Balanced Accuracy : 0.6897          
##                                           
##        'Positive' Class : Bad             
## 

When testing performance of this “extremes” case on a complete set of data (with the middle 50% re-inserted), performance did not improve substantially. Modeling the extremes marginally improved the specificity of the model from a true positive rate of 0.5 to 0.68.

Comparison of Area under ROC Curves

To compare the extremes case with the full-data case, the area under the ROC curves were compared. From the plot below, it is clear that there is some improvement in modeling (and testing) the extremes of the data versus the complete set.

# get ROC values for plotting using ROCR
require("ROCR")
# ROC for entire data set
prob.gbm<-predict(mage.gbm,test[,c(topPredictors,labelName)],type="prob") # calculate class probabilities
pred.rocr.gbm<-prediction(prob.gbm$Good,labels=test$perf)
perf.rocr.gbm<-performance(pred.rocr.gbm,"tpr","fpr") # track true positive rate and false positive rate

# ROC for extremes
prob.gbmx<-predict(mage.gbmx,test[,c(topPredictors,labelName)],type="prob") # calculate class probabilities
pred.rocr.x<-prediction(prob.gbmx$Good,labels=test$perf)
perf.rocr.x<-performance(pred.rocr.x,"tpr","fpr") # track true positive rate and false positive rate

The model trained on extremes was applied to try to classify the total dataset with no significant improvement noted. It would appear that decks that performed at and around an average win rate might lead to some misclassification. Perhaps a skilled player received a weak set of cards to choose from or an unskilled player used a good set of cards in a non-ideal way during their games.

Conclusions & Future Work

Modeling deck success based on cards alone seems to be a difficult task. This study looked only factors that would influence an arena run before the games were started. In reality, other factors (still outside of the player’s control) weigh in on arena performance such as which hero the opponent is playing and who goes first. Adding these factors would surely improve model accuracy, but would require the completion of an arena run.

There are clearly other factors to consider when characterizing a deck. For example, card combos and synergies were not included. In the future, it would be interesting to see how card synergies, such as those provided at HearthArena.com influence model performance.

Finally, although the models themselves were not terribly accurate, the relative importance of predictors such as deck swing and card popularity were supported after the GBM and RF models were trained. This led to the wireframing of a card recommendation system which was covered in a follow-up RPub.