Classification w/ Decision Trees
Introduction:
The purpose of this project will be to
attempt to classify a win or a loss pending game metrics. This project
will use a favorite passion of mine, a competitive video game called
“Dota 2.” It holds the record for the largest overall prize pool of all
“eSports” for the top 7 spots and holds 8 out of the top 10 spots.
knitr::include_graphics("https://raw.githubusercontent.com/d-ev-craig/DATA607/main/Projects/Project4%20-%20'Document'%20Classification/Prize.png") $40 million isn’t too shabby, eh?
Ignoring our dreams of playing video games to earn money, let’s come
back to reality and see if we can’t predict a win for one of these
games. Great for betting, if you’re into that. Before we get into
predicting a win, I’d like to go over some basics of the game.
- There are 5 players to a team and 2 teams that play against each other, for 10 total players
- There are hundreds of heroes you can choose to play as
- These 5 heroes are chosen by the teams before play begins in an alternating fashion
- These 5 players and their chosen heroes fight against each other to gain gold, kill the enemy team, destroy the enemy buildings, and win the game once the most important building is destroyed
For this project of classification, we are
going to try and predict/classify a win using some game variables while
holding one variable constant. That variable will be the hero, and in
this case my favorite hero, Nature’s Prophet. We won’t get into the fine
details but to give you an idea the hero is a heavily objective based
hero that has the ability to teleport to different locations and spawn
little servants that help him destroy buildings. Due to this, the hero
is considered to be playing his best when taking buildings. This is
important since it will have an impact of some of his game stats and
what we can expect.
To give you a face to the name, below are a few of Nature’s Prophets spells he can cast in the game. Picture was clipped together from Dota’s official site
To give you a face to the name, below are a few of Nature’s Prophets spells he can cast in the game. Picture was clipped together from Dota’s official site
Data Collection
This data was gathered by using OpenDota, an API that allows
for parsing all matches played in the game. I used their website to
generate the csv since their GUI was so easy to use. I went ahead and
removed some information like player id, match id, and match time since
we don’t want these to influence our predictions.
Data Cleaning & Variable Explanation
npData<-read_csv("https://raw.githubusercontent.com/d-ev-craig/DATA607/main/Projects/Project4%20-%20'Document'%20Classification/NPBase.csv")
npData$gold <- as.numeric(npData$gold)
npData <- npData[,-c(1:2,4,5:6)]
dataModel <- data.frame(npData)
dataModel$win <- as.factor(dataModel$win)
dataModel$lane <- as.factor(dataModel$lane)
dataModel$lane_role <- as.factor(dataModel$lane_role)
paged_table(dataModel)Variables in Context
Some variables with explanation
below:
- Gold_per_min: How much gold this hero earned a minute (the higher the better, more gold means more items purchased which means the hero is stronger)
- Net Worth: Rough indicator of how much the hero was worth at the end of the game (higher net worth means more items means the hero is stronger)
- Gold: Amount of gold the hero ended the game on (Did they have a lot of extra gold? Did they spend everything?)
- Kills: How many times did this hero kill an enemy hero
- Tower Damage: How much damage to buildings the hero did(remember, the game ends when the most important building falls)
- Duration: How long did the game last (mm:ss)
- Lane: Ranked from 1 -3, characterizes the first 10-12 minutes of a game. 1 will typically be easier, while 3 is a bit harder.
- Lane_Role: Ranked from 1 - 5, represents the team’s priority of the hero. If it’s ranked 1, teammates will sacrifice their life to save yours, or give you preferential areas of the map.
Training the Decision Tree
We will be using the tree and caret
packages to perform our training. The basic idea behind a decision tree
is to identify critical points in a variable’s value that influence it
to be more of a yes vs a no in terms of classifying.
- I use caret’s createDataPartition command to create a 70/30 split to train/test on (train on 70% / test on 30%)
## Set Training Data
dataIndex <- createDataPartition(dataModel$win, p = .7, list = FALSE) #Creates a list containing 70% of the records and puts them into a vector
dataTrain <- dataModel[dataIndex, ] # Selects the rows of our dataset we want to train(which are the 70% we chose in our dataIndex line)
paged_table(dataTrain)## Set Testing Data
dataTest <- dataModel[-dataIndex, ] #Selects rows of our dataset that were not used for training
paged_table(dataTest)- I train our decision tree and print out a summary of the tree.
#Train our Decision Tree
classTreeFit <- tree(win ~ ., data = dataTrain)
summary(classTreeFit)##
## Classification tree:
## tree(formula = win ~ ., data = dataTrain)
## Variables actually used in tree construction:
## [1] "tower_damage" "duration" "gold" "kills" "net_worth"
## [6] "gold_per_min"
## Number of terminal nodes: 13
## Residual mean deviance: 0.3518 = 44.32 / 126
## Misclassification error rate: 0.07914 = 11 / 139
#Plot it and admire its beauty
plot(classTreeFit, show.node.label=TRUE)
text(classTreeFit,pretty=0,cex=0.7)
| From our tree, we can see the remaining error in the tree is about 25%
with a 6% misclassification rate. This is really pretty good to start
off with. Let’s try pruning it to simplify the tree without losing too
much accuracy.
- Post-pruning
#Pruning Our Tree
##Getting the misclass rate
pruneFit <- cv.tree(classTreeFit, FUN = prune.misclass)
pruneFit## $size
## [1] 13 10 9 7 5 3 2 1
##
## $dev
## [1] 31 31 31 35 38 37 32 63
##
## $k
## [1] -Inf 0.0 1.0 1.5 2.0 2.5 7.0 31.0
##
## $method
## [1] "misclass"
##
## attr(,"class")
## [1] "prune" "tree.sequence"
##
dfPruneFit <- cbind(size=pruneFit$size,dev=pruneFit$dev) #Grabbing the Size and Dev classes as these show the least amount of 'deviations' aka bad classifications and the relative size.
dfPruneFit <- data.frame(dfPruneFit)
dfPruneFit <- dfPruneFit %>% group_by(size)%>%arrange(size)%>%arrange(dev)
dfPruneFit## # A tibble: 8 × 2
## # Groups: size [8]
## size dev
## <dbl> <dbl>
## 1 9 31
## 2 10 31
## 3 13 31
## 4 2 32
## 5 7 35
## 6 3 37
## 7 5 38
## 8 1 63
bestVal <- dfPruneFit$size[1] #Sets our best number of tree branches to a variable to use later
# Using our pruned misclass rate to alter our training
pruneFitFinal <- prune.misclass(classTreeFit, best = bestVal) #Use our bestVal to prune our tree
summary(pruneFitFinal)##
## Classification tree:
## snip.tree(tree = classTreeFit, nodes = c(6L, 22L, 23L))
## Variables actually used in tree construction:
## [1] "tower_damage" "duration" "gold" "kills"
## Number of terminal nodes: 9
## Residual mean deviance: 0.4848 = 63.02 / 130
## Misclassification error rate: 0.08633 = 12 / 139
plot(pruneFitFinal, show.node.label=TRUE)
text(pruneFitFinal,pretty=0,cex=0.7)## Prune Graph Generate
plot(pruneFit$size, pruneFit$dev, type = "b") You can tell by the plot of dev v size,
that 8 is the lowest and isn’t too many extra nodes, so we will prune
the tree back to that. Our residual mean deviance increases up to 42%
which is quite a bit higher, but out mis-classification error rate stays
the same.
Making Predictions
From here we can now use our trained
tree’s to predict on our test partition of the dataset. I created a few
tables to compare the misclassifications from both the un-pruned and
pruned version of the tree.
#Predicting on the test set
fullPred <- predict(classTreeFit, dplyr::select(dataTest, -"win"), type = "class") #predict using the classTreeFit trained decision tree on the dataTest dataframe without the win column
fullTbl <- table(data.frame(fullPred, dataTest$win))
fullTbl## dataTest.win
## fullPred FALSE TRUE
## FALSE 21 2
## TRUE 5 31
accFull<-sum(diag(fullTbl)/sum(fullTbl))
accFull## [1] 0.8813559
prunePred <- predict(pruneFitFinal, dplyr::select(dataTest, -"win"), type = "class")
pruneTbl <- table(data.frame(prunePred, dataTest$win))
pruneTbl## dataTest.win
## prunePred FALSE TRUE
## FALSE 20 2
## TRUE 6 31
accPrun<-sum(diag(pruneTbl)/sum(pruneTbl))
accPrun## [1] 0.8644068
As you can see, we have identical
prediction capabilities between the two trees. Looks like the pruning is
worth it.
print(paste0("The pruned tree had an accuracy of ",accPrun," and the un-pruned tree had an accuracy of ",accFull," when it came to predicting the win with the hero Nature's Prophet."))## [1] "The pruned tree had an accuracy of 0.864406779661017 and the un-pruned tree had an accuracy of 0.88135593220339 when it came to predicting the win with the hero Nature's Prophet."
Conclusions
Let’s take a look at our pruned decision
tree and see what the Gini index helped the tree choose as the most
important variables.
plot(pruneFitFinal, show.node.label=TRUE)
text(pruneFitFinal,pretty=0,cex=0.7) The biggest factor seemed to be tower
damage. This aligns with some of the hero context I mentioned earlier,
that the hero is best played when focusing on dealing damage to
buildings. The next biggest factor was duration of the game. So let’s
imagine going down the right path of the root tower damage node.. if we
deal more than 8793 damage to a tower, and our game is less than 24
minutes, we most likely win with this hero. If we go longer than 24
minutes, our next most important factor is again our tower damage, but
it has to jump all the way to 20,000 damage. That is quite a lot. I
would interpret this as some type of timing in the game is important for
this hero, and if you miss that timing, your next best bet is to plan on
a longer game where you deal more damage to buildings. Let’s interpret
the left side of the tree.
If our tower damage is less than 8793 and the game went longer than 17 minutes, we lost. This probably means at some point the hero was unable to hit buildings, most likely from having to play defensively. If the game is shorter than 17 minutes, the best way out is by prioritizing kills. Most games last 25 or more minutes, but its good to know that if one felt like they won’t be able to end the game at 25 minutes, the next best bet is to aim for kills early and end it at 17.
Decision trees are useful for how easy it is to interpret them. There are some criticisms about how accurate they can get and there are methods such as boosting, bagging, and randomizing forests, but this is one way to classify something. Other methods in the supervised realm are Support Vector Machines or K Nearest Neighbors. For the unsupervised, you could look at K-Means Clustering, Hierarchical Clustering, and Principal Component Analysis (PCA).
If our tower damage is less than 8793 and the game went longer than 17 minutes, we lost. This probably means at some point the hero was unable to hit buildings, most likely from having to play defensively. If the game is shorter than 17 minutes, the best way out is by prioritizing kills. Most games last 25 or more minutes, but its good to know that if one felt like they won’t be able to end the game at 25 minutes, the next best bet is to aim for kills early and end it at 17.
Decision trees are useful for how easy it is to interpret them. There are some criticisms about how accurate they can get and there are methods such as boosting, bagging, and randomizing forests, but this is one way to classify something. Other methods in the supervised realm are Support Vector Machines or K Nearest Neighbors. For the unsupervised, you could look at K-Means Clustering, Hierarchical Clustering, and Principal Component Analysis (PCA).