Esports stands for electronic sport is a kind of sport that uses electronic devices such as phones, PC, or game consoles. In esports, there is usually a team of players who play in the same game to compete with other teams and try to become the winner of the game.
Esports begins in 1972 when five students from Standford University competed in an event called Intergalactic Spacewar Olympics for a game called Spacewar. These days, esports and its enthusiast are still growing in every part of the world. Lots of esports competitions are being held ranging from local to the international scale.
In Indonesia itself, since the author of this article is Indonesian, esports has been recognized as one of the sports under the Indonesian National Sports Committee (KONI) and is under the auspices of the General Manager of Esports Indonesia (PB ESI). Based on the statistic gather from easportsearnings.com, Indonesia is ranked 23 out of 161 countries based on esports players who won prize money in a competition. Every country wants its team to be the winner in each competition and to be the number 1 country that has the best players for every game. To achieve that, the main key comes from the skills of their players.
In this article, the author wants to utilize machine learning to build a model that can be used as a tool to help measure esports players’ skills. The model will tell the player’s skill status based on their performance in a game compared to other players’ data.
# load library
library(tidyverse)
library(FactoMineR)
library(factoextra)
library(ggplot2)
library(GGally)
library(ggiraphExtra)
library(gridExtra)
library(rsample) # package for initial split
library(class) # package for using `knn()` function
library(caret)
library(randomForest) # package to build random forest model
library(nnet) # package for computing multinomial logistic regressionFor this project, the author used static data taken from kaggle.com containing more than 1K players’ behaviors from the Call of Duty: Modern Warfare video game. In the code below, we will read the data and inspect the first six rows.
cod <- read.csv("data_input/cod.csv")
head(cod)## name wins kills kdRatio killstreak level losses prestige
## 1 RggRt45#4697369 0 0 0.000000 0 1 0 0
## 2 JohniceRex#9176033 0 0 0.000000 0 1 0 110
## 3 bootybootykill#1892064 0 66 1.031250 0 9 0 110
## 4 JNaCo#5244172 3 2 0.400000 0 1 0 0
## 5 gomezyayo_007#6596687 0 2 0.200000 0 1 0 110
## 6 Brxndoon7-LK#4002715 684 27011 1.066743 18 177 10 110
## hits timePlayed headshots averageTime gamesPlayed assists misses xp
## 1 0 0 0 0.000000 0 0 0 0
## 2 0 7 0 7.000000 0 0 0 700
## 3 0 32 16 32.000000 0 1 0 48300
## 4 0 3 0 3.000000 0 0 0 1150
## 5 0 5 1 5.000000 0 0 0 1000
## 6 98332 1366 5113 2.323129 588 6063 305319 3932335
## scorePerMinute shots deaths
## 1 0.000 0 0
## 2 0.000 0 16
## 3 0.000 0 64
## 4 0.000 0 5
## 5 0.000 0 10
## 6 255.672 403651 25321
Variable explanation:
There is other way beside looking at the first six rows to get to know the inside of our data. The code below is one of them:
glimpse(cod)## Rows: 1,558
## Columns: 19
## $ name <chr> "RggRt45#4697369", "JohniceRex#9176033", "bootybootykil…
## $ wins <int> 0, 0, 0, 3, 0, 684, 4, 186, 741, 26, 0, 0, 188, 0, 15, …
## $ kills <int> 0, 0, 66, 2, 2, 27011, 162, 1898, 21803, 349, 0, 26, 19…
## $ kdRatio <dbl> 0.0000000, 0.0000000, 1.0312500, 0.4000000, 0.2000000, …
## $ killstreak <int> 0, 0, 0, 0, 0, 18, 4, 13, 26, 7, 0, 0, 22, 0, 7, 10, 17…
## $ level <int> 1, 1, 9, 1, 1, 177, 6, 37, 185, 12, 1, 6, 53, 1, 5, 40,…
## $ losses <int> 0, 0, 0, 0, 0, 10, 2, 7, 29, 4, 0, 0, 4, 0, 4, 9, 11, 1…
## $ prestige <int> 0, 110, 110, 0, 110, 110, 0, 2, 111, 0, 0, 110, 57, 0, …
## $ hits <int> 0, 0, 0, 0, 0, 98332, 568, 5111, 81361, 996, 0, 0, 3333…
## $ timePlayed <int> 0, 7, 32, 3, 5, 1366, 8, 550, 2442, 44, 0, 37, 409, 1, …
## $ headshots <int> 0, 0, 16, 0, 1, 5113, 35, 485, 3894, 40, 0, 3, 536, 0, …
## $ averageTime <dbl> 0.000000, 7.000000, 32.000000, 3.000000, 5.000000, 2.32…
## $ gamesPlayed <int> 0, 0, 0, 0, 0, 588, 4, 150, 864, 15, 0, 0, 25, 0, 6, 12…
## $ assists <int> 0, 0, 1, 0, 0, 6063, 68, 488, 4029, 138, 0, 4, 150, 0, …
## $ misses <int> 0, 0, 0, 0, 0, 305319, 4836, 39978, 327230, 4844, 0, 0,…
## $ xp <int> 0, 700, 48300, 1150, 1000, 3932335, 24485, 458269, 4269…
## $ scorePerMinute <dbl> 0.00000, 0.00000, 0.00000, 0.00000, 0.00000, 255.67204,…
## $ shots <int> 0, 0, 0, 0, 0, 403651, 5404, 45089, 408591, 5840, 0, 0,…
## $ deaths <int> 0, 16, 64, 5, 10, 25321, 256, 3332, 21032, 786, 0, 77, …
We get the information about the data such that the number of rows is
1,558 and columns or variables is 19. There is also information about
any existing column and some data on each, and also the type of data in
each column such as <chr> for character or string
object, <int> for integer, and
<double> for decimal values.
Since the final model will be used to predict player performance from the data on the scoreboard given right after they finish a match in a game and not all of the variables from our cod data are being provided, we will only use some of them that are common and easy to get right from the scoreboard.
Call of Duty: Modern Warfare Team Deathmatch Scoreboard
For this article, the author will use the variables from the
scoreboard on the image above and the one that is also available from
the main data. The variables we will use are kdRatio,
assists, and scorePerMinute. We don’t use
kills and deaths information since it’s
already summarized in kdRatio or kill/death ratio, while
wins and losses aren’t included since they
happen because of team performance, not individuals. These variables can
be changed later if the information contained in the scoreboard has been
updated or added by the developer.
used_var <- c("kdRatio", "assists", "scorePerMinute")
cod <- cod %>%
select(all_of(used_var))We have filtered the data using only the one provided on the scoreboard. Next, we need to explore the data and get to know more about the content in it. The first thing is we need to make sure that our data doesn’t contain empty values on each variable and to ensure that, we use the code below.
colSums(is.na(cod))## kdRatio assists scorePerMinute
## 0 0 0
Now that the data doesn’t have empty values marked with a zero in each variable, the next step that we can do is inspect the distribution of the data. One way to do that is by looking at the summary statistic of the data in each variable. The next code will help us with that task.
summary(cod)## kdRatio assists scorePerMinute
## Min. :0.0000 Min. : 0.0 Min. : 0.00
## 1st Qu.:0.2614 1st Qu.: 0.0 1st Qu.: 0.00
## Median :0.7328 Median : 36.5 Median : 56.79
## Mean :0.6371 Mean : 685.8 Mean :107.87
## 3rd Qu.:0.9553 3rd Qu.: 609.8 3rd Qu.:221.65
## Max. :3.0000 Max. :14531.0 Max. :413.80
Here are insights that we can get from the above summary statistic:
kdRatio has a unit of ones, assists is ten
thousands, while scorePerMinute is hundreds.assists data until the 3rd quartile or 75%
of its data is from tens to hundreds, but its maximum value has a unit
of thousands which a bit odd and need further exploration.From the second insight, we know that in assists there
is some value that is very large and different compared to others. Value
like this is commonly known as an outlier or observation that we can
presume has an abnormal distance from other values. We need to get rid
of this kind of value to prevent any bias in our analysis. For this, we
can use some techniques to check if there is any outlier in the
data.
The first technique the author will be using is boxplot
visualization. Below is one of the code examples to make a boxplot out
of the ggplot library.
theme <- theme_minimal() +
theme(
axis.title.y = element_blank(),
axis.text.x = element_blank())
# boxplot for kdRatio column
boxplot_1 <- ggplot(data = cod, aes(y = kdRatio)) +
geom_boxplot() + labs(x = "kdRatio") + theme
# boxplot for assists column
boxplot_2 <- ggplot(data = cod, aes(y = assists)) +
geom_boxplot() + labs(x = "assists") + theme
# boxplot for scorePerMinute column
boxplot_3 <- ggplot(data = cod, aes(y = scorePerMinute)) +
geom_boxplot() + labs(x = "scorePerMinute") + theme
# combine all boxplot in one plot
grid.arrange(boxplot_1, boxplot_2, boxplot_3,
ncol = 3)Boxplot also gives the information that is already being provided
from the summary statistic. But most importantly, the one that we can’t
get from the summary statistic but is provided by the boxplot is a sign
of any outliers in our data. For the above boxplot visualization, for
example, we can see there are lots of black points. Each of those black
points represents one observation of our data and is a sign of outliers,
more precisely on kdRatio and assists columns.
Because the scorePerMinute doesn’t have black points in its
boxplot, it means there are no outliers for that variable.
We want to know exactly which rows of our data have outliers because
later we will remove them from our data. To do this, we are going to do
a conditional subsetting by making use of the quartile values of
variables kdRatio and assists. In statistics,
outliers data lies above Q3 + 1.5*IQR also called
upper bound or below Q1 – 1.5*IQR also
called lower bound. So here are our next steps:
kdRatio and assists.kdRatio and assists
values match the condition of below the lower bound or
above the upper bound.# calculate upper and lower bound of kdRatio
q1_kdRatio <- quantile(cod$kdRatio, probs = 0.25)
q3_kdRatio <- quantile(cod$kdRatio, probs = 0.75)
iqr_kdRatio <- q3_kdRatio - q1_kdRatio
upper_kdRatio <- q3_kdRatio + 1.5 * iqr_kdRatio
lower_kdRatio <- q1_kdRatio - 1.5 * iqr_kdRatio
# calculate upper and lower bound of assists
q1_assists <- quantile(cod$assists, probs = 0.25)
q3_assists <- quantile(cod$assists, probs = 0.75)
iqr_assists <- q3_assists - q1_assists
upper_assists <- q3_assists + 1.5 * iqr_assists
lower_assists <- q1_assists - 1.5 * iqr_assists
# get row containing outliers using conditional subsetting
outliers_index <- which((cod$kdRatio < lower_kdRatio) |
(cod$kdRatio > upper_kdRatio) |
(cod$assists < lower_assists) |
(cod$assists > upper_assists))
head(outliers_index)## [1] 6 9 18 40 48 65
From the above step, we get numbers saved in an object called
outliers_index. These numbers are the index of the row in
our data that contain outliers.
The second technique the author will be using is the Principal Component Analysis (PCA) method. PCA is an algorithm commonly used for dimensionality reduction in machine learning. But here, we will use another feature of the PCA algorithm which is for outlier detection. Here are the steps that we need to do to search for outliers in the data using PCA:
# build PCA object
pca_cod <- PCA(X = cod,
scale.unit = T,
graph = F)
# visualize PCA
plot.PCA(x = pca_cod,
choix = "ind",
select = "contrib5")The visualization of the PCA object above is an example of when we
want to look at the five outermost outliers in our data based on
all of the variables. Its number is set using the
select parameter inside the plot.PCA function,
where we only need to change the value written after the word
contrib, and because it has been set as
contrib5 so the plot only shows the index of the five
outermost outliers.
There are some weaknesses in how PCA can give us information about
outliers. It only shows the row index of the one that we define in
contrib. It is also hard for the plot to visualize the row
index when we want to show more of them since they will overlap with one
another. Besides, if we want to get all of those indexes, we need to
manually type it because it can’t be done using any function to speed up
the process.
Here, the author only use PCA to check if the index in the five
outermost outliers from this method is also contain in the
outliers_index object from the boxplot method.
# build function to check for an existance of value
fun <- function(a){
a %in% outliers_index
}
# the index from PCA need to be typed manually
sapply(list(1096, 236, 136, 1314, 1121), FUN = fun)## [1] TRUE TRUE TRUE TRUE TRUE
The above step is to check if the five outermost indexes shown in PCA
visualization are also in theoutliers_index object. It
checks for each index and returns TRUE five times, which
means all of the indexes are contained in
theoutliers_index. This means these indexes are considered
outliers both using PCA and boxplot method. Hence, we can have some
confidence in the results of outlier detection from these two
methods.
After going through the process of detecting outliers, we already have the list of row indexes which considered to have outliers. This information can be used for the next step, where we will exclude these data from the main data. We want to know how many rows are there that considered an outlier using the code below:
length(outliers_index)## [1] 220
It shows an output of 220 which means 220 rows are considered
outliers. Next, we will exclude all of that rows from our data using the
code below where we save it in a new variable called
cod_clean. This step gives us data where at first has 1558
rows but after the outliers are removed, it consists of 1338 rows.
# remove outliers from initial data
cod_clean <- cod[-outliers_index,]
# check for dimension of the initial data
dim(cod) # initial data## [1] 1558 3
dim(cod_clean) # data after removing outliers## [1] 1338 3
Remember that the final model of this project will be used to analyze the player’s skill status based on their performance in a game, which means that it should be a supervised model since it predicts a target which in this case is the skill status. When we look at the main data, there is no information about this skill status, hence we need to provide it first for the model to be able to make a prediction.
For the next step, we use an unsupervised method first which is clustering to make a cluster or category from our data that doesn’t have a target. Later, clusters resulting from the clustering process will be changed and adapted to the appropriate skill status and used for the final supervised model. Here, we will see the implementation of combining unsupervised and supervised methods. In this article, the unsupervised model selected for the clustering step will be the k-means model.
K-means is one of the models used for clustering where it grouped the data based on the calculation of the distance between each data point. The important thing about distance calculation is that it is sensitive to data with variables that has different ranges or units. Hence, it will be better if we scale the data first before entering the clustering process. After data gets through the scaling process, its variables will be in the same range or at least not far from one another. In this article, the author utilizes the Z-Score distribution method which will scale the data and make its range between -3 and 3 and to do this, below is the code:
# scale data
cod_clean_scale <- scale(cod_clean)
# check the range of variables in the data using summary statistic
summary(cod_clean_scale)## kdRatio assists scorePerMinute
## Min. :-1.3811 Min. :-0.56049 Min. :-0.8016
## 1st Qu.:-0.9787 1st Qu.:-0.56049 1st Qu.:-0.8016
## Median : 0.1043 Median :-0.52393 Median :-0.6785
## Mean : 0.0000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.8165 3rd Qu.: 0.04914 3rd Qu.: 0.9269
## Max. : 3.1612 Max. : 3.72589 Max. : 2.7656
The insight that we can get from the above scaling process is that after data gets through the scaling process, its variables are in the same range or at least not far from one another. But we must know that after being scaled, the data is not in the same value as the one before the process is being taken, which means we can’t interpret the data in this condition. The below code shows the first five rows in the scaled data:
head(cod_clean_scale)## kdRatio assists scorePerMinute
## 1 -1.3810916 -0.5604909 -0.8016452
## 2 -1.3810916 -0.5604909 -0.8016452
## 3 1.1084000 -0.5576783 -0.8016452
## 4 -0.4154706 -0.5604909 -0.8016452
## 5 -0.8982811 -0.5604909 -0.8016452
## 7 0.1465510 -0.3692350 1.4871587
In k-means, we need to define the k value that we will
use for clustering. k indicates the number of clusters or
groups the model will try to find out of the data. It means that after
the clustering process, there will be k cluster(s) or
group(s) of data where the data between one cluster and another
have different characteristics but in one cluster have
the same characteristics. There are several ways to decide what
k value that we want to use:
k
value for k-means clustering.For this article, the author will use the first way in finding the
k since there will be the skill status of a
pro, average, or newbie, so
we will be using 3 for the k value. But if
we can check if the optimum value is the same or not by using the code
below, and don’t forget to use the scaled data:
fviz_nbclust(x = cod_clean_scale,
FUNcluster = kmeans,
method = "wss")From the graphic above, we need to look at the value of
k at the elbow i.e. the point before the
decline is no longer significant. Thus, for the optimum k
value it will be 4 from the elbow method above.
Since it was already mentioned before that the author will be using 3
for the k value based on the business needs, so we won’t
proceed with the k value from the elbow method. Below is
the code to do k-means clustering using the scaled data and the k value
is defined in the centers parameter equals to 3:
RNGkind(sample.kind = "Rounding")
set.seed(123)
cod_kmeans <- kmeans(x = cod_clean_scale,
centers = 3)The result of the clustering process from the step before is saved in
an object called cod_kmeans. To get the cluster of each row
or observation in our data, we can access the result in
cod_kmeans by accessing its cluster. We will
directly save the information of the cluster to a new column called
cluster_result in our data using the below code:
cod_clean$cluster_result <- cod_kmeans$cluster
head(cod_clean)## kdRatio assists scorePerMinute cluster_result
## 1 0.0000000 0 0.0 1
## 2 0.0000000 0 0.0 1
## 3 1.0312500 1 0.0 2
## 4 0.4000000 0 0.0 1
## 5 0.2000000 0 0.0 1
## 7 0.6328125 68 265.5 2
Note that the information of the cluster is being saved in the unscaled or original data. Why is that? It is because previously mentioned that if we transform the data by doing scaling, we will no longer be able to interpret the value in it. So, because we need these values to be in its original form when we want to do the interpretation on the further analysis, we need to return the clustering result to our unscaled data and the scaling process is being done only for the needs of the clustering process.
Next, we will look at the mean value of each variable in the data for each cluster to get to know the summary of these clusters, which can help us determine the characteristic of the data in each cluster.
cod_clean %>%
group_by(cluster_result) %>%
summarise_all(mean)## # A tibble: 3 × 4
## cluster_result kdRatio assists scorePerMinute
## <int> <dbl> <dbl> <dbl>
## 1 1 0.246 12.1 7.64
## 2 2 0.837 145. 167.
## 3 3 0.979 979. 187.
Now we have the information on each variable’s mean for each cluster, it can be concluded that in general:
The value for each variable in the 3rd cluster is the biggest of the two other clusters, while the 1st cluster is the smallest, and the 2nd cluster is in the middle.
From the conclusions made above, the author will change the name of
each cluster to match the skill status according to the variable’s mean
value in each cluster. The 1st cluster will be renamed as a
newbie, the 2nd cluster as an average, and the
3rd cluster as a pro.
Since now the data already have a target variable which is
cluster_result, we can proceed to the next step which is
using a supervised method to make a model that can classify new data to
its skill status. In this article, the author will build two kinds of
supervised models:
But before starting to build the model, we need to make sure the data we are going to use is ready by going through the preprocess and exploratory steps.
The step we haven’t done yet is to change the name of the cluster
based on the cluster profiling step. So next, we will rename each of
those clusters where the 1st cluster becomes newbie, the
2nd cluster becomes average, and the 3rd cluster becomes
pro. We do this step using the below code:
cod_clean <- cod_clean %>%
mutate(cluster_result = as.factor(cluster_result))
# change the name of each cluster
levels(cod_clean$cluster_result) <- c("newbie", "average", "pro") # 1 = "newbie"; 2 = "average"; 3 = "pro"
head(cod_clean)## kdRatio assists scorePerMinute cluster_result
## 1 0.0000000 0 0.0 newbie
## 2 0.0000000 0 0.0 newbie
## 3 1.0312500 1 0.0 average
## 4 0.4000000 0 0.0 newbie
## 5 0.2000000 0 0.0 newbie
## 7 0.6328125 68 265.5 average
We have completed the renaming step for the target variable of our data. But before we continue to use the data to build the model, we will explore the data once more like the one we did before. We need to do this because the previous exploration step was carried out using data that didn’t have a target variable. Since now we already have that information, we want to know more and do further analysis with our data.
A common thing to do in the exploratory step is to check for the correlation between variables in the data. Correlation analysis is good to do because it can give information about how is the relationship between one variable with another. One of the easy ways to do correlation analysis is by using visualization, and below is the code for that:
ggcorr(cod_clean, label = TRUE)The correlation value lies between -1 to 1, where when the value gets closer to 1 means both variable has a strong positive correlation, when it gets closer to -1 means a negative correlation, whereas when the value equals to or near zero means there is no correlation. Here are things we can get from the above visualization:
Each variable has a correlation value between 0.4 and 0.5 with other variables, which means they all have a positive correlation with one another but are not too strong.
A positive correlation between two variables means when the value of one of them increases, so does the other variable. Meanwhile, negative correlation means when the value of one of them decreases, so does the other variable. Note that there is no information about the correlation for our cluster variable. This is because the above correlation analysis is only available for a continuous value and since the cluster information is discrete, it can’t be calculated using the above method.
The last step of exploratory data that the author will do before
going straight to the model building is to check for the proportion of
our target variable in the data, which is cluster_result.
Why do we need to do this? Checking the proportion of the target
variable is important, especially for classification purposes. It is
because when we use data with an unbalanced proportion of the target, it
will make the model tend to classify new data to the target with the
majority proportion. Here, we can check for the proportion of the target
variable using below code:
prop.table(table(cod_clean$cluster_result))##
## newbie average pro
## 0.4828102 0.3751868 0.1420030
From the result of the above code, we can see that the proportion of
each target is different especially for the pro’s
proportion which is small compared to the other two. When faced with a
target unbalanced problem like this, there are two common ways that we
can choose to make its proportion balanced:
The number of rows in our current data is 1338. Since the author doesn’t want to reduce the size of the data to balance its target, for this article the author will choose to use the upsampling method for balancing the proportion of the target variable.
Before going through the upsampling process, first, we need to split the data into train and test data. This process is required to provide data used by the model to learn which will be the train data, while the other data used to evaluate the model performance which will be the test data. The author chooses the proportion of 75% train and 25% test data for this article. Here, since we want the splitting process to be done in such a way that each unique target value, in this case, skill status is being provided in both train and test data, we will use the stratified method for this process using the below code:
RNGkind(sample.kind = "Rounding")
set.seed(100)
index <- initial_split(cod_clean, prop = 0.75, strata = cluster_result)
train <- training(index) # train data
test <- testing(index) # test dataWe already split the data into train and test, the next step is to do the upsampling process. Here, one thing to be noted is that we only do the target balancing process in the train data. Why is that? It is because, in real-life data, it is unlikely that the data will have a balanced target since it is randomly being gathered based on each of its observations. So in the below code, we use the upsampling method in the train data for training purposes only:
RNGkind(sample.kind = "Rounding")
set.seed(100)
# upsampling method
train <- upSample(x = train %>% select(-cluster_result),
y = train$cluster_result,
yname = "cluster_result")# check proportion of target variable
prop.table(table(train$cluster_result))##
## newbie average pro
## 0.3333333 0.3333333 0.3333333
Above, we can see that the proportion of each unique value in the target variable is already the same or balanced after going through the upsampling process. Now our data is ready to go to the next process which is to build the classification model.
The first model that we will make is a multinomial logistic regression model. It is multinomial because we have more than two unique values in the target variable. Logistic regression is a supervised machine learning model that works by calculating how much probability a new data will fall into each one of the unique values of the target variable. The final result of the target variable for that new data will be the target unique value with the biggest probability.
For the first step, we will build the model using train data for learning purposes. Here, we don’t want to always run the program everytime we need this model in the future, because some model requires a long time in the training process. So in this case, the author choose to save the model after being train as a file in RDS format file. Where than later when need to use the model, we only have to call or read the saved model. The below code is to build the multinomial logistic regression model and save it in an RDS file:
# # ONLY RUNS ONCE, please return to the comment form again after a run
#
# # build logistic regression
# logreg_model <- multinom(formula = cluster_result ~., data = train)
#
# # save model in the local folder
# saveRDS(logreg_model, "logreg_model.RDS")Note that the above code is made into a comment to prevent the code from being evaluated when the chunk is executed because we only need to run it once for building the model. Since we already save the model as a file, we don’t need to run the above code again. To read the saved model, we can use the below code:
# read model
logreg_model <- readRDS("logreg_model.RDS")
logreg_model## Call:
## multinom(formula = cluster_result ~ ., data = train)
##
## Coefficients:
## (Intercept) kdRatio assists scorePerMinute
## average -116.8678 133.8923 0.0401555 0.4762593
## pro -446.4343 194.0271 0.5156745 0.5177757
##
## Residual Deviance: 5.096176
## AIC: 21.09618
From the above summary of the model, there are a lot of things being shown that we can take as a piece of information. Here are some of them:
Next, after the logistic regression model has been created, we want
to use the model to predict the target variable of the test data and use
the predicted result as a performance evaluation for the model. The
prediction is made using the below code, where we save the prediction
result on an object called pred_result_logreg:
pred_result_logreg <- predict(object = logreg_model, newdata = test)
head(pred_result_logreg)## [1] average average average newbie average average
## Levels: newbie average pro
Now, as we have the prediction result of the test data, the last step we need to do is evaluate the model using a certain measurement. The common tool used to evaluate model performance, especially for the classification problem is called a confusion matrix, which gives a lot of measurement metrics we can use and choose to evaluate the model. We will do the confusion matrix method, where we need to specify the actual value of the target variable in the test data along with its predicted result. Here is the code for that:
confusionMatrix(data = pred_result_logreg, reference = test$cluster_result, positive = "pro")## Confusion Matrix and Statistics
##
## Reference
## Prediction newbie average pro
## newbie 162 0 0
## average 0 126 0
## pro 0 0 48
##
## Overall Statistics
##
## Accuracy : 1
## 95% CI : (0.9891, 1)
## No Information Rate : 0.4821
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 1
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: newbie Class: average Class: pro
## Sensitivity 1.0000 1.000 1.0000
## Specificity 1.0000 1.000 1.0000
## Pos Pred Value 1.0000 1.000 1.0000
## Neg Pred Value 1.0000 1.000 1.0000
## Prevalence 0.4821 0.375 0.1429
## Detection Rate 0.4821 0.375 0.1429
## Detection Prevalence 0.4821 0.375 0.1429
## Balanced Accuracy 1.0000 1.000 1.0000
There is a lot of information from the above measurements made by the
confusion matrix. We need not look at all values, only need to focus on
some of them. Commonly used matrices are the Accuracy,
Sensitivity, Specificity, and
Pos Pred Value, but the author will be more likely to see
from the accuracy score. Our logistic regression model performance has
an accuracy value of 1 means the model accuracy in predicting
the target is 100% which is a really good performance. This
model also performs well so that the other matrices are scored 1.
The second model is the random forest model. Random forest is one of the supervised machine learning models that are robust, which means it is more likely to have a good performance and small errors in predicting new data. Random forest is a strong model since it uses a combination of the output of multiple decision tree models. It will do things like majority voting from the decision tree models output to make its final prediction output.
The processes that we will do next will be the same as the ones we already use for the logistic regression model, namely build the model using train data and save it to an RDS file, use the model to predict test data, and the last is evaluate the model based on how it performs on the test data.
# # ONLY RUNS ONCE, please return to the comment form again after a run
#
# set.seed(417)
# control <- trainControl(method = "repeatedcv", number = 5, repeats = 3)
#
# # build Random Forest
# rf_model <- train(form = cluster_result~., data = train, method = "rf",
# trainControl = control)
#
# # save model in the local folder
# saveRDS(rf_model, "rf_model.RDS")# read model
rf_model <- readRDS("rf_model.RDS")In the random forest model, we can see how much the model uses the
predictor variable in our data which in this case are
assists, scorePerMinute, and
kdRatio to make an accurate target prediction. To do this,
we can use the code below:
# variabel importance
varImp(rf_model) %>% plot()The analysis of the variable importance in the model above shows that
assists is the most influential variable in predicting a
player’s skill status followed by scorePerMinute and
kdRatio. This result only gives us the idea about how much
each variable is being used for the model, but it doesn’t mean we need
to remove the variable that doesn’t have any influence at all, which in
this case is kdRatio. If the application in the game turns
out that these variables are indeed important, we can still keep them in
the data.
pred_result_rf <- predict(object = rf_model, newdata = test)
head(pred_result_rf)## [1] average average average newbie average average
## Levels: newbie average pro
confusionMatrix(data = pred_result_rf, reference = test$cluster_result, positive = "pro")## Confusion Matrix and Statistics
##
## Reference
## Prediction newbie average pro
## newbie 160 2 0
## average 2 124 0
## pro 0 0 48
##
## Overall Statistics
##
## Accuracy : 0.9881
## 95% CI : (0.9698, 0.9967)
## No Information Rate : 0.4821
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9804
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: newbie Class: average Class: pro
## Sensitivity 0.9877 0.9841 1.0000
## Specificity 0.9885 0.9905 1.0000
## Pos Pred Value 0.9877 0.9841 1.0000
## Neg Pred Value 0.9885 0.9905 1.0000
## Prevalence 0.4821 0.3750 0.1429
## Detection Rate 0.4762 0.3690 0.1429
## Detection Prevalence 0.4821 0.3750 0.1429
## Balanced Accuracy 0.9881 0.9873 1.0000
Based on the model performance on predicting the target variable in the test data, it turns out that the accuracy score of the model is high at 98.97%, while the score in other matrices besides accuracy for each target is also high between 90% - 100%.
Overall based on the performance from the confusion matrix of each
model, both of them have a really good performance in terms of
predicting the target which is the skill status. The model performs very
well, exceeding 95% in accuracy, it even scores 100% on
the logistic regression model. These models give the final prediction
result in the form of one of skill status as a newbie,
average, or pro player as an evaluation of
where the player’s skill position is, based on the performance at the
end of every game.
Why were both accuracies high? Normally, we need to be aware when getting a high accuracy score, especially with the condition of our data with only three variables to predict the target and on a simple model such as logistic regression. But here, both models could give high accuracy because we did the unsupervised method first to build our target variable. These targets were built based on a pattern from the initial data that the unsupervised model catch. Hence, the resulting target already has a clear pattern with the data. From that step, our supervised model, in this case, logistic regression and random forest model, would also easily catch the pattern from the resulting data while doing the learning process and give a good performance on the prediction.
Since both of our final models is reliable enough for predicting a player’s skill status, we can use any of them in the future. But here, the author choose to use the random forest model even though its accuracy value is slightly smaller than the logistic regression, but this model is more robust so it can be more reliable across any data types and characteristics which will be predicted.
What to do next? Most of the time, when we want to be the best of the best, we need to improve and develop ourselves further. Every esports player and even the esports team manager can use this skill status prediction result as the basis and evaluation material to improve in-game performance and be ready to face real competition later.
Furthermore, here is an example of how the author used the final
model and combined it with other visuals in an analytics dashboard. The
author has made a simple dashboard using Shiny which is an
R package used to build an interactive web application straight from R
and deploy the apps on shinyapps.io. The dashboard contains several
features to analyze the overall player’s gameplay, below is the
screenshot of those features with an explanation for each:
Performance Analysis Dashboard Feature
date,
kdRatio, assists, and
scorePerMinute scores in its field. This feature will
utilize the model to predict the skill status based on the
kdRatio, assists, and
scorePerMinute values that have been submitted before. Its
default value will be newbie since the default of the
kdRatio, assists, and
scorePerMinute values is zero, and so the model predicts it
as newbie. Two buttons can be used in this feature, the
first is the check button which only will do the
prediction and display the output at the very top. While the second is
the save button which will do the prediction and save
it to the main data along with the submitted date,
kdRatio, assists, and
scorePerMinute values.Sys.Date
function, while the default value of the start date is one week before
the end date. (Warning: If an error appears, it may be caused by an
input error or the data is unavailable. Please re-check your input and
data.)sys.Date function. (Warning: If an error
appears, it may be caused by an input error or the data is unavailable.
Please re-check your input and data.)To try the dashboard go to this link, or if you want to replicate the dashboard go to this GitHub repository.