1 Introduction

Esports stands for electronic sport is a kind of sport that uses electronic devices such as phones, PC, or game consoles. In esports, there is usually a team of players who play in the same game to compete with other teams and try to become the winner of the game.

Esports begins in 1972 when five students from Standford University competed in an event called Intergalactic Spacewar Olympics for a game called Spacewar. These days, esports and its enthusiast are still growing in every part of the world. Lots of esports competitions are being held ranging from local to the international scale.

In Indonesia itself, since the author of this article is Indonesian, esports has been recognized as one of the sports under the Indonesian National Sports Committee (KONI) and is under the auspices of the General Manager of Esports Indonesia (PB ESI). Based on the statistic gather from easportsearnings.com, Indonesia is ranked 23 out of 161 countries based on esports players who won prize money in a competition. Every country wants its team to be the winner in each competition and to be the number 1 country that has the best players for every game. To achieve that, the main key comes from the skills of their players.

In this article, the author wants to utilize machine learning to build a model that can be used as a tool to help measure esports players’ skills. The model will tell the player’s skill status based on their performance in a game compared to other players’ data.

1.1 Import Library

# load library
library(tidyverse)
library(FactoMineR)
library(factoextra)
library(ggplot2)
library(GGally)
library(ggiraphExtra)
library(gridExtra)
library(rsample) # package for initial split
library(class) # package for using `knn()` function
library(caret)
library(randomForest) # package to build random forest model
library(nnet) # package for computing multinomial logistic regression

2 Import Data

For this project, the author used static data taken from kaggle.com containing more than 1K players’ behaviors from the Call of Duty: Modern Warfare video game. In the code below, we will read the data and inspect the first six rows.

cod <- read.csv("data_input/cod.csv")
head(cod)
##                     name wins kills  kdRatio killstreak level losses prestige
## 1        RggRt45#4697369    0     0 0.000000          0     1      0        0
## 2     JohniceRex#9176033    0     0 0.000000          0     1      0      110
## 3 bootybootykill#1892064    0    66 1.031250          0     9      0      110
## 4          JNaCo#5244172    3     2 0.400000          0     1      0        0
## 5  gomezyayo_007#6596687    0     2 0.200000          0     1      0      110
## 6   Brxndoon7-LK#4002715  684 27011 1.066743         18   177     10      110
##    hits timePlayed headshots averageTime gamesPlayed assists misses      xp
## 1     0          0         0    0.000000           0       0      0       0
## 2     0          7         0    7.000000           0       0      0     700
## 3     0         32        16   32.000000           0       1      0   48300
## 4     0          3         0    3.000000           0       0      0    1150
## 5     0          5         1    5.000000           0       0      0    1000
## 6 98332       1366      5113    2.323129         588    6063 305319 3932335
##   scorePerMinute  shots deaths
## 1          0.000      0      0
## 2          0.000      0     16
## 3          0.000      0     64
## 4          0.000      0      5
## 5          0.000      0     10
## 6        255.672 403651  25321

Variable explanation:

  • name: this is the name for each player
  • wins: number of times the player win a match
  • kills: number of kills the player made in all his matches
  • kdRatio: kill/deaths ratio that means, if a player has 10 kills and 5 deaths, his KD ratio is equal to 2. A KD ratio of 1 means that the player got killed exactly as many times as he successfully eliminated his opponents
  • killstreak: kill a number of enemy players without dying.
  • level: is the player grade
  • losses: total number of losing
  • prestige: it is an optional Mode that players can choose after they progress to Level 55 and max
  • hits: number of times the player damaged another player
  • timePlayed: the time spent by every player playing Call of Duty in hours
  • headshots: number of times the player hit the others with headshots
  • averageTime: avrage time
  • gamesPlayed: number of times the player play multiplyer match
  • assists: number of times player damaging an enemy but a teammate gets the kill.
  • misses: the number of times the player miss the hit
  • xp: Experience Points (XP) are a numerical quantity exclusive to multiplayer that dictates a player’s level and progress in the game.
  • scorePerMinute:a measure of how many points players are gaining per unit time.
  • shots: number of shots the player did
  • deaths: number of time the player gots killed in the game.

3 Data Wrangling

There is other way beside looking at the first six rows to get to know the inside of our data. The code below is one of them:

glimpse(cod)
## Rows: 1,558
## Columns: 19
## $ name           <chr> "RggRt45#4697369", "JohniceRex#9176033", "bootybootykil…
## $ wins           <int> 0, 0, 0, 3, 0, 684, 4, 186, 741, 26, 0, 0, 188, 0, 15, …
## $ kills          <int> 0, 0, 66, 2, 2, 27011, 162, 1898, 21803, 349, 0, 26, 19…
## $ kdRatio        <dbl> 0.0000000, 0.0000000, 1.0312500, 0.4000000, 0.2000000, …
## $ killstreak     <int> 0, 0, 0, 0, 0, 18, 4, 13, 26, 7, 0, 0, 22, 0, 7, 10, 17…
## $ level          <int> 1, 1, 9, 1, 1, 177, 6, 37, 185, 12, 1, 6, 53, 1, 5, 40,…
## $ losses         <int> 0, 0, 0, 0, 0, 10, 2, 7, 29, 4, 0, 0, 4, 0, 4, 9, 11, 1…
## $ prestige       <int> 0, 110, 110, 0, 110, 110, 0, 2, 111, 0, 0, 110, 57, 0, …
## $ hits           <int> 0, 0, 0, 0, 0, 98332, 568, 5111, 81361, 996, 0, 0, 3333…
## $ timePlayed     <int> 0, 7, 32, 3, 5, 1366, 8, 550, 2442, 44, 0, 37, 409, 1, …
## $ headshots      <int> 0, 0, 16, 0, 1, 5113, 35, 485, 3894, 40, 0, 3, 536, 0, …
## $ averageTime    <dbl> 0.000000, 7.000000, 32.000000, 3.000000, 5.000000, 2.32…
## $ gamesPlayed    <int> 0, 0, 0, 0, 0, 588, 4, 150, 864, 15, 0, 0, 25, 0, 6, 12…
## $ assists        <int> 0, 0, 1, 0, 0, 6063, 68, 488, 4029, 138, 0, 4, 150, 0, …
## $ misses         <int> 0, 0, 0, 0, 0, 305319, 4836, 39978, 327230, 4844, 0, 0,…
## $ xp             <int> 0, 700, 48300, 1150, 1000, 3932335, 24485, 458269, 4269…
## $ scorePerMinute <dbl> 0.00000, 0.00000, 0.00000, 0.00000, 0.00000, 255.67204,…
## $ shots          <int> 0, 0, 0, 0, 0, 403651, 5404, 45089, 408591, 5840, 0, 0,…
## $ deaths         <int> 0, 16, 64, 5, 10, 25321, 256, 3332, 21032, 786, 0, 77, …

We get the information about the data such that the number of rows is 1,558 and columns or variables is 19. There is also information about any existing column and some data on each, and also the type of data in each column such as <chr> for character or string object, <int> for integer, and <double> for decimal values.

Since the final model will be used to predict player performance from the data on the scoreboard given right after they finish a match in a game and not all of the variables from our cod data are being provided, we will only use some of them that are common and easy to get right from the scoreboard.

Call of Duty: Modern Warfare Team Deathmatch Scoreboard

For this article, the author will use the variables from the scoreboard on the image above and the one that is also available from the main data. The variables we will use are kdRatio, assists, and scorePerMinute. We don’t use kills and deaths information since it’s already summarized in kdRatio or kill/death ratio, while wins and losses aren’t included since they happen because of team performance, not individuals. These variables can be changed later if the information contained in the scoreboard has been updated or added by the developer.

used_var <- c("kdRatio", "assists", "scorePerMinute")

cod <- cod %>% 
  select(all_of(used_var))

4 Exploratory Data Analysis (EDA)

We have filtered the data using only the one provided on the scoreboard. Next, we need to explore the data and get to know more about the content in it. The first thing is we need to make sure that our data doesn’t contain empty values on each variable and to ensure that, we use the code below.

colSums(is.na(cod))
##        kdRatio        assists scorePerMinute 
##              0              0              0

Now that the data doesn’t have empty values marked with a zero in each variable, the next step that we can do is inspect the distribution of the data. One way to do that is by looking at the summary statistic of the data in each variable. The next code will help us with that task.

summary(cod)
##     kdRatio          assists        scorePerMinute  
##  Min.   :0.0000   Min.   :    0.0   Min.   :  0.00  
##  1st Qu.:0.2614   1st Qu.:    0.0   1st Qu.:  0.00  
##  Median :0.7328   Median :   36.5   Median : 56.79  
##  Mean   :0.6371   Mean   :  685.8   Mean   :107.87  
##  3rd Qu.:0.9553   3rd Qu.:  609.8   3rd Qu.:221.65  
##  Max.   :3.0000   Max.   :14531.0   Max.   :413.80

Here are insights that we can get from the above summary statistic:

  • From the minimum and maximum value of each variable, we can take that these variables don’t have the same range and unit. kdRatio has a unit of ones, assists is ten thousands, while scorePerMinute is hundreds.
  • The unit of assists data until the 3rd quartile or 75% of its data is from tens to hundreds, but its maximum value has a unit of thousands which a bit odd and need further exploration.

From the second insight, we know that in assists there is some value that is very large and different compared to others. Value like this is commonly known as an outlier or observation that we can presume has an abnormal distance from other values. We need to get rid of this kind of value to prevent any bias in our analysis. For this, we can use some techniques to check if there is any outlier in the data.

4.1 Outliers Detection

4.1.1 Boxplot

The first technique the author will be using is boxplot visualization. Below is one of the code examples to make a boxplot out of the ggplot library.

theme <- theme_minimal() +
  theme(
    axis.title.y = element_blank(),
    axis.text.x = element_blank())

# boxplot for kdRatio column
boxplot_1 <- ggplot(data = cod, aes(y = kdRatio)) +
  geom_boxplot() + labs(x = "kdRatio") + theme

# boxplot for assists column
boxplot_2 <- ggplot(data = cod, aes(y = assists)) +
  geom_boxplot() + labs(x = "assists") + theme

# boxplot for scorePerMinute column
boxplot_3 <- ggplot(data = cod, aes(y = scorePerMinute)) +
  geom_boxplot() + labs(x = "scorePerMinute") + theme

# combine all boxplot in one plot
grid.arrange(boxplot_1, boxplot_2, boxplot_3,
             ncol = 3)

Boxplot also gives the information that is already being provided from the summary statistic. But most importantly, the one that we can’t get from the summary statistic but is provided by the boxplot is a sign of any outliers in our data. For the above boxplot visualization, for example, we can see there are lots of black points. Each of those black points represents one observation of our data and is a sign of outliers, more precisely on kdRatio and assists columns. Because the scorePerMinute doesn’t have black points in its boxplot, it means there are no outliers for that variable.

We want to know exactly which rows of our data have outliers because later we will remove them from our data. To do this, we are going to do a conditional subsetting by making use of the quartile values of variables kdRatio and assists. In statistics, outliers data lies above Q3 + 1.5*IQR also called upper bound or below Q1 – 1.5*IQR also called lower bound. So here are our next steps:

  1. Calculate the upper and lower bound of each variables kdRatio and assists.
  2. Find the row in which kdRatio and assists values match the condition of below the lower bound or above the upper bound.
# calculate upper and lower bound of kdRatio
q1_kdRatio <- quantile(cod$kdRatio, probs = 0.25)
q3_kdRatio <- quantile(cod$kdRatio, probs = 0.75)
iqr_kdRatio <- q3_kdRatio - q1_kdRatio

upper_kdRatio <- q3_kdRatio + 1.5 * iqr_kdRatio
lower_kdRatio <- q1_kdRatio - 1.5 * iqr_kdRatio

# calculate upper and lower bound of assists
q1_assists <- quantile(cod$assists, probs = 0.25)
q3_assists <- quantile(cod$assists, probs = 0.75)
iqr_assists <- q3_assists - q1_assists

upper_assists <- q3_assists + 1.5 * iqr_assists
lower_assists <- q1_assists - 1.5 * iqr_assists

# get row containing outliers using conditional subsetting
outliers_index <- which((cod$kdRatio < lower_kdRatio) | 
      (cod$kdRatio > upper_kdRatio) |
      (cod$assists < lower_assists) |
      (cod$assists > upper_assists))

head(outliers_index)
## [1]  6  9 18 40 48 65

From the above step, we get numbers saved in an object called outliers_index. These numbers are the index of the row in our data that contain outliers.

4.1.2 Principal Component Analysis

The second technique the author will be using is the Principal Component Analysis (PCA) method. PCA is an algorithm commonly used for dimensionality reduction in machine learning. But here, we will use another feature of the PCA algorithm which is for outlier detection. Here are the steps that we need to do to search for outliers in the data using PCA:

  1. Build the PCA object using our data.
  2. Visualize the PCA object to see if there are any outliers in the data.
# build PCA object
pca_cod <- PCA(X = cod,
               scale.unit = T,
               graph = F)

# visualize PCA
plot.PCA(x = pca_cod,
         choix = "ind",
         select = "contrib5")

The visualization of the PCA object above is an example of when we want to look at the five outermost outliers in our data based on all of the variables. Its number is set using the select parameter inside the plot.PCA function, where we only need to change the value written after the word contrib, and because it has been set as contrib5 so the plot only shows the index of the five outermost outliers.

There are some weaknesses in how PCA can give us information about outliers. It only shows the row index of the one that we define in contrib. It is also hard for the plot to visualize the row index when we want to show more of them since they will overlap with one another. Besides, if we want to get all of those indexes, we need to manually type it because it can’t be done using any function to speed up the process.

Here, the author only use PCA to check if the index in the five outermost outliers from this method is also contain in the outliers_index object from the boxplot method.

# build function to check for an existance of value
fun <- function(a){
  a %in% outliers_index
}

# the index from PCA need to be typed manually
sapply(list(1096, 236, 136, 1314, 1121), FUN = fun)
## [1] TRUE TRUE TRUE TRUE TRUE

The above step is to check if the five outermost indexes shown in PCA visualization are also in theoutliers_index object. It checks for each index and returns TRUE five times, which means all of the indexes are contained in theoutliers_index. This means these indexes are considered outliers both using PCA and boxplot method. Hence, we can have some confidence in the results of outlier detection from these two methods.

5 Data Preprocessing

After going through the process of detecting outliers, we already have the list of row indexes which considered to have outliers. This information can be used for the next step, where we will exclude these data from the main data. We want to know how many rows are there that considered an outlier using the code below:

length(outliers_index)
## [1] 220

It shows an output of 220 which means 220 rows are considered outliers. Next, we will exclude all of that rows from our data using the code below where we save it in a new variable called cod_clean. This step gives us data where at first has 1558 rows but after the outliers are removed, it consists of 1338 rows.

# remove outliers from initial data
cod_clean <- cod[-outliers_index,]

# check for dimension of the initial data
dim(cod) # initial data
## [1] 1558    3
dim(cod_clean) # data after removing outliers
## [1] 1338    3

Remember that the final model of this project will be used to analyze the player’s skill status based on their performance in a game, which means that it should be a supervised model since it predicts a target which in this case is the skill status. When we look at the main data, there is no information about this skill status, hence we need to provide it first for the model to be able to make a prediction.

For the next step, we use an unsupervised method first which is clustering to make a cluster or category from our data that doesn’t have a target. Later, clusters resulting from the clustering process will be changed and adapted to the appropriate skill status and used for the final supervised model. Here, we will see the implementation of combining unsupervised and supervised methods. In this article, the unsupervised model selected for the clustering step will be the k-means model.

6 Clustering

K-means is one of the models used for clustering where it grouped the data based on the calculation of the distance between each data point. The important thing about distance calculation is that it is sensitive to data with variables that has different ranges or units. Hence, it will be better if we scale the data first before entering the clustering process. After data gets through the scaling process, its variables will be in the same range or at least not far from one another. In this article, the author utilizes the Z-Score distribution method which will scale the data and make its range between -3 and 3 and to do this, below is the code:

# scale data
cod_clean_scale <- scale(cod_clean)

# check the range of variables in the data using summary statistic
summary(cod_clean_scale)
##     kdRatio           assists         scorePerMinute   
##  Min.   :-1.3811   Min.   :-0.56049   Min.   :-0.8016  
##  1st Qu.:-0.9787   1st Qu.:-0.56049   1st Qu.:-0.8016  
##  Median : 0.1043   Median :-0.52393   Median :-0.6785  
##  Mean   : 0.0000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.8165   3rd Qu.: 0.04914   3rd Qu.: 0.9269  
##  Max.   : 3.1612   Max.   : 3.72589   Max.   : 2.7656

The insight that we can get from the above scaling process is that after data gets through the scaling process, its variables are in the same range or at least not far from one another. But we must know that after being scaled, the data is not in the same value as the one before the process is being taken, which means we can’t interpret the data in this condition. The below code shows the first five rows in the scaled data:

head(cod_clean_scale)
##      kdRatio    assists scorePerMinute
## 1 -1.3810916 -0.5604909     -0.8016452
## 2 -1.3810916 -0.5604909     -0.8016452
## 3  1.1084000 -0.5576783     -0.8016452
## 4 -0.4154706 -0.5604909     -0.8016452
## 5 -0.8982811 -0.5604909     -0.8016452
## 7  0.1465510 -0.3692350      1.4871587

6.1 Choosing Optimum K

In k-means, we need to define the k value that we will use for clustering. k indicates the number of clusters or groups the model will try to find out of the data. It means that after the clustering process, there will be k cluster(s) or group(s) of data where the data between one cluster and another have different characteristics but in one cluster have the same characteristics. There are several ways to decide what k value that we want to use:

  • Based on business needs. If we already know how many clusters or groups we want to make from the data, then we can directly use that value for the k-means clustering.
  • Using elbow method. Another more statistical way is by using a method called the elbow method. This method is a graphical representation of finding the optimal k value for k-means clustering.

For this article, the author will use the first way in finding the k since there will be the skill status of a pro, average, or newbie, so we will be using 3 for the k value. But if we can check if the optimum value is the same or not by using the code below, and don’t forget to use the scaled data:

fviz_nbclust(x = cod_clean_scale,
             FUNcluster = kmeans,
             method = "wss")

From the graphic above, we need to look at the value of k at the elbow i.e. the point before the decline is no longer significant. Thus, for the optimum k value it will be 4 from the elbow method above.

Since it was already mentioned before that the author will be using 3 for the k value based on the business needs, so we won’t proceed with the k value from the elbow method. Below is the code to do k-means clustering using the scaled data and the k value is defined in the centers parameter equals to 3:

RNGkind(sample.kind = "Rounding")
set.seed(123)

cod_kmeans <- kmeans(x = cod_clean_scale,
                    centers = 3)

6.2 Cluster Profiling

The result of the clustering process from the step before is saved in an object called cod_kmeans. To get the cluster of each row or observation in our data, we can access the result in cod_kmeans by accessing its cluster. We will directly save the information of the cluster to a new column called cluster_result in our data using the below code:

cod_clean$cluster_result <- cod_kmeans$cluster

head(cod_clean)
##     kdRatio assists scorePerMinute cluster_result
## 1 0.0000000       0            0.0              1
## 2 0.0000000       0            0.0              1
## 3 1.0312500       1            0.0              2
## 4 0.4000000       0            0.0              1
## 5 0.2000000       0            0.0              1
## 7 0.6328125      68          265.5              2

Note that the information of the cluster is being saved in the unscaled or original data. Why is that? It is because previously mentioned that if we transform the data by doing scaling, we will no longer be able to interpret the value in it. So, because we need these values to be in its original form when we want to do the interpretation on the further analysis, we need to return the clustering result to our unscaled data and the scaling process is being done only for the needs of the clustering process.

Next, we will look at the mean value of each variable in the data for each cluster to get to know the summary of these clusters, which can help us determine the characteristic of the data in each cluster.

cod_clean %>% 
  group_by(cluster_result) %>% 
  summarise_all(mean)
## # A tibble: 3 × 4
##   cluster_result kdRatio assists scorePerMinute
##            <int>   <dbl>   <dbl>          <dbl>
## 1              1   0.246    12.1           7.64
## 2              2   0.837   145.          167.  
## 3              3   0.979   979.          187.

Now we have the information on each variable’s mean for each cluster, it can be concluded that in general:

The value for each variable in the 3rd cluster is the biggest of the two other clusters, while the 1st cluster is the smallest, and the 2nd cluster is in the middle.

From the conclusions made above, the author will change the name of each cluster to match the skill status according to the variable’s mean value in each cluster. The 1st cluster will be renamed as a newbie, the 2nd cluster as an average, and the 3rd cluster as a pro.

7 Classification

Since now the data already have a target variable which is cluster_result, we can proceed to the next step which is using a supervised method to make a model that can classify new data to its skill status. In this article, the author will build two kinds of supervised models:

  1. A simple logistic regression model
  2. A robust random forest model

But before starting to build the model, we need to make sure the data we are going to use is ready by going through the preprocess and exploratory steps.

7.1 Data Wrangling

The step we haven’t done yet is to change the name of the cluster based on the cluster profiling step. So next, we will rename each of those clusters where the 1st cluster becomes newbie, the 2nd cluster becomes average, and the 3rd cluster becomes pro. We do this step using the below code:

cod_clean <- cod_clean %>% 
  mutate(cluster_result = as.factor(cluster_result))

# change the name of each cluster
levels(cod_clean$cluster_result) <- c("newbie", "average", "pro") # 1 = "newbie"; 2 = "average"; 3 = "pro"

head(cod_clean)
##     kdRatio assists scorePerMinute cluster_result
## 1 0.0000000       0            0.0         newbie
## 2 0.0000000       0            0.0         newbie
## 3 1.0312500       1            0.0        average
## 4 0.4000000       0            0.0         newbie
## 5 0.2000000       0            0.0         newbie
## 7 0.6328125      68          265.5        average

7.2 Exploratory Data Analysis

We have completed the renaming step for the target variable of our data. But before we continue to use the data to build the model, we will explore the data once more like the one we did before. We need to do this because the previous exploration step was carried out using data that didn’t have a target variable. Since now we already have that information, we want to know more and do further analysis with our data.

A common thing to do in the exploratory step is to check for the correlation between variables in the data. Correlation analysis is good to do because it can give information about how is the relationship between one variable with another. One of the easy ways to do correlation analysis is by using visualization, and below is the code for that:

ggcorr(cod_clean, label = TRUE)

The correlation value lies between -1 to 1, where when the value gets closer to 1 means both variable has a strong positive correlation, when it gets closer to -1 means a negative correlation, whereas when the value equals to or near zero means there is no correlation. Here are things we can get from the above visualization:

Each variable has a correlation value between 0.4 and 0.5 with other variables, which means they all have a positive correlation with one another but are not too strong.

A positive correlation between two variables means when the value of one of them increases, so does the other variable. Meanwhile, negative correlation means when the value of one of them decreases, so does the other variable. Note that there is no information about the correlation for our cluster variable. This is because the above correlation analysis is only available for a continuous value and since the cluster information is discrete, it can’t be calculated using the above method.

The last step of exploratory data that the author will do before going straight to the model building is to check for the proportion of our target variable in the data, which is cluster_result. Why do we need to do this? Checking the proportion of the target variable is important, especially for classification purposes. It is because when we use data with an unbalanced proportion of the target, it will make the model tend to classify new data to the target with the majority proportion. Here, we can check for the proportion of the target variable using below code:

prop.table(table(cod_clean$cluster_result))
## 
##    newbie   average       pro 
## 0.4828102 0.3751868 0.1420030

From the result of the above code, we can see that the proportion of each target is different especially for the pro’s proportion which is small compared to the other two. When faced with a target unbalanced problem like this, there are two common ways that we can choose to make its proportion balanced:

  • Upsampling, is a method to balance the target proportion by adding the minority target observation by duplicating data until its proportion is the same as the majority target. This method is good to use when the amount of data is small.
  • Downsampling, is a method to balance the target proportion by reducing the majority target observation by removing some of its data until the proportion is the same as the minority target. Usually, this method is good to use when the data especially on the majority target variable is large.

The number of rows in our current data is 1338. Since the author doesn’t want to reduce the size of the data to balance its target, for this article the author will choose to use the upsampling method for balancing the proportion of the target variable.

7.3 Cross Validation

Before going through the upsampling process, first, we need to split the data into train and test data. This process is required to provide data used by the model to learn which will be the train data, while the other data used to evaluate the model performance which will be the test data. The author chooses the proportion of 75% train and 25% test data for this article. Here, since we want the splitting process to be done in such a way that each unique target value, in this case, skill status is being provided in both train and test data, we will use the stratified method for this process using the below code:

RNGkind(sample.kind = "Rounding")
set.seed(100)

index <- initial_split(cod_clean, prop = 0.75, strata = cluster_result)
train <- training(index) # train data
test <- testing(index) # test data

We already split the data into train and test, the next step is to do the upsampling process. Here, one thing to be noted is that we only do the target balancing process in the train data. Why is that? It is because, in real-life data, it is unlikely that the data will have a balanced target since it is randomly being gathered based on each of its observations. So in the below code, we use the upsampling method in the train data for training purposes only:

RNGkind(sample.kind = "Rounding")
set.seed(100)

# upsampling method
train <- upSample(x = train %>% select(-cluster_result),
                  y = train$cluster_result,
                  yname = "cluster_result")
# check proportion of target variable
prop.table(table(train$cluster_result))
## 
##    newbie   average       pro 
## 0.3333333 0.3333333 0.3333333

Above, we can see that the proportion of each unique value in the target variable is already the same or balanced after going through the upsampling process. Now our data is ready to go to the next process which is to build the classification model.

7.4 Build Supervised Models

7.4.1 Logistic Regression

The first model that we will make is a multinomial logistic regression model. It is multinomial because we have more than two unique values in the target variable. Logistic regression is a supervised machine learning model that works by calculating how much probability a new data will fall into each one of the unique values of the target variable. The final result of the target variable for that new data will be the target unique value with the biggest probability.

7.4.1.1 Model Fitting

For the first step, we will build the model using train data for learning purposes. Here, we don’t want to always run the program everytime we need this model in the future, because some model requires a long time in the training process. So in this case, the author choose to save the model after being train as a file in RDS format file. Where than later when need to use the model, we only have to call or read the saved model. The below code is to build the multinomial logistic regression model and save it in an RDS file:

# # ONLY RUNS ONCE, please return to the comment form again after a run
# 
# # build logistic regression
# logreg_model <- multinom(formula = cluster_result ~., data = train)
# 
# # save model in the local folder
# saveRDS(logreg_model, "logreg_model.RDS")

Note that the above code is made into a comment to prevent the code from being evaluated when the chunk is executed because we only need to run it once for building the model. Since we already save the model as a file, we don’t need to run the above code again. To read the saved model, we can use the below code:

# read model
logreg_model <- readRDS("logreg_model.RDS")
logreg_model
## Call:
## multinom(formula = cluster_result ~ ., data = train)
## 
## Coefficients:
##         (Intercept)  kdRatio   assists scorePerMinute
## average   -116.8678 133.8923 0.0401555      0.4762593
## pro       -446.4343 194.0271 0.5156745      0.5177757
## 
## Residual Deviance: 5.096176 
## AIC: 21.09618

From the above summary of the model, there are a lot of things being shown that we can take as a piece of information. Here are some of them:

  • Residual Deviance, shows how well the model performs in predicting the target variable in the learning process. This value range between 0 to infinite, where there is no threshold in determining how big is the value of the residual deviance to assume that the model performance is good, instead we want this value to be closer to 0.
  • AIC or the abbreviation of Akaike Information Criteria, represent how much of our data information is missing in the model. The value also ranges between 0 to infinite and we want the value to be as small as possible, which means not much data is lost.

7.4.1.2 Predict

Next, after the logistic regression model has been created, we want to use the model to predict the target variable of the test data and use the predicted result as a performance evaluation for the model. The prediction is made using the below code, where we save the prediction result on an object called pred_result_logreg:

pred_result_logreg <- predict(object = logreg_model, newdata = test)
head(pred_result_logreg)
## [1] average average average newbie  average average
## Levels: newbie average pro

7.4.1.3 Model Evaluation

Now, as we have the prediction result of the test data, the last step we need to do is evaluate the model using a certain measurement. The common tool used to evaluate model performance, especially for the classification problem is called a confusion matrix, which gives a lot of measurement metrics we can use and choose to evaluate the model. We will do the confusion matrix method, where we need to specify the actual value of the target variable in the test data along with its predicted result. Here is the code for that:

confusionMatrix(data = pred_result_logreg, reference = test$cluster_result, positive = "pro")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction newbie average pro
##    newbie     162       0   0
##    average      0     126   0
##    pro          0       0  48
## 
## Overall Statistics
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9891, 1)
##     No Information Rate : 0.4821     
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##                                      
##  Mcnemar's Test P-Value : NA         
## 
## Statistics by Class:
## 
##                      Class: newbie Class: average Class: pro
## Sensitivity                 1.0000          1.000     1.0000
## Specificity                 1.0000          1.000     1.0000
## Pos Pred Value              1.0000          1.000     1.0000
## Neg Pred Value              1.0000          1.000     1.0000
## Prevalence                  0.4821          0.375     0.1429
## Detection Rate              0.4821          0.375     0.1429
## Detection Prevalence        0.4821          0.375     0.1429
## Balanced Accuracy           1.0000          1.000     1.0000

There is a lot of information from the above measurements made by the confusion matrix. We need not look at all values, only need to focus on some of them. Commonly used matrices are the Accuracy, Sensitivity, Specificity, and Pos Pred Value, but the author will be more likely to see from the accuracy score. Our logistic regression model performance has an accuracy value of 1 means the model accuracy in predicting the target is 100% which is a really good performance. This model also performs well so that the other matrices are scored 1.

7.4.2 Random Forest

The second model is the random forest model. Random forest is one of the supervised machine learning models that are robust, which means it is more likely to have a good performance and small errors in predicting new data. Random forest is a strong model since it uses a combination of the output of multiple decision tree models. It will do things like majority voting from the decision tree models output to make its final prediction output.

The processes that we will do next will be the same as the ones we already use for the logistic regression model, namely build the model using train data and save it to an RDS file, use the model to predict test data, and the last is evaluate the model based on how it performs on the test data.

7.4.2.1 Model Fitting

# # ONLY RUNS ONCE, please return to the comment form again after a run
# 
# set.seed(417)
# control <- trainControl(method = "repeatedcv", number = 5, repeats = 3)
# 
# # build Random Forest
# rf_model <- train(form = cluster_result~., data = train, method = "rf",
#                   trainControl = control)
# 
# # save model in the local folder
# saveRDS(rf_model, "rf_model.RDS")
# read model
rf_model <- readRDS("rf_model.RDS")

In the random forest model, we can see how much the model uses the predictor variable in our data which in this case are assists, scorePerMinute, and kdRatio to make an accurate target prediction. To do this, we can use the code below:

# variabel importance
varImp(rf_model) %>% plot()

The analysis of the variable importance in the model above shows that assists is the most influential variable in predicting a player’s skill status followed by scorePerMinute and kdRatio. This result only gives us the idea about how much each variable is being used for the model, but it doesn’t mean we need to remove the variable that doesn’t have any influence at all, which in this case is kdRatio. If the application in the game turns out that these variables are indeed important, we can still keep them in the data.

7.4.2.2 Predict

pred_result_rf <- predict(object = rf_model, newdata = test)
head(pred_result_rf)
## [1] average average average newbie  average average
## Levels: newbie average pro

7.4.2.3 Model Evaluation

confusionMatrix(data = pred_result_rf, reference = test$cluster_result, positive = "pro")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction newbie average pro
##    newbie     160       2   0
##    average      2     124   0
##    pro          0       0  48
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9881          
##                  95% CI : (0.9698, 0.9967)
##     No Information Rate : 0.4821          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9804          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: newbie Class: average Class: pro
## Sensitivity                 0.9877         0.9841     1.0000
## Specificity                 0.9885         0.9905     1.0000
## Pos Pred Value              0.9877         0.9841     1.0000
## Neg Pred Value              0.9885         0.9905     1.0000
## Prevalence                  0.4821         0.3750     0.1429
## Detection Rate              0.4762         0.3690     0.1429
## Detection Prevalence        0.4821         0.3750     0.1429
## Balanced Accuracy           0.9881         0.9873     1.0000

Based on the model performance on predicting the target variable in the test data, it turns out that the accuracy score of the model is high at 98.97%, while the score in other matrices besides accuracy for each target is also high between 90% - 100%.

8 Conclusion

8.1 Conclusion

Overall based on the performance from the confusion matrix of each model, both of them have a really good performance in terms of predicting the target which is the skill status. The model performs very well, exceeding 95% in accuracy, it even scores 100% on the logistic regression model. These models give the final prediction result in the form of one of skill status as a newbie, average, or pro player as an evaluation of where the player’s skill position is, based on the performance at the end of every game.

Why were both accuracies high? Normally, we need to be aware when getting a high accuracy score, especially with the condition of our data with only three variables to predict the target and on a simple model such as logistic regression. But here, both models could give high accuracy because we did the unsupervised method first to build our target variable. These targets were built based on a pattern from the initial data that the unsupervised model catch. Hence, the resulting target already has a clear pattern with the data. From that step, our supervised model, in this case, logistic regression and random forest model, would also easily catch the pattern from the resulting data while doing the learning process and give a good performance on the prediction.

Since both of our final models is reliable enough for predicting a player’s skill status, we can use any of them in the future. But here, the author choose to use the random forest model even though its accuracy value is slightly smaller than the logistic regression, but this model is more robust so it can be more reliable across any data types and characteristics which will be predicted.

8.2 Suggestion

What to do next? Most of the time, when we want to be the best of the best, we need to improve and develop ourselves further. Every esports player and even the esports team manager can use this skill status prediction result as the basis and evaluation material to improve in-game performance and be ready to face real competition later.

Furthermore, here is an example of how the author used the final model and combined it with other visuals in an analytics dashboard. The author has made a simple dashboard using Shiny which is an R package used to build an interactive web application straight from R and deploy the apps on shinyapps.io. The dashboard contains several features to analyze the overall player’s gameplay, below is the screenshot of those features with an explanation for each:

Performance Analysis Dashboard Feature

  • Feature 1: Button to start the user guide on how to use each dashboard feature.
  • Feature 2: Insert the gameplay date, kdRatio, assists, and scorePerMinute scores in its field. This feature will utilize the model to predict the skill status based on the kdRatio, assists, and scorePerMinute values that have been submitted before. Its default value will be newbie since the default of the kdRatio, assists, and scorePerMinute values is zero, and so the model predicts it as newbie. Two buttons can be used in this feature, the first is the check button which only will do the prediction and display the output at the very top. While the second is the save button which will do the prediction and save it to the main data along with the submitted date, kdRatio, assists, and scorePerMinute values.
  • Feature 3: The performance score based on the data of the latest game the player played. (Warning: If an error appears, it may be caused by unavailable data. Please re-check your data.)
  • Feature 4: The overall gameplay average score for each variable. This radar chart was built based on the maximum and minimum values of each variable. The outer circle represents the maximum value for each variable, the small or the middle circle represents the minimum value, whereas the points shown are the average value for each variable (here, the author uses the median to avoid any outliers affecting the result). (Warning: If an error appears, it may be caused by unavailable data. Please re-check your data.)
  • Feature 5: Insert the start-to-end date input on the top of the plot. This line plot will visualize the player’s performance score over the submitted date range in the input. The default value of the end date is from the Sys.Date function, while the default value of the start date is one week before the end date. (Warning: If an error appears, it may be caused by an input error or the data is unavailable. Please re-check your input and data.)
  • Feature 6: Insert the month and year values for the input. This input will be used by the calendar to show which days a player plays the game and which doesn’t. Grey color means the player plays a game on that day, whereas white color means the player doesn’t play. The default value for this visualization is the month and year from the sys.Date function. (Warning: If an error appears, it may be caused by an input error or the data is unavailable. Please re-check your input and data.)

To try the dashboard go to this link, or if you want to replicate the dashboard go to this GitHub repository.