PUBG is a successfull multiplayer online game set on an island where one hundred people play a battle royal game. Each player struggles in fights, hidings, escapes, pursuits, ambushes, stalkings and whatever mean necessary in order to survive. Only the last player standing wins. Fight or escape? Hide or attack? Join forces or play alone? In order to decide the best strategy, a set of player stats have been collected and analyzed. The final aim is to apply a predictive model that target the player ranking.
PUBG is organized in matches. In each match, up to 100 people partecipate as singles or teams. At the end of the match, players (singles or groups) are ranked based on how many other groups are still alive when they are eliminated. During the match, players can find objects such as weapons, vehicles, and medical kits they can use to kill and injure other players (including team members), to drive away from dangerous areas or go right in to the middle of it.
The data set, available here, consists of 29 statistics collected for 4446966 players. The statistics are:
Since the big number of statistics and observations, summary plots for match and single players features are shown.
Data contains statistics of 47965 matches. For each match it is reported: the type [16 types], the duration, the number of partecipating players, the number of partecipating groups, the worst placement ranked in the match, and the number of winners in the match.
Although the match number is not even amongs types, as shown in the table below, note that for each match type there is a high variability in the statistic distributions [see Figure 1], and overall they have a comparable distribution. For this reason the match type is disregarded in the next analysis.
##
## crashfpp crashtpp duo duo-fpp
## 73 5 3356 10620
## flarefpp flaretpp normal-duo normal-duo-fpp
## 9 29 12 158
## normal-solo normal-solo-fpp normal-squad normal-squad-fpp
## 23 96 16 358
## solo solo-fpp squad squad-fpp
## 2297 5679 6658 18576
On average, a match lasts 1579 +/- 264 s [~26 +/- 4 min] and has 93 +/- 12 players divided in 42 +/- 23 groups. At the end of the match, on average, the worst placement is 44 +/- 24 and the number of winners is 3 +/- 1.
For a better visualization, the over 4 milions players have been divided into 4 classes depending on their placement at the end on the match, i.e. the winPlacePerc spanning between 0 and 1. Classes represent winning, good, medium, and poor performances rather than an uniform partition of the players. Winning players score 1 and are labeled as “top”; players scoring between 0.9 and 0.99 are labeled as “head”, while scores between 0.2 and 0.9 (excluded) are labeld as “heart” and scores between 0 and 0.2 (excluded) are labeld as “tail” (see table below).
## score number percentage
## tail [0-0.2) 1107104 0.25
## heart [0.2-0.9) 2870630 0.65
## head [0.9-0.99] 341658 0.08
## top 1 127573 0.03
For each class, the value distribution of the player statistics has been compared. The “top” class spans a larger range in almost every statistic, especially in weaponsAcquired,damageDealt,,walkDistance,killPoints,longestKill, winPoints as shown in Figure 2 and Figure 3.
Now, let’s see the pair-wise correlation between statistics, divided per class, i.e their mutual interaction. Figure 4 shows that in all classes there is an high correlation (~0.9) between winPoints and killPoints. This fact is not surprising since the two statistics are calculated using an external ranking (rankPoints). As expected, the number of knocked enemies, DBNOs, and the dameges dealt, damageDealt, are strongly correlated (~0.7) for all classes. The statistics damageDealt, kills, and headShotKills correlate between each other with a decreasing trend from the “top” to the “tail” class. This trend likely reflect that the winning players (“top”) manage to inflict more damage and more effectively. An opposite trend in correlation values is found for the killStreaks, kills, DBNOs, longestKill statistics.
For all classes, killPlace anti-correlates (~-0.7) with kills, killStreaks, and damageDealt statistics, which means that the lower is the ranking score for kills in the match, the higher is the number of damages and kills achieved.
The ultimate goal of this analysis is to use the player statistics to predict the final placement after the match, i.e. winPlacePerc. This statistic, which is the target, is expressed as the percentage of 100 possible final scores where 0 represents the loser and 1 represent the winner.
In order to create a balanced model, the winPlacePerc statistic has been divided into 11 classes, each representing a subset of the final placement. Each class spans an interval 0.1 long: the first ranges from 0.00 (included) to 0.10 (excluded), and so on. The last two calsses are slightly different: one covers 0.90 to 0.99 (both included), and the last class is reserved to the winners, i.e. with a score of 1. One hundred thousands players have been randomly selected from each class to create a balanced training data set of a total of 1.1 milion observations. Figure 5 shows the distributions of each class for the whole data set.
A logistic Generalized Linear Model (GLM) has been chosen as predictive model to achive a compromise beteween computational effort and accuracy. Briefly, a logistic approach treats the target variable as a binomial random variable that, in practice, has the same behaviour of the statistic winPlacePerc, being “1” the success and “0” the failure. The logistic model calcualates the probability that an observation is successfull, in this case the player’s placement percentage at the end of the match.
A subset of low correlated features (corr<|0.5|) has been chosen to build the GLM prediction model (DBNOs, kills, and killStreak have been discarded). Categorical features, i.e. the id of the player, the match, and the group, have been also discarded because not relevant to the prediction. Figure 6 shows how much each feature impacts on a unit change of the final placement (y-axis). The size of each coefficient indicates the degree of uncertainty, i.e. the amplitude of the confidence interval. For example, assists have the highest positive effect (~0.2) per unit change in the final placement with medium uncertainty, while teamKills have a negative influence (-0.1) with higher uncertainty. On the contrary, swimDistance and rideDistance have no influence at all with a very low degree of uncertainty.
Figure 7 shows the diagnostic of the model, that is how much each predicted (fitted) value is far from the theoretical model (this distance is called residual). The red line represents the mean of the residuals, which is -0.68. Even if the problem is not strictly linear, the concentration of the residuals near zero shows that the model achieves reasonable results.
The in-sample error of the model, i.e. the error calculated on the training data set, is on average 0.1 +/- 0.092. The left panel in Figure 8 shows the error counts (expressed as a percentage of the maximum for a better visualization): 50% of the errors are under the mean (blue line).
The out-sample error, i.e. the error calculated on data not used in the model, is calculated on 50 subsamples each 10 thousands strong. The overall average error is similar to the in-sample error, in fact it’s 0.1 +/- 0.088. The error counts are shown in the rigth panel of Figure 8: 50% of the errors are under the mean (blue line).
Finally, the test data set has almost 2 millions obseravations. The mean error, calculated by Kaggle is 0.10072 link.