Exploratory analysis report for PUBG

Synopsis

PUBG is a successfull multiplayer online game set on an island where one hundred people play a battle royal game. Each player struggles in fights, hidings, escapes, pursuits, ambushes, stalkings and whatever mean necessary in order to survive. Only the last player standing wins. Fight or escape? Hide or attack? Join forces or play alone? In order to decide the best strategy, a set of player stats have been collected and analyzed. The final aim is to apply a predictive model that target the player ranking.

Data Set

PUBG is organized in matches. In each match, up to 100 people partecipate as singles or teams. At the end of the match, players (singles or groups) are ranked based on how many other groups are still alive when they are eliminated. During the match, players can find objects such as weapons, vehicles, and medical kits they can use to kill and injure other players (including team members), to drive away from dangerous areas or go right in to the middle of it.

The data set, available here, consists of 29 statistics collected for 4446966 players. The statistics are:

Id: player’s Id
assists: number of enemy that this player damaged and were killed by teammates
boosts: number of boost items used
heals: number of healing items used
revives: number of times this player revived teammates
damageDealt: total damage dealt. Note: Self inflicted damage is subtracted
DBNOs: number of enemy knocked
killPlace: ranking in match for number of enemy killed
killPoints: kills-based external ranking of player [“0” should be treated as a “None” for rankPoints equal to -1]
killStreaks: max number of enemy killed in a short amount of time
kills: number of enemy killed
headshotKills: number of enemy killed with headshots
roadKills: number of kills while in a vehicle
teamKills: number of times this player killed a teammate
longestKill: longest distance between this player and a player killed at time of death
rideDistance: total distance traveled in vehicles measured in meters
swimDistance: total distance traveled by swimming measured in meters
vehicleDestroys: number of vehicles destroyed
walkDistance: total distance traveled on foot measured in meters
weaponsAcquired: number of weapons picked up
winPoints: win-based external ranking for player [“0” should be treated as a “None” for rankPoints equal to -1]
winPlacePerc: percentile winning placement, where 1 corresponds to 1st place. It is calculated off of maxPlace [TARGET]
rankPoints: Elo-like ranking of player, inconsistent and is being deprecated in the API’s next version
groupId: ID group within a match. In different matches the same group of players will have a different IDs
matchDuration: duration of match in seconds
matchId: ID to identify match
matchType: game mode such as: “solo”, “duo”, “squad”, “solo-fpp”, “duo-fpp”, “squad-fpp”, and other custom modes
numGroups: number of groups we have data for in the match
maxPlace: worst placement we have data for in the match

Since the big number of statistics and observations, summary plots for match and single players features are shown.

Summary statistics

Match statistics

Data contains statistics of 47965 matches. For each match it is reported: the type [16 types], the duration, the number of partecipating players, the number of partecipating groups, the worst placement ranked in the match, and the number of winners in the match.
Although the match number is not even amongs types, as shown in the table below, note that for each match type there is a high variability in the statistic distributions [see Figure 1], and overall they have a comparable distribution. For this reason the match type is disregarded in the next analysis.

## 
##         crashfpp         crashtpp              duo          duo-fpp 
##               73                5             3356            10620 
##         flarefpp         flaretpp       normal-duo   normal-duo-fpp 
##                9               29               12              158 
##      normal-solo  normal-solo-fpp     normal-squad normal-squad-fpp 
##               23               96               16              358 
##             solo         solo-fpp            squad        squad-fpp 
##             2297             5679             6658            18576

On average, a match lasts 1579 +/- 264 s [~26 +/- 4 min] and has 93 +/- 12 players divided in 42 +/- 23 groups. At the end of the match, on average, the worst placement is 44 +/- 24 and the number of winners is 3 +/- 1.

Personal statistics

For a better visualization, the over 4 milions players have been divided into 4 classes depending on their placement at the end on the match, i.e. the winPlacePerc spanning between 0 and 1. Classes represent winning, good, medium, and poor performances rather than an uniform partition of the players. Winning players score 1 and are labeled as “top”; players scoring between 0.9 and 0.99 are labeled as “head”, while scores between 0.2 and 0.9 (excluded) are labeld as “heart” and scores between 0 and 0.2 (excluded) are labeld as “tail” (see table below).

##            score  number percentage
## tail     [0-0.2) 1107104       0.25
## heart  [0.2-0.9) 2870630       0.65
## head  [0.9-0.99]  341658       0.08
## top            1  127573       0.03

For each class, the value distribution of the player statistics has been compared. The “top” class spans a larger range in almost every statistic, especially in weaponsAcquired,damageDealt,,walkDistance,killPoints,longestKill, winPoints as shown in Figure 2 and Figure 3.

Histograms

Correlations

Now, let’s see the pair-wise correlation between statistics, divided per class, i.e their mutual interaction. Figure 4 shows that in all classes there is an high correlation (~0.9) between winPoints and killPoints. This fact is not surprising since the two statistics are calculated using an external ranking (rankPoints). As expected, the number of knocked enemies, DBNOs, and the dameges dealt, damageDealt, are strongly correlated (~0.7) for all classes. The statistics damageDealt, kills, and headShotKills correlate between each other with a decreasing trend from the “top” to the “tail” class. This trend likely reflect that the winning players (“top”) manage to inflict more damage and more effectively. An opposite trend in correlation values is found for the killStreaks, kills, DBNOs, longestKill statistics.

For all classes, killPlace anti-correlates (~-0.7) with kills, killStreaks, and damageDealt statistics, which means that the lower is the ranking score for kills in the match, the higher is the number of damages and kills achieved.

Predictive analysis

The ultimate goal of this analysis is to use the player statistics to predict the final placement after the match, i.e. winPlacePerc. This statistic, which is the target, is expressed as the percentage of 100 possible final scores where 0 represents the loser and 1 represent the winner.

In order to create a balanced model, the winPlacePerc statistic has been divided into 11 classes, each representing a subset of the final placement. Each class spans an interval 0.1 long: the first ranges from 0.00 (included) to 0.10 (excluded), and so on. The last two calsses are slightly different: one covers 0.90 to 0.99 (both included), and the last class is reserved to the winners, i.e. with a score of 1. One hundred thousands players have been randomly selected from each class to create a balanced training data set of a total of 1.1 milion observations. Figure 5 shows the distributions of each class for the whole data set.

Predictive model

A logistic Generalized Linear Model (GLM) has been chosen as predictive model to achive a compromise beteween computational effort and accuracy. Briefly, a logistic approach treats the target variable as a binomial random variable that, in practice, has the same behaviour of the statistic winPlacePerc, being “1” the success and “0” the failure. The logistic model calcualates the probability that an observation is successfull, in this case the player’s placement percentage at the end of the match.

A subset of low correlated features (corr<|0.5|) has been chosen to build the GLM prediction model (DBNOs, kills, and killStreak have been discarded). Categorical features, i.e. the id of the player, the match, and the group, have been also discarded because not relevant to the prediction. Figure 6 shows how much each feature impacts on a unit change of the final placement (y-axis). The size of each coefficient indicates the degree of uncertainty, i.e. the amplitude of the confidence interval. For example, assists have the highest positive effect (~0.2) per unit change in the final placement with medium uncertainty, while teamKills have a negative influence (-0.1) with higher uncertainty. On the contrary, swimDistance and rideDistance have no influence at all with a very low degree of uncertainty.

Figure 7 shows the diagnostic of the model, that is how much each predicted (fitted) value is far from the theoretical model (this distance is called residual). The red line represents the mean of the residuals, which is -0.68. Even if the problem is not strictly linear, the concentration of the residuals near zero shows that the model achieves reasonable results.

The in-sample error of the model, i.e. the error calculated on the training data set, is on average 0.1 +/- 0.092. The left panel in Figure 8 shows the error counts (expressed as a percentage of the maximum for a better visualization): 50% of the errors are under the mean (blue line).

The out-sample error, i.e. the error calculated on data not used in the model, is calculated on 50 subsamples each 10 thousands strong. The overall average error is similar to the in-sample error, in fact it’s 0.1 +/- 0.088. The error counts are shown in the rigth panel of Figure 8: 50% of the errors are under the mean (blue line).

Test data set

Finally, the test data set has almost 2 millions obseravations. The mean error, calculated by Kaggle is 0.10072 link.