PUBG Placement Prediction Exploratory Data Analysis

About PUBG and This Project

PlayerUnknown’s BattleGrounds (PUBG) is a battle royale game that pits 100 players against each other in a struggle for survival. Gather supplies and outwit your opponents to become the last person standing.

In this report, I’ll share some insights for a Kaggle Competition, PUBG Finish Placement Prediction, where participators are asked to predict final placement. You can visit the Competition by right click here and open it in a new tab. The goal of this project is to provide insights on what is the best strategy to win in PUBG. Sounds so fun! The report will focus on the exploratory analysis and aims to get a good sense of how the variables spread and are correlated as well as the profiles of players from winning teams to provide thoughts on feature engineering for modeling.

Data Overview

The data set contains anonymized player data from 65,000 games, split into training and testing sets. The training set has the target variable winPlacePer, which is a percentile winning placement, where 1 corresponds to 1st place, and 0 corresponds to last place in the match, and 28 other features that basically covers 3 aspects of information: final in-game stats such as kills and weaponsAquired; initial player ratings such as winPoints, Win-based external ranking of player; match configurations such as matchType and numGroups.

Description and Cleanup

First of all, there are 4,446,965 player data from 47,965 games in the training set with 1 missing value in a game that only has 1 player data. There also appears to be 68 games with all teams ranking the worst place and 74 without a first place. These games’ data are removed since they are not valid, and are small.

library(tidyverse)

# import training set
train_og <- read.csv('E:/KaggleCompetitions/PUBG/train_V2.csv', header = TRUE)

# count number of games
nlevels(train_og$matchId)

# check and remove missing values
colSums(is.na(train_og))
train_og <- train_og[!is.na(train_og$winPlacePerc),]

# check and cleanup games with invalid winplace
train_og <- train_og %>% 
  group_by(matchId) %>% 
  filter(sum(winPlacePerc)!=0) %>% 
  filter(max(winPlacePerc)==1)

Next, we would be interested in the distribution of the values of each variable. The analysis below summaries some of the interesting ones.

# distribution and levels of categorical variable, matchType
levels(train_og$matchType)

##  [1] "crashfpp"         "crashtpp"         "duo"             
##  [4] "duo-fpp"          "flarefpp"         "flaretpp"        
##  [7] "normal-duo"       "normal-duo-fpp"   "normal-solo"     
## [10] "normal-solo-fpp"  "normal-squad"     "normal-squad-fpp"
## [13] "solo"             "solo-fpp"         "squad"           
## [16] "squad-fpp"

From above, we see that matchType differentiates between first player perspective (fpp) and third player perspective (tpp). But it’s better to start from separating these two attributes, so I removed the perspective indicator from matchtype and created a new variable to keep that information. Also, normal-duo and -solo has too small samples to be separated from duo and solo, so that part was removed as well.

# create a new variable for view
train_og <- train_og %>% 
  mutate(view = ifelse(grepl('fpp', matchType),'fpp','tpp'))

# cleanup matchtype
train_og$matchType <- gsub('fpp|tpp|-|normal', '', train_og$matchType)

# frequency of matchtype
prop.table(table(train_og$matchType))

## 
##        crash          duo        flare         solo        squad 
## 0.0014977337 0.2960075656 0.0007250219 0.1618447724 0.5399249064

The frequency table above shows that squad is the most popular mode, accounting for 53% of the total number of games. 30% is duo and 16% is solo games. This makes a lot of sense because solos are more like a different games, and people would prefer to play in teams. Crash and flare modes are from events and custom matches and only accounts for 1% in total. This frequency table aligns with how the test set looks like as well.

# distribution of some numeric data
summary(train_og$kills)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.9251  1.0000 72.0000

It was not really surprising to see that most players 0 kill in a game. If you are able to kill 1 enemy, you are better at this than about 75% of the people playing this game. But it looks like there are experts who killed as many as 72 enemies in a game, which I actually don’t think is possible in this game. Let’s take a deeper look at it.

# check strangely high kills game info
train_og %>% 
  group_by(matchId, matchType) %>% 
  summarise(maxkills=max(kills),numplys=n(), numgrp=max(numGroups)) %>% 
  filter(maxkills>numplys) %>% 
  arrange(desc(maxkills)) %>% 
  head(5)

## # A tibble: 5 x 5
## # Groups:   matchId [5]
##   matchId        matchType maxkills numplys numgrp
##   <fct>          <chr>        <dbl>   <int>  <dbl>
## 1 6680c7c3d17d48 squad           72      47     15
## 2 08e4c9e6c033e2 solo            66      18     12
## 3 f900de1ec39fa5 solo            65      11     11
## 4 17dea22cefe62a duo             57      28     12
## 5 cfa2775c9ef944 solo            56      41     22

# remove data from games that has more than 35 individual kills and that are potentially imcomplete
train_og <- train_og %>% 
  filter(kills<=35) %>% 
  group_by(matchId) %>%
  mutate(numplys=n(), flag=ifelse(kills>numplys,1,0)) %>% 
  filter(flag==0) %>%
  select(-c(numplys,flag))

As shown in the sample table above, many strangely high kills are in the games where maximum kills is larger than the total players in the game. And after some research, it appears that 40-50 kills is already a record-breaking level game stats. So those outliners are most likely bad data. I removed 452 records from games with more than 35 individual kills, and the games with kills more than the total number of players which we have data of.

# distribution of other numeric data
summary(train_og$teamKills)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  0.00000  0.00000  0.00000  0.02384  0.00000 10.00000

tapply(train_og$matchDuration, train_og$matchType, FUN = summary)

## $crash
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   829.0   875.0   901.0   893.1   914.0   924.0 
## 
## $duo
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     152    1374    1448    1595    1864    2204 
## 
## $flare
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1746    1820    1878    1875    1914    2031 
## 
## $solo
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     133    1384    1455    1601    1875    2237 
## 
## $squad
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     246    1357    1420    1566    1836    2226

Some other interesting variables to look like is teamkills. A player kills on average 1 teammate in every 50 games and this is not too bad. Again, I don’t really think that the maximum of 10 teamkills is possible in this game. But given that most players have 0 teamkills, this variable may not too much influence in the prediction.

Because the first and only game I played took like 30 minutes, I’m interested in how long a game last. The matchDuration by matchtype table shows that the three normal type of matches do not have too much difference with a median around 24 minutes. It appears to me that this is a fairly low-pace game compared with games like League of Legend given that most players won’t kill 1 enemy by the end of a game. And it starts to make sense for me when hearing my coworker who is a big fan of this game said he usually played when with his friends.

train_og %>% 
  filter(walkDistance<50000) %>% 
  group_by(groupId) %>%
  
  summarise(teamWalkDis=sum(walkDistance),winplace=max(winPlacePerc)) %>% 
  ggplot(aes(x=winplace,y=teamWalkDis))+
  geom_point()+
  xlab('Percentile winning placement') +
  ylab('Walk distance') +
  ggtitle('Total Team Walk Distance vs Winning Placecment')+
  theme(plot.title = element_text(size=22))

Now let’s look at a really interesting variable in the data set, walkDistance. Given this is a last-player-standing-win game and is based on a decreasing map, how long you are able to walk in the game pretty much tells how long you are able to stay alive and thus your placement. The graph above shows the relationship between the total walk distance of a team and winPlacePerc. And there’s clearly a positive correlation between these two variables.

Winning Player Segmentation

After initial analysis, it looks like most variables’ distribution are right skewed, meaning the values concentrate on the lower side. This make sense given the natural of this game, but also makes it hard to see the difference between winplaces. Since I’m interested in the strategies to win the game, it would be interested to see if there’s certain pattern of role assignments in a winning team like Tanks, Fighters, Assassins, ADCS in League of Legend. So, I picked all the player data from the 1st place team in all games and did a k-means clustering analysis on them. The variables I used are kills damageDealt assists heals boosts weaponAcquired revives, which can reflect the most essential events and performance in a game.

# select the players with 1st place
temp <- train_og %>% 
  filter(winPlacePerc==1) %>% 
  .[,c(1,4,5,6,9,12,20,27)] 

# scale the variables to avoid bias 
temp$assists <- scale(temp$assists,center = T,scale = T)
temp$boosts <- scale(temp$boosts,center = T,scale = T)
temp$damageDealt <- scale(temp$damageDealt,center = T,scale = T)
temp$kills <- scale(temp$kills,center = T,scale = T)
temp$revives <- scale(temp$revives,center = T,scale = T)
temp$weaponsAcquired <- scale(temp$weaponsAcquired,center = T,scale = T)
temp$heals <- scale(temp$heals,center = T,scale = T)

# decide how many clusters (k) we need

k.max=10
wss <- sapply(1:k.max, 
              function(k){kmeans(temp[,2:6], k, nstart=50,iter.max = 15 )$tot.withinss})

plot(1:k.max, wss,
     type="b", pch = 19, frame = FALSE, 
     xlab="Number of clusters K",
     ylab="Total within-clusters sum of squares")

Because these stats have different scales that will introduce bias when clustering, the first step is to scale our variables using their means and standard deviations. Then, we need to decide the optimal number of clusters we want. In this case, I used the method that the optimal point would be when adding one more cluster the sum of squared errors (SSE) does not reduce too much. The plot above shows k=3 is the optimal point where k is the minimum and SSE is significantly reduced. And then, the most exciting part is coming.

# clustering
k.means.fit <- kmeans(temp[,2:8], 3)

# re order the levels of clusters to re order x-axis
temp <- temp[,2:8] %>% 
  cbind(cluster=k.means.fit$cluster) 
temp$cluster <- factor(temp$cluster, levels=c(1,2,3))

# plotting
temp %>% 
  gather(vars,val,1:7) %>% 
  filter(val<15) %>% 
  ggplot(aes(x=factor(vars,levels=c('damageDealt','kills','weaponsAcquired','revives','assists','boosts','heals')),y=val)) +
  geom_boxplot()+
  facet_wrap(~ cluster,nrow = 1) +
  labs(title="Player profiles of win teams",x ="Variables", y = "Scaled values") +
  theme(plot.title = element_text(size=30))

The plot above shows players’ basic game stats by clusters. This is a good way to visualize and understand our clusters, especially we have more than 2 dimensions. Not surprisingly, kills and damageDealt always have the same distribution within each cluster. The group that has significantly highest kills and damage dealt is obviously the expert player on the team. Also they do not use significant higher boosts and heals as well, which indicates that they are good at covering and taking down enemies first.

The group with a second high kills looks like a intermediate/supporting group that are able to kill or hurt some enemies and support their teammates by having a significantly high average on the times they revive teammates. But we can also see they use the highest number of healing items. This indicates that they are not so good at covering themselves and receive a lot of damage from enemies because healing items can only be used when you health is below 75%.

On the other hand, the last group is obviously the free-riders or the players unluckily get killed early in the game. They have really low kills and don’t pick up that many weapons which may be because they do not fire that much.. This beginner group also use significantly less healing and boosting items which indicates that they may be beginners that avoid engaging in fights or the players that get killed to fast.

Variable Correlation Clustering

Last, since our goal is to predict the winplaces, it is important to see how variables are correlated to each other. And it can also give us thoughts on how to aggregate and select variables to avoid correlation. In this case, since we have 27 variables, variable clustering would be a good way to visualize and understand the correlation between them. It is a hiearchical cluster analysis using the euclidean distance based on the absolute correlation betweeen variables.

# assign numeric values to categorical varables
train_og <- train_og %>% 
  mutate(matchTypenum = ifelse(matchType=='solo',1,
                            ifelse(matchType=='duo',2,
                                   ifelse(matchType=='squad',3,
                                          ifelse(matchType=='crash',4,5)))))
train_og <- train_og %>% 
  mutate(viewnum=ifelse(view=='fpp',1,3))

# plotting
plot(hclust(dist(abs(cor(na.omit(train_og[,c(4:15,17:29,31,32)]))))), xlab=' ')

The plot above shows the results of the clustering. Branches show the hierarchical clusters of the variables and the horizontal distance between clusters indicates how relatively close they are within same parent branch.

The first thing you may notice is that it proves walk distance is indeed the most correlated variable to winplace as it is a very good indicator of how long a player stays alive in game. Also, the highly correlated variables that fall into the same branch are indeed pretty much reflecting the same aspect of a player’s performance. For instance, kills and damage dealt are in the same branch, with killstreaks and skillplace in their parent branches. Matchtype, number of groups and maxplace are in the same branch since they are definitely highly correlated to each other. Besides, external ranking data including killpoints, winpoints and rankpoints are in the same branches and most correlated to each other.

Boosts is the second most correlated variable to winplace, and is in a deeper hierarchical level than heals. This make sense now as we saw in the winning players’ profile that experts tend to use more boosts than heals. Weapons acquired is the third important one, which also consistent with what we saw in the profiles that killers tend to pick up more weapons, and it to some extent also indicates how long a player stays alive. Variables related to kills seem to be the next important attributes and then the branch with assists and revives.

Takeaways and Improvement

The exploratory analysis helps us to have a good grasp of how the data looks like and how the variables are correlated to each other, with walk distance being the biggest indicator of winning placement. And it was also interesting to see players from winning teams cluster into 3 groups, killers, supporters and free-riders. These information can help us decide how to aggregate player data to a team level and aggregate clustered variables for prediction.

Last, I believe that the location of a player could be a valuable predictor of placement as well, because being able to find good cover and which part of the map a player initially lands to can be really important as a strategy in this game. This piece of information was not included in the data set but I think can be reflected by adding new variables such as the player density of a player’s initial landing spot or how long a player spend on each altitude level of the map.