Dota 2 is a multiplayer online battle arena video game developed and published by Valve. In each game, players are required to select an in-game character, generally known as the heroes. This dataset records the overall performance of the heroes. The data was extracted manually from dotabuff.com – the official Dota2 statistics provider on 13/10/2019 into an .xlsx file. The data covers 16 attributes for all 117 heroes available at that time. Both the univariate multivariate EDA will be done and visualisation for the analysis are all done using the ggplot.
library(openxlsx)
df<-read.xlsx('D:/Dataset/datdota/dotabuff.xlsx',sheet = 1, startRow = 1, colNames = T)
head(df,3)
## Hero Matches.Played Pick.Rate Win.Rate KDA.Ratio Kills Deaths
## 1 Abaddon 115975606 0.0553 0.5642 2.91 5.75 6.60
## 2 Alchemist 190340443 0.0907 0.4650 2.23 5.89 7.53
## 3 Ancient Apparition 128959122 0.0614 0.5026 2.69 5.54 7.82
## Assists Match.Duration Gold./.Minute Experience./.Minute
## 1 13.45 1.687500 379.89 483.94
## 2 10.90 1.708333 611.70 540.32
## 3 15.50 1.768056 315.50 381.71
## Last.Hits./.10.Minutes Denies./.10.Minutes Hero.Damage./.Minute
## 1 25.46 0.91 328.41
## 2 52.01 0.84 366.16
## 3 12.00 0.93 290.71
## Tower.Damage./.Minute Hero.Healing./.Minute
## 1 33.77 45.62
## 2 57.82 0.77
## 3 10.05 7.42
Before the analysis is done, the data is checked for missing or empty cell. This can be done using the colsums() function.
colSums(is.na(df) | df=='')
## Hero Matches.Played Pick.Rate
## 0 0 0
## Win.Rate KDA.Ratio Kills
## 0 0 0
## Deaths Assists Match.Duration
## 0 0 0
## Gold./.Minute Experience./.Minute Last.Hits./.10.Minutes
## 0 0 0
## Denies./.10.Minutes Hero.Damage./.Minute Tower.Damage./.Minute
## 0 0 0
## Hero.Healing./.Minute
## 0
R auto,atically replaces space with ‘.’. Some columns name may create additional hassle while reading them later. Take ‘Experience / Minutes’ as example, importing the data using R would convert the column as ‘Experience./.Minute’. Renaming column can be done by various method. If the we are planning to rename one or two column, renaming can easily be done by using the names() function.
new_df <- data.frame(index=1:3,oldName=rnorm(3))
head(new_df)
## index oldName
## 1 1 -1.3059777
## 2 2 0.1066659
## 3 3 1.3087354
names(new_df)[2] <- c('newName')
head(new_df)
## index newName
## 1 1 -1.3059777
## 2 2 0.1066659
## 3 3 1.3087354
To handle larger dataset, this can be done using the stringr. Replacing period is tricky as ‘.’ in R matches any single character. Including ‘.’ as it is would result in blank column name, hence square bracket [] is needed.
library(stringr)
colname_old <- colnames(df)
colname_new <- str_replace_all(colname_old,pattern = '[.]',replacement = '')
colname_new <- str_replace_all(colname_new,pattern = '/',replacement = 'Per')
colnames(df) <- colname_new
head(df,3)
## Hero MatchesPlayed PickRate WinRate KDARatio Kills Deaths
## 1 Abaddon 115975606 0.0553 0.5642 2.91 5.75 6.60
## 2 Alchemist 190340443 0.0907 0.4650 2.23 5.89 7.53
## 3 Ancient Apparition 128959122 0.0614 0.5026 2.69 5.54 7.82
## Assists MatchDuration GoldPerMinute ExperiencePerMinute LastHitsPer10Minutes
## 1 13.45 1.687500 379.89 483.94 25.46
## 2 10.90 1.708333 611.70 540.32 52.01
## 3 15.50 1.768056 315.50 381.71 12.00
## DeniesPer10Minutes HeroDamagePerMinute TowerDamagePerMinute
## 1 0.91 328.41 33.77
## 2 0.84 366.16 57.82
## 3 0.93 290.71 10.05
## HeroHealingPerMinute
## 1 45.62
## 2 0.77
## 3 7.42
The WinRate column selected for the univariate exploratory data analysis. The summary() function can be used to obtain some of the basic descriptive statistics. Since R do not have a standard built-in function to compute mode, we can work around that by estimating the mode using using density function for the WinRate. Based on the output, we can tell that the gameplay is somewhat balanced, as the mean, median and mode of the win rate are around 50%.
attach(df)
summary(WinRate)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.4011 0.4656 0.4934 0.4916 0.5150 0.5709
density_estimate <- density (WinRate)
mode_value<-density_estimate$x[which.max(density_estimate$y)]
mode_value
## [1] 0.5027889
One ofthe fundamental task in EDA is to characterize the variablilty and location of a dataset. Apart from the standard descriptive statistics, further characterisation of the dataset can be explained via skewness and kurtosis. Skewness is a measure of symmetry, whereas the kurtosis explains if the data is has a light or heavy tails.
The data has a skewness of -0.005, therefore it is negatively skewed, where the peak leans slightly towards the right, and the longer tail towards the left. The kurtosis of the density plot is 2.55 (less than 3), indicating that the plot is platykurtic, with a thinner tail compared to a normal distribution. These can be validated later using the KDE plot.
library(e1071)
wr_skw <- skewness(WinRate)
wr_skw
## [1] -0.005136054
wr_kur <- kurtosis(WinRate)
wr_kur
## [1] -0.4520088
To create KDE plot, we can use the ggplot2 library. The output shows that the data is unimodal. The rug plot (red lines at the x axis) represents unduplicated win rate values.
library(ggplot2)
ggplot(df, aes(WinRate)) + geom_histogram(aes(y=..density..), colour="gray", fill="white") +
geom_density()+geom_rug(alpha=0.5,color='Red') +
geom_vline(aes(xintercept=mean(WinRate)), color="blue", linetype="dashed", size=1) +
geom_vline(aes(xintercept=median(WinRate)), color="red", linetype="dashed", size=1) +
geom_vline(xintercept = mode_value,color='purple', linetype="twodash", size =1)+
geom_label(aes(mean(WinRate),4,label='mean'),show.legend = F)+
geom_label(aes(median(WinRate),6,label='median'),show.legend = F)+
geom_label(aes(mode_value,8,label='KDE \nmode'),show.legend = F)+
xlim(0.39,0.61)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ECDF shows the distribution of win rate in an ascending order. Based on the plot, only one hero has a win rate below 42% and majority (95%) of the heroes have an average win rate of under 55%.
# ECDF with vline for 25th, 50th, 75th and 95th
# Compute the qth quantile to be displayed on the plot
q25 <- quantile(WinRate, probs=c(.25))
q50 <- quantile(WinRate, probs=c(.5))
q75 <- quantile(WinRate, probs=c(.75))
q95 <- quantile(WinRate, probs=c(.95))
ggplot(df, aes(WinRate))+stat_ecdf(geom='point') + labs(x="WinRate",y="ECDF") +
geom_vline(xintercept=q25,linetype="dashed",colour='red',size =1)+
geom_label(aes(q25,0.1,label='25th Percentile'),show.legend = F)+
geom_vline(xintercept=q50,linetype="dashed",colour='blue',size =1)+
geom_label(aes(q50,0.3,label='50th Percentile'),show.legend = F)+
geom_vline(xintercept=q75,linetype="dashed",colour='green',size =1)+
geom_label(aes(q75,0.55,label='75th Percentile'),show.legend = F)+
geom_vline(xintercept=q95,linetype="dashed",colour='black',size =1)+
geom_label(aes(q95,0.8,label='95th Percentile'),show.legend = F)
Multivariate EDA examines the relationship between variables. To simplify the visualisation and analysis, only GoldPerMinute (GPM), ExperiencePerMinute (XPM) and KDARatio are selected. Based on the correlation matrix, all the variables shows positive correlation. GPM - XPM pair has the highest correlation, followed by XPM - KDA, and GPM - KDA. The high correlation between GPM - XPM is consistent with the logic of the game, where heroes obtain gold or experience by killing the creeps or opponent heroes.
# select 3 columns KDA.Ratio,Gold./.Minute, Experience./.Minute
# Create a dataframe using the columns selected
multivariate<-data.frame( GoldPerMinute, ExperiencePerMinute,KDARatio)
cor(multivariate[1:3])
## GoldPerMinute ExperiencePerMinute KDARatio
## GoldPerMinute 1.0000000 0.8862935 0.2959903
## ExperiencePerMinute 0.8862935 1.0000000 0.4473213
## KDARatio 0.2959903 0.4473213 1.0000000
To visualise the correlation betwenen the variables, pair plot is created. The lower triangle shows the scatterplot of each pairs and the best fitted regression line, the diagonal summarises each of the variables using kernel density estimation plot, and the upper triangle shows the 2D density diagram.
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
ggpairs(multivariate,upper=list(continuous = 'density'),
lower = list(continuous='smooth'),
diag = list(continuous ='densityDiag'))
Packages in R are handy, however, some of them might have limited custamisability. Take ggpairs() as example, we can choose the type of plots for the pairs, this itself is a huge improvement from the base R. However, we can still further improve with this. Some of the possible improvements could be combining the 2D KDE diagram (or contour plot) with a scatterplot - to better visualise the exact position of each of the points, and more importantly, to locate outliers.
Another way to improve the visualisation is the crease a 2D bin diagram (or Heatmap).Each points on the plot represents a bin and the intensity of the color shows the number of actual observation located within the bin.
More reading on GGally::ggpairs can be found here. https://ggobi.github.io/ggally/#columns_and_mapping