DATA 606 Data Project

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──

## ✓ ggplot2 3.3.3     ✓ purrr   0.3.4
## ✓ tibble  3.0.6     ✓ dplyr   1.0.4
## ✓ tidyr   1.1.2     ✓ stringr 1.4.0
## ✓ readr   1.4.0     ✓ forcats 0.5.1

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

## 
## Attaching package: 'kableExtra'

## The following object is masked from 'package:dplyr':
## 
##     group_rows

Abstract

League of Legends is a online video game that has been around for over a decade that had a monthly active user base of 115 million people in 2021. People that play the game, pros and casuals alike, claim that the first step to becoming proficient at the game is to practice CS’ing. CS’ing is short for creep score and is a tally of how many minions a player killed in the game. Users can spend hundreds or even thousands of hours playing this game, so I wanted to test the validity of this claim. Do better players kill more minions in a game on average?

\[ H_{0} = There\ is\ no\ correlation\ between\ rank\ and\ creepscore \] \[ H_{a} = There\ is\ a\ correlation\ between\ rank\ and\ creepscore \]

My personal hypothesis is that higher ranking players on average can kill more minions in a game compared to lower ranked players. To test my hypothesis I collected data directly from League of Legends API. I then ran a linear regression model on Rank and Total minions Killed in a game. The result of my linear model showed there is a positive correlation between rank and the resulting function is: \[ Creep Score = 126.9362 + 2.7321 * Rank \] After our results I would advise that a new player practice and become better at CS’ing

Part 1 - Introduction

I have been playing a game named “League of Legends” on and off for aproximetly a decade. During my journey as a player I began to practice and study methods to get better at the game. One of the most common advice I would hear from friends and professional players online was: “Get better at CS’ing”. I won’t get into too much of the mechanics of the game, but CS stand for creep score. Or, in other words, how many creeps a player kills throughout the game. It’s a simple aspect of the game, but it’s important to become proficient at in order to get better at the game. I want to answer the question: Are higher ranked players “better” at CS’ing when compared to lower ranked players.

I will try to answer this question by comparing creep scores from lower ranked players to higher ranked players. I will be using a linear model to see if there is a positive correlation between rank and creep score. The data is actually pulled using League of Legends API which can be found here: https://developer.riotgames.com/apis. Unfortunelty, since I do not have a production API key it took my script quite a while to pull the data necessary for this analysis. Script can be found here: https://github.com/jglendrange/data606/blob/main/leagueAPI.ipynb. This is something I have always been interested in and I’m glad I have the opportunity to test it’s validity.

Part 2 - Data

As I mentioned in the introduction, the data was obtained using the league of legends API. I used a website that generates random matchIds (Each matchId corresponds to a an individual game), and each match returns performance metrics of each player that participated in the match. So, I queried 1,000 matches which gave me data on 10,000 random players.

Now, let’s take a look at the data set I put together. We have 8 columns, but the columns we will be most interested in is rank and totalMinionsKilled. This is what we will use to create the linear model. We may be interested in role and lane after our first attempt at the model, because these can influence the totalMinionKilled in a game. For example when role = “DUO_SUPPORT” we can expect creep score to be low, since they are not focused on killing minions.

head(playerStats) %>%
  kbl() %>%
  kable_styling()

X	gameId	visionScore	goldEarned	totalMinionsKilled	role	lane	userId	rank
0	3856873255	13	9173	139	SOLO	MIDDLE	F_7N8wSdjoPdUw1SBn6P2t62PeDArMYt_GmtbU4RJe9rr9k	SILVERII
1	3856873255	39	8568	12	NONE	JUNGLE	TAdsWNVJiqCj0hA5npp4-z5EKXkkBJw7ss5vOhdFldZw-kI	GOLDIV
2	3856873255	10	8377	155	SOLO	TOP	M01izMLGcM-Q5WjIkGooQHDSljWmjY0uIDHyeV-lCO6hEYM	GOLDIV
3	3856873255	10	11397	168	DUO_CARRY	BOTTOM	hQwpywrGvpGKyK0Dcz006NBlzUe4mEewMjEQuPgkgq5JrYk	GOLDIV
4	3856873255	37	7035	25	DUO_SUPPORT	BOTTOM	v1MLAUQfrq1FadXN0cPPbxkpAKxxLsdkcNUmv0kfuOM6TiA	GOLDIV
5	3856873255	13	12398	147	SOLO	TOP	4DnCOG5yQV5SFo5n59BaBU2gXk-AjGgZUTJflaHOaIvqaXk	GOLDIII

Every player has a rank going into the game. They get it after playing a minimum amount of games and getting placed. Ideally, as you get better at the game you rise through the ranks. The highest ranks have very few people, with the highest, challenger, having hundreds of people. There are around 30 ranks. Comparing our distribution to the distribution stated by league of legends we are pretty close. https://www.leagueofgraphs.com/rankings/rank-distribution

playerStats %>%
  group_by(rank) %>%
  summarise(count = n()) %>%
  ggplot(aes(x=reorder(rank, -count), y=count)) + geom_bar(stat="identity") + coord_flip()

Part 3 - Exploratory data analysis

First let’s take a look at the summary statistics on all of our fields to get a better understanding of our data.

summary(playerStats)

##        X             gameId           visionScore       goldEarned   
##  Min.   :    0   Min.   :3.857e+09   Min.   :  0.00   Min.   :  667  
##  1st Qu.: 3137   1st Qu.:3.857e+09   1st Qu.: 12.00   1st Qu.: 8011  
##  Median : 6274   Median :3.857e+09   Median : 19.00   Median :10326  
##  Mean   : 6274   Mean   :3.857e+09   Mean   : 23.62   Mean   :10554  
##  3rd Qu.: 9412   3rd Qu.:3.857e+09   3rd Qu.: 29.00   3rd Qu.:12821  
##  Max.   :12549   Max.   :3.857e+09   Max.   :171.00   Max.   :38844  
##  totalMinionsKilled     role               lane              userId         
##  Min.   :  0.0      Length:12550       Length:12550       Length:12550      
##  1st Qu.: 35.0      Class :character   Class :character   Class :character  
##  Median :112.0      Mode  :character   Mode  :character   Mode  :character  
##  Mean   :105.3                                                              
##  3rd Qu.:162.0                                                              
##  Max.   :403.0                                                              
##      rank          
##  Length:12550      
##  Class :character  
##  Mode  :character  
##                    
##                    
##

Now let’s look at our target variable totalMinionsKilled. This is really interesting. Plotting totalMinionsKilled gives us a bimodal distribution. I suspect the lower peak is from support roles and jungle roles that are not killing as many creeps throughout the game. We will likely have to separate the roles.

hist(playerStats$totalMinionsKilled)

Let’s take a look at the average totalMinionsKilled by lane and role. My suspicions were correct, DUO_SUPPORT and NONE have low averages.

playerStats %>%
  group_by(lane, role) %>%
  summarise(avg = mean(totalMinionsKilled),
            med = median(totalMinionsKilled)) %>%
  kbl() %>%
  kable_styling()

## `summarise()` has grouped output by 'lane'. You can override using the `.groups` argument.

lane	role	avg	med
BOTTOM	DUO	109.80000	94.0
BOTTOM	DUO_CARRY	164.94041	163.0
BOTTOM	DUO_SUPPORT	34.04340	32.0
BOTTOM	SOLO	109.20253	111.5
JUNGLE	NONE	46.03748	34.0
MIDDLE	DUO	158.50820	155.0
MIDDLE	DUO_CARRY	159.62000	152.5
MIDDLE	DUO_SUPPORT	47.48077	37.0
MIDDLE	SOLO	153.28166	151.0
NONE	DUO	50.22530	14.0
NONE	DUO_SUPPORT	60.41154	61.0
TOP	DUO	168.65972	165.0
TOP	DUO_CARRY	160.08571	156.0
TOP	DUO_SUPPORT	52.04762	36.0
TOP	SOLO	161.00209	158.0

ggplot(data = playerStats, aes(x=totalMinionsKilled, y=role)) + geom_boxplot()

Let’s see how the histogram changes when I remove the 2 roles. It turns into a pretty normal distribution, which is a good sign for our analysis.

playerStats %>%
  filter(role != "DUO_SUPPORT" & role != "NONE") %>%
  ggplot(aes(x = totalMinionsKilled)) + geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Here I break the distibution out by rank to get a visual feel for how these distributions look. You can see that some ranks are have a greater population. For now this will have to do because it will be difficult to make each section equal.

playerStats %>%
  filter(role != "DUO_SUPPORT" & role != "NONE") %>%
  ggplot(aes(x = totalMinionsKilled)) + geom_histogram() + 
  facet_wrap(vars(rank))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

This section of code is just converting the ranks into numeric form 1-26.

playerStats$rank_num1 <- ifelse(grepl("IRON", playerStats$rank),0,
                               ifelse(grepl("BRONZE", playerStats$rank),4,
                               ifelse(grepl("SILVER",playerStats$rank),8,
                               ifelse(grepl("GOLD",playerStats$rank),12,
                               ifelse(grepl("PLATINUM",playerStats$rank),16,
                               ifelse(grepl("DIAMOND",playerStats$rank),20,
                               ifelse(grepl("MASTER",playerStats$rank),24,
                              ifelse(grepl("GRANDMASTER",playerStats$rank),25,
                               ifelse(grepl("CHALLENGER",playerStats$rank),26,0
                                      )))))))))

playerStats$rank_num2 <- ifelse(grepl("III$",playerStats$rank), 1, 
                                ifelse(grepl("II$",playerStats$rank),2,
                                ifelse(grepl("MASTER",playerStats$rank) | grepl("CHALLENGER",playerStats$rank) ,0,
                                ifelse(grepl("I$",playerStats$rank),3,
                                ifelse(grepl("IV$",playerStats$rank),0,0
                          )))))

playerStats$rank_num <- playerStats$rank_num1 + playerStats$rank_num2

The first view below looks at totalMinionsKilled vs rank_num for our entire dataset. As we could expect from the previous graphs I think removing some of the lanes will help our model greatly.

ggplot(data=playerStats,aes(x=rank_num,y=totalMinionsKilled)) + geom_point() + geom_jitter()

Filtering our set down makes it look like there is a clear trend: The higher the rank the more likely the player is to have a higher on average creepscore in a game.

laners <- playerStats %>%
  filter(role != 'DUO_SUPPORT' & role != 'NONE' & rank != "INVALID" & totalMinionsKilled > 5)

ggplot(data=laners,aes(x=rank_num,y=totalMinionsKilled)) + geom_point() + geom_jitter() + geom_smooth(method = "lm")

## `geom_smooth()` using formula 'y ~ x'

Just to be sure we can check the correlation and see it’s positive between the two variables.

cor(laners$rank_num,laners$totalMinionsKilled)

## [1] 0.23255

ggplot(data=playerStats,aes(x=rank_num,y=totalMinionsKilled)) + geom_point() + geom_jitter() + facet_wrap(vars(role)) + geom_smooth(method = "lm")

## `geom_smooth()` using formula 'y ~ x'

Part 4 - Inference

Now let’s plug our dataset into a linear regression model and see what function is returned. Note the Pr(>t) value. This corresponds to the probability of observing any value equal or larger than t. Since we have a extremely small p-value it means there is almost no chance we observed this relationship due to chance. Since our Pr(>t) is so low we can conclude that the results are statistically significant.

\[ Creep Score = 126.9362 + 2.7321 * Rank \]

minion_model <- lm (totalMinionsKilled ~ rank_num, laners)
summary(minion_model)

## 
## Call:
## lm(formula = totalMinionsKilled ~ rank_num, data = laners)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -161.649  -27.721   -1.721   27.671  248.743 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 126.9362     1.6138   78.66   <2e-16 ***
## rank_num      2.7321     0.1391   19.64   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 46.02 on 6748 degrees of freedom
## Multiple R-squared:  0.05408,    Adjusted R-squared:  0.05394 
## F-statistic: 385.8 on 1 and 6748 DF,  p-value: < 2.2e-16

Now that we have already created our model lets make sure a inear regression fits this data set by checking each condition:

Linearity: Our view shows a positive linear trend. This condition is met. Nearly Normal Residuals: The QQ plot and histogram below show that the residuals are almost a normal distribution. This condition is met. Constant Variability: The 3rd plot shows there is no noticeable trend and the variability is constant. This condition is met. Independent Observations: The site I pulled the matchIds from claim its random. The rank distribution is fairly close to the actual population. This condition is met.

All of the conditions for a least squares line has been met. A linear model is perfect to describe this correlation.

qqnorm(minion_model$residuals)
qqline(minion_model$residuals)

hist(minion_model$residuals)

plot(x=minion_model$residuals)

Part 5 - Conclusion

In conclusion, there is a positive relationship between rank and creep score, and our Pr(>t) value means our results are significant. There was almost 0 chance these results were observed by chance, which means we can reject the null hypothesis that there is no correlation between the 2 variables.

Furthermore, after evaluating the residuals we confirm that all the necessary conditions are met and a linear regression model fits this data set. With these result I would feel comfortable telling a beginer getting into the game to invest time into improving this skill.

References

https://developer.riotgames.com/apis