# This is the R chunk for the required packages
library(readr)
library(tidyr)
library(dplyr)
library(ggplot2)
library(outliers)
library(editrules)
library(MVN)
In order to prepare the analysis of Competitive League of Legend Matches two data sets have been imported into R. The two data sets have various variables collected about the Legaue Matches. The Data section outlines what these variables are and what types of variables they are i.e. character, numerical. One of the data sets “gold_matches” is left joined onto “League_Matches” to create League_data which was joined through the variable “Address”. The new data set is then checked for any obvious errors.
In the Understand Section League_data is checked for any incorrectly categorized variables. Many of the variables were converted to from characters to factors as they were incorrectly read as characters.
League_data was then tidied and manipulated. Firstly the Season_Type column was separated as into two columns as it contained two variables. Secondly, a column was removed as it didn’t contain a variable and lastly, a new variable was created from 2 variables: “Golddiff_change_from_min10_min20” was created from the variable min_10 and min_20.
League_data was then scanned. Firstly, for missing values, special values and obvious errors/inconsistencies. It was found that there were 435 missing values, which were excluded as they made up a small percentage of the data set. No special values or obvious errors/inconsistencies were found.
Secondly, the data set was scanned for any outliers. the variable “Golddiff_change_from_min10_min20” was found to have outliers which were removed using the function z.scores().
Lastly, the variable “gamelength” was transformed to be normally distributed using logarithmic transformations as it is preferred for statistical inference.
Two datasets “League_Matches” and “Gold_matches” both collected data from competitive League of Legends Matches from 2015 to 2018. League_Matches has both numeric and qualitative variables about the games played including, the result, game length, team names, the players, and champions picked. The data set "Gold_Matches also has both qualitative and numeric variables collected about the difference in gold from first minute to the 20th minute of the competitive League of Legend Matches.
League_Matches has 29 variables and 7620 observations. Example of the Definitions/details of these variables are given below:
League: The league in which the game was played. i.e. North American league (NALCS)
Year: Year the game was played
Season_Type: Which season and what type of game it was i.e. season match or final
blueTeamTag: Blue side team name
bResult: Result for Blue team
rResult: Result for Red team
RedTeamTag: Red Side team name
gamelength: How long the game lasted
BlueTop: Player who played top for Blue team
BlueTopChamp: Champion played top by the Blue team
BlueJungler: Player who played top for Blue team
BlueJunglerChamp Champion played jungle by the Blue team
BlueMiddle: Player who played mid for the Blue team
BlueMiddleChamp: Champion played mid by the Blue team
BlueADC: Player who played ADC for the Blue team
BlueADCChamp: Champion played ADC by the Blue team
BlueSupport: Player who played support for the Blue team
BlueSupportChampion: Champion played support by the Blue team
Position(i.e Support) and Champion variable repeated for the Red team. Website Url: https://www.kaggle.com/chuckephron/leagueoflegends
Gold_matches has 22 variables and 7620 observations. Examples of the Definitions/details of these variables are given below:
Address: the match URL the offical League of Legends match history site.
Type: The gold type
min_1: Gold difference at 1 minute
min_2: Gold difference at 2 minutes
min_3 Gold difference at 3 minutes
up to min_20: Gold Difference at 20 minutes
Website Url: https://www.kaggle.com/chuckephron/leagueoflegends
League_Matches and gold_matches csvs are imported, heads checked for any obvious issues. Gold_matches is left_joined onto League_Matches by “Address”.
# This is the R chunk for the Data Section
League_Matches <- read_csv("~/Pictures/Grad Cert/League Matches.csv")
── Column specification ──────────────────────────────────────────────────────────────
cols(
.default = col_character(),
Year = col_double(),
bResult = col_double(),
rResult = col_double(),
gamelength = col_double()
)
ℹ Use `spec()` for the full column specifications.
gold_matches <- read_csv("~/Pictures/Grad Cert/gold matches.csv")
── Column specification ──────────────────────────────────────────────────────────────
cols(
.default = col_double(),
Address = col_character(),
Type = col_character()
)
ℹ Use `spec()` for the full column specifications.
head(League_Matches)
head(gold_matches)
League_data <- League_Matches %>% left_join(gold_matches,by = "Address")
head(League_data)
NA
Checked the dimensions of the League_data data frame, 7620 observations, 50 variables. Of the 50 variables, 26 are currently read as characters and 24 as doubles/numeric. bResult and rResult should be converted into a factor variable and relabeled “Win” and “Loss” with Win ordered higher than loss. League, Year, Season Type, blueTeamTag and redTeamTag should also be converted to factors. This is done with the use of the as.factor function.
# This is the R chunk for the Understand Section
League_data$bResult <- as.factor(League_data$bResult)
League_data$bResult <- League_data$bResult %>% factor(levels = c("1","0"), order = TRUE, labels=c ("Win","Loss"))
League_data$rResult <- as.factor(League_data$rResult)
League_data$rResult <- League_data$rResult %>% factor(levels = c("1","0"), order = TRUE, labels= c("Win","Loss"))
League_data$League <- as.factor(League_data$League)
League_data$Year <- as.factor(League_data$Year)
League_data$Season_Type <- as.factor(League_data$Season_Type)
League_data$blueTeamTag <- as.factor(League_data$blueTeamTag)
League_data$redTeamTag <- as.factor(League_data$redTeamTag)
head(League_data)
Two variables were kept in the “Season_Type” column making the data frame untidy. Whether the game is played in the “Summer” or “Spring” was in the same column as whether the game was a “Season” game or a “Playoff” game. Therefore the separate function was used to tidy the data and have the variables separated into two different columns.
“Type” variable was also removed from the data set as it was not a variable as all observations contained the same observation “golddiff”.
# This is the R chunk for the Tidy & Manipulate Data I
League_data <- League_data %>% separate(Season_Type, into = c("Season","Game_Type"),sep = " ")
League_data <- League_data %>% select (-Type)
League_data$Game_Type <- as.factor(League_data$Game_Type)
League_data$Season <- as.factor(League_data$Season)
head(League_data)
NA
Created a new variable from the min_10 and min_20 variable showing the change in the gold lead from 10 mins to 20 mins into the game. The mutate function was used to create this and will allow for easier analysis of how the gold lead changes from earlier in the game to later in the game.
# This is the R chunk for the Tidy & Manipulate Data II
League_data <- mutate(League_data, Golddiff_change_from_min10_min20 = min_20 - min_10)
The data was scanned for missing values, special values and obvious errors and inconsistencies.
Scanning for missng variables showed that there 435 missing values. These values primarily occured in 38 rows of the data set and were sub 5% of the data set. So it was decided that it was the best stratergy to leave out the values with missing features in order not to bias the analysis.
The data was scanned for special values using is.infinite() and is.nan() no values were found.
The gold difference variables (i.e. min_1, min_2) and “gamelength” variable were also scanned for any obvious errors or inconsistencies. It was found that there were no errors or inconsistencies.
# This is the R chunk for the Scan I
sum(is.na(League_data))
[1] 435
colSums(is.na(League_data))
League Year
0 0
Season Game_Type
0 0
blueTeamTag bResult
38 0
rResult redTeamTag
0 37
gamelength blueTop
0 37
blueTopChamp blueJungle
0 28
blueJungleChamp blueMiddle
0 37
blueMiddleChamp blueADC
0 37
blueADCChamp blueSupport
0 37
blueSupportChamp redTop
0 37
redTopChamp redJungle
0 24
redJungleChamp redMiddle
0 37
redMiddleChamp redADC
0 37
redADCChamp redSupport
0 37
redSupportChamp Address
0 0
min_1 min_2
0 0
min_3 min_4
0 0
min_5 min_6
0 0
min_7 min_8
0 0
min_9 min_10
0 0
min_11 min_12
0 0
min_13 min_14
0 0
min_15 min_16
0 0
min_17 min_18
0 1
min_19 min_20
3 4
Golddiff_change_from_min10_min20
4
sapply(League_data,function(x) sum(is.infinite(x)))
League Year
0 0
Season Game_Type
0 0
blueTeamTag bResult
0 0
rResult redTeamTag
0 0
gamelength blueTop
0 0
blueTopChamp blueJungle
0 0
blueJungleChamp blueMiddle
0 0
blueMiddleChamp blueADC
0 0
blueADCChamp blueSupport
0 0
blueSupportChamp redTop
0 0
redTopChamp redJungle
0 0
redJungleChamp redMiddle
0 0
redMiddleChamp redADC
0 0
redADCChamp redSupport
0 0
redSupportChamp Address
0 0
min_1 min_2
0 0
min_3 min_4
0 0
min_5 min_6
0 0
min_7 min_8
0 0
min_9 min_10
0 0
min_11 min_12
0 0
min_13 min_14
0 0
min_15 min_16
0 0
min_17 min_18
0 0
min_19 min_20
0 0
Golddiff_change_from_min10_min20
0
sapply(League_data,function(x) sum(is.nan(x)))
League Year
0 0
Season Game_Type
0 0
blueTeamTag bResult
0 0
rResult redTeamTag
0 0
gamelength blueTop
0 0
blueTopChamp blueJungle
0 0
blueJungleChamp blueMiddle
0 0
blueMiddleChamp blueADC
0 0
blueADCChamp blueSupport
0 0
blueSupportChamp redTop
0 0
redTopChamp redJungle
0 0
redJungleChamp redMiddle
0 0
redMiddleChamp redADC
0 0
redADCChamp redSupport
0 0
redSupportChamp Address
0 0
min_1 min_2
0 0
min_3 min_4
0 0
min_5 min_6
0 0
min_7 min_8
0 0
min_9 min_10
0 0
min_11 min_12
0 0
min_13 min_14
0 0
min_15 min_16
0 0
min_17 min_18
0 0
min_19 min_20
0 0
Golddiff_change_from_min10_min20
0
# This is the R chunk for the Scan I
League_data <- na.omit(League_data)
sum(is.na(League_data))
[1] 0
Rule1 <- editset(c("min_1 <= 2000", "min_2 <=4000","min_3 <= 6000","min_4 <= 8000","min_5 <=10000", "min_6 <= 10000", "min_7 <= 15000", "min_8 <=20000","min_9 <= 20000","min_10 <= 20000", "min_11 <=20000","min_12 <= 30000","min_13 <= 30000", "min_14 <=30000","min_15 <= 30000", "min_16 <= 40000", "min_17 <=50000","min_18 <= 50000", "min_19 <=50000","min_20 <= 50000"))
Rule2 <-editset(c("min_1 >= -3000", "min_2 >= -6000","min_3 >= -8000","min_4 >= -8000","min_5 >= -10000", "min_6 >= -10000", "min_7 >= -15000", "min_8 >= -20000","min_9 >= -20000","min_10 >= -20000", "min_11 >= -20000","min_12 >= -30000","min_13 >= -30000", "min_14 >= -30000","min_15 >= -30000", "min_16 >= -40000", "min_17 >= -50000","min_18 >= -50000", "min_19 >= -50000","min_20 >= -50000"))
Rule3 <- editset(c("gamelength <=100"))
violatedEdits(Rule1, League_data) %>% summary()
No violations detected, 0 checks evaluated to NA
NULL
violatedEdits(Rule2, League_data) %>% summary()
No violations detected, 0 checks evaluated to NA
NULL
violatedEdits(Rule3, League_data) %>% summary()
No violations detected, 0 checks evaluated to NA
NULL
The created variable of “Golddiff_change_from_min10_min20” was scanned for outliers. The variable was plotted on in a box plot and a histogram. The box plot showed outliers at both the lower end and high end of the data. The histogram showed approximately normal distribution. Therefore z.scores can be used to exclude the outliers. This methodology was applied and 53 outliers were removed from the data set.
The “gamelength” variable was also scanned for outliers. The variable was plotted on a box plot and histogram. The box plot showed that there were outliers on the high end of the data. The histogram doesn’t show a normally distributed data set, rather it is right-skewed. Therefore z.scores are not appropriate for excluding outliers. Chi-square Q-Q plot method was attempted but the data set was greater than 5000 observations so was unable to run. Therefore outliers were not excluded.
# This is the R chunk for the Scan II
League_data$Golddiff_change_from_min10_min20 %>% boxplot(main= "Box Plot of Gold Diff Change from 10 mins to 20 mins", ylab= "Gold", col = "grey")
hist(League_data$Golddiff_change_from_min10_min20)
z.scores <- League_data$Golddiff_change_from_min10_min20 %>% scores(type = "z")
z.scores %>% summary()
Min. 1st Qu. Median Mean 3rd Qu. Max.
-4.631499 -0.634183 -0.001472 0.000000 0.642287 4.534876
length (which( abs(z.scores) >3 ))
[1] 52
Golddiff_change_from_min10_min20_clean <- League_data$Golddiff_change_from_min10_min20 [ - which( abs(z.scores) >3 )]
hist(Golddiff_change_from_min10_min20_clean)
League_data$gamelength %>% boxplot(main= "Box Plot of Game Length", ylab= "Game Length", col = "grey")
hist(League_data$gamelength)
From the previous step we know that the “gamelength” variable is right-skewed. We applied logarithmic transformations to the “gamelength” distribution to attempt to make it more symmetrical.
Applying log10(), log(), sqrt all improve the symmetry of “gamelength” making it more normally distributed. Normal distribution is preferred for statistical inference,therefore transforming the data is helpful for analysis.
# This is the R chunk for the Transform Section
hist(League_data$gamelength)
log_gamelength <- log10(League_data$gamelength)
hist(log_gamelength)
ln_gamelength <- log(League_data$gamelength)
hist(ln_gamelength)
sqrt_gamelength <- sqrt(League_data$gamelength)
hist(sqrt_gamelength)