Introduction

Hey all, I want to show you my process and code to retrieve a complete and preprocessed dataset from the riot API for League of Legends in R. I know there is a module for it in python, but since I am R native because I come from a scientific background I wanted to do it in R and get this experience. This is my first time working with any API, so I am sure there are more efficient ways to achieve a complete dataset. If you have any guides or specific topics I should look out for to optimize things, let me know. Side note: I gathered the script in RMarkdown, but the requests were done in a “normal” RStudio interface, because RMarkdown lacks speed for production. Let’s get right into it. In the process I saved the data from each step to avoid sending every request again, if I have to change something along the way.In my first request I asked for 100 accounts for each tier and division. Although it can be done very fast, the API has a rate limit implemented which only allows 100 requests every 120 seconds. In order to don’t exceed this limit, because I could get a ban otherwise, I implemented 1.3 seconds of system sleep in every iteration. This explains why I am not going to access data as huge as I would like to. For now, I ended up with a dataset of ~25k observations as player and 2.5k as unique matches. My steps were the following:

  1. Request 100 summoner accounts from each tier (Iron to Diamond) and division (IV to I) (24 requests)
  2. Request the account id for every account to access match history for these individuals (2400 requests)
  3. Request 50 last matches for every individual (2400 requests)
  4. Request match information for every individual match (~25k requests)

I want to illustrate how much time it would cost to get a huge dataset, when you don’t have access to a development API key from riot, which would vanish the need for 1.3 seconds of system sleep. I have a total of ~30k requests and am only permitted for 100 requests every two minutes. This totals in 600 minutes or 10 hours. 10 hours, for in my opinion, a relatively small dataset. But I can work with it and I ended up with approximately 1000 observations per tier and per division, which is okay.

Starting somewhere

Here I loaded all libraries and defined some helper functions. One to access the data through the API and convert it from the JSON format to a data frame and another function (source) to convert the timestamps used from RIOT to store the date and time of the match to a date format. Also I created a lookup table with RIOTs datadragon to transform the integers as identifiers for champions and spell names from the Data Dragon from RIOT.

library(httr)
library(jsonlite)
library(data.table)
library(stringr)
library(tidyverse)
library(lemon)
library(knitr)
options(dplyr.print_max = 1e9)
data_path <- "Data/"
riot_api_fetching <- function(x) {
  key <- "censored"
  url <- paste0(x, key)
  json <- GET(url = url)
  raw <- rawToChar(json$content)
  fromJSON(raw)
}
delimiter <- "?api_key="
delimiter2 <- "&api_key="

ms_to_date <- function(ms, t0="1970-01-01") {
  ## @ms: a numeric vector of milliseconds (big integers of 13 digits)
  ## @t0: a string of the format "yyyy-mm-dd", specifying the date that
  ##      corresponds to 0 millisecond
  ## @timezone: a string specifying a timezone that can be recognized by R
  ## return: a POSIXt vector representing calendar dates and times
  sec <- ms / 1000
  as.POSIXct(sec, origin = t0, tz = "CET")
}

#####Accounts for divisions
tier <- c("IRON", "BRONZE", "SILVER", "GOLD", "PLATINUM", "DIAMOND")
division <- c("IV", "III", "II", "I")

# #Data Dragon
champion_dd <- fromJSON("http://ddragon.leagueoflegends.com/cdn/11.10.1/data/en_US/championFull.json")
spell_dd <- fromJSON("http://ddragon.leagueoflegends.com/cdn/11.10.1/data/en_US/summoner.json")

spell_lookup <- list()
for (i in seq_along(spell_dd$data)) {
 spell <- spell_dd$data[[i]]
  spell_lookup[[spell[["key"]]]] <- str_replace(spell[["id"]], "Summoner", "")
}

load_data <- function(name_df) {
  readRDS(list.files(data_path, pattern = name_df, full.names = TRUE))
}

Fetching the data

I started with retrieving random accounts that the API gave me for every combination of tier and division. I created every possible combination with expand.grid() and stored every combination in a list to give this list to lapply() to iterate through it by pasting the information I want into the appropriate URL which I got from RIOT API. This is basically what I will do in every request from the API: Store information I want in strings and iterate through the strings and pasting them into the respective URL. Then I bind them all in one data frame with the dplyr function bind_rows(). To get the match history from these individuals, I needed the account id, which was not included in the first request I did in the “random” accounts per tier and division. To get the account id I needed to request it with each summoner name. Since summoner names with spaces in their names were not accepted as a correct request, I needed to remove these from the names with str_replace(). I added the account id as a column in the account data frame.

# Accounts ####
accounts_per_divison <- 100

combinations_df <- expand.grid(division, tier)
combinations_list <- list()
for (i in seq_len(nrow(combinations_df))) {
  combinations_list[[i]] <- combinations_df[i, ]
}

accounts <- lapply(combinations_list, function(x) {
  Sys.sleep(1.3)
  division <- x[1, 1]
  tier <- x[1, 2]
  print(x)
  return(riot_api_fetching(
    paste0("https://euw1.api.riotgames.com/lol/league/v4/entries/RANKED_SOLO_5x5/", tier, "/",  division, "?page=1", delimiter2)))
})

for (i in seq_along(accounts)) {
accounts[[i]] <- accounts[[i]][seq_len(accounts_per_divison), ]
}
full_accounts <- do.call(bind_rows, accounts)
saveRDS(full_accounts, file = paste0("Data/full-accounts-", file_date(), ".RData"))

###Join account_id in full_accounts
names <- str_remove_all(full_accounts$summonerName, " ")

add <- lapply(names, function(x) {
summoner <- riot_api_fetching(paste0("https://euw1.api.riotgames.com/lol/summoner/v4/summoners/by-name/", x, delimiter))
acc_id <- summoner$accountId
if (which(x == names) %% 200 == 0) {
  print(paste(which(x == names), "/", length(names), "@", format(Sys.time(), "%H:%M:%S")))
}
Sys.sleep(1.3)
return(acc_id)
})

There were some account ids that I could not access. I wanted to know if there were systematic errors or just random problems with the way some summoner names were displayed in our dataset. I plotted a vector where every missing account id was stated as TRUE in ggplot() as a scatter plot along the x-axis. AS you see below, there were just some random problems, nothing systematically. I saved the dataset in my project directory and moved on.

non_empty <- vector()
for (i in seq_along(add)) {
  non_empty[i] <- !(is.null(add[[i]]))
}

missings <- ggplot(as.data.frame(cbind(non_empty, nrow = seq_along(non_empty))), aes(nrow, as.logical(non_empty))) +
  geom_jitter()
ggsave("summoner-names.png", missings, dpi = 1200, units = "cm", width = 30, height = 20)

to_add <- do.call(rbind, add)
saveRDS(to_add, file = paste0("Data/to_add", file_date(), ".RData"))
full_accounts2 <- full_accounts %>% select(-miniSeries) %>% filter(non_empty) %>% cbind(to_add) %>% rename(acc_id = to_add)
saveRDS(full_accounts2, file = paste0("Data/full-accounts2-", file_date(), ".RData"))

Here are a few lines from the requested joined data.

## Rows: 2,166
## Columns: 14
## $ leagueId     <chr> "cb67c803-7156-466b-b80d-ef989152bcc0", "0e7976f6-68e5...
## $ queueType    <chr> "RANKED_SOLO_5x5", "RANKED_SOLO_5x5", "RANKED_SOLO_5x5...
## $ tier         <chr> "IRON", "IRON", "IRON", "IRON", "IRON", "IRON", "IRON"...
## $ rank         <chr> "IV", "IV", "IV", "IV", "IV", "IV", "IV", "IV", "IV", ...
## $ summonerId   <chr> "fKtnBbrFyGd4Af2ohdpKiEEjIwj4DHqfH1HFkZebW9v-_pnn", "3...
## $ summonerName <chr> "High Elo Smurf", "reprotj", "OBG    Tirex", "AntoLegg...
## $ leaguePoints <int> 60, 21, 0, 75, 44, 47, 77, 69, 19, 92, 6, 15, 37, 61, ...
## $ wins         <int> 436, 6, 91, 110, 70, 32, 54, 41, 23, 47, 207, 7, 24, 1...
## $ losses       <int> 520, 23, 126, 140, 115, 52, 92, 60, 44, 77, 337, 20, 4...
## $ veteran      <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE...
## $ inactive     <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE...
## $ freshBlood   <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE...
## $ hotStreak    <lgl> TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE,...
## $ acc_id       <chr> "ML7UWHUvCDGk86NFqgOPnVmPlPZGBdsP62HtrIbQ2h1HQuA", "td...

Our next goal was to achieve match data from which I can get match information. I chose to request match history for a lot of players instead of retrieving a lot of matches per fewer players. I filtered for the ranked solo/duo match history (The code for each queue can be seen on http://static.developer.riotgames.com/docs/lol/queues.json) and requested the last 50 matches from each individual. Finally I joined the account id, the tier and the division. I wanted to keep track of the requests, so I used the sink() function with the split = TRUE argument to store some information how the process is going in a .txt file and also print it out live on the R console. Information I wanted to print out regularly were how far the request has progressed with the current time. Additionally there were some accounts that don’t play ranked at all, so I wanted to know how often this is the case and printed these out too. Finally I bounnd the list to a data frame again with dplyrs bind_rows() function. Moving on, I checked and filtered for duplicate match ids, because it could be the case that the match history of some players could overlap. Finally, since time is valuable and I didn’t want to do 120k requests, I shortened the data to one quarter of its original size.

#####     Match History ##################
acc_ids <- as.character(full_accounts2$acc_id)
queue <- "420"
begin_index <- "0"
end_index <- "50"

sink("Sinks/Match_history_abfrage.txt", split = T)
matches <- lapply(acc_ids, function(x) {
  Sys.sleep(1.3)
  y <- riot_api_fetching(paste0("https://euw1.api.riotgames.com/lol/match/v4/matchlists/by-account/", x, "?queue=", queue, "&endIndex=", end_index, "&beginIndex=", begin_index, delimiter2))[1]$matches
  if (!is.data.frame(y)) {
    print(paste(which(acc_ids == x), "is data frame: ", is.data.frame(y)))
    return()
  }
  if (nrow(y) == 0) {
    paste0(which(acc_ids == x), " nrow is 0")
    return()
  }
  if (which(acc_ids == x) %% 200 == 0) {
    print(paste(which(acc_ids == x), "/", length(acc_ids), "@", format(Sys.time(), "%H:%M:%S")))
  }
  rows <- nrow(y)
  return(y %>% mutate(tier = full_accounts2[full_accounts2$acc_id == x, ]$tier,
                      division = full_accounts2[full_accounts2$acc_id == x, ]$rank))
})

closeAllConnections()

full_matchhistory <- do.call(bind_rows, matches)
full_matchhistory <- full_matchhistory %>% rename(gameid = gameId)
full_matchhistory <- full_matchhistory[!duplicated(full_matchhistory$gameid), ]
#saveRDS(full_matchhistory, file = paste0(data_path, "full-matchhistory", file_date(), ".RData"))


#shortening of match data ####
one_quarter_vec <- rep(c(T, F, F, F), nrow(full_matchhistory) / 4)
full_matchhistory2 <- full_matchhistory[one_quarter_vec, ]
saveRDS(full_matchhistory2, file = paste0(data_path, "full-matchhistory2-", file_date(), ".RData"))

This is a glimpse of the data frame which contains the match ids.

## Rows: 25,930
## Columns: 10
## $ platformId <chr> "EUW1", "EUW1", "EUW1", "EUW1", "EUW1", "EUW1", "EUW1", ...
## $ gameid     <dbl> 5284028758, 5247333816, 5242023273, 5241753231, 52413797...
## $ champion   <int> 517, 157, 350, 350, 350, 350, 350, 350, 350, 350, 350, 3...
## $ queue      <int> 420, 420, 420, 420, 420, 420, 420, 420, 420, 420, 420, 4...
## $ season     <int> 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, ...
## $ timestamp  <dbl> 1.621783e+12, 1.620059e+12, 1.619821e+12, 1.619815e+12, ...
## $ role       <chr> "DUO", "DUO", "DUO_SUPPORT", "DUO", "DUO_SUPPORT", "DUO_...
## $ lane       <chr> "TOP", "NONE", "BOTTOM", "NONE", "BOTTOM", "BOTTOM", "BO...
## $ tier       <chr> "IRON", "IRON", "IRON", "IRON", "IRON", "IRON", "IRON", ...
## $ division   <chr> "IV", "IV", "IV", "IV", "IV", "IV", "IV", "IV", "IV", "I...

The function

Finally I got to the point where I requested the specific data for each match. Since the data structure was too confusing for me to work with, because there were data frames inside the data frame, I decided to split the data into 3 main data frames. One info data frame with meta information about each match, one team data frame with information about the team statistics and one stats data frame with information about each individual. This was the most difficult part to program for me, since there were a lot of requests which were done over a long time in which I weren’t watching R the whole time, so it was important to avoid that the loops break because of errors of various reasons. For example, if my API key, which is only valid for 24 hours, timed out or if one data frame deviates from the usual structure. In a long trial and error process I managed to keep the function as flexible as possible, implement different checkpoints and get me output with informative value about which game ids deviate or which game ids throw errors and which errors, without interrupting the whole request with two structures. One function and one for loop. The function is wrapped in possibly() from the purr package to catch errors which I could not account for. Different Checkpoints were:

  • If else structure
    • length of the output list is under 13 implied that there is no complete data frame returned and instead an error code was returned and the function should print out the respective meaning of the error code, which I also found in the RIOT API documentation. All while don’t breaking the function.
    • game duration under 240 seconds are remake matches and should not be returned

I looked through one example data frame and sorted the information in the three different data frames (info, team and stats) and stored them into one list with these three elements. I accessed the respective tier and division by the game id and put them in the info data frame since this information is nowhere to find inside the match data frame but is a very important variable to account for. I needed do.call() to transform the retrieved data into one data frame instead of the preexisting structure of data frames in the data frame.

#Fetching Data ####
x <- list.files(data_path, pattern = "full-matchhistory2", full.names = TRUE)
full_matchhistory2 <- readRDS(x)

#possibly im objekt/Funktion speichern oder seperat wrappen
match_fetching <- possibly(function(x) {
  Sys.sleep(1.2)
  x <- riot_api_fetching(paste0("https://euw1.api.riotgames.com/lol/match/v4/matches/", x, delimiter))
  if (length(x) < 13) {
    if (x$status$status_code == 429) {
      print("429 - Rate limit exceeded")
      print(paste("Sleep for two minutes @", time_in_hms()))
      Sys.sleep(120)
      return(NULL)
    } else if (x$status$status_code == 401) {
      print("401 - Unauthorized")
      return(NULL)
    } else if (x$status$status_code == 403) {
      print("403 - Forbidden invalid API key")
      return("403 - Forbidden invalid API key")
    } else if (any(c(500, 503) %in% x$status$status_code)) {
      print("Server-side")
      return(NULL)
    } else {
      print(paste(x, "Length under 13 without 429, 401, 403, 500, 503 status code"))
      return(NULL)
    }}
  
  if (x$gameDuration < 240) {
    print(paste("gameid:", x$gameid, "; GameDuration < 240s @", time_in_hms()))
    return(NULL)
  } else {
    match_combined <- vector(mode = "list", length = 3)
    names(x)[grep("gameId", names(x))] <- "gameid"
    gameid_pulled <- as.character(x$gameid)
    tier <- full_matchhistory2 %>% filter(gameid == gameid_pulled) %>% pull(tier)
    division <- full_matchhistory2 %>% filter(gameid == gameid_pulled) %>% pull(division)
    match_combined[[1]] <- data.frame(gameid = gameid_pulled, gameDuration = x$gameDuration, gameCreation = x$gameCreation,
                                      tier = tier, division = division)
    
    match_combined[[2]] <- cbind(x$teams %>% select(-bans), gameid_pulled)
    match_combined[[3]] <- cbind(x$participants, gameid_pulled)
    match_combined[[3]] <- do.call(data.frame, match_combined[[3]])
    match_combined[[3]] <- do.call(data.frame, match_combined[[3]])
  }
  return(match_combined)
}
, otherwise = "Check later")

The for loop

The aforementioned function was nested in a for loop to iterate through the gameids. I decided for this approach, since if the process was terminated for example by an invalid API key, because it exceeded its 24 hour lifetime, I was able to trace back at which game id it stopped and could start it from there instead from starting it all over. I also made sure that every 3k requests the data was saved into my project directory. If my function stored the message that my API key was invalid the for loop was designed to stop the process and save everything until then.

gameids <- as.character(full_matchhistory2$gameid)
start <- 1
to <- length(gameids)
save_every <- 3000
pseudo_iteration <- 1
x <- vector(mode = "list", length = nrow(full_matchhistory2))

sink("Sinks/matches_query.txt", split = T)

for (i in start:to) {
  id <- gameids[i]
  x[[pseudo_iteration]] <- match_fetching(id)
  if (x[pseudo_iteration] == "403 - Forbidden invalid API key" | x[pseudo_iteration] == "Check later") {
    if (x[pseudo_iteration] == "403 - Forbidden invalid API key") {
      filename <- paste0("df", start, "_", i, ".RData")
      to_save <- x[1:save_every]
      saveRDS(to_save, file = paste0(data_path, filename))
      break
    } else if (x[pseudo_iteration] == "Check later") {
      print(paste("Fetching Check later", "gameid:", gameids[i]))
    } }
  if (pseudo_iteration %% save_every == 0) {
    filename <- paste0("df", start, "_", i, ".RData")
    print(paste(filename, "normal saveRDS"))
    to_save <- x[1:save_every]
    saveRDS(to_save, file = paste0(data_path, filename))
    rm(list = c("x", "to_save"))
    x <- list()
    start <- i + 1
    pseudo_iteration <- 0
  } else if (i == to) {
    filename <- paste0("df", start, "_", i, ".RData")
    print(paste(filename, i, "saveRDS at the end"))
    to_save <- x[1:pseudo_iteration]
    saveRDS(to_save, file = paste0(data_path, filename))
    rm(list = c("x", "to_save"))
  }
  if (i %% 300 == 0) {
    print(paste(i, "/", length(gameids), "@", time_in_hms()))
  }
  pseudo_iteration <- pseudo_iteration  + 1
}

closeAllConnections()

Finally only one thing was needed. Join all the saved lists, combine each 1st, 2nd and 3rd element of the ith list and seperate them from each other to save each data frame individually.

Preprocessing

Now it was time for the preprocessing so I could get some reliable data to make beautiful insightful graphs of. The only matches I explicitly excluded in the request process were games with a game duration of less than 240 seconds, because these games were remakes in accordance with that you can only make a remake request before the four minute mark. I started tyding up the info data frame, which you can see below

## Rows: 25,282
## Columns: 5
## $ gameid       <chr> "5284028758", "5247333816", "5242023273", "5241379736"...
## $ gameDuration <int> 1631, 1156, 1255, 1480, 1340, 1913, 2261, 1684, 2255, ...
## $ gameCreation <dbl> 1.621783e+12, 1.620059e+12, 1.619821e+12, 1.619805e+12...
## $ tier         <chr> "IRON", "IRON", "IRON", "IRON", "IRON", "IRON", "IRON"...
## $ division     <chr> "IV", "IV", "IV", "IV", "IV", "IV", "IV", "IV", "IV", ...

For the preprocessing of the info data frame I changed tier and division in ordered factor variables and made sense out of the gamecreation variable with the ms_to_date function I mentioned at the start.

#Preprocessing & Quality Control ####
#info_df ####
library(scales)
info_df <- readRDS(paste0(data_path, list.files(data_path, pattern = "info-df-pre")))
info_df_post <- info_df %>%
  mutate(across(c(tier, division), factor)) %>% 
  mutate(tier = fct_relevel(tier, toupper(c("iron", "bronze", "silver", "gold", "platinum", "diamond"))), 
         division = fct_relevel(division, c("IV", "III", "II", "I")),
         gameCreation = ms_to_date(info_df$gameCreation), gameDuration = round(gameDuration / 60, 1))
str(info_df_post)
saveRDS(info_df_post, file = paste0(data_path, "info-df-post-", file_date(), ".RData"))

We can take a look at the distribution of game length overall and see that the average game time is approximately half an hour. It ranges from 7,5 minutes (probably in high elo games) to a blasting 90 minutes (probably in low elo games). I include one histogram with the whole data and one with excluded outliers.

info_df <- readRDS(list.files(data_path, pattern = "info-df-post", full.names = TRUE))
duration_hist <- function(data, title_plot) {
ggplot(data, aes(x = gameDuration)) +
  geom_histogram(aes(y = ..density..), color = "black", fill = "white", size = 2) +
  geom_density(fill = "blue", alpha = 0.2) +
  theme_bw() +
  geom_boxplot(aes(x = gameDuration, y = 0.065), width = 0.006) +
  stat_summary(geom = "text", fun = quantile,
               aes(label = ..x.., y = 0.07), size = 2.5, orientation = "y") +
  labs(title = title_plot) +
  theme(plot.title = element_text(hjust = 0.5))
}

duration_hist(info_df, "Histogram of gamelength")

x <- quantile(info_df$gameDuration, c(0.25, 0.75))
iqr <- x[2] - x[1]
info_df_outlier <- info_df %>% filter(gameDuration > x[1]-(1.5*iqr) & gameDuration < x[2]+(1.5*iqr))

duration_hist(info_df_outlier, "Histogram of gamelength without outliers")

Since high elo players know when it’s over and don’t make that many mistakes that change the outcome of a game dramatically it isn’t surprising that the higher the elo the shorter the games as we see below with a slight decrease in game length as the tier goes up from Iron to Diamond. It is reflected in the decreasing trend in all summary statistics, min, 25% percentile, median, 75% percentile and max. Again I present histograms for the whole data and with excluded outliers.

#gameduration nach tier aufgliedern
duration_tier_hist <- function(data, title) {
  ggplot(data, aes(x = gameDuration)) +
  geom_histogram(aes(y = ..density..), color = "black", fill = "white", size = 2) +
  geom_density(fill = "blue", alpha = 0.2) +
  theme_bw() +
  geom_boxplot(aes(x = gameDuration, y = 0.075), width = 0.006) +
  stat_summary(geom = "text", fun = quantile,
               aes(label = ..x.., y = 0.085), size = 2.5, orientation = "y") +
  labs(title = title) +
  facet_grid(rows = vars(tier))
}

duration_tier_hist(info_df, "With outliers")

#gameduration without outliers per tier
is_iqr_outlier <- function(x) {
   q <- quantile(x, c(0.25, 0.75))
   iqr <- diff(q)
   return((x > q[1] - 1.5*iqr) & (x < q[2] + 1.5*iqr))
}

info_df_outliers <- info_df %>% 
  group_by(tier) %>% 
  filter(is_iqr_outlier(gameDuration))

duration_tier_hist(info_df_outliers, "Outliers removed")

We see we have ~1k observations per tier and division.

info_df_post <- load_data("info-df-post")
table(info_df_post$tier, info_df_post$division)
##           
##              IV  III   II    I
##   IRON     1117 1062 1064 1042
##   BRONZE   1023 1015  994 1047
##   SILVER   1114 1087 1050 1043
##   GOLD     1021 1070 1164 1019
##   PLATINUM 1026 1016 1080 1033
##   DIAMOND  1033 1119 1002 1041

If there are games with less than 10 minutes of playtime, I assume these games were won by forfeiting and I think mostly because of afk teammates, I will exclude these games. But there were none.

info_path <- list.files(data_path, pattern = "info-df-post", full.names = TRUE)
stats_path <- list.files(data_path, pattern = "stats-df-pre", full.names = TRUE)
team_path <- list.files(data_path, pattern = "team-df-pre", full.names = TRUE)
info_df_post <- readRDS(info_path)
stats_df_pre <- readRDS(stats_path)
team_df_pre <- readRDS(team_path)

under_10 <- info_df_post %>% filter(gameDuration < 10)
team_df_pre %>% filter(gameid %in% under_10$gameid) %>% left_join(under_10 %>% select(gameid, gameDuration), by = "gameid")
stats_df_pre %>% filter(gameid %in% under_10$gameId) %>% left_join(under_10 %>% select(gameid, gameDuration), by = "gameid")

saveRDS(info_df_post, file = info_path)
saveRDS(stats_df_pre, file = stats_path)
saveRDS(team_df_pre, file = team_path)

For the preprocessing of the stats data frame I had to look through a lot of variables. Mainly I changed variables such as champion or summoner spell into factor variables and recoded them in the readable champions and summoner spell names with the previously mentioned lookup table. In the datadragon from RIOT are summoner spells called “Boost” and “Haste”, I figured that these might be Cleanse and Ghost, so I replaced them with these names. Finally I excluded some variables that I couldn’t get any sense out of it, but maybe I will change this step later for example for a data driven analysis method.

#stats_df ####
stats_df <- readRDS(paste0(data_path, list.files(data_path, pattern = "stats-df-pre")))

stats_df_post <- stats_df %>% mutate(across(contains("participant"), factor),
                                     championId = factor(championId),
                                     teamId = factor(teamId),
                                     across(contains("spell"), factor),
                                     across(c("championId", contains("spell")), as.character))

ch_ids <- stats_df_post$championId
for (i in unique(ch_ids)) {
  stats_df_post$championId[ch_ids %in% i] <- champion_dd$keys[[i]]
}

summs1_ids <- stats_df_post$spell1Id
summs2_ids <- stats_df_post$spell2Id
for (i in unique(summs1_ids)) {
  stats_df_post$spell1Id[summs1_ids %in% i] <- spell_lookup[[i]]
  stats_df_post$spell2Id[summs2_ids %in% i] <- spell_lookup[[i]]
}

stats_df_post$spell1Id <- str_replace_all(stats_df_post$spell1Id, "Boost", "Cleanse")
stats_df_post$spell1Id <- str_replace_all(stats_df_post$spell1Id, "Haste", "Ghost")

stats_df_post$spell2Id <- str_replace_all(stats_df_post$spell2Id, "Boost", "Cleanse")
stats_df_post$spell2Id <- str_replace_all(stats_df_post$spell2Id, "Haste", "Ghost")

sink(paste0("Processing-stuff/stats_summary", file_date(), ".txt"))
options(max.print = 1000000000)
print(summary(stats_df_post))
closeAllConnections()
options(max.print = 60)

stats_df_post <- stats_df_post %>% mutate(across(contains("spell"), factor),
                                          championId = factor(championId),
                                          stats.longestTimeSpentLiving = stats.longestTimeSpentLiving / 60,
                                          timeline.role = factor(timeline.role),
                                          timeline.lane = factor(timeline.lane)) %>%
                                  select(-c(stats.unrealKills, stats.sightWardsBoughtInGame, stats.combatPlayerScore, stats.objectivePlayerScore,
                                            stats.totalPlayerScore, stats.totalScoreRank, highestAchievedSeasonTier, starts_with("X0")))
                                
str(stats_df_post, list.len = ncol(stats_df_post))

saveRDS(stats_df_post, file = paste0(data_path, "stats-df-post-", file_date(), ".RData"))

In the team data frame there were only a few variables which needed to be changed to their correct datatype and two outdated variables were kicked out.

# Team_df ####
readRDS(paste0(data_path, list.files(data_path, pattern = "team-df-pre")))
str(team_df, list.len = nrow(team_df))
team_df_post <- team_df %>% mutate(win = factor(win),
                                   teamId = factor(teamId)) %>% 
  select(-vilemawKills, -dominionVictoryScore)

sink(paste0("Processing-stuff/team_summary", file_date(), ".txt"))
options(max.print = 1000000000)
print(summary(team_df_post))
closeAllConnections()
options(max.print = 60)

saveRDS(team_df_post, file = paste0(data_path, "team-df-post-", file_date(), ".RData"))

Finally I matched the common gameids between all three datasets and worked with them.

#Alle df's mit gleichen Id's ####
team_path <- list.files(data_path, pattern = "team-df-post", full.names = TRUE)
stats_path <- list.files(data_path, pattern = "stats-df-post", full.names = TRUE)
info_path <- list.files(data_path, pattern = "info-df-post", full.names = TRUE)
team_post <- readRDS(team_path)
stats_post <- readRDS(stats_path)
info_post <- readRDS(info_path)

common_ids <- intersect(intersect(team_df_post$gameid, stats_df_post$gameid), info_df_post$gameid)


team_df_post <- team_df %>% filter(gameid %in% common_ids) %>% 
  filter(!gameid %in% under_10$gameid)
info_df_post <- info_df %>% filter(gameid %in% common_ids) %>% 
  filter(!gameid %in% under_10$gameid)
stats_df_post <- stats_df %>% filter(gameid %in% common_ids) %>% 
  filter(!gameid %in% under_10$gameid)

saveRDS(team_df_post, file = paste0(data_path, "team-df-post-", file_date(), ".RData"))
saveRDS(stats_df_post, file = paste0(data_path, "stats-df-post-", file_date(), ".RData"))
saveRDS(info_df_post, file = paste0(data_path, "info-df-post-", file_date(), ".RData"))

I finally ended up with 25 282 matches and here is the pipeline of the observation numbers with “actual observations/possible amount of observations”

My next step will be to do some exploratory analysis on winrates on specific champs depending on some variables like getting first blood or first tower. See you then. Please feel free to educate me.