World of Warships is a niche team-based online game. Like many games, it has a publicly accessible API giving access to a reasonable set of player performance statistics, which are used in community-created metrics, and displayed on some community sites, suchs as http://www.wows-numbers.com.
These widely available statistics are available only for more ‘casual’ game modes. There is an ongoing (seasonal) competitive game mode with player-determined teams, a limited map pool, and no publicly available statistics bar team-wide aggregate winrates, and ladder position.
This project has been focused on trying to aggregate what data exists, piping it into SQL, and then using it to generate otherwise unavailable statistics, and hopefully useful insights.
This writeup is half documentation, and half portfolio demonstration for anyone interested (or bored) enough to look closely, covering the scraping and primary ETL pipeline written in R feeding in to MySQL.
The below is probably due for a bit of a rewrite: continuous scraping through each game ‘prime time’ to give semi-live updates rather than single-run ‘end of session’ scrapes. Per-season updates could do with a rewrite or update as well.
Package management is handled by pacman, with a quick check to ensure that it is installed. I don’t use much for this: tidyverse primarily for the dplyr grammar, jsonlite to interpret the web and api-returned objects, RMariaDB which is a wrapper for the MySQL ODBC driver enabling me to push to SQL (though really, I should be rewriting some of the wrapped functions with DBI ‘basics’ as there is something catastrophically slow about whatever RMariaDB does to insert scripts), lubridate for slightly nicer date/time handling (which may be an unused holdover from an older version) and httr, which may well be unneeded on top of jsonlite - again, a holdover. Oh well.
Instead of keeping usernames and passwords hardcoded, I pull them from a CSV so I can share this ‘intact’. The settings include an application ID for the vendor API, connection information for the MySQL server housing the data, and a list of user authentication tokens which I’ll cover below.
if (!require("pacman")){install.packages("pacman")}
pacman::p_load(tidyverse, jsonlite, RMariaDB, lubridate, httr)
rm(list=ls())
settings <- read_csv("settings.csv",show_col_types = FALSE)
tokens <- read_csv("tokens.csv",show_col_types = FALSE)
appID <- settings %>%
filter(setting == 'appID') %>%
.$value
conn <- dbConnect(MariaDB(),
host = settings %>% filter(setting == 'host') %>% .$value,
port = settings %>% filter(setting == 'port') %>% .$value,
user = settings %>% filter(setting == 'user') %>% .$value,
password = settings %>% filter(setting == 'password') %>% .$value,
dbname = settings %>% filter(setting == 'database') %>% .$value)
rm(settings)
The primary ‘extraction’ portion of the ETL routine is, in this case, scraping the game data on a nightly basis from the api endpoint backend of a semi-public vendor website. I say semi-public in that these are available only to logged in members of the related team structure (clan), which is what the authentication tokens mentioned above are: these are the account-associated cookies that tell the vendor’s website that the user is logged in and ‘allowed’ to see this content.
As I handle data for a number of teams, I have a number of these tokens. They do not seem to have an expiry date, but will be expired if the user ever manually logs out of the vendor web site - potentially, a new one is generated when they log in rather than the old one being expired when they log out. I do not actually know which. People giving me these is actually a substantial exercise of trust, as simply pasting someone else’s token into my cookies gives me full access to their account. I do my best to explain or demonstrate this to people before they give me a token. I also don’t mess with their account, because doing so would impair my ability to collect data, would offer me no benefit at all, and would be just plain mean.
There are two ratings (sub teams, I guess) available to each clan. Each rating’s history stores the last 50 games played. On average, there are 16 games played by a team on any given evening, and clans can have multiple teams playing simultaneously. This means there is a theoretical ‘highly active clan’ case where I would need to scrape more than nightly (which would be easily managed with while() or a cron job), in general, scraping on a nightly basis gives me all games without losing any. This does mean that adding new clans mid-season tends to give an incomplete picture, but there is no way around this.
I save an aggregate of the raw scraped JSON at this stage just in case, and grab the current season number for checks, below.
for(wsauth_token in tokens$token){
for(team in c(1,2)){
temp <- vector(mode = "list")
try(temp <- jsonlite::fromJSON(rawToChar(GET(paste("https://clans.worldofwarships.eu/api/ladder/battles/?team=",team,sep=''),
set_cookies(`wsauth_token`=wsauth_token))$content)), silent=TRUE)
if(length(temp) > 0){
if(exists("newCBs")){newCBs <- union(newCBs,temp)
}else{newCBs <- temp}}}}
season <- newCBs$season_number[1]
write_json(newCBs,paste("cbJSON - ", Sys.Date(),".json",sep=''))
Alongside the nightly scrapes of games played, there are a few reference tables to keep up-to-date - ship lists, map lists, season information, and (pending) player historic statistics.
This is a simple check of the ‘season’ field from a current scraped game against MAX(season_number) from the SQL database. If the newly scraped games have a higher season_number than the database, the script updates all of the reference tables. This could be accomplished differently: the season reference table includes start and end dates, so I could also compare Sys.Date() to the latest recorded season end_date to accomplish the same without needing a new game. Contextually, these approaches should yield identical results. It is also theoretically possible that this will under-update the ships list. Generally, ships are added to the API reference months in advance of the patches that add them to the game proper, and not much is changed during seasons, but this isn’t an officially static state, just the way things have tended to work. Manual intervention to update these is possible, but also trivial, and the impact of slightly out of date reference tables is trivially small, and quickly spotted and remedied.
I have a separate script which scrapes historic clanbattle performance for all clans seen which I would like to add in here, but I haven’t gotten around to it, partially because the value of historic performance data could be questionable when I already have historic seasonal ladder data integrated.
check <- tbl(conn,"seasonList") %>%
slice_max(finish_time) %>%
select(season_number) %>%
collect()
if(nrow(check) == 0 || season != check$season_number[1]){
# Update shipList ------
shipURL=paste("https://api.worldofwarships.eu/wows/encyclopedia/ships/?application_id=",appID,"&fields=name%2Cship_id%2Ctier%2Ctype%2Cnation",sep="")
shipTemp <- jsonlite::fromJSON(shipURL)
for (n in 1:shipTemp$meta$page_total){
url <- paste(shipURL,"&page_no=",n,sep="")
if (exists("shipsTemp")){
shipsTemp <- union(shipsTemp,jsonlite::fromJSON(url)$data)
} else {
shipsTemp <- jsonlite::fromJSON(url)$data
}
}
dedupe <- tbl(conn, "shipList") %>%
select(SID) %>%
collect()
shipList <- data.frame(matrix(unlist(shipsTemp),nrow=length(shipsTemp),byrow=TRUE)) %>%
rename("Tier"=X1,
"SID"=X2,
"Class"=X3,
"Ship"=X4,
"Nation"=X5)
shipList <- shipList %>%
mutate(Class = case_when(Class == "Cruiser" ~ "CA",
Class == "Battleship" ~ "BB",
Class == "AirCarrier" ~ "CV",
Class == "Destroyer" ~ "DD"),
ship_name=iconv(Ship, from = "UTF-8", to = "ASCII//TRANSLIT"),
Trunc = str_trunc(ship_name,5,ellipsis=""))
shipListUpd <- shipList %>%
filter(!SID %in% dedupe$SID)
if(nrow(shipListUpd) > 0){dbWriteTable(conn,"shipList",shipList,append=TRUE)}else{print("No new ships")}
# Update seasonList ------
seasonURL <- paste("https://api.worldofwarships.eu/wows/clans/season/?application_id=",appID,"&fields=start_time%2Cfinish_time%2Cseason_id%2C+name%2C+ship_tier_min%2C+ship_tier_max",sep="")
seasonTemp <- jsonlite::fromJSON(seasonURL)$data
seasonList <- data.frame(matrix(unlist(seasonTemp),nrow=length(seasonTemp),byrow=TRUE)) %>%
rename("season_number"=X6,
"season_name"=X1,
"start_time"=X5,
"finish_time"=X3,
"ship_tier_min"=X4,
"ship_tier_max"=X2) %>%
mutate(ship_tier_min=as.numeric(ship_tier_min),
ship_tier_max=as.numeric(ship_tier_max),
season_number=as.numeric(season_number),
start_time=as_datetime(as.numeric(start_time)),
finish_time=as_datetime(as.numeric(finish_time)))
dedupe <- tbl(conn, "seasonList") %>%
select(season_number) %>%
collect()
seasonListUpd <- seasonList %>%
filter(!season_number %in% dedupe$season_number)
if(nrow(seasonListUpd) > 0){
dbWriteTable(conn,"seasonList",seasonListUpd,append=TRUE)
dbExecute(conn,"CALL update_seasonList")
}else{print("No new seasons")}
# Update Maptable -----
url <- paste("https://api.worldofwarships.eu/wows/encyclopedia/battlearenas/?application_id=",appID,sep="")
mapTemp <- jsonlite::fromJSON(url)$data
mapTable <- tibble("map_id"=1:length(mapTemp),"map_name"="")
for(i in 1:length(mapTemp)){
mapTable$map_id[i] <- mapTemp[[i]]$battle_arena_id
mapTable$map_name[i] <- mapTemp[[i]]$name
}
rm(i,mapTemp,url)
map_dedupe <- tbl(conn,"mapList") %>% collect()
mapTable <- mapTable %>% filter(!map_id %in% map_dedupe$map_id)
dbWriteTable(conn,"mapList",mapTable,append=TRUE)
}
There are multiple ‘generations’ of script throughout - if something from an older version was reasonably vectorised or I couldn’t see a nicer way to do things, I pretty much just left it. The reference tables are small enough in general that awkward loops (see: the map table above) aren’t too much of a performance hit, especially given how infrequently they are run. I found that, given how almost all of these JSON objects are structured, R’s JSON-handling packages (or, at least, jsonlite) could not gracefully flatten them, though this is an area I should probably revisit.
These scripts also still use the catastrophically slow dbWriteTable() and similar RMariaDB functions. At some point, I will go back and rewrite all of these using the much faster but slightly finicky DBI commands.
The real meat of this ETL process is in handling the scraped games. As stated above, the accessible lists contain the last 50 games played, so deduplication is necessary. This is especially true when scraping/recovering from flat files.
I handle this by grabbing a list of season numbers present in the ingested data, grabbing game IDs for all games from those seasons from my database, and deduplicating in a relatively brute force way.
As game IDs are sequential, I could theoretically deduplicate by grabbing the MAX() value only, but as I tend to add more clans over time, and occasionally have to replace de-authorised tokens, this method would be lossy without significant manual intervention. It could be a very substantial performance improvement if the list of clans I scraped were static, or if I deduplicated as I scraped the data for the first time instead. This isn’t a guaranteed improvement, but a situational one, so I have retained the ‘brute force’ method that gives the correct output in all cases instead of more ‘technically correct’ solutions which would result in data issues.
Another option, which I am more inclined toward the more I work with SQL, is to flatten everything from JSON in R, but otherwise dump it in a relatively untransformed state into staging tables and then handle all further transformation with stored procedures. If I were doing this from scratch, I would likely take that approach in most cases.
seasons <- newCBs %>% select(season_number) %>% distinct()
dedupe <- tbl(conn,"cbGames") %>%
filter(season_number %in% local(seasons$season_number)) %>%
select(id) %>%
collect()
newCBs <- newCBs %>%
filter(!id %in% dedupe$id)
The data itself comes in hideously nested JSON objects that R struggles with a little bit.
At the top level, each game is a single row with a unique game ID, and some overall information - the map, the time, etc. There is a ‘teams’ column containing two rows of its own, with team result ids, clan identifiers, results, and similar. Each ‘teams’ entry contains a seven row object of players on that team, the ships they were in, whether they survived, etc. This results in one game entry, two team entries, and fourteen player entries per match.
The first round of denesting is done with a clumsy loop, primarily because there is no key within ‘team’ objects that relates back to the games. The team objects have a result_id, which the player objects have as well, so these can be effortlessly pulled apart, but each team object has to have the related game ID appended to it. I suspect this is a place where if I played around with some flatten() or unnest() type functions I could vectorise this. At an average of 15 rows per clan per evening, this would only result in a fairly modest performance gain, but it would make me feel good about myself.
Everything else is relatively straightforward - pull the data apart into a variety of dataframes, strip actively useless data (mostly, ship image URLs and colour information that exists for the web interface but gives no actual information) and then update tables in SQL.
The one exception is grabbing historic clan ladder details. For this, I check the existing SQL tables and identify clans I have not seen yet this season, then construct an API call to grab that clan’s peak ladder position in the previous season.
# Single sad loop to get through the first layer of JSON nesting ------
for(i in 1:nrow(newCBs)){
cbTeamsTemp <- as_tibble(newCBs$teams[[i]],.name_repair="universal") %>% mutate(GID=newCBs$id[i])
cbPlayersTemp <- union(as_tibble(cbTeamsTemp$players[[1]],.name_repair="universal"), as_tibble(cbTeamsTemp$players[[2]],.name_repair="universal"))
cbTeamsTemp <- cbTeamsTemp %>% select(-players)
if(exists("cbTeams")){
cbTeams <- bind_rows(cbTeams, cbTeamsTemp)
cbPlayers <- union(cbPlayers, cbPlayersTemp)
} else {
cbTeams <- cbTeamsTemp
cbPlayers <- cbPlayersTemp
}
}
cbTeams <- cbTeams %>%
left_join(newCBs %>% select(id,season_number), by=c("GID"="id"))
cbPlayers <- cbPlayers %>%
mutate(ship_name=ship$name, tier=ship$level) %>%
select(-ship)
cbGames <- newCBs %>%
select(-teams) %>%
distinct() %>%
mutate(map_name=map$name,finished_at=as_datetime(finished_at)) %>%
select(-map)
stages <- cbTeams$stage %>%
distinct() %>%
select(-progress)
cbClans <- cbTeams$claninfo %>%
select(-hex_color,-members_count,-color,-disbanded) %>%
left_join(cbTeams %>% select(id,clan_id, season_number, GID), by=c("id"="id")) %>%
select(-id,-GID) %>%
distinct() # Append Last CB Season Result ------
temp <- tbl(conn,"cbClans") %>%
collect() %>%
filter(season_number %in% seasons$season_number)
cbClans <- cbClans %>% setdiff(temp %>% select(-last_season))
last_season_ratings <- cbClans %>%
select(clan_id) %>%
mutate(last_season=0)
if((nrow(cbClans)>0)){
for(i in 1:nrow(cbClans)){
LS_temp <- cbClans$season_number[i]-1
realm_temp <- ifelse(cbClans$realm[i]=="us","com",ifelse(cbClans$realm[i]=="sg","asia",cbClans$realm[i]))
cid <- cbClans$clan_id[i]
mpr <- fromJSON(rawToChar(GET(paste("https://clans.worldofwarships.",realm_temp,"/api/clanbase/",cid,"/claninfo/",sep=''))$content))$clanview$wows_ladder$ratings %>%
filter(season_number==LS_temp, is_best_season_rating==TRUE) %>%
select(max_public_rating)
last_season_ratings$last_season[i] <- ifelse(length(mpr[[1]])>0, mpr[[1]],0)
}
cbClans <- cbClans %>%
left_join(last_season_ratings, by="clan_id")
dbWriteTable(conn,"cbClans",cbClans,append=TRUE)
}
cbTeams <- cbTeams %>%
distinct() %>%
select(-stage,-claninfo,-season_number) %>%
mutate(spawn=2-(id %% 2))
stages <- stages %>%
na.omit()
if(nrow(cbGames)>0){
dbWriteTable(conn,"cbGames",cbGames,append=TRUE)
dbWriteTable(conn,"cbPlayers",cbPlayers,append=TRUE)
dbWriteTable(conn,"cbTeams",cbTeams,append=TRUE)
}
if(nrow(stages)>0){
dbWriteTable(conn,"stages",stages,append=TRUE)
}