Overview

World of Warships is a niche team-based online game. Like many games, it has a publicly accessible API giving access to a reasonable set of player performance statistics, which are used in community-created metrics, and displayed on some community sites, suchs as http://www.wows-numbers.com.

These widely available statistics are available only for more ‘casual’ game modes. There is an ongoing (seasonal) competitive game mode with player-determined teams, a limited map pool, and no publicly available statistics bar team-wide aggregate winrates, and ladder position. My project focuses on trying to remedy this, and provide useful performance statistics to competitive teams.

In this case, a source of aggregate competitive statistics was identified through aggressive exploration of the vendor web site and game client (watching web traffic with Wireshark or looking at web browser developer tools to identify public API endpoints) and I wanted to see if I could make something worthwhile out of it. Turns out, I could.

Starting Point - What Exists?

The endpoint identified contains per-season aggregate statistics for all players in a clan. These aren’t massively involved - number of games played, average damage, and average frags - but I figured that if I could scrape these as games were being played, capturing each game, I could do some simple math to derive the player performance stats in that game: \(dam_{game}=(dam_{avg} \cdot games) - (LAG(dam_{avg})\cdot (games-1))\) - the total damage (games*avg_dam) this game, less the total damage at the previous game must be the amount of damage dealt this game. Frags can be derived identically. The only ‘technical’ problem seemed to be grabbing every game, which ended up not really being much of a problem at all.

There are, however, a multitude of problems after the scraping point which I will discuss below. Namely, the timestamps do not match other timestamps used by the vendor, there are no game IDs, despite being ‘bound’ to an individual clan, these aggregates ‘follow’ the player if they change clans, players can play with clans other than those they belong to, and so forth.

Scraping the Games

First things first, breaking down the URL - https://clans.worldofwarships.eu/api/members/500205591/?battle_type=cvc&season=19 - three things stand out. First, the web address top-level domain, .eu in this case, which can change based on the realm in which a clan is based, so has to be considered. Second, the 9-digit numeric string, the clan_id, third, the season number. Playing with all of these, I can grab data from whatever clans I want, as these aggregates are actually public, rather than needing some kind of authentication token. At first instance, I ignored this outside of it being slightly more convenient, but as I moved forward, this became more interesting, as I will detail later.

# Scrape CB aggregate stats 'live' to derive per-player per-game performance metrics
#
# Michael Zorzella 
#
# Version     Action                            Date    
# 1.0         Created                           22-12-22     
# 1.1         Baseline from SQL                 24-12-22
# 1.2         Dump into staging table live      26-12-22
#             Handle deduplication in SQL 

## Setup -----
pacman::p_load(tidyverse, jsonlite, RMariaDB, lubridate, httr, DBI)
rm(list=ls())


settings <- read_csv("settings.csv",show_col_types = FALSE)
url1 <- "https://clans.worldofwarships.eu/api/members/"
url2 <- "/?battle_type=cvc&season="

appID <- settings %>%
  filter(setting == 'appID') %>%
  .$value

conn <- dbConnect(MariaDB(),
                  host = settings %>% filter(setting == 'host') %>% .$value,
                  port = settings %>% filter(setting == 'port') %>% .$value,
                  user = settings %>% filter(setting == 'user') %>% .$value,
                  password = settings %>% filter(setting == 'password') %>% .$value,
                  dbname = settings %>% filter(setting == 'database') %>% .$value)

rm(settings)

season <- tbl(conn,"seasonList") %>% 
  filter(start_time <= local(Sys.Date()) && finish_time >= local(Sys.Date())) %>% 
  select(season_number) %>% 
  collect() %>% 
  .$season_number

clans <- tbl(conn,"trackedClans") %>%
  filter(season == local(season)) %>%
  select(tag,clan_id) %>%
  collect()

## Scrape Live -----

if(hour(Sys.time()) < 18){
  untilthen <- 64800 - (hour(now())*3600+minute(now())*60+second(now()))
  print(paste('Waiting',untilthen,'seconds'))
  Sys.sleep(untilthen)
}

while((hour(Sys.time()) < 22 || minute(Sys.time()) < 20) && hour(Sys.time()) >= 18){

  ### Scrape -----
  for(i in 1:nrow(clans)){
    scrape <- jsonlite::fromJSON(paste(url1,clans$clan_id[i],url2,season,sep=''))$items %>% 
      select(name, 
             id, 
             battles_count, 
             last_battle_time, 
             frags_per_battle, 
             damage_per_battle) %>% 
      filter(battles_count > 0) %>%
      mutate(last_battle_time = as.character(as_datetime(last_battle_time)),
             tag=clans$tag[i], 
             season_number=season)
    
    ### Update SQL and fire stored procedure if appropriate -----
    if(nrow(scrape)>0){
      query <- DBI::sqlAppendTable(conn,"cbLiveStage",scrape,row.names=FALSE)
      res <- dbSendStatement(conn,query)
      dbGetRowsAffected(res)
      print(paste(dbGetRowsAffected(res), 'new', clans$tag[i],'rows at',Sys.time()))
      dbClearResult(res)
      write.table(scrape, file=paste(Sys.Date(),"cbStatsLive.csv"),append=TRUE,col.names=FALSE)
      dbExecute(conn,"CALL update_cbLive")

      #### Save all updates in DF -----
      #if(exists("cbStatsLive")){
      #  cbStatsLive <- scrape
      #  } else {
      #  cbStatsLive <- union(cbStatsLive,scrape)
      #  }
      }
    }
  
  ### Scrape every 5 minutes -----
  Sys.sleep(300)
  }

## Save raw Locally -----
if(exists("cbStatsLive")){
  write.csv(cbStatsLive,paste(Sys.Date(),"cbStatsLive.csv"))
}

The above is my local working copy rather than what I have scheduled, hence the ghetto crontab Sys.sleep(untilthen) delaying the run until the competitive session had actually started.

What this script isn’t is complicated. In earlier iterations, I tried to deduplicate much more in R, but this proved difficult, primarily because the last_battle_time timestamp is.. slightly insane. Instead of this tracking the last competitive battle, it tracks the last account battle, leading to mountains of technically not duplicate duplicates. This isn’t insurmountable, I just find it much easier to handle in SQL, so opted to dump everything into a staging table and call a stored procedure to deduplicate and merge results into the main record.

There are also minor annoyances looking at this script now: I upload and call the stored procedure for each clan rather than once per scrape cycle. This should be easily remedied with an additional union() and de-nesting the upload by one ‘level’ of the loop. As of time of writing, there are no timing problems (it takes about 1s per clan per loop to execute, and most of this is waiting for the web response) so there is no immediate benefit to making this fix, it would just be ‘better’ if I did. Also, the print() in each loop is meaningless - because I do not deduplicate in R, the number of new rows each clan adds is simply the number of players who have played any clan battles. This serves to notify me that everything is working properly, which I guess I’m happy enough with, but it isn’t really meaningful. Without a specific goal in mind, though, there is no real purpose served by changing it.

There is no magic to this loop, though: I grab the list of clans I track from SQL this season (trackedClans), then iterate through that list, constructing a URL for each clan, grabbing the results, paring them down to be only what I want (the data I discard is actively pointless - useful information for the vendor about the player account status rather than statistics or anything usable) and transforming the date from a UNIX timestamp to datetime, then the character string that DBI wants to pass to SQL. The five minute wait is arbitrary, but ‘feels’ correct - games take a maximum of 20 minutes, and most take in excess of 10. Technically, there is no minimum duration, but the idea of having two games finish within a five minute window is so outlandish as to be functionally (hopefully?) impossible. The amount of damage it would do if that did happen is rather trivial contextually - loss of accurate data for two games out of a few hundred.

I then dump the result into my staging table, and call a stored procedure to deduplicate and load the data:

BEGIN
    INSERT INTO cbLive
        SELECT
            st.`name`
            ,st.id
            ,st.battles_count
            ,MIN(st.last_battle_time) AS last_battle_time
            ,st.frags_per_battle
            ,st.damage_per_battle
            ,st.tag
            ,st.season_number
        FROM
            cbLiveStage st
            LEFT JOIN (SELECT id, season_number, MAX(battles_count) AS battles_count FROM cbLive GROUP BY id, season_number) cl ON st.id=cl.id AND st.season_number=cl.season_number
        WHERE
            st.battles_count > cl.battles_count OR cl.battles_count IS NULL
        GROUP BY
            st.`name`
            ,st.id
            ,st.battles_count
            ,st.frags_per_battle
            ,st.damage_per_battle
            ,st.tag
            ,st.season_number
        ORDER BY
            last_battle_time DESC;
        INSERT INTO cbLiveStageBUP
        SELECT
            *
        FROM
            cbLiveStage;
    TRUNCATE TABLE cbLiveStage;
END

This isn’t especially graceful, but it is okay. Lines of note include ,MIN(st.last_battle_time) AS last_battle_time which handles the case where players are playing other formats and incrementing the timestamp, generating multiple records. The first timestamp seen for any particular battle number should be the correct one, so that is the one that is used. Otherwise, the WHERE clause handles deduplication - the battles_count difference being what we really want to see, and battles_count being NULL means the player has not been previously recorded - so, easy peasy.

Since this is still at least somewhat in flux, I dump everything I scrape from the staging table into a heavy-handed backup table, and then truncate the staging table so I only process genuinely new records each iteration.

I now have a clean, consistent records of the changing aggregate statistics

Making This Useful

When I started this, I scraped the per-game stats of the clans I was tracking anyway, with the view that the worst-case would be manually collating per-game stats with games tracked, but hopeful that I could automate this. What I realised very quickly was that these stats were public - I could grab all of summary statistics from whichever clans I wanted. Now, scraping EVERY clan is nonsensical, but I could at least scrape the top few percent of performers - the games that tend to matter most for reviews - and try to construct something. Having added that, I could clearly see whole games:

cbLive game construct

With millisecond-accurate timestamps, I can be FAIRLY comfortable that blocks of 4-7 players from the same clan (a team can include up to three non-members in a team of 7) with an identical last_battle_time are playing the same game. This will always be a little bit fuzzy, but the intent here is to match these records with the ‘main’ game records, so I should be able to verify.

So, I can derive per-game damage and frags from this point effortlessly using LAG() and the simple math from above, but what I really want to do is match these with ships, maps, and the full game records scraped elsewhere. Because these are derived, I have no game identifiers - but I have that millisecond-accurate timestamp, right?

cbGames timestamp match?

Nope! It turns out that the finished_at timestamp used in the API and all other records is the game end time, but the last_battle_time timestamp is the time the game started! So, the timestamps are useful for grouping the games, but cannot actually be used to match the records.

Another option could be team composition - fuzzy matching based on the presence of all of the players - but the ‘friendly’ team will tend to be static, so this is only even remotely possible if I happen to be actively scraping the friendly and enemy teams. I thought about trying to match ‘stepped’ timestamps, but came up with what I think is a much better solution.

Basically, I match on a per-player basis rather than a per-game basis, generating a RANK() order of ‘qualifying’ (see 4 or more records from the same clan at the same timestamp to exclude players individually playing with ‘foreign’ teams) for each of the live scraped games and the ‘main’ scraped games for that night, and matching on that RANK() as well as player_id. This seems to generate accurate records, and I at least can’t think of edge cases that are likely to break it:

cbLive amalgamated dataset

So, from having no per-game or per-ship granular statistics, I can now produce per-player per-ship performance statistics, as well as the per-player performance statistics of enemies. I have made this into a view which is then queried by Google Data Studio and similar:

WITH
    tots AS
        (
        SELECT
            *
            ,battles_count*frags_per_battle AS tot_frags
            ,battles_count*damage_per_battle AS tot_damage
        FROM
            #ntt_test
            cbLive
        ),
    teams AS
        (
        SELECT
            COUNT(*) AS team_ct
            ,last_battle_time
        FROM
            #ntt_test
            cbLive
        GROUP BY
            last_battle_time
        )
    ,fin AS
        (
        SELECT 
            `name`
            ,id
            ,battles_count
            ,tots.last_battle_time
            ,frags_per_battle
            ,damage_per_battle
            ,tag
            ,tot_damage - LAG(tot_damage) OVER (PARTITION BY `name` ORDER BY battles_count) AS damage
            ,ROUND(tot_frags - LAG(tot_frags) OVER (PARTITION BY `name` ORDER BY battles_count)) AS frags
            ,season_number
        FROM 
            tots
        ORDER BY
            last_battle_time DESC
        )
    ,cbp AS
        (
        SELECT DISTINCT
            cbp.`name`
            ,spa_id
            ,result_id
            ,vehicle_id
            ,ship_name
            ,finished_at
            ,map_name
            ,RANK() OVER (PARTITION BY spa_id, DATE(finished_at) ORDER BY finished_at) AS game_rnk
            ,cbt.result
            ,cbc.tag AS enemy
            ,cbp.survived AS survived
        FROM
            cbPlayers cbp
            LEFT JOIN cbTeams cbt ON cbp.result_id=cbt.id
            LEFT JOIN cbGames cbg ON cbt.GID=cbg.id
            INNER JOIN trackedClans tc ON tc.clan_id=cbp.clan_id
            LEFT JOIN cbTeams cbt2 ON cbt2.GID=cbg.id AND cbt2.clan_id <> cbt.clan_id
            LEFT JOIN cbClans cbc ON cbt2.clan_id=cbc.clan_id AND cbc.season_number=cbg.season_number
        )
    ,per_player AS
        (
        SELECT
            fin.*
            ,RANK() OVER (PARTITION BY id, DATE(fin.last_battle_time) ORDER BY battles_count) AS game_rnk
        FROM
            fin
            LEFT JOIN teams ON fin.last_battle_time=teams.last_battle_time
        WHERE
            team_ct >= 4
        )
SELECT
    pp.`name`
    ,id
    ,battles_count
    ,finished_at
    ,last_battle_time
    ,tag
    ,ROUND(IFNULL(damage,damage_per_battle))/IFNULL((battles_count-LAG(battles_count) OVER (PARTITION BY id, season_number ORDER BY battles_count)),1) AS damage
    ,ABS(ROUND(IFNULL(frags,frags_per_battle)))/IFNULL((battles_count-LAG(battles_count) OVER (PARTITION BY id, season_number ORDER BY battles_count)),1) AS frags
    ,ship_name
    ,map_name
    ,result
    ,enemy
    ,survived
    ,season_number
FROM
    per_player pp
    LEFT JOIN cbp ON cbp.spa_id=pp.id AND cbp.game_rnk=pp.game_rnk AND DATE(last_battle_time)=DATE(finished_at)
#WHERE
#   (tag IN ('YOBOY') OR enemy IN ('YOBOY'))
ORDER BY 
    last_battle_time DESC
    ,tag

I am fairly certain I could write that a bit more nicely - I don’t think the totals CTE is needed at all, I am probably doubling up on the window functions, it just.. probably isn’t great. But, it works right now, and there are no performance concerns, even when not excluding games I don’t explicitly care about.