World of Warships is a niche team-based online game. Like many games, it has a publicly accessible API giving access to a reasonable set of player performance statistics, which are used in community-created metrics, and displayed on some community sites, suchs as http://www.wows-numbers.com.
These widely available statistics are available only for more ‘casual’ game modes. There is an ongoing (seasonal) competitive game mode with player-determined teams, a limited map pool, and no publicly available statistics bar team-wide aggregate winrates, and ladder position. My project focuses on trying to remedy this, and provide useful performance statistics to competitive teams.
In this case, a source of aggregate competitive statistics was identified through aggressive exploration of the vendor web site and game client (watching web traffic with Wireshark or looking at web browser developer tools to identify public API endpoints) and I wanted to see if I could make something worthwhile out of it. Turns out, I could.
The endpoint identified contains per-season aggregate statistics for all players in a clan. These aren’t massively involved - number of games played, average damage, and average frags - but I figured that if I could scrape these as games were being played, capturing each game, I could do some simple math to derive the player performance stats in that game: \(dam_{game}=(dam_{avg} \cdot games) - (LAG(dam_{avg})\cdot (games-1))\) - the total damage (games*avg_dam) this game, less the total damage at the previous game must be the amount of damage dealt this game. Frags can be derived identically. The only ‘technical’ problem seemed to be grabbing every game, which ended up not really being much of a problem at all.
There are, however, a multitude of problems after the scraping point which I will discuss below. Namely, the timestamps do not match other timestamps used by the vendor, there are no game IDs, despite being ‘bound’ to an individual clan, these aggregates ‘follow’ the player if they change clans, players can play with clans other than those they belong to, and so forth.
First things first, breaking down the URL -
https://clans.worldofwarships.eu/api/members/500205591/?battle_type=cvc&season=19
- three things stand out. First, the web address top-level domain, .eu
in this case, which can change based on the realm in which a clan is
based, so has to be considered. Second, the 9-digit numeric string, the
clan_id, third, the season number. Playing with all of these, I can grab
data from whatever clans I want, as these aggregates are actually
public, rather than needing some kind of authentication token. At first
instance, I ignored this outside of it being slightly more convenient,
but as I moved forward, this became more interesting, as I will detail
later.
# Scrape CB aggregate stats 'live' to derive per-player per-game performance metrics
#
# Michael Zorzella
#
# Version Action Date
# 1.0 Created 22-12-22
# 1.1 Baseline from SQL 24-12-22
# 1.2 Dump into staging table live 26-12-22
# Handle deduplication in SQL
## Setup -----
pacman::p_load(tidyverse, jsonlite, RMariaDB, lubridate, httr, DBI)
rm(list=ls())
settings <- read_csv("settings.csv",show_col_types = FALSE)
url1 <- "https://clans.worldofwarships.eu/api/members/"
url2 <- "/?battle_type=cvc&season="
appID <- settings %>%
filter(setting == 'appID') %>%
.$value
conn <- dbConnect(MariaDB(),
host = settings %>% filter(setting == 'host') %>% .$value,
port = settings %>% filter(setting == 'port') %>% .$value,
user = settings %>% filter(setting == 'user') %>% .$value,
password = settings %>% filter(setting == 'password') %>% .$value,
dbname = settings %>% filter(setting == 'database') %>% .$value)
rm(settings)
season <- tbl(conn,"seasonList") %>%
filter(start_time <= local(Sys.Date()) && finish_time >= local(Sys.Date())) %>%
select(season_number) %>%
collect() %>%
.$season_number
clans <- tbl(conn,"trackedClans") %>%
filter(season == local(season)) %>%
select(tag,clan_id) %>%
collect()
## Scrape Live -----
if(hour(Sys.time()) < 18){
untilthen <- 64800 - (hour(now())*3600+minute(now())*60+second(now()))
print(paste('Waiting',untilthen,'seconds'))
Sys.sleep(untilthen)
}
while((hour(Sys.time()) < 22 || minute(Sys.time()) < 20) && hour(Sys.time()) >= 18){
### Scrape -----
for(i in 1:nrow(clans)){
scrape <- jsonlite::fromJSON(paste(url1,clans$clan_id[i],url2,season,sep=''))$items %>%
select(name,
id,
battles_count,
last_battle_time,
frags_per_battle,
damage_per_battle) %>%
filter(battles_count > 0) %>%
mutate(last_battle_time = as.character(as_datetime(last_battle_time)),
tag=clans$tag[i],
season_number=season)
### Update SQL and fire stored procedure if appropriate -----
if(nrow(scrape)>0){
query <- DBI::sqlAppendTable(conn,"cbLiveStage",scrape,row.names=FALSE)
res <- dbSendStatement(conn,query)
dbGetRowsAffected(res)
print(paste(dbGetRowsAffected(res), 'new', clans$tag[i],'rows at',Sys.time()))
dbClearResult(res)
write.table(scrape, file=paste(Sys.Date(),"cbStatsLive.csv"),append=TRUE,col.names=FALSE)
dbExecute(conn,"CALL update_cbLive")
#### Save all updates in DF -----
#if(exists("cbStatsLive")){
# cbStatsLive <- scrape
# } else {
# cbStatsLive <- union(cbStatsLive,scrape)
# }
}
}
### Scrape every 5 minutes -----
Sys.sleep(300)
}
## Save raw Locally -----
if(exists("cbStatsLive")){
write.csv(cbStatsLive,paste(Sys.Date(),"cbStatsLive.csv"))
}
The above is my local working copy rather than what I have scheduled,
hence the ghetto crontab Sys.sleep(untilthen) delaying the
run until the competitive session had actually started.
What this script isn’t is complicated. In earlier iterations, I tried
to deduplicate much more in R, but this proved difficult, primarily
because the last_battle_time timestamp is.. slightly
insane. Instead of this tracking the last competitive battle,
it tracks the last account battle, leading to mountains of technically
not duplicate duplicates. This isn’t insurmountable, I just find it much
easier to handle in SQL, so opted to dump everything into a staging
table and call a stored procedure to deduplicate and merge results into
the main record.
There are also minor annoyances looking at this script now: I upload
and call the stored procedure for each clan rather than once per scrape
cycle. This should be easily remedied with an additional
union() and de-nesting the upload by one ‘level’ of the
loop. As of time of writing, there are no timing problems (it takes
about 1s per clan per loop to execute, and most of this is waiting for
the web response) so there is no immediate benefit to making this fix,
it would just be ‘better’ if I did. Also, the print() in
each loop is meaningless - because I do not deduplicate in R, the number
of new rows each clan adds is simply the number of players who have
played any clan battles. This serves to notify me that everything is
working properly, which I guess I’m happy enough with, but it isn’t
really meaningful. Without a specific goal in mind, though, there is no
real purpose served by changing it.
There is no magic to this loop, though: I grab the list of clans I track from SQL this season (trackedClans), then iterate through that list, constructing a URL for each clan, grabbing the results, paring them down to be only what I want (the data I discard is actively pointless - useful information for the vendor about the player account status rather than statistics or anything usable) and transforming the date from a UNIX timestamp to datetime, then the character string that DBI wants to pass to SQL. The five minute wait is arbitrary, but ‘feels’ correct - games take a maximum of 20 minutes, and most take in excess of 10. Technically, there is no minimum duration, but the idea of having two games finish within a five minute window is so outlandish as to be functionally (hopefully?) impossible. The amount of damage it would do if that did happen is rather trivial contextually - loss of accurate data for two games out of a few hundred.
I then dump the result into my staging table, and call a stored procedure to deduplicate and load the data:
BEGIN
INSERT INTO cbLive
SELECT
st.`name`
,st.id
,st.battles_count
,MIN(st.last_battle_time) AS last_battle_time
,st.frags_per_battle
,st.damage_per_battle
,st.tag
,st.season_number
FROM
cbLiveStage st
LEFT JOIN (SELECT id, season_number, MAX(battles_count) AS battles_count FROM cbLive GROUP BY id, season_number) cl ON st.id=cl.id AND st.season_number=cl.season_number
WHERE
st.battles_count > cl.battles_count OR cl.battles_count IS NULL
GROUP BY
st.`name`
,st.id
,st.battles_count
,st.frags_per_battle
,st.damage_per_battle
,st.tag
,st.season_number
ORDER BY
last_battle_time DESC;
INSERT INTO cbLiveStageBUP
SELECT
*
FROM
cbLiveStage;
TRUNCATE TABLE cbLiveStage;
END
This isn’t especially graceful, but it is okay. Lines of note include
,MIN(st.last_battle_time) AS last_battle_time which handles
the case where players are playing other formats and incrementing the
timestamp, generating multiple records. The first timestamp seen for any
particular battle number should be the correct one, so that is the one
that is used. Otherwise, the WHERE clause handles
deduplication - the battles_count difference being what we really want
to see, and battles_count being NULL means the player has
not been previously recorded - so, easy peasy.
Since this is still at least somewhat in flux, I dump everything I scrape from the staging table into a heavy-handed backup table, and then truncate the staging table so I only process genuinely new records each iteration.
I now have a clean, consistent records of the changing aggregate statistics
When I started this, I scraped the per-game stats of the clans I was tracking anyway, with the view that the worst-case would be manually collating per-game stats with games tracked, but hopeful that I could automate this. What I realised very quickly was that these stats were public - I could grab all of summary statistics from whichever clans I wanted. Now, scraping EVERY clan is nonsensical, but I could at least scrape the top few percent of performers - the games that tend to matter most for reviews - and try to construct something. Having added that, I could clearly see whole games:
cbLive game construct
With millisecond-accurate timestamps, I can be FAIRLY comfortable that blocks of 4-7 players from the same clan (a team can include up to three non-members in a team of 7) with an identical last_battle_time are playing the same game. This will always be a little bit fuzzy, but the intent here is to match these records with the ‘main’ game records, so I should be able to verify.
So, I can derive per-game damage and frags from this point
effortlessly using LAG() and the simple math from above,
but what I really want to do is match these with ships, maps, and the
full game records scraped elsewhere. Because these are derived, I have
no game identifiers - but I have that millisecond-accurate timestamp,
right?
cbGames timestamp match?
Nope! It turns out that the finished_at timestamp used in the API and all other records is the game end time, but the last_battle_time timestamp is the time the game started! So, the timestamps are useful for grouping the games, but cannot actually be used to match the records.
Another option could be team composition - fuzzy matching based on the presence of all of the players - but the ‘friendly’ team will tend to be static, so this is only even remotely possible if I happen to be actively scraping the friendly and enemy teams. I thought about trying to match ‘stepped’ timestamps, but came up with what I think is a much better solution.
Basically, I match on a per-player basis rather than a per-game
basis, generating a RANK() order of ‘qualifying’ (see 4 or
more records from the same clan at the same timestamp to exclude players
individually playing with ‘foreign’ teams) for each of the live scraped
games and the ‘main’ scraped games for that night, and matching on that
RANK() as well as player_id. This seems to generate
accurate records, and I at least can’t think of edge cases that are
likely to break it:
cbLive amalgamated dataset
So, from having no per-game or per-ship granular statistics, I can now produce per-player per-ship performance statistics, as well as the per-player performance statistics of enemies. I have made this into a view which is then queried by Google Data Studio and similar:
WITH
tots AS
(
SELECT
*
,battles_count*frags_per_battle AS tot_frags
,battles_count*damage_per_battle AS tot_damage
FROM
#ntt_test
cbLive
),
teams AS
(
SELECT
COUNT(*) AS team_ct
,last_battle_time
FROM
#ntt_test
cbLive
GROUP BY
last_battle_time
)
,fin AS
(
SELECT
`name`
,id
,battles_count
,tots.last_battle_time
,frags_per_battle
,damage_per_battle
,tag
,tot_damage - LAG(tot_damage) OVER (PARTITION BY `name` ORDER BY battles_count) AS damage
,ROUND(tot_frags - LAG(tot_frags) OVER (PARTITION BY `name` ORDER BY battles_count)) AS frags
,season_number
FROM
tots
ORDER BY
last_battle_time DESC
)
,cbp AS
(
SELECT DISTINCT
cbp.`name`
,spa_id
,result_id
,vehicle_id
,ship_name
,finished_at
,map_name
,RANK() OVER (PARTITION BY spa_id, DATE(finished_at) ORDER BY finished_at) AS game_rnk
,cbt.result
,cbc.tag AS enemy
,cbp.survived AS survived
FROM
cbPlayers cbp
LEFT JOIN cbTeams cbt ON cbp.result_id=cbt.id
LEFT JOIN cbGames cbg ON cbt.GID=cbg.id
INNER JOIN trackedClans tc ON tc.clan_id=cbp.clan_id
LEFT JOIN cbTeams cbt2 ON cbt2.GID=cbg.id AND cbt2.clan_id <> cbt.clan_id
LEFT JOIN cbClans cbc ON cbt2.clan_id=cbc.clan_id AND cbc.season_number=cbg.season_number
)
,per_player AS
(
SELECT
fin.*
,RANK() OVER (PARTITION BY id, DATE(fin.last_battle_time) ORDER BY battles_count) AS game_rnk
FROM
fin
LEFT JOIN teams ON fin.last_battle_time=teams.last_battle_time
WHERE
team_ct >= 4
)
SELECT
pp.`name`
,id
,battles_count
,finished_at
,last_battle_time
,tag
,ROUND(IFNULL(damage,damage_per_battle))/IFNULL((battles_count-LAG(battles_count) OVER (PARTITION BY id, season_number ORDER BY battles_count)),1) AS damage
,ABS(ROUND(IFNULL(frags,frags_per_battle)))/IFNULL((battles_count-LAG(battles_count) OVER (PARTITION BY id, season_number ORDER BY battles_count)),1) AS frags
,ship_name
,map_name
,result
,enemy
,survived
,season_number
FROM
per_player pp
LEFT JOIN cbp ON cbp.spa_id=pp.id AND cbp.game_rnk=pp.game_rnk AND DATE(last_battle_time)=DATE(finished_at)
#WHERE
# (tag IN ('YOBOY') OR enemy IN ('YOBOY'))
ORDER BY
last_battle_time DESC
,tag
I am fairly certain I could write that a bit more nicely - I don’t think the totals CTE is needed at all, I am probably doubling up on the window functions, it just.. probably isn’t great. But, it works right now, and there are no performance concerns, even when not excluding games I don’t explicitly care about.