World of Warships is a niche team-based online game. Like many games, it has a publicly accessible API giving access to a reasonable set of player performance statistics, which are used in community-created metrics, and displayed on some community sites, suchs as http://www.wows-numbers.com.
These widely available statistics are available only for more ‘casual’ game modes. There is an ongoing (seasonal) competitive game mode with player-determined teams, a limited map pool, and no publicly available statistics bar team-wide aggregate winrates, and ladder position.
This project has been focused on trying to aggregate what data exists, piping it into SQL, and then using it to generate otherwise unavailable statistics, and hopefully useful insights.
This writeup covers the SQL model, transformations, and stored procedures within MySQL. It is quite dry, and probably not recommended reading unless you want to understand how everything fits together.
In the beginning, I piped all of this into Google Sheets as a kind of free/terrible database-alike. The ingestion was clumsy, and there was no real ‘relationality’ - it was one big, flat file with massive redundancy. This was extremely difficult to work with, and even more difficult to derive any value from.
The first generation of SQL was a bit better - I tracked one clan, which simplified some reporting issues substantially, but discarded a substantial amount of data that i didn’t know what to do with at the time. There was a lack of resilience in the entire process, and it was entire unscalable - in order to track additional clans, I had to duplicate the SQL database, duplicate scripts with different hardcoded clan information, and duplicate dashboards, changing all of the sources and filters over to handle the other teams. This sucked.
In reality, this is probably my fourth iteration of this project, and apart from relatively minor tweaks or additional elements, I think the physical model is now relatively static and resilient. My ETL and reporting processes can handle new teams near seamlessly, I have recovery options in place, and with the exception of adding clans to my settings/tokens file, no manual intervention is required to keep the process going.
Since generating this model image, I have added two physical tables - cbHist which contains historic performance information, and cbLive which is a current attempt to derive otherwise unavailable per-player per-game statistics, which I will cover a bit more in a different writeup.
SQL Model
I have considered breaking this down further into a Kimball-style heavily normalised snowflake schema, but I feel like this would add work for most normal applications, both complicating the ETL process and increasing the number of joins required to get to analytically useful data models. There are a few places remaining where some normalisation could be applied relatively easily, but there is a real question about how much value that would generate. I think the level of denormalisation existing in this model is relatively acceptable, but I will cover these bits and pieces as they come up.
Realistically, this is the ‘parent’ table of the database, or at least one of the ‘hubs’ of a snowflake, if I were to go down that pathway. There is, however, little actual information here. The ID allows teams to be linked together to give a view of each match, finished_at gives us a timestamp, and the map name tells us which map was played.
That is.. kind of it.
The other identifiers here - cluster_id, realm, and arena_id - are of between no and questionable value. cluster_id is nothing more than a server identifier, as far as I can play - which cluster the game took place on, nothing more. Realm matches which ‘prime time’ the match was played in - functionally, whether the majority of the teams will be Asian, EU, CIS, or North American. Not currently analytically useful. arena_id has potential value, but I don’t know how to parse it - I think this should contain information like the in-game weather (some cosmetic, some with gameplay effects), map mode, and similar. Right now, though, it has no value.
In theory, I should normalise map_name out of this table - I actually have a mapList reference table made and maintained for just that purpose. While this would free up a trivial amount of storage space, it would require one more table join to get back to where we are now, and that is a join that we want to have done in pretty much all cases. My current judgement call is that this isn’t REALLY worth the effort needed to accomplish.
Off to the side, the seasonList reference table, containing start and end dates, as well as tiering information on all clan battle seasons.
This was created almost exclusively to record and handle the number of players per team changing with the season. Sadly, the API does not actually return this information, so I have to append it quite stupidly:
PROCEDURE `update_seasonList`()
BEGIN
DROP TABLE IF EXISTS
season_players;
CREATE TEMPORARY TABLE
season_players
SELECT DISTINCT
season_number,
COUNT(*) AS players_per_team
FROM
cbGames cbg
LEFT JOIN cbTeams cbt ON cbg.id=cbt.GID
LEFT JOIN cbPlayers cbp ON cbt.id=cbp.result_id
WHERE
result_id IS NOT NULL
GROUP BY
season_number
,result_id;
UPDATE
seasonList
LEFT JOIN season_players ON seasonList.season_number=season_players.season_number
SET
player_count=players_per_team;
END
In the context of the season table, I think I can rewrite this as a trigger rather than a called procedure as only one row should be written at a time so it won’t be spammed nonsensically, but this is fine for purposes. It is very blunt: simply counting the number of players per team per season and appending that value. There probably isn’t a great reason to drop and regenerate this table, but it is also concise enough that that there is no realistic downside.
Mostly an intermediary table - one entry in cbGames matches two teams here. Each team has an associated clanID, matching which clan the team ‘belongs’ to, leaving the edge case of clanless or ‘mercenary’ players, but this isn’t a big deal. Each team also has a separate result id, which is unique on a per-team per-game basis, and matches to a number of players in cbPlayers - usually but not always 7.
Each entry has a few unique pieces of information as well. First, the result from the perspective of that team (generally one winner one loser per match. Draws are theoretically possible but I’m not sure I’ve ever recorded any. This could be an unhandled edge case that will break something someday. Second, all of the ladder information - league, division, and division_rating. Five leagues, (0..4, 0 being highest) each with three divisions (1..3, 1 being highest), and then a 1-100 rating within each division, 100 being highest. The rating delta is also part of the provided data, and I have taken the approach of not discarding anything.
At some point, I thought the map spawn side - a piece of information otherwise unavailable - might be derivable from the result id sequence - they are always sequential, so assuming the lower one would be side A and the higher side B seemed likely. I might need to revisit this at some point, as I cannot remember whether I verified this or not.
At the upper and lower bound of each league, there are promotion and demotion games played - a series of five games where the balance must be won to either ascend or maintain position. When games are within these promotion/demotion games, there are additional records, all of which can be logically derived - the number of wins required, the league the team will be promoted or demoted into, and similar.
I currently don’t use this for anything. It does not seem to give meaningful information. In the interest of not repeating mistakes, I ensure that it is collected and recorded correctly.
As previously stated, this will contain records for each member of each team, one record per player per game. It includes the ship the player was in, tier information, whether they survived or not, and similar. There is a clan_id here as well, but it only relates to the ‘game’ clan, so players from external clans are recorded as being ‘from’ the clan they play for. This could be remedied with an abundance of API calls, on a per-player basis, with an excess of 1s response time from the API.
Given that players can switch clans at basically any time, recording all of this information accurately in anything other than an ‘at the time’ kind of way seems impossible, which would mean checking every player from every game, adding substantial processing time, as well as a requirement to process the now relatively substantial backlog of games. To date, there has been no ‘need’ identified to accurately identify and record clans on a per-player basis.
This is a simple reference table of all ships existing ingame scraped from the API once per season. Technically, a map table exists as well, but it has no additional information. This table includes only a few bits and pieces not already included in cbPlayers - nation and class - as well as a heavy-handedly derived ‘short form name.’
In the other tables, the only reference to clans is through clan_ids. These are great as keys, but they do not give human-usable information.
When I scrape games, I check to see if there are any clans I do not have a record of for the current season. If no record exists, I grab details from the vendor API. This gives me the realm (EU, Asia, North America) that the clan is ‘from’, which will impact where any other calls need to be made, as well as the human-readable identifiers, like clan tag and name. I also grab clan historic performance - the last season’s ladder finishing position - and use this for a more ‘consistent’ indicator of team quality than current ladder position, as that is heavily skewed in the first half of a season.
In reality, both clan tag and name can change at will. I make no effort to keep this information accurate ‘to the minute,’ and so far, this has not mattered - most ‘good’ teams do not change their identity often. I generate one accurate entry per season, but may need to increase this frequency in the future. I do not think updating these records would cause any technical problem, but may require the addition of a slow-changing factor date column, and some modifications of the views I feed to dashboard applications to only take the most recent record.
Simply, the clans I am focused on - the only ones that I have authentication tokens for, so records should generally be shown from the perspective of these teams.