WoWS - Dimensional Modelling?

Introduction

I was wrong about dimensional modelling. This is for a few reasons, but the chief among them is that I didn’t correctly understand dimensional models. I may still not, but that is beside the point. Fundamentally, I understood dimensional modelling as entailing more normalisation and tables by design, targeting 3nf everywhere, and necessitating a huge number of joins to get anywhere analytically useful. Without meaning to make excuses for myself, I’m going to make excuses for myself.

My first ‘professional’ exposure to data warehousing was second-hand, with a somewhat distant contractor managed by a different business unit being tasked with creating a data warehouse based chiefly, but not entirely, on Aeron’s QL housing asset management system. What he came up with was, in places, extremely heavily normalised, splitting off tiny dimensional tables for every facet of tenant demographic data, for instance, changing single-factor demographics into single keys that then required a join to get the single factor back - technical space saving, but analytically head-scratching. In other places, my concept was hindered by the extent to which the ‘production’ QL database is denormalised. It is a weird, creaky, aging, largely undocumented (as far as anyone in the organisation I worked in could tell, at least) database, where every table was a one-to-one reproduction of a front-end screen - if you have the same tenant information on five different screens in the frontend, it exists in five different places in the backend.

Combining these aspects, as well as the corporate politics surrounding the data warehouse project, probably lead me to being less interested in dimensional modelling than I had any business being, and not (until very recently) digging into it deeply enough to fully grasp that the point is to make something analytically more useful at the expense of normalisation rather than something that just seemed deliberately clunky. I’ve spent a good portion of the last week taking a slightly more thorough look at dimensional modelling. The rest of this writeup is an exploration of applying dimensional modelling to my existing e-sports performance analytics database.

At the end, I’ll briefly talk about wide tables (or a wide table?) as an alternative to both my existing approach and dimensional modelling. Ironically, this will resemble a previous version (first attempt, largely using Google Sheets as a crap database replacement) of this project’s data model, though will hopefully be at least a little bit more considered now.

Current Model

The existing data model is not far off a ‘lowest reasonable effort’ kind of ingestion - it heavily resembles the JSON objects I load from, because broadly speaking, that is ‘good enough,’ covering all the bases and keeping things relatively simple. I hadn’t, honestly, put too much thought into this. It is quite aggressively normalised, only incidentally denormalised when I have had to add additional reference tables to fill in missing information, as well as a small handful of oddities existing in the source data (e.g clan_id existing at the lineup and player levels of granularity, but not accurately reflecting the difference between the ‘team’ clan and the ‘player’ clan.)

Current data model

From an analytic point of view, however, this leaves a lot to be desired: there are pieces of information, such as season_number which exist multiple joins away from what I’m trying to model most of the time, leading to almost every view created for a report or dashboard including something like:

WITH every_gd_time AS(
    SELECT
        *
    FROM
        cbGames cbg
        LEFT JOIN cbTeams cbt ON cbg.id=cbt.GID
        LEFT JOIN cbPlayers cbp ON cbt.result_id=cbp.result_id
        LEFT JOIN cbClans cbc ON cbt.clan_id=cbc.clan_id AND cbg.season_number=cbc.season_number
    WHERE
        cbg.season_number=@this_season
        AND cbc.tag=@this_clan
    )

… which could also include a further join for ship classes, or so forth. It isn’t life-changing, but it is a mild grumble every time, especially because I haven’t mastered DRY, and tend to just rewrite the construct constantly.

Basically, in an effort to retain normalisation and the existing structure, there are pieces of information that I need >95% of the time which require one or more joins to apply. This is dumb, and it makes me feel dumb.

Dimensional Model

I could never quite make this work previously, due to one simple mental block: I saw the fact of the ‘game’ as the hub, rather than ‘this team played a game’ being the core analytic concept, and the fact of this being part of a matched set being incidental. This more closely matches how I actually use the data, and I can’t quite work out why I had this mental block. This now fits as a dimensional model, and allowing some degree of denormalisation to include all the stuff that you always want in the fact (and some of the dimensions) without necessitating joins is desirable.

There are definitely questions around whether surrogate keys should be used in some places, whether some degree of snowflaking would be desirable as the model expands, and slow-changing dimension flags in a few places, but I’ll cover that momentarily.

It is worth mentioning, I think, that I haven’t included dates other than the game date or versioning information because most of this is unchanging - there is no case where most of these records will or should change or update, though there are additional possible dimensions (and/or transitioning players into a second fact table alongside the added dimensions) which would necessitate some degree of versioning.

For the majority of the day-to-day reporting I do, this model would be sufficient, and would make things marginally easier… except that Google Data Studio basically wants wide tables anyway, so the value of this is questionable.

I’ve just noticed that a pile of the relationships on my ERD are backward, and there is nothing worth talking about with regard to the season dimension table. Oh well.

Dimensional Model ### fact_games

Now the core, in my original model this was represented as ‘teams’, but that isn’t really the point - the point of the data model is representing the game state from that team’s point of view, so reframing ‘team’ to ‘game’ helps me, at least mentally, to center this. Nothing in this table should ever change, so there is no versioning or surrogate key requirement, as far as I know.

The kneejerk reaction from me is that there is ‘too much’ here, but it is also a near-complete record of a single game from a single point of view without needed further joins - the date, season, result, clan, and all of the ladder data that is tied to the result are here, and this makes sense to me.

I can, however, see an argument for splitting off a dim_result table with all of the ladder information - fundamentally, it isn’t always or even usually needed. I think, on consideration, that is the correct approach, leaving league denormalised within both tables, as it tends to be the primary ladder-related metric. In fact:

Dimensional Model v2

This feels a bit better to me. Still a question about the game_date, which broadly boils down to: how useful should dimension tables be on their own? There is a further question about including season_number in multiple places, which boils down to an identical issue.

For now, though, this is a slight improvement, I think.

dim_matches

As stated above, I used to consider the ‘game’ entry the real core of the dataset, but now think it is only incidental. The game_id is the ‘production’ key, but there is no obvious need for any surrogate key - there is no versioning, uniqueness is guaranteed through the ETL process, etc.

The real ‘question’ is: is there any point to leaving this? Fundamentally, the map is not a ‘core’ data feature, and self-joining the fact table on game_id can pair the teams, so there is an argument for just widening the fact table with the map information. However, there is equally a reasonable argument for pulling game_date off of the fact table and leaving it only here.

Oh well.

dim_clans

Just clan information now, but the first place where slow-changing dimensions might be realistic.

Fundamentally, clans can change their external identifying information more-or-less at will - both tag and clan_name. Right now, I grab this data once per season, and make absolutely no effort to correct this - clan_id does not change, so there is no ‘loss’ here apart from mild confusion where reporting tags do not match tags in replays.

This would, therefore, be a good candidate for a slow changing dimension, but I am not entirely certain how or when I would trigger an update. Even without the slow changing dimension aspect, this is a candidate for a surrogate key as well - there is no single unique key, as there will be one row per clan per season. A concatenation of clan_id, season_number and version would allow for easy ‘correct at time of play’ single-key matches, reverting to the composite key for ‘correct for current information.’ This feels a bit backward to me.

dim_players

Another candidate for a surrogate key, which would have to be result_id plus player_id to give ‘this player in this game.’ This might be worthwhile if I took the snowflaking approach, splitting off the per-game statistics and adding further random statistics. This would actually complicate things quite a bit, as it would definitely require a slow-changing player fact table holding dated clan and player name data, and linking through the expanded statistics. I don’t currently care about which clan a player is technically in, or when they change their name, and I do not collect or store wider random or ranked game statistics, which would require further decisions on how to handle the slow-changing nature, as they are stored as aggregate stats only, and the change over time is quite important.

Having game_date in this table is almost certainly overkill unless I want to visualise average performance over time without needing joins - back to the same question about how self-contained fact tables should be.

Another Option - Wide Tables

Listening to some dbt talk from a couple years ago, apparently wide tables are quite efficient - and certainly for Google Data Studio, at least partially preferred. There are a couple approaches that come to mind:

Two tables, games and players - this slightly reduces redundancy, having two rows per game, one per team, and (usually) seven rows of players per team.
One table, one row per player per game - lots of needless duplication of game data, but if we’re going wide, maybe the answer is not to care about that.

Theoretically, in the first approach, the games table could be structured with a single line per game differentiating ‘team 1’ and ‘team 2’, but all this really does is complicate both ETL and analysis in order to make a table that looks slightly more like an ‘end product’ in and of itself, where you get all of the information pertaining to a game in a single line. I’m not a fan.

All in all, I’m not really enthused by this idea. I can see the argument for certain wide tables or materialised views to be created - indeed, I do just that to ease Data Studio reporting - but I think I’ll stick with thinking about dimensional modelling and optimising for analytics - at least, for analytics platforms that aren’t the awkward mess that is Data Studio.