I was wrong about dimensional modelling. This is for a few reasons, but the chief among them is that I didn’t correctly understand dimensional models. I may still not, but that is beside the point. Fundamentally, I understood dimensional modelling as entailing more normalisation and tables by design, targeting 3nf everywhere, and necessitating a huge number of joins to get anywhere analytically useful. Without meaning to make excuses for myself, I’m going to make excuses for myself.
My first ‘professional’ exposure to data warehousing was second-hand, with a somewhat distant contractor managed by a different business unit being tasked with creating a data warehouse based chiefly, but not entirely, on Aeron’s QL housing asset management system. What he came up with was, in places, extremely heavily normalised, splitting off tiny dimensional tables for every facet of tenant demographic data, for instance, changing single-factor demographics into single keys that then required a join to get the single factor back - technical space saving, but analytically head-scratching. In other places, my concept was hindered by the extent to which the ‘production’ QL database is denormalised. It is a weird, creaky, aging, largely undocumented (as far as anyone in the organisation I worked in could tell, at least) database, where every table was a one-to-one reproduction of a front-end screen - if you have the same tenant information on five different screens in the frontend, it exists in five different places in the backend.
Combining these aspects, as well as the corporate politics surrounding the data warehouse project, probably lead me to being less interested in dimensional modelling than I had any business being, and not (until very recently) digging into it deeply enough to fully grasp that the point is to make something analytically more useful at the expense of normalisation rather than something that just seemed deliberately clunky. I’ve spent a good portion of the last week taking a slightly more thorough look at dimensional modelling. The rest of this writeup is an exploration of applying dimensional modelling to my existing e-sports performance analytics database.
At the end, I’ll briefly talk about wide tables (or a wide table?) as an alternative to both my existing approach and dimensional modelling. Ironically, this will resemble a previous version (first attempt, largely using Google Sheets as a crap database replacement) of this project’s data model, though will hopefully be at least a little bit more considered now.
The existing data model is not far off a ‘lowest reasonable effort’
kind of ingestion - it heavily resembles the JSON objects I load from,
because broadly speaking, that is ‘good enough,’ covering all the bases
and keeping things relatively simple. I hadn’t, honestly, put too much
thought into this. It is quite aggressively normalised, only
incidentally denormalised when I have had to add additional reference
tables to fill in missing information, as well as a small handful of
oddities existing in the source data (e.g clan_id existing
at the lineup and player levels of granularity, but not accurately
reflecting the difference between the ‘team’ clan and the ‘player’
clan.)
Current data model
From an analytic point of view, however, this leaves a lot to be
desired: there are pieces of information, such as
season_number which exist multiple joins away from what I’m
trying to model most of the time, leading to almost every view created
for a report or dashboard including something like:
WITH every_gd_time AS(
SELECT
*
FROM
cbGames cbg
LEFT JOIN cbTeams cbt ON cbg.id=cbt.GID
LEFT JOIN cbPlayers cbp ON cbt.result_id=cbp.result_id
LEFT JOIN cbClans cbc ON cbt.clan_id=cbc.clan_id AND cbg.season_number=cbc.season_number
WHERE
cbg.season_number=@this_season
AND cbc.tag=@this_clan
)
… which could also include a further join for ship classes, or so forth. It isn’t life-changing, but it is a mild grumble every time, especially because I haven’t mastered DRY, and tend to just rewrite the construct constantly.
Basically, in an effort to retain normalisation and the existing structure, there are pieces of information that I need >95% of the time which require one or more joins to apply. This is dumb, and it makes me feel dumb.
I could never quite make this work previously, due to one simple mental block: I saw the fact of the ‘game’ as the hub, rather than ‘this team played a game’ being the core analytic concept, and the fact of this being part of a matched set being incidental. This more closely matches how I actually use the data, and I can’t quite work out why I had this mental block. This now fits as a dimensional model, and allowing some degree of denormalisation to include all the stuff that you always want in the fact (and some of the dimensions) without necessitating joins is desirable.
There are definitely questions around whether surrogate keys should be used in some places, whether some degree of snowflaking would be desirable as the model expands, and slow-changing dimension flags in a few places, but I’ll cover that momentarily.
It is worth mentioning, I think, that I haven’t included dates other than the game date or versioning information because most of this is unchanging - there is no case where most of these records will or should change or update, though there are additional possible dimensions (and/or transitioning players into a second fact table alongside the added dimensions) which would necessitate some degree of versioning.
For the majority of the day-to-day reporting I do, this model would be sufficient, and would make things marginally easier… except that Google Data Studio basically wants wide tables anyway, so the value of this is questionable.
I’ve just noticed that a pile of the relationships on my ERD are backward, and there is nothing worth talking about with regard to the season dimension table. Oh well.
###
fact_games
Now the core, in my original model this was represented as ‘teams’, but that isn’t really the point - the point of the data model is representing the game state from that team’s point of view, so reframing ‘team’ to ‘game’ helps me, at least mentally, to center this. Nothing in this table should ever change, so there is no versioning or surrogate key requirement, as far as I know.
The kneejerk reaction from me is that there is ‘too much’ here, but it is also a near-complete record of a single game from a single point of view without needed further joins - the date, season, result, clan, and all of the ladder data that is tied to the result are here, and this makes sense to me.
I can, however, see an argument for splitting off a dim_result table
with all of the ladder information - fundamentally, it isn’t always or
even usually needed. I think, on consideration, that is the correct
approach, leaving league denormalised within both tables,
as it tends to be the primary ladder-related metric. In fact:
Dimensional Model v2
This feels a bit better to me. Still a question about the
game_date, which broadly boils down to: how useful should
dimension tables be on their own? There is a further question about
including season_number in multiple places, which boils
down to an identical issue.
For now, though, this is a slight improvement, I think.
As stated above, I used to consider the ‘game’ entry the real core of
the dataset, but now think it is only incidental. The
game_id is the ‘production’ key, but there is no obvious
need for any surrogate key - there is no versioning, uniqueness is
guaranteed through the ETL process, etc.
The real ‘question’ is: is there any point to leaving this?
Fundamentally, the map is not a ‘core’ data feature, and self-joining
the fact table on game_id can pair the teams, so there is
an argument for just widening the fact table with the map information.
However, there is equally a reasonable argument for pulling
game_date off of the fact table and leaving it only
here.
Oh well.
Just clan information now, but the first place where slow-changing dimensions might be realistic.
Fundamentally, clans can change their external identifying
information more-or-less at will - both tag and
clan_name. Right now, I grab this data once per season, and
make absolutely no effort to correct this - clan_id does
not change, so there is no ‘loss’ here apart from mild confusion where
reporting tags do not match tags in replays.
This would, therefore, be a good candidate for a slow changing
dimension, but I am not entirely certain how or when I would trigger an
update. Even without the slow changing dimension aspect, this is a
candidate for a surrogate key as well - there is no single unique key,
as there will be one row per clan per season. A concatenation of
clan_id, season_number and
version would allow for easy ‘correct at time of play’
single-key matches, reverting to the composite key for ‘correct for
current information.’ This feels a bit backward to me.
Another candidate for a surrogate key, which would have to be
result_id plus player_id to give ‘this player
in this game.’ This might be worthwhile if I took the snowflaking
approach, splitting off the per-game statistics and adding further
random statistics. This would actually complicate things quite a bit, as
it would definitely require a slow-changing player fact table holding
dated clan and player name data, and linking through the expanded
statistics. I don’t currently care about which clan a player is
technically in, or when they change their name, and I do not collect or
store wider random or ranked game statistics, which would require
further decisions on how to handle the slow-changing nature, as they are
stored as aggregate stats only, and the change over time is quite
important.
Having game_date in this table is almost certainly
overkill unless I want to visualise average performance over time
without needing joins - back to the same question about how
self-contained fact tables should be.
Listening to some dbt talk from a couple years ago, apparently wide tables are quite efficient - and certainly for Google Data Studio, at least partially preferred. There are a couple approaches that come to mind:
games and players - this
slightly reduces redundancy, having two rows per game, one per team, and
(usually) seven rows of players per team.Theoretically, in the first approach, the games table
could be structured with a single line per game differentiating ‘team 1’
and ‘team 2’, but all this really does is complicate both ETL and
analysis in order to make a table that looks slightly more like an ‘end
product’ in and of itself, where you get all of the information
pertaining to a game in a single line. I’m not a fan.
All in all, I’m not really enthused by this idea. I can see the argument for certain wide tables or materialised views to be created - indeed, I do just that to ease Data Studio reporting - but I think I’ll stick with thinking about dimensional modelling and optimising for analytics - at least, for analytics platforms that aren’t the awkward mess that is Data Studio.