The goal of Sprint 3 is to improve the Lane Theory Lab analysis without restarting the project. Sprint 2 successfully built a reproducible modeling pipeline, but the first models used broad composition counts that were likely too blunt to capture meaningful Dota strategy.
Sprint 3 keeps the project focused on the academic MVP while adding a stronger interpretation layer:
Lane Theory Lab uses public Dota 2 match data to evaluate whether simplified draft-composition features can support esports business intelligence. The business problem is not just whether a model can predict winners. The deeper question is whether public match data can be transformed into useful decision-support signals for players, analysts, content creators, or future dashboard users.
The product-facing version of the question is:
Can a simple public-data draft-risk calculator help users understand whether a draft shape appears fragile, balanced, greedy, or difficult to execute?
This sprint focuses on four questions:
required_packages <- c(
"tidyverse",
"tidymodels",
"ranger",
"knitr",
"broom",
"scales"
)
missing_packages <- required_packages[
!required_packages %in% rownames(installed.packages())
]
if (length(missing_packages) > 0) {
stop(
paste0(
"Missing required package(s): ",
paste(missing_packages, collapse = ", "),
"\nInstall them with: install.packages(c('",
paste(missing_packages, collapse = "', '"),
"'))"
)
)
}
library(tidyverse)
library(tidymodels)
library(ranger)
library(knitr)
library(broom)
library(scales)
set.seed(580)
# ================================================================
# Data loading block
# Purpose:
# Prefer the larger Sprint 3 expanded dataset if it exists.
# Otherwise, fall back to the stable Sprint 1/Sprint 2 dataset.
# ================================================================
expanded_path <- "data/lane_theory_modeling_dataset_sprint3_expanded.csv"
original_path <- "data/lane_theory_modeling_dataset.csv"
data_path <- if_else(file.exists(expanded_path), expanded_path, original_path)
if (!file.exists(data_path)) {
stop(
paste0(
"Could not find a modeling dataset. Expected one of:\n",
expanded_path, "\n",
original_path, "\n",
"Check that this Rmd file is saved in the project root folder."
)
)
}
model_raw <- read_csv(data_path, show_col_types = FALSE)
cat("Using dataset:", data_path, "\n")
## Using dataset: data/lane_theory_modeling_dataset_sprint3_expanded.csv
glimpse(model_raw)
## Rows: 5,000
## Columns: 36
## $ match_id <dbl> 8855078541, 8855077836, 8855074571, 8855074472, 8855073800…
## $ radiant_win <lgl> FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, FALSE, TRUE,…
## $ avg_rank_tier <dbl> 62, 52, 35, 64, 42, 33, 24, 21, 41, 32, 41, 63, 13, 23, 12…
## $ duration <dbl> 398, 499, 384, 539, 897, 895, 549, 765, 982, 1020, 919, 80…
## $ game_mode <dbl> 23, 23, 22, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 4,…
## $ lobby_type <dbl> 0, 0, 7, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 7, 0, 0…
## $ cluster <dbl> 225, 227, 144, 151, 153, 183, 186, 153, 202, 181, 413, 153…
## $ dire_heroes_on_team <dbl> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5…
## $ radiant_heroes_on_team <dbl> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5…
## $ dire_melee_count <dbl> 2, 3, 2, 3, 1, 4, 2, 4, 3, 1, 2, 1, 3, 2, 3, 3, 3, 2, 4, 3…
## $ radiant_melee_count <dbl> 5, 2, 2, 1, 2, 3, 4, 2, 1, 3, 2, 3, 3, 2, 4, 1, 4, 1, 3, 1…
## $ dire_ranged_count <dbl> 3, 2, 3, 2, 4, 1, 3, 1, 2, 4, 3, 4, 2, 3, 2, 2, 2, 3, 1, 2…
## $ radiant_ranged_count <dbl> 0, 3, 3, 4, 3, 2, 1, 3, 4, 2, 3, 2, 2, 3, 1, 4, 1, 4, 2, 4…
## $ dire_str_count <dbl> 1, 2, 2, 3, 2, 2, 2, 2, 1, 1, 1, 1, 2, 2, 1, 2, 2, 0, 3, 1…
## $ radiant_str_count <dbl> 2, 1, 2, 1, 2, 2, 3, 2, 1, 1, 1, 2, 0, 1, 5, 1, 1, 0, 2, 1…
## $ dire_agi_count <dbl> 3, 1, 1, 1, 2, 1, 0, 2, 2, 1, 1, 1, 2, 1, 1, 2, 0, 3, 1, 1…
## $ radiant_agi_count <dbl> 2, 1, 1, 1, 2, 1, 0, 1, 2, 3, 1, 1, 3, 1, 0, 1, 2, 2, 1, 1…
## $ dire_int_count <dbl> 0, 2, 2, 1, 1, 1, 1, 1, 2, 2, 3, 2, 1, 2, 0, 1, 2, 2, 1, 2…
## $ radiant_int_count <dbl> 1, 2, 2, 3, 0, 1, 1, 1, 2, 1, 2, 2, 1, 2, 0, 2, 0, 2, 2, 2…
## $ dire_all_count <dbl> 1, 0, 0, 0, 0, 1, 2, 0, 0, 1, 0, 1, 0, 0, 3, 0, 1, 0, 0, 1…
## $ radiant_all_count <dbl> 0, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 2, 1, 0, 1…
## $ dire_missing_attack_type <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ radiant_missing_attack_type <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ dire_missing_primary_attr <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ radiant_missing_primary_attr <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ melee_count_difference <dbl> 3, -1, 0, -2, 1, -1, 2, -2, -2, 2, 0, 2, 0, 0, 1, -2, 1, -…
## $ ranged_count_difference <dbl> -3, 1, 0, 2, -1, 1, -2, 2, 2, -2, 0, -2, 0, 0, -1, 2, -1, …
## $ str_count_difference <dbl> 1, -1, 0, -2, 0, 0, 1, 0, 0, 0, 0, 1, -2, -1, 4, -1, -1, 0…
## $ agi_count_difference <dbl> -1, 0, 0, 0, 0, 0, 0, -1, 0, 2, 0, 0, 1, 0, -1, -1, 2, -1,…
## $ int_count_difference <dbl> 1, 0, 0, 2, -1, 0, 0, 0, 0, -1, -1, 0, 0, 0, 0, 1, -2, 0, …
## $ all_count_difference <dbl> -1, 1, 0, 0, 1, 0, -1, 1, 0, -1, 1, -1, 1, 1, -3, 1, 1, 1,…
## $ radiant_melee_share <dbl> 1.0, 0.4, 0.4, 0.2, 0.4, 0.6, 0.8, 0.4, 0.2, 0.6, 0.4, 0.6…
## $ dire_melee_share <dbl> 0.4, 0.6, 0.4, 0.6, 0.2, 0.8, 0.4, 0.8, 0.6, 0.2, 0.4, 0.2…
## $ radiant_ranged_share <dbl> 0.0, 0.6, 0.6, 0.8, 0.6, 0.4, 0.2, 0.6, 0.8, 0.4, 0.6, 0.4…
## $ dire_ranged_share <dbl> 0.6, 0.4, 0.6, 0.4, 0.8, 0.2, 0.6, 0.2, 0.4, 0.8, 0.6, 0.8…
## $ rank_bracket <chr> "Divine/Immortal", "Legend/Ancient", "Crusader/Archon", "D…
# ================================================================
# Initial checks block
# Purpose:
# Confirm the dataset has the expected match-level structure.
# ================================================================
initial_checks <- tibble(
check = c(
"Dataset has rows",
"match_id column exists",
"radiant_win target exists",
"Match IDs are unique or nearly unique",
"Melee difference feature exists",
"Attribute difference features exist"
),
passed = c(
nrow(model_raw) > 0,
"match_id" %in% names(model_raw),
"radiant_win" %in% names(model_raw),
if ("match_id" %in% names(model_raw)) n_distinct(model_raw$match_id) == nrow(model_raw) else FALSE,
"melee_count_difference" %in% names(model_raw),
all(c(
"str_count_difference",
"agi_count_difference",
"int_count_difference",
"all_count_difference"
) %in% names(model_raw))
)
) %>%
mutate(result = if_else(passed, "PASS", "FAIL"))
kable(initial_checks, caption = "Sprint 3 Initial Dataset Checks")
| check | passed | result |
|---|---|---|
| Dataset has rows | TRUE | PASS |
| match_id column exists | TRUE | PASS |
| radiant_win target exists | TRUE | PASS |
| Match IDs are unique or nearly unique | TRUE | PASS |
| Melee difference feature exists | TRUE | PASS |
| Attribute difference features exist | TRUE | PASS |
stopifnot(all(initial_checks$passed))
# ================================================================
# Target preparation block
# Purpose:
# Convert radiant_win into a two-class factor for classification.
# ================================================================
model_prepped <- model_raw %>%
mutate(
radiant_win_text = str_to_lower(as.character(radiant_win)),
radiant_win = case_when(
radiant_win_text %in% c("true", "1", "radiant_win", "radiant", "win", "yes") ~ "Radiant_Win",
radiant_win_text %in% c("false", "0", "dire_win", "dire", "loss", "no") ~ "Dire_Win",
TRUE ~ NA_character_
),
radiant_win = factor(radiant_win, levels = c("Dire_Win", "Radiant_Win"))
) %>%
select(-radiant_win_text) %>%
drop_na(radiant_win)
target_balance <- model_prepped %>%
count(radiant_win) %>%
mutate(percent = n / sum(n))
kable(target_balance, digits = 3, caption = "Target Class Balance")
| radiant_win | n | percent |
|---|---|---|
| Dire_Win | 2357 | 0.471 |
| Radiant_Win | 2643 | 0.529 |
The first Sprint 2 models used broad team-composition variables because they are transparent and easy to reproduce. Sprint 3 keeps that transparency but reshapes some variables into draft-risk indicators.
The player-informed idea is not that melee heroes, ranged heroes, or primary attributes are automatically good or bad. The more useful idea is that certain draft shapes can create public-match risks:
This is why the model is called domain-informed, not expert-deterministic. It uses Dota experience to ask better questions, but the data still decides what signal appears.
# ================================================================
# Domain-informed feature engineering block
# Purpose:
# Create interpretable draft-risk variables from existing composition
# counts. These features do not require new API calls.
# ================================================================
required_feature_columns <- c(
"radiant_melee_count",
"dire_melee_count",
"radiant_ranged_count",
"dire_ranged_count",
"radiant_str_count",
"dire_str_count",
"radiant_agi_count",
"dire_agi_count",
"radiant_int_count",
"dire_int_count",
"radiant_all_count",
"dire_all_count",
"melee_count_difference",
"str_count_difference",
"agi_count_difference",
"int_count_difference",
"all_count_difference"
)
missing_feature_columns <- setdiff(required_feature_columns, names(model_prepped))
if (length(missing_feature_columns) > 0) {
stop(
paste0(
"Missing required feature columns: ",
paste(missing_feature_columns, collapse = ", ")
)
)
}
# Helper function for attribute imbalance.
# A balanced four-attribute profile has counts close to the team's average.
# A skewed profile has one or more attributes far above or below that average.
attribute_imbalance <- function(str_count, agi_count, int_count, all_count) {
attr_mean <- (str_count + agi_count + int_count + all_count) / 4
abs(str_count - attr_mean) +
abs(agi_count - attr_mean) +
abs(int_count - attr_mean) +
abs(all_count - attr_mean)
}
model_domain <- model_prepped %>%
mutate(
# Attack type categories.
melee_advantage_category = case_when(
melee_count_difference <= -2 ~ "Radiant much less melee",
melee_count_difference == -1 ~ "Radiant slightly less melee",
melee_count_difference == 0 ~ "Even melee count",
melee_count_difference == 1 ~ "Radiant slightly more melee",
melee_count_difference >= 2 ~ "Radiant much more melee",
TRUE ~ NA_character_
),
melee_advantage_category = factor(
melee_advantage_category,
levels = c(
"Radiant much less melee",
"Radiant slightly less melee",
"Even melee count",
"Radiant slightly more melee",
"Radiant much more melee"
)
),
# Heavy composition flags.
radiant_heavy_melee = as.integer(radiant_melee_count >= 4),
dire_heavy_melee = as.integer(dire_melee_count >= 4),
radiant_all_ranged = as.integer(radiant_melee_count == 0),
dire_all_ranged = as.integer(dire_melee_count == 0),
radiant_all_melee = as.integer(radiant_ranged_count == 0),
dire_all_melee = as.integer(dire_ranged_count == 0),
# Weak frontline/brawling proxy.
# This is intentionally labeled as a proxy because Strength/Universal
# does not equal initiation or true frontline.
radiant_low_frontline_proxy = as.integer(radiant_str_count <= 1 & radiant_all_count <= 1),
dire_low_frontline_proxy = as.integer(dire_str_count <= 1 & dire_all_count <= 1),
frontline_proxy_difference = radiant_low_frontline_proxy - dire_low_frontline_proxy,
# Attribute imbalance.
radiant_attribute_imbalance = attribute_imbalance(
radiant_str_count,
radiant_agi_count,
radiant_int_count,
radiant_all_count
),
dire_attribute_imbalance = attribute_imbalance(
dire_str_count,
dire_agi_count,
dire_int_count,
dire_all_count
),
attribute_imbalance_difference = radiant_attribute_imbalance - dire_attribute_imbalance,
# Simple greed/scaling proxy.
# This is not a hero-role model. It is a deliberately rough proxy.
radiant_high_agi_proxy = as.integer(radiant_agi_count >= 3),
dire_high_agi_proxy = as.integer(dire_agi_count >= 3),
high_agi_proxy_difference = radiant_high_agi_proxy - dire_high_agi_proxy,
# Extreme intelligence stack proxy.
radiant_high_int_proxy = as.integer(radiant_int_count >= 4),
dire_high_int_proxy = as.integer(dire_int_count >= 4),
high_int_proxy_difference = radiant_high_int_proxy - dire_high_int_proxy
) %>%
drop_na()
# Preview the new feature columns.
domain_feature_preview <- model_domain %>%
select(
radiant_win,
melee_count_difference,
melee_advantage_category,
radiant_heavy_melee,
dire_heavy_melee,
radiant_all_ranged,
dire_all_ranged,
radiant_low_frontline_proxy,
dire_low_frontline_proxy,
radiant_attribute_imbalance,
dire_attribute_imbalance,
attribute_imbalance_difference
) %>%
head(10)
kable(domain_feature_preview, caption = "Preview of Domain-Informed Draft-Risk Features")
| radiant_win | melee_count_difference | melee_advantage_category | radiant_heavy_melee | dire_heavy_melee | radiant_all_ranged | dire_all_ranged | radiant_low_frontline_proxy | dire_low_frontline_proxy | radiant_attribute_imbalance | dire_attribute_imbalance | attribute_imbalance_difference |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Dire_Win | 3 | Radiant much more melee | 1 | 0 | 0 | 0 | 0 | 1 | 3.0 | 3.5 | -0.5 |
| Dire_Win | -1 | Radiant slightly less melee | 0 | 0 | 0 | 0 | 1 | 0 | 1.5 | 3.0 | -1.5 |
| Dire_Win | 0 | Even melee count | 0 | 0 | 0 | 0 | 0 | 0 | 3.0 | 3.0 | 0.0 |
| Dire_Win | -2 | Radiant much less melee | 0 | 0 | 0 | 0 | 1 | 0 | 3.5 | 3.5 | 0.0 |
| Radiant_Win | 1 | Radiant slightly more melee | 0 | 0 | 0 | 0 | 0 | 0 | 3.0 | 3.0 | 0.0 |
| Radiant_Win | -1 | Radiant slightly less melee | 0 | 1 | 0 | 0 | 0 | 0 | 1.5 | 1.5 | 0.0 |
| Radiant_Win | 2 | Radiant much more melee | 1 | 0 | 0 | 0 | 0 | 0 | 3.5 | 3.0 | 0.5 |
| Dire_Win | -2 | Radiant much less melee | 0 | 1 | 0 | 0 | 0 | 0 | 1.5 | 3.0 | -1.5 |
| Radiant_Win | -2 | Radiant much less melee | 0 | 0 | 0 | 0 | 1 | 1 | 3.0 | 3.0 | 0.0 |
| Dire_Win | 2 | Radiant much more melee | 0 | 0 | 0 | 0 | 1 | 1 | 3.5 | 1.5 | 2.0 |
# ================================================================
# EDA block
# Purpose:
# Make the melee/ranged question easier to read than raw counts.
# ================================================================
melee_category_summary <- model_domain %>%
group_by(melee_advantage_category) %>%
summarise(
matches = n(),
radiant_win_rate = mean(radiant_win == "Radiant_Win"),
.groups = "drop"
) %>%
mutate(win_rate_label = percent(radiant_win_rate, accuracy = 0.1))
kable(
melee_category_summary,
digits = 3,
caption = "Radiant Win Rate by Melee Advantage Category"
)
| melee_advantage_category | matches | radiant_win_rate | win_rate_label |
|---|---|---|---|
| Radiant much less melee | 589 | 0.487 | 48.7% |
| Radiant slightly less melee | 1177 | 0.517 | 51.7% |
| Even melee count | 1466 | 0.550 | 55.0% |
| Radiant slightly more melee | 1144 | 0.525 | 52.5% |
| Radiant much more melee | 624 | 0.546 | 54.6% |
melee_category_summary %>%
ggplot(aes(x = melee_advantage_category, y = radiant_win_rate)) +
geom_col() +
geom_text(aes(label = paste0(win_rate_label, "\nn=", matches)), vjust = -0.25, size = 3.4) +
scale_y_continuous(labels = percent_format(), limits = c(0, 1)) +
labs(
title = "Radiant Win Rate by Melee Advantage Category",
subtitle = "Domain-informed check: raw melee difference grouped into readable draft shapes",
x = "Melee Advantage Category",
y = "Radiant Win Rate"
) +
theme(axis.text.x = element_text(angle = 25, hjust = 1))
# ================================================================
# EDA block
# Purpose:
# Summarize simple draft-risk flags in one readable table.
# ================================================================
risk_flag_summary <- model_domain %>%
summarise(
radiant_heavy_melee_wr = mean(radiant_win == "Radiant_Win" & radiant_heavy_melee == 1) / mean(radiant_heavy_melee == 1),
radiant_all_ranged_wr = mean(radiant_win == "Radiant_Win" & radiant_all_ranged == 1) / mean(radiant_all_ranged == 1),
radiant_low_frontline_proxy_wr = mean(radiant_win == "Radiant_Win" & radiant_low_frontline_proxy == 1) / mean(radiant_low_frontline_proxy == 1),
radiant_high_agi_proxy_wr = mean(radiant_win == "Radiant_Win" & radiant_high_agi_proxy == 1) / mean(radiant_high_agi_proxy == 1),
radiant_high_int_proxy_wr = mean(radiant_win == "Radiant_Win" & radiant_high_int_proxy == 1) / mean(radiant_high_int_proxy == 1),
overall_radiant_wr = mean(radiant_win == "Radiant_Win")
) %>%
pivot_longer(
cols = everything(),
names_to = "draft_condition",
values_to = "radiant_win_rate"
) %>%
mutate(radiant_win_rate = replace_na(radiant_win_rate, 0))
kable(
risk_flag_summary,
digits = 3,
caption = "Radiant Win Rate Under Selected Draft-Risk Conditions"
)
| draft_condition | radiant_win_rate |
|---|---|
| radiant_heavy_melee_wr | 0.540 |
| radiant_all_ranged_wr | 0.435 |
| radiant_low_frontline_proxy_wr | 0.511 |
| radiant_high_agi_proxy_wr | 0.549 |
| radiant_high_int_proxy_wr | 0.563 |
| overall_radiant_wr | 0.529 |
# ================================================================
# EDA block
# Purpose:
# Test whether skewed attribute profiles appear meaningfully different.
# ================================================================
attribute_imbalance_summary <- model_domain %>%
mutate(
radiant_more_imbalanced = case_when(
attribute_imbalance_difference > 0 ~ "Radiant more imbalanced",
attribute_imbalance_difference < 0 ~ "Dire more imbalanced",
TRUE ~ "Equal imbalance"
)
) %>%
group_by(radiant_more_imbalanced) %>%
summarise(
matches = n(),
radiant_win_rate = mean(radiant_win == "Radiant_Win"),
.groups = "drop"
) %>%
mutate(win_rate_label = percent(radiant_win_rate, accuracy = 0.1))
kable(
attribute_imbalance_summary,
digits = 3,
caption = "Radiant Win Rate by Attribute Imbalance Direction"
)
| radiant_more_imbalanced | matches | radiant_win_rate | win_rate_label |
|---|---|---|---|
| Dire more imbalanced | 1802 | 0.530 | 53.0% |
| Equal imbalance | 1460 | 0.508 | 50.8% |
| Radiant more imbalanced | 1738 | 0.545 | 54.5% |
attribute_imbalance_summary %>%
ggplot(aes(x = radiant_more_imbalanced, y = radiant_win_rate)) +
geom_col() +
geom_text(aes(label = paste0(win_rate_label, "\nn=", matches)), vjust = -0.25, size = 3.5) +
scale_y_continuous(labels = percent_format(), limits = c(0, 1)) +
labs(
title = "Radiant Win Rate by Attribute Imbalance Direction",
subtitle = "A rough proxy for whether one team has a more skewed draft identity",
x = "Attribute Imbalance Category",
y = "Radiant Win Rate"
)
# ================================================================
# Model 3 data block
# Purpose:
# Select clean, non-redundant draft-risk features.
#
# Important:
# This avoids the broad Sprint 2 issue where melee counts, ranged
# counts, differences, and shares encoded overlapping information.
# ================================================================
model3_predictors <- c(
"melee_count_difference",
"str_count_difference",
"agi_count_difference",
"int_count_difference",
"all_count_difference",
"attribute_imbalance_difference",
"frontline_proxy_difference",
"high_agi_proxy_difference",
"high_int_proxy_difference",
"radiant_heavy_melee",
"dire_heavy_melee",
"radiant_all_ranged",
"dire_all_ranged",
"radiant_low_frontline_proxy",
"dire_low_frontline_proxy"
)
model3_df <- model_domain %>%
select(radiant_win, all_of(model3_predictors)) %>%
drop_na()
model3_columns <- tibble(
column = names(model3_df),
type = map_chr(model3_df, ~ class(.x)[1])
)
kable(model3_columns, caption = "Model 3 Final Modeling Columns")
| column | type |
|---|---|
| radiant_win | factor |
| melee_count_difference | numeric |
| str_count_difference | numeric |
| agi_count_difference | numeric |
| int_count_difference | numeric |
| all_count_difference | numeric |
| attribute_imbalance_difference | numeric |
| frontline_proxy_difference | integer |
| high_agi_proxy_difference | integer |
| high_int_proxy_difference | integer |
| radiant_heavy_melee | integer |
| dire_heavy_melee | integer |
| radiant_all_ranged | integer |
| dire_all_ranged | integer |
| radiant_low_frontline_proxy | integer |
| dire_low_frontline_proxy | integer |
# ================================================================
# Readiness checks block
# Purpose:
# Confirm Model 3 data is usable before fitting.
# ================================================================
model3_checks <- tibble(
check = c(
"Model 3 data has rows",
"Target has exactly two classes",
"Target has no missing values",
"All predictors are numeric",
"At least five predictors are available",
"duration is excluded",
"match_id is excluded"
),
passed = c(
nrow(model3_df) > 0,
n_distinct(model3_df$radiant_win) == 2,
all(!is.na(model3_df$radiant_win)),
all(map_lgl(model3_df %>% select(-radiant_win), is.numeric)),
ncol(model3_df) - 1 >= 5,
!"duration" %in% names(model3_df),
!"match_id" %in% names(model3_df)
)
) %>%
mutate(result = if_else(passed, "PASS", "FAIL"))
kable(model3_checks, caption = "Model 3 Readiness Checks")
| check | passed | result |
|---|---|---|
| Model 3 data has rows | TRUE | PASS |
| Target has exactly two classes | TRUE | PASS |
| Target has no missing values | TRUE | PASS |
| All predictors are numeric | TRUE | PASS |
| At least five predictors are available | TRUE | PASS |
| duration is excluded | TRUE | PASS |
| match_id is excluded | TRUE | PASS |
stopifnot(all(model3_checks$passed))
# ================================================================
# Train/test split block
# Purpose:
# Split data while preserving class balance.
# ================================================================
lane_split <- initial_split(model3_df, prop = 0.80, strata = radiant_win)
train_data <- training(lane_split)
test_data <- testing(lane_split)
split_summary <- tibble(
dataset = c("Training", "Testing"),
rows = c(nrow(train_data), nrow(test_data)),
radiant_win_rate = c(
mean(train_data$radiant_win == "Radiant_Win"),
mean(test_data$radiant_win == "Radiant_Win")
)
)
kable(split_summary, digits = 3, caption = "Model 3 Train/Test Split Summary")
| dataset | rows | radiant_win_rate |
|---|---|---|
| Training | 3999 | 0.529 |
| Testing | 1001 | 0.528 |
# ================================================================
# Baseline block
# Purpose:
# Establish the majority-class benchmark. The model should be judged
# against this, not just against 50 percent accuracy.
# ================================================================
majority_class_rate <- train_data %>%
count(radiant_win) %>%
mutate(rate = n / sum(n)) %>%
arrange(desc(rate)) %>%
slice(1)
kable(
majority_class_rate,
digits = 3,
caption = "Majority-Class Baseline from Training Data"
)
| radiant_win | n | rate |
|---|---|---|
| Radiant_Win | 2114 | 0.529 |
# ================================================================
# Model 3 fitting block
# Purpose:
# Fit an interpretable logistic regression using cleaner, domain-
# informed draft-risk features.
# ================================================================
model3_recipe <- recipe(radiant_win ~ ., data = train_data) %>%
step_zv(all_predictors()) %>%
step_normalize(all_numeric_predictors())
model3_log_spec <- logistic_reg() %>%
set_engine("glm") %>%
set_mode("classification")
model3_log_workflow <- workflow() %>%
add_recipe(model3_recipe) %>%
add_model(model3_log_spec)
model3_log_fit <- fit(model3_log_workflow, data = train_data)
model3_log_fit
## ══ Workflow [trained] ═══════════════════════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: logistic_reg()
##
## ── Preprocessor ─────────────────────────────────────────────────────────────────────────────────
## 2 Recipe Steps
##
## • step_zv()
## • step_normalize()
##
## ── Model ────────────────────────────────────────────────────────────────────────────────────────
##
## Call: stats::glm(formula = ..y ~ ., family = stats::binomial, data = data)
##
## Coefficients:
## (Intercept) melee_count_difference str_count_difference
## 0.115178 0.036814 0.064215
## agi_count_difference int_count_difference all_count_difference
## 0.079635 0.098550 NA
## attribute_imbalance_difference frontline_proxy_difference high_agi_proxy_difference
## -0.013211 -0.095076 0.060048
## high_int_proxy_difference radiant_heavy_melee dire_heavy_melee
## 0.034014 -0.003934 0.070702
## radiant_all_ranged dire_all_ranged radiant_low_frontline_proxy
## -0.029035 -0.028420 -0.021098
## dire_low_frontline_proxy
## NA
##
## Degrees of Freedom: 3998 Total (i.e. Null); 3985 Residual
## Null Deviance: 5531
## Residual Deviance: 5511 AIC: 5539
# ================================================================
# Prediction and metrics block
# Purpose:
# Evaluate Model 3 on the test set.
# ================================================================
model3_predictions <- predict(model3_log_fit, test_data, type = "prob") %>%
bind_cols(predict(model3_log_fit, test_data, type = "class")) %>%
bind_cols(test_data %>% select(radiant_win))
model3_metrics <- bind_rows(
accuracy(model3_predictions, truth = radiant_win, estimate = .pred_class),
roc_auc(model3_predictions, truth = radiant_win, .pred_Radiant_Win, event_level = "second")
) %>%
mutate(model = "Model 3: Domain-Informed Draft Risk", .before = 1)
kable(model3_metrics, digits = 3, caption = "Model 3 Performance")
| model | .metric | .estimator | .estimate |
|---|---|---|---|
| Model 3: Domain-Informed Draft Risk | accuracy | binary | 0.519 |
| Model 3: Domain-Informed Draft Risk | roc_auc | binary | 0.492 |
# ================================================================
# Confusion matrix block
# Purpose:
# Check whether the model predicts both classes or mostly defaults to
# the majority class.
# ================================================================
model3_confusion <- conf_mat(
model3_predictions,
truth = radiant_win,
estimate = .pred_class
)
model3_confusion
## Truth
## Prediction Dire_Win Radiant_Win
## Dire_Win 111 120
## Radiant_Win 361 409
# ================================================================
# Coefficient interpretation block
# Purpose:
# Identify which domain-informed features have the largest positive
# or negative association with Radiant victory.
#
# Important:
# Coefficients are associations, not causal proof.
# ================================================================
model3_coefficients <- model3_log_fit %>%
extract_fit_parsnip() %>%
tidy() %>%
filter(term != "(Intercept)") %>%
mutate(
odds_ratio = exp(estimate),
direction = case_when(
is.na(estimate) ~ "Not estimated",
estimate > 0 ~ "Higher odds of Radiant win",
estimate < 0 ~ "Lower odds of Radiant win",
TRUE ~ "No estimated direction"
)
) %>%
arrange(desc(abs(estimate)))
kable(
model3_coefficients %>%
select(term, estimate, odds_ratio, direction),
digits = 3,
caption = "Model 3 Logistic Regression Coefficients"
)
| term | estimate | odds_ratio | direction |
|---|---|---|---|
| int_count_difference | 0.099 | 1.104 | Higher odds of Radiant win |
| frontline_proxy_difference | -0.095 | 0.909 | Lower odds of Radiant win |
| agi_count_difference | 0.080 | 1.083 | Higher odds of Radiant win |
| dire_heavy_melee | 0.071 | 1.073 | Higher odds of Radiant win |
| str_count_difference | 0.064 | 1.066 | Higher odds of Radiant win |
| high_agi_proxy_difference | 0.060 | 1.062 | Higher odds of Radiant win |
| melee_count_difference | 0.037 | 1.037 | Higher odds of Radiant win |
| high_int_proxy_difference | 0.034 | 1.035 | Higher odds of Radiant win |
| radiant_all_ranged | -0.029 | 0.971 | Lower odds of Radiant win |
| dire_all_ranged | -0.028 | 0.972 | Lower odds of Radiant win |
| radiant_low_frontline_proxy | -0.021 | 0.979 | Lower odds of Radiant win |
| attribute_imbalance_difference | -0.013 | 0.987 | Lower odds of Radiant win |
| radiant_heavy_melee | -0.004 | 0.996 | Lower odds of Radiant win |
| all_count_difference | NA | NA | Not estimated |
| dire_low_frontline_proxy | NA | NA | Not estimated |
# ================================================================
# Optional comparison block
# Purpose:
# Fit a random forest using the same Model 3 features to check whether
# nonlinear interactions improve predictive performance.
# ================================================================
rf_mtry <- max(1, floor(sqrt(ncol(train_data) - 1)))
model3_rf_spec <- rand_forest(
trees = 500,
mtry = rf_mtry,
min_n = 5
) %>%
set_engine("ranger", importance = "permutation") %>%
set_mode("classification")
model3_rf_workflow <- workflow() %>%
add_recipe(model3_recipe) %>%
add_model(model3_rf_spec)
model3_rf_fit <- fit(model3_rf_workflow, data = train_data)
model3_rf_predictions <- predict(model3_rf_fit, test_data, type = "prob") %>%
bind_cols(predict(model3_rf_fit, test_data, type = "class")) %>%
bind_cols(test_data %>% select(radiant_win))
model3_rf_metrics <- bind_rows(
accuracy(model3_rf_predictions, truth = radiant_win, estimate = .pred_class),
roc_auc(model3_rf_predictions, truth = radiant_win, .pred_Radiant_Win, event_level = "second")
) %>%
mutate(model = "Model 3 RF: Domain-Informed Features", .before = 1)
kable(model3_rf_metrics, digits = 3, caption = "Model 3 Random Forest Performance")
| model | .metric | .estimator | .estimate |
|---|---|---|---|
| Model 3 RF: Domain-Informed Features | accuracy | binary | 0.502 |
| Model 3 RF: Domain-Informed Features | roc_auc | binary | 0.498 |
# ================================================================
# Comparison block
# Purpose:
# Compare Model 3 against the majority-class baseline.
# ================================================================
baseline_row <- tibble(
model = "Majority-Class Baseline",
accuracy = majority_class_rate$rate,
roc_auc = NA_real_
)
model3_comparison <- bind_rows(model3_metrics, model3_rf_metrics) %>%
select(model, .metric, .estimate) %>%
pivot_wider(names_from = .metric, values_from = .estimate) %>%
bind_rows(baseline_row) %>%
arrange(desc(replace_na(roc_auc, 0)))
kable(
model3_comparison,
digits = 3,
caption = "Model 3 Performance Compared with Majority-Class Baseline"
)
| model | accuracy | roc_auc |
|---|---|---|
| Model 3 RF: Domain-Informed Features | 0.502 | 0.498 |
| Model 3: Domain-Informed Draft Risk | 0.519 | 0.492 |
| Majority-Class Baseline | 0.529 | NA |
# ================================================================
# ROC curve block
# Purpose:
# Compare class-separation ability for the logistic and random forest
# versions of Model 3.
# ================================================================
model3_log_roc <- roc_curve(
model3_predictions,
truth = radiant_win,
.pred_Radiant_Win,
event_level = "second"
) %>%
mutate(model = "Model 3 Logistic")
model3_rf_roc <- roc_curve(
model3_rf_predictions,
truth = radiant_win,
.pred_Radiant_Win,
event_level = "second"
) %>%
mutate(model = "Model 3 Random Forest")
bind_rows(model3_log_roc, model3_rf_roc) %>%
ggplot(aes(x = 1 - specificity, y = sensitivity, linetype = model)) +
geom_path(linewidth = 1) +
geom_abline(linetype = "dashed") +
labs(
title = "Model 3 ROC Curve Comparison",
subtitle = "Curves near the diagonal indicate limited class-separation power",
x = "False Positive Rate",
y = "True Positive Rate",
linetype = "Model"
)
Use this section after reviewing the output values.
The Sprint 3 model tests whether cleaner, domain-informed draft-risk features improve the interpretability of the Lane Theory Lab analysis. These features are based on public match data and intentionally avoid unavailable behavioral details such as whether supports pulled correctly, whether a lane over-dived, or whether a position 4 actually played a support function.
The key comparison is not whether the model exceeds 50 percent accuracy. The key comparison is whether it improves on the majority-class baseline, which reflects the observed Radiant-side win advantage in the sample. If Model 3 does not outperform this baseline, the result suggests that broad team-level draft-risk features still do not explain enough of the match outcome by themselves.
That finding would not invalidate the project. Instead, it would support the idea that useful esports BI requires richer context than simple team-level counts. Future versions should consider lane assignments, hero identity, role fidelity, reliable stun counts, initiation tools, support behavior, patch context, game mode, and rank bracket.
This sprint still uses match-level draft-composition features. It does not directly observe laning behavior, support behavior, bad pulls, inopportune dives, player skill, hero familiarity, item builds, objective control, or coordination. These are major parts of Dota outcomes.
The frontline and greed variables are deliberately labeled as proxies. Strength or Universal heroes do not automatically equal frontline or initiation. Agility-heavy drafts do not automatically mean greed. Intelligence-heavy drafts do not automatically mean spell pressure. These features are simple public-data approximations designed to ask better questions, not perfect strategic labels.
The current dataset may also mix game modes, rank brackets, patches, regions, and public-match environments. Turbo and ranked All Pick may not represent the same mindset. Public-match data may also include bots, smurfs, griefing, role abuse, and other behaviors that are difficult to detect from composition data alone.
The strongest next step is to treat Model 3 as a domain-informed baseline rather than a final predictive product. If the model performs weakly, the recommendation is not to force the result. Instead, the recommendation is to improve feature quality by adding context.
Recommended next steps:
To rerun this sprint:
.Rmd file in RStudio.data/ folder.