1 Sprint 3 Goal
2 Business Intelligence Framing
3 Research Questions
4 Load Required Packages
5 Load Modeling Dataset
6 Initial Dataset Checks
7 Prepare Target Variable
8 Domain-Informed Feature Design
9 Create Domain-Informed Draft-Risk Features
10 Domain-Informed EDA
- 10.1 Win Rate by Melee Advantage Category
- 10.2 Win Rate by Draft-Risk Flags
- 10.3 Attribute Imbalance Summary
11 Build Modeling Dataset for Model 3
12 Model 3 Readiness Checks
13 Train / Test Split
14 Baseline Accuracy Reference
15 Model 3: Domain-Informed Draft Risk Logistic Regression
16 Model 3 Predictions and Metrics
17 Model 3 Confusion Matrix
18 Model 3 Coefficient Interpretation
19 Optional Comparison Model: Random Forest on Model 3 Features
20 Model 3 Comparison Table
21 ROC Curve Comparison
22 Plain-English Interpretation Template
23 Limitations
24 Recommendations
25 Replication Notes

1 Sprint 3 Goal

The goal of Sprint 3 is to improve the Lane Theory Lab analysis without restarting the project. Sprint 2 successfully built a reproducible modeling pipeline, but the first models used broad composition counts that were likely too blunt to capture meaningful Dota strategy.

Sprint 3 keeps the project focused on the academic MVP while adding a stronger interpretation layer:

Reframe the project as esports business intelligence and predictive analytics using public match data.
Add domain-informed draft-risk features that are still simple, transparent, and reproducible.
Compare the new model against the Sprint 2 baseline logic and the Radiant-side majority baseline.
Preserve honest interpretation, including weak or null results.

2 Business Intelligence Framing

Lane Theory Lab uses public Dota 2 match data to evaluate whether simplified draft-composition features can support esports business intelligence. The business problem is not just whether a model can predict winners. The deeper question is whether public match data can be transformed into useful decision-support signals for players, analysts, content creators, or future dashboard users.

The product-facing version of the question is:

Can a simple public-data draft-risk calculator help users understand whether a draft shape appears fragile, balanced, greedy, or difficult to execute?

3 Research Questions

This sprint focuses on four questions:

Do broad draft-composition features predict match outcomes beyond the Radiant-side baseline?
Are broad melee/ranged and primary attribute counts too coarse to explain match outcomes on their own?
Can domain-informed draft-risk features make the model more interpretable?
What additional data would be needed to build a stronger esports BI product?

4 Load Required Packages

required_packages <- c(
  "tidyverse",
  "tidymodels",
  "ranger",
  "knitr",
  "broom",
  "scales"
)

missing_packages <- required_packages[
  !required_packages %in% rownames(installed.packages())
]

if (length(missing_packages) > 0) {
  stop(
    paste0(
      "Missing required package(s): ",
      paste(missing_packages, collapse = ", "),
      "\nInstall them with: install.packages(c('",
      paste(missing_packages, collapse = "', '"),
      "'))"
    )
  )
}

library(tidyverse)
library(tidymodels)
library(ranger)
library(knitr)
library(broom)
library(scales)

set.seed(580)

5 Load Modeling Dataset

# ================================================================
# Data loading block
# Purpose:
#   Prefer the larger Sprint 3 expanded dataset if it exists.
#   Otherwise, fall back to the stable Sprint 1/Sprint 2 dataset.
# ================================================================

expanded_path <- "data/lane_theory_modeling_dataset_sprint3_expanded.csv"
original_path <- "data/lane_theory_modeling_dataset.csv"

data_path <- if_else(file.exists(expanded_path), expanded_path, original_path)

if (!file.exists(data_path)) {
  stop(
    paste0(
      "Could not find a modeling dataset. Expected one of:\n",
      expanded_path, "\n",
      original_path, "\n",
      "Check that this Rmd file is saved in the project root folder."
    )
  )
}

model_raw <- read_csv(data_path, show_col_types = FALSE)

cat("Using dataset:", data_path, "\n")

## Using dataset: data/lane_theory_modeling_dataset_sprint3_expanded.csv

glimpse(model_raw)

## Rows: 5,000
## Columns: 36
## $ match_id                     <dbl> 8855078541, 8855077836, 8855074571, 8855074472, 8855073800…
## $ radiant_win                  <lgl> FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, FALSE, TRUE,…
## $ avg_rank_tier                <dbl> 62, 52, 35, 64, 42, 33, 24, 21, 41, 32, 41, 63, 13, 23, 12…
## $ duration                     <dbl> 398, 499, 384, 539, 897, 895, 549, 765, 982, 1020, 919, 80…
## $ game_mode                    <dbl> 23, 23, 22, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 4,…
## $ lobby_type                   <dbl> 0, 0, 7, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 7, 0, 0…
## $ cluster                      <dbl> 225, 227, 144, 151, 153, 183, 186, 153, 202, 181, 413, 153…
## $ dire_heroes_on_team          <dbl> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5…
## $ radiant_heroes_on_team       <dbl> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5…
## $ dire_melee_count             <dbl> 2, 3, 2, 3, 1, 4, 2, 4, 3, 1, 2, 1, 3, 2, 3, 3, 3, 2, 4, 3…
## $ radiant_melee_count          <dbl> 5, 2, 2, 1, 2, 3, 4, 2, 1, 3, 2, 3, 3, 2, 4, 1, 4, 1, 3, 1…
## $ dire_ranged_count            <dbl> 3, 2, 3, 2, 4, 1, 3, 1, 2, 4, 3, 4, 2, 3, 2, 2, 2, 3, 1, 2…
## $ radiant_ranged_count         <dbl> 0, 3, 3, 4, 3, 2, 1, 3, 4, 2, 3, 2, 2, 3, 1, 4, 1, 4, 2, 4…
## $ dire_str_count               <dbl> 1, 2, 2, 3, 2, 2, 2, 2, 1, 1, 1, 1, 2, 2, 1, 2, 2, 0, 3, 1…
## $ radiant_str_count            <dbl> 2, 1, 2, 1, 2, 2, 3, 2, 1, 1, 1, 2, 0, 1, 5, 1, 1, 0, 2, 1…
## $ dire_agi_count               <dbl> 3, 1, 1, 1, 2, 1, 0, 2, 2, 1, 1, 1, 2, 1, 1, 2, 0, 3, 1, 1…
## $ radiant_agi_count            <dbl> 2, 1, 1, 1, 2, 1, 0, 1, 2, 3, 1, 1, 3, 1, 0, 1, 2, 2, 1, 1…
## $ dire_int_count               <dbl> 0, 2, 2, 1, 1, 1, 1, 1, 2, 2, 3, 2, 1, 2, 0, 1, 2, 2, 1, 2…
## $ radiant_int_count            <dbl> 1, 2, 2, 3, 0, 1, 1, 1, 2, 1, 2, 2, 1, 2, 0, 2, 0, 2, 2, 2…
## $ dire_all_count               <dbl> 1, 0, 0, 0, 0, 1, 2, 0, 0, 1, 0, 1, 0, 0, 3, 0, 1, 0, 0, 1…
## $ radiant_all_count            <dbl> 0, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 2, 1, 0, 1…
## $ dire_missing_attack_type     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ radiant_missing_attack_type  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ dire_missing_primary_attr    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ radiant_missing_primary_attr <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ melee_count_difference       <dbl> 3, -1, 0, -2, 1, -1, 2, -2, -2, 2, 0, 2, 0, 0, 1, -2, 1, -…
## $ ranged_count_difference      <dbl> -3, 1, 0, 2, -1, 1, -2, 2, 2, -2, 0, -2, 0, 0, -1, 2, -1, …
## $ str_count_difference         <dbl> 1, -1, 0, -2, 0, 0, 1, 0, 0, 0, 0, 1, -2, -1, 4, -1, -1, 0…
## $ agi_count_difference         <dbl> -1, 0, 0, 0, 0, 0, 0, -1, 0, 2, 0, 0, 1, 0, -1, -1, 2, -1,…
## $ int_count_difference         <dbl> 1, 0, 0, 2, -1, 0, 0, 0, 0, -1, -1, 0, 0, 0, 0, 1, -2, 0, …
## $ all_count_difference         <dbl> -1, 1, 0, 0, 1, 0, -1, 1, 0, -1, 1, -1, 1, 1, -3, 1, 1, 1,…
## $ radiant_melee_share          <dbl> 1.0, 0.4, 0.4, 0.2, 0.4, 0.6, 0.8, 0.4, 0.2, 0.6, 0.4, 0.6…
## $ dire_melee_share             <dbl> 0.4, 0.6, 0.4, 0.6, 0.2, 0.8, 0.4, 0.8, 0.6, 0.2, 0.4, 0.2…
## $ radiant_ranged_share         <dbl> 0.0, 0.6, 0.6, 0.8, 0.6, 0.4, 0.2, 0.6, 0.8, 0.4, 0.6, 0.4…
## $ dire_ranged_share            <dbl> 0.6, 0.4, 0.6, 0.4, 0.8, 0.2, 0.6, 0.2, 0.4, 0.8, 0.6, 0.8…
## $ rank_bracket                 <chr> "Divine/Immortal", "Legend/Ancient", "Crusader/Archon", "D…

6 Initial Dataset Checks

# ================================================================
# Initial checks block
# Purpose:
#   Confirm the dataset has the expected match-level structure.
# ================================================================

initial_checks <- tibble(
  check = c(
    "Dataset has rows",
    "match_id column exists",
    "radiant_win target exists",
    "Match IDs are unique or nearly unique",
    "Melee difference feature exists",
    "Attribute difference features exist"
  ),
  passed = c(
    nrow(model_raw) > 0,
    "match_id" %in% names(model_raw),
    "radiant_win" %in% names(model_raw),
    if ("match_id" %in% names(model_raw)) n_distinct(model_raw$match_id) == nrow(model_raw) else FALSE,
    "melee_count_difference" %in% names(model_raw),
    all(c(
      "str_count_difference",
      "agi_count_difference",
      "int_count_difference",
      "all_count_difference"
    ) %in% names(model_raw))
  )
) %>%
  mutate(result = if_else(passed, "PASS", "FAIL"))

kable(initial_checks, caption = "Sprint 3 Initial Dataset Checks")

Sprint 3 Initial Dataset Checks
check	passed	result
Dataset has rows	TRUE	PASS
match_id column exists	TRUE	PASS
radiant_win target exists	TRUE	PASS
Match IDs are unique or nearly unique	TRUE	PASS
Melee difference feature exists	TRUE	PASS
Attribute difference features exist	TRUE	PASS

stopifnot(all(initial_checks$passed))

7 Prepare Target Variable

# ================================================================
# Target preparation block
# Purpose:
#   Convert radiant_win into a two-class factor for classification.
# ================================================================

model_prepped <- model_raw %>%
  mutate(
    radiant_win_text = str_to_lower(as.character(radiant_win)),
    radiant_win = case_when(
      radiant_win_text %in% c("true", "1", "radiant_win", "radiant", "win", "yes") ~ "Radiant_Win",
      radiant_win_text %in% c("false", "0", "dire_win", "dire", "loss", "no") ~ "Dire_Win",
      TRUE ~ NA_character_
    ),
    radiant_win = factor(radiant_win, levels = c("Dire_Win", "Radiant_Win"))
  ) %>%
  select(-radiant_win_text) %>%
  drop_na(radiant_win)

target_balance <- model_prepped %>%
  count(radiant_win) %>%
  mutate(percent = n / sum(n))

kable(target_balance, digits = 3, caption = "Target Class Balance")

Target Class Balance
radiant_win	n	percent
Dire_Win	2357	0.471
Radiant_Win	2643	0.529

8 Domain-Informed Feature Design

The first Sprint 2 models used broad team-composition variables because they are transparent and easy to reproduce. Sprint 3 keeps that transparency but reshapes some variables into draft-risk indicators.

The player-informed idea is not that melee heroes, ranged heroes, or primary attributes are automatically good or bad. The more useful idea is that certain draft shapes can create public-match risks:

heavy melee drafts may require coordination, initiation, or stun chaining;
all-ranged drafts may lack a durable fight starter;
low Strength or low Universal counts may act as a weak frontline or brawling proxy;
highly skewed attribute profiles may indicate an execution burden or narrow draft identity;
broad team-level counts cannot observe role abuse, bad pulls, bad dives, or whether supports are actually supporting.

This is why the model is called domain-informed, not expert-deterministic. It uses Dota experience to ask better questions, but the data still decides what signal appears.

9 Create Domain-Informed Draft-Risk Features

# ================================================================
# Domain-informed feature engineering block
# Purpose:
#   Create interpretable draft-risk variables from existing composition
#   counts. These features do not require new API calls.
# ================================================================

required_feature_columns <- c(
  "radiant_melee_count",
  "dire_melee_count",
  "radiant_ranged_count",
  "dire_ranged_count",
  "radiant_str_count",
  "dire_str_count",
  "radiant_agi_count",
  "dire_agi_count",
  "radiant_int_count",
  "dire_int_count",
  "radiant_all_count",
  "dire_all_count",
  "melee_count_difference",
  "str_count_difference",
  "agi_count_difference",
  "int_count_difference",
  "all_count_difference"
)

missing_feature_columns <- setdiff(required_feature_columns, names(model_prepped))

if (length(missing_feature_columns) > 0) {
  stop(
    paste0(
      "Missing required feature columns: ",
      paste(missing_feature_columns, collapse = ", ")
    )
  )
}

# Helper function for attribute imbalance.
# A balanced four-attribute profile has counts close to the team's average.
# A skewed profile has one or more attributes far above or below that average.
attribute_imbalance <- function(str_count, agi_count, int_count, all_count) {
  attr_mean <- (str_count + agi_count + int_count + all_count) / 4
  abs(str_count - attr_mean) +
    abs(agi_count - attr_mean) +
    abs(int_count - attr_mean) +
    abs(all_count - attr_mean)
}

model_domain <- model_prepped %>%
  mutate(
    # Attack type categories.
    melee_advantage_category = case_when(
      melee_count_difference <= -2 ~ "Radiant much less melee",
      melee_count_difference == -1 ~ "Radiant slightly less melee",
      melee_count_difference == 0 ~ "Even melee count",
      melee_count_difference == 1 ~ "Radiant slightly more melee",
      melee_count_difference >= 2 ~ "Radiant much more melee",
      TRUE ~ NA_character_
    ),
    melee_advantage_category = factor(
      melee_advantage_category,
      levels = c(
        "Radiant much less melee",
        "Radiant slightly less melee",
        "Even melee count",
        "Radiant slightly more melee",
        "Radiant much more melee"
      )
    ),
    
    # Heavy composition flags.
    radiant_heavy_melee = as.integer(radiant_melee_count >= 4),
    dire_heavy_melee = as.integer(dire_melee_count >= 4),
    radiant_all_ranged = as.integer(radiant_melee_count == 0),
    dire_all_ranged = as.integer(dire_melee_count == 0),
    radiant_all_melee = as.integer(radiant_ranged_count == 0),
    dire_all_melee = as.integer(dire_ranged_count == 0),
    
    # Weak frontline/brawling proxy.
    # This is intentionally labeled as a proxy because Strength/Universal
    # does not equal initiation or true frontline.
    radiant_low_frontline_proxy = as.integer(radiant_str_count <= 1 & radiant_all_count <= 1),
    dire_low_frontline_proxy = as.integer(dire_str_count <= 1 & dire_all_count <= 1),
    frontline_proxy_difference = radiant_low_frontline_proxy - dire_low_frontline_proxy,
    
    # Attribute imbalance.
    radiant_attribute_imbalance = attribute_imbalance(
      radiant_str_count,
      radiant_agi_count,
      radiant_int_count,
      radiant_all_count
    ),
    dire_attribute_imbalance = attribute_imbalance(
      dire_str_count,
      dire_agi_count,
      dire_int_count,
      dire_all_count
    ),
    attribute_imbalance_difference = radiant_attribute_imbalance - dire_attribute_imbalance,
    
    # Simple greed/scaling proxy.
    # This is not a hero-role model. It is a deliberately rough proxy.
    radiant_high_agi_proxy = as.integer(radiant_agi_count >= 3),
    dire_high_agi_proxy = as.integer(dire_agi_count >= 3),
    high_agi_proxy_difference = radiant_high_agi_proxy - dire_high_agi_proxy,
    
    # Extreme intelligence stack proxy.
    radiant_high_int_proxy = as.integer(radiant_int_count >= 4),
    dire_high_int_proxy = as.integer(dire_int_count >= 4),
    high_int_proxy_difference = radiant_high_int_proxy - dire_high_int_proxy
  ) %>%
  drop_na()

# Preview the new feature columns.
domain_feature_preview <- model_domain %>%
  select(
    radiant_win,
    melee_count_difference,
    melee_advantage_category,
    radiant_heavy_melee,
    dire_heavy_melee,
    radiant_all_ranged,
    dire_all_ranged,
    radiant_low_frontline_proxy,
    dire_low_frontline_proxy,
    radiant_attribute_imbalance,
    dire_attribute_imbalance,
    attribute_imbalance_difference
  ) %>%
  head(10)

kable(domain_feature_preview, caption = "Preview of Domain-Informed Draft-Risk Features")

Preview of Domain-Informed Draft-Risk Features
radiant_win	melee_count_difference	melee_advantage_category	radiant_heavy_melee	dire_heavy_melee	radiant_low_frontline_proxy	dire_low_frontline_proxy	radiant_attribute_imbalance	dire_attribute_imbalance	attribute_imbalance_difference
Dire_Win	3	Radiant much more melee	1	0	0	1	3.0	3.5	-0.5
Dire_Win	-1	Radiant slightly less melee	0	0	1	0	1.5	3.0	-1.5
Dire_Win	0	Even melee count	0	0	0	0	3.0	3.0	0.0
Dire_Win	-2	Radiant much less melee	0	0	1	0	3.5	3.5	0.0
Radiant_Win	1	Radiant slightly more melee	0	0	0	0	3.0	3.0	0.0
Radiant_Win	-1	Radiant slightly less melee	0	1	0	0	1.5	1.5	0.0
Radiant_Win	2	Radiant much more melee	1	0	0	0	3.5	3.0	0.5
Dire_Win	-2	Radiant much less melee	0	1	0	0	1.5	3.0	-1.5
Radiant_Win	-2	Radiant much less melee	0	0	1	1	3.0	3.0	0.0
Dire_Win	2	Radiant much more melee	0	0	1	1	3.5	1.5	2.0

10 Domain-Informed EDA

10.1 Win Rate by Melee Advantage Category

# ================================================================
# EDA block
# Purpose:
#   Make the melee/ranged question easier to read than raw counts.
# ================================================================

melee_category_summary <- model_domain %>%
  group_by(melee_advantage_category) %>%
  summarise(
    matches = n(),
    radiant_win_rate = mean(radiant_win == "Radiant_Win"),
    .groups = "drop"
  ) %>%
  mutate(win_rate_label = percent(radiant_win_rate, accuracy = 0.1))

kable(
  melee_category_summary,
  digits = 3,
  caption = "Radiant Win Rate by Melee Advantage Category"
)

Radiant Win Rate by Melee Advantage Category
melee_advantage_category	matches	radiant_win_rate	win_rate_label
Radiant much less melee	589	0.487	48.7%
Radiant slightly less melee	1177	0.517	51.7%
Even melee count	1466	0.550	55.0%
Radiant slightly more melee	1144	0.525	52.5%
Radiant much more melee	624	0.546	54.6%

melee_category_summary %>%
  ggplot(aes(x = melee_advantage_category, y = radiant_win_rate)) +
  geom_col() +
  geom_text(aes(label = paste0(win_rate_label, "\nn=", matches)), vjust = -0.25, size = 3.4) +
  scale_y_continuous(labels = percent_format(), limits = c(0, 1)) +
  labs(
    title = "Radiant Win Rate by Melee Advantage Category",
    subtitle = "Domain-informed check: raw melee difference grouped into readable draft shapes",
    x = "Melee Advantage Category",
    y = "Radiant Win Rate"
  ) +
  theme(axis.text.x = element_text(angle = 25, hjust = 1))

10.2 Win Rate by Draft-Risk Flags

# ================================================================
# EDA block
# Purpose:
#   Summarize simple draft-risk flags in one readable table.
# ================================================================

risk_flag_summary <- model_domain %>%
  summarise(
    radiant_heavy_melee_wr = mean(radiant_win == "Radiant_Win" & radiant_heavy_melee == 1) / mean(radiant_heavy_melee == 1),
    radiant_all_ranged_wr = mean(radiant_win == "Radiant_Win" & radiant_all_ranged == 1) / mean(radiant_all_ranged == 1),
    radiant_low_frontline_proxy_wr = mean(radiant_win == "Radiant_Win" & radiant_low_frontline_proxy == 1) / mean(radiant_low_frontline_proxy == 1),
    radiant_high_agi_proxy_wr = mean(radiant_win == "Radiant_Win" & radiant_high_agi_proxy == 1) / mean(radiant_high_agi_proxy == 1),
    radiant_high_int_proxy_wr = mean(radiant_win == "Radiant_Win" & radiant_high_int_proxy == 1) / mean(radiant_high_int_proxy == 1),
    overall_radiant_wr = mean(radiant_win == "Radiant_Win")
  ) %>%
  pivot_longer(
    cols = everything(),
    names_to = "draft_condition",
    values_to = "radiant_win_rate"
  ) %>%
  mutate(radiant_win_rate = replace_na(radiant_win_rate, 0))

kable(
  risk_flag_summary,
  digits = 3,
  caption = "Radiant Win Rate Under Selected Draft-Risk Conditions"
)

Radiant Win Rate Under Selected Draft-Risk Conditions
draft_condition	radiant_win_rate
radiant_heavy_melee_wr	0.540
radiant_all_ranged_wr	0.435
radiant_low_frontline_proxy_wr	0.511
radiant_high_agi_proxy_wr	0.549
radiant_high_int_proxy_wr	0.563
overall_radiant_wr	0.529

10.3 Attribute Imbalance Summary

# ================================================================
# EDA block
# Purpose:
#   Test whether skewed attribute profiles appear meaningfully different.
# ================================================================

attribute_imbalance_summary <- model_domain %>%
  mutate(
    radiant_more_imbalanced = case_when(
      attribute_imbalance_difference > 0 ~ "Radiant more imbalanced",
      attribute_imbalance_difference < 0 ~ "Dire more imbalanced",
      TRUE ~ "Equal imbalance"
    )
  ) %>%
  group_by(radiant_more_imbalanced) %>%
  summarise(
    matches = n(),
    radiant_win_rate = mean(radiant_win == "Radiant_Win"),
    .groups = "drop"
  ) %>%
  mutate(win_rate_label = percent(radiant_win_rate, accuracy = 0.1))

kable(
  attribute_imbalance_summary,
  digits = 3,
  caption = "Radiant Win Rate by Attribute Imbalance Direction"
)

Radiant Win Rate by Attribute Imbalance Direction
radiant_more_imbalanced	matches	radiant_win_rate	win_rate_label
Dire more imbalanced	1802	0.530	53.0%
Equal imbalance	1460	0.508	50.8%
Radiant more imbalanced	1738	0.545	54.5%

attribute_imbalance_summary %>%
  ggplot(aes(x = radiant_more_imbalanced, y = radiant_win_rate)) +
  geom_col() +
  geom_text(aes(label = paste0(win_rate_label, "\nn=", matches)), vjust = -0.25, size = 3.5) +
  scale_y_continuous(labels = percent_format(), limits = c(0, 1)) +
  labs(
    title = "Radiant Win Rate by Attribute Imbalance Direction",
    subtitle = "A rough proxy for whether one team has a more skewed draft identity",
    x = "Attribute Imbalance Category",
    y = "Radiant Win Rate"
  )

11 Build Modeling Dataset for Model 3

# ================================================================
# Model 3 data block
# Purpose:
#   Select clean, non-redundant draft-risk features.
#
# Important:
#   This avoids the broad Sprint 2 issue where melee counts, ranged
#   counts, differences, and shares encoded overlapping information.
# ================================================================

model3_predictors <- c(
  "melee_count_difference",
  "str_count_difference",
  "agi_count_difference",
  "int_count_difference",
  "all_count_difference",
  "attribute_imbalance_difference",
  "frontline_proxy_difference",
  "high_agi_proxy_difference",
  "high_int_proxy_difference",
  "radiant_heavy_melee",
  "dire_heavy_melee",
  "radiant_all_ranged",
  "dire_all_ranged",
  "radiant_low_frontline_proxy",
  "dire_low_frontline_proxy"
)

model3_df <- model_domain %>%
  select(radiant_win, all_of(model3_predictors)) %>%
  drop_na()

model3_columns <- tibble(
  column = names(model3_df),
  type = map_chr(model3_df, ~ class(.x)[1])
)

kable(model3_columns, caption = "Model 3 Final Modeling Columns")

Model 3 Final Modeling Columns
column	type
radiant_win	factor
melee_count_difference	numeric
str_count_difference	numeric
agi_count_difference	numeric
int_count_difference	numeric
all_count_difference	numeric
attribute_imbalance_difference	numeric
frontline_proxy_difference	integer
high_agi_proxy_difference	integer
high_int_proxy_difference	integer
radiant_heavy_melee	integer
dire_heavy_melee	integer
radiant_all_ranged	integer
dire_all_ranged	integer
radiant_low_frontline_proxy	integer
dire_low_frontline_proxy	integer

12 Model 3 Readiness Checks

# ================================================================
# Readiness checks block
# Purpose:
#   Confirm Model 3 data is usable before fitting.
# ================================================================

model3_checks <- tibble(
  check = c(
    "Model 3 data has rows",
    "Target has exactly two classes",
    "Target has no missing values",
    "All predictors are numeric",
    "At least five predictors are available",
    "duration is excluded",
    "match_id is excluded"
  ),
  passed = c(
    nrow(model3_df) > 0,
    n_distinct(model3_df$radiant_win) == 2,
    all(!is.na(model3_df$radiant_win)),
    all(map_lgl(model3_df %>% select(-radiant_win), is.numeric)),
    ncol(model3_df) - 1 >= 5,
    !"duration" %in% names(model3_df),
    !"match_id" %in% names(model3_df)
  )
) %>%
  mutate(result = if_else(passed, "PASS", "FAIL"))

kable(model3_checks, caption = "Model 3 Readiness Checks")

Model 3 Readiness Checks
check	passed	result
Model 3 data has rows	TRUE	PASS
Target has exactly two classes	TRUE	PASS
Target has no missing values	TRUE	PASS
All predictors are numeric	TRUE	PASS
At least five predictors are available	TRUE	PASS
duration is excluded	TRUE	PASS
match_id is excluded	TRUE	PASS

stopifnot(all(model3_checks$passed))

13 Train / Test Split

# ================================================================
# Train/test split block
# Purpose:
#   Split data while preserving class balance.
# ================================================================

lane_split <- initial_split(model3_df, prop = 0.80, strata = radiant_win)

train_data <- training(lane_split)
test_data <- testing(lane_split)

split_summary <- tibble(
  dataset = c("Training", "Testing"),
  rows = c(nrow(train_data), nrow(test_data)),
  radiant_win_rate = c(
    mean(train_data$radiant_win == "Radiant_Win"),
    mean(test_data$radiant_win == "Radiant_Win")
  )
)

kable(split_summary, digits = 3, caption = "Model 3 Train/Test Split Summary")

Model 3 Train/Test Split Summary
dataset	rows	radiant_win_rate
Training	3999	0.529
Testing	1001	0.528

14 Baseline Accuracy Reference

# ================================================================
# Baseline block
# Purpose:
#   Establish the majority-class benchmark. The model should be judged
#   against this, not just against 50 percent accuracy.
# ================================================================

majority_class_rate <- train_data %>%
  count(radiant_win) %>%
  mutate(rate = n / sum(n)) %>%
  arrange(desc(rate)) %>%
  slice(1)

kable(
  majority_class_rate,
  digits = 3,
  caption = "Majority-Class Baseline from Training Data"
)

Majority-Class Baseline from Training Data
radiant_win	n	rate
Radiant_Win	2114	0.529

15 Model 3: Domain-Informed Draft Risk Logistic Regression

# ================================================================
# Model 3 fitting block
# Purpose:
#   Fit an interpretable logistic regression using cleaner, domain-
#   informed draft-risk features.
# ================================================================

model3_recipe <- recipe(radiant_win ~ ., data = train_data) %>%
  step_zv(all_predictors()) %>%
  step_normalize(all_numeric_predictors())

model3_log_spec <- logistic_reg() %>%
  set_engine("glm") %>%
  set_mode("classification")

model3_log_workflow <- workflow() %>%
  add_recipe(model3_recipe) %>%
  add_model(model3_log_spec)

model3_log_fit <- fit(model3_log_workflow, data = train_data)

model3_log_fit

## ══ Workflow [trained] ═══════════════════════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: logistic_reg()
## 
## ── Preprocessor ─────────────────────────────────────────────────────────────────────────────────
## 2 Recipe Steps
## 
## • step_zv()
## • step_normalize()
## 
## ── Model ────────────────────────────────────────────────────────────────────────────────────────
## 
## Call:  stats::glm(formula = ..y ~ ., family = stats::binomial, data = data)
## 
## Coefficients:
##                    (Intercept)          melee_count_difference            str_count_difference  
##                       0.115178                        0.036814                        0.064215  
##           agi_count_difference            int_count_difference            all_count_difference  
##                       0.079635                        0.098550                              NA  
## attribute_imbalance_difference      frontline_proxy_difference       high_agi_proxy_difference  
##                      -0.013211                       -0.095076                        0.060048  
##      high_int_proxy_difference             radiant_heavy_melee                dire_heavy_melee  
##                       0.034014                       -0.003934                        0.070702  
##             radiant_all_ranged                 dire_all_ranged     radiant_low_frontline_proxy  
##                      -0.029035                       -0.028420                       -0.021098  
##       dire_low_frontline_proxy  
##                             NA  
## 
## Degrees of Freedom: 3998 Total (i.e. Null);  3985 Residual
## Null Deviance:       5531 
## Residual Deviance: 5511  AIC: 5539

16 Model 3 Predictions and Metrics

# ================================================================
# Prediction and metrics block
# Purpose:
#   Evaluate Model 3 on the test set.
# ================================================================

model3_predictions <- predict(model3_log_fit, test_data, type = "prob") %>%
  bind_cols(predict(model3_log_fit, test_data, type = "class")) %>%
  bind_cols(test_data %>% select(radiant_win))

model3_metrics <- bind_rows(
  accuracy(model3_predictions, truth = radiant_win, estimate = .pred_class),
  roc_auc(model3_predictions, truth = radiant_win, .pred_Radiant_Win, event_level = "second")
) %>%
  mutate(model = "Model 3: Domain-Informed Draft Risk", .before = 1)

kable(model3_metrics, digits = 3, caption = "Model 3 Performance")

Model 3 Performance
model	.metric	.estimator	.estimate
Model 3: Domain-Informed Draft Risk	accuracy	binary	0.519
Model 3: Domain-Informed Draft Risk	roc_auc	binary	0.492

17 Model 3 Confusion Matrix

# ================================================================
# Confusion matrix block
# Purpose:
#   Check whether the model predicts both classes or mostly defaults to
#   the majority class.
# ================================================================

model3_confusion <- conf_mat(
  model3_predictions,
  truth = radiant_win,
  estimate = .pred_class
)

model3_confusion

##              Truth
## Prediction    Dire_Win Radiant_Win
##   Dire_Win         111         120
##   Radiant_Win      361         409

18 Model 3 Coefficient Interpretation

# ================================================================
# Coefficient interpretation block
# Purpose:
#   Identify which domain-informed features have the largest positive
#   or negative association with Radiant victory.
#
# Important:
#   Coefficients are associations, not causal proof.
# ================================================================

model3_coefficients <- model3_log_fit %>%
  extract_fit_parsnip() %>%
  tidy() %>%
  filter(term != "(Intercept)") %>%
  mutate(
    odds_ratio = exp(estimate),
    direction = case_when(
      is.na(estimate) ~ "Not estimated",
      estimate > 0 ~ "Higher odds of Radiant win",
      estimate < 0 ~ "Lower odds of Radiant win",
      TRUE ~ "No estimated direction"
    )
  ) %>%
  arrange(desc(abs(estimate)))

kable(
  model3_coefficients %>%
    select(term, estimate, odds_ratio, direction),
  digits = 3,
  caption = "Model 3 Logistic Regression Coefficients"
)

Model 3 Logistic Regression Coefficients
term	estimate	odds_ratio	direction
int_count_difference	0.099	1.104	Higher odds of Radiant win
frontline_proxy_difference	-0.095	0.909	Lower odds of Radiant win
agi_count_difference	0.080	1.083	Higher odds of Radiant win
dire_heavy_melee	0.071	1.073	Higher odds of Radiant win
str_count_difference	0.064	1.066	Higher odds of Radiant win
high_agi_proxy_difference	0.060	1.062	Higher odds of Radiant win
melee_count_difference	0.037	1.037	Higher odds of Radiant win
high_int_proxy_difference	0.034	1.035	Higher odds of Radiant win
radiant_all_ranged	-0.029	0.971	Lower odds of Radiant win
dire_all_ranged	-0.028	0.972	Lower odds of Radiant win
radiant_low_frontline_proxy	-0.021	0.979	Lower odds of Radiant win
attribute_imbalance_difference	-0.013	0.987	Lower odds of Radiant win
radiant_heavy_melee	-0.004	0.996	Lower odds of Radiant win
all_count_difference	NA	NA	Not estimated
dire_low_frontline_proxy	NA	NA	Not estimated

19 Optional Comparison Model: Random Forest on Model 3 Features

# ================================================================
# Optional comparison block
# Purpose:
#   Fit a random forest using the same Model 3 features to check whether
#   nonlinear interactions improve predictive performance.
# ================================================================

rf_mtry <- max(1, floor(sqrt(ncol(train_data) - 1)))

model3_rf_spec <- rand_forest(
  trees = 500,
  mtry = rf_mtry,
  min_n = 5
) %>%
  set_engine("ranger", importance = "permutation") %>%
  set_mode("classification")

model3_rf_workflow <- workflow() %>%
  add_recipe(model3_recipe) %>%
  add_model(model3_rf_spec)

model3_rf_fit <- fit(model3_rf_workflow, data = train_data)

model3_rf_predictions <- predict(model3_rf_fit, test_data, type = "prob") %>%
  bind_cols(predict(model3_rf_fit, test_data, type = "class")) %>%
  bind_cols(test_data %>% select(radiant_win))

model3_rf_metrics <- bind_rows(
  accuracy(model3_rf_predictions, truth = radiant_win, estimate = .pred_class),
  roc_auc(model3_rf_predictions, truth = radiant_win, .pred_Radiant_Win, event_level = "second")
) %>%
  mutate(model = "Model 3 RF: Domain-Informed Features", .before = 1)

kable(model3_rf_metrics, digits = 3, caption = "Model 3 Random Forest Performance")

Model 3 Random Forest Performance
model	.metric	.estimator	.estimate
Model 3 RF: Domain-Informed Features	accuracy	binary	0.502
Model 3 RF: Domain-Informed Features	roc_auc	binary	0.498

20 Model 3 Comparison Table

# ================================================================
# Comparison block
# Purpose:
#   Compare Model 3 against the majority-class baseline.
# ================================================================

baseline_row <- tibble(
  model = "Majority-Class Baseline",
  accuracy = majority_class_rate$rate,
  roc_auc = NA_real_
)

model3_comparison <- bind_rows(model3_metrics, model3_rf_metrics) %>%
  select(model, .metric, .estimate) %>%
  pivot_wider(names_from = .metric, values_from = .estimate) %>%
  bind_rows(baseline_row) %>%
  arrange(desc(replace_na(roc_auc, 0)))

kable(
  model3_comparison,
  digits = 3,
  caption = "Model 3 Performance Compared with Majority-Class Baseline"
)

Model 3 Performance Compared with Majority-Class Baseline
model	accuracy	roc_auc
Model 3 RF: Domain-Informed Features	0.502	0.498
Model 3: Domain-Informed Draft Risk	0.519	0.492
Majority-Class Baseline	0.529	NA

21 ROC Curve Comparison

# ================================================================
# ROC curve block
# Purpose:
#   Compare class-separation ability for the logistic and random forest
#   versions of Model 3.
# ================================================================

model3_log_roc <- roc_curve(
  model3_predictions,
  truth = radiant_win,
  .pred_Radiant_Win,
  event_level = "second"
) %>%
  mutate(model = "Model 3 Logistic")

model3_rf_roc <- roc_curve(
  model3_rf_predictions,
  truth = radiant_win,
  .pred_Radiant_Win,
  event_level = "second"
) %>%
  mutate(model = "Model 3 Random Forest")

bind_rows(model3_log_roc, model3_rf_roc) %>%
  ggplot(aes(x = 1 - specificity, y = sensitivity, linetype = model)) +
  geom_path(linewidth = 1) +
  geom_abline(linetype = "dashed") +
  labs(
    title = "Model 3 ROC Curve Comparison",
    subtitle = "Curves near the diagonal indicate limited class-separation power",
    x = "False Positive Rate",
    y = "True Positive Rate",
    linetype = "Model"
  )

22 Plain-English Interpretation Template

Use this section after reviewing the output values.

The Sprint 3 model tests whether cleaner, domain-informed draft-risk features improve the interpretability of the Lane Theory Lab analysis. These features are based on public match data and intentionally avoid unavailable behavioral details such as whether supports pulled correctly, whether a lane over-dived, or whether a position 4 actually played a support function.

The key comparison is not whether the model exceeds 50 percent accuracy. The key comparison is whether it improves on the majority-class baseline, which reflects the observed Radiant-side win advantage in the sample. If Model 3 does not outperform this baseline, the result suggests that broad team-level draft-risk features still do not explain enough of the match outcome by themselves.

That finding would not invalidate the project. Instead, it would support the idea that useful esports BI requires richer context than simple team-level counts. Future versions should consider lane assignments, hero identity, role fidelity, reliable stun counts, initiation tools, support behavior, patch context, game mode, and rank bracket.

23 Limitations

This sprint still uses match-level draft-composition features. It does not directly observe laning behavior, support behavior, bad pulls, inopportune dives, player skill, hero familiarity, item builds, objective control, or coordination. These are major parts of Dota outcomes.

The frontline and greed variables are deliberately labeled as proxies. Strength or Universal heroes do not automatically equal frontline or initiation. Agility-heavy drafts do not automatically mean greed. Intelligence-heavy drafts do not automatically mean spell pressure. These features are simple public-data approximations designed to ask better questions, not perfect strategic labels.

The current dataset may also mix game modes, rank brackets, patches, regions, and public-match environments. Turbo and ranked All Pick may not represent the same mindset. Public-match data may also include bots, smurfs, griefing, role abuse, and other behaviors that are difficult to detect from composition data alone.

24 Recommendations

The strongest next step is to treat Model 3 as a domain-informed baseline rather than a final predictive product. If the model performs weakly, the recommendation is not to force the result. Instead, the recommendation is to improve feature quality by adding context.

Recommended next steps:

Compare Model 3 performance against the majority-class baseline.
Review whether domain-informed features improve interpretation, even if predictive lift is small.
Add rank and game-mode comparison cuts to see whether behavior changes across public-match contexts.
Consider a future hero-archetype table for initiation, stun reliability, scaling core count, sustain, and support function.
Use the final project as a foundation for a future draft-risk calculator or esports BI dashboard.

25 Replication Notes

To rerun this sprint:

Open this .Rmd file in RStudio.
Confirm the dataset exists in the data/ folder.
Install missing packages if needed.
Knit to HTML.
Review the Model 3 comparison table and coefficient table.
Merge useful sections into the final midterm report.

Lane Theory Lab: Sprint 3 Domain-Informed Draft Risk Model

Shep

2026-06-23