1 Sprint 3 Goal

The goal of Sprint 3 is to improve the Lane Theory Lab analysis without restarting the project. Sprint 2 successfully built a reproducible modeling pipeline, but the first models used broad composition counts that were likely too blunt to capture meaningful Dota strategy.

Sprint 3 keeps the project focused on the academic MVP while adding a stronger interpretation layer:

  1. Reframe the project as esports business intelligence and predictive analytics using public match data.
  2. Add domain-informed draft-risk features that are still simple, transparent, and reproducible.
  3. Compare the new model against the Sprint 2 baseline logic and the Radiant-side majority baseline.
  4. Preserve honest interpretation, including weak or null results.

2 Business Intelligence Framing

Lane Theory Lab uses public Dota 2 match data to evaluate whether simplified draft-composition features can support esports business intelligence. The business problem is not just whether a model can predict winners. The deeper question is whether public match data can be transformed into useful decision-support signals for players, analysts, content creators, or future dashboard users.

The product-facing version of the question is:

Can a simple public-data draft-risk calculator help users understand whether a draft shape appears fragile, balanced, greedy, or difficult to execute?

3 Research Questions

This sprint focuses on four questions:

  1. Do broad draft-composition features predict match outcomes beyond the Radiant-side baseline?
  2. Are broad melee/ranged and primary attribute counts too coarse to explain match outcomes on their own?
  3. Can domain-informed draft-risk features make the model more interpretable?
  4. What additional data would be needed to build a stronger esports BI product?

4 Load Required Packages

required_packages <- c(
  "tidyverse",
  "tidymodels",
  "ranger",
  "knitr",
  "broom",
  "scales"
)

missing_packages <- required_packages[
  !required_packages %in% rownames(installed.packages())
]

if (length(missing_packages) > 0) {
  stop(
    paste0(
      "Missing required package(s): ",
      paste(missing_packages, collapse = ", "),
      "\nInstall them with: install.packages(c('",
      paste(missing_packages, collapse = "', '"),
      "'))"
    )
  )
}

library(tidyverse)
library(tidymodels)
library(ranger)
library(knitr)
library(broom)
library(scales)

set.seed(580)

5 Load Modeling Dataset

# ================================================================
# Data loading block
# Purpose:
#   Prefer the larger Sprint 3 expanded dataset if it exists.
#   Otherwise, fall back to the stable Sprint 1/Sprint 2 dataset.
# ================================================================

expanded_path <- "data/lane_theory_modeling_dataset_sprint3_expanded.csv"
original_path <- "data/lane_theory_modeling_dataset.csv"

data_path <- if_else(file.exists(expanded_path), expanded_path, original_path)

if (!file.exists(data_path)) {
  stop(
    paste0(
      "Could not find a modeling dataset. Expected one of:\n",
      expanded_path, "\n",
      original_path, "\n",
      "Check that this Rmd file is saved in the project root folder."
    )
  )
}

model_raw <- read_csv(data_path, show_col_types = FALSE)

cat("Using dataset:", data_path, "\n")
## Using dataset: data/lane_theory_modeling_dataset_sprint3_expanded.csv
glimpse(model_raw)
## Rows: 5,000
## Columns: 36
## $ match_id                     <dbl> 8855078541, 8855077836, 8855074571, 8855074472, 8855073800…
## $ radiant_win                  <lgl> FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, FALSE, TRUE,…
## $ avg_rank_tier                <dbl> 62, 52, 35, 64, 42, 33, 24, 21, 41, 32, 41, 63, 13, 23, 12…
## $ duration                     <dbl> 398, 499, 384, 539, 897, 895, 549, 765, 982, 1020, 919, 80…
## $ game_mode                    <dbl> 23, 23, 22, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 4,…
## $ lobby_type                   <dbl> 0, 0, 7, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 7, 0, 0…
## $ cluster                      <dbl> 225, 227, 144, 151, 153, 183, 186, 153, 202, 181, 413, 153…
## $ dire_heroes_on_team          <dbl> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5…
## $ radiant_heroes_on_team       <dbl> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5…
## $ dire_melee_count             <dbl> 2, 3, 2, 3, 1, 4, 2, 4, 3, 1, 2, 1, 3, 2, 3, 3, 3, 2, 4, 3…
## $ radiant_melee_count          <dbl> 5, 2, 2, 1, 2, 3, 4, 2, 1, 3, 2, 3, 3, 2, 4, 1, 4, 1, 3, 1…
## $ dire_ranged_count            <dbl> 3, 2, 3, 2, 4, 1, 3, 1, 2, 4, 3, 4, 2, 3, 2, 2, 2, 3, 1, 2…
## $ radiant_ranged_count         <dbl> 0, 3, 3, 4, 3, 2, 1, 3, 4, 2, 3, 2, 2, 3, 1, 4, 1, 4, 2, 4…
## $ dire_str_count               <dbl> 1, 2, 2, 3, 2, 2, 2, 2, 1, 1, 1, 1, 2, 2, 1, 2, 2, 0, 3, 1…
## $ radiant_str_count            <dbl> 2, 1, 2, 1, 2, 2, 3, 2, 1, 1, 1, 2, 0, 1, 5, 1, 1, 0, 2, 1…
## $ dire_agi_count               <dbl> 3, 1, 1, 1, 2, 1, 0, 2, 2, 1, 1, 1, 2, 1, 1, 2, 0, 3, 1, 1…
## $ radiant_agi_count            <dbl> 2, 1, 1, 1, 2, 1, 0, 1, 2, 3, 1, 1, 3, 1, 0, 1, 2, 2, 1, 1…
## $ dire_int_count               <dbl> 0, 2, 2, 1, 1, 1, 1, 1, 2, 2, 3, 2, 1, 2, 0, 1, 2, 2, 1, 2…
## $ radiant_int_count            <dbl> 1, 2, 2, 3, 0, 1, 1, 1, 2, 1, 2, 2, 1, 2, 0, 2, 0, 2, 2, 2…
## $ dire_all_count               <dbl> 1, 0, 0, 0, 0, 1, 2, 0, 0, 1, 0, 1, 0, 0, 3, 0, 1, 0, 0, 1…
## $ radiant_all_count            <dbl> 0, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 2, 1, 0, 1…
## $ dire_missing_attack_type     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ radiant_missing_attack_type  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ dire_missing_primary_attr    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ radiant_missing_primary_attr <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ melee_count_difference       <dbl> 3, -1, 0, -2, 1, -1, 2, -2, -2, 2, 0, 2, 0, 0, 1, -2, 1, -…
## $ ranged_count_difference      <dbl> -3, 1, 0, 2, -1, 1, -2, 2, 2, -2, 0, -2, 0, 0, -1, 2, -1, …
## $ str_count_difference         <dbl> 1, -1, 0, -2, 0, 0, 1, 0, 0, 0, 0, 1, -2, -1, 4, -1, -1, 0…
## $ agi_count_difference         <dbl> -1, 0, 0, 0, 0, 0, 0, -1, 0, 2, 0, 0, 1, 0, -1, -1, 2, -1,…
## $ int_count_difference         <dbl> 1, 0, 0, 2, -1, 0, 0, 0, 0, -1, -1, 0, 0, 0, 0, 1, -2, 0, …
## $ all_count_difference         <dbl> -1, 1, 0, 0, 1, 0, -1, 1, 0, -1, 1, -1, 1, 1, -3, 1, 1, 1,…
## $ radiant_melee_share          <dbl> 1.0, 0.4, 0.4, 0.2, 0.4, 0.6, 0.8, 0.4, 0.2, 0.6, 0.4, 0.6…
## $ dire_melee_share             <dbl> 0.4, 0.6, 0.4, 0.6, 0.2, 0.8, 0.4, 0.8, 0.6, 0.2, 0.4, 0.2…
## $ radiant_ranged_share         <dbl> 0.0, 0.6, 0.6, 0.8, 0.6, 0.4, 0.2, 0.6, 0.8, 0.4, 0.6, 0.4…
## $ dire_ranged_share            <dbl> 0.6, 0.4, 0.6, 0.4, 0.8, 0.2, 0.6, 0.2, 0.4, 0.8, 0.6, 0.8…
## $ rank_bracket                 <chr> "Divine/Immortal", "Legend/Ancient", "Crusader/Archon", "D…

6 Initial Dataset Checks

# ================================================================
# Initial checks block
# Purpose:
#   Confirm the dataset has the expected match-level structure.
# ================================================================

initial_checks <- tibble(
  check = c(
    "Dataset has rows",
    "match_id column exists",
    "radiant_win target exists",
    "Match IDs are unique or nearly unique",
    "Melee difference feature exists",
    "Attribute difference features exist"
  ),
  passed = c(
    nrow(model_raw) > 0,
    "match_id" %in% names(model_raw),
    "radiant_win" %in% names(model_raw),
    if ("match_id" %in% names(model_raw)) n_distinct(model_raw$match_id) == nrow(model_raw) else FALSE,
    "melee_count_difference" %in% names(model_raw),
    all(c(
      "str_count_difference",
      "agi_count_difference",
      "int_count_difference",
      "all_count_difference"
    ) %in% names(model_raw))
  )
) %>%
  mutate(result = if_else(passed, "PASS", "FAIL"))

kable(initial_checks, caption = "Sprint 3 Initial Dataset Checks")
Sprint 3 Initial Dataset Checks
check passed result
Dataset has rows TRUE PASS
match_id column exists TRUE PASS
radiant_win target exists TRUE PASS
Match IDs are unique or nearly unique TRUE PASS
Melee difference feature exists TRUE PASS
Attribute difference features exist TRUE PASS
stopifnot(all(initial_checks$passed))

7 Prepare Target Variable

# ================================================================
# Target preparation block
# Purpose:
#   Convert radiant_win into a two-class factor for classification.
# ================================================================

model_prepped <- model_raw %>%
  mutate(
    radiant_win_text = str_to_lower(as.character(radiant_win)),
    radiant_win = case_when(
      radiant_win_text %in% c("true", "1", "radiant_win", "radiant", "win", "yes") ~ "Radiant_Win",
      radiant_win_text %in% c("false", "0", "dire_win", "dire", "loss", "no") ~ "Dire_Win",
      TRUE ~ NA_character_
    ),
    radiant_win = factor(radiant_win, levels = c("Dire_Win", "Radiant_Win"))
  ) %>%
  select(-radiant_win_text) %>%
  drop_na(radiant_win)

target_balance <- model_prepped %>%
  count(radiant_win) %>%
  mutate(percent = n / sum(n))

kable(target_balance, digits = 3, caption = "Target Class Balance")
Target Class Balance
radiant_win n percent
Dire_Win 2357 0.471
Radiant_Win 2643 0.529

8 Domain-Informed Feature Design

The first Sprint 2 models used broad team-composition variables because they are transparent and easy to reproduce. Sprint 3 keeps that transparency but reshapes some variables into draft-risk indicators.

The player-informed idea is not that melee heroes, ranged heroes, or primary attributes are automatically good or bad. The more useful idea is that certain draft shapes can create public-match risks:

This is why the model is called domain-informed, not expert-deterministic. It uses Dota experience to ask better questions, but the data still decides what signal appears.

9 Create Domain-Informed Draft-Risk Features

# ================================================================
# Domain-informed feature engineering block
# Purpose:
#   Create interpretable draft-risk variables from existing composition
#   counts. These features do not require new API calls.
# ================================================================

required_feature_columns <- c(
  "radiant_melee_count",
  "dire_melee_count",
  "radiant_ranged_count",
  "dire_ranged_count",
  "radiant_str_count",
  "dire_str_count",
  "radiant_agi_count",
  "dire_agi_count",
  "radiant_int_count",
  "dire_int_count",
  "radiant_all_count",
  "dire_all_count",
  "melee_count_difference",
  "str_count_difference",
  "agi_count_difference",
  "int_count_difference",
  "all_count_difference"
)

missing_feature_columns <- setdiff(required_feature_columns, names(model_prepped))

if (length(missing_feature_columns) > 0) {
  stop(
    paste0(
      "Missing required feature columns: ",
      paste(missing_feature_columns, collapse = ", ")
    )
  )
}

# Helper function for attribute imbalance.
# A balanced four-attribute profile has counts close to the team's average.
# A skewed profile has one or more attributes far above or below that average.
attribute_imbalance <- function(str_count, agi_count, int_count, all_count) {
  attr_mean <- (str_count + agi_count + int_count + all_count) / 4
  abs(str_count - attr_mean) +
    abs(agi_count - attr_mean) +
    abs(int_count - attr_mean) +
    abs(all_count - attr_mean)
}

model_domain <- model_prepped %>%
  mutate(
    # Attack type categories.
    melee_advantage_category = case_when(
      melee_count_difference <= -2 ~ "Radiant much less melee",
      melee_count_difference == -1 ~ "Radiant slightly less melee",
      melee_count_difference == 0 ~ "Even melee count",
      melee_count_difference == 1 ~ "Radiant slightly more melee",
      melee_count_difference >= 2 ~ "Radiant much more melee",
      TRUE ~ NA_character_
    ),
    melee_advantage_category = factor(
      melee_advantage_category,
      levels = c(
        "Radiant much less melee",
        "Radiant slightly less melee",
        "Even melee count",
        "Radiant slightly more melee",
        "Radiant much more melee"
      )
    ),
    
    # Heavy composition flags.
    radiant_heavy_melee = as.integer(radiant_melee_count >= 4),
    dire_heavy_melee = as.integer(dire_melee_count >= 4),
    radiant_all_ranged = as.integer(radiant_melee_count == 0),
    dire_all_ranged = as.integer(dire_melee_count == 0),
    radiant_all_melee = as.integer(radiant_ranged_count == 0),
    dire_all_melee = as.integer(dire_ranged_count == 0),
    
    # Weak frontline/brawling proxy.
    # This is intentionally labeled as a proxy because Strength/Universal
    # does not equal initiation or true frontline.
    radiant_low_frontline_proxy = as.integer(radiant_str_count <= 1 & radiant_all_count <= 1),
    dire_low_frontline_proxy = as.integer(dire_str_count <= 1 & dire_all_count <= 1),
    frontline_proxy_difference = radiant_low_frontline_proxy - dire_low_frontline_proxy,
    
    # Attribute imbalance.
    radiant_attribute_imbalance = attribute_imbalance(
      radiant_str_count,
      radiant_agi_count,
      radiant_int_count,
      radiant_all_count
    ),
    dire_attribute_imbalance = attribute_imbalance(
      dire_str_count,
      dire_agi_count,
      dire_int_count,
      dire_all_count
    ),
    attribute_imbalance_difference = radiant_attribute_imbalance - dire_attribute_imbalance,
    
    # Simple greed/scaling proxy.
    # This is not a hero-role model. It is a deliberately rough proxy.
    radiant_high_agi_proxy = as.integer(radiant_agi_count >= 3),
    dire_high_agi_proxy = as.integer(dire_agi_count >= 3),
    high_agi_proxy_difference = radiant_high_agi_proxy - dire_high_agi_proxy,
    
    # Extreme intelligence stack proxy.
    radiant_high_int_proxy = as.integer(radiant_int_count >= 4),
    dire_high_int_proxy = as.integer(dire_int_count >= 4),
    high_int_proxy_difference = radiant_high_int_proxy - dire_high_int_proxy
  ) %>%
  drop_na()

# Preview the new feature columns.
domain_feature_preview <- model_domain %>%
  select(
    radiant_win,
    melee_count_difference,
    melee_advantage_category,
    radiant_heavy_melee,
    dire_heavy_melee,
    radiant_all_ranged,
    dire_all_ranged,
    radiant_low_frontline_proxy,
    dire_low_frontline_proxy,
    radiant_attribute_imbalance,
    dire_attribute_imbalance,
    attribute_imbalance_difference
  ) %>%
  head(10)

kable(domain_feature_preview, caption = "Preview of Domain-Informed Draft-Risk Features")
Preview of Domain-Informed Draft-Risk Features
radiant_win melee_count_difference melee_advantage_category radiant_heavy_melee dire_heavy_melee radiant_all_ranged dire_all_ranged radiant_low_frontline_proxy dire_low_frontline_proxy radiant_attribute_imbalance dire_attribute_imbalance attribute_imbalance_difference
Dire_Win 3 Radiant much more melee 1 0 0 0 0 1 3.0 3.5 -0.5
Dire_Win -1 Radiant slightly less melee 0 0 0 0 1 0 1.5 3.0 -1.5
Dire_Win 0 Even melee count 0 0 0 0 0 0 3.0 3.0 0.0
Dire_Win -2 Radiant much less melee 0 0 0 0 1 0 3.5 3.5 0.0
Radiant_Win 1 Radiant slightly more melee 0 0 0 0 0 0 3.0 3.0 0.0
Radiant_Win -1 Radiant slightly less melee 0 1 0 0 0 0 1.5 1.5 0.0
Radiant_Win 2 Radiant much more melee 1 0 0 0 0 0 3.5 3.0 0.5
Dire_Win -2 Radiant much less melee 0 1 0 0 0 0 1.5 3.0 -1.5
Radiant_Win -2 Radiant much less melee 0 0 0 0 1 1 3.0 3.0 0.0
Dire_Win 2 Radiant much more melee 0 0 0 0 1 1 3.5 1.5 2.0

10 Domain-Informed EDA

10.1 Win Rate by Melee Advantage Category

# ================================================================
# EDA block
# Purpose:
#   Make the melee/ranged question easier to read than raw counts.
# ================================================================

melee_category_summary <- model_domain %>%
  group_by(melee_advantage_category) %>%
  summarise(
    matches = n(),
    radiant_win_rate = mean(radiant_win == "Radiant_Win"),
    .groups = "drop"
  ) %>%
  mutate(win_rate_label = percent(radiant_win_rate, accuracy = 0.1))

kable(
  melee_category_summary,
  digits = 3,
  caption = "Radiant Win Rate by Melee Advantage Category"
)
Radiant Win Rate by Melee Advantage Category
melee_advantage_category matches radiant_win_rate win_rate_label
Radiant much less melee 589 0.487 48.7%
Radiant slightly less melee 1177 0.517 51.7%
Even melee count 1466 0.550 55.0%
Radiant slightly more melee 1144 0.525 52.5%
Radiant much more melee 624 0.546 54.6%
melee_category_summary %>%
  ggplot(aes(x = melee_advantage_category, y = radiant_win_rate)) +
  geom_col() +
  geom_text(aes(label = paste0(win_rate_label, "\nn=", matches)), vjust = -0.25, size = 3.4) +
  scale_y_continuous(labels = percent_format(), limits = c(0, 1)) +
  labs(
    title = "Radiant Win Rate by Melee Advantage Category",
    subtitle = "Domain-informed check: raw melee difference grouped into readable draft shapes",
    x = "Melee Advantage Category",
    y = "Radiant Win Rate"
  ) +
  theme(axis.text.x = element_text(angle = 25, hjust = 1))

10.2 Win Rate by Draft-Risk Flags

# ================================================================
# EDA block
# Purpose:
#   Summarize simple draft-risk flags in one readable table.
# ================================================================

risk_flag_summary <- model_domain %>%
  summarise(
    radiant_heavy_melee_wr = mean(radiant_win == "Radiant_Win" & radiant_heavy_melee == 1) / mean(radiant_heavy_melee == 1),
    radiant_all_ranged_wr = mean(radiant_win == "Radiant_Win" & radiant_all_ranged == 1) / mean(radiant_all_ranged == 1),
    radiant_low_frontline_proxy_wr = mean(radiant_win == "Radiant_Win" & radiant_low_frontline_proxy == 1) / mean(radiant_low_frontline_proxy == 1),
    radiant_high_agi_proxy_wr = mean(radiant_win == "Radiant_Win" & radiant_high_agi_proxy == 1) / mean(radiant_high_agi_proxy == 1),
    radiant_high_int_proxy_wr = mean(radiant_win == "Radiant_Win" & radiant_high_int_proxy == 1) / mean(radiant_high_int_proxy == 1),
    overall_radiant_wr = mean(radiant_win == "Radiant_Win")
  ) %>%
  pivot_longer(
    cols = everything(),
    names_to = "draft_condition",
    values_to = "radiant_win_rate"
  ) %>%
  mutate(radiant_win_rate = replace_na(radiant_win_rate, 0))

kable(
  risk_flag_summary,
  digits = 3,
  caption = "Radiant Win Rate Under Selected Draft-Risk Conditions"
)
Radiant Win Rate Under Selected Draft-Risk Conditions
draft_condition radiant_win_rate
radiant_heavy_melee_wr 0.540
radiant_all_ranged_wr 0.435
radiant_low_frontline_proxy_wr 0.511
radiant_high_agi_proxy_wr 0.549
radiant_high_int_proxy_wr 0.563
overall_radiant_wr 0.529

10.3 Attribute Imbalance Summary

# ================================================================
# EDA block
# Purpose:
#   Test whether skewed attribute profiles appear meaningfully different.
# ================================================================

attribute_imbalance_summary <- model_domain %>%
  mutate(
    radiant_more_imbalanced = case_when(
      attribute_imbalance_difference > 0 ~ "Radiant more imbalanced",
      attribute_imbalance_difference < 0 ~ "Dire more imbalanced",
      TRUE ~ "Equal imbalance"
    )
  ) %>%
  group_by(radiant_more_imbalanced) %>%
  summarise(
    matches = n(),
    radiant_win_rate = mean(radiant_win == "Radiant_Win"),
    .groups = "drop"
  ) %>%
  mutate(win_rate_label = percent(radiant_win_rate, accuracy = 0.1))

kable(
  attribute_imbalance_summary,
  digits = 3,
  caption = "Radiant Win Rate by Attribute Imbalance Direction"
)
Radiant Win Rate by Attribute Imbalance Direction
radiant_more_imbalanced matches radiant_win_rate win_rate_label
Dire more imbalanced 1802 0.530 53.0%
Equal imbalance 1460 0.508 50.8%
Radiant more imbalanced 1738 0.545 54.5%
attribute_imbalance_summary %>%
  ggplot(aes(x = radiant_more_imbalanced, y = radiant_win_rate)) +
  geom_col() +
  geom_text(aes(label = paste0(win_rate_label, "\nn=", matches)), vjust = -0.25, size = 3.5) +
  scale_y_continuous(labels = percent_format(), limits = c(0, 1)) +
  labs(
    title = "Radiant Win Rate by Attribute Imbalance Direction",
    subtitle = "A rough proxy for whether one team has a more skewed draft identity",
    x = "Attribute Imbalance Category",
    y = "Radiant Win Rate"
  )

11 Build Modeling Dataset for Model 3

# ================================================================
# Model 3 data block
# Purpose:
#   Select clean, non-redundant draft-risk features.
#
# Important:
#   This avoids the broad Sprint 2 issue where melee counts, ranged
#   counts, differences, and shares encoded overlapping information.
# ================================================================

model3_predictors <- c(
  "melee_count_difference",
  "str_count_difference",
  "agi_count_difference",
  "int_count_difference",
  "all_count_difference",
  "attribute_imbalance_difference",
  "frontline_proxy_difference",
  "high_agi_proxy_difference",
  "high_int_proxy_difference",
  "radiant_heavy_melee",
  "dire_heavy_melee",
  "radiant_all_ranged",
  "dire_all_ranged",
  "radiant_low_frontline_proxy",
  "dire_low_frontline_proxy"
)

model3_df <- model_domain %>%
  select(radiant_win, all_of(model3_predictors)) %>%
  drop_na()

model3_columns <- tibble(
  column = names(model3_df),
  type = map_chr(model3_df, ~ class(.x)[1])
)

kable(model3_columns, caption = "Model 3 Final Modeling Columns")
Model 3 Final Modeling Columns
column type
radiant_win factor
melee_count_difference numeric
str_count_difference numeric
agi_count_difference numeric
int_count_difference numeric
all_count_difference numeric
attribute_imbalance_difference numeric
frontline_proxy_difference integer
high_agi_proxy_difference integer
high_int_proxy_difference integer
radiant_heavy_melee integer
dire_heavy_melee integer
radiant_all_ranged integer
dire_all_ranged integer
radiant_low_frontline_proxy integer
dire_low_frontline_proxy integer

12 Model 3 Readiness Checks

# ================================================================
# Readiness checks block
# Purpose:
#   Confirm Model 3 data is usable before fitting.
# ================================================================

model3_checks <- tibble(
  check = c(
    "Model 3 data has rows",
    "Target has exactly two classes",
    "Target has no missing values",
    "All predictors are numeric",
    "At least five predictors are available",
    "duration is excluded",
    "match_id is excluded"
  ),
  passed = c(
    nrow(model3_df) > 0,
    n_distinct(model3_df$radiant_win) == 2,
    all(!is.na(model3_df$radiant_win)),
    all(map_lgl(model3_df %>% select(-radiant_win), is.numeric)),
    ncol(model3_df) - 1 >= 5,
    !"duration" %in% names(model3_df),
    !"match_id" %in% names(model3_df)
  )
) %>%
  mutate(result = if_else(passed, "PASS", "FAIL"))

kable(model3_checks, caption = "Model 3 Readiness Checks")
Model 3 Readiness Checks
check passed result
Model 3 data has rows TRUE PASS
Target has exactly two classes TRUE PASS
Target has no missing values TRUE PASS
All predictors are numeric TRUE PASS
At least five predictors are available TRUE PASS
duration is excluded TRUE PASS
match_id is excluded TRUE PASS
stopifnot(all(model3_checks$passed))

13 Train / Test Split

# ================================================================
# Train/test split block
# Purpose:
#   Split data while preserving class balance.
# ================================================================

lane_split <- initial_split(model3_df, prop = 0.80, strata = radiant_win)

train_data <- training(lane_split)
test_data <- testing(lane_split)

split_summary <- tibble(
  dataset = c("Training", "Testing"),
  rows = c(nrow(train_data), nrow(test_data)),
  radiant_win_rate = c(
    mean(train_data$radiant_win == "Radiant_Win"),
    mean(test_data$radiant_win == "Radiant_Win")
  )
)

kable(split_summary, digits = 3, caption = "Model 3 Train/Test Split Summary")
Model 3 Train/Test Split Summary
dataset rows radiant_win_rate
Training 3999 0.529
Testing 1001 0.528

14 Baseline Accuracy Reference

# ================================================================
# Baseline block
# Purpose:
#   Establish the majority-class benchmark. The model should be judged
#   against this, not just against 50 percent accuracy.
# ================================================================

majority_class_rate <- train_data %>%
  count(radiant_win) %>%
  mutate(rate = n / sum(n)) %>%
  arrange(desc(rate)) %>%
  slice(1)

kable(
  majority_class_rate,
  digits = 3,
  caption = "Majority-Class Baseline from Training Data"
)
Majority-Class Baseline from Training Data
radiant_win n rate
Radiant_Win 2114 0.529

15 Model 3: Domain-Informed Draft Risk Logistic Regression

# ================================================================
# Model 3 fitting block
# Purpose:
#   Fit an interpretable logistic regression using cleaner, domain-
#   informed draft-risk features.
# ================================================================

model3_recipe <- recipe(radiant_win ~ ., data = train_data) %>%
  step_zv(all_predictors()) %>%
  step_normalize(all_numeric_predictors())

model3_log_spec <- logistic_reg() %>%
  set_engine("glm") %>%
  set_mode("classification")

model3_log_workflow <- workflow() %>%
  add_recipe(model3_recipe) %>%
  add_model(model3_log_spec)

model3_log_fit <- fit(model3_log_workflow, data = train_data)

model3_log_fit
## ══ Workflow [trained] ═══════════════════════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: logistic_reg()
## 
## ── Preprocessor ─────────────────────────────────────────────────────────────────────────────────
## 2 Recipe Steps
## 
## • step_zv()
## • step_normalize()
## 
## ── Model ────────────────────────────────────────────────────────────────────────────────────────
## 
## Call:  stats::glm(formula = ..y ~ ., family = stats::binomial, data = data)
## 
## Coefficients:
##                    (Intercept)          melee_count_difference            str_count_difference  
##                       0.115178                        0.036814                        0.064215  
##           agi_count_difference            int_count_difference            all_count_difference  
##                       0.079635                        0.098550                              NA  
## attribute_imbalance_difference      frontline_proxy_difference       high_agi_proxy_difference  
##                      -0.013211                       -0.095076                        0.060048  
##      high_int_proxy_difference             radiant_heavy_melee                dire_heavy_melee  
##                       0.034014                       -0.003934                        0.070702  
##             radiant_all_ranged                 dire_all_ranged     radiant_low_frontline_proxy  
##                      -0.029035                       -0.028420                       -0.021098  
##       dire_low_frontline_proxy  
##                             NA  
## 
## Degrees of Freedom: 3998 Total (i.e. Null);  3985 Residual
## Null Deviance:       5531 
## Residual Deviance: 5511  AIC: 5539

16 Model 3 Predictions and Metrics

# ================================================================
# Prediction and metrics block
# Purpose:
#   Evaluate Model 3 on the test set.
# ================================================================

model3_predictions <- predict(model3_log_fit, test_data, type = "prob") %>%
  bind_cols(predict(model3_log_fit, test_data, type = "class")) %>%
  bind_cols(test_data %>% select(radiant_win))

model3_metrics <- bind_rows(
  accuracy(model3_predictions, truth = radiant_win, estimate = .pred_class),
  roc_auc(model3_predictions, truth = radiant_win, .pred_Radiant_Win, event_level = "second")
) %>%
  mutate(model = "Model 3: Domain-Informed Draft Risk", .before = 1)

kable(model3_metrics, digits = 3, caption = "Model 3 Performance")
Model 3 Performance
model .metric .estimator .estimate
Model 3: Domain-Informed Draft Risk accuracy binary 0.519
Model 3: Domain-Informed Draft Risk roc_auc binary 0.492

17 Model 3 Confusion Matrix

# ================================================================
# Confusion matrix block
# Purpose:
#   Check whether the model predicts both classes or mostly defaults to
#   the majority class.
# ================================================================

model3_confusion <- conf_mat(
  model3_predictions,
  truth = radiant_win,
  estimate = .pred_class
)

model3_confusion
##              Truth
## Prediction    Dire_Win Radiant_Win
##   Dire_Win         111         120
##   Radiant_Win      361         409

18 Model 3 Coefficient Interpretation

# ================================================================
# Coefficient interpretation block
# Purpose:
#   Identify which domain-informed features have the largest positive
#   or negative association with Radiant victory.
#
# Important:
#   Coefficients are associations, not causal proof.
# ================================================================

model3_coefficients <- model3_log_fit %>%
  extract_fit_parsnip() %>%
  tidy() %>%
  filter(term != "(Intercept)") %>%
  mutate(
    odds_ratio = exp(estimate),
    direction = case_when(
      is.na(estimate) ~ "Not estimated",
      estimate > 0 ~ "Higher odds of Radiant win",
      estimate < 0 ~ "Lower odds of Radiant win",
      TRUE ~ "No estimated direction"
    )
  ) %>%
  arrange(desc(abs(estimate)))

kable(
  model3_coefficients %>%
    select(term, estimate, odds_ratio, direction),
  digits = 3,
  caption = "Model 3 Logistic Regression Coefficients"
)
Model 3 Logistic Regression Coefficients
term estimate odds_ratio direction
int_count_difference 0.099 1.104 Higher odds of Radiant win
frontline_proxy_difference -0.095 0.909 Lower odds of Radiant win
agi_count_difference 0.080 1.083 Higher odds of Radiant win
dire_heavy_melee 0.071 1.073 Higher odds of Radiant win
str_count_difference 0.064 1.066 Higher odds of Radiant win
high_agi_proxy_difference 0.060 1.062 Higher odds of Radiant win
melee_count_difference 0.037 1.037 Higher odds of Radiant win
high_int_proxy_difference 0.034 1.035 Higher odds of Radiant win
radiant_all_ranged -0.029 0.971 Lower odds of Radiant win
dire_all_ranged -0.028 0.972 Lower odds of Radiant win
radiant_low_frontline_proxy -0.021 0.979 Lower odds of Radiant win
attribute_imbalance_difference -0.013 0.987 Lower odds of Radiant win
radiant_heavy_melee -0.004 0.996 Lower odds of Radiant win
all_count_difference NA NA Not estimated
dire_low_frontline_proxy NA NA Not estimated

19 Optional Comparison Model: Random Forest on Model 3 Features

# ================================================================
# Optional comparison block
# Purpose:
#   Fit a random forest using the same Model 3 features to check whether
#   nonlinear interactions improve predictive performance.
# ================================================================

rf_mtry <- max(1, floor(sqrt(ncol(train_data) - 1)))

model3_rf_spec <- rand_forest(
  trees = 500,
  mtry = rf_mtry,
  min_n = 5
) %>%
  set_engine("ranger", importance = "permutation") %>%
  set_mode("classification")

model3_rf_workflow <- workflow() %>%
  add_recipe(model3_recipe) %>%
  add_model(model3_rf_spec)

model3_rf_fit <- fit(model3_rf_workflow, data = train_data)

model3_rf_predictions <- predict(model3_rf_fit, test_data, type = "prob") %>%
  bind_cols(predict(model3_rf_fit, test_data, type = "class")) %>%
  bind_cols(test_data %>% select(radiant_win))

model3_rf_metrics <- bind_rows(
  accuracy(model3_rf_predictions, truth = radiant_win, estimate = .pred_class),
  roc_auc(model3_rf_predictions, truth = radiant_win, .pred_Radiant_Win, event_level = "second")
) %>%
  mutate(model = "Model 3 RF: Domain-Informed Features", .before = 1)

kable(model3_rf_metrics, digits = 3, caption = "Model 3 Random Forest Performance")
Model 3 Random Forest Performance
model .metric .estimator .estimate
Model 3 RF: Domain-Informed Features accuracy binary 0.502
Model 3 RF: Domain-Informed Features roc_auc binary 0.498

20 Model 3 Comparison Table

# ================================================================
# Comparison block
# Purpose:
#   Compare Model 3 against the majority-class baseline.
# ================================================================

baseline_row <- tibble(
  model = "Majority-Class Baseline",
  accuracy = majority_class_rate$rate,
  roc_auc = NA_real_
)

model3_comparison <- bind_rows(model3_metrics, model3_rf_metrics) %>%
  select(model, .metric, .estimate) %>%
  pivot_wider(names_from = .metric, values_from = .estimate) %>%
  bind_rows(baseline_row) %>%
  arrange(desc(replace_na(roc_auc, 0)))

kable(
  model3_comparison,
  digits = 3,
  caption = "Model 3 Performance Compared with Majority-Class Baseline"
)
Model 3 Performance Compared with Majority-Class Baseline
model accuracy roc_auc
Model 3 RF: Domain-Informed Features 0.502 0.498
Model 3: Domain-Informed Draft Risk 0.519 0.492
Majority-Class Baseline 0.529 NA

21 ROC Curve Comparison

# ================================================================
# ROC curve block
# Purpose:
#   Compare class-separation ability for the logistic and random forest
#   versions of Model 3.
# ================================================================

model3_log_roc <- roc_curve(
  model3_predictions,
  truth = radiant_win,
  .pred_Radiant_Win,
  event_level = "second"
) %>%
  mutate(model = "Model 3 Logistic")

model3_rf_roc <- roc_curve(
  model3_rf_predictions,
  truth = radiant_win,
  .pred_Radiant_Win,
  event_level = "second"
) %>%
  mutate(model = "Model 3 Random Forest")

bind_rows(model3_log_roc, model3_rf_roc) %>%
  ggplot(aes(x = 1 - specificity, y = sensitivity, linetype = model)) +
  geom_path(linewidth = 1) +
  geom_abline(linetype = "dashed") +
  labs(
    title = "Model 3 ROC Curve Comparison",
    subtitle = "Curves near the diagonal indicate limited class-separation power",
    x = "False Positive Rate",
    y = "True Positive Rate",
    linetype = "Model"
  )

22 Plain-English Interpretation Template

Use this section after reviewing the output values.

The Sprint 3 model tests whether cleaner, domain-informed draft-risk features improve the interpretability of the Lane Theory Lab analysis. These features are based on public match data and intentionally avoid unavailable behavioral details such as whether supports pulled correctly, whether a lane over-dived, or whether a position 4 actually played a support function.

The key comparison is not whether the model exceeds 50 percent accuracy. The key comparison is whether it improves on the majority-class baseline, which reflects the observed Radiant-side win advantage in the sample. If Model 3 does not outperform this baseline, the result suggests that broad team-level draft-risk features still do not explain enough of the match outcome by themselves.

That finding would not invalidate the project. Instead, it would support the idea that useful esports BI requires richer context than simple team-level counts. Future versions should consider lane assignments, hero identity, role fidelity, reliable stun counts, initiation tools, support behavior, patch context, game mode, and rank bracket.

23 Limitations

This sprint still uses match-level draft-composition features. It does not directly observe laning behavior, support behavior, bad pulls, inopportune dives, player skill, hero familiarity, item builds, objective control, or coordination. These are major parts of Dota outcomes.

The frontline and greed variables are deliberately labeled as proxies. Strength or Universal heroes do not automatically equal frontline or initiation. Agility-heavy drafts do not automatically mean greed. Intelligence-heavy drafts do not automatically mean spell pressure. These features are simple public-data approximations designed to ask better questions, not perfect strategic labels.

The current dataset may also mix game modes, rank brackets, patches, regions, and public-match environments. Turbo and ranked All Pick may not represent the same mindset. Public-match data may also include bots, smurfs, griefing, role abuse, and other behaviors that are difficult to detect from composition data alone.

24 Recommendations

The strongest next step is to treat Model 3 as a domain-informed baseline rather than a final predictive product. If the model performs weakly, the recommendation is not to force the result. Instead, the recommendation is to improve feature quality by adding context.

Recommended next steps:

  1. Compare Model 3 performance against the majority-class baseline.
  2. Review whether domain-informed features improve interpretation, even if predictive lift is small.
  3. Add rank and game-mode comparison cuts to see whether behavior changes across public-match contexts.
  4. Consider a future hero-archetype table for initiation, stun reliability, scaling core count, sustain, and support function.
  5. Use the final project as a foundation for a future draft-risk calculator or esports BI dashboard.

25 Replication Notes

To rerun this sprint:

  1. Open this .Rmd file in RStudio.
  2. Confirm the dataset exists in the data/ folder.
  3. Install missing packages if needed.
  4. Knit to HTML.
  5. Review the Model 3 comparison table and coefficient table.
  6. Merge useful sections into the final midterm report.