Introduction

The purpose of this project is to develop a model that will be able to predict the win rate of various unique team compositions in Valorant. Valorant Agents Relaxing

About Valorant

Valorant is a 5v5 tactical first-person shooter developed by Riot Games where one team starts as Attackers and the other as Defenders, switching roles at halftime after 12 rounds. The Attacker’s objective is to plant a bomb called the Spike at a bomb site, while the Defenders aim to eliminate them or defuse it after planting. A standard match can possibly go up to 25 rounds, with the first team to reach 13 round wins taking the match. What makes Valorant unique is its roster of 29 playable characters called Agents, each with distinct abilities belonging to one of four roles: Duelists (fraggers who create space), Initiators (information gatherers), Controllers (smoke and area denial), and Sentinels (defensive anchors).

The Objective: Why Predict Team Compositions?

At the professional level, Valorant is organized through the Valorant Champions Tour (VCT), one of Riot Games’ official tournament spanning three international leagues: Americas, EMEA, and Pacific. One interesting thing about Riot Games is that they frequently rotate maps in and out of the active map pool (the collection of maps currently used in competitive play), meaning play style is always shifting with different compositions being constructed. This is further complicated by the meta, which refers to the trends in Agent selection at any given time, constantly changing due to patches and new Agent releases.

As someone who really enjoys watching pro play, I find it difficult to keep up with which team compositions are actually performing well on any given map given how frequently things change. The purpose of this project is to use the VCT match data from 2021 to 2026 to a build a model that predicts a team’s win probability based on their Agent composition and map, allowing insight into the most optimal team without having to manually track the changing meta.

Look at all those different Agents!

Loading Packages and Data (Source)

library(tidyverse)
library(tidymodels)
library(janitor)
library(kknn)
library(glmnet)
library(ranger)
library(xgboost)
library(ggplot2)
library(readr)
library(dplyr)
library(vip)

The data used in this project is sourced from the Kaggle dataset “Valorant Champion Tour 2021-2026 Data” by Ryan Luong, scraped from vlr.gg, the primary statistics tracker for pro league Valorant. The data set is available here.

Of the many files available, we use four per year:

overview.csv — one row per player per map, including which agent they played
maps_scores.csv — one row per map played, with final scores for both teams
maps_stats.csv — attacker and defender side win rates for each map, used as feature enrichment
agents_pick_rates.csv — each agent’s global pick rate across the competitive meta for a given year

The first two files provide the structure for building our outcome variable and agent composition. The other two joined in as additional predictors during the tidying stage.

#Setting the base path to the folder containing all six years of data (allows for easier access)
base_path <- "~/Desktop/Final Project/archive"

#list.files() allows us to search recursively through the sub folders in "archive" for files matching the pattern. This helps us a lot because its 6 years of data with the same name files.
#full.names = TRUE returns the complete file path so read_csv knows where to look
overview_files <- list.files(base_path, pattern = "overview.*\\.csv",
                             recursive = TRUE, full.names = TRUE)
score_files <- list.files(base_path, pattern = "maps_scores.*\\.csv",
                          recursive = TRUE, full.names = TRUE)
maps_stats_files <- list.files(base_path, pattern = "maps_stats.*\\.csv",
                               recursive = TRUE, full.names = TRUE)
pick_rates_files <- list.files(base_path, pattern = "agents_pick_rates.*\\.csv",
                               recursive = TRUE, full.names = TRUE)

#clean_names() standardizes column names to lowercase with underscores (Map Name -> map_name)
#str_extract is used to pull a 4 digit number from the file path string allowing us to use the year as a value
#Note, in the original data set, all the files had the same name despite the different years, I had to manually change the file to the corresponding year (e.g., overview.csv -> overview_2021.csv)
overview_all <- read_csv(overview_files, show_col_types = FALSE, id = "source_file") |>
  mutate(year = as.integer(str_extract(source_file, "\\d{4}"))) |>
  clean_names()

scores_all <- read_csv(score_files, show_col_types = FALSE, id = "source_file") |>
  mutate(year = as.integer(str_extract(source_file, "\\d{4}"))) |>
  clean_names()

maps_stats_all <- read_csv(maps_stats_files, show_col_types = FALSE, id = "source_file") |>
  mutate(year = as.integer(str_extract(source_file, "\\d{4}"))) |>
  clean_names()

pick_rates_all <- read_csv(pick_rates_files, show_col_types = FALSE, id = "source_file") |>
  mutate(year = as.integer(str_extract(source_file, "\\d{4}"))) |>
  clean_names()

Codebook

These are the predictors we are going to be working with.

Outcome Variable - won - Whether the team won the map (yes/no)

Categorical Predictors

map - The name of the map the match was played on
agent_1 - First agent in the team’s sorted composition
agent_2 - Second agent in the team’s sorted composition
agent_3 - Third agent in the team’s sorted composition
agent_4 - Fourth agent in the team’s sorted composition
agent_5 - Fifth agent in the team’s sorted composition

Numeric Predictors

attacker_win_pct - Average attacker side win percentage for that map in that year
defender_win_pct - Average defender side win percentage for that map in that year
mean_pick_rate - Average global pick rate across all five agents on the team

Exploring Raw Data & Tidying

How many variables do we have? Should we narrow it down?

Before building our modeling dataset, let’s take a look at what we’re working with. The four raw files each serve a different purpose, so we’ll first look at their dimensions and relevant columns.

dim(overview_all)

## [1] 1128745      23

dim(scores_all)

## [1] 26977    18

dim(maps_stats_all)

## [1] 16826     9

dim(pick_rates_all)

## [1] 409250      8

colnames(overview_all)

##  [1] "source_file"                       "tournament"                       
##  [3] "stage"                             "match_type"                       
##  [5] "match_name"                        "map"                              
##  [7] "player"                            "team"                             
##  [9] "agents"                            "rating"                           
## [11] "average_combat_score"              "kills"                            
## [13] "deaths"                            "assists"                          
## [15] "kills_deaths_kd"                   "kill_assist_trade_survive_percent"
## [17] "average_damage_per_round"          "headshot_percent"                 
## [19] "first_kills"                       "first_deaths"                     
## [21] "kills_deaths_fkd"                  "side"                             
## [23] "year"

colnames(scores_all)

##  [1] "source_file"           "tournament"            "stage"                
##  [4] "match_type"            "match_name"            "map"                  
##  [7] "team_a"                "team_a_score"          "team_a_attacker_score"
## [10] "team_a_defender_score" "team_a_overtime_score" "team_b"               
## [13] "team_b_score"          "team_b_attacker_score" "team_b_defender_score"
## [16] "team_b_overtime_score" "duration"              "year"

colnames(maps_stats_all)

## [1] "source_file"                  "tournament"                  
## [3] "stage"                        "match_type"                  
## [5] "map"                          "total_maps_played"           
## [7] "attacker_side_win_percentage" "defender_side_win_percentage"
## [9] "year"

colnames(pick_rates_all)

## [1] "source_file" "tournament"  "stage"       "match_type"  "map"        
## [6] "agent"       "pick_rate"   "year"

Its apparent that the files vary significantly in size. overview_all contains 1,128,745 rows across 21 columns with one row per player per map, meaning every match contributes 10 rows (one per player, since its 5v5). scores_all has 26,977 rows with 16 columns, maps_stats_all has 16,826 rows across 7 columns, and pick_rates_all has 409,250 rows across 6 columns. Clearly, we cannot model directly from these raw files. Looking at these files we can actually see that they share some common identifiers such as, tournament, stage, match_type, match_name, and map. We can use this to our advantage by joining these files together through those key terms. Ideally, we combine them into a single, clean data set with one row per team per map, reducing the over 1 million rows in overview_all down to something more workable.

Tidying

From overview_all, the columns we care about are tournament, match_name, map, team, agents and side. Everything else, such as: kills, deaths, ratings, headshot percentage are individual player performance stats that go beyond our research question. We are predicting outcomes from composition alone, not player skill.

Now that we understand the structure of our raw data, we can start building our modeling dataset. First, we will reshape overview_all to get one row per team per map with all five agents as separate columns, then determine the winner of each map from scores_all, and finally join the two together!

#Filter to "both" sides to avoid counting attack and defense rows separately
#Group by match + map + team, then number each player 1-5 to create agent slots
agents_wide <- overview_all |>
  #Set map!= "All Maps" since "All Maps" rows are the summaries of the series (Best of 3)
  filter(side == "both", map != "All Maps") |>
  distinct(tournament, match_name, map, team, player, .keep_all = TRUE) |>
  group_by(year, tournament, match_name, map, team) |>
  arrange(player, .by_group = TRUE) |>
  slice_head(n = 5) |>
  mutate(slot = paste0("agent_", row_number())) |>
  select(year, tournament, match_name, map, team, slot, agents) |>
  pivot_wider(names_from = slot, values_from = agents) |>
  ungroup()

#Verifying code works
head(agents_wide)

## # A tibble: 6 × 10
##    year tournament        match_name map   team  agent_1 agent_2 agent_3 agent_4
##   <int> <chr>             <chr>      <chr> <chr> <chr>   <chr>   <chr>   <chr>  
## 1  2021 Champions Tour A… BOOM Espo… Asce… BOOM… astra   killjoy sova    jett   
## 2  2021 Champions Tour A… BOOM Espo… Asce… DAMW… sova    astra   skye    cypher 
## 3  2021 Champions Tour A… BOOM Espo… Bind  BOOM… astra   killjoy skye    raze   
## 4  2021 Champions Tour A… BOOM Espo… Bind  DAMW… sova    astra   skye    viper  
## 5  2021 Champions Tour A… BOOM Espo… Asce… BOOM… astra   killjoy sova    jett   
## 6  2021 Champions Tour A… BOOM Espo… Asce… FENN… astra   sage    jett    sova   
## # ℹ 1 more variable: agent_5 <chr>

We now have one row per team per map, with columns agent_1 though agent_5 representing the five agents that the team played!

Next, lets determine the winner of each map using scores_all. The team with the higher score wins (denoted by 13). We can do that by pivoting the data from wide (one row per map with two team columns) to long (one row per term per map) and create a binary outcome variable won.

#Using if_else statement with mutate to create a column that dictates the winner
scores_long <- scores_all |>
  mutate(winner = if_else(team_a_score > team_b_score, team_a, team_b)) |>
  select(year, tournament, match_name, map, team_a, team_b, winner) |>
  #Pivoting to attach `won` label to each team
  pivot_longer(cols = c(team_a, team_b),
               names_to = "side",
               values_to = "team") |>
#Creating a binary outcome: did this team win? which is stored as a factor for classification
  mutate(won = as.factor(if_else(team == winner, "yes", "no"))) |>
  select(year, tournament, match_name, map, team, won)

#Printing a few rows for verification
head(scores_long)

## # A tibble: 6 × 6
##    year tournament              match_name                    map    team  won  
##   <int> <chr>                   <chr>                         <chr>  <chr> <fct>
## 1  2021 Valorant Champions 2021 Vision Strikers vs FULL SENSE Haven  Visi… yes  
## 2  2021 Valorant Champions 2021 Vision Strikers vs FULL SENSE Haven  FULL… no   
## 3  2021 Valorant Champions 2021 Vision Strikers vs FULL SENSE Breeze Visi… yes  
## 4  2021 Valorant Champions 2021 Vision Strikers vs FULL SENSE Breeze FULL… no   
## 5  2021 Valorant Champions 2021 Team Vikings vs Crazy Raccoon Icebox Team… yes  
## 6  2021 Valorant Champions 2021 Team Vikings vs Crazy Raccoon Icebox Craz… no

Each map now has two rows, one for each team, with won indicating whether that team won the map or not.

Lastly, we can now join the agent compositions with the match outcomes. We can do that by using inner_join so only maps that appear in both datasets are kept.

#Joining agents_wide and scores_long
model_data <- agents_wide |>
  inner_join(scores_long, by = c("year", "tournament", "match_name", "map", "team"))

#Checking dimensions
dim(model_data)

## [1] 53674    11

#Calling head to verify
head(model_data)

## # A tibble: 6 × 11
##    year tournament        match_name map   team  agent_1 agent_2 agent_3 agent_4
##   <int> <chr>             <chr>      <chr> <chr> <chr>   <chr>   <chr>   <chr>  
## 1  2021 Champions Tour A… BOOM Espo… Asce… BOOM… astra   killjoy sova    jett   
## 2  2021 Champions Tour A… BOOM Espo… Asce… DAMW… sova    astra   skye    cypher 
## 3  2021 Champions Tour A… BOOM Espo… Bind  BOOM… astra   killjoy skye    raze   
## 4  2021 Champions Tour A… BOOM Espo… Bind  DAMW… sova    astra   skye    viper  
## 5  2021 Champions Tour A… BOOM Espo… Asce… BOOM… astra   killjoy sova    jett   
## 6  2021 Champions Tour A… BOOM Espo… Asce… FENN… astra   sage    jett    sova   
## # ℹ 2 more variables: agent_5 <chr>, won <fct>

Our observations went from over 1 million to just 53,673 observations!

Adding Features

With our core data set built, we can now add some special features to it! First, we can join maps_stats_all which provides the attacker and defender win percentage for each map.

maps_features <- maps_stats_all |>
  #Remove "All Maps" rows, these are just summary rows, not real maps
  filter(map != "All Maps") |>
  #Strip the % sign from win percentage columns so they can be converted to numeric (was originally character)
  mutate(attacker_side_win_percentage = as.numeric(str_remove(attacker_side_win_percentage, "%")),
         defender_side_win_percentage = as.numeric(str_remove(defender_side_win_percentage, "%"))
         ) |>
  group_by(year, map) |>
  #Average attacker and defender win percentage per map per year
  summarise(
    attacker_win_pct = mean(attacker_side_win_percentage, na.rm = TRUE),
    defender_win_pct = mean(defender_side_win_percentage, na.rm = TRUE),
    .groups = "drop"
  )

#Join map balance features onto model_data by year and map
model_data <- model_data |>
  left_join(maps_features, by = c("year", "map"))


#Verifying dimensions
dim(model_data)

## [1] 53674    13

colnames(model_data)

##  [1] "year"             "tournament"       "match_name"       "map"             
##  [5] "team"             "agent_1"          "agent_2"          "agent_3"         
##  [9] "agent_4"          "agent_5"          "won"              "attacker_win_pct"
## [13] "defender_win_pct"

Next, we can incorporate the global agent pick rates from pick_rates_all. Instead of joining pick rate for each individual agent slot, we can compute the average pick rate across all five agents on a team. This gives us a single numeric feature capturing how meta a team composition is. A high average pick rate would suggest that the team is running widely favored agents, while a low average suggests a more unconventional draft (could be experimenting with new compositions).

#Summarizing pick rates to one row per agent per year
#Since pick_rate was a character, we have to strip % from it and convert to numeric
pick_rate_lookup <- pick_rates_all |>
  mutate(pick_rate = as.numeric(str_remove(pick_rate, "%"))) |>
  group_by(year, agent) |>
  summarise(avg_pick_rate = mean(pick_rate, na.rm = TRUE), .groups = "drop")

#For each row, get the average pick rate across all 5 agents
agent_pick_rates <- model_data |>
  pivot_longer(cols = c(agent_1, agent_2, agent_3, agent_4, agent_5),
               names_to = "slot", values_to = "agent") |>
  left_join(pick_rate_lookup, by = c("year", "agent")) |>
  group_by(year, tournament, match_name, map, team) |>
  summarise(mean_pick_rate = mean(avg_pick_rate, na.rm = TRUE), .groups = "drop")

#Join back onto model_data
model_data <- model_data |>
  left_join(agent_pick_rates, by = c("year", "tournament", "match_name", "map", "team"))

dim(model_data)

## [1] 53674    14

colnames(model_data)

##  [1] "year"             "tournament"       "match_name"       "map"             
##  [5] "team"             "agent_1"          "agent_2"          "agent_3"         
##  [9] "agent_4"          "agent_5"          "won"              "attacker_win_pct"
## [13] "defender_win_pct" "mean_pick_rate"

Our dataset is now fully built, containing 53,674 observations across 14 variables. We can now move onto exploratory data analysis to better understand the distributions and relationships in our data before modeling!

Exploratory Data Analysis (EDA)

Before exploring the data visually, we should check for any missing values across all columns. Missing data could cause issues during modeling so its important that we identify if there is any and handle it early.

Missing Values

#Check for missing values in each column
colSums(is.na(model_data))

##             year       tournament       match_name              map 
##                0                0                0                0 
##             team          agent_1          agent_2          agent_3 
##                0                2               23               35 
##          agent_4          agent_5              won attacker_win_pct 
##               55              215                0                0 
## defender_win_pct   mean_pick_rate 
##                0                2

It seems like several agent columns contain missing values. agent_1 through agent_5 have varying amount of missing entries, and mean_pick_rate has 2. Before removing these rows, let’s investigate why this might be the case.

#Finding specific rows where agent slots are missing to compare manually
model_data |>
  filter(is.na(agent_1) | is.na(agent_2) | is.na(agent_3) |
         is.na(agent_4) | is.na(agent_5)) |>
  select(year, tournament, match_name, map, team, agent_1, agent_2,
         agent_3, agent_4, agent_5) |>
  head(5)

## # A tibble: 5 × 10
##    year tournament        match_name map   team  agent_1 agent_2 agent_3 agent_4
##   <int> <chr>             <chr>      <chr> <chr> <chr>   <chr>   <chr>   <chr>  
## 1  2021 Champions Tour B… Vorax vs … Split Mark… brimst… sage    breach  <NA>   
## 2  2021 Champions Tour B… Liberty v… Bind  INGA… sova    jett    raze    brimst…
## 3  2021 Champions Tour B… mato30epe… Asce… mato… omen    killjoy jett    raze   
## 4  2021 Champions Tour B… mato30epe… Split mato… omen    sage    jett    raze   
## 5  2021 Champions Tour C… GMT Espor… Bind  GMT … breach  sova    cypher  viper  
## # ℹ 1 more variable: agent_5 <chr>

After looking up one of these matches directly on vlr.gg, the cause is clear. The first observation is from 2021 Champions Tour Brazil Stage 1: Challengers 1, specifically Vorax vs. Mark Five on Split. As shown in the screenshot below, Mark Five only had 3 players recorded (dek, vianna1, and e4s) despite Valorant being a 5v5 game. This appears to be a incomplete data entry issue on vlr.gg’s end rather than an error in our tidying.

vlr.gg match Since we can not reliably reconstruct a full 5-agent composition from incomplete data, these rows are removed.

#Remove the rows with missing values
model_data <- model_data |>
  drop_na()

colSums(is.na(model_data))

##             year       tournament       match_name              map 
##                0                0                0                0 
##             team          agent_1          agent_2          agent_3 
##                0                0                0                0 
##          agent_4          agent_5              won attacker_win_pct 
##                0                0                0                0 
## defender_win_pct   mean_pick_rate 
##                0                0

dim(model_data)

## [1] 53459    14

After removing incomplete rows, all 14 columns now have zero missing values. The dataset still has 53,459 observations losing only 215 rows (roughly less than 0.5% of the data), meaning the removal will have minimal impact on our analysis given our size.

Duplicates

While investigating the missing values, there was another issue that caught my attention. Looking at the first observation, agent_1 contained “brimstone, jett, raze” rather than a single agent name. This shows that some agent slots contained multiple agent names separated together by a comma. Lets verify if there are any other instances of this in our dataset.

#Check if any agent slots contain comma separated values
model_data |>
  filter(str_detect(agent_1, ",") |
         str_detect(agent_2, ",") |
         str_detect(agent_3, ",") |
         str_detect(agent_4, ",") |
         str_detect(agent_5, ",")) |>
  select(tournament, match_name, map, team, agent_1, agent_2,
         agent_3, agent_4, agent_5) |>
  head(5)

## # A tibble: 5 × 9
##   tournament      match_name map   team  agent_1 agent_2 agent_3 agent_4 agent_5
##   <chr>           <chr>      <chr> <chr> <chr>   <chr>   <chr>   <chr>   <chr>  
## 1 Champions Tour… Five Ace … Iceb… Five… sage    viper   killjo… reyna   jett   
## 2 Champions Tour… Five Ace … Iceb… Five… sage    viper   killjo… reyna   jett   
## 3 Champions Tour… Five Ace … Iceb… TRAI… phoeni… omen    killjoy jett    sage, …
## 4 Champions Tour… Five Ace … Iceb… TRAI… phoeni… omen    killjoy jett    sage, …
## 5 Champions Tour… CBT Gamin… Bind  CBT … cypher  sova    brimst… phoenix jett, …

It seems like it is true that other observations slots contain corrupted entries where multiple agent names are concatenated with a comma (e.g., “killjoy, yoru” in the second observation). After manually verifying on vlr.gg again, this occurs because some rows in overview_all correspond to “All Maps” series summaries rather than the individual maps (in this case icebox). This means a player who played as Killjoy on one map and Yoru on another appears as “killjoy, yoru” in the aggregated row. Therefore, we can not use these rows as they are not reliable since they don’t represent the actual lineup used on the given map (you can’t use more than one agent in one match).

Provided below is a screenshot of vlr.gg “All Maps” tab showing this issue. Vlr.gg

An easy fix for this is simply filtering out any rows that has agent slots that contain a comma.

#Filtering out duplicates in agent columns
model_data <- model_data |>
  filter(!str_detect(agent_1, ","),
         !str_detect(agent_2, ","),
         !str_detect(agent_3, ","),
         !str_detect(agent_4, ","),
         !str_detect(agent_5, ",")) 
dim(model_data)

## [1] 53445    14

After removing the corrupted rows, the dataset has 53445 observations. The agent columns are now clean with each slot containing exactly one agent.

Sorting

One unique aspect of Valorant to consider is that mirror compositions are allowed. Mirror compositions is a term used for when both teams run the same exact 5 agents on the same map. This possibility means that that the same compositions can appear across many different matches. However, since agents slots were assigned alphabetically by player name rather than by agent, the same 5-agent lineup could be encoded differently across rows. An example would be a team running jett, omen, sova, sage, and raze with agent_1 = jett for Team A, and agent_1 = omen for Team B even though they have the same composition. This would be a problem because our models would treat each agent slot as a separate feature, meaning it would see these as two completely different compositions even though they are identical (pick order is just different). To have it consistent, we sort the agents alphabetically for each row.

# Sort agents alphabetically within each row for consistent composition encoding
model_data <- model_data |>
  rowwise() |>
  mutate(
    agents_sorted = list(sort(c(agent_1, agent_2, agent_3, agent_4, agent_5))),
    agent_1 = agents_sorted[[1]],
    agent_2 = agents_sorted[[2]],
    agent_3 = agents_sorted[[3]],
    agent_4 = agents_sorted[[4]],
    agent_5 = agents_sorted[[5]]
  ) |>
  select(-agents_sorted) |>
  ungroup()

## [1] 53445    14

Our dataset is now fully cleaned and standardized. With 53,445 observations and now consistent agent encoding, we are ready to move into the Visual EDA and Modeling.

Visual EDA

Outcome

Now that our data is clean, we can begin exploring it visually. We can start with a standard check, lets look at the balance of our outcome variable won. Since every map played has to have one winning and one losing team, we should be expect this to be roughly balanced.

#Bar plot of outcome distribution
model_data |>
  count(won) |>
  ggplot(aes(x = won, y = n, fill = won)) +
  geom_col(show.legend = FALSE) +
  labs(title = "Win/Loss Distribution",
       x = "Won",
       y = "Count") +
  theme_minimal()

The outcome variable is nearly balanced with 26,817 wins and 26,628 losses. The slight excess of wins over losses is actually expected given the tournament format. In a best of 3 series, the winning team would win 2 maps while the losing would only win 1, naturally producing more win entires than loss entries across the database. This minor imbalance is unlikely to affect our actual model performance.

#This is the difference
model_data |>
  count(won)

## # A tibble: 2 × 2
##   won       n
##   <fct> <int>
## 1 no    26628
## 2 yes   26817

Map Play Frequency

Next we can look at how frequently each map appears in our dataset. It is worth noting that Valorant has map rotations, meaning only 7 maps are active in the competitive rotation at any given time. Riot Games periodically adds new maps and removes other between acts and episodes (due to map rework/glitches).

# Bar plot of map play frequency
model_data |>
  #Sorts the data frame by n in descending order
  count(map, sort = TRUE) |>
  #Orders the bars by n on the plot
  ggplot(aes(x = reorder(map, n), y = n)) +
  geom_col(fill = "steelblue") +
  #Flips the plots' x and y axes, allows for easier reading of map names
  coord_flip() +
  labs(title = "Map Play Frequency (2021-2026)",
       x = "Map",
       y = "Times Played") +
  theme_minimal()

As expected, Ascent and Haven appear most frequently, having been in the competitive pool since Valorant’s launch. Newer maps such as Abyss, Sunset, and Corrode have far fewer appearances which logically makes sense. This distribution is worth keeping in mind during our modeling because our model would have much more training data for older maps than newer ones.

Attacker / Defender Win Rate

Now that we know which maps appear more frequently, we can look at whether these maps are balanced. In Valorant, a map being imbalanced refers to whether the attacking or defending side has a structural advantage. Something to note is that Riot Games regularly patches/updates these maps, so let’s see how attacker win rates have shifted across all maps over the 6 years.

maps_features |>
  ggplot(aes(x = year, y = attacker_win_pct, color = map, group = map)) +
  geom_line() +
  geom_point() +
  geom_hline(yintercept = 50, linetype = "dashed", color = "black") +
  scale_y_continuous(labels = scales::label_number(suffix = "%")) +
  scale_color_brewer(palette = "Paired") +
  labs(title = "Attacker Win Rate by Map Over Time (2021-2026)",
       x = "Year",
       y = "Attacker Win Rate (%)",
       color = "Map") +
  theme_minimal()

The line plot reveals a lot about how attacker win rates have fluctuated across these maps from 2021 - 2026. Noticeably, most maps hover around the 50% reference line, suggesting that Riot Games has generally succeeded in maintaining the competitive balance through regular patches. However, there are some notable outliers. For example, if we look at fracture, it stands out tremendously with a very high attacker win rate upon its release (came out Sept, 2021). Similarly, Breeze trended downward over time, becoming increasingly defender sided. Lastly, newer maps like Abyss (released in 2024), had a very big spike, with early data suggesting that it was heavily attacker advantage.

Something to note is that we only had to plot the attacker side win rate here because attacker and defender win rates always sum to 100%, meaning the defender win rate is simply the complement.

This feature is particularly important for our model because map side bias can heavily influence outcomes independent of agent composition. For example, imagine going 9-3 in the first half on Fracture in 2021, which we mentioned earlier had a very high attacker sided win rate. Would it be wrong to question whether the performance was driven by the lineup of agents or simply the map favoring the attacking side? The opposing team would have virtually no room for error on their attacking half (when they switch sides), needing to win 9 of the possible 12 rounds just to force overtime. In scenarios like these, map side advantage is just as crucial as the agents picked.

Win Rates

Previously, the line plot helped reveal how map balance has shifted over the years due to Riot’s patches. Now that we have a sense of the map dynamics, we can start to shift our focus onto the agents themselves. It would be interesting to look at which agents are more prone to win. Additionally, since there are 29 playable agents in Valorant, we will focus on the agents that has at least 50 appearances so our win rate estimates are more reliable.

#Pivot longer so each agent gets its own row
model_data |>
  pivot_longer(cols = c(agent_1, agent_2, agent_3, agent_4, agent_5),
               names_to = "slot", values_to = "agent") |>
  #Calculating win rate and total picks per agent
  group_by(agent) |>
  summarise(
    win_rate = mean(won == "yes"),
    picks = n(),
    .groups = "drop"
  ) |>
  #Order bars by win rate
  ggplot(aes(x = reorder(agent, win_rate), y = win_rate, fill = win_rate)) +
  geom_col() +
  #Add a reference line at 50% to show the average baseline
  geom_hline(yintercept = 0.5, linetype = "dashed", color = "black") +
  #Flipping the axes so agent names are readable on the y-axis
  coord_flip() +
  #Blue = lower win rate, red = higher win rate
  scale_fill_gradient(low = "lightblue", high = "red") +
  #Having the y-axis display percentages
  scale_y_continuous(labels = scales::percent) +
  labs(title = "Win Rate of Agents",
       x = "Agent", y = "Win Rate",
       fill = "Win Rate") +
  theme_minimal()

The bar plot reveals a lot of information about the win rates across the top picked agents. Though the margins are relatively small, most agents fall within the 46% - 52% range, suggesting that there is no single (overpowered) agent that determines the outcome entirely. Additionally, the roles of all agents are scattered in the top, indicating that there is not a single role such as Duelist or Controller that dominates the ranking. This reinforces the idea that this game’s win rate is primarily due to team composition and how agents compliment one another.

At the top, Clove stands out with the highest win rate despite being a relatively new agent introduced in 2024. This is partly explained by recency bias, such as experimenting with new agents in professional play, leading to inflating early win rates. However, looking at logically, it’s to be expected because Clove’s kit genuinely supports her strong performance. As a controller, she brings powerful smokes for map control while also functioning more aggressively than a tradition controller. Her self-heal allows her to win more gunfights to stay alive longer, and more importantly her revival ability gives her team a second chance, allowing her to take on the role as a Duelist when entering bombsites.

On the other end, Deadlock sits at the bottom with the lowest win rate, which is not surprising. As a Sentinel (more defense sided), her kit has a lot of weaknesses. Her barrier wall (to slow down the opposing team) can simply be shot down, her sensors (another slowing ability) can be spotted and silently walked through, while her net grenade provides some stalling utility, these abilities lack in comparison to other strong Sentinel options like Killjoy or Cypher. These characters in comparison offer more reliable and harder to counter defensive setups. This was to be expected.

Mean Agent Pick Rate by Match Outcome

Now, lets examine whether running a more meta composition (one with higher globally picked agents) is associated with winning. The mean_pick_rate feature helps us capture the average pick rate across a teams’ five agents. A higher value mean would indicate to us that team is running widely favored agents, while a lower value would suggest a more unconventional line up.

#Box plot of mean pick rate by outcome
model_data |>
  ggplot(aes(x = won, y = mean_pick_rate, fill = won)) +
  geom_boxplot(show.legend = FALSE) +
  scale_fill_manual(values = c("yes" = "orange", "no" = "black")) +
  labs(title = "Mean Agent Rate by Match Outcome",
       x = "won",
       y = "Mean Pick Rate (%)") +
  theme_bw()

From this box plots, it seems like winning and losing teams are nearly identical in terms of median, spread, and distribution of mean_pick_rate. This tells us that just simply running more meta agents does not strongly predict whether a team wins or loses. In pro play, both teams tend to run similarly high pick rate agents since pro players tend to naturally gravitate towards the current meta. This is actually an interesting finding because winning in professional Valorant is less likely less about picking the most popular agents and more about how those agents are combined as a team.

As mentioned before it is worth nothing Valorant allows for mirror teams, meaning both teams can run the same exact 5 agents on the same map. This means the outcome is less determined by which agents are picked and more by how well the players executed their roles. This further explains why mean_pick_rate shows little separation between winners and losers.

Preparation for Modeling

Now that we have gotten familiar with the relationships between our variables, we can begin the modeling process. Before fitting any models, we should prepare our data by converting categorical variables to factors, splitting into training and testing sets, and setting up our cross-validation for tuning.

Factor Conversion

Lets first start by converting our categorical variables to factors.

#Converting categorical variables to factors for modeling
model_data <- model_data |>
  mutate(
    won = as.factor(won),
    map = as.factor(map),
    agent_1 = as.factor(agent_1),
    agent_2 = as.factor(agent_2),
    agent_3 = as.factor(agent_3),
    agent_4 = as.factor(agent_4),
    agent_5 = as.factor(agent_5),
  )

All the categorical variables are now stored as factors. The str() helps confirm this.

Data Split

Next, lets start splitting our data into a training and testing set. The training set is used to teach the model by allowing it identify patterns and relationships within the data. On the other hand, the testing set is only used after the model has finished learning. This allows us to evaluate how well the model performs on new, unseen data. Keeping these datasets separate also helps prevents overfitting, a problem where the model becomes too tailored (just memorizing) to the training data and struggles to make accurate predictions on unfamiliar data.

For the purpose of our project, we will stratify on won to make sure that both the training and testing sets maintain the same win/los ratio as the full dataset, which prevents either set from being accidentally skewed toward more wins or losses.

#Setting a random seed for reproducibility
set.seed(2025)

#Splitting data into 80% training and 20% testing, stratified on won
valorant_split <- initial_split(model_data, prop = 0.80, strata = won)
valorant_train <- training(valorant_split)
valorant_test <- testing(valorant_split)

#Check proportions of training vs. testing
dim(valorant_train)

## [1] 42755    14

dim(valorant_test)

## [1] 10690    14

nrow(valorant_train) / nrow(model_data)

## [1] 0.7999813

nrow(valorant_test) / nrow(model_data)

## [1] 0.2000187

With 53,445 total observations, this results in 42,755 training observations (~80) and 10,690 testing observations (~20%), which is confirmed by the proportions of 0.7999813 and 0.2000187 respectively.

Cross-validation

With our data split, we now set up k-fold cross-validation on the training set. The purpose is that rather than fitting the models directly on all of the training data, k-fold cross-validation allows us to divide the training data into k equally sized folds. The model is then trained on k-1 folds and evaluated on the remaining fold, which is repeated k times so every fold gets a turn as the evaluation set.

By averaging the model’s performance across k-folds it gives us a more reliable estimate of how well our model generalizes compared to a single train/evaluate spit. For our case, we will use 5-fold cross-validation (rather than 10) as our dataset is large enough that 5 folds provides a reliable performance estimate while keeping computation time more manageable.

Similarly, we also stratify the folds based on the won variable. This helps ensure that the win/loss balance is preserved across both sets, helping prevent either from being accidentally skewed toward more wins or losses.

#Set up 5-fold cross-validation (stratified on won)
valorant_folds <- vfold_cv(valorant_train, v = 5, strata = won)
valorant_folds

## #  5-fold cross-validation using stratification 
## # A tibble: 5 × 2
##   splits               id   
##   <list>               <chr>
## 1 <split [34203/8552]> Fold1
## 2 <split [34203/8552]> Fold2
## 3 <split [34204/8551]> Fold3
## 4 <split [34205/8550]> Fold4
## 5 <split [34205/8550]> Fold5

Creating Our Recipe

With our data split, we can define our recipe. The recipe is a blueprint that specifies how the data should be preprocessed before being put into any model.

In our case, we will using 10 predictors: map, agent_1, agent_2, agent_3, agent_4, agent_4, attacker_win_pct, defender_win_pct, and mean_pick_rate. The identifier columns such as year, tournament, match_name, and team are removed since they are not useful features for predicting match outcomes, they just simply tell us who played and when, not anything about the composition itself (which is our focus).

Since map and all five agent columns are categorical, we will turn them into dummy variables. Finally, we will normalize all numeric predictors: attack_win_pct, defender_win_pct, and mean_pick_rate by centering and scaling them.

#Define the recipe
valorant_recipe <- recipe(won ~., data = valorant_train) |>
  #Removing identifier columns (not in the scope of our research question)
  step_rm(any_of(c("year", "tournament", "match_name", "team"))) |>
  #Turning Categorical Variables into Dummy Variables
  step_dummy(all_nominal_predictors()) |>
  #Note that step_nzv() was added after initial modeling produced errors due to zero variance columns
  #Remove zero variance columns
  step_nzv(all_predictors()) |>
  #Normalizing all numeric predictors
  step_normalize(all_numeric_predictors())

Modeling

Now, the fun starts! We can actually start to begin testing our dataset through various types of models. As previously stated, the goal of this project is to predict whether a team wins or loses a map based on their agent composition. Since won is a binary outcome (yes/no), this would be a classification problem. We will be fitting five different model, each having its unique different approach to the problem. The five models are, Logistic Regression, Elastic Net, K-Nearest Neighbors, Random Forest and Pruned Decision Tree.

Each model will follow the same general process: 1. Define the model, set its engine and mode (classification) 2. Bundle it into a workflow with our recipe 3. Set up a tuning grid with the hyperparameters and levels we want to explore (skip for Logistic Reg.) 4. Tune the model across our 5 cross-validation folds (skip for Logistic reg.) 5. Select the best hyperparameter combination by ROC AUC (skip for Logistic reg.) 6. Finalize the workflow with those best parameters (skip for Logistic reg.) 7. Fit the finalized workflow to the full training set

Logistic Regression

#Define logistic regression model using glm engine
log_reg_model <- logistic_reg() |>
  set_engine("glm") |>
  set_mode("classification")

#Create workflow
log_reg_wf <- workflow() |>
  add_recipe(valorant_recipe) |>
  add_model(log_reg_model)

#Evaluate using cross-validation folds (no tuning needed for log. reg.)
log_reg_fit <- fit_resamples(
  log_reg_wf,
  resamples = valorant_folds,
  metrics = metric_set(roc_auc)
)

Elastic Net Model

#Define elastic net model with tunable penalty and mixture
elastic_net_model <- logistic_reg(penalty = tune(), mixture = tune()) |>
  set_engine("glmnet") |>
  set_mode("classification")

#Create workflow
elastic_net_wf <- workflow() |>
  add_recipe(valorant_recipe) |>
  add_model(elastic_net_model)

#Define tuning grid
#Penalty ranges from 1e-4 to 1, testing weak to strong regularization
#Mixture ranges from 0 to 1, -0 is pure Ridge (shrinking all coefficients)
#1 is pure Lasso (can zero out meaningless coefficients), values in between blend both
#levels = 5 gives a 5x5 grid of 25 combinations
elastic_net_grid <- grid_regular(
  penalty(range = c(-4, 0)),
  mixture(range = c(0, 1)),
  levels = 5
)

#Tune elastic net using cross-validation
elastic_net_res <- tune_grid(
  elastic_net_wf,
  resamples = valorant_folds,
  grid = elastic_net_grid,
  metrics = metric_set(roc_auc)
)

K-Nearest-Neighbors

#Define the model
knn_model <- nearest_neighbor(neighbors = tune()) |>
  set_engine("kknn") |>
  set_mode("classification")

#Workflow
knn_workflow <- workflow() |>
  add_model(knn_model) |>
  add_recipe(valorant_recipe)

#Tuning grid
knn_grid <- grid_regular(
  neighbors(range = c(1, 15)),
  levels = 8
)

#Tune across folds
knn_tune <- tune_grid(
  knn_workflow,
  resamples = valorant_folds,
  grid = knn_grid,
  metrics = metric_set(roc_auc)
)

Random Forest

#Define Random Forest model
rf_model <- rand_forest(mtry = tune(), trees = tune(), min_n = tune()) |>
  set_engine("ranger", importance = "impurity") |>
  set_mode("classification")

#Create Workflow
rf_workflow <- workflow() |>
  add_model(rf_model) |>
  add_recipe(valorant_recipe)

#Tuning grid
rf_grid <- grid_regular(
  mtry(range = c(1, 10)),
  trees(range = c(100, 500)),
  min_n(range = c(2, 20)),
  levels = 3
)

#Ensures reproducible results
set.seed(2025)
#Tune across folds
rf_tune <- tune_grid(
  rf_workflow,
  resamples = valorant_folds,
  grid = rf_grid,
  metrics = metric_set(roc_auc)
)

Pruned Decision Tree

#Define the Decision Tree model
tree_model <- decision_tree(cost_complexity = tune(), tree_depth = tune(),
                            min_n = tune()) |>
  set_engine("rpart") |>
  set_mode("classification")

#Set up workflow
tree_workflow <- workflow() |>
  add_model(tree_model) |>
  add_recipe(valorant_recipe)

#Tuning grid
tree_grid <- grid_regular(
  cost_complexity(range = c(-4,-1)),
  tree_depth(range = c(1,10)),
  min_n(range = c(2,20)),
  levels = 3
)

#Tune across folds
tree_tune <- tune_grid(
  tree_workflow,
  resamples = valorant_folds,
  grid = tree_grid,
  metrics = metric_set(roc_auc)
)

Model Results

Model ROC AUC Summary

With all five models trained and evaluated through cross-validation, we can now compare their performance side by side using ROC AUC as our metric.

model_results <- tibble(
  Model = c("Logistic Regression", "Elastic Net", "KNN", "Random Forest", "Decision Tree"),
  ROC_AUC = c(
    #Logistic regression uses fit_resamples so we pull from collect_metrics
    collect_metrics(log_reg_fit) |> filter(.metric == "roc_auc") |> pull(mean),
    #For the tuned models, show_best(n=1) returns the best hyperparameter combo
    show_best(elastic_net_res, metric = "roc_auc", n = 1) |> pull(mean),
    show_best(knn_tune, metric = "roc_auc", n = 1) |> pull(mean),
    show_best(rf_tune, metric = "roc_auc", n = 1) |> pull(mean),
    show_best(tree_tune, metric = "roc_auc", n = 1) |> pull(mean)
  )
)
model_results

## # A tibble: 5 × 2
##   Model               ROC_AUC
##   <chr>                 <dbl>
## 1 Logistic Regression   0.522
## 2 Elastic Net           0.522
## 3 KNN                   0.493
## 4 Random Forest         0.525
## 5 Decision Tree         0.518

#Lets make a visual showing us the difference in ROC AUC values between the models
model_results |>
  ggplot(aes(x = reorder(Model, ROC_AUC), y = ROC_AUC, fill = Model)) +
  geom_col(show.legend = FALSE) +
  geom_hline(yintercept = 0.5, linetype = "dashed", color = "black") +
  geom_text(aes(label = round(ROC_AUC, 4)), hjust = -0.1, size = 3.5) +
  coord_flip() +
  ylim(0, 0.6) +
  labs(title = "Cross-Validated ROC AUC by Model",
       x = "Model",
       y = "ROC AUC") +
  theme_minimal()

The bar chart above summarizes the cross-validated ROC AUC scores for all five models. The dashed line at 0.5 represents random chance, meaning a model at the threshold where it is essentially just guessing. Random forest performed the best with ROC AUC of 0.5251, followed closely by Logistic Regression (0.5224) and Elastic Net (0.5222). The Decision Tree came in fourth at 0.5175, while KNN was the only model to fall below random chance at 0.4931. Overall, the margins between models are very small, and no model managed to meaningfully separate winners from losers, suggesting that the agent compositions alone may not be strong enough to signal to predict match outcomes.

Model Autoplots

To better understand how each tuned model behaved during cross-validation, the plots below show how ROC AUC changed across the hyperparameter combinations explored for each model.

Elastic Net

autoplot(elastic_net_res)

For the Elastic Net tuning plot, we used a penalty range from 1e-04 to 1 and a mixture range from 0 to 1. At the lowest regularization, all five lines start clustered together around 0.520 - 0.522, meaning the blend of Ridge and Lasso did not make much of a difference early on. As regularization increases however, the lines begin to sharply drop towards 0.500, which is essentially a coin flip. Something that is interesting is that the pure Ridge (0.00) holds on the longest before falling off, while the higher Lasso proportions (0.75 and 1.00) fall off quicker, which makes sense since Lasso is more aggressive about eliminating predictors entirely. To sum it all up, no matter how we tuned this model, it could not find a strong enough pattern in the agent compositions to predict wins reliably.

K-Nearest-Neighbors

autoplot(knn_tune)

For KNN, we tuned the number of neighbors ranging from 1 to 15 across 8 levels. The plot shows a clear and consistent downward trend. As the number of neighbors increases, ROC AUC steadily decreases from about 0.493 at k = 1 all the way down to around 0.478 at k = 15. Normally, we’d want to see the opposite, where a larger k smooths out noise and improves the performance, but it’s not the case here. Even at its best (k = 1), the model is already performing below random chance at 0.491, which tells us that even the closest matches in our data set don’t share reliable patterns. This makes sense given our data, essentially two teams can run the same exact five agents on the same map and still get completely different outcomes, so finding the meaningful neighbors based on composition alone is hard.

Random Forest

autoplot(rf_tune)

For Random Forest, we tuned three hyperparameters: mtry (ranging from 1 to 10), trees (ranging from 100 to 500), and min_n (ranging from 2 to 20) across 3 levels each. Looking at the plot, the number of trees doesn’t seem to have much of an effect on performance, with the lines 100, 300, and 500 trees being pretty close to each other. An interesting thing to take note of is with min_n, the rightmost panel (min_n = 20) consistently outperforms the others, suggesting that larger node sizes help prevent the trees from over splitting on noise. As for mtry, it looks like the middle range around 5 tends to do the best within each panel. Overall, even the best combinations only gets us to about 0.525, so while Random Forest is out top performer, it’s still not far off from randomly guessing.

Pruned Decision Tree

autoplot(tree_tune)

For the Decision Tree, we tuned three hyperparameters: cost_complexity (ranging from 1e-04 to 0.1), tree_depth (ranging from 1 - 10) and min_n (ranging from 2 - 20). The first thing that struck out to me was that tree depth matters a lot, with the deeper trees (depth = 10, blue) consistently outperforming the shallower one across all three panels, while depth = 1 (red) sits at the bottom the entire time. This makes sense because a tree with only one split can barely capture any patterns in the data. As cost complexity increases, all three depths sharply drop toward 0.500, which is expected since a higher penalty aggressively prunes the tree down until it’s essentially predicting the same class for everything. Unlike the Random Forest where min_n made a slight difference in performance across panels, the three here looks nearly identical, suggesting that minimum node size had little to no impact on the performance. The best we got of this was around 0.51, which again, is right in the range we’ve been seeing across all our models.

Final Model Evaluation

With all five models evaluated, Random Forest came out on top with the highest cross-validated ROC AUC of 0.5251, so we will move forward with fitting it on the full training set and evaluating it on the testing set.

Fitting to Training Data + Testing the Model

#Select best hyperparameters for tuning results to use for final fit
best_rf <- select_best(rf_tune, metric = "roc_auc")
#Finalize the workflow with the best hyperparameters
rf_final_workflow <- finalize_workflow(rf_workflow, best_rf)

#Fit to the full training set
set.seed(2025)
rf_fit <- fit(rf_final_workflow, data = valorant_train)

#Predict on test set
rf_preds <- augment(rf_fit, new_data = valorant_test)

tibble(Set = c("Training", "Testing"),
       ROC_AUC = c(0.5251, 0.4822)) |>
  ggplot(aes(x = Set, y = ROC_AUC, fill = Set)) +
  geom_col(show.legend = FALSE) +
  geom_hline(yintercept = 0.5, linetype = "dashed", color = "black") +
  geom_text(aes(label = round(ROC_AUC, 4)), vjust = -0.5, size = 4) +
  ylim(0, 0.6) +
  labs(title = "Random Forest: Training vs. Testing ROC AUC",
       x = "",
       y = "ROC AUC") +
  theme_bw()

This bar chart compares the Random Forest’s cross-validated training ROC AUC of 0.5251 against the testing ROC AUC of 0.4822. While the training performance was already modest, sitting just barely above random chance, the testing performance dropped below the 0.5 threshold entirely. This gap suggests that the model is slightly overfit to the training data, picking up on patterns that did not generalize to new, unseen compositions. Overall, both values are so close to 0.5 that it is clear that the agent compositions alone is not a reliable predictor of match outcomes in Valorant, and no amount of tuning was able to overcome that fundamental limitation in the data.

Random Forest ROC Curve (Testing Data)

#ROC Curve
roc_curve(rf_preds, truth = won, .pred_yes) |>
  autoplot()

Looking at the ROC curve on the testing set, the curve barely lifts off the diagonal dashed line at all, which tells the same story as our 0.482 ROC AUC, the model really struggled to tell apart winning teams from losing ones. No matter where we set the threshold, the model is essentially guessing, which further reinforces the idea that agent composition alone is just not enough information to predict match outcomes reliably.

Variable Importance

Since Random Forest was our top performer, it would be interesting to dig a little deeper and see which specific agents or maps the model was leaning on the most when making its predictions.

rf_fit |>
  extract_fit_parsnip() |>
  vip(num_features = 20) +
  labs(title = "Random Forest Variable importance") +
  theme_bw()

Looking at the variable importance plot, the three features that stood out the most were mean_pick_rate, defender_win_pct, and attacker_win_pct, which are all map related features rather than specific agents. So ironically, the model was leaning more map balance than the actual agent compositions we were trying to study. Among the agents, Sova and Jett showed up the highest, which honestly makes sense since they have been staples in pro play for years. Everything else further down the list looks pretty similar in importance, with no single agent really standing out. The fact that map characteristics ended up being more useful than the agents themselves actually backs up what we’ve been seeing all along, the agents you pick is much less important than you’d think in professional Valorant.

Conclusion

Out of all five models, Random Forest performed the best with a cross-validated ROC AUC of 0.5251, though when evaluated on the testing set it dropped to 0.4822, falling below random chance. Even as our top performer, the model struggled to find any reliable signal in the agent compositions, which was a consistent theme across all five models we tested. Despite tuning each model carefully across a range of hyperparameters, non of them were able to meaningfully outperform a coin flip, which honestly says more about the nature of the data than the models themselves.

On the other end, KNN was our worst model at 0.4931. Since two teams can run the exact same five agents on the same map and still get completely different results, there really are no meaningful neighbors for the model to learn from, and the more neighbors we considered the worse it got.

Something that stood out was the variable importance plot showing map balance features like attacker_win_pct and defender_win_pct as the most influential predictors rather than the agents themselves. For a project centered around agent compositions, it is a pretty telling sign that the model found map characteristics more useful than the actual drafts.

If I were able to approach this project differently, or build up on it, I would incorporate individual player statistics like kill/death ratios, first blood rates, and tournament appearances. As someone who watches pro play regularly, I had the idea that going into the project, the players was going matter a little bit more than the agents, and the results ended up confirming it more strongly than I expected. I think a future model that combines both composition and player performance data would be a much stronger predictor of match outcomes.

To conclude, even though the models did not perform as well as I had hoped, this project gave me a deeper appreciation for just how complex professional Valorant really is. With over 2,300 hours of playtime and countless more hours spent watching professional matches, I came into this project genuinely believing that team composition was one of the most important factors in determining who wins. These results challenged that assumption clearly. And from this, something I like to think about from this project is just the sheer preparation that these Valorant professionals go through. To win at the highest level, it comes down to so much more than just the agents you draft. Rather, its about the preparation, teamwork, and individual skill. If anything, this project helped me view the game through a completely different lens, and made me appreciate just how much goes into every professional match beyond what any model can fully capture. On top of that, working through six years of professional match data and building these models from scratch was an incredible opportunity to grow my machine learning skills and deepen my understanding of the entire modeling pipeline from data cleaning all the way to final evaluation, something I plan to incorporate in my future projects.

Predicting Win Rate Based on Team Composition in Valorant

PSTAT 231 Machine Learning Project

Khoi Huynh

UCSB Spring 2026

Introduction

About Valorant

The Objective: Why Predict Team Compositions?

Loading Packages and Data (Source)

Codebook

Exploring Raw Data & Tidying

How many variables do we have? Should we narrow it down?

Tidying

Adding Features

Exploratory Data Analysis (EDA)

Missing Values

Duplicates

Sorting

Visual EDA

Outcome

Map Play Frequency

Attacker / Defender Win Rate

Win Rates

Mean Agent Pick Rate by Match Outcome

Preparation for Modeling

Factor Conversion

Data Split

Cross-validation

Creating Our Recipe

Modeling

Logistic Regression

Elastic Net Model

K-Nearest-Neighbors

Random Forest

Pruned Decision Tree

Model Results

Model ROC AUC Summary

Model Autoplots

Elastic Net

K-Nearest-Neighbors

Random Forest

Pruned Decision Tree

Final Model Evaluation

Fitting to Training Data + Testing the Model

Random Forest ROC Curve (Testing Data)

Variable Importance

Conclusion