The purpose of this project is to develop a model that will be able
to predict the win rate of various unique team compositions in Valorant.
Valorant is a 5v5 tactical first-person shooter developed by Riot Games where one team starts as Attackers and the other as Defenders, switching roles at halftime after 12 rounds. The Attacker’s objective is to plant a bomb called the Spike at a bomb site, while the Defenders aim to eliminate them or defuse it after planting. A standard match can possibly go up to 25 rounds, with the first team to reach 13 round wins taking the match. What makes Valorant unique is its roster of 29 playable characters called Agents, each with distinct abilities belonging to one of four roles: Duelists (fraggers who create space), Initiators (information gatherers), Controllers (smoke and area denial), and Sentinels (defensive anchors).
At the professional level, Valorant is organized through the Valorant Champions Tour (VCT), one of Riot Games’ official tournament spanning three international leagues: Americas, EMEA, and Pacific. One interesting thing about Riot Games is that they frequently rotate maps in and out of the active map pool (the collection of maps currently used in competitive play), meaning play style is always shifting with different compositions being constructed. This is further complicated by the meta, which refers to the trends in Agent selection at any given time, constantly changing due to patches and new Agent releases.
As someone who really enjoys watching pro play, I find it difficult to keep up with which team compositions are actually performing well on any given map given how frequently things change. The purpose of this project is to use the VCT match data from 2021 to 2026 to a build a model that predicts a team’s win probability based on their Agent composition and map, allowing insight into the most optimal team without having to manually track the changing meta.
library(tidyverse)
library(tidymodels)
library(janitor)
library(kknn)
library(glmnet)
library(ranger)
library(xgboost)
library(ggplot2)
library(readr)
library(dplyr)
library(vip)
The data used in this project is sourced from the Kaggle dataset “Valorant Champion Tour 2021-2026 Data” by Ryan Luong, scraped from vlr.gg, the primary statistics tracker for pro league Valorant. The data set is available here.
Of the many files available, we use four per year:
The first two files provide the structure for building our outcome variable and agent composition. The other two joined in as additional predictors during the tidying stage.
#Setting the base path to the folder containing all six years of data (allows for easier access)
base_path <- "~/Desktop/Final Project/archive"
#list.files() allows us to search recursively through the sub folders in "archive" for files matching the pattern. This helps us a lot because its 6 years of data with the same name files.
#full.names = TRUE returns the complete file path so read_csv knows where to look
overview_files <- list.files(base_path, pattern = "overview.*\\.csv",
recursive = TRUE, full.names = TRUE)
score_files <- list.files(base_path, pattern = "maps_scores.*\\.csv",
recursive = TRUE, full.names = TRUE)
maps_stats_files <- list.files(base_path, pattern = "maps_stats.*\\.csv",
recursive = TRUE, full.names = TRUE)
pick_rates_files <- list.files(base_path, pattern = "agents_pick_rates.*\\.csv",
recursive = TRUE, full.names = TRUE)
#clean_names() standardizes column names to lowercase with underscores (Map Name -> map_name)
#str_extract is used to pull a 4 digit number from the file path string allowing us to use the year as a value
#Note, in the original data set, all the files had the same name despite the different years, I had to manually change the file to the corresponding year (e.g., overview.csv -> overview_2021.csv)
overview_all <- read_csv(overview_files, show_col_types = FALSE, id = "source_file") |>
mutate(year = as.integer(str_extract(source_file, "\\d{4}"))) |>
clean_names()
scores_all <- read_csv(score_files, show_col_types = FALSE, id = "source_file") |>
mutate(year = as.integer(str_extract(source_file, "\\d{4}"))) |>
clean_names()
maps_stats_all <- read_csv(maps_stats_files, show_col_types = FALSE, id = "source_file") |>
mutate(year = as.integer(str_extract(source_file, "\\d{4}"))) |>
clean_names()
pick_rates_all <- read_csv(pick_rates_files, show_col_types = FALSE, id = "source_file") |>
mutate(year = as.integer(str_extract(source_file, "\\d{4}"))) |>
clean_names()
These are the predictors we are going to be working with.
Outcome Variable - won - Whether the
team won the map (yes/no)
Categorical Predictors
map - The name of the map the match was played onagent_1 - First agent in the team’s sorted
compositionagent_2 - Second agent in the team’s sorted
compositionagent_3 - Third agent in the team’s sorted
compositionagent_4 - Fourth agent in the team’s sorted
compositionagent_5 - Fifth agent in the team’s sorted
compositionNumeric Predictors
attacker_win_pct - Average attacker side win percentage
for that map in that yeardefender_win_pct - Average defender side win percentage
for that map in that yearmean_pick_rate - Average global pick rate across all
five agents on the teamBefore building our modeling dataset, let’s take a look at what we’re working with. The four raw files each serve a different purpose, so we’ll first look at their dimensions and relevant columns.
dim(overview_all)
## [1] 1128745 23
dim(scores_all)
## [1] 26977 18
dim(maps_stats_all)
## [1] 16826 9
dim(pick_rates_all)
## [1] 409250 8
colnames(overview_all)
## [1] "source_file" "tournament"
## [3] "stage" "match_type"
## [5] "match_name" "map"
## [7] "player" "team"
## [9] "agents" "rating"
## [11] "average_combat_score" "kills"
## [13] "deaths" "assists"
## [15] "kills_deaths_kd" "kill_assist_trade_survive_percent"
## [17] "average_damage_per_round" "headshot_percent"
## [19] "first_kills" "first_deaths"
## [21] "kills_deaths_fkd" "side"
## [23] "year"
colnames(scores_all)
## [1] "source_file" "tournament" "stage"
## [4] "match_type" "match_name" "map"
## [7] "team_a" "team_a_score" "team_a_attacker_score"
## [10] "team_a_defender_score" "team_a_overtime_score" "team_b"
## [13] "team_b_score" "team_b_attacker_score" "team_b_defender_score"
## [16] "team_b_overtime_score" "duration" "year"
colnames(maps_stats_all)
## [1] "source_file" "tournament"
## [3] "stage" "match_type"
## [5] "map" "total_maps_played"
## [7] "attacker_side_win_percentage" "defender_side_win_percentage"
## [9] "year"
colnames(pick_rates_all)
## [1] "source_file" "tournament" "stage" "match_type" "map"
## [6] "agent" "pick_rate" "year"
Its apparent that the files vary significantly in size.
overview_all contains 1,128,745 rows across 21 columns with
one row per player per map, meaning every match contributes 10 rows (one
per player, since its 5v5). scores_all has 26,977 rows with
16 columns, maps_stats_all has 16,826 rows across 7
columns, and pick_rates_all has 409,250 rows across 6
columns. Clearly, we cannot model directly from these raw files. Looking
at these files we can actually see that they share some common
identifiers such as, tournament, stage,
match_type, match_name, and map.
We can use this to our advantage by joining these files together through
those key terms. Ideally, we combine them into a single, clean data set
with one row per team per map, reducing the over 1 million rows in
overview_all down to something more workable.
From overview_all, the columns we care about are
tournament, match_name, map,
team, agents and side. Everything
else, such as: kills, deaths, ratings, headshot percentage are
individual player performance stats that go beyond our research
question. We are predicting outcomes from composition alone, not player
skill.
Now that we understand the structure of our raw data, we can start
building our modeling dataset. First, we will reshape
overview_all to get one row per team per map with all five
agents as separate columns, then determine the winner of each map from
scores_all, and finally join the two together!
#Filter to "both" sides to avoid counting attack and defense rows separately
#Group by match + map + team, then number each player 1-5 to create agent slots
agents_wide <- overview_all |>
#Set map!= "All Maps" since "All Maps" rows are the summaries of the series (Best of 3)
filter(side == "both", map != "All Maps") |>
distinct(tournament, match_name, map, team, player, .keep_all = TRUE) |>
group_by(year, tournament, match_name, map, team) |>
arrange(player, .by_group = TRUE) |>
slice_head(n = 5) |>
mutate(slot = paste0("agent_", row_number())) |>
select(year, tournament, match_name, map, team, slot, agents) |>
pivot_wider(names_from = slot, values_from = agents) |>
ungroup()
#Verifying code works
head(agents_wide)
## # A tibble: 6 × 10
## year tournament match_name map team agent_1 agent_2 agent_3 agent_4
## <int> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 2021 Champions Tour A… BOOM Espo… Asce… BOOM… astra killjoy sova jett
## 2 2021 Champions Tour A… BOOM Espo… Asce… DAMW… sova astra skye cypher
## 3 2021 Champions Tour A… BOOM Espo… Bind BOOM… astra killjoy skye raze
## 4 2021 Champions Tour A… BOOM Espo… Bind DAMW… sova astra skye viper
## 5 2021 Champions Tour A… BOOM Espo… Asce… BOOM… astra killjoy sova jett
## 6 2021 Champions Tour A… BOOM Espo… Asce… FENN… astra sage jett sova
## # ℹ 1 more variable: agent_5 <chr>
We now have one row per team per map, with columns
agent_1 though agent_5 representing the five
agents that the team played!
Next, lets determine the winner of each map using
scores_all. The team with the higher score wins (denoted by
13). We can do that by pivoting the data from wide (one row per map with
two team columns) to long (one row per term per map) and create a binary
outcome variable won.
#Using if_else statement with mutate to create a column that dictates the winner
scores_long <- scores_all |>
mutate(winner = if_else(team_a_score > team_b_score, team_a, team_b)) |>
select(year, tournament, match_name, map, team_a, team_b, winner) |>
#Pivoting to attach `won` label to each team
pivot_longer(cols = c(team_a, team_b),
names_to = "side",
values_to = "team") |>
#Creating a binary outcome: did this team win? which is stored as a factor for classification
mutate(won = as.factor(if_else(team == winner, "yes", "no"))) |>
select(year, tournament, match_name, map, team, won)
#Printing a few rows for verification
head(scores_long)
## # A tibble: 6 × 6
## year tournament match_name map team won
## <int> <chr> <chr> <chr> <chr> <fct>
## 1 2021 Valorant Champions 2021 Vision Strikers vs FULL SENSE Haven Visi… yes
## 2 2021 Valorant Champions 2021 Vision Strikers vs FULL SENSE Haven FULL… no
## 3 2021 Valorant Champions 2021 Vision Strikers vs FULL SENSE Breeze Visi… yes
## 4 2021 Valorant Champions 2021 Vision Strikers vs FULL SENSE Breeze FULL… no
## 5 2021 Valorant Champions 2021 Team Vikings vs Crazy Raccoon Icebox Team… yes
## 6 2021 Valorant Champions 2021 Team Vikings vs Crazy Raccoon Icebox Craz… no
Each map now has two rows, one for each team, with won
indicating whether that team won the map or not.
Lastly, we can now join the agent compositions with the match
outcomes. We can do that by using inner_join so only maps
that appear in both datasets are kept.
#Joining agents_wide and scores_long
model_data <- agents_wide |>
inner_join(scores_long, by = c("year", "tournament", "match_name", "map", "team"))
#Checking dimensions
dim(model_data)
## [1] 53674 11
#Calling head to verify
head(model_data)
## # A tibble: 6 × 11
## year tournament match_name map team agent_1 agent_2 agent_3 agent_4
## <int> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 2021 Champions Tour A… BOOM Espo… Asce… BOOM… astra killjoy sova jett
## 2 2021 Champions Tour A… BOOM Espo… Asce… DAMW… sova astra skye cypher
## 3 2021 Champions Tour A… BOOM Espo… Bind BOOM… astra killjoy skye raze
## 4 2021 Champions Tour A… BOOM Espo… Bind DAMW… sova astra skye viper
## 5 2021 Champions Tour A… BOOM Espo… Asce… BOOM… astra killjoy sova jett
## 6 2021 Champions Tour A… BOOM Espo… Asce… FENN… astra sage jett sova
## # ℹ 2 more variables: agent_5 <chr>, won <fct>
Our observations went from over 1 million to just 53,673 observations!
With our core data set built, we can now add some special features to
it! First, we can join maps_stats_all which provides the
attacker and defender win percentage for each map.
maps_features <- maps_stats_all |>
#Remove "All Maps" rows, these are just summary rows, not real maps
filter(map != "All Maps") |>
#Strip the % sign from win percentage columns so they can be converted to numeric (was originally character)
mutate(attacker_side_win_percentage = as.numeric(str_remove(attacker_side_win_percentage, "%")),
defender_side_win_percentage = as.numeric(str_remove(defender_side_win_percentage, "%"))
) |>
group_by(year, map) |>
#Average attacker and defender win percentage per map per year
summarise(
attacker_win_pct = mean(attacker_side_win_percentage, na.rm = TRUE),
defender_win_pct = mean(defender_side_win_percentage, na.rm = TRUE),
.groups = "drop"
)
#Join map balance features onto model_data by year and map
model_data <- model_data |>
left_join(maps_features, by = c("year", "map"))
#Verifying dimensions
dim(model_data)
## [1] 53674 13
colnames(model_data)
## [1] "year" "tournament" "match_name" "map"
## [5] "team" "agent_1" "agent_2" "agent_3"
## [9] "agent_4" "agent_5" "won" "attacker_win_pct"
## [13] "defender_win_pct"
Next, we can incorporate the global agent pick rates from
pick_rates_all. Instead of joining pick rate for each
individual agent slot, we can compute the average pick rate across all
five agents on a team. This gives us a single numeric feature capturing
how meta a team composition is. A high average pick rate would suggest
that the team is running widely favored agents, while a low average
suggests a more unconventional draft (could be experimenting with new
compositions).
#Summarizing pick rates to one row per agent per year
#Since pick_rate was a character, we have to strip % from it and convert to numeric
pick_rate_lookup <- pick_rates_all |>
mutate(pick_rate = as.numeric(str_remove(pick_rate, "%"))) |>
group_by(year, agent) |>
summarise(avg_pick_rate = mean(pick_rate, na.rm = TRUE), .groups = "drop")
#For each row, get the average pick rate across all 5 agents
agent_pick_rates <- model_data |>
pivot_longer(cols = c(agent_1, agent_2, agent_3, agent_4, agent_5),
names_to = "slot", values_to = "agent") |>
left_join(pick_rate_lookup, by = c("year", "agent")) |>
group_by(year, tournament, match_name, map, team) |>
summarise(mean_pick_rate = mean(avg_pick_rate, na.rm = TRUE), .groups = "drop")
#Join back onto model_data
model_data <- model_data |>
left_join(agent_pick_rates, by = c("year", "tournament", "match_name", "map", "team"))
dim(model_data)
## [1] 53674 14
colnames(model_data)
## [1] "year" "tournament" "match_name" "map"
## [5] "team" "agent_1" "agent_2" "agent_3"
## [9] "agent_4" "agent_5" "won" "attacker_win_pct"
## [13] "defender_win_pct" "mean_pick_rate"
Our dataset is now fully built, containing 53,674 observations across 14 variables. We can now move onto exploratory data analysis to better understand the distributions and relationships in our data before modeling!
Before exploring the data visually, we should check for any missing values across all columns. Missing data could cause issues during modeling so its important that we identify if there is any and handle it early.
#Check for missing values in each column
colSums(is.na(model_data))
## year tournament match_name map
## 0 0 0 0
## team agent_1 agent_2 agent_3
## 0 2 23 35
## agent_4 agent_5 won attacker_win_pct
## 55 215 0 0
## defender_win_pct mean_pick_rate
## 0 2
It seems like several agent columns contain missing values.
agent_1 through agent_5 have varying amount of
missing entries, and mean_pick_rate has 2. Before removing
these rows, let’s investigate why this might be the case.
#Finding specific rows where agent slots are missing to compare manually
model_data |>
filter(is.na(agent_1) | is.na(agent_2) | is.na(agent_3) |
is.na(agent_4) | is.na(agent_5)) |>
select(year, tournament, match_name, map, team, agent_1, agent_2,
agent_3, agent_4, agent_5) |>
head(5)
## # A tibble: 5 × 10
## year tournament match_name map team agent_1 agent_2 agent_3 agent_4
## <int> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 2021 Champions Tour B… Vorax vs … Split Mark… brimst… sage breach <NA>
## 2 2021 Champions Tour B… Liberty v… Bind INGA… sova jett raze brimst…
## 3 2021 Champions Tour B… mato30epe… Asce… mato… omen killjoy jett raze
## 4 2021 Champions Tour B… mato30epe… Split mato… omen sage jett raze
## 5 2021 Champions Tour C… GMT Espor… Bind GMT … breach sova cypher viper
## # ℹ 1 more variable: agent_5 <chr>
After looking up one of these matches directly on vlr.gg, the cause is clear. The first observation is from 2021 Champions Tour Brazil Stage 1: Challengers 1, specifically Vorax vs. Mark Five on Split. As shown in the screenshot below, Mark Five only had 3 players recorded (dek, vianna1, and e4s) despite Valorant being a 5v5 game. This appears to be a incomplete data entry issue on vlr.gg’s end rather than an error in our tidying.
Since we can not reliably
reconstruct a full 5-agent composition from incomplete data, these rows
are removed.
#Remove the rows with missing values
model_data <- model_data |>
drop_na()
colSums(is.na(model_data))
## year tournament match_name map
## 0 0 0 0
## team agent_1 agent_2 agent_3
## 0 0 0 0
## agent_4 agent_5 won attacker_win_pct
## 0 0 0 0
## defender_win_pct mean_pick_rate
## 0 0
dim(model_data)
## [1] 53459 14
After removing incomplete rows, all 14 columns now have zero missing values. The dataset still has 53,459 observations losing only 215 rows (roughly less than 0.5% of the data), meaning the removal will have minimal impact on our analysis given our size.
While investigating the missing values, there was another issue that
caught my attention. Looking at the first observation,
agent_1 contained “brimstone, jett, raze” rather than a
single agent name. This shows that some agent slots contained multiple
agent names separated together by a comma. Lets verify if there are any
other instances of this in our dataset.
#Check if any agent slots contain comma separated values
model_data |>
filter(str_detect(agent_1, ",") |
str_detect(agent_2, ",") |
str_detect(agent_3, ",") |
str_detect(agent_4, ",") |
str_detect(agent_5, ",")) |>
select(tournament, match_name, map, team, agent_1, agent_2,
agent_3, agent_4, agent_5) |>
head(5)
## # A tibble: 5 × 9
## tournament match_name map team agent_1 agent_2 agent_3 agent_4 agent_5
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Champions Tour… Five Ace … Iceb… Five… sage viper killjo… reyna jett
## 2 Champions Tour… Five Ace … Iceb… Five… sage viper killjo… reyna jett
## 3 Champions Tour… Five Ace … Iceb… TRAI… phoeni… omen killjoy jett sage, …
## 4 Champions Tour… Five Ace … Iceb… TRAI… phoeni… omen killjoy jett sage, …
## 5 Champions Tour… CBT Gamin… Bind CBT … cypher sova brimst… phoenix jett, …
It seems like it is true that other observations slots contain
corrupted entries where multiple agent names are concatenated with a
comma (e.g., “killjoy, yoru” in the second observation). After manually
verifying on vlr.gg again, this occurs because some rows in
overview_all correspond to “All Maps” series summaries
rather than the individual maps (in this case icebox). This means a
player who played as Killjoy on one map and Yoru on another appears as
“killjoy, yoru” in the aggregated row. Therefore, we can not use these
rows as they are not reliable since they don’t represent the actual
lineup used on the given map (you can’t use more than one agent in one
match).
Provided below is a screenshot of vlr.gg “All Maps” tab showing this
issue.
An easy fix for this is simply filtering out any rows that has agent slots that contain a comma.
#Filtering out duplicates in agent columns
model_data <- model_data |>
filter(!str_detect(agent_1, ","),
!str_detect(agent_2, ","),
!str_detect(agent_3, ","),
!str_detect(agent_4, ","),
!str_detect(agent_5, ","))
dim(model_data)
## [1] 53445 14
After removing the corrupted rows, the dataset has 53445 observations. The agent columns are now clean with each slot containing exactly one agent.
One unique aspect of Valorant to consider is that mirror compositions
are allowed. Mirror compositions is a term used for when both teams run
the same exact 5 agents on the same map. This possibility means that
that the same compositions can appear across many different matches.
However, since agents slots were assigned alphabetically by player name
rather than by agent, the same 5-agent lineup could be encoded
differently across rows. An example would be a team running
jett, omen, sova,
sage, and raze with
agent_1 = jett for Team A, and agent_1 = omen
for Team B even though they have the same composition. This would be a
problem because our models would treat each agent slot as a separate
feature, meaning it would see these as two completely different
compositions even though they are identical (pick order is just
different). To have it consistent, we sort the agents alphabetically for
each row.
# Sort agents alphabetically within each row for consistent composition encoding
model_data <- model_data |>
rowwise() |>
mutate(
agents_sorted = list(sort(c(agent_1, agent_2, agent_3, agent_4, agent_5))),
agent_1 = agents_sorted[[1]],
agent_2 = agents_sorted[[2]],
agent_3 = agents_sorted[[3]],
agent_4 = agents_sorted[[4]],
agent_5 = agents_sorted[[5]]
) |>
select(-agents_sorted) |>
ungroup()
## [1] 53445 14
Our dataset is now fully cleaned and standardized. With 53,445 observations and now consistent agent encoding, we are ready to move into the Visual EDA and Modeling.
Now that our data is clean, we can begin exploring it visually. We
can start with a standard check, lets look at the balance of our outcome
variable won. Since every map played has to have one
winning and one losing team, we should be expect this to be roughly
balanced.
#Bar plot of outcome distribution
model_data |>
count(won) |>
ggplot(aes(x = won, y = n, fill = won)) +
geom_col(show.legend = FALSE) +
labs(title = "Win/Loss Distribution",
x = "Won",
y = "Count") +
theme_minimal()
The outcome variable is nearly balanced with 26,817 wins and 26,628
losses. The slight excess of wins over losses is actually expected given
the tournament format. In a best of 3 series, the winning team would win
2 maps while the losing would only win 1, naturally producing more win
entires than loss entries across the database. This minor imbalance is
unlikely to affect our actual model performance.
#This is the difference
model_data |>
count(won)
## # A tibble: 2 × 2
## won n
## <fct> <int>
## 1 no 26628
## 2 yes 26817
Next we can look at how frequently each map appears in our dataset. It is worth noting that Valorant has map rotations, meaning only 7 maps are active in the competitive rotation at any given time. Riot Games periodically adds new maps and removes other between acts and episodes (due to map rework/glitches).
# Bar plot of map play frequency
model_data |>
#Sorts the data frame by n in descending order
count(map, sort = TRUE) |>
#Orders the bars by n on the plot
ggplot(aes(x = reorder(map, n), y = n)) +
geom_col(fill = "steelblue") +
#Flips the plots' x and y axes, allows for easier reading of map names
coord_flip() +
labs(title = "Map Play Frequency (2021-2026)",
x = "Map",
y = "Times Played") +
theme_minimal()
As expected, Ascent and Haven appear most frequently, having been in the
competitive pool since Valorant’s launch. Newer maps such as Abyss,
Sunset, and Corrode have far fewer appearances which logically makes
sense. This distribution is worth keeping in mind during our modeling
because our model would have much more training data for older maps than
newer ones.
Now that we know which maps appear more frequently, we can look at whether these maps are balanced. In Valorant, a map being imbalanced refers to whether the attacking or defending side has a structural advantage. Something to note is that Riot Games regularly patches/updates these maps, so let’s see how attacker win rates have shifted across all maps over the 6 years.
maps_features |>
ggplot(aes(x = year, y = attacker_win_pct, color = map, group = map)) +
geom_line() +
geom_point() +
geom_hline(yintercept = 50, linetype = "dashed", color = "black") +
scale_y_continuous(labels = scales::label_number(suffix = "%")) +
scale_color_brewer(palette = "Paired") +
labs(title = "Attacker Win Rate by Map Over Time (2021-2026)",
x = "Year",
y = "Attacker Win Rate (%)",
color = "Map") +
theme_minimal()
The line plot reveals a lot about how attacker win rates have fluctuated
across these maps from 2021 - 2026. Noticeably, most maps hover around
the 50% reference line, suggesting that Riot Games has generally
succeeded in maintaining the competitive balance through regular
patches. However, there are some notable outliers. For example, if we
look at fracture, it stands out tremendously with a very high attacker
win rate upon its release (came out Sept, 2021). Similarly, Breeze
trended downward over time, becoming increasingly defender sided.
Lastly, newer maps like Abyss (released in 2024), had a very big spike,
with early data suggesting that it was heavily attacker advantage.
Something to note is that we only had to plot the attacker side win rate here because attacker and defender win rates always sum to 100%, meaning the defender win rate is simply the complement.
This feature is particularly important for our model because map side bias can heavily influence outcomes independent of agent composition. For example, imagine going 9-3 in the first half on Fracture in 2021, which we mentioned earlier had a very high attacker sided win rate. Would it be wrong to question whether the performance was driven by the lineup of agents or simply the map favoring the attacking side? The opposing team would have virtually no room for error on their attacking half (when they switch sides), needing to win 9 of the possible 12 rounds just to force overtime. In scenarios like these, map side advantage is just as crucial as the agents picked.
Previously, the line plot helped reveal how map balance has shifted over the years due to Riot’s patches. Now that we have a sense of the map dynamics, we can start to shift our focus onto the agents themselves. It would be interesting to look at which agents are more prone to win. Additionally, since there are 29 playable agents in Valorant, we will focus on the agents that has at least 50 appearances so our win rate estimates are more reliable.
#Pivot longer so each agent gets its own row
model_data |>
pivot_longer(cols = c(agent_1, agent_2, agent_3, agent_4, agent_5),
names_to = "slot", values_to = "agent") |>
#Calculating win rate and total picks per agent
group_by(agent) |>
summarise(
win_rate = mean(won == "yes"),
picks = n(),
.groups = "drop"
) |>
#Order bars by win rate
ggplot(aes(x = reorder(agent, win_rate), y = win_rate, fill = win_rate)) +
geom_col() +
#Add a reference line at 50% to show the average baseline
geom_hline(yintercept = 0.5, linetype = "dashed", color = "black") +
#Flipping the axes so agent names are readable on the y-axis
coord_flip() +
#Blue = lower win rate, red = higher win rate
scale_fill_gradient(low = "lightblue", high = "red") +
#Having the y-axis display percentages
scale_y_continuous(labels = scales::percent) +
labs(title = "Win Rate of Agents",
x = "Agent", y = "Win Rate",
fill = "Win Rate") +
theme_minimal()
The bar plot reveals a lot of information about the win rates across the top picked agents. Though the margins are relatively small, most agents fall within the 46% - 52% range, suggesting that there is no single (overpowered) agent that determines the outcome entirely. Additionally, the roles of all agents are scattered in the top, indicating that there is not a single role such as Duelist or Controller that dominates the ranking. This reinforces the idea that this game’s win rate is primarily due to team composition and how agents compliment one another.
At the top, Clove stands out with the highest win rate despite being a relatively new agent introduced in 2024. This is partly explained by recency bias, such as experimenting with new agents in professional play, leading to inflating early win rates. However, looking at logically, it’s to be expected because Clove’s kit genuinely supports her strong performance. As a controller, she brings powerful smokes for map control while also functioning more aggressively than a tradition controller. Her self-heal allows her to win more gunfights to stay alive longer, and more importantly her revival ability gives her team a second chance, allowing her to take on the role as a Duelist when entering bombsites.
On the other end, Deadlock sits at the bottom with the lowest win rate, which is not surprising. As a Sentinel (more defense sided), her kit has a lot of weaknesses. Her barrier wall (to slow down the opposing team) can simply be shot down, her sensors (another slowing ability) can be spotted and silently walked through, while her net grenade provides some stalling utility, these abilities lack in comparison to other strong Sentinel options like Killjoy or Cypher. These characters in comparison offer more reliable and harder to counter defensive setups. This was to be expected.
Now, lets examine whether running a more meta composition (one with
higher globally picked agents) is associated with winning. The
mean_pick_rate feature helps us capture the average pick
rate across a teams’ five agents. A higher value mean would indicate to
us that team is running widely favored agents, while a lower value would
suggest a more unconventional line up.
#Box plot of mean pick rate by outcome
model_data |>
ggplot(aes(x = won, y = mean_pick_rate, fill = won)) +
geom_boxplot(show.legend = FALSE) +
scale_fill_manual(values = c("yes" = "orange", "no" = "black")) +
labs(title = "Mean Agent Rate by Match Outcome",
x = "won",
y = "Mean Pick Rate (%)") +
theme_bw()
From this box plots, it seems like winning and losing teams are nearly
identical in terms of median, spread, and distribution of
mean_pick_rate. This tells us that just simply running more
meta agents does not strongly predict whether a team wins or loses. In
pro play, both teams tend to run similarly high pick rate agents since
pro players tend to naturally gravitate towards the current meta. This
is actually an interesting finding because winning in professional
Valorant is less likely less about picking the most popular agents and
more about how those agents are combined as a team.
As mentioned before it is worth nothing Valorant allows for mirror
teams, meaning both teams can run the same exact 5 agents on the same
map. This means the outcome is less determined by which agents are
picked and more by how well the players executed their roles. This
further explains why mean_pick_rate shows little separation
between winners and losers.
Now that we have gotten familiar with the relationships between our variables, we can begin the modeling process. Before fitting any models, we should prepare our data by converting categorical variables to factors, splitting into training and testing sets, and setting up our cross-validation for tuning.
Lets first start by converting our categorical variables to factors.
#Converting categorical variables to factors for modeling
model_data <- model_data |>
mutate(
won = as.factor(won),
map = as.factor(map),
agent_1 = as.factor(agent_1),
agent_2 = as.factor(agent_2),
agent_3 = as.factor(agent_3),
agent_4 = as.factor(agent_4),
agent_5 = as.factor(agent_5),
)
All the categorical variables are now stored as factors. The str() helps confirm this.
Next, lets start splitting our data into a training and testing set. The training set is used to teach the model by allowing it identify patterns and relationships within the data. On the other hand, the testing set is only used after the model has finished learning. This allows us to evaluate how well the model performs on new, unseen data. Keeping these datasets separate also helps prevents overfitting, a problem where the model becomes too tailored (just memorizing) to the training data and struggles to make accurate predictions on unfamiliar data.
For the purpose of our project, we will stratify on won
to make sure that both the training and testing sets maintain the same
win/los ratio as the full dataset, which prevents either set from being
accidentally skewed toward more wins or losses.
#Setting a random seed for reproducibility
set.seed(2025)
#Splitting data into 80% training and 20% testing, stratified on won
valorant_split <- initial_split(model_data, prop = 0.80, strata = won)
valorant_train <- training(valorant_split)
valorant_test <- testing(valorant_split)
#Check proportions of training vs. testing
dim(valorant_train)
## [1] 42755 14
dim(valorant_test)
## [1] 10690 14
nrow(valorant_train) / nrow(model_data)
## [1] 0.7999813
nrow(valorant_test) / nrow(model_data)
## [1] 0.2000187
With 53,445 total observations, this results in 42,755 training observations (~80) and 10,690 testing observations (~20%), which is confirmed by the proportions of 0.7999813 and 0.2000187 respectively.
With our data split, we now set up k-fold cross-validation on the training set. The purpose is that rather than fitting the models directly on all of the training data, k-fold cross-validation allows us to divide the training data into k equally sized folds. The model is then trained on k-1 folds and evaluated on the remaining fold, which is repeated k times so every fold gets a turn as the evaluation set.
By averaging the model’s performance across k-folds it gives us a more reliable estimate of how well our model generalizes compared to a single train/evaluate spit. For our case, we will use 5-fold cross-validation (rather than 10) as our dataset is large enough that 5 folds provides a reliable performance estimate while keeping computation time more manageable.
Similarly, we also stratify the folds based on the won
variable. This helps ensure that the win/loss balance is preserved
across both sets, helping prevent either from being accidentally skewed
toward more wins or losses.
#Set up 5-fold cross-validation (stratified on won)
valorant_folds <- vfold_cv(valorant_train, v = 5, strata = won)
valorant_folds
## # 5-fold cross-validation using stratification
## # A tibble: 5 × 2
## splits id
## <list> <chr>
## 1 <split [34203/8552]> Fold1
## 2 <split [34203/8552]> Fold2
## 3 <split [34204/8551]> Fold3
## 4 <split [34205/8550]> Fold4
## 5 <split [34205/8550]> Fold5
With our data split, we can define our recipe. The recipe is a blueprint that specifies how the data should be preprocessed before being put into any model.
In our case, we will using 10 predictors: map,
agent_1, agent_2, agent_3,
agent_4, agent_4,
attacker_win_pct, defender_win_pct, and
mean_pick_rate. The identifier columns such as
year, tournament, match_name, and
team are removed since they are not useful features for
predicting match outcomes, they just simply tell us who played and when,
not anything about the composition itself (which is our focus).
Since map and all five agent columns are categorical, we
will turn them into dummy variables. Finally, we will normalize all
numeric predictors: attack_win_pct,
defender_win_pct, and mean_pick_rate by
centering and scaling them.
#Define the recipe
valorant_recipe <- recipe(won ~., data = valorant_train) |>
#Removing identifier columns (not in the scope of our research question)
step_rm(any_of(c("year", "tournament", "match_name", "team"))) |>
#Turning Categorical Variables into Dummy Variables
step_dummy(all_nominal_predictors()) |>
#Note that step_nzv() was added after initial modeling produced errors due to zero variance columns
#Remove zero variance columns
step_nzv(all_predictors()) |>
#Normalizing all numeric predictors
step_normalize(all_numeric_predictors())
Now, the fun starts! We can actually start to begin testing our
dataset through various types of models. As previously stated, the goal
of this project is to predict whether a team wins or loses a map based
on their agent composition. Since won is a binary outcome
(yes/no), this would be a classification problem. We will be fitting
five different model, each having its unique different approach to the
problem. The five models are, Logistic Regression, Elastic Net,
K-Nearest Neighbors, Random Forest and Pruned Decision Tree.
Each model will follow the same general process: 1. Define the model, set its engine and mode (classification) 2. Bundle it into a workflow with our recipe 3. Set up a tuning grid with the hyperparameters and levels we want to explore (skip for Logistic Reg.) 4. Tune the model across our 5 cross-validation folds (skip for Logistic reg.) 5. Select the best hyperparameter combination by ROC AUC (skip for Logistic reg.) 6. Finalize the workflow with those best parameters (skip for Logistic reg.) 7. Fit the finalized workflow to the full training set
#Define logistic regression model using glm engine
log_reg_model <- logistic_reg() |>
set_engine("glm") |>
set_mode("classification")
#Create workflow
log_reg_wf <- workflow() |>
add_recipe(valorant_recipe) |>
add_model(log_reg_model)
#Evaluate using cross-validation folds (no tuning needed for log. reg.)
log_reg_fit <- fit_resamples(
log_reg_wf,
resamples = valorant_folds,
metrics = metric_set(roc_auc)
)
#Define elastic net model with tunable penalty and mixture
elastic_net_model <- logistic_reg(penalty = tune(), mixture = tune()) |>
set_engine("glmnet") |>
set_mode("classification")
#Create workflow
elastic_net_wf <- workflow() |>
add_recipe(valorant_recipe) |>
add_model(elastic_net_model)
#Define tuning grid
#Penalty ranges from 1e-4 to 1, testing weak to strong regularization
#Mixture ranges from 0 to 1, -0 is pure Ridge (shrinking all coefficients)
#1 is pure Lasso (can zero out meaningless coefficients), values in between blend both
#levels = 5 gives a 5x5 grid of 25 combinations
elastic_net_grid <- grid_regular(
penalty(range = c(-4, 0)),
mixture(range = c(0, 1)),
levels = 5
)
#Tune elastic net using cross-validation
elastic_net_res <- tune_grid(
elastic_net_wf,
resamples = valorant_folds,
grid = elastic_net_grid,
metrics = metric_set(roc_auc)
)
#Define the model
knn_model <- nearest_neighbor(neighbors = tune()) |>
set_engine("kknn") |>
set_mode("classification")
#Workflow
knn_workflow <- workflow() |>
add_model(knn_model) |>
add_recipe(valorant_recipe)
#Tuning grid
knn_grid <- grid_regular(
neighbors(range = c(1, 15)),
levels = 8
)
#Tune across folds
knn_tune <- tune_grid(
knn_workflow,
resamples = valorant_folds,
grid = knn_grid,
metrics = metric_set(roc_auc)
)
#Define Random Forest model
rf_model <- rand_forest(mtry = tune(), trees = tune(), min_n = tune()) |>
set_engine("ranger", importance = "impurity") |>
set_mode("classification")
#Create Workflow
rf_workflow <- workflow() |>
add_model(rf_model) |>
add_recipe(valorant_recipe)
#Tuning grid
rf_grid <- grid_regular(
mtry(range = c(1, 10)),
trees(range = c(100, 500)),
min_n(range = c(2, 20)),
levels = 3
)
#Ensures reproducible results
set.seed(2025)
#Tune across folds
rf_tune <- tune_grid(
rf_workflow,
resamples = valorant_folds,
grid = rf_grid,
metrics = metric_set(roc_auc)
)
#Define the Decision Tree model
tree_model <- decision_tree(cost_complexity = tune(), tree_depth = tune(),
min_n = tune()) |>
set_engine("rpart") |>
set_mode("classification")
#Set up workflow
tree_workflow <- workflow() |>
add_model(tree_model) |>
add_recipe(valorant_recipe)
#Tuning grid
tree_grid <- grid_regular(
cost_complexity(range = c(-4,-1)),
tree_depth(range = c(1,10)),
min_n(range = c(2,20)),
levels = 3
)
#Tune across folds
tree_tune <- tune_grid(
tree_workflow,
resamples = valorant_folds,
grid = tree_grid,
metrics = metric_set(roc_auc)
)
With all five models trained and evaluated through cross-validation, we can now compare their performance side by side using ROC AUC as our metric.
model_results <- tibble(
Model = c("Logistic Regression", "Elastic Net", "KNN", "Random Forest", "Decision Tree"),
ROC_AUC = c(
#Logistic regression uses fit_resamples so we pull from collect_metrics
collect_metrics(log_reg_fit) |> filter(.metric == "roc_auc") |> pull(mean),
#For the tuned models, show_best(n=1) returns the best hyperparameter combo
show_best(elastic_net_res, metric = "roc_auc", n = 1) |> pull(mean),
show_best(knn_tune, metric = "roc_auc", n = 1) |> pull(mean),
show_best(rf_tune, metric = "roc_auc", n = 1) |> pull(mean),
show_best(tree_tune, metric = "roc_auc", n = 1) |> pull(mean)
)
)
model_results
## # A tibble: 5 × 2
## Model ROC_AUC
## <chr> <dbl>
## 1 Logistic Regression 0.522
## 2 Elastic Net 0.522
## 3 KNN 0.493
## 4 Random Forest 0.525
## 5 Decision Tree 0.518
#Lets make a visual showing us the difference in ROC AUC values between the models
model_results |>
ggplot(aes(x = reorder(Model, ROC_AUC), y = ROC_AUC, fill = Model)) +
geom_col(show.legend = FALSE) +
geom_hline(yintercept = 0.5, linetype = "dashed", color = "black") +
geom_text(aes(label = round(ROC_AUC, 4)), hjust = -0.1, size = 3.5) +
coord_flip() +
ylim(0, 0.6) +
labs(title = "Cross-Validated ROC AUC by Model",
x = "Model",
y = "ROC AUC") +
theme_minimal()
The bar chart above summarizes the cross-validated ROC AUC scores for
all five models. The dashed line at 0.5 represents random chance,
meaning a model at the threshold where it is essentially just guessing.
Random forest performed the best with ROC AUC of 0.5251, followed
closely by Logistic Regression (0.5224) and Elastic Net (0.5222). The
Decision Tree came in fourth at 0.5175, while KNN was the only model to
fall below random chance at 0.4931. Overall, the margins between models
are very small, and no model managed to meaningfully separate winners
from losers, suggesting that the agent compositions alone may not be
strong enough to signal to predict match outcomes.
To better understand how each tuned model behaved during cross-validation, the plots below show how ROC AUC changed across the hyperparameter combinations explored for each model.
autoplot(elastic_net_res)
For the Elastic Net tuning plot, we used a penalty range from 1e-04 to 1
and a mixture range from 0 to 1. At the lowest regularization, all five
lines start clustered together around 0.520 - 0.522, meaning the blend
of Ridge and Lasso did not make much of a difference early on. As
regularization increases however, the lines begin to sharply drop
towards 0.500, which is essentially a coin flip. Something that is
interesting is that the pure Ridge (0.00) holds on the longest before
falling off, while the higher Lasso proportions (0.75 and 1.00) fall off
quicker, which makes sense since Lasso is more aggressive about
eliminating predictors entirely. To sum it all up, no matter how we
tuned this model, it could not find a strong enough pattern in the agent
compositions to predict wins reliably.
autoplot(knn_tune)
For KNN, we tuned the number of neighbors ranging from 1 to 15 across 8
levels. The plot shows a clear and consistent downward trend. As the
number of neighbors increases, ROC AUC steadily decreases from about
0.493 at k = 1 all the way down to around 0.478 at k = 15. Normally,
we’d want to see the opposite, where a larger k smooths out noise and
improves the performance, but it’s not the case here. Even at its best
(k = 1), the model is already performing below random chance at 0.491,
which tells us that even the closest matches in our data set don’t share
reliable patterns. This makes sense given our data, essentially two
teams can run the same exact five agents on the same map and still get
completely different outcomes, so finding the meaningful neighbors based
on composition alone is hard.
autoplot(rf_tune)
For Random Forest, we tuned three hyperparameters:
mtry
(ranging from 1 to 10), trees (ranging from 100 to 500),
and min_n (ranging from 2 to 20) across 3 levels each.
Looking at the plot, the number of trees doesn’t seem to have much of an
effect on performance, with the lines 100, 300, and 500 trees being
pretty close to each other. An interesting thing to take note of is with
min_n, the rightmost panel (min_n = 20)
consistently outperforms the others, suggesting that larger node sizes
help prevent the trees from over splitting on noise. As for
mtry, it looks like the middle range around 5 tends to do
the best within each panel. Overall, even the best combinations only
gets us to about 0.525, so while Random Forest is out top performer,
it’s still not far off from randomly guessing.
autoplot(tree_tune)
For the Decision Tree, we tuned three hyperparameters:
cost_complexity (ranging from 1e-04 to 0.1),
tree_depth (ranging from 1 - 10) and min_n
(ranging from 2 - 20). The first thing that struck out to me was that
tree depth matters a lot, with the deeper trees (depth = 10, blue)
consistently outperforming the shallower one across all three panels,
while depth = 1 (red) sits at the bottom the entire time. This makes
sense because a tree with only one split can barely capture any patterns
in the data. As cost complexity increases, all three depths sharply drop
toward 0.500, which is expected since a higher penalty aggressively
prunes the tree down until it’s essentially predicting the same class
for everything. Unlike the Random Forest where min_n made a
slight difference in performance across panels, the three here looks
nearly identical, suggesting that minimum node size had little to no
impact on the performance. The best we got of this was around 0.51,
which again, is right in the range we’ve been seeing across all our
models.
With all five models evaluated, Random Forest came out on top with the highest cross-validated ROC AUC of 0.5251, so we will move forward with fitting it on the full training set and evaluating it on the testing set.
#Select best hyperparameters for tuning results to use for final fit
best_rf <- select_best(rf_tune, metric = "roc_auc")
#Finalize the workflow with the best hyperparameters
rf_final_workflow <- finalize_workflow(rf_workflow, best_rf)
#Fit to the full training set
set.seed(2025)
rf_fit <- fit(rf_final_workflow, data = valorant_train)
#Predict on test set
rf_preds <- augment(rf_fit, new_data = valorant_test)
tibble(Set = c("Training", "Testing"),
ROC_AUC = c(0.5251, 0.4822)) |>
ggplot(aes(x = Set, y = ROC_AUC, fill = Set)) +
geom_col(show.legend = FALSE) +
geom_hline(yintercept = 0.5, linetype = "dashed", color = "black") +
geom_text(aes(label = round(ROC_AUC, 4)), vjust = -0.5, size = 4) +
ylim(0, 0.6) +
labs(title = "Random Forest: Training vs. Testing ROC AUC",
x = "",
y = "ROC AUC") +
theme_bw()
This bar chart compares the Random Forest’s cross-validated training ROC
AUC of 0.5251 against the testing ROC AUC of 0.4822. While the training
performance was already modest, sitting just barely above random chance,
the testing performance dropped below the 0.5 threshold entirely. This
gap suggests that the model is slightly overfit to the training data,
picking up on patterns that did not generalize to new, unseen
compositions. Overall, both values are so close to 0.5 that it is clear
that the agent compositions alone is not a reliable predictor of match
outcomes in Valorant, and no amount of tuning was able to overcome that
fundamental limitation in the data.
#ROC Curve
roc_curve(rf_preds, truth = won, .pred_yes) |>
autoplot()
Looking at the ROC curve on the testing set, the curve barely lifts off
the diagonal dashed line at all, which tells the same story as our 0.482
ROC AUC, the model really struggled to tell apart winning teams from
losing ones. No matter where we set the threshold, the model is
essentially guessing, which further reinforces the idea that agent
composition alone is just not enough information to predict match
outcomes reliably.
Since Random Forest was our top performer, it would be interesting to dig a little deeper and see which specific agents or maps the model was leaning on the most when making its predictions.
rf_fit |>
extract_fit_parsnip() |>
vip(num_features = 20) +
labs(title = "Random Forest Variable importance") +
theme_bw()
Looking at the variable importance plot, the three features that stood
out the most were
mean_pick_rate,
defender_win_pct, and attacker_win_pct, which
are all map related features rather than specific agents. So ironically,
the model was leaning more map balance than the actual agent
compositions we were trying to study. Among the agents, Sova and Jett
showed up the highest, which honestly makes sense since they have been
staples in pro play for years. Everything else further down the list
looks pretty similar in importance, with no single agent really standing
out. The fact that map characteristics ended up being more useful than
the agents themselves actually backs up what we’ve been seeing all
along, the agents you pick is much less important than you’d think in
professional Valorant.
Out of all five models, Random Forest performed the best with a cross-validated ROC AUC of 0.5251, though when evaluated on the testing set it dropped to 0.4822, falling below random chance. Even as our top performer, the model struggled to find any reliable signal in the agent compositions, which was a consistent theme across all five models we tested. Despite tuning each model carefully across a range of hyperparameters, non of them were able to meaningfully outperform a coin flip, which honestly says more about the nature of the data than the models themselves.
On the other end, KNN was our worst model at 0.4931. Since two teams can run the exact same five agents on the same map and still get completely different results, there really are no meaningful neighbors for the model to learn from, and the more neighbors we considered the worse it got.
Something that stood out was the variable importance plot showing map
balance features like attacker_win_pct and
defender_win_pct as the most influential predictors rather
than the agents themselves. For a project centered around agent
compositions, it is a pretty telling sign that the model found map
characteristics more useful than the actual drafts.
If I were able to approach this project differently, or build up on it, I would incorporate individual player statistics like kill/death ratios, first blood rates, and tournament appearances. As someone who watches pro play regularly, I had the idea that going into the project, the players was going matter a little bit more than the agents, and the results ended up confirming it more strongly than I expected. I think a future model that combines both composition and player performance data would be a much stronger predictor of match outcomes.
To conclude, even though the models did not perform as well as I had hoped, this project gave me a deeper appreciation for just how complex professional Valorant really is. With over 2,300 hours of playtime and countless more hours spent watching professional matches, I came into this project genuinely believing that team composition was one of the most important factors in determining who wins. These results challenged that assumption clearly. And from this, something I like to think about from this project is just the sheer preparation that these Valorant professionals go through. To win at the highest level, it comes down to so much more than just the agents you draft. Rather, its about the preparation, teamwork, and individual skill. If anything, this project helped me view the game through a completely different lens, and made me appreciate just how much goes into every professional match beyond what any model can fully capture. On top of that, working through six years of professional match data and building these models from scratch was an incredible opportunity to grow my machine learning skills and deepen my understanding of the entire modeling pipeline from data cleaning all the way to final evaluation, something I plan to incorporate in my future projects.