This code-through applies learned clustering techniques to rhythm game analysis. Using the downloadable data from the rhythm game osu! along with the existing Python parser ‘osrparse’, we will create difficulty clusters that section off parts of a song as either easy, medium, hard or very hard. We will then create a heatmap of the clusters over time so we can see the different difficulty sections of a song. We will then overlay the life bar from replay files of the same song and do a cross-reference of the heatmap and the life bars to see if there are any patterns.
osu! is a popular online rhythm game where players hit notes in time with music. The notes are represented by a combination of circles and sliders which are mapped based on the rhythm of the song. Each level is mapped so patterns get harder or easier depending on the design and rhythm of the song. Every tap location, accuracy, and movement is recorded, giving us a unique ability to analyze game play data in a meaningful way.
Gameplay Screenshot
include_graphics("C:/Users/ajare/OneDrive/Desktop/RStudio Desktop/PAF 516/Code-Through/osu! interface.jpg")
Results Screenshot
include_graphics("C:/Users/ajare/OneDrive/Desktop/RStudio Desktop/PAF 516/Code-Through/accuracy in osu!.jpg")
Since R gives us useful data analysis tools, we want to take the binary replay data from the .osr file and make it readable in R. First, we need to create a python environment within R to host the ‘osrparse’ module. Reticulate acts as the translator between Python and R, and we use Miniconda to create the actual python environment for our osu! data. See documentation below:
#INSTALL RETICULATE AND MINICONDA IF NOT ALREADY INSTALLED
#install.packages("reticulate")
#reticulate::install_miniconda()
#ACCEPT TERMS OF SERVICE FOR MINICONDA (Change to your local miniconda file path)
#system('"C:/Users/ajare/AppData/Local/r-miniconda/condabin/conda.bat" tos accept --override-channels --channel https://repo.anaconda.com/pkgs/main')
#system('"C:/Users/ajare/AppData/Local/r-miniconda/condabin/conda.bat" tos accept --override-channels --channel https://repo.anaconda.com/pkgs/r')
#system('"C:/Users/ajare/AppData/Local/r-miniconda/condabin/conda.bat" tos accept --override-channels --channel https://repo.anaconda.com/pkgs/msys2')
#CREATES PYTHON ENVIRONMENT
#reticulate::conda_create("osu_env", packages = "python=3.10", channel = "conda-forge")
#INSTALLS OSRPARSE INTO THE ENVIORNMENT, PIP = TRUE IS CRUCIAL
#reticulate::conda_install("osu_env", packages = "osrparse", pip = TRUE)
#LINKS R TO THE ENVIRONMENT
use_condaenv("osu_env", required = TRUE)
#CHECKS THAT PYTHON RUNS CORRECTLY
py_config()
## python: C:/Users/ajare/AppData/Local/r-miniconda/envs/osu_env/python.exe
## libpython: C:/Users/ajare/AppData/Local/r-miniconda/envs/osu_env/python310.dll
## pythonhome: C:/Users/ajare/AppData/Local/r-miniconda/envs/osu_env
## version: 3.10.19 | packaged by conda-forge | (main, Oct 22 2025, 22:23:22) [MSC v.1944 64 bit (AMD64)]
## Architecture: 64bit
## numpy: [NOT FOUND]
##
## NOTE: Python version was forced by use_python() function
Now that the Python environment works, we are going to import ‘osrparse’ so that we can extract the data from the osu! replay files (.osr)
osrparse <- import("osrparse")
# Create Replay variable
Replay <- osrparse$Replay
# Create path to Replay directory
TechnoKitty_replay_dir <- "C:/Users/ajare/OneDrive/Desktop/RStudio Desktop/PAF 516/Code-Through/Replays"
# List all .osr files
TechnoKitty_replay_files <- list.files(
TechnoKitty_replay_dir,
pattern = ".osr",
full.names = TRUE
)
TechnoKitty_replay_files
## [1] "C:/Users/ajare/OneDrive/Desktop/RStudio Desktop/PAF 516/Code-Through/Replays/NghtLife - S3RL feat Sara - Techno Kitty [Extreme] (2025-11-30) Osu-1.osr"
## [2] "C:/Users/ajare/OneDrive/Desktop/RStudio Desktop/PAF 516/Code-Through/Replays/NghtLife - S3RL feat Sara - Techno Kitty [Extreme] (2025-11-30) Osu-2.osr"
## [3] "C:/Users/ajare/OneDrive/Desktop/RStudio Desktop/PAF 516/Code-Through/Replays/NghtLife - S3RL feat Sara - Techno Kitty [Extreme] (2025-11-30) Osu-3.osr"
## [4] "C:/Users/ajare/OneDrive/Desktop/RStudio Desktop/PAF 516/Code-Through/Replays/NghtLife - S3RL feat Sara - Techno Kitty [Extreme] (2025-11-30) Osu-4.osr"
## [5] "C:/Users/ajare/OneDrive/Desktop/RStudio Desktop/PAF 516/Code-Through/Replays/NghtLife - S3RL feat Sara - Techno Kitty [Extreme] (2025-11-30) Osu-5.osr"
We can confirm there are 5 different replay files in this folder. We are now going to parse through each replay and extract the life bar from each replay (see osrparse documentation above), which is a good representation of performance over time:
TechnoKitty_LifeBar_all <- lapply(TechnoKitty_replay_files, function(f) {
# Parse replay
replay_obj <- Replay$from_path(f)
# Extract life bar list
lb_list <- replay_obj$life_bar_graph
# Handle replays with no life bar info
if (length(lb_list) == 0) return(NULL)
# Convert to tibble
tibble(
replay_file = basename(f),
time_ms = sapply(lb_list, function(s) s$time),
life = sapply(lb_list, function(s) s$life)
)
}) %>%
# Stack them
bind_rows() %>%
mutate(
time_sec = time_ms / 1000,
replay_id = factor(replay_file)
)
TechnoKitty_LifeBar_all <- TechnoKitty_LifeBar_all %>%
mutate(
replay_id = factor(as.numeric(factor(replay_id))) # converts 5 files → 1,2,3,4,5
)
TechnoKitty_LifeBar_all %>%
slice_sample(n = 10)
## # A tibble: 10 × 5
## replay_file time_ms life time_sec replay_id
## <chr> <int> <dbl> <dbl> <fct>
## 1 NghtLife - S3RL feat Sara - Techno Kitty [E… 99527 1 99.5 1
## 2 NghtLife - S3RL feat Sara - Techno Kitty [E… 36783 1 36.8 2
## 3 NghtLife - S3RL feat Sara - Techno Kitty [E… 60268 1 60.3 4
## 4 NghtLife - S3RL feat Sara - Techno Kitty [E… 42973 1 43.0 4
## 5 NghtLife - S3RL feat Sara - Techno Kitty [E… 14342 1 14.3 4
## 6 NghtLife - S3RL feat Sara - Techno Kitty [E… 17068 1 17.1 1
## 7 NghtLife - S3RL feat Sara - Techno Kitty [E… 30097 0.8 30.1 5
## 8 NghtLife - S3RL feat Sara - Techno Kitty [E… 14352 1 14.4 1
## 9 NghtLife - S3RL feat Sara - Techno Kitty [E… 71239 1 71.2 5
## 10 NghtLife - S3RL feat Sara - Techno Kitty [E… 62325 1 62.3 1
We will use this data later during visualization.
We will be using the song Techno Kitty by S3RL feat. Sara, on the Extreme difficulty for this example, as that is the difficulty we have our 5 replays on. We create a file path to the .osz file, unzip its contents and locate the correct difficulty
TechnoKitty_osz <- "C:/Users/ajare/OneDrive/Desktop/RStudio Desktop/PAF 516/Code-Through/345450 S3RL feat Sara - Techno Kitty.osz"
# Create a folder to unzip into
unzip_dir <- file.path(tempdir(), "techno_kitty_map")
if (!dir.exists(unzip_dir)) dir.create(unzip_dir)
# Unzip the .osz contents
unzip(TechnoKitty_osz, exdir = unzip_dir)
# List all .osu difficulty files inside the map
TechnoKitty_files <- list.files(unzip_dir, pattern = "\\.osu$", full.names = TRUE)
TechnoKitty_files
## [1] "C:\\Users\\ajare\\AppData\\Local\\Temp\\Rtmp8cklcW/techno_kitty_map/S3RL feat Sara - Techno Kitty (Bakari) [Average].osu"
## [2] "C:\\Users\\ajare\\AppData\\Local\\Temp\\Rtmp8cklcW/techno_kitty_map/S3RL feat Sara - Techno Kitty (Bakari) [Challenging].osu"
## [3] "C:\\Users\\ajare\\AppData\\Local\\Temp\\Rtmp8cklcW/techno_kitty_map/S3RL feat Sara - Techno Kitty (Bakari) [Extreme].osu"
## [4] "C:\\Users\\ajare\\AppData\\Local\\Temp\\Rtmp8cklcW/techno_kitty_map/S3RL feat Sara - Techno Kitty (Bakari) [Meowcalypse!].osu"
## [5] "C:\\Users\\ajare\\AppData\\Local\\Temp\\Rtmp8cklcW/techno_kitty_map/S3RL feat Sara - Techno Kitty (Bakari) [Simple].osu"
Now we make a new variable for the extreme difficulty by calling on the 3rd item:
TechnoKitty_extreme <- TechnoKitty_files[3]
basename(TechnoKitty_extreme)
## [1] "S3RL feat Sara - Techno Kitty (Bakari) [Extreme].osu"
Now, we need to count the total number of circles and sliders in the game, otherwise known as hit objects. In lamens terms this is the total number of ‘notes’ that the player has to hit during the duration of the song.
We can take a look at the osu! documentation to understand the underlying file structure. Hit Objects will be the last section of the file: osu! documentation
# Read all lines from the Extreme difficulty file
TechnoKitty_lines_extreme <- readLines(TechnoKitty_extreme, encoding = "UTF-8")
#Check the total length of the file
length(TechnoKitty_lines_extreme)
## [1] 398
# Find the [HitObjects] line
TechnoKitty_HitObjectHeader <- which(trimws(TechnoKitty_lines_extreme) == "[HitObjects]")
TechnoKitty_HitObjectHeader
## [1] 88
# All lines AFTER the [HitObjects] header are hitobject definitions
TechnoKitty_HitLines_extreme <- TechnoKitty_lines_extreme[(TechnoKitty_HitObjectHeader + 1):length(TechnoKitty_lines_extreme)]
# Checks how many objects there are after filtering
length(TechnoKitty_HitLines_extreme)
## [1] 310
sample(TechnoKitty_HitLines_extreme, 10)
## [1] "196,32,86497,1,0,1:0:0:0:"
## [2] "256,216,101926,1,0,1:0:0:0:"
## [3] "96,60,45526,2,0,L|60:232,1,154.000004699707,4|8,1:3|1:2,0:0:0:0:"
## [4] "116,188,101583,1,0,1:0:0:0:"
## [5] "117,101,85640,2,0,L|64:157,1,77.0000023498536,8|0,1:2|1:0,0:0:0:0:"
## [6] "308,104,25126,1,0,0:0:0:0:"
## [7] "416,320,106897,2,0,P|376:320|320:300,1,83.9999974365235,4|0,1:3|1:3,0:0:0:0:"
## [8] "272,188,26155,2,0,P|220:208|168:304,1,140,0|0,3:0|0:0,0:0:0:0:"
## [9] "444,120,100383,2,0,L|260:96,1,167.999994873047,4|8,1:3|1:2,0:0:0:0:"
## [10] "324,268,46555,1,8,1:2:0:0:"
Hit Object Sytnax From the documentation we know what the hit object syntax represents.
include_graphics("C:/Users/ajare/OneDrive/Desktop/RStudio Desktop/PAF 516/Code-Through/Hit Object Syntax.png")
We need to split the relevant values into individual fields, then turn that data into a tibble. From the hit object syntax we want to grab x-position, y-position, and time. We also want to grab the 8th value of the slider syntax which gives us the slider length. If no pixellength is available it means the object is a circle not a slider, and we assign the value NA for pixellength on the circles.
Slider Sytnax
include_graphics("C:/Users/ajare/OneDrive/Desktop/RStudio Desktop/PAF 516/Code-Through/Slider Syntax.png")
#Splits the string into individual fields
TechnoKitty_HitSplit_extreme <- strsplit(TechnoKitty_HitLines_extreme, ",", fixed = TRUE)
#Converts data into a tibble
TechnoKitty_HitDF_extreme <- tibble(
x = as.numeric(sapply(TechnoKitty_HitSplit_extreme, '[', 1)),
y = as.numeric(sapply(TechnoKitty_HitSplit_extreme, '[', 2)),
time_ms = as.numeric(sapply(TechnoKitty_HitSplit_extreme, '[', 3)),
pixel_length = suppressWarnings(as.numeric(sapply(
TechnoKitty_HitSplit_extreme,
function(z) if (length(z) >= 8) z[8] else NA
)))
) %>%
arrange(time_ms)
head(TechnoKitty_HitDF_extreme)
## # A tibble: 6 × 4
## x y time_ms pixel_length
## <dbl> <dbl> <dbl> <dbl>
## 1 24 272 11926 140
## 2 132 88 12440 140
## 3 164 256 12955 NA
## 4 288 368 13297 140
## 5 480 272 13812 NA
## 6 412 220 13983 NA
Jump strain is calculated based on the previous notes location and time.
TechnoKitty_HitDF_extreme <- TechnoKitty_HitDF_extreme %>%
arrange(time_ms) %>%
mutate(
# previous object position and time
prev_x = dplyr::lag(x),
prev_y = dplyr::lag(y),
prev_time = dplyr::lag(time_ms),
# time between notes
dt_ms = time_ms - prev_time,
dt_ms = ifelse(dt_ms <= 0, NA_real_, dt_ms),
dt_sec = dt_ms / 1000,
#distance from previous object
dist_from_prev = sqrt((x - prev_x)^2 + (y - prev_y)^2),
# jump strain = movement per second
jump_strain = dist_from_prev / dt_sec
)
head(TechnoKitty_HitDF_extreme)
## # A tibble: 6 × 11
## x y time_ms pixel_length prev_x prev_y prev_time dt_ms dt_sec
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 24 272 11926 140 NA NA NA NA NA
## 2 132 88 12440 140 24 272 11926 514 0.514
## 3 164 256 12955 NA 132 88 12440 515 0.515
## 4 288 368 13297 140 164 256 12955 342 0.342
## 5 480 272 13812 NA 288 368 13297 515 0.515
## 6 412 220 13983 NA 480 272 13812 171 0.171
## # ℹ 2 more variables: dist_from_prev <dbl>, jump_strain <dbl>
We are going to pre-define our clusters for meangingful analysis based on two major variables: notes_per_sec (object density) and mean_jump_strain (average movement difficulty), as well as total slider pixels. We will assign equal major weights to the first two variables when considering their impact on difficulty level, and a small weight for sliders. In reality, the difficulty rating for maps is much more complicated and accounts for various elements of gameplay. An entire dev team works on the game to determine how the ranking system works in actual osu!.
Now we divide the song into 1-second windows where we can look at density, average jump strain, and total slider pixels per window:
# length of each analysis window in milliseconds
window_size <- 1000
TechnoKitty_Windows_1s_base <- TechnoKitty_HitDF_extreme %>%
arrange(time_ms) %>%
mutate(
window_id = floor(time_ms / window_size) + 1
) %>%
group_by(window_id) %>%
summarise(
n_objects = dplyr::n(),
notes_per_sec = n_objects / (window_size / 1000),
mean_jump_strain = mean(jump_strain, na.rm = TRUE),
total_slider_pixels = sum(pixel_length, na.rm = TRUE),
.groups = "drop"
)
# builds a complete sequence of windows (fills gaps where a window has no hit objects/notes)
max_window <- max(TechnoKitty_Windows_1s_base$window_id, na.rm = TRUE)
all_windows <- tibble(window_id = 1:max_window)
TechnoKitty_Windows_extreme <- all_windows %>%
left_join(TechnoKitty_Windows_1s_base, by = "window_id") %>%
mutate(
n_objects = tidyr::replace_na(n_objects, 0L),
notes_per_sec = dplyr::if_else(n_objects == 0, 0, notes_per_sec),
mean_jump_strain = dplyr::if_else(n_objects == 0, NA_real_, mean_jump_strain),
total_slider_pixels = tidyr::replace_na(total_slider_pixels, 0),
start_time = (window_id - 1L) * window_size,
end_time = window_id * window_size - 1L,
start_sec = start_time / 1000,
end_sec = end_time / 1000
)
TechnoKitty_Windows_extreme %>%
dplyr::slice_sample(n = 10)
## # A tibble: 10 × 9
## window_id n_objects notes_per_sec mean_jump_strain total_slider_pixels
## <dbl> <int> <dbl> <dbl> <dbl>
## 1 82 4 4 465. 77.0
## 2 10 0 0 NA 0
## 3 48 3 3 498. 231.
## 4 87 2 2 490. 154.
## 5 106 4 4 678. 168.
## 6 56 4 4 514. 217.
## 7 89 2 2 58.9 84.0
## 8 83 3 3 518. 154.
## 9 57 2 2 290. 70
## 10 46 1 1 455. 154.
## # ℹ 4 more variables: start_time <dbl>, end_time <dbl>, start_sec <dbl>,
## # end_sec <dbl>
Now that we have all our variables we need we can create our clusters with mclust. We want to make windows with no notes excluded from the cluster:
TechnoKitty_Windows_1s <- TechnoKitty_Windows_extreme %>%
arrange(window_id) %>%
mutate(
density_1s = notes_per_sec,
jump_1s = mean_jump_strain,
slider_1s = total_slider_pixels
)
# only clusters windows that contain at least one object
cluster_data <- TechnoKitty_Windows_1s %>%
filter(n_objects > 0) %>%
select(window_id, start_sec, end_sec, density_1s, jump_1s, slider_1s) %>%
mutate(
density_1s = ifelse(is.na(density_1s), 0, density_1s),
jump_1s = ifelse(is.na(jump_1s), 0, jump_1s),
slider_1s = ifelse(is.na(slider_1s), 0, slider_1s)
)
# Mclust
cluster_features <- cluster_data %>%
select(density_1s, jump_1s, slider_1s) %>%
scale() %>%
as.matrix()
set.seed(123)
TechnoKitty_mclust <- Mclust(cluster_features, G = 4)
table(TechnoKitty_mclust$classification)
##
## 1 2 3 4
## 29 38 28 5
We assign those clusters back to the windows and create simplified but commonly used clusters: Easy, Medium, Hard, Very Hard. To note: the Extreme map difficulty we are using for wrangling represents the map difficulty compared to other maps of the same song, where our clusters will represent the difficulty of sections within one map.
# attach cluster labels back to clustered windows
cluster_assign <- cluster_data %>%
mutate(cluster_raw = TechnoKitty_mclust$classification)
# summarize cluster profiles
cluster_profile <- cluster_assign %>%
group_by(cluster_raw) %>%
summarise(
mean_density = mean(density_1s),
mean_jump = mean(jump_1s),
mean_slider = mean(slider_1s),
.groups = "drop"
) %>%
mutate(
# Weighted difficulty score:
rough_diff =
1.0 * scale(mean_density)[, 1] +
1.0 * scale(mean_jump)[, 1] +
0.3 * scale(mean_slider)[, 1], # reduced influence of sliders
difficulty_rank = rank(rough_diff)
)
difficulty_labels <- c("Easy", "Medium", "Hard", "Very Hard")
cluster_lut <- cluster_profile %>%
mutate(
difficulty_tier = factor(
difficulty_labels[difficulty_rank],
levels = difficulty_labels)
) %>%
select(cluster_raw, difficulty_tier)
# attach difficulty tiers back to all windows (including empty ones)
TechnoKitty_Windows_extreme <- TechnoKitty_Windows_extreme %>%
select(-dplyr::any_of(c("cluster_raw", "difficulty_tier"))) %>%
left_join(cluster_assign %>% select(window_id, cluster_raw),
by = "window_id") %>%
left_join(cluster_lut, by = "cluster_raw")
TechnoKitty_Windows_extreme %>%
dplyr::slice_sample(n = 10)
## # A tibble: 10 × 11
## window_id n_objects notes_per_sec mean_jump_strain total_slider_pixels
## <dbl> <int> <dbl> <dbl> <dbl>
## 1 31 3 3 370. 70
## 2 79 3 3 397. 154.
## 3 51 3 3 381. 154.
## 4 14 3 3 469. 140
## 5 67 3 3 362. 147.
## 6 42 5 5 575. 84.0
## 7 50 2 2 359. 154.
## 8 43 4 4 535. 168.
## 9 101 3 3 504. 168.
## 10 108 3 3 564. 168.
## # ℹ 6 more variables: start_time <dbl>, end_time <dbl>, start_sec <dbl>,
## # end_sec <dbl>, cluster_raw <dbl>, difficulty_tier <fct>
Let’s now create a heatmap of the difficulty over time so we can see how the difficulty changes throughout the song. We will overlay the lifebars to look for patterns. Theoretically, we could see major performance dips during hard sections and good performance during easy sections.
# Difficulty heatmap
ggplot() +
geom_rect(
data = TechnoKitty_Windows_extreme,
aes(
xmin = start_sec,
xmax = end_sec,
ymin = 0,
ymax = 1,
fill = difficulty_tier
),
color = NA,
alpha = 0.4
) +
geom_line(
data = TechnoKitty_LifeBar_all,
aes(
x = time_sec,
y = life,
group = replay_id,
color = replay_id
),
inherit.aes = FALSE,
linewidth = .5
) +
# Difficulty colors
scale_fill_manual(
values = c(
"Easy" = "#4CAF50",
"Medium" = "#F4D03F",
"Hard" = "#E74C3C",
"Very Hard" = "#8E44AD",
"NA" = "#7F8C8D"
),
na.value = "#7F8C8D"
) +
# Replay colors (1–5)
scale_color_manual(
name = "Replay #",
values = c(
"1" = "#0074D9",
"2" = "#7FDBFF",
"3" = "#FF851B",
"4" = "#8B4513",
"5" = "#111111"
)
) +
theme_classic(base_size = 12) +
theme(
legend.position = "right",
legend.box = "vertical",
plot.title = element_text(hjust = 0.5)
) +
labs(
title = "Techno Kitty [Extreme] – Difficulty Timeline with Replay Life Bars",
x = "Time (seconds)",
y = "Life Bar",
fill = "Difficulty Tier"
) +
guides(
fill = guide_legend(order = 1),
color = guide_legend(order = 2)
)
We can’t expect perfect patterns for a couple reasons. First, our model of difficult is severly oversimplified, where the actual games difficulty ranking is developed by an entire team with various moving components. Second, replay data is based on player performance, which doesn’t linearly correlate with the difficulty. Players could do really well on a hard section or do poorly on an easy section.
However, this can still give us some insight as there is a clear hard section at the end of the song which was repeatedly struggled on around 90 seconds. There appears to be a medium/hard section before 30 seconds where performance started to fall off on each run.
To reiterate, the true ranking system for map difficulty is much more complex, and this example is an attempt at using clusters to determine if we can also determine which parts of a song are easy or hard, using a very simplified difficulty calculation.