The only packages we’ll need are ROCR (to measure prediction accuracy) and tidyverse (for everything else — tidyverse.org):

library(ROCR)
library(tidyverse)

 

Marathonbet started offering bets on Makuuchi bouts. I’ve collected the odds for January and March tournaments from betmarathon.com/en/betting/Sumo/ (available only during Honbasho):

odds <- read_csv("odds.csv", col_types = "cicdcd")
odds %>% arrange(basho, day)

 

I’d missed the first day of the January tournament, and asked the customer service if there was an archive with historical odds. No:

Please be aware that this option is not available.
Due to the odds are movement depents of the fluctuations of the market.

Anyway, 541 bouts should be enough for our purposes:

odds %>% count(basho, day)

 

Results for all divisions are easily fetched from Sumo Reference (day 16 stands for play-offs):

results <- read_csv("results.csv", col_types = "cicicic")
results %>% count(basho, day)

 

To join odds with results:

  1. As they don’t necessarily order wrestlers (rikishi1, rikishi2) in the same way, join odds with a mirrored copy of itself.
  2. Calculate implied probability of the first wrestler winning.
  3. Filter out forfeits (see fusen).
odds_and_results <- merge(
    rbind(
        odds,
        odds %>% rename(
            rikishi1 = rikishi2, odds1 = odds2,
            rikishi2 = rikishi1, odds2 = odds1
        )
    ) %>% mutate(win1_prob = odds2 / (odds1 + odds2)),
    results %>% filter(kimarite != "fusen")
)
odds_and_results

 

Sanity check — what’s the distribution of implied probability to win for winner and losers?

ggplot(odds_and_results, aes(factor(win1, labels = c("lost", "won")), win1_prob)) + 
    geom_boxplot(outlier.size = .5) + 
    geom_jitter(size = .5, width = .1) + 
    labs(x = "", y = "implied win probability")

 

To measure prediction accuracy, we’ll use ROC curve:

pred <- prediction(
    predictions = odds_and_results$win1_prob,
    labels = odds_and_results$win1
)
perf <- performance(pred, measure = "tpr", x.measure = "fpr")
ggplot(data.frame(x = perf@x.values[[1]], y = perf@y.values[[1]]), aes(x, y)) + 
    geom_line() + 
    labs(
        title = sprintf("%.1f%%", unlist(performance(pred, "auc")@y.values) * 100),
        subtitle = "area under ROC curve",
        x = "false positive rate",
        y = "false negative rate"
    ) + 
    geom_abline(intercept = 0, slope = 1, linetype = "dotted")

 

LS0tDQp0aXRsZTogIk1hcmF0aG9uYmV0J3MgYWNjdXJhY3kgaW4gcHJlZGljdGluZyBzdW1vIHJlc3VsdHMiDQphdXRob3I6ICJNaWtoYWlsIFpoaWxraW4iDQpkYXRlOiAiYHIgU3lzLkRhdGUoKWAiDQp0YWdzOiBbc3Vtbywgb2RkcywgbWFyYXRob25iZXQsIGJldHRpbmcsIHByZWRpY3Rpb24sIGFjY3VyYWN5XQ0Kb3V0cHV0OiBodG1sX25vdGVib29rDQotLS0NCg0KJm5ic3A7DQoNClRoZSBvbmx5IHBhY2thZ2VzIHdlJ2xsIG5lZWQgYXJlIFtST0NSXShodHRwczovL2NyYW4uci1wcm9qZWN0Lm9yZy93ZWIvcGFja2FnZXMvUk9DUi9pbmRleC5odG1sKSAodG8gbWVhc3VyZSBwcmVkaWN0aW9uIGFjY3VyYWN5KSBhbmQgW3RpZHl2ZXJzZV0oaHR0cHM6Ly9jcmFuLnItcHJvamVjdC5vcmcvd2ViL3BhY2thZ2VzL3RpZHl2ZXJzZS9pbmRleC5odG1sKSAoZm9yIGV2ZXJ5dGhpbmcgZWxzZSAtLS0gW3RpZHl2ZXJzZS5vcmddKGh0dHA6Ly90aWR5dmVyc2Uub3JnLykpOg0KDQpgYGB7ciwgbWVzc2FnZT1GQUxTRSwgd2FybmluZz1GQUxTRX0NCmxpYnJhcnkoUk9DUikNCmxpYnJhcnkodGlkeXZlcnNlKQ0KYGBgDQoNCiZuYnNwOw0KDQpNYXJhdGhvbmJldCBzdGFydGVkIG9mZmVyaW5nIGJldHMgb24gW01ha3V1Y2hpXShodHRwczovL2VuLndpa2lwZWRpYS5vcmcvd2lraS9NYWt1dWNoaSkgYm91dHMuIEkndmUgY29sbGVjdGVkIHRoZSBvZGRzIGZvciBKYW51YXJ5IGFuZCBNYXJjaCB0b3VybmFtZW50cyBmcm9tIFtiZXRtYXJhdGhvbi5jb20vZW4vYmV0dGluZy9TdW1vL10oaHR0cHM6Ly93d3cuYmV0bWFyYXRob24uY29tL2VuL2JldHRpbmcvU3Vtby8pIChhdmFpbGFibGUgb25seSBkdXJpbmcgW0hvbmJhc2hvXShodHRwczovL2VuLndpa2lwZWRpYS5vcmcvd2lraS9Ib25iYXNobykpOg0KDQpgYGB7cn0NCm9kZHMgPC0gcmVhZF9jc3YoIm9kZHMuY3N2IiwgY29sX3R5cGVzID0gImNpY2RjZCIpDQpvZGRzICU+JSBhcnJhbmdlKGJhc2hvLCBkYXkpDQpgYGANCg0KJm5ic3A7DQoNCkknZCBtaXNzZWQgdGhlIGZpcnN0IGRheSBvZiB0aGUgSmFudWFyeSB0b3VybmFtZW50LCBhbmQgYXNrZWQgdGhlIGN1c3RvbWVyIHNlcnZpY2UgaWYgdGhlcmUgd2FzIGFuIGFyY2hpdmUgd2l0aCBoaXN0b3JpY2FsIG9kZHMuIE5vOg0KDQo+IFBsZWFzZSBiZSBhd2FyZSB0aGF0IHRoaXMgb3B0aW9uIGlzIG5vdCBhdmFpbGFibGUuICANCj4gRHVlIHRvIHRoZSBvZGRzIGFyZSBtb3ZlbWVudCBkZXBlbnRzIG9mIHRoZSBmbHVjdHVhdGlvbnMgb2YgdGhlIG1hcmtldC4NCg0KQW55d2F5LCBgciBucm93KG9kZHMpYCBib3V0cyBzaG91bGQgYmUgZW5vdWdoIGZvciBvdXIgcHVycG9zZXM6DQoNCmBgYHtyfQ0Kb2RkcyAlPiUgY291bnQoYmFzaG8sIGRheSkNCmBgYA0KDQombmJzcDsNCg0KUmVzdWx0cyBmb3IgYWxsIGRpdmlzaW9ucyBhcmUgZWFzaWx5IGZldGNoZWQgZnJvbSBbU3VtbyBSZWZlcmVuY2VdKGh0dHA6Ly9zdW1vZGIuc3Vtb2dhbWVzLmRlLykgKGRheSAxNiBzdGFuZHMgZm9yIHBsYXktb2Zmcyk6DQoNCmBgYHtyfQ0KcmVzdWx0cyA8LSByZWFkX2NzdigicmVzdWx0cy5jc3YiLCBjb2xfdHlwZXMgPSAiY2ljaWNpYyIpDQpyZXN1bHRzICU+JSBjb3VudChiYXNobywgZGF5KQ0KYGBgDQoNCiZuYnNwOw0KDQpUbyBqb2luICpvZGRzKiB3aXRoICpyZXN1bHRzKjoNCg0KMS4gQXMgdGhleSBkb24ndCBuZWNlc3NhcmlseSBvcmRlciB3cmVzdGxlcnMgKCpyaWtpc2hpMSosICpyaWtpc2hpMiopIGluIHRoZSBzYW1lIHdheSwgam9pbiAqb2Rkcyogd2l0aCBhIG1pcnJvcmVkIGNvcHkgb2YgaXRzZWxmLg0KMi4gQ2FsY3VsYXRlIGltcGxpZWQgcHJvYmFiaWxpdHkgb2YgdGhlIGZpcnN0IHdyZXN0bGVyIHdpbm5pbmcuDQozLiBGaWx0ZXIgb3V0IGZvcmZlaXRzIChzZWUgW2Z1c2VuXShodHRwczovL2VuLndpa3Rpb25hcnkub3JnL3dpa2kvZnVzZW4pKS4NCg0KYGBge3J9DQpvZGRzX2FuZF9yZXN1bHRzIDwtIG1lcmdlKA0KCXJiaW5kKA0KCQlvZGRzLA0KCQlvZGRzICU+JSByZW5hbWUoDQoJCQlyaWtpc2hpMSA9IHJpa2lzaGkyLCBvZGRzMSA9IG9kZHMyLA0KCQkJcmlraXNoaTIgPSByaWtpc2hpMSwgb2RkczIgPSBvZGRzMQ0KCQkpDQoJKSAlPiUgbXV0YXRlKHdpbjFfcHJvYiA9IG9kZHMyIC8gKG9kZHMxICsgb2RkczIpKSwNCglyZXN1bHRzICU+JSBmaWx0ZXIoa2ltYXJpdGUgIT0gImZ1c2VuIikNCikNCm9kZHNfYW5kX3Jlc3VsdHMNCmBgYA0KDQombmJzcDsNCg0KU2FuaXR5IGNoZWNrIC0tLSB3aGF0J3MgdGhlIGRpc3RyaWJ1dGlvbiBvZiBpbXBsaWVkIHByb2JhYmlsaXR5IHRvIHdpbiBmb3Igd2lubmVyIGFuZCBsb3NlcnM/DQoNCmBgYHtyfQ0KZ2dwbG90KG9kZHNfYW5kX3Jlc3VsdHMsIGFlcyhmYWN0b3Iod2luMSwgbGFiZWxzID0gYygibG9zdCIsICJ3b24iKSksIHdpbjFfcHJvYikpICsgDQoJZ2VvbV9ib3hwbG90KG91dGxpZXIuc2l6ZSA9IC41KSArIA0KCWdlb21faml0dGVyKHNpemUgPSAuNSwgd2lkdGggPSAuMSkgKyANCglsYWJzKHggPSAiIiwgeSA9ICJpbXBsaWVkIHdpbiBwcm9iYWJpbGl0eSIpDQpgYGANCg0KJm5ic3A7DQoNClRvIG1lYXN1cmUgcHJlZGljdGlvbiBhY2N1cmFjeSwgd2UnbGwgdXNlIFtST0MgY3VydmVdKGh0dHA6Ly9tbHdpa2kub3JnL2luZGV4LnBocC9ST0NfQW5hbHlzaXMjUk9DX0FuYWx5c2lzX2luX1IpOg0KDQpgYGB7cn0NCnByZWQgPC0gcHJlZGljdGlvbigNCglwcmVkaWN0aW9ucyA9IG9kZHNfYW5kX3Jlc3VsdHMkd2luMV9wcm9iLA0KCWxhYmVscyA9IG9kZHNfYW5kX3Jlc3VsdHMkd2luMQ0KKQ0KcGVyZiA8LSBwZXJmb3JtYW5jZShwcmVkLCBtZWFzdXJlID0gInRwciIsIHgubWVhc3VyZSA9ICJmcHIiKQ0KZ2dwbG90KGRhdGEuZnJhbWUoeCA9IHBlcmZAeC52YWx1ZXNbWzFdXSwgeSA9IHBlcmZAeS52YWx1ZXNbWzFdXSksIGFlcyh4LCB5KSkgKyANCglnZW9tX2xpbmUoKSArIA0KCWxhYnMoDQoJCXRpdGxlID0gc3ByaW50ZigiJS4xZiUlIiwgdW5saXN0KHBlcmZvcm1hbmNlKHByZWQsICJhdWMiKUB5LnZhbHVlcykgKiAxMDApLA0KCQlzdWJ0aXRsZSA9ICJhcmVhIHVuZGVyIFJPQyBjdXJ2ZSIsDQoJCXggPSAiZmFsc2UgcG9zaXRpdmUgcmF0ZSIsDQoJCXkgPSAiZmFsc2UgbmVnYXRpdmUgcmF0ZSINCgkpICsgDQoJZ2VvbV9hYmxpbmUoaW50ZXJjZXB0ID0gMCwgc2xvcGUgPSAxLCBsaW5ldHlwZSA9ICJkb3R0ZWQiKQ0KYGBgDQoNCiZuYnNwOw0K