Unfortunately, this database is rather inconvenient for data skimming. Data for each rikishi, of which there are over 600 currently competing in each tournament, is kept on its own html page. This data includes their hometown, which will be the main focus of my analyses for this project. Luckily, the links to each rikishi page are in the form “http://sumodb.sumogames.de/Rikishi.aspx?r=” appended to a numeric ID for each rikishi, such as “12191”. In order to get the ID for each rikishi page, we can skim the page for the results of the most recent tournament, http://sumodb.sumogames.de/Banzuke.aspx. The entry for each rikishi contains a hyperlink to their page, so we must first skim this page to get the link to every rikishi page. Then, we can isolate the ID by manipulating the string for the link.
## [1] "11927" "12191" "12231" "12451" "11985"
## [1] 642
This output shows the first 5 IDs extracted from skimming, as well as the length of the vector of IDs. From this, we can conclude that there were 617 rikishi that competed in the tournament this past March.
For the first part of this project, I will be analyzing the how rikishi are distributed across Japan’s 47 prefectures. To get the home prefectures of each rikishi, we need to skim their aforementioned page. This can be achieved by mapping a function to each rikishi ID to skim their page, extract their ring name (shikona) and home prefecture from their page. Then, the results from this mapping can be mapped to bind_rows() to create a single tibble. However, doing this for all 617 rikishi in succession would be impolite, as it will lead to the database crashing. Doing so in a polite manner (leaving 5 seconds between skimming attempts) would take this function about 50 minutes to run. So, instead, I have ran this function once and saved its result as a csv. To prove that this function works, the following output is the output from running this function on three random IDs from the vector obtained earlier.
| shikona | home_prefecture | rikishi_id |
|---|---|---|
| Asashiyu Ryoga | Mie | 12369 |
| Takakento Terutora | Kumamoto | 12114 |
| Wakasa Kengo | Hokkaido | 12745 |
Reading the CSV provides the same formatted results as the random sample above, and saves plenty of time:
| shikona | home_prefecture | rikishi_id |
|---|---|---|
| Terunofuji Haruo | Mongolia | 11927 |
| Takakeisho Mitsunobu | Hyogo | 12191 |
| Wakatakakage Atsushi | Fukushima | 12370 |
| Hoshoryu Tomokatsu | Mongolia | 12451 |
| Wakamotoharu Minato | Fukushima | 11980 |
It should be noted that there are foreign rikishi that are still included in the data above. Their home prefecture has been set as their home country. Now, rikishi can be grouped by their home prefecture/country to derive the number of rikishi from each.
| home_prefecture | number_of_rikishi |
|---|---|
| Tokyo | 52 |
| Saitama | 33 |
| Aichi | 31 |
| Osaka | 30 |
| Hyogo | 28 |
Evidently, Tokyo prefecture is home to the most sumo wrestlers with 51. We can now also use a bar graph to plot how many rikishi are from each prefecture/country.
Our first analysis will be to see if there are any prefectures that produce abnormal numbers of rikishi relative to their size. In order to do this, we need to source data of the populations of Japan’s prefectures. This can be done by skimming the Wikipedia page for Japan’s prefectures, https://en.wikipedia.org/wiki/Prefectures_of_Japan. We will also append geometry data sourced from https://www.diva-gis.org/datadown to this table in order to be able to create choropleth maps later. The population of each prefecture is plotted below.
This data can now be joined with our rikishi counts from earlier. A byproduct of doing so is that we are able to isolate foreign rikishi. They are dropped from the resulting table as they are not our current focus. For each prefecture, we can add the variables rikishi_pref_prop, which represents the proportion of total rikishi that are from a certain prefecture, and pop_pref_prop, which represents the proportion of Japan’s national population that are from a certain prefecture. Using these, we can define a third variable “diff_in_prop”, which is the difference between these two proportions (rikishi_pref_prop - pop_pref_prop). Lastly, the variable rikishi_per_mil is defined as the number of rikishi per 1,000,000 people from each prefecture, and this variable is mapped to each prefecture in the following figure:
We can now determine which prefectures produce disproportionate numbers of rikishi relative to their population.
| home_prefecture | number_of_rikishi | region | pref_pop | rikishi_pref_prop | pop_pref_prop | diff_in_prop | rikishi_per_mil |
|---|---|---|---|---|---|---|---|
| Kagoshima | 21 | Kyushu | 1650000 | 0.0357143 | 0.0128153 | 0.0228990 | 12.727273 |
| Kanagawa | 26 | Kanto | 9278000 | 0.0442177 | 0.0720610 | -0.0278433 | 2.802328 |
The table above shows the prefectures that disproportionately produce the most and least number of rikishi. Kagoshima prefecture produces disproportionately more rikishi than one would expect. as despite making up only 1.28% of Japan’s population, rikishi from Kagoshima account for 3.57% of Japanese rikishi. Likewise, Kanagawa prefecture produces disproportionately less rikishi than one would expect. Despite making up 7.21% of Japan’s population, rikishi from Kanagawa account for 4.42% of Japanese rikishi. Plotting these differences for each prefecture yields the following figure:
A natural progression from this would be to perform a chi-squared goodness of fit test using the count of rikishi from each prefecture as observed values and pop_pref_prop as the expected proportions for these values. This would indicate whether or not these differences in proportions are statistically significant and therefore that there is an underlying reason as to why certain prefectures produce proportionally more rikishi. However, the assumptions for this test are not met, as the expected count for multiple prefectures would be less than 5. We cannot state whether or not these differences at the prefecture level are statistically significant. Yet, we can go up a “level in geography”. Japan’s prefectures are grouped into 8 different regions, and we can combine prefectures into their respective regions to see if these differences persist.
| region | number_of_rikishi | region_pop | rikishi_reg_prop | pop_reg_prop | diff_in_prop | rikishi_per_mil |
|---|---|---|---|---|---|---|
| Kyushu | 96 | 14704000 | 0.1632653 | 0.1142041 | 0.0490613 | 6.528836 |
| Tohoku | 55 | 8968000 | 0.0935374 | 0.0696533 | 0.0238841 | 6.132917 |
| Chubu | 110 | 21601000 | 0.1870748 | 0.1677722 | 0.0193027 | 5.092357 |
| Shikoku | 19 | 3884000 | 0.0323129 | 0.0301665 | 0.0021464 | 4.891864 |
| Chugoku | 31 | 7507000 | 0.0527211 | 0.0583059 | -0.0055848 | 4.129479 |
Once again, we can plot the rikishi per million people. However, this time we are at the regional level.
Likewise, we can plot the dispropotionality between the number of rikishi from a certain prefecture relative to its population, albeit now on the regional level.
Fortunately, we can proceed with a chi-square test at this regional level.
##
## Chi-squared test for given probabilities
##
## data: Number of Rikishi per Region
## X-squared = 27.3, df = 7, p-value = 0.0002942
From the test above, we can see that the differences in the regional proportions of rikishi and population are statistically significant from the extremely low p-value. The natural progression in this analysis is to figure out why these differences occur.
People have always been drawn to join professional sumo for money. Rikishi who are good enough to rise to the top two divisions can make hundreds of thousands of dollars a year. Historically, Japanese citizens in poorer, rural regions of the country saw sumo as a way to succeed. While this connotation between poverty and sumo wrestling is much weaker today, it still may provide a reason for these differences.
To determine if this is a plausible explanation, we can see if there are more rikishi per capita in regions of Japan with lower GDPs. To get this GDP data for each prefecture, the Wikipedia article https://en.wikipedia.org/wiki/List_of_Japanese_prefectures_by_GDP_per_capita must be skimmed. Doing so yields the following table, which pairs each prefecture with its GDP in millions of USD.
| prefecture | gdp_in_usd_mil |
|---|---|
| Tokyo | 1026340 |
| Aichi | 389325 |
| Osaka | 387105 |
| Kanagawa | 343824 |
| Saitama | 226361 |
By joining our prefecture data with this GDP data and grouping by region, we can obtain the following scatter plot:
While the plot above suggests a trend, this is based on a rather small sample size. If we run a simple linear regression model on this data, the correlation between the two is not very strong.
## # A tibble: 2 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 8.71 3.40 2.56 0.0427
## 2 regional_gdp_per_capita -0.000101 0.0000860 -1.17 0.287
Plotting the regression line with confidence bands demonstrates how there is no significant correlation between a region’s GDP and the proportion of rikishi it produces.
However, this isn’t to say that the lure of such a large salary is not relevant anymore. If we return to our original, prefecture-level data with foreign rikishi included, one significant anomaly can be highlighted. First, we must join not only the populations of Japan’s prefectures to this data, but also the populations of the countries that are homes to foreign rikishi. If we isolate the rikishi per million people from countries other than Japan, we can get the following table.
| home_country | population | number_of_rikishi | rikishi_per_mil |
|---|---|---|---|
| Mongolia | 3495306 | 23 | 6.5802536 |
| Georgia | 3688600 | 1 | 0.2711056 |
| Bulgaria | 6519789 | 1 | 0.1533792 |
| Kazakhstan | 19849628 | 1 | 0.0503788 |
| Ukraine | 41130432 | 1 | 0.0243129 |
| Philippines | 110587187 | 1 | 0.0090426 |
| China | 1411750000 | 1 | 0.0007083 |
While the rikishi per million people is extremely low for most countries due to having produced 1 rikishi, Mongolia has instead produced 23 rikishi that are currently competing. This computes to an astounding 6.58 rikishi per million people from Mongolia. This is only more astounding if we plot this alongside the rikishi per million people of Japan’s prefectures:
From the output above, we can see that there are 35 prefectures in Japan that produce proportionately less rikishi than Mongolia.
As excitingly weird as this may seem, the reason for this is quite mundane. There are extreme similarities between sumo wrestling and Mongolian wrestling, or Bokh, so many Mongolian wrestlers can make the switch over to sumo wrestling with relative ease. Given Mongolia’s low GDP and high unemployment, the motivation to pack up and move to Japan to pursue professional sumo wrestling is more than there.
Of the six sumo tournaments held each year, three are held in Tokyo, one in Osaka, one in Fukuoka, and one in Nagoya (Aichi Prefecture). There is a lot of discussion in sumo media about rikishi being motivated to do well in tournaments that are held in their home prefectures. The purpose of the following analysis is to determine whether or not there is data to back up the assumption that rikishi do better in tournaments held in their home prefectures.
To do so, I will scrape the tournament performance data for each active rikishi whose home prefecture hosts tournaments. This yields the number of wins and losses each rikishi has amounted from every tournament they have competed in. These wins and losses will be split into two categories depending on whether or not the win/loss came from a held in their home prefecture.
Once again, scraping this data takes a significant amount of time. As was done before, the data has been scraped in the past and saved as a CSV file. So, an example of this scrapping is performed below using a random sample of rikishi.
| shikona | home_prefecture | tournament_in_home_pref | wins | losses | win_pct |
|---|---|---|---|---|---|
| Ojiyama Nobunaga | Aichi | FALSE | 83 | 120 | 0.4088670 |
| Ojiyama Nobunaga | Aichi | TRUE | 11 | 24 | 0.3142857 |
| Aozora Daichi | Osaka | FALSE | 304 | 272 | 0.5277778 |
| Aozora Daichi | Osaka | TRUE | 55 | 60 | 0.4782609 |
| Tochikasuga Hiroyuki | Aichi | FALSE | 399 | 425 | 0.4842233 |
| Tochikasuga Hiroyuki | Aichi | TRUE | 80 | 88 | 0.4761905 |
Reading the CSV yields a table of the same format.
| shikona | home_prefecture | tournament_in_home_pref | wins | losses | win_pct |
|---|---|---|---|---|---|
| Tobizaru Masaya | Tokyo | FALSE | 162 | 147 | 0.5242718 |
| Tobizaru Masaya | Tokyo | TRUE | 155 | 123 | 0.5575540 |
| Ura Kazuki | Osaka | FALSE | 238 | 131 | 0.6449864 |
| Ura Kazuki | Osaka | TRUE | 46 | 29 | 0.6133333 |
| Oho Konosuke | Tokyo | FALSE | 104 | 56 | 0.6500000 |
| Oho Konosuke | Tokyo | TRUE | 78 | 79 | 0.4968153 |
If we group rikishi by their home prefecture, we can obtain the win percentages of all rikishi from each prefecture for tournaments hosted in that prefecture, as well as outside of that prefecture.
## `summarise()` has grouped output by 'home_prefecture'. You can override using
## the `.groups` argument.
As can be seen, there is no discernible difference between the two across all prefectures. This seems to suggest that a rikishi’s performance is not impacted by competing in their home prefecture.
We can instead look at individual rikishi from these prefectures and see if there are any rikishi for which competing in their home prefecture does impact them.
| shikona | home_prefecture | total_matches | winpct_in_home_pref | winpct_out_of_home_pref | diff_in_winpct |
|---|---|---|---|---|---|
| Tobizaru Masaya | Tokyo | 587 | 0.5575540 | 0.5242718 | 0.0332821 |
| Ura Kazuki | Osaka | 444 | 0.6133333 | 0.6449864 | -0.0316531 |
| Oho Konosuke | Tokyo | 317 | 0.4968153 | 0.6500000 | -0.1531847 |
| Tsurugisho Momotaro | Tokyo | 713 | 0.5382436 | 0.5222222 | 0.0160214 |
| Tohakuryu Masahito | Tokyo | 262 | 0.5634921 | 0.5147059 | 0.0487862 |
If we plot the win percentage of rikishi within their home prefecture against their win percentage outside of their home prefecture, we can obtain the following scatterplot. In order to avoid a small sample size (rikishi who may have only competed in no/a small number of tournaments in their home prefecture), only rikishi who have fought above 300 career matches have been included
We can see from above that the variation between the win percentages is never that great, and the larger differences between the two often belong to rikishi that have fought smaller numbers of career matches. Following this logic, we can plot the differences between their win percentages against the number of career matches a rikishi has fought.
This provides evidence that the impact of performing within one’s home prefecture averages out over time. This strengthens the claim that competing in one’s home prefecture does not have a significant impact on a rikishi’s performance.
As one final point, we can look at the two rikishi who have the largest differences in win percentages that have fought over 300 bouts.
| shikona | home_prefecture | total_matches | winpct_in_home_pref | winpct_out_of_home_pref | diff_in_winpct |
|---|---|---|---|---|---|
| Daijo Kotaro | Fukuoka | 662 | 0.6160714 | 0.4600000 | 0.1560714 |
| Sakai Shoichi | Fukuoka | 377 | 0.3333333 | 0.5095541 | -0.1762208 |
Ironically, they are both from the same prefecture. Talk about averaging out…