SUMO Final Project

Author

Gabriel Castillo Lopez

Source: <https://r.search.yahoo.com/_ylt=AwrEnyquCzRmNQkJYzqjzbkF;_ylu=c2VjA2ZwLWF0dHJpYgRzbGsDcnVybA–/RV=2/RE=1714715694/RO=11/RU=https%3a%2f%2fwww.jrailpass.com%2fblog%2fsumo-wrestling/RK=2/RS=ARLifzt5CkcCAPWAjPZagkJb0N8->

Intro Paragraph :

Source of the dataset: https://data.scorenetwork.org/wrestling/sumo_wrestingling_since_1957.html

The data set I chose is about the sumo wrestler’s characteristics since 1957. There are 15760 observations with 10 variables. The data-set variables are as follow :

Variable	Description
Rank	Rank of the wrestler in the Basho
Rikishi	The ring name of a the sumo wrestler
Heya	The establishment where a wrestler trains
Shusshin	The birthplace, or place of origin of the wrestler
Birth Date	The birth date of the wrestler
height_cm	The height of the wrestler at the time of the event in centimeters
weight_kg	The weight of the wrestler at the time of the even in kilograms
wins	Amount of wins in the specific Basho
losses	Amount of losses in the specific Basho
ties	Amount of ties in the specific Basho

The data for this dataset was captured through two websites Sumo Reference - Alexander Nitschke https://sumodb.sumogames.de and The Sumo Matchup Centuries In The Making - FiveThirtyEight https://fivethirtyeight.com/features/the-sumo-matchup-centuries-in-the-making/. The data was collected through the Sumo Tournments and mainly came from the first website. I chose this dataset because I am interested in learning how weight and height and even where the wrestler trained have an influence on their wrestling career. I also started watching Sumo wrestling and found that the sport is more than just pushing the opponent out of the ring. Its a beautiful sport and a great part of culture in japan.

Loading in the Data set:

library(tidyverse)

Warning: package 'tidyverse' was built under R version 4.3.3

Warning: package 'lubridate' was built under R version 4.3.3

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.3     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.0     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggrepel)

Warning: package 'ggrepel' was built under R version 4.3.3

library(ggfortify)

Warning: package 'ggfortify' was built under R version 4.3.3

library(ggthemes)

Warning: package 'ggthemes' was built under R version 4.3.3

library(RColorBrewer)
library(highcharter) # Loading in all the packages

Warning: package 'highcharter' was built under R version 4.3.3

Registered S3 method overwritten by 'quantmod':
  method            from
  as.zoo.data.frame zoo

setwd("C:/Users/casti/OneDrive/Documents/DATA 110") # setting directory
sumo_since_1957 <- read_csv("sumo_since_1957.csv") # Loading in data

Rows: 15760 Columns: 10
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (5): Rank, Rikishi, Heya, Shusshin, Birth Date
dbl (5): height_cm, weight_kg, wins, losses, ties

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Data Tidying and Manipulation:

str(sumo_since_1957) # Checking if the variables are catergorized right

spc_tbl_ [15,760 × 10] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ Rank      : chr [1:15760] "Y1eYO" "O1w" "S1e" "S1w" ...
 $ Rikishi   : chr [1:15760] "Terunofuji" "Takakeisho" "Wakatakakage" "Hoshoryu" ...
 $ Heya      : chr [1:15760] "Isegahama" "Tokiwayama" "Arashio" "Tatsunami" ...
 $ Shusshin  : chr [1:15760] "Mongolia" "Hyogo" "Fukushima" "Mongolia" ...
 $ Birth Date: chr [1:15760] "29.11.1991" "05.08.1996" "06.12.1994" "22.05.1999" ...
 $ height_cm : num [1:15760] 191 175 180 186 187 184 187 188 180 187 ...
 $ weight_kg : num [1:15760] 173 183 129 131 175 170 138 166 151 143 ...
 $ wins      : num [1:15760] 0 12 9 8 1 6 11 8 5 9 ...
 $ losses    : num [1:15760] 0 3 6 7 5 9 4 7 10 6 ...
 $ ties      : num [1:15760] 15 0 0 0 9 0 0 0 0 0 ...
 - attr(*, "spec")=
  .. cols(
  ..   Rank = col_character(),
  ..   Rikishi = col_character(),
  ..   Heya = col_character(),
  ..   Shusshin = col_character(),
  ..   `Birth Date` = col_character(),
  ..   height_cm = col_double(),
  ..   weight_kg = col_double(),
  ..   wins = col_double(),
  ..   losses = col_double(),
  ..   ties = col_double()
  .. )
 - attr(*, "problems")=<externalptr>

Sumo_Prop <- sumo_since_1957|>
  group_by(Rank,Heya,Shusshin ) |> # filter for the three variables
  summarise(cnt=n()) |> # count 
  mutate(props = round(cnt/sum(cnt),2)) |> # mutate for proportions
  filter(cnt > 1) |>
  filter(props< 1)

`summarise()` has grouped output by 'Rank', 'Heya'. You can override using the
`.groups` argument.

  # Can't be a one of one situation
Sumo_Prop

# A tibble: 3,104 × 5
# Groups:   Rank, Heya [1,143]
   Rank  Heya       Shusshin    cnt props
   <chr> <chr>      <chr>     <int> <dbl>
 1 K1e   Azumazeki  Aomori        2  0.67
 2 K1e   Dewanoumi  Aomori        7  0.27
 3 K1e   Dewanoumi  Hokkaido      3  0.12
 4 K1e   Dewanoumi  Nagano        6  0.23
 5 K1e   Dewanoumi  Nagasaki      2  0.08
 6 K1e   Dewanoumi  Osaka         3  0.12
 7 K1e   Fujishima  Hiroshima     4  0.8 
 8 K1e   Futagoyama Aomori        8  0.44
 9 K1e   Futagoyama Hiroshima     3  0.17
10 K1e   Futagoyama Hyogo         3  0.17
# ℹ 3,094 more rows

 Sumo_Prop <- Sumo_Prop[order(Sumo_Prop$props,decreasing = TRUE),] # To create a descending dataframe
Sumo_Prop

# A tibble: 3,104 × 5
# Groups:   Rank, Heya [1,143]
   Rank  Heya   Shusshin    cnt props
   <chr> <chr>  <chr>     <int> <dbl>
 1 M2w   Izutsu Kagoshima    18  0.95
 2 M5w   Izutsu Kagoshima    20  0.95
 3 M4w   Izutsu Kagoshima    15  0.94
 4 M9w   Izutsu Kagoshima    16  0.94
 5 M7w   Izutsu Kagoshima    14  0.93
 6 M1e   Izutsu Kagoshima    12  0.92
 7 M3e   Izutsu Kagoshima    22  0.92
 8 M9e   Izutsu Kagoshima    12  0.92
 9 M5e   Izutsu Kagoshima    10  0.91
10 M8w   Izutsu Kagoshima    20  0.91
# ℹ 3,094 more rows

Top 10 Proportions

Top10 <- head(Sumo_Prop,10) # Use the head function in order to see the top 10
Top10

# A tibble: 10 × 5
# Groups:   Rank, Heya [10]
   Rank  Heya   Shusshin    cnt props
   <chr> <chr>  <chr>     <int> <dbl>
 1 M2w   Izutsu Kagoshima    18  0.95
 2 M5w   Izutsu Kagoshima    20  0.95
 3 M4w   Izutsu Kagoshima    15  0.94
 4 M9w   Izutsu Kagoshima    16  0.94
 5 M7w   Izutsu Kagoshima    14  0.93
 6 M1e   Izutsu Kagoshima    12  0.92
 7 M3e   Izutsu Kagoshima    22  0.92
 8 M9e   Izutsu Kagoshima    12  0.92
 9 M5e   Izutsu Kagoshima    10  0.91
10 M8w   Izutsu Kagoshima    20  0.91

Bottom 10 Proportions

Last10 <- tail(Sumo_Prop,10) # Use the Tail function in order to see the bottom 10
Last10

# A tibble: 10 × 5
# Groups:   Rank, Heya [4]
   Rank  Heya        Shusshin   cnt props
   <chr> <chr>       <chr>    <int> <dbl>
 1 M7e   Tokitsukaze Oita         2  0.05
 2 M8w   Tokitsukaze Kochi        2  0.05
 3 M8w   Tokitsukaze Mongolia     2  0.05
 4 M8w   Tokitsukaze Nagasaki     2  0.05
 5 M8w   Tokitsukaze Oita         2  0.05
 6 M8w   Tokitsukaze Saga         2  0.05
 7 S1w   Sadogatake  Tottori      3  0.05
 8 S1e   Sadogatake  Bulgaria     2  0.04
 9 S1w   Sadogatake  Bulgaria     2  0.04
10 S1w   Sadogatake  Yamagata     2  0.04

Two proportion Graphs

Top 10 Graph

ggplot(Top10,aes(x = cnt, y = props, col = Rank, label = Shusshin)) + # looking into the Top 10dataset and looking at proportions and count. Labeling for birth place 
  geom_point() + # A scatterplot 
    geom_text_repel(nudge_x = 0.005) +   # To nudge the text to make the labels a little more visible 
xlab("Number of Wrestlers") +  # x title
  ylab("Proportions") + # y title
  ggtitle("Most Common Birth Place in Sumo Wrestling") + # Title of graph 
  theme_economist() +  # Adding a new theme
  scale_color_brewer(palette =  "Accent")+ #Settin color palette 
  labs(caption = "Source : https://data.scorenetwork.org/wrestling/sumo_wrestingling_since_1957.html")

Warning in RColorBrewer::brewer.pal(n, pal): n too large, allowed maximum for palette Accent is 8
Returning the palette you asked for with that many colors

Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).

Explanation :

The graph shows the different ranks and how all the fighters come from kagoshima. This shows that most fighers are born there and dominant the sport in each rank coming from that area. The other interesting fact is that the number of fighters doesn’t exceed 22 for each rank. Its interesting that each of the ranks have several fighters from Kagoshima.

Bottom 10 Graph

ggplot(Last10,aes(x = cnt, y = props, col = Rank, label = Shusshin)) + # looking into the threat dataset and looking at proportions and count. Labeling for race and adding a circle or triangle for true or false bodycam
  geom_point() + # A scatterplot 
    geom_text_repel(nudge_x = 0.1, max.overlaps = 10) +   # To nudge the text to make the labels a little more visible 
xlab("Number of Wrestlers") +  # x title
  ylab("Proportions") + # y title
  ggtitle("Least Common Birth Places in Sumo Wrestling") + # Title of graph 
  theme_solarized() + 
  scale_color_brewer(palette = "Set1") +
   labs(caption = "Source : https://data.scorenetwork.org/wrestling/sumo_wrestingling_since_1957.html")

Explanation:

The Graph was made to show the bottom 10 least common birthplaces in Sumo wrestling. Unlike the Top 10 Graph, we see that the birthplaces are far more diverse. The special thing about this graph is that the number of wrestlers for each birthplace doesn’t go over 3. Fewer ranks are also observed due to less wrestlers for each rank.

Mutations

Win_percent <- sumo_since_1957 |> 
  mutate(Total_matches = wins + losses + ties,
         Win_percentage = wins/Total_matches,
         Weight_lbs = weight_kg/0.4536, # Mutating for win percentage, Height in ft, and Weight in pounds
         Height_ft = height_cm/30.48)

Quantitative analysis :

ggplot(Win_percent,aes(x = Win_percentage, y = Weight_lbs )) + # looking into the Win_percent dataset and looking at Height and Weight
  geom_point() + # A scatterplot 
    geom_smooth(method = "lm",se = FALSE, lty = 2, linewidth = 0.3) + # I made each correlation line be dotted and small so it looks more presentable
    xlab("Win Percentage") +
    ylab("Weight in lbs") + # Adding y and x titles 
    ggtitle("Correlation of Weight in lbs to Win Percentage")+ # adding title 
  labs(caption = "Source : https://data.scorenetwork.org/wrestling/sumo_wrestingling_since_1957.html") #adding a caption

`geom_smooth()` using formula = 'y ~ x'

Warning: Removed 18 rows containing non-finite outside the scale range
(`stat_smooth()`).

Warning: Removed 18 rows containing missing values or values outside the scale range
(`geom_point()`).

Correlation

cor(Win_percent$Win_percentage, Win_percent$Weight_lbs, use = "complete.obs") # a Correlation line

[1] 0.02304655

Little explaination:

There is a very small correlation however not big enough to confirm causation.

Model Summary

fit1 <- lm(Win_percentage ~ Weight_lbs, data = Win_percent)
summary(fit1) # making a summary for fit 1 for the anaylsis


Call:
lm(formula = Win_percentage ~ Weight_lbs, data = Win_percent)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.50181 -0.13840  0.04669  0.12222  0.52937 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 4.514e-01  9.119e-03  49.499  < 2e-16 ***
Weight_lbs  8.321e-05  2.877e-05   2.892  0.00383 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1916 on 15740 degrees of freedom
  (18 observations deleted due to missingness)
Multiple R-squared:  0.0005311, Adjusted R-squared:  0.0004676 
F-statistic: 8.365 on 1 and 15740 DF,  p-value: 0.003831

Meaning of Model Summary ( Equation)

Win Percentage = 0.001(Weight in lbs) + 6.753

This means for each pound of Weight there is 0.001 increase of Win Percentage.

The 2 asterisks next to the p-value for weight indicates that its a meaniful variable for the the increase of Win Percentage. The P value is also really low which is also a good thing.

The very low adjusted R squared is what is very worrying because it bascially states that around 99% of the data is not likely explained by this model.

Model Summary part 2

fit2 <- lm(Win_percentage ~ Weight_lbs + Height_ft, data = Win_percent )
summary(fit2)


Call:
lm(formula = Win_percentage ~ Weight_lbs + Height_ft, data = Win_percent)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.51773 -0.12947  0.03403  0.12162  0.53425 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.123e-01  4.901e-02   2.292   0.0219 *  
Weight_lbs  -1.975e-05  3.224e-05  -0.613   0.5401    
Height_ft    6.187e-02  8.789e-03   7.040    2e-12 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1913 on 15739 degrees of freedom
  (18 observations deleted due to missingness)
Multiple R-squared:  0.003669,  Adjusted R-squared:  0.003542 
F-statistic: 28.98 on 2 and 15739 DF,  p-value: 2.748e-13

autoplot(fit2,1:4,nrow=2,ncol = 2)

What does these diagnostic plots mean ?

The residual plot is relatively horizontal so the linear model is appropriate.
QQplot indicates whether the distribution is relatively normal. However there is so many outliers.
The scale location indicates homogenous variance and it shows that only one point shew the data.

Cook ’s Distance indicates outliers have high leverage, meaning that point 12,133 may cause problems to the model.

Removing the two outliers

Win2 <- Win_percent[-c(12133,14527),]
fit3 <- lm(Win_percentage ~ Weight_lbs + Height_ft, data = Win2)
summary(fit3) # making a another summary of Fit3


Call:
lm(formula = Win_percentage ~ Weight_lbs + Height_ft, data = Win2)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.51820 -0.12930  0.03401  0.12144  0.53415 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.147e-01  4.900e-02   2.341   0.0192 *  
Weight_lbs  -1.238e-05  3.229e-05  -0.383   0.7014    
Height_ft    6.110e-02  8.789e-03   6.952 3.74e-12 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1913 on 15737 degrees of freedom
  (18 observations deleted due to missingness)
Multiple R-squared:  0.003674,  Adjusted R-squared:  0.003547 
F-statistic: 29.02 on 2 and 15737 DF,  p-value: 2.641e-13

autoplot(fit3,1:4, nrow= 2, ncol=2) # Using autoplot

The residual plot stayed the same compared to my other one with the outliers.
The QQplot is the same with the shape of a leaning Letter L.
Scale location also stayed the same.
However the outliers in the cook’s distance lowered due to removing the two main outliers.

Final summary:

The weight and height variables seem to not explain the win percentage in the sumo wrestlers. The adjusted r squared is super low and prompts that the majority of the data is not explained but this model. There isn’t a slight correlation too in weight and win percentage with the correlation line. However the P values are low enough to say they are significant however my claim of weight and height does not change.

Second Graph :

Fiveft_Under <- Win_percent |> 
  filter(Height_ft <= 5.6) # Filtering for wrestlers under or equal  5ft 6

ggplot(Fiveft_Under,aes(x = Weight_lbs, y = Height_ft, col = Shusshin)) + # looking into the FiveFT dataset and looking at Weight and Height. Labeling for birthplace
  geom_point() + # A scatterplot the labels a little more visible 
xlab("Weight(lbs) of Wrestlers" ) +  # x title
  ylab("Height(ft) of Wrestlers") + # y title
  ggtitle("The Height and Weight of Sumo Wrestlers Under 5ft 6") + # Title of graph
  theme_bw() + scale_color_brewer(palette = "Spectral") + # adding a palette
  labs(caption = "Source : https://data.scorenetwork.org/wrestling/sumo_wrestingling_since_1957.html") # for the caption

Explanation:

Most of sumo wrestlers were over the height of 5ft and 9 inches. I wanted to see if 300 pound wrestlers existed for heights lower than 5ft 6. There is a couple and its cool to see that a human can weigh so much and have that height. Its also so cool that the graph also shows the birth places of each of the wrestlers. The birthplace Shizuoka had most of the short and heavy wrestlers.

Final Graph :

Win_percent <- Win_percent[order(Win_percent$Weight_lbs,decreasing = TRUE),] # To create a descending dataframe

Top100 <- head(Win_percent,100) # Use the head function in order to see the top 10


cols <- brewer.pal(4,"Set1") # Setting the Palette for the hicharter

unique(Top100$Heya) # To see the unique values for Heya

[1] "Takasago"    "Onoe"        "Azumazeki"   "Musashigawa"

hchart(Top100, "scatter", hcaes(x = Height_ft, y = Weight_lbs, group = Heya, size = Win_percentage)) %>%
  hc_colors(cols) %>% # Adding the color of the data
  hc_title(text = "Top 100 Heaviest Sumo Wrestlers Based On Training Grounds") %>% #Adding Title
  hc_subtitle(text = "Source : https://data.scorenetwork.org/wrestling/sumo_wrestingling_since_1957.html ") %>% # Adding the source
  hc_xAxis(title = list(text = ""),
           labels = list(format = "{value} Ft")) %>% 
  hc_yAxis(title = list(text = ""),
           labels = list(format = "{value} lbs")) %>% # Adding the x/y titles with the meaurements
 hc_tooltip(pointFormat = 
               "{point.Heya}:
                {point.Weight_lbs:.2f} lbs / {point.Height_ft:.2f} ft / {point.Win_percentage:.2f}% Win Percentage " ,
             headerFormat = "{point.Heya}",
             shadow = TRUE,
             crosshairs = TRUE,
             borderWidth = 2,
             headerShape = "circle") %>%
  hc_add_theme(hc_theme_darkunica()) # Adding the theme

Explanation:

I wanted to see where the top heaviest sumo wrestlers trained and lived in for majority of their lives. At the same time, I wanted to explore the weight and height of the sumo wrestlers. I wanted to also implement their win percentage. I found out that most of the heaviest Sumo wrestlers trained at Takasago. The also trained the heaviest individual. The least amount of heavy wrestlers trained at the Mushashigawa. However, the tallest sumo wrestlers trained at Azumazeki.

Essay :

B. The Sumo sport orginated in Japan in 300 BCE. Around these times, Sumo Wrestling was seen as a ritual not a sport. Often resulting in death and offering a cash prize for the winner. The ritual revolves around these rules, one is each of the sumo wrestlers would attempt to push each other out of the ring or have the other touch the floor with their bodies not including their feet. To bring intimation to the sport each of the sumo wrestlers would wear a mawashi which is a large belt and have their hair tied in a knot. This practice was done in ancient times to scare enemies . One of the many perks of being a sumo wrestler is that in Japan in become a superstar.

PartB Source: Cartwright, Mark. “Sumo.” World History Encyclopedia, https://www.worldhistory.org#organization, 1 May 2024, www.worldhistory.org/Sumo/#:~:text=Sumo%20%28Ozumo%29%20is%20an%20ancient%20form%20of%20wrestling,last%20much%20longer%20than%20the%20actual%20sporting%20contest.

C. I was very surprised that the weight and height of the sumo wrestlers didn’t have a big impact on the win percentage. The very low adjusted r squared proved that a linear model was not a good representation of the data. I was also surprised about how short a 300 pound sumo wrestler can be. I assumed they all were 5ft 10 and above because 250+ pounds on a person that is 5ft 6 and shorter seems so unhealthy. I wish I could have added the names of the wrestlers however it kept messing up my whole interactive graph. I am proud of myself for getting a hicharter graph to work because in my previous project I couldn’t get it to work. I prefer hicharter then plotly because the interactivity is smoother.

Intro Paragraph :

Loading in the Data set:

Data Tidying and Manipulation:

Top 10 Proportions

Bottom 10 Proportions

Two proportion Graphs

Top 10 Graph

Explanation :

Bottom 10 Graph

Explanation:

Mutations

Quantitative analysis :

Correlation

Little explaination:

Model Summary

Meaning of Model Summary ( Equation)

Model Summary part 2

What does these diagnostic plots mean ?

Removing the two outliers

Final summary:

Second Graph :

Explanation:

Final Graph :

Explanation:

Essay :