The data set I chose is about the sumo wrestler’s characteristics since 1957. There are 15760 observations with 10 variables. The data-set variables are as follow :
Variable
Description
Rank
Rank of the wrestler in the Basho
Rikishi
The ring name of a the sumo wrestler
Heya
The establishment where a wrestler trains
Shusshin
The birthplace, or place of origin of the wrestler
Birth Date
The birth date of the wrestler
height_cm
The height of the wrestler at the time of the event in centimeters
weight_kg
The weight of the wrestler at the time of the even in kilograms
wins
Amount of wins in the specific Basho
losses
Amount of losses in the specific Basho
ties
Amount of ties in the specific Basho
The data for this dataset was captured through two websites Sumo Reference - Alexander Nitschke https://sumodb.sumogames.de and The Sumo Matchup Centuries In The Making - FiveThirtyEight https://fivethirtyeight.com/features/the-sumo-matchup-centuries-in-the-making/. The data was collected through the Sumo Tournments and mainly came from the first website. I chose this dataset because I am interested in learning how weight and height and even where the wrestler trained have an influence on their wrestling career. I also started watching Sumo wrestling and found that the sport is more than just pushing the opponent out of the ring. Its a beautiful sport and a great part of culture in japan.
Loading in the Data set:
library(tidyverse)
Warning: package 'tidyverse' was built under R version 4.3.3
Warning: package 'lubridate' was built under R version 4.3.3
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.3 ✔ readr 2.1.4
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.0 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.0
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggrepel)
Warning: package 'ggrepel' was built under R version 4.3.3
library(ggfortify)
Warning: package 'ggfortify' was built under R version 4.3.3
library(ggthemes)
Warning: package 'ggthemes' was built under R version 4.3.3
library(RColorBrewer)library(highcharter) # Loading in all the packages
Warning: package 'highcharter' was built under R version 4.3.3
Registered S3 method overwritten by 'quantmod':
method from
as.zoo.data.frame zoo
setwd("C:/Users/casti/OneDrive/Documents/DATA 110") # setting directorysumo_since_1957 <-read_csv("sumo_since_1957.csv") # Loading in data
Rows: 15760 Columns: 10
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (5): Rank, Rikishi, Heya, Shusshin, Birth Date
dbl (5): height_cm, weight_kg, wins, losses, ties
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Data Tidying and Manipulation:
str(sumo_since_1957) # Checking if the variables are catergorized right
Sumo_Prop <- sumo_since_1957|>group_by(Rank,Heya,Shusshin ) |># filter for the three variablessummarise(cnt=n()) |># count mutate(props =round(cnt/sum(cnt),2)) |># mutate for proportionsfilter(cnt >1) |>filter(props<1)
`summarise()` has grouped output by 'Rank', 'Heya'. You can override using the
`.groups` argument.
ggplot(Top10,aes(x = cnt, y = props, col = Rank, label = Shusshin)) +# looking into the Top 10dataset and looking at proportions and count. Labeling for birth place geom_point() +# A scatterplot geom_text_repel(nudge_x =0.005) +# To nudge the text to make the labels a little more visible xlab("Number of Wrestlers") +# x titleylab("Proportions") +# y titleggtitle("Most Common Birth Place in Sumo Wrestling") +# Title of graph theme_economist() +# Adding a new themescale_color_brewer(palette ="Accent")+#Settin color palette labs(caption ="Source : https://data.scorenetwork.org/wrestling/sumo_wrestingling_since_1957.html")
Warning in RColorBrewer::brewer.pal(n, pal): n too large, allowed maximum for palette Accent is 8
Returning the palette you asked for with that many colors
Warning: Removed 2 rows containing missing values or values outside the scale range
(`geom_point()`).
Explanation :
The graph shows the different ranks and how all the fighters come from kagoshima. This shows that most fighers are born there and dominant the sport in each rank coming from that area. The other interesting fact is that the number of fighters doesn’t exceed 22 for each rank. Its interesting that each of the ranks have several fighters from Kagoshima.
Bottom 10 Graph
ggplot(Last10,aes(x = cnt, y = props, col = Rank, label = Shusshin)) +# looking into the threat dataset and looking at proportions and count. Labeling for race and adding a circle or triangle for true or false bodycamgeom_point() +# A scatterplot geom_text_repel(nudge_x =0.1, max.overlaps =10) +# To nudge the text to make the labels a little more visible xlab("Number of Wrestlers") +# x titleylab("Proportions") +# y titleggtitle("Least Common Birth Places in Sumo Wrestling") +# Title of graph theme_solarized() +scale_color_brewer(palette ="Set1") +labs(caption ="Source : https://data.scorenetwork.org/wrestling/sumo_wrestingling_since_1957.html")
Explanation:
The Graph was made to show the bottom 10 least common birthplaces in Sumo wrestling. Unlike the Top 10 Graph, we see that the birthplaces are far more diverse. The special thing about this graph is that the number of wrestlers for each birthplace doesn’t go over 3. Fewer ranks are also observed due to less wrestlers for each rank.
Mutations
Win_percent <- sumo_since_1957 |>mutate(Total_matches = wins + losses + ties,Win_percentage = wins/Total_matches,Weight_lbs = weight_kg/0.4536, # Mutating for win percentage, Height in ft, and Weight in poundsHeight_ft = height_cm/30.48)
Quantitative analysis :
ggplot(Win_percent,aes(x = Win_percentage, y = Weight_lbs )) +# looking into the Win_percent dataset and looking at Height and Weightgeom_point() +# A scatterplot geom_smooth(method ="lm",se =FALSE, lty =2, linewidth =0.3) +# I made each correlation line be dotted and small so it looks more presentablexlab("Win Percentage") +ylab("Weight in lbs") +# Adding y and x titles ggtitle("Correlation of Weight in lbs to Win Percentage")+# adding title labs(caption ="Source : https://data.scorenetwork.org/wrestling/sumo_wrestingling_since_1957.html") #adding a caption
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 18 rows containing non-finite outside the scale range
(`stat_smooth()`).
Warning: Removed 18 rows containing missing values or values outside the scale range
(`geom_point()`).
Correlation
cor(Win_percent$Win_percentage, Win_percent$Weight_lbs, use ="complete.obs") # a Correlation line
[1] 0.02304655
Little explaination:
There is a very small correlation however not big enough to confirm causation.
Model Summary
fit1 <-lm(Win_percentage ~ Weight_lbs, data = Win_percent)summary(fit1) # making a summary for fit 1 for the anaylsis
Call:
lm(formula = Win_percentage ~ Weight_lbs, data = Win_percent)
Residuals:
Min 1Q Median 3Q Max
-0.50181 -0.13840 0.04669 0.12222 0.52937
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.514e-01 9.119e-03 49.499 < 2e-16 ***
Weight_lbs 8.321e-05 2.877e-05 2.892 0.00383 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.1916 on 15740 degrees of freedom
(18 observations deleted due to missingness)
Multiple R-squared: 0.0005311, Adjusted R-squared: 0.0004676
F-statistic: 8.365 on 1 and 15740 DF, p-value: 0.003831
Meaning of Model Summary ( Equation)
Win Percentage = 0.001(Weight in lbs) + 6.753
This means for each pound of Weight there is 0.001 increase of Win Percentage.
The 2 asterisks next to the p-value for weight indicates that its a meaniful variable for the the increase of Win Percentage. The P value is also really low which is also a good thing.
The very low adjusted R squared is what is very worrying because it bascially states that around 99% of the data is not likely explained by this model.
Model Summary part 2
fit2 <-lm(Win_percentage ~ Weight_lbs + Height_ft, data = Win_percent )summary(fit2)
Call:
lm(formula = Win_percentage ~ Weight_lbs + Height_ft, data = Win_percent)
Residuals:
Min 1Q Median 3Q Max
-0.51773 -0.12947 0.03403 0.12162 0.53425
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.123e-01 4.901e-02 2.292 0.0219 *
Weight_lbs -1.975e-05 3.224e-05 -0.613 0.5401
Height_ft 6.187e-02 8.789e-03 7.040 2e-12 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.1913 on 15739 degrees of freedom
(18 observations deleted due to missingness)
Multiple R-squared: 0.003669, Adjusted R-squared: 0.003542
F-statistic: 28.98 on 2 and 15739 DF, p-value: 2.748e-13
autoplot(fit2,1:4,nrow=2,ncol =2)
What does these diagnostic plots mean ?
The residual plot is relatively horizontal so the linear model is appropriate.
QQplot indicates whether the distribution is relatively normal. However there is so many outliers.
The scale location indicates homogenous variance and it shows that only one point shew the data.
Cook ’s Distance indicates outliers have high leverage, meaning that point 12,133 may cause problems to the model.
Removing the two outliers
Win2 <- Win_percent[-c(12133,14527),]fit3 <-lm(Win_percentage ~ Weight_lbs + Height_ft, data = Win2)summary(fit3) # making a another summary of Fit3
Call:
lm(formula = Win_percentage ~ Weight_lbs + Height_ft, data = Win2)
Residuals:
Min 1Q Median 3Q Max
-0.51820 -0.12930 0.03401 0.12144 0.53415
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.147e-01 4.900e-02 2.341 0.0192 *
Weight_lbs -1.238e-05 3.229e-05 -0.383 0.7014
Height_ft 6.110e-02 8.789e-03 6.952 3.74e-12 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.1913 on 15737 degrees of freedom
(18 observations deleted due to missingness)
Multiple R-squared: 0.003674, Adjusted R-squared: 0.003547
F-statistic: 29.02 on 2 and 15737 DF, p-value: 2.641e-13
autoplot(fit3,1:4, nrow=2, ncol=2) # Using autoplot
The residual plot stayed the same compared to my other one with the outliers.
The QQplot is the same with the shape of a leaning Letter L.
Scale location also stayed the same.
However the outliers in the cook’s distance lowered due to removing the two main outliers.
Final summary:
The weight and height variables seem to not explain the win percentage in the sumo wrestlers. The adjusted r squared is super low and prompts that the majority of the data is not explained but this model. There isn’t a slight correlation too in weight and win percentage with the correlation line. However the P values are low enough to say they are significant however my claim of weight and height does not change.
Second Graph :
Fiveft_Under <- Win_percent |>filter(Height_ft <=5.6) # Filtering for wrestlers under or equal 5ft 6
ggplot(Fiveft_Under,aes(x = Weight_lbs, y = Height_ft, col = Shusshin)) +# looking into the FiveFT dataset and looking at Weight and Height. Labeling for birthplacegeom_point() +# A scatterplot the labels a little more visible xlab("Weight(lbs) of Wrestlers" ) +# x titleylab("Height(ft) of Wrestlers") +# y titleggtitle("The Height and Weight of Sumo Wrestlers Under 5ft 6") +# Title of graphtheme_bw() +scale_color_brewer(palette ="Spectral") +# adding a palettelabs(caption ="Source : https://data.scorenetwork.org/wrestling/sumo_wrestingling_since_1957.html") # for the caption
Explanation:
Most of sumo wrestlers were over the height of 5ft and 9 inches. I wanted to see if 300 pound wrestlers existed for heights lower than 5ft 6. There is a couple and its cool to see that a human can weigh so much and have that height. Its also so cool that the graph also shows the birth places of each of the wrestlers. The birthplace Shizuoka had most of the short and heavy wrestlers.
Final Graph :
Win_percent <- Win_percent[order(Win_percent$Weight_lbs,decreasing =TRUE),] # To create a descending dataframeTop100 <-head(Win_percent,100) # Use the head function in order to see the top 10cols <-brewer.pal(4,"Set1") # Setting the Palette for the hicharter
unique(Top100$Heya) # To see the unique values for Heya
[1] "Takasago" "Onoe" "Azumazeki" "Musashigawa"
hchart(Top100, "scatter", hcaes(x = Height_ft, y = Weight_lbs, group = Heya, size = Win_percentage)) %>%hc_colors(cols) %>%# Adding the color of the datahc_title(text ="Top 100 Heaviest Sumo Wrestlers Based On Training Grounds") %>%#Adding Titlehc_subtitle(text ="Source : https://data.scorenetwork.org/wrestling/sumo_wrestingling_since_1957.html ") %>%# Adding the sourcehc_xAxis(title =list(text =""),labels =list(format ="{value} Ft")) %>%hc_yAxis(title =list(text =""),labels =list(format ="{value} lbs")) %>%# Adding the x/y titles with the meaurementshc_tooltip(pointFormat ="{point.Heya}: {point.Weight_lbs:.2f} lbs / {point.Height_ft:.2f} ft / {point.Win_percentage:.2f}% Win Percentage " ,headerFormat ="{point.Heya}",shadow =TRUE,crosshairs =TRUE,borderWidth =2,headerShape ="circle") %>%hc_add_theme(hc_theme_darkunica()) # Adding the theme
Explanation:
I wanted to see where the top heaviest sumo wrestlers trained and lived in for majority of their lives. At the same time, I wanted to explore the weight and height of the sumo wrestlers. I wanted to also implement their win percentage. I found out that most of the heaviest Sumo wrestlers trained at Takasago. The also trained the heaviest individual. The least amount of heavy wrestlers trained at the Mushashigawa. However, the tallest sumo wrestlers trained at Azumazeki.
Essay :
B. The Sumo sport orginated in Japan in 300 BCE. Around these times, Sumo Wrestling was seen as a ritual not a sport. Often resulting in death and offering a cash prize for the winner. The ritual revolves around these rules, one is each of the sumo wrestlers would attempt to push each other out of the ring or have the other touch the floor with their bodies not including their feet. To bring intimation to the sport each of the sumo wrestlers would wear a mawashi which is a large belt and have their hair tied in a knot. This practice was done in ancient times to scare enemies . One of the many perks of being a sumo wrestler is that in Japan in become a superstar.
PartB Source: Cartwright, Mark. “Sumo.” World History Encyclopedia, https://www.worldhistory.org#organization, 1 May 2024, www.worldhistory.org/Sumo/#:~:text=Sumo%20%28Ozumo%29%20is%20an%20ancient%20form%20of%20wrestling,last%20much%20longer%20than%20the%20actual%20sporting%20contest.
C. I was very surprised that the weight and height of the sumo wrestlers didn’t have a big impact on the win percentage. The very low adjusted r squared proved that a linear model was not a good representation of the data. I was also surprised about how short a 300 pound sumo wrestler can be. I assumed they all were 5ft 10 and above because 250+ pounds on a person that is 5ft 6 and shorter seems so unhealthy. I wish I could have added the names of the wrestlers however it kept messing up my whole interactive graph. I am proud of myself for getting a hicharter graph to work because in my previous project I couldn’t get it to work. I prefer hicharter then plotly because the interactivity is smoother.