library(readr)
banzuke <- read_csv("banzuke.csv")
## Rows: 170406 Columns: 12
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (6): rank, wrestler, training_qtrs, birthplace, birthdate, prev
## dbl (6): tournament, id, height_cm, weight_kg, prev_w, prev_l
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
View(banzuke)

The Data

This large dataset contains information about sumo wrestlers and tournaments from 1983 to 2017. It contains information such as wrestler rank, birthplace, height, weight, previous wins and losses, etc. For this homework, I have chosen to examine the variables “height” and “weight.”

(Limitations of this analysis: It must be remembered that this data covers a large span of years. In other words, height and weight results are averages for all sumo wrestlers from 1983 to 2017. In the future, I hope to isolate smaller portions of the data to analyze.)

The first thing I want to do is change the variables of weight and height from the metric system to the Imperial system, to make it more ready friendly for people who do not regularly use the metric system. I can do this by creating two new columns for weight_lbs and height_in. Then I create Imperial system variables (lbs, inches) by mathematically converting kg and cm to lbs and inches. Doing this increases the number of columns from 12 to 14, as we see in the results after calling for dimensions.

banzuke<-banzuke%>%
  mutate(weight_lbs = weight_kg * 2.2,
         height_in = height_cm * 0.3937)

dim(banzuke)
## [1] 170406     14

Now I’m going to make a univariate graph with a single variable, weight. This will give us an idea of how heavy sumo wrestlers are, Here, weight is on the y axis. We can see that average wrestler weighs slightly less than 300 pounds.

ggplot(banzuke, aes(weight_lbs)) + 
   geom_histogram(aes(y = ..density..), alpha = 0.5) +
  geom_density(alpha = 0.2, fill="red")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 5174 rows containing non-finite values (stat_bin).
## Warning: Removed 5174 rows containing non-finite values (stat_density).

Here, I perform the same code, but am using height as the variable. This visualization shows that the average height of wrestlers is 70 inches (5’10").

ggplot(banzuke, aes(height_in)) + 
 geom_histogram(aes(y = ..density..), alpha = 0.5) +
  geom_density(alpha = 0.2, fill="blue")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 5174 rows containing non-finite values (stat_bin).
## Warning: Removed 5174 rows containing non-finite values (stat_density).

Now let’s make a bivariate graph, this time utlizing both height and weight as variables.

ggplot(banzuke, aes(weight_lbs, height_in)) +
   geom_point() +
  geom_smooth() +
  theme_minimal() +
  labs(title = "Height and weight of sumo wrestlers", y = "height (in)", x = "weight (lbs)")
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
## Warning: Removed 5174 rows containing non-finite values (stat_smooth).
## Warning: Removed 5174 rows containing missing values (geom_point).

We can check the veracity of these graphs by calculating average weight and height without using visualization. The following data says the sumo wrestlers have an average weight of 276 pounds and a height of 70 inches (5 ft 10 in). The graphs bear out these calculations.

banzuke %>%
  select(weight_lbs) %>%
  summarise_all(mean, na.rm=TRUE)%>%
  arrange(desc(weight_lbs))
## # A tibble: 1 x 1
##   weight_lbs
##        <dbl>
## 1       276.
banzuke %>%
  select(height_in) %>%
  summarise_all(mean, na.rm=TRUE)%>%
  arrange(desc(height_in))
## # A tibble: 1 x 1
##   height_in
##       <dbl>
## 1      70.7