https://www.hockey-reference.com/leagues/NHL_2023_skaters.html#stats::goals
# excel file
colony <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-01-11/colony.csv')
## Rows: 1222 Columns: 10
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): months, state
## dbl (8): year, colony_n, colony_max, colony_lost, colony_lost_pct, colony_ad...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
colony
skimr::skim(colony)
It takes a situation where there are so many observations in the common bins that the rare bins are so short that you can’t see them.
You can see rare bins more clearly in the second plot.
colony %>%
ggplot(aes(colony_n)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 47 rows containing non-finite values (`stat_bin()`).
colony %>%
ggplot(aes(colony_n)) +
geom_histogram() +
coord_cartesian(ylim = c(0,50))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 47 rows containing non-finite values (`stat_bin()`).
The second plot shows the result of treating outliers in colony_reno as NA.
colony %>%
ggplot(aes(colony_lost, colony_reno)) +
geom_point()
## Warning: Removed 131 rows containing missing values (`geom_point()`).
colony %>%
mutate(colony_reno = ifelse(colony_reno > 4e+05, NA, colony_reno)) %>%
ggplot(aes(colony_lost, colony_reno)) +
geom_point()
## Warning: Removed 139 rows containing missing values (`geom_point()`).
The dataset has only two possible categorical variables in months and state. Dark blue represents 7 occurrences and light blue 6.
colony %>%
count(months, state) %>%
ggplot(aes(months, state)) +
geom_tile(aes(fill = n))