Suicide Statistics

Author

Oluwatosin Akinmoladun

Introduction

The topic of this project is suicide rates worldwide, using data from the World Health Organization (WHO) on suicide statistics. This dataset includes variables such as:

Country (categorical)
Year (date/categorical)
Sex (categorical)
Age group (categorical)
Suicide rate per 100,000 population (quantitative)
Population (quantitative).

The data comes from the World Health Organization, the global public health agency of the United Nations. According to the WHO, suicide is a significant public health issue worldwide, and this dataset was compiled through national statistical agencies and health ministries submitting annual reports of deaths by suicide, aggregated and standardized by the WHO to enable international comparisons.

I chose this dataset because suicide is a critical, complex issue that affects millions of families globally. Understanding how suicide rates vary by age, sex, and country can help identify vulnerable populations and inform prevention strategies. Personally, I believe bringing attention to this data is important to reduce stigma and encourage policies to address mental health more effectively.

Load Library

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.2     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.4     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(plotly)


Attaching package: 'plotly'

The following object is masked from 'package:ggplot2':

    last_plot

The following object is masked from 'package:stats':

    filter

The following object is masked from 'package:graphics':

    layout

library(ggplot2)

Load Dataset

df_suicide <- read_csv("C:/Users/tosin/Downloads/who_suicide_statistics(1).csv")

Rows: 43776 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): country, sex, age
dbl (3): year, suicides_no, population

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Clean Dataset

# Clean, filter, and summarize suicide dataset
df_clean <- df_suicide |>
  select(country, year, sex, age, suicides_no, population) |>
  filter(year >= 2008) |>
  mutate(rate_per_100k = suicides_no)

# Summarize average suicide rate by sex and age
df_summary <- df_clean |>
  group_by(sex, age) |>
  summarise(
    avg_suicide_rate = mean(rate_per_100k, na.rm = TRUE)
  )

`summarise()` has grouped output by 'sex'. You can override using the `.groups`
argument.

# View summary table sorted by highest average rate
df_summary |>
  arrange(desc(avg_suicide_rate))

# A tibble: 12 × 3
# Groups:   sex [2]
   sex    age         avg_suicide_rate
   <chr>  <chr>                  <dbl>
 1 male   35-54 years           608.  
 2 male   55-74 years           412.  
 3 male   25-34 years           277.  
 4 male   15-24 years           191.  
 5 female 35-54 years           167.  
 6 male   75+ years             156.  
 7 female 55-74 years           130.  
 8 female 25-34 years            67.1 
 9 female 75+ years              66.9 
10 female 15-24 years            56.9 
11 male   5-14 years             10.4 
12 female 5-14 years              6.85

Make your 1st plot

plot3 <- ggplot(df_summary, aes(x = age, y = avg_suicide_rate, fill = sex)) +
  geom_col(position = "dodge") +
  labs(
    title = "Average Number of Suicides by Age Group and Sex (2008)",
    x = "Age Group",
    y = "Average Number of Suicides",
    fill = "Sex",
    caption = "Data source: WHO Suicide Statistics"
  ) +
  scale_fill_manual(values = c("male" = "#0072B2", "female" = "#CC79A7")) +
  theme_dark() +
  annotate(
  "text",
  x = 3,                  
  y = 650,                 
  label = "Peak suicide rate",
  color = "red",
  size = 5,
  fontface = "bold"
)

plot3

When I looked at this plot, I noticed right away how middle-aged adults had the highest suicide rates, with men in that age group peaking at nearly 600 suicides on average. What really stood out to me was how big the gap is between men and women, men had way higher suicide rates in every age group. The rates were lower for younger adults and older people, but the 55-74 age group still stayed pretty high before dropping off in the oldest ages. To me, this pattern suggests that the stress and responsibilities of middle age, like work pressure or financial issues, might make things worse for people at that stage of life. I was honestly surprised by just how much higher the male rates were sometimes two or three times higher than women and it made me realize how important it is to address mental health for men, especially when they hit middle age.

Filter data for the interactive plot

country_comparison <- df_suicide |>
  # Remove rows with missing or zero population
  filter(!is.na(suicides_no), !is.na(population), population > 0) |>
  # Calculate suicide rate
  mutate(suicide_rate = (suicides_no / population) * 100000) |>
  # Group by country and calculate average
  group_by(country) |>
  summarise(
    avg_rate = mean(suicide_rate, na.rm = TRUE),
    .groups = 'drop'
  ) |>
  # Remove any countries with NA rates
  filter(!is.na(avg_rate)) |>
  # Sort and take top 8
  arrange(desc(avg_rate)) |>
  slice_head(n = 8)

Make Interactive Plot

median_rate <- median(country_comparison$avg_rate, na.rm = TRUE)

median_country <- country_comparison |>
  slice_min(abs(avg_rate - median_rate), n = 1)

country_plot <- ggplot(country_comparison, aes(
  x = reorder(country, avg_rate),
  y = avg_rate,
  fill = avg_rate,
  text = paste0(
    "Country: ", country, "\n",
    "Avg Rate: ", round(avg_rate, 2)
  )
)) +
  geom_col(alpha = 0.8) +
  coord_flip() +
  scale_fill_gradient(
    low = "#efedf5",  
    high = "#54278f"  
  ) +
  labs(
    title = "Countries with Highest Average Suicide Rates",
    x = "Country",
    y = "Average Suicide Rate (per 100,000 population)",
    fill = "Avg Rate",
    caption = "Data source: WHO Suicide Statistics"
  ) +
  theme_light() +
  annotate(
    "text",
    x = median_country$country,
    y = median_country$avg_rate,
    label = paste0("Median: ", median_country$country),
    hjust = -0.10,
    color = "black",
    size = 4,
    fontface = "bold"
  )

interactive_country_plot <- ggplotly(country_plot, tooltip = "text")

interactive_country_plot

I started by calculating the median average suicide rate from the country_comparison data to get a middle point that is representative of the average suicide rate between nations that is free from distortions due to high or low extremes. I then sought out the country with the mean rate closest to this median to present it as an important point of reference within the data. I used ggplot2 to plot a horizontal bar chart of the average suicide rate for every country with a color gradient to visually emphasize rate variations. To ensure the audience would notice the median, I placed an annotation above the country nearest the median value so it would be easily identifiable on the plot. Finally, I rendered the static chart interactive by converting it into a Plotly visualization so users can hover over each bar to see the comprehensive details like the country name and exact average suicide rate which makes looking for information on the plot more interesting.

Statistical Analysis - Multiple Linear Regression

# Prepare data for regression
regression_data <- df_suicide |>
  filter(!is.na(suicides_no), !is.na(population), population > 0) |>
  mutate(
    suicide_rate = (suicides_no / population) * 100000,
    male = ifelse(sex == "male", 1, 0),
    age_numeric = case_when(
      age == "5-14 years" ~ 1,
      age == "15-24 years" ~ 2,
      age == "25-34 years" ~ 3,
      age == "35-54 years" ~ 4,
      age == "55-74 years" ~ 5,
      age == "75+ years" ~ 6
    ) ## chatgpt assistance in making this code
  ) |>
  filter(!is.na(suicide_rate), !is.na(age_numeric), suicide_rate < 200)

# Build model
suicide_model <- lm(suicide_rate ~ male + age_numeric + year, data = regression_data)
summary(suicide_model)


Call:
lm(formula = suicide_rate ~ male + age_numeric + year, data = regression_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-33.684  -9.419  -1.524   5.194 164.202 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) 253.573176  17.229686   14.72   <2e-16 ***
male         14.792772   0.175172   84.45   <2e-16 ***
age_numeric   4.212839   0.051293   82.13   <2e-16 ***
year         -0.131359   0.008619  -15.24   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 16.63 on 36047 degrees of freedom
Multiple R-squared:  0.2812,    Adjusted R-squared:  0.2812 
F-statistic:  4701 on 3 and 36047 DF,  p-value: < 2.2e-16

# Diagnostic plots
par(mfrow = c(2, 2))
plot(suicide_model)

par(mfrow = c(1, 1))

The diagnostic plots for the regression model help us assess how well the model fits the data and whether the assumptions of linear regression are met.They show things like the distribution of residuals, the relationship between fitted values and residuals, and any potential outliers or influential points. From these plots, I can see if there are any issues with non-linearity or unusual data points that might affect the reliability of the model’s results.

Research

For my research, I decided to focus on why men have such a higher rate of suicide compared to women, which was clear in my first plot. According to the American Institute for Boys and Men, in 2023, 39,045 men and 10,270 women took their own lives, demonstrating that the suicide risk for men is nearly four times higher than for women (American Institute for Boys and Men, 2023). Our World in Data explains that while the exact reasons for this gender gap are still debated by researchers, several factors likely contribute to this disparity, including differences in the lethality of suicide methods chosen, stigma around seeking help, varying social pressures between genders, and higher rates of alcohol and drug abuse among men (Our World in Data, 2024). The data from Our World in Data also reveals that this pattern is consistent across most countries globally, not just in the United States (Our World in Data, 2024). Understanding these complex factors helped me realize how important it is to address the multiple barriers that prevent men from seeking mental health support and to develop targeted prevention strategies that acknowledge these gender-specific risk factors.

Sources:

“Male Suicide: Patterns and Recent Trends.” American Institute for Boys and Men, 6 June 2025, aibm.org/research/male-suicide/.

Ritchie, Hannah. “Suicide Rates Are Higher in Men than Women.” Our World in Data, ourworldindata.org/data-insights/suicide-rates-are-higher-in-men-than-women. Accessed 4 July 2025.

Reflection

Looking back on this project, I’m really proud of what I was able to put together, but there were definitely a few things I wish I could have done better or included. One thing I wanted to show was a time series plot to explore how suicide rates changed over time for each country or age group, but I couldn’t get it to work properly with my data because there were too many missing values across different years. When I got to the linear regression section of my project, I found it pretty confusing at first. I wasn’t sure how to set up the model properly with categorical variables like sex and age group, or how to interpret things like the coefficients and p-values. I ended up spending a lot of time researching online and trying different combinations of variables, but I still couldn’t figure everything out on my own.Even after finishing the linear regression part of my project, I still don’t feel completely confident on how it turned out. Another issue I had when making this was the inclusion of caption for my plotly, I tried adding it to my labs multiple times but even though the code runs it never shows up on the actual plot.