Visualization of suicide data in New Zealand (2) - Age Group
Author
Takafumi Kubota
Published
November 4, 2024
Abstract
This report analyzes suicide trends in Aotearoa New Zealand for 2023, focusing on differences across age groups and sexes. Using “Suspected” case data from all ethnic groups, the study employs data cleaning and transformation to ensure accuracy. Utilizing R and ggplot2, the report presents stacked bar charts showing both the number of suicide deaths and rates per 100,000 population across age categories. The findings provide insights for public health officials and policymakers to identify high-risk groups and develop targeted intervention strategies, contributing to efforts to reduce suicide rates and enhance mental health support in New Zealand.
Keywords
R language, Suicide, New Zealand, Bar Chart
Introduction
This page includes information about numbers and rates of suicide deaths in Aotearoa New Zealand. If at any point you feel worried about harming yourself while viewing the information in this page—or if you think someone else may be in danger—please stop reading and seek help.
Suicide remains a critical public health concern globally, with profound impacts on individuals, families, and communities. In Aotearoa New Zealand, understanding the underlying patterns and demographic disparities in suicide trends is essential for developing effective prevention strategies. This report delves into the suicide trends for the year 2023, analyzing data categorized under “Suspected” cases across all ethnic groups. The primary objective is to elucidate the variations in suicide occurrences and rates among different age groups and sexes, thereby identifying vulnerable populations that may benefit from targeted interventions. Employing a comprehensive data analysis approach, the study utilizes R programming to process and visualize the data. Initial steps involve meticulous data cleaning, including converting relevant variables to numeric formats and handling missing or anomalous values. Calculating average population counts (pop_mean) for each demographic segment serves as a foundation for subsequent rate computations. To address gaps in the data, a custom imputation function is implemented, ensuring that missing pop_mean values are estimated based on historical trends or overall group averages. The visualization phase leverages ggplot2 to create stacked bar charts that effectively communicate both the absolute number of suicide deaths and the corresponding rates per 100,000 population across defined age groups and sexes. These visual representations facilitate the identification of patterns and disparities, offering actionable insights for stakeholders. By highlighting the interplay between year, age, sex, and suicide rates, this report contributes to the ongoing discourse on mental health in New Zealand. The findings aim to support policymakers and healthcare providers in prioritizing resources and designing interventions that address the specific needs of high-risk groups, ultimately striving to reduce the incidence of suicide and promote mental well-being across the nation.
The data on this page is sourced from the Suicide Data Web Tool provided by Health New Zealand, specifically from https://tewhatuora.shinyapps.io/suicide-web-tool/, and is licensed under a Creative Commons Attribution 4.0 International License.
This visualisation shows only calendar years. It also visualises only suspected suicides. The following notes are given on the site of the Suicide Data Web Tool:
Short term year-on-year data are not an accurate indicator by which to measure trends. Trends can only be considered over a five to ten year period, or longer.
Confirmed suicide rates generally follow the same pattern as suspected suicide rates.
On the technical information page for the Suicide Data Web Tool, the following is written as a cautionary note on ‘Interpreting Numerical Values and Rates’. For the purpose of visualisation, this page uses suicide rates calculated by extracting or calculating the population from similar attributes. You should be very careful when interpreting the graphs.
For groups where suicide numbers are very low, small changes in the numbers of suicide deaths across years can result in large changes in the corresponding rates. Rates that are based on such small numbers are not reliable and can show large changes over time that may not accurately represent underlying suicide trends. Because of issues with particularly small counts, rates in this web tool are not calculated for groups with fewer than six suicide deaths in a given year.
##1.## Load necessary librarieslibrary(ggplot2) # For creating visualizationslibrary(dplyr) # For data manipulationlibrary(readr) # For reading CSV fileslibrary(zoo) # For handling missing values and time series data##2.## Load the data from a CSV file#suicide_trends <- read_csv("data/suicide-trends-by-ethnicity-by-calendar-year.csv")suicide_trends <-read_csv("https://takafumikubota.jp/suicide/nz/suicide-trends-by-ethnicity-by-calendar-year.csv")##3.## Filter and transform data for the line plotsuicide_trends_filtered_age <- suicide_trends %>%filter( data_status =="Suspected", # Select rows where data_status is "Suspected" sex %in%c("Male", "Female", "All sex"), # Select specific sex categories ethnicity =="All ethnic groups", # Select rows where ethnicity is "All ethnic groups" age_group !="All ages"# Exclude rows where age_group is "All ages" ) %>%mutate(number =as.numeric(number)) # Convert the 'number' column to numeric type##4.## Group by data_status, year, sex, and age_group, then calculate the average popcount for each grouppop_means <- suicide_trends_filtered_age %>%mutate(popcount_num =as.numeric(popcount), # Convert 'popcount' to numericpopcount_num =if_else(popcount =="S", NA_real_, popcount_num) # Replace "S" with NA in 'popcount_num' ) %>%group_by(data_status, year, sex, age_group) %>%# Group by data_status, year, sex, and age_groupsummarise(pop_mean =mean(popcount_num, na.rm =TRUE), # Calculate the mean of popcount_num, ignoring NA values.groups ='drop'# Drop the grouping after summarisation )##5.## Arrange the pop_means data frame by sex and age_group for consistencypop_means <- pop_means %>%arrange(sex, age_group)##6.## Extract unique values for year, sex, and age_group to use in the filling functionyear.name <-unique(pop_means$year)sex.name <-unique(pop_means$sex)age_group.name <-unique(pop_means$age_group)##7.## Define a function to fill missing pop_mean valuesfillpop <-function(k){# Retrieve the current year for the k-th row year.this <- year.name[which(pop_means$year[k] == year.name)]# Retrieve the previous year; handle cases where the current year is the first year year.last <- year.name[which(pop_means$year[k] == year.name) -1]# Retrieve the current sex for the k-th row sex.this <- sex.name[which(pop_means$sex[k] == sex.name)]# Retrieve the current age_group for the k-th row age_group.this <- age_group.name[which(pop_means$age_group[k] == age_group.name)]# Attempt to retrieve the pop_mean from the previous year, same sex, and same age_group pop.tmp <-tryCatch({ pop_means %>%filter(year == year.last, sex == sex.this, age_group == age_group.this) %>%# Filter for previous year, current sex, and current age_grouppull(pop_mean) # Extract the pop_mean value }, error =function(e) { # If an error occurs (e.g., previous year does not exist), return the mean pop_mean across all years for the same sex and age_group# Ideally, you would look back 2 or 3 years, but for simplicity, use the average when previous data is unavailable pop_means %>%filter(sex == sex.this, age_group == age_group.this) %>%# Filter for current sex and age_group across all yearssummarise(pop.tmp =mean(pop_mean, na.rm=TRUE)) %>%# Calculate the average pop_meanpull(pop.tmp) # Extract the average pop_mean })# Return the filled pop_mean valuereturn(pop.tmp)}##8.## Create a copy of pop_means to store the filled valuespop_means2 <- pop_means##9.## Loop through each row of pop_means to fill in missing pop_mean valuesfor(k in1:nrow(pop_means)){# Check if the pop_mean for the k-th row is NaNif(is.nan(pop_means[k,]$pop_mean)){# If NaN, replace it with the value returned by the fillpop function pop_means2[k,]$pop_mean <-fillpop(k) }}##10.## Sort suicide_trends_filtered_age by year, sex, and age_group to align with pop_means2suicide_trends_filtered_age <- suicide_trends_filtered_age %>%arrange(year, sex, age_group)##11.## Sort pop_means2 by year, sex, and age_group to ensure alignmentpop_means2 <- pop_means2 %>%arrange(year, sex, age_group)##12.## Replace the 'popcount' column in suicide_trends_filtered_age with the filled 'pop_mean' values from pop_means2suicide_trends_filtered_age$popcount <- pop_means2$pop_mean##13.## Filter and transform data for the bar plotsuicide_trends_age <- suicide_trends_filtered_age %>%filter( data_status =="Suspected", # Select rows where data_status is "Suspected" sex %in%c("Male", "Female"), # Select only "Male" and "Female" sexes age_group !="All ages", # Exclude rows where age_group is "All ages" ethnicity =="All ethnic groups", # Select rows where ethnicity is "All ethnic groups" year ==2023# Select data for the year 2023 ) %>%mutate(number =as.numeric(number)) %>%# Convert the 'number' column to numeric typemutate(rate =as.numeric(number)/as.numeric(popcount)*100000) # Calculate the suicide rate per 100,000 population##14.## Get unique age groups and set factor levels in orderage_levels <-unique(suicide_trends_filtered_age$age_group) # Extract unique age groups from the datasuicide_trends_age$age_group <-factor(suicide_trends_age$age_group, levels = age_levels) # Set the order of age_group factors##15.## Define colors for the bar plotbar_colors <-c("Female"=rgb(102/255, 102/255, 153/255), # Define color for "Female""Male"=rgb(255/255, 102/255, 102/255) # Define color for "Male")##16.## Create the stacked bar plot for the number of suicide deathsggplot(suicide_trends_age, aes(x = age_group, y = number, fill = sex)) +geom_bar(stat ="identity") +# Use identity statistic to represent actual valueslabs(title ="Number of Suicide by Age Group and Sex in Aotearoa New Zealand, 2023", # Set plot titlex ="Age Group", # Set x-axis labely ="Number (Suspected)", # Set y-axis labelfill ="Sex"# Set legend title ) +scale_fill_manual(values = bar_colors) +# Apply custom colors to the barstheme_minimal() +# Use a minimal theme for the plottheme(axis.text.x =element_text(angle =0, hjust =0.5) # Adjust x-axis text for better readability )
##17.## Create the stacked bar plot for the suicide rateggplot(suicide_trends_age, aes(x = age_group, y = rate, fill = sex)) +geom_bar(stat ="identity") +# Use identity statistic to represent actual valueslabs(title ="Suicide Rate by Age Group and Sex in Aotearoa New Zealand, 2023", # Set plot titlex ="Age Group", # Set x-axis labely ="Rate (Suspected)", # Set y-axis labelfill ="Sex"# Set legend title ) +scale_fill_manual(values = bar_colors) +# Apply custom colors to the barstheme_minimal() +# Use a minimal theme for the plottheme(axis.text.x =element_text(angle =0, hjust =0.5) # Adjust x-axis text for better readability )
##1.## Load necessary libraries
ggplot2: A powerful package for creating a wide range of static and dynamic visualizations.
dplyr: Provides a set of functions (verbs) for data manipulation, such as filtering, selecting, and summarizing data.
readr: Facilitates reading data into R, particularly CSV files, with functions that are faster and more user-friendly than base R functions.
zoo: Offers functions for handling missing values, time series data, and data imputation.
##2.## Load the data from a CSV file
read_csv: Reads the specified CSV file into a tibble (a modern version of a data frame).
##3.## Filter and transform data for the line plot
filter: Selects rows that meet the specified conditions:
data_status == "Suspected": Keeps rows where data_status is “Suspected”.
sex %in% c("Male", "Female", "All sex"): Keeps rows where sex is either “Male”, “Female”, or “All sex”.
ethnicity == "All ethnic groups": Keeps rows where ethnicity is “All ethnic groups”.
age_group != "All ages": Excludes rows where age_group is “All ages”.
mutate: Creates or transforms columns:
number = as.numeric(number): Converts the number column to numeric data type to ensure proper calculations.
##4.## Group by data_status, year, sex, and age_group, then calculate the average popcount for each group
mutate:
popcount_num = as.numeric(popcount): Converts the popcount column to numeric.
if_else(popcount == "S", NA_real_, popcount_num): Replaces any occurrence of “S” in popcount with NA (missing value).
group_by: Groups the data by data_status, year, sex, and age_group to prepare for summarization.
summarise:
pop_mean = mean(popcount_num, na.rm = TRUE): Calculates the average population count for each group, ignoring NA values.
.groups = 'drop': Removes the grouping after summarization to return an ungrouped data frame.
##5.## Arrange the pop_means data frame by sex and age_group for consistency
arrange: Sorts the pop_means data frame first by sex and then by age_group to ensure consistent ordering.
##6.## Extract unique values for year, sex, and age_group to use in the filling function
unique: Extracts unique values from the specified columns to create vectors (year.name, sex.name, age_group.name) that will be used in the fillpop function.
##7.## Define a function to fill missing pop_mean values
fillpop: A custom function designed to handle missing (NaN) values in the pop_mean column.
Parameters:
k: The row index in pop_means where the pop_mean is missing.
Process:
Retrieve Current Values: Extracts the current year, sex, and age_group based on the row index k.
Identify Previous Year: Determines the previous year (year.last) relative to the current year.
Attempt to Retrieve Previous pop_mean:
Uses tryCatch to attempt to filter pop_means for the previous year, same sex, and same age_group to get the pop_mean.
If successful, pop.tmp will hold the pop_mean from the previous year.
Handle Errors:
If an error occurs (e.g., previous year does not exist), the function calculates the average pop_mean across all available years for the same sex and age_group.
Return Value: The function returns the filled pop_mean value (pop.tmp), either from the previous year or the average.
##8.## Create a copy of pop_means to store the filled values
pop_means2: A duplicate of pop_means where the filled pop_mean values will be stored.
##9.## Loop through each row of pop_means to fill in missing pop_mean values
for Loop:
Iterates over each row of pop_means.
Condition: Checks if the pop_mean for the current row (k) is NaN using is.nan().
Action: If pop_mean is NaN, the function fillpop(k) is called to compute a replacement value, which is then assigned to pop_means2[k,]$pop_mean.
##10.## Sort suicide_trends_filtered_age by year, sex, and age_group to align with pop_means2
arrange: Ensures that suicide_trends_filtered_age is sorted by year, sex, and age_group to maintain alignment with pop_means2.
##11.## Sort pop_means2 by year, sex, and age_group to ensure alignment
arrange: Sorts pop_means2 by year, sex, and age_group to ensure it aligns correctly with suicide_trends_filtered_age.
##12.## Replace the ‘popcount’ column in suicide_trends_filtered_age with the filled ‘pop_mean’ values from pop_means2
Assignment: Updates the popcount column in suicide_trends_filtered_age with the filled pop_mean values from pop_means2.
##13.## Filter and transform data for the bar plot
filter: Selects rows that meet specific conditions:
data_status == "Suspected": Keeps rows where data_status is “Suspected”.
sex %in% c("Male", "Female"): Only includes “Male” and “Female” categories, excluding “All sex”.
age_group != "All ages": Excludes rows where age_group is “All ages”.
ethnicity == "All ethnic groups": Keeps rows where ethnicity is “All ethnic groups”.
year == 2023: Focuses on data from the year 2023.
mutate:
number = as.numeric(number): Ensures the number column is numeric.
rate = as.numeric(number)/as.numeric(popcount)*100000: Calculates the suicide rate per 100,000 population.
##14.## Get unique age groups and set factor levels in order
unique: Retrieves all unique age groups present in the data.
factor:
Converts the age_group column to a factor with levels ordered according to age_levels. This ensures that the age groups appear in a logical and consistent order in the plots.
##15.## Define colors for the bar plot
rgb: Specifies colors using red, green, and blue components, each ranging from 0 to 1.
"Female" is assigned a muted purple color.
"Male" is assigned a light red color.
##16.## Create the stacked bar plot for the number of suicide deaths
ggplot: Initializes the plotting with suicide_trends_age as the data source.
aes: Maps aesthetics, setting age_group on the x-axis, number on the y-axis, and filling bars based on sex.
geom_bar(stat = “identity”): Creates bar plots where the heights of the bars represent actual data values.
labs: Adds labels to the plot, including title, x-axis, y-axis, and legend.
scale_fill_manual: Applies the custom colors defined in bar_colors to the bars based on sex.
theme_minimal: Applies a clean, minimalistic theme to the plot.
theme:
axis.text.x = element_text(angle = 0, hjust = 0.5): Keeps the x-axis labels horizontal and centers them.
##17.## Create the stacked bar plot for the suicide rate
ggplot: Initializes the plotting with suicide_trends_age as the data source.
aes: Maps aesthetics, setting age_group on the x-axis, rate on the y-axis, and filling bars based on sex.
geom_bar(stat = “identity”): Creates bar plots where the heights of the bars represent actual data values.
labs: Adds labels to the plot, including title, x-axis, y-axis, and legend.
scale_fill_manual: Applies the custom colors defined in bar_colors to the bars based on sex.
theme_minimal: Applies a clean, minimalistic theme to the plot.
theme:
axis.text.x = element_text(angle = 0, hjust = 0.5): Keeps the x-axis labels horizontal and centers them.
Summary of the Workflow
Loading Libraries (##1.##):
Essential libraries for data manipulation, reading, handling missing values, and visualization are loaded.
Data Loading (##2.##):
Suicide trend data is imported from a CSV file into R for analysis.
Data Filtering and Transformation (##3.##):
The data is filtered to include only relevant categories, such as suspected cases, specific sexes, all ethnic groups, and specific age groups.
The number column is converted to a numeric type to facilitate accurate calculations.
Calculating Average Population Counts (##4.##):
The population counts (popcount) are converted to numeric, with any “S” entries replaced by NA.
The data is grouped by relevant categories to compute the average population count (pop_mean) for each group.
Data Arrangement (##5.##):
The pop_means data frame is sorted by sex and age_group to ensure consistency in subsequent operations.
Extracting Unique Values (##6.##):
Unique values for year, sex, and age_group are extracted to facilitate the filling of missing population counts.
Defining the Filling Function (##7.##):
A custom function fillpop is defined to handle missing (NaN) population counts by attempting to retrieve the value from the previous year or, if unavailable, using the average across available years for the same sex and age_group.
Creating a Copy for Filled Values (##8.##):
A duplicate of pop_means (pop_means2) is created to store the filled pop_mean values.
Filling Missing Values via Loop (##9.##):
A loop iterates through each row of pop_means, checking for NaN values in pop_mean and filling them using the fillpop function.
Sorting for Alignment (##10.## & ##11.##):
Both suicide_trends_filtered_age and pop_means2 are sorted by year, sex, and age_group to ensure proper alignment for data replacement.
Updating the Original Data Frame (##12.##):
The original suicide_trends_filtered_age data frame is updated with the filled pop_mean values from pop_means2.
Preparing Data for Visualization (##13.##):
The data is further filtered for the year 2023, focusing on “Male” and “Female” sexes and specific age groups.
Suicide rates per 100,000 population are calculated to facilitate meaningful comparisons.
Setting Factor Levels (##14.##):
Age groups are ordered consistently to ensure that visualizations display age categories in a logical sequence.
Defining Plot Colors (##15.##):
Custom colors are defined for the “Male” and “Female” categories to enhance the visual appeal and clarity of the plots.
Creating the Number of Suicide Deaths Bar Plot (##16.##):
A stacked bar plot is created to visualize the number of suicide deaths by age group and sex.
Custom labels, colors, and themes are applied for better readability and aesthetics.
Creating the Suicide Rate Bar Plot (##17.##):
Another stacked bar plot is created to visualize the suicide rate per 100,000 population by age group and sex.
Similar customization as the previous plot ensures consistency and clarity.