Visualization of suicide data in New Zealand (3) - Choropleth map

Author

Takafumi Kubota

Published

November 18, 2024

Abstract

This report visualizes suicide data in New Zealand for 2023 using choropleth maps to highlight spatial variations across districts. Leveraging R and the ggplot2 package, the study processes and cleans data from official sources, addressing missing and anomalous values to ensure accuracy. Two maps are generated: one depicting the absolute number of suicides and another illustrating suicide rates per 100,000 population. The visualizations reveal significant disparities between regions, identifying high-risk districts that require targeted intervention. These insights aim to inform public health strategies and policy-making to effectively address and mitigate suicide rates across New Zealand.

Keywords

Suicide Visualization, Choropleth Map, Public Health

Summary

Suicide remains a critical public health issue globally, necessitating effective data analysis and visualization to inform prevention strategies. This report focuses on the visualization of suicide data in New Zealand for the year 2023, employing choropleth maps to elucidate spatial patterns and disparities across various districts. Utilizing the R programming language and the ggplot2 package, the study meticulously processes and cleans the data to ensure reliability and accuracy in the resulting visualizations.

The data sources include district codes and suicide trends segmented by district and calendar year, obtained from official online repositories. Initial data cleaning involves addressing inconsistencies such as non-numeric entries (“S” and “-”) and missing values (NA) in critical columns like rate, number, and popcount. These anomalies are systematically replaced with standardized values or imputed using group means to maintain data integrity. The population counts (popcount) are particularly crucial, as they serve as the denominator in calculating standardized suicide rates per 100,000 population, allowing for meaningful comparisons across districts with varying population sizes.

After cleaning, the dataset is enriched by merging district codes, facilitating the alignment of suicide data with geographical regions. The analysis concentrates on suspected suicide cases in 2023, aggregating the total number of suicides and population counts per district. This aggregation enables the calculation of suicide rates, which are essential for identifying areas with disproportionately high suicide prevalence relative to their population.

The spatial aspect of the analysis leverages the spData package, which provides comprehensive map data for New Zealand. By merging the cleaned suicide data with the spatial polygons representing each district, the study updates a unified dataset (nz) suitable for visualization. Two distinct choropleth maps are generated using ggplot2:

Number of Suicides by District: This map employs a yellow-to-red color gradient to represent the absolute number of suicides in each district. Darker shades indicate higher suicide counts, allowing for the immediate identification of regions with elevated suicide incidents.
Suicide Rate by District: To account for population differences, this map visualizes suicide rates per 100,000 population. The same color gradient is used for consistency, with grey representing districts with missing data. This standardized rate provides a clearer picture of suicide prevalence, independent of district size.

The resulting visualizations reveal significant spatial disparities in suicide occurrences across New Zealand. Certain districts emerge as high-risk areas, exhibiting both high absolute numbers and elevated suicide rates. These insights are invaluable for public health officials and policymakers, as they highlight regions that may benefit from targeted mental health interventions, resource allocation, and preventive measures.

Moreover, the methodological approach demonstrated in this report underscores the importance of rigorous data cleaning and preprocessing in producing reliable visualizations. By addressing data anomalies and ensuring accurate population counts, the study lays a solid foundation for meaningful analysis and interpretation.

In conclusion, the choropleth maps serve as powerful tools for understanding the geographical distribution of suicide in New Zealand. They not only facilitate the identification of high-risk districts but also support the development of informed strategies aimed at reducing suicide rates and enhancing mental health support systems nationwide. Future work may extend this analysis by incorporating temporal trends, demographic factors, and evaluating the effectiveness of implemented interventions over time.

Full code

# Loading required packages
library(sf)
library(terra)
library(dplyr)
library(spData)
library(ggplot2)
library(readr)
library(RColorBrewer)

# Reading data
scode <- read_csv("https://takafumikubota.jp/suicide/nz/scode.csv")
suicide_area <- read_csv("https://takafumikubota.jp/suicide/nz/suicide-trends-by-district-by-calendar-year.csv")
suicide_area <- suicide_area %>% filter(district != "All New Zealand")

# Replace 'S' and NA in 'rate' column with '0' and convert to numeric
suicide_area <- suicide_area %>%
  mutate(
    rate = ifelse(rate == "S" | is.na(rate), "0", rate),
    rate = as.numeric(rate)
  )

# Replace '-' and NA in 'number' column with '0' and convert to numeric
suicide_area <- suicide_area %>%
  mutate(
    number = ifelse(number == "-" | is.na(number), "0", number),
    number = as.numeric(number)
  )

# Replace 'S' in 'popcount' with NA and impute with mean values
suicide_area_clean <- suicide_area %>%
  mutate(
    popcount_num = as.numeric(popcount),
    popcount_num = if_else(popcount == "S", NA_real_, popcount_num)
  )

pop_means <- suicide_area_clean %>%
  group_by(data_status, district) %>%
  summarise(
    pop_mean = mean(popcount_num, na.rm = TRUE),
    .groups = 'drop'
  )

suicide_area_updated <- suicide_area_clean %>%
  left_join(pop_means, by = c("data_status", "district")) %>%
  mutate(
    popcount_final = if_else(is.na(popcount_num), pop_mean, popcount_num)
  ) %>%
  select(-popcount, -popcount_num, -pop_mean) %>%
  rename(popcount = popcount_final)

suicide_area <- suicide_area_updated

# Adding district codes
suicide_area <- suicide_area %>%
  left_join(scode, by = c("district" = "district")) %>%
  rename(area_code = mcode)

# Creating data for the map
mapdata <- suicide_area %>%
  filter(year == 2023, data_status == "Suspected") %>%
  group_by(area_code) %>%
  summarise(
    year = unique(year),
    number = sum(number),
    popcount = sum(popcount),
    rate = (number / popcount) * 100000
  ) %>%
  ungroup()

# Loading New Zealand map data
data("nz", package = "spData")

# Merging map data with spatial data
nz <- nz %>% mutate(area_code = c(1:12,12,13,13,13)) %>%
  left_join(mapdata, by = "area_code")

ggplot(nz) +
  geom_sf(aes(fill = number)) +
  scale_fill_gradientn(
    colors = brewer.pal(5, "YlOrRd"),
    na.value = "grey50"
  ) +
  theme_minimal() +
  labs(
    title = "The number of suicides by district in 2023",
    fill = "Number"
  )

ggplot(nz) +
  geom_sf(aes(fill = rate)) +
  scale_fill_gradientn(
    colors = brewer.pal(5, "YlOrRd"),
    na.value = "grey50"
  ) +
  theme_minimal() +
  labs(
    title = "Suicide rate by district in 2023",
    fill = "Rate"
  )

Loading required packages: Initializes the necessary libraries for spatial data handling, data manipulation, and visualization.
Reading data: Imports district codes and suicide trend data from specified URLs, filtering out aggregated national data to focus on individual districts.
Replace ‘S’ and NA in ‘rate’ column with ‘0’ and convert to numeric: Cleans the rate column by replacing non-numeric entries (“S”) and missing values (NA) with “0” and converting the column to a numeric type for analysis.
Replace ‘-’ and NA in ‘number’ column with ‘0’ and convert to numeric: Cleans the number column by replacing placeholder (“-”) and missing values (NA) with “0” and converting the column to numeric for accurate calculations.
Replace ‘S’ in ‘popcount’ with NA and impute with mean values: Handles the popcount column by replacing “S” with NA to denote missing values and prepares to impute these missing values with the mean population counts.
Adding district codes: Merges district codes into the main dataset to enable accurate mapping of suicide data to specific geographical areas.
Creating data for the map: Filters the data for the year 2023 and “Suspected” cases, then aggregates the number of suicides and population counts by district to calculate suicide rates per 100,000 population.
Loading New Zealand map data: Imports spatial polygon data for New Zealand regions from the spData package to serve as the base map for visualization.
Merging map data with spatial data: Combines the aggregated suicide data with the spatial map data using area_code to prepare for choropleth mapping.

フルコード

# 必要なパッケージの読み込み
library(sf)
library(terra)
library(dplyr)
library(spData)
library(ggplot2)
library(readr)
library(RColorBrewer)

# データの読み込み
scode <- read_csv("https://takafumikubota.jp/suicide/nz/scode.csv")
suicide_area <- read_csv("https://takafumikubota.jp/suicide/nz/suicide-trends-by-district-by-calendar-year.csv")
suicide_area <- suicide_area %>% filter(district != "All New Zealand")

# 'rate'カラムの'S'とNAを'0'に置換し、数値に変換
suicide_area <- suicide_area %>%
  mutate(
    rate = ifelse(rate == "S" | is.na(rate), "0", rate),
    rate = as.numeric(rate)
  )

# 'number'カラムの'-'とNAを'0'に置換し、数値に変換
suicide_area <- suicide_area %>%
  mutate(
    number = ifelse(number == "-" | is.na(number), "0", number),
    number = as.numeric(number)
  )

# 'popcount'の'S'をNAに置換し、平均値で補完
suicide_area_clean <- suicide_area %>%
  mutate(
    popcount_num = as.numeric(popcount),
    popcount_num = if_else(popcount == "S", NA_real_, popcount_num)
  )

pop_means <- suicide_area_clean %>%
  group_by(data_status, district) %>%
  summarise(
    pop_mean = mean(popcount_num, na.rm = TRUE),
    .groups = 'drop'
  )

suicide_area_updated <- suicide_area_clean %>%
  left_join(pop_means, by = c("data_status", "district")) %>%
  mutate(
    popcount_final = if_else(is.na(popcount_num), pop_mean, popcount_num)
  ) %>%
  select(-popcount, -popcount_num, -pop_mean) %>%
  rename(popcount = popcount_final)

suicide_area <- suicide_area_updated

# 地区コードを追加
suicide_area <- suicide_area %>%
  left_join(scode, by = c("district" = "district")) %>%
  rename(area_code = mcode)

# マップ用のデータを作成
mapdata <- suicide_area %>%
  filter(year == 2023, data_status == "Suspected") %>%
  group_by(area_code) %>%
  summarise(
    year = unique(year),
    number = sum(number),
    popcount = sum(popcount),
    rate = (number / popcount) * 100000
  ) %>%
  ungroup()

# ニュージーランドの地図データを読み込み
data("nz", package = "spData")

# 地図データにマップ用データを結合
nz <- nz %>% mutate(area_code = c(1:12,12,13,13,13)) %>%
  left_join(mapdata, by = "area_code")

ggplot(nz) +
  geom_sf(aes(fill = number)) +
  scale_fill_gradientn(
    colors = brewer.pal(5, "YlOrRd"),
    na.value = "grey50"
  ) +
  theme_minimal() +
  labs(
    title = "The number of suicides by district in 2023",
    fill = "Number"
  )

ggplot(nz) +
  geom_sf(aes(fill = rate)) +
  scale_fill_gradientn(
    colors = brewer.pal(5, "YlOrRd"),
    na.value = "grey50"
  ) +
  theme_minimal() +
  labs(
    title = "Suicide rate by district in 2023",
    fill = "Rate"
  )

1. 必要なパッケージの読み込み

最初に、データの処理や可視化に必要な各種パッケージを読み込む。sfは空間データの操作を可能にし、terraはラスターおよびベクター空間データの処理を支援する。dplyrはデータフレームの操作を効率化し、spDataはサンプルの空間データセットを提供する。ggplot2は強力なデータ可視化ツールであり、readrはデータの読み込みを簡素化する。RColorBrewerは視覚的に魅力的なカラーパレットを提供する。

2. データの読み込み

read_csv関数を用いて、指定されたURLから2つのCSVファイルを読み込む。scodeには地区コードが含まれており、suicide_areaには地区別および年度別の自殺傾向データが含まれている。次に、filter関数を用いて「All New Zealand」という集計データを除外し、個々の地区に焦点を当てる。

3. ’rate’カラムの’S’とNAを’0’に置換し、数値に変換

mutate関数を使用して、rateカラム内の「S」や欠損値（NA）を「0」に置き換える。この操作により、非数値データや欠損値を統一し、数値型への変換を可能にする。最終的にrateカラムを数値型に変換することで、後続の計算やプロットが可能となる。

4. ‘number’カラムの’-’とNAを’0’に置換し、数値に変換

同様に、numberカラム内の「-」や欠損値（NA）を「0」に置き換え、数値型に変換する。これにより、データの一貫性を保ち、解析に適した形式に整える。

5. ’popcount’の’S’をNAに置換し、平均値で補完

popcountカラムに含まれる「S」を欠損値（NA）に置き換える。次に、popcount_numという新たな数値型カラムを作成し、popcountの値を数値に変換する。この過程で、「S」はNAとして扱われ、後に平均値で補完する準備を整える。

6. 平均人口数の計算

group_byとsummariseを用いて、data_statusおよびdistrictごとにpopcount_numの平均値を計算する。これにより、欠損している人口数を各グループの平均値で補完する基礎データを作成する。

7. 欠損人口数の補完

計算した平均人口数（pop_mean）を元に、popcount_numが欠損している場合には平均値で補完する。select関数を用いて不要なカラムを除去し、renameで最終的なpopcountカラムを更新する。これにより、全ての地区において人口数が補完されたデータセットが完成する。

8. メインデータフレームの更新

補完後のデータセット（suicide_area_updated）を元のsuicide_areaに再代入し、以降の処理でクリーンなデータを使用できるようにする。

9. 地区コードの追加

left_joinを用いて、scodeデータフレームとsuicide_areaを地区名で結合し、各地区に対応するコード（mcode）を追加する。これをarea_codeと名称変更し、後続の空間データとの統合に備える。

10. マップ用データの作成

filter関数で2023年および「Suspected」ステータスのデータに絞り込み、group_byとsummariseを用いて各area_codeごとに自殺数（number）、人口数（popcount）、および100,000人あたりの自殺率（rate）を計算する。これにより、地理的に標準化された自殺率を算出し、地区間の比較を可能にする。

11. ニュージーランドの地図データの読み込み

spDataパッケージからnzデータセットを読み込み、ニュージーランドの各地区の空間ポリゴンデータを取得する。これが後の可視化の基盤となる。

12. 地図データと自殺データの結合

mutate関数でnzデータにarea_codeを追加し、left_joinを用いてmapdataと結合する。これにより、空間データと属性データが統合された新たなデータフレーム（nz）が更新され、地図上での可視化が可能となる。

13. 自殺数の地区別地図の作成

ggplotを用いて、nzデータフレームから自殺数（number）を基にしたコロプレス地図を作成する。geom_sfで空間データをプロットし、scale_fill_gradientnで「YlOrRd」パレットを用いた色のグラデーションを設定する。theme_minimalでシンプルなテーマを適用し、labsでタイトルと凡例ラベルを追加する。これにより、各地区の自殺数の分布が視覚的に明示される。

14. 自殺率の地区別地図の作成

同様に、ggplotを用いて、自殺率（rate）を基にしたコロプレス地図を作成する。前述の設定と同様に色のグラデーションやテーマを適用し、各地区の自殺率の分布を視覚化する。これにより、人口規模に依存しない自殺の相対的な発生状況が明らかになる。