Economics of Mental Health: Exploring the link Between GDP and Depression

Author

Martia Eyi

Credit: Image generated by ChatGPT

Introduction

This project explores the relationship between income levels and economic depression. I chose this topic because the term “depression” personally resonates with me. While it’s most commonly associated with mental health, I was intrigued by how it might also apply in an economic context. This curiosity led me to investigate how income levels and economic downturns interact on a global scale, and whether there is a measurable connection between the two.

The dataset used in this analysis was obtained from the World Bank Open Data platform. The data was collected by the World Bank, an international financial institution that provides access to economic, social, and health statistics from member countries. Although the dataset did not include a ReadMe file or documentation describing the exact data collection methodology, it is likely that the information was gathered through national statistical offices, household health surveys, and international monitoring programs.

The dataset includes key variables such as country name, year, depression prevalence, and GDP per capita, with indicator codes representing each type of data. These variables allow for both cross-sectional and time-series analysis across regions and income classifications. Categorical variables include country and region, while quantitative variables include year, GDP per capita, and depression prevalence. This structure enables a meaningful analysis of the global relationship between economic conditions and mental health outcomes.

To prepare the data for analysis, I performed data cleaning by removing entries with missing or null values. I also narrowed the focus to more recent years in order to ensure that the insights drawn would be timely and relevant. These steps helped maintain the accuracy and integrity of the dataset, allowing for a more meaningful and up-to-date analysis.

Loading libraries for analysis

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.2     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.4     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(plotly)


Attaching package: 'plotly'

The following object is masked from 'package:ggplot2':

    last_plot

The following object is masked from 'package:stats':

    filter

The following object is masked from 'package:graphics':

    layout

library(dplyr)
library(ggthemes)
library(ggplot2)

Importing the dataset and setting the working directory

To begin the analysis, I set the working directory to the location where my dataset is stored. Then, I loaded the dataset using the read.csv() function and displayed the first few rows with head() to verify that the data was imported correctly. This step ensures that I can easily access and reference the dataset throughout the project.

# Setting working directory
setwd("C:/Users/MCuser/Downloads")
# Loading the dataset
depression_data <- read_csv("depression_income.csv")

Rows: 6468 Columns: 11
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (5): country, iso3c, iso2c, region, income
dbl (6): year, prevalence, gdp_percap, population, birth_rate, neonat_mortal...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# View depression data
head(depression_data)

# A tibble: 6 × 11
  country     iso3c  year prevalence iso2c gdp_percap population birth_rate
  <chr>       <chr> <dbl>      <dbl> <chr>      <dbl>      <dbl>      <dbl>
1 Afghanistan AFG    1990    318436. AF            NA   12067570       49.0
2 Afghanistan AFG    1991    329045. AF            NA   12789374       48.9
3 Afghanistan AFG    1992    382545. AF            NA   13745630       48.8
4 Afghanistan AFG    1993    440382. AF            NA   14824371       48.8
5 Afghanistan AFG    1994    456917. AF            NA   15869967       48.9
6 Afghanistan AFG    1995    471475. AF            NA   16772522       49.0
# ℹ 3 more variables: neonat_mortal_rate <dbl>, region <chr>, income <chr>

Identifying missing data in the dataset

Before performing any cleaning or analysis, it’s important to assess if the dataset contains missing values. Using the colSums(is.na()) function, I calculated how many missing values are present in each variable. This helps determine which columns may need to be cleaned or handled differently to ensure accurate results.

# Checking for missing values
colSums(is.na(depression_data))  # Summarizes how many missing values exist in each variable

           country              iso3c               year         prevalence 
                 0                980                  0                  0 
             iso2c         gdp_percap         population         birth_rate 
              2243               2567               2224               2311 
neonat_mortal_rate             region             income 
              2368               2218               2218

After identifying columns with missing values, I decided to remove all rows that contain any NAs. This ensures the dataset used for analysis is complete and reduces the risk of biased or inaccurate results due to incomplete records.

# Cleaning data: removing rows with missing values in key columns
depression_data <- depression_data %>%
  filter(
    !is.na(prevalence),
    !is.na(gdp_percap),
    !is.na(region),
    !is.na(income)
  )  # Keeps rows only where all key variables are not missing

To get a general understanding of the dataset, I used the summary() function to display basic descriptive statistics. This includes the minimum, maximum, mean, and quartiles for each variable, helping identify patterns and possible outliers.

# Generating summary statistics for key variables
summary(depression_data)

   country             iso3c                year        prevalence      
 Length:3901        Length:3901        Min.   :1990   Min.   :     931  
 Class :character   Class :character   1st Qu.:1996   1st Qu.:   70282  
 Mode  :character   Mode  :character   Median :2002   Median :  215531  
                                       Mean   :2002   Mean   : 1204673  
                                       3rd Qu.:2008   3rd Qu.:  604661  
                                       Max.   :2014   Max.   :54949281  
                                                                        
    iso2c             gdp_percap         population          birth_rate   
 Length:3901        Min.   :   239.7   Min.   :4.730e+04   Min.   : 7.60  
 Class :character   1st Qu.:  2158.3   1st Qu.:2.337e+06   1st Qu.:13.40  
 Mode  :character   Median :  6474.7   Median :7.503e+06   Median :22.15  
                    Mean   : 12649.5   Mean   :3.604e+07   Mean   :24.39  
                    3rd Qu.: 17409.9   3rd Qu.:2.054e+07   3rd Qu.:34.42  
                    Max.   :141442.2   Max.   :1.364e+09   Max.   :55.12  
                                                           NA's   :38     
 neonat_mortal_rate    region             income         
 Min.   : 1.00      Length:3901        Length:3901       
 1st Qu.: 6.10      Class :character   Class :character  
 Median :15.10      Mode  :character   Mode  :character  
 Mean   :19.25                                           
 3rd Qu.:29.30                                           
 Max.   :73.10                                           
 NA's   :48

To simplify the analysis, I selected only the most relevant variables from the dataset: country, year, region, prevalence, GDP per capita, and income classification. I also filtered the dataset to include only the most recent year of data available, which allows for a focused, cross-sectional analysis of global depression and income levels.

# selecting relevant columns
depression_data_selescted <- depression_data %>%
  select(country, year, region, prevalence, gdp_percap, income)

# Filtering for the most recent year 
recent_data <- depression_data %>% 
  filter(year == max(year))

Background Research

The intersection between economic stability and mental health has been widely discussed in global public health research. According to the World Health Organization (2023), depression is a leading cause of disability worldwide and disproportionately affects populations facing economic hardship. Individuals living in low and middle income countries often experience barriers such as poverty, poor access to mental health services, and underfunded healthcare systems, which contribute to a higher prevalence of untreated depression.

By examining both GDP per capita and depression rates, this project aims to identify whether national income levels are meaningfully associated with mental health outcomes. If a relationship exists, it would suggest that economic conditions—such as poverty, unemployment, and underfunded healthcare systems—play a significant role in shaping population well-being.

This kind of analysis is important because it can inform policies that address both economic development and mental health access at the same time. In particular, low- and middle-income countries may benefit from integrated strategies that consider mental health support as part of broader efforts to improve quality of life and social stability.

References (APA Style)

World Health Organization. (2023). Depression and other common mental disorders: Global health estimates. Retrieved from https://www.who.int/publications/i/item/depression-estimates

Patel, V., Saxena, S., Lund, C., Thornicroft, G., Baingana, F., Bolton, P., … & UnÜtzer, J. (2018). The Lancet Commission on global mental health and sustainable development. The Lancet, 392(10157), 1553–1598. https://doi.org/10.1016/S0140-6736(18)31612-X

Bulding and Interpreting a multiple linear regression model

To explore the relationship between depression prevalence and economic variables, I created a multiple linear regression model using region and GDP per capita as predictors. This helps assess how these two factors influence depression rates across countries. Below, I summarize the regression output to evaluate the model’s coefficients, significance, and explanatory power.

# Building a Multiple Linear Regression Model (I used chatGpt to creatw it)
linear_model <- lm(prevalence ~ region + gdp_percap, data = depression_data)   

# Summarizing the Regression Output
summary(linear_model)


Call:
lm(formula = prevalence ~ region + gdp_percap, data = depression_data)

Residuals:
     Min       1Q   Median       3Q      Max 
-6398277  -517459  -318199   -53321 51763799 

Coefficients:
                                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)                       3.272e+06  2.115e+05  15.469  < 2e-16 ***
regionEurope & Central Asia      -2.487e+06  2.527e+05  -9.843  < 2e-16 ***
regionLatin America & Caribbean  -2.723e+06  2.676e+05 -10.175  < 2e-16 ***
regionMiddle East & North Africa -2.725e+06  3.248e+05  -8.389  < 2e-16 ***
regionNorth America               1.909e+06  5.858e+05   3.258  0.00113 ** 
regionSouth Asia                  3.151e+06  3.998e+05   7.882 4.16e-15 ***
regionSub-Saharan Africa         -2.823e+06  2.510e+05 -11.249  < 2e-16 ***
gdp_percap                       -6.494e+00  5.448e+00  -1.192  0.23331    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4548000 on 3893 degrees of freedom
Multiple R-squared:  0.1044,    Adjusted R-squared:  0.1028 
F-statistic: 64.83 on 7 and 3893 DF,  p-value: < 2.2e-16

Evaluating the regression model with diagnostic plots

These diagnostic plots help determine whether the assumptions of linear regression are met. They allow us to assess the normality of residuals, detect outliers, and identify potential issues.

library(ggfortify)

# Set a custom theme before plotting
theme_set(theme_minimal(base_size = 12))  # You can also try theme_bw() or theme_classic()      

# Diagnostic plots for the regression model
autoplot(linear_model, which = 1:4, nrow = 2, ncol = 2, colour = "#1f77b4")  # Blue color

Diagnostic plot

I created this set of diagnostic plots to check whether my linear regression model meets the necessary assumptions. The residuals appear unevenly spread and the Q-Q plot shows some deviation from normality, suggesting mild violations. A few outliers and influential points are also visible in the Cook’s distance plot. If I had more time, I would explore how removing those points or transforming variables could improve the model.

Scatter plot: Regional averages od depression vs GDP per capita

To better understand the relationship between average depression rates and economic conditions across regions, I created an interactive scatter plot. Each point represents a regional average for GDP per capita and depression prevalence. A regression line is added to show the overall trend across these regions.

library(plotly)
library(viridis)

Loading required package: viridisLite

# Calculate average prevalence and GDP by region
avg_prevalence_region <- recent_data %>%
  group_by(region) %>%
  summarise(
    avg_prevalence = mean(prevalence, na.rm = TRUE),
    avg_gdp_percap = mean(gdp_percap, na.rm = TRUE)
  )

# Create base plot
p <- ggplot(avg_prevalence_region, aes(x = avg_gdp_percap, y = avg_prevalence, color = region, text = region)) +
  geom_point(size = 4) +
  geom_smooth(method = "lm", se = FALSE, color = "black") +
  scale_color_viridis_d(option = "D", begin = 0.2, end = 0.8) +
  labs(
    title = "Regional Averages: Depression vs GDP per Capita",
    x = "Average GDP per Capita (USD)",
    y = "Average Depression Prevalence (%)",
    color = "Region"
  ) +
  theme_minimal()

# Convert to interactive
ggplotly(p, tooltip = c("text", "x", "y"))

`geom_smooth()` using formula = 'y ~ x'

Interactive line plot: Depression trends over time by region

To better understand how depression prevalence has evolved over time across different regions, I created an interactive line plot. Each line represents the total depression prevalence for a region over the years. This helps highlight rising or declining trends and compare patterns between regions.

# Summarize total prevalence by region and year
region_prevalence <- depression_data %>%
  group_by(region, year) %>%
  summarise(
    total_prevalence = sum(prevalence, na.rm = TRUE),
    .groups = "drop"
  )

# Create the line plot
line_plot <- ggplot(region_prevalence, aes(x = year, y = total_prevalence, color = region, text = region)) +
  geom_line(size = 1.2) +
  scale_color_viridis_d(option = "D") +
  labs(
    title = "Depression Trends Over Time by Region",
    x = "Year",
    y = "Total Depression Prevalence (%)",
    color = "Region",
    caption = "Source: World Bank Open Database"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(face = "bold", hjust = 0.5),
    legend.position = "bottom"
  )

Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.

# Convert to interactive plot
ggplotly(line_plot, tooltip = c("text", "x", "y"))

Conclusion

Through this project, I explored the connection between GDP per capita and depression prevalence using global data. The visualizations showed a clear pattern: regions with lower income levels tend to report higher levels of depression. The interactive scatter plot and line plot made it easy to observe how this relationship varies across regions and over time. One interesting pattern was that some high-GDP regions still reported relatively high depression rates, suggesting that economic wealth alone does not guarantee better mental health outcomes.

I was surprised to see how strong the regional differences were, especially in the regression model, where certain regions had a significant impact on predicted depression levels. If I had more time, I would have liked to explore additional variables such as healthcare spending or social support indicators. Despite that, the project helped me better understand how economic indicators can be used to study mental health on a global scale.