Project 2 Assignment

Author

Latifah Traore

Depression Income : A Data Analysis

source: alamy.com

Introduction

The topic of this project revolves around exploring the relationship between income levels and economic depression. I chose this topic because the word “depression” resonates with me. When we typically talk about “depression,” it’s often in the context of mental health. However, I was curious to explore how the term applies to income levels and what economic depression means on a global scale. I wanted to understand whether there is a significant link between economic downturns and a country’s income levels.

The dataset used for this project was sourced from the World Bank Open Data. It contains several variables, such as the country, year, region, income level, prevalence of economic depression, and GDP per capita. The variables are a mix of categorical (such as country, region, and income level) and quantitative (such as year, prevalence, and GDP per capita) data. The country variable provides information on which country each data entry corresponds to, while year indicates the year of the observation. The region variable helps categorize countries into broader geographical groups. The income variable classifies countries based on their income levels, such as low, middle, or high income. The prevalence variable represents the percentage of the population that experiences economic depression, and GDP per capita offers an insight into the economic output per person.

To ensure the dataset’s integrity, I cleaned it by removing any rows with missing or NA values. I also focused on the most recent years in the dataset to avoid using outdated information. This was done to enhance the relevance and timeliness of the analysis, ensuring that only accurate and current data was included in the final analysis.

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(dplyr)
library(ggplot2)
library(ggfortify)
library(plotly)


Attaching package: 'plotly'

The following object is masked from 'package:ggplot2':

    last_plot

The following object is masked from 'package:stats':

    filter

The following object is masked from 'package:graphics':

    layout

library(readr)
library(viridis)

Loading required package: viridisLite

# Setting working directory
setwd("C:/Users/akais/OneDrive/Documents/Project2 Dataset")
# Loading the dataset
depression_data <- read_csv("depression_income.csv")

Rows: 6468 Columns: 11
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (5): country, iso3c, iso2c, region, income
dbl (6): year, prevalence, gdp_percap, population, birth_rate, neonat_mortal...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

#view depression data
head(depression_data)

# A tibble: 6 × 11
  country     iso3c  year prevalence iso2c gdp_percap population birth_rate
  <chr>       <chr> <dbl>      <dbl> <chr>      <dbl>      <dbl>      <dbl>
1 Afghanistan AFG    1990    318436. AF            NA   12067570       49.0
2 Afghanistan AFG    1991    329045. AF            NA   12789374       48.9
3 Afghanistan AFG    1992    382545. AF            NA   13745630       48.8
4 Afghanistan AFG    1993    440382. AF            NA   14824371       48.8
5 Afghanistan AFG    1994    456917. AF            NA   15869967       48.9
6 Afghanistan AFG    1995    471475. AF            NA   16772522       49.0
# ℹ 3 more variables: neonat_mortal_rate <dbl>, region <chr>, income <chr>

# Checking for missing values
colSums(is.na(depression_data))  # Display the count of missing values in each column

           country              iso3c               year         prevalence 
                 0                980                  0                  0 
             iso2c         gdp_percap         population         birth_rate 
              2243               2567               2224               2311 
neonat_mortal_rate             region             income 
              2368               2218               2218

# Removing rows with missing values in critical columns
depression_data <- depression_data%>%
  drop_na()  # Remove rows with any missing values

# View summary statistics
summary(depression_data)

   country             iso3c                year        prevalence      
 Length:3793        Length:3793        Min.   :1990   Min.   :    1107  
 Class :character   Class :character   1st Qu.:1996   1st Qu.:   75775  
 Mode  :character   Mode  :character   Median :2002   Median :  230360  
                                       Mean   :2002   Mean   : 1237921  
                                       3rd Qu.:2008   3rd Qu.:  632828  
                                       Max.   :2014   Max.   :54949281  
    iso2c             gdp_percap         population          birth_rate   
 Length:3793        Min.   :   239.7   Min.   :5.140e+04   Min.   : 7.60  
 Class :character   1st Qu.:  2092.8   1st Qu.:2.662e+06   1st Qu.:13.41  
 Mode  :character   Median :  6376.3   Median :7.934e+06   Median :22.24  
                    Mean   : 12480.9   Mean   :3.703e+07   Mean   :24.46  
                    3rd Qu.: 17029.1   3rd Qu.:2.139e+07   3rd Qu.:34.51  
                    Max.   :141442.2   Max.   :1.364e+09   Max.   :55.12  
 neonat_mortal_rate    region             income         
 Min.   : 1.00      Length:3793        Length:3793       
 1st Qu.: 6.00      Class :character   Class :character  
 Median :14.90      Mode  :character   Mode  :character  
 Mean   :19.28                                           
 3rd Qu.:29.50                                           
 Max.   :73.10

 #Selecting relevant columns for analysis
depression_data_selected <- depression_data %>%
select(country, year, region, income, prevalence, gdp_percap)

# Filtering for the most recent year for analysis
recent_data <- depression_data %>%
  filter(year == max(year))

Background Research

Economic depression refers to a prolonged period of economic downturn characterized by a significant decrease in economic activity, rising unemployment, and declining GDP. While often associated with periods of financial crises or recessions, it can also be seen in countries with slower or stagnating growth. According to a report by the International Monetary Fund (IMF), economic depression can lead to income inequality and social instability, as well as increased rates of poverty and reduced access to essential services like healthcare and education.

In addition, it is important to understand that economic depression impacts countries differently based on their income levels. For instance, low-income countries are often more vulnerable to the effects of economic depression due to limited financial resources and weaker economic infrastructure. On the other hand, high-income countries may have better resilience due to stronger economies and financial systems. According to the World Bank, countries experiencing economic

references:

www.imf.org Recession: When Bad Times Prevail
www.worldbank.org

# Creating the linear model
linear_model <- lm(prevalence ~ region + gdp_percap, data = recent_data)

# Checking the linaer model summary
summary(linear_model)


Call:
lm(formula = prevalence ~ region + gdp_percap, data = recent_data)

Residuals:
     Min       1Q   Median       3Q      Max 
-6979345  -607120  -380528   -72896 50910757 

Coefficients:
                                   Estimate Std. Error t value Pr(>|t|)   
(Intercept)                       4.102e+06  1.351e+06   3.035  0.00285 **
regionEurope & Central Asia      -3.291e+06  1.559e+06  -2.110  0.03654 * 
regionLatin America & Caribbean  -3.418e+06  1.694e+06  -2.017  0.04548 * 
regionMiddle East & North Africa -3.406e+06  1.983e+06  -1.718  0.08799 . 
regionNorth America               4.333e+06  4.216e+06   1.028  0.30579   
regionSouth Asia                  2.942e+06  2.359e+06   1.247  0.21438   
regionSub-Saharan Africa         -3.491e+06  1.573e+06  -2.220  0.02796 * 
gdp_percap                       -4.772e+00  2.612e+01  -0.183  0.85532   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5555000 on 146 degrees of freedom
Multiple R-squared:  0.1093,    Adjusted R-squared:  0.06656 
F-statistic: 2.559 on 7 and 146 DF,  p-value: 0.01628

# Diagnostic plots for the linear model
autoplot(linear_model, which = 1:4, nrow = 2, ncol = 2)

Linear Regression Model

I created a linear regression model to assess the impact of region and GDP per capita on the prevalence of economic depression.

Prevalence=4101771+(−3290763)×Region+(−3417591)×GDP per capita

This model suggests a negative relationship between GDP per capita and depression prevalence, meaning higher GDP per capita is associated with lower depression prevalence across regions.

Prevalence vs. GDP per capita by Region

I created a scatter plot to examine the average prevalence of depression across regions in relation to GDP per capita:

# Calculate average prevalence by region
avg_prevalence_region <- recent_data %>%
  group_by(region) %>%
  summarise(avg_prevalence = mean(prevalence, na.rm = TRUE),  # Calculate average prevalence
            avg_gdp_percap = mean(gdp_percap, na.rm = TRUE))  # Calculate average GDP per capita

# Creating a scatter plot of avg prevalence vs avg GDP per capita
ggplot(avg_prevalence_region, aes(x = avg_gdp_percap, y = avg_prevalence, color = region)) +
  geom_point(size = 4) +  # Scatter plot
  geom_smooth(method = "lm", se = FALSE, col = "black") +  # Adding regression line 
  labs(title = "Avg Prevalence vs. Avg GDP per Capita by Region",
       x = "Avg GDP per Capita",
       y = "Avg Prevalence") +
  theme_minimal() +
  theme(legend.position = "right")  # Adjusting the legend position

`geom_smooth()` using formula = 'y ~ x'

Prevalence by Region Over Time
I created an interactive stacked area plot to visualize how the prevalence of depression has changed over time across different regions:

# Summarizing prevalence by region over time
region_prevalence <- depression_data %>%
  group_by(region, year) %>%
  summarise(total_prevalence = sum(prevalence, na.rm = TRUE), .groups = 'drop')

# Creating the interactive stacked area plot
interactive_area_plot <- ggplot(region_prevalence, aes(x = year, y = total_prevalence, fill = region)) +
  geom_area(color = "white", alpha = 0.7) +  # Adding transparent areas with white borders
  scale_fill_viridis_d(option = "cividis") +  # Non-default palette
  labs(
    title = "Prevalence by Region vs Income Over Time",
    x = "Year",
    y = "Total Depression Prevalence (%)",
    fill = "Region",
    caption = "Source: World Bank Open Database"
  ) +
  theme_classic(base_size = 14) +  # Using a different theme for a better look
  theme(
    plot.title = element_text(face = "bold", hjust = 0.5, size = 16),  # Centered and bold title
    axis.title = element_text(size = 12),
    legend.position = "bottom",
    legend.title = element_text(size = 10),
    legend.text = element_text(size = 9)
  )

# Converting to an interactive plot using plotly
interactive_area_plotly <- ggplotly(interactive_area_plot, tooltip = c("x", "y", "fill"))

# Display the interactive plot
interactive_area_plotly

Conclusion

This analysis explores the relationship between economic depression and income levels globally. The results suggest that lower GDP per capita is associated with higher income depression rates, which aligns with the idea that poorer economies face greater challenges in addressing the mental health needs of their populations. What surprised me during the analysis was seeing that the Middle East & North Africa region had the lowest income depression rate, while East Asia and the Pacific had the highest. This was an unexpected finding, given the diverse economic conditions in these regions. The visualizations provided a clear representation of how income depression rates vary by region and income levels, and I also considered incorporating the birth rate to see if there were any correlations with depression prevalence. This could be an interesting avenue for future exploration. Overall, the findings provide a basis for further research into economic policies and mental health interventions.