Data 110 Project_1

Author

Shadeja Fuentes

Depression_income Dataset Exploration

Introduction

The dataset on depression_income dataset exhibits a wealth of information, comprising more than 6,400 observations of 11 distinct variables. Among these variables are population counts, birthrates, mortality rates, regions, income levels, and the prevalence of depression as a percentage across various countries from 1997 to 2016. My hypothesis posits that individuals with lower income levels may experience heightened levels of depression in comparison to those with higher income levels. Furthermore, it is worth considering what other variables might impact the relationship between income and depression, beyond income itself.

Load in the dataset and appropriate libraries.Afterwards, I used the function str() to return information about the structure of the dataset.There are some NA values for the variable gdp_per cap.

library(tidyverse)

── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.4.0     ✔ purrr   1.0.1
✔ tibble  3.1.8     ✔ dplyr   1.1.0
✔ tidyr   1.3.0     ✔ stringr 1.5.0
✔ readr   2.1.3     ✔ forcats 1.0.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()

depression <- read_csv("depression_income.csv")

Rows: 6468 Columns: 11
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (5): country, iso3c, iso2c, region, income
dbl (6): year, prevalence, gdp_percap, population, birth_rate, neonat_mortal...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

str(depression)

spc_tbl_ [6,468 × 11] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ country           : chr [1:6468] "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
 $ iso3c             : chr [1:6468] "AFG" "AFG" "AFG" "AFG" ...
 $ year              : num [1:6468] 1990 1991 1992 1993 1994 ...
 $ prevalence        : num [1:6468] 318436 329045 382545 440382 456917 ...
 $ iso2c             : chr [1:6468] "AF" "AF" "AF" "AF" ...
 $ gdp_percap        : num [1:6468] NA NA NA NA NA NA NA NA NA NA ...
 $ population        : num [1:6468] 12067570 12789374 13745630 14824371 15869967 ...
 $ birth_rate        : num [1:6468] 49 48.9 48.8 48.8 48.9 ...
 $ neonat_mortal_rate: num [1:6468] 52.8 51.9 50.9 49.9 49.1 48.2 47.5 47 46.1 45.6 ...
 $ region            : chr [1:6468] "South Asia" "South Asia" "South Asia" "South Asia" ...
 $ income            : chr [1:6468] "Low income" "Low income" "Low income" "Low income" ...
 - attr(*, "spec")=
  .. cols(
  ..   country = col_character(),
  ..   iso3c = col_character(),
  ..   year = col_double(),
  ..   prevalence = col_double(),
  ..   iso2c = col_character(),
  ..   gdp_percap = col_double(),
  ..   population = col_double(),
  ..   birth_rate = col_double(),
  ..   neonat_mortal_rate = col_double(),
  ..   region = col_character(),
  ..   income = col_character()
  .. )
 - attr(*, "problems")=<externalptr>

Cleaning the data

First drop the na values in the dataset with the function drop_na. The dataset presents the variable gdp_percap. Gross domestic product per capita is a financial metric that breaks down a country’s economic output per person and is calculated by dividing the GDP of a nation by its population. The variable gdp_percap can be simplified to show gdp in trillions. I use mutate to create a new variable in depression_2 that displays the countries GDP in trillions of dollars. This will be a simpler way to measure the amounts. After creating the new variable gdp_trillions I used group_by() to group the dataset by year, income, and region. Lastly, I summarized the mean population and gdp in trillions for the grouped columns.

depression_2 <- drop_na(depression) %>% 
  mutate(gdp_trillions = gdp_percap / 1000000000000)


depression_grouped <- depression_2 %>% 
  group_by(region, income, year, prevalence) %>% 
  summarise(mean_gdp_trillions = mean(gdp_trillions), 
            mean_population = mean(population))

`summarise()` has grouped output by 'region', 'income', 'year'. You can
override using the `.groups` argument.

The simple graph below shows the different income levels of the countries in the world.The height of each bar represents the count of the countries. It shows that most countries are categorized as lower middle income.

ggplot(depression_grouped, aes(x=income)) + geom_bar(fill="#f54298") + coord_flip() + ylab ("Count") + xlab("Income Level") + ggtitle("Income Levels of the Countries in the world") + theme_linedraw()

Depression Prevalence by region Box plot

This boxplot visualizes the prevalence of depression across different regions. The y-axis represents the regions being compared, while the x-axis represents the prevalence of depression. By looking at the boxplot, you can compare the prevalence of depression across different regions and see if there are any significant differences in the distribution of depression prevalence between them. You can also identify any potential outliers that may exist in the data.

ggplot(depression_grouped, aes(x = prevalence, y = region)) +
  geom_boxplot(fill = "Pink", color = "black", alpha = 0.5) +
  xlab("Prevalence of Depression") +
  ylab("Region") +
  ggtitle("Depression Prevalence by Region")

Depression Prevalence by Region Histogram

In the histogram the prevalence variable in the depression_grouped dataset is used for the x-axis, and the region variable will is used for fill color. I used facet_wrap to create a separate histogram for each level of the income variable, arranged in a grid.The bars representing the number of observations in each bin. The color of each bar indicates the region that the observations belong to. The plot provides a visual representation of the prevalence of depression across different income levels and regions, and allows for easy comparison between them.

ggplot(depression_grouped, aes(x = prevalence, fill = region)) + 
  geom_histogram(alpha = 0.5, color = "black",
                 lwd = 0.55,
                 linetype = 1,
                 position = "identity") + facet_wrap(~income) + xlab("Prevalence of Depression") + ggtitle ("Depression Prevalence by Income Level") + theme_gray() + xlim(20*10^3, 10^5)

Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Warning: Removed 3139 rows containing non-finite values (`stat_bin()`).

Warning: Removed 30 rows containing missing values (`geom_bar()`).

The histograms show the prevalence of depression grouped by income level and region. However, it is not possible to draw any conclusions about the relationship between depression and income solely based on this plot. A regression or correlation analysis would be required to examine the relationship between depression and income or determine whether there is a significant association between the two variables.Additionally, it is important to note that this plot only shows the prevalence of depression within each income group, and it does not provide information about the direction or strength of the relationship between depression and income.