Project 2 Depression CSV

Author

Dajana R

The data set I am exploring contains information on various global socioeconomic indicators across different countries and years. It demonstrates a diverse range of variables, including both quantitative and categorical ones. Quantitative variables include GDP per capita, population, birth rate, and neonatal mortality rate, which offer insights into the economic prosperity, demographic composition, fertility levels, and newborn health outcomes of different nations. Categorical variables such as country, ISO codes, region, and income level provide additional context by categorizing countries based on their geographic location, economic development, and other shared characteristics. The data was sourced from the world data bank. To ensure the data set’s honesty, I performed several cleaning steps, such as checking for missing values and verifying the validity of categorical variables.

Loading packages

## loading dplyr,ggplot2, and plotly
library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
library(ggplot2)
library(plotly)

Attaching package: 'plotly'
The following object is masked from 'package:ggplot2':

    last_plot
The following object is masked from 'package:stats':

    filter
The following object is masked from 'package:graphics':

    layout

Loading the data set and naming it depression_In/ Exploring the data

## setting the diretory and naming the data set
## exploring the data
depression_In <- read.csv("depression_income.csv")

head(depression_In)
      country iso3c year prevalence iso2c gdp_percap population birth_rate
1 Afghanistan   AFG 1990   318435.8    AF         NA   12067570     49.029
2 Afghanistan   AFG 1991   329044.8    AF         NA   12789374     48.896
3 Afghanistan   AFG 1992   382544.6    AF         NA   13745630     48.834
4 Afghanistan   AFG 1993   440381.5    AF         NA   14824371     48.839
5 Afghanistan   AFG 1994   456916.6    AF         NA   15869967     48.898
6 Afghanistan   AFG 1995   471475.2    AF         NA   16772522     48.978
  neonat_mortal_rate     region     income
1               52.8 South Asia Low income
2               51.9 South Asia Low income
3               50.9 South Asia Low income
4               49.9 South Asia Low income
5               49.1 South Asia Low income
6               48.2 South Asia Low income
str(depression_In)
'data.frame':   6468 obs. of  11 variables:
 $ country           : chr  "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
 $ iso3c             : chr  "AFG" "AFG" "AFG" "AFG" ...
 $ year              : int  1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 ...
 $ prevalence        : num  318436 329045 382545 440382 456917 ...
 $ iso2c             : chr  "AF" "AF" "AF" "AF" ...
 $ gdp_percap        : num  NA NA NA NA NA NA NA NA NA NA ...
 $ population        : num  12067570 12789374 13745630 14824371 15869967 ...
 $ birth_rate        : num  49 48.9 48.8 48.8 48.9 ...
 $ neonat_mortal_rate: num  52.8 51.9 50.9 49.9 49.1 48.2 47.5 47 46.1 45.6 ...
 $ region            : chr  "South Asia" "South Asia" "South Asia" "South Asia" ...
 $ income            : chr  "Low income" "Low income" "Low income" "Low income" ...

Checking for missing data

## using sum and is.na to check for missing values
sum(is.na(depression_In))
[1] 17129

Removing/Cleaning the missing data

## omiting the missing data 
depression_In <- na.omit(depression_In)

Looking at the summary

## looking at the summary for the variables dgp_percap, population,birthrate, and neonat_mortal_rate
summary_variables <- depression_In |>
  select(gdp_percap, population, birth_rate, neonat_mortal_rate) |>
  summary()
summary_variables
   gdp_percap         population          birth_rate    neonat_mortal_rate
 Min.   :   239.7   Min.   :5.140e+04   Min.   : 7.60   Min.   : 1.00     
 1st Qu.:  2092.8   1st Qu.:2.662e+06   1st Qu.:13.41   1st Qu.: 6.00     
 Median :  6376.3   Median :7.934e+06   Median :22.24   Median :14.90     
 Mean   : 12480.9   Mean   :3.703e+07   Mean   :24.46   Mean   :19.28     
 3rd Qu.: 17029.1   3rd Qu.:2.139e+07   3rd Qu.:34.51   3rd Qu.:29.50     
 Max.   :141442.2   Max.   :1.364e+09   Max.   :55.12   Max.   :73.10     

Performing Linear regression analysis

## settin up the linear regression model usinf neonat_mortal_rate, gdp_percap and birthrate
lm_model <- lm(neonat_mortal_rate ~ gdp_percap + birth_rate, data = depression_In)
## looking at the regression model
summary(lm_model)

Call:
lm(formula = neonat_mortal_rate ~ gdp_percap + birth_rate, data = depression_In)

Residuals:
    Min      1Q  Median      3Q     Max 
-20.206  -4.851  -0.381   3.786  34.000 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -2.201e+00  4.042e-01  -5.446 5.48e-08 ***
gdp_percap  -1.710e-04  9.437e-06 -18.122  < 2e-16 ***
birth_rate   9.652e-01  1.262e-02  76.514  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 7.872 on 3790 degrees of freedom
Multiple R-squared:  0.7345,    Adjusted R-squared:  0.7344 
F-statistic:  5244 on 2 and 3790 DF,  p-value: < 2.2e-16

The linear equation is: Neonatal Mortality Rate = (−2.196 + (−0.0001707(gdp_percap))+(0.9628(Birth Rate))

For each additional unit of gdp per capita ($1000 increase), the predicted Neonatal Mortality Rate decreases by 0.0001707(0.0001707 deaths per 1000 live births), holding all other variables constant.For each additional unit of Birth Rate (one more birth per 1000 people), the predicted Neonatal Mortality Rate increases by 0.9628 (0.9628 deaths per 1000 live births), holding all other variables constant.

P-values: The p-values associated with each coefficient indicate the statistical significance of the predictors. In this model, all coefficients have p-values much smaller than 0.05, suggesting that both gdp per capita and Birth Rate are statistically significant predictors of Neonatal Mortality Rate.

Adjusted R-squared value: The adjusted R squared value is 0.7328, indicating that approximately 73.28% of the variance in Neonatal Mortality Rate is explained by the predictors (gdp per capita and Birth Rate) included in the model. This suggests a good fit of the model to the data.

Exploring with plots

## making a scatter plot using ggplot
## x axis is gdp and y axis is neonatal mortality rate 

ggplot(depression_In, aes(x = gdp_percap, y= neonat_mortal_rate)) +
  geom_point(color = "pink") +
  labs(x = "GDP per Capita", y = "Neonatal Mortality Rate", 
       title = "Relationship between GDP per Capita and Neonatal Mortality Rate",
       caption = "Data Source: World Data Bank") +
  theme_minimal() +
  theme(plot.caption = element_text(hjust = 0))

## making a histogram using ggplot
## x axis is gdp and y axis is the frequency of it 

ggplot(depression_In, aes(x = gdp_percap)) +
  geom_histogram(fill = "skyblue", color = "black", bins = 20) +
  labs(x = "GDP per Capita", y = "Frequency", 
       title = "Distribution of GDP per Capita",
       caption = "Data Source: World Data Bank") +
  theme_minimal() 

## making a bar plot using ggplot
## x axis is region and its filled by income 
## y axis is the proportion of it
ggplot(depression_In, aes(x = region, fill = income)) +
  geom_bar(position = "fill") +
  labs(x = "Region", y = "Proportion", 
       title = "Distribution of Countries by Region and Income",
       caption = "Data Source: World Data Bank") +
  scale_fill_brewer(palette = "Set3") +
  theme_minimal() +
  theme(plot.caption = element_text(hjust = 0)) +
  theme(legend.position = "top",
        axis.text.x = element_text(angle = 45, hjust = 1)) +
  guides(fill = guide_legend(title = "Income"))

Final Vizualisation

## filtering the data to include only the data for the year 2008(middle of the late 2000s recession)
depression_2008 <- depression_In |>
  filter(year == 2008)

## grouping the filtered data by region and income, and calculating the count of countries in each 
region_income_counts_2008 <- depression_2008 |>
  group_by(region, income) |>
  summarise(count = n())
`summarise()` has grouped output by 'region'. You can override using the
`.groups` argument.
## making a bar plot using ggplot x axis is region and y axis the count(number of countries) and its filled by the region
p <- ggplot(region_income_counts_2008, aes(x = region, y = count, fill = income)) +
  geom_bar(stat = "identity") +
  labs(x = "Region", y = "Number of Countries", 
       title = "Distribution of Countries by Region and Income in 2008)",
       caption = "Data Source: World Data Bank") +
  scale_fill_brewer(palette = "Set3") +
  theme_bw() +
  theme(plot.caption = element_text(hjust = 0),
        legend.position = "top",
        axis.text.x = element_text(angle = 45, hjust = 1)) +
  guides(fill = guide_legend(title = "Income"))

## converting the ggplot bar plot to use plotly
ggplotly(p)

Essay

The data set I have chosen goes into global socioeconomic indicators, offering insights into various countries’ economic, demographics over time. Its made up of a mix of quantitative and categorical variables, including GDP per capita, population, birth rate, neonatal mortality rate, country, ISO codes, region, and income level. These variables show the complex interplay between economic development, demographic trends worldwide. The data was gathered from world data bank. To ensure the data’s transparency, I conducted thorough cleaning procedures, cleaning missing values, I checked for NAs and also committed them.Along with that I also checked the head,structure and summary. I chose this topic and data set due to its importance in understanding the changing nature of global development and its impact on human well-being. Exploring this data set provides valuable insights into the disparities and challenges faced by different regions and income groups across the globe. By analyzing these indicators, I hope to gain a deeper understanding of the underlying factors driving disparities our current world.

Socioeconomic factors, including income, education, employment status, and housing conditions, are important determinants of health outcomes, as mentioned by the Centers for Disease Control and Prevention (CDC). Low socioeconomic status is consistently associated with higher risks of cardiovascular disease (CVD) and other health conditions. Education plays a critical role too, with lower educational attainment correlating with increased CVD risk. Similarly, unstable employment leads to psycho social stress and barriers to healthcare access, exacerbating health disparities. Income inequality further makes these disparities, with households earning lower incomes experiencing higher rates of illness and premature mortality. Food and housing insecurity, closely tied to income and employment, pose additional risks for chronic diseases such as hypertension and coronary heart disease. Understanding these socioeconomic indicators is essential for developing targeted interventions to address health disparities and promote equitable access to healthcare and resources.

The visualization represents the distribution of countries by region and income level for the year 2008. It highlights the varying economic statuses and development levels across different regions, providing insights into the global distribution of wealth and economic opportunities. One interesting pattern that emerges is the concentration of high-income countries in regions such as North America, Western Europe, and parts of Asia, while low-income countries are more prevalent in Sub-Saharan Africa and parts of south Asia. This shows the persistent disparities in economic development and living standards between regions. One aspect that could have been further explored is the correlation between income levels and health outcomes, such as neonatal mortality rates, to understand how economic development influences population health. Additionally, incorporating interactive features could enhance the visualization by allowing users to explore the data in more detail and uncover additional insights.

Centers for Disease Control and Prevention. “Socioeconomic Factors | CDC.” Centers for Disease Control and Prevention, 1 Sept. 2023, www.cdc.gov/dhdsp/health_equity/socioeconomic.htm.

*** ChatGTP was used to help with fixing errors and set different color palettes.