Introduction

In this project, my goal is to create a programming example or “vignette” that showcases the capabilities of a TidyVerse package, along with a dataset from either fivethirtyeight.com or Kaggle. The aim of this example is to demonstrate how to effectively use the selected TidyVerse package to manipulate, analyze, and visualize the selected dataset. By doing so, readers will gain a better understanding of the potential of TidyVerse packages and how they can be used to solve real-world data problems.

Data wrangling

The dataset I used is called “Suicide Rates Overview 1985 to 2021” and it contains information about suicide rates in various countries from 1985 to 2021.

(Suicide data)[https://www.kaggle.com/datasets/omkargowda/suicide-rates-overview-1985-to-2021]

Load require packages
library(tidyverse)
Load the dataset
df <- read.csv("master.csv")
glimpse(df)
## Rows: 31,756
## Columns: 12
## $ country            <chr> "Albania", "Albania", "Albania", "Albania", "Albani…
## $ year               <int> 1987, 1987, 1987, 1987, 1987, 1987, 1987, 1987, 198…
## $ sex                <chr> "male", "male", "female", "male", "male", "female",…
## $ age                <chr> "15-24 years", "35-54 years", "15-24 years", "75+ y…
## $ suicides_no        <int> 21, 16, 14, 1, 9, 1, 6, 4, 1, 0, 0, 0, 2, 17, 1, 14…
## $ population         <int> 312900, 308000, 289700, 21800, 274300, 35600, 27880…
## $ suicides.100k.pop  <dbl> 6.71, 5.19, 4.83, 4.59, 3.28, 2.81, 2.15, 1.56, 0.7…
## $ country.year       <chr> "Albania1987", "Albania1987", "Albania1987", "Alban…
## $ HDI.for.year       <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ gdp_for_year....   <chr> "2,15,66,24,900", "2,15,66,24,900", "2,15,66,24,900…
## $ gdp_per_capita.... <dbl> 796, 796, 796, 796, 796, 796, 796, 796, 796, 796, 7…
## $ generation         <chr> "Generation X", "Silent", "Generation X", "G.I. Gen…
knitr::kable(head(df,5))
country year sex age suicides_no population suicides.100k.pop country.year HDI.for.year gdp_for_year…. gdp_per_capita…. generation
Albania 1987 male 15-24 years 21 312900 6.71 Albania1987 NA 2,15,66,24,900 796 Generation X
Albania 1987 male 35-54 years 16 308000 5.19 Albania1987 NA 2,15,66,24,900 796 Silent
Albania 1987 female 15-24 years 14 289700 4.83 Albania1987 NA 2,15,66,24,900 796 Generation X
Albania 1987 male 75+ years 1 21800 4.59 Albania1987 NA 2,15,66,24,900 796 G.I. Generation
Albania 1987 male 25-34 years 9 274300 3.28 Albania1987 NA 2,15,66,24,900 796 Boomers
is.na(df)

Delete missing data

master <- na.omit(df)
Filtering for one year

By filtering data, analysts can focus on a subset of the data that is relevant to a particular research question or analysis, and exclude data that is not needed.

# Filter the master dataset for the year 2020
master_filtered <- master %>% 
  filter(year == 2020)


datatable(head(master_filtered), 
          options = list(pageLength = 5, scrollX = TRUE, scrollY = "300px")) 
Filtering for one country and one year

if you are searching for a particular country, you can employ the following code.

# Find all unique values in the country column
distinct_countries <- master %>% 
  distinct(country)

# View the unique countries
head (distinct_countries,5)
##               country
## 1             Albania
## 2 Antigua and Barbuda
## 3           Argentina
## 4             Armenia
## 5           Australia

I was looking for United States of America and the year 2020

# Filter for United States of America in 2020
USA <-master %>%
filter(year == "2020", country == "United States of America")
knitr::kable(head(USA,5))
country year sex age suicides_no population suicides.100k.pop country.year HDI.for.year gdp_for_year…. gdp_per_capita…. generation
United States of America 2020 male 5-14 years 390 331501080 0.1176467 United States of America2020 0.9205143 2.10E+13 63593.44 Generation X
United States of America 2020 male 15-24 years 1495 331501080 0.4509789 United States of America2020 0.9205143 2.10E+13 63593.44 Generation X
United States of America 2020 male 25-34 years 1495 331501080 0.4509789 United States of America2020 0.9205143 2.10E+13 63593.44 Boomers
United States of America 2020 male 35-54 years 1495 331501080 0.4509789 United States of America2020 0.9205143 2.10E+13 63593.44 Silent
United States of America 2020 male 55-74 years 1495 331501080 0.4509789 United States of America2020 0.9205143 2.10E+13 63593.44 G.I. Generation
selecting certain columns from the dataset using the select()
master_selected <- master %>%
  select(country, year, sex, age, population)

head(master_selected, 5)
##    country year    sex         age population
## 73 Albania 1995   male 25-34 years     232900
## 74 Albania 1995   male 55-74 years     178000
## 75 Albania 1995 female   75+ years      40800
## 76 Albania 1995 female 15-24 years     283500
## 77 Albania 1995   male 15-24 years     241200
create a new variable based on existing variables using the mutate()
# Create a new variable that calculates the percentage of the population in each age group
master_mutated <- master_selected %>%
  mutate(percent_population = population / sum(population) * 100)

# View the first 5 rows of the resulting data frame
head(master_mutated, 5)
##    country year    sex         age population percent_population
## 73 Albania 1995   male 25-34 years     232900       1.996419e-04
## 74 Albania 1995   male 55-74 years     178000       1.525816e-04
## 75 Albania 1995 female   75+ years      40800       3.497376e-05
## 76 Albania 1995 female 15-24 years     283500       2.430162e-04
## 77 Albania 1995   male 15-24 years     241200       2.067566e-04

group the data by certain variables using the group_by() function.

# Group the data by country, year, and sex
master_grouped <- master_mutated %>%
  group_by(country, year, sex)

# View the grouped data frame
master_grouped
## # A tibble: 11,100 × 6
## # Groups:   country, year, sex [1,850]
##    country  year sex    age         population percent_population
##    <chr>   <int> <chr>  <chr>            <int>              <dbl>
##  1 Albania  1995 male   25-34 years     232900          0.000200 
##  2 Albania  1995 male   55-74 years     178000          0.000153 
##  3 Albania  1995 female 75+ years        40800          0.0000350
##  4 Albania  1995 female 15-24 years     283500          0.000243 
##  5 Albania  1995 male   15-24 years     241200          0.000207 
##  6 Albania  1995 male   75+ years        25100          0.0000215
##  7 Albania  1995 male   35-54 years     375900          0.000322 
##  8 Albania  1995 female 25-34 years     264000          0.000226 
##  9 Albania  1995 female 35-54 years     356400          0.000306 
## 10 Albania  1995 male   5-14 years      376500          0.000323 
## # … with 11,090 more rows

summarize the data to get summary statistics using the summarize() function.

# Summarize the grouped data to get the total population for each combination of country, year, and sex
master_summarized <- master_grouped %>%
  summarize(total_population = sum(population))

# View the resulting data frame
head(master_summarized)
## # A tibble: 6 × 4
## # Groups:   country, year [3]
##   country  year sex    total_population
##   <chr>   <int> <chr>             <dbl>
## 1 Albania  1995 female          1473800
## 2 Albania  1995 male            1429600
## 3 Albania  2000 female          1372400
## 4 Albania  2000 male            1423900
## 5 Albania  2005 female          1399928
## 6 Albania  2005 male            1383392

Analysis

The top 10 Countries with the Highest Number of Suicides in 2020
library(ggplot2)
library(tidyr)

# Filter for 2020 data
suicides_2020 <- master %>% 
  filter(year == 2020)

# Group by country and calculate total number of suicides
suicides_by_country <- suicides_2020 %>% 
  group_by(country) %>% 
  summarize(total_suicides = sum(suicides_no)) %>%
  replace_na(list(total_suicides = 0))

# Select top 10 countries by total number of suicides
top_10_countries <- suicides_by_country %>%
  top_n(10, total_suicides) %>%
  arrange(desc(total_suicides))

# Create bar plot of top 10 countries
ggplot(top_10_countries, aes(x = country, y = total_suicides, fill = country)) +
  geom_bar(stat = "identity") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  ggtitle("Top 10 Countries with the Highest Number of Suicides in 2020")

The 10 countries with the lowest number of suicides in 2020
# Select bottom 10 countries by total number of suicides
bottom_10_countries <- suicides_by_country %>%
  top_n(-10, total_suicides) %>%
  arrange(total_suicides)

# Create bar plot of bottom 10 countries
ggplot(bottom_10_countries, aes(x = country, y = total_suicides, fill = country)) +
  geom_bar(stat = "identity") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  ggtitle("Bottom 10 Countries with the Lowest Number of Suicides in 2020")

library(tidyverse)
library(ggplot2)

# Read in the data
master <- read.csv("master.csv")

# Filter for 2020 data only
master_filtered <- master %>% 
  filter(year == 2020)

# Create a subset of the data for the top 10 countries with the highest suicide rates in 2020
top10 <- master_filtered %>% 
  group_by(country) %>% 
  summarize(suicides_per_100k = sum(suicides.100k.pop, na.rm = TRUE)) %>% 
  arrange(desc(suicides_per_100k)) %>% 
  top_n(10)

# Create a subset of the data for Guatemala
guatemala <- master_filtered %>% 
  filter(country == "Guatemala")

# Calculate the total number of suicides in Guatemala in 2020
guatemala_total_suicides <- sum(guatemala$suicides_no, na.rm = TRUE)

# Calculate the total population of Guatemala in 2020
guatemala_total_population <- sum(guatemala$population, na.rm = TRUE)

# Calculate the suicide rate per 100,000 people in Guatemala in 2020
guatemala_suicide_rate <- guatemala_total_suicides / guatemala_total_population * 100000

# Create a data frame with the suicide rate for Guatemala and the top 10 countries with the highest suicide rates
comparison_data <- rbind(
  data.frame(country = "Guatemala", suicides_per_100k = guatemala_suicide_rate),
  top10
)

# Add a column indicating whether each country is in the top 10 or not
comparison_data$highlight <- ifelse(comparison_data$country == "United States" | comparison_data$country %in% top10$country, "Yes", "No")

# Create a bar chart showing the suicide rate for Guatemala and the top 10 countries, with highlighting for the top 10 countries
ggplot(comparison_data, aes(x = country, y = suicides_per_100k, fill = highlight)) +
  geom_bar(stat = "identity", position = "stack") +
  labs(title = "Comparison of Suicide Rates in Guatemala and Top 10 Countries with Highest Suicide Rates in 2020",
       x = "Country", y = "Suicides per 100,000 people",
       fill = "Highlight") +
  theme(plot.title = element_text(hjust = 0.5),
        legend.position = "bottom",
        axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))

conclusion

I have explored the TidyVerse package and its capabilities for manipulating, analyzing, and visualizing datasets using R programming language.

By following the steps outlined in this project, some commonly used TidyVerse packages include dplyr for data manipulation, tidyr for data cleaning, ggplot2 for data visualization, readr for reading data