In this project, my goal is to create a programming example or “vignette” that showcases the capabilities of a TidyVerse package, along with a dataset from either fivethirtyeight.com or Kaggle. The aim of this example is to demonstrate how to effectively use the selected TidyVerse package to manipulate, analyze, and visualize the selected dataset. By doing so, readers will gain a better understanding of the potential of TidyVerse packages and how they can be used to solve real-world data problems.
The dataset I used is called “Suicide Rates Overview 1985 to 2021” and it contains information about suicide rates in various countries from 1985 to 2021.
(Suicide data)[https://www.kaggle.com/datasets/omkargowda/suicide-rates-overview-1985-to-2021]
library(tidyverse)df <- read.csv("master.csv")glimpse(df)## Rows: 31,756
## Columns: 12
## $ country <chr> "Albania", "Albania", "Albania", "Albania", "Albani…
## $ year <int> 1987, 1987, 1987, 1987, 1987, 1987, 1987, 1987, 198…
## $ sex <chr> "male", "male", "female", "male", "male", "female",…
## $ age <chr> "15-24 years", "35-54 years", "15-24 years", "75+ y…
## $ suicides_no <int> 21, 16, 14, 1, 9, 1, 6, 4, 1, 0, 0, 0, 2, 17, 1, 14…
## $ population <int> 312900, 308000, 289700, 21800, 274300, 35600, 27880…
## $ suicides.100k.pop <dbl> 6.71, 5.19, 4.83, 4.59, 3.28, 2.81, 2.15, 1.56, 0.7…
## $ country.year <chr> "Albania1987", "Albania1987", "Albania1987", "Alban…
## $ HDI.for.year <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ gdp_for_year.... <chr> "2,15,66,24,900", "2,15,66,24,900", "2,15,66,24,900…
## $ gdp_per_capita.... <dbl> 796, 796, 796, 796, 796, 796, 796, 796, 796, 796, 7…
## $ generation <chr> "Generation X", "Silent", "Generation X", "G.I. Gen…
knitr::kable(head(df,5))| country | year | sex | age | suicides_no | population | suicides.100k.pop | country.year | HDI.for.year | gdp_for_year…. | gdp_per_capita…. | generation |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Albania | 1987 | male | 15-24 years | 21 | 312900 | 6.71 | Albania1987 | NA | 2,15,66,24,900 | 796 | Generation X |
| Albania | 1987 | male | 35-54 years | 16 | 308000 | 5.19 | Albania1987 | NA | 2,15,66,24,900 | 796 | Silent |
| Albania | 1987 | female | 15-24 years | 14 | 289700 | 4.83 | Albania1987 | NA | 2,15,66,24,900 | 796 | Generation X |
| Albania | 1987 | male | 75+ years | 1 | 21800 | 4.59 | Albania1987 | NA | 2,15,66,24,900 | 796 | G.I. Generation |
| Albania | 1987 | male | 25-34 years | 9 | 274300 | 3.28 | Albania1987 | NA | 2,15,66,24,900 | 796 | Boomers |
is.na(df)Delete missing data
master <- na.omit(df)By filtering data, analysts can focus on a subset of the data that is relevant to a particular research question or analysis, and exclude data that is not needed.
# Filter the master dataset for the year 2020
master_filtered <- master %>%
filter(year == 2020)
datatable(head(master_filtered),
options = list(pageLength = 5, scrollX = TRUE, scrollY = "300px")) if you are searching for a particular country, you can employ the following code.
# Find all unique values in the country column
distinct_countries <- master %>%
distinct(country)
# View the unique countries
head (distinct_countries,5)## country
## 1 Albania
## 2 Antigua and Barbuda
## 3 Argentina
## 4 Armenia
## 5 Australia
I was looking for United States of America and the year 2020
# Filter for United States of America in 2020
USA <-master %>%
filter(year == "2020", country == "United States of America")
knitr::kable(head(USA,5))| country | year | sex | age | suicides_no | population | suicides.100k.pop | country.year | HDI.for.year | gdp_for_year…. | gdp_per_capita…. | generation |
|---|---|---|---|---|---|---|---|---|---|---|---|
| United States of America | 2020 | male | 5-14 years | 390 | 331501080 | 0.1176467 | United States of America2020 | 0.9205143 | 2.10E+13 | 63593.44 | Generation X |
| United States of America | 2020 | male | 15-24 years | 1495 | 331501080 | 0.4509789 | United States of America2020 | 0.9205143 | 2.10E+13 | 63593.44 | Generation X |
| United States of America | 2020 | male | 25-34 years | 1495 | 331501080 | 0.4509789 | United States of America2020 | 0.9205143 | 2.10E+13 | 63593.44 | Boomers |
| United States of America | 2020 | male | 35-54 years | 1495 | 331501080 | 0.4509789 | United States of America2020 | 0.9205143 | 2.10E+13 | 63593.44 | Silent |
| United States of America | 2020 | male | 55-74 years | 1495 | 331501080 | 0.4509789 | United States of America2020 | 0.9205143 | 2.10E+13 | 63593.44 | G.I. Generation |
master_selected <- master %>%
select(country, year, sex, age, population)
head(master_selected, 5)## country year sex age population
## 73 Albania 1995 male 25-34 years 232900
## 74 Albania 1995 male 55-74 years 178000
## 75 Albania 1995 female 75+ years 40800
## 76 Albania 1995 female 15-24 years 283500
## 77 Albania 1995 male 15-24 years 241200
# Create a new variable that calculates the percentage of the population in each age group
master_mutated <- master_selected %>%
mutate(percent_population = population / sum(population) * 100)
# View the first 5 rows of the resulting data frame
head(master_mutated, 5)## country year sex age population percent_population
## 73 Albania 1995 male 25-34 years 232900 1.996419e-04
## 74 Albania 1995 male 55-74 years 178000 1.525816e-04
## 75 Albania 1995 female 75+ years 40800 3.497376e-05
## 76 Albania 1995 female 15-24 years 283500 2.430162e-04
## 77 Albania 1995 male 15-24 years 241200 2.067566e-04
group the data by certain variables using the group_by() function.
# Group the data by country, year, and sex
master_grouped <- master_mutated %>%
group_by(country, year, sex)
# View the grouped data frame
master_grouped## # A tibble: 11,100 × 6
## # Groups: country, year, sex [1,850]
## country year sex age population percent_population
## <chr> <int> <chr> <chr> <int> <dbl>
## 1 Albania 1995 male 25-34 years 232900 0.000200
## 2 Albania 1995 male 55-74 years 178000 0.000153
## 3 Albania 1995 female 75+ years 40800 0.0000350
## 4 Albania 1995 female 15-24 years 283500 0.000243
## 5 Albania 1995 male 15-24 years 241200 0.000207
## 6 Albania 1995 male 75+ years 25100 0.0000215
## 7 Albania 1995 male 35-54 years 375900 0.000322
## 8 Albania 1995 female 25-34 years 264000 0.000226
## 9 Albania 1995 female 35-54 years 356400 0.000306
## 10 Albania 1995 male 5-14 years 376500 0.000323
## # … with 11,090 more rows
summarize the data to get summary statistics using the summarize() function.
# Summarize the grouped data to get the total population for each combination of country, year, and sex
master_summarized <- master_grouped %>%
summarize(total_population = sum(population))
# View the resulting data frame
head(master_summarized)## # A tibble: 6 × 4
## # Groups: country, year [3]
## country year sex total_population
## <chr> <int> <chr> <dbl>
## 1 Albania 1995 female 1473800
## 2 Albania 1995 male 1429600
## 3 Albania 2000 female 1372400
## 4 Albania 2000 male 1423900
## 5 Albania 2005 female 1399928
## 6 Albania 2005 male 1383392
library(ggplot2)
library(tidyr)
# Filter for 2020 data
suicides_2020 <- master %>%
filter(year == 2020)
# Group by country and calculate total number of suicides
suicides_by_country <- suicides_2020 %>%
group_by(country) %>%
summarize(total_suicides = sum(suicides_no)) %>%
replace_na(list(total_suicides = 0))
# Select top 10 countries by total number of suicides
top_10_countries <- suicides_by_country %>%
top_n(10, total_suicides) %>%
arrange(desc(total_suicides))
# Create bar plot of top 10 countries
ggplot(top_10_countries, aes(x = country, y = total_suicides, fill = country)) +
geom_bar(stat = "identity") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
ggtitle("Top 10 Countries with the Highest Number of Suicides in 2020")# Select bottom 10 countries by total number of suicides
bottom_10_countries <- suicides_by_country %>%
top_n(-10, total_suicides) %>%
arrange(total_suicides)
# Create bar plot of bottom 10 countries
ggplot(bottom_10_countries, aes(x = country, y = total_suicides, fill = country)) +
geom_bar(stat = "identity") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
ggtitle("Bottom 10 Countries with the Lowest Number of Suicides in 2020")library(tidyverse)
library(ggplot2)
# Read in the data
master <- read.csv("master.csv")
# Filter for 2020 data only
master_filtered <- master %>%
filter(year == 2020)
# Create a subset of the data for the top 10 countries with the highest suicide rates in 2020
top10 <- master_filtered %>%
group_by(country) %>%
summarize(suicides_per_100k = sum(suicides.100k.pop, na.rm = TRUE)) %>%
arrange(desc(suicides_per_100k)) %>%
top_n(10)
# Create a subset of the data for Guatemala
guatemala <- master_filtered %>%
filter(country == "Guatemala")
# Calculate the total number of suicides in Guatemala in 2020
guatemala_total_suicides <- sum(guatemala$suicides_no, na.rm = TRUE)
# Calculate the total population of Guatemala in 2020
guatemala_total_population <- sum(guatemala$population, na.rm = TRUE)
# Calculate the suicide rate per 100,000 people in Guatemala in 2020
guatemala_suicide_rate <- guatemala_total_suicides / guatemala_total_population * 100000
# Create a data frame with the suicide rate for Guatemala and the top 10 countries with the highest suicide rates
comparison_data <- rbind(
data.frame(country = "Guatemala", suicides_per_100k = guatemala_suicide_rate),
top10
)
# Add a column indicating whether each country is in the top 10 or not
comparison_data$highlight <- ifelse(comparison_data$country == "United States" | comparison_data$country %in% top10$country, "Yes", "No")
# Create a bar chart showing the suicide rate for Guatemala and the top 10 countries, with highlighting for the top 10 countries
ggplot(comparison_data, aes(x = country, y = suicides_per_100k, fill = highlight)) +
geom_bar(stat = "identity", position = "stack") +
labs(title = "Comparison of Suicide Rates in Guatemala and Top 10 Countries with Highest Suicide Rates in 2020",
x = "Country", y = "Suicides per 100,000 people",
fill = "Highlight") +
theme(plot.title = element_text(hjust = 0.5),
legend.position = "bottom",
axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))I have explored the TidyVerse package and its capabilities for manipulating, analyzing, and visualizing datasets using R programming language.
By following the steps outlined in this project, some commonly used TidyVerse packages include dplyr for data manipulation, tidyr for data cleaning, ggplot2 for data visualization, readr for reading data