Final Project: Global mortality

Introduction

For this assignment, I used the ‘Global mortality’ dataset which can be found here https://github.com/rfordatascience/tidytuesday/blob/master/data/2018/2018-04-16/global_mortality.xlsx .

The original dataset includes information on the shares of 32 causes of death for 226 countries/regions for 1990-2016 (6156 observations). This project tried to answer the following questions using different types of data visualization:

What causes were the most frequent in the world in the latest year available (2016)?
How the share for different causes changed in 2016 compared to 1990?
How did the share for causes that were the most frequent in 2016 change through the years?
In which world regions/countries percentages for the most frequent causes are the highest?

Most frequent causes in the world in 2016

The first graph illustrates the proportion of all deaths for all 32 causes.

# Downloading necessary libraries
library(dplyr)
library(shiny)
library(ggplot2)
library(tidyverse)
library(maps)
library(animation)
library(sf)
library(magrittr)
library(treemapify)

# Defining custom theme
my_theme <- function() {
  theme(
    panel.border = element_rect(colour = "gray", fill = NA),
    panel.background = element_rect(fill = "white"),
    panel.grid.major.x = element_line(colour = "gray", linetype = 3, size = 0.5),
    panel.grid.minor.x = element_blank(),
    panel.grid.major.y =  element_line(colour = "gray", linetype = 3, size = 0.5),
    panel.grid.minor.y = element_blank(),
    axis.text = element_text(colour = "black", face = "italic"),
    axis.title = element_text(colour = "black"),
    axis.ticks = element_line(colour = "black"),
  )
}

# Downloading the data and changing names for the map
mortality <- readxl::read_excel("global_mortality.xlsx")%>% 
  mutate(country = case_when(
    country == 'Antigua and Barbuda' ~ 'Antigua',
    country == 'United States' | country == 'United States and Virgin Islands' ~ 'USA',
    country == 'United Kingdom' ~ 'UK',
    country == 'England' ~ 'UK',
    country == 'Scotland' ~ 'UK',
    country == 'Trinidad and Tobago' ~ 'Trinidad',
    country == 'Congo' ~ 'Republic of Congo',
    country == 'Democratic Republic of Congo' ~ 'Democratic Republic of the Congo',
    TRUE ~ country
  ))

# Deleting (%) from the causes of death
names(mortality) <- names(mortality) %>%
  gsub("\\s\\(%\\)","", .)

# Changing the format to long
mortality.long <- mortality %>% 
  gather(cause, percent, 4:35) %>%
  mutate(cause = as.factor(cause), country = as.factor(country), country_code = as.factor(country_code))

# Creating the dataset for the world
mortality_world <- mortality.long  %>% 
  filter(country_code %in% ("OWID_WRL"), year %in% c(1990, 2016)) %>%
  spread(year, percent) %>%
  mutate(percentchange = `2016` - `1990`)

# Lollipop graph for the most frequent causes in 2016
ggplot (mortality_world, aes(x = reorder(cause, `2016`), y =`2016`)) +
  geom_segment(aes(x = reorder(cause, `2016`), xend = reorder(cause, `2016`), y = 0, yend = `2016`),
               color = "gray", lwd = 1) +
  geom_point(size = 3, pch = 21, bg = "#990000", color="black") +
  coord_flip() +
  geom_text(aes(label = paste0(round(`2016`,2), "%")), color = "black", size = 2.7, hjust = -0.3) +
  scale_y_continuous(limits = c(0, 40), breaks = c(0, 10, 20, 30, 40),
                     label = c("0%", "10%", "20%", 
                               "30%", "40%")) +
  labs(x = "Cause of Death", y = "Percentage of all deaths",
       title = "Percentages by Cause of Death in 2016 (World)") +
  theme(plot.title = element_text(hjust = 0, vjust = 1)) +
  my_theme()

As we can see from the graph, ‘Cardiovascular diseases’ and ‘Cancers’ accounted for around 50% of all deaths in 2016. Other reasons that had relatively high percentages (> 4%) are: ‘Respiratory diseases’, ‘Diabetes’, ‘Dementia’, and ‘Lower respiratory infections’.

Comparison of causes’ world shares in 1990 and 2016

The next plot presents the difference in the shares for 1990 and 2016.

# New columns for the changes in the percentages
mortality_world$percentchange <- round(mortality_world$percentchange, digits = 2)
mortality_world$percent_type <- ifelse(mortality_world$percentchange < 0, "increased", "decreased") 
mortality_world <- mortality_world[order(mortality_world$percentchange),]
mortality_world$cause <- factor(mortality_world$cause, levels = mortality_world$cause)
mortality_world$r_2016 <- round(mortality_world$`2016`)

# Changes in the share 1990-2016
ggplot(mortality_world,aes(x = factor(cause), y = percentchange, label = percentchange)) + 
  geom_bar(stat = "identity", aes(fill = percent_type), width = .5) +
  geom_text(color = "black", size = 2.5, nudge_y = 0.2) +  
  coord_flip() +  
  scale_fill_manual(name="Percent Change", 
                    labels = c("Increased", "Decreased"), 
                    values = c("increased"="#ff9966", "decreased"="#6699ff"))  +  
  labs(y = NULL, x = NULL, title = "Difference in the Shares for 1990 and 2016 (World)") +
  my_theme()

Turned out that ‘Cardiovascular diseases’ and ‘Cancers’ not only had the highest shares in 2016 but also had the highest rise in compassion with 1990. ‘Diabetes’ and ‘Dementia’, which had fairly high values in 2016, have grown quite significantly as well (by 2.4% and 2.29% respectively). Additionally, we can notice that the ‘HIV/AIDS’ is also quite high in this list. Compared to 1990, it grew by 1.27% (the highest rise relative to its value in 2016 which was equal to 1.89%).

Conversely, ‘Neonatal deaths’, ‘Diarrheal diseases’, ‘Lower respiratory diseases’, and ‘Tuberculosis’ had the highest drop in 2016 shares in comparison with 1990.

The changes in the world shares for the top 10 causes in 2016 through years

The graph below shows how the share for the 10 most frequent causes changed through the years.

# Change through years for top 10 diseases
mortality.long %>% filter(country_code=='OWID_WRL') %>%
  filter(cause %in% c('Cardiovascular diseases', 'Cancers', 'Respiratory diseases',
                      'Diabetes', 'Dementia', 'Lower respiratory infections',
                      'Neonatal deaths', 'Diarrheal diseases', 'Road accidents',
                      'Liver diseases')) %>%
  ggplot(aes(x=year, y=percent, color=cause)) +
  geom_line(size=1.2) +
  geom_point(size=3) +
  theme(legend.position="top") +
  guides(color=guide_legend(nrow=2, byrow=TRUE)) +
  labs(x = "Year", y = "Percentage of all deaths",
       title = "Top 10 causes of death through years (World)") +
  scale_x_continuous(breaks = seq(1990, 2016, by = 5)) +
  scale_y_continuous(breaks = c(10, 20, 30),
                     label = c("10%", "20%", "30%")) +
  scale_color_brewer(palette="Set1") +
  my_theme()

We can observe that ‘Cardiovascular diseases’ and ‘Cancers’ steadily rise with each year. As for the other 8 causes, ‘Diabetes’ and ‘Dementia’ are the only ones for which the shares increase consistently. In 2016, they were among the five most frequent causes of death, although in 1990 they were not. It is noticeable how a constant positive trend has lifted ‘Diabetes’ and ‘Dementia’ up the list.

Changes in shares of ‘Cardiovascular diseases’ and ‘Cancers’ for different regions through years

From the previous findings, it is evident that ‘Cardiovascular diseases’ and ‘Cancers’ are the most frequent causes of death on earth with ever-increasing proportions. Therefore, I decided to look at the distribution of their shares in different world regions.

# Creating region variable
mortality.long$region <- case_when(mortality.long$country %in% c("Central Asia","East Asia","Southeast Asia","South Asia") ~ "Asia",
                            mortality.long$country == "Oceania" ~ "Oceania",
                            mortality.long$country == "Latin America and Caribbean" ~ "South America",
                            mortality.long$country == "North America" ~ "North America",
                            mortality.long$country %in% c("Central Europe", "Eastern Europe","Western Europe") ~ "Europe",
                            mortality.long$country %in% c("North Africa and Middle East","Sub-Saharan Africa") ~ "Africa",
                            TRUE ~ "Other")
# Boxplot for Cardiovascular diseases
p1 <- mortality.long %>%
  filter(cause %in% c("Cardiovascular diseases")) %>%
  filter(region != "Other") %>%
  group_by(region, year) %>%
  summarize(percent = mean(percent, na.rm=TRUE)) %>%
  ggplot(aes(x=reorder(region, -percent), y=percent, fill=region)) +
  geom_boxplot() +
  scale_fill_brewer(palette = "RdBu") +
  labs(y = "Percentage of all deaths",
       title = "Cardiovascular diseases different world regions") +
  scale_y_continuous(breaks = c(10, 20, 30, 40 ,50),
                     label = c("10%", "20%", "30%", "40%", "50%")) +
  theme(axis.text.x = element_text(angle = 35, hjust = 1),
    axis.title.x= element_blank(),
    legend.position = "none",
    text = element_text(size = 20)) +
  my_theme()

# Boxplot for Cancers
p2 <- mortality.long %>%
  filter(cause %in% c("Cancers")) %>%
  filter(region != "Other") %>%
  group_by(region, year) %>%
  summarize(percent = mean(percent, na.rm=TRUE)) %>%
  ggplot(aes(x=reorder(region, -percent), y=percent, fill=region)) +
  geom_boxplot() +
  scale_fill_brewer(palette = "RdBu") +
  labs(y = "Percentage of all deaths",
       title = "Cancers in different world regions") +
  scale_y_continuous(breaks = c(10, 20, 30, 40 ,50),
                     label = c("10%", "20%", "30%", "40%", "50%")) +
  theme(axis.text.x = element_text(angle = 35, hjust = 1),
        axis.title.x= element_blank(),
        legend.position = "none",
        text = element_text(size = 20)) +
  my_theme()
p1
p2

As we can see from the graph, the highest shares of ‘Cardiovascular diseases’ through the years are in Europe with a mean of 50%. The following regions are North America and Asia with an average of about 35%.

As for ‘Cancers’ the highest shares are for North America and Europe with the mean values above 20%.

Noticeably, the shares for these two causes are the lowest in Oceania and Africa.

Lastly, I checked the changes in shares for ‘Cardiovascular diseases’ on a country level using the animated map below.

# Creating dataset for the map
world_map <- st_as_sf(maps::map('world', plot = FALSE, fill = TRUE))
map_years <- world_map %>% 
  full_join(mortality, by = c('ID' = 'country'))

# Function for the animated map
gif_map <- function(data, cause_of_death) {
  fill_lim <- range(data %$% get(cause_of_death), na.rm = T)
  data_list <- split(data, data$year)
  out <- lapply(data_list, function(data, cause_of_death, fill_lim){
    p <- ggplot(data, aes(fill = get(cause_of_death))) +
      geom_sf(size = .2, color = 'black') +
      scale_fill_gradient(low = "#ffefef", high = "#720000", space = "Lab",
                          na.value = "#c6c6c6", guide = "colourbar", 
                          limits = c(fill_lim[1], fill_lim[2]),
                          name = cause_of_death) +
      ggthemes::theme_map() +
      labs(title = paste(cause_of_death,"in", unique(data$year))) +
      theme(panel.grid.major = element_line(colour = 'gray', size = .2, linetype = 'dashed'),
            legend.position = 'bottom')
    print(p)
  }, cause_of_death, fill_lim)
}

img <- magick::image_graph(width = 800, height = 350, res = 96)

# gif for Cardiovascular diseases
gif_map(map_years, "Cardiovascular diseases")
dev.off()
animation <- magick::image_animate(img, fps = 1)

# Savinf gif
magick::image_write(animation, path = 'cardio_map.gif')

The highest values are observed in Russia and Eastern Europe. Moreover, the share did not change much through the years and stayed at a high level for the whole 1990-2016 period. Additionally, it is noticeable that the share of ‘Cardiovascular diseases’ dropped for North America, south part of Latin America and Australia.

Conclusion

The highest shares in 2016 were for ‘Cardiovascular diseases’ and ‘Cancers’ (32.26% and 16.32% respectively). Moreover, these causes had the constant and the biggest rise throughout the whole available period. The highest values of shares for these two causes are observed in Europe and North America.

The project has potential limitations due to the availability of data only until 2016. The pandemic in 2020 could substantially change the results. Therefore, it can be further improved by the addition of updated statistics.