Data Reporting with R: Best Practices from Tables to Time Series

Author

Kaburungo

Demographic Pyramids

Load packages

  • This requires R packages for data manipulation, data visualization and relative file paths
# Load packages
if(!require(pacman)) install.packages("pacman")
Loading required package: pacman
pacman::p_load(tidyverse, # to clean, wrangle, and plot data
       here,      # to locate files
       apyramid)   # package dedicated to creating age pyramids

Data Preparation

Intro to the Dataset

  • We’ll be using a simulated HIV dataset of linelist cases in Zimbabwe during 2016.

  • Our focus : age relate and sex variables for the demographic pyramid.

Importing data

  • Let’s Import our dataset into Rstudio and inspect the variables:
# Import the data from CSV file
hiv_data <- read.csv(here("C:/Users/Perminus Njiru/OneDrive - LVCT Health/Desktop/Freecodecamp/R/Care_and_Treatment_Analysis/Input/hiv_zw_linelist_2016.csv"))

# display the data frame
head(hiv_data)
  age_group    sex hiv_status
1     20-24 female   positive
2     35-39   male   positive
3     15-19 female   negative
4     40-44   male   negative
5     45-49   male   positive
6     35-39 female   negative
  • Our dataset has 28000 cases across 3 columns : age_group, sex, and hiv_status.

  • To create a demographic pyramid of HIV Prevalence, we’ll filter the data to include only HIV Positive individuals.

#filter and save Hiv Pos as a new dataset
hiv_case <- 
  hiv_data %>% 
  filter(hiv_status == "positive")
# view dataset
head(hiv_case)
  age_group    sex hiv_status
1     20-24 female   positive
2     35-39   male   positive
3     45-49   male   positive
4     50-54   male   positive
5     20-24 female   positive
6     45-49 female   positive

Data Inspection

  • Next, we’ll look at a summary table for age_group and sex to check that our data is clean.
# summarize the data into a table by age group and sex
hiv_case %>% 
  count(age_group,sex) %>% head(5)
  age_group    sex  n
1     00-04 female  3
2     00-04   male  6
3     05-09 female  8
4     05-09   male 10
5     10-14 female 42
  • Age groups are in ascending order - this is crucial for plotting the demographic pyramid

Grouping and aggregating Data

  • To plot the pyramid bars, we need to summarize the linelist observation into an aggregated table.

  • We want positive females totals and negative male totals, to plot bars on opposite sides of the axis.

  • We can use count() and mutate() to get total cases and percents grouped by age group and sex.

# Create new subsets with grouped counts
pyramid_data <- 
  hiv_case %>% 
  #count total cases by age group and gender
  count(age_group,sex,name = "total") %>%
  # create new columns for x axis values on the plot
  mutate(
    #add column with diverging axis values - converts male counts to negative 
    axis_count = ifelse(sex == "male",-total,total),
    #add column with percentage axis values
    axis_percent = round(100 * (axis_count/sum(total)),digits = 1)
    )
head(pyramid_data)
  age_group    sex total axis_count axis_percent
1     00-04 female     3          3          0.1
2     00-04   male     6         -6         -0.2
3     05-09 female     8          8          0.2
4     05-09   male    10        -10         -0.3
5     10-14 female    42         42          1.1
6     10-14   male    23        -23         -0.6
  • Now that the data is summarized in appropriate format, we can use pyramid_data to create population pyramid with {ggplot2}!

Alternative to {ggplot2=“” ()=““}

The {apyramid} package can be a useful tool that contains a function age_pyramid(), which allows for the rapid creation of age-sex pyramids:

# Start with the original linelist (no need to summarize counts)
hiv_case %>% 
  mutate(age_group = factor(age_group)) %>% 
  apyramid::age_pyramid(
    # Required arguments:
    age_group = "age_group",
    split_by = "sex")

Plotting Demographic Pyramids with {ggplot2()}

  • Remember; a demographic pyramid is a modified version of a stacked bar plot

  • A basic stacked bar plot with geom_col () needs a categorical variable against a continuous variable (e.g, age_group vs. total), and fill set to a second categorical variable (e.g. sex)

# Basic stacked bar plot: bars stacked on top of each other
# initialize plot
ggplot() +
# Create bar graph using geom_col()
  geom_col(data = pyramid_data, # specify dataset for plotting
           aes(x = age_group,   # Indicate categorical x variable
              y = total,        # Indicate continuous y variable
              fill = sex)) +    # fill by second categorical variable
# Modify theme
  theme_light()+
  coord_flip()

Using geom_col() for demographic pyramids

  • We can build on the stacked bar code above to create a demographic pyramid

  • This time we use axis_counts for the y variables instead of the total, which has negative male counts

# Create and save plot to environment
demo_pyramid <- 
  ggplot() +
  geom_col(data = pyramid_data,
           aes(x = age_group,
               y = axis_count, # indicate special y variable
               fill = sex)) +
  theme_light() +
  # Flip x and y axes
  coord_flip()
demo_pyramid

  • We can also use percentage values ( axis_percentage) on the y axis
demo_pyramid_percentage <- 
  ggplot() +
  geom_col(data = pyramid_data,
           aes(x = age_group,
               y = axis_percent, # indicate special y variable
               fill = sex)) +
  theme_light() +
  # Flip x and y axes
  coord_flip()
demo_pyramid_percentage

Plot Customization

  • Next we’ll re-scale the case count axis, add informative labels, and edit non-data elements

  • The current graph shows asymmetrical axis limits due to varying case count between genders

Choosing the limits

  • Our goal is an axis that is symmetrical, with the same length on both sides around zero.
# store highest case as an object
max_count <- 
  max(pyramid_data$total)
max_count
[1] 318
  • We will edit our continuous axis using the scale_y_continuous () function

  • Set the upper and lower axis limits to max_count.

  • Define the spacing between axis breaks.

  • Convert the negative labels left of 0 to their absolute values (remove minus sign)

# Add scales layers to the previous graph
custom_axes <- 
  demo_pyramid +
  # Scale function for y axis (axis_count)
  scale_y_continuous(
    # Specify upper and lower limits of the same length, for symmetry
    limits = c(-max_count,max_count),
    #Specify the spacing between axis break labels
    breaks = scales::breaks_width(100),
    # Make axis numbers absolute so male counts appear positive
    labels = abs
  )
custom_axes

  • Equal upper and lower limits on both sides ensures a symmetrical and accurate visual comparison.

  • Negative numbers from axis_count are converted to positive numbers using labels = abs

  • The width of breaks are fixed at a specific value using breaks_width(100)

Watch Out

  • Although it looks like the X axis, axis_counts is actually the y axis

  • The code to create demo_pyramid had age_group mapped to x and axis_count mapped to y

  • we then flipped the orientation of the plot using coord_flip().

demo_pyramid <- 
  ggplot() +
  geom_col(data = pyramid_data,
           aes(x = age_group,
               y = axis_count, # indicate special y variable
               fill = sex)) +
  theme_light() +
  # Flip x and y axes
  coord_flip()

Add Custom Labels

  • Use the labs() functions to informative main title, subtitle, axis titles and caption.
custome_labels <- 
  # Starts with previous demographics pyramid
  custom_axes +
  #labs() functions controls label text
  labs(
    title = "HIV Positive cases by Age and Sex",
    subtitle = "Zimbabwe (2016)",
    x = "Age Group",
    y = "Incidences",
    fill = "Sex",
    caption = stringr::str_glue("Data are from simulated linelist \nn = {nrow(hiv_case)}")
  )
custome_labels

Enhance Color Scheme and Themes

  • Let’ apply some style adjustment to our demographic pyramid
Custom_color_theme <- 
  # Build upon the previous plot
  custome_labels +
  # Manually specify custom colors for each sex
  scale_fill_manual(values = c("female" = "lightblue",
                               "male" = "darkblue"),
                    #capitalize legend labels
                    labels = c("Female","Male")) +
  #fine - tune theme elements for a cleaner look
  theme(axis.line = element_line(colour = "black"), # make axis lines black
         # Center title and subtitle
         plot.title = element_text(hjust = 0.5),
         plot.subtitle = element_text(hjust = 0.5),
         # Format caption text
         plot.caption = element_text(hjust = 0, #align left
                                     size = 11, #increase font size
                                     face = "italic")) #italicize text
Custom_color_theme

Wrap-Up

  • Demographic pyramids are vital tools for visualizing disease distribution by age and sex.

  • The techniques used here can be extended to other graphs with negative and positive values.

  • You can now take these concepts to visualize cases against baseline populations or the impact of health interventions.

  • This knowledge is invaluable for epidemiological analysis and reporting.