Data Reporting with R: Best Practices from Tables to Time Series

Author

Kaburungo

Bar Charts & Pie Charts

Load packages

In this project we will use the following packages:

  • {tidyverse} for data wrangling and data visualization

  • {here} for project-relative file paths

Data: TB treatment outcomes in Benin

  • Public Health Data Analysis: Compare subgroup metrics and understand data contributions.

  • Focus on Benin’s TB Data: Investigate WHO - provided sub-national TB data on DHIS2 dashboard

  • Data composition: Includes new and relapse TB cases started on treatment.

  • Disaggregation categories: Data broken down by time period, health facilities, treatment outcome and diagnosis type.

import the tb_outcomes data subset.

# Import data from csv
tb_outcomes <- 
  read.csv("C:/Users/Perminus Njiru/OneDrive - LVCT Health/Desktop/Freecodecamp/R/Care_and_Treatment_Analysis/Input/benin_tb.csv")
head(tb_outcomes)
  period period_date        hospital     outcome cases  diagnosis_type
1 2015Q4  2015-10-01 St Jean De Dieu      failed     0 bacteriological
2 2015Q4  2015-10-01 St Jean De Dieu unevaluated     0 bacteriological
3 2015Q4  2015-10-01 St Jean De Dieu        died     0 bacteriological
4 2015Q4  2015-10-01 St Jean De Dieu        lost     0 bacteriological
5 2015Q4  2015-10-01 St Jean De Dieu   completed     0 bacteriological
6 2015Q4  2015-10-01 St Jean De Dieu       cured    11 bacteriological

Data layers

  1. Time Frame Tracking (Period and Period_date): Quarterly records from 2015Q1 to 2017Q4.

  2. Health Facility Identifier (hospital):

    • Data from St Jean De Dieu, CHPP Akron,CS Abomey-Calavi, Hospital Bethesda, Hospital Savalou, Hospital St Luc.
  3. Treatment Outcome Categories (Outcome):

    1. Completed: Treatment finished, outcome marked as completed.

    2. Cured: Treatment succeeded with sputum smear confirmation.

    3. died: Patient succumbed to TB during treatment.

    4. failed: Treatment did not succeed.

    5. Unevaluated: Treatment outcome not determined.

  4. Diagnosis Categorization (diagnosis_type):

    1. Bacteriological: Diagnosis confirmed by bacteriological tests.

    2. Clinical: Diagnosis based on clinical symptoms, sans bacteriological confirmation.

  5. Case Counts (cases): Quantifies the number of TB cases starting treatment.

Charts used for visualizing Comparisons

A selection of chart types for making comparisons between groups of data.

Bar charts

  • Bar Charts Advantage: Ideal for displaying counts and making categorical comparison.

  • Optimal Usage:

    • Effective for ordinal categories or time-based data.

    • Best when data is grouped into distinct categories.

  • Comparison Tool: Bar charts excellently illustrate comparisons among groups.

  • {ggplot2} Implementation: Use geom_col () for plotting categorical against numerical data.

Visualizing the Number of cases per treatment outcomes in the tb_outcomes dataset

# Basic bar plot example 1: Frequency of treatment outcomes
tb_outcomes %>% 
  #Pass the data to ggplot as a basis for creating the visualization
  ggplot(
    #specify the x and y axis variables
    aes(x = outcome,
        y = cases)) +
  # geom_col() creates a bar plot
  geom_col() +
  labs(title = "Number of cases per treatment outcome")

# Basic bar plot example 1: Frequency of treatment outcomes
tb_outcomes %>% 
  #Pass the data to ggplot as a basis for creating the visualization
  ggplot(
    #specify the x and y axis variables
    aes(x = hospital,
        y = cases)) +
  # geom_col() creates a bar plot
  geom_col() +
  labs(title = "Number of cases per Hospital")

  • Horizontal Bar Plot Creation: Utilize coord_flip() to transform a vertical bar chart into a horizontal layout

  • Enhanced Category Visualization: Horizontal orientation can improve readability of categories.

# Basic bar plot example 1: Frequency of treatment outcomes
tb_outcomes %>% 
  #Pass the data to ggplot as a basis for creating the visualization
  ggplot(
    #specify the x and y axis variables
    aes(x = hospital,
        y = cases)) +
  # geom_col() creates a bar plot
  geom_col() +
  labs(title = "Number of cases per Hospital") +
  coord_flip()

Stacked bar charts

  • Stacked Bar Charts: Introduce a second categorical variable for deeper insight.

  • ggplot () Customization: Use fill attribute to differentiate categories within the bars.

# Stacked bar plot:
tb_outcomes %>% 
  ggplot(
    # Fill colour of bars by the "outcome" variable
    aes(x = hospital,
        y = cases,
        fill = outcome)) +
      geom_col() + coord_flip()

Grouped bar charts

  • Grouped bar plots provide a side-by-side representation of subgroups within each main category.

  • We can set the position argument to “dodge” in geom_col () to display bars side by side:

# Grouped bar plot:
tb_outcomes %>% 
  ggplot(
    aes(x = hospital,
        y = cases,
        fill = outcome)) +
  # Add position argument for side-by-side bars
  geom_col(position = "dodge") +
  coord_flip()

Adding Error Bars

  • Showcasing data variability or uncertainty is done effectively with error bars.

  • Error bars help illustrate the reliability of mean scores or the precision of data points.

  • In {ggplot2}, adding error bars is achieved with the geom_errorbar()function.

  • The error range is typically defined by standard deviation, standard error, or confidence intervals.

  • Essential summary statistics like mean and standard deviation are necessary for error bars and need to be calculated.

  • By integrating error bars into our grouped bar plots, we gain a more nuanced understanding of the data.

Creating the necessary summary data since we need to have some kind of error measurement. In this case we compute the standard deviation.

hosp_dx_error <- 
  tb_outcomes %>% 
  group_by(period_date,diagnosis_type) %>% 
  summarise(
    total_case = sum(cases, na.rm = TRUE),
    error = sd(cases, na.rm = TRUE)
  )
`summarise()` has regrouped the output.
ℹ Summaries were computed grouped by period_date and diagnosis_type.
ℹ Output is grouped by period_date.
ℹ Use `summarise(.groups = "drop_last")` to silence this message.
ℹ Use `summarise(.by = c(period_date, diagnosis_type))` for per-operation
  grouping (`?dplyr::dplyr_by`) instead.
hosp_dx_error
# A tibble: 24 × 4
# Groups:   period_date [12]
   period_date diagnosis_type  total_case error
   <chr>       <chr>                <int> <dbl>
 1 2015-01-01  bacteriological        143 11.9 
 2 2015-01-01  clinical                47  4.40
 3 2015-04-01  bacteriological        163 13.0 
 4 2015-04-01  clinical                35  3.84
 5 2015-07-01  bacteriological        146 11.2 
 6 2015-07-01  clinical                34  3.33
 7 2015-10-01  bacteriological        152 10.4 
 8 2015-10-01  clinical                43  3.55
 9 2016-01-01  bacteriological        201 15.7 
10 2016-01-01  clinical                71  7.01
# ℹ 14 more rows
# Recreate grouped bar chart and add error bars
hosp_dx_error %>% 
  ggplot(
    aes(x = period_date,
        y = total_case,
        fill = diagnosis_type)) +
  geom_col(position = "dodge") + # Dodge the bars
  # geomerrorbar() adds error bars
  geom_errorbar(
    #specify upper and lower limits of the error bars
    aes(ymin = total_case - error,
        ymax = total_case + error),
    position = position_dodge(width = 0.9),
    width = 0.25# Dodge the error bars to align them with side-by-side bars
  )

Visualizing comparison with normalized bar charts, pie charts, and donut charts

  • Leet’s explore how compositions illustrate the contributions of individual parts to a whole.

  • Consider using dedicated composition chart types that better represent these relationships.

  • We’ll focus on part to whole charts to highlight how each piece fits in to the overall picture.

A selection of chart types for visualizing part-to-whole relationship.

Percent-stacked bar chart

  • For showcasing compositions, we need to identify the parts and the whole comprise.

  • Stacked bar chart, previously discussed, serve as an acceptable point for visualizing these relationships.

# Regular stacked bar plot
tb_outcomes %>% 
  ggplot(aes(x = hospital,
             y = cases,
             fill = outcome)) +
  geom_col(position = "fill")

Circular plots: Pie and Donut charts

With careful use, we aim to unlock their potential for presenting clear snapshots of data proportions.

  • We start by aggregating data to tally total counts for each treatment outcome category

  • This step ensures each segment of our dataset is clearly represented for visualization.

outcome_totals <- 
  tb_outcomes %>% 
  group_by(outcome) %>% 
  summarise(total_cases = sum(cases, na.rm = TRUE))
outcome_totals
# A tibble: 6 × 2
  outcome     total_cases
  <chr>             <int>
1 completed           573
2 cured              1506
3 died                130
4 failed               30
5 lost                 87
6 unevaluated          15

A pie chart is basically a round version of a single 100% stacked bar.

# Single-bar chart (precursor to pie chart)
ggplot(outcome_totals,
       aes(x = 4, # Set arbitrary x value
           y = total_cases,
           fill = outcome)) +
  geom_col()

# Single-bar chart (precursor to pie chart)
ggplot(outcome_totals,
       aes(x = 4, # Set arbitrary x value
           y = total_cases,
           fill = outcome)) +
  geom_col()+
  coord_polar(theta = 'y') # change y axis to be circular

Watch out!

  • Delve into pie charts with a note of caution: they’re visually enticing but can mislead.

  • Recognize that bar plots often surpass these charts in delivering precise data interpretations.

  • Understand that our brains prefer comparing lengths (like bars) over angles or areas (like pie slices).

  • Remember, as data categories grow, pie charts can get overly crowded and lose clarity.

  • Acknowledge their inability to effectively display changes over time, unlike the adept bar plot.

  • Use pie charts sparingly, with a mindful eye on their potential to obscure the data’s true story.

Wrap Up!

  • We’ve worked through this {ggplot2} lesson to improve visual comparisons and compositions using bar charts and pie charts.

  • Bar charts were our starting point, highlighting their strength in comparing categories and customization strategies in ggplot2.

  • We then explored pie charts, focusing on their ability to display compositions.

  • Using the tb_outcomes dataset, we applied these techniques to real-world public health data, emphasizing TB treatment outcomes in Benin.

  • Our practical exercises included transforming bar plots to 100% stacked bars and creating pie charts with geom_col() and coord_polar().

  • The lesson aimed to empower you with the knowledge to choose the right chart type and {ggplot2} tools to effectively present your data.