Data Reporting with R: Best Practices from Tables to Time Series

Author

Kaburungo

Bar Charts & Pie Charts

Load packages

In this project we will use the following packages:

{tidyverse} for data wrangling and data visualization
{here} for project-relative file paths

Data: TB treatment outcomes in Benin

Public Health Data Analysis: Compare subgroup metrics and understand data contributions.
Focus on Benin’s TB Data: Investigate WHO - provided sub-national TB data on DHIS2 dashboard
Data composition: Includes new and relapse TB cases started on treatment.
Disaggregation categories: Data broken down by time period, health facilities, treatment outcome and diagnosis type.

import the tb_outcomes data subset.

# Import data from csv
tb_outcomes <- 
  read.csv("C:/Users/Perminus Njiru/OneDrive - LVCT Health/Desktop/Freecodecamp/R/Care_and_Treatment_Analysis/Input/benin_tb.csv")
head(tb_outcomes)

  period period_date        hospital     outcome cases  diagnosis_type
1 2015Q4  2015-10-01 St Jean De Dieu      failed     0 bacteriological
2 2015Q4  2015-10-01 St Jean De Dieu unevaluated     0 bacteriological
3 2015Q4  2015-10-01 St Jean De Dieu        died     0 bacteriological
4 2015Q4  2015-10-01 St Jean De Dieu        lost     0 bacteriological
5 2015Q4  2015-10-01 St Jean De Dieu   completed     0 bacteriological
6 2015Q4  2015-10-01 St Jean De Dieu       cured    11 bacteriological

Data layers

Time Frame Tracking (Period and Period_date): Quarterly records from 2015Q1 to 2017Q4.
Health Facility Identifier (hospital):
- Data from St Jean De Dieu, CHPP Akron,CS Abomey-Calavi, Hospital Bethesda, Hospital Savalou, Hospital St Luc.
Treatment Outcome Categories (Outcome):
1. Completed: Treatment finished, outcome marked as completed.
2. Cured: Treatment succeeded with sputum smear confirmation.
3. died: Patient succumbed to TB during treatment.
4. failed: Treatment did not succeed.
5. Unevaluated: Treatment outcome not determined.
Diagnosis Categorization (diagnosis_type):
1. Bacteriological: Diagnosis confirmed by bacteriological tests.
2. Clinical: Diagnosis based on clinical symptoms, sans bacteriological confirmation.
Case Counts (cases): Quantifies the number of TB cases starting treatment.

Charts used for visualizing Comparisons

A selection of chart types for making comparisons between groups of data.

Bar charts

Bar Charts Advantage: Ideal for displaying counts and making categorical comparison.
Optimal Usage:
- Effective for ordinal categories or time-based data.
- Best when data is grouped into distinct categories.
Comparison Tool: Bar charts excellently illustrate comparisons among groups.
{ggplot2} Implementation: Use geom_col () for plotting categorical against numerical data.

Visualizing the Number of cases per treatment outcomes in the tb_outcomes dataset

# Basic bar plot example 1: Frequency of treatment outcomes
tb_outcomes %>% 
  #Pass the data to ggplot as a basis for creating the visualization
  ggplot(
    #specify the x and y axis variables
    aes(x = outcome,
        y = cases)) +
  # geom_col() creates a bar plot
  geom_col() +
  labs(title = "Number of cases per treatment outcome")

# Basic bar plot example 1: Frequency of treatment outcomes
tb_outcomes %>% 
  #Pass the data to ggplot as a basis for creating the visualization
  ggplot(
    #specify the x and y axis variables
    aes(x = hospital,
        y = cases)) +
  # geom_col() creates a bar plot
  geom_col() +
  labs(title = "Number of cases per Hospital")

Horizontal Bar Plot Creation: Utilize coord_flip() to transform a vertical bar chart into a horizontal layout
Enhanced Category Visualization: Horizontal orientation can improve readability of categories.

# Basic bar plot example 1: Frequency of treatment outcomes
tb_outcomes %>% 
  #Pass the data to ggplot as a basis for creating the visualization
  ggplot(
    #specify the x and y axis variables
    aes(x = hospital,
        y = cases)) +
  # geom_col() creates a bar plot
  geom_col() +
  labs(title = "Number of cases per Hospital") +
  coord_flip()

Stacked bar charts

Stacked Bar Charts: Introduce a second categorical variable for deeper insight.
ggplot () Customization: Use fill attribute to differentiate categories within the bars.

# Stacked bar plot:
tb_outcomes %>% 
  ggplot(
    # Fill colour of bars by the "outcome" variable
    aes(x = hospital,
        y = cases,
        fill = outcome)) +
      geom_col() + coord_flip()

Grouped bar charts

Grouped bar plots provide a side-by-side representation of subgroups within each main category.
We can set the position argument to “dodge” in geom_col () to display bars side by side:

# Grouped bar plot:
tb_outcomes %>% 
  ggplot(
    aes(x = hospital,
        y = cases,
        fill = outcome)) +
  # Add position argument for side-by-side bars
  geom_col(position = "dodge") +
  coord_flip()

Adding Error Bars

Showcasing data variability or uncertainty is done effectively with error bars.
Error bars help illustrate the reliability of mean scores or the precision of data points.
In {ggplot2}, adding error bars is achieved with the geom_errorbar()function.
The error range is typically defined by standard deviation, standard error, or confidence intervals.
Essential summary statistics like mean and standard deviation are necessary for error bars and need to be calculated.
By integrating error bars into our grouped bar plots, we gain a more nuanced understanding of the data.

Creating the necessary summary data since we need to have some kind of error measurement. In this case we compute the standard deviation.

hosp_dx_error <- 
  tb_outcomes %>% 
  group_by(period_date,diagnosis_type) %>% 
  summarise(
    total_case = sum(cases, na.rm = TRUE),
    error = sd(cases, na.rm = TRUE)
  )

`summarise()` has regrouped the output.
ℹ Summaries were computed grouped by period_date and diagnosis_type.
ℹ Output is grouped by period_date.
ℹ Use `summarise(.groups = "drop_last")` to silence this message.
ℹ Use `summarise(.by = c(period_date, diagnosis_type))` for per-operation
  grouping (`?dplyr::dplyr_by`) instead.

hosp_dx_error

# A tibble: 24 × 4
# Groups:   period_date [12]
   period_date diagnosis_type  total_case error
   <chr>       <chr>                <int> <dbl>
 1 2015-01-01  bacteriological        143 11.9 
 2 2015-01-01  clinical                47  4.40
 3 2015-04-01  bacteriological        163 13.0 
 4 2015-04-01  clinical                35  3.84
 5 2015-07-01  bacteriological        146 11.2 
 6 2015-07-01  clinical                34  3.33
 7 2015-10-01  bacteriological        152 10.4 
 8 2015-10-01  clinical                43  3.55
 9 2016-01-01  bacteriological        201 15.7 
10 2016-01-01  clinical                71  7.01
# ℹ 14 more rows

# Recreate grouped bar chart and add error bars
hosp_dx_error %>% 
  ggplot(
    aes(x = period_date,
        y = total_case,
        fill = diagnosis_type)) +
  geom_col(position = "dodge") + # Dodge the bars
  # geomerrorbar() adds error bars
  geom_errorbar(
    #specify upper and lower limits of the error bars
    aes(ymin = total_case - error,
        ymax = total_case + error),
    position = position_dodge(width = 0.9),
    width = 0.25# Dodge the error bars to align them with side-by-side bars
  )

Visualizing comparison with normalized bar charts, pie charts, and donut charts

Leet’s explore how compositions illustrate the contributions of individual parts to a whole.
Consider using dedicated composition chart types that better represent these relationships.
We’ll focus on part to whole charts to highlight how each piece fits in to the overall picture.

A selection of chart types for visualizing part-to-whole relationship.

Percent-stacked bar chart

For showcasing compositions, we need to identify the parts and the whole comprise.
Stacked bar chart, previously discussed, serve as an acceptable point for visualizing these relationships.

# Regular stacked bar plot
tb_outcomes %>% 
  ggplot(aes(x = hospital,
             y = cases,
             fill = outcome)) +
  geom_col(position = "fill")

Circular plots: Pie and Donut charts

With careful use, we aim to unlock their potential for presenting clear snapshots of data proportions.

We start by aggregating data to tally total counts for each treatment outcome category
This step ensures each segment of our dataset is clearly represented for visualization.

outcome_totals <- 
  tb_outcomes %>% 
  group_by(outcome) %>% 
  summarise(total_cases = sum(cases, na.rm = TRUE))
outcome_totals

# A tibble: 6 × 2
  outcome     total_cases
  <chr>             <int>
1 completed           573
2 cured              1506
3 died                130
4 failed               30
5 lost                 87
6 unevaluated          15

A pie chart is basically a round version of a single 100% stacked bar.

# Single-bar chart (precursor to pie chart)
ggplot(outcome_totals,
       aes(x = 4, # Set arbitrary x value
           y = total_cases,
           fill = outcome)) +
  geom_col()

# Single-bar chart (precursor to pie chart)
ggplot(outcome_totals,
       aes(x = 4, # Set arbitrary x value
           y = total_cases,
           fill = outcome)) +
  geom_col()+
  coord_polar(theta = 'y') # change y axis to be circular

Watch out!

Delve into pie charts with a note of caution: they’re visually enticing but can mislead.
Recognize that bar plots often surpass these charts in delivering precise data interpretations.
Understand that our brains prefer comparing lengths (like bars) over angles or areas (like pie slices).
Remember, as data categories grow, pie charts can get overly crowded and lose clarity.
Acknowledge their inability to effectively display changes over time, unlike the adept bar plot.
Use pie charts sparingly, with a mindful eye on their potential to obscure the data’s true story.

Wrap Up!

We’ve worked through this {ggplot2} lesson to improve visual comparisons and compositions using bar charts and pie charts.
Bar charts were our starting point, highlighting their strength in comparing categories and customization strategies in ggplot2.
We then explored pie charts, focusing on their ability to display compositions.
Using the tb_outcomes dataset, we applied these techniques to real-world public health data, emphasizing TB treatment outcomes in Benin.
Our practical exercises included transforming bar plots to 100% stacked bars and creating pie charts with geom_col() and coord_polar().
The lesson aimed to empower you with the knowledge to choose the right chart type and {ggplot2} tools to effectively present your data.