• Data is Not Neutral
    • Setup
  • Collect Demographics Data
  • Transform Data
  • Visualize Data
  • Prop Test

Data is Not Neutral

This R Markdown document is intended to accompany the talk from Paulette McRae, PhD, and Joy Payton, MS, entitled “Data is Not Neutral.”

Let’s boil it down to: human subjects research is characterized by unrepresentative research talent and unrepresentative research subject cohorts.

How can you visualize and quantify these challenges?

Setup

library(tibble)
library(tidyr)
library(dplyr)

Collect Demographics Data

For example:

  • Use the Census Bureau’s American Community Survey to collect population demographics for your county, city, state, etc. (Philadelphia Data)
  • Get clinical counts for your hospital
  • Ask team members to anonymously and optionally self-identify in a survey
  • Obtain department / institutional data from HR

Note: the data below marked “Hospital Patients” and “Department Employees” are fabricated to protect institutional data. But they’re not unrealistic!

race_data <- tibble(group = c("Philadelphia County",
                              "Hospital Patients",
                              "Department Employees",
                              "Webinar Speakers"),
               white_only = c(709661,
                              524000,
                              600,
                              82),
               black_only = c(690652,
                              230000,
                              120,
                              0),
               asian_only = c(123557,
                              46000,
                              190,
                              6),
               everyone_else = c(60194,
                                 244000,
                                 90,
                                 12))

Transform Data

Get percentages. We use [-1] to remove the group label and just leave numeric values.

race_pct <- race_data
race_pct[,-1] <- prop.table(as.matrix(race_pct[,-1]), 1) * 100
race_pct
group white_only black_only asian_only everyone_else
Philadelphia County 44.80002 43.60001 7.800001 3.799973
Hospital Patients 50.19157 22.03065 4.406130 23.371648
Department Employees 60.00000 12.00000 19.000000 9.000000
Webinar Speakers 82.00000 0.00000 6.000000 12.000000

Reshape the data to key-value pairs (long, instead of wide).

race_long <- pivot_longer(race_pct, c("white_only", "black_only", "asian_only",
                                       "everyone_else"))

Refactor so that the visualization doesn’t rely on alphabetical order but rather a logical order – in our case, putting “Everyone Else” at the end.

Also, give better factor labels.

race_long$name <- factor(race_long$name, levels = c("asian_only", "black_only", "white_only", "everyone_else"), labels = c("Asian Only", "Black Only", "White Only", "Everyone Else"))

race_long$group <- factor(race_long$group, levels = c("Philadelphia County",
                                                      "Hospital Patients",
                                                      "Department Employees",
                                                      "Webinar Speakers"))

Visualize Data

library(ggplot2)
  
ggplot(race_long, 
       aes(x=name, y=value, fill=group)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Race Comparison", 
       x = "Race", 
       y = "Percent of Population") +
  scale_fill_manual(values = c("#3c941f", "#054C70","#05C3DE",
                                 "#8ff0ff"))

Prop Test

It sure looks like there is major difference between Philadelphia County data and our (fabricated) Hospital and Department data and an even more startling difference between all of those and (the sadly not fabricated) Webinar Speakers group.

Are these differences statistically significant? We’ll use chi squared. We use [-1] to remove the group label and just leave numeric values.

chisq <- chisq.test(race_data[,-1])
chisq
## 
##  Pearson's Chi-squared test
## 
## data:  race_data[, -1]
## X-squared = 307479, df = 9, p-value < 2.2e-16

Sure looks like overall there are major disparities.

We can also do pairwise chi-squared:

For example, we could do the third and fourth row (department vs webinar speakers):

chisq.test(race_data[3:4,-1])
## 
##  Pearson's Chi-squared test
## 
## data:  race_data[3:4, -1]
## X-squared = 28.602, df = 3, p-value = 2.715e-06