Data is Not Neutral

Data is Not Neutral

Setup

Collect Demographics Data

Transform Data

Visualize Data

Prop Test

Data is Not Neutral

This R Markdown document is intended to accompany the talk from Paulette McRae, PhD, and Joy Payton, MS, entitled “Data is Not Neutral.”

Let’s boil it down to: human subjects research is characterized by unrepresentative research talent and unrepresentative research subject cohorts.

How can you visualize and quantify these challenges?

Setup

library(tibble)
library(tidyr)
library(dplyr)

Collect Demographics Data

For example:

Use the Census Bureau’s American Community Survey to collect population demographics for your county, city, state, etc. (Philadelphia Data)
Get clinical counts for your hospital
Ask team members to anonymously and optionally self-identify in a survey
Obtain department / institutional data from HR

Note: the data below marked “Hospital Patients” and “Department Employees” are fabricated to protect institutional data. But they’re not unrealistic!

race_data <- tibble(group = c("Philadelphia County",
                              "Hospital Patients",
                              "Department Employees",
                              "Webinar Speakers"),
               white_only = c(709661,
                              524000,
                              600,
                              82),
               black_only = c(690652,
                              230000,
                              120,
                              0),
               asian_only = c(123557,
                              46000,
                              190,
                              6),
               everyone_else = c(60194,
                                 244000,
                                 90,
                                 12))

Transform Data

Get percentages. We use [-1] to remove the group label and just leave numeric values.

race_pct <- race_data
race_pct[,-1] <- prop.table(as.matrix(race_pct[,-1]), 1) * 100
race_pct

group	white_only	black_only	asian_only	everyone_else
Philadelphia County	44.80002	43.60001	7.800001	3.799973
Hospital Patients	50.19157	22.03065	4.406130	23.371648
Department Employees	60.00000	12.00000	19.000000	9.000000
Webinar Speakers	82.00000	0.00000	6.000000	12.000000

Reshape the data to key-value pairs (long, instead of wide).

race_long <- pivot_longer(race_pct, c("white_only", "black_only", "asian_only",
                                       "everyone_else"))

Refactor so that the visualization doesn’t rely on alphabetical order but rather a logical order – in our case, putting “Everyone Else” at the end.

Also, give better factor labels.

race_long$name <- factor(race_long$name, levels = c("asian_only", "black_only", "white_only", "everyone_else"), labels = c("Asian Only", "Black Only", "White Only", "Everyone Else"))

race_long$group <- factor(race_long$group, levels = c("Philadelphia County",
                                                      "Hospital Patients",
                                                      "Department Employees",
                                                      "Webinar Speakers"))

Visualize Data

library(ggplot2)
  
ggplot(race_long, 
       aes(x=name, y=value, fill=group)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(title = "Race Comparison", 
       x = "Race", 
       y = "Percent of Population") +
  scale_fill_manual(values = c("#3c941f", "#054C70","#05C3DE",
                                 "#8ff0ff"))

Prop Test

It sure looks like there is major difference between Philadelphia County data and our (fabricated) Hospital and Department data and an even more startling difference between all of those and (the sadly not fabricated) Webinar Speakers group.

Are these differences statistically significant? We’ll use chi squared. We use [-1] to remove the group label and just leave numeric values.

chisq <- chisq.test(race_data[,-1])
chisq

## 
##  Pearson's Chi-squared test
## 
## data:  race_data[, -1]
## X-squared = 307479, df = 9, p-value < 2.2e-16

Sure looks like overall there are major disparities.

We can also do pairwise chi-squared:

For example, we could do the third and fourth row (department vs webinar speakers):

chisq.test(race_data[3:4,-1])

## 
##  Pearson's Chi-squared test
## 
## data:  race_data[3:4, -1]
## X-squared = 28.602, df = 3, p-value = 2.715e-06

Data is Not Neutral

Joy Payton

8/28/2020

Data is Not Neutral

Setup

Collect Demographics Data

Transform Data

Visualize Data

Prop Test