This R Markdown document is intended to accompany the talk from Paulette McRae, PhD, and Joy Payton, MS, entitled “Data is Not Neutral.”
Let’s boil it down to: human subjects research is characterized by unrepresentative research talent and unrepresentative research subject cohorts.
How can you visualize and quantify these challenges?
library(tibble)
library(tidyr)
library(dplyr)
For example:
Note: the data below marked “Hospital Patients” and “Department Employees” are fabricated to protect institutional data. But they’re not unrealistic!
race_data <- tibble(group = c("Philadelphia County",
"Hospital Patients",
"Department Employees",
"Webinar Speakers"),
white_only = c(709661,
524000,
600,
82),
black_only = c(690652,
230000,
120,
0),
asian_only = c(123557,
46000,
190,
6),
everyone_else = c(60194,
244000,
90,
12))
Get percentages. We use [-1] to remove the group label and just leave numeric values.
race_pct <- race_data
race_pct[,-1] <- prop.table(as.matrix(race_pct[,-1]), 1) * 100
race_pct
group | white_only | black_only | asian_only | everyone_else |
---|---|---|---|---|
Philadelphia County | 44.80002 | 43.60001 | 7.800001 | 3.799973 |
Hospital Patients | 50.19157 | 22.03065 | 4.406130 | 23.371648 |
Department Employees | 60.00000 | 12.00000 | 19.000000 | 9.000000 |
Webinar Speakers | 82.00000 | 0.00000 | 6.000000 | 12.000000 |
Reshape the data to key-value pairs (long, instead of wide).
race_long <- pivot_longer(race_pct, c("white_only", "black_only", "asian_only",
"everyone_else"))
Refactor so that the visualization doesn’t rely on alphabetical order but rather a logical order – in our case, putting “Everyone Else” at the end.
Also, give better factor labels.
race_long$name <- factor(race_long$name, levels = c("asian_only", "black_only", "white_only", "everyone_else"), labels = c("Asian Only", "Black Only", "White Only", "Everyone Else"))
race_long$group <- factor(race_long$group, levels = c("Philadelphia County",
"Hospital Patients",
"Department Employees",
"Webinar Speakers"))
library(ggplot2)
ggplot(race_long,
aes(x=name, y=value, fill=group)) +
geom_bar(stat = "identity", position = "dodge") +
labs(title = "Race Comparison",
x = "Race",
y = "Percent of Population") +
scale_fill_manual(values = c("#3c941f", "#054C70","#05C3DE",
"#8ff0ff"))
It sure looks like there is major difference between Philadelphia County data and our (fabricated) Hospital and Department data and an even more startling difference between all of those and (the sadly not fabricated) Webinar Speakers group.
Are these differences statistically significant? We’ll use chi squared. We use [-1] to remove the group label and just leave numeric values.
chisq <- chisq.test(race_data[,-1])
chisq
##
## Pearson's Chi-squared test
##
## data: race_data[, -1]
## X-squared = 307479, df = 9, p-value < 2.2e-16
Sure looks like overall there are major disparities.
We can also do pairwise chi-squared:
For example, we could do the third and fourth row (department vs webinar speakers):
chisq.test(race_data[3:4,-1])
##
## Pearson's Chi-squared test
##
## data: race_data[3:4, -1]
## X-squared = 28.602, df = 3, p-value = 2.715e-06