library(pacman); p_load(psych, pROC, ggplot2)
describe(data)
The juicr package was used for data extraction from Figure S9A of Bryc et al. (2015). The values for extracted datapoints obtained by juicr were clearly mostly accurate, but at the limit of 1, they systematically erred and every point of 1 was rounded to .963 instead. To correct for this, all of the points after the first appearance of .963 points were adjusted by dividing by .964, to bring the values of the points at the limit to .999, since using literally 1 would impair estimation, and this epsilon is acceptably small for points that were not at the limit beyond this point. The lower-bound was similarly misestimated, but the figure was not clear enough to correct for subtle differences among the points in this range. The effect here was probably not much. To correct the model coefficients in your head, just make them slightly larger. If weights were available, they would obviously reduce the variance, because the values of not-0/1 for proportions with different SIRE IDs at given ancestry levels between 0 and 100% were not populated as widely as were basically 0/1 (about 15% to 50% African ancestry).
p_load(juicr)
GUI_juicr()
There is a general interest in the correspondence between social group categorization and objective criteria like ancestry. As an example of another categorization, chromosomal sex corresponds almost perfectly with gender. Bryc et al. (2015) provided some data that could be used for assessing the correspondence between genetic ancestry and racial self-identification in African Americans. In their supplementary materials, the graph S9A showed the proportions of people who self-identified as African American at each 1% increment of African ancestry. Explicitly fitting this data to a quasibinomial model allow expressing to what degree the two quantities are linked in precise terms.
SIREFit <- glm(PropSIREAfr ~ PropAfr, family = quasibinomial(link = "logit"), data = data)
summary(SIREFit)
##
## Call:
## glm(formula = PropSIREAfr ~ PropAfr, family = quasibinomial(link = "logit"),
## data = data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.83555 -0.27217 -0.10272 0.09845 0.99337
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.867736 0.258117 -11.11 <2e-16 ***
## PropAfr 0.092712 0.006931 13.38 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for quasibinomial family taken to be 0.136578)
##
## Null deviance: 68.3306 on 99 degrees of freedom
## Residual deviance: 9.0825 on 98 degrees of freedom
## AIC: NA
##
## Number of Fisher Scoring iterations: 6
PredData <- data.frame(data$PropAfr, data$PropSIREAfr, data$PropSIRECorrected, Ancestry = seq(1, 100, 1))
PredictedNCorr <- predict(SIREFit, newdata = PredData, type = "response")
cbind(Ancestry = PredData$Ancestry, "Predicted Self-ID" = PredictedNCorr)
## Ancestry Predicted Self-ID
## 1 1 0.05868885
## 2 2 0.06402520
## 3 3 0.06981078
## 4 4 0.07607668
## 5 5 0.08285487
## 6 6 0.09017804
## 7 7 0.09807926
## 8 8 0.10659166
## 9 9 0.11574804
## 10 10 0.12558041
## 11 11 0.13611943
## 12 12 0.14739383
## 13 13 0.15942971
## 14 14 0.17224986
## 15 15 0.18587296
## 16 16 0.20031275
## 17 17 0.21557728
## 18 18 0.23166804
## 19 19 0.24857923
## 20 20 0.26629703
## 21 21 0.28479904
## 22 22 0.30405384
## 23 23 0.32402064
## 24 24 0.34464931
## 25 25 0.36588044
## 26 26 0.38764584
## 27 27 0.40986912
## 28 28 0.43246666
## 29 29 0.45534876
## 30 30 0.47842096
## 31 31 0.50158559
## 32 32 0.52474342
## 33 33 0.54779532
## 34 34 0.57064397
## 35 35 0.59319548
## 36 36 0.61536090
## 37 37 0.63705754
## 38 38 0.65821008
## 39 39 0.67875149
## 40 40 0.69862364
## 41 41 0.71777763
## 42 42 0.73617400
## 43 43 0.75378257
## 44 44 0.77058213
## 45 45 0.78656002
## 46 46 0.80171147
## 47 47 0.81603891
## 48 48 0.82955122
## 49 49 0.84226287
## 50 50 0.85419314
## 51 51 0.86536531
## 52 52 0.87580590
## 53 53 0.88554394
## 54 54 0.89461030
## 55 55 0.90303712
## 56 56 0.91085729
## 57 57 0.91810395
## 58 58 0.92481014
## 59 59 0.93100845
## 60 60 0.93673076
## 61 61 0.94200800
## 62 62 0.94687005
## 63 63 0.95134551
## 64 64 0.95546171
## 65 65 0.95924459
## 66 66 0.96271872
## 67 67 0.96590722
## 68 68 0.96883185
## 69 69 0.97151299
## 70 70 0.97396969
## 71 71 0.97621972
## 72 72 0.97827959
## 73 73 0.98016466
## 74 74 0.98188916
## 75 75 0.98346625
## 76 76 0.98490813
## 77 77 0.98622602
## 78 78 0.98743030
## 79 79 0.98853050
## 80 80 0.98953544
## 81 81 0.99045317
## 82 82 0.99129112
## 83 83 0.99205612
## 84 84 0.99275441
## 85 85 0.99339172
## 86 86 0.99397332
## 87 87 0.99450401
## 88 88 0.99498821
## 89 89 0.99542995
## 90 90 0.99583292
## 91 91 0.99620049
## 92 92 0.99653574
## 93 93 0.99684152
## 94 94 0.99712038
## 95 95 0.99737468
## 96 96 0.99760658
## 97 97 0.99781804
## 98 98 0.99801086
## 99 99 0.99818667
## 100 100 0.99834696
SIREFit <- glm(PropSIRECorrected ~ PropAfr, family = quasibinomial(link = "logit"), data = data)
summary(SIREFit)
##
## Call:
## glm(formula = PropSIRECorrected ~ PropAfr, family = quasibinomial(link = "logit"),
## data = data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.82025 -0.12039 -0.02189 0.07711 0.89715
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.01971 0.26460 -15.19 <2e-16 ***
## PropAfr 0.13561 0.00807 16.80 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for quasibinomial family taken to be 0.07286179)
##
## Null deviance: 79.1603 on 99 degrees of freedom
## Residual deviance: 5.0651 on 98 degrees of freedom
## AIC: NA
##
## Number of Fisher Scoring iterations: 7
PredictedCorr <- predict(SIREFit, newdata = PredData, type = "response")
cbind(Ancestry = PredData$Ancestry, "Predicted Self-ID" = PredictedCorr)
## Ancestry Predicted Self-ID
## 1 1 0.02015194
## 2 2 0.02301140
## 3 3 0.02626574
## 4 4 0.02996620
## 5 5 0.03416970
## 6 6 0.03893918
## 7 7 0.04434383
## 8 8 0.05045924
## 9 9 0.05736740
## 10 10 0.06515642
## 11 11 0.07392007
## 12 12 0.08375683
## 13 13 0.09476864
## 14 14 0.10705905
## 15 15 0.12073080
## 16 16 0.13588279
## 17 17 0.15260635
## 18 18 0.17098087
## 19 19 0.19106893
## 20 20 0.21291096
## 21 21 0.23651980
## 22 22 0.26187553
## 23 23 0.28892083
## 24 24 0.31755758
## 25 25 0.34764504
## 26 26 0.37900002
## 27 27 0.41139956
## 28 28 0.44458592
## 29 29 0.47827406
## 30 30 0.51216100
## 31 31 0.54593652
## 32 32 0.57929447
## 33 33 0.61194370
## 34 34 0.64361795
## 35 35 0.67408378
## 36 36 0.70314643
## 37 37 0.73065302
## 38 38 0.75649351
## 39 39 0.78059934
## 40 40 0.80294044
## 41 41 0.82352092
## 42 42 0.84237397
## 43 43 0.85955644
## 44 44 0.87514351
## 45 45 0.88922360
## 46 46 0.90189388
## 47 47 0.91325633
## 48 48 0.92341456
## 49 49 0.93247116
## 50 50 0.94052574
## 51 51 0.94767352
## 52 52 0.95400427
## 53 53 0.95960175
## 54 54 0.96454335
## 55 55 0.96890007
## 56 56 0.97273660
## 57 57 0.97611151
## 58 58 0.97907763
## 59 59 0.98168238
## 60 60 0.98396815
## 61 61 0.98597277
## 62 62 0.98772986
## 63 63 0.98926925
## 64 64 0.99061734
## 65 65 0.99179747
## 66 66 0.99283025
## 67 67 0.99373381
## 68 68 0.99452413
## 69 69 0.99521525
## 70 70 0.99581951
## 71 71 0.99634774
## 72 72 0.99680943
## 73 73 0.99721293
## 74 74 0.99756552
## 75 75 0.99787360
## 76 76 0.99814277
## 77 77 0.99837792
## 78 78 0.99858334
## 79 79 0.99876277
## 80 80 0.99891951
## 81 81 0.99905641
## 82 82 0.99917597
## 83 83 0.99928040
## 84 84 0.99937160
## 85 85 0.99945125
## 86 86 0.99952081
## 87 87 0.99958155
## 88 88 0.99963460
## 89 89 0.99968093
## 90 90 0.99972138
## 91 91 0.99975670
## 92 92 0.99978755
## 93 93 0.99981449
## 94 94 0.99983801
## 95 95 0.99985855
## 96 96 0.99987649
## 97 97 0.99989215
## 98 98 0.99990583
## 99 99 0.99991777
## 100 100 0.99992820
ggplot(data, aes(x = PropAfr, y = PropSIREAfr)) +
geom_point(alpha = .5, color = "steelblue") +
stat_smooth(formula = "y ~ x", method = "glm", se = T, method.args = list(family = quasibinomial(link = "logit")),
col = "steelblue", lty = 2) +
theme_bw() + labs(x = "African Ancestry Proportion", y = "Proportion who Identify as African-American")
ggplot(data, aes(x = PropAfr, y = PropSIRECorrected)) +
geom_point(alpha = .5, color = "steelblue") +
stat_smooth(formula = "y ~ x", method = "glm", se = T, method.args = list(family = quasibinomial(link = "logit")),
col = "steelblue", lty = 2) +
theme_bw() + labs(x = "African Ancestry Proportion", y = "Proportion who Identify as African-American")
The values slightly above the 0 on the x-axis are, as mentioned earlier, wrong, and they should be shifted down somewhat. Careful inspection could allow this to be done precisely, but the result will not qualitatively change: genetic ancestry and self-identification with its corresponding social group are virtually perfectly associated. Treating this as a multiclass problem, the AUCs are
NonCorrFit <- multiclass.roc(data$PropSIREAfr, PredictedNCorr)
CorrFit <- multiclass.roc(data$PropSIRECorrected, PredictedCorr)
NonCorrFit
##
## Call:
## multiclass.roc.default(response = data$PropSIREAfr, predictor = PredictedNCorr)
##
## Data: PredictedNCorr with 52 levels of data$PropSIREAfr: 0.035, 0.036, 0.037, 0.039, 0.04, 0.044, 0.047, 0.063, 0.068, 0.113, 0.154, 0.17, 0.223, 0.234, 0.302, 0.315, 0.346, 0.499, 0.542, 0.577, 0.593, 0.654, 0.67, 0.724, 0.76, 0.802, 0.809, 0.817, 0.829, 0.833, 0.837, 0.838, 0.845, 0.862, 0.881, 0.891, 0.916, 0.92, 0.933, 0.935, 0.943, 0.944, 0.945, 0.947, 0.949, 0.953, 0.957, 0.959, 0.96, 0.961, 0.963, 0.964.
## Multi-class area under the curve: 0.9761
CorrFit
##
## Call:
## multiclass.roc.default(response = data$PropSIRECorrected, predictor = PredictedCorr)
##
## Data: PredictedCorr with 54 levels of data$PropSIRECorrected: 0.035, 0.036, 0.037, 0.039, 0.04, 0.044, 0.047, 0.063, 0.068, 0.113, 0.154, 0.17, 0.223, 0.234, 0.302, 0.315, 0.346, 0.499, 0.542, 0.577, 0.593, 0.654, 0.67, 0.724, 0.76, 0.802, 0.809, 0.817, 0.829, 0.833, 0.837, 0.838, 0.845, 0.862, 0.881, 0.891, 0.916, 0.92, 0.933, 0.935, 0.964, 0.969, 0.978, 0.979, 0.981, 0.982, 0.985, 0.989, 0.993, 0.994, 0.996, 0.997, 0.999, 1.
## Multi-class area under the curve: 0.9872
Even with slight amounts of error due to the automation of data extraction, African ancestry is a near-perfect predictor of self-identifying as African American in the United States. There are notable reasons for exceptions, like individuals identifying as more than two races or being immigrants, but another possible explanation for the intercept shift rather than a smooth increase in self-identification proportions from 0 to 100% African might be social hypodescent. This might operate such that individuals who visually appear to be some portion African are identified and treated by others as if they are African American (see Ho et al., 2011). For this to work out, the visual difference would - quite plausibly - have to appear over a wide range of ancestries between 0 and 100% and people would have to weight those differences rather strongly in their evaluations of a person’s ancestry/race. Only at very extreme appearance shifts towards looking stereotypically European would we begin to see substantial levels of non-African American self-identification for people with considerable amounts of African ancestry. The presence of large numbers of individuals who have both White and Black parents may also explain the shifted smoothness, since these individuals would more plausibly self-identify as one or the other, they may do so according to some aspect of their appearances, rearing, or treatment by others (which could be affected by appearance), and they may do so in a fashion that has changed between generations. This combined with hypodescent and the presence of individuals who have two socially or self-identified Black parents but similar genetic ancestry might be enough as an explanation.
Regardless of what causes self-identification to so precisely related to genetic ancestry, the subject is interesting and deserves further investigation.
Bryc, K., Durand, E. Y., Macpherson, J. M., Reich, D., & Mountain, J. L. (2015). The Genetic Ancestry of African Americans, Latinos, and European Americans across the United States. The American Journal of Human Genetics, 96(1), 37–53. https://doi.org/10.1016/j.ajhg.2014.11.010
Ho, A. K., Sidanius, J., Levin, D. T., & Banaji, M. R. (2011). Evidence for hypodescent and racial hierarchy in the categorization and perception of biracial individuals. Journal of Personality and Social Psychology, 100, 492–506. https://doi.org/10.1037/a0021562
sessionInfo()
## R version 4.2.1 (2022-06-23 ucrt)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19044)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=English_United States.utf8
## [2] LC_CTYPE=English_United States.utf8
## [3] LC_MONETARY=English_United States.utf8
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.utf8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] ggplot2_3.3.6 pROC_1.18.0 psych_2.2.9 pacman_0.5.1
##
## loaded via a namespace (and not attached):
## [1] tidyselect_1.2.0 xfun_0.33 bslib_0.4.0 splines_4.2.1
## [5] lattice_0.20-45 colorspace_2.0-3 vctrs_0.4.2 generics_0.1.3
## [9] htmltools_0.5.3 yaml_2.3.5 mgcv_1.8-40 utf8_1.2.2
## [13] rlang_1.0.6 jquerylib_0.1.4 pillar_1.8.1 glue_1.6.2
## [17] withr_2.5.0 lifecycle_1.0.3 plyr_1.8.7 stringr_1.4.1
## [21] munsell_0.5.0 gtable_0.3.1 evaluate_0.17 labeling_0.4.2
## [25] knitr_1.40 fastmap_1.1.0 parallel_4.2.1 fansi_1.0.3
## [29] highr_0.9 Rcpp_1.0.9 scales_1.2.1 cachem_1.0.6
## [33] jsonlite_1.8.2 farver_2.1.1 mnormt_2.1.1 digest_0.6.29
## [37] stringi_1.7.8 dplyr_1.0.10 grid_4.2.1 cli_3.4.1
## [41] tools_4.2.1 magrittr_2.0.3 sass_0.4.2 tibble_3.1.8
## [45] pkgconfig_2.0.3 Matrix_1.4-1 rmarkdown_2.17 rstudioapi_0.14
## [49] R6_2.5.1 nlme_3.1-157 compiler_4.2.1