This vignette is part of a larger R package concerning assignment1 for BIOS 621; a GitHub repository containing the package can be found at https://github.com/schifferl/assignment1. Similarly, the built html vignette can be found on RPubs at http://rpubs.com/schifferl/assignment1. This vignette intends to reproduce the results given in the 2011 Chou et al. paper1.
Data related to the paper were provided by Levi Waldron, PhD in Microsoft Excel format (i.e. a .xls file), along with basic formatting code from which appropriate labels could be deduced. Labels were determined and serialized in the R environment as is shown in the subsequent code block.
labels_AgeGroup <- c("18-26", "11-17")
labels_Race <- c("White", "African American", "Hispanic", "Other/Unknown")
labels_Shots <- c("Received 1 Shot", "Received 2 Shots", "Received 3 Shots")
labels_Completed <- c("Noncompleter", "Completer")
labels_InsuranceType <- c("Public", "Private payer", "Hospital–based",
"Military")
labels_MedAssist <- c("Public", "Private")
labels_Location <- c("Johns Hopkins Odenton", "Johns Hopkins White Marsh",
"Johns Hopkins Outpatient Center",
"Johns Hopkins Bayview Medical Offices")
labels_LocationType <- c("Urban", "Suburban")
labels_PracticeType <- c("Pediatrics", "Family practice", "Gynecology")
With the serialization of the variable labels, the Excel file was then read in using the readxl2 package and all the variables, save for age, were interpreted to be categorical. The endomorphism of the data_frame
class was used in the manipulation of the data between readxl and dplyr3 commands. Syntax used in the serialization of data and interpolation of variable labels is shown in the subsequent code block.
gardasil <- read_excel("../inst/extdata/gardasil.xls") %>%
mutate(., AgeGroup = factor(AgeGroup, 1:0, labels_AgeGroup)) %>%
mutate(., Race = factor(Race, 0:3, labels_Race)) %>%
mutate(., Shots = factor(Shots, 1:3, labels_Shots)) %>%
mutate(., Completed = factor(Completed, 0:1, labels_Completed)) %>%
mutate(., InsuranceType = factor(InsuranceType, 0:3, labels_InsuranceType)) %>%
mutate(., MedAssist = factor(MedAssist, 1:0, labels_MedAssist)) %>%
mutate(., Location = factor(Location, 1:4, labels_Location)) %>%
mutate(., LocationType = factor(LocationType, 1:0, labels_LocationType)) %>%
mutate(., PracticeType = factor(PracticeType, 0:2, labels_PracticeType))
In order to reconstruct table 1 of the Chou et al. paper it was necessary to calculate sample standard deviation measures from the counts of categorical variables. The following formula4 was used whereby \(x\) represented a count within a category and \(y\) represented the total of the category itself. Numerous functions were written so as to automate the process and a final reconstruction of table 1 was achieved using dplyr syntax.
\[s=\sqrt{\frac{1}{y} \frac{x}{y}\left(1-\frac{x}{y}\right)}\]
Similarly, the reconstruction of table 2 from the Chou et al. paper required a series of univariate generalized linear models from which the coefficients could be abstracted. The series of models was built as is subsequently specified and the reader may note that there is no variable \(Minority\) previously devised in the serialization syntax. In order to produce results similar to the Chou et al. paper it was necessary to construct the \(Minority\) variable from the \(Race\) variable, considering \(Minority\) to be either "African American"
or "Hispanic"
. In the case where \(Race\) was "Other/Unknown"
it was necessary to exclude these cases from the analysis to obtain a result similar to Chou et al.
\[\ln(OR_{Completed})=\beta_0+AgeGroup\ x_1\]
\[\ln(OR_{Completed})=\beta_0+MedAssist\ x_1\]
\[\ln(OR_{Completed})=\beta_0+InsuranceType\ x_1\]
\[\ln(OR_{Completed})=\beta_0+LocationType\ x_1\]
\[\ln(OR_{Completed})=\beta_0+PracticeType\ x_1\]
\[\ln(OR_{Completed})=\beta_0+Minority\ x_1\]
\[\ln(OR_{Completed})=\beta_0+Race\ x_1\]
Finally, the reconstruction of table 3 required the construction of two multivariate logistic regression models - one that accounted for the missing cases within the \(Minority\) category and another that excluded them. The results that are presented in the reconstructed table 3 represent coefficients from both models, given that it was necessary for reproduction. All but the \(Minority\) coefficient come from the first model (i.e. where \(\beta_6=Race\)). There was a difference in number of observations between the two models by 186, with the model where \(\beta_6=Race\) having the greater number. It did seem like a dishonest mistake, albeit perhaps committed honestly, and the implications are further discussed herein.
\[\ln(OR_{Completed})=\beta_0+AgeGroup\ x_1+MedAssist\ x_2+InsuranceType\ x_3+ LocationType\ x_4+PracticeType\ x_5+Race\ x_6\]
\[\ln(OR_{Completed})=\beta_0+AgeGroup\ x_1+MedAssist\ x_2+InsuranceType\ x_3+ LocationType\ x_4+PracticeType\ x_5+Minority\ x_6\]
It is noted that, in all models, the 95% confidence intervals were calculated using the standard error multiplied by the Z score for a probability density of 0.95 taken to 7 significant figures. In the absence of further information in the Chou et al. paper as to how the confidence intervals were calculated, the method was conservatively selected.
To the exception of sample standard deviation percentages, it was possible to reproduce table 1 exactly as it was presented in the Chou et al. paper. There was a bit of experimentation necessary to determine that the "Received 2 Shots"
category in fact contained anyone who had received 3 shots but had done so outside of the 12 month window and had thus been labeled a "Noncompleter"
, in addition to those who had actually received 2 shots. The reconstructed table 1 is shown below, with sample standard deviation percentages calculated as specified in the methods.
n (%) | Received 1 Shot | Received 2 Shots | Completer | Noncompleter | |
---|---|---|---|---|---|
18-26 | 712 (50.4) | 252 (57.3% ± 3.1%) | 238 (47.2% ± 3.2%) | 222 (47.3% ± 3.4%) | 490 (51.9% ± 2.3%) |
11-17 | 701 (49.6) | 188 (42.7% ± 3.6%) | 266 (52.8% ± 3.1%) | 247 (52.7% ± 3.2%) | 454 (48.1% ± 2.3%) |
Public | 275 (19.5) | 111 (25.2% ± 4.1%) | 109 (21.6% ± 3.9%) | 55 (11.7% ± 4.3%) | 220 (23.3% ± 2.9%) |
Private | 1138 (80.5) | 329 (74.8% ± 2.4%) | 395 (78.4% ± 2.1%) | 414 (88.3% ± 1.6%) | 724 (76.7% ± 1.6%) |
Urban | 450 (31.8) | 168 (38.2% ± 3.7%) | 168 (33.3% ± 3.6%) | 114 (24.3% ± 4%) | 336 (35.6% ± 2.6%) |
Suburban | 963 (68.2) | 272 (61.8% ± 2.9%) | 336 (66.7% ± 2.6%) | 355 (75.7% ± 2.3%) | 608 (64.4% ± 1.9%) |
Pediatrics | 515 (36.4) | 153 (34.8% ± 3.9%) | 200 (39.7% ± 3.5%) | 162 (34.5% ± 3.7%) | 353 (37.4% ± 2.6%) |
Family practice | 365 (25.8) | 117 (26.6% ± 4.1%) | 142 (28.2% ± 3.8%) | 106 (22.6% ± 4.1%) | 259 (27.4% ± 2.8%) |
Gynecology | 533 (37.7) | 170 (38.6% ± 3.7%) | 162 (32.1% ± 3.7%) | 201 (42.9% ± 3.5%) | 332 (35.2% ± 2.6%) |
White | 732 (51.8) | 201 (45.7% ± 3.5%) | 251 (49.8% ± 3.2%) | 280 (59.7% ± 2.9%) | 452 (47.9% ± 2.3%) |
African American | 443 (31.4) | 167 (38% ± 3.8%) | 171 (33.9% ± 3.6%) | 105 (22.4% ± 4.1%) | 338 (35.8% ± 2.6%) |
Hispanic | 52 (3.7) | 14 (3.2% ± 4.7%) | 21 (4.2% ± 4.4%) | 17 (3.6% ± 4.5%) | 35 (3.7% ± 3.2%) |
Other/Unknown | 186 (13.2) | 58 (13.2% ± 4.4%) | 61 (12.1% ± 4.2%) | 67 (14.3% ± 4.3%) | 119 (12.6% ± 3%) |
It was also possible to reproduce table 2 almost identically to the one presented in the Chou et al. paper, with slight divergences happening around the confidence intervals and one of the p values. These differences are perhaps due to the exactness of the calculations and the internal implementations of the statistical methods within software used to calculate the confidence intervals. With the multiplication of the standard error by the Z score a confidence interval more narrow than than presented in the Chou et al. paper was achieved, as is shown in the reconstructed table 2 below.
Group | n | Completed 3 Vaccinations in 12 Mo (%) | OR (95% CI) | P |
---|---|---|---|---|
18-26 | 712 | 222 (31.2) | 1.0 | |
11-17 | 701 | 247 (35.2) | 1.2 (0.98-1.42) | 0.11 |
Public | 275 | 55 (20) | 1.0 | |
Private | 1138 | 414 (36.4) | 2.29 (1.97-2.61) | < 0.001 |
Private payer | 723 | 253 (35) | 2.15 (1.82-2.49) | < 0.001 |
Hospital–based | 84 | 39 (46.4) | 3.47 (2.95-3.99) | < 0.001 |
Military | 331 | 122 (36.9) | 2.33 (1.96-2.71) | < 0.001 |
Urban | 450 | 114 (25.3) | 1.0 | |
Suburban | 963 | 355 (36.9) | 1.72 (1.47-1.97) | < 0.001 |
Pediatrics | 515 | 162 (31.5) | 1.0 | |
Family practice | 365 | 106 (29) | 0.89 (0.6-1.18) | 0.44 |
Gynecology | 533 | 201 (37.7) | 1.32 (1.06-1.57) | 0.03 |
White | 732 | 280 (38.3) | 1.0 | |
Minority | 495 | 122 (24.6) | 0.53 (0.28-0.78) | < 0.001 |
African American | 443 | 105 (23.7) | 0.5 (0.24-0.77) | < 0.001 |
Hispanic | 52 | 17 (32.7) | 0.78 (0.19-1.38) | 0.43 |
The reproduction of table 3 was unique among the others in that it was not possible to reproduce the results of Chou et al., even after significant experimentation. In the absence of sufficient detail for reproducibility, it was necessary to abstract coefficients from multiple models and results were approximately equal to those of Chou et al. but the exact results could not be attained. All coefficients and p values were within about 10% of the published values and are presented in the reproduced table 3 below.
Characteristic | OR (95% CI) | P |
---|---|---|
18-26 | 1.00 | |
11-17 | 1.73 (1.42-2.04) | < 0.001 |
Public | 1.00 | |
Private | 1.94 (1.55-2.33) | 0.001 |
Urban | 1.00 | |
Suburban | 1.49 (1.17-1.8) | 0.014 |
Pediatrics | 1.00 | |
Family practice | 0.79 (0.42-1.17) | 0.233 |
Gynecology | 1.42 (1.06-1.78) | 0.058 |
White | 1.00 | |
African American or Hispanic | 0.64 (0.37-0.9) | 0.001 |
By the end of attempting to reproduce tables 1, 2, and 3 of the Chou et al. paper, it became clear that subtle differences persisted across all three tables and no one table was an exact replica of the published variant. The differences are most likely attributable to use of statistical methods and software, with the reproduced and published results generally showing concordance.
As this and the Chou et al. analysis showed, \(InsuranceType\) was found to be the strongest predictor of Gardasil HPV vaccine series completion. Considering \(InsuranceType\) a modifiable risk factor then, it begs the question of if modification would provide any shift in completion. To a limited extent perhaps it would, but it is rather more important to consider the situation whereby \(InsuranceType\) is merely a proxy for socioeconomic status. In classical studies of socioeconomic status, such as the Whitehall Study5, it becomes clear that much of what dictates personal behavior is based in relation to feelings of autonomy. Furthermore, beyond the issues related to multiple collinearity6, it would be important to consider \(InsuranceType\) in the overall context of epidemiological studies and the multiplicity of complex exposures7. In the absence of further study that might validate the correlation of interest and validate its significance, it would be rash to assume causation on correlation alone.
Chou, B., Krill, L. S., Horton, B. B., Barat, C. E. & Trimble, C. L. Disparities in Human Papillomavirus Vaccine Completion Among Vaccine Initiators. Obstetrics & Gynecology 118, 14–20 (2011).↩
Hadley Wickham (2016). readxl: Read Excel Files. R package version 0.1.1. https://CRAN.R-project.org/package=readxl↩
Hadley Wickham and Romain Francois (2016). dplyr: A Grammar of Data Manipulation. R package version 0.5.0. https://CRAN.R-project.org/package=dplyr↩
https://stats.stackexchange.com/questions/ 74797/calculating-standard-deviation-of-percentages↩
Marmot, M. G., Rose, G., Shipley, M. & Hamilton, P. J. Employment grade and coronary heart disease in British civil servants. J Epidemiol Community Health 32, 244–249 (1978).↩
Vatcheva, K. P., Lee, M., McCormick, J. B. & Rahbar, M. H. Multicollinearity in Regression Analyses Conducted in Epidemiologic Studies. Epidemiology (Sunnyvale) 6, (2016).↩
Patel, C. J. & Ioannidis, J. P. A. Placing epidemiological results in the context of multiplicity and typical correlations of exposures. J Epidemiol Community Health 68, 1096–1100 (2014).↩