assignment1

Introduction

This vignette is part of a larger R package concerning assignment1 for BIOS 621; a GitHub repository containing the package can be found at https://github.com/schifferl/assignment1. Similarly, the built html vignette can be found on RPubs at http://rpubs.com/schifferl/assignment1. This vignette intends to reproduce the results given in the 2011 Chou et al. paper¹.

Methods

Data related to the paper were provided by Levi Waldron, PhD in Microsoft Excel format (i.e. a .xls file), along with basic formatting code from which appropriate labels could be deduced. Labels were determined and serialized in the R environment as is shown in the subsequent code block.

labels_AgeGroup <- c("18-26", "11-17")
labels_Race <- c("White", "African American", "Hispanic", "Other/Unknown")
labels_Shots <- c("Received 1 Shot", "Received 2 Shots", "Received 3 Shots")
labels_Completed <- c("Noncompleter", "Completer")
labels_InsuranceType <- c("Public", "Private payer", "Hospital–based",
                          "Military")
labels_MedAssist <- c("Public", "Private")
labels_Location <- c("Johns Hopkins Odenton", "Johns Hopkins White Marsh",
                     "Johns Hopkins Outpatient Center",
                     "Johns Hopkins Bayview Medical Offices")
labels_LocationType <- c("Urban", "Suburban")
labels_PracticeType <- c("Pediatrics", "Family practice", "Gynecology")

With the serialization of the variable labels, the Excel file was then read in using the readxl² package and all the variables, save for age, were interpreted to be categorical. The endomorphism of the data_frame class was used in the manipulation of the data between readxl and dplyr³ commands. Syntax used in the serialization of data and interpolation of variable labels is shown in the subsequent code block.

gardasil <- read_excel("../inst/extdata/gardasil.xls") %>%
mutate(., AgeGroup = factor(AgeGroup, 1:0, labels_AgeGroup)) %>%
mutate(., Race = factor(Race, 0:3, labels_Race)) %>%
mutate(., Shots = factor(Shots, 1:3, labels_Shots)) %>%
mutate(., Completed = factor(Completed, 0:1, labels_Completed)) %>%
mutate(., InsuranceType = factor(InsuranceType, 0:3, labels_InsuranceType)) %>%
mutate(., MedAssist = factor(MedAssist, 1:0, labels_MedAssist)) %>%
mutate(., Location = factor(Location, 1:4, labels_Location)) %>%
mutate(., LocationType = factor(LocationType, 1:0, labels_LocationType)) %>%
mutate(., PracticeType = factor(PracticeType, 0:2, labels_PracticeType))

In order to reconstruct table 1 of the Chou et al. paper it was necessary to calculate sample standard deviation measures from the counts of categorical variables. The following formula⁴ was used whereby \(x\) represented a count within a category and \(y\) represented the total of the category itself. Numerous functions were written so as to automate the process and a final reconstruction of table 1 was achieved using dplyr syntax.

\[s=\sqrt{\frac{1}{y} \frac{x}{y}\left(1-\frac{x}{y}\right)}\]

Similarly, the reconstruction of table 2 from the Chou et al. paper required a series of univariate generalized linear models from which the coefficients could be abstracted. The series of models was built as is subsequently specified and the reader may note that there is no variable \(Minority\) previously devised in the serialization syntax. In order to produce results similar to the Chou et al. paper it was necessary to construct the \(Minority\) variable from the \(Race\) variable, considering \(Minority\) to be either "African American" or "Hispanic". In the case where \(Race\) was "Other/Unknown" it was necessary to exclude these cases from the analysis to obtain a result similar to Chou et al.

\[\ln(OR_{Completed})=\beta_0+AgeGroup\ x_1\]

\[\ln(OR_{Completed})=\beta_0+MedAssist\ x_1\]

\[\ln(OR_{Completed})=\beta_0+InsuranceType\ x_1\]

\[\ln(OR_{Completed})=\beta_0+LocationType\ x_1\]

\[\ln(OR_{Completed})=\beta_0+PracticeType\ x_1\]

\[\ln(OR_{Completed})=\beta_0+Minority\ x_1\]

\[\ln(OR_{Completed})=\beta_0+Race\ x_1\]

Finally, the reconstruction of table 3 required the construction of two multivariate logistic regression models - one that accounted for the missing cases within the \(Minority\) category and another that excluded them. The results that are presented in the reconstructed table 3 represent coefficients from both models, given that it was necessary for reproduction. All but the \(Minority\) coefficient come from the first model (i.e. where \(\beta_6=Race\)). There was a difference in number of observations between the two models by 186, with the model where \(\beta_6=Race\) having the greater number. It did seem like a dishonest mistake, albeit perhaps committed honestly, and the implications are further discussed herein.

\[\ln(OR_{Completed})=\beta_0+AgeGroup\ x_1+MedAssist\ x_2+InsuranceType\ x_3+ LocationType\ x_4+PracticeType\ x_5+Race\ x_6\]

\[\ln(OR_{Completed})=\beta_0+AgeGroup\ x_1+MedAssist\ x_2+InsuranceType\ x_3+ LocationType\ x_4+PracticeType\ x_5+Minority\ x_6\]

It is noted that, in all models, the 95% confidence intervals were calculated using the standard error multiplied by the Z score for a probability density of 0.95 taken to 7 significant figures. In the absence of further information in the Chou et al. paper as to how the confidence intervals were calculated, the method was conservatively selected.

Results

To the exception of sample standard deviation percentages, it was possible to reproduce table 1 exactly as it was presented in the Chou et al. paper. There was a bit of experimentation necessary to determine that the "Received 2 Shots" category in fact contained anyone who had received 3 shots but had done so outside of the 12 month window and had thus been labeled a "Noncompleter", in addition to those who had actually received 2 shots. The reconstructed table 1 is shown below, with sample standard deviation percentages calculated as specified in the methods.

Table 1. Demographics of Study Cohort
	n (%)	Received 1 Shot	Received 2 Shots	Completer	Noncompleter
18-26	712 (50.4)	252 (57.3% ± 3.1%)	238 (47.2% ± 3.2%)	222 (47.3% ± 3.4%)	490 (51.9% ± 2.3%)
11-17	701 (49.6)	188 (42.7% ± 3.6%)	266 (52.8% ± 3.1%)	247 (52.7% ± 3.2%)	454 (48.1% ± 2.3%)
Public	275 (19.5)	111 (25.2% ± 4.1%)	109 (21.6% ± 3.9%)	55 (11.7% ± 4.3%)	220 (23.3% ± 2.9%)
Private	1138 (80.5)	329 (74.8% ± 2.4%)	395 (78.4% ± 2.1%)	414 (88.3% ± 1.6%)	724 (76.7% ± 1.6%)
Urban	450 (31.8)	168 (38.2% ± 3.7%)	168 (33.3% ± 3.6%)	114 (24.3% ± 4%)	336 (35.6% ± 2.6%)
Suburban	963 (68.2)	272 (61.8% ± 2.9%)	336 (66.7% ± 2.6%)	355 (75.7% ± 2.3%)	608 (64.4% ± 1.9%)
Pediatrics	515 (36.4)	153 (34.8% ± 3.9%)	200 (39.7% ± 3.5%)	162 (34.5% ± 3.7%)	353 (37.4% ± 2.6%)
Family practice	365 (25.8)	117 (26.6% ± 4.1%)	142 (28.2% ± 3.8%)	106 (22.6% ± 4.1%)	259 (27.4% ± 2.8%)
Gynecology	533 (37.7)	170 (38.6% ± 3.7%)	162 (32.1% ± 3.7%)	201 (42.9% ± 3.5%)	332 (35.2% ± 2.6%)
White	732 (51.8)	201 (45.7% ± 3.5%)	251 (49.8% ± 3.2%)	280 (59.7% ± 2.9%)	452 (47.9% ± 2.3%)
African American	443 (31.4)	167 (38% ± 3.8%)	171 (33.9% ± 3.6%)	105 (22.4% ± 4.1%)	338 (35.8% ± 2.6%)
Hispanic	52 (3.7)	14 (3.2% ± 4.7%)	21 (4.2% ± 4.4%)	17 (3.6% ± 4.5%)	35 (3.7% ± 3.2%)
Other/Unknown	186 (13.2)	58 (13.2% ± 4.4%)	61 (12.1% ± 4.2%)	67 (14.3% ± 4.3%)	119 (12.6% ± 3%)

It was also possible to reproduce table 2 almost identically to the one presented in the Chou et al. paper, with slight divergences happening around the confidence intervals and one of the p values. These differences are perhaps due to the exactness of the calculations and the internal implementations of the statistical methods within software used to calculate the confidence intervals. With the multiplication of the standard error by the Z score a confidence interval more narrow than than presented in the Chou et al. paper was achieved, as is shown in the reconstructed table 2 below.

Table 2. Human Papillomavirus Vaccination Completion Rates, Univariable Analysis (N=1,413)
Group	n	Completed 3 Vaccinations in 12 Mo (%)	OR (95% CI)	P
18-26	712	222 (31.2)	1.0
11-17	701	247 (35.2)	1.2 (0.98-1.42)	0.11
Public	275	55 (20)	1.0
Private	1138	414 (36.4)	2.29 (1.97-2.61)	< 0.001
Private payer	723	253 (35)	2.15 (1.82-2.49)	< 0.001
Hospital–based	84	39 (46.4)	3.47 (2.95-3.99)	< 0.001
Military	331	122 (36.9)	2.33 (1.96-2.71)	< 0.001
Urban	450	114 (25.3)	1.0
Suburban	963	355 (36.9)	1.72 (1.47-1.97)	< 0.001
Pediatrics	515	162 (31.5)	1.0
Family practice	365	106 (29)	0.89 (0.6-1.18)	0.44
Gynecology	533	201 (37.7)	1.32 (1.06-1.57)	0.03
White	732	280 (38.3)	1.0
Minority	495	122 (24.6)	0.53 (0.28-0.78)	< 0.001
African American	443	105 (23.7)	0.5 (0.24-0.77)	< 0.001
Hispanic	52	17 (32.7)	0.78 (0.19-1.38)	0.43

The reproduction of table 3 was unique among the others in that it was not possible to reproduce the results of Chou et al., even after significant experimentation. In the absence of sufficient detail for reproducibility, it was necessary to abstract coefficients from multiple models and results were approximately equal to those of Chou et al. but the exact results could not be attained. All coefficients and p values were within about 10% of the published values and are presented in the reproduced table 3 below.

Table 3. Multivariable Logistic Regression Analysis of Predictors of Human Papillomavirus Vaccine Completion
Characteristic	OR (95% CI)	P
18-26	1.00
11-17	1.73 (1.42-2.04)	< 0.001
Public	1.00
Private	1.94 (1.55-2.33)	0.001
Urban	1.00
Suburban	1.49 (1.17-1.8)	0.014
Pediatrics	1.00
Family practice	0.79 (0.42-1.17)	0.233
Gynecology	1.42 (1.06-1.78)	0.058
White	1.00
African American or Hispanic	0.64 (0.37-0.9)	0.001

By the end of attempting to reproduce tables 1, 2, and 3 of the Chou et al. paper, it became clear that subtle differences persisted across all three tables and no one table was an exact replica of the published variant. The differences are most likely attributable to use of statistical methods and software, with the reproduced and published results generally showing concordance.

Discussion

As this and the Chou et al. analysis showed, \(InsuranceType\) was found to be the strongest predictor of Gardasil HPV vaccine series completion. Considering \(InsuranceType\) a modifiable risk factor then, it begs the question of if modification would provide any shift in completion. To a limited extent perhaps it would, but it is rather more important to consider the situation whereby \(InsuranceType\) is merely a proxy for socioeconomic status. In classical studies of socioeconomic status, such as the Whitehall Study⁵, it becomes clear that much of what dictates personal behavior is based in relation to feelings of autonomy. Furthermore, beyond the issues related to multiple collinearity⁶, it would be important to consider \(InsuranceType\) in the overall context of epidemiological studies and the multiplicity of complex exposures⁷. In the absence of further study that might validate the correlation of interest and validate its significance, it would be rash to assume causation on correlation alone.

References

Chou, B., Krill, L. S., Horton, B. B., Barat, C. E. & Trimble, C. L. Disparities in Human Papillomavirus Vaccine Completion Among Vaccine Initiators. Obstetrics & Gynecology 118, 14–20 (2011).↩
Hadley Wickham (2016). readxl: Read Excel Files. R package version 0.1.1. https://CRAN.R-project.org/package=readxl ↩
Hadley Wickham and Romain Francois (2016). dplyr: A Grammar of Data Manipulation. R package version 0.5.0. https://CRAN.R-project.org/package=dplyr ↩
https://stats.stackexchange.com/questions/ 74797/calculating-standard-deviation-of-percentages↩
Marmot, M. G., Rose, G., Shipley, M. & Hamilton, P. J. Employment grade and coronary heart disease in British civil servants. J Epidemiol Community Health 32, 244–249 (1978).↩
Vatcheva, K. P., Lee, M., McCormick, J. B. & Rahbar, M. H. Multicollinearity in Regression Analyses Conducted in Epidemiologic Studies. Epidemiology (Sunnyvale) 6, (2016).↩
Patel, C. J. & Ioannidis, J. P. A. Placing epidemiological results in the context of multiplicity and typical correlations of exposures. J Epidemiol Community Health 68, 1096–1100 (2014).↩