When answering the Lecture 11 assignment for PPCR, I was introduced to the CAMP dataset. In this dataset, the effect of two different treatments in asthmatic children was evaluated and compared to placebo. In the feedback I received for that assignment, Carla Mori, my TA, suggested the following:

As a secondary analysis, If you would like to go deeper, you could assess how the baseline hemoglobin modifies the effect of one treatment in the pre-bronchodilator peak flow ? How would you do that? Please, let me know if you would like to reflect more on that and engage in further discussion.

In this new analysis, we will examine if the effect of budesonide, nedocromil or placebo has an association with hemoglobin baseline values.

To start, we’re going to read the data. In a previous post, I detailed how the data reading process is done, so, for this one, I’m going to omit it as it is going to be exactly the same.

head(df)
## # A tibble: 6 Ă— 39
##      id    TG age_rz GENDER ETHNIC hemog   wbc agehome anypet woodstove dehumid
##   <dbl> <dbl>  <dbl>  <dbl>  <dbl> <dbl> <dbl>   <dbl>  <dbl>     <dbl>   <dbl>
## 1     1     2      5      1      3  12.5    65      50      1         2       2
## 2     2     2     11      1      1  12.5    82      34      1         2       1
## 3     4     2      7      2      4  13.6    54      34      2         2       2
## 4     5     2      5      1      2  13.8    66      14      1         2       2
## 5     6     2     13      1      4  13     114      20      2         2       2
## 6     9     1     12      2      4  12.6    52       5      1         2       2
## # … with 28 more variables: parent_smokes <dbl>, any_smokes <dbl>, ...14 <lgl>,
## #   PREFEV_BAS <dbl>, PREFVC_BAS <dbl>, PREFF_BAS <dbl>, PREPF_BAS <dbl>,
## #   POSFEV_BAS <dbl>, POSFVC_BAS <dbl>, POSFF_BAS <dbl>, POSPF_BAS <dbl>,
## #   PREFEVPP_BAS <dbl>, PREFVCPP_BAS <dbl>, POSFEVPP_BAS <dbl>,
## #   POSFVCPP_BAS <dbl>, ...27 <lgl>, PREFEV_72MO <dbl>, PREFVC_72MO <dbl>,
## #   PREFF_72MO <dbl>, PREPF_72MO <dbl>, POSFEV_72MO <dbl>, POSFVC_72MO <dbl>,
## #   POSFF_72MO <dbl>, POSPF_72MO <dbl>, PREFEVPP_72MO <dbl>, …

With the data now read, it’s worth it to examine what each variable represents. Description of Variables in the CAMP Data set Further information can be encountered in the documentation.

The image doesn’t describe this, but we can see also included a variable called PREFEVPP_BAS which is a measure of the Pre-bronchodilator Forced Expiratory Volume in one second, represented as a percentage of the predicted value for the patient (which takes into consideration parameters such as height and age), on the baseline, that is at randomization, before any treatment was assigned.

Also, the documentation and image describe the TG variable as using A, B and C. As the data set provided includes TG as a numeric variable, I suspect this has been edited by the PPCR staff to facilitate STATA usage with this data set. Unfortunately, I couldn’t get access to the original data set in time (one must request it from the National Heart, Lung, and Blood Institute division of the NIH), and, therefore, wasn’t able to confirm the exact transformation. I’m going to assume that 1=A=Budesonide, 2=B=Nedocromil,3=C=Placebo.

Let’s transform the TG variable into a factor and use the respective names, as this will allow for the creation of more informative plots in an easier fashion.

library("tidyverse")
df %>% mutate(TG = case_when(
        TG == 1 ~ "Budesonide",
        TG == 2 ~ "Nedocromil",
        TG == 3 ~ "Placebo")) %>% mutate(TG = as_factor(TG)) -> df
head(df)
## # A tibble: 6 Ă— 39
##      id TG       age_rz GENDER ETHNIC hemog   wbc agehome anypet woods…¹ dehumid
##   <dbl> <fct>     <dbl>  <dbl>  <dbl> <dbl> <dbl>   <dbl>  <dbl>   <dbl>   <dbl>
## 1     1 Nedocro…      5      1      3  12.5    65      50      1       2       2
## 2     2 Nedocro…     11      1      1  12.5    82      34      1       2       1
## 3     4 Nedocro…      7      2      4  13.6    54      34      2       2       2
## 4     5 Nedocro…      5      1      2  13.8    66      14      1       2       2
## 5     6 Nedocro…     13      1      4  13     114      20      2       2       2
## 6     9 Budeson…     12      2      4  12.6    52       5      1       2       2
## # … with 28 more variables: parent_smokes <dbl>, any_smokes <dbl>, ...14 <lgl>,
## #   PREFEV_BAS <dbl>, PREFVC_BAS <dbl>, PREFF_BAS <dbl>, PREPF_BAS <dbl>,
## #   POSFEV_BAS <dbl>, POSFVC_BAS <dbl>, POSFF_BAS <dbl>, POSPF_BAS <dbl>,
## #   PREFEVPP_BAS <dbl>, PREFVCPP_BAS <dbl>, POSFEVPP_BAS <dbl>,
## #   POSFVCPP_BAS <dbl>, ...27 <lgl>, PREFEV_72MO <dbl>, PREFVC_72MO <dbl>,
## #   PREFF_72MO <dbl>, PREPF_72MO <dbl>, POSFEV_72MO <dbl>, POSFVC_72MO <dbl>,
## #   POSFF_72MO <dbl>, POSPF_72MO <dbl>, PREFEVPP_72MO <dbl>, …

In order to assess the effect of treatment in this most important measurement in Asthma, we must look at the change from baseline to post-treatment in each group. Using the percentage of predicted values gives us a better idea of how patients responded. In order to calculate this change, we’re going to create a new variable EffectPREFEVPP which will be the difference between PREFEVPP and PREFEVPP_BAS.

Let’s look at the distribution of the two variables we’re going to use to construct our new variable to check if there are any inconsistencies.

library("ggplot2")
df %>% ggplot(aes(x = PREFEVPP_BAS, colour = TG)) + geom_density()
## Warning: Removed 9 rows containing non-finite values (`stat_density()`).

df %>% ggplot(aes(x = PREFEVPP_72MO, colour = TG)) + geom_density()
## Warning: Removed 108 rows containing non-finite values (`stat_density()`).

At a glance, it looks like there are no inconsistencies (although there are warnings about missing data), so we can proceed with the creation of EffectPREFEVPP.

We’re then going to select only the baseline characteristics of the subjects in the study and the new variable created.

df %>% mutate (EffectPREFEVPP = PREFEVPP_72MO - PREFEVPP_BAS) %>% select(id,TG,GENDER,ETHNIC,hemog,wbc,agehome,anypet,woodstove,dehumid,parent_smokes,any_smokes,EffectPREFEVPP) -> df2

Let’s check for missing values in our newly created variable

df2 %>% pull(EffectPREFEVPP) %>% summary()
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
## -58.000  -6.000   3.000   2.926  11.000  50.000     116

We can see that there are 116 missing values. Let’s check which proportion of the total number of values that is.

df2 %>% pull(EffectPREFEVPP) %>% length() -> n
116/n
## [1] 0.1669065

16.7% of our observations are missing values, which could bias our results. We will proceed by removing those observations in our future calculations, but it is important to consider that the observations removed could be systematically different from the ones that remain, and, in this proportion, they could compromise the results.

Next, let’s look at the distribution of this new variable among the treatment groups.

df2 %>% ggplot(aes(x = EffectPREFEVPP, colour = TG)) + geom_density()
## Warning: Removed 116 rows containing non-finite values (`stat_density()`).

df2 %>% select(TG, EffectPREFEVPP) %>% group_by(TG) %>% reframe(mean = mean(EffectPREFEVPP, na.rm=TRUE),
                                                                  sd = sd(EffectPREFEVPP, na.rm = TRUE),
                                                                  n = n())
## # A tibble: 3 Ă— 4
##   TG          mean    sd     n
##   <fct>      <dbl> <dbl> <int>
## 1 Nedocromil  3.90  14.3   210
## 2 Budesonide  3.08  12.8   210
## 3 Placebo     2.10  13.8   275

Everything looks right and it looks like there is no difference between the groups (we could test this using ANOVA but this isn’t the focus of this analysis).

The next step is to remove from the hemog variable the observations which are extreme outliers and can be considered inputation errors, as was described in a previous post.

df2 %>% filter (hemog>2) -> df2

Since we’re dealing with a continuous independent variable (hemoglobin) and a continuous dependent variable (change in percentage point of the predicted pre-bronchodilator FEV after treatment), we’re going to use linear regression to compare these variables. Let’s first look at the data as a whole.

df2 %>% ggplot(aes(x = hemog, y = EffectPREFEVPP)) + geom_point() + geom_rug(position="jitter", linewidth=0.1) + stat_smooth(method = lm)
## `geom_smooth()` using formula = 'y ~ x'

df2  %>% lm(EffectPREFEVPP ~ hemog, data =.) %>% summary()
## 
## Call:
## lm(formula = EffectPREFEVPP ~ hemog, data = .)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -60.114  -8.157  -0.294   8.370  45.261 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept) -18.7213     7.6560  -2.445   0.0148 * 
## hemog         1.6406     0.5822   2.818   0.0050 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.59 on 569 degrees of freedom
##   (114 observations deleted due to missingness)
## Multiple R-squared:  0.01376,    Adjusted R-squared:  0.01203 
## F-statistic: 7.941 on 1 and 569 DF,  p-value: 0.005001

Looking at data as a whole, there seems to be a weak positive association between the baseline hemoglobin value and the effect of treatment (\(R^2=0.0137, p < 0.01\)).

Now let’s take a look at the data separated by treatment groups.

df2 %>% group_by(TG) %>% ggplot(aes(x = hemog, y = EffectPREFEVPP)) + geom_point() + geom_rug(position="jitter", linewidth=0.1) + facet_wrap(vars(TG)) + stat_smooth(method = lm)
## `geom_smooth()` using formula = 'y ~ x'

df2 %>% group_by(TG) %>% do(model = lm(EffectPREFEVPP ~ hemog, data = .)) -> regressions
for(i in 1:3){
        print(summary(regressions$model[[i]]))
}
## 
## Call:
## lm(formula = EffectPREFEVPP ~ hemog, data = .)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -35.625  -9.554  -0.748   8.252  44.200 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  -19.277     14.265  -1.351    0.178
## hemog          1.754      1.085   1.616    0.108
## 
## Residual standard error: 14.27 on 163 degrees of freedom
##   (42 observations deleted due to missingness)
## Multiple R-squared:  0.01577,    Adjusted R-squared:  0.009727 
## F-statistic: 2.611 on 1 and 163 DF,  p-value: 0.1081
## 
## 
## Call:
## lm(formula = EffectPREFEVPP ~ hemog, data = .)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -36.513  -7.717  -0.161   7.820  44.080 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept) -36.1199    12.9682  -2.785  0.00595 **
## hemog         2.9632     0.9836   3.013  0.00298 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.49 on 172 degrees of freedom
##   (31 observations deleted due to missingness)
## Multiple R-squared:  0.05012,    Adjusted R-squared:  0.0446 
## F-statistic: 9.076 on 1 and 172 DF,  p-value: 0.002981
## 
## 
## Call:
## lm(formula = EffectPREFEVPP ~ hemog, data = .)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -59.900  -7.754   0.521   7.773  38.576 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  -4.8082    12.5571  -0.383    0.702
## hemog         0.5282     0.9565   0.552    0.581
## 
## Residual standard error: 13.88 on 230 degrees of freedom
##   (41 observations deleted due to missingness)
## Multiple R-squared:  0.001324,   Adjusted R-squared:  -0.003018 
## F-statistic: 0.3049 on 1 and 230 DF,  p-value: 0.5813

This output shows us that there is no correlation between baseline hemoglobin and the effect of treatment in the Nedocromil and Placebo groups (\(R^2=0.015, p > 0.05\) and \(R^2=0.001, p > 0.05\), respectively), while there is a weak positive correlation in the Budesonide group (\(R^2=0.050, p < 0.01\)).