This week’s assignment asks us to review the CAMP dataset, develop a question, run the analysis on the dataset and report our conclusions. The assignment asked us to do this in STATA, but, as I’ve enjoyed using R much more than STATA and find that publishing R markdown documents makes for a much clearer and reproducible report, I’ll be doing analysis in R first and only later conducting it in STATA.
The CAMP dataset involves the measurement of certain variables in a long-term follow up of asthmatic children randomized to either budesonide, nedocromil or placebo. The primary outcome was FEV1. Secondary outcomes were clinical outcomes.
We’re gonna focus on the 72 month mark and run our analysis using that data.
As the dataset was made available in google sheets and as we’re striving for maximum reproducibility, we’re going to download it with R. To do so, we need a package that helps us get the data that is hosted in google sheets.
if (!require("googlesheets4")){
install.packages("googlesheets4")
}
## Loading required package: googlesheets4
library("googlesheets4")
Next, we read the data from google sheets.
if(!exists("df", envir = .GlobalEnv, inherits = FALSE)){
df <- read_sheet("https://docs.google.com/spreadsheets/d/1qbTKlpERNulSYT94uOjS0qTtxW0eovOgtvURUo6syKk/edit#gid=1639646413")
}
## ! Using an auto-discovered, cached token.
## To suppress this message, modify your code or options to clearly consent to
## the use of a cached token.
## See gargle's "Non-interactive auth" vignette for more details:
## <https://gargle.r-lib.org/articles/non-interactive-auth.html>
## ℹ The googlesheets4 package is using a cached token for
## 'pedro.faria-2023@ppcr.org'.
## ✔ Reading from "Assignment_Case_L11_2023_72mo".
## ✔ Range 'Data'.
## New names:
## • `` -> `...14`
## • `` -> `...27`
head(df)
## # A tibble: 6 × 39
## id TG age_rz GENDER ETHNIC hemog wbc agehome anypet woodstove dehumid
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 2 5 1 3 12.5 65 50 1 2 2
## 2 2 2 11 1 1 12.5 82 34 1 2 1
## 3 4 2 7 2 4 13.6 54 34 2 2 2
## 4 5 2 5 1 2 13.8 66 14 1 2 2
## 5 6 2 13 1 4 13 114 20 2 2 2
## 6 9 1 12 2 4 12.6 52 5 1 2 2
## # ℹ 28 more variables: parent_smokes <dbl>, any_smokes <dbl>, ...14 <lgl>,
## # PREFEV_BAS <dbl>, PREFVC_BAS <dbl>, PREFF_BAS <dbl>, PREPF_BAS <dbl>,
## # POSFEV_BAS <dbl>, POSFVC_BAS <dbl>, POSFF_BAS <dbl>, POSPF_BAS <dbl>,
## # PREFEVPP_BAS <dbl>, PREFVCPP_BAS <dbl>, POSFEVPP_BAS <dbl>,
## # POSFVCPP_BAS <dbl>, ...27 <lgl>, PREFEV_72MO <dbl>, PREFVC_72MO <dbl>,
## # PREFF_72MO <dbl>, PREPF_72MO <dbl>, POSFEV_72MO <dbl>, POSFVC_72MO <dbl>,
## # POSFF_72MO <dbl>, POSPF_72MO <dbl>, PREFEVPP_72MO <dbl>, …
Then, let’s proceed with the questions.
Question 1 - What is your question?
Are hemoglobin levels different between treatment groups on the CAMP dataset at 72 months?
Question 2 – What is the statistical analysis to test your research question?
First, we need to determine if there are missing values in this data.
library("tidyverse")
df %>% pull (hemog) -> hemoglobin.data
hemoglobin.data[is.na(hemoglobin.data)]
## [1] NA NA NA NA NA NA NA NA
The above shows us that there are eight observations with missing values. This represents 1.15% of our observations and, therefore, we can exclude these observations with not too much loss of information.
Then, we need to check if the outcome is normally distributed in order to understand which type of analysis we must conduct. In a previous Rpubs post we discussed how to determine if a variable is normally distributed. We’re going to conduct this analysis based on those previous determinations.
df %>% ggplot(aes(x=hemog, colour = as.factor(TG))) + geom_density()
## Warning: Removed 8 rows containing non-finite values (`stat_density()`).
Looking at the results there are two things that pop-out. First, there is a warning that tells us that the eight observations that have missing values were disregarded. Second, the curve does not look normal. There are some strange outliers close to 0. Let’s take a look at those.
df %>% pull (hemog) %>% sort() %>% head(n=10)
## [1] 1.16 1.26 10.50 10.60 10.70 10.80 11.00 11.00 11.10 11.10
With this, we can see that there are two outliers that are changing the shape of the data. These two values are not compatible with the hemoglobin values that a patient can have while not in a incredibly critical state. It’s likely that this data was incorrectly inputed and lacked a final 0 (meaning that 1.16 meant 11.60 and that 1.26 meant 12.60). Let’s check other variables in these observations to see what is going on.
df %>% select(id,age_rz,GENDER, ETHNIC,hemog,PREFEVPP_72MO, POSFEVPP_72MO) %>%
arrange (hemog) %>% head(n=2)
## # A tibble: 2 × 7
## id age_rz GENDER ETHNIC hemog PREFEVPP_72MO POSFEVPP_72MO
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 875 11 1 2 1.16 103 106
## 2 744 12 2 4 1.26 98 114
I believe we can safely say that a child with hemoglobin of 1 cannot have a predicted FEV1 of close to 100%. We will then remove these two observations from our data, along with the missing values.
df %>% filter (hemog>2) -> df2
df2
## # A tibble: 685 × 39
## id TG age_rz GENDER ETHNIC hemog wbc agehome anypet woodstove dehumid
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 2 5 1 3 12.5 65 50 1 2 2
## 2 2 2 11 1 1 12.5 82 34 1 2 1
## 3 4 2 7 2 4 13.6 54 34 2 2 2
## 4 5 2 5 1 2 13.8 66 14 1 2 2
## 5 6 2 13 1 4 13 114 20 2 2 2
## 6 9 1 12 2 4 12.6 52 5 1 2 2
## 7 10 3 8 1 4 14.4 97 25 1 2 2
## 8 12 3 9 1 4 12.6 60 13 1 2 2
## 9 13 3 5 1 4 11.9 62 86 2 2 2
## 10 14 2 5 2 4 12 64 38 1 2 2
## # ℹ 675 more rows
## # ℹ 28 more variables: parent_smokes <dbl>, any_smokes <dbl>, ...14 <lgl>,
## # PREFEV_BAS <dbl>, PREFVC_BAS <dbl>, PREFF_BAS <dbl>, PREPF_BAS <dbl>,
## # POSFEV_BAS <dbl>, POSFVC_BAS <dbl>, POSFF_BAS <dbl>, POSPF_BAS <dbl>,
## # PREFEVPP_BAS <dbl>, PREFVCPP_BAS <dbl>, POSFEVPP_BAS <dbl>,
## # POSFVCPP_BAS <dbl>, ...27 <lgl>, PREFEV_72MO <dbl>, PREFVC_72MO <dbl>,
## # PREFF_72MO <dbl>, PREPF_72MO <dbl>, POSFEV_72MO <dbl>, POSFVC_72MO <dbl>, …
Let’s take a second look at the hemoglobin data now.
df2 %>% ggplot(aes(x=hemog, colour = as.factor(TG))) + geom_density() +
facet_grid(. ~ as.factor(TG))
This looks much closer to normal distributions, albeit with a small
right-skew. Let’s see the Q-Q Plots.
df2 %>% ggplot(aes(sample = hemog)) + geom_qq() + geom_qq_line() +
facet_grid(. ~ as.factor(TG))
Shapiro-Wilk test of normality can also help us here.
df2 %>% group_by(TG) %>% summarise(p.value.Shapiro.Test = format(shapiro.test(hemog)$p.value, scientific = FALSE))
## # A tibble: 3 × 2
## TG p.value.Shapiro.Test
## <dbl> <chr>
## 1 1 0.107604
## 2 2 0.4087514
## 3 3 0.6261369
Since the null-hypothesis of the test is that there is no difference between the tested distribution and a normal distribution, and the fact that we couldn’t reject the null-hypothesis, we’re going to move forward with parametric tests.
Since we have three different groups, we will use analysis of variance (ANOVA) to test if the three are different. Since we have only one independent variable (TG) we will be using one-way ANOVA. Also, since samples are independent, we’re not going to use repeated measures ANOVA.
Another assumption of ANOVA, besides normality, is homoscedasticity, or equality of variances in the populations from which the samples were derived. We can use boxplots to visually test this as well as the density distributions we used before.
df2 %>% ggplot(aes(x = TG, y = hemog, fill = as.factor(TG))) + geom_boxplot() +
guides(fill="none") + coord_flip()
They look very similar! To formally test this assumption, we can use Bartlett’s Test, in the same way we used Shapiro-Wilk’s test.
bartlett.test(hemog ~ as.factor(TG), data = df2)
##
## Bartlett test of homogeneity of variances
##
## data: hemog by as.factor(TG)
## Bartlett's K-squared = 0.61195, df = 2, p-value = 0.7364
From this, we cannot reject the null-hypothesis, therefore we can proceed with one-way ANOVA.
oneway.test(hemog ~ as.factor(TG), data = df2, var.equal = TRUE)
##
## One-way analysis of means
##
## data: hemog and as.factor(TG)
## F = 0.038838, num df = 2, denom df = 682, p-value = 0.9619
With this p.value we conclude that there is no statistically significant difference between hemoglobin levels in the three treatment group at the 72 months assessment.