This section is going to be repeated from what was seen in the previous assignment, it pertains to reading the data and cleaning it initially.
## this library is used by almost everyone that uses R and it introduces the notion of pipes (%>%)
library(tidyverse)
## this is used to read the excel sheet that was provided
library(readxl)
## this is used to summarize the continuous data in a more neat way than base R does
library(skimr)
Then, let’s read in the data. This data contains multiple entries of
“.” to represent missing values. In order to correctly code this, we
pass the argument na = "." to the function. We then use
-> to assign the data that was read to the variable
df.
read_excel("PPCR_2005_2006.xlsx", na = ".") -> df
Now, as you can see below, there are 19 observations in which all
variables equal NA.
## this command takes the data that is stored in `df` and provides it as the
## first argument for the `filter()` function, which, in this situation
## is evaluating if all (`if_all()` function) observations, across all the
## variables (`everything()` function) are equal to `NA` (`is.na()` function)
df %>% filter(if_all(everything(), ~ is.na(.)))
## # A tibble: 19 × 136
## SEQN Gender Age at Screening Adjud…¹ Race/Ethnicity - Rec…² `Marital Status`
## <dbl> <dbl> <dbl> <dbl> <chr>
## 1 NA NA NA NA <NA>
## 2 NA NA NA NA <NA>
## 3 NA NA NA NA <NA>
## 4 NA NA NA NA <NA>
## 5 NA NA NA NA <NA>
## 6 NA NA NA NA <NA>
## 7 NA NA NA NA <NA>
## 8 NA NA NA NA <NA>
## 9 NA NA NA NA <NA>
## 10 NA NA NA NA <NA>
## 11 NA NA NA NA <NA>
## 12 NA NA NA NA <NA>
## 13 NA NA NA NA <NA>
## 14 NA NA NA NA <NA>
## 15 NA NA NA NA <NA>
## 16 NA NA NA NA <NA>
## 17 NA NA NA NA <NA>
## 18 NA NA NA NA <NA>
## 19 NA NA NA NA <NA>
## # ℹ abbreviated names: ¹`Age at Screening Adjudicated - Recode`,
## # ²`Race/Ethnicity - Recode`
## # ℹ 131 more variables: `Education Level - Adults 20+` <chr>,
## # `Annual Family Income` <chr>, `Family PIR` <chr>, `Energy (kcal)` <chr>,
## # `Protein (gm)` <chr>, `Carbohydrate (gm)` <chr>, `Total sugars (gm)` <chr>,
## # `Dietary fiber (gm)` <chr>, `Total fat (gm)` <chr>, `Vitamin C (mg)` <chr>,
## # `Vitamin K (mcg)` <chr>, `Calcium (mg)` <chr>, `Phosphorus (mg)` <chr>, …
Let’s drop those observations.
## this command filters across all columns only those observations in which at least one value isn't NA
df %>% filter(if_any(everything(), ~!is.na(.))) -> df
## if we now repeat the same command as above, we can see that there are no more fully NA observations
df %>% filter(if_all(everything(), ~ is.na(.)))
## # A tibble: 0 × 136
## # ℹ 136 variables: SEQN <dbl>, Gender <dbl>,
## # Age at Screening Adjudicated - Recode <dbl>, Race/Ethnicity - Recode <dbl>,
## # Marital Status <chr>, Education Level - Adults 20+ <chr>,
## # Annual Family Income <chr>, Family PIR <chr>, Energy (kcal) <chr>,
## # Protein (gm) <chr>, Carbohydrate (gm) <chr>, Total sugars (gm) <chr>,
## # Dietary fiber (gm) <chr>, Total fat (gm) <chr>, Vitamin C (mg) <chr>,
## # Vitamin K (mcg) <chr>, Calcium (mg) <chr>, Phosphorus (mg) <chr>, …
We can see that, while How much sleep do you get (hours)? is initially seen as a continuous variable, we need to remove the observations that have values equal to 77 and 99, as those are not the actual values for these observations. Also, the fact that the number 12 doesn’t actually represent 12 hours of sleep, but represents 12 or more hours of sleep, transforms this into an ordinal variable.
## this checks if `How much sleep do you get (hours)?` is 77 or 99, if it is either of those values, it replaces it with NA. If it's not, it keeps the value.
df %>% mutate(`How much sleep do you get (hours)?` = case_when(
(`How much sleep do you get (hours)?` == 77) | (`How much sleep do you get (hours)?` == 99) ~ NA,
.default = `How much sleep do you get (hours)?`)) -> df
For the first question, the command asks the student to perform t-tests and mann-whitney tests, using the dependent variable as something that relates to the primary outcome of the research question and the indepedent variable as a categorical, dichotomous variable that is included in the dataset.
Christel did something really cool, she created a new categorical
variable from the Caffeine (mg) continuous variable, using
the mean as a breakpoint. Let’s do the same in R.
## we first need to transform `Caffeine (mg)` to numeric, as the variables were read as 'chr'
## this is accomplished with the first call of mutate()
## then, we use cut() to create 2 categories from the original variable and
## assign it to the new cat.Caffeine variable
df %>% mutate (`Caffeine (mg)` = as.numeric(`Caffeine (mg)`)) %>%
mutate(cat.Caffeine = cut(`Caffeine (mg)`,
breaks = c(-Inf, mean(`Caffeine (mg)`, na.rm = TRUE), Inf),
labels = c("Low", "High"))) -> df
## let's see the new variable and the old one, side by side
df %>% select(contains("caffeine")) %>% head(10)
## # A tibble: 10 × 2
## `Caffeine (mg)` cat.Caffeine
## <dbl> <fct>
## 1 0 Low
## 2 0 Low
## 3 0 Low
## 4 NA <NA>
## 5 0 Low
## 6 0 Low
## 7 0 Low
## 8 142 High
## 9 0 Low
## 10 NA <NA>
df %>% select(cat.Caffeine) %>% skim()
| Name | Piped data |
| Number of rows | 10348 |
| Number of columns | 1 |
| _______________________ | |
| Column type frequency: | |
| factor | 1 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| cat.Caffeine | 1160 | 0.89 | FALSE | 2 | Low: 8064, Hig: 1124 |
We can see here, that the number of observations doesn’t match what
Christel found in STATA. My understanding is that, when the command was
run in STATA, that the observations that should’ve been NA
got transformed to 0. This ends up distorting the data,
that’s one of the reasons why results won’t match.
Now, let’s run the first t-test.
## converting the dependent variable to numeric
df %>% mutate(`How much sleep do you get (hours)?` =
as.numeric(`How much sleep do you get (hours)?`)) -> df
t.test(df$`How much sleep do you get (hours)?` ~ df$cat.Caffeine)
##
## Welch Two Sample t-test
##
## data: df$`How much sleep do you get (hours)?` by df$cat.Caffeine
## t = 1.6106, df = 1317.3, p-value = 0.1075
## alternative hypothesis: true difference in means between group Low and group High is not equal to 0
## 95 percent confidence interval:
## -0.01807278 0.18384851
## sample estimates:
## mean in group Low mean in group High
## 6.994089 6.911201
The result here is different than the one found in STATA. I don’t
have access to STATA right now, but I’d assume that this is due to the
command that was used or due to lack of transforming the 99
and 77 values to NA (as per the documentation,
see below).
Also, it looks like, when creating the variable in STATA,
observations that were NA were converted to 0
which could’ve influenced the results as well.
If you can, try to correct the dependent variable and the newly created independent variable and then try running the code below in STATA and let me know if the results match the above.
ttest howmuchsleepdoyougethours, by(caffeine_binary)
I want you to consider the following information that can be found in the documentation for the dataset:
This variable is not a continuous variable, but an ordinal one, due to the ceiling placed on it. Consider this when conducting the analysis for the data project. A good idea then, is to use tests that are non-parametric. The question also requires usage of non-parametric tests. Let’s see how this would be done in R.
wilcox.test(`How much sleep do you get (hours)?` ~ cat.Caffeine, data = df)
##
## Wilcoxon rank sum test with continuity correction
##
## data: How much sleep do you get (hours)? by cat.Caffeine
## W = 3130001, p-value = 0.05302
## alternative hypothesis: true location shift is not equal to 0
This result shows that there is no statistically significant difference in the numbers of hours slept between those that consume more than the average amount of caffeine and those that consume less than the average.
Next, the same was done for At least one comorbidity and
Gender. Let’s reproduce it here.
# We need to convert those variables to factors, as they are categorical variables
# and they were read either as ´chr´ or ´dbl´
df %>% mutate(Gender = as_factor(Gender),
`At least one comorbidity` =
as_factor(`At least one comorbidity`)) -> df
# We can now conduct the tests
t.test(df$`How much sleep do you get (hours)?` ~ df$`At least one comorbidity`)
##
## Welch Two Sample t-test
##
## data: df$`How much sleep do you get (hours)?` by df$`At least one comorbidity`
## t = 0.29368, df = 5333.6, p-value = 0.769
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
## -0.05995866 0.08108833
## sample estimates:
## mean in group 0 mean in group 1
## 6.987644 6.977079
wilcox.test(`How much sleep do you get (hours)?` ~ `At least one comorbidity`,
data = df)
##
## Wilcoxon rank sum test with continuity correction
##
## data: How much sleep do you get (hours)? by At least one comorbidity
## W = 6694904, p-value = 0.8445
## alternative hypothesis: true location shift is not equal to 0
t.test(df$`How much sleep do you get (hours)?` ~ df$Gender)
##
## Welch Two Sample t-test
##
## data: df$`How much sleep do you get (hours)?` by df$Gender
## t = -0.89177, df = 7930.7, p-value = 0.3725
## alternative hypothesis: true difference in means between group 1 and group 2 is not equal to 0
## 95 percent confidence interval:
## -0.09489903 0.03555332
## sample estimates:
## mean in group 1 mean in group 2
## 6.966361 6.996034
wilcox.test(`How much sleep do you get (hours)?` ~ Gender,
data = df)
##
## Wilcoxon rank sum test with continuity correction
##
## data: How much sleep do you get (hours)? by Gender
## W = 7862981, p-value = 0.6041
## alternative hypothesis: true location shift is not equal to 0
The interpretation is that, none of the tests above rejected the null hypothesis. The null for the t-test is that there is no difference in means between the groups. The null in the Mann-Whitney is that the sum of ranks between the two groups is not different.
Let’s test the tests without correcting the values for the dependent variable to see if we reach the same results as Tatiana found in her assignment.
## let's read the data again, to have it in a unchanged manner
read_excel("PPCR_2005_2006.xlsx", na = ".") -> df
## let's convert the variables to their appropriate types
df %>% mutate(Gender = as_factor(Gender),
`At least one comorbidity` =
as_factor(`At least one comorbidity`),
`How much sleep do you get (hours)?` =
as.numeric(`How much sleep do you get (hours)?`),
`HH Eat less than should` =
as_factor(`HH Eat less than should`)) -> df
# need to specify var.equal = TRUE, as stata assumes equal variances as the
# default
t.test(df$`How much sleep do you get (hours)?` ~ df$`At least one comorbidity`,
var.equal = TRUE)
##
## Two Sample t-test
##
## data: df$`How much sleep do you get (hours)?` by df$`At least one comorbidity`
## t = 1.0404, df = 7653, p-value = 0.2982
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
## -0.0869017 0.2834865
## sample estimates:
## mean in group 0 mean in group 1
## 7.169193 7.070901
wilcox.test(`How much sleep do you get (hours)?` ~ `At least one comorbidity`,
data = df)
##
## Wilcoxon rank sum test with continuity correction
##
## data: How much sleep do you get (hours)? by At least one comorbidity
## W = 6721973, p-value = 0.792
## alternative hypothesis: true location shift is not equal to 0
t.test(df$`How much sleep do you get (hours)?` ~ df$`HH Eat less than should`,
var.equal = TRUE)
##
## Two Sample t-test
##
## data: df$`How much sleep do you get (hours)?` by df$`HH Eat less than should`
## t = -0.38185, df = 2208, p-value = 0.7026
## alternative hypothesis: true difference in means between group 2 and group 1 is not equal to 0
## 95 percent confidence interval:
## -0.4666302 0.3145237
## sample estimates:
## mean in group 2 mean in group 1
## 7.070485 7.146538
wilcox.test(`How much sleep do you get (hours)?` ~ `HH Eat less than should`,
data = df)
##
## Wilcoxon rank sum test with continuity correction
##
## data: How much sleep do you get (hours)? by HH Eat less than should
## W = 508284, p-value = 0.2577
## alternative hypothesis: true location shift is not equal to 0
t.test(df$`How much sleep do you get (hours)?` ~ df$Gender,
var.equal = TRUE)
##
## Two Sample t-test
##
## data: df$`How much sleep do you get (hours)?` by df$Gender
## t = 0.62704, df = 7969, p-value = 0.5307
## alternative hypothesis: true difference in means between group 1 and group 2 is not equal to 0
## 95 percent confidence interval:
## -0.1160423 0.2251963
## sample estimates:
## mean in group 1 mean in group 2
## 7.153611 7.099034
wilcox.test(`How much sleep do you get (hours)?` ~ Gender,
data = df)
##
## Wilcoxon rank sum test with continuity correction
##
## data: How much sleep do you get (hours)? by Gender
## W = 7895281, p-value = 0.6499
## alternative hypothesis: true location shift is not equal to 0
Fortunately, I was able to reproduce the results found by Tatiana. Keep in mind, then, that one needs to recode the dependent variable to remove the observations that are not number of hours, but codes that represent questionnaire answers. Also, it’s important to remember, for the data project, that the dependent variable here shouldn’t be considered a normal variable, due to the ceiling imposed by the way the questionnaire was built.