Overview: In this lab exercise, you will explore associations between a binary variable and a continuous variable.
Objectives: At the end of this lab you will be able to:
Create a subdirectory named Lab 7 in the
PUBHBIO 2210 Labs directory you created in your OneDrive
folder in Lab 1.
Download the four lab files from Carmen while in the RStudio server:
lab-07-continuous-blank.htmllab-07-continuous-blank.Rmdlab-07-continuous-worksheet-blank.docxnhanes.RDataIf you have not downloaded all of these files, do so now.
Save the four downloaded files in the
PUBHBIO 2210 Labs/Lab 7 directory (i.e., save the
downloaded files in the Lab 7 directory or folder created).
When working on labs, it is important to keep all related files in the
same directory.
Change the author and date information in the lab header.
Load and print the prepared dataset from nhanes.RData.
Recall labs 1 and 2 on how to load a .RData file and
print.
# Enter code here
load("nhanes.RData")
print(nhanes)
## # A tibble: 100 × 33
## id race ethnicity sex age familySize urban region pir yrsEducation
## <int> <fct> <fct> <fct> <int> <int> <fct> <fct> <dbl> <int>
## 1 1 black not hisp… fema… 56 1 metr… midwe… 0.666 4
## 2 2 white not hisp… fema… 73 1 other west 3.45 13
## 3 3 white not hisp… fema… 25 2 metr… south 3.97 12
## 4 4 white mexican-… fema… 53 2 other south NA 0
## 5 5 white mexican-… fema… 68 2 other south NA 6
## 6 6 white not hisp… fema… 44 3 other west 5.50 14
## 7 7 black not hisp… fema… 28 2 metr… south 5.82 15
## 8 8 white not hisp… male 74 2 other midwe… 2.51 12
## 9 9 white not hisp… fema… 65 1 other north… 1.20 12
## 10 10 white other hi… fema… 61 3 metr… west 1.88 1
## # ℹ 90 more rows
## # ℹ 23 more variables: maritalStatus <fct>, healthStatus <ord>,
## # heightInSelf <int>, weightLbSelf <int>, beer <int>, wine <int>,
## # liquor <int>, everSmoke <fct>, smokeNow <fct>, active <ord>, SBP <int>,
## # DBP <int>, weightKg <dbl>, heightCm <dbl>, waist <dbl>, tricep <dbl>,
## # thigh <dbl>, BMD <dbl>, RBC <dbl>, lead <dbl>, cholesterol <int>,
## # triglyceride <int>, hdl <int>
We are interested what factors might be associated with lead levels
in the blood (lead). We will investigate whether or not if
an individual has ever smoked (everSmoke) is related to
blood lead levels (lead). In other words, we want to see if
the binary predictor variable everSmoke is associated with
the continuous outcome variable lead.
We will first produce a scatterplot of lead by
everSmoke using the gf_point() command.
For example,
# Not evaluated
gf_point(height ~ sex, data = mydata)
will produce a scatterplot of height versus
sex.
In the code chunk below, produce the scatterplot of lead
versus everSmoke.
# Enter code here
gf_point(lead ~ everSmoke, data = nhanes)
## Warning: Removed 8 rows containing missing values or values outside the scale range
## (`geom_point()`).
There are a few problems with this plot:
Most importantly, the scatterplot contains points in just two
columns, which is not very easy to interpret, because there may be data
points overlapping. If we jitter the points—shift them
left and right—it will be easier to see where the data points lie. This
can be done with the gf_jitter() function.
There is a third, empty column, NA. This is
because some subjects have a missing value of everSmoke.
Filtering out such subjects from the dataset will remove this
column.
We also get a warning that 8 rows containing missing values have
been removed. In this case, that means rows with missing values of
lead. This warning won’t hurt the plot, but we could get
rid of it by also filtering out subjects with missing values of
lead.
For example,
# Not evaluated
gf_jitter(height ~ sex,
data = nhanes %>% filter(!is.na(height), !is.na(sex)),
width = 0.25, height = 0)
will remove subjects with missing values of height
and/or sex, then plot height versus
sex with horizontal jitter at 25% of the maximum amount of
jitter, and no vertical jitter.
In the code chunk below, create the jittered scatterplot (at 25% of
the maximum amount of jitter) of lead versus
everSmoke with no warnings and no NA
category (i.e., remove missing values for lead and
everSmoke from the plot).
# Enter code here
gf_jitter(lead ~ everSmoke,
data = nhanes %>% filter(!is.na(lead), !is.na(everSmoke)),
width = 0.25, height = 0)
Two other useful plots to look at are boxplots and histograms. A comparison using boxplots may be produced similarly to a jittered scatterplot.
For example,
# Not evaluated
gf_boxplot(height ~ sex, data = mydata%>% filter(!is.na(sex), !is.na(height)))
would produce side-by-side boxplots of height separated
by sex with no missing values of height and/or
sex.
In the code chucnk below, produce side-by-side boxplots of
lead separated by everSmoke with no missing
values of lead and/or everSmoke.
# Enter code here
gf_boxplot(lead ~ everSmoke, data = nhanes%>% filter(!is.na(lead), !is.na(everSmoke)))
Histograms are a bit different, since we can’t place two histograms on the same plot while sharing axes.
Instead, we produce a panel of histograms, really two separate histograms side-by-side.
For example,
# Not evaluated
gf_histogram( ~ height | sex,
data = mydata%>% filter(!is.na(sex), !is.na(height)),
binwidth=2, color = "black", fill = "firebrick", las = 1,
xlab = "Height", ylab = "Frequency")
produces two separate histograms of height, one for
males and one for females, and places them side-by-side with no missing
values of height and/or sex.
In the code chunk below, create a panel of histograms for
lead separated by everSmoke with no missing
values of lead and/or everSmoke.
# Enter code here
gf_histogram(~ lead | everSmoke,
data = nhanes %>% filter(!is.na(lead), !is.na(everSmoke)),
binwidth = 1, color = "black", fill = "orange",
xlab = "Blood Lead Level", ylab = "Frequency")
Of course, just looking at plots is not enough. The next step will be
to look at summary statistics such as means for each group. The
favstats() commanprovides several statistical summaries,
and can be used to compare groups.
For example,
# Not evaluated
favstats(height ~ sex, data = mydata)
produces the summaries that you have previously used
favstats() to obtain, but does so separately for males and
females.
Use favstats() to summarize lead for each
level of everSmoke.
# Enter code here
favstats(lead ~ everSmoke, data = nhanes)
## everSmoke min Q1 median Q3 max mean sd n missing
## 1 no 0.7 1.55 2.6 4.25 11.1 3.188235 2.364711 51 4
## 2 yes 0.7 2.50 3.7 5.40 13.3 4.321951 2.643909 41 3
We can see that in our sample the people who ever smoked had a higher average lead level. But can we generalize this to the population? Or might this be due to chance? To test if there is a difference in mean lead levels between ever-smokers and never-smokers in the population we will use a two-sample t-test.
To perform the test, we will use the t.test() function
with the var.equal = TRUE argument specified. This will
perform the version of the two-sample t-test that assumes equal
variances in the two groups.
For example,
# Not evaluated
t.test(height ~ sex, data = mydata, conf.level = 0.95,
alternative="two.sided", var.equal = TRUE)
performs an equal-variance (i.e., var.equal = TRUE) t-test to compare
the means of height in the groups defined by
sex. It also generates the 95% confidence interval (which
is the defualt) for the difference in the means of height
in the groups defined by sex .
In the code chunk below, perform an equal-variance t-test (with 95%
confidence interval) to compare the means of lead in the
groups defined by everSmoke.
# Enter code here
t.test(lead ~ everSmoke, data = nhanes, conf.level = 0.95,
alternative = "two.sided", var.equal = TRUE)
##
## Two Sample t-test
##
## data: lead by everSmoke
## t = -2.1683, df = 90, p-value = 0.03277
## alternative hypothesis: true difference in means between group no and group yes is not equal to 0
## 95 percent confidence interval:
## -2.17245672 -0.09497513
## sample estimates:
## mean in group no mean in group yes
## 3.188235 4.321951
There is a lot of output displayed; the test statistic is
t = line in second line of the output and the p-value also
appears in that line. For this test, we can see that the two-sided
p-value is 0.03277.
Notice that on the Two Sample t-test output where it
says sample estimates, the group means are presented. The
difference between these two is the difference in the mean lead values
for ever-smokers and never-smokers. R also produces a 95% confidence
interval for this difference which appears under the
95 percent confidence interval: heading.
Suppose you were worried about the assumption that the two groups had
equal variance, and wanted to perform the t-test that does not assume
equal variance in the two groups. To do this, in the code chunk below,
repeat your code in the last step BUT omit the
var.equal = TRUE argument (or write
var.equal = FALSE).
# Enter code here
t.test(lead ~ everSmoke, data = nhanes, conf.level = 0.95,
alternative = "two.sided", var.equal = FALSE)
##
## Welch Two Sample t-test
##
## data: lead by everSmoke
## t = -2.142, df = 81.144, p-value = 0.03519
## alternative hypothesis: true difference in means between group no and group yes is not equal to 0
## 95 percent confidence interval:
## -2.18679077 -0.08064108
## sample estimates:
## mean in group no mean in group yes
## 3.188235 4.321951
Remember, for this version of the t-test, you are testing the exact same hypotheses as the other version of the t-test but you are just relaxing the assumption of equal variance in the two groups. In general, you can use which ever version of the t-test you think is most appropriate, as long as you clearly state which one you are using.
You can also make the removal of the equal variances assumption
explicit by using the argument var.equal = FALSE.
Look back at your plots that show the distribution of lead by ever smoked status (Question 1). Within each group, the distribution of lead appears to be right-skewed. The assumption of a normal distribution within each group may be violated for these data. The t-test works pretty well even if this assumption is violated, as long as sample size is reasonably large (as it is here). But we can use a transformation to better approximate normality and make sure we draw similar conclusions.
A commonly used transformation for right-skewed data is the natural log transformation. This can only be used if all data values are >0 since the log is not defined for 0 or negative values. Fortunately, the lead values are all positive. We will log-transform the data and then perform a t-test to make sure we draw the same conclusions as we did on the untransformed data.
For example,
# Enter code here
mydata <- mydata %>% mutate(heightInches = height / 2.54)
mydata %>% select(height, heightInches)
uses the mutate() function to create a new variable
heightInches that is equal to height / 2.54.
Then, using the select() funtion, it prints the first few
values in the height and heightInches
variables/columns of the dataset mydata.
In the code chunk below, use the mutate() function to
create a new variable lnlead that is equal to
log(lead). Then, use the select() function to
print the first few values in the id, lead,
and lnlead variables/columns. .
# Enter code here
nhanes <- nhanes %>% mutate(lnlead = log(lead))
nhanes %>% select(id, lead, lnlead)
## # A tibble: 100 × 3
## id lead lnlead
## <int> <dbl> <dbl>
## 1 1 5 1.61
## 2 2 NA NA
## 3 3 1.3 0.262
## 4 4 3.7 1.31
## 5 5 NA NA
## 6 6 0.7 -0.357
## 7 7 0.7 -0.357
## 8 8 9 2.20
## 9 9 4.7 1.55
## 10 10 2.8 1.03
## # ℹ 90 more rows
To convince yourself that R did the right calculation, look at the
first person in the data set – the top row, the subject with ID number
1. Looking under the lead column we see that this person’s lead value
was 5. So that means that the natural-log transformed value should be
ln(5) = 1.60943791. Look under the new column lnlead and
check that this is in fact the value (it may be rounded to fewer decimal
places).
Next, we will look at the distribution of the natural-log transformed
lead values by ever-smoked status. Similar to codes in part 1, in the
code chunk below, create a jittered scatterplot of lnlead
versus everSmoke with no warnings and no
NA category (i.e., remove missing values for
lnlead and everSmoke from the plot). In
addition, produce side-by-side boxplots and side-by-side histograms (use
bins=7 or binwidth=0.5) with no missing values
of lnlead and/or everSmoke. Here, your chunks
of code before you answered Question 1 will be helpful.
# Enter code here
gf_jitter(lnlead ~ everSmoke,
data = nhanes %>% filter(!is.na(lnlead), !is.na(everSmoke)),
width = 0.25, height = 0)
gf_boxplot(lnlead ~ everSmoke,
data = nhanes %>% filter(!is.na(lnlead), !is.na(everSmoke)))
gf_histogram(~ lnlead | everSmoke,
data = nhanes %>% filter(!is.na(lnlead), !is.na(everSmoke)),
binwidth = 0.5, color = "black", fill = "red",
xlab = "Log Blood Lead Level", ylab = "Frequency")
Using the favstats() function, calculate the means and standard
deviations of lnlead for each level ofeverSmoke`.
# Enter code here
favstats(lnlead ~ everSmoke, data = nhanes)
## everSmoke min Q1 median Q3 max mean sd
## 1 no -0.3566749 0.4377344 0.9555114 1.446296 2.406945 0.901485 0.7482624
## 2 yes -0.3566749 0.9162907 1.3083328 1.686399 2.587764 1.294589 0.6007439
## n missing
## 1 51 4
## 2 41 3
Think back to when we talked about summary statistics. If we take the
natural log of data values, then take a mean, and then back-transform
this mean, you get the geometric mean. So, if you exponentiate the means
that were calculated using the ln-transformed data (i.e.,
lnlead) you will get the geometric means.
In the code chunk below, compute the geometric mean blood lead
concentration for never-smokers and for ever-smokers. Here, you will
need to simply use the exp() function on the respective
sample means you computed for lnlead for each level of
everSmoke using the favstats() function above. For example,
if say 0.563729 is the mean of the log measurements for
group 1 after using the favstats() function, then
exp(0.563729) is the geometric mean for group 1.
# Enter code here
meanneverSmoke <- 0.901485
meaneverSmoke <- 1.294589
exp(meanneverSmoke)
## [1] 2.463258
exp(meaneverSmoke)
## [1] 3.649496
Now, in the code chunk below, perform a two-sided two-sample t-test
(that does not assume equal variances, i.e.,
var.equal = FALSE) with 95% confidence interval on these
log-transformed data values (lnlead) using the steps you
used before to compare the means of lnlead in the groups
defined by everSmoke.
Notice that your hypotheses for this test are slightly different than they were before, because we are now talking about log-transformed lead values. Your hypotheses are now:
Ho: population mean log-lead values are the same for ever-smokers and never-smokers
Ha: population mean log-lead values are different for ever-smokers and never-smokers
# Enter code here
t.test(lnlead ~ everSmoke,
data = nhanes %>% filter,
conf.level = 0.95,
alternative = "two.sided",
var.equal = FALSE)
##
## Welch Two Sample t-test
##
## data: lnlead by everSmoke
## t = -2.795, df = 90, p-value = 0.006342
## alternative hypothesis: true difference in means between group no and group yes is not equal to 0
## 95 percent confidence interval:
## -0.6725169 -0.1136906
## sample estimates:
## mean in group no mean in group yes
## 0.901485 1.294589
In this case, your conclusions agreed for the two tests – there is a difference in blood lead levels between ever-smokers and never-smokers. The sample size was large enough that the untransformed test was probably okay (and is definitely easier to interpret). But you should always be checking to make sure that the necessary assumptions are met, because it is possible to draw the incorrect conclusion if you violate any of the assumptions.
As a final test, we are interested in whether average cholesterol
levels are different for people who have never smoked compared to people
who have ever smoked. Using the variables cholesterol and
everSmoke, perform the appropriate test (with a 95%
confidence interval) using the assumption of equal variances.
# Enter code here
t.test(cholesterol ~ everSmoke,
data = nhanes %>% filter,
conf.level = 0.95,
alternative = "two.sided",
var.equal = TRUE)
##
## Two Sample t-test
##
## data: cholesterol by everSmoke
## t = 0.92484, df = 90, p-value = 0.3575
## alternative hypothesis: true difference in means between group no and group yes is not equal to 0
## 95 percent confidence interval:
## -9.766488 26.779401
## sample estimates:
## mean in group no mean in group yes
## 213.8235 205.3171
Please turn in your completed worksheet (DOCX, i.e., word document), and your RMD file and updated HTML file to Carmen by the due date. Here, ensure to upload all the three (3) files before you click on the “Submit Assignment” tab to complete your submission.