Introduction

Overview: In this lab exercise, you will explore associations between a binary variable and a continuous variable.

Objectives: At the end of this lab you will be able to:

Part 0: Download and organize files

Part 1: Scatterplots, boxplots, and histograms by groups

Load and print the prepared dataset from nhanes.RData. Recall labs 1 and 2 on how to load a .RData file and print.

# Enter code here
load("nhanes.RData")
print(nhanes)
## # A tibble: 100 × 33
##       id race  ethnicity sex     age familySize urban region    pir yrsEducation
##    <int> <fct> <fct>     <fct> <int>      <int> <fct> <fct>   <dbl>        <int>
##  1     1 black not hisp… fema…    56          1 metr… midwe…  0.666            4
##  2     2 white not hisp… fema…    73          1 other west    3.45            13
##  3     3 white not hisp… fema…    25          2 metr… south   3.97            12
##  4     4 white mexican-… fema…    53          2 other south  NA                0
##  5     5 white mexican-… fema…    68          2 other south  NA                6
##  6     6 white not hisp… fema…    44          3 other west    5.50            14
##  7     7 black not hisp… fema…    28          2 metr… south   5.82            15
##  8     8 white not hisp… male     74          2 other midwe…  2.51            12
##  9     9 white not hisp… fema…    65          1 other north…  1.20            12
## 10    10 white other hi… fema…    61          3 metr… west    1.88             1
## # ℹ 90 more rows
## # ℹ 23 more variables: maritalStatus <fct>, healthStatus <ord>,
## #   heightInSelf <int>, weightLbSelf <int>, beer <int>, wine <int>,
## #   liquor <int>, everSmoke <fct>, smokeNow <fct>, active <ord>, SBP <int>,
## #   DBP <int>, weightKg <dbl>, heightCm <dbl>, waist <dbl>, tricep <dbl>,
## #   thigh <dbl>, BMD <dbl>, RBC <dbl>, lead <dbl>, cholesterol <int>,
## #   triglyceride <int>, hdl <int>

We are interested what factors might be associated with lead levels in the blood (lead). We will investigate whether or not if an individual has ever smoked (everSmoke) is related to blood lead levels (lead). In other words, we want to see if the binary predictor variable everSmoke is associated with the continuous outcome variable lead.

We will first produce a scatterplot of lead by everSmoke using the gf_point() command.

For example,

# Not evaluated
gf_point(height ~ sex, data = mydata)

will produce a scatterplot of height versus sex.

In the code chunk below, produce the scatterplot of lead versus everSmoke.

# Enter code here
gf_point(lead ~ everSmoke, data = nhanes)
## Warning: Removed 8 rows containing missing values or values outside the scale range
## (`geom_point()`).

There are a few problems with this plot:

  1. Most importantly, the scatterplot contains points in just two columns, which is not very easy to interpret, because there may be data points overlapping. If we jitter the points—shift them left and right—it will be easier to see where the data points lie. This can be done with the gf_jitter() function.

  2. There is a third, empty column, NA. This is because some subjects have a missing value of everSmoke. Filtering out such subjects from the dataset will remove this column.

  3. We also get a warning that 8 rows containing missing values have been removed. In this case, that means rows with missing values of lead. This warning won’t hurt the plot, but we could get rid of it by also filtering out subjects with missing values of lead.

For example,

# Not evaluated
gf_jitter(height ~ sex,
          data = nhanes %>% filter(!is.na(height), !is.na(sex)),
          width = 0.25, height = 0)

will remove subjects with missing values of height and/or sex, then plot height versus sex with horizontal jitter at 25% of the maximum amount of jitter, and no vertical jitter.

In the code chunk below, create the jittered scatterplot (at 25% of the maximum amount of jitter) of lead versus everSmoke with no warnings and no NA category (i.e., remove missing values for lead and everSmoke from the plot).

# Enter code here
gf_jitter(lead ~ everSmoke,
          data = nhanes %>% filter(!is.na(lead), !is.na(everSmoke)),
          width = 0.25, height = 0)

Two other useful plots to look at are boxplots and histograms. A comparison using boxplots may be produced similarly to a jittered scatterplot.

For example,

# Not evaluated
gf_boxplot(height ~ sex, data = mydata%>% filter(!is.na(sex), !is.na(height)))

would produce side-by-side boxplots of height separated by sex with no missing values of height and/or sex.

In the code chucnk below, produce side-by-side boxplots of lead separated by everSmoke with no missing values of lead and/or everSmoke.

# Enter code here
gf_boxplot(lead ~ everSmoke, data = nhanes%>% filter(!is.na(lead), !is.na(everSmoke)))

Histograms are a bit different, since we can’t place two histograms on the same plot while sharing axes.

Instead, we produce a panel of histograms, really two separate histograms side-by-side.

For example,

# Not evaluated
gf_histogram( ~ height | sex, 
              data = mydata%>% filter(!is.na(sex), !is.na(height)),
              binwidth=2, color = "black", fill = "firebrick", las = 1,
              xlab = "Height", ylab = "Frequency")

produces two separate histograms of height, one for males and one for females, and places them side-by-side with no missing values of height and/or sex.

In the code chunk below, create a panel of histograms for lead separated by everSmoke with no missing values of lead and/or everSmoke.

# Enter code here
gf_histogram(~ lead | everSmoke,
             data = nhanes %>% filter(!is.na(lead), !is.na(everSmoke)),
             binwidth = 1, color = "black", fill = "orange",
             xlab = "Blood Lead Level", ylab = "Frequency")

STOP! Answer Question 1 now.

Part 2: Two-sample t-test with equal variance

Of course, just looking at plots is not enough. The next step will be to look at summary statistics such as means for each group. The favstats() commanprovides several statistical summaries, and can be used to compare groups.

For example,

# Not evaluated
favstats(height ~ sex, data = mydata)

produces the summaries that you have previously used favstats() to obtain, but does so separately for males and females.

Use favstats() to summarize lead for each level of everSmoke.

# Enter code here
favstats(lead ~ everSmoke, data = nhanes)
##   everSmoke min   Q1 median   Q3  max     mean       sd  n missing
## 1        no 0.7 1.55    2.6 4.25 11.1 3.188235 2.364711 51       4
## 2       yes 0.7 2.50    3.7 5.40 13.3 4.321951 2.643909 41       3

STOP! Answer Question 2 now.

We can see that in our sample the people who ever smoked had a higher average lead level. But can we generalize this to the population? Or might this be due to chance? To test if there is a difference in mean lead levels between ever-smokers and never-smokers in the population we will use a two-sample t-test.

STOP! Answer Question 3 now.

To perform the test, we will use the t.test() function with the var.equal = TRUE argument specified. This will perform the version of the two-sample t-test that assumes equal variances in the two groups.

For example,

# Not evaluated
t.test(height ~ sex, data = mydata, conf.level = 0.95, 
       alternative="two.sided", var.equal = TRUE)

performs an equal-variance (i.e., var.equal = TRUE) t-test to compare the means of height in the groups defined by sex. It also generates the 95% confidence interval (which is the defualt) for the difference in the means of height in the groups defined by sex .

In the code chunk below, perform an equal-variance t-test (with 95% confidence interval) to compare the means of lead in the groups defined by everSmoke.

# Enter code here
t.test(lead ~ everSmoke, data = nhanes, conf.level = 0.95, 
       alternative = "two.sided", var.equal = TRUE)
## 
##  Two Sample t-test
## 
## data:  lead by everSmoke
## t = -2.1683, df = 90, p-value = 0.03277
## alternative hypothesis: true difference in means between group no and group yes is not equal to 0
## 95 percent confidence interval:
##  -2.17245672 -0.09497513
## sample estimates:
##  mean in group no mean in group yes 
##          3.188235          4.321951

There is a lot of output displayed; the test statistic is t = line in second line of the output and the p-value also appears in that line. For this test, we can see that the two-sided p-value is 0.03277.

STOP! Answer Question 4 now.

Notice that on the Two Sample t-test output where it says sample estimates, the group means are presented. The difference between these two is the difference in the mean lead values for ever-smokers and never-smokers. R also produces a 95% confidence interval for this difference which appears under the 95 percent confidence interval: heading.

STOP! Answer Question 5 now.

Part 3: Two-sample t-test with unequal variance

Suppose you were worried about the assumption that the two groups had equal variance, and wanted to perform the t-test that does not assume equal variance in the two groups. To do this, in the code chunk below, repeat your code in the last step BUT omit the var.equal = TRUE argument (or write var.equal = FALSE).

# Enter code here
t.test(lead ~ everSmoke, data = nhanes, conf.level = 0.95,
       alternative = "two.sided", var.equal = FALSE)
## 
##  Welch Two Sample t-test
## 
## data:  lead by everSmoke
## t = -2.142, df = 81.144, p-value = 0.03519
## alternative hypothesis: true difference in means between group no and group yes is not equal to 0
## 95 percent confidence interval:
##  -2.18679077 -0.08064108
## sample estimates:
##  mean in group no mean in group yes 
##          3.188235          4.321951

Remember, for this version of the t-test, you are testing the exact same hypotheses as the other version of the t-test but you are just relaxing the assumption of equal variance in the two groups. In general, you can use which ever version of the t-test you think is most appropriate, as long as you clearly state which one you are using.

You can also make the removal of the equal variances assumption explicit by using the argument var.equal = FALSE.

STOP! Answer Question 6 now.

Part 4: Log-transformations and geometric means

Look back at your plots that show the distribution of lead by ever smoked status (Question 1). Within each group, the distribution of lead appears to be right-skewed. The assumption of a normal distribution within each group may be violated for these data. The t-test works pretty well even if this assumption is violated, as long as sample size is reasonably large (as it is here). But we can use a transformation to better approximate normality and make sure we draw similar conclusions.

A commonly used transformation for right-skewed data is the natural log transformation. This can only be used if all data values are >0 since the log is not defined for 0 or negative values. Fortunately, the lead values are all positive. We will log-transform the data and then perform a t-test to make sure we draw the same conclusions as we did on the untransformed data.

For example,

# Enter code here
mydata <- mydata %>% mutate(heightInches = height / 2.54)
mydata %>% select(height, heightInches)

uses the mutate() function to create a new variable heightInches that is equal to height / 2.54. Then, using the select() funtion, it prints the first few values in the height and heightInches variables/columns of the dataset mydata.

In the code chunk below, use the mutate() function to create a new variable lnlead that is equal to log(lead). Then, use the select() function to print the first few values in the id, lead, and lnlead variables/columns. .

# Enter code here
nhanes <- nhanes %>% mutate(lnlead = log(lead))
nhanes %>% select(id, lead, lnlead) 
## # A tibble: 100 × 3
##       id  lead lnlead
##    <int> <dbl>  <dbl>
##  1     1   5    1.61 
##  2     2  NA   NA    
##  3     3   1.3  0.262
##  4     4   3.7  1.31 
##  5     5  NA   NA    
##  6     6   0.7 -0.357
##  7     7   0.7 -0.357
##  8     8   9    2.20 
##  9     9   4.7  1.55 
## 10    10   2.8  1.03 
## # ℹ 90 more rows

To convince yourself that R did the right calculation, look at the first person in the data set – the top row, the subject with ID number 1. Looking under the lead column we see that this person’s lead value was 5. So that means that the natural-log transformed value should be ln(5) = 1.60943791. Look under the new column lnlead and check that this is in fact the value (it may be rounded to fewer decimal places).

Next, we will look at the distribution of the natural-log transformed lead values by ever-smoked status. Similar to codes in part 1, in the code chunk below, create a jittered scatterplot of lnlead versus everSmoke with no warnings and no NA category (i.e., remove missing values for lnlead and everSmoke from the plot). In addition, produce side-by-side boxplots and side-by-side histograms (use bins=7 or binwidth=0.5) with no missing values of lnlead and/or everSmoke. Here, your chunks of code before you answered Question 1 will be helpful.

# Enter code here
gf_jitter(lnlead ~ everSmoke,
          data = nhanes %>% filter(!is.na(lnlead), !is.na(everSmoke)),
          width = 0.25, height = 0)

gf_boxplot(lnlead ~ everSmoke,
           data = nhanes %>% filter(!is.na(lnlead), !is.na(everSmoke)))

gf_histogram(~ lnlead | everSmoke,
             data = nhanes %>% filter(!is.na(lnlead), !is.na(everSmoke)),
             binwidth = 0.5, color = "black", fill = "red",
             xlab = "Log Blood Lead Level", ylab = "Frequency")

STOP! Answer Question 7 now.

Using the favstats() function, calculate the means and standard deviations of lnlead for each level ofeverSmoke`.

# Enter code here
favstats(lnlead ~ everSmoke, data = nhanes)
##   everSmoke        min        Q1    median       Q3      max     mean        sd
## 1        no -0.3566749 0.4377344 0.9555114 1.446296 2.406945 0.901485 0.7482624
## 2       yes -0.3566749 0.9162907 1.3083328 1.686399 2.587764 1.294589 0.6007439
##    n missing
## 1 51       4
## 2 41       3

STOP! Answer Question 8 now.

Think back to when we talked about summary statistics. If we take the natural log of data values, then take a mean, and then back-transform this mean, you get the geometric mean. So, if you exponentiate the means that were calculated using the ln-transformed data (i.e., lnlead) you will get the geometric means.

In the code chunk below, compute the geometric mean blood lead concentration for never-smokers and for ever-smokers. Here, you will need to simply use the exp() function on the respective sample means you computed for lnlead for each level of everSmoke using the favstats() function above. For example, if say 0.563729 is the mean of the log measurements for group 1 after using the favstats() function, then exp(0.563729) is the geometric mean for group 1.

# Enter code here
meanneverSmoke <- 0.901485  
meaneverSmoke <- 1.294589   

exp(meanneverSmoke)  
## [1] 2.463258
exp(meaneverSmoke)
## [1] 3.649496

STOP! Answer Questions 9–10 now.

Now, in the code chunk below, perform a two-sided two-sample t-test (that does not assume equal variances, i.e., var.equal = FALSE) with 95% confidence interval on these log-transformed data values (lnlead) using the steps you used before to compare the means of lnlead in the groups defined by everSmoke.

Notice that your hypotheses for this test are slightly different than they were before, because we are now talking about log-transformed lead values. Your hypotheses are now:

Ho: population mean log-lead values are the same for ever-smokers and never-smokers

Ha: population mean log-lead values are different for ever-smokers and never-smokers

# Enter code here
t.test(lnlead ~ everSmoke,
       data = nhanes %>% filter,
       conf.level = 0.95,
       alternative = "two.sided",
       var.equal = FALSE)
## 
##  Welch Two Sample t-test
## 
## data:  lnlead by everSmoke
## t = -2.795, df = 90, p-value = 0.006342
## alternative hypothesis: true difference in means between group no and group yes is not equal to 0
## 95 percent confidence interval:
##  -0.6725169 -0.1136906
## sample estimates:
##  mean in group no mean in group yes 
##          0.901485          1.294589

STOP! Answer Question 11 now.

In this case, your conclusions agreed for the two tests – there is a difference in blood lead levels between ever-smokers and never-smokers. The sample size was large enough that the untransformed test was probably okay (and is definitely easier to interpret). But you should always be checking to make sure that the necessary assumptions are met, because it is possible to draw the incorrect conclusion if you violate any of the assumptions.

As a final test, we are interested in whether average cholesterol levels are different for people who have never smoked compared to people who have ever smoked. Using the variables cholesterol and everSmoke, perform the appropriate test (with a 95% confidence interval) using the assumption of equal variances.

# Enter code here
t.test(cholesterol ~ everSmoke,
       data = nhanes %>% filter,
       conf.level = 0.95,
       alternative = "two.sided",
       var.equal = TRUE)
## 
##  Two Sample t-test
## 
## data:  cholesterol by everSmoke
## t = 0.92484, df = 90, p-value = 0.3575
## alternative hypothesis: true difference in means between group no and group yes is not equal to 0
## 95 percent confidence interval:
##  -9.766488 26.779401
## sample estimates:
##  mean in group no mean in group yes 
##          213.8235          205.3171

STOP! Answer Question 12 now.

Please turn in your completed worksheet (DOCX, i.e., word document), and your RMD file and updated HTML file to Carmen by the due date. Here, ensure to upload all the three (3) files before you click on the “Submit Assignment” tab to complete your submission.