Lab 5

Introduction

Overview: In this lab exercise, you will learn how to conduct one-sample tests for means and proportions.

Objectives: At the end of this lab you will be able to:

Conduct a z-test to test if a population proportion is equal to a specific value;
Conduct a t-test to test if a population mean is equal to a specific value;
Interpret the results of these tests and describe the results in “layman’s terms.”

Part 0: Download and organize files

Many tasks and commands that were explained in the first lab will be used here with less direction.

Refer to the first lab (Lab 1) if more direction is needed.

Create a subdirectory named Lab 5 in the PUBHBIO 2210 Labs directory you created in your OneDrive folder in Lab 1.
Download the four lab files from Carmen while in the RStudio server:
1. lab-05-testing-blank.html
2. lab-05-testing-blank.Rmd
3. lab-05-testing-worksheet-blank.docx
4. nhanes.RData
If you have not downloaded all of these files, do so now.
Save the four downloaded files in the PUBHBIO 2210 Labs/Lab 5 directory (i.e., save the downloaded files in the Lab 5 directory or folder created). When working on labs, it is important to keep all related files in the same directory.
Change the author and date information in the lab header.

Part 1: Hypothesis tests for a proportion

In the code chunk below, load the nhanes.RData file and print it. Recall from lab 1 (latter part) and lab 2 on how to load a .RData file and print an object (dataset in R) as well.

# Enter code here
load("nhanes.RData")
print(nhanes)

## # A tibble: 100 × 33
##       id race  ethnicity sex     age familySize urban region
##    <int> <fct> <fct>     <fct> <int>      <int> <fct> <fct> 
##  1     1 black not hisp… fema…    56          1 metr… midwe…
##  2     2 white not hisp… fema…    73          1 other west  
##  3     3 white not hisp… fema…    25          2 metr… south 
##  4     4 white mexican-… fema…    53          2 other south 
##  5     5 white mexican-… fema…    68          2 other south 
##  6     6 white not hisp… fema…    44          3 other west  
##  7     7 black not hisp… fema…    28          2 metr… south 
##  8     8 white not hisp… male     74          2 other midwe…
##  9     9 white not hisp… fema…    65          1 other north…
## 10    10 white other hi… fema…    61          3 metr… west  
## # ℹ 90 more rows
## # ℹ 25 more variables: pir <dbl>, yrsEducation <int>,
## #   maritalStatus <fct>, healthStatus <ord>,
## #   heightInSelf <int>, weightLbSelf <int>, beer <int>,
## #   wine <int>, liquor <int>, everSmoke <fct>,
## #   smokeNow <fct>, active <ord>, SBP <int>, DBP <int>,
## #   weightKg <dbl>, heightCm <dbl>, waist <dbl>, …

First we will make a frequency table using the tally() function. In the code chunk below, make a frequency table for the variable everSmoke, which tells us if a subject reports having ever smoked. See lab 2 for help.

# Enter code here
tally(~ everSmoke, data = nhanes)

## everSmoke
##   no  yes <NA> 
##   55   44    1

Notice that about 44.44% (44/99) of your sample reports having smoked. You would like to test whether 50% (p=0.50) of the population (that this sample came from) has ever smoked, based on your sample. You will perform a 2-sided test because it would be interesting if more than 50% smoked or if less than 50% smoked. By default, you should always conduct a 2-sided test unless you are specifically told otherwise.

STOP! Answer Question 1 now.

Since you are testing whether a proportion is equal to a specific value, you will use a one-sample z-test or an “exact” binomial test. The binom.test() function performs an “exact” binomial test.

For example:

# Not evaluated
binom.test( ~ urban, data = mydata,
            p=0.30, conf.level = 0.95,
            alternative="two.sided",
            ci.method = "Wilson",
            success = "metro area of 1 million")

performs a hypothesis test to see whether 30% (p=0.30) of the population (that this sample came from) lives in a metro area of 1 million, based on your sample. It also creates a two-sided 95% confidence interval (using Wilson’s method) for true population proportion. Note here that success = "......" determines the level of variable to be considered success (in this case "metro area of 1 million").

Note: the name of the success argument is an unfortunate hold-over from introducing proportion tests using coin flips, where success = "head" is intuitive. Of course, this is not meant to be normative, for example, if we were to do a proportion test on race or sex.

In the code chunk below, perform a hypothesis test to see whether 50% (p=0.50) of the population (that this sample came from) ever smoked based on your sample (i.e., those who said “yes” to everSmoke in the sample). In addition create a two-sided 95% confidence interval (using Wilson’s method) for the true population proportion of who have ever smoked.

# Enter code here
binom.test(~ everSmoke, data = nhanes, p = 0.50, conf.level = 0.95, alternative = "two.sided", ci.method = "Wilson", success = "yes")

## 
##  Exact binomial test (Score CI without continuity
##  correction)
## 
## data:  nhanes$everSmoke  [with success = yes]
## number of successes = 44, number of trials = 99,
## p-value = 0.3149
## alternative hypothesis: true probability of success is not equal to 0.5
## 95 percent confidence interval:
##  0.3504607 0.5425786
## sample estimates:
## probability of success 
##              0.4444444

For this test, the p-value is 0.3149, which is larger than the standard significance level of 0.05. Therefore, we would fail to reject the null hypothesis (which was that the proportion of ever smokers was equal to 0.5). Also note that the null hypothesized value of 0.50 is included in the 95% confidence interval[i.e., (0.3504607, 0.5425786)]. This supports the decision to fail to reject the null hypothesis. The results of this test could be summarized in one sentence as:

The proportion of subjects reporting having ever smoked was 44.44%, which was not significantly different from 50% (p=0.3149).

Note that in the above sentence the “p” means p-value, not proportion (confusing!). If it is helpful, you can write “p-value=0.3149” instead of “p=0.3149” (as in the above) when you describe results of tests.

Now you will perform another hypothesis test and describe the results. According to the CDC, 19% of all adults were smokers in 2010. You would like to see if the population your sample came from has a different proportion of smokers.

In the code chunk below, using the variable smokeNow, perform a test to see whether the proportion of smokers in the population (that this sample came from) is different from 19% (p=0.19). In addition create a two-sided 95% confidence interval (using Wilson’s method) for the true population proportion of smokers.

# Enter code here
binom.test(~ smokeNow, data = nhanes, p = 0.19, conf.level = 0.95, alternative = "two.sided", ci.method = "Wilson", success = "yes")

## 
##  Exact binomial test (Score CI without continuity
##  correction)
## 
## data:  nhanes$smokeNow  [with success = yes]
## number of successes = 21, number of trials = 99,
## p-value = 0.6078
## alternative hypothesis: true probability of success is not equal to 0.19
## 95 percent confidence interval:
##  0.1431354 0.3026135
## sample estimates:
## probability of success 
##              0.2121212

STOP! Answer Questions 2–4 now.

Part 2: Hypothesis tests for a mean

Next you will perform hypothesis tests for a population mean. We would start with the variable cholesterol.

The favstats() command provides some useful summary statistics. In the code chunk below, use favstats() with the formula ~ cholesterol and data nhanes to compute summary statistics. See labs 2 and 4 for help.

# Enter code here
favstats(~ cholesterol, data = nhanes)

##  min     Q1 median    Q3 max     mean       sd  n missing
##  122 179.25    206 239.5 370 210.0326 43.81466 92       8

In the code chunk below, generate a histogram for cholesterol. See labs 2, 3, and 4 for help.

NOTE: For all histograms created in this lab, if necessary, adjust the width of “bins” using the binwidth or binsargument to make the histograms look less noisy.

# Enter code here
histogram(~ cholesterol, data = nhanes, bins = 25)

Notice by looking at the histogram that this variable isn’t perfectly bell-curve shaped, but it is pretty close to having a normal distribution.

First you will test whether the average cholesterol in this population is equal to 200 mg/dL.

STOP! Answer Question 5 now.

Since you are testing whether a mean is equal to a specific value, you will use a one-sample z-test or t-test. The one sample t-test can be peformed via the t.test() function.

For example:

# Not evaluated
t.test( ~ age, data = mydata, mu=40,
        conf.level = 0.95, alternative="two.sided")

performs a hypothesis test to see whether the average/mean age of the population (that this sample came from) is equal to 40 (i.e., mu=40), based on your sample. It also creates a two-sided 95% confidence interval for the true population mean age.

In the code chunk below, perform a hypothesis test to see whether the mean cholesterol of the population (that this sample came from) is 200 mg/dL, based on your sample. In addition create a two-sided 95% confidence interval for true population mean cholesterol.

# Enter code here
t.test(~ cholesterol, data = nhanes, mu = 200, conf.level = 0.95, alternative = "two.sided")

## 
##  One Sample t-test
## 
## data:  cholesterol
## t = 2.1963, df = 91, p-value = 0.03061
## alternative hypothesis: true mean is not equal to 200
## 95 percent confidence interval:
##  200.9589 219.1064
## sample estimates:
## mean of x 
##  210.0326

Stop! Answer Questions 6–8 now.

Finally, you will analyze the variable familySize.

In the code chunk below, use favstats() with the formula ~ familySize and data nhanes to compute summary statistics. See labs 2 and 4 for help.

# Enter code here
favstats(~ familySize, data = nhanes)

##  min Q1 median Q3 max mean       sd   n missing
##    1  2      2  4  10 3.01 1.778065 100       0

In the code chunk below, generate histogram for familySize. See labs 2, 3, and 4 for help.

NOTE: For all histograms created in this lab, if necessary, adjust the width of “bins” using the binwidth or binsargument to make the histograms look less noisy.

# Enter code here
histogram(~ familySize, data = nhanes, bins = 20)

Notice by looking at the histogram that the distribution of this variable appears right-skewed.

Even though the distribution of familySize does not appear to be normal, we can still use a t-test to perform a hypothesis test for its mean. This is because the sample size is large enough (n=100) to rely on the Central Limit Theorem. We say that the t-test is “robust to departures from normality” - which means that the test is still valid even when your data aren’t perfectly normal.

According to the 2010 Census, the average household size in the U.S. is 2.6 persons. In the code chunk below, perform a hypothesis test to see if the population your data came from has an average family size that is different from 2.6.

# Enter code here
t.test(~ familySize, data = nhanes, mu = 2.6, conf.level = 0.95, alternative = "two.sided")

## 
##  One Sample t-test
## 
## data:  familySize
## t = 2.3059, df = 99, p-value = 0.0232
## alternative hypothesis: true mean is not equal to 2.6
## 95 percent confidence interval:
##  2.657193 3.362807
## sample estimates:
## mean of x 
##      3.01

STOP! Answer Question 9 now.

Please turn in your completed worksheet (DOCX, i.e., word document), and your RMD file and updated HTML file to Carmen by the due date. Here, ensure to upload all the three (3) files before you click on the “Submit Assignment” tab to complete your submission.