Overview: In this lab exercise, you will learn how to conduct one-sample tests for means and proportions.
Objectives: At the end of this lab you will be able to:
Many tasks and commands that were explained in the first lab will be used here with less direction.
Refer to the first lab (Lab 1) if more direction is needed.
Create a subdirectory named Lab 5
in the
PUBHBIO 2210 Labs
directory you created in your OneDrive
folder in Lab 1.
Download the four lab files from Carmen while in the RStudio server:
lab-05-testing-blank.html
lab-05-testing-blank.Rmd
lab-05-testing-worksheet-blank.docx
nhanes.RData
If you have not downloaded all of these files, do so now.
Save the four downloaded files in the
PUBHBIO 2210 Labs/Lab 5
directory (i.e., save the
downloaded files in the Lab 5
directory or folder created).
When working on labs, it is important to keep all related files in the
same directory.
Change the author and date information in the lab header.
In the code chunk below, load the nhanes.RData
file and
print it. Recall from lab 1 (latter part) and lab 2 on how to load a
.RData
file and print an object (dataset in R) as well.
# Enter code here
load("nhanes.RData")
print(nhanes)
## # A tibble: 100 × 33
## id race ethnicity sex age familySize urban region
## <int> <fct> <fct> <fct> <int> <int> <fct> <fct>
## 1 1 black not hisp… fema… 56 1 metr… midwe…
## 2 2 white not hisp… fema… 73 1 other west
## 3 3 white not hisp… fema… 25 2 metr… south
## 4 4 white mexican-… fema… 53 2 other south
## 5 5 white mexican-… fema… 68 2 other south
## 6 6 white not hisp… fema… 44 3 other west
## 7 7 black not hisp… fema… 28 2 metr… south
## 8 8 white not hisp… male 74 2 other midwe…
## 9 9 white not hisp… fema… 65 1 other north…
## 10 10 white other hi… fema… 61 3 metr… west
## # ℹ 90 more rows
## # ℹ 25 more variables: pir <dbl>, yrsEducation <int>,
## # maritalStatus <fct>, healthStatus <ord>,
## # heightInSelf <int>, weightLbSelf <int>, beer <int>,
## # wine <int>, liquor <int>, everSmoke <fct>,
## # smokeNow <fct>, active <ord>, SBP <int>, DBP <int>,
## # weightKg <dbl>, heightCm <dbl>, waist <dbl>, …
First we will make a frequency table using the tally()
function. In the code chunk below, make a frequency table for the
variable everSmoke
, which tells us if a subject reports
having ever smoked. See lab 2 for help.
# Enter code here
tally(~ everSmoke, data = nhanes)
## everSmoke
## no yes <NA>
## 55 44 1
Notice that about 44.44% (44/99) of your sample reports having smoked. You would like to test whether 50% (p=0.50) of the population (that this sample came from) has ever smoked, based on your sample. You will perform a 2-sided test because it would be interesting if more than 50% smoked or if less than 50% smoked. By default, you should always conduct a 2-sided test unless you are specifically told otherwise.
Since you are testing whether a proportion is equal to a specific
value, you will use a one-sample z-test or an “exact” binomial test. The
binom.test()
function performs an “exact” binomial
test.
For example:
# Not evaluated
binom.test( ~ urban, data = mydata,
p=0.30, conf.level = 0.95,
alternative="two.sided",
ci.method = "Wilson",
success = "metro area of 1 million")
performs a hypothesis test to see whether 30% (p=0.30) of the
population (that this sample came from) lives in a metro area of 1
million, based on your sample. It also creates a two-sided 95%
confidence interval (using Wilson’s method) for true population
proportion. Note here that success = "......"
determines
the level of variable to be considered success (in this case
"metro area of 1 million"
).
Note: the name of the success
argument is an
unfortunate hold-over from introducing proportion tests using coin
flips, where success = "head"
is intuitive. Of course, this
is not meant to be normative, for example, if we were to do a proportion
test on race
or sex
.
In the code chunk below, perform a hypothesis test to see whether 50%
(p=0.50) of the population (that this sample came from) ever smoked
based on your sample (i.e., those who said “yes” to
everSmoke
in the sample). In addition create a two-sided
95% confidence interval (using Wilson’s method) for the true population
proportion of who have ever smoked.
# Enter code here
binom.test(~ everSmoke, data = nhanes, p = 0.50, conf.level = 0.95, alternative = "two.sided", ci.method = "Wilson", success = "yes")
##
## Exact binomial test (Score CI without continuity
## correction)
##
## data: nhanes$everSmoke [with success = yes]
## number of successes = 44, number of trials = 99,
## p-value = 0.3149
## alternative hypothesis: true probability of success is not equal to 0.5
## 95 percent confidence interval:
## 0.3504607 0.5425786
## sample estimates:
## probability of success
## 0.4444444
For this test, the p-value is 0.3149, which is larger than the standard significance level of 0.05. Therefore, we would fail to reject the null hypothesis (which was that the proportion of ever smokers was equal to 0.5). Also note that the null hypothesized value of 0.50 is included in the 95% confidence interval[i.e., (0.3504607, 0.5425786)]. This supports the decision to fail to reject the null hypothesis. The results of this test could be summarized in one sentence as:
The proportion of subjects reporting having ever smoked was 44.44%, which was not significantly different from 50% (p=0.3149).
Note that in the above sentence the “p” means p-value, not proportion (confusing!). If it is helpful, you can write “p-value=0.3149” instead of “p=0.3149” (as in the above) when you describe results of tests.
Now you will perform another hypothesis test and describe the results. According to the CDC, 19% of all adults were smokers in 2010. You would like to see if the population your sample came from has a different proportion of smokers.
In the code chunk below, using the variable smokeNow
,
perform a test to see whether the proportion of smokers in the
population (that this sample came from) is different from 19% (p=0.19).
In addition create a two-sided 95% confidence interval (using Wilson’s
method) for the true population proportion of smokers.
# Enter code here
binom.test(~ smokeNow, data = nhanes, p = 0.19, conf.level = 0.95, alternative = "two.sided", ci.method = "Wilson", success = "yes")
##
## Exact binomial test (Score CI without continuity
## correction)
##
## data: nhanes$smokeNow [with success = yes]
## number of successes = 21, number of trials = 99,
## p-value = 0.6078
## alternative hypothesis: true probability of success is not equal to 0.19
## 95 percent confidence interval:
## 0.1431354 0.3026135
## sample estimates:
## probability of success
## 0.2121212
Next you will perform hypothesis tests for a population mean. We would start with the variable cholesterol.
The favstats()
command provides some useful summary
statistics. In the code chunk below, use favstats()
with
the formula ~ cholesterol
and data nhanes
to
compute summary statistics. See labs 2 and 4 for help.
# Enter code here
favstats(~ cholesterol, data = nhanes)
## min Q1 median Q3 max mean sd n missing
## 122 179.25 206 239.5 370 210.0326 43.81466 92 8
In the code chunk below, generate a histogram for
cholesterol
. See labs 2, 3, and 4 for help.
NOTE: For all histograms created in this lab,
if necessary, adjust the width of “bins” using the
binwidth
or bins
argument to make the
histograms look less noisy.
# Enter code here
histogram(~ cholesterol, data = nhanes, bins = 25)
Notice by looking at the histogram that this variable isn’t perfectly bell-curve shaped, but it is pretty close to having a normal distribution.
First you will test whether the average cholesterol in this population is equal to 200 mg/dL.
Since you are testing whether a mean is equal to a specific value,
you will use a one-sample z-test or t-test. The one sample t-test can be
peformed via the t.test()
function.
For example:
# Not evaluated
t.test( ~ age, data = mydata, mu=40,
conf.level = 0.95, alternative="two.sided")
performs a hypothesis test to see whether the average/mean age of the population (that this sample came from) is equal to 40 (i.e., mu=40), based on your sample. It also creates a two-sided 95% confidence interval for the true population mean age.
In the code chunk below, perform a hypothesis test to see whether the mean cholesterol of the population (that this sample came from) is 200 mg/dL, based on your sample. In addition create a two-sided 95% confidence interval for true population mean cholesterol.
# Enter code here
t.test(~ cholesterol, data = nhanes, mu = 200, conf.level = 0.95, alternative = "two.sided")
##
## One Sample t-test
##
## data: cholesterol
## t = 2.1963, df = 91, p-value = 0.03061
## alternative hypothesis: true mean is not equal to 200
## 95 percent confidence interval:
## 200.9589 219.1064
## sample estimates:
## mean of x
## 210.0326
Finally, you will analyze the variable familySize
.
In the code chunk below, use favstats()
with the formula
~ familySize
and data nhanes
to compute
summary statistics. See labs 2 and 4 for help.
# Enter code here
favstats(~ familySize, data = nhanes)
## min Q1 median Q3 max mean sd n missing
## 1 2 2 4 10 3.01 1.778065 100 0
In the code chunk below, generate histogram for
familySize
. See labs 2, 3, and 4 for help.
NOTE: For all histograms created in this lab,
if necessary, adjust the width of “bins” using the
binwidth
or bins
argument to make the
histograms look less noisy.
# Enter code here
histogram(~ familySize, data = nhanes, bins = 20)
Notice by looking at the histogram that the distribution of this variable appears right-skewed.
Even though the distribution of familySize
does not
appear to be normal, we can still use a t-test to perform a hypothesis
test for its mean. This is because the sample size is large enough
(n=100) to rely on the Central Limit Theorem. We say that the t-test is
“robust to departures from normality” - which means that the test is
still valid even when your data aren’t perfectly normal.
According to the 2010 Census, the average household size in the U.S. is 2.6 persons. In the code chunk below, perform a hypothesis test to see if the population your data came from has an average family size that is different from 2.6.
# Enter code here
t.test(~ familySize, data = nhanes, mu = 2.6, conf.level = 0.95, alternative = "two.sided")
##
## One Sample t-test
##
## data: familySize
## t = 2.3059, df = 99, p-value = 0.0232
## alternative hypothesis: true mean is not equal to 2.6
## 95 percent confidence interval:
## 2.657193 3.362807
## sample estimates:
## mean of x
## 3.01
Please turn in your completed worksheet (DOCX, i.e., word document), and your RMD file and updated HTML file to Carmen by the due date. Here, ensure to upload all the three (3) files before you click on the “Submit Assignment” tab to complete your submission.