This is a practice problem set with solved problems very similar to those in Lab 3. It is based on the CDC dataset, which you used in Lab 1.
You will perform the following tasks.
Construct a confidence interval for a proportion and use proper language to describe it.
Conduct a hypothesis test for a proportion. Distinguish between the null and alternative hypotheses and use proper language to describe your conclusion.
Construct a confidence interval for a mean and use proper language to describe it.
Conduct a hypothesis test for a mean. Distinguish between the null and alternative hypotheses and use proper language to describe your conclusion.
Conduct a hypothesis test for the difference between two means and use proper language to describe your conclusion.
Construct a contingency table to describe the relationship between two categorical variables and use it to:
The first step is to load the data and to verify that all of the required packages are installed and loaded.
load("/cloud/project/cdc.Rdata")
package = "gmodels"
if (!require(package, character.only=T, quietly=T)) {
install.packages(package)
library(package, character.only=T)
}
package = "tidyverse"
if (!require(package, character.only=T, quietly=T)) {
install.packages(package)
library(package, character.only=T)
}
Run the commands str() and summary() on the dataframe cdc. Then identify the quantitative and categorical variables.
# Place your code in this chunk.
str(cdc)
## 'data.frame': 20000 obs. of 9 variables:
## $ genhlth : Factor w/ 5 levels "excellent","very good",..: 3 3 3 3 2 2 2 2 3 3 ...
## $ exerany : num 0 0 1 1 0 1 1 0 0 1 ...
## $ hlthplan: num 1 1 1 1 1 1 1 1 1 1 ...
## $ smoke100: num 0 1 1 0 0 0 0 0 1 0 ...
## $ height : num 70 64 60 66 61 64 71 67 65 70 ...
## $ weight : int 175 125 105 132 150 114 194 170 150 180 ...
## $ wtdesire: int 175 115 105 124 130 114 185 160 130 170 ...
## $ age : int 77 33 49 42 55 55 31 45 27 44 ...
## $ gender : Factor w/ 2 levels "m","f": 1 2 2 2 2 2 1 1 2 1 ...
summary(cdc)
## genhlth exerany hlthplan smoke100
## excellent:4657 Min. :0.0000 Min. :0.0000 Min. :0.0000
## very good:6972 1st Qu.:0.0000 1st Qu.:1.0000 1st Qu.:0.0000
## good :5675 Median :1.0000 Median :1.0000 Median :0.0000
## fair :2019 Mean :0.7457 Mean :0.8738 Mean :0.4721
## poor : 677 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :1.0000 Max. :1.0000 Max. :1.0000
## height weight wtdesire age gender
## Min. :48.00 Min. : 68.0 Min. : 68.0 Min. :18.00 m: 9569
## 1st Qu.:64.00 1st Qu.:140.0 1st Qu.:130.0 1st Qu.:31.00 f:10431
## Median :67.00 Median :165.0 Median :150.0 Median :43.00
## Mean :67.18 Mean :169.7 Mean :155.1 Mean :45.07
## 3rd Qu.:70.00 3rd Qu.:190.0 3rd Qu.:175.0 3rd Qu.:57.00
## Max. :93.00 Max. :500.0 Max. :680.0 Max. :99.00
List the quantitative and categorical variables here.
Quantitative variables are: height, weight, wtdesire, and age.
Categorical variables are: gender and genhlth, which are identified as factors as well as exerany, hlthplan and smoke100. The latter three are categorical in meaning (Yes/No) but coded numerically (1/0).
Create a 95% confidence interval for the proportion of smokers (smoke100 equal 1) in cdc.
Describe the confidence interval using one of the two prescribed sentences from the Module 6 notes.
# Place the R code you need to answer this question in this chunk.
# Code snippet to compute a confidence interval for a proportion
phat <- mean(cdc$smoke100 == 1) # Estimated proportion
CL <- .95 # Required confidence level
n <- length(cdc$smoke100) # Sample size
zstar <- qnorm(CL+.5*(1-CL))
se.phat <-sqrt(phat*(1-phat)/n)
ME = zstar * se.phat
lb <- phat - ME
ub <- phat + ME
CI <- c(CL,ME,lb,phat,ub)
names(CI) <- c("Confidence Level", "Margin of Error", "lower Bound","phat","Upper Bound")
CI
## Confidence Level Margin of Error lower Bound phat
## 0.950000000 0.006918684 0.465131316 0.472050000
## Upper Bound
## 0.478968684
Here are the acceptable sentences.
We are 95% confident that the population proportion lies between .465 and .479.
We are 95% confident that the true population proportion is within 0069 of the estimated proportion .472.
Test the hypothesis that the proportion of smokers is .48.
There are two ways you could do this. One is to use the R function prop.test.
# Place your R code here.
prop.test(sum(cdc$smoke100 == 1),length(cdc$smoke100),.48)
##
## 1-sample proportions test with continuity correction
##
## data: sum(cdc$smoke100 == 1) out of length(cdc$smoke100), null probability 0.48
## X-squared = 5.0325, df = 1, p-value = 0.02488
## alternative hypothesis: true p is not equal to 0.48
## 95 percent confidence interval:
## 0.4651124 0.4789984
## sample estimates:
## p
## 0.47205
You could also use the code snippet I provided in Module 7.
# Here are the inputs which can be changed to reuse the snippet.
n <- length(cdc$smoke100) # Number of trials (sample size)
phat <- mean(cdc$smoke100 == 1) # Poroportion of sample cases meeting the definition
p0 <- .48 # The value of p under the null hypothesis
sided <- 2 # Specification of the alternative
# Construct z
z <- (phat - p0)/sqrt( (p0*(1-p0) )/n )
# Compute and display the p-value
pvalue <- sided * pnorm(-abs(z))
pvalue
## [1] 0.02442353
Since the p-value of .024 is below .05, we reject the hypothesis that the true population proportion is .48.
Create a 95% confidence interval for mean value of weight in cdc.
Describe the confidence interval using one of the two prescribed sentences from the Module 6 notes.
# Place the R code you need to answer this question in this chunk.
# Code snippet to construct a confidence interval for a
# population mean given a known population standard deviation
# and a sample mean from a sample of size n
xbar = mean(cdc$weight) # Sample Mean
sd = sd(cdc$weight) # Population Standard Deviation
n = length(cdc$weight) # Sample size
CL = .95 # Required Confidence Level
zstar <- qnorm(CL+.5*(1-CL)) # Obtain Z-score for this confidence level
sd.xbar <- sd/sqrt(n) # Compute standard error of sample mean
ME <- zstar * sd.xbar # Compute margin of error
lb <- xbar - ME # Compute lower bound of CI
ub <- xbar + ME # Compute upper bound of CI
CI <- c(CL,lb,xbar,ub,ME) # Put our results in a vector
names(CI) <- c("Confidence Level","Lower Bound","Xbar","Upper Bound",
"Margin of Error") # Name the vector elements
CI # Display the vector
## Confidence Level Lower Bound Xbar Upper Bound
## 0.9500000 169.1274663 169.6829500 170.2384337
## Margin of Error
## 0.5554837
Here are the two possible sentences.
We are 95% confident that the true population mean is between 169.13 and 170.24.
We are 95% confident that the true population mean is with .55 of the estimated mean, 169.68.
Test the hypothesis that the mean value of weight is 168.
Here’s how to do this with the R function t.test()
t.test(cdc$weight,mu = 168)
##
## One Sample t-test
##
## data: cdc$weight
## t = 5.9381, df = 19999, p-value = 2.931e-09
## alternative hypothesis: true mean is not equal to 168
## 95 percent confidence interval:
## 169.1274 170.2385
## sample estimates:
## mean of x
## 169.683
Here’s how to this with the code snippet from Module 7.
# Replace the example values as necessary
xbar <- mean(cdc$weight) # Sample mean
mu <- 168 # Hypothesized value of the mean
sigma <- sd(cdc$weight) # Known population standard deviation
n <- length(cdc$weight) # sample size
sided = 2 # Specification of the alternative type
# Now do the work
sd.xbar <- sigma/sqrt(n)
z <- (xbar - mu)/sd.xbar
p.value <- sided * pnorm(-abs(z))
# Display the p-value.
p.value
## [1] 2.883326e-09
Since the p-value is well below .05, we reject the hypothesis that the true mean is 168.
Use the CrossTable command to create a contingency table for the variables gender and genhlth in cdc. Set the optional parameter chisq to TRUE. Based on the value of the Chi-Square statistic, what is your conclusion about the hypothesis that gender and genhlth are independent categorical variables? How did you reach this conclusion?
# Place your code here.
CrossTable(cdc$genhlth,cdc$gender,chisq = TRUE)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | Chi-square contribution |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 20000
##
##
## | cdc$gender
## cdc$genhlth | m | f | Row Total |
## -------------|-----------|-----------|-----------|
## excellent | 2298 | 2359 | 4657 |
## | 2.190 | 2.009 | |
## | 0.493 | 0.507 | 0.233 |
## | 0.240 | 0.226 | |
## | 0.115 | 0.118 | |
## -------------|-----------|-----------|-----------|
## very good | 3382 | 3590 | 6972 |
## | 0.641 | 0.588 | |
## | 0.485 | 0.515 | 0.349 |
## | 0.353 | 0.344 | |
## | 0.169 | 0.179 | |
## -------------|-----------|-----------|-----------|
## good | 2722 | 2953 | 5675 |
## | 0.017 | 0.016 | |
## | 0.480 | 0.520 | 0.284 |
## | 0.284 | 0.283 | |
## | 0.136 | 0.148 | |
## -------------|-----------|-----------|-----------|
## fair | 884 | 1135 | 2019 |
## | 6.959 | 6.384 | |
## | 0.438 | 0.562 | 0.101 |
## | 0.092 | 0.109 | |
## | 0.044 | 0.057 | |
## -------------|-----------|-----------|-----------|
## poor | 283 | 394 | 677 |
## | 5.167 | 4.740 | |
## | 0.418 | 0.582 | 0.034 |
## | 0.030 | 0.038 | |
## | 0.014 | 0.020 | |
## -------------|-----------|-----------|-----------|
## Column Total | 9569 | 10431 | 20000 |
## | 0.478 | 0.522 | |
## -------------|-----------|-----------|-----------|
##
##
## Statistics for All Table Factors
##
##
## Pearson's Chi-squared test
## ------------------------------------------------------------
## Chi^2 = 28.71183 d.f. = 4 p = 8.945012e-06
##
##
##
Since the p-value is well below .05, we reject the null hypothesis that gender and general health are independent.
Answer the following questions based on the table from the preceding problem. Place your answers between the questions. In each case, assume that you have selected a person at random from the datafram cdc.
What is the probability that the person is male? .478
What is the probability that the genhlth is poor? .034
What is the probability that the person is male and the genhlth is poor. .014
What is the probability that the person is male, given that the genhlth is poor. .418
What is the probability that the genhlth is poor, given that the person is male. .030
Test the hypothesis that the mean value of weight is the same for both genders. State your conclusion and explain how you arrived at it.
# Place your code here.
t.test(cdc$weight~cdc$gender)
##
## Welch Two Sample t-test
##
## data: cdc$weight by cdc$gender
## t = 74.957, df = 19560, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 36.67182 38.64122
## sample estimates:
## mean in group m mean in group f
## 189.3227 151.6662
Since the p-value is below .05 we reject the null hypothesis that the mean weight is the same for both genders.