This is a practice problem set with solved problems very similar to those in Lab 3. It is based on the CDC dataset, which you used in Lab 1.

You will perform the following tasks.

  1. Construct a confidence interval for a proportion and use proper language to describe it.

  2. Conduct a hypothesis test for a proportion. Distinguish between the null and alternative hypotheses and use proper language to describe your conclusion.

  3. Construct a confidence interval for a mean and use proper language to describe it.

  4. Conduct a hypothesis test for a mean. Distinguish between the null and alternative hypotheses and use proper language to describe your conclusion.

  5. Conduct a hypothesis test for the difference between two means and use proper language to describe your conclusion.

  6. Construct a contingency table to describe the relationship between two categorical variables and use it to:

    1. Describe the absolute probability of an event.
    2. Describe the joint probability of two events.
    3. Describe both conditional probabilities between two events.
    4. Test the hypothesis of independence between two events.

Setup

The first step is to load the data and to verify that all of the required packages are installed and loaded.

load("/cloud/project/cdc.Rdata")

package = "gmodels"
if (!require(package, character.only=T, quietly=T)) {
    install.packages(package)
    library(package, character.only=T)
}

package = "tidyverse"
if (!require(package, character.only=T, quietly=T)) {
    install.packages(package)
    library(package, character.only=T)
}

Problem 1

Run the commands str() and summary() on the dataframe cdc. Then identify the quantitative and categorical variables.

# Place your code in this chunk.
str(cdc)
## 'data.frame':    20000 obs. of  9 variables:
##  $ genhlth : Factor w/ 5 levels "excellent","very good",..: 3 3 3 3 2 2 2 2 3 3 ...
##  $ exerany : num  0 0 1 1 0 1 1 0 0 1 ...
##  $ hlthplan: num  1 1 1 1 1 1 1 1 1 1 ...
##  $ smoke100: num  0 1 1 0 0 0 0 0 1 0 ...
##  $ height  : num  70 64 60 66 61 64 71 67 65 70 ...
##  $ weight  : int  175 125 105 132 150 114 194 170 150 180 ...
##  $ wtdesire: int  175 115 105 124 130 114 185 160 130 170 ...
##  $ age     : int  77 33 49 42 55 55 31 45 27 44 ...
##  $ gender  : Factor w/ 2 levels "m","f": 1 2 2 2 2 2 1 1 2 1 ...
summary(cdc)
##       genhlth        exerany          hlthplan         smoke100     
##  excellent:4657   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  very good:6972   1st Qu.:0.0000   1st Qu.:1.0000   1st Qu.:0.0000  
##  good     :5675   Median :1.0000   Median :1.0000   Median :0.0000  
##  fair     :2019   Mean   :0.7457   Mean   :0.8738   Mean   :0.4721  
##  poor     : 677   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.0000  
##                   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##      height          weight         wtdesire          age        gender   
##  Min.   :48.00   Min.   : 68.0   Min.   : 68.0   Min.   :18.00   m: 9569  
##  1st Qu.:64.00   1st Qu.:140.0   1st Qu.:130.0   1st Qu.:31.00   f:10431  
##  Median :67.00   Median :165.0   Median :150.0   Median :43.00            
##  Mean   :67.18   Mean   :169.7   Mean   :155.1   Mean   :45.07            
##  3rd Qu.:70.00   3rd Qu.:190.0   3rd Qu.:175.0   3rd Qu.:57.00            
##  Max.   :93.00   Max.   :500.0   Max.   :680.0   Max.   :99.00

List the quantitative and categorical variables here.

Quantitative variables are: height, weight, wtdesire, and age.

Categorical variables are: gender and genhlth, which are identified as factors as well as exerany, hlthplan and smoke100. The latter three are categorical in meaning (Yes/No) but coded numerically (1/0).

Problem 2

Create a 95% confidence interval for the proportion of smokers (smoke100 equal 1) in cdc.

Describe the confidence interval using one of the two prescribed sentences from the Module 6 notes.

# Place the R code you need to answer this question in this chunk.
# Code snippet to compute a confidence interval for a proportion

phat <- mean(cdc$smoke100 == 1)    # Estimated proportion
CL <- .95      # Required confidence level
n <- length(cdc$smoke100)     # Sample size

zstar <- qnorm(CL+.5*(1-CL))
se.phat <-sqrt(phat*(1-phat)/n)

ME = zstar * se.phat

lb <- phat - ME
ub <- phat + ME

CI <- c(CL,ME,lb,phat,ub)
names(CI) <- c("Confidence Level", "Margin of Error",  "lower Bound","phat","Upper Bound")

CI
## Confidence Level  Margin of Error      lower Bound             phat 
##      0.950000000      0.006918684      0.465131316      0.472050000 
##      Upper Bound 
##      0.478968684

Here are the acceptable sentences.

  1. We are 95% confident that the population proportion lies between .465 and .479.

  2. We are 95% confident that the true population proportion is within 0069 of the estimated proportion .472.

Problem 3.

Test the hypothesis that the proportion of smokers is .48.

There are two ways you could do this. One is to use the R function prop.test.

# Place your R code here.
prop.test(sum(cdc$smoke100 == 1),length(cdc$smoke100),.48)
## 
##  1-sample proportions test with continuity correction
## 
## data:  sum(cdc$smoke100 == 1) out of length(cdc$smoke100), null probability 0.48
## X-squared = 5.0325, df = 1, p-value = 0.02488
## alternative hypothesis: true p is not equal to 0.48
## 95 percent confidence interval:
##  0.4651124 0.4789984
## sample estimates:
##       p 
## 0.47205

You could also use the code snippet I provided in Module 7.

# Here are the inputs which can be changed to reuse the snippet.
n <- length(cdc$smoke100)      # Number of trials (sample size)
phat <- mean(cdc$smoke100 == 1)    # Poroportion of sample cases meeting the definition
p0 <-  .48       # The value of p under the null hypothesis
sided <- 2    # Specification of the alternative  
 
# Construct z  
z <- (phat - p0)/sqrt( (p0*(1-p0) )/n )

# Compute and display the p-value
pvalue <- sided * pnorm(-abs(z))
pvalue
## [1] 0.02442353

Since the p-value of .024 is below .05, we reject the hypothesis that the true population proportion is .48.

Problem 4

Create a 95% confidence interval for mean value of weight in cdc.

Describe the confidence interval using one of the two prescribed sentences from the Module 6 notes.

# Place the R code you need to answer this question in this chunk.
# Code snippet to construct a confidence interval for a 
# population mean given a known population standard deviation 
# and a sample mean from a sample of size n

  xbar = mean(cdc$weight)  # Sample Mean
  sd   = sd(cdc$weight)   # Population Standard Deviation
  n    = length(cdc$weight)   # Sample size
  CL   = .95  # Required Confidence Level
  
  zstar <- qnorm(CL+.5*(1-CL)) # Obtain Z-score for this confidence level
  sd.xbar <- sd/sqrt(n)        # Compute standard error of sample mean 
  ME <- zstar * sd.xbar        # Compute margin of error
  
  lb <- xbar - ME              # Compute lower bound of CI
  ub <- xbar + ME              # Compute upper bound of CI
  
  CI <- c(CL,lb,xbar,ub,ME)    # Put our results in a vector

  names(CI) <- c("Confidence Level","Lower Bound","Xbar","Upper Bound",
                 "Margin of Error") # Name the vector elements
  CI                           # Display the vector
## Confidence Level      Lower Bound             Xbar      Upper Bound 
##        0.9500000      169.1274663      169.6829500      170.2384337 
##  Margin of Error 
##        0.5554837

Here are the two possible sentences.

  1. We are 95% confident that the true population mean is between 169.13 and 170.24.

  2. We are 95% confident that the true population mean is with .55 of the estimated mean, 169.68.

Problem 5

Test the hypothesis that the mean value of weight is 168.

Here’s how to do this with the R function t.test()

t.test(cdc$weight,mu = 168)
## 
##  One Sample t-test
## 
## data:  cdc$weight
## t = 5.9381, df = 19999, p-value = 2.931e-09
## alternative hypothesis: true mean is not equal to 168
## 95 percent confidence interval:
##  169.1274 170.2385
## sample estimates:
## mean of x 
##   169.683

Here’s how to this with the code snippet from Module 7.

# Replace the example values as necessary

xbar <- mean(cdc$weight)    # Sample mean
mu <- 168      # Hypothesized value of the mean
sigma <- sd(cdc$weight)    # Known population standard deviation
n <- length(cdc$weight)       # sample size
sided = 2      # Specification of the alternative type

# Now do the work
sd.xbar <- sigma/sqrt(n)
z <- (xbar - mu)/sd.xbar
p.value <- sided * pnorm(-abs(z))

# Display the p-value.

p.value
## [1] 2.883326e-09

Since the p-value is well below .05, we reject the hypothesis that the true mean is 168.

Problem 6

Use the CrossTable command to create a contingency table for the variables gender and genhlth in cdc. Set the optional parameter chisq to TRUE. Based on the value of the Chi-Square statistic, what is your conclusion about the hypothesis that gender and genhlth are independent categorical variables? How did you reach this conclusion?

# Place your code here.
CrossTable(cdc$genhlth,cdc$gender,chisq = TRUE)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  20000 
## 
##  
##              | cdc$gender 
##  cdc$genhlth |         m |         f | Row Total | 
## -------------|-----------|-----------|-----------|
##    excellent |      2298 |      2359 |      4657 | 
##              |     2.190 |     2.009 |           | 
##              |     0.493 |     0.507 |     0.233 | 
##              |     0.240 |     0.226 |           | 
##              |     0.115 |     0.118 |           | 
## -------------|-----------|-----------|-----------|
##    very good |      3382 |      3590 |      6972 | 
##              |     0.641 |     0.588 |           | 
##              |     0.485 |     0.515 |     0.349 | 
##              |     0.353 |     0.344 |           | 
##              |     0.169 |     0.179 |           | 
## -------------|-----------|-----------|-----------|
##         good |      2722 |      2953 |      5675 | 
##              |     0.017 |     0.016 |           | 
##              |     0.480 |     0.520 |     0.284 | 
##              |     0.284 |     0.283 |           | 
##              |     0.136 |     0.148 |           | 
## -------------|-----------|-----------|-----------|
##         fair |       884 |      1135 |      2019 | 
##              |     6.959 |     6.384 |           | 
##              |     0.438 |     0.562 |     0.101 | 
##              |     0.092 |     0.109 |           | 
##              |     0.044 |     0.057 |           | 
## -------------|-----------|-----------|-----------|
##         poor |       283 |       394 |       677 | 
##              |     5.167 |     4.740 |           | 
##              |     0.418 |     0.582 |     0.034 | 
##              |     0.030 |     0.038 |           | 
##              |     0.014 |     0.020 |           | 
## -------------|-----------|-----------|-----------|
## Column Total |      9569 |     10431 |     20000 | 
##              |     0.478 |     0.522 |           | 
## -------------|-----------|-----------|-----------|
## 
##  
## Statistics for All Table Factors
## 
## 
## Pearson's Chi-squared test 
## ------------------------------------------------------------
## Chi^2 =  28.71183     d.f. =  4     p =  8.945012e-06 
## 
## 
## 

Since the p-value is well below .05, we reject the null hypothesis that gender and general health are independent.

Problem 7

Answer the following questions based on the table from the preceding problem. Place your answers between the questions. In each case, assume that you have selected a person at random from the datafram cdc.

  1. What is the probability that the person is male? .478

  2. What is the probability that the genhlth is poor? .034

  3. What is the probability that the person is male and the genhlth is poor. .014

  4. What is the probability that the person is male, given that the genhlth is poor. .418

  5. What is the probability that the genhlth is poor, given that the person is male. .030

Problem 8

Test the hypothesis that the mean value of weight is the same for both genders. State your conclusion and explain how you arrived at it.

# Place your code here.
t.test(cdc$weight~cdc$gender)
## 
##  Welch Two Sample t-test
## 
## data:  cdc$weight by cdc$gender
## t = 74.957, df = 19560, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  36.67182 38.64122
## sample estimates:
## mean in group m mean in group f 
##        189.3227        151.6662

Since the p-value is below .05 we reject the null hypothesis that the mean weight is the same for both genders.