library(gt)

Quick tools

z/t values

percent z_value
99% 2.58
95% 1.96
90% 1.65

Confidence Interval function

conf_interval <- function(mean,sd,t_value,n,places){
  places <- as.numeric(places)
  lower <- mean-(t_value*sd/sqrt(n))
  upper <- mean+(t_value*sd/sqrt(n))
  print(
    paste0(
      "(",round(lower,places),",",round(upper,places),")"
    )
  )
}

CIF for Unknown Population Mean

Only requires count of sample of desired portion (\(Y\)) and sample size (\(n\))
z is for z-value
places for number of places

library(dipsaus)
CIF_unk_pop_mean <- function(Y,n,z,places) {
  p_hat <- Y/n
  interval <- p_hat%+-%(z*sqrt((p_hat*(1-p_hat))/n))
  print(
    paste0(
      "(",round(min(interval),places),",",round(max(interval),places),")"
    )
  )
}

Sample size function: pop. mean

samp_size_pop_mean <- function(z,s,SE){
  n <- ((z*s)/SE)^2
  print(ceiling(n))
}

Sample size function: pop. proportion

samp_size_pop_prop <- function(z,p,SE){
  print(
    ceiling(((z^2)*p*(1-p))/(SE^2))
  )
}

Z-value Review

95%

\[ \begin{align} \alpha&=1-0.95\\ \frac{\alpha}{2}&=\frac{.05}{2}\\ z_\frac{\alpha}{2}&=z_{.025}\\ P(Z>z_.025)&=.025\\ 1-P(Z<z_.025)&=.025\\ P(Z<z_.025)&=.975 \end{align} \]

qnorm(.975,mean=0,sd=1)
## [1] 1.959964

90%

\[ \begin{align} \alpha&=1-0.90\\ \frac{\alpha}{2}&=\frac{.1}{2}\\ z_\frac{\alpha}{2}&=z_{.05}\\ P(Z>z_.05)&=.05\\ 1-P(Z<z_.05)&=.05\\ P(Z<z_.05)&=.95 \end{align} \]

qnorm(0.95,mean=0,sd=1)
## [1] 1.644854

99%

\[ \begin{align} \alpha&=1-0.99\\ \frac{\alpha}{2}&=\frac{.01}{2}\\ z_\frac{\alpha}{2}&=z_{.005}\\ P(Z>z_.05)&=.005\\ 1-P(Z<z_.05)&=.005\\ P(Z<z_.05)&=.995 \end{align} \]

qnorm(0.995,mean=0,sd=1)
## [1] 2.575829

Notes

Confidence interval for a proportion

\(100(1-\alpha)\%\) confidence interval for the unknown population mean \[ \hat{p}\pm z_\frac{\alpha}{2} \sqrt{\frac{\hat{p}(1-\hat{p})}{n}} \]

CAUTION: This only applies if \(n\hat{p}\geq 15\) and \(n(1-\hat{p})\geq 15\).

Sample size for est. a pop. mean

\[ n=\left(\frac{z_\frac{\alpha}{2}\times\sigma}{\text{SE}}\right)^2 \]

Sample size for est. a pop. proportion

The \(\text{SE}\) formula comes from the confidence interval for unknown population mean \(\hat{p}\pm z_\frac{\alpha}{2} \sqrt{\frac{\hat{p}(1-\hat{p})}{n}}\)

Objective: distance between the sample proportion and the population proportion should be no more than \(\text{SE}\) with \(\text{desired}\%\) probability

\(z_\frac{\alpha}{2}\) accounts for the \(\text{desired}\%\)
\[ \begin{align} \text{SE}&=z_\frac{\alpha}{2}\sqrt{\frac{p(1-p)}{n}}\\ (\text{SE})^2&=\left(z_\frac{\alpha}{2}\sqrt{\frac{p(1-p)}{n}}\right)^2\\ &=\frac{\left(z_\frac{\alpha}{2}\right)^2p(1-p)}{n}\\ n&=\frac{\left(z_\frac{\alpha}{2}\right)^2p(1-p)}{(\text{SE})^2} \end{align} \]

p <- seq(from=0,to=1,by=.001)
plot(p,p*(1-p))

Examples

Voting candidates

\(n=2,200 \text{ eligible voters}\) have been asked about their voting preferences in an election with two candidates, \(A\) and \(B\). 471 of the 2,200 will vote for \(A\). Let \(p\) be the proportion of people in the population will vote for candidate \(A\). What is the 95% confidence interval for \(p\)?

  • Let \(A=n\text{ for entire population}\) and \(\hat{A}=n\text{ for people voting for } A \text{ in sample}\)
  • Let \(p= \text{proportion of }A\) and \(\hat{p}= \text{proportion of }\hat{A}\)
  • \(n=n\text{ for total people in sample}\)

\[ \begin{align} A&=?\\ \hat{A}&=471\\ n&=2200\\ \hat{p}&=\frac{\hat{A}}{n}\\ &=\frac{471}{2200}\\ &=\color{green}{0.2140909}\\ \end{align} \]

\(\hat{A}\) follows binomial distribution so

\[ \begin{align} \text{SD}(\hat{A})&=\sqrt{np(1-p)}\\ &\approx \sqrt{n\hat{p}(1-\hat{p})}\\ &\approx\sqrt{2200\times 0.2140909 \times 0.7859091}\\ &\approx 19.23963\\ \text{SD}(\hat{p})&=\frac{\text{SD}(\hat{A})}{n}\\ &=\frac{19.23963}{2200}\\ &=\color{red}{0.008745286} \end{align} \]

“The expected value of \(\hat{p}\) is \(p\)
Remember that \(p\) and \(\hat{p}\) are proportions, so the proportions should be the same
\(\hat{p}\) approximately follows normal distribution (because \(n\) is sufficiently large), so with \(z_{\frac{\alpha}{2}}=1.96\),

\[ -z_\frac{\alpha}{2}<\frac{\bar{X}-\mu}{\sigma_\bar{X}}<z_\frac{\alpha}{2}\\ -1.96<\frac{\hat{p}-p}{\color{red}{.0087}}<1.96\\ -1.96<\frac{\color{green}{0.2140909}-p}{\color{red}{.0087}}<1.96\\ -1.96\times \color{red}{.0087}<\color{green}{0.2140909}-p<1.96\times \color{red}{.0087}\\ (-1.96\times \color{red}{.0087})-\color{green}{0.2140909}<-p<(1.96\times \color{red}{.0087})-\color{green}{0.2140909}\\ -0.2311429<-p<-0.1970389\\ 0.1970389<p<0.2311429\\ (0.197,0.231) \]

So, we have 95% confidence that the population fraction of people voting for \(A\) is between 19.7% and 23.1%.

Bleak election

Suppose that \(n=10000\) and \(\hat{A}=4700\). What is the 95% confidence interval that \(A\) will win?

  • Let \(A=n\text{ for entire population}\) and \(\hat{A}=n\text{ for people voting for } A \text{ in sample}\)
  • Let \(p= \text{proportion of }A\) and \(\hat{p}= \text{proportion of }\hat{A}\)
  • \(n=n\text{ for total people in sample}\)

\[ \begin{align} A&=?\\ \hat{A}&=4700\\ n&=\color{blue}{10000}\\ \hat{p}&=\frac{\hat{A}}{n}\\ &=\frac{4700}{10000}\\ &=\color{red}{.47} \end{align} \]

\(n\) is sufficiently large, so binomial distribution is normal

\[ \begin{align} \hat{p}\pm z_\frac{\alpha}{2} \sqrt{\frac{\hat{p}(1-\hat{p})}{n}}&= \color{red}{.47}\pm 1.96 \sqrt{\frac{\color{red}{.47}(1-\color{red}{.47})}{\color{blue}{10000}}}\\ &=.47\pm.009782344\\ &\approx .47\pm.01\\ &=(.46,.48) \end{align} \]

So, we have 95% confidence that the proportion of votes \(A\) will receive is between 46% and 48%.

Estimate wages

Suppose that we would like to estimate the average wage in an industry. How large sample is needed to be 90% sure that the distance between the sample mean and the population mean is no more than .5? Suppose that \(\sigma=\$4.00\)

\[ \begin{align} z_\frac{\alpha}{2}\frac{4}{\sqrt{n}}&=.5\\ \\ \alpha&=1-.9\\ &=.1\\ z_\frac{\alpha}{2}&=z_{.05}\\ \text{qnorm}(1-.05,0,1)&=1.644854\\ z_{.05}&=1.644854 \end{align} \]

Exercises

Market study

A food-products company conducted a market study by randomly sampling and interviewing 1,000 consumers to determine which brand of breakfast cereal they prefer. In this sample 313 consumers were found to prefer the company’s brand. Estimate the true proportion of consumers who prefer the company’s brand using a 95% confidence interval.

\[ \begin{align} \hat{p}\pm z_{\alpha /2}\sqrt{\frac{\hat{p}(1-\hat{p})}{n}}&= .313\pm 1.96\sqrt{\frac{.313(1-.313)}{1000}}\\ &=(0.284,0.342) \end{align} \]

CIF_unk_pop_mean(313,1000,1.96,3)
## [1] "(0.284,0.342)"

Household income

In a marketing study the objective is to estimate the average household income in a population. The researchers want to be 95% confident that the difference between the real population mean and the sample mean is no more than $500. A small pilot study resulted with sample standard deviation $2,500. How large sample is necessary to achieve the above objective?

\[ \begin{align} n&=\left(\frac{z_\frac{\alpha}{2}\times\sigma}{\text{SE}}\right)^2\\ &=\left(\frac{1.96\times 2500}{\text{500}}\right)^2\\ &=96.04\\ &\approx 97 \end{align} \]

samp_size_pop_mean(1.96,2500,500)
## [1] 97

Imperfect boxes

A manufacturer of boxes of candy is concerned about the proportion of imperfect boxes–those containing cracked, broken, or otherwise damaged candies.

  1. How large a sample is needed to be 99% confident that the difference between the sample fraction of imperfect boxes and the population proportion of imperfect boxes is no more than .015? Assume here that we have absolutely no information concerning the true proportion of imperfect boxes.
    We substitute .5 for \(p\) because that value maximizes the expression \(p(1-p)\) in the formula for the sample size. \(\alpha=.01\), and the sampling error is .015.
samp_size_pop_prop(2.58,.5,.015)
## [1] 7397

\[ \begin{align} n&=\frac{\left(z_\frac{\alpha}{2}\right)^2p(1-p)}{(\text{SE})^2}\\ &=\frac{(2.57)^2\times .5 \times .5}{(.015)^2}\\ &\approx 7397 \end{align} \]

  1. How does your answer to part (a) change if we assume that the population proportion of imperfect boxes is at least .005 and no more than .1?

Substitute .1 for \(p\), since it maximizes \(p(1-p)\) in \((.005,.1)\)

\[ \begin{align} n&=\frac{\left(z_\frac{\alpha}{2}\right)^2p(1-p)}{(\text{SE})^2}\\ &=\frac{(2.57)^2\times .1 \times .9}{(.015)^2}\\ &\approx 2642 \end{align} \]

samp_size_pop_prop(2.57,.1,.015)
## [1] 2642

The information that the true proportion is below \(.1\) reduced the sample size. (The lower bound \(.005\) was useless)

Textbook exercises

USPS Performance

The USPS reports that 95% of first-class mail within the same city is delivered on time (i.e. within 2 days of the time of mailing). To gauge the USPS performance, Price Waterhouse monitored the delivery of first-class mail items between Dec. 10 and Mar. 3–the most difficult delivery season due to bad weather conditions and holidays. In a sample of 332,000 items, Price Waterhouse determined that 282,200 were delivered on time. Comment on the performance of USPS first-class mail service over this time period.

\[ \begin{align} n&=332000\\ Y&=282200\\ \\ \hat{p}\pm z_\frac{\alpha}{2} \sqrt{\frac{\hat{p}(1-\hat{p})}{n}}&=.85\pm 1.96 \sqrt{\frac{.85(1-.85)}{332000}} \end{align} \]

CIF_unk_pop_mean(282200,332000,1.96,3)
## [1] "(0.849,0.851)"

Defective Items

It costs more to produce defective items–because they must be scrapped or reworked–than it does to produce non-defective items. This simple fact suggests that manufacturers should ensure the quality of their products by perfecting their production processes rather than through inspection of finished products (Out of the Crisis, Deming, 1986). In order to better understand a particular metal-stamping process, a manufacturer wishes to estimate the mean length of items produced by the process during the past 24 hours.

  1. How many parts should be sampled in order to estimate the population mean to within .1 millimeter (mm) with 90% confidence? Previous studies of this machine have indicated that the standard deviation of lengths produced by the stamping operation is about 2mm

\[ \begin{align} \sigma&=2\\ \text{SE}&=.1\\ z_{\alpha /2}=z_{.05}&=1.65\\ n=\left(\frac{z_{\alpha /2}\times \sigma}{SE}\right)^2&=\left(\frac{1.644854\times 2}{.1}\right)^2\\ &\approx 1083 \end{align} \]

samp_size_pop_mean(1.644854,2,.1)
## [1] 1083
  1. Time permits the use of a sample size no larger than 100. If a 90% confidence interval for \(\mu\) is constructed using \(n=100\), will it be wider or narrower than would have been obtained using the sample size determined in part a? Explain.

    Wider, since \(s\) will be larger. Consider: \(\frac{\sigma}{\sqrt{n}}\) \[ \frac{2}{\sqrt{100}} \text{ vs. }\frac{2}{\sqrt{1083}}\\ 0.2\text{ vs. }0.06077371 \]

  2. If management requires that \(\mu\) be estimated to within .1 mm and that a sample size of no more than 100 be used, what is (approximately) the maxiumum confidence level that could be attained for a confidence interval that meets management’s specifications?

\[ \begin{align} \sigma&=2\\ \text{SE}&=.1\\ n&=100\\ \text{SE}&=z_{\alpha /2}\frac{\sigma}{\sqrt{n}}\\ .1&=z_{\alpha /2}\frac{2}{\sqrt{100}}\\ z_{\alpha /2}&=.5\\ \end{align} \]

Then, find the area to the left of the z-score

pnorm(.5)
## [1] 0.6914625

The cumulative probability of 0.6914625 represents the proportion of the distribution that falls below \(z_{\alpha /2} = .5\), i.e. \(P(Z<0.5)=0.6914625\)

So, \[ \begin{align} P(Z>0.5)&=1-P(Z<0.5)\\ &=1-\text{pnorm}(.5,0,1)\\ &=1-0.6914625\\ &=0.3085375\\ z_{\alpha /2}&=0.3085375\\ \frac{\alpha}{2}&=0.3085375\\ \alpha&=0.3085375\times 2\\ &=0.617075\\ \text{desired confidence %}&=1-\alpha\\ &=1-0.617075\\ &\approx 0.383 \end{align} \]

Going backwards: \[ \begin{align} \alpha&=1-0.383\\ \frac{\alpha}{2}&=\frac{0.617}{2}\\ z_\frac{\alpha}{2}&=z_{.3085}\\ P(Z>z_.3085)&=.3085\\ 1-P(Z<z_.3085)&=.3085\\ P(Z<z_.3085)&=1-.3085\\ &=0.6915 \end{align} \]

qnorm(.6915,0,1)
## [1] 0.5001066

\[ \therefore z_.3085=z_\frac{\alpha}{2}=0.5001066 \]