library(dplyr)
library(stats)
library(gt)

R commands

functions description
shapiro.test(column) check for the normal distribution
qt({t calculation}*,df) quantile t-distribution

* \(\alpha=1-\text{interval}\), \(\frac{\alpha}{2}=\frac{1-\text{interval}}{2}\), \(\text{t-calculation}=1-\frac{\alpha}{2}\)

Confidence Interval function

conf_interval <- function(mean,sd,t_value,n,places){
  places <- as.numeric(places)
  lower <- mean-(t_value*sd/sqrt(n))
  upper <- mean+(t_value*sd/sqrt(n))
  print(
    paste0(
      "(",round(lower,places),",",round(upper,places),")"
    )
  )
}

Sample set statistics

Sample set statistics

Suppose that we select a sample of size \(n\) from a population of size \(N\). We assume that \(N\) is much larger than \(n\).

The sample size may be \(n = 100\). The population mean and standard deviation are \(\mu\) and \(\sigma\). These are unknown parameters. Quantities that we can calculate based on the sample are called statistics. Such statistics are the sample mean and the sample standard deviation

Sample mean

Note: we usually estimate \(\mu\) by \(\bar{X}\)

\[ \bar{X}=\frac{1}{n}\sum_{i=1}^{n} X_i \]

IMPORTANT: \(\bar{X}\) is a random variable 
a different random sample would lead to a different value for \(\bar{X}\)

\(\therefore\bar{X}\) has its own mean and standard deviation.

Remember that we usually estimate \(\mu\) by \(\bar{X}\)? It is not surprising that the mean of \(\bar{X}\) is equal to \(\mu\).

If we take many random samples, then all these sample averages will oscillate around \(\mu\).
(sample mean is approximately the same as the population mean)

\[ \begin{align} \mu_\bar{x}&=E[\bar{X}]\\ &=\mu \end{align} \]

Sample standard deviation

\[ s=\sqrt{\frac{1}{n-1} \sum_{i=1}^{n}\left( X_i-\bar{X}\right )^2} \]

Caution: the formula only applies if we are sampling with replacement (putting item back after draw). As long as the population is large (preferably very, very large; rule of thumb: \(n>30\)), it doesn’t make much of a difference (like how the value of a fraction becomes smaller when the denominator increases)

\[ \begin{align} \sigma_{\bar{x}}&=\text{SD}\left(\bar{X}\right)\\ &=\frac{\sigma}{\sqrt{n}} \end{align} \]

Central Limit Theorem

When a distribution has the Central Limit Theorem applied, then it tells us that \(\bar{X}\) follows approximately normal distribution
Resource

The central limit theorem says that the sampling distribution of the mean will always be normally distributed, as long as the sample size is large enough. Regardless of whether the population has a normal, Poisson, binomial, or any other distribution, the sampling distribution of the mean will be normal. Source

  • \(\bar{X}\) = sampling distribution of the sample means
  • \(\mathcal{N}\)

\[ \bar{X}\thicksim \mathcal{N} \left(\mu,\frac{\sigma}{\sqrt{n}}\right) \]

Confidence interval

Range of values that describes the uncertainty surrounding an estimate Source

General 95% confidence

  • Valid if \(n\) is large (\(n>30\)), or the original population is normal.
  • If \(n\) is small, then population must be normal

\[ \left(\bar{X}{\color{red}-}1.96\times\frac{\sigma}{\sqrt{n}}, \bar{X}{\color{red}+}1.96\times\frac{\sigma}{\sqrt{n}}\right) \]

Change confidence level

Let \(\alpha = 1-\text{(confidence level)}\)

\[ \left(\bar{X}{\color{red}-}z_{\frac{\alpha}{2}}\times\frac{\sigma}{\sqrt{n}}, \bar{X}{\color{red}+}z_{\frac{\alpha}{2}}\times\frac{\sigma}{\sqrt{n}}\right)^* \]

* \(t\)-distribution replaces \(z\); see \(t\)-distribution below

Issue: \(\sigma\) is usually unknown
SOLUTION: replace \(\sigma\) with \(s\), notation for SAMPLE standard deviation.
New issue: have to use the \(t\)-distribution to find the multiplier for the \(\frac{s}{\sqrt{n}}\) factor

CAUTION: When using the sample standard deviation \(s\) instead of \(\sigma\) then we need to assume that the original population follows approximately normal distribution. This is especially important if \(n\) is small, say \(n \leq 30\).

\(t\)-distribution

“The t-distribution is a continuous probability distribution of the z-score when the [sample] standard deviation is used in the denominator rather than the [population] standard deviation” Source

  • Has 1 parameter: degree of freedom \(d\) (\(n-1\) in this case)
  • if \(d\) is large, then there is very little difference between the normal and the \(t\)-distribution.
  • \(t_{\frac{\alpha}{2}}\) replaces \(z_{\frac{\alpha}{2}}\)

\[ \left(\bar{X}{\color{red}-}\left(t_{\frac{\alpha}{2}}\right)\times\frac{s}{\sqrt{n}}, \bar{X} {\color{red} +} \left(t_{\frac{\alpha}{2}}\right)\times\frac{s}{\sqrt{n}}\right) \]

Finding confidence interval

  1. Check normal distribution using shapiro.test()
  2. Evaluate \(p\)-value
    • If \(p\)-value \(>5\%\), THEN assume population is normal
    • Otherwise, reject that the population follows normal distribution
  3. Calculate confidence interval

\[ \begin{align} \text{lower}&=\mu\color{red}{-}t_{\alpha/2}\times \sqrt{n}\\ \text{upper}&=\mu\color{red}{+}t_{\alpha/2}\times \sqrt{n} \end{align} \]

95% probability Review

Suppose that \(Z\) is a standard normal random variable. Find the value \(w\) so that \(P[-w<X<+w]=0.95\)

By symmetry, \[ \begin{align} P[-w<Z<+w]&=0.95\\ P(0<Z<w)&=0.95/2\\ &=0.475\\ P(Z>w)&=P(Z<w)\\ &= 0.5\\ P(Z>w)+P(0<Z<w)&=0.5+0.475\\ &=0.975 \end{align} \]

qnorm(0.975,mean=0,sd=1)
## [1] 1.959964

Examples

Sample mean interval

\[ \begin{align} \text{SD}\left(\bar{X}\right)&=\frac{\sigma}{\sqrt{n}}\\ &=\frac{80}{\sqrt{100}}\\ &= 8 \end{align} \]

Per Central Limit Theorem:

\[ \begin{align} \bar{X}&\thicksim \mathcal{N}(500,8)\\ P(490<\bar{X}<510)&=P(\bar{X}<510)-P(\bar{X}<490)\\ &=0.8943502-0.1056498\\ &=0.7887005 \end{align} \]

pnorm(510,mean=500,sd=8)
## [1] 0.8943502
pnorm(490,mean=500,sd=8)
## [1] 0.1056498
pnorm(510,mean=500,sd=8)-pnorm(490,mean=500,sd=8)
## [1] 0.7887005

Confidence interval

200 households are drawn at random from a population, and their incomes are recorded.

  • We have \(\bar{X} = 29000\).
  • Suppose that \(\sigma = 8000\).

    What is the confidence interval?

    The Central Limit Theorem states that

\[ \begin{align} \bar{X}&\thicksim \mathcal{N}\left(\mu,\frac{\sigma}{\sqrt{n}}\right)\\ &= \mathcal{N}(\mu,566) \end{align} \]

By standardization we have that \[ \frac{\bar{X}-\mu}{\sigma}=\frac{\bar{X}-\mu}{566} \]

follows standard normal distribution

\[ P\left[-1.96<\frac{\bar{X}-{\color{blue}\mu}}{566}<1.96\right]=0.95\\ P[(\bar{X}{\color{red}-}1.96\times 566)<{\color{blue}\mu}<(\bar{X}{\color{red}+}1.96\times 566)]=0.95\\ P[(29000{\color{red}-}1.96\times 566)<{\color{blue}\mu}<(29000{\color{red}+}1.96\times 566)]=0.95\\ P[27890<{\color{blue}\mu}<30109]=0.95\\ \therefore (27890,30109) \]

Change confidence level

Find the 90% confidence interval

\[ \begin{align} \alpha&= 1-0.9\\ &=0.1\\ \frac{\alpha}{2}&=0{\color{red}{.05}}\\ z_{\alpha/2}&=z_{\color{red}{.05}}\\ \end{align} \]

Remember, \(z_{.05}\) represents the area to the right. We want the area to the left of \(z_{.05}\), which is \(.95\) (looking for 95th percentile)

qnorm(.95,0,1)
## [1] 1.644854

\[ P\left[-1.64<\frac{\bar{X}-{\color{blue}\mu}}{\frac{\sigma}{\sqrt{n}}}<1.64\right] \]

(No explicit answer given, but it replaces 1.96 in the “Confidence interval” example)

\(t\)-distribution

Suppose that the sample size is 20 and we want a 95% confidence interval for the population mean.

\[ \begin{align} \alpha&=1-.95\\ &=.05\\ \frac{\alpha}{2}&=.025\\ \therefore t_.025 \end{align} \]

To the left of \(t_.025\): \(1-.025=.975\)

qt(.975, df=19) #df=19 is the degree of freedom
## [1] 2.093024

\(\therefore t_.025=2.09\)

If there’s 1000 is the sample size then:

qt(.975, df=999) #remember: n-1
## [1] 1.962341

Notice: similar to \(z_.025\)

Taste-testers

scores <- readr::read_csv("scores.CSV")
attach(scores)
shapiro.test(Scores)
## 
##  Shapiro-Wilk normality test
## 
## data:  Scores
## W = 0.91983, p-value = 0.09835

Confirmed that \(p\)-value \(>5\%\)

conf_interval(mean(Scores),sd(Scores),t_value=qt(.975,df=19),n=20,2)
## [1] "(46.61,63.19)"
detach(scores)

\(\therefore\) we are 95% confident that the average score within the entire population is within (46.61,63.19)

Bottles

A bottling process fills 16-ounce bottles. It is important that the average volume placed in the containers is 16 ounces, that is, overfilling or under-filling is a problem. The quality control inspector selects 20 bottles from the filling process and measures the volume of liquid each contains. The data are available in the file “BOTTLES”. Check for normality and calculate a 95% confidence interval for the population mean.

bottles <- readr::read_csv("bottles.CSV")
volumesnew <- as.numeric(as.character(bottles$Volumes))
shapiro.test(volumesnew)
## 
##  Shapiro-Wilk normality test
## 
## data:  volumesnew
## W = 0.94171, p-value = 0.2583

\(p\)-value is \(>5\%\), so assume that the population is normal

conf_interval(mean(volumesnew),sd(volumesnew),qt(.975,19),20,2)
## [1] "(16.06,16.3)"

Exercises

Fish Filet

Bluefish purchased at the Lime Beach Fishing Terminal produce a filet weight which has a mean of 4.5 pounds with a standard deviation of 0.8 pound. If a restaurant manager purchases 50 such fish, then what is the probability that she will have at least 220 pounds of filets?

Let \(W\) be pounds of filets

The total amount of fish is at least 220 pounds if the average is at least \(220/50=4.4\).

population demonstration
Use Central Limit Theorem: \(\bar{X}\thicksim (4.5,\frac{.8}{\sqrt{50}})\)
To have 220 pounds of fish, we need each fish to be at least \(\frac{220}{50}=\color{purple}{4.4}\)

\[ \begin{align} \color{green}\mu&=4.5\\ \color{blue}\sigma&=0.8\\ \color{red}n&=50\\ \\ \color{green}\mu_\bar{X}&=\color{green}\mu\\ &=4.5\\ \color{blue}\sigma_\bar{X}&=\frac{\color{blue}\sigma}{\sqrt{\color{red}n}}\\ &=\frac{.8}{\sqrt{50}}\\ &=.11\\ \\ P(F\geq220)&= 1-P(F\leq220)\\ &=1-\text{pnorm}(\color{purple}{4.4},\color{green}{4.5},\color{blue}{0.11})\\ &=1-0.1816511\\ &=0.8183489\\ &\approx.82 \end{align} \]

Garden equipment

Your quality control department has just analyzed the contents of 20 randomly selected barrels of materials to be used in manufacturing plastic garden equipment. The results found an average of 41.93 gallons of usable materials per barrel. The sample standard deviation has been .1789 gallons. Find the 95% confidence interval for the population mean. Assume that the population distribution is normal.

\[ \begin{align} \alpha&=1-0.95\\ \frac{\alpha}{2}&=0.025\\ t_\frac{\alpha}{2}&=t_{.025}\\ \text{qt}((1-.025),(20-1))&=2.093024\\ \left(\bar{X}-t_\frac{\alpha}{2} \times \frac{s}{\sqrt{n}},\bar{X}+t_\frac{\alpha}{2} \times \frac{s}{\sqrt{n}}\right)&= \left(41.93-2.093024 \times \frac{.1789}{\sqrt{20}}, 41.93+2.093024 \times \frac{.1789}{\sqrt{20}}\right)\\ &=(41.85,42.01) \end{align} \]

conf_interval(41.93,.1789,qt((1-.025),(20-1)),20,2)
## [1] "(41.85,42.01)"

Sick leave

A company is interested in estimating \(\mu\), the mean number of days of sick leave during the last year taken by all its employees. They select a random sample of 100 employees and note the number of sick days taken by each employee in the sample. The following sample statistics are computed: \(\bar{X} = 12.2 \text{ days}\), \(s = 3\text{ days}\). Find a 95% confidence interval for \(\mu\).

\[ \begin{align} \alpha &= 1-0.95\\ \frac{\alpha}{2}&=\frac{0.05}{2}\\ &=.025\\ t_\frac{\alpha}{2}&=t_{.025}\\ \text{qt}((1-0.25),(100-1))&=0.676976\\ \left(\bar{X}-t_\frac{\alpha}{2} \times \frac{s}{\sqrt{n}},\bar{X}+t_\frac{\alpha}{2} \times \frac{s}{\sqrt{n}}\right)&= \left(12.2-0.676976 \times \frac{3}{\sqrt{100}}, 12.2+0.676976 \times \frac{3}{\sqrt{100}}\right)\\ &=(11.6,12.8) \end{align} \]

conf_interval(mean=12.2,sd=3,t_value=qt((1-0.025),(100-1)),n=100,places=2)
## [1] "(11.6,12.8)"