library(dplyr)
library(stats)
library(gt)
| functions | description |
|---|---|
| shapiro.test(column) | check for the normal distribution |
| qt({t calculation}*,df) | quantile t-distribution |
* \(\alpha=1-\text{interval}\), \(\frac{\alpha}{2}=\frac{1-\text{interval}}{2}\), \(\text{t-calculation}=1-\frac{\alpha}{2}\)
conf_interval <- function(mean,sd,t_value,n,places){
places <- as.numeric(places)
lower <- mean-(t_value*sd/sqrt(n))
upper <- mean+(t_value*sd/sqrt(n))
print(
paste0(
"(",round(lower,places),",",round(upper,places),")"
)
)
}
Suppose that we select a sample of size \(n\) from a population of size \(N\). We assume that \(N\) is much larger than \(n\).
The sample size may be \(n = 100\). The population mean and standard deviation are \(\mu\) and \(\sigma\). These are unknown parameters. Quantities that we can calculate based on the sample are called statistics. Such statistics are the sample mean and the sample standard deviation
Note: we usually estimate \(\mu\) by \(\bar{X}\)
\[ \bar{X}=\frac{1}{n}\sum_{i=1}^{n} X_i \]
IMPORTANT: \(\bar{X}\) is a random
variable
a different random sample would lead to a different value for
\(\bar{X}\)
\(\therefore\bar{X}\) has its own mean
and standard deviation.
Remember that we usually estimate \(\mu\) by \(\bar{X}\)? It is not surprising that the
mean of \(\bar{X}\) is equal to \(\mu\).
If we take many random samples, then all these sample averages
will oscillate around \(\mu\).
(sample mean is approximately the same as the population mean)
\[ \begin{align} \mu_\bar{x}&=E[\bar{X}]\\ &=\mu \end{align} \]
\[ s=\sqrt{\frac{1}{n-1} \sum_{i=1}^{n}\left( X_i-\bar{X}\right )^2} \]
Caution: the formula only applies if we are sampling with replacement (putting item back after draw). As long as the population is large (preferably very, very large; rule of thumb: \(n>30\)), it doesn’t make much of a difference (like how the value of a fraction becomes smaller when the denominator increases)
\[ \begin{align} \sigma_{\bar{x}}&=\text{SD}\left(\bar{X}\right)\\ &=\frac{\sigma}{\sqrt{n}} \end{align} \]
When a distribution has the Central Limit Theorem applied,
then it tells us that \(\bar{X}\)
follows approximately normal distribution
Resource
The central limit theorem says that the sampling distribution of the mean will always be normally distributed, as long as the sample size is large enough. Regardless of whether the population has a normal, Poisson, binomial, or any other distribution, the sampling distribution of the mean will be normal. Source
\[ \bar{X}\thicksim \mathcal{N} \left(\mu,\frac{\sigma}{\sqrt{n}}\right) \]
Range of values that describes the uncertainty surrounding an estimate Source
\[ \left(\bar{X}{\color{red}-}1.96\times\frac{\sigma}{\sqrt{n}}, \bar{X}{\color{red}+}1.96\times\frac{\sigma}{\sqrt{n}}\right) \]
Let \(\alpha = 1-\text{(confidence level)}\)
\[ \left(\bar{X}{\color{red}-}z_{\frac{\alpha}{2}}\times\frac{\sigma}{\sqrt{n}}, \bar{X}{\color{red}+}z_{\frac{\alpha}{2}}\times\frac{\sigma}{\sqrt{n}}\right)^* \]
* \(t\)-distribution replaces \(z\); see \(t\)-distribution below
Issue: \(\sigma\) is usually
unknown
SOLUTION: replace \(\sigma\) with \(s\), notation for SAMPLE standard
deviation.
New issue: have to use the \(t\)-distribution to find the multiplier for
the \(\frac{s}{\sqrt{n}}\) factor
CAUTION: When using the sample standard deviation \(s\) instead of \(\sigma\) then we need to assume that the
original population follows approximately normal distribution. This is
especially important if \(n\) is small,
say \(n \leq 30\).
“The t-distribution is a continuous probability distribution of the
z-score when the [sample] standard deviation is used in the denominator
rather than the [population] standard deviation” Source
\[ \left(\bar{X}{\color{red}-}\left(t_{\frac{\alpha}{2}}\right)\times\frac{s}{\sqrt{n}}, \bar{X} {\color{red} +} \left(t_{\frac{\alpha}{2}}\right)\times\frac{s}{\sqrt{n}}\right) \]
shapiro.test()\[ \begin{align} \text{lower}&=\mu\color{red}{-}t_{\alpha/2}\times \sqrt{n}\\ \text{upper}&=\mu\color{red}{+}t_{\alpha/2}\times \sqrt{n} \end{align} \]
Suppose that \(Z\) is a standard
normal random variable. Find the value \(w\) so that \(P[-w<X<+w]=0.95\)
By symmetry, \[
\begin{align}
P[-w<Z<+w]&=0.95\\
P(0<Z<w)&=0.95/2\\
&=0.475\\
P(Z>w)&=P(Z<w)\\
&= 0.5\\
P(Z>w)+P(0<Z<w)&=0.5+0.475\\
&=0.975
\end{align}
\]
qnorm(0.975,mean=0,sd=1)
## [1] 1.959964
\[ \begin{align} \text{SD}\left(\bar{X}\right)&=\frac{\sigma}{\sqrt{n}}\\ &=\frac{80}{\sqrt{100}}\\ &= 8 \end{align} \]
Per Central Limit Theorem:
\[ \begin{align} \bar{X}&\thicksim \mathcal{N}(500,8)\\ P(490<\bar{X}<510)&=P(\bar{X}<510)-P(\bar{X}<490)\\ &=0.8943502-0.1056498\\ &=0.7887005 \end{align} \]
pnorm(510,mean=500,sd=8)
## [1] 0.8943502
pnorm(490,mean=500,sd=8)
## [1] 0.1056498
pnorm(510,mean=500,sd=8)-pnorm(490,mean=500,sd=8)
## [1] 0.7887005
200 households are drawn at random from a population, and their
incomes are recorded.
\[ \begin{align} \bar{X}&\thicksim \mathcal{N}\left(\mu,\frac{\sigma}{\sqrt{n}}\right)\\ &= \mathcal{N}(\mu,566) \end{align} \]
By standardization we have that \[ \frac{\bar{X}-\mu}{\sigma}=\frac{\bar{X}-\mu}{566} \]
follows standard normal distribution
\[ P\left[-1.96<\frac{\bar{X}-{\color{blue}\mu}}{566}<1.96\right]=0.95\\ P[(\bar{X}{\color{red}-}1.96\times 566)<{\color{blue}\mu}<(\bar{X}{\color{red}+}1.96\times 566)]=0.95\\ P[(29000{\color{red}-}1.96\times 566)<{\color{blue}\mu}<(29000{\color{red}+}1.96\times 566)]=0.95\\ P[27890<{\color{blue}\mu}<30109]=0.95\\ \therefore (27890,30109) \]
Find the 90% confidence interval
\[ \begin{align} \alpha&= 1-0.9\\ &=0.1\\ \frac{\alpha}{2}&=0{\color{red}{.05}}\\ z_{\alpha/2}&=z_{\color{red}{.05}}\\ \end{align} \]
Remember, \(z_{.05}\) represents the area to the right. We want the area to the left of \(z_{.05}\), which is \(.95\) (looking for 95th percentile)
qnorm(.95,0,1)
## [1] 1.644854
\[ P\left[-1.64<\frac{\bar{X}-{\color{blue}\mu}}{\frac{\sigma}{\sqrt{n}}}<1.64\right] \]
(No explicit answer given, but it replaces 1.96 in the “Confidence interval” example)
Suppose that the sample size is 20 and we want a 95% confidence interval for the population mean.
\[ \begin{align} \alpha&=1-.95\\ &=.05\\ \frac{\alpha}{2}&=.025\\ \therefore t_.025 \end{align} \]
To the left of \(t_.025\): \(1-.025=.975\)
qt(.975, df=19) #df=19 is the degree of freedom
## [1] 2.093024
\(\therefore t_.025=2.09\)
If there’s 1000 is the sample size then:
qt(.975, df=999) #remember: n-1
## [1] 1.962341
Notice: similar to \(z_.025\)
scores <- readr::read_csv("scores.CSV")
attach(scores)
shapiro.test(Scores)
##
## Shapiro-Wilk normality test
##
## data: Scores
## W = 0.91983, p-value = 0.09835
Confirmed that \(p\)-value \(>5\%\)
conf_interval(mean(Scores),sd(Scores),t_value=qt(.975,df=19),n=20,2)
## [1] "(46.61,63.19)"
detach(scores)
\(\therefore\) we are 95% confident that the average score within the entire population is within (46.61,63.19)
A bottling process fills 16-ounce bottles. It is important that the average volume placed in the containers is 16 ounces, that is, overfilling or under-filling is a problem. The quality control inspector selects 20 bottles from the filling process and measures the volume of liquid each contains. The data are available in the file “BOTTLES”. Check for normality and calculate a 95% confidence interval for the population mean.
bottles <- readr::read_csv("bottles.CSV")
volumesnew <- as.numeric(as.character(bottles$Volumes))
shapiro.test(volumesnew)
##
## Shapiro-Wilk normality test
##
## data: volumesnew
## W = 0.94171, p-value = 0.2583
\(p\)-value is \(>5\%\), so assume that the population is normal
conf_interval(mean(volumesnew),sd(volumesnew),qt(.975,19),20,2)
## [1] "(16.06,16.3)"
Bluefish purchased at the Lime Beach Fishing Terminal produce a filet
weight which has a mean of 4.5 pounds with a standard deviation of 0.8
pound. If a restaurant manager purchases 50 such fish, then what is the
probability that she will have at least 220 pounds of filets?
Let \(W\) be pounds of filets
The total amount of fish is at least 220 pounds if the average is at
least \(220/50=4.4\).
Use Central Limit Theorem: \(\bar{X}\thicksim
(4.5,\frac{.8}{\sqrt{50}})\)
To have 220 pounds of fish, we need each fish to be at least \(\frac{220}{50}=\color{purple}{4.4}\)
\[ \begin{align} \color{green}\mu&=4.5\\ \color{blue}\sigma&=0.8\\ \color{red}n&=50\\ \\ \color{green}\mu_\bar{X}&=\color{green}\mu\\ &=4.5\\ \color{blue}\sigma_\bar{X}&=\frac{\color{blue}\sigma}{\sqrt{\color{red}n}}\\ &=\frac{.8}{\sqrt{50}}\\ &=.11\\ \\ P(F\geq220)&= 1-P(F\leq220)\\ &=1-\text{pnorm}(\color{purple}{4.4},\color{green}{4.5},\color{blue}{0.11})\\ &=1-0.1816511\\ &=0.8183489\\ &\approx.82 \end{align} \]
Your quality control department has just analyzed the contents of 20 randomly selected barrels of materials to be used in manufacturing plastic garden equipment. The results found an average of 41.93 gallons of usable materials per barrel. The sample standard deviation has been .1789 gallons. Find the 95% confidence interval for the population mean. Assume that the population distribution is normal.
\[ \begin{align} \alpha&=1-0.95\\ \frac{\alpha}{2}&=0.025\\ t_\frac{\alpha}{2}&=t_{.025}\\ \text{qt}((1-.025),(20-1))&=2.093024\\ \left(\bar{X}-t_\frac{\alpha}{2} \times \frac{s}{\sqrt{n}},\bar{X}+t_\frac{\alpha}{2} \times \frac{s}{\sqrt{n}}\right)&= \left(41.93-2.093024 \times \frac{.1789}{\sqrt{20}}, 41.93+2.093024 \times \frac{.1789}{\sqrt{20}}\right)\\ &=(41.85,42.01) \end{align} \]
conf_interval(41.93,.1789,qt((1-.025),(20-1)),20,2)
## [1] "(41.85,42.01)"
A company is interested in estimating \(\mu\), the mean number of days of sick leave during the last year taken by all its employees. They select a random sample of 100 employees and note the number of sick days taken by each employee in the sample. The following sample statistics are computed: \(\bar{X} = 12.2 \text{ days}\), \(s = 3\text{ days}\). Find a 95% confidence interval for \(\mu\).
\[ \begin{align} \alpha &= 1-0.95\\ \frac{\alpha}{2}&=\frac{0.05}{2}\\ &=.025\\ t_\frac{\alpha}{2}&=t_{.025}\\ \text{qt}((1-0.25),(100-1))&=0.676976\\ \left(\bar{X}-t_\frac{\alpha}{2} \times \frac{s}{\sqrt{n}},\bar{X}+t_\frac{\alpha}{2} \times \frac{s}{\sqrt{n}}\right)&= \left(12.2-0.676976 \times \frac{3}{\sqrt{100}}, 12.2+0.676976 \times \frac{3}{\sqrt{100}}\right)\\ &=(11.6,12.8) \end{align} \]
conf_interval(mean=12.2,sd=3,t_value=qt((1-0.025),(100-1)),n=100,places=2)
## [1] "(11.6,12.8)"