Classical Probability

We are given $P(C)=0.6$ and $P(C \cap J)=0.2$. We can determine

$P(J|C)=\frac{P(J\cap C)}{P(C)}=\frac{0.2}{0.6}=\frac{1}{3}$
Given $P(C|J)=0.4$, want $P(J)$. Since

\[0.4 = P(C|J)=\frac{P(C\cap J)}{P(J)}=\frac{0.2}{P(J)}\] \[P(J)=0.5\] 2. Randomly choose a person. Denote “Left-handed” by “LH”, “Red-Green color-blinded” by “RG”, “Blue-Yellow color-blinded” by “BY”, “Completely color-blinded” by “Complete”, and “Any kind color-blinded” by “Color”. We are given $P(LH) = 0.11, P(RG)=0.0560, P(BY)=0.0224$, and $P(Complete)=0.0016$. We can determine

$P(Color)=0.11+0.0560+0.0224=0.08$.
$P(No ~color-blindness)=1-0.0560-0.0224-0.0016=0.92$
$P(LH \cup Color)$, if we further know that there is a 0.91% chance that a person is left-handed and is color-blinded. Since $P(LH\cup Color)=P(LH)+P(Color)-P(LH\cap Color)=0.11 + 0.08-0.0091=0.1809$
$P(RH\cap Color)=P(RH)\cdot P(Color)=(1-0.11)\cdot (0.08)=0.0712.$

Probability Density Functions

Suppose $f(x)=2e^{-2x}, x>0$ is a probability density function of a random variable $X$. Determine

$P(X< 1)$
$P(X>1)$
$P(1<X<2)
the mean $\mu = E(X)$, and
the stanadrd deviation $\sigma$

Solution.

$P(X<1)=\int_0^1 2e^{-2x}dx=1-e^{-2}=0.8647$
$P(X>1)=\int_1^{\infty}2e^{-2x}dx=e^{-2}=0.1353$
$P(1<X<2)=\int_1^{2}2e^{-2x}dx=e^{-2}-e^{-4}=0.1170$
Since the distribution is exponential, the mean $\mu = \frac{1}{2}$, and
the standard deviation is the same as the mean for exponential distributions.

Suppose $f(x)=\frac{1}{3}x^2, -1<x<2$ is a probability density function of a random variable $X$. Determine

$P(X< 1)$
$P(X>1)$
$P(0<X<1)$
the mean $\mu = E(X)$,
the standard deviation $\sigma$, and
$x$ such that $P(x<X)=0.05$

Solution.

$P(X<1)=\int_{-1}^1 \frac{1}{3}x^2dx=\frac{2}{9}$
$P(X>1)=\int_1^{2}\frac{1}{3}x^2dx=\frac{7}{9}$
$P(0<X<1)=\int_0^{1}\frac{1}{3}x^2dx=\frac{1}{9}$
$\mu = \int_{-1}^2 x\cdot \frac{1}{3}x^2dx=\frac{5}{4}$, and
The variance $\sigma^2=\int_{-1}^2 x^2\cdot \frac{1}{3}x^2dx-\mu^2=\frac{80}{51}$, so the standard deviation is the $\sqrt{\frac{80}{51}}\approx 1.2524$.
Since $x$ will have to be between $-1$ and 2, $P(x<X)=\int_x^{2}\frac{1}{3}x^2dx=\frac{8-x^3}{9}$. Setting $\frac{8-x^3}{9}=0.05$ yields $x\approx 1.9618$.

Suppose the cumulative distribution function of $X$ is

\[\begin{equation} F(x)= \begin{cases} 0, & x\le -1 \\ \frac{x+1}{3}, & -1<x<2\\ 1, & x\ge 2\\ \end{cases} \end{equation}\]

find

$P(X>1)$
$P(X<1.4)$
$P(0<X<1)$
$P(X>3)$

Solution.

$P(X>1)=1-P(X\le 1)=1-F(1)=1-\frac{1+1}{3}=\frac{1}{3}$
$P(X<1.4)\stackrel{\text{since X is continuous random variable}}= P(X\le 1.4)=F(1.4)=0.8$
$P(0<X<1)\stackrel{\text{since X is continuous random variable}}=F(1)-F(0)=\frac{2}{3}-\frac{1}{3}=\frac{1}{3}$
$P(X>3)=1-P(X\le 3)=1-F(3)=1-1=0$

Suppose $f(x)=2e^{-2x}, x>0$ is a probability density function of a random variable $X$. Determine its cumulative distribution function $F(x)$.

Solution.

For $x \le 0$, $F(x)=0$; for $x >0$, $F(x)=\int_0^x 2e^{-2x}dx=1-e^{-2x}$.

\[\begin{equation} F(x)= \begin{cases} 0, & x\le 0 \\ 1-e^{-2x}, & x>0\\ \end{cases} \end{equation}\]

Suppose $X$ is normally distributed with mean 100 and standard deviation 15. Find

$P(X<90)$.
$P(X>120)$

Solution.

$P(X<90)=P(X-\mu<90-\mu)=P(\frac{X-\mu}{\sigma}<\frac{90-\mu}{\sigma})=P(Z<-0.67)=0.2514$.
$P(X>90)=1-P(X\le 90)=1-0.2514=0.7486$

The time between calls is exponentially distributed with mean 15 minutes. Find

the probability that there is no call within 25 minutes.
the probability that there is at least one call within 10 minutes.
the probability that the first call arrives within 15 to 20 minutes after opening.
the length of an interval of time such that the probability of at least one call is in the interval is 0.8. Round to two decimal places.

Solution.

Let $X$ denote the time between calls. The probability density function of $X$ is \[f(x) = \frac{1}{15}e^{-\frac{1}{15}x}, ~~ x>0\] The cumulative distribution function is

\[F(x) = 1-e^{-\frac{1}{15}x}, ~~ x>0\]

$P(X>25)=1-P(X\le 25)=1-F(25)=e^{-\frac{25}{15}}=0.1899$
$P(X\le 10)=F(10)=1-e^{-\frac{10}{15}}=0.4866$
$P(15\le X\le 20)=F(20)-F(15)=0.2498$
Suppose that the interval is (0, b). We need to solve the equation $P(X<b)=0.8$. Since $P(X<b)=F(b)=1-e^{-\frac{1}{15}b}$, solving the equation $1-e^{-\frac{1}{15}b}=0.8$ gives $b=24.14$

Joint Distribution of Two Discrete Random Variables

Given the joint probability mass function

**Joint Distribution of X and Y**
$x$	$y$	$f(x,y)$
-1	0	1/6
2	1	1/3
3	-2	1/4
2	2	1/4

Determine the marginal probability mass function of $X$.
Determine the marginal probability mass function of $Y$.
Determine the mean and variance of $X$.
Determine the mean and variance of $Y$.
Determine the covariance between $X$ and $Y$.
Determine the correlation between $X$ and $Y$.
Determine $P(X<2.5, Y>-1)$.

Solution.

To determine the marginal distribution of $X$, we first have

$P(X=-1)=1/6$, $P(X=2)=1/3+1/4=7/12$, and $P(X=3)=1/4$. So the marginal probability mass function of $X$ is

**Marginal Distribution of X**
$x$	$f_X(x)$
-1	1/6
2	7/12
3	1/4

To determine the marginal distribution of $Y$, we first have

$P(Y=0)=1/6$, $P(Y=1)=1/3$, and $P(Y=-2)=1/4+1/4=1/2$. So the marginal probability mass function of $Y$ is

D=data.frame(y=c("$y$", 0,-2,1),"f_Y (y)"=c("$f_Y(y)$", "1/6", "1/3", "1/2"))

names(D)=NULL

kableExtra::kable(D, align='cc', escape = F,
      caption = "<center><strong>Marginal Distribution of X</strong></center>",
      table.attr = "style='width:60%; '")

**Marginal Distribution of X**
$y$	$f_Y(y)$
0	1/6
-2	1/3
1	1/2

The mean of $X$ is $E(X)=(-1)(1/6)+(2)(7/12)+(3)(1/4)=1.75$. The variance is $(-1)^2(1/6)+(2)^2(7/12)+(3)^2(1/4)-1.75^2=1.6875$.
The mean of $Y$ is $E(Y)=(0)(1/6)+(1)(1/3)+(-2)(1/2)=-2/3$. The variance is $(0)^2(1/6)+(1)^2(1/3)+(-2)^2(1/2)-(-2/3)^2=17/9$.
To calculate the covariance, we need to fund $E(XY)$ first. \[E(XY)=(-1)(0)(1/6)+(2)(1)(1/3)+(3)(-2)(1/4)+(2)(-2)(1/4)=-11/6\]

The covariance is

\[Cov(X,Y)=E(XY)-E(X)E(Y)=-11/6-(1.75)(-2/3)=-2/3\]

The correlation is

\[\rho = \frac{Cov(X,Y)}{\sqrt{Var(X)}\sqrt{Var(Y)}}=\frac{-2/3}{\sqrt{1.6875}\sqrt{17/9}}=-0.373408\]

There are two pairs of (X, Y) satisfying the condition $X<2.5, Y>-1$. These pairs are $(-1,0)$ and $(2,1)$. So the desired probability is the sum of the corresponding probabilities, or $1/6+1/3=1/2$.

Basic Descriptive Statistics

The following R code might be useful.

# Create a data vector or array and store the data in x
x = c(23, 45, 12, 9, 15, 42, 40, 22, 25, 60, 28, 52, 44)
y = c(45, 85, 30, 17, 34, 86, 85, 50, 48, 115, 64, 100, 90)

# Calculate sample mean
mean(x)
# Calculate sample variance
var(x)

# Calculate sample standard deviation
sd(x)

# Calculate median
median(x)

# Calculate sample correlation between x and y
cor(x,y)

# Create histogram
hist(x)

# Create boxplot
boxplot(x)

# Create stem-and-leaf plot
stem(x)

# Create a scatter plot
plot(y~x)

Midterm Exam Review

Some useful formula:

$P(A \cup B)=P(A)+P(B)-P(A\cap B)$.
For a discrete random variable, the mean $\mu=E(X)=\sum x_i p_i$ and variance $\sigma^2=V(X)=\sum x_i^2 p_i-\mu^2$.
For a continuous random variable, the mean $\mu=E(X)=\int_{-\infty}^{\infty}xf(x)dx$ and variance $\sigma^2=V(X)=\int_{-\infty}^{\infty}x^2f(x)dx-\mu^2$.
For two discrete random variables, the covariance $cov(X,Y)=E(XY)-E(X)E(Y)$ and correlation $\rho=\frac{cov(X,Y)}{\sigma_X\cdot \sigma_Y}$.
$E(aX)=aE(X)$, $E(X+c)=E(X)+c$
$V(aX)=a^2V(X)$, $V(X+c)=V(X)$
If $X$ and $Y$ are two independent random variables and $a~ \& ~b$ are constants, then $V(aX+bY)=a^2V(X)+b^2V(Y)$.
The sample variance of a sample is $s^2 = \frac{\sum_{i=1}^{n}(x_i -\bar{x})^2}{n-1}$.
The sample correlation between two quantitative variables is $r = \frac{\sum_{i=1}^{n}x_i y_i-n\bar{x}\bar{y} }{\sqrt{\sum x_i^2-n\bar{x}^2}\sqrt{\sum y_i^2-n\bar{y}^2}}$.
The binomial distribution has the probability mass function: $P(X=x)=\binom{n}{x}p^x (1-p)^{n-x}, ~~x = 0, 1, 2, \dots, n$
The Poisson distribution has the probability mass function: $P(X=x)=\frac{\lambda^x}{x!} e^{-\lambda}, ~~x = 0, 1, 2, \dots$
The geometric distribution has the probability mass function: $P(x)=(1-p)^{x-1}p, ~~x = 1, 2, \cdots$
The exponential distribution has the probability density function $f(x)=\lambda e^{-\lambda x}, ~ x>0$. The cumulative distribution function is $F(x)=1-e^{-\lambda x}, ~ x>0$. The mean and the standard deviation are both $\lambda$.
The uniform distribution has the probability density function $f(x)=\frac{1}{b-a}, ~ a<x<b$. The mean is $\frac{a+b}{2}$ and the standard deviation is $\frac{b-a}{\sqrt{12}}$.
The conditional probability is defined as $P(B|A)=\frac{P(A\cap B)}{P(A)}$, where $P(A)>0$.
If $X$ is continuous random variable with the probability density function $f(x)$, then $P(a<X<b)=\int_a^b f(x)dx$.
If $X\sim\text{N}(\mu_1, \sigma_1^2)$ and $Y\sim\text{N}(\mu_2, \sigma_2^2)$ are independent, then $X+Y\sim\text{N}(\mu_1+\mu_2, \sigma_1^2+\sigma_2^2)$.

Some typical problems:

If events $A$ and $B$ are independent with $P(A)=0.2$ and $P(B)=0.6$, then $P(A \cap B)=(0.2)(0.6)=0.12$.
If events $A$ and $B$ are disjoint (meaning that they can’t happen simultaneously) with $P(A)=0.2$ and $P(B)=0.6$, then $P(A \cup B)=0.2+0.6=0.8$.
If $P(X<10)=0.2$ and $P(X<20)= 0.6$, then $P(10\le X<20)=0.6-0.2=0.4$.
A series system consists of 5 components, each functioning independently with probability 0.9. What is the probability that the system functions? Answer: $0.9^5=0.59049$.
If $X\sim\text{N}(\mu_1=100, \sigma_1^2=200)$ and $Y\sim\text{N}(\mu_1=120, \sigma_1^2=25)$ are independent, then $P(X+Y>250)=?$. Answer: Since $X+Y\sim\text{N}(\mu=220, \sigma^2=225)$, $P(X+Y>250)=P(Z>\frac{250-220}{15})=P(Z>2)=1-P(Z\le 2)=0.9772.$

One-Sample Hypotheses Testing

We will provide examples

for testing $\mu$ using the 1-sample z test approach when $\sigma$ is known
for testing $\mu$ using the 1-sample t test approach when $\sigma$ is unknown
for testing $p$ using the 1-sample z test approach
for testing goodness of fit using the chi-square approach
for testing independence between two categorical variables using the chi-square approach

Testing $\mu$ using the 1-sample z test approach when $\sigma$ is known

Example 1. A two-sided test for a population mean with $\sigma$ known.

https://www.youtube.com/watch?v=BWJRsY-G8u0

Example 2. A left-sided test for a population mean with $\sigma$ known.

https://www.youtube.com/watch?v=oEW8Hd_xy1k

Example 3. A right-sided test for a population mean with $\sigma$ known.

To test if a population mean is greater than 20. A random sample of size 36 gives a sample mean 22. If the population standard deviation is 5, test, at level 0.05, that the population mean exceeds 20.

Solution.

The null and alternative hypotheses are:

\[H_0:\mu = 20 ~~~ vs ~~~ H_a: \mu > 20\] The test statistic value is

\[z_0=\frac{\bar{x}-\mu_0}{\sigma/\sqrt{n}}=\frac{22-20}{5/\sqrt{36}}=2.4\]

Since larger sample mean or larger $z_0$ suggestions rejection of the null hypothesis, the rejection region looks like $(c, \infty)$ with the critical value $c=z_{\alpha}$. By the standard normal table or the R code $qnorm(1-\alpha)$, $c=1.645$.

Since the test statistic value falls in the rejection region, reject the null hypothesis.

Equivalently, we can use the $p$-value approach. The $p$-value is the area to the right of the statistic value under the standard normal curve. By the standard normal table or the R code $1-pnorm(2.4)$, the $p$-value is 0.0082. Since the $p$-value is less than the significance level, reject the null hypothesis.

Testing $\mu$ using the 1-sample t test approach when $\sigma$ is unknown

Example 1.

A very useful video: https://www.youtube.com/watch?v=VPd8DOL13Iw

Example 2.

Your company wants to improve sales. Past sales data indicate that the average sales was $100 per transaction. After training your sales force, recent sales data (taken from a random sample of 25 salesmen) indicates an average of $130, with a standard deviation of $15. Did the training work? Test your hypothesis at a 0.05 significance level.

Solution.

The population mean $\mu$ is the parameter of interest. To test whether sales has been improved, we should have the null and alternative hypotheses as follows:

\[H_0: \mu=100 ~~~ vs ~~~ H_a: \mu>100\] The value of the test statistic is

\[t_0 = \frac{\bar{x}-\mu_0}{s/\sqrt{n}}=\frac{130-100}{15/\sqrt{25}}=10\] with $n-1$ or 24 degrees of freedom.

Since larger $\bar{x}$’s or $t_0$’s suggest rejection of the null hypothesis, the rejection (or critical) region looks like $(c, \infty)$, where $c=t_{\alpha, n-1}$. We are given $\alpha=0.05$, so the critical value based on the $t_{24}$ distribution is 1.711, which is obtained by R code $qt(1-\alpha, n-1)$ or by a $t$-table.

Since the test statistic value 10 falls in the rejection region, we reject the null hypothesis.

Equivalently, we can calculate the $p$-value, which is the area under the $t_{24}$ distribution to the right of the test statistic value. Using the $t$ table or the R code $1-pt(10, 24)$, we know the $p$-value is smaller than 0.001 and thus smaller than the significance level 0.05. Again, we reject the null hypothesis.

In conclusion, the data provide sufficient evidence that the sales has been improved after training.

The following is a video explaining the above procedure:

https://www.youtube.com/watch?v=7ty2bO6VrUI

Example 3.

A firm claims that their product on average weighs 19 pounds. A supervisory authority doubts that the average weight is below 19 pounds, so it collects a random sample of 51 products made by the company from the market. The sample is 18.5 pounds with a standard deviation 3.2 pounds. Test appropriate hypotheses at the significance level 0.01. In order to prevent themselves from been sued by the company, should the authority use a larger or smaller significance level?

Solution.

The null and alternative hypotheses are:

\[H_0: \mu=19 ~~~ vs ~~~ H_a: \mu<19\] The value of the test statistic is

\[t_0 = \frac{\bar{x}-\mu_0}{s/\sqrt{n}}=\frac{18.5-19}{3.2/\sqrt{51}}=-1.1158\] with $n-1$ or 50 degrees of freedom.

Since smaller $\bar{x}$’s or $t_0$’s suggest rejection of the null hypothesis, the rejection (or critical) region looks like $(-\infty, c)$, where $c=-t_{\alpha, n-1}$. We are given $\alpha=0.05$, so the critical value based on the $t_{50}$ distribution is $-1.6759$, which is obtained by R code $qt(\alpha, n-1)$ with $\alpha = 0.01, n=50$ or by a $t$-table.

Since the test statistic value $-1.1158$ does not fall in the rejection region, we fail to reject the null hypothesis.

Equivalently, we can calculate the $p$-value, which is the area under the $t_{50}$ distribution to the left of the test statistic value. Using the $t$ table or the R code $pt(-1.1158, 50)$, we know the $p$-value is 0.1349 and thus NOT smaller than the significance level 0.01. Again, we fail to reject the null hypothesis.

In conclusion, the data do not provide sufficient evidence that the average weight of the firm’s products is below 19 pounds.

The following is a video explaining the above procedure: https://www.youtube.com/watch?v=ZY5XxJ2aJNc

Testing $p$ using the 1-sample z test approach

Test each of the following using data: $n = 36, ~~\hat{p}=0.3$.

\[(a) ~~H_0:p=0.4 ~~ vs ~~ H_a: p<0.40\] \[(b) ~~H_0:p=0.4 ~~ vs ~~ H_a: p<0.40\]

\[(c) ~~H_0:p=0.4 ~~ vs ~~ H_a: p<0.40\] > Solution.

The test statistic is the same for all 3 cases:

\[z_0=\frac{\hat{p}-p_0}{\sqrt{\frac{p_0(1-p_0)}{n}}}=\frac{0.3-0.40}{\sqrt{\frac{0.4(1-0.4)}{36}}}=-1.22\] (a) The rejection region looks like $(-\infty, c)$ with the critical value $c=-1.645$ obtained by using the standard normal table or R code $qnorm(0.05)$. Since the test statistic value does not fall in the rejection region, we fail to reject the null hypothesis. Equivalently, we can use the $p$-value approach. The $p$-value is obtained by using the standard normal table or R code $pnorm(-1.22)$, which is 0.11. Since the p-value is not smaller than the significance level 0.05, we again fail to reject the null hypothesis.

The rejection region looks like $(c, \infty)$ with the critical value $c=1.645$ obtained by using the standard normal table or R code $qnorm(0.95)$. Since the test statistic value does not fall in the rejection region, we fail to reject the null hypothesis. Equivalently, we can use the $p$-value approach. The $p$-value is obtained by using the standard normal table or R code $1-pnorm(-1.22)$, which is 0.89. Since the p-value is not smaller than the significance level 0.05, we again fail to reject the null hypothesis.
The rejection region looks like $(-\infty, -c)\cup (c, \infty)$ with the critical value $-c=-1.645$ and $c=1.645$ obtained by using the standard normal table or R code $qnorm(0.025)$ and $qnorm(0.975)$. Since the test statistic value does not fall in the rejection region, we fail to reject the null hypothesis. Equivalently, we can use the $p$-value approach. The $p$-value is obtained by using the standard normal table or R code $pnorm(-1.22)*2$, which is 0.22. Since the p-value is not smaller than the significance level 0.05, we again fail to reject the null hypothesis.

Testing goodness of fit using the chi-square approach

A quick video: https://www.youtube.com/watch?v=b3o_hjWKgQw

Example.

An IT specialist doubts that

32% of IT hardware failures are mainly due to exposure to extreme temperatures
24% are mainly due to ineffective cleaning routines
20% are mainly due to human error caused by poor training
24% are due to other reasons

Based on the past 5 years of data, she has the following results on IT hardware failures:
30 are mainly due to exposure to extreme temperatures
25 are mainly due to ineffective cleaning routines
18 are mainly due to human error caused by poor training
27 are due to other reasons

Test, at the 0.05 significance level, whether the claim of the IT specialist is supported by the data.

Step 1: Specify the null and alternative hypotheses. \[H_0: p_1=0.32, ~p_2=0.24, ~p_3=0.20, ~p_4=0.24 ~~ vs ~~H_a: \text{at least one proportion is wrongly specified}\]

Step 2: Calculate the expected frequencies under the null hypothesis. Then, calculate the value of the test statistic and determine the number of degrees of freedom. The expected frequencies are: $100\cdot 0.32= 32, 100\cdot 0.24= 24, 100\cdot 0.20= 20, 100\cdot 0.24= 24$, respectively, so the test statistic value is

\[\chi^2=\frac{(30-32)^2}{32}+\frac{(25-24)^2}{24}+\frac{(18-20)^2}{20}+\frac{(27-24)^2}{24}=0.74\]

Step 3: Calculate the critical value and the $p$-value using the chi-square distribution table. The critical region for the chi-square test always locates in the right tail of the chi-square distribution and looks like $(c, \infty)$ with the critical value $c=\chi_{\alpha, k-1}^2$. Here $k=4$ , so the number of degrees of freedom is 3. Using the chi-square table of the R code $pchisq(1-\alpha, df)$ with $\alpha=0.05$ and $df=3$, $c=7.81$. Since the test statistic value 0.74 does not fall in the rejection region, we fail to reject the null hypothesis.

Equivalently, the $p$-value is 0.86 obtained by the R code $1-pchisq(0.74, 3)$ or estimated to be larger than 0.05 by the chi-square table.

Step 4: Draw a conclusion. We conclude that the data match the claim well.

R code for all with the $p$-value approach:

x=c(30, 25, 18, 27)

chisq.test(x, p = c(0.32, 0.24, 0.20, 0.24))

Testing independence between two categorical variables using the chi-square approach

Here is a helpful video: https://www.youtube.com/watch?v=LE3AIyY_cn8

Example.

Refer to the data in the table below:

Test, at the 0.05 significance level, whether there is any gender gap in the choice of college majors.

Solution.

Step 1: Specify the null and alternative hypotheses. \[H_0: \text{there is any gender gap in the choice of college majors (gender and major are independent)}\] \[H_0: \text{there is a gender gap in the choice of college majors (gender and major are NOT independent)}\]

Step 2: Calculate the expected frequencies under the null hypothesis. Then, calculate the value of the test statistic and determine the number of degrees of freedom. The expected frequencies are 5.65, 8.35, 8.47, 12.53, 8.88, 13.12, respectively. The number of degrees of freedom is $(3-1)(2-1)=2$.

Step 3: Calculate the critical value and the $p$-value using the chi-square distribution table. The critical value is $c=\chi_{0.05}^2$ and the rejection region is $(c, \infty)$, where $c=5.99$ obtained by the chi-square distribution table or R code $qchisq(0.95,2)$. The $p$-value is 0.33.

Step 4: Make a decision & draw a conclusion. We fail to reject the null hypothesis. We conclude that the data do not provide sufficient evidence that there is a gender gap in the choice of college major.

R code for all with the $p$-value approach:

M=matrix(c(4,11,8,10,10,14),3)

chisq.test(M)

Two-Sample Confidence Intervals and Hypotheses Testing

R code

# 1. Confidence interval and test of the difference between means

# 1a. Assuming that both population variances are known
# 2-sample z-test
# Excel

# 1b. Assuming unknown but equal population variances
# 2-sample t-test, df = n1+n2-2
# R function available
x = c(23, 43, 45, 33, 51, 28, 52, 39, 44)
y = c(58, 76, 46, 63, 75, 51)

t.test(x, y, var.equal = TRUE, alternative = "less", conf.level = 0.95)
t.test(x, y, var.equal = TRUE, alternative = "greater", conf.level = 0.95)
t.test(x, y, var.equal = TRUE, alternative = "two.sided", conf.level = 0.95)

# 1c. Assuming unknown population variances
# 2-sample t-test, df = formula
# R function available
x = c(23, 43, 45, 33, 51, 28, 52, 39, 44)
y = c(58, 76, 46, 63, 75, 51)

t.test(x, y, var.equal=FALSE, alternative = "less", conf.level = 0.95)
t.test(x, y, var.equal=FALSE, alternative = "greater", conf.level = 0.95)
t.test(x, y, var.equal=FALSE, alternative = "two.sided", conf.level = 0.95)

# 1d. Paired data
x = c(34, 56, 33, 28, 45, 63, 51) # measurements based on method 1
y = c(35, 57, 30, 26, 40, 59, 48) # measurements based on method 2
t.test(x, y, paired = TRUE, alternative = "less", conf.level = 0.95)
t.test(x, y, paired = TRUE, alternative = "greater", conf.level = 0.95)
t.test(x, y, paired = TRUE, alternative = "two.sided", conf.level = 0.95)


# 2. Confidence interval and test of the difference between proportions

# Sample 1: n = 40, x = 28; Sample 2: n = 50, x = 34
n = c(40, 50)
x = c(28, 34)
prop.test(x, n, alternative = "less", correct = FALSE, conf.level = 0.95)
prop.test(x, n, alternative = "greater", correct = FALSE, conf.level = 0.95)
prop.test(x, n, alternative = "two.sided", correct = FALSE, conf.level = 0.95)

Exercise:

Use t test to see whether there is a difference in mean number of attacks before and after installing firewalls.

Attempts before Firewall: 56, 47, 49, 37, 38, 60, 50, 43, 43, 59, 50, 56, 54, 58

Attempts before Firewall: 53, 21, 32, 49, 45, 38, 44, 33, 32, 43, 53, 46, 36, 48, 39, 35, 37, 36, 39, 45

More Example on Statistical Inference

The diameter of steel rods manufactured on two different extrusion machines is being investigated. Two random samples of sizes $n_1=15, n_2=17$ are selected, and the sample means and sample variances are $\bar{x}_1=8.73, s^2_1=0.35,\bar{x}_2=8.68, s^2_2=0.40$, respectively. Assume that that the population variances are equal and that the data are drawn from a normal distribution.

Is there evidence to support the claim that the two machines produce rods with different mean diameters? Give bounds on the P-value used to make your conclusion. Use only Table V of Appendix A.
Construct a 95% confidence interval for the difference in mean rod diameter. Use only Table V of Appendix A. Round your answer to 3 decimal places.
Interpret this interval.

Two suppliers manufacture a plastic gear used in a laser printer. The impact strength of these gears, measured in foot-pounds, is an important characteristic. A random sample of 10 gears from supplier 1 results in $\bar{x}_1=289.30, s^2_1=22.5$, and another random sample of 16 gears from the second supplier results in $\bar{x}_2=322.10, s^2_2=21$.

Use only Table V of Appendix A.

Is there evidence to support the claim that supplier 2 provides gears with higher mean impact strength? Use 𝛼=0.05, and assume that both populations are normally distributed but the variances are not equal. Round your answer to 4 decimal places.
Do the data support the claim that the mean impact strength of gears from supplier 2 is at least 25 foot-pounds higher than that of supplier 1? Find bounds on the P-value making the same assumptions as in part (a). Round your answer to 2 decimal places.
Construct an appropriate 95% confidence interval on the difference in mean impact strength. Use only Table V of Appendix A. Round your answers to 3 decimal places.

Does the confidence interval support the claim that the mean impact strength of gears from supplier 2 is at least 25 foot-pounds higher than that of supplier 1?

The manager of a fleet of automobiles is testing two brands of radial tires and assigns one tire of each brand at random to the two rear wheels of eight cars and runs the cars until the tires wear out. The data (in kilometers) follow. Find a 99% confidence interval on the difference in the mean life.

Round your answer to 2 decimal places. Do not use commas.

Does the confidence interval constructed in the previous step indicate that one brand is better than the other?

Solution.

R code:

x=c(36925, 45300, 36218, 32100, 37210, 48360, 38200, 33500)
y = c(34318, 42280, 35497, 31950, 38015, 47800, 37810, 33215)

t.test(x, y, paired = TRUE, conf.level = 0.99)

## 
##  Paired t-test
## 
## data:  x and y
## t = 1.8983, df = 7, p-value = 0.09945
## alternative hypothesis: true mean difference is not equal to 0
## 99 percent confidence interval:
##  -730.4546 2462.4546
## sample estimates:
## mean difference 
##             866

An article in the Journal of Aircraft (1986, Vol. 23, pp. 859-864) described a new equivalent plate analysis method formulation that is capable of modeling aircraft structures such as cranked wing boxes and that produces results similar to the more computationally intensive finite element analysis method. Natural vibration frequencies for the cranked wing box structure are calculated using both methods, and results for the first seven natural frequencies follow:

Do the data suggest that the two methods provide the same mean value for natural vibration frequency? Find an interval for P-value.
Find a 95% confidence interval on the mean difference between the two methods and use it to answer the question in part (a).

Round your answer to 3 decimal places.

Does the confidence interval indicate that the two methods provide different mean values for natural vibration frequency?

Solution.

R code:

x=c(14.58, 48.52, 97.21, 113.99, 174.73, 212.72, 277.38)
y = c(14.76, 49.10, 99.96, 117.53, 181.22, 220.14, 294.80)

t.test(x, y, paired = TRUE, conf.level = 0.95)

## 
##  Paired t-test
## 
## data:  x and y
## t = -2.4481, df = 6, p-value = 0.04992
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
##  -10.962958325  -0.002755961
## sample estimates:
## mean difference 
##       -5.482857

Two different types of injection-molding machines are used to form plastic parts. A part is considered defective if it has excessive shrinkage or is discolored. Two random samples, each of size 300, are selected, and 15 defective parts are found in the sample from machine 1, while 9 defective parts are found in the sample from machine 2. Is it reasonable to conclude that both machines produce the same proportion of defective parts? Use α = 0.05.

Test the hypothesis $H_0: p_1 = p_2$ verses $H_1: p_1 ≠ p_2$. What is $z_0$, the value of the test statistic? Round your answer to two decimal places (e.g. 98.76).

Is it reasonable to conclude that both machines produce the same proportion of defective parts?

Solution.

R code:

x = c(15, 9)
n = c(300, 300)

prop.test(x, n, correct = FALSE)

## 
##  2-sample test for equality of proportions without continuity correction
## 
## data:  x out of n
## X-squared = 1.5625, df = 1, p-value = 0.2113
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  -0.01131856  0.05131856
## sample estimates:
## prop 1 prop 2 
##   0.05   0.03

The prop.test returns a test statistic that equals $z_0^2$. So, to get $z_0$, we take the square root of the output statistic value (which is 1.0851 here). $z_0=\sqrt{1.0851}=1.04$. Why positive? It is because the first sample proportion is larger than the second! Refer to the $z_0$ formula!

A random sample of 500 adult residents of Maricopa County found that 359 were in favor of increasing the highway speed limit to 75 mph, while another sample of 400 adult residents of Pima County found that 295 were in favor of the increased speed limit. Do these data indicate that there is a difference in the support for in increasing the speed limit between the residents of the two counties? Use α = 0.05.

Test the hypothesis $H_0: p_1 = p_2$ verses $H_1: p_1 ≠ p_2$. What is $z_0$, the value of the test statistic? Round your answer to two decimal places (e.g. 98.76).
Is it reasonable to conclude that there is a difference in the support for increasing the speed limit between the residents of the two counties?

Solution.

R code:

x = c(359, 295)
n = c(500, 400)

prop.test(x, n, correct = FALSE)

## 
##  2-sample test for equality of proportions without continuity correction
## 
## data:  x out of n
## X-squared = 0.42543, df = 1, p-value = 0.5142
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  -0.0779364  0.0389364
## sample estimates:
## prop 1 prop 2 
## 0.7180 0.7375

The prop.test returns a test statistic that equals $z_0^2$. So, to get $z_0$, we take the square root of the output statistic value (which is 0.42543 here). $z_0=\sqrt{0.42543}=0.65$. Why positive? It is because the first sample proportion is larger than the second! Refer to the $z_0$ formula!

$p$-value = 0.5142.

A random sample of 500 adult residents of Maricopa County found that 384 were in favor of increasing the highway speed limit to 75 mph, while another sample of 400 adult residents of Pima County found that 281 were in favor of the increased speed limit. Construct a 95% confidence interval on the difference in the two proportions. Round your answer to four decimal places (e.g. 98.7654).

Solution.

R code:

x = c(384, 281)
n = c(500, 400)

prop.test(x, n, conf.level = 0.95, correct = FALSE)

## 
##  2-sample test for equality of proportions without continuity correction
## 
## data:  x out of n
## X-squared = 4.9416, df = 1, p-value = 0.02622
## alternative hypothesis: two.sided
## 95 percent confidence interval:
##  0.007396526 0.123603474
## sample estimates:
## prop 1 prop 2 
## 0.7680 0.7025

The CI is between 0.007396526 and 0.1236.

Still More Examples on Inference

Important contents:

The best point estimate of a population parameter is the sample counterpart (called the statistic which is a function of observations). The precision of a point estimate is measured by the standard error. The accuracy is measured by the bias. If an estimator has 0 bias, it is said to be unbiased.
The sampling distribution of the sample mean: (a) it always has mean that equals the population mean and variance that equals the population variance divided by the sample size; (b) it is normally distributed when the population distribution is normal; (3) it is approximately normally distribution when the sample size is large and the population distribution is not normal.
The sampling distribution of the sample proportion: (a) it always has mean that equals the population proportion and variance that equals $p\cdot(1-p)$ divided by the sample size; (b) it is approximately normally distribution when the sample size is large.
The standard error of the sample mean is $\frac{\sigma}{\sqrt{n}}$ or $\frac{s}{\sqrt{n}}$ when $\sigma$ is unknown.
The standard error of the sample proportion is $\sqrt{\frac{p(1-p)}{n}}$ or $\sqrt{\frac{\hat{p}(1-\hat{p})}{n}}$ when $p$ is unknown.
Test and find confidence interval about $p$, the $z$ method
Test and find confidence interval about $\mu$ ($\sigma$ known), the $z$ method
Test and find confidence interval about $\mu$ ($\sigma$ unknown), the $t$ method
Test and find confidence interval about $p_1-p_2$, the $z$ method
Test and find confidence interval about $\mu_1-\mu_2$ ($\sigma_1$ and $\sigma_2$ known), the $z$ method
Test and find confidence interval about $\mu_1-\mu_2$ ($\sigma_1$ and $\sigma_2$ unknown), the $t$ method
Paired data $t$ test and $t$ confidence interval
Sample size determination when estimating $p$
Sample size determination when estimating $\mu$ ($\sigma$ known)
Chi-square test for goodness of fit or independence

Examples

Given $n = 23, \bar{x}=50, s = 11$.

Test $H_0: \mu = 45$ vs $H_a:\mu<45$, using both the critical value method and the $p$-value method.
Test $H_0:\mu = 45$ vs $H_a:\mu>45$, using both the critical value method and the $p$-value method.
Test $H_0:\mu = 45$ vs $H_a:\mu\ne 45$, using both the critical value method and the $p$-value method.
Construct a 95% 2-sided confidence interval.
Construct a 95% confidence interval with a upper-bound.
Construct a 95% confidence interval with a lower-bound.

Solution

The problems must be done by hand, since we are not given the original sample data.

Given $n_1 = 23, \bar{x}_1=20, s_1 = 5$ and $n_2 = 32, \bar{x}_2=30, s_2 = 8$.

Test $H_0:\mu_1 = \mu_2$ vs. $H_a:\mu_1 < \mu_2$, using both the critical value method and the $p$-value method.
Test $H_0:\mu_1 = \mu_2$ vs. $H_a:\mu_1 > \mu_2$, using both the critical value method and the $p$-value method.
Test $H_0:\mu_1 = \mu_2$ vs. $H_a:\mu_1 \ne \mu_2$, using both the critical value method and the $p$-value method.
Construct a 95% 2-sided confidence interval.

Solution

The problems must be done by hand, since we are not given the original sample data.

Given paired data:

x: 34, 56, 78, 89, 77, 55

y: 44, 65, 70, 80, 75, 50

Test $H_0:\mu_d = 0$ vs. $H_a:\mu_d < 0$, using both the critical value method and the $p$-value method. $\alpha = 0.05$.
Test $H_0:\mu_d = 0$ vs. $H_a:\mu_d > 0$, using both the critical value method and the $p$-value method. $\alpha = 0.05$.
Test $H_0:\mu_d = 0$ vs. $H_a:\mu_d \ne 0$, using both the critical value method and the $p$-value method. $\alpha = 0.05$.
Construct a 95% 2-sided confidence interval.

Solution

The problems can be done by hand or R code, since we are given the original sample data.

R code:

x = c(34, 56, 78, 89, 77, 55)

y = c(44, 65, 70, 80, 75, 50)

t.test(x, y, paired = TRUE, alternative = "less) # for (a)

qt(alpha, df = 6-1) # Critical value for (a)

t.test(x, y, paired = TRUE, alternative = “greater”) # for (b)

qt(1-alpha, df = 6-1) # Critical value for (b)

t.test(x, y, paired = TRUE, alternative = “two.sided”) # for (c)

qt(alpha/2, df = 6-1); -qt(alpha/2, df = 6-1) # Critical values for (c)

Given data: n = 54 and x = 35,

Test, at significance level $\alpha=0.05$, whether the population proportion is less than 0.7.
Test, at significance level $\alpha=0.05$, whether the population proportion is greater than 0.7.
Test, at significance level $\alpha=0.05$, whether the population proportion is different from 0.7.
Construct a 95% confidence interval for the population proportion.

Solution

The problems can be done by hand or R code, since we are given the original sample data.

R code:

n=54

x = 35

prop.test(x, n, alternative = “less”, correct = FALSE) # for (a)

qnorm(alpha) # Critical value for (a)

prop.test(x, n, alternative = “greater”, correct = FALSE) # for (b)

qnorm(1-alpha) # Critical value for (b)

prop.test(x, n, alternative = “two.sided”, correct = FALSE) # for (c)

qnorm(alpha/2); -qnorm(alpha/2) # Critical values for (c)

Given data: $n_1 = 54, x_1=35$, and $n_1 = 63, x_1=41$,

Test, at significance level $\alpha=0.05$, whether the first population proportion is smaller.
Test, at significance level $\alpha=0.05$, whether the first population proportion is larger.
Test, at significance level $\alpha=0.05$, whether the population proportions different.
Construct a 95% confidence interval for the difference in population proportions ($p_1-p_2$).

Solution

The problems can be done by hand or R code, since we are given the original sample data.

R code:

n = c(54, 63)

x = c(35, 41)

prop.test(x, n, alternative = “less”, correct = FALSE) # for (a)

qnorm(alpha) # Critical value for (a)

prop.test(x, n, alternative = “greater”, correct = FALSE) # for (b)

qnorm(1-alpha) # Critical value for (b)

prop.test(x, n, alternative = “two.sided”, correct = FALSE) # for (c)

qnorm(alpha/2); -qnorm(alpha/2) # Critical values for (c)

Chi-square test for independence.

Test, at significance level 0.05, that the type of defect does not differ across shifts.

Solution

The problems can be done by hand or R code, since we are given the original sample data.

R code:

M = matrix(c(15, 26, 33, 21, 31, 17, 45, 34, 49, 13, 5, 20), 3, 4)

chisq.test(M)

Simple Linear Regression

Example 1: Engineering - Heat Transfer In this example, we’ll consider the relationship between the temperature difference across a heat exchanger and the rate of heat transfer. Let’s assume we have data on the temperature difference (in °C) and the heat transfer rate (in Watts) for 30 heat exchanger experiments.

Data and code:

# Sample data
temperature_diff <- c(10, 15, 20, 25, 30, 35, 40, 45, 50, 55,
                      60, 65, 70, 75, 80, 85, 90, 95, 100, 105,
                      110, 115, 120, 125, 130, 135, 140, 145, 150, 155)
heat_transfer_rate <- c(25, 42, 62, 80, 95, 110, 125, 140, 155, 170,
                        180, 192, 202, 214, 226, 236, 247, 259, 270, 280,
                        290, 302, 313, 325, 335, 346, 358, 370, 380, 390)

# Perform simple linear regression
lm_model <- lm(heat_transfer_rate ~ temperature_diff)

# Print the summary of the regression
summary(lm_model)

## 
## Call:
## lm(formula = heat_transfer_rate ~ temperature_diff)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -23.4280  -4.9973   0.2456   6.6170  12.6170 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      24.21572    3.41635   7.088 1.04e-07 ***
## temperature_diff  2.42122    0.03667  66.025  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.692 on 28 degrees of freedom
## Multiple R-squared:  0.9936, Adjusted R-squared:  0.9934 
## F-statistic:  4359 on 1 and 28 DF,  p-value: < 2.2e-16

Insights:

The regression summary provides information about the estimated intercept and slope coefficients.
The “Residual standard error” indicates the average difference between observed and predicted values.
The “R-squared” value measures the proportion of variability in the dependent variable explained by the independent variable.
The p-value associated with the slope coefficient tests if the relationship is statistically significant.

One-Way Analysis of Variance (ANOVA)

What is 1-way ANOVA?

Click here for explanation

A one-way ANOVA (Analysis of Variance) is a statistical test used to compare means across two or more groups or categories. It assesses whether there are any statistically significant differences between the means of these groups, indicating that at least one group is different from the others in terms of the variable being tested.

The one-way ANOVA is particularly useful when you have one independent variable (also known as a factor) with multiple levels or categories, and you want to determine if those levels have a significant effect on a continuous dependent variable.

Here are the key components and steps of performing a one-way ANOVA:

Null Hypothesis (H0): The null hypothesis states that there is no significant difference between the means of the groups. Mathematically, it can be written as: H0: μ1 = μ2 = … = μk, where μ1, μ2, …, μk are the population means of the different groups.

Alternative Hypothesis (Ha): The alternative hypothesis states that at least one group mean is significantly different from the others.

Assumptions:

The data within each group are normally distributed. The variances of the groups are approximately equal (homogeneity of variances). The observations are independent. ANOVA Test Statistic: The ANOVA test statistic is calculated by comparing the variability between the group means (explained variance) to the variability within each group (unexplained variance).

Calculating the F-Statistic: The F-statistic is calculated as the ratio of the between-group variability to the within-group variability.

P-Value and Significance Level: The F-statistic is used to calculate a p-value. If the p-value is below a chosen significance level (often denoted as α), you can reject the null hypothesis.

Interpretation: If the p-value is below the significance level, you can conclude that at least one group mean is significantly different from the others. However, the ANOVA itself does not indicate which specific groups are different. For that, post-hoc tests like Tukey’s Honestly Significant Difference (HSD) or Bonferroni corrections are often used.

If the p-value is above the significance level, you fail to reject the null hypothesis, indicating that there is not enough evidence to conclude that there are significant differences among the groups.

Overall, the one-way ANOVA helps you determine whether the observed differences among group means are likely due to true differences in the populations or if they could be due to random chance.

Consider an engineering experiment involving the hardness of different types of metals:

# Load necessary libraries
library(dplyr)

# Fake experiment data with assumed hardness values
experiment_data <- data.frame(
  Metal_Type = rep(c("Aluminum", "Steel", "Copper"), times = c(11, 11, 12)),
  Hardness = c(85, 82, 88, 87, 83, 84, 89, 86, 84, 88, 90, 85, 82, 87, 88, 89, 75, 78, 80, 76, 79, 77, 82, 81, 78, 110, 115, 120, 112, 118, 114, 100, 105, 108)
)

# Display the first few rows of the dataset
head(experiment_data)

##   Metal_Type Hardness
## 1   Aluminum       85
## 2   Aluminum       82
## 3   Aluminum       88
## 4   Aluminum       87
## 5   Aluminum       83
## 6   Aluminum       84

# Create a box plot using base R
boxplot(Hardness ~ Metal_Type, data = experiment_data,
        main = "Hardness Comparison by Metal Type",
        xlab = "Metal Type",
        ylab = "Hardness")

# Perform one-way ANOVA to compare hardness across different metal types
anova_result <- aov(Hardness ~ Metal_Type, data = experiment_data)

# Summarize the ANOVA results
summary(anova_result)

##             Df Sum Sq Mean Sq F value   Pr(>F)    
## Metal_Type   2   3175  1587.5   17.48 8.27e-06 ***
## Residuals   31   2816    90.8                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Here’s what each part of the ANOVA summary means:

Df (Degrees of Freedom): This represents the degrees of freedom associated with the model (Metal_Type) and the residuals (unexplained variability).

Sum Sq (Sum of Squares): This is the sum of squared differences between the observed values and the group (metal type) means.

Mean Sq (Mean Sum of Squares): This is the sum of squares divided by the degrees of freedom, representing the mean variability in each group.

F value: This is the calculated test statistic, which is the ratio of the mean sum of squares between groups to the mean sum of squares within groups. It measures the difference in means relative to the variation within groups.

Pr(>F) (p-value): This is the p-value associated with the F value. It indicates the probability of observing an F value as extreme as the one calculated, assuming the null hypothesis is true (no group differences). A small p-value suggests that at least one group’s mean is significantly different from the others.

In this example, the p-value for the “Metal_Type” factor is 0.0014, which is less than the typical significance level of 0.05. This means that we have enough evidence to reject the null hypothesis and conclude that there are statistically significant differences in hardness among the different metal types.

The “Insight”:

The p-value indicates that there are significant differences in hardness among the metal types. This suggests that the type of metal used affects the hardness of the material.

However, to determine exactly which metal types are significantly different from each other, you might consider using post hoc tests, such as Tukey’s Honestly Significant Difference (HSD) test or pairwise t-tests, to perform multiple comparisons between group means.

In this example, we first generate boxplots to display the distribution of hardness measurements for each metal type. The x-axis represents the metal types, and the y-axis represents the hardness values. We then use the aov() function to perform a one-way ANOVA on the “Hardness” variable across the three metal types. The summary() function provides the ANOVA table with the F-statistic, p-value, and other relevant information.

By running this code, you can assess whether there are significant differences in hardness among the three metal types based on the assumed values. If the p-value is below your chosen significance level, you would conclude that there are statistically significant differences in hardness. If the p-value is not significant, you would not find evidence of significant differences.

Quality Control

Appendix

Some useful datasets: http://fs2.american.edu/baron/www/Book/

Statistical tables: https://read.wiley.com/books/9781119400363/page/152/section/top-of-page

Practice Questions for Stat 353 Homework Assignments

SZ

1/12/2022

Classical Probability

Probability Density Functions

Joint Distribution of Two Discrete Random Variables

Basic Descriptive Statistics

Midterm Exam Review

One-Sample Hypotheses Testing

Testing \(\mu\) using the 1-sample z test approach when \(\sigma\) is known

Testing \(\mu\) using the 1-sample t test approach when \(\sigma\) is unknown

Testing \(p\) using the 1-sample z test approach

Testing goodness of fit using the chi-square approach

Testing independence between two categorical variables using the chi-square approach

Two-Sample Confidence Intervals and Hypotheses Testing

More Example on Statistical Inference

Still More Examples on Inference

Examples

Simple Linear Regression

One-Way Analysis of Variance (ANOVA)

Quality Control

Appendix