\(P(J|C)=\frac{P(J\cap C)}{P(C)}=\frac{0.2}{0.6}=\frac{1}{3}\)
Given \(P(C|J)=0.4\), want \(P(J)\). Since
\[0.4 = P(C|J)=\frac{P(C\cap J)}{P(J)}=\frac{0.2}{P(J)}\] \[P(J)=0.5\] 2. Denote “Left-handed” by “LH”, “Red-Green color-blinded” by “RG”, “Blue-Yellow color-blinded” by “BY”, “Completely color-blinded” by “Complete”, and “Any kind color-blinded” by “Color”. We are given \(P(LH) = 0.11, P(RG)=0.0560, P(BY)=0.0224\), and \(P(Complete)=0.0016\). We also know \(P(Color)=0.11+0.0560+0.0224=0.08\).
\(P(No ~color-blindness)=1-0.0560-0.0224-0.0016=0.92\)
Given \(P(LH\cap Color)=0.0091\), want \(P(LH \cup Color)\). Since \(P(LH\cup Color)=P(LH)+P(Color)-P(LH\cap Color)=0.11 + 0.08-0.0091=0.1809\)
\(P(RH\cap Color)=P(RH)\cdot P(Color)=(1-0.11)\cdot (0.08)=0.0712.\)
Want \(P(Color|LH)\). Since “Color” and “LH” are independent, \(P(Color|LH)=P(Color)=0.08\).
- \(P(X< 1)\)
- \(P(X>1)\)
- $P(1<X<2)
- the mean \(\mu = E(X)\), and
- the stanadrd deviation \(\sigma\)
Solution.
\(P(X<1)=\int_0^1 2e^{-2x}dx=1-e^{-2}=0.8647\)
\(P(X>1)=\int_1^{\infty}2e^{-2x}dx=e^{-2}=0.1353\)
\(P(1<X<2)=\int_1^{2}2e^{-2x}dx=e^{-2}-e^{-4}=0.1170\)
Since the distribution is exponential, the mean \(\mu = \frac{1}{2}\), and
the standard deviation is the same as the mean for exponential distributions.
- \(P(X< 1)\)
- \(P(X>1)\)
- $P(0<X<1)
- the mean \(\mu = E(X)\),
- the standard deviation \(\sigma\), and
- \(x\) such that \(P(x<X)=0.05\)
Solution.
\(P(X<1)=\int_{-1}^1 \frac{1}{3}x^2dx=\frac{2}{9}\)
\(P(X>1)=\int_1^{2}\frac{1}{3}x^2dx=\frac{7}{9}\)
\(P(0<X<1)=\int_0^{1}\frac{1}{3}x^2dx=\frac{1}{9}\)
\(\mu = \int_{-1}^2 x\cdot \frac{1}{3}x^2dx=\frac{5}{4}\), and
The variance \(\sigma^2=\int_{-1}^2 x^2\cdot \frac{1}{3}x^2dx-\mu^2=\frac{80}{51}\), so the standard deviation is the \(\sqrt{\frac{80}{51}}\approx 1.2524\).
Since \(x\) will have to be between \(-1\) and 2, \(P(x<X)=\int_x^{2}\frac{1}{3}x^2dx=\frac{8-x^3}{9}\). Setting \(\frac{8-x^3}{9}=0.05\) yields \(x\approx 1.9618\).
\[\begin{equation} F(x)= \begin{cases} 0, & x\le -1 \\ \frac{x+1}{3}, & -1<x<2\\ 1, & x\ge 2\\ \end{cases} \end{equation}\]
find
- \(P(X>1)\)
- \(P(X<1.4)\)
- \(P(0<X<1)\)
- \(P(X>3)\)
Solution.
\(P(X>1)=1-P(X\le 1)=1-F(1)=1-\frac{1+1}{3}=\frac{1}{3}\)
\(P(X<1.4)\stackrel{\text{since X is continuous random variable}}= P(X\le 1.4)=F(1.4)=0.8\)
\(P(0<X<1)\stackrel{\text{since X is continuous random variable}}=F(1)-F(0)=\frac{2}{3}-\frac{1}{3}=\frac{1}{3}\)
\(P(X>3)=1-P(X\le 3)=1-F(3)=1-1=0\)
Solution.
For \(x \le 0\), \(F(x)=0\); for \(x >0\), \(F(x)=\int_0^x 2e^{-2x}dx=1-e^{-2x}\).
\[\begin{equation} F(x)= \begin{cases} 0, & x\le 0 \\ 1-e^{-2x}, & x>0\\ \end{cases} \end{equation}\]
- \(P(X<90)\).
- \(P(X>120)\)
Solution.
\(P(X<90)=P(X-\mu<90-\mu)=P(\frac{X-\mu}{\sigma}<\frac{90-\mu}{\sigma})=P(Z<-0.67)=0.2514\).
\(P(X>90)=1-P(X\le 90)=1-0.2514=0.7486\)
- the probability that there is no call within 25 minutes.
- the probability that there is at least one call within 10 minutes.
- the probability that the first call arrives within 15 to 20 minutes after opening.
- the length of an interval of time such that the probability of at least one call is in the interval is 0.8. Round to two decimal places.
Solution.
Let \(X\) denote the time between calls. The probability density function of \(X\) is \[f(x) = \frac{1}{15}e^{-\frac{1}{15}x}, ~~ x>0\] The cumulative distribution function is
\[F(x) = 1-e^{-\frac{1}{15}x}, ~~ x>0\]
- \(P(X>25)=1-P(X\le 25)=1-F(25)=e^{-\frac{25}{15}}=0.1899\)
- \(P(X\le 10)=F(10)=1-e^{-\frac{10}{15}}=0.4866\)
- \(P(15\le X\le 20)=F(20)-F(15)=0.2498\)
- Suppose that the interval is (0, b). We need to solve the equation \(P(X<b)=0.8\). Since \(P(X<b)=F(b)=1-e^{-\frac{1}{15}b}\), solving the equation \(1-e^{-\frac{1}{15}b}=0.8\) gives \(b=24.14\)
\(x\) | \(y\) | \(f(x,y)\) |
-1 | 0 | 1/6 |
2 | 1 | 1/3 |
3 | -2 | 1/4 |
2 | 2 | 1/4 |
Determine the marginal probability mass function of \(X\).
Determine the marginal probability mass function of \(Y\).
Determine the mean and variance of \(X\).
Determine the mean and variance of \(Y\).
Determine the covariance between \(X\) and \(Y\).
Determine the correlation between \(X\) and \(Y\).
Determine \(P(X<2.5, Y>-1)\).
Solution.
\(P(X=-1)=1/6\), \(P(X=2)=1/3+1/4=7/12\), and \(P(X=3)=1/4\). So the marginal probability mass function of \(X\) is
\(x\) | \(f_X(x)\) |
-1 | 1/6 |
2 | 7/12 |
3 | 1/4 |
\(P(Y=0)=1/6\), \(P(Y=1)=1/3\), and \(P(Y=-2)=1/4+1/4=1/2\). So the marginal probability mass function of \(Y\) is
\(y\) | \(f_Y(y)\) |
0 | 1/6 |
-2 | 1/3 |
1 | 1/2 |
The mean of \(X\) is \(E(X)=(-1)(1/6)+(2)(7/12)+(3)(1/4)=1.75\). The variance is \((-1)^2(1/6)+(2)^2(7/12)+(3)^2(1/4)-1.75^2=1.6875\).
The mean of \(Y\) is \(E(Y)=(0)(1/6)+(1)(1/3)+(-2)(1/2)=-2/3\). The variance is \((0)^2(1/6)+(1)^2(1/3)+(-2)^2(1/2)-(-2/3)^2=17/9\).
To calculate the covariance, we need to fund \(E(XY)\) first. \[E(XY)=(-1)(0)(1/6)+(2)(1)(1/3)+(3)(-2)(1/4)+(2)(-2)(1/4)=-11/6\]
The covariance is
\[Cov(X,Y)=E(XY)-E(X)E(Y)=-11/6-(1.75)(-2/3)=-2/3\]
\[\rho = \frac{Cov(X,Y)}{\sqrt{Var(X)}\sqrt{Var(Y)}}=\frac{-2/3}{\sqrt{1.6875}\sqrt{17/9}}=-0.373408\]
The following R code might be useful.
# Create a data vector or array and store the data in x
x = c(23, 45, 12, 9, 15, 42, 40, 22, 25, 60, 28, 52, 44)
y = c(45, 85, 30, 17, 34, 86, 85, 50, 48, 115, 64, 100, 90)
# Calculate sample mean
mean(x)
# Calculate sample variance
var(x)
# Calculate sample standard deviation
sd(x)
# Calculate median
median(x)
# Calculate sample correlation between x and y
cor(x,y)
# Create histogram
hist(x)
# Create boxplot
boxplot(x)
# Create stem-and-leaf plot
stem(x)
# Create a scatter plot
plot(y~x)
Some useful formula:
\(P(A \cup B)=P(A)+P(B)-P(A\cap B)\).
For a discrete random variable, the mean \(\mu=E(X)=\sum x_i p_i\) and variance \(\sigma^2=V(X)=\sum x_i^2 p_i-\mu^2\).
For a continuous random variable, the mean \(\mu=E(X)=\int_{-\infty}^{\infty}xf(x)dx\) and variance \(\sigma^2=V(X)=\int_{-\infty}^{\infty}x^2f(x)dx-\mu^2\).
For two discrete random variables, the covariance \(cov(X,Y)=E(XY)-E(X)E(Y)\) and correlation \(\rho=\frac{cov(X,Y)}{\sigma_X\cdot \sigma_Y}\).
\(E(aX)=aE(X)\), \(E(X+c)=E(X)+c\)
\(V(aX)=a^2V(X)\), \(V(X+c)=V(X)\)
If \(X\) and \(Y\) are two independent random variables and \(a~ \& ~b\) are constants, then \(V(aX+bY)=a^2V(X)+b^2V(Y)\).
The sample variance of a sample is \(s^2 = \frac{\sum_{i=1}^{n}(x_i -\bar{x})^2}{n-1}\).
The sample correlation between two quantitative variables is \(r = \frac{\sum_{i=1}^{n}x_i y_i-n\bar{x}\bar{y} }{\sqrt{\sum x_i^2-n\bar{x}^2}\sqrt{\sum y_i^2-n\bar{y}^2}}\).
The binomial distribution has the probability mass function: \(P(X=x)=\binom{n}{x}p^x (1-p)^{n-x}, ~~x = 0, 1, 2, \dots, n\)
The Poisson distribution has the probability mass function: \(P(X=x)=\frac{\lambda^x}{x!} e^{-\lambda}, ~~x = 0, 1, 2, \dots\)
The geometric distribution has the probability mass function: \(P(x)=(1-p)^{x-1}p, ~~x = 1, 2, \cdots\)
The exponential distribution has the probability density function \(f(x)=\lambda e^{-\lambda x}, ~ x>0\). The cumulative distribution function is \(F(x)=1-e^{-\lambda x}, ~ x>0\). The mean and the standard deviation are both \(\lambda\).
The uniform distribution has the probability density function \(f(x)=\frac{1}{b-a}, ~ a<x<b\). The mean is \(\frac{a+b}{2}\) and the standard deviation is \(\frac{b-a}{\sqrt{12}}\).
The conditional probability is defined as \(P(B|A)=\frac{P(A\cap B)}{P(A)}\), where \(P(A)>0\).
If \(X\) is continuous random variable with the probability density function \(f(x)\), then \(P(a<X<b)=\int_a^b f(x)dx\).
If \(X\sim\text{N}(\mu_1, \sigma_1^2)\) and \(Y\sim\text{N}(\mu_2, \sigma_2^2)\) are independent, then \(X+Y\sim\text{N}(\mu_1+\mu_2, \sigma_1^2+\sigma_2^2)\).
Some typical problems:
If events \(A\) and \(B\) are independent with \(P(A)=0.2\) and \(P(B)=0.6\), then \(P(A \cap B)=(0.2)(0.6)=0.12\).
If events \(A\) and \(B\) are disjoint (meaning that they can’t happen simultaneously) with \(P(A)=0.2\) and \(P(B)=0.6\), then \(P(A \cup B)=0.2+0.6=0.8\).
If \(P(X<10)=0.2\) and \(P(X<20)= 0.6\), then \(P(10\le X<20)=0.6-0.2=0.4\).
A series system consists of 5 components, each functioning independently with probability 0.9. What is the probability that the system functions? Answer: \(0.9^5=0.59049\).
If \(X\sim\text{N}(\mu_1=100, \sigma_1^2=200)\) and \(Y\sim\text{N}(\mu_1=120, \sigma_1^2=25)\) are independent, then \(P(X+Y>250)=?\). Answer: Since \(X+Y\sim\text{N}(\mu=220, \sigma^2=225)\), \(P(X+Y>250)=P(Z>\frac{250-220}{15})=P(Z>2)=1-P(Z\le 2)=0.9772.\)
We will provide examples
for testing \(\mu\) using the 1-sample z test approach when \(\sigma\) is known
for testing \(\mu\) using the 1-sample t test approach when \(\sigma\) is unknown
for testing \(p\) using the 1-sample z test approach
for testing goodness of fit using the chi-square approach
for testing independence between two categorical variables using the chi-square approach
Example 1. A two-sided test for a population mean with \(\sigma\) known.
https://www.youtube.com/watch?v=BWJRsY-G8u0
Example 2. A left-sided test for a population mean with \(\sigma\) known.
https://www.youtube.com/watch?v=oEW8Hd_xy1k
Example 3. A right-sided test for a population mean with \(\sigma\) known.
To test if a population mean is greater than 20. A random sample of size 36 gives a sample mean 22. If the population standard deviation is 5, test, at level 0.05, that the population mean exceeds 20.
Solution.
The null and alternative hypotheses are:
\[H_0:\mu = 20 ~~~ vs ~~~ H_a: \mu > 20\] The test statistic value is
\[z_0=\frac{\bar{x}-\mu_0}{\sigma/\sqrt{n}}=\frac{22-20}{5/\sqrt{36}}=2.4\]
Since larger sample mean or larger \(z_0\) suggestions rejection of the null hypothesis, the rejection region looks like \((c, \infty)\) with the critical value \(c=z_{\alpha}\). By the standard normal table or the R code \(qnorm(1-\alpha)\), \(c=1.645\).
Since the test statistic value falls in the rejection region, reject the null hypothesis.
Equivalently, we can use the \(p\)-value approach. The \(p\)-value is the area to the right of the statistic value under the standard normal curve. By the standard normal table or the R code \(1-pnorm(2.4)\), the \(p\)-value is 0.0082. Since the \(p\)-value is less than the significance level, reject the null hypothesis.
Example 1.
A very useful video: https://www.youtube.com/watch?v=VPd8DOL13Iw
Example 2.
Your company wants to improve sales. Past sales data indicate that the average sales was $100 per transaction. After training your sales force, recent sales data (taken from a random sample of 25 salesmen) indicates an average of $130, with a standard deviation of $15. Did the training work? Test your hypothesis at a 0.05 significance level.
Solution.
The population mean \(\mu\) is the parameter of interest. To test whether sales has been improved, we should have the null and alternative hypotheses as follows:
\[H_0: \mu=100 ~~~ vs ~~~ H_a: \mu>100\] The value of the test statistic is
\[t_0 = \frac{\bar{x}-\mu_0}{s/\sqrt{n}}=\frac{130-100}{15/\sqrt{25}}=10\] with \(n-1\) or 24 degrees of freedom.
Since larger \(\bar{x}\)’s or \(t_0\)’s suggest rejection of the null hypothesis, the rejection (or critical) region looks like \((c, \infty)\), where \(c=t_{\alpha, n-1}\). We are given \(\alpha=0.05\), so the critical value based on the \(t_{24}\) distribution is 1.711, which is obtained by R code \(qt(1-\alpha, n-1)\) or by a \(t\)-table.
Since the test statistic value 10 falls in the rejection region, we reject the null hypothesis.
Equivalently, we can calculate the \(p\)-value, which is the area under the \(t_{24}\) distribution to the right of the test statistic value. Using the \(t\) table or the R code \(1-pt(10, 24)\), we know the \(p\)-value is smaller than 0.001 and thus smaller than the significance level 0.05. Again, we reject the null hypothesis.
In conclusion, the data provide sufficient evidence that the sales has been improved after training.
The following is a video explaining the above procedure:
https://www.youtube.com/watch?v=7ty2bO6VrUI
Example 3.
A firm claims that their product on average weighs 19 pounds. A supervisory authority doubts that the average weight is below 19 pounds, so it collects a random sample of 51 products made by the company from the market. The sample is 18.5 pounds with a standard deviation 3.2 pounds. Test appropriate hypotheses at the significance level 0.01. In order to prevent themselves from been sued by the company, should the authority use a larger or smaller significance level?
Solution.
The null and alternative hypotheses are:
\[H_0: \mu=19 ~~~ vs ~~~ H_a: \mu<19\] The value of the test statistic is
\[t_0 = \frac{\bar{x}-\mu_0}{s/\sqrt{n}}=\frac{18.5-19}{3.2/\sqrt{51}}=-1.1158\] with \(n-1\) or 50 degrees of freedom.
Since smaller \(\bar{x}\)’s or \(t_0\)’s suggest rejection of the null hypothesis, the rejection (or critical) region looks like \((-\infty, c)\), where \(c=-t_{\alpha, n-1}\). We are given \(\alpha=0.05\), so the critical value based on the \(t_{50}\) distribution is \(-1.6759\), which is obtained by R code \(qt(\alpha, n-1)\) with \(\alpha = 0.01, n=50\) or by a \(t\)-table.
Since the test statistic value \(-1.1158\) does not fall in the rejection region, we fail to reject the null hypothesis.
Equivalently, we can calculate the \(p\)-value, which is the area under the \(t_{50}\) distribution to the left of the test statistic value. Using the \(t\) table or the R code \(pt(-1.1158, 50)\), we know the \(p\)-value is 0.1349 and thus NOT smaller than the significance level 0.01. Again, we fail to reject the null hypothesis.
In conclusion, the data do not provide sufficient evidence that the average weight of the firm’s products is below 19 pounds.
The following is a video explaining the above procedure: https://www.youtube.com/watch?v=ZY5XxJ2aJNc
Test each of the following using data: \(n = 36, ~~\hat{p}=0.3\).
\[(a) ~~H_0:p=0.4 ~~ vs ~~ H_a: p<0.40\] \[(b) ~~H_0:p=0.4 ~~ vs ~~ H_a: p<0.40\]
\[(c) ~~H_0:p=0.4 ~~ vs ~~ H_a: p<0.40\] Solution.
The test statistic is the same for all 3 cases:
\[z_0=\frac{\hat{p}-p_0}{\sqrt{\frac{p_0(1-p_0)}{n}}}=\frac{0.3-0.40}{\sqrt{\frac{0.4(1-0.4)}{36}}}=-1.22\] (a) The rejection region looks like \((-\infty, c)\) with the critical value \(c=-1.645\) obtained by using the standard normal table or R code \(qnorm(0.05)\). Since the test statistic value does not fall in the rejection region, we fail to reject the null hypothesis. Equivalently, we can use the \(p\)-value approach. The \(p\)-value is obtained by using the standard normal table or R code \(pnorm(-1.22)\), which is 0.11. Since the p-value is not smaller than the significance level 0.05, we again fail to reject the null hypothesis.
The rejection region looks like \((c, \infty)\) with the critical value \(c=1.645\) obtained by using the standard normal table or R code \(qnorm(0.95)\). Since the test statistic value does not fall in the rejection region, we fail to reject the null hypothesis. Equivalently, we can use the \(p\)-value approach. The \(p\)-value is obtained by using the standard normal table or R code \(1-pnorm(-1.22)\), which is 0.89. Since the p-value is not smaller than the significance level 0.05, we again fail to reject the null hypothesis.
The rejection region looks like \((-\infty, -c)\cup (c, \infty)\) with the critical value \(-c=-1.645\) and \(c=1.645\) obtained by using the standard normal table or R code \(qnorm(0.025)\) and \(qnorm(0.975)\). Since the test statistic value does not fall in the rejection region, we fail to reject the null hypothesis. Equivalently, we can use the \(p\)-value approach. The \(p\)-value is obtained by using the standard normal table or R code \(pnorm(-1.22)*2\), which is 0.22. Since the p-value is not smaller than the significance level 0.05, we again fail to reject the null hypothesis.
A quick video: https://www.youtube.com/watch?v=b3o_hjWKgQw
Example.
An IT specialist doubts that
32% of IT hardware failures are mainly due to exposure to extreme temperatures
24% are mainly due to ineffective cleaning routines
20% are mainly due to human error caused by poor training
24% are due to other reasons
Based on the past 5 years of data, she has the following results on IT hardware failures:
30 are mainly due to exposure to extreme temperatures
25 are mainly due to ineffective cleaning routines
18 are mainly due to human error caused by poor training
27 are due to other reasons
Test, at the 0.05 significance level, whether the claim of the IT specialist is supported by the data.
Step 1: Specify the null and alternative hypotheses. \[H_0: p_1=0.32, ~p_2=0.24, ~p_3=0.20, ~p_4=0.24 ~~ vs ~~H_a: \text{at least one proportion is wrongly specified}\]
Step 2: Calculate the expected frequencies under the null hypothesis. Then, calculate the value of the test statistic and determine the number of degrees of freedom. The expected frequencies are: \(100\cdot 0.32= 32, 100\cdot 0.24= 24, 100\cdot 0.20= 20, 100\cdot 0.24= 24\), respectively, so the test statistic value is
\[\chi^2=\frac{(30-32)^2}{32}+\frac{(25-24)^2}{24}+\frac{(18-20)^2}{20}+\frac{(27-24)^2}{24}=0.74\]
Step 3: Calculate the critical value and the \(p\)-value using the chi-square distribution table. The critical region for the chi-square test always locates in the right tail of the chi-square distribution and looks like \((c, \infty)\) with the critical value \(c=\chi_{\alpha, k-1}^2\). Here \(k=4\) , so the number of degrees of freedom is 3. Using the chi-square table of the R code \(pchisq(1-\alpha, df)\) with \(\alpha=0.05\) and \(df=3\), \(c=7.81\). Since the test statistic value 0.74 does not fall in the rejection region, we fail to reject the null hypothesis.
Equivalently, the \(p\)-value is 0.86 obtained by the R code \(1-pchisq(0.74, 3)\) or estimated to be larger than 0.05 by the chi-square table.
Step 4: Draw a conclusion. We conclude that the data match the claim well.
R code for all with the \(p\)-value approach:
x=c(30, 25, 18, 27)
chisq.test(x, p = c(0.32, 0.24, 0.20, 0.24))
Here is a helpful video: https://www.youtube.com/watch?v=LE3AIyY_cn8
Example.
Refer to the data in the table below:
Test, at the 0.05 significance level, whether there is any gender gap in the choice of college majors.
Solution.
Step 1: Specify the null and alternative hypotheses. \[H_0: \text{there is any gender gap in the choice of college majors (gender and major are independent)}\] \[H_0: \text{there is a gender gap in the choice of college majors (gender and major are NOT independent)}\]
Step 2: Calculate the expected frequencies under the null hypothesis. Then, calculate the value of the test statistic and determine the number of degrees of freedom. The expected frequencies are 5.65, 8.35, 8.47, 12.53, 8.88, 13.12, respectively. The number of degrees of freedom is \((3-1)(2-1)=2\).
Step 3: Calculate the critical value and the \(p\)-value using the chi-square distribution table. The critical value is \(c=\chi_{0.05}^2\) and the rejection region is \((c, \infty)\), where \(c=5.99\) obtained by the chi-square distribution table or R code \(qchisq(0.95,2)\). The \(p\)-value is 0.33.
Step 4: Make a decision & draw a conclusion. We fail to reject the null hypothesis. We conclude that the data do not provide sufficient evidence that there is a gender gap in the choice of college major.
R code for all with the \(p\)-value approach:
M=matrix(c(4,11,8,10,10,14),3)
chisq.test(M)
R code
# 1. Confidence interval and test of the difference between means
# 1a. Assuming that both population variances are known
# 2-sample z-test
# Excel
# 1b. Assuming unknown but equal population variances
# 2-sample t-test, df = n1+n2-2
# R function available
x = c(23, 43, 45, 33, 51, 28, 52, 39, 44)
y = c(58, 76, 46, 63, 75, 51)
t.test(x, y, var.equal = TRUE, alternative = "less", conf.level = 0.95)
t.test(x, y, var.equal = TRUE, alternative = "greater", conf.level = 0.95)
t.test(x, y, var.equal = TRUE, alternative = "two.sided", conf.level = 0.95)
# 1c. Assuming unknown population variances
# 2-sample t-test, df = formula
# R function available
x = c(23, 43, 45, 33, 51, 28, 52, 39, 44)
y = c(58, 76, 46, 63, 75, 51)
t.test(x, y, var.equal=FALSE, alternative = "less", conf.level = 0.95)
t.test(x, y, var.equal=FALSE, alternative = "greater", conf.level = 0.95)
t.test(x, y, var.equal=FALSE, alternative = "two.sided", conf.level = 0.95)
# 1d. Paired data
x = c(34, 56, 33, 28, 45, 63, 51) # measurements based on method 1
y = c(35, 57, 30, 26, 40, 59, 48) # measurements based on method 2
t.test(x, y, paired = TRUE, alternative = "less", conf.level = 0.95)
t.test(x, y, paired = TRUE, alternative = "greater", conf.level = 0.95)
t.test(x, y, paired = TRUE, alternative = "two.sided", conf.level = 0.95)
# 2. Confidence interval and test of the difference between proportions
# Sample 1: n = 40, x = 28; Sample 2: n = 50, x = 34
n = c(40, 50)
x = c(28, 34)
prop.test(x, n, alternative = "less", correct = FALSE, conf.level = 0.95)
prop.test(x, n, alternative = "greater", correct = FALSE, conf.level = 0.95)
prop.test(x, n, alternative = "two.sided", correct = FALSE, conf.level = 0.95)
Exercise:
Use t test to see whether there is a difference in mean number of attacks before and after installing firewalls.
Attempts before Firewall: 56, 47, 49, 37, 38, 60, 50, 43, 43, 59, 50, 56, 54, 58
Attempts before Firewall: 53, 21, 32, 49, 45, 38, 44, 33, 32, 43, 53, 46, 36, 48, 39, 35, 37, 36, 39, 45
Is there evidence to support the claim that the two machines produce rods with different mean diameters? Give bounds on the P-value used to make your conclusion. Use only Table V of Appendix A.
Construct a 95% confidence interval for the difference in mean rod diameter. Use only Table V of Appendix A. Round your answer to 3 decimal places.
Interpret this interval.
Use only Table V of Appendix A.
Is there evidence to support the claim that supplier 2 provides gears with higher mean impact strength? Use 𝛼=0.05, and assume that both populations are normally distributed but the variances are not equal. Round your answer to 4 decimal places.
Do the data support the claim that the mean impact strength of gears from supplier 2 is at least 25 foot-pounds higher than that of supplier 1? Find bounds on the P-value making the same assumptions as in part (a). Round your answer to 2 decimal places.
Construct an appropriate 95% confidence interval on the difference in mean impact strength. Use only Table V of Appendix A. Round your answers to 3 decimal places.
Does the confidence interval support the claim that the mean impact strength of gears from supplier 2 is at least 25 foot-pounds higher than that of supplier 1?
Round your answer to 2 decimal places. Do not use commas.
Does the confidence interval constructed in the previous step indicate that one brand is better than the other?
Solution.
R code:
##
## Paired t-test
##
## data: x and y
## t = 1.8983, df = 7, p-value = 0.09945
## alternative hypothesis: true mean difference is not equal to 0
## 99 percent confidence interval:
## -730.4546 2462.4546
## sample estimates:
## mean difference
## 866
Do the data suggest that the two methods provide the same mean value for natural vibration frequency? Find an interval for P-value.
Find a 95% confidence interval on the mean difference between the two methods and use it to answer the question in part (a).
Round your answer to 3 decimal places.
Solution.
R code:
##
## Paired t-test
##
## data: x and y
## t = -2.4481, df = 6, p-value = 0.04992
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
## -10.962958325 -0.002755961
## sample estimates:
## mean difference
## -5.482857
Solution.
R code:
##
## 2-sample test for equality of proportions without continuity correction
##
## data: x out of n
## X-squared = 1.5625, df = 1, p-value = 0.2113
## alternative hypothesis: two.sided
## 95 percent confidence interval:
## -0.01131856 0.05131856
## sample estimates:
## prop 1 prop 2
## 0.05 0.03
The prop.test returns a test statistic that equals \(z_0^2\). So, to get \(z_0\), we take the square root of the output statistic value (which is 1.0851 here). \(z_0=\sqrt{1.0851}=1.04\). Why positive? It is because the first sample proportion is larger than the second! Refer to the \(z_0\) formula!
Test the hypothesis \(H_0: p_1 = p_2\) verses \(H_1: p_1 ≠ p_2\). What is \(z_0\), the value of the test statistic? Round your answer to two decimal places (e.g. 98.76).
Is it reasonable to conclude that there is a difference in the support for increasing the speed limit between the residents of the two counties?
Solution.
R code:
##
## 2-sample test for equality of proportions without continuity correction
##
## data: x out of n
## X-squared = 0.42543, df = 1, p-value = 0.5142
## alternative hypothesis: two.sided
## 95 percent confidence interval:
## -0.0779364 0.0389364
## sample estimates:
## prop 1 prop 2
## 0.7180 0.7375
The prop.test returns a test statistic that equals \(z_0^2\). So, to get \(z_0\), we take the square root of the output statistic value (which is 0.42543 here). \(z_0=\sqrt{0.42543}=0.65\). Why positive? It is because the first sample proportion is larger than the second! Refer to the \(z_0\) formula!
\(p\)-value = 0.5142.
Solution.
R code:
##
## 2-sample test for equality of proportions without continuity correction
##
## data: x out of n
## X-squared = 4.9416, df = 1, p-value = 0.02622
## alternative hypothesis: two.sided
## 95 percent confidence interval:
## 0.007396526 0.123603474
## sample estimates:
## prop 1 prop 2
## 0.7680 0.7025
The CI is between 0.007396526 and 0.1236.
The best point estimate of a population parameter is the sample counterpart (called the statistic which is a function of observations). The precision of a point estimate is measured by the standard error. The accuracy is measured by the bias. If an estimator has 0 bias, it is said to be unbiased.
The sampling distribution of the sample mean: (a) it always has mean that equals the population mean and variance that equals the population variance divided by the sample size; (b) it is normally distributed when the population distribution is normal; (3) it is approximately normally distribution when the sample size is large and the population distribution is not normal.
The sampling distribution of the sample proportion: (a) it always has mean that equals the population proportion and variance that equals \(p\cdot(1-p)\) divided by the sample size; (b) it is approximately normally distribution when the sample size is large.
The standard error of the sample mean is \(\frac{\sigma}{\sqrt{n}}\) or \(\frac{s}{\sqrt{n}}\) when \(\sigma\) is unknown.
The standard error of the sample proportion is \(\sqrt{\frac{p(1-p)}{n}}\) or \(\sqrt{\frac{\hat{p}(1-\hat{p})}{n}}\) when \(p\) is unknown.
Test and find confidence interval about \(p\), the \(z\) method
Test and find confidence interval about \(\mu\) (\(\sigma\) known), the \(z\) method
Test and find confidence interval about \(\mu\) (\(\sigma\) unknown), the \(t\) method
Test and find confidence interval about \(p_1-p_2\), the \(z\) method
Test and find confidence interval about \(\mu_1-\mu_2\) (\(\sigma_1\) and \(\sigma_2\) known), the \(z\) method
Test and find confidence interval about \(\mu_1-\mu_2\) (\(\sigma_1\) and \(\sigma_2\) unknown), the \(t\) method
Paired data \(t\) test and \(t\) confidence interval
Sample size determination when estimating \(p\)
Sample size determination when estimating \(\mu\) (\(\sigma\) known)
Chi-square test for goodness of fit or independence
Test \(H_0: \mu = 45\) vs \(H_a:\mu<45\), using both the critical value method and the \(p\)-value method.
Test \(H_0:\mu = 45\) vs \(H_a:\mu>45\), using both the critical value method and the \(p\)-value method.
Test \(H_0:\mu = 45\) vs \(H_a:\mu\ne 45\), using both the critical value method and the \(p\)-value method.
Construct a 95% 2-sided confidence interval.
Construct a 95% confidence interval with a upper-bound.
Construct a 95% confidence interval with a lower-bound.
Solution
The problems must be done by hand, since we are not given the original sample data.
Test \(H_0:\mu_1 = \mu_2\) vs. \(H_a:\mu_1 < \mu_2\), using both the critical value method and the \(p\)-value method.
Test \(H_0:\mu_1 = \mu_2\) vs. \(H_a:\mu_1 > \mu_2\), using both the critical value method and the \(p\)-value method.
Test \(H_0:\mu_1 = \mu_2\) vs. \(H_a:\mu_1 \ne \mu_2\), using both the critical value method and the \(p\)-value method.
Construct a 95% 2-sided confidence interval.
Solution
The problems must be done by hand, since we are not given the original sample data.
x: 34, 56, 78, 89, 77, 55
y: 44, 65, 70, 80, 75, 50
Test \(H_0:\mu_d = 0\) vs. \(H_a:\mu_d < 0\), using both the critical value method and the \(p\)-value method. \(\alpha = 0.05\).
Test \(H_0:\mu_d = 0\) vs. \(H_a:\mu_d > 0\), using both the critical value method and the \(p\)-value method. \(\alpha = 0.05\).
Test \(H_0:\mu_d = 0\) vs. \(H_a:\mu_d \ne 0\), using both the critical value method and the \(p\)-value method. \(\alpha = 0.05\).
Construct a 95% 2-sided confidence interval.
Solution
The problems can be done by hand or R code, since we are given the original sample data.
R code:
x = c(34, 56, 78, 89, 77, 55)
y = c(44, 65, 70, 80, 75, 50)
t.test(x, y, paired = TRUE, alternative = "less) # for (a)
qt(alpha, df = 6-1) # Critical value for (a)
t.test(x, y, paired = TRUE, alternative = “greater”) # for (b)
qt(1-alpha, df = 6-1) # Critical value for (b)
t.test(x, y, paired = TRUE, alternative = “two.sided”) # for (c)
qt(alpha/2, df = 6-1); -qt(alpha/2, df = 6-1) # Critical values for (c)
Test, at significance level \(\alpha=0.05\), whether the population proportion is less than 0.7.
Test, at significance level \(\alpha=0.05\), whether the population proportion is greater than 0.7.
Test, at significance level \(\alpha=0.05\), whether the population proportion is different from 0.7.
Construct a 95% confidence interval for the population proportion.
Solution
The problems can be done by hand or R code, since we are given the original sample data.
R code:
n=54
x = 35
prop.test(x, n, alternative = “less”, correct = FALSE) # for (a)
qnorm(alpha) # Critical value for (a)
prop.test(x, n, alternative = “greater”, correct = FALSE) # for (b)
qnorm(1-alpha) # Critical value for (b)
prop.test(x, n, alternative = “two.sided”, correct = FALSE) # for (c)
qnorm(alpha/2); -qnorm(alpha/2) # Critical values for (c)
Test, at significance level \(\alpha=0.05\), whether the first population proportion is smaller.
Test, at significance level \(\alpha=0.05\), whether the first population proportion is larger.
Test, at significance level \(\alpha=0.05\), whether the population proportions different.
Construct a 95% confidence interval for the difference in population proportions (\(p_1-p_2\)).
Solution
The problems can be done by hand or R code, since we are given the original sample data.
R code:
n = c(54, 63)
x = c(35, 41)
prop.test(x, n, alternative = “less”, correct = FALSE) # for (a)
qnorm(alpha) # Critical value for (a)
prop.test(x, n, alternative = “greater”, correct = FALSE) # for (b)
qnorm(1-alpha) # Critical value for (b)
prop.test(x, n, alternative = “two.sided”, correct = FALSE) # for (c)
qnorm(alpha/2); -qnorm(alpha/2) # Critical values for (c)
Test, at significance level 0.05, that the type of defect does not differ across shifts.
Solution
The problems can be done by hand or R code, since we are given the original sample data.
R code:
M = matrix(c(15, 26, 33, 21, 31, 17, 45, 34, 49, 13, 5, 20), 3, 4)
chisq.test(M)
Some useful datasets: http://fs2.american.edu/baron/www/Book/
Statistical tables: https://read.wiley.com/books/9781119400363/page/152/section/top-of-page