Introduction to Statistics

Spring 2018

Sampling Distribution of the Difference between Two Proportions

Let's consider, the difference in two population proportions: \(p_1 - p_2\).

A reasonable point estimate of \(p_1 - p_2\) based on the samples drawn from the same populations can be written in the form: \(\hat p_1 - \hat p_2\).

The mean or expected value of \(\hat p_1 - \hat p_2\) is \(p_1 - p_2\).

The standard deviation: \(SD_{\hat p_1 - \hat p_2} = \sqrt{SD^2_{\hat p_1} + SD^2_{\hat p_2}} = \sqrt{\frac{\hat p_1(1- \hat p_1)}{n_1}+\frac{\hat p_2(1-\hat p_2)}{n_2}}\)

\(n_1\) and \(n_2\) represent the sample sizes.

Two-Proportional \(z\)-interval

If the samples are independent and the sample sizes are large, then the sampling distribution of \(\hat p_1 - \hat p_2\), the confidence interval for \(\hat p_1 - \hat p_2\) is

\((\hat p_1 - \hat p_2) \pm z^* \times \sqrt{\frac{\hat p_1(1- \hat p_1)}{n_1}+\frac{\hat p_2(1- \hat p_2)}{n_2}}\)

\(z^*\) is the critical value that corresponds to the confidence level \(C\).

Question:

How much difference is there in the proportion of male drivers who wear seat belts when sitting next to a man and the proportion when sitting next to a woman?
With female passengers: \(2777\) wore seat belts, \(1431\) did not.
With male passengers: \(1363\) wore seat belts, \(1400\) did not.

Two-Proportional \(z\)-interval

Solution:

\[ \begin{align} n_F &= 4208, n_M = 2763 \\ \hat p_F &= \frac{2777}{4208} = 0.660, \hat p_M = \frac{1363}{2763} = 0.493 \\ \\ SE_{\hat {p_F} - \hat p_M} &= \sqrt{\frac{\hat p_F(1- \hat p_F)}{n_F}+\frac{\hat p_M(1-\hat p_M)}{n_M}} \\ &= \sqrt{\frac{(0.660)(1-0.660)}{4208}+\frac{0.493(1-0.493)}{2763}} \\ &= 0.012 \\ \\ ME &= z^* \times SE(\hat p_F - \hat p_M) = 1.96 \times 0.012 = 0.024 \\ \hat p_F - \hat p_M &= 0.660 - 0.493 = 0.167 \\ \\ 95\% CI &: 0.167 \pm 0.024 = [0.143, 0.191] \end{align} \]

Two Sample \(z\)-test

The Sleep in America Poll found that \(205\) of \(293\) of Gen-Y and \(235\) of \(469\) of Gen-X use the Internet before sleep. Is this difference real?

Null Model:

mean: \(p_Y - p_X = 0\)

\[ \begin{align} \hat p_{pooled} &= \frac{x_Y + x_X}{n_Y + n_X} = \frac{205+235}{293+469} = 0.5774 \\ \\ SE_{pooled}(\hat p_Y - \hat p_X) &= \sqrt{\frac{\hat p_{pooled}(1- \hat p_{pooled})}{n_X}+\frac{\hat p_{pooled}(1-\hat p_{pooled})}{n_Y}} \\ &= \sqrt{\frac{0.5774 \times (1- 0.5774)}{293}+\frac{0.5774 \times (1-0.5774)}{469}} \\ &= 0.0368 \\ \\ \end{align} \]

Two Sample \(z\)-test

\[ \begin{align} \text{Two-tailed test} \\ H_0: p_Y - p_X &= 0 \\ H_A: p_Y - p_X &\ne 0 \\ \\ \hat p_Y - \hat p_X &= 0.700 - 0.501 = 0.199 \\ \\ z &= \frac{0.199 - 0}{0.0368} = 5.41 \\ \\ p-value &= 2 \times P(z>5.41) \le 0.05 \end{align} \]

Hence, reject \(H_0\). The difference between proportions of Gen Y and Gen X is significantly different.

Two Sample \(z\)-test

\(62\) of \(325\) girls and \(75\) of \(268\) boys have online profiles. Is there a real difference between all boys and girls?

Null Model:

mean: \(p_B - p_G = 0\)

\[ \begin{align} \hat p_{pooled} &= \frac{x_B + x_G}{n_B + n_G} = \frac{75+62}{268+325} = 0.231 \\ \\ SE_{pooled}(\hat p_B - \hat p_G) &= \sqrt{\frac{\hat p_{pooled}(1- \hat p_{pooled})}{n_B}+\frac{\hat p_{pooled}(1-\hat p_{pooled})}{n_G}} \\ &= \sqrt{\frac{0.231 \times (1- 0.231)}{268}+\frac{0.231 \times (1-0.231)}{325}} \\ &= 0.0348 \\ \\ \end{align} \]

Two Sample \(z\)-test

\[ \begin{align} \text{Two-tailed test} \\ H_0: p_B - p_G &= 0 \\ H_A: p_B - p_G &\ne 0 \\ \\ \hat p_B - \hat p_G &= 0.28 - 0.19 = 0.09 \\ \\ z &= \frac{0.09 - 0}{0.0348} = 2.59 \\ \\ p-value &= 2 \times P(z>2.59) = 0.0096 \le 0.05 \end{align} \]

Reject \(H_0\). There is strong evidence to say that there is a statisticaly significant difference between the proportions of boys and girls who have online profiles.

Sampling Distribution of the Difference between Two Means

Let's consider, the difference in two population means: \(y_1 - y_2\).

A reasonable point estimate of \(y_1 - y_2\) based on the samples drawn from the same populations can be written in the form: \(\bar y_1 - \bar y_2\).

The mean or expected value of \(\bar y_1 - \bar y_2\) is \(y_1 - y_2\).

The standard deviation: \(SD_{\bar y_1 - \bar y_2} = \sqrt{SD^2_{\bar y_1} + SD^2_{\bar y_2}} = \sqrt{\frac{\sigma^2_1}{n_1}+\frac{\sigma^2_2}{n_2}}\)

(\(n_1\) and \(n_2\) represent the sample sizes)

when each sample mean is nearly normal and all observations are independent. Recall that each sample mean will be nearly normal if the population is normal or if the sample size is at least 30.

When sample size is small or population SD is unknown, each sample is assumed to follow \(t\)-distribution with \(df = n-1\) and \(SD_{\bar y_1 - \bar y_2} = \sqrt{\frac{s^2_1}{n_1}+\frac{s^2_2}{n_2}}\)

Two-Sample \(t\)-interval

If the samples are independent and the samples are \(t\)-distributed, then the sampling distribution of \(\bar y_1 - \bar y_2\), the confidence interval for \(\bar y_1 - \bar y_2\) is

\[(\bar y_1 - \bar y_2) \pm t^*_{df} \times \sqrt{\frac{s^2_1}{n_1}+\frac{s^2_2}{n_2}}\]

\(t^*_{df}\) is the critical value that corresponds to the confidence level \(C\) and \(df = \frac{({s^2_1}/{n_1}+{s^2_2}/{n_2})^2}{(s^2_1/n_1)^2/(n_1-1)+(s^2_2/n_2)^2/(n_2-1)}\).

Two-Sample \(t\)-interval

Find the \(95\%\) confidence interval about the difference in sample means.

\[ \begin{array}{c|c|c} & \text{Sample 1} & \text{Sample 2} \\ \hline n & 27 & 27 \\ \bar y & 8.5 & 14.7 \\ s & 6.1 & 8.4 \end{array} \]

\[ \begin{align} & \bar y_1 - \bar y_2 = 14.7 - 8.5 = 6.2 \\ & t_{47.46} = 2.011 \text { at CL} = 95\% \\ \\ & SE = \sqrt{\frac{s^2_1}{n_1}+\frac{s^2_2}{n_2}} = \sqrt{\frac{8.4^2}{27}+\frac{6.1^2}{27}} = 2 \\ \\ & ME = 2.011 \times 2 = 4.02 \\ & CI: 6.2 \pm 4.02 = [2.18, 10.22] \end{align} \]

Two-Sample \(t\)-test: Testing for the Difference between the Two Means

Is there a statistically significant difference between two sample means?

Generally, use unpooled \(t\)-test. Equal variance assumption is often violated in small samples.

\[ \begin{array}{c|c|c} & \text{Sample 1} & \text{Sample 2} \\ \hline n & 8 & 7 \\ \bar y & 281.88 & 211.43 \\ s & 18.31 & 46.43 \end{array} \]

Two-Sample \(t\)-test: Testing for the Difference between the Two Means

\[ \begin{align} \text{Two-tailed t test} \\ H_0: \mu_1 - \mu_2 &= 0 \\ H_A: \mu_1 - \mu_2 &\ne 0 \\ \\ \bar y_1 - \bar y_2 &= 0.281.88 - 211.43 = 70.45 \\ \\ SE(\bar y_1 - \bar y_2) &= \sqrt{\frac{18.31^2}{8}+\frac{46.43^2}{7}} = 18.70 \\ t_{7.62} &= \frac{70.45 - 0}{18.70} = 3.77 \\ \\ p-value &= 2 \times P(t>3.77) = 0.006 \le 0.05 \end{align} \] Hence, we reject \(H_0\).

Pooled \(t\)-test

To use the pooled \(t\)-test, we must make the Equal Variance Assumption that the variances of the two populations from which the samples have been drawn are equal. That is \(\sigma_1^2 = \sigma_2^2.\)

\[ \begin{align} &s^2_{pooled} = \frac{(n_1-1)s^2_1+(n_2-1)s^2_2}{(n_1-1)+(n_2-1)} \\ &SE_{pooled}(\bar y_1-\bar y_2) = \sqrt{(s^2_{pooled}/n_1)+(s^2_{pooled}/n_2)} \end{align} \]

\[ \begin{align} &H_0: \mu_1 - \mu_2 = 0 \\ &H_A: \mu_1 - \mu_2 \ne 0 \\ \\ &\text {pooled t-score, } t = \frac{(\bar y_1 - \bar y_2 )-0}{SE_{pooled}(\bar y_1 - \bar y_2)} \\ &df = (n_1 + n_2 - 2) \\ \\ &\text {Confidence Interval: } (\bar y_1- \bar y_2) \pm t_{df}^* \times SE_{pooled}(\bar y_1 - \bar y_2) &\end{align} \]

Paired Data

\[ \begin{array}{c|c|c} & \text{Inner Time} & \text{Outer Time} & Diff\\ \hline 1 & 125.75 & 122.34 & 3.41 \\ 2 & 121.63 & 122.12 & -0.49 \\ 3 & 122.24 & 123.35 & -1.11 \\ 4 & 120.85 & 120.45 & 0.40 \\ ... & ... & ... & ... \\ 17 & 122.15 & 122.75 & -0.60 \end{array} \]

Hypothesis Testing for Paired Data

The paired t-test

\[ \begin{align} & H_0:\mu_d = 0 \\ & H_1: \mu_d \ne 0 \\ \\ & n = 17 \text{ pairs }, \bar d = 0.499, s_d = 2.333 \\ \\ & SE(\bar d) = s_d/\sqrt n = 2.333/\sqrt 17 = 0.5658 \\ \\ & t_{16} = \frac{\bar d-0}{SD(\bar d)} = \frac{0.499}{0.5658} = 0.882 \\ \\ & p-value = 2P(t_{16}>0.882) = 0.39 > 0.05 \end{align} \]

Hence, \(H_0\) cannot be rejected.

Paired \(t\)-Interval

The \(95\%\) confidence interval for the mean paired difference is

\[ \begin{align} & \bar d \pm t_{n-1}^* \times \frac{s_d}{\sqrt n} \\ \\ & = 0.499 \pm 2.12 \times 0.5658 \\ \\ & [-0.7005, 1.6985] \end{align} \]

Sampling Distribution of the Difference between Two Proportions

Two-Proportional \(z\)-interval

Two-Proportional \(z\)-interval

Two Sample \(z\)-test

Two Sample \(z\)-test

Two Sample \(z\)-test

Two Sample \(z\)-test

Sampling Distribution of the Difference between Two Means

Two-Sample \(t\)-interval

Two-Sample \(t\)-interval

Two-Sample \(t\)-test: Testing for the Difference between the Two Means

Two-Sample \(t\)-test: Testing for the Difference between the Two Means

Pooled \(t\)-test

Paired Data

Hypothesis Testing for Paired Data

The paired t-test

Paired \(t\)-Interval

Next Week

Final Exam