Introduction to Statistics

Spring 2018

2012 Cherry Blossom 10 mile run in DC

Dataset (first 20 observations)

   place   time   pace age gender state divPlace divTot
1   4494  92.25  9.225  38      M    MD      690   1093
2   6298 106.35 10.635  33      M    DC     1322   1490
3   2502  89.33  8.933  55      F    VA       37    236
4   8176 113.50 11.350  24      F    VA      878    974
5   3413  86.52  8.652  54      M    CA      213    483
6   8008 112.30 11.230  42      F    MD      785    974
7   8791 118.45 11.845  36      F    VA     1215   1367
8   3987  95.17  9.517  25      F    VA     1230   2782
9   3451  93.25  9.325  25      F    PA     1074   2782
10  1046  72.37  7.237  43      M    MD      111    931
11  3484  86.90  8.690  55      M    VA      138    375
12  2987  84.47  8.447  30      M    MD      659   1490
13  4427  96.65  9.665  39      F    CA      587   1367
14  6496 108.93 10.893  40      M    VA      846    931
15  5827 101.58 10.158  30      F    DC     1297   2228
16  1224  82.78  8.278  24      F    VA      164    974
17  4942  98.32  9.832  45      F    MD      255    554
18  9579 134.18 13.418  33      F    VA     2189   2228
19  6425 107.98 10.798  41      M    VA      831    931
20  5951 102.08 10.208  36      F    VA      809   1367

[1] 16924

Distribution of Time

Histogram of time for a randomly selected sample of size \(100\).

Sampling Distribution

A histogram of \(1000\) sample means, where the samples are of size \(n = 100\). This histogram approximates the true sampling distribution of the sample mean, with mean \(\mu_{\bar x}\) and standard deviation \(\sigma_{\bar x}\).

Sampling Distribution

The sampling distribution represents the distribution of the point estimates (e.g. \(mean\)) based on samples of a fixed size from a certain population. It is useful to think of a point estimate as being drawn from such a distribution. Understanding the concept of a sampling distribution is central to understanding statistical inference.

\(E(\bar X)\) of the Sampling Distribution of \(\bar X\)

Let \(X_1, X_2,...,X_n\) be \(n\) independently drawn observations from a population distribution with mean \(\mu\) and variance \(\sigma^2\).

Let \(\bar X\) be the mean of these \(n\) independent observations:

\[ \begin{align} \bar X &= \frac{X_1 + X_2 +...+ X_n}{n} \\ \\ E(\bar X) &= E(\frac{X_1 + X_2 +...+ X_n}{n}) \\ &= (\frac {1}{n})E(X_1 + X_2 +...+ X_n) \\ &= (\frac {1}{n})[E(X_1) + E(X_2) +...+ E(X_n)] \\ &= (\frac {1}{n})[\mu + \mu +...+ \mu] \\ &= (\frac {1}{n})[n.\mu] \\ &= \mu \\ \mu_{\bar x} &= \mu \end{align} \]

\(Var\) of the Sampling Distribution of \(\bar X\)

\[ \begin{align} \bar X &= \frac{X_1 + X_2 +...+ X_n}{n} \\ \\ Var(\bar X) &= Var(\frac{X_1 + X_2 +...+ X_n}{n}) \\ &= (\frac {1}{n^2})Var(X_1 + X_2 +...+ X_n) \\ &= (\frac {1}{n^2})[Var(X_1) + Var(X_2) +...+ Var(X_n)] \\ &= (\frac {1}{n^2})[\sigma^2 + \sigma^2 +...+ \sigma^2] \\ &= (\frac {1}{n^2})[n.\sigma^2] \\ &= \frac{\sigma^2}{n} \\ \sigma^2_{\bar x} &= \frac{\sigma^2}{n} \\ SD_{\bar x} = \sigma_{\bar x} &= {\frac{\sigma}{\sqrt n}} \end{align} \]

Sampling Distributions of a Sample Proportion

The mean and standard Deviation of the sample proportion describe the center and spread of the distribution of all possible sample porportions \(\hat p\) from a random sample size of \(n\) with true population proportion \(p\).

\[ \begin{align} \mu_{\hat p} &= p \\ \\ \sigma_{\hat p} &= \sqrt{\frac{p(1-p)}{n}} \end{align} \]

Central Limit Theorem

When taking a random sample of independent observations from a population with a fixed mean and standard deviation, the distribution of \(\bar x\) approaches the normal distribution as \(n\) increases.

Central Limit Theorem

Normal Approximation for the Sampling Distribution

Three important facts about the distribution of a sample mean \(\bar x\)

The mean of a sample mean is denoted by \(\mu_{\bar x}\), and it is equal to \(\mu\).
The SD of a sample mean is denoted by \(\sigma_{\bar x}\), and it is equal to \({\frac{\sigma}{\sqrt n}}\).
When the population is normal or when \(n \ge 30,\) the sample mean closely follows a normal distribution.

Three important facts about the distribution of a sample proportion \(\hat p\)

The mean of a sample proportion is denoted by \(p\).
The SD of a sample proportion is \(\sqrt{\frac{p(1-p)}{n}}\).
When \(np \ge 10\) and \(n(1-p) \ge 10,\) the sample proportion closely follows a normal distribution.

Sampling Distribution of Mean

Problem:

In the 2012 Cherry Blossom 10 mile run, the average time for all of the runners is \(94.52\) minutes with a standard deviation of \(8.97\) minutes. The distribution of run times is approximately normal. Find the probabiliy that a randomly selected runner completes the run in less than \(90\) minutes.

Solution:

Because the distribution of run times is approximately normal, we can use normal approximation.

\[ \begin{align} Z &= \frac{\bar x-\mu_{\bar x}}{\sigma_{\bar x}} \\ &= \frac{90-94.52}{8.97/\sqrt 1} \\ &= -0.504 \\ \\ P(Z < -0.504) &= 0.3072 \end{align} \]

There is a \(30.72\%\) probability that a randomly selected runner will complete the run in less than \(90\) minutes.

Sampling Distribution of Mean

Problem:

Find the probabiliy that the average of 20 runners is less than 90 minutes.

Solution:

Here, \(n = 20 < 30\), but the distribution of the population, that is, the distribution of run times is stated to be approximately normal. Because of this, the sampling distribution will be normal for any sample size.

\[ \begin{align} \sigma_{\bar x} &= \frac{\sigma}{\sqrt n} = \frac{8.97}{\sqrt {20}} = 2.01 \\ Z &= \frac{\bar x-\mu_{\bar x}}{\sigma_{\bar x}} = \frac{90-94.52}{2.01}= - 2.25 \\ P(Z<-0.504) &= 0.0123 \end{align} \] There is a \(1.23\%\) probability that the average run time of 20 randomly selected runners will be less than 90 minutes.

Sampling Distribution of Proportion

Problem:

Find the probability that less than \(15\%\) of the sample of \(400\) people will be smokers if the true proportion is \(20\%.\)

Solutions:

The mean of the sample proportion is the population proportion: \(\mu_\hat p = 0.20.\)

The standard deviation of \(\hat p\) is described by the standard deviation for the proportion:

\[\sigma_{\hat p}=\sqrt \frac{p(1-p)}{n} = \sqrt \frac{0.2(0.8)}{400} = 0.02\]

\[ \begin{align} Z &= \frac{\hat p - \mu_\hat p}{\sigma_\hat p} = \frac{0.15 - 0.20}{0.02} = -2.5 \\ \\ P(Z<-2.5) &= 0.0062 \end{align} \]

Sampling Distribution of Proportion

enough lefty seats?

Problem:

\(13\%\) of the US ppopulation are left-handed. If an auditorium has \(15\) lefty seats, what is the probability that there will not be enough lefty seats for a class of \(90\) students (in other words, what is the probability that there will be more than \(15\) lefty students in the group)?

Solutions:

\[ \begin{align} \mu_\hat p &= 0.13 \\ \hat{p} &= 15/90 = 0.167 \\ \sigma_{\hat p}&=\sqrt \frac{p(1- p)}{n} = \sqrt \frac{0.13(0.87)}{90} = 0.035 \\ \\ Z &= \frac{\hat p - \mu_\hat p}{\sigma_\hat p} = \frac{0.167 - 0.13}{0.035} = 1.06 \\ P(\hat{p}>0.167) &= P(Z>1.06) = 0.1446 \end{align} \]

Confidence Intervals

A point estimate provides a single plausible value for a parameter. However, a point estimate is rarely perfect; usually there is some error in the estimate. In addition to supplying a point estimate of a parameter, a next logical step would be to provide a plausible range of values for the parameter.

Constructing a \(95\%\) confidence interval

When the sampling distribution of a point estimate can reasonably be modeled as normal, the point estimate we observe will be within \(1.96\) standard errors of the true value of interest about \(95\%\) of the time. Thus, a \(95\%\) confidence interval for such a point estimate can be constructed:

\(\text {point estimate} \pm 1.96 \times SE\)

We can be \(95\%\) confident this interval captures the true value.

Simulating Confidence Intervals

Fifty samples of size \(n = 300\) were simulated when \(p = 0.30\). For each sample, a confidence interval was created to capture the true proportion \(p\). How many did not capture \(p = 0.30?\)

Generalizing Confidence Interval

If the point estimate follows the normal model with standard model with standard error \(SE\), then a confidence interval for the population parameter is

\[ \text {point estimate} \pm z^\star \times SE \]

where \(z^\star\) depends on the confidence level selected.

Calculating Confidence Intervals

The heart patients who receive stents are \(9\%\) more likely to suffer stroke from usage of the stent than those who do not have it. The estimate's standard error \((SE)\) is \(0.028\). Construct a \(95\%\) confidence interval for the change in strole rates from the usage of stent.

\[ \begin{align} \text {95% Confidence Interval} &= \text {point estimate} \pm 1.96 \times SE \\ &= 0.090 \pm 1.96 \times 0.028 \\ &=(0.035, 0.145) \end{align} \]

\[ \begin{align} \text {90% Confidence Interval} &= \text {point estimate} \pm 1.645 \times SE \\ &= 0.090 \pm 1.645 \times 0.028 \\ &=(0.044, 0.136) \end{align} \]

Margin of Error (ME)

The margin of error (ME) is the distance between the point estimate and the lower or upper bound of a confidence interval.

\[ \begin{align} \text{confidence interval} &= \text {point estimate} \pm z^\star \times SE \\ &= \text {point estimate} \pm \text{margin of error} \end{align} \]

Calculation of Sample Size

A pilot study showed that \(0.5\%\) of credit card offers in the mail end up with the person signing up. To be within \(0.1\%\) of the true rate with \(95\%\) confidence, how big does the test mailing have to be?

\[ \begin{align} ME &= z^\star \times SE \\ ME &= z^\star \times \sqrt \frac{\hat p \hat q}{n} \\ 0.001 &= 1.96 \times \sqrt \frac{(0.005)(0.995)}{n} \\ (0.001)^2 &= (1.96)^2 \times \frac{(0.005)(0.995)}{n} \\ n &= (1.96)^2 \times \frac{(0.005)(0.995)}{(0.001)^2} \\ n &= 19112 \end{align} \]

2012 Cherry Blossom 10 mile run in DC

Distribution of Time

Sampling Distribution

Sampling Distribution

\(E(\bar X)\) of the Sampling Distribution of \(\bar X\)

\(Var\) of the Sampling Distribution of \(\bar X\)

Sampling Distributions of a Sample Proportion

Central Limit Theorem

When taking a random sample of independent observations from a population with a fixed mean and standard deviation, the distribution of \(\bar x\) approaches the normal distribution as \(n\) increases.

Central Limit Theorem

Normal Approximation for the Sampling Distribution

Sampling Distribution of Mean

Sampling Distribution of Mean

Sampling Distribution of Proportion

Sampling Distribution of Proportion

enough lefty seats?

Confidence Intervals

Confidence Intervals

Simulating Confidence Intervals

Generalizing Confidence Interval

Calculating Confidence Intervals

Margin of Error (ME)

Calculation of Sample Size

Next Week

Chapter 17: Testing Hypothesis About Proportions
Chapter 18: Inference About Means

2012 Cherry Blossom 10 mile run in DC

Distribution of Time

Sampling Distribution

Sampling Distribution

\(E(\bar X)\) of the Sampling Distribution of \(\bar X\)

\(Var\) of the Sampling Distribution of \(\bar X\)

Sampling Distributions of a Sample Proportion

Central Limit Theorem

When taking a random sample of independent observations from a population with a fixed mean and standard deviation, the distribution of \(\bar x\) approaches the normal distribution as \(n\) increases.

Central Limit Theorem

Normal Approximation for the Sampling Distribution

Sampling Distribution of Mean

Sampling Distribution of Mean

Sampling Distribution of Proportion

Sampling Distribution of Proportion

enough lefty seats?

Confidence Intervals

Confidence Intervals

Simulating Confidence Intervals

Generalizing Confidence Interval

Calculating Confidence Intervals

Margin of Error (ME)

Calculation of Sample Size

Next Week

Chapter 17: Testing Hypothesis About Proportions Chapter 18: Inference About Means

Chapter 17: Testing Hypothesis About Proportions
Chapter 18: Inference About Means