Introduction to Statistics

The Standard Normal Curve

\[ \bbox[yellow,5px] { \color{black}{{\text {Density at z}} = \frac {1}{\sqrt {2\pi}}\exp{-\frac{1}{2}z^2}, -\infty<z<+\infty} } \]

Normal Probability Examples

z-score to percentile

Cumulative SAT scores are approximated by a normal model with \(\mu = 1500 \text { and } \sigma = 300\).

What is the probability that a randomly selected SAT taker scores at least 1630 on the SAT?

\(z = \frac{x-\mu}{\sigma}=\frac{1630-1500}{300}=\frac{130}{300}=0.43\)

\(P(z\ge0.43)=0.3336\)

The probability that a randomly selected score is at least 1630 on the SAT is 33%.

Normal Probability Examples

z-score to percentile

Edward earned a 1400 on his SAT. What is his percentile?

\(z = \frac{x-\mu}{\sigma}=\frac{1400-1500}{300}=\frac{100}{300}=-0.33\)

\(P(z\le-0.33)=0.3707\)

Edward is at the 37th percentile.

Normal Probability Examples

percentile to z-score

Carlos believes he can get into his preferred college if he scores at least in the 80th percentile on the SAT. What score should he aim for?

At \(80th\) percentile, \(z = 0.84\)

\[ \begin{align} z & = \frac{x-\mu}{\sigma} \\ 0.84 & = \frac{x-1500}{300} \\ 0.84 \times 300 + 1500 & = x \\ x & = 1752 \end{align} \]

The 80th percentile on the SAT corresponds to a score of 1752.

More Exercises

The U.S. Air Force requires that pilots have heights between 64 in. and 77 in. Heights of women are normally distributed with a mean of \(63.7\) in. and a standard deviation of \(2.9\) in. What percentage of women meet that height requirement?
To recruit more women pilots in the Air Force, if the height requirements are relaxed to allow middle \(95\%\) of women based on the height distribution \((N \sim (63.7, 2.9))\), what will be the heights of the tallest and shortest women meeting the requirements?

2012 Cherry Blossom 10 mile run in DC

Dataset (first 20 observations)

   place   time   pace age gender state divPlace divTot
1   4494  92.25  9.225  38      M    MD      690   1093
2   6298 106.35 10.635  33      M    DC     1322   1490
3   2502  89.33  8.933  55      F    VA       37    236
4   8176 113.50 11.350  24      F    VA      878    974
5   3413  86.52  8.652  54      M    CA      213    483
6   8008 112.30 11.230  42      F    MD      785    974
7   8791 118.45 11.845  36      F    VA     1215   1367
8   3987  95.17  9.517  25      F    VA     1230   2782
9   3451  93.25  9.325  25      F    PA     1074   2782
10  1046  72.37  7.237  43      M    MD      111    931
11  3484  86.90  8.690  55      M    VA      138    375
12  2987  84.47  8.447  30      M    MD      659   1490
13  4427  96.65  9.665  39      F    CA      587   1367
14  6496 108.93 10.893  40      M    VA      846    931
15  5827 101.58 10.158  30      F    DC     1297   2228
16  1224  82.78  8.278  24      F    VA      164    974
17  4942  98.32  9.832  45      F    MD      255    554
18  9579 134.18 13.418  33      F    VA     2189   2228
19  6425 107.98 10.798  41      M    VA      831    931
20  5951 102.08 10.208  36      F    VA      809   1367

[1] 16924

Distribution of Time

Histogram of time for a randomly selected sample of size \(100\).

Sampling Distribution

A histogram of \(1000\) sample means, where the samples are of size \(n = 100\). This histogram approximates the true sampling distribution of the sample mean, with mean \(\mu_{\bar x}\) and standard deviation \(\sigma_{\bar x}\).

Sampling Distribution of a Statistic

The sampling distribution of a statistic represents the distribution of all values of the statistic (e.g. \(\text { sample mean, sample proporton, etc.}\)) when all possible samples of the same size \(n\) are drawn from the same population.

Understanding the concept of a sampling distribution is central to understanding statistical inference.

Parameter and Statistics

A statistic is a value from our observed data.

A parameter is a value that describes the population.

\[ \begin{array} {l|c} \text{Name} & \text{Statistic} & \text{Parameter} \\ \hline \text {Mean} & \bar y & \mu \\ \text {Std. Deviation} & s & \sigma \\ \text {Correlation} & r & \rho \\ \text {Regression Coefficient} & b & \beta \\ \text {Proportion} & \hat p & p \end{array} \]

\(E(\bar X)\) of the Sampling Distribution of \(\bar X\)

Let \(X_1, X_2,...,X_n\) be \(n\) independently drawn observations from a population distribution with mean \(\mu\) and variance \(\sigma^2\).

Let \(\bar X\) be the mean of these \(n\) independent observations:

\[ \begin{align} \bar X &= \frac{X_1 + X_2 +...+ X_n}{n} \\ \\ E(\bar X) &= E(\frac{X_1 + X_2 +...+ X_n}{n}) \\ &= (\frac {1}{n})E(X_1 + X_2 +...+ X_n) \\ &= (\frac {1}{n})[E(X_1) + E(X_2) +...+ E(X_n)] \\ &= (\frac {1}{n})[\mu + \mu +...+ \mu] \\ &= (\frac {1}{n})[n.\mu] \\ &= \mu \\ \mu_{\bar x} &= \mu \end{align} \]

\(Var\) of the Sampling Distribution of \(\bar X\)

\[ \begin{align} \bar X &= \frac{X_1 + X_2 +...+ X_n}{n} \\ \\ Var(\bar X) &= Var(\frac{X_1 + X_2 +...+ X_n}{n}) \\ &= (\frac {1}{n^2})Var(X_1 + X_2 +...+ X_n) \\ &= (\frac {1}{n^2})[Var(X_1) + Var(X_2) +...+ Var(X_n)] \\ &= (\frac {1}{n^2})[\sigma^2 + \sigma^2 +...+ \sigma^2] \\ &= (\frac {1}{n^2})[n.\sigma^2] \\ &= \frac{\sigma^2}{n} \\ \sigma^2_{\bar x} &= \frac{\sigma^2}{n} \\ SD_{\bar x} = \sigma_{\bar x} &= {\frac{\sigma}{\sqrt n}} \end{align} \]

Sampling Distributions of a Sample Proportion

The mean and standard Deviation of the sample proportion describe the center and spread of the distribution of all possible sample porportions \(\hat p\) from a random sample size of \(n\) with true population proportion \(p\).

\[ \begin{align} \mu_{\hat p} &= p \\ \\ \sigma_{\hat p} &= \sqrt{\frac{p(1-p)}{n}} \end{align} \]

Central Limit Theorem

When taking a random sample of independent observations from a population with a fixed mean and standard deviation, the distribution of \(\bar x\) approaches the normal distribution as \(n\) increases.

Central Limit Theorem

Normal Approximation for the Sampling Distribution

Three important facts about the distribution of a sample mean \(\bar x\)

The mean of a sample mean is denoted by \(\mu_{\bar x}\), and it is equal to \(\mu\).
The SD of a sample mean is denoted by \(\sigma_{\bar x}\), and it is equal to \({\frac{\sigma}{\sqrt n}}\).
When the population is normal or when \(n \ge 30,\) the sample mean closely follows a normal distribution.

Three important facts about the distribution of a sample proportion \(\hat p\)

The mean of a sample proportion is denoted by \(p\).
The SD of a sample proportion is \(\sqrt{\frac{p(1-p)}{n}}\).
When \(np \ge 10\) and \(n(1-p) \ge 10,\) the sample proportion closely follows a normal distribution.

Sampling Distribution of Mean

Problem:

In the 2012 Cherry Blossom 10 mile run, the average time for all of the runners is \(94.52\) minutes with a standard deviation of \(8.97\) minutes. The distribution of run times is approximately normal. Find the probabiliy that a randomly selected runner completes the run in less than \(90\) minutes.

Solution:

Because the distribution of run times is approximately normal, we can use normal approximation.

\[ \begin{align} Z &= \frac{\bar x-\mu_{\bar x}}{\sigma_{\bar x}} \\ &= \frac{90-94.52}{8.97/\sqrt 1} \\ &= -0.504 \\ \\ P(Z < -0.504) &= 0.3072 \end{align} \]

There is a \(30.72\%\) probability that a randomly selected runner will complete the run in less than \(90\) minutes.

Sampling Distribution of Mean

Problem:

Find the probabiliy that the average of 20 runners is less than 90 minutes.

Solution:

Here, \(n = 20 < 30\), but the distribution of the population, that is, the distribution of run times is stated to be approximately normal. Because of this, the sampling distribution will be normal for any sample size.

\[ \begin{align} \sigma_{\bar x} &= \frac{\sigma}{\sqrt n} = \frac{8.97}{\sqrt {20}} = 2.01 \\ Z &= \frac{\bar x-\mu_{\bar x}}{\sigma_{\bar x}} = \frac{90-94.52}{2.01}= - 2.25 \\ P(Z<-0.504) &= 0.0123 \end{align} \] There is a \(1.23\%\) probability that the average run time of 20 randomly selected runners will be less than 90 minutes.

Sampling Distribution of Proportion

Problem:

Find the probability that less than \(15\%\) of the sample of \(400\) people will be smokers if the true proportion is \(20\%.\)

Solutions:

The mean of the sample proportion is the population proportion: \(\mu_\hat p = 0.20.\)

The standard deviation of \(\hat p\) is described by the standard deviation for the proportion:

\[\sigma_{\hat p}=\sqrt \frac{p(1-p)}{n} = \sqrt \frac{0.2(0.8)}{400} = 0.02\]

\[ \begin{align} Z &= \frac{\hat p - \mu_\hat p}{\sigma_\hat p} = \frac{0.15 - 0.20}{0.02} = -2.5 \\ \\ P(Z<-2.5) &= 0.0062 \end{align} \]

Sampling Distribution of Proportion

enough lefty seats?

Problem:

\(13\%\) of the US ppopulation are left-handed. If an auditorium has \(15\) lefty seats, what is the probability that there will not be enough lefty seats for a class of \(90\) students (in other words, what is the probability that there will be more than \(15\) lefty students in the group)?

Solutions:

\[ \begin{align} \mu_\hat p &= 0.13 \\ \hat{p} &= 15/90 = 0.167 \\ \sigma_{\hat p}&=\sqrt \frac{p(1- p)}{n} = \sqrt \frac{0.13(0.87)}{90} = 0.035 \\ \\ Z &= \frac{\hat p - \mu_\hat p}{\sigma_\hat p} = \frac{0.167 - 0.13}{0.035} = 1.06 \\ P(\hat{p}>0.167) &= P(Z>1.06) = 0.1446 \end{align} \]

Evaluating the Normal Approximation

The distribution is approximately normal if
(1) curve fits the histogram; or
(2) on the QQ plot, the data points fall on the \(45^\circ\) line

The Standard Normal Curve

Normal Probability Examples

z-score to percentile

Normal Probability Examples

z-score to percentile

Normal Probability Examples

percentile to z-score

More Exercises

2012 Cherry Blossom 10 mile run in DC

Distribution of Time

Sampling Distribution

Sampling Distribution of a Statistic

\(E(\bar X)\) of the Sampling Distribution of \(\bar X\)

\(Var\) of the Sampling Distribution of \(\bar X\)

Sampling Distributions of a Sample Proportion

Central Limit Theorem

When taking a random sample of independent observations from a population with a fixed mean and standard deviation, the distribution of \(\bar x\) approaches the normal distribution as \(n\) increases.

Central Limit Theorem

Normal Approximation for the Sampling Distribution

Sampling Distribution of Mean

Sampling Distribution of Mean

Sampling Distribution of Proportion

Sampling Distribution of Proportion

enough lefty seats?

Evaluating the Normal Approximation

Next

Chapter 7: Estimating Parameters and Determining Sample Sizes