- Consider a series: \(1, 2, 2, 3\)
- \(Add \space 4: 5, 6, 6, 7\)
Spring 2018
Consider a list of 5000 temparatures recorded in Celsius (C): \(N(27,3)\)
To convert from Celsius to Fahrenheit, we use the following conversion:
\(x_F = \frac{9}{5}x_C+32\)
The unit conversion is a linear transformation of the following form: \(aX+b\)
where \(a=\frac{9}{5}\) and \(b=32\)
\(mean: \space \bar x_F = \frac{9}{5}\bar x_C + 32 = \frac{9}{5}(27) + 32 = 80.6\)
\(sd: \space \sigma_F=\frac{9}{5}\sigma_C = \frac{9}{5}(3) = 5.4\)
Adding shifts the values, multiplying stretches or contracts them.
Adding a constant to every value in a data set shifts the mean but does not affect the standard deviation. Multiplying the values in a data set by a constant will change the mean and the standard deviation by the same multiple, except that the standard deviation will always remain positive.
Consider a normally distributed random variable \(x\) with mean \(\mu\) and sd \(\sigma\): \(x \tilde \space N(\mu, \sigma)\)
Two-step linear transformation of \(x\)
The Z-score of an observation is defined as the number of standard deviations it falls above or bemow the mean. If the observation is one standard deviation above the mean, its Z-score is 1. If it is 1.5 standard deviations below the mean, then its Z-score is -1.5.
The normal distribution model describes a symmetric, unimodal, bell-shaped curve. It can be adjusted using two parameters; mean \((\mu)\) and standard deviation \((\sigma)\).
\[ \bbox[yellow,5px]
{
\color{black}{{\text {Density at z}} = \frac {1}{\sqrt {2\pi}}\exp{-\frac{1}{2}z^2}, -\infty<z<+\infty}
}
\]
Probabilities for falling 1, 2, and 3 standard deviations of the mean in a normal distribution.
Cumulative SAT scores are approximated by a normal model with \(\mu = 1500 \text { and } \sigma = 300\).
What is the probability that a randomly selected SAT taker scores at least 1630 on the SAT?
\(z = \frac{x-\mu}{\sigma}=\frac{1630-1500}{300}=\frac{130}{300}=0.43\)
\(P(z\ge0.43)=0.3336\)
The probability that a randomly selected score is at least 1630 on the SAT is 33%.
Edward earned a 1400 on his SAT. What is his percentile?
\(z = \frac{x-\mu}{\sigma}=\frac{1400-1500}{300}=\frac{100}{300}=-0.33\)
\(P(z\le-0.33)=0.3707\)
Edward is at the 37th percentile.
Carlos believes he can get into his preferred college if he scores at least in the 80th percentile on the SAT. What score should he aim for?
At \(80th\) percentile, \(z = 0.84\)
\[ \begin{align} z & = \frac{x-\mu}{\sigma} \\ 0.84 & = \frac{x-1500}{300} \\ 0.84 \times 300 + 1500 & = x \\ x & = 1752 \end{align} \]
The 80th percentile on the SAT corresponds to a score of 1752.
The distribution is approximately normal if
(1) curve fits the histogram; or
(2) on the QQ plot, the data points fall on the \(45^\circ\) line
It measures linear association, i.e. how tightly the points are clustered about a straight line.
\(x: 1, 2, 3, 4, 5 \space \text { } \space y: 2, 3, 1, 6, 6\)
z_x # Step 1a: calculate z-scores of x (use population sd)
[1] -1.4142136 -0.7071068 0.0000000 0.7071068 1.4142136
z_y # Step 1b: calculate z-scores of y (use population sd)
[1] -0.7770287 -0.2913858 -1.2626716 1.1655430 1.1655430
z_x * z_y # Step 2: Multiple corresponding pairs of z-scores
[1] 1.0988845 0.2060408 0.0000000 0.8241634 1.6483268
r # Step 3: calculate the average of the product (z_x * z_y)
[1] 0.7554831
\(\text {If the data are} \space (x_i, y_i), 1\le i\le n, \text {then}\)
\[\bbox[yellow,5px]
{
\color{black}{r = \frac{1}{n}\sum_{i=1}^n \left(\frac{x_i-\mu_x}{\sigma_x}\right)\left(\frac{y_i-\mu_y}{\sigma_y}\right)}
}
\]
What \(r\) does not tell you?
Association is not causation.
If two variables have a non-zero correlation, then they are related to each other in some way, but that does not mean that one causes the other.
Two variable appear to strongly assciated, but \(r\) is close to \(0\). This is because the relationship is clearly nonlinear. \(r\) measures linear association. Don't use it if the scatter diagram is nonlinear.