Spring 2018

Linear Transformations of Data

Adding a constant

  1. Consider a series: \(1, 2, 2, 3\)
  2. \(Add \space 4: 5, 6, 6, 7\)

Linear Transformations of Data

Multiplying by a constant

  1. Consider a series: \(1, 2, 2, 3\)
  2. \(Multiply \space by \space 5: 5, 10, 10, 15\)

Linear Transformations of Data

Celcius to Fahrenheit

Consider a list of 5000 temparatures recorded in Celsius (C): \(N(27,3)\)

To convert from Celsius to Fahrenheit, we use the following conversion:

\(x_F = \frac{9}{5}x_C+32\)

The unit conversion is a linear transformation of the following form: \(aX+b\)
where \(a=\frac{9}{5}\) and \(b=32\)

\(mean: \space \bar x_F = \frac{9}{5}\bar x_C + 32 = \frac{9}{5}(27) + 32 = 80.6\)

\(sd: \space \sigma_F=\frac{9}{5}\sigma_C = \frac{9}{5}(3) = 5.4\)

Adding shifts the values, multiplying stretches or contracts them.

Adding a constant to every value in a data set shifts the mean but does not affect the standard deviation. Multiplying the values in a data set by a constant will change the mean and the standard deviation by the same multiple, except that the standard deviation will always remain positive.

Linear Transformations of Data

Celcius to Fahrenheit

Linear Transformation of Normal Curve

Standardizing with Z-Scores

Consider a normally distributed random variable \(x\) with mean \(\mu\) and sd \(\sigma\): \(x \tilde \space N(\mu, \sigma)\)

Two-step linear transformation of \(x\)

  1. subtract \(\mu\) from \(x\)
  2. divide \((x-\mu)\) by \(\sigma\)

\[\bbox[yellow,5px]{\color{black}{\text{standard normal deviate: } z = \frac {x-\mu}{\sigma}}}\]

The Z-score of an observation is defined as the number of standard deviations it falls above or bemow the mean. If the observation is one standard deviation above the mean, its Z-score is 1. If it is 1.5 standard deviations below the mean, then its Z-score is -1.5.

Linear Transformation

Normal Curve to Standard Normal Curve

The normal distribution model describes a symmetric, unimodal, bell-shaped curve. It can be adjusted using two parameters; mean \((\mu)\) and standard deviation \((\sigma)\).

The Standard Normal Curve

\[ \bbox[yellow,5px] { \color{black}{{\text {Density at z}} = \frac {1}{\sqrt {2\pi}}\exp{-\frac{1}{2}z^2}, -\infty<z<+\infty} } \]

68-95-99.7 Rule

Probabilities for falling 1, 2, and 3 standard deviations of the mean in a normal distribution.

Normal Probability Examples

z-score to percentile

Cumulative SAT scores are approximated by a normal model with \(\mu = 1500 \text { and } \sigma = 300\).

What is the probability that a randomly selected SAT taker scores at least 1630 on the SAT?


\(z = \frac{x-\mu}{\sigma}=\frac{1630-1500}{300}=\frac{130}{300}=0.43\)

\(P(z\ge0.43)=0.3336\)

The probability that a randomly selected score is at least 1630 on the SAT is 33%.

Normal Probability Examples

z-score to percentile

Edward earned a 1400 on his SAT. What is his percentile?

\(z = \frac{x-\mu}{\sigma}=\frac{1400-1500}{300}=\frac{100}{300}=-0.33\)

\(P(z\le-0.33)=0.3707\)

Edward is at the 37th percentile.

Normal Probability Examples

percentile to z-score

Carlos believes he can get into his preferred college if he scores at least in the 80th percentile on the SAT. What score should he aim for?


At \(80th\) percentile, \(z = 0.84\)

\[ \begin{align} z & = \frac{x-\mu}{\sigma} \\ 0.84 & = \frac{x-1500}{300} \\ 0.84 \times 300 + 1500 & = x \\ x & = 1752 \end{align} \]


The 80th percentile on the SAT corresponds to a score of 1752.

Evaluating the Normal Approximation

The distribution is approximately normal if
(1) curve fits the histogram; or
(2) on the QQ plot, the data points fall on the \(45^\circ\) line

Linear Association

Univariate Data

Bivariate Data - Scatter Diagram

  • Scatterplots exhibit the relationship between two numeric variables.
  • They are used for detecting patterns, trends, and relationships.

Bivariate Data - Positive Association

  • Unit of observation used in the plot is 'county'.
The plot shows positive association between education and income.

Linear Association between Variables

Linear Association: the scatter diagram is clustered around a straight line


  • Positive (Linear) Association: above average values of one variable tend to go with above average values of the other; scatterplot slopes up.
  • Negative (Linear) Association: above average values of one variable tend to go with below average values of the other; scatterplot slopes down.
  • No (Linear) Association: Scatterplot shows no direction

Correlation Coefficient

Correlation coefficient: \(\lbrace r | -1 \le r \le +1 \rbrace\)

It measures linear association, i.e. how tightly the points are clustered about a straight line.

Calculating r

\(x: 1, 2, 3, 4, 5 \space \text { } \space y: 2, 3, 1, 6, 6\)

z_x # Step 1a: calculate z-scores of x (use population sd)
[1] -1.4142136 -0.7071068  0.0000000  0.7071068  1.4142136
z_y # Step 1b: calculate z-scores of y (use population sd)
[1] -0.7770287 -0.2913858 -1.2626716  1.1655430  1.1655430
z_x * z_y # Step 2: Multiple corresponding pairs of z-scores
[1] 1.0988845 0.2060408 0.0000000 0.8241634 1.6483268
r # Step 3: calculate the average of the product (z_x * z_y)
[1] 0.7554831

Formula for \(r\)

\(\text {If the data are} \space (x_i, y_i), 1\le i\le n, \text {then}\)
\[\bbox[yellow,5px] { \color{black}{r = \frac{1}{n}\sum_{i=1}^n \left(\frac{x_i-\mu_x}{\sigma_x}\right)\left(\frac{y_i-\mu_y}{\sigma_y}\right)} } \]

Properties of \(r\)

  1. \(r\) is a pure number with no units
  2. \(-1\le r\le +1\)
  3. Adding a constant to one of the variables does not affect \(r\)
  4. Multiplying one of the variables by a positive constant does not affect \(r\)
  5. Multiplying one of the variables by a negative constant switches the sign of \(r\) but does not affect the absolute value of \(r\)


What \(r\) does not tell you?

Association is not causation.

If two variables have a non-zero correlation, then they are related to each other in some way, but that does not mean that one causes the other.

\(r\) measures linear association

Two variable appear to strongly assciated, but \(r\) is close to \(0\). This is because the relationship is clearly nonlinear. \(r\) measures linear association. Don't use it if the scatter diagram is nonlinear.

Next Week


Chapter 7-8: Introduction to Linear Regression