It measures linear association, i.e. how tightly the points are clustered about a straight line.
\(x: 1, 2, 3, 4, 5 \space \text { } \space y: 2, 3, 1, 6, 6\)
z_x # Step 1a: calculate z-scores of x (use population sd)
[1] -1.4142136 -0.7071068 0.0000000 0.7071068 1.4142136
z_y # Step 1b: calculate z-scores of y (use population sd)
[1] -0.7770287 -0.2913858 -1.2626716 1.1655430 1.1655430
z_x * z_y # Step 2: Multiple corresponding pairs of z-scores
[1] 1.0988845 0.2060408 0.0000000 0.8241634 1.6483268
r # Step 3: calculate the average of the product (z_x * z_y)
[1] 0.7554831
\(\text {If the data are} \space (x_i, y_i), 1\le i\le n, \text {then}\)
\[\bbox[yellow,5px]
{
\color{black}{r = \frac{1}{n}\sum_{i=1}^n \left(\frac{x_i-\mu_x}{\sigma_x}\right)\left(\frac{y_i-\mu_y}{\sigma_y}\right)}
}
\]
What \(r\) does not tell you?
Association is not causation.
If two variables have a non-zero correlation, then they are related to each other in some way, but that does not mean that one causes the other.
Two variable appear to strongly assciated, but \(r\) is close to \(0\). This is because the relationship is clearly nonlinear. \(r\) measures linear association. Don't use it if the scatter diagram is nonlinear.
Hypotheses If conducting a formal hypothesis test to determine whether there is a significant linear correlation between two variables, use the following null and alternative hypotheses that use \(\rho\) to represent the linear correlation coefficient of the population:
\[ \text{Null Hypothesis } H_0: \rho = 0 \text{ (no correlation)} \\ \text{Alt. Hypothesis } H_a: \rho \ne 0 \text{ (correlation)} \\ \\ t = \frac{r}{\sqrt{\frac{1-r^2}{n-2}}} \\ \\ \text{Then, calculate p-value and} \\ \text{compare with sig. level 5% to accept or reject null hypothesis.} \]
\[ x: 5, 6, 4, 4, 5 \\ y: 6, 9, 3, 2, 11 \]
\[ \begin{align} H_0&: \rho = 0 \\ H_a&: \rho \ne 0 \\ \\ r &= 0.795 \\ n &= 5 \\ t &= \frac{r}{\sqrt{\frac{1-r^2}{n-2}}} = \frac{0.795}{\sqrt{\frac{1-0.795^2}{5-2}}} \\ &= 2.269 \\ \\ p -value &= 0.1079 > 0.05 \\ &\text{Cannot reject null hypothesis.} \end{align} \]
Estimate the height of one of these people: Heights (inches) ~ \(N(67,3)\)
Let's say, estimate = \(c\)
estimation error = actual height - \(c\)
The "best" \(c\) is the one that makes the smallest root mean squared (r.m.s) error
The r.m.s of the errors will be smallest if \(c = \mu\)
least squared estimate = \(\mu\) = 67 and least squared error = \(\sigma\) = 3
Given the value of one variable, estimate the value of the other.
Assume, both variables are approximately normally distributed.
\[ \begin{align} \text {From bivariate scatter plot in standard units:} \\ z_y & = r.z_x \\ \frac {y-\mu_y}{\sigma_y} & = r. \frac {x-\mu_x}{\sigma_x} \\ y-\mu_y & = r. \frac {\sigma_y}{\sigma_x} (x-\mu_x) \\ y & = r. \frac {\sigma_y}{\sigma_x} (x-\mu_x) + \mu_y\\ y & = (\mu_y - r. \frac {\sigma_y}{\sigma_x}\mu_x) + (r. \frac {\sigma_y}{\sigma_x}).x \\ y & = b_0 + b_1.x \\ \text {Where, } & \begin{cases} slope(b_1) & = r. \frac {\sigma_y}{\sigma_x} \\ intercept(b_0) & = \mu_y - b_1.\mu_x \\ \end{cases} \\ \text {When, } x & = \mu_x, y = \mu_y \\ \end{align} \]
\[ \bbox[yellow,5px] { \color {black} {\implies \text {The regression line passes through the points of averages } (\mu_x, \mu_y).} } \]
[1] "Height (inches) (x): mean = 67 sd = 3"
[1] "Weight (lb) (y): mean = 174 sd = 21"
[1] "r = 0.304"
Find the equation of the regression line for estimating weight based on height.
\[ \begin{align} slope(b_1) & = r. \frac {\sigma_y}{\sigma_x} \\ intercept(b_0) & = \mu_y - b_1.\mu_x \\ \end{align} \]
[1] "slope (b1) = 2.07 lb per inch"
[1] "intercept (b0) = 35 lb"
[1] "Regression Equation: Est. weight = 35 + 2.07.(height)"
[1] "A person who is 60 inches tall is estimated to be 159 lb"
Mathematically, the intercept is described as the mean response \((Y)\) value when all predictor variables \((X)\) are set to zero. Sometimes a zero setting for the predictor variable(s) is nonsensical, which makes the intercept noninterpretable.
For example, in the following equation: \(\hat {Weight} = 35 + 2*Height\)
\(Height = 0\) is nonsensical; therefore, the model intercept has no interpretation.
The constant in regression model guarantees that the residuals have a mean of zero, which is a key assumption in regression analysis. If we don't include the constant, the regression line is forced to go through the origin. This means that all of the predictors and the response variable must equal to zero at that point. If the fitted line doesn't naturally go through the origin, the regression coefficients and predictions will be biased. The constant guarantees that the residuals don't have an overall positive or negative bias.
The slope of a straight line measures how much the value of \(Y\) changes for every unit of change in \(X\).
For example, in the following equation: \(\hat {Weight} = 35 + 2*Height\)
The slope is \(2 \text { lb per inch}\) - meaning that if a group of people is one inch taller than another group, the former group will be on average 2 lb heavier than the later.
In other words,
Remember, the slope should NOT be interpreted as: if one person gets taller by 1 inch, he/she will put on 2 lb of weight.
Which line to use?
Objectively, we want a line that produces the least estimation error.
Residuals are the leftover variation in the data after accounting for the model fit.
Residual: difference between observed and expected
The residual of the \(i^{th}\) observation \((x_i, y_i)\) is the difference between the observed response \((y_i)\) and its predicted value based on model fit \((\hat y_i)\): \(e_i = y_i - \hat y_i\)
In the scatter plot, residual (in other words, estimation error) is shown as the verticle distance between the observed point and the line. If an observation is above the line, then its residual is positive. Observations below the line have negative residuals. One goal in picking the right linear model is for these residuals to be as small as possible.
Common practice is to choose the line that minimizes the sum of squared residuals: \(e_1^2 + e_2^2 + ... + e_n^2\)
There is only one line that minimizes the sum of squared residuals. It is called the least squared line.
Mathematically, it can be shown that the regression line is the least squared line.
root mean squared (r.m.s) error of regression = r.m.s of residuals =
\[ \bbox[yellow,5px] { \color{black}{\sqrt {1-r^2}.\sigma_y} } \]
\(\color{black}{\sqrt {1-r^2}.\sigma_y}\)
Linearity - The relationship between \(x\) and \(y\) should show a linear trend.
Nearly Normal Residuals - Generally the residuals must be nearly normal with a mean of zero. There should be no linear association between the resuduals and \(x\), meaning \(cor(x,res)=0\). When this condition is violated, it is usually because of outliers.
Constant Variability The variability of points around the least squares line remains roughly constant.
Independent Observations Be cautious about applying regression to time series data, which are sequential observations in time such as a stock price each day. Such data may have an underlying structure that should be considered in a model and analysis.
\[ \text{total variation = explained variation + unexplained variation} \\ \sum(y - \bar y)^2 = \sum(\hat y - \bar y)^2 + \sum(y - \hat y)^2 \]
\(R^2\) is the proportion of the variance in the dependent variable \(y\) that is explained by the linear relationship between \(x\) and \(y\).
\[ r^2 = \frac{\text{explained variation}}{\text{total variation}} \]
\(R^2 = 58\%\) suggests that \(58\%\) of the variability in \(y\) can be explained by the variability in \(x\).