Introduction to Statistics

Linear Association

Univariate Data

Bivariate Data - Scatter Diagram

Scatterplots exhibit the relationship between two numeric variables.
They are used for detecting patterns, trends, and relationships.

Bivariate Data - Positive Association

Unit of observation used in the plot is 'county'.

The plot shows positive association between education and income.

Linear Association between Variables

Linear Association: the scatter diagram is clustered around a straight line

Positive (Linear) Association: above average values of one variable tend to go with above average values of the other; scatterplot slopes up.
Negative (Linear) Association: above average values of one variable tend to go with below average values of the other; scatterplot slopes down.
No (Linear) Association: Scatterplot shows no direction

Correlation Coefficient

Correlation coefficient: \(\lbrace r | -1 \le r \le +1 \rbrace\)

It measures linear association, i.e. how tightly the points are clustered about a straight line.

Calculating r

\(x: 1, 2, 3, 4, 5 \space \text { } \space y: 2, 3, 1, 6, 6\)

z_x # Step 1a: calculate z-scores of x (use population sd)

[1] -1.4142136 -0.7071068  0.0000000  0.7071068  1.4142136

z_y # Step 1b: calculate z-scores of y (use population sd)

[1] -0.7770287 -0.2913858 -1.2626716  1.1655430  1.1655430

z_x * z_y # Step 2: Multiple corresponding pairs of z-scores

[1] 1.0988845 0.2060408 0.0000000 0.8241634 1.6483268

r # Step 3: calculate the average of the product (z_x * z_y)

[1] 0.7554831

Formula for \(r\)

\(\text {If the data are} \space (x_i, y_i), 1\le i\le n, \text {then}\)
\[\bbox[yellow,5px] { \color{black}{r = \frac{1}{n}\sum_{i=1}^n \left(\frac{x_i-\mu_x}{\sigma_x}\right)\left(\frac{y_i-\mu_y}{\sigma_y}\right)} } \]

Properties of \(r\)

\(r\) is a pure number with no units
\(-1\le r\le +1\)
Adding a constant to one of the variables does not affect \(r\)
Multiplying one of the variables by a positive constant does not affect \(r\)
Multiplying one of the variables by a negative constant switches the sign of \(r\) but does not affect the absolute value of \(r\)

What \(r\) does not tell you?

Association is not causation.

If two variables have a non-zero correlation, then they are related to each other in some way, but that does not mean that one causes the other.

\(r\) measures linear association

Two variable appear to strongly assciated, but \(r\) is close to \(0\). This is because the relationship is clearly nonlinear. \(r\) measures linear association. Don't use it if the scatter diagram is nonlinear.

Formal Hypothesis Test

Hypotheses If conducting a formal hypothesis test to determine whether there is a significant linear correlation between two variables, use the following null and alternative hypotheses that use \(\rho\) to represent the linear correlation coefficient of the population:

\[ \text{Null Hypothesis } H_0: \rho = 0 \text{ (no correlation)} \\ \text{Alt. Hypothesis } H_a: \rho \ne 0 \text{ (correlation)} \\ \\ t = \frac{r}{\sqrt{\frac{1-r^2}{n-2}}} \\ \\ \text{Then, calculate p-value and} \\ \text{compare with sig. level 5% to accept or reject null hypothesis.} \]

Hypothesis Test of \(r\)

\[ x: 5, 6, 4, 4, 5 \\ y: 6, 9, 3, 2, 11 \]

\[ \begin{align} H_0&: \rho = 0 \\ H_a&: \rho \ne 0 \\ \\ r &= 0.795 \\ n &= 5 \\ t &= \frac{r}{\sqrt{\frac{1-r^2}{n-2}}} = \frac{0.795}{\sqrt{\frac{1-0.795^2}{5-2}}} \\ &= 2.269 \\ \\ p -value &= 0.1079 > 0.05 \\ &\text{Cannot reject null hypothesis.} \end{align} \]

Estimation: One Variable

Estimate the height of one of these people: Heights (inches) ~ \(N(67,3)\)
Let's say, estimate = \(c\)
estimation error = actual height - \(c\)
The "best" \(c\) is the one that makes the smallest root mean squared (r.m.s) error

The r.m.s of the errors will be smallest if \(c = \mu\)
least squared estimate = \(\mu\) = 67 and least squared error = \(\sigma\) = 3

Bivariate Normal Distributions

Estimation: Two Variables

Given the value of one variable, estimate the value of the other.
Assume, both variables are approximately normally distributed.

Estimation: Two Variables

Regression Line

Bivariate Normal - in Standard Units

Derivation - Bivariate Regression Model

\[ \begin{align} \text {From bivariate scatter plot in standard units:} \\ z_y & = r.z_x \\ \frac {y-\mu_y}{\sigma_y} & = r. \frac {x-\mu_x}{\sigma_x} \\ y-\mu_y & = r. \frac {\sigma_y}{\sigma_x} (x-\mu_x) \\ y & = r. \frac {\sigma_y}{\sigma_x} (x-\mu_x) + \mu_y\\ y & = (\mu_y - r. \frac {\sigma_y}{\sigma_x}\mu_x) + (r. \frac {\sigma_y}{\sigma_x}).x \\ y & = b_0 + b_1.x \\ \text {Where, } & \begin{cases} slope(b_1) & = r. \frac {\sigma_y}{\sigma_x} \\ intercept(b_0) & = \mu_y - b_1.\mu_x \\ \end{cases} \\ \text {When, } x & = \mu_x, y = \mu_y \\ \end{align} \]

\[ \bbox[yellow,5px] { \color {black} {\implies \text {The regression line passes through the points of averages } (\mu_x, \mu_y).} } \]

Finding the Equation

[1] "Height (inches) (x): mean = 67 sd = 3"

[1] "Weight (lb) (y): mean = 174 sd = 21"

[1] "r = 0.304"

Find the equation of the regression line for estimating weight based on height.

\[ \begin{align} slope(b_1) & = r. \frac {\sigma_y}{\sigma_x} \\ intercept(b_0) & = \mu_y - b_1.\mu_x \\ \end{align} \]

[1] "slope (b1) = 2.07 lb per inch"

[1] "intercept (b0) = 35 lb"

[1] "Regression Equation: Est. weight = 35 + 2.07.(height)"

[1] "A person who is 60 inches tall is estimated to be 159 lb"

Plotting the Regression Line

Interpretation of Intercept

Mathematically, the intercept is described as the mean response \((Y)\) value when all predictor variables \((X)\) are set to zero. Sometimes a zero setting for the predictor variable(s) is nonsensical, which makes the intercept noninterpretable.
For example, in the following equation: \(\hat {Weight} = 35 + 2*Height\)
\(Height = 0\) is nonsensical; therefore, the model intercept has no interpretation.

Why is it still crucial to include the intercept in the model?

The constant in regression model guarantees that the residuals have a mean of zero, which is a key assumption in regression analysis. If we don't include the constant, the regression line is forced to go through the origin. This means that all of the predictors and the response variable must equal to zero at that point. If the fitted line doesn't naturally go through the origin, the regression coefficients and predictions will be biased. The constant guarantees that the residuals don't have an overall positive or negative bias.

Interpretation of Slope

The slope of a straight line measures how much the value of \(Y\) changes for every unit of change in \(X\).

For example, in the following equation: \(\hat {Weight} = 35 + 2*Height\)

The slope is \(2 \text { lb per inch}\) - meaning that if a group of people is one inch taller than another group, the former group will be on average 2 lb heavier than the later.

In other words,

Take all the people of any given height
Then take all the people who are one inch taller
the taller group is 2 lb heavier, on average.

Remember, the slope should NOT be interpreted as: if one person gets taller by 1 inch, he/she will put on 2 lb of weight.

Comparing Two Lines

Which line to use?

Objectively, we want a line that produces the least estimation error.

Data = Fit + Residual

Residuals are the leftover variation in the data after accounting for the model fit.

Residual: difference between observed and expected

The residual of the \(i^{th}\) observation \((x_i, y_i)\) is the difference between the observed response \((y_i)\) and its predicted value based on model fit \((\hat y_i)\): \(e_i = y_i - \hat y_i\)

Least Squared Line

In the scatter plot, residual (in other words, estimation error) is shown as the verticle distance between the observed point and the line. If an observation is above the line, then its residual is positive. Observations below the line have negative residuals. One goal in picking the right linear model is for these residuals to be as small as possible.

Common practice is to choose the line that minimizes the sum of squared residuals: \(e_1^2 + e_2^2 + ... + e_n^2\)

There is only one line that minimizes the sum of squared residuals. It is called the least squared line.

Mathematically, it can be shown that the regression line is the least squared line.

root mean squared (r.m.s) error of regression = r.m.s of residuals =

\[ \bbox[yellow,5px] { \color{black}{\sqrt {1-r^2}.\sigma_y} } \]

r.m.s error of regression

\(\color{black}{\sqrt {1-r^2}.\sigma_y}\)

\(r = 1 \text { or } -1: \space\) Scatter is a perfect straight line; r.m.s error of regression = 0
\(r = 0: \space\) No linear association; r.m.s error of regression = \(\sigma_y\)
All other \(r: \space\) Regression is not perfect, but better than using the average; r.m.s error of regression < \(\sigma_y\)

r.m.s error of regression

Conditions for the Least Squared Line

Linearity - The relationship between \(x\) and \(y\) should show a linear trend.

Nearly Normal Residuals - Generally the residuals must be nearly normal with a mean of zero. There should be no linear association between the resuduals and \(x\), meaning \(cor(x,res)=0\). When this condition is violated, it is usually because of outliers.

Conditions for the Least Squared Line

Constant Variability The variability of points around the least squares line remains roughly constant.

Independent Observations Be cautious about applying regression to time series data, which are sequential observations in time such as a stock price each day. Such data may have an underlying structure that should be considered in a model and analysis.

\(R^2: \text { Coefficient of Determination}\)

\[ \text{total variation = explained variation + unexplained variation} \\ \sum(y - \bar y)^2 = \sum(\hat y - \bar y)^2 + \sum(y - \hat y)^2 \]

\(R^2\) is the proportion of the variance in the dependent variable \(y\) that is explained by the linear relationship between \(x\) and \(y\).

\[ r^2 = \frac{\text{explained variation}}{\text{total variation}} \]

\(R^2 = 58\%\) suggests that \(58\%\) of the variability in \(y\) can be explained by the variability in \(x\).

\(R^2 \space\) provides a measure of how useful the regression line is as a prediction tool.
If \(R^2 \space\) is close to 1, then the regression line is useful.
If \(R^2 \space\) is close to 0, then the regression line is not useful.

Linear Association

Univariate Data

Bivariate Data - Scatter Diagram

Bivariate Data - Positive Association

The plot shows positive association between education and income.

Linear Association between Variables

Linear Association: the scatter diagram is clustered around a straight line

Correlation Coefficient

Correlation coefficient: \(\lbrace r | -1 \le r \le +1 \rbrace\)

Calculating r

Formula for \(r\)

Properties of \(r\)

\(r\) measures linear association

Formal Hypothesis Test

Hypothesis Test of \(r\)

Estimation: One Variable

Bivariate Normal Distributions

Bivariate Normal Distributions

Estimation: Two Variables

Estimation: Two Variables

Regression Line

Bivariate Normal - in Standard Units

Derivation - Bivariate Regression Model

Finding the Equation

Plotting the Regression Line

Interpretation of Intercept

Interpretation of Slope

Comparing Two Lines

Data = Fit + Residual

Least Squared Line

r.m.s error of regression

r.m.s error of regression

Conditions for the Least Squared Line

Conditions for the Least Squared Line

\(R^2: \text { Coefficient of Determination}\)

Thank you