M. Drew LaMar
February 7, 2020
Definition:
Regression is the method used to predict values of one numerical variable (response) from values of another (explanatory).
Note: Regression can be done on data from an observational or experimental study.
We will discuss 3 types:
Definition:
Linear regression draws a straight line through the data to predict the response variable from the explanatory variable.
Definition: For the
population , the regression line is
\[ Y = \alpha + \beta X, \]
where \( \alpha \) (theintercept ) and \( \beta \) (theslope ) are population parameters.
Definition: For a
sample , the regression line is
\[ Y = a + b X, \]
where \( a \) and \( b \) are estimates of \( \alpha \) and \( \beta \), respectively.
Note: At each value of \( X \), there is a population of \( Y \)-values whose mean lies on the true regression line (this is the linear assumption).
Variables: For a correlation, our data consist of two numerical variables (continuous or discrete).
Definition: The (linear)
correlation coefficient \( \rho \) measures the strength and direction of the association between two numerical variables in a population.
The linear (Pearson) correlation coefficient measures the tendency of two numerical variables to co-vary in a linear way.
The symbol \( r \) denotes a sample estimate of \( \rho \).
\[ r = \frac{\sum_{i}(X_{i}-\bar{X})(Y_{i}-\bar{Y})}{\sqrt{\sum_{i}(X_{i}-\bar{X})^2}\sqrt{\sum_{i}(Y_{i}-\bar{Y})^2}} \]
\[ -1 \leq r \leq 1 \]
\[ r = \frac{\frac{1}{n-1}\sum_{i}(X_{i}-\bar{X})(Y_{i}-\bar{Y})}{\sqrt{\frac{1}{n-1}\sum_{i}(X_{i}-\bar{X})^2}\sqrt{\frac{1}{n-1}\sum_{i}(Y_{i}-\bar{Y})^2}} \]
\[ r = \frac{\mathrm{Covariance}(X,Y)}{s_{X}s_{Y}} \]
Technically, the linear regression equation is
\[ \mu_{Y\, |\, X=X^{*}} = \alpha + \beta X^{*}, \]
were \( \mu_{Y\, |\, X=X^{*}} \) is the mean of \( Y \) in the sub-population with \( X=X^{*} \) (called predicted values).
You are predicting the mean of Y given X.
Method of least squares
Definition: The
least-squares regression line is the line for which the sum of all thesquared deviations in \( Y \) is smallest.
The method of least-squares leads to the following estimates for intercept and slope:
\[ \begin{align} b & = \frac{\sum_{i}(X_{i}-\bar{X})(Y_{i}-\bar{Y})}{\sum_{i}(X_{i}-\bar{X})^2} \\ a & = \bar{Y}-b\bar{X} \end{align} \]
Note:
\[ b = \frac{\mathrm{Covariance(X,Y)}}{s_{X}^2} = r\frac{s_{Y}}{s_{X}}, \]
where \( r \) is the correlation coefficient!