Linear Regression

M. Drew LaMar
February 4, 2019


https://xkcd.com/605/

Class announcements

  • Updated office hours
    • Wednesday, 10-11 am
    • Wednesday, 2-3 pm
  • Homework #2
    • OpenStats, Chapter 4: 4.6.3 Hypothesis testing (p. 209) - #4.18, 4.20, 4.22, 4.24, 4.28, 4.30
    • OpenStats, Chapter 7: 7.5.1 Line fitting, residuals, and correlation (p. 356) - #7.1-7.10 (even)
    • OpenStats, Chapter 7: 7.5.2 Fitting a line by least squares regression (p. 362) - #7.24, 7.26, 7.30
    • OpenStats, Chapter 7: 7.5.4 Inference for linear regression (p. 367) - #7.36

Errors in Hypothesis Testing

alt text

Definition: Type I error is rejecting a true null hypothesis. The probability of a Type I error is given by \[ \mathrm{Pr[Reject} \ H_{0} \ | \ H_{0} \ \mathrm{is \ true}] = \alpha \]

Definition: Type II error is failing to reject a false null hypothesis. The probability of a Type II error is given by \[ \mathrm{Pr[Do \ not \ reject} \ H_{0} \ | \ H_{0} \ \mathrm{is \ false}] = \beta \]

Errors in Hypothesis Testing - Power

alt text

Definition: The power of a statistical test (denoted \( 1-\beta \)) is given by \[ \begin{align*} \mathrm{Pr[Reject} \ H_{0} \ | \ H_{0} \ \mathrm{is \ false}] & = 1-\beta \\ & = 1 - \mathrm{Pr[Type \ II \ error]} \end{align*} \]

Probability of errors in hypothesis testing

alt text

  • \( \alpha \) is the significance level
  • \( 1-\beta \) is the power

Statistical power example
https://qubeshub.org/tools/statpowerviz/

Power analysis

Power of a statistical test is a function of
     - Significance level \( \alpha \)
     - Variability of data
     - Sample size
     - Effect size

  • Desired power is set by researcher (typically 80%)
  • Significance level set by researcher
  • Data variability and effect size can be estimated by previous studies or pilot studies
  • Sample size is then calculated to achieve desired power given previous fixed attributes

Regression

Definition: Regression is the method used to predict values of one numerical variable (response) from values of another (explanatory).

Note: Regression can be done on data from an observational or experimental study.

We will discuss 3 types:

  • Linear regression
  • Nonlinear regression
  • Logistic regression

Linear regression

Definition: Linear regression draws a straight line through the data to predict the response variable from the explanatory variable.

Slope determines rate of change of response with explanatory - humans lose 0.076 units of genetic diversity with every 10,000 km from East Africa.

Formula for the line

Definition: For the population, the regression line is

\[ Y = \alpha + \beta X, \]
where \( \alpha \) (the intercept) and \( \beta \) (the slope) are population parameters.

Definition: For a sample, the regression line is

\[ Y = a + b X, \]
where \( a \) and \( b \) are estimates of \( \alpha \) and \( \beta \), respectively.

Graph of the line

  • \( a \): intercept
  • \( b \): slope

Assumptions of linear regression

Note: At each value of \( X \), there is a population of \( Y \)-values whose mean lies on the true regression line (this is the linear assumption).

Assumptions of linear regression

  • Linearity
  • Residuals are normally distributed
  • Constant variance of residuals
  • Independent observations

alt text

Linear regression is a statistical model

Linear regression is a model formulation

Usually (but not always) it is reserved for situations where you assert evidence of causation (e.g. A causes B)

Correlation, in contrast, describes relationships (e.g. A and B are positively correlated)

Linear correlation coefficient

Variables: For a correlation, our data consist of two numerical variables (continuous or discrete).

Definition: The (linear) correlation coefficient \( \rho \) measures the strength and direction of the association between two numerical variables in a population.

The linear (Pearson) correlation coefficient measures the tendency of two numerical variables to co-vary in a linear way.

The symbol \( r \) denotes a sample estimate of \( \rho \).