Lecture 13: Regression

Joel Correa da Rosa
June, 7th 2017

Regression

One variable (\( Y \)) can be predicted from one or more variables (\( X_1,X_2,...,X_p \)).

\( Y \) is usually called dependent variable or response variable.

\( X_1,X_2,...,X_p \) are called independent variables, predictors, regressors, explanatory variables, etc…

In the classical regression framework, \( Y \) is normally distributed with a mean that changes as a function of the predictors.

Simple Linear Regression

When there is only one explanatory variable we have the simple linear regression model.

\( Y_i = \beta_0 +\beta_1X_i+\epsilon_i \)

\( \beta_0 \) is the intercept, that represents the mean value of \( Y \) when \( X=0 \)

\( \beta_1 \) is the slope that represents the increase (or decrease) in the mean value of \( Y \) when there is one unit increase in \( X \).

\( \epsilon \) is the random error. For the purpose of making inference we assume \( \epsilon \) to be normally distributed with mean 0 and variance \( \sigma_2 \).

Assumptions of the Simple Linear Regression

Linear Relationship
Normality
Homoscedasticity
(multivariate normality for observational studies, where \( X \) is not fixed)
No autocorrelation

Inference in the Simple Linear Regression

Given a sample of the pair \( (X_i,Y_i) \), the inference in the simple linear regression consists of finding good estimates of \( \beta_0 \) and \( \beta_1 \).

One popular method for estimation is the least squares. This method consists of finding the values of \( \beta_0 \) and \( \beta_1 \) that minimizes the sum of squared deviations:

\( \sum_{i=1^n}(Y_i-\beta_0-\beta_1Xi)^2 \)

Least Squares Estimator

The least squares estimator are derived analytically and they are given by:

\( b1=\frac{\sum xy -(\sum x)(\sum y)/n}{\sum x^2 -(\sum x)^2/n} \)

\( b0= \bar{y} -b1\bar{x} \)

Once we have calculated \( b_0 \) and \( b_1 \), given a value of \( X \), the values of Y can be predicted with the equation

\( \hat{Y}=b0+b1x \)

Multiple Regression

The effect of some factor on the dependent variable may be influenced by the presence of other factors. To take into account the joint influence of \( p \) explanatory variables, the multiple linear regression is an extension of the simple linear regression.

\( Y_i = \beta_0+\sum_{j=1}^p \beta_j X_{ji}+\epsilon_i \)

The slopes \( \beta_j \) have a similar interpretation to the one from simple linear regression. It quantifies the effect of one unit increase in X_j variable assuming that the other variables are fixed.

\( \beta_j \) is the additional contribution of the variable \( X_j \)

Assumptions of the Simple Linear Regression

Linear Relationship
Normality
Homoscedasticity
(multivariate normality for observational studies, where \( X \) is not fixed)
No or little multicolinearity
No autocorrelation

Interaction

Consider the multiple regression with two indepedent variables:

\( Y_i = \beta_0 + \beta_1X_{1i} + \beta_2X_{2i}+\beta_3X_{1i}X_{2i}+\epsilon_i \)

\( \beta_3 \) is responsible to quantify the effect of the interaction between \( X_{1} \) and \( X_2 \).

Interaction - Interpretation

Consider that \( X_1 \) and \( X_2 \) are two binary variables representing exposure to a risk factor.

(1). If exposed to \( X_2 \) (\( X_2=1) \)

\( \mu_{y|x} = \beta_0 +\beta_1+\beta_2+\beta_3 \) if exposed to \( X_1 \) (\( X_1=1 \))

\( \mu_{y|x} = \beta_0 +\beta_2 \) if not exposed to \( X_1 \) (\( X_1=0 \))

(2). If not exposed to \( X_2 \)(\( X_2=0 \))

\( \mu_{y|x} = \beta_0 +\beta_1 \) if exposed to \( X_1 \) (\( X_1=1 \))

\( \mu_{y|x} = \beta_0 \) if not exposed to \( X_1 \)(\( X_1=0 \))

Interaction - Interpretation

\( \mu_y \) can be read as the expected value of \( Y \) conditional to the value of X.

\( \beta_3>0 \) is called synergistic effect and \( \beta_3<0 \) is called antagonistic effect.

Polinomial Regression

If one variable has a quadratic effect, it can be included in the linear regression in the following way:

\( Y_i=\beta_0+\beta_1X_i+\beta_2X_i^2+\epsilon_i \)

The interpretation for \( \beta_1 \) is not the same as before. One unit change in \( X \) causes a change of \( \beta_1+\beta_2(2X+1) \) in \( Y \)

Note that the effect on the response variable is proportional to the values of \( X \).

Estimation in the Multiple Regression

Following the same principle used in the simple linear regression, the parameters can be estimated by minimization of the sum of squared deviations:

minimize \( S=\sum_{i=1}^n(Y_i-\beta_0-\sum_{j=1}^p\beta_jX_{ji})^2 \)

Analysis of Variance Approach

The Analysis of variance approach consists of decomposing the total variation represented by

\( SST=\sum_{i=1}^n(Y_i-\bar{Y})^2 \)

in the following way:

\( SST = SSR+SSE \)

where \( SSR \) stands for sum of squares of regression and \( SSE \) stands for sum of squares of errors.

R-squared

Based on the analysis of variance approach, the \( R^2 \) statistic is a measure of goodness-of-fit.

\( R^2 = \frac{SSR}{SSE} \) , \( R^2 \in (0,1) \)

The closer to 1, the better the fit.

Example: Blood Pressure

Consider an example where the blood pressure will be associated to the use of a drug that interacts with obesity. The data is reported below.

bloodpressure <- c(158,163,173,178,168,188,183,198,178,193,186,191,196,181,176,185,190,195,200,180)
obesity <- factor(c(rep("absent",10),rep("present",10)))
drug <- factor(rep(c(rep("present",5),rep("absent",5)),2))
drug<-factor(drug,levels=c('absent','present'))
obesity<-factor(obesity,levels=c('absent','present'))
bpdata <- data.frame(bloodpressure, obesity, drug)

Example: Data Visualization

The ggplot is an important instrument to visualize data from multifactorial experiments

require(ggplot2)
ggplot(bpdata,aes(x=drug,y=bloodpressure,fill=obesity))+
  geom_boxplot()+
  geom_point(position = position_jitter())

plot of chunk unnamed-chunk-3

Multiple Regression with Interaction

fit<-lm(bloodpressure~drug*obesity)
summary(fit)


Call:
lm(formula = bloodpressure ~ drug * obesity)

Residuals:
   Min     1Q Median     3Q    Max 
   -10     -5      0      5     10 

Coefficients:
                           Estimate Std. Error t value Pr(>|t|)    
(Intercept)                 188.000      3.536  53.174  < 2e-16 ***
drugpresent                 -20.000      5.000  -4.000  0.00103 ** 
obesitypresent                2.000      5.000   0.400  0.69445    
drugpresent:obesitypresent   16.000      7.071   2.263  0.03792 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 7.906 on 16 degrees of freedom
Multiple R-squared:  0.6063,    Adjusted R-squared:  0.5325 
F-statistic: 8.213 on 3 and 16 DF,  p-value: 0.001553

ANOVA Approach

anova(fit)

Analysis of Variance Table

Response: bloodpressure
             Df Sum Sq Mean Sq F value   Pr(>F)   
drug          1    720   720.0   11.52 0.003706 **
obesity       1    500   500.0    8.00 0.012109 * 
drug:obesity  1    320   320.0    5.12 0.037917 * 
Residuals    16   1000    62.5                    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

ANOVA Approach

aov(fit)

Call:
   aov(formula = fit)

Terms:
                drug obesity drug:obesity Residuals
Sum of Squares   720     500          320      1000
Deg. of Freedom    1       1            1        16

Residual standard error: 7.905694
Estimated effects may be unbalanced