Joel Correa da Rosa
June, 7th 2017
One variable (\( Y \)) can be predicted from one or more variables (\( X_1,X_2,...,X_p \)).
\( Y \) is usually called dependent variable or response variable.
\( X_1,X_2,...,X_p \) are called independent variables, predictors, regressors, explanatory variables, etc…
In the classical regression framework, \( Y \) is normally distributed with a mean that changes as a function of the predictors.
When there is only one explanatory variable we have the simple linear regression model.
\( Y_i = \beta_0 +\beta_1X_i+\epsilon_i \)
\( \beta_0 \) is the intercept, that represents the mean value of \( Y \) when \( X=0 \)
\( \beta_1 \) is the slope that represents the increase (or decrease) in the mean value of \( Y \) when there is one unit increase in \( X \).
\( \epsilon \) is the random error. For the purpose of making inference we assume \( \epsilon \) to be normally distributed with mean 0 and variance \( \sigma_2 \).
Given a sample of the pair \( (X_i,Y_i) \), the inference in the simple linear regression consists of finding good estimates of \( \beta_0 \) and \( \beta_1 \).
One popular method for estimation is the least squares. This method consists of finding the values of \( \beta_0 \) and \( \beta_1 \) that minimizes the sum of squared deviations:
\( \sum_{i=1^n}(Y_i-\beta_0-\beta_1Xi)^2 \)
The least squares estimator are derived analytically and they are given by:
\( b1=\frac{\sum xy -(\sum x)(\sum y)/n}{\sum x^2 -(\sum x)^2/n} \)
\( b0= \bar{y} -b1\bar{x} \)
Once we have calculated \( b_0 \) and \( b_1 \), given a value of \( X \), the values of Y can be predicted with the equation
\( \hat{Y}=b0+b1x \)
The effect of some factor on the dependent variable may be influenced by the presence of other factors. To take into account the joint influence of \( p \) explanatory variables, the multiple linear regression is an extension of the simple linear regression.
\( Y_i = \beta_0+\sum_{j=1}^p \beta_j X_{ji}+\epsilon_i \)
The slopes \( \beta_j \) have a similar interpretation to the one from simple linear regression. It quantifies the effect of one unit increase in X_j variable assuming that the other variables are fixed.
\( \beta_j \) is the additional contribution of the variable \( X_j \)
Consider the multiple regression with two indepedent variables:
\( Y_i = \beta_0 + \beta_1X_{1i} + \beta_2X_{2i}+\beta_3X_{1i}X_{2i}+\epsilon_i \)
\( \beta_3 \) is responsible to quantify the effect of the interaction between \( X_{1} \) and \( X_2 \).
Consider that \( X_1 \) and \( X_2 \) are two binary variables representing exposure to a risk factor.
(1). If exposed to \( X_2 \) (\( X_2=1) \)
\( \mu_{y|x} = \beta_0 +\beta_1+\beta_2+\beta_3 \) if exposed to \( X_1 \) (\( X_1=1 \))
\( \mu_{y|x} = \beta_0 +\beta_2 \) if not exposed to \( X_1 \) (\( X_1=0 \))
(2). If not exposed to \( X_2 \)(\( X_2=0 \))
\( \mu_{y|x} = \beta_0 +\beta_1 \) if exposed to \( X_1 \) (\( X_1=1 \))
\( \mu_{y|x} = \beta_0 \) if not exposed to \( X_1 \)(\( X_1=0 \))
\( \mu_y \) can be read as the expected value of \( Y \) conditional to the value of X.
\( \beta_3>0 \) is called synergistic effect and \( \beta_3<0 \) is called antagonistic effect.
If one variable has a quadratic effect, it can be included in the linear regression in the following way:
\( Y_i=\beta_0+\beta_1X_i+\beta_2X_i^2+\epsilon_i \)
The interpretation for \( \beta_1 \) is not the same as before. One unit change in \( X \) causes a change of \( \beta_1+\beta_2(2X+1) \) in \( Y \)
Note that the effect on the response variable is proportional to the values of \( X \).
Following the same principle used in the simple linear regression, the parameters can be estimated by minimization of the sum of squared deviations:
minimize \( S=\sum_{i=1}^n(Y_i-\beta_0-\sum_{j=1}^p\beta_jX_{ji})^2 \)
The Analysis of variance approach consists of decomposing the total variation represented by
\( SST=\sum_{i=1}^n(Y_i-\bar{Y})^2 \)
in the following way:
\( SST = SSR+SSE \)
where \( SSR \) stands for sum of squares of regression and \( SSE \) stands for sum of squares of errors.
Based on the analysis of variance approach, the \( R^2 \) statistic is a measure of goodness-of-fit.
\( R^2 = \frac{SSR}{SSE} \) , \( R^2 \in (0,1) \)
The closer to 1, the better the fit.
Consider an example where the blood pressure will be associated to the use of a drug that interacts with obesity. The data is reported below.
bloodpressure <- c(158,163,173,178,168,188,183,198,178,193,186,191,196,181,176,185,190,195,200,180)
obesity <- factor(c(rep("absent",10),rep("present",10)))
drug <- factor(rep(c(rep("present",5),rep("absent",5)),2))
drug<-factor(drug,levels=c('absent','present'))
obesity<-factor(obesity,levels=c('absent','present'))
bpdata <- data.frame(bloodpressure, obesity, drug)
The ggplot is an important instrument to visualize data from multifactorial experiments
require(ggplot2)
ggplot(bpdata,aes(x=drug,y=bloodpressure,fill=obesity))+
geom_boxplot()+
geom_point(position = position_jitter())
fit<-lm(bloodpressure~drug*obesity)
summary(fit)
Call:
lm(formula = bloodpressure ~ drug * obesity)
Residuals:
Min 1Q Median 3Q Max
-10 -5 0 5 10
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 188.000 3.536 53.174 < 2e-16 ***
drugpresent -20.000 5.000 -4.000 0.00103 **
obesitypresent 2.000 5.000 0.400 0.69445
drugpresent:obesitypresent 16.000 7.071 2.263 0.03792 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 7.906 on 16 degrees of freedom
Multiple R-squared: 0.6063, Adjusted R-squared: 0.5325
F-statistic: 8.213 on 3 and 16 DF, p-value: 0.001553
anova(fit)
Analysis of Variance Table
Response: bloodpressure
Df Sum Sq Mean Sq F value Pr(>F)
drug 1 720 720.0 11.52 0.003706 **
obesity 1 500 500.0 8.00 0.012109 *
drug:obesity 1 320 320.0 5.12 0.037917 *
Residuals 16 1000 62.5
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
aov(fit)
Call:
aov(formula = fit)
Terms:
drug obesity drug:obesity Residuals
Sum of Squares 720 500 320 1000
Deg. of Freedom 1 1 1 16
Residual standard error: 7.905694
Estimated effects may be unbalanced