Regression analysis is a statistical tool to determine the relationship between the outcome(dependent) and the predictor(independent) varibales. A simple regression model has only one outcome and a single prdictor.
The data in simple regression model is given by pairs.
X: The predictor. \(x_1,x_2,\dots,x_n\)
are \(n\) observations from \(X\).
Y: The response. \(y_1,y_2,…,y_n\) are
\(n\) observations from \(Y\).
each index represent one case or unit of observation in the
data.
A simple linear model is given by:
\[ Y=\beta_0+\beta_1X+\epsilon \]
which composed of a linear function and an error which is usually
assumed to have a normal distribution. The quantities \(\beta_0\) and \(\beta_1\) are called model parameters and
they are estimated from the data. In a simple regression model, for each
individual or unit \(i=1,2,⋯,n\) in the
study, we have a pair of observations, \((x_i,y_i)\).
A method to achieve a best fitted linear function is called
Ordinary Least Square (OLS) method which estimate the parameters of the
model (intercept and slope) by minimizing the sum of squares of the
residuals and assumes the following:
Errors have mean equal to zero.
Errors are independent.
Errors have equal variances (Homoscedasticity).
Errors follow normal distribution.
The relationship of dependent variable with the independent
variables is linear.
Estimate model parameters are noted as \(\hat\beta_0\) and \(\hat\beta_1\) and the estimate model will
be
\[\hat y=\hat\beta_0+\hat\beta_1
x\]
where \(\hat y\) is the predicted mean
of the outcome for a given value of x also known as fitted
value.
\(\hat y_i=\hat\beta_0+\hat\beta_1
x_i\) predicts the value of the outcome for a given value of
x.
\(e_i=y_i - \hat y_i\) is called the residual.
Regression can be seen as a model to explain variability of the response variable using the predictor variable(s).
If only the intercept is used to model the response variable (slope restricted to be zero), the regression line will be a horizontal line at the mean of the response variable.
From this intercept model, the residuals for each data point \(i\) will \(y_i - \bar y_i\)
The sum of the these residuals squared, \(\Sigma_{i=1}^n=(y_i - \bar y)^2\) is the Total Sums of Squares (TSS), representing the total variation in the outcome.
We can decompose the residual from the intercept model, \(y_i−\bar y\) into two parts using the
fitted value, \(\hat y_i\) which is
written as,
\[y_i-\bar y=(y_i - \hat y_i)+(\hat y_i -
\bar y)\]
where,
The total variation of the outcome (Total Sum of Squares (TSS)) is the sum of unexplained variation(Residual Sum of Squares (RSS)) and the variation explained by the model(Sum of Square Regression(SSreg)).
The ratio of SSreg and TSS is called \(R^2\) which is the proportion of the total variation that can be explained by the model. \(R^2\) is between 0 and 1. Higher \(R^2\) means the model explains more of the variation of the outcome.
Example:
We will use the data “mtcars” throughout the example.
data(mtcars)
class(mtcars)
[1] "data.frame"
names(mtcars)
[1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear"
[11] "carb"
dim(mtcars)
[1] 32 11
str(mtcars)
'data.frame': 32 obs. of 11 variables:
$ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
$ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
$ disp: num 160 160 108 258 360 ...
$ hp : num 110 110 93 110 175 105 245 62 95 123 ...
$ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
$ wt : num 2.62 2.88 2.32 3.21 3.44 ...
$ qsec: num 16.5 17 18.6 19.4 17 ...
$ vs : num 0 0 1 1 0 1 0 1 1 1 ...
$ am : num 1 1 1 0 0 0 0 0 0 0 ...
$ gear: num 4 4 4 3 3 3 3 4 4 4 ...
$ carb: num 4 4 1 1 2 1 4 2 2 4 ...
summary(mtcars)
mpg cyl disp hp
Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0
1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5
Median :19.20 Median :6.000 Median :196.3 Median :123.0
Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7
3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0
Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0
drat wt qsec vs
Min. :2.760 Min. :1.513 Min. :14.50 Min. :0.0000
1st Qu.:3.080 1st Qu.:2.581 1st Qu.:16.89 1st Qu.:0.0000
Median :3.695 Median :3.325 Median :17.71 Median :0.0000
Mean :3.597 Mean :3.217 Mean :17.85 Mean :0.4375
3rd Qu.:3.920 3rd Qu.:3.610 3rd Qu.:18.90 3rd Qu.:1.0000
Max. :4.930 Max. :5.424 Max. :22.90 Max. :1.0000
am gear carb
Min. :0.0000 Min. :3.000 Min. :1.000
1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:2.000
Median :0.0000 Median :4.000 Median :2.000
Mean :0.4062 Mean :3.688 Mean :2.812
3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:4.000
Max. :1.0000 Max. :5.000 Max. :8.000
Variable name and description:
mpg - Miles/(US) gallon
cyl - Number of cylinders
disp- Displacement (cu.in.)
hp - Gross horsepower
drat- Rear axle ratio
wt - Weight (1000 lbs)
qsec- 1/4 mile time
vs - Engine (0 = V-shaped, 1 = straight)
am - Transmission (0 = automatic, 1 = manual)
gear- Number of forward gears
carb- Number of carburetors
Using the function linear model lm(), we study the relationship between the gross horsepower and miles per gallon.
d<-lm(hp~mpg, mtcars)
d
Call:
lm(formula = hp ~ mpg, data = mtcars)
Coefficients:
(Intercept) mpg
324.08 -8.83
Here, the response variable is hp and the predictor variable is the
mpg.
The estimated linear function is,
\[\hat{hp}=324.08-8.83mpg\]
The coefficient of mpg is -8.83 which implies that for one increase in
miles per gallon, we would expect a 8.83 decrease in horsepower.
We can use functions ls() or str() to list the components of object d.
ls(d)
[1] "assign" "call" "coefficients" "df.residual"
[5] "effects" "fitted.values" "model" "qr"
[9] "rank" "residuals" "terms" "xlevels"
To access to the above components we can use $ sign after the lm object name.
d$coefficients
(Intercept) mpg
324.082314 -8.829731
To get the confidence intervals for the coefficients of the model, we use confint()
confint(d)
2.5 % 97.5 %
(Intercept) 268.05605 380.108579
mpg -11.50426 -6.155202
We can use function anova() to extract the sums of squares of the model.
anova(d)
Analysis of Variance Table
Response: hp
Df Sum Sq Mean Sq F value Pr(>F)
mpg 1 87791 87791 45.46 1.788e-07 ***
Residuals 30 57936 1931
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
In ANOVA table we have the sum of squares of the regression model in the row named “mpg” and sum of squares of residuals with their degrees of freedom in the row below. The table also shows the F-statistic that we have seen before in the summary of the model. Mean squares are sum of squares divided by their degrees of freedom.
Another important function is predict() which can be use to get predicted value for new data points. First we need to create new data.frame including all predictors of the model
new.data <- data.frame(mpg = c(30, 20, 25))
predict(d, newdata = new.data)
1 2 3
59.19038 147.48769 103.33904
plot(hp ~ mpg, mtcars) #scatter plot of hp vs mpg
abline(d, col = "blue") #add regression line to the scatter plot
A multiple regression has one outcome and multiple predictors. In order to explain more variation, we can add more predictors. In our previous example, we can add rear axle ratio and weight as predictors.
d<-lm(hp~mpg+drat+wt, mtcars)
summary(d)
Call:
lm(formula = hp ~ mpg + drat + wt, data = mtcars)
Residuals:
Min 1Q Median 3Q Max
-52.85 -33.08 -15.53 28.29 138.84
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 270.869 135.698 1.996 0.05573 .
mpg -9.860 2.730 -3.611 0.00118 **
drat 19.576 21.803 0.898 0.37690
wt 1.087 17.545 0.062 0.95104
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 44.8 on 28 degrees of freedom
Multiple R-squared: 0.6144, Adjusted R-squared: 0.5731
F-statistic: 14.87 on 3 and 28 DF, p-value: 5.567e-06
anova(d)
Analysis of Variance Table
Response: hp
Df Sum Sq Mean Sq F value Pr(>F)
mpg 1 87791 87791 43.7470 3.565e-07 ***
drat 1 1738 1738 0.8659 0.3601
wt 1 8 8 0.0038 0.9510
Residuals 28 56190 2007
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We see that, based on the F-test, the overall model is a significant improvement in fit compared to the intercept-only model. However, only mpg coefficient against the null value of zero is significant.
The R-square is 0.6144, meaning that approximately 61% of the variability of hp is accounted for by the variables in the model. The adjusted R-square shows after taking the account of number of predictors in the model R_square is about 0.58.
The coefficients for each of the variables indicates the amount of change one could expect in hp given a one-unit increase in the value of that variable, given that all other variables in the model are held constant. For example, consider the variable drat. We would expect an increase of about 20 in the hp for every one unit increase in percent drat, assuming that all other variables in the model are held constant.
We see quite a difference in the coefficient of variable mpg compared to the simple linear regression. Before the coefficient for variable enroll was -8.83 and now it is -9.86.
_ Standardized regression coefficients can be obtain by transforming the outcome and predictor variables all to their standardized scores. This is used in comparing the relative strength of the various predictors within the model where each variable is measured in different units.
d<-lm(scale(hp)~scale(mpg)+scale(drat)+scale(wt), mtcars)
summary(d)
Call:
lm(formula = scale(hp) ~ scale(mpg) + scale(drat) + scale(wt),
data = mtcars)
Residuals:
Min 1Q Median 3Q Max
-0.7708 -0.4825 -0.2266 0.4127 2.0251
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.196e-17 1.155e-01 0.000 1.00000
scale(mpg) -8.667e-01 2.400e-01 -3.611 0.00118 **
scale(drat) 1.527e-01 1.700e-01 0.898 0.37690
scale(wt) 1.551e-02 2.504e-01 0.062 0.95104
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.6534 on 28 degrees of freedom
Multiple R-squared: 0.6144, Adjusted R-squared: 0.5731
F-statistic: 14.87 on 3 and 28 DF, p-value: 5.567e-06
In the standardized regression coefficients summary we see that the intercept is zero and all t-statistics (p-values) for the other coefficients are exactly the same as the original model.
Because the coefficients are all in the same standardized units, standard deviations, you can compare these coefficients to assess the relative strength of each of the predictors. In this example, mpg has the largest Beta coefficient, -0.8667.
Thus, a one standard deviation increase in drat leads to a 0.8667 standard deviation decrease in predicted hp, with the other variables held constant.
From the our previous model with hp as the outcome, we let mpg, wt and interaction of mpg and wt to be our predictor.
d<-lm(hp~mpg+wt+mpg*wt, mtcars)
summary(d)
Call:
lm(formula = hp ~ mpg + wt + mpg * wt, data = mtcars)
Residuals:
Min 1Q Median 3Q Max
-53.11 -29.18 -12.92 31.01 140.35
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 311.320 106.789 2.915 0.00692 **
mpg -6.026 3.773 -1.597 0.12146
wt 18.757 24.415 0.768 0.44877
mpg:wt -1.741 1.379 -1.262 0.21723
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 44.2 on 28 degrees of freedom
Multiple R-squared: 0.6247, Adjusted R-squared: 0.5845
F-statistic: 15.53 on 3 and 28 DF, p-value: 3.845e-06
Regression diagnostics checks on how well the data meet the
assumptions of OLS regression such as:
- Homogeneity of variance (homoscedasticity): The error variance should
be constant
- Linearity: the relationships between the predictors and the outcome
variable should be linear
- Independence: The errors associated with one observation are not
correlated with the errors of any other observation
- Normality: the errors should be normally distributed. Technically
normality is necessary only for hypothesis tests to be valid.
- Model specification: The model should be properly specified (including
all relevant variables, and excluding irrelevant variables).
Influence which is individual observations that exert undue influence on the coefficients and collinearity where predictors that are highly collinear, i.e., linearly related, can cause problems in estimating the regression coefficients are the issues that can arise during the analysis.
Three ways that an observation can be unusual.
-Outliers: In linear regression, an outlier is an observation with large
residual. In other words, it is an observation whose dependent-variable
value is unusual given its values on the predictor variables. An outlier
may indicate a sample peculiarity or may indicate a data entry error or
other problem.
Leverage: An observation with an extreme value on a predictor
variable is called a point with high leverage. Leverage is a measure of
how far an observation deviates from the mean of that variable. These
leverage points can have an effect on the estimate of regression
coefficients.
Influence: An observation is said to be influential if removing
the observation substantially changes the estimate of coefficients.
Influence can be thought of as the product of leverage and
outlierness.
Now, using our previous example, we plot the multiple regression model of hp and mpg,drat,and wt.
library(car)
Loading required package: carData
library(alr4)
Loading required package: effects
lattice theme set by effectsTheme()
See ?effectsTheme for details.
library(faraway)
Attaching package: 'faraway'
The following objects are masked from 'package:alr4':
cathedral, pipeline, twins
The following objects are masked from 'package:car':
logit, vif
d<-lm(hp~mpg+drat+wt, mtcars)
scatterplotMatrix(~hp+mpg+drat+wt, mtcars)
Observe that the graphs of hp with other variables show some potential
problems. In every plot, we see one or more data point that is far away
from the rest of the data points.
Studentized residuals can be used to identify outliers. In R we use rstandard() function to compute Studentized residuals.
res.std <- rstandard(d) #studentized residuals stored in vector res.std
#plot Standardized residual in y axis. X axis will be the index or row names
plot(res.std, ylab="Standardized Residual", ylim=c(-3.5,3.5))
#add horizontal lines 3 and -3 to identify extreme values
abline(h =c(-3,0,3), lty = 2)
#find out which data point is outside of 3 standard deviation cut-off
#index is row numbers of those point
index <- which(res.std > 3 | res.std < -3)
#add School name next to points that have extreme value
text(index-20, res.std[index] , labels = mtcars$hp[index])
print(index)
Maserati Bora
31
These results show that Maserati Bora is the most worrisome observation.
outlierTest(d)
rstudent unadjusted p-value Bonferroni p
Maserati Bora 3.999845 0.00044275 0.014168
To find points with high leverage is to use half normal plot in the package “faraway”. First we need to compute diagonal of hat matrix as a measure for leverage.
#a vector containing the diagonal of the 'hat' matrix
h <- influence(d)$hat
#half normal plot of leverage from package faraway
halfnorm(influence(d)$hat, ylab = "leverage")
Cook’s distance is a measure for influence points. A point with high level of cook’s distance is considers as a point with high influence point.We can plot Cook’s distance using the following code:
#the cut of value for cook's distance
cutoff <- 4/((nrow(mtcars)-length(d$coefficients)-2))
plot(d, which = 4, cook.levels = cutoff)
We can use influencePlot() function in package “car” to identify
influence point. It plots Studentized residuals against leverage with
cook’s distance.
#cook's distance, studentized residuals, and leverage in the same plot
influencePlot(d, main="Influence Plot",
sub="Circle size is proportional to Cook's Distance" )
StudRes Hat CookD
Chrysler Imperial 0.9593046 0.33758022 0.1175805
Toyota Corolla 1.1296378 0.24465796 0.1023229
Lotus Europa 1.7112186 0.19766298 0.1687314
Ford Pantera L 1.6403401 0.22415346 0.1832799
Maserati Bora 3.9998448 0.07792233 0.2201014
And finally infIndexPlot() function will gives use a series of plots that we need for to investigate influence points.
#4 diagnostic plots to identify influential points
infIndexPlot(d)
#residual vs. fitted value plot for Homoscedasticity
plot(d$resid ~ d$fitted.values)
#add horizental line from 0
abline(h = 0, lty = 2)
#residual vs. fitted value and all predictors plus test for curvature
residualPlots(d)
Test stat Pr(>|Test stat|)
mpg 2.0542 0.04976 *
drat -0.8886 0.38205
wt 0.5860 0.56274
Tukey test 2.5652 0.01031 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#plotting the trend of residuals and school number
plot(d$resid ~ mtcars$hp) #residual plot vs. school id
qqnorm(d$resid) #Normal Quantile to Quantile plot
qqline(d$resid)
The primary concern is that as the degree of multicollinearity
increases, the regression model estimates of the coefficients become
unstable and the standard errors for the coefficients can get wildly
inflated.
VIF, variance inflation factor, is used to measure the degree of multicollinearity. As a rule of thumb, a variable whose VIF values are greater than 10 may merit further investigation. Tolerance, defined as 1/VIF, is used by many researchers to check on the degree of collinearity. A tolerance value lower than 0.1 is comparable to a VIF of 10. It means that the variable could be considered as a linear combination of other independent variables.
car::vif(d) #variance inflation factor
mpg drat wt
4.182297 2.099243 4.552402
*Model Specification
A model specification error can occur when one or more relevant
variables are omitted from the model or one or more irrelevant variables
are included in the model. If relevant variables are omitted from the
model, the common variance they share with included variables may be
wrongly attributed to those variables, and the error term is inflated.
On the other hand, if irrelevant variables are included in the model,
the common variance they share with included variables may be wrongly
attributed to them. Model specification errors can substantially affect
the estimate of regression coefficients.
There are many methods and statistical tests to check whether or not a variable is relevant in the model. To visualize the effect of each variable in the model we can use added variable plot also called a partial-regression plot. The added variable plot is scatter plot of residuals of a model by excluding one variable from the full model against residuals of a model that uses the excluded variable as dependent variable predicted by other variables. The slope of the simple regression between those residuals will be the same as coefficient of the excluded variable. If the slope is not close to zero we conclude that the variable is relevant to the model.
Added variable plot is also useful for detecting influential points.
avPlots(d) #added variable plots from package "car"