INTRODUCTION TO REGRESSION IN R

Simple Regression

  • Regression analysis is a statistical tool to determine the relationship between the outcome(dependent) and the predictor(independent) varibales. A simple regression model has only one outcome and a single prdictor.

  • The data in simple regression model is given by pairs.
    X: The predictor. \(x_1,x_2,\dots,x_n\) are \(n\) observations from \(X\).
    Y: The response. \(y_1,y_2,…,y_n\) are \(n\) observations from \(Y\).
    each index represent one case or unit of observation in the data.

  • A simple linear model is given by:
    \[ Y=\beta_0+\beta_1X+\epsilon \] which composed of a linear function and an error which is usually assumed to have a normal distribution. The quantities \(\beta_0\) and \(\beta_1\) are called model parameters and they are estimated from the data. In a simple regression model, for each individual or unit \(i=1,2,⋯,n\) in the study, we have a pair of observations, \((x_i,y_i)\).

  • A method to achieve a best fitted linear function is called Ordinary Least Square (OLS) method which estimate the parameters of the model (intercept and slope) by minimizing the sum of squares of the residuals and assumes the following:

    • Errors have mean equal to zero.

    • Errors are independent.

    • Errors have equal variances (Homoscedasticity).

    • Errors follow normal distribution.

    • The relationship of dependent variable with the independent variables is linear.

    • Estimate model parameters are noted as \(\hat\beta_0\) and \(\hat\beta_1\) and the estimate model will be
      \[\hat y=\hat\beta_0+\hat\beta_1 x\]
      where \(\hat y\) is the predicted mean of the outcome for a given value of x also known as fitted value.

    • \(\hat y_i=\hat\beta_0+\hat\beta_1 x_i\) predicts the value of the outcome for a given value of x.

    • \(e_i=y_i - \hat y_i\) is called the residual.

  • Regression can be seen as a model to explain variability of the response variable using the predictor variable(s).

  • If only the intercept is used to model the response variable (slope restricted to be zero), the regression line will be a horizontal line at the mean of the response variable.

  • From this intercept model, the residuals for each data point \(i\) will \(y_i - \bar y_i\)

  • The sum of the these residuals squared, \(\Sigma_{i=1}^n=(y_i - \bar y)^2\) is the Total Sums of Squares (TSS), representing the total variation in the outcome.

  • We can decompose the residual from the intercept model, \(y_i−\bar y\) into two parts using the fitted value, \(\hat y_i\) which is written as,
    \[y_i-\bar y=(y_i - \hat y_i)+(\hat y_i - \bar y)\]
    where,

    • \(\bar y\) is the mean response
    • \(\hat y_i\) is the fitted value from the regression model
    • \(y_i\) is the observed response.
  • The total variation of the outcome (Total Sum of Squares (TSS)) is the sum of unexplained variation(Residual Sum of Squares (RSS)) and the variation explained by the model(Sum of Square Regression(SSreg)).

  • The ratio of SSreg and TSS is called \(R^2\) which is the proportion of the total variation that can be explained by the model. \(R^2\) is between 0 and 1. Higher \(R^2\) means the model explains more of the variation of the outcome.

Example:
We will use the data “mtcars” throughout the example.

data(mtcars)
class(mtcars)
[1] "data.frame"
names(mtcars)
 [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear"
[11] "carb"
dim(mtcars)
[1] 32 11
  • class() is used to determine the object’s type, names() is used to get the names of the variables in a data.frame and dim() is used to determine the dimension- the number of rows and columns.
    Here we have 11 variables with 32 observations. Next, we take a look at structure of the data and summary statistics.
str(mtcars)
'data.frame':   32 obs. of  11 variables:
 $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
 $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
 $ disp: num  160 160 108 258 360 ...
 $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
 $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
 $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
 $ qsec: num  16.5 17 18.6 19.4 17 ...
 $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
 $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
 $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
 $ carb: num  4 4 1 1 2 1 4 2 2 4 ...
summary(mtcars)
      mpg             cyl             disp             hp       
 Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0  
 1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5  
 Median :19.20   Median :6.000   Median :196.3   Median :123.0  
 Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7  
 3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0  
 Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0  
      drat             wt             qsec             vs        
 Min.   :2.760   Min.   :1.513   Min.   :14.50   Min.   :0.0000  
 1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000  
 Median :3.695   Median :3.325   Median :17.71   Median :0.0000  
 Mean   :3.597   Mean   :3.217   Mean   :17.85   Mean   :0.4375  
 3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000  
 Max.   :4.930   Max.   :5.424   Max.   :22.90   Max.   :1.0000  
       am              gear            carb      
 Min.   :0.0000   Min.   :3.000   Min.   :1.000  
 1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:2.000  
 Median :0.0000   Median :4.000   Median :2.000  
 Mean   :0.4062   Mean   :3.688   Mean   :2.812  
 3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:4.000  
 Max.   :1.0000   Max.   :5.000   Max.   :8.000  

Variable name and description:

mpg - Miles/(US) gallon
cyl - Number of cylinders
disp- Displacement (cu.in.)
hp - Gross horsepower
drat- Rear axle ratio
wt - Weight (1000 lbs)
qsec- 1/4 mile time
vs - Engine (0 = V-shaped, 1 = straight)
am - Transmission (0 = automatic, 1 = manual)
gear- Number of forward gears
carb- Number of carburetors

Using the function linear model lm(), we study the relationship between the gross horsepower and miles per gallon.

d<-lm(hp~mpg, mtcars)
d

Call:
lm(formula = hp ~ mpg, data = mtcars)

Coefficients:
(Intercept)          mpg  
     324.08        -8.83  

Here, the response variable is hp and the predictor variable is the mpg.
The estimated linear function is,
\[\hat{hp}=324.08-8.83mpg\]
The coefficient of mpg is -8.83 which implies that for one increase in miles per gallon, we would expect a 8.83 decrease in horsepower.

We can use functions ls() or str() to list the components of object d.

ls(d)
 [1] "assign"        "call"          "coefficients"  "df.residual"  
 [5] "effects"       "fitted.values" "model"         "qr"           
 [9] "rank"          "residuals"     "terms"         "xlevels"      

To access to the above components we can use $ sign after the lm object name.

d$coefficients
(Intercept)         mpg 
 324.082314   -8.829731 

To get the confidence intervals for the coefficients of the model, we use confint()

confint(d)
                2.5 %     97.5 %
(Intercept) 268.05605 380.108579
mpg         -11.50426  -6.155202

We can use function anova() to extract the sums of squares of the model.

anova(d)
Analysis of Variance Table

Response: hp
          Df Sum Sq Mean Sq F value    Pr(>F)    
mpg        1  87791   87791   45.46 1.788e-07 ***
Residuals 30  57936    1931                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

In ANOVA table we have the sum of squares of the regression model in the row named “mpg” and sum of squares of residuals with their degrees of freedom in the row below. The table also shows the F-statistic that we have seen before in the summary of the model. Mean squares are sum of squares divided by their degrees of freedom.

Another important function is predict() which can be use to get predicted value for new data points. First we need to create new data.frame including all predictors of the model

new.data <- data.frame(mpg = c(30, 20, 25))
predict(d, newdata = new.data)
        1         2         3 
 59.19038 147.48769 103.33904 
  • Plotting a scatter plot and the regression line
plot(hp ~ mpg, mtcars) #scatter plot of hp vs mpg
abline(d, col = "blue")       #add regression line to the scatter plot

Multiple Regression

A multiple regression has one outcome and multiple predictors. In order to explain more variation, we can add more predictors. In our previous example, we can add rear axle ratio and weight as predictors.

d<-lm(hp~mpg+drat+wt, mtcars)
summary(d)

Call:
lm(formula = hp ~ mpg + drat + wt, data = mtcars)

Residuals:
   Min     1Q Median     3Q    Max 
-52.85 -33.08 -15.53  28.29 138.84 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept)  270.869    135.698   1.996  0.05573 . 
mpg           -9.860      2.730  -3.611  0.00118 **
drat          19.576     21.803   0.898  0.37690   
wt             1.087     17.545   0.062  0.95104   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 44.8 on 28 degrees of freedom
Multiple R-squared:  0.6144,    Adjusted R-squared:  0.5731 
F-statistic: 14.87 on 3 and 28 DF,  p-value: 5.567e-06
anova(d)
Analysis of Variance Table

Response: hp
          Df Sum Sq Mean Sq F value    Pr(>F)    
mpg        1  87791   87791 43.7470 3.565e-07 ***
drat       1   1738    1738  0.8659    0.3601    
wt         1      8       8  0.0038    0.9510    
Residuals 28  56190    2007                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  • We see that, based on the F-test, the overall model is a significant improvement in fit compared to the intercept-only model. However, only mpg coefficient against the null value of zero is significant.

  • The R-square is 0.6144, meaning that approximately 61% of the variability of hp is accounted for by the variables in the model. The adjusted R-square shows after taking the account of number of predictors in the model R_square is about 0.58.

  • The coefficients for each of the variables indicates the amount of change one could expect in hp given a one-unit increase in the value of that variable, given that all other variables in the model are held constant. For example, consider the variable drat. We would expect an increase of about 20 in the hp for every one unit increase in percent drat, assuming that all other variables in the model are held constant.

  • We see quite a difference in the coefficient of variable mpg compared to the simple linear regression. Before the coefficient for variable enroll was -8.83 and now it is -9.86.

_ Standardized regression coefficients can be obtain by transforming the outcome and predictor variables all to their standardized scores. This is used in comparing the relative strength of the various predictors within the model where each variable is measured in different units.

  • We use function scale() to do this for each variable.
d<-lm(scale(hp)~scale(mpg)+scale(drat)+scale(wt), mtcars)
summary(d)

Call:
lm(formula = scale(hp) ~ scale(mpg) + scale(drat) + scale(wt), 
    data = mtcars)

Residuals:
    Min      1Q  Median      3Q     Max 
-0.7708 -0.4825 -0.2266  0.4127  2.0251 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)   
(Intercept) -1.196e-17  1.155e-01   0.000  1.00000   
scale(mpg)  -8.667e-01  2.400e-01  -3.611  0.00118 **
scale(drat)  1.527e-01  1.700e-01   0.898  0.37690   
scale(wt)    1.551e-02  2.504e-01   0.062  0.95104   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.6534 on 28 degrees of freedom
Multiple R-squared:  0.6144,    Adjusted R-squared:  0.5731 
F-statistic: 14.87 on 3 and 28 DF,  p-value: 5.567e-06

In the standardized regression coefficients summary we see that the intercept is zero and all t-statistics (p-values) for the other coefficients are exactly the same as the original model.

Because the coefficients are all in the same standardized units, standard deviations, you can compare these coefficients to assess the relative strength of each of the predictors. In this example, mpg has the largest Beta coefficient, -0.8667.

Thus, a one standard deviation increase in drat leads to a 0.8667 standard deviation decrease in predicted hp, with the other variables held constant.

Regression model with two way interaction

From the our previous model with hp as the outcome, we let mpg, wt and interaction of mpg and wt to be our predictor.

d<-lm(hp~mpg+wt+mpg*wt, mtcars)
summary(d)

Call:
lm(formula = hp ~ mpg + wt + mpg * wt, data = mtcars)

Residuals:
   Min     1Q Median     3Q    Max 
-53.11 -29.18 -12.92  31.01 140.35 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept)  311.320    106.789   2.915  0.00692 **
mpg           -6.026      3.773  -1.597  0.12146   
wt            18.757     24.415   0.768  0.44877   
mpg:wt        -1.741      1.379  -1.262  0.21723   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 44.2 on 28 degrees of freedom
Multiple R-squared:  0.6247,    Adjusted R-squared:  0.5845 
F-statistic: 15.53 on 3 and 28 DF,  p-value: 3.845e-06
  • The intercept interpreted as before and is the expected hp when both mpg and wt are set to zero.
  • The coefficient of mpg is interpreted as decrease of the mean of hp by 6.026 for one unit increase of number of mpg given the value of wt is zero.
  • The coefficient of wt is interpreted as average increase of the mean of hp by 18.757 for one unit increase of wt (one percentage increase) given the value of mpg is zero.
  • Adding interaction term means that the effect of mpg and wt is no longer constant for different values of the other predictor, however in this case, there is no significant effect of the interaction.

Regression Diagnostics

Regression diagnostics checks on how well the data meet the assumptions of OLS regression such as:
- Homogeneity of variance (homoscedasticity): The error variance should be constant
- Linearity: the relationships between the predictors and the outcome variable should be linear
- Independence: The errors associated with one observation are not correlated with the errors of any other observation
- Normality: the errors should be normally distributed. Technically normality is necessary only for hypothesis tests to be valid.
- Model specification: The model should be properly specified (including all relevant variables, and excluding irrelevant variables).

  • Influence which is individual observations that exert undue influence on the coefficients and collinearity where predictors that are highly collinear, i.e., linearly related, can cause problems in estimating the regression coefficients are the issues that can arise during the analysis.

  • Three ways that an observation can be unusual.
    -Outliers: In linear regression, an outlier is an observation with large residual. In other words, it is an observation whose dependent-variable value is unusual given its values on the predictor variables. An outlier may indicate a sample peculiarity or may indicate a data entry error or other problem.

    • Leverage: An observation with an extreme value on a predictor variable is called a point with high leverage. Leverage is a measure of how far an observation deviates from the mean of that variable. These leverage points can have an effect on the estimate of regression coefficients.

    • Influence: An observation is said to be influential if removing the observation substantially changes the estimate of coefficients. Influence can be thought of as the product of leverage and outlierness.

Now, using our previous example, we plot the multiple regression model of hp and mpg,drat,and wt.

library(car)
Loading required package: carData
library(alr4)
Loading required package: effects
lattice theme set by effectsTheme()
See ?effectsTheme for details.
library(faraway)

Attaching package: 'faraway'
The following objects are masked from 'package:alr4':

    cathedral, pipeline, twins
The following objects are masked from 'package:car':

    logit, vif
d<-lm(hp~mpg+drat+wt, mtcars)
scatterplotMatrix(~hp+mpg+drat+wt, mtcars)

Observe that the graphs of hp with other variables show some potential problems. In every plot, we see one or more data point that is far away from the rest of the data points.

Studentized residuals can be used to identify outliers. In R we use rstandard() function to compute Studentized residuals.

res.std <- rstandard(d) #studentized residuals stored in vector res.std 
#plot Standardized residual in y axis. X axis will be the index or row names
plot(res.std, ylab="Standardized Residual", ylim=c(-3.5,3.5))
#add horizontal lines 3 and -3 to identify extreme values
abline(h =c(-3,0,3), lty = 2)
#find out which data point is outside of 3 standard deviation cut-off
#index is row numbers of those point
index <- which(res.std > 3 | res.std < -3)
#add School name next to points that have extreme value
text(index-20, res.std[index] , labels = mtcars$hp[index])

print(index)
Maserati Bora 
           31 

These results show that Maserati Bora is the most worrisome observation.

outlierTest(d)
              rstudent unadjusted p-value Bonferroni p
Maserati Bora 3.999845         0.00044275     0.014168

To find points with high leverage is to use half normal plot in the package “faraway”. First we need to compute diagonal of hat matrix as a measure for leverage.

#a vector containing the diagonal of the 'hat' matrix
h <- influence(d)$hat
#half normal plot of leverage from package faraway
halfnorm(influence(d)$hat, ylab = "leverage")

Cook’s distance is a measure for influence points. A point with high level of cook’s distance is considers as a point with high influence point.We can plot Cook’s distance using the following code:

#the cut of value for cook's distance
cutoff <- 4/((nrow(mtcars)-length(d$coefficients)-2))
plot(d, which = 4, cook.levels = cutoff)

We can use influencePlot() function in package “car” to identify influence point. It plots Studentized residuals against leverage with cook’s distance.

#cook's distance, studentized residuals, and leverage in the same plot
influencePlot(d, main="Influence Plot", 
              sub="Circle size is proportional to Cook's Distance" )

                    StudRes        Hat     CookD
Chrysler Imperial 0.9593046 0.33758022 0.1175805
Toyota Corolla    1.1296378 0.24465796 0.1023229
Lotus Europa      1.7112186 0.19766298 0.1687314
Ford Pantera L    1.6403401 0.22415346 0.1832799
Maserati Bora     3.9998448 0.07792233 0.2201014

And finally infIndexPlot() function will gives use a series of plots that we need for to investigate influence points.

#4 diagnostic plots to identify influential points
infIndexPlot(d)

  • Checking Homoscedasticity
    If the model is well-fitted, there should be no pattern to the residuals plotted against the fitted values. We can check the homoscedasticity of the variance of residuals by plotting it.
#residual vs. fitted value plot for Homoscedasticity
plot(d$resid ~ d$fitted.values)
#add horizental line from 0
abline(h = 0, lty = 2)

  • Checking Linearity
    To check linearity residuals should be plotted against the fit as well as other predictors. If any of these plots show systematic shapes, then the linear model is not appropriate and some nonlinear terms may need to be added.In package car, function residualPlots() produces those plots.
#residual vs. fitted value and all predictors plus test for curvature
residualPlots(d)

           Test stat Pr(>|Test stat|)  
mpg           2.0542          0.04976 *
drat         -0.8886          0.38205  
wt            0.5860          0.56274  
Tukey test    2.5652          0.01031 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  • Issues of Independence
    A simple visual check would be to plot the residuals versus the time variable. Here the data is not as time series therefore we use school number as an order variable for the data.
#plotting the trend of residuals and school number
plot(d$resid ~ mtcars$hp)   #residual plot vs. school id 

  • Checking Normality of Residuals
    Normality of residuals is only required for valid hypothesis testing, that is, the normality assumption assures that the p-values for the t-tests and F-test will be valid. Normality is not required in order to obtain unbiased estimates of the regression coefficients. In a large sample size, normality test is not necessarily required but for small sample size, the assumption is required.
qqnorm(d$resid)  #Normal Quantile to Quantile plot
qqline(d$resid)  

  • Checking Multicollinearity
    When there is a perfect linear relationship among the predictors, the estimates for a regression model cannot be uniquely computed. The term collinearity implies that two variables are near perfect linear combinations of one another. When more than two variables are involved it is often called multicollinearity, although the two terms are often used interchangeably.

The primary concern is that as the degree of multicollinearity increases, the regression model estimates of the coefficients become unstable and the standard errors for the coefficients can get wildly inflated.

VIF, variance inflation factor, is used to measure the degree of multicollinearity. As a rule of thumb, a variable whose VIF values are greater than 10 may merit further investigation. Tolerance, defined as 1/VIF, is used by many researchers to check on the degree of collinearity. A tolerance value lower than 0.1 is comparable to a VIF of 10. It means that the variable could be considered as a linear combination of other independent variables.

car::vif(d)  #variance inflation factor
     mpg     drat       wt 
4.182297 2.099243 4.552402 

*Model Specification
A model specification error can occur when one or more relevant variables are omitted from the model or one or more irrelevant variables are included in the model. If relevant variables are omitted from the model, the common variance they share with included variables may be wrongly attributed to those variables, and the error term is inflated. On the other hand, if irrelevant variables are included in the model, the common variance they share with included variables may be wrongly attributed to them. Model specification errors can substantially affect the estimate of regression coefficients.

There are many methods and statistical tests to check whether or not a variable is relevant in the model. To visualize the effect of each variable in the model we can use added variable plot also called a partial-regression plot. The added variable plot is scatter plot of residuals of a model by excluding one variable from the full model against residuals of a model that uses the excluded variable as dependent variable predicted by other variables. The slope of the simple regression between those residuals will be the same as coefficient of the excluded variable. If the slope is not close to zero we conclude that the variable is relevant to the model.

Added variable plot is also useful for detecting influential points.

avPlots(d) #added variable plots from package "car"