Multiple Regression in R

MULTIPLE LINEAR REGRESSION

Linear Regression:

It is the basic and commonly used type for predictive analysis. It is a statistical approach for modeling the relationship between a dependent variable and a given set of independent variables. These are of two types:

Simple linear Regression
Multiple Linear Regression

Multiple Linear Regression :

It is the most common form of Linear Regression. Multiple Linear Regression basically describes how a single response variable Y depends linearly on a number of predictor variables. The basic examples where Multiple Regression can be used are as follows:

The selling price of a house can depend on the desirability of the location, the number of bedrooms, the number of bathrooms, the year the house was built, the square footage of the lot, and a number of other factors.
The height of a child can depend on the height of the mother, the height of the father, nutrition, and environmental factors.

Estimation of the Model Parameters Consider a multiple linear Regression model with \(k\) independent predictor variable \(x_1, x_2, \cdots, x_k\), and one response variable \(y\).

\[y=\beta_0+\beta_1x_1+\beta_2x_2+\cdots+\beta_kx_k+\epsilon\] Suppose we have \(n\) observation on the \(k+1\) variables and the variable of \(n\) should be greater than \(k\).

\[y_i=\beta_0+\beta_1x_{i1}+\beta_2x_{i2}+\cdots+\beta_kx_{ik}+\epsilon_i, \quad i=1,2,\cdots,n.\]

The basic goal in least-squares regression is to fit a hyper-plane into \((k + 1)\)-dimensional space that minimizes the sum of squared residuals. \[\displaystyle \sum^{n}_{i=1} e^2_i= \sum^{n}_{i=1}( y_i-\beta_0- \sum^{k}_{j=1}\beta_jx_{ij})^2 \]

Before taking the derivative with respect to the model parameters set them equal to zero and derive the least-squares normal equations that the parameters would have to fulfill.

These equations are formulated with the help of vectors and matrices.

Let \(\textbf{y}=\begin{bmatrix}y_1\\ y_2 \\ \vdots\\ y_k \end{bmatrix} \qquad \textbf{X}=\begin{bmatrix}1 & x_{11} & x_{12}&\cdots& x_{1k} \\ 1 & x_{21} & x_{22}&\cdots& x_{2k}\\ \vdots & \vdots & \vdots & \ddots & \vdots\\ 1 & x_{n1} & x_{n2}&\cdots& x_{nk}\end{bmatrix} \qquad \beta=\begin{bmatrix}\beta_0\\\beta_1\\\vdots \\ \beta_k\end{bmatrix}\qquad \epsilon=\begin{bmatrix}\epsilon_1\\\epsilon_2\\\vdots\\\epsilon_k\end{bmatrix}\)

The linear Regression model is written in the form as follows: \[\textbf{y}=\textbf{X}\beta+\epsilon\]

In linear regression the least square parameters estimate \(\beta\).

\[\displaystyle\sum^{n}_{i=1}\epsilon^2_i=\epsilon'\epsilon=(\textbf{y}-\textbf{X}\beta)'(\textbf{y}-\textbf{X}\beta)\]

Imagine the columns of \(X\) to be fixed, they are the data for a specific problem and say \(\beta\) to be variable. We want to find the “best” \(\beta\) in the sense that the sum of squared residuals is minimized.

The smallest that the sum of squares could be is zero.

\[\hat{y}=\textbf{X}\hat{\beta}\] Here \(y\) is the estimated response vector.

Multiple Linear Regression Example

A dataset contains observations on the percentage of people biking to work each day, the percentage of people smoking, and the percentage of people with heart disease in an imaginary sample of 500 towns.

Download the sample datasets to try it yourself. Data can be downloaded here:https://bit.ly/HeartDataExample.

Step 1: Load the data into R

Since the variables are quantitative, running the code produces a numeric summary of the data for the independent variables (smoking and biking) and the dependent variable (heart disease):

summary(heart.data)

##        X             biking          smoking        heart.disease    
##  Min.   :  1.0   Min.   : 1.119   Min.   : 0.5259   Min.   : 0.5519  
##  1st Qu.:125.2   1st Qu.:20.205   1st Qu.: 8.2798   1st Qu.: 6.5137  
##  Median :249.5   Median :35.824   Median :15.8146   Median :10.3853  
##  Mean   :249.5   Mean   :37.788   Mean   :15.4350   Mean   :10.1745  
##  3rd Qu.:373.8   3rd Qu.:57.853   3rd Qu.:22.5689   3rd Qu.:13.7240  
##  Max.   :498.0   Max.   :74.907   Max.   :29.9467   Max.   :20.4535

Step 2: Make sure your data meet the assumptions

Independence of observations (aka no autocorrelation)

Use the cor() function to test the relationship between your independent variables and make sure they aren’t too highly correlated.

cor(heart.data$biking, heart.data$smoking)

## [1] 0.01513618

The correlation between biking and smoking is small (0.015 is only a 1.5% correlation), so we can include both parameters in our model.

Normality

Use the hist() function to test whether your dependent variable follows a normal distribution.

hist(heart.data$heart.disease)

The distribution of observations is roughly bell-shaped, so we can proceed with the linear regression.

Linearity

We can check this using two scatterplots: one for biking and heart disease, and one for smoking and heart disease.

plot(heart.disease ~ biking, data=heart.data)

plot(heart.disease ~ smoking, data=heart.data)

Although the relationship between smoking and heart disease is a bit less clear, it still appears linear. We can proceed with linear regression.

Homoscedasticity

We will check this after we make the model.

Step 3: Perform the linear regression analysis

Now that you’ve determined your data meet the assumptions, you can perform a linear regression analysis to evaluate the relationship between the independent and dependent variables.

Let’s see if there’s a linear relationship between biking to work, smoking, and heart disease in our imaginary survey of 500 towns. The rates of biking to work range between 1 and 75%, rates of smoking between 0.5 and 30%, and rates of heart disease between 0.5% and 20.5%.

To test the relationship, we first fit a linear model with heart disease as the dependent variable and biking and smoking as the independent variables. Run these two lines of code:

heart.disease.lm<-lm(heart.disease ~ biking + smoking, data = heart.data)

summary(heart.disease.lm)

## 
## Call:
## lm(formula = heart.disease ~ biking + smoking, data = heart.data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.1789 -0.4463  0.0362  0.4422  1.9331 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 14.984658   0.080137  186.99   <2e-16 ***
## biking      -0.200133   0.001366 -146.53   <2e-16 ***
## smoking      0.178334   0.003539   50.39   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.654 on 495 degrees of freedom
## Multiple R-squared:  0.9796, Adjusted R-squared:  0.9795 
## F-statistic: 1.19e+04 on 2 and 495 DF,  p-value: < 2.2e-16

The estimated effect of biking on heart disease is -0.2, while the estimated effect of smoking is 0.178.

This means that for every 1% increase in biking to work, there is a correlated 0.2% decrease in the incidence of heart disease. Meanwhile, for every 1% increase in smoking, there is a 0.178% increase in the rate of heart disease.

The standard errors for these regression coefficients are very small, and the t-statistics are very large (-147 and 50.4, respectively). The p-values reflect these small errors and large t-statistics. For both parameters, there is almost zero probability that this effect is due to chance.

Remember that these data are made up for this example, so in real life these relationships would not be nearly so clear!

Step 4: Check for homoscedasticity

Again, we should check that our model is actually a good fit for the data, and that we don’t have large variation in the model error, by running this code:

par(mfrow=c(2,2))
plot(heart.disease.lm)

par(mfrow=c(1,1))

As with our simple regression, the residuals show no bias, so we can say our model fits the assumption of homoscedasticity.

Example 2

Let’s start with a simple example where the goal is to predict the index_price (the dependent variable) of a fictitious economy based on two independent/input variables:

interest_rate
unemployment_rate

The following code can then be used to capture the data in R:

year <- c(2017,2017,2017,2017,2017,2017,2017,2017,2017,2017,2017,2017,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016)
month <- c(12,11,10,9,8,7,6,5,4,3,2,1,12,11,10,9,8,7,6,5,4,3,2,1)
interest_rate <- c(2.75,2.5,2.5,2.5,2.5,2.5,2.5,2.25,2.25,2.25,2,2,2,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1.75)
unemployment_rate <- c(5.3,5.3,5.3,5.3,5.4,5.6,5.5,5.5,5.5,5.6,5.7,5.9,6,5.9,5.8,6.1,6.2,6.1,6.1,6.1,5.9,6.2,6.2,6.1)
index_price <- c(1464,1394,1357,1293,1256,1254,1234,1195,1159,1167,1130,1075,1047,965,943,958,971,949,884,866,876,822,704,719)

Before you apply linear regression models, you’ll need to verify that several assumptions are met. Most notably, you’ll need to make sure that a linear relationship exists between the dependent variable and the independent variable/s.

A quick way to check for linearity is by using scatter plots.

For our example, we’ll check that a linear relationship exists between:

The index_price (dependent variable) and the interest_rate (independent variable); and
The index_price (dependent variable) and the unemployment_rate (independent variable)

Here is the code to plot the relationship between the index_price and the interest_rate:

      plot(x = interest_rate, y = index_price)

You may now use the following template to perform the multiple linear regression in R:

model <- lm(index_price ~ interest_rate + unemployment_rate)
summary(model)

## 
## Call:
## lm(formula = index_price ~ interest_rate + unemployment_rate)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -158.205  -41.667   -6.248   57.741  118.810 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)   
## (Intercept)         1798.4      899.2   2.000  0.05861 . 
## interest_rate        345.5      111.4   3.103  0.00539 **
## unemployment_rate   -250.1      117.9  -2.121  0.04601 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 70.56 on 21 degrees of freedom
## Multiple R-squared:  0.8976, Adjusted R-squared:  0.8879 
## F-statistic: 92.07 on 2 and 21 DF,  p-value: 4.043e-11

You can use the coefficients in the summary above (as highlighted in yellow) in order to build the multiple linear regression equation as follows:

index_price = (Intercept) + (interest_rate coef)X1 (unemployment_rate coef)X2

And once you plug the numbers from the summary:

index_price = (1798.4) + (345.5)X1 + (-250.1)X2

Some additional stats to consider in the summary:

Adjusted R-squared reflects the fit of the model, where a higher value generally indicates a better fit
Intercept coefficient is the Y-intercept
interest_rate coefficient is the change in Y due to a change of one unit in the interest rate (everything else held constant)
unemployment_rate coefficient is the change in Y due to a change of one unit in the unemployment rate (everything else held constant)
Std. Error reflects the level of accuracy of the coefficients
Pr(>|t|) is the p-value. A p-value of less than 0.05 is considered to be statistically significant