Linear Regression:
It is the basic and commonly used type for predictive analysis. It is a statistical approach for modeling the relationship between a dependent variable and a given set of independent variables. These are of two types:
Multiple Linear Regression :
It is the most common form of Linear Regression. Multiple Linear Regression basically describes how a single response variable Y depends linearly on a number of predictor variables. The basic examples where Multiple Regression can be used are as follows:
Estimation of the Model Parameters Consider a multiple linear Regression model with \(k\) independent predictor variable \(x_1, x_2, \cdots, x_k\), and one response variable \(y\).
\[y=\beta_0+\beta_1x_1+\beta_2x_2+\cdots+\beta_kx_k+\epsilon\] Suppose we have \(n\) observation on the \(k+1\) variables and the variable of \(n\) should be greater than \(k\).
\[y_i=\beta_0+\beta_1x_{i1}+\beta_2x_{i2}+\cdots+\beta_kx_{ik}+\epsilon_i, \quad i=1,2,\cdots,n.\]
The basic goal in least-squares regression is to fit a hyper-plane into \((k + 1)\)-dimensional space that minimizes the sum of squared residuals. \[\displaystyle \sum^{n}_{i=1} e^2_i= \sum^{n}_{i=1}( y_i-\beta_0- \sum^{k}_{j=1}\beta_jx_{ij})^2 \]
Before taking the derivative with respect to the model parameters set them equal to zero and derive the least-squares normal equations that the parameters would have to fulfill.
These equations are formulated with the help of vectors and matrices.
Let \(\textbf{y}=\begin{bmatrix}y_1\\ y_2 \\ \vdots\\ y_k \end{bmatrix} \qquad \textbf{X}=\begin{bmatrix}1 & x_{11} & x_{12}&\cdots& x_{1k} \\ 1 & x_{21} & x_{22}&\cdots& x_{2k}\\ \vdots & \vdots & \vdots & \ddots & \vdots\\ 1 & x_{n1} & x_{n2}&\cdots& x_{nk}\end{bmatrix} \qquad \beta=\begin{bmatrix}\beta_0\\\beta_1\\\vdots \\ \beta_k\end{bmatrix}\qquad \epsilon=\begin{bmatrix}\epsilon_1\\\epsilon_2\\\vdots\\\epsilon_k\end{bmatrix}\)
The linear Regression model is written in the form as follows: \[\textbf{y}=\textbf{X}\beta+\epsilon\]
In linear regression the least square parameters estimate \(\beta\).
\[\displaystyle\sum^{n}_{i=1}\epsilon^2_i=\epsilon'\epsilon=(\textbf{y}-\textbf{X}\beta)'(\textbf{y}-\textbf{X}\beta)\]
Imagine the columns of \(X\) to be fixed, they are the data for a specific problem and say \(\beta\) to be variable. We want to find the “best” \(\beta\) in the sense that the sum of squared residuals is minimized.
The smallest that the sum of squares could be is zero.
\[\hat{y}=\textbf{X}\hat{\beta}\] Here \(y\) is the estimated response vector.
A dataset contains observations on the percentage of people biking to work each day, the percentage of people smoking, and the percentage of people with heart disease in an imaginary sample of 500 towns.
Download the sample datasets to try it yourself. Data can be downloaded here:https://bit.ly/HeartDataExample.
Since the variables are quantitative, running the code produces a numeric summary of the data for the independent variables (smoking and biking) and the dependent variable (heart disease):
summary(heart.data)
## X biking smoking heart.disease
## Min. : 1.0 Min. : 1.119 Min. : 0.5259 Min. : 0.5519
## 1st Qu.:125.2 1st Qu.:20.205 1st Qu.: 8.2798 1st Qu.: 6.5137
## Median :249.5 Median :35.824 Median :15.8146 Median :10.3853
## Mean :249.5 Mean :37.788 Mean :15.4350 Mean :10.1745
## 3rd Qu.:373.8 3rd Qu.:57.853 3rd Qu.:22.5689 3rd Qu.:13.7240
## Max. :498.0 Max. :74.907 Max. :29.9467 Max. :20.4535
Use the cor() function to test the relationship between your independent variables and make sure they aren’t too highly correlated.
cor(heart.data$biking, heart.data$smoking)
## [1] 0.01513618
The correlation between biking and smoking is small (0.015 is only a 1.5% correlation), so we can include both parameters in our model.
Use the hist() function to test whether your dependent variable follows a normal distribution.
hist(heart.data$heart.disease)
The distribution of observations is roughly bell-shaped, so we can proceed with the linear regression.
We can check this using two scatterplots: one for biking and heart disease, and one for smoking and heart disease.
plot(heart.disease ~ biking, data=heart.data)
plot(heart.disease ~ smoking, data=heart.data)
Although the relationship between smoking and heart disease is a bit less clear, it still appears linear. We can proceed with linear regression.
We will check this after we make the model.
Now that you’ve determined your data meet the assumptions, you can perform a linear regression analysis to evaluate the relationship between the independent and dependent variables.
Let’s see if there’s a linear relationship between biking to work, smoking, and heart disease in our imaginary survey of 500 towns. The rates of biking to work range between 1 and 75%, rates of smoking between 0.5 and 30%, and rates of heart disease between 0.5% and 20.5%.
To test the relationship, we first fit a linear model with heart disease as the dependent variable and biking and smoking as the independent variables. Run these two lines of code:
heart.disease.lm<-lm(heart.disease ~ biking + smoking, data = heart.data)
summary(heart.disease.lm)
##
## Call:
## lm(formula = heart.disease ~ biking + smoking, data = heart.data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.1789 -0.4463 0.0362 0.4422 1.9331
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 14.984658 0.080137 186.99 <2e-16 ***
## biking -0.200133 0.001366 -146.53 <2e-16 ***
## smoking 0.178334 0.003539 50.39 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.654 on 495 degrees of freedom
## Multiple R-squared: 0.9796, Adjusted R-squared: 0.9795
## F-statistic: 1.19e+04 on 2 and 495 DF, p-value: < 2.2e-16
The estimated effect of biking on heart disease is -0.2, while the estimated effect of smoking is 0.178.
This means that for every 1% increase in biking to work, there is a correlated 0.2% decrease in the incidence of heart disease. Meanwhile, for every 1% increase in smoking, there is a 0.178% increase in the rate of heart disease.
The standard errors for these regression coefficients are very small, and the t-statistics are very large (-147 and 50.4, respectively). The p-values reflect these small errors and large t-statistics. For both parameters, there is almost zero probability that this effect is due to chance.
Remember that these data are made up for this example, so in real life these relationships would not be nearly so clear!
Again, we should check that our model is actually a good fit for the data, and that we don’t have large variation in the model error, by running this code:
par(mfrow=c(2,2))
plot(heart.disease.lm)
par(mfrow=c(1,1))
As with our simple regression, the residuals show no bias, so we can say our model fits the assumption of homoscedasticity.
Let’s start with a simple example where the goal is to predict the index_price (the dependent variable) of a fictitious economy based on two independent/input variables:
The following code can then be used to capture the data in R:
year <- c(2017,2017,2017,2017,2017,2017,2017,2017,2017,2017,2017,2017,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016,2016)
month <- c(12,11,10,9,8,7,6,5,4,3,2,1,12,11,10,9,8,7,6,5,4,3,2,1)
interest_rate <- c(2.75,2.5,2.5,2.5,2.5,2.5,2.5,2.25,2.25,2.25,2,2,2,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1.75,1.75)
unemployment_rate <- c(5.3,5.3,5.3,5.3,5.4,5.6,5.5,5.5,5.5,5.6,5.7,5.9,6,5.9,5.8,6.1,6.2,6.1,6.1,6.1,5.9,6.2,6.2,6.1)
index_price <- c(1464,1394,1357,1293,1256,1254,1234,1195,1159,1167,1130,1075,1047,965,943,958,971,949,884,866,876,822,704,719)
Before you apply linear regression models, you’ll need to verify that several assumptions are met. Most notably, you’ll need to make sure that a linear relationship exists between the dependent variable and the independent variable/s.
A quick way to check for linearity is by using scatter plots.
For our example, we’ll check that a linear relationship exists between:
The index_price (dependent variable) and the interest_rate (independent variable); and
The index_price (dependent variable) and the unemployment_rate (independent variable)
Here is the code to plot the relationship between the index_price and the interest_rate:
plot(x = interest_rate, y = index_price)
You may now use the following template to perform the multiple linear regression in R:
model <- lm(index_price ~ interest_rate + unemployment_rate)
summary(model)
##
## Call:
## lm(formula = index_price ~ interest_rate + unemployment_rate)
##
## Residuals:
## Min 1Q Median 3Q Max
## -158.205 -41.667 -6.248 57.741 118.810
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1798.4 899.2 2.000 0.05861 .
## interest_rate 345.5 111.4 3.103 0.00539 **
## unemployment_rate -250.1 117.9 -2.121 0.04601 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 70.56 on 21 degrees of freedom
## Multiple R-squared: 0.8976, Adjusted R-squared: 0.8879
## F-statistic: 92.07 on 2 and 21 DF, p-value: 4.043e-11
You can use the coefficients in the summary above (as highlighted in yellow) in order to build the multiple linear regression equation as follows:
index_price = (Intercept) + (interest_rate coef)X1 (unemployment_rate coef)X2
And once you plug the numbers from the summary:
index_price = (1798.4) + (345.5)X1 + (-250.1)X2
Some additional stats to consider in the summary:
Adjusted R-squared reflects the fit of the model, where a higher value generally indicates a better fit
Intercept coefficient is the Y-intercept
interest_rate coefficient is the change in Y due to a change of one unit in the interest rate (everything else held constant)
unemployment_rate coefficient is the change in Y due to a change of one unit in the unemployment rate (everything else held constant)
Std. Error reflects the level of accuracy of the coefficients
Pr(>|t|) is the p-value. A p-value of less than 0.05 is considered to be statistically significant