Multiple linear regression
Abstract:
a statistical approach used to describe the simultaneous associations of several variables with one continuous outcome. Important steps in using this approach include estimation and inference, variable selection in model building, and assessing model fit. The special cases of regression with interactions among the variables, polynomial regression, regressions with categorical (grouping) variables, and separate slopes models are also covered. Examples in microbiology are used throughout
• Multiple linear regression is used to estimate the relationship between two or more independent variables and one dependent variable.
Introduction:
Multiple linear regression is a regression model that estimates the relationship between a quantitative dependent variable and two or more independent variables using a straight line
• to understand the relationship between multiple predictor variables and a response variable then we can use multiple linear regression.
If we have p predictor variables, then a multiple linear regression model takes the form:
Y = β0 + β1X1 + β2X2 + … + βpXp + ε
Multiple Linear Regression Formula:
- A population model for a multiple linear regression model that relates a y-variable to k x-variables is written as
yi=β0+β1xi,1+β2xi,2+………+βkxi,k+ϵi.
Here we’re using “k” for the number of predictor variables, which means we have k+1 regression parameters (the β coefficients). Some textbooks use “p” for the number of regression parameters and p–1 for the number of predictor variables.
We assume that the ϵi have a normal distribution with mean 0 and constant variance σ2. These are the same assumptions that we used in simple regression with one x-variable.
The subscript i refers to the ith individual or unit in the population. In the notation for the x-variables, the subscript following i simply denotes which x-variable it is.
The word “linear” in “multiple linear regression” refers to the fact that the model is linear in the parameters, β0,β1,…,βk. This simply means that each parameter multiplies an x-variable, while the regression function is a sum of these “parameter times x-variable” terms. Each x-variable can be a predictor variable or a transformation of predictor variables (such as the square of a predictor variable or two predictor variables multiplied together). Allowing non-linear transformation of predictor variables like this enables the multiple linear regression model to represent non-linear relationships between the response variable and the predictor variables. We’ll explore predictor transformations further in Lesson, Note that even β0 represents a “parameter times x-variable” term if you think of the x-variable that is multiplied by β0 as being the constant function “1.”
Real time applications:
1)researchers might administer various dosages of a certain drug to patients and observe how their blood pressure responds. They might fit a mutliple linear regression model using dosage as the predictor variable and blood pressure as the response variable. The regression model would take the following form:
blood pressure = β0 + β1(dosage)X1
•The coefficient β0 would represent the expected blood pressure when dosage is zero.
•The coefficient β1 would represent the average change in blood pressure when dosage is increased by one unit.
2) scientists might use different amounts of fertilizer and water on different fields and see how it affects crop yield. They might fit a multiple linear regression model using fertilizer and water as the predictor variables and crop yield as the response variable. The regression model would take the following form:
crop yield = β0 + β1(amount of fertilizer)X1 + β2(amount of water)X2
•The coefficient β0 would represent the expected crop yield with no fertilizer or water.
•The coefficient β1 would represent the average change in crop yield when fertilizer is increased by one unit, assuming the amount of water remains unchanged.
•The coefficient β2 would represent the average change in crop yield when water is increased by one unit, assuming the amount of fertilizer remains unchanged.
3)data scientists in the NBA might analyze how different amounts of weekly yoga sessions and weightlifting sessions affect the number of points a player scores. They might fit a multiple linear regression model using yoga sessions and weightlifting sessions as the predictor variables and total points scored as the response variable. The regression model would take the following form:
points scored = β0 + β1(yoga sessions) X1+ β2(weightlifting sessions)X2
•The coefficient β0 would represent the expected points scored for a player who participates in zero yoga sessions and zero weightlifting sessions.
•The coefficient β1 would represent the average change in points scored when weekly yoga sessions is increased by one, assuming the number of weekly weightlifting sessions remains unchanged.
•The coefficient β2 would represent the average change in points scored when weekly weightlifting sessions is increased by one, assuming the number of weekly yoga sessions remains unchanged.
problem and solution on mutliple linear regression:
Suppose we have the following dataset with one response variable y and two predictor variables X1 and X2:
Use the following steps to fit a multiple linear regression model to this dataset.
solution:
Step 1: Calculate X12, X22, X1y, X2y and X1X2.
Step 2: Calculate Regression Sums.
Next, make the following regression sum calculations:
Σx12 = ΣX12 – (ΣX1)2 / n = 38,767 – (555)2 / 8 = 263.875
Σx22 = ΣX22 – (ΣX2)2 / n = 2,823 – (145)2 / 8 = 194.875
Σx1y = ΣX1y – (ΣX1Σy) / n = 101,895 – (555*1,452) / 8 = 1,162.5
Σx2y = ΣX2y – (ΣX2Σy) / n = 25,364 – (145*1,452) / 8 = -953.5
Σx1x2 = ΣX1X2 – (ΣX1ΣX2) / n = 9,859 – (555*145) / 8 = -200.375
Step 3: Calculate b0, b1, and b2.
The formula to calculate b1 is: [(Σx22)(Σx1y) – (Σx1x2)(Σx2y)] / [(Σx12) (Σx22) – (Σx1x2)2]
Thus, b1 = [(194.875)(1162.5) – (-200.375)(-953.5)] / [(263.875) (194.875) – (-200.375)2] = 3.148
The formula to calculate b2 is: [(Σx12)(Σx2y) – (Σx1x2)(Σx1y)] / [(Σx12) (Σx22) – (Σx1x2)2]
Thus, b2 = [(263.875)(-953.5) – (-200.375)(1152.5)] / [(263.875) (194.875) – (-200.375)2] = -1.656
The formula to calculate b0 is: y – b1X1 – b2X2
Thus, b0 = 181.5 – 3.148(69.375) – (-1.656)(18.125) = -6.867
Step 4: Place b0, b1, and b2 in the estimated linear regression equation.
The estimated linear regression equation is: ŷ = b0 + b1*x1 + b2*x2
In our example, it is ŷ = -6.867 + 3.148x1 – 1.656x2
Interpret a Multiple Linear Regression Equation
to interpret this estimated linear regression equation: ŷ = -6.867 + 3.148x1 – 1.656x2
b0 = -6.867. When both predictor variables are equal to zero, the mean value for y is -6.867.
b1 = 3.148. A one unit increase in x1 is associated with a 3.148 unit increase in y, on average, assuming x2 is held constant.
b2 = -1.656. A one unit increase in x2 is associated with a 1.656 unit decrease in y, on average, assuming x1 is held constant.
Conclusion:
Multiple linear regression is used to evaluate predictors for continuously distributed outcome variables. This procedure computes a coefficient for each independent variable (predictor) that best fits the observed data in the sample.
•Multiple linear regression analyses produce several diagnostic and outcome statistics which are summarised below and are important to understand.
Reference:
1.Green, S. B. (1991). How many subjects does it take to do a regression analysis?. Multivariate Behavioral Research, 26, 499-510.
2.Knofczynski, G. T., & Mundfrom, D. (2008). Sample sizes when using multiple linear regression for prediction. Educational and Psychological Measurement, 68, 431-442.