Multiple linear regression

Author

Darshan gowda v

Abstract:

a statistical approach used to describe the simultaneous associations of several variables with one continuous outcome. Important steps in using this approach include estimation and inference, variable selection in model building, and assessing model fit. The special cases of regression with interactions among the variables, polynomial regression, regressions with categorical (grouping) variables, and separate slopes models are also covered. Examples in microbiology are used throughout

• Multiple linear regression is used to estimate the relationship between two or more independent variables and one dependent variable.

Introduction:

Multiple linear regression is a regression model that estimates the relationship between a quantitative dependent variable and two or more independent variables using a straight line

• to understand the relationship between multiple predictor variables and a response variable then we can use multiple linear regression.

If we have p predictor variables, then a multiple linear regression model takes the form:

Y = β0 + β1X1 + β2X2 + … + βpXp + ε

Multiple Linear Regression Formula:

A population model for a multiple linear regression model that relates a y-variable to k x-variables is written as

yi=β0+β1xi,1+β2xi,2+………+βkxi,k+ϵi.

Here we’re using “k” for the number of predictor variables, which means we have k+1 regression parameters (the β coefficients). Some textbooks use “p” for the number of regression parameters and p–1 for the number of predictor variables.
We assume that the ϵi have a normal distribution with mean 0 and constant variance σ2. These are the same assumptions that we used in simple regression with one x-variable.
The subscript i refers to the ith individual or unit in the population. In the notation for the x-variables, the subscript following i simply denotes which x-variable it is.
The word “linear” in “multiple linear regression” refers to the fact that the model is linear in the parameters, β0,β1,…,βk. This simply means that each parameter multiplies an x-variable, while the regression function is a sum of these “parameter times x-variable” terms. Each x-variable can be a predictor variable or a transformation of predictor variables (such as the square of a predictor variable or two predictor variables multiplied together). Allowing non-linear transformation of predictor variables like this enables the multiple linear regression model to represent non-linear relationships between the response variable and the predictor variables. We’ll explore predictor transformations further in Lesson, Note that even β0 represents a “parameter times x-variable” term if you think of the x-variable that is multiplied by β0 as being the constant function “1.”

Real time applications:

1)researchers might administer various dosages of a certain drug to patients and observe how their blood pressure responds. They might fit a mutliple linear regression model using dosage as the predictor variable and blood pressure as the response variable. The regression model would take the following form:

blood pressure = β₀ + β₁(dosage)X1

•The coefficient β₀ would represent the expected blood pressure when dosage is zero.

•The coefficient β₁ would represent the average change in blood pressure when dosage is increased by one unit.

2) scientists might use different amounts of fertilizer and water on different fields and see how it affects crop yield. They might fit a multiple linear regression model using fertilizer and water as the predictor variables and crop yield as the response variable. The regression model would take the following form:

crop yield = β₀ + β₁(amount of fertilizer)X1 + β₂(amount of water)X2

•The coefficient β₀ would represent the expected crop yield with no fertilizer or water.

•The coefficient β₁ would represent the average change in crop yield when fertilizer is increased by one unit, assuming the amount of water remains unchanged.

•The coefficient β₂ would represent the average change in crop yield when water is increased by one unit, assuming the amount of fertilizer remains unchanged.

3)data scientists in the NBA might analyze how different amounts of weekly yoga sessions and weightlifting sessions affect the number of points a player scores. They might fit a multiple linear regression model using yoga sessions and weightlifting sessions as the predictor variables and total points scored as the response variable. The regression model would take the following form:

points scored = β₀ + β₁(yoga sessions) X1+ β₂(weightlifting sessions)X2

•The coefficient β₀ would represent the expected points scored for a player who participates in zero yoga sessions and zero weightlifting sessions.

•The coefficient β₁ would represent the average change in points scored when weekly yoga sessions is increased by one, assuming the number of weekly weightlifting sessions remains unchanged.

•The coefficient β₂ would represent the average change in points scored when weekly weightlifting sessions is increased by one, assuming the number of weekly yoga sessions remains unchanged.

problem and solution on mutliple linear regression:

Suppose we have the following dataset with one response variable y and two predictor variables X₁ and X₂:

Use the following steps to fit a multiple linear regression model to this dataset.

solution:

Step 1: Calculate X₁², X₂², X₁y, X₂y and X₁X₂.

Step 2: Calculate Regression Sums.

Next, make the following regression sum calculations:

Σx₁² = ΣX₁² – (ΣX₁)² / n = 38,767 – (555)² / 8 = 263.875
Σx₂² = ΣX₂² – (ΣX₂)² / n = 2,823 – (145)² / 8 = 194.875
Σx₁y = ΣX₁y – (ΣX₁Σy) / n = 101,895 – (555*1,452) / 8 = 1,162.5
Σx₂y = ΣX₂y – (ΣX₂Σy) / n = 25,364 – (145*1,452) / 8 = -953.5
Σx₁x₂ = ΣX₁X₂ – (ΣX₁ΣX₂) / n = 9,859 – (555*145) / 8 = -200.375

Step 3: Calculate b₀, b₁, and b₂.

The formula to calculate b₁ is: [(Σx₂²)(Σx₁y) – (Σx₁x₂)(Σx₂y)] / [(Σx₁²) (Σx₂²) – (Σx₁x₂)²]

Thus, b₁ = [(194.875)(1162.5) – (-200.375)(-953.5)] / [(263.875) (194.875) – (-200.375)²] = 3.148

The formula to calculate b₂ is: [(Σx₁²)(Σx₂y) – (Σx₁x₂)(Σx₁y)] / [(Σx₁²) (Σx₂²) – (Σx₁x₂)²]

Thus, b₂ = [(263.875)(-953.5) – (-200.375)(1152.5)] / [(263.875) (194.875) – (-200.375)²] = -1.656

The formula to calculate b₀ is: y – b₁X₁ – b₂X₂

Thus, b₀ = 181.5 – 3.148(69.375) – (-1.656)(18.125) = -6.867

Step 4: Place b₀, b₁, and b₂ in the estimated linear regression equation.

The estimated linear regression equation is: ŷ = b₀ + b₁*x₁ + b₂*x₂

In our example, it is ŷ = -6.867 + 3.148x₁ – 1.656x₂

Interpret a Multiple Linear Regression Equation

to interpret this estimated linear regression equation: ŷ = -6.867 + 3.148x₁ – 1.656x₂

b₀ = -6.867. When both predictor variables are equal to zero, the mean value for y is -6.867.

b₁ = 3.148. A one unit increase in x₁ is associated with a 3.148 unit increase in y, on average, assuming x₂ is held constant.

b₂ = -1.656. A one unit increase in x₂ is associated with a 1.656 unit decrease in y, on average, assuming x₁ is held constant.

Conclusion:

Multiple linear regression is used to evaluate predictors for continuously distributed outcome variables. This procedure computes a coefficient for each independent variable (predictor) that best fits the observed data in the sample.

•Multiple linear regression analyses produce several diagnostic and outcome statistics which are summarised below and are important to understand.

Reference:

1.Green, S. B. (1991). How many subjects does it take to do a regression analysis?. Multivariate Behavioral Research, 26, 499-510.

2.Knofczynski, G. T., & Mundfrom, D. (2008). Sample sizes when using multiple linear regression for prediction. Educational and Psychological Measurement, 68, 431-442.