Multiple Linear Regression Analysis

An implementation in R Markdown

Ma. Angelika C. Regoso, Angelica T. Masa

07-29-2021

Problem

An article in Optical Engineering1 Read the article by Operating Curve Extraction of a Correlator’s Filter (Letendre, Bergeron, and Arsenault 2004) reported on the use of an optical correlator to perform an experiment by varying brightness and contrast. The resulting modulation is characterized by the useful range of gray levels. The data follow:



Table 1.1 Useful Range Data

Useful Range (ng) \(y_i\) Brightness % (\(x_i1\)) Contrast %(\(x_{i2}\))
96 54 56
50 61 80
50 65 70
112 100 50
96 100 65
80 100 80
155 50 25
144 57 35
255 54 26
Matrix of Scatter plots for the Useful Range Data in Table 1.1

Matrix of Scatter plots for the Useful Range Data in Table 1.1

## Loading required package: scatterplot3d

Interactive Three-Dimensional Scatter Plot

Interactive Three Dimensional Scatter Plot of the Useful Range Data



Questions

  1. Fit a multiple linear regression model to these data.
  2. Estimate \(σ^2\).
  3. Compute the standard errors of the regression coefficients.
  4. Predict the useful range when brightness = 80 and contrast = 75.
  5. Test for significance of regression using α=0.05. What is the P-value for this test?
  6. Construct a t-test on each regression coefficient. What conclusions can you draw about the variables in this model? Use α=0.05.

Answer to Part 1:

Task: Fit a multiple linear regression model to these data.

We will fit the linear regression model model:

\(y=\beta_0\ + \beta_1x_1 +\beta_2x_2 + \epsilon\)

where \(y\) = useful range of gray levels, \(x_1\) = brightness percentage and \(x_2\) = contrast percentage.

R Computation

Optical.data <- data.frame(
  Useful_Range = c(96,50,50,112,96,80,155,144,255),
  Brightness = c(54,61,65,100,100,100,50,57,54),
  Contrast = c(56,80,70,50,65,80,25,35,26)
)
Eyi_<- sum(Optical.data$Useful_Range)
Exi_1<- sum(Optical.data$Brightness) 
Exi_2<- sum(Optical.data$Contrast)
Exi_1_xi_2<- sum(Optical.data$Brightness*Optical.data$Contrast)
Exi_1Squared<- sum(Optical.data$Brightness^2)
Exi_2Squared<- sum(Optical.data$Contrast^2)
Exi_1_yi_<- sum(Optical.data$Brightness*Optical.data$Useful_Range) 
Exi_2_yi_<- sum(Optical.data$Contrast*Optical.data$Useful_Range)
## [1] 1038
## [1] 641
## [1] 487
## [1] 36603
## [1] 49527
## [1] 30087
## [1] 70012
## [1] 46661

Manual Computation

Extending the table using the values of Brightness percentage, contrast percentage and useful range of gray levels:

Table 1.2 Extended Version of Table 1.1 Useful Range Data
Observation Number Useful Range (ng) \(y_i\) Brightness % (\(x_i1\)) Contrast %(\(x_{i2}\)) \(x_{i1}x_{i2}\) \({x_{i1}}^2\) \({x_{i2}}^2\) \(x_{i1}y_i\) \(x_{i2}y_i\)
1 96 54 56 3024 2916 3136 5184 5376
2 50 61 80 4880 3721 6400 3050 4000
3 50 65 70 4550 4225 4900 3250 3500
4 112 100 50 5000 10000 2500 11200 5600
5 96 100 65 6500 10000 4225 9600 6240
6 80 100 80 8000 10000 6400 8000 6400
7 155 50 25 1250 2500 625 7750 3875
8 144 57 35 1995 3249 1225 8208 5040
9 255 54 26 1404 2916 676 13770 6630

From the data in the Table 1.2, we could calculate that:

For the model \(Y=\beta_0\ + \beta_1x_1 +\beta_2x_2 + \epsilon\), the least square normal equations on the book of2 Equations on Chap 12.1(Montgomery and Runger 2010) are:

\(n{\hat{\beta}}_0 + {\hat{\beta}}_1\sum_{i=1}^{n}x_{i1}+ {\hat{\beta}}_2\sum_{i=1}^{n}x_{i2}= \sum_{i=1}^{n}y_{i}\)

\({\hat{\beta}}_0\sum_{i=1}^{n}x_{i1} + {\hat{\beta}}_1\sum_{i=1}^{n}x_{i1}^2+ {\hat{\beta}}_2\sum_{i=1}^{n}x_{i1}x_{i2}= \sum_{i=1}^{n}x_{i1}y_{i}\)

\({\hat{\beta}}_0\sum_{i=1}^{n}x_{i2} + {\hat{\beta}}_1\sum_{i=1}^{n}x_{i1}x_{i2}+ {\hat{\beta}}_2\sum_{i=1}^{n}x_{i2}^2 = \sum_{i=1}^{n}x_{i2}y_{i}\)

Inserting the computed summations into the normal equations, we obtain:

\(9{\hat{\beta}}_0 + 643{\hat{\beta}}_1+ 487{\hat{\beta}_2}= 1038\)

\({641\hat{\beta}}_0 + 49527{\hat{\beta}}_1 + 36603{\hat{\beta}}_2 = 70012\)

\({487\hat{\beta}}_0 + 36603{\hat{\beta}}_1 + {30087\hat{\beta}}_2 = 46661\)

The solution to this set of equations is:

\({\hat{\beta}}_0=238.5569\)
\({\hat{\beta}}_1=0.3339\)
\({\hat{\beta}}_2=-2.7167\)

DOUBLE CHECKING WITH R

model <- lm (Useful_Range ~ Brightness + Contrast, data=Optical.data)
model
## 
## Call:
## lm(formula = Useful_Range ~ Brightness + Contrast, data = Optical.data)
## 
## Coefficients:
## (Intercept)   Brightness     Contrast  
##    238.5569       0.3339      -2.7167

Therefore, the fitted equation is

\(\hat{y}=238.5569\ +\ 0.3339x_1-2.7167x_2\)

Practical Interpretation: The fitted equation above can be used in predicting the useful range of gray level, given that the value for the regressorS, which are the brightness percentage \((x_1)\) and contrast percentage \((x_2)\), are known.

Answer to Part 2:

Task:Estimate \(σ^2\).

The mean squared error of the useful range data points from the fitted linear model is represented by \(σ^2\). This unbiased estimator, \(σ^2\), is given by the equation:

\(\sigma^2=\frac{\sum_{i=1}^{n}{e_i}^2}{n-p}\)

where the \(\sum_{i=1}^{n}{e_i}^2\) or the sum of squares of the residual error can also be represented as \(SS_E\)

\(\sigma^2=\frac{SS_E}{n-p}\)

Note that n is the number of observations and p is the number of regressors plus one. Thus, in this problem, n = 9 and p=2+1=3.

The residual error \((e_i)\) is the difference of the observed value \((y_i)\) of the useful range and the fitted value \((\hat{y_i})\) of it from the fitted linear equation:

\(\hat{y}=238.5569\ +\ 0.3339x_1-2.7167x_2\)

Manual Computation

Table 1.3 displays all 9 fitted values of useful range \((\hat{y_i})\) and the corresponding residuals \(e\) and its squares \(e^2\).

Table 1.3 Observations, Fitted Values and Residuals for Useful Range Data

Observation Number \(y_i\) \(\hat{y_i}\) \(e_i=y_i-\hat{y_i}\) \(e_i^2\)
1 96 104.4523 -8.4523 71.44138
2 50 41.5888 8.4112 70.74829
3 50 70.0914 -20.0914 403.66435
4 112 136.1190 -24.1190 581.72616
5 96 95.3614 0.6386 0.40781
6 80 54.6109 25.3891 644.60640
7 155 187.3344 -32.3340 1045.48776
8 144 162.5047 -18.5047 342.42392
9 255 185.9533 69.0467 4767.44678

From the data in the Table 1.3, we could calculate for the \(SS_E\) or the Sum of the Squares of the Residuals \(e_i^2\):

We know that n = 9 and p = 3. Substituting the known values:

\(\sigma^2=\frac{SS_E}{n-p}\)
\(\sigma^2=\frac{7927.952843}{9-3}\)
\(\sigma^2=1321.325414\)

Double Check with R

lm2 <- lm(Useful_Range~Brightness+Contrast,data=Optical.data)
anova(lm2)
## Analysis of Variance Table
## 
## Response: Useful_Range
##            Df  Sum Sq Mean Sq F value   Pr(>F)   
## Brightness  1  3960.3  3960.3  2.9973 0.134119   
## Contrast    1 20558.1 20558.1 15.5593 0.007585 **
## Residuals   6  7927.6  1321.3                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Practical Interpretation: This implies that the average of the squared difference between each predicted values and the actual values for the useful range is \(\sigma^2\approx1321.3\).


Answer to Part 3:

Task: Compute the standard errors of the regression coefficients.

“The standard deviation of an estimate is called the standard error. The standard error of the coefficient measures how the regression model precisely estimates the coefficient’s unknown value.”3 Read What is the standard error of the coefficient? (Minitab, n.d.)

This means that the larger the standard error of the regression coefficients, the less precise is the estimation of the coefficient’s value.

MATRIX-APPROACH COMPUTATION

In matrix-approach computation, the variances of the regression coefficients is given by the diagonal elements of the matrix \(\sigma^2\left(X'X\right)^{-1}\)

Our matrix \(X\) is given by:

library(knitr)
x.data <- c(1,54,56,1,61,80,1,65,70,1,100,50,1,100,
            65,1,100,80,1,50,25,1,57,35,1,54,26
            ) 
x <- matrix(x.data,nrow=9,ncol=3,byrow=TRUE) #Matrix of X-Values
x

\(\left[\begin{matrix}1&54&56\\1&61&80\\1&65&70\\1&100&50\\1&100&65\\1&100&80\\1&50&25\\1&57&35\\1&54&26\\\end{matrix}\right]\)


Meanwhile, the transposed version of it \(X'\) is given by:

xp <- t(x)  #Transpose Matrix of X-Values
xp

\(\left[\begin{matrix}1&1&1&1&1&1&1&1&1\\54&61&65&100&100&100&50&57&54\\96&50&50&112&96&80&115&144&255\\\end{matrix}\right]\)


Multiplying these two matrices, \(X'X\):

xxp <- (xp %*% x) #Product of Original and Transposed Matrix
xxp

\(\left[\begin{matrix}1&1&1&1&1&1&1&1&1\\54&61&65&100&100&100&50&57&54\\96&50&50&112&96&80&115&144&255\\\end{matrix}\right]\left[\begin{matrix}1&54&56\\1&61&80\\1&65&70\\1&100&50\\1&100&65\\1&100&80\\1&50&25\\1&57&35\\1&54&26\\\end{matrix}\right]=\left[\begin{matrix}9&641&487\\641&49527&36603\\487&36603&30087\\\end{matrix}\right]\)


Solving the inverse of the product,\(\left(X'X\right)^{-1}\)

xxp_I <- solve(xxp) #Inverse of Two Matrices' Product
xxp_I

\(\left[\begin{matrix}9&641&487\\641&49527&36603\\487&36603&30087\\\end{matrix}\right]^{-1}=\left[\begin{matrix}1.54821553\ &-0.0150363856\ &-0.006767180\\-0.01503639&0.0003461619&-0.000177746\\-0.00676718&-0.0001777460&0.000359014\\\end{matrix}\right]\)


Finally, multiply the derived mean square error from question 2, \(σ^2=1321.3\), to the inverse of the product of the two matrices \(\left(X'X\right)^{-1}\):

MSE <- 1321.325414
C <- (xxp_I *MSE )

\(\left[\begin{matrix}1.54821553\ &-0.0150363856\ &-0.006767180\\-0.01503639&0.0003461619&-0.000177746\\-0.00676718&-0.0001777460&0.000359014\\\end{matrix}\right]\times1321.3=\left[\begin{matrix}2045.696527\ &-19.8679585\ &-8.9416467\\-19.867959&0.4573925&-0.2348603\\-8.941647&-0.2348603&0.4743743\\\end{matrix}\right]\)


From the matrix \(\sigma^2\left(X'X\right)^{-1}\), knowing that the variances of the regression coefficients are given by the diagonal elements of it means that:

\(\left[\begin{matrix}2045.696527\ &-19.8679585\ &-8.9416467\\-19.867959&0.4573925&-0.2348603\\-8.941647&-0.2348603&0.4743743\\\end{matrix}\right]\)

R COMPUTATION

multiple.regression <- lm (Useful_Range ~ Brightness + 
                          Contrast, data=Optical.data)
summary(multiple.regression)

Call:
lm(formula = Useful_Range ~ Brightness + Contrast, data = Optical.data)

Residuals:
    Min      1Q  Median      3Q     Max 
-32.334 -20.090  -8.451   8.413  69.047 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept) 238.5569    45.2285   5.274  0.00188 **
Brightness    0.3339     0.6763   0.494  0.63904   
Contrast     -2.7167     0.6887  -3.945  0.00759 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 36.35 on 6 degrees of freedom
Multiple R-squared:  0.7557,    Adjusted R-squared:  0.6742 
F-statistic: 9.278 on 2 and 6 DF,  p-value: 0.01459

The computed values for the standard errors on the regression coefficients- intercept, brightness, and contrast - through the matrix-approach are equal to the computed values of R using the function multiple.regression.

Answer to Part 4:

Task: Predict the useful range when brightness = 80 and contrast = 75

From what we had solved in part 1, the fitted regression model of the Useful Range Data in Table 1.1 is given by the equation:

\(\hat{y}=238.5569\ +\ 0.3339x_1-2.7167x_2\)

This regression model can be used to predict values of useful range \((y)\) for various percentages of Brightness \((x_1)\) and Contrast \((x_2)\)

If the observation has a brightness \((x_1)\) of 80 and contrast \((x_2)\) of 75:

This implies that the corresponding fitted value of the useful range is \(75.0999\) ng when the brightness percentage \((x_1)\) is \(80\) and the contrast percentage \((x_2)\) is \(75\).

Answer to Part 5:

Task: Test for significance of regression using α=0.05. What is the P-value for this test.

The appropriate hypotheses are:

The test statistic is:

Manual Computation

From part 2, we know that the mean square error \(MS_E\) or the \(σ^2\) is \(1321.3\). We also know that \(k=2\) because there are two regressors. However, we don’t know \(SS_R\) or the sum of squares of regression. But we know that \(SS_R=SS_T-SS_E\). And we could calculate this \(SS_T\) through the equation:

\({\rm SS}_T=y'y-\frac{\left(\sum_{i=}^{n}y_i\right)^2}{n}\)

Range.data <- data.frame(
  Useful_Range = c(96,50,50,112,96,80,155,144,255)
)
Eyi_<- sum(Range.data$Useful_Range)
(Eyi_)^2
[1] 1077444

Based on this R-computation,\(\left(\sum_{i=}^{9}y_i\right)^2=1077444\). We also know that \(n=9\)

To find the \(SS_T\), we would just need the product of the matrix \(y'y\).

library(knitr)
y.data <- c(96,50,50,112,96,80,155,144,255
            ) 
y <- matrix(y.data,nrow=9,ncol=1,byrow=TRUE) #Matrix of y-Values
yp <- t(y) #Transpose Matrix of X-Values
ypy <- (yp %*% y) #Product of Original and Transposed Matrix
ypy
       [,1]
[1,] 152162

As computed by R, \(y'y\) is given by:

\(\left[\begin{matrix}96&50&50&112&96&80&115&144&255\\\end{matrix}\right]\times\left[\begin{matrix}96\\50\\50\\112\\96\\80\\115\\144\\255\\\end{matrix}\right]=152162\)

Putting all these values together to find the \(SS_T\),:

Now that we have the value for \(SS_T\), we can now compute for the value of \(SS_R\), given that \(SS_E=7927.952843\)

Computing for the test statistic:

\(f_0=\frac{24518.04157/2}{{7927.952843}/(9-3)}= 9.277999535\) \(f_0=9.278\)


R Computation


Call:
lm(formula = Useful_Range ~ Brightness + Contrast, data = Optical.data)

Residuals:
    Min      1Q  Median      3Q     Max 
-32.334 -20.090  -8.451   8.413  69.047 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept) 238.5569    45.2285   5.274  0.00188 **
Brightness    0.3339     0.6763   0.494  0.63904   
Contrast     -2.7167     0.6887  -3.945  0.00759 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 36.35 on 6 degrees of freedom
Multiple R-squared:  0.7557,    Adjusted R-squared:  0.6742 
F-statistic: 9.278 on 2 and 6 DF,  p-value: 0.01459


Through manual and R-computation, we had verified that the F-statistic is \(f_0=9.278\) on \(2\) and \(6\) degrees of freedom.


#find  P-value
pf(9.278, 2, 6, lower.tail = FALSE)
[1] 0.0145875
#find  F-critical Value
qf(0.05, 2, 6, lower.tail = FALSE)
[1] 5.143253

Since the p-value is relatively smaller than the significance level, the \(\alpha=0.05\), and \(f_0>f_{0.05,2,6}\), we reject the null hypothesis. Thus, we could then conclude that the useful range is linearly related to either the brightness percentage, the contrast percentage or both.

Practical Interpretation: Being able to reject the null hypothesis does not necessarily imply that the regression model found is an accurate function of the brightness percentage \((x_1)\) and the contrast percentage \((x_2)\). There are other tests of model adequacy needed before we became fully confident on using the regression model.

Answer to Part 6:

Task: Construct a t-test on each regression coefficient. What conclusions can you draw about the variables in this model? Use α=0.05.

T-test on Regression Coefficient for Brightness Percentage \(\hat\beta_1\)

The hypotheses are:

The test-statistic is:

Reject \(H_0\): \(\beta_1\) = 0 if \(|t_0|\) > \(t_{\alpha/2, n-p}\).

#critical-value
qt(p=.025, df=6, lower.tail = FALSE)
[1] 2.446912

Manual Computation

We know that the standard error on the regression coefficient of Brightness Percentage \((x_1)\) is \(se({\hat{B}}_1)=0.676307992\) from Part 3. We also know that the regression coefficient of it is \({\hat{\beta}}_1=0.3339\) from Part 1. Thus:

\(t_0=\frac{0.3339-0}{0.676307992}\)

#test-statistic
B_x <- 0.3339
B_0 <- 0.0
seBx <-0.676307992
t_0 <- {(B_x)-(B_0)}/seBx
t_0
[1] 0.49371

\(t_0=0.49371\)
\(t_0\approx0.494\)

[1] 0.639

\(p-value = 0.639\)

R Computation

## 
## Call:
## lm(formula = Useful_Range ~ Brightness + Contrast, data = Optical.data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -32.334 -20.090  -8.451   8.413  69.047 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept) 238.5569    45.2285   5.274  0.00188 **
## Brightness    0.3339     0.6763   0.494  0.63904   
## Contrast     -2.7167     0.6887  -3.945  0.00759 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 36.35 on 6 degrees of freedom
## Multiple R-squared:  0.7557, Adjusted R-squared:  0.6742 
## F-statistic: 9.278 on 2 and 6 DF,  p-value: 0.01459

Since \(t_0\) < \(t_{0.025,6}\), which means that the computed t-statistic did not fall in the rejection criteria, we then failed to reject the \(H_0\): \(\beta_1\) = 0. This implies that the Brightness Pecentage \((x_1)\) is not statistically significant.

P-value is also another basis to draw conclusions. The P-value for \(t_0\) = 0.494 is \(P\approx0.639\). The P-value is greater than the significance level, which is the \(\alpha=0.05\). Thus, we failed to reject the null hypothesis.

Practical Interpretation: This test is a marginal test, which means, it measures the marginal contribution of Brightness Percentage \((x_1)\), given that the Contrast Percentage \((x_2)\) is in the regression model.

T-test on Regression Coefficient for Contrast Percentage \(\hat\beta_2\)

The hypotheses are:

The test-statistic is:

Reject \(H_0\): \(\beta_2\) = 0 if \(|t_0|\) > \(t_{\alpha/2, n-p}\).

#critical-value
qt(p=.025, df=6, lower.tail = FALSE)
[1] 2.446912

Manual Computation

We know that the standard error on the regression coefficient of Brightness Percentage \((x_2)\) is \(se({\hat{B}}_2)=0.6887483575\) from Part 3. We also know that the regression coefficient of it is \({\hat{\beta}}_2=-2.7167\) from Part 1. Thus:

\(t_0=\frac{-2.7167-0}{0.6887483575}\)

#test-statistic
B_x <- -2.7167
B_0 <- 0.0
seBx <-0.6887483575
t_0 <- {(B_x)-(B_0)}/seBx
t_0
[1] -3.944401

\(t_0= -3.9444\)
\(|t_0|= 3.9444\)

#find two-tailed p-value
2*pt(q=3.9444, df=6, lower.tail=FALSE)
[1] 0.007586218

\(p-value = 0.00759\)

R Computation

## 
## Call:
## lm(formula = Useful_Range ~ Brightness + Contrast, data = Optical.data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -32.334 -20.090  -8.451   8.413  69.047 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept) 238.5569    45.2285   5.274  0.00188 **
## Brightness    0.3339     0.6763   0.494  0.63904   
## Contrast     -2.7167     0.6887  -3.945  0.00759 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 36.35 on 6 degrees of freedom
## Multiple R-squared:  0.7557, Adjusted R-squared:  0.6742 
## F-statistic: 9.278 on 2 and 6 DF,  p-value: 0.01459

Since \(|t_0|\) > \(t_{0.025,6}\), we reject the \(H_0\): \(\beta_2\) = 0. This implies that the Contrast percentage \(x_2\) is statistically significant.

P-value is also another basis to draw conclusions. The P-value for \(|t_0|\) = 3.9444 is P = 0.00759. The P-value is less than the significance level, which is the \(\alpha=0.05\). Thus, we reject the null hypothesis.

Practical Interpretation: This test is a marginal test. Meaning, it measures the marginal contribution of Contrast Percentage \((x_2)\), given that the Brightness Percentage \((x_1)\) is in the model.


REFERENCES

Letendre, Mathias, Alain Bergeron, and Henri H Arsenault. 2004. “Operating Curve Extraction of a Correlator’s Filter.” Optical Engineering 43 (11): 2775–79.
Minitab. n.d. “What Is the Standard Error of the Coefficient?” https://support.minitab.com/en-us/minitab/18/help-and-how-to/modeling-statistics/regression/supporting-topics/regression-models/what-is-the-standard-error-of-the-coefficient/.
Montgomery, Douglas C, and George C Runger. 2010. Applied Statistics and Probability for Engineers. John Wiley & Sons.