Simple Linear Regression

R Markdown Implementation


Question 1

An article in the Journal of Sound and Vibration [“Measurement of Noise-Evoked Blood Pressure by Means of Averaging Method: Relation between Blood Pressure Rise and PSL” (1991, Vol. 151(3), pp. 383-394)] described a study investigating the relationship between noise exposure and hypertension. The following data are representative of those reported in the article.

x y
60 1
63 0
65 1
70 2
70 5
70 1
80 4
90 6
80 2
80 3
85 5
89 4
90 6
90 8
90 4
90 5
94 7
100 9
100 7
100 6
Table 1. Blood Pressure Level (mm Hg),x and Sound Pressure Level (dB), y.


A. Draw a scatter diagram of y (blood pressure rise in millimeters of mercury) versus x (sound pressure level in decibels). Does a simple linear regression model seem reasonable in this situation?

Graph 1.a Scatter plot of y (Blood Pressure Rise in mm Hg) versus x (Sound Pressure Level in dB).

As we can see from the scatter plot above, a simple linear regression model is reasonable for this situation because there is a linear relationship present between blood pressure rise and sound pressure level. An increase in blood pressure affects the increase of sound pressure level also.


B. Fit the simple linear regression model using least squares. Find an estimate of \(\sigma^2\).

Simple Linear Regression Model using Least Squares Method

We will fit a simple linear regression for Table 1 using \(\hat{y}=\hat{\beta}_0 +\hat{\beta}_1x\), wherein \(\hat{y}\) represents the blood pressure rise in mmHg, \(x\) represents the sound pressure level in dB, and \(\hat{\beta}_0\) and \(\hat{\beta}_1\) are the estimates for regression coefficients.

The following quantities may be computed: \[ \begin{aligned} n&=20 \\ \sum_{i = 1}^{20}x_i&=1,656\\ \sum_{i = 1}^{20}y_i&=86\\ \bar{x}&=82.8 \\ \bar{y}&=4.3\\ \sum_{i = 1}^{20}x_i^2&=140,176\\ \sum_{i = 1}^{20}y_i^2&=494\\ \sum_{i = 1}^{20}x_1y_i&=7,654\\ \end{aligned} \]

We then calculate for \(S_{xx}\) and \(S_{xy}\) using the following formulas: \[ \begin{aligned} S_{xx}&=\sum_{i = 1}^{n}x_i^2-\frac{(\sum_{i = 1}^{n}x_i)^2}{n}\\ S_{xy}&=\sum_{i = 1}^{n}x_iy_i-\frac{(\sum_{i = 1}^{n}x_i)(\sum_{i = 1}^{n}y_i)}{n}\\ \end{aligned} \] Then substitute the known values: \[ \begin{aligned} S_{xx}&=\sum_{i = 1}^{n}x_i^2-\frac{(\sum_{i = 1}^{n}x_i)^2}{n}\\ S_{xx}&=140,176-\frac{(1,656)^2}{20}\\ S_{xx}&=140,176-\frac{2,742,336}{20}\\ S_{xx}&=3,059.2\\ \end{aligned} \] \[ \begin{aligned} S_{xy}&=\sum_{i = 1}^{n}x_iy_i-\frac{(\sum_{i = 1}^{n}x_i)(\sum_{i = 1}^{n}y_i)}{n}\\ S_{xy}&=7,654-\frac{(1,656)(86)}{20}\\ S_{xy}&=7,654-\frac{142,416}{20}\\ S_{xy}&=533.2\\ \end{aligned} \]

Therefore, the least squares estimate of the slope and intercept are: \[ \begin{aligned} \hat{\beta}_1&=\frac{S_{xy}}{S_{xx}}\\ \hat{\beta}_1&=\frac{533.2}{3,059.2}\\ \hat{\beta}_1&=0.1742939\\ \end{aligned} \] and \[ \begin{aligned} \hat{\beta}_0&=\bar{y}-\hat{\beta_1\bar{x}}\\ &=4.3-(0.1742939)(82.8)\\ &=-10.1315349\\ \end{aligned} \]

Thus, the fitted simple linear regression model (with the coefficients reported to only five decimal places) is: \[\hat{y}=-10.13153+0.17429x\]
##      x y
## 1   60 1
## 2   63 0
## 3   65 1
## 4   70 2
## 5   70 5
## 6   70 1
## 7   80 4
## 8   90 6
## 9   80 2
## 10  80 3
## 11  85 5
## 12  89 4
## 13  90 6
## 14  90 8
## 15  90 4
## 16  90 5
## 17  94 7
## 18 100 9
## 19 100 7
## 20 100 6
## 
## Call:
## lm(formula = y ~ x)
## 
## Coefficients:
## (Intercept)            x  
##    -10.1315       0.1743

## 
## Call:
## lm(formula = y ~ x, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.8120 -0.9040 -0.1333  0.5023  2.9310 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -10.13154    1.99490  -5.079 7.83e-05 ***
## x             0.17429    0.02383   7.314 8.57e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.318 on 18 degrees of freedom
## Multiple R-squared:  0.7483, Adjusted R-squared:  0.7343 
## F-statistic:  53.5 on 1 and 18 DF,  p-value: 8.567e-07

Graph 1.b Scatter plot of y (Blood Pressure Rise in mm Hg) versus x (Sound Pressure Level in dB) with fitted simple linear regression model \(\hat{y}=-10.13153+0.17429x\).

We can also solve for the fitted simple line regression using r as follows:

We first calculate the first three sums of squares.

x = df$x
y = df$y

Sxy = sum((x - mean(x)) * (y - mean(y)))
Sxx = sum((x - mean(x)) ^ 2)
Syy = sum((y - mean(y)) ^ 2)
c(Sxy, Sxx, Syy)
## [1]  533.2 3059.2  124.2

Then we can calculate for the \(\hat{\beta}_0\) and \(\hat{\beta}_1\).

x = df$x
y = df$y

beta_1_hat = Sxy / Sxx
beta_0_hat = mean(y) - beta_1_hat * mean(x)
c(beta_0_hat, beta_1_hat)
## [1] -10.1315377   0.1742939
Notice that we had the same value from earlier, thus, the fitted simple line regression is \[\hat{y}=-10.13153+0.17429x\]

Estimating \(\sigma^2\), variance

We can estimate the variance using r.

y_hat = beta_0_hat + beta_1_hat * x
e = y - y_hat
n = length(e)
s2_e = sum(e^2) / (n - 2)
s2_e
## [1] 1.737026

The estimated variance, \(\sigma^2\) is 1.737026.

The lm Function

Now, how can we check if the values we got are consistent with the graph we made in r? We will be using the lm function. Notice that we already have it in Graph 1.b but we can check it again.

fit <- lm(y~x, data=df)
summary(fit)
## 
## Call:
## lm(formula = y ~ x, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.8120 -0.9040 -0.1333  0.5023  2.9310 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -10.13154    1.99490  -5.079 7.83e-05 ***
## x             0.17429    0.02383   7.314 8.57e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.318 on 18 degrees of freedom
## Multiple R-squared:  0.7483, Adjusted R-squared:  0.7343 
## F-statistic:  53.5 on 1 and 18 DF,  p-value: 8.567e-07
By checking using the lm function, the fitted simple line regression is \(\hat{y}=-10.13153+0.17429x\) is given by the intercept and x of -10.13154 and 0.17429, respectively. The residual standard error is also the standard deviation that means if we square it, we will be getting the value of the variance which is \((1.318)^2 = 1.737\) when decimals are reported to three decimal places.


C. Find the predicted mean rise in blood pressure level associated with a sound pressure level of 85 decibels.

Now that we have the fitted simple line regression model, we can use this line to make predctions. We are asked to predict the mean rise in blood pressure level associated with a sound pressure level of 85 decibels.

The simple line regression model is: \[\hat{y}=-10.13153+0.17429x\] Now, let’s make a prediction on the mean rise in blood pressure level with 85 decibels in sound pressure level/

\[ \begin{aligned} \hat{y}&=-10.13153+0.17429x\\ \hat{y}&=-10.13153+(0.17429)(85)\\ \hat{y}&=4.68312 \end{aligned} \] Manipulating this with r, we can have:

y_hat <- beta_0_hat + beta_1_hat * 85
y_hat
## [1] 4.683447
We have 4.68377.

Both of them are approximately 5, then the estimated mean rise in the blood pressure level with a sound pressure level of 85 decibels is 5 mm Hg.


References

D. C. Montgomery and G. C. Runger, Applied Statistics and Probability for Engineers. 2018.

“Chapter 7 Simple Linear Regression.” https://daviddalpiaz.github.io/appliedstats/simple-linear-regression.html#the-lm-function (accessed Jul. 30, 2021).