Introduction

Simple Linear Regression

A simple linear regression model is used to determine the relationship between a predictor (independent) variable and a response (dependent) variable. This is approximated through a best-fit straight line. This model must represent the trend, or the increase/decrease in the response variable as the predictor variable increases. Additionally, the scattering of the points around the straight line is an important part of the model since it determines how good the model is at prediction.

The equation of a simple linear regression line is: \(y_i = \beta_0 + \beta_1 x_i + \epsilon_i\) where:

\(y_i\) is the response variable

\(x_i\) is the predictor variable

\(\beta_0\) is the y-intercept

\(\beta_1\) is the slope

\(\epsilon_i\) is the error term

We do not know the exact values of \(\beta_0\) and \(\beta_1\) since we don’t have an infinite data set and must rely on a sample data set for estimation.

This means we have to use point estimates of \(\beta_0\) and \(\beta_1\) (denoted by \(\widehat{\beta_0}\) and \(\widehat{\beta_1}\) respectively).

This leaves us with the following equation: \(\widehat {y_i} = \widehat{\beta_0} + \widehat{\beta_1} x_i\)

Note how the error term \(\epsilon_i\) is eliminated. This is due to assumptions for the linear regression model, which will be explained later.

Data

For example, consider the forbes data set. This set is found in the MASS library, which can be called with the library function. To view the layout of the first few rows of data, use the head function. In order to continue using the data without having to call it again, use the attach function.

library(MASS)
head(forbes)
##      bp  pres
## 1 194.5 20.79
## 2 194.3 20.79
## 3 197.9 22.40
## 4 198.4 22.67
## 5 199.4 23.15
## 6 199.9 23.35
attach(forbes)

This data set shows the effect of barometric pressure (also denoted by pres) on the boiling point of water (also denoted by bp). Boiling point is the response variable because it depends on barometric pressure, whereas barometric pressure is the predictor variable. This would leave us with the following equation: \[\widehat {bp_i} = \widehat{\beta_0} + \widehat{\beta_1} pres_i\]

In order to visualize any data trends that might be found between boiling point and barometric pressure, create a scatter plot of the data by using the plot function, with parameters pres and bp. You can label the y-axis and x-axis by using the ylab="y axis name"and xlab="x axis name" parameters, as well as the title of the plot using main="Title".

plot(pres,bp,ylab="boiling point (degrees Fahrenheit)",xlab="barometric pressure (inches of Hg)",main="Effect of pres on bp")

We can see from the scatter plot that there appears to be a positive trend between barometric pressure and boiling point. A larger data set is more ideal because we can be more sure about this trend. In order to plot the regression line, we first need to find the values of the intercept and slope. To set up the model, you can create a variable called mymod and set it to lm(bp~pres) in order to represent that the linear model is based on boiling point being a function of barometric pressure.

mymod<-lm(bp~pres)
mymod
## 
## Call:
## lm(formula = bp ~ pres)
## 
## Coefficients:
## (Intercept)         pres  
##     155.296        1.902

This gives us a value of 155.296 for the intercept of the line (or \(\widehat{\beta_0}\)), as well as a value of 1.902 of the slope (or \(\widehat{\beta_1}\)).

In this case, the intercept can’t be interpreted since it is extrapolated, as the range of values for the barometric pressure does not include 0. Trying to interpret this intercept or any other extrapolated point would be an extremely poor representation of the data.

Unlike the intercept, the slope can be interpreted. For every inch Hg increase of barometric pressure, the boiling point increases by 1.902 degrees Fahrenheit.

To build the plot with the regression line, the same plot is created, but a straight line is implemented using the abline function. The parameters for the abline function are the intercept (155.296) and the slope (1.902) respectively.

plot(pres,bp,ylab="boiling point (degrees Fahrenheit)",xlab="barometric pressure (inches of Hg)"
,main="Effect of pres on bp")
abline(155.296,1.902)

Prediction Estimate and Residuals

Using what we know from the regression line we created, we should be able to take any barometric pressure from the data points, and be able to find the predicted boiling point. I took the 10th data point’s pressure, which is [10,2] in vector form (i.e. it is the 10th row and 2nd column of data), and saved it to a variable called bpres. The prediction is calculated by multiplying a 2x1 matrix where the intercept is in the first row and the slope in the second, by a 1x2 matrix with the number 1 in the first column and our chosen pressure value (named bpres) in the second. Note that the use of %*% indicates matrix multiplication.

bpres<-forbes[10,2]
bpres
## [1] 24.01
pred<-coef(mymod)%*%c(1,bpres)
pred
##          [,1]
## [1,] 200.9583

With a barometric pressure of 24.01 inches Hg, the prediction estimate for the boiling point is 200.9583 degrees Fahrenheit. We then take the actual value of the boiling point at 24.01 inches Hg ([10,1] in vector form) and subtract the prediction estimate, giving us the residual.

actual<-forbes[10,1]
actual
## [1] 201.3
res<-actual-pred
res
##           [,1]
## [1,] 0.3416941

The residual value at 24.01 inches Hg is 0.3416941, which means that the actual value was 0.3416941 degrees Fahrenheit above the predicted value. This indicates that the model underestimated the boiling point at 24.01 inches Hg.

In order to indicate how far away data points are from the regression line overall, we use the sum of squared residuals, or RSS. The sum of squared residuals will penalize data points that are farther away from the line. The goal here is to be able to find a \(\beta_0\) and \(\beta_1\) that will minimize the sum of squared residuals. In other words, it is better to have the data closer to the regression line, making the model more accurate.

Model Assumptions

The residuals will help us check the four main assumptions for the linear regression model:

  1. Homoscedasticity

  2. Normality

  3. Independence

  4. The mean of errors (\(E(\epsilon\))) equals 0

Homoscedasticity

The first regression assumption is homoscedasticity, or equal variances among error term values. This can be done with the plot function, where you can plot the residuals against boiling points using myresid~mymod$fitted.values. We want to see the distribution around the line y=0. To plot this line, use the abline command with parameters (x=0,y=0).

myresid <- mymod$residuals
plot(myresid~mymod$fitted.values,xlab="Predicted Values",ylab="Residuals",main="Residual Plot for Boiling Points")
abline(0,0)

There seems to be a curve of data points (with exception to 1 outlier), which appears mostly homoscedastic since the vertical spread is pretty consistent. This outlier could be a result of human error, as this is a very scientific experiment. Note that the outlier is exaggerated by the graph since the range of the residual plot is about -1 to 0.5, which is relatively small. Also, the points below the line y=0 indicate overprediction, whereas the points above it indicate underprediction. Homoscedasticity is important because if that assumption doesn’t hold true, the confidence interval is going to be too wide for the narrow, vertically spread areas, and too narrow for the wide, vertically spread areas. We are going to continue with the assumption that the variances are equal for the sake of testing the data.

Normality

Another regression assumption is that the residuals for each data point follow a normal distribution. Instead of looking at a histogram of residuals, it is much easier to identify the normal distribution by converting it into a straight line. The data points can be converted with the qqnorm function. The qqline function is also conducted, which plots the straight line that represents the normal curve.

qqnorm(myresid,main="Normal Q-Q Plot of Boiling Point")
qqline(myresid)

The graph indicates that the points follow the line relatively closely, with exception to the very first point. This shows that it is very similar to a normal distribution. Similar to the residual plot, there is overprediction when the points are below the line, and underprediction when they are above it. We are going to continue to assume that it is distributed normally so that we can carry out the rest of the tests. For a better normal distribution, a larger sample size would definitely help.

Independence

One of the assumptions for simple linear regression is that any error term value is statistically independent from any other error term. In other words, each data point was collected individually, and therefore there is no influence of one point over another. Any sort of dependence between data points can lead to incorrectly underestimating the standard error for the T-test, which increases the test statistic and lowers the p-value. A lower p-value means that the null hypotheses is more likely to be rejected when it shouldn’t be, leading to a Type I error.

\(E(\epsilon)=0\)

Another regression assumption is that the expectation of the error terms is equal to 0. This is because we are assuming quantiles of a normal distribution, in which the mean is 0. To find the expectation of the error term, we take the mean of the residuals. However, this calculation isn’t necessary since R creates the regression line such that the mean of the residuals is essentially 0. Therefore, we can safely assume that this holds true.

Mean Square Error

In order to estimate the variance of residuals (\(\sigma^2\)), we use the mean square error (denoted \(s^2\)). In R, we can use the summary function to find the mean square error without having to use time-consuming formulas.

summary(mymod)
## 
## Call:
## lm(formula = bp ~ pres)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.22687 -0.22178  0.07723  0.19687  0.51001 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 155.29648    0.92734  167.47   <2e-16 ***
## pres          1.90178    0.03676   51.74   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.444 on 15 degrees of freedom
## Multiple R-squared:  0.9944, Adjusted R-squared:  0.9941 
## F-statistic:  2677 on 1 and 15 DF,  p-value: < 2.2e-16

The output of this function gives us a residual standard error of 0.444, which is equivalent to \(s\). This means that in order to get the value of the mean square error (\(s^2\)), we have to square it. The resulting MSE \(s^2\) is \((0.444)^2\), or 0.1971, on 15 degrees of freedom (where degrees of freedom = sample size-number of beta coefficients, or 17-2=15).

Hypothesis Testing for \(\beta_1\) and \(\beta_0\)

We want to conduct a hypothesis test to see whether or not the slope (\(\beta_1\)) is significant. The linear regression model isn’t useful unless there is a significant relationship between boiling point and barometric pressure. In our tests from this point forward, we will be using a significance level of \(\alpha=0.05\). \[\widehat{y_i} = \widehat {\beta_0} + \widehat {\beta_1} x_i\] The hypotheses for this test are as follows:

\(H_0: \beta_1 = 0\), or the relationship between barometric pressure and boiling point is not significant.

\(H_A: \beta_1 \neq 0\), or the relationship between barometric pressure and boiling point is significant.

Also, we can take a look at whether or not the y-intercept (\(\beta_0\)) is statistically significant. It gives us an idea on whether it is necessary to include the intercept in the model. However, it is generally good practice to keep the intercept in the model no matter its significance. The hypothesis for testing the intercept are as follows:

\(H_0: \beta_0 = 0\), or the intercept is not significant.

\(H_A: \beta_0 \neq 0\), or the intercept is significant.

We can test these hypotheses using one of two different methods:

  1. T-test

    or

  2. F-test

T-Test

In order to conduct a T-test, we need the p-value of the slope so that we can compare it to the level of significance (\(\alpha=0.05\)) and determine whether or not there is sufficient evidence to reject the null hypothesis. We can determine this value by using the summary function on the model.

summary(mymod)
## 
## Call:
## lm(formula = bp ~ pres)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.22687 -0.22178  0.07723  0.19687  0.51001 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 155.29648    0.92734  167.47   <2e-16 ***
## pres          1.90178    0.03676   51.74   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.444 on 15 degrees of freedom
## Multiple R-squared:  0.9944, Adjusted R-squared:  0.9941 
## F-statistic:  2677 on 1 and 15 DF,  p-value: < 2.2e-16

We find that the p-value of the slope (under column (Pr>|t|)) is \(<2*10^{-16}\) in the row denoted pres. Since the p-value is much smaller than \(\alpha\), we reject the null hypothesis in favor of the alternative, which states that \(\beta_1 \neq 0\). So, with 95% confidence, we can say that \(\beta_1\) is not equal to 0. This means that there is a linear relationship between barometric pressure and boiling point.

The hypothesis testing for the intercept is calculated the same way as the slope, except you look at the row denoted intercept. For the intercept, we find that the p-value (under column (Pr>|t|)) is also \(<2*10^{-16}\). Again, the p-value is much smaller than \(\alpha\), so we reject the null hypothesis in favor of the alternative , which states that \(\beta_0 \neq 0\). So, with 95% confidence, we can say that \(\beta_0\) is not equal to 0. In other words, the intercept of the linear regression model is significant. Note that in this case, testing the intercept isn’t necessary since we can’t interpret the intercept given our data set.

F-Test

We can also conduct an F-test in order to see whether or not there is a linear relationship between barometric pressure and boiling point. Like the t-test, we use a significance level of 0.05, and want to find the p-value in order to compare it to that. We find the p-value of the F-test on the output of the summary function:

summary(mymod)
## 
## Call:
## lm(formula = bp ~ pres)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.22687 -0.22178  0.07723  0.19687  0.51001 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 155.29648    0.92734  167.47   <2e-16 ***
## pres          1.90178    0.03676   51.74   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.444 on 15 degrees of freedom
## Multiple R-squared:  0.9944, Adjusted R-squared:  0.9941 
## F-statistic:  2677 on 1 and 15 DF,  p-value: < 2.2e-16

The p-value is less than \(2.2*10^{-16}\). Note that the p-value from this F-test is the same as from the T-test. It is not necessary to conduct the F-test if you have already completed a T-test since it gives the same p-value and therefore the same conclusion for the hypotheses. The p-value is much smaller than \(\alpha\), therefore the null hypothesis is rejected in favor of the alternative hypothesis. In other words, we are 95% confident that the linear relationship between barometric pressure and boiling point is significant.

Confidence and Prediction Intervals

Confidence Intervals for Slope and Intercept

Next, we want to look at the confidence intervals for both \(\beta_1\) and \(\beta_0\) in our forbes data set. We use the confint function with the parameter mymod. This is going to give us a 95% confidence interval by default.

confint(mymod)
##                 2.5 %     97.5 %
## (Intercept) 153.31991 157.273055
## pres          1.82344   1.980127

Thus we are 95% confident that the intercept of the model is between 153.31991 and 157.273055, and the slope is in between 1.82344 and 1.980127.

The intercept’s confidence interval can’t be interpreted because the data set doesn’t include a barometric pressure of 0. Interpreting the intercept would be considered extrapolation since it doesn’t exist within the range of barometric pressures in the data.

For the slope however, it means that for every inch of Hg increase in barometric pressure, there is an increase in boiling point between 1.82344 and 1.980127 degrees Fahrenheit 95% of the time.

Prediction and Confidence Intervals for a Specified Value of pres

In addition, we want to be able to evaluate specific values of barometric pressure to find their confidence intervals and prediction intervals. A confidence interval is for the mean value of the boiling point when \(pres=pres_0\), whereas a prediction interval is for a particular value of the boiling point when \(pres=pres_0\).

Note that both the prediction and confidence intervals are going to be centered at the same value. However, the prediction interval is technically wider than the confidence interval. This intuitively makes sense because the confidence interval is for the mean value of the boiling point (averaged), whereas the prediction value is for a particular value of the boiling point, which is going to have more variation.

We want to apply the confidence and prediction intervals to a certain barometric pressure in our data set. You can choose any barometric pressure that falls within the cloud of data. I chose a barometric pressure of 23 inches Hg for this demonstration. You can create a data frame so that the barometric pressure is constantly 23 inches Hg by using the data.frame function with the parameter pres=23.

Prediction Interval for a Specified Value of pres

I created a prediction interval in which the value of the boiling point falls into it with a probability of \(1-\alpha\). To calculate this, I used the predict function with the parameters of the model, the data frame set, and the interval that is specified as "predict", and set it to a variable called predy. To capture the width of the interval, we do matrix multiplication of predy and the vector [0,-1,1].

newdata<-data.frame(pres=23)
(predy<-predict(mymod,newdata,interval="predict"))
##        fit      lwr      upr
## 1 199.0375 198.0504 200.0246
predy%*%c(0,-1,1)
##       [,1]
## 1 1.974265

We are 95% confident that the boiling point at 23 inches Hg falls in between 198.0504 and 200.0246 degrees Fahrenheit. Also, the width of the interval is found to be 1.974265.

Confidence Interval for a Specified Value of pres

The confidence interval is the interval in which the mean value of the boiling falls into it with a probability of \(1-\alpha\). To calculate this I used the predict function with the parameters of the model, the data frame set, and the interval specified as "confidence", and set it to a variable called confy. Similar to finding the width of the prediction interval, in order to find the confidence interval, we do matrix multiplication of confy and the vector [0,-1,1].

(confy<-predict(mymod,newdata,interval="confidence"))
##        fit     lwr      upr
## 1 199.0375 198.757 199.3181
confy%*%c(0,-1,1)
##        [,1]
## 1 0.5610916

We are 95% confident that the mean value of the boiling point falls in between 198.757 and 199.3181 degrees Fahrenheit. Also, the width of the interval is 0.5610916, which is noticeably smaller than the prediction interval. This is due to the fact that the variance of the mean of the boiling point is going to be smaller than any individual boiling point, as the variance of the mean is the variance divided by n=17 observations.

Simple Coefficient of Determination and Correlation

The Simple Coefficient of Determination, or \(r^2\)

The simple coefficient of determination, denotated \(r^2\), is to measure how useful the linear regression model is. To produce the value of \(r^2\) in R, it can be found under the summary function as Multiple R-squared:

summary(mymod)
## 
## Call:
## lm(formula = bp ~ pres)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.22687 -0.22178  0.07723  0.19687  0.51001 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 155.29648    0.92734  167.47   <2e-16 ***
## pres          1.90178    0.03676   51.74   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.444 on 15 degrees of freedom
## Multiple R-squared:  0.9944, Adjusted R-squared:  0.9941 
## F-statistic:  2677 on 1 and 15 DF,  p-value: < 2.2e-16

The resulting value is 0.9944. The value of \(r^2\) can range from greater than or equal to 0, and also less than or equal to 1. The closer to 1 that \(r^2\) is, the higher the proportion of total variation is explained variation, and therefore there is lower unexplained variation (since \(r^2= \frac{\text{explained variation}}{\text{total variation}}\), where total variation= explained variation + unexplained variation). So, a value near 1 is going to be a more useful model. For our model, the value is 0.9944, which is extremely close to 1, making it very useful in finding the linear relationship between the boiling point of water and barometric pressure.

Correlation, or r

Next, I wanted to check the correlation between boiling point and barometric pressure. This is represented by \(r\). The formula for this is simply the square root of the value of \(r^2\) since the slope is positive. This is useful in finding out how strong the linear relationship is between barometric pressure and boiling point. The value of \(r\) can generally range from -1 and 1 (in this case, since the slope is positive, it can only range from 0 to 1), where a value of -1 indicates a perfectly negative linear relationship and 1 indicates a perfectly positive linear relationship. 0, on the other hand, indicates that there is no correlation, so any \(r\) values near that show a weak linear relationship between boiling point and barometric pressure. In our example, we can find the value of \(r\) using the cor function, with barometric pressure and boiling point as the parameters.

cor(pres,bp)
## [1] 0.9972102

This function outputs a value of 0.9972102 for \(r\). This is a value that is extremely close to 1, so there is a very strong, positive linear relationship between boiling point and barometric pressure.

Conclusion

Through the creation of a simple linear regression model, we were able to find that there is a significant relationship between the barometric pressure and the boiling point of water with either the T-test or F-test. They relate to each other with the following equation: \[\widehat {bp_i} = 155.296 + 1.902 pres_i\] Based on calculations, we were also able to conclude that the model was quite useful in terms of predicting the data due to the high value of \(r^2=0.9944\). Furthermore, barometric pressure and boiling point have a high positive correlation with a value near 1 (\(r=0.9972102\)).

Something to be careful about is the application of these results, since not all of the regression assumptions might be met with some models. When conducting research using regression, it is important to make sure that the data successfully meets all of these assumptions in order to correctly apply the model. It is also good to have a larger sample size so that the model can be better estimate of that of an infinite sample size.