In this guide, you will learn how to do simple linear regression analysis using R. We will be working with the aircraft dataset found in the robustbase library. If you have this library installed already, go on to the next section.
If you do not have this library, you will need to install it before continuing. To install the package, make sure you have access to the internet, then type the following into the console and hit enter:
install.packages(“robustbase”)
R will then download and install the package along with any other necessary files.
We want to be able to explain and interpret our model, so it’s a good idea to know what the variables in our dataset actually mean. In this guide, we will be working with the aircraft dataset from the robustbase library. Before we create a model, let’s take a closer look at the data.
Now that we have the package installed, we need to call our dataset and attach it. Attaching a dataset means R will include it in its search path, so we can refer to variables by name without having to call up the dataset again.
library(robustbase)
## Warning: package 'robustbase' was built under R version 3.4.3
data(aircraft)
attach(aircraft)
We can take a look at a few of the data points now.
head(aircraft)
## X1 X2 X3 X4 Y
## 1 6.3 1.7 8176 4500 2.76
## 2 6.0 1.9 6699 3120 4.76
## 3 5.9 1.5 9663 6300 8.75
## 4 3.0 1.2 12837 9800 7.78
## 5 5.0 1.8 10205 4900 6.18
## 6 6.3 2.0 14890 6500 9.50
The output for head shows us the values for the first six observations within the dataset, as well as the variable names. However, we don’t know what the variables stand for. Let’s refer to the documentation for more information. The question mark retrieves the documentation for the dataset, but it can also be used for different commands and functions in R if you need a quick reminder on how to use them.
?aircraft
## starting httpd help server ... done
The summary of the dataset is given in the description: “Aircraft Data, deals with 23 single-engine aircraft built over the years 1947-1979, from Office of Naval Research. The dependent variable is cost (in units of $100,000) and the explanatory variables are aspect ratio, lift-to-drag ratio, weight of plane (in pounds) and maximal thrust.”
From the documentation, we know that X1 is aspect ratio, X2 is lift-to-drag ratio, X3 is weight, X4 is thrust, and Y is the cost of the aircraft.
In the context of aviation, thrust is a universal measure of engine power for all types of aircraft engines. Let’s see if there is a correlation between how much thrust a single engine plane has and its cost. Which is the explanatory variable and which is the response? We know that cost is the response variable from reading the documentation, but let’s try to think about it intuitively.
Aircraft with higher thrust ratings tend to achieve higher speeds and consequently require stronger parts and airframes to withstand the stresses, driving up costs. Thus, it makes sense that cost is the response variable and thrust is the explanatory variable.
Now that we know our explanatory and response variables, we can start to build our model. Let’s first create a scatterplot of our data.
When creating scatterplots for regression, we want the explanatory variable to be on the x-axis, and the response to be on the y-axis. The syntax for plot is plot(explanatory, response), as shown below.
plot(X4, Y)
The default labels for the axes are the variable names. To make things easier to understand, let’s rename those axes. This can be done by adding the xlab and ylab arguments to plot. Be sure to put axis names in quotations (otherwise you’ll get an error).
plot(X4, Y, xlab = "Maximum Thrust (in pounds)", ylab = "Cost (in $100,000)")
Now take a look at the scatterplot. Are there any trends? If so, is it an upward or downward trend? Is the trend linear, or does it follow another curve? Do the data points closely follow the trend? Are there any points that do not follow the trend? These are general guidelines for looking at scatterplots of data. Remember, if the data doesn’t follow a linear trend, a linear model wouldn’t be very helpful.
From looking at the scatterplot, we can see an upward trend; that is, aircraft with higher maximum thrust ratings tend to cost more. The data points seem to follow a linear relationship. Thus, a linear model would fit the data. Now we can model that relationship by creating a linear regression model, which is also referred to as a line of best fit.
Recall the general form of a regression equation is \[\hat{y}=\beta_0 + \beta_1x\] where \(\hat{y}\) is the estimated value of the response, \(\beta_1\) is the coefficient for the slope and \(\beta_0\) is the intercept.
The function we will use is lm (lowercase “LM”). This function creates the linear model. The syntax is lm(response ~ explanatory). Running this command by itself will directly give the output for the regression model, but it’s a good idea to save it because we will be analyzing our model later on.
ThrustMod <- lm(Y ~ X4)
ThrustMod
##
## Call:
## lm(formula = Y ~ X4)
##
## Coefficients:
## (Intercept) X4
## -3.852546 0.001519
I saved the model as ThrustMod, but any name works. To get the output, we only need to type the name and run it as code. The output gives us the slope and intercept of the regression equation. Try writing out the equation now.
Answer: \(\hat{cost} = 0.001519*thrust - 3.852546\)
In the previous section, we created a regression model for the relationship between the cost of a single-engine plane and its maximum rated thrust. We got an equation for this relationship, but how do we put it into words?
We can interpret our model by first looking at the slope. The slope will tell us how much the response variable is expected to change if we increase the explanatory variable by one unit. For our example, the slope is 0.001519, and our response variable is measured in units of $100,000. Thus, our model tells us that for every additional pound of thrust, we expect the cost of a single engine aircraft to increase by $151.90.
Now let’s look at the intercept. Unlike the slope, we need to be very careful when dealing with intercepts. We can only interpret the intercept if it’s reasonable for our explanatory variable to take on a value of 0 and we have data at or near 0. If both conditions are satisfied, then the intercept is the predicted value of the response when the explanatory has a value of 0. Otherwise, the intercept has no meaning.
In the context of this example, is it reasonable for a plane to have 0 pounds of thrust? Can we interpret the intercept of our regression line?
Answer: A plane without thrust wouldn’t be able to go anywhere, so it’s unreasonable for our explanatory variable to be zero. Additionally, setting thrust to zero would be going outside the scope of our observed data (known as extrapolation), which should be avoided. Therefore, the intercept cannot be interpreted in this model.
When we looked at the scatterplot, one of the questions we wanted to answer was whether the data followed the trend closely. Now that we have the regression equation, we can plot it with the scatterplot to better answer this question.
If we create the scatterplot first and then use the abline function, R will graph the regression line directly on the scatterplot. Abline is used to add a straight line to the current plot, so it’s important to add it immediately after creating the scatterplot of the data.
The syntax is as follows: abline(a,b) where a is the intercept and b is the slope. Plugging in the values of the intercept and slope from our linear regression model graphs the regression line. Let’s try it now.
plot(Y ~ X4, xlab = "Maximum Thrust (in pounds)", ylab = "Cost (in $100,000)")
abline(-3.852546, 0.001519)
Now let’s take another look at our data. It seems the data follows the trend fairly well.
Before we move on to predictions, let’s take a quick look at indexing. Datasets in R are arranged as a maxtrix, with each row representing one observation, and the columns representing variables. If we simply type the name of the dataset and run it as code, R will output the complete maxtrix. This can get long if the dataset has a large number of observations, so we rely on indexing.
Indexing allows us to retrieve one row, column, or specific entry we’re interested in without having to scroll through the entire matrix. Before we dive into indexing, let’s quickly remind ourselves what the columns represent in our dataset.
head(aircraft)
## X1 X2 X3 X4 Y
## 1 6.3 1.7 8176 4500 2.76
## 2 6.0 1.9 6699 3120 4.76
## 3 5.9 1.5 9663 6300 8.75
## 4 3.0 1.2 12837 9800 7.78
## 5 5.0 1.8 10205 4900 6.18
## 6 6.3 2.0 14890 6500 9.50
The columns are labeled using the variable names. The column on the left containing the numbers 1 through 6 is only naming the observations. Some datasets will have specific names for each observation, while others, such as this one, will simply number them.
Suppose we want to know the entry in the i-th row, j-th column of a dataset. The syntax is dataset_name[i,j] and the output is that entry. In our case, if we wanted to know the cost of the plane represented by the 14th row, we would run:
aircraft[14,5]
## [1] 15.73
The value of the entry is 15.73
If we wanted to know all values for an observation in the i-th row, we would run dataset_name [i, ]. Note that we leave a blank space. The output will be the entire row. If we want to know all information on the 14th plane, we would run:
aircraft[14, ]
## X1 X2 X3 X4 Y
## 14 4.3 9.7 13384 17900 15.73
This gives us all 5 variables for this plane.
If we wanted to retrieve the entire j-th column, then we would run dataset_name[ ,j]. Note that we now leave the blank in the space where the row number would normally go. The output is every value in the j-th column for each observation in the dataset. If we wanted a list of costs for all aircraft in our dataset, we would run the following:
aircraft[ ,5]
## [1] 2.76 4.76 8.75 7.78 6.18 9.50 5.14 4.76 16.70 27.68
## [11] 26.64 13.71 12.31 15.73 13.59 51.90 20.78 29.82 32.78 10.12
## [21] 27.84 107.10 11.19
This gives us the costs of all the aircraft, starting with the first one and continuing in order.
Now that we have our model, we want to use it to predict costs of planes based on their thrust ratings, as well as calculate residual values.
Predicting using linear regression is fairly straightforward: plug in a value for the predictor, and the result is a predicted response. However, we need to be careful about what we plug into the regression equation.
The model should only be used to predict within the scope of the observed data. Predicting outside this scope is known as extrapolation, and should be avoided. This is because a regression model only describes the relationship within the scope of our observed data; we don’t know if this relationship still holds outside of the scope.
Let’s work with the 12th plane in the dataset. First, we find how much thrust it has.
aircraft[12,4]
## [1] 16000
This plane has 16,000 pounds of thrust. Now we can plug this value into our regression equation. If you don’t remember the coefficients, you can use the coef function as quick reminder.
coef(ThrustMod)
## (Intercept) X4
## -3.852546291 0.001519477
Plugging in the values:
PredCost12 <- 0.001519477 * 16000 - 3.852546291
PredCost12
## [1] 20.45909
Our prediction is 20.45909. Since the costs are all in units of $100,000, we predict the cost of a single engine aircraft with 16,000 pounds of thrust is $2,045,909.
Note: We can predict responses for all values of the predictor in the scope of our data; we are not restricted to plugging in only observed values. The range for thrust is from 3120 to 37,000, so any value within this range can be used for prediction.
The residual of one observation is simply the observed value minus the predicted value. Let’s continue working with the 12th aircraft. We need to know the actual cost of this plane.
ActCost12 <- aircraft[12, 5]
ActCost12
## [1] 13.71
The actual cost of this plane is 13.71 (in units of $100,000). Now we take the actual cost minus the predicted cost to find the residual.
ActCost12 - PredCost12
## [1] -6.749086
The residual is -6.749086 (in units of $100,000). This means our predicted cost of this plane was $647,908.60 too high.
Check your understanding: predict the cost of the plane on line 9, and calculate the residual. Specify units for your answer.
Answer: predicted cost is $1,863,571.26, actual cost is $1,670,000.00 Residual is -$1,935,713.00
The Mean Square Error, or MSE, is a measure of the average of the squares of the residuals. The MSE is equal to the sum of all the residuals squared divided by (n-2). The standard error, sometimes called the root mean square error, is the square root of the MSE, and can be found in the model summary as Residual standard error.
summary(ThrustMod)
##
## Call:
## lm(formula = Y ~ X4)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.705 -7.183 -0.225 3.674 54.732
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.8525463 6.1904411 -0.622 0.540419
## X4 0.0015195 0.0003274 4.641 0.000141 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 16.03 on 21 degrees of freedom
## Multiple R-squared: 0.5063, Adjusted R-squared: 0.4828
## F-statistic: 21.53 on 1 and 21 DF, p-value: 0.0001406
We can see our standard error is 16.03. Our MSE is simply this number squared.
16.03^2
## [1] 256.9609
Our MSE is 256.9606.
There are a few assumptions that apply to linear regression:
Linearity: There is a linear relationship between the predictor and response.
Normality: the residuals have a normal distribution.
Homoscedasticity: The residuals have equal variances for all predictor values.
Independence: Our observations are independent.
We checked linearity when we created scatterplots of our data, and observed a linear relationship. If our scatterplot resembled a parabola or some other curve, then the linearity assumption would be violated.
The independence assumption is important to keep in mind if we use time series data (measurements of the same variable collected over time). Since our data is cross sectional, we can safely assume the independence assumption holds.
The normality and equal variance assumptions can be checked using R.
This assumption can be checked by using either a histogram of the residuals or by using a Q-Q plot. Both methods will require us to find all of the residuals for the data, so let’s do that now.
The code below tells R to find residuals for the model named ThrustMod. The dollar sign indicates a mathematical operation. The output is a string of values, which is saved in AllResid.
AllResid <- ThrustMod$residuals
Now we can create a histogram of the residuals. We use the hist command with the first argument being the string of residuals we just found. xlab, ylab, and main can be added to change x-axis labels, y-axis labels, and title, respectively.
hist(AllResid, xlab="Residuals", main = "Histogram of Residuals")
When we look at the histogram, we can see some slight skewness.
The Q-Q plot is another tool we can use to check for normality. It plots the quantiles of the residuals against the quantiles of a normal distribution.
qqnorm(AllResid)
qqline(AllResid)
If the normality assumption holds, we would see the quantiles closely following the line. Most of the points seem to follow a straight line, but we do have some deviation. Looks like the normality assumption does not hold.
We can refer back to the scatterplot of our original data to check for homoscedasticity.
plot(X4, Y, xlab = "Maximum Thrust (in pounds)", ylab = "Cost (in $100,000)")
We want the vertical spread of the data points to be roughly the same for all thrust values. Our data points are clustered closer together for low thrust values and spread out for higher thrust values. It seems that the equal variance assumption does not hold.
Another way you could check for equal variances is to plot the residuals against the values of thrust. Again we want to look for fairly even vertical spread across all values of thrust.
plot(AllResid ~ X4, xlab= "Thrust" , ylab = "Residuals")
abline (0,0)
Here we can see additional evidence that the equal variance assumption does not hold. We would need to be careful when running hypothesis tests on our model due to the violation of both the normality and equal variance assumptions.
We can test the slope of a regression model to see if there is a significant linear relationship between the response and predictor.
If the slope of the regression line is equal to zero, then it means there is no linear relationship between the predictor and the response. This is our null hypothesis. Since we’re looking for the existence of a significant linear relationship, our alternative hypothesis is that the slope is not equal to zero. Expressing this idea using notation (\(\beta_1\) is the coefficient for slope):
\(H_0\): \(\beta_1 = 0\)
\(H_a\): \(\beta_1 \neq 0\)
All the information we need for this test can be found in the model summary.
summary(ThrustMod)
##
## Call:
## lm(formula = Y ~ X4)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.705 -7.183 -0.225 3.674 54.732
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.8525463 6.1904411 -0.622 0.540419
## X4 0.0015195 0.0003274 4.641 0.000141 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 16.03 on 21 degrees of freedom
## Multiple R-squared: 0.5063, Adjusted R-squared: 0.4828
## F-statistic: 21.53 on 1 and 21 DF, p-value: 0.0001406
The test statistic, given as t value for X4, is 4.641. Our sample size is 23 (n=23), and degrees of freedom is \((n-2)\). Thus, our test statistic follows a t-distribution with 21 degrees of freedom.
The p-value is given in the column labeled Pr(>|t|). The values given are for a two-sided t-test. Our p-value is 0.000141. The three stars next to it are just R’s way of saying the p-value is significant.
Since our p-value is significantly smaller than any reasonable level of significance, we reject the null hypothesis in favor of the alternate. There is strong evidence to suggest that there is a linear relationship between the thrust and cost of single engine aircraft.
We can also create a confidence interval for the slope of our regression line. We use the confint function, with the first argument being the model and the second argument specifying our desired confidence level.
confint (ThrustMod, level = 0.95)
## 2.5 % 97.5 %
## (Intercept) -1.672627e+01 9.021180663
## X4 8.385427e-04 0.002200411
The output will give us the interval for both the slope and intercept. The 95% confidence interval for the slope is (8.385427*10^-4, 0.002200411).
We are 95% confident that for each additional pound of thrust, the cost of a single engine plane will see an increase between $0.838527 and $2.200411.
We can also test the intercept of the regression line. The process is similar to the one we followed to test the slope.
The null hypothesis states the intercept is equal to zero, whereas the alternative hypothesis states the intercept is not equal to zero. The test statistic and p-value can both be found in the model summary, in the row titled “(Intercept)”.
Expressed using notation (\(\beta_0\) is the intercept):
\(H_0\): \(\beta_0 = 0\)
\(H_a\): \(\beta_0 \neq 0\)
Our test statistic is -0.622, which gives a p-value of 0.540419 for our two-sided t-test.
We have our p-value, but what does this mean? In our case, nothing. Our intercept is an extrapolated value which we are unable to interpret. In general, whenever the intercept cannot be interpreted, we do not need to test it.
In order to interpret a t-test for the intercept, we must be able to interpret it and it must make sense given the context of the data for \(y\) to be 0 when \(x\) is 0, since this is what the null hypothesis states. If both conditions are met, then we can drop the intercept from the model if we fail to reject \(H_o\). Otherwise, we generally keep the intercept regardless of what the p-value is.
The regression equation \(\hat{y}=\beta_0 + \beta_1x_0\) gives a prediction for the response variable at some given value of the explanatory, denoted as \(x_0\). \(\hat{y}\) in this case is the point estimate of the mean value of the response when the explanatory is \(x_0\).
However, \(\hat{y}\) is rarely going to be equal to the actual value of \(y\) or the actual mean value of \(y\) when \(x\) is \(x_0\). Prediction and confidence intervals can be useful in finding a range for these actual values. The prediction interval is for an individual value of \(y\) given a particular \(x_0\), whereas the confidence interval is for the mean value of \(y\) given a particular \(x_0\).
Both intervals require a given value of the predictor as they are centered at \(\hat{y}\). Let’s go back to our dataset and look at the 19th observation.
aircraft[19, ]
## X1 X2 X3 X4 Y
## 19 2.8 3.7 28539 34000 32.78
This plane has 34000 pounds of thrust. We create a new data frame to hold this given value, then use the predict function to create our interval. The default level is 95% but you can specify any level by changing the level argument.
intervalData <- data.frame(X4 = 34000)
PI <- predict(ThrustMod, intervalData, interval = "predict", level = 0.95)
PI
## fit lwr upr
## 1 47.80967 11.59702 84.02232
The lower and upper bounds are denoted by lwr and upr, respectively. The predicted cost is denoted as fit. Our 95% prediction interval is (11.59702, 84.02232).
We are 95% sure the cost of a single engine aircraft with 34,000 pounds of thrust is between $1,159,702 and $8,402,232.
To create the confidence interval, we use predict again but change the interval argument.
CI <- predict(ThrustMod, intervalData, interval = "confidence", level = 0.95)
CI
## fit lwr upr
## 1 47.80967 33.66739 61.95195
As with the prediction interval, the confidence interval is also centered at the predicted cost. Our 95% confidence interval for the mean cost of an aircraft with 34,000 pounds of thrust is (33.66739, 61.95195).
We are 95% confident the mean cost of a single engine aircraft with 34,000 pounds of thrust is from $3,366,739 to $6,195,195.
Notice how the prediction interval is much wider than the confidence interval. This is because estimating a single response involves a greater degree of uncertainty than estimating the mean.
The response variable in our linear regression model has variation when compared to our observed data. The total variation is composed of explained and unexplained variation.
Explained variation is the portion of total variation that is explained by the model. This is measured by \(r^2\), which is sometimes called the simple coefficient of determination. This is equal to the explained variation divided by the total variation. The model summary gives the \(r^2\) value.
summary(ThrustMod)
##
## Call:
## lm(formula = Y ~ X4)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.705 -7.183 -0.225 3.674 54.732
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.8525463 6.1904411 -0.622 0.540419
## X4 0.0015195 0.0003274 4.641 0.000141 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 16.03 on 21 degrees of freedom
## Multiple R-squared: 0.5063, Adjusted R-squared: 0.4828
## F-statistic: 21.53 on 1 and 21 DF, p-value: 0.0001406
The \(r^2\) value is given as “Multiple R-squared” in the summary output. Adjusted R-squared will be covered in a future guide. \(r^2\) is interpreted as the percent of the variability of the response that is explained by its linear relationship with the predictor.
In our example, \(r^2\) = 0.5063. 50.63% of the variability in the cost of a single engine aircraft is explained by the linear relationship between cost and thrust. It seems our linear regression model is not a great fit for our data.
The simple correlation coefficient, or just correlation, is a way to measure the relationship between the response and predictor. It is denoted by \(r\). The correlation can be found by taking the square root of \(r^2\). Note that while \(r^2\) is always positive and between 0 and 1, \(r\) can range from -1 to 1. We need to look at our model to determine whether \(r\) is positive or negative. \(r\) is positive if an increase in the predictor is associated with an increase in the response (when the slope is positive). \(r\) is negative if an increase in the predictor is associated with a decrease in the response (when the slope is negative).
Now let’s calculate the correlation for our model. Let’s first take a quick look at our slope.
coef(ThrustMod)
## (Intercept) X4
## -3.852546291 0.001519477
We can see that the slope is positive, so we know \(r\) should also be positive. Then we just take the square root of \(r^2\).
sqrt (0.5063)
## [1] 0.7115476
\(r\)=0.7115476
There is a faster way to calculate the correlation by using the cor function in R. The syntax is cor(variable1, variable2), and the order of the variables does not matter.
cor(Y, X4)
## [1] 0.7115392
We get a correlation of 0.7115392. Notice that this answer is slightly different than the one we calculated manually. This is due to rounding error and is not cause for concern.
Note: we can run a two-sided hypothesis test on the correlation, where the null hypothesis is that the correlation is equal to zero. Naively this is just another way of testing for a linear relationship between the predictor and the response, which we already did when we tested the slope of the regression line.
The F-test for simple linear regression is just another way of testing for a significant relationship between the predictor and the response. We should get the same p-value and conclusion that we got when testing the slope of the regression line.
\(H_o\): there is no significant linear relationship between the cost of a single engine aircraft and its thrust.
\(H_A\): There is a significant linear relationship between the cost of a single engine aircraft and its thrust.
We can find the F-test statistic and the corresponding p-value in the model summary.
summary(ThrustMod)
##
## Call:
## lm(formula = Y ~ X4)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.705 -7.183 -0.225 3.674 54.732
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.8525463 6.1904411 -0.622 0.540419
## X4 0.0015195 0.0003274 4.641 0.000141 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 16.03 on 21 degrees of freedom
## Multiple R-squared: 0.5063, Adjusted R-squared: 0.4828
## F-statistic: 21.53 on 1 and 21 DF, p-value: 0.0001406
Our F-statistic has a value of 21.53, with 1 and 21 degrees of freedom. This results in a p-value of 0.0001406, which is the same p-value we got when testing the slope.
Since the p-value is smaller than any reasonable level of significance, we reject the null hypothesis in favor of the alternative. There is strong evidence to suggest there is a significant linear relationship between the cost of a single engine aircraft and its thrust.