Regression is one of the most important tools in your research toolbox. An incredibly simple engine, regression analysis is a powerful tool for prediction, inference, and decision making. Regression, generally speaking, refers to a suite of statistical processes that seek to describe the relationships between a dependent variable \(Y\) and any number of independent variables \(X_i\).
The simplest kind of regression is just the relationship between two random variables, which can be represented by a straight line. Therefore, the very center of regression analysis is the simple linear equation you learned in 6th grade algebra.
\[ Y = \alpha +\beta X + \epsilon \]
Where \(Y\) is the response variable, \(\alpha\) is the intercept, \(\beta\) the coefficient on the predictor variable \(X\), and \(\epsilon\) is the residual error term.
Regression analysis is probably the most flexible statistical techniques available to you as a research. It is capable of handling multiple independent variables, continuous and categorical data, non-linear responses, interaction effects, and mixed-effects. Additionally, the generalized linear model provides a powerful extension of the humble linear model to allow for non-normal data. Given its utility, I think you will find that almost any research question can be analyzed by a regression analysis of some kind.
Performing a regression in R is easy. The base stats package provides all of the basic functionality you will need to perform most forms of regression analysis. Let’s get started with a quick example.
library(Stat2Data)
library(investr)
library(ggplot2)
data("Diamonds")
d <- Diamonds
head(d)
## Carat Color Clarity Depth PricePerCt TotalPrice
## 1 1.08 E VS1 68.6 6693.3 7228.8
## 2 0.31 F VVS1 61.9 3159.0 979.3
## 3 0.31 H VS1 62.1 1755.0 544.1
## 4 0.32 F VVS1 60.8 3159.0 1010.9
## 5 0.33 D IF 60.8 4758.8 1570.4
## 6 0.33 G VVS1 61.5 2895.8 955.6
The Diamonds data set contains information about price and several quality indicators for a sample of 351 diamonds. Try ?Diamonds for more information about the data.
ggplot(d) +
geom_point(aes(x = Carat, y = TotalPrice), size = 4, alpha = 0.5, color = "goldenrod") +
theme_classic()
ggplot(d) +
geom_point(aes(x = Carat, y = PricePerCt), size = 4, alpha = 0.5, color = "goldenrod") +
theme_classic()
Now that we have the data visualized, we can go ahead with fitting regression models to them. The first plot shows a non-linear relationship, where the magnitude of the price increases exponentially with the carat. The second plot is a more classic linear relationship, where the price is adjusted for the number of carats.
Fitting a linear model in R is simple. We simply write out our formula in the lm() function. Fitting a model with lm() uses the ordinary least squares (OLS) method, which finds the straight line intersecting the data that minimizes the total squared difference between each observation and the predicted values. A fantastic visual explanation of the OLS method can be explored further here.
Let’s fit a model to describe PricePerCt ~ Carat.
lm1 <- lm(PricePerCt ~ Carat, d)
summary(lm1)
##
## Call:
## lm(formula = PricePerCt ~ Carat, data = d)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4683.7 -867.1 -222.3 504.1 6861.2
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1267.0 184.8 6.857 3.21e-11 ***
## Carat 4977.8 165.8 30.025 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1532 on 349 degrees of freedom
## Multiple R-squared: 0.7209, Adjusted R-squared: 0.7201
## F-statistic: 901.5 on 1 and 349 DF, p-value: < 2.2e-16
By calling summary() on the model we get an outupt to display each of the coefficients and associated p-values. We also get some global statistics for the model fit, like the residual error, the \(R^2\) value, F-statistic, and the associated p-value.
The most important piece of a regression output is the coefficients. In fact, I would almost suggest ignoring the p-values for the coefficients altogether, especially for models with multiple predictors. The majority of the information we are after is in the model’s coefficients. In this example, we see that the \(\beta\) on Carat was 4977.8. In a linear regression, the interpretation is pretty simple. This just means that for every one unit increase in Carat, we expect a $4977.8 increase in PricePerCarat. And of course, the intercept of 1267.0 doesn’t mean anything to us in practice - it is the hypothetical PricePerCarat of a zero carat diamond. But what is a zero carat diamond? Like I said, the intercept doesn’t mean anything to us in this case but that will not always be true. In this example, the intercept simply anchors our prediction line in the right place.
Speaking of prediction lines, let’s add that to our plot!
preds <- predict(lm1)
newdat <- cbind(d, preds)
ggplot(newdat, aes(x = Carat, y = PricePerCt)) +
geom_point(size = 3, alpha = 0.5, color = "goldenrod") +
geom_line(aes(y = preds), size = 1) +
theme_classic()
To add this line, I first created a new vector of the predicted values. Then, I used cbind(), which stands for column bind, to append the predictions to my original data. Finally, I used geom_line() to draw in the predictions. I could have simply used geom_smooth() to get that line. But like I mentioned in a previous post, I strongly urge you to add the predicted values directly from the model yourself. The reason why is because geom_smooth() defaults to a simple linear regression to draw prediction lines. By manually adding the predicted values, you are ensuring that what you are displaying actually came from the model you defined.
We can add more information to our plot by including uncertainty intervals. In general, we have two options: confidence or prediction intervals. Confidence and prediction intervals are not the same thing, and you should be careful not to equate these two.
Confidence intervals produce a range of values which are likely to contain the parameter of interest upon hypothetical repeated experiments. Confidence levels are customarily set at 95%, or 1 - \(\alpha\), circling back to the traditional \(\alpha\) level of 0.05. When interpreting a 95% confidence interval, you can think of it as saying “the true parameter value of the population is likely to be within this range and 95 out of 100 repeat experiments (or sample observations) would contain the true mean”. However, it does not guarantee that the true mean is anywhere within that range! The confidence intervals are derived from your data, therefore they are only as good as your data.
preds <- predict(lm1, interval = "confidence", level = 0.95)
newdat <- cbind(d, preds)
ggplot(newdat, aes(x = Carat, y = PricePerCt)) +
geom_point(size = 3, alpha = 0.5, color = "goldenrod") +
geom_line(aes(y = fit), size = 1) +
geom_line(aes(y = lwr), lty = 2, size = 1) +
geom_line(aes(y = upr), lty = 2, size = 1) +
theme_classic()
The confidence intervals are great at displaying the precision of the estimated mean. Since we have a lot of data here, the confidence intervals fit pretty tightly around the mean prediction line. A more informative interval might be the prediction interval. Unlike the confidence interval, the prediction interval attempts to show the entire spread of individual predictions within a specified range rather than just the mean. Therefore, for a 95% prediction interval, you could say that 95 out of 100 new individual observations will fall within that range. Let’s plot and display the prediction intervals.
preds <- predict(lm1, interval = "prediction", level = 0.95)
## Warning in predict.lm(lm1, interval = "prediction", level = 0.95): predictions on current data refer to _future_ responses
newdat2 <- cbind(newdat, preds)
colnames(newdat2)[10:12] <- c("fit.pred", "lwr.pred", "upr.pred")
ggplot(newdat2, aes(x = Carat, y = PricePerCt)) +
geom_point(size = 3, alpha = 0.5, color = "goldenrod") +
geom_line(aes(y = fit), size = 1) +
geom_line(aes(y = lwr), lty = 2, size = 1) +
geom_line(aes(y = upr), lty = 2, size = 1) +
geom_line(aes(y = lwr.pred), lty = 3, size = 1) +
geom_line(aes(y = upr.pred), lty = 3, size = 1) +
theme_classic()
Suppose we think that the relationship between TotalPrice and Carat is quadratic, meaning that TotalPrice increases non-linearly but with a fixed acceleration. Our model would then be:
\[ Y = \alpha + \beta_1 X + \beta_2 X^2\] Let’s try a non-linear regression using TotalPrice ~ Carat. Here are the data plotted again.
ggplot(d) +
geom_point(aes(x = Carat, y = TotalPrice), size = 4, alpha = 0.5, color = "goldenrod") +
theme_classic()
Fitting a polynomial regression in R can be accomplished with just a few more bits of text inside the lm() argument. All we need to do is include a squared term on the predictor Carat.
lm2 <- lm(TotalPrice ~ Carat + I(Carat^2), d)
summary(lm2)
##
## Call:
## lm(formula = TotalPrice ~ Carat + I(Carat^2), data = d)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10207.4 -711.6 -167.9 355.0 12147.3
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -522.7 466.3 -1.121 0.26307
## Carat 2386.0 752.5 3.171 0.00166 **
## I(Carat^2) 4498.2 263.0 17.101 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2127 on 348 degrees of freedom
## Multiple R-squared: 0.9257, Adjusted R-squared: 0.9253
## F-statistic: 2168 on 2 and 348 DF, p-value: < 2.2e-16
preds <- predict(lm2, interval = "prediction", level = 0.95)
## Warning in predict.lm(lm2, interval = "prediction", level = 0.95): predictions on current data refer to _future_ responses
newdat <- cbind(d, preds)
ggplot(newdat, aes(x = Carat, y = TotalPrice)) +
geom_point(size = 3, alpha = 0.5, color = "goldenrod") +
geom_line(aes(y = fit), size = 1) +
geom_line(aes(y = lwr), lty = 2, size = 1) +
geom_line(aes(y = upr), lty = 2, size = 1) +
theme_classic()
##Non-linear least squares
The quadratic growth model looks like a great fit. I would probably stop right there. But suppose we just wanted to check a different type of model? What if we hypothesized that the growth should be exponential instead? That is, the rate of change accelerates proportional to itself. We could fit a model like this using nls() and our definition of exponential growth.
\[ Y = \alpha * e^{\beta X} \]
One hurdle to clear when using nls() is the need to choose good starting values for the algorithm it uses to estimate parameter values. If you have a good guesstimate of what those starting values should be, you can go ahead and use those. But we may have no clue what those starting values should be. So one potential solution is to fit a linearized model by adding log() around our model formula and using the coefficients from the linearized model as our starting values.
fm0 <- nls(log(TotalPrice) ~ I(log(a * exp(b * Carat))), d, start = c(a = 1, b = 1))
summary(fm0)
##
## Formula: log(TotalPrice) ~ I(log(a * exp(b * Carat)))
##
## Parameters:
## Estimate Std. Error t value Pr(>|t|)
## a 1.003e+03 3.988e+01 25.16 <2e-16 ***
## b 1.623e+00 3.566e-02 45.52 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3295 on 349 degrees of freedom
##
## Number of iterations to convergence: 8
## Achieved convergence tolerance: 1.114e-08
nls1 <- nls(TotalPrice ~ I(a * exp(b * Carat)), start = coef(fm0), data = d)
summary(nls1)
##
## Formula: TotalPrice ~ I(a * exp(b * Carat))
##
## Parameters:
## Estimate Std. Error t value Pr(>|t|)
## a 2502.5378 92.8400 26.95 <2e-16 ***
## b 0.9928 0.0162 61.29 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2899 on 349 degrees of freedom
##
## Number of iterations to convergence: 7
## Achieved convergence tolerance: 7.029e-06
Getting the prediction intervals from nls() just requires loading in the investr package and using the predFit() function.
preds <- investr::predFit(nls1, interval = "prediction", level = 0.95)
newdat <- cbind(d, preds)
ggplot(newdat, aes(x = Carat, y = TotalPrice)) +
geom_point(size = 3, alpha = 0.5, color = "goldenrod") +
geom_line(aes(y = fit), size = 1) +
geom_line(aes(y = lwr), lty = 2, size = 1) +
geom_line(aes(y = upr), lty = 2, size = 1) +
theme_classic()
By now it should be pretty clear that the quadratic model fit a little better to the sample data.