Why does it matter?

Regression is a statistical technique that helps us to determine the relationship between a dependent variable on the one hand and independent variables on the other. This may be useful in a variety of ways; however, increasingly there is a perceived divide between using regression for prediction versus using regression for interpretation.

When we use regression for prediction, we develop a model which maximizes our ability to accurately estimate the dependent variable and data that we haven’t seen. There are several metrics which can help us determine which model is the most accurate.

The definition of interpretation is a little less precise. It overlaps with inference (which involves hypothesis testing to determine how close sample parameters are to population parameters), but more properly refers to our ability to make sense of the relationship between the independent and dependent variables. Interpretability is much more fuzzy, and while metrics can enhance our interpretation, they cannot tell us how interpretable our model is.

The reason it’s important is that we sometimes, though not always, need to make choices that will enhance either the predictability or interpretability of our models, and sometimes those choices are at odds.

Below we discuss some examples using hypothetical case studies:

Feature selection

Imagine you work for a diabetes clinic which collects a fair amount of data on patients, including their A1c (indicative of diabetes) over time. You decide to run a regression in order to determine the relationship between patient data (demographics, treatments, etc.) and their change in A1c. The type and number of features you select will be partly determined by the use case for the data.

Do you want to predict as accurately as possible whether a new patient will see their A1c increase or decrease? Then you will want to include as much information as possible. Do you want to understand what drives changes in A1c? Then you might want to restrict yourself to a subset of the most important independent variables and understand how they work together. Do you want a handy rule of thumb to give to your patients, such as “men of this age who get this treatment and have these characteristics will reduce their A1c on average by X amount”? In this case you would want even fewer independent variables.

Feature transformation

The same issues appear when we look at feature transformation. Sometimes features must be transformed or the assumptions for the regression are not met. We might, for example, take the log of the dependent variable to address heteroscedasticity.

In other cases, however, we perform transformations in order to improve our fit. Particularly when using Box-Cox, these transformations may seriously interfere with our ability to see and understand the relationship between the dependent variable and the independent variables.

Imagine you work for a newspaper company interested in increasing circulation and you are entering a new market. If you want to predict how well your newspaper will do in the new market, you would want to include as many transformations as needed, and as sophisticated ones as possible if they help the predictive value of the model. On the other hand, if you want to increase circulation in the markets you are already in, it may be more important to understand more precisely what independent variables you can influence that will bring you more circulation.

Stability of coefficients There are a number of reasons why coefficients may vary from test case to test case or from sample to sample. For example, if a data set is marked by strong multicollinearity, coefficients may change, and even flip, even with samples that are quite similar to each other. In the case of prediction, stability of coefficients is not necessarily important. There are always reasons to address multicollinearity, but it generally will not adversely affect our predictions. We may, however, use a form of regression (like Ridge regression) that reduces the level of variance found in these models.

On the other hand, if we want to interpret our models, stability of coefficients is crucial. We need to know the direction of the relationship and the strength of the relationship if are to say anything at all about the meaning of that particular variable.

Imagine you work for a paleontologist trying to determine the sex of a baboon by its skull. You have 10 measurements per skull and 50 skulls. The measurements on the skull are likely to be highly correlated, as large skulls will have larger measurements across the board. If the coefficients keep changing with every test set, it becomes difficult to develop a simple rule of thumb that allows you to make the determination. This is a situation where both prediction AND interpretation might benefit from a more simplified model.

Meeting assumptions

Because prediction metrics rely mostly on the performance of the model on unseen data, sometimes data scientists pay less attention to whether or not basic assumptions are met. For interpretation, which relies more heavily on statistical metrics, assumptions must be met in order to trust those metrics.

Imagine you work for the Forest Service attempting to get a ballpark of the number of forest fires likely to happen in a given place and season. You run a regression and find heteroscedasticity. This is going to interfere with your ability to make interpretations because heteroscedasticity will affect metrics that depend on variance (like P value and F statistic), and while your coefficients will remain unbiased, they may not be the most efficient. In this case you would do what you could to transform the dependent variable, or use weighted least squares, or employ other techniques to reduce the heteroscedasticity.

However, if your prediction metrics tell you that the model is good enough for a ballpark number of forest fires, you would not in this case need to address the heteroscedasticity if you are already satisfied with the model.

It comes down to use case

Prediction tells us what’s going to happen. Interpretation tells us why. Prediction may better prepare us for what’s to come, while interpretation may better allow us to influence what’s to come. There is nothing stopping a data scientist from running more than one model (except perhaps time in computing resources) After all, interpretation helps us create their predictive models, and prediction helps us better establish whether our models are reliable.