Many data scientists would say nothing is great about heteroskedasticity. In fact, some regard it like a skin rash – unexpected, inconvenient and a little embarrassing. So they immediately throw at it whatever random ointments they find in the medicine cabinet and hope it goes away.
But to me, heteroskedasticity isn’t a problem with our data, it’s a feature. It’s interesting, and depending on its origin, it may tell us something important - perhaps even more important than the precision we are losing by not having constant variance.
Heteroskedasticity refers to situations in which the variance of the residuals is unequal over the range of fitted values. While there are specific tests for it (e.g. Breusch-Pagan), when heteroskedasticity is present there is generally a patterning you can see in a scatterplot of residuals against fitted values. For example, we might see a cone shape, which tells us that variance is increasing as fitted values increase. It is precisely this patterning that tells us there may be some relationships to discover within our data.
The absence of heteroskedasticity is a fundamental assumption of Ordinary Least Squares regression. When this assumption is violated, those aspects of our regression results that depend on constant variance (p-values, t=tests, F-tests, standard errors) are not trustworthy. This can make statistical inference very difficult. On the other hand, our coefficients remain unbiased.
So when heteroskedasticity is present, we are inhibited about making claims about the precision of our coefficients. However, by contrast, we are more empowered to draw inferences about shifts in variance within our residuals. Depending on the business question, this may have utility.
We will use the AdRevenue database from Simon Sheather (A Modern Approach to Regression. Simon Sheather, Springer Texts in Statistics: 2009) to illustrate this point. I’ve removed some outliers to make the point more clear.
Imagine we are working for a relatively small-sized newspaper looking to increase ad revenue by increasing circulation. Management wants a simple “this increase in circulation leads to that increase in ad revenue” formula, not a hard and fast rule but a reliable ballpark.
Here is a summary of the variables we are examining:
library(tidyverse)
dfx <- read.csv("D:\\RStudio\\CUNY_621\\AdRevenue.csv", head=TRUE)
dfx <- dfx %>%
dplyr::select(AdRevenue, Circulation) %>%
filter(Circulation <= 6) %>%
filter(!row_number() == 3)
summary(dfx)
## AdRevenue Circulation
## Min. : 61.1 Min. :0.3310
## 1st Qu.:101.8 1st Qu.:0.9755
## Median :130.4 Median :1.5420
## Mean :137.9 Mean :1.7463
## 3rd Qu.:154.0 3rd Qu.:2.0520
## Max. :291.8 Max. :4.7410
We run a regression on ad revenue and circulation. There is a high r-squared (.89), but the residual plot shows some patterning. Clearly there is heteroskedasticity here, and so we can’t trust our statistical inferences. A BP test leads us to reject the null hypothesis (that heteroskedasticity is not present). So what can we do?
library(lmtest)
dfx1 <- dfx %>%
dplyr::select(AdRevenue, Circulation)
p <- lm( AdRevenue ~ Circulation, dfx1)
plot(p, which = c(1))
bptest(p)
##
## studentized Breusch-Pagan test
##
## data: p
## BP = 5.7381, df = 1, p-value = 0.0166
One option is to choose some transformation on the dependent variable and hope it does the trick. Commonly this is the log, so we take the log of ad revenue, run the regression again, and this is what our residual plot looks like now:
q <- lm(log(AdRevenue) ~ Circulation, dfx)
plot(q, which = c(1))
bptest(q)
##
## studentized Breusch-Pagan test
##
## data: q
## BP = 0.031519, df = 1, p-value = 0.8591
It worked! While there are similarities to the previous plot, the tight bunching at lower levels of fitted values is gone. There are a few odd points at the lower left of the plot which throw off the median, and one or two outliers that give the appearance of heteroskedasticity, but in reality this distribution is a lot more random. Indeed, a Breusch-Pagan test does not allow us to reject the null hypothesis (that heteroskedasticity is not present). So with one simple move, we’ve solved our problem. And while management may not understand our “log” formula, at least we can deliver them a reliable model.
But this alone doesn’t really address the patterning in the variance. Unless we understand why the log transformation is effective, it only problematizes it and buries it.
This time we start with the data. Here is a scatterplot of ad revenue and circulation:
library(EHData)
ggplot(dfx, aes(Circulation, AdRevenue)) +
geom_point(fill="navy", color="navy") +
geom_smooth(method = "lm", color="red", fill="lightcoral") + EHData::EHTheme()
## `geom_smooth()` using formula 'y ~ x'
While heteroskedasticity is specifically a patterning of variance within the residuals, we can nonetheless see where some of that variance originates. The scatterplot shows that the relationship between revenue and circulation becomes increasingly less tightly coupled as circulation becomes larger. It should be noted that this pattern of variance is extremely common, and so it would benefit us to know when and why it happens. If this is a feature of our data and we discover that it has meaning and/or consequences, we would want to communicate this with management. There are many reasons why it may occur - here are just four:
a percentage or other size effect: as newspapers get larger, mathematically a 10% variation in ad revenue is going to be much larger for the bigger papers than for the smaller ones. Thus, we often see this cone-shaped patterning of heteroskedasticity when there is a very large range of low to high values, and particluarly when a percentage increase in the independent variable leads to a percentage increase in the idependent variable.
a missing variable: newspapers may have more revenue options as they get larger. For example, maybe smaller newspapers tend to be free of charge, while the larger ones can rely more on subscription fees, and therefore some have less incentive to pursue ad revenues aggressively. Heteroskedasticity is a common sign of a missing variable.
an interaction effect: the relationship between circulation and ad revenue may be different for small newspapers compared to large. For example, the market for large advertisers may be much more competitive, but also much more lucrative, than for smaller advertisers. When entities at small and large values face different conditions with different type of variance, you will see heteroskedasticity. Sometimes the cone faces the other way than in this dataset - for example, the relationship between flight departure delays and arrival delays shows much more variance for small departure delays than for large.
a hidden grouping effect: large papers in small markets may exhaust advertising opportunities quickly, while large newspapers in large markets may have increased advertising opportunities.
The fact that a log transformation eliminated the heteroskedasticity may suggest that a percentage effect is in play, but we would want to investigate that further before we reached that conclusion. Without more information, we don’t know whether our model is incomplete or not. And if the issue is a mispecification of the model, it is contraindicated to use a transformation to hide the impact of the missing effects. In any case, the pattern of the variance tells us that the relationship between circulation and ad revenue is less tight as newspapers get larger, and management should be aware of this as it suggests that relying on ad revenue may be increasingly more rewarding, but also more risky. Further, we can see from the scatterplot that no matter what the true standard error is, it is unlikely to negate the effect completely. We can’t say precisely how much ad revenue will increase with increase in circulation, but we can, in fact, be confident that it will increase, even without a reliable standard error.
So we could report to management:
“Here is a handy formula with a bit of advanced math - we can give you a hand if you need to decipher it”,
or we could report:
“We’ve given you a ballpark estimate, give or take, of the relationship between circulation and ad revenue. We’re confident ad revenue will increase with circulation, but if you need more precision around the “give or take” part, we can provide that with a more complex formula. However, there’s something else you should be aware of. As we increase in size, the risk- reward of generating ad revenue through circulation is going to increase. We don’t know if this is simply inherent in the relationship, or if larger companies begin to replace ad revenue with subscription revenue or if something else is at play. This is something we’d like to study and request budgeting to do so. In any case, this phenomenon should be incorporated into any company risk management strategy.”
I know I would prefer the second.
In sum, heteroskedasticity violates a fundamental assumption about variance for our regression and therefore invalidates inferences based on standard errors and p-values. But the fact is, variance may also tell an important story. Sometimes heteroskedasticity invalidates our entire model, but perhaps sometimes it is our model. After all, a model is just a simplified description of what is significant about our data, it isn’t the regression itself.
A final note - a debate in data science, up there with the Frequentists vs. Baysians, is how to spell heteroskedasticity (or is it heteroscedasticity?). Because the “hetero” suffix is Greek, technically speaking the Greek spelling (with a k) is more correct. However, it does give the appearance that the writer doesn’t know how to spell and just went at it phonetically. So as always, it depends on use case and personal preference. I like heteroskedasticity because I like the way it looks.