What’s up with \(\epsilon\)?

Some things that are hard for me in statistics are understanding how the data we use in our plots and models relates to the assumed, or underlying, things that are happening in the real world, and how both the data and the model relate to the real-world population and to the observations in our dataset.

It’s okay if that sentence is confusing. I tried my best to clearly explain what I mean, but I am also somewhat unclear inside my own head about what it is that is tricky. What we can understand from that sentence, however, is that the confusion expressed pertains to several interrelated pieces of statistical material. Let’s look at each piece more closely to unravel this situation.

A list of the pieces in that sentence:

  • the data we use in our plots and models

  • the assumed, or underlying, things that are happening in the real world

  • the model

  • the real-world population

  • the observations in our dataset

Looking at this big picture view, we see that statistics all starts with something in the real world that we want to know about and that we need tools to help us learn about (because we don’t yet know the things we want to know!). Accordingly, most everything we do in statistics has to do with a population that we don’t know about and consists of us trying to figure out something about that population. Our “figuring out” can take many forms, including making a guess and trying to get a sense of how reasonable our guess is (e.g. a one-sample t-test for proportion or mean), or finding a mathematical equation to estimate a relationship between variables (e.g. models). The details of understanding how your statistical tools relate to the real-world population will vary based on the test or model used, but I want to focus in on the simple linear regression (SLR) model.


To give us a clear understanding of what we don’t know, what we want to know, and how a simple linear regression equation connects those two things, let’s step into a world where we are statisticians but where we also have the extra super power of knowing everything about the population that we are looking at.

In this world, we encounter a population of perfect penguins, so called because they grow with unreal accuracy. Since we are all-knowing statisticians, we know that these penguins always grow in perfect proportions— the length of their flippers in millimeters is always exactly 126 + 1.6(bill_length) where bill_length is the length of their bill in millimeters. Now, a group of statisticians (who unfortunately do not share our super power) want to know if there is a relationship between flipper length and bill length of the perfect penguins. Since it is unrealistic to track down every living perfect penguin for measurements, the statisticians decide to take a random sample of perfect penguins living in a particular area, measure the bill length and flipper length of each penguin, and fit a model to their observed data. When they plot their data and add line of best fit, they will see something very similar to this:

What they find is that for every perfect penguin in this made up world, there is an exact linear relationship between bill length and flipper length. (Of course, as the omniscient statisticians we are, we already knew that!) For the bill length of any perfect penguin, I can tell you exactly how long that penguin’s flipper is using the equation flipper_length = 126 + 1.6(bill_length). If it’s not already clear enough: we know exactly what the relationship between the two variables is because every observation we could possibly take from the population of perfect penguins lies exactly on the line.

To understand the perfect penguins, we only need a linear equation in slope-intercept form, and a linear equation is algebra. It’s mathematics. It is not statistics. And so, we don’t need statistics here. Perfect penguins render statistics useless* because we know exactly what the relationship is. The imaginary perfect penguins emphasize for us that the field of statistics is a big tool that we use to understand, model, and work with the populations that we aren’t sure about— populations that we can only approximate.

And this is why we need epsilon. This is what epsilon is. Without epsilon, the simple linear regression equation \[Y |_x = \beta_0 + \beta_1x + \epsilon\] is just a linear equation: \[Y |_x = \beta_0 + \beta_1x.\]

Epsilon is our way of saying “we have a population of data that is mostly linear but not every individual in the population lies perfectly on a line.” Instead, when we plot measurements of bill and flipper length of individual Adelie penguins in the palmerpenguins dataset sampled from the population of Adelie penguins here on Earth, the points line in a mostly linear band. Our friend epsilon, \(\epsilon\), describes how the observed data points don’t all fall exactly on the line.

*I know I say that “Perfect penguins render statistics useless”, and I suppose that is a bit harsh on statistics, but I distinctly remember the moment when I realized that the reason we need a random error in our regression equation is because without it, we know the exact relationship between two variables, and so statistics are rendered useless! Luckily for us, there are endless things here in the real world that we can only understand through approximation, so perfect penguins and other populations that render statistics useless are much harder to come by outside of our imaginary world.

Residuals

In statistics, we approximate population parameters (the things about the population that we are trying to learn about) using sample statistics. A true population mean, \(\mu\), is approximated by a sample mean, \(\bar{x}\). (\(\bar{x}\) is found by taking a sample of observations from the population of interest and calculating their mean.) Similarly, residuals approximate \(\epsilon\).

The residual of an observed data point is the distance between the observed \(y\) value (often denoted \(y_i\) where \(i\) is an index indicating which observation this \(y\) is from) and the fitted, or approximated, \(y\) value, called \(\hat{y}\) (“y hat”). Visually, residuals are the purple lines that appear on this plot when you select the checkbox “Show residuals:”:


If we take a point on the orange line and we add a random amount (\(\epsilon\)) to it, then we’ll get a point like one of the black observation points that is a bit away from the line.

When we fit a linear model, we check several assumptions, some of which can be checked using (among other things) a residual plot.

One reason why we don’t want to see any patterns in a residual plot , and the reason why a pattern in a residual plot alerts us that “a linear model might not be appropriate” is because if the data in our dataset differs predictably (in a pattern) from the model, then maybe the data in our dataset hasn’t been sampled from a population that has a linear relationship.

Let’s look at several examples: (click on the “Examples” tab)


We can use the residual plot for two assumptions: We can use it to get a sense of whether or not our two variables are related by a linear relationship and we can use it to check a constant variance assumption. The first assumption (linear relationship) is indicated by points in the residual plot that are randomly spread around the residual = 0 line on the \(y\)-axis, as seen in Examples B, C, and maybe E— some people may say there is a bit of a negative linear trend in E’s residual plot. The points in examples A and D have some patterns— they are not randomly arranged, so these plots do not meet this condition.

For the other assumption (constant variance), we want to see that the spread of the points at any given fitted y value (along the \(x\)-axis) is about the same height as the spread at every other fitted y value. Example A breaks this: on the left of the plot, around fitted value = 0, all of the residuals are very close to each other and stay within in a nice tight band, but over at fitted value = 6,000 and above that, there is a much taller spread of residuals (the spread is over 5,000 units tall at about fitted value = 9,000). Examples B and C have more consistent spread as the fitted value changes. Example E is debatable. Some people might say that it has a consistent spread height for most fitted y values, but if you look closely there are some spots where the spread is smaller (around fitted value = 1, we have spread about 1.75 or 2 units tall, whereas at fitted value = 1.5, we have spread about 1.25 units tall). For some people, those spreads are close enough to consider the assumption met. For others, this is a bit more variability than they like to see. Example D varies from a spread height of 1 unit at fitted value = 0 to a spread height of 175 units at fitted value 60, so Example D does not meet the constant variance condition.

(You may notice that Example B is kind of interesting— the data in the scatterplot are very linear, but the residual plot does show a funny pattern. But the funny pattern we see has to do with which fitted values of y have observations at that value, as well as the widths of the diamonds— it seems that perhaps diamonds are always cut to one of a set of specified widths and thus we see some horizontal lines of points on the residual plot.)

I hope that this blog post and set of examples helped you better understand \(\epsilon\)’s role in the simple linear regression equation and the usefulness of residual plots for checking assumptions. Happy stats!