Kirby Arinder
2023/08/09
What are we doing?
Why do we care?
At whom is this aimed?
The core of linear regression is simple.
It's a method of assessing the central tendency of multidimensional data and making certain inferences based on that assessment.
So let's start at the beginning.
It has one dimension.
Which is a point, i.e., zero-dimensional!
…Serves as an expected value of sorts for, well, entities like whatever this number line represents!
Let's think informally for now: If the numbers on the previous slide represented the output of some process, then ceteris paribus you'd expect future processes like this to also have this average, over the long term!
We can easily envision a two-dimensional set of points.
has a central tendency! (Actually many, but…)
That blue line looks wonky, you might say!
This line minimizes the squared distances on the Y axis of all points in the set from the central tendency (or regression surface – hey, there's our word!).
The central tendency under discussion is identical to the expected value of the set.
That is: the value you'd expect new members of this set, generated by the same process, to assume, on average, in the long term!
The C.T.U.D. is one dimension less than its set.
High-dimensional predictions aren't typically very useful.
Even a two-dimensional expectation can
By holding a number of dimensions constant, we can reduce the dimensionality of our expected value!
This creates a conditional expectation.
In theory, this practice is very useful:
First, let's introduce the ideas of:
Traditionally, in two dimensions, the independent variable is depicted on the x axis and the dependent on the y.
With that understanding, let's just hold the x axis constant!
Right now, nothing at all! It's a purely logical stipulation: If P, then Q.
But that changes when we think about regression as an inferential method. Which we will shortly!
You now understand linear regression as a descriptive technique.
It's a high-dimensional average which can be simplified down to set conditional expectations for the output of some function.
But more typically, we want to use linear regression to make inferences.
These often take two forms:
Either of these forms of inference requires a little more than what we've seen so far.
We need not just a conditional expectation, but a distribution of error around that expectation.
This is where things like confidence intervals and p-values come in.
Here's the standard 95% confidence interval around our old friend.
Or at least, I'm not going to talk about how to do them.
Which may be unnecessary anyway; if I compute a regression line using R's standard functions, I'm going to get CIs and p-values whether or not I know how to use them or even want them!
The standard inferential mechanism for linear regression is based on an assumption of random sampling. It assumes:
Your CIs and p-vals are only meaningful in this context.
If your data aren't from a random sample at all, it tells you nothing to know the probability of a random sample with values at least as extreme as the ones you observed!
With that in mind, let's finally move on to…
A.k.a., what not to do:
Two problems with using linear regression for this purpose:
Before I say anything at all about this, a quote:
“No causal claim can be established by a purely statistical method, be it propensity scores, regression, stratification, or any other distribution-based design.”
-Judea Pearl, “Causality,” p. 350
But disregarding that looming fact, we've still got our more specific problems:
It just covers the simplest form of linear regression and its formal interpretation!
There are many, many considerations not mentioned here!
Don't be dissuaded from forecasting, or from making causal claims.
But beware of using a method developed for one specific context, in a different context entirely!
If you want to break the rules, have an argument ready that your method works in your novel context.