Let’s talk about \(y=mx+b\) real quick. I’ve been seeing people online (rightfully) questioning why this formula is burned in their brains when they’ve never had a use for it. “Why do I know how to compute the slope of a line but I was never taught how to do my taxes?!” Education sometimes leaves us so unprepared for, and even misled about, the “real-world” in many regards. I definitely hear that, and even agree on a lot of fronts.

Looking back now though, I recognize \(y = mx + b\) as one of the most powerful formulas we learned in our high school math classes. Though not exactly taught in this context, it was probably the first time that many of us were exposed to a foundational concept governing data science, technology, machine learning, and artificial intelligence today: describing relationships between variables with relatively few parameters. I think it’s time we took another look at this formula.

Bear with me – this post sounds like a stretch because it is :)

Recap

As a reminder, \(y=mx + b\) is an equation for a line (in a cartesian plane). Specifically, its known as “slope-intercept form” because \(m\) denotes the slope of the line (rise-over-run), and \(b\) is the y-intercept (where the line crosses the y-axis).

Suppose you have two points, \((x_1, y_1) = (-1,-1)\) and \((x_2, y_2) = (1, 2)\).

We know that any two points define a line, but thanks to \(y=mx+b\), rather than defining the line by 4 numbers – \(x_1, y_1, x_2, y_2\) – we can do it with just 2, \((m, b)\). Cool!

You may even remember the formulas for computing \(m\) and \(b\) from \((x_1, y_1)\) and \((x_2, y_2)\): \[\begin{align} m &= \frac{y_2-y_1}{x_2-x_1}\\ &=\frac{2-(-1)}{1-(-1)} = \frac{3}{2}\\ b &= y_1-mx_1\\ &= -1-\frac{3}{2}(-1)= \frac{1}{2} \end{align}\]

Finally, we get \(y = \frac{3}{2}x + \frac{1}{2}\) as the line connecting \((-1,-1)\) and \((1,2)\).

Put differently, given two data points and our model, \(y=mx+b\), we identified the parameters \((m,b)\) that, in this case, exactly define the relationship between \(y\) and \(x\). Woo!

A slightly more general problem

What if we had been presented with 3 data points: \((-1,-1)\), \((1,2)\), and \((x_3, y_3) = (2,2)\)?

These points no longer define a line, but let’s say for my purposes I still wanted to model the relationship between \(y\) and \(x\) as linear, in the form (you guessed it!) \(y=mx+b\). Which line would generally capture the trend of the data? Something like this looks right:

Let’s try to find the line, defined by \((m,b)\), that’s as close to these three points as possible. That is, the line that is the least “off”. We can define this “off”-ness in terms of the difference between the true y-value and the predicted y-value. That is, \(\Delta y_i\) is the difference between \(y_i\) and \(\bar{y}_i = mx_i+b\).

This leads us to an equation for the squared error, \(E\) - the sum of each of these differences, squared:

\[\begin{align} E &= (y_1-\bar{y}_1)^2 + (y_2-\bar{y}_2)^2 + (y_3-\bar{y}_3)^2\\ &=\sum_{i=1}^3(y_i-\bar{y}_i)^2\\ &=\sum_{i=1}^3(y_i-(mx_i+b))^2 \end{align}\]

We want to minimize this error to achieve the line that fits these points best. Note, we square the errors to make the differences positive and to accentuate bigger differences. Not a super important thing, but it makes the math nicer.

This new problem is more generally referred to as a regression, which is a method for understanding relationships between a dependent variable, \(y\) in this case, and one-or-more independent variables, just \(x\) here. This specific regression is a linear regression, because we want to model the relationship between \(y\) and \(x\) as a line, \(y=mx+b\).

So, how do we determine the values for \(m\) and \(b\) that minimize this error? We’re going to pass this question off to yet another expansive and crucial domain – the field of optimization. These next steps could be the subject of their own blog post, but I can assure you, there is a very straight-forward way to find that \[\begin{align} m=1.0714\\ b=0.2857 \end{align}\] fits these points best!

Why does any of this matter?

Okay let’s take a step back. First of all, it’s super rad that \(y=mx+b\) allows us to represent a whole, infinite line using only 2 parameters, and that that line presumably summarizes the relationship between the variables in our “dataset”… So elegant! So compact! But like I’ve shown above, that simple formula extends far beyond the algebraic exercise of connecting two points in space. It’s at the heart of linear regression, the simplest and most common machine learning task there is.

But why do we care to know the line of best fit from a linear regression? The answer to that is prediction. If given some new point, \(x_4\), we can predict its corresponding y-value using the fitted parameters, \(y_4 \leftarrow mx_4 + b\). More generally, the process of selecting a model [\(y=mx+b\)], “learning” model parameters [\((m,b)\)], then doing a prediction task [\(y_4 \leftarrow mx_4+b\)] underlies all of machine learning & AI, and is the force driving our interactions with most digital platforms we come into contact with.

Of course, there are far more complex models and relationships than linear behind your most beloved online platforms; in fact, the art of selecting an apt model is often the biggest challenge for domains of interest. But the lessons above remain at the root, all summed up in a simple and catchy formula: \(y=mx+b\).

I dont wanna see meme’s like the one above again. Put some respect on \(y=mx+b\)’s name!!

Signed,

An enthusiastic nerd