logologo

Welcome to Atrium University! We are devoted to developing Atrians’ skills at understanding and using machine learning tools. This Linear Regression primer is one of a series on common data science tools and their application in R. See other Atrium University tools in this series to learn about other methods.

Introduction

Far and away, the tools most used in Machine Learning are regression-based. Linear Regression and Logistic Regression are the easiest to understand and most commonly utilized regression tools; however, there are others available as well.

This document covers the basics of linear regression, including:

Before we jump in, note that linear regression is a powerful statistical tool. Entire books have been written solely on this subject - this is just an overview of how you can get started. For a more technical overview of the subject matter, we suggest the following books:

Know any good statistics/ML textbooks? Email paul@atrium.ai and we will post them to the Wookie.

What is linear regression?

Linear regression refers to a statistical tool that relates a response variable, \(Y\), to a suite of predictor variables, called \(x_{1}...x_{p}\). The model that relates these is written as follows: \[Y_{i} = \beta_0 + \beta_1x_1 + ...\beta_px_p + \epsilon_i \] In the model, \(\beta_0\) refers to the intercept, which we interpret as the predicted \(Y\) response value if all of the predictor values are equal to 0. The \(\beta_1...\beta_p\) terms refer to the true estimated change in the response, \(Y\), associated with a one-unit change in a single input variable \(x_1...x_p\). Finally, because this is a true model, we also consider an error term \(\epsilon\) that is assumed to be normally distributed with 0 mean and constant standard deviation \(\sigma\). This means that the model is assumed to miss below the true value about as often as it misses above.

That’s a lot of math! What does this mean for predictive analytics? Here’s some Linear Regression quick hits:

Goals of Linear Regression:

The goal of linear regression is to identify which variables are most strongly related (either positively or negatively) to the response of interest.

  • If two variables are positively related (or “correlated”), this means that large values of one variable tend to relate to large values on the other variable (a positive trend).
  • Negatively related variables mean that large values of the one are related to small values on the other (a negative trend).
  • If two variables do not have much of a linear relationship, large values on one variable may be associated with any value of the other variable. Often, two unrelated variables will result in a ‘blob’ of points with no distinguishable pattern.

The goal of linear regression is to figure out which predictor variables are strongly related to the response of interest.

How do linear regression and logistic regression differ?

Without getting into too many technical details, linear regression is used when the response variable of interest takes on quantitative, continuous values. If you’re interested in predicting Sales Prices, Ages, Times (like, say, for a sales opportunity to close), Heights or Weights, etc., linear regression is a go-to tool for analysis.

Logistic regression works differently - it predicts values between 0 and 1 and should be used when the number of values that the response can take is binary (either 0 or 1). For more information about logistic regression, see the Logistic Regression primer!

Logistic Regression requires binary response values and produces predicted probabilities. Linear Regression requires continuous, numerical response values and can predict an entire spectrum of responses. Foud at: machinelearningplus.com.

Logistic Regression requires binary response values and produces predicted probabilities. Linear Regression requires continuous, numerical response values and can predict an entire spectrum of responses. Foud at: machinelearningplus.com.