Welcome to Atrium University! We are devoted to developing Atrians’ skills at understanding and using machine learning tools. This Linear Regression primer is one of a series on common data science tools and their application in R. See other Atrium University tools in this series to learn about other methods.
Introduction
Far and away, the tools most used in Machine Learning are regression-based. Linear Regression and Logistic Regression are the easiest to understand and most commonly utilized regression tools; however, there are others available as well.
This document covers the basics of linear regression, including:
- What is linear regression?
- How does it differ from logistic regression?
- Common Regression Problems: Model Selection
- Common Regression Problems: Inference
- Common Regression Problems: Prediction
- How do I implement this tool in R?
- How do I use Einsteing Discovery’s regression tools?
Before we jump in, note that linear regression is a powerful statistical tool. Entire books have been written solely on this subject - this is just an overview of how you can get started. For a more technical overview of the subject matter, we suggest the following books:
Applied Predictive Modeling (Max Kuhn and Kjell Johnson) [Found at: link] Description: A technical overview of how to implement ML methods in R.
Naked Statistics (Charles Wheelan)[Found at: link] Description: A high-level overview of regression tools from a social science perspective. Good context.
The Elements of Statistical Learning (Trevor Hastie, Robert Tibshirani, and Jerome Friedman) [Found at: link] Description: A thorough and advanced treatment of many ML tools, including linear regression. Available FREE at the link.
Know any good statistics/ML textbooks? Email paul@atrium.ai and we will post them to the Wookie.
What is linear regression?
Linear regression refers to a statistical tool that relates a response variable, \(Y\), to a suite of predictor variables, called \(x_{1}...x_{p}\). The model that relates these is written as follows: \[Y_{i} = \beta_0 + \beta_1x_1 + ...\beta_px_p + \epsilon_i \] In the model, \(\beta_0\) refers to the intercept, which we interpret as the predicted \(Y\) response value if all of the predictor values are equal to 0. The \(\beta_1...\beta_p\) terms refer to the true estimated change in the response, \(Y\), associated with a one-unit change in a single input variable \(x_1...x_p\). Finally, because this is a true model, we also consider an error term \(\epsilon\) that is assumed to be normally distributed with 0 mean and constant standard deviation \(\sigma\). This means that the model is assumed to miss below the true value about as often as it misses above.
That’s a lot of math! What does this mean for predictive analytics? Here’s some Linear Regression quick hits:
Linear regression models allow us to predict a new response value \(Y\) for a given set of information (\(x_1...x_p\)).
The model will rarely predict with perfect accuracy (it’s a statistical model), but on average, it will predict values above the true mean about half the time, and values below the true mean the other half of the time.
Multiple linear regression (i.e. regression with more than one variable) allows us to isolate the effect of a single variable (say, \(x_1\)) controlling for the effect of the other variables in the model. This is a major advantage over, say, making dashboards to diagnose trends.
Goals of Linear Regression:
The goal of linear regression is to identify which variables are most strongly related (either positively or negatively) to the response of interest.
- If two variables are positively related (or “correlated”), this means that large values of one variable tend to relate to large values on the other variable (a positive trend).
- Negatively related variables mean that large values of the one are related to small values on the other (a negative trend).
- If two variables do not have much of a linear relationship, large values on one variable may be associated with any value of the other variable. Often, two unrelated variables will result in a ‘blob’ of points with no distinguishable pattern.
The goal of linear regression is to figure out which predictor variables are strongly related to the response of interest.
How do linear regression and logistic regression differ?
Without getting into too many technical details, linear regression is used when the response variable of interest takes on quantitative, continuous values. If you’re interested in predicting Sales Prices, Ages, Times (like, say, for a sales opportunity to close), Heights or Weights, etc., linear regression is a go-to tool for analysis.
Logistic regression works differently - it predicts values between 0 and 1 and should be used when the number of values that the response can take is binary (either 0 or 1). For more information about logistic regression, see the Logistic Regression primer!
Logistic Regression requires binary response values and produces predicted probabilities. Linear Regression requires continuous, numerical response values and can predict an entire spectrum of responses. Foud at: machinelearningplus.com.