In this post, I choose to talk about linear regression; what it means, what does it do and why we care about it so much. Here I go!

What is Linear Regression?

Linear regression is a basic and commonly used type of predictive analysis. The overall idea of regression is to examine two things:

The simplest form of the regression equation with one dependent and one independent variable is defined by the formula y = c + b*x, where y = estimated dependent variable score, c = constant, b = regression coefficient, and x = score on the independent variable.

Naming the Variables. There are many names for a regression’s dependent variable. It may be called an outcome variable, criterion variable, endogenous variable, or regressand. The independent variables can be called exogenous variables, predictor variables, or regressors.

Three major uses for regression analysis are:

First, the regression might be used to identify the strength of the effect that the independent variable(s) have on a dependent variable. Typical questions are what is the strength of relationship between dose and effect, sales and marketing spending, or age and income.

Second, it can be used to forecast effects or impact of changes. That is, the regression analysis helps us to understand how much the dependent variable changes with a change in one or more independent variables. A typical question is, “how much additional sales income do I get for each additional $1000 spent on marketing?”

Third, regression analysis predicts trends and future values. The regression analysis can be used to get point estimates. A typical question is, “what will the price of gold be in 6 months?”

Types of Linear Regression

Simple linear regression

Multiple linear regression

Logistic regression

Ordinal regression

Multinomial regression

Discriminant analysis

When selecting the model for the analysis, an important consideration is model fitting. Adding independent variables to a linear regression model will always increase the explained variance of the model (typically expressed as R²). However, overfitting can occur by adding too many variables to the model, which reduces model generalizability. A simple model is usually preferable to a more complex model. Statistically, if a model includes a large number of variables, some of the variables will be statistically significant due to chance alone.

Limits of Simple Linear Regression

Even the best data does not tell a complete story. Regression analysis is commonly used in research to establish that a correlation exists between variables. But correlation is not the same as causation - a relationship between two variables does not mean one causes the other to happen. Even a line in a simple linear regression that fits the data points well may not guarantee a cause-and-effect relationship.

Using a linear regression model will allow you to discover whether a relationship between variables exists at all. To understand exactly what that relationship is, and whether one variable causes another, you will need additional research and statistical analysis.