Introduction
linear regression is a linear approach for modelling the relationship
between a scalar response variable \(y\) and one or more explanatory variables
(\(x\)). We can also say that linear
regression can predict the value of a variable \(y\) by using another variable(\(x\)) or variables. The variable(s) being
used to predict \(y\) are called
independent variable(s), predictor variables, or features while the
variable \(y\) is often referred to as
the dependent variable, target variable, or the response variable.
Linear Regression is applicable in problems where the response variable
is continous in nature. The predictor variables can be quantitative or
qualitative.
Simple Linear Regression: This is a type of linear
regression problem in which the only one independent variable (only one
predictor variable) is used to predict the response variable. The
relationship can be given mathematically as: \(y = \beta _{0} + \beta _{1}x_{1} +
\epsilon\)
Multivariate Linear Regression: This is an approach
to linear regression that uses more than one predictor variables to
predict the response variable. This can be written mathematically as:
\(y = \beta _{0} + \beta _{1}x_{1} + \beta
_{2}x_{2} + \beta _{3}x_{3} + ...+ \beta _{n}x_{n}
+\epsilon\)
The goal of the linear regression model is to determine the best
parameters \(\beta_{i}\) that best fits
the data points. In other words, Linear Regression seeks to obtain the
line of best fit for the data points.
Assumptions for Linear Regression
Linearity: There must be a linear relationship between
the predictor variable(s) and the response variable. If a linear
relationship does not exists, there is no point using a linear
regression. Non-Linear relationships can be transformed to a linear
relationship and then linear regression can be applied.
Independence of Observations: This means that each
observation in the dataset should be independent of one another. This is
often difficult to obtain by looking at the dataset itself. The process
of data collection is best suited to help us determine whether the data
contains independent observations.
Normality of Residuals: The residuals should follow a
normal distribution. This can be tested using a histogram or a QQ Plot.
Homoscedacity: This means that there should be
constant variance of the residuals. A plot of Fitted Values
vs. Residuals can be used to check if the residuals have a constant
variance. In the case where there is no constant variance of residuals,
the dataset would be said to be heteroscedastic. There are different
ways to deal with heteroscedascity such as log transformation or Box-Cox
transformation.
Evaluating the Linear Regression Model
After obtaining the best line of fit using the linear regression
model, we often use certain metrics to evaluate how well the model
performs. Below are some metrics that can be used to evaluate the
performance of a regression model:
R-Squared and Adjusted R-Squared: This tells us the
percentage of variations in the response variable that is explained by
the predictor variables. This is a relative measure of fit.
Mean Absolute Error (MAE): This is the mean of the
absolute value of the difference between the predicted values and the
actual values. This tells us the amount of error that we can expect from
the predicted values. The MAE is generally less sensitive to outliers.
Also, lower values of MAE indicate a better performing model. This has
the advantage of being in the same unit as the response variable that
makes it easy to interpret.
Mean Square Error (MSE): This is the mean of the
square of errors. It is the mean of the square of the difference between
the predicted values and actual values. It has the disadvantage of not
being in the same unit as the response variable and thus making it hard
to interpret.
Root Mean Square Error (RMSE): This is the square root
of the mean square error. It is the square root of the mean of squared
difference between the predicted value and the actual values. This is an
absolute measure of fit and has the advantage of being in the same unit
as the response variable thereby making it easy to interpret. Just like
MAE, lower values of RMSE indicates a better fit, but it is more
sensitive to outliers.
References
Diez, D., Barr, C. D., & Cetinkaya-Rundel, M. (2019). OpenIntro
statistics 4th Edition.