Regression Analysis: Definition

Definition

In statistical modeling, regression analysis is a set of statistical processes for estimating the relationships between a dependent variable (often called the ‘outcome variable’) and one or more independent variables (often called ‘predictors’, ‘covariates’, or ‘features’).

Regression Analysis: Types

Column

Primarily, following types of regression models are used—

Linear regression
Logistic regression
Polynomial regression
Stepwise regression
Ridge regression
Lasso regression
ElasticNet regression

Linear regression is used for predictive analysis. Linear regression is a linear approach for modeling the relationship between the criterion or the scalar response and the multiple predictors or explanatory variables. Linear regression focuses on the conditional probability distribution of the response given the values of the predictors. For linear regression, there is a danger of overfitting.
The formula for linear regression is:

\[Y’ = bX + A\]

Logistic regression is used when the dependent variable is dichotomous. Logistic regression estimates the parameters of a logistic model and is form of binomial regression. Logistic regression is used to deal with data that has two possible criterions and the relationship between the criterions and the predictors. The equation for logistic regression is:

\[l = \beta_{0}+\beta_{1}x_{1}+\beta_{2}x_{2}\]

Polynomial regression is used for curvilinear data. Polynomial regression is fit with the method of least squares. The goal of regression analysis to model the expected value of a dependent variable y in regards to the independent variable x. The equation for polynomial regression is:

\[l = \beta_{0}+\beta_{0}x_{1}+\epsilon\]

Stepwise regression is used for fitting regression models with predictive models. It is carried out automatically. With each step, the variable is added or subtracted from the set of explanatory variables. The approaches for stepwise regression are forward selection, backward elimination, and bidirectional elimination. The formula for stepwise regression is

\[b_{j.std} = b_{j}(s_{x} * s_{y}^{-1})\]

Ridge regression is a technique for analyzing multiple regression data. When multicollinearity occurs, least squares estimates are unbiased. A degree of bias is added to the regression estimates, and a result, ridge regression reduces the standard errors. The formula for ridge regression is

\[\beta = (X^{T}X + \lambda * I)^{-1}X^{T}y\]

Lasso regression is a regression analysis method that performs both variable selection and regularization. Lasso regression uses soft thresholding. Lasso regression selects only a subset of the provided covariates for use in the final model. Lasso regression is

\[N^{-1}\sum^{N}_{i=1}f(x_{i}, y_{I}, \alpha, \beta)\]

ElasticNet regression is a regularized regression method that linearly combines the penalties of the lasso and ridge methods. ElasticNet regression is used for support vector machines, metric learning, and portfolio optimization. The penalty function is given by:

\[||\beta||_{1} = \sum^{p}_{j=1}|\beta_{j}|\]

With little bit of tweaks, there are other forms of regression models are formulated for specific applications—

Quantile Regression
Prinicipal Components Regression
Partial Least Square (PLS) Regression
Support Vector Regression
Ordinal Regression
Poisson Regression
Negative Binomial Regression
Quasi Poisson Regression
Cox Regression
Tobin Regression

How to Choose Right Regression Model

How to choose the correct regression model?

If dependent variable is continuous and model is suffering from collinearity or there are a lot of independent variables, you can try PCR, PLS, ridge, lasso and elastic net regressions. You can select the final model based on Adjusted r-square, RMSE, AIC and BIC.

If you are working on count data, you should try poisson, quasi-poisson and negative binomial regression. To avoid overfitting, we can use cross-validation method to evaluate models used for prediction. We can also use ridge, lasso and elastic net regressions techniques to correct overfitting issue.

Try support vector regression when you have non-linear model.

Univariate Linear Regression Example

Column

Top 10 Observations of the Dataset

   speed dist
1      4    2
2      4   10
3      7    4
4      7   22
5      8   16
6      9   10
7     10   18
8     10   26
9     10   34
10    11   17

Dimensions of the Dataset

[1] 50  2

Summary of the Dataset

     speed           dist       
 Min.   : 4.0   Min.   :  2.00  
 1st Qu.:12.0   1st Qu.: 26.00  
 Median :15.0   Median : 36.00  
 Mean   :15.4   Mean   : 42.98  
 3rd Qu.:19.0   3rd Qu.: 56.00  
 Max.   :25.0   Max.   :120.00

Visualization

Assumptions of Linear Regression Model

Assumption 1
The regression model is linear in parameters An example of model equation that is linear in parameters
\[Y = a + (β1*X1) + (β2*X2)\]
Assumption 2
The mean of residuals is zero
How to check?
Check the mean of the residuals. If it zero (or very close), then this assumption is held true for that model.
Assumption 3
Homoscedasticity of residuals or equal variance

Let us check the assumptions for the dataset :cars.

Linear Regression Model

model=lm(dist~speed, data=cars)

summary(model)


Call:
lm(formula = dist ~ speed, data = cars)

Residuals:
    Min      1Q  Median      3Q     Max 
-29.069  -9.525  -2.272   9.215  43.201 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -17.5791     6.7584  -2.601   0.0123 *  
speed         3.9324     0.4155   9.464 1.49e-12 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 15.38 on 48 degrees of freedom
Multiple R-squared:  0.6511,    Adjusted R-squared:  0.6438 
F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12

Checking the LM Assumptions:Statistical Methods

# Calculating mean of model residuals

mean(model$residuals)

[1] 8.65974e-17

Mean of the model residuals is very very close to zero. Hence Assumption 1 is met.

# Checking the Heteroscadicity 
# Ho: Variance of the model residuals is constant 
# Ha: Variance of the model residuals is not constant

lmtest::bptest(model)


    studentized Breusch-Pagan test

data:  model
BP = 3.2149, df = 1, p-value = 0.07297

car::ncvTest(model)

Non-constant Variance Score Test 
Variance formula: ~ fitted.values 
Chisquare = 4.650233, Df = 1, p = 0.031049

The Assumption of Homoscadicity is not met as the tests fail to accept the null hypthesis.

Checking the LM Assumptions:Graphical Methods

par(mfrow=c(2,2)) # init 4 charts in 1 panel
plot(model)

Multivariate Linear Regression Example

Column

What is Multivariate Linear Regression?

Definition
Multivariate Regression is a method used to measure the degree at which more than one independent variable (predictors) and more than one dependent variable (responses), are linearly related.

\[Y' = \beta_{0}+\beta_{1}x_{1}+\beta_{2}x_{2}+\beta_{3}x_{3}+......\beta_{n}x_{n}+\epsilon\]

In above equation, \[\beta_{0}\] is the intercept and \[\beta_{1},beta_{2},beta_{3},.....,beta_{n}\] are the regression coefficients corresponding to each predictor \[\x_{1},x_{2},x_{3},......,x_{n}\]. \[Y'\] is the dependent variable.

Top 10 Observations of the Dataset

Dimensions of the Dataset

[1] 210   8

The data contains following parameters for seeds of three varieties of wheat (Kama=1, Rose=2 and Canadian=3)—

1. area A,
2. perimeter P,
3. compactness C = 4piA/P^2,
4. length of kernel,
5. width of kernel,
6. asymmetry coefficient
7. length of kernel groove
8. Wheat Variety

Source:M. Charytanowicz, J. Niewczas, P. Kulczycki, P.A. Kowalski, S. Lukasik, S. Zak, ‘A Complete Gradient Clustering Algorithm for Features Analysis of X-ray Images’, in: Information Technologies in Biomedicine, Ewa Pietka, Jacek Kawa (eds.), Springer-Verlag, Berlin-Heidelberg, 2010, pp. 15-24.
Contributors gratefully acknowledge support of their work by the Institute of Agrophysics of the Polish Academy of Sciences in Lublin.

Summary of the Dataset

      Area         Perimeter      Compactness     length_of_kernel
 Min.   :10.59   Min.   :12.41   Min.   :0.8081   Min.   :4.899   
 1st Qu.:12.27   1st Qu.:13.45   1st Qu.:0.8569   1st Qu.:5.262   
 Median :14.36   Median :14.32   Median :0.8734   Median :5.524   
 Mean   :14.85   Mean   :14.56   Mean   :0.8710   Mean   :5.629   
 3rd Qu.:17.30   3rd Qu.:15.71   3rd Qu.:0.8878   3rd Qu.:5.980   
 Max.   :21.18   Max.   :17.25   Max.   :0.9183   Max.   :6.675   
 width_of_kernel asymmetry_coefficient length_of_kernel_groove
 Min.   :2.630   Min.   :0.7651        Min.   :4.519          
 1st Qu.:2.944   1st Qu.:2.5615        1st Qu.:5.045          
 Median :3.237   Median :3.5990        Median :5.223          
 Mean   :3.259   Mean   :3.7002        Mean   :5.408          
 3rd Qu.:3.562   3rd Qu.:4.7687        3rd Qu.:5.877          
 Max.   :4.033   Max.   :8.4560        Max.   :6.550          
 Wheat_Variety
 1:70         
 2:70         
 3:70

2D Visualization

3D Visualization

Assumptions of Linear Regression Model

Linear Multivariate Regression Model

model=lm(length_of_kernel_groove~., data=seeds[,-8])

summary(model)


Call:
lm(formula = length_of_kernel_groove ~ ., data = seeds[, -8])

Residuals:
     Min       1Q   Median       3Q      Max 
-0.43176 -0.08756  0.01050  0.09855  0.36292 

Coefficients:
                       Estimate Std. Error t value Pr(>|t|)    
(Intercept)           12.191174   2.419963   5.038 1.04e-06 ***
Area                   0.495831   0.083106   5.966 1.07e-08 ***
Perimeter             -0.690882   0.179084  -3.858 0.000154 ***
Compactness           -6.052916   1.756266  -3.446 0.000690 ***
length_of_kernel       0.663013   0.149602   4.432 1.53e-05 ***
width_of_kernel       -0.821043   0.264683  -3.102 0.002196 ** 
asymmetry_coefficient  0.035005   0.007384   4.740 4.01e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1431 on 203 degrees of freedom
Multiple R-squared:  0.9177,    Adjusted R-squared:  0.9152 
F-statistic: 377.1 on 6 and 203 DF,  p-value: < 2.2e-16

Checking the LM Assumptions:Statistical Methods

# Calculating mean of model residuals

mean(model$residuals)

[1] 6.988691e-18

Mean of the model residuals is very very close to zero. Hence Assumption 1 is met.

# Checking the Heteroscadicity 
# Ho: Variance of the model residuals is constant 
# Ha: Variance of the model residuals is not constant

lmtest::bptest(model)


    studentized Breusch-Pagan test

data:  model
BP = 7.5698, df = 6, p-value = 0.2713

We can see that the p-value is 0.2713 much larger than significance level 0.05.

car::ncvTest(model)

Non-constant Variance Score Test 
Variance formula: ~ fitted.values 
Chisquare = 0.3126986, Df = 1, p = 0.57603

The Assumption of Homoscadicity ismet as the tests fail to reject the null hypthesis.

So, model is robust because all three assumptions are met and model has Rsquared Value more than 0.90.

Checking the LM Assumptions:Graphical Methods

Performance Metrics for Linear Regression

The various metrics used to evaluate the results of the prediction are :

Mean Squared Error(MSE)
Root-Mean-Squared-Error(RMSE)
Mean-Absolute-Error(MAE)
R² or Coefficient of Determination
Adjusted R²

Mean Squared Error: MSE or Mean Squared Error is one of the most preferred metrics for regression tasks. It is simply the average of the squared difference between the target value and the value predicted by the regression model. As it squares the differences, it penalizes even a small error which leads to over-estimation of how bad the model is. It is preferred more than other metrics because it is differentiable and hence can be optimized better.

Figure :Mean Squared Error Formula

Root Mean Squared Error: RMSE is the most widely used metric for regression tasks and is the square root of the averaged squared difference between the target value and the value predicted by the model. It is preferred more in some cases because the errors are first squared before averaging which poses a high penalty on large errors. This implies that RMSE is useful when large errors are undesired.

Figure :The formula of Root Mean Squared Error

Mean Absolute Error: MAE is the absolute difference between the target value and the value predicted by the model. The MAE is more robust to outliers and does not penalize the errors as extremely as mse. MAE is a linear score which means all the individual differences are weighted equally. It is not suitable for applications where you want to pay more attention to the outliers.

Figure :The Formula of Mean Absolute Error

R² Error: Coefficient of Determination or R² is another metric used for evaluating the performance of a regression model. The metric helps us to compare our current model with a constant baseline and tells us how much our model is better. The constant baseline is chosen by taking the mean of the data and drawing a line at the mean. R² is a scale-free score that implies it doesn’t matter whether the values are too large or too small, the R² will always be less than or equal to 1.

Figure :The Formula for R²

Adjusted R²: Adjusted R² depicts the same meaning as R² but is an improvement of it. R² suffers from the problem that the scores improve on increasing terms even though the model is not improving which may misguide the researcher. Adjusted R² is always lower than R² as it adjusts for the increasing predictors and only shows improvement if there is a real improvement.t

Figure :The Formula of Adjusted R²

Why is R² Negative?

There is a misconception among people that R² score ranges from 0 to 1 but actually it ranges from -∞ to 1. Due to this misconception, they are sometimes scared why the R² is negative which is not a possibility according to them.

Following are the reasons for negative R² —

1. When model does not follow the trend

2. Due to large amount of outliers, the mse of model is more than mse of baseline.

3. Sometimes by mistake, you foreget to add the intercept in the model building.

About Us

Dr. Amita Sharma

Post Doc from Erasmus University, the Netherlands, PhD, MBA
Assistant Professor
Institute of Agri-Business Management
Swami Keshwanand Rajasthan Agricultural University
Bikaner (Raj) India
Visit the blog : www.thinkingai.in

Arun Kumar Sharma

Machine Learning Enthusiast, Hobbyist, writer, blogger and S&M Training Professional
Certified Business Analytics Professional
Certified in Predictive Analytics from IIMx Bangalore
Certified in Macroeconomic Forecasting from IMFx
Certified in Text Analytics from openSAP

Contact for How Machine Learning can Transform Your Business: 9468567418/aks10000@gmail.com

---
title: "REGRESSION ANALYSIS"
output: 
  flexdashboard::flex_dashboard:
    orientation: columns
    vertical_layout: fill
    social : ["facebook","twitter", "menu"]
    source_code : embed
---

```{r}
library(flexdashboard)
options(rgl.printRglwidget = TRUE)
```


Regression Analysis: Definition  {data-navmenu="MENU"}
=============================================





Definition




In statistical modeling, regression analysis is a set of statistical processes for estimating the relationships between a dependent variable (often called the 'outcome variable') and one or more independent variables (often called 'predictors', 'covariates', or 'features').



Regression Analysis: Types {data-navmenu="MENU"}
=======================================

Column
---------------------------------------




Primarily, following types of regression models are used---





Linear regression


Logistic regression


Polynomial regression


Stepwise regression


Ridge regression


Lasso regression


ElasticNet regression





Linear regression is used for predictive analysis. Linear regression is a linear approach for modeling the relationship between the criterion or the scalar response and the multiple predictors or explanatory variables. Linear regression focuses on the conditional probability distribution of the response given the values of the predictors. For linear regression, there is a danger of overfitting. 


The formula for linear regression is: 




$$Y’ = bX + A$$




Logistic regression is used when the dependent variable is dichotomous. Logistic regression estimates the parameters of a logistic model and is form of binomial regression. Logistic regression is used to deal with data that has two possible criterions and the relationship between the criterions and the predictors. The equation for logistic regression is: 




$$l = \beta_{0}+\beta_{1}x_{1}+\beta_{2}x_{2}$$




Polynomial regression is used for curvilinear data. Polynomial regression is fit with the method of least squares. The goal of regression analysis to model the expected value of a dependent variable y in regards to the independent variable x. The equation for polynomial regression is: 




$$l = \beta_{0}+\beta_{0}x_{1}+\epsilon$$




Stepwise regression is used for fitting regression models with predictive models. It is carried out automatically. With each step, the variable is added or subtracted from the set of explanatory variables. The approaches for stepwise regression are forward selection, backward elimination, and bidirectional elimination. The formula for stepwise regression is




$$b_{j.std} = b_{j}(s_{x} * s_{y}^{-1})$$





Ridge regression is a technique for analyzing multiple regression data. When multicollinearity occurs, least squares estimates are unbiased. A degree of bias is added to the regression estimates, and a result, ridge regression reduces the standard errors. The formula for ridge regression is 




$$\beta = (X^{T}X + \lambda * I)^{-1}X^{T}y$$




Lasso regression is a regression analysis method that performs both variable selection and regularization. Lasso regression uses soft thresholding. Lasso regression selects only a subset of the provided covariates for use in the final model. Lasso regression is 




$$N^{-1}\sum^{N}_{i=1}f(x_{i}, y_{I}, \alpha, \beta)$$




ElasticNet regression is a regularized regression method that linearly combines the penalties of the lasso and ridge methods. ElasticNet regression is used for support vector machines, metric learning, and portfolio optimization. The penalty function is given by:




$$||\beta||_{1} = \sum^{p}_{j=1}|\beta_{j}|$$




With little bit of tweaks, there are other forms of regression models are formulated for specific applications---





Quantile Regression


Prinicipal Components Regression


Partial Least Square (PLS) Regression


Support Vector Regression


Ordinal Regression


Poisson Regression


Negative Binomial Regression


Quasi Poisson Regression


Cox Regression


Tobin Regression


How to Choose Right Regression Model {data-navmenu="MENU"}
=====================================





How to choose the correct regression model?




If dependent variable is continuous and model is suffering from collinearity or there are a lot of independent variables, you can try PCR, PLS, ridge, lasso and elastic net regressions. You can select the final model based on Adjusted r-square, RMSE, AIC and BIC.




If you are working on count data, you should try poisson, quasi-poisson and negative binomial regression.
To avoid overfitting, we can use cross-validation method to evaluate models used for prediction. We can also use ridge, lasso and elastic net regressions techniques to correct overfitting issue.




Try support vector regression when you have non-linear model.


Univariate Linear Regression Example {data-navmenu="MENU"}
===================================

Column {.tabset}
-----------------------------------------

### Top 10 Observations of the Dataset

```{r}

head(cars,10)

```

### Dimensions of the Dataset

```{r}

dim(cars)

```

### Summary of the Dataset

```{r}

summary(cars)

```

### Visualization

```{r}

scatter.smooth(x=cars$speed, 
               y=cars$dist, 
               main="Speed Vs Distance", 
               xlab = "Speed of the Car", 
               ylab="Distance", col="blue", lwd=2)

```

### Assumptions of Linear Regression Model


Assumption 1


The regression model is linear in parameters
An example of model equation that is linear in parameters


$$Y = a + (β1*X1) + (β2*X2)$$


Assumption 2


The mean of residuals is zero


How to check?


Check the mean of the residuals. If it zero (or very close), then this assumption is held true for that model. 


Assumption 3


Homoscedasticity of residuals or equal variance




Let us check the assumptions for the dataset :cars.





### Linear Regression Model

```{r echo=TRUE}

model=lm(dist~speed, data=cars)

summary(model)

```

### Checking the LM Assumptions:Statistical Methods

```{r echo=TRUE}
# Calculating mean of model residuals

mean(model$residuals)
```


Mean of the model residuals is very very close to zero. Hence Assumption 1 is met.



```{r echo=TRUE}
# Checking the Heteroscadicity 
# Ho: Variance of the model residuals is constant 
# Ha: Variance of the model residuals is not constant

lmtest::bptest(model)


```

```{r echo=TRUE}

car::ncvTest(model)

```

The Assumption of Homoscadicity is not met as the tests fail to accept the null hypthesis.

### Checking the LM Assumptions:Graphical Methods

```{r echo=TRUE}

par(mfrow=c(2,2)) # init 4 charts in 1 panel
plot(model)


```



Multivariate Linear Regression Example {data-navmenu="MENU"}
===================================

Column {.tabset}
-----------------------------------------

### What is Multivariate Linear Regression?




Definition


Multivariate Regression is a method used to measure the degree at which more than one independent variable (predictors) and more than one dependent variable (responses), are linearly related.


$$Y' = \beta_{0}+\beta_{1}x_{1}+\beta_{2}x_{2}+\beta_{3}x_{3}+......\beta_{n}x_{n}+\epsilon$$

In above equation, $$\beta_{0}$$ is the intercept and $$\beta_{1},beta_{2},beta_{3},.....,beta_{n}$$ are the regression coefficients corresponding to each predictor $$\x_{1},x_{2},x_{3},......,x_{n}$$. $$Y'$$ is the dependent variable. 

### Top 10 Observations of the Dataset

```{r}

seeds=read.delim("C:\\Users\\ARUN SHARMA\\Desktop\\regression analysis\\seeds_dataset.txt",header=FALSE)

names(seeds)=c("Area","Perimeter","Compactness","length_of_kernel","width_of_kernel","asymmetry_coefficient","length_of_kernel_groove", "Wheat_Variety")

seeds$Wheat_Variety=as.factor(seeds$Wheat_Variety)

DT::datatable(seeds, filter="top")

```

### Dimensions of the Dataset

```{r}

dim(seeds)

```




The data contains following parameters for seeds of three varieties of wheat (Kama=1, Rose=2 and Canadian=3)---




1. area A,


2. perimeter P,


3. compactness C = 4*pi*A/P^2,


4. length of kernel,


5. width of kernel,


6. asymmetry coefficient


7. length of kernel groove


8. Wheat Variety 




Source:M. Charytanowicz, J. Niewczas, P. Kulczycki, P.A. Kowalski, S. Lukasik, S. Zak, 'A Complete Gradient Clustering Algorithm for Features Analysis of X-ray Images', in: Information Technologies in Biomedicine, Ewa Pietka, Jacek Kawa (eds.), Springer-Verlag, Berlin-Heidelberg, 2010, pp. 15-24.


Contributors gratefully acknowledge support of their work by the Institute of Agrophysics of the Polish Academy of Sciences in Lublin.


### Summary of the Dataset

```{r}

summary(seeds)

```

### 2D Visualization

```{r}

library(ggplot2)

ggplot(seeds,aes(x=Area, 
               y=length_of_kernel_groove, 
               color=Wheat_Variety))+geom_point()+
               ggtitle("Length of Kernel Groove Vs Perimeter of Seed")+                xlab("Perimeter of Seed")+ 
               ylab("Length of Kernel Groove")

```

### 3D Visualization

```{r}

library(plot3Drgl)
library(plot3D)

scatter3D(x=seeds$length_of_kernel, 
          y=seeds$length_of_kernel_groove,
          z=seeds$width_of_kernel, bty = "u",
          col.panel ="lightyellow", expand =1, 
          col.grid = "black",
          col.var = as.integer(seeds$Wheat_Variety), 
          col = c("#1B9E77", "#D95F02", "#7570B3"),
          pch = 18, ticktype = "detailed",
          colkey = list(at = c(2,3, 4), side = 4, 
          addlines = TRUE, length = 0.5, width = 0.5,
          labels = c("Kami", "Rosa", "Canadian")), main ="Length Kernel Groove Vs Width and Length of Kernel")

```


### Assumptions of Linear Regression Model


Assumption 1


The regression model is linear in parameters
An example of model equation that is linear in parameters


$$Y = a + (β1*X1) + (β2*X2)$$


Assumption 2


The mean of residuals is zero


How to check?


Check the mean of the residuals. If it zero (or very close), then this assumption is held true for that model. 


Assumption 3


Homoscedasticity of residuals or equal variance




Let us check the assumptions for the dataset :cars.

### Linear Multivariate Regression Model

```{r echo=TRUE}

model=lm(length_of_kernel_groove~., data=seeds[,-8])

summary(model)

```

### Checking the LM Assumptions:Statistical Methods

```{r echo=TRUE}
# Calculating mean of model residuals

mean(model$residuals)
```


Mean of the model residuals is very very close to zero. Hence Assumption 1 is met.



```{r echo=TRUE}
# Checking the Heteroscadicity 
# Ho: Variance of the model residuals is constant 
# Ha: Variance of the model residuals is not constant

lmtest::bptest(model)


```

We can see that the p-value is 0.2713 much larger than significance level 0.05.

```{r echo=TRUE}

car::ncvTest(model)

```

The Assumption of Homoscadicity ismet as the tests fail to reject the null hypthesis.




So, model is robust because all three assumptions are met and model has Rsquared Value more than 0.90.

### Checking the LM Assumptions:Graphical Methods

```{r echo=FALSE}

par(mfrow=c(2,2)) # init 4 charts in 1 panel
plot(model)


```


Performance Metrics for Linear Regression {data-navmenu="MENU"}
============================================

### Performance Metrics for Linear Regression




The various metrics used to evaluate the results of the prediction are :




Mean Squared Error(MSE)


Root-Mean-Squared-Error(RMSE)


Mean-Absolute-Error(MAE)


R² or Coefficient of Determination


Adjusted R²




Mean Squared Error: MSE or Mean Squared Error is one of the most preferred metrics for regression tasks. It is simply the average of the squared difference between the target value and the value predicted by the regression model. As it squares the differences, it penalizes even a small error which leads to over-estimation of how bad the model is. It is preferred more than other metrics because it is differentiable and hence can be optimized better.




Figure :Mean Squared Error Formula







Root Mean Squared Error: RMSE is the most widely used metric for regression tasks and is the square root of the averaged squared difference between the target value and the value predicted by the model. It is preferred more in some cases because the errors are first squared before averaging which poses a high penalty on large errors. This implies that RMSE is useful when large errors are undesired.




Figure :The formula of Root Mean Squared Error







Mean Absolute Error: MAE is the absolute difference between the target value and the value predicted by the model. The MAE is more robust to outliers and does not penalize the errors as extremely as mse. MAE is a linear score which means all the individual differences are weighted equally. It is not suitable for applications where you want to pay more attention to the outliers.




Figure :The Formula of Mean Absolute Error







R² Error: Coefficient of Determination or R² is another metric used for evaluating the performance of a regression model. The metric helps us to compare our current model with a constant baseline and tells us how much our model is better. The constant baseline is chosen by taking the mean of the data and drawing a line at the mean. R² is a scale-free score that implies it doesn't matter whether the values are too large or too small, the R² will always be less than or equal to 1.




Figure :The Formula for R²







Adjusted R²: Adjusted R² depicts the same meaning as R² but is an improvement of it. R² suffers from the problem that the scores improve on increasing terms even though the model is not improving which may misguide the researcher. Adjusted R² is always lower than R² as it adjusts for the increasing predictors and only shows improvement if there is a real improvement.t




Figure :The Formula of Adjusted R²







Why is R² Negative?




There is a misconception among people that R² score ranges from 0 to 1 but actually it ranges from -∞ to 1. Due to this misconception, they are sometimes scared why the R² is negative which is not a possibility according to them.




Following are the reasons for negative R² ---




1. When model does not follow the trend









2. Due to large amount of outliers, the mse of model is more than mse of baseline.




3. Sometimes by mistake, you foreget to add the intercept in the model building.


About Us {data-navmenu="MENU"}
============================================


Dr. Amita Sharma




Post Doc from Erasmus University, the Netherlands, PhD, MBA


Assistant Professor


Institute of Agri-Business Management


Swami Keshwanand Rajasthan Agricultural University 


Bikaner (Raj) India


Visit the blog : www.thinkingai.in






Arun Kumar Sharma




Machine Learning Enthusiast, Hobbyist, writer, blogger and S&M Training Professional


Certified Business Analytics Professional


Certified in Predictive Analytics from IIMx Bangalore


Certified in Macroeconomic Forecasting from IMFx


Certified in Text Analytics from openSAP




Contact for How Machine Learning can Transform Your Business: 9468567418/aks10000@gmail.com