Introduction

Linear Regression is one of the most powerful algorithm to model the data. It is one of the most interpretable method compared to other algorithms. It can take different forms depending on the type of dataset.

In real-world, dataset may have multi-collinerity, long dataset with various predictors and perform shrinkage mechnaisum to simplify the dataset and model. In this article, we are going to take a sample dataset and apply L1/L2 regularization on it and understand the output of the dataset.

Dataset

Dataset which we are going to use in Big mart sales dataset[https://datahack.analyticsvidhya.com/contest/practice-problem-big-mart-sales-iii/]. It has different columns, below is the snapshot of the dataset.

df_original = read.csv('./data/Train_UWu5bXk.csv',skipNul = TRUE,na.strings = c(' ','','NA',NA))

df_original = na.omit(df_original)

df = df_original[sample(nrow(df_original),1000),]

df_test = df_original[sample(nrow(df_original),1000),]
head(df)
##      Item_Identifier Item_Weight Item_Fat_Content Item_Visibility
## 1917           FDQ44       20.50          Low Fat      0.03628752
## 4848           FDU16       19.25          Regular      0.05822661
## 6912           FDF08       14.30          Regular      0.06515329
## 6530           DRK39        7.02          Low Fat      0.04986540
## 3783           FDQ40       11.10          Regular      0.03608354
## 346            DRI37       15.85          Low Fat      0.10776516
##                  Item_Type Item_MRP Outlet_Identifier
## 1917 Fruits and Vegetables 121.2756            OUT018
## 4848          Frozen Foods  85.2908            OUT013
## 6912 Fruits and Vegetables  88.4856            OUT013
## 6530                 Dairy  82.9250            OUT046
## 3783          Frozen Foods 175.4712            OUT049
## 346            Soft Drinks  59.5904            OUT049
##      Outlet_Establishment_Year Outlet_Size Outlet_Location_Type
## 1917                      2009      Medium               Tier 3
## 4848                      1987        High               Tier 3
## 6912                      1987        High               Tier 3
## 6530                      1997       Small               Tier 1
## 3783                      1999      Medium               Tier 1
## 346                       1999      Medium               Tier 1
##            Outlet_Type Item_Outlet_Sales
## 1917 Supermarket Type2         2181.1608
## 4848 Supermarket Type1          671.1264
## 6912 Supermarket Type1         1406.1696
## 6530 Supermarket Type1         1165.1500
## 3783 Supermarket Type1         2988.1104
## 346  Supermarket Type1          703.0848

There are a total of 11 predictors and one dependent variables sales which need to predict. It also contains categorical variables. For the sake of simplicity, we will omit categorical variables and use quantative variables.

Simple models

Lets start with a simple model using few quantitative predictors(fat content, visiblity, type, Item_MRP)

Multiple Linear Regression

Lets try a simple linear regression which Item_MRP as a predictor variable of Item_Outlet_Sales

multiple_model = lm(Item_Outlet_Sales ~ Outlet_Establishment_Year +Item_MRP+Item_Weight ,df)

summary(multiple_model)
## 
## Call:
## lm(formula = Item_Outlet_Sales ~ Outlet_Establishment_Year + 
##     Item_MRP + Item_Weight, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3577.0  -639.9   -53.4   546.1  5610.7 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               9546.7766  9537.3082   1.001    0.317    
## Outlet_Establishment_Year   -4.8140     4.7686  -1.010    0.313    
## Item_MRP                    16.0690     0.5555  28.926   <2e-16 ***
## Item_Weight                  2.5517     7.3416   0.348    0.728    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1088 on 996 degrees of freedom
## Multiple R-squared:  0.4576, Adjusted R-squared:  0.456 
## F-statistic: 280.1 on 3 and 996 DF,  p-value: < 2.2e-16

Above mentioned simple model provides an adjusted R-squared of 0.48. 48% of variability in the dependent variable is explained by the model. Now we will evaluate other metrics.

evaluation <- function(df,model){
  pred = predict(model,df)
  mse = round(mean((df['Item_Outlet_Sales']-pred)^2),2)
  rmse = round(sqrt(mean((df['Item_Outlet_Sales']-pred)^2)),2)
  
  print(paste0("MSE: ",mse))
  print(paste0("RMSE: ",rmse))  
}

#Training Error
evaluation(df,multiple_model)
## [1] "MSE: 1179087.88"
## [1] "RMSE: 1085.86"
#Testing Error
evaluation(df_test,multiple_model)
## [1] "MSE: 1234270.36"
## [1] "RMSE: 1110.98"

Polynomial Linear Regression

As multiple linear regression improved the performance a bit, it did not drastically improve the performance. So we need to try polynomial regression for better results.

polynomial_model = lm(Item_Outlet_Sales ~ poly(Outlet_Establishment_Year,4) +poly(Item_MRP,6)+poly(Item_Weight,3) ,df)

summary(polynomial_model)
## 
## Call:
## lm(formula = Item_Outlet_Sales ~ poly(Outlet_Establishment_Year, 
##     4) + poly(Item_MRP, 6) + poly(Item_Weight, 3), data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3668.0  -647.0   -26.0   534.9  5412.2 
## 
## Coefficients:
##                                     Estimate Std. Error t value Pr(>|t|)
## (Intercept)                          2240.59      34.27  65.385   <2e-16
## poly(Outlet_Establishment_Year, 4)1 -1049.76    1088.85  -0.964   0.3352
## poly(Outlet_Establishment_Year, 4)2 -2212.70    1090.01  -2.030   0.0426
## poly(Outlet_Establishment_Year, 4)3 -2378.51    1086.65  -2.189   0.0288
## poly(Outlet_Establishment_Year, 4)4  1752.89    1088.42   1.610   0.1076
## poly(Item_MRP, 6)1                  31443.86    1086.71  28.935   <2e-16
## poly(Item_MRP, 6)2                    189.67    1090.87   0.174   0.8620
## poly(Item_MRP, 6)3                    892.92    1092.70   0.817   0.4140
## poly(Item_MRP, 6)4                   2036.70    1088.77   1.871   0.0617
## poly(Item_MRP, 6)5                    860.29    1087.09   0.791   0.4289
## poly(Item_MRP, 6)6                    290.25    1087.29   0.267   0.7896
## poly(Item_Weight, 3)1                 633.44    1096.86   0.578   0.5637
## poly(Item_Weight, 3)2                 814.57    1090.39   0.747   0.4552
## poly(Item_Weight, 3)3                -881.82    1089.36  -0.809   0.4184
##                                        
## (Intercept)                         ***
## poly(Outlet_Establishment_Year, 4)1    
## poly(Outlet_Establishment_Year, 4)2 *  
## poly(Outlet_Establishment_Year, 4)3 *  
## poly(Outlet_Establishment_Year, 4)4    
## poly(Item_MRP, 6)1                  ***
## poly(Item_MRP, 6)2                     
## poly(Item_MRP, 6)3                     
## poly(Item_MRP, 6)4                  .  
## poly(Item_MRP, 6)5                     
## poly(Item_MRP, 6)6                     
## poly(Item_Weight, 3)1                  
## poly(Item_Weight, 3)2                  
## poly(Item_Weight, 3)3                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1084 on 986 degrees of freedom
## Multiple R-squared:  0.4674, Adjusted R-squared:  0.4604 
## F-statistic: 66.56 on 13 and 986 DF,  p-value: < 2.2e-16

As I add more and more polynmoials to the predictors, it starts overfitting. Lets deviate a bit with different dataset to explain this and bias-variance tradeoff.

# Altering the data a bit
df_changed = df[!(df['Item_MRP'] >150 & df['Item_Outlet_Sales'] < 2000),]

fit_1 = lm(Item_Outlet_Sales ~ 1, data = df_changed)
fit_2 = lm(Item_Outlet_Sales ~ Item_MRP, data = df_changed)
fit_3 = lm(Item_Outlet_Sales ~ poly(Item_MRP, degree = 5), data = df_changed)
fit_4 = lm(Item_Outlet_Sales ~ poly(Item_MRP, degree = 10), data = df_changed)
fit_5 = lm(Item_Outlet_Sales ~ poly(Item_MRP, degree = 15), data = df_changed)
fit_6 = lm(Item_Outlet_Sales ~ poly(Item_MRP, degree = 20), data = df_changed)
fit_7 = lm(Item_Outlet_Sales ~ poly(Item_MRP, degree = 22), data = df_changed)

plot(Item_Outlet_Sales ~ Item_MRP, df_changed)
grid = seq(from = min(df_changed$Item_MRP), to = max(df_changed$Item_MRP), by = 1)
lines(grid, predict(fit_1, newdata = data.frame(Item_MRP = grid)), 
      col = "red", lwd = 2, lty = 2)

lines(grid, predict(fit_2, newdata = data.frame(Item_MRP = grid)), 
      col = "blue", lwd = 2, lty = 3)
lines(grid, predict(fit_3, newdata = data.frame(Item_MRP = grid)), 
      col = "green", lwd = 2, lty = 4)
lines(grid, predict(fit_4, newdata = data.frame(Item_MRP = grid)), 
      col = "orange", lwd = 2, lty = 5)
lines(grid, predict(fit_5, newdata = data.frame(Item_MRP = grid)), 
      col = "green", lwd = 2, lty = 4)
lines(grid, predict(fit_6, newdata = data.frame(Item_MRP = grid)), 
      col = "orange", lwd = 2, lty = 5)
lines(grid, predict(fit_7, newdata = data.frame(Item_MRP = grid)), 
      col = "red", lwd = 2, lty = 5)

#Fit 2
evaluation(df_changed,fit_2)
## [1] "MSE: 887622.87"
## [1] "RMSE: 942.14"
#fit 4
evaluation(df_changed,fit_4)
## [1] "MSE: 858701.66"
## [1] "RMSE: 926.66"
#Fit 6
evaluation(df_changed,fit_6)
## [1] "MSE: 849811.34"
## [1] "RMSE: 921.85"
#Fit7
evaluation(df_changed,fit_7)
## [1] "MSE: 842595.97"
## [1] "RMSE: 917.93"

From the above plot, when the poloynomials are increased, the model tries to overfit the data.

As we add more and more parameters to our model, its complexity increases, which results in increasing variance and decreasing bias, i.e., overfitting. So we need to find out one optimum point in our model where the decrease in bias is equal to increase in variance. In practice, there is no analytical way to find this point.

To overcome underfitting or high bias, we can basically add new parameters to our model so that the model complexity increases, and thus reducing high bias.

To overcome overfitting or high variance, we will use Regularization technique. There are two types of regularization.

Before that lets convert the categorical variables to dummy variables.

df_dummies = dummy.data.frame(df[,c("Item_Weight",   
"Item_Fat_Content","Item_Visibility", 
"Item_Type","Item_MRP",      
"Outlet_Identifier","Outlet_Establishment_Year",
"Outlet_Size",    "Outlet_Location_Type",
"Outlet_Type",    "Item_Outlet_Sales")])

#head(df_dummies)
overall_fit = lm(Item_Outlet_Sales ~ ., data = df_dummies)

summary(overall_fit)
## 
## Call:
## lm(formula = Item_Outlet_Sales ~ ., data = df_dummies)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3580.2  -617.7    -5.1   519.0  5607.4 
## 
## Coefficients: (12 not defined because of singularities)
##                                    Estimate Std. Error t value Pr(>|t|)
## (Intercept)                        290.5415   326.4653   0.890 0.373707
## Item_Weight                          3.2180     7.4200   0.434 0.664606
## Item_Fat_ContentLF                -103.4499   217.4722  -0.476 0.634400
## `Item_Fat_Contentlow fat`          753.8568   417.2739   1.807 0.071130
## `Item_Fat_ContentLow Fat`           49.4246    82.5787   0.599 0.549635
## Item_Fat_Contentreg               -114.4669   250.5972  -0.457 0.647934
## Item_Fat_ContentRegular                  NA         NA      NA       NA
## Item_Visibility                  -1179.3281   801.7284  -1.471 0.141619
## `Item_TypeBaking Goods`           -215.2064   302.9904  -0.710 0.477704
## Item_TypeBreads                      4.3815   344.6660   0.013 0.989860
## Item_TypeBreakfast                 405.7509   391.0738   1.038 0.299747
## Item_TypeCanned                   -184.3354   306.2266  -0.602 0.547343
## Item_TypeDairy                    -169.3139   304.3043  -0.556 0.578068
## `Item_TypeFrozen Foods`           -269.6639   302.0346  -0.893 0.372172
## `Item_TypeFruits and Vegetables`  -142.5656   294.3011  -0.484 0.628196
## `Item_TypeHard Drinks`             187.1223   346.4534   0.540 0.589246
## `Item_TypeHealth and Hygiene`      -78.1734   318.2942  -0.246 0.806043
## Item_TypeHousehold                -125.2685   299.6131  -0.418 0.675966
## Item_TypeMeat                     -211.0127   330.1058  -0.639 0.522825
## Item_TypeOthers                   -615.2010   391.5181  -1.571 0.116433
## Item_TypeSeafood                  1193.6593   560.7252   2.129 0.033523
## `Item_TypeSnack Foods`            -185.1607   292.9980  -0.632 0.527567
## `Item_TypeSoft Drinks`            -290.5898   322.2989  -0.902 0.367484
## `Item_TypeStarchy Foods`                 NA         NA      NA       NA
## Item_MRP                            15.9359     0.5599  28.460  < 2e-16
## Outlet_IdentifierOUT013           -180.0761   108.5173  -1.659 0.097353
## Outlet_IdentifierOUT018           -361.7711   107.6836  -3.360 0.000811
## Outlet_IdentifierOUT035            -83.4523   106.6997  -0.782 0.434333
## Outlet_IdentifierOUT046           -224.2009   106.3994  -2.107 0.035359
## Outlet_IdentifierOUT049                  NA         NA      NA       NA
## Outlet_Establishment_Year                NA         NA      NA       NA
## Outlet_SizeHigh                          NA         NA      NA       NA
## Outlet_SizeMedium                        NA         NA      NA       NA
## Outlet_SizeSmall                         NA         NA      NA       NA
## `Outlet_Location_TypeTier 1`             NA         NA      NA       NA
## `Outlet_Location_TypeTier 2`             NA         NA      NA       NA
## `Outlet_Location_TypeTier 3`             NA         NA      NA       NA
## `Outlet_TypeSupermarket Type1`           NA         NA      NA       NA
## `Outlet_TypeSupermarket Type2`           NA         NA      NA       NA
##                                     
## (Intercept)                         
## Item_Weight                         
## Item_Fat_ContentLF                  
## `Item_Fat_Contentlow fat`        .  
## `Item_Fat_ContentLow Fat`           
## Item_Fat_Contentreg                 
## Item_Fat_ContentRegular             
## Item_Visibility                     
## `Item_TypeBaking Goods`             
## Item_TypeBreads                     
## Item_TypeBreakfast                  
## Item_TypeCanned                     
## Item_TypeDairy                      
## `Item_TypeFrozen Foods`             
## `Item_TypeFruits and Vegetables`    
## `Item_TypeHard Drinks`              
## `Item_TypeHealth and Hygiene`       
## Item_TypeHousehold                  
## Item_TypeMeat                       
## Item_TypeOthers                     
## Item_TypeSeafood                 *  
## `Item_TypeSnack Foods`              
## `Item_TypeSoft Drinks`              
## `Item_TypeStarchy Foods`            
## Item_MRP                         ***
## Outlet_IdentifierOUT013          .  
## Outlet_IdentifierOUT018          ***
## Outlet_IdentifierOUT035             
## Outlet_IdentifierOUT046          *  
## Outlet_IdentifierOUT049             
## Outlet_Establishment_Year           
## Outlet_SizeHigh                     
## Outlet_SizeMedium                   
## Outlet_SizeSmall                    
## `Outlet_Location_TypeTier 1`        
## `Outlet_Location_TypeTier 2`        
## `Outlet_Location_TypeTier 3`        
## `Outlet_TypeSupermarket Type1`      
## `Outlet_TypeSupermarket Type2`      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1079 on 973 degrees of freedom
## Multiple R-squared:  0.4786, Adjusted R-squared:  0.4646 
## F-statistic: 34.35 on 26 and 973 DF,  p-value: < 2.2e-16
# Model coefficients
barplot(coefficients((overall_fit)))

L2 Regularization(Ridge Regression)

Ridge Regression will add penalty equivalent to square of its magnitude.

\(min((||Y-X(\theta))^2 + \lambda||\theta||^2)\)

From the below plot for a lambda value of 10, it clearly reduces the coefficients weights. However, it does not bring down to zero. Here the coefficients are calculated using below math notation.

\(\beta = (X^TX+\lambda I)^-1\space X^Ty\)

ridge_model = glmnet(x=as.matrix(subset(df_dummies, select =-c(Item_Outlet_Sales)))
 , y=df_dummies[,c('Item_Outlet_Sales')], alpha = 0,lambda=10)

barplot(ridge_model$beta[,1])

L1 Regularization (Lasso regression)

Lasso Regression will add penalty equivalent to absolute of its magnitude.

\(min((||Y-X(\theta))^2 + \lambda||\theta||)\)

In the below blot, it clearly brings down the coefficients to zero.

lasso_model = glmnet(x=as.matrix(subset(df_dummies, select =-c(Item_Outlet_Sales)))
 , y=df_dummies[,c('Item_Outlet_Sales')], alpha = 1,lambda=10)

barplot(lasso_model$beta[,1])

Summary

L1 Regularization:

  1. Adds the penalty to the absolute value of its magnitude
  2. In case of multicollinearity, it will bring the weights of one of the variable to zero.
  3. Can be used if there are many predictor variables to bring down the coefficients.
  4. Computationally less compared to L2.

L2 Regularization:

  1. Adds the penalty to the square of its magnitude
  2. In case of multicollinearity between variables, it will balance out the weights of predictor variables. L1 is better for multicollinearity.
  3. Can be used if there are predictor variables are less compared to L1/
  4. Computationally high compared to L2.

Both are used to avoid overfitting.