Linear Regression is one of the most powerful algorithm to model the data. It is one of the most interpretable method compared to other algorithms. It can take different forms depending on the type of dataset.
In real-world, dataset may have multi-collinerity, long dataset with various predictors and perform shrinkage mechnaisum to simplify the dataset and model. In this article, we are going to take a sample dataset and apply L1/L2 regularization on it and understand the output of the dataset.
Dataset which we are going to use in Big mart sales dataset[https://datahack.analyticsvidhya.com/contest/practice-problem-big-mart-sales-iii/]. It has different columns, below is the snapshot of the dataset.
df_original = read.csv('./data/Train_UWu5bXk.csv',skipNul = TRUE,na.strings = c(' ','','NA',NA))
df_original = na.omit(df_original)
df = df_original[sample(nrow(df_original),1000),]
df_test = df_original[sample(nrow(df_original),1000),]
head(df)
## Item_Identifier Item_Weight Item_Fat_Content Item_Visibility
## 1917 FDQ44 20.50 Low Fat 0.03628752
## 4848 FDU16 19.25 Regular 0.05822661
## 6912 FDF08 14.30 Regular 0.06515329
## 6530 DRK39 7.02 Low Fat 0.04986540
## 3783 FDQ40 11.10 Regular 0.03608354
## 346 DRI37 15.85 Low Fat 0.10776516
## Item_Type Item_MRP Outlet_Identifier
## 1917 Fruits and Vegetables 121.2756 OUT018
## 4848 Frozen Foods 85.2908 OUT013
## 6912 Fruits and Vegetables 88.4856 OUT013
## 6530 Dairy 82.9250 OUT046
## 3783 Frozen Foods 175.4712 OUT049
## 346 Soft Drinks 59.5904 OUT049
## Outlet_Establishment_Year Outlet_Size Outlet_Location_Type
## 1917 2009 Medium Tier 3
## 4848 1987 High Tier 3
## 6912 1987 High Tier 3
## 6530 1997 Small Tier 1
## 3783 1999 Medium Tier 1
## 346 1999 Medium Tier 1
## Outlet_Type Item_Outlet_Sales
## 1917 Supermarket Type2 2181.1608
## 4848 Supermarket Type1 671.1264
## 6912 Supermarket Type1 1406.1696
## 6530 Supermarket Type1 1165.1500
## 3783 Supermarket Type1 2988.1104
## 346 Supermarket Type1 703.0848
There are a total of 11 predictors and one dependent variables sales which need to predict. It also contains categorical variables. For the sake of simplicity, we will omit categorical variables and use quantative variables.
Lets start with a simple model using few quantitative predictors(fat content, visiblity, type, Item_MRP)
Lets try a simple linear regression which Item_MRP as a predictor variable of Item_Outlet_Sales
multiple_model = lm(Item_Outlet_Sales ~ Outlet_Establishment_Year +Item_MRP+Item_Weight ,df)
summary(multiple_model)
##
## Call:
## lm(formula = Item_Outlet_Sales ~ Outlet_Establishment_Year +
## Item_MRP + Item_Weight, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3577.0 -639.9 -53.4 546.1 5610.7
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9546.7766 9537.3082 1.001 0.317
## Outlet_Establishment_Year -4.8140 4.7686 -1.010 0.313
## Item_MRP 16.0690 0.5555 28.926 <2e-16 ***
## Item_Weight 2.5517 7.3416 0.348 0.728
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1088 on 996 degrees of freedom
## Multiple R-squared: 0.4576, Adjusted R-squared: 0.456
## F-statistic: 280.1 on 3 and 996 DF, p-value: < 2.2e-16
Above mentioned simple model provides an adjusted R-squared of 0.48. 48% of variability in the dependent variable is explained by the model. Now we will evaluate other metrics.
evaluation <- function(df,model){
pred = predict(model,df)
mse = round(mean((df['Item_Outlet_Sales']-pred)^2),2)
rmse = round(sqrt(mean((df['Item_Outlet_Sales']-pred)^2)),2)
print(paste0("MSE: ",mse))
print(paste0("RMSE: ",rmse))
}
#Training Error
evaluation(df,multiple_model)
## [1] "MSE: 1179087.88"
## [1] "RMSE: 1085.86"
#Testing Error
evaluation(df_test,multiple_model)
## [1] "MSE: 1234270.36"
## [1] "RMSE: 1110.98"
As multiple linear regression improved the performance a bit, it did not drastically improve the performance. So we need to try polynomial regression for better results.
polynomial_model = lm(Item_Outlet_Sales ~ poly(Outlet_Establishment_Year,4) +poly(Item_MRP,6)+poly(Item_Weight,3) ,df)
summary(polynomial_model)
##
## Call:
## lm(formula = Item_Outlet_Sales ~ poly(Outlet_Establishment_Year,
## 4) + poly(Item_MRP, 6) + poly(Item_Weight, 3), data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3668.0 -647.0 -26.0 534.9 5412.2
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2240.59 34.27 65.385 <2e-16
## poly(Outlet_Establishment_Year, 4)1 -1049.76 1088.85 -0.964 0.3352
## poly(Outlet_Establishment_Year, 4)2 -2212.70 1090.01 -2.030 0.0426
## poly(Outlet_Establishment_Year, 4)3 -2378.51 1086.65 -2.189 0.0288
## poly(Outlet_Establishment_Year, 4)4 1752.89 1088.42 1.610 0.1076
## poly(Item_MRP, 6)1 31443.86 1086.71 28.935 <2e-16
## poly(Item_MRP, 6)2 189.67 1090.87 0.174 0.8620
## poly(Item_MRP, 6)3 892.92 1092.70 0.817 0.4140
## poly(Item_MRP, 6)4 2036.70 1088.77 1.871 0.0617
## poly(Item_MRP, 6)5 860.29 1087.09 0.791 0.4289
## poly(Item_MRP, 6)6 290.25 1087.29 0.267 0.7896
## poly(Item_Weight, 3)1 633.44 1096.86 0.578 0.5637
## poly(Item_Weight, 3)2 814.57 1090.39 0.747 0.4552
## poly(Item_Weight, 3)3 -881.82 1089.36 -0.809 0.4184
##
## (Intercept) ***
## poly(Outlet_Establishment_Year, 4)1
## poly(Outlet_Establishment_Year, 4)2 *
## poly(Outlet_Establishment_Year, 4)3 *
## poly(Outlet_Establishment_Year, 4)4
## poly(Item_MRP, 6)1 ***
## poly(Item_MRP, 6)2
## poly(Item_MRP, 6)3
## poly(Item_MRP, 6)4 .
## poly(Item_MRP, 6)5
## poly(Item_MRP, 6)6
## poly(Item_Weight, 3)1
## poly(Item_Weight, 3)2
## poly(Item_Weight, 3)3
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1084 on 986 degrees of freedom
## Multiple R-squared: 0.4674, Adjusted R-squared: 0.4604
## F-statistic: 66.56 on 13 and 986 DF, p-value: < 2.2e-16
As I add more and more polynmoials to the predictors, it starts overfitting. Lets deviate a bit with different dataset to explain this and bias-variance tradeoff.
# Altering the data a bit
df_changed = df[!(df['Item_MRP'] >150 & df['Item_Outlet_Sales'] < 2000),]
fit_1 = lm(Item_Outlet_Sales ~ 1, data = df_changed)
fit_2 = lm(Item_Outlet_Sales ~ Item_MRP, data = df_changed)
fit_3 = lm(Item_Outlet_Sales ~ poly(Item_MRP, degree = 5), data = df_changed)
fit_4 = lm(Item_Outlet_Sales ~ poly(Item_MRP, degree = 10), data = df_changed)
fit_5 = lm(Item_Outlet_Sales ~ poly(Item_MRP, degree = 15), data = df_changed)
fit_6 = lm(Item_Outlet_Sales ~ poly(Item_MRP, degree = 20), data = df_changed)
fit_7 = lm(Item_Outlet_Sales ~ poly(Item_MRP, degree = 22), data = df_changed)
plot(Item_Outlet_Sales ~ Item_MRP, df_changed)
grid = seq(from = min(df_changed$Item_MRP), to = max(df_changed$Item_MRP), by = 1)
lines(grid, predict(fit_1, newdata = data.frame(Item_MRP = grid)),
col = "red", lwd = 2, lty = 2)
lines(grid, predict(fit_2, newdata = data.frame(Item_MRP = grid)),
col = "blue", lwd = 2, lty = 3)
lines(grid, predict(fit_3, newdata = data.frame(Item_MRP = grid)),
col = "green", lwd = 2, lty = 4)
lines(grid, predict(fit_4, newdata = data.frame(Item_MRP = grid)),
col = "orange", lwd = 2, lty = 5)
lines(grid, predict(fit_5, newdata = data.frame(Item_MRP = grid)),
col = "green", lwd = 2, lty = 4)
lines(grid, predict(fit_6, newdata = data.frame(Item_MRP = grid)),
col = "orange", lwd = 2, lty = 5)
lines(grid, predict(fit_7, newdata = data.frame(Item_MRP = grid)),
col = "red", lwd = 2, lty = 5)
#Fit 2
evaluation(df_changed,fit_2)
## [1] "MSE: 887622.87"
## [1] "RMSE: 942.14"
#fit 4
evaluation(df_changed,fit_4)
## [1] "MSE: 858701.66"
## [1] "RMSE: 926.66"
#Fit 6
evaluation(df_changed,fit_6)
## [1] "MSE: 849811.34"
## [1] "RMSE: 921.85"
#Fit7
evaluation(df_changed,fit_7)
## [1] "MSE: 842595.97"
## [1] "RMSE: 917.93"
From the above plot, when the poloynomials are increased, the model tries to overfit the data.
As we add more and more parameters to our model, its complexity increases, which results in increasing variance and decreasing bias, i.e., overfitting. So we need to find out one optimum point in our model where the decrease in bias is equal to increase in variance. In practice, there is no analytical way to find this point.
To overcome underfitting or high bias, we can basically add new parameters to our model so that the model complexity increases, and thus reducing high bias.
To overcome overfitting or high variance, we will use Regularization technique. There are two types of regularization.
Before that lets convert the categorical variables to dummy variables.
df_dummies = dummy.data.frame(df[,c("Item_Weight",
"Item_Fat_Content","Item_Visibility",
"Item_Type","Item_MRP",
"Outlet_Identifier","Outlet_Establishment_Year",
"Outlet_Size", "Outlet_Location_Type",
"Outlet_Type", "Item_Outlet_Sales")])
#head(df_dummies)
overall_fit = lm(Item_Outlet_Sales ~ ., data = df_dummies)
summary(overall_fit)
##
## Call:
## lm(formula = Item_Outlet_Sales ~ ., data = df_dummies)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3580.2 -617.7 -5.1 519.0 5607.4
##
## Coefficients: (12 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 290.5415 326.4653 0.890 0.373707
## Item_Weight 3.2180 7.4200 0.434 0.664606
## Item_Fat_ContentLF -103.4499 217.4722 -0.476 0.634400
## `Item_Fat_Contentlow fat` 753.8568 417.2739 1.807 0.071130
## `Item_Fat_ContentLow Fat` 49.4246 82.5787 0.599 0.549635
## Item_Fat_Contentreg -114.4669 250.5972 -0.457 0.647934
## Item_Fat_ContentRegular NA NA NA NA
## Item_Visibility -1179.3281 801.7284 -1.471 0.141619
## `Item_TypeBaking Goods` -215.2064 302.9904 -0.710 0.477704
## Item_TypeBreads 4.3815 344.6660 0.013 0.989860
## Item_TypeBreakfast 405.7509 391.0738 1.038 0.299747
## Item_TypeCanned -184.3354 306.2266 -0.602 0.547343
## Item_TypeDairy -169.3139 304.3043 -0.556 0.578068
## `Item_TypeFrozen Foods` -269.6639 302.0346 -0.893 0.372172
## `Item_TypeFruits and Vegetables` -142.5656 294.3011 -0.484 0.628196
## `Item_TypeHard Drinks` 187.1223 346.4534 0.540 0.589246
## `Item_TypeHealth and Hygiene` -78.1734 318.2942 -0.246 0.806043
## Item_TypeHousehold -125.2685 299.6131 -0.418 0.675966
## Item_TypeMeat -211.0127 330.1058 -0.639 0.522825
## Item_TypeOthers -615.2010 391.5181 -1.571 0.116433
## Item_TypeSeafood 1193.6593 560.7252 2.129 0.033523
## `Item_TypeSnack Foods` -185.1607 292.9980 -0.632 0.527567
## `Item_TypeSoft Drinks` -290.5898 322.2989 -0.902 0.367484
## `Item_TypeStarchy Foods` NA NA NA NA
## Item_MRP 15.9359 0.5599 28.460 < 2e-16
## Outlet_IdentifierOUT013 -180.0761 108.5173 -1.659 0.097353
## Outlet_IdentifierOUT018 -361.7711 107.6836 -3.360 0.000811
## Outlet_IdentifierOUT035 -83.4523 106.6997 -0.782 0.434333
## Outlet_IdentifierOUT046 -224.2009 106.3994 -2.107 0.035359
## Outlet_IdentifierOUT049 NA NA NA NA
## Outlet_Establishment_Year NA NA NA NA
## Outlet_SizeHigh NA NA NA NA
## Outlet_SizeMedium NA NA NA NA
## Outlet_SizeSmall NA NA NA NA
## `Outlet_Location_TypeTier 1` NA NA NA NA
## `Outlet_Location_TypeTier 2` NA NA NA NA
## `Outlet_Location_TypeTier 3` NA NA NA NA
## `Outlet_TypeSupermarket Type1` NA NA NA NA
## `Outlet_TypeSupermarket Type2` NA NA NA NA
##
## (Intercept)
## Item_Weight
## Item_Fat_ContentLF
## `Item_Fat_Contentlow fat` .
## `Item_Fat_ContentLow Fat`
## Item_Fat_Contentreg
## Item_Fat_ContentRegular
## Item_Visibility
## `Item_TypeBaking Goods`
## Item_TypeBreads
## Item_TypeBreakfast
## Item_TypeCanned
## Item_TypeDairy
## `Item_TypeFrozen Foods`
## `Item_TypeFruits and Vegetables`
## `Item_TypeHard Drinks`
## `Item_TypeHealth and Hygiene`
## Item_TypeHousehold
## Item_TypeMeat
## Item_TypeOthers
## Item_TypeSeafood *
## `Item_TypeSnack Foods`
## `Item_TypeSoft Drinks`
## `Item_TypeStarchy Foods`
## Item_MRP ***
## Outlet_IdentifierOUT013 .
## Outlet_IdentifierOUT018 ***
## Outlet_IdentifierOUT035
## Outlet_IdentifierOUT046 *
## Outlet_IdentifierOUT049
## Outlet_Establishment_Year
## Outlet_SizeHigh
## Outlet_SizeMedium
## Outlet_SizeSmall
## `Outlet_Location_TypeTier 1`
## `Outlet_Location_TypeTier 2`
## `Outlet_Location_TypeTier 3`
## `Outlet_TypeSupermarket Type1`
## `Outlet_TypeSupermarket Type2`
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1079 on 973 degrees of freedom
## Multiple R-squared: 0.4786, Adjusted R-squared: 0.4646
## F-statistic: 34.35 on 26 and 973 DF, p-value: < 2.2e-16
# Model coefficients
barplot(coefficients((overall_fit)))
Ridge Regression will add penalty equivalent to square of its magnitude.
\(min((||Y-X(\theta))^2 + \lambda||\theta||^2)\)
From the below plot for a lambda value of 10, it clearly reduces the coefficients weights. However, it does not bring down to zero. Here the coefficients are calculated using below math notation.
\(\beta = (X^TX+\lambda I)^-1\space X^Ty\)
ridge_model = glmnet(x=as.matrix(subset(df_dummies, select =-c(Item_Outlet_Sales)))
, y=df_dummies[,c('Item_Outlet_Sales')], alpha = 0,lambda=10)
barplot(ridge_model$beta[,1])
Lasso Regression will add penalty equivalent to absolute of its magnitude.
\(min((||Y-X(\theta))^2 + \lambda||\theta||)\)
In the below blot, it clearly brings down the coefficients to zero.
lasso_model = glmnet(x=as.matrix(subset(df_dummies, select =-c(Item_Outlet_Sales)))
, y=df_dummies[,c('Item_Outlet_Sales')], alpha = 1,lambda=10)
barplot(lasso_model$beta[,1])
L1 Regularization:
L2 Regularization:
Both are used to avoid overfitting.