Introduction

Regression is one of the most powerful algorithm to model the data. It is one of the most interpretable method compared to other algorithms. It can take different forms depending on the type of dataset.

In theory we often use OLS to find out the coefficients. But in real-life, it is quite complex if we have a very big dataset with various predictors. In this blog, we are going to see about Gradient decent algorithm for linear and logistic regression using MLE.

Dataset

Dataset which we are going to use in Big mart sales dataset[https://datahack.analyticsvidhya.com/contest/practice-problem-big-mart-sales-iii/]. It has different columns, below is the snapshot of the dataset.

For simplicity, we will drop the NA's and sample 1000 rows.

df_original = read.csv('./data/Train_UWu5bXk.csv',skipNul = TRUE,na.strings = c(' ','','NA',NA))

df_original = na.omit(df_original)

df = df_original[sample(nrow(df_original),1000),]

head(df)
##      Item_Identifier Item_Weight Item_Fat_Content Item_Visibility
## 3177           FDL46      20.350          Low Fat      0.05404671
## 380            FDY28       7.470          Regular      0.00000000
## 5574           NCE31       7.670          Low Fat      0.18559655
## 4820           FDX21       7.050          Low Fat      0.08496602
## 92             DRG27       8.895          Low Fat      0.10527411
## 6570           FDM20      10.000          Low Fat      0.03865361
##                  Item_Type Item_MRP Outlet_Identifier
## 3177           Snack Foods 119.5466            OUT035
## 380           Frozen Foods 214.3218            OUT035
## 5574             Household  35.7216            OUT018
## 4820           Snack Foods 111.1912            OUT046
## 92                   Dairy  39.9138            OUT049
## 6570 Fruits and Vegetables 245.4144            OUT013
##      Outlet_Establishment_Year Outlet_Size Outlet_Location_Type
## 3177                      2004       Small               Tier 2
## 380                       2004       Small               Tier 2
## 5574                      2009      Medium               Tier 3
## 4820                      1997       Small               Tier 1
## 92                        1999      Medium               Tier 1
## 6570                      1987        High               Tier 3
##            Outlet_Type Item_Outlet_Sales
## 3177 Supermarket Type1          353.5398
## 380  Supermarket Type1         4274.4360
## 5574 Supermarket Type2           69.2432
## 4820 Supermarket Type1         1965.4416
## 92   Supermarket Type1          690.4346
## 6570 Supermarket Type1         7105.4176

There are a total of 11 predictors and one dependent variables sales which need to predict. It also contains categorical variables. For the sake of simplicity, we will omit categorical variables and use one quantative variable with dependent variable.

Simple models

Lets start with a simple model using quantitative predictors(Item_MRP)

Model1 - Null Model

Below is the null model, if we do not predict any thing and not use any model. This is just an average of the sales.

Defining function for MSE:

mse = function(y,y_hat){
  
y = c(y)
y_hat=c(y_hat)

return(mean((y-y_hat)^2))
}
mse(df$Item_Outlet_Sales,mean(df$Item_Outlet_Sales))
## [1] 2500816

That is too much of error if we just use null model and an average of it.

Model2 - Simple Linear Regression function

Lets try a simple linear regression which Item_MRP as a predictor variable of Item_Outlet_Sales

simple_model = lm(Item_Outlet_Sales ~  Item_MRP,df)

summary(simple_model)
## 
## Call:
## lm(formula = Item_Outlet_Sales ~ Item_MRP, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3905.2  -620.4   -45.6   569.3  4569.3 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -133.2649    84.9224  -1.569    0.117    
## Item_MRP      17.3568     0.5556  31.242   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1126 on 998 degrees of freedom
## Multiple R-squared:  0.4944, Adjusted R-squared:  0.4939 
## F-statistic:   976 on 1 and 998 DF,  p-value: < 2.2e-16
#plot(simple_model$coefficients*df$Item_MRP,df$Item_Outlet_Sales)

# For simplicity, plot only 100 points
plot(Item_Outlet_Sales ~ Item_MRP, df[seq(100),])

abline(simple_model)

summary(simple_model)
## 
## Call:
## lm(formula = Item_Outlet_Sales ~ Item_MRP, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3905.2  -620.4   -45.6   569.3  4569.3 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -133.2649    84.9224  -1.569    0.117    
## Item_MRP      17.3568     0.5556  31.242   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1126 on 998 degrees of freedom
## Multiple R-squared:  0.4944, Adjusted R-squared:  0.4939 
## F-statistic:   976 on 1 and 998 DF,  p-value: < 2.2e-16

This is going to be our model using R build-in functions. Now we will use other methods to calculate the coefficients(beta_hat)

Model3 - Manual Slope and Intercept

Coefficents can be calculated using different summary statistics.

\[slope(m) = cor(x,y) * (sd(y)/sd(x)) \\ intercept(b) = \bar{y} - (slope * \bar{x} )\]

x = as.matrix(df$Item_MRP)
y = as.matrix(df$Item_Outlet_Sales)

slope <- cor(x, y) * (sd(y) / sd(x))
intercept <- mean(y) - (slope * mean(x))

print(paste0("Slope: ",round(slope,2)))
## [1] "Slope: 17.36"
print(paste0("Intercept: ",round(intercept,2)))
## [1] "Intercept: -133.26"

This method is easy when we have one or two predictor variables. But gets hard if there are many predictors.

Model 4 - Ordinary Least Squares(OLS)

This is one of the famous method for calculating the parameters of linear regression. Mathematically, it is reducing the errors.

\(\hat{\beta} =(X'X)^-1 \space X'Y\)

x = as.matrix(cbind (1, df$Item_MRP))
y = as.matrix(df$Item_Outlet_Sales)


 
## Choose beta-hat to minimize the sum of squared residuals
## resulting in matrix of estimated coefficients:
bh = round(solve(t(x)%*%x)%*%t(x)%*%y, digits=2)
 
## Label and organize results into a data frame
beta.hat = as.data.frame(cbind(c("Intercept","Beta"),bh))
names(beta.hat) = c("Coeff.","Est")
beta.hat
##      Coeff.     Est
## 1 Intercept -133.26
## 2      Beta   17.36

This is most commonly used method when we have less number of predictors. When having multiple predictors with a high number of data points, this method is too expensive. That is when we look for cost optimazation using gradient decent algorithm.

Model5 - Gradient Decent using Optim

This is one of the most used method in many machine learning problems. It is robust to multiple predictors and many observations. It simply uses graident algorithm to find the coefficients. Below is the mathematical form of gradient decent.

\(J(\theta) = \frac{1}{2m}\sum(h_\theta (x^i) - y^i)^2\)

Below equation is repeated until it converges.

\(J(\theta) = \theta - \alpha \frac{1}{2m}\sum(h_\theta (x^i) - y^i)^2\)

Vector form of gradient decent

\(\frac{1}{2m}\sum((X.\theta - Y)^2)\)

Here we create a function to reduce the least squares using optim function. Beta is initially assigned as 0.

fnRSS = function(vBeta, vY, mX) {
  return(sum((vY - mX %*% vBeta)^2))
}

mX = as.matrix(df$Item_MRP)
mX <- cbind(1, matrix(mX))
vY = as.matrix(df$Item_Outlet_Sales)

vBeta0 = rep(0, ncol(mX))

optimLinReg = optim(vBeta0, fnRSS,
                    mX = mX, vY = vY, method = 'BFGS', 
                    hessian=TRUE)

print(optimLinReg$par)
## [1] -133.26493   17.35676

So we have tried different methods to find out the coefficents of a linear regression models.