Regression is one of the most powerful algorithm to model the data. It is one of the most interpretable method compared to other algorithms. It can take different forms depending on the type of dataset.
In theory we often use OLS to find out the coefficients. But in real-life, it is quite complex if we have a very big dataset with various predictors. In this blog, we are going to see about Gradient decent algorithm for linear and logistic regression using MLE.
Dataset which we are going to use in Big mart sales dataset[https://datahack.analyticsvidhya.com/contest/practice-problem-big-mart-sales-iii/]. It has different columns, below is the snapshot of the dataset.
For simplicity, we will drop the NA's
and sample 1000
rows.
df_original = read.csv('./data/Train_UWu5bXk.csv',skipNul = TRUE,na.strings = c(' ','','NA',NA))
df_original = na.omit(df_original)
df = df_original[sample(nrow(df_original),1000),]
head(df)
## Item_Identifier Item_Weight Item_Fat_Content Item_Visibility
## 3177 FDL46 20.350 Low Fat 0.05404671
## 380 FDY28 7.470 Regular 0.00000000
## 5574 NCE31 7.670 Low Fat 0.18559655
## 4820 FDX21 7.050 Low Fat 0.08496602
## 92 DRG27 8.895 Low Fat 0.10527411
## 6570 FDM20 10.000 Low Fat 0.03865361
## Item_Type Item_MRP Outlet_Identifier
## 3177 Snack Foods 119.5466 OUT035
## 380 Frozen Foods 214.3218 OUT035
## 5574 Household 35.7216 OUT018
## 4820 Snack Foods 111.1912 OUT046
## 92 Dairy 39.9138 OUT049
## 6570 Fruits and Vegetables 245.4144 OUT013
## Outlet_Establishment_Year Outlet_Size Outlet_Location_Type
## 3177 2004 Small Tier 2
## 380 2004 Small Tier 2
## 5574 2009 Medium Tier 3
## 4820 1997 Small Tier 1
## 92 1999 Medium Tier 1
## 6570 1987 High Tier 3
## Outlet_Type Item_Outlet_Sales
## 3177 Supermarket Type1 353.5398
## 380 Supermarket Type1 4274.4360
## 5574 Supermarket Type2 69.2432
## 4820 Supermarket Type1 1965.4416
## 92 Supermarket Type1 690.4346
## 6570 Supermarket Type1 7105.4176
There are a total of 11 predictors and one dependent variables sales
which need to predict. It also contains categorical variables. For the sake of simplicity, we will omit categorical variables and use one quantative variable with dependent variable.
Lets start with a simple model using quantitative predictors(Item_MRP)
Below is the null model, if we do not predict any thing and not use any model. This is just an average of the sales.
Defining function for MSE:
mse = function(y,y_hat){
y = c(y)
y_hat=c(y_hat)
return(mean((y-y_hat)^2))
}
mse(df$Item_Outlet_Sales,mean(df$Item_Outlet_Sales))
## [1] 2500816
That is too much of error if we just use null model and an average of it.
Lets try a simple linear regression which Item_MRP
as a predictor variable of Item_Outlet_Sales
simple_model = lm(Item_Outlet_Sales ~ Item_MRP,df)
summary(simple_model)
##
## Call:
## lm(formula = Item_Outlet_Sales ~ Item_MRP, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3905.2 -620.4 -45.6 569.3 4569.3
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -133.2649 84.9224 -1.569 0.117
## Item_MRP 17.3568 0.5556 31.242 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1126 on 998 degrees of freedom
## Multiple R-squared: 0.4944, Adjusted R-squared: 0.4939
## F-statistic: 976 on 1 and 998 DF, p-value: < 2.2e-16
#plot(simple_model$coefficients*df$Item_MRP,df$Item_Outlet_Sales)
# For simplicity, plot only 100 points
plot(Item_Outlet_Sales ~ Item_MRP, df[seq(100),])
abline(simple_model)
summary(simple_model)
##
## Call:
## lm(formula = Item_Outlet_Sales ~ Item_MRP, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3905.2 -620.4 -45.6 569.3 4569.3
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -133.2649 84.9224 -1.569 0.117
## Item_MRP 17.3568 0.5556 31.242 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1126 on 998 degrees of freedom
## Multiple R-squared: 0.4944, Adjusted R-squared: 0.4939
## F-statistic: 976 on 1 and 998 DF, p-value: < 2.2e-16
This is going to be our model using R build-in functions. Now we will use other methods to calculate the coefficients(beta_hat)
Coefficents can be calculated using different summary statistics.
\[slope(m) = cor(x,y) * (sd(y)/sd(x)) \\ intercept(b) = \bar{y} - (slope * \bar{x} )\]
x = as.matrix(df$Item_MRP)
y = as.matrix(df$Item_Outlet_Sales)
slope <- cor(x, y) * (sd(y) / sd(x))
intercept <- mean(y) - (slope * mean(x))
print(paste0("Slope: ",round(slope,2)))
## [1] "Slope: 17.36"
print(paste0("Intercept: ",round(intercept,2)))
## [1] "Intercept: -133.26"
This method is easy when we have one or two predictor variables. But gets hard if there are many predictors.
This is one of the famous method for calculating the parameters of linear regression. Mathematically, it is reducing the errors.
\(\hat{\beta} =(X'X)^-1 \space X'Y\)
x = as.matrix(cbind (1, df$Item_MRP))
y = as.matrix(df$Item_Outlet_Sales)
## Choose beta-hat to minimize the sum of squared residuals
## resulting in matrix of estimated coefficients:
bh = round(solve(t(x)%*%x)%*%t(x)%*%y, digits=2)
## Label and organize results into a data frame
beta.hat = as.data.frame(cbind(c("Intercept","Beta"),bh))
names(beta.hat) = c("Coeff.","Est")
beta.hat
## Coeff. Est
## 1 Intercept -133.26
## 2 Beta 17.36
This is most commonly used method when we have less number of predictors. When having multiple predictors with a high number of data points, this method is too expensive. That is when we look for cost optimazation using gradient decent algorithm.
This is one of the most used method in many machine learning problems. It is robust to multiple predictors and many observations. It simply uses graident algorithm to find the coefficients. Below is the mathematical form of gradient decent.
\(J(\theta) = \frac{1}{2m}\sum(h_\theta (x^i) - y^i)^2\)
Below equation is repeated until it converges.
\(J(\theta) = \theta - \alpha \frac{1}{2m}\sum(h_\theta (x^i) - y^i)^2\)
Vector form of gradient decent
\(\frac{1}{2m}\sum((X.\theta - Y)^2)\)
Here we create a function to reduce the least squares using optim
function. Beta is initially assigned as 0.
fnRSS = function(vBeta, vY, mX) {
return(sum((vY - mX %*% vBeta)^2))
}
mX = as.matrix(df$Item_MRP)
mX <- cbind(1, matrix(mX))
vY = as.matrix(df$Item_Outlet_Sales)
vBeta0 = rep(0, ncol(mX))
optimLinReg = optim(vBeta0, fnRSS,
mX = mX, vY = vY, method = 'BFGS',
hessian=TRUE)
print(optimLinReg$par)
## [1] -133.26493 17.35676
So we have tried different methods to find out the coefficents of a linear regression models.