Understanding Logistic Regression from Scratch

Introduction

In my last post, we walked through the construction of a two-layer neural network and used it to classify the MNIST dataset. Today, I will show how we can build a logistic regression model from scratch (spoiler: it’s much simpler than a neural network). Linear and logistic regression are probably the simplest yet useful models in a lot of fields. They are fast to implement and also easy to interpret.

In this post, I will first talk about the basics of logistic regression, followed by model construction and training on my NBA shot log dataset, then I will try to interpret the model through statistical inference. Finally, I will compare it with the built-in glm() function.

Logistic regression

Logistic regression is a generalized linear model, with a binominal distribution and logit link function. The outcome \(Y\) is either 1 or 0. What we are interested in is the expected values of \(Y\), \(E(Y)\). In this case, they can also be thought as probability of getting 1, \(p\). However, because \(p\) is bounded between 0 and 1, it’s hard to implement the method similar to what we did for linear regression. So instead of predicting \(p\) directly, we predict the log of odds (logit), which takes values from \(-\infty\) to \(\infty\). Now the function is: \(\log(\frac{p}{1-p})=\theta_0 + \theta_1x_1 + \theta_2x_2 + ...\), let’s denote the RHS as \(x\cdot\theta\). Next the task is to find \(\theta\).

In logistic regresion, the cost function is defined as: \(J=-\frac{1}{m}\sum_{i=1}^m(y^{(i)}\log(h(x^{(i)}))+(1-y^{(i)})\log(1-h(x^{(i)})))\), where \(h(x)=\frac{1}{1+e^{-x\cdot\theta}}\) is the sigmoid function, inverse of logit function. We can use gradient descent to find the optimal \(\theta\) that minimizes \(J\). So this is basically the process to construct the model. It is actually simpler than you think when you starting coding.

Model construction in R

Now let’s build our logistic regression. First I will define some useful functions. Note %*% is the dot product in R.

library(ggplot2)
library(dplyr)
#sigmoid function, inverse of logit
sigmoid <- function(z){1/(1+exp(-z))}

#cost function
cost <- function(theta, X, y){
  m <- length(y) # number of training examples

  h <- sigmoid(X%*%theta)
  J <- (t(-y)%*%log(h)-t(1-y)%*%log(1-h))/m
  J
}

#gradient function
grad <- function(theta, X, y){
  m <- length(y) 

  h <- sigmoid(X%*%theta)
  grad <- (t(X)%*%(h - y))/m
  grad
}

Here comes the logistic regression fuction, which takes training dataframe X, and label y as function input. It returns a column vector which stores the coefficients in theta. One thing to pay attention to is that the input X usually doesn’t have a bias term, the leading column vector of 1, so I added this column in the function.

logisticReg <- function(X, y){
  #remove NA rows
  temp <- na.omit(cbind(y, X))
  #add bias term and convert to matrix
  X <- mutate(temp[, -1], bias =1)
  X <- as.matrix(X[,c(ncol(X), 1:(ncol(X)-1))])
  y <- as.matrix(temp[, 1])
  #initialize theta
  theta <- matrix(rep(0, ncol(X)), nrow = ncol(X))
  #use the optim function to perform gradient descent
  costOpti <- optim(matrix(rep(0, 4), nrow = 4), cost, grad, X=X, y=y)
  #return coefficients
  return(costOpti$par)
}

Training with NBA shot log dataset

Now let’s train our model with the NBA shot log dataset. Specifically, I am interested in how will shot clock, shot distance and defender distance affect shooting performance. Naively, we would think more time remaining in shot clock, shorter distance to basket, farther to defender will all increase the probability of a field goal. Shortly, we will see whether we can statistically prove that.

#load the dataset
shot <- read.csv('2014-2015shot.csv', header = T, stringsAsFactors = F)
shot.df <- select(shot, FGM, SHOT_CLOCK, SHOT_DIST, CLOSE_DEF_DIST)
head(shot.df)

##   FGM SHOT_CLOCK SHOT_DIST CLOSE_DEF_DIST
## 1   1       10.6       6.5            2.1
## 2   0        6.2       4.0            1.9
## 3   1        5.2       2.8            2.2
## 4   1       16.0       8.4            4.4
## 5   1        1.4       4.5            4.4
## 6   0       14.1       1.3            1.9

shot.X <- shot.df[, -1]
shot.y <- shot.df[, 1]

mod <- logisticReg(shot.X, shot.y)
mod

##             [,1]
## [1,] -0.06444830
## [2,]  0.01886594
## [3,] -0.05952669
## [4,]  0.10610004

How do we interpret the model?

The first number is the intercept. It is the log odds of a FG if all other predictors are 0. Note if log odds is 0, the probality is 0.5. So the negative intercept means <50%.
The next three numbers are the coefficients for SHOT_CLOCK, SHOT_DIST, CLOSE_DEF_DIST. For every unit increase in the predictor, the coefficient is the change of log odds while holding other predictors to be constant.
For example, let’s look at the last number. While holding others the same, if the defender moves 1 ft farther away, the log odds of that shot will increase by 0.106.
If the original FG% is 50%, the new FG% will be 52.6% if the defender is 1 ft farther.

Now, look at the signs of the coefficients, we can conclude that increase in SHOT_CLOCK, CLOSE_DEF_DIST and decrease in SHOT_DIST will all have positive impact in FG%.

Next, let’s compare our self-built model with the glm() function.

mod1 <- glm(as.factor(FGM) ~ SHOT_CLOCK + SHOT_DIST + CLOSE_DEF_DIST, 
            family=binomial(link = "logit"), data=shot.df)

summary(mod1)$coefficients

##                   Estimate   Std. Error    z value      Pr(>|z|)
## (Intercept)    -0.06663629 0.0144138120  -4.623086  3.780737e-06
## SHOT_CLOCK      0.01889689 0.0008333007  22.677151 7.530998e-114
## SHOT_DIST      -0.05942813 0.0006771956 -87.756216  0.000000e+00
## CLOSE_DEF_DIST  0.10605534 0.0022038402  48.122972  0.000000e+00

We did a pretty good job as the coefficient are almost identical to 3rd decimal place.

Prediction function and the Expected FG%

Finally, I will write a prediction function that will output the probability of getting 1 in a logistic regression.

logisticPred <- function(mod, X){
  X <- na.omit(X)
  #add bias term and convert to matrix
  X <- mutate(X, bias =1)
  X <- as.matrix(X[,c(ncol(X), 1:(ncol(X)-1))])
  return(sigmoid(X%*%mod))
}

Generate a new data grid to see how FG% changes with predictors.

newdata <- expand.grid(SHOT_CLOCK = 10, 
                       SHOT_DIST = seq(2.5, 37.5, by = 5), 
                       CLOSE_DEF_DIST = seq(1, 7, by=2))

FG <- logisticPred(mod, newdata)

Visulize the impact

Shot clock seems to have the least impact, so I will exclude that in this plot.

shot.pred <- mutate(newdata, FG = FG)

ggplot(shot.pred, aes(x = factor(SHOT_DIST), y = FG, fill = factor(CLOSE_DEF_DIST))) + 
  geom_bar(stat = "identity",  position = position_dodge(width = .9), width = 0.8) + 
  ylab('FG%') + xlab('Shot Distance (ft)') + ylim(0, 0.8) +
  scale_x_discrete(labels=c('0-5', '5-10','10-15','15-20','20-25','25-30','30-35','35+'))+
  theme_bw(base_size = 12) + 
  scale_fill_discrete(name="Defender\nDistance (ft)", 
                      labels=c('0-2', '2-4','4-6','6+'))+
  theme(legend.position = c(0.65, 0.8))

Indeed, wide open shots in the paint have the highest probability and contested long 3s have the lowest. This plot conveys very similar information as the one I did in my shiny app. However, doing regression smoothens things out (regression to mean?) and losses some important features. For example, the predictions of extreme cases (shot distance < 5ft or > 35ft) are all less drastic than what the reality is. One way is that We can add higher order terms in the regression.

Conclusion

There you have it, it is not that hard for ourselves to build a regression model from scratch (as we also demonstrated in neural network). If you follow this post, hopefully by now, you have a better understanding of logistic regression. One last note, although logistic regression is often said to be a classifier, it can also be used for regression (to find the probability).