library(ggplot2)
library("scatterplot3d")

Logistic Regression from the ground up

In this third blog we will discuss logistic regressions. We will start with a simple regression problem and apply what we have already discussed, linear regression. Same as in our other blogs, staring with simple data we will try to build a model using linear regression. While attempting this we will find the shortcoming of this approach and discover that what we have at hand is what is called a classification problem. Understanding the issues with linear regression we will look at the logistic regression approach. We will go into the details of logistic regression to understand how to use it and interpret it. To cement these concepts we will cast a long line and present a multiple class classification example.

Linear Regression

Let’s supposed we have a dataset in which our independent variable Y has only two possible values, 0 and 1. We have data at hand that maps to these possible values, a predictor variable. We could have more than one predictor, but for starters and to make data visualization simpler, let’s look at the case where we have only one, let’s call it X.

df<-data.frame(Y=c(0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1),X=c(1,2,3,4,5,6,7,8,9,11,12,13,14,15,16,17,18,19))

Using linear regression, which we have discussed in previous blogs, we could build a model to predict the values of Y.

linreg<-lm(Y~X,df)
plot(df$X,df$Y)
abline(linreg)

With this model we can now predict values for our independent variable.

yhat<-predict(linreg,newdata = data.frame(X=df$X))
plot(df$X,yhat)

As can be seen, predictions now follow our model line, as expected. But this yhat and Y don’t exactly look a like. If we recognize this as a classification problem, where using X we want to determine if the independent variable is 1 or 0, we could add an extra step and introduce a threshold. So if we set that cutoff at 0.5, we have out yhat taking one of two values as does our real data Y.

yhat[yhat<0.5]<-0
yhat[yhat>=0.5]<-1
plot(df$X,yhat)

Generating a confusion matrix we can also see how all our Y and yhat values match.

table(yhat,df$Y)

##     
## yhat 0 1
##    0 9 0
##    1 0 9

But there are some issues with this approach. Lets consider a slightly different dataset in which our 0 and 1 data isn’t as nicely segregated.

df<-data.frame(Y=c(0,0,0,0,1,1,0,0,0,1,0,1,1,1,1,1,1,1),X=c(1,2,3,4,5,6,7,8,9,11,12,13,14,15,16,17,18,19))
linreg<-lm(Y~X,df)
yhat<-predict(linreg,newdata = data.frame(X=df$X))
yhat[yhat<0.5]<-0
yhat[yhat>=0.5]<-1
plot(df$X,yhat)

So following the same approach, we set the cutoff at 0.5 and we end up with a classification. But we are not successfully classifying every point, which we can see more clearly in the confusion matrix.

table(yhat,df$Y)

##     
## yhat 0 1
##    0 7 2
##    1 1 8

On top of that, what does the 0.5 threshold really mean? Or the original linear regression plot before we applied this cutoff. Is there a way to tell how certain are we that a particular point corresponds to one classification of the other? We it turns out that if the independent predicted variable from our model represented a probability of being a 0 or a 1, then we would have an answer to these questions. If the model gave us the probability of each point corresponding to a 1 classification, then we can set the threshold to a certain probability higher than which we want the point classified as a 1.

Is this possible with linear regression? Not really, the yhat from the linear regression is not a probability. This is easy to see from the first linear model in which we can see yhat being higher than 1 and lower than 0. In fact, if we try predicting for a X=19 in our last model, we see this is not a probability as it is higher than 1.

predict(linreg,newdata = data.frame(X=c(19)))

##        1 
## 1.092398

Logistic Regression

In logistic regression we use a linear regression with a transformation of the independent variable.Here the “linear regression” maps to a log function rather than the independent variable Y directly.

\[log \frac{p}{1-p} = \beta_0 + \beta_1x_1\]

This is a form of a general linearized model, so in R we use the glm function. Let’s do so for the dataset in hand.

logreg<-glm(Y~X,data=df,family="binomial"(link="logit"))
yhat<-predict(logreg,newdata = data.frame(X=df$X),type="response")
plot(df$X,yhat)

Now our yhat shows the probability of any one point being a 1. For the point case before, we see a probability of less than one.

predict(logreg,newdata = data.frame(X=c(19)),type="response")

##         1 
## 0.9745744

Now setting a threshold of 0.5 has a meaning. We can say anything with a probability of more than 50% of being a 1 will be classified as a 1. We can also change the threshold depending on the application. If classification is critical, maybe a 75% makes better sense. We can also do ROC and AUC analysis.

Multiple Logistic Regression

Now that we have presented the very basics of logistic regression and how it can step in for linear regressions when encountering classification problems, let’s look at more interesting data with more than one predictor. Here as before we have the same independent variable transformation, but now instead of having to coefficients, we add the new predictors with their corresponding coefficients.

\[log \frac{p}{1-p} = \beta_0 + \beta_1x_1 + \beta_2x_2 + \beta_3x_3 + ...\]

Segregated groups

Again, let’s look at a simple dataset.

df<-data.frame(Y=c(rep(0,10),rep(1,10)),
               X1=c(runif(10,0,5),runif(10,6,11)),
               X2=c(runif(10,0,5),runif(10,6,11)))
ggplot(df,aes(x=X1,y=X2,color=as.factor(Y))) + geom_point()

From the scattered plot, one thing to note is the fact that we can draw a line between our two classifications. Intuitively we can deduce that there is a logistic regression which perfectly classifies these two groups. As before we can use the glm function in R with selecting a binomial family and a logit regression. Here we will include the final step and use a threshold of 0.5 to assign yhat to either 0 or 1s.

logreg<-glm(Y~.,data=df,family="binomial"(link="logit"))

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

yhat<-predict(logreg,type="response")
yhat[yhat<0.5]<-0
yhat[yhat>=0.5]<-1
ggplot(df,aes(x=X1,y=X2,color=as.factor(yhat))) + geom_point()

The plot shows what looks like a perfect classification. We can double check this with a confusion table.

table(yhat,df$Y)

##     
## yhat  0  1
##    0 10  0
##    1  0 10

Overlapping groups

Now, what happens if both classifications groups are not neatly separated, as in this dataset below.

df<-data.frame(Y=c(rep(0,10),rep(1,10)),
               X1=c(runif(10,0,8),runif(10,5,11)),
               X2=c(runif(10,0,8),runif(10,5,11)))
ggplot(df,aes(x=X1,y=X2,color=as.factor(Y))) + geom_point()

Because we can no longer draw a straight line between the two classifications, we would not expect our regression to not perform perfectly. This does not mean this is not a valid regression model, but this exercise does illustrate the quality of models given the nature of datasets.

logreg<-glm(Y~.,data=df,family="binomial"(link="logit"))
yhat<-predict(logreg,type="response")
yhat[yhat<0.5]<-0
yhat[yhat>=0.5]<-1
ggplot(df,aes(x=X1,y=X2,color=as.factor(yhat))) + geom_point()

It is actually easy to see how a straight line would separate both groups on the scattered plot using the model predicted classification. Comparing this plot with the previous plot highlights the point that were misclassified by our regression model. The confusion table shows the same.

table(yhat,df$Y)

##     
## yhat 0 1
##    0 9 1
##    1 1 9

For datasets where the two groups overlap or are interwoven more, logistic regression classifier would perform more poorly. But again, they might still be valid usable models. Here is an example of a low performing model.

df<-data.frame(Y=c(rep(0,10),rep(1,10)),
               X1=c(runif(10,0,9),runif(10,2,11)),
               X2=c(runif(10,0,9),runif(10,2,11)))
ggplot(df,aes(x=X1,y=X2,color=as.factor(Y))) + geom_point()

This data shows groups that are highly overlapped. When running logistic regression, the resulting model tries to draw a line to classify both groups.

logreg<-glm(Y~.,data=df,family="binomial"(link="logit"))

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

yhat<-predict(logreg,type="response")
yhat[yhat<0.5]<-0
yhat[yhat>=0.5]<-1
ggplot(df,aes(x=X1,y=X2,color=as.factor(yhat))) + geom_point()

Again we can easily draw a line between both groups, but again recognize the misclassification when comparing with the previous dataset plot. The confusion matrix also shows the lower performance of this model.

table(yhat,df$Y)

##     
## yhat  0  1
##    0 10  0
##    1  0 10

More than two predictors

Of course what we just described will also hold true in hyperplanes, that is in situations where we have more than two predictors. Plotting a 4 dimensional space is tricky to say the least, but let’s look at a three predictor case. We will use a segregated class dataset to make the graphs easier to see, but model performance will follow the same behavior as described above.

df<-data.frame(Y=c(rep(0,10),
                   rep(1,10)),
               X1=c(runif(10,0,5),runif(10,6,11)),
               X2=c(runif(10,0,5),runif(10,6,11)),
               X3=c(runif(10,0,5),runif(10,6,11)))
colors <- c("#56B4E9", "#E69F00")
colors <- colors[as.numeric(df$Y+1)]
scatterplot3d(df$X1,df$X2,df$X3,color=colors, pch = 16)

Running a logistic regression and plotting the yhat classification shows the same classification as the original data set. Here as opposed to being able to draw a line between the two groups, we see that we can draw a plane between the two groups, hence the good performance of the model.

logreg<-glm(Y~.,data=df,family="binomial"(link="logit"))

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

yhat<-predict(logreg,type="response")
yhat[yhat<0.5]<-0
yhat[yhat>=0.5]<-1
colors <- c("#56B4E9", "#E69F00")
colors <- colors[as.numeric(yhat+1)]
scatterplot3d(df$X1,df$X2,df$X3,color=colors, pch = 16)

We can confirm this with the confusion matrix.

table(yhat,df$Y)

##     
## yhat  0  1
##    0 10  0
##    1  0 10

If we can’t draw a plane between the two groups, then the model won’t perform as nicely.

df<-data.frame(Y=c(rep(0,10),
                   rep(1,10)),
               X1=c(runif(10,0,9),runif(10,3,11)),
               X2=c(runif(10,0,9),runif(10,3,11)),
               X3=c(runif(10,0,9),runif(10,3,11)))
colors <- c("#56B4E9", "#E69F00")
colors <- colors[as.numeric(df$Y+1)]
scatterplot3d(df$X1,df$X2,df$X3,color=colors, pch = 16)

logreg<-glm(Y~.,data=df,family="binomial"(link="logit"))
yhat<-predict(logreg,type="response")
yhat[yhat<0.5]<-0
yhat[yhat>=0.5]<-1
colors <- c("#56B4E9", "#E69F00")
colors <- colors[as.numeric(yhat+1)]
scatterplot3d(df$X1,df$X2,df$X3,color=colors, pch = 16)

As in the two predictor case, but now using a plane, we can easily draw a plane between the two classifications predicted by the model. The confusion matrix shows how this classification isn’t perfect.

table(yhat,df$Y)

##     
## yhat 0 1
##    0 8 2
##    1 2 8

As mentioned before, for more than three predictors, what was described stands, instead of separating the data with a plane, we would have a hyperplane. Unfortunately we only have three visual dimensions to show.

More than two classes

We have presented the case with one, two and three predictors. We have also stated what happens with more than three predictors. But up to now we have been looking at only two classification groups. Could we extend what we have discussed so far to say a dataset where we have three classifications? Say 0,1 and 2? Let’s look at that case. For simplicity lets again consider two predictors and nicely segregated groups.

df<-data.frame(Y=c(rep(0,10),
                   rep(1,10),
                   rep(2,10)),
               X1=c(runif(10,0,5),runif(10,6,11),runif(10,12,17)),
               X2=c(runif(10,0,5),runif(10,6,11),runif(10,0,5)))
ggplot(df,aes(x=X1,y=X2,color=as.factor(Y))) + geom_point()

So how can we solve this using logistic regression? We can see how straight lines would separate these three groups, but we would need two line to do this. An approach is to use the concept of One-vs-all. Basically we can split the problem is multiple logistic regression cases. This is simple to understand now that we have a good grasp of what logistic regression is.

Say that for starters we want to have a model to identify the class of 0’s. To do this we can simply convert our dataset into a two class dataset, the class for 0’s and the class for non-0’s which is the result of grouping the 2’s and 3’s. We can do this for all three classes, which would result in three different models, one for each class.

df0<-df
df0$Y[df0$Y==2]<-1
ggplot(df0,aes(x=X1,y=X2,color=as.factor(Y))) + geom_point()

We start with a model for the 0’s

logreg0<-glm(Y~.,data=df0,family="binomial"(link="logit"))

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

yhat0<-predict(logreg0,type="response")
yhat0[yhat0<0.5]<-0
yhat0[yhat0>=0.5]<-1
ggplot(df0,aes(x=X1,y=X2,color=as.factor(yhat0))) + geom_point()

The table shows how we classify all the 0’s, while the 1’s and 2’s are bundled into the same classification.

table(yhat0,df0$Y)

##      
## yhat0  0  1
##     0 10  0
##     1  0 20

We do the same for 1’s

df1<-df
df1$Y[df1$Y==2]<-0
ggplot(df1,aes(x=X1,y=X2,color=as.factor(Y))) + geom_point()

logreg1<-glm(Y~.,data=df1,family="binomial"(link="logit"))

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

yhat1<-predict(logreg1,type="response")
yhat1[yhat1<0.5]<-0
yhat1[yhat1>=0.5]<-1
ggplot(df1,aes(x=X1,y=X2,color=as.factor(yhat1))) + geom_point()

Again, the table shows how we classify one groups, the 1’s, and bundle the 0’s and 2’s in one class.

table(yhat1,df1$Y)

##      
## yhat1  0  1
##     0 20  0
##     1  0 10

Last we build a model for the 2’s.

df2<-df
df2$Y[df2$Y==1]<-0
df2$Y[df2$Y==2]<-1
ggplot(df2,aes(x=X1,y=X2,color=as.factor(Y))) + geom_point()

We running the model for 2’s, we need to relabel the 2’s for 1’s so we can run the glm function to build the model.

logreg2<-glm(Y~.,data=df2,family="binomial"(link="logit"))

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

yhat2<-predict(logreg2,type="response")
yhat2[yhat2<0.5]<-0
yhat2[yhat2>=0.5]<-1
ggplot(df2,aes(x=X1,y=X2,color=as.factor(yhat2))) + geom_point()

The table again gives us more details on the performance of the model. Now we have to keep in mind that we had to replace the 2’s for 1’s in order to build the model. so we undo that labeling.

df2$Y[df2$Y==1]<-2
yhat2[yhat2==1]<-2
table(yhat2,df2$Y)

##      
## yhat2  0  2
##     0 20  0
##     2  0 10

Now we have a model for each class. Let’s see how we would use this to classify anew dataset. Here we will use these model we have just built and see how they behave against new data.

df<-data.frame(Y=c(rep(0,10),
                   rep(1,10),
                   rep(2,10)),
               X1=c(runif(10,0,8),runif(10,6,12),runif(10,12,17)),
               X2=c(runif(10,0,8),runif(10,6,12),runif(10,0,5)))
ggplot(df,aes(x=X1,y=X2,color=as.factor(Y))) + geom_point()

We start by using our models to identify the 0’s. We might want to run all three models and see which one gives us the best classification of 0’s, but here we will already use the model we have calibrated for 0’s.

yhat0.new<-predict(logreg0,newdata=df,type="response")
yhat0.new[yhat0.new<0.5]<-0
yhat0.new[yhat0.new>=0.5]<-1
ggplot(df,aes(x=X1,y=X2,color=as.factor(yhat0.new))) + geom_point()

The table helps us better understand the models performance. As we can see some points are misclassified, but overall the model is able to label the 0’s correctly.

table(yhat0.new,df$Y)

##          
## yhat0.new  0  1  2
##         0  8  0  0
##         1  2 10 10

In the plot against our predictions of 0’s using our model, again we see how a straight line can be draw between the classification.

We continue with the 1’s

yhat1.new<-predict(logreg1,newdata=df,type="response")
yhat1.new[yhat1.new<0.5]<-0
yhat1.new[yhat1.new>=0.5]<-1
ggplot(df,aes(x=X1,y=X2,color=as.factor(yhat1.new))) + geom_point()

Once again the table gives us more insight into model performance. We need to keep in mind that anything that isn’t a 1, that is 0’s and 2’s, will be classified as 0’s.

table(yhat1.new,df$Y)

##          
## yhat1.new  0  1  2
##         0  6  0 10
##         1  4 10  0

And finally the 2’s

yhat2.new<-predict(logreg2,newdata=df,type="response")
yhat2.new[yhat2.new<0.5]<-0
yhat2.new[yhat2.new>=0.5]<-2
ggplot(df,aes(x=X1,y=X2,color=as.factor(yhat2.new))) + geom_point()

table(yhat2.new,df$Y)

##          
## yhat2.new  0  1  2
##         0 10 10  0
##         2  0  0 10

The keen eye will see that some points might be assigned to more than one class by different models. One way to make the final selection for these points is to use the classification given by the model with the best performance in labelling its respective class.

Final Comments

From what was presented, it is easy to understand a rather complex situation where we have several predictors and several classes in our independent variable. But by understanding the basics of logistic regression, we can easily understand how the elaborate models work, what situations are they best suited for, and what a models performance refers to.

The Data 621 Blog - Post 3

Peter Kowalchuk

October 15, 2019