Logistic regression is one of the common classification method. In this project, I will study about this method and explain why logistic regression is more approriate than the linear regression for the classfication problem.

Why is not Linear Regression.

We know linear regression takes the form as \(f(x)=w^Tx\) where \(w\) is the vector of the coefficient and \(x\) is the input vector. And the response value \(y=f(x)\) is usually a quantitative value.

df<-data.frame(weight=c(39,45,33,60,80,78,55,100,70,46,99,66),
               type=c(0,0,0,0,1,1,0,1,1,0,1,1))
library(ggplot2)
ggplot(df,aes(x=weight,y=type))+geom_point(alpha=0.5)+stat_smooth(method="lm", se=FALSE, method.args = list(family=binomial),col="red")

In the example above, we have two groups of people: fat and thin and their following kilograms in weight. And 1 denoted for Fat, and 0 is for Thin.

This problem has two boundary 0 and 1, or it is called the binary classifcation problem. And definitely linear regression can not be suitable for predicting the class of people based on their weight. Since linear regression value is not bounded at any value, therefore at some weight, the response value will be larger than 1 or smaller than 0.

For example, we find the \(B_0\) = -0.783 and \(B_1\)=0.0198. If we know the weight of a new person, is 120kg. We try to find the response value based on the linear regression model.

\(y=-0.78375+0.01998.Weight=-0.78375+0.01998.(120)=1.613\).

The result is meaningless in this example since it is larger than 1, and we do not know where to classify the new guy into which group. Therefore, logistic regression was introduced to solve these problems. Logistic has the same form as the linear regression, but they use the exponential function so that the model gives the output between 0 and 1.

\(p(x)=\frac{e^{w^Tx}}{1+e^{w^Tx}}\)

After manipulation, we find that \(\frac{p(x)}{1-p(x)}=e^{w^Tx}\)

The quantity \(\frac{p(x)}{1-p(x)}\) called the odds, and can take on any value between 0 and \(\infty\). By taking the logarithm of both sides, we obtain \(log\big(\frac{p(x)}{1-p(x)}\big)=w^Tx\) The left-hand side of this equation is called log odds or logit.

ggplot(df,aes(x=weight,y=type))+geom_point(alpha=0.5)+stat_smooth(method="glm", se=FALSE, method.args = list(family=binomial),col="blue")+stat_smooth(method="lm", se=FALSE, method.args = list(family=binomial),col="red")

The plot show why Logistic regression is more suitable for the binary classification problem. Usually, we set up the threshold equals 0.5, or

\(p(Group=Fat|Weight)>0.5\). If probability of one person where weight is given is larger than 0.5, they will classified as Fat, otherwise is Thin.

Logistic regression

Now, we will apply this method to solve the new problem.

data("PimaIndiansDiabetes2", package = "mlbench")
df<-PimaIndiansDiabetes2
head(df)
##   pregnant glucose pressure triceps insulin mass pedigree age diabetes
## 1        6     148       72      35      NA 33.6    0.627  50      pos
## 2        1      85       66      29      NA 26.6    0.351  31      neg
## 3        8     183       64      NA      NA 23.3    0.672  32      pos
## 4        1      89       66      23      94 28.1    0.167  21      neg
## 5        0     137       40      35     168 43.1    2.288  33      pos
## 6        5     116       74      NA      NA 25.6    0.201  30      neg
str(df)
## 'data.frame':    768 obs. of  9 variables:
##  $ pregnant: num  6 1 8 1 0 5 3 10 2 8 ...
##  $ glucose : num  148 85 183 89 137 116 78 115 197 125 ...
##  $ pressure: num  72 66 64 66 40 74 50 NA 70 96 ...
##  $ triceps : num  35 29 NA 23 35 NA 32 NA 45 NA ...
##  $ insulin : num  NA NA NA 94 168 NA 88 NA 543 NA ...
##  $ mass    : num  33.6 26.6 23.3 28.1 43.1 25.6 31 35.3 30.5 NA ...
##  $ pedigree: num  0.627 0.351 0.672 0.167 2.288 ...
##  $ age     : num  50 31 32 21 33 30 26 29 53 54 ...
##  $ diabetes: Factor w/ 2 levels "neg","pos": 2 1 2 1 2 1 2 1 2 2 ...

We use the data available in the package mlbench about the diabetes. We try to build the model to predict the if the person is negative or positive based on their health information.

First, we will prepare data by omitting all NA values.

df<-na.omit(df)

After that, we split data into test and train dataset for our model. 80% data is for training and 20% is for testing.

set.seed(23)
index<-sample(2,nrow(df),replace=TRUE,prob = c(0.8,0.2))
training<-df[index==1,]
testing<-df[index==2,]

We will use the training data to train our logistic regression model. ## Single logistic regression.

To be easy to represent, we will predict the diabetes group based on the glucose predictor.

model<-glm(diabetes~glucose,data=training,family = "binomial")
model$coef
## (Intercept)     glucose 
##  -5.9304336   0.0406278

We set up the model based on these information as follows

\(p(x)=\frac{e^{-6.095+0.0424.Glucose}}{1+e^{-6.095+0.0424.Glucose}}\)

#Apply the model for testing
predicted_value<-predict(model,testing,type='response')
#Set the threshold equals 0.5
predicted_group<-ifelse(predicted_value>0.5,"pos","neg")
predicted_group
##    14    17    19    20    21    25    32    41    54    57    60    86    89 
## "pos" "neg" "neg" "neg" "neg" "neg" "pos" "pos" "pos" "pos" "neg" "neg" "neg" 
##    92   112   115   120   126   128   140   145   153   159   163   166   172 
## "neg" "pos" "pos" "neg" "neg" "neg" "neg" "pos" "pos" "neg" "neg" "neg" "neg" 
##   174   175   182   209   225   242   245   253   274   289   299   309   310 
## "neg" "neg" "neg" "neg" "neg" "neg" "pos" "neg" "neg" "neg" "neg" "neg" "neg" 
##   319   329   330   335   365   366   369   370   381   384   385   389   396 
## "neg" "neg" "neg" "neg" "pos" "neg" "neg" "neg" "neg" "neg" "neg" "neg" "neg" 
##   397   416   422   429   433   443   494   498   501   512   516   520   528 
## "neg" "pos" "neg" "neg" "neg" "neg" "neg" "neg" "neg" "neg" "pos" "neg" "neg" 
##   529   533   539   547   548   566   570   575   576   577   585   600   648 
## "neg" "neg" "neg" "pos" "neg" "neg" "neg" "neg" "neg" "neg" "neg" "neg" "pos" 
##   651   653   673   681   723   731   733   748   754 
## "neg" "neg" "neg" "neg" "pos" "neg" "pos" "neg" "pos"
#Table of accuracy
tab<-table(Predicted=predicted_group,Actual=testing$diabetes)
print(tab)
##          Actual
## Predicted neg pos
##       neg  52  17
##       pos   4  14
#Accuarcy of testing data
sum(diag(tab))/sum(tab)
## [1] 0.7586207

The accuracy of our model on testing data is around 75.8%.

df$prob<-ifelse(df$diabetes=="pos",1,0)
ggplot(df,aes(x=glucose,y=prob))+ geom_point() + geom_smooth(method = "glm", method.args = list(family = "binomial")) +
  labs(
    title = "Logistic Regression Model", 
    x = "Plasma Glucose Concentration",
    y = "Probability of being diabete-pos"
    )

Multiple logistic regression

Since only the glucose is not enough to classify who is diabetic or not. we will use more predictors in dataset to enhance the model better.

model<-glm(diabetes~glucose+pressure+pregnant+mass,data=training,family = "binomial")
model$coef
##   (Intercept)       glucose      pressure      pregnant          mass 
## -8.3732894381  0.0364138778 -0.0003777669  0.1331721036  0.0741780756

Our model now has the form as

\(p(x)=\frac{e^{-8.912+0.038.Glucose+0.0012.Pressure+0.148.Pregnant+0.08.Mass}}{1+e^{-8.912+0.038.Glucose+0.0012.Pressure+0.148.Pregnant+0.08.Mass}}\)

predicted_prob<-predict(model,testing,type="response")
predicted_class<-ifelse(predicted_prob>0.5,'pos','neg')

#Accuracy table
tab<-table(Predicted=predicted_class,Actual=testing$diabetes)
tab
##          Actual
## Predicted neg pos
##       neg  52  16
##       pos   4  15
#Accuracy
sum(diag(tab))/sum(tab)
## [1] 0.7701149

After adding more predictors for our model, accuracy increased to 77%.

Conclusion

Logistic regression is suitable for the binary classification problem. There is another method called multinomial logistic regression, for multiclass classification problem.