Logistic regression is one of the common classification method. In this project, I will study about this method and explain why logistic regression is more approriate than the linear regression for the classfication problem.
We know linear regression takes the form as \(f(x)=w^Tx\) where \(w\) is the vector of the coefficient and \(x\) is the input vector. And the response value \(y=f(x)\) is usually a quantitative value.
df<-data.frame(weight=c(39,45,33,60,80,78,55,100,70,46,99,66),
type=c(0,0,0,0,1,1,0,1,1,0,1,1))
library(ggplot2)
ggplot(df,aes(x=weight,y=type))+geom_point(alpha=0.5)+stat_smooth(method="lm", se=FALSE, method.args = list(family=binomial),col="red")
In the example above, we have two groups of people: fat and thin and their following kilograms in weight. And 1 denoted for Fat, and 0 is for Thin.
This problem has two boundary 0 and 1, or it is called the binary classifcation problem. And definitely linear regression can not be suitable for predicting the class of people based on their weight. Since linear regression value is not bounded at any value, therefore at some weight, the response value will be larger than 1 or smaller than 0.
For example, we find the \(B_0\) = -0.783 and \(B_1\)=0.0198. If we know the weight of a new person, is 120kg. We try to find the response value based on the linear regression model.
\(y=-0.78375+0.01998.Weight=-0.78375+0.01998.(120)=1.613\).
The result is meaningless in this example since it is larger than 1, and we do not know where to classify the new guy into which group. Therefore, logistic regression was introduced to solve these problems. Logistic has the same form as the linear regression, but they use the exponential function so that the model gives the output between 0 and 1.
\(p(x)=\frac{e^{w^Tx}}{1+e^{w^Tx}}\)
After manipulation, we find that \(\frac{p(x)}{1-p(x)}=e^{w^Tx}\)
The quantity \(\frac{p(x)}{1-p(x)}\) called the odds, and can take on any value between 0 and \(\infty\). By taking the logarithm of both sides, we obtain \(log\big(\frac{p(x)}{1-p(x)}\big)=w^Tx\) The left-hand side of this equation is called log odds or logit.
ggplot(df,aes(x=weight,y=type))+geom_point(alpha=0.5)+stat_smooth(method="glm", se=FALSE, method.args = list(family=binomial),col="blue")+stat_smooth(method="lm", se=FALSE, method.args = list(family=binomial),col="red")
The plot show why Logistic regression is more suitable for the binary classification problem. Usually, we set up the threshold equals 0.5, or
\(p(Group=Fat|Weight)>0.5\). If probability of one person where weight is given is larger than 0.5, they will classified as Fat, otherwise is Thin.
Now, we will apply this method to solve the new problem.
data("PimaIndiansDiabetes2", package = "mlbench")
df<-PimaIndiansDiabetes2
head(df)
## pregnant glucose pressure triceps insulin mass pedigree age diabetes
## 1 6 148 72 35 NA 33.6 0.627 50 pos
## 2 1 85 66 29 NA 26.6 0.351 31 neg
## 3 8 183 64 NA NA 23.3 0.672 32 pos
## 4 1 89 66 23 94 28.1 0.167 21 neg
## 5 0 137 40 35 168 43.1 2.288 33 pos
## 6 5 116 74 NA NA 25.6 0.201 30 neg
str(df)
## 'data.frame': 768 obs. of 9 variables:
## $ pregnant: num 6 1 8 1 0 5 3 10 2 8 ...
## $ glucose : num 148 85 183 89 137 116 78 115 197 125 ...
## $ pressure: num 72 66 64 66 40 74 50 NA 70 96 ...
## $ triceps : num 35 29 NA 23 35 NA 32 NA 45 NA ...
## $ insulin : num NA NA NA 94 168 NA 88 NA 543 NA ...
## $ mass : num 33.6 26.6 23.3 28.1 43.1 25.6 31 35.3 30.5 NA ...
## $ pedigree: num 0.627 0.351 0.672 0.167 2.288 ...
## $ age : num 50 31 32 21 33 30 26 29 53 54 ...
## $ diabetes: Factor w/ 2 levels "neg","pos": 2 1 2 1 2 1 2 1 2 2 ...
We use the data available in the package mlbench about the diabetes. We try to build the model to predict the if the person is negative or positive based on their health information.
First, we will prepare data by omitting all NA values.
df<-na.omit(df)
After that, we split data into test and train dataset for our model. 80% data is for training and 20% is for testing.
set.seed(23)
index<-sample(2,nrow(df),replace=TRUE,prob = c(0.8,0.2))
training<-df[index==1,]
testing<-df[index==2,]
We will use the training data to train our logistic regression model. ## Single logistic regression.
To be easy to represent, we will predict the diabetes group based on the glucose predictor.
model<-glm(diabetes~glucose,data=training,family = "binomial")
model$coef
## (Intercept) glucose
## -5.9304336 0.0406278
We set up the model based on these information as follows
\(p(x)=\frac{e^{-6.095+0.0424.Glucose}}{1+e^{-6.095+0.0424.Glucose}}\)
#Apply the model for testing
predicted_value<-predict(model,testing,type='response')
#Set the threshold equals 0.5
predicted_group<-ifelse(predicted_value>0.5,"pos","neg")
predicted_group
## 14 17 19 20 21 25 32 41 54 57 60 86 89
## "pos" "neg" "neg" "neg" "neg" "neg" "pos" "pos" "pos" "pos" "neg" "neg" "neg"
## 92 112 115 120 126 128 140 145 153 159 163 166 172
## "neg" "pos" "pos" "neg" "neg" "neg" "neg" "pos" "pos" "neg" "neg" "neg" "neg"
## 174 175 182 209 225 242 245 253 274 289 299 309 310
## "neg" "neg" "neg" "neg" "neg" "neg" "pos" "neg" "neg" "neg" "neg" "neg" "neg"
## 319 329 330 335 365 366 369 370 381 384 385 389 396
## "neg" "neg" "neg" "neg" "pos" "neg" "neg" "neg" "neg" "neg" "neg" "neg" "neg"
## 397 416 422 429 433 443 494 498 501 512 516 520 528
## "neg" "pos" "neg" "neg" "neg" "neg" "neg" "neg" "neg" "neg" "pos" "neg" "neg"
## 529 533 539 547 548 566 570 575 576 577 585 600 648
## "neg" "neg" "neg" "pos" "neg" "neg" "neg" "neg" "neg" "neg" "neg" "neg" "pos"
## 651 653 673 681 723 731 733 748 754
## "neg" "neg" "neg" "neg" "pos" "neg" "pos" "neg" "pos"
#Table of accuracy
tab<-table(Predicted=predicted_group,Actual=testing$diabetes)
print(tab)
## Actual
## Predicted neg pos
## neg 52 17
## pos 4 14
#Accuarcy of testing data
sum(diag(tab))/sum(tab)
## [1] 0.7586207
The accuracy of our model on testing data is around 75.8%.
df$prob<-ifelse(df$diabetes=="pos",1,0)
ggplot(df,aes(x=glucose,y=prob))+ geom_point() + geom_smooth(method = "glm", method.args = list(family = "binomial")) +
labs(
title = "Logistic Regression Model",
x = "Plasma Glucose Concentration",
y = "Probability of being diabete-pos"
)
Since only the glucose is not enough to classify who is diabetic or not. we will use more predictors in dataset to enhance the model better.
model<-glm(diabetes~glucose+pressure+pregnant+mass,data=training,family = "binomial")
model$coef
## (Intercept) glucose pressure pregnant mass
## -8.3732894381 0.0364138778 -0.0003777669 0.1331721036 0.0741780756
Our model now has the form as
\(p(x)=\frac{e^{-8.912+0.038.Glucose+0.0012.Pressure+0.148.Pregnant+0.08.Mass}}{1+e^{-8.912+0.038.Glucose+0.0012.Pressure+0.148.Pregnant+0.08.Mass}}\)
predicted_prob<-predict(model,testing,type="response")
predicted_class<-ifelse(predicted_prob>0.5,'pos','neg')
#Accuracy table
tab<-table(Predicted=predicted_class,Actual=testing$diabetes)
tab
## Actual
## Predicted neg pos
## neg 52 16
## pos 4 15
#Accuracy
sum(diag(tab))/sum(tab)
## [1] 0.7701149
After adding more predictors for our model, accuracy increased to 77%.
Logistic regression is suitable for the binary classification problem. There is another method called multinomial logistic regression, for multiclass classification problem.