Logistic Regression

Introduction

Logistic regression is a statistical method used to predict the probability of a binary outcome.
It is a supervised learning algorithm, which means that it learns from a dataset of labeled examples.
Logistic regression is a generalized linear model, which means that it uses a linear function to model the log odds of the outcome.
Logistic regression is a popular choice for binary classification and prediction tasks because it is easy to understand, interpret, and implement.

Applications

Predicting customer churn: Logistic regression can be used to predict which customers are likely to churn, so that companies can take steps to retain them.
Diagnosing diseases: Logistic regression can be used to diagnose diseases based on a patient’s symptoms and medical history.
Predicting credit risk: Logistic regression can be used to predict which borrowers are likely to default on their loans, so that lenders can make better lending decisions.
Predicting election results: Logistic regression can be used to predict the outcome of elections based on polls and other data.

How logistic regression works?

Logistic regression models the log odds of the outcome as a linear function of the predictors.
The log odds is the natural logarithm of the odds of the outcome.
The odds of an event occurring is the ratio of the probability that the event will occur to the probability that the event will not occur.
The log odds is a convenient way to model the odds of an event because it is a continuous function that can take on any value.

Further Example

Now, I will explain more into logistic regression using a live application on predicting if someone has diabetes or not using a logistic regression model.

We will be using a dataset imported from Kaggle, that will be imported below and shown using initial analysis steps in R. Below is a brief view of what the dataset looks like:

## 'data.frame':    768 obs. of  9 variables:
##  $ Pregnancies             : int  6 1 8 1 0 5 3 10 2 8 ...
##  $ Glucose                 : int  148 85 183 89 137 116 78 115 197 125 ...
##  $ BloodPressure           : int  72 66 64 66 40 74 50 0 70 96 ...
##  $ SkinThickness           : int  35 29 0 23 35 0 32 0 45 0 ...
##  $ Insulin                 : int  0 0 0 94 168 0 88 0 543 0 ...
##  $ BMI                     : num  33.6 26.6 23.3 28.1 43.1 25.6 31 35.3 30.5 0 ...
##  $ DiabetesPedigreeFunction: num  0.627 0.351 0.672 0.167 2.288 ...
##  $ Age                     : int  50 31 32 21 33 30 26 29 53 54 ...
##  $ Outcome                 : int  1 0 1 0 1 0 1 0 1 1 ...

Understanding relationships

First we find the correlation value for each of these features to understand which variables can be used successfully and logarithmically to predict the factor outcome. Which can be done by using a formula:

\[ {SS}_{xy} = \sum (x - \bar{x})(y - \bar{y}) = \sum xy - \frac {(\sum x)(\sum y)}{n} \]

\[ r = \frac {{SS}_{xy}}{\sqrt {{SS}_{xx}{SS}_{yy}}} \] Where r is the correlation between each variables namely x and y.

Correlations

All of the correlations are shown in the plot below and by only choosing the highest correlation factors from them we use Age, glucose, skin thickness and Dibetes pedigree function to predict our binomial outcome. In the following plot:

1 = Glucose vs Blood Pressure
2 = Skin Thickness vs Insulin
3 = BMI vs DiabetesPedigreeFunction
4 = Age vs Glucose
5 = BMI vs Age
6 = Insulin vs Age
7 = Blood Pressure vs Insulin

Plot of correlation values

Creating the model

After our first steps for understanding the correlation, we can see that a few variables play a role in predicting the outcome, hence we create a model indictaing all of these as a factor.

Ideally on paper, the logistic regression formula would look something like this, which would intend for us to calculate by hand and understand.

\[ Pr(Y_i=1|X_i) = {\frac{exp(\beta_0 + \beta_1X_i + \beta_2X_2 + \beta_3X_3 + \beta_4X_4 )}{1 + exp (\beta_0 + \beta_1X_i + \beta_2X_2 + \beta_3X_3 + \beta_4X_4 )}} \] Since we will be using r to create the model, we will be using the logistic glm function to do so.

Creating the model in R

Feature selection

At first we see that most of the correlations have a positive value exepet for one, so we create and see the highest correlations to see which one has a positive z value to predict. The results shown below:

## 
## Call:
## glm(formula = Outcome ~ Age + Glucose + SkinThickness + Insulin, 
##     family = "binomial", data = diabetes)
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   -6.3053188  0.5008493 -12.589  < 2e-16 ***
## Age            0.0261142  0.0075595   3.454 0.000551 ***
## Glucose        0.0369103  0.0035444  10.414  < 2e-16 ***
## SkinThickness  0.0142718  0.0061999   2.302 0.021338 *  
## Insulin       -0.0012551  0.0008665  -1.448 0.147513    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 993.48  on 767  degrees of freedom
## Residual deviance: 791.82  on 763  degrees of freedom
## AIC: 801.82
## 
## Number of Fisher Scoring iterations: 4

Creating the model in R

From that we see that insulin is not a great variable to predict from hence we now create a new model using only the three main variables used i.e Age, Glucode and Skin Thickness. The Model:

logistic_reg <- glm(Outcome ~  Age + Glucose + SkinThickness, 
                    family="binomial", diabetes)

Predicting our outcome

After the model is created, we can now predict using that model. i.e to check whether a person has diabetes or not using the predict function in r.

predicted <- data.frame(probability.of.Outcome=logistic_reg$fitted.values, 
                          Outcome=diabetes$Outcome)

Data using the selected features

Below is a 3D scatterplot to show how our values show from the features that are selected.

Results

The code below orders and shows the top 5 of our predictions and the graph shows the actual outcome given in the table vs results from our prediction model.

predicted <- predicted[order(predicted$probability.of.Outcome, 
                                               decreasing=FALSE),]
head(predicted)

##     probability.of.Outcome Outcome
## 183            0.004614746       0
## 76             0.004741092       0
## 343            0.005518290       0
## 350            0.008021256       1
## 503            0.009780777       1
## 63             0.026046260       0

Predicted Probabilty plotted

This plot shows our end result of how we used logistic regression to predict the probability of a person having diabetes.

Conclusion

Overall, logistic regression is a powerful and versatile tool that can be used to predict a wide range of binary outcomes.In our case the probability of having diabetes.

It is a relatively simple model that is easy to understand and implement.

It has proven to have many applications and can be used to benefit multiple industries on a whole.