Logistic Regression

Sameer Mathur

Demonstration using mtcars

---

Logistic Regression

What is Logistic Regression?

Logistic regression is the appropriate regression analysis to conduct when the dependent variable is dichotomous (binary).

Like all regression analyses, the logistic regression is a predictive analysis.

Logistic regression is used to describe data and to explain the relationship between one dependent binary variable and one or more nominal, ordinal, interval or ratio-level independent variables.

Regression versus Logistic Regression

Regression:

We Objective is to estimate the expected or mean value given the independent variables.

Logistic Regression:

Objective is to find the probability of an event given the independent variables.

Application of Logistic Regression

  • Email: Spam / Not Spam

  • Online Transaction: Fraudulent / Not Fraudulent (Yes / No)

  • HR Status: Joining / Not Joining

  • Credit Scoring: Defaulter / Non-defaulter

Major Assumptions of Logistic Regression

  • The dependent variable should be dichotomous in nature (e.g., presence vs. absent).

  • There should be no outliers in the data.

  • There should be no high correlations (multicollinearity) among the predictors.

Logistic Regression Equation

We use the logistic regression equation to predict the probability of a dependent variable taking the dichotomy values 0 or 1.

Logistic Regression Equation

Suppose \( x_1, x_2, x_3, \ldots, x_p \) are the independent variables, \( \alpha \) and \( \beta_k; k = 1, 2, \ldots, p \) are the parameters, and \( E(y) \) is the expected value of the dependent variable \( y \), then the logistic regression equation is

\[ E(Y) = \frac{1}{(1 + exp^{-(\alpha +\sum_{k} \beta_k x_k)})} \]

Example

For example, in the built-in data set mtcars, the data column am represents the transmission type of the automobile model (0 = automatic, 1 = manual).

With the logistic regression equation, we can model the probability of a manual transmission in a vehicle based on its engine horsepower and weight data.

\[ P(manualTransmission) = \frac{1}{(1 + exp^{-(\alpha + \beta_1 *Horsepower + \beta_2 *Wieght)})} \]

Problem

By use of the logistic regression equation of vehicle transmission in the data set mtcars, estimate the probability of a vehicle being fitted with a manual transmission if it has a 120 hp engine and weights 2800 lbs.

First Few Rows of the Dataset

# first few rows of the dataset
head(mtcars)
                   mpg cyl disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

Fitting the Logistic Regression Model

We apply the function glm to a formula that describes the transmission type (am) by the horsepower (hp) and weight (wt). This creates a generalized linear model (GLM) in the binomial family.

# fitting logistic regression model
am.glm <- glm(am ~ hp + wt, 
              data = mtcars, 
              family = binomial)
# summary of the model
summary(am.glm)

Call:
glm(formula = am ~ hp + wt, family = binomial, data = mtcars)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.2537  -0.1568  -0.0168   0.1543   1.3449  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)   
(Intercept) 18.86630    7.44356   2.535  0.01126 * 
hp           0.03626    0.01773   2.044  0.04091 * 
wt          -8.08348    3.06868  -2.634  0.00843 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 43.230  on 31  degrees of freedom
Residual deviance: 10.059  on 29  degrees of freedom
AIC: 16.059

Number of Fisher Scoring iterations: 8

Create a Single Value Dataframe

We then wrap the test parameters inside a data frame newdata.

# create a single value dataframe
newdata <- data.frame(hp = 120, wt = 2.8)
newdata
   hp  wt
1 120 2.8

Prediction of the Generalized Linear Model

Now we apply the function predict to the generalized linear model am.glm along with newdata. We will have to select response prediction type in order to obtain the predicted probability.

# prediction of glm
predict(am.glm, newdata, type = "response") 
        1 
0.6418125 

Answer

For an automobile with 120 hp engine and 2800 lbs weight, the probability of it being fitted with a manual transmission is about 64%.

Significance Test for Logistic Regression

We can decide whether there is any significant relationship between the dependent variable \( y \) and the independent variables \( x_k (k = 1, 2, \ldots, p) \) in the logistic regression equation.

In particular, if any of the null hypothesis that \( \beta_k = 0 (k = 1, 2, \ldots, p) \) is valid, then \( x_k \) is statistically insignificant in the logistic regression model.

Problem

At 0.05 significance level, decide if any of the independent variables in the logistic regression model of vehicle transmission in data set mtcars is statistically insignificant.

Solution

We then print out the summary of the generalized linear model and check for the p-values of the hp and wt variables.


Call:
glm(formula = am ~ hp + wt, family = binomial, data = mtcars)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.2537  -0.1568  -0.0168   0.1543   1.3449  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)   
(Intercept) 18.86630    7.44356   2.535  0.01126 * 
hp           0.03626    0.01773   2.044  0.04091 * 
wt          -8.08348    3.06868  -2.634  0.00843 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 43.230  on 31  degrees of freedom
Residual deviance: 10.059  on 29  degrees of freedom
AIC: 16.059

Number of Fisher Scoring iterations: 8

Answer

As the p-values of the hp and wt variables are both less than 0.05, neither hp or wt is insignificant in the logistic regression model.