Sameer Mathur
Demonstration using mtcars
---
Logistic regression is the appropriate regression analysis to conduct when the dependent variable is dichotomous (binary).
Like all regression analyses, the logistic regression is a predictive analysis.
Logistic regression is used to describe data and to explain the relationship between one dependent binary variable and one or more nominal, ordinal, interval or ratio-level independent variables.
We Objective is to estimate the expected or mean value given the independent variables.
Objective is to find the probability of an event given the independent variables.
Email: Spam / Not Spam
Online Transaction: Fraudulent / Not Fraudulent (Yes / No)
HR Status: Joining / Not Joining
Credit Scoring: Defaulter / Non-defaulter
The dependent variable should be dichotomous in nature (e.g., presence vs. absent).
There should be no outliers in the data.
There should be no high correlations (multicollinearity) among the predictors.
We use the logistic regression equation to predict the probability of a dependent variable taking the dichotomy values 0 or 1.
Suppose \( x_1, x_2, x_3, \ldots, x_p \) are the independent variables, \( \alpha \) and \( \beta_k; k = 1, 2, \ldots, p \) are the parameters, and \( E(y) \) is the expected value of the dependent variable \( y \), then the logistic regression equation is
\[ E(Y) = \frac{1}{(1 + exp^{-(\alpha +\sum_{k} \beta_k x_k)})} \]
For example, in the built-in data set mtcars, the data column am represents the transmission type of the automobile model (0 = automatic, 1 = manual).
With the logistic regression equation, we can model the probability of a manual transmission in a vehicle based on its engine horsepower and weight data.
\[ P(manualTransmission) = \frac{1}{(1 + exp^{-(\alpha + \beta_1 *Horsepower + \beta_2 *Wieght)})} \]
By use of the logistic regression equation of vehicle transmission in the data set mtcars, estimate the probability of a vehicle being fitted with a manual transmission if it has a 120 hp engine and weights 2800 lbs.
# first few rows of the dataset
head(mtcars)
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
We apply the function glm to a formula that describes the transmission type (am) by the horsepower (hp) and weight (wt). This creates a generalized linear model (GLM) in the binomial family.
# fitting logistic regression model
am.glm <- glm(am ~ hp + wt,
data = mtcars,
family = binomial)
# summary of the model
summary(am.glm)
Call:
glm(formula = am ~ hp + wt, family = binomial, data = mtcars)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.2537 -0.1568 -0.0168 0.1543 1.3449
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 18.86630 7.44356 2.535 0.01126 *
hp 0.03626 0.01773 2.044 0.04091 *
wt -8.08348 3.06868 -2.634 0.00843 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 43.230 on 31 degrees of freedom
Residual deviance: 10.059 on 29 degrees of freedom
AIC: 16.059
Number of Fisher Scoring iterations: 8
We then wrap the test parameters inside a data frame newdata.
# create a single value dataframe
newdata <- data.frame(hp = 120, wt = 2.8)
newdata
hp wt
1 120 2.8
Now we apply the function predict to the generalized linear model am.glm along with newdata. We will have to select response prediction type in order to obtain the predicted probability.
# prediction of glm
predict(am.glm, newdata, type = "response")
1
0.6418125
For an automobile with 120 hp engine and 2800 lbs weight, the probability of it being fitted with a manual transmission is about 64%.
We can decide whether there is any significant relationship between the dependent variable \( y \) and the independent variables \( x_k (k = 1, 2, \ldots, p) \) in the logistic regression equation.
In particular, if any of the null hypothesis that \( \beta_k = 0 (k = 1, 2, \ldots, p) \) is valid, then \( x_k \) is statistically insignificant in the logistic regression model.
At 0.05 significance level, decide if any of the independent variables in the logistic regression model of vehicle transmission in data set mtcars is statistically insignificant.
We then print out the summary of the generalized linear model and check for the p-values of the hp and wt variables.
Call:
glm(formula = am ~ hp + wt, family = binomial, data = mtcars)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.2537 -0.1568 -0.0168 0.1543 1.3449
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 18.86630 7.44356 2.535 0.01126 *
hp 0.03626 0.01773 2.044 0.04091 *
wt -8.08348 3.06868 -2.634 0.00843 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 43.230 on 31 degrees of freedom
Residual deviance: 10.059 on 29 degrees of freedom
AIC: 16.059
Number of Fisher Scoring iterations: 8
As the p-values of the hp and wt variables are both less than 0.05, neither hp or wt is insignificant in the logistic regression model.