Logit Implementation

Author

Arvind Sharma

remove(list = ls())

library(stargazer)

Please cite as: 
 Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.
 R package version 5.2.3. https://CRAN.R-project.org/package=stargazer 

Data

Lets load mtcars data.

The data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models).

It is a data frame with 32 observations on 11 (numeric) variables.

  • mpg Miles/(US) gallon

  • cyl Number of cylinders

  • disp Displacement (cu.in.)

  • hp Gross horsepower

  • drat Rear axle ratio

  • wt Weight (1000 lbs)

  • qsec 1/4 mile time

  • vs Engine (0 = V-shaped, 1 = straight)

  • `am`Transmission (0 = automatic, 1 = manual)

  • gear Number of forward gears

  • carb Number of carburetors

?mtcars # see variables

str(mtcars)
'data.frame':   32 obs. of  11 variables:
 $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
 $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
 $ disp: num  160 160 108 258 360 ...
 $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
 $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
 $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
 $ qsec: num  16.5 17 18.6 19.4 17 ...
 $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
 $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
 $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
 $ carb: num  4 4 1 1 2 1 4 2 2 4 ...
head(mtcars, n = 10)
                   mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Mazda RX4         21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag     21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
Datsun 710        22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive    21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout 18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
Valiant           18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
Duster 360        14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
Merc 240D         24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
Merc 230          22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
Merc 280          19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
stargazer(mtcars, 
          type='text')

============================================
Statistic N   Mean   St. Dev.  Min     Max  
--------------------------------------------
mpg       32 20.091   6.027   10.400 33.900 
cyl       32  6.188   1.786     4       8   
disp      32 230.722 123.939  71.100 472.000
hp        32 146.688  68.563    52     335  
drat      32  3.597   0.535   2.760   4.930 
wt        32  3.217   0.978   1.513   5.424 
qsec      32 17.849   1.787   14.500 22.900 
vs        32  0.438   0.504     0       1   
am        32  0.406   0.499     0       1   
gear      32  3.688   0.738     3       5   
carb      32  2.812   1.615     1       8   
--------------------------------------------

We will try to model am as dependent variable. Note it is binary: Transmission (0 = automatic, 1 = manual).

sort(table(mtcars$am), 
     decreasing = TRUE)

 0  1 
19 13 

OLS

Linear regression will generate fitted values and/or predictions that have negative values.

#fit linear regression model
model_lm <- lm(formula =  am ~ disp + hp, 
               data    =  mtcars
               )

#view model summary
summary(model_lm)

Call:
lm(formula = am ~ disp + hp, data = mtcars)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.66396 -0.33221  0.08811  0.29402  0.59786 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.7559286  0.1578183   4.790 4.55e-05 ***
disp        -0.0042892  0.0008776  -4.887 3.47e-05 ***
hp           0.0043626  0.0015865   2.750   0.0102 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3706 on 29 degrees of freedom
Multiple R-squared:  0.4841,    Adjusted R-squared:  0.4485 
F-statistic:  13.6 on 2 and 29 DF,  p-value: 6.8e-05

Vehicles with higher gross power and smaller in size tend to have manual transmission.

In Sample Fit / Training Data

Linear probability gives us predicted probabilities below 0 (though not above 1) in this case. Marginal effects are the same everywhere.

my_mtcars <- mtcars

my_mtcars$predicted_ols <- predict(object = model_lm, 
                                   newdata = mtcars
                                   )

plot(x = my_mtcars$predicted_ols, 
     y = mtcars$am, 
     main="Best-Fit Line and Data Plot", 
     ylim = c(-.5,1.2),
     ylab = "Transmission (1=Automatic)",
     xlab = "Fitted Values"
     )

# Add a line based on the fitted values
lines(x = sort(my_mtcars$predicted_ols), 
      y = sort(model_lm$fitted.values), 
      col = "red", 
      lty = 2
      )

Another way is to simply visualize the fitted values (on sample) and see their values. Fitted values (predicted values on original data) will be the same as predicted values (generally on new data) if we use the original data in the predict command. We will go with predicted here (more general). Lets plot the fitted value, and you can see the values are below 0.

hist(my_mtcars$predicted_ols, 
     main = "Histogram (OLS)", 
     xlab = "Predicted Probabilities"
     )

describe(my_mtcars$predicted_ols)
   vars  n mean   sd median trimmed  mad   min  max range skew kurtosis   se
X1    1 32 0.41 0.35   0.48    0.43 0.33 -0.37 0.93   1.3 -0.6    -0.64 0.06

Lets look at the distribution of residuals. The 4 residual charts do not suggest a good fit.

par(mfrow = c(2, 2))
plot(model_lm)

Confusion matrix from the model predictions.

table_ols <- 
  table(mtcars$am, 
       ifelse(test = model_lm$fitted.values >.48, 
             yes = 1, 
             no = 0)
      )

table_ols
   
     0  1
  0 15  4
  1  1 12

Logistic Regression

glm is used to fit generalized linear models, specified by giving a symbolic description of the linear predictor and a description of the error distribution.

Also read up on family option in glm.

?glm 
?family
#fit logistic regression model
model_logit <- glm(formula = am ~ disp + hp, 
                   data    = mtcars, 
                   family  = binomial(link = "logit") # a description of the error distribution and link function to be used in the model. 
             )

#view model summary
summary(model_logit)

Call:
glm(formula = am ~ disp + hp, family = binomial(link = "logit"), 
    data = mtcars)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)  
(Intercept)  1.40342    1.36757   1.026   0.3048  
disp        -0.09518    0.04800  -1.983   0.0474 *
hp           0.12170    0.06777   1.796   0.0725 .
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 43.230  on 31  degrees of freedom
Residual deviance: 16.713  on 29  degrees of freedom
AIC: 22.713

Number of Fisher Scoring iterations: 8

Interpretation of Output

The glm() function in R is used to fit generalized linear models (GLMs). The call glm(formula = am ~ disp + hp, family = binomial(link = "logit")) is fitting a logistic regression model with the following specifications:

  1. Formula:

    am ~ disp + hp: This specifies the formula for the model. It indicates that the binary response variable am (assumed to represent an “automatic” or “manual” transmission type) is modeled as a function of the predictor variables disp (engine displacement) and hp (horsepower).

  2. Family:

    family = binomial: This specifies the family of the distribution for the response variable. In this case, it’s a binomial distribution, indicating that the response variable is binary (0 or 1). Binomial distribution is commonly used for binary outcomes. The binomial distribution is appropriate for modeling binary outcomes because it represents the number of successes (1s) in a fixed number of independent Bernoulli trials.

  3. Link Function:

    link = "logit": This specifies the link function used to relate the linear predictor (1) to the mean of the distribution (2). For logistic regression, the default link function is the logit function (link = "logit"). The logit link function is the natural logarithm of the odds of success, and it’s suitable for binary outcomes.

Putting it all together, the glm() call is fitting a logistic regression model that models the probability of having an automatic transmission (am) as a function of the engine displacement (disp) and horsepower (hp). The model estimates the coefficients of the predictors and uses the logistic link function to transform the linear predictor to probabilities between 0 and 1.

You would typically interpret the coefficients of this model in terms of log-odds. The logistic regression model estimates how the log-odds of having an automatic transmission change with changes in engine displacement and horsepower. If you want to interpret the results in terms of odds ratios or probabilities, additional transformations may be needed.

Estimation with MLE

The message "Number of Fisher Scoring iterations: 8" typically appears in the output of a logistic regression model fitted using maximum likelihood estimation (MLE). This message provides information about the optimization process used to find the parameter estimates for the logistic regression model.

Here’s what each part of the message means:

  • Fisher Scoring:

Fisher Scoring is an optimization algorithm used for fitting generalized linear models (GLMs) like logistic regression. GLMs include logistic regression, Poisson regression, and others. Fisher Scoring is an iterative optimization algorithm that aims to maximize the likelihood function, finding the parameter values that best explain the observed data. You can sometimes chnage the solver, such as to Newton-Ralph optimization algorithm.

  • Number of Iterations (8):

The "Number of Fisher Scoring iterations" indicates how many times the optimization algorithm iterated through the process of updating parameter estimates to maximize the likelihood. In logistic regression, this is done to find the values of coefficients that maximize the likelihood of observing the given data.

In your case, the algorithm took 8 iterations to converge to a solution. Convergence means that the parameter estimates are stable and do not change significantly between iterations. The algorithm stops when either a maximum number of iterations is reached or when the change in parameter estimates is below a certain threshold.

A higher number of iterations might suggest that the optimization algorithm needed more steps to converge, possibly because the model is complex, or the data require more refinement to find the maximum likelihood estimates.

predict.glm

Read up on predict.glm - the type option in particular.

type: The type of prediction required.

The default is on the scale of the linear predictors; the alternative "response" is on the scale of the response variable. Thus for a default binomial model the default predictions are of log-odds (probabilities on logit scale) and type = "response" gives the predicted probabilities.

Notes on convert probabilities into (binary) outcomes

Deciding on a cutoff value for logistic probabilities is a crucial step in converting predicted probabilities into binary outcomes (0 or 1). Typically, the default cutoff is 0.5, but depending on your specific objectives and the characteristics of your data, exploring and tuning this cutoff value can lead to improved model performance.

Let’s discuss the importance of exploring and tuning cutoff values:

  1. Default Cutoff (0.5): The default cutoff of 0.5 is commonly used, but it may not always be the most appropriate choice, especially if the dataset is imbalanced or if the costs of false positives and false negatives differ significantly.

  2. Explore Different Cutoff Values: Start by exploring a range of cutoff values beyond 0.5. Try the mean (imbalanced data), or the median (balanced data). You can systematically test cutoff values from 0 to 1 to understand how they affect the model’s performance metrics.

    • Plot performance metrics such as accuracy, precision, recall, and F1 score as functions of the cutoff value. This exploration helps you understand the trade-offs between different types of errors and identify the cutoff that optimizes the metric most relevant to your application.
  3. Consider Domain Knowledge: Incorporate domain knowledge and the specific requirements of your problem when choosing cutoff values. For example, in medical diagnosis, the costs associated with false positives (misdiagnosing a healthy individual) and false negatives (missing a disease) may vary, influencing the optimal cutoff.

  4. Stratified Sampling: If your dataset is imbalanced, consider using stratified sampling techniques to ensure that each class is represented proportionally in your training and validation sets. This ensures that the performance evaluation is not biased towards the majority class.

  5. Cross-Validation: Use cross-validation to assess the robustness of different cutoff values. By splitting your data into multiple folds and evaluating performance across different subsets, you can obtain more reliable estimates of model performance for various cutoff values.

  6. Tune Cutoff Value: Once you’ve explored a range of cutoff values and identified promising candidates, fine-tune the cutoff value using techniques like grid search or random search. These methods systematically search the space of possible cutoff values to find the one that optimizes your chosen performance metric.

  7. Evaluate on Holdout Set: Finally, evaluate the performance of your model with the selected cutoff value on a holdout set or test set that was not used during hyperparameter tuning. This provides an unbiased estimate of the model’s generalization performance.

By exploring and tuning cutoff values, you can optimize the performance of your logistic regression model, ensuring it meets the specific requirements of your application and delivers the best possible results.

Notes on Datasets for Exploring and Tuning Cutoff Values:

  1. Training Dataset:

    • The training dataset serves as the foundation for building the logistic regression model. It comprises labeled examples, where both the features (independent variables) and the target variable (dependent variable) are known. During model development, this dataset is utilized to train the model’s parameters.
  2. Validation Dataset:

    • The validation dataset is a subset of the training data that facilitates model evaluation and hyperparameter tuning. It plays a crucial role in assessing the model’s performance with various hyperparameter configurations, including cutoff values for logistic probabilities.

    • By exploring different cutoff values on this dataset, one can gauge how effectively the model generalizes to unseen data and select the optimal cutoff value for deployment.

Evaluation Dataset for Model Validation:

  1. Test Dataset:

    • The test dataset remains entirely separate from both the training and validation datasets during model development. Its purpose is to provide an independent assessment of the model’s performance after hyperparameter tuning.

    • By evaluating the model with the chosen cutoff value on the test dataset, one obtains an unbiased estimation of the model’s ability to generalize to new, unseen data, ensuring the model’s reliability and effectiveness in real-world applications.

  • It is fine in Economics if you just have train and test data (70%-30%) like in your homeworks, though in Data Science (big data) people often split their data three way (75%-15%-15%). These proportions need not be exact (flexible as this is an art as well).

In Sample Fit / Training Data

type = "response" gives the predicted probabilities, which we will use.

?predict.glm
predicted_logit <- predict(object  = model_logit, 
                           newdata = mtcars, # it the same data
                           type    = "response"
                          )

hist(predicted_logit, 
     main = "Histogram (Logitic Regression)", 
     xlab = "Predicted Probabilities"
     )

summary(predicted_logit)  
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
0.0000000 0.0005646 0.3921834 0.4062500 0.8594700 0.9999987 

Now, fitted values do not have negative probabilities.

table_logit <- 
table(mtcars$am, 
      ifelse(test = model_logit$fitted.values > 0.39, # median 
             yes = 1, 
             no = 0)
      )

table_logit
   
     0  1
  0 15  4
  1  0 13
table_ols
   
     0  1
  0 15  4
  1  1 12

Out of Sample Fit / Testing Data

This is what you would do in your assignment. Apply the model on a different data.

newdata = data.frame(disp=c(200, 180, 160, 140, 120, 120, 100, 160),
                     hp  =c(100, 90 , 108, 90 , 80 , 90 , 80 , 90 ),
                     am  =c(0  , 0  , 0  , 1  , 0  , 1  , 1  , 1  )
                     )

#view data frame
newdata
  disp  hp am
1  200 100  0
2  180  90  0
3  160 108  0
4  140  90  1
5  120  80  0
6  120  90  1
7  100  80  1
8  160  90  1
#use model to predict value of am for all new cars
newdata$am_prob <- predict(model_logit, 
                           newdata = newdata, 
                           type="response"
                          )

describe(newdata$am_prob)
   vars n mean   sd median trimmed  mad min  max range skew kurtosis   se
X1    1 8 0.33 0.32   0.31    0.33 0.41   0 0.84  0.83 0.37    -1.59 0.11
#view updated data frame
newdata
  disp  hp am     am_prob
1  200 100  0 0.004225640
2  180  90  0 0.008361069
3  160 108  0 0.335916069
4  140  90  1 0.275162866
5  120  80  0 0.429961894
6  120  90  1 0.718090728
7  100  80  1 0.835013994
8  160  90  1 0.053546152

Again, no negative probabilities.

Probit Regression

The probit link function is based on the cumulative distribution function (CDF) of the standard normal distribution. It is defined as \(\Phi^{-1}(p)\) where \(\Phi\) is the standard normal CDF and p is the probability of success. It is also S shaped like the “logit” function, and often logit and probit give similar answers.

Just change the link function in logistic regression from logit to probit.

?make.link
model_probit <- glm(formula = am ~ disp + hp, 
                   data    = mtcars, 
                   family  = binomial(link = "probit") # a description of the error distribution and link function to be used in the model. 
             )
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
#view model summary
summary(model_probit)

Call:
glm(formula = am ~ disp + hp, family = binomial(link = "probit"), 
    data = mtcars)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)  
(Intercept)  0.85931    0.81028   1.061   0.2889  
disp        -0.05634    0.02683  -2.099   0.0358 *
hp           0.07169    0.03870   1.853   0.0639 .
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 43.230  on 31  degrees of freedom
Residual deviance: 16.493  on 29  degrees of freedom
AIC: 22.493

Number of Fisher Scoring iterations: 10
library(stargazer)

stargazer(model_lm, model_logit, model_probit,
          type = "text"
          )

============================================================
                              Dependent variable:           
                    ----------------------------------------
                                       am                   
                             OLS           logistic  probit 
                             (1)             (2)      (3)   
------------------------------------------------------------
disp                      -0.004***        -0.095** -0.056**
                           (0.001)         (0.048)  (0.027) 
                                                            
hp                         0.004**          0.122*   0.072* 
                           (0.002)         (0.068)  (0.039) 
                                                            
Constant                   0.756***         1.403    0.859  
                           (0.158)         (1.368)  (0.810) 
                                                            
------------------------------------------------------------
Observations                  32              32       32   
R2                          0.484                           
Adjusted R2                 0.448                           
Log Likelihood                              -8.356   -8.247 
Akaike Inf. Crit.                           22.713   22.493 
Residual Std. Error    0.371 (df = 29)                      
F Statistic         13.605*** (df = 2; 29)                  
============================================================
Note:                            *p<0.1; **p<0.05; ***p<0.01

Logit is often computationally simpler, and statistical packages typically default to logit in logistic regression models. Probit models may involve additional computational complexities related to the standard normal distribution.

Also, interpretation of coefficients may be more intuitive in logit models, as they directly relate to changes in log-odds. Interpretation in probit models involves changes in the standard normal deviate associated with a one-unit change in the predictor.

Useful Readings

  1. UCLA
  2. Statology