remove(list = ls())

library(stargazer)
## 
## Please cite as:
##  Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.
##  R package version 5.2.3. https://CRAN.R-project.org/package=stargazer
library(psych)

1 Data

Lets load mtcars data.

The data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models).

It is a data frame with 32 observations on 11 (numeric) variables.

  • mpg Miles/(US) gallon

  • cyl Number of cylinders

  • disp Displacement (cu.in.)

  • hp Gross horsepower

  • drat Rear axle ratio

  • wt Weight (1000 lbs)

  • qsec 1/4 mile time

  • vs Engine (0 = V-shaped, 1 = straight)

  • `am`Transmission (0 = automatic, 1 = manual)

  • gear Number of forward gears

  • carb Number of carburetors

?mtcars # see variables

str(mtcars)
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...
head(mtcars, n = 10)
##                    mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## Duster 360        14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
## Merc 240D         24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## Merc 230          22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## Merc 280          19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
stargazer(mtcars, 
          type='text')
## 
## ============================================
## Statistic N   Mean   St. Dev.  Min     Max  
## --------------------------------------------
## mpg       32 20.091   6.027   10.400 33.900 
## cyl       32  6.188   1.786     4       8   
## disp      32 230.722 123.939  71.100 472.000
## hp        32 146.688  68.563    52     335  
## drat      32  3.597   0.535   2.760   4.930 
## wt        32  3.217   0.978   1.513   5.424 
## qsec      32 17.849   1.787   14.500 22.900 
## vs        32  0.438   0.504     0       1   
## am        32  0.406   0.499     0       1   
## gear      32  3.688   0.738     3       5   
## carb      32  2.812   1.615     1       8   
## --------------------------------------------

We will try to model am as dependent variable. Note it is binary: Transmission (0 = automatic, 1 = manual).

sort(table(mtcars$am), 
     decreasing = TRUE)
## 
##  0  1 
## 19 13

2 OLS

Linear regression will generate fitted values and/or predictions that have negative values.

#fit linear regression model
model_lm <- lm(formula =  am ~ disp + hp, 
               data    =  mtcars
               )

#view model summary
summary(model_lm)
## 
## Call:
## lm(formula = am ~ disp + hp, data = mtcars)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.66396 -0.33221  0.08811  0.29402  0.59786 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.7559286  0.1578183   4.790 4.55e-05 ***
## disp        -0.0042892  0.0008776  -4.887 3.47e-05 ***
## hp           0.0043626  0.0015865   2.750   0.0102 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3706 on 29 degrees of freedom
## Multiple R-squared:  0.4841, Adjusted R-squared:  0.4485 
## F-statistic:  13.6 on 2 and 29 DF,  p-value: 6.8e-05

Vehicles with higher gross power and smaller in size tend to have manual transmission.

2.0.1 In Sample Fit / Training Data

Linear probability gives us predicted probabilities below 0 (though not above 1) in this case. Marginal effects are the same everywhere.

my_mtcars <- mtcars

my_mtcars$predicted_ols <- predict(object = model_lm, 
                                   newdata = mtcars
                                   )

plot(x = my_mtcars$predicted_ols, 
     y = mtcars$am, 
     main="Best-Fit Line and Data Plot", 
     ylim = c(-.5,1.2),
     ylab = "Transmission (1=Automatic)",
     xlab = "Fitted Values"
     )

# Add a line based on the fitted values
lines(x = sort(my_mtcars$predicted_ols), 
      y = sort(model_lm$fitted.values), 
      col = "red", 
      lty = 2
      )

Another way is to simply visualize the fitted values (on sample) and see their values. Fitted values (predicted values on original data) will be the same as predicted values (generally on new data) if we use the original data in the predict command. We will go with predicted here (more general). Lets plot the fitted value, and you can see the values are below 0.

hist(my_mtcars$predicted_ols, 
     main = "Histogram (OLS)", 
     xlab = "Predicted Probabilities"
     )

describe(my_mtcars$predicted_ols)
##    vars  n mean   sd median trimmed  mad   min  max range skew kurtosis   se
## X1    1 32 0.41 0.35   0.48    0.43 0.33 -0.37 0.93   1.3 -0.6    -0.64 0.06

Lets look at the distribution of residuals. The 4 residual charts do not suggest a good fit.

par(mfrow = c(2, 2))
plot(model_lm)

Confusion matrix from the model predictions.

table_ols <- 
  table(mtcars$am, 
       ifelse(test = model_lm$fitted.values >.48, 
             yes = 1, 
             no = 0)
      )

table_ols
##    
##      0  1
##   0 15  4
##   1  1 12

3 Logistic Regression

glm is used to fit generalized linear models, specified by giving a symbolic description of the linear predictor and a description of the error distribution.

Also read up on family option in glm.

?glm 
?family
#fit logistic regression model
model_logit <- glm(formula = am ~ disp + hp, 
                   data    = mtcars, 
                   family  = binomial(link = "logit") # a description of the error distribution and link function to be used in the model. 
             )

#view model summary
summary(model_logit)
## 
## Call:
## glm(formula = am ~ disp + hp, family = binomial(link = "logit"), 
##     data = mtcars)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9665  -0.3090  -0.0017   0.3934   1.3682  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)  
## (Intercept)  1.40342    1.36757   1.026   0.3048  
## disp        -0.09518    0.04800  -1.983   0.0474 *
## hp           0.12170    0.06777   1.796   0.0725 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 43.230  on 31  degrees of freedom
## Residual deviance: 16.713  on 29  degrees of freedom
## AIC: 22.713
## 
## Number of Fisher Scoring iterations: 8

The glm() function in R is used to fit generalized linear models (GLMs). The call glm(formula = am ~ disp + hp, family = binomial(link = "logit")) is fitting a logistic regression model with the following specifications:

Formula:

am ~ disp + hp: This specifies the formula for the model. It indicates that the binary response variable am (assumed to represent an “automatic” or “manual” transmission type) is modeled as a function of the predictor variables disp (engine displacement) and hp (horsepower).

Family:

family = binomial: This specifies the family of the distribution for the response variable. In this case, it’s a binomial distribution, indicating that the response variable is binary (0 or 1). Binomial distribution is commonly used for binary outcomes.

Link Function:

link = "logit": This specifies the link function used to relate the linear predictor to the mean of the distribution. For logistic regression, the default link function is the logit function (link = "logit"). The logit link function is the natural logarithm of the odds of success, and it’s suitable for binary outcomes.

Putting it all together, the glm() call is fitting a logistic regression model that models the probability of having an automatic transmission (am) as a function of the engine displacement (disp) and horsepower (hp). The model estimates the coefficients of the predictors and uses the logistic link function to transform the linear predictor to probabilities between 0 and 1.

You would typically interpret the coefficients of this model in terms of log-odds. The logistic regression model estimates how the log-odds of having an automatic transmission change with changes in engine displacement and horsepower. If you want to interpret the results in terms of odds ratios or probabilities, additional transformations may be needed.

3.0.0.1 MLE

The message "Number of Fisher Scoring iterations: 8" typically appears in the output of a logistic regression model fitted using maximum likelihood estimation (MLE). This message provides information about the optimization process used to find the parameter estimates for the logistic regression model.

Here’s what each part of the message means:

  • Fisher Scoring:

Fisher Scoring is an optimization algorithm used for fitting generalized linear models (GLMs) like logistic regression. GLMs include logistic regression, Poisson regression, and others. Fisher Scoring is an iterative optimization algorithm that aims to maximize the likelihood function, finding the parameter values that best explain the observed data.

  • Number of Iterations (8):

The "Number of Fisher Scoring iterations" indicates how many times the optimization algorithm iterated through the process of updating parameter estimates to maximize the likelihood. In logistic regression, this is done to find the values of coefficients that maximize the likelihood of observing the given data.

In your case, the algorithm took 8 iterations to converge to a solution. Convergence means that the parameter estimates are stable and do not change significantly between iterations. The algorithm stops when either a maximum number of iterations is reached or when the change in parameter estimates is below a certain threshold.

A higher number of iterations might suggest that the optimization algorithm needed more steps to converge, possibly because the model is complex, or the data require more refinement to find the maximum likelihood estimates.

Control this with iter.

3.0.1 predict.glm

Read up on predict.glm - the type option in particular.

type: The type of prediction required.

The default is on the scale of the linear predictors; the alternative "response" is on the scale of the response variable. Thus for a default binomial model the default predictions are of log-odds (probabilities on logit scale) and type = "response" gives the predicted probabilities.

3.0.2 In Sample Fit / Training Data

type = "response" gives the predicted probabilities, which we will use.

?predict.glm
predicted_logit <- predict(object  = model_logit, 
                           newdata = mtcars, # it the same data
                           type    = "response"
                          )

hist(predicted_logit, 
     main = "Histogram (Logitic Regression)", 
     xlab = "Predicted Probabilities"
     )

summary(predicted_logit)  
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## 0.0000000 0.0005646 0.3921834 0.4062500 0.8594700 0.9999987

Now, fitted values do not have negative probabilities.

table_logit <- 
table(mtcars$am, 
      ifelse(test = model_logit$fitted.values > 0.39, # median 
             yes = 1, 
             no = 0)
      )

table_logit
##    
##      0  1
##   0 15  4
##   1  0 13
table_ols
##    
##      0  1
##   0 15  4
##   1  1 12

3.1 Out of Sample Fit / Testing Data

This is what you would do in your assignment. Apply the model on a different data.

newdata = data.frame(disp=c(200, 180, 160, 140, 120, 120, 100, 160),
                     hp  =c(100, 90 , 108, 90 , 80 , 90 , 80 , 90 ),
                     am  =c(0  , 0  , 0  , 1  , 0  , 1  , 1  , 1  )
                     )

#view data frame
newdata
##   disp  hp am
## 1  200 100  0
## 2  180  90  0
## 3  160 108  0
## 4  140  90  1
## 5  120  80  0
## 6  120  90  1
## 7  100  80  1
## 8  160  90  1
#use model to predict value of am for all new cars
newdata$am_prob <- predict(model_logit, 
                           newdata = newdata, 
                           type="response"
                          )

describe(newdata$am_prob)
##    vars n mean   sd median trimmed  mad min  max range skew kurtosis   se
## X1    1 8 0.33 0.32   0.31    0.33 0.41   0 0.84  0.83 0.37    -1.59 0.11
#view updated data frame
newdata
##   disp  hp am     am_prob
## 1  200 100  0 0.004225640
## 2  180  90  0 0.008361069
## 3  160 108  0 0.335916069
## 4  140  90  1 0.275162866
## 5  120  80  0 0.429961894
## 6  120  90  1 0.718090728
## 7  100  80  1 0.835013994
## 8  160  90  1 0.053546152

Again, no negative probabilities.

4 Probit Regression

The probit link function is based on the cumulative distribution function (CDF) of the standard normal distribution. It is defined as \(\Phi^-1(p)\) where \(\Phi\) is the standard normal CDF and p is the probability of success. It is also S shaped like the “logit” function, and often logit and probit give similar answers.

Just change the link function in logistic regression from logit to probit.

?make.link
model_probit <- glm(formula = am ~ disp + hp, 
                   data    = mtcars, 
                   family  = binomial(link = "probit") # a description of the error distribution and link function to be used in the model. 
             )
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
#view model summary
summary(model_probit)
## 
## Call:
## glm(formula = am ~ disp + hp, family = binomial(link = "probit"), 
##     data = mtcars)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9567  -0.2598   0.0000   0.3735   1.3645  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)  
## (Intercept)  0.85931    0.81028   1.061   0.2889  
## disp        -0.05634    0.02683  -2.099   0.0358 *
## hp           0.07169    0.03870   1.853   0.0639 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 43.230  on 31  degrees of freedom
## Residual deviance: 16.493  on 29  degrees of freedom
## AIC: 22.493
## 
## Number of Fisher Scoring iterations: 10
library(stargazer)

stargazer(model_lm, model_logit, model_probit,
          type = "text"
          )
## 
## ============================================================
##                               Dependent variable:           
##                     ----------------------------------------
##                                        am                   
##                              OLS           logistic  probit 
##                              (1)             (2)      (3)   
## ------------------------------------------------------------
## disp                      -0.004***        -0.095** -0.056**
##                            (0.001)         (0.048)  (0.027) 
##                                                             
## hp                         0.004**          0.122*   0.072* 
##                            (0.002)         (0.068)  (0.039) 
##                                                             
## Constant                   0.756***         1.403    0.859  
##                            (0.158)         (1.368)  (0.810) 
##                                                             
## ------------------------------------------------------------
## Observations                  32              32       32   
## R2                          0.484                           
## Adjusted R2                 0.448                           
## Log Likelihood                              -8.356   -8.247 
## Akaike Inf. Crit.                           22.713   22.493 
## Residual Std. Error    0.371 (df = 29)                      
## F Statistic         13.605*** (df = 2; 29)                  
## ============================================================
## Note:                            *p<0.1; **p<0.05; ***p<0.01

Logit is often computationally simpler, and statistical packages typically default to logit in logistic regression models. Probit models may involve additional computational complexities related to the standard normal distribution.

Also, interpretation of coefficients may be more intuitive in logit models, as they directly relate to changes in log-odds. Interpretation in probit models involves changes in the standard normal deviate associated with a one-unit change in the predictor.

5 Readings

  1. UCLA
  2. Statology