Please cite as:
Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.
R package version 5.2.3. https://CRAN.R-project.org/package=stargazer
Please cite as:
Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.
R package version 5.2.3. https://CRAN.R-project.org/package=stargazer
Lets load mtcars
data.
The data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models).
It is a data frame with 32 observations on 11 (numeric) variables.
mpg
Miles/(US) gallon
cyl
Number of cylinders
disp
Displacement (cu.in.)
hp
Gross horsepower
drat
Rear axle ratio
wt
Weight (1000 lbs)
qsec
1/4 mile time
vs
Engine (0 = V-shaped, 1 = straight)
`am`Transmission (0 = automatic, 1 = manual)
gear
Number of forward gears
carb
Number of carburetors
?mtcars # see variables
str(mtcars)
'data.frame': 32 obs. of 11 variables:
$ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
$ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
$ disp: num 160 160 108 258 360 ...
$ hp : num 110 110 93 110 175 105 245 62 95 123 ...
$ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
$ wt : num 2.62 2.88 2.32 3.21 3.44 ...
$ qsec: num 16.5 17 18.6 19.4 17 ...
$ vs : num 0 0 1 1 0 1 0 1 1 1 ...
$ am : num 1 1 1 0 0 0 0 0 0 0 ...
$ gear: num 4 4 4 3 3 3 3 4 4 4 ...
$ carb: num 4 4 1 1 2 1 4 2 2 4 ...
head(mtcars, n = 10)
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
stargazer(mtcars,
type='text')
============================================
Statistic N Mean St. Dev. Min Max
--------------------------------------------
mpg 32 20.091 6.027 10.400 33.900
cyl 32 6.188 1.786 4 8
disp 32 230.722 123.939 71.100 472.000
hp 32 146.688 68.563 52 335
drat 32 3.597 0.535 2.760 4.930
wt 32 3.217 0.978 1.513 5.424
qsec 32 17.849 1.787 14.500 22.900
vs 32 0.438 0.504 0 1
am 32 0.406 0.499 0 1
gear 32 3.688 0.738 3 5
carb 32 2.812 1.615 1 8
--------------------------------------------
We will try to model am
as dependent variable. Note it is binary: Transmission (0 = automatic, 1 = manual).
Linear regression will generate fitted values and/or predictions that have negative values.
#fit linear regression model
model_lm <- lm(formula = am ~ disp + hp,
data = mtcars
)
#view model summary
summary(model_lm)
Call:
lm(formula = am ~ disp + hp, data = mtcars)
Residuals:
Min 1Q Median 3Q Max
-0.66396 -0.33221 0.08811 0.29402 0.59786
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.7559286 0.1578183 4.790 4.55e-05 ***
disp -0.0042892 0.0008776 -4.887 3.47e-05 ***
hp 0.0043626 0.0015865 2.750 0.0102 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.3706 on 29 degrees of freedom
Multiple R-squared: 0.4841, Adjusted R-squared: 0.4485
F-statistic: 13.6 on 2 and 29 DF, p-value: 6.8e-05
Vehicles with higher gross power and smaller in size tend to have manual transmission.
Linear probability gives us predicted probabilities below 0 (though not above 1) in this case. Marginal effects are the same everywhere.
my_mtcars <- mtcars
my_mtcars$predicted_ols <- predict(object = model_lm,
newdata = mtcars
)
plot(x = my_mtcars$predicted_ols,
y = mtcars$am,
main="Best-Fit Line and Data Plot",
ylim = c(-.5,1.2),
ylab = "Transmission (1=Automatic)",
xlab = "Fitted Values"
)
# Add a line based on the fitted values
lines(x = sort(my_mtcars$predicted_ols),
y = sort(model_lm$fitted.values),
col = "red",
lty = 2
)
Another way is to simply visualize the fitted values (on sample) and see their values. Fitted values (predicted values on original data) will be the same as predicted values (generally on new data) if we use the original data in the predict
command. We will go with predicted here (more general). Lets plot the fitted value, and you can see the values are below 0.
hist(my_mtcars$predicted_ols,
main = "Histogram (OLS)",
xlab = "Predicted Probabilities"
)
describe(my_mtcars$predicted_ols)
vars n mean sd median trimmed mad min max range skew kurtosis se
X1 1 32 0.41 0.35 0.48 0.43 0.33 -0.37 0.93 1.3 -0.6 -0.64 0.06
Lets look at the distribution of residuals. The 4 residual charts do not suggest a good fit.
Confusion matrix from the model predictions.
glm
is used to fit generalized linear models, specified by giving a symbolic description of the linear predictor and a description of the error distribution.
Also read up on family
option in glm.
?glm
?family
#fit logistic regression model
model_logit <- glm(formula = am ~ disp + hp,
data = mtcars,
family = binomial(link = "logit") # a description of the error distribution and link function to be used in the model.
)
#view model summary
summary(model_logit)
Call:
glm(formula = am ~ disp + hp, family = binomial(link = "logit"),
data = mtcars)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.40342 1.36757 1.026 0.3048
disp -0.09518 0.04800 -1.983 0.0474 *
hp 0.12170 0.06777 1.796 0.0725 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 43.230 on 31 degrees of freedom
Residual deviance: 16.713 on 29 degrees of freedom
AIC: 22.713
Number of Fisher Scoring iterations: 8
The glm()
function in R is used to fit generalized linear models (GLMs). The call glm(formula = am ~ disp + hp, family = binomial(link = "logit"))
is fitting a logistic regression model with the following specifications:
Formula:
am ~ disp + hp
: This specifies the formula for the model. It indicates that the binary response variable am
(assumed to represent an “automatic” or “manual” transmission type) is modeled as a function of the predictor variables disp
(engine displacement) and hp
(horsepower).
Family:
family = binomial
: This specifies the family of the distribution for the response variable. In this case, it’s a binomial distribution, indicating that the response variable is binary (0 or 1). Binomial distribution is commonly used for binary outcomes. The binomial distribution is appropriate for modeling binary outcomes because it represents the number of successes (1s) in a fixed number of independent Bernoulli trials.
Link Function:
link = "logit"
: This specifies the link function used to relate the linear predictor (1) to the mean of the distribution (2). For logistic regression, the default link function is the logit function (link = "logit"
). The logit link function is the natural logarithm of the odds of success, and it’s suitable for binary outcomes.
Putting it all together, the glm()
call is fitting a logistic regression model that models the probability of having an automatic transmission (am
) as a function of the engine displacement (disp
) and horsepower (hp
). The model estimates the coefficients of the predictors and uses the logistic link function to transform the linear predictor to probabilities between 0 and 1.
You would typically interpret the coefficients of this model in terms of log-odds. The logistic regression model estimates how the log-odds of having an automatic transmission change with changes in engine displacement and horsepower. If you want to interpret the results in terms of odds ratios or probabilities, additional transformations may be needed.
The message "Number of Fisher Scoring iterations: 8"
typically appears in the output of a logistic regression model fitted using maximum likelihood estimation (MLE). This message provides information about the optimization process used to find the parameter estimates for the logistic regression model.
Here’s what each part of the message means:
Fisher Scoring is an optimization algorithm used for fitting generalized linear models (GLMs) like logistic regression. GLMs include logistic regression, Poisson regression, and others. Fisher Scoring is an iterative optimization algorithm that aims to maximize the likelihood function, finding the parameter values that best explain the observed data. You can sometimes chnage the solver, such as to Newton-Ralph optimization algorithm.
The "Number of Fisher Scoring iterations"
indicates how many times the optimization algorithm iterated through the process of updating parameter estimates to maximize the likelihood. In logistic regression, this is done to find the values of coefficients that maximize the likelihood of observing the given data.
In your case, the algorithm took 8 iterations to converge to a solution. Convergence means that the parameter estimates are stable and do not change significantly between iterations. The algorithm stops when either a maximum number of iterations is reached or when the change in parameter estimates is below a certain threshold.
A higher number of iterations might suggest that the optimization algorithm needed more steps to converge, possibly because the model is complex, or the data require more refinement to find the maximum likelihood estimates.
Read up on predict.glm
- the type
option in particular.
type
: The type of prediction required.
The default is on the scale of the linear predictors; the alternative "response"
is on the scale of the response variable. Thus for a default binomial model the default predictions are of log-odds (probabilities on logit scale) and type = "response"
gives the predicted probabilities.
Deciding on a cutoff value for logistic probabilities is a crucial step in converting predicted probabilities into binary outcomes (0 or 1). Typically, the default cutoff is 0.5, but depending on your specific objectives and the characteristics of your data, exploring and tuning this cutoff value can lead to improved model performance.
Let’s discuss the importance of exploring and tuning cutoff values:
Default Cutoff (0.5): The default cutoff of 0.5 is commonly used, but it may not always be the most appropriate choice, especially if the dataset is imbalanced or if the costs of false positives and false negatives differ significantly.
Explore Different Cutoff Values: Start by exploring a range of cutoff values beyond 0.5. Try the mean (imbalanced data), or the median (balanced data). You can systematically test cutoff values from 0 to 1 to understand how they affect the model’s performance metrics.
Consider Domain Knowledge: Incorporate domain knowledge and the specific requirements of your problem when choosing cutoff values. For example, in medical diagnosis, the costs associated with false positives (misdiagnosing a healthy individual) and false negatives (missing a disease) may vary, influencing the optimal cutoff.
Stratified Sampling: If your dataset is imbalanced, consider using stratified sampling techniques to ensure that each class is represented proportionally in your training and validation sets. This ensures that the performance evaluation is not biased towards the majority class.
Cross-Validation: Use cross-validation to assess the robustness of different cutoff values. By splitting your data into multiple folds and evaluating performance across different subsets, you can obtain more reliable estimates of model performance for various cutoff values.
Tune Cutoff Value: Once you’ve explored a range of cutoff values and identified promising candidates, fine-tune the cutoff value using techniques like grid search or random search. These methods systematically search the space of possible cutoff values to find the one that optimizes your chosen performance metric.
Evaluate on Holdout Set: Finally, evaluate the performance of your model with the selected cutoff value on a holdout set or test set that was not used during hyperparameter tuning. This provides an unbiased estimate of the model’s generalization performance.
By exploring and tuning cutoff values, you can optimize the performance of your logistic regression model, ensuring it meets the specific requirements of your application and delivers the best possible results.
Training Dataset:
Validation Dataset:
The validation dataset is a subset of the training data that facilitates model evaluation and hyperparameter tuning. It plays a crucial role in assessing the model’s performance with various hyperparameter configurations, including cutoff values for logistic probabilities.
By exploring different cutoff values on this dataset, one can gauge how effectively the model generalizes to unseen data and select the optimal cutoff value for deployment.
Evaluation Dataset for Model Validation:
Test Dataset:
The test dataset remains entirely separate from both the training and validation datasets during model development. Its purpose is to provide an independent assessment of the model’s performance after hyperparameter tuning.
By evaluating the model with the chosen cutoff value on the test dataset, one obtains an unbiased estimation of the model’s ability to generalize to new, unseen data, ensuring the model’s reliability and effectiveness in real-world applications.
type = "response"
gives the predicted probabilities, which we will use.
?predict.glm
predicted_logit <- predict(object = model_logit,
newdata = mtcars, # it the same data
type = "response"
)
hist(predicted_logit,
main = "Histogram (Logitic Regression)",
xlab = "Predicted Probabilities"
)
summary(predicted_logit)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0000000 0.0005646 0.3921834 0.4062500 0.8594700 0.9999987
Now, fitted values do not have negative probabilities.
This is what you would do in your assignment. Apply the model on a different data.
newdata = data.frame(disp=c(200, 180, 160, 140, 120, 120, 100, 160),
hp =c(100, 90 , 108, 90 , 80 , 90 , 80 , 90 ),
am =c(0 , 0 , 0 , 1 , 0 , 1 , 1 , 1 )
)
#view data frame
newdata
disp hp am
1 200 100 0
2 180 90 0
3 160 108 0
4 140 90 1
5 120 80 0
6 120 90 1
7 100 80 1
8 160 90 1
#use model to predict value of am for all new cars
newdata$am_prob <- predict(model_logit,
newdata = newdata,
type="response"
)
describe(newdata$am_prob)
vars n mean sd median trimmed mad min max range skew kurtosis se
X1 1 8 0.33 0.32 0.31 0.33 0.41 0 0.84 0.83 0.37 -1.59 0.11
#view updated data frame
newdata
disp hp am am_prob
1 200 100 0 0.004225640
2 180 90 0 0.008361069
3 160 108 0 0.335916069
4 140 90 1 0.275162866
5 120 80 0 0.429961894
6 120 90 1 0.718090728
7 100 80 1 0.835013994
8 160 90 1 0.053546152
Again, no negative probabilities.
The probit link function is based on the cumulative distribution function (CDF) of the standard normal distribution. It is defined as \(\Phi^{-1}(p)\) where \(\Phi\) is the standard normal CDF and p
is the probability of success. It is also S shaped like the “logit” function, and often logit and probit give similar answers.
Just change the link function in logistic regression from logit
to probit
.
?make.link
model_probit <- glm(formula = am ~ disp + hp,
data = mtcars,
family = binomial(link = "probit") # a description of the error distribution and link function to be used in the model.
)
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
#view model summary
summary(model_probit)
Call:
glm(formula = am ~ disp + hp, family = binomial(link = "probit"),
data = mtcars)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.85931 0.81028 1.061 0.2889
disp -0.05634 0.02683 -2.099 0.0358 *
hp 0.07169 0.03870 1.853 0.0639 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 43.230 on 31 degrees of freedom
Residual deviance: 16.493 on 29 degrees of freedom
AIC: 22.493
Number of Fisher Scoring iterations: 10
============================================================
Dependent variable:
----------------------------------------
am
OLS logistic probit
(1) (2) (3)
------------------------------------------------------------
disp -0.004*** -0.095** -0.056**
(0.001) (0.048) (0.027)
hp 0.004** 0.122* 0.072*
(0.002) (0.068) (0.039)
Constant 0.756*** 1.403 0.859
(0.158) (1.368) (0.810)
------------------------------------------------------------
Observations 32 32 32
R2 0.484
Adjusted R2 0.448
Log Likelihood -8.356 -8.247
Akaike Inf. Crit. 22.713 22.493
Residual Std. Error 0.371 (df = 29)
F Statistic 13.605*** (df = 2; 29)
============================================================
Note: *p<0.1; **p<0.05; ***p<0.01
Logit is often computationally simpler, and statistical packages typically default to logit in logistic regression models. Probit models may involve additional computational complexities related to the standard normal distribution.
Also, interpretation of coefficients may be more intuitive in logit models, as they directly relate to changes in log-odds. Interpretation in probit models involves changes in the standard normal deviate associated with a one-unit change in the predictor.