remove(list = ls())
library(stargazer)
##
## Please cite as:
## Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.
## R package version 5.2.3. https://CRAN.R-project.org/package=stargazer
library(psych)
Lets load mtcars data.
The data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models).
It is a data frame with 32 observations on 11 (numeric) variables.
mpg Miles/(US) gallon
cyl Number of cylinders
disp Displacement (cu.in.)
hp Gross horsepower
drat Rear axle ratio
wt Weight (1000 lbs)
qsec 1/4 mile time
vs Engine (0 = V-shaped, 1 = straight)
`am`Transmission (0 = automatic, 1 = manual)
gear Number of forward gears
carb Number of carburetors
?mtcars # see variables
str(mtcars)
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
head(mtcars, n = 10)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
## Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
## Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
stargazer(mtcars,
type='text')
##
## ============================================
## Statistic N Mean St. Dev. Min Max
## --------------------------------------------
## mpg 32 20.091 6.027 10.400 33.900
## cyl 32 6.188 1.786 4 8
## disp 32 230.722 123.939 71.100 472.000
## hp 32 146.688 68.563 52 335
## drat 32 3.597 0.535 2.760 4.930
## wt 32 3.217 0.978 1.513 5.424
## qsec 32 17.849 1.787 14.500 22.900
## vs 32 0.438 0.504 0 1
## am 32 0.406 0.499 0 1
## gear 32 3.688 0.738 3 5
## carb 32 2.812 1.615 1 8
## --------------------------------------------
We will try to model am as dependent variable. Note it
is binary: Transmission (0 = automatic, 1 = manual).
sort(table(mtcars$am),
decreasing = TRUE)
##
## 0 1
## 19 13
Linear regression will generate fitted values and/or predictions that have negative values.
#fit linear regression model
model_lm <- lm(formula = am ~ disp + hp,
data = mtcars
)
#view model summary
summary(model_lm)
##
## Call:
## lm(formula = am ~ disp + hp, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.66396 -0.33221 0.08811 0.29402 0.59786
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.7559286 0.1578183 4.790 4.55e-05 ***
## disp -0.0042892 0.0008776 -4.887 3.47e-05 ***
## hp 0.0043626 0.0015865 2.750 0.0102 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3706 on 29 degrees of freedom
## Multiple R-squared: 0.4841, Adjusted R-squared: 0.4485
## F-statistic: 13.6 on 2 and 29 DF, p-value: 6.8e-05
Vehicles with higher gross power and smaller in size tend to have manual transmission.
Linear probability gives us predicted probabilities below 0 (though not above 1) in this case. Marginal effects are the same everywhere.
my_mtcars <- mtcars
my_mtcars$predicted_ols <- predict(object = model_lm,
newdata = mtcars
)
plot(x = my_mtcars$predicted_ols,
y = mtcars$am,
main="Best-Fit Line and Data Plot",
ylim = c(-.5,1.2),
ylab = "Transmission (1=Automatic)",
xlab = "Fitted Values"
)
# Add a line based on the fitted values
lines(x = sort(my_mtcars$predicted_ols),
y = sort(model_lm$fitted.values),
col = "red",
lty = 2
)
Another way is to simply visualize the fitted values (on sample) and
see their values. Fitted values (predicted values on original data) will
be the same as predicted values (generally on new data) if we use the
original data in the predict command. We will go with
predicted here (more general). Lets plot the fitted value, and you can
see the values are below 0.
hist(my_mtcars$predicted_ols,
main = "Histogram (OLS)",
xlab = "Predicted Probabilities"
)
describe(my_mtcars$predicted_ols)
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 32 0.41 0.35 0.48 0.43 0.33 -0.37 0.93 1.3 -0.6 -0.64 0.06
Lets look at the distribution of residuals. The 4 residual charts do not suggest a good fit.
par(mfrow = c(2, 2))
plot(model_lm)
Confusion matrix from the model predictions.
table_ols <-
table(mtcars$am,
ifelse(test = model_lm$fitted.values >.48,
yes = 1,
no = 0)
)
table_ols
##
## 0 1
## 0 15 4
## 1 1 12
glm is used to fit generalized linear models, specified
by giving a symbolic description of the linear predictor and a
description of the error distribution.
Also read up on family option in glm.
?glm
?family
#fit logistic regression model
model_logit <- glm(formula = am ~ disp + hp,
data = mtcars,
family = binomial(link = "logit") # a description of the error distribution and link function to be used in the model.
)
#view model summary
summary(model_logit)
##
## Call:
## glm(formula = am ~ disp + hp, family = binomial(link = "logit"),
## data = mtcars)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9665 -0.3090 -0.0017 0.3934 1.3682
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.40342 1.36757 1.026 0.3048
## disp -0.09518 0.04800 -1.983 0.0474 *
## hp 0.12170 0.06777 1.796 0.0725 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 43.230 on 31 degrees of freedom
## Residual deviance: 16.713 on 29 degrees of freedom
## AIC: 22.713
##
## Number of Fisher Scoring iterations: 8
The glm() function in R is used to fit generalized
linear models (GLMs). The call
glm(formula = am ~ disp + hp, family = binomial(link = "logit"))
is fitting a logistic regression model with the following
specifications:
Formula:
am ~ disp + hp: This specifies the formula for the
model. It indicates that the binary response variable am
(assumed to represent an “automatic” or “manual” transmission type) is
modeled as a function of the predictor variables disp
(engine displacement) and hp (horsepower).
Family:
family = binomial: This specifies the family of the
distribution for the response variable. In this case, it’s a binomial
distribution, indicating that the response variable is binary (0 or 1).
Binomial distribution is commonly used for binary outcomes.
Link Function:
link = "logit": This specifies the link function used to
relate the linear predictor to the mean of the distribution. For
logistic regression, the default link function is the logit function
(link = "logit"). The logit link function is the natural
logarithm of the odds of success, and it’s suitable for binary
outcomes.
Putting it all together, the glm() call is fitting a
logistic regression model that models the probability of having an
automatic transmission (am) as a function of the engine
displacement (disp) and horsepower (hp). The
model estimates the coefficients of the predictors and uses the logistic
link function to transform the linear predictor to probabilities between
0 and 1.
You would typically interpret the coefficients of this model in terms of log-odds. The logistic regression model estimates how the log-odds of having an automatic transmission change with changes in engine displacement and horsepower. If you want to interpret the results in terms of odds ratios or probabilities, additional transformations may be needed.
The message "Number of Fisher Scoring iterations: 8"
typically appears in the output of a logistic regression model fitted
using maximum likelihood estimation (MLE). This message provides
information about the optimization process used to find the parameter
estimates for the logistic regression model.
Here’s what each part of the message means:
Fisher Scoring is an optimization algorithm used for fitting generalized linear models (GLMs) like logistic regression. GLMs include logistic regression, Poisson regression, and others. Fisher Scoring is an iterative optimization algorithm that aims to maximize the likelihood function, finding the parameter values that best explain the observed data.
The "Number of Fisher Scoring iterations" indicates how
many times the optimization algorithm iterated through the process of
updating parameter estimates to maximize the likelihood. In logistic
regression, this is done to find the values of coefficients that
maximize the likelihood of observing the given data.
In your case, the algorithm took 8 iterations to converge to a solution. Convergence means that the parameter estimates are stable and do not change significantly between iterations. The algorithm stops when either a maximum number of iterations is reached or when the change in parameter estimates is below a certain threshold.
A higher number of iterations might suggest that the optimization algorithm needed more steps to converge, possibly because the model is complex, or the data require more refinement to find the maximum likelihood estimates.
Control this with iter.
Read up on predict.glm - the type option in
particular.
type: The type of prediction required.
The default is on the scale of the linear predictors; the alternative
"response" is on the scale of the response variable. Thus
for a default binomial model the default predictions are of
log-odds (probabilities on logit scale) and
type = "response" gives the predicted
probabilities.
type = "response" gives the predicted probabilities,
which we will use.
?predict.glm
predicted_logit <- predict(object = model_logit,
newdata = mtcars, # it the same data
type = "response"
)
hist(predicted_logit,
main = "Histogram (Logitic Regression)",
xlab = "Predicted Probabilities"
)
summary(predicted_logit)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000000 0.0005646 0.3921834 0.4062500 0.8594700 0.9999987
Now, fitted values do not have negative probabilities.
table_logit <-
table(mtcars$am,
ifelse(test = model_logit$fitted.values > 0.39, # median
yes = 1,
no = 0)
)
table_logit
##
## 0 1
## 0 15 4
## 1 0 13
table_ols
##
## 0 1
## 0 15 4
## 1 1 12
This is what you would do in your assignment. Apply the model on a different data.
newdata = data.frame(disp=c(200, 180, 160, 140, 120, 120, 100, 160),
hp =c(100, 90 , 108, 90 , 80 , 90 , 80 , 90 ),
am =c(0 , 0 , 0 , 1 , 0 , 1 , 1 , 1 )
)
#view data frame
newdata
## disp hp am
## 1 200 100 0
## 2 180 90 0
## 3 160 108 0
## 4 140 90 1
## 5 120 80 0
## 6 120 90 1
## 7 100 80 1
## 8 160 90 1
#use model to predict value of am for all new cars
newdata$am_prob <- predict(model_logit,
newdata = newdata,
type="response"
)
describe(newdata$am_prob)
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 8 0.33 0.32 0.31 0.33 0.41 0 0.84 0.83 0.37 -1.59 0.11
#view updated data frame
newdata
## disp hp am am_prob
## 1 200 100 0 0.004225640
## 2 180 90 0 0.008361069
## 3 160 108 0 0.335916069
## 4 140 90 1 0.275162866
## 5 120 80 0 0.429961894
## 6 120 90 1 0.718090728
## 7 100 80 1 0.835013994
## 8 160 90 1 0.053546152
Again, no negative probabilities.
The probit link function is based on the cumulative distribution function (CDF) of the standard normal distribution. It is defined as \(\Phi^-1(p)\) where \(\Phi\) is the standard normal CDF and p is the probability of success. It is also S shaped like the “logit” function, and often logit and probit give similar answers.
Just change the link function in logistic regression from
logit to probit.
?make.link
model_probit <- glm(formula = am ~ disp + hp,
data = mtcars,
family = binomial(link = "probit") # a description of the error distribution and link function to be used in the model.
)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
#view model summary
summary(model_probit)
##
## Call:
## glm(formula = am ~ disp + hp, family = binomial(link = "probit"),
## data = mtcars)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9567 -0.2598 0.0000 0.3735 1.3645
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.85931 0.81028 1.061 0.2889
## disp -0.05634 0.02683 -2.099 0.0358 *
## hp 0.07169 0.03870 1.853 0.0639 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 43.230 on 31 degrees of freedom
## Residual deviance: 16.493 on 29 degrees of freedom
## AIC: 22.493
##
## Number of Fisher Scoring iterations: 10
library(stargazer)
stargazer(model_lm, model_logit, model_probit,
type = "text"
)
##
## ============================================================
## Dependent variable:
## ----------------------------------------
## am
## OLS logistic probit
## (1) (2) (3)
## ------------------------------------------------------------
## disp -0.004*** -0.095** -0.056**
## (0.001) (0.048) (0.027)
##
## hp 0.004** 0.122* 0.072*
## (0.002) (0.068) (0.039)
##
## Constant 0.756*** 1.403 0.859
## (0.158) (1.368) (0.810)
##
## ------------------------------------------------------------
## Observations 32 32 32
## R2 0.484
## Adjusted R2 0.448
## Log Likelihood -8.356 -8.247
## Akaike Inf. Crit. 22.713 22.493
## Residual Std. Error 0.371 (df = 29)
## F Statistic 13.605*** (df = 2; 29)
## ============================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
Logit is often computationally simpler, and statistical packages typically default to logit in logistic regression models. Probit models may involve additional computational complexities related to the standard normal distribution.
Also, interpretation of coefficients may be more intuitive in logit models, as they directly relate to changes in log-odds. Interpretation in probit models involves changes in the standard normal deviate associated with a one-unit change in the predictor.