Week 7 D Discussion - Reflection and Logistic Regression

I.

A.

# Load the dataset
data(mtcars)

# Fit the logistic regression model
model <- glm(vs ~ mpg + hp + am, data = mtcars, family = binomial)

# Print the model summary
summary(model)

## 
## Call:
## glm(formula = vs ~ mpg + hp + am, family = binomial, data = mtcars)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)  
## (Intercept)  9.96351    9.33802   1.067   0.2860  
## mpg          0.26884    0.25554   1.052   0.2928  
## hp          -0.10668    0.05455  -1.955   0.0505 .
## am          -5.17703    2.82366  -1.833   0.0667 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 43.860  on 31  degrees of freedom
## Residual deviance: 10.844  on 28  degrees of freedom
## AIC: 18.844
## 
## Number of Fisher Scoring iterations: 8

# Interpret coefficients
coefficients <- coef(model)
intercept <- coefficients[1]
mpg_coeff <- coefficients[2]
hp_coeff <- coefficients[3]
am_coeff <- coefficients[4]

# Interpretation
cat("Intercept:", intercept, "\n")

## Intercept: 9.963508

cat("Coefficient for mpg:", mpg_coeff, "\n")

## Coefficient for mpg: 0.2688446

cat("Coefficient for hp:", hp_coeff, "\n")

## Coefficient for hp: -0.1066805

cat("Coefficient for am:", am_coeff, "\n")

## Coefficient for am: -5.177027

we are fitting a logistic regression model to predict the “vs” variable (engine type: 0 = V-shaped, 1 = straight) using the predictor variables “mpg” (miles per gallon), “hp” (horsepower), and “am” (automatic or manual transmission). The glm() function is used to fit the logistic regression model, specifying the formula as “vs ~ mpg + hp + am”. The “family” argument is set to “binomial” to indicate logistic regression.

Interpreting the coefficients: The intercept coefficient (9.963508) represents the log-odds of the outcome (engine type: V-shaped) when all predictor variables (mpg, hp, and am) are zero. However, in this context, it might not have a meaningful interpretation due to its high standard error and lack of statistical significance (p-value = 0.2860). The coefficient for mpg (0.2688446) indicates that, holding other variables constant, a one-unit increase in mpg is associated with a 0.2688446 increase in the log-odds of having a straight engine type. However, this coefficient is not statistically significant (p-value = 0.2928). The coefficient for hp (-0.1066805) suggests that, holding other variables constant, a one-unit increase in hp is associated with a -0.1066805 decrease in the log-odds of having a straight engine type. This coefficient is marginally significant (p-value = 0.0505, denoted by ‘.’). The coefficient for am (-5.177027) implies that, holding other variables constant, having an automatic transmission (am = 0) is associated with a -5.177027 decrease in the log-odds of having a straight engine type compared to a manual transmission (am = 1). This coefficient is marginally significant (p-value = 0.0667, denoted by ‘.’).
Why not run a multivariate regression: Limited sample size: If the dataset has a small sample size relative to the number of predictor variables, running a multivariate regression may lead to overfitting. In such cases, it is advisable to prioritize a more focused analysis with fewer variables to avoid potential issues with model complexity. High collinearity: When the predictor variables are highly correlated with each other, running a multivariate regression can lead to multicollinearity issues. Multicollinearity can make it difficult to interpret the individual effects of the variables and may affect the stability of coefficient estimates. Theory or prior knowledge: If there is a strong theoretical or prior knowledge basis for focusing on specific variables, it may be more appropriate to run a regression analysis with those variables only. This allows for a more focused and interpretable analysis, emphasizing the variables that are of greater theoretical interest.

EXTRA.

It snippet involves the comparison of logistic regression models for predicting hospital bankruptcy. The models being compared are named “New,” “Altman,” and “Ohlson.” The summary output includes various evaluation metrics for each model. The code also includes custom functions for model evaluation and diagnostics, such as calculating Nagelkerke’s R-squared, F1 score, generating confusion matrices, assessing linearity, identifying outliers, and extracting model coefficients. Additionally, the code utilizes the glmnet function for logistic regression with L1 regularization. It appears that concepts like merging datasets, correlation analysis, Box Cox transformation, and creating training and testing data have been applied in the code to enhance the model’s predictive performance.

Week 7 D Discussion - Reflection and Logistic Regression

Ruiyang Li

2024-04-28

I.

A.

EXTRA.

II.