Exercise 1: Conceptual Questions

What is the main issue with fitting an MLR on a binary response coded as 0 or 1?
What must we do to a logistic regression coefficient to interpret it as an odds ratio?
If we wanted to interpret an odds ratio for a 20-unit increase (rather than a one-unit increase) of a continuous predictor, what is the general formula to obtain this estimate?
A logistic regression model was fitted to a data set with the following equation \(log \Big( \frac{\pi}{(1-\pi)}\Big)= -4+.8X\). What is the predicted probability of the positive class (event) occurring for a new observation at \(X=3.2\)?

ANSWER EXERCISE 1:

A. Issues with MLR on Binary Responses:

Multiple linear regression (MLR) assumes several things that make it unsuitable for binary responses:

1. Linearity:

MLR assumes a linear relationship between the independent variables and the dependent variable. A binary response variable (0 or 1) cannot have a perfectly linear relationship with predictors.

2. Normal Errors:

MLR assumes that the error terms in the model are normally distributed. With a binary response, the errors can’t be perfectly normally distributed.

3. Constant Variance:

MLR assumes constant variance of errors (homoscedasticity). This doesn’t hold with a binary response.

4. Predicted Values:

MLR can predict outcome values outside the range of 0 and 1, which is nonsensical for probabilities.

B. Coefficients and Odds Ratios:

1. Odds Ratios:

In logistic regression, an odds ratio represents the change in odds of the outcome occurring associated with a one-unit increase in a predictor variable (holding all other variables constant).

2. Exponentiation:

To obtain the odds ratio from a logistic regression coefficient, we need to exponentiate it. If the coefficient is ‘β’, the odds ratio is calculated as e^β.

C.Odds ratio for a ‘k-unit’ increase in a continuous predictor:

To find the odds ratio associated with a ‘k-unit’ change in a continuous predictor:

1.Original odds ratio:

Exponentiate the coefficient (e^β) – this is the odds ratio for a one-unit change. Simply multiply: Multiply the coefficient (β) by k

2.Raise to the power of ‘k’:

(e^β)k(exponentiate the result)

This result represents the change in odds associated with a k-unit increase in the predictor.

D. Calculating Predicted Probability:

1. Logit:

The left-hand side of the equation represents the log odds (or logit) of the event occurring.

2. Solving for Probability:

a.Calculate the linear predictor:

-4 + (0.8 * 3.2) = -1.44

b. Convert logit to odds:

e^(-1.44) = 0.237

c. Calculate probability:

odds / (1 + odds) = 0.237 / (1 + 0.237) = 0.192

Therefore, the predicted probability of the positive class occurring for an observation at X=3.2 is approximately 0.192.

Exercise 2: Coronary Artery Disease and Age

In the pre-live session, we examined the effects of sex and ECG on whether a person has coronary artery disease. Age most certainly has an impact as well. Go back and provide an exploratory analysis on age using the CAD data set. Include a loess plot, summary statistics for age by disease status, and a box plot of age by disease status. Comment on your findings.
Perform a logistic regression analysis using age. Provide a formal write-up of the hypothesis test and interpret the coefficient. Provide a 95 percent confidence interval as well.

ANSWER EXERCISE 2:

A: EDA with Loess Plot, Botx Plot & Summary Statistics

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(gridExtra)

## 
## Attaching package: 'gridExtra'
## 
## The following object is masked from 'package:dplyr':
## 
##     combine

cad_data <- read.csv("coronary.csv")
# Loess Plot
ggplot(cad_data, aes(x = AGE, y = CAD)) + 
  geom_point(aes(color = CAD)) +
  geom_smooth(method = "loess", se = FALSE, aes(color = CAD)) +
  theme_minimal() +
  labs(x = "Age", y = "Coronary Artery Disease (CAD)")

## `geom_smooth()` using formula = 'y ~ x'

# Summary Statistics
cad_data %>% 
    group_by(CAD) %>% 
    summarize(mean_age = mean(AGE),
              sd_age = sd(AGE),
              n = n())

## # A tibble: 2 × 4
##   CAD   mean_age sd_age     n
##   <chr>    <dbl>  <dbl> <int>
## 1 No        44.1   7.90    37
## 2 Yes       49.4   8.41    41

A tibble: 2 × 4

CAD mean_age sd_age n 1 No 44.1 7.90 37 2 Yes 49.4 8.41 41

# Boxplot
ggplot(cad_data, aes(x = CAD, y = AGE)) +
  geom_boxplot() +
  theme_minimal() +
  labs(x = "Coronary Artery Disease (CAD)", y = "Age")

Alternative EDA

CAD_data <- read.csv("coronary.csv")
# Summary statistics for age by disease status
age_summary <- aggregate(AGE ~ CAD, data = CAD_data, function(x) c(mean = mean(x), sd = sd(x)))
print(age_summary)

##   CAD  AGE.mean    AGE.sd
## 1  No 44.081081  7.899994
## 2 Yes 49.439024  8.405501

# Box plot of age by disease status
ggplot(CAD_data, aes(x = CAD, y = AGE)) + 
  geom_boxplot() +
  labs(title = "Box Plot of Age by Disease Status", x = "Coronary Artery Disease", y = "Age")

# Loess plot of age by disease status (with CAD status converted to numeric for plotting)
CAD_data$CAD_numeric <- as.numeric(CAD_data$CAD == "Yes")
ggplot(CAD_data, aes(x = AGE, y = CAD_numeric)) +
  geom_point() +
  geom_smooth(method = "loess", formula = y ~ x, se = FALSE) +
  labs(title = "Loess Plot of Age vs. Disease Status", x = "Age", y = "Probability of CAD")

Observations from output and summary statistics:

The mean age for individuals without CAD is approximately 44.1 years with a standard deviation of about 7.9.
The mean age for individuals with CAD is higher, approximately 49.4 years with a standard deviation of about 8.41.

These statistics suggest that there is a trend where individuals with CAD tend to be older on average than those without CAD.

Looking at the box plot, we see that the age distribution for those with CAD has a higher median as well as higher 25th and 75th percentiles compared to those without CAD. This aligns with the summary statistics and reinforces the indication that age may be a factor in CAD prevalence.

The loess plot provides a non-parametric way to visualize the relationship between age and the probability of having CAD. From the plot, there appears to be an increase in the probability of CAD with age. The curve is smoother and shows the trend without assuming a specific form for the relationship between age and CAD.

B.Logistic regression:

# CAD variable factor level ordered for logistic  ression
CAD_data$CAD <- factor(CAD_data$CAD, levels = c("No", "Yes"))

# Convert the CAD factor to numeric (0 and 1) for the logistic regression
CAD_data$CAD_numeric <- as.numeric(CAD_data$CAD) - 1

# Fit the logistic regression model using the numeric CAD variable
CAD_logit_model <- glm(CAD_numeric ~ AGE, data = CAD_data, family = binomial())

# Summary of the model
summary(CAD_logit_model)

## 
## Call:
## glm(formula = CAD_numeric ~ AGE, family = binomial(), data = CAD_data)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept) -3.64310    1.41838  -2.568  0.01021 * 
## AGE          0.08006    0.02993   2.675  0.00748 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 107.926  on 77  degrees of freedom
## Residual deviance:  99.902  on 76  degrees of freedom
## AIC: 103.9
## 
## Number of Fisher Scoring iterations: 4

# Exponentiated coefficient for age to interpret it as an odds ratio
exp(coef(CAD_logit_model))

## (Intercept)         AGE 
##  0.02617108  1.08334844

# 95% CI for the age coefficient
exp(confint(CAD_logit_model))

## Waiting for profiling to be done...

##                   2.5 %    97.5 %
## (Intercept) 0.001369886 0.3759326
## AGE         1.024192318 1.1529970

Observations from logistic regression model:

Output Interpretation

Coefficients:
- Intercept: The estimated log-odds of having CAD when ‘AGE’ is zero. Its interpretation in this context might not be very meaningful.
- AGE: For each one-unit increase in age, the log-odds of having CAD increase by 0.08006.
P-value (AGE): 0.00748 (** indicates significance at the 0.01 level). This suggests a significant association between age and CAD.
Odds Ratio (AGE): 1.0833. With each year increase in age, the odds of having CAD are multiplied by approximately 1.08.
95% Confidence Interval (AGE): (1.024, 1.153). We are 95% confident that the true odds ratio for a one-year increase in age lies within this range. Since the interval does not include 1, it further supports the significance of the age effect.

Conclusion

Based on this analysis, there’s a statistically significant positive relationship between age and coronary artery disease (CAD). As age increases, the odds of having CAD also increase.

Exercise 3: Gas Mileage and Engine Size

The following code modifies the miles per gallon data set by converting the response to a binary outcome and cleans up the data set a little bit including dropping the three and five cylinder vehicles.

##   mpg cylinders displacement horsepower weight acceleration year origin
## 1 Low         8          307        130   3504         12.0   70      1
## 2 Low         8          350        165   3693         11.5   70      1
## 3 Low         8          318        150   3436         11.0   70      1
## 4 Low         8          304        150   3433         12.0   70      1
## 5 Low         8          302        140   3449         10.5   70      1
## 6 Low         8          429        198   4341         10.0   70      1
##                        name
## 1 chevrolet chevelle malibu
## 2         buick skylark 320
## 3        plymouth satellite
## 4             amc rebel sst
## 5               ford torino
## 6          ford galaxie 500

Provide an exploratory analysis of the relationship of the categorical Miles per gallon variable and the number of cylinders. Include at least one graphic and summary proportions. Explain what your results are saying.
Perform a logistic regression model using just the single predictor cylinders. Provide a good interpretation of the coefficients that makes the most sense for this problem.

C. Provide an odds ratio for all three comparisons involving the cylinder groups? (four vs. six, four vs. eight, and six vs. eight)

ANSWER EXERCISE 3:

A. EDA:

library(ggplot2) 

# Summary Proportions (Table)
prop.table(table(newAuto$cylinders, newAuto$mpg), margin = 1)

##    
##           High        Low
##   4 0.89949749 0.10050251
##   6 0.13253012 0.86746988
##   8 0.02912621 0.97087379

# Visual Exploration (Bar Plot)
ggplot(newAuto, aes(x = cylinders, fill = mpg)) + 
  geom_bar(position = "fill") +  # Position = "fill" shows proportions
  labs(x = "Cylinders", y = "Proportion", title = "Proportion of MPG Levels by Cylinders") +
  theme_minimal()

# Summary Proportions 2
prop.table(table(newAuto$cylinders, newAuto$mpg), margin = 1)

##    
##           High        Low
##   4 0.89949749 0.10050251
##   6 0.13253012 0.86746988
##   8 0.02912621 0.97087379

# Visualizations
ggplot(newAuto, aes(x = cylinders, fill = mpg)) + 
  geom_bar(position = "dodge") +
  labs(x = "Number of Cylinders", y = "Count", 
       title = "Distribution of High/Low Mileage by Cylinders") +
  theme_minimal()

Insights and Observations from EDA: The exploratory analysis provides some clear insights into the relationship between the number of cylinders in a vehicle and its fuel efficiency as categorized into “High” and “Low” mpg.

The 1st bar chart: This displays the counts rather than proportions, visually communicates the trends. There are many more 4-cylinder vehicles classified as “High” mpg than “Low”, and conversely, there are more 6 and 8-cylinder vehicles classified as “Low” mpg.

Summary proportions table:: The table quantifies these observations:

About 89.95% of 4-cylinder vehicles have “High” mpg, and only about 10.05% have “Low” mpg.
For 6-cylinder vehicles, only about 13.25% have “High” mpg, while the majority, about 86.75%, fall into the “Low” mpg category.
Even more pronounced, about 97.09% of 8-cylinder vehicles are in the “Low” mpg category, with only 2.91% being classified as “High” mpg.

Normalized bar chart: In this chart, we see the proportion of vehicles with “High” and “Low” mpg within each cylinder category. This visualization shows that:

The majority of vehicles with 4 cylinders fall into the “High” mpg category.
As the number of cylinders increases to 6 and 8, the proportion of vehicles classified as “High” mpg decreases significantly.
Vehicles with 8 cylinders are predominantly in the “Low” mpg category, indicating lower fuel efficiency.

These results suggest a strong negative relationship between the number of cylinders and fuel efficiency as measured by mpg. Vehicles with fewer cylinders are more likely to be fuel-efficient (“High” mpg), whereas those with more cylinders tend to be less fuel-efficient (“Low” mpg). This could be due to several factors, including the larger engine size associated with more cylinders which typically results in higher fuel consumption.

B & C: Logistic Regression & Odd’s Ratio:

model <- glm(mpg ~ cylinders, data = newAuto, family = binomial())
summary(model)

## 
## Call:
## glm(formula = mpg ~ cylinders, family = binomial(), data = newAuto)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -2.1917     0.2358  -9.296   <2e-16 ***
## cylinders6    4.0704     0.4005  10.164   <2e-16 ***
## cylinders8    5.6982     0.6316   9.022   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 533.72  on 384  degrees of freedom
## Residual deviance: 221.88  on 382  degrees of freedom
## AIC: 227.88
## 
## Number of Fisher Scoring iterations: 6

# Getting the coefficients
coefs <- coef(model)

# Calculating the odds ratios from the coefficients
odds_ratios <- exp(coefs)

# To compare the odds of 4 vs 6 cylinders directly from the model
odds_ratio_4_vs_6 <- odds_ratios['6']

# To compare the odds of 4 vs 8 cylinders directly from the model
odds_ratio_4_vs_8 <- odds_ratios['8']

# For 6 vs 8 cylinders, calculate the ratio of the odds ratios
odds_ratio_6_vs_8 <- odds_ratio_4_vs_6 / odds_ratio_4_vs_8

# Output the odds ratios for interpretation
print(odds_ratios)

## (Intercept)  cylinders6  cylinders8 
##   0.1117318  58.5818182 298.3333333

print(odds_ratio_6_vs_8)

## <NA> 
##   NA

exp(coef(model))

## (Intercept)  cylinders6  cylinders8 
##   0.1117318  58.5818182 298.3333333

# Adjusting the calculation for odds ratio of 6 vs 8 cylinders
log_odds_ratio_6_vs_8 <- coefs['cylinders6'] - coefs['cylinders8']
odds_ratio_6_vs_8 <- exp(log_odds_ratio_6_vs_8)

# Output the adjusted odds ratio for 6 vs 8 cylinders
print(odds_ratio_6_vs_8)

## cylinders6 
##  0.1963636

library(emmeans)

# Estimated marginal means for the model
emm <- emmeans(model, ~ cylinders)

# Pairwise comparisons
pairs <- pairs(emm)

# Convert to odds ratios
odds_ratios <- summary(pairs, type = "response")

# Print the results
print(odds_ratios)

##  contrast                odds.ratio      SE  df null z.ratio p.value
##  cylinders4 / cylinders6    0.01707 0.00684 Inf    1 -10.164  <.0001
##  cylinders4 / cylinders8    0.00335 0.00212 Inf    1  -9.022  <.0001
##  cylinders6 / cylinders8    0.19636 0.13145 Inf    1  -2.432  0.0399
## 
## P value adjustment: tukey method for comparing a family of 3 estimates 
## Tests are performed on the log odds ratio scale

Interpretation of results:

The model starts with a negative intercept of -2.1917, indicating that with 4-cylinder vehicles as the baseline, their likelihood for “High” mpg is low. The coefficient for cylinders6, at 4.0704, suggests that 6-cylinder vehicles are much more likely to have “High” mpg compared to 4-cylinder ones. This is odd to me because typically, more cylinders mean more fuel consumption and lower mpg.

For cylinders8, with a coefficient of 5.6982, it means 8-cylinder vehicles are even more likely to have “High” mpg compared to 4-cylinder vehicles. This also goes against my expectations for the same reasons as above.

Looking at the odds ratios, 6-cylinder vehicles are about 58.58 times more likely, and 8-cylinder vehicles about 298.33 times more likely, to have “High” mpg compared to 4-cylinder vehicles. These findings don’t align with what I’d normally expect, hinting that “High” mpg might be pointing to the lower mpg values since it’s based on a median split.

Comparing 6-cylinder to 8-cylinder vehicles, the odds for “High” mpg are lower for 6-cylinder vehicles, about a fifth, which fits more with what I understand about engine size and fuel efficiency.

Using emmeans for pairwise comparisons, the odds are much lower for 4-cylinder vehicles to be in the “High” mpg category compared to both 6 and 8-cylinder vehicles. The odds ratios from emmeans are in line with the manual calculations.

The analysis showed results that were opposite to what I expected, as vehicles with fewer cylinders usually have better mpg. This makes me think the label “High” mpg might not truly represent higher efficiency due to how the binary variable was formed from a continuous one using a median split. It’s crucial to think about how these variables are coded and named, particularly when they come from splitting continuous variables, as the terms “High” and “Low” might not accurately reflect efficiency categories without considering the distribution of the original mpg values.

So,to get a clear understanding of the coefficients from my logistic regression model focusing on the cylinders predictor, I’ll start with what’s generally expected: vehicles with fewer cylinders tend to be more fuel-efficient, showing higher mpg, while those with more cylinders are less fuel-efficient, showing lower mpg.

In this model, the response variable mpg is binary, labeled “High” or “Low,” based on whether a vehicle’s mpg is above or below the dataset’s median mpg.

Considering the surprising direction of the coefficients, they might make more sense if we assume “High” mpg was actually meant to indicate vehicles with mpg below the median, which would be less fuel-efficient.

Intercept (-2.1917):

Starting with 4-cylinder vehicles as the baseline, the negative intercept suggests that the likelihood of these vehicles having mpg above the median (thus, being more fuel-efficient) is low. But if mpg coding is the opposite, it implies 4-cylinder vehicles are likely to have high mpg.

Coefficient for 6 cylinders (4.0704):

For 6-cylinder vehicles, this coefficient increase indicates they’re less likely to fall into the more fuel-efficient category compared to 4-cylinder vehicles, matching expectations.

Coefficient for 8 cylinders (5.6982):

This shows 8-cylinder vehicles have even less chance of being more fuel-efficient compared to 4-cylinder ones, which is also what’s typically expected.

Odds Ratios:

4 vs. 6 cylinders:

An odds ratio of 58.58 suggests 6-cylinder vehicles are much less likely to be seen as having high mpg compared to 4-cylinder ones.

4 vs. 8 cylinders:

With an odds ratio of 298.33, 8-cylinder vehicles are significantly less likely to be considered as having high mpg than 4-cylinder ones.

6 vs. 8 cylinders:

An odds ratio of 0.196 shows that, in comparison, 6-cylinder vehicles are more likely to be in the higher mpg category than 8-cylinder vehicles.

In summary, if “High” mpg actually stands for lower mpg values (indicating less fuel efficiency), then the model’s coefficients are logical. It means vehicles with more cylinders are more likely to be labeled “High” mpg, aligning with being less fuel-efficient.

Exercise 4: Quick Investigation

Suppose a friend came to you with a data set that you have pinned as a logistic regression problem. Suppose he gives you the data set to analyze. The response is binary and the predictor is continuous. Explore this data set to determine whether logistic regression will indeed work for this data set and if so, complete the analysis and provide a generic interpretation of the coefficients. If it is not appropriate, explain.

investigate <-read.csv("Investigate.csv",stringsAsFactors = T)
head(data)

ANSWER EXERCISE 4:

library(ggplot2) ggplot(investigate, aes(x = x, y = y)) + geom_jitter(width = 0.1, height = 0.1, alpha = 0.5) + geom_smooth(method = “glm”, method.args = list(family = “binomial”), se = FALSE) + labs(title = “Scatterplot with Logistic Regression Line”, x = “Predictor (x)”, y = “Response (y)”)

model <- glm(y ~ x, data = investigate, family = binomial()) summary(model)

Residuals vs. Fitted plot

residuals_plot <- ggplot(investigate, aes(x = fitted(model), y = residuals(model, type = “deviance”))) + geom_point() + geom_hline(yintercept = 0, linetype = “dashed”) + labs(title = “Residuals vs. Fitted Plot”, x = “Fitted Values”, y = “Residuals”)

print(residuals_plot)

investigate <-read.csv(“Investigate.csv”,stringsAsFactors = T) head(data) x y 1 0.1540348 0 2 3.8089079 1 3 2.5328041 1 4 3.0170302 1 5 3.0093886 0 6 3.6670790 1 model <- glm(y ~ x, data = investigate, family = binomial()) summary(model)

Call: glm(formula = y ~ x, family = binomial(), data = investigate)

Coefficients: Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.4442 0.6463 -0.687 0.4919
x 0.5362 0.2648 2.025 0.0429 * — Signif. codes: 0 ‘’ 0.001 ‘’ 0.01 ‘’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 50.446  on 39  degrees of freedom

Residual deviance: 45.862 on 38 degrees of freedom AIC: 49.862

Number of Fisher Scoring iterations: 3

library(ggplot2)
help<-read.csv("Investigate.csv",stringsAsFactors = T)

head(help)

##           x y
## 1 0.1540348 0
## 2 3.8089079 1
## 3 2.5328041 1
## 4 3.0170302 1
## 5 3.0093886 0
## 6 3.6670790 1

ggplot(help, aes(x=x, y=y)) + 
  geom_point(alpha=0.5) +
  labs(title="Scatter plot of Continuous Predictor vs. Binary Response",
       x="Continuous Predictor",
       y="Binary Response")

Analysis and Response:

Looking at the scatterplot with the logistic regression line, I noticed a trend where as x, a continuous predictor, increases, the chance of y being 1 goes up too. The scatterplot reveals that data points with y = 1 tend to gather at higher x values, while points for y = 0 are more common at lower x values. The logistic regression line seems to reflect this pattern well, suggesting a logistic regression model fits this data appropriately.

The “Residuals vs. Fitted Plot” didn’t show any clear patterns or deviations from zero, meaning the model fits the data well without breaking any logistic regression assumptions.

From the model summary:

Intercept (-0.4442): This intercept is the log-odds of y being 1 when x is 0. Its value, not statistically significant with a p-value of 0.4919, implies that at x = 0, there’s weak evidence of an association with the probability of y being 1.
Coefficient for x (0.5362): This coefficient shows how the log-odds of y being 1 change with a one-unit increase in x, and it’s statistically significant (p = 0.0429). This means x is a meaningful predictor of y. For every one-unit rise in x, the odds of y being 1 go up by about 1.71, keeping everything else constant.

The model’s AIC is 49.862. This metric helps judge the quality of statistical models for a given dataset. Without other models for comparison, we can’t do much with it here, but usually, a lower AIC suggests a model fits better.

Based on the initial data analysis and the logistic regression results, using logistic regression for this dataset appears valid. The predictor x has a significant link with the binary outcome y, and the model checks don’t show any significant problems with this method.

Unit 11 HW

MSDS 6372: Jessica McPhaul link: https://rpubs.com/Texaschikkita/Unit_11_HW

March 25, 2024