Multiple linear regression (MLR) assumes several things that make it unsuitable for binary responses:
1. Linearity:
MLR assumes a linear relationship between the independent variables and the dependent variable. A binary response variable (0 or 1) cannot have a perfectly linear relationship with predictors.
2. Normal Errors:
MLR assumes that the error terms in the model are normally distributed. With a binary response, the errors can’t be perfectly normally distributed.
3. Constant Variance:
MLR assumes constant variance of errors (homoscedasticity). This doesn’t hold with a binary response.
4. Predicted Values:
MLR can predict outcome values outside the range of 0 and 1, which is nonsensical for probabilities.
1. Odds Ratios:
In logistic regression, an odds ratio represents the change in odds of the outcome occurring associated with a one-unit increase in a predictor variable (holding all other variables constant).
2. Exponentiation:
To obtain the odds ratio from a logistic regression coefficient, we need to exponentiate it. If the coefficient is ‘β’, the odds ratio is calculated as e^β.
To find the odds ratio associated with a ‘k-unit’ change in a continuous predictor:
1.Original odds ratio:
Exponentiate the coefficient (e^β) – this is the odds ratio for a one-unit change. Simply multiply: Multiply the coefficient (β) by k
2.Raise to the power of ‘k’:
(eβ)k(exponentiate the result)
This result represents the change in odds associated with a k-unit increase in the predictor.
1. Logit:
The left-hand side of the equation represents the log odds (or logit) of the event occurring.
2. Solving for Probability:
a.Calculate the linear predictor:
-4 + (0.8 * 3.2) = -1.44
b. Convert logit to odds:
e^(-1.44) = 0.237
c. Calculate probability:
odds / (1 + odds) = 0.237 / (1 + 0.237) = 0.192
Therefore, the predicted probability of the positive class occurring for an observation at X=3.2 is approximately 0.192.
In the pre-live session, we examined the effects of sex and ECG on whether a person has coronary artery disease. Age most certainly has an impact as well. Go back and provide an exploratory analysis on age using the CAD data set. Include a loess plot, summary statistics for age by disease status, and a box plot of age by disease status. Comment on your findings.
Perform a logistic regression analysis using age. Provide a formal write-up of the hypothesis test and interpret the coefficient. Provide a 95 percent confidence interval as well.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(gridExtra)
##
## Attaching package: 'gridExtra'
##
## The following object is masked from 'package:dplyr':
##
## combine
cad_data <- read.csv("coronary.csv")
# Loess Plot
ggplot(cad_data, aes(x = AGE, y = CAD)) +
geom_point(aes(color = CAD)) +
geom_smooth(method = "loess", se = FALSE, aes(color = CAD)) +
theme_minimal() +
labs(x = "Age", y = "Coronary Artery Disease (CAD)")
## `geom_smooth()` using formula = 'y ~ x'
# Summary Statistics
cad_data %>%
group_by(CAD) %>%
summarize(mean_age = mean(AGE),
sd_age = sd(AGE),
n = n())
## # A tibble: 2 × 4
## CAD mean_age sd_age n
## <chr> <dbl> <dbl> <int>
## 1 No 44.1 7.90 37
## 2 Yes 49.4 8.41 41
CAD mean_age sd_age n
# Boxplot
ggplot(cad_data, aes(x = CAD, y = AGE)) +
geom_boxplot() +
theme_minimal() +
labs(x = "Coronary Artery Disease (CAD)", y = "Age")
Alternative EDA
CAD_data <- read.csv("coronary.csv")
# Summary statistics for age by disease status
age_summary <- aggregate(AGE ~ CAD, data = CAD_data, function(x) c(mean = mean(x), sd = sd(x)))
print(age_summary)
## CAD AGE.mean AGE.sd
## 1 No 44.081081 7.899994
## 2 Yes 49.439024 8.405501
# Box plot of age by disease status
ggplot(CAD_data, aes(x = CAD, y = AGE)) +
geom_boxplot() +
labs(title = "Box Plot of Age by Disease Status", x = "Coronary Artery Disease", y = "Age")
# Loess plot of age by disease status (with CAD status converted to numeric for plotting)
CAD_data$CAD_numeric <- as.numeric(CAD_data$CAD == "Yes")
ggplot(CAD_data, aes(x = AGE, y = CAD_numeric)) +
geom_point() +
geom_smooth(method = "loess", formula = y ~ x, se = FALSE) +
labs(title = "Loess Plot of Age vs. Disease Status", x = "Age", y = "Probability of CAD")
Observations from output and summary statistics:
These statistics suggest that there is a trend where individuals with CAD tend to be older on average than those without CAD.
Looking at the box plot, we see that the age distribution for those with CAD has a higher median as well as higher 25th and 75th percentiles compared to those without CAD. This aligns with the summary statistics and reinforces the indication that age may be a factor in CAD prevalence.
The loess plot provides a non-parametric way to visualize the relationship between age and the probability of having CAD. From the plot, there appears to be an increase in the probability of CAD with age. The curve is smoother and shows the trend without assuming a specific form for the relationship between age and CAD.
# CAD variable factor level ordered for logistic ression
CAD_data$CAD <- factor(CAD_data$CAD, levels = c("No", "Yes"))
# Convert the CAD factor to numeric (0 and 1) for the logistic regression
CAD_data$CAD_numeric <- as.numeric(CAD_data$CAD) - 1
# Fit the logistic regression model using the numeric CAD variable
CAD_logit_model <- glm(CAD_numeric ~ AGE, data = CAD_data, family = binomial())
# Summary of the model
summary(CAD_logit_model)
##
## Call:
## glm(formula = CAD_numeric ~ AGE, family = binomial(), data = CAD_data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.64310 1.41838 -2.568 0.01021 *
## AGE 0.08006 0.02993 2.675 0.00748 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 107.926 on 77 degrees of freedom
## Residual deviance: 99.902 on 76 degrees of freedom
## AIC: 103.9
##
## Number of Fisher Scoring iterations: 4
# Exponentiated coefficient for age to interpret it as an odds ratio
exp(coef(CAD_logit_model))
## (Intercept) AGE
## 0.02617108 1.08334844
# 95% CI for the age coefficient
exp(confint(CAD_logit_model))
## Waiting for profiling to be done...
## 2.5 % 97.5 %
## (Intercept) 0.001369886 0.3759326
## AGE 1.024192318 1.1529970
Observations from logistic regression model:
Output Interpretation
Coefficients:
P-value (AGE): 0.00748 (** indicates significance at the 0.01 level). This suggests a significant association between age and CAD.
Odds Ratio (AGE): 1.0833. With each year increase in age, the odds of having CAD are multiplied by approximately 1.08.
95% Confidence Interval (AGE): (1.024, 1.153). We are 95% confident that the true odds ratio for a one-year increase in age lies within this range. Since the interval does not include 1, it further supports the significance of the age effect.
Conclusion
Based on this analysis, there’s a statistically significant positive relationship between age and coronary artery disease (CAD). As age increases, the odds of having CAD also increase.
The following code modifies the miles per gallon data set by converting the response to a binary outcome and cleans up the data set a little bit including dropping the three and five cylinder vehicles.
## mpg cylinders displacement horsepower weight acceleration year origin
## 1 Low 8 307 130 3504 12.0 70 1
## 2 Low 8 350 165 3693 11.5 70 1
## 3 Low 8 318 150 3436 11.0 70 1
## 4 Low 8 304 150 3433 12.0 70 1
## 5 Low 8 302 140 3449 10.5 70 1
## 6 Low 8 429 198 4341 10.0 70 1
## name
## 1 chevrolet chevelle malibu
## 2 buick skylark 320
## 3 plymouth satellite
## 4 amc rebel sst
## 5 ford torino
## 6 ford galaxie 500
Provide an exploratory analysis of the relationship of the categorical Miles per gallon variable and the number of cylinders. Include at least one graphic and summary proportions. Explain what your results are saying.
Perform a logistic regression model using just the single predictor cylinders. Provide a good interpretation of the coefficients that makes the most sense for this problem.
C. Provide an odds ratio for all three comparisons involving the cylinder groups? (four vs. six, four vs. eight, and six vs. eight)
library(ggplot2)
# Summary Proportions (Table)
prop.table(table(newAuto$cylinders, newAuto$mpg), margin = 1)
##
## High Low
## 4 0.89949749 0.10050251
## 6 0.13253012 0.86746988
## 8 0.02912621 0.97087379
# Visual Exploration (Bar Plot)
ggplot(newAuto, aes(x = cylinders, fill = mpg)) +
geom_bar(position = "fill") + # Position = "fill" shows proportions
labs(x = "Cylinders", y = "Proportion", title = "Proportion of MPG Levels by Cylinders") +
theme_minimal()
# Summary Proportions 2
prop.table(table(newAuto$cylinders, newAuto$mpg), margin = 1)
##
## High Low
## 4 0.89949749 0.10050251
## 6 0.13253012 0.86746988
## 8 0.02912621 0.97087379
# Visualizations
ggplot(newAuto, aes(x = cylinders, fill = mpg)) +
geom_bar(position = "dodge") +
labs(x = "Number of Cylinders", y = "Count",
title = "Distribution of High/Low Mileage by Cylinders") +
theme_minimal()
Insights and Observations from EDA: The exploratory analysis provides some clear insights into the relationship between the number of cylinders in a vehicle and its fuel efficiency as categorized into “High” and “Low” mpg.
The 1st bar chart: This displays the counts rather than proportions, visually communicates the trends. There are many more 4-cylinder vehicles classified as “High” mpg than “Low”, and conversely, there are more 6 and 8-cylinder vehicles classified as “Low” mpg.
Summary proportions table:: The table quantifies these observations:
Normalized bar chart: In this chart, we see the proportion of vehicles with “High” and “Low” mpg within each cylinder category. This visualization shows that:
These results suggest a strong negative relationship between the number of cylinders and fuel efficiency as measured by mpg. Vehicles with fewer cylinders are more likely to be fuel-efficient (“High” mpg), whereas those with more cylinders tend to be less fuel-efficient (“Low” mpg). This could be due to several factors, including the larger engine size associated with more cylinders which typically results in higher fuel consumption.
model <- glm(mpg ~ cylinders, data = newAuto, family = binomial())
summary(model)
##
## Call:
## glm(formula = mpg ~ cylinders, family = binomial(), data = newAuto)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.1917 0.2358 -9.296 <2e-16 ***
## cylinders6 4.0704 0.4005 10.164 <2e-16 ***
## cylinders8 5.6982 0.6316 9.022 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 533.72 on 384 degrees of freedom
## Residual deviance: 221.88 on 382 degrees of freedom
## AIC: 227.88
##
## Number of Fisher Scoring iterations: 6
# Getting the coefficients
coefs <- coef(model)
# Calculating the odds ratios from the coefficients
odds_ratios <- exp(coefs)
# To compare the odds of 4 vs 6 cylinders directly from the model
odds_ratio_4_vs_6 <- odds_ratios['6']
# To compare the odds of 4 vs 8 cylinders directly from the model
odds_ratio_4_vs_8 <- odds_ratios['8']
# For 6 vs 8 cylinders, calculate the ratio of the odds ratios
odds_ratio_6_vs_8 <- odds_ratio_4_vs_6 / odds_ratio_4_vs_8
# Output the odds ratios for interpretation
print(odds_ratios)
## (Intercept) cylinders6 cylinders8
## 0.1117318 58.5818182 298.3333333
print(odds_ratio_6_vs_8)
## <NA>
## NA
exp(coef(model))
## (Intercept) cylinders6 cylinders8
## 0.1117318 58.5818182 298.3333333
# Adjusting the calculation for odds ratio of 6 vs 8 cylinders
log_odds_ratio_6_vs_8 <- coefs['cylinders6'] - coefs['cylinders8']
odds_ratio_6_vs_8 <- exp(log_odds_ratio_6_vs_8)
# Output the adjusted odds ratio for 6 vs 8 cylinders
print(odds_ratio_6_vs_8)
## cylinders6
## 0.1963636
library(emmeans)
# Estimated marginal means for the model
emm <- emmeans(model, ~ cylinders)
# Pairwise comparisons
pairs <- pairs(emm)
# Convert to odds ratios
odds_ratios <- summary(pairs, type = "response")
# Print the results
print(odds_ratios)
## contrast odds.ratio SE df null z.ratio p.value
## cylinders4 / cylinders6 0.01707 0.00684 Inf 1 -10.164 <.0001
## cylinders4 / cylinders8 0.00335 0.00212 Inf 1 -9.022 <.0001
## cylinders6 / cylinders8 0.19636 0.13145 Inf 1 -2.432 0.0399
##
## P value adjustment: tukey method for comparing a family of 3 estimates
## Tests are performed on the log odds ratio scale
Interpretation of results:
The model starts with a negative intercept of -2.1917, indicating
that with 4-cylinder vehicles as the baseline, their likelihood for
“High” mpg is low. The coefficient for cylinders6
, at
4.0704, suggests that 6-cylinder vehicles are much more likely to have
“High” mpg compared to 4-cylinder ones. This is odd to me because
typically, more cylinders mean more fuel consumption and lower mpg.
For cylinders8
, with a coefficient of 5.6982, it means
8-cylinder vehicles are even more likely to have “High” mpg compared to
4-cylinder vehicles. This also goes against my expectations for the same
reasons as above.
Looking at the odds ratios, 6-cylinder vehicles are about 58.58 times more likely, and 8-cylinder vehicles about 298.33 times more likely, to have “High” mpg compared to 4-cylinder vehicles. These findings don’t align with what I’d normally expect, hinting that “High” mpg might be pointing to the lower mpg values since it’s based on a median split.
Comparing 6-cylinder to 8-cylinder vehicles, the odds for “High” mpg are lower for 6-cylinder vehicles, about a fifth, which fits more with what I understand about engine size and fuel efficiency.
Using emmeans
for pairwise comparisons, the odds are
much lower for 4-cylinder vehicles to be in the “High” mpg category
compared to both 6 and 8-cylinder vehicles. The odds ratios from
emmeans
are in line with the manual calculations.
The analysis showed results that were opposite to what I expected, as vehicles with fewer cylinders usually have better mpg. This makes me think the label “High” mpg might not truly represent higher efficiency due to how the binary variable was formed from a continuous one using a median split. It’s crucial to think about how these variables are coded and named, particularly when they come from splitting continuous variables, as the terms “High” and “Low” might not accurately reflect efficiency categories without considering the distribution of the original mpg values.
So,to get a clear understanding of the coefficients from my logistic
regression model focusing on the cylinders
predictor, I’ll
start with what’s generally expected: vehicles with fewer cylinders tend
to be more fuel-efficient, showing higher mpg, while those with more
cylinders are less fuel-efficient, showing lower mpg.
In this model, the response variable mpg
is binary,
labeled “High” or “Low,” based on whether a vehicle’s mpg is above or
below the dataset’s median mpg.
Considering the surprising direction of the coefficients, they might make more sense if we assume “High” mpg was actually meant to indicate vehicles with mpg below the median, which would be less fuel-efficient.
Starting with 4-cylinder vehicles as the baseline, the negative intercept suggests that the likelihood of these vehicles having mpg above the median (thus, being more fuel-efficient) is low. But if mpg coding is the opposite, it implies 4-cylinder vehicles are likely to have high mpg.
For 6-cylinder vehicles, this coefficient increase indicates they’re less likely to fall into the more fuel-efficient category compared to 4-cylinder vehicles, matching expectations.
This shows 8-cylinder vehicles have even less chance of being more fuel-efficient compared to 4-cylinder ones, which is also what’s typically expected.
Odds Ratios:
An odds ratio of 58.58 suggests 6-cylinder vehicles are much less likely to be seen as having high mpg compared to 4-cylinder ones.
With an odds ratio of 298.33, 8-cylinder vehicles are significantly less likely to be considered as having high mpg than 4-cylinder ones.
An odds ratio of 0.196 shows that, in comparison, 6-cylinder vehicles are more likely to be in the higher mpg category than 8-cylinder vehicles.
In summary, if “High” mpg actually stands for lower mpg values (indicating less fuel efficiency), then the model’s coefficients are logical. It means vehicles with more cylinders are more likely to be labeled “High” mpg, aligning with being less fuel-efficient.
Suppose a friend came to you with a data set that you have pinned as a logistic regression problem. Suppose he gives you the data set to analyze. The response is binary and the predictor is continuous. Explore this data set to determine whether logistic regression will indeed work for this data set and if so, complete the analysis and provide a generic interpretation of the coefficients. If it is not appropriate, explain.
investigate <-read.csv("Investigate.csv",stringsAsFactors = T)
head(data)
library(ggplot2) ggplot(investigate, aes(x = x, y = y)) + geom_jitter(width = 0.1, height = 0.1, alpha = 0.5) + geom_smooth(method = “glm”, method.args = list(family = “binomial”), se = FALSE) + labs(title = “Scatterplot with Logistic Regression Line”, x = “Predictor (x)”, y = “Response (y)”)
model <- glm(y ~ x, data = investigate, family = binomial()) summary(model)
residuals_plot <- ggplot(investigate, aes(x = fitted(model), y = residuals(model, type = “deviance”))) + geom_point() + geom_hline(yintercept = 0, linetype = “dashed”) + labs(title = “Residuals vs. Fitted Plot”, x = “Fitted Values”, y = “Residuals”)
print(residuals_plot)
investigate <-read.csv(“Investigate.csv”,stringsAsFactors = T) head(data) x y 1 0.1540348 0 2 3.8089079 1 3 2.5328041 1 4 3.0170302 1 5 3.0093886 0 6 3.6670790 1 model <- glm(y ~ x, data = investigate, family = binomial()) summary(model)
Call: glm(formula = y ~ x, family = binomial(), data = investigate)
Coefficients: Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.4442 0.6463 -0.687 0.4919
x 0.5362 0.2648 2.025 0.0429 * — Signif. codes: 0 ‘’ 0.001
‘’ 0.01 ‘’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 50.446 on 39 degrees of freedom
Residual deviance: 45.862 on 38 degrees of freedom AIC: 49.862
Number of Fisher Scoring iterations: 3
library(ggplot2)
help<-read.csv("Investigate.csv",stringsAsFactors = T)
head(help)
## x y
## 1 0.1540348 0
## 2 3.8089079 1
## 3 2.5328041 1
## 4 3.0170302 1
## 5 3.0093886 0
## 6 3.6670790 1
ggplot(help, aes(x=x, y=y)) +
geom_point(alpha=0.5) +
labs(title="Scatter plot of Continuous Predictor vs. Binary Response",
x="Continuous Predictor",
y="Binary Response")
Analysis and Response:
Looking at the scatterplot with the logistic regression line, I
noticed a trend where as x
, a continuous predictor,
increases, the chance of y
being 1 goes up too. The
scatterplot reveals that data points with y = 1
tend to
gather at higher x
values, while points for
y = 0
are more common at lower x
values. The
logistic regression line seems to reflect this pattern well, suggesting
a logistic regression model fits this data appropriately.
The “Residuals vs. Fitted Plot” didn’t show any clear patterns or deviations from zero, meaning the model fits the data well without breaking any logistic regression assumptions.
From the model summary:
Intercept (-0.4442): This intercept is the
log-odds of y
being 1 when x
is 0. Its value,
not statistically significant with a p-value of 0.4919, implies that at
x
= 0, there’s weak evidence of an association with the
probability of y
being 1.
Coefficient for x
(0.5362): This
coefficient shows how the log-odds of y
being 1 change with
a one-unit increase in x
, and it’s statistically
significant (p = 0.0429). This means x
is a meaningful
predictor of y
. For every one-unit rise in x
,
the odds of y
being 1 go up by about 1.71, keeping
everything else constant.
The model’s AIC is 49.862. This metric helps judge the quality of statistical models for a given dataset. Without other models for comparison, we can’t do much with it here, but usually, a lower AIC suggests a model fits better.
Based on the initial data analysis and the logistic regression
results, using logistic regression for this dataset appears valid. The
predictor x
has a significant link with the binary outcome
y
, and the model checks don’t show any significant problems
with this method.