Unit 11 Pre-Live Session

Discussion 1: Continous Predictor Case

As discussed in the videos, visualizing trends when the predictor is continuous and the response is binary takes a little care. Loess curves are one way to examine it but the results will vary depending on the data set and sample size. We will illustrate this using two separate data sets. The first is the credit card default set from the videos and the second takes a look at an experiment evaluating the failure of O-rings during test launches before the Challenger launch, which catastrophically failed due to an O-ring failure.

There are two main steps to visualize using loess curves. The first is setting up your response variable as numeric and the second is to just plot the points with a loess curve overlayed. When creating your numeric response variable, it should be coded so the primary event is a 1 so that it makes sense with other output you create.

The ISLR Default data set actually has multiple predictors but we will just examine the balance predictor which is continuous. Note how the response is not numeric,so we will add an additional columm. Our interest is when they default so the level “yes” will take the numeric value 1.

library(ISLR)
table(Default$default)

## 
##   No  Yes 
## 9667  333

Default$default.num<-ifelse(Default$default=="Yes",1,0)
#Sanity check
table(Default$default.num,Default$default)

##    
##       No  Yes
##   0 9667    0
##   1    0  333

Once the response is numeric a simple call can be made using ggplot2

library(ggplot2)
ggplot(Default,(aes(x=balance,y=default.num)))+geom_point()+geom_smooth(method="loess",span=.4)

## `geom_smooth()` using formula = 'y ~ x'

Just note that the loess fit comes with some options to modify the fit. The point here is to not get anything exact. We just want to see the trend. Change the span option to 0.1 and observe the difference. Although they are different, they both suggest as the balance on the credit card account increases, the chance of the customer defaulting on their credit card increases.
Rinse and repeat this strategy using the Challenger data set from the Sleuth3 package to examine the relationship between when an O-ring failure occurred and temperature of the launch. Make note of the difficulties with this much smaller data set when it comes to fitting a loess plot (use default setting) and then specify the span=1. What does this plot suggest when it comes to the relationship?

library(Sleuth3)
head(ex2011)

##   Temperature Failure
## 1          53     Yes
## 2          56     Yes
## 3          57     Yes
## 4          63      No
## 5          66      No
## 6          67      No

The following code fits a logistic regression model to the Challenger data set. Interpret the regression coefficient in terms of an odds ratio. Does this make sense given your previous graph?

mydata<-ex2011
log.model<-glm(Failure~Temperature,data=mydata,family="binomial")
coef(summary(log.model))

##               Estimate Std. Error   z value   Pr(>|z|)
## (Intercept) 10.8753491 5.70291251  1.906982 0.05652297
## Temperature -0.1713205 0.08343865 -2.053251 0.04004824

Before fitting logistic models for hypothesis testing, it is important to pay attention to the order in which the levels of your response are behaving. It is the last level that glm treats as the event (positive class) of interest. Note the following example:

mydata$newFail<-factor(ifelse(mydata$Failure=="Yes","Failed","Survived"))
#View(mydata)
levels(mydata$newFail)

## [1] "Failed"   "Survived"

levels(mydata$Failure)

## [1] "No"  "Yes"

newlog.model<-glm(newFail~Temperature,data=mydata,family="binomial")
coef(summary(newlog.model))

##                Estimate Std. Error   z value   Pr(>|z|)
## (Intercept) -10.8753491 5.70291251 -1.906982 0.05652297
## Temperature   0.1713205 0.08343865  2.053251 0.04004824

What is the interpretation of this coefficient for temperature in the model? Note the p-values and SE’s are the same. It is only the coefficients that have changed.

ANSWER DISCUSSION 1:

Continuous Predictors

A.

library(ggplot2)
ggplot(Default, aes(x=balance, y=default.num)) + 
  geom_point() + 
  geom_smooth(method="loess", span=0.1)

## `geom_smooth()` using formula = 'y ~ x'

### This createS a plot with a tighter fit to the data points, as a lower span value makes the loess curve more sensitive to local variations, showing a clearer trend of increasing default probability with balance. In this plot, I have balance on the x-axis and default.num on the y-axis, there’s a distinct sigmoidal trend. I can see that as balance rises, so does the probability of defaulting (default.num=1), particularly once the balance exceeds about $1500. The shape is classic for a logistic regression curve and suggests that higher credit card balances might increase the chance of default. Around my loess curve, the shaded area signals the confidence interval for the predicted probabilities. This gives me an idea about the reliability of my predictions. Notably, the interval broadens with higher balances, implying that my confidence in the prediction of default diminishes as the balance surges.

Loess Curves: Loess curves help visualize the relationship between a continuous predictor (like balance) and a binary outcome (like defaulting on a credit card). They provide a smoothed, flexible line to show trends that might be obscured in a raw scatterplot. Span Parameter: The span parameter in loess fitting controls the smoothness of the curve. Smaller values create more wiggly curves that tightly follow the data; larger values produce smoother lines representing broader trends. Interpretation Challenges: Smaller datasets can make loess curves less reliable. It’s best to see them as exploratory tools highlighting overall trends, not precise predictions

B.

Challenger Data and O-Ring Failures The Challenger data set analysis with a loess plot might present challenges due to its smaller size, affecting the smoothness and reliability of the loess fit. A default span value might produce a curve that doesn’t capture the underlying trend well due to the reduced data points, while specifying span=1 offers a smoother curve that may better reveal the general trend. This plot likely suggests a relationship between lower temperatures and increased likelihood of O-ring failure. The smoother curve with span=1 could highlight the general decrease in reliability of the O-rings as temperatures decrease, aligning with the hypothesis that colder temperatures contributed to the Challenger disaster.

library(Sleuth3)
data(ex2011) # Assuming ex2011 is the Challenger data set

# Convert Failure to numeric if necessary
ex2011$Failure.num <- ifelse(ex2011$Failure == "Yes", 1, 0)

# Plot with default span
ggplot(ex2011, aes(x=Temperature, y=Failure.num)) + 
  geom_point() + 
  geom_smooth(method="loess")

## `geom_smooth()` using formula = 'y ~ x'

# Plot with span=1
ggplot(ex2011, aes(x=Temperature, y=Failure.num)) + 
  geom_point() + 
  geom_smooth(method="loess", span=1)

## `geom_smooth()` using formula = 'y ~ x'

The span=1 makes the curve smoother, potentially oversimplifying the trend, especially in smaller datasets. This will illustrate the general trend of O-ring failure probability with temperature but might not capture local variations well. In this plot, where Temperature is on the x-axis and Failure.num on the y-axis, the relationship appears more inconsistent. The trend seems to show that the risk of O-ring failure (Failure.num=1) climbs as temperatures fall. However, the wide confidence interval highlights a significant level of uncertainty in these predictions, likely due to the limited data or its high variability.

There’s an intriguing dip in the curve at mid-range temperatures, hinting at a complex, non-linear relationship where the failure probability might drop with rising temperature up to a certain point, only to ascend as temperatures continue to decrease. Yet the expansive confidence interval advises me to interpret this trend cautiously.

C.

Logistic Regression Coefficient: The logistic regression model on the Challenger data examines the relationship between temperature and O-ring failure. In a logistic regression model, the coefficient for temperature tells us about the change in the log-odds of an O-ring failure associated with a one-unit increase in temperature. To interpret as an odds ratio, we: Exponentiate the coefficient: exp(-0.1713205) = 0.842 Interpretation: For every one-degree increase in temperature, the odds of O-ring failure decrease by roughly 16% (1 - 0.842). Odds Ratio An odds ratio of 1 means no effect. Values greater than 1 suggest an increased likelihood of the outcome with a change in the predictor; values less than 1 suggest a decreased likelihood. Here, the negative coefficient for temperature implies that lower temperatures increase the odds of an O-ring failure. The graph supports this, even if the loess fit is less reliable due to the small dataset, suporting notion that lower temps increase probability of O rign failure.

D.

Response Level Ordering In logistic regression, the order of the response (e.g., “Yes” vs. “No”) matters. The coefficient represents the change in the log-odds of the last level listed as the positive outcome. Flipping the order flips the sign of the coefficient but not the overall interpretation about the odds. Inverting the coding of the response variable to treat “Failed” as the event of interest reverses the sign of the temperature coefficient to 0.1713205. This change means that for each unit increase in temperature, the odds of an O-ring failure (now defined as “Failed”) actually increase, which is the reciprocal interpretation of the previous model. However, this interpretation contradicts the expected relationship and the physical understanding of the Challenger disaster, where lower temperatures were a risk factor for failure. The p-values and standard errors remain the same, highlighting that the statistical significance and reliability of the estimate are unchanged; only the interpretation of the relationship direction differs due to the redefinition of the response variable.

Discussion 2: Categorical Predictors

Dr. Turner showed in the videos that when you have a categorical predictor for two levels the logistic regression models produce essentially the same analysis as a Chi square test and odds ratio from a 2x2 table. When you have more than two levels, the interpretation approach is similar to a one way ANOVA.

Consider the following coronary artery disease data set which tries to assess important risk factors such as age, sex, and echo cardiogram results on coronary artery disease. Sex and age are self explanatory. The ECG variable stands for echo cardiogram. Although the variable is categorical, Low stands for Normal, Medium stands for Mild, and High stands for Severe in terms of the echo cardiogram’s general reading.

cad<-read.csv("coronary.csv",stringsAsFactors = T)
head(cad)

##   X    Sex    ECG AGE CAD
## 1 1 Female    Low  28  No
## 2 2   Male    Low  42 Yes
## 3 3 Female Medium  46  No
## 4 4   Male Medium  45  No
## 5 5 Female    Low  34  No
## 6 6   Male    Low  44 Yes

To explore categorical variables we typically make use of barplots. Below are bar plots examining the relationship of sex and ECG on disease status. Describe what trends you see with the predictors and their potential impact on having coronary artery disease.

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

## `summarise()` has grouped output by 'ECG'. You can override using the `.groups`
## argument.

## # A tibble: 6 × 4
## # Groups:   ECG [3]
##   ECG    CAD     cnt  perc
##   <fct>  <fct> <int> <dbl>
## 1 High   Yes      10 0.769
## 2 Low    No       21 0.636
## 3 Medium Yes      19 0.594
## 4 Medium No       13 0.406
## 5 Low    Yes      12 0.364
## 6 High   No        3 0.231

## `summarise()` has grouped output by 'Sex'. You can override using the `.groups`
## argument.

## # A tibble: 4 × 4
## # Groups:   Sex [2]
##   Sex    CAD     cnt  perc
##   <fct>  <fct> <int> <dbl>
## 1 Female No       21 0.636
## 2 Female Yes      12 0.364
## 3 Male   No       16 0.356
## 4 Male   Yes      29 0.644

Below is a logistic regression model using ECG alone. Interpret the regression coefficients as odds ratios for this given setting. Remember, if the odds ratios are less than 1, you can always take the reciprocal and interpret it the other way around. Do your best and lets see what you come up with.

##               Estimate Std. Error   z value   Pr(>|z|)
## (Intercept)  1.2039728  0.6582801  1.828967 0.06740450
## ECGLow      -1.7635886  0.7511891 -2.347729 0.01888825
## ECGMedium   -0.8244832  0.7502582 -1.098933 0.27179746

There is no work for this problem, but it’s an important piece of information. The focus of discussion is interpreting the coefficients which is the most important thing. To get CIs for odds ratios (regardless of whether they are numeric or continuous), you can run the following code on a glm object to obtain it.

exp(confint(log.model))

## Waiting for profiling to be done...

##                  2.5 %    97.5 %
## (Intercept) 1.01973260 14.867917
## ECGLow      0.03330912  0.684391
## ECGMedium   0.08580524  1.766097

ANSWER DISCUSSION 2:

Categorical Predictors

A.

Barplots and Risk Asessment: The trends observed in the bar plots for the coronary artery disease (CAD) data set indicate significant relationships between the predictors (sex and ECG) and the outcome (presence of CAD). For the ECG predictor, a higher proportion of CAD is observed in the “High” category (severe echo cardiogram results) compared to the “Low” (normal) and “Medium” (mild) categories. This suggests that more severe ECG readings are associated with a higher risk of coronary artery disease. For the sex predictor, a higher proportion of CAD is observed in males compared to females, indicating that sex may be a significant risk factor for CAD, with males being at higher risk.

B.

Logistic Regression w/ Categorical Predictors The logistic regression model with ECG as the predictor produces coefficients for “Low” and “Medium” categories of ECG readings. The negative coefficient for “Low” (-1.7635886) suggests that, compared to the reference category (which is likely “High” based on the context), individuals with normal (Low) ECG readings have significantly lower odds of having CAD. The odds ratio for “Low” can be calculated as $e^{-1.7635886}$, which is less than 1, indicating lower odds of CAD in this group. The “Medium” category coefficient is also negative (-0.8244832), indicating lower odds of CAD compared to the “High” category, but to a lesser extent than the “Low” category. The interpretation of these coefficients supports the trend observed in the bar plots, where more severe ECG readings are associated with a higher risk of CAD.

C.

Looking at coefficients from GLM Obtaining confidence intervals (CIs) for odds ratios from a logistic regression model provides a range of values within which we can be confident that the true odds ratio lies, with a certain level of confidence (typically 95%). For example, the CI for the “Low” category of ECG readings (0.03330912 to 0.684391) indicates that, while accounting for uncertainty, the odds of CAD are significantly lower for individuals with normal (Low) ECG readings compared to those with severe (High) readings. The CI for the “Medium” category (0.08580524 to 1.766097) also suggests lower odds of CAD compared to the “High” category, but with a wider range that includes values closer to 1, indicating less certainty about the extent of the difference. These CIs further support the interpretation of the logistic regression coefficients, emphasizing the importance of ECG severity in predicting the risk of coronary artery disease.

Unit 11 Pre-Live Session

MSDS 6372: Jessica McPhaul

Discussion 1: Continous Predictor Case

ANSWER DISCUSSION 1:

A.

B.

C.

D.

Discussion 2: Categorical Predictors

ANSWER DISCUSSION 2:

A.

B.

C.