Jessica McPhaul - 6372 _Unit_11_PreLive
As discussed in the videos, visualizing trends when the predictor is continuous and the response is binary takes a little care. Loess curves are one way to examine it but the results will vary depending on the data set and sample size. We will illustrate this using two separate data sets. The first is the credit card default set from the videos and the second takes a look at an experiment evaluating the failure of O-rings during test launches before the Challenger launch, which catastrophically failed due to an O-ring failure.
There are two main steps to visualize using loess curves. The first is setting up your response variable as numeric and the second is to just plot the points with a loess curve overlayed. When creating your numeric response variable, it should be coded so the primary event is a 1 so that it makes sense with other output you create.
The ISLR Default
data set actually has multiple
predictors but we will just examine the balance predictor which is
continuous. Note how the response is not numeric,so we will add an
additional columm. Our interest is when they default so the level “yes”
will take the numeric value 1.
library(ISLR)
table(Default$default)
##
## No Yes
## 9667 333
Default$default.num<-ifelse(Default$default=="Yes",1,0)
#Sanity check
table(Default$default.num,Default$default)
##
## No Yes
## 0 9667 0
## 1 0 333
Once the response is numeric a simple call can be made using ggplot2
library(ggplot2)
ggplot(Default,(aes(x=balance,y=default.num)))+geom_point()+geom_smooth(method="loess",span=.4)
## `geom_smooth()` using formula = 'y ~ x'
Just note that the loess fit comes with some options to modify
the fit. The point here is to not get anything exact. We just want to
see the trend. Change the span
option to 0.1 and observe
the difference. Although they are different, they both suggest as the
balance on the credit card account increases, the chance of the customer
defaulting on their credit card increases.
Rinse and repeat this strategy using the Challenger
data set from the Sleuth3 package to examine the
relationship between when an O-ring failure occurred and temperature of
the launch. Make note of the difficulties with this much smaller data
set when it comes to fitting a loess plot (use default setting) and then
specify the span=1. What does this plot suggest when it comes to the
relationship?
library(Sleuth3)
head(ex2011)
## Temperature Failure
## 1 53 Yes
## 2 56 Yes
## 3 57 Yes
## 4 63 No
## 5 66 No
## 6 67 No
Challenger
data set. Interpret the regression coefficient
in terms of an odds ratio. Does this make sense given your previous
graph?mydata<-ex2011
log.model<-glm(Failure~Temperature,data=mydata,family="binomial")
coef(summary(log.model))
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 10.8753491 5.70291251 1.906982 0.05652297
## Temperature -0.1713205 0.08343865 -2.053251 0.04004824
mydata$newFail<-factor(ifelse(mydata$Failure=="Yes","Failed","Survived"))
#View(mydata)
levels(mydata$newFail)
## [1] "Failed" "Survived"
levels(mydata$Failure)
## [1] "No" "Yes"
newlog.model<-glm(newFail~Temperature,data=mydata,family="binomial")
coef(summary(newlog.model))
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -10.8753491 5.70291251 -1.906982 0.05652297
## Temperature 0.1713205 0.08343865 2.053251 0.04004824
What is the interpretation of this coefficient for temperature in the model? Note the p-values and SE’s are the same. It is only the coefficients that have changed.
Continuous Predictors
library(ggplot2)
ggplot(Default, aes(x=balance, y=default.num)) +
geom_point() +
geom_smooth(method="loess", span=0.1)
## `geom_smooth()` using formula = 'y ~ x'
### This createS a plot with a tighter fit to the data points, as a
lower span value makes the loess curve more sensitive to local
variations, showing a clearer trend of increasing default probability
with balance. In this plot, I have balance on the x-axis and default.num
on the y-axis, there’s a distinct sigmoidal trend. I can see that as
balance rises, so does the probability of defaulting (default.num=1),
particularly once the balance exceeds about $1500. The shape is classic
for a logistic regression curve and suggests that higher credit card
balances might increase the chance of default. Around my loess curve,
the shaded area signals the confidence interval for the predicted
probabilities. This gives me an idea about the reliability of my
predictions. Notably, the interval broadens with higher balances,
implying that my confidence in the prediction of default diminishes as
the balance surges.
Loess Curves: Loess curves help visualize the relationship between a continuous predictor (like balance) and a binary outcome (like defaulting on a credit card). They provide a smoothed, flexible line to show trends that might be obscured in a raw scatterplot. Span Parameter: The span parameter in loess fitting controls the smoothness of the curve. Smaller values create more wiggly curves that tightly follow the data; larger values produce smoother lines representing broader trends. Interpretation Challenges: Smaller datasets can make loess curves less reliable. It’s best to see them as exploratory tools highlighting overall trends, not precise predictions
Challenger Data and O-Ring Failures The
Challenger
data set analysis with a loess plot might
present challenges due to its smaller size, affecting the smoothness and
reliability of the loess fit. A default span
value might
produce a curve that doesn’t capture the underlying trend well due to
the reduced data points, while specifying span=1
offers a
smoother curve that may better reveal the general trend. This plot
likely suggests a relationship between lower temperatures and increased
likelihood of O-ring failure. The smoother curve with
span=1
could highlight the general decrease in reliability
of the O-rings as temperatures decrease, aligning with the hypothesis
that colder temperatures contributed to the Challenger disaster.
library(Sleuth3)
data(ex2011) # Assuming ex2011 is the Challenger data set
# Convert Failure to numeric if necessary
ex2011$Failure.num <- ifelse(ex2011$Failure == "Yes", 1, 0)
# Plot with default span
ggplot(ex2011, aes(x=Temperature, y=Failure.num)) +
geom_point() +
geom_smooth(method="loess")
## `geom_smooth()` using formula = 'y ~ x'
# Plot with span=1
ggplot(ex2011, aes(x=Temperature, y=Failure.num)) +
geom_point() +
geom_smooth(method="loess", span=1)
## `geom_smooth()` using formula = 'y ~ x'
There’s an intriguing dip in the curve at mid-range temperatures, hinting at a complex, non-linear relationship where the failure probability might drop with rising temperature up to a certain point, only to ascend as temperatures continue to decrease. Yet the expansive confidence interval advises me to interpret this trend cautiously.
Logistic Regression Coefficient: The logistic regression model on the Challenger data examines the relationship between temperature and O-ring failure. In a logistic regression model, the coefficient for temperature tells us about the change in the log-odds of an O-ring failure associated with a one-unit increase in temperature. To interpret as an odds ratio, we: Exponentiate the coefficient: exp(-0.1713205) = 0.842 Interpretation: For every one-degree increase in temperature, the odds of O-ring failure decrease by roughly 16% (1 - 0.842). Odds Ratio An odds ratio of 1 means no effect. Values greater than 1 suggest an increased likelihood of the outcome with a change in the predictor; values less than 1 suggest a decreased likelihood. Here, the negative coefficient for temperature implies that lower temperatures increase the odds of an O-ring failure. The graph supports this, even if the loess fit is less reliable due to the small dataset, suporting notion that lower temps increase probability of O rign failure.
Response Level Ordering In logistic regression, the order of the response (e.g., “Yes” vs. “No”) matters. The coefficient represents the change in the log-odds of the last level listed as the positive outcome. Flipping the order flips the sign of the coefficient but not the overall interpretation about the odds. Inverting the coding of the response variable to treat “Failed” as the event of interest reverses the sign of the temperature coefficient to 0.1713205. This change means that for each unit increase in temperature, the odds of an O-ring failure (now defined as “Failed”) actually increase, which is the reciprocal interpretation of the previous model. However, this interpretation contradicts the expected relationship and the physical understanding of the Challenger disaster, where lower temperatures were a risk factor for failure. The p-values and standard errors remain the same, highlighting that the statistical significance and reliability of the estimate are unchanged; only the interpretation of the relationship direction differs due to the redefinition of the response variable.
Dr. Turner showed in the videos that when you have a categorical predictor for two levels the logistic regression models produce essentially the same analysis as a Chi square test and odds ratio from a 2x2 table. When you have more than two levels, the interpretation approach is similar to a one way ANOVA.
Consider the following coronary artery disease data set which tries to assess important risk factors such as age, sex, and echo cardiogram results on coronary artery disease. Sex and age are self explanatory. The ECG variable stands for echo cardiogram. Although the variable is categorical, Low stands for Normal, Medium stands for Mild, and High stands for Severe in terms of the echo cardiogram’s general reading.
cad<-read.csv("coronary.csv",stringsAsFactors = T)
head(cad)
## X Sex ECG AGE CAD
## 1 1 Female Low 28 No
## 2 2 Male Low 42 Yes
## 3 3 Female Medium 46 No
## 4 4 Male Medium 45 No
## 5 5 Female Low 34 No
## 6 6 Male Low 44 Yes
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
## `summarise()` has grouped output by 'ECG'. You can override using the `.groups`
## argument.
## # A tibble: 6 × 4
## # Groups: ECG [3]
## ECG CAD cnt perc
## <fct> <fct> <int> <dbl>
## 1 High Yes 10 0.769
## 2 Low No 21 0.636
## 3 Medium Yes 19 0.594
## 4 Medium No 13 0.406
## 5 Low Yes 12 0.364
## 6 High No 3 0.231
## `summarise()` has grouped output by 'Sex'. You can override using the `.groups`
## argument.
## # A tibble: 4 × 4
## # Groups: Sex [2]
## Sex CAD cnt perc
## <fct> <fct> <int> <dbl>
## 1 Female No 21 0.636
## 2 Female Yes 12 0.364
## 3 Male No 16 0.356
## 4 Male Yes 29 0.644
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.2039728 0.6582801 1.828967 0.06740450
## ECGLow -1.7635886 0.7511891 -2.347729 0.01888825
## ECGMedium -0.8244832 0.7502582 -1.098933 0.27179746
exp(confint(log.model))
## Waiting for profiling to be done...
## 2.5 % 97.5 %
## (Intercept) 1.01973260 14.867917
## ECGLow 0.03330912 0.684391
## ECGMedium 0.08580524 1.766097
Categorical Predictors
Barplots and Risk Asessment: The trends observed in the bar plots for the coronary artery disease (CAD) data set indicate significant relationships between the predictors (sex and ECG) and the outcome (presence of CAD). For the ECG predictor, a higher proportion of CAD is observed in the “High” category (severe echo cardiogram results) compared to the “Low” (normal) and “Medium” (mild) categories. This suggests that more severe ECG readings are associated with a higher risk of coronary artery disease. For the sex predictor, a higher proportion of CAD is observed in males compared to females, indicating that sex may be a significant risk factor for CAD, with males being at higher risk.
Logistic Regression w/ Categorical Predictors The logistic regression model with ECG as the predictor produces coefficients for “Low” and “Medium” categories of ECG readings. The negative coefficient for “Low” (-1.7635886) suggests that, compared to the reference category (which is likely “High” based on the context), individuals with normal (Low) ECG readings have significantly lower odds of having CAD. The odds ratio for “Low” can be calculated as \(e^{-1.7635886}\), which is less than 1, indicating lower odds of CAD in this group. The “Medium” category coefficient is also negative (-0.8244832), indicating lower odds of CAD compared to the “High” category, but to a lesser extent than the “Low” category. The interpretation of these coefficients supports the trend observed in the bar plots, where more severe ECG readings are associated with a higher risk of CAD.
Looking at coefficients from GLM Obtaining confidence intervals (CIs) for odds ratios from a logistic regression model provides a range of values within which we can be confident that the true odds ratio lies, with a certain level of confidence (typically 95%). For example, the CI for the “Low” category of ECG readings (0.03330912 to 0.684391) indicates that, while accounting for uncertainty, the odds of CAD are significantly lower for individuals with normal (Low) ECG readings compared to those with severe (High) readings. The CI for the “Medium” category (0.08580524 to 1.766097) also suggests lower odds of CAD compared to the “High” category, but with a wider range that includes values closer to 1, indicating less certainty about the extent of the difference. These CIs further support the interpretation of the logistic regression coefficients, emphasizing the importance of ECG severity in predicting the risk of coronary artery disease.