This material corresponds to Week 10 of the labs and demos for Bio 4259F (Undergraduate) and Bio 9915A (Graduate) at Western University.

It documents logistic regression activities using R. The code loads necessary libraries, simulates data from simulation, fits loggistic distributions to simulated and real data, and performs logistic regression analyses on example datasets with glm().

First thing you need to know is the two different types of data you can analyze with loggistic regressions.

Brief intro

  1. Binary Response Data, these are datasets where each observation corresponds to a single trial with a binary outcome (often coded as 0/1, No/Yes, False/True). For example, consider a medical trial where each patient either experiences a side effect (1) or does not (0). Each patient’s outcome is a separate binary response. In R, such data might be stored as a numeric vector of 0/1 or as a factor with two levels (e.g., “no” and “yes”). Logistic regression can model the probability of outcome = 1 as a function of predictors in these cases.

  2. Binary Count Data, here each observation aggregates multiple binary trials. Instead of one outcome per row, we have a count of successes (and often failures) out of a number of trials. For example, in an agriculture experiment you might plant 20 seeds in each of several pots and record how many seeds germinated in each pot. Each pot gives a number of “successes” (germinated seeds) out of 20. Such data are modeled with the binomial distribution, which generalizes the Bernoulli (0/1) case to \(n\) trials. In R, you can input this data by giving the model a two-column matrix of successes and failures using cbind(successes, failures) as the response. This tells glm() to treat each row as \(n\) independent trials with the given number of successes, rather than a single 0/1 observation.

packages…

library(MASS)   
library(arm)    # needed for the binnedplot() function
## Loading required package: Matrix
## Loading required package: lme4
## 
## arm (Version 1.14-4, built: 2024-4-1)
## Working directory is D:/1_MisDocumentos/Documents/R_Class/4th_Class/4class
library(Sleuth3)
library(ggplot2)
library(ggh4x) # Required for stat_theodensity()
library(carData)
When to use which: If your data are naturally grouped or you only have aggregate counts, use the two-column count format. If you have individual-level data, use the 0/1 format. In R, both are handled by glm(…, family=binomial) appropriately.
Just remember that if using the 0/1 format, the response should be numeric (0/1) or a two-level factor (by default, the first level is treated as failure = 0, second as success = 1). If your binary response is stored as logical (TRUE/FALSE) or a character factor, convert it to 0/1 or factor before modeling.

The Binomial Distribution and Simulation in R

Simulating Binomial Data in R, we can use rbinom() to generate random outcomes from a binomial distribution. This is useful for building intuition. For example, let’s simulate 10,000 experiments of 10 coin flips each, with \(p=0.3\) (30% chance of success on each trial), and tally the distribution of successes:

set.seed(123)                      # for reproducibility
# Simulate 10000 experiments of 10 trials each, success probability 0.3
sim_results <- rbinom(10000, size = 10, prob = 0.3)
table(sim_results)                 # frequency of each number of successes
## sim_results
##    0    1    2    3    4    5    6    7    8    9 
##  300 1180 2326 2734 2010 1001  341   91   16    1

If you run this, you will get frequencies for 0, 1, 2, …, 10 successes. We expect the most common outcomes to be around \(np = 10 * 0.3 = 3\) successes. Let’s visualize the approximate distribution:

hist(sim_results, breaks = -0.5:10.5, col="lightblue", xlab="Number of successes in 10 trials",
     main="Simulation: Binomial(10, p=0.3) outcomes")

Number of successes in 10 trials (10000 simulations, p = 0.3). The distribution centers around 3 successes, which is \(np = 3\), with a spread determined by \(np(1-p) = 2.1\). Higher or lower counts (e.g. 0 or 10) are relatively rare.

You can see that 3 successes is the most likely result, and extremes like 0 or 10 successes are very rare. This matches our intuition for 10 trials with 30% chance each. Simulating can help confirm our understanding of the binomial probabilities.

Connecting to logistic regression Logistic regression uses the binomial distribution for its likelihood. If we have binary response data, each observation \(Y_i\) is like \(\text{Binomial}(1, p_i)\) (either 0 or 1 success). If we have count data of \(n_i\) trials in row \(i\) (successes out of \(n_i\)), then \(Y_i \sim \text{Binomial}(n_i, p_i)\). The logistic model will estimate \(p_i\) for each observation given predictors. In practice, we don’t usually simulate from the binomial during modeling, but understanding this distribution is key to understanding why we use logistic regression for binary outcomes.

Try to… Using rbinom(), simulate 1000 draws from a Binomial distribution with \(n=20\) and \(p=0.1\). Plot the distribution of the results (you can use a histogram or table to see frequencies). Does it look symmetric or skewed? Why?

Logistic Regression on Simulated Data “Basic Example”

Logistic regression is a method to model the relationship between one or more predictors \(X\) and the probability of a binary outcome \(Y=1\). Instead of modeling \(Y\) directly, we model the log-odds (logit) of the outcome as a linear function of the predictors:

Simulate a simple scenario

Let’s create a toy dataset where we know the true relationship. Suppose a person’s chance of having a certain disease increases with their exposure to a chemical “Coca-Cola. We define a predictor exposure and simulate a binary outcome based on it.

set.seed(42)
n <- 200                           # sample size
exposure <- runif(n, min=0, max=100)        # exposure level between 0 and 100
# True relationship: log-odds of disease = -3 + 0.05 * exposure
true_beta0 <- -3
true_beta1 <- 0.05
p <- 1 / (1 + exp(-(true_beta0 + true_beta1 * exposure)))  # logistic function

Simulate disease outcome: 1 = has disease, 0 = no disease

disease <- rbinom(n, size = 1, prob = p)
sim_data <- data.frame(exposure, disease)
head(sim_data)
##   exposure disease
## 1 91.48060       0
## 2 93.70754       1
## 3 28.61395       1
## 4 83.04476       1
## 5 64.17455       1
## 6 51.90959       0

Here, the true intercept is \(-3\) and slope is \(0.05\). This means at exposure = 0, the log-odds of disease is \(-3\) (so \(P(\text{disease}) = \exp(-3)/(1+\exp(-3)) \approx 0.05\), or 5%). For each 1-unit increase in exposure, the log-odds increases by 0.05 (i.e., odds multiply by \(e^{0.05} \approx 1.05\)). Over a 100-unit range, that’s a substantial increase in risk.

Fit a logistic regression model Now we pretend we don’t know the true parameters and fit a model using glm() with family binomial.

fit <- glm(disease ~ exposure, family = binomial, data = sim_data)

And then, please never forget to use The summary(). Its output will show an intercept and slope estimate. They should be in the ballpark of -3 and 0.05 (though not exact, due to random noise). For example, you might see something like:

summary(fit)
## 
## Call:
## glm(formula = disease ~ exposure, family = binomial, data = sim_data)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.557290   0.412231  -6.204 5.52e-10 ***
## exposure     0.044252   0.006702   6.602 4.04e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 275.98  on 199  degrees of freedom
## Residual deviance: 216.26  on 198  degrees of freedom
## AIC: 220.26
## 
## Number of Fisher Scoring iterations: 4

These estimates suggest \(\hat{\beta}_0 \approx -2.56\), \(\hat{\beta}_1 \approx 0.044\). The true values were approx. -3 and 0.05, and indeed our estimates are quite close. The model found a highly significant effect of exposure (very small p-value), meaning exposure is strongly associated with disease outcome in our data.

Interpreting… The coefficient \(\hat{\beta}_1 \approx 0.044\) means each 1-unit increase in exposure multiplies the odds of disease by \(e^{0.044} \approx 1.045\) (about a 4.5% increase in odds). The intercept \(\hat{\beta}_0 \approx -2.56\) corresponds to the log-odds when exposure = 0. If exposure = 0, the odds of disease are \(\exp(-2.56) \approx 0.0775\), which is a 7.8% chance—close to our design of 5%.

We can also get confidence intervals for these coefficients. For logistic models, it’s common to use profile likelihood confidence intervals via confint():

confint(fit)
## Waiting for profiling to be done...
##                   2.5 %      97.5 %
## (Intercept) -3.41721619 -1.79354585
## exposure     0.03175082  0.05813687
exp(cbind(OddsRatio = coef(fit), confint(fit)))
## Waiting for profiling to be done...
##              OddsRatio      2.5 %    97.5 %
## (Intercept) 0.07751451 0.03280363 0.1663692
## exposure    1.04524586 1.03226026 1.0598601

The second line above uses exp() to convert log-odds coefficients to odds ratios (OR). For example, the 95% CI for \(\beta_1\) was \((0.0318, 0.0581)\). Exponentiating gives OR \(= e^{0.044} \approx 1.045\) with 95% CI \((e^{0.0318}, e^{0.0581}) \approx (1.032, 1.060)\). This says each unit of exposure increases the odds of disease by between ~3.2% and ~6.0%, with 95% confidence.

Visualizing the fit

# Add model-predicted probability to dataframe for plotting
sim_data$pred_prob <- predict(fit, type="response")

ggplot(sim_data, aes(x=exposure, y=disease)) +
  geom_jitter(height=0.05, width=0, alpha=0.5) +
  geom_line(aes(y=pred_prob), color="red", size=1) +
  labs(title="Simulated Data: Disease vs Exposure",
       y="Disease (0/1 outcome)", x="Exposure level") +
  theme_minimal()
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

The graph will produce a scatterplot of the binary data (jittered vertically for visibility) with a smooth S-shaped red curve showing how predicted probability of disease increases with exposure. We expect the curve to rise from near 0 to near 1 as exposure goes from 0 to 100.

Checking model fit: For simulated data, we know the model form is correct. In real cases, we might check goodness-of-fit. One simple diagnostic is a binned residual plot. We can use the arm package’s binnedplot() or create it manually. The idea is to group predictions into bins and see if average residuals are near 0. Here, since we simulated from the true model, the fit should be good. We won’t delve into residual diagnostics deeply in this toy example, but we will revisit model checking with real data below.

Try to… modify the simulation above so that the true relationship has \(\beta_1 = -0.1\) (i.e. higher exposure decreases the disease risk). Simulate a dataset and fit the same logistic model. Do the estimates capture the negative effect? Plot the fitted curve – does it slope downward?

Bumpus’s House Sparrows Survival

This example comes from a classic study by Hermon Bumpus in 1898. After a severe winter storm, 87 house sparrows were found, of which some survived and some perished. Bumpus’s question: did natural selection favor certain physical characteristics? In other words, were some sparrows more likely to survive based on their traits.This dataset contains various measurements of each bird. We will focus on a few key variables such as Status, TL (total length in mm), WT (weight in grams), HL(humerus length in inches), and AG (age).

Status is our Y (Y=1) and means survival status of the bird (factor with levels “Perished” and “Survived”). We will treat “Survived” as the outcome of interest.

For this case study, let’s load the data and inspect it. (The data is available in some R packages and online sources; here we assume we have it as bumpus data frame.). In this case we use the package “Sleuth3”. Just in case you have a issue with this package please go to owl and download the dataset… You already know how to imported to R ;)

data("ex2016")          # Bumpus data from Sleuth3, 3rd ed.
bumpus <- ex2016
str(bumpus)
## 'data.frame':    87 obs. of  11 variables:
##  $ Status: Factor w/ 2 levels "Perished","Survived": 2 2 2 2 2 2 2 2 2 2 ...
##  $ AG    : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ TL    : int  154 160 155 154 156 161 157 159 158 158 ...
##  $ AE    : int  241 252 243 245 247 253 251 247 247 252 ...
##  $ WT    : num  24.5 26.9 26.9 24.3 24.1 26.5 24.6 24.2 23.6 26.2 ...
##  $ BH    : num  31.2 30.8 30.6 31.7 31.5 31.8 31.1 31.4 29.8 32 ...
##  $ HL    : num  0.69 0.74 0.73 0.74 0.71 0.78 0.74 0.73 0.7 0.75 ...
##  $ FL    : num  0.67 0.71 0.7 0.69 0.71 0.74 0.74 0.72 0.67 0.74 ...
##  $ TT    : num  1.02 1.18 1.15 1.15 1.13 1.14 1.15 1.13 1.08 1.15 ...
##  $ SK    : num  0.59 0.6 0.6 0.58 0.57 0.61 0.61 0.61 0.6 0.61 ...
##  $ KL    : num  0.83 0.84 0.85 0.84 0.82 0.89 0.86 0.79 0.82 0.86 ...
table(bumpus$Status) ## take a look at Y
## 
## Perished Survived 
##       36       51

Please answer… How many observations are? Is The breakdown of Survived vs Perished balanced?

Logistic regression.

Single predictor, as a first attempt, let’s see if a single measurement, say total length (TL), predicts survival. We fit a logistic model:

# Ensure the response is 0/1 numeric: 1 for Survived, 0 for Perished
bumpus$Survived01 <- ifelse(bumpus$Status == "Survived", 1, 0)
fit1 <- glm(Survived01 ~ TL, family = binomial, data = bumpus)

Ask yourself, Why Survived01 and not Survived?

and then… summary noooooo forget Giving us the estimate for (Intercept) and TL.

summary(fit1)
## 
## Call:
## glm(formula = Survived01 ~ TL, family = binomial, data = bumpus)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) 54.49312   14.57866   3.738 0.000186 ***
## TL          -0.33698    0.09062  -3.719 0.000200 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 118.008  on 86  degrees of freedom
## Residual deviance:  99.788  on 85  degrees of freedom
## AIC: 103.79
## 
## Number of Fisher Scoring iterations: 4

The slope for TL is negative and significant. It suggests that longer birds had lower odds of survival. In fact, \(\exp(-0.337) \approx 0.714\), so each 1 mm increase in total length multiplies the odds of survival by about 0.714 (a ~29% reduction). This is evidence that smaller birds were more likely to survive the storm, perhaps because extremely large birds were at a disadvantage (e.g. higher weight or wing loading). The intercept is very large (54.49) because it corresponds to log-odds of survival at TL = 0 (which is far outside any realistic range of sparrow lengths). The intercept alone isn’t meaningful here, except in combination with the slope to form the curve.

We can plug in a range of lengths and see predicted survival chances

# Predict survival probability over a range of total lengths
newdata <- data.frame(TL = seq(min(bumpus$TL), max(bumpus$TL), length.out=100))
newdata$surv_prob <- predict(fit1, newdata, type="response")
head(newdata)
##         TL surv_prob
## 1 153.0000 0.9495720
## 2 153.1414 0.9472406
## 3 153.2828 0.9448077
## 4 153.4242 0.9422695
## 5 153.5657 0.9396219
## 6 153.7071 0.9368611

Plotting

ggplot(bumpus, aes(x=TL, y=Survived01)) +
  geom_jitter(height=0.05, width=0, alpha=0.7) +
  geom_line(data=newdata, aes(x=TL, y=surv_prob), color="blue", size=1) +
  labs(x="Total Length (mm)", y="Survival (0=Perished, 1=Survived)",
       title="Bumpus Sparrows: Survival vs Total Length") +
  theme_minimal()

The downward slope confirms what the coefficient told us: longer birds have lower predicted probability of survival.

Multiple predictors

That was cool, ain´t it? Then we can extend the model to include more measurements. Perhaps survival depends on a combination of traits. Let’s try using total length and weight together:

fit2 <- glm(Survived01 ~ TL + WT, family=binomial, data=bumpus)

and ofcourse our summary not forget it…

summary(fit2)
## 
## Call:
## glm(formula = Survived01 ~ TL + WT, family = binomial, data = bumpus)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  53.3898    14.7203   3.627 0.000287 ***
## TL           -0.3140     0.1012  -3.105 0.001906 ** 
## WT           -0.1000     0.2050  -0.488 0.625567    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 118.008  on 86  degrees of freedom
## Residual deviance:  99.546  on 84  degrees of freedom
## AIC: 105.55
## 
## Number of Fisher Scoring iterations: 4

No academic interpretation… TL (Total Length) is significant and negative. Each extra mm in length reduces the log-odds of survival by 0.314. That means longer birds are less likely to survive. Odds ratio: exp(-0.3140) ≈ 0.730 → About 27% lower odds of survival per extra mm in TL. Strong effect, statistically significant (p = 0.0019). WT (Weight) is rather negative but not significant (p = 0.63). This means there’s no evidence that weight explains survival after accounting for TL. The effect size (-0.10) is small, and the uncertainty (Std. Error = 0.205) is large. Weight could still matter biologically, but in this model, length is the only clear predictor.

The we can say… We found that total length was a strong negative predictor of survival (OR ≈ 0.73 per mm, p < 0.01), while weight showed no significant effect after controlling for length. This suggests that longer birds were less likely to survive the storm, but weight on its own did not explain much once length was accounted for.

We should check confidence intervals to understand the precision as follows

exp(cbind(OddsRatio = coef(fit2), confint(fit2)))
## Waiting for profiling to be done...
##                OddsRatio        2.5 %       97.5 %
## (Intercept) 1.537801e+23 3.431193e+11 7.172300e+36
## TL          7.304924e-01 5.896227e-01 8.800215e-01
## WT          9.048107e-01 5.966969e-01 1.343263e+00

So we can infer that out fist model was right and we don’t need to add more variables here.

Safe-Wells in Bangladesh (Multiple Predictors)

Our second case study comes from a public health campaign in Bangladesh. In the 1990s, many tube wells used for drinking water were found to have unsafe levels of arsenic. Households using an unsafe well were encouraged to switch to a safe well (either a new well or a neighbor’s well). Researchers collected data on which households actually switched wells over a period of time. We will analyze what factors influenced the decision to switch.

Our Y (Y=1) is “switch”, which is the household switch to a different well (“yes” or “no”; we will treat “yes” as outcome = 1).

Then we have arsenic, distance (in meters distance nearest safe well), education (Years of education), and association (active in community organizations, “yes” or “no”) as ours Xs…

Let’s load and inspect the data:

data("Wells")
str(Wells)
## 'data.frame':    3020 obs. of  5 variables:
##  $ switch     : Factor w/ 2 levels "no","yes": 2 2 1 2 2 2 2 2 2 2 ...
##  $ arsenic    : num  2.36 0.71 2.07 1.15 1.1 3.9 2.97 3.24 3.28 2.52 ...
##  $ distance   : num  16.8 47.3 21 21.5 40.9 ...
##  $ education  : int  0 0 10 12 14 9 4 10 0 0 ...
##  $ association: Factor w/ 2 levels "no","yes": 1 1 1 1 2 2 2 1 2 2 ...
table(Wells$switch)
## 
##   no  yes 
## 1283 1737

Please asnwer.. How many observations observations? The switch variable is likely a factor ?

Let’s convert it to numeric 0/1 and similarly association to 0/1 for modeling convenience,

wells <- Wells  # copy to a data frame
wells$switch01 <- ifelse(wells$switch == "yes", 1, 0)
wells$assoc01 <- ifelse(wells$association == "yes", 1, 0)

Small exploratory analysis

For instance, compare arsenic levels between those who switched and those who didn’t:

tapply(wells$arsenic, wells$switch, mean)
##       no      yes 
## 1.420055 1.831894

Is it balance ?

tapply(wells$distance, wells$switch, mean)
##       no      yes 
## 53.61148 44.43218

Is it balance ?

We might find that those who switched have, on average, higher arsenic and shorter distance than those who did not switch. For example, in one analysis: households that switched had higher arsenic concentrations and were closer to safe wells on average. This aligns with intuition: a strong reason to switch is high contamination, and an impediment to switching is having to go far. Education and association can also be checked similarly.

Fit the logistic model

In this case we want to include distance, arsenic, education, and association as predictors for switch. To make interpretation easier, we will rescale distance. The raw distance is in meters; the effect per meter might be very small. It’s common to divide by 100 (so a unit = 100 m) because a difference of 100 m is more meaningful in context. Similarly, arsenic is already in 100s of µg/L, so an increase of 1 in arsenic equals 100 µg/L more arsenic. We’ll use those scales:

wells$dist100 <- wells$distance / 100    # distance in hundreds of meters


fit_wells <- glm(switch01 ~ dist100 + arsenic + education + assoc01, 
                 family=binomial, data=wells)

And then hahaha not forget to summary your models…

summary(fit_wells)
## 
## Call:
## glm(formula = switch01 ~ dist100 + arsenic + education + assoc01, 
##     family = binomial, data = wells)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.156712   0.099601  -1.573    0.116    
## dist100     -0.896110   0.104576  -8.569  < 2e-16 ***
## arsenic      0.467022   0.041602  11.226  < 2e-16 ***
## education    0.042447   0.009588   4.427 9.55e-06 ***
## assoc01     -0.124300   0.076966  -1.615    0.106    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 4118.1  on 3019  degrees of freedom
## Residual deviance: 3907.8  on 3015  degrees of freedom
## AIC: 3917.8
## 
## Number of Fisher Scoring iterations: 4

No academic interpretation would be,

Distance: For each 100 m increase, odds of switching are multiplied by exp(-0.896) ≈ 0.408. That’s a 59% drop in odds per 100 m. Strong negative effect.

Arsenic: Each 100 µg/L increase raises the odds by exp(0.467) ≈ 1.595, or about 60% higher odds of switching. Strong positive effect.

Education: Each year adds exp(0.042) ≈ 1.043, or ~4.3% higher odds of switching. Positive and significant.

Association: Coefficient is negative but not significant (p = 0.106). Odds ratio is exp(-0.1243) ≈ 0.883. No clear evidence of an effect.

Please base on the previous example create your academic interpretation.

Predicting probability for a typical household

What could happen at 60 m distance

predict(fit_wells, data.frame(dist100=0.6, arsenic=2.0, education=4, assoc01=0), type="response")
##         1 
## 0.6009564

60% chance of switching

Try 150 m distance,

predict(fit_wells, data.frame(dist100=1.5, arsenic=2.0, education=4, assoc01=0), type="response")
##        1 
## 0.402022

40% chance of switching

Try 10 m distance,

predict(fit_wells, data.frame(dist100=0.1, arsenic=2.0, education=4, assoc01=0), type="response")
##         1 
## 0.7021395

wofff…. 70% hahahaha

Now change other variable for example, a distance of 60m but with high arsenic,

predict(fit_wells, data.frame(dist100=0.6, arsenic=3.0, education=4, assoc01=0), type="response")
##         1 
## 0.7060909

same distance low arsenic,

predict(fit_wells, data.frame(dist100=0.6, arsenic=0.5, education=4, assoc01=0), type="response")
##         1 
## 0.4277394

Visualization

Visualizing effect of distance at two arsenic levels.

dist_seq <- seq(0, 1.5, by=0.01)
lowAs <- data.frame(dist100 = dist_seq, arsenic = 1.0, education = 4, assoc01 = 0)
highAs <- data.frame(dist100 = dist_seq, arsenic = 3.0, education = 4, assoc01 = 0)
lowAs$pred <- predict(fit_wells, lowAs, type="response")
highAs$pred <- predict(fit_wells, highAs, type="response")


plot(lowAs$dist100*100, lowAs$pred, type="l", col="red", lwd=2,
xlab="Distance to safe well (m)", ylab="Probability of switching",
ylim=c(0,1))
lines(highAs$dist100*100, highAs$pred, col="blue", lwd=2)
legend("topright", legend=c("Arsenic = 100 µg/L","Arsenic = 300 µg/L"),
col=c("red","blue"), lwd=2)

This shows how switching probability falls as distance increases, and rises as arsenic increases.

Model fit check

For logistic regression, a common approach is to check calibration (are predicted probabilities in line with observed proportions?) and discrimination (does the model assign higher probabilities to those who switched vs not?). A quick check: use binnedplot again:

binnedplot(fitted(fit_wells), resid(fit_wells, type="response"), nclass=20,
xlab="Predicted probability", main="Binned Residuals: Wells")

If the model is well-calibrated, we expect the points to hover around 0. If there’s a systematic curve, it suggests maybe a nonlinear effect or missing interaction. For instance, sometimes one might notice that at very high predicted probabilities, residuals show a pattern – which could hint at needing an arsenic*distance interaction or perhaps a nonlinear term for distance (maybe the effect of distance tapers off at large values). If diagnostics suggest, one could refine the model. For brevity, we’ll assume this model is a reasonable first approximation.

Try to… Eliminate the factor “assoc01”, is it right to do so? Also try to create a new factor variable dist_group that categorizes distance into “Near” (e.g. < 50 m), “Medium” (50–100 m), and “Far” (>100 m). Fit a logistic model with dist_group (factor) and arsenic. How do the odds of switching compare in these distance categories (use one as baseline)? Does this factor-based result agree with the continuous distance model?

Best Practices for Logistic Regression Modeling

let’s summarize some best practices and additional tips for performing and interpreting logistic regression:

Interpret coefficients on the odds scale: Use exp(Estimate) to interpret logistic regression coefficients as odds ratios (OR). For example, in the wells model, the coefficient for dist100 was -0.896, which means the odds of switching wells are multiplied by exp(-0.896) ≈ 0.41 for each additional 100 meters — a 59% decrease. Always include the unit (e.g., per 100 meters) to make your interpretation clear.

Predict probabilities for context: Odds ratios are helpful, but predicted probabilities are usually more intuitive. For instance, a household 10 m from a safe well with average arsenic (200 µg/L) and 4 years of education had about an 80% chance of switching, while one 150 m away had only ~20% chance, despite the same arsenic level. This conveys the real-world meaning of the model.

Check model fit: Use diagnostic tools like binnedplot() from the arm package to inspect residuals. A flat band of residuals around zero suggests a good fit. If you see curved patterns or outliers, consider adding interaction terms or transforming predictors. In the wells model, binnedplot() showed a well-centered distribution of residuals, supporting model fit.

Assess discrimination: To check how well the model separates outcomes (0 vs 1), use metrics like AUC (via the pROC package) or confusion matrices. Our model for well-switching had a reasonable deviance drop and strong predictor significance, indicating good discrimination.

Avoid linearity assumptions pitfalls: Logistic regression assumes a linear relationship in log-odds. If the effect of a predictor like distance flattens or changes direction, consider squaring the variable or using a spline. For example, the effect of distance on switching might weaken at very high distances; a quadratic term could help.

Multicollinearity: Strong correlation between predictors inflates standard errors. Use car::vif() to check. In the Bumpus model, total length (TL) and weight (WT) were correlated. TL had a strong effect, but WT did not add much — likely due to collinearity or redundancy. Consider removing redundant predictors or using PCA.

Confidence intervals: Always report confidence intervals with estimates. For example, in Bumpus, TL had an odds ratio of exp(-0.314) ≈ 0.73, with 95% CI showing this effect was meaningful. Use confint() and exp() to convert intervals to OR scale. This is more informative than a p-value alone.

Present clearly: Explain results in plain terms. Example: “For each additional 100 meters to the nearest safe well, the odds of switching decrease by about 59% (OR = 0.41, 95% CI …).” Tie numerical results to real-world context.

Model summaries in R: Use summary() for a quick overview. Use exp(coef(fit)) and exp(confint(fit)) to get odds ratios and their intervals. Tools like broom::tidy() or stargazer can produce clean tables.

Model validation: With large data (like the wells data), consider splitting into training and test sets to avoid overfitting. Fit on 80%, test on 20%. Check prediction accuracy on the test set. This is especially important if adding interaction or polynomial terms.

Balanced vs. imbalanced outcomes: In our data, about 48% of households switched wells and 60% of sparrows survived, so logistic regression is stable. But if switching were rare (e.g., <5%), we’d consider alternate methods or adjust thresholds. For very rare events, consider Firth correction or penalized logistic regression.