load(url("http://alecri.github.io/downloads/data/lowbwt.Rdata"))
Exercises for Categorical
Data Analysis - Part I
Exercises for Categorical
Data Analysis - Part I
(Last compiled: Sep 10, 2025)
Instructions
This page contains the exercises for the module Categorical Data Analysis (part I) in the course Biostatistics II in the master Metodologia Epidemilogica e Bostatistica per la Ricerca Clinica.
Try to answer the questions without looking at the solutions.
The assignment will be based on lbw data set. The Child Health and Development Data Set was designed to assess factors related to low birth weight in children. The outcome is lowbwt (binary outcome, \(\le\) 6 pounds (or 2720 grams) or \(>\) 6 pounds) and the main exposure is maternal smoking (smoke) is the maternal number of cigarettes smoked per day. Weight of the mother at the last menstrual period is considered as confounding variable or possible effect modifier (interaction analysis).
Reference: Hosmer, D.W., Lemeshow, S. and Sturdivant, R.X. (2013) Applied Logistic Regression: Third Edition.
To load the data, run the following:
Structure of the data
str(lowbwt)Classes 'tbl_df', 'tbl' and 'data.frame': 189 obs. of 11 variables:
$ id : num 4 10 11 13 15 16 17 18 19 20 ...
..- attr(*, "label")= chr "Identification Code"
..- attr(*, "format.stata")= chr "%8.0g"
$ low : Factor w/ 2 levels ">= 2500 g","< 2500 g": 2 2 2 2 2 2 2 2 2 2 ...
$ age : num 28 29 34 25 25 27 23 24 24 21 ...
..- attr(*, "label")= chr "Age of Mother (years)"
..- attr(*, "format.stata")= chr "%8.0g"
$ lwt : num 120 130 187 105 85 150 97 128 132 165 ...
..- attr(*, "label")= chr "Weight of Mother at Last Menstrual Period (pounds) "
..- attr(*, "format.stata")= chr "%8.0g"
$ race : Factor w/ 3 levels "White","Black",..: 3 1 2 3 3 3 3 2 3 1 ...
$ smoke: Factor w/ 2 levels "No","Yes": 2 1 2 1 1 1 1 1 1 2 ...
$ ptl : num 1 0 0 1 0 0 0 1 0 0 ...
..- attr(*, "label")= chr "History of Premature Labor"
..- attr(*, "format.stata")= chr "%8.0g"
$ ht : Factor w/ 2 levels "No","Yes": 1 1 2 2 1 1 1 1 2 2 ...
$ ui : Factor w/ 2 levels "No","Yes": 2 2 1 1 2 1 2 1 1 1 ...
$ ftv : num 0 2 0 0 0 0 1 1 0 1 ...
..- attr(*, "label")= chr "Physician Visits First Trimester"
..- attr(*, "format.stata")= chr "%8.0g"
$ bwt : num 709 1021 1135 1330 1474 ...
..- attr(*, "label")= chr "Birth Weight (grams)"
..- attr(*, "format.stata")= chr "%8.0g"
- attr(*, "label")= chr "Low birth weight data"
The following are the packages used in the solutions. Many other may be used, as well as some may not be necessary to answer the questions.
Packages
pacman::p_load(tidyverse, Epi, scales, gridExtra, ResourceSelection, epiDisplay)
# usefull functions
invlogit <- function(x) exp(x)/(1+exp(x))
# theme for ggplots
theme_set(theme_classic())Intro to Categorical Data Analysis
Inference on one proportion
Question 1
What it the proportion of mother with low birth weight child (lowbwt)? And the odds? Are they different?
Solution
p_odds <- summarise(lowbwt,
cases = sum(low == "< 2500 g"),
n = sum(!is.na(low)),
p = cases/n,
odds = p/(1-p))
p_odds# A tibble: 1 × 4
cases n p odds
<int> <int> <dbl> <dbl>
1 59 189 0.312 0.454
The odds of mother with low birth is 0.45. Every 100 women who delivered a normal weight baby (not low birth weight), I expect that there are on average other 45 who delivered a low birth weight baby.
The two measure are different from each other since the prevalence of low birth weight is not low.
Question 2
Estimate and interpret the risk of low birth weight, as well as the corresponding 95% confidence intervals.
Solution
ci_risk <- prop.test(x = 59, n = 189)
ci_risk
1-sample proportions test with continuity correction
data: 59 out of 189, null probability 0.5
X-squared = 25.926, df = 1, p-value = 3.548e-07
alternative hypothesis: true p is not equal to 0.5
95 percent confidence interval:
0.2479596 0.3841585
sample estimates:
p
0.3121693
We are 95% confident that the true risk of low birth weight will lay between 0.25 and 0.38.
2 x 2 table
Question 3
Construct a two-by-two table to examine the possible association between low birth weight and maternal cigarette smoking (smoke). State the null hypothesis and run an appropriate test to answer the question.
Solution
tab <- table(lowbwt$smoke, lowbwt$low)
prop.table(tab, margin = 1)
>= 2500 g < 2500 g
No 0.7478261 0.2521739
Yes 0.5945946 0.4054054
The null hypothesis is the independence between smoke and risk of low birth weight.
\(H_0: p(\textrm{low}| \textrm{smoke} = 0) == p(\textrm{low}| \textrm{smoke} = 1)\)
chisq.test(tab)
Pearson's Chi-squared test with Yates' continuity correction
data: tab
X-squared = 4.2359, df = 1, p-value = 0.03958
Question 4
Estimate and interpret the risk of low birth weight separately for smokers and non-smokers mothers. Provide at least one measure to quantify the possible association.
Solution
tab_sl <- with(lowbwt, twoby2(exposure = relevel(smoke, 2), outcome = relevel(low, 2)))2 by 2 table analysis:
------------------------------------------------------
Outcome : < 2500 g
Comparing : Yes vs. No
< 2500 g >= 2500 g P(< 2500 g) 95% conf. interval
Yes 30 44 0.4054 0.3001 0.5203
No 29 86 0.2522 0.1812 0.3394
95% conf. interval
Relative Risk: 1.6076 1.0578 2.4433
Sample Odds Ratio: 2.0219 1.0807 3.7831
Conditional MLE Odds Ratio: 2.0141 1.0288 3.9649
Probability difference: 0.1532 0.0176 0.2871
Exact P-value: 0.0362
Asymptotic P-value: 0.0276
------------------------------------------------------
tab_sl$table
< 2500 g >= 2500 g P(< 2500 g) 95% conf. interval
Yes 30 44 0.4054054 0.3000509 0.5202566
No 29 86 0.2521739 0.1812469 0.3393526
$measures
95% conf. interval
Relative Risk: 1.6076421 1.05781242 2.4432623
Sample Odds Ratio: 2.0219436 1.08065960 3.7831115
Conditional MLE Odds Ratio: 2.0141372 1.02878038 3.9649039
Probability difference: 0.1532315 0.01757833 0.2871171
$p.value
[1] 0.02761968 0.03617650
The risk of low birth weight for smoker mothers is 0.41.
Two possible measure of associations are the relative risk RR = 1.61: the risk of low birth weight for smoker mothers is 61% higher than for non-smoker mothers.
Another measure is the odds ratio OR = 2.02: the odds of low birth weight for smoker mothers is 2.02 times the odds of low birth weight for non-smokers mothers.
Logistic regression (empty model)
Question 5
Specify the equation of a logistic model to estimate the quantities in questions 2. Estimate and interpret the model coefficient.
Solution
\[\log\left( \textrm{odds}(\textrm{low}) \right) = \beta_0\]
mod_0 <- glm(low ~ 1, data = lowbwt, family = "binomial")
ci.lin(mod_0) Estimate StdErr z P 2.5% 97.5%
(Intercept) -0.789997 0.1569759 -5.0326 4.83873e-07 -1.097664 -0.4823298
ci.exp(mod_0) exp(Est.) 2.5% 97.5%
(Intercept) 0.4538462 0.3336495 0.6173434
\(\exp(\beta_0)\) = 0.45 is the odds of low birth weight (see solution to question 1 for interpretation).
Question 6
What is the risk and 95% confidence intervals of having low birth weight?
Solution
p_ci <- invlogit(ci.lin(mod_0)[c(1, 5, 6)])
p_ci[1] 0.3121693 0.2501778 0.3817021
Simple Logistic Regression
Binary predictor
Question 7
Specify a simple logistic regression model to quantify and test the association between maternal smoke and odds of having low birth weight. Estimate and interpret the model coefficients. Is there association between the predictors and the outcome?
Solution
\[\log\left(\textrm{odds}(\textrm{low}| \textrm{smoke}) \right) = \beta_0 + \beta_1\textrm{smoke}\]
mod_s <- glm(low ~ smoke, data = lowbwt, family = "binomial")
summary(mod_s)
Call:
glm(formula = low ~ smoke, family = "binomial", data = lowbwt)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.0871 0.2147 -5.062 4.14e-07 ***
smokeYes 0.7041 0.3196 2.203 0.0276 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 234.67 on 188 degrees of freedom
Residual deviance: 229.80 on 187 degrees of freedom
AIC: 233.8
Number of Fisher Scoring iterations: 4
ci.exp(mod_s) exp(Est.) 2.5% 97.5%
(Intercept) 0.3372093 0.2213695 0.5136665
smokeYes 2.0219436 1.0806599 3.7831106
\(\exp(\beta_1)\) = 2.02 is the odds ratio of low birth weight for smoker mothers compared with non-smoker mothers. It indicates a positive association (higher odds for smokers) and is statistically significant at a 95% confidence level.
Question 8
Using the coefficients of the estimated model, calculate the quantities in 4.
Solution
invlogit(coef(mod_s)[1])(Intercept)
0.2521739
invlogit(coef(mod_s)[1] + coef(mod_s)[2])(Intercept)
0.4054054
Continuous predictor
Question 9
Specify a simple logistic regression model to predict the risk of low birth weight as function of lwt, the weight of the mother at the last menstrual period. Estimate and interpret the model coefficients. Is there association between the predictors and the outcome?
NB lwt is measured in pounds, you need to transform it in kilograms if you want to communicate the results (1 pound \(\approx\) 0.45 kg).
Tip: To ease interpretation of the results, center the continuous variable around the 54.5 kg (median value).
Solution
lowbwt$lwtk <- lowbwt$lwt*0.45\[\log\left(\textrm{odds}(\textrm{low}| \textrm{lwtk}) \right) = \beta_0 + \beta_1(\textrm{lwtk} - 54.5)\]
mod_l <- glm(low ~ I(lwtk - 54.5), data = lowbwt, family = "binomial")
summary(mod_l)
Call:
glm(formula = low ~ I(lwtk - 54.5), family = "binomial", data = lowbwt)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.70430 0.16082 -4.379 1.19e-05 ***
I(lwtk - 54.5) -0.03124 0.01371 -2.279 0.0227 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 234.67 on 188 degrees of freedom
Residual deviance: 228.69 on 187 degrees of freedom
AIC: 232.69
Number of Fisher Scoring iterations: 4
ci.exp(mod_l) exp(Est.) 2.5% 97.5%
(Intercept) 0.4944559 0.3607763 0.6776683
I(lwtk - 54.5) 0.9692424 0.9435442 0.9956404
\(\exp(\beta_1)\) = 0.97: every 1 kg increase in maternal weight is associated with a 3% decrease in the odds of low birth weight. It indicates a negative association (lower odds for higher values of maternal weight) and is statistically significant at a 95% confidence level.
Question 10
Graph the predicted risk and 95% confidence intervals as function of age?
Solution
cbind(lowbwt, ci.pred(mod_l)) %>%
ggplot(aes(lwtk, Estimate, ymin = `2.5%`, ymax = `97.5%`)) +
geom_line() +
geom_ribbon(alpha = .2) +
labs(x = "Maternal weight (kg)", y = "Predicted risk")Question 11
Provide a graphical presentation of the odds ratio of low birth weight with 95% CI as a function of age using 23 years (or 23.24 if you have centered around the mean) as reference value.
Solution
lwtk <- seq(40, 60, .5)
ci.exp(mod_l, cbind(0, lwtk - 54.5)) %>%
data.frame() %>%
ggplot(aes(lwtk, `exp.Est..`, ymin = `X2.5.`, ymax = `X97.5.`)) +
geom_line() +
geom_ribbon(alpha = .2) +
scale_y_continuous(trans = "log", breaks = pretty_breaks()) +
labs(x = "Maternal weight (kg)", y = "OR")Multivariable Logistic Regression
Confounding
Question 12
We might think that maternal weight may be a confounder of the association between maternal smoke and odds of having low birth weight. Specify the equation of a multivariable logistic model to test the previous associations. Estimate and interpret the model coefficients.
Solution
\[\log\left(\textrm{odds}(\textrm{low}| \textrm{smoke}), \textrm{lwtk}) \right) = \beta_0 + \beta_1\textrm{smoke} + \beta_2(\textrm{lwtk} - 54.5)\]
mod_sl <- glm(low ~ smoke + I(lwtk - 54.5), data = lowbwt, family = "binomial")
summary(mod_sl)
Call:
glm(formula = low ~ smoke + I(lwtk - 54.5), family = "binomial",
data = lowbwt)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.99173 0.21879 -4.533 5.82e-06 ***
smokeYes 0.67667 0.32470 2.084 0.0372 *
I(lwtk - 54.5) -0.02961 0.01353 -2.188 0.0287 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 234.67 on 188 degrees of freedom
Residual deviance: 224.34 on 186 degrees of freedom
AIC: 230.34
Number of Fisher Scoring iterations: 4
ci.exp(mod_sl) exp(Est.) 2.5% 97.5%
(Intercept) 0.3709354 0.2415835 0.5695467
smokeYes 1.9673220 1.0410977 3.7175724
I(lwtk - 54.5) 0.9708245 0.9454138 0.9969181
\(\exp(\beta_1)\) = 1.97 is the odds ratio of low birth weight for smoker mothers compared with non-smoker mothers holding maternal constant. It indicates a positive association (higher odds for smokers).
\(\exp(\beta_2)\) = 0.97: every 1 kg increase in maternal weight is associated with a -97% decrease in the odds of low birth weight adjusting for maternal smoke. It indicates a negative association (lower odds for higher values of maternal weight).
Both predictors are statistically significant at a 95% confidence level.
Question 13
Presents the results in terms of predicted (log) odds and probabilities in a graphical format.
Solution
pred_sl <- expand.grid(
smoke = c("No", "Yes"),
lwtk = lwtk
) %>%
mutate(
prob = predict(mod_sl, newdata = ., type = "response"),
logodds = predict(mod_sl, newdata = ., type = "link")
)
grid.arrange(
ggplot(pred_sl, aes(lwtk, prob, col = smoke)) +
geom_line() +
labs(x = "Maternal weight (kg)", y = "Probability", col = "Maternal smoke"),
ggplot(pred_sl, aes(lwtk, logodds, col = smoke)) +
geom_line() +
labs(x = "Maternal weight (kg)", y = "Log odds", col = "Maternal smoke"),
ncol = 2
)# Alternative plot
gather(pred_sl, measure, pred, prob, logodds) %>%
ggplot(aes(lwtk, pred, col = smoke)) +
geom_line() +
facet_grid(measure ~ ., scales = "free") +
labs(x = "Maternal weight (kg)", y = "", col = "Maternal smoke")Interaction
Question 14
Is there any evidence that the association between maternal smoke and the odds of low birth weight varies according to maternal weight? Specify the equation of a multivariable logistic model to test the previous hypothesis. Estimate and interpret the model coefficients.
Solution
\[\log\left(\textrm{odds}(\textrm{low}| \textrm{smoke}), \textrm{lwtk}) \right) = \beta_0 + \beta_1\textrm{smoke} + \beta_2(\textrm{lwtk} - 54.5) + \beta_3\textrm{smoke} \cdot (\textrm{lwtk} - 54.5)\]
mod_sl_int <- glm(low ~ smoke*I(lwtk - 54.5), data = lowbwt, family = "binomial")
summary(mod_sl_int)
Call:
glm(formula = low ~ smoke * I(lwtk - 54.5), family = "binomial",
data = lowbwt)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.96057 0.22113 -4.344 1.4e-05 ***
smokeYes 0.61680 0.32721 1.885 0.0594 .
I(lwtk - 54.5) -0.05308 0.02309 -2.299 0.0215 *
smokeYes:I(lwtk - 54.5) 0.03904 0.02843 1.373 0.1697
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 234.67 on 188 degrees of freedom
Residual deviance: 222.37 on 185 degrees of freedom
AIC: 230.37
Number of Fisher Scoring iterations: 4
ci.exp(mod_sl_int) exp(Est.) 2.5% 97.5%
(Intercept) 0.3826753 0.2480860 0.5902807
smokeYes 1.8529854 0.9757701 3.5188156
I(lwtk - 54.5) 0.9483034 0.9063424 0.9922070
smokeYes:I(lwtk - 54.5) 1.0398121 0.9834564 1.0993973
\(\exp(\beta_0)\) = 0.38 is the odds of low birth weight for non-smokers mothers with a median weight (54.5 kg).
\(\exp(\beta_1)\) = 1.85 is the odds ratio of low birth weight for smoker compared with non-smoker mothers with a median weight (54.5 kg). \(\exp(\beta_2)\) = 0.95: every 1 kg increase in maternal weight for non-smoker mothers is associated with a 5% decrease in the odds of low birth weight. \(\exp(\beta_3)\) = 1.04 is the additional (multiplicative) component in the combined effect of maternal smoke and maternal weight on the odds of low birth weight.
Question 15
Present the association (OR) between maternal weight and the odds of low birth weight separately for smoker and non-smokers women.
Solution
ci.exp(mod_sl_int, cbind(0, 0, 1, 0)) exp(Est.) 2.5% 97.5%
[1,] 0.9483034 0.9063424 0.992207
ci.exp(mod_sl_int, cbind(0, 0, 1, 1)) exp(Est.) 2.5% 97.5%
[1,] 0.9860573 0.9545186 1.018638
Non-linearity
Question 16
Maternal weight may be associated in a non-linear fashion with the odds of low birth. Specify the equation of a multivariable logistic model to test the previous hypothesis, which takes into account the effect of maternal smoke.
Solution
\[\log\left(\textrm{odds}(\textrm{low}| \textrm{smoke}), \textrm{lwtk}) \right) = \beta_0 + \beta_1\textrm{smoke} + \beta_2(\textrm{lwtk} - 54.5) + \beta_2(\textrm{lwtk} - 54.5)^2\]
mod_sl2 <- glm(low ~ smoke + I(lwtk - 54.5) + I((lwtk - 54.5)^2),
data = lowbwt, family = "binomial")
summary(mod_sl2)
Call:
glm(formula = low ~ smoke + I(lwtk - 54.5) + I((lwtk - 54.5)^2),
family = "binomial", data = lowbwt)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.0139842 0.2247579 -4.511 6.44e-06 ***
smokeYes 0.6580473 0.3277372 2.008 0.0447 *
I(lwtk - 54.5) -0.0355892 0.0191917 -1.854 0.0637 .
I((lwtk - 54.5)^2) 0.0002745 0.0006154 0.446 0.6556
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 234.67 on 188 degrees of freedom
Residual deviance: 224.15 on 185 degrees of freedom
AIC: 232.15
Number of Fisher Scoring iterations: 4
ci.exp(mod_sl2) exp(Est.) 2.5% 97.5%
(Intercept) 0.3627707 0.2335167 0.5635683
smokeYes 1.9310179 1.0158179 3.6707664
I(lwtk - 54.5) 0.9650366 0.9294109 1.0020279
I((lwtk - 54.5)^2) 1.0002745 0.9990687 1.0014817
The \(\beta_0\) and \(\beta_1\) have a similar interpretation to those in question 12.
The coefficients of the non-linear association (\(\beta_2\) and \(\beta_3\)) are not directly interpretable
lrtest(mod_sl2, mod_sl)Likelihood ratio test for MLE method
Chi-squared 1 d.f. = 0.1899367 , P value = 0.6629693
Question 17
Provide a graphical presentation of the odds ratio of low birth weight with 95% CI as a function of maternal weight using 54.5 kg as reference value.
Solution
ci.exp(mod_sl2, ctr.mat = cbind(0, 0, lwtk - 54.5, (lwtk - 54.5)^2)) %>%
data.frame() %>%
ggplot(aes(lwtk, `exp.Est..`, ymin = `X2.5.`, ymax = `X97.5.`)) +
geom_line() +
geom_ribbon(alpha = .2) +
scale_y_continuous(trans = "log", breaks = pretty_breaks()) +
labs(x = "Maternal weight (kg)", y = "Smoke-adjusted OR")