Advanced Data Analysis - MSc Mathematics in Epidemiology
Author
Rabecca Kanini KATING’U
Published
22 March 2026
Introduction
This assignment focuses on applying Generalized Linear Models (GLM) to epidemiological data. We will explore Malaria prevalence, Food Insecurity, and Tuberculosis transmission using simulated datasets representative of Sub-Saharan African settings.
Part 1: Logistic Regression (Malaria Prevalence)
We are investigating the association between household characteristics and Malaria infection in children under five in a rural district.
Data dictionary
This dataset (stored in the file malaria_data.csv) simulates a cross-sectional survey of children under five years old in a malaria-endemic region.
Variable Name
Data Type
Units / Levels
Description
infected
Factor
No, Yes
Indicates whether the child tested positive for malaria parasites via RDT or microscopy.
age
Numeric
Months
The age of the child in months (Range: 6–60 months).
bednet
Factor
No, Yes
Indicates if the child slept under an insecticide-treated net (ITN) the previous night.
wealth
Integer
1 to 5
A composite index of household assets (1 = Poorest quintile, 5 = Richest quintile).
Question 1.1: Exploratory Analysis
Calculate the percentage of children infected with malaria in this sample. Create a plot showing the proportion of infection by bednet usage.
# Load the datasetmalaria_data <-read.csv("malaria_data.csv")# Converting 'infected' to binary (Yes = 1, No = 0)malaria_data$infected_binary <-ifelse(malaria_data$infected =="Yes", 1, 0)# Percentage of children infected with malariainfected_percentage <-mean(malaria_data$infected_binary) *100print(paste("Percentage of children infected with malaria:", round(infected_percentage, 2), "%"))
[1] "Percentage of children infected with malaria: 16.8 %"
# Plot showing proportion of infection by bednet usageggplot(malaria_data, aes(x = bednet, y = infected_binary, fill = bednet)) +geom_bar(stat ="summary", fun ="mean", color ="black") +scale_fill_manual(values =c("No"="orange", "Yes"="yellow")) +labs(title ="Proportion of Malaria Infection by Bednet Usage",x ="Bednet Usage",y ="Proportion of Infection",fill ="Bednet Use") +theme_minimal()
Question 1.2: Model Fitting
Fit a logistic regression model to predict infected status using age, bednet, and wealth.
# Fit a logistic regression modellogit_model <-glm(infected_binary ~ age + bednet + wealth, data = malaria_data, family = binomial)# Display the model summarysummary(logit_model)
Call:
glm(formula = infected_binary ~ age + bednet + wealth, family = binomial,
data = malaria_data)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.107962 0.414190 2.675 0.00747 **
age -0.026148 0.008705 -3.004 0.00267 **
bednetYes -1.591694 0.269076 -5.915 3.31e-09 ***
wealth -0.404101 0.096300 -4.196 2.71e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 452.70 on 499 degrees of freedom
Residual deviance: 390.06 on 496 degrees of freedom
AIC: 398.06
Number of Fisher Scoring iterations: 5
Question 1.3: Interpretation
Exponetiate the coefficients to obtain Odds Ratios (OR). Interpret the OR for bednet. Is the use of a bednet a protective factor in this simulation?
coef_est <-coef(logit_model)conf_int <-confint.default(logit_model)# Exponentiate to get Odds Ratios and CIOR <-exp(coef_est)CI_lower <-exp(conf_int[,1])CI_upper <-exp(conf_int[,2])results_table <-cbind(OR, CI_lower, CI_upper)print(round(results_table, 3))
OR CI_lower CI_upper
(Intercept) 3.028 1.345 6.819
age 0.974 0.958 0.991
bednetYes 0.204 0.120 0.345
wealth 0.668 0.553 0.806
The odds ratio for bednet use is 0.204. Since this value is less than 1, it indicates that bednet use reduces the odds of malaria infection. Children who use a bednet have 79.6% lower odds of being infected compared to those who do not use a bednet. The 95% confidence interval (0.12, 0.345) does not include 1, indicating that this effect is statistically significant. Therefore, bednet use is a protective factor in this simulation.
Part 2: Ordinal Logistic Regression (Food Insecurity)
In this section, we assess household food insecurity levels, categorized as: 1: Secure, 2: Mildly Insecure, 3: Moderately Insecure, 4: Severely Insecure.
Data dictionary
This dataset (stored in the file food_insecurity_data.csv) measures the severity of food insecurity using an ordinal scale, common in nutrition and livelihoods assessments in Sub-Saharan Africa.
The geographic classification of the household’s location.
size
Integer
People
The total number of individuals currently living in the household.
Question 2.1: Proportional Odds Model
Fit an ordinal logistic regression model to assess the impact of living in a rural area and household size on food insecurity levels.
library(MASS)# Load the datasetfood_insecurity_data <-read.csv("food_insecurity_data.csv")# Converting status variable to an ordered factorfood_insecurity_data$status <-factor(food_insecurity_data$status, ordered =TRUE, levels =c(1, 2, 3, 4))# Fit an ordinal logistic regression modelordinal_model <-polr(status ~ rural + size, data = food_insecurity_data, Hess =TRUE)# Display the model summarysummary(ordinal_model)
Call:
polr(formula = status ~ rural + size, data = food_insecurity_data,
Hess = TRUE)
Coefficients:
Value Std. Error t value
ruralUrban -1.6926 0.21274 -7.956
size 0.2982 0.04449 6.703
Intercepts:
Value Std. Error t value
1|2 -1.0112 0.2959 -3.4177
2|3 0.7406 0.2855 2.5943
3|4 2.4524 0.3087 7.9438
Residual Deviance: 961.2149
AIC: 971.2149
Question 2.2: Proportional Odds Assumption
Use the brant() test to check the proportional odds assumption. What is your conclusion?
library(brant)brant_test <-brant(ordinal_model)
--------------------------------------------
Test for X2 df probability
--------------------------------------------
Omnibus 10.42 4 0.03
ruralUrban 4.77 2 0.09
size 5.92 2 0.05
--------------------------------------------
H0: Parallel Regression Assumption holds
From the results of the Brant test, the Omnibus test has a p-value of 0.03, which is less than 0.05, so we reject the null hypothesis that the proportional odds assumption holds for the overall model. For the individual predictors, the assumption holds for ruralUrban (p = 0.09) and size (p = 0.05), but since size is close to 0.05, we can say there is a marginal violation. Therefore, we reject the proportional odds assumption for the overall model, and alternative models like partial proportional odds models or multinomial logistic regression may be more appropriate.
Part 3: Count Data Models (TB Case Notification)
We are looking at the number of new Tuberculosis (TB) cases notified per month in 100 health facility catchments.
Data dictionary
This dataset (stored in the file tb_cases_data.csv) represents aggregate health facility data, where each row is a specific health facility catchment area.
Variable Name
Data Type
Units / Levels
Description
cases
Integer (Count)
Counts
The number of new bacteriologically confirmed Tuberculosis cases notified to the facility in a single month.
hiv
Numeric
Percentage (%)
Estimated HIV prevalence among the adult population within the health facility catchment area.
distance
Numeric
Kilometers (km)
The average travel distance (population-weighted) for residents of the catchment area to reach the health facility.
Question 3.1: Distribution Assessment
Compare the mean and the variance of the cases variable. Based on this, which model (Poisson or Negative Binomial) do you expect to fit better?
# Load the datasettb_cases_data <-read.csv("tb_cases_data.csv")# Calculate the mean and variance of the 'cases' variablemean_cases <-mean(tb_cases_data$cases)variance_cases <-var(tb_cases_data$cases)print(paste("Mean of the cases:", round(mean_cases, 2)))
[1] "Mean of the cases: 5.61"
print(paste("Variance of the cases:", round(variance_cases, 2)))
[1] "Variance of the cases: 29.86"
From the above output, the variance (29.86) is much greater than the mean (5.61), which indicates overdispersion in the data. The Negative Binomial distribution is a better fit for count data with overdispersion, as it allows for variance to be greater than the mean.
Therefore, the Negative Binomial model would fit better than the Poisson model for this data.
Question 3.2: Model Comparison
Fit both a Poisson and a Negative Binomial model, having cases as the outcome, and hiv and distance as explanatory variables. Use the AIC() function to determine the best-fitting model.
library(MASS)# Fitting Poisson modelpoisson_model <-glm(cases ~ hiv + distance, data = tb_cases_data, family = poisson)poisson_model
Call: glm(formula = cases ~ hiv + distance, family = poisson, data = tb_cases_data)
Coefficients:
(Intercept) hiv distance
1.108459 0.045482 -0.008505
Degrees of Freedom: 99 Total (i.e. Null); 97 Residual
Null Deviance: 459.9
Residual Deviance: 424.3 AIC: 736.5
# Fitting Negative Binomial modelnb_model <-glm.nb(cases ~ hiv + distance, data = tb_cases_data)nb_model
Call: glm.nb(formula = cases ~ hiv + distance, data = tb_cases_data,
init.theta = 1.615176238, link = log)
Coefficients:
(Intercept) hiv distance
1.059592 0.046370 -0.004041
Degrees of Freedom: 99 Total (i.e. Null); 97 Residual
Null Deviance: 119.5
Residual Deviance: 111.5 AIC: 557.8
# Comparing the models using AICpoisson_aic <-AIC(poisson_model)nb_aic <-AIC(nb_model)
print(paste("AIC of Poisson model:", round(poisson_aic, 2)))
[1] "AIC of Poisson model: 736.47"
print(paste("AIC of Negative Binomial model:", round(nb_aic, 2)))
[1] "AIC of Negative Binomial model: 557.85"
From the results above, the AIC of the Negative Binomial model (557.85) is lower than the AIC of the Poisson model (736.47), therefore we conclude that the Negative Binomial model is a better fit for this data. This suggests that the Negative Binomial distribution, which accounts for overdispersion (variance greater than the mean), provides a more accurate representation of the TB case counts in the health facility catchment areas.
Question 3.3: Rate Ratios
Calculate the Rate Ratio (RR) for HIV prevalence from your chosen model. Interpret what a 1-unit increase in HIV prevalence does to the expected number of TB cases.
hiv_coef <-coef(nb_model)["hiv"]rate_ratio_hiv <-exp(hiv_coef)print(paste("Rate Ratio for HIV prevalence:", round(rate_ratio_hiv, 2)))
[1] "Rate Ratio for HIV prevalence: 1.05"
The Rate Ratio (RR) for HIV prevalence is 1.05. This means that for every 1-unit increase in HIV prevalence, the expected number of TB cases increases by 5%, holding other variables constant. Therefore, higher HIV prevalence is positively associated with an increase in the expected number of TB cases in the health facility catchment areas.