Assignment 1

Advanced Data Analysis - MSc Mathematics in Epidemiology

Author

Rabecca Kanini KATING’U

Published

22 March 2026

Introduction

This assignment focuses on applying Generalized Linear Models (GLM) to epidemiological data. We will explore Malaria prevalence, Food Insecurity, and Tuberculosis transmission using simulated datasets representative of Sub-Saharan African settings.

Part 1: Logistic Regression (Malaria Prevalence)

We are investigating the association between household characteristics and Malaria infection in children under five in a rural district.

Data dictionary

This dataset (stored in the file malaria_data.csv) simulates a cross-sectional survey of children under five years old in a malaria-endemic region.

Variable Name	Data Type	Units / Levels	Description
`infected`	Factor	No, Yes	Indicates whether the child tested positive for malaria parasites via RDT or microscopy.
`age`	Numeric	Months	The age of the child in months (Range: 6–60 months).
`bednet`	Factor	No, Yes	Indicates if the child slept under an insecticide-treated net (ITN) the previous night.
`wealth`	Integer	1 to 5	A composite index of household assets (1 = Poorest quintile, 5 = Richest quintile).

Question 1.1: Exploratory Analysis

Calculate the percentage of children infected with malaria in this sample. Create a plot showing the proportion of infection by bednet usage.

# Load the dataset
malaria_data <- read.csv("malaria_data.csv")
# Converting 'infected' to binary (Yes = 1, No = 0)
malaria_data$infected_binary <- ifelse(malaria_data$infected == "Yes", 1, 0)

# Percentage of children infected with malaria
infected_percentage <- mean(malaria_data$infected_binary) * 100
print(paste("Percentage of children infected with malaria:", round(infected_percentage, 2), "%"))

[1] "Percentage of children infected with malaria: 16.8 %"

# Plot showing proportion of infection by bednet usage
ggplot(malaria_data, aes(x = bednet, y = infected_binary, fill = bednet)) +
  geom_bar(stat = "summary", fun = "mean", color = "black") +
  scale_fill_manual(values = c("No" = "orange", "Yes" = "yellow")) +
  labs(title = "Proportion of Malaria Infection by Bednet Usage",
       x = "Bednet Usage",
       y = "Proportion of Infection",
       fill = "Bednet Use") +
  theme_minimal()

Question 1.2: Model Fitting

Fit a logistic regression model to predict infected status using age, bednet, and wealth.

# Fit a logistic regression model
logit_model <- glm(infected_binary ~ age + bednet + wealth, data = malaria_data, family = binomial)

# Display the model summary
summary(logit_model)


Call:
glm(formula = infected_binary ~ age + bednet + wealth, family = binomial, 
    data = malaria_data)

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept)  1.107962   0.414190   2.675  0.00747 ** 
age         -0.026148   0.008705  -3.004  0.00267 ** 
bednetYes   -1.591694   0.269076  -5.915 3.31e-09 ***
wealth      -0.404101   0.096300  -4.196 2.71e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 452.70  on 499  degrees of freedom
Residual deviance: 390.06  on 496  degrees of freedom
AIC: 398.06

Number of Fisher Scoring iterations: 5

Question 1.3: Interpretation

Exponetiate the coefficients to obtain Odds Ratios (OR). Interpret the OR for bednet. Is the use of a bednet a protective factor in this simulation?

coef_est <- coef(logit_model)


conf_int <- confint.default(logit_model)

# Exponentiate to get Odds Ratios and CI
OR <- exp(coef_est)
CI_lower <- exp(conf_int[,1])
CI_upper <- exp(conf_int[,2])


results_table <- cbind(OR, CI_lower, CI_upper)

print(round(results_table, 3))

               OR CI_lower CI_upper
(Intercept) 3.028    1.345    6.819
age         0.974    0.958    0.991
bednetYes   0.204    0.120    0.345
wealth      0.668    0.553    0.806

The odds ratio for bednet use is 0.204. Since this value is less than 1, it indicates that bednet use reduces the odds of malaria infection. Children who use a bednet have 79.6% lower odds of being infected compared to those who do not use a bednet. The 95% confidence interval (0.12, 0.345) does not include 1, indicating that this effect is statistically significant. Therefore, bednet use is a protective factor in this simulation.

Part 2: Ordinal Logistic Regression (Food Insecurity)

In this section, we assess household food insecurity levels, categorized as: 1: Secure, 2: Mildly Insecure, 3: Moderately Insecure, 4: Severely Insecure.

Data dictionary

This dataset (stored in the file food_insecurity_data.csv) measures the severity of food insecurity using an ordinal scale, common in nutrition and livelihoods assessments in Sub-Saharan Africa.

Variable Name	Data Type	Units / Levels	Description
`status`	Ordered Factor	1, 2, 3, 4	Household Food Insecurity Access Scale (HFIAS) category.
`rural`	Factor	Urban, Rural	The geographic classification of the household’s location.
`size`	Integer	People	The total number of individuals currently living in the household.

Question 2.1: Proportional Odds Model

Fit an ordinal logistic regression model to assess the impact of living in a rural area and household size on food insecurity levels.

library(MASS)
# Load the  dataset
food_insecurity_data <- read.csv("food_insecurity_data.csv")

# Converting  status variable to an ordered factor
food_insecurity_data$status <- factor(food_insecurity_data$status, ordered = TRUE, levels = c(1, 2, 3, 4))


# Fit an ordinal logistic regression model
ordinal_model <- polr(status ~ rural + size, data = food_insecurity_data, Hess = TRUE)

# Display the model summary
summary(ordinal_model)

Call:
polr(formula = status ~ rural + size, data = food_insecurity_data, 
    Hess = TRUE)

Coefficients:
             Value Std. Error t value
ruralUrban -1.6926    0.21274  -7.956
size        0.2982    0.04449   6.703

Intercepts:
    Value   Std. Error t value
1|2 -1.0112  0.2959    -3.4177
2|3  0.7406  0.2855     2.5943
3|4  2.4524  0.3087     7.9438

Residual Deviance: 961.2149 
AIC: 971.2149

Question 2.2: Proportional Odds Assumption

Use the brant() test to check the proportional odds assumption. What is your conclusion?

library(brant)

brant_test <- brant(ordinal_model)

-------------------------------------------- 
Test for    X2  df  probability 
-------------------------------------------- 
Omnibus     10.42   4   0.03
ruralUrban  4.77    2   0.09
size        5.92    2   0.05
-------------------------------------------- 

H0: Parallel Regression Assumption holds

From the results of the Brant test, the Omnibus test has a p-value of 0.03, which is less than 0.05, so we reject the null hypothesis that the proportional odds assumption holds for the overall model. For the individual predictors, the assumption holds for ruralUrban (p = 0.09) and size (p = 0.05), but since size is close to 0.05, we can say there is a marginal violation. Therefore, we reject the proportional odds assumption for the overall model, and alternative models like partial proportional odds models or multinomial logistic regression may be more appropriate.

Part 3: Count Data Models (TB Case Notification)

We are looking at the number of new Tuberculosis (TB) cases notified per month in 100 health facility catchments.

Data dictionary

This dataset (stored in the file tb_cases_data.csv) represents aggregate health facility data, where each row is a specific health facility catchment area.

Variable Name	Data Type	Units / Levels	Description
`cases`	Integer (Count)	Counts	The number of new bacteriologically confirmed Tuberculosis cases notified to the facility in a single month.
`hiv`	Numeric	Percentage (%)	Estimated HIV prevalence among the adult population within the health facility catchment area.
`distance`	Numeric	Kilometers (km)	The average travel distance (population-weighted) for residents of the catchment area to reach the health facility.

Question 3.1: Distribution Assessment

Compare the mean and the variance of the cases variable. Based on this, which model (Poisson or Negative Binomial) do you expect to fit better?

# Load the dataset
tb_cases_data <- read.csv("tb_cases_data.csv")

# Calculate the mean and variance of the 'cases' variable
mean_cases <- mean(tb_cases_data$cases)
variance_cases <- var(tb_cases_data$cases)

print(paste("Mean of the cases:", round(mean_cases, 2)))

[1] "Mean of the cases: 5.61"

print(paste("Variance of the cases:", round(variance_cases, 2)))

[1] "Variance of the cases: 29.86"

From the above output, the variance (29.86) is much greater than the mean (5.61), which indicates overdispersion in the data. The Negative Binomial distribution is a better fit for count data with overdispersion, as it allows for variance to be greater than the mean.
Therefore, the Negative Binomial model would fit better than the Poisson model for this data.

Question 3.2: Model Comparison

Fit both a Poisson and a Negative Binomial model, having cases as the outcome, and hiv and distance as explanatory variables. Use the AIC() function to determine the best-fitting model.

library(MASS)

#  Fitting  Poisson model
poisson_model <- glm(cases ~ hiv + distance, data = tb_cases_data, family = poisson)
poisson_model


Call:  glm(formula = cases ~ hiv + distance, family = poisson, data = tb_cases_data)

Coefficients:
(Intercept)          hiv     distance  
   1.108459     0.045482    -0.008505  

Degrees of Freedom: 99 Total (i.e. Null);  97 Residual
Null Deviance:      459.9 
Residual Deviance: 424.3    AIC: 736.5

#  Fitting  Negative Binomial model
nb_model <- glm.nb(cases ~ hiv + distance, data = tb_cases_data)
nb_model


Call:  glm.nb(formula = cases ~ hiv + distance, data = tb_cases_data, 
    init.theta = 1.615176238, link = log)

Coefficients:
(Intercept)          hiv     distance  
   1.059592     0.046370    -0.004041  

Degrees of Freedom: 99 Total (i.e. Null);  97 Residual
Null Deviance:      119.5 
Residual Deviance: 111.5    AIC: 557.8

#  Comparing the models using AIC
poisson_aic <- AIC(poisson_model)
nb_aic <- AIC(nb_model)

print(paste("AIC of Poisson model:", round(poisson_aic, 2)))

[1] "AIC of Poisson model: 736.47"

print(paste("AIC of Negative Binomial model:", round(nb_aic, 2)))

[1] "AIC of Negative Binomial model: 557.85"

From the results above, the AIC of the Negative Binomial model (557.85) is lower than the AIC of the Poisson model (736.47), therefore we conclude that the Negative Binomial model is a better fit for this data. This suggests that the Negative Binomial distribution, which accounts for overdispersion (variance greater than the mean), provides a more accurate representation of the TB case counts in the health facility catchment areas.

Question 3.3: Rate Ratios

Calculate the Rate Ratio (RR) for HIV prevalence from your chosen model. Interpret what a 1-unit increase in HIV prevalence does to the expected number of TB cases.

hiv_coef <- coef(nb_model)["hiv"]

rate_ratio_hiv <- exp(hiv_coef)

print(paste("Rate Ratio for HIV prevalence:", round(rate_ratio_hiv, 2)))

[1] "Rate Ratio for HIV prevalence: 1.05"

The Rate Ratio (RR) for HIV prevalence is 1.05. This means that for every 1-unit increase in HIV prevalence, the expected number of TB cases increases by 5%, holding other variables constant. Therefore, higher HIV prevalence is positively associated with an increase in the expected number of TB cases in the health facility catchment areas.