# Hide all code but keep outputs 
knitr::opts_chunk$set( message = FALSE, warning = FALSE,results = 'asis' )

Executive Summary

This report documents the end-to-end analytical workflow implemented in R for a cross-sectional study assessing community knowledge of Mpox in five provinces of Zambia. The analysis formed the basis of a manuscript currently under peer review; however, this document focuses specifically on demonstrating intermediate-level competencies in data management, statistical analysis, visualization, and reproducible reporting using R.

The project involved importing raw survey data from Excel, performing structured data cleaning, re-coding categorical variables, engineering derived variables (including age categories, income groups, and occupation classifications), and constructing a composite knowledge score using Bloom’s cut-off criteria. Analysis was restricted to participants who had heard of Mpox, and a binary outcome variable was created to classify knowledge as good or poor.

Exploratory data analysis was conducted using the tidyverse framework, including summary statistics, frequency distributions, and graphical visualizations to assess distributions and identify outliers. Categorical associations were evaluated using Chi-square tests, with formatted summary tables generated using the gtsummary package and exported to Word.

To assess factors associated with good knowledge, uni-variable and multivariable logistic regression models were fitted using generalized linear models with a binomial link function. Odds ratios and 95% confidence intervals were obtained through coefficient exponentiation. Model selection and refinement were guided by theoretical relevance and comparison of Akaike and Bayesian Information Criteria (AIC and BIC), demonstrating structured model evaluation and parsimony.

Advanced features of R were also utilized, including:

All analyses were conducted within an R Markdown environment to ensure full reproducibility of outputs, including tables, figures, and model results.

This project demonstrates applied proficiency in data wrangling, variable engineering, regression modeling, model comparison, spatial analysis, and reproducible reporting -reflecting intermediate-level competence in R for public health data analysis.

Analytical Environment Setup and Package Management

The analytical environment was configured by loading packages required for data import, manipulation, visualization, statistical modeling, and reproducible reporting. The tidyverse framework was used for data wrangling and graphical analysis, while additional packages supported regression diagnostics, formatted table generation, and automated export of results. Organizing package loading at the beginning of the document enhances reproducibility and transparency.

library(tidyverse)      # Includes dplyr, ggplot2, tidyr, forcats, etc.
library(readxl)         # Reading Excel files
library(gtsummary)      # Summary tables
library(flextable)      # Table formatting for Word
library(officer)        # Export to Word
library(scales)         # Percent formatting, axis scaling
library(janitor)        # Cleaning column names
library(gt)             # Table formatting
library(sjPlot)         # Data visualization of tables
library(sjmisc)         # Data manipulation utilities
library(pROC)           # ROC curves (if used)
library(elrm)           # Exact logistic regression (if used)
library(logistf)        # Firth logistic regression
library(knitr)          # Printing tables in RMarkdown
library(car)            # Recoding, regression utilities
library(DiagrammeR)     # Diagrams/flowcharts
library(DiagrammeRsvg)  # Export DiagrammeR to SVG
library(rsvg)           # Render SVG images
library(sf)             # Spatial data handling
library(ggspatial)      # Spatial visualization in ggplot2
library(here)

Data Import and Initial Cleaning

Raw survey data exported from the electronic data collection platform contained system-generated metadata and label-heavy column names. These were programmatically cleaned to ensure analysis-ready variable names and improve reproducibility.

# Suppress warnings and messages when reading Excel file Corrected version

Mpox <- suppressWarnings(
  suppressMessages(
    readxl::read_excel("Data/KAP_Mpox_2025_-_all_versions_-_labels_-_2025-10-08-18-00-59.xlsx")
  )
)

Remove some columns

Purpose: This step identifies all survey and metadata columns that are not needed for analysis. By defining cols_to_remove, we prepare the dataset for cleaner, focused analysis.

names(Mpox)

[1] “start”
[2] “end”
[3] “deviceid”
[4] “Introduction: We are conducting this survey to understand community members’ knowledge, attitudes, and practices regarding Mpox. Your participation is voluntary, your responses will be confidential, and you may stop at any time. You must be 18 years or older and have lived in this area for at least 6 months to participate.” [5] “Do you agree to participate in this survey?”
[6] “Respondent did not consent. End the interview and thank them.”
[7] “A2. Date (YYYY-MM-DD)”
[8] “A3a. Province”
[9] “A3b. District”
[10] “A4. Facility”
[11] “A5. Interviewer Name”
[12] “GPS coordinate”
[13] “_GPS coordinate_latitude”
[14] “_GPS coordinate_longitude”
[15] “_GPS coordinate_altitude”
[16] “_GPS coordinate_precision”
[17] “B0. How many months have you lived in this area?”
[18] “B1. Sex”
[19] “B2. Age in years”
[20] “B3. Marital status”
[21] “B4. Residence”
[22] “B5. Education level”
[23] “B6. Occupation”
[24] “Specify occupation (Other)”
[25] “B7. Religion”
[26] “B8. Monthly household income (ZMW)…26”
[27] “eligible”
[28] “Not eligible (must be ≥18 years and resident ≥6 months). End interview.”
[29] “C1. Have you heard of Mpox before today?”
[30] “C2. Where did you hear about Mpox? (multiple)/Radio/TV”
[31] “C2. Where did you hear about Mpox? (multiple)/Newspaper”
[32] “C2. Where did you hear about Mpox? (multiple)/Internet/social media”
[33] “C2. Where did you hear about Mpox? (multiple)/Health worker”
[34] “C2. Where did you hear about Mpox? (multiple)/Friends/family”
[35] “C2. Where did you hear about Mpox? (multiple)/Posters/community meetings”
[36] “C2. Where did you hear about Mpox? (multiple)/Other (specify)”
[37] “Specify other source”
[38] “C3. Which source do you trust the most?”
[39] “Specify other trusted source”
[40] “C4. How frequently do you receive information about Mpox?”
[41] “D1. What causes the disease Mpox?”
[42] “D2. How is Mpox transmitted? (multiple)/From animals (such as monkeys, rodents)”
[43] “D2. How is Mpox transmitted? (multiple)/From person-to-person through close contact”
[44] “D2. How is Mpox transmitted? (multiple)/Through sexual contact”
[45] “D2. How is Mpox transmitted? (multiple)/Through respiratory droplets”
[46] “D2. How is Mpox transmitted? (multiple)/From contaminated materials (clothes, beddings)”
[47] “D2. How is Mpox transmitted? (multiple)/Don’t know”
[48] “D3. Mpox can be transmitted through… (multiple)/Direct contact with rash/lesions”
[49] “D3. Mpox can be transmitted through… (multiple)/Respiratory droplets during close contact”
[50] “D3. Mpox can be transmitted through… (multiple)/Contact with contaminated objects (bedding, clothing)”
[51] “D3. Mpox can be transmitted through… (multiple)/Bite from infected animals”
[52] “D3. Mpox can be transmitted through… (multiple)/Sexual contact”
[53] “D4. Common symptoms of Mpox (Yes/No/Don’t know)”
[54] “Fever”
[55] “Headache”
[56] “Swollen lymph nodes”
[57] “Skin rash/lesions”
[58] “Back pain”
[59] “Muscle aches”
[60] “D5. Mpox symptoms can resemble chickenpox”
[61] “D6. There is a vaccine that can protect against Mpox.”
[62] “D7. Being vaccinated provides some protection against Mpox.”
[63] “D8a. Should someone with suspected Mpox seek care at a health facility?”
[64] “D8b. Should someone with suspected Mpox isolate from others?”
[65] “D9. Does handwashing and cleaning surfaces help prevent Mpox?”
[66] “D10. Mpox can be prevented by avoiding contact with wild animals.”
[67] “D11. Pregnant women can transmit Mpox to their unborn child.”
[68] “D12. Mpox can cause serious illness in children and immunocompromised people.”
[69] “D13. Mpox incubation period is 5–21 days.”
[70] “D14. Mpox can be spread via contaminated surfaces.”
[71] “D16. Maternal-to-foetal transmission is possible.”
[72] “E1. Community members can work with health workers to reduce Mpox spread.”
[73] “E2. Awareness programs about Mpox are needed in my district.”
[74] “E3. I trust health authorities to manage an Mpox outbreak.”
[75] “E4. I would take an Mpox vaccine if recommended.”
[76] “E5. Mpox is a serious public health threat.”
[77] “E6. I am concerned about Mpox affecting my family.”
[78] “E7. Mpox is more dangerous than other diseases in my community.”
[79] “E8. I believe I can protect myself from Mpox if I follow recommended measures.”
[80] “F1. In the last 30 days, have you discussed Mpox prevention with others?”
[81] “F2. If you had fever and rash, would you seek care at a health facility immediately?”
[82] “F3. Would you avoid close contact with a person with Mpox symptoms?”
[83] “F4. Would you avoid sharing bedding/utensils with someone suspected of Mpox?”
[84] “F5. Have you avoided crowded gatherings during an outbreak?”
[85] “F6. Would you accept Mpox vaccination if available?”
[86] “F7. In the past month, have you reported a suspected Mpox case?”
[87] “F8. In the past month, have you avoided contact with wild or sick animals?”
[88] “F9. In the past month, have you handled bushmeat?”
[89] “F10. In the past month, have you attended large community gatherings during an outbreak?”
[90] “F11. In the past month, have you ever cared for some with Mpox ?”
[91] “F12. In the past month, have you used protective equipment when caring for a sick person?”
[92] “k_d1”
[93] “k_d2”
[94] “k_d3”
[95] “k_d4”
[96] “k_d5”
[97] “k_d6”
[98] “k_d7”
[99] “k_d8a”
[100] “k_d8b”
[101] “k_d9”
[102] “k_d10”
[103] “k_d11”
[104] “k_d12”
[105] “k_d13”
[106] “k_d14”
[107] “k_d16”
[108] “knowledge_score”
[109] “knowledge_total_max”
[110] “knowledge_pct”
[111] “knowledge_good”
[112] “attitude_score”
[113] “attitude_total_max”
[114] “attitude_pct”
[115] “attitude_positive”
[116] “p_f1”
[117] “p_f2”
[118] “p_f3”
[119] “p_f4”
[120] “p_f5”
[121] “p_f6”
[122] “p_f7”
[123] “p_f8”
[124] “p_f9”
[125] “p_f10”
[126] “p_f11”
[127] “practice_score”
[128] “practice_total_max”
[129] “practice_pct”
[130] “practice_good”
[131] “End of questionnaire. Thank you for your time.”
[132] “A1. Interview ID”
[133] “B8. Monthly household income (ZMW)…133”
[134] “_id”
[135] “_uuid”
[136] “_submission_time”
[137] “_validation_status”
[138] “_notes”
[139] “_status”
[140] “_submitted_by”
[141] “version
[142] “_tags”
[143] “_index”

cols_to_remove <- c( "start" ,"end","deviceid","Do you agree to participate in this survey?","A5.", "Interviewer  Name", "k_d1","k_d2","k_d3","k_d4","k_d5","k_d6","k_d7","k_d8a","k_d8a","k_d8b","k_d9", "k_d10","k_d11","k_d12","k_d13","k_d14","k_d16","knowledge_score","knowledge_total_max","knowledge_pct","knowledge_good","attitude_score","attitude_total_max","attitude_pct","attitude_positive","p_f1","p_f2","p_f3","p_f4","p_f5","p_f6","p_f7","p_f8","p_f9","p_f11","practice_score","practice_total_max","practice_pct","practice_good","End of questionnaire. Thank you for your time.","A1. Interview ID","_submission_time","_validation_status","_notes","_status","_submitted_by","__version__","_tags","_index","Introduction: We are conducting this survey to understand community members’ knowledge, attitudes, and practices regarding Mpox. Your participation is voluntary, your responses will be confidential, and you may stop at any time. You must be 18 years or older and have lived in this area for at least 6 months to participate.","Respondent did not consent. End the interview and thank them.", "_GPS coordinate_altitude", "_GPS coordinate_precision","A5. Interviewer  Name","GPS coordinate","B0. How many months have you lived in this area?","eligible","Not eligible (must be ≥18 years and resident ≥6 months). End interview.")
Mpox<- Mpox[, !names(Mpox) %in% cols_to_remove]

Renaming Columns

Purpose: This step standardizes and simplifies column names in the Mpox dataset for readability and consistency.

Mpox <- Mpox %>% 
  rename(
    Province = `A3a. Province`, 
    District = `A3b. District`, 
    Facility = `A4. Facility`, 
    Latitude = `_GPS coordinate_latitude`, 
    Longitude = `_GPS coordinate_longitude`,
    Sex=`B1. Sex`,Age=`B2. Age in years`,
    Marital_status=`B3. Marital status`,
    Residence=`B4. Residence`,
    Education_level=`B5. Education level`,
    Occupation=`B6. Occupation`,
    Religion=`B7. Religion`,
    Monthly_income=`B8. Monthly household income (ZMW)...26`)

Renaming Columns for Mpox Awareness Sources

Purpose: This step further cleans and standardizes column names, focusing specifically on Mpox awareness and information sources.

Mpox<- Mpox %>% 
  rename(
    heard_of_mpox = `C1. Have you heard of Mpox before today?`,
    mpox_info_radio_tv = `C2. Where did you hear about Mpox? (multiple)/Radio/TV`,
    mpox_info_newspaper = `C2. Where did you hear about Mpox? (multiple)/Newspaper`,
    mpox_info_internet = `C2. Where did you hear about Mpox? (multiple)/Internet/social media`,
    mpox_info_health_worker = `C2. Where did you hear about Mpox? (multiple)/Health worker`,
    mpox_info_friends_family = `C2. Where did you hear about Mpox? (multiple)/Friends/family`,
    mpox_info_posters_meetings = `C2. Where did you hear about Mpox? (multiple)/Posters/community meetings`,
    mpox_info_other = `C2. Where did you hear about Mpox? (multiple)/Other (specify)`)

Renaming Columns for Mpox Knowledge, Transmission, and Symptoms

This chunk is focused on renaming survey columns that cover knowledge about Mpox causes, transmission routes, symptoms, and prevention. It takes the original long survey questions and converts them into short, clear, analysis-ready variable names.

Mpox<- Mpox %>% 
  rename(freq_mpox_info=`C4. How frequently do you receive information about Mpox?`,mpox_cause=`D1. What causes the disease Mpox?`,mpox_transmitted_animal=`D2. How is Mpox transmitted? (multiple)/From animals (such as monkeys, rodents)`,mpox_transmitted_person=`D2. How is Mpox transmitted? (multiple)/From person-to-person through close contact`,mpox_transmitted_sexual=`D2. How is Mpox transmitted? (multiple)/Through sexual contact`,mpox_transmitted_droplets=`D2. How is Mpox transmitted? (multiple)/Through respiratory droplets`,mpox_transmitted_materials=`D2. How is Mpox transmitted? (multiple)/From contaminated materials (clothes, beddings)`,mpox_transmission_Dont_know=`D2. How is Mpox transmitted? (multiple)/Don't know`,mpox_transsmission_direct=`D3. Mpox can be transmitted through… (multiple)/Direct contact with rash/lesions`,mpox_transsmission_respiratory=`D3. Mpox can be transmitted through… (multiple)/Respiratory droplets during close contact`,mpox_transsmission_contact=`D3. Mpox can be transmitted through… (multiple)/Contact with contaminated objects (bedding, clothing)`,mpox_transsmission_bite=`D3. Mpox can be transmitted through… (multiple)/Bite from infected animals`, mpox_transsmission_sex_contact=`D3. Mpox can be transmitted through… (multiple)/Sexual contact`,signs_symptoms=`D4. Common symptoms of Mpox (Yes/No/Don't know)`,
Fever=Fever,Headache=Headache,Swollen_lymph_nodes=`Swollen lymph nodes`,Skin_rash=`Skin rash/lesions`,Back_pain=`Back pain`, Muscle_aches=`Muscle aches`,
Mpox_resemble_chickenpox=`D5. Mpox symptoms can resemble chickenpox`, mpox_vaccine_available=`D6. There is a vaccine that can protect against Mpox.`,vax_mpox_protection=`D7. Being vaccinated  provides some protection against Mpox.`,mpox_seek_care=`D8a. Should someone with suspected Mpox seek care at a health facility?`,mpox_isolate=`D8b. Should someone with suspected Mpox isolate from others?`,handwahsing_prevent=`D9. Does handwashing and cleaning surfaces help prevent Mpox?`,avoiding_contact=`D10. Mpox can be prevented by avoiding contact with wild animals.`,pregant_women=`D11. Pregnant women can transmit Mpox to their unborn child.`,mpox_severe_children=`D12. Mpox can cause serious illness in children and immunocompromised people.`,mpox_incubation=`D13. Mpox incubation period is 5–21 days.`,mpox_spread_surfaces=`D14. Mpox can be spread via contaminated surfaces.`,maternal_foetal=`D16. Maternal-to-foetal transmission is possible.`)

Renaming Columns for Attitudes Toward Mpox

This chunk focuses on renaming survey columns that capture respondents’ attitudes and perceptions about Mpox and public health response.

Mpox<- Mpox %>% 
  rename(community_HCWs=`E1. Community members can work with health workers to reduce Mpox spread.`,awareness_pox=`E2. Awareness programs about Mpox are needed in my district.`,trust_HCWs=`E3. I trust health authorities to manage an Mpox outbreak.`,mpox_vaccination=`E4. I would take an Mpox vaccine if recommended.`,public_health_threat=`E5. Mpox is a serious public health threat.`,concerned=`E6. I am concerned about Mpox affecting my family.`)

Renaming Columns for Practices Toward Mpox

Purpose: This chunk renames survey columns that capture respondents’ actual or intended behaviors (practices) related to Mpox prevention.

Mpox<- Mpox %>% 
  rename(discussed_Mpox_last_30_days=`F1. In the last 30 days, have you discussed Mpox prevention with others?`,visit_clinic_fever_rash=`F2. If you had fever and rash, would you seek care at a health facility immediately?`,avoid_close_contact=`F3. Would you avoid close contact with a person with Mpox symptoms?`,avoid_sharing_contaminated_materials=`F4. Would you avoid sharing bedding/utensils with someone suspected of Mpox?`,avoided_crowded_gatherings=`F5. Have you avoided crowded gatherings during an outbreak?`,get_mpox_vaccine=`F6. Would you accept Mpox vaccination if available?`)

Recoding Occupational Status & Marital Status

This chunk simplifies categorical variables to make analysis and reporting easier, particularly for KAP tables or mentor-ready summaries.

#Occupatonal status 
Mpox <- Mpox %>%
  mutate(
    Occupation = case_when(
      Occupation %in% c("Business/vendor",
                         "Civil servant/professional",
                         "Farmer/manual") ~ "Employed",
      Occupation == "Student" ~ "Student",
      Occupation == "Unemployed" ~ "Unemployed",
      Occupation == "Other" ~ "Unemployed",
      TRUE ~ NA_character_))

#marital stautus 

Mpox<- Mpox%>%
  mutate(
    Marital_status= case_when(
      Marital_status == "Married" ~ "Married",
      TRUE ~ "Unmarried"
    ))

Adding Variable Labels- table 1

This chunk uses the labelled package to assign descriptive labels to variables. These labels help make tables, summaries, and outputs more readable for mentors or reports.

pacman::p_load(labelled)

var_label(Mpox) <- list(Province= "Provinces",District="Districts",Facility="Facilities",Sex="Sex",Age="Age (years)",Marital_status="Marital status",Residence="Residence",Education_level="Education level", Occupation="Occupation",Religion="Religion") 

Labeling Mpox Awareness Variables

This chunk continues adding descriptive labels to your Mpox dataset, now focusing on awareness and information sources.

pacman::p_load(labelled)

var_label(Mpox) <- list(heard_of_mpox="Heard of Mpox before",
  mpox_info_radio_tv = "Radio/TV",
  mpox_info_newspaper = "Newspaper",
  mpox_info_internet = "Internet/social media",
  mpox_info_health_worker = "Health care worker",
  mpox_info_friends_family = "Family/friends",
  mpox_info_posters_meetings = "Posters/Meetings",
  mpox_info_other = "Other"
)

Labeling Knowledge Variables

You are assigning descriptive labels to all variables related to knowledge of Mpox, including causes, transmission, symptoms, prevention, and vaccination.

pacman::p_load(labelled)

var_label(Mpox) <- list(freq_mpox_info="How frequently do you receive information about Mpox?",mpox_cause="What causes the disease Mpox?",mpox_transmitted_animal="How is Mpox transmitted:From animals (such as monkeys, rodents)?", mpox_transmitted_person= "How is Mpox transmitted:From person-to-person through close contact?",mpox_transmitted_sexual="How is Mpox transmitted:Through sexual contact?", mpox_transmitted_droplets="How is Mpox transmitted:Through respiratory droplets?", mpox_transmitted_materials="How is Mpox transmitted:From contaminated materials (clothes, beddings)?",mpox_transmission_Dont_know="How is Mpox transmitted:Don't know? ",mpox_transsmission_direct="Mpox can be transmitted through: Direct contact with rash/lesions?",mpox_transsmission_respiratory="Mpox can be transmitted through:Respiratory droplets during close contact",mpox_transsmission_contact="Mpox can be transmitted through: Contact with contaminated objects (bedding, clothing)",mpox_transsmission_bite="Mpox can be transmitted through:Bite from infected animals",mpox_transsmission_sex_contact="Mpox can be transmitted through:Sexual contact",Fever="Fever",Headache="Headache",Swollen_lymph_nodes="Swollen lymph nodes",Skin_rash="Skin rash/lesions",Back_pain="Back pain", Muscle_aches="Muscle aches", Mpox_resemble_chickenpox="Mpox symptoms can resemble chickenpox?",mpox_vaccine_available="There is a vaccine that can protect against Mpox?",vax_mpox_protection="Being vaccinated  provides some protection against Mpox?",mpox_seek_care="Should someone with suspected Mpox seek care at a health facility?",mpox_isolate="Should someone with suspected Mpox isolate from others?",handwahsing_prevent="Does handwashing and cleaning surfaces help prevent Mpox?",avoiding_contact="Mpox can be prevented by avoiding contact with wild animals?", pregant_women="Pregnant women can transmit Mpox to their unborn child?", mpox_severe_children="Mpox can cause serious illness in children and immunocompromised people?",mpox_incubation="Mpox incubation period is 5–21 days?",mpox_spread_surfaces="Mpox can be spread via contaminated surfaces?",maternal_foetal="Maternal-to-foetal transmission is possible?")

Labeling of variables-attitude questions

pacman::p_load(labelled)

var_label(Mpox) <- list(community_HCWs= "Community members can work with health workers to reduce Mpox spread?",awareness_pox= "Awareness programs about Mpox are needed in my district?",trust_HCWs= "I trust health authorities to manage an Mpox outbreak?",mpox_vaccination= "I would take an Mpox vaccine if recommended?",public_health_threat= "Mpox is a serious public health threat?",concerned= "I am concerned about Mpox affecting my family?")

Labeling of variables-practices questions

pacman::p_load(labelled)

var_label(Mpox) <- list(discussed_Mpox_last_30_days= "Have you discussed Mpox prevention with others in the last 30 days?",visit_clinic_fever_rash= "seek care at a health facility immediately if fever or rash?",avoid_close_contact="avoid close contact with a person with Mpox symptoms?",avoid_sharing_contaminated_materials= "avoid sharing bedding/utensils with someone suspected of Mpox?",avoided_crowded_gatherings= "avoided crowded gatherings during an outbreak?",get_mpox_vaccine=" accept Mpox vaccination if available?")

Assessing the distribution of age and monthly household income

summary(Mpox$Age)

Min. 1st Qu. Median Mean 3rd Qu. Max. NA’s 18.00 27.00 34.00 37.93 47.25 90.00 3

Filter the NAs for age. these where entered during the practice sessions

Mpox <- Mpox %>% filter(!is.na(Age))

Categorizing age

  • Age was categorized into three groups: 18–34, 35–54, and 55+ years to reflect young, middle-aged, and older adults.
# updated age categories 
Mpox$Age_categories <- cut(
  Mpox$Age,
  breaks = c(18, 35, 55, 85),   # updated boundaries
  labels = c("18–34", "35–54", "55+"),  # new categories
  right = FALSE, include.lowest = TRUE)
Mpox <- Mpox %>% filter(!is.na(Age_categories))
ggplot(Mpox, aes(x = Age)) +
  geom_histogram(aes(y = ..density..), bins = 30, color = "black", fill = "lightblue") +
  stat_function(fun = dnorm, args = list(mean = mean(Mpox$Age, na.rm = TRUE), sd = sd(Mpox$Age, na.rm = TRUE)), color = "red") +
  labs(title = "Age distribution", x = "Age (Years)", y = "Density") +
  theme_classic() +
  theme(plot.title = element_text(hjust = 0.5))

# save as an image
ggsave("age_distribution.png", width = 10, height = 6, dpi = 300)
summary(Mpox$Monthly_income)

Min. 1st Qu. Median Mean 3rd Qu. Max. 0 700 1650 2876 3000 400000

Remove some out liers

Mpox <- Mpox[Mpox$Monthly_income != 400000, ]
  • incase of income being reported as continous, check its distribution below
ggplot(Mpox, aes(x =Monthly_income)) +
  geom_histogram(aes(y = ..density..), bins = 30, color = "black", fill = "lightblue") +
  stat_function(fun = dnorm, args = list(mean = mean(Mpox$Monthly_income, na.rm = TRUE), sd = sd(Mpox$Monthly_income, na.rm = TRUE)), color = "red") +
  labs(title = "Income distribution", x = "Monthly_income(kwacha)", y = "Density") +
  theme_classic() +
  theme(plot.title = element_text(hjust = 0.5))

# save as an image
ggsave("Monthly_income_distribution.png", width = 10, height = 6, dpi = 300)
  • I decided to report as categorical and therefore i changed below
  • Monthly income was dichotomized as <3000 and ≥3000 to reflect lower- and higher-income respondents.
Mpox$Income_category<- ifelse(
  Mpox$Monthly_income < 3000, 
  "< 3000", 
  "≥ 3000")

Descriptive analysis

This section is building a descriptive summary table of your survey population and stratifying it by whether respondents have heard of Mpox. The goal is to show the distribution of key demographics and compare groups.

# Step 1: Recode outcome with desired order
Mpox <- Mpox %>% 
  mutate(heard_of_mpox = factor(heard_of_mpox, levels = c("Yes", "No"))) # Yes first, No second

# Step 2: Build stratified table (categorical only)
t1 <- Mpox %>% 
  select(heard_of_mpox, Province, Age_categories, Sex, Marital_status, Residence, Education_level, Occupation, Income_category) %>% 
  tbl_summary(
    by = heard_of_mpox,
    percent = "row",
    missing_text = "Missing",
    label = list(
      Province ~ "Province",
      Age_categories ~ "Age group",
      Sex ~ "Sex",
      Marital_status ~ "Marital status",
      Residence ~ "Residence (Urban/Rural)",
      Education_level  ~ "Education level",
      Occupation ~ "Occupation",
      Income_category ~ "Income category"
    ),
    digits = list(all_categorical() ~ c(0, 1)) # format counts to 0 decimal points and percentages to 1 decimal point
  ) %>% 
  add_overall(last = FALSE) %>% 
  add_p(
    test = list(all_categorical() ~ "chisq.test"),
    pvalue_fun = function(x) style_pvalue(x, digits = 3)
  ) %>% 
  bold_labels() %>% 
  modify_header(label ~ "**Population Characteristics**") %>% 
  modify_spanning_header(c("stat_1", "stat_2") ~ "**Heard of Mpox**")

print(t1)
Population Characteristics Overall
N = 809
1
Heard of Mpox
p-value2
Yes
N = 372
1
No
N = 437
1
Province


<0.001
    Central 120 (100.0%) 71 (59.2%) 49 (40.8%)
    Copperbelt 120 (100.0%) 58 (48.3%) 62 (51.7%)
    Lusaka 444 (100.0%) 151 (34.0%) 293 (66.0%)
    Muchinga 32 (100.0%) 25 (78.1%) 7 (21.9%)
    Western 93 (100.0%) 67 (72.0%) 26 (28.0%)
Age group


0.335
    18–34 415 (100.0%) 181 (43.6%) 234 (56.4%)
    35–54 282 (100.0%) 139 (49.3%) 143 (50.7%)
    55+ 112 (100.0%) 52 (46.4%) 60 (53.6%)
Sex


<0.001
    Female 457 (100.0%) 236 (51.6%) 221 (48.4%)
    Male 352 (100.0%) 136 (38.6%) 216 (61.4%)
Marital status


0.089
    Married 462 (100.0%) 200 (43.3%) 262 (56.7%)
    Unmarried 347 (100.0%) 172 (49.6%) 175 (50.4%)
Residence (Urban/Rural)


0.422
    Rural 64 (100.0%) 33 (51.6%) 31 (48.4%)
    Urban 745 (100.0%) 339 (45.5%) 406 (54.5%)
Education level


<0.001
    None 51 (100.0%) 20 (39.2%) 31 (60.8%)
    Primary 260 (100.0%) 89 (34.2%) 171 (65.8%)
    Secondary 395 (100.0%) 191 (48.4%) 204 (51.6%)
    Tertiary 103 (100.0%) 72 (69.9%) 31 (30.1%)
Occupation


0.021
    Employed 564 (100.0%) 257 (45.6%) 307 (54.4%)
    Student 28 (100.0%) 20 (71.4%) 8 (28.6%)
    Unemployed 217 (100.0%) 95 (43.8%) 122 (56.2%)
Income category


0.008
    < 3000 542 (100.0%) 231 (42.6%) 311 (57.4%)
    ≥ 3000 267 (100.0%) 141 (52.8%) 126 (47.2%)
1 n (%)
2 Pearson’s Chi-squared test
# Convert gtsummary object to flextable
ft <- as_flex_table(t1)

# Export to Word
doc <- read_docx() %>%
  body_add_flextable(ft)

print(doc, target = "table1.docx")

Ever heard of Mpox

This section creates a pie chart showing the proportion of respondents who have or have not heard of Mpox. It also saves the plot as a PNG for use in reports or presentations.

# Prepare data
heard_df <- as.data.frame(table(Mpox$heard_of_mpox))
colnames(heard_df) <- c("Response", "Count")

# Add percentage and formatted label with 1 decimal place
heard_df$Percent <- heard_df$Count / sum(heard_df$Count) * 100
heard_df$Label <- sprintf("%s\n%d (%.1f%%)", heard_df$Response, heard_df$Count, heard_df$Percent)

# Plot
library(ggplot2)
ggplot(heard_df, aes(x = "", y = Count, fill = Response)) +
  geom_bar(stat = "identity", width = 1) +
  coord_polar("y") +
  theme_void() +
  geom_text(aes(label = Label), position = position_stack(vjust = 0.5), size = 5) +
  labs(title = "") +
  scale_fill_manual(values = c("#F8766D", "#00BFC4"))

# Save the last plot as a PNG
ggsave("heard_of_mpox_pie.png", width = 6, height = 6, dpi = 300)

Sub analysis

I filtered the dataset to include only participants who reported having heard of Mpox, so that subsequent analyses focus on this subgroup with prior awareness.

sub_set <- Mpox %>% filter(heard_of_mpox == "Yes")
sub_set$Age_categories <- cut(
  sub_set$Age,
  breaks = c(18, 35, 55, 85),   # updated boundaries
  labels = c("18–34", "35–54", "55+"),  # new categories
  right = FALSE, include.lowest = TRUE)
sub_set$Income_category<- ifelse(
  sub_set$Monthly_income < 3000, 
  "< 3000", 
  "≥ 3000")

Source of Information on Mpox

I created a new variable, Sources_of_information_on_Mpox, to categorize where participants who had heard of Mpox obtained their information. I grouped responses into main channels such as ‘Health care worker’, ‘Radio/TV’, ‘Internet/social media’, ‘Family/friends’, and a combined ‘Print/community sources’ category for newspapers, posters, meetings, or other sources. This simplification allowed for easier interpretation and comparison of information sources.

sub_set <- sub_set %>%
  mutate(
    Sources_of_information_on_Mpox = case_when(
      mpox_info_health_worker == 1 ~ "Health care worker",
      mpox_info_radio_tv == 1 ~ "Radio/TV",
      mpox_info_internet == 1 ~ "Internet/social media",
      mpox_info_friends_family == 1 ~ "Family/friends",
      
      # Collapsed category
      mpox_info_newspaper == 1 ~ "Print/community sources",
      mpox_info_posters_meetings == 1 ~ "Print/community sources",
      mpox_info_other == 1 ~ "Print/community sources",
      
      TRUE ~ NA_character_
    )
  )

Analyze Sources of Mpox Information by Province

I corrected a minor typo in the Sources_of_information_on_Mpox variable and then organized the data to improve readability. I ordered provinces by the proportion of respondents and information sources by their frequency. Using this ordering, I created a horizontal stacked bar chart showing the distribution of Mpox information sources across provinces. This visualization allowed me to quickly identify which channels were most commonly used. Finally, I exported the chart and embedded it in a Word document for reporting purposes.

# Fix typo in source categories
sub_set <- sub_set %>%
  mutate(Sources_of_information_on_Mpox = case_when(
    Sources_of_information_on_Mpox == "Family/freinds" ~ "Family/friends",
    TRUE ~ Sources_of_information_on_Mpox
  ))

# Order provinces by overall respondent percentage
province_order <- sub_set %>%
  count(Province) %>%
  mutate(prop = n / sum(n)) %>%
  arrange(desc(prop)) %>%
  pull(Province)

sub_set$Province <- factor(sub_set$Province, levels = province_order)

# Order sources by frequency
source_order <- sub_set %>%
  count(Sources_of_information_on_Mpox) %>%
  arrange(desc(n)) %>%
  pull(Sources_of_information_on_Mpox)

source_order

[1] “Health care worker” “Radio/TV”
[3] “Internet/social media” “Family/friends”
[5] “Print/community sources”

sub_set <- sub_set |> mutate(Sources_of_information_on_Mpox = factor(Sources_of_information_on_Mpox,
                                                                     levels = c("Health care worker","Radio/TV",
                                                                               "Internet/social media","Family/friends",
                                                                                "Print/community sources")))

library(RColorBrewer)

# Plot chart (provinces only, no "Overall")
chart <- ggplot(sub_set, aes(y = Province, fill = Sources_of_information_on_Mpox)) + 
  geom_bar(position = "fill", color = "#636363") +
  scale_x_continuous(labels = scales::percent_format(accuracy = 1)) +
  # "Blues" is a common sequential RColorBRewer palette; "Set1" or "Set2" are good for categories 
  # n must match the number of levels in the data (5 in this case)
  scale_fill_manual(values = rev(brewer.pal(5, "Blues")), limits = source_order # Using color brewer pallet with reversed order
    # values = c(
    #   "Health care worker" = "#045a8d",
    #   "Radio/TV" = "#2b8cbe",
    #   "Internet/social media" = "#74a9cf",
    #   "Family/friends" = "#bdc9e1",
    #   "Print/community sources" = "#f1eef6"), #manually specificing the colors
    # limits = source_order  
  ) +
  labs(
    x = "Proportion of respondents",
    y = "Province",
    fill = "Source category"
  ) +
  theme_classic(base_size = 14) +
  theme(
    legend.position = "right",
    legend.title = element_text(size = 12, colour = "black"),
    legend.text = element_text(size = 11, colour = "black"),
    axis.text.y = element_text(size = 12, colour = "black"),
    axis.text.x = element_text(size = 11, colour = "black")
  )

# print
print(chart)

# Save the chart
ggsave("mpox_info_sources.png", plot = chart, height = 5.84, width = 13.8, dpi = 300)

# Create Word document and embed the image
doc <- read_docx() %>% 
  body_add_par("Figure: Sources of Mpox Information Among Participants", style = "heading 2") %>% 
  body_add_img(src = "mpox_info_sources.png", width = 6, height = 4)

print(doc, target = "Mpox_Info_Sources.docx")

Create Table of Mpox Information Sources by Province

I summarized the number and proportion of respondents reporting each source of Mpox information, stratified by province. Percentages were calculated for each source within a province and formatted alongside counts for clarity (e.g., n (x%)). Using pivot_wider, I transformed the data into a table format suitable for reporting. Finally, I created a flextable for a polished table layout and exported it to a Word document for inclusion in the report

# Step 1: Summarise counts and percentages
mpox_table <- sub_set %>%
  group_by(Province, Sources_of_information_on_Mpox) %>%
  summarise(n = n(), .groups = "drop") %>%
  group_by(Province) %>%
  mutate(
    total = sum(n),
    percent = round((n / total) * 100, 1),
    label = paste0(n, " (", percent, "%)")
  ) %>%
  select(Province, Sources_of_information_on_Mpox, label, total) %>%
  pivot_wider(
    names_from = Sources_of_information_on_Mpox,
    values_from = label,
    values_fill = list(label = "0 (0%)")   # <-- ensures zeros are shown
  ) %>%
  relocate(total, .after = last_col())     # put totals at the end

# Step 2: Create flextable for Word
ft <- flextable(mpox_table) %>%
  autofit() %>%
  set_header_labels(
    Province = "Province",
    `Health care worker` = "Health care worker",
    `Radio/TV` = "Radio/TV",
    `Internet/social media` = "Internet/social media",
    `Family/friends` = "Family/friends",
    Others = "Others",
    total = "Total (n)"
  )
ft

Province

Health care worker

Radio/TV

Internet/social media

Family/friends

Print/community sources

Total (n)

Lusaka

45 (29.8%)

38 (25.2%)

40 (26.5%)

18 (11.9%)

10 (6.6%)

151

Central

35 (49.3%)

19 (26.8%)

4 (5.6%)

6 (8.5%)

7 (9.9%)

71

Western

39 (58.2%)

14 (20.9%)

6 (9%)

7 (10.4%)

1 (1.5%)

67

Copperbelt

30 (51.7%)

12 (20.7%)

7 (12.1%)

2 (3.4%)

7 (12.1%)

58

Muchinga

19 (76%)

4 (16%)

0 (0%)

0 (0%)

2 (8%)

25

# Step 3: Save as Word document
doc <- read_docx() %>%
  body_add_par("Table: Sources of Mpox information by province", style = "heading 1") %>%
  body_add_flextable(ft)

print(doc, target = "mpox_info_sources_by_province.docx")

Summarize Knowledge of Mpox Transmission

I recoded the transmission variables so that 0 = No and 1 = Yes, ensuring consistency across the dataset. I then selected the key transmission modes—animal-to-human, person-to-person, sexual, droplet, and contaminated materials—and labeled them clearly. Using tbl_summary, I calculated counts and percentages of respondents aware of each mode of Mpox transmission. The resulting summary table was exported as a Word document for reporting purposes.

# Step 1: Recode correctly (0 = No, 1 = Yes)
sub_set <- sub_set %>%
  mutate(
    mpox_transmitted_animal    = factor(mpox_transmitted_animal,    levels = c(0,1), labels = c("No","Yes")),
    mpox_transmitted_person    = factor(mpox_transmitted_person,    levels = c(0,1), labels = c("No","Yes")),
    mpox_transmitted_sexual    = factor(mpox_transmitted_sexual,    levels = c(0,1), labels = c("No","Yes")),
    mpox_transmitted_droplets  = factor(mpox_transmitted_droplets,  levels = c(0,1), labels = c("No","Yes")),
    mpox_transmitted_materials = factor(mpox_transmitted_materials, levels = c(0,1), labels = c("No","Yes"))
  )

# Step 2: Define variable list and labels
transmission <- c(
  "mpox_transmitted_animal",
  "mpox_transmitted_person",
  "mpox_transmitted_sexual",
  "mpox_transmitted_droplets",
  "mpox_transmitted_materials"
)

transmission_labels <- list(
  mpox_transmitted_animal    = "Animal-to-human",
  mpox_transmitted_person    = "Person-to-person",
  mpox_transmitted_sexual    = "Sexual transmission",
  mpox_transmitted_droplets  = "Droplet transmission",
  mpox_transmitted_materials = "Contaminated materials"
)

# Step 3: Create summary table
kap_table <- sub_set %>%
  select(all_of(transmission)) %>%
  tbl_summary(
    type      = all_categorical() ~ "categorical",
    label     = transmission_labels,
    statistic = all_categorical() ~ "{n} ({p}%)",
    digits    = all_categorical() ~ c(0, 1),
    missing   = "no"   # ensures 0 (0%) is shown instead of blank
  )

# Step 4: Print inline in R Markdown
kap_table
Characteristic N = 3721
Animal-to-human
    No 183 (49.2%)
    Yes 189 (50.8%)
Person-to-person
    No 152 (40.9%)
    Yes 220 (59.1%)
Sexual transmission
    No 251 (67.5%)
    Yes 121 (32.5%)
Droplet transmission
    No 264 (71.0%)
    Yes 108 (29.0%)
Contaminated materials
    No 252 (67.7%)
    Yes 120 (32.3%)
1 n (%)
# Step 5: Export to Word (optional)
ft <- as_flex_table(kap_table)
doc <- read_docx() %>%
  body_add_par("Table: Knowledge of Mpox Transmission Modes", style = "heading 2") %>%
  body_add_flextable(ft)
print(doc, target = "Mpox_Transmission_Table.docx")

Define Transmission Variables and Labels

I created a vector transmission_vars containing the key Mpox transmission variables to streamline subsequent analysis. I also defined transmission_labels to assign clear, descriptive labels to each transmission mode (e.g., Animal-to-human, Person-to-person, Sexual transmission, Droplet transmission, Contaminated materials). This ensures that tables and plots are easily interpretable for stakeholders

transmission_vars <- c(
  "mpox_transmitted_animal",
  "mpox_transmitted_person",
  "mpox_transmitted_sexual",
  "mpox_transmitted_droplets",
  "mpox_transmitted_materials"
)

transmission_labels <- c(
  mpox_transmitted_animal = "Animal-to-human",
  mpox_transmitted_person = "Person-to-person",
  mpox_transmitted_sexual = "Sexual transmission",
  mpox_transmitted_droplets = "Droplet transmission",
  mpox_transmitted_materials = "Contaminated materials"
)

Prepare Data for Knowledge of Transmission Plot

I reshaped the Mpox transmission data into a long format suitable for plotting. For each transmission mode, I calculated the count and percentage of respondents reporting ‘Yes’ or ‘No’. I then created a variable label using our descriptive transmission labels and reordered them by the percentage of respondents answering ‘Yes’, so the plot will display the most recognized transmission modes first.

# Step 1: Build the full dataset
kap_plot_data <- sub_set %>%
  select(all_of(transmission_vars)) %>%
  pivot_longer(cols = everything(), names_to = "variable", values_to = "response") %>%
  group_by(variable, response) %>%
  summarise(n = n(), .groups = "drop") %>%
  group_by(variable) %>%
  mutate(percent = round(n / sum(n) * 100, 1)) %>%
  ungroup() %>%
  mutate(label = transmission_labels[variable])

# Step 2: Calculate Yes percentages for ordering
yes_order <- kap_plot_data %>%
  filter(response == "Yes") %>%
  group_by(label) %>%
  summarise(yes_percent = sum(percent), .groups = "drop")

# Step 3: Join back and reorder labels
kap_plot_data <- kap_plot_data %>%
  left_join(yes_order, by = "label") %>%
  mutate(label = fct_reorder(label, yes_percent, .desc = TRUE))

Visualize Knowledge of Mpox Transmission Modes

I created a horizontal stacked bar chart showing the proportion of respondents who answered ‘Yes’ or ‘No’ for each mode of Mpox transmission. The y-axis lists the transmission modes, ordered by the percentage of respondents who correctly recognized them. The bars are color-coded to distinguish ‘Yes’ from ‘No’, and percentages are displayed on the x-axis for clarity

ggplot(kap_plot_data, aes(x = percent, y = label, fill = response)) +
  geom_bar(stat = "identity", position = "stack", color = "#636363", width = 0.8) +
  scale_fill_manual(values = c("Yes" = "#1f77b4", "No" = "#bdc9e1")) +
  labs(
    title = "",
    x = "Percentage",
    y = "Modes of transmission",   # ← updated axis label
    fill = "Response"
  ) +
  scale_x_continuous(labels = function(x) paste0(x, "%")) +
  theme_classic(base_size = 14) +
  theme(
    plot.title   = element_text(face = "bold", hjust = 0.5),
    axis.text.y  = element_text(face = "bold"),
    legend.position = "right",
    panel.grid = element_blank()
  )

# Alternative use sjPlot
trans_dat <- sub_set |> dplyr::select(mpox_transmitted_animal, mpox_transmitted_person, mpox_transmitted_sexual,
                            mpox_transmitted_droplets, mpox_transmitted_materials)

# Set theme
set_theme(base = theme_classic())

# plot_stackfrq(trans_dat, 
#               show.total = F,
#               axis.labels = c("",
#                               "",
#                               "",
#                               "",
#                               "",
#                               ""),
#               geom.colors = "Blues")
ggsave("mpox_transmission_chart.png", dpi = 300, width = 8, height = 5)

Summarize Knowledge of Mpox Signs and Symptoms

I recoded respondents’ answers for common Mpox symptoms into ‘Yes’ or ‘No’ categories, treating ‘Don’t know’ as ‘No’. Then, I created a summary table showing counts and percentages of respondents recognizing each symptom. This table helps identify which symptoms are well-known and which may require additional awareness efforts. Finally, I exported the table to Word for reporting.

# Define symptom variables (removed Mpox_resemble_chickenpox)
symptoms <- c("Fever", "Headache", "Swollen_lymph_nodes",
              "Skin_rash", "Back_pain", "Muscle_aches")

# Recode variables
sub_set <- sub_set %>%
  mutate(across(all_of(symptoms),
                ~ factor(ifelse(.x %in% c("Don't know", "No"), "No", "Yes"),
                         levels = c("No", "Yes"))))

# Create the summary table with a caption
symptoms_tbl <- sub_set %>%
  select(all_of(symptoms)) %>%
  tbl_summary(
    type = all_categorical() ~ "categorical",
    statistic = all_categorical() ~ "{n} ({p}%)",
    digits = all_categorical() ~ c(0, 1),
    label = list(
      Fever = "Fever",
      Headache = "Headache",
      Swollen_lymph_nodes = "Swollen lymph nodes",
      Skin_rash = "Skin rash/lesions",
      Back_pain = "Back pain",
      Muscle_aches = "Muscle aches"
    ),
    missing = "no"
  ) %>%
  modify_caption("**Knowledge of Mpox Signs and Symptoms**")

print(symptoms_tbl)
Knowledge of Mpox Signs and Symptoms
Characteristic N = 3721
Fever
    No 119 (32.0%)
    Yes 253 (68.0%)
Headache
    No 147 (39.5%)
    Yes 225 (60.5%)
Swollen lymph nodes
    No 195 (52.4%)
    Yes 177 (47.6%)
Skin rash/lesions
    No 71 (19.1%)
    Yes 301 (80.9%)
Back pain
    No 243 (65.3%)
    Yes 129 (34.7%)
Muscle aches
    No 203 (54.6%)
    Yes 169 (45.4%)
1 n (%)
# Save to Word
doc <- read_docx() %>%
  body_add_par("Table 1. Knowledge of Mpox Signs and Symptoms", style = "heading 2") %>%
  body_add_flextable(value = as_flex_table(symptoms_tbl))

print(doc, target = "symptoms.docx")

Visualize Knowledge of Mpox Symptoms

I created a bar chart showing the percentage of respondents who correctly recognized each Mpox symptom. Responses of ‘No’ and ‘Don’t know’ were combined to simplify interpretation. The symptoms were ordered from most to least recognized based on the proportion of ‘Yes’ responses. This visual highlights which symptoms are widely known and where awareness campaigns may need to focus.

# Define the six symptoms
symptoms <- c("Fever", "Headache", "Swollen_lymph_nodes",
              "Skin_rash", "Back_pain", "Muscle_aches")

# Clean labels for plotting
symptom_labels <- c(
  Fever = "Fever",
  Headache = "Headache",
  Swollen_lymph_nodes = "Swollen lymph nodes",
  Skin_rash = "Skin rash/lesions",
  Back_pain = "Back pain",
  Muscle_aches = "Muscle aches"
)

# Recode Yes/No responses (combine No + Don't know)
sub_set <- sub_set %>%
  mutate(across(
    all_of(symptoms),
    ~ factor(
        ifelse(.x %in% c("Don't know", "No"), "No / Don't know", "Yes"),
        levels = c("No / Don't know", "Yes")
      )
  ))



# new and ordered by percentage of yes

symptom_plot_data <- sub_set %>%
  select(all_of(symptoms)) %>%
  pivot_longer(cols = everything(),
               names_to = "variable",
               values_to = "response") %>%
  group_by(variable, response) %>%
  summarise(n = n(), .groups = "drop") %>%
  group_by(variable) %>%
  mutate(percent = round(n / sum(n) * 100, 1)) %>%
  ungroup() %>%
  mutate(label = symptom_labels[variable]) %>%
  # calculate Yes percentage per symptom
  group_by(label) %>%
  mutate(yes_percent = ifelse(response == "Yes", percent, 0)) %>%
  summarise(yes_percent = sum(yes_percent), .groups = "drop") %>%
  right_join(
    sub_set %>%
      select(all_of(symptoms)) %>%
      pivot_longer(cols = everything(),
                   names_to = "variable",
                   values_to = "response") %>%
      group_by(variable, response) %>%
      summarise(n = n(), .groups = "drop") %>%
      group_by(variable) %>%
      mutate(percent = round(n / sum(n) * 100, 1)) %>%
      ungroup() %>%
      mutate(label = symptom_labels[variable]),
    by = "label"
  ) %>%
  mutate(label = fct_reorder(label, yes_percent, .desc = TRUE))


# Plot
ggplot(symptom_plot_data, aes(x = percent, y = label, fill = response)) +
  geom_bar(stat = "identity", position = "stack", color = "#636363") +
  scale_fill_manual(
    values = c("Yes" = "#1f77b4", "No / Don't know" = "#bdc9e1")
  ) +
  labs(
    title = "",
    x = "Percentage",
    y = NULL,
    fill = "Response"
  ) +
  scale_x_continuous(labels = function(x) paste0(x, "%")) +
  theme_minimal(base_size = 14) +
  theme(
    plot.title = element_text(face = "bold", hjust = 0.5),
    axis.text.y = element_text(face = "bold"),
    legend.position = "right",
    legend.justification = c(1, 1),
    panel.grid = element_blank(),
    axis.line = element_line(color = "black", size = 0.6)
  )

`

ggsave("mpox_symptoms_chart_selected.png", dpi = 300, width = 8, height = 5)

Summarize Knowledge on Mpox Prevention

I focused on prevention-related variables, including awareness of the Mpox vaccine, protective behaviors (e.g., handwashing, isolation), and knowledge of transmission risks. Responses of ‘No’ and ‘Don’t know’ were combined to simplify interpretation. I then created a summary table showing the counts and percentages of participants who correctly recognized each preventive measure. This highlights areas where community knowledge is strong versus where targeted education may be needed

# Define prevention-focused variables
prevention_vars <- c(
  "mpox_vaccine_available",
  "vax_mpox_protection",
  "mpox_seek_care",
  "mpox_isolate",
  "handwahsing_prevent",   # spelling check
  "avoiding_contact",
  "pregant_women",         # spelling check
  "mpox_severe_children",
  "mpox_spread_surfaces"
)

# Recode Yes/No responses
sub_set <- sub_set %>%
  mutate(across(all_of(prevention_vars),
                ~ factor(ifelse(.x %in% c("Don't know", "No"), "No / Don't know", "Yes"),
                         levels = c("No / Don't know", "Yes"))))

# Create the summary table
prevention_tbl <- sub_set %>%
  select(all_of(prevention_vars)) %>%
  tbl_summary(
    type = all_categorical() ~ "categorical",
    statistic = all_categorical() ~ "{n} ({p}%)",
    digits = all_categorical() ~ c(0, 1),
    label = list(
      mpox_vaccine_available = "Vaccine available for Mpox?",
      vax_mpox_protection    = "Vaccination protects against Mpox?",
      mpox_seek_care         = "Suspected Mpox: seek health care?",
      mpox_isolate           = "Suspected Mpox: isolate from others?",
      handwahsing_prevent    = "Handwashing/surface cleaning prevents Mpox?",
      avoiding_contact       = "Avoiding wild animals prevents Mpox?",
      pregant_women          = "Pregnant women can transmit Mpox to unborn child?",
      mpox_severe_children   = "Mpox is severe in children/immunocompromised?",
      mpox_spread_surfaces   = "Mpox spreads via contaminated surfaces?"
    ),
    missing = "no"
  ) %>%
  modify_caption("**Knowledge of Mpox Prevention**")

print(prevention_tbl)
Knowledge of Mpox Prevention
Characteristic N = 3721
Vaccine available for Mpox?
    No / Don’t know 210 (56.5%)
    Yes 162 (43.5%)
Vaccination protects against Mpox?
    No / Don’t know 127 (34.1%)
    Yes 245 (65.9%)
Suspected Mpox: seek health care?
    No / Don’t know 38 (10.2%)
    Yes 334 (89.8%)
Suspected Mpox: isolate from others?
    No / Don’t know 75 (20.2%)
    Yes 297 (79.8%)
Handwashing/surface cleaning prevents Mpox?
    No / Don’t know 110 (29.6%)
    Yes 262 (70.4%)
Avoiding wild animals prevents Mpox?
    No / Don’t know 108 (29.0%)
    Yes 264 (71.0%)
Pregnant women can transmit Mpox to unborn child?
    No / Don’t know 221 (59.4%)
    Yes 151 (40.6%)
Mpox is severe in children/immunocompromised?
    No / Don’t know 123 (33.1%)
    Yes 249 (66.9%)
Mpox spreads via contaminated surfaces?
    No / Don’t know 133 (35.8%)
    Yes 239 (64.2%)
1 n (%)
# Save to Word
doc <- read_docx() %>%
  body_add_par("Table 2. Knowledge of Mpox Prevention", style = "heading 2") %>%
  body_add_flextable(value = as_flex_table(prevention_tbl))

print(doc, target = "prevention.docx")

Visualizing Knowledge of Mpox Prevention

I created a stacked bar chart to show participants’ knowledge of Mpox prevention. First, I cleaned the variable labels for clarity. Then, I reshaped the dataset to calculate the percentage of respondents answering ‘Yes’ versus ‘No / Don’t know’ for each prevention measure. Finally, I ordered the prevention measures by the proportion of ‘Yes’ responses and plotted them as horizontal stacked bars with percentages labeled, allowing a clear visual comparison of awareness across different preventive actions

# Clean labels for plotting
prevention_labels <- c(
  mpox_vaccine_available = "Vaccine available for Mpox?",
  vax_mpox_protection    = "Vaccination protects against Mpox?",
  mpox_seek_care         = "Suspected Mpox: seek health care?",
  mpox_isolate           = "Suspected Mpox: isolate from others?",
  handwahsing_prevent    = "Handwashing/surface cleaning prevents Mpox?",
  avoiding_contact       = "Avoiding wild animals prevents Mpox?",
  pregant_women          = "Pregnant women can transmit Mpox to unborn child?",
  mpox_severe_children   = "Mpox is severe in children/immunocompromised?",
  mpox_spread_surfaces   = "Mpox spreads via contaminated surfaces?"
)

# Build dataset
prevention_plot_data <- sub_set %>%
  select(all_of(prevention_vars)) %>%
  pivot_longer(cols = everything(),
               names_to = "variable",
               values_to = "response") %>%
  group_by(variable, response) %>%
  summarise(n = n(), .groups = "drop") %>%
  group_by(variable) %>%
  mutate(percent = round(n / sum(n) * 100, 1)) %>%
  ungroup() %>%
  mutate(label = prevention_labels[variable]) %>%
  # calculate Yes percentage per variable
  group_by(label) %>%
  mutate(yes_percent = ifelse(response == "Yes", percent, 0)) %>%
  summarise(yes_percent = sum(yes_percent), .groups = "drop") %>%
  right_join(
    sub_set %>%
      select(all_of(prevention_vars)) %>%
      pivot_longer(cols = everything(),
                   names_to = "variable",
                   values_to = "response") %>%
      group_by(variable, response) %>%
      summarise(n = n(), .groups = "drop") %>%
      group_by(variable) %>%
      mutate(percent = round(n / sum(n) * 100, 1)) %>%
      ungroup() %>%
      mutate(label = prevention_labels[variable]),
    by = "label"
  ) %>%
  mutate(label = fct_reorder(label, yes_percent, .desc = TRUE))

# Plot stacked bars
ggplot(prevention_plot_data, aes(x = percent, y = label, fill = response)) +
  geom_bar(stat = "identity", position = "stack", color = "#636363") +
  geom_text(aes(label = paste0(percent, "%")),
            position = position_stack(vjust = 0.5),
            color = "white", size = 4, fontface = "bold") +
  scale_fill_manual(values = c("Yes" = "#1f77b4", "No / Don't know" = "#bdc9e1")) +
  labs(x = "Percentage", y = NULL, fill = "Response") +
  scale_x_continuous(labels = function(x) paste0(x, "%")) +
  theme_minimal(base_size = 14) +
  theme(axis.text.y = element_text(face = "bold"),
        legend.position = "right",
        panel.grid = element_blank(),
        axis.line = element_line(color = "black", size = 0.6))

# Save prevention chart to file
ggsave("prevention_chart.png",
       width = 8, height = 6, dpi = 300)   # high resolution for manuscripts

Calculating Overall Knowledge of Mpox

I combined all knowledge-related variables—including symptoms, transmission, and prevention—into a single dataset. Responses were recoded to 0/1 (No = 0, Yes = 1), and a total knowledge score was calculated for each participant. Participants scoring ≥80% of the total possible points were classified as having ‘Good knowledge,’ and others as ‘Poor knowledge.’ I summarized the overall knowledge with counts and percentages, then visualized it as a bar chart showing the proportion of participants with good versus poor knowledge, with labels for clarity.

# --- Step 1: Combine all knowledge variables ---
knowledge_vars <- c(symptoms, transmission, prevention_vars)

# --- Step 2: Recode responses into 0/1 ---
sub_set <- sub_set %>%
  mutate(across(all_of(knowledge_vars),
                ~ ifelse(. == "Yes", 1, 0)))

# --- Step 3: Calculate total score per participant ---
sub_set <- sub_set %>%
  rowwise() %>%
  mutate(total_score = sum(c_across(all_of(knowledge_vars)), na.rm = TRUE)) %>%
  ungroup()

# --- Step 4: Classify knowledge level (≥80% = Good knowledge) ---
sub_set <- sub_set %>%
  mutate(knowledge_level = ifelse(total_score >= 16, "Good knowledge", "Poor knowledge"))

# --- Step 5: Summarize overall knowledge with numerators + percentages ---
overall_knowledge <- sub_set %>%
  count(knowledge_level) %>%
  mutate(percent = round(n / sum(n) * 100, 1),
         label = paste0(n, " (", percent, "%)"))

print(overall_knowledge)

A tibble: 2 × 4

knowledge_level n percent label

1 Good knowledge 95 25.5 95 (25.5%) 2 Poor knowledge 277 74.5 277 (74.5%)

# --- Step 6: Plot bar chart with labels ---
ggplot(overall_knowledge, aes(x = knowledge_level, y = percent, fill = knowledge_level)) +
  geom_bar(stat = "identity", width = 0.6, color = "#636363") +
  geom_text(aes(label = label), vjust = -0.5, size = 5, fontface = "bold") +
  scale_fill_manual(values = c("Good knowledge" = "#1f77b4", "Poor knowledge" = "#bdc9e1")) +
  labs(
    title = "",
    x = NULL,
    y = "Percentage of participants"
  ) +
  scale_y_continuous(labels = function(x) paste0(x, "%"), expand = expansion(mult = c(0, 0.1))) +
  theme_minimal(base_size = 14) +
  theme(
    plot.title = element_text(face = "bold", hjust = 0.5),
    axis.text.x = element_text(face = "bold"),
    legend.position = "none",
    panel.grid = element_blank(),
    axis.line = element_line(color = "black", size = 0.6)
  )

chart without percentage labels

ggplot(overall_knowledge, aes(x = knowledge_level, y = percent, fill = knowledge_level)) +
  geom_bar(stat = "identity", width = 0.6, color = "#636363") +
  scale_fill_manual(values = c("Good knowledge" = "#1f77b4", "Poor knowledge" = "#bdc9e1")) +
  labs(
    title = "",
    x = NULL,
    y = "Percentage of participants"
  ) +
  scale_y_continuous(labels = function(x) paste0(x, "%"), 
                     expand = expansion(mult = c(0, 0.1))) +
  theme_minimal(base_size = 14) +
  theme(
    plot.title = element_text(face = "bold", hjust = 0.5),
    axis.text.x = element_text(face = "bold"),
    legend.position = "none",
    panel.grid = element_blank(),
    axis.line = element_line(color = "black", size = 0.6)
  )

# Pie chart of overall knowledge (no labels)
ggplot(overall_knowledge, aes(x = "", y = percent, fill = knowledge_level)) +
  geom_col(width = 1, color = "white") +
  coord_polar(theta = "y") +
  scale_fill_manual(values = c("Good knowledge" = "#1f77b4",
                               "Poor knowledge" = "#ff7f0e")) +
  labs(title = "",
       x = NULL, y = NULL, fill = "Knowledge Level") +
  theme_void(base_size = 14) +
  theme(
    plot.title = element_text(face = "bold", hjust = 0.5),
    legend.position = "right"
  )

# Save the overall knowledge chart with labels
ggsave("overall_knowledge_chart.png",
       width = 8, height = 6, dpi = 300)
# Save the overall knowledge chart with labels
ggsave("overall_knowledge_pie_chart.png",
       width = 8, height = 6, dpi = 300)

creating a column for the outcome

# Create binary variable for regression

sub_set |> tabyl(knowledge_level)

knowledge_level n percent Good knowledge 95 0.2553763 Poor knowledge 277 0.7446237

sub_set <- sub_set %>%
  mutate(knowledge_binary = ifelse(knowledge_level == "Good knowledge", 1, 0))

# Check distribution
table(sub_set$knowledge_binary)

0 1 277 95

sub_set |> tabyl(knowledge_level, knowledge_binary)

knowledge_level 0 1 Good knowledge 0 95 Poor knowledge 277 0

sub_set |> tabyl(knowledge_binary) |> adorn_pct_formatting(affix_sign = T)

knowledge_binary n percent 0 277 74.5% 1 95 25.5%

Stratified Summary of Participant Characteristics by Knowledge Level

I generated a stratified summary table showing participant demographic characteristics (Province, Age group, Sex, Marital status, Residence, Education level, Occupation, and Income category) by knowledge level (‘Good knowledge’ vs. ‘Poor knowledge’). Counts and row percentages were displayed, with chi-square tests for association and p-values formatted to three decimal places. Labels were bolded, headers clarified, and the resulting table was converted to a Word-friendly flextable for reporting

# Step 2: Build stratified table (categorical only)
t2 <- sub_set %>% 
  select(knowledge_level, Province, Age_categories, Sex, Marital_status, Residence, Education_level, Occupation, Income_category) %>% 
  tbl_summary(
    by = knowledge_level,
    percent = "row",
    missing_text = "Missing",
    label = list(
      Province ~ "Province",
      Age_categories ~ "Age group",
      Sex ~ "Sex",
      Marital_status ~ "Marital status",
      Residence ~ "Residence (Urban/Rural)",
      Education_level ~ "Education level",
      Occupation ~ "Occupation",
      Income_category ~ "Income category"
    ),
    digits = list(all_categorical() ~ c(0, 1)) # format counts to 0 decimal points and percentages to 1 decimal point
  ) %>% 
  add_overall(last = FALSE) %>% 
  add_p(
    test = list(all_categorical() ~ "chisq.test"),
    pvalue_fun = function(x) style_pvalue(x, digits = 3)
  ) %>% 
  bold_labels() %>% 
  modify_header(label ~ "**Population Characteristics**") %>% 
  modify_spanning_header(c("stat_1", "stat_2") ~ "**Heard of Mpox**")

print(t2)
Population Characteristics Overall
N = 372
1
Heard of Mpox
p-value2
Good knowledge
N = 95
1
Poor knowledge
N = 277
1
Province


<0.001
    Lusaka 151 (100.0%) 22 (14.6%) 129 (85.4%)
    Central 71 (100.0%) 13 (18.3%) 58 (81.7%)
    Western 67 (100.0%) 45 (67.2%) 22 (32.8%)
    Copperbelt 58 (100.0%) 9 (15.5%) 49 (84.5%)
    Muchinga 25 (100.0%) 6 (24.0%) 19 (76.0%)
Age group


0.107
    18–34 181 (100.0%) 45 (24.9%) 136 (75.1%)
    35–54 139 (100.0%) 42 (30.2%) 97 (69.8%)
    55+ 52 (100.0%) 8 (15.4%) 44 (84.6%)
Sex


0.662
    Female 236 (100.0%) 58 (24.6%) 178 (75.4%)
    Male 136 (100.0%) 37 (27.2%) 99 (72.8%)
Marital status


0.117
    Married 200 (100.0%) 44 (22.0%) 156 (78.0%)
    Unmarried 172 (100.0%) 51 (29.7%) 121 (70.3%)
Residence (Urban/Rural)


0.698
    Rural 33 (100.0%) 7 (21.2%) 26 (78.8%)
    Urban 339 (100.0%) 88 (26.0%) 251 (74.0%)
Education level


0.945
    None 20 (100.0%) 4 (20.0%) 16 (80.0%)
    Primary 89 (100.0%) 23 (25.8%) 66 (74.2%)
    Secondary 191 (100.0%) 50 (26.2%) 141 (73.8%)
    Tertiary 72 (100.0%) 18 (25.0%) 54 (75.0%)
Occupation


<0.001
    Employed 257 (100.0%) 80 (31.1%) 177 (68.9%)
    Student 20 (100.0%) 4 (20.0%) 16 (80.0%)
    Unemployed 95 (100.0%) 11 (11.6%) 84 (88.4%)
Income category


0.005
    < 3000 231 (100.0%) 47 (20.3%) 184 (79.7%)
    ≥ 3000 141 (100.0%) 48 (34.0%) 93 (66.0%)
1 n (%)
2 Pearson’s Chi-squared test
# Convert gtsummary object to flextable
ft <- as_flex_table(t2)

# Export to Word
doc <- read_docx() %>%
  body_add_flextable(ft)

print(doc, target = "table1.docx")

merging the outputs

# Multivariable model
mv_model <- glm(
  knowledge_binary ~ Province + Age_categories + Sex + Marital_status + Residence +Education_level  + Occupation + Income_category+Sources_of_information_on_Mpox,
  data = sub_set,
  family = binomial
) %>%
  tbl_regression(
    exponentiate = TRUE,
    pvalue_fun = ~ style_pvalue(.x, digits = 3)
  ) %>%
  bold_p(t = 0.05)

# Merge univariable and multivariable models
merged_table <- tbl_merge(
  tbls = list(uv_model, mv_model),
  tab_spanner = c("**Crude Odds Ratio**", "**Adjusted Odds Ratio**")
)

print(merged_table)
Characteristic
Crude Odds Ratio
Adjusted Odds Ratio
N Odds Ratio 95% CI P-value OR 95% CI p-value
Province 372





    Lusaka


    Central
1.31 0.61, 2.76 0.477 1.25 0.53, 2.90 0.609
    Western
12.0 6.17, 24.2 <0.001 13.5 5.84, 33.2 <0.001
    Copperbelt
1.08 0.44, 2.44 0.863 0.83 0.31, 2.05 0.689
    Muchinga
1.85 0.62, 4.95 0.238 1.45 0.37, 5.37 0.586
Age_categories 372





    55+


    35–54
2.38 1.08, 5.85 0.042 3.25 1.15, 10.2 0.033
    18–34
1.82 0.83, 4.43 0.155 1.95 0.70, 6.00 0.220
Sex 372





    Female


    Male
1.15 0.71, 1.85 0.576 1.58 0.83, 2.99 0.161
Marital status 372





    Married


    Unmarried
1.49 0.94, 2.39 0.092 1.53 0.80, 2.93 0.193
Residence 372





    Urban


    Rural
0.77 0.30, 1.74 0.552 0.40 0.12, 1.19 0.111
Education level 372





    None


    Primary
1.39 0.45, 5.24 0.586 0.92 0.25, 3.97 0.903
    Secondary
1.42 0.49, 5.13 0.549 0.78 0.22, 3.28 0.716
    Tertiary
1.33 0.42, 5.11 0.644 0.85 0.21, 3.89 0.829
Occupation 372





    Employed


    Student
0.55 0.15, 1.56 0.303 0.65 0.16, 2.29 0.522
    Unemployed
0.29 0.14, 0.55 <0.001 0.42 0.18, 0.92 0.038
Income_category 372





    < 3000


    ≥ 3000
2.02 1.26, 3.25 0.004 1.21 0.62, 2.33 0.567
Sources_of_information_on_Mpox 372





    Health care worker


    Radio/TV
0.17 0.07, 0.35 <0.001 0.13 0.05, 0.30 <0.001
    Internet/social media
0.35 0.16, 0.71 0.005 0.37 0.15, 0.87 0.027
    Family/friends
0.33 0.12, 0.79 0.019 0.23 0.07, 0.66 0.010
    Print/community sources
0.06 0.00, 0.28 0.005 0.08 0.00, 0.42 0.017
Abbreviations: CI = Confidence Interval, OR = Odds Ratio
# Convert to flextable using gtsummary's method
ft <- as_flex_table(merged_table)

# Create and save Word document
doc <- read_docx() %>%
  body_add_flextable(value = ft) %>%
  body_add_par("", style = "Normal")  # optional spacing

print(doc, target = "regression_table.docx")

Stepwise Selection

step_dat <- sub_set |> dplyr::select(knowledge_binary, Province, Age_categories, Sex, Marital_status, Residence, Education_level, Occupation, Income_category, Sources_of_information_on_Mpox, Age)

# Remove any missing observations if any
step_dat <- na.omit(step_dat)

# Label convert to factor
  step_dat <- step_dat %>%
  mutate(across(where(is.character), as.factor))
  
  var_label(step_dat) <- list(Province= "Provinces",Sex="Sex",Age_categories="Age (years)",Marital_status="Marital status", Residence="Residence",Education_level="Education level", Occupation="Occupation", Income_category = "Income category",
                                Sources_of_information_on_Mpox = "Sources of Mpox information", Age = "Age (years)") 
  

# Fit full model
FitAll <- glm(knowledge_binary ~ Province + Age_categories + Sex + Marital_status + Residence +Education_level  + Occupation + Income_category + Sources_of_information_on_Mpox,
  data = step_dat,
  family = binomial)


  FitAll

Call: glm(formula = knowledge_binary ~ Province + Age_categories + Sex + Marital_status + Residence + Education_level + Occupation + Income_category + Sources_of_information_on_Mpox, family = binomial, data = step_dat)

Coefficients: (Intercept)
-1.73359
ProvinceCentral
0.22147
ProvinceWestern
2.60127
ProvinceCopperbelt
-0.19007
ProvinceMuchinga
0.36847
Age_categories35–54
1.17771
Age_categories18–34
0.66700
SexMale
0.45547
Marital_statusUnmarried
0.42746
ResidenceRural
-0.91878
Education_levelPrimary
-0.08517
Education_levelSecondary
-0.24703
Education_levelTertiary
-0.15738
OccupationStudent
-0.43082
OccupationUnemployed
-0.86182
Income_category≥ 3000
0.19155
Sources_of_information_on_MpoxRadio/TV
-2.01974
Sources_of_information_on_MpoxInternet/social media
-0.99959
Sources_of_information_on_MpoxFamily/friends
-1.48030
Sources_of_information_on_MpoxPrint/community sources
-2.54225

Degrees of Freedom: 371 Total (i.e. Null); 352 Residual Null Deviance: 422.7 Residual Deviance: 298.1 AIC: 338.1

  library(jtools)
  summ(FitAll, exp = T)
Observations 372
Dependent variable knowledge_binary
Type Generalized linear model
Family binomial
Link logit
χ²(19) 124.60
p 0.00
Pseudo-R² (Cragg-Uhler) 0.42
Pseudo-R² (McFadden) 0.29
AIC 338.12
BIC 416.49
exp(Est.) 2.5% 97.5% z val. p
(Intercept) 0.18 0.04 0.81 -2.23 0.03
ProvinceCentral 1.25 0.53 2.92 0.51 0.61
ProvinceWestern 13.48 5.67 32.04 5.89 0.00
ProvinceCopperbelt 0.83 0.33 2.10 -0.40 0.69
ProvinceMuchinga 1.45 0.38 5.44 0.54 0.59
Age_categories35–54 3.25 1.10 9.57 2.13 0.03
Age_categories18–34 1.95 0.67 5.65 1.23 0.22
SexMale 1.58 0.83 2.98 1.40 0.16
Marital_statusUnmarried 1.53 0.81 2.92 1.30 0.19
ResidenceRural 0.40 0.13 1.23 -1.59 0.11
Education_levelPrimary 0.92 0.23 3.59 -0.12 0.90
Education_levelSecondary 0.78 0.21 2.95 -0.36 0.72
Education_levelTertiary 0.85 0.21 3.56 -0.22 0.83
OccupationStudent 0.65 0.17 2.43 -0.64 0.52
OccupationUnemployed 0.42 0.19 0.95 -2.08 0.04
Income_category≥ 3000 1.21 0.63 2.33 0.57 0.57
Sources_of_information_on_MpoxRadio/TV 0.13 0.06 0.32 -4.53 0.00
Sources_of_information_on_MpoxInternet/social media 0.37 0.15 0.89 -2.21 0.03
Sources_of_information_on_MpoxFamily/friends 0.23 0.07 0.71 -2.56 0.01
Sources_of_information_on_MpoxPrint/community sources 0.08 0.01 0.63 -2.39 0.02
Standard errors: MLE
  length(coef(FitAll))

[1] 20

  # Daliso's model with prov, occupation and mpox_info
  
  Fitdaliso <- glm(knowledge_binary ~ Province + Occupation + Sources_of_information_on_Mpox,
  data = step_dat,
  family = binomial)
  
  summ(Fitdaliso, exp = T)
Observations 372
Dependent variable knowledge_binary
Type Generalized linear model
Family binomial
Link logit
χ²(10) 112.59
p 0.00
Pseudo-R² (Cragg-Uhler) 0.38
Pseudo-R² (McFadden) 0.27
AIC 332.13
BIC 375.23
exp(Est.) 2.5% 97.5% z val. p
(Intercept) 0.54 0.30 0.97 -2.05 0.04
ProvinceCentral 1.08 0.48 2.45 0.18 0.86
ProvinceWestern 10.29 4.79 22.10 5.97 0.00
ProvinceCopperbelt 0.80 0.32 1.97 -0.49 0.62
ProvinceMuchinga 1.06 0.35 3.23 0.11 0.92
OccupationStudent 0.62 0.18 2.15 -0.75 0.46
OccupationUnemployed 0.33 0.15 0.70 -2.88 0.00
Sources_of_information_on_MpoxRadio/TV 0.13 0.05 0.31 -4.62 0.00
Sources_of_information_on_MpoxInternet/social media 0.38 0.16 0.89 -2.22 0.03
Sources_of_information_on_MpoxFamily/friends 0.25 0.08 0.72 -2.54 0.01
Sources_of_information_on_MpoxPrint/community sources 0.08 0.01 0.62 -2.40 0.02
Standard errors: MLE
  formula(Fitdaliso)

knowledge_binary ~ Province + Occupation + Sources_of_information_on_Mpox

  length(coef(Fitdaliso))

[1] 11

  #FORWARD SELECTION

  FitStart <- glm(knowledge_binary ~ 1,
                  data = step_dat,
                  family = binomial) #begin with model without predictors - intercept only

  summ(FitStart)
Observations 372
Dependent variable knowledge_binary
Type Generalized linear model
Family binomial
Link logit
χ²(0) -0.00
p NA
Pseudo-R² (Cragg-Uhler) 0.00
Pseudo-R² (McFadden) 0.00
AIC 424.71
BIC 428.63
Est. S.E. z val. p
(Intercept) -1.07 0.12 -9.00 0.00
Standard errors: MLE
  Fitforward <- step(FitStart, direction = "forward", scope=formula(FitAll))

Start: AIC=424.71 knowledge_binary ~ 1

                             Df Deviance    AIC
  • Province 4 355.42 365.42
  • Sources_of_information_on_Mpox 4 380.40 390.40
  • Occupation 2 406.87 412.87
  • Income_category 1 414.24 418.24
  • Marital_status 1 419.87 423.87
  • Age_categories 2 417.99 423.99 422.71 424.71
  • Residence 1 422.35 426.35
  • Sex 1 422.40 426.40
  • Education_level 3 422.32 430.32

Step: AIC=365.42 knowledge_binary ~ Province

                             Df Deviance    AIC
  • Sources_of_information_on_Mpox 4 319.74 337.74
  • Residence 1 347.64 359.64
  • Occupation 2 346.42 360.42
  • Age_categories 2 349.13 363.13
  • Income_category 1 352.63 364.63 355.42 365.42
  • Sex 1 353.50 365.50
  • Marital_status 1 354.78 366.78
  • Education_level 3 355.21 371.21

Step: AIC=337.74 knowledge_binary ~ Province + Sources_of_information_on_Mpox

              Df Deviance    AIC
  • Occupation 2 310.13 332.13
  • Residence 1 314.54 334.54
  • Age_categories 2 313.39 335.39
  • Sex 1 316.54 336.54
  • Income_category 1 317.40 337.40 319.74 337.74
  • Marital_status 1 319.42 339.42
  • Education_level 3 319.52 343.52

Step: AIC=332.13 knowledge_binary ~ Province + Sources_of_information_on_Mpox + Occupation

              Df Deviance    AIC
  • Residence 1 306.62 330.62
  • Sex 1 307.61 331.61 310.13 332.13
  • Age_categories 2 306.38 332.38
  • Marital_status 1 309.20 333.20
  • Income_category 1 309.59 333.59
  • Education_level 3 310.00 338.00

Step: AIC=330.62 knowledge_binary ~ Province + Sources_of_information_on_Mpox + Occupation + Residence

              Df Deviance    AIC
  • Age_categories 2 302.39 330.39 306.62 330.62
  • Sex 1 304.64 330.64
  • Marital_status 1 305.98 331.98
  • Income_category 1 306.36 332.36
  • Education_level 3 306.42 336.42

Step: AIC=330.39 knowledge_binary ~ Province + Sources_of_information_on_Mpox + Occupation + Residence + Age_categories

              Df Deviance    AIC
  • Sex 1 300.15 330.15 302.39 330.39
  • Marital_status 1 300.54 330.54
  • Income_category 1 302.22 332.22
  • Education_level 3 302.31 336.31

Step: AIC=330.15 knowledge_binary ~ Province + Sources_of_information_on_Mpox + Occupation + Residence + Age_categories + Sex

              Df Deviance    AIC

300.15 330.15 + Marital_status 1 298.59 330.59 + Income_category 1 300.02 332.02 + Education_level 3 300.03 336.03

  formula(Fitforward)

knowledge_binary ~ Province + Sources_of_information_on_Mpox + Occupation + Residence + Age_categories + Sex

  #BACKWARD SELECTION

  Fitback <- step(FitAll, direction = "backward", scope=formula(FitAll))

Start: AIC=338.12 knowledge_binary ~ Province + Age_categories + Sex + Marital_status + Residence + Education_level + Occupation + Income_category + Sources_of_information_on_Mpox

                             Df Deviance    AIC
  • Education_level 3 298.37 332.37
  • Income_category 1 298.44 336.44
  • Marital_status 1 299.81 337.81
  • Sex 1 300.08 338.08 298.12 338.12
  • Residence 1 300.81 338.81
  • Occupation 2 302.89 338.89
  • Age_categories 2 303.58 339.58
  • Sources_of_information_on_Mpox 4 332.34 364.34
  • Province 4 350.11 382.11

Step: AIC=332.37 knowledge_binary ~ Province + Age_categories + Sex + Marital_status + Residence + Occupation + Income_category + Sources_of_information_on_Mpox

                             Df Deviance    AIC
  • Income_category 1 298.59 330.59
  • Marital_status 1 300.02 332.02
  • Sex 1 300.28 332.28 298.37 332.37
  • Residence 1 301.04 333.04
  • Occupation 2 303.08 333.08
  • Age_categories 2 303.91 333.91
  • Sources_of_information_on_Mpox 4 333.23 359.23
  • Province 4 350.70 376.70

Step: AIC=330.59 knowledge_binary ~ Province + Age_categories + Sex + Marital_status + Residence + Occupation + Sources_of_information_on_Mpox

                             Df Deviance    AIC
  • Marital_status 1 300.15 330.15
  • Sex 1 300.54 330.54 298.59 330.59
  • Residence 1 301.49 331.49
  • Occupation 2 304.02 332.02
  • Age_categories 2 304.22 332.22
  • Sources_of_information_on_Mpox 4 334.09 358.09
  • Province 4 352.50 376.50

Step: AIC=330.15 knowledge_binary ~ Province + Age_categories + Sex + Residence + Occupation + Sources_of_information_on_Mpox

                             Df Deviance    AIC

300.15 330.15 - Sex 1 302.39 330.39 - Age_categories 2 304.64 330.64 - Occupation 2 305.06 331.06 - Residence 1 303.45 331.45 - Sources_of_information_on_Mpox 4 335.41 357.41 - Province 4 359.55 381.55

  formula(Fitforward)

knowledge_binary ~ Province + Sources_of_information_on_Mpox + Occupation + Residence + Age_categories + Sex

  #STEWISE SELECTION

  Fitstep <- step(FitStart, direction = "both", scope=formula(FitAll))

Start: AIC=424.71 knowledge_binary ~ 1

                             Df Deviance    AIC
  • Province 4 355.42 365.42
  • Sources_of_information_on_Mpox 4 380.40 390.40
  • Occupation 2 406.87 412.87
  • Income_category 1 414.24 418.24
  • Marital_status 1 419.87 423.87
  • Age_categories 2 417.99 423.99 422.71 424.71
  • Residence 1 422.35 426.35
  • Sex 1 422.40 426.40
  • Education_level 3 422.32 430.32

Step: AIC=365.42 knowledge_binary ~ Province

                             Df Deviance    AIC
  • Sources_of_information_on_Mpox 4 319.74 337.74
  • Residence 1 347.64 359.64
  • Occupation 2 346.42 360.42
  • Age_categories 2 349.13 363.13
  • Income_category 1 352.63 364.63 355.42 365.42
  • Sex 1 353.50 365.50
  • Marital_status 1 354.78 366.78
  • Education_level 3 355.21 371.21
  • Province 4 422.71 424.71

Step: AIC=337.74 knowledge_binary ~ Province + Sources_of_information_on_Mpox

                             Df Deviance    AIC
  • Occupation 2 310.13 332.13
  • Residence 1 314.54 334.54
  • Age_categories 2 313.39 335.39
  • Sex 1 316.54 336.54
  • Income_category 1 317.40 337.40 319.74 337.74
  • Marital_status 1 319.42 339.42
  • Education_level 3 319.52 343.52
  • Sources_of_information_on_Mpox 4 355.42 365.42
  • Province 4 380.40 390.40

Step: AIC=332.13 knowledge_binary ~ Province + Sources_of_information_on_Mpox + Occupation

                             Df Deviance    AIC
  • Residence 1 306.62 330.62
  • Sex 1 307.61 331.61 310.13 332.13
  • Age_categories 2 306.38 332.38
  • Marital_status 1 309.20 333.20
  • Income_category 1 309.59 333.59
  • Occupation 2 319.74 337.74
  • Education_level 3 310.00 338.00
  • Sources_of_information_on_Mpox 4 346.42 360.42
  • Province 4 363.60 377.60

Step: AIC=330.62 knowledge_binary ~ Province + Sources_of_information_on_Mpox + Occupation + Residence

                             Df Deviance    AIC
  • Age_categories 2 302.39 330.39 306.62 330.62
  • Sex 1 304.64 330.64
  • Marital_status 1 305.98 331.98
  • Residence 1 310.13 332.13
  • Income_category 1 306.36 332.36
  • Occupation 2 314.54 334.54
  • Education_level 3 306.42 336.42
  • Sources_of_information_on_Mpox 4 340.91 356.91
  • Province 4 362.83 378.83

Step: AIC=330.39 knowledge_binary ~ Province + Sources_of_information_on_Mpox + Occupation + Residence + Age_categories

                             Df Deviance    AIC
  • Sex 1 300.15 330.15 302.39 330.39
  • Marital_status 1 300.54 330.54
  • Age_categories 2 306.62 330.62
  • Occupation 2 307.92 331.92
  • Income_category 1 302.22 332.22
  • Residence 1 306.38 332.38
  • Education_level 3 302.31 336.31
  • Sources_of_information_on_Mpox 4 336.33 356.33
  • Province 4 360.44 380.44

Step: AIC=330.15 knowledge_binary ~ Province + Sources_of_information_on_Mpox + Occupation + Residence + Age_categories + Sex

                             Df Deviance    AIC

300.15 330.15 - Sex 1 302.39 330.39 + Marital_status 1 298.59 330.59 - Age_categories 2 304.64 330.64 - Occupation 2 305.06 331.06 - Residence 1 303.45 331.45 + Income_category 1 300.02 332.02 + Education_level 3 300.03 336.03 - Sources_of_information_on_Mpox 4 335.41 357.41 - Province 4 359.55 381.55

  formula(FitAll)

knowledge_binary ~ Province + Age_categories + Sex + Marital_status + Residence + Education_level + Occupation + Income_category + Sources_of_information_on_Mpox

  summ(FitAll, exp = T)
Observations 372
Dependent variable knowledge_binary
Type Generalized linear model
Family binomial
Link logit
χ²(19) 124.60
p 0.00
Pseudo-R² (Cragg-Uhler) 0.42
Pseudo-R² (McFadden) 0.29
AIC 338.12
BIC 416.49
exp(Est.) 2.5% 97.5% z val. p
(Intercept) 0.18 0.04 0.81 -2.23 0.03
ProvinceCentral 1.25 0.53 2.92 0.51 0.61
ProvinceWestern 13.48 5.67 32.04 5.89 0.00
ProvinceCopperbelt 0.83 0.33 2.10 -0.40 0.69
ProvinceMuchinga 1.45 0.38 5.44 0.54 0.59
Age_categories35–54 3.25 1.10 9.57 2.13 0.03
Age_categories18–34 1.95 0.67 5.65 1.23 0.22
SexMale 1.58 0.83 2.98 1.40 0.16
Marital_statusUnmarried 1.53 0.81 2.92 1.30 0.19
ResidenceRural 0.40 0.13 1.23 -1.59 0.11
Education_levelPrimary 0.92 0.23 3.59 -0.12 0.90
Education_levelSecondary 0.78 0.21 2.95 -0.36 0.72
Education_levelTertiary 0.85 0.21 3.56 -0.22 0.83
OccupationStudent 0.65 0.17 2.43 -0.64 0.52
OccupationUnemployed 0.42 0.19 0.95 -2.08 0.04
Income_category≥ 3000 1.21 0.63 2.33 0.57 0.57
Sources_of_information_on_MpoxRadio/TV 0.13 0.06 0.32 -4.53 0.00
Sources_of_information_on_MpoxInternet/social media 0.37 0.15 0.89 -2.21 0.03
Sources_of_information_on_MpoxFamily/friends 0.23 0.07 0.71 -2.56 0.01
Sources_of_information_on_MpoxPrint/community sources 0.08 0.01 0.63 -2.39 0.02
Standard errors: MLE
  length(coef(FitAll))

[1] 20

  # All these converge on the same variables
  formula(Fitback)

knowledge_binary ~ Province + Age_categories + Sex + Residence + Occupation + Sources_of_information_on_Mpox

  formula(Fitforward)

knowledge_binary ~ Province + Sources_of_information_on_Mpox + Occupation + Residence + Age_categories + Sex

  formula(Fitstep)

knowledge_binary ~ Province + Sources_of_information_on_Mpox + Occupation + Residence + Age_categories + Sex

  length(coef(Fitstep))

[1] 15

  pacman::p_load(stats, pscl, DescTools)

# Log-likelihood, AIC and BIC
    logstats <- rbind(logLik(FitAll),   logLik(Fitstep), logLik(Fitdaliso)) 
    aicstats <- rbind(AIC(FitAll),  AIC(Fitstep), AIC(Fitdaliso)) 
    bicstats <- rbind(BIC(FitAll),  BIC(Fitstep), BIC(Fitdaliso)) 
    
    fitstats1 <- cbind(logstats, aicstats, bicstats)
    rm(logstats, aicstats, bicstats)
    fitstats1
      [,1]     [,2]     [,3]

[1,] -149.0584 338.1168 416.4947 [2,] -150.0774 330.1548 388.9382 [3,] -155.0633 332.1266 375.2344

    rownames(fitstats1) <- c( "Full model", "Stepwise Selection Model", "Daliso's model")
    colnames(fitstats1) <- c("Log-likelihood", "AIC", "BIC")
    formula(FitAll)

knowledge_binary ~ Province + Age_categories + Sex + Marital_status + Residence + Education_level + Occupation + Income_category + Sources_of_information_on_Mpox

    formula(Fitstep)

knowledge_binary ~ Province + Sources_of_information_on_Mpox + Occupation + Residence + Age_categories + Sex

    formula(Fitdaliso)

knowledge_binary ~ Province + Occupation + Sources_of_information_on_Mpox

    fitstats1
                     Log-likelihood      AIC      BIC

Full model -149.0584 338.1168 416.4947 Stepwise Selection Model -150.0774 330.1548 388.9382 Daliso’s model -155.0633 332.1266 375.2344

    # aic1 <- -2(log-lik) + 2*length(coef(FitAll))
    nobs(FitAll)

[1] 372

    # bic1 <-    -2(log-lik) + 2*log(nobs(FitAll))*length(coef(FitAll))

    
  best_mod <- tbl_regression(Fitstep, exponentiate = TRUE,
    pvalue_fun = ~ style_pvalue(.x, digits = 3)) %>%
  bold_p(t = 0.05)
  
  daliso_mod <- tbl_regression(Fitdaliso, exponentiate = TRUE,
    pvalue_fun = ~ style_pvalue(.x, digits = 3)) %>%
  bold_p(t = 0.05)
  
  merged_table <- tbl_merge(
  tbls = list(best_mod, daliso_mod),
  tab_spanner = c("**Best fit**", "**Other**")
)
  
  merged_table
Characteristic
Best fit
Other
OR 95% CI p-value OR 95% CI p-value
Provinces





    Lusaka

    Central 1.18 0.51, 2.67 0.697 1.08 0.47, 2.42 0.857
    Western 14.6 6.47, 35.1 <0.001 10.3 4.88, 22.7 <0.001
    Copperbelt 0.81 0.31, 1.97 0.644 0.80 0.31, 1.93 0.623
    Muchinga 1.72 0.47, 5.99 0.401 1.06 0.33, 3.13 0.916
Sources of Mpox information





    Health care worker

    Radio/TV 0.13 0.05, 0.30 <0.001 0.13 0.05, 0.29 <0.001
    Internet/social media 0.38 0.16, 0.89 0.031 0.38 0.16, 0.87 0.027
    Family/friends 0.24 0.07, 0.67 0.011 0.25 0.08, 0.68 0.011
    Print/community sources 0.08 0.00, 0.41 0.016 0.08 0.00, 0.41 0.016
Occupation





    Employed

    Student 0.76 0.19, 2.55 0.671 0.62 0.16, 2.00 0.455
    Unemployed 0.43 0.19, 0.91 0.034 0.33 0.15, 0.68 0.004
Residence





    Urban



    Rural 0.37 0.12, 1.08 0.079


Age (years)





    55+



    35–54 2.80 1.03, 8.53 0.054


    18–34 1.82 0.67, 5.49 0.261


Sex





    Female



    Male 1.61 0.86, 3.03 0.134


Abbreviations: CI = Confidence Interval, OR = Odds Ratio
  # Let's draw an S-curve for a logistic regression. Works best when we've a continuous variable
  #  I've used age as continuous
  
  mv2_mod <- glm(knowledge_binary ~ Province + Sources_of_information_on_Mpox + 
    Occupation + Residence + Age + Sex,
  data = step_dat,
  family = binomial)
  
  tbl_regression(mv2_mod, exponentiate = TRUE,
    pvalue_fun = ~ style_pvalue(.x, digits = 3)) %>%
  bold_p(t = 0.05)
Characteristic OR 95% CI p-value
Provinces


    Lusaka
    Central 1.10 0.47, 2.48 0.825
    Western 13.0 5.89, 30.5 <0.001
    Copperbelt 0.82 0.32, 2.00 0.670
    Muchinga 1.68 0.46, 5.86 0.421
Sources of Mpox information


    Health care worker
    Radio/TV 0.13 0.05, 0.31 <0.001
    Internet/social media 0.39 0.16, 0.90 0.031
    Family/friends 0.26 0.08, 0.72 0.015
    Print/community sources 0.07 0.00, 0.40 0.015
Occupation


    Employed
    Student 0.63 0.16, 2.09 0.471
    Unemployed 0.36 0.16, 0.76 0.011
Residence


    Urban
    Rural 0.39 0.12, 1.13 0.094
Age (years) 1.00 0.97, 1.02 0.737
Sex


    Female
    Male 1.53 0.83, 2.82 0.175
Abbreviations: CI = Confidence Interval, OR = Odds Ratio
  #define new data frame that contains predictor variable
  library(ciTools)
  step_dat <- add_probs(step_dat, mv2_mod, q = 0, comparison = "=")
 
  
 #plot logistic regression curve
ggplot(step_dat, aes(x = pred, y = knowledge_binary)) + 
  geom_point(colour = "dodgerblue4",
             size = 5,
             alpha = 0.8) +
  stat_smooth(method="glm", se=T, method.args = list(family=binomial),
              colour = "darkred",
              size = 1) +
  labs(x = "Predicted probabilities",
       y = "Probability",
       title = "Logistic Regression Curve") +
  theme_classic() +
    theme(
    axis.title = element_text(size = 20, face = "bold"),   # X-axis title
    axis.text.x  = element_text(size = 18, colour = "black"),   
    axis.text  = element_text(size = 18, colour = "black") 
    ) +
  scale_x_continuous(breaks = seq(min(step_dat$Age), 
                                  max(step_dat$Age), 
                                  by = 5)) 

newdat <- step_dat|> dplyr::select(Age, pred) |> mutate(outcome = ifelse(pred <0.5, 0,1)) |> arrange(Age)

investigator led step wise regression

In this study, i only kept variables that were stattistically significant at p < 0.05

I removed variables following this order

  1. education

  2. income

  3. marital status

  4. sex

  5. age

  6. residence

mv_model <- glm(
  knowledge_binary ~ 
    Province+ 
    Occupation+Sources_of_information_on_Mpox,
  data = sub_set,
  family = binomial
)

tbl_regression(
  mv_model, 
  exponentiate = TRUE,  # show odds ratios instead of log-odds
  pvalue_fun = ~style_pvalue(.x, digits = 3)
) %>%
  bold_p(t = 0.05)
Characteristic OR 95% CI p-value
Province


    Lusaka
    Central 1.08 0.47, 2.42 0.857
    Western 10.3 4.88, 22.7 <0.001
    Copperbelt 0.80 0.31, 1.93 0.623
    Muchinga 1.06 0.33, 3.13 0.916
Occupation


    Employed
    Student 0.62 0.16, 2.00 0.455
    Unemployed 0.33 0.15, 0.68 0.004
Sources_of_information_on_Mpox


    Health care worker
    Radio/TV 0.13 0.05, 0.29 <0.001
    Internet/social media 0.38 0.16, 0.87 0.027
    Family/friends 0.25 0.08, 0.68 0.011
    Print/community sources 0.08 0.00, 0.41 0.016
Abbreviations: CI = Confidence Interval, OR = Odds Ratio

Compare a full model with a model with Residence

# full model 
model1  <- glm(
  knowledge_binary ~ 
    Province+ 
    Occupation+Sources_of_information_on_Mpox+Residence,
  data = sub_set,
  family = binomial
)

tbl_regression(
  mv_model, 
  exponentiate = TRUE,  # show odds ratios instead of log-odds
  pvalue_fun = ~style_pvalue(.x, digits = 3)
) %>%
  bold_p(t = 0.05)
Characteristic OR 95% CI p-value
Province


    Lusaka
    Central 1.08 0.47, 2.42 0.857
    Western 10.3 4.88, 22.7 <0.001
    Copperbelt 0.80 0.31, 1.93 0.623
    Muchinga 1.06 0.33, 3.13 0.916
Occupation


    Employed
    Student 0.62 0.16, 2.00 0.455
    Unemployed 0.33 0.15, 0.68 0.004
Sources_of_information_on_Mpox


    Health care worker
    Radio/TV 0.13 0.05, 0.29 <0.001
    Internet/social media 0.38 0.16, 0.87 0.027
    Family/friends 0.25 0.08, 0.68 0.011
    Print/community sources 0.08 0.00, 0.41 0.016
Abbreviations: CI = Confidence Interval, OR = Odds Ratio
model1

Call: glm(formula = knowledge_binary ~ Province + Occupation + Sources_of_information_on_Mpox + Residence, family = binomial, data = sub_set)

Coefficients: (Intercept)
-0.60079
ProvinceCentral
0.02556
ProvinceWestern
2.47543
ProvinceCopperbelt
-0.27043
ProvinceMuchinga
0.57964
OccupationStudent
-0.40658
OccupationUnemployed
-1.04250
Sources_of_information_on_MpoxRadio/TV
-1.97892
Sources_of_information_on_MpoxInternet/social media
-0.97572
Sources_of_information_on_MpoxFamily/friends
-1.33189
Sources_of_information_on_MpoxPrint/community sources
-2.60987
ResidenceRural
-1.00895

Degrees of Freedom: 371 Total (i.e. Null); 360 Residual Null Deviance: 422.7 Residual Deviance: 306.6 AIC: 330.6

# without residence  
model2 <- glm(
  knowledge_binary ~ 
    Province+ 
    Occupation+Sources_of_information_on_Mpox,
  data = sub_set,
  family = binomial
)

tbl_regression(
  mv_model, 
  exponentiate = TRUE,  # show odds ratios instead of log-odds
  pvalue_fun = ~style_pvalue(.x, digits = 3)
) %>%
  bold_p(t = 0.05)
Characteristic OR 95% CI p-value
Province


    Lusaka
    Central 1.08 0.47, 2.42 0.857
    Western 10.3 4.88, 22.7 <0.001
    Copperbelt 0.80 0.31, 1.93 0.623
    Muchinga 1.06 0.33, 3.13 0.916
Occupation


    Employed
    Student 0.62 0.16, 2.00 0.455
    Unemployed 0.33 0.15, 0.68 0.004
Sources_of_information_on_Mpox


    Health care worker
    Radio/TV 0.13 0.05, 0.29 <0.001
    Internet/social media 0.38 0.16, 0.87 0.027
    Family/friends 0.25 0.08, 0.68 0.011
    Print/community sources 0.08 0.00, 0.41 0.016
Abbreviations: CI = Confidence Interval, OR = Odds Ratio
model2

Call: glm(formula = knowledge_binary ~ Province + Occupation + Sources_of_information_on_Mpox, family = binomial, data = sub_set)

Coefficients: (Intercept)
-0.62299
ProvinceCentral
0.07552
ProvinceWestern
2.33071
ProvinceCopperbelt
-0.22756
ProvinceMuchinga
0.05972
OccupationStudent
-0.47158
OccupationUnemployed
-1.11892
Sources_of_information_on_MpoxRadio/TV
-2.03969
Sources_of_information_on_MpoxInternet/social media
-0.96022
Sources_of_information_on_MpoxFamily/friends
-1.40486
Sources_of_information_on_MpoxPrint/community sources
-2.58732

Degrees of Freedom: 371 Total (i.e. Null); 361 Residual Null Deviance: 422.7 Residual Deviance: 310.1 AIC: 332.1

model selection

# --- Full model (with residence) ---
model1 <- glm(
  knowledge_binary ~ 
    Province+ 
    Occupation+Sources_of_information_on_Mpox+Residence,
  data = sub_set,
  family = binomial)

# Create formatted regression table
model1_tbl <- tbl_regression(
  model1,
  exponentiate = TRUE,   # show Odds Ratios
  pvalue_fun = ~ style_pvalue(.x, digits = 3)
) %>%
  bold_p(t = 0.05)


# --- Reduced model (without residence) ---
model2 <- glm(
  knowledge_binary ~ 
    Province+ 
    Occupation+Sources_of_information_on_Mpox,
  data = sub_set,
  family = binomial)

# Create formatted regression table
model2_tbl <- tbl_regression(
  model2,
  exponentiate = TRUE,
  pvalue_fun = ~ style_pvalue(.x, digits = 3)
) %>%
  bold_p(t = 0.05)


# --- Model comparison using AIC ---
AIC(model1, model2)
   df      AIC

model1 12 330.6191 model2 11 332.1266

# --- Display regression tables ---
model1_tbl
Characteristic OR 95% CI p-value
Province


    Lusaka
    Central 1.03 0.44, 2.30 0.951
    Western 11.9 5.49, 27.2 <0.001
    Copperbelt 0.76 0.30, 1.84 0.559
    Muchinga 1.79 0.49, 6.18 0.364
Occupation


    Employed
    Student 0.67 0.17, 2.11 0.517
    Unemployed 0.35 0.16, 0.74 0.008
Sources_of_information_on_Mpox


    Health care worker
    Radio/TV 0.14 0.06, 0.31 <0.001
    Internet/social media 0.38 0.15, 0.86 0.025
    Family/friends 0.26 0.08, 0.74 0.016
    Print/community sources 0.07 0.00, 0.41 0.016
Residence


    Urban
    Rural 0.36 0.12, 1.05 0.070
Abbreviations: CI = Confidence Interval, OR = Odds Ratio
model2_tbl
Characteristic OR 95% CI p-value
Province


    Lusaka
    Central 1.08 0.47, 2.42 0.857
    Western 10.3 4.88, 22.7 <0.001
    Copperbelt 0.80 0.31, 1.93 0.623
    Muchinga 1.06 0.33, 3.13 0.916
Occupation


    Employed
    Student 0.62 0.16, 2.00 0.455
    Unemployed 0.33 0.15, 0.68 0.004
Sources_of_information_on_Mpox


    Health care worker
    Radio/TV 0.13 0.05, 0.29 <0.001
    Internet/social media 0.38 0.16, 0.87 0.027
    Family/friends 0.25 0.08, 0.68 0.011
    Print/community sources 0.08 0.00, 0.41 0.016
Abbreviations: CI = Confidence Interval, OR = Odds Ratio
# --- Model comparison using BIC ---
BIC(model1, model2)
   df      BIC

model1 12 377.6458 model2 11 375.2344

# Collect AIC and BIC values
model_comp <- data.frame(
  Model = c("Model 1: Full (with residence)", "Model 2: Reduced (without residence)"),
  df    = c(12, 11),
  AIC   = c(330.6191, 332.1266  ),
  BIC   = c(377.6458, 375.2344)
)

# Option 1: Simple table for manuscript (inline in R Markdown/console)
knitr::kable(model_comp, caption = "Comparison of AIC and BIC for Logistic Regression Models")
Comparison of AIC and BIC for Logistic Regression Models
Model df AIC BIC
Model 1: Full (with residence) 12 330.6191 377.6458
Model 2: Reduced (without residence) 11 332.1266 375.2344
# Option 2: Polished table with gt (for HTML/PDF output)
model_comp %>%
  gt() %>%
  tab_header(
    title = "Comparison of Model Fit Statistics",
    subtitle = "Logistic regression predicting overall Mpox knowledge"
  ) %>%
  fmt_number(columns = c(AIC, BIC), decimals = 3)
Comparison of Model Fit Statistics
Logistic regression predicting overall Mpox knowledge
Model df AIC BIC
Model 1: Full (with residence) 12 330.619 377.646
Model 2: Reduced (without residence) 11 332.127 375.234
# --- Export to Word using flextable ---
ft <- flextable(model_comp) %>%
  autofit() %>%
  set_header_labels(
    Model = "Model",
    df    = "Degrees of freedom",
    AIC   = "AIC",
    BIC   = "BIC"
  )

doc <- read_docx() %>%
  body_add_par("Table 4. Comparison of Model Fit Statistics", style = "heading 2") %>%
  body_add_flextable(ft)

print(doc, target = "Model_Fit_Comparison.docx")

merged

# Univariable model
uv_model <- tbl_uvregression(
  data = sub_set,
  y =  knowledge_binary,
  method = glm,
  method.args = list(family = binomial),
  include = c("Province","Occupation", "Sources_of_information_on_Mpox"),
  exponentiate = TRUE
) %>%
  bold_p(t = 0.05) %>%
  modify_header(
    estimate ~ "**UOR**",  # Unadjusted Odds Ratio
    p.value ~ "**P-value**"
  ) %>%
  modify_fmt_fun(p.value ~ function(x) style_pvalue(x, digits = 3)) %>%
  bold_labels()

# Multivariable model
model2 <- glm(
   knowledge_binary ~ Province+ 
    Occupation+Sources_of_information_on_Mpox,
  data = sub_set,
  family = binomial
)

mv_model <- tbl_regression(
  model2,
  exponentiate = TRUE,
  pvalue_fun = ~style_pvalue(.x, digits = 3)
) %>%
  bold_p(t = 0.05) %>%
  modify_header(estimate ~ "**AOR**") %>%
  bold_labels()

# Merge both models
final_model <- tbl_merge(
  tbls = list(uv_model, mv_model),
  tab_spanner = c("**Univariable Model**", "**Multivariable Model**")
) %>%
  modify_caption("**Factors Associated with Mpox Knowledge (Binary Outcome)**")

# Export to Word
doc <- read_docx() %>%
  body_add_flextable(value = as_flex_table(final_model)) %>%
  body_add_par(" ", style = "Normal")

print(doc, target = "Final_Knowledge_Model.docx")

# Display the table
final_model
Factors Associated with Mpox Knowledge (Binary Outcome)
Characteristic
Univariable Model
Multivariable Model
N UOR 95% CI P-value AOR 95% CI p-value
Province 372





    Lusaka


    Central
1.31 0.61, 2.76 0.477 1.08 0.47, 2.42 0.857
    Western
12.0 6.17, 24.2 <0.001 10.3 4.88, 22.7 <0.001
    Copperbelt
1.08 0.44, 2.44 0.863 0.80 0.31, 1.93 0.623
    Muchinga
1.85 0.62, 4.95 0.238 1.06 0.33, 3.13 0.916
Occupation 372





    Employed


    Student
0.55 0.15, 1.56 0.303 0.62 0.16, 2.00 0.455
    Unemployed
0.29 0.14, 0.55 <0.001 0.33 0.15, 0.68 0.004
Sources_of_information_on_Mpox 372





    Health care worker


    Radio/TV
0.17 0.07, 0.35 <0.001 0.13 0.05, 0.29 <0.001
    Internet/social media
0.35 0.16, 0.71 0.005 0.38 0.16, 0.87 0.027
    Family/friends
0.33 0.12, 0.79 0.019 0.25 0.08, 0.68 0.011
    Print/community sources
0.06 0.00, 0.28 0.005 0.08 0.00, 0.41 0.016
Abbreviations: CI = Confidence Interval, OR = Odds Ratio
  • Based on the BIC, model 2 best predicts the outcome of interest well.

Analysis flow diagram

library(DiagrammeR)
library(DiagrammeRsvg)
library(rsvg)
library(knitr)

# Build diagram
diagram <- grViz("
digraph flow {
  graph [rankdir = TB, fontsize = 12]
  node [shape = box, style = filled, fillcolor = white, fontname = Helvetica]

  A [label = 'Individuals enrolled and interviewed\\n n = 815']
  B [label = 'Records removed during data cleaning\\n n = 6\\n - Missing critical data (consent/filter question)']
  C [label = 'Final analytic sample for AWARENESS analysis\\n N = 809']
  D [label = 'All participants answered filter question:\\n Have you ever heard of Mpox?']
  E [label = 'YES\\n n = 372']
  F [label = 'NO\\n n = 437']
  G [label = 'Completed full knowledge questionnaire']
  H [label = 'Included only in analyses of\\n awareness prevalence and its correlates']
  I [label = 'Primary analytic sample for KNOWLEDGE analysis\\n n = 372\\n (Assessed for knowledge level, sources of info, and associated factors)']

  A -> B -> C -> D
  D -> E
  D -> F
  E -> G -> I
  F -> H
}
")

# Export static PNG for R Markdown
svg <- export_svg(diagram)
rsvg_png(charToRaw(svg), "flow_diagram.png", width = 2400, height = 3200)

# Include the image in the knitted document
knitr::include_graphics("flow_diagram.png")

Mapping the sources of Mpox information among study participants

# Step 1: Summarise counts and percentages
mpox_map <- sub_set %>%
  group_by(Province, Sources_of_information_on_Mpox) %>%
  summarise(n = n(), .groups = "drop") %>%
  group_by(Province) %>%
  mutate(total = sum(n),
         percent = round((n / total) * 100, 1)) %>%
  ungroup()

# Step 2: Read shapefile
provinces <- st_read("C:/Users/USER1/OneDrive/Documents/Office Work_ ZNPHI/Daliso_Mpox/Data/Shapefiles. Admin boundaries/zmb_admbnda_adm1_dmmu_20201124.shp")

Reading layer zmb_admbnda_adm1_dmmu_20201124' from data sourceC:Work_ ZNPHI_Mpox. Admin boundaries_admbnda_adm1_dmmu_20201124.shp’ using driver `ESRI Shapefile’ Simple feature collection with 10 features and 12 fields Geometry type: POLYGON Dimension: XY Bounding box: xmin: 21.98004 ymin: -18.07918 xmax: 33.71244 ymax: -8.271976 Geodetic CRS: WGS 84

# Step 3: Join shapefile with source data
map_radio <- provinces %>%
  left_join(filter(mpox_map, Sources_of_information_on_Mpox == "Radio/TV"),
            by = c("ADM1_EN" = "Province")) %>%
  mutate(percent = ifelse(is.na(percent), 0, percent))

# Step 4: Plot full Zambia map with proportions shaded
ggplot() +
  geom_sf(data = provinces, fill = "grey95", color = "black") +   # base map
  geom_sf(data = map_radio, aes(fill = percent), color = "black") +
  geom_sf_text(data = filter(map_radio, percent > 0),
               aes(label = paste0(percent, "%")), size = 3) +
  scale_fill_gradient(low = "#FFE5B4", high = "#FF8C00", na.value = "grey95") +
  annotation_scale(location = "bl", width_hint = 0.3) +
  annotation_north_arrow(location = "tl", which_north = "true",
                         style = north_arrow_fancy_orienteering) +
  theme_minimal() +
  theme(panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        axis.text = element_blank(),
        axis.ticks = element_blank(),
        axis.title = element_blank()) +
  labs(fill = "Proportion (%)",
       title = "Proportion of Mpox Information from Radio/TV")

map_internet <- provinces %>%
  left_join(filter(mpox_map, Sources_of_information_on_Mpox == "Internet/social media"),
            by = c("ADM1_EN" = "Province")) %>%
  mutate(percent = ifelse(is.na(percent), 0, percent))

ggplot() +
  geom_sf(data = provinces, fill = "grey95", color = "black") +
  geom_sf(data = map_internet, aes(fill = percent), color = "black") +
  geom_sf_text(data = filter(map_internet, percent > 0),
               aes(label = paste0(percent, "%")), size = 3) +
  scale_fill_gradient(low = "#D0E6FF", high = "#0073E6", na.value = "grey95") +
  annotation_scale(location = "bl", width_hint = 0.3) +
  annotation_north_arrow(location = "tl", which_north = "true",
                         style = north_arrow_fancy_orienteering) +
  theme_minimal() +
  theme(panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        axis.text = element_blank(),
        axis.ticks = element_blank(),
        axis.title = element_blank()) +
  labs(fill = "Proportion (%)",
       title = "Proportion of Mpox Information from Internet/Social Media")

  map_family <- provinces %>%
    right_join(filter(mpox_map, Sources_of_information_on_Mpox == "Family/friends"),
               by = c("ADM1_EN" = "Province")) %>%
    mutate(percent = ifelse(is.na(percent), 0, percent))
  
  ggplot() +
    geom_sf(data = provinces, fill = "grey95", color = "black") +
    geom_sf(data = map_family, aes(fill = percent), color = "black") +
    geom_sf_text(data = filter(map_family, percent > 0),
                 aes(label = paste0(percent, "%")), size = 3) +
    scale_fill_gradient(low = "#FAD0E6", high = "#E7298A", na.value = "grey95") +
    annotation_scale(location = "bl", width_hint = 0.3) +
    annotation_north_arrow(location = "tl", which_north = "true",
                           style = north_arrow_fancy_orienteering) +
    theme_minimal() +
    theme(panel.grid.major = element_blank(),
          panel.grid.minor = element_blank(),
          axis.text = element_blank(),
          axis.ticks = element_blank(),
          axis.title = element_blank()) +
    labs(fill = "Proportion (%)",
         title = "Proportion of Mpox Information from Family/Friends")

# Example for Health care worker
map_health <- provinces %>%
  left_join(filter(mpox_map, Sources_of_information_on_Mpox == "Health care worker"),
            by = c("ADM1_EN" = "Province")) %>%
  mutate(percent = ifelse(is.na(percent), 0, percent))

ggplot() +
  geom_sf(data = provinces, fill = "grey95", color = "black") +
  geom_sf(data = map_health, aes(fill = percent), color = "black") +
  geom_sf_text(data = filter(map_health, percent > 0),
               aes(label = paste0(percent, "%")), size = 3) +
  scale_fill_gradient(low = "#E0F8E0", high = "#00B050", na.value = "grey95") +
  annotation_scale(location = "bl", width_hint = 0.3) +
  annotation_north_arrow(location = "tl", which_north = "true",
                         style = north_arrow_fancy_orienteering) +
  theme_minimal() +
  theme(panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        axis.text = element_blank(),
        axis.ticks = element_blank(),
        axis.title = element_blank()) +
  labs(fill = "Proportion (%)",
       title = "Proportion of Mpox Information from Health Care Workers")

study seeting map

# Load required packages
library(sf)        # spatial data
library(ggplot2)   # plotting
library(dplyr)     # data wrangling
library(ggspatial) # scale bar, north arrow

# 1. Read Zambia ADM2 shapefile
zmb_adm2 <- st_read("C:/Users/USER1/OneDrive/Documents/Office Work_ ZNPHI/Daliso_Mpox/Data/Shapefiles. Admin boundaries/zmb_admbnda_adm2_dmmu_20201124.shp")

Reading layer zmb_admbnda_adm2_dmmu_20201124' from data sourceC:Work_ ZNPHI_Mpox. Admin boundaries_admbnda_adm2_dmmu_20201124.shp’ using driver `ESRI Shapefile’ Simple feature collection with 115 features and 14 fields Geometry type: POLYGON Dimension: XY Bounding box: xmin: 21.98004 ymin: -18.07918 xmax: 33.71244 ymax: -8.271976 Geodetic CRS: WGS 84

# 2. Define sampled districts
district_sampled <- data.frame(
  District = c("Chilanga","Kabwe","Lusaka","Mkushi","Mongu","Mpika","Ndola","Senanga"),
  sampled = c("Chilanga","Kabwe","Lusaka","Mkushi","Mongu","Mpika","Ndola","Senanga")
)

# 3. Join sampled districts to shapefile
zmb_adm2_flag <- zmb_adm2 %>%
  left_join(district_sampled, by = c("ADM2_EN" = "District"))

# 4. Plot map
ggplot() +
  # Plain background for non-sampled districts
  geom_sf(data = zmb_adm2_flag %>% filter(is.na(sampled)),
          fill = "white", color = "grey80", size = 0.2) +
  
  # Highlight sampled districts in distinct colors
  geom_sf(data = zmb_adm2_flag %>% filter(!is.na(sampled)),
          aes(fill = sampled), color = "black", size = 0.3) +
  
  # Assign colors for each sampled district
  scale_fill_manual(values = c(
    "Chilanga" = "red",
    "Kabwe"    = "blue",
    "Lusaka"   = "green",
    "Mkushi"   = "purple",
    "Mongu"    = "orange",
    "Mpika"    = "brown",
    "Ndola"    = "pink",
    "Senanga"  = "cyan"
  )) +
  
  # Standard map elements
  annotation_scale(location = "bl", width_hint = 0.5) +
  annotation_north_arrow(location = "tl", style = north_arrow_fancy_orienteering) +
  theme_void() +   # removes grid lines and axes
  labs(title = "",
       subtitle = "",
       fill = "Sampled Districts")

# 5. Export high-resolution image
ggsave("study_area_map_multicolor.png", width = 8, height = 6, dpi = 300)