# Hide all code but keep outputs 
knitr::opts_chunk$set( message = FALSE, warning = FALSE,results = 'asis' )

Executive Summary

This report documents the end-to-end analytical workflow implemented in R for a cross-sectional study assessing community knowledge of Mpox in five provinces of Zambia. The analysis formed the basis of a manuscript currently under peer review; however, this document focuses specifically on demonstrating intermediate-level competencies in data management, statistical analysis, visualization, and reproducible reporting using R.

The project involved importing raw survey data from Excel, performing structured data cleaning, re-coding categorical variables, engineering derived variables (including age categories, income groups, and occupation classifications), and constructing a composite knowledge score using Bloom’s cut-off criteria. Analysis was restricted to participants who had heard of Mpox, and a binary outcome variable was created to classify knowledge as good or poor.

Exploratory data analysis was conducted using the tidyverse framework, including summary statistics, frequency distributions, and graphical visualizations to assess distributions and identify outliers. Categorical associations were evaluated using Chi-square tests, with formatted summary tables generated using the gtsummary package and exported to Word.

To assess factors associated with good knowledge, uni-variable and multivariable logistic regression models were fitted using generalized linear models with a binomial link function. Odds ratios and 95% confidence intervals were obtained through coefficient exponentiation. Model selection and refinement were guided by theoretical relevance and comparison of Akaike and Bayesian Information Criteria (AIC and BIC), demonstrating structured model evaluation and parsimony.

Advanced features of R were also utilized, including:

Automated generation of publication-ready regression tables
Export of high-resolution figures
Creation of a workflow diagram
Spatial visualization using shapefiles and the sf package

All analyses were conducted within an R Markdown environment to ensure full reproducibility of outputs, including tables, figures, and model results.

This project demonstrates applied proficiency in data wrangling, variable engineering, regression modeling, model comparison, spatial analysis, and reproducible reporting -reflecting intermediate-level competence in R for public health data analysis.

Analytical Environment Setup and Package Management

The analytical environment was configured by loading packages required for data import, manipulation, visualization, statistical modeling, and reproducible reporting. The tidyverse framework was used for data wrangling and graphical analysis, while additional packages supported regression diagnostics, formatted table generation, and automated export of results. Organizing package loading at the beginning of the document enhances reproducibility and transparency.

library(tidyverse)      # Includes dplyr, ggplot2, tidyr, forcats, etc.
library(readxl)         # Reading Excel files
library(gtsummary)      # Summary tables
library(flextable)      # Table formatting for Word
library(officer)        # Export to Word
library(scales)         # Percent formatting, axis scaling
library(janitor)        # Cleaning column names
library(gt)             # Table formatting
library(sjPlot)         # Data visualization of tables
library(sjmisc)         # Data manipulation utilities
library(pROC)           # ROC curves (if used)
library(elrm)           # Exact logistic regression (if used)
library(logistf)        # Firth logistic regression
library(knitr)          # Printing tables in RMarkdown
library(car)            # Recoding, regression utilities
library(DiagrammeR)     # Diagrams/flowcharts
library(DiagrammeRsvg)  # Export DiagrammeR to SVG
library(rsvg)           # Render SVG images
library(sf)             # Spatial data handling
library(ggspatial)      # Spatial visualization in ggplot2
library(here)

Data Import and Initial Cleaning

Raw survey data exported from the electronic data collection platform contained system-generated metadata and label-heavy column names. These were programmatically cleaned to ensure analysis-ready variable names and improve reproducibility.

# Suppress warnings and messages when reading Excel file Corrected version

Mpox <- suppressWarnings(
  suppressMessages(
    readxl::read_excel("Data/KAP_Mpox_2025_-_all_versions_-_labels_-_2025-10-08-18-00-59.xlsx")
  )
)

Remove some columns

Purpose: This step identifies all survey and metadata columns that are not needed for analysis. By defining cols_to_remove, we prepare the dataset for cleaner, focused analysis.

names(Mpox)

[1] “start”
[2] “end”
[3] “deviceid”
[4] “Introduction: We are conducting this survey to understand community members’ knowledge, attitudes, and practices regarding Mpox. Your participation is voluntary, your responses will be confidential, and you may stop at any time. You must be 18 years or older and have lived in this area for at least 6 months to participate.” [5] “Do you agree to participate in this survey?”
[6] “Respondent did not consent. End the interview and thank them.”
[7] “A2. Date (YYYY-MM-DD)”
[8] “A3a. Province”
[9] “A3b. District”
[10] “A4. Facility”
[11] “A5. Interviewer Name”
[12] “GPS coordinate”
[13] “_GPS coordinate_latitude”
[14] “_GPS coordinate_longitude”
[15] “_GPS coordinate_altitude”
[16] “_GPS coordinate_precision”
[17] “B0. How many months have you lived in this area?”
[18] “B1. Sex”
[19] “B2. Age in years”
[20] “B3. Marital status”
[21] “B4. Residence”
[22] “B5. Education level”
[23] “B6. Occupation”
[24] “Specify occupation (Other)”
[25] “B7. Religion”
[26] “B8. Monthly household income (ZMW)…26”
[27] “eligible”
[28] “Not eligible (must be ≥18 years and resident ≥6 months). End interview.”
[29] “C1. Have you heard of Mpox before today?”
[30] “C2. Where did you hear about Mpox? (multiple)/Radio/TV”
[31] “C2. Where did you hear about Mpox? (multiple)/Newspaper”
[32] “C2. Where did you hear about Mpox? (multiple)/Internet/social media”
[33] “C2. Where did you hear about Mpox? (multiple)/Health worker”
[34] “C2. Where did you hear about Mpox? (multiple)/Friends/family”
[35] “C2. Where did you hear about Mpox? (multiple)/Posters/community meetings”
[36] “C2. Where did you hear about Mpox? (multiple)/Other (specify)”
[37] “Specify other source”
[38] “C3. Which source do you trust the most?”
[39] “Specify other trusted source”
[40] “C4. How frequently do you receive information about Mpox?”
[41] “D1. What causes the disease Mpox?”
[42] “D2. How is Mpox transmitted? (multiple)/From animals (such as monkeys, rodents)”
[43] “D2. How is Mpox transmitted? (multiple)/From person-to-person through close contact”
[44] “D2. How is Mpox transmitted? (multiple)/Through sexual contact”
[45] “D2. How is Mpox transmitted? (multiple)/Through respiratory droplets”
[46] “D2. How is Mpox transmitted? (multiple)/From contaminated materials (clothes, beddings)”
[47] “D2. How is Mpox transmitted? (multiple)/Don’t know”
[48] “D3. Mpox can be transmitted through… (multiple)/Direct contact with rash/lesions”
[49] “D3. Mpox can be transmitted through… (multiple)/Respiratory droplets during close contact”
[50] “D3. Mpox can be transmitted through… (multiple)/Contact with contaminated objects (bedding, clothing)”
[51] “D3. Mpox can be transmitted through… (multiple)/Bite from infected animals”
[52] “D3. Mpox can be transmitted through… (multiple)/Sexual contact”
[53] “D4. Common symptoms of Mpox (Yes/No/Don’t know)”
[54] “Fever”
[55] “Headache”
[56] “Swollen lymph nodes”
[57] “Skin rash/lesions”
[58] “Back pain”
[59] “Muscle aches”
[60] “D5. Mpox symptoms can resemble chickenpox”
[61] “D6. There is a vaccine that can protect against Mpox.”
[62] “D7. Being vaccinated provides some protection against Mpox.”
[63] “D8a. Should someone with suspected Mpox seek care at a health facility?”
[64] “D8b. Should someone with suspected Mpox isolate from others?”
[65] “D9. Does handwashing and cleaning surfaces help prevent Mpox?”
[66] “D10. Mpox can be prevented by avoiding contact with wild animals.”
[67] “D11. Pregnant women can transmit Mpox to their unborn child.”
[68] “D12. Mpox can cause serious illness in children and immunocompromised people.”
[69] “D13. Mpox incubation period is 5–21 days.”
[70] “D14. Mpox can be spread via contaminated surfaces.”
[71] “D16. Maternal-to-foetal transmission is possible.”
[72] “E1. Community members can work with health workers to reduce Mpox spread.”
[73] “E2. Awareness programs about Mpox are needed in my district.”
[74] “E3. I trust health authorities to manage an Mpox outbreak.”
[75] “E4. I would take an Mpox vaccine if recommended.”
[76] “E5. Mpox is a serious public health threat.”
[77] “E6. I am concerned about Mpox affecting my family.”
[78] “E7. Mpox is more dangerous than other diseases in my community.”
[79] “E8. I believe I can protect myself from Mpox if I follow recommended measures.”
[80] “F1. In the last 30 days, have you discussed Mpox prevention with others?”
[81] “F2. If you had fever and rash, would you seek care at a health facility immediately?”
[82] “F3. Would you avoid close contact with a person with Mpox symptoms?”
[83] “F4. Would you avoid sharing bedding/utensils with someone suspected of Mpox?”
[84] “F5. Have you avoided crowded gatherings during an outbreak?”
[85] “F6. Would you accept Mpox vaccination if available?”
[86] “F7. In the past month, have you reported a suspected Mpox case?”
[87] “F8. In the past month, have you avoided contact with wild or sick animals?”
[88] “F9. In the past month, have you handled bushmeat?”
[89] “F10. In the past month, have you attended large community gatherings during an outbreak?”
[90] “F11. In the past month, have you ever cared for some with Mpox ?”
[91] “F12. In the past month, have you used protective equipment when caring for a sick person?”
[92] “k_d1”
[93] “k_d2”
[94] “k_d3”
[95] “k_d4”
[96] “k_d5”
[97] “k_d6”
[98] “k_d7”
[99] “k_d8a”
[100] “k_d8b”
[101] “k_d9”
[102] “k_d10”
[103] “k_d11”
[104] “k_d12”
[105] “k_d13”
[106] “k_d14”
[107] “k_d16”
[108] “knowledge_score”
[109] “knowledge_total_max”
[110] “knowledge_pct”
[111] “knowledge_good”
[112] “attitude_score”
[113] “attitude_total_max”
[114] “attitude_pct”
[115] “attitude_positive”
[116] “p_f1”
[117] “p_f2”
[118] “p_f3”
[119] “p_f4”
[120] “p_f5”
[121] “p_f6”
[122] “p_f7”
[123] “p_f8”
[124] “p_f9”
[125] “p_f10”
[126] “p_f11”
[127] “practice_score”
[128] “practice_total_max”
[129] “practice_pct”
[130] “practice_good”
[131] “End of questionnaire. Thank you for your time.”
[132] “A1. Interview ID”
[133] “B8. Monthly household income (ZMW)…133”
[134] “_id”
[135] “_uuid”
[136] “_submission_time”
[137] “_validation_status”
[138] “_notes”
[139] “_status”
[140] “_submitted_by”
[141] “version”
[142] “_tags”
[143] “_index”

cols_to_remove <- c( "start" ,"end","deviceid","Do you agree to participate in this survey?","A5.", "Interviewer  Name", "k_d1","k_d2","k_d3","k_d4","k_d5","k_d6","k_d7","k_d8a","k_d8a","k_d8b","k_d9", "k_d10","k_d11","k_d12","k_d13","k_d14","k_d16","knowledge_score","knowledge_total_max","knowledge_pct","knowledge_good","attitude_score","attitude_total_max","attitude_pct","attitude_positive","p_f1","p_f2","p_f3","p_f4","p_f5","p_f6","p_f7","p_f8","p_f9","p_f11","practice_score","practice_total_max","practice_pct","practice_good","End of questionnaire. Thank you for your time.","A1. Interview ID","_submission_time","_validation_status","_notes","_status","_submitted_by","__version__","_tags","_index","Introduction: We are conducting this survey to understand community members’ knowledge, attitudes, and practices regarding Mpox. Your participation is voluntary, your responses will be confidential, and you may stop at any time. You must be 18 years or older and have lived in this area for at least 6 months to participate.","Respondent did not consent. End the interview and thank them.", "_GPS coordinate_altitude", "_GPS coordinate_precision","A5. Interviewer  Name","GPS coordinate","B0. How many months have you lived in this area?","eligible","Not eligible (must be ≥18 years and resident ≥6 months). End interview.")

Mpox<- Mpox[, !names(Mpox) %in% cols_to_remove]

Renaming Columns

Purpose: This step standardizes and simplifies column names in the Mpox dataset for readability and consistency.

Mpox <- Mpox %>% 
  rename(
    Province = `A3a. Province`, 
    District = `A3b. District`, 
    Facility = `A4. Facility`, 
    Latitude = `_GPS coordinate_latitude`, 
    Longitude = `_GPS coordinate_longitude`,
    Sex=`B1. Sex`,Age=`B2. Age in years`,
    Marital_status=`B3. Marital status`,
    Residence=`B4. Residence`,
    Education_level=`B5. Education level`,
    Occupation=`B6. Occupation`,
    Religion=`B7. Religion`,
    Monthly_income=`B8. Monthly household income (ZMW)...26`)

Renaming Columns for Mpox Awareness Sources

Purpose: This step further cleans and standardizes column names, focusing specifically on Mpox awareness and information sources.

Mpox<- Mpox %>% 
  rename(
    heard_of_mpox = `C1. Have you heard of Mpox before today?`,
    mpox_info_radio_tv = `C2. Where did you hear about Mpox? (multiple)/Radio/TV`,
    mpox_info_newspaper = `C2. Where did you hear about Mpox? (multiple)/Newspaper`,
    mpox_info_internet = `C2. Where did you hear about Mpox? (multiple)/Internet/social media`,
    mpox_info_health_worker = `C2. Where did you hear about Mpox? (multiple)/Health worker`,
    mpox_info_friends_family = `C2. Where did you hear about Mpox? (multiple)/Friends/family`,
    mpox_info_posters_meetings = `C2. Where did you hear about Mpox? (multiple)/Posters/community meetings`,
    mpox_info_other = `C2. Where did you hear about Mpox? (multiple)/Other (specify)`)

Renaming Columns for Mpox Knowledge, Transmission, and Symptoms

This chunk is focused on renaming survey columns that cover knowledge about Mpox causes, transmission routes, symptoms, and prevention. It takes the original long survey questions and converts them into short, clear, analysis-ready variable names.

Mpox<- Mpox %>% 
  rename(freq_mpox_info=`C4. How frequently do you receive information about Mpox?`,mpox_cause=`D1. What causes the disease Mpox?`,mpox_transmitted_animal=`D2. How is Mpox transmitted? (multiple)/From animals (such as monkeys, rodents)`,mpox_transmitted_person=`D2. How is Mpox transmitted? (multiple)/From person-to-person through close contact`,mpox_transmitted_sexual=`D2. How is Mpox transmitted? (multiple)/Through sexual contact`,mpox_transmitted_droplets=`D2. How is Mpox transmitted? (multiple)/Through respiratory droplets`,mpox_transmitted_materials=`D2. How is Mpox transmitted? (multiple)/From contaminated materials (clothes, beddings)`,mpox_transmission_Dont_know=`D2. How is Mpox transmitted? (multiple)/Don't know`,mpox_transsmission_direct=`D3. Mpox can be transmitted through… (multiple)/Direct contact with rash/lesions`,mpox_transsmission_respiratory=`D3. Mpox can be transmitted through… (multiple)/Respiratory droplets during close contact`,mpox_transsmission_contact=`D3. Mpox can be transmitted through… (multiple)/Contact with contaminated objects (bedding, clothing)`,mpox_transsmission_bite=`D3. Mpox can be transmitted through… (multiple)/Bite from infected animals`, mpox_transsmission_sex_contact=`D3. Mpox can be transmitted through… (multiple)/Sexual contact`,signs_symptoms=`D4. Common symptoms of Mpox (Yes/No/Don't know)`,
Fever=Fever,Headache=Headache,Swollen_lymph_nodes=`Swollen lymph nodes`,Skin_rash=`Skin rash/lesions`,Back_pain=`Back pain`, Muscle_aches=`Muscle aches`,
Mpox_resemble_chickenpox=`D5. Mpox symptoms can resemble chickenpox`, mpox_vaccine_available=`D6. There is a vaccine that can protect against Mpox.`,vax_mpox_protection=`D7. Being vaccinated  provides some protection against Mpox.`,mpox_seek_care=`D8a. Should someone with suspected Mpox seek care at a health facility?`,mpox_isolate=`D8b. Should someone with suspected Mpox isolate from others?`,handwahsing_prevent=`D9. Does handwashing and cleaning surfaces help prevent Mpox?`,avoiding_contact=`D10. Mpox can be prevented by avoiding contact with wild animals.`,pregant_women=`D11. Pregnant women can transmit Mpox to their unborn child.`,mpox_severe_children=`D12. Mpox can cause serious illness in children and immunocompromised people.`,mpox_incubation=`D13. Mpox incubation period is 5–21 days.`,mpox_spread_surfaces=`D14. Mpox can be spread via contaminated surfaces.`,maternal_foetal=`D16. Maternal-to-foetal transmission is possible.`)

Renaming Columns for Attitudes Toward Mpox

This chunk focuses on renaming survey columns that capture respondents’ attitudes and perceptions about Mpox and public health response.

Mpox<- Mpox %>% 
  rename(community_HCWs=`E1. Community members can work with health workers to reduce Mpox spread.`,awareness_pox=`E2. Awareness programs about Mpox are needed in my district.`,trust_HCWs=`E3. I trust health authorities to manage an Mpox outbreak.`,mpox_vaccination=`E4. I would take an Mpox vaccine if recommended.`,public_health_threat=`E5. Mpox is a serious public health threat.`,concerned=`E6. I am concerned about Mpox affecting my family.`)

Renaming Columns for Practices Toward Mpox

Purpose: This chunk renames survey columns that capture respondents’ actual or intended behaviors (practices) related to Mpox prevention.

Mpox<- Mpox %>% 
  rename(discussed_Mpox_last_30_days=`F1. In the last 30 days, have you discussed Mpox prevention with others?`,visit_clinic_fever_rash=`F2. If you had fever and rash, would you seek care at a health facility immediately?`,avoid_close_contact=`F3. Would you avoid close contact with a person with Mpox symptoms?`,avoid_sharing_contaminated_materials=`F4. Would you avoid sharing bedding/utensils with someone suspected of Mpox?`,avoided_crowded_gatherings=`F5. Have you avoided crowded gatherings during an outbreak?`,get_mpox_vaccine=`F6. Would you accept Mpox vaccination if available?`)

Recoding Occupational Status & Marital Status

This chunk simplifies categorical variables to make analysis and reporting easier, particularly for KAP tables or mentor-ready summaries.

#Occupatonal status 
Mpox <- Mpox %>%
  mutate(
    Occupation = case_when(
      Occupation %in% c("Business/vendor",
                         "Civil servant/professional",
                         "Farmer/manual") ~ "Employed",
      Occupation == "Student" ~ "Student",
      Occupation == "Unemployed" ~ "Unemployed",
      Occupation == "Other" ~ "Unemployed",
      TRUE ~ NA_character_))

#marital stautus 

Mpox<- Mpox%>%
  mutate(
    Marital_status= case_when(
      Marital_status == "Married" ~ "Married",
      TRUE ~ "Unmarried"
    ))

Adding Variable Labels- table 1

This chunk uses the labelled package to assign descriptive labels to variables. These labels help make tables, summaries, and outputs more readable for mentors or reports.

pacman::p_load(labelled)

var_label(Mpox) <- list(Province= "Provinces",District="Districts",Facility="Facilities",Sex="Sex",Age="Age (years)",Marital_status="Marital status",Residence="Residence",Education_level="Education level", Occupation="Occupation",Religion="Religion")

Labeling Mpox Awareness Variables

This chunk continues adding descriptive labels to your Mpox dataset, now focusing on awareness and information sources.

pacman::p_load(labelled)

var_label(Mpox) <- list(heard_of_mpox="Heard of Mpox before",
  mpox_info_radio_tv = "Radio/TV",
  mpox_info_newspaper = "Newspaper",
  mpox_info_internet = "Internet/social media",
  mpox_info_health_worker = "Health care worker",
  mpox_info_friends_family = "Family/friends",
  mpox_info_posters_meetings = "Posters/Meetings",
  mpox_info_other = "Other"
)

Labeling Knowledge Variables

You are assigning descriptive labels to all variables related to knowledge of Mpox, including causes, transmission, symptoms, prevention, and vaccination.

pacman::p_load(labelled)

var_label(Mpox) <- list(freq_mpox_info="How frequently do you receive information about Mpox?",mpox_cause="What causes the disease Mpox?",mpox_transmitted_animal="How is Mpox transmitted:From animals (such as monkeys, rodents)?", mpox_transmitted_person= "How is Mpox transmitted:From person-to-person through close contact?",mpox_transmitted_sexual="How is Mpox transmitted:Through sexual contact?", mpox_transmitted_droplets="How is Mpox transmitted:Through respiratory droplets?", mpox_transmitted_materials="How is Mpox transmitted:From contaminated materials (clothes, beddings)?",mpox_transmission_Dont_know="How is Mpox transmitted:Don't know? ",mpox_transsmission_direct="Mpox can be transmitted through: Direct contact with rash/lesions?",mpox_transsmission_respiratory="Mpox can be transmitted through:Respiratory droplets during close contact",mpox_transsmission_contact="Mpox can be transmitted through: Contact with contaminated objects (bedding, clothing)",mpox_transsmission_bite="Mpox can be transmitted through:Bite from infected animals",mpox_transsmission_sex_contact="Mpox can be transmitted through:Sexual contact",Fever="Fever",Headache="Headache",Swollen_lymph_nodes="Swollen lymph nodes",Skin_rash="Skin rash/lesions",Back_pain="Back pain", Muscle_aches="Muscle aches", Mpox_resemble_chickenpox="Mpox symptoms can resemble chickenpox?",mpox_vaccine_available="There is a vaccine that can protect against Mpox?",vax_mpox_protection="Being vaccinated  provides some protection against Mpox?",mpox_seek_care="Should someone with suspected Mpox seek care at a health facility?",mpox_isolate="Should someone with suspected Mpox isolate from others?",handwahsing_prevent="Does handwashing and cleaning surfaces help prevent Mpox?",avoiding_contact="Mpox can be prevented by avoiding contact with wild animals?", pregant_women="Pregnant women can transmit Mpox to their unborn child?", mpox_severe_children="Mpox can cause serious illness in children and immunocompromised people?",mpox_incubation="Mpox incubation period is 5–21 days?",mpox_spread_surfaces="Mpox can be spread via contaminated surfaces?",maternal_foetal="Maternal-to-foetal transmission is possible?")

Labeling of variables-attitude questions

pacman::p_load(labelled)

var_label(Mpox) <- list(community_HCWs= "Community members can work with health workers to reduce Mpox spread?",awareness_pox= "Awareness programs about Mpox are needed in my district?",trust_HCWs= "I trust health authorities to manage an Mpox outbreak?",mpox_vaccination= "I would take an Mpox vaccine if recommended?",public_health_threat= "Mpox is a serious public health threat?",concerned= "I am concerned about Mpox affecting my family?")

Labeling of variables-practices questions

pacman::p_load(labelled)

var_label(Mpox) <- list(discussed_Mpox_last_30_days= "Have you discussed Mpox prevention with others in the last 30 days?",visit_clinic_fever_rash= "seek care at a health facility immediately if fever or rash?",avoid_close_contact="avoid close contact with a person with Mpox symptoms?",avoid_sharing_contaminated_materials= "avoid sharing bedding/utensils with someone suspected of Mpox?",avoided_crowded_gatherings= "avoided crowded gatherings during an outbreak?",get_mpox_vaccine=" accept Mpox vaccination if available?")

Assessing the distribution of age and monthly household income

summary(Mpox$Age)

Min. 1st Qu. Median Mean 3rd Qu. Max. NA’s 18.00 27.00 34.00 37.93 47.25 90.00 3

Filter the NAs for age. these where entered during the practice sessions

Mpox <- Mpox %>% filter(!is.na(Age))

Categorizing age

Age was categorized into three groups: 18–34, 35–54, and 55+ years to reflect young, middle-aged, and older adults.

# updated age categories 
Mpox$Age_categories <- cut(
  Mpox$Age,
  breaks = c(18, 35, 55, 85),   # updated boundaries
  labels = c("18–34", "35–54", "55+"),  # new categories
  right = FALSE, include.lowest = TRUE)

Mpox <- Mpox %>% filter(!is.na(Age_categories))

ggplot(Mpox, aes(x = Age)) +
  geom_histogram(aes(y = ..density..), bins = 30, color = "black", fill = "lightblue") +
  stat_function(fun = dnorm, args = list(mean = mean(Mpox$Age, na.rm = TRUE), sd = sd(Mpox$Age, na.rm = TRUE)), color = "red") +
  labs(title = "Age distribution", x = "Age (Years)", y = "Density") +
  theme_classic() +
  theme(plot.title = element_text(hjust = 0.5))

# save as an image
ggsave("age_distribution.png", width = 10, height = 6, dpi = 300)

summary(Mpox$Monthly_income)

Min. 1st Qu. Median Mean 3rd Qu. Max. 0 700 1650 2876 3000 400000

Remove some out liers

Mpox <- Mpox[Mpox$Monthly_income != 400000, ]

incase of income being reported as continous, check its distribution below

ggplot(Mpox, aes(x =Monthly_income)) +
  geom_histogram(aes(y = ..density..), bins = 30, color = "black", fill = "lightblue") +
  stat_function(fun = dnorm, args = list(mean = mean(Mpox$Monthly_income, na.rm = TRUE), sd = sd(Mpox$Monthly_income, na.rm = TRUE)), color = "red") +
  labs(title = "Income distribution", x = "Monthly_income(kwacha)", y = "Density") +
  theme_classic() +
  theme(plot.title = element_text(hjust = 0.5))

# save as an image
ggsave("Monthly_income_distribution.png", width = 10, height = 6, dpi = 300)

I decided to report as categorical and therefore i changed below
Monthly income was dichotomized as <3000 and ≥3000 to reflect lower- and higher-income respondents.

Mpox$Income_category<- ifelse(
  Mpox$Monthly_income < 3000, 
  "< 3000", 
  "≥ 3000")

Descriptive analysis

This section is building a descriptive summary table of your survey population and stratifying it by whether respondents have heard of Mpox. The goal is to show the distribution of key demographics and compare groups.

# Step 1: Recode outcome with desired order
Mpox <- Mpox %>% 
  mutate(heard_of_mpox = factor(heard_of_mpox, levels = c("Yes", "No"))) # Yes first, No second

# Step 2: Build stratified table (categorical only)
t1 <- Mpox %>% 
  select(heard_of_mpox, Province, Age_categories, Sex, Marital_status, Residence, Education_level, Occupation, Income_category) %>% 
  tbl_summary(
    by = heard_of_mpox,
    percent = "row",
    missing_text = "Missing",
    label = list(
      Province ~ "Province",
      Age_categories ~ "Age group",
      Sex ~ "Sex",
      Marital_status ~ "Marital status",
      Residence ~ "Residence (Urban/Rural)",
      Education_level  ~ "Education level",
      Occupation ~ "Occupation",
      Income_category ~ "Income category"
    ),
    digits = list(all_categorical() ~ c(0, 1)) # format counts to 0 decimal points and percentages to 1 decimal point
  ) %>% 
  add_overall(last = FALSE) %>% 
  add_p(
    test = list(all_categorical() ~ "chisq.test"),
    pvalue_fun = function(x) style_pvalue(x, digits = 3)
  ) %>% 
  bold_labels() %>% 
  modify_header(label ~ "**Population Characteristics**") %>% 
  modify_spanning_header(c("stat_1", "stat_2") ~ "**Heard of Mpox**")

print(t1)

Population Characteristics	Overall N = 809¹	Heard of Mpox		p-value²
Population Characteristics	Overall N = 809¹	Yes N = 372¹	No N = 437¹	p-value²
Province				<0.001
Central	120 (100.0%)	71 (59.2%)	49 (40.8%)
Copperbelt	120 (100.0%)	58 (48.3%)	62 (51.7%)
Lusaka	444 (100.0%)	151 (34.0%)	293 (66.0%)
Muchinga	32 (100.0%)	25 (78.1%)	7 (21.9%)
Western	93 (100.0%)	67 (72.0%)	26 (28.0%)
Age group				0.335
18–34	415 (100.0%)	181 (43.6%)	234 (56.4%)
35–54	282 (100.0%)	139 (49.3%)	143 (50.7%)
55+	112 (100.0%)	52 (46.4%)	60 (53.6%)
Sex				<0.001
Female	457 (100.0%)	236 (51.6%)	221 (48.4%)
Male	352 (100.0%)	136 (38.6%)	216 (61.4%)
Marital status				0.089
Married	462 (100.0%)	200 (43.3%)	262 (56.7%)
Unmarried	347 (100.0%)	172 (49.6%)	175 (50.4%)
Residence (Urban/Rural)				0.422
Rural	64 (100.0%)	33 (51.6%)	31 (48.4%)
Urban	745 (100.0%)	339 (45.5%)	406 (54.5%)
Education level				<0.001
None	51 (100.0%)	20 (39.2%)	31 (60.8%)
Primary	260 (100.0%)	89 (34.2%)	171 (65.8%)
Secondary	395 (100.0%)	191 (48.4%)	204 (51.6%)
Tertiary	103 (100.0%)	72 (69.9%)	31 (30.1%)
Occupation				0.021
Employed	564 (100.0%)	257 (45.6%)	307 (54.4%)
Student	28 (100.0%)	20 (71.4%)	8 (28.6%)
Unemployed	217 (100.0%)	95 (43.8%)	122 (56.2%)
Income category				0.008
< 3000	542 (100.0%)	231 (42.6%)	311 (57.4%)
≥ 3000	267 (100.0%)	141 (52.8%)	126 (47.2%)
¹ n (%)
² Pearson’s Chi-squared test

# Convert gtsummary object to flextable
ft <- as_flex_table(t1)

# Export to Word
doc <- read_docx() %>%
  body_add_flextable(ft)

print(doc, target = "table1.docx")

Ever heard of Mpox

This section creates a pie chart showing the proportion of respondents who have or have not heard of Mpox. It also saves the plot as a PNG for use in reports or presentations.

# Prepare data
heard_df <- as.data.frame(table(Mpox$heard_of_mpox))
colnames(heard_df) <- c("Response", "Count")

# Add percentage and formatted label with 1 decimal place
heard_df$Percent <- heard_df$Count / sum(heard_df$Count) * 100
heard_df$Label <- sprintf("%s\n%d (%.1f%%)", heard_df$Response, heard_df$Count, heard_df$Percent)

# Plot
library(ggplot2)
ggplot(heard_df, aes(x = "", y = Count, fill = Response)) +
  geom_bar(stat = "identity", width = 1) +
  coord_polar("y") +
  theme_void() +
  geom_text(aes(label = Label), position = position_stack(vjust = 0.5), size = 5) +
  labs(title = "") +
  scale_fill_manual(values = c("#F8766D", "#00BFC4"))

# Save the last plot as a PNG
ggsave("heard_of_mpox_pie.png", width = 6, height = 6, dpi = 300)

Sub analysis

I filtered the dataset to include only participants who reported having heard of Mpox, so that subsequent analyses focus on this subgroup with prior awareness.

sub_set <- Mpox %>% filter(heard_of_mpox == "Yes")

sub_set$Age_categories <- cut(
  sub_set$Age,
  breaks = c(18, 35, 55, 85),   # updated boundaries
  labels = c("18–34", "35–54", "55+"),  # new categories
  right = FALSE, include.lowest = TRUE)

sub_set$Income_category<- ifelse(
  sub_set$Monthly_income < 3000, 
  "< 3000", 
  "≥ 3000")

Source of Information on Mpox

I created a new variable, Sources_of_information_on_Mpox, to categorize where participants who had heard of Mpox obtained their information. I grouped responses into main channels such as ‘Health care worker’, ‘Radio/TV’, ‘Internet/social media’, ‘Family/friends’, and a combined ‘Print/community sources’ category for newspapers, posters, meetings, or other sources. This simplification allowed for easier interpretation and comparison of information sources.

sub_set <- sub_set %>%
  mutate(
    Sources_of_information_on_Mpox = case_when(
      mpox_info_health_worker == 1 ~ "Health care worker",
      mpox_info_radio_tv == 1 ~ "Radio/TV",
      mpox_info_internet == 1 ~ "Internet/social media",
      mpox_info_friends_family == 1 ~ "Family/friends",
      
      # Collapsed category
      mpox_info_newspaper == 1 ~ "Print/community sources",
      mpox_info_posters_meetings == 1 ~ "Print/community sources",
      mpox_info_other == 1 ~ "Print/community sources",
      
      TRUE ~ NA_character_
    )
  )

Analyze Sources of Mpox Information by Province

I corrected a minor typo in the Sources_of_information_on_Mpox variable and then organized the data to improve readability. I ordered provinces by the proportion of respondents and information sources by their frequency. Using this ordering, I created a horizontal stacked bar chart showing the distribution of Mpox information sources across provinces. This visualization allowed me to quickly identify which channels were most commonly used. Finally, I exported the chart and embedded it in a Word document for reporting purposes.

# Fix typo in source categories
sub_set <- sub_set %>%
  mutate(Sources_of_information_on_Mpox = case_when(
    Sources_of_information_on_Mpox == "Family/freinds" ~ "Family/friends",
    TRUE ~ Sources_of_information_on_Mpox
  ))

# Order provinces by overall respondent percentage
province_order <- sub_set %>%
  count(Province) %>%
  mutate(prop = n / sum(n)) %>%
  arrange(desc(prop)) %>%
  pull(Province)

sub_set$Province <- factor(sub_set$Province, levels = province_order)

# Order sources by frequency
source_order <- sub_set %>%
  count(Sources_of_information_on_Mpox) %>%
  arrange(desc(n)) %>%
  pull(Sources_of_information_on_Mpox)

source_order

[1] “Health care worker” “Radio/TV”
[3] “Internet/social media” “Family/friends”
[5] “Print/community sources”

sub_set <- sub_set |> mutate(Sources_of_information_on_Mpox = factor(Sources_of_information_on_Mpox,
                                                                     levels = c("Health care worker","Radio/TV",
                                                                               "Internet/social media","Family/friends",
                                                                                "Print/community sources")))

library(RColorBrewer)

# Plot chart (provinces only, no "Overall")
chart <- ggplot(sub_set, aes(y = Province, fill = Sources_of_information_on_Mpox)) + 
  geom_bar(position = "fill", color = "#636363") +
  scale_x_continuous(labels = scales::percent_format(accuracy = 1)) +
  # "Blues" is a common sequential RColorBRewer palette; "Set1" or "Set2" are good for categories 
  # n must match the number of levels in the data (5 in this case)
  scale_fill_manual(values = rev(brewer.pal(5, "Blues")), limits = source_order # Using color brewer pallet with reversed order
    # values = c(
    #   "Health care worker" = "#045a8d",
    #   "Radio/TV" = "#2b8cbe",
    #   "Internet/social media" = "#74a9cf",
    #   "Family/friends" = "#bdc9e1",
    #   "Print/community sources" = "#f1eef6"), #manually specificing the colors
    # limits = source_order  
  ) +
  labs(
    x = "Proportion of respondents",
    y = "Province",
    fill = "Source category"
  ) +
  theme_classic(base_size = 14) +
  theme(
    legend.position = "right",
    legend.title = element_text(size = 12, colour = "black"),
    legend.text = element_text(size = 11, colour = "black"),
    axis.text.y = element_text(size = 12, colour = "black"),
    axis.text.x = element_text(size = 11, colour = "black")
  )

# print
print(chart)

# Save the chart
ggsave("mpox_info_sources.png", plot = chart, height = 5.84, width = 13.8, dpi = 300)

# Create Word document and embed the image
doc <- read_docx() %>% 
  body_add_par("Figure: Sources of Mpox Information Among Participants", style = "heading 2") %>% 
  body_add_img(src = "mpox_info_sources.png", width = 6, height = 4)

print(doc, target = "Mpox_Info_Sources.docx")

Create Table of Mpox Information Sources by Province

I summarized the number and proportion of respondents reporting each source of Mpox information, stratified by province. Percentages were calculated for each source within a province and formatted alongside counts for clarity (e.g., n (x%)). Using pivot_wider, I transformed the data into a table format suitable for reporting. Finally, I created a flextable for a polished table layout and exported it to a Word document for inclusion in the report

# Step 1: Summarise counts and percentages
mpox_table <- sub_set %>%
  group_by(Province, Sources_of_information_on_Mpox) %>%
  summarise(n = n(), .groups = "drop") %>%
  group_by(Province) %>%
  mutate(
    total = sum(n),
    percent = round((n / total) * 100, 1),
    label = paste0(n, " (", percent, "%)")
  ) %>%
  select(Province, Sources_of_information_on_Mpox, label, total) %>%
  pivot_wider(
    names_from = Sources_of_information_on_Mpox,
    values_from = label,
    values_fill = list(label = "0 (0%)")   # <-- ensures zeros are shown
  ) %>%
  relocate(total, .after = last_col())     # put totals at the end

# Step 2: Create flextable for Word
ft <- flextable(mpox_table) %>%
  autofit() %>%
  set_header_labels(
    Province = "Province",
    `Health care worker` = "Health care worker",
    `Radio/TV` = "Radio/TV",
    `Internet/social media` = "Internet/social media",
    `Family/friends` = "Family/friends",
    Others = "Others",
    total = "Total (n)"
  )
ft

Province	Health care worker	Radio/TV	Internet/social media	Family/friends	Print/community sources	Total (n)
Lusaka	45 (29.8%)	38 (25.2%)	40 (26.5%)	18 (11.9%)	10 (6.6%)	151
Central	35 (49.3%)	19 (26.8%)	4 (5.6%)	6 (8.5%)	7 (9.9%)	71
Western	39 (58.2%)	14 (20.9%)	6 (9%)	7 (10.4%)	1 (1.5%)	67
Copperbelt	30 (51.7%)	12 (20.7%)	7 (12.1%)	2 (3.4%)	7 (12.1%)	58
Muchinga	19 (76%)	4 (16%)	0 (0%)	0 (0%)	2 (8%)	25

# Step 3: Save as Word document
doc <- read_docx() %>%
  body_add_par("Table: Sources of Mpox information by province", style = "heading 1") %>%
  body_add_flextable(ft)

print(doc, target = "mpox_info_sources_by_province.docx")

Summarize Knowledge of Mpox Transmission

I recoded the transmission variables so that 0 = No and 1 = Yes, ensuring consistency across the dataset. I then selected the key transmission modes—animal-to-human, person-to-person, sexual, droplet, and contaminated materials—and labeled them clearly. Using tbl_summary, I calculated counts and percentages of respondents aware of each mode of Mpox transmission. The resulting summary table was exported as a Word document for reporting purposes.

# Step 1: Recode correctly (0 = No, 1 = Yes)
sub_set <- sub_set %>%
  mutate(
    mpox_transmitted_animal    = factor(mpox_transmitted_animal,    levels = c(0,1), labels = c("No","Yes")),
    mpox_transmitted_person    = factor(mpox_transmitted_person,    levels = c(0,1), labels = c("No","Yes")),
    mpox_transmitted_sexual    = factor(mpox_transmitted_sexual,    levels = c(0,1), labels = c("No","Yes")),
    mpox_transmitted_droplets  = factor(mpox_transmitted_droplets,  levels = c(0,1), labels = c("No","Yes")),
    mpox_transmitted_materials = factor(mpox_transmitted_materials, levels = c(0,1), labels = c("No","Yes"))
  )

# Step 2: Define variable list and labels
transmission <- c(
  "mpox_transmitted_animal",
  "mpox_transmitted_person",
  "mpox_transmitted_sexual",
  "mpox_transmitted_droplets",
  "mpox_transmitted_materials"
)

transmission_labels <- list(
  mpox_transmitted_animal    = "Animal-to-human",
  mpox_transmitted_person    = "Person-to-person",
  mpox_transmitted_sexual    = "Sexual transmission",
  mpox_transmitted_droplets  = "Droplet transmission",
  mpox_transmitted_materials = "Contaminated materials"
)

# Step 3: Create summary table
kap_table <- sub_set %>%
  select(all_of(transmission)) %>%
  tbl_summary(
    type      = all_categorical() ~ "categorical",
    label     = transmission_labels,
    statistic = all_categorical() ~ "{n} ({p}%)",
    digits    = all_categorical() ~ c(0, 1),
    missing   = "no"   # ensures 0 (0%) is shown instead of blank
  )

# Step 4: Print inline in R Markdown
kap_table

Characteristic	N = 372¹
Animal-to-human
No	183 (49.2%)
Yes	189 (50.8%)
Person-to-person
No	152 (40.9%)
Yes	220 (59.1%)
Sexual transmission
No	251 (67.5%)
Yes	121 (32.5%)
Droplet transmission
No	264 (71.0%)
Yes	108 (29.0%)
Contaminated materials
No	252 (67.7%)
Yes	120 (32.3%)
¹ n (%)

# Step 5: Export to Word (optional)
ft <- as_flex_table(kap_table)
doc <- read_docx() %>%
  body_add_par("Table: Knowledge of Mpox Transmission Modes", style = "heading 2") %>%
  body_add_flextable(ft)
print(doc, target = "Mpox_Transmission_Table.docx")

Define Transmission Variables and Labels

I created a vector transmission_vars containing the key Mpox transmission variables to streamline subsequent analysis. I also defined transmission_labels to assign clear, descriptive labels to each transmission mode (e.g., Animal-to-human, Person-to-person, Sexual transmission, Droplet transmission, Contaminated materials). This ensures that tables and plots are easily interpretable for stakeholders

transmission_vars <- c(
  "mpox_transmitted_animal",
  "mpox_transmitted_person",
  "mpox_transmitted_sexual",
  "mpox_transmitted_droplets",
  "mpox_transmitted_materials"
)

transmission_labels <- c(
  mpox_transmitted_animal = "Animal-to-human",
  mpox_transmitted_person = "Person-to-person",
  mpox_transmitted_sexual = "Sexual transmission",
  mpox_transmitted_droplets = "Droplet transmission",
  mpox_transmitted_materials = "Contaminated materials"
)

Prepare Data for Knowledge of Transmission Plot

I reshaped the Mpox transmission data into a long format suitable for plotting. For each transmission mode, I calculated the count and percentage of respondents reporting ‘Yes’ or ‘No’. I then created a variable label using our descriptive transmission labels and reordered them by the percentage of respondents answering ‘Yes’, so the plot will display the most recognized transmission modes first.

# Step 1: Build the full dataset
kap_plot_data <- sub_set %>%
  select(all_of(transmission_vars)) %>%
  pivot_longer(cols = everything(), names_to = "variable", values_to = "response") %>%
  group_by(variable, response) %>%
  summarise(n = n(), .groups = "drop") %>%
  group_by(variable) %>%
  mutate(percent = round(n / sum(n) * 100, 1)) %>%
  ungroup() %>%
  mutate(label = transmission_labels[variable])

# Step 2: Calculate Yes percentages for ordering
yes_order <- kap_plot_data %>%
  filter(response == "Yes") %>%
  group_by(label) %>%
  summarise(yes_percent = sum(percent), .groups = "drop")

# Step 3: Join back and reorder labels
kap_plot_data <- kap_plot_data %>%
  left_join(yes_order, by = "label") %>%
  mutate(label = fct_reorder(label, yes_percent, .desc = TRUE))

Visualize Knowledge of Mpox Transmission Modes

I created a horizontal stacked bar chart showing the proportion of respondents who answered ‘Yes’ or ‘No’ for each mode of Mpox transmission. The y-axis lists the transmission modes, ordered by the percentage of respondents who correctly recognized them. The bars are color-coded to distinguish ‘Yes’ from ‘No’, and percentages are displayed on the x-axis for clarity

ggplot(kap_plot_data, aes(x = percent, y = label, fill = response)) +
  geom_bar(stat = "identity", position = "stack", color = "#636363", width = 0.8) +
  scale_fill_manual(values = c("Yes" = "#1f77b4", "No" = "#bdc9e1")) +
  labs(
    title = "",
    x = "Percentage",
    y = "Modes of transmission",   # ← updated axis label
    fill = "Response"
  ) +
  scale_x_continuous(labels = function(x) paste0(x, "%")) +
  theme_classic(base_size = 14) +
  theme(
    plot.title   = element_text(face = "bold", hjust = 0.5),
    axis.text.y  = element_text(face = "bold"),
    legend.position = "right",
    panel.grid = element_blank()
  )

# Alternative use sjPlot
trans_dat <- sub_set |> dplyr::select(mpox_transmitted_animal, mpox_transmitted_person, mpox_transmitted_sexual,
                            mpox_transmitted_droplets, mpox_transmitted_materials)

# Set theme
set_theme(base = theme_classic())

# plot_stackfrq(trans_dat, 
#               show.total = F,
#               axis.labels = c("",
#                               "",
#                               "",
#                               "",
#                               "",
#                               ""),
#               geom.colors = "Blues")

ggsave("mpox_transmission_chart.png", dpi = 300, width = 8, height = 5)

Summarize Knowledge of Mpox Signs and Symptoms

I recoded respondents’ answers for common Mpox symptoms into ‘Yes’ or ‘No’ categories, treating ‘Don’t know’ as ‘No’. Then, I created a summary table showing counts and percentages of respondents recognizing each symptom. This table helps identify which symptoms are well-known and which may require additional awareness efforts. Finally, I exported the table to Word for reporting.

# Define symptom variables (removed Mpox_resemble_chickenpox)
symptoms <- c("Fever", "Headache", "Swollen_lymph_nodes",
              "Skin_rash", "Back_pain", "Muscle_aches")

# Recode variables
sub_set <- sub_set %>%
  mutate(across(all_of(symptoms),
                ~ factor(ifelse(.x %in% c("Don't know", "No"), "No", "Yes"),
                         levels = c("No", "Yes"))))

# Create the summary table with a caption
symptoms_tbl <- sub_set %>%
  select(all_of(symptoms)) %>%
  tbl_summary(
    type = all_categorical() ~ "categorical",
    statistic = all_categorical() ~ "{n} ({p}%)",
    digits = all_categorical() ~ c(0, 1),
    label = list(
      Fever = "Fever",
      Headache = "Headache",
      Swollen_lymph_nodes = "Swollen lymph nodes",
      Skin_rash = "Skin rash/lesions",
      Back_pain = "Back pain",
      Muscle_aches = "Muscle aches"
    ),
    missing = "no"
  ) %>%
  modify_caption("**Knowledge of Mpox Signs and Symptoms**")

print(symptoms_tbl)

**Knowledge of Mpox Signs and Symptoms**
Characteristic	N = 372¹
Fever
No	119 (32.0%)
Yes	253 (68.0%)
Headache
No	147 (39.5%)
Yes	225 (60.5%)
Swollen lymph nodes
No	195 (52.4%)
Yes	177 (47.6%)
Skin rash/lesions
No	71 (19.1%)
Yes	301 (80.9%)
Back pain
No	243 (65.3%)
Yes	129 (34.7%)
Muscle aches
No	203 (54.6%)
Yes	169 (45.4%)
¹ n (%)

# Save to Word
doc <- read_docx() %>%
  body_add_par("Table 1. Knowledge of Mpox Signs and Symptoms", style = "heading 2") %>%
  body_add_flextable(value = as_flex_table(symptoms_tbl))

print(doc, target = "symptoms.docx")

Visualize Knowledge of Mpox Symptoms

I created a bar chart showing the percentage of respondents who correctly recognized each Mpox symptom. Responses of ‘No’ and ‘Don’t know’ were combined to simplify interpretation. The symptoms were ordered from most to least recognized based on the proportion of ‘Yes’ responses. This visual highlights which symptoms are widely known and where awareness campaigns may need to focus.

# Define the six symptoms
symptoms <- c("Fever", "Headache", "Swollen_lymph_nodes",
              "Skin_rash", "Back_pain", "Muscle_aches")

# Clean labels for plotting
symptom_labels <- c(
  Fever = "Fever",
  Headache = "Headache",
  Swollen_lymph_nodes = "Swollen lymph nodes",
  Skin_rash = "Skin rash/lesions",
  Back_pain = "Back pain",
  Muscle_aches = "Muscle aches"
)

# Recode Yes/No responses (combine No + Don't know)
sub_set <- sub_set %>%
  mutate(across(
    all_of(symptoms),
    ~ factor(
        ifelse(.x %in% c("Don't know", "No"), "No / Don't know", "Yes"),
        levels = c("No / Don't know", "Yes")
      )
  ))



# new and ordered by percentage of yes

symptom_plot_data <- sub_set %>%
  select(all_of(symptoms)) %>%
  pivot_longer(cols = everything(),
               names_to = "variable",
               values_to = "response") %>%
  group_by(variable, response) %>%
  summarise(n = n(), .groups = "drop") %>%
  group_by(variable) %>%
  mutate(percent = round(n / sum(n) * 100, 1)) %>%
  ungroup() %>%
  mutate(label = symptom_labels[variable]) %>%
  # calculate Yes percentage per symptom
  group_by(label) %>%
  mutate(yes_percent = ifelse(response == "Yes", percent, 0)) %>%
  summarise(yes_percent = sum(yes_percent), .groups = "drop") %>%
  right_join(
    sub_set %>%
      select(all_of(symptoms)) %>%
      pivot_longer(cols = everything(),
                   names_to = "variable",
                   values_to = "response") %>%
      group_by(variable, response) %>%
      summarise(n = n(), .groups = "drop") %>%
      group_by(variable) %>%
      mutate(percent = round(n / sum(n) * 100, 1)) %>%
      ungroup() %>%
      mutate(label = symptom_labels[variable]),
    by = "label"
  ) %>%
  mutate(label = fct_reorder(label, yes_percent, .desc = TRUE))


# Plot
ggplot(symptom_plot_data, aes(x = percent, y = label, fill = response)) +
  geom_bar(stat = "identity", position = "stack", color = "#636363") +
  scale_fill_manual(
    values = c("Yes" = "#1f77b4", "No / Don't know" = "#bdc9e1")
  ) +
  labs(
    title = "",
    x = "Percentage",
    y = NULL,
    fill = "Response"
  ) +
  scale_x_continuous(labels = function(x) paste0(x, "%")) +
  theme_minimal(base_size = 14) +
  theme(
    plot.title = element_text(face = "bold", hjust = 0.5),
    axis.text.y = element_text(face = "bold"),
    legend.position = "right",
    legend.justification = c(1, 1),
    panel.grid = element_blank(),
    axis.line = element_line(color = "black", size = 0.6)
  )

ggsave("mpox_symptoms_chart_selected.png", dpi = 300, width = 8, height = 5)

Summarize Knowledge on Mpox Prevention

I focused on prevention-related variables, including awareness of the Mpox vaccine, protective behaviors (e.g., handwashing, isolation), and knowledge of transmission risks. Responses of ‘No’ and ‘Don’t know’ were combined to simplify interpretation. I then created a summary table showing the counts and percentages of participants who correctly recognized each preventive measure. This highlights areas where community knowledge is strong versus where targeted education may be needed

# Define prevention-focused variables
prevention_vars <- c(
  "mpox_vaccine_available",
  "vax_mpox_protection",
  "mpox_seek_care",
  "mpox_isolate",
  "handwahsing_prevent",   # spelling check
  "avoiding_contact",
  "pregant_women",         # spelling check
  "mpox_severe_children",
  "mpox_spread_surfaces"
)

# Recode Yes/No responses
sub_set <- sub_set %>%
  mutate(across(all_of(prevention_vars),
                ~ factor(ifelse(.x %in% c("Don't know", "No"), "No / Don't know", "Yes"),
                         levels = c("No / Don't know", "Yes"))))

# Create the summary table
prevention_tbl <- sub_set %>%
  select(all_of(prevention_vars)) %>%
  tbl_summary(
    type = all_categorical() ~ "categorical",
    statistic = all_categorical() ~ "{n} ({p}%)",
    digits = all_categorical() ~ c(0, 1),
    label = list(
      mpox_vaccine_available = "Vaccine available for Mpox?",
      vax_mpox_protection    = "Vaccination protects against Mpox?",
      mpox_seek_care         = "Suspected Mpox: seek health care?",
      mpox_isolate           = "Suspected Mpox: isolate from others?",
      handwahsing_prevent    = "Handwashing/surface cleaning prevents Mpox?",
      avoiding_contact       = "Avoiding wild animals prevents Mpox?",
      pregant_women          = "Pregnant women can transmit Mpox to unborn child?",
      mpox_severe_children   = "Mpox is severe in children/immunocompromised?",
      mpox_spread_surfaces   = "Mpox spreads via contaminated surfaces?"
    ),
    missing = "no"
  ) %>%
  modify_caption("**Knowledge of Mpox Prevention**")

print(prevention_tbl)

**Knowledge of Mpox Prevention**
Characteristic	N = 372¹
Vaccine available for Mpox?
No / Don’t know	210 (56.5%)
Yes	162 (43.5%)
Vaccination protects against Mpox?
No / Don’t know	127 (34.1%)
Yes	245 (65.9%)
Suspected Mpox: seek health care?
No / Don’t know	38 (10.2%)
Yes	334 (89.8%)
Suspected Mpox: isolate from others?
No / Don’t know	75 (20.2%)
Yes	297 (79.8%)
Handwashing/surface cleaning prevents Mpox?
No / Don’t know	110 (29.6%)
Yes	262 (70.4%)
Avoiding wild animals prevents Mpox?
No / Don’t know	108 (29.0%)
Yes	264 (71.0%)
Pregnant women can transmit Mpox to unborn child?
No / Don’t know	221 (59.4%)
Yes	151 (40.6%)
Mpox is severe in children/immunocompromised?
No / Don’t know	123 (33.1%)
Yes	249 (66.9%)
Mpox spreads via contaminated surfaces?
No / Don’t know	133 (35.8%)
Yes	239 (64.2%)
¹ n (%)

# Save to Word
doc <- read_docx() %>%
  body_add_par("Table 2. Knowledge of Mpox Prevention", style = "heading 2") %>%
  body_add_flextable(value = as_flex_table(prevention_tbl))

print(doc, target = "prevention.docx")

Visualizing Knowledge of Mpox Prevention

I created a stacked bar chart to show participants’ knowledge of Mpox prevention. First, I cleaned the variable labels for clarity. Then, I reshaped the dataset to calculate the percentage of respondents answering ‘Yes’ versus ‘No / Don’t know’ for each prevention measure. Finally, I ordered the prevention measures by the proportion of ‘Yes’ responses and plotted them as horizontal stacked bars with percentages labeled, allowing a clear visual comparison of awareness across different preventive actions

# Clean labels for plotting
prevention_labels <- c(
  mpox_vaccine_available = "Vaccine available for Mpox?",
  vax_mpox_protection    = "Vaccination protects against Mpox?",
  mpox_seek_care         = "Suspected Mpox: seek health care?",
  mpox_isolate           = "Suspected Mpox: isolate from others?",
  handwahsing_prevent    = "Handwashing/surface cleaning prevents Mpox?",
  avoiding_contact       = "Avoiding wild animals prevents Mpox?",
  pregant_women          = "Pregnant women can transmit Mpox to unborn child?",
  mpox_severe_children   = "Mpox is severe in children/immunocompromised?",
  mpox_spread_surfaces   = "Mpox spreads via contaminated surfaces?"
)

# Build dataset
prevention_plot_data <- sub_set %>%
  select(all_of(prevention_vars)) %>%
  pivot_longer(cols = everything(),
               names_to = "variable",
               values_to = "response") %>%
  group_by(variable, response) %>%
  summarise(n = n(), .groups = "drop") %>%
  group_by(variable) %>%
  mutate(percent = round(n / sum(n) * 100, 1)) %>%
  ungroup() %>%
  mutate(label = prevention_labels[variable]) %>%
  # calculate Yes percentage per variable
  group_by(label) %>%
  mutate(yes_percent = ifelse(response == "Yes", percent, 0)) %>%
  summarise(yes_percent = sum(yes_percent), .groups = "drop") %>%
  right_join(
    sub_set %>%
      select(all_of(prevention_vars)) %>%
      pivot_longer(cols = everything(),
                   names_to = "variable",
                   values_to = "response") %>%
      group_by(variable, response) %>%
      summarise(n = n(), .groups = "drop") %>%
      group_by(variable) %>%
      mutate(percent = round(n / sum(n) * 100, 1)) %>%
      ungroup() %>%
      mutate(label = prevention_labels[variable]),
    by = "label"
  ) %>%
  mutate(label = fct_reorder(label, yes_percent, .desc = TRUE))

# Plot stacked bars
ggplot(prevention_plot_data, aes(x = percent, y = label, fill = response)) +
  geom_bar(stat = "identity", position = "stack", color = "#636363") +
  geom_text(aes(label = paste0(percent, "%")),
            position = position_stack(vjust = 0.5),
            color = "white", size = 4, fontface = "bold") +
  scale_fill_manual(values = c("Yes" = "#1f77b4", "No / Don't know" = "#bdc9e1")) +
  labs(x = "Percentage", y = NULL, fill = "Response") +
  scale_x_continuous(labels = function(x) paste0(x, "%")) +
  theme_minimal(base_size = 14) +
  theme(axis.text.y = element_text(face = "bold"),
        legend.position = "right",
        panel.grid = element_blank(),
        axis.line = element_line(color = "black", size = 0.6))

# Save prevention chart to file
ggsave("prevention_chart.png",
       width = 8, height = 6, dpi = 300)   # high resolution for manuscripts

Calculating Overall Knowledge of Mpox

I combined all knowledge-related variables—including symptoms, transmission, and prevention—into a single dataset. Responses were recoded to 0/1 (No = 0, Yes = 1), and a total knowledge score was calculated for each participant. Participants scoring ≥80% of the total possible points were classified as having ‘Good knowledge,’ and others as ‘Poor knowledge.’ I summarized the overall knowledge with counts and percentages, then visualized it as a bar chart showing the proportion of participants with good versus poor knowledge, with labels for clarity.

# --- Step 1: Combine all knowledge variables ---
knowledge_vars <- c(symptoms, transmission, prevention_vars)

# --- Step 2: Recode responses into 0/1 ---
sub_set <- sub_set %>%
  mutate(across(all_of(knowledge_vars),
                ~ ifelse(. == "Yes", 1, 0)))

# --- Step 3: Calculate total score per participant ---
sub_set <- sub_set %>%
  rowwise() %>%
  mutate(total_score = sum(c_across(all_of(knowledge_vars)), na.rm = TRUE)) %>%
  ungroup()

# --- Step 4: Classify knowledge level (≥80% = Good knowledge) ---
sub_set <- sub_set %>%
  mutate(knowledge_level = ifelse(total_score >= 16, "Good knowledge", "Poor knowledge"))

# --- Step 5: Summarize overall knowledge with numerators + percentages ---
overall_knowledge <- sub_set %>%
  count(knowledge_level) %>%
  mutate(percent = round(n / sum(n) * 100, 1),
         label = paste0(n, " (", percent, "%)"))

print(overall_knowledge)

A tibble: 2 × 4

knowledge_level n percent label

1 Good knowledge 95 25.5 95 (25.5%) 2 Poor knowledge 277 74.5 277 (74.5%)

# --- Step 6: Plot bar chart with labels ---
ggplot(overall_knowledge, aes(x = knowledge_level, y = percent, fill = knowledge_level)) +
  geom_bar(stat = "identity", width = 0.6, color = "#636363") +
  geom_text(aes(label = label), vjust = -0.5, size = 5, fontface = "bold") +
  scale_fill_manual(values = c("Good knowledge" = "#1f77b4", "Poor knowledge" = "#bdc9e1")) +
  labs(
    title = "",
    x = NULL,
    y = "Percentage of participants"
  ) +
  scale_y_continuous(labels = function(x) paste0(x, "%"), expand = expansion(mult = c(0, 0.1))) +
  theme_minimal(base_size = 14) +
  theme(
    plot.title = element_text(face = "bold", hjust = 0.5),
    axis.text.x = element_text(face = "bold"),
    legend.position = "none",
    panel.grid = element_blank(),
    axis.line = element_line(color = "black", size = 0.6)
  )

chart without percentage labels

ggplot(overall_knowledge, aes(x = knowledge_level, y = percent, fill = knowledge_level)) +
  geom_bar(stat = "identity", width = 0.6, color = "#636363") +
  scale_fill_manual(values = c("Good knowledge" = "#1f77b4", "Poor knowledge" = "#bdc9e1")) +
  labs(
    title = "",
    x = NULL,
    y = "Percentage of participants"
  ) +
  scale_y_continuous(labels = function(x) paste0(x, "%"), 
                     expand = expansion(mult = c(0, 0.1))) +
  theme_minimal(base_size = 14) +
  theme(
    plot.title = element_text(face = "bold", hjust = 0.5),
    axis.text.x = element_text(face = "bold"),
    legend.position = "none",
    panel.grid = element_blank(),
    axis.line = element_line(color = "black", size = 0.6)
  )

# Pie chart of overall knowledge (no labels)
ggplot(overall_knowledge, aes(x = "", y = percent, fill = knowledge_level)) +
  geom_col(width = 1, color = "white") +
  coord_polar(theta = "y") +
  scale_fill_manual(values = c("Good knowledge" = "#1f77b4",
                               "Poor knowledge" = "#ff7f0e")) +
  labs(title = "",
       x = NULL, y = NULL, fill = "Knowledge Level") +
  theme_void(base_size = 14) +
  theme(
    plot.title = element_text(face = "bold", hjust = 0.5),
    legend.position = "right"
  )

# Save the overall knowledge chart with labels
ggsave("overall_knowledge_chart.png",
       width = 8, height = 6, dpi = 300)

# Save the overall knowledge chart with labels
ggsave("overall_knowledge_pie_chart.png",
       width = 8, height = 6, dpi = 300)

creating a column for the outcome

# Create binary variable for regression

sub_set |> tabyl(knowledge_level)

knowledge_level n percent Good knowledge 95 0.2553763 Poor knowledge 277 0.7446237

sub_set <- sub_set %>%
  mutate(knowledge_binary = ifelse(knowledge_level == "Good knowledge", 1, 0))

# Check distribution
table(sub_set$knowledge_binary)

0 1 277 95

sub_set |> tabyl(knowledge_level, knowledge_binary)

knowledge_level 0 1 Good knowledge 0 95 Poor knowledge 277 0

sub_set |> tabyl(knowledge_binary) |> adorn_pct_formatting(affix_sign = T)

knowledge_binary n percent 0 277 74.5% 1 95 25.5%

Stratified Summary of Participant Characteristics by Knowledge Level

I generated a stratified summary table showing participant demographic characteristics (Province, Age group, Sex, Marital status, Residence, Education level, Occupation, and Income category) by knowledge level (‘Good knowledge’ vs. ‘Poor knowledge’). Counts and row percentages were displayed, with chi-square tests for association and p-values formatted to three decimal places. Labels were bolded, headers clarified, and the resulting table was converted to a Word-friendly flextable for reporting

# Step 2: Build stratified table (categorical only)
t2 <- sub_set %>% 
  select(knowledge_level, Province, Age_categories, Sex, Marital_status, Residence, Education_level, Occupation, Income_category) %>% 
  tbl_summary(
    by = knowledge_level,
    percent = "row",
    missing_text = "Missing",
    label = list(
      Province ~ "Province",
      Age_categories ~ "Age group",
      Sex ~ "Sex",
      Marital_status ~ "Marital status",
      Residence ~ "Residence (Urban/Rural)",
      Education_level ~ "Education level",
      Occupation ~ "Occupation",
      Income_category ~ "Income category"
    ),
    digits = list(all_categorical() ~ c(0, 1)) # format counts to 0 decimal points and percentages to 1 decimal point
  ) %>% 
  add_overall(last = FALSE) %>% 
  add_p(
    test = list(all_categorical() ~ "chisq.test"),
    pvalue_fun = function(x) style_pvalue(x, digits = 3)
  ) %>% 
  bold_labels() %>% 
  modify_header(label ~ "**Population Characteristics**") %>% 
  modify_spanning_header(c("stat_1", "stat_2") ~ "**Heard of Mpox**")

print(t2)

Population Characteristics	Overall N = 372¹	Heard of Mpox		p-value²
Population Characteristics	Overall N = 372¹	Good knowledge N = 95¹	Poor knowledge N = 277¹	p-value²
Province				<0.001
Lusaka	151 (100.0%)	22 (14.6%)	129 (85.4%)
Central	71 (100.0%)	13 (18.3%)	58 (81.7%)
Western	67 (100.0%)	45 (67.2%)	22 (32.8%)
Copperbelt	58 (100.0%)	9 (15.5%)	49 (84.5%)
Muchinga	25 (100.0%)	6 (24.0%)	19 (76.0%)
Age group				0.107
18–34	181 (100.0%)	45 (24.9%)	136 (75.1%)
35–54	139 (100.0%)	42 (30.2%)	97 (69.8%)
55+	52 (100.0%)	8 (15.4%)	44 (84.6%)
Sex				0.662
Female	236 (100.0%)	58 (24.6%)	178 (75.4%)
Male	136 (100.0%)	37 (27.2%)	99 (72.8%)
Marital status				0.117
Married	200 (100.0%)	44 (22.0%)	156 (78.0%)
Unmarried	172 (100.0%)	51 (29.7%)	121 (70.3%)
Residence (Urban/Rural)				0.698
Rural	33 (100.0%)	7 (21.2%)	26 (78.8%)
Urban	339 (100.0%)	88 (26.0%)	251 (74.0%)
Education level				0.945
None	20 (100.0%)	4 (20.0%)	16 (80.0%)
Primary	89 (100.0%)	23 (25.8%)	66 (74.2%)
Secondary	191 (100.0%)	50 (26.2%)	141 (73.8%)
Tertiary	72 (100.0%)	18 (25.0%)	54 (75.0%)
Occupation				<0.001
Employed	257 (100.0%)	80 (31.1%)	177 (68.9%)
Student	20 (100.0%)	4 (20.0%)	16 (80.0%)
Unemployed	95 (100.0%)	11 (11.6%)	84 (88.4%)
Income category				0.005
< 3000	231 (100.0%)	47 (20.3%)	184 (79.7%)
≥ 3000	141 (100.0%)	48 (34.0%)	93 (66.0%)
¹ n (%)
² Pearson’s Chi-squared test

# Convert gtsummary object to flextable
ft <- as_flex_table(t2)

# Export to Word
doc <- read_docx() %>%
  body_add_flextable(ft)

print(doc, target = "table1.docx")

Objective 3 : To identify socio-demographic factors associated with variations in knowledge related to Mpox among community members in selected districts of Zambia

sub_set$Sources_of_information_on_Mpox <- factor(sub_set$Sources_of_information_on_Mpox)

sub_set$Sources_of_information_on_Mpox <- relevel(
  sub_set$Sources_of_information_on_Mpox,
  ref = "Health care worker"
)

sub_set <- sub_set %>%
  mutate(Residence = factor(Residence, levels = c("Urban", "Rural")))

sub_set <- sub_set%>%
  mutate(Age_categories = factor(Age_categories,
                                 levels = c("55+", "35–54", "18–34")))

Univariate analysis

uv_model <- tbl_uvregression(
  data = sub_set,
  y = knowledge_binary,
  method = glm,
  method.args = list(family = binomial),
  include = c(
    "Province",
    "Age_categories",
    "Sex",
    "Marital_status",
    "Residence",
    "Education_level",
    "Occupation",
    "Income_category","Sources_of_information_on_Mpox"
  ),
  exponentiate = TRUE # gives ORs instead of log-odds
) %>%
  bold_p(t = 0.05) %>%
  modify_header(
    estimate ~ "**Odds Ratio**",
    p.value ~ "**P-value**"
  ) %>%
  modify_fmt_fun(p.value ~ function(x) style_pvalue(x, digits = 3)) %>%
  bold_labels()

print(uv_model)

Characteristic	N	Odds Ratio	95% CI	P-value
Province	372
Lusaka		—	—
Central		1.31	0.61, 2.76	0.477
Western		12.0	6.17, 24.2	<0.001
Copperbelt		1.08	0.44, 2.44	0.863
Muchinga		1.85	0.62, 4.95	0.238
Age_categories	372
55+		—	—
35–54		2.38	1.08, 5.85	0.042
18–34		1.82	0.83, 4.43	0.155
Sex	372
Female		—	—
Male		1.15	0.71, 1.85	0.576
Marital status	372
Married		—	—
Unmarried		1.49	0.94, 2.39	0.092
Residence	372
Urban		—	—
Rural		0.77	0.30, 1.74	0.552
Education level	372
None		—	—
Primary		1.39	0.45, 5.24	0.586
Secondary		1.42	0.49, 5.13	0.549
Tertiary		1.33	0.42, 5.11	0.644
Occupation	372
Employed		—	—
Student		0.55	0.15, 1.56	0.303
Unemployed		0.29	0.14, 0.55	<0.001
Income_category	372
< 3000		—	—
≥ 3000		2.02	1.26, 3.25	0.004
Sources_of_information_on_Mpox	372
Health care worker		—	—
Radio/TV		0.17	0.07, 0.35	<0.001
Internet/social media		0.35	0.16, 0.71	0.005
Family/friends		0.33	0.12, 0.79	0.019
Print/community sources		0.06	0.00, 0.28	0.005
Abbreviations: CI = Confidence Interval, OR = Odds Ratio

mv_model <- glm(
  knowledge_binary ~ 
    Province+
    Age_categories+
    Sex + 
    Marital_status+
    Residence+
    Education_level+ 
    Occupation+
    Income_category+Sources_of_information_on_Mpox,
  data = sub_set,
  family = binomial
)

# Check listwise deletion
na.action(mv_model) # No deletion made

NULL

tbl_regression(
  mv_model, 
  exponentiate = TRUE,  # show odds ratios instead of log-odds
  pvalue_fun = ~style_pvalue(.x, digits = 3)
) %>%
  bold_p(t = 0.05)

Characteristic	OR	95% CI	p-value
Province
Lusaka	—	—
Central	1.25	0.53, 2.90	0.609
Western	13.5	5.84, 33.2	<0.001
Copperbelt	0.83	0.31, 2.05	0.689
Muchinga	1.45	0.37, 5.37	0.586
Age_categories
55+	—	—
35–54	3.25	1.15, 10.2	0.033
18–34	1.95	0.70, 6.00	0.220
Sex
Female	—	—
Male	1.58	0.83, 2.99	0.161
Marital status
Married	—	—
Unmarried	1.53	0.80, 2.93	0.193
Residence
Urban	—	—
Rural	0.40	0.12, 1.19	0.111
Education level
None	—	—
Primary	0.92	0.25, 3.97	0.903
Secondary	0.78	0.22, 3.28	0.716
Tertiary	0.85	0.21, 3.89	0.829
Occupation
Employed	—	—
Student	0.65	0.16, 2.29	0.522
Unemployed	0.42	0.18, 0.92	0.038
Income_category
< 3000	—	—
≥ 3000	1.21	0.62, 2.33	0.567
Sources_of_information_on_Mpox
Health care worker	—	—
Radio/TV	0.13	0.05, 0.30	<0.001
Internet/social media	0.37	0.15, 0.87	0.027
Family/friends	0.23	0.07, 0.66	0.010
Print/community sources	0.08	0.00, 0.42	0.017
Abbreviations: CI = Confidence Interval, OR = Odds Ratio

merging the outputs

# Multivariable model
mv_model <- glm(
  knowledge_binary ~ Province + Age_categories + Sex + Marital_status + Residence +Education_level  + Occupation + Income_category+Sources_of_information_on_Mpox,
  data = sub_set,
  family = binomial
) %>%
  tbl_regression(
    exponentiate = TRUE,
    pvalue_fun = ~ style_pvalue(.x, digits = 3)
  ) %>%
  bold_p(t = 0.05)

# Merge univariable and multivariable models
merged_table <- tbl_merge(
  tbls = list(uv_model, mv_model),
  tab_spanner = c("**Crude Odds Ratio**", "**Adjusted Odds Ratio**")
)

print(merged_table)

Characteristic	Crude Odds Ratio				Adjusted Odds Ratio
Characteristic	N	Odds Ratio	95% CI	P-value	OR	95% CI	p-value
Province	372
Lusaka		—	—		—	—
Central		1.31	0.61, 2.76	0.477	1.25	0.53, 2.90	0.609
Western		12.0	6.17, 24.2	<0.001	13.5	5.84, 33.2	<0.001
Copperbelt		1.08	0.44, 2.44	0.863	0.83	0.31, 2.05	0.689
Muchinga		1.85	0.62, 4.95	0.238	1.45	0.37, 5.37	0.586
Age_categories	372
55+		—	—		—	—
35–54		2.38	1.08, 5.85	0.042	3.25	1.15, 10.2	0.033
18–34		1.82	0.83, 4.43	0.155	1.95	0.70, 6.00	0.220
Sex	372
Female		—	—		—	—
Male		1.15	0.71, 1.85	0.576	1.58	0.83, 2.99	0.161
Marital status	372
Married		—	—		—	—
Unmarried		1.49	0.94, 2.39	0.092	1.53	0.80, 2.93	0.193
Residence	372
Urban		—	—		—	—
Rural		0.77	0.30, 1.74	0.552	0.40	0.12, 1.19	0.111
Education level	372
None		—	—		—	—
Primary		1.39	0.45, 5.24	0.586	0.92	0.25, 3.97	0.903
Secondary		1.42	0.49, 5.13	0.549	0.78	0.22, 3.28	0.716
Tertiary		1.33	0.42, 5.11	0.644	0.85	0.21, 3.89	0.829
Occupation	372
Employed		—	—		—	—
Student		0.55	0.15, 1.56	0.303	0.65	0.16, 2.29	0.522
Unemployed		0.29	0.14, 0.55	<0.001	0.42	0.18, 0.92	0.038
Income_category	372
< 3000		—	—		—	—
≥ 3000		2.02	1.26, 3.25	0.004	1.21	0.62, 2.33	0.567
Sources_of_information_on_Mpox	372
Health care worker		—	—		—	—
Radio/TV		0.17	0.07, 0.35	<0.001	0.13	0.05, 0.30	<0.001
Internet/social media		0.35	0.16, 0.71	0.005	0.37	0.15, 0.87	0.027
Family/friends		0.33	0.12, 0.79	0.019	0.23	0.07, 0.66	0.010
Print/community sources		0.06	0.00, 0.28	0.005	0.08	0.00, 0.42	0.017
Abbreviations: CI = Confidence Interval, OR = Odds Ratio

# Convert to flextable using gtsummary's method
ft <- as_flex_table(merged_table)

# Create and save Word document
doc <- read_docx() %>%
  body_add_flextable(value = ft) %>%
  body_add_par("", style = "Normal")  # optional spacing

print(doc, target = "regression_table.docx")

Stepwise Selection

step_dat <- sub_set |> dplyr::select(knowledge_binary, Province, Age_categories, Sex, Marital_status, Residence, Education_level, Occupation, Income_category, Sources_of_information_on_Mpox, Age)

# Remove any missing observations if any
step_dat <- na.omit(step_dat)

# Label convert to factor
  step_dat <- step_dat %>%
  mutate(across(where(is.character), as.factor))
  
  var_label(step_dat) <- list(Province= "Provinces",Sex="Sex",Age_categories="Age (years)",Marital_status="Marital status", Residence="Residence",Education_level="Education level", Occupation="Occupation", Income_category = "Income category",
                                Sources_of_information_on_Mpox = "Sources of Mpox information", Age = "Age (years)") 
  

# Fit full model
FitAll <- glm(knowledge_binary ~ Province + Age_categories + Sex + Marital_status + Residence +Education_level  + Occupation + Income_category + Sources_of_information_on_Mpox,
  data = step_dat,
  family = binomial)


  FitAll

Call: glm(formula = knowledge_binary ~ Province + Age_categories + Sex + Marital_status + Residence + Education_level + Occupation + Income_category + Sources_of_information_on_Mpox, family = binomial, data = step_dat)

Coefficients: (Intercept)
-1.73359
ProvinceCentral
0.22147
ProvinceWestern
2.60127
ProvinceCopperbelt
-0.19007
ProvinceMuchinga
0.36847
Age_categories35–54
1.17771
Age_categories18–34
0.66700
SexMale
0.45547
Marital_statusUnmarried
0.42746
ResidenceRural
-0.91878
Education_levelPrimary
-0.08517
Education_levelSecondary
-0.24703
Education_levelTertiary
-0.15738
OccupationStudent
-0.43082
OccupationUnemployed
-0.86182
Income_category≥ 3000
0.19155
Sources_of_information_on_MpoxRadio/TV
-2.01974
Sources_of_information_on_MpoxInternet/social media
-0.99959
Sources_of_information_on_MpoxFamily/friends
-1.48030
Sources_of_information_on_MpoxPrint/community sources
-2.54225

Degrees of Freedom: 371 Total (i.e. Null); 352 Residual Null Deviance: 422.7 Residual Deviance: 298.1 AIC: 338.1

  library(jtools)
  summ(FitAll, exp = T)

Observations	372
Dependent variable	knowledge_binary
Type	Generalized linear model
Family	binomial
Link	logit

χ²(19)	124.60
p	0.00
Pseudo-R² (Cragg-Uhler)	0.42
Pseudo-R² (McFadden)	0.29
AIC	338.12
BIC	416.49

	exp(Est.)	2.5%	97.5%	z val.	p
(Intercept)	0.18	0.04	0.81	-2.23	0.03
ProvinceCentral	1.25	0.53	2.92	0.51	0.61
ProvinceWestern	13.48	5.67	32.04	5.89	0.00
ProvinceCopperbelt	0.83	0.33	2.10	-0.40	0.69
ProvinceMuchinga	1.45	0.38	5.44	0.54	0.59
Age_categories35–54	3.25	1.10	9.57	2.13	0.03
Age_categories18–34	1.95	0.67	5.65	1.23	0.22
SexMale	1.58	0.83	2.98	1.40	0.16
Marital_statusUnmarried	1.53	0.81	2.92	1.30	0.19
ResidenceRural	0.40	0.13	1.23	-1.59	0.11
Education_levelPrimary	0.92	0.23	3.59	-0.12	0.90
Education_levelSecondary	0.78	0.21	2.95	-0.36	0.72
Education_levelTertiary	0.85	0.21	3.56	-0.22	0.83
OccupationStudent	0.65	0.17	2.43	-0.64	0.52
OccupationUnemployed	0.42	0.19	0.95	-2.08	0.04
Income_category≥ 3000	1.21	0.63	2.33	0.57	0.57
Sources_of_information_on_MpoxRadio/TV	0.13	0.06	0.32	-4.53	0.00
Sources_of_information_on_MpoxInternet/social media	0.37	0.15	0.89	-2.21	0.03
Sources_of_information_on_MpoxFamily/friends	0.23	0.07	0.71	-2.56	0.01
Sources_of_information_on_MpoxPrint/community sources	0.08	0.01	0.63	-2.39	0.02
Standard errors: MLE

  length(coef(FitAll))

[1] 20

  # Daliso's model with prov, occupation and mpox_info
  
  Fitdaliso <- glm(knowledge_binary ~ Province + Occupation + Sources_of_information_on_Mpox,
  data = step_dat,
  family = binomial)
  
  summ(Fitdaliso, exp = T)

Observations	372
Dependent variable	knowledge_binary
Type	Generalized linear model
Family	binomial
Link	logit

χ²(10)	112.59
p	0.00
Pseudo-R² (Cragg-Uhler)	0.38
Pseudo-R² (McFadden)	0.27
AIC	332.13
BIC	375.23

	exp(Est.)	2.5%	97.5%	z val.	p
(Intercept)	0.54	0.30	0.97	-2.05	0.04
ProvinceCentral	1.08	0.48	2.45	0.18	0.86
ProvinceWestern	10.29	4.79	22.10	5.97	0.00
ProvinceCopperbelt	0.80	0.32	1.97	-0.49	0.62
ProvinceMuchinga	1.06	0.35	3.23	0.11	0.92
OccupationStudent	0.62	0.18	2.15	-0.75	0.46
OccupationUnemployed	0.33	0.15	0.70	-2.88	0.00
Sources_of_information_on_MpoxRadio/TV	0.13	0.05	0.31	-4.62	0.00
Sources_of_information_on_MpoxInternet/social media	0.38	0.16	0.89	-2.22	0.03
Sources_of_information_on_MpoxFamily/friends	0.25	0.08	0.72	-2.54	0.01
Sources_of_information_on_MpoxPrint/community sources	0.08	0.01	0.62	-2.40	0.02
Standard errors: MLE

  formula(Fitdaliso)

knowledge_binary ~ Province + Occupation + Sources_of_information_on_Mpox

  length(coef(Fitdaliso))

[1] 11

  #FORWARD SELECTION

  FitStart <- glm(knowledge_binary ~ 1,
                  data = step_dat,
                  family = binomial) #begin with model without predictors - intercept only

  summ(FitStart)

Observations	372
Dependent variable	knowledge_binary
Type	Generalized linear model
Family	binomial
Link	logit

χ²(0)	-0.00
p	NA
Pseudo-R² (Cragg-Uhler)	0.00
Pseudo-R² (McFadden)	0.00
AIC	424.71
BIC	428.63

	Est.	S.E.	z val.	p
(Intercept)	-1.07	0.12	-9.00	0.00
Standard errors: MLE

  Fitforward <- step(FitStart, direction = "forward", scope=formula(FitAll))

Start: AIC=424.71 knowledge_binary ~ 1

                             Df Deviance    AIC

Province 4 355.42 365.42
Sources_of_information_on_Mpox 4 380.40 390.40
Occupation 2 406.87 412.87
Income_category 1 414.24 418.24
Marital_status 1 419.87 423.87
Age_categories 2 417.99 423.99 422.71 424.71
Residence 1 422.35 426.35
Sex 1 422.40 426.40
Education_level 3 422.32 430.32

Step: AIC=365.42 knowledge_binary ~ Province

                             Df Deviance    AIC

Sources_of_information_on_Mpox 4 319.74 337.74
Residence 1 347.64 359.64
Occupation 2 346.42 360.42
Age_categories 2 349.13 363.13
Income_category 1 352.63 364.63 355.42 365.42
Sex 1 353.50 365.50
Marital_status 1 354.78 366.78
Education_level 3 355.21 371.21

Step: AIC=337.74 knowledge_binary ~ Province + Sources_of_information_on_Mpox

              Df Deviance    AIC

Occupation 2 310.13 332.13
Residence 1 314.54 334.54
Age_categories 2 313.39 335.39
Sex 1 316.54 336.54
Income_category 1 317.40 337.40 319.74 337.74
Marital_status 1 319.42 339.42
Education_level 3 319.52 343.52

Step: AIC=332.13 knowledge_binary ~ Province + Sources_of_information_on_Mpox + Occupation

              Df Deviance    AIC

Residence 1 306.62 330.62
Sex 1 307.61 331.61 310.13 332.13
Age_categories 2 306.38 332.38
Marital_status 1 309.20 333.20
Income_category 1 309.59 333.59
Education_level 3 310.00 338.00

Step: AIC=330.62 knowledge_binary ~ Province + Sources_of_information_on_Mpox + Occupation + Residence

              Df Deviance    AIC

Age_categories 2 302.39 330.39 306.62 330.62
Sex 1 304.64 330.64
Marital_status 1 305.98 331.98
Income_category 1 306.36 332.36
Education_level 3 306.42 336.42

Step: AIC=330.39 knowledge_binary ~ Province + Sources_of_information_on_Mpox + Occupation + Residence + Age_categories

              Df Deviance    AIC

Sex 1 300.15 330.15 302.39 330.39
Marital_status 1 300.54 330.54
Income_category 1 302.22 332.22
Education_level 3 302.31 336.31

Step: AIC=330.15 knowledge_binary ~ Province + Sources_of_information_on_Mpox + Occupation + Residence + Age_categories + Sex

              Df Deviance    AIC

300.15 330.15 + Marital_status 1 298.59 330.59 + Income_category 1 300.02 332.02 + Education_level 3 300.03 336.03

  formula(Fitforward)

knowledge_binary ~ Province + Sources_of_information_on_Mpox + Occupation + Residence + Age_categories + Sex

  #BACKWARD SELECTION

  Fitback <- step(FitAll, direction = "backward", scope=formula(FitAll))

Start: AIC=338.12 knowledge_binary ~ Province + Age_categories + Sex + Marital_status + Residence + Education_level + Occupation + Income_category + Sources_of_information_on_Mpox

                             Df Deviance    AIC

Education_level 3 298.37 332.37
Income_category 1 298.44 336.44
Marital_status 1 299.81 337.81
Sex 1 300.08 338.08 298.12 338.12
Residence 1 300.81 338.81
Occupation 2 302.89 338.89
Age_categories 2 303.58 339.58
Sources_of_information_on_Mpox 4 332.34 364.34
Province 4 350.11 382.11

Step: AIC=332.37 knowledge_binary ~ Province + Age_categories + Sex + Marital_status + Residence + Occupation + Income_category + Sources_of_information_on_Mpox

                             Df Deviance    AIC

Income_category 1 298.59 330.59
Marital_status 1 300.02 332.02
Sex 1 300.28 332.28 298.37 332.37
Residence 1 301.04 333.04
Occupation 2 303.08 333.08
Age_categories 2 303.91 333.91
Sources_of_information_on_Mpox 4 333.23 359.23
Province 4 350.70 376.70

Step: AIC=330.59 knowledge_binary ~ Province + Age_categories + Sex + Marital_status + Residence + Occupation + Sources_of_information_on_Mpox

                             Df Deviance    AIC

Marital_status 1 300.15 330.15
Sex 1 300.54 330.54 298.59 330.59
Residence 1 301.49 331.49
Occupation 2 304.02 332.02
Age_categories 2 304.22 332.22
Sources_of_information_on_Mpox 4 334.09 358.09
Province 4 352.50 376.50

Step: AIC=330.15 knowledge_binary ~ Province + Age_categories + Sex + Residence + Occupation + Sources_of_information_on_Mpox

                             Df Deviance    AIC

300.15 330.15 - Sex 1 302.39 330.39 - Age_categories 2 304.64 330.64 - Occupation 2 305.06 331.06 - Residence 1 303.45 331.45 - Sources_of_information_on_Mpox 4 335.41 357.41 - Province 4 359.55 381.55

  formula(Fitforward)

knowledge_binary ~ Province + Sources_of_information_on_Mpox + Occupation + Residence + Age_categories + Sex

  #STEWISE SELECTION

  Fitstep <- step(FitStart, direction = "both", scope=formula(FitAll))

Start: AIC=424.71 knowledge_binary ~ 1

                             Df Deviance    AIC

Province 4 355.42 365.42
Sources_of_information_on_Mpox 4 380.40 390.40
Occupation 2 406.87 412.87
Income_category 1 414.24 418.24
Marital_status 1 419.87 423.87
Age_categories 2 417.99 423.99 422.71 424.71
Residence 1 422.35 426.35
Sex 1 422.40 426.40
Education_level 3 422.32 430.32

Step: AIC=365.42 knowledge_binary ~ Province

                             Df Deviance    AIC

Sources_of_information_on_Mpox 4 319.74 337.74
Residence 1 347.64 359.64
Occupation 2 346.42 360.42
Age_categories 2 349.13 363.13
Income_category 1 352.63 364.63 355.42 365.42
Sex 1 353.50 365.50
Marital_status 1 354.78 366.78
Education_level 3 355.21 371.21
Province 4 422.71 424.71

Step: AIC=337.74 knowledge_binary ~ Province + Sources_of_information_on_Mpox

                             Df Deviance    AIC

Occupation 2 310.13 332.13
Residence 1 314.54 334.54
Age_categories 2 313.39 335.39
Sex 1 316.54 336.54
Income_category 1 317.40 337.40 319.74 337.74
Marital_status 1 319.42 339.42
Education_level 3 319.52 343.52
Sources_of_information_on_Mpox 4 355.42 365.42
Province 4 380.40 390.40

Step: AIC=332.13 knowledge_binary ~ Province + Sources_of_information_on_Mpox + Occupation

                             Df Deviance    AIC

Residence 1 306.62 330.62
Sex 1 307.61 331.61 310.13 332.13
Age_categories 2 306.38 332.38
Marital_status 1 309.20 333.20
Income_category 1 309.59 333.59
Occupation 2 319.74 337.74
Education_level 3 310.00 338.00
Sources_of_information_on_Mpox 4 346.42 360.42
Province 4 363.60 377.60

Step: AIC=330.62 knowledge_binary ~ Province + Sources_of_information_on_Mpox + Occupation + Residence

                             Df Deviance    AIC

Age_categories 2 302.39 330.39 306.62 330.62
Sex 1 304.64 330.64
Marital_status 1 305.98 331.98
Residence 1 310.13 332.13
Income_category 1 306.36 332.36
Occupation 2 314.54 334.54
Education_level 3 306.42 336.42
Sources_of_information_on_Mpox 4 340.91 356.91
Province 4 362.83 378.83

Step: AIC=330.39 knowledge_binary ~ Province + Sources_of_information_on_Mpox + Occupation + Residence + Age_categories

                             Df Deviance    AIC

Sex 1 300.15 330.15 302.39 330.39
Marital_status 1 300.54 330.54
Age_categories 2 306.62 330.62
Occupation 2 307.92 331.92
Income_category 1 302.22 332.22
Residence 1 306.38 332.38
Education_level 3 302.31 336.31
Sources_of_information_on_Mpox 4 336.33 356.33
Province 4 360.44 380.44

Step: AIC=330.15 knowledge_binary ~ Province + Sources_of_information_on_Mpox + Occupation + Residence + Age_categories + Sex

                             Df Deviance    AIC

300.15 330.15 - Sex 1 302.39 330.39 + Marital_status 1 298.59 330.59 - Age_categories 2 304.64 330.64 - Occupation 2 305.06 331.06 - Residence 1 303.45 331.45 + Income_category 1 300.02 332.02 + Education_level 3 300.03 336.03 - Sources_of_information_on_Mpox 4 335.41 357.41 - Province 4 359.55 381.55

  formula(FitAll)

knowledge_binary ~ Province + Age_categories + Sex + Marital_status + Residence + Education_level + Occupation + Income_category + Sources_of_information_on_Mpox

  summ(FitAll, exp = T)

Observations	372
Dependent variable	knowledge_binary
Type	Generalized linear model
Family	binomial
Link	logit

χ²(19)	124.60
p	0.00
Pseudo-R² (Cragg-Uhler)	0.42
Pseudo-R² (McFadden)	0.29
AIC	338.12
BIC	416.49

	exp(Est.)	2.5%	97.5%	z val.	p
(Intercept)	0.18	0.04	0.81	-2.23	0.03
ProvinceCentral	1.25	0.53	2.92	0.51	0.61
ProvinceWestern	13.48	5.67	32.04	5.89	0.00
ProvinceCopperbelt	0.83	0.33	2.10	-0.40	0.69
ProvinceMuchinga	1.45	0.38	5.44	0.54	0.59
Age_categories35–54	3.25	1.10	9.57	2.13	0.03
Age_categories18–34	1.95	0.67	5.65	1.23	0.22
SexMale	1.58	0.83	2.98	1.40	0.16
Marital_statusUnmarried	1.53	0.81	2.92	1.30	0.19
ResidenceRural	0.40	0.13	1.23	-1.59	0.11
Education_levelPrimary	0.92	0.23	3.59	-0.12	0.90
Education_levelSecondary	0.78	0.21	2.95	-0.36	0.72
Education_levelTertiary	0.85	0.21	3.56	-0.22	0.83
OccupationStudent	0.65	0.17	2.43	-0.64	0.52
OccupationUnemployed	0.42	0.19	0.95	-2.08	0.04
Income_category≥ 3000	1.21	0.63	2.33	0.57	0.57
Sources_of_information_on_MpoxRadio/TV	0.13	0.06	0.32	-4.53	0.00
Sources_of_information_on_MpoxInternet/social media	0.37	0.15	0.89	-2.21	0.03
Sources_of_information_on_MpoxFamily/friends	0.23	0.07	0.71	-2.56	0.01
Sources_of_information_on_MpoxPrint/community sources	0.08	0.01	0.63	-2.39	0.02
Standard errors: MLE

  length(coef(FitAll))

[1] 20

  # All these converge on the same variables
  formula(Fitback)

knowledge_binary ~ Province + Age_categories + Sex + Residence + Occupation + Sources_of_information_on_Mpox

  formula(Fitforward)

knowledge_binary ~ Province + Sources_of_information_on_Mpox + Occupation + Residence + Age_categories + Sex

  formula(Fitstep)

knowledge_binary ~ Province + Sources_of_information_on_Mpox + Occupation + Residence + Age_categories + Sex

  length(coef(Fitstep))

[1] 15

  pacman::p_load(stats, pscl, DescTools)

# Log-likelihood, AIC and BIC
    logstats <- rbind(logLik(FitAll),   logLik(Fitstep), logLik(Fitdaliso)) 
    aicstats <- rbind(AIC(FitAll),  AIC(Fitstep), AIC(Fitdaliso)) 
    bicstats <- rbind(BIC(FitAll),  BIC(Fitstep), BIC(Fitdaliso)) 
    
    fitstats1 <- cbind(logstats, aicstats, bicstats)
    rm(logstats, aicstats, bicstats)
    fitstats1

      [,1]     [,2]     [,3]

[1,] -149.0584 338.1168 416.4947 [2,] -150.0774 330.1548 388.9382 [3,] -155.0633 332.1266 375.2344

    rownames(fitstats1) <- c( "Full model", "Stepwise Selection Model", "Daliso's model")
    colnames(fitstats1) <- c("Log-likelihood", "AIC", "BIC")
    formula(FitAll)

knowledge_binary ~ Province + Age_categories + Sex + Marital_status + Residence + Education_level + Occupation + Income_category + Sources_of_information_on_Mpox

    formula(Fitstep)

knowledge_binary ~ Province + Sources_of_information_on_Mpox + Occupation + Residence + Age_categories + Sex

    formula(Fitdaliso)

knowledge_binary ~ Province + Occupation + Sources_of_information_on_Mpox

    fitstats1

                     Log-likelihood      AIC      BIC

Full model -149.0584 338.1168 416.4947 Stepwise Selection Model -150.0774 330.1548 388.9382 Daliso’s model -155.0633 332.1266 375.2344

    # aic1 <- -2(log-lik) + 2*length(coef(FitAll))
    nobs(FitAll)

[1] 372

    # bic1 <-    -2(log-lik) + 2*log(nobs(FitAll))*length(coef(FitAll))

    
  best_mod <- tbl_regression(Fitstep, exponentiate = TRUE,
    pvalue_fun = ~ style_pvalue(.x, digits = 3)) %>%
  bold_p(t = 0.05)
  
  daliso_mod <- tbl_regression(Fitdaliso, exponentiate = TRUE,
    pvalue_fun = ~ style_pvalue(.x, digits = 3)) %>%
  bold_p(t = 0.05)
  
  merged_table <- tbl_merge(
  tbls = list(best_mod, daliso_mod),
  tab_spanner = c("**Best fit**", "**Other**")
)
  
  merged_table

Characteristic	Best fit			Other
Characteristic	OR	95% CI	p-value	OR	95% CI	p-value
Provinces
Lusaka	—	—		—	—
Central	1.18	0.51, 2.67	0.697	1.08	0.47, 2.42	0.857
Western	14.6	6.47, 35.1	<0.001	10.3	4.88, 22.7	<0.001
Copperbelt	0.81	0.31, 1.97	0.644	0.80	0.31, 1.93	0.623
Muchinga	1.72	0.47, 5.99	0.401	1.06	0.33, 3.13	0.916
Sources of Mpox information
Health care worker	—	—		—	—
Radio/TV	0.13	0.05, 0.30	<0.001	0.13	0.05, 0.29	<0.001
Internet/social media	0.38	0.16, 0.89	0.031	0.38	0.16, 0.87	0.027
Family/friends	0.24	0.07, 0.67	0.011	0.25	0.08, 0.68	0.011
Print/community sources	0.08	0.00, 0.41	0.016	0.08	0.00, 0.41	0.016
Occupation
Employed	—	—		—	—
Student	0.76	0.19, 2.55	0.671	0.62	0.16, 2.00	0.455
Unemployed	0.43	0.19, 0.91	0.034	0.33	0.15, 0.68	0.004
Residence
Urban	—	—
Rural	0.37	0.12, 1.08	0.079
Age (years)
55+	—	—
35–54	2.80	1.03, 8.53	0.054
18–34	1.82	0.67, 5.49	0.261
Sex
Female	—	—
Male	1.61	0.86, 3.03	0.134
Abbreviations: CI = Confidence Interval, OR = Odds Ratio

  # Let's draw an S-curve for a logistic regression. Works best when we've a continuous variable
  #  I've used age as continuous
  
  mv2_mod <- glm(knowledge_binary ~ Province + Sources_of_information_on_Mpox + 
    Occupation + Residence + Age + Sex,
  data = step_dat,
  family = binomial)
  
  tbl_regression(mv2_mod, exponentiate = TRUE,
    pvalue_fun = ~ style_pvalue(.x, digits = 3)) %>%
  bold_p(t = 0.05)

Characteristic	OR	95% CI	p-value
Provinces
Lusaka	—	—
Central	1.10	0.47, 2.48	0.825
Western	13.0	5.89, 30.5	<0.001
Copperbelt	0.82	0.32, 2.00	0.670
Muchinga	1.68	0.46, 5.86	0.421
Sources of Mpox information
Health care worker	—	—
Radio/TV	0.13	0.05, 0.31	<0.001
Internet/social media	0.39	0.16, 0.90	0.031
Family/friends	0.26	0.08, 0.72	0.015
Print/community sources	0.07	0.00, 0.40	0.015
Occupation
Employed	—	—
Student	0.63	0.16, 2.09	0.471
Unemployed	0.36	0.16, 0.76	0.011
Residence
Urban	—	—
Rural	0.39	0.12, 1.13	0.094
Age (years)	1.00	0.97, 1.02	0.737
Sex
Female	—	—
Male	1.53	0.83, 2.82	0.175
Abbreviations: CI = Confidence Interval, OR = Odds Ratio

  #define new data frame that contains predictor variable
  library(ciTools)
  step_dat <- add_probs(step_dat, mv2_mod, q = 0, comparison = "=")
 
  
 #plot logistic regression curve
ggplot(step_dat, aes(x = pred, y = knowledge_binary)) + 
  geom_point(colour = "dodgerblue4",
             size = 5,
             alpha = 0.8) +
  stat_smooth(method="glm", se=T, method.args = list(family=binomial),
              colour = "darkred",
              size = 1) +
  labs(x = "Predicted probabilities",
       y = "Probability",
       title = "Logistic Regression Curve") +
  theme_classic() +
    theme(
    axis.title = element_text(size = 20, face = "bold"),   # X-axis title
    axis.text.x  = element_text(size = 18, colour = "black"),   
    axis.text  = element_text(size = 18, colour = "black") 
    ) +
  scale_x_continuous(breaks = seq(min(step_dat$Age), 
                                  max(step_dat$Age), 
                                  by = 5))

newdat <- step_dat|> dplyr::select(Age, pred) |> mutate(outcome = ifelse(pred <0.5, 0,1)) |> arrange(Age)

investigator led step wise regression

In this study, i only kept variables that were stattistically significant at p < 0.05

I removed variables following this order

education
income
marital status
sex
age
residence

mv_model <- glm(
  knowledge_binary ~ 
    Province+ 
    Occupation+Sources_of_information_on_Mpox,
  data = sub_set,
  family = binomial
)

tbl_regression(
  mv_model, 
  exponentiate = TRUE,  # show odds ratios instead of log-odds
  pvalue_fun = ~style_pvalue(.x, digits = 3)
) %>%
  bold_p(t = 0.05)

Characteristic	OR	95% CI	p-value
Province
Lusaka	—	—
Central	1.08	0.47, 2.42	0.857
Western	10.3	4.88, 22.7	<0.001
Copperbelt	0.80	0.31, 1.93	0.623
Muchinga	1.06	0.33, 3.13	0.916
Occupation
Employed	—	—
Student	0.62	0.16, 2.00	0.455
Unemployed	0.33	0.15, 0.68	0.004
Sources_of_information_on_Mpox
Health care worker	—	—
Radio/TV	0.13	0.05, 0.29	<0.001
Internet/social media	0.38	0.16, 0.87	0.027
Family/friends	0.25	0.08, 0.68	0.011
Print/community sources	0.08	0.00, 0.41	0.016
Abbreviations: CI = Confidence Interval, OR = Odds Ratio

Compare a full model with a model with Residence

# full model 
model1  <- glm(
  knowledge_binary ~ 
    Province+ 
    Occupation+Sources_of_information_on_Mpox+Residence,
  data = sub_set,
  family = binomial
)

tbl_regression(
  mv_model, 
  exponentiate = TRUE,  # show odds ratios instead of log-odds
  pvalue_fun = ~style_pvalue(.x, digits = 3)
) %>%
  bold_p(t = 0.05)

Characteristic	OR	95% CI	p-value
Province
Lusaka	—	—
Central	1.08	0.47, 2.42	0.857
Western	10.3	4.88, 22.7	<0.001
Copperbelt	0.80	0.31, 1.93	0.623
Muchinga	1.06	0.33, 3.13	0.916
Occupation
Employed	—	—
Student	0.62	0.16, 2.00	0.455
Unemployed	0.33	0.15, 0.68	0.004
Sources_of_information_on_Mpox
Health care worker	—	—
Radio/TV	0.13	0.05, 0.29	<0.001
Internet/social media	0.38	0.16, 0.87	0.027
Family/friends	0.25	0.08, 0.68	0.011
Print/community sources	0.08	0.00, 0.41	0.016
Abbreviations: CI = Confidence Interval, OR = Odds Ratio

model1

Call: glm(formula = knowledge_binary ~ Province + Occupation + Sources_of_information_on_Mpox + Residence, family = binomial, data = sub_set)

Coefficients: (Intercept)
-0.60079
ProvinceCentral
0.02556
ProvinceWestern
2.47543
ProvinceCopperbelt
-0.27043
ProvinceMuchinga
0.57964
OccupationStudent
-0.40658
OccupationUnemployed
-1.04250
Sources_of_information_on_MpoxRadio/TV
-1.97892
Sources_of_information_on_MpoxInternet/social media
-0.97572
Sources_of_information_on_MpoxFamily/friends
-1.33189
Sources_of_information_on_MpoxPrint/community sources
-2.60987
ResidenceRural
-1.00895

Degrees of Freedom: 371 Total (i.e. Null); 360 Residual Null Deviance: 422.7 Residual Deviance: 306.6 AIC: 330.6

# without residence  
model2 <- glm(
  knowledge_binary ~ 
    Province+ 
    Occupation+Sources_of_information_on_Mpox,
  data = sub_set,
  family = binomial
)

tbl_regression(
  mv_model, 
  exponentiate = TRUE,  # show odds ratios instead of log-odds
  pvalue_fun = ~style_pvalue(.x, digits = 3)
) %>%
  bold_p(t = 0.05)

Characteristic	OR	95% CI	p-value
Province
Lusaka	—	—
Central	1.08	0.47, 2.42	0.857
Western	10.3	4.88, 22.7	<0.001
Copperbelt	0.80	0.31, 1.93	0.623
Muchinga	1.06	0.33, 3.13	0.916
Occupation
Employed	—	—
Student	0.62	0.16, 2.00	0.455
Unemployed	0.33	0.15, 0.68	0.004
Sources_of_information_on_Mpox
Health care worker	—	—
Radio/TV	0.13	0.05, 0.29	<0.001
Internet/social media	0.38	0.16, 0.87	0.027
Family/friends	0.25	0.08, 0.68	0.011
Print/community sources	0.08	0.00, 0.41	0.016
Abbreviations: CI = Confidence Interval, OR = Odds Ratio

model2

Call: glm(formula = knowledge_binary ~ Province + Occupation + Sources_of_information_on_Mpox, family = binomial, data = sub_set)

Coefficients: (Intercept)
-0.62299
ProvinceCentral
0.07552
ProvinceWestern
2.33071
ProvinceCopperbelt
-0.22756
ProvinceMuchinga
0.05972
OccupationStudent
-0.47158
OccupationUnemployed
-1.11892
Sources_of_information_on_MpoxRadio/TV
-2.03969
Sources_of_information_on_MpoxInternet/social media
-0.96022
Sources_of_information_on_MpoxFamily/friends
-1.40486
Sources_of_information_on_MpoxPrint/community sources
-2.58732

Degrees of Freedom: 371 Total (i.e. Null); 361 Residual Null Deviance: 422.7 Residual Deviance: 310.1 AIC: 332.1

model selection

# --- Full model (with residence) ---
model1 <- glm(
  knowledge_binary ~ 
    Province+ 
    Occupation+Sources_of_information_on_Mpox+Residence,
  data = sub_set,
  family = binomial)

# Create formatted regression table
model1_tbl <- tbl_regression(
  model1,
  exponentiate = TRUE,   # show Odds Ratios
  pvalue_fun = ~ style_pvalue(.x, digits = 3)
) %>%
  bold_p(t = 0.05)


# --- Reduced model (without residence) ---
model2 <- glm(
  knowledge_binary ~ 
    Province+ 
    Occupation+Sources_of_information_on_Mpox,
  data = sub_set,
  family = binomial)

# Create formatted regression table
model2_tbl <- tbl_regression(
  model2,
  exponentiate = TRUE,
  pvalue_fun = ~ style_pvalue(.x, digits = 3)
) %>%
  bold_p(t = 0.05)


# --- Model comparison using AIC ---
AIC(model1, model2)

   df      AIC

model1 12 330.6191 model2 11 332.1266

# --- Display regression tables ---
model1_tbl

Characteristic	OR	95% CI	p-value
Province
Lusaka	—	—
Central	1.03	0.44, 2.30	0.951
Western	11.9	5.49, 27.2	<0.001
Copperbelt	0.76	0.30, 1.84	0.559
Muchinga	1.79	0.49, 6.18	0.364
Occupation
Employed	—	—
Student	0.67	0.17, 2.11	0.517
Unemployed	0.35	0.16, 0.74	0.008
Sources_of_information_on_Mpox
Health care worker	—	—
Radio/TV	0.14	0.06, 0.31	<0.001
Internet/social media	0.38	0.15, 0.86	0.025
Family/friends	0.26	0.08, 0.74	0.016
Print/community sources	0.07	0.00, 0.41	0.016
Residence
Urban	—	—
Rural	0.36	0.12, 1.05	0.070
Abbreviations: CI = Confidence Interval, OR = Odds Ratio

model2_tbl

Characteristic	OR	95% CI	p-value
Province
Lusaka	—	—
Central	1.08	0.47, 2.42	0.857
Western	10.3	4.88, 22.7	<0.001
Copperbelt	0.80	0.31, 1.93	0.623
Muchinga	1.06	0.33, 3.13	0.916
Occupation
Employed	—	—
Student	0.62	0.16, 2.00	0.455
Unemployed	0.33	0.15, 0.68	0.004
Sources_of_information_on_Mpox
Health care worker	—	—
Radio/TV	0.13	0.05, 0.29	<0.001
Internet/social media	0.38	0.16, 0.87	0.027
Family/friends	0.25	0.08, 0.68	0.011
Print/community sources	0.08	0.00, 0.41	0.016
Abbreviations: CI = Confidence Interval, OR = Odds Ratio

# --- Model comparison using BIC ---
BIC(model1, model2)

   df      BIC

model1 12 377.6458 model2 11 375.2344

# Collect AIC and BIC values
model_comp <- data.frame(
  Model = c("Model 1: Full (with residence)", "Model 2: Reduced (without residence)"),
  df    = c(12, 11),
  AIC   = c(330.6191, 332.1266  ),
  BIC   = c(377.6458, 375.2344)
)

# Option 1: Simple table for manuscript (inline in R Markdown/console)
knitr::kable(model_comp, caption = "Comparison of AIC and BIC for Logistic Regression Models")

Comparison of AIC and BIC for Logistic Regression Models
Model	df	AIC	BIC
Model 1: Full (with residence)	12	330.6191	377.6458
Model 2: Reduced (without residence)	11	332.1266	375.2344

# Option 2: Polished table with gt (for HTML/PDF output)
model_comp %>%
  gt() %>%
  tab_header(
    title = "Comparison of Model Fit Statistics",
    subtitle = "Logistic regression predicting overall Mpox knowledge"
  ) %>%
  fmt_number(columns = c(AIC, BIC), decimals = 3)

Model	df	AIC	BIC
Comparison of Model Fit Statistics
Logistic regression predicting overall Mpox knowledge
Model 1: Full (with residence)	12	330.619	377.646
Model 2: Reduced (without residence)	11	332.127	375.234

# --- Export to Word using flextable ---
ft <- flextable(model_comp) %>%
  autofit() %>%
  set_header_labels(
    Model = "Model",
    df    = "Degrees of freedom",
    AIC   = "AIC",
    BIC   = "BIC"
  )

doc <- read_docx() %>%
  body_add_par("Table 4. Comparison of Model Fit Statistics", style = "heading 2") %>%
  body_add_flextable(ft)

print(doc, target = "Model_Fit_Comparison.docx")

merged

# Univariable model
uv_model <- tbl_uvregression(
  data = sub_set,
  y =  knowledge_binary,
  method = glm,
  method.args = list(family = binomial),
  include = c("Province","Occupation", "Sources_of_information_on_Mpox"),
  exponentiate = TRUE
) %>%
  bold_p(t = 0.05) %>%
  modify_header(
    estimate ~ "**UOR**",  # Unadjusted Odds Ratio
    p.value ~ "**P-value**"
  ) %>%
  modify_fmt_fun(p.value ~ function(x) style_pvalue(x, digits = 3)) %>%
  bold_labels()

# Multivariable model
model2 <- glm(
   knowledge_binary ~ Province+ 
    Occupation+Sources_of_information_on_Mpox,
  data = sub_set,
  family = binomial
)

mv_model <- tbl_regression(
  model2,
  exponentiate = TRUE,
  pvalue_fun = ~style_pvalue(.x, digits = 3)
) %>%
  bold_p(t = 0.05) %>%
  modify_header(estimate ~ "**AOR**") %>%
  bold_labels()

# Merge both models
final_model <- tbl_merge(
  tbls = list(uv_model, mv_model),
  tab_spanner = c("**Univariable Model**", "**Multivariable Model**")
) %>%
  modify_caption("**Factors Associated with Mpox Knowledge (Binary Outcome)**")

# Export to Word
doc <- read_docx() %>%
  body_add_flextable(value = as_flex_table(final_model)) %>%
  body_add_par(" ", style = "Normal")

print(doc, target = "Final_Knowledge_Model.docx")

# Display the table
final_model

**Factors Associated with Mpox Knowledge (Binary Outcome)**
Characteristic	Univariable Model				Multivariable Model
Characteristic	N	UOR	95% CI	P-value	AOR	95% CI	p-value
Province	372
Lusaka		—	—		—	—
Central		1.31	0.61, 2.76	0.477	1.08	0.47, 2.42	0.857
Western		12.0	6.17, 24.2	<0.001	10.3	4.88, 22.7	<0.001
Copperbelt		1.08	0.44, 2.44	0.863	0.80	0.31, 1.93	0.623
Muchinga		1.85	0.62, 4.95	0.238	1.06	0.33, 3.13	0.916
Occupation	372
Employed		—	—		—	—
Student		0.55	0.15, 1.56	0.303	0.62	0.16, 2.00	0.455
Unemployed		0.29	0.14, 0.55	<0.001	0.33	0.15, 0.68	0.004
Sources_of_information_on_Mpox	372
Health care worker		—	—		—	—
Radio/TV		0.17	0.07, 0.35	<0.001	0.13	0.05, 0.29	<0.001
Internet/social media		0.35	0.16, 0.71	0.005	0.38	0.16, 0.87	0.027
Family/friends		0.33	0.12, 0.79	0.019	0.25	0.08, 0.68	0.011
Print/community sources		0.06	0.00, 0.28	0.005	0.08	0.00, 0.41	0.016
Abbreviations: CI = Confidence Interval, OR = Odds Ratio

Based on the BIC, model 2 best predicts the outcome of interest well.

Analysis flow diagram

library(DiagrammeR)
library(DiagrammeRsvg)
library(rsvg)
library(knitr)

# Build diagram
diagram <- grViz("
digraph flow {
  graph [rankdir = TB, fontsize = 12]
  node [shape = box, style = filled, fillcolor = white, fontname = Helvetica]

  A [label = 'Individuals enrolled and interviewed\\n n = 815']
  B [label = 'Records removed during data cleaning\\n n = 6\\n - Missing critical data (consent/filter question)']
  C [label = 'Final analytic sample for AWARENESS analysis\\n N = 809']
  D [label = 'All participants answered filter question:\\n Have you ever heard of Mpox?']
  E [label = 'YES\\n n = 372']
  F [label = 'NO\\n n = 437']
  G [label = 'Completed full knowledge questionnaire']
  H [label = 'Included only in analyses of\\n awareness prevalence and its correlates']
  I [label = 'Primary analytic sample for KNOWLEDGE analysis\\n n = 372\\n (Assessed for knowledge level, sources of info, and associated factors)']

  A -> B -> C -> D
  D -> E
  D -> F
  E -> G -> I
  F -> H
}
")

# Export static PNG for R Markdown
svg <- export_svg(diagram)
rsvg_png(charToRaw(svg), "flow_diagram.png", width = 2400, height = 3200)

# Include the image in the knitted document
knitr::include_graphics("flow_diagram.png")

Mapping the sources of Mpox information among study participants

# Step 1: Summarise counts and percentages
mpox_map <- sub_set %>%
  group_by(Province, Sources_of_information_on_Mpox) %>%
  summarise(n = n(), .groups = "drop") %>%
  group_by(Province) %>%
  mutate(total = sum(n),
         percent = round((n / total) * 100, 1)) %>%
  ungroup()

# Step 2: Read shapefile
provinces <- st_read("C:/Users/USER1/OneDrive/Documents/Office Work_ ZNPHI/Daliso_Mpox/Data/Shapefiles. Admin boundaries/zmb_admbnda_adm1_dmmu_20201124.shp")

Reading layer zmb_admbnda_adm1_dmmu_20201124' from data sourceC:Work_ ZNPHI_Mpox. Admin boundaries_admbnda_adm1_dmmu_20201124.shp’ using driver `ESRI Shapefile’ Simple feature collection with 10 features and 12 fields Geometry type: POLYGON Dimension: XY Bounding box: xmin: 21.98004 ymin: -18.07918 xmax: 33.71244 ymax: -8.271976 Geodetic CRS: WGS 84

# Step 3: Join shapefile with source data
map_radio <- provinces %>%
  left_join(filter(mpox_map, Sources_of_information_on_Mpox == "Radio/TV"),
            by = c("ADM1_EN" = "Province")) %>%
  mutate(percent = ifelse(is.na(percent), 0, percent))

# Step 4: Plot full Zambia map with proportions shaded
ggplot() +
  geom_sf(data = provinces, fill = "grey95", color = "black") +   # base map
  geom_sf(data = map_radio, aes(fill = percent), color = "black") +
  geom_sf_text(data = filter(map_radio, percent > 0),
               aes(label = paste0(percent, "%")), size = 3) +
  scale_fill_gradient(low = "#FFE5B4", high = "#FF8C00", na.value = "grey95") +
  annotation_scale(location = "bl", width_hint = 0.3) +
  annotation_north_arrow(location = "tl", which_north = "true",
                         style = north_arrow_fancy_orienteering) +
  theme_minimal() +
  theme(panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        axis.text = element_blank(),
        axis.ticks = element_blank(),
        axis.title = element_blank()) +
  labs(fill = "Proportion (%)",
       title = "Proportion of Mpox Information from Radio/TV")

map_internet <- provinces %>%
  left_join(filter(mpox_map, Sources_of_information_on_Mpox == "Internet/social media"),
            by = c("ADM1_EN" = "Province")) %>%
  mutate(percent = ifelse(is.na(percent), 0, percent))

ggplot() +
  geom_sf(data = provinces, fill = "grey95", color = "black") +
  geom_sf(data = map_internet, aes(fill = percent), color = "black") +
  geom_sf_text(data = filter(map_internet, percent > 0),
               aes(label = paste0(percent, "%")), size = 3) +
  scale_fill_gradient(low = "#D0E6FF", high = "#0073E6", na.value = "grey95") +
  annotation_scale(location = "bl", width_hint = 0.3) +
  annotation_north_arrow(location = "tl", which_north = "true",
                         style = north_arrow_fancy_orienteering) +
  theme_minimal() +
  theme(panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        axis.text = element_blank(),
        axis.ticks = element_blank(),
        axis.title = element_blank()) +
  labs(fill = "Proportion (%)",
       title = "Proportion of Mpox Information from Internet/Social Media")

  map_family <- provinces %>%
    right_join(filter(mpox_map, Sources_of_information_on_Mpox == "Family/friends"),
               by = c("ADM1_EN" = "Province")) %>%
    mutate(percent = ifelse(is.na(percent), 0, percent))
  
  ggplot() +
    geom_sf(data = provinces, fill = "grey95", color = "black") +
    geom_sf(data = map_family, aes(fill = percent), color = "black") +
    geom_sf_text(data = filter(map_family, percent > 0),
                 aes(label = paste0(percent, "%")), size = 3) +
    scale_fill_gradient(low = "#FAD0E6", high = "#E7298A", na.value = "grey95") +
    annotation_scale(location = "bl", width_hint = 0.3) +
    annotation_north_arrow(location = "tl", which_north = "true",
                           style = north_arrow_fancy_orienteering) +
    theme_minimal() +
    theme(panel.grid.major = element_blank(),
          panel.grid.minor = element_blank(),
          axis.text = element_blank(),
          axis.ticks = element_blank(),
          axis.title = element_blank()) +
    labs(fill = "Proportion (%)",
         title = "Proportion of Mpox Information from Family/Friends")

# Example for Health care worker
map_health <- provinces %>%
  left_join(filter(mpox_map, Sources_of_information_on_Mpox == "Health care worker"),
            by = c("ADM1_EN" = "Province")) %>%
  mutate(percent = ifelse(is.na(percent), 0, percent))

ggplot() +
  geom_sf(data = provinces, fill = "grey95", color = "black") +
  geom_sf(data = map_health, aes(fill = percent), color = "black") +
  geom_sf_text(data = filter(map_health, percent > 0),
               aes(label = paste0(percent, "%")), size = 3) +
  scale_fill_gradient(low = "#E0F8E0", high = "#00B050", na.value = "grey95") +
  annotation_scale(location = "bl", width_hint = 0.3) +
  annotation_north_arrow(location = "tl", which_north = "true",
                         style = north_arrow_fancy_orienteering) +
  theme_minimal() +
  theme(panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        axis.text = element_blank(),
        axis.ticks = element_blank(),
        axis.title = element_blank()) +
  labs(fill = "Proportion (%)",
       title = "Proportion of Mpox Information from Health Care Workers")

study seeting map

# Load required packages
library(sf)        # spatial data
library(ggplot2)   # plotting
library(dplyr)     # data wrangling
library(ggspatial) # scale bar, north arrow

# 1. Read Zambia ADM2 shapefile
zmb_adm2 <- st_read("C:/Users/USER1/OneDrive/Documents/Office Work_ ZNPHI/Daliso_Mpox/Data/Shapefiles. Admin boundaries/zmb_admbnda_adm2_dmmu_20201124.shp")

Reading layer zmb_admbnda_adm2_dmmu_20201124' from data sourceC:Work_ ZNPHI_Mpox. Admin boundaries_admbnda_adm2_dmmu_20201124.shp’ using driver `ESRI Shapefile’ Simple feature collection with 115 features and 14 fields Geometry type: POLYGON Dimension: XY Bounding box: xmin: 21.98004 ymin: -18.07918 xmax: 33.71244 ymax: -8.271976 Geodetic CRS: WGS 84

# 2. Define sampled districts
district_sampled <- data.frame(
  District = c("Chilanga","Kabwe","Lusaka","Mkushi","Mongu","Mpika","Ndola","Senanga"),
  sampled = c("Chilanga","Kabwe","Lusaka","Mkushi","Mongu","Mpika","Ndola","Senanga")
)

# 3. Join sampled districts to shapefile
zmb_adm2_flag <- zmb_adm2 %>%
  left_join(district_sampled, by = c("ADM2_EN" = "District"))

# 4. Plot map
ggplot() +
  # Plain background for non-sampled districts
  geom_sf(data = zmb_adm2_flag %>% filter(is.na(sampled)),
          fill = "white", color = "grey80", size = 0.2) +
  
  # Highlight sampled districts in distinct colors
  geom_sf(data = zmb_adm2_flag %>% filter(!is.na(sampled)),
          aes(fill = sampled), color = "black", size = 0.3) +
  
  # Assign colors for each sampled district
  scale_fill_manual(values = c(
    "Chilanga" = "red",
    "Kabwe"    = "blue",
    "Lusaka"   = "green",
    "Mkushi"   = "purple",
    "Mongu"    = "orange",
    "Mpika"    = "brown",
    "Ndola"    = "pink",
    "Senanga"  = "cyan"
  )) +
  
  # Standard map elements
  annotation_scale(location = "bl", width_hint = 0.5) +
  annotation_north_arrow(location = "tl", style = north_arrow_fancy_orienteering) +
  theme_void() +   # removes grid lines and axes
  labs(title = "",
       subtitle = "",
       fill = "Sampled Districts")

# 5. Export high-resolution image
ggsave("study_area_map_multicolor.png", width = 8, height = 6, dpi = 300)

KAP-2025

DALISO NGULUBE

2025-10-01

Executive Summary

Analytical Environment Setup and Package Management

Data Import and Initial Cleaning

Remove some columns

Renaming Columns

Renaming Columns for Mpox Awareness Sources

Renaming Columns for Mpox Knowledge, Transmission, and Symptoms

Renaming Columns for Attitudes Toward Mpox

Renaming Columns for Practices Toward Mpox

Recoding Occupational Status & Marital Status

Adding Variable Labels- table 1

Labeling Mpox Awareness Variables

Labeling Knowledge Variables

Labeling of variables-attitude questions

Labeling of variables-practices questions

Assessing the distribution of age and monthly household income

Filter the NAs for age. these where entered during the practice sessions

Categorizing age

Remove some out liers

Descriptive analysis

Ever heard of Mpox

Sub analysis

Source of Information on Mpox

Analyze Sources of Mpox Information by Province

Create Table of Mpox Information Sources by Province

Summarize Knowledge of Mpox Transmission

Define Transmission Variables and Labels

Prepare Data for Knowledge of Transmission Plot

Visualize Knowledge of Mpox Transmission Modes

Summarize Knowledge of Mpox Signs and Symptoms

Visualize Knowledge of Mpox Symptoms

Summarize Knowledge on Mpox Prevention

Visualizing Knowledge of Mpox Prevention

Calculating Overall Knowledge of Mpox

A tibble: 2 × 4

chart without percentage labels

creating a column for the outcome

Stratified Summary of Participant Characteristics by Knowledge Level

Objective 3 : To identify socio-demographic factors associated with variations in knowledge related to Mpox among community members in selected districts of Zambia

merging the outputs

Stepwise Selection

investigator led step wise regression

Compare a full model with a model with Residence

model selection

merged

Analysis flow diagram

Mapping the sources of Mpox information among study participants

study seeting map