Rows (households): 409
Columns (variables): 147
Exploratory Data Analysis for Brucellosis
This is an exploratory data analysis (EDA) of the brucellosis knowledge, attitudes, and practices (KAP) survey conducted across six locations in Isiolo County, Kenya including garbatulla_reserve,sericho_reserve,kina_reserve,kina_main,sericho_main,and garbatulla_main. The survey was administered to 409 livestock-keeping households.
The purpose of this document is not just to describe the data, but to check, whether the dataset can actually answer the four research objectives set out in the analysis plan:
Rows (households): 409
Columns (variables): 147
A large share of “missingness” in this survey is not random ,it is structural, caused by the collection tool’s skip logic. For example, only respondents who answered “Yes” to “Are you aware of brucellosis?” (brucella_aware) were shown the entire knowledge section (Section B) and several subsequent sections. This means low coverage on a variable is often a feature of the questionnaire design, not a data quality failure. Each section below makes clear which kind of missingness is in play.
Before looking at brucellosis-specific responses, it’s worth understanding who was surveyed, since every objective will use these demographic and livelihood variables as explanatory factors.
| ward | n_households | pct |
|---|---|---|
| Garbatulla | 174 | 42.5 |
| Kinna | 144 | 35.2 |
| Sericho | 91 | 22.2 |
The survey spans three wards (Garbatulla, Kinna, Sericho), each split into a “main” town area and a “reserve” (more remote pastoral) area, giving six location strata in total. Garbatulla ward has the largest share of the sample, Sericho the smallest.
| Statistic | Value |
|---|---|
| Mean age | 43.6 |
| Median age | 40.0 |
| % Female | 44.0 |
| Mean household size | 7.0 |
| Mean years schooling | 4.8 |
| % No formal education | 51.5 |
The sample is mostly pastoralist, which fits the research context well since brucellosis is fundamentally a disease of livestock-keeping populations.
For Objectives 1 and 2 (factors associated with knowledge, and management practices), age_group, edu_cat, sex, income_source, and no_hh_members are all fully populated (100% or near-100% coverage) and ready to be used as explanatory variables. There is enough spread across age groups and education categories to support meaningful comparisons.
We are trying to answer how many households have heard of brucellosis, and among those who have, how much do they actually know about it?
84.1% of respondents (344 of 409) reported having heard of brucellosis before. This is the key skip-logic gate for the rest of the knowledge section , only these 344 respondents went on to answer the detailed knowledge items below.
Awareness (“have you heard of it”) is a much lower bar than knowledge (“do you know how it spreads, what it looks like, and how it affects people”). The questionnaire probed four separate knowledge domains, each captured as a set of multi-select items now expanded into 0/1 indicator columns:
| Min | 25th pct | Median | Mean | 75th pct | Max |
|---|---|---|---|---|---|
| 2.5 | 19.375 | 22.5 | 24.1 | 27.5 | 75 |
The distribution is right-skewed and clustered at the low end: most aware respondents correctly identify only a small fraction of the full set of signs and transmission routes. A mean knowledge score around the low-to-mid 20s (out of 100) suggests that simply being “aware” of brucellosis does not translate into detailed understanding of brucellosis.
A comparison of whether knowledge varies systematically by age, sex, or education.
Awareness itself is recorded for all 409 respondents with zero missingness, and the detailed knowledge items (40 indicator variables across four domains) are consistently populated for all 344 aware respondents (84% of the sample). The data clearly supports both halves of Objective 1: the descriptive half (how aware are people, and what do they actually know) is fully answerable with the figures above, and the inferential half (what predicts higher knowledge) has a complete, non-missing set of explanatory variables — age_group, sex, edu_cat, income_source, yrs_keep_livestock, rec_ext_services, grp_member — to regress knowledge_score_pct or aware_binary against.
Of the three demographic splits shown, education shows the clearest separation:
This ordering education > age > sex gives a working hypothesis for the regression: education is the demographic variable most likely to retain a significant, independent effect on knowledge score once the other covariates are controlled for.
This objective asks what households actually do to prevent or manage brucellosis risk, as opposed to what they know. The questionnaire captured this through a multi-select list of specific prevention practices, plus a smaller sub-section on individual risk behaviours.
Eight of the twelve prevention-practice options in the questionnaire including restricting movement, farm sanitation, slaughtering positive animals, isolating animals during parturition, disposing of fetal material safely, disinfecting, public education, and seeking veterinary advice , were selected by zero respondents across all 409 households.
This suggests a real and large gap between the practices considered “textbook” prevention and what pastoralist households in this part of Isiolo are actually doing.
| adopted_any_practice | n | pct |
|---|---|---|
| Adopted none | 124 | 30.3 |
| Adopted ≥1 practice | 285 | 69.7 |
A smaller sub-section of the questionnaire asked about specific risk behaviours during animal handling. This block was only shown to a subset of respondents (the human-illness branch of the questionnaire), so coverage is much lower at about 17% of the sample (60–69 households).
A key analytical question for Objective 2: do households with higher knowledge scores actually report more prevention practices?
The self-reported adoption question (adopt_prev_ctrl) turned out to have zero variance since every respondent who answered it said “Yes.” Therefore, it cannot serve as a regression outcome. The practice-count and practice-variables (prev_practice_count, adopted_any_practice, and the four practices with real variation: vaccination, testing, isolating infected animals, testing new animals) are the variables that should anchor this objective’s analysis instead. The individual risk-behaviour items (raw milk consumption, glove use, etc.) are real and usable, but only for a descriptive sub-analysis — at n ≈ 69 they are too small to support a separate regression model.
The boxplot above shows:
Because the data are cross-sectional, any association that does emerge cannot establish whether knowledge drives adoption or adoption (and the experience that comes with it) builds knowledge ,this directionality should be flagged as a limitation regardless of the regression result.
This objective covers Section E of the questionnaire: Likert-scale items capturing how serious, preventable, and threatening respondents perceive brucellosis to be. Unlike the knowledge section, this section has two separate blocks with very different coverage:
| Block | Example items | n (coverage) | % of sample |
|---|---|---|---|
| seqb / seqc (transmission & prevention attitudes) | Risk from consuming milk; vaccination effectiveness | 275 | 67.2 |
| seq1–seq16 (general severity & risk perception) | Brucellosis is a serious threat to animals/humans | 69 | 16.9 |
| comm_* (community-level perception) | Shared grazing; shared water points | 275 | 67.2 |
seq1–seq16 items were only shown to a subset of respondents.seqb/seqc/comm_* block (n = 275, 67% coverage) can be used for inferential analysis.seqb/seqc/comm_* block (16 items, n = 275) are enough to support both descriptive summaries and a perception-index regression (e.g. summing or averaging Likert scores and regressing against knowledge score and SES).seq1–16 block (n = 69) should be reported as a descriptive table but not enpugh for inferential analysis as stated earlier.This objective characterizes household wealth and access as both a descriptive picture of the sample and as an explanatory variable for knowledge and practice in the inferential models.
ses_index is a composite asset index built from ten components: seven binary assets (radio, bicycle, motorbike, car, house ownership, piped water, electricity), phone count (capped at 3), house wall material, and toilet type , each rescaled to a 0–1 contribution and averaged.
| Households missing SES index | All from brucella_aware = No? | % of full sample |
|---|---|---|
| 56 | TRUE | 13.7 |
The 56 households missing an SES index are exactly the households who were not asked Section I (because they were routed past it after answering “No” to brucellosis awareness), this is structural missingness, not random, and should not be imputed.
The ten-component index, plus all of its raw inputs (assets, house construction, toilet type, phone count) and the separate livestock-holding variables, are populated for 353 of 409 households (86%), enough for both a descriptive wealth profile and use as an explanatory variable in the knowledge and practice regressions. The missingness pattern is fully understood and structural (tied to the brucellosis-aware skip gate), so it should be reported as such rather than imputed.
The scatter plot above shows whether wealthier households tend to know more about brucellosis (n = 344).
The bivariate association is weak , r = 0.06 (p = 0.306) , with the fitted trend rising only about 3 percentage points across the full SES range (0–0.85). The confidence band also widens noticeably above SES ≈ 0.6, where data are sparser, so the apparent upward tilt at the high end of the scale should be read cautiously rather than as a strong trend.
| Objective | Key variables | Coverage | Verdict |
|---|---|---|---|
| Obj 1 | Awareness & knowledge | brucella_aware, knowledge_score_pct, 40 knowledge dummies | 100% awareness / 84% knowledge (n=344) | Fully answerable |
| Obj 2 | Management practices | prev_practice_count, adopted_any_practice, 4 practice dummies | 100% practice dummies / 66%* self-report (unusable) | Answerable, reframed outcome variable |
| Obj 3 | Perceptions & attitudes | seqb/seqc (16 items), comm_* (4 items); seq1-16 descriptive only | 67% (n=275) main block / 17% (n=69) general block | Answerable in two tiers |
| Obj 4 | SES & access | ses_index, 10 asset/housing inputs, livestock holdings | 86% (n=353) | Fully answerable |
Taken together, this dataset can answer all four objectives, though two of them require a small adjustment in framing compared to how they might originally have been conceived:
For Objective 2, the self-reported “did you adopt any prevention practice” question turned out to have no variation in the data (everyone who answered said yes), so the practice-count and practice-variables built from the multi-select responses should be used as the outcome instead.
For Objective 3, the perception data needs to be presented in two tiers: the seqb/seqc questions (n = 275) suitable for both description and regression, and the seq1-16 questions (n = 69) that is genuinely useful for context but should be explicitly labelled as descriptive variable only.
Objectives 1 and 4 are fully supported by the data as originally framed, with large, well-covered variable sets and no structural barriers.