Preambule

R code and outpus are in boxes. Code typed at the command line is in boxes with grey background color. Outputs of these commands are in boxes with white background color. You can copy and paste these input code to build your own R script. Be careful however in the process not to copy the > and + prompts!

Packages

The packages currently installed on your system:

> pkgs <- rownames(installed.packages())

The packages we need from CRAN and installing those that are not already installed:

> cran <- c("survey", "devtools", "dplyr", "sp", "raster", "countrycode", "stringr")
> sel <- !cran %in% pkgs
> if (any(sel)) install.packages(cran[sel])

The packages we need from GitHub and installing those that are not already installed:

> github <- c("larmarange/labelled", "ajdamico/lodown",
+             paste0("choisy/", c("mclabelled", "mcsurvey")))
> sel <- !sub(".*/", "", github) %in% pkgs
> if (any(sel)) devtools::install_github(github[sel])

Loading the packages we need for the analysis:

> library(magrittr)   # %>%, %<>% 
> library(labelled)   # to_factor(), remove_attributes()
> library(lodown)     # get_catalog(), lodown()
> library(survey)     # svydesign(), svyby(), svymean(), svyglm()
> library(dplyr)      # filter(), select(), mutate(), mutate_if(), recode()
> library(sp)         # plot(), spplot()

Getting the data and reshaping them

Define your DHS’s project information. For the project, enter no more than the first 35 characters of the project name. Also, replace the second and third fields by your login email address and your password:

> profile <- function(fct, ...) {
+   fct(data_name     = "dhs",
+       your_email    = "your_email_used_as_login_on_DHS", 
+       your_password = "your_DHS_password", 
+       your_project  = "Antibiotics usage amongst under fiv", ...)
+ }

Retrieving the catalogue of data sets of the project:

> catalogue <- profile(get_catalog)

Selecting the children under 5 year-old for 2015, and also changing the output directory:

> catalogue <- subset(catalogue, year == 2015 & grepl("KR", full_url))
> catalogue$output_folder <- getwd()

Downloading the data:

> profile(lodown, catalog = catalogue)

Loading the data:

> tz <- readRDS(grep("TZKR.*rds", dir(), value = TRUE))

Converting into factors in order to get an native R object that will take much less space in memory:

> print(object.size(tz), units = "auto")
99.9 Mb
> tz %<>% to_factor()
> print(object.size(tz), units = "auto")
57.6 Mb

Almost half the size: 57.6 Mb instead of 99.9 Mb! Let’s, in addition, trim out the format attributes (it doesn’t really save much memory but it makes the object contain no useless information):

> tz %<>% remove_attributes("format.stata")

Selecting the variables we are interesting in, out of the 1253 available:

> tz %<>% select(v001, v002, v005, v021, v022, v023, v024, v106, v113, v116, v161,
+                v140, v463a, v463b, v190, b4, h3, h5, h7, h51, h52, h53, h54, h55,
+                h56, h57, h58, h59, h11, h22, h31, h31b, h31c, h12a, h12b, h12c,
+                h12d, h12e, h12f, h12g, h12h, h12j, h12k, h12m, h12n, h12o, h12p,
+                h12s, h12t, h12u, h12x, h32a, h32b, h32c, h32d, h32e, h32f, h32g,
+                h32h, h32j, h32k, h32m, h32n, h32o, h32p, h32s, h32t, h32u, h32x,
+                hw1, h15, h15b, h37i, h37j, h44a, h46a, s630aa, s630ab, s630ac,
+                s630ad, s630ae, s630af, s630ag, s630ah, s630ai, s630aj, s630ak,
+                s630al, s630am, s630an, s630ao, s630ap, s630aq, s630ar, s630as,
+                s630at, s630au, s630av, s630ax) 

We now have only 1253 variables:

> ncol(tz)
[1] 76

Which make the data set much more smaller:

> print(object.size(tz), units = "auto")
3.3 Mb

This will be very easy to handle.

Dividing the weights by 1,000,000 (because the weights have been multiplied by 1,000,000 in order to save space on disk):

> tz$v005 <- tz$v005 / 1e6

Here, it doesn’t change the size in memory, good news! (actually it makes it slightly smaller!).

Attaching the data set to the below-defined var_def and var_def_ functions for easier use:

> var_def <- function(...) mclabelled::var_def(tz, ...) # the NSE version
> var_def_ <- function(x) mclabelled::var_def(tz[x]) # the character vector version

Let’s look at the definitions of the variables we have selected:

> var_def()
v001  : cluster number
v002  : household number
v005  : women's individual sample weight (6 decimals)
v021  : primary sampling unit
v022  : sample strata for sampling errors
v023  : stratification used in sample design
v024  : region
v106  : highest educational level
v113  : source of drinking water
v116  : type of toilet facility
v161  : type of cooking fuel
v140  : de jure type of place of residence
v463a : smokes cigarettes
v463b : smokes pipe full of tobacco
v190  : wealth index
b4    : sex of child
h3    : received dpt 1
h5    : received dpt 2
h7    : received dpt 3
h51   : received pentavalent 1
h52   : received pentavalent 2
h53   : received pentavalent 3
h54   : received pneumococcal 1
h55   : received pneumococcal 2
h56   : received pneumococcal 3
h57   : received rotavirus 1
h58   : received rotavirus 2
h59   : na - received rotavirus 3
h11   : had diarrhea recently
h22   : had fever in last two weeks
h31   : had cough in last two weeks
h31b  : short, rapid breaths
h31c  : problem in the chest or blocked or running nose
h12a  : diarrhea: government hospital
h12b  : diarrhea: regional referral hospital(government/parastatal)
h12c  : diarrhea: regional  hospital(government/parastatal)
h12d  : diarrhea: district hospital(government/parastatal)
h12e  : diarrhea: health center(government/parastatal
h12f  : diarrhea: dispensary(government/parastatal)
h12g  : diarrhea: clinic (government/parastatal)
h12h  : diarrhea: chw(government/parastatal)
h12j  : diarrhea: private hospital/clinic
h12k  : diarrhea: private pharmacy
h12m  : diarrhea: specialized hospital (private)
h12n  : diarrhea: health centre (private)
h12o  : diarrhea: dispensary(private)
h12p  : diarrhea: clinic (private)
h12s  : diarrhea: addo
h12t  : diarrhea:  ngo
h12u  : diarrhea: referal/ spec. voluntary
h12x  : diarrhea: other
h32a  : fever/cough: government hospital
h32b  : fever/cough: regional referral hospital
h32c  : fever/cough: regional hospital
h32d  : fever/cough: district hospital
h32e  : fever/cough: health center
h32f  : fever/cough: dispensary
h32g  : fever/cough: clinic
h32h  : fever/cough: chw
h32j  : fever/cough: private hospital/ clinic
h32k  : fever/cough: private pharmacy
h32m  : fever/cough: specialized hospital
h32n  : fever/cough: health centre
h32o  : fever/cough: private dispensary
h32p  : fever/cough: private  clinic
h32s  : fever/cough: addo
h32t  : fever/cough: ngo
h32u  : fever/cough: religious/voluntary medical
h32x  : fever/cough: other
hw1   : child's age in months
h15   : given antibiotic pills or syrups
h15b  : given antibiotic injection
h37i  : antibiotic pill/syrup taken for fever
h37j  : antibiotic injection taken for fever
h44a  : place first sought treatment for diarrhea
h46a  : place first sought treatment for fever

Defining new variables and recoding some other

> vaccination <- c("vaccination date on card", "reported by mother")
> tz %<>%
+   mutate_if(~ identical(levels(.), c("no", "yes")), ~ . == "yes") %>% 
+   mutate(nbea      = 106642, # total number of enumeration areas (EA) in the country
+          nbhh      = 86,     # number of households per EA
+          count     = 1,      # counting variable
+          age_class = cut(tz$hw1, c(0, 6, 12, 24, 36, 48, 60), right = FALSE),
+          diarrhea  = h11  == "yes, last two weeks",
+          ari       = h31b == "yes" & h31c %in% c("chest only", "both"),
+          fever     = h22 == "yes",
+          amu_fever = h37i | h37j,
+          amu_diarr = h15 == "yes" | h15b == "yes",
+          res       = factor(recode(v140,
+                                    `rural`  = "rural",
+                                    `urban`  = "urban",
+                                    .default =  "NA")),
+          toilet    = factor(recode(v116,
+                                   `flush toilet`                           = "improved",
+                                   `flush to piped sewer system`            = "improved",
+                                   `flush to septic tank`                   = "improved",
+                                   `flush to pit latrine`                   = "improved",
+                                   `pit toilet latrine`                     = "improved",
+                                   `ventilated improved pit latrine (vip)`  = "improved",
+                                   `pit latrine with slab`                  = "improved",
+                                   `composting toilet`                      = "improved", 
+                                   .default                                 = "unimproved")),
+          facility  = case_when(grepl("government|public",   h44a) ~ "public",
+                                grepl("private|specialized", h44a) ~ "private",
+                                grepl("pharmacy|addo",       h44a) ~ "pharmacy",
+                                TRUE                               ~ "other"),
+          education = factor(recode(v106,
+                                    `no education` = "no education",
+                                    `primary`      = "primary",
+                                    `secondary`    = "secondary or higher",
+                                    `higher`       = "secondary or higher")),
+         penta      = h51 %in% vaccination,
+         pcv        = h54 %in% vaccination,
+         rota       = h57 %in% vaccination)

form the report, “ARI symptoms consist of cough (h31) accompanied by (1) short, rapid breathing that is chest-related (h31b), and/or (2) difficult breathing that is chest-related (h31c)”. However it seems to work better defining ARI on h31b and h31c only.

The first line of the above code recodes all the variables that have answer “no” or “yes” only into boolean.

Defining the sampling desing

Defining the sampling design:

> dtz <- svydesign(ids     = ~ v001 + v002, # the PSU and SSU clusters
+                  strata  = ~ v023,        # the strata
+                  fpc     = ~ nbea + nbhh, # finite population correction (see below)
+                  weights = ~ v005,        # sampling weights
+                  data    = tz)

where the meaning of the variables is

> var_def(v001, v002, v023, v005)
v001 : cluster number
v002 : household number
v023 : stratification used in sample design
v005 : women's individual sample weight (6 decimals)

Note that all the variables starting with a v are women standard variables. Finite population corrections are the total numbers of units from which PSU and SSU are sampled.

Exploring the survey design and setting options

The distribution of the number of PSUs per stratum:

> plot(table(colSums(with(tz, table(v001, v023)) > 0)),
+      xlab = "number of PSUs per stratum", ylab = "number of strata")

Looks OK since the minimum number of PSUs per stratum is 2. Let’s now look at the distribution of the number of households (SSUs) per PSU:

> plot(table(with(tz, table(v001, v023)))[-1],
+      xlab = "number of households (SSUs) per PSU", ylab = "number of PSUs")

That does not seem OK since there is a PSU with only one SSU (household). It concerns the cluster 142 of stratum 13:

> mcsurvey::check_replicates(tz, v023, v001, v002)
[[1]]
v023 v001 
  13  142 

Because one PSU has only 1 SSU, we need to change the lonely PSU management option from default value fail:

> options("survey.lonely.psu")
$survey.lonely.psu
[1] "fail"

to one of these alternative options:

> options(survey.lonely.psu = "remove")
> options(survey.lonely.psu = "certainty")
> options(survey.lonely.psu = "adjust")
> options(survey.lonely.psu = "average")

see here for more information. Following advice on this page, we’ll set up the option as

> options(survey.lonely.psu = "certainty")

for “compatibility with other softwares” (in practice, it doesn’t make much difference anyway, which option you choose, at least on this data set).

Checking the data

The weighted estimate of the total number of children:

> svytotal(~ count, dtz, na.rm = T)
      total     SE
count 10052 257.86

The percentage of children who had diarrhea recently, by sex:

> var_def(h11, b4)
h11 : had diarrhea recently
b4  : sex of child

Note that all the variables starting with a h are immunization and health variables and all the variables starting with a b are birth history variables. To see the levels of a variable:

> levels(tz$h11)
[1] "no"                  "yes, last 24 hours"  "yes, last two weeks"
[4] "don't know"         

A weighted estimate of the total numbers of boys and girls:

> svyby(~ count, ~ b4, dtz, svytotal, na.rm = TRUE)
           b4    count       se
male     male 5098.177 142.0301
female female 4953.686 139.4720

A weighted estimate of the percentages of boys and girls who had diarrhea within the last 2 weeks:

> svyby(~ h11 == "yes, last two weeks", ~ b4, dtz, svymean, na.rm = TRUE)
           b4 h11 == "yes, last two weeks"FALSE
male     male                          0.882918
female female                          0.881215
       h11 == "yes, last two weeks"TRUE
male                           0.117082
female                         0.118785
       se.h11 == "yes, last two weeks"FALSE
male                            0.006233980
female                          0.006148465
       se.h11 == "yes, last two weeks"TRUE
male                           0.006233980
female                         0.006148465

We can put the above result in nicer shape by adding:

> svyby(~ h11 == "yes, last two weeks", ~ b4, dtz, svymean, na.rm = TRUE) %>% 
+   select(contains("TRUE")) %>% # selecting the columns we are interested in
+   `names<-`(c("mean", "SE"))   # renaming these columns
           mean          SE
male   0.117082 0.006233980
female 0.118785 0.006148465

Which is to compare with the values 11.7 and 11.9 of Table 10.7.1 of the report, page 213. Seems to work OK! Other variables to check.

Do we need SSU?

Official Stata code for the analysis of DHS does not define SSU. Let’s try without and compare with previous results:

> dtz2 <- svydesign(ids     = ~ v001,
+                   strata  = ~ v023,
+                   fpc     = ~ nbea,
+                   weights = ~ v005,
+                   data    = tz)
> svytotal(~ count, dtz2, na.rm = T)
      total     SE
count 10052 257.86

to compare with

> svytotal(~ count, dtz, na.rm = T)
      total     SE
count 10052 257.86

and

> svyby(~ count, ~ b4, dtz2, svytotal, na.rm = TRUE)
           b4    count       se
male     male 5098.177 142.0279
female female 4953.686 139.4696

to compare with

> svyby(~ count, ~ b4, dtz, svytotal, na.rm = TRUE)
           b4    count       se
male     male 5098.177 142.0301
female female 4953.686 139.4720

and, finally,

> svyby(~ h11 == "yes, last two weeks", ~ b4, dtz2, svymean, na.rm = TRUE) %>% 
+   select(contains("TRUE")) %>%
+   `names<-`(c("mean", "SE"))
           mean          SE
male   0.117082 0.006233653
female 0.118785 0.006148120

to compare with

> svyby(~ h11 == "yes, last two weeks", ~ b4, dtz, svymean, na.rm = TRUE) %>% 
+   select(contains("TRUE")) %>%
+   `names<-`(c("mean", "SE"))
           mean          SE
male   0.117082 0.006233980
female 0.118785 0.006148465

This shows that the results with and without SSU are very similar. Defining SSU only very slightly decrease the SE.

Summarizing the data

Summarizing highest educational level with a table

> table(tz$v106)

no education      primary    secondary       higher 
        2199         6170         1775           89 

or a barplot:

> barplot(table(tz$v106))

To compare with:

> table(tz$education)

       no education             primary secondary or higher 
               2199                6170                1864 

and barplot:

> barplot(table(tz$education))

The initial educational variable:

> svyby(~ count, ~ v106, dtz, svytotal, na.rm = TRUE)
                     v106      count        se
no education no education 2103.40024 126.84417
primary           primary 6516.73840 185.13632
secondary       secondary 1343.58042  59.59380
higher             higher   88.14375  16.35962

and the one we just created:

> svyby(~ count, ~ education, dtz, svytotal, na.rm = TRUE)
                              education    count        se
no education               no education 2103.400 126.84417
primary                         primary 6516.738 185.13632
secondary or higher secondary or higher 1431.724  64.60883

AMU as function of age

Let’s try to reproduce the link between AMU and age observe on Table 10.5:

> svyby(~ amu_fever, ~ age_class, subset(dtz, ari), svymean, na.rm = TRUE) %>%
+   select(contains("TRUE")) %>%
+   `names<-`(c("percentage", "SE"))
        percentage         SE
[0,6)    0.2879797 0.13954440
[6,12)   0.4733787 0.10666608
[12,24)  0.3236933 0.07069466
[24,36)  0.4014607 0.07569719
[36,48)  0.2310347 0.10626940
[48,60)  0.2322670 0.09108236

Diarrhea

Proportion of children with diarrhea by facility:

> svytable(~ facility, subset(dtz, diarrhea))
facility
    other  pharmacy   private    public 
409.33408 279.34900  62.33795 370.62437 
> svytable(~ facility, subset(dtz, diarrhea), 1)
facility
     other   pharmacy    private     public 
0.36494072 0.24905287 0.05557724 0.33042918 

Relationship between diarrhea and facility:

> svytable(~ diarrhea + facility, dtz)
        facility
diarrhea      other   pharmacy    private     public
   FALSE 8389.87596    0.00000    0.00000    0.00000
   TRUE   409.33408  279.34900   62.33795  370.62437
> svytable(~ diarrhea + facility, dtz, 1)
        facility
diarrhea       other    pharmacy     private      public
   FALSE 0.882075079 0.000000000 0.000000000 0.000000000
   TRUE  0.043035605 0.029369540 0.006553941 0.038965834
> svytable(~ diarrhea + amu_diarr, dtz)
        amu_diarr
diarrhea    FALSE     TRUE
    TRUE 749.6177 372.0277
> svytable(~ diarrhea + amu_diarr, dtz, 1)
        amu_diarr
diarrhea     FALSE      TRUE
    TRUE 0.6683196 0.3316804

Number of children with dirrhea by facility, with SE:

> svyby(~ count, ~ facility, subset(dtz, diarrhea), svytotal, na.rm = TRUE)
         facility     count       se
other       other 409.33408 28.05320
pharmacy pharmacy 279.34900 30.92891
private   private  62.33795 10.99203
public     public 370.62437 25.11904
> summary(svyglm(amu_diarr ~ facility, subset(dtz, diarrhea), family = quasibinomial))

Call:
svyglm(formula = amu_diarr ~ facility, subset(dtz, diarrhea), 
    family = quasibinomial)

Survey design:
subset(dtz, diarrhea)

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)       -1.9489     0.1751 -11.127  < 2e-16 ***
facilitypharmacy   2.1126     0.2304   9.171  < 2e-16 ***
facilityprivate    1.8324     0.4000   4.582 6.17e-06 ***
facilitypublic     1.4560     0.2230   6.528 2.01e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for quasibinomial family taken to be 1.00089)

Number of Fisher Scoring iterations: 4
> model <- svyglm(amu_diarr ~ facility, dtz, family = quasibinomial)
> summary(model)

Call:
svyglm(formula = amu_diarr ~ facility, dtz, family = quasibinomial)

Survey design:
svydesign(ids = ~v001 + v002, strata = ~v023, fpc = ~nbea + nbhh, 
    weights = ~v005, data = tz)

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)       -1.9489     0.1751 -11.127  < 2e-16 ***
facilitypharmacy   2.1126     0.2304   9.171  < 2e-16 ***
facilityprivate    1.8324     0.4000   4.582 6.17e-06 ***
facilitypublic     1.4560     0.2230   6.528 2.01e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for quasibinomial family taken to be 1.015888)

Number of Fisher Scoring iterations: 4

Odds ratio:

> exp(cbind(coef(model), confint(model))) 
                               2.5 %    97.5 %
(Intercept)      0.1424352 0.1010499  0.200770
facilitypharmacy 8.2700080 5.2652450 12.989525
facilityprivate  6.2490444 2.8534379 13.685441
facilitypublic   4.2889369 2.7701370  6.640459

Some models

> model1 <- svyglm(fever ~ age_class + b4 + v190 + penta + pcv + rota, design = dtz, family = quasibinomial)

And then:

> summary(model1)

Call:
svyglm(formula = fever ~ age_class + b4 + v190 + penta + pcv + 
    rota, design = dtz, family = quasibinomial)

Survey design:
svydesign(ids = ~v001 + v002, strata = ~v023, fpc = ~nbea + nbhh, 
    weights = ~v005, data = tz)

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)      -2.50352    0.16318 -15.342  < 2e-16 ***
age_class[6,12)   0.61974    0.15511   3.995 7.36e-05 ***
age_class[12,24)  0.72288    0.15080   4.794 2.12e-06 ***
age_class[24,36)  0.66406    0.14923   4.450 1.05e-05 ***
age_class[36,48)  0.80855    0.17475   4.627 4.66e-06 ***
age_class[48,60)  0.67048    0.16007   4.189 3.28e-05 ***
b4female         -0.10343    0.06471  -1.598  0.11053    
v190poorer        0.18115    0.10044   1.804  0.07187 .  
v190middle        0.06185    0.11642   0.531  0.59543    
v190richer        0.08493    0.11061   0.768  0.44291    
v190richest       0.18050    0.11639   1.551  0.12154    
pentaTRUE         0.83965    0.25909   3.241  0.00127 ** 
pcvTRUE          -0.21717    0.27491  -0.790  0.42991    
rotaTRUE         -0.08266    0.28781  -0.287  0.77408    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for quasibinomial family taken to be 0.9942134)

Number of Fisher Scoring iterations: 4

Another model:

> model2 <- svyglm(diarrhea ~ age_class + res + toilet + v190 + education + rota, design = dtz, family = quasibinomial)

And then:

> summary(model2)

Call:
svyglm(formula = diarrhea ~ age_class + res + toilet + v190 + 
    education + rota, design = dtz, family = quasibinomial)

Survey design:
svydesign(ids = ~v001 + v002, strata = ~v023, fpc = ~nbea + nbhh, 
    weights = ~v005, data = tz)

Coefficients:
                             Estimate Std. Error t value Pr(>|t|)    
(Intercept)                  -2.87090    0.33520  -8.565  < 2e-16 ***
age_class[6,12)               1.30686    0.19088   6.846 2.09e-11 ***
age_class[12,24)              1.26345    0.18575   6.802 2.78e-11 ***
age_class[24,36)              0.55642    0.19650   2.832  0.00481 ** 
age_class[36,48)              0.41330    0.20609   2.005  0.04542 *  
age_class[48,60)             -0.11273    0.23625  -0.477  0.63342    
resurban                     -0.25484    0.21560  -1.182  0.23772    
resrural                     -0.48132    0.20314  -2.369  0.01817 *  
toiletunimproved             -0.03212    0.15810  -0.203  0.83909    
v190poorer                    0.24828    0.12951   1.917  0.05576 .  
v190middle                    0.27709    0.13992   1.980  0.04818 *  
v190richer                    0.26650    0.16465   1.619  0.10612    
v190richest                   0.20113    0.23355   0.861  0.38952    
educationprimary              0.18147    0.11061   1.641  0.10148    
educationsecondary or higher  0.22474    0.16004   1.404  0.16083    
rotaTRUE                      0.38465    0.16367   2.350  0.01913 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for quasibinomial family taken to be 0.9865525)

Number of Fisher Scoring iterations: 5

Mapping

You can download shapefiles for free from the GADM project:

> tz1 <- "tanzania" %>%
+   countrycode::countrycode("country.name", "iso3c") %>%
+   raster::getData("GADM", country = ., level = 1)

Level 1 here corresponds to the regions we are interested in. This tz1 is a complex structure object of class S4. Without going too much into detail, it contains a number of slots, one of them is a regular data frame (slot data), all the other ones are geometric objects used to draw the map (pretty much like the internal structure of a shapefile if you’re familiar with that) that we can see when we plot that object:

> plot(tz1)

The cool thing about that object is that there is a good correspondance between the data frame frame of the data slot and the other slots. This data frame contains all what we call the “attributes” in GIS, basically all the information that we’d like to map. For example, so far, the data frame of attributes of tz1 looks like this:

> tz1@data
   OBJECTID ID_0 ISO   NAME_0 ID_1                     NAME_1 HASC_1 CCN_1
1         1  227 TZA Tanzania    1                     Arusha  TZ.AS    NA
2         2  227 TZA Tanzania    2              Dar es Salaam  TZ.DS    NA
3         3  227 TZA Tanzania    3                     Dodoma  TZ.DO    NA
4         4  227 TZA Tanzania    4                      Geita  TZ.GE    NA
5         5  227 TZA Tanzania    5                     Iringa  TZ.IG    NA
6         6  227 TZA Tanzania    6                     Kagera  TZ.KG    NA
7         7  227 TZA Tanzania    7                     Katavi  TZ.KA    NA
8         8  227 TZA Tanzania    8                     Kigoma  TZ.KM    NA
9         9  227 TZA Tanzania    9                Kilimanjaro  TZ.KL    NA
10       10  227 TZA Tanzania   10                      Lindi  TZ.LI    NA
11       11  227 TZA Tanzania   11                    Manyara  TZ.MY    NA
12       12  227 TZA Tanzania   12                       Mara  TZ.MA    NA
13       13  227 TZA Tanzania   13                      Mbeya  TZ.MB    NA
14       14  227 TZA Tanzania   14                   Morogoro  TZ.MO    NA
15       15  227 TZA Tanzania   15                     Mtwara  TZ.MT    NA
16       16  227 TZA Tanzania   16                     Mwanza  TZ.MZ    NA
17       17  227 TZA Tanzania   17                     Njombe  TZ.NJ    NA
18       18  227 TZA Tanzania   18                Pemba North  TZ.PN    NA
19       19  227 TZA Tanzania   19                Pemba South  TZ.PS    NA
20       20  227 TZA Tanzania   20                      Pwani  TZ.PW    NA
21       21  227 TZA Tanzania   21                      Rukwa  TZ.RU    NA
22       22  227 TZA Tanzania   22                     Ruvuma  TZ.RV    NA
23       23  227 TZA Tanzania   23                  Shinyanga  TZ.SY    NA
24       24  227 TZA Tanzania   24                     Simiyu  TZ.SI    NA
25       25  227 TZA Tanzania   25                    Singida  TZ.SD    NA
26       26  227 TZA Tanzania   26                     Tabora  TZ.TB    NA
27       27  227 TZA Tanzania   27                      Tanga  TZ.TN    NA
28       28  227 TZA Tanzania   28             Zanzibar North  TZ.ZN    NA
29       29  227 TZA Tanzania   29 Zanzibar South and Central  TZ.ZS    NA
30       30  227 TZA Tanzania   30              Zanzibar West  TZ.ZW    NA
   CCA_1 TYPE_1 ENGTYPE_1 NL_NAME_1           VARNAME_1
1     02   Mkoa    Region                              
2     07   Mkoa    Region           Dar-es-salaam|Pwani
3     01   Mkoa    Region                              
4     25   Mkoa    Region                              
5     11   Mkoa    Region                              
6     18   Mkoa    Region                              
7     23   Mkoa    Region                              
8     16   Mkoa    Region                              
9     03   Mkoa    Region                              
10    08   Mkoa    Region                              
11    21   Mkoa    Region                              
12    20   Mkoa    Region                              
13    12   Mkoa    Region                              
14    05   Mkoa    Region                              
15    09   Mkoa    Region                              
16    19   Mkoa    Region                              
17    22   Mkoa    Region                              
18    54   Mkoa    Region               Kaskazini Pemba
19    55   Mkoa    Region                  Kusini Pemba
20    06   Mkoa    Region                              
21    15   Mkoa    Region                              
22    10   Mkoa    Region                              
23    17   Mkoa    Region                              
24    24   Mkoa    Region                              
25    13   Mkoa    Region                              
26    14   Mkoa    Region                              
27    04   Mkoa    Region                              
28    51   Mkoa    Region              Kaskazini Unguja
29    52   Mkoa    Region                 Kusini Unguja
30    53   Mkoa    Region               Mjini Magharibi

You can see that the variable NAME_1 of this data frame contains the name of the regions. You can also see that alternative names of the regions are contained in the VARNAME_1 variable. In order to be consistent with the names used in tz we need to compute a region variable using both NAME_1 and VARNAME_1:

> tz1@data %<>%
+   mutate(region = ifelse(VARNAME_1 == "", NAME_1, VARNAME_1) %>%
+                     sub("Dar-es-salaam.*$", "Dar Es Salaam", .))

In order to map some data, you first need to add these data to this data frame, in the form of a merging. A obvious way to do the merging here is to do it by regions’s names. But for that we first need to make sure that the regions’ names in the tz1 and the tz object are the same. Let’s check that:

> sort(levels(tz$v024))
 [1] " arusha"          " dar es salaam"   " kilimanjaro"    
 [4] " morogoro"        " pwani"           " tanga"          
 [7] "dodoma"           "geita"            "iringa"          
[10] "kagera"           "kaskazini pemba"  "kaskazini unguja"
[13] "katavi"           "kigoma"           "kusini pemba"    
[16] "kusini unguja"    "lindi"            "manyara"         
[19] "mara"             "mbeya"            "mjini magharibi" 
[22] "mtwara"           "mwanza"           "njombe"          
[25] "rukwa"            "ruvuma"           "shinyanga"       
[28] "simiyu"           "singida"          "tabora"          

We can see that there are 2 problems:

> tz$v024 <- factor(stringr::str_to_title(trimws(as.character(tz$v024))))

And now, let’s check that everything is in order:

> data.frame(sort(unique(tz1@data$region)), sort(as.character(levels(tz$v024))))
   sort.unique.tz1.data.region.. sort.as.character.levels.tz.v024...
1                         Arusha                              Arusha
2                  Dar Es Salaam                       Dar Es Salaam
3                         Dodoma                              Dodoma
4                          Geita                               Geita
5                         Iringa                              Iringa
6                         Kagera                              Kagera
7                Kaskazini Pemba                     Kaskazini Pemba
8               Kaskazini Unguja                    Kaskazini Unguja
9                         Katavi                              Katavi
10                        Kigoma                              Kigoma
11                   Kilimanjaro                         Kilimanjaro
12                  Kusini Pemba                        Kusini Pemba
13                 Kusini Unguja                       Kusini Unguja
14                         Lindi                               Lindi
15                       Manyara                             Manyara
16                          Mara                                Mara
17                         Mbeya                               Mbeya
18               Mjini Magharibi                     Mjini Magharibi
19                      Morogoro                            Morogoro
20                        Mtwara                              Mtwara
21                        Mwanza                              Mwanza
22                        Njombe                              Njombe
23                         Pwani                               Pwani
24                         Rukwa                               Rukwa
25                        Ruvuma                              Ruvuma
26                     Shinyanga                           Shinyanga
27                        Simiyu                              Simiyu
28                       Singida                             Singida
29                        Tabora                              Tabora
30                         Tanga                               Tanga
> identical(sort(unique(tz1@data$region)), sort(as.character(levels(tz$v024))))
[1] TRUE

That’s perfect, now we can do some mapping! Let’s say you have the following data that you want to map:

> fake_data <- data.frame(region = sort(unique(tz1@data$region)),
+                         prevalence = runif(30))

That looks like:

> fake_data
             region prevalence
1            Arusha 0.71728459
2     Dar Es Salaam 0.41555970
3            Dodoma 0.64026440
4             Geita 0.71475903
5            Iringa 0.04204099
6            Kagera 0.77923156
7   Kaskazini Pemba 0.28061829
8  Kaskazini Unguja 0.70913978
9            Katavi 0.51097488
10           Kigoma 0.91588271
11      Kilimanjaro 0.64929853
12     Kusini Pemba 0.32436069
13    Kusini Unguja 0.90943005
14            Lindi 0.54887290
15          Manyara 0.22672338
16             Mara 0.01194075
17            Mbeya 0.74136891
18  Mjini Magharibi 0.87527528
19         Morogoro 0.10128811
20           Mtwara 0.65647817
21           Mwanza 0.26264764
22           Njombe 0.38125004
23            Pwani 0.65449194
24            Rukwa 0.74027565
25           Ruvuma 0.16806927
26        Shinyanga 0.77780351
27           Simiyu 0.78997908
28          Singida 0.33727974
29           Tabora 0.44200809
30            Tanga 0.32160170

The first step is to merge this data to the tz1 object:

> tz1b <- merge(tz1, fake_data, by = "region")

You can check now that the prevalence data has been added to the data frame slot of the tz1 object:

> head(tz1b@data)
         region OBJECTID ID_0 ISO   NAME_0 ID_1        NAME_1 HASC_1 CCN_1
1        Arusha        1  227 TZA Tanzania    1        Arusha  TZ.AS    NA
2 Dar Es Salaam        2  227 TZA Tanzania    2 Dar es Salaam  TZ.DS    NA
3        Dodoma        3  227 TZA Tanzania    3        Dodoma  TZ.DO    NA
4         Geita        4  227 TZA Tanzania    4         Geita  TZ.GE    NA
5        Iringa        5  227 TZA Tanzania    5        Iringa  TZ.IG    NA
6        Kagera        6  227 TZA Tanzania    6        Kagera  TZ.KG    NA
  CCA_1 TYPE_1 ENGTYPE_1 NL_NAME_1           VARNAME_1 prevalence
1    02   Mkoa    Region                               0.71728459
2    07   Mkoa    Region           Dar-es-salaam|Pwani 0.41555970
3    01   Mkoa    Region                               0.64026440
4    25   Mkoa    Region                               0.71475903
5    11   Mkoa    Region                               0.04204099
6    18   Mkoa    Region                               0.77923156

Once this is done, you can use the spplot function to make a map of the prevalence this way:

> spplot(subset(tz1b, , prevalence))

You can change the color scale this way:

> spplot(subset(tz1b, , prevalence), col.regions = rev(heat.colors(30)))

Read the help of the spplot function for more functionalities.

An alternative way

Here we show how you could use plot instead of spplot. On this example, I also show how you can add labels to the polygons. This is not dependent of the use of plot and could very well be used with spplot too.

First, we need to extract the information we need:

> labpt <- tz1b@polygons %>%
+   sapply(function(x) x@labpt) %>% 
+   t() %>% 
+   data.frame() %>% 
+   setNames(c("x", "y")) %>% 
+   cbind(tz1b@data[, c("region", "prevalence")])

Then, we need to define a function that add the label the way we want:

> add_info <- function(df, eps) {
+   with(df, {
+     text(x, y + eps, region)
+     text(x, y - eps,
+          paste0(100 * round(prevalence, 2), "%"))
+   })
+ }

where the eps argument is half the vertical distance (in latitude decimal degree) we want between the polygon’s name and the prevalence value. Also, for the map to be easy to read, we need to plot the mainland and the islands separately. Here is the pattern that targets the islands:

> islands <- "Pemba|Unguja|Mjini"

The number of color categories we want is

> n <- 12

And, finally, the coordinates of the box:

> add_box <- function()
+   polygon(rep(c(38.9, 40.15), 2)[c(1, 2, 4, 3)], rep(c(-6.54, -4.8), each = 2))

Now, we have all the ingredients to draw the two maps. First the mainland:

> plot(tz1b, col = rev(heat.colors(n))[cut(tz1b@data$prevalence, n)])
> labpt %>%
+   filter(region %>% grepl(pattern = islands) %>% `!`) %>%
+   add_info(.2)
> add_box()

And now, the islands:

> sel <- grep(islands, labpt$region)
> plot(tz1b[sel, ], col = rev(heat.colors(n))[cut(tz1b@data$prevalence, n)][sel])
> labpt %>%
+   filter(region %>% grepl(pattern = islands)) %>%
+   add_info(.06)
> add_box()