R code and outpus are in boxes. Code typed at the command line is in boxes with grey background color. Outputs of these commands are in boxes with white background color. You can copy and paste these input code to build your own R script. Be careful however in the process not to copy the > and + prompts!
The packages currently installed on your system:
> pkgs <- rownames(installed.packages())
The packages we need from CRAN and installing those that are not already installed:
> cran <- c("survey", "devtools", "dplyr", "sp", "raster", "countrycode", "stringr")
> sel <- !cran %in% pkgs
> if (any(sel)) install.packages(cran[sel])
The packages we need from GitHub and installing those that are not already installed:
> github <- c("larmarange/labelled", "ajdamico/lodown",
+ paste0("choisy/", c("mclabelled", "mcsurvey")))
> sel <- !sub(".*/", "", github) %in% pkgs
> if (any(sel)) devtools::install_github(github[sel])
Loading the packages we need for the analysis:
> library(magrittr) # %>%, %<>%
> library(labelled) # to_factor(), remove_attributes()
> library(lodown) # get_catalog(), lodown()
> library(survey) # svydesign(), svyby(), svymean(), svyglm()
> library(dplyr) # filter(), select(), mutate(), mutate_if(), recode()
> library(sp) # plot(), spplot()
Define your DHS’s project information. For the project, enter no more than the first 35 characters of the project name. Also, replace the second and third fields by your login email address and your password:
> profile <- function(fct, ...) {
+ fct(data_name = "dhs",
+ your_email = "your_email_used_as_login_on_DHS",
+ your_password = "your_DHS_password",
+ your_project = "Antibiotics usage amongst under fiv", ...)
+ }
Retrieving the catalogue of data sets of the project:
> catalogue <- profile(get_catalog)
Selecting the children under 5 year-old for 2015, and also changing the output directory:
> catalogue <- subset(catalogue, year == 2015 & grepl("KR", full_url))
> catalogue$output_folder <- getwd()
Downloading the data:
> profile(lodown, catalog = catalogue)
Loading the data:
> tz <- readRDS(grep("TZKR.*rds", dir(), value = TRUE))
Converting into factors in order to get an native R object that will take much less space in memory:
> print(object.size(tz), units = "auto")
99.9 Mb
> tz %<>% to_factor()
> print(object.size(tz), units = "auto")
57.6 Mb
Almost half the size: 57.6 Mb instead of 99.9 Mb! Let’s, in addition, trim out the format attributes (it doesn’t really save much memory but it makes the object contain no useless information):
> tz %<>% remove_attributes("format.stata")
Selecting the variables we are interesting in, out of the 1253 available:
> tz %<>% select(v001, v002, v005, v021, v022, v023, v024, v106, v113, v116, v161,
+ v140, v463a, v463b, v190, b4, h3, h5, h7, h51, h52, h53, h54, h55,
+ h56, h57, h58, h59, h11, h22, h31, h31b, h31c, h12a, h12b, h12c,
+ h12d, h12e, h12f, h12g, h12h, h12j, h12k, h12m, h12n, h12o, h12p,
+ h12s, h12t, h12u, h12x, h32a, h32b, h32c, h32d, h32e, h32f, h32g,
+ h32h, h32j, h32k, h32m, h32n, h32o, h32p, h32s, h32t, h32u, h32x,
+ hw1, h15, h15b, h37i, h37j, h44a, h46a, s630aa, s630ab, s630ac,
+ s630ad, s630ae, s630af, s630ag, s630ah, s630ai, s630aj, s630ak,
+ s630al, s630am, s630an, s630ao, s630ap, s630aq, s630ar, s630as,
+ s630at, s630au, s630av, s630ax)
We now have only 1253 variables:
> ncol(tz)
[1] 76
Which make the data set much more smaller:
> print(object.size(tz), units = "auto")
3.3 Mb
This will be very easy to handle.
Dividing the weights by 1,000,000 (because the weights have been multiplied by 1,000,000 in order to save space on disk):
> tz$v005 <- tz$v005 / 1e6
Here, it doesn’t change the size in memory, good news! (actually it makes it slightly smaller!).
Attaching the data set to the below-defined var_def and var_def_ functions for easier use:
> var_def <- function(...) mclabelled::var_def(tz, ...) # the NSE version
> var_def_ <- function(x) mclabelled::var_def(tz[x]) # the character vector version
Let’s look at the definitions of the variables we have selected:
> var_def()
v001 : cluster number
v002 : household number
v005 : women's individual sample weight (6 decimals)
v021 : primary sampling unit
v022 : sample strata for sampling errors
v023 : stratification used in sample design
v024 : region
v106 : highest educational level
v113 : source of drinking water
v116 : type of toilet facility
v161 : type of cooking fuel
v140 : de jure type of place of residence
v463a : smokes cigarettes
v463b : smokes pipe full of tobacco
v190 : wealth index
b4 : sex of child
h3 : received dpt 1
h5 : received dpt 2
h7 : received dpt 3
h51 : received pentavalent 1
h52 : received pentavalent 2
h53 : received pentavalent 3
h54 : received pneumococcal 1
h55 : received pneumococcal 2
h56 : received pneumococcal 3
h57 : received rotavirus 1
h58 : received rotavirus 2
h59 : na - received rotavirus 3
h11 : had diarrhea recently
h22 : had fever in last two weeks
h31 : had cough in last two weeks
h31b : short, rapid breaths
h31c : problem in the chest or blocked or running nose
h12a : diarrhea: government hospital
h12b : diarrhea: regional referral hospital(government/parastatal)
h12c : diarrhea: regional hospital(government/parastatal)
h12d : diarrhea: district hospital(government/parastatal)
h12e : diarrhea: health center(government/parastatal
h12f : diarrhea: dispensary(government/parastatal)
h12g : diarrhea: clinic (government/parastatal)
h12h : diarrhea: chw(government/parastatal)
h12j : diarrhea: private hospital/clinic
h12k : diarrhea: private pharmacy
h12m : diarrhea: specialized hospital (private)
h12n : diarrhea: health centre (private)
h12o : diarrhea: dispensary(private)
h12p : diarrhea: clinic (private)
h12s : diarrhea: addo
h12t : diarrhea: ngo
h12u : diarrhea: referal/ spec. voluntary
h12x : diarrhea: other
h32a : fever/cough: government hospital
h32b : fever/cough: regional referral hospital
h32c : fever/cough: regional hospital
h32d : fever/cough: district hospital
h32e : fever/cough: health center
h32f : fever/cough: dispensary
h32g : fever/cough: clinic
h32h : fever/cough: chw
h32j : fever/cough: private hospital/ clinic
h32k : fever/cough: private pharmacy
h32m : fever/cough: specialized hospital
h32n : fever/cough: health centre
h32o : fever/cough: private dispensary
h32p : fever/cough: private clinic
h32s : fever/cough: addo
h32t : fever/cough: ngo
h32u : fever/cough: religious/voluntary medical
h32x : fever/cough: other
hw1 : child's age in months
h15 : given antibiotic pills or syrups
h15b : given antibiotic injection
h37i : antibiotic pill/syrup taken for fever
h37j : antibiotic injection taken for fever
h44a : place first sought treatment for diarrhea
h46a : place first sought treatment for fever
> vaccination <- c("vaccination date on card", "reported by mother")
> tz %<>%
+ mutate_if(~ identical(levels(.), c("no", "yes")), ~ . == "yes") %>%
+ mutate(nbea = 106642, # total number of enumeration areas (EA) in the country
+ nbhh = 86, # number of households per EA
+ count = 1, # counting variable
+ age_class = cut(tz$hw1, c(0, 6, 12, 24, 36, 48, 60), right = FALSE),
+ diarrhea = h11 == "yes, last two weeks",
+ ari = h31b == "yes" & h31c %in% c("chest only", "both"),
+ fever = h22 == "yes",
+ amu_fever = h37i | h37j,
+ amu_diarr = h15 == "yes" | h15b == "yes",
+ res = factor(recode(v140,
+ `rural` = "rural",
+ `urban` = "urban",
+ .default = "NA")),
+ toilet = factor(recode(v116,
+ `flush toilet` = "improved",
+ `flush to piped sewer system` = "improved",
+ `flush to septic tank` = "improved",
+ `flush to pit latrine` = "improved",
+ `pit toilet latrine` = "improved",
+ `ventilated improved pit latrine (vip)` = "improved",
+ `pit latrine with slab` = "improved",
+ `composting toilet` = "improved",
+ .default = "unimproved")),
+ facility = case_when(grepl("government|public", h44a) ~ "public",
+ grepl("private|specialized", h44a) ~ "private",
+ grepl("pharmacy|addo", h44a) ~ "pharmacy",
+ TRUE ~ "other"),
+ education = factor(recode(v106,
+ `no education` = "no education",
+ `primary` = "primary",
+ `secondary` = "secondary or higher",
+ `higher` = "secondary or higher")),
+ penta = h51 %in% vaccination,
+ pcv = h54 %in% vaccination,
+ rota = h57 %in% vaccination)
form the report, “ARI symptoms consist of cough (h31) accompanied by (1) short, rapid breathing that is chest-related (h31b), and/or (2) difficult breathing that is chest-related (h31c)”. However it seems to work better defining ARI on h31b and h31c only.
The first line of the above code recodes all the variables that have answer “no” or “yes” only into boolean.
Defining the sampling design:
> dtz <- svydesign(ids = ~ v001 + v002, # the PSU and SSU clusters
+ strata = ~ v023, # the strata
+ fpc = ~ nbea + nbhh, # finite population correction (see below)
+ weights = ~ v005, # sampling weights
+ data = tz)
where the meaning of the variables is
> var_def(v001, v002, v023, v005)
v001 : cluster number
v002 : household number
v023 : stratification used in sample design
v005 : women's individual sample weight (6 decimals)
Note that all the variables starting with a v are women standard variables. Finite population corrections are the total numbers of units from which PSU and SSU are sampled.
The distribution of the number of PSUs per stratum:
> plot(table(colSums(with(tz, table(v001, v023)) > 0)),
+ xlab = "number of PSUs per stratum", ylab = "number of strata")
Looks OK since the minimum number of PSUs per stratum is 2. Let’s now look at the distribution of the number of households (SSUs) per PSU:
> plot(table(with(tz, table(v001, v023)))[-1],
+ xlab = "number of households (SSUs) per PSU", ylab = "number of PSUs")
That does not seem OK since there is a PSU with only one SSU (household). It concerns the cluster 142 of stratum 13:
> mcsurvey::check_replicates(tz, v023, v001, v002)
[[1]]
v023 v001
13 142
Because one PSU has only 1 SSU, we need to change the lonely PSU management option from default value fail:
> options("survey.lonely.psu")
$survey.lonely.psu
[1] "fail"
to one of these alternative options:
> options(survey.lonely.psu = "remove")
> options(survey.lonely.psu = "certainty")
> options(survey.lonely.psu = "adjust")
> options(survey.lonely.psu = "average")
see here for more information. Following advice on this page, we’ll set up the option as
> options(survey.lonely.psu = "certainty")
for “compatibility with other softwares” (in practice, it doesn’t make much difference anyway, which option you choose, at least on this data set).
The weighted estimate of the total number of children:
> svytotal(~ count, dtz, na.rm = T)
total SE
count 10052 257.86
The percentage of children who had diarrhea recently, by sex:
> var_def(h11, b4)
h11 : had diarrhea recently
b4 : sex of child
Note that all the variables starting with a h are immunization and health variables and all the variables starting with a b are birth history variables. To see the levels of a variable:
> levels(tz$h11)
[1] "no" "yes, last 24 hours" "yes, last two weeks"
[4] "don't know"
A weighted estimate of the total numbers of boys and girls:
> svyby(~ count, ~ b4, dtz, svytotal, na.rm = TRUE)
b4 count se
male male 5098.177 142.0301
female female 4953.686 139.4720
A weighted estimate of the percentages of boys and girls who had diarrhea within the last 2 weeks:
> svyby(~ h11 == "yes, last two weeks", ~ b4, dtz, svymean, na.rm = TRUE)
b4 h11 == "yes, last two weeks"FALSE
male male 0.882918
female female 0.881215
h11 == "yes, last two weeks"TRUE
male 0.117082
female 0.118785
se.h11 == "yes, last two weeks"FALSE
male 0.006233980
female 0.006148465
se.h11 == "yes, last two weeks"TRUE
male 0.006233980
female 0.006148465
We can put the above result in nicer shape by adding:
> svyby(~ h11 == "yes, last two weeks", ~ b4, dtz, svymean, na.rm = TRUE) %>%
+ select(contains("TRUE")) %>% # selecting the columns we are interested in
+ `names<-`(c("mean", "SE")) # renaming these columns
mean SE
male 0.117082 0.006233980
female 0.118785 0.006148465
Which is to compare with the values 11.7 and 11.9 of Table 10.7.1 of the report, page 213. Seems to work OK! Other variables to check.
Official Stata code for the analysis of DHS does not define SSU. Let’s try without and compare with previous results:
> dtz2 <- svydesign(ids = ~ v001,
+ strata = ~ v023,
+ fpc = ~ nbea,
+ weights = ~ v005,
+ data = tz)
> svytotal(~ count, dtz2, na.rm = T)
total SE
count 10052 257.86
to compare with
> svytotal(~ count, dtz, na.rm = T)
total SE
count 10052 257.86
and
> svyby(~ count, ~ b4, dtz2, svytotal, na.rm = TRUE)
b4 count se
male male 5098.177 142.0279
female female 4953.686 139.4696
to compare with
> svyby(~ count, ~ b4, dtz, svytotal, na.rm = TRUE)
b4 count se
male male 5098.177 142.0301
female female 4953.686 139.4720
and, finally,
> svyby(~ h11 == "yes, last two weeks", ~ b4, dtz2, svymean, na.rm = TRUE) %>%
+ select(contains("TRUE")) %>%
+ `names<-`(c("mean", "SE"))
mean SE
male 0.117082 0.006233653
female 0.118785 0.006148120
to compare with
> svyby(~ h11 == "yes, last two weeks", ~ b4, dtz, svymean, na.rm = TRUE) %>%
+ select(contains("TRUE")) %>%
+ `names<-`(c("mean", "SE"))
mean SE
male 0.117082 0.006233980
female 0.118785 0.006148465
This shows that the results with and without SSU are very similar. Defining SSU only very slightly decrease the SE.
Summarizing highest educational level with a table
> table(tz$v106)
no education primary secondary higher
2199 6170 1775 89
or a barplot:
> barplot(table(tz$v106))
To compare with:
> table(tz$education)
no education primary secondary or higher
2199 6170 1864
and barplot:
> barplot(table(tz$education))
The initial educational variable:
> svyby(~ count, ~ v106, dtz, svytotal, na.rm = TRUE)
v106 count se
no education no education 2103.40024 126.84417
primary primary 6516.73840 185.13632
secondary secondary 1343.58042 59.59380
higher higher 88.14375 16.35962
and the one we just created:
> svyby(~ count, ~ education, dtz, svytotal, na.rm = TRUE)
education count se
no education no education 2103.400 126.84417
primary primary 6516.738 185.13632
secondary or higher secondary or higher 1431.724 64.60883
Let’s try to reproduce the link between AMU and age observe on Table 10.5:
> svyby(~ amu_fever, ~ age_class, subset(dtz, ari), svymean, na.rm = TRUE) %>%
+ select(contains("TRUE")) %>%
+ `names<-`(c("percentage", "SE"))
percentage SE
[0,6) 0.2879797 0.13954440
[6,12) 0.4733787 0.10666608
[12,24) 0.3236933 0.07069466
[24,36) 0.4014607 0.07569719
[36,48) 0.2310347 0.10626940
[48,60) 0.2322670 0.09108236
Proportion of children with diarrhea by facility:
> svytable(~ facility, subset(dtz, diarrhea))
facility
other pharmacy private public
409.33408 279.34900 62.33795 370.62437
> svytable(~ facility, subset(dtz, diarrhea), 1)
facility
other pharmacy private public
0.36494072 0.24905287 0.05557724 0.33042918
Relationship between diarrhea and facility:
> svytable(~ diarrhea + facility, dtz)
facility
diarrhea other pharmacy private public
FALSE 8389.87596 0.00000 0.00000 0.00000
TRUE 409.33408 279.34900 62.33795 370.62437
> svytable(~ diarrhea + facility, dtz, 1)
facility
diarrhea other pharmacy private public
FALSE 0.882075079 0.000000000 0.000000000 0.000000000
TRUE 0.043035605 0.029369540 0.006553941 0.038965834
> svytable(~ diarrhea + amu_diarr, dtz)
amu_diarr
diarrhea FALSE TRUE
TRUE 749.6177 372.0277
> svytable(~ diarrhea + amu_diarr, dtz, 1)
amu_diarr
diarrhea FALSE TRUE
TRUE 0.6683196 0.3316804
Number of children with dirrhea by facility, with SE:
> svyby(~ count, ~ facility, subset(dtz, diarrhea), svytotal, na.rm = TRUE)
facility count se
other other 409.33408 28.05320
pharmacy pharmacy 279.34900 30.92891
private private 62.33795 10.99203
public public 370.62437 25.11904
> summary(svyglm(amu_diarr ~ facility, subset(dtz, diarrhea), family = quasibinomial))
Call:
svyglm(formula = amu_diarr ~ facility, subset(dtz, diarrhea),
family = quasibinomial)
Survey design:
subset(dtz, diarrhea)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.9489 0.1751 -11.127 < 2e-16 ***
facilitypharmacy 2.1126 0.2304 9.171 < 2e-16 ***
facilityprivate 1.8324 0.4000 4.582 6.17e-06 ***
facilitypublic 1.4560 0.2230 6.528 2.01e-10 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for quasibinomial family taken to be 1.00089)
Number of Fisher Scoring iterations: 4
> model <- svyglm(amu_diarr ~ facility, dtz, family = quasibinomial)
> summary(model)
Call:
svyglm(formula = amu_diarr ~ facility, dtz, family = quasibinomial)
Survey design:
svydesign(ids = ~v001 + v002, strata = ~v023, fpc = ~nbea + nbhh,
weights = ~v005, data = tz)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.9489 0.1751 -11.127 < 2e-16 ***
facilitypharmacy 2.1126 0.2304 9.171 < 2e-16 ***
facilityprivate 1.8324 0.4000 4.582 6.17e-06 ***
facilitypublic 1.4560 0.2230 6.528 2.01e-10 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for quasibinomial family taken to be 1.015888)
Number of Fisher Scoring iterations: 4
Odds ratio:
> exp(cbind(coef(model), confint(model)))
2.5 % 97.5 %
(Intercept) 0.1424352 0.1010499 0.200770
facilitypharmacy 8.2700080 5.2652450 12.989525
facilityprivate 6.2490444 2.8534379 13.685441
facilitypublic 4.2889369 2.7701370 6.640459
> model1 <- svyglm(fever ~ age_class + b4 + v190 + penta + pcv + rota, design = dtz, family = quasibinomial)
And then:
> summary(model1)
Call:
svyglm(formula = fever ~ age_class + b4 + v190 + penta + pcv +
rota, design = dtz, family = quasibinomial)
Survey design:
svydesign(ids = ~v001 + v002, strata = ~v023, fpc = ~nbea + nbhh,
weights = ~v005, data = tz)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.50352 0.16318 -15.342 < 2e-16 ***
age_class[6,12) 0.61974 0.15511 3.995 7.36e-05 ***
age_class[12,24) 0.72288 0.15080 4.794 2.12e-06 ***
age_class[24,36) 0.66406 0.14923 4.450 1.05e-05 ***
age_class[36,48) 0.80855 0.17475 4.627 4.66e-06 ***
age_class[48,60) 0.67048 0.16007 4.189 3.28e-05 ***
b4female -0.10343 0.06471 -1.598 0.11053
v190poorer 0.18115 0.10044 1.804 0.07187 .
v190middle 0.06185 0.11642 0.531 0.59543
v190richer 0.08493 0.11061 0.768 0.44291
v190richest 0.18050 0.11639 1.551 0.12154
pentaTRUE 0.83965 0.25909 3.241 0.00127 **
pcvTRUE -0.21717 0.27491 -0.790 0.42991
rotaTRUE -0.08266 0.28781 -0.287 0.77408
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for quasibinomial family taken to be 0.9942134)
Number of Fisher Scoring iterations: 4
Another model:
> model2 <- svyglm(diarrhea ~ age_class + res + toilet + v190 + education + rota, design = dtz, family = quasibinomial)
And then:
> summary(model2)
Call:
svyglm(formula = diarrhea ~ age_class + res + toilet + v190 +
education + rota, design = dtz, family = quasibinomial)
Survey design:
svydesign(ids = ~v001 + v002, strata = ~v023, fpc = ~nbea + nbhh,
weights = ~v005, data = tz)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.87090 0.33520 -8.565 < 2e-16 ***
age_class[6,12) 1.30686 0.19088 6.846 2.09e-11 ***
age_class[12,24) 1.26345 0.18575 6.802 2.78e-11 ***
age_class[24,36) 0.55642 0.19650 2.832 0.00481 **
age_class[36,48) 0.41330 0.20609 2.005 0.04542 *
age_class[48,60) -0.11273 0.23625 -0.477 0.63342
resurban -0.25484 0.21560 -1.182 0.23772
resrural -0.48132 0.20314 -2.369 0.01817 *
toiletunimproved -0.03212 0.15810 -0.203 0.83909
v190poorer 0.24828 0.12951 1.917 0.05576 .
v190middle 0.27709 0.13992 1.980 0.04818 *
v190richer 0.26650 0.16465 1.619 0.10612
v190richest 0.20113 0.23355 0.861 0.38952
educationprimary 0.18147 0.11061 1.641 0.10148
educationsecondary or higher 0.22474 0.16004 1.404 0.16083
rotaTRUE 0.38465 0.16367 2.350 0.01913 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for quasibinomial family taken to be 0.9865525)
Number of Fisher Scoring iterations: 5
You can download shapefiles for free from the GADM project:
> tz1 <- "tanzania" %>%
+ countrycode::countrycode("country.name", "iso3c") %>%
+ raster::getData("GADM", country = ., level = 1)
Level 1 here corresponds to the regions we are interested in. This tz1 is a complex structure object of class S4. Without going too much into detail, it contains a number of slots, one of them is a regular data frame (slot data), all the other ones are geometric objects used to draw the map (pretty much like the internal structure of a shapefile if you’re familiar with that) that we can see when we plot that object:
> plot(tz1)
The cool thing about that object is that there is a good correspondance between the data frame frame of the data slot and the other slots. This data frame contains all what we call the “attributes” in GIS, basically all the information that we’d like to map. For example, so far, the data frame of attributes of tz1 looks like this:
> tz1@data
OBJECTID ID_0 ISO NAME_0 ID_1 NAME_1 HASC_1 CCN_1
1 1 227 TZA Tanzania 1 Arusha TZ.AS NA
2 2 227 TZA Tanzania 2 Dar es Salaam TZ.DS NA
3 3 227 TZA Tanzania 3 Dodoma TZ.DO NA
4 4 227 TZA Tanzania 4 Geita TZ.GE NA
5 5 227 TZA Tanzania 5 Iringa TZ.IG NA
6 6 227 TZA Tanzania 6 Kagera TZ.KG NA
7 7 227 TZA Tanzania 7 Katavi TZ.KA NA
8 8 227 TZA Tanzania 8 Kigoma TZ.KM NA
9 9 227 TZA Tanzania 9 Kilimanjaro TZ.KL NA
10 10 227 TZA Tanzania 10 Lindi TZ.LI NA
11 11 227 TZA Tanzania 11 Manyara TZ.MY NA
12 12 227 TZA Tanzania 12 Mara TZ.MA NA
13 13 227 TZA Tanzania 13 Mbeya TZ.MB NA
14 14 227 TZA Tanzania 14 Morogoro TZ.MO NA
15 15 227 TZA Tanzania 15 Mtwara TZ.MT NA
16 16 227 TZA Tanzania 16 Mwanza TZ.MZ NA
17 17 227 TZA Tanzania 17 Njombe TZ.NJ NA
18 18 227 TZA Tanzania 18 Pemba North TZ.PN NA
19 19 227 TZA Tanzania 19 Pemba South TZ.PS NA
20 20 227 TZA Tanzania 20 Pwani TZ.PW NA
21 21 227 TZA Tanzania 21 Rukwa TZ.RU NA
22 22 227 TZA Tanzania 22 Ruvuma TZ.RV NA
23 23 227 TZA Tanzania 23 Shinyanga TZ.SY NA
24 24 227 TZA Tanzania 24 Simiyu TZ.SI NA
25 25 227 TZA Tanzania 25 Singida TZ.SD NA
26 26 227 TZA Tanzania 26 Tabora TZ.TB NA
27 27 227 TZA Tanzania 27 Tanga TZ.TN NA
28 28 227 TZA Tanzania 28 Zanzibar North TZ.ZN NA
29 29 227 TZA Tanzania 29 Zanzibar South and Central TZ.ZS NA
30 30 227 TZA Tanzania 30 Zanzibar West TZ.ZW NA
CCA_1 TYPE_1 ENGTYPE_1 NL_NAME_1 VARNAME_1
1 02 Mkoa Region
2 07 Mkoa Region Dar-es-salaam|Pwani
3 01 Mkoa Region
4 25 Mkoa Region
5 11 Mkoa Region
6 18 Mkoa Region
7 23 Mkoa Region
8 16 Mkoa Region
9 03 Mkoa Region
10 08 Mkoa Region
11 21 Mkoa Region
12 20 Mkoa Region
13 12 Mkoa Region
14 05 Mkoa Region
15 09 Mkoa Region
16 19 Mkoa Region
17 22 Mkoa Region
18 54 Mkoa Region Kaskazini Pemba
19 55 Mkoa Region Kusini Pemba
20 06 Mkoa Region
21 15 Mkoa Region
22 10 Mkoa Region
23 17 Mkoa Region
24 24 Mkoa Region
25 13 Mkoa Region
26 14 Mkoa Region
27 04 Mkoa Region
28 51 Mkoa Region Kaskazini Unguja
29 52 Mkoa Region Kusini Unguja
30 53 Mkoa Region Mjini Magharibi
You can see that the variable NAME_1 of this data frame contains the name of the regions. You can also see that alternative names of the regions are contained in the VARNAME_1 variable. In order to be consistent with the names used in tz we need to compute a region variable using both NAME_1 and VARNAME_1:
> tz1@data %<>%
+ mutate(region = ifelse(VARNAME_1 == "", NAME_1, VARNAME_1) %>%
+ sub("Dar-es-salaam.*$", "Dar Es Salaam", .))
In order to map some data, you first need to add these data to this data frame, in the form of a merging. A obvious way to do the merging here is to do it by regions’s names. But for that we first need to make sure that the regions’ names in the tz1 and the tz object are the same. Let’s check that:
> sort(levels(tz$v024))
[1] " arusha" " dar es salaam" " kilimanjaro"
[4] " morogoro" " pwani" " tanga"
[7] "dodoma" "geita" "iringa"
[10] "kagera" "kaskazini pemba" "kaskazini unguja"
[13] "katavi" "kigoma" "kusini pemba"
[16] "kusini unguja" "lindi" "manyara"
[19] "mara" "mbeya" "mjini magharibi"
[22] "mtwara" "mwanza" "njombe"
[25] "rukwa" "ruvuma" "shinyanga"
[28] "simiyu" "singida" "tabora"
We can see that there are 2 problems:
tz some regions names contain heading spaces that should not be there;tz are in small caps whereas first letters of names in tz1 are large cap. Let’s fix it:> tz$v024 <- factor(stringr::str_to_title(trimws(as.character(tz$v024))))
And now, let’s check that everything is in order:
> data.frame(sort(unique(tz1@data$region)), sort(as.character(levels(tz$v024))))
sort.unique.tz1.data.region.. sort.as.character.levels.tz.v024...
1 Arusha Arusha
2 Dar Es Salaam Dar Es Salaam
3 Dodoma Dodoma
4 Geita Geita
5 Iringa Iringa
6 Kagera Kagera
7 Kaskazini Pemba Kaskazini Pemba
8 Kaskazini Unguja Kaskazini Unguja
9 Katavi Katavi
10 Kigoma Kigoma
11 Kilimanjaro Kilimanjaro
12 Kusini Pemba Kusini Pemba
13 Kusini Unguja Kusini Unguja
14 Lindi Lindi
15 Manyara Manyara
16 Mara Mara
17 Mbeya Mbeya
18 Mjini Magharibi Mjini Magharibi
19 Morogoro Morogoro
20 Mtwara Mtwara
21 Mwanza Mwanza
22 Njombe Njombe
23 Pwani Pwani
24 Rukwa Rukwa
25 Ruvuma Ruvuma
26 Shinyanga Shinyanga
27 Simiyu Simiyu
28 Singida Singida
29 Tabora Tabora
30 Tanga Tanga
> identical(sort(unique(tz1@data$region)), sort(as.character(levels(tz$v024))))
[1] TRUE
That’s perfect, now we can do some mapping! Let’s say you have the following data that you want to map:
> fake_data <- data.frame(region = sort(unique(tz1@data$region)),
+ prevalence = runif(30))
That looks like:
> fake_data
region prevalence
1 Arusha 0.71728459
2 Dar Es Salaam 0.41555970
3 Dodoma 0.64026440
4 Geita 0.71475903
5 Iringa 0.04204099
6 Kagera 0.77923156
7 Kaskazini Pemba 0.28061829
8 Kaskazini Unguja 0.70913978
9 Katavi 0.51097488
10 Kigoma 0.91588271
11 Kilimanjaro 0.64929853
12 Kusini Pemba 0.32436069
13 Kusini Unguja 0.90943005
14 Lindi 0.54887290
15 Manyara 0.22672338
16 Mara 0.01194075
17 Mbeya 0.74136891
18 Mjini Magharibi 0.87527528
19 Morogoro 0.10128811
20 Mtwara 0.65647817
21 Mwanza 0.26264764
22 Njombe 0.38125004
23 Pwani 0.65449194
24 Rukwa 0.74027565
25 Ruvuma 0.16806927
26 Shinyanga 0.77780351
27 Simiyu 0.78997908
28 Singida 0.33727974
29 Tabora 0.44200809
30 Tanga 0.32160170
The first step is to merge this data to the tz1 object:
> tz1b <- merge(tz1, fake_data, by = "region")
You can check now that the prevalence data has been added to the data frame slot of the tz1 object:
> head(tz1b@data)
region OBJECTID ID_0 ISO NAME_0 ID_1 NAME_1 HASC_1 CCN_1
1 Arusha 1 227 TZA Tanzania 1 Arusha TZ.AS NA
2 Dar Es Salaam 2 227 TZA Tanzania 2 Dar es Salaam TZ.DS NA
3 Dodoma 3 227 TZA Tanzania 3 Dodoma TZ.DO NA
4 Geita 4 227 TZA Tanzania 4 Geita TZ.GE NA
5 Iringa 5 227 TZA Tanzania 5 Iringa TZ.IG NA
6 Kagera 6 227 TZA Tanzania 6 Kagera TZ.KG NA
CCA_1 TYPE_1 ENGTYPE_1 NL_NAME_1 VARNAME_1 prevalence
1 02 Mkoa Region 0.71728459
2 07 Mkoa Region Dar-es-salaam|Pwani 0.41555970
3 01 Mkoa Region 0.64026440
4 25 Mkoa Region 0.71475903
5 11 Mkoa Region 0.04204099
6 18 Mkoa Region 0.77923156
Once this is done, you can use the spplot function to make a map of the prevalence this way:
> spplot(subset(tz1b, , prevalence))
You can change the color scale this way:
> spplot(subset(tz1b, , prevalence), col.regions = rev(heat.colors(30)))
Read the help of the spplot function for more functionalities.
Here we show how you could use plot instead of spplot. On this example, I also show how you can add labels to the polygons. This is not dependent of the use of plot and could very well be used with spplot too.
First, we need to extract the information we need:
> labpt <- tz1b@polygons %>%
+ sapply(function(x) x@labpt) %>%
+ t() %>%
+ data.frame() %>%
+ setNames(c("x", "y")) %>%
+ cbind(tz1b@data[, c("region", "prevalence")])
Then, we need to define a function that add the label the way we want:
> add_info <- function(df, eps) {
+ with(df, {
+ text(x, y + eps, region)
+ text(x, y - eps,
+ paste0(100 * round(prevalence, 2), "%"))
+ })
+ }
where the eps argument is half the vertical distance (in latitude decimal degree) we want between the polygon’s name and the prevalence value. Also, for the map to be easy to read, we need to plot the mainland and the islands separately. Here is the pattern that targets the islands:
> islands <- "Pemba|Unguja|Mjini"
The number of color categories we want is
> n <- 12
And, finally, the coordinates of the box:
> add_box <- function()
+ polygon(rep(c(38.9, 40.15), 2)[c(1, 2, 4, 3)], rep(c(-6.54, -4.8), each = 2))
Now, we have all the ingredients to draw the two maps. First the mainland:
> plot(tz1b, col = rev(heat.colors(n))[cut(tz1b@data$prevalence, n)])
> labpt %>%
+ filter(region %>% grepl(pattern = islands) %>% `!`) %>%
+ add_info(.2)
> add_box()
And now, the islands:
> sel <- grep(islands, labpt$region)
> plot(tz1b[sel, ], col = rev(heat.colors(n))[cut(tz1b@data$prevalence, n)][sel])
> labpt %>%
+ filter(region %>% grepl(pattern = islands)) %>%
+ add_info(.06)
> add_box()