In this tutorial, we’ll learn how to analyze basic survey data and plot the results on a map using R, using the March PCM data as an example. This illustrates the use of R as a front-to-end solution for reading, exploring, and cleaning the data, performing a statistical analysis, and summarizing the results.

In addition to the packages loaded in this script, you will need the PCM results and shapefiles.

other resources

This tutorial is not a complete explanation of R syntax or it’s plotting functions, or on survey statistics. A great resource for learning about R and plotting maps is the spatial.ly blog. The survey analysis is done with the survey package, which has an accompanying journal article, a nice companion book, with very few equations, and even a online course for those who want to learn more about how to analyze survey data.

Reading in the data

First, we’ll load the neccesary packages and data into R:

# I'm setting the working directory to where I've stored my shape files and survey results
setwd("~/Dropbox/Pakistan/markdown") 

library(maptools) # a package for dealing with shapefiles/spatial data
library(survey) # a package for analyzing clustered survey data
library(ggplot2) # a for making pretty plots
library(RColorBrewer) # a package that has some pretty colors for the maps
library(dplyr) # a useful package for data manipulation

pcm <- read.csv("apex_pcm_march_2015.csv", header=TRUE,stringsAsFactors=FALSE)
map <- readShapePoly("pakistan_district.shp")

Taking a look at the data

First, lets take a look at the data, using the str() command. The data consists of location info (province, district, uc, village) for 63,758 children, along with whether the child was vaccinated in the previous campaign according to caregiver recall (vaccinated) or by finger marked (finger)

str(pcm)
## 'data.frame':    63737 obs. of  12 variables:
##  $ X         : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ serial    : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ form      : chr  "39301" "39301" "39302" "39302" ...
##  $ province  : chr  "AJK" "AJK" "AJK" "AJK" ...
##  $ district  : chr  "bagh" "bagh" "bagh" "bagh" ...
##  $ tehsil    : chr  "345" "345" "345" "345" ...
##  $ uc        : int  1952 1952 1952 1952 1952 1952 1952 1952 1952 1952 ...
##  $ village   : int  2621 2621 2621 2621 2621 2621 2621 2621 2621 2621 ...
##  $ age       : chr  "6" "54" "59" "36" ...
##  $ vaccinated: chr  "Yes" "Yes" "Yes" "Yes" ...
##  $ finger    : chr  "Yes" "Yes" "Yes" "Yes" ...
##  $ reason    : chr  "." "." "." "." ...

The table() function will count how many times a variable takes on each value. We can use it to calculate how many children are in each province, or in each district

table(pcm$province)
## 
##          28           6         AJK BALOCHISTAN         KPK      PUNJAB       SINDH 
##           1        2945        5021        6278       14005       18964       16523
#I dodn't want to print all the districts, so take a subset of the first 10, using the subset `[1:10]`, where and `1:10` indicating I wanted the first through the 10th entry.
table(pcm$district)[1:10] 
## 
##        347        367   abotabad     astore     attock      badin       bagh bahawalpur bahwlnagar 
##          2          1        438        562        442        543        474        512        498 
##      bannu 
##        461
#removing numeric districts
pcm <- filter(pcm, district != 347, district != 367) 
# equvalent to
#pcm <- subset(pcm,!(pcm$district %in% c(347,367) ))
# or
#pcm <- pcm[(pcm$district != 347) & (pcm$district != 347),] 

Note that some districts and provinces are given as numbers in the above. At this point I would ask the person who supplied the data for the names of these districts, or I might try to find out what district they correspond to by looking at the tehsil or UC. Here, I just removed them.

We can use the summary function to tell us more about a numeric variable:

#notince age is stored as a character in the above, I'll first change it to a numeric variable:
pcm$age <- as.numeric(pcm$age)

#Some NAs are returned when I convert age to a character. This means some of the ages in the file were not numbers.
summary(pcm$age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    0.00   15.00   30.00   29.55   45.00   59.00       7

Analyzing survey results

Now, we want to analyze the results from the survey. There are a number of packages available to do this, but the best in my opinion is the survey package. You can see some examples of this package by looking at the vignettes and help pages (help(package="survey") is a good place to start).

This is more of an R tutorial than a survey course, so I won’t dwell on the details.

#changing an option of the package, so it removes districts with only 1 cluster, rather than throw an error
options(survey.lonely.psu="remove") 

# change `finger' to a 0/1 numeric variable
pcm$finger <- as.numeric(pcm$finger=="Yes")

## Create a `survey.design` object. Basically telling R about the design of the survey, so it knows how to analyze it.
## PCM is a cluster survey, which (I believe) is stratified by district. I choose `probs~1', since PCM uses PPS to select districts.
pcm.svy <- svydesign(~village, strata=~district, nest=TRUE, probs = ~1,data = pcm)

## svymean calculates the proportion of fingers marked.
## To apply svymean to each district, I wrap it in `svyby':
result <- svyby(~finger, ~district,design=pcm.svy,deff="replace",svymean,vartype=c("ci","se"))

#result is a dataframe which gives coverage and corresponding standard error, confidence interval, and design effect for each district
head(result)
##              district    finger          se      ci_l      ci_u DEff.finger
## abotabad     abotabad 0.8105023 0.061254286 0.6904461 0.9305585  10.6756955
## astore         astore 0.7330961 0.045549223 0.6438212 0.8223709   5.9485205
## attock         attock 0.9932127 0.004900455 0.9836080 1.0028174   1.5709783
## badin           badin 0.9963168 0.002572840 0.9912741 1.0013594   0.9776808
## bagh             bagh 0.9662447 0.016118830 0.9346524 0.9978371   3.7679001
## bahawalpur bahawalpur 0.9863281 0.007929576 0.9707864 1.0018698   2.3827111

Plotting in R (in general)

Now, we can save the result of the analyisis and stop here:

write.csv(result, file="apex_pcm_march_2015_coverage_by_district.csv")

From there, we could plot the data in, e.g. Excell, or ArcGIS. Here, I’ll show you how to plot the data in R.

There are LOTS of ways to plot data in R. R comes out of the box with `base’ graphics:

#histogram of coverage
hist(x = result$finger, xlab="Coverage")

##boxplot:
boxplot(result$finger)

##scatter plot of the `design effect' vs standard error:
plot(x=result$DEff,y=result$se)

## we can even plot the map:
plot(map)

R also comes with add-on plotting packages, the most popular of which is ggplot2. gg stands for the grammer of graphics, and is a structured language to describe graphs. ggplot works by mapping variables in a data frame to aesthetic properties on a graph. All ggplots start with the ggplot() function, and are built up by adding layers to that plot:

ggplot(aes(x=finger),data=result) + geom_histogram()

ggplot(aes(x=DEff.finger,y=se),data=result) + geom_point() + geom_smooth()

Plotting maps in ggplot with geom_map()

Here, we want to make a map. This is done by adding a geom_map() layer to a ggplot, in which we have to specify a map_id which links the survey results to places on a maps.

#first need to change the `map' object in to a data frame, and rename long and lat to x and y:
# Note, here I'm setting region="district", which matches the district variable in the survey results:
map.df <- fortify(map,region="district")
map.df <- rename(map.df,x=long,y=lat)

g <- ggplot(data=result) + geom_map(aes(fill=finger, map_id=district),map=map.df) + expand_limits(map.df)
g

Now, in order to make this pretty, we need to do a lot of tweaking. The good thing though, is that once you do this once, you can save the code and re-use it to make many similar types of plots. First, let’s change the color scheme and background. Note, we can take the graph (which I called `g’) and just add layers/options to it.

mycolors <- brewer.pal(9,"RdYlGn")
g + theme_bw() + scale_fill_gradientn(name="Coverage", colours = mycolors)

Now, a number of districts are missing from the plot - this is because they aren’t included in the survey. We can include them in the plot by adding them into the results with NA as their coverage value:

result <- merge(result,map@data,all.y = T)
g <- ggplot(data=result) + 
  geom_map(aes(fill=finger, map_id=district),map=map.df) + expand_limits(map.df)+ 
  theme_bw() + 
  scale_fill_gradientn(name="Coverage", colours = mycolors,na.value = "grey",limits=c(.4,1))
g

I often add a bunch of other options to make it prettier:

pak_cities <- data.frame(
  city= c("Islamabad","Lahore","Quetta","Peshawar","Karachi"),
  long=c(73.066667,  74.343611, 67.000000, 71.583333,67.01),
  lat=c(33.716667, 31.549722, 30.183333, 34.016667,24.86),
  hjust=c(0,0,1,1,1),
  vjust=c(-0.5,-0.5,-0.5,-0.5,0)
)
pak_format <- list(
  xlab(""), 
  ylab(""),
  coord_map("polyconic"),
  geom_point(aes(long,lat),data=pak_cities,size=2),
  geom_text(aes(x=long,y=lat,label=city,hjust=hjust,vjust=vjust),face="bold",size=3,data=pak_cities),
  theme_bw(),
  theme(legend.justification=c(1,0),legend.position=c(1,0), legend.background=element_rect(colour="black"))
)

g+ pak_format + ggtitle("Coverage, based on Finger Marking")

Now, for instance, if I want to map the lower confidence limit, rather than the coverage, I can just copy my code from above, and change fill=finger to fill=ci_l.

ggplot(data=result) + 
  geom_map(aes(fill=ci_l, map_id=district),map=map.df) + expand_limits(map.df)+ 
  theme_bw() + 
  scale_fill_gradientn(name="Coverage", colours = mycolors,na.value = "grey",limits=c(.4,1)) + 
  pak_format + ggtitle("Coverage, based on lower confidence limit")

We could also look at the design effect by district, which gives evidence for sub-district heterogeneity. Again, I just do this by replacing fill=ci_l with fill=DEff.finger:

ggplot(data=result) + 
  geom_map(aes(fill=DEff.finger, map_id=district),map=map.df) + 
  expand_limits(map.df)+ 
  theme_bw()+
  scale_fill_gradientn(name="Coverage", colours = rev(mycolors),na.value = "grey") + 
  pak_format + ggtitle("Design Effect, by district")

Khushab in Punjab stands out in the map. Drilling down,we can find that there are villages where nearly every child was missed, while most villages had 100% coverage.

We can get at this with a few dplyr commands, which filter the pcm data to the right district, separate it into groups defined by village, and summarise the coverage in each of those groups, and the arrange the results by coverage:

pcm %>% filter(district == "khushab") %>% 
  group_by(village) %>% 
  summarise(coverage = mean(finger),
            vaccinated=sum(finger),
            total=length(finger)) %>% 
  arrange(coverage)
## Source: local data frame [20 x 4]
## 
##    village   coverage vaccinated total
## 1      302 0.00000000          0    27
## 2      301 0.03703704          1    27
## 3      320 0.37931034         11    29
## 4      305 0.70000000         14    20
## 5      303 1.00000000         30    30
## 6      304 1.00000000         24    24
## 7      306 1.00000000         22    22
## 8      307 1.00000000         24    24
## 9      308 1.00000000         22    22
## 10     309 1.00000000         24    24
## 11     310 1.00000000         24    24
## 12     311 1.00000000         27    27
## 13     312 1.00000000         26    26
## 14     313 1.00000000         24    24
## 15     314 1.00000000         23    23
## 16     315 1.00000000         22    22
## 17     316 1.00000000         25    25
## 18     317 1.00000000         21    21
## 19     318 1.00000000         19    19
## 20     319 1.00000000         23    23

Extra credit: demographic characteristics of survey population, and biases in vaccination

Here, I’m calculating the average age in each district, which might tell us about systematic biases in the survey. This is done in nearly an identical way to the analysis of coverage.

result2 <- svyby(~age, ~district,design=pcm.svy,deff="replace",svymean,vartype=c("se","ci"))
result2 <- merge(result2,map@data,all.y=T)

mycolors <- brewer.pal(9,"BrBG")
ggplot(data=result2) + 
  geom_map(aes(fill=age, map_id=district),map=map.df) + expand_limits(map.df)+ 
  theme_bw() + 
  scale_fill_gradientn(name="Average age", colours = mycolors,na.value = "grey") + 
  pak_format

From this plot, we can see areas where the survey is biased toward either younger children (brown), or older children (blue). The survey looks especially biased towards younger children in Baldia (Sindh), Washuk, and dbugti (Baloch.), where children are about 6 months younger than other children in the survey

To see where this is systematic, we would want to look at the confidence intervals, and compare them to the overall age distribution of the survey.

Pehaps more interesting is whether coverage varies with age, or other demographic variables, as this could help target efforts for improvement:

mod <- svyglm(finger~age,design=pcm.svy,family="binomial")
summary(mod)
## 
## Call:
## svyglm(formula = finger ~ age, design = pcm.svy, family = "binomial")
## 
## Survey design:
## svydesign(~village, strata = ~district, nest = TRUE, probs = ~1, 
##     data = pcm)
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2.7706077  0.0567256  48.842   <2e-16 ***
## age         0.0004209  0.0011571   0.364    0.716    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1.000016)
## 
## Number of Fisher Scoring iterations: 5

The p-value for age, given in the variable Pr(>|t|) is large, suggesting there is no overall evidence that coverage varies with age in this survey.