In this tutorial, we’ll learn how to analyze basic survey data and plot the results on a map using R, using the March PCM data as an example. This illustrates the use of R as a front-to-end solution for reading, exploring, and cleaning the data, performing a statistical analysis, and summarizing the results.
In addition to the packages loaded in this script, you will need the PCM results and shapefiles.
other resources
This tutorial is not a complete explanation of R syntax or it’s plotting functions, or on survey statistics. A great resource for learning about R and plotting maps is the spatial.ly blog. The survey analysis is done with the survey
package, which has an accompanying journal article, a nice companion book, with very few equations, and even a online course for those who want to learn more about how to analyze survey data.
First, we’ll load the neccesary packages and data into R:
# I'm setting the working directory to where I've stored my shape files and survey results
setwd("~/Dropbox/Pakistan/markdown")
library(maptools) # a package for dealing with shapefiles/spatial data
library(survey) # a package for analyzing clustered survey data
library(ggplot2) # a for making pretty plots
library(RColorBrewer) # a package that has some pretty colors for the maps
library(dplyr) # a useful package for data manipulation
pcm <- read.csv("apex_pcm_march_2015.csv", header=TRUE,stringsAsFactors=FALSE)
map <- readShapePoly("pakistan_district.shp")
First, lets take a look at the data, using the str()
command. The data consists of location info (province, district, uc, village
) for 63,758 children, along with whether the child was vaccinated in the previous campaign according to caregiver recall (vaccinated
) or by finger marked (finger
)
str(pcm)
## 'data.frame': 63737 obs. of 12 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ serial : int 1 2 3 4 5 6 7 8 9 10 ...
## $ form : chr "39301" "39301" "39302" "39302" ...
## $ province : chr "AJK" "AJK" "AJK" "AJK" ...
## $ district : chr "bagh" "bagh" "bagh" "bagh" ...
## $ tehsil : chr "345" "345" "345" "345" ...
## $ uc : int 1952 1952 1952 1952 1952 1952 1952 1952 1952 1952 ...
## $ village : int 2621 2621 2621 2621 2621 2621 2621 2621 2621 2621 ...
## $ age : chr "6" "54" "59" "36" ...
## $ vaccinated: chr "Yes" "Yes" "Yes" "Yes" ...
## $ finger : chr "Yes" "Yes" "Yes" "Yes" ...
## $ reason : chr "." "." "." "." ...
The table()
function will count how many times a variable takes on each value. We can use it to calculate how many children are in each province, or in each district
table(pcm$province)
##
## 28 6 AJK BALOCHISTAN KPK PUNJAB SINDH
## 1 2945 5021 6278 14005 18964 16523
#I dodn't want to print all the districts, so take a subset of the first 10, using the subset `[1:10]`, where and `1:10` indicating I wanted the first through the 10th entry.
table(pcm$district)[1:10]
##
## 347 367 abotabad astore attock badin bagh bahawalpur bahwlnagar
## 2 1 438 562 442 543 474 512 498
## bannu
## 461
#removing numeric districts
pcm <- filter(pcm, district != 347, district != 367)
# equvalent to
#pcm <- subset(pcm,!(pcm$district %in% c(347,367) ))
# or
#pcm <- pcm[(pcm$district != 347) & (pcm$district != 347),]
Note that some districts and provinces are given as numbers in the above. At this point I would ask the person who supplied the data for the names of these districts, or I might try to find out what district they correspond to by looking at the tehsil or UC. Here, I just removed them.
We can use the summary
function to tell us more about a numeric variable:
#notince age is stored as a character in the above, I'll first change it to a numeric variable:
pcm$age <- as.numeric(pcm$age)
#Some NAs are returned when I convert age to a character. This means some of the ages in the file were not numbers.
summary(pcm$age)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00 15.00 30.00 29.55 45.00 59.00 7
Now, we want to analyze the results from the survey. There are a number of packages available to do this, but the best in my opinion is the survey
package. You can see some examples of this package by looking at the vignettes and help pages (help(package="survey")
is a good place to start).
This is more of an R
tutorial than a survey course, so I won’t dwell on the details.
#changing an option of the package, so it removes districts with only 1 cluster, rather than throw an error
options(survey.lonely.psu="remove")
# change `finger' to a 0/1 numeric variable
pcm$finger <- as.numeric(pcm$finger=="Yes")
## Create a `survey.design` object. Basically telling R about the design of the survey, so it knows how to analyze it.
## PCM is a cluster survey, which (I believe) is stratified by district. I choose `probs~1', since PCM uses PPS to select districts.
pcm.svy <- svydesign(~village, strata=~district, nest=TRUE, probs = ~1,data = pcm)
## svymean calculates the proportion of fingers marked.
## To apply svymean to each district, I wrap it in `svyby':
result <- svyby(~finger, ~district,design=pcm.svy,deff="replace",svymean,vartype=c("ci","se"))
#result is a dataframe which gives coverage and corresponding standard error, confidence interval, and design effect for each district
head(result)
## district finger se ci_l ci_u DEff.finger
## abotabad abotabad 0.8105023 0.061254286 0.6904461 0.9305585 10.6756955
## astore astore 0.7330961 0.045549223 0.6438212 0.8223709 5.9485205
## attock attock 0.9932127 0.004900455 0.9836080 1.0028174 1.5709783
## badin badin 0.9963168 0.002572840 0.9912741 1.0013594 0.9776808
## bagh bagh 0.9662447 0.016118830 0.9346524 0.9978371 3.7679001
## bahawalpur bahawalpur 0.9863281 0.007929576 0.9707864 1.0018698 2.3827111
Now, we can save the result of the analyisis and stop here:
write.csv(result, file="apex_pcm_march_2015_coverage_by_district.csv")
From there, we could plot the data in, e.g. Excell, or ArcGIS. Here, I’ll show you how to plot the data in R.
There are LOTS of ways to plot data in R. R comes out of the box with `base’ graphics:
#histogram of coverage
hist(x = result$finger, xlab="Coverage")
##boxplot:
boxplot(result$finger)
##scatter plot of the `design effect' vs standard error:
plot(x=result$DEff,y=result$se)
## we can even plot the map:
plot(map)
R also comes with add-on plotting packages, the most popular of which is ggplot2
. gg
stands for the grammer of graphics
, and is a structured language to describe graphs. ggplot works by mapping variables in a data frame to aesthetic properties on a graph. All ggplots start with the ggplot()
function, and are built up by adding layers to that plot:
ggplot(aes(x=finger),data=result) + geom_histogram()
ggplot(aes(x=DEff.finger,y=se),data=result) + geom_point() + geom_smooth()
geom_map()
Here, we want to make a map. This is done by adding a geom_map()
layer to a ggplot, in which we have to specify a map_id
which links the survey results to places on a maps.
#first need to change the `map' object in to a data frame, and rename long and lat to x and y:
# Note, here I'm setting region="district", which matches the district variable in the survey results:
map.df <- fortify(map,region="district")
map.df <- rename(map.df,x=long,y=lat)
g <- ggplot(data=result) + geom_map(aes(fill=finger, map_id=district),map=map.df) + expand_limits(map.df)
g
Now, in order to make this pretty, we need to do a lot of tweaking. The good thing though, is that once you do this once, you can save the code and re-use it to make many similar types of plots. First, let’s change the color scheme and background. Note, we can take the graph (which I called `g’) and just add layers/options to it.
mycolors <- brewer.pal(9,"RdYlGn")
g + theme_bw() + scale_fill_gradientn(name="Coverage", colours = mycolors)
Now, a number of districts are missing from the plot - this is because they aren’t included in the survey. We can include them in the plot by adding them into the results with NA
as their coverage value:
result <- merge(result,map@data,all.y = T)
g <- ggplot(data=result) +
geom_map(aes(fill=finger, map_id=district),map=map.df) + expand_limits(map.df)+
theme_bw() +
scale_fill_gradientn(name="Coverage", colours = mycolors,na.value = "grey",limits=c(.4,1))
g
I often add a bunch of other options to make it prettier:
pak_cities <- data.frame(
city= c("Islamabad","Lahore","Quetta","Peshawar","Karachi"),
long=c(73.066667, 74.343611, 67.000000, 71.583333,67.01),
lat=c(33.716667, 31.549722, 30.183333, 34.016667,24.86),
hjust=c(0,0,1,1,1),
vjust=c(-0.5,-0.5,-0.5,-0.5,0)
)
pak_format <- list(
xlab(""),
ylab(""),
coord_map("polyconic"),
geom_point(aes(long,lat),data=pak_cities,size=2),
geom_text(aes(x=long,y=lat,label=city,hjust=hjust,vjust=vjust),face="bold",size=3,data=pak_cities),
theme_bw(),
theme(legend.justification=c(1,0),legend.position=c(1,0), legend.background=element_rect(colour="black"))
)
g+ pak_format + ggtitle("Coverage, based on Finger Marking")
Now, for instance, if I want to map the lower confidence limit, rather than the coverage, I can just copy my code from above, and change fill=finger
to fill=ci_l
.
ggplot(data=result) +
geom_map(aes(fill=ci_l, map_id=district),map=map.df) + expand_limits(map.df)+
theme_bw() +
scale_fill_gradientn(name="Coverage", colours = mycolors,na.value = "grey",limits=c(.4,1)) +
pak_format + ggtitle("Coverage, based on lower confidence limit")
We could also look at the design effect by district, which gives evidence for sub-district heterogeneity. Again, I just do this by replacing fill=ci_l
with fill=DEff.finger
:
ggplot(data=result) +
geom_map(aes(fill=DEff.finger, map_id=district),map=map.df) +
expand_limits(map.df)+
theme_bw()+
scale_fill_gradientn(name="Coverage", colours = rev(mycolors),na.value = "grey") +
pak_format + ggtitle("Design Effect, by district")
Khushab in Punjab stands out in the map. Drilling down,we can find that there are villages where nearly every child was missed, while most villages had 100% coverage.
We can get at this with a few dplyr
commands, which filter
the pcm data to the right district, separate it into groups defined by village, and summarise
the coverage in each of those groups, and the arrange
the results by coverage:
pcm %>% filter(district == "khushab") %>%
group_by(village) %>%
summarise(coverage = mean(finger),
vaccinated=sum(finger),
total=length(finger)) %>%
arrange(coverage)
## Source: local data frame [20 x 4]
##
## village coverage vaccinated total
## 1 302 0.00000000 0 27
## 2 301 0.03703704 1 27
## 3 320 0.37931034 11 29
## 4 305 0.70000000 14 20
## 5 303 1.00000000 30 30
## 6 304 1.00000000 24 24
## 7 306 1.00000000 22 22
## 8 307 1.00000000 24 24
## 9 308 1.00000000 22 22
## 10 309 1.00000000 24 24
## 11 310 1.00000000 24 24
## 12 311 1.00000000 27 27
## 13 312 1.00000000 26 26
## 14 313 1.00000000 24 24
## 15 314 1.00000000 23 23
## 16 315 1.00000000 22 22
## 17 316 1.00000000 25 25
## 18 317 1.00000000 21 21
## 19 318 1.00000000 19 19
## 20 319 1.00000000 23 23
Here, I’m calculating the average age in each district, which might tell us about systematic biases in the survey. This is done in nearly an identical way to the analysis of coverage.
result2 <- svyby(~age, ~district,design=pcm.svy,deff="replace",svymean,vartype=c("se","ci"))
result2 <- merge(result2,map@data,all.y=T)
mycolors <- brewer.pal(9,"BrBG")
ggplot(data=result2) +
geom_map(aes(fill=age, map_id=district),map=map.df) + expand_limits(map.df)+
theme_bw() +
scale_fill_gradientn(name="Average age", colours = mycolors,na.value = "grey") +
pak_format
From this plot, we can see areas where the survey is biased toward either younger children (brown), or older children (blue). The survey looks especially biased towards younger children in Baldia (Sindh), Washuk, and dbugti (Baloch.), where children are about 6 months younger than other children in the survey
To see where this is systematic, we would want to look at the confidence intervals, and compare them to the overall age distribution of the survey.
Pehaps more interesting is whether coverage varies with age, or other demographic variables, as this could help target efforts for improvement:
mod <- svyglm(finger~age,design=pcm.svy,family="binomial")
summary(mod)
##
## Call:
## svyglm(formula = finger ~ age, design = pcm.svy, family = "binomial")
##
## Survey design:
## svydesign(~village, strata = ~district, nest = TRUE, probs = ~1,
## data = pcm)
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.7706077 0.0567256 48.842 <2e-16 ***
## age 0.0004209 0.0011571 0.364 0.716
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1.000016)
##
## Number of Fisher Scoring iterations: 5
The p-value for age, given in the variable Pr(>|t|)
is large, suggesting there is no overall evidence that coverage varies with age in this survey.