Analysis
Read in data.
mydata <- read.csv("https://raw.githubusercontent.com/hgeiger-CUNY/Data608_Spring2019/master/izb_odp_final_03262019.csv",header=TRUE,stringsAsFactors=FALSE)
#Separate lines where county="California" (statewide totals) from the remainder of the data.
statewide_totals <- mydata[mydata$county == "California",]
mydata <- mydata[mydata$county != "California",]
Let’s start with a simple look at the number of cases per disease, over the whole time span.
sum_per_var <- mydata %>% group_by(disease) %>% summarise(total = sum(count))
sum_per_var <- data.frame(sum_per_var,stringsAsFactors=FALSE)
sum_per_var[,1] <- factor(sum_per_var[,1],levels=as.vector(sum_per_var[,1])[order(sum_per_var[,2])])
ggplot(sum_per_var,
aes(x=disease,y=total)) +
geom_bar(stat="identity",fill="lightgrey",col="black") +
ylab("Total cases") +
xlab("") +
ggtitle("Statewide cases of vaccine-preventable diseases (VPDs)\nTotal 2001-2017") +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5)) +
coord_flip()

Pertussis makes up the overwhelming majority of cases. So, it seems like just looking at the sum of cases for all VPDs together is the best approach. Let’s proceed that way for the remainder of the analysis.
Plot by year and by county for all VPDs combined.
sum_per_var <- statewide_totals %>% group_by(year) %>% summarise(total = sum(count))
ggplot(sum_per_var,
aes(x=year,y=total)) +
geom_point(shape=21) +
geom_line() +
xlab("Year") +
ylab("Total cases") +
ggtitle("Statewide cases of all VPDs, 2001-2017")

sum_per_var <- mydata %>% group_by(county) %>% summarise(total = sum(count))
sum_per_var <- data.frame(sum_per_var,stringsAsFactors=FALSE)
sum_per_var$county <- as.vector(sum_per_var$county)
sum_per_var <- sum_per_var[order(sum_per_var$total,decreasing=TRUE),]
sum_per_var <- sum_per_var[1:20,]
sum_per_var$county <- factor(sum_per_var$county,levels=rev(sum_per_var$county))
ggplot(sum_per_var,
aes(x=county,y=total)) +
geom_bar(stat="identity",fill="lightgrey",col="black") +
ylab("Total cases") +
xlab("") +
ggtitle("Statewide cases of all VPDs in top 20 counties, 2001-2017") +
coord_flip()

Top counties are generally highly populated (e.g. Los Angeles).
This highlights the need to normalize by population.
Let’s get the cases per county per year as a rate per 10,000 residents.
population <- read.csv("https://raw.githubusercontent.com/hgeiger-CUNY/Data608_Spring2019/master/California_county_populations_2001_to_2017.csv",header=TRUE,check.names=FALSE,stringsAsFactors=FALSE)
colnames(population)[1] <- "county"
population <- population[population$county != "California",]
population <- gather(population,key="year",value="population",-county)
population$year <- as.numeric(population$year)
mydata <- mydata %>% group_by(county,year) %>% summarise(cases = sum(count))
mydata <- merge(mydata,population,by=c("county","year"))
mydata <- mydata %>% mutate(cases.per.10k = cases/(population/10000))
For each year, plot the median cases per 10,000 residents rate across the 58 counties.
If we still see increases in 2010 and 2014, that would suggest that the increase in VPD cases was a trend across multiple counties. Whereas if the trend in this plot is relatively flat, it would instead suggest that the increase in VPD cases may have come from one or two very populous counties.
median_normalized_case_rate_per_year <- mydata %>% group_by(year) %>% summarise(median.cases.per.10k = median(cases.per.10k))
ggplot(median_normalized_case_rate_per_year,
aes(x=year,y=median.cases.per.10k)) +
geom_line() +
geom_point(shape=21) +
xlab("Year") +
ylab("Median yearly countywide VPD cases per 10k residents")

Looks like after normalizing for county population differences, we still see huge spikes in VPD cases in 2010 and 2014 across multiple counties.
For the remainder of the analysis, I will focus on the incidence of VPDs in 2014 specifically. This year is chosen because of its high rate of vaccine-preventable diseases combined with its proximity to the available data on immunization rates.
Read in California geographic boundaries.
regions_OGR <- readOGR(dsn="CA_Counties")
map_regions <- spTransform(regions_OGR,CRS("+proj=longlat +datum=WGS84"))
map_regions_fortified <- fortify(map_regions)
map_regions_fortified <- merge(map_regions_fortified,data.frame(id = rownames(map_regions@data),county = map_regions$NAME),by="id")
Select VPD case data from 2014 only.
Then exclude counties with <50,000 residents, as just a handful of cases can inflate the VPD case rate for these.
yearly_cases_per_county_select_years <- mydata[mydata$year == 2014,]
yearly_cases_per_county_select_years[yearly_cases_per_county_select_years$population < 50000,"cases.per.10k"] <- NA #I checked and the same counties with population < 50,000 are found using either 2010 or 2014 population.
Read in data on immunization rates from the 2013-2014 school year.
Remove schools with a blank for enrollment, which means there is no data because there are less than 10 students.
Then, focus on counts for enrollment vs. PBE (personal belief exemption).
immunizations <- read.csv("https://raw.githubusercontent.com/hgeiger-CUNY/Data608_Spring2019/master/school_immunizations_in_kindergarten_by_academic_year_2013_2014_school_year.csv",header=TRUE,stringsAsFactors=FALSE)
immunizations <- immunizations[which(is.na(immunizations$Enrollment) == FALSE),]
immunizations$Count <- as.numeric(immunizations$Count)
immunizations <- spread(immunizations[,setdiff(colnames(immunizations),"Percent")],key="Category",value="Count")
immunizations <- immunizations[,c(colnames(immunizations)[1:10],"PBE")]
Summarize PBE rates by county.
colnames(immunizations)[3] <- "county"
immunizations <- immunizations %>% group_by(county) %>% summarise(total_enrolled = sum(Enrollment),total_PBE = sum(PBE)) %>% mutate(PBE_percent_enrolled = total_PBE*100/total_enrolled)
immunizations <- data.frame(immunizations,stringsAsFactors=FALSE)
Combine with population estimates. Let’s use 2013.
pop2013 <- population[population$year == 2013,c("county","population")]
pop2013$county <- toupper(pop2013$county)
immunizations <- merge(immunizations,pop2013,by="county")
Again, NA out low population counties.
immunizations$PBE_percent_enrolled[immunizations$population < 50000] <- NA
Plot the yearly VPD rate as a function of the PBE rate.
yearly_cases_per_county_select_years$county <- toupper(yearly_cases_per_county_select_years$county)
rate_2014_vs_PBE <- merge(yearly_cases_per_county_select_years,immunizations[,c("county","PBE_percent_enrolled")],by="county")
ggplot(rate_2014_vs_PBE,
aes(PBE_percent_enrolled,cases.per.10k)) +
geom_point(shape=21) +
xlab("Percent of enrolled kindergarteners in county with PBE") +
ylab("Yearly VPD cases per 10,000 residents, 2014") +
ggtitle("VPD vs. personal belief exemption rates per county") +
geom_vline(xintercept = 5,linetype=2) +
geom_hline(yintercept = 5,linetype=2)

We find that many counties with high PBE rates have low disease incidence rates, and vice versa.
However, this is not super informative without looking in geographic context. If for example all 4 counties with 2014 outbreaks that had relatively low vaccine exemption rates were directly adjacent to counties with high vaccine exemption rates, that could be a possible explanation.
Let’s map VPD case rates vs. PBE rates side-by-side, highlighting in each map counties that have high values for the other variable.
map_regions_fortified$county <- toupper(map_regions_fortified$county)
map_regions_fortified_add_data <- merge(map_regions_fortified,rate_2014_vs_PBE,by="county")
colnames(map_regions_fortified_add_data) <- plyr::mapvalues(colnames(map_regions_fortified_add_data),
from=c("cases.per.10k","PBE_percent_enrolled"),
to=c("VPDs/10k residents","% enrolled w/ PBE"))
blue_yellow_red_gradient <- colorRampPalette(rev(brewer.pal(n = 7, name ="RdYlBu")))(100)
county_centroids <- getSpPPolygonsLabptSlots(map_regions)
county_centroids <- data.frame(long = county_centroids[,1],lat = county_centroids[,2],county = map_regions$NAME,row.names=map_regions$NAME,group = as.numeric(rownames(map_regions@data)) + 0.1,cases.per.10k = rep(0,times=58),stringsAsFactors=FALSE)
county_centroids <- data.frame(long = county_centroids[,1],
lat = county_centroids[,2],
county = toupper(map_regions$NAME),
group = as.numeric(rownames(map_regions@data)) + 0.1,
row.names=toupper(map_regions$NAME),
stringsAsFactors=FALSE)
county_centroids$group <- factor(county_centroids$group,levels=levels(map_regions_fortified$group))
county_centroids <- merge(county_centroids,rate_2014_vs_PBE[,c("county","cases.per.10k","PBE_percent_enrolled")],by="county")
county_centroids_to_highlight <- county_centroids[which(is.na(county_centroids$cases.per.10k) == FALSE & county_centroids$cases.per.10k >= 5),]
colnames(county_centroids_to_highlight) <- plyr::mapvalues(colnames(county_centroids_to_highlight),
from=c("cases.per.10k","PBE_percent_enrolled"),
to=c("VPDs/10k residents","% enrolled w/ PBE"))
county_centroids_to_highlight <- data.frame(county_centroids_to_highlight,Dummy.var = "5+ VPDs/10k residents, 2014",stringsAsFactors=FALSE,check.names=FALSE)
PBE_plot <- ggplot(map_regions_fortified_add_data,
aes(x = long,y = lat,group = group,fill=`% enrolled w/ PBE`)) +
geom_polygon(colour="black") +
xlab("") +
ylab("") +
theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
panel.background = element_blank(),
axis.ticks.x = element_blank(),axis.ticks.y = element_blank(),axis.text.x = element_blank(),axis.text.y = element_blank()) +
scale_fill_gradientn(colours = blue_yellow_red_gradient,limits=c(0,10),oob=squish) +
geom_point(data=county_centroids_to_highlight,aes(x = long,y = lat,colour=Dummy.var),shape=8) +
scale_colour_manual(values="black") + labs(colour="")
county_centroids_to_highlight <- county_centroids[which(is.na(county_centroids$cases.per.10k) == FALSE & county_centroids$PBE_percent_enrolled >= 5),]
colnames(county_centroids_to_highlight) <- plyr::mapvalues(colnames(county_centroids_to_highlight),
from=c("cases.per.10k","PBE_percent_enrolled"),
to=c("VPDs/10k residents","% enrolled w/ PBE"))
county_centroids_to_highlight <- data.frame(county_centroids_to_highlight,Dummy.var = "5%+ enrolled w/ PBEs, 2013-2014",stringsAsFactors=FALSE,check.names=FALSE)
VPD_plot <- ggplot(map_regions_fortified_add_data,
aes(x = long,y = lat,group = group,fill=`VPDs/10k residents`)) +
geom_polygon(colour="black") +
xlab("") +
ylab("") +
theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank(),
panel.background = element_blank(),
axis.ticks.x = element_blank(),axis.ticks.y = element_blank(),axis.text.x = element_blank(),axis.text.y = element_blank()) +
scale_fill_gradientn(colours = blue_yellow_red_gradient,limits=c(0,10),oob=squish) +
geom_point(data=county_centroids_to_highlight,aes(x = long,y = lat,colour=Dummy.var),shape=17) +
scale_colour_manual(values="black") + labs(colour="")
grid.arrange(VPD_plot,PBE_plot,ncol=2,top="2013-2014 school year immunization data, PBE = personal belief exemption\n2014 disease incidence data, VPD = vaccine preventable disease\nExclude counties with <50,000 residents")

We find that thankfully, many counties (especially in the northeastern/eastern central part of the state) managed to get through 2014 with relatively low incidence of vaccine-preventable diseases, despite high rates of students whose parents claimed personal belief exemptions in lieu of vaccination.
On the flip side, though, we find that nearly all of the counties with high rates of vaccine-preventable diseases either have high vaccine exemption rates themselves, or are near other counties with high vaccine exemption rates. This is especially noticeable in the northern and northwestern part of the state.
The only county where this trend does not completely hold is San Diego (in the southwest part of the state). This county in particular may highlight the limits of the disease incidence data only being available countywide. San Diego is a very populous county that is also fairly large in land area. Mapping personal belief exemption rates at a more granular level shows that anti-vaccine sentiment may be more concentrated in certain areas of the county vs. others (https://www.washingtonpost.com/news/wonk/wp/2015/01/27/californias-epidemic-of-vaccine-denial-mapped/?noredirect=on&utm_term=.c3a42b9d3feb). Perhaps if we had disease incidence data mapped to smaller sub-sections of the counties, we would see more of a correlation here.