Vaccines have helped save millions of lives. In the 19th century, before herd immunization was achieved through vaccination programs, deaths from infectious diseases, like smallpox and polio, were common. However, today, despite all the scientific evidence for their importance, vaccination programs have become somewhat controversial.
The controversy started with a paper published in 1988 and lead by Andrew Wakefield claiming there was a link between the administration of the measles, mumps and rubella (MMR) vaccine, and the appearance of autism and bowel disease. Despite much science contradicting this finding, sensationalists media reports and fear mongering from conspiracy theorists, led parts of the public to believe that vaccines were harmful. Some parents stopped vaccinating their children. This dangerous practice can be potentially disastrous given that the Center for Disease Control (CDC) estimates that vaccinations will prevent more than 21 million hospitalizations and 732,000 deaths among children born in the last 20 years (see Benefits from Immunization during the Vaccines for Children Program Era — United States, 1994-2013, MMWR).
Effective communication of data is a strong antidote to misinformation and fear mongering. In this homework you are going to prepare a report to have ready in case you need to help a family member, friend or acquaintance that is not aware of the positive impact vaccines have had for public health.
The data used for these plots were collected, organized and distributed by the Tycho Project. They include weekly reported counts data for seven diseases from 1928 to 2011, from all fifty states. We include the yearly totals in the dslabs
package:
library(dslabs)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(gridExtra)
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
data(us_contagious_diseases)
us_contagious_disease
and dplyr
tools to create an object called dat
that stores only the Measles data, includes a per 100,000 people rate, and removes Alaska and Hawaii since they only became states in the late 50s. Note that there is a weeks_reporting
column. Take that into account when computing the rate.To accurately calcualte the rate, the count column is divded by the weeks reporting column, which is then multiplied by 52 as there are 52 weeks in one year. This number is then dividied by a modified population measure that is calculated by the total population divided by 100,000 to yield a per 100,000 people rate. Also note that Alaska and Hawaii are removed from the dataframe for future analyses.
us_contagious_diseases$rate <- (((us_contagious_diseases$count/us_contagious_diseases$weeks_reporting))*52)/(us_contagious_diseases$population/100000)
dat <- us_contagious_diseases %>%
filter(disease == "Measles", state != "Hawaii", state != "Alaska")
The Measles vaccine was first introduced in 1963 by John Enders and colleagues who transformed their Edmonston-B strain of Measels virus into a vaccine and licensed it in the United States. In the following plot, a grey vertical line is added to indicate the year the Measles vaccine was introduced.
dat %>% filter(state == "California") %>% filter(!is.na(rate)) %>%
ggplot(aes(x = year, y = rate)) + geom_line()+ geom_vline(xintercept = 1963, col = "grey") + annotate("text", x = 1963, y= 1000, angle = 90, label = paste("paste(Vaccine)", collapse = "_"), vjust = 1.2, parse = TRUE, size = 2.5) + scale_y_continuous(limits = c(0,1500), expand = c(0,0)) + theme_classic() + ggtitle("Measles disease rate per year in California") + theme(plot.title = element_text(hjust = 0.5)) + ylab("Measles disease rate")
dat1 <- us_contagious_diseases %>% select(disease, rate, state, year, count) %>% filter(disease == "Measles", state != "Alaska", state != "Hawaii", !is.na(rate))
p1a <- dat1 %>% filter(year == 1950) %>% ggplot(aes(x = count)) + geom_histogram(bins = 15, fill = "grey", col = "black") + facet_grid(~year) + theme_classic() + xlab("Total Measles Cases") + ylab("Frequency by state") + scale_y_continuous(expand = c(0,0))
p1b <- dat1 %>% filter(year == 1950) %>% ggplot(aes(x = sqrt(count))) + geom_histogram(bins = 15 ,fill = "grey", col = "black") + facet_grid(~year) + theme_classic()+ xlab("Total Measles Cases (square-rooted) ") + ylab("Frequency by state") + scale_y_continuous(expand = c(0,0))
p2a <- dat1 %>% filter(year == 1960) %>% ggplot(aes(x = count)) + geom_histogram(bins = 15, fill = "grey", col = "black") + facet_grid(~year) + theme_classic()+ xlab("Total Measles Cases") + ylab("Frequency by state") + scale_y_continuous(expand = c(0,0))
p2b <- dat1 %>% filter(year == 1960) %>% ggplot(aes(x = sqrt(count))) + geom_histogram(bins = 15 ,fill = "grey", col = "black") + facet_grid(~year) + theme_classic()+ xlab("Total Measles Cases (square-rooted) ") + ylab("Frequency by state")+ scale_y_continuous(expand = c(0,0))
p3a <- dat1 %>% filter(year == 1970) %>% ggplot(aes(x = count)) + geom_histogram(bins = 15, fill = "grey", col = "black") + facet_grid(~year) + theme_classic()+ xlab("Total Measles Cases") + ylab("Frequency by state")+ scale_y_continuous(expand = c(0,0))
p3b <- dat1 %>% filter(year == 1970) %>% ggplot(aes(x = sqrt(count))) + geom_histogram(bins = 15 ,fill = "grey", col = "black") + facet_grid(~year) + theme_classic()+ xlab("Total Measles Cases (square-rooted) ") + ylab("Frequency by state")+ scale_y_continuous(expand = c(0,0))
grid.arrange(p1a,p1b,p2a,p2b,p3a,p3b, nrow = 3, ncol = 2)
The histrograms without square root transformation are informative; however, there seems to be more variability for larger and smaller counts. One way to overcome this issue is transforming the data, which stabilizes the variability so that the data is closer across levels. One can see that the issue with variability is better addressed when a square root transformation is used as the data is now closer across levels.
years <- c(1950, 1960, 1970)
# Without transformation - barplot for indicidual state
dat %>%
filter(year %in% years) %>% mutate(state = reorder(state, rate, FUN = mean)) %>% filter(!is.na(rate)) %>%
ggplot(aes(x = state, y = rate)) + geom_histogram(binwidth = 1 ,color = "grey", stat = "identity")+ theme_classic() + scale_y_continuous(expand=c(0,0)) + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + ggtitle("Histogram of Measles Distribution") + theme(plot.title = element_text(hjust = 0.5)) + facet_wrap(~year, ncol = 1) + xlab("State")
# Square root transformatipn - barplot for individual state
dat %>%
filter (year %in% years) %>% mutate(state = reorder(state, rate, FUN = mean)) %>% filter(!is.na(rate)) %>%
ggplot(aes(x = state, y = rate)) + geom_histogram(binwidth = 1 ,color = "grey", stat = "identity")+ theme_classic() + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + ggtitle("Histogram of Measles Distribution (square root transformation)") + theme(plot.title = element_text(hjust = 0.5)) + scale_y_sqrt() + facet_wrap(~year, ncol = 1) + xlab("State")
Effects of the square root transformation can also be seen by looking at the bar plot of each state for 1950, 1960, and 1970. A sqaure root transformation was performed on the y-axis to overcome issues with the variability for larger and smaller rates. By transforming the data, the variability was stabilized acroess levels in the bar plot.
From question 2, we confirmed that the Measles vaccine was introduced in 1963.
dat %>%
filter(state == "California") %>% filter(!is.na(rate)) %>%
ggplot(aes(x = year, y = rate)) + geom_line() + geom_vline(xintercept = 1963, col = "grey") + annotate("text", x = 1963, y= 1000, angle = 90, label = paste("paste(Vaccine)", collapse = "_"), vjust = 1.2, parse = TRUE, size = 2.5) + scale_y_sqrt(breaks = (seq(0,100,2))^2) + theme_classic()+ ggtitle("Measles disease rate per year for California") + theme(plot.title = element_text(hjust = 0.5)) + ylab("Measles Rate- square root transformed")
dat %>% filter (!is.na(rate)) %>% mutate(state = reorder(state, rate, FUN = mean)) %>% ggplot(aes(x = factor(year), y = rate)) + theme_classic() + geom_boxplot() + scale_y_sqrt()+ theme(axis.text.x = element_text(angle = 90, hjust = 1, size = 6.5)) + ggtitle("Boxplot of Measles Distribution") + theme(plot.title = element_text(hjust = 0.5)) + ylab("Measles rate- square root transformed") + xlab("Year")
Although the boxplot helps one see composite U.S trends of Measles distribution in different years, the boxplot does not illustrate state specific trends. It appears that the overall rate decreases over time, but this does not illustrate anything about state-specific trends.
dat %>% filter (!is.na(rate)) %>% mutate(state = reorder(state, rate, FUN = mean)) %>% ggplot(aes(x = factor(year), y = rate, color = state)) + theme_classic() + geom_boxplot() + scale_y_sqrt()+ theme(axis.text.x = element_text(angle = 90, hjust = 1)) + ggtitle("Boxplot of Measles Distribution") + theme(plot.title = element_text(hjust = 0.5)) + ylab("Measles rate- square root transformed") + xlab("Year") + scale_x_discrete(breaks = seq(1928,2010,20))
This plot demonstrates how the boxplot does not illustrate state-specific trends. Boxplots disappear once the color function is added to stratify the data to see state-specific trends.
# This replaced num 5 but with better trends
dat %>%
filter(!is.na(rate)) %>% mutate(state = reorder(state, rate, FUN = mean)) %>%
ggplot(aes(year,rate, color = state)) + geom_line(alpha = 0.5) + theme_classic() + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + scale_y_sqrt() + stat_summary(fun.y=mean, geom="line",lwd=0.7,col="black") + annotate("text", x = 1950, y = 490, label = "US average", size = 3) + ggtitle("US Measles Distribution") + annotate("text", x = 1963, y= 2000, angle = 90, label = paste("paste(Vaccine)", collapse = "_"), vjust = 1.2, parse = TRUE, size = 2.5) + geom_vline(xintercept = 1963, col = "grey")
Line graphs are used to see state-specific trends. Using the color function, one can see state specific trends; however, because there are too many lines in one graph, it is still difficult to distinguish states from each other.
geom_tile
to tile the plot with colors representing disease rates.dat %>% filter(!is.na(rate)) %>%
ggplot(aes(year, state)) + geom_tile(aes(fill = rate), color = "white") + scale_fill_gradient(low = "white", high = "blue", trans = "sqrt") + scale_x_continuous(expand = c(0,0)) + ggtitle("Measles disease rate per year in the United States") + theme(plot.title = element_text(hjust = 0.5)) + geom_vline(xintercept = 1963, col = "black")
The geometry to title the plot with colors representing disease rates helps one better visualize state-specific trends that are diffcult to analyze using box plots and line graphs. A square root transformation is used to avoid having high counts that can possibly dominate the plot. Note that Grey squares simply denote missing values.
Autism was conceptualized and discovered in 1943 by a doctor named Leo Kanner. At the time, treatment for autism was extremely limited. However, much has changed in the past seven decades; now we know that autism is mostly caused by a combination of environmental and genetic factors. However, as vaccines were introduced, Autism rate was not reduced. In fact, the prevalence of autism increased, not coinciding with the introduction of vaccines that reduced disease rates. The data used in this plot is acquired from the Centers for Disease Control and Prevention. CDC began tracking the prevalence of autism and characteristics of children with autism in the United States in 1998; therefore, the timeline of the autism data does not perfectly align with that of the vaccine data; however, acquiring the 20th century autism data is difficult, and the rationale behind my decision to work with this data is because this data focuses on the U.S population. If one is more curious to see the historical autism data for a longer timeline, one might be interested in another data that is also provided by the CDC. This data clearly presents how autism kept increasing over time. There are also other plots that are externally available that depect the trends of the autism prevalnce like the one from Statista.
This data was imported from the CDC website. Initially, I saved the data as an excel file and modified to conveniently manipulate the dataframe in R. However, for an easier knitting process for BST260 team, I manually created the dataframe in R especially since there are not many data points in the autism data that I am using for this analysis. One must note that this is an extremely simplified analysis, and more detailed analyses can be facilitaed through sophisticated web-scraping from different data sources.
setwd("/Users/yeongjunpark1/Desktop")
autism <- read.csv("autism.csv") # excel data
autism <- autism[!is.na(autism),][c(1:8),]
colnames(autism) <- c("year", "prevalence")
# Manual autism dataframe
autism <- as.data.frame(cbind(seq(2000,2014,2), c(6.7, 6.6, 8.0, 9.0, 11.3, 14.7, 14.6, 16.8)))
colnames(autism) <- c("year", "prevalence")
ggplot(autism, aes(year, prevalence)) + geom_histogram(stat = "identity", color = "black", binwidth = 10) + theme_classic() + ylab("Prevalence per 1000 Children") + ggtitle("Autism Prevalencee in the United States") + theme(plot.title = element_text(hjust = 0.5)) + scale_y_continuous(expand = c(0,0))
This plot demonstrates how autism prevalence increased over time in the United States, not coincidinng with the positive outcomes accomplished by the introduction of vaccinnes. Such trends are also observed and substantiated in other studies in the autism literature. This plot suggests that autism prevalence among children aged 8 years in multiple U.S communities have almost consistetnly increased. This observed increase in autism prevalence emphasizes the need for continued survelliance using consistent methods to facilitate a long-term monitoing in the population.
Polio is excluded from the analysis because of the duplicate data between Polio and Pertussis. Also, Alaska and Hawaii are still excluded from this analysis.
# It's really hard to see with the tile graph
us_contagious_diseases %>% filter (!is.na(rate)) %>% filter(disease != "Polio", state != "Hawaii", state != "Alaska", disease == "Hepatitis A") %>% ggplot(aes(year,state)) + geom_tile(aes(fill = rate), color = "white")+ scale_fill_gradient(low = "white", high = "blue", trans = "sqrt") + scale_x_continuous(expand = c(0,0)) + facet_wrap(~disease) + geom_vline(xintercept = 1995, col = "black")
us_contagious_diseases %>% filter (!is.na(rate)) %>% filter(disease != "Polio", state != "Hawaii", state != "Alaska", disease == "Mumps") %>% ggplot(aes(year,state)) + geom_tile(aes(fill = rate), color = "white")+ scale_fill_gradient(low = "white", high = "blue", trans = "sqrt") + scale_x_continuous(expand = c(0,0)) + facet_wrap(~disease) + geom_vline(xintercept = 1969, col = "black")
us_contagious_diseases %>% filter (!is.na(rate)) %>% filter(disease != "Polio", state != "Hawaii", state != "Alaska", disease == "Pertussis") %>% ggplot(aes(year,state)) + geom_tile(aes(fill = rate), color = "white")+ scale_fill_gradient(low = "white", high = "blue", trans = "sqrt") + scale_x_continuous(expand = c(0,0)) + facet_wrap(~disease)+ geom_vline(xintercept = 1948, col = "black")
us_contagious_diseases %>% filter (!is.na(rate)) %>% filter(disease != "Polio", state != "Hawaii", state != "Alaska", disease == "Rubella") %>% ggplot(aes(year,state)) + geom_tile(aes(fill = rate), color = "white")+ scale_fill_gradient(low = "white", high = "blue", trans = "sqrt") + scale_x_continuous(expand = c(0,0)) + facet_wrap(~disease)+ geom_vline(xintercept = 1971, col = "black")
us_contagious_diseases %>% filter (!is.na(rate)) %>% filter(disease != "Polio", state != "Hawaii", state != "Alaska", disease == "Smallpox") %>% ggplot(aes(year,state)) + geom_tile(aes(fill = rate), color = "white")+ scale_fill_gradient(low = "white", high = "blue", trans = "sqrt") + scale_x_continuous(expand = c(0,0)) + facet_wrap(~disease)
Remember how a solid black line indicates the year vaccine was introduced. I initially used the geometry ‘geom_tile’ to tile the plots with colors representing diseases rates for each disease. They all seem to show similar patterns as those of Measles because the colors of the tiles become lighter over time especailly after the introduction of vaccine. However, small Pox showed somewhat dubious observations where some states have higher disease rates than they did in the beginning. To clearly assess the trends, I generated line plots because the line plots showed the most clear analyses in previous problems.
us_contagious_diseases %>%
filter(!is.na(rate)) %>%
filter(disease != "Polio", state != "Hawaii", state != "Alaksa") %>%
ggplot(aes(x = year, y = rate, color = disease)) + geom_line() + scale_y_sqrt(breaks = (seq(0,100,2))^2) + theme_classic()+ ggtitle("All diseases rate per year for all states") + theme(plot.title = element_text(hjust = 0.5)) + ylab("Disease Rate- square root transformed")
When the line plots of all diseases are combined, the analysis becomes much harder although all diseases seem to demonstrate decreasing trends. Therefore, I decided to undergo individual examination of each disease with an additional incorporation of fitting generalized linear models.
# All diseases combined
us_contagious_diseases %>%
filter(!is.na(rate)) %>%
filter(disease != "Polio", state != "Hawaii", state != "Alaksa", disease == "Hepatitis A") %>%
ggplot(aes(x = year, y = rate, color = disease)) + geom_line(col = "grey") + scale_y_sqrt(breaks = (seq(0,100,2))^2) + theme_classic()+ ggtitle("Hepatitis A disease rate per year for all states") + theme(plot.title = element_text(hjust = 0.5)) + ylab("Hepatitis A Rate- square root transformed") + stat_smooth(method = "glm") + geom_vline(xintercept = 1995, col = "black") + annotate("text", x = 1995, y= 500, angle = 90, label = paste("paste(Vaccine)", collapse = "_"), vjust = 1.2, parse = TRUE, size = 2.5)
# Hepatitus A vaccine introduced in 1995
us_contagious_diseases %>%
filter(!is.na(rate)) %>%
filter(disease != "Polio", state != "Hawaii", state != "Alaksa", disease == "Mumps") %>%
ggplot(aes(x = year, y = rate, color = disease)) + geom_line(col = "grey") + scale_y_sqrt(breaks = (seq(0,100,2))^2) + theme_classic()+ ggtitle("Mumps disease rate per year for all states") + theme(plot.title = element_text(hjust = 0.5)) + ylab("Mumps Rate- square root transformed") + stat_smooth(method = "glm") + geom_vline(xintercept = 1969, col = "black") + annotate("text", x = 1969, y= 400, angle = 90, label = paste("paste(Vaccine)", collapse = "_"), vjust = 1.2, parse = TRUE, size = 2.5)
# Mumps vaccine was introduced in 1969
us_contagious_diseases %>%
filter(!is.na(rate)) %>%
filter(disease != "Polio", state != "Hawaii", state != "Alaksa", disease == "Pertussis") %>%
ggplot(aes(x = year, y = rate, color = disease)) + geom_line(col = "grey") + scale_y_sqrt(breaks = (seq(0,100,2))^2) + theme_classic()+ ggtitle("Pertussis disease rate per year for all states") + theme(plot.title = element_text(hjust = 0.5)) + ylab("Pertussis Rate- square root transformed") + stat_smooth(method = "glm") + geom_vline(xintercept = 1948, col = "black") + annotate("text", x = 1948, y= 500, angle = 90, label = paste("paste(Vaccine)", collapse = "_"), vjust = 1.2, parse = TRUE, size = 2.5)
# vaccine introduced in late 1940's- line in 1948
us_contagious_diseases %>%
filter(!is.na(rate)) %>%
filter(disease != "Polio", state != "Hawaii", state != "Alaksa", disease == "Rubella") %>%
ggplot(aes(x = year, y = rate, color = disease)) + geom_line(col = "grey") + scale_y_sqrt(breaks = (seq(0,100,2))^2) + theme_classic()+ ggtitle("Rubella disease rate per year for all states") + theme(plot.title = element_text(hjust = 0.5)) + ylab("Rubella Rate- square root transformed") + stat_smooth(method = "glm") + geom_vline(xintercept = 1971, col = "black") + annotate("text", x = 1971, y= 250, angle = 90, label = paste("paste(Vaccine)", collapse = "_"), vjust = 1.2, parse = TRUE, size = 2.5)
# vaccine introduced in 1971
us_contagious_diseases %>%
filter(!is.na(rate)) %>%
filter(disease != "Polio", state != "Hawaii", state != "Alaksa", disease == "Smallpox") %>%
ggplot(aes(x = year, y = rate, color = disease)) + geom_line(col = "grey") + scale_y_sqrt(breaks = (seq(0,100,2))^2) + theme_classic()+ ggtitle("Smallpox disease rate per year for all states") + theme(plot.title = element_text(hjust = 0.5)) + ylab("Smallpox Rate- square root transformed") + stat_smooth(method = "glm") + annotate("text", x = 1930, y = 200, label = "Vaccine introduced in 1796", size = 2.5)
# Smallpox vaccine introduced in 1796
The slopes of the fitted lines are all negative for different diseases, and they are all statistically significant (p-value < 0.05). Small Pox vaccine was introduced in 1796, and the given data set does not contain any Small Pox data from the pre-vaccine era; therefore, it is harder to see the effect of Small Pox vacciine by only looking at the given dataset. However, other diseases besides small pox have sufficient data to explore the conducive effects of vaccines. These plots substantiate how the use of vaccines is extremely valuable and beneficial.