Week 10 - Looking at Autism and MMR through correlation and regression

Introduction

In today’s lab, we will be going through a case study that examines Autism, measles, and a infamous paper in vaccination studies. We will go through the case in several parts, and learn correlation and regression analysis along the way. We will also re-visit old topics as a reminder that science isn’t about single tests. Furthermore, we will briefly discuss data ethics, and the importance of using data and statistics correctly.

Part 1. Using publically available datasets
Part 2. The Measles, Mumps, Rubella (MMR) vaccine
Part 3. Wakefield study retracted
Part 4. Your assignment

Part 1. Using publically available datasets

Part 1a. Fighting for your children

Kristen typically loved the monthly Sunday brunches with her mother Anne and older sister Carly. The monthly “girls’ day out” was an excuse for the three women to celebrate the love and commitment that helped them through the difficult times when Anne was working 80 hours a week and Carly was helping care for Kristen. Following Kristen’s graduation from college, Anne courageously tackled her lifelong dream and enrolled in a nursing program. After the first year of course work, Anne was near the top of her class. Anne’s increasing expertise in health issues made her, once again, a resource for the entire clan, as Ian, Carly’s son, had been diagnosed with Autism Spectrum Disorder (ASD) just four months after his first birthday.

Autism can take a heavy toll on a family, especially those with children exhibiting the most extreme form of the disease. Ian unfortunately was in this group—regressing from a happy and interactive one-year-old to a toddler who was less responsive to his parents, and whose long bouts of repetitive rocking were interspersed with brief but intense periods of aggression. Anne’s genuine love for her daughters and grandchildren, coupled with her access to the medical school’s library, made her a knowledgeable and compassionate counsel for Carly.

One Sunday, Kristen was running late. When she was directed to a patio table to join her mother and sister, she found them engaged in the same battle left unresolved a month ago.

[Carly] “Mom, Jenny McCarthy is just one of many celebrities speaking out against the measles vaccine—and she has the right to—her own son is autistic, and Jenny is convinced that the combined measles, mumps, and rubella (MMR) vaccine is the cause. She made a very convincing case on the Larry King show, while the scientists looked to be in bed with big pharmaceutical companies to make money selling vaccines.”

[Carly] continued, “It’s a huge business. And there has to be something behind this epidemic of autism. Did you know that the MMR vaccine replaced the old, simple measles vaccine in 1988? And did you know that during the 20 years following MMR’s introduction, autism rates increased by almost 600%? And the rates are still increasing—one in every 110 kids today develops some form of ASD.”

Part 1b. Analyzing freely available datasets.

Is Carly right about ASD rise? CDC and over government sites collect a lot of data, including the prevalence of disorders like ASD. Thus, we can do a lot of analysis, if we know where to find data and collect data. I think Carly, Anne and Kristen would want to see the data for the follow: What is the prevalence of ASD in the united states?

[https://www.cdc.gov/ncbddd/autism/data/index.html#data]

Explore the autism data presented on the website. At the bottom of the page, you can actually access the entire dataset. I did this for you, and trimmed the dataset for more easy analysis. This can be found on your Canvas page for this week’s story.

I trimmed the dataset for you so that we can use it in class. The file is: autism_us.csv

autism <- read.csv("autism_us.csv")
# View(autism)
str(autism)

## 'data.frame':    853 obs. of  7 variables:
##  $ State      : chr  "AL" "AK" "AZ" "AR" ...
##  $ Denominator: int  674044 122469 804559 412602 5512592 658092 510125 106279 55111 2203889 ...
##  $ Prevalence : num  1.1 1.6 1.3 1.6 1.8 0.7 2.3 2.3 1.8 1.5 ...
##  $ Lower.CI   : num  1 1.4 1.2 1.5 1.8 0.6 2.2 2 1.5 1.5 ...
##  $ Upper.CI   : num  1.2 1.8 1.4 1.7 1.8 0.7 2.4 2.6 2.2 1.6 ...
##  $ Year       : int  2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 ...
##  $ Source     : chr  "sped" "sped" "sped" "sped" ...

This is a rich dataset. It has every states’ autism prevalence (per 100000) per year, from 2000-2016.

Now, we can plot US Autism prevalence from 2000 - 2016. To do this, we will make a scatter plot. I will have you do this with the ggscatter() function, in the package ggpubr.

ggscatter(autism, x = "Year", y = "Prevalence")

Woah, that a lot of dots! We can still clearly see a trend of increasing prevalance over time. But, perhaps are visualization would be better if we summarize the data; let’s look at Autism prevalance (per 1000 children) by year.

autismSUM <- summarySE(data=autism, 
                       measurevar = "Prevalence", 
                       groupvars = "Year")
autismSUM

##    Year  N Prevalence        sd         se        ci
## 1  2000 51   1.662745 0.6994171 0.09793799 0.1967142
## 2  2001 51   2.035294 0.8019535 0.11229595 0.2255531
## 3  2002 51   2.452941 0.9609064 0.13455381 0.2702593
## 4  2003 51   2.876471 1.1211759 0.15699602 0.3153358
## 5  2004 48   3.427083 1.3476125 0.19451111 0.3913059
## 6  2005 50   3.898000 1.5690228 0.22189334 0.4459114
## 7  2006 49   4.553061 1.7915429 0.25593470 0.5145912
## 8  2007 48   5.220833 2.0606711 0.29743225 0.5983565
## 9  2008 50   5.876000 2.2768184 0.32199075 0.6470646
## 10 2009 49   6.706122 2.4460349 0.34943356 0.7025833
## 11 2010 51   7.311765 2.5931947 0.36311989 0.7293478
## 12 2011 51   8.000000 2.7858212 0.39009300 0.7835249
## 13 2012 51   8.703922 2.9173934 0.40851679 0.8205301
## 14 2013 51   9.315686 3.0323174 0.42460937 0.8528530
## 15 2014 50   9.934000 3.1304046 0.44270606 0.8896511
## 16 2015 51  10.572549 3.1364361 0.43918890 0.8821369
## 17 2016 50  11.182000 3.2332261 0.45724722 0.9188727

Now, let’s graph the summed data, with the prevalence being the dependent, or response variable and the year being the independent or exploratory variable.

library(ggpubr)
ggscatter(autismSUM, x = "Year", y = "Prevalence")

Great! As you can see, clearly the prevalence of cases of Autism has increased since 2000. Carly is right. Let’s continue.

We can take advantage of R to do some nicer graphics. Visualization is important for science communication. So let’s clean up the graph a bit now. I can make the dots bigger (size parameter), different colors, and i can add a best fit line (add parameter).

ggscatter(autismSUM, x = "Year", y = "Prevalence", 
          ylab = "Prevalence (per 1000 children)",
          add = "reg.line", add.params = list(color="red"), # add a regression line
          col="blue", # change the color of the dots
          size = 3 # make dots bigger
          )

## `geom_smooth()` using formula 'y ~ x'

Nice! That looks very clear now. You can learn more about cool ggscatter plots here.

Part 2. The Measles, Mumps, Rubella (MMR) vaccine

[Kristen] Anne tried not to sound smug or motherly in her response to Carly, but a recent topic in her microbiology course in nursing school dealt with the amazing benefits of vaccines, including the shot (or “jab” as it was known in Great Britain) against measles.

[Anne] “Carly, both the original and the more effective MMR vaccines have nearly eliminated a serious and deadly disease. The measles virus kills. A paper we had to read for class showed that during the 20 years following its release in 1963 the measles shot prevented fifty million cases of measles and saved at least 5,200 lives in the United States alone. And that’s not all. A common complication from measles is encephalitis, a massive swelling of the brain, which can lead to brain damage and mental retardation. During that same 20-year period, at least 17,000 American kids were spared that tragedy. And I don’t believe the link between MMR and autism has been firmly made.”

[Carly] “I know the link has been made,” she said with more than a hint of anger in her voice. “Just look at Ian. He was a happy and interactive one-year-old. I take him in for his MMR shot at 15 months, just like my pediatrician said to, and a week later he smiles less, he talks less, and he stopped looking me in my eyes.”

[Carly] continued, “This doctor, Dr. Andrew Wakefield, published a paper in Lancet and called a press conference announcing that he found a link between a routine childhood vaccine and autism. Children in their study developed neurological disorder within a few days of receiving the MMR “jab.” Somehow, the vaccine was causing intestinal inflammation, allowing harmful proteins to enter the child’s bloodstream, eventually traveling to the brain, causing the disorder.” Did Carly want to be convinced by this data?

Part 2a. Visualizing the data

Let’s explore the data more.

In 1963 the measles vaccine was developed, and by the late 1960s, vaccines were also available to protect against mumps (1967) and rubella (1969). These three vaccines were combined into the MMR vaccine in 1971. Let’s look at the data to see measles deaths over time.

# read the dataset
measles <- read.csv("measles_history_death.csv")
str(measles)

## 'data.frame':    63 obs. of  5 variables:
##  $ Year               : int  1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 ...
##  $ Measles.Case.Rate  : num  331 420 272 406 324 ...
##  $ Measles.Death.Rate : num  0.427 0.38 0.28 0.308 0.202 ...
##  $ World.Population   : chr  "2,584,034,261" "2,630,861,562" "2,677,608,960" "2,724,846,741" ...
##  $ Meat.consumption.kg: num  NA NA NA NA NA NA NA NA NA NA ...

The dataset has 5 variables. Year, Reported cases (per 100,000), Death Rate, World population that year, and meat consumption that year (kg/capita). First, let’s look at the measles cases by year.

ggscatter(data=measles, x = "Year", y = "Measles.Case.Rate", 
          ylab = "Measles cases (per 100,000)")

# we can draw a line along the data
ggscatter(data=measles, x = "Year", y = "Measles.Case.Rate",
          ylab = "Measles cases (per 100,000)",
          add = "reg.line", add.params = list(color="red"))

## `geom_smooth()` using formula 'y ~ x'

So, as you can see, measles cases has dropped dramatically, and is attributed to the introduction of the measles vaccine and teh MMR coctail. Similarly, we can look at death rate and the measles vaccine decreased death rate.

ggscatter(data=measles, x = "Year", y = "Measles.Death.Rate",
          ylab = "Measles cases (per 100,000)",
          add = "loess", add.params = list(color="red"))

## `geom_smooth()` using formula 'y ~ x'

# a loess line fits the data, and a reg.line fits the regression

Let’s do one more thing. Let’s make this plot interactive. For this, we need to use a new package, called plotly.

library(plotly)

## 
## Attaching package: 'plotly'

## The following objects are masked from 'package:plyr':
## 
##     arrange, mutate, rename, summarise

## The following object is masked from 'package:ggplot2':
## 
##     last_plot

## The following object is masked from 'package:stats':
## 
##     filter

## The following object is masked from 'package:graphics':
## 
##     layout

coolfig <- plot_ly(data = measles, x = ~Year, y = ~Measles.Death.Rate)
coolfig

## No trace type specified:
##   Based on info supplied, a 'scatter' trace seems appropriate.
##   Read more about this trace type -> https://plotly.com/r/reference/#scatter

## No scatter mode specifed:
##   Setting the mode to markers
##   Read more about this attribute -> https://plotly.com/r/reference/#scatter-mode

Awesome! Now we have an interactive plot where we can easily see death rates at specific years. Learn more about the power of plotly here.

Part 2b. Correlation tests

Now, let’s ask a simple question with our data. We want to know whether a change in one variable (independent) leads to a change in the next variable (dependent). For example,

As measles cases go up, do death rates also go up?

Let’s say the independent variable is Measles cases (per 100,000), and the dependent variable is deaths (per 100,000).

For correlation type tests, the null hypothesis is that there is no relationship between your two variables. Or, if measles cases go up, death cases may go not change.

First, let’s look at the graph to help us visualize this data.

ggscatter(data=measles, x = "Measles.Case.Rate", y = "Measles.Death.Rate",
          add = "reg.line", add.params = list(color="red"))

## `geom_smooth()` using formula 'y ~ x'

Now, we can make it even more informative. We can add variation in our regression line.

ggscatter(data=measles, x = "Measles.Case.Rate", y = "Measles.Death.Rate",
          add = "reg.line", conf.int = TRUE, 
          add.params = list(color="red"),
          xlab = "Measles cases (per 100,000)", ylab = "Measles deaths (per 100,000")

## `geom_smooth()` using formula 'y ~ x'

Looking at the strength of the relationship with correlation analyses

As you can see, it looks like there might be a relationship there, such that as cases go up, so do death rates. But let’s put some statistics and determine whether there is a correlation between our two variables, using the cor() function. A correlation analyses provides an r value, which tells us whether one measurement variable is associated (relationship) with another measurement variable (case rate and death rate). r values range from -1 to 1.

We can also look at the strength of the association, or relationship by examining a r² value, which ranges from 0-1. Thus, values close to 0 represent weak associations and values close to 1 is a strong association.

cor(measles$Measles.Case.Rate, measles$Measles.Death.Rate)

## [1] 0.975001

You get 0.975, which is really strong. It is also a positive number, which suggests that as cases go up, so do death cases. If the number was negative, it would suggest that if cases go up, deaths would go down. Is this association significant? For that, we use the cor.test() function.

cor.test(measles$Measles.Case.Rate, measles$Measles.Death.Rate)

## 
##  Pearson's product-moment correlation
## 
## data:  measles$Measles.Case.Rate and measles$Measles.Death.Rate
## t = 34.271, df = 61, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.9588718 0.9848538
## sample estimates:
##      cor 
## 0.975001

As you can see, we get a p value < 2.2e-16, which is really small! Thus, we would reject a null of there not being a relationship.

We can actually add the the r value and the p value to our previous graphs. We just need to tell R, and tell it to run a Pearson’s correlation (one type of correlation test).

ggscatter(data=measles, x = "Measles.Case.Rate", y = "Measles.Death.Rate",
          add = "reg.line", conf.int = TRUE, 
          add.params = list(color="red"),
          cor.coef = TRUE, cor.method = "pearson",
          xlab = "Measles cases (per 100,000)", ylab = "Measles deaths (per 100,000")

## `geom_smooth()` using formula 'y ~ x'

Part 2c. Do regression analyses to predict the data

Finally, we can also do regression analyses. Regression allows us to predict information. It essentially gives us the equation for our regression line, shown in the above figure, which allows us to plug in an “x” value and predict a “y” value. To do this, we use the lm() function (which we should be familiar with from ANOVA).

measles_deaths <- lm(measles$Measles.Death.Rate ~ measles$Measles.Case.Rate)
measles_deaths

## 
## Call:
## lm(formula = measles$Measles.Death.Rate ~ measles$Measles.Case.Rate)
## 
## Coefficients:
##               (Intercept)  measles$Measles.Case.Rate  
##                 0.0030362                  0.0008674

From the lm function, we gat a y-intercept, and the slope of a line. Thus, our equation (y = mx+b) would be this:

deaths = 0.0008674*cases + 0.0030362

Now, you can use this equation to solve for unknown death rates, if you know the case rate. This might be helpful for future years. For example, the CDC reported 971 Measles cases in the US for 2019. Using this formula, we would predict:

deaths = 0.0008674*971 + 0.0030362 = 0.8452816

So we might predict that there was one death. Certainly, factors such as medical treatments, preexisting conditions, and other factors would play in (that’s multiple regression next week!).

Part 2d. Correlation and causation

As another example, let’s look at this comparison. Measles death rate by meat consumption.

ggscatter(data=measles, x = "Meat.consumption.kg", y = "Measles.Death.Rate",
           add = "reg.line", conf.int = TRUE, 
          add.params = list(color="red"))

## `geom_smooth()` using formula 'y ~ x'

## Warning: Removed 10 rows containing non-finite values (stat_smooth).

## Warning: Removed 10 rows containing missing values (geom_point).

If we do the correlation statistics, we get this:

cor.test(measles$Meat.consumption.kg, measles$Measles.Death.Rate)

## 
##  Pearson's product-moment correlation
## 
## data:  measles$Meat.consumption.kg and measles$Measles.Death.Rate
## t = -6.0403, df = 51, p-value = 1.774e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.7799422 -0.4548998
## sample estimates:
##        cor 
## -0.6457895

A significant result! Interesting, so we can use meat consumption to predict measles death. Wait…really? This is an example where if two things correlate, it DOES NOT MEAN they are causal. So eating more meat does not cause a decrease in measles death, but they are correlated.

Part 3. Wakefield study retracted

So why did “we” even link Autism to the MMR vaccine. MMR may contain a chemical that damages the digestive system, leading to infections that impact the brain during development. That’s the theory. Let’s continue the case and look at the infamous Wakefield study that Carly was talking about.

Below, you will see the data collected by Wakefield on 12 patients. In particular, he examined the time from MMR vaccine to behavioral diagnosis. There are issues here, but really there are bigger statistical fish to fry in terms of what’s wrong with this paper.

Wakefield also took urinary samples and measured methylmalonic acid (MMA), which is raised in chronic enterocolitis. According to their theory, the MMR vaccine damaged the intestinal lining and lead to encephalopathic proteins and brain damage. He found that the eight Autistic patients had greater concentrations of MMA (5.20+/-2.93mg/mmol (+/- s.d.)) than controls (0.62+/-0.56mg/mmol (+/- s.d.), P=0.003).

data

First, since MMR vaccinations were in greater than 90%, over 9/10 Autism patients would have been vaccinated just by percentages. So his finding is not that surprising.

Let’s look at more data. Remember when I said there were 12 total patients. That’s not that many. Also, If you took a random person, and got a MMA value of 0.62mg/mmol, do you expect it to be within 95% of the values observed in ASD patients (assume the data is normal)?

If you recall, 1 sd is 68.3% of the data around the mean, and 2 sd is ~95%. Thus, 5.2 +/- 2*2.93 = -0.66 to 11.06. Thus, based on the sd of the data, all the controls fall within 95% of the mean for ASD patients. Thus, the small sample size really skews this data.

Finally, there were large ethical issues. As we think about Marian’s Franciscan values and proper experimental design in biology, reflect on the fact that dignity of the individual and responsible stewardship are strongly embedded in the principles of responsible conduct of research (RCR). It turns out that Wakefield broke several principles: * 1) In the period leading up to his 1998 Lancet publication, Andrew Wakefield was paid £435,643 in fees, plus £3,910 for expenses (totaling over $750,000) by Richard Barr, an attorney attempting to demonstrate a connection between MMR and autism in order to sue the vaccine manufacturer (Deer 2006). * 2) The 12 children in the original study were not randomly selected; parents of nearly half of the subjects were actually litigants in Richard Barr’s lawsuit against the MMR manufacturer (Cox 2010). * 3) Andrew Wakefield was developing an alternative measles vaccine he hoped to patent and, thus, he stood to benefit financially if the safety of the MMR vaccine was suspect (Deer 2004). * 4) Deer (2011) showed that Wakefield forged some medical records. For example, only 2 of 12 children showed any symptoms after a MMR vaccine treatment, and parent interviews showed that many were worried of autism-like symptoms in the children of this study, before the MMR vaccine treatment (Autism-like symptoms show as early as 1 year).

It was later discovered that Wakefield violated several tenants of RCR, and even co-authors later separated themselves from this study. Here are some other chilling aftereffects. * On 28 January 2010, the General Medical Council of Great Britain found Andrew Wakefield guilty of unethical and irresponsible behavior in the original Lancet study (Salzberg 2010). * Five days later, on 2 February 2010, the Lancet formally expunged the 1998 paper—finding it “fatally flawed” (Editors of the Lancet 2010). * On 24 May 2010, the General Medical Council stripped Dr. Wakefield of his license to practice medicine in Great Britain

image of the week