Distinguishing Effects and Spurious Effects

This is one of the main challenges in statistics. All practitioners will continually need to evaluate their analyses and question if the effects they find are real.

It is all too common for practitioners to fit overly complicated models to small data sets, find small (but ‘significant’) effects, and try to publish these results with minimal self-criticism.

avoid doing this

This is part of the debate about how p-values are often contrived through over-fitting models. A “significant result” is not necessarily significant. Vice-versa, a biologically meaningful difference (effect size) is not always significant to the p<0.05 value.

The following example is mostly taken from Richard McElreath’s book,“Statistical Rethinking” (Chapter 5)

If I can recommend you one book for using statistics in science - I recommend you this book.

Part 1: The utility of power analysis to check your sample and effect sizes

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────────────── tidyverse 1.2.1.9000 ──

## ✔ ggplot2 3.2.0          ✔ purrr   0.3.2     
## ✔ tibble  2.1.3          ✔ dplyr   0.8.3.9000
## ✔ tidyr   0.8.3          ✔ stringr 1.4.0     
## ✔ readr   1.3.1          ✔ forcats 0.4.0

## ── Conflicts ─────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

dat <- read_csv("WaffleDivorce.csv") # read data in

## Parsed with column specification:
## cols(
##   Location = col_character(),
##   Loc = col_character(),
##   Population = col_double(),
##   MedianAgeMarriage = col_double(),
##   Marriage = col_double(),
##   Marriage.SE = col_double(),
##   Divorce = col_double(),
##   Divorce.SE = col_double(),
##   WaffleHouses = col_double(),
##   South = col_double(),
##   Slaves1860 = col_double(),
##   Population1860 = col_double(),
##   PropSlaves1860 = col_double()
## )

glimpse(dat) # quickly see how the data was read

## Observations: 50
## Variables: 13
## $ Location          <chr> "Alabama", "Alaska", "Arizona", "Arkansas", "C…
## $ Loc               <chr> "AL", "AK", "AZ", "AR", "CA", "CO", "CT", "DE"…
## $ Population        <dbl> 4.78, 0.71, 6.33, 2.92, 37.25, 5.03, 3.57, 0.9…
## $ MedianAgeMarriage <dbl> 25.3, 25.2, 25.8, 24.3, 26.8, 25.7, 27.6, 26.6…
## $ Marriage          <dbl> 20.2, 26.0, 20.3, 26.4, 19.1, 23.5, 17.1, 23.1…
## $ Marriage.SE       <dbl> 1.27, 2.93, 0.98, 1.70, 0.39, 1.24, 1.06, 2.89…
## $ Divorce           <dbl> 12.7, 12.5, 10.8, 13.5, 8.0, 11.6, 6.7, 8.9, 6…
## $ Divorce.SE        <dbl> 0.79, 2.05, 0.74, 1.22, 0.24, 0.94, 0.77, 1.39…
## $ WaffleHouses      <dbl> 128, 0, 18, 41, 0, 11, 0, 3, 0, 133, 381, 0, 0…
## $ South             <dbl> 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0…
## $ Slaves1860        <dbl> 435080, 0, 0, 111115, 0, 0, 0, 1798, 0, 61745,…
## $ Population1860    <dbl> 964201, 0, 0, 435450, 379994, 34277, 460147, 1…
## $ PropSlaves1860    <dbl> 4.5e-01, 0.0e+00, 0.0e+00, 2.6e-01, 0.0e+00, 0…

Let’s plot a hypothesized relationship

(This ‘hypothesis’ is problematic and really classist hypothesis, but it will demonstrate an important point about doing this kind of thing).

dat %>% 
  ggplot(data=., aes(WaffleHouses, Divorce))+
  geom_point()+
  geom_smooth(method='lm')

Here Divorce is the response variable and WaffleHouses is the covariate.

Obviously the underlying drivers of divorce are complicated, but let’s continue

with this ‘strawman’ example

What can we do to examine this Waffle House effect more coherently? A big populated state will likely have more Waffle Houses than a small state, so we can divide by the population to standardize it.

dat <- dat %>%  # shortcut for pipe symbol in Rstudio: ctrl + shift + m
  mutate(wh_per_million = WaffleHouses/Population)

dat %>% 
  ggplot(data=., aes(wh_per_million, Divorce))+
  geom_point()+
  geom_smooth(method='lm')

fit <- lm(Divorce~wh_per_million, data=dat)
summary(fit)

## 
## Call:
## lm(formula = Divorce ~ wh_per_million, data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.5343 -1.2448 -0.0718  1.0552  3.6802 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     9.31980    0.27723  33.617  < 2e-16 ***
## wh_per_million  0.07442    0.02730   2.726  0.00892 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.712 on 48 degrees of freedom
## Multiple R-squared:  0.1341, Adjusted R-squared:  0.116 
## F-statistic: 7.431 on 1 and 48 DF,  p-value: 0.008921

OK, so let’s check our assumptions

par(mfrow=c(2,2)) # this specifies a 2x2 grid for the plots to go in
plot(fit) # this will plot model diagnostis

Residuals vs. Fitted: Assumption of homoscedasticity. Does the variance of the residuals change along the scale of fitted values? If so, the data is heteoscedastic and probably should not be modeled with a linear model.

Normal Q-Q: This acts as a check about the model distribution assumptions. The data is consistent with the model assumptions when the standardized residuals for a 1:1 relationship with the theoretical quantiles of the distribution.

Residuals vs Leverage: These show especially influential observations. So here the 11th observation is exerting vastly more leverage on the model fit. You might consider applying a transform (sqrt, log10, etc) to the covariates if the distribution is especially skewed.

HOLD ON! Did we really have enough data to isolate an effect this small?

Under what assumptions?

We also assume the observations are “independently and identically distributed”.

hist(dat$Divorce)

If we look back at the data, perhaps only a few observations (states) are driving this percieved trend.

Let’s see who is driving this relationship.

ggplot(dat, aes(wh_per_million, Divorce))+
  geom_point()+
  geom_smooth(method='lm')+
  geom_label(aes(label=Location))

Here we apply a power analysis to examine what the sample size would need to be to identify the effect.

library(pwr)

# Does power analysis indicate we've done something wrong? 
pwr.r.test(n = NULL, 
           r = 0.074,        # the effect size from our waffle divorce model
           
           sig.level = .05, # NOT PROBABILITY; 
           # https://en.wikipedia.org/wiki/Statistical_significance
           
           power = 0.95,    # ranges from 0-1. 
            # Roughly, the probability of finding an effect that is real
           # https://en.wikipedia.org/wiki/Power_(statistics) 
           
           alternative = "two.sided")

## 
##      approximate correlation power calculation (arctangh transformation) 
## 
##               n = 2366.44
##               r = 0.074
##       sig.level = 0.05
##           power = 0.95
##     alternative = two.sided

We can visualize how large a sample size would be required to identify an effect. Taken from: https://www.statmethods.net/stats/power.html

library(tidyverse)
library(pwr)

set.seed(3)
population_1 <- rnorm(10000, mean = 50, sd=25)


nobs <- 5
sample_1 <- population_1[sample.int(10000, nobs)]
sample_2 <- population_1[sample.int(10000, nobs)]

t.test(sample_1, sample_2)

## 
##  Welch Two Sample t-test
## 
## data:  sample_1 and sample_2
## t = 0.85889, df = 7.2497, p-value = 0.4179
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -31.80723  68.49382
## sample estimates:
## mean of x mean of y 
##  47.09334  28.75004

# Plot sample size curves for detecting correlations of
# various sizes.

library(pwr)

# range of correlations
r <- seq(0.05,.5,.01)
nr <- length(r)

# power values
p <- seq(.4,.9,.1)
np <- length(p)

# obtain sample sizes
samsize <- array(numeric(nr*np), dim=c(nr,np))
for (i in 1:np){
  for (j in 1:nr){
    result <- pwr.r.test(n = NULL, r = r[j],
                         sig.level = .05, power = p[i],
                         alternative = "two.sided")
    samsize[j,i] <- ceiling(result$n)
  }
}

# set up graph
xrange <- range(r)
yrange <- round(range(samsize))
colors <- rainbow(length(p))
plot(xrange, yrange, type="n",
     xlab="Correlation Coefficient (r)",
     ylab="Sample Size (n)" )

# add power curves
for (i in 1:np){
  lines(r, samsize[,i], type="l", lwd=2, col=colors[i])
}

# add annotation (grid lines, title, legend) 
abline(v=0, h=seq(0,yrange[2],50), lty=2, col="grey89")
abline(h=0, v=seq(xrange[1],xrange[2],.02), lty=2,
       col="grey89")
title("Sample Size Estimation for Correlation Studies\n
  Sig=0.05 (Two-tailed)")
legend("topright", title="Power", as.character(p),
       fill=colors)

Part 2: Simpson’s Paradox

Read about it: https://en.wikipedia.org/wiki/Simpson%27s_paradox This example comes from: https://blog.revolutionanalytics.com/2015/11/fun-with-simpsons-paradox-simulating-confounders.html

Here we will simulate a large number of observations with a seemingly straight- forward trend.

point_line_distance <- function(b, m, x, y){
  (y - (m*x + b))/sqrt(m^2 + 1)}

confounded_data_frame <- function(x, y, m, num_grp){
  b <- 0 # intercept doesn't matter
  d <- point_line_distance(b, m, x, y)
  d_scaled <- 0.0005 + 0.999 * (d - min(d))/(max(d) - min(d)) # avoid 0 and 1
  data.frame(x=x, y=y, 
             group=as.factor(sprintf("grp%02d", ceiling(num_grp*(d_scaled)))))
}

find_group_coefficients <- function(data){
  coef <- t(sapply(levels(data$group), 
                   function(grp) coefficients(lm(y ~ x, data=data[data$group==grp,]))))
  coef[!is.na(coef[,1]) & ! is.na(coef[,2]),]
}

set.seed(1)    # reproducibilty with the random number generator
N <- 3000      # number of simulated observations
x <- rnorm(N)  # generate random numbers for the covariate
m <- -0.5555556 # The trend effect
b <- 8.3333333  # The intercept
y <- m * x + b + rnorm(length(x))  # simulate the response variable
plot(x, y, col="gray", pch=20, asp=1) # plot the data
fit <- lm(y ~ x)  # Fit a linear model
abline(fit, lty=2, lwd=2) # Use the linear model fit to add a trend-line to the plot

summary(fit) # Inspect the summary statistics about the fit

## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.6481 -0.6682 -0.0088  0.6665  3.0759 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  8.32840    0.01830   455.1   <2e-16 ***
## x           -0.53582    0.01768   -30.3   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.002 on 2998 degrees of freedom
## Multiple R-squared:  0.2344, Adjusted R-squared:  0.2342 
## F-statistic: 918.1 on 1 and 2998 DF,  p-value: < 2.2e-16

Does power analysis indicate something wrong with this analysis?

pwr.r.test(n = NULL, 
           r = -0.535,        # the effect size from our waffle divorce model
           
           sig.level = .001, # NOT PROBABILITY; 
           # https://en.wikipedia.org/wiki/Statistical_significance
           
           power = 0.999,    # ranges from 0-1. 
            # Roughly, the probability of finding an effect that is real
           # https://en.wikipedia.org/wiki/Power_(statistics) 
           
           alternative = "two.sided")

## 
##      approximate correlation power calculation (arctangh transformation) 
## 
##               n = 116.9737
##               r = 0.535
##       sig.level = 0.001
##           power = 0.999
##     alternative = two.sided

Ok so we have enough observations to satisfy our power analysis. So this trend seems robust, correct?

Well what if the data is not independent? What if the data was sampled in regional groups?

# This function plots groups of data by color
striped_scatterplot <- function(formula, grouped_data){
  # 
  colors <- rev(viridis::viridis(length(levels(grouped_data$group)), end=0.95, option='A'))
  plot(formula, grouped_data, bg=colors[grouped_data$group], pch=21, asp=1)
  grp_coef <- find_group_coefficients(grouped_data)
  # if some coefficents get dropped, colors won't match exactly
  for (r in 1:nrow(grp_coef)) 
    abline(grp_coef[r,1], grp_coef[r,2], col=colors[r], lwd=2)
}


m_new <- 1 # the new coefficient we want x to have
cdf <- confounded_data_frame(x, y, m_new, num_grp=10) # see function below
striped_scatterplot(y ~ x, cdf) # also see below

Now we will fit a linear model with a varying intercept. By allowing the intercept to term to vary by group, we will partition variation away from the x term that we are examining.

cdf10 <- confounded_data_frame(x, y, m_new, num_grp=10)
# without confounder
coefficients(lm(y ~ x, cdf10))['x']

##          x 
## -0.5358175

##          x 
## -0.5358175
# with confounder
coefficients(lm(y ~ x + group, cdf10))['x']

##         x 
## 0.8193693

j0 <- lme4::lmer(y~x+(1|group), data=cdf10)

So time simple model that does not acknowledge differences between groups suggests that x has an -0.5358175 effect upon y, per unit of x.

The linear model with a group varying intercept shows that the effect of x upon y within each group, is actually positive (0.996). This is close to the true value used to simulate the data.

These kinds of models are very useful and a little beyond the scope of an intro seminar like this. But let’s go over what the formula means:

y~x+(1|group)

y is the response variable (aka, dependent variable)

x is the covariate (aka, the independent variable)

(1|group) means that ‘1’ the intercept, is conditional upon the group.

If we wanted to fit a model where the effect of x dendends on the group, how might we write it?

Some stats caveats

11/25/2019

R package Requirements: tidyverse, viridis, lme4