You can install these as such: install.packages(c(‘tidyverse’,‘viridis’,‘lme4’,‘pwr’))
This is one of the main challenges in statistics. All practitioners will continually need to evaluate their analyses and question if the effects they find are real.
It is all too common for practitioners to fit overly complicated models to small data sets, find small (but ‘significant’) effects, and try to publish these results with minimal self-criticism.
avoid doing this
This is part of the debate about how p-values are often contrived through over-fitting models. A “significant result” is not necessarily significant. Vice-versa, a biologically meaningful difference (effect size) is not always significant to the p<0.05 value.
If I can recommend you one book for using statistics in science - I recommend you this book.
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────────────── tidyverse 1.2.1.9000 ──
## ✔ ggplot2 3.2.0 ✔ purrr 0.3.2
## ✔ tibble 2.1.3 ✔ dplyr 0.8.3.9000
## ✔ tidyr 0.8.3 ✔ stringr 1.4.0
## ✔ readr 1.3.1 ✔ forcats 0.4.0
## ── Conflicts ─────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
dat <- read_csv("WaffleDivorce.csv") # read data in
## Parsed with column specification:
## cols(
## Location = col_character(),
## Loc = col_character(),
## Population = col_double(),
## MedianAgeMarriage = col_double(),
## Marriage = col_double(),
## Marriage.SE = col_double(),
## Divorce = col_double(),
## Divorce.SE = col_double(),
## WaffleHouses = col_double(),
## South = col_double(),
## Slaves1860 = col_double(),
## Population1860 = col_double(),
## PropSlaves1860 = col_double()
## )
glimpse(dat) # quickly see how the data was read
## Observations: 50
## Variables: 13
## $ Location <chr> "Alabama", "Alaska", "Arizona", "Arkansas", "C…
## $ Loc <chr> "AL", "AK", "AZ", "AR", "CA", "CO", "CT", "DE"…
## $ Population <dbl> 4.78, 0.71, 6.33, 2.92, 37.25, 5.03, 3.57, 0.9…
## $ MedianAgeMarriage <dbl> 25.3, 25.2, 25.8, 24.3, 26.8, 25.7, 27.6, 26.6…
## $ Marriage <dbl> 20.2, 26.0, 20.3, 26.4, 19.1, 23.5, 17.1, 23.1…
## $ Marriage.SE <dbl> 1.27, 2.93, 0.98, 1.70, 0.39, 1.24, 1.06, 2.89…
## $ Divorce <dbl> 12.7, 12.5, 10.8, 13.5, 8.0, 11.6, 6.7, 8.9, 6…
## $ Divorce.SE <dbl> 0.79, 2.05, 0.74, 1.22, 0.24, 0.94, 0.77, 1.39…
## $ WaffleHouses <dbl> 128, 0, 18, 41, 0, 11, 0, 3, 0, 133, 381, 0, 0…
## $ South <dbl> 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0…
## $ Slaves1860 <dbl> 435080, 0, 0, 111115, 0, 0, 0, 1798, 0, 61745,…
## $ Population1860 <dbl> 964201, 0, 0, 435450, 379994, 34277, 460147, 1…
## $ PropSlaves1860 <dbl> 4.5e-01, 0.0e+00, 0.0e+00, 2.6e-01, 0.0e+00, 0…
(This ‘hypothesis’ is problematic and really classist hypothesis, but it will demonstrate an important point about doing this kind of thing).
dat %>%
ggplot(data=., aes(WaffleHouses, Divorce))+
geom_point()+
geom_smooth(method='lm')
Here Divorce is the response variable and WaffleHouses is the covariate.
What can we do to examine this Waffle House effect more coherently? A big populated state will likely have more Waffle Houses than a small state, so we can divide by the population to standardize it.
dat <- dat %>% # shortcut for pipe symbol in Rstudio: ctrl + shift + m
mutate(wh_per_million = WaffleHouses/Population)
dat %>%
ggplot(data=., aes(wh_per_million, Divorce))+
geom_point()+
geom_smooth(method='lm')
fit <- lm(Divorce~wh_per_million, data=dat)
summary(fit)
##
## Call:
## lm(formula = Divorce ~ wh_per_million, data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.5343 -1.2448 -0.0718 1.0552 3.6802
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.31980 0.27723 33.617 < 2e-16 ***
## wh_per_million 0.07442 0.02730 2.726 0.00892 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.712 on 48 degrees of freedom
## Multiple R-squared: 0.1341, Adjusted R-squared: 0.116
## F-statistic: 7.431 on 1 and 48 DF, p-value: 0.008921
par(mfrow=c(2,2)) # this specifies a 2x2 grid for the plots to go in
plot(fit) # this will plot model diagnostis
Residuals vs. Fitted: Assumption of homoscedasticity. Does the variance of the residuals change along the scale of fitted values? If so, the data is heteoscedastic and probably should not be modeled with a linear model.
Normal Q-Q: This acts as a check about the model distribution assumptions. The data is consistent with the model assumptions when the standardized residuals for a 1:1 relationship with the theoretical quantiles of the distribution.
Residuals vs Leverage: These show especially influential observations. So here the 11th observation is exerting vastly more leverage on the model fit. You might consider applying a transform (sqrt, log10, etc) to the covariates if the distribution is especially skewed.
Under what assumptions?
We also assume the observations are “independently and identically distributed”.
hist(dat$Divorce)
If we look back at the data, perhaps only a few observations (states) are driving this percieved trend.
Let’s see who is driving this relationship.
ggplot(dat, aes(wh_per_million, Divorce))+
geom_point()+
geom_smooth(method='lm')+
geom_label(aes(label=Location))
Here we apply a power analysis to examine what the sample size would need to be to identify the effect.
library(pwr)
# Does power analysis indicate we've done something wrong?
pwr.r.test(n = NULL,
r = 0.074, # the effect size from our waffle divorce model
sig.level = .05, # NOT PROBABILITY;
# https://en.wikipedia.org/wiki/Statistical_significance
power = 0.95, # ranges from 0-1.
# Roughly, the probability of finding an effect that is real
# https://en.wikipedia.org/wiki/Power_(statistics)
alternative = "two.sided")
##
## approximate correlation power calculation (arctangh transformation)
##
## n = 2366.44
## r = 0.074
## sig.level = 0.05
## power = 0.95
## alternative = two.sided
We can visualize how large a sample size would be required to identify an effect. Taken from: https://www.statmethods.net/stats/power.html
library(tidyverse)
library(pwr)
set.seed(3)
population_1 <- rnorm(10000, mean = 50, sd=25)
nobs <- 5
sample_1 <- population_1[sample.int(10000, nobs)]
sample_2 <- population_1[sample.int(10000, nobs)]
t.test(sample_1, sample_2)
##
## Welch Two Sample t-test
##
## data: sample_1 and sample_2
## t = 0.85889, df = 7.2497, p-value = 0.4179
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -31.80723 68.49382
## sample estimates:
## mean of x mean of y
## 47.09334 28.75004
# Plot sample size curves for detecting correlations of
# various sizes.
library(pwr)
# range of correlations
r <- seq(0.05,.5,.01)
nr <- length(r)
# power values
p <- seq(.4,.9,.1)
np <- length(p)
# obtain sample sizes
samsize <- array(numeric(nr*np), dim=c(nr,np))
for (i in 1:np){
for (j in 1:nr){
result <- pwr.r.test(n = NULL, r = r[j],
sig.level = .05, power = p[i],
alternative = "two.sided")
samsize[j,i] <- ceiling(result$n)
}
}
# set up graph
xrange <- range(r)
yrange <- round(range(samsize))
colors <- rainbow(length(p))
plot(xrange, yrange, type="n",
xlab="Correlation Coefficient (r)",
ylab="Sample Size (n)" )
# add power curves
for (i in 1:np){
lines(r, samsize[,i], type="l", lwd=2, col=colors[i])
}
# add annotation (grid lines, title, legend)
abline(v=0, h=seq(0,yrange[2],50), lty=2, col="grey89")
abline(h=0, v=seq(xrange[1],xrange[2],.02), lty=2,
col="grey89")
title("Sample Size Estimation for Correlation Studies\n
Sig=0.05 (Two-tailed)")
legend("topright", title="Power", as.character(p),
fill=colors)
Read about it: https://en.wikipedia.org/wiki/Simpson%27s_paradox This example comes from: https://blog.revolutionanalytics.com/2015/11/fun-with-simpsons-paradox-simulating-confounders.html
Here we will simulate a large number of observations with a seemingly straight- forward trend.
point_line_distance <- function(b, m, x, y){
(y - (m*x + b))/sqrt(m^2 + 1)}
confounded_data_frame <- function(x, y, m, num_grp){
b <- 0 # intercept doesn't matter
d <- point_line_distance(b, m, x, y)
d_scaled <- 0.0005 + 0.999 * (d - min(d))/(max(d) - min(d)) # avoid 0 and 1
data.frame(x=x, y=y,
group=as.factor(sprintf("grp%02d", ceiling(num_grp*(d_scaled)))))
}
find_group_coefficients <- function(data){
coef <- t(sapply(levels(data$group),
function(grp) coefficients(lm(y ~ x, data=data[data$group==grp,]))))
coef[!is.na(coef[,1]) & ! is.na(coef[,2]),]
}
set.seed(1) # reproducibilty with the random number generator
N <- 3000 # number of simulated observations
x <- rnorm(N) # generate random numbers for the covariate
m <- -0.5555556 # The trend effect
b <- 8.3333333 # The intercept
y <- m * x + b + rnorm(length(x)) # simulate the response variable
plot(x, y, col="gray", pch=20, asp=1) # plot the data
fit <- lm(y ~ x) # Fit a linear model
abline(fit, lty=2, lwd=2) # Use the linear model fit to add a trend-line to the plot
summary(fit) # Inspect the summary statistics about the fit
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.6481 -0.6682 -0.0088 0.6665 3.0759
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.32840 0.01830 455.1 <2e-16 ***
## x -0.53582 0.01768 -30.3 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.002 on 2998 degrees of freedom
## Multiple R-squared: 0.2344, Adjusted R-squared: 0.2342
## F-statistic: 918.1 on 1 and 2998 DF, p-value: < 2.2e-16
pwr.r.test(n = NULL,
r = -0.535, # the effect size from our waffle divorce model
sig.level = .001, # NOT PROBABILITY;
# https://en.wikipedia.org/wiki/Statistical_significance
power = 0.999, # ranges from 0-1.
# Roughly, the probability of finding an effect that is real
# https://en.wikipedia.org/wiki/Power_(statistics)
alternative = "two.sided")
##
## approximate correlation power calculation (arctangh transformation)
##
## n = 116.9737
## r = 0.535
## sig.level = 0.001
## power = 0.999
## alternative = two.sided
Ok so we have enough observations to satisfy our power analysis. So this trend seems robust, correct?
Well what if the data is not independent? What if the data was sampled in regional groups?
# This function plots groups of data by color
striped_scatterplot <- function(formula, grouped_data){
#
colors <- rev(viridis::viridis(length(levels(grouped_data$group)), end=0.95, option='A'))
plot(formula, grouped_data, bg=colors[grouped_data$group], pch=21, asp=1)
grp_coef <- find_group_coefficients(grouped_data)
# if some coefficents get dropped, colors won't match exactly
for (r in 1:nrow(grp_coef))
abline(grp_coef[r,1], grp_coef[r,2], col=colors[r], lwd=2)
}
m_new <- 1 # the new coefficient we want x to have
cdf <- confounded_data_frame(x, y, m_new, num_grp=10) # see function below
striped_scatterplot(y ~ x, cdf) # also see below
Now we will fit a linear model with a varying intercept. By allowing the intercept to term to vary by group, we will partition variation away from the x term that we are examining.
cdf10 <- confounded_data_frame(x, y, m_new, num_grp=10)
# without confounder
coefficients(lm(y ~ x, cdf10))['x']
## x
## -0.5358175
## x
## -0.5358175
# with confounder
coefficients(lm(y ~ x + group, cdf10))['x']
## x
## 0.8193693
j0 <- lme4::lmer(y~x+(1|group), data=cdf10)
So time simple model that does not acknowledge differences between groups suggests that x has an -0.5358175 effect upon y, per unit of x.
The linear model with a group varying intercept shows that the effect of x upon y within each group, is actually positive (0.996). This is close to the true value used to simulate the data.
These kinds of models are very useful and a little beyond the scope of an intro seminar like this. But let’s go over what the formula means:
y~x+(1|group)
y is the response variable (aka, dependent variable)
x is the covariate (aka, the independent variable)
(1|group) means that ‘1’ the intercept, is conditional upon the group.
If we wanted to fit a model where the effect of x dendends on the group, how might we write it?
Let’s apply what we learned about “Simpson’s paradox”, aka “Collider Bias”
dat <- dat %>%
mutate(MedianAgeMarriage_s = (MedianAgeMarriage - mean(MedianAgeMarriage)) /
sd(MedianAgeMarriage))
fit2 <- lm(Divorce~wh_per_million+MedianAgeMarriage_s, data=dat)
summary(fit2)
##
## Call:
## lm(formula = Divorce ~ wh_per_million + MedianAgeMarriage_s,
## data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.0147 -0.7782 0.0195 0.9208 3.8617
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.41692 0.22874 41.169 < 2e-16 ***
## wh_per_million 0.05479 0.02279 2.404 0.0202 *
## MedianAgeMarriage_s -1.00126 0.20420 -4.903 1.17e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.407 on 47 degrees of freedom
## Multiple R-squared: 0.4271, Adjusted R-squared: 0.4027
## F-statistic: 17.52 on 2 and 47 DF, p-value: 2.063e-06