Gamma distributions for Hilary White

Goal is to describe the shape of E/I data with a better distribution than the normal distribution since the data are clearly non-normal. Gamma distribution visually seems to do the job.

Create some data

Create some data that approximate real world data. Mean and variance are easy things to use so start with them and convert them to shape and scale (or shape and rate) for use with the gamma function. Depending on how the probability density function is formulated the shape and scale (or shape and rate) are used as the two required parameters (just as intercept and slope are the two parameters in a linear regression.) See both formulations in https://en.wikipedia.org/wiki/Gamma_distribution.

Test some code: 1. Generate 10,000 data points in a gamma distribution to get a good idea of the shape 2. Do the initial calculations required to fit those data to prove we can do it

# Mean and variance are handy values to use
mean <- 0.1
var <- 0.003

#Convert to shape and scale for the gamma function
shape <- mean^2/var
scale <- var/mean

# Generate data
set.seed(100)
x = rgamma(1e4, shape = shape, scale = scale)

Plot the data to check that it looks the way we expect.

library(ggplot2)
qplot(x, bins = 100)

Fit these data with a gamma function to show that we can get the correct parameters back (recall that we set the parameters above.)

library(MASS)
mod <- fitdistr(x, "gamma")

Recreate the mean (first number, 0.1) and variance (second number, 0.003) from the model fit to double check.

as.numeric(mod$estimate["shape"] / mod$estimate["rate"])
## [1] 0.09890784
as.numeric(mod$estimate["shape"] / mod$estimate["rate"]^2)
## [1] 0.002916718

These look good. So we can fit a gamma model.

Now the fun part

Calculate the 68th and 95th percentiles (akin to +1SD and +2SD) by using the gamma cumulative density function and the results of the model fit.

qgamma(c(0.68, 0.95), mod$estimate["shape"], mod$estimate["rate"])
## [1] 0.1152203 0.2011226

This is handy. So draw those lines on the plot:

library(ggplot2)
qplot(x, bins = 100) +
  geom_vline(xintercept = qgamma(c(0.68, 0.95), mod$estimate["shape"], mod$estimate["rate"]))

Make a function

Collect all these commands and turn them into a function to automate this process. Pass the data (dat) and a title for the plot (plotTitle) to the function hilz. Make the output a little easier to read by using the cat() function

hilz <- function(dat, plotTitle){
  mod <- fitdistr(dat, "gamma")
  print(qplot(dat) + stat_function(fun = dgamma, args = list(shape = mod$estimate["shape"], rate = mod$estimate["rate"]), colour = "red") + 
          geom_density() + 
          geom_vline(xintercept = qgamma(c(0.68, 0.95), mod$estimate["shape"], mod$estimate["rate"])) + 
          labs(x = "E/I", title = plotTitle))
  cat("mean:", as.numeric(mod$estimate["shape"]/mod$estimate["rate"]), "\n")
  cat("variance:", as.numeric(mod$estimate["shape"]/mod$estimate["rate"]^2), "\n")
  cat("+1SD 68% and +2SD 95%:", qgamma(c(0.68, 0.95), mod$estimate["shape"], mod$estimate["rate"]), "\n")
}

Test the function with a small dataset

mean <- 0.1
var <- 0.003
shape <- mean^2/var
scale <- var/mean
dat2 <- rgamma(30, shape = shape, scale = scale)
hilz(dat2, "Ma Sitez")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## mean: 0.09953554 
## variance: 0.002215883 
## +1SD 68% and +2SD 95%: 0.1149115 0.1874249

Bootstrap the estimates

Four steps in the function:

  1. Bootstrap the data 1000 times (with sample), model each of these (with fistdist), collect the resulting model parameters
  2. Prepare the code needed to plot any of the models (create a stat_function() line for each of the 1000 bootstrapped models with alply())
  3. Plot the data with some of the 1000 bootstrapped models and the mean of all 1000 models
  4. Report the mean of the 1000 bootstrapped datasets and 68th and 95th percentiles of the modelled datasets
library(fitdistrplus)
library(plyr)

hilz_boot <- function(dat, plotTitle){
  # bootstrap the data 1000 times
  # model all 1000 bootstrapped datasets and return the parameters of the model
  cat("bootstrapping and modelling", "\n")
  boot_parms <- sapply(1:1000, function(i) {
    est <- fitdist(sample(dat, size = length(dat), replace = TRUE), distr = "gamma")  
    return(est$estimate)
  })
  
  # prepare the stat_function for each of the 1000 models for plotting
  # this is a giant list of stat_function, one for each model
  cat("prepping lines for plot", "\n")
  gamma_lines <-
    alply(as.matrix(setNames(as.data.frame(t(boot_parms)), c("rate", "shape"))), 1, function(i) {
      stat_function(fun = dgamma, args = list(shape = i[1], rate = i[2]), colour="grey", size = 0.1)
    })
  
  # plot the data, defaults to a histogram
  # add some of the modelled functions from gamma_lines
  # add vertical lines for the 68th and 95th percentiles from the mean of all 1000 models
  cat("plotting", "\n")
  print(qplot(dat) + gamma_lines[1:200] + 
          stat_function(fun=dgamma, args = list(shape = rowMeans(boot_parms)[1], rate = rowMeans(boot_parms)[2]), colour="red") + 
          geom_vline(xintercept = qgamma(c(0.68, 0.95), rowMeans(boot_parms)[1], rowMeans(boot_parms)[2]), colour="blue") +
          labs(x = "E/I", title = plotTitle))
  
  # output the mean value of the modelled datasets
  # output the 68th and 95th percentiles of the modelled datasets
  cat("mean:", as.numeric(rowMeans(boot_parms)[1]/rowMeans(boot_parms)[2]), "\n")
  cat("+1SD 68% and +2SD 95%:", qgamma(c(0.68, 0.95), rowMeans(boot_parms)[1], rowMeans(boot_parms)[2]), "\n")
}

Test the bootstrapping function:

hilz_boot(dat2, "Ma Sitez")
## bootstrapping and modelling 
## prepping lines for plot 
## plotting
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## mean: 0.09794465 
## +1SD 68% and +2SD 95%: 0.1126297 0.179365