Parallel Computing: Monte Carlo Study for t-test

Packages

The following packages are used in this project:

scales: formats and labels scales nicely for better visualization
tidyverse: includes collections of useful packages like dplyr (data manipulation), tidyr (tidying data), ggplots (creating graphs), etc.
glue: offers interpreted string literals for easy creation of dynamic messages and labels
parallel: parallel computation, including by forking (taken from package multicore), by sockets (taken from package snow) and random-number generation

if (!require("pacman")) utils::install.packages("pacman", dependencies = TRUE)

## Loading required package: pacman

pacman::p_load(
  scales,
    tidyverse,
    glue,
    parallel
)

Implemente helper functions to do a t-test

Write a function to calculate the test statistic.

Inputs should be a vector of numeric data and the mean value to compare against (\(\mu_0\))

The output should be the calculated \(t_{\text{obs}}\) value

# a helper function to calculate t statistic
t_statistic <- function(x, mu){
  (mean(x) - mu) / ( sd(x)/sqrt(length(x)) )
}

Write a function to determine whether you reject \(H_0\) for fail to reject it.

Inputs should be the test statistic value, the sample size, the significance level (\(\alpha\)), and the direction of the alternative hypothesis (left, right, or two-sided).

The output should be a TRUE or FALSE value depending on whether or not you reject the null hypothesis. Note: qt() will give quantiles from the t-distribution.

In an one-sample t test, whether the null hypothesis is rejected or not can be determined either by comparing the p-value to the significance level or by comparing the observed and critical statistics. In the function definition below, we decide what type of the test is (1: left, 2: right and 3: two-sided) and then calculate p-value and statistics critical value accordingly.

A wrapper function is also defined to perform the t test easier.

# a helper function to determine whether to reject the null hypothesis
t_test_reject_H0 <- 
  function(t_statistic, size, direction = "two-sided", alpha = 0.05, verbose = FALSE){
  
  # check if test direction is valid
  directions <- c("left", "right", "two-sided")
  if(!(direction %in% directions)) {
    # invalid test direction
    print("Invalid test direction!")
    result(NA)
  } 
  
  # degrees of freedom for 1-sample t-test
  df <-  size - 1
  
  # test type based on direction: (1: left, 2: right and 3: two-sided)
  test_type <- which(direction == directions)
  
  # how many sides (tails)
  sides <- c(1, 1, 2)[test_type]
  
  # is the test comparison on the lower tail?
  lower.tail = c(T, F, t_statistic < 0)[test_type]
  
  # get p-value
  p_value <- pt(t_statistic, df, lower.tail = lower.tail) * sides
  
  # get critical t score
  t_critical <- qt(alpha/sides, df, lower.tail = lower.tail)
  
  # determine whether or not to reject the null hypothesis
  #   method 1: compare p-value to alpha
  reject_p <- p_value <= alpha
  #   method 2: compare t statistics to critical value
  reject <- ifelse(lower.tail, t_statistic < t_critical, t_statistic > t_critical)
  
  # print out test infomation and result if required
  if(verbose){
    # get a proper test name for print-out
    test_name <- direction %>% 
    paste0("-sided") %>% 
    str_sub(end = str_locate(., "-sided")[2])
  
    # print out test info and result
    print(glue(
      'One sample t-test: {test_name} with alpha = {alpha}\n',
      't = {round(t_statistic, 3)}, df = {size-1}, ',
      'p-value = {format(p_value, digits = 4, scientific = T)}\n',
      't* = {round(t_critical, 3)} for confidence level = {1-alpha}\n',
      'Result: {ifelse(reject, "Reject", "Fail to reject")} the null hypothese H0'
    ))
  }
  
  return(reject)
}

# a wrapper function to perform a t test: whether to reject the null hypothesis
t_test <- function(x, mu, ...){
  t_test_reject_H0(t_statistic(x, mu), length(x), ...)
}

Test how well the function works! Use the built-in iris data set and run a few hypothesis tests. Using the data, determine if the true mean

Test Case 1:

sepal length differs from 5.5 (i.e. test \(H_0: \mu=5.5\) vs \(H_A: \mu \neq 5.5\) and report whether or not we reject \(H_0\))

# use our helper function to do the t-test
t_test(iris$Sepal.Length, 5.5, verbose = TRUE)

## One sample t-test: two-sided with alpha = 0.05
## t = 5.078, df = 149, p-value = 1.123e-06
## t* = 1.976 for confidence level = 0.95
## Result: Reject the null hypothese H0

## [1] TRUE

# compare to t.test()
t.test(iris$Sepal.Length, mu = 5.5)

## 
##  One Sample t-test
## 
## data:  iris$Sepal.Length
## t = 5.078, df = 149, p-value = 1.123e-06
## alternative hypothesis: true mean is not equal to 5.5
## 95 percent confidence interval:
##  5.709732 5.976934
## sample estimates:
## mean of x 
##  5.843333

Test Case 2:

sepal width is greater than 3.5

# use our helper function to do the t-test
t_test(iris$Sepal.Width, 3.5, direction = "right", verbose = TRUE)

## One sample t-test: right-sided with alpha = 0.05
## t = -12.439, df = 149, p-value = 1e+00
## t* = 1.655 for confidence level = 0.95
## Result: Fail to reject the null hypothese H0

## [1] FALSE

# compare to t.test()
t.test(iris$Sepal.Width, mu = 3.5, alternative = "greater")

## 
##  One Sample t-test
## 
## data:  iris$Sepal.Width
## t = -12.439, df = 149, p-value = 1
## alternative hypothesis: true mean is greater than 3.5
## 95 percent confidence interval:
##  2.998429      Inf
## sample estimates:
## mean of x 
##  3.057333

Test Case 3:

petal length is less than 4

# use our helper function to do the t-test
t_test(iris$Petal.Length, 4, direction = "left", verbose = TRUE)

## One sample t-test: left-sided with alpha = 0.05
## t = -1.679, df = 149, p-value = 4.763e-02
## t* = -1.655 for confidence level = 0.95
## Result: Reject the null hypothese H0

## [1] TRUE

# compare to t.test()
t.test(iris$Petal.Length, mu = 4, alternative = "less")

## 
##  One Sample t-test
## 
## data:  iris$Petal.Length
## t = -1.679, df = 149, p-value = 0.04763
## alternative hypothesis: true mean is less than 4
## 95 percent confidence interval:
##      -Inf 3.996566
## sample estimates:
## mean of x 
##     3.758

Monte Carlo Study

A Monte Carlo Simulation is one where we generate (pseudo) random values using a random number generator and use those random values to judge properties of tests, intervals, algorithms, etc.

Generate data from a gamma distribution using different shape parameters and sample sizes. We’ll then apply the t-test functions from above and see how well the \(\alpha\) level is controlled under the incorrect assumption about the distribution our data is generated from.

Pseudocode:

generate a random sample of size n from the gamma distribution with given \(\text{shape}\) and \(\text{rate}\) parameters specified (see rgamma())

find the test statistic using the actual mean of the randomly generated data for \(\mu_0\) (i.e \(\mu_0 = \text{shape} ∗ \text{rate}\) - implying the null hypothesis \(H_0\) is true)

using the given alternative (left, right, or both) determine whether you reject or not

repeat many times

observe the proportion of rejected null hypothesis from your repetitions (these are all incorrectly rejected since the null hypothesis is true). This proportion is our Monte Carlo estimate of the \(\alpha\) value under this data generating process.

Write code to do the above when sampling from a gamma distribution with \(\text{shape} = 0.5\) and \(\text{rate} = 1\) with a sample size of \(n = 10\) for a two-sided test. Use the replicate() function (a wrapper for the sapply() function) to do the data generation and determination of reject/fail to reject with relative ease (with the number of replications being 10000)! Then find the mean of the resulting TRUE/FALSE values as your estimated \(\alpha\) level under these assumptions.

We first set up the parameters including sample size, shape and rate to generate random values from the gamma distribution. Then we perform a one-sample two-sided t test with the significance level of \(\alpha=0.05\) and repeat this process for 10,000 times to get an estimated \(\alpha\).

# the simulation parameters: sample size, shape and rate of the Gamma distribution
X <- list(n = 10, shape = 0.5, rate = 1)

# execute simulation for 10000 times
alpha_estimated <-
  replicate(10000, t_test(rgamma(X$n, X$shape, X$rate), mu =  X$shape * X$rate)) %>% 
  mean()

print(glue(
  "The estimated alpha is {alpha_estimated}\n",
  "The significance level used in the t test is 0.05."
))

## The estimated alpha is 0.1363
## The significance level used in the t test is 0.05.

An incorrect assumption about the distribution results in big estimated alpha.

Parallel Computing

Do the above Monte Carlo simulation for several settings of sample size, shape, and rate parameter. As it turns out, we really don’t need to worry about the rate parameter, it just scales the distribution.

create a vector of sample size values (10, 20, 30, 50, 100)

create a vector of shape values (0.5, 1, 2, 5, 10, 20)

create a rate vector with the value 1

We want to be able to use parLapply() to execute the above Monte Carlo study for combinations of sample sizes and shape values (with rate always 1). To use parLapply(), we need to pass

our cluster set up via makeCluster()

We must also use clusterExport() function to pass a list of our functions that we created

If you use any packages, you’ll need to use clusterEvalQ() to pass the package

X a list we will parallelize our computations over (each list element would be a combination of sample size and shape parameter)

fun a function that does the replication of gamma samples, finds the decisions for each sample, and calculates and returns the proportion

Create a function to do this that just uses the replicate() and mean() code above

any other arguments needed for our functions to run (for instance \(\text{rate} = 1\))

Set up the parameter grid as a list in which each list element is a combination of sample size and shape parameter:

# parameters 
sample_sizes <- c(10, 20, 30, 50, 100)
shape_values <- c(0.5, 1, 2, 5, 10, 20)
rate_values <- c(1)

# set up a list of the parameter grid
X <- expand_grid(sample_sizes, shape_values, rate_values) %>% 
  `names<-`(c("n", "shape", "rate")) %>% 
  split(., 1:nrow(.)) %>% 
  unname() %>% 
  lapply(as.list)

# check if the number of combinations is correct
length(X) == prod(sapply(list(sample_sizes, shape_values, rate_values), length))

## [1] TRUE

# take a look to ensure the setup is correct
X[1:3]

## [[1]]
## [[1]]$n
## [1] 10
## 
## [[1]]$shape
## [1] 0.5
## 
## [[1]]$rate
## [1] 1
## 
## 
## [[2]]
## [[2]]$n
## [1] 10
## 
## [[2]]$shape
## [1] 1
## 
## [[2]]$rate
## [1] 1
## 
## 
## [[3]]
## [[3]]$n
## [1] 10
## 
## [[3]]$shape
## [1] 2
## 
## [[3]]$rate
## [1] 1

Define the function to run the simulation and return the results:

getResults <- function(X){
  replicate(10000, t_test(rgamma(X$n, X$shape, X$rate), mu =  X$shape * X$rate)) %>% 
  mean()
}

Run Monte Carlo simulation with serial processing, and save results and elapsed time:

# run tests - serial processing
time_ser <-
  system.time({
    results_ser <- lapply(X, getResults)}) %>% 
  print()

##    user  system elapsed 
##    6.89    0.11    7.04

Run Monte Carlo simulation with parallel processing, and save results and elapsed time:

# Set up cores and clusters
cores <- detectCores()
cluster <- makeCluster(cores - 1)

# send packages and functions to cores on Windows
clusterExport(cluster, list("getResults", "t_test", "t_test_reject_H0", "t_statistic"))
clusterEvalQ(cluster, library(tidyverse)) %>% invisible()
clusterEvalQ(cluster, library(glue)) %>% invisible()

# run tests - parallel processing
time_par <-
  system.time({
    results_par <- parLapply(cluster, X = X, fun = getResults)}) %>% 
  print()

##    user  system elapsed 
##    0.00    0.00    1.33

# close the cluster
stopCluster(cluster)

Comparison of the results (estimated alpha) in a summary table and a faceted plot:

# combine results and the run parameters into one data frame
result_summary <- bind_cols(
  as_tibble(bind_rows(X)), 
  tibble(alpha_serial = unlist(results_ser)),
  tibble(alpha_parallel = unlist(results_par)))
result_summary

result_summary %>% 
  pivot_longer(
    cols = starts_with("alpha"),
    names_to = "processing",
    names_prefix = "alpha_",
    values_to = "alpha") %>% 
  mutate(
    processing = factor(processing, levels = c("serial", "parallel"))
  ) %>% 
  ggplot(aes(x = processing, y = alpha, fill = processing)) +
  geom_col(show.legend = FALSE) +
  geom_text(
    aes(label = format(alpha, digits = 2)), 
    color = "white",
    size = 3, 
    position = position_stack(vjust = 0.5)) +
  facet_grid(
    n ~ shape, 
    labeller = labeller(.rows = label_both, .cols = label_both)) +
  theme(axis.text.x = element_text(angle = 45, hjust=1)) + 
  labs(
    title = "Comparison of Estimated alpha (10,000 Replications)",
    subtitle = glue(
      "1-sample 2-sided t test (alpha = 0.05) ",
      "on Gamma distributed random values (rate = 1)"),
    x = "Processing Method",
    y = "Estimated alpha",
    fill = NULL)

The estimated alpha decreases to 0.05 when sample size increases which is explained by CLT. It also decreases with larger shape value. This agrees with the fact that the Gamma distribution converges to the normal distribution for a large shape parameter.

At last, let’s compare the computing time used (seconds) by both methods:

time_summary <- bind_rows(
  c(seq_or_par = "Sequential", time_ser), 
  c(seq_or_par = "Parallel", time_par)) %>% 
  mutate_at(2:4, as.numeric)
time_summary

print(glue(
  "In this study, the parallel processing is ",
  "{percent((time_ser[3]- time_par[3])/time_ser[3], 0.01)} ",
  "faster than the sequential processing."
))

## In this study, the parallel processing is 81.11% faster than the sequential processing.