Bayesian independent group comparison power

Author

Jamie Elsey (edits/notes by David Reinstein)

Published

November 17, 2021

1 Introduction

This guided example is a supplement to the more general chapter offering a framework and workflow for power analyses. It follows the general steps outlined there. In this example, we will use the package brms for Bayesian estimation of parameter values. To keep things computationally tractable, we will focus on a simple comparison between two independent samples. The general steps are applicable to other Bayesian designs.

Pre-requisite: You will need to set up RStan on your computer for the Bayesian analyses in this example to work. See mc-stan.org and Rstan’s wiki.

I never managed to install “mac -os Rtoolchain” but I did manage to get #Rstan up and running thanks to the Coatless Professor’s workaround.

2 Step 1: Generate a large number of data sets for the analysis

We will run a comparison of two independent samples, and seek to assess our capacity to detect 1 a range of possible effect sizes, considering Cohen’s \(d\) at 0, .15, .3.

Code
#We will also increase the memory limit in R to allow for the large objects that are likely to be created:
memory.limit(100000)
[1] Inf
Code
#This is a Windows specific function. Some possible ways to do this in Mac are suggested here: https://stackoverflow.com/questions/51295402/r-on-macos-error-vector-memory-exhausted-limit-reached
  • Creates a dataframe with two groups
    • each with the same group.size (easy to adjust this to allow 2 sizes)
    • each drawn from standard normal distribution
    • group b’s mean differs from group a according to the specified effect.size
  • sim is just the counter of ‘which simulation this is’
  • one row per ‘observation in group’… so 2*group.size rows
Code
#| label: data-making-function
#| code-summary: "function to make simulated data"

t.data.maker <- function(sim, effect.size, group.size) {

  # create a data frame with two groups, standard normal distribution
  # group b differs from group a according to the specified effect size
  data <- tibble(
    a.group = rnorm(n = group.size, mean = 0, sd = 1),  #create `group.size` rows of a standard normal,
    b.group = rnorm(n = group.size, mean = effect.size, sd = 1), #... of a std normal shifted by `effect` for 'treatment group' (or after-treatment if this is within subject)
    n = seq(1:group.size), #counter
    'effect.size' = effect.size,
    'sim' = sim
  ) %>%
    pivot_longer(cols = a.group:b.group, #make this one row per 'observation in  group' (rather than 1 per observation with 'both groups') ... so 2*group.size rows
      names_to = 'group',
      values_to = 'response')

  return(data)

}

We will run 250 analyses2 for each of the three effect sizes of interest. We use maximum group sizes of 1500.3 We put these into a format to pass to a map function, and then run the data-making function over the different effect sizes to generate the many data sets.

In the code below, we create this simulated data, creating a list of tibbles, one tibble for each simulated data set, considering the three effect sizes.

iterate-datamaker
# set up furrr to run iterations more quickly ... this is (optional) parallel processing stuff
library(furrr)
plan(multisession)
options <- furrr_options(seed = 1234)

# make a tibble with our intended effect sizes (DR: for 250 simulations each) 
effect.sizes <- tibble(
    effect.size = c(rep(0, 250), rep(.15, 250), rep(.3, 250)),
    sim = seq(1:750)
    )

# run t.data.maker over all the requested data sets
# Map the list of effect sizes (and sim counter) to `t.data.maker`
#... to generate a list of tibbles (each of size 1500)
#... as specified, with 250 'treated' and 250 'controls' per effect size, where controls are standard normal and treated have the given effect sizes

sim.data <- future_map2_dfr(
  .x = effect.sizes$sim,
  .y = effect.sizes$effect.size,
  .f = t.data.maker,
  group.size = 1500,
  .options = options
)

# split the simulated data sets (into a list of tibbles, one tibble per simulation) for analysis
sim.data <- sim.data %>% group_by(sim) %>% group_split()

3 Step 2: Run the proposed analysis over the many data sets and return the estimands of interest

We set up a Bayesian regression model to run over the many simulated data sets.

It is important to ensure that the estimands we are interested in can be returned at this step, because this process can be time consuming. We don’t want to have to repeated it due to small mistakes. Hence, first carefully check that the code returns all the parameters you need, and then put it into a function that runs over the many data sets.

In any case, we need to run the model at least once outside of the function.4 This regression serves to compile the code and prevent the need to recompile it each time a new analysis is run, saving a lot of time.

Our simple formula for the t-test is a simple linear model:

Code
reg.form <- response ~ 1 + group

In the code below, we consider the nature of priors we can set for this model. We pass brms::get_prior our data, the family of response and link we want to use in the model, and the formula we are fitting. It suggests some possible priors or things to assign priors over?]

We aim for ‘weakly informative priors’, to constrain the model from entertaining highly unlikely parameter estimates. Setting reasonable priors on a model also helps speed up the MCMC sampling.

Code
# load in requisite packages
library(brms)
library(tidybayes)

# check what priors can be defined in the model
get_prior(formula = reg.form,
          data = sim.data[[501]],
          family = gaussian())
                  prior     class         coef group resp dpar nlpar lb ub
                 (flat)         b                                         
                 (flat)         b groupb.group                            
 student_t(3, 0.2, 2.5) Intercept                                         
   student_t(3, 0, 2.5)     sigma                                     0   
       source
      default
 (vectorized)
      default
      default

Next we set our priors (coded).5

Code
# set our own priors
(
  reg.prior <- c(set_prior("normal(0 , 1)", class = "Intercept"),
               set_prior("normal(0 , .8)", class = "b", coef = "groupb.group"),
               set_prior("normal(1, 1)", class = "sigma"))
)
          prior     class         coef group resp dpar nlpar   lb   ub source
  normal(0 , 1) Intercept                                    <NA> <NA>   user
 normal(0 , .8)         b groupb.group                       <NA> <NA>   user
   normal(1, 1)     sigma                                    <NA> <NA>   user

Note that we have specified

  • A standard normal prior over the intercept6
  • A normal with mean 0 and variance 0.8 over the group-b adjustment
  • A normal(1,1) over the standard error of the outcome (?)


Now we want to check that the model runs properly, and that we can extract the estimands of interest. We run the regression with the brm() command, and we extract and manipulate the estimands of interest by pulling out the posterior distribution from the model with posterior_samples():

Code
# grab a test data set from the many data sets

test.data <- sim.data[[501]]

# run the regression model
base.reg <- brm(formula = reg.form,
                data = test.data,
                family = gaussian(),
                chains = 2,
                cores = 2,
                iter = 1200,
                warmup = 200,
                prior = reg.prior)

(base.reg)
 Family: gaussian 
  Links: mu = identity; sigma = identity 
Formula: response ~ 1 + group 
   Data: test.data (Number of observations: 3000) 
  Draws: 2 chains, each with iter = 1200; warmup = 200; thin = 1;
         total post-warmup draws = 2000

Population-Level Effects: 
             Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept        0.01      0.03    -0.04     0.06 1.00     1732     1459
groupb.group     0.30      0.04     0.23     0.37 1.00     1457     1362

Family Specific Parameters: 
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma     1.00      0.01     0.98     1.03 1.00     1894     1610

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).
Code
posterior <- as_tibble(posterior_samples(base.reg)) %>%
  # I use the names of the posterior sample columns
  # to calculate group means and Cohen's d
  mutate(a.group = `b_Intercept`,
           b.group = `b_Intercept` + `b_groupb.group`,
           d = `b_groupb.group` / sigma)

(posterior)
# A tibble: 2,000 × 8
   b_Intercept b_groupb.group sigma lprior   lp__  a.group b.group     d
         <dbl>          <dbl> <dbl>  <dbl>  <dbl>    <dbl>   <dbl> <dbl>
 1    -0.0122           0.345 0.997  -2.47 -4258. -0.0122    0.333 0.346
 2    -0.0138           0.349 0.995  -2.47 -4258. -0.0138    0.335 0.350
 3     0.0270           0.286 1.00   -2.44 -4258.  0.0270    0.313 0.286
 4     0.0180           0.292 0.990  -2.44 -4258.  0.0180    0.310 0.295
 5     0.0254           0.263 1.01   -2.43 -4258.  0.0254    0.288 0.260
 6     0.0296           0.275 0.997  -2.43 -4258.  0.0296    0.305 0.276
 7     0.0148           0.250 1.01   -2.42 -4259.  0.0148    0.265 0.248
 8    -0.00824          0.377 1.00   -2.49 -4260. -0.00824   0.369 0.377
 9     0.0140           0.314 0.991  -2.45 -4258.  0.0140    0.328 0.317
10     0.00134          0.321 1.00   -2.45 -4258.  0.00134   0.322 0.320
# … with 1,990 more rows

We can see that our model run smoothly (decent effect sample sizes, Rhat of 1), and that we can return the parameters of interest. This approach can apply to essentially any Bayesian model - all one needs to do is know which part (or parts) of the posterior they are interested in and extract it. We can now move on to running the analysis over the many data sets. On my computer, this set of analyses took about 10 minutes to complete. (3 Feb 2023 update – seems faster now)

Code
reg.maker <- function(simulated.data, base.reg = base.reg,
                      breaks = seq(from = 100, to = 1500, by = 100)) {

  data <- simulated.data

  # this function allows the sample size to be cut
  # so that we can see power for many different sample sizes
  cut.samples <- function(break.point, data) {
    cut.data <- filter(data, n <= break.point)
    cut.data <- mutate(cut.data,
                       sample.size = break.point)
    return(cut.data)
  }

  data.cuts <- map_dfr(.x = breaks, .f = cut.samples, data = data)

  data.cuts <- data.cuts %>% group_by(sample.size) %>% group_split()

  # this function runs the regression model over the different sample sizes
  run.reg <- function(data, base.reg) {

    set.seed = as.numeric(data[1 , "sim"])

    # note use of the 'update' function rather than brm
    # to prevent need to recompile the model
    output <- update(base.reg,
                     newdata = data,
                     chains = 1,
                     cores = 1,
                     iter = 1750,
                     warmup = 250)

    output <- as_tibble(posterior_samples(output))

    output <- mutate(output,
                     sample.size = nrow(data) / 2,
                     effect.size = pull(data[1, "effect.size"]),
                     sim = pull(data[1, "sim"]),
                     a.group = `b_Intercept`,
           b.group = `b_Intercept` + `b_groupb.group`,
                     cohen.d = `b_groupb.group` / `sigma`)

    return(output)
  }

  samples <- map_df(.x = data.cuts, .f = run.reg, base.reg = base.reg)

  return(samples)

}

t1 <- Sys.time()
regression.samples <- future_map_dfr(.x = sim.data,
                                     .f = reg.maker,
                                     base.reg = base.reg,
                                     .options = options)
t2 <- Sys.time()
t2 - t1

head(regression.samples)

4 Step 3: Summarise the output returned in Step 2 to determine likelihood of achieving various inferential goals

The regression.samples object now contains posterior distributions for analyses of various effect sizes and sample sizes. For this example, our key outcome of interest is the effect size, measured as Cohen’s d. We can use the posterior distribution for Cohen’s d to draw various inferences about it. Some examples of possible ‘tests’ one might like to do with regards to a posterior distribution are presented below.

For the purposes of demonstration, we will focus on two possible metrics of interest. The first would be that the 95% Highest Density Interval (HDI: the 95% most probable values from the posterior) for Cohen’s d is greater than 0 (this is somewhat like a two sided t-test), and the degree to which we achieve a certain level of precision in the posterior estimate for Cohen’s d.

To do this, we simply need to make a function that will take each analysis (each posterior distribution for a particular simulation, sample size, and effect size), and determine the number of times across simulations that the posterior for Cohen’s d meets our decision criteria.

Code
# first split the posterior output according to sample size, effect size, and simulation
regression.samples.split <-
regression.samples %>%
  group_by(effect.size, sample.size, sim) %>%
  group_split()

# make a function to summarise the posterior
posterior.inference <- function(posterior) {

  # take the posterior and retrieve the upper and lower
  # 95% hdi bounds for the effect size
  summary <- dplyr::summarise(posterior,
                             lower95 = hdi(cohen.d, .95)[1],
                             upper95 = hdi(cohen.d, .95)[2])

  # determine the width of the hdi and whether it is above 0
  # and add in meta information
  summary <- mutate(summary,
                    width = abs(lower95 - upper95),
                    width.25 = case_when(width <= .25 ~ 1,
                                         TRUE ~ 0),
                    above.0 = case_when(lower95 > 0 ~ 1,
                                        TRUE ~ 0),
                    sample.size = posterior[[1, 'sample.size']],
                    effect.size = posterior[[1, 'effect.size']],
                    sim = posterior[[1, 'sim']])

  return(summary)

}

# feed the different posteriors to the summariser
posterior.inferences <- future_map_dfr(.x = regression.samples.split,
                            .f = posterior.inference, .options = options)

We can now count and plot the proportion of times we achieve our inference goals, according to different effect sizes, sample sizes, and goals. Note that in the case of determining whether the 95% HDI for Cohen’s d is above 0, the line for the effect size of 0 represents a false positive rate (because the effect size is in fact not above 0).

Code
above0 <- posterior.inferences %>% group_by(sample.size, effect.size) %>%
          summarise(.groups = 'keep',
              n = sum(above.0)) %>%
  mutate('Proportion of times achieved' = n/250)

precision25 <- posterior.inferences %>% group_by(sample.size, effect.size) %>%
  summarise(.groups = 'keep',
            n = sum(width.25)) %>%
  mutate('Proportion of times achieved' = n/250)

ggplot(data = above0) +
  scale_x_continuous(limits = c(50, 1550), breaks = seq(from = 100, to = 1500, by = 200)) +
  geom_path(aes(x = sample.size, color = as.factor(effect.size),
                group = as.factor(effect.size), y = `Proportion of times achieved`)) +
  labs(x = 'n per group', color = 'Effect size', title = '95% HDI is above 0') +
  theme_minimal() +
  theme(aspect.ratio = 1)

Code
ggplot(data = precision25) +
  scale_x_continuous(limits = c(50, 1550), breaks = seq(from = 100, to = 1500, by = 200)) +
  geom_path(aes(x = sample.size, color = as.factor(effect.size),
                group = as.factor(effect.size), y = `Proportion of times achieved`)) +
  labs(x = 'n per group', color = 'Effect size', title = 'Width of 95% HDI if .25 or less') +
  theme_minimal() +
  theme(aspect.ratio = 1)

Note that, as would be expected, the chance that we achieve our goal of rejecting a non-zero effect varies with the true effect size. When we consider the precision of our estimates, the actual effect size is not relevant: the more data we gather, the more precise our estimates become, irrespective of the size true effect.7

Footnotes

  1. DR: Or rule out?↩︎

  2. DR: simulations?↩︎

  3. DR: IIRC this comes from a real-world case where we are considering a maximum survey size of 1500 per group.↩︎

  4. DR: what do you mean ‘outside the function’?↩︎

  5. DR: how did we decide on these?↩︎

  6. DR: Isn’t this simply a known parameter?↩︎

  7. DR: Is this getting at the desired goal we discussed earlier of ‘power to have a precise interval estimate?’ Also, does the statement about ‘actual effect size is irrelevant for this’ hold modeling frameworks, or only for simple and linear ones?↩︎