Generate data using `simstudy` package in `R` - Part 1

Mark Bounthavong

30 April 2025

Introduction

There are a variety of places where you can go to and find data to use for practice. For example, the Agency for Healthcare Research and Quality (AHRQ) has the Medical Expenditure Panel Survey (MEPS) data, and R has its own set of build in data that comes with the software. You can type data() to view all the datasets in R.

Figure 1. Available data in R.

However, if none of these data meet your needs, you can generate data using R.

In this article, I’ll review a simple method to generate data for your needs using R. You will need to load the simstudy package. You can read about the simstudy package on the developers GitHub site.

### Load `simstudy` package
# install.packages("simstudy") # Install once if `simstudy` hasn't been installed
library("simstudy")

### Load other packages for this tutorial:
library("tidyverse")

Generating data

The simstudy package contains the defData() function, which we will use to generate some data.

To replicate the results of the generated data, you can set a seed using the set.seed() function.

Example 1 - Generate data with two groups

Let’s generate a dataframe with two groups using the defData() function. We’ll have a total sample size of 100, and we will generate a random dataframe with two groups using a 50% probability of assignment (formula = 0.5). We will use the dist = "binary" to generate two outcomes (0 and 1). Lastly, we will call this variable group.

### Set seed
set.seed(12345)

### Set up the data (Group = 0/1)
df1 <- defData(varname = "group",
               dist = "binary",
               formula = 0.5)

### Generate the dataframe with N = 100 subjects
df1 <- genData(100, df1)

### Inspect the number in each group
table(df1$group)

## 
##  0  1 
## 48 52

We have 48 subjects in group == 0 and 52 subjects in group == 1. (Note: Since we set the seed, we can replicate this random result.)

Example 2 - Generate data with two groups and a continous outcome variable

Next, we can generate a dataframe with two groups and a continuous outcome variable. We can set some parameters for what kind of continuous outcome variable we would like. Let’s suppose we want a continuous outcome data type that is normally distributed with a mean of 10 and a variance of 2. We can set the dist = "normal" and the formula = 10 and add the variance = 2.

### Set seed
set.seed(12345)

### Set up the data (Group = 0/1)
df2 <- defData(varname = "group",
               dist = "binary",
               formula = 0.5)
df2 <- defData(df2,
               varname = "outcome",
               dist = "normal",
               formula = 10,
               variance = 2)

### Generate the dataframe with N = 100 subjects
df2 <- genData(100, df2)

### Compare the means between the groups
df2 %>%
  group_by(group) %>%
  summarise(n_distinct(id),
            endpoint = mean(outcome),
            sd(outcome))

## # A tibble: 2 × 4
##   group `n_distinct(id)` endpoint `sd(outcome)`
##   <int>            <int>    <dbl>         <dbl>
## 1     0               48     10.2          1.47
## 2     1               52     10.2          1.75

The means for group 0 is 10.2 with a standard deviation (SD) of 1.47. The mean for group 1 is 10.2 with a SD of 1.75.

The means are the same because we didn’t differentiate the outcome variable based on the groups.

Example 3 - Generate two dataframes and append using `bind_rows()` function

Suppose you want to have different outcomes for the groups, you can generate two separate dataframes and append them using the bind_rows() function from the tidyverse package.

Step 1 - Set up the data rules

Let’s suppose we want the outcome variable to have a normally distributed mean = 10 with a variance = 2 for group 0, and the mean = 16 and variance = 4 for group 1. We will generate two dataframes df0 and df1.

Step 2 - Convert to dataframes

We’ll generate a dataframe from these rules and call them df0.data and df1.data with 100 subjects each.

Step 3 - Generate a group variable

Since these are two separate dataframes, we will need to generate a group variable. We will assign group = 0 and group = 1.

Step 4 - Create uniqe identifiers

Once these are generated, we will need to ensure the identifiers are unique. Because we are generating two separate dataframes, the id variable will be the same. To ensure that the identifier is unique, we need to generate a new identifier patientid and modify one of the dataframe’s id variable by adding 100 so that the sequence will begin with 101.

Step 5 - Append data

Next, we append the data using the bind_rows() function.

Step 6 - Rearrange data

Since the new unique identifier patientid is the last column, we want to re-arrange the dataframe so that it is in the first column. Additionally, we don’t the old identifier id, so we can drop it. We’ll do this using the select() function.

### Set seed
set.seed(12345)

###### STEP 1 - Set up the data rules
### Generate data for group 0
df0 <- defData(varname = "outcome",
               dist = "normal",
               formula = 10,
               variance = 2)

### Generate data for group 1
df1 <- defData(varname = "outcome",
               dist = "normal",
               formula = 16,
               variance = 3)

###### STEP 2 - Convert to dataframes
df0.data <- genData(100, df0)
df1.data <- genData(100, df1)

###### STEP 3 - Generate a group variable
### Add a grouping variable
df0.data$group <- 0 # add group variable
df1.data$group <- 1 # add group variable

###### STEP 4 - Create uniqe identifiers
### Generate unique patient identifier `patientid`
df0.data$patientid <- df0.data$id
df1.data$patientid <- df1.data$id+100 # add patientid variable

###### STEP 5 - Append the tables
data.bind <- bind_rows(df0.data, df1.data)

###### STEP 6 - Rearrange data
### Re-arrange the data (patientid in the first column)
final.data <- data.bind %>% select(patientid, group, outcome)

### Compare the means between the groups
final.data %>%
  group_by(group) %>%
  summarise(n_distinct(patientid),
            endpoint = mean(outcome),
            sd(outcome))

## # A tibble: 2 × 4
##   group `n_distinct(patientid)` endpoint `sd(outcome)`
##   <dbl>                   <int>    <dbl>         <dbl>
## 1     0                     100     10.3          1.58
## 2     1                     100     16.1          1.75

Now, the mean for group 0 is 10.3 with a SD of 1.58. The mean for group 1 is 16.1 with a SD of 1.75.

Conclusions

Generating data with R is possible using the simstudy package. This can be helpful when you are looking for a specific kind of data for your exercise or simulation. In future articles, I will expand on other features of the simstudy package to generate data for other uses.

Acknowlegements

The developers of the simstudy package include Keith Goldfeld and Jacob Wujciak-Jen. The simstudy package GitHub site is located here.

Disclaimers

This is a work in progress. Thus, the content will be subject to changes and updates.

This is for educational purposes only.

Generate data using simstudy package in R - Part 1