Introduction
There are a variety of places where you can go to and find data to
use for practice. For example, the Agency for Healthcare Research and
Quality (AHRQ) has the Medical Expenditure Panel
Survey (MEPS) data, and R
has its own set of build in
data that comes with the software. You can type data()
to
view all the datasets in R
.
Figure 1. Available data in R.
However, if none of these data meet your needs, you can generate data
using R
.
In this article, I’ll review a simple method to generate data for
your needs using R
. You will need to load the
simstudy
package. You can read about the
simstudy
package on the developers GitHub site.
### Load `simstudy` package
# install.packages("simstudy") # Install once if `simstudy` hasn't been installed
library("simstudy")
### Load other packages for this tutorial:
library("tidyverse")
Generating data
The simstudy
package contains the defData()
function, which we will use to generate some data.
To replicate the results of the generated data, you can set a seed
using the set.seed()
function.
Example 1 - Generate data with two groups
Let’s generate a dataframe with two groups using the
defData()
function. We’ll have a total sample size of 100,
and we will generate a random dataframe with two groups using a 50%
probability of assignment (formula = 0.5
). We will use the
dist = "binary"
to generate two outcomes (0
and 1
). Lastly, we will call this variable
group
.
### Set seed
set.seed(12345)
### Set up the data (Group = 0/1)
df1 <- defData(varname = "group",
dist = "binary",
formula = 0.5)
### Generate the dataframe with N = 100 subjects
df1 <- genData(100, df1)
### Inspect the number in each group
table(df1$group)
##
## 0 1
## 48 52
We have 48 subjects in group == 0
and 52 subjects in
group == 1
. (Note: Since we set the seed, we can replicate
this random result.)
Example 2 - Generate data with two groups and a continous outcome variable
Next, we can generate a dataframe with two groups and a continuous
outcome variable. We can set some parameters for what kind of continuous
outcome variable we would like. Let’s suppose we want a continuous
outcome data type that is normally distributed with a mean of 10 and a
variance of 2. We can set the dist = "normal"
and the
formula = 10
and add the variance = 2
.
### Set seed
set.seed(12345)
### Set up the data (Group = 0/1)
df2 <- defData(varname = "group",
dist = "binary",
formula = 0.5)
df2 <- defData(df2,
varname = "outcome",
dist = "normal",
formula = 10,
variance = 2)
### Generate the dataframe with N = 100 subjects
df2 <- genData(100, df2)
### Compare the means between the groups
df2 %>%
group_by(group) %>%
summarise(n_distinct(id),
endpoint = mean(outcome),
sd(outcome))
## # A tibble: 2 × 4
## group `n_distinct(id)` endpoint `sd(outcome)`
## <int> <int> <dbl> <dbl>
## 1 0 48 10.2 1.47
## 2 1 52 10.2 1.75
The means for group 0 is 10.2 with a standard deviation (SD) of 1.47. The mean for group 1 is 10.2 with a SD of 1.75.
The means are the same because we didn’t differentiate the
outcome
variable based on the groups.
Example 3 - Generate two dataframes and append using
bind_rows()
function
Suppose you want to have different outcomes for the groups, you can
generate two separate dataframes and append them using the
bind_rows()
function from the tidyverse
package.
Step 1 - Set up the data rules
Let’s suppose we want the outcome variable to have a normally
distributed mean = 10
with a variance = 2
for
group 0, and the mean = 16
and variance = 4
for group 1. We will generate two dataframes df0
and
df1
.
Step 2 - Convert to dataframes
We’ll generate a dataframe from these rules and call them
df0.data
and df1.data
with 100 subjects
each.
Step 3 - Generate a group variable
Since these are two separate dataframes, we will need to generate a
group
variable. We will assign group = 0
and
group = 1
.
Step 4 - Create uniqe identifiers
Once these are generated, we will need to ensure the identifiers are
unique. Because we are generating two separate dataframes, the
id
variable will be the same. To ensure that the identifier
is unique, we need to generate a new identifier patientid
and modify one of the dataframe’s id
variable by adding
100
so that the sequence will begin with
101
.
Step 5 - Append data
Next, we append the data using the bind_rows()
function.
Step 6 - Rearrange data
Since the new unique identifier patientid
is the last
column, we want to re-arrange the dataframe so that it is in the first
column. Additionally, we don’t the old identifier id
, so we
can drop it. We’ll do this using the select()
function.
### Set seed
set.seed(12345)
###### STEP 1 - Set up the data rules
### Generate data for group 0
df0 <- defData(varname = "outcome",
dist = "normal",
formula = 10,
variance = 2)
### Generate data for group 1
df1 <- defData(varname = "outcome",
dist = "normal",
formula = 16,
variance = 3)
###### STEP 2 - Convert to dataframes
df0.data <- genData(100, df0)
df1.data <- genData(100, df1)
###### STEP 3 - Generate a group variable
### Add a grouping variable
df0.data$group <- 0 # add group variable
df1.data$group <- 1 # add group variable
###### STEP 4 - Create uniqe identifiers
### Generate unique patient identifier `patientid`
df0.data$patientid <- df0.data$id
df1.data$patientid <- df1.data$id+100 # add patientid variable
###### STEP 5 - Append the tables
data.bind <- bind_rows(df0.data, df1.data)
###### STEP 6 - Rearrange data
### Re-arrange the data (patientid in the first column)
final.data <- data.bind %>% select(patientid, group, outcome)
### Compare the means between the groups
final.data %>%
group_by(group) %>%
summarise(n_distinct(patientid),
endpoint = mean(outcome),
sd(outcome))
## # A tibble: 2 × 4
## group `n_distinct(patientid)` endpoint `sd(outcome)`
## <dbl> <int> <dbl> <dbl>
## 1 0 100 10.3 1.58
## 2 1 100 16.1 1.75
Now, the mean for group 0 is 10.3 with a SD of 1.58. The mean for group 1 is 16.1 with a SD of 1.75.
Conclusions
Generating data with R
is possible using the
simstudy
package. This can be helpful when you are looking
for a specific kind of data for your exercise or simulation. In future
articles, I will expand on other features of the simstudy
package to generate data for other uses.
Acknowlegements
The developers of the simstudy
package include Keith
Goldfeld and Jacob Wujciak-Jen. The simstudy
package GitHub
site is located here.
Disclaimers
This is a work in progress. Thus, the content will be subject to changes and updates.
This is for educational purposes only.