knitr::opts_chunk$set(echo = TRUE)
knitr::opts_chunk$set(cache=TRUE)
require(tidyverse)
require(ggplot2)

This lab is about simulating data in advance of your experiment. Doing this is incredibly helpful when planning the types of analyses that you will perform on your data.

“Simulating data” means that you are probabalistically generating fake data. You deliberately set the probability distributions in advance, so that the resulting data has properties that you believe might hold in the real world. Then, you can run analyses on the simulated data, to assess whether your experimental design and statistical methods will be sensitive enough to capture the important properties of that data.

Preliminaries. In your Lab R Markdown Template, you should make sure that tidyverse and ggplot2 are loaded up in the setup section.

1 An example - height at different ages

In America, the mean recumbent length (i.e., the height) of a male baby, under 2 months, is 57.5cm, and the standard deviation of that distribution is 3.55cm. The mean height of a female baby is 55.7cm, with standard deviation 2.04cm. The mean height of an 18-year-old male is 175.6cm, with standard deviation 10.8cm. The mean height of an 18-year-old female is 161.7cm, with standard deviation 7.1cm.

Let’s assume that height/length are normally distributed. We can then simulate what we expect the distributions of these properties should look like in the general population. To do this, we will use the function rnorm(), which produces random numbers drawn from a normal distribution.

In the code chunk generate_baby_data I want you to create a tibble (a dplyr version of a dataframe), that contains simulated values of the heights of male and female babies/adults. The code chunk below shows a tibble that has only 4 rows. One row shows the height of a male baby, one shows the height of a female baby, one shows the height of a male adult, one shows the height of a female adult. Use the incomplete code below in your code chunk, replacing x and y so that the simulated data come from the correct distributions (e.g., the Male baby height data should have Mean of 57.5 and SD of 3.55).

demographics = tibble(
  Gender = c("Male",
             "Female",
             "Male",
             "Female"),
  Age = c("Baby",
          "Baby",
          "Adult",
          "Adult"),
  data = c(rnorm(1,x,y),
           rnorm(1,x,y),
           rnorm(1,x,y),
           rnorm(1,x,y))
)
demographics

Once you have down that, you can expand the tibble. In particular, you can use the function rep(), which repeats objects, in order to create lots of rows for each combination of the different factors. For instance, if you use Gender = rep(c("Male","Female","Male","Female"), each = 5), then you will produce 20 rows in total: 5 Male rows, 5 Female rows, 5 more Male rows, 5 more Female rows.

Remember, the tibble won’t be created unless all the columns (Gender, Measure and data) have the same length, so you need to make sure that these are matched.

First, try to make a table that contains five individuals per cell (i.e., per combination of factors, meaning five male babies, five female babies, etc). Once you have got the hang of this, try creating a tibble with 10000 entries for each combination of factors (e.g., 10,000 male babies, 10,000 female babies, etc.).

Now, you can plot the data using ggplot2. Modify the code below and add it to your chunk, in order to display a density plot. In this case, the density plot shows the probability of finding an individual of that particular height.

ggplot(your_data_file, aes(x = the_measure_you_are_interested_in, color = the_measure_you_are_splitting_by))+
  geom_density()+
  facet_grid(.~the_measure_you_are_faceting_by)+
  xlab("A helpful x axis label")

At this point, things should look pretty good, except that maybe the scale on your plots is odd (ggplot automatically uses the same scale for all facets). You can change that, so that the scales for the facets vary freely, by modifying facet_grid() as so: facet_grid(.~the_measure_you_are_faceting_by, scales = "free").

1.1 What have we shown?

From your distribution plot, you should be able to see that Male babies and adults tend to be a little bit taller than Female babies and adults, but that the difference is more pronounced for adults than it is for children.

1.2 What would we like to know?

Let’s imagine that we are demographers, who want to go and confirm that males are taller than females in a field test. Before we do that, we should ask “How many people do we need to measure, in order to show that males are taller than females? And does this differ depending on whether we are measuring adults or children?”

As a preliminary, let us test if males are statistically significantly taller than females in the dataset that you just simulated, with 10,000 observations per group. In the code chunk test_withthousands_of_observations, create two linear regressions, one of which regresses height against gender for babies, and one of which regresses height against gender for adults. Use the code below for inspiration, and modify it appropriately.

summary(lm(
  measured_variable ~ predictor_variable, 
  data = subset(demographics,
                AGE_VARIABLE == "AGE_NAME")
))

summary(lm(
  measured_variable ~ predictor_variable, 
  data = subset(demographics,
                AGE_VARIABLE == "AGE_NAME")
))

If you look at the output of the model, you should see that, with this many observations, it is certainly true that – in both Adults and Babies – males are taller than females. To see this, make sure to look at the coefficients for Gender, (as shown, e.g., below), and examine the relevant t and p values.

1.2.0.1 Adults

## 
## Call:
## lm(formula = data ~ Gender, data = subset(demographics, Age == 
##     "Adult"))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -36.978  -5.880  -0.073   5.863  36.856 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 161.73080    0.09158  1766.1   <2e-16 ***
## GenderMale   13.92503    0.12951   107.5   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.158 on 19998 degrees of freedom
## Multiple R-squared:  0.3663, Adjusted R-squared:  0.3663 
## F-statistic: 1.156e+04 on 1 and 19998 DF,  p-value: < 2.2e-16

1.2.0.2 Babies

## 
## Call:
## lm(formula = data ~ Gender, data = subset(demographics, Age != 
##     "Adult"))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.8226  -1.7503  -0.0102   1.7615  13.3028 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 55.69714    0.02891 1926.87   <2e-16 ***
## GenderMale   1.77610    0.04088   43.45   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.891 on 19998 degrees of freedom
## Multiple R-squared:  0.08626,    Adjusted R-squared:  0.08621 
## F-statistic:  1888 on 1 and 19998 DF,  p-value: < 2.2e-16

However, most of the time we do not have the money to go and measure 10,000 people. So, what about if we only measured a few people per group? In the next code chunk, create_tibble_with_fewer_observations, copy your code from before (or edit my code below) and change it to create a new Tibble that only has 7 observations for each combination of factors.

small_demographics = tibble(
  Gender = rep(c("Male","Female","Male","Female"), each = x),
  Measure = rep(c("Baby","Baby","Adult","Adult"), each = x),
  data = c(rnorm(7,x,y),
           rnorm(7,x,y),
           rnorm(7,x,y),
           rnorm(7,x,y))
)

Then, in the code chunk after that, test_with_fewer_observations, copy over your two linear regressions from above, but edited to analyze this new database. Is the result still statistically significant? Try Re-Running the code that produced small_demographics, and then run the test again. Try this a few times, to see how often the results are significant. You should find that, sometimes, the result is significant, and sometimes it is not. Probably, it will more often be significant for adults than for babies.

1.3 How many people do we need to measure?

It should be clear at this point that if we measure 10,000 people, then we will see a statistically significant difference between males and females, but if we only measure 7, then we are less likely to see this significant difference. But how many people should we plan to measure, in order to be reasonably confident that – if a statistical difference exists – then we will find it? One way to assess this is to simulate lots of datasets, with different numbers of participants, and test how often those datasets produce a statistically significant result.

We can do this using a For loop. With a For loop, we repeat a series of tasks again and again, until some condition is reached. For instance, a For loop might stop once it has run 100 times.

We will start quite simply, and focus on babies initially. Imagine if we only measure 7 boys and 7 girls. How often would we expect to get a significant result (p <.05), if we went out and randomly measured 14 children 100 times?

To do this, we are going to need three things. First, we will need the code to create a tibble, which you already created above (using the template that I have repeated below). Copy this to the new chunk create_first_for_loop.

small_demographics = tibble(
  Gender = rep(c("Male","Female","Male","Female"), each = x),
  Measure = rep(c("Baby","Baby","Adult","Adult"), each = x),
  data = c(rnorm(7,x,y),
           rnorm(7,x,y),
           rnorm(7,x,y),
           rnorm(7,x,y))
)

Then, we will need to copy the code to run the relevant regression, which you also already created above (again, using the template that I have repeated below).

regression_result = summary(lm(
                      measured_variable ~ predictor_variable, 
                      data = subset(demographics_small,
                        AGE_VARIABLE == "Baby")
                      ))

Finally, you will need code to extract the p value from your regression, which you can do using the example code below. In this code, $coefficient extracts the table of coefficients from the regression, and [2,4] extracts the position of the critical p value – the 2nd row down, and the 4th column across. This is used in the ifelse() function; if the p value is smaller than 0.05, it prints "Significant", else it prints "Not Significant".

  ifelse(regression_result$coefficient[2,4] < 0.05,
         print("Significant"),
         print("Not significant")
  )

These things should be put together in the following order.

1. Create tibble.
2. Run regression.
3. Check if p value is less than 0.05

Then, you insert them in a For loop. Following the code below, edit your code chunk so that it contains a For loop that carries out our tasks in the right order, repeatedly. In this For loop, the code within the curly braces is executed 10 times (i.e., from 1 to 10).

for (i in 1:10){
  # Create table
  small_demographics = tibble(
    Gender = rep(c("Male","Female","Male","Female"), each = x),
    Measure = rep(c("Baby","Baby","Adult","Adult"), each = x),
    data = c(rnorm(7,x,y),
             rnorm(7,x,y),
             rnorm(7,x,y),
             rnorm(7,x,y))
  )
  
  # Run regression
  regression_result = summary(lm(
    measured_variable ~ predictor_variable, 
    data = subset(demographics_small,
                    AGE_VARIABLE == "Baby")
      ))
  
  # Test if regression if significant
  ifelse(regression_result$coefficient[2,4] <0.05,
         print("Significant"),
         print("Not significant"))  
}

You should get output that looks like the text below:

## [1] "Not significant"
## [1] "Not significant"
## [1] "Not significant"
## [1] "Not significant"
## [1] "Not significant"
## [1] "Not significant"
## [1] "Not significant"
## [1] "Significant"
## [1] "Significant"
## [1] "Not significant"

1.4 Improving your For loop

The For loop that you have created thus far can illustrate when the regression will and will not be significant, but we want to be able to quantify that estimate. To do this, we want to see the proportion of regressions that have a significant result.

Calculating this is very easy. Before, we used an ifelse() call to print whether or not a result was significant. Now, we will use ifelse() to count the number of significant simulated datasets, and then we will calculate the proportion of significant results over the total number of datasets.

In the next code chunk, create_for_loop_with_proportion, copy your For loop over from the prior chunk, and edit the ifelse() as below:

  ifelse(regression_result$coefficient[2,4] <0.05,
         1,
         0)

This code returns 1 if the regression is significant, and 0 if the regression is not significant.

However, we still need to add code to count these numbers. In the code below, we begin by creating a variable called significant_regressions which has the value 0. Then, on each round of the For loop, we add 1 to the variable every time the coefficient is significant (in particular, we say that on each round of the For loop, the value of significant_regressions becomes the previous value of significant_regressions plus either 1 or 0.). Then, the proportion of significant regressions is calculated as the number of significant regressions divided by the total number of regressions

significant_regressions = 0

for (i in 1:10){
    # Create table
    # run regression
    # Count significant regressions
    significant_regressions = significant_regressions +   
      ifelse(regression_result$coefficient[2,4] <0.05,
         1,
         0)
 
}

proportion_significant = significant_regressions/10
print(proportion_significant)

Add this to your code chunk and run it. You should get an output like the below; for these 10 simulations, somewhere between 0% and 40% of the regressions will be significant when you are measuring only 7 participants per gender. The problem here is that you only have 10 simulations, so your estimates will not be very stable. We need to use more simulations to get a more stable estimate.

## [1] 0.2

Now let’s try with more than 10 simulations. Edit the code in your For loop, so that you now use 1000 simulations, rather than 10. The easiest way to do that is to create a new variable (e.g., called N_Simulations), assign it a value (e.g., 1000), and then use that variable in all the places where you had previously written just the number 10. You can see some pseudo-code below; this will give you a more precise estimate of the proportion significant.

N_Simulations = 1000
for (i in 1:N_Simulations){
  #Do stuff in the For loop.
}
proportion_significant = significant_regressions/N_Simulations
print(proportion_significant)

## [1] 0.198

1.5 Systematically simulating data

Now that we’ve got this going, we could run a bunch of different simulations for both kids and adults, varying the number of people that are measured. But rather than create lots and lots of For loops, it would be nice to do this in a more systematic way, so that we can keep all the results together and tidy.

We will do this now using dplyr, just like last week.

First, we will create a dataframe that states all the different combinations of factors, as well as how many participants are present. In the code chunk efficient_simulation1, start building code as below. This will create a dataframe that crosses Age and Gender, and says that there are N = 7 participants per combination.

n_participants = 7
efficient_demographics = expand.grid(
            N = n_participants,
            Age = c("Baby","Adult"),
            Gender = c("Male","Female"))
efficient_demographics

Now, we will use the mutate function from dplyr, plus some other functions, to add the Mean and SD of the relevant normal distributions to this dataframe.Here, we use ifelse() to subset efficient_demographics into its component condition (Age and Gender) and then specify what the mean and SD should be for each of those.

efficient_demographics = 
  efficient_demographics %>%
  mutate(mean = ifelse(Age == "Baby",
                ifelse(Gender == "Male",57.5,55.7),
                ifelse(Gender == "Male", 175.6,161.7)),
         sd =  ifelse(Age == "Baby",
                      ifelse(Gender == "Male", 3.55, 2.04),
                      ifelse(Gender == "Male", 10.8,7.1))
  )
efficient_demographics

Finally, use the code below to generate the actual data for each condition, and then run the linear models on each of the two age conditions.

summarized_demographics = efficient_demographics %>%
  group_by(Age,N,Gender) %>% # Get the variables to group by -- Age, N (number of subjects), and Gender
  do(data = rnorm(.$N,.$mean,.$sd) ) %>% # Generate the random data (N samples, with mean of mean and sd of SD). 
                                          #The .$ here means that we are using columns from the data piped from the previous line.
  unnest(data) %>% # This turns the data generated by rnorm() into a numeric format.
  select(Age,Gender,N,data) %>% # select the relevant columns of the tibble
  group_by(Age,N) %>%  # Select the columns to group by
  do(p_value = summary(lm(data ~ Gender, data = .))$coefficient[2,4]) %>%  # Run the linear model and return the p value
  unnest(p_value) %>%   # (turn the p value a numeric format)
  mutate(significant = ifelse(p_value < 0.05,1,0))   # check if the p value is significant or not.

summarized_demographics

Make sure that this is working, and it produces an output as below. You can see that, because we are using dplyr’s group_by() function, we can simultaneously run the calculations for both adults and children.

## # A tibble: 2 x 4
##      Age     N     p_value significant
##   <fctr> <dbl>       <dbl>       <dbl>
## 1   Baby     7 0.740070640           0
## 2  Adult     7 0.006693651           1

1.6 Simulate lots of data

Now, we will slightly change that code to create up to 1000s of simulations.

In the next code chunk (efficient_simulation2), augment your previous code for the dataframe efficient_demographics from before, so that it has a new variable called Simulations, which will run from 1 to 10.

This should create a big new dataframe, which has 7 participants for each Age and Gender, for each of 10 simulations.

n_participants = 7
efficient_demographics = expand.grid(
            N = n_participants,
            Age = c("Baby","Adult"),
            Gender = c("Male","Female"),
            Simulations = # some code to run from 1 to 10
              )

efficient_demographics

Then, augment this with the same code you used before to subset the dataframe and specify the means and SDs.

efficient_demographics = 
  efficient_demographics %>%
  mutate(mean = ifelse(Age == "Baby",
                ifelse(Gender == "Male",57.5,55.7),
                ifelse(Gender == "Male", 175.6,161.7)),
         sd =  ifelse(Age == "Baby",
                      ifelse(Gender == "Male", 3.55, 2.04),
                      ifelse(Gender == "Male", 10.8,7.1))
  )
efficient_demographics

Finally, augment your code with more code for generating data and calculating p values

summarized_demographics = efficient_demographics %>%
  group_by(Age,N,Gender) %>% # Make sure that you reference the simulations column here
  do(data = rnorm(.$N,.$mean,.$sd) ) %>%
  unnest(data) %>%
  select(Age,Gender,N,data) %>% # Make sure to select the simulations column
  group_by(Age,N) %>% # Make sure to select the simulations column
  do(p_value = summary(lm(data ~ Gender, data = .))$coefficient[2,4]) %>%
  unnest(p_value) %>%
  mutate(significant = ifelse(p_value < 0.05,1,0)) 

summarized_demographics

This should now produce a p value for each of your 10 simulations.

## # A tibble: 20 x 5
##       Age     N Simulations      p_value significant
##    <fctr> <dbl>       <int>        <dbl>       <dbl>
##  1   Baby     7           1 0.3672215112           0
##  2   Baby     7           2 0.8764295124           0
##  3   Baby     7           3 0.1489226766           0
##  4   Baby     7           4 0.0346474018           1
##  5   Baby     7           5 0.5208075909           0
##  6   Baby     7           6 0.8810366110           0
##  7   Baby     7           7 0.5541274007           0
##  8   Baby     7           8 0.7628346941           0
##  9   Baby     7           9 0.0455873014           1
## 10   Baby     7          10 0.1602105669           0
## 11  Adult     7           1 0.2691767836           0
## 12  Adult     7           2 0.0104616187           1
## 13  Adult     7           3 0.0347455804           1
## 14  Adult     7           4 0.0232465195           1
## 15  Adult     7           5 0.0003532002           1
## 16  Adult     7           6 0.0011865257           1
## 17  Adult     7           7 0.0009834268           1
## 18  Adult     7           8 0.0219900993           1
## 19  Adult     7           9 0.0135256939           1
## 20  Adult     7          10 0.0304857893           1

The next thing on your place, is to summarize this new dataframe, to see what proportion of the values are significant.

Start by adding a pipe %>% to the end of the previous line of code, and then carry out the following instructions. Use the function group_by() to group the data appropriately, and then pipe %>% that into a summarise() function that calculates three things. First, the sum of the significant results, Second, the sum number of total regressions, and Third, the proportion of significant regressions over total regressions.

Your final output should look like this:

## # A tibble: 2 x 4
## # Groups:   Age [?]
##      Age     N n_sims proportion_significant
##   <fctr> <dbl>  <int>                  <dbl>
## 1   Baby     7     10                    0.5
## 2  Adult     7     10                    0.6

Once you have produced a table that looks just like this, change the code so that you now run 1000 simulations. The result should look like this:

## # A tibble: 2 x 4
## # Groups:   Age [?]
##      Age     N n_sims proportion_significant
##   <fctr> <dbl>  <int>                  <dbl>
## 1   Baby     7   1000                  0.197
## 2  Adult     7   1000                  0.718

Check that the data from this efficient analysis matches the baby data from your previous analysis that used For loops.

1.7 Vary the number of participants

Your simulations have now shown that, if we randomly sample 7 babies per gender, we’ll find a positive result about 1 time in 5 for babies, and that this holds even though it is true that male babies tend to be larger than female babies.

Now, let’s find out how many babies we would need to test, such that we get a positive result about 4 times in 5, which is much more reasonable. Move to code chunk how_many1.

To solve our task, we will create a very big dataframe which, for each given number of participants, simulates lots of experimental samples. Based off of the code below, create such a dataframe, so that the smallest number of participants is 5 and the largest is 50, incrementing in steps of 1.

Note that, in this dataframe, we have gone back to having only a small number of simulations, because it will be faster when testing the code.

lowest_number_of_participants = ??
highest_number_of_participants = ??


  efficient_demographics = expand.grid(
            N = WHAT_SHOULD_GO_HERE?,
            Age = c("Baby","Adult"),
            Gender = c("Male","Female"),
            Simulations = 1:10)

Now, augment this with the code from the previous section, in order to generate a dataframe as below.

## # A tibble: 92 x 4
## # Groups:   Age [?]
##       Age     N n_sims proportion_significant
##    <fctr> <int>  <int>                  <dbl>
##  1   Baby     5     10                    0.0
##  2   Baby     6     10                    0.1
##  3   Baby     7     10                    0.4
##  4   Baby     8     10                    0.0
##  5   Baby     9     10                    0.5
##  6   Baby    10     10                    0.4
##  7   Baby    11     10                    0.1
##  8   Baby    12     10                    0.3
##  9   Baby    13     10                    0.2
## 10   Baby    14     10                    0.3
## # ... with 82 more rows

You may have noticed that the Markdown page takes a long time to Knit at this point. That’s because it has to do the calculations anew, each time you press Knit. To save (cache) the previously done calculations, add this code to your setup chunk: knitr::opts_chunk$set(cache=TRUE)

Once you have got the above section working, increase the number of simulations to 500.

## # A tibble: 92 x 4
## # Groups:   Age [?]
##       Age     N n_sims proportion_significant
##    <fctr> <int>  <int>                  <dbl>
##  1   Baby     5    500                  0.164
##  2   Baby     6    500                  0.162
##  3   Baby     7    500                  0.188
##  4   Baby     8    500                  0.218
##  5   Baby     9    500                  0.230
##  6   Baby    10    500                  0.282
##  7   Baby    11    500                  0.306
##  8   Baby    12    500                  0.312
##  9   Baby    13    500                  0.324
## 10   Baby    14    500                  0.350
## # ... with 82 more rows

Now, in the chunk graph_how_many1, use ggplot to graph the results. Remember that your data should be summarized_demographics, your x axis should be the number of samples, N, your y axis should be the proportion significant samples, and you can color the lines by Age.

2 Simulate data from a simple experiment

Now that you have learned how to do some complex simulations about height, it is time to transfer your skills, in order to simulate the type of experiment that you have been creating.

The basic logic behind this is exactly the same as before, except that now we are not simulating random normally distributed data, but randomly distributed binomial data (i.e., choices on each trial, 1 or 0). Moreover, in your experiments, each participant responds to multiple trials, so we have multiple measurements per participant.

2.1 What should the data look like?

We will simulate one of the comparisons from the Byers-Heinlen paper, which compared whether children preferred to be friends with a monolingual speaker versus a bilingual speaker, and whether this differed between children who are monolingual or bilingual.

In Experiment 2 of that paper, Byers-Heimlein and colleagues compared monolinguals from Montreal, and bilinguals from Montreal, on their task.

Here is the critical paragraph from the results section. Byers-Heimlein and colleagues found that…

There are a few different results reported here. For instance, Monolingual children pick the monolingual friend above chance, and Bilingual children pick their friend at chance. But the most important result is the comparison between monolinguals and bilinguals, which produces a significant result. That is the result that we will try to simulate.

2.2 How do we simulate this data?

To simulate this experiment, we have to do something slightly different than before.

In our earlier simulations, we assumed that each person’s height was drawn from an underlying normal distribution, and it was this statistic that was entered into our simulations. There’s an important assumption here, which is that we can perfectly measure people’s heights. So, if the height is drawn from an underlying distribution, we can simulate heights from that distribution, and then enter them into our equations.

However, we cannot do that in this case. Here, the underlying distribution is “children’s preference for monolingual speakers”. We assume that this parameter is latent in the mind of the child, but we can’t measure it directly. We can only estimate it based on the small number of trials that children complete in this experiment. As such, our estimate will be quite noisy (to make it less noisy, we need to increase the number of trials that children perform).

So, we have to simulate this extra step. First, we simulate what each child’s underlying preference for monolingual speakers is, and then we use that parameter to simulate how they might perform on the choice task.

2.2.1 The Experimental Parameters

In the new code chunk byers1, start building a description of the experiment, as follows. We’ll first do a single simulation of the data in the original experiment, and then we will do simulations for power.

n_trials = 8 # Each child took part in 8 trials
n_children = 20 # There were 20 monolingual and 20 bilingual children

m_mono = 0.65
sd_mono = 0.19
m_bi = 0.53
sd_bi = 0.28

byers = expand.grid(
  n_trials = n_trials,
  N = n_children,
  Language = c("Monolingual","Bilingual"),
  Simulations = 1
  )

byers = 
  byers %>%
  mutate(mean = ifelse(Language == "Monolingual",m_mono,m_bi),
         sd =  ifelse(Language == "Bilingual", sd_mono, sd_bi)
  )

print(byers)

##   n_trials  N    Language Simulations mean   sd
## 1        8 20 Monolingual           1 0.65 0.28
## 2        8 20   Bilingual           1 0.53 0.19

Running this code should create a simple table with means, sd, and number of subjects.

Now, we will expand on this code, similarly to how we did in the previous section. We will assign each participant a simulated probability of choosing the monolingual speaker.

Note. You can either add the code below as written, or remove the first line and pipe %>% the previous code into this version.

byers = byers %>%
  group_by(Language,n_trials,N,Simulations) %>% 
  do(subject_prob = rnorm(.$N,.$mean,.$sd)) %>%
  unnest(subject_prob)

Once you have run the code, take a look at the column subject_prob. This column should contain a probability between 0 and 1 for each participant. However, because we have generated it from a normal distribution, it may be the case that some values lie outside of this range. So, we will have to tidy it up. Using mutate() and replace(), as below, tidy it so that values below 0 get assigned to 0.01, and values above 1 get assigned to 0.99.

byers = byers %>%
  group_by(Language,n_trials,N,Simulations) %>% 
  do(subject_prob = rnorm(.$N,.$mean,.$sd)) %>%
  unnest(subject_prob) %>%
  mutate(subject_prob = replace(subject_prob ,subject_prob <0,0.01),
         subject_prob = replace(subject_prob ,subject_prob >1,0.99))

Now, each subject has been assigned a simulated probability of choosing the monolingual speaker. We will use that probability to generate each subject’s performance on the 8 test trials. On those test trials, the subject has to choose between two options, so we can treat each test trial as a binomial event – 1 = Choose Monolingual, 0 = Choose Bilingual, with probability defined by the subject mean.

To do this, we use the function rbinom(), which takes three arguments:

n (the number of observations, here it is 1 as we do this for each single participant).
size (the number of trials, here that number is 8).
prob (here, the simulated accuracy of the subject)

Note that the variable data, below, is generated by rbinom() divided by the number of trials, in order to create a proportion.

byers = byers %>%
group_by(Language,n_trials,N,Simulations,subject_prob) %>% 
  do(data = rbinom(n = 1, size = .$n_trials, prob = .$subject_prob)/.$n_trials) %>%
  unnest(data)
byers

## # A tibble: 40 x 6
##       Language n_trials     N Simulations subject_prob  data
##         <fctr>    <dbl> <dbl>       <dbl>        <dbl> <dbl>
##  1 Monolingual        8    20           1   0.09678578 0.250
##  2 Monolingual        8    20           1   0.31705082 0.125
##  3 Monolingual        8    20           1   0.33281988 0.000
##  4 Monolingual        8    20           1   0.39974881 0.375
##  5 Monolingual        8    20           1   0.42867141 0.625
##  6 Monolingual        8    20           1   0.43632658 0.625
##  7 Monolingual        8    20           1   0.46175192 0.625
##  8 Monolingual        8    20           1   0.48618641 0.375
##  9 Monolingual        8    20           1   0.50362407 0.500
## 10 Monolingual        8    20           1   0.53614736 0.625
## # ... with 30 more rows

Now, we have reached a position from where we can start running analyses. Each child has a response variable (the data column), and that response variable can be predicted by whether the child is Bilingual or Monolingual.

Using the code below, generate the mean and standard deviation for the two conditions in your simulated dataset, and see if you can graph the results using ggplot() (make sure to add error bars based on the standard error). You may well find that, in your simulated dataset, the difference between conditions is much, much smaller! Don’t worry, you probably are not wrong!.

byers_means = byers %>%
  group_by(Language) %>%
  summarize(Mean = mean(data), SE = sd(data)/sqrt(n()))

2.3 Run multiple simulations.

In the final code chunk, and building on all the code you have written so far for this experiment and for the previous section, write code that simulates 500 datasets (i.e., Simulations = 1:500), for different numbers of children per condition varying from 10 to 100, in blocks of 10 (use the function seq() to do this, check ?seq for help if you need it).

Then, based on some of the code in the section before (repeated below), use a regression to test if there is a difference between monolingual and bilingual children on each simulation. Finally, use ggplot() to plot the probability of getting a significant result by the number of children tested.

# REMEMBER TO CHANGE VARIABLE NAMES FOR YOUR NEW EXPERIMENT
summarized_demographics = efficient_demographics %>%
  group_by(Age,N,Simulations,Gender) %>% # Make sure that you reference the simulation column here
  do(data = rnorm(.$N,.$mean,.$sd) ) %>%
  unnest(data) %>%
  select(Age,Gender,Simulations,N,data) %>% # Make sure to select the simulation column
  group_by(Age,N,Simulations) %>% # Make sure to select the simulation column
  do(p_value = summary(lm(data ~ Gender, data = .))$coefficient[2,4]) %>%
  unnest(p_value) %>%
  mutate(significant = ifelse(p_value < 0.05,1,0))  %>%
  group_by(Age,N) %>%
  summarize(n_sims = length(significant), proportion_significant = sum(significant)/n_sims)

3 Now what?

If you have completed the above, then you should have a pretty good idea how to simulate data for your own experiments. Your homework for this week is to do that; to try and simulate the key results from the experiments that you replicated.

Aim to:

Simulate one dataset that more-or-less matches one of the figures in the manuscript, and then display it using a bar chart, as we did for the Byers-Heinlein paper above.
Create a series of simulations that allow you to say how many subjects you probably need to get 80% power, for the key theoretical question in the paper you are replicating.

Remember that one key result may be an interaction between age and condition, and so you will have to edit the lm() model in order to account for that.

Another issue to bear in mind, is that published work tends to overestimate effect sizes. So,

Run a second series of simulations, which assess how many participants you would need if the difference between the conditions were only 75% as large.

Lab 8 - Simulating Data

Hugh Rabagliati

07/11/2017