Simulation of the distribution of the mean of a random sample

Make sure that you have necessary packages in your libary

library(tidyverse)

## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr

## Conflicts with tidy packages ----------------------------------------------

## filter(): dplyr, stats
## lag():    dplyr, stats

library(ggplot2)

Let???s say we have a uniform distribution over [0,1]. We are going to create a matrix (similar to a data frame, but easier to manipulate in this context) with random values from a uniform distribution. The matrix mymatrix will have 100 columns (variables) and 200 rows (observations).

mymatrix <- matrix(runif(20000,0,1),ncol=100,nrow=200)

#We already encountered runif last week to draw random values from a uniform distribution. runif(20000,0,1) draws 20000 values from a uniform distribution between 0 and 1. Why do we need 20,000 values? Because 100 columns * 200 rows = 20,000, so this is how many values we need to fill our matrix of 100 columns by 200 rows.

Now we can easily create the mean of each of our 100 variables:

means <- colMeans(mymatrix)

#The line above creates a vector with the means of the variables (=columns) in mymatrix. This is why it???s called colMeans: it takes the means of the columns in a matrix.

Using what we learned in chapter 1 (see week 1 of lab notes), we can plot the distribution of this means and verify that it is normal. Remember that since we drew from a uniform distribution between 0 and 1, the mean of this distribution is 0.5.

Here is a histogram of the distribution of means that we created above:

ggplot (data=NULL) + 
    geom_histogram(mapping = aes(x = means), bins = 30)

#data=NULL tells R that means is not in a data frame (since means is in the global environment).

Let???s do a QQ plot (see week 2 of lab notes) for means to check that it???s normal.

qqnorm(means)
qqline(means, col="red")

Illustrate the law of large numbers

We first create an empty (full of zeros) vector (=column) with 20000 zeroes:

myvector <- numeric(20000)
table(myvector)

## myvector
##     0 
## 20000

We then calculate the average of the sample drawn in mymatrix for each possible sample size i, i going from 1 to 20000, and we put this average in myvector:

for(i in 1:20000) { 
   myvector[i] <- mean(mymatrix[1:i])
}

#The line above is a loop that repeats the command in curly brackets 20000 times, one for each value of i, going from 1 to 20000.

Now we can plot our estimate of the mean of the sample if the sample contains 1 observation, 2, 3, etc., up to 20000:

ggplot (data=NULL) + 
     geom_line(mapping = aes(x = 1:20000,y= myvector))

#The x=1:20000 above simply indicates that the x axis should be labeled with integers 1 to 20000. Note that the mean converges to the true distribution mean of  0.5 for large sample sizes.

Simulation of the distribution of the mean of a random sample

Seongho An

9/29/2017

Simulation of the distribution of the mean of a random sample

Illustrate the law of large numbers