Make sure that you have necessary packages in your libary
library(tidyverse)
## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr
## Conflicts with tidy packages ----------------------------------------------
## filter(): dplyr, stats
## lag(): dplyr, stats
library(ggplot2)
Let???s say we have a uniform distribution over [0,1]. We are going to create a matrix (similar to a data frame, but easier to manipulate in this context) with random values from a uniform distribution. The matrix mymatrix will have 100 columns (variables) and 200 rows (observations).
mymatrix <- matrix(runif(20000,0,1),ncol=100,nrow=200)
#We already encountered runif last week to draw random values from a uniform distribution. runif(20000,0,1) draws 20000 values from a uniform distribution between 0 and 1. Why do we need 20,000 values? Because 100 columns * 200 rows = 20,000, so this is how many values we need to fill our matrix of 100 columns by 200 rows.
Now we can easily create the mean of each of our 100 variables:
means <- colMeans(mymatrix)
#The line above creates a vector with the means of the variables (=columns) in mymatrix. This is why it???s called colMeans: it takes the means of the columns in a matrix.
Using what we learned in chapter 1 (see week 1 of lab notes), we can plot the distribution of this means and verify that it is normal. Remember that since we drew from a uniform distribution between 0 and 1, the mean of this distribution is 0.5.
Here is a histogram of the distribution of means that we created above:
ggplot (data=NULL) +
geom_histogram(mapping = aes(x = means), bins = 30)
#data=NULL tells R that means is not in a data frame (since means is in the global environment).
Let???s do a QQ plot (see week 2 of lab notes) for means to check that it???s normal.
qqnorm(means)
qqline(means, col="red")
We first create an empty (full of zeros) vector (=column) with 20000 zeroes:
myvector <- numeric(20000)
table(myvector)
## myvector
## 0
## 20000
We then calculate the average of the sample drawn in mymatrix for each possible sample size i, i going from 1 to 20000, and we put this average in myvector:
for(i in 1:20000) {
myvector[i] <- mean(mymatrix[1:i])
}
#The line above is a loop that repeats the command in curly brackets 20000 times, one for each value of i, going from 1 to 20000.
Now we can plot our estimate of the mean of the sample if the sample contains 1 observation, 2, 3, etc., up to 20000:
ggplot (data=NULL) +
geom_line(mapping = aes(x = 1:20000,y= myvector))
#The x=1:20000 above simply indicates that the x axis should be labeled with integers 1 to 20000. Note that the mean converges to the true distribution mean of 0.5 for large sample sizes.