Sampling Distribution Simulation

Introduction

This is an RMarkdown document displaying R code for simulating a sampling distribution for the sample mean in a hypothetical situation. The distribution of sample means is assumed to be normal with a mean of 50 and standard error of 5. A corresponding animated plot is constructed to demonstrate how a large number of sample means construct a sampling distribution.

A final static plot of the simulated sampling distribution is given (based on 2,000 simulated means), as well as two corresponding color-coated plots to reflect two specific probability questions in regards to the sample mean.

This was created with the intention of supplementing lecture notes regarding the sampling distribution of the mean and associated probability statements.

Initial Sampling Distribution Simulation and Data Frame Preparation

The first block of code accomplished the following:

Load appropriate R packages.
Simulate 2,000 sample means from a hypothetical normal distribution.
Use a loop to create a data frame appropriate for animation purposes.

library(dplyr)
library(ggplot2)
library(gganimate)

tempvec <- rnorm(30,50,5)
meanval <- mean(tempvec)
meanset <- meanval
meanset <- matrix(meanset,nrow=1,ncol=1)
meanset <- as.data.frame(meanset)
newset <- meanset

for (i in 2:1000) {
tempvec <- rnorm(30,50,5)
meanval <- mean(tempvec)
meanval <- matrix(meanval,nrow=1,ncol=1)
meanval <- as.data.frame(meanval)
meanset <- rbind(meanset,meanval)
newset <- rbind(newset,meanset)
}

colnames(newset) <- "Sample_Mean"
vec <- 1
for (i in 2:1000) {
tempvec <- rep(i,i)
vec <- c(vec,tempvec)
}

vec <- matrix(vec,nrow=500500,ncol=1)
vec <- as.data.frame(vec)

colnames(vec) <- "Dataset"
animset <- cbind(newset,vec)

Creating an Animated Plot to Visualize Construction of the Sampling Distribution.

An animated dataset is created by sampling every 25th dataset from the previous “animset.” After some trial and error this was found to create a proper transition between states in the animation.

Once the filtering is completed, code is executed to produce the desired animated plot.

animnew <- animset %>% filter(Dataset %% 25 == 0)

anim1 <- animnew %>% ggplot(aes(x = Sample_Mean)) + geom_histogram(fill="blue",color="black") + transition_states(Dataset,1,10)
anim1

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

A Simple Plot of the Simulated Sampling Distribution of the Sample Mean.

The final simulated sampling distribution is then presented as a simple histogram. This will be used as a reference for two probability-focused plots.

samp_dist_1 <- animnew %>% filter(Dataset == 1000) %>% ggplot(aes(x = Sample_Mean)) + geom_histogram(fill="blue",color="black",binwidth=.2) + geom_density(aes(y=.2*..count..))
samp_dist_1

Create Data Frames to Use in Probability Plots.

In order to create two plots visualizing different proability statements, two data frames are created that create a binary indicator of whether or not a sample mean falls in a particular region due to some threshhold.

The specific probability statements are:

Probability the sample mean is less than 49.5.
Probability the sample mean is greater than 52.

animnew2 <- animnew
animnew3 <- animnew

animnew2 <- animnew2 %>% mutate(flag = ifelse(Sample_Mean < 49.5,1,0))
animnew3 <- animnew3 %>% mutate(flag = ifelse(Sample_Mean > 52,1,0))

animnew2$flag <- as.factor(animnew2$flag)
animnew3$flag <- as.factor(animnew3$flag)

Visualizing Probability Statement About Sample Means (#1)

A plot is constructed where the histogram is shaded either pink or teal depending on whether the sample mean satisfies the probability statement or not. Teal denotes the section of the histogram that satisfies the probability statement: “Sample mean is less than 49.5.”

It should be noted that the bins of the histogram are not necessarily constructed to reflect the threshhold of interest. Therefore, in one of the histogram bins there may be values that fall on either side of the threshhold. This is seen by an overlap of the shaded colors.

prob_plot1 <- animnew2 %>% filter(Dataset == 1000) %>% ggplot(aes(x = Sample_Mean, color = flag, fill = flag)) + geom_histogram(aes(y=..count..),position="identity",alpha=.5)
prob_plot1

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Visualizing Probability Statement About Sample Means (#2)

A plot is constructed where the histogram is shaded either pink or teal depending on whether the sample mean satisfies the probability statement or not. Teal denotes the section of the histogram that satisfies the probability statement: “Sample mean is greater than 52.”