This is an RMarkdown document displaying R code for simulating a sampling distribution for the sample mean in a hypothetical situation. The distribution of sample means is assumed to be normal with a mean of 50 and standard error of 5. A corresponding animated plot is constructed to demonstrate how a large number of sample means construct a sampling distribution.
A final static plot of the simulated sampling distribution is given (based on 2,000 simulated means), as well as two corresponding color-coated plots to reflect two specific probability questions in regards to the sample mean.
This was created with the intention of supplementing lecture notes regarding the sampling distribution of the mean and associated probability statements.
The first block of code accomplished the following:
library(dplyr)
library(ggplot2)
library(gganimate)
tempvec <- rnorm(30,50,5)
meanval <- mean(tempvec)
meanset <- meanval
meanset <- matrix(meanset,nrow=1,ncol=1)
meanset <- as.data.frame(meanset)
newset <- meanset
for (i in 2:1000) {
tempvec <- rnorm(30,50,5)
meanval <- mean(tempvec)
meanval <- matrix(meanval,nrow=1,ncol=1)
meanval <- as.data.frame(meanval)
meanset <- rbind(meanset,meanval)
newset <- rbind(newset,meanset)
}
colnames(newset) <- "Sample_Mean"
vec <- 1
for (i in 2:1000) {
tempvec <- rep(i,i)
vec <- c(vec,tempvec)
}
vec <- matrix(vec,nrow=500500,ncol=1)
vec <- as.data.frame(vec)
colnames(vec) <- "Dataset"
animset <- cbind(newset,vec)
An animated dataset is created by sampling every 25th dataset from the previous “animset.” After some trial and error this was found to create a proper transition between states in the animation.
Once the filtering is completed, code is executed to produce the desired animated plot.
animnew <- animset %>% filter(Dataset %% 25 == 0)
anim1 <- animnew %>% ggplot(aes(x = Sample_Mean)) + geom_histogram(fill="blue",color="black") + transition_states(Dataset,1,10)
anim1
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The final simulated sampling distribution is then presented as a simple histogram. This will be used as a reference for two probability-focused plots.
samp_dist_1 <- animnew %>% filter(Dataset == 1000) %>% ggplot(aes(x = Sample_Mean)) + geom_histogram(fill="blue",color="black",binwidth=.2) + geom_density(aes(y=.2*..count..))
samp_dist_1
In order to create two plots visualizing different proability statements, two data frames are created that create a binary indicator of whether or not a sample mean falls in a particular region due to some threshhold.
The specific probability statements are:
animnew2 <- animnew
animnew3 <- animnew
animnew2 <- animnew2 %>% mutate(flag = ifelse(Sample_Mean < 49.5,1,0))
animnew3 <- animnew3 %>% mutate(flag = ifelse(Sample_Mean > 52,1,0))
animnew2$flag <- as.factor(animnew2$flag)
animnew3$flag <- as.factor(animnew3$flag)
A plot is constructed where the histogram is shaded either pink or teal depending on whether the sample mean satisfies the probability statement or not. Teal denotes the section of the histogram that satisfies the probability statement: “Sample mean is less than 49.5.”
It should be noted that the bins of the histogram are not necessarily constructed to reflect the threshhold of interest. Therefore, in one of the histogram bins there may be values that fall on either side of the threshhold. This is seen by an overlap of the shaded colors.
prob_plot1 <- animnew2 %>% filter(Dataset == 1000) %>% ggplot(aes(x = Sample_Mean, color = flag, fill = flag)) + geom_histogram(aes(y=..count..),position="identity",alpha=.5)
prob_plot1
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
A plot is constructed where the histogram is shaded either pink or teal depending on whether the sample mean satisfies the probability statement or not. Teal denotes the section of the histogram that satisfies the probability statement: “Sample mean is greater than 52.”
It should be noted that the bins of the histogram are not necessarily constructed to reflect the threshhold of interest. Therefore, in one of the histogram bins there may be values that fall on either side of the threshhold. This is seen by an overlap of the shaded colors.
prob_plot3 <- animnew3 %>% filter(Dataset == 1000) %>% ggplot(aes(x = Sample_Mean, color = flag, fill = flag)) + geom_histogram(aes(y=..count..),position="identity",alpha=.5)
prob_plot3
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.