The data set used is from the openintro library. More info about the data set could be found at http://www.openintro.org/stat/extras.php.
The repot covers sampling distribution of means and standard deviation. The topics covered ranges from summary statistics to Central Limit Theorem (CLT), Law of Large Number (LLN), Bootstrapping simulation and confidence interval.
library(openintro)
## Please visit openintro.org for free statistics materials
##
## Attaching package: 'openintro'
##
## The following object(s) are masked from 'package:datasets':
##
## cars
nrow(run10)
## [1] 16924
nrow(run10Samp)
## [1] 100
str(run10)
## 'data.frame': 16924 obs. of 9 variables:
## $ place : num 4494 6298 2502 8176 3413 ...
## $ time : num 92.2 106.3 89.3 113.5 86.5 ...
## $ pace : num 9.22 10.63 8.93 11.35 8.65 ...
## $ age : num 38 33 55 24 54 42 36 25 25 43 ...
## $ gender : Factor w/ 2 levels "F","M": 2 2 1 1 2 1 1 1 1 2 ...
## $ location: Factor w/ 1663 levels " ","11221 NY",..: 301 1535 920 24 1314 312 503 63 1171 1149 ...
## $ state : Factor w/ 62 levels "","AE","AK","AL",..: 28 13 57 57 9 28 57 57 47 28 ...
## $ divPlace: num 690 1322 37 878 213 ...
## $ divTot : num 1093 1490 236 974 483 ...
# Population mean of time and age
attach(run10)
mean(time)
## [1] 94.52
mean(age, na.rm = T)
## [1] 35.52
# Subset of female runners
females.run <- run10[gender == "F", ]
# Subset of male runners
males.run <- run10[gender == "M", ]
# Mean of male and females runners
mean(males.run$time)
## [1] 88.43
mean(females.run$time)
## [1] 99.02
# Percantage of male runners
(nrow(males.run)/nrow(run10)) * 100
## [1] 42.5
# Number of female runners
(nrow(females.run)/nrow(run10)) * 100
## [1] 57.5
par(mfrow = c(3, 1))
hist(run10$time, xlab = "Time(minutes)", main = "Distribution of time, All Runners",
col = "blue", breaks = 10)
hist(males.run$time, xlab = "Time(minutes)", main = "Distribution of time, Male Runners",
col = "blue", breaks = 10)
hist(females.run$time, xlab = "Time(minutes)", main = "Distribution of time, Female Runners",
col = "blue", breaks = 10)
mean(run10Samp$time)
## [1] 95.61
mean(run10Samp$age)
## [1] 35.05
# Running mean of time
mean.time.running <- cumsum(run10Samp$time)/(1:100)
# Take a sample of 5000 runners
set.seed(1)
smp.1000 = sample(run10$time, 1000, repl = T)
mean.time.running.1000 = cumsum(smp.1000)/(1:1000)
Plot the running mean time. As the number of samples increases, the estimated sample mean approaches the population mean
par(mfrow = c(2, 1))
plot(seq(1:length(mean.time.running)), mean.time.running, type = "l", col = "blue",
xlab = "Sample Size", ylab = "Running Mean (minutes)", main = "A Sample of 100 runners")
plot(seq(1:length(mean.time.running.1000)), mean.time.running.1000, type = "l",
col = "blue", xlab = "Sample Size", ylab = "Running Mean (minutes)", main = "A Sample of 1000 runners")
lines(c(0, 1000), c(mean(time), mean(time)))
Sample point estimates only approximate the population parameter, and they vary from one sample to another. If we took another simple random sample of the Cherry Blossom runners, we would find that the sample mean for the run time would be a little different. It will be useful to quantify how variable an estimate is from one sample to another. If this variability is small (i.e. the sample mean doesn’t change much from one sample to another) then that estimate is probably very accurate. If it varies widely from one sample to another, then we should not expect our estimate to be very good.
set.seed(1)
m <- 1000
n <- 10000
x <- numeric(m)
sd.err <- numeric(m)
y = numeric(m)
for (i in 1:m) {
smp <- sample(run10$time, n, replace = T)
x[i] <- mean(smp)
sd.err[i] <- sd(smp)/sqrt(n)
}
# Mean of the 1000 sample means
mean(x)
## [1] 94.52
# Population mean
mean(time)
## [1] 94.52
# Standard error of sampled means
mean(sd.err)
## [1] 0.1592
# Standard deviation of the population mean
sd(time)/sqrt(n)
## [1] 0.1593
A histogram of 1000 sample means for run time, where the samples are of size n = 100.
par(mfrow = c(2, 1))
hist(x, main = "Sampling Distribution of the Sample means for Run Time", xlab = "Time(minutes)",
col = "blue")
hist(sd.err, main = "Sampling Distribution of the Sample Standard Errors for Run Time",
xlab = "Time(minutes)", col = "blue")
Our point estimate is the most plausible value of the parameter, so it makes sense to build the confidence interval around the point estimate. The standard error, which is a measure of the uncertainty associated with the point estimate, provides a guide for how large we should make the confidence interval. The standard error represents the standard deviation associated with the estimate, and roughly 95% of the time the estimate will be within 2 standard errors of the parameter. If the interval spreads out 2 standard errors from the point estimate, we can be roughly 95% confident that we have captured the true parameter.
set.seed(1)
# Sample size
samsize <- 100
# Number of Replicates
replicates <- 50
# Significance level
pval <- 0.05
# Samples
samples <- replicate(replicates, sample(run10$time, samsize))
# T values
confint <- t(apply(samples, 2, function(x) c(mean(x) - qt(1 - pval/2, df = samsize -
1) * sd(x)/sqrt(samsize), mean(x) + qt(1 - pval/2, df = samsize - 1) * sd(x)/sqrt(samsize))))
# Confidence interval plot. Use red if mean outside interval
outside <- ifelse(confint[, 1] > mean(time) | confint[, 2] < mean(time), 2,
1)
plot(c(mean(time), mean(time)), c(1, replicates), col = "black", typ = "l",
ylab = "Samples", xlab = "Confidence Interval", xlim = c(80, 110))
segments(confint[, 1], 1:replicates, confint[, 2], 1:replicates, col = outside)