I pledge on my honor that i have not given or recieved any unauthorized assistance on an assignment or examination
In this assignment we will explore the technique of bootstrap estimation.
Load the iris dataset
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
data(iris)
Use the filter command to get the dataframe iris_set, corresponding to the species Setosa.
iris_set <- filter(iris, Species == "setosa")
Extract the four variables of iris_set into the vectors: pet_len, pet_wid, sep_len, sep_wid
pet_len <- iris_set$Petal.Length
pet_wid <- iris_set$Petal.Width
sep_len <- iris_set$Sepal.Length
sep_wid <- iris_set$Sepal.Width
Recall that a bootstrap sample is obtained by sampling with replacement a sample of size n from a sample of n data values. A bootstrap point estimate will then be the point estimator evaluate on a given bootstrap sample. We will calculate bootstrap estimates for some standard estimators, calculate the empirical expected values, and the bootstrap standard error.
Calculate the bootstrap estimates for the sample mean, store these in the vector “boot_means_pet_len” etc and then calculate the standard deviation of this vector to estimate the standard error for sample mean as the point estimator (compare this with the standard error estimated by the Central limit theorem). Note down your observations.
set.seed(2020)
B <- 100000
boot_means_pet_len <- replicate(B, mean(sample(pet_len, size = length(pet_len), replace = TRUE)))
mean(pet_len)
## [1] 1.462
sd(pet_len)/sqrt(50) #standard error from Central Limit Theorem
## [1] 0.0245598
mean(boot_means_pet_len)
## [1] 1.461951
sd(boot_means_pet_len) #bootstrap standard error
## [1] 0.02433344
set.seed(2020)
B <- 100000
boot_means_pet_wid <- replicate(B, mean(sample(pet_wid, size = length(pet_wid), replace = TRUE)))
mean(pet_wid)
## [1] 0.246
sd(pet_wid)/sqrt(50) #standard error from Central Limit Theorem
## [1] 0.01490377
mean(boot_means_pet_wid)
## [1] 0.246004
sd(boot_means_pet_wid) #bootstrap standard error
## [1] 0.01474575
set.seed(2020)
B <- 100000
boot_means_sep_len <- replicate(B, mean(sample(sep_len, size = length(sep_len), replace = TRUE)))
mean(sep_len)
## [1] 5.006
sd(sep_len)/sqrt(50) #standard error from Central Limit Theorem
## [1] 0.04984957
mean(boot_means_sep_len)
## [1] 5.005718
sd(boot_means_sep_len) #bootstrap standard error
## [1] 0.04940281
set.seed(2020)
B <- 100000
boot_means_sep_wid <- replicate(B, mean(sample(sep_wid, size = length(sep_wid), replace = TRUE)))
mean(sep_wid)
## [1] 3.428
sd(sep_wid)/sqrt(50) #standard error from Central Limit Theorem
## [1] 0.0536078
mean(boot_means_sep_wid)
## [1] 3.427784
sd(boot_means_sep_wid) #bootstrap standard error
## [1] 0.05318353
In all cases bootstrap standard errors and standard error estimates from the Central Limit Theorem matched up to three decimal points. Bootstrap mean and sample mean were also very close.
Calculate the bootstrap estimates for the sample standard deviation, store these in the vector “boot_sd_pet_len” etc and then calculate the standard deviation of this vector to estimate the standard error for sample standard deviation as the point estimator. Note down your observations.
set.seed(2020)
B <- 100000
boot_sd_pet_len <- replicate(B, sd(sample(pet_len, size = length(pet_len), replace = TRUE)))
sd(pet_len)
## [1] 0.173664
mean(boot_sd_pet_len)
## [1] 0.1706416
sd(boot_sd_pet_len)
## [1] 0.02061946
set.seed(2020)
B <- 100000
boot_sd_pet_wid <- replicate(B, sd(sample(pet_wid, size = length(pet_wid), replace = TRUE)))
sd(pet_wid)
## [1] 0.1053856
mean(boot_sd_pet_wid)
## [1] 0.1034229
sd(boot_sd_pet_wid)
## [1] 0.01384665
set.seed(2020)
B <- 100000
boot_sd_sep_len <- replicate(B, sd(sample(sep_len, size = length(sep_len), replace = TRUE)))
sd(sep_len)
## [1] 0.3524897
mean(boot_sd_sep_len)
## [1] 0.3474428
sd(boot_sd_sep_len)
## [1] 0.0322138
set.seed(2020)
B <- 100000
boot_sd_sep_wid <- replicate(B, sd(sample(sep_wid, size = length(sep_wid), replace = TRUE)))
sd(sep_wid)
## [1] 0.3790644
mean(boot_sd_sep_wid)
## [1] 0.3727383
sd(boot_sd_sep_wid)
## [1] 0.04404392
Bootstrap estimates of standard deviations and sample standard deviations were close, in all cases we got sample standard deviation a little bit higher than the bootstrap estimate. We also observed that the estimated standard deviations and their estimated standard errors were higher for sepals, probably because of a larger scale of this part of the Iris Setosa flowers.
Recall that the area of an ellipse with major axis \(a\) and minor axis \(b\) is given by \[ \pi a b.\] Under the assumption that the shape of the petal of the iris flower is approximately an ellipse with the major and minor axes given by petal length and petal width (or vice verse), we can estimate the petal surface area (only one side) of iris of Setosa species by the following estimator: \[ \hat\theta = \pi \cdot \overline{X}\cdot\overline{Y},\] where \(\overline{X}\) and \(\overline{Y}\) are the sample means of petal length and petal width respectively. \ Estimate the surface area of the petal of a iris Setosa flower using the given dataset.
petal_area_set <- pi*mean(pet_len)*mean(pet_wid)
petal_area_set
## [1] 1.12988
Now use bootstrap sampling to estimate the standard error in your estimated petal surface area. Make sure you identify the correct output.
set.seed(2020)
B <- 100000
boot_pet_area_set <- replicate(B, {
x <- mean(sample(pet_len, size = length(pet_len), replace = TRUE))
y <- mean(sample(pet_wid, size = length(pet_wid), replace = TRUE))
pi*x*y
}
)
mean(boot_pet_area_set)
## [1] 1.129927
sd(boot_pet_area_set)
## [1] 0.07033764
The bootstrap estimate of the standard error of my estimated petal surface area is 0.07033764.