Vignette: autoboot

Load in the autoboot package:

library(autoboot)

Bootstrapping - Creating Bootstrapped Distributions & Confidence Intervals

Oftentimes, researchers are interested in modeling distributions. However, they cannot always do this, especially for smaller data sets. Thankfully, there is a way to still work with small data sets and get statistics - bootstrapping. The create_boot_dist function is useful for creating a univariate bootstrap distribution using a user-chosen statistic of choice.

Below, we have data containing information from the mtcars package that we will be using to demonstrate our package’s capabilities.

x <- mtcars$mpg

However, the vector’s size is limited (32) and researchers may not always want to use parametric statistics since statistical inference assumptions may be violated. Bootstrapping provides an alternative method of calculating statistics compared to R’s built-in functions by sampling from the sample to create a distribution of \(B\) observations of the statistic. The autoboot package focuses on using bootstrap to make inferential statistics easy by calculating parameter estimates by doing bootstrapping on univariate data.

create_boot_dist(x)

## $boot_vals
##  [1] 20.25937 20.15938 21.37187 20.05937 19.81875 19.73750 20.34375 20.23438
##  [9] 20.78125 20.20312 20.46562 20.35000 18.80937 19.37187 20.56250 21.02187
## [17] 18.75000 22.47188 20.14062 21.91250 19.40625 20.26250 19.06875 21.26875
## [25] 20.73125 18.97188 20.26875 18.84375 18.46562 20.59062 21.26875 20.92188
## [33] 18.04375 21.03438 20.62500 19.59688 19.53125 20.71562 22.10938 20.03125
## [41] 19.01250 20.27500 19.08438 20.45000 20.77187 20.30312 20.99063 20.60938
## [49] 19.79688 20.52188
## 
## $pt_est
## [1] 20.07236
## 
## $obs_stat
## [1] 20.09062
## 
## $sd
## [1] 0.02771939

This function outputs a list containing bootstrapped values, point estimates, observed statistics, and standard deviation via a function call. While there are many ways to use this, our package has a few helpful built-in functions, especially for creating confidence intervals for a bootstrapped distribution. Additionally, this function also automatically removes NA values for ease of us and provides a warning if \(B\) is a low value.

mpg2 <- c(NA, mtcars$mpg)
create_boot_dist(mpg2, 50, median) # Fifty of B values will appear in output

## Warning in create_boot_dist(mpg2, 50, median): Missing data removed from given
## data vector

## Warning in create_boot_dist(mpg2, 50, median): B has a low number of replicates.
## Consider increasing B for a more reliable bootstrap distribution

## $boot_vals
##  [1] 21.00 21.45 21.20 21.00 19.20 18.25 17.95 19.45 18.10 18.65 18.10 18.10
## [13] 20.35 18.40 17.80 21.20 19.20 18.95 18.70 18.95 19.20 18.40 19.45 17.95
## [25] 19.20 17.50 19.70 19.20 18.95 21.00 17.95 20.35 18.40 19.70 18.65 17.55
## [37] 16.40 21.00 19.20 17.80 21.40 17.30 18.95 15.50 18.50 20.35 19.45 18.40
## [49] 21.00 19.70
## 
## $pt_est
## [1] 19.082
## 
## $obs_stat
## [1] 19.2
## 
## $sd
## [1] 0.08828571

Having a warning message for low counts for \(B\) is intended to make users aware any inference they do is risky, but if it is what is desired for application, the warning can be ignored.

Next, we create bootstrapped confidence intervals using four different methods.

# CI using percentile method
create_percentile_ci(create_boot_dist(mpg2)$boot_vals)

## Warning in create_boot_dist(mpg2): Missing data removed from given data vector

##     2.5%    97.5% 
## 18.63133 22.53133

# CI using standard normal
create_stand_norm_ci(create_boot_dist(mpg2)$boot_vals)

## Warning in create_boot_dist(mpg2): Missing data removed from given data vector

## [1] 20.39617 20.39816

# CI using basic method
create_basic_ci(create_boot_dist(mpg2)$boot_vals)

## Warning in create_boot_dist(mpg2): Missing data removed from given data vector

## [1] 20.12819 20.70404

# CI using Bootstrap t method
create_boot_t_ci(create_boot_dist(mpg2)$boot_vals)

## Warning in create_boot_dist(mpg2): Missing data removed from given data vector

## [1] 19.81118 20.42986

We caution users that the create_standard_norm_ci has narrow intervals due to its mathematical properties, but it is good to use to get a rough estimate of what ‘acceptable’ parameter estimates are possible at given confidence levels. The other methods to create CIs give more reasonable estimates (generally speaking) and users should keep Type I errors in mind when looking at each CI.

For histograms, researchers and analysts are often interested in calculating the bandwidth to determine an optimal number of bins for an effective histogram. Thus, we have created a function to do that when we visualize histograms in our automated_boot function.

calc_bw(mtcars$mpg)

## [1] 3.916667

automated_boot(mtcars$mpg, B = 500, show_hist = TRUE, ci.method = "all")

## Percentile: 18.68188 21.87648 
## Standard Normal: 20.22837 20.23202 
## T:  19.98681 20.46126 
## Basic:  18.19328 22.19289

## $ci
## NULL
## 
## $hist

The automated_boot function does a lot of common bootstrapping work for univariate data on its own. It allows for CI calculations using a specified method (percentile, standard_normal, boot_t, and basic). Using the all method, four types of confidence interval can be returned. Additionally, the automated_boot function uses color palettes from the RColorBrewer package to create a ggplot-based histogram with nicer colors. The default palette used is Set3. However, users may sometimes want different colors. A comprehensive list of color palettes is available at https://www.r-graph-gallery.com/38-rcolorbrewers-palettes.html, and all that is needed is the name of the color palette, given as a string, for a histogram to be created using that color palette.

MLE Estimation via Bootstrapping

Often, people are interested in a maximum likelihood estimate (MLE) for certain distributions. Hence, we calculate MLE for well-known discrete and continuous distributions. MLE formulas were encoded in for these commonly known distributions and were implemented as functions where an argument is a bootstrapped sample. Then, a MLE estimator is returned.

# function to automate MLE
automated_MLE(data = mtcars$mpg, MLE.method = "normal", B = 500)

## $mu_MLE
## [1] 20.08834
## 
## $sigma_square_MLE
## [1] 33.9289

In just one line of code, we were able to compute \(\mu_{MLE}\) and \(\sigma^{2}_{MLE}\). We can change MLE.method’s input to find bernoulli, poisson, normal, uniform, exponential, or gamma-based MLEs via bootstrapping. Another example is provided using bernoulli instead of normal.

# Bernoulli data is 1 or 0
bernData <- c(0, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0)
automated_MLE(data = bernData, MLE.method = "bernoulli", B = 500)

## [1] "p_MLE is: 0.4636"

Bootstrapping with the same distribution several times, or comparing bootstrapping methods with two similar distributions, can lead to estimates that are very similar, but domain knowledge can sometimes lead to certain distributions being more realistic than others. autoboot helps to streamline that process and arrive at decisions of appropriate distributions sooner.

Data Visualization

This package will help to create different graphs in R. These graphs are highly useful for data exploration and the best part is that most the functions inside this package work on entire data sets. It will automatically select the appropriate column types.

To create a histogram of a data column or vector we can use Histo_graph with x_label, y_label, and title arguments. We can assign label names and title.

Histo_graph(mtcars$hp, x_label="Horsepower",y_label="Frequency", title = "Histogram graph")

## NULL

If you want to find out correlation of all columns(numeric) in the data, use Correlation_Plot. This function will select numeric columns and create a correlation plot. With x_label, y_label, and title, we can insert axis names and title.

Correlation_Plot(mtcars)

## NULL

The Violin_plot function can create a violin plot; the user needs to specify the data column. It depends on the ggplot2 package and uses its functioning. Here with x_label,y_label, and title we can define the labels and title of the graph.

Violin_plot(mtcars,mtcars$cyl, mtcars$mpg)

Scatter_graph creates a scatter plot of two numeric vectors, and inside that plot we can insert the axis labels by using the x_label and y_label arguments. Also, we can add a title to the graph by using the Name argument.

Scatter_graph(mtcars$hp,mtcars$wt, x_label = "HorsePower", y_label = "Weight", Name = "Scatterplot of Horsepower and Weight")

## NULL

This function can create box plots of a data set that gets passed in, and it will select the numeric columns itself. Further, users can pass a factor argument as well through the BY argument. This argument will create box plots for the variables based on factor.

All_boxplot(mtcars, BY = "gear")

## NULL

To plot density charts for the data, we can use All_density. It will create a density chart for all columns in the data, and this will be useful to detect the normality of data columns. This function depends on the DataExplorer package.

All_density(mtcars)

## NULL

With this function we can create a QQ-Plot, by passing the data column (or vector) and with the x_label, y_label, and title arguments, we can define axis names and a title.

QQ_Plot(mtcars$hp)

## NULL

Bootstrapping for Distribution Parameter Estimates and Other Population Parameters

The bootstrap is brought into analysis when a population cannot be resampled for practical reasons such as cost, or such as a distribution not having a closed-form or analytic distribution function. In bootstrapping, the sample data at hand is treated as the population, and it is resampled with replacement to obtain a sample of size m, for a fixed number of times B (so m cannot exceed B). Bootstrapping can provide point estimates of parameters of interest, as well as provide confidence intervals for said parameters.

The bootCorr, bootGammaParameters, bootKurtosis, and bootMean functions were coded as bootstrap implementations of formulas of point estimators for well-known parameters like correlation, kurtosis, and mean. For the gamma distribution, which requires two parameters, the formulas coded to generate estimates are based on the method of moments.

Let’s explore the correlation between city mileage and highway mileage in cars. The bootCorr() method takes in two numeric vectors of equal length, combines them into a data.frame object, and then samples the rows and takes correlations in the usual bootstrapping manner. The output is a message with the name of the parameter of interest \(\rho\), the lower and upper bounds for the confidence interval for \(\rho\), and the point estimate for \(\rho\).

bootCorr(mpg$cty, mpg$hwy, 100, 500, 0.95)

## [1] "rho CI:  0.9562 ,  0.9645  with point estimate of  0.9603 ."

Let’s sample from Gamma(2, 4)! The bootGammaParameters() method takes in a numeric vector and then resamples from it in order to calculate method of moments (MOM) estimates for the Gamma distribution parameters. The output is a message with the name of the parameters of interest (\(\alpha\) and \(\lambda\)), the lower and upper bounds for both parameters, and the point estimates for \(\alpha\) and \(\lambda\).

gammaData <- rgamma(500, shape = 2, rate = 4)
bootGammaParameters(gammaData, 250, 1000, 0.9)

## [1] "alpha_MOM CI:  1.5717  ,  2.2186 with point estimate of  1.9362 ; lambda_MOM CI:  3.1517  ,  4.6125  with point estimate of  3.9749 ."

How skewed is an approximately standard normal distribution? Let’s explore this by bootstrapping for kurtosis coefficient \(\kappa\). The bootKurtosis() method takes in a numeric vector and then resamples from it in order to estimate the kurtosis coefficient \(\kappa\). The output is a message with the name of the parameter of interest (\(\kappa\)), the lower and upper bounds of the confidence interval for \(\kappa\), and the point estimate for \(\kappa\).

kurtData <- rnorm(250, 0, 1)
bootKurtosis(kurtData, 100, 1000, 0.9)

## [1] "kappa CI:  1.9625 ,  4.3433  with point estimate of  3.1839 ."

We bootstrap from a small, 15-row data set built into R, women, where the women observed are American and aged between 30 and 39. Let’s bootstrap for women’s average height. The bootMean() method takes in a numeric vector and then resamples from it in order to estimate the population mean \(\mu\). The output is a message with the name of the parameter of interest (\(\mu\)), the lower and upper bounds for the confidence interval for \(\mu\), and the point estimate for \(\mu\).

bootMean(women$height, 12, 1000, 0.95)

## [1] "mu CI:  62.5803 ,  67.3303  with point estimate of  65.0803 ."

Vignette: autoboot

Group 7: Razmig Zeitounian, Cristian Mojica, Yan Ke, Juan Flores

12/15/2021

Bootstrapping - Creating Bootstrapped Distributions & Confidence Intervals

MLE Estimation via Bootstrapping

Data Visualization

Bootstrapping for Distribution Parameter Estimates and Other Population Parameters