1 Introduction

While massive data bring many statistical issues to the fore, including issues in exploratory data analysis and data visualization, there remains the core inferential need to assess the quality of estimators. Indeed, the uncertainty and biases in estimates based on large data can remain quite significant, as large datasets are often high dimensional, are frequently used to fit complex models with large numbers of parameters, and can have many potential sources of bias. [A Scalable Bootstrap for Massive Data]

Additionally, because the variability of an estimator on a subsample differs from its variability on the full dataset, these procedures must perform a rescaling of their output, and this rescaling requires knowledge and explicit use of the convergence rate of the estimator in question; these methods are thus less automatic and easily deployable than the bootstrap. [A Scalable Bootstrap for Massive Data]

1.1 Bootstrapping

Bootstrapping uses the sample data to estimate relevant characteristics of the population. The sampling distribution of a statistic is then constructed empirically by resampling from the sample. The resampling procedure is designed to parallel the process by which sample observations were drawn from the population. For example, if the data represent an independent random sample of size n (or a simple random sample of size n from a much larger population), then each bootstrap sample selects n observations with replacement from the original sample. The key bootstrap analogy is the following: The population is to the sample as the sample is to the bootstrap samples.

1.2 The Bag of Little Bootstraps

For estimating \(SD(\hat{\theta})\):

Let \(\hat{F}\) denote the empirical probability distribution of the data
(i.e., placing mass \(1/n\) at each of the \(n\) data points)
Select \(s\) subsets of size \(b\) from the full data (i.e., randomly sample a set of \(b\) indices \(\mathcal{I}_{j}=\left\{i_{1},\ldots,i_{b}\right\}\) from \(\left\{1,2,\ldots,n\right\}\) without replacement, and repeat \(s\) times).

For each of the \(s\) subsets (\(j=1,\ldots,s\)):

Repeat the following steps \(r\) times (\(k=1,\ldots,r\)):

Resample a bootstrap dataset \(X^{*}_{j,k}\) of size \(n\) from subset \(j\).
i.e., sample \((n_{1},\ldots,n_{b})\sim{}\textrm{Multinomial}\left(n,(1/b,\ldots,1/b)\right)\), where \((n_{1},\ldots,n_{b})\) denotes the number of times each data point of the subset occurs in the bootstrapped dataset.
Compute and store the estimator \(\hat{\theta}_{j,k}\)

Compute the bootstrap SE of \(\hat{\theta}\) based on the \(r\) bootstrap datasets for subset \(j\) i.e., compute:

\[ \xi^{*}_{j}=\textrm{SD}\left\{\hat{\theta}^{*}_{j,1},\ldots,\hat{\theta}^{*}_{j,r}\right\} . \]

Average the \(s\) bootstrap SE’s, \(\xi^{*}_{1},\ldots,\xi^{*}_{s}\) to obtain an estimate of \(SD(\hat{\theta})\) i.e.,

\[ \hat{SD}(\hat{\theta}) = \frac{1}{s}\sum_{j=1}^{s}\xi^{*}_{j} . \]

1.2.1 Jaccard (b=0.4 R=10 S=10) X (b=0.6 R=10 S=10)

1.2.1.1 Average of Jaccard Distance

##              [,1]         [,2]
## [1,] 5.027871e-05 5.616129e-05

1.2.2 Frobenius (b=0.4 R=10 S=10) X (b=0.6 R=10 S=10)

1.2.2.1 Average Frobenius Distance

##              [,1]         [,2]
## [1,] 1.098333e-07 1.460337e-07

1.2.3 Jaccard (b=0.4 R=10 S=30) X (b=0.6 R=10 S=30)

1.2.3.1 Average Jaccard Distance

##              [,1]         [,2]
## [1,] 5.066064e-05 5.515175e-05

1.2.4 Frobenius (b=0.4 R=10 S=30) X (b=0.6 R=10 S=30)

1.2.4.1 Average Frobenius Distance

##              [,1]         [,2]
## [1,] 3.092863e-06 1.385951e-07

1.2.5 Jaccard (b=0.6 R=10 S=10) X (b=0.7 R=10 S=10)

1.2.5.1 Average Jaccard Distance

##              [,1]         [,2]
## [1,] 5.616129e-05 2.395781e-05

1.2.6 Frobenius (b=0.6 R=10 S=10) X (b=0.7 R=10 S=10)

1.2.6.1 Average Frobenius Distance

##              [,1]         [,2]
## [1,] 1.460337e-07 2.378013e-08

1.2.7 Jaccard (b=0.6 R=10 S=30) X (b=0.7 R=10 S=30)

1.2.7.1 Average Jaccard Distance

##              [,1]        [,2]
## [1,] 5.515175e-05 2.37764e-05

1.2.8 Frobenius (b=0.6 R=10 S=30) X (b=0.7 R=10 S=30)

1.2.8.1 Average Frobenius Distance

##              [,1]         [,2]
## [1,] 1.385951e-07 2.343538e-08

1.2.9 Jaccard (b=0.6 R=10 S=60) X (b=0.7 R=10 S=60)

1.2.9.1 Average Jaccard Distance

##              [,1]         [,2]
## [1,] 5.613607e-05 2.417312e-05

1.2.10 Frobenius (b=0.6 R=10 S=60) X (b=0.7 R=10 S=60)

1.2.10.1 Average Frobenius Distance

##              [,1]        [,2]
## [1,] 1.446439e-07 2.32966e-08

2 References

Referencias dia 14/09/2015 http://gregable.com/2007/10/reservoir-sampling.html http://web.stanford.edu/~thairu/07_184.Guest.1sts.pdf https://cran.r-project.org/web/packages/stream/stream.pdf http://www.geeksforgeeks.org/reservoir-sampling/ http://engineering.bloomreach.com/mapreduce-fun-sampling-for-large-data-set/ http://gregable.com/2007/10/reservoir-sampling.html http://codereview.stackexchange.com/questions/30764/review-of-reservoir-sampling https://en.wikipedia.org/wiki/Reservoir_sampling http://blog.cloudera.com/blog/2013/04/hadoop-stratified-randosampling-algorithm/

Referencias 21/09/2015 Manhã http://blog.cloudera.com/blog/2013/04/hadoop-stratified-randosampling-algorithm/ https://en.wikipedia.org/wiki/Reservoir_sampling https://svn.apache.org/repos/asf/flume/branches/branch-0.9.5/flume-core/src/main/java/com/cloudera/util/ReservoirSampler.java

Referencias 21/09/2015 Noite http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4076480/pdf/inj-18-50.pdf http://arxiv.org/pdf/1502.07989v1.pdf http://www.cs.umb.edu/~ding/papers/TKDE2013.pdf http://www.cbs.nl/NR/rdonlyres/457A097A-DA43-4006-AFE0-A8E8316CFEF0/0/201411x10pub.pdf file:///Users/gustavo/Downloads/21092015/[Agostino_Di_Ciaccio,_Mauro_Coli,_Jose_Miguel_Angu(BookFi.org).pdf http://www.amazon.com/Statistical-Machine-Learning-Data-Mining-Techniques/dp/1439860912 http://simplystatistics.org/2014/05/07/why-big-data-is-in-trouble-they-forgot-about-applied-statistics/ http://sea.ucar.edu/sites/default/files/StatLearnBigData20130401.pdf http://ac.els-cdn.com/S0268401214001066/1-s2.0-S0268401214001066-main.pdf?_tid=dfc9eb14-60b4-11e5-aa9f-00000aacb360&acdnat=1442876747_2e161cbfc445a566489d6c9c2e820e98 http://bigdata-madesimple.com/26-popular-techniques-for-analysing-big-data/ https://www.google.com.br/search?q=Big+Data+sampling+mechanisms&oq=Big+Data+sampling+mechanisms&aqs=chrome..69i57j69i60.391j0j7&sourceid=chrome&es_sm=119&ie=UTF-8#q=statistical+methods+big+data https://www.researchgate.net/publication/271836790_Statistical_methods_for_big_data_A_scenic_tour

Referencias 23/09/2015 http://www.jstatsoft.org/article/view/v055i14 http://arxiv.org/pdf/1112.5016v2.pdf https://github.com/tesseradata/datadr/blob/master/R/recombine_transforms.R http://arxiv.org/pdf/1502.07989v1.pdf http://www.sciencedirect.com/science/article/pii/S0268401214001066

Referencias 25/09/2015 http://statistics.unl.edu/faculty/bilder/categorical/Chapter3/Multinomial.R http://www.uvm.edu/~dhowell/methods9/Supplements/R-Programs/multinomial.R https://www.linkedin.com/pulse/bootstrap-big-data-ingenious-idea-bag-little-oscar-cassetti http://pt.slideshare.net/WayneLee9/bagoflittlebootstrap http://www.eecs.berkeley.edu/~ameet/blb_workshop.pdf http://sta250.github.io/Stuff/lectures http://sta250.github.io/Stuff/Lecture_08#22 http://sta250.github.io/Stuff/notes/Lecture_08_Notes_Fan.pdf

Referencias 05/10/2015 http://www.stat.harvard.edu/NRC2014/MichaelJordan.pdf http://www.cs.berkeley.edu/~jegonzal/ http://www.cs.berkeley.edu/~jegonzal/jegonzal_thesis.pdf http://stanford.edu/~rezab/dao/ http://artax.karlin.mff.cuni.cz/r-help/library/fitdistrplus/html/plotdist.html

Referencias 09/10/2015 http://artax.karlin.mff.cuni.cz/r-help/library/ks/html/kde.html

Referencias 12/10/2015 http://www.sagepub.com/sites/default/files/upm-binaries/21122_Chapter_21.pdf http://data.princeton.edu/wws509/notes/c6.pdf http://rpackages.ianhowson.com/rforge/bigmlogit/ http://arxiv.org/pdf/1311.6139.pdf http://faculty.washington.edu/heagerty/Courses/b571/handouts/MultModels.pdf https://cran.r-project.org/web/packages/glmnet/index.html http://arxiv.org/pdf/1402.4089v1.pdf http://arxiv.org/abs/1403.5693 https://github.com/HIPS/firefly-monte-carlo/blob/master/examples/toy_dataset.py http://arxiv.org/pdf/1403.5693v1.pdf http://jmlr.org/proceedings/papers/v32/bardenet14.pdf http://gaoyang10.github.io/index.html/add_MCMC.pdf https://github.com/BigBayes/bigbayes.github.io/wiki/BIBiD-2015 http://babaks.github.io/ScalableMonteCarlo/ https://nips2015.sched.org/event/2208561c74e3b6798df560c76f192e7e#.VhxuURNViko http://cbl.eng.cam.ac.uk/pub/Intranet/MLG/ResearchAndCommunicationClub/scalableMCMC.pdf http://machinelearning.wustl.edu/mlpapers/paper_files/icml2015_betancourt15.pdf

Referencias 17/10/2015 http://www.mathstat.helsinki.fi/msm/banocoss/2011/Presentations/Thompson_web.pdf http://bigdatawg.nist.gov/FrontiersInMassiveDataAnalysisPrepub.pdf http://arxiv.org/pdf/1308.1479v2.pdf http://www.nap.edu/download.php?record_id=11098 http://infolab.stanford.edu/~ullman/mmds/bookL.pdf https://users.soe.ucsc.edu/~optas/papers/kpca.pdf

A Scalable Bootstrap for Massive Data

Gustavo Lacerda, gustavolacerdas@ufmg.br

23/10/2015