Problem 1

write a function to compute the expected value and standard deviation of an array of values. Compare your results with that of R’s mean and std functions. Now, consider that instead of being able to neatly fit the values in memory in an array, you have an infinite stream of numbers coming by.How would you estimate the mean and standard deviation of such a stream?

Welford’s Method is an ideal way to compute means and variances of a dynamic, streaming set of numbers. It uses only one pass per addition of a stream to calculate both the mean and variance.

The stream, after it is calculated, does not need to be stored in memory while under normal methods, the original set of numbers would need to be appended to the existing set in memory before it can be calculated again.

Below is a class implementation for Welford’s Method using R’s R6 library, which allows for OO styled class objects and methods. The class is stripped down as much as possible (e.g. - portable = FALSE) due to the apparent increased overhead in run time relative to the base stat function in R. A list of random numbers needs to be stored in memory in order to compare the base function against Welford’s, but the reader should be clear that the StatStream class does not need the previous number stream to obtain new statistical properties of an addition number stream.

StatStream <- R6Class("StatSream",
            portable = FALSE,
            class = FALSE,
            cloneable = FALSE,

            private = list(
                mean_ = NA_real_,
                sd_ = NA_real_,
                var_ = NA_real_,
                n_ = NA_real_
            ),

            public = list(
                initialize = function() {
                    mean_ <<- 0.0
                    sd_ <<- 0.0
                    var_ <<- 0.0
                    n_ <<- 0
                    },
                
                get_mean = function() mean_,
                get_var = function() var_/(n_ - 1),
                get_sd = function() sqrt(var_/(n_ - 1)),
                get_size = function() n_,
                
                stream = function(x) {
                    for(i in 1:length(x)) {
                        old_mean <- mean_
                        
                        n_ <<- n_ + 1
                        mean_ <<- mean_ + (x[i] - mean_)/n_
                        var_ <<- var_ + (x[i] - mean_)*(x[i] - old_mean)
                    }
                    
                    cat("The current mean: ", mean_, "\n",
                        "The current standard deviaiton: ", sqrt(var_/(n_ - 1)), "\n", sep = "")
                    }
                )
)

Below we will instantiate a new StatSream object and stream a vector to it and compare it with the traditional methods in R.

set.seed(8675309)

x1 <- runif(1e6, 0,1)
x2 <- runif(1e6, 0,1)

ss_test <- StatStream$new()

ss_test$stream(x1)
## The current mean: 0.5000019
## The current standard deviaiton: 0.2888236
all.equal(mean(x1), ss_test$get_mean())
## [1] TRUE
all.equal(sd(x1), ss_test$get_sd())
## [1] TRUE

The methods above give the same result. Now lets introduce a new stream into our StatStream method and compare the results with R’s base functions.

ss_test$stream(x2)
## The current mean: 0.5001375
## The current standard deviaiton: 0.2887481
all.equal(mean(c(x1, x2)), ss_test$get_mean())
## [1] TRUE
all.equal(sd(c(x1, x2)), ss_test$get_sd())
## [1] TRUE

Once again, we get similar results. It is important to note that the run-time of the base mean and sd functions far surpass that of the class implementation of statStream; even though its as light weight as I could make it. Discovering why this is would be out of the scope of the assignment.

The main idea of the Welford Method in this implementation is to avoid storing these arrays in memory; the 1e6 set implemented above is ~ 8 Mb and a 1e10 array grows to ~75 Gb. Not having to deal with issues of memory management when working with data on this scale, or in a near infinite stream, would force the implementation akin to the method outlined above.