Problem Set 1

Please write a function to compute the expected value and standard deviation of an array of values. Compare your results with that of R’s mean and std functions. Please document your work in an RMarkdown file and ensure that you have good comments to help the reader follow your work.

Create a function to calculate the expected value.

Expected value = sum(x) / length(x) where x is a vector of numeric values

expected_value <- function(vector){
    
    return (sum(vector)/length(vector))
}

Create some test data to test the expected value function.

vector.test <- runif(500, 1, 10000)

Calculate and compare against the R built-in function mean.

(ev <- expected_value(vector.test))
## [1] 4993.608
mean(vector.test)
## [1] 4993.608
mean(vector.test) == ev
## [1] TRUE

Create a function to calculate the standard deviation.

For this function, we’ll be using the sample standard deviation calculation which is given by:

\(s = \sqrt{\frac{1}{N-1} \sum_{i=1}^N (x_i - \overline{x})^2}\)

where

s = sample standard deviation
\(\sum\) = sum of…
\(\bar{x}\) = sample mean
n = number of scores in sample

Using the sample standard deviation definition seems to be more in line with R’s sd built-in command.

sample_std_dev <- function(vector) {

  mean <- expected_value(vector)
 
  return (sqrt(sum((vector-mean)^2)/(length(vector) - 1))) 
 
}

Calculate the standard deviation and compare against the R built-in function sd.

(ssd <- sample_std_dev(vector.test))
## [1] 2894.351
sd(vector.test)
## [1] 2894.351
sd(vector.test) == ssd
## [1] TRUE

Combining these two functions together into one function to calculate the mean and standard deviation.

get_mean_sd <- function(vector) {

  ev <- expected_value(vector)
  ssd  <- sample_std_dev(vector)

  return (c("mean" = ev, "sd" = ssd))


}

(get_mean_sd(vector.test))
##     mean       sd 
## 4993.608 2894.351

Now, consider that instead of being able to neatly fit the values in memory in an array, you have an infinite stream of numbers coming by. How would you estimate the mean and standard deviation of such a stream? Your function should be able to return the current estimate of the mean and standard deviation at any time it is asked. Your program should maintain these current estimates and return them back at any invocation of these functions. (Hint: You can maintain a rolling estimate of the mean and standard deviation and allow these to slowly change over time as you see more and more new values).

Create the R code to maintain the current estimates of the mean and standard deviation

options(scipen = 999)

global_length    <<- 0
global_sum       <<- 0
global_sum_sq    <<- 0

global_mean      <<- 0
global_std       <<- 0

init_globals <- function() {
    
    global_length    <<- 0
    global_sum       <<- 0
    global_sum_sq    <<- 0
    
    global_mean      <<- 0
    global_sd        <<- 0
    
}

## 
## Function: print_rolling_stats
##     Desc: when called, prints the current global values -- n, mean, and standard deviation
##   Output: none
##
print_rolling_stats <- function() {
    
  print(sprintf("n = %s; mean = %s;  sd = %s",   global_length,  global_mean, global_sd))
    
}

## 
## Function: rolling_stats
##    Input: a vector of numeric values, in theory an infinite number as a data stream
##     Desc: calculates an estimated running mean and standard deviation based on the vector data stream
##
rolling_stats <- function(vector) {
    
    
    for (element in vector) {
     
        global_length <<-    global_length + 1
        global_sum    <<-    global_sum + element  
        
        global_sum_sq <<- global_sum_sq + element^2  
        
        global_mean <<-   global_sum/global_length
        
        global_sd  <<- sqrt((global_length)* global_sum_sq - global_sum^2)/(global_length)

    
        if (global_length %% 250000 == 0) print_rolling_stats()
    }
    
    return (c("n" = global_length, "mean" = global_mean, "sd" = global_sd))

}

Let’s simulate a scenario of an infinite stream of numbers. We’ll create a large array of numbers for processing through the rolling_stats function above. While processing, the current estimates of the mean and standard deviation will be printed out at major intervals.

Test 1

  • Use rnorm to calculate a vector of 2,500,000 numbers with a mean of 25 and a standard deviation of 4.5
init_globals()
 
vector <- rnorm(n = 2500000, mean = 250, sd = 4.5 )

rolling_stats(vector)
## [1] "n = 250000; mean = 250.0023433335;  sd = 4.49142797451857"
## [1] "n = 500000; mean = 250.001822803191;  sd = 4.49776181048574"
## [1] "n = 750000; mean = 250.000817480524;  sd = 4.4993585010804"
## [1] "n = 1000000; mean = 250.002689314691;  sd = 4.49939683171245"
## [1] "n = 1250000; mean = 250.000816517315;  sd = 4.49820279656921"
## [1] "n = 1500000; mean = 250.000838282663;  sd = 4.49703823714844"
## [1] "n = 1750000; mean = 249.999083161066;  sd = 4.49827909742603"
## [1] "n = 2000000; mean = 249.998282447034;  sd = 4.49810309809102"
## [1] "n = 2250000; mean = 249.99793299028;  sd = 4.49794695747453"
## [1] "n = 2500000; mean = 249.999229167233;  sd = 4.49748285245656"
##              n           mean             sd 
## 2500000.000000     249.999229       4.497483

Calculated mean and standard deviation using R test vector:

mean(vector)
## [1] 249.9992
sd(vector)
## [1] 4.497484

Test 2

  • Use 5 million values between 0.1 and 100000
init_globals()

vector <- runif(5000000, 0.1, 100000)

rolling_stats(vector)
## [1] "n = 250000; mean = 50043.4766201461;  sd = 28879.4072491196"
## [1] "n = 500000; mean = 50063.9747707321;  sd = 28878.8348262797"
## [1] "n = 750000; mean = 50036.1598729163;  sd = 28886.9271088124"
## [1] "n = 1000000; mean = 50021.2343409425;  sd = 28897.4340716867"
## [1] "n = 1250000; mean = 50008.4971136104;  sd = 28884.5596843954"
## [1] "n = 1500000; mean = 50005.0155334842;  sd = 28881.3921008235"
## [1] "n = 1750000; mean = 49993.9565390894;  sd = 28878.8397340913"
## [1] "n = 2000000; mean = 49996.7895009039;  sd = 28875.9801724134"
## [1] "n = 2250000; mean = 50003.6041052044;  sd = 28876.9623120679"
## [1] "n = 2500000; mean = 50007.3828341462;  sd = 28877.8284553011"
## [1] "n = 2750000; mean = 50008.7457277896;  sd = 28873.1159488839"
## [1] "n = 3000000; mean = 50003.5725038927;  sd = 28871.2175449197"
## [1] "n = 3250000; mean = 49999.9444904926;  sd = 28871.2726029017"
## [1] "n = 3500000; mean = 50006.1668584592;  sd = 28870.8579000389"
## [1] "n = 3750000; mean = 50005.2929409367;  sd = 28875.4975473185"
## [1] "n = 4000000; mean = 49998.0941268072;  sd = 28878.2961125891"
## [1] "n = 4250000; mean = 50001.4245133331;  sd = 28880.543753124"
## [1] "n = 4500000; mean = 50005.7251745774;  sd = 28877.6730371267"
## [1] "n = 4750000; mean = 50003.1313168644;  sd = 28875.2187320867"
## [1] "n = 5000000; mean = 49999.9836411155;  sd = 28874.908385731"
##          n       mean         sd 
## 5000000.00   49999.98   28874.91

Calculated mean and standard deviation using R on the test vector:

mean(vector)
## [1] 49999.98
sd(vector)
## [1] 28874.91