Please write a function to compute the expected value and standard deviation of an array of values. Compare your results with that of R’s mean and std functions. Please document your work in an RMarkdown file and ensure that you have good comments to help the reader follow your work.
Expected value = sum(x) / length(x) where x is a vector of numeric values
expected_value <- function(vector){
return (sum(vector)/length(vector))
}
vector.test <- runif(500, 1, 10000)
mean.(ev <- expected_value(vector.test))
## [1] 4993.608
mean(vector.test)
## [1] 4993.608
mean(vector.test) == ev
## [1] TRUE
For this function, we’ll be using the sample standard deviation calculation which is given by:
\(s = \sqrt{\frac{1}{N-1} \sum_{i=1}^N (x_i - \overline{x})^2}\)
where
s = sample standard deviation
\(\sum\) = sum of…
\(\bar{x}\) = sample mean
n = number of scores in sample
Using the sample standard deviation definition seems to be more in line with R’s sd built-in command.
sample_std_dev <- function(vector) {
mean <- expected_value(vector)
return (sqrt(sum((vector-mean)^2)/(length(vector) - 1)))
}
sd.(ssd <- sample_std_dev(vector.test))
## [1] 2894.351
sd(vector.test)
## [1] 2894.351
sd(vector.test) == ssd
## [1] TRUE
get_mean_sd <- function(vector) {
ev <- expected_value(vector)
ssd <- sample_std_dev(vector)
return (c("mean" = ev, "sd" = ssd))
}
(get_mean_sd(vector.test))
## mean sd
## 4993.608 2894.351
Now, consider that instead of being able to neatly fit the values in memory in an array, you have an infinite stream of numbers coming by. How would you estimate the mean and standard deviation of such a stream? Your function should be able to return the current estimate of the mean and standard deviation at any time it is asked. Your program should maintain these current estimates and return them back at any invocation of these functions. (Hint: You can maintain a rolling estimate of the mean and standard deviation and allow these to slowly change over time as you see more and more new values).
options(scipen = 999)
global_length <<- 0
global_sum <<- 0
global_sum_sq <<- 0
global_mean <<- 0
global_std <<- 0
init_globals <- function() {
global_length <<- 0
global_sum <<- 0
global_sum_sq <<- 0
global_mean <<- 0
global_sd <<- 0
}
##
## Function: print_rolling_stats
## Desc: when called, prints the current global values -- n, mean, and standard deviation
## Output: none
##
print_rolling_stats <- function() {
print(sprintf("n = %s; mean = %s; sd = %s", global_length, global_mean, global_sd))
}
##
## Function: rolling_stats
## Input: a vector of numeric values, in theory an infinite number as a data stream
## Desc: calculates an estimated running mean and standard deviation based on the vector data stream
##
rolling_stats <- function(vector) {
for (element in vector) {
global_length <<- global_length + 1
global_sum <<- global_sum + element
global_sum_sq <<- global_sum_sq + element^2
global_mean <<- global_sum/global_length
global_sd <<- sqrt((global_length)* global_sum_sq - global_sum^2)/(global_length)
if (global_length %% 250000 == 0) print_rolling_stats()
}
return (c("n" = global_length, "mean" = global_mean, "sd" = global_sd))
}
Let’s simulate a scenario of an infinite stream of numbers. We’ll create a large array of numbers for processing through the rolling_stats function above. While processing, the current estimates of the mean and standard deviation will be printed out at major intervals.
init_globals()
vector <- rnorm(n = 2500000, mean = 250, sd = 4.5 )
rolling_stats(vector)
## [1] "n = 250000; mean = 250.0023433335; sd = 4.49142797451857"
## [1] "n = 500000; mean = 250.001822803191; sd = 4.49776181048574"
## [1] "n = 750000; mean = 250.000817480524; sd = 4.4993585010804"
## [1] "n = 1000000; mean = 250.002689314691; sd = 4.49939683171245"
## [1] "n = 1250000; mean = 250.000816517315; sd = 4.49820279656921"
## [1] "n = 1500000; mean = 250.000838282663; sd = 4.49703823714844"
## [1] "n = 1750000; mean = 249.999083161066; sd = 4.49827909742603"
## [1] "n = 2000000; mean = 249.998282447034; sd = 4.49810309809102"
## [1] "n = 2250000; mean = 249.99793299028; sd = 4.49794695747453"
## [1] "n = 2500000; mean = 249.999229167233; sd = 4.49748285245656"
## n mean sd
## 2500000.000000 249.999229 4.497483
Calculated mean and standard deviation using R test vector:
mean(vector)
## [1] 249.9992
sd(vector)
## [1] 4.497484
init_globals()
vector <- runif(5000000, 0.1, 100000)
rolling_stats(vector)
## [1] "n = 250000; mean = 50043.4766201461; sd = 28879.4072491196"
## [1] "n = 500000; mean = 50063.9747707321; sd = 28878.8348262797"
## [1] "n = 750000; mean = 50036.1598729163; sd = 28886.9271088124"
## [1] "n = 1000000; mean = 50021.2343409425; sd = 28897.4340716867"
## [1] "n = 1250000; mean = 50008.4971136104; sd = 28884.5596843954"
## [1] "n = 1500000; mean = 50005.0155334842; sd = 28881.3921008235"
## [1] "n = 1750000; mean = 49993.9565390894; sd = 28878.8397340913"
## [1] "n = 2000000; mean = 49996.7895009039; sd = 28875.9801724134"
## [1] "n = 2250000; mean = 50003.6041052044; sd = 28876.9623120679"
## [1] "n = 2500000; mean = 50007.3828341462; sd = 28877.8284553011"
## [1] "n = 2750000; mean = 50008.7457277896; sd = 28873.1159488839"
## [1] "n = 3000000; mean = 50003.5725038927; sd = 28871.2175449197"
## [1] "n = 3250000; mean = 49999.9444904926; sd = 28871.2726029017"
## [1] "n = 3500000; mean = 50006.1668584592; sd = 28870.8579000389"
## [1] "n = 3750000; mean = 50005.2929409367; sd = 28875.4975473185"
## [1] "n = 4000000; mean = 49998.0941268072; sd = 28878.2961125891"
## [1] "n = 4250000; mean = 50001.4245133331; sd = 28880.543753124"
## [1] "n = 4500000; mean = 50005.7251745774; sd = 28877.6730371267"
## [1] "n = 4750000; mean = 50003.1313168644; sd = 28875.2187320867"
## [1] "n = 5000000; mean = 49999.9836411155; sd = 28874.908385731"
## n mean sd
## 5000000.00 49999.98 28874.91
Calculated mean and standard deviation using R on the test vector:
mean(vector)
## [1] 49999.98
sd(vector)
## [1] 28874.91