1. Problem Set 1
Please write a function to compute the expected value and standard deviation of an array of values.
Compare your results with that of R’s mean and std functions.
Solution:
# Expected value sum(x) / length(x)
exp_value <- function(v){
if (length(v) != 0)
return(sum(v)/length(v))
}
# validating the mean using mean()
a<- c(1,2,3,4,5)
(exp_value(a) == mean(a))
## [1] TRUE
Using the standard deviation equation for a sample of a population formula:
s= sqrt[ sum (x - meax(x))^2 / N -1 ]
where:
s= the standard deviation
x= each value in the sample
mean(x)= the mean of the values
N= the sample size
std_deviation<- function(v) {
if ((length(v)-1) != 0)
return(sqrt (sum( (v-exp_value(v) ) ^2 ) / (length(v)-1)) )
}
# validating the standard deviation using sd()
a<- c(1,2,3,4,5)
(std_deviation(a) == sd(a))
## [1] TRUE
Now, consider that instead of being able to neatly fit the values in memory in an array,
you have an infinite stream of numbers coming by. How would you estimate the mean and
standard deviation of such a stream? Your function should be able to return the current
estimate of the mean and standard deviation at any time it is asked. Your program should
maintain these current estimates and return them back at any invocation of these functions.
(Hint: You can maintain a rolling estimate of the mean and standard deviation and allow
these to slowly change over time as you see more and more new values).
# initializing global variable to hold stream data
stream<<-NA
rollingfunc <- function(x) {
# initializing the stream array as global variable
if (is.na(stream[1]) == TRUE) stream<<- stream[-1]
# assign the input array to the global stream
stream<<- c(stream, assign("stream", x, envir = .GlobalEnv))
# print the global array, starting from the beginning to the current value
print(data.frame(stream))
# return the rolling mean and standard deviation for the global stream
return(data.frame(mean = exp_value(stream),
std = std_deviation(stream) ))
}
Testing
stream<<-NA
a<- c(1,2,3,4,5)
b<- c(10,55,22)
c<- c(11,22,33,44,55,10,55,22, 66)
d<- seq(1:20)
e<- c(a,b,c,d)
# initial test
rollingfunc(a)
## stream
## 1 1
## 2 2
## 3 3
## 4 4
## 5 5
## mean std
## 1 3 1.581139
mean(a)
## [1] 3
sd(a)
## [1] 1.581139
rollingfunc(b)
## stream
## 1 1
## 2 2
## 3 3
## 4 4
## 5 5
## 6 10
## 7 55
## 8 22
## mean std
## 1 12.75 18.37506
rollingfunc(c)
## stream
## 1 1
## 2 2
## 3 3
## 4 4
## 5 5
## 6 10
## 7 55
## 8 22
## 9 11
## 10 22
## 11 33
## 12 44
## 13 55
## 14 10
## 15 55
## 16 22
## 17 66
## mean std
## 1 24.70588 22.23107
k<- rollingfunc(d)
## stream
## 1 1
## 2 2
## 3 3
## 4 4
## 5 5
## 6 10
## 7 55
## 8 22
## 9 11
## 10 22
## 11 33
## 12 44
## 13 55
## 14 10
## 15 55
## 16 22
## 17 66
## 18 1
## 19 2
## 20 3
## 21 4
## 22 5
## 23 6
## 24 7
## 25 8
## 26 9
## 27 10
## 28 11
## 29 12
## 30 13
## 31 14
## 32 15
## 33 16
## 34 17
## 35 18
## 36 19
## 37 20
# testing ... Please note that I assigned "e" to be cumulative of a, b, c, and d.
mean(e) == k$mean
## [1] TRUE
sd(e) == k$std
## [1] TRUE