Problem Set 1

Given Array

The function below takes an vector and returns its expected value and standard deviation.

sum_stats <- function(input) {
  # get unique values in vector and count, convert to number
  input_tbl <- as.data.frame(table(input))
  names(input_tbl)[1] <- "x"
  input_tbl$x <- as.numeric(as.character(input_tbl$x))
  # get probability for each value of x
  input_tbl$p <- input_tbl$Freq / sum(input_tbl$Freq)
  #return E(X) & sd(X)
  E_x <- sum(input_tbl$x * input_tbl$p)
  var_x <- sum(input_tbl$x * input_tbl$x * input_tbl$p) - E_x^2
  c("E(x)" = E_x, "sd(x)" = sqrt(var_x), "N" = length(input))
}

Testing with an array and comparing to the builtin mean and sd functions:

set.seed(42)
a <- round(runif(n = 100, min = 0, max = 20))
sum_stats(a)
      E(x)      sd(x)          N 
 10.530000   6.035652 100.000000 
c("mean" = mean(a), "sd" = sd(a))
     mean        sd 
10.530000  6.066059 

The expected value calculated is equivalent to the mean. The standard deviation differs by a small amount. Investigation reveals that the sd function calculates the sample standard deviation – that is, it divides by \(\sqrt{n-1}\) in the denominator. Adjusting this to calculate the population standard deviation (i.e. dividing by \(\sqrt{n}\)) returns the same value as the designed function:

(pop_sd <- sd(a) * sqrt(length(a) - 1) / sqrt(length(a)))
[1] 6.035652
round(pop_sd, 6) == round(sum_stats(a)[[2]], 6)
[1] TRUE

Stream of Values

In order to calculate the exected value and standard deviation of a stream of values, running values for these figures are tracked and updated as additional data is added. Each time a new array of data \(Y\) is added to the existing data \(X\), the expected value and variance of the combined data are calculated by \[\begin{array} E(X \cup Y) &= \frac{N_X E(X) + N_Y E(Y)}{N_X + N_Y} \\ Var(X \cup Y) &= \frac{N_X Var(X) + N_Y Var(Y) + N_X (E(X) - E(X \cup Y))^2 + N_Y (E(Y) - E(X \cup Y))^2}{N_X + N_Y} \end{array}\]

running_stats <- function(newinput) {
  #read in existing stats; 0 if not defined
  E_X <- ifelse(exists('E'), E, 0)
  var_X <- ifelse(exists('SD'), SD^2, 0)
  N_X <- ifelse(exists('N'), N, 0)
  
  #get figures for new array
  E_Y <- sum_stats(newinput)[[1]]
  var_Y <- sum_stats(newinput)[[2]]^2
  N_Y  <- sum_stats(newinput)[[3]]
  
  #calculate & store new stats
  E_tot <- (N_X * E_X + N_Y * E_Y) / (N_X + N_Y)
  var_tot <- (N_X * var_X + N_Y * var_Y + N_X * (E_X - E_tot)^2 + N_Y * (E_Y - E_tot)^2) / (N_X + N_Y)
  E <<- E_tot
  SD <<- sqrt(var_tot)
  N <<- N_X + N_Y
  
  #return values
  c("E" = E, "SD" = SD, "N" = N)
}

This can be testing using the previous vector a and a new vector b.

set.seed(24)
b <- round(runif(n = 20, min = 0, max = 20))
running_stats(a)
         E         SD          N 
 10.530000   6.035652 100.000000 
running_stats(b)
         E         SD          N 
 10.391667   5.896745 120.000000 

The values returned can be seen to be equivalent to both the sum_stats function of the joined arrays, as well as the calculation of the build-in mean and sd (for population) functions on the joined arrays.

sum_stats(c(a, b))
      E(x)      sd(x)          N 
 10.391667   5.896745 120.000000 
mean(c(a, b))
[1] 10.39167
sd(c(a, b)) * sqrt(length(c(a, b)) - 1) / sqrt(length(c(a, b)))
[1] 5.896745