Why in sample SD we divide by (N-1)

Step 1

Ever wondered why we divide by ‘N’ for a population standard deviation (SD) and divide by ‘n-1’ for sample SD? We try to show you that dividing the sample values by (n-1) allows us to better approximate the population SD.

We will use a bit a simulation here. So, along with learning about SD, you will also learn about simulation, creating functions in R and creating loops.

To start we create a population of 5000 numbers, uniformly varying between 20 and 120.

pop <- runif(5000,20,120)
mean(pop)

## [1] 70.12015

The mean had to be around 70 [=(20+120)/2]

We do not have to generate uniformly varying random numbers. You can try different distributions. Pop1 creates normally distributed random numbers with mean 100 and SD as 25. Pop2 creates exponentially varying random numbers with mean 100. Feel free to change the population distribution to check the outcomes. We will use the uniformly distributed set of numbers (pop)

pop1 <- rnorm(5000,100,25)
pop2 <- rexp(5000,1/100)

Step 2

We will now create functions to find the population and sample standard deviation. sdp is the function for population standard deviation and sds is the function for sample standard deviation

sdp <- function(x){
  sqrt(sum((x-mean(x))^2)/length(x))
}
sds <- function(x){
  sqrt(sum((x-mean(x))^2)/(length(x)-1))
}

Since pop is the population we should find the SD by using sdp

sdp(pop)

## [1] 28.58355

Our sample SD should come as close to this value of population SD as possible.

Step 3

We will now create a sample of 40 randomly selected numbers from pop and find the SD of the sample using the sdp and sds functions. Ideally, the value returned by the sds function should be closer to the population SD value.

samp <- sample(pop, 40, replace = FALSE)
sdp(samp)

## [1] 27.35996

sds(samp)

## [1] 27.70851

As you can see, the value from sds is closer to the population SD than sdp. This means that when we divide the sample values by ‘n-1’ we get a better approximation of the population values.

Step 4

But wait!! What if this was a random chance that this happened? Will this work every time? Lets check. We will repeat this step a 1000 times and then take the average of SD calculated using sds and sdp. Ideally, the sds average should be closer to population SD.

We create a matrix to store the values that we generate. We need 1000 rows and 2 columns - first column for sds and second for sdp. We initially assume that all values are 0.

sdmatrix <- matrix(0,1000,2)

We create a 1000 samples from the population, each with sample size of 40 and find SD with sds and sdp

for (i in 1:1000){
  samp <- sample(pop, 40, replace = FALSE)
  sdmatrix[i,1] <- sds(samp)
  sdmatrix[i,2] <- sdp(samp)
}

c(mean(sdmatrix[,1]),mean(sdmatrix[,2]))

## [1] 28.54355 28.18450

Here the first value represents the mean of SD values created by dividing by ‘n-1’ and the second value represents mean of SD values created by dividing by ‘n’. And as you can see, even in averages, the ‘n-1’ based formula is a better approximation of the population SD.

When we divide by ‘n-1’, the value of SD we obtain from the sample is thus a better approximation of the population SD.

Step 5

I have assumed a sample size of 40 here. Try this with smaller samples, or larger samples. Does sds always give values closer to the population SD?

Try different population distributions. Is sds a better approximation for all kinds of populations?