Recently I’ve been interested in developing a basic Risk Index which would be leveraged to easily assess the changing risk profile of a large portfolio of fixed assets. Based on some internal prioritization, a subset of these assets are replaced annually as they age, become degraded, or otherwise become obsolete. The thing is, the total pool of these assets is huge, with the assets themselves spread across over 2700 locations nationally. As such, it’s difficult to get a sense of aggregate risk, and of the impact capital investment has on it.

One solution is to develop a basic risk audit which would be conducted periodically (say semi-annually), producing a risk score we would then convert to an index number. This would allow us to easily assess the percentage change in aggregate risk over time. The only issue is, it would be way too time consuming to do this for every location. The logical choice is to take a representative random sample, and only conduct the audit on those sampled sites (with a new sample taken for each audit). This would effectively constitute a cross-sectional study.

Here I’ll talk a bit about sample size determination, balancing our need for confidence with real-world considerations such as cost.

The bigger the sample size, the better

It’s first important to understand what you want out of your sample. In my case, I simply want to infer something about the broader population, specifically the mean level of risk (and how that risk level is changing over time). Naturally, I want to have some level of confidence in my inference.

Assuming a sample of size n which is independent and identically distributed, we know the standard error of the sample mean is simply the standard deviation of the population, divided by the square root of the sample size. Since we don’t actually know the population SD, we’ll estimate this with the sample SD instead. In R this is sd(x)/sqrt(length(x)) where x is a vector of length n. Since the parameter we’re estimating is the mean, this is actually called the Standard Error of the Mean, and we’ll see it gets smaller the larger we make our sample.

Let’s start by modelling what my population of risk scores might look like, then draw repeated samples which increase in size in order to calculate and plot the SEM. We’ll also create a function for computing the SEM, then we’ll take 100 samples of increasing size, calculate the SEM and plot the results.

set.seed(2000)

#model the population using 2700 uniformly drawn integers
population <- ceiling(runif(2700, 150, 250))
samp <- list()
samples <- list()
sem <- numeric()
conf <- list()

#function to compute the SEM
calc_sem <- function(x) {
  sd(x)/sqrt(length(x))
}

#Calculate SEM for samples ranging in size from 1 to 100 observations
for (i in 1:100) {
  samp[[i]]<-sample(population, i)
  sem[[i]]<-calc_sem(samp[[i]])
}

#creating a DF and adding a col for the sample size
sem_df<-data.frame(sem)
sem_df["size"]<-c(1:100)

#plot the SEM
ggplot(sem_df,aes(x = size, y = sem))+geom_point()+xlab("Sample Size")+ylab("Standard Error of the Mean")

In the above plot we see exactly what we’d expect: that the SEM drops as our sample size increases. But why do we care about SEM?

Computing and Interpreting Confidence Intervals

With large enough sample sizes, the trusty Central Limit Theorem tells us the sampling distribution of the mean will be approximately normal. Meaning we can calculate meaningful confidence intervals without much concern for the underlying distribution of our data. In loose terms, Confidence Intervals allow us to communicate how confident we are that the parameter lies in the interval. A more technically accurate interpretation might be to say that if we were to take many repeated samples and calculate the mean for each, a confidence interval of say 95% implies that 95% of those means would fall into that interval. The former interpretation can easily be shown with some old school R.

#calculate the 95% CI for a sample of size n=100 drawn from our population
ci_test<-t.test(sample(population, 100),conf.level=0.95)

#take 500 repeated samples of size n=100
for (i in 1:500) {
  samples[[i]]<-sample(population, 100)
}

#calculate the means for each sample
sample_means <- data.frame(sapply(samples, function(x){mean(x)}))
sample_means$in_CI <- NA

#loop through each mean and check if it lands in our 95% CI. Flag as 1 if so, 0 if it falls outside
for (i in 1:500) {
  if ((sample_means[i,1]<ci_test$conf.int[[2]]) & (sample_means[i,1]>ci_test$conf.int[[1]])) {
    sample_means[i,2] <- 1
  }
  else {
    sample_means[i,2] <- 0
  }
}

sum(sample_means$in_CI)/length(sample_means$in_CI)

## [1] 0.962

So, with 500 repeated samples and a 95% confidence interval of 194.3353859, 206.0646141, we see that 0.962 of our means landed in the interval, which is exactly the interpretation of confidence intervals noted above.

In any case, if I make a claim regarding the mean risk level of the population, It would be useful to know how confident I am that the population mean is indeed within a given interval.

For an unkown population standard deviation, the confidence interval is defined as follows:

Note that you can recognize the SEM in the above equation. The Confidence Interval, SEM, and Margin of Error are all linked together. As our sample size increases, the SEM will decrease, which in turn will tighten our confidence interval, and decrease our margin of error.

We could use either traditional or bootstrapped (resampled) methods to produce our confidence intervals. We’ll use a traditional t-test method here. We’ll produce Confidence Intervals for a given confidence level (95%), and then graph the size of the intervals. Note that for this to be meaningful, we’ll start the sample size off at 10.

#Calculate CI for samples ranging in size from 10 to 100 observations
for (i in 10:100) {
conf[[i-9]]<-t.test(samp[[i]],conf.level=0.95)
}

#create a vector of the size of each interval using sapply then convert to a dataframe for graphing
intervals<-sapply(conf, function(x){x$conf.int[2]-x$conf.int[1]})
intervals_df<-data.frame(intervals)
intervals_df["size"]<-c(10:100)

#plot the interval lengths
ggplot(intervals_df,aes(x = size, y = intervals))+geom_bar(stat="identity")+xlab("Sample Size")+ylab("Confidence Interval Length")

It’s a bit tricky to understand, but if I want a confidence level of 95%, this means (again, loosely) that I’m 95% confident the interval contains the true population mean. But, to achieve this level of confidence with a small sample size, I must by definition increase my margin of error, which is effectively the range that my population parameter may deviate from my sample. We can see this in the fact that my confidence interval is wider the smaller my sample gets, for a given level of confidence.

So the short of it is, just like our SEM shrinks as the sample size gets larger, so too does our confidence interval (for a given confidence level). The thing is that in real life, we have other considerations than strictly confidence levels.

Balancing marginal costs and benefits

If I head over to a tool such as Survey Monkey’s Sample Size Calculator, we see that if we’re looking for a 95% confidence level and a 5% margin of error (based on our population size of 2700), we need a sample size of 337. This simply won’t work for me from a business standpoint. If a given audit takes 4 hours to complete, at a labour rate of $100/Hr, I’m looking at 100*4*337=134,800 in cost for each audit, to be conducted semi-annually. It won’t fly.

The idea, in my case anyway, is to balance incremental cost with incremental tightening of the confidence interval (or reduction of the margin of error) at a 95% confidence level. Let’s look at a few different sample sizes.

##     n CISize Margin_Error    Cost
## 1  40   5.10        10.32 $16,000
## 2  50   4.13         8.29 $20,000
## 3  60   4.00         8.01 $24,000
## 4 100   2.91         5.77 $40,000

While decidedly un-scientific, I get a minor SEM improvement going from 50 to 60 observations, and given the additional costs (and especially the attention paid to additional opex), I’m happy to go with a sample size of 50 which is decidedly less than we would target if applying theory alone. It will fit my budget while still providing insight to the broader aggregate risk with some level of confidence.

My blog posts always align with a real project I happen to be working on at the time. I find that writing out my thought process helps me think through things more clearly. I hope this post has helped you gain a better understanding about sample size determination and confidence intervals.

Exploring Sample Size and Confidence in R

Phil Jette

December 18, 2018

The bigger the sample size, the better

Computing and Interpreting Confidence Intervals

Balancing marginal costs and benefits