Lets say we wanted to know the average diameter of trees at breast height (dbh) in a research plot in the Appalachians (our data are in livetrees.csv from Clark 2007). We could compute this simply using sd() and visualize it using hist()
## Load Data
trees <- read.csv("/Users/seth/Dropbox/GEOG 5023/GEOG 5023 - Spring 2013/Data/livetrees.csv")
## SD
sd(trees$dbh, na.rm = T) #remove missing values
## [1] 16.02
## Histogram
hist(trees$dbh)
We could create a crude map of the data using:
symbols(trees$x, trees$y, circles = trees$dbh/20, inches = F)
If we took a random sample of 100 and 500 of these trees:
samp.100.rows <- sample(1:nrow(trees), 100) #randomly sample rows of the table
trees.samp1 <- trees[samp.100.rows, ] #slice selected rows and all colums from the data.frame
samp.500.rows <- sample(1:nrow(trees), 500) #randomly sample rows of the table
trees.samp2 <- trees[samp.500.rows, ] #slice selected rows and all colums from the data.frame
# use points to overlay the 100 sampled locations onto original map
symbols(trees$x, trees$y, circles = trees$dbh/20, inches = F)
points(trees.samp1$x, trees.samp1$y, pch = 16, cex = 1, col = "red")
Our sample has a mean and a standard deviation which we expect to be close to the population, the size of the sample doesn't affect this expectation:
# MEANS
mean(trees$dbh, na.rm = T) #mean of the population
## [1] 11.72
mean(trees.samp1$dbh, na.rm = T) #mean of the 100 tree sample
## [1] 13.11
mean(trees.samp2$dbh, na.rm = T) #mean of the 500 tree sample
## [1] 11.21
# SD
sd(trees$dbh, na.rm = T) #SD of the population
## [1] 16.02
sd(trees.samp1$dbh, na.rm = T) #SD of the 100 tree sample
## [1] 18.52
sd(trees.samp2$dbh, na.rm = T) #SD of the 500 tree sample
## [1] 15.78
The means and the standard deviations above are properties of groups of real trees. We could go and visit these trees. However…
We want to do something abstract with our sample, we want to use its mean to make an inference about the larger population mean. The trees physically exist in the world however our inference about the trees does not. The standard deviation refers to the variability in the real trees, we could measure 100, 500, or all of the trees in the forest and the standard deviation in the DBH might not change. The standard error on the other hand, relates to our inference and our uncertainty about the true population mean. As we collect more and more samples our confidence in the true population mean increases and our uncertainty decreases. This reminds me of the famous quote by Richard Rorty:
“The world is out there, but descriptions of the world are not. Only descriptions of the world can be true or false. The world on its own unaided by the describing activities of humans cannot.”
We know that the characteristics of our randomly selected set of sample trees will not exactly match the characteristics of the population, our sample contains chance error.
# Chance error
mean(trees$dbh, na.rm = T) - mean(trees.samp1$dbh, na.rm = T)
## [1] -1.391
The average magnitude of these chance errors depends upon sample size. In the extreme case if our sample included 100% of the trees the chance error would be zero.
We can illustrate this via re-sampling:
# create an empty data frame to hold results
results <- data.frame(sampleMean = numeric(), sampleSD = numeric(), sampleSize = character())
# choose a range of sample sizes to explore
ss <- c(10, 50, 100, 500, 1000, 1100)
for (size in ss) {
sMeans <- matrix(nrow = 10000)
sSDs <- matrix(nrow = 10000)
for (j in 1:10000) {
aSample <- sample(trees$dbh, size)
sMeans[j] <- mean(aSample, na.rm = T)
sSDs[j] <- sd(aSample, na.rm = T)
}
newRows <- data.frame(sampleMean = sMeans, sampleSD = sSDs, sampleSize = rep(size,
10000))
results <- rbind(results, newRows)
}
results$sampleSize <- as.factor(results$sampleSize)
boxplot(results$sampleMean ~ results$sampleSize, xlab = "Sample Size", ylab = "Standard Deviation of Sample Means",
main = "Standard Errors and Sample Size")
The plot above shows how the standard error declines declines as a function of sample size. If we believe that the "truth” lies somewhere withing the range of samples means we observed we can interpret this plot as uncertainty about the population mean decreasing as sample size increases. The amount of variation in the forest does not change just because our sample size changes. However, changes in sample size change our level of confidence in our knowledge of the forest.
This plot has all sorts of substantive scientific implications. For example studies that use small samples to estimate the characteristics of a population are harder to replicate and less reliable than larger studies…