Point Estimate

In statistics, we often want to know a characteristic of a large group (the population), but it is unfeasible to actually measure the entire group. Instead, we can sample a small portion of the group (the sample) and use that to create an estimate.

We will use R to demonstrate how a small sample can provide a suprisingly accurate estimate of the true population average.

In this demonstration, we will use the data of 141 major rivers in North America compiled by the US Geological Survey.

Point Estimate Formula

\[\bar{x} = \frac{1}{n}\sum_{i=1}^{n}x_i\] \(\bar{x}\): The sample mean

\(n\): The number of rivers in our sample

\(x_i\): The length of each individual river sampled

\(\bar{x}\) is considered an unbiased estimator, meaning that on average it will near the true population mean.

Population Distribution

Sampling Error

A point estimate is not a perfect estimate. The reliability of the estimate depends on the Standard Error (\(SE\)), which decreases as the sample size (\(n\)) grows.

\[SE = \frac{\sigma}{\sqrt{n}}\]

With a too small sample, the point estimate may not accurately describe the true value. As \(n\) grows, the distribution of our estimate will better reflect the true mean \(\mu\).

Random Sample (small)

In this visualization, we take a random sample of just 10 rivers.

Random Sample (large)

In this visualization, we take a larger sample of 50 rivers.

Visualizing Law of Large Numbers

To visualize the relationship between sample size and point estimate accuracy, we can create a 3D plot with a matrix of estimates at each n values, 20 times each. The spikes in the plot at lower sample sizes represent the inaccuracy of the estimate, slowly converging to the true mean as the sample size increases.

n_values <- 1:140
trials <- 1:20
z_matrix <- outer(n_values, trials,
                  Vectorize(function(n, t) mean(sample(rivers, n))))

plot_ly(x = trials, y = n_values, z = z_matrix) %>%
  add_surface() %>%
  layout(title = "Accuracy vs Sample Size",
         scene = list(xaxis = list(title = "Trial #"),
                      yaxis = list(title = "Sample size (n)"),
                      zaxis = list(title = "Mean estimate")))

Convergence