Accuracy versus Precision
The dictionary gets it substantially wrong:
precision |priˈsi zh ən| noun the quality, condition, or fact of being exact and accurate : the deal was planned and executed with military precision. • [as adj. ] marked by or adapted for accuracy and exactness : a precision instrument. • technical refinement in a measurement, calculation, or specification, esp. as represented by the number of digits given : this has brought an unprecedented degree of precision to the business of dating rocks | a precision of six decimal figures. Compare with accuracy . ORIGIN mid 18th cent.: from French précision or Latin praecisio(n-), from praecidere ‘cut off’ (see precise ).
Thesaurus synonyms: tools crafted with precision: exactness, exactitude, accuracy, correctness, preciseness; care, carefulness, meticulousness, scrupulousness, punctiliousness, methodicalness, rigor, rigorousness.
Today is all about Precision and how to measure and convey it. For us, the meaning of precision is “the degree of repeatability” or “reproducibility.” Sometimes this is conveyed by the “number of digits”, which is to say “the position of the last digit that matters”.
One wikipedia definition is useless: precision (statistics)http://en.wikipedia.org/wiki/Precision_(statistics).
Another one, accuracy and precision is more on target.
We saw that the choice of parameters in a model could be automated: fit the model, thereby choosing the parameters that eliminate bias in the residuals and minimize the variance.
Of course, the chosen parameters depend on the data to which the model is fitted. Insofar as this data is a random sample from a population, the parameters themselves are random. The question for use today is, “How random?”
Sample 100 runners from the runners data set and find the mean running time. Because there is some missing data, you'll need to do something like this:
options(na.rm = TRUE)
options(na.rm = TRUE)
run = fetchData("repeat-runners.csv")
## Retrieving from http://www.mosaic-web.org/go/datasets/repeat-runners.csv
nrow(run)
## [1] 24334
mysamp = sample(run, size = 100)
mymod = mm(net ~ sex, data = mysamp)
mymod
## Groupwise Model.
## Call:
## net ~ sex
##
## Coefficients:
## F M
## 94.3 84.3
Across the class, comparing the different students, see what the range of values is.
Do it several times on the instructor's computer.
do(5) * mm(net ~ sex, data = sample(run, size = 100))
## F M sigma r.squared
## 1 95.23 85.31 NA 0.09653
## 2 92.96 85.37 NA 0.07690
## 3 91.59 78.51 NA 0.23614
## 4 96.15 85.71 NA 0.11462
## 5 95.73 82.05 NA 0.20665
Do it many times with do and plot out the sampling distribution.
trials = do(100) * mm(net ~ sex, data = sample(run, size = 100))
densityplot(~F, data = trials)
densityplot(~M, data = trials)
Or, to put them both on one plot, which I want to do to compare to the population distribution, here's a more esoteric command:
require(reshape2)
## Loading required package: reshape2
densityplot(~value, groups = variable, data = melt(trials[, 1:2]))
## Using as id variables
How do we measure the width of the sampling distribution. With the standard deviation of the distribution.
sd(trials)
## F M sigma r.squared
## 2.3135 2.2143 NA 0.0619
This is called the standard error of the distribution.
IMPORTANT the standard error of the distribution is not at all the same as the distribution of the population. Here's that distribution of individual net running times:
densityplot(~net, groups = sex, data = run)
You don't typically have the population at hand. (If you did … )
So we simulate the sampling distribution by constructing a simulated population that consists of many copies of the original. You could think of this as creating a pot of data, pouring into the pot many copies of the sample so that you have a huge set of cases to choose from.
In practice, this is implemented by sampling with replacement. To contrast sampling with resampling:
set = c(1, 2, 3, 4)
sample(set)
## [1] 4 2 1 3
sample(set)
## [1] 4 2 3 1
sample(set)
## [1] 3 4 1 2
resample(set)
## [1] 4 3 4 4
resample(set)
## [1] 3 1 3 1
resample(set)
## [1] 2 4 2 4
The resample() function is set up, by default, to create a simulated sample that's the same size as the original.
Take your sample of size 100 and fit the model to a resampled set of data:
mm(net ~ sex, data = resample(mysamp))
## Groupwise Model.
## Call:
## net ~ sex
##
## Coefficients:
## F M
## 93.6 80.8
mm(net ~ sex, data = resample(mysamp))
## Groupwise Model.
## Call:
## net ~ sex
##
## Coefficients:
## F M
## 98.7 82.6
mm(net ~ sex, data = resample(mysamp))
## Groupwise Model.
## Call:
## net ~ sex
##
## Coefficients:
## F M
## 91.9 79.7
Do this many times to create the resampling distribution:
rs = do(100) * mm(net ~ sex, data = resample(mysamp))
What's the width of this distribution?
sd(rs)
## F M sigma r.squared
## 2.29298 2.61794 NA 0.07289
That constitutes an estimate of the standard error of the coefficients of this model.
summary(mymod)
## Groupwise Model
## Call: net ~ sex
##
## group 2.5 % 97.5 %
## 1 F 89.71 98.93
## 2 M 80.67 87.88
##
## sigma: NA
## r-squared: 0.106
## Adj. r-squared: 0.0974
QUESTION: Where does the information to do this come from, since mymod doesn't see anything but the particular sample that we took?
The standard error depends on the size of the sample. More data means a more precise estimate. Let's quantify this.
Student Activity Use sampling and iteration for 200 times to calculate the standard error for different sample sizes. Assign a sample size to each student: 25, 50, 100, 200, 400, 800, 1600, 3200
Example:
trials = do(200) * mm(net ~ sex, data = sample(run, size = 800))
sd(trials)
## F M sigma r.squared
## 0.91891 0.74530 NA 0.02579
Draw a plot of the standard error versus sample size. Show that the standard error gets smaller as \( 1/\sqrt{n} \). You need 4 times as much data to get an estimate that's twice as precise.
Two equivalent formats
Three components: 1. Point estimate \( p \) 2. Margin of error \( m \) 3. Confidence level \( c \) typically 95\%
Two main methods:
The confint program will do this whichever way you ask:
confint(trials, method = "stderr")
## Warning: NaNs produced
## name lower upper
## 1 F 92.72121 96.3453
## 2 M 82.54898 85.4884
## 3 sigma NA NA
## 4 r.squared 0.07049 0.1722
confint(trials, method = "quantile")
## name 2.5 % 97.5 %
## 1 F 92.74472 96.3573
## 2 M 82.70897 85.3933
## 3 sigma NA NA
## 4 r.squared 0.07235 0.1708
Simulation
Have students generate a random sample of 100,000 from a normal distribution with a specied mean and standard deviation. Then calculate the 95% coverage interval. Note that it extends from the mean about \( \pm 2 \) standard deviations.
Example:
samp = rnorm(1e+05, mean = 20, sd = 5)
qdata(c(0.025, 0.975), samp) # should be about 10 to 30
## 2.5% 97.5%
## 10.15 29.82
The second method, using the standard error, is very common, because it's easier to get a good estimate of the standard error than to get the 2.5 or 97.5 percentile.
Simple visual method: do the confidence intervals overlap? Later on we'll see a more sophisticated method called hypothesis testing.
Have everyone construct a groupwise model of the wage data against some grouping variable and pick some group to examine.
Example:
popParameter = mm(net ~ sex, data = run)
popParameter
## Groupwise Model.
## Call:
## net ~ sex
##
## Coefficients:
## F M
## 94.6 84.1
mymod = mm(net ~ sex, data = sample(run, size = 100))
confint(mymod, level = 0.5)
## group 25 % 75 %
## 1 F 92.87 96.01
## 2 M 80.43 82.54
Is the population parameter value within your confidence interval?
Have fun with the students whose interval doesn't cover their population parameter. Lead them to understand that with a confidence level of 50%, you'll cover the population parameter only half the time.
In any given study, you can't know if your confidence interval includes the population parameter. But if you follow the right techniques, your interval will include the population parameter 95% (or whatever the confidence level is) of the time.
Why not 100% confidence? This would have to extend to infinity.
A simple method: Report two digits of the (doubled) standard error. Then report the point estimate to the same digit as the smallest digit in your standard error.
Examples:
Round doubled sd to two digits: 130000. Report point estimate to the same digit: 3420000 \( \pm \) 130000 with 95% confidence.
Don't confusion precision with accuracy. The confidence interval tells you the precision of your estimate, based on the variability that comes about from the sampling process. But however precise your estimate is, it may still be inaccurate. Some reasons:
We'll be moving on to build models involving multiple explanatory variables in the next class. Part of doing that will be to expand the notion of models across groupwise means.
We saw an example of this in the first class, with the early studies of smoking and mortality. Breaking things up into groups by smoking status, without taking into account age, gave a result that we now know to be wrong.