We saw that the choice of parameters in a model could be automated: fit the model, thereby choosing the parameters that eliminate bias in the residuals and minimize the variance.
Of course, the chosen parameters depend on the data to which the model is fitted. Insofar as this data is a random sample from a population, the parameters themselves are random. The question for use today is, “How random?”
Sample 100 runners from the runners data set and find the mean running time. Because there is some missing data, you'll need to do something like this:
options(na.rm = TRUE)
options(na.rm = TRUE)
run = fetchData("repeat-runners.csv")
## Retrieving from http://www.mosaic-web.org/go/datasets/repeat-runners.csv
nrow(run)
## [1] 24334
mysamp = sample(run, size = 100)
mymod = mm(net ~ sex, data = mysamp)
mymod
## Groupwise Model.
## Call:
## net ~ sex
##
## Coefficients:
## F M
## 95.8 86.5
##
Across the class, comparing the different students, see what the range of values is.
Do it several times on the instructor's computer.
do(5) * mm(net ~ sex, data = sample(run, size = 100))
## F M sigma r.squared
## 1 93.75 81.56 NA 0.2355246
## 2 90.21 90.53 NA 0.0001373
## 3 93.41 84.89 NA 0.0668309
## 4 96.65 82.63 NA 0.1739183
## 5 97.42 86.97 NA 0.0791569
Do it many times with do and plot out the sampling distribution.
trials = do(100) * mm(net ~ sex, data = sample(run, size = 100))
densityplot(~F, data = trials)
densityplot(~M, data = trials)
Or, to put them both on one plot, which I want to do to compare to the population distribution, here's a more esoteric command:
require(reshape2)
## Loading required package: reshape2
densityplot(~value, groups = variable, data = melt(trials[, 1:2]))
## Using as id variables
How do we measure the width of the sampling distribution. With the standard deviation of the distribution.
sd(trials)
## F M sigma r.squared
## 2.18674 2.06732 NA 0.05963
This is called the standard error of the distribution.
IMPORTANT the standard error of the distribution is not at all the same as the distribution of the population. Here's that distribution of individual net running times:
densityplot(~net, groups = sex, data = run)
You don't typically have the population at hand. (If you did … )
So we simulate the sampling distribution by constructing a simulated population that consists of many copies of the original. You could think of this as creating a pot of data, pouring into the pot many copies of the sample so that you have a huge set of cases to choose from.
In practice, this is implemented by sampling with replacement. To contrast sampling with resampling:
set = c(1, 2, 3, 4)
sample(set)
## [1] 1 3 2 4
sample(set)
## [1] 4 1 3 2
sample(set)
## [1] 4 3 1 2
resample(set)
## [1] 2 2 2 4
resample(set)
## [1] 4 3 1 3
resample(set)
## [1] 2 1 4 1
The resample() function is set up, by default, to create a simulated sample that's the same size as the original.
Take your sample of size 100 and fit the model to a resampled set of data:
mm(net ~ sex, data = resample(mysamp))
## Groupwise Model.
## Call:
## net ~ sex
##
## Coefficients:
## F M
## 95.1 85.6
##
mm(net ~ sex, data = resample(mysamp))
## Groupwise Model.
## Call:
## net ~ sex
##
## Coefficients:
## F M
## 92.9 87.9
##
mm(net ~ sex, data = resample(mysamp))
## Groupwise Model.
## Call:
## net ~ sex
##
## Coefficients:
## F M
## 95.7 91.4
##
Do this many times to create the resampling distribution:
rs = do(100) * mm(net ~ sex, data = resample(mysamp))
What's the width of this distribution?
sd(rs)
## F M sigma r.squared
## 2.74415 2.23523 NA 0.07084
That constitutes an estimate of the standard error of the coefficients of this model.
summary(mymod)
## Groupwise Model
## Call: net ~ sex
##
## group 2.5 % 97.5 %
## 1 F 91.27 100.31
## 2 M 83.02 89.95
##
## sigma: NA
## r-squared: 0.1
## Adj. r-squared: 0.091
QUESTION: Where does the information to do this come from, since mymod doesn't see anything but the particular sample that we took?
The standard error depends on the size of the sample. More data means a more precise estimate. Let's quantify this.
Student Activity Use sampling and iteration for 200 times to calculate the standard error for different sample sizes. Assign a sample size to each student: 25, 50, 100, 200, 400, 800, 1600, 3200
Example:
trials = do(200) * mm(net ~ sex, data = sample(run, size = 800))
sd(trials)
## F M sigma r.squared
## 0.82647 0.69632 NA 0.02437
Draw a plot of the standard error versus sample size. Show that the standard error gets smaller as \( 1/\sqrt{n} \). You need 4 times as much data to get an estimate that's twice as precise.
Two equivalent formats
Three components:
1. Point estimate \( p \)
2. Margin of error \( m \)
3. Confidence level \( c \) typically 95\%
Two main methods:
The confint program will do this whichever way you ask:
confint(trials, method = "stderr")
## Warning: NaNs produced
## name lower upper
## 1 F 92.95436 96.2139
## 2 M 82.60163 85.3478
## 3 sigma NA NA
## 4 r.squared 0.07628 0.1724
confint(trials, method = "quantile")
## name 2.5 % 97.5 %
## 1 F 93.037 96.2908
## 2 M 82.741 85.3623
## 3 sigma NA NA
## 4 r.squared 0.079 0.1675
Simulation
Have students generate a random sample of 100,000 from a normal distribution with a specied mean and standard deviation. Then calculate the 95% coverage interval. Note that it extends from the mean about \( \pm 2 \) standard deviations.
Example:
samp = rnorm(1e+05, mean = 20, sd = 5)
qdata(c(0.025, 0.975), samp) # should be about 10 to 30
## 2.5% 97.5%
## 10.30 29.76
The second method, using the standard error, is very common, because it's easier to get a good estimate of the standard error than to get the 2.5 or 97.5 percentile.
Simple visual method: do the confidence intervals overlap? Later on we'll see a more sophisticated method called hypothesis testing.
Have everyone construct a groupwise model of the wage data against some grouping variable and pick some group to examine.
Example:
popParameter = mm(net ~ sex, data = run)
popParameter
## Groupwise Model.
## Call:
## net ~ sex
##
## Coefficients:
## F M
## 94.6 84.1
##
mymod = mm(net ~ sex, data = sample(run, size = 100))
confint(mymod, level = 0.5)
## group 25 % 75 %
## 1 F 89.97 93.23
## 2 M 83.60 86.16
Is the population parameter value within your confidence interval?
Have fun with the students whose interval doesn't cover their population parameter. Lead them to understand that with a confidence level of 50%, you'll cover the population parameter only half the time.
In any given study, you can't know if your confidence interval includes the population parameter. But if you follow the right techniques, your interval will include the population parameter 95% (or whatever the confidence level is) of the time.
Why not 100% confidence? This would have to extend to infinity.
A simple method: Report two digits of the (doubled) standard error. Then report the point estimate to the same digit as the smallest digit in your standard error.
Examples:
Round doubled sd to two digits: 130000.
Report point estimate to the same digit: 3420000 \( \pm \) 130000 with 95% confidence.