In-Class Notes: 2012-09-19

Review from last class

We saw that the choice of parameters in a model could be automated: fit the model, thereby choosing the parameters that eliminate bias in the residuals and minimize the variance.

Of course, the chosen parameters depend on the data to which the model is fitted. Insofar as this data is a random sample from a population, the parameters themselves are random. The question for use today is, “How random?”

The sampling process

Sample 100 runners from the runners data set and find the mean running time. Because there is some missing data, you'll need to do something like this:

options(na.rm = TRUE)
options(na.rm = TRUE)
run = fetchData("repeat-runners.csv")
## Retrieving from http://www.mosaic-web.org/go/datasets/repeat-runners.csv
nrow(run)
## [1] 24334
mysamp = sample(run, size = 100)
mymod = mm(net ~ sex, data = mysamp)
mymod
## Groupwise Model.
## Call:
## net ~ sex
## 
## Coefficients:
##    F     M  
## 95.8  86.5  
## 

Across the class, comparing the different students, see what the range of values is.

Do it several times on the instructor's computer.

do(5) * mm(net ~ sex, data = sample(run, size = 100))
##       F     M sigma r.squared
## 1 93.75 81.56    NA 0.2355246
## 2 90.21 90.53    NA 0.0001373
## 3 93.41 84.89    NA 0.0668309
## 4 96.65 82.63    NA 0.1739183
## 5 97.42 86.97    NA 0.0791569

Do it many times with do and plot out the sampling distribution.

trials = do(100) * mm(net ~ sex, data = sample(run, size = 100))
densityplot(~F, data = trials)

plot of chunk unnamed-chunk-5

densityplot(~M, data = trials)

plot of chunk unnamed-chunk-5

Or, to put them both on one plot, which I want to do to compare to the population distribution, here's a more esoteric command:

require(reshape2)
## Loading required package: reshape2
densityplot(~value, groups = variable, data = melt(trials[, 1:2]))
## Using as id variables

plot of chunk unnamed-chunk-6

How do we measure the width of the sampling distribution. With the standard deviation of the distribution.

sd(trials)
##         F         M     sigma r.squared 
##   2.18674   2.06732        NA   0.05963 

This is called the standard error of the distribution.

IMPORTANT the standard error of the distribution is not at all the same as the distribution of the population. Here's that distribution of individual net running times:

densityplot(~net, groups = sex, data = run)

plot of chunk unnamed-chunk-8

Resampling

You don't typically have the population at hand. (If you did … )

So we simulate the sampling distribution by constructing a simulated population that consists of many copies of the original. You could think of this as creating a pot of data, pouring into the pot many copies of the sample so that you have a huge set of cases to choose from.

In practice, this is implemented by sampling with replacement. To contrast sampling with resampling:

set = c(1, 2, 3, 4)
sample(set)
## [1] 1 3 2 4
sample(set)
## [1] 4 1 3 2
sample(set)
## [1] 4 3 1 2
resample(set)
## [1] 2 2 2 4
resample(set)
## [1] 4 3 1 3
resample(set)
## [1] 2 1 4 1

The resample() function is set up, by default, to create a simulated sample that's the same size as the original.

Student task

Take your sample of size 100 and fit the model to a resampled set of data:

mm(net ~ sex, data = resample(mysamp))
## Groupwise Model.
## Call:
## net ~ sex
## 
## Coefficients:
##    F     M  
## 95.1  85.6  
## 
mm(net ~ sex, data = resample(mysamp))
## Groupwise Model.
## Call:
## net ~ sex
## 
## Coefficients:
##    F     M  
## 92.9  87.9  
## 
mm(net ~ sex, data = resample(mysamp))
## Groupwise Model.
## Call:
## net ~ sex
## 
## Coefficients:
##    F     M  
## 95.7  91.4  
## 

Do this many times to create the resampling distribution:

rs = do(100) * mm(net ~ sex, data = resample(mysamp))

What's the width of this distribution?

sd(rs)
##         F         M     sigma r.squared 
##   2.74415   2.23523        NA   0.07084 

That constitutes an estimate of the standard error of the coefficients of this model.

summary(mymod)
## Groupwise Model
## Call:  net ~ sex 
## 
##   group 2.5 % 97.5 %
## 1     F 91.27 100.31
## 2     M 83.02  89.95
## 
## sigma:  NA 
## r-squared: 0.1 
## Adj. r-squared: 0.091 

QUESTION: Where does the information to do this come from, since mymod doesn't see anything but the particular sample that we took?

The Standard Error

The standard error depends on the size of the sample. More data means a more precise estimate. Let's quantify this.

Student Activity Use sampling and iteration for 200 times to calculate the standard error for different sample sizes. Assign a sample size to each student: 25, 50, 100, 200, 400, 800, 1600, 3200

Example:

trials = do(200) * mm(net ~ sex, data = sample(run, size = 800))
sd(trials)
##         F         M     sigma r.squared 
##   0.82647   0.69632        NA   0.02437 

Draw a plot of the standard error versus sample size. Show that the standard error gets smaller as \( 1/\sqrt{n} \). You need 4 times as much data to get an estimate that's twice as precise.

From the Sampling/Resampling Distribution to a Confidence Interval

Two equivalent formats

Three components:
1. Point estimate \( p \)
2. Margin of error \( m \)
3. Confidence level \( c \) typically 95\%

Two main methods:

The confint program will do this whichever way you ask:

confint(trials, method = "stderr")
## Warning: NaNs produced
##        name    lower   upper
## 1         F 92.95436 96.2139
## 2         M 82.60163 85.3478
## 3     sigma       NA      NA
## 4 r.squared  0.07628  0.1724
confint(trials, method = "quantile")
##        name  2.5 %  97.5 %
## 1         F 93.037 96.2908
## 2         M 82.741 85.3623
## 3     sigma     NA      NA
## 4 r.squared  0.079  0.1675

Simulation

Have students generate a random sample of 100,000 from a normal distribution with a specied mean and standard deviation. Then calculate the 95% coverage interval. Note that it extends from the mean about \( \pm 2 \) standard deviations.

Example:

samp = rnorm(1e+05, mean = 20, sd = 5)
qdata(c(0.025, 0.975), samp)  # should be about 10 to 30
##  2.5% 97.5% 
## 10.30 29.76 

The second method, using the standard error, is very common, because it's easier to get a good estimate of the standard error than to get the 2.5 or 97.5 percentile.

Interpreting the Confidence Interval

Differences between groups

Simple visual method: do the confidence intervals overlap? Later on we'll see a more sophisticated method called hypothesis testing.

Coverage

Have everyone construct a groupwise model of the wage data against some grouping variable and pick some group to examine.

Example:

popParameter = mm(net ~ sex, data = run)
popParameter
## Groupwise Model.
## Call:
## net ~ sex
## 
## Coefficients:
##    F     M  
## 94.6  84.1  
## 
mymod = mm(net ~ sex, data = sample(run, size = 100))
confint(mymod, level = 0.5)
##   group  25 %  75 %
## 1     F 89.97 93.23
## 2     M 83.60 86.16

Is the population parameter value within your confidence interval?

Have fun with the students whose interval doesn't cover their population parameter. Lead them to understand that with a confidence level of 50%, you'll cover the population parameter only half the time.

In any given study, you can't know if your confidence interval includes the population parameter. But if you follow the right techniques, your interval will include the population parameter 95% (or whatever the confidence level is) of the time.

Why not 100% confidence? This would have to extend to infinity.

Significant digits and the confidence interval

A simple method: Report two digits of the (doubled) standard error. Then report the point estimate to the same digit as the smallest digit in your standard error.

Examples:

Round doubled sd to two digits: 130000.
Report point estimate to the same digit: 3420000 \( \pm \) 130000 with 95% confidence.