2012-09-14 Class Notes

Sampling Bias

Continued from Wednesday

Library book sampling simulation

Continued from Wednesday

Descriptions of Distributions

Main points

Example of Standard Deviation

Osteopenia.

Here's a quote from Wikipedia:

Osteopenia is a condition where bone mineral density is lower than normal. It is considered by many doctors to be a precursor to osteoporosis. However, not every person diagnosed with osteopenia will develop osteoporosis. More specifically, osteopenia is defined as a bone mineral density T-score between -1.0 and -2.5.

The article goes on:

The definition has been controversial. Steven R. Cummings, of the University of California, San Francisco, said in 2003 that “There is no basis, no biological, social, economic or treatment basis, no basis whatsoever” for using one standard deviation. Cummings added that “As a consequence, though, more than half of the population is told arbitrarily that they have a condition they need to worry about.” Quoted from this Gina Kolata article

bone density and osteopenia from surgeongeneral.gov

Bone density falls with age. The T-score is really just a Z-score, but compares a person to the distribution of young people. Some graphs

Osteopenia is defined so that about 1/6 of young people have it and much larger fractions of old people will have it.

Means and Models

Important components of statistical models:

Example: wage varies from person to person.

Example: wage might depend on sector of the economy or sex or something else.

cps = fetchData("CPS85")
## Data CPS85 found in package.

We can quantify the amount of variation using the standard deviation:

sd(wage, data = cps)
## [1] 5.139

Example: We might model wage by sex as “Men make $3 per hour, women make $5 per hour”

You'll never have to do this, but here's one way to implement that model:

fetchData("m155development.R")  # some new software
## Retrieving from http://www.mosaic-web.org/go/datasets/m155development.R
## [1] TRUE
mymod = listFun(M = 3, F = 5)

This is a function that takes “M” or “F” as an input and returns the model value.

Let's add these as variable modvals to the data

cps = transform(cps, modvals = mymod(sex))
resids = with(data = cps, wage - modvals)

One way to quantify the “goodness” of a model is with how far off the residuals are from their ideal value: zero. The mean square residual is a good measure of this:

mean(resids^2)
## [1] 55.53

As with the arrows, there are two components to this: bias and variance.

mean(resids)  # bias, the systematic error
## [1] 5.106
var(resids)  # the random error
## [1] 29.51
sd(resids)  # the square root of the variance
## [1] 5.432

The above model has a big bias. We want to choose the model values to make the sum of squared errors as small as possible. Ideally, we can make the bias zero. This is actually easy. We do it by fitting the model, adjusting the parameters to match the data as closely as possible. The mm() function will fit model values to the data.

mean(wage ~ sex, data = cps)
##     F     M 
## 7.879 9.995 
f2 = listFun(M = 9.995, F = 7.879)
cps = transform(cps, means = f2(sex))
resids = with(cps, wage - means)
mean(resids^2)  # This model is better than the earlier one!
## [1] 25.25
mean(resids)  # No bias!
## [1] -0.0001124
sd(resids)
## [1] 5.03

The mm() function does these calculations for us in a convenient way, allowing us to extract the residuals and fitted model values without all the work.

mod = mm(wage ~ sex, data = cps)
mean(resid(mod)^2)
## [1] 25.25
mean(resid(mod))  # No bias!
## [1] -1.608e-16
sd(resid(mod))
## [1] 5.03

Any difference between the “by hand” version and the mm version is due to rounding off when typing in the mean values.
sd(resid)

QUIZ (last 15 minutes)