February 8, 2013 Class Notes

Basics about models

  1. A model is a representation for a purpose. What the best model is depends on your purpose for constructing the model. This requires judgment.
  2. Statistical models (a) attempt for account for case-to-case variation in a variable (b) are based on data and © put the variation that's accounted for in the context of the variation that remains unexplained.
  3. The usual form of a statistical model is response \( = f \)(explanatory). That is, we imagine that for each value of the explanatory variables there is a unique model output. Reality generally doesn't work this way.

A mathematical model of human mood with two parameters.

These pictures are representations of mood. They capture some aspects of mood but not every nuance.

mood model

I call my model a mathematical model since it is governed by two mathematical quantities: the curvature of the mouth and the slope of the (left) eyebrow.

The word is also a model. That we can communicate without the word is evidence that the cartoon face really is representing something.

More skilled modelers can add more details — more parameters, more levels to the parameters — to get a better representation.

Important components of statistical models:

Example: wage varies from person to person.

Example: wage might depend on sector of the economy or sex or something else.

cps = fetchData("CPS85")
## Data CPS85 found in package.

We can quantify the amount of variation using the standard deviation:

sd(wage, data = cps)
## [1] 5.139

Example: We might model wage by sex as “Men make $3 per hour, women make $5 per hour”

You'll never have to do this, but here's one way to implement that model:

fetchData("m155development.R")  # some new software
## Retrieving from http://www.mosaic-web.org/go/datasets/m155development.R
## [1] TRUE
mymod = listFun(M = 3, F = 5)

This is a function that takes “M” or “F” as an input and returns the model value.

Let's add these as variable modvals to the data

cps = transform(cps, modvals = mymod(sex))
resids = with(data = cps, wage - modvals)

One way to quantify the “goodness” of a model is with how far off the residuals are from their ideal value: zero. The mean square residual is a good measure of this:

mean(resids^2)
## [1] 55.53

As with the arrows, there are two components to this: bias and variance.

mean(resids)  # bias, the systematic error
## [1] 5.106
var(resids)  # the random error
## [1] 29.51
sd(resids)  # the square root of the variance
## [1] 5.432

The above model has a big bias. We want to choose the model values to make the sum of squared errors as small as possible. Ideally, we can make the bias zero. This is actually easy. We do it by fitting the model, adjusting the parameters to match the data as closely as possible. The mm() function will fit model values to the data.

mean(wage ~ sex, data = cps)
##     F     M 
## 7.879 9.995
f2 = listFun(M = 9.995, F = 7.879)
cps = transform(cps, means = f2(sex))
resids = with(cps, wage - means)
mean(resids^2)  # This model is better than the earlier one!
## [1] 25.25
mean(resids)  # No bias!
## [1] -0.0001124
sd(resids)
## [1] 5.03

The mm() function does these calculations for us in a convenient way, allowing us to extract the residuals and fitted model values without all the work.

mod = mm(wage ~ sex, data = cps)
mean(resid(mod)^2)
## [1] 25.25
mean(resid(mod))  # No bias!
## [1] -1.608e-16
sd(resid(mod))
## [1] 5.03

Any difference between the “by hand” version and the mm version is due to rounding off when typing in the mean values. sd(resid)

Constructing some models

A handout for students

We're going to build some groupwise models. Later on, we're going to fit models, that is, let the data dictate the model parameters in some optimal way.

We'll be using some new software, so make sure to load it:

fetchData("m155development.R")

Construct each of these models but make up your own parameters, don't just use the ones here. (Of course, you won't have a solid idea of what the parameters should be. Just use your best judgment.) Test out your function by giving it, in quotes, the level as an input.

height = listFun(F = 63, M = 68)
height("F")  # testing it for females
##  F 
## 63
age = listFun(Black = 25, White = 28, Hispanic = 22)
age("Hispanic")
## Hispanic 
##       22
cost = listFun(F = 20, M = 24)
cost("M")
##  M 
## 24
dead = listFun(Yes = 0.8, No = 0.4)
dead("Yes")
## Yes 
## 0.8

INSTRUCTIONS TO STUDENTS: Make your own model for each of these. Then we're going to have a competition.

Evaluating your model against data

We happen to have some data that reflects each of these situations. The data sets are:

g = fetchData("Galton")
## Data Galton found in package.
m = fetchData("marriage.csv")
## Retrieving from http://www.mosaic-web.org/go/datasets/marriage.csv
i = fetchData("AARP-insurance.csv")
## Retrieving from http://www.mosaic-web.org/go/datasets/AARP-insurance.csv
w = fetchData("whickham.csv")
## Retrieving from http://www.mosaic-web.org/go/datasets/whickham.csv

Look at the data sets

Find the model values for these data

Add another variable to each data frame that gives the model value. Call it modval so that we can remember it.

Your statements will look like this:

g = transform(g, modval = height(sex))
m = transform(m, modval = age(Race))
i = transform(i, modval = cost(Sex))
w = transform(w, modval = dead(smoker))

Do a scatter plot of the model value against the response value.

xyplot(modval ~ height, data = m)
## Error: object of type 'closure' is not subsettable

Explain the pattern you see.

What's special about the w data and it's model values:

xyplot(modval ~ smoker, data = w)

plot of chunk unnamed-chunk-19

Evaluate your models

Take the difference between the response value and the model value, and assign that to a variable called resid

g = transform(g, resid = height - modval)
m = transform(m, resid = Age - modval)
i = transform(i, resid = Cost - modval)
w = transform(w, resid = outcome - modval)
## Warning: - not meaningful for factors

What's special about w?

For each of the first three settings, evaluate your residuals for bias and variance. Let's find who has the best model.

For my g model:

mean(resid, data = g)
## [1] 1.172
var(resid, data = g)
## [1] 6.292

Write down the bias and variance for each person's model in sorted order (order by variance).

Fitting models

The models for each setting differ from person to person in the class depending on the parameters that we choose.

We can pick the “best” model for the class by picking the one with the smallest bias and variance.

Fitting is the process of doing this automatically.

The mm() function will fit models in one step. You specify the response and explanatory variables and the data you want to use in fitting the model.

mod1 = mm(height ~ sex, data = g)
mod2 = mm(Age ~ Race, data = m)
mod3 = mm(Cost ~ Age, data = i)

We can't fit the smoking data this way because the outcomes aren't quantitative. So the subtraction that we use in defining the residuals doesn't apply.

How does the bias and variance of the residuals in the fitted models compare to the models that you just made up.

CONCLUSION: It makes sense to fit models rather than just make them up when you have a choice.

Effect Size

Comparing the fitted parameters gives us a way to quantify the effect size.

Another way to quantify the effect size is to look at the variance of the fitted model values.

See if you can find a relationship between the variance of the fitted model values, the variance of the residuals, and the variance of the response variable.

Crossing Groups

Use mm() to fit wage against both sex and sector in the CPS85 data. Explain the structure of the results.

Going further

We'll be able to fit models beyond the groupwise models, including quantitative explanatory variables. For instance:

lm(Cost ~ Sex + Age + Coverage, data = i)
## 
## Call:
## lm(formula = Cost ~ Sex + Age + Coverage, data = i)
## 
## Coefficients:
## (Intercept)         SexM          Age     Coverage  
##     -200.54        10.23         3.35         1.82

Adding more variables.

Show that the variance of residuals goes down as you add more terms.