- A model is a representation for a purpose. What the best model is depends on your purpose for constructing the model. This requires judgment.
- Statistical models (a) attempt for account for case-to-case variation in a variable (b) are based on data and © put the variation that's accounted for in the context of the variation that remains unexplained.
- The usual form of a statistical model is response \( = f \)(explanatory). That is, we imagine that for each value of the explanatory variables there is a unique model output. Reality generally doesn't work this way.

These pictures are representations of mood. They capture some aspects of mood but not every nuance.

- Which one of these is the best representation of your mood today?
- Which word would you use to label each graph?

I call my model a mathematical model since it is governed by two mathematical quantities: the curvature of the mouth and the slope of the (left) eyebrow.

The word is also a model. That we can communicate without the word is evidence that the cartoon face really is representing something.

More skilled modelers can add more details — more parameters, more levels to the parameters — to get a better representation.

Important components of statistical models:

- The
**response variable**. The quantity that varies from case to case whose spread is something we want to account for.

Example: **wage** varies from person to person.

**Explanatory variables**. Other variables that we think may account for the response. For the present, we'll work with categorical variables. Each case falls into one group, and our model is that the groups are different from one another but the individuals within the group are all the same.

Example: wage might depend on sector of the economy or sex or something else.

```
cps = fetchData("CPS85")
```

```
## Data CPS85 found in package.
```

We can quantify the amount of variation using the standard deviation:

```
sd(wage, data = cps)
```

```
## [1] 5.139
```

- The
**model values**. For each group, there will be one model value. That is, according to our model, each case has a response value that is the value for that case's group.

Example: We might model wage by sex as “Men make $3 per hour, women make $5 per hour”

You'll never have to do this, but here's one way to implement that model:

```
fetchData("m155development.R") # some new software
```

```
## Retrieving from http://www.mosaic-web.org/go/datasets/m155development.R
```

```
## [1] TRUE
```

```
mymod = listFun(M = 3, F = 5)
```

This is a function that takes “M” or “F” as an input and returns the model value.

Let's add these as variable `modvals`

to the data

```
cps = transform(cps, modvals = mymod(sex))
```

- The
**residuals**. These tell how far the model values are from the actual values;

```
resids = with(data = cps, wage - modvals)
```

One way to quantify the “goodness” of a model is with how far off the residuals are from their ideal value: zero. The mean square residual is a good measure of this:

```
mean(resids^2)
```

```
## [1] 55.53
```

As with the arrows, there are two components to this: bias and variance.

```
mean(resids) # bias, the systematic error
```

```
## [1] 5.106
```

```
var(resids) # the random error
```

```
## [1] 29.51
```

```
sd(resids) # the square root of the variance
```

```
## [1] 5.432
```

The above model has a big bias. We want to choose the model values to make the sum of squared errors as small as possible. Ideally, we can make the bias zero. This is actually easy. We do it by **fitting** the model, adjusting the parameters to match the data as closely as possible. The `mm()`

function will fit model values to the data.

- The
**fitted model values**. Let's use as model values the means of the groups.

```
mean(wage ~ sex, data = cps)
```

```
## F M
## 7.879 9.995
```

```
f2 = listFun(M = 9.995, F = 7.879)
cps = transform(cps, means = f2(sex))
resids = with(cps, wage - means)
mean(resids^2) # This model is better than the earlier one!
```

```
## [1] 25.25
```

```
mean(resids) # No bias!
```

```
## [1] -0.0001124
```

```
sd(resids)
```

```
## [1] 5.03
```

The `mm()`

function does these calculations for us in a convenient way, allowing us to extract the residuals and fitted model values without all the work.

```
mod = mm(wage ~ sex, data = cps)
mean(resid(mod)^2)
```

```
## [1] 25.25
```

```
mean(resid(mod)) # No bias!
```

```
## [1] -1.608e-16
```

```
sd(resid(mod))
```

```
## [1] 5.03
```

Any difference between the “by hand” version and the mm version is due to rounding off when typing in the mean values. sd(resid)

We're going to build some groupwise models. Later on, we're going to **fit** models, that is, let the data dictate the model parameters in some *optimal* way.

We'll be using some new software, so make sure to load it:

```
fetchData("m155development.R")
```

Construct each of these models but make up **your own parameters**, don't just use the ones here. (Of course, you won't have a solid idea of what the parameters should be. Just use your best judgment.) Test out your function by giving it, in quotes,
the level as an input.

- A model of height versus sex. Levels for sex: M and F.

```
height = listFun(F = 63, M = 68)
height("F") # testing it for females
```

```
## F
## 63
```

- Age at marriage as s function of race. Levels of Race: Black, Hispanic, White

```
age = listFun(Black = 25, White = 28, Hispanic = 22)
age("Hispanic")
```

```
## Hispanic
## 22
```

- Monthly cost of buying $10-50K of life insurance from the AARP versus sex. Level for sex: F and M

```
cost = listFun(F = 20, M = 24)
cost("M")
```

```
## M
## 24
```

- Probability of dying in the next 30 years versus smoking status. Levels for smoker: Yes and No

```
dead = listFun(Yes = 0.8, No = 0.4)
dead("Yes")
```

```
## Yes
## 0.8
```

INSTRUCTIONS TO STUDENTS: Make your own model for each of these. Then we're going to have a competition.

We happen to have some data that reflects each of these situations. The data sets are:

```
g = fetchData("Galton")
```

```
## Data Galton found in package.
```

```
m = fetchData("marriage.csv")
```

```
## Retrieving from http://www.mosaic-web.org/go/datasets/marriage.csv
```

```
i = fetchData("AARP-insurance.csv")
```

```
## Retrieving from http://www.mosaic-web.org/go/datasets/AARP-insurance.csv
```

```
w = fetchData("whickham.csv")
```

```
## Retrieving from http://www.mosaic-web.org/go/datasets/whickham.csv
```

- See what the variables are and what they are named. In particular, note the names of the relevant response variable and explanatory variable for each of the 4 cases.
- Make sure you understand what the variables stand for and their levels.
- Pick out one or more other variables that might play a valuable role in accounting for variation in the response variable.

Add another variable to each data frame that gives the model value. Call it `modval`

so that we can remember it.

Your statements will look like this:

```
g = transform(g, modval = height(sex))
m = transform(m, modval = age(Race))
i = transform(i, modval = cost(Sex))
w = transform(w, modval = dead(smoker))
```

Do a scatter plot of the model value against the response value.

```
xyplot(modval ~ height, data = m)
```

```
## Error: object of type 'closure' is not subsettable
```

Explain the pattern you see.

What's special about the `w`

data and it's model values:

```
xyplot(modval ~ smoker, data = w)
```

Take the difference between the response value and the model value, and assign that to a variable called resid

```
g = transform(g, resid = height - modval)
m = transform(m, resid = Age - modval)
i = transform(i, resid = Cost - modval)
w = transform(w, resid = outcome - modval)
```

```
## Warning: - not meaningful for factors
```

What's special about `w`

?

For each of the first three settings, evaluate your residuals for bias and variance. Let's find who has the best model.

For my `g`

model:

```
mean(resid, data = g)
```

```
## [1] 1.172
```

```
var(resid, data = g)
```

```
## [1] 6.292
```

Write down the bias and variance for each person's model in sorted order (order by variance).

The models for each setting differ from person to person in the class depending on the **parameters** that we choose.

We can pick the “best” model for the class by picking the one with the smallest bias and variance.

**Fitting** is the process of doing this automatically.

The `mm()`

function will fit models in one step. You specify the response and explanatory variables and the data you want to use in fitting the model.

```
mod1 = mm(height ~ sex, data = g)
mod2 = mm(Age ~ Race, data = m)
mod3 = mm(Cost ~ Age, data = i)
```

We can't fit the smoking data this way because the outcomes aren't quantitative. So the subtraction that we use in defining the residuals doesn't apply.

How does the bias and variance of the residuals in the fitted models compare to the models that you just made up.

CONCLUSION: It makes sense to fit models rather than just make them up when you have a choice.

Comparing the fitted parameters gives us a way to quantify the **effect size**.

Another way to quantify the effect size is to look at the variance of the **fitted model values**.

See if you can find a relationship between the variance of the fitted model values, the variance of the residuals, and the variance of the response variable.

Use `mm()`

to fit wage against both sex and sector in the CPS85 data. Explain the structure of the results.

We'll be able to fit models beyond the groupwise models, including quantitative explanatory variables. For instance:

```
lm(Cost ~ Sex + Age + Coverage, data = i)
```

```
##
## Call:
## lm(formula = Cost ~ Sex + Age + Coverage, data = i)
##
## Coefficients:
## (Intercept) SexM Age Coverage
## -200.54 10.23 3.35 1.82
```

Show that the variance of residuals goes down as you add more terms.