These pictures are representations of mood. They capture some aspects of mood but not every nuance.
I call my model a mathematical model since it is governed by two mathematical quantities: the curvature of the mouth and the slope of the (left) eyebrow.
The word is also a model. That we can communicate without the word is evidence that the cartoon face really is representing something.
More skilled modelers can add more details — more parameters, more levels to the parameters — to get a better representation.
Important components of statistical models:
Example: wage varies from person to person.
Example: wage might depend on sector of the economy or sex or something else.
cps = fetchData("CPS85")
## Data CPS85 found in package.
We can quantify the amount of variation using the standard deviation:
sd(wage, data = cps)
##  5.139
Example: We might model wage by sex as “Men make $3 per hour, women make $5 per hour”
You'll never have to do this, but here's one way to implement that model:
fetchData("m155development.R") # some new software
## Retrieving from http://www.mosaic-web.org/go/datasets/m155development.R
##  TRUE
mymod = listFun(M = 3, F = 5)
This is a function that takes “M” or “F” as an input and returns the model value.
Let's add these as variable
modvals to the data
cps = transform(cps, modvals = mymod(sex))
resids = with(data = cps, wage - modvals)
One way to quantify the “goodness” of a model is with how far off the residuals are from their ideal value: zero. The mean square residual is a good measure of this:
##  55.53
As with the arrows, there are two components to this: bias and variance.
mean(resids) # bias, the systematic error
##  5.106
var(resids) # the random error
##  29.51
sd(resids) # the square root of the variance
##  5.432
The above model has a big bias. We want to choose the model values to make the sum of squared errors as small as possible. Ideally, we can make the bias zero. This is actually easy. We do it by fitting the model, adjusting the parameters to match the data as closely as possible. The
mm() function will fit model values to the data.
mean(wage ~ sex, data = cps)
## F M ## 7.879 9.995
f2 = listFun(M = 9.995, F = 7.879) cps = transform(cps, means = f2(sex)) resids = with(cps, wage - means) mean(resids^2) # This model is better than the earlier one!
##  25.25
mean(resids) # No bias!
##  -0.0001124
##  5.03
mm() function does these calculations for us in a convenient way, allowing us to extract the residuals and fitted model values without all the work.
mod = mm(wage ~ sex, data = cps) mean(resid(mod)^2)
##  25.25
mean(resid(mod)) # No bias!
##  -1.608e-16
##  5.03
Any difference between the “by hand” version and the mm version is due to rounding off when typing in the mean values. sd(resid)
A handout for students
We're going to build some groupwise models. Later on, we're going to fit models, that is, let the data dictate the model parameters in some optimal way.
We'll be using some new software, so make sure to load it:
Construct each of these models but make up your own parameters, don't just use the ones here. (Of course, you won't have a solid idea of what the parameters should be. Just use your best judgment.) Test out your function by giving it, in quotes, the level as an input.
height = listFun(F = 63, M = 68) height("F") # testing it for females
## F ## 63
age = listFun(Black = 25, White = 28, Hispanic = 22) age("Hispanic")
## Hispanic ## 22
cost = listFun(F = 20, M = 24) cost("M")
## M ## 24
dead = listFun(Yes = 0.8, No = 0.4) dead("Yes")
## Yes ## 0.8
INSTRUCTIONS TO STUDENTS: Make your own model for each of these. Then we're going to have a competition.
We happen to have some data that reflects each of these situations. The data sets are:
g = fetchData("Galton")
## Data Galton found in package.
m = fetchData("marriage.csv")
## Retrieving from http://www.mosaic-web.org/go/datasets/marriage.csv
i = fetchData("AARP-insurance.csv")
## Retrieving from http://www.mosaic-web.org/go/datasets/AARP-insurance.csv
w = fetchData("whickham.csv")
## Retrieving from http://www.mosaic-web.org/go/datasets/whickham.csv
Add another variable to each data frame that gives the model value. Call it
modval so that we can remember it.
Your statements will look like this:
g = transform(g, modval = height(sex)) m = transform(m, modval = age(Race)) i = transform(i, modval = cost(Sex)) w = transform(w, modval = dead(smoker))
Do a scatter plot of the model value against the response value.
xyplot(modval ~ height, data = m)
## Error: object of type 'closure' is not subsettable
Explain the pattern you see.
What's special about the
w data and it's model values:
xyplot(modval ~ smoker, data = w)
Take the difference between the response value and the model value, and assign that to a variable called resid
g = transform(g, resid = height - modval) m = transform(m, resid = Age - modval) i = transform(i, resid = Cost - modval) w = transform(w, resid = outcome - modval)
## Warning: - not meaningful for factors
What's special about
For each of the first three settings, evaluate your residuals for bias and variance. Let's find who has the best model.
mean(resid, data = g)
##  1.172
var(resid, data = g)
##  6.292
Write down the bias and variance for each person's model in sorted order (order by variance).
The models for each setting differ from person to person in the class depending on the parameters that we choose.
We can pick the “best” model for the class by picking the one with the smallest bias and variance.
Fitting is the process of doing this automatically.
mm() function will fit models in one step. You specify the response and explanatory variables and the data you want to use in fitting the model.
mod1 = mm(height ~ sex, data = g) mod2 = mm(Age ~ Race, data = m) mod3 = mm(Cost ~ Age, data = i)
We can't fit the smoking data this way because the outcomes aren't quantitative. So the subtraction that we use in defining the residuals doesn't apply.
How does the bias and variance of the residuals in the fitted models compare to the models that you just made up.
CONCLUSION: It makes sense to fit models rather than just make them up when you have a choice.
Comparing the fitted parameters gives us a way to quantify the effect size.
Another way to quantify the effect size is to look at the variance of the fitted model values.
See if you can find a relationship between the variance of the fitted model values, the variance of the residuals, and the variance of the response variable.
mm() to fit wage against both sex and sector in the CPS85 data. Explain the structure of the results.
We'll be able to fit models beyond the groupwise models, including quantitative explanatory variables. For instance:
lm(Cost ~ Sex + Age + Coverage, data = i)
## ## Call: ## lm(formula = Cost ~ Sex + Age + Coverage, data = i) ## ## Coefficients: ## (Intercept) SexM Age Coverage ## -200.54 10.23 3.35 1.82
Show that the variance of residuals goes down as you add more terms.