Teaching Statistical Modeling with the mosaic Package

Daniel Kaplan

http://www.mosaic-web.org/StatisticalModeling

Starting in Fall 2012, the software for Statistical Modeling: A Fresh Approach has been integrated into the mosaic package. This means that the core software is now part of the standard distribution channel for R. It also provides new and better ways of performing some common statistical operations.

The main student-oriented feature of the mosaic package is an orientation around modeling syntax, even for operations that traditionally have not used this syntax. For the instructor, this means that many of the patterns you are used to with operators such as lm() and glm() are now extended down to operators such as mean(). Students will see a substantially unified syntax that should make it easier for them to move from basic statistical operations (e.g. tallys, group means, etc.) to modeling operations.

This document describes some of the ways that instructors might want to teach differently with mosaic than with the original software for Statistical Modeling. It's written with the instructor's point of view in mind, particularly the transition that an instructor may want to make between the old and the new way of doing things.

Installing the Package

The package can be installed in the standard way. You load it in the standard way as well:

require(mosaic)

Major New Features of mosaic

The mosaic includes almost all of the functionality of the original Statistical Modeling software, such as do(). Here are the major additions:

  1. Use of the modeling syntax systematically for a wide range of calculations.
  2. Avoiding need for operators such as $ for accessing individual variables. (This has some gotchas for people used to the $ way of doing things. See below.)
  3. Inclusion of many data sets as part of the package and inter-operability with the fetchData() function that makes it easy for students to access new data sets from instructors' web sites.
  4. Implementation of calculus operations including symbolic differentiation and integration. (This is likely not of direct interest to most statistics instructors, but is worth pointing out for those interested in providing greater unity in the student experience.)

The Modeling Syntax

In the past, simple calculations such as means were performed on individual data frames, like this calculation of the mean wage in the CPS85 data.

mean(CPS85$wage)
## [1] 9.024

The mosaic package allows the user to write this in a way that's more like the syntax of lm():

mean(wage, data = CPS85)
## [1] 9.024

This is not a big deal in itself (and could be performed in any event with the built-in with function, as in with(CPS85, mean(wage))), but mosaic enables the calculation to be easily extended to group-wise means.

mean(wage ~ sex, data = CPS85)
##     F     M 
## 7.879 9.995 
mean(wage ~ sex + married, data = CPS85)
## F.Married M.Married  F.Single  M.Single 
##     7.684    10.876     8.260     8.355 

This makes it easy for beginning students to explore the idea of “breaking things down by ….'' It also provides a short route to resampling-based inference:

trials <- do(500) * mean(wage ~ sex, data = resample(CPS85))
sd(trials)
##      F      M 
## 0.3048 0.3043 
confint(trials)
##   name lower  upper
## 1    F 7.299  8.497
## 2    M 9.400 10.596

mosaic adds the same modeling syntax to median(), sd(), var(), min(), and max().

When you use the modeling syntax for some operation that has not been provided by mosaic, you will often get an error message.

IQR(wage ~ sex, data = CPS85)
## Error: unused argument(s) (data = CPS85)

If you find it frustrating that an operator hasn't yet been implemented with the formula interface, that's a sign that you find the formula interface more natural than the non-formula interface. In a pinch, you can approximate the formula syntax like this (although the format of the output isn't ideally suited to use in do().)

aggregate(wage ~ sex, data = CPS85, FUN = IQR)
##   sex wage
## 1   F 5.25
## 2   M 7.00

Note that the default way of handling missing data in mosaic is to ignore it, the opposite of the behavior of the mean and related operators in base R, but consistent with lm(). This makes it easier for students to get started even in the presence of missing data.

Groupwise Means as Models

To help students develop the idea of explanatory and response variables, model values, residuals, \( R^2 \), and so on, mosaic adds a new modeling function mm() that calculates groupwise means but allows the standard operations such as fitted() and resid() to be applied directly. For instance,

mod = mm(wage ~ sex, data = CPS85)
head(fitted(mod))
## [1] 9.995 9.995 7.879 7.879 9.995 7.879
summary(mod)
## Groupwise Model
## Call:  wage ~ sex 
## 
##   group 2.5 % 97.5 %
## 1     F 7.247  8.511
## 2     M 9.413 10.577
## 
## sigma:  5.03 
## r-squared: 0.0422 
## Adj. r-squared: 0.0404 

When there are multiple explantory variables, mm() crosses all of the factors involved, as in

mod2 = mm(wage ~ sex + married, data = CPS85)
summary(mod2)
## Groupwise Model
## Call:  wage ~ sex + married 
## 
##       group  2.5 % 97.5 %
## 1 F.Married  6.918  8.450
## 2 M.Married 10.165 11.587
## 3  F.Single  7.190  9.329
## 4  M.Single  7.385  9.325
## 
## sigma:  4.96 
## r-squared: 0.0731 
## Adj. r-squared: 0.0679 

This is equivalent to including the interaction terms in a linear model. Instructors may want to contrast the behavior of mm() and lm() to reinforce the idea of main effects and interaction terms. That is, compare the above to

# Now with a linear model
mod3 = lm(wage ~ sex + married, data = CPS85)
summary(mod3)
## 
## Call:
## lm(formula = wage ~ sex + married, data = CPS85)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -9.38  -3.45  -1.14   2.22  37.36 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      8.255      0.356   23.22  < 2e-16 ***
## sexM             2.128      0.435    4.89  1.3e-06 ***
## marriedSingle   -1.112      0.456   -2.44    0.015 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 5.01 on 531 degrees of freedom
## Multiple R-squared: 0.0528,  Adjusted R-squared: 0.0492 
## F-statistic: 14.8 on 2 and 531 DF,  p-value: 5.62e-07 
## 

Tables and Tallies

mosaic introduces a new function, tally() which unifies the operation of the base table(), proptable() and perctable() around a formula interface. For instance,

tally(~sex, data = CPS85)
## 
##     F     M Total 
##   245   289   534 
tally(~sex, data = CPS85, format = "proportion")
## 
##      F      M  Total 
## 0.4588 0.5412 1.0000 

Notice that the variable to break down by follows the ~ sign. This is for consistency with the lattice graphics operators, the developers' reasoning being that tables are in some sense a graphical presentation.

To get a joint tally, put both grouping variables after the ~, as in

tally(~sex + married, data = CPS85)
##        married
## sex     Married Single Total
##   F         162     83   245
##   M         188    101   289
##   Total     350    184   534

The advantage of the formula-based syntax comes when doing conditional tallys.

tally(sex ~ married, data = CPS85)
##        married
## sex     Married Single
##   F      0.4629 0.4511
##   M      0.5371 0.5489
##   Total  1.0000 1.0000
tally(married ~ sex, data = CPS85)
##          sex
## married        F      M
##   Married 0.6612 0.6505
##   Single  0.3388 0.3495
##   Total   1.0000 1.0000

Or, if you prefer, use | to signify the given” variable even more explicitly.

tally(~sex | married, data = CPS85)
##        married
## sex     Married Single
##   F      0.4629 0.4511
##   M      0.5371 0.5489
##   Total  1.0000 1.0000
tally(~married | sex, data = CPS85)
##          sex
## married        F      M
##   Married 0.6612 0.6505
##   Single  0.3388 0.3495
##   Total   1.0000 1.0000

This makes it easier to sort out the conditioning variable.

Prediction and Model Values

Experienced R users will be familiar with the predict() family of functions which allows you to find the model values for given levels of input.

mosaic introduces a new interface, makeFun(), that produces a function that implements the model formula. For example:

mod = lm(wage ~ poly(age, 2) * sex, data = CPS85)
fmod = makeFun(mod)

The resulting function has a standard mathematical format and can be used in a standard way, e.g. plotting

plotFun(fmod(age, sex = "M") ~ age, age.lim = range(20, 60))
plotFun(fmod(age, sex = "F") ~ age, add = TRUE, col = "red")

plot of chunk unnamed-chunk-17

You can even construct new functions from the fitted model functions.

FMratio = makeFun(fmod(age, sex = "F")/fmod(age, sex = "M") ~ age)
plotFun(FMratio(age) ~ age, age.lim = range(20, 60))

plot of chunk unnamed-chunk-18

Adding Variables to a Data Frame

Many experienced R users will use $ to add a new variable to a data frame, e.g.

CPS85$euros = with(CPS85, wage/1.234)  # exchange rate on Aug 20, 2012

You can see here that I'm trying to avoid the use of $ in the right side of the assignment, which of course could have been written CPS85$wage/1.234. In generally, I am trying to avoid any use of $. As such, I prefer the following syntax, which is standard in R but unfamiliar to many instructors:

CPS85 = transform(CPS85, euros = wage/1.234)

You may sometimes be tempted to write

euros = with(CPS85, wage/1.234)

The problem here is that euros is now an independent variable, subject to being accidentally reassigned and not available to operators such as predict().

As always … Avoid attach()

Here is yet aother plea for instructors to avoid the use of attach(). Although attach() may seem easy to use, ultimately it leads to ambiguity and confusion. Data frames and the formula syntax are the right mechanism to bring variables together for a set of cases. Unpackaging a data frame and creating separate variables creates problems with re-assignment and documentation.

Maintenance, Changes, and the Developers

The mosaic package has been developed mainly by Randall Pruim (at Calvin College), Danny Kaplan (at Macalester College), and Nicholas Horton (at Smith College). If you're unhappy with how something works, or have found a bug, or have a suggestion for some useful, new functionality, please contact
any of the developers. My email is dtkaplan@gmail.com.

The source files are available at https://github.com/rpruim/mosaic.

We gratefully acknowledge the support of the US National Science Foundation for Project MOSAIC, NSF-DUE-0920350