mosaic Packagehttp://www.mosaic-web.org/StatisticalModeling
Starting in Fall 2012, the software for Statistical Modeling: A Fresh Approach has been integrated into the mosaic package. This means that the core software is now part of the standard distribution channel for R. It also provides new and better ways of performing some common statistical operations.
The main student-oriented feature of the mosaic package is an orientation around modeling syntax, even for operations that traditionally have not used this syntax. For the instructor, this means that many of the patterns you are used to with operators such as lm() and glm() are now extended down to operators such as mean(). Students will see a substantially unified syntax that should make it easier for them to move from basic statistical operations (e.g. tallys, group means, etc.) to modeling operations.
This document describes some of the ways that instructors might want to teach differently with mosaic than with the original software for Statistical Modeling. It's written with the instructor's point of view in mind, particularly the transition that an instructor may want to make between the old and the new way of doing things.
The package can be installed in the standard way. You load it in the standard way as well:
require(mosaic)
mosaicThe mosaic includes almost all of the functionality of the original Statistical Modeling software, such as do(). Here are the major additions:
$ for accessing individual variables. (This has some gotchas for people used to the $ way of doing things. See below.)fetchData() function that makes it easy for students to access new data sets from instructors' web sites.In the past, simple calculations such as means were performed on individual data frames, like this calculation of the mean wage in the CPS85 data.
mean(CPS85$wage)
## [1] 9.024
The mosaic package allows the user to write this in a way that's more like the syntax of lm():
mean(wage, data = CPS85)
## [1] 9.024
This is not a big deal in itself (and could be performed in any event with the built-in with function, as in with(CPS85, mean(wage))), but mosaic enables the calculation to be easily extended to group-wise means.
mean(wage ~ sex, data = CPS85)
## F M
## 7.879 9.995
mean(wage ~ sex + married, data = CPS85)
## F.Married M.Married F.Single M.Single
## 7.684 10.876 8.260 8.355
This makes it easy for beginning students to explore the idea of “breaking things down by ….'' It also provides a short route to resampling-based inference:
trials <- do(500) * mean(wage ~ sex, data = resample(CPS85))
sd(trials)
## F M
## 0.3048 0.3043
confint(trials)
## name lower upper
## 1 F 7.299 8.497
## 2 M 9.400 10.596
mosaic adds the same modeling syntax to median(), sd(), var(), min(), and max().
When you use the modeling syntax for some operation that has not been provided by mosaic, you will often get an error message.
IQR(wage ~ sex, data = CPS85)
## Error: unused argument(s) (data = CPS85)
If you find it frustrating that an operator hasn't yet been implemented with the formula interface, that's a sign that you find the formula interface more natural than the non-formula interface. In a pinch, you can approximate the formula syntax like this (although the format of the output isn't ideally suited to use in do().)
aggregate(wage ~ sex, data = CPS85, FUN = IQR)
## sex wage
## 1 F 5.25
## 2 M 7.00
Note that the default way of handling missing data in mosaic is to ignore it, the opposite of the behavior of the mean and related operators in base R, but consistent with lm(). This makes it easier for students to get started even in the presence of missing data.
To help students develop the idea of explanatory and response variables, model values, residuals, \( R^2 \), and so on, mosaic adds a new modeling function mm() that calculates groupwise means but allows the standard operations such as fitted() and resid() to be applied directly. For instance,
mod = mm(wage ~ sex, data = CPS85)
head(fitted(mod))
## [1] 9.995 9.995 7.879 7.879 9.995 7.879
summary(mod)
## Groupwise Model
## Call: wage ~ sex
##
## group 2.5 % 97.5 %
## 1 F 7.247 8.511
## 2 M 9.413 10.577
##
## sigma: 5.03
## r-squared: 0.0422
## Adj. r-squared: 0.0404
When there are multiple explantory variables, mm() crosses all of the factors involved, as in
mod2 = mm(wage ~ sex + married, data = CPS85)
summary(mod2)
## Groupwise Model
## Call: wage ~ sex + married
##
## group 2.5 % 97.5 %
## 1 F.Married 6.918 8.450
## 2 M.Married 10.165 11.587
## 3 F.Single 7.190 9.329
## 4 M.Single 7.385 9.325
##
## sigma: 4.96
## r-squared: 0.0731
## Adj. r-squared: 0.0679
This is equivalent to including the interaction terms in a linear model. Instructors may want to contrast the behavior of mm() and lm() to reinforce the idea of main effects and interaction terms. That is, compare the above to
# Now with a linear model
mod3 = lm(wage ~ sex + married, data = CPS85)
summary(mod3)
##
## Call:
## lm(formula = wage ~ sex + married, data = CPS85)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.38 -3.45 -1.14 2.22 37.36
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.255 0.356 23.22 < 2e-16 ***
## sexM 2.128 0.435 4.89 1.3e-06 ***
## marriedSingle -1.112 0.456 -2.44 0.015 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.01 on 531 degrees of freedom
## Multiple R-squared: 0.0528, Adjusted R-squared: 0.0492
## F-statistic: 14.8 on 2 and 531 DF, p-value: 5.62e-07
##
mosaic introduces a new function, tally() which unifies the operation of the base table(), proptable() and perctable() around a formula interface. For instance,
tally(~sex, data = CPS85)
##
## F M Total
## 245 289 534
tally(~sex, data = CPS85, format = "proportion")
##
## F M Total
## 0.4588 0.5412 1.0000
Notice that the variable to break down by follows the ~ sign. This is for consistency with the lattice graphics operators, the developers' reasoning being that tables are in some sense a graphical presentation.
To get a joint tally, put both grouping variables after the ~, as in
tally(~sex + married, data = CPS85)
## married
## sex Married Single Total
## F 162 83 245
## M 188 101 289
## Total 350 184 534
The advantage of the formula-based syntax comes when doing conditional tallys.
tally(sex ~ married, data = CPS85)
## married
## sex Married Single
## F 0.4629 0.4511
## M 0.5371 0.5489
## Total 1.0000 1.0000
tally(married ~ sex, data = CPS85)
## sex
## married F M
## Married 0.6612 0.6505
## Single 0.3388 0.3495
## Total 1.0000 1.0000
Or, if you prefer, use | to signify the given” variable even more explicitly.
tally(~sex | married, data = CPS85)
## married
## sex Married Single
## F 0.4629 0.4511
## M 0.5371 0.5489
## Total 1.0000 1.0000
tally(~married | sex, data = CPS85)
## sex
## married F M
## Married 0.6612 0.6505
## Single 0.3388 0.3495
## Total 1.0000 1.0000
This makes it easier to sort out the conditioning variable.
Experienced R users will be familiar with the predict() family of functions which allows you to find the model values for given levels of input.
mosaic introduces a new interface, makeFun(), that produces a function that implements the model formula. For example:
mod = lm(wage ~ poly(age, 2) * sex, data = CPS85)
fmod = makeFun(mod)
The resulting function has a standard mathematical format and can be used in a standard way, e.g. plotting
plotFun(fmod(age, sex = "M") ~ age, age.lim = range(20, 60))
plotFun(fmod(age, sex = "F") ~ age, add = TRUE, col = "red")
You can even construct new functions from the fitted model functions.
FMratio = makeFun(fmod(age, sex = "F")/fmod(age, sex = "M") ~ age)
plotFun(FMratio(age) ~ age, age.lim = range(20, 60))
Many experienced R users will use $ to add a new variable to a data frame, e.g.
CPS85$euros = with(CPS85, wage/1.234) # exchange rate on Aug 20, 2012
You can see here that I'm trying to avoid the use of $ in the right side of the assignment, which of course could have been written CPS85$wage/1.234. In generally, I am trying to avoid any use of $. As such, I prefer the following syntax, which is standard in R but unfamiliar to many instructors:
CPS85 = transform(CPS85, euros = wage/1.234)
You may sometimes be tempted to write
euros = with(CPS85, wage/1.234)
The problem here is that euros is now an independent variable, subject to being accidentally reassigned and not available to operators such as predict().
attach()Here is yet aother plea for instructors to avoid the use of attach(). Although attach() may seem easy to use, ultimately it leads to ambiguity and confusion. Data frames and the formula syntax are the right mechanism to bring variables together for a set of cases. Unpackaging a data frame and creating separate variables creates problems with re-assignment and documentation.
The mosaic package has been developed mainly by Randall Pruim (at Calvin College), Danny Kaplan (at Macalester College), and Nicholas Horton (at Smith College). If you're unhappy with how something works, or have found a bug, or have a suggestion for some useful, new functionality, please contact
any of the developers. My email is dtkaplan@gmail.com.
The source files are available at https://github.com/rpruim/mosaic.
We gratefully acknowledge the support of the US National Science Foundation for Project MOSAIC, NSF-DUE-0920350