R in the QR Workshop : Kenyon College

Introduction

On June 5-6, 2014, eleven members of the Kenyon College faculty, from six departments in the natural and social sciences divisions, participated in a workshop to learn how to use the the stastical computing language R in the classroom. Through instruction and extensive hands on activities, we learned not only the basics of the R language, itself, but also its place in a larger “software ecosystem,” that includes web-based document and data distribution applications like Google Drive and Dropbox, as well as document-authoring tools that are fully integrated into the R Studio development environment.

Our workshop leader and “R Coach” was Dr. Daniel Kaplan, DeWitt-Wallace Professor of Mathematics, Statistics, and Computer Science at Macalester College. Danny is the leader of Project MOSAIC, a community of educators developing new ways of teaching mathematics, statistics, computation, and modeling to undergraduates. He provided a fast-paced and exciting set of pathways for bringing R to our students at Kenyon. Given the varied experience with R that the participants brought to the workshop, from nearly complete newbie to old-hat, Danny did a remarkable job of providing guidance and instruction pitched at a number of different levels. By having us use workshop time to actually develop course materials, Danny helped us overcome the “activiation energy” that sometimes prevents instructors from adopting new technological tools.

Below is a summary of the workshop, with some additional remarks and examples of work. While I start with the beginning of the workshop, I later combine material across the two days by topic, as we frequently circled back to particular topics repeatedly to access different levels of knowledge. Simon Garcia also generated minutes for the workshop that are more clearly chronological.

1. Introduction and a Quiz

After giving us a brief overview of the workshop, Danny started with a quiz to assess both our level of familiarity with R, and the varied ways in which we process and distribute our work. He wanted us to examine wage data from a file that he provided and provide him with a small summary. Here is how mine went:

cps <- read.csv("~/Downloads/cps.csv")
mean(wage~sector, data=cps)

## clerical    const    manag    manuf    other     prof    sales  service 
##    7.423    9.502   12.704    8.036    8.501   11.947    7.593    6.537

attach(cps)
plot(wage~age)
lines(lowess(wage~age))

plot of chunk unnamed-chunk-2

The quiz asked us to: 1. upload data, 2. summarize mean values by groups, 3. plot a relationship, and 4. provide him (Danny) with a report via email. These are all basic things that we want to enable our students to do, but what was striking was that we all accomplished these things in very different ways and many faced significant procedural challenges.

2. Word Processing with R: Rmarkdown

After reviewing the difficulties that many of us ran into, Danny then said that he was going to start not by teaching us R, but by teaching us word processing. This was perhaps a risky move, but it paid off, because what he showed us was Rmarkdown, a markup language that is fully integrated into the R Studio environment and allows you to insert executable R code directly into a document, as you can see above, and publish the code and its resulting output as an html, pdf, or word document.

Rmarkdown is remarkably easy to use, and it provides a fantastic means of sharing R code with both students and colleagues. In R Studio, when you open a new markdown document, it even includes some sample text and code, as a reminder of how to do it, and the Markdown Quick Reference (from the ? menu in the editor window) is very helpful.

As mentioned above, the key feature of Rmarkdown is the “Chunks” of R code can be inserted directly into documents, so students (and others) can see both the code and the product of its execution. Chunks can be run individually in the “Source” environment in R Studio, to check them, then the whole document can be compiled using the knitR package and published to any url, with the default being the publicly accessible RPubs server, which is where you are reading this report now.

One of the core activities of the workshop was the development of our own course materials, and we used Rmarkdown to generate and share these materials with one another at the close of the workshop. Several examples are linked below.

3. Basic R Syntax and Commands and the Mosaic Package

After an introduction to Rmarkdown, we began to explore the R language itself. We focused on many of the basic mathematical and statistical commands that our students will be using regularly in class. We also discussed the extensive array of packages that are available for R. In particular, we worked with the mosaic package, which Danny helped develop to facilitate education in mathematics, statistics, computation, and modeling.

One of the key innovation of the mosaic package is that it normalizes the syntax for calls to all of the basic statistical functions, putting them all in the formula, data form:

goal(y ~ x, data=mydata)

Here, goal() is what you want R to do - the function that you call, whether to compute a mean() or making an arbitarily complex statistical model, e.g., using lm(). The formula is the y ~ x part of the statement, in which the tilde ~ specifies that the variable y should be considered as statistically dependent on the variable x. The data=mydata part of the statement simply tells R the data.frame where it can find the variables that the function needs to accomplish its goal().

While this syntax may seem abstract, it is incredibly useful pedagogically because it can help students to translate conventional statements like “On average, wages are higher in the managerial and professional sectors of the economy,” into more abstract, statistical hypotheses that “mean wages depend on economic sector,” or, more concisely, mean(wage~sector, data=cps). Thinking in the terms of statistical formulae also helps students to consider the different sources of variation in the measurements that they examine, because the formula statements can incorporate combinations of multiple variables. The utility of the mosaic package is that it normalizes almost all of the basic mathematical, statistical, and graphical functions so that they adhere the formula, data syntax. This makes R easier to use and provides a pedagogically exciting way to help students think about data!

4. Data, Modeling, and Graphics with R

With the formula syntax and some basic commands in our toolboxes, we began to explore some of the data sets that are included in R through the datasets and mosaic packages. The following commands will open text files in R Studio, detailing the available data sets. Here we will not print the output.

data(package="mosaic")
data(package="datasets")

We focused on the dataset Galton which provides the parent and child height measurements that Francis Galton used in the development of the correlation coefficient and linear regression. Because height is presumably heritable from both parents, but fathers are systematically taller than mothers, Galton developed a “midparent” measure of height that took into account the heights of both fathers and mothers. After loading the Galton data, we added the new variable to our data.frame.

data(Galton)
Galton = transform(Galton, midparent=father+1.08*mother) # midparent is added as a variable to our working data frame

This activity also gave us the chance to talk about data.frame objects in R and how tabular data are stored and accessed by the program. Further, a discussion of data.frames also provides the opportunity to discuss (with students) data collection, curation, and management. Derived, transformed variables like midparent can be calculated during the analysis of data, but they are distinct from the raw, measured observations. In line with this perspective, when you view a data.frame in R, it is not editable. Usefully, this separates the data collection/curation step from the data analysis step, which is in line with statistical best practices.

Yet data collection is still part of the process. Danny gave us the idea of using a shared GoogleDoc spreadsheet to have students simultaneously add their data. They can (all) watch each other do it on the classroom screen and they can react to the data editing strategies of one another, which provides yet another opportunity for the discussion of best practice, from variable names (case senstive, no spaces, brief) to collecting data in a proper tabular form as described above, to collecting ancillary information and collector identifiers. The GoogleDoc can share the link with students to edit (or students can install a GoogleDoc folder on their computer - Drew needs to talk to LBIS about this option further), then publish the data as a .csv file and share that link with students as well. Students can then access the file via the “Import Dataset” menu in the “Environment” pane of R Studio, or they can simply read.csv("http://url.of.our.data").

Returning to the Galton dataset, we explored some of the graphical capabilities of R. Danny emphasized that there are actually three separate graphical platforms for R: base, lattice, and ggplot, which stands for “grammar of graphics.” We focused on lattice graphics, with some extensions from the mosaic package. For example, we made a plot of the child height as a function of midparent height, separated by the sex of the child

xyplot(height~midparent, group=sex, data=Galton, xlab="Midparent height (inches)", ylab="Child height (inches)", auto.key=TRUE)

plot of chunk unnamed-chunk-5

Using the | operator in the formula statement, we can also make separate “facets” (or panels) for the two sexes:

xyplot(height~midparent | sex, data=Galton, xlab="Midparent height (inches)", ylab="Child height (inches)", auto.key=TRUE)

plot of chunk unnamed-chunk-6

Later in the workshop, we also explored the development of interactive graphics using the manipulate package, which can incorporate, for example, “drop-down”" selection of variables for different axes, or “sliders” for different parameter values. We also, very briefly, touched on the Shiny application, which allows R Studio users to generate fully interactive, graphical web applications. It looks like a great area for further exploration.

5. A Short Case Study

Working collaboratively, we “translated” an example Biology lab exercise from Macalester College using the new tools that we had learned over the course of the workshop. Rather than producing a particular product, this activity allowed us to integrate the knowledge we had gained, and to anticipate issues that students would run into, from the need to reshape() or melt() data (using the reshape2 package), to pedagogical issues in presenting statistical summaries and inferences.

6. Developing Course Materials

For most of both afternoons, we worked in small groups to discuss and develop course materials, preparing and publishing them as Rmarkdown to share with one another. Rather than describing individual projects, we will let them speak for themselves by linking to projects here. Not all the links below are active, but we will add them as they come up.

Chris Gillen worked on a demo on biological scaling and a template for a follow up exercise applying scaling principles to fiddler crab morphology
Paula Turner and Simon Garcia revised an exercise on histograms for the KEEP program.
Chris Bickford, Ryn Edwards, Jennifer Smith, and Drew Kerkhoff worked on materials for the Introduction to Experimental Biology (BIOL 109-110) course, including an introduction to basic commands, and a revision of an experiment on oviposition choice in bean beetles (more links coming soon!)
Katie Corker, Paul Curran, and Andrew Engell devised a regression primer personality data.
P.J. Glandon drafted an exercise using time series analysis for an upper level economics course.