Introduction

Danny recommended ultimately forming pairs to each decide on a suitable topic or activity, and assemble data, and appropriate methods and models for students, with instructions for preparing deliverables. He said tomorrow we will put together a place to store and present these materials.

With a reference to a Space Family Robinson skeuomorphism (the robot who paces because human sentries do this to avoid falling asleep), he warned against replicating unnecessary patterns in our own pedagogy, and noted the importance of evaluating why we teach in the way that we do as technology and situations progress.

He also advised packaging up processes developed in R, without worrying about general programming principles. In particular, “learn what effective computation looks like”, rather than learning general programming principles. He advocated not just using templates, but the process of searching for and studying templates.

Documentation

Assignment

Danny started us with a sample assignment to experience the different aspects of the process of using R. He asked us to download a data set, analyze it using our choice of software, create an illustration, and write about it.

Data set is http://www.mosaic-web.org/go/datasets/cps.csv
Get data, using read.csv()
What’s the mean wage globally?
What’s the mean wage by sector?
Make an informative plot about dependence on age.
Send in report to kaplan@macalester.edu

As the exercise progressed, we encountered plenty of issues with the process. Generally, there were issues with actually getting data into the software (whether R or Minitab) and into a proper structure. We also found that struggles with just importing the data prevented many of us from getting to the fun part of the assignment. For this reason, there is a strong motivation to use import processes that avoid downloading and pre-processing data, or even to avoid importing altogether if appropriate.

Markdown

In installing R Markdown packages, Danny recommended creating packages that install packages, or running R off a server where the installation is already done and updates can be managed.

Creating a new R Markdown document creates a new tab in the upper left section. Strangely, it is named “Untitled.txt” despite being named. Clicking “Knit html” creates a tab in the console section and then pops up a document showing a preview of the document rendered in html. Clicking Publish prompts you to create an account on RPubs, uploads the html, and then shows you the url. He pointed out that students can actually submit their assignments this way, and that instructors can publish tutorials this way.

Another advantage is that instructors can structure parts of the R Markdown document for students, which can help scaffold their development of coding skills and systematize reporting.

We found that a document in R Studio will not offer republishing upon knitting/publishing, unless the file is saved first in R Studio.

Exploring data sets and methods in R

Console

Danny demonstrated common operations using the console. He noted that in many cases, you don’t need to think in terms of loops in order to generate sequences or repeated calculations.

He explained the use of functions, which consist of the form (operator-paren-argument list-paren). He noted that if you type a function name without the parentheses, you get the internal source code of the function. Warn students not to freak out about it. He demonstrated using the help function in order to learn about a function.

One really nice feature is using tab to auto complete on a partially written function.

Unique features of R

Danny warned that the R interpreter is case-sensitive, which might trip up students. Another surprising feature is that periods are not reserved. In addition, formula references are done by “~”. (A formula defines a computation for a name without actually doing the computation; it allows for updating. In other words, they are like function references.) He also noted that using “=” is equivalent to “<-”, and that it’s generally safe to use as long as you don’t accidentally use “==”.

Installing packages

He showed that you can install packages easily using the files panel. As an example, we installed “mosaic”. But afterward, the interpreter has loaded it, which you do by using the library() function.

To determine loaded packages, one just checks in the Files pane. In the pane, checking and unchecking will load and detach a package, respectively.

Included data sets

He noted that some packages include data sets. You can find out what’s in a package by invoking data(package="<package name>""). Apply data() to a specific data set will bring up the Environment pane, showing properties of the data set. just indicates that the data are ready but don’t exist. Once you do something with its name, then the set gets created.

He also noted that viewing a data set in a table doesn’t allow editing, because this is considered bad practice in data analysis. In other words, changes to a data set, that are made during analysis, should be documented through R commands.

Working with data sets

Danny recommended not teaching students to create fake data sets in R. Pre-processing, importing, and analyzing data are separate processes.

Creating data sets

As an example, we created a file to hold data. R does allow one to set individual data separately, but Danny warned that normal practice is not to do this in R. Instead, set up columns in Excel to collect the data. As a side note, he encouraged instructing students to enter covariates so that they get used to hypothesizing variable relationships (i.e., be data-oriented instead of being model-oriented).

He recommended saving files as csv. A tricky point here is finding the file. Instead of starting a set-working-directory nightmare, he recommended using the file.choose() function, which brings up the file picker for the operating system. This creates an ugly name to copy and paste, so we brainstormed ways to get the data in:

assign it a name, e.g., silly = file.choose().
set working directly command
Use R Studio to set the working directory
Use Google Drive or Dropbox to distribute the data and get a url, then use the url to read in the data

Learning about data sets

There are a variety of commands for learning about a data set:

data(): promises the data set (used in programming)
View(): pops up a table in the source pane
help(): pops up a help file in the help pane
names(): returns the variables in the data frame
nrow(): returns the number of rows in the data frame
summary():
`str()’:

As a side note, Danny showed us the trick of using the up arrow to access previous commands.

Working with variables in a data set

To plot or correlate two variables in a data set, you need to specify the data set itself, using the data keyword. In addition, you use the formula syntax (<y variable>~<x variable>) to specify the variables of interest. For example:

xyplot(data=Galton, height~mother)
cor(data=Galton, height~mother)
tally(mother>66~sex,data-Galton,margins=TRUE)

Adding variables to a data frame

Danny demonstrated adding variables using the transform() function. Of course, it’s not good practice to modify the original data set. So define a new dataframe newG containing a new variable called midparent:

newG=transform(Galton,midparent=father+1.08*mother)

newG contains all the same variables but with an additional variable midparent.

Extracting subset of data frame

From here, we pulled out the male boys using a logical operator “==”, again creating a new data frame Boys:

Boys = subset(newG,sex=='M')

With this new data set, we could do different types of analysis, including xyplot, cor, and lm, as before.

lm(height ~ midparent,data=Boys)

## 
## Call:
## lm(formula = height ~ midparent, data = Boys)
## 
## Coefficients:
## (Intercept)    midparent  
##      19.991        0.356

Danny also demonstrated a box plot using the original data set, using sex as the categorical variable.

bwplot(height~sex, data=Galton)

plot of chunk unnamed-chunk-5

Designing instructional materials for R

Collaborating on data sets

Danny shared a Google spreadsheet link by inserting it into an R Markdown document, publishing the document publicly, and then pulling out the url.

Reading data

To read in the data, we then used fetchGoogle() on the url to the spreadsheet:

kdata = fetchGoogle("https://docs.google.com/spreadsheet/pub?key=0Am13enSalO74dHJBRFR1V0ZaVHZSaUNsMWhHWFZMdFE&single=true&gid=0&output=csv")

This works for old Google sheets. Another method is to use Google Drive, but to use the “Publish” command and select csv as the format.

Chunks

An R Markdown document allows you to “Insert a Chunk”, which creates a (syntax-colored) light gray section (indicated in R Markdown between triple-back-tick fences). Within this section, you can insert R code. When knit, the content will be formatted to look like code:

# <code here>

We then worked individually on each creating a short R-markdown-based article walking students through simple analysis of a data set. Examples:

Graphics

Danny pointed out that there are three major systems for graphing: base, lattice, and ggplot. For each one, a different set of functionalities are easier.

Invoking help(package=lattice) shows a list of functions included in the package.

help(package=lattice)

We started with an example using wage data, called CPS85, and plotted wage against education.

data(CPS85)
xyplot(wage~educ, data=CPS85)

plot of chunk unnamed-chunk-9

histogram(~educ, data=CPS85)

plot of chunk unnamed-chunk-9

The scatter plot could use a little discrimination between groups, such as by sex.

xyplot(wage~educ, groups=sex, data=CPS85)

plot of chunk unnamed-chunk-10

To make the groups easier to distinguish, we added some jitter to make it easier to see the density of dots,

xyplot(wage~jitter(educ), groups=sex, data=CPS85, alpha=0.5)

plot of chunk unnamed-chunk-11

A variety of vignettes are available in mosaic, and they demonstrate some of the functionality of `lattice’.

Interactive graphics in R

Danny showed us a neat way to build an interactive graph, which used a dropdown to change the x axis and checkboxes to specify different interaction terms. fetchData() allows you to search a number of web directories for the named data file or source file.

fetchData("mLM.R")

## Retrieving from http://www.mosaic-web.org/go/datasets/mLM.R

## [1] TRUE

# Does not knit :: mLM(wage~sex+age+sector*sex+educ,data=CPS85)

The ‘mLM’ function accomplishes this by using the manipulate package. manipulate() takes a graph call and a user-interface element as arguments. Doing so requires that you embed variables in the place of fixed values for appropriate parameters. Examples:

# hist(scores$raw,10)
# manipulate(hist(scores$raw,num.bins),k=slider(0,1,0.1,label="k"),num.bins=slider(1,10,1,label="bin width"))

# hist(scores$raw,10)
# manipulate(hist(scores$raw,num.bins),k=slider(0,1,0.1,label="k"),num.bins=slider(1,10,1,label="bin width"))

Making your own R package

R packages have a standard, required structure, including “data”, “description”, etc. The easiest way to develop your own is to copy an existing package folder, and then modify files as necessary. Danny demonstrated compilation of a package within R Studio by opening the files in the “Files” pane and then using “Build” to compile the package.

Case study

Issues with arranging data

We examined a sample lab manual experiment and asked some questions about how it was written. We pointed out that the sample data were not in non-canonical form. Danny pointed out that the format was motivated by the plate-reader layout. He suggested that students should be instructed to start with authentic data, and then to use another table to label the meaning of each row or column. Then the data need to be arranged in canonical form.

How to do this? R can do this with a variety of functions, including ‘melt()’, ‘reshape()’, and ‘merge()’. The desired process should be thought of, generally, as a database inner-join operation.

Case study

Danny described a case study involving a hypothetical experiment that exposes students to the concept of sampling (fish in polluted and non-polluted lakes). He noted that compiling the R code within an R Markdown document can be time-consuming, so he suggested using different types of “Run” commands to debug the code before knitting.

What to do? He suggested drawing random samples from each lake, and repeating this process with different sample sizes. In essence, R allows one to do statistical, hypothetical experiments to develop an intuition for the stability of an estimator.

Course material development

We spent the rest of the time in teams working on R-based modules for our courses.

Minutes

SPG

June 5, 2014

Introduction

Documentation

Assignment

Markdown

Exploring data sets and methods in R

Console

Unique features of R

Installing packages

Included data sets

Working with data sets

Creating data sets

Learning about data sets

Working with variables in a data set

Adding variables to a data frame

Extracting subset of data frame

Designing instructional materials for R

Collaborating on data sets

Reading data

Chunks

Graphics

Interactive graphics in R

Making your own R package

Case study

Issues with arranging data

Case study

Course material development