Danny recommended ultimately forming pairs to each decide on a suitable topic or activity, and assemble data, and appropriate methods and models for students, with instructions for preparing deliverables. He said tomorrow we will put together a place to store and present these materials.
With a reference to a Space Family Robinson skeuomorphism (the robot who paces because human sentries do this to avoid falling asleep), he warned against replicating unnecessary patterns in our own pedagogy, and noted the importance of evaluating why we teach in the way that we do as technology and situations progress.
He also advised packaging up processes developed in R, without worrying about general programming principles. In particular, “learn what effective computation looks like”, rather than learning general programming principles. He advocated not just using templates, but the process of searching for and studying templates.
Danny started us with a sample assignment to experience the different aspects of the process of using R. He asked us to download a data set, analyze it using our choice of software, create an illustration, and write about it.
As the exercise progressed, we encountered plenty of issues with the process. Generally, there were issues with actually getting data into the software (whether R or Minitab) and into a proper structure. We also found that struggles with just importing the data prevented many of us from getting to the fun part of the assignment. For this reason, there is a strong motivation to use import processes that avoid downloading and pre-processing data, or even to avoid importing altogether if appropriate.
In installing R Markdown packages, Danny recommended creating packages that install packages, or running R off a server where the installation is already done and updates can be managed.
Creating a new R Markdown document creates a new tab in the upper left section. Strangely, it is named “Untitled.txt” despite being named. Clicking “Knit html” creates a tab in the console section and then pops up a document showing a preview of the document rendered in html. Clicking Publish prompts you to create an account on RPubs, uploads the html, and then shows you the url. He pointed out that students can actually submit their assignments this way, and that instructors can publish tutorials this way.
Another advantage is that instructors can structure parts of the R Markdown document for students, which can help scaffold their development of coding skills and systematize reporting.
We found that a document in R Studio will not offer republishing upon knitting/publishing, unless the file is saved first in R Studio.
Danny demonstrated common operations using the console. He noted that in many cases, you don’t need to think in terms of loops in order to generate sequences or repeated calculations.
He explained the use of functions, which consist of the form (operator-paren-argument list-paren). He noted that if you type a function name without the parentheses, you get the internal source code of the function. Warn students not to freak out about it. He demonstrated using the help function in order to learn about a function.
One really nice feature is using tab to auto complete on a partially written function.
Danny warned that the R interpreter is case-sensitive, which might trip up students. Another surprising feature is that periods are not reserved. In addition, formula references are done by “~”. (A formula defines a computation for a name without actually doing the computation; it allows for updating. In other words, they are like function references.) He also noted that using “=” is equivalent to “<-”, and that it’s generally safe to use as long as you don’t accidentally use “==”.
He showed that you can install packages easily using the files panel. As an example, we installed “mosaic”. But afterward, the interpreter has loaded it, which you do by using the library() function.
To determine loaded packages, one just checks in the Files pane. In the pane, checking and unchecking will load and detach a package, respectively.
He noted that some packages include data sets. You can find out what’s in a package by invoking data(package="<package name>""). Apply data() to a specific data set will bring up the Environment pane, showing properties of the data set.
He also noted that viewing a data set in a table doesn’t allow editing, because this is considered bad practice in data analysis. In other words, changes to a data set, that are made during analysis, should be documented through R commands.
Danny recommended not teaching students to create fake data sets in R. Pre-processing, importing, and analyzing data are separate processes.
As an example, we created a file to hold data. R does allow one to set individual data separately, but Danny warned that normal practice is not to do this in R. Instead, set up columns in Excel to collect the data. As a side note, he encouraged instructing students to enter covariates so that they get used to hypothesizing variable relationships (i.e., be data-oriented instead of being model-oriented).
He recommended saving files as csv. A tricky point here is finding the file. Instead of starting a set-working-directory nightmare, he recommended using the file.choose() function, which brings up the file picker for the operating system. This creates an ugly name to copy and paste, so we brainstormed ways to get the data in:
silly = file.choose().There are a variety of commands for learning about a data set:
data(): promises the data set (used in programming)View(): pops up a table in the source panehelp(): pops up a help file in the help panenames(): returns the variables in the data framenrow(): returns the number of rows in the data framesummary():As a side note, Danny showed us the trick of using the up arrow to access previous commands.
To plot or correlate two variables in a data set, you need to specify the data set itself, using the data keyword. In addition, you use the formula syntax (<y variable>~<x variable>) to specify the variables of interest. For example:
xyplot(data=Galton, height~mother)cor(data=Galton, height~mother)tally(mother>66~sex,data-Galton,margins=TRUE)Danny demonstrated adding variables using the transform() function. Of course, it’s not good practice to modify the original data set. So define a new dataframe newG containing a new variable called midparent:
newG=transform(Galton,midparent=father+1.08*mother)
newG contains all the same variables but with an additional variable midparent.
From here, we pulled out the male boys using a logical operator “==”, again creating a new data frame Boys:
Boys = subset(newG,sex=='M')
With this new data set, we could do different types of analysis, including xyplot, cor, and lm, as before.
lm(height ~ midparent,data=Boys)
##
## Call:
## lm(formula = height ~ midparent, data = Boys)
##
## Coefficients:
## (Intercept) midparent
## 19.991 0.356
Danny also demonstrated a box plot using the original data set, using sex as the categorical variable.
bwplot(height~sex, data=Galton)
Danny shared a Google spreadsheet link by inserting it into an R Markdown document, publishing the document publicly, and then pulling out the url.
To read in the data, we then used fetchGoogle() on the url to the spreadsheet:
kdata = fetchGoogle("https://docs.google.com/spreadsheet/pub?key=0Am13enSalO74dHJBRFR1V0ZaVHZSaUNsMWhHWFZMdFE&single=true&gid=0&output=csv")
This works for old Google sheets. Another method is to use Google Drive, but to use the “Publish” command and select csv as the format.
An R Markdown document allows you to “Insert a Chunk”, which creates a (syntax-colored) light gray section (indicated in R Markdown between triple-back-tick fences). Within this section, you can insert R code. When knit, the content will be formatted to look like code:
# <code here>
We then worked individually on each creating a short R-markdown-based article walking students through simple analysis of a data set. Examples:
Danny pointed out that there are three major systems for graphing: base, lattice, and ggplot. For each one, a different set of functionalities are easier.
Invoking help(package=lattice) shows a list of functions included in the package.
help(package=lattice)
We started with an example using wage data, called CPS85, and plotted wage against education.
data(CPS85)
xyplot(wage~educ, data=CPS85)
histogram(~educ, data=CPS85)
The scatter plot could use a little discrimination between groups, such as by sex.
xyplot(wage~educ, groups=sex, data=CPS85)
To make the groups easier to distinguish, we added some jitter to make it easier to see the density of dots,
xyplot(wage~jitter(educ), groups=sex, data=CPS85, alpha=0.5)
A variety of vignettes are available in mosaic, and they demonstrate some of the functionality of `lattice’.
Danny showed us a neat way to build an interactive graph, which used a dropdown to change the x axis and checkboxes to specify different interaction terms. fetchData() allows you to search a number of web directories for the named data file or source file.
fetchData("mLM.R")
## Retrieving from http://www.mosaic-web.org/go/datasets/mLM.R
## [1] TRUE
# Does not knit :: mLM(wage~sex+age+sector*sex+educ,data=CPS85)
The ‘mLM’ function accomplishes this by using the manipulate package. manipulate() takes a graph call and a user-interface element as arguments. Doing so requires that you embed variables in the place of fixed values for appropriate parameters. Examples:
# hist(scores$raw,10)
# manipulate(hist(scores$raw,num.bins),k=slider(0,1,0.1,label="k"),num.bins=slider(1,10,1,label="bin width"))
# hist(scores$raw,10)
# manipulate(hist(scores$raw,num.bins),k=slider(0,1,0.1,label="k"),num.bins=slider(1,10,1,label="bin width"))
R packages have a standard, required structure, including “data”, “description”, etc. The easiest way to develop your own is to copy an existing package folder, and then modify files as necessary. Danny demonstrated compilation of a package within R Studio by opening the files in the “Files” pane and then using “Build” to compile the package.
We examined a sample lab manual experiment and asked some questions about how it was written. We pointed out that the sample data were not in non-canonical form. Danny pointed out that the format was motivated by the plate-reader layout. He suggested that students should be instructed to start with authentic data, and then to use another table to label the meaning of each row or column. Then the data need to be arranged in canonical form.
How to do this? R can do this with a variety of functions, including ‘melt()’, ‘reshape()’, and ‘merge()’. The desired process should be thought of, generally, as a database inner-join operation.
Danny described a case study involving a hypothetical experiment that exposes students to the concept of sampling (fish in polluted and non-polluted lakes). He noted that compiling the R code within an R Markdown document can be time-consuming, so he suggested using different types of “Run” commands to debug the code before knitting.
What to do? He suggested drawing random samples from each lake, and repeating this process with different sample sizes. In essence, R allows one to do statistical, hypothetical experiments to develop an intuition for the stability of an estimator.
We spent the rest of the time in teams working on R-based modules for our courses.