The goal of this lab is to introduce you to R and RStudio, which you’ll be using throughout the course both to learn the statistical concepts discussed in the textbook and also to analyze real data and come to informed conclusions. To straighten out which is which: R is the name of the programming language itself and RStudio is a convenient interface.

As the labs progress, you are encouraged to explore beyond what the labs dictate; a willingness to experiment will make you a much better programmer. Before we get to that stage, however, you need to build some basic fluency in R. Today we begin with the fundamental building blocks of R and RStudio: the interface, reading in data, and basic commands.

The panel in the upper right contains your workspace as well as a history of the commands that you’ve previously entered. Any plots that you generate will show up in the panel in the lower right corner.

The panel on the left is where the action happens. It’s called the console. Everytime you launch RStudio, it will have the same text at the top of the console telling you the version of R that you’re running. Below that information is the prompt. As its name suggests, this prompt is really a request, a request for a command. Initially, interacting with R is all about typing commands and interpreting the output. These commands and their syntax have evolved over decades (literally) and now provide what many users feel is a fairly natural way to access data and organize, describe, and invoke statistical computations.

To get you started, enter all the commands in this lab surrounded by grey blocks at the R prompt (i.e. right after > on the console). You can either type it in manually or copy and paste it from this document.

R has a number of additional packages that help you perform specific tasks. In order to use these packages, they must be installed (you only have to do this once) and loaded (you have to do this every time you start a new R session).

If you have not installed the packages already, run these commands (this may take a few minutes):

install.packages("Rcpp")

Then go to the Panel with the tab Packages -> Click on the Install Tab -> Enter dplyr and click Install (this may take a few minutes). Then run the following commands:

install.packages(c("ggplot2", "devtools"))
devtools::install_github("andrewpbray/oilabs")

Phew! Fortunately you only need to run all the above once! Now we can load the packages

library(oilabs)
library(dplyr)
library(ggplot2)

and finally, load in some data:

data(arbuthnot)

The data are Arbuthnot baptism counts for boys and girls. You should see that the environment area in the upper righthand corner of the RStudio window now lists a data set called arbuthnot that has 82 observations on 3 variables. As you interact with R, you will create a series of objects. Sometimes you load them as we have done here, and sometimes you create them yourself as the byproduct of a computation or some analysis you have performed.

## The Data: Dr. Arbuthnot’s Baptism Records

The Arbuthnot data set refers to Dr. John Arbuthnot, an 18th century physician, writer, and mathematician. He was interested in the ratio of newborn boys to newborn girls, so he gathered the baptism records for children born in London for every year from 1629 to 1710. We can take a look at the data by typing its name into the console.

arbuthnot

What you should see are four columns of numbers, each row representing a different year: the first entry in each row is simply the row number (an index we can use to access the data from individual years if we want), the second is the year, and the third and fourth are the numbers of boys and girls baptized that year, respectively. Use the scrollbar on the right side of the console window to examine the complete data set.

Note that the row numbers in the first column are not part of Arbuthnot’s data. R adds them as part of its printout to help you make visual comparisons. You can think of them as the index that you see on the left side of a spreadsheet. In fact, the comparison to a spreadsheet will generally be helpful. R has stored Arbuthnot’s data in a kind of spreadsheet or table called a data frame.

You can see the dimensions of this data frame by typing:

dim(arbuthnot)
## [1] 82  3

This command should output [1] 82 3, indicating that there are 82 rows and 3 columns, just as it says next to the object in your environment. You can see the names of these columns (or variables) by typing:

names(arbuthnot)
## [1] "year"  "boys"  "girls"

You should see that the data frame contains the columns year, boys, and girls. At this point, you might notice that many of the commands in R look a lot like functions from math class; that is, invoking R commands means supplying a function with some number of arguments. The dim and names commands, for example, each took a single argument, the name of a data frame.

One advantage of RStudio is that it comes with a built-in data viewer. Click on the name arbuthnot in the Environment pane (upper right window) that lists the objects in your environment. This will bring up an alternative display of the data set in the Data Viewer (upper left window). You can close the data viewer by clicking on the x in the upper lefthand corner. Alternatively, you can access the same alternative display by using the View() command

View(arbuthnot)

## Some Exploration

Let’s start to examine the data a little more closely. We can access the data in a single column of a data frame separately using a command like

arbuthnot %>%
select(boys)

This command will only show the number of boys baptized each year.

Notice that the special operator %>% is being used here. The %>% is called the ‘pipe’ operator from the dplyr package. Many languages have pipe operators, and they allow functions to pass things along a chain of commands. In R, we often read %>% as the English word ‘then.’ So, we could read this code as saying, ‘take the arbuthnot data, then select the boys variable.’

1. What command would you use to extract just the counts of girls baptized? Try it!

There is an additional package for R called mosaic that streamlines all of the commands that you will need in this course. This package is specifically designed by a team of NSF-funded educators to make R more accessible to introductory statistics students like you. The mosaic package doesn’t provide new functionality so much as it makes existing functionality more logical, consistent, all the while emphasizing importants concepts in statistics.

To use the package, you will first need to install it (this may take a few minutes).

install.packages("mosaic")

Note that you will only have to do this once. However, once the package is installed, you will have to load it into the current workspace before it can be used.

library(mosaic)

Note that you will have to do this every time you start a new R session.

R has some powerful functions for making graphics.

The centerpiece of the mosaic syntax is the use of the modeling language. This involves the use of a tilde (~), which can be read as “is a function of”.

We can create a simple plot of the number of girls baptized per year with the command

xyplot(girls ~ year, data=arbuthnot)

By default, R creates a scatterplot with each x,y pair indicated by an open circle. The plot itself should appear under the Plots tab of the lower right panel of RStudio. Notice that the command above is a function called xyplot that takes two arguments separated by a comma. The first argument to the xyplot function is a formula of the form y ~ x that specifies which variable should go on the vertical (y) axis and which should go on the horizontal (x) axis. The data argument tell R where to find the girls and year variables – in this case, they are inside the data frame called arbuthnot. You will find that you can perform most of the operations that you will see in this class using this simple template. [Please see the Less Volume, More Creativity vignette for more information about this.]

If we wish to plot lines instead of points, we can add the type argument.

xyplot(girls ~ year, data=arbuthnot, type = "l")

You might wonder how you are supposed to know that it was possible to add that third argument. Thankfully, R documents all of its functions extensively. To read what a function does and learn the arguments that are available to you, just type in a question mark followed by the name of the function that you’re interested in. Try the following.

?xyplot

Notice that the help file replaces the plot in the lower right panel. You can toggle between plots and help files using the tabs at the top of that panel.

1. Is there an apparent trend in the number of girls baptized over the years? How would you describe it?

Now, suppose we want to plot the total number of baptisms. To compute this, we could use the fact that R is really just a big calculator. We can type in mathematical expressions like

5218 + 4683

to see the total number of baptisms in 1629. We could repeat this once for each year, but there is a faster way. If we add the variable for baptisms for boys to that of girls, R will compute all sums simultaneously.

arbuthnot %>%
mutate(total = boys + girls)

Again, we can read this chunk of code by saying, ‘take the arbuthnot data, and then mutate it so there is an additional column called total, which is comprised of the columns boys and girls added together.’

You’ll see that there is now a new column called “total” that has been tacked on to the data frame. What “total” contains is a list of 82 numbers, each one representing the sum we’re after. Take a look at a few of them and verify that they are right.

While this output looks like our original dataset with an additional column added, it was just temporarily shown to us in the console. In order to use that column again, we have to save it.

arbuthnot <-
arbuthnot %>%
mutate(total = boys + girls)

The special symbol <- performs an assignment, taking the output of a line of code or series of piped commands, and saving it into an object in your environment. In this case, you already have an object called arbuthnot, so this command updates that data set with the new mutated column.

We can make a plot of the total number of baptisms per year with the command

xyplot(boys + girls ~ year, data=arbuthnot, type = "l")

Similarly to how we computed the total number of births, we can compute the ratio of the number of boys to the number of girls baptized in 1629 with

5218 / 4683

or we can act on the complete columns with the expression

arbuthnot %>%
mutate(ratio = boys / girls)

We can also compute the proportion of newborns that are boys

5218 / (5218 + 4683)

or this may also be computed for all years simultaneously:

arbuthnot %>%
mutate(prop = boys / (boys + girls))

Note that with R as with your calculator, you need to be conscious of the order of operations. Here, we want to divide the number of boys by the total number of newborns, so we have to use parentheses. Without them, R will first do the division, then the addition, giving you something that is not a proportion.

1. Now, mutate the data frame once more to include a new column for the proportion of newborns that are boys, then generate a plot of that proportion over time. What do you see? Tip: If you use the up and down arrow keys, you can scroll through your previous commands, your so-called command history. You can also access it by clicking on the history tab in the upper right panel. This will save you a lot of typing in the future.

Finally, in addition to simple mathematical operators like subtraction and division, you can ask R to make comparisons like greater than, >, greater than or equal, >=, less than, <, less than or equal <=, and equality, ==. For example, we can ask if boys outnumber girls in each year with the expression

arbuthnot %>%
mutate(boys>girls)

This command returns a dataframe with an additional variable tacked on. The values of this variable are either TRUE if that year had more boys than girls, or FALSE if that year did not (the answer may surprise you). This output shows a different kind of data than we have considered so far. All other columns in the arbuthnot data frame have values are numerical (the year, the number of boys and girls). Here, we’ve asked R to create logical data, data where the values are either TRUE or FALSE. In general, data analysis will involve many different kinds of data types, and one reason for using R is that it is able to represent and compute with many of them.

This seems like a fair bit for your first lab, so let’s stop here. To exit RStudio you can click on the log out link at the upper right corner of the screen. Go ahead and log out and then sign back in. You’ll find your R session is just as you left it.

## Practice Exercises

In the previous few pages, you recreated some of the displays and preliminary analysis of Arbuthnot’s baptism data. Your assignment involves repeating these steps, but for present day birth records in the United States. Load up the present day data with the following command.

data(present)

The data are stored in a data frame called present.

• What years are included in this data set? What are the dimensions of the data frame and what are the variable or column names?

• How do the counts of boys and girls in the present day birth records compare to Arbuthnot’s? Are they on a similar scale?

• Make a plot that displays the boy-to-girl ratio for every year in the data set. What do you see? Does Arbuthnot’s observation about boys being born in greater proportion than girls hold up in the U.S.? Include the plot in your response.

• In what year did we see the most total number of births in the U.S.?

These data come from a report by the Centers for Disease Control. Check it out if you would like to read more about an analysis of sex ratios at birth in the United States.

That was a short introduction to R and RStudio, but we will provide you with more functions and a more complete sense of the language as the course progresses. Feel free to browse around the websites for R and RStudio.

This is a product of OpenIntro that is released under a Creative Commons Attribution-ShareAlike 3.0 Unported. This lab was adapted for OpenIntro by Andrew Bray and Mine Çetinkaya-Rundel from a lab written by Mark Hansen of UCLA Statistics.