Getting Started

Before beginning this lab, make sure you’ve:

The video below will walk you through starting a new Rmarkdown file for the first time, finding the file path for the dataset you’re using, and making sure the Rmarkdown file is in the same place as the data.

The RStudio Interface

The goal of this lab is to introduce you to R and RStudio, which you’ll be using throughout the course both to learn the statistical concepts discussed in the course and to analyze real data and come to informed conclusions. To clarify which is which: R is the name of the programming language itself and RStudio is a convenient interface.

As the labs progress, you are encouraged to explore beyond what the labs dictate; a willingness to experiment will make you a much better programmer. Before we get to that stage, however, you need to build some basic fluency in R. Today we begin with the fundamental building blocks of R and RStudio: the interface, reading in data, and basic commands.

Go ahead and launch RStudio. You should see a window that looks like the image shown below.

The panel on the lower left is where the action happens. It’s called the console. Everytime you launch RStudio, it will have the same text at the top of the console telling you the version of R that you’re running. Below that information is the prompt. As its name suggests, this prompt is really a request: a request for a command. Initially, interacting with R is all about typing commands and interpreting the output. These commands and their syntax have evolved over decades (literally) and now provide what many users feel is a fairly natural way to access data and organize, describe, and invoke statistical computations.

The panel in the upper right contains your environment as well as a history of the commands that you’ve previously entered.

Any plots that you generate will show up in the panel in the lower right corner. This is also where you can browse your files, access help, manage packages, etc.

R Packages

R is an open-source programming language, meaning that users can contribute packages that make our lives easier, and we can use them for free. For this lab, and many others in the future, we will use the following R package:

  • The suite of tidyverse packages: for data wrangling and data visualization

If this package is not already available in your R environment, install it by typing the following line of code into the console of your RStudio session, pressing the enter/return key after each one. Note that you can check to see which packages (and which versions) are installed by inspecting the Packages tab in the lower right panel of RStudio.

install.packages("tidyverse")

You may need to select a server from which to download; any of them will work. Next, you need to load the Tidyverser package in your working environment. We do this with the library function. Run the following line in your console.

library(tidyverse)

You only need to install packages once, but you need to load them each time you relaunch RStudio.

The Tidyverse packages share common philosophies and are designed to work together. You can find more about the packages in the tidyverse at tidyverse.org.

Creating a reproducible lab report

We will be using R Markdown to create reproducible lab reports. See the following video describing why and how:

Why use R Markdown for Lab Reports?

A template lab report for you to use is available for download on the course WordPress site.

Going forward you should refrain from typing your code directly in the console, and instead type any code (final correct answer, or anything you’re just trying out) in the R Markdown file and run the chunk using either the Run button on the chunk (green sideways triangle) or by highlighting the code and clicking Run on the top right corner of the R Markdown editor. If at any point you need to start over, you can Run All Chunks above the chunk you’re working in by clicking on the down arrow in the code chunk.

The video below will walk you through the process of knitting your Rmarkdown file for Lab 1:

The data

Today we’ll be looking at a subset of the National Prisoner Statistics (NPS) dataset. This subset also includes populations statstics from the US Census. Note that because population counts are taken from the US Census, we only have them on a 10-year basis. We round down the year to use it’s decade’s population statistics. For example, for years 1980-1989, we assign the 1980 population count.

To get started, let’s open the data. Remember to set your working directory!

load("NPS_Pop.rda")

You can run the command by

This command instructs R to load some data: the subset of the National Prisoner Statistics dataset. You should see that the environment area in the upper righthand corner of the RStudio window now lists a data set called NPS_Pop that has 39 observations on 9 variables. As you interact with R, you will create a series of objects. Sometimes you load them as we have done here, and sometimes you create them yourself as the byproduct of a computation or some analysis you have performed.

A brief introduction to the NPS dataset from the US Department of Justice before we get started (edited for brevity):

“The National Prisoner Statistics (NPS) data collection began in 1926 in response to a congressional mandate to gather information on persons incarcerated in state and federal prisons. The NPS provides an enumeration of persons in state and federal prisons and collects data on key characteristics of the nation’s prison population. NPS has been adapted over time to keep pace with the changing information needs of the public, researchers, and federal, state, and local governments.”

We can take a look at the dataset by running the following line of code:

NPS_Pop

However, printing the whole dataset in the console is not that useful. One advantage of RStudio is that it comes with a built-in data viewer. Click on the name NPS_Pop in the Environment pane (upper right window) that lists the objects in your environment. This will bring up an alternative display of the data set in the Data Viewer (upper left window). You can close the data viewer by clicking on the x in the upper lefthand corner.

What you should see are 7 columns of numbers, each row representing a different year. Use the scrollbar on the right side of the console window to examine the complete data set.

Note that the row numbers in the first column are not part of the NPS_Pop dataset. R adds them as part of its printout to help you make visual comparisons. You can think of them as the index that you see on the left side of a spreadsheet. In fact, the comparison to a spreadsheet will generally be helpful. R has stored the data in a kind of spreadsheet or table called a data frame.

You can see the dimensions of this data frame as well as the names of the variables and the first few observations by typing:

glimpse(NPS_Pop)

It is better practice to type this command into your console, since it is not necessary code to include in your solution file.

This command should output the following

Rows: 39 Columns: 9 $ Year 1978, 1979, 1980, 1981, 1982, 1983, 1984, 1985, 1986, 1987, 1988, 1989, 1990,… $ WhiteM 438491, 452711, 475450, 534758, 600355, 629894, 666511, 723483, 756520, 80029… $ WhiteF 16295, 16829, 17933, 21168, 24839, 27184, 29213, 32935, 37495, 41279, 46625, … $ BlackM 400255, 407563, 423397, 473197, 533249, 563375, 589138, 636968, 690568, 73569… $ BlackF 18475, 19043, 18980, 22392, 25587, 26827, 29095, 31377, 36067, 38534, 43366, … $ WhitePop 178119221, 178119221, 189035012, 189035012, 189035012, 189035012, 189035012, … $ BlackPop 22539362, 22539362, 26482349, 26482349, 26482349, 26482349, 26482349, 2648234… $ IncarcTotal 873516, 896146, 935760, 1051515, 1184030, 1247280, 1313957, 1424763, 1520650,… $ TotalPop 200658583, 200658583, 215517361, 215517361, 215517361, 215517361, 215517361, …

We can see that there are 39 observations and 7 variables in this dataset. The variable names are Year, WhiteM, WhiteF, BlackM, BlackF, WhitePop, and BlackPop.

At this point, you might notice that many of the commands in R look a lot like functions from math class; that is, invoking R commands means supplying a function with some number of arguments. The glimpse command, for example, took a single argument, the name of a data frame.

Some Exploration

Let’s start to examine the data a little more closely. We can access the data in a single column of a data frame separately using a command like

NPS_Pop$WhiteM

This command will only show the number of White males incarcerated in a given year. The dollar sign basically says “go to the data frame that comes before me, and find the variable that comes after me”.

1. What command would you use to extract just the number of White females incarcerated in a given year? The number of Black males? Of Black females? Try extracting these counts on your own now.

Notice that the way R has printed these data is different. When we looked at the complete data frame, we saw 39 rows, one on each line of the display. These data are no longer structured in a table with other variables, so they are displayed one right after another. Objects that print out in this way are called vectors; they represent a set of numbers. R has added numbers in [brackets] along the left side of the printout to indicate locations within the vector. For example, 438491 follows [1], indicating that 438491 is the first entry in the vector. And if [15] starts a line, then that would mean the first number on that line would represent the 15th entry in the vector.

Data visualization

R has some powerful functions for making graphics. We can create a simple plot of the number of Black males in incarceration each year with the command

ggplot(data = NPS_Pop, aes(x = Year, y = BlackM)) + 
  geom_point()

We use the ggplot() function to build plots. If you run the plotting code in your console, you should see the plot appear under the Plots tab of the lower right panel of RStudio. Notice that the command above again looks like a function, this time with arguments separated by commas.

With ggplot():

  • The first argument is always the dataset.
  • Next, you provide the variables from the dataset to be assigned to aesthetic elements of the plot, e.g. the x and the y axes.
  • Finally, you use another layer, separated by a + to specify the geometric object for the plot. Since we want to scatterplot, we use geom_point().

For instance, if you wanted to visualize the above plot using a line graph, you would replace geom_point() with geom_line().

ggplot(data = NPS_Pop, aes(x = Year, y = BlackM)) + 
  geom_line()

You might wonder how you are supposed to know the syntax for the ggplot function. Thankfully, R documents all of its functions extensively. To learn what a function does and its arguments that are available to you, just type in a question mark followed by the name of the function that you’re interested in. Try the following in your console:

?ggplot

Notice that the help file replaces the plot in the lower right panel. You can toggle between plots and help files using the tabs at the top of that panel.

2. Is there an apparent trend in the number of Black males in incarceration over the years? How would you describe it?

3. Build a scatter or line plot showing the number of White males in incarceration over the years. How does this trend compare to the trend for Black males?

If we want to visualize both trends at the same time, we can add one plot to the other with the following code.

ggplot(data = NPS_Pop, aes(x = Year, y = BlackM)) + 
  geom_line(color ="coral") +
  geom_line(data = NPS_Pop, aes(x = Year, y = WhiteM), color = "blue") +
  ylab("Number of Incarcerated Individuals") 

Don’t worry too much about the specifics of this code- we’ll go into more detail about making plots in future labs. Note that the coral line represents the number of Black males in incarceration, and the blue line represents the rate of Whites males in incarceration. We’ve also changed the title of the y-axis with the ylab function.

Combining vectors

In order to make a fair comparison in incarceration trends, we need to account for the difference in Black and White population size in the United States. One way of doing this is to calculate the rate of incarceration per 100,000 people for each group. For example, the per 100,000 rate of incarceration for whites is equivalent to:

  • Total number of white individuals in incarceration (WhiteM + WhiteF) divided by the total population of whites in the US (WhitePop), multiplied by 100,000.

Let’s start by calculating the rate of incarceration per 100,000 for Whites. If we add the vector for incarcerations for White males (WhiteM) to that of white females, R will compute all sums simultaneously.

NPS_Pop$WhiteM + NPS_Pop$WhiteF

What you will see are 39 numbers (in that packed display, because we aren’t looking at a data frame here), each one representing the sum we’re after: the total number of Whites incarcerated each year. Take a look at a few of them and verify that they are right.

Instead of performing each operation separately, we can combine multiple operations in a single line. Just be careful with your parentheses!

(((NPS_Pop$WhiteM + NPS_Pop$WhiteF) /NPS_Pop$WhitePop) *100000) 

Adding a new variable to the data frame

We’ll be using this new vector to generate some plots, so we’ll want to save it as a permanent column in our data frame. Note what we are doing here (saving over raw data) is not good practice. We are only doing it here so that you focus on the process of manipulating data. After this lab, we will never save over raw data.

NPS_Pop <- NPS_Pop %>%
  mutate(White_per100000 = (((WhiteM + WhiteF) / WhitePop) *100000))

The %>% operator is called the piping operator. It takes the output of the previous expression and pipes it into the first argument of the function in the following one. To continue our analogy with mathematical functions, x %>% f(y) is equivalent to f(x, y).

A note on piping: Note that we can read these two lines of code as the following:

"Take the NPS_Pop dataset and pipe it into the mutate function. Mutate the NPS_Pop data set by creating a new variable called White_per100000 that is the incarceration rate for Whites per 100,000. Then assign the resulting dataset to the object called NPS_Pop.

This is equivalent to going through each row and calculating the incarceration rate for Whites per 100,000 for that year and recording that value in a new column called White_per100000.

Where is the new variable? When you make changes to variables in your dataset, click on the name of the dataset again to update it in the data viewer.

You’ll see that there is now a new column called White_per100000 that has been tacked onto the data frame. The special symbol <- performs an assignment, taking the output of one line of code and saving it into an object in your environment.

You can make a line plot of the incarceration rate for Whites per 100,000 over time with the command

ggplot(data = NPS_Pop, aes(x = Year, y = White_per100000)) + 
  geom_line()

4. Mutate a new column called Black_per100000 which represents incarceration rate for Blacks per 100,000. Make sure to include your code in your lab report. What do you see?

5. Add the plot for Black_per100000 to the plot we made for White_per100000. Hint you can use our code from the Data Visualization section as a starting point. What do you see?

Tip: If you use the up and down arrow keys, you can scroll through your previous commands, your so-called command history. You can also access it by clicking on the history tab in the upper right panel. This will save you a lot of typing in the future.

Comparing variables using mathematic operators

Finally, in addition to simple mathematical operators like subtraction and division, you can ask R to make comparisons like greater than, >, less than, <, and equality, ==. For example, we can ask if the number of incarcerated Black males outnumber that of White males year with the expression

NPS_Pop <- NPS_Pop %>%
  mutate(more_Black_inc = BlackM > WhiteM)

This command adds a new variable to the NPS_Pop dataframe containing the values of either TRUE if that year had more Black males incarcerated than White males, or FALSE if that year did not. This variable contains a different kind of data than we have encountered so far. All other columns in the NPS_Pop data frame have values that are numerical. Here, we’ve asked R to create logical data, data where the values are either TRUE or FALSE. In general, data analysis will involve many different kinds of data types, and one reason for using R is that it is able to represent and compute with many of them.

If we want to know the number of years for which the number of incarcerated Black males outnumber that of White males using the count function.

NPS_Pop %>%
  count(more_Black_inc == TRUE)

The number of incarcerated Black males outnumber that of White males in 27 out of 39 years in our dataset.

6. Mutate a new column that reports whether the per 100,000 rate of incarceration is greater for Blacks than for Whites.

7. Count the number of years for which the per 100,000 rate of incarceration is greater for Blacks than for Whites. How does this compare to the previous count we ran?

Minimum and Maximum values

To find the minimum and maximum values of columns, you can use the functions min and max within a summarize() call, which you will learn more about in the following lab. Here’s an example of how to find the minimum and maximum number of Black males incarcerated in a year:

NPS_Pop %>%
  summarize(min = min(BlackM), max = max(BlackM))

8. Find the minimum and maximum incarceration rate per 100,000 for Blacks and Whites.

Resources for learning R and working in RStudio

That was a short introduction to R and RStudio, but we will provide you with more functions and a more complete sense of the language as the course progresses.

In this course we will be using the suite of R packages from the tidyverse. The book R For Data Science by Grolemund and Wickham is a fantastic resource for data analysis in R with the tidyverse. If you are googling for R code, make sureto also include these package names in your search query. For example, instead of googling “scatterplot in R”, google “scatterplot in R with the tidyverse”.

These cheatsheets may come in handy throughout the semester:

Note that some of the code on these cheatsheets may be too advanced for this course. However the majority of it will become useful throughout the semester.


Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License and was adopted from OpenIntro.org.