Using RStudio

RStudio is an Integrated Development Environment (IDE) for using the R language. It has a lot of useful tools for writing code, managing projects, viewing plots, and more.

RStudio is highly customizable, and allows you to change appearances and general layout without much fuss. For instance, to rearrange the order of panes, you can do the following:

RStudio > Preferences (Mac)
Tools > Options (Windows)

There are a lot of other features to take advantage of in RStudio, and many of them are bound to shortcuts. You can see those by going to Tools > Keyboard Shortcuts Help or pressing option + shift + K. You can also find a handy cheat sheet here.

Projects and Working Directories

Projects are an RStudio feature to help keep your code and working environments contained and organized, which comes in handy when you start to have multiple projects. To start a new project, you can click on the dropdown in the upper right-hand corner of RStudio and choose to begin a new project.

R Basics

Creating Variables

You can use R to create variables. Variables can contain all kinds of information, from single values to entire data structures. We create variables using <-, called the assignment operator.

new_int <- 4 
new_int

Variables will behave just like the values they represent. When we run cos() on new_int, it returns the same value as cos(4).

cos(new_int) 
cos(4)

Working Environments

You may have noticed that we have a few new things in our “Environment” pane in RStudio. These variables and functions comprise our working environment, data that R has held in our active memory. This environment isn’t necessarily persistent, so it doesn’t last between R sessions.

Packages

People in different disciplines may write functions or create datasets that will be useful to others in their field. They can bundle these functions and datasets into packages that can be downloaded and used by others. There is a simple way into install packages with R:

install.packages(c("tidyverse", "rmarkdown", "flexdashboard"))

library(tidyverse)

Read in Data

We’ll be using the gapminder data again. The original data can be found here, and the version we’re using comes from Jenny Bryan’s gapminder package.

df <- read_csv("./data/original/gapminder.csv")

glimpse(df)

Manipulating Data

We can use a number of functions built into the “tidyverse” set of packages to manipulate and subset our data. For instance, if we just wanted data from the Americas, we could use the filter() function to subset by rows using some condition.

df_americas <- filter(df, continent == "Americas")

If we wanted just the population for each country and year within that, we could use the select() function to choose which columns to keep.

df_americas_subset <- select(df_americas, country, year, pop)

Pipes

The pipe character from the tidyverse %>% allows us to chain commands together so that we can perform multiple data manipulations at once. If we wanted to perform the previous two operations, here are two ways to do so. The first doesn’t use the pipe, and the second does. They produce the exact same result.

df_americas_subset <- select(filter(df, continent == "Americas"), country, year, pop)

df_americas_subset <- df %>%
  filter(continent == "Americas") %>%
  select(country, year, pop)

Adding columns

We can add variables to our data based on existing variables using the mutate() function

df_pop_millions <- df %>%
  mutate(pop_millions = pop / 1000000)

How would you create a column to calculate GDP by multiplying population and GDP Per Capita?

df_gdp <- df %>%
  mutate(gdp = gdpPercap * pop)

Grouping / Summarizing

One of the most powerful features of the tidyverse is the ability to group and summarize data. If we wanted the mean life expectancy for each continent, we could do so like this:

df_avg <- df %>%
  group_by(continent) %>%
  summarize(mean_lifeExp = mean(lifeExp))

You can create multiple columns this way, too:

df_summed <- df %>%
  group_by(continent) %>%
  summarize(
    mean_lifeExp = mean(lifeExp),
    sum_lifeExp = sum(lifeExp),
    median_lifeExp = median(lifeExp)
  )

You can even group by multiple variables:

df_summed <- df %>%
  group_by(continent, year) %>%
  summarize(
    mean_lifeExp = mean(lifeExp),
    sum_lifeExp = sum(lifeExp),
    median_lifeExp = median(lifeExp)
  )

Visualizing Data

Try running this code:

ggplot(df, aes(x = gdpPercap, y = lifeExp)) +
  geom_point()

ggplot2 is built on the idea of the Grammar of Graphics, a system described by Leland Wilkinson (see readings). There are a lot of nuances to explore in the book and in the ggplot2 package, but one of the core concepts to keep in mind is that a visualization has three critical components:

data
a coordinate system
geometry to display the data

We can see this in action

ggplot(df)

It doesn’t mean much yet. Let’s give it a coordinate system and see what happens.

ggplot(df, aes(x = gdpPercap, y = lifeExp))

Fantastic! The data is in there, and describing a 2D plane (the most we can really have in ggplot2 - there are no “traditional” 3D visualizations for it). For every row of our dataset, there is some intersection that exists on this plane. We need to decide how they will be rendered, though. For these two continuous variables, a scatterplot is a good place to start. We can add another “layer” to our plot containing geom_point() to show where the data exists on this plane.

ggplot(df, aes(x = gdpPercap, y = lifeExp)) +
  geom_point()

The points are just one way of rendering this. We could do it another way if we wanted to.

ggplot(df, aes(x = gdpPercap, y = lifeExp)) +
  geom_text(aes(label = country))

Or we could try this

ggplot(df, aes(x = continent, y = lifeExp)) +
  geom_boxplot()

Or this:

ggplot(df, aes(x = continent, y = lifeExp)) +
  geom_col()

ggplot(df, aes(x = continent, y = lifeExp)) +
  geom_col(aes(color = country))

Let’s go back to our points for now, though. Though we’re limited to “2D” visualizations, there’s a technique called “small multiples” (“facets” in ggplot) that allow us to show this data in other dimensions. For instance, if we want to separate things out by year.

ggplot(df, aes(x = gdpPercap, y = lifeExp)) +
  geom_point() +
  facet_grid(year~.)

Activity

Using what you know so far, create a plot that lays out GDP Per Capita and Life Expectancy by Continent

Solution

ggplot(df, aes(x = gdpPercap, y = lifeExp)) +
  geom_point() +
  facet_grid(continent~.)

We could also write this as:

ggplot(df, aes(x = gdpPercap, y = lifeExp)) +
  geom_point() +
  facet_wrap(~continent)

Adding color

We can also encode variables as color, shape, or size. Let’s try coloring the points of df by continent.

ggplot(df, aes(x = gdpPercap, y = lifeExp)) +
  geom_point(aes(color = continent))

What if I want to make all the points blue?

ggplot(df, aes(x = gdpPercap, y = lifeExp)) +
  geom_point(aes(color = "blue"))

That doesn’t work. It’s because we called it inside of the aes() function. We use aes() when we’re referring to something inside the data, such as gdpPercap, country, or continent If we’re just trying to assign a characteristic to every point, we do that inside of geom_point() but outside of aes().

ggplot(df, aes(x = gdpPercap, y = lifeExp)) +
  geom_point(color = "blue")

Scales

Sometimes scales can be modified to improve the readability of an image. Scale functions can be used to manipulate X and Y axes, size, shape, color, fill, alpha, and just about any encoding ggplot uses for geometries.

We can use a logarithmic X axis to make this image more clear

ggplot(df, aes(x = gdpPercap, y = lifeExp)) +
  geom_point(aes(color = continent, size = pop)) +
  scale_x_log10()

Adding more geometries

We aren’t limited to a single geometry to represent our data within a plot. If we’d like, we can add a geom_smooth() layer to this to show a regression line on the plot

ggplot(df, aes(x = gdpPercap, y = lifeExp)) +
  geom_point(aes(color = continent, size = pop)) +
  geom_smooth(method = lm) +
  scale_x_log10()

Encodings can be inherited from the top-line ggplot() function, so if we wanted the geom_smooth to have the same color as the points, we could do so like this

ggplot(df, aes(x = gdpPercap, y = lifeExp, color = continent)) +
  geom_point(aes(size = pop)) +
  geom_smooth(method = lm) +
  scale_x_log10()

Labels

We can use labs() to set the labels for just about anything.

ggplot(df, aes(x = gdpPercap, y = lifeExp)) +
  geom_point(aes(size = pop, color = continent)) +
  scale_x_log10() +
  facet_wrap(~year) +
  labs(
    title = "Life Expectancy and GDP Per Capita",
    subtitle = "1952 - 2007",
    x = "GDP Per Capita (USD)",
    y = "Life Expectancy",
    color = "Continent",
    size = "Population",
    caption = "Data from gapminder.com"
  )

Themes

ggplot has some built-in themes that can improve your charts

ggplot(df, aes(x = gdpPercap, y = lifeExp)) +
  geom_point(aes(size = pop, color = continent)) +
  scale_x_log10() +
  facet_wrap(~year) +
  labs(
    title = "Life Expectancy and GDP Per Capita",
    subtitle = "1952 - 2007",
    x = "GDP Per Capita (USD)",
    y = "Life Expectancy",
    color = "Continent",
    size = "Population",
    caption = "Data from gapminder.com"
  ) +
  theme_bw()

Saving an Image

We can use the ggsave() function to save images that we’ve created as local files.

ggplot(df, aes(x = gdpPercap, y = lifeExp)) +
  geom_point(aes(size = pop, color = continent)) +
  scale_x_log10() +
  facet_wrap(~year) +
  labs(
    title = "Life Expectancy and GDP Per Capita",
    subtitle = "1952 - 2007",
    x = "GDP Per Capita (USD)",
    y = "Life Expectancy",
    color = "Continent",
    size = "Population",
    caption = "Data from gapminder.com"
  ) +
  theme_bw()

ggsave("images/gdp_lifeExp.png")

Using RMarkdown

Basic RMarkdown Example

RMarkdown is a package that allows you to create documents that include narrative, images, code, and any other pieces of information that are helpful for presenting your data. It is a flavor of markdown, a popular markup language for creating documents.

Regular RMarkdown can be used to create html, pdf, or Word documents, and there are a number of extensions that allow for presentations, dashboards, and more.

---
title: "RMarkdown Example"
author: "James L. Adams"
output: html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

# Here's a first-level header
## And a second-level
### And a third-level
This could be some narrative text for framing my code and analysis. We'll begin with loading packages and data.

```{r message=FALSE}
library(tidyverse)
df <- read_csv("./data/original/gapminder.csv")
```

We included "message=FALSE" in the code chunk options to avoid some console printouts that interrupt the flow of the document.

Next, we'll print the data in a nice-looking way using the `kable` function from `knitr`, which was downloaded automatically as part of `rmarkdown`. We'll use `head(df)` to only print the first few rows rather than the entire data set.
```{r}
knitr::kable(head(df))
```

Next, we'll embed a plot we can make with the data

```{r}
ggplot(df, aes(x = gdpPercap, y = lifeExp)) +
  geom_point(aes(size = pop, color = continent)) +
  scale_x_log10() +
  facet_wrap(~year) +
  labs(
    title = "Life Expectancy and GDP Per Capita",
    subtitle = "1952 - 2007",
    x = "GDP Per Capita (USD)",
    y = "Life Expectancy",
    color = "Continent",
    size = "Population",
    caption = "Data from gapminder.com"
  ) +
  theme_bw()
```

We'll include a link to the data source, as well. You can find the original data from [Gapminder](https://www.gapminder.org/) or the processed version we're using from Jenny Bryan's
[gapminder package](https://github.com/jennybc/gapminder).

Dashboard Example

There are ways to use RMarkdown to produce specialized documents, such as a dashboard. One popular dashboard package is flexdashboard.

Here is a basic example of using flexdashboard with our gapminder data:

---
title: "flexdashboard example"
output: 
  flexdashboard::flex_dashboard:
    orientation: columns
    vertical_layout: fill
---

```{r setup, include=FALSE}
library(flexdashboard)
library(tidyverse)
library(plotly)

df <- read_csv("./data/original/gapminder.csv")
```

Column {data-width=650}
-----------------------------------------------------------------------

### Chart A

```{r}
# Using the plotly package and ggplotly to add basic interactivity

ggplot(df, aes(x = gdpPercap, y = lifeExp)) +
  geom_point(aes(color = continent)) +
  scale_x_log10()

ggplotly()
```

Column {data-width=350}
-----------------------------------------------------------------------

### Chart B

```{r}
ggplot(df, aes(x = continent, y = lifeExp, fill = continent)) +
  geom_boxplot() +
  theme(
    legend.position = "none"
  )
```

### Chart C

```{r}
tmp <- df %>%
  group_by(continent, year) %>%
  summarize(lifeExp = mean(lifeExp))

ggplot(tmp, aes(x = year, y = lifeExp, color = continent)) +
  geom_line() +
  geom_point()
```

Sources

Bryan, Jennifer. 2017. Gapminder: Data from Gapminder. https://CRAN.R-project.org/package=gapminder.

Chang, Winston. n.d. R Graphics Cookbook, 2nd Edition. Accessed August 21, 2021. https://r-graphics.org.

“Gapminder.” n.d. Accessed September 22, 2021. https://www.gapminder.org/data/.

Iannone, Richard, J. J. Allaire, Barbara Borges, RStudio, Keen IO (Dashboard CSS), Abdullah Almsaeed (Dashboard CSS), Jonas Mosbech (StickyTableHeaders), et al. 2020. “Flexdashboard: R Markdown Format for Flexible Dashboards.” https://CRAN.R-project.org/package=flexdashboard.

R Core Team. 2017. “R: A Language and Environment for Statistical Computing.” Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.

Sievert, Carson, Chris Parmer, Toby Hocking, Scott Chamberlain, Karthik Ram, Marianne Corvellec, Pedro Despouy, Salim Brüggemann, and Plotly Technologies Inc. 2021. “Plotly: Create Interactive Web Graphics via ’Plotly.js’.” https://CRAN.R-project.org/package=plotly.

Wickham, Hadley. 2014. “Tidy Data.” Journal of Statistical Software 59 (10). http://dx.doi.org/10.18637/jss.v059.i10.

———. 2016. Ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York. https://ggplot2.tidyverse.org.

Wickham, Hadley, Mara Averick, Jennifer Bryan, Winston Chang, Lucy D’Agostino McGowan, Romain François, Garrett Grolemund, et al. 2019. “Welcome to the Tidyverse.” Journal of Open Source Software 4 (43): 1686. https://doi.org/10.21105/joss.01686.

Wickham, Hadley, Winston Chang, Lionel Henry, Thomas Lin Pedersen, Kohske Takahashi, Claus Wilke, Kara Woo, Hiroaki Yutani, and Dewey Dunnington. 2021. Ggplot2: Create Elegant Data Visualisations Using the Grammar of Graphics. https://CRAN.R-project.org/package=ggplot2.

Wickham, Hadley, and Garrett Grolemund. 2017. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. 1 edition. Sebastopol, CA: O’Reilly Media.

Zimmerman, Naupaka, Greg Wilson, Raniere Silva, Scott Ritchie, François Michonneau, Jeffrey Oliver, Harriet Dashnow, et al. 2019. “Swcarpentry/r-Novice-Gapminder: Software Carpentry: R for Reproducible Scientific Analysis, June 2019.” https://doi.org/10.5281/zenodo.3265164.

Visualization with R and RStudio

James L. Adams