Visualization with R and RStudio

Author

James L. Adams

Using RStudio

RStudio is an Integrated Development Environment (IDE) for using the R language. It has a lot of useful tools for writing code, managing projects, viewing plots, and more.

RStudio is highly customizable, and allows you to change appearances and general layout without much fuss. For instance, to rearrange the order of panes, you can do the following:

  • RStudio > Preferences (Mac)
  • Tools > Options (Windows)

There are a lot of other features to take advantage of in RStudio, and many of them are bound to shortcuts. You can see those by going to Tools > Keyboard Shortcuts Help or pressing option + shift + K. You can also find a handy cheat sheet here.

Projects and Working Directories

Projects are an RStudio feature to help keep your code and working environments contained and organized, which comes in handy when you start to have multiple projects. To start a new project, you can click on the dropdown in the upper right-hand corner of RStudio and choose to begin a new project.

R Basics

Creating Variables

You can use R to create variables. Variables can contain all kinds of information, from single values to entire data structures. We create variables using <-, called the assignment operator.

new_int <- 4 
new_int
[1] 4

Variables will behave just like the values they represent. When we run cos() on new_int, it returns the same value as cos(4).

cos(new_int) 
[1] -0.6536436
cos(4)
[1] -0.6536436

Working Environments

You may have noticed that we have a few new things in our “Environment” pane in RStudio. These variables and functions comprise our working environment, data that R has held in our active memory. This environment isn’t necessarily persistent, so it doesn’t last between R sessions.

Packages

People in different disciplines may write functions or create datasets that will be useful to others in their field. They can bundle these functions and datasets into packages that can be downloaded and used by others. There is a simple way into install packages with R:

install.packages(c("tidyverse", "rmarkdown", "flexdashboard"))
library(tidyverse)

Read in Data

We’ll be using the gapminder data again. The original data can be found here, and the version we’re using comes from Jenny Bryan’s gapminder package.

df <- read_csv("./data/original/gapminder.csv")

head(df)
country continent year lifeExp pop gdpPercap
Afghanistan Asia 1952 28.801 8425333 779.4453
Afghanistan Asia 1957 30.332 9240934 820.8530
Afghanistan Asia 1962 31.997 10267083 853.1007
Afghanistan Asia 1967 34.020 11537966 836.1971
Afghanistan Asia 1972 36.088 13079460 739.9811
Afghanistan Asia 1977 38.438 14880372 786.1134

Manipulating Data

We can use a number of functions built into the “tidyverse” set of packages to manipulate and subset our data. For instance, if we just wanted data from the Americas, we could use the filter() function to subset by rows using some condition.

df_americas <- filter(df, continent == "Americas")

head(df_americas)
country continent year lifeExp pop gdpPercap
Argentina Americas 1952 62.485 17876956 5911.315
Argentina Americas 1957 64.399 19610538 6856.856
Argentina Americas 1962 65.142 21283783 7133.166
Argentina Americas 1967 65.634 22934225 8052.953
Argentina Americas 1972 67.065 24779799 9443.039
Argentina Americas 1977 68.481 26983828 10079.027

If we wanted just the population for each country and year within that, we could use the select() function to choose which columns to keep.

df_americas_subset <- select(df_americas, country, year, pop)

head(df_americas_subset)
country year pop
Argentina 1952 17876956
Argentina 1957 19610538
Argentina 1962 21283783
Argentina 1967 22934225
Argentina 1972 24779799
Argentina 1977 26983828

Pipes

The pipe character from the tidyverse %>% allows us to chain commands together so that we can perform multiple data manipulations at once. If we wanted to perform the previous two operations, here are two ways to do so. The first doesn’t use the pipe, and the second does. They produce the exact same result.

df_americas_subset <- select(filter(df, continent == "Americas"), country, year, pop)

df_americas_subset <- df %>%
  filter(continent == "Americas") %>%
  select(country, year, pop)

head(df_americas_subset)
country year pop
Argentina 1952 17876956
Argentina 1957 19610538
Argentina 1962 21283783
Argentina 1967 22934225
Argentina 1972 24779799
Argentina 1977 26983828

Adding columns

We can add variables to our data based on existing variables using the mutate() function

df_pop_millions <- df %>%
  mutate(pop_millions = pop / 1000000)

head(df_pop_millions)
country continent year lifeExp pop gdpPercap pop_millions
Afghanistan Asia 1952 28.801 8425333 779.4453 8.425333
Afghanistan Asia 1957 30.332 9240934 820.8530 9.240934
Afghanistan Asia 1962 31.997 10267083 853.1007 10.267083
Afghanistan Asia 1967 34.020 11537966 836.1971 11.537966
Afghanistan Asia 1972 36.088 13079460 739.9811 13.079460
Afghanistan Asia 1977 38.438 14880372 786.1134 14.880372

How would you create a column to calculate GDP by multiplying population and GDP Per Capita?

df_gdp <- df %>%
  mutate(gdp = gdpPercap * pop)

head(df_gdp)
country continent year lifeExp pop gdpPercap gdp
Afghanistan Asia 1952 28.801 8425333 779.4453 6567086330
Afghanistan Asia 1957 30.332 9240934 820.8530 7585448670
Afghanistan Asia 1962 31.997 10267083 853.1007 8758855797
Afghanistan Asia 1967 34.020 11537966 836.1971 9648014150
Afghanistan Asia 1972 36.088 13079460 739.9811 9678553274
Afghanistan Asia 1977 38.438 14880372 786.1134 11697659231

Grouping / Summarizing

One of the most powerful features of the tidyverse is the ability to group and summarize data. If we wanted the mean life expectancy for each continent, we could do so like this:

df_avg <- df %>%
  group_by(continent) %>%
  summarize(mean_lifeExp = mean(lifeExp))

df_avg
continent mean_lifeExp
Africa 48.86533
Americas 64.65874
Asia 60.06490
Europe 71.90369
Oceania 74.32621

You can create multiple columns this way, too:

df_summed <- df %>%
  group_by(continent) %>%
  summarize(
    mean_lifeExp = mean(lifeExp),
    sum_lifeExp = sum(lifeExp),
    median_lifeExp = median(lifeExp)
  )

You can even group by multiple variables:

df_summed <- df %>%
  group_by(continent, year) %>%
  summarize(
    mean_lifeExp = mean(lifeExp),
    sum_lifeExp = sum(lifeExp),
    median_lifeExp = median(lifeExp)
  )
`summarise()` has grouped output by 'continent'. You can override using the
`.groups` argument.
head(df_summed)
continent year mean_lifeExp sum_lifeExp median_lifeExp
Africa 1952 39.13550 2035.046 38.8330
Africa 1957 41.26635 2145.850 40.5925
Africa 1962 43.31944 2252.611 42.6305
Africa 1967 45.33454 2357.396 44.6985
Africa 1972 47.45094 2467.449 47.0315
Africa 1977 49.58042 2578.182 49.2725

Visualizing Data

Try running this code:

ggplot(df, aes(x = gdpPercap, y = lifeExp)) +
  geom_point()

ggplot2 is built on the idea of the Grammar of Graphics, a system described by Leland Wilkinson (see readings). There are a lot of nuances to explore in the book and in the ggplot2 package, but one of the core concepts to keep in mind is that a visualization has three critical components:

  • data
  • a coordinate system
  • geometry to display the data

We can see this in action

ggplot(df)

It doesn’t mean much yet. Let’s give it a coordinate system and see what happens.

ggplot(df, aes(x = gdpPercap, y = lifeExp))

Fantastic! The data is in there, and describing a 2D plane (the most we can really have in ggplot2 - there are no “traditional” 3D visualizations for it). For every row of our dataset, there is some intersection that exists on this plane. We need to decide how they will be rendered, though. For these two continuous variables, a scatterplot is a good place to start. We can add another “layer” to our plot containing geom_point() to show where the data exists on this plane.

ggplot(df, aes(x = gdpPercap, y = lifeExp)) +
  geom_point()

The points are just one way of rendering this. We could do it another way if we wanted to.

ggplot(df, aes(x = gdpPercap, y = lifeExp)) +
  geom_text(aes(label = country))

Or we could try this

ggplot(df, aes(x = continent, y = lifeExp)) +
  geom_boxplot()

Or this:

ggplot(df, aes(x = continent, y = lifeExp)) +
  geom_col()

ggplot(df, aes(x = continent, y = lifeExp)) +
  geom_col(aes(color = country))

Let’s go back to our points for now, though. Though we’re limited to “2D” visualizations, there’s a technique called “small multiples” (“facets” in ggplot) that allow us to show this data in other dimensions. For instance, if we want to separate things out by year.

ggplot(df, aes(x = gdpPercap, y = lifeExp)) +
  geom_point() +
  facet_grid(year~.)

Activity

Using what you know so far, create a plot that lays out GDP Per Capita and Life Expectancy by Continent

Solution

ggplot(df, aes(x = gdpPercap, y = lifeExp)) +
  geom_point() +
  facet_grid(continent~.)

We could also write this as:

ggplot(df, aes(x = gdpPercap, y = lifeExp)) +
  geom_point() +
  facet_wrap(~continent)

Adding color

We can also encode variables as color, shape, or size. Let’s try coloring the points of df by continent.

ggplot(df, aes(x = gdpPercap, y = lifeExp)) +
  geom_point(aes(color = continent))

What if I want to make all the points blue?

ggplot(df, aes(x = gdpPercap, y = lifeExp)) +
  geom_point(aes(color = "blue"))

That doesn’t work. It’s because we called it inside of the aes() function. We use aes() when we’re referring to something inside the data, such as gdpPercap, country, or continent If we’re just trying to assign a characteristic to every point, we do that inside of geom_point() but outside of aes().

ggplot(df, aes(x = gdpPercap, y = lifeExp)) +
  geom_point(color = "blue")

Scales

Sometimes scales can be modified to improve the readability of an image. Scale functions can be used to manipulate X and Y axes, size, shape, color, fill, alpha, and just about any encoding ggplot uses for geometries.

We can use a logarithmic X axis to make this image more clear

ggplot(df, aes(x = gdpPercap, y = lifeExp)) +
  geom_point(aes(color = continent, size = pop)) +
  scale_x_log10()

Adding more geometries

We aren’t limited to a single geometry to represent our data within a plot. If we’d like, we can add a geom_smooth() layer to this to show a regression line on the plot

ggplot(df, aes(x = gdpPercap, y = lifeExp)) +
  geom_point(aes(color = continent, size = pop)) +
  geom_smooth(method = lm) +
  scale_x_log10()
`geom_smooth()` using formula = 'y ~ x'

Encodings can be inherited from the top-line ggplot() function, so if we wanted the geom_smooth to have the same color as the points, we could do so like this

ggplot(df, aes(x = gdpPercap, y = lifeExp, color = continent)) +
  geom_point(aes(size = pop)) +
  geom_smooth(method = lm) +
  scale_x_log10()
`geom_smooth()` using formula = 'y ~ x'

Labels

We can use labs() to set the labels for just about anything.

ggplot(df, aes(x = gdpPercap, y = lifeExp)) +
  geom_point(aes(size = pop, color = continent)) +
  scale_x_log10() +
  facet_wrap(~year) +
  labs(
    title = "Life Expectancy and GDP Per Capita",
    subtitle = "1952 - 2007",
    x = "GDP Per Capita (USD)",
    y = "Life Expectancy",
    color = "Continent",
    size = "Population",
    caption = "Data from gapminder.com"
  )

Themes

ggplot has some built-in themes that can improve your charts

ggplot(df, aes(x = gdpPercap, y = lifeExp)) +
  geom_point(aes(size = pop, color = continent)) +
  scale_x_log10() +
  facet_wrap(~year) +
  labs(
    title = "Life Expectancy and GDP Per Capita",
    subtitle = "1952 - 2007",
    x = "GDP Per Capita (USD)",
    y = "Life Expectancy",
    color = "Continent",
    size = "Population",
    caption = "Data from gapminder.com"
  ) +
  theme_bw()

Saving an Image

We can use the ggsave() function to save images that we’ve created as local files.

ggplot(df, aes(x = gdpPercap, y = lifeExp)) +
  geom_point(aes(size = pop, color = continent)) +
  scale_x_log10() +
  facet_wrap(~year) +
  labs(
    title = "Life Expectancy and GDP Per Capita",
    subtitle = "1952 - 2007",
    x = "GDP Per Capita (USD)",
    y = "Life Expectancy",
    color = "Continent",
    size = "Population",
    caption = "Data from gapminder.com"
  ) +
  theme_bw()

ggsave("images/gdp_lifeExp.png")
Saving 7 x 5 in image

Using RMarkdown

Basic RMarkdown Example

RMarkdown is a package that allows you to create documents that include narrative, images, code, and any other pieces of information that are helpful for presenting your data. It is a flavor of markdown, a popular markup language for creating documents.

Regular RMarkdown can be used to create html, pdf, or Word documents, and there are a number of extensions that allow for presentations, dashboards, and more.

---
title: "RMarkdown Example"
author: "James L. Adams"
output: html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

# Here's a first-level header
## And a second-level
### And a third-level
This could be some narrative text for framing my code and analysis. We'll begin with loading packages and data.

```{r message=FALSE}
library(tidyverse)
df <- read_csv("./data/original/gapminder.csv")
```

We included "message=FALSE" in the code chunk options to avoid some console printouts that interrupt the flow of the document.

Next, we'll print the data in a nice-looking way using the `kable` function from `knitr`, which was downloaded automatically as part of `rmarkdown`. We'll use `head(df)` to only print the first few rows rather than the entire data set.
```{r}
knitr::kable(head(df))
```

Next, we'll embed a plot we can make with the data

```{r}
ggplot(df, aes(x = gdpPercap, y = lifeExp)) +
  geom_point(aes(size = pop, color = continent)) +
  scale_x_log10() +
  facet_wrap(~year) +
  labs(
    title = "Life Expectancy and GDP Per Capita",
    subtitle = "1952 - 2007",
    x = "GDP Per Capita (USD)",
    y = "Life Expectancy",
    color = "Continent",
    size = "Population",
    caption = "Data from gapminder.com"
  ) +
  theme_bw()
```

We'll include a link to the data source, as well. You can find the original data from [Gapminder](https://www.gapminder.org/) or the processed version we're using from Jenny Bryan's
[gapminder package](https://github.com/jennybc/gapminder).

Dashboard Example

There are ways to use RMarkdown to produce specialized documents, such as a dashboard. One popular dashboard package is flexdashboard.

Here is a basic example of using flexdashboard with our gapminder data:

---
title: "flexdashboard example"
output: 
  flexdashboard::flex_dashboard:
    orientation: columns
    vertical_layout: fill
---

```{r setup, include=FALSE}
library(flexdashboard)
library(tidyverse)
library(plotly)

df <- read_csv("./data/original/gapminder.csv")
```

Column {data-width=650}
-----------------------------------------------------------------------

### Chart A

```{r}
# Using the plotly package and ggplotly to add basic interactivity

ggplot(df, aes(x = gdpPercap, y = lifeExp)) +
  geom_point(aes(color = continent)) +
  scale_x_log10()

ggplotly()
```

Column {data-width=350}
-----------------------------------------------------------------------

### Chart B

```{r}
ggplot(df, aes(x = continent, y = lifeExp, fill = continent)) +
  geom_boxplot() +
  theme(
    legend.position = "none"
  )
```

### Chart C

```{r}
tmp <- df %>%
  group_by(continent, year) %>%
  summarize(lifeExp = mean(lifeExp))

ggplot(tmp, aes(x = year, y = lifeExp, color = continent)) +
  geom_line() +
  geom_point()
```

Sources

Bryan, Jennifer. 2017. “Gapminder: Excerpt from the Gapminder Data, as an R Data Package and in Plain Text Delimited Form.” https://github.com/jennybc/gapminder.
Chang, Winston. n.d. R Graphics Cookbook, 2nd Edition. Accessed August 21, 2021. https://r-graphics.org.
“Gapminder.” n.d. Accessed September 22, 2021. https://www.gapminder.org/data/.
Iannone, Richard, J. J. Allaire, Barbara Borges, RStudio, Keen IO (Dashboard CSS), Abdullah Almsaeed (Dashboard CSS), Jonas Mosbech (StickyTableHeaders), et al. 2020. “Flexdashboard: R Markdown Format for Flexible Dashboards.” https://CRAN.R-project.org/package=flexdashboard.
R Core Team. 2017. “R: A Language and Environment for Statistical Computing.” Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.
Sievert, Carson, Chris Parmer, Toby Hocking, Scott Chamberlain, Karthik Ram, Marianne Corvellec, Pedro Despouy, Salim Brüggemann, and Plotly Technologies Inc. 2021. “Plotly: Create Interactive Web Graphics via ’Plotly.js’.” https://CRAN.R-project.org/package=plotly.
Wickham, Hadley. 2009. Ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York. http://ggplot2.org.
———. 2014. “Tidy Data.” Journal of Statistical Software 59 (10). http://dx.doi.org/10.18637/jss.v059.i10.
Wickham, Hadley, Mara Averick, Jennifer Bryan, Winston Chang, Lucy D’Agostino McGowan, Romain François, Garrett Grolemund, et al. 2019. “Welcome to the Tidyverse.” Journal of Open Source Software 4 (43): 1686. https://doi.org/10.21105/joss.01686.
Wickham, Hadley, and Garrett Grolemund. 2017. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. 1 edition. Sebastopol, CA: O’Reilly Media.
Zimmerman, Naupaka, Greg Wilson, Raniere Silva, Scott Ritchie, François Michonneau, Jeffrey Oliver, Harriet Dashnow, et al. 2019. “Swcarpentry/r-Novice-Gapminder: Software Carpentry: R for Reproducible Scientific Analysis, June 2019.” https://doi.org/10.5281/zenodo.3265164.