Visualization with R and RStudio

Author

James L. Adams

Using RStudio

RStudio is an Integrated Development Environment (IDE) for using the R language. It has a lot of useful tools for writing code, managing projects, viewing plots, and more.

RStudio is highly customizable, and allows you to change appearances and general layout without much fuss. For instance, to rearrange the order of panes, you can do the following:

RStudio > Preferences (Mac)
Tools > Options (Windows)

There are a lot of other features to take advantage of in RStudio, and many of them are bound to shortcuts. You can see those by going to Tools > Keyboard Shortcuts Help or pressing option + shift + K. You can also find a handy cheat sheet here.

Projects and Working Directories

Projects are an RStudio feature to help keep your code and working environments contained and organized, which comes in handy when you start to have multiple projects. To start a new project, you can click on the dropdown in the upper right-hand corner of RStudio and choose to begin a new project.

R Basics

Creating Variables

You can use R to create variables. Variables can contain all kinds of information, from single values to entire data structures. We create variables using <-, called the assignment operator.

new_int <- 4 
new_int

[1] 4

Variables will behave just like the values they represent. When we run cos() on new_int, it returns the same value as cos(4).

cos(new_int)

[1] -0.6536436

cos(4)

[1] -0.6536436

Working Environments

You may have noticed that we have a few new things in our “Environment” pane in RStudio. These variables and functions comprise our working environment, data that R has held in our active memory. This environment isn’t necessarily persistent, so it doesn’t last between R sessions.

Packages

People in different disciplines may write functions or create datasets that will be useful to others in their field. They can bundle these functions and datasets into packages that can be downloaded and used by others. There is a simple way into install packages with R:

install.packages(c("tidyverse", "rmarkdown", "flexdashboard"))

library(tidyverse)

Read in Data

We’ll be using the gapminder data again. The original data can be found here, and the version we’re using comes from Jenny Bryan’s gapminder package.

df <- read_csv("./data/original/gapminder.csv")

head(df)

country	continent	year	lifeExp	pop	gdpPercap
Afghanistan	Asia	1952	28.801	8425333	779.4453
Afghanistan	Asia	1957	30.332	9240934	820.8530
Afghanistan	Asia	1962	31.997	10267083	853.1007
Afghanistan	Asia	1967	34.020	11537966	836.1971
Afghanistan	Asia	1972	36.088	13079460	739.9811
Afghanistan	Asia	1977	38.438	14880372	786.1134

Manipulating Data

We can use a number of functions built into the “tidyverse” set of packages to manipulate and subset our data. For instance, if we just wanted data from the Americas, we could use the filter() function to subset by rows using some condition.

df_americas <- filter(df, continent == "Americas")

head(df_americas)

country	continent	year	lifeExp	pop	gdpPercap
Argentina	Americas	1952	62.485	17876956	5911.315
Argentina	Americas	1957	64.399	19610538	6856.856
Argentina	Americas	1962	65.142	21283783	7133.166
Argentina	Americas	1967	65.634	22934225	8052.953
Argentina	Americas	1972	67.065	24779799	9443.039
Argentina	Americas	1977	68.481	26983828	10079.027

If we wanted just the population for each country and year within that, we could use the select() function to choose which columns to keep.

df_americas_subset <- select(df_americas, country, year, pop)

head(df_americas_subset)

country	year	pop
Argentina	1952	17876956
Argentina	1957	19610538
Argentina	1962	21283783
Argentina	1967	22934225
Argentina	1972	24779799
Argentina	1977	26983828

Pipes

The pipe character from the tidyverse %>% allows us to chain commands together so that we can perform multiple data manipulations at once. If we wanted to perform the previous two operations, here are two ways to do so. The first doesn’t use the pipe, and the second does. They produce the exact same result.

df_americas_subset <- select(filter(df, continent == "Americas"), country, year, pop)

df_americas_subset <- df %>%
  filter(continent == "Americas") %>%
  select(country, year, pop)

head(df_americas_subset)

country	year	pop
Argentina	1952	17876956
Argentina	1957	19610538
Argentina	1962	21283783
Argentina	1967	22934225
Argentina	1972	24779799
Argentina	1977	26983828

Adding columns

We can add variables to our data based on existing variables using the mutate() function

df_pop_millions <- df %>%
  mutate(pop_millions = pop / 1000000)

head(df_pop_millions)

country	continent	year	lifeExp	pop	gdpPercap	pop_millions
Afghanistan	Asia	1952	28.801	8425333	779.4453	8.425333
Afghanistan	Asia	1957	30.332	9240934	820.8530	9.240934
Afghanistan	Asia	1962	31.997	10267083	853.1007	10.267083
Afghanistan	Asia	1967	34.020	11537966	836.1971	11.537966
Afghanistan	Asia	1972	36.088	13079460	739.9811	13.079460
Afghanistan	Asia	1977	38.438	14880372	786.1134	14.880372

How would you create a column to calculate GDP by multiplying population and GDP Per Capita?

df_gdp <- df %>%
  mutate(gdp = gdpPercap * pop)

head(df_gdp)

country	continent	year	lifeExp	pop	gdpPercap	gdp
Afghanistan	Asia	1952	28.801	8425333	779.4453	6567086330
Afghanistan	Asia	1957	30.332	9240934	820.8530	7585448670
Afghanistan	Asia	1962	31.997	10267083	853.1007	8758855797
Afghanistan	Asia	1967	34.020	11537966	836.1971	9648014150
Afghanistan	Asia	1972	36.088	13079460	739.9811	9678553274
Afghanistan	Asia	1977	38.438	14880372	786.1134	11697659231

Grouping / Summarizing

One of the most powerful features of the tidyverse is the ability to group and summarize data. If we wanted the mean life expectancy for each continent, we could do so like this:

df_avg <- df %>%
  group_by(continent) %>%
  summarize(mean_lifeExp = mean(lifeExp))

df_avg

continent	mean_lifeExp
Africa	48.86533
Americas	64.65874
Asia	60.06490
Europe	71.90369
Oceania	74.32621

You can create multiple columns this way, too:

df_summed <- df %>%
  group_by(continent) %>%
  summarize(
    mean_lifeExp = mean(lifeExp),
    sum_lifeExp = sum(lifeExp),
    median_lifeExp = median(lifeExp)
  )

You can even group by multiple variables:

df_summed <- df %>%
  group_by(continent, year) %>%
  summarize(
    mean_lifeExp = mean(lifeExp),
    sum_lifeExp = sum(lifeExp),
    median_lifeExp = median(lifeExp)
  )

`summarise()` has grouped output by 'continent'. You can override using the
`.groups` argument.

head(df_summed)

continent	year	mean_lifeExp	sum_lifeExp	median_lifeExp
Africa	1952	39.13550	2035.046	38.8330
Africa	1957	41.26635	2145.850	40.5925
Africa	1962	43.31944	2252.611	42.6305
Africa	1967	45.33454	2357.396	44.6985
Africa	1972	47.45094	2467.449	47.0315
Africa	1977	49.58042	2578.182	49.2725

Visualizing Data

Try running this code:

ggplot(df, aes(x = gdpPercap, y = lifeExp)) +
  geom_point()

ggplot2 is built on the idea of the Grammar of Graphics, a system described by Leland Wilkinson (see readings). There are a lot of nuances to explore in the book and in the ggplot2 package, but one of the core concepts to keep in mind is that a visualization has three critical components:

data
a coordinate system
geometry to display the data

We can see this in action

ggplot(df)

It doesn’t mean much yet. Let’s give it a coordinate system and see what happens.

ggplot(df, aes(x = gdpPercap, y = lifeExp))

Fantastic! The data is in there, and describing a 2D plane (the most we can really have in ggplot2 - there are no “traditional” 3D visualizations for it). For every row of our dataset, there is some intersection that exists on this plane. We need to decide how they will be rendered, though. For these two continuous variables, a scatterplot is a good place to start. We can add another “layer” to our plot containing geom_point() to show where the data exists on this plane.

ggplot(df, aes(x = gdpPercap, y = lifeExp)) +
  geom_point()

The points are just one way of rendering this. We could do it another way if we wanted to.

ggplot(df, aes(x = gdpPercap, y = lifeExp)) +
  geom_text(aes(label = country))

Or we could try this

ggplot(df, aes(x = continent, y = lifeExp)) +
  geom_boxplot()

Or this:

ggplot(df, aes(x = continent, y = lifeExp)) +
  geom_col()

ggplot(df, aes(x = continent, y = lifeExp)) +
  geom_col(aes(color = country))

Let’s go back to our points for now, though. Though we’re limited to “2D” visualizations, there’s a technique called “small multiples” (“facets” in ggplot) that allow us to show this data in other dimensions. For instance, if we want to separate things out by year.

ggplot(df, aes(x = gdpPercap, y = lifeExp)) +
  geom_point() +
  facet_grid(year~.)

Activity

Using what you know so far, create a plot that lays out GDP Per Capita and Life Expectancy by Continent

Solution

ggplot(df, aes(x = gdpPercap, y = lifeExp)) +
  geom_point() +
  facet_grid(continent~.)

We could also write this as:

ggplot(df, aes(x = gdpPercap, y = lifeExp)) +
  geom_point() +
  facet_wrap(~continent)

Adding color

We can also encode variables as color, shape, or size. Let’s try coloring the points of df by continent.

ggplot(df, aes(x = gdpPercap, y = lifeExp)) +
  geom_point(aes(color = continent))

What if I want to make all the points blue?

ggplot(df, aes(x = gdpPercap, y = lifeExp)) +
  geom_point(aes(color = "blue"))

That doesn’t work. It’s because we called it inside of the aes() function. We use aes() when we’re referring to something inside the data, such as gdpPercap, country, or continent If we’re just trying to assign a characteristic to every point, we do that inside of geom_point() but outside of aes().

ggplot(df, aes(x = gdpPercap, y = lifeExp)) +
  geom_point(color = "blue")

Scales

Sometimes scales can be modified to improve the readability of an image. Scale functions can be used to manipulate X and Y axes, size, shape, color, fill, alpha, and just about any encoding ggplot uses for geometries.

We can use a logarithmic X axis to make this image more clear

ggplot(df, aes(x = gdpPercap, y = lifeExp)) +
  geom_point(aes(color = continent, size = pop)) +
  scale_x_log10()

Adding more geometries

We aren’t limited to a single geometry to represent our data within a plot. If we’d like, we can add a geom_smooth() layer to this to show a regression line on the plot

ggplot(df, aes(x = gdpPercap, y = lifeExp)) +
  geom_point(aes(color = continent, size = pop)) +
  geom_smooth(method = lm) +
  scale_x_log10()

`geom_smooth()` using formula = 'y ~ x'

Encodings can be inherited from the top-line ggplot() function, so if we wanted the geom_smooth to have the same color as the points, we could do so like this

ggplot(df, aes(x = gdpPercap, y = lifeExp, color = continent)) +
  geom_point(aes(size = pop)) +
  geom_smooth(method = lm) +
  scale_x_log10()

`geom_smooth()` using formula = 'y ~ x'

Labels

We can use labs() to set the labels for just about anything.

ggplot(df, aes(x = gdpPercap, y = lifeExp)) +
  geom_point(aes(size = pop, color = continent)) +
  scale_x_log10() +
  facet_wrap(~year) +
  labs(
    title = "Life Expectancy and GDP Per Capita",
    subtitle = "1952 - 2007",
    x = "GDP Per Capita (USD)",
    y = "Life Expectancy",
    color = "Continent",
    size = "Population",
    caption = "Data from gapminder.com"
  )

Themes

ggplot has some built-in themes that can improve your charts

ggplot(df, aes(x = gdpPercap, y = lifeExp)) +
  geom_point(aes(size = pop, color = continent)) +
  scale_x_log10() +
  facet_wrap(~year) +
  labs(
    title = "Life Expectancy and GDP Per Capita",
    subtitle = "1952 - 2007",
    x = "GDP Per Capita (USD)",
    y = "Life Expectancy",
    color = "Continent",
    size = "Population",
    caption = "Data from gapminder.com"
  ) +
  theme_bw()

Saving an Image

We can use the ggsave() function to save images that we’ve created as local files.

ggplot(df, aes(x = gdpPercap, y = lifeExp)) +
  geom_point(aes(size = pop, color = continent)) +
  scale_x_log10() +
  facet_wrap(~year) +
  labs(
    title = "Life Expectancy and GDP Per Capita",
    subtitle = "1952 - 2007",
    x = "GDP Per Capita (USD)",
    y = "Life Expectancy",
    color = "Continent",
    size = "Population",
    caption = "Data from gapminder.com"
  ) +
  theme_bw()

ggsave("images/gdp_lifeExp.png")

Saving 7 x 5 in image

Using RMarkdown

Basic RMarkdown Example

RMarkdown is a package that allows you to create documents that include narrative, images, code, and any other pieces of information that are helpful for presenting your data. It is a flavor of markdown, a popular markup language for creating documents.

Regular RMarkdown can be used to create html, pdf, or Word documents, and there are a number of extensions that allow for presentations, dashboards, and more.

---
title: "RMarkdown Example"
author: "James L. Adams"
output: html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

# Here's a first-level header
## And a second-level
### And a third-level
This could be some narrative text for framing my code and analysis. We'll begin with loading packages and data.

```{r message=FALSE}
library(tidyverse)
df <- read_csv("./data/original/gapminder.csv")
```

We included "message=FALSE" in the code chunk options to avoid some console printouts that interrupt the flow of the document.

Next, we'll print the data in a nice-looking way using the `kable` function from `knitr`, which was downloaded automatically as part of `rmarkdown`. We'll use `head(df)` to only print the first few rows rather than the entire data set.
```{r}
knitr::kable(head(df))
```

Next, we'll embed a plot we can make with the data

```{r}
ggplot(df, aes(x = gdpPercap, y = lifeExp)) +
  geom_point(aes(size = pop, color = continent)) +
  scale_x_log10() +
  facet_wrap(~year) +
  labs(
    title = "Life Expectancy and GDP Per Capita",
    subtitle = "1952 - 2007",
    x = "GDP Per Capita (USD)",
    y = "Life Expectancy",
    color = "Continent",
    size = "Population",
    caption = "Data from gapminder.com"
  ) +
  theme_bw()
```

We'll include a link to the data source, as well. You can find the original data from [Gapminder](https://www.gapminder.org/) or the processed version we're using from Jenny Bryan's
[gapminder package](https://github.com/jennybc/gapminder).

Dashboard Example

There are ways to use RMarkdown to produce specialized documents, such as a dashboard. One popular dashboard package is flexdashboard.

Here is a basic example of using flexdashboard with our gapminder data:

---
title: "flexdashboard example"
output: 
  flexdashboard::flex_dashboard:
    orientation: columns
    vertical_layout: fill
---

```{r setup, include=FALSE}
library(flexdashboard)
library(tidyverse)
library(plotly)

df <- read_csv("./data/original/gapminder.csv")
```

Column {data-width=650}
-----------------------------------------------------------------------

### Chart A

```{r}
# Using the plotly package and ggplotly to add basic interactivity

ggplot(df, aes(x = gdpPercap, y = lifeExp)) +
  geom_point(aes(color = continent)) +
  scale_x_log10()

ggplotly()
```

Column {data-width=350}
-----------------------------------------------------------------------

### Chart B

```{r}
ggplot(df, aes(x = continent, y = lifeExp, fill = continent)) +
  geom_boxplot() +
  theme(
    legend.position = "none"
  )
```

### Chart C

```{r}
tmp <- df %>%
  group_by(continent, year) %>%
  summarize(lifeExp = mean(lifeExp))

ggplot(tmp, aes(x = year, y = lifeExp, color = continent)) +
  geom_line() +
  geom_point()
```

Sources

Bryan, Jennifer. 2017. “Gapminder: Excerpt from the Gapminder Data, as an R Data Package and in Plain Text Delimited Form.” https://github.com/jennybc/gapminder.

Chang, Winston. n.d. R Graphics Cookbook, 2nd Edition. Accessed August 21, 2021. https://r-graphics.org.

“Gapminder.” n.d. Accessed September 22, 2021. https://www.gapminder.org/data/.

Iannone, Richard, J. J. Allaire, Barbara Borges, RStudio, Keen IO (Dashboard CSS), Abdullah Almsaeed (Dashboard CSS), Jonas Mosbech (StickyTableHeaders), et al. 2020. “Flexdashboard: R Markdown Format for Flexible Dashboards.” https://CRAN.R-project.org/package=flexdashboard.

R Core Team. 2017. “R: A Language and Environment for Statistical Computing.” Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.

Sievert, Carson, Chris Parmer, Toby Hocking, Scott Chamberlain, Karthik Ram, Marianne Corvellec, Pedro Despouy, Salim Brüggemann, and Plotly Technologies Inc. 2021. “Plotly: Create Interactive Web Graphics via ’Plotly.js’.” https://CRAN.R-project.org/package=plotly.

Wickham, Hadley. 2009. Ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York. http://ggplot2.org.

———. 2014. “Tidy Data.” Journal of Statistical Software 59 (10). http://dx.doi.org/10.18637/jss.v059.i10.

Wickham, Hadley, Mara Averick, Jennifer Bryan, Winston Chang, Lucy D’Agostino McGowan, Romain François, Garrett Grolemund, et al. 2019. “Welcome to the Tidyverse.” Journal of Open Source Software 4 (43): 1686. https://doi.org/10.21105/joss.01686.

Wickham, Hadley, and Garrett Grolemund. 2017. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. 1 edition. Sebastopol, CA: O’Reilly Media.

Zimmerman, Naupaka, Greg Wilson, Raniere Silva, Scott Ritchie, François Michonneau, Jeffrey Oliver, Harriet Dashnow, et al. 2019. “Swcarpentry/r-Novice-Gapminder: Software Carpentry: R for Reproducible Scientific Analysis, June 2019.” https://doi.org/10.5281/zenodo.3265164.