new_int <- 4
new_int[1] 4
RStudio is an Integrated Development Environment (IDE) for using the R language. It has a lot of useful tools for writing code, managing projects, viewing plots, and more.
RStudio is highly customizable, and allows you to change appearances and general layout without much fuss. For instance, to rearrange the order of panes, you can do the following:
RStudio > Preferences (Mac)Tools > Options (Windows)There are a lot of other features to take advantage of in RStudio, and many of them are bound to shortcuts. You can see those by going to Tools > Keyboard Shortcuts Help or pressing option + shift + K. You can also find a handy cheat sheet here.
Projects are an RStudio feature to help keep your code and working environments contained and organized, which comes in handy when you start to have multiple projects. To start a new project, you can click on the dropdown in the upper right-hand corner of RStudio and choose to begin a new project.
You can use R to create variables. Variables can contain all kinds of information, from single values to entire data structures. We create variables using <-, called the assignment operator.
new_int <- 4
new_int[1] 4
Variables will behave just like the values they represent. When we run cos() on new_int, it returns the same value as cos(4).
cos(new_int) [1] -0.6536436
cos(4)[1] -0.6536436
You may have noticed that we have a few new things in our “Environment” pane in RStudio. These variables and functions comprise our working environment, data that R has held in our active memory. This environment isn’t necessarily persistent, so it doesn’t last between R sessions.
People in different disciplines may write functions or create datasets that will be useful to others in their field. They can bundle these functions and datasets into packages that can be downloaded and used by others. There is a simple way into install packages with R:
install.packages(c("tidyverse", "rmarkdown", "flexdashboard"))library(tidyverse)We’ll be using the gapminder data again. The original data can be found here, and the version we’re using comes from Jenny Bryan’s gapminder package.
df <- read_csv("./data/original/gapminder.csv")
head(df)| country | continent | year | lifeExp | pop | gdpPercap |
|---|---|---|---|---|---|
| Afghanistan | Asia | 1952 | 28.801 | 8425333 | 779.4453 |
| Afghanistan | Asia | 1957 | 30.332 | 9240934 | 820.8530 |
| Afghanistan | Asia | 1962 | 31.997 | 10267083 | 853.1007 |
| Afghanistan | Asia | 1967 | 34.020 | 11537966 | 836.1971 |
| Afghanistan | Asia | 1972 | 36.088 | 13079460 | 739.9811 |
| Afghanistan | Asia | 1977 | 38.438 | 14880372 | 786.1134 |
We can use a number of functions built into the “tidyverse” set of packages to manipulate and subset our data. For instance, if we just wanted data from the Americas, we could use the filter() function to subset by rows using some condition.
df_americas <- filter(df, continent == "Americas")
head(df_americas)| country | continent | year | lifeExp | pop | gdpPercap |
|---|---|---|---|---|---|
| Argentina | Americas | 1952 | 62.485 | 17876956 | 5911.315 |
| Argentina | Americas | 1957 | 64.399 | 19610538 | 6856.856 |
| Argentina | Americas | 1962 | 65.142 | 21283783 | 7133.166 |
| Argentina | Americas | 1967 | 65.634 | 22934225 | 8052.953 |
| Argentina | Americas | 1972 | 67.065 | 24779799 | 9443.039 |
| Argentina | Americas | 1977 | 68.481 | 26983828 | 10079.027 |
If we wanted just the population for each country and year within that, we could use the select() function to choose which columns to keep.
df_americas_subset <- select(df_americas, country, year, pop)
head(df_americas_subset)| country | year | pop |
|---|---|---|
| Argentina | 1952 | 17876956 |
| Argentina | 1957 | 19610538 |
| Argentina | 1962 | 21283783 |
| Argentina | 1967 | 22934225 |
| Argentina | 1972 | 24779799 |
| Argentina | 1977 | 26983828 |
The pipe character from the tidyverse %>% allows us to chain commands together so that we can perform multiple data manipulations at once. If we wanted to perform the previous two operations, here are two ways to do so. The first doesn’t use the pipe, and the second does. They produce the exact same result.
df_americas_subset <- select(filter(df, continent == "Americas"), country, year, pop)
df_americas_subset <- df %>%
filter(continent == "Americas") %>%
select(country, year, pop)
head(df_americas_subset)| country | year | pop |
|---|---|---|
| Argentina | 1952 | 17876956 |
| Argentina | 1957 | 19610538 |
| Argentina | 1962 | 21283783 |
| Argentina | 1967 | 22934225 |
| Argentina | 1972 | 24779799 |
| Argentina | 1977 | 26983828 |
We can add variables to our data based on existing variables using the mutate() function
df_pop_millions <- df %>%
mutate(pop_millions = pop / 1000000)
head(df_pop_millions)| country | continent | year | lifeExp | pop | gdpPercap | pop_millions |
|---|---|---|---|---|---|---|
| Afghanistan | Asia | 1952 | 28.801 | 8425333 | 779.4453 | 8.425333 |
| Afghanistan | Asia | 1957 | 30.332 | 9240934 | 820.8530 | 9.240934 |
| Afghanistan | Asia | 1962 | 31.997 | 10267083 | 853.1007 | 10.267083 |
| Afghanistan | Asia | 1967 | 34.020 | 11537966 | 836.1971 | 11.537966 |
| Afghanistan | Asia | 1972 | 36.088 | 13079460 | 739.9811 | 13.079460 |
| Afghanistan | Asia | 1977 | 38.438 | 14880372 | 786.1134 | 14.880372 |
How would you create a column to calculate GDP by multiplying population and GDP Per Capita?
df_gdp <- df %>%
mutate(gdp = gdpPercap * pop)
head(df_gdp)| country | continent | year | lifeExp | pop | gdpPercap | gdp |
|---|---|---|---|---|---|---|
| Afghanistan | Asia | 1952 | 28.801 | 8425333 | 779.4453 | 6567086330 |
| Afghanistan | Asia | 1957 | 30.332 | 9240934 | 820.8530 | 7585448670 |
| Afghanistan | Asia | 1962 | 31.997 | 10267083 | 853.1007 | 8758855797 |
| Afghanistan | Asia | 1967 | 34.020 | 11537966 | 836.1971 | 9648014150 |
| Afghanistan | Asia | 1972 | 36.088 | 13079460 | 739.9811 | 9678553274 |
| Afghanistan | Asia | 1977 | 38.438 | 14880372 | 786.1134 | 11697659231 |
One of the most powerful features of the tidyverse is the ability to group and summarize data. If we wanted the mean life expectancy for each continent, we could do so like this:
df_avg <- df %>%
group_by(continent) %>%
summarize(mean_lifeExp = mean(lifeExp))
df_avg| continent | mean_lifeExp |
|---|---|
| Africa | 48.86533 |
| Americas | 64.65874 |
| Asia | 60.06490 |
| Europe | 71.90369 |
| Oceania | 74.32621 |
You can create multiple columns this way, too:
df_summed <- df %>%
group_by(continent) %>%
summarize(
mean_lifeExp = mean(lifeExp),
sum_lifeExp = sum(lifeExp),
median_lifeExp = median(lifeExp)
)You can even group by multiple variables:
df_summed <- df %>%
group_by(continent, year) %>%
summarize(
mean_lifeExp = mean(lifeExp),
sum_lifeExp = sum(lifeExp),
median_lifeExp = median(lifeExp)
)`summarise()` has grouped output by 'continent'. You can override using the
`.groups` argument.
head(df_summed)| continent | year | mean_lifeExp | sum_lifeExp | median_lifeExp |
|---|---|---|---|---|
| Africa | 1952 | 39.13550 | 2035.046 | 38.8330 |
| Africa | 1957 | 41.26635 | 2145.850 | 40.5925 |
| Africa | 1962 | 43.31944 | 2252.611 | 42.6305 |
| Africa | 1967 | 45.33454 | 2357.396 | 44.6985 |
| Africa | 1972 | 47.45094 | 2467.449 | 47.0315 |
| Africa | 1977 | 49.58042 | 2578.182 | 49.2725 |
Try running this code:
ggplot(df, aes(x = gdpPercap, y = lifeExp)) +
geom_point()ggplot2 is built on the idea of the Grammar of Graphics, a system described by Leland Wilkinson (see readings). There are a lot of nuances to explore in the book and in the ggplot2 package, but one of the core concepts to keep in mind is that a visualization has three critical components:
We can see this in action
ggplot(df)It doesn’t mean much yet. Let’s give it a coordinate system and see what happens.
ggplot(df, aes(x = gdpPercap, y = lifeExp))Fantastic! The data is in there, and describing a 2D plane (the most we can really have in ggplot2 - there are no “traditional” 3D visualizations for it). For every row of our dataset, there is some intersection that exists on this plane. We need to decide how they will be rendered, though. For these two continuous variables, a scatterplot is a good place to start. We can add another “layer” to our plot containing geom_point() to show where the data exists on this plane.
ggplot(df, aes(x = gdpPercap, y = lifeExp)) +
geom_point()The points are just one way of rendering this. We could do it another way if we wanted to.
ggplot(df, aes(x = gdpPercap, y = lifeExp)) +
geom_text(aes(label = country))Or we could try this
ggplot(df, aes(x = continent, y = lifeExp)) +
geom_boxplot()Or this:
ggplot(df, aes(x = continent, y = lifeExp)) +
geom_col()ggplot(df, aes(x = continent, y = lifeExp)) +
geom_col(aes(color = country))Let’s go back to our points for now, though. Though we’re limited to “2D” visualizations, there’s a technique called “small multiples” (“facets” in ggplot) that allow us to show this data in other dimensions. For instance, if we want to separate things out by year.
ggplot(df, aes(x = gdpPercap, y = lifeExp)) +
geom_point() +
facet_grid(year~.)Using what you know so far, create a plot that lays out GDP Per Capita and Life Expectancy by Continent
ggplot(df, aes(x = gdpPercap, y = lifeExp)) +
geom_point() +
facet_grid(continent~.)We could also write this as:
ggplot(df, aes(x = gdpPercap, y = lifeExp)) +
geom_point() +
facet_wrap(~continent)We can also encode variables as color, shape, or size. Let’s try coloring the points of df by continent.
ggplot(df, aes(x = gdpPercap, y = lifeExp)) +
geom_point(aes(color = continent))What if I want to make all the points blue?
ggplot(df, aes(x = gdpPercap, y = lifeExp)) +
geom_point(aes(color = "blue"))That doesn’t work. It’s because we called it inside of the aes() function. We use aes() when we’re referring to something inside the data, such as gdpPercap, country, or continent If we’re just trying to assign a characteristic to every point, we do that inside of geom_point() but outside of aes().
ggplot(df, aes(x = gdpPercap, y = lifeExp)) +
geom_point(color = "blue")Sometimes scales can be modified to improve the readability of an image. Scale functions can be used to manipulate X and Y axes, size, shape, color, fill, alpha, and just about any encoding ggplot uses for geometries.
We can use a logarithmic X axis to make this image more clear
ggplot(df, aes(x = gdpPercap, y = lifeExp)) +
geom_point(aes(color = continent, size = pop)) +
scale_x_log10()We aren’t limited to a single geometry to represent our data within a plot. If we’d like, we can add a geom_smooth() layer to this to show a regression line on the plot
ggplot(df, aes(x = gdpPercap, y = lifeExp)) +
geom_point(aes(color = continent, size = pop)) +
geom_smooth(method = lm) +
scale_x_log10()`geom_smooth()` using formula = 'y ~ x'
Encodings can be inherited from the top-line ggplot() function, so if we wanted the geom_smooth to have the same color as the points, we could do so like this
ggplot(df, aes(x = gdpPercap, y = lifeExp, color = continent)) +
geom_point(aes(size = pop)) +
geom_smooth(method = lm) +
scale_x_log10()`geom_smooth()` using formula = 'y ~ x'
We can use labs() to set the labels for just about anything.
ggplot(df, aes(x = gdpPercap, y = lifeExp)) +
geom_point(aes(size = pop, color = continent)) +
scale_x_log10() +
facet_wrap(~year) +
labs(
title = "Life Expectancy and GDP Per Capita",
subtitle = "1952 - 2007",
x = "GDP Per Capita (USD)",
y = "Life Expectancy",
color = "Continent",
size = "Population",
caption = "Data from gapminder.com"
)ggplot has some built-in themes that can improve your charts
ggplot(df, aes(x = gdpPercap, y = lifeExp)) +
geom_point(aes(size = pop, color = continent)) +
scale_x_log10() +
facet_wrap(~year) +
labs(
title = "Life Expectancy and GDP Per Capita",
subtitle = "1952 - 2007",
x = "GDP Per Capita (USD)",
y = "Life Expectancy",
color = "Continent",
size = "Population",
caption = "Data from gapminder.com"
) +
theme_bw()We can use the ggsave() function to save images that we’ve created as local files.
ggplot(df, aes(x = gdpPercap, y = lifeExp)) +
geom_point(aes(size = pop, color = continent)) +
scale_x_log10() +
facet_wrap(~year) +
labs(
title = "Life Expectancy and GDP Per Capita",
subtitle = "1952 - 2007",
x = "GDP Per Capita (USD)",
y = "Life Expectancy",
color = "Continent",
size = "Population",
caption = "Data from gapminder.com"
) +
theme_bw()ggsave("images/gdp_lifeExp.png")Saving 7 x 5 in image
RMarkdown is a package that allows you to create documents that include narrative, images, code, and any other pieces of information that are helpful for presenting your data. It is a flavor of markdown, a popular markup language for creating documents.
Regular RMarkdown can be used to create html, pdf, or Word documents, and there are a number of extensions that allow for presentations, dashboards, and more.
---
title: "RMarkdown Example"
author: "James L. Adams"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
# Here's a first-level header
## And a second-level
### And a third-level
This could be some narrative text for framing my code and analysis. We'll begin with loading packages and data.
```{r message=FALSE}
library(tidyverse)
df <- read_csv("./data/original/gapminder.csv")
```
We included "message=FALSE" in the code chunk options to avoid some console printouts that interrupt the flow of the document.
Next, we'll print the data in a nice-looking way using the `kable` function from `knitr`, which was downloaded automatically as part of `rmarkdown`. We'll use `head(df)` to only print the first few rows rather than the entire data set.
```{r}
knitr::kable(head(df))
```
Next, we'll embed a plot we can make with the data
```{r}
ggplot(df, aes(x = gdpPercap, y = lifeExp)) +
geom_point(aes(size = pop, color = continent)) +
scale_x_log10() +
facet_wrap(~year) +
labs(
title = "Life Expectancy and GDP Per Capita",
subtitle = "1952 - 2007",
x = "GDP Per Capita (USD)",
y = "Life Expectancy",
color = "Continent",
size = "Population",
caption = "Data from gapminder.com"
) +
theme_bw()
```
We'll include a link to the data source, as well. You can find the original data from [Gapminder](https://www.gapminder.org/) or the processed version we're using from Jenny Bryan's
[gapminder package](https://github.com/jennybc/gapminder).There are ways to use RMarkdown to produce specialized documents, such as a dashboard. One popular dashboard package is flexdashboard.
Here is a basic example of using flexdashboard with our gapminder data:
---
title: "flexdashboard example"
output:
flexdashboard::flex_dashboard:
orientation: columns
vertical_layout: fill
---
```{r setup, include=FALSE}
library(flexdashboard)
library(tidyverse)
library(plotly)
df <- read_csv("./data/original/gapminder.csv")
```
Column {data-width=650}
-----------------------------------------------------------------------
### Chart A
```{r}
# Using the plotly package and ggplotly to add basic interactivity
ggplot(df, aes(x = gdpPercap, y = lifeExp)) +
geom_point(aes(color = continent)) +
scale_x_log10()
ggplotly()
```
Column {data-width=350}
-----------------------------------------------------------------------
### Chart B
```{r}
ggplot(df, aes(x = continent, y = lifeExp, fill = continent)) +
geom_boxplot() +
theme(
legend.position = "none"
)
```
### Chart C
```{r}
tmp <- df %>%
group_by(continent, year) %>%
summarize(lifeExp = mean(lifeExp))
ggplot(tmp, aes(x = year, y = lifeExp, color = continent)) +
geom_line() +
geom_point()
```