This is just a taster of some of the capabililies of R. It is not intended to cover all of the fundamentals of R - that’s for other sessions. Instead, we’ll run through the process of importing, tidying, transforming, visualising and reporting data using some of the packages developed by Hadley Wickham and others. These packages help lower the barrier to entry for newbies and will hopefully inspire you to code regularly in R.

Learning outcomes

By the end of the session participants will:

  • Gain a working knowledge of RStudio software
  • Become familiar with the process of importing, tidying, transforming, querying, visualising and reporting data
  • Create and publish a data visualisation

Introduction

What is R?

R is an open source programming language for statistical analysis and data visualisation. It was developed by Ross Ihaka and Robert Gentleman of the University of Auckland and released in 1995 (see this NYT piece from 2009). R is widely used in academia and becoming increasingly important in business and government.

Why use R?

  • It’s the leading tool for statistical analysis, forecasting and machine learning
  • Cutting edge analytics: over 8,000 user-contributed packages available on finance, genomics, animal tracking, crime analysis, and much more
  • Powerful graphics and data visualisations: used by the New York Times and FiveThirtyEight
  • Open source: no vendor lock-in
  • Reproducibility: code can be shared and the results repeated
  • Transparency: explicitly documents all the steps of your analyses
  • Automation: analyses can be run and re-run with new and existing datasets
  • Support network: worldwide community of developers and users

Are there any disadvantages?

Learning R can be a steep learning curve and the transition from a graphical user interface like Excel or SPSS to one that is command driven can be unsettling. However, you’ll soon find that working with a command line is much more efficient than pointing and clicking. After all, you can replicate, automate and share your R scripts.

Setting up

Installing R and RStudio

Download and install R and RStudio. RStudio is an integrated development environment for R which includes syntax highlighting, code completion, debugging tools and an in-built browser. RStudio makes R a much more user-friendly experience.

Packages

Packages are collections of R functions and data. There are over 8,000 user-contributed packages available to install from CRAN. Just type install.packages() in the console with the name of the package in inverted commas. When you install a package in R for the first time you will be asked to select a CRAN mirror. A mirror is a distribution site for R source code, manuals, and contributed packages. Just pick the mirror that is closest to you.

The code below will install the dplyr package which is a set of tools for manipulating dataframes. It will also install all its dependent packages.

install.packages("dplyr", dependencies = TRUE)

Once installed the package can be loaded into your R session with the library() function.

library(dplyr)

A helpful list of R packages hand-picked by RStudio is available at this link: https://github.com/rstudio/RStartHere

Useful tips

  • If you need information on a function just type ? or help(), e.g. ?mean or help(mean).
  • If you have a more complicated question try trawling through the answers on stackoverflow.com with [r] in the search field.
  • The # symbol can be used to add comments to your code. R will ignore everything after the # symbol.
  • Every package on CRAN comes with a vignette which can be loaded by typing browseVignettes(package = "") with the name of the package entered in inverted commas.
  • There are a couple of useful style guides from Google and Hadley Wickham.

Importing data

R can handle a range of data formats: .xlsx, .csv, .txt, .sav, .shp etc. Some data formats require specific packages.

To import a .csv file you can use the function read_csv() from the readr package.

library(readr)
df <- read_csv("world_prison_population_list_11th_edition_wide.csv")

The data that we’ve imported derive the World Prison Population List which is published by the International Centre for Prison Studies. The most recent report, the 11th edition, uses data from 223 countries and is accurate up to October 2015.

Let’s have a look at the first few columns of data using the select() function in dplyr. The %>% operator ‘chains’ lines of code together and can be read as ‘then’. So, first we call the dataframe ‘df’ and then select the first 3 columns.

df %>% select(1:3)
## Source: local data frame [4 x 3]
## 
##                            Name        Afghanistan         Albania
##                           (chr)              (chr)           (chr)
## 1                          iso3                AFG             ALB
## 2                        Region South Central Asia Southern Europe
## 3 Estimated national population           35600000         2890000
## 4       Prison population total              26519            5455

You’ll notice that the data structure is messy*. Variables are stored in rows and values are column headers. This type of tabular data structure is very common but not particularly helpful for data analysis.

Tidying data

According to Hadley Wickham, tidy data are structured for use in R and satisfy three rules (Grolemund and Wickham 2016):

  • variables are stored upright in columns;
  • observations are lined up in rows, and;
  • values are placed in their own cells.

Assigning variables to columns ensures that values are paired with other values in the same row of observations.

The tidyr package has helpful tools called gather() and spread() which will fix this for us.

The code below gathers all of the columns except the first and then spreads the gathered columns.

library(tidyr)
df <- df %>%
  gather(country, value, 2:ncol(df)) %>%
  spread(Name, value)

The function glimpse() from the dplyr package prints out the variables from the dataframe and the first few rows. Additional information on data types is also provided. Our data is now structured tidily but the variable names are too long, in the wrong order, ‘region’ needs to be a factor variable, and the population values should be integers not characters.

glimpse(df) 
## Observations: 221
## Variables: 5
## $ country                       (chr) "Afghanistan", "Albania", "Alger...
## $ Estimated national population (chr) "35600000", "2890000", "37280000...
## $ iso3                          (chr) "AFG", "ALB", "DZA", "WSM", "AND...
## $ Prison population total       (chr) "26519", "5455", "60220", "214",...
## $ Region                        (chr) "South Central Asia", "Southern ...

We’ll use the select() and mutate() functions from the dplyr package to remedy this. The variables are selected and renamed using the select() function and re-classed using mutate().

df <- df %>% 
  select(country,
         iso3,
         region = Region,
         national_pop = `Estimated national population`,
         prison_pop = `Prison population total`) %>% 
  mutate(region = factor(region),
         national_pop = as.integer(national_pop),
         prison_pop = as.integer(prison_pop))
glimpse(df)
## Observations: 221
## Variables: 5
## $ country      (chr) "Afghanistan", "Albania", "Algeria", "American Sa...
## $ iso3         (chr) "AFG", "ALB", "DZA", "WSM", "AND", "AGO", "AIA", ...
## $ region       (fctr) South Central Asia, Southern Europe, Northern Af...
## $ national_pop (int) 35600000, 2890000, 37280000, 56000, 76250, 214500...
## $ prison_pop   (int) 26519, 5455, 60220, 214, 55, 22826, 46, 343, 6906...

Transforming data

Creating new variables by transforming data is straightforward. The following code uses the mutate() function in dplyr to create a new variable representing the rate of incarceration per 100,000 people. The resulting values are rounded using the base R round() function.

df <- mutate(df, rate = round((prison_pop / national_pop) * 100000, 0))

Let’s use the slice() function to view the first 5 rows of the ‘rate’ variable.

df %>% slice(1:5) %>% select(rate)
## Source: local data frame [5 x 1]
## 
##    rate
##   (dbl)
## 1    74
## 2   189
## 3   162
## 4   382
## 5    72

Querying data

Arranging values is possible using the arrange() function in dplyr. The code below sorts the ‘rate’ values in ascending order and then prints the first 5 rows.

arrange(df, rate) %>% head(5)
## Source: local data frame [5 x 6]
## 
##             country  iso3          region national_pop prison_pop  rate
##               (chr) (chr)          (fctr)        (int)      (int) (dbl)
## 1     Guinea Bissau   GNB  Western Africa      1725000         92     5
## 2        San Marino   SMR Southern Europe        33700          2     6
## 3     Liechtenstein   LIE  Western Europe        37370          8    21
## 4 Faeroes (Denmark)   FRO Northern Europe        48480         11    23
## 5  Guinea (Rep. of)   GUY  Western Africa     12120000       3110    26

In descending order:

arrange(df, desc(rate)) %>% head(5)
## Source: local data frame [5 x 6]
## 
##            country  iso3         region national_pop prison_pop  rate
##              (chr) (chr)         (fctr)        (int)      (int) (dbl)
## 1       Seychelles   SYC Eastern Africa        92000        735   799
## 2           U.S.A.   USA  North America    317760000    2217000   698
## 3 St Kitts & Nevis   KNA      Caribbean        55000        334   607
## 4     Turkmenistan   TKM   Central Asia      5240000      30568   583
## 5 Virgin Is. (USA)   VIR      Caribbean       106700        577   541

Then sorted by ‘region’ (ascending) and then by ‘rate’ in descending order

arrange(df, region, desc(rate)) %>% head(5)
## Source: local data frame [5 x 6]
## 
##            country  iso3    region national_pop prison_pop  rate
##              (chr) (chr)    (fctr)        (int)      (int) (dbl)
## 1 St Kitts & Nevis   KNA Caribbean        55000        334   607
## 2 Virgin Is. (USA)   VIR Caribbean       106700        577   541
## 3             Cuba   CUB Caribbean     11250000      57337   510
## 4  Virgin Is. (UK)   VGB Caribbean        28000        119   425
## 5          Grenada   GRD Caribbean       106500        424   398

The group_by() function allows you to run operations on groups of data. The following code groups the data by ‘region’, calculates the total prison population, and then sorts the results in decscending order.

df %>% 
  group_by(region) %>% 
  summarise(total = sum(prison_pop)) %>% 
  arrange(desc(total))
## Source: local data frame [20 x 2]
## 
##                        region   total
##                        (fctr)   (int)
## 1               North America 2255210
## 2                Eastern Asia 1850147
## 3               South America 1036812
## 4          South Eastern Asia  881634
## 5          South Central Asia  859473
## 6                 Europe/Asia  851674
## 7              Eastern Africa  395974
## 8             Central America  368027
## 9  Central and Eastern Europe  266752
## 10            Northern Africa  247194
## 11            Southern Europe  172817
## 12            Southern Africa  172316
## 13               Western Asia  171696
## 14             Western Europe  163674
## 15               Central Asia  134847
## 16             Western Africa  133753
## 17                  Caribbean  120479
## 18             Central Africa   89498
## 19            Northern Europe   80381
## 20                    Oceania   54726

Exercises

Try and find the answers to these questions by using the data wrangling tools provided by dplyr:

  1. How many people are held in penal institutions worldwide?

  2. What is the prison population of the U.S.A?

  3. Which country has the highest prison population rate?

  4. Which country has the second highest prison population rate?

  5. Which country has the lowest prison population rate?

  6. What is the world prison population rate?

  7. What is the median incarceration rate for Oceania?

  8. If the U.S.A. has 4% of the world’s population what percent of the world’s prison population does it have?

Visualising data

There are several packages that will allow you to visualise data both statically and interactively. Two of the most popular packages are ggplot2 for static outputs and highcharter which is an R wrapper for the Highcharts javascript libray.

Static plots

First, we’ll attempt to create a static plot in ggplot2 by subsetting the top 10 countries by incarceration rate.

temp <- df %>% 
  arrange(desc(rate)) %>% 
  slice(1:10) 

Then we load the ggplot2 library and build up our plot.

library(ggplot2) ; library(ggthemes)

ggplot(temp, aes(reorder(country, rate), rate))+
  theme_tufte(base_size=14, ticks=F) +
  geom_bar(width=0.25, fill="gray", stat="identity") +
  theme(axis.title=element_blank()) +
  scale_y_continuous(breaks=seq(100, 800, 100)) + 
  geom_hline(yintercept=seq(100, 800, 100), col="white", lwd=1) +
  labs(x="", y="\nRate per 100,000 population") +
  coord_flip() +
  ggtitle("Top 10 countries by incarceration rate") 

You’ll notice that we also loaded the ggthemes package. The theme_tufte() function allowed us to style the plot in a manner similar to those adopted in Edward Tufte’s graphics.

You can save the plot with the ggsave() function.

ggsave("plot.png", scale = 1, dpi = 300)


Interactive plots

The next plot is interactive and uses the highcharter package. The theme adopts the in-house style of fivethirtyeight.com.

library(highcharter)

hc <- highchart(height = 400, width = 700) %>%
  hc_title(text = "Top 10 countries by incarceration rate") %>% 
  hc_subtitle(text = "Source: International Centre for Prison Studies") %>% 
  hc_xAxis(categories = temp$country) %>% 
  hc_add_series(name = "Incarceration rate", data = temp$rate, 
                type = 'bar', color = "#f16913") %>% 
  hc_yAxis(title = list(text = "Rate per 100,000 population")) %>% 
  hc_legend(enabled = FALSE) %>% 
  hc_exporting(enabled = TRUE)
hc %>% hc_add_theme(hc_theme_538())


The final plot also uses the highcharter package to create an interactive map.

library(highcharter) ; library(RColorBrewer)

data(worldgeojson)

n <- 4
dclass <- data_frame(to = 0:n/n, color = brewer.pal(5, "PuRd"))
dclass <- list.parse2(dclass)

highchart() %>% 
  hc_title(text = "Incarceration rates") %>% 
  hc_subtitle(text = "Source:  International Centre for Prison Studies") %>%
  hc_add_series_map(worldgeojson, df, name = "Rate per 100,000 pop.", 
                    value = "rate", joinBy = "iso3") %>% 
  hc_colorAxis(stops = dclass) %>% 
  hc_legend(enabled = TRUE) %>% 
  hc_mapNavigation(enabled = TRUE)


Communicating data

There are number of ways to share the results of your data analysis. For example, the Shiny package allows you to create interactive visualisations in a web browser. There are examples of great Shiny apps in RStudio’s showcase of users’ apps.

R Markdown is another package that helps you to author dynamic documents, presentations, and reports within R. There are a range of possible output formats including MS Word, PDF, and HTML.

To publish one of your visualisation online on rpubs.com just install and load the knitr package, create a new R Markdown document, insert the code into a chunk, click Knit to HTML, and press the Publish button in the preview window.

Useful references



1: The messiness is a result of my scraping the data from a PDF file.







Answers
  1. How many people are held in penal institutions worldwide?
summarise(df, total = sum(prison_pop))
## Source: local data frame [1 x 1]
## 
##      total
##      (int)
## 1 10307084
  1. What is the prison population of the U.S.A?
filter(df, country == "U.S.A.") %>% select(prison_pop)
## Source: local data frame [1 x 1]
## 
##   prison_pop
##        (int)
## 1    2217000
  1. Which country has the highest prison population rate?
arrange(df, desc(rate)) %>% slice(1)
## Source: local data frame [1 x 6]
## 
##      country  iso3         region national_pop prison_pop  rate
##        (chr) (chr)         (fctr)        (int)      (int) (dbl)
## 1 Seychelles   SYC Eastern Africa        92000        735   799
  1. Which country has the second highest prison population rate?
arrange(df, desc(rate)) %>% slice(2)
## Source: local data frame [1 x 6]
## 
##   country  iso3        region national_pop prison_pop  rate
##     (chr) (chr)        (fctr)        (int)      (int) (dbl)
## 1  U.S.A.   USA North America    317760000    2217000   698
  1. Which country has the lowest prison population rate?
arrange(df, rate) %>% slice(1)
## Source: local data frame [1 x 6]
## 
##         country  iso3         region national_pop prison_pop  rate
##           (chr) (chr)         (fctr)        (int)      (int) (dbl)
## 1 Guinea Bissau   GNB Western Africa      1725000         92     5
  1. What is the world prison population rate?
summarise(df, total_rate = sum(prison_pop) / sum(as.numeric(df$national_pop)) * 100000) %>% 
  round(0)
## Source: local data frame [1 x 1]
## 
##   total_rate
##        (dbl)
## 1        144
  1. What is the median incarceration rate for Oceania?
filter(df, region == "Oceania") %>% 
  summarise(median = median(rate))
## Source: local data frame [1 x 1]
## 
##   median
##    (dbl)
## 1    155
  1. If the U.S.A. has 4% of the world’s population what percent of the world’s prison population does it have?
df %>%
  mutate(total_national_pop = sum(as.numeric(national_pop)),
         total_prison_pop = sum(prison_pop)) %>%
  filter(country == "U.S.A.") %>%
  summarise(usa_national_pop_percent = round(national_pop/total_national_pop*100, 1),
    usa_prison_pop_percent = round(prison_pop/total_prison_pop*100, 1))
## Source: local data frame [1 x 2]
## 
##   usa_national_pop_percent usa_prison_pop_percent
##                      (dbl)                  (dbl)
## 1                      4.4                   21.5