This is just a taster of some of the capabililies of R. It is not intended to cover all of the fundamentals of R - that’s for other sessions. Instead, we’ll run through the process of importing, tidying, transforming, visualising and reporting data using some of the packages developed by Hadley Wickham and others. These packages help lower the barrier to entry for newbies and will hopefully inspire you to code regularly in R.
By the end of the session participants will:
R is an open source programming language for statistical analysis and data visualisation. It was developed by Ross Ihaka and Robert Gentleman of the University of Auckland and released in 1995 (see this NYT piece from 2009). R is widely used in academia and becoming increasingly important in business and government.
Learning R can be a steep learning curve and the transition from a graphical user interface like Excel or SPSS to one that is command driven can be unsettling. However, you’ll soon find that working with a command line is much more efficient than pointing and clicking. After all, you can replicate, automate and share your R scripts.
Download and install R and RStudio. RStudio is an integrated development environment for R which includes syntax highlighting, code completion, debugging tools and an in-built browser. RStudio makes R a much more user-friendly experience.
Packages are collections of R functions and data. There are over 8,000 user-contributed packages available to install from CRAN. Just type install.packages() in the console with the name of the package in inverted commas. When you install a package in R for the first time you will be asked to select a CRAN mirror. A mirror is a distribution site for R source code, manuals, and contributed packages. Just pick the mirror that is closest to you.
The code below will install the dplyr package which is a set of tools for manipulating dataframes. It will also install all its dependent packages.
install.packages("dplyr", dependencies = TRUE)Once installed the package can be loaded into your R session with the library() function.
library(dplyr)A helpful list of R packages hand-picked by RStudio is available at this link: https://github.com/rstudio/RStartHere
? or help(), e.g. ?mean or help(mean).# symbol can be used to add comments to your code. R will ignore everything after the # symbol.browseVignettes(package = "") with the name of the package entered in inverted commas.R can handle a range of data formats: .xlsx, .csv, .txt, .sav, .shp etc. Some data formats require specific packages.
To import a .csv file you can use the function read_csv() from the readr package.
library(readr)
df <- read_csv("world_prison_population_list_11th_edition_wide.csv")The data that we’ve imported derive the World Prison Population List which is published by the International Centre for Prison Studies. The most recent report, the 11th edition, uses data from 223 countries and is accurate up to October 2015.
Let’s have a look at the first few columns of data using the select() function in dplyr. The %>% operator ‘chains’ lines of code together and can be read as ‘then’. So, first we call the dataframe ‘df’ and then select the first 3 columns.
df %>% select(1:3)## Source: local data frame [4 x 3]
##
## Name Afghanistan Albania
## (chr) (chr) (chr)
## 1 iso3 AFG ALB
## 2 Region South Central Asia Southern Europe
## 3 Estimated national population 35600000 2890000
## 4 Prison population total 26519 5455
You’ll notice that the data structure is messy*. Variables are stored in rows and values are column headers. This type of tabular data structure is very common but not particularly helpful for data analysis.
According to Hadley Wickham, tidy data are structured for use in R and satisfy three rules (Grolemund and Wickham 2016):
Assigning variables to columns ensures that values are paired with other values in the same row of observations.
The tidyr package has helpful tools called gather() and spread() which will fix this for us.
The code below gathers all of the columns except the first and then spreads the gathered columns.
library(tidyr)
df <- df %>%
gather(country, value, 2:ncol(df)) %>%
spread(Name, value)The function glimpse() from the dplyr package prints out the variables from the dataframe and the first few rows. Additional information on data types is also provided. Our data is now structured tidily but the variable names are too long, in the wrong order, ‘region’ needs to be a factor variable, and the population values should be integers not characters.
glimpse(df) ## Observations: 221
## Variables: 5
## $ country (chr) "Afghanistan", "Albania", "Alger...
## $ Estimated national population (chr) "35600000", "2890000", "37280000...
## $ iso3 (chr) "AFG", "ALB", "DZA", "WSM", "AND...
## $ Prison population total (chr) "26519", "5455", "60220", "214",...
## $ Region (chr) "South Central Asia", "Southern ...
We’ll use the select() and mutate() functions from the dplyr package to remedy this. The variables are selected and renamed using the select() function and re-classed using mutate().
df <- df %>%
select(country,
iso3,
region = Region,
national_pop = `Estimated national population`,
prison_pop = `Prison population total`) %>%
mutate(region = factor(region),
national_pop = as.integer(national_pop),
prison_pop = as.integer(prison_pop))
glimpse(df)## Observations: 221
## Variables: 5
## $ country (chr) "Afghanistan", "Albania", "Algeria", "American Sa...
## $ iso3 (chr) "AFG", "ALB", "DZA", "WSM", "AND", "AGO", "AIA", ...
## $ region (fctr) South Central Asia, Southern Europe, Northern Af...
## $ national_pop (int) 35600000, 2890000, 37280000, 56000, 76250, 214500...
## $ prison_pop (int) 26519, 5455, 60220, 214, 55, 22826, 46, 343, 6906...
Creating new variables by transforming data is straightforward. The following code uses the mutate() function in dplyr to create a new variable representing the rate of incarceration per 100,000 people. The resulting values are rounded using the base R round() function.
df <- mutate(df, rate = round((prison_pop / national_pop) * 100000, 0))Let’s use the slice() function to view the first 5 rows of the ‘rate’ variable.
df %>% slice(1:5) %>% select(rate)## Source: local data frame [5 x 1]
##
## rate
## (dbl)
## 1 74
## 2 189
## 3 162
## 4 382
## 5 72
Arranging values is possible using the arrange() function in dplyr. The code below sorts the ‘rate’ values in ascending order and then prints the first 5 rows.
arrange(df, rate) %>% head(5)## Source: local data frame [5 x 6]
##
## country iso3 region national_pop prison_pop rate
## (chr) (chr) (fctr) (int) (int) (dbl)
## 1 Guinea Bissau GNB Western Africa 1725000 92 5
## 2 San Marino SMR Southern Europe 33700 2 6
## 3 Liechtenstein LIE Western Europe 37370 8 21
## 4 Faeroes (Denmark) FRO Northern Europe 48480 11 23
## 5 Guinea (Rep. of) GUY Western Africa 12120000 3110 26
In descending order:
arrange(df, desc(rate)) %>% head(5)## Source: local data frame [5 x 6]
##
## country iso3 region national_pop prison_pop rate
## (chr) (chr) (fctr) (int) (int) (dbl)
## 1 Seychelles SYC Eastern Africa 92000 735 799
## 2 U.S.A. USA North America 317760000 2217000 698
## 3 St Kitts & Nevis KNA Caribbean 55000 334 607
## 4 Turkmenistan TKM Central Asia 5240000 30568 583
## 5 Virgin Is. (USA) VIR Caribbean 106700 577 541
Then sorted by ‘region’ (ascending) and then by ‘rate’ in descending order
arrange(df, region, desc(rate)) %>% head(5)## Source: local data frame [5 x 6]
##
## country iso3 region national_pop prison_pop rate
## (chr) (chr) (fctr) (int) (int) (dbl)
## 1 St Kitts & Nevis KNA Caribbean 55000 334 607
## 2 Virgin Is. (USA) VIR Caribbean 106700 577 541
## 3 Cuba CUB Caribbean 11250000 57337 510
## 4 Virgin Is. (UK) VGB Caribbean 28000 119 425
## 5 Grenada GRD Caribbean 106500 424 398
The group_by() function allows you to run operations on groups of data. The following code groups the data by ‘region’, calculates the total prison population, and then sorts the results in decscending order.
df %>%
group_by(region) %>%
summarise(total = sum(prison_pop)) %>%
arrange(desc(total))## Source: local data frame [20 x 2]
##
## region total
## (fctr) (int)
## 1 North America 2255210
## 2 Eastern Asia 1850147
## 3 South America 1036812
## 4 South Eastern Asia 881634
## 5 South Central Asia 859473
## 6 Europe/Asia 851674
## 7 Eastern Africa 395974
## 8 Central America 368027
## 9 Central and Eastern Europe 266752
## 10 Northern Africa 247194
## 11 Southern Europe 172817
## 12 Southern Africa 172316
## 13 Western Asia 171696
## 14 Western Europe 163674
## 15 Central Asia 134847
## 16 Western Africa 133753
## 17 Caribbean 120479
## 18 Central Africa 89498
## 19 Northern Europe 80381
## 20 Oceania 54726
Try and find the answers to these questions by using the data wrangling tools provided by dplyr:
How many people are held in penal institutions worldwide?
What is the prison population of the U.S.A?
Which country has the highest prison population rate?
Which country has the second highest prison population rate?
Which country has the lowest prison population rate?
What is the world prison population rate?
What is the median incarceration rate for Oceania?
If the U.S.A. has 4% of the world’s population what percent of the world’s prison population does it have?
There are several packages that will allow you to visualise data both statically and interactively. Two of the most popular packages are ggplot2 for static outputs and highcharter which is an R wrapper for the Highcharts javascript libray.
First, we’ll attempt to create a static plot in ggplot2 by subsetting the top 10 countries by incarceration rate.
temp <- df %>%
arrange(desc(rate)) %>%
slice(1:10) Then we load the ggplot2 library and build up our plot.
library(ggplot2) ; library(ggthemes)
ggplot(temp, aes(reorder(country, rate), rate))+
theme_tufte(base_size=14, ticks=F) +
geom_bar(width=0.25, fill="gray", stat="identity") +
theme(axis.title=element_blank()) +
scale_y_continuous(breaks=seq(100, 800, 100)) +
geom_hline(yintercept=seq(100, 800, 100), col="white", lwd=1) +
labs(x="", y="\nRate per 100,000 population") +
coord_flip() +
ggtitle("Top 10 countries by incarceration rate") You’ll notice that we also loaded the ggthemes package. The theme_tufte() function allowed us to style the plot in a manner similar to those adopted in Edward Tufte’s graphics.
You can save the plot with the ggsave() function.
ggsave("plot.png", scale = 1, dpi = 300)The next plot is interactive and uses the highcharter package. The theme adopts the in-house style of fivethirtyeight.com.
library(highcharter)
hc <- highchart(height = 400, width = 700) %>%
hc_title(text = "Top 10 countries by incarceration rate") %>%
hc_subtitle(text = "Source: International Centre for Prison Studies") %>%
hc_xAxis(categories = temp$country) %>%
hc_add_series(name = "Incarceration rate", data = temp$rate,
type = 'bar', color = "#f16913") %>%
hc_yAxis(title = list(text = "Rate per 100,000 population")) %>%
hc_legend(enabled = FALSE) %>%
hc_exporting(enabled = TRUE)
hc %>% hc_add_theme(hc_theme_538())The final plot also uses the highcharter package to create an interactive map.
library(highcharter) ; library(RColorBrewer)
data(worldgeojson)
n <- 4
dclass <- data_frame(to = 0:n/n, color = brewer.pal(5, "PuRd"))
dclass <- list.parse2(dclass)
highchart() %>%
hc_title(text = "Incarceration rates") %>%
hc_subtitle(text = "Source: International Centre for Prison Studies") %>%
hc_add_series_map(worldgeojson, df, name = "Rate per 100,000 pop.",
value = "rate", joinBy = "iso3") %>%
hc_colorAxis(stops = dclass) %>%
hc_legend(enabled = TRUE) %>%
hc_mapNavigation(enabled = TRUE)There are number of ways to share the results of your data analysis. For example, the Shiny package allows you to create interactive visualisations in a web browser. There are examples of great Shiny apps in RStudio’s showcase of users’ apps.
R Markdown is another package that helps you to author dynamic documents, presentations, and reports within R. There are a range of possible output formats including MS Word, PDF, and HTML.
To publish one of your visualisation online on rpubs.com just install and load the knitr package, create a new R Markdown document, insert the code into a chunk, click Knit to HTML, and press the Publish button in the preview window.
Garrett Grolemund and Hadley Wickham’s R for Data Science
1: The messiness is a result of my scraping the data from a PDF file.
summarise(df, total = sum(prison_pop))## Source: local data frame [1 x 1]
##
## total
## (int)
## 1 10307084
filter(df, country == "U.S.A.") %>% select(prison_pop)## Source: local data frame [1 x 1]
##
## prison_pop
## (int)
## 1 2217000
arrange(df, desc(rate)) %>% slice(1)## Source: local data frame [1 x 6]
##
## country iso3 region national_pop prison_pop rate
## (chr) (chr) (fctr) (int) (int) (dbl)
## 1 Seychelles SYC Eastern Africa 92000 735 799
arrange(df, desc(rate)) %>% slice(2)## Source: local data frame [1 x 6]
##
## country iso3 region national_pop prison_pop rate
## (chr) (chr) (fctr) (int) (int) (dbl)
## 1 U.S.A. USA North America 317760000 2217000 698
arrange(df, rate) %>% slice(1)## Source: local data frame [1 x 6]
##
## country iso3 region national_pop prison_pop rate
## (chr) (chr) (fctr) (int) (int) (dbl)
## 1 Guinea Bissau GNB Western Africa 1725000 92 5
summarise(df, total_rate = sum(prison_pop) / sum(as.numeric(df$national_pop)) * 100000) %>%
round(0)## Source: local data frame [1 x 1]
##
## total_rate
## (dbl)
## 1 144
filter(df, region == "Oceania") %>%
summarise(median = median(rate))## Source: local data frame [1 x 1]
##
## median
## (dbl)
## 1 155
df %>%
mutate(total_national_pop = sum(as.numeric(national_pop)),
total_prison_pop = sum(prison_pop)) %>%
filter(country == "U.S.A.") %>%
summarise(usa_national_pop_percent = round(national_pop/total_national_pop*100, 1),
usa_prison_pop_percent = round(prison_pop/total_prison_pop*100, 1))## Source: local data frame [1 x 2]
##
## usa_national_pop_percent usa_prison_pop_percent
## (dbl) (dbl)
## 1 4.4 21.5