This document is a work in progress – give me feedback

1 Introduction

1.1 Before we begin

Do you have:

  • R and RStudio downloaded from Software Center?
  • the folder I sent this morning containing the data?

Good.

Hello.

  • Who am I?
  • Who are you?
  • Why are we here?
  • What can you do with R?
  • What do you want to be able to do?

(Note: these are not existential questions.)

1.2 Who is this guide for?

Follow this guide if:

  • you’ve never heard of R
  • you’ve used SQL but not R
  • you’ve never coded before
  • you want to use R but don’t consider yourself a ‘programmer’

Dip into this guide if you:

  • have used R but never made use of something called RStudio
  • want to learn more about RStudio and how it works
  • have used R but never made use of something called the ‘tidyverse’
  • want to learn the best way to set up an R project to improve reproducibility

1.3 Disclaimers:

  • this guide is a basic introduction and is in no way exhaustive
  • there’s usually more than one way to do things in R – I’ve kept things simple here
  • there’s probably errors and spelling mistakes, etc

This document was originally written with a very specific public sector audience in mind and may contain references not relevant to you. See the Further Reading section at the bottom of this document if you want to find some other resources.

2 What’s the problem?

2.1 Workflow

A typical analytical workflow in our department might involve SQL, Excel and Word. Typical steps might be:

  1. Query a database with SQL code using SQL Server Management Studio
  2. QA this code
  3. Copy and paste the output into Excel
  4. Process the data in Excel
  5. Produce outputs (tables, plots, etc) manually in Excel
  6. QA your Excel file(s)
  7. Copy and paste outputs into a Word document
  8. QA the Word document
  9. You notice an error
  10. Debug somehow (go back to step 1?)

There are three main reasons why this isn’t ideal. It’s:

  • got a high chance of producing errors
  • difficult to reproduce your work (what order were the steps in your workflow?)
  • time consuming (many steps, lots of wasted time)

So, let’s discuss what we mean by ‘errors’. This is mostly a problem with spreadsheets and moving data in and out of them. You:

In terms of reproducibility, you don’t have a record of the order of doing things and therefore it’s not easy to backtrack on mistakes. A lot of documentation and commenting is required within and across multiple files to ensure that the workflow can be replicated. Typically, this is not always the case. If you write reproducible code, it may also be easier to automate it. This in turn can help free-up time for other, perhaps less trivial, tasks. For example, the Reproducible Analytical Pipeline (RAP) approach helps reduce error and speed up the process of producing official statistics.

Obviously the process takes time because you have to copy-paste values from place to place and perform quality assurance across all the files in your workflow. But there’s also the time needed to remember how you did the analysis when you’re asked to make changes long after you remember how the process works.

2.2 The bottom line

Our analytical work has a direct impact on policy decisions and therefore it affects young people, parents, learners, schools, teachers and many others.

Above all humans cannot be trusted. Let’s minimise the chance of errors, speed things up and make it easy on our future selves by minimising the chance of doing it wrong in the first place. This means breaking away from spreadsheet addiction.

2.3 R is the answer

What might an optimal analytical workflow look like in R?

  1. Run your code

This is simple. R is end-to-end: you can get data in at one end from files or a database and pump it out the other in a report or app, while also having automated testing built in. All from the same script. You also have the opportunity to more easily version your work using tools such as Git and GitHub.

2.4 But what is R?

R is a just another tool for data analysis, in the same way that Excel and SQL are tools for data analysis.

Put simply, R lets you read, wrangle and analyse data and create outputs such as graphics, documents and interactive apps. R is a coding language, which means you use it to write instructions for the computer to perform. This allows for fine control of what you want to do.

You can think of R as a place where data is abstracted away and the instructions are brought to the forefront, whereas spreadsheets are where data is at the forefront and the instructions are abstracted away (I heard this somewhere but can’t remember the source; let me know).

RStudio is simply a very useful interface for R that provides a whole bunch of useful bells and whistles.

What’s great about R? It’s:

  • free
  • available on our work laptops via Software Center
  • open-source and cross-platform (you can download it for Windows, Mac and Linux machines)
  • established and has many high-quality extensions available (‘packages’)
  • has a big and active community, both in the department (e.g. Coffee & Coding) and online (e.g. the RStudio Community)
  • got a lot of in-built help files
  • got a wealth of articles and help online (e.g. the R bloggers feed and via StackOverflow)
  • got excellent statistical and graphics capabilities in particular
  • the suite of RStudio tools make documentation, teaching and dissemination much easier

I could go on.

2.5 Should I stop using all other tools?

R is not always the answer. I’m not telling you that we must do things in any particular way. For example, you have an urgent request for the minister due in five minutes and you don’t have the experience to do it in R. Excel may be good enough. That’s absolutely fine. The argument here is that we should move towards a more reproducible model, so that when the minister comes back wanting to tweak your calculation you can be confident that you can remember what you did and how you did it.

3 Project working

Let’s assume you’re starting a new piece of work. Your life will be much easier if you manage the structure of your project from the start, rather than creating a horrible file dump of various data sets, code and documentation that you have no chance of untangling in a few months’ time.

3.1 RStudio Projects

We’re going to start by creating an ‘RStudio Project’ (capital ‘P’).

Why do this? Well, it makes your work more:

  • organised – all the code, data, outputs, etc, are stored in one place (a single project folder)
  • reproducible – your code can be re-run from scratch to produce the same outputs every time
  • transferable – you can pass the entire project folder to someone else and they’ll be able to run it on their own machine; the filepaths you specify in your code assume the home folder is the project folder, so you can write something like data/dataset.csv rather than file/path/on/my/personal/machine/that/you/cannot/access.csv

To set up an RStudio Project:

  1. Open RStudio (the icon is a white R inside a blue circle; see top of this document)
  2. File > New Project…
  3. New Directory > New Project
  4. Give your project a meaningful name in the ‘Directory Name’ box
  5. Browse for the filepath where your R Project folder will be placed
  6. Click ‘Create Project’ and RStudio will open your project (note the project name in the top right)

This process creates a directory – a folder on your machine or shared drive that you choose – containing a an RStudio Project file with the extension (suffix) ‘.Rproj’. The repository is the ‘home’ of your project and will house all the files and code that you need. Opening the .RProj file will open your RStudio Project as you last left it with the scripts you were working on.

To access your R Project in future, navigate to the project folder and double-click your R Project file, which has the .Rproj extension (e.g. your-project.Rproj).

3.2 Directory layout

So, your project directory contains an RStudio Project (.Rproj) file, but let’s now fill it with some basic folders that we’ll need to compelte our project. This helps keep things organised and can help prevent mishaps like accidentally deleting raw data.

Organisation of projects from something like designing projects by Rich Fitzjohn at Macquarie University.

The basic arrangement would be something like:

The files and folders are:

  • data for raw, untouched, read-only data sets
  • figs for any graphics you produce (could also be maps or something else)
  • output for data files processed from the raw data
  • separate script files (with extension .R) to be executed in the labelled order (more on this in the next section)
  • the .Rproj file

4 The RStudio interface

Don’t be alarmed by the RStudio interface. There’s lots of buttons and tabs, but we’ll be restricting ourselves to a relatively small subset of these to begin with.

4.1 Layout

RStudio is split into three panes when you open it first time:

Each of which has a few tabs. We care about a few of these tabs right now:

Left pane:

  • the console tab where outputs are displayed (you can also directly type code into the console, but your code won’t be saved)

Upper-right pane:

  • the Environment tab that fills when you create saved objects
  • the History tab for seeing and rerunning any previous commands

Lower-right pane:

  • the Files tab from which you can open files (it that defaults to your home folder where the .Rproj is stored for this project)
  • the Plots tab for viewing plot outputs that you’ve created
  • the Help tab for searching for help with R packages and functions

4.2 Start a script

Open a new file with File > New File > R script, or in the top left of RStudio click the button with a ‘+’ in a green circle on a white square, then click ‘R Script’:

A new pane will appear with a new scripting tab. It’s blank. You type the code into this space and run it. The inputs and results are displayed in the console below once the script has been executed. This is not too dissimilar to what you get in SQL Server Management Studio, for example.

You can have more than one scripting tab open at once. Usually you would have one script per process. For example, one for reading and manipulating data (e.g. 01_read-data), one for modelling (e.g. 02_model)and one for plotting (03_plot), i.e. sensible names with a number that indicates the order to execute the code. This will improve reproducibility.

Start your script with some useful information. Anything prefixed with a hash (#) will be recognised as a comment and won’t be executed as code. For example:

# Title: Sensational training script
# Purpose: To inspire new R users
# Name: Matt Dray
# Date: Jan 2018

You can copy-paste or type the code from this document into your R script as we go along. Remember to add comments with # to say what you’re doing and to break your script up into sections.

4.3 Execute code

Type 1 + 1 into your scripting window (upper left pane). To ‘run’ the code, make sure your cursor is on the line containing the code and use the keyboard shortcut ‘Control + Enter’ to execute it (alternatively, click the ‘Run’ button in the top right of the scripting window). This will only run the bit of code you’ve highlighted; it won’t continue running the whole script.

1 + 1  # add one number to another
## [1] 2

You should have got the answer 2. The number in brackets relates to the number of items of information that are returned to you.


CHALLENGE!

Save your script with a sensible name.

Hint: File > Save, or Control + S. You’ll be prompted to save the file in your home folder (the one containing your R Project file).


This is good, but ideally we want to store values to help simplify our code. We do this by making ‘objects’. An object can be a single number, a list of strings, a table of data, a plot, or many other things. You create an object by assigning a name to your values. You do this with the ‘assignment arrow’, <-, which is basically akin to “into an object named the thing on the left, save the thing on the right”.

For example, we can assign 1 + 1 to the object name my_num with <-. Execute the following code:

my_num <- 1 + 1  # store 1 + 1 as an object

Hm. Nothing printed out in the console. Instead the object is now in your environment – see the top right pane in RStudio. You are now free to refer to this object by name in your script. For example, you can now print the contents of this object to the console with the line print(my_num) or explore it with the environment pane.

print(my_num)  # print the contents of the object to the console
## [1] 2

Storing one value is fine. But objects can be used sed to store more than that. This next chunk of code creates a ‘vector’, where several values in the brackets have been combined together with the c() command. In this example I’ve created some character strings, each bound within a pair of quotation marks (""). Numbers don’t need to be in quotation marks (unless they’ve been stored as text).

my_vector <- c("Pichu", "Pikachu", "Raichu")  # combine some values
print(my_vector)  # have a look at what the object contains
## [1] "Pichu"   "Pikachu" "Raichu"

You can see what ‘class’ your vector is at any time with the class() function.

class(my_num)
## [1] "numeric"

The vector my_num is composed of numbers only and so is ‘numeric’, but my_vector is composed entirely of character strings:

class(my_vector)
## [1] "character"

So we’ve create objects composed of both single values and vectors. You can think of these as being zero-dimensional and one-dimensional. The next step would be two dimenions: a table. Tables of data with rows and columns are called ‘data frames’ in R and are effectively a bunch of vectors of the same length stuck together. Consider this:

my_df <- data.frame(
  species = c("Pichu", "Pikachu", "Raichu"),
  number = c(172, 25, 26),
  location = c("Johto", "Kanto", "Kanto")
)

print(my_df)
##   species number location
## 1   Pichu    172    Johto
## 2 Pikachu     25    Kanto
## 3  Raichu     26    Kanto

Can you see how this is three vectors (species, number and location) of the same length (3 values) arranged into columns? the function data.frame() binds these vectors together into (surprise) a data frame.

class(my_df)
## [1] "data.frame"

Aha!


CHALLENGE!

Create a sensibly-named data frame object with three sensibly-named columns:

  • one for animals
  • one for a cuteness score
  • one for a ferocity score

Now print it.


4.4 Functions

You’ve been using functions already: print(), class(), data.frame() and c().

Theory: a function is a reproducible unit of code that performs a given task, such as reading a data file or fitting a model. Functions prevent you from copy-pasting your code multiple times, which could lead to errors and makes for unwieldy, unreadable code. If you can help it, Don’t Repeat Yourself.

functions are written as the function name followed by brackets. The brackets contain the arguments – the items you need to provide to the function for it to work. One argument might be be a filepath to some data, another might describe the colour of points to be plotted. They’re separated by commas.

So a generic function might look like this:

# don't run this, it doesn't do anything!
function_name(
  data = my_data,
  colour = "red",
  option = 5
)  

Note that you can break the function over several lines. You can put your cursor on any of these lines and run it. You don’t have to highlight the whole thing.

You can use type a question mark followed by a function name to learn about its arguments in a help file that will appear in the bottom right pane. For example, ?plot(). Try it, but don’t worry about the content for now.

Aside: you don’t necessarily need to write the argument name and an equals sign. For example, if the first argument expected by example_function() is data (you can find out by running ?example_function()) you can write example_function(my_data) instead of example_function(data = my_data). It’s good practice to write the argument names though, it’ll help you and others to understand your code and to stop any confusion. For example, specifying the arguments x = vector_x and y = vector_y in a plot function might make it clearer which axis is which when checking your code.


CHALLENGE!

It’s good practice to reset R every so often.

Why might we do this?

Hit the keyboard shortcut Control + Shift + F10 for RStudio to reset.


4.5 Packages

Functions can be bundled into packages. A bunch of packages are pre-installed with R, but there are thousands more available for download. These packages extend the basic capabilities of R.

Packages can be installed to your computer using the install.packages() function. This automatically fetches and downloads packages from the Comprehensive R Archive Network (CRAN).

Here are three packages that we’re going to use in this session:

install.packages(pkgs = "readr")  # for reading data into R
install.packages(pkgs = "dplyr")  # for manipulating data

You only need to run the installation function once for each package. The package is installed to your computer once you’ve done this and you only need to ‘remind’ RStudio where to find the package using the library() function in future.

So now we have the readr and dplyr packages installed we can call them with the library() function so we can use them.

library(package = "readr")
library(package = "dplyr")
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Sometimes a package will print a message in the console after loading. This is usually fine and only a problem in very specific circumstances. For example, you might be told that the package was developed using a newer version of R, or perhaps that a function from that package ‘conflicts’ with another already-installed function (usually because the two functions have the same name).

Okay, let’s get hold of some data!

5 Get data in and look at it

5.1 Read the data

Let’s use a dataset that I collected myself. It contains information about organisms I collected from exotic locations spanning the globe from Napoli to Hastings. It’s a file containing a data set of about 700 Pokemon – captured on the Pokemon Go app – with their characteristics data. It’s the very best dataset, like no dataset ever was.

If I haven’t given you the dataset already as a Comma-Separated Values (CSV) file, you can download it from the internet via GitHub. Save it to a folder named ‘data’ in your R Project folder like so:

download.file(
  url = "https://raw.githubusercontent.com/mwdray/datasets/master/pokemon_go_captures.csv",
  destfile = "data/pokemon_go_captures.csv"  # where to save it to
)

And then you can read it in as follows:

pokemon <- read_csv(file = "data/pokemon_go_captures.csv")
## Parsed with column specification:
## cols(
##   species = col_character(),
##   combat_power = col_integer(),
##   hit_points = col_integer(),
##   weight_kg = col_double(),
##   weight_bin = col_character(),
##   height_m = col_double(),
##   height_bin = col_character(),
##   fast_attack = col_character(),
##   charge_attack = col_character()
## )

This function loads the data from the CSV file at the filepath provided (it’s in our ‘data’ folder). It prints a note to the console to tell you the columns that have been read in and also what the data type of each one is. For example, combat_power = col_integer() tells us that the data in this column has been read as integers.

But where is this data? How do we know it’s actually been read in?

If you look at the ‘Environment’ tab in the top-right pane of RStudio, you’ll see our object ‘pokemon’ is there. Helpfully, we’re told it has dimensions of 696 rows and 9 columns.

5.2 Data inspection

The first thing we should do is look at the data to check for anomalies. There are a number of ways to do this.

You can take a look at information about your data frame using:

glimpse(pokemon)
## Observations: 696
## Variables: 9
## $ species       <chr> "krabby", "geodude", "venonat", "parasect", "eev...
## $ combat_power  <int> 51, 85, 129, 171, 172, 131, 96, 11, 112, 156, 12...
## $ hit_points    <int> 15, 23, 38, 32, 37, 320, 21, 10, 30, 35, 26, 38,...
## $ weight_kg     <dbl> 5.82, 20.88, 20.40, 19.20, 4.18, 11.20, 3.49, 36...
## $ weight_bin    <chr> "normal", "normal", "extra_small", "extra_small"...
## $ height_m      <dbl> 0.36, 0.37, 0.92, 0.87, 0.25, 0.48, 0.27, 0.80, ...
## $ height_bin    <chr> "normal", "normal", "normal", "normal", "normal"...
## $ fast_attack   <chr> "mud_shot", "rock_throw", "confusion", "bug_bite...
## $ charge_attack <chr> "vice_grip", "rock_tomb", "poison_fang", "x-scis...

Immediately this tells us that there are 696 rows and 9 columns. Column names are then listed with the data type and first few examples. This infomration is also available in the environment tab in the upper-right pane. Click the little blue arrow to have this infomration drop down.

Another way of expressing this is to simply print() to the console. The output is displayed in table format, but is truncated to fit the console window (this prevents you from printing millions of rows to the console!).

print(pokemon)
## # A tibble: 696 x 9
##         species combat_power hit_points weight_kg  weight_bin height_m
##           <chr>        <int>      <int>     <dbl>       <chr>    <dbl>
##  1       krabby           51         15      5.82      normal     0.36
##  2      geodude           85         23     20.88      normal     0.37
##  3      venonat          129         38     20.40 extra_small     0.92
##  4     parasect          171         32     19.20 extra_small     0.87
##  5        eevee          172         37      4.18 extra_small     0.25
##  6      voltorb          131        320     11.20      normal     0.48
##  7     shellder           96         21      3.49      normal     0.27
##  8       staryu           11         10     36.41      normal     0.80
##  9 nidoran_male          112         30      9.49      normal     0.51
## 10      poliwag          156         35     11.24      normal     0.58
## # ... with 686 more rows, and 3 more variables: height_bin <chr>,
## #   fast_attack <chr>, charge_attack <chr>

If you want to see the whole datset you could use the View() function:

View(pokemon) # note the capital 'V'

This opens up a read-only tab in the script window that displays your data in full. You can scroll around and order the columns by clicking the headers. This doesn’t affect the underlying data at all.

You can also access this by clickng the little image of a table to the right of the object in the environment pane (upper-right).

5.3 Quick summary

You can get very quick summary statistics with the summary() function. The function provides a quick summary of each column depending on its data type (integer, character, etc). This is pretty basic, but we’ll do something more impressive later.

summary(pokemon)
##    species           combat_power      hit_points       weight_kg      
##  Length:696         Min.   :  10.0   Min.   : 10.00   Min.   :  0.050  
##  Class :character   1st Qu.:  76.0   1st Qu.: 23.00   1st Qu.:  2.795  
##  Mode  :character   Median : 160.0   Median : 33.00   Median :  6.440  
##                     Mean   : 206.1   Mean   : 37.42   Mean   : 15.053  
##                     3rd Qu.: 286.0   3rd Qu.: 47.00   3rd Qu.: 20.163  
##                     Max.   :1636.0   Max.   :320.00   Max.   :492.040  
##   weight_bin           height_m       height_bin        fast_attack       
##  Length:696         Min.   :0.2000   Length:696         Length:696        
##  Class :character   1st Qu.:0.3100   Class :character   Class :character  
##  Mode  :character   Median :0.5050   Mode  :character   Mode  :character  
##                     Mean   :0.6544                                        
##                     3rd Qu.:0.8900                                        
##                     Max.   :9.5200                                        
##  charge_attack     
##  Length:696        
##  Class :character  
##  Mode  :character  
##                    
##                    
## 

6 Manipulating the data frame

We’re going to use a number of sensibly-named functions from the dplyr package to do our data manipulation. These functions take verbs – not too dissimilar to SQL verbs – as their names. This makes it easy to understand what they’re doing.

dplyr is part of a suite of packages within what is called ‘the Tidyverse’. These packages are all written with the same thoughts in mind (e.g. the first argument of all the functions is the data, function names are sensible and written in snake_case, the code is optimised to run quickly, etc).

The tidyverse aims to make things simpler and fasterfor R coders.

6.1 Select

Firstly, we can select() columns of interest. There’s the first sensible function name. You’ll notice that a lot of them are verbs to make it clear that the code is actively doing something.

# save as an object for later
pokemon_hp <- select( 
  pokemon,  # the first argument is always the data
  hit_points,  # the other arguments are column names you want to keep
  species
)  

print(pokemon_hp)
## # A tibble: 696 x 2
##    hit_points      species
##         <int>        <chr>
##  1         15       krabby
##  2         23      geodude
##  3         38      venonat
##  4         32     parasect
##  5         37        eevee
##  6        320      voltorb
##  7         21     shellder
##  8         10       staryu
##  9         30 nidoran_male
## 10         35      poliwag
## # ... with 686 more rows

Note that the order you select the columns is the order they’ll appear in when they print.

And we can choose not to include certan columns by prefixing with - (hyphen/minus).

select(
  pokemon,  # data frame first
  -hit_points, -combat_power, -fast_attack, -weight_bin  # columns to drop
)
## # A tibble: 696 x 5
##         species weight_kg height_m height_bin charge_attack
##           <chr>     <dbl>    <dbl>      <chr>         <chr>
##  1       krabby      5.82     0.36     normal     vice_grip
##  2      geodude     20.88     0.37     normal     rock_tomb
##  3      venonat     20.40     0.92     normal   poison_fang
##  4     parasect     19.20     0.87     normal     x-scissor
##  5        eevee      4.18     0.25     normal     body_slam
##  6      voltorb     11.20     0.48     normal     discharge
##  7     shellder      3.49     0.27     normal   bubble_beam
##  8       staryu     36.41     0.80     normal   bubble_beam
##  9 nidoran_male      9.49     0.51     normal     body_slam
## 10      poliwag     11.24     0.58     normal     body_slam
## # ... with 686 more rows

That can be quite laborious, so there are some special functions we can use inside the select function to help us out.

For example, selecting columns starting with a particular string:

select(pokemon, starts_with("weight"))
## # A tibble: 696 x 2
##    weight_kg  weight_bin
##        <dbl>       <chr>
##  1      5.82      normal
##  2     20.88      normal
##  3     20.40 extra_small
##  4     19.20 extra_small
##  5      4.18 extra_small
##  6     11.20      normal
##  7      3.49      normal
##  8     36.41      normal
##  9      9.49      normal
## 10     11.24      normal
## # ... with 686 more rows

Or any columns containing a given string.

select(pokemon, contains("bin"))
## # A tibble: 696 x 2
##     weight_bin height_bin
##          <chr>      <chr>
##  1      normal     normal
##  2      normal     normal
##  3 extra_small     normal
##  4 extra_small     normal
##  5 extra_small     normal
##  6      normal     normal
##  7      normal     normal
##  8      normal     normal
##  9      normal     normal
## 10      normal     normal
## # ... with 686 more rows

CHALLENGE!

Create an object called my_selection that uses the select() function to store from pokemon the species column and any columns that end with with "attack"


More infomration in the help file if you type ?select.

6.2 Filter

Now for subsetting the data by its rows.

We’re going to make use of some common logical operators for subsetting our data by certain conditions:

  • == – equals
  • != – not equals
  • %in% – match to several things listed with c()
  • >, <, <=, >= – greater/less than (or equal to)
  • & – ‘and’
  • | – ‘or’

Let’s start by filtering for one particular species.

filter(pokemon, species == "jigglypuff")
## # A tibble: 11 x 9
##       species combat_power hit_points weight_kg  weight_bin height_m
##         <chr>        <int>      <int>     <dbl>       <chr>    <dbl>
##  1 jigglypuff          221         93      7.04 extra_large     0.56
##  2 jigglypuff          156         80      6.83      normal     0.55
##  3 jigglypuff          349        119      3.57 extra_small     0.42
##  4 jigglypuff           10         22      4.92      normal     0.44
##  5 jigglypuff          188         94      6.56      normal     0.52
##  6 jigglypuff           33         39      7.14 extra_large     0.58
##  7 jigglypuff           56         51      5.55      normal     0.49
##  8 jigglypuff           66         51      8.13 extra_large     0.60
##  9 jigglypuff          289        111      5.02      normal     0.44
## 10 jigglypuff          348        119      4.91      normal     0.47
## 11 jigglypuff          486        146      4.90      normal     0.44
## # ... with 3 more variables: height_bin <chr>, fast_attack <chr>,
## #   charge_attack <chr>

Now everything except for one species.

filter(pokemon, species != "pidgey")  # not equals to
## # A tibble: 610 x 9
##         species combat_power hit_points weight_kg  weight_bin height_m
##           <chr>        <int>      <int>     <dbl>       <chr>    <dbl>
##  1       krabby           51         15      5.82      normal     0.36
##  2      geodude           85         23     20.88      normal     0.37
##  3      venonat          129         38     20.40 extra_small     0.92
##  4     parasect          171         32     19.20 extra_small     0.87
##  5        eevee          172         37      4.18 extra_small     0.25
##  6      voltorb          131        320     11.20      normal     0.48
##  7     shellder           96         21      3.49      normal     0.27
##  8       staryu           11         10     36.41      normal     0.80
##  9 nidoran_male          112         30      9.49      normal     0.51
## 10      poliwag          156         35     11.24      normal     0.58
## # ... with 600 more rows, and 3 more variables: height_bin <chr>,
## #   fast_attack <chr>, charge_attack <chr>

Now filtering to include three species only.

filter(
  pokemon,
  species %in% c("staryu", "psyduck", "charmander")
)
## # A tibble: 39 x 9
##    species combat_power hit_points weight_kg  weight_bin height_m
##      <chr>        <int>      <int>     <dbl>       <chr>    <dbl>
##  1  staryu           11         10     36.41      normal     0.80
##  2 psyduck           97         26     26.05 extra_large     0.90
##  3 psyduck           41         17     23.63      normal     0.91
##  4  staryu          225         25     36.39      normal     0.73
##  5  staryu          154         23     18.76 extra_small     0.59
##  6  staryu           11         10     18.91 extra_small     0.68
##  7  staryu          260         29     44.21 extra_large     0.85
##  8 psyduck           44         19     23.41      normal     0.72
##  9  staryu          112         19     28.08      normal     0.78
## 10  staryu          144         23     50.42 extra_large     0.97
## # ... with 29 more rows, and 3 more variables: height_bin <chr>,
## #   fast_attack <chr>, charge_attack <chr>

We can work with numbers too.

filter(
  pokemon,
  combat_power > 900 & hit_points < 100  # two conditions
)  # note the '&'
## # A tibble: 7 x 9
##      species combat_power hit_points weight_kg  weight_bin height_m
##        <chr>        <int>      <int>     <dbl>       <chr>    <dbl>
## 1   gyarados          955         94    176.98      normal     5.58
## 2     magmar          936         70     40.36      normal     1.16
## 3     magmar          991         73     31.13 extra_small     1.23
## 4     magmar          963         75     42.48      normal     1.28
## 5     pinsir         1184         84     68.13      normal     1.61
## 6     fearow          954         83     40.55      normal     1.20
## 7 electabuzz          962         74     39.04 extra_large     1.26
## # ... with 3 more variables: height_bin <chr>, fast_attack <chr>,
## #   charge_attack <chr>

CHALLENGE!

Filter the pokemon dataframe to include species rows that:

  • are the species “abra”, “chansey”, or “bellsprout”
  • and have greater than 100 combat_power
  • or less than 100 hit_points

6.3 Mutate

Now to create new columns. We use mutate() because we’re mutating our dataframe – we’re budding a new column where there wasn’t one before. Often you’ll be creating new columns based on the content of columns that already exist, or you can fill the entire column with one thing.

For now, we’re going to create column names without spaces. It’s easier.

# we're going to subset by columns first
pokemon_power_hp <- select(  # create new object by subsetting our data set
  pokemon,  # data
  species, combat_power, hit_points  # columns to keep
)

# now to mutate with some extra information
mutate(
  pokemon_power_hp,  # our new, subsetted data frame
  power_index = combat_power * hit_points,  # new column from old ones
  caught = 1,  # new column will fill entirely with number
  area = "kanto"  # will fill entirely with this text 
)
## # A tibble: 696 x 6
##         species combat_power hit_points power_index caught  area
##           <chr>        <int>      <int>       <int>  <dbl> <chr>
##  1       krabby           51         15         765      1 kanto
##  2      geodude           85         23        1955      1 kanto
##  3      venonat          129         38        4902      1 kanto
##  4     parasect          171         32        5472      1 kanto
##  5        eevee          172         37        6364      1 kanto
##  6      voltorb          131        320       41920      1 kanto
##  7     shellder           96         21        2016      1 kanto
##  8       staryu           11         10         110      1 kanto
##  9 nidoran_male          112         30        3360      1 kanto
## 10      poliwag          156         35        5460      1 kanto
## # ... with 686 more rows

So we have a column caught filled for every row with 1 and and a column filled with kanto for every row. R ‘recycles’ whatever you put there for each row. For example, if you gave the argument a vector of three numbers, e.g. caught = c(1:3), then the row 1 would get 1, row 2 would get 2, row 3 would get 3 and it would cycle back to 1 for row 4, and so on.

You can mutate a little more easily with an if_else() statement:

mutate(
  pokemon_hp,
  common = if_else(
    condition = species %in% c(  # if this condition is met...
      "pidgey", "rattata", "drowzee", 
      "spearow", "magikarp", "weedle", 
      "staryu", "psyduck", "eevee"
    ),
    true = "yes",  # ...fill column with this string
    false = "no"  # ...otherwise fill it with this string
  )
)
## # A tibble: 696 x 3
##    hit_points      species common
##         <int>        <chr>  <chr>
##  1         15       krabby     no
##  2         23      geodude     no
##  3         38      venonat     no
##  4         32     parasect     no
##  5         37        eevee    yes
##  6        320      voltorb     no
##  7         21     shellder     no
##  8         10       staryu    yes
##  9         30 nidoran_male     no
## 10         35      poliwag     no
## # ... with 686 more rows

And we can get more nuanced by using a case_when() statement (you may have seen this in SQL). This prevents us writing nested if_else() statements to specify multiple conditions.

mutate(
  pokemon_hp,  # data
  common = case_when(
    species %in% c("pidgey", "rattata", "drowzee") ~ "very_common",
    species == "spearow" ~ "pretty_common",
    species %in% c("magikarp", "weedle", "staryu", "psyduck") ~ "common",
    species == "eevee" ~ "less_common",
    TRUE ~ "no"
  )
)
## # A tibble: 696 x 3
##    hit_points      species      common
##         <int>        <chr>       <chr>
##  1         15       krabby          no
##  2         23      geodude          no
##  3         38      venonat          no
##  4         32     parasect          no
##  5         37        eevee less_common
##  6        320      voltorb          no
##  7         21     shellder          no
##  8         10       staryu      common
##  9         30 nidoran_male          no
## 10         35      poliwag          no
## # ... with 686 more rows

CHALLENGE!

Create a new datafrmae object that takes the pokemon data and adds a column containing Pokemon body-mass index (BMI).

Hint: BMI is weight over height squared (you can square a number by writing ^2 after it).

Now use a case_when() to categorise Pokemon:

  • Underweight = <18.5
  • Normal weight = 18.5–24.9
  • Overweight = 25–29.9
  • Obesity = BMI of 30 or greater

Note that these are BMI groups for humans. And that BMI has many limitations!


6.4 Arrange

This does what it says on the tin. This alters the order of the rows in your table according to some column specification.

arrange(
  pokemon,  # again, data first
  height_m  # column to order by
)
## # A tibble: 696 x 9
##    species combat_power hit_points weight_kg  weight_bin height_m
##      <chr>        <int>      <int>     <dbl>       <chr>    <dbl>
##  1 diglett           79         10      0.79      normal     0.20
##  2  pidgey          254         44      0.82 extra_small     0.21
##  3 rattata           23         11      1.52 extra_small     0.22
##  4  pidgey          229         43      0.85 extra_small     0.22
##  5  weedle           17         13      2.25 extra_small     0.22
##  6 spearow          296         47      0.69 extra_small     0.22
##  7 spearow           89         26      1.06 extra_small     0.22
##  8  pidgey          256         46      0.82 extra_small     0.23
##  9 rattata           64         17      2.70      normal     0.23
## 10 diglett           64         10      1.05 extra_large     0.23
## # ... with 686 more rows, and 3 more variables: height_bin <chr>,
## #   fast_attack <chr>, charge_attack <chr>

And in reverse order (tallest first):

arrange(pokemon, desc(height_m))  # descending
## # A tibble: 696 x 9
##     species combat_power hit_points weight_kg  weight_bin height_m
##       <chr>        <int>      <int>     <dbl>       <chr>    <dbl>
##  1     onix          299         38    192.45      normal     9.52
##  2 gyarados          955         94    176.98      normal     5.58
##  3   pidgey           76         26      1.25 extra_small     2.50
##  4    ekans          206         35     11.57 extra_large     2.46
##  5   lapras         1636        161    162.88 extra_small     2.22
##  6  snorlax          300         85    492.04      normal     2.11
##  7  dratini          298         42      4.40 extra_large     2.08
##  8  dratini          332         44      4.75 extra_large     1.99
##  9    ekans           95         24      5.20      normal     1.93
## 10  dratini          316         40      3.13      normal     1.91
## # ... with 686 more rows, and 3 more variables: height_bin <chr>,
## #   fast_attack <chr>, charge_attack <chr>

CHALLENGE!

What happens if you arrange by a column containing characters rather than numbers? For example, the species column.


6.5 Join

Again, another verb that mirrors what you can find in SQL. There are several types of join, but we’re going to focus on the most common one: the left_join(). This joins information from one table – x – to another – y – by some key matching variable of our choice.

Let’s start by reading in a lookup table that provides some extra infomration about our species.

pokedex <- read_csv("data/pokedex_simple.csv")
## Parsed with column specification:
## cols(
##   species = col_character(),
##   pokedex_number = col_integer(),
##   type1 = col_character(),
##   type2 = col_character()
## )
glimpse(pokedex)  # let's inspect its contents
## Observations: 801
## Variables: 4
## $ species        <chr> "bulbasaur", "ivysaur", "venusaur", "charmander...
## $ pokedex_number <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, ...
## $ type1          <chr> "grass", "grass", "grass", "fire", "fire", "fir...
## $ type2          <chr> "poison", "poison", "poison", NA, NA, "flying",...

Now we’re going to join this new data to our pokemon data. The key for matching these in the species column, which exists in both datasets.

pokemon_join <- left_join(
  x = pokemon,  # to this table...
  y = pokedex,   # ...join this table
  by = "species"  # on this key
)

glimpse(pokemon_join)
## Observations: 696
## Variables: 12
## $ species        <chr> "krabby", "geodude", "venonat", "parasect", "ee...
## $ combat_power   <int> 51, 85, 129, 171, 172, 131, 96, 11, 112, 156, 1...
## $ hit_points     <int> 15, 23, 38, 32, 37, 320, 21, 10, 30, 35, 26, 38...
## $ weight_kg      <dbl> 5.82, 20.88, 20.40, 19.20, 4.18, 11.20, 3.49, 3...
## $ weight_bin     <chr> "normal", "normal", "extra_small", "extra_small...
## $ height_m       <dbl> 0.36, 0.37, 0.92, 0.87, 0.25, 0.48, 0.27, 0.80,...
## $ height_bin     <chr> "normal", "normal", "normal", "normal", "normal...
## $ fast_attack    <chr> "mud_shot", "rock_throw", "confusion", "bug_bit...
## $ charge_attack  <chr> "vice_grip", "rock_tomb", "poison_fang", "x-sci...
## $ pokedex_number <int> 98, 74, 48, 47, 133, 100, 90, 120, 32, 60, 46, ...
## $ type1          <chr> "water", "rock", "bug", "bug", "normal", "elect...
## $ type2          <chr> NA, "ground", "poison", "grass", NA, NA, NA, NA...

CHALLENGE!

Try right_join() instead of left_join(). What happens? And what about anti_join()?


6.6 Other verbs

This document does not contain an exhaustive list of other functions within the same family as select(), filter(), mutate(), arrange() and *_join(). There are other functions that will be useful for your work and other ways of manipulating your data. For example, the stringr package helps with dealing with data in strings (text, for example).

6.7 Pipes

Alright great, we’ve seen how to manipulate our dataframe a bit. But we’ve been doing it one discrete step at a time, so your script might end up looking something like this:

pokemon <- read_csv(file = "data/pokemon_go_captures.csv")

pokemon_select <- select(pokemon, -height_bin, -weight_bin)

pokemon_filter <- filter(pokemon_select, weight_kg > 15)

pokemon_mutate <- mutate(pokemon_filter, organism = "pokemon")

In other words, you might end up creating lots of intermediate variables and cluttering up your workspace and filling up memory.

You could do all this in one step by nesting each function inside the others, but that would be super messy and hard to read. Instead we’re going to ‘pipe’ data from one function to the next. The pipe operator – %>% – says ‘take what’s on the left and pass it through to the next function’.

So you can do it all in one step:

pokemon_piped <- read_csv(file = "data/pokemon_go_captures.csv") %>% 
  select(-height_bin, -weight_bin) %>% 
  filter(weight_kg > 15) %>% 
  mutate(organism = "pokemon")
## Parsed with column specification:
## cols(
##   species = col_character(),
##   combat_power = col_integer(),
##   hit_points = col_integer(),
##   weight_kg = col_double(),
##   weight_bin = col_character(),
##   height_m = col_double(),
##   height_bin = col_character(),
##   fast_attack = col_character(),
##   charge_attack = col_character()
## )
glimpse(pokemon_piped)
## Observations: 204
## Variables: 8
## $ species       <chr> "geodude", "venonat", "parasect", "staryu", "ven...
## $ combat_power  <int> 85, 129, 171, 11, 137, 256, 234, 157, 140, 246, ...
## $ hit_points    <int> 23, 38, 32, 10, 38, 64, 33, 49, 56, 42, 45, 34, ...
## $ weight_kg     <dbl> 20.88, 20.40, 19.20, 36.41, 41.23, 30.20, 73.81,...
## $ height_m      <dbl> 0.37, 0.92, 0.87, 0.80, 1.26, 0.84, 1.52, 0.94, ...
## $ fast_attack   <chr> "rock_throw", "confusion", "bug_bite", "water_gu...
## $ charge_attack <chr> "rock_tomb", "poison_fang", "x-scissor", "bubble...
## $ organism      <chr> "pokemon", "pokemon", "pokemon", "pokemon", "pok...

This reads as:

  • for the object named pokemon_piped, assign (<-) the contents of a CSV file read with read_csv()
  • then select out some columns
  • then filter on a variable
  • then add a column

See how this is like a recipe?

Did you notice something? We didn’t have to keep calling the dataframe object in each function call. For example, we used filter(weight_kg > 15) rather than filter(pokemon, weight_kg > 15) because the data argument was piped in. The functions mentioned above all accept the data that’s being passed into them because they’re part of the Tidyverse. (Note that this is not true for all functions, but we can talk about that later.)

Here’s another simple example using the dataframe we built earlier:

my_df <- data.frame(
  species = c("Pichu", "Pikachu", "Raichu"),
  number = c(172, 25, 26),
  location = c("Johto", "Kanto", "Kanto")
)

my_df %>%  # take the dataframe object...
  select(species, number) %>%   # ...then select these columns...
  filter(number %in% c(172, 26))  # ...then filter on these values
##   species number
## 1   Pichu    172
## 2  Raichu     26

Nice and easy to read.


CHALLENGE!

Write a pipe recipe that creates a new dataframe called my_poke that takes the pokemon dataframe and:

  • select()s only the species and combat_power columns
  • left_join()s the pokedex dataframe by species
  • filter()s by those with a type1 that’s ‘normal’

7 Summaries

Assuming we’ve now wrangled out data using the dplyr functions, we can do some quick, readable summarisation that’s way better than the summary() function.

So let’s use our knowledge – and some new functions – to get the top 5 pokemon by count.

pokemon %>%  # take the dataframe
  group_by(species) %>%   # group it by species
  tally() %>%   # tally up (count) the number of instances
  arrange(desc(n)) %>%  # arrange in descending order
  slice(1:5)  # and slice out the first five rows
## # A tibble: 5 x 2
##   species     n
##     <chr> <int>
## 1  pidgey    86
## 2 rattata    78
## 3 drowzee    64
## 4 spearow    42
## 5   zubat    35

The order of your functions is important – remember it’s like a recipe. Don’t crack the eggs on your cake just before serving. Do it near the beginning somewhere, I guess (I’m not much a cake maker).

There’s also a specific summarise() function that allows you to, well… summarise.

pokemon_join %>%  # take the dataframe
  group_by(type1) %>%   # group by variable
  summarise( # summarise it by...
    count = n(),  # counting the number
    mean_cp = round(mean(combat_power), 1)  # and taking a mean to 2 dp
  ) %>% 
  arrange(desc(mean_cp))  # then organise in descending order of this column
## # A tibble: 16 x 3
##       type1 count mean_cp
##       <chr> <int>   <dbl>
##  1     fire    16   510.4
##  2    fairy     5   412.4
##  3     <NA>     3   389.7
##  4 electric    12   373.2
##  5 fighting     1   358.0
##  6    grass    17   357.0
##  7   dragon     4   325.8
##  8  psychic    70   300.8
##  9      ice     7   274.9
## 10   ground     7   214.4
## 11    water   157   191.5
## 12     rock     9   190.3
## 13      bug    63   184.6
## 14    ghost    12   170.3
## 15   poison    59   167.6
## 16   normal   254   157.4

Note that you can group by more than one thing as well. We can group on the weight_bin category within the type1 category, for example.

pokemon_join %>%
  group_by(type1, weight_bin) %>% 
  summarise(
    mean_weight = mean(weight_kg),
    count = n()
  )
## # A tibble: 40 x 4
## # Groups:   type1 [?]
##       type1  weight_bin mean_weight count
##       <chr>       <chr>       <dbl> <int>
##  1      bug extra_large    29.10444     9
##  2      bug extra_small     8.98375    16
##  3      bug      normal    10.59868    38
##  4   dragon extra_large     4.57500     2
##  5   dragon      normal     2.95000     2
##  6 electric extra_large    18.73000     3
##  7 electric extra_small     5.73500     2
##  8 electric      normal    18.45571     7
##  9    fairy extra_large     9.47000     2
## 10    fairy      normal     7.96000     3
## # ... with 30 more rows

8 Plot the data

We’re going to keep this very short and dangle it like a rare candy in front of your nose. We’ll revisit this in more depth in a later session. For now, we’re going to use a package called ggplot2 to create some simple charts.


CHALLENGE!

Remember how to use packages? Install ggplot2 and load it from the library.


The ‘gg’ in ‘ggplot2’ stands for ‘grammar of graphics’. This is a way of thinking about plotting as having a ‘grammar’ – ‘elements that can be applied in succession to create a plot. This is ’the idea that you can build every graph from the same few components’: a data set, geoms (marks representing data points), a co-ordinate system and some other things.

The ggplot() function from the ggplot2 package is how you create these plots. You build up the graphical elements using the + rather than a pipe. Think about it as placing down a canvas and then adding layers on top.

pokemon %>%
  ggplot() +
  geom_boxplot(aes(x = weight_bin, y = combat_power))

ggplot plays nicely with the pipe – it’s part of the Tidyverse – so we can create recipes that combine data reading, data manipulation and plotting all in one go. Let’s do some manipulation before plotting and then introduce some new elements to our plot that simplify the theme and change the labels.

pokemon_join %>%
  filter(type1 %in% c("fire", "water", "grass")) %>% 
  ggplot() +
  geom_violin(aes(x = type1, y = combat_power)) +
  theme_bw() +
  labs(
    title = "CP by type",
    x = "Primary type",
    y = "Combat power"
  )

How about a dotplot? Coloured by type1?

pokemon_join %>%
  filter(type1 %in% c("fire", "water", "grass")) %>% 
  ggplot() +
  geom_point(aes(x = pokedex_number, y = height_m, colour = type1))


CHALLENGE!

Create a boxplot for Pokemon with type1 of ‘normal’, ‘poison’, ‘ground’ and ‘water’ against their hit-points


Simple, but relatively effective. We’ll look next time at plotting in more depth. For example, yes: you can use Pokemon sprites as your plotting points. And why stop there? You can also use specific Pokemon typing colours, sprite colour palettes and theme your barplot like a Pokemon first generation HP bar. Cool, eh?

9 Further reading

9.1 Tutorials

9.2 Help/tips and tricks