2016-06-13

Context

See the thesis-reproducible repo

One of the first maps I made for my thesis

One of the first maps I made for my thesis

What we'll cover: Day 1

  • Building strong foundations
  • Manipulating non-spatial data
  • Plotting paradigms

Day II

  • R as a GIS

Day III

  • Advanced R for spatial data

Course home

https://github.com/Robinlovelace/cs7

  • A site linking together materials for this course
  • Updated content / data / resources will go here
  • Feel free to ask any questions here by creating issues
  • Will hopefully be useful in the future

An introduction to R and RStudio

Pages 1:5 of Introduction to visualising spatial data in R

What R can do - Propensity to Cycle Tool

A bit about R

  • Developed by statisticians Ross Ihaka and Robert Gentleman
  • De-facto standard for advanced statistical analysis
  • A programming language in its own right
  • The power of the command line
  • Used by an increasing number of organisations

Why R?

  • Performace: stable, light and fast
  • Support network
  • documentation, community, developers
  • Reproducibility
  • anyone anywhere can reproduce results
  • enables dissemination (RPubs, RMarkdown, .RPres) - this presentation is a .Rmd file!
  • Versatility: unified solution to almost any numerical problem, graphical capabilities
  • Ethics removes economic barrier to statistics, is open and democratic

R is up and coming I

II - Increasing popularity in academia

III - R vs Python

IV - employment market

What is RStudio and and why use it?

What is RStudio?

  • An IDE
  • A project management system
  • A document preparation system
  • An online publication portal

Installing and using R and RStudio

Productivity with RStudio

Key shortcuts in RStudio:

Command Action
Alt + Shift + K Show shortcuts
Ctrl + Enter Run current line of code
Ctrl + R Run all lines of code in the script
Tab Autocomplete*

Code used to generate that table

a mini demo of RMarkdown

RStudio is not just about R - it's a productivity suite!

shortcuts <- data.frame(Command = c(
  "Alt + Shift + K", 
  "Ctrl + Enter",
  "Ctrl + R",
  "Tab"),
  Action = c("Show shortcuts",
    "Run current line of code",
    "Run all lines of code in the script",
    "Autocomplete*"))
kable(shortcuts)

R packages

There are 7,000+ 'add-on' packages to 'supercharge' R.

Easiest way to install them, from RStudio:

Tools -> Install Packages

or using keyboard shortcuts:

Alt + T ... then k

Installing and loading many packages

Can be installed and loaded in 6 lines of code:

pkgs <- c("devtools", "shiny", "rgdal", "rgeos", "ggmap", "raster")
install.packages(pkgs) # install the official packages!
library(devtools) # enables installation of leaflet
gh_pkgs <- c("rstudio/leaflet", "robinlovelace/stplanr") 
install_github(gh_pkgs) # install packages on github
lapply(c(pkgs, "leaflet", "stplanr"), library, character.only = T)

Features of RStudio

  • Flexible window pane layouts to optimise use of screen space and enable fast interactive visual feed-back.
  • Intelligent auto-completion of function names, packages and R objects.
  • A wide range of keyboard shortcuts.
  • Visual display of objects, including a searchable data display table.
  • Real-time code checking and error detection.
  • Menus to install and update packages.
  • Project management and integration with version control.

RStudio panes

RStudio has four main window 'panes':

  • The Source pane, for editing, saving, and dispatching R code to the console (top left).

  • The Console pane. Any code entered here is processed by R, line by line (bottom left).

  • The Environment pane (top right) contains information about the current objects loaded in the workspace including their class, dimension (if they are a data frame) and name.

  • The Files pane (bottom right) contains a simple file browser, a Plots tab, Help and Package tabs and a Viewer.

Exercises

You are developing a project to visualise data. Test out the multi-panel RStudio workflow by following the steps below:

  1. Create a new folder for the input data using the Files pane.

  2. Type in read.c in the Source pane and hit Enter to make the function read.csv() autocomplete. Then type ", which will autocomplete to "".

  3. Execute the full command with Ctrl-Enter:

url = "https://www.census.gov/2010census/csv/pop_change.csv"
pop_change = read.csv(url, skip = 2)

Exercises: bonus

  1. Use the Environment pane to click on the data object pop_change.

  2. Use the Console to test different plot commands to visualise the data, saving the code you want to keep back into the Source pane, as pop_change.R.

Project management

In the far top-right of RStudio there is a diminutive drop-down menu illustrated with R inside a transparent box.

Projects

  • Set the working directory automatically. setwd(), a common source of error for R users, is rarely if ever needed.

  • The last previously open file is loaded into the Source pane.

  • The File tab displays the associated files and folders in the project, allowing you to quickly find your previous work.

  • Any settings associated with the project, such as Git settings, are loaded. This assists with collaboration and project-specific set-up.

Setting up a project for the remainder of this course

Objects, functions and concepts for efficient R programming

Practical handout

Data can be found here

A few words on style

  • Code style does not usually affect results
  • But good code style has a number of advantages > - Ease of reading > - Consistency > - Collaboration

  • Recommendation: pick a style guide and stick with it
  • Example: <- vs = for assignment (mostly interchangeable)

Basic data types

Anything that exists in R is an object. Let's create some with the <- symbol (= does the same job, before you ask!)

vector_logical <- c(TRUE, TRUE, FALSE)
vector_character <- c("yes", "yes", "Hello!")
vector_numeric <- c(1, 3, 9.9)

class(vector_logical) # what are the other object classes?
## [1] "logical"

Use the "Environment tab" (top right in RStudio) to see these

Intermediate data types

R has a hierarchy of data classes, tending to the lowest:

  • Binary
  • Integer (numeric)
  • Double (numeric)
  • Character

Examples of data types

a <- TRUE
b <- 1:5
c <- pi
d <- "Hello Leeds"
class(a)
class(b)
class(c)
class(d)

Class coercion I

ab <- c(a, b)
ab
## [1] 1 1 2 3 4 5
class(ab)
## [1] "integer"

Class coercion II

  • R automatically forces some objects into the class it thinks it is best
  • Demo:
x = 1:5
class(x)
## [1] "integer"
x = c(x, 6.1)
class(x)
## [1] "numeric"
  • Test: what is the class of x = c(x, "hello")?

Dimensions of objects

  • Dimensionality is key to understanding R data
  • You cannot do the same thing with a square as you can a line
  • R is 'vectorised', meaning it deals with many numbers at once

Test: what is the dimension of objects we created in the last slide?

Vectorised code I

Python is not vectorised by default, hence:

a = [1,2,3]
b = [9,8,6]
print(a + b)
## [1, 2, 3, 9, 8, 6]

Vectorised code II

R is vectorised, meaning that it adds each element automatically

a = c(1,2,3)
b = c(9,8,6)
a + b
## [1] 10 10  9
  • The same applies to matrices - R understands matrix algebra
  • See ?matmult for more on matrix multiplication

Vectorised code III

x <- c(1, 2, 5)
for(i in x){
  print(i^2)
}
## [1] 1
## [1] 4
## [1] 25

Creating a new vector based on x

for(i in 1:length(x)){
  if(i == 1) x2 <- x[i]^2
  else x2 <- c(x2, x[i]^2)
}
x2
## [1]  1  4 25

Test on data types

class(c(a, b))
## [1] "numeric"
class(c(a, c))
## [1] "numeric"
class(c(b, d))
## [1] "character"

Sequences

x <- 1:5
y <- 2:6
plot(x, y)

Sequences with seq

x <- seq(1,2, by = 0.2)
length(x)
## [1] 6
x <- seq(1, 2, length.out = 5)
length(x)
## [1] 5

The data frame

The fundamental data object in R.

Create them with data.frame()

data.frame(vector_logical, vector_character, n = vector_numeric)
##   vector_logical vector_character   n
## 1           TRUE              yes 1.0
## 2           TRUE              yes 3.0
## 3          FALSE           Hello! 9.9

Oops - we forgot to assign that. Tap UP or Ctl-UP in the console, then enter:

df <- data.frame(vector_logical, vector_character, n = vector_numeric)

Common dimensions

Homogeneous Heterogeneous
1d Atomic vector List
2d Matrix Data frame
nd Array

Source: Wickham (2014)

Automating things

To ask R what objects it has, we can use ls().

(Anything that happens is a function)

ls()
##  [1] "a"                "ab"               "b"               
##  [4] "c"                "d"                "df"              
##  [7] "i"                "pkgs"             "pop_change"      
## [10] "shortcuts"        "url"              "vector_character"
## [13] "vector_logical"   "vector_numeric"   "x"               
## [16] "x2"               "y"

Now we can automate the question: what class?

obs <- ls()[grep("ve", ls())]
sapply(X = mget(obs), FUN = class)
## vector_character   vector_logical   vector_numeric 
##      "character"        "logical"        "numeric"

Getting help in R

To find out what just happened, we can use R's internal help

The most commonly used help functions are:

help(apply) # get help on apply
?apply 
?sapply
??apply

The *apply family of functions are R's internal for loops. What about get()

?get

Data manipulation and plotting paradigms

Subsetting data in R

The [] brackets, appending the object name, subset data.

A comma separates each dimension; nothing means everything:

df[1,] # all of the the 1st line of df
##   vector_logical vector_character n
## 1           TRUE              yes 1

In a 2d dataset, the following selects the 3rd line in the 3rd column:

df[3,3]
## [1] 9.9

Manipulating columns

New columns can be created as follows:

df$new_col = NA

Or as a function of old ones:

df$new_col = df$vector_logical + df$n

Plotting data in R

plot() is polymorphic. Try plot(df) and ?plot:

## Help on topic 'plot' was found in the following packages:
## 
##   Package               Library
##   raster                /home/robin/R/x86_64-pc-linux-gnu-library/3.3
##   graphics              /usr/lib/R/library
## 
## 
## Using the first match ...

Exercise: play + plot pop_change

Experiment with plot arguments

u = "https://www.census.gov/2010census/csv/pop_change.csv"
pop_change = read.csv(u, skip = 2)
plot(pop_change$X1910_POPULATION, pop_change$X1960_POPULATION,
     xlim = c(0, 10e6), ylim = c(0, 10e6))

Making it look nice

Hint: refer to ?plot

Plots with ggplot2

library(ggplot2)
ggplot(pop_change) +
  geom_point(aes(X1910_POPULATION, X1960_POPULATION, color = pop_change$STATE_OR_REGION))

Practical: Pages 20 - 24 in Intro. to vis. sp. data in R

  • Work through the handout
  • Ask if anything doesn't make sense

Materials from Lex Comber

Evaluation form for day 2

References