Slides: rpubs.com/RobinLovelace

Introduction

Context

cdrc

This course is brought to you the Consumer Data Research Centre (CDRC) and is funded by the ESRC's (Big Data Network).

What R can do - Propensity to Cycle Tool

A bit about R

  • Developed by statisticians Ross Ihaka and Robert Gentleman
  • De-facto standard for advanced statistical analysis
  • A programming language in its own right
  • The power of the command line
  • Used by an increasing number of organisations

Why R?

  • Performace: stable, light and fast
  • Support network
  • documentation, community, developers
  • Reproducibility
  • anyone anywhere can reproduce results
  • enables dissemination (RPubs, RMarkdown, .RPres) - this presentation is a .Rmd file!
  • Versatility: unified solution to almost any numerical problem, graphical capabilities
  • Ethics removes economic barrier to statistics, is open and democratic

R is up and coming I

II - Increasing popularity in academia

III - R vs Python

IV - employment market

Visualisation

  • R's visualisation capabilities have evolved over time
  • Used to create plots in the best academic journals
  • ggplot2 has revolutionised the visualisation of quantitative information in R, and (possibly) overall
  • Thus there are different camps with different preferences when it comes to maps in R
## Loading required package: devtools
## Loading required package: nclRintroduction

Demonstration 1: The RStudio IDE

R as a giant calculator

5 * 5
1 + 4 * 5
4 * 5 ^ 2
sin(90)
sin(0.5 * pi)

Objects

a <- 1
b <- 2
c <- "c"
x_thingy <- 4
a + b
a * b
a + c
a / x_thingy

Adding and removing objects

ls()
## [1] "a"        "b"        "c"        "x_thingy"
x <- x_thingy
rm(x_thingy)
x
## [1] 4
ls()
## [1] "a" "b" "c" "x"

Harmonograph example

Practical 1: Getting used to RStudio and R

  • Open RStudio and have a look around
  • Create a new project
  • Create a new R Script: pass code to the console with Ctl-Enter
  • Use R as a calculator: what is:

\[ \pi * 9.15^2 \]

  • Explore each of the 'panes'
  • Find and write down some useful shortcuts (Alt-Shift-K on Windows/Linux)

Basic R functions and behaviour

Functions and objects

In R:

  • Everything that exists is an object
  • Everything that happens is a function
# Assignment of x
x <- 5
x
## [1] 5
# A trick to print x
(x <- 5)
(y <- x)

Functions

sin(x)
## [1] -0.9589243
exp(x)
## [1] 148.4132
factorial(x)
## [1] 120
sinx <- sin(x)

Assignment

x = 5 # the same as x <- 5
(x = x + 1)
## [1] 6

R is vector based

x <- c(1, 2, 5)
x
## [1] 1 2 5
x^2
## [1]  1  4 25
x + 2
## [1] 3 4 7
x + rev(x)

The classic programming way: verbose

x <- c(1, 2, 5)
for(i in x){
  print(i^2)
}
## [1] 1
## [1] 4
## [1] 25

Creating a new vector based on x

for(i in 1:length(x)){
  if(i == 1) x2 <- x[i]^2
  else x2 <- c(x2, x[i]^2)
}
x2
## [1]  1  4 25

Data types

R has a hierarchy of data classes, tending to the lowest:

  • Binary
  • Integer (numeric)
  • Double (numeric)
  • Character

Examples of data types

a <- TRUE
b <- 1:5
c <- pi
d <- "Hello Leeds"
class(a)
class(b)
class(c)
class(d)

Data type switching

ab <- c(a, b)
ab
## [1] 1 1 2 3 4 5
class(ab)
## [1] "integer"

Test on data types

class(c(a, b))
## [1] "integer"
class(c(a, c))
## [1] "numeric"
class(c(b, d))
## [1] "character"

Sequences

x <- 1:5
y <- 2:6
plot(x, y)

Sequences with seq

x <- seq(1,2, by = 0.2)
length(x)
## [1] 6
x <- seq(1, 2, length.out = 5)
length(x)
## [1] 5

Practical 2

  • Work through Chapter 2 in Colin Gillespie's Introduction to R handout
  • Optional: install the R package
install.packages("drat")
drat::addRepo("rcourses")
install.packages("nclRintroduction", type="source")
library(nclRintroduction)
vignette(package = "nclRintroduction", "practical1")

Reading and writing data (13:30 - 14:30)

Reading data into R can tricky

  • There are dozens of data formats
  • Some data formats require additional packages
  • Text encoding can cause additional problems

Inequality data from the world bank

Importing a simple .csv

# Download from the internet:
# url <- "http://www.mas.ncl.ac.uk/~ncsg3/Rcourse/movie.txt"
# download.file(url, "movie.txt")
movie <- read.csv("../movie.txt", header=TRUE, stringsAsFactors=FALSE)

Practical 3: work through Chapter 3

  • What is the size of the data saved as a .csv?
  • How big is the data save as .Rds?
  • What about if you save it as .xlsx?

Practical 3 - Reading R in the wild

  • Identify a medium/large dataset of interest
  • Read it into R
  • If you finish early: go back to the subsetting stage
  • If you finish that: find more datasets!
  • Try saving the data with different file formats

Data manipulation (14:30 - 15:00)

An example dataset

  • It's good to start small before going BIG
  • A quick exploration with the diamonds dataset:
library(ggplot2)
diamonds[1:3,]
##   carat     cut color clarity depth table price    x    y    z
## 1  0.23   Ideal     E     SI2  61.5    55   326 3.95 3.98 2.43
## 2  0.21 Premium     E     SI1  59.8    61   326 3.89 3.84 2.31
## 3  0.23    Good     E     VS1  56.9    65   327 4.05 4.07 2.31

Selecting rows

The base R way

summary(diamonds$color)
##     D     E     F     G     H     I     J 
##  6775  9797  9542 11292  8304  5422  2808
head(diamonds[diamonds$color == "I",])
##    carat       cut color clarity depth table price    x    y    z
## 4   0.29   Premium     I     VS2  62.4    58   334 4.20 4.23 2.63
## 7   0.24 Very Good     I    VVS1  62.3    57   336 3.95 3.98 2.47
## 17  0.30     Ideal     I     SI2  62.0    54   348 4.31 4.34 2.68
## 21  0.30      Good     I     SI2  63.3    56   351 4.26 4.30 2.71
## 27  0.24   Premium     I     VS1  62.5    57   355 3.97 3.94 2.47
## 40  0.33     Ideal     I     SI2  61.8    55   403 4.49 4.51 2.78

The dplyr way

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
filter(diamonds, color == "I")
## Source: local data frame [5,422 x 10]
## 
##    carat       cut  color clarity depth table price     x     y     z
##    (dbl)    (fctr) (fctr)  (fctr) (dbl) (dbl) (int) (dbl) (dbl) (dbl)
## 1   0.29   Premium      I     VS2  62.4    58   334  4.20  4.23  2.63
## 2   0.24 Very Good      I    VVS1  62.3    57   336  3.95  3.98  2.47
## 3   0.30     Ideal      I     SI2  62.0    54   348  4.31  4.34  2.68
## 4   0.30      Good      I     SI2  63.3    56   351  4.26  4.30  2.71
## 5   0.24   Premium      I     VS1  62.5    57   355  3.97  3.94  2.47
## 6   0.33     Ideal      I     SI2  61.8    55   403  4.49  4.51  2.78
## 7   0.33     Ideal      I     SI2  61.2    56   403  4.49  4.50  2.75
## 8   0.32     Ideal      I     SI1  60.9    55   404  4.45  4.48  2.72
## 9   0.30     Ideal      I     SI2  61.0    59   405  4.30  4.33  2.63
## 10  0.30 Very Good      I     SI1  62.6    57   405  4.25  4.28  2.67
## ..   ...       ...    ...     ...   ...   ...   ...   ...   ...   ...

dplyr is clever

sum(diamonds$color == "I" | diamonds$color == "J")
## [1] 8230
dfs <- filter(diamonds, color == "I" | color == "J")
nrow(dfs)
## [1] 8230
arrange(diamonds, desc(z))
## Source: local data frame [53,940 x 10]
## 
##    carat       cut  color clarity depth table price     x     y     z
##    (dbl)    (fctr) (fctr)  (fctr) (dbl) (dbl) (int) (dbl) (dbl) (dbl)
## 1   0.51 Very Good      E     VS1  61.8  54.7  1970  5.12  5.15 31.80
## 2   2.00   Premium      H     SI2  58.9  57.0 12210  8.09 58.90  8.06
## 3   5.01      Fair      J      I1  65.5  59.0 18018 10.74 10.54  6.98
## 4   4.50      Fair      J      I1  65.8  58.0 18531 10.23 10.16  6.72
## 5   4.13      Fair      H      I1  64.8  61.0 17329 10.00  9.85  6.43
## 6   3.65      Fair      H      I1  67.1  53.0 11668  9.53  9.48  6.38
## 7   4.00 Very Good      I      I1  63.3  58.0 15984 10.01  9.94  6.31
## 8   3.40      Fair      D      I1  66.8  52.0 15964  9.42  9.34  6.27
## 9   4.01   Premium      J      I1  62.5  62.0 15223 10.02  9.94  6.24
## 10  4.01   Premium      I      I1  61.0  61.0 15223 10.14 10.10  6.17
## ..   ...       ...    ...     ...   ...   ...   ...   ...   ...   ...

Work using the dplyr package data processing

Summary statistics (15:00 - 15:30)

Practical 4

Plotting with R (15:45 - 16:15)

Feedback