Introduction to R and RStudio

Slides: rpubs.com/RobinLovelace

Introduction

Context

cdrc

This course is brought to you the Consumer Data Research Centre (CDRC) and is funded by the ESRC's (Big Data Network).

What R can do - Propensity to Cycle Tool

See pct.bike

A bit about R

Developed by statisticians Ross Ihaka and Robert Gentleman
De-facto standard for advanced statistical analysis
A programming language in its own right
The power of the command line
Used by an increasing number of organisations

Why R?

Performace: stable, light and fast
Support network
documentation, community, developers
Reproducibility
anyone anywhere can reproduce results
enables dissemination (RPubs, RMarkdown, .RPres) - this presentation is a .Rmd file!
Versatility: unified solution to almost any numerical problem, graphical capabilities
Ethics removes economic barrier to statistics, is open and democratic

R is up and coming I

scholar-searches1

Source: r4stats.com

II - Increasing popularity in academia

scholar-searches2

Source: r4stats.com

III - R vs Python

Source: Data Camp

IV - employment market

jobs

Source: revolution analytics

Visualisation

R's visualisation capabilities have evolved over time
Used to create plots in the best academic journals
ggplot2 has revolutionised the visualisation of quantitative information in R, and (possibly) overall
Thus there are different camps with different preferences when it comes to maps in R

## Loading required package: devtools

## Loading required package: nclRintroduction

Demonstration 1: The RStudio IDE

R as a giant calculator

5 * 5

1 + 4 * 5

4 * 5 ^ 2

sin(90)

sin(0.5 * pi)

Objects

a <- 1
b <- 2
c <- "c"
x_thingy <- 4

a + b
a * b
a + c
a / x_thingy

Adding and removing objects

ls()

## [1] "a"        "b"        "c"        "x_thingy"

x <- x_thingy
rm(x_thingy)
x

## [1] 4

ls()

## [1] "a" "b" "c" "x"

Harmonograph example

Practical 1: Getting used to RStudio and R

Open RStudio and have a look around
Create a new project
Create a new R Script: pass code to the console with Ctl-Enter
Use R as a calculator: what is:

\[ \pi * 9.15^2 \]

Explore each of the 'panes'
Find and write down some useful shortcuts (Alt-Shift-K on Windows/Linux)

Basic R functions and behaviour

Functions and objects

In R:

Everything that exists is an object
Everything that happens is a function

# Assignment of x
x <- 5
x

## [1] 5

# A trick to print x
(x <- 5)

(y <- x)

Functions

sin(x)

## [1] -0.9589243

exp(x)

## [1] 148.4132

factorial(x)

## [1] 120

sinx <- sin(x)

Assignment

x = 5 # the same as x <- 5
(x = x + 1)

## [1] 6

R is vector based

x <- c(1, 2, 5)
x

## [1] 1 2 5

x^2

## [1]  1  4 25

x + 2

## [1] 3 4 7

x + rev(x)

The classic programming way: verbose

x <- c(1, 2, 5)
for(i in x){
  print(i^2)
}

## [1] 1
## [1] 4
## [1] 25

Creating a new vector based on x

for(i in 1:length(x)){
  if(i == 1) x2 <- x[i]^2
  else x2 <- c(x2, x[i]^2)
}
x2

## [1]  1  4 25

Data types

R has a hierarchy of data classes, tending to the lowest:

Binary
Integer (numeric)
Double (numeric)
Character

Examples of data types

a <- TRUE
b <- 1:5
c <- pi
d <- "Hello Leeds"

class(a)
class(b)
class(c)
class(d)

Data type switching

ab <- c(a, b)
ab

## [1] 1 1 2 3 4 5

class(ab)

## [1] "integer"

Test on data types

class(c(a, b))

## [1] "integer"

class(c(a, c))

## [1] "numeric"

class(c(b, d))

## [1] "character"

Sequences

x <- 1:5
y <- 2:6
plot(x, y)

Sequences with seq

x <- seq(1,2, by = 0.2)
length(x)

## [1] 6

x <- seq(1, 2, length.out = 5)
length(x)

## [1] 5

Practical 2

Work through Chapter 2 in Colin Gillespie's Introduction to R handout

Optional: install the R package

install.packages("drat")
drat::addRepo("rcourses")
install.packages("nclRintroduction", type="source")
library(nclRintroduction)
vignette(package = "nclRintroduction", "practical1")

Reading and writing data (13:30 - 14:30)

Reading data into R can tricky

There are dozens of data formats
Some data formats require additional packages
Text encoding can cause additional problems

Inequality data from the world bank

Download data from http://www.mas.ncl.ac.uk/~ncsg3/Rcourse/movie.txt
Open the file with Notepad++
Read it in with read.csv(ineq.csv).

Importing a simple .csv

# Download from the internet:
# url <- "http://www.mas.ncl.ac.uk/~ncsg3/Rcourse/movie.txt"
# download.file(url, "movie.txt")
movie <- read.csv("../movie.txt", header=TRUE, stringsAsFactors=FALSE)

Practical 3: work through Chapter 3

What is the size of the data saved as a .csv?
How big is the data save as .Rds?
What about if you save it as .xlsx?

Practical 3 - Reading R in the wild

Identify a medium/large dataset of interest
Read it into R
If you finish early: go back to the subsetting stage
If you finish that: find more datasets!
Try saving the data with different file formats

Data manipulation (14:30 - 15:00)

An example dataset

It's good to start small before going BIG
A quick exploration with the diamonds dataset:

library(ggplot2)
diamonds[1:3,]

##   carat     cut color clarity depth table price    x    y    z
## 1  0.23   Ideal     E     SI2  61.5    55   326 3.95 3.98 2.43
## 2  0.21 Premium     E     SI1  59.8    61   326 3.89 3.84 2.31
## 3  0.23    Good     E     VS1  56.9    65   327 4.05 4.07 2.31

Selecting rows

The base R way

summary(diamonds$color)

##     D     E     F     G     H     I     J 
##  6775  9797  9542 11292  8304  5422  2808

head(diamonds[diamonds$color == "I",])

##    carat       cut color clarity depth table price    x    y    z
## 4   0.29   Premium     I     VS2  62.4    58   334 4.20 4.23 2.63
## 7   0.24 Very Good     I    VVS1  62.3    57   336 3.95 3.98 2.47
## 17  0.30     Ideal     I     SI2  62.0    54   348 4.31 4.34 2.68
## 21  0.30      Good     I     SI2  63.3    56   351 4.26 4.30 2.71
## 27  0.24   Premium     I     VS1  62.5    57   355 3.97 3.94 2.47
## 40  0.33     Ideal     I     SI2  61.8    55   403 4.49 4.51 2.78

The dplyr way

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

filter(diamonds, color == "I")

## Source: local data frame [5,422 x 10]
## 
##    carat       cut  color clarity depth table price     x     y     z
##    (dbl)    (fctr) (fctr)  (fctr) (dbl) (dbl) (int) (dbl) (dbl) (dbl)
## 1   0.29   Premium      I     VS2  62.4    58   334  4.20  4.23  2.63
## 2   0.24 Very Good      I    VVS1  62.3    57   336  3.95  3.98  2.47
## 3   0.30     Ideal      I     SI2  62.0    54   348  4.31  4.34  2.68
## 4   0.30      Good      I     SI2  63.3    56   351  4.26  4.30  2.71
## 5   0.24   Premium      I     VS1  62.5    57   355  3.97  3.94  2.47
## 6   0.33     Ideal      I     SI2  61.8    55   403  4.49  4.51  2.78
## 7   0.33     Ideal      I     SI2  61.2    56   403  4.49  4.50  2.75
## 8   0.32     Ideal      I     SI1  60.9    55   404  4.45  4.48  2.72
## 9   0.30     Ideal      I     SI2  61.0    59   405  4.30  4.33  2.63
## 10  0.30 Very Good      I     SI1  62.6    57   405  4.25  4.28  2.67
## ..   ...       ...    ...     ...   ...   ...   ...   ...   ...   ...

dplyr is clever

sum(diamonds$color == "I" | diamonds$color == "J")

## [1] 8230

dfs <- filter(diamonds, color == "I" | color == "J")
nrow(dfs)

## [1] 8230

arrange(diamonds, desc(z))

## Source: local data frame [53,940 x 10]
## 
##    carat       cut  color clarity depth table price     x     y     z
##    (dbl)    (fctr) (fctr)  (fctr) (dbl) (dbl) (int) (dbl) (dbl) (dbl)
## 1   0.51 Very Good      E     VS1  61.8  54.7  1970  5.12  5.15 31.80
## 2   2.00   Premium      H     SI2  58.9  57.0 12210  8.09 58.90  8.06
## 3   5.01      Fair      J      I1  65.5  59.0 18018 10.74 10.54  6.98
## 4   4.50      Fair      J      I1  65.8  58.0 18531 10.23 10.16  6.72
## 5   4.13      Fair      H      I1  64.8  61.0 17329 10.00  9.85  6.43
## 6   3.65      Fair      H      I1  67.1  53.0 11668  9.53  9.48  6.38
## 7   4.00 Very Good      I      I1  63.3  58.0 15984 10.01  9.94  6.31
## 8   3.40      Fair      D      I1  66.8  52.0 15964  9.42  9.34  6.27
## 9   4.01   Premium      J      I1  62.5  62.0 15223 10.02  9.94  6.24
## 10  4.01   Premium      I      I1  61.0  61.0 15223 10.14 10.10  6.17
## ..   ...       ...    ...     ...   ...   ...   ...   ...   ...   ...

Work using the dplyr package data processing

Work through the worked example here:

http://r4ds.had.co.nz/transform.html

Summary statistics (15:00 - 15:30)

Practical 4

Plotting with R (15:45 - 16:15)

Feedback

Your feedback is greatly appreciated.

Please complete the online form here:

https://leeds.onlinesurveys.ac.uk/rstudiojan16