0.1 Introduction

This documents serves as a reference sheet with example for the basics of R programming. It is more of brain dump of functions and examples I find to be useful.

1 Data Types and Basic Operations

1.1 Basic Operators

+ - * /

1.1.1 Logical Operators

< <= > >= == != | &

1.1.2 Assignment Operators

<- is the assignment operator

x <- 1

1.2 Atomic classes of objects

  • Character
  • Numeric
  • Integer
  • Complex
  • Logical

1.2.1 character

Use " "

c("My", "name", "is")
## [1] "My"   "name" "is"

Combining elements

my_char <- c("My", "name", "is")
paste(my_char, collapse = " ")
## [1] "My name is"
# Vectorized
paste(1:3, c("X", "Y", "Z"), sep = "")
## [1] "1X" "2Y" "3Z"
# recycling
paste(LETTERS, 1:4, sep = "-")
##  [1] "A-1" "B-2" "C-3" "D-4" "E-1" "F-2" "G-3" "H-4" "I-1" "J-2" "K-3"
## [12] "L-4" "M-1" "N-2" "O-3" "P-4" "Q-1" "R-2" "S-3" "T-4" "U-1" "V-2"
## [23] "W-3" "X-4" "Y-1" "Z-2"

1.2.2 numeric

  • Generally treated as a double real number or “numeric”
class(1)
## [1] "numeric"

1.2.3 integer

class (1L)
## [1] "integer"
  • Inf represents infinity
1/0
## [1] Inf
  • Inf can be used in calculations
1/Inf
## [1] 0
  • NaN represent undefined value or missing value
0/0
## [1] NaN

1.2.4 complex

z <- complex(real = 1, imaginary = 2)
z
## [1] 1+2i

1.2.5 logical

my_vector <-1:10
my_logical <- my_vector == 5
my_logical
##  [1] FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE

1.3 Attributes

Sample attributes:

  • names, dimnames
  • dimentions
  • class
  • length
temp <- 1:10
length (temp)
## [1] 10
  • other meta data

Attributes can be used using attributes()

1.4 Sequences

Simplest way is using :

temp <- 1:10
temp
##  [1]  1  2  3  4  5  6  7  8  9 10
temp <- pi:10
temp
## [1] 3.141593 4.141593 5.141593 6.141593 7.141593 8.141593 9.141593
temp <- 15:1
temp
##  [1] 15 14 13 12 11 10  9  8  7  6  5  4  3  2  1

Alternatively you can use seq()

my_seq <- seq(0, 20)
my_seq <- seq(0, 10, by=0.5)
my_seq <- seq(5, 10, length=30)

1.4.1 seq_along

seq_along (my_seq)
##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
## [24] 24 25 26 27 28 29 30

1.4.2 along.with

my_seq <- seq(along.with = my_seq)

1.4.3 rep()

rep(0, times = 40)
##  [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [36] 0 0 0 0 0
rep(c(0, 1, 2), times = 10)
##  [1] 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2
rep(c(0, 1, 2), each = 10)
##  [1] 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2

1.5 Vectors

  • Vectors can only contain objects of same class

  • List is a vector that can contain objects of different classes

1.5.1 Creating vectors

  • Vector can be created using vector() function
x <- vector("numeric", length = 10)

x
##  [1] 0 0 0 0 0 0 0 0 0 0
  • c() function can be used to create vectors
x <- c(0.5, 0.6)
x <- c(TRUE, FALSE) ## logical
x <- c(T, F) ## logical
x <- c("a", "b", "c") ## character
x <- 9:29 ## integer
x <- c(1+0i, 2+4i) ## complex

1.5.2 Coercion

Coercion occurs when mixing objects

y <- c(1.7, "a") ## character
y
## [1] "1.7" "a"
y <- c(TRUE, 2) ## numeric
y
## [1] 1 2
y <- c("a", TRUE) ## character
y
## [1] "a"    "TRUE"

1.5.3 Explicit Coercion

use the as.* functions

x <- 0:6
class(x)
## [1] "integer"
as.numeric(x)
## [1] 0 1 2 3 4 5 6
as.logical(x)
## [1] FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
as.character(x)
## [1] "0" "1" "2" "3" "4" "5" "6"

Note: nonsensical coercion results in NA

x <- c("a", "b", "c")
as.numeric(x)
## Warning: NAs introduced by coercion
## [1] NA NA NA

1.5.4 Subsetting Vectors

x <- rnorm(100)
x[1:10]
##  [1] -0.09395277 -0.57027407  1.94268854 -0.68255566  0.59790498
##  [6]  0.91508766  0.26957541 -0.81158391 -0.23726188 -1.50521598
#subset everything except 2 and 10
x[c(-2, -10)]
##  [1] -0.09395277  1.94268854 -0.68255566  0.59790498  0.91508766
##  [6]  0.26957541 -0.81158391 -0.23726188  0.25465048  2.81339326
## [11] -1.19823288  1.49524733  0.04258432 -0.55912845  1.95869557
## [16]  0.04002838 -0.75245094 -0.12954895 -0.02649436  1.01778248
## [21] -1.65050478  0.45054472  1.42448902 -0.72963672  1.07730775
## [26] -0.29340742  0.55711576 -0.09962152 -0.43695608 -0.34377805
## [31]  0.36087754 -1.29022680 -0.30197948  0.23639093 -0.15399597
## [36]  0.30102203  0.29552396 -0.84844166  0.74993891 -0.39342129
## [41]  0.62084805  0.68010494  0.48228051 -0.52695683 -1.40116875
## [46] -0.68463419 -2.29923708  0.71177792 -1.12863548 -0.91979920
## [51]  0.51470227 -1.05691883 -0.11445467 -0.75333934  0.59201581
## [56]  1.37231180  1.34272400 -0.00330700 -0.15438057  0.95481515
## [61] -1.10379628 -1.55768665  1.10785064  1.41759311  1.18387602
## [66]  0.46881241 -2.35489957 -1.35077875 -0.57805577  0.62606079
## [71] -1.56333473  0.93642731 -1.10128476  1.54773458  1.12321609
## [76] -0.74726358  1.00009303 -0.98599075 -0.37965946  1.58991656
## [81] -0.08269716 -0.07365312  0.84002278 -0.31310774  0.35137289
## [86] -1.32260399 -1.45120209 -0.47992977 -1.09667338 -1.21224085
## [91]  0.58221614 -1.16962665  1.05707835  0.22598321  0.08169949
## [96]  1.11492505 -0.90819208  2.73815977
x[-c(2, 10)]
##  [1] -0.09395277  1.94268854 -0.68255566  0.59790498  0.91508766
##  [6]  0.26957541 -0.81158391 -0.23726188  0.25465048  2.81339326
## [11] -1.19823288  1.49524733  0.04258432 -0.55912845  1.95869557
## [16]  0.04002838 -0.75245094 -0.12954895 -0.02649436  1.01778248
## [21] -1.65050478  0.45054472  1.42448902 -0.72963672  1.07730775
## [26] -0.29340742  0.55711576 -0.09962152 -0.43695608 -0.34377805
## [31]  0.36087754 -1.29022680 -0.30197948  0.23639093 -0.15399597
## [36]  0.30102203  0.29552396 -0.84844166  0.74993891 -0.39342129
## [41]  0.62084805  0.68010494  0.48228051 -0.52695683 -1.40116875
## [46] -0.68463419 -2.29923708  0.71177792 -1.12863548 -0.91979920
## [51]  0.51470227 -1.05691883 -0.11445467 -0.75333934  0.59201581
## [56]  1.37231180  1.34272400 -0.00330700 -0.15438057  0.95481515
## [61] -1.10379628 -1.55768665  1.10785064  1.41759311  1.18387602
## [66]  0.46881241 -2.35489957 -1.35077875 -0.57805577  0.62606079
## [71] -1.56333473  0.93642731 -1.10128476  1.54773458  1.12321609
## [76] -0.74726358  1.00009303 -0.98599075 -0.37965946  1.58991656
## [81] -0.08269716 -0.07365312  0.84002278 -0.31310774  0.35137289
## [86] -1.32260399 -1.45120209 -0.47992977 -1.09667338 -1.21224085
## [91]  0.58221614 -1.16962665  1.05707835  0.22598321  0.08169949
## [96]  1.11492505 -0.90819208  2.73815977

1.5.5 Named elements

vect <- c(foo = 11, bar = 2, norf = NA)
#view names
names (vect)
## [1] "foo"  "bar"  "norf"

Named vector after creation

vect2 <- c(11,2,NA)
names(vect2)<- c("foo", "bar", "norf")

View names

vect[c("foo", "bar")]
## foo bar 
##  11   2

1.6 Matrices

  • are vectors with dimension attribute (a integer vector of lenth 2)
m <- matrix(nrow = 2, ncol = 3)
m
##      [,1] [,2] [,3]
## [1,]   NA   NA   NA
## [2,]   NA   NA   NA
dim(m)
## [1] 2 3
attributes(m)
## $dim
## [1] 2 3
  • matricies are column wise
m <- matrix (1:6, nrow = 2, ncol =3)
m
##      [,1] [,2] [,3]
## [1,]    1    3    5
## [2,]    2    4    6
  • can be created from vectors by adding dimension attribute
m<-1:10
m
##  [1]  1  2  3  4  5  6  7  8  9 10
dim(m)<- c(2,5)
m
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    1    3    5    7    9
## [2,]    2    4    6    8   10

1.6.1 column bind and row bind

  • Create matrices by column-binding (cbind) or row-binding (rbind)
x <- 1:3
y <- 10:12
cbind (x,y)
##      x  y
## [1,] 1 10
## [2,] 2 11
## [3,] 3 12
rbind (x,y)
##   [,1] [,2] [,3]
## x    1    2    3
## y   10   11   12

1.6.2 Subsetting a Matrix

Matrices can be subsetted in the usual way with (i,j) type indices.

x <- matrix(1:6, 2, 3)
x[1, 2]
x[2, 1]

Subsetting a single element resutls in returning a vector of length 1 rather than a 1x1 matrix. This can be changed using ‘drop = FALSE’

x <- matrix(1:6, 2, 3)
x[1, 2]
x[2, 1]

Similarly, subsetting a single column or a single row will give you a vector, not a matrix (by default).

x <- matrix(1:6, 2, 3)
x[1, ]
x[1, , drop = FALSE]

1.7 Lists

As mentioned before, lists are a vector which contain elements of different classes

x <- list(1, "a", TRUE, 1 + 4i)
x
## [[1]]
## [1] 1
## 
## [[2]]
## [1] "a"
## 
## [[3]]
## [1] TRUE
## 
## [[4]]
## [1] 1+4i

1.7.1 Subsetting Lists

x <- list(foo = 1:4, bar = 0.6)
x$foo
## [1] 1 2 3 4
x$bar
## [1] 0.6
x[["bar"]]
## [1] 0.6
x["bar"]
## $bar
## [1] 0.6
x <- list(foo = 1:4, bar = 0.6, baz = "hello")
x[c(1, 3)]
## $foo
## [1] 1 2 3 4
## 
## $baz
## [1] "hello"

The [[ operator can be used with computed indices; $ can only be used with literal names.

x <- list(foo = 1:4, bar = 0.6, baz = "hello")
name <- "foo"
x[[name]]  ## computed index for ‘foo’
x$name     ## element ‘name’ doesn’t exist!
x$foo

1.7.2 Subsetting Nested Elements of a List

The [[ can take an integer sequence.

x <- list(a = list(10, 12, 14), b = c(3.14, 2.81))
x[[c(1, 3)]]
x[[1]][[3]]
x[[c(2, 1)]]

1.8 Factors

Used to represent categorical data (ie male and female)

x <- factor(c("yes", "yes", "no", "yes", "no"))
x
## [1] yes yes no  yes no 
## Levels: no yes
table (x)
## x
##  no yes 
##   2   3
unclass(x)
## [1] 2 2 1 2 1
## attr(,"levels")
## [1] "no"  "yes"

1.8.1 Setting order of levels

  • Very important in linear modeling since first level is baseline level
x<- factor(c("yes", "yes", "no", "yes", "no"),
levels = c("yes", "no"))
x
## [1] yes yes no  yes no 
## Levels: yes no

1.9 Missing Values

Missing values are denoted by NA or NaN for undefined mathematical operations.

  • is.na() is used to test objects if they are NA

  • is.nan() is used to test for NaN

  • NA values have a class also, so there are integer NA, character NA, etc.

  • A NaN value is also NA but the converse is not true

  • Be very careful when combining missing values with logical expressions

x <- c(1, 2, NA, 10, 3)
is.na(x)
## [1] FALSE FALSE  TRUE FALSE FALSE
is.nan(x)
## [1] FALSE FALSE FALSE FALSE FALSE
x <- c(1, 2, NaN, NA, 4)
is.na(x)
## [1] FALSE FALSE  TRUE  TRUE FALSE
is.nan(x)
## [1] FALSE FALSE  TRUE FALSE FALSE

1.9.1 Selecting everything with NA

x[is.na(x)]
## [1] NaN  NA
x[!is.na(x)]
## [1] 1 2 4
x[is.na(x)] <-0

1.9.2 Subset with no missing values

x <- c(1, 2, NA, 4, NA, 5)
y <- c("a", "b", NA, "d", NA, "f")
good <- complete.cases(x, y)
good
x[good]
y[good]

1.9.3 Set all NA to zeros

1.10 Data Frames

-Used to save tabular data

-List of same length

-attribute called ‘row.names’

  • can be created using ‘read.table()’ and ‘read.csv()’

  • be careful when converting to matrix ‘data.matrix()’

x <- data.frame(foo = 1:4, bar = c(T, T, F, F))
x
##   foo   bar
## 1   1  TRUE
## 2   2  TRUE
## 3   3 FALSE
## 4   4 FALSE
nrow(x)
## [1] 4
ncol(x)
## [1] 2

1.11 Names

1.11.1 Objects

R objects can have names as shown before.

x <- 1:3
names(x)
## NULL
names(x) <- c("foo", "bar", "norf")
x
##  foo  bar norf 
##    1    2    3
names(x)
## [1] "foo"  "bar"  "norf"

1.11.2 Lists

x <- list (a =1, b = 2, c = 3)

1.11.3 Matrices

m <- matrix(1:4, nrow = 2, ncol = 2)
dimnames(m) <- list(c("a", "b"), c("c", "d"))
m
##   c d
## a 1 3
## b 2 4

2 Reading and Writing Data

2.1 Reading Data

There are a few principal functions reading data into R. - read.table, read.csv, for reading tabular data

  • readLines, for reading lines of a text file

  • source, for reading in R code files (inverse of dump)

  • dget, for reading in R code files (inverse of dput)

  • load, for reading in saved workspaces

  • unserialize, for reading single R objects in binary form

2.1.1 Reading Data Files with read.table

The read.table function is one of the most commonly used functions for reading data.

For small to moderately sized datasets, you can usually call read.table without specifying any other arguments

data <- read.table("foo.txt")

2.1.2 Reading in Larger Datasets with read.table

With much larger datasets, doing the following things will make your life easier and will prevent R from choking.

  • Read the help page for read.table, which contains many hints

  • Make a rough calculation of the memory required to store your dataset. If the dataset is larger than the amount of RAM on your computer, you can probably stop right here.

  • Set comment.char = "" if there are no commented lines in your file.

  • Use the colClasses argument. Specifying this option instead of using the default can make ’read.table’ run MUCH faster, often twice as fast. In order to use this option, you have to know the class of each column in your data frame. If all of the columns are “numeric”, for example, then you can just set colClasses = "numeric". A quick an dirty way to figure out the classes of each column is the following:

initial <- read.table("datatable.txt", nrows = 100)
classes <- sapply(initial, class)
tabAll <- read.table("datatable.txt",
                     colClasses = classes)

2.1.3 Reading Lines of a Text File

con <- gzfile("words.gz")
x <- readLines(con, 10)
x

writeLines takes a character vector and writes each element one line at a time to a text file.

readLines can be useful for reading in lines of webpages

## This might take time
con <- url("http://www.jhsph.edu", "r")
x <- readLines(con)
> head(x)

2.2 Writing Data

There are analogous functions for writing data to files - write.table - writeLines - dump - dput - save - serialize

2.2.1 Textual Formats

  • dumping and dputing are useful because the resulting textual format is edit-able, and in the case of corruption, potentially recoverable.

  • Unlike writing out a table or csv file, dump and dput preserve the metadata (sacrificing some readability), so that another user doesn’t have to specify it all over again.

  • Textual formats can work much better with version control programs like subversion or git which can only track changes meaningfully in text files

  • Textual formats can be longer-lived; if there is corruption somewhere in the file, it can be easier to fix the problem

  • Textual formats adhere to the “Unix philosophy”

  • Downside: The format is not very space-efficient

2.2.2 dput-ting R Objects

Another way to pass data around is by deparsing the R object with dput and reading it back in using dget.

y <- data.frame(a = 1, b = "a")
dput(y)
structure(list(a = 1,
               b = structure(1L, .Label = "a",
                             class = "factor")),
          .Names = c("a", "b"), row.names = c(NA, -1L),
          class = "data.frame")
dput(y, file = "y.R")
new.y <- dget("y.R")
new.y

2.2.3 Dumping R Objects

Multiple objects can be deparsed using the dump function and read back in using source.

x <- "foo"
y <- data.frame(a = 1, b = "a")
dump(c("x", "y"), file = "data.R")
rm(x, y)
source("data.R")
y
x

2.2.4 Interfaces to the Outside World

Data are read in using connection interfaces. Connections can be made to files (most common) or to other more exotic things.

  • file, opens a connection to a file

  • gzfile, opens a connection to a file compressed with gzip

  • bzfile, opens a connection to a file compressed with bzip2

  • url, opens a connection to a webpage

2.2.5 File Connections

> str(file)
function (description = "", open = "", blocking = TRUE,
          encoding = getOption("encoding"))
  • description is the name of the file
  • open is a code indicating
  • “r” read only
  • “w” writing (and initializing a new file)
  • “a” appending
  • “rb”, “wb”, “ab” reading, writing, or appending in binary mode (Windows)

2.2.6 Connections

In general, connections are powerful tools that let you navigate files or other external objects. In practice, we often don’t need to deal with the connection interface directly.

con <- file("foo.txt", "r")
data <- read.csv(con)
close(con)

is the same as

data <- read.csv("foo.txt")

3 Control Structures

Control structures in R allow you to control the flow of execution of the program, depending on runtime conditions. Common structures are

3.1 if

Example of how to use if:

if() { ## do something } else if() { ## do something different } else { ## do something different }

3.1.1 Setting variables using if

y <- if(x > 3) {
        10
} else {
        0
}

3.2 for

for loops take an interator variable and assign it successive values from a sequence or vector. For loops are most commonly used for iterating over the elements of an object (list, vector, etc.)

for(i in 1:10) {
        print(i)
}

3.2.1 Nested for loops

for loops can be nested.

x <- matrix(1:6, 2, 3)

for(i in seq_len(nrow(x))) {
        for(j in seq_len(ncol(x))) {
                print(x[i, j])
        }
}

3.3 while

While loops begin by testing a condition. If it is true, then they execute the loop body. Once the loop body is executed, the condition is tested again, and so forth.

count <- 0
while(count < 10) {
        print(count)
        count <- count + 1
}

3.3.1 multi conditional while

Sometimes there will be more than one condition in the test.

z <- 5

while(z >= 3 && z <= 10) {
        print(z)
        coin <- rbinom(1, 1, 0.5)

        if(coin == 1) {  ## random walk
                z <- z + 1
        } else {
                z <- z - 1
        }
}

3.4 repeat

Repeat initiates an infinite loop; these are not commonly used in statistical applications but they do have their uses. The only way to exit a repeat loop is to call break.

x0 <- 1
tol <- 1e-8

repeat {
        x1 <- computeEstimate()

        if(abs(x1 - x0) < tol) {
                break
        } else {
                x0 <- x1
        }
}

The previous loop is a bit dangerous because there’s no guarantee it will stop. Better to set a hard limit on the number of iterations (e.g. using a for loop) and then report whether convergence was achieved or not.

3.5 Special control functions

3.5.1 Next, return

next is used to skip an iteration of a loop

for(i in 1:100) {
        if(i <= 20) {
                ## Skip the first 20 iterations
                next
        }
        ## Do something here
}

return signals that a function should exit and return a given value

3.6 lapply and sapply

3.7 Functions to add:

rnorm
sample
print () prints values identical(vect, vect2)