Introduction to Statistical Programming in R

David Robinson and Neo Christopher Chung

How to Read These Slides

In these slides, we show blocks of R code, which are immediately followed by their output:

print("hello world")
[1] "hello world"

The gray box shows the original R code, which you can copy and paste into your own R console to try yourself. The white box shows the code's output: you can compare it to your own results (or just trust us that that's the output).

Numeric variables

Assigning a variable

You store a value in a variable using the = operator:

x = 42

This gives the variable a a value of 42. You can show the value of a with:

print(x)
[1] 42

You can also assign a variable with <-: this is equivalent.

x <- 42

Variable names

Variable names consist of letters, digits, periods and underscores (_), and cannot start with a digit. Convention is to use periods as spaces.

Legal variable names include:

  • my.variable
  • my_variable

Illegal names include:

  • my-variable
  • dave's.variable
  • 2ndvariable

Using R like a scientific calculator

You can perform mathematical operations using +, -, *, and /:

x = 6 + 4
print(x)
[1] 10
x / 2
[1] 5
y = 4
x / y
[1] 2.5

Using R like a scientific calculator

You can use exponentiation with ^, or calculate the natural log:

x^2
[1] 100
y^3
[1] 64
log(x)
[1] 2.303

Assigning variables: FAQ

  • What is the difference between <- and =?
    • In 99% of cases, they act exactly the same, so it's personal preference. See here to see a description of the rare cases where they differ.
  • When do you need print(x) to display a variable, and when x?
    • When working in the R interactive terminal, the result of each line are displayed after being evaluated- print is unnecessary. When you source a .R file, you need print(x) in the line or it won't display.

Assigning variables: FAQ

  • Why is there a [1] before each result?
    • You'll find out in the next section!

Vectors

You may have noticed the [1] at the start of each result. That's because all numbers in R are actually represented as vectors of length 1. The [1] is there to indicate rows of results.

Creating and combining vectors

Create a vector using a function c, as well as display this vector by simply typing its variable name:

v1 = c(1, 5.5, 1e2)
v1
[1]   1.0   5.5 100.0

You can also use c to combine two vectors:

v2 = c(.14, 0, -2)
v3 = c(v1, v2)

Quiz

What numbers are stored in v3?

Character vectors

Not all values you could want to store in R are numeric. You could store:

  • subject names
  • gene sequences (GCAT)

Character vectors are surrounded by either single or double quotation marks.

chv1 = "hello"
chv2 = c(chv1, 'world')
chv2
[1] "hello" "world"

Subsetting a vector

Use square brackets to retrieve a value from a vector, or multiple values:

chv2[1]
[1] "hello"
v3[2]
[1] 5.5
v3[2:5]
[1]   5.50 100.00   0.14   0.00

Of course, you can store the output into another variable

v3_sub = v3[2:5]

Operations on vectors

Mathematical operations on a vector apply to all elements:

v1 + 2
[1]   3.0   7.5 102.0
sin(v1)
[1]  0.8415 -0.7055 -0.5064

Apply an operation on a subset of a vector

sin(v1[2])
[1] -0.7055
sin(v1)[2]
[1] -0.7055

Operations on vectors

Similarly, you can perform operations between two vectors:

v1 * v2
[1]    0.14    0.00 -200.00
v1 / v2
[1]   7.143     Inf -50.000

A dot product (or inner product), which is a sum of the products of corresponding elements in two vectors, can be computed by

v1 %*% v2
       [,1]
[1,] -199.9

Operations on vectors

For certain operations such as a dot product, two vectors must have a same length. Otherwise, you will get an error:

v1 %*% v3
Error: non-conformable arguments

Operations on vectors

When in doubt, check the length of a vector:

length(v1)
[1] 3
length(v3)
[1] 6

Operations on vectors

Similarly, you can't use math operations on a character vector:

chv2 + 10
Error: non-numeric argument to binary operator

You can verify the class of a variable:

class(v2)
[1] "numeric"
class(chv2)
[1] "character"

Operations on vectors

It is possible to store numeric values in a character vector (e.g., patient numbers):

v4 = c("10", "42")
class(v4)
[1] "character"

And since R thinks v4 is a character vector, an attempt to apply a math operation will give an error:

v4 / 10
Error: non-numeric argument to binary operator

Operations on vectors

If you would like to use numeric values in a character vector, we can tell R to treat v4 as numeric values:

as.numeric(v4) + 10
[1] 20 52
v4 = as.numeric(v4)
class(v4)
[1] "numeric"

Quiz

What would happen if you try to force chv2 into a numeric vector by applying as.numeric to chv2?

Operations on vectors

You can obtain summary statistics of a vector:

summary(v3)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  -2.00    0.04    0.57   17.40    4.38  100.00 

Similarly, R includes a number of built-in functions with intuitive names to compute various statistics about a vector:

mean(v3)
[1] 17.44
sum(v3)
[1] 104.6
var(v3)
[1] 1642

Operations on vectors

Elements in a vector have names, which you can access by:

names(v2)
NULL

NULL implies that elements in v2 currently do not have names. We can assign names:

names(v2) = c("Cat", "Dog", "Rat")
names(v2)
[1] "Cat" "Dog" "Rat"

Matrices

Matrices are like two-dimensional vectors, organizing values into rows and columns.

Creating and combining a matrix

Intuitively named, a function matrix is used:

ma = matrix(1:6, nrow=3, ncol=2)
ma
     [,1] [,2]
[1,]    1    4
[2,]    2    5
[3,]    3    6
mb = matrix(7:9, nrow=3, ncol=1)

cbind (as well as rbind) stands for column bind, and indeed merge two matrices into one:

m = cbind(ma, mb)

Quiz

What value is stored at 2nd row and 3rd column in the matrix m?

Subsetting a matrix

To extract one value from a matrix, use the structure matrix[index of row,index of column].

m
     [,1] [,2] [,3]
[1,]    1    4    7
[2,]    2    5    8
[3,]    3    6    9
m[1, 3]
[1] 7

Retrieving a row or column

Leaving the “row” spot or the “column” spot empty will extract, respectively, an entire column or an entire row.

m[1, ]
[1] 1 4 7
m[, 2]
[1] 4 5 6

You will get an error if you enter an index of row or column that is too large:

m[5, ]
Error: subscript out of bounds

Dimensions of a matrix

You can get the number of rows, the number of columns, or both:

nrow(m)
[1] 3
ncol(m)
[1] 3
dim(m)
[1] 3 3

Matrix arithmetic

You can add or multiply a single value by a matrix (element-wise operation):

m + 3
     [,1] [,2] [,3]
[1,]    4    7   10
[2,]    5    8   11
[3,]    6    9   12
m * 2
     [,1] [,2] [,3]
[1,]    2    8   14
[2,]    4   10   16
[3,]    6   12   18

Transpose and diagonal

Use the t function to transpose a matrix:

t(m)
     [,1] [,2] [,3]
[1,]    1    2    3
[2,]    4    5    6
[3,]    7    8    9

Use diag to extract the diagonal:

diag(m)
[1] 1 5 9

Matrix multiplication

You can also perform traditional matrix multiplication with the %*% operator:

m2 = matrix(21:32, nrow=3)
m3 = m %*% m2
m3
     [,1] [,2] [,3] [,4]
[1,]  270  306  342  378
[2,]  336  381  426  471
[3,]  402  456  510  564

Note that each element in m3 is an dot product between a row in m and a column in m2.

Logical vectors

Another type of variable is a logical value: TRUE or FALSE. Like numbers and characters, logical values are always stored in vectors (sometimes of length 1).

x = TRUE
y = c(TRUE, FALSE, TRUE)

Logical operators

Logical vectors are useful because they are the result of logical operators, such as

  • > : greater than
  • < : less than
  • == : equal to
  • != : not equal to
  • & : and
  • | : or

Logical operators: comparison

You can compare a numeric value stored in a variable with any value you choose:

x = 2  # assignment
x > 0
[1] TRUE

If you apply a logical operator to a matrix, it will work on each element:

m >= 5
      [,1]  [,2] [,3]
[1,] FALSE FALSE TRUE
[2,] FALSE  TRUE TRUE
[3,] FALSE  TRUE TRUE

Quiz

Assume that you have a character vector mq:

mq = "TRUE"
class(mq)
[1] "character"

What function could you use to tell R that mq should be treated as a logical value (similar to as.numeric)?

Logical operators FAQ

  • Why is the logical operator for equals == and not =?
    • Because = is already reserved for assignment.

Data frames

Data frames store multiple columns of information together. Unlike a matrix, different columns in a data frame can store different kinds of information (numbers, factors, character vectors, etc). In that sense, a data frame is a collection of lists, with their own classes.

Built-in Datasets

R comes with built-in datasets that can be retrieved by name, using data function. In this class, we are going to utilize mtcars:

data(mtcars)

mtcars contains statistics about 32 cars in 1974, including miles per gallon, weight, number of cylinders, etc. Each row is one car, and each column one of characteristics.

View data frame in RStudio

View(mtcars)

See details and documentation about the data with:

?mtcars

or

help(mtcars)

See first rows of data frame

One of the most useful functions is head, which shows the first 6 rows of a data frame (a good way to get an idea of its contents):

head(mtcars)
                   mpg cyl disp  hp drat    wt  qsec
Mazda RX4         21.0   6  160 110 3.90 2.620 16.46
Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02
Datsun 710        22.8   4  108  93 3.85 2.320 18.61
Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44
Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02
Valiant           18.1   6  225 105 2.76 3.460 20.22
                  vs am gear carb
Mazda RX4          0  1    4    4
Mazda RX4 Wag      0  1    4    4
Datsun 710         1  1    4    1
Hornet 4 Drive     1  0    3    1
Hornet Sportabout  0  0    3    2
Valiant            1  0    3    1

Information about a data frame

Get the number of rows, columns or both:

nrow(mtcars)
[1] 32
ncol(mtcars)
[1] 11
dim(mtcars)
[1] 32 11

Access a column by name

You can find out the column names in two ways:

colnames(mtcars)
 [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec"
 [8] "vs"   "am"   "gear" "carb"
names(mtcars)
 [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec"
 [8] "vs"   "am"   "gear" "carb"

Access a column by name

Similarly, there are two ways to access a column by name:

mtcars$mpg
 [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2
[11] 17.8 16.4 17.3 15.2 10.4 10.4 14.7 32.4 30.4 33.9
[21] 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7
[31] 15.0 21.4
mtcars[,"mpg"]
 [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2
[11] 17.8 16.4 17.3 15.2 10.4 10.4 14.7 32.4 30.4 33.9
[21] 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7
[31] 15.0 21.4

Each column is a vector once it is extracted.

Access one row or value

You can use square brackets with a comma to access a single row of a data frame:

mtcars[1, ]
          mpg cyl disp  hp drat   wt  qsec vs am gear
Mazda RX4  21   6  160 110  3.9 2.62 16.46  0  1    4
          carb
Mazda RX4    4

Access one row or value

Or you can give row, column to get a single value at a particular position:

mtcars[3, 2]
[1] 4

Filtering a data frame

One common operation on data is to filter out rows based on some criterion.

Subsetting rows of a data frame

You can get a set of rows using their indices:

mtcars[1:2, ]
              mpg cyl disp  hp drat    wt  qsec vs am
Mazda RX4      21   6  160 110  3.9 2.620 16.46  0  1
Mazda RX4 Wag  21   6  160 110  3.9 2.875 17.02  0  1
              gear carb
Mazda RX4        4    4
Mazda RX4 Wag    4    4

However, what if you want “all automatic cars” or “all cars with mpg > 20”?

Logical operations on a column

This can equally easily be applied to a column of mtcars:

mtcars$mpg > 20
 [1]  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE  TRUE
 [9]  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[17] FALSE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE
[25] FALSE  TRUE  TRUE  TRUE FALSE FALSE FALSE  TRUE

Filtering a data frame logically

This logical vector can be used to subset rows of the data frame- TRUE means “keep the row”, FALSE means drop it. Place it before the comma in the square brackets:

v = mtcars$mpg > 20
efficient.cars = mtcars[v, ]

or just:

efficient.cars = mtcars[mtcars$mpg > 20, ]

Filtering on multiple conditions

You can combine multiple conditions using & (and) or | (or), such as looking for automatic gearshift cars with mpg > 20:

efficient.auto = mtcars[mtcars$mpg > 20 & mtcars$am == 0, ]
head(efficient.auto, 3)
                mpg cyl  disp  hp drat    wt  qsec vs
Hornet 4 Drive 21.4   6 258.0 110 3.08 3.215 19.44  1
Merc 240D      24.4   4 146.7  62 3.69 3.190 20.00  1
Merc 230       22.8   4 140.8  95 3.92 3.150 22.90  1
               am gear carb
Hornet 4 Drive  0    3    1
Merc 240D       0    4    2
Merc 230        0    4    2

Quiz

What would be one line R command to obtain mpg is less than 15 or greater than 30?

Next Time

plot of chunk ggplot2_preview

Appendix

About these slides

These slides were created as an RStudio presentation, which integrates R code using knitr.

Session Info

sessionInfo()
R version 3.0.2 (2013-09-25)
Platform: x86_64-apple-darwin10.8.0 (64-bit)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets 
[6] methods   base     

other attached packages:
[1] ggplot2_0.9.3.1 knitr_1.5      

loaded via a namespace (and not attached):
 [1] codetools_0.2-8    colorspace_1.2-4  
 [3] dichromat_2.0-0    digest_0.6.4      
 [5] evaluate_0.5.1     formatR_0.10      
 [7] grid_3.0.2         gtable_0.1.2      
 [9] labeling_0.2       MASS_7.3-29       
[11] munsell_0.4.2      plyr_1.8          
[13] proto_0.3-10       RColorBrewer_1.0-5
[15] reshape2_1.2.2     scales_0.2.3      
[17] stringr_0.6.2      tools_3.0.2