David Robinson and Neo Christopher Chung
In these slides, we show blocks of R code, which are immediately followed by their output:
print("hello world")
[1] "hello world"
The gray box shows the original R code, which you can copy and paste into your own R console to try yourself. The white box shows the code's output: you can compare it to your own results (or just trust us that that's the output).
You store a value in a variable using the = operator:
x = 42
This gives the variable a a value of 42. You can show the value of a with:
print(x)
[1] 42
You can also assign a variable with <-: this is equivalent.
x <- 42
Variable names consist of letters, digits, periods and underscores (_), and cannot start with a digit. Convention is to use periods as spaces.
Legal variable names include:
Illegal names include:
You can perform mathematical operations using +, -, *, and /:
x = 6 + 4
print(x)
[1] 10
x / 2
[1] 5
y = 4
x / y
[1] 2.5
You can use exponentiation with ^, or calculate the natural log:
x^2
[1] 100
y^3
[1] 64
log(x)
[1] 2.303
<- and =?
print(x) to display a variable, and when x?
print is unnecessary. When you source a .R file, you need print(x) in the line or it won't display.[1] before each result?
You may have noticed the [1] at the start of each result. That's because all numbers in R are actually represented as vectors of length 1. The [1] is there to indicate rows of results.
Create a vector using a function c, as well as display this vector by simply typing its variable name:
v1 = c(1, 5.5, 1e2)
v1
[1] 1.0 5.5 100.0
You can also use c to combine two vectors:
v2 = c(.14, 0, -2)
v3 = c(v1, v2)
What numbers are stored in v3?
Not all values you could want to store in R are numeric. You could store:
Character vectors are surrounded by either single or double quotation marks.
chv1 = "hello"
chv2 = c(chv1, 'world')
chv2
[1] "hello" "world"
Use square brackets to retrieve a value from a vector, or multiple values:
chv2[1]
[1] "hello"
v3[2]
[1] 5.5
v3[2:5]
[1] 5.50 100.00 0.14 0.00
Of course, you can store the output into another variable
v3_sub = v3[2:5]
Mathematical operations on a vector apply to all elements:
v1 + 2
[1] 3.0 7.5 102.0
sin(v1)
[1] 0.8415 -0.7055 -0.5064
Apply an operation on a subset of a vector
sin(v1[2])
[1] -0.7055
sin(v1)[2]
[1] -0.7055
Similarly, you can perform operations between two vectors:
v1 * v2
[1] 0.14 0.00 -200.00
v1 / v2
[1] 7.143 Inf -50.000
A dot product (or inner product), which is a sum of the products of corresponding elements in two vectors, can be computed by
v1 %*% v2
[,1]
[1,] -199.9
For certain operations such as a dot product, two vectors must have a same length. Otherwise, you will get an error:
v1 %*% v3
Error: non-conformable arguments
When in doubt, check the length of a vector:
length(v1)
[1] 3
length(v3)
[1] 6
Similarly, you can't use math operations on a character vector:
chv2 + 10
Error: non-numeric argument to binary operator
You can verify the class of a variable:
class(v2)
[1] "numeric"
class(chv2)
[1] "character"
It is possible to store numeric values in a character vector (e.g., patient numbers):
v4 = c("10", "42")
class(v4)
[1] "character"
And since R thinks v4 is a character vector, an attempt to apply a math operation will give an error:
v4 / 10
Error: non-numeric argument to binary operator
If you would like to use numeric values in a character vector, we can tell R to treat v4 as numeric values:
as.numeric(v4) + 10
[1] 20 52
v4 = as.numeric(v4)
class(v4)
[1] "numeric"
What would happen if you try to force chv2 into a numeric vector by applying as.numeric to chv2?
You can obtain summary statistics of a vector:
summary(v3)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-2.00 0.04 0.57 17.40 4.38 100.00
Similarly, R includes a number of built-in functions with intuitive names to compute various statistics about a vector:
mean(v3)
[1] 17.44
sum(v3)
[1] 104.6
var(v3)
[1] 1642
Elements in a vector have names, which you can access by:
names(v2)
NULL
NULL implies that elements in v2 currently do not have names. We can assign names:
names(v2) = c("Cat", "Dog", "Rat")
names(v2)
[1] "Cat" "Dog" "Rat"
Matrices are like two-dimensional vectors, organizing values into rows and columns.
Intuitively named, a function matrix is used:
ma = matrix(1:6, nrow=3, ncol=2)
ma
[,1] [,2]
[1,] 1 4
[2,] 2 5
[3,] 3 6
mb = matrix(7:9, nrow=3, ncol=1)
cbind (as well as rbind) stands for column bind, and indeed merge two matrices into one:
m = cbind(ma, mb)
What value is stored at 2nd row and 3rd column in the matrix m?
To extract one value from a matrix, use the structure matrix[index of row,index of column].
m
[,1] [,2] [,3]
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9
m[1, 3]
[1] 7
Leaving the “row” spot or the “column” spot empty will extract, respectively, an entire column or an entire row.
m[1, ]
[1] 1 4 7
m[, 2]
[1] 4 5 6
You will get an error if you enter an index of row or column that is too large:
m[5, ]
Error: subscript out of bounds
You can get the number of rows, the number of columns, or both:
nrow(m)
[1] 3
ncol(m)
[1] 3
dim(m)
[1] 3 3
You can add or multiply a single value by a matrix (element-wise operation):
m + 3
[,1] [,2] [,3]
[1,] 4 7 10
[2,] 5 8 11
[3,] 6 9 12
m * 2
[,1] [,2] [,3]
[1,] 2 8 14
[2,] 4 10 16
[3,] 6 12 18
Use the t function to transpose a matrix:
t(m)
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 4 5 6
[3,] 7 8 9
Use diag to extract the diagonal:
diag(m)
[1] 1 5 9
You can also perform traditional matrix multiplication with the %*% operator:
m2 = matrix(21:32, nrow=3)
m3 = m %*% m2
m3
[,1] [,2] [,3] [,4]
[1,] 270 306 342 378
[2,] 336 381 426 471
[3,] 402 456 510 564
Note that each element in m3 is an dot product between a row in m and a column in m2.
Another type of variable is a logical value: TRUE or FALSE. Like numbers and characters, logical values are always stored in vectors (sometimes of length 1).
x = TRUE
y = c(TRUE, FALSE, TRUE)
Logical vectors are useful because they are the result of logical operators, such as
> : greater than< : less than== : equal to!= : not equal to& : and| : orYou can compare a numeric value stored in a variable with any value you choose:
x = 2 # assignment
x > 0
[1] TRUE
If you apply a logical operator to a matrix, it will work on each element:
m >= 5
[,1] [,2] [,3]
[1,] FALSE FALSE TRUE
[2,] FALSE TRUE TRUE
[3,] FALSE TRUE TRUE
Assume that you have a character vector mq:
mq = "TRUE"
class(mq)
[1] "character"
What function could you use to tell R that mq should be treated as a logical value (similar to as.numeric)?
== and not =?
= is already reserved for assignment.Data frames store multiple columns of information together. Unlike a matrix, different columns in a data frame can store different kinds of information (numbers, factors, character vectors, etc). In that sense, a data frame is a collection of lists, with their own classes.
R comes with built-in datasets that can be retrieved by name, using data function. In this class, we are going to utilize mtcars:
data(mtcars)
mtcars contains statistics about 32 cars in 1974, including miles per gallon, weight, number of cylinders, etc. Each row is one car, and each column one of characteristics.
View(mtcars)
See details and documentation about the data with:
?mtcars
or
help(mtcars)
One of the most useful functions is head, which shows the first 6 rows of a data frame (a good way to get an idea of its contents):
head(mtcars)
mpg cyl disp hp drat wt qsec
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02
Datsun 710 22.8 4 108 93 3.85 2.320 18.61
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02
Valiant 18.1 6 225 105 2.76 3.460 20.22
vs am gear carb
Mazda RX4 0 1 4 4
Mazda RX4 Wag 0 1 4 4
Datsun 710 1 1 4 1
Hornet 4 Drive 1 0 3 1
Hornet Sportabout 0 0 3 2
Valiant 1 0 3 1
Get the number of rows, columns or both:
nrow(mtcars)
[1] 32
ncol(mtcars)
[1] 11
dim(mtcars)
[1] 32 11
You can find out the column names in two ways:
colnames(mtcars)
[1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec"
[8] "vs" "am" "gear" "carb"
names(mtcars)
[1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec"
[8] "vs" "am" "gear" "carb"
Similarly, there are two ways to access a column by name:
mtcars$mpg
[1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2
[11] 17.8 16.4 17.3 15.2 10.4 10.4 14.7 32.4 30.4 33.9
[21] 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7
[31] 15.0 21.4
mtcars[,"mpg"]
[1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2
[11] 17.8 16.4 17.3 15.2 10.4 10.4 14.7 32.4 30.4 33.9
[21] 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7
[31] 15.0 21.4
Each column is a vector once it is extracted.
You can use square brackets with a comma to access a single row of a data frame:
mtcars[1, ]
mpg cyl disp hp drat wt qsec vs am gear
Mazda RX4 21 6 160 110 3.9 2.62 16.46 0 1 4
carb
Mazda RX4 4
Or you can give row, column to get a single value at a particular position:
mtcars[3, 2]
[1] 4
One common operation on data is to filter out rows based on some criterion.
You can get a set of rows using their indices:
mtcars[1:2, ]
mpg cyl disp hp drat wt qsec vs am
Mazda RX4 21 6 160 110 3.9 2.620 16.46 0 1
Mazda RX4 Wag 21 6 160 110 3.9 2.875 17.02 0 1
gear carb
Mazda RX4 4 4
Mazda RX4 Wag 4 4
However, what if you want “all automatic cars” or “all cars with mpg > 20”?
This can equally easily be applied to a column of mtcars:
mtcars$mpg > 20
[1] TRUE TRUE TRUE TRUE FALSE FALSE FALSE TRUE
[9] TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[17] FALSE TRUE TRUE TRUE TRUE FALSE FALSE FALSE
[25] FALSE TRUE TRUE TRUE FALSE FALSE FALSE TRUE
This logical vector can be used to subset rows of the data frame- TRUE means “keep the row”, FALSE means drop it. Place it before the comma in the square brackets:
v = mtcars$mpg > 20
efficient.cars = mtcars[v, ]
or just:
efficient.cars = mtcars[mtcars$mpg > 20, ]
You can combine multiple conditions using & (and) or | (or), such as looking for automatic gearshift cars with mpg > 20:
efficient.auto = mtcars[mtcars$mpg > 20 & mtcars$am == 0, ]
head(efficient.auto, 3)
mpg cyl disp hp drat wt qsec vs
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1
Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1
Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1
am gear carb
Hornet 4 Drive 0 3 1
Merc 240D 0 4 2
Merc 230 0 4 2
What would be one line R command to obtain mpg is less than 15 or greater than 30?
These slides were created as an RStudio presentation, which integrates R code using knitr.
sessionInfo()
R version 3.0.2 (2013-09-25)
Platform: x86_64-apple-darwin10.8.0 (64-bit)
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets
[6] methods base
other attached packages:
[1] ggplot2_0.9.3.1 knitr_1.5
loaded via a namespace (and not attached):
[1] codetools_0.2-8 colorspace_1.2-4
[3] dichromat_2.0-0 digest_0.6.4
[5] evaluate_0.5.1 formatR_0.10
[7] grid_3.0.2 gtable_0.1.2
[9] labeling_0.2 MASS_7.3-29
[11] munsell_0.4.2 plyr_1.8
[13] proto_0.3-10 RColorBrewer_1.0-5
[15] reshape2_1.2.2 scales_0.2.3
[17] stringr_0.6.2 tools_3.0.2