(myVector <- c(7,11,13))# Create a vector called 'myVector'
## [1] 7 11 13
myVector * 3
## [1] 21 33 39
myVector[1] # Get the first element
## [1] 7
myVector[3] # Get the third element
## [1] 13
length(): returns the length of a vector.sort(): sorts the elements of a vector, in increasing
or decreasing order.rev(): rearranges the elements of a vector in reverse
order.rank(): returns the vector of ranks of the
elements.head(): returns the first few elements of a
vector.tail(): returns the last few elements of a vector.x <- c(1,3,6,2,7,4,8,1,0,8)
length(x)
## [1] 10
sort(x)
## [1] 0 1 1 2 3 4 6 7 8 8
sort(x, decreasing=TRUE)
## [1] 8 8 7 6 4 3 2 1 1 0
rev(x)
## [1] 8 0 1 8 4 7 2 6 3 1
rank(x)
## [1] 2.5 5.0 7.0 4.0 8.0 6.0 9.5 2.5 1.0 9.5
head(x, 3)
## [1] 1 3 6
tail(x, 2)
## [1] 0 8
Note: The largest value of rank(x) is not always equal
to length(x), as there could be a tie of largest
values.
A <- c(4,5,2,7)
B <- c(2,1,7,3)
C <- c(2,3,7)
is.element(C,A)
## [1] TRUE FALSE TRUE
is.element(A,C)
## [1] FALSE FALSE TRUE TRUE
intersect(A,B)
## [1] 2 7
union(A,B)
## [1] 4 5 2 7 1 3
setdiff(A,B)
## [1] 4 5
Given that
A <- c(4,5,2,7)
B <- c(2,1,7,3)
C <- c(2,3,7)
Calculate
# the collection of elements of A and B that only belong to one set
union(setdiff(A,B), setdiff(B,A))
## [1] 4 5 1 3
# the number of elements that belong to A, B and C
length(intersect(intersect(A,B), C))
## [1] 2
(X <- matrix(1:12, nrow=4, ncol=3, byrow=TRUE))
## [,1] [,2] [,3]
## [1,] 1 2 3
## [2,] 4 5 6
## [3,] 7 8 9
## [4,] 10 11 12
(X <- matrix(1:12, nrow=4, ncol=3, byrow=FALSE))
## [,1] [,2] [,3]
## [1,] 1 5 9
## [2,] 2 6 10
## [3,] 3 7 11
## [4,] 4 8 12
(X <- array(1:24, dim=c(2,3,4)))
## , , 1
##
## [,1] [,2] [,3]
## [1,] 1 3 5
## [2,] 2 4 6
##
## , , 2
##
## [,1] [,2] [,3]
## [1,] 7 9 11
## [2,] 8 10 12
##
## , , 3
##
## [,1] [,2] [,3]
## [1,] 13 15 17
## [2,] 14 16 18
##
## , , 4
##
## [,1] [,2] [,3]
## [1,] 19 21 23
## [2,] 20 22 24
Visualization of a three-dimensional array?
Given an operation on two vectors/matrices/arrays of different lengths, R will complete the shortest data structure by repeating its elements from the beginning. We call this behaviour ‘recycling’:
x <- c(1,2,3,4,5,6)
y <- c(1,2,3)
x + y
## [1] 2 4 6 5 7 9
Another example is below, where the vector 1:3 is
repeated to fill in a matrix:
matrix(1:3, ncol=3, nrow=4)
## [,1] [,2] [,3]
## [1,] 1 2 3
## [2,] 2 3 1
## [3,] 3 1 2
## [4,] 1 2 3
You can merge vectors or matrices together to create a new matrix
with functions cbind() and rbind().
(B <- cbind(1:4,5:8))
## [,1] [,2]
## [1,] 1 5
## [2,] 2 6
## [3,] 3 7
## [4,] 4 8
(C <- cbind(B, 9:12))
## [,1] [,2] [,3]
## [1,] 1 5 9
## [2,] 2 6 10
## [3,] 3 7 11
## [4,] 4 8 12
(B <- rbind(1:4,5:8))
## [,1] [,2] [,3] [,4]
## [1,] 1 2 3 4
## [2,] 5 6 7 8
(C <- rbind(B, 9:12))
## [,1] [,2] [,3] [,4]
## [1,] 1 2 3 4
## [2,] 5 6 7 8
## [3,] 9 10 11 12
(A <- matrix(c(2,3,5,4), nrow=2, ncol=2, byrow=T))
## [,1] [,2]
## [1,] 2 3
## [2,] 5 4
(B <- matrix(c(1,2,8,7), nrow=2, ncol=2, byrow=F))
## [,1] [,2]
## [1,] 1 8
## [2,] 2 7
(I2 <- diag(nrow=2)) # identity matrix of size 2x2
## [,1] [,2]
## [1,] 1 0
## [2,] 0 1
A+B
## [,1] [,2]
## [1,] 3 11
## [2,] 7 11
A*B
## [,1] [,2]
## [1,] 2 24
## [2,] 10 28
A/B
## [,1] [,2]
## [1,] 2.0 0.3750000
## [2,] 2.5 0.5714286
A%*%I2
## [,1] [,2]
## [1,] 2 3
## [2,] 5 4
A%*%B
## [,1] [,2]
## [1,] 8 37
## [2,] 13 68
t(B)
## [,1] [,2]
## [1,] 1 2
## [2,] 8 7
Note: the diag() function can also be used for different
purposes (see R help file)
solve(A,b) function can be used to solve \(A \mathbf{x} = \mathbf{b}\), for \(\mathbf{x}\). Here \(\mathbf{b}\) can be a vector or a
matrix.solve() is used with only one argument,
e.g. solve(A), it will return the inverse of a matrix (if
it exists).(A <- matrix(1:4, ncol=2))
## [,1] [,2]
## [1,] 1 3
## [2,] 2 4
(x <- solve(A, c(1,1)))
## [1] -0.5 0.5
A%*%x
## [,1]
## [1,] 1
## [2,] 1
solve(A) %*% A
## [,1] [,2]
## [1,] 1 0
## [2,] 0 1
The function apply() is often quite handy. It applies a
given function to the elements of all rows (MARGIN=1) or
all columns (MARGIN=2) of a matrix.
(X <- matrix(c(1:4, 1, 6:8), nrow = 2))
## [,1] [,2] [,3] [,4]
## [1,] 1 3 1 7
## [2,] 2 4 6 8
apply(X, MARGIN=1, FUN=sum)
## [1] 12 20
apply(X, MARGIN=2, FUN=mean)
## [1] 1.5 3.5 3.5 7.5
Other functions you could use: rowSums(),
colSums(), rowMeans(),
colMeans().
Given a \(3\times 3\) matrix
X
X <- matrix(1:9, nrow = 3)
apply to create a vector called
row.sums containing the row marginal sums of X (i.e. the
sum of elements within each row)col.sums
containing the column marginal sums of Xsum(row.sums) = sum(col.sums) = sum(X). Make sure you see
why that should be the case.(X <- matrix(1:9, nr = 3))
## [,1] [,2] [,3]
## [1,] 1 4 7
## [2,] 2 5 8
## [3,] 3 6 9
(row.sums <- apply(X, 1, sum))
## [1] 12 15 18
(col.sums <- apply(X, 2, sum))
## [1] 6 15 24
(sum(row.sums) == sum(col.sums)) & (sum(X) == sum(row.sums))
## [1] TRUE
myVector <- c(1,2,"A", TRUE)
myVector
## [1] "1" "2" "A" "TRUE"
typeof(myVector)
## [1] "character"
myList <- list(TRUE, my.matrix=matrix(1:4, nrow=2), c(1+2i,3), "R is my friend")
myList
## [[1]]
## [1] TRUE
##
## $my.matrix
## [,1] [,2]
## [1,] 1 3
## [2,] 2 4
##
## [[3]]
## [1] 1+2i 3+0i
##
## [[4]]
## [1] "R is my friend"
Consider ‘myList’ given before as
myList <- list(TRUE, my.matrix=matrix(1:4, nrow=2), c(1+2i,3), "R is my friend")
myList
## [[1]]
## [1] TRUE
##
## $my.matrix
## [,1] [,2]
## [1,] 1 3
## [2,] 2 4
##
## [[3]]
## [1] 1+2i 3+0i
##
## [[4]]
## [1] "R is my friend"
typeof(myList[[1]])
## [1] "logical"
typeof(myList[[2]])
## [1] "integer"
typeof(myList[[3]])
## [1] "complex"
typeof(myList[[4]])
## [1] "character"
my.matrix, and the others
are unnamed. Note that naming elements makes it easier to read the data
from a list - we will discuss this further in later sessions.A data.frame in R is a table where
Data frames are widely used in R
BMI <- data.frame(
Gender=c("M","F","M","F"),
Height=c(1.83,1.78,1.80,1.60),
Weight=c(77,68,66,48),
row.names=c("Ben","Katja","Anthony","Jinxia"))
BMI
## Gender Height Weight
## Ben M 1.83 77
## Katja F 1.78 68
## Anthony M 1.80 66
## Jinxia F 1.60 48
str(BMI)#display internal structure of data frame
## 'data.frame': 4 obs. of 3 variables:
## $ Gender: chr "M" "F" "M" "F"
## $ Height: num 1.83 1.78 1.8 1.6
## $ Weight: num 77 68 66 48
You can access a specific variable using the $ command
BMI$Gender
## [1] "M" "F" "M" "F"
merge() function (but be
careful of what happens when not all subjects are present in both
datasets)X <- data.frame(GENDER=c("F","M","M","F"),
ID=c(123,234,345,456),
NAME=c("Mary","James","James","Olivia"),
Height=c(170,180,185,160))
Y <- data.frame(GENDER=c("M","F","F","M"),
ID=c(345,456,123,234),
NAME=c("James","Olivia","Mary","James"),
Weight=c(80,50,70,60))
X
## GENDER ID NAME Height
## 1 F 123 Mary 170
## 2 M 234 James 180
## 3 M 345 James 185
## 4 F 456 Olivia 160
Y
## GENDER ID NAME Weight
## 1 M 345 James 80
## 2 F 456 Olivia 50
## 3 F 123 Mary 70
## 4 M 234 James 60
cbind(X,Y) # Not very useful here
## GENDER ID NAME Height GENDER ID NAME Weight
## 1 F 123 Mary 170 M 345 James 80
## 2 M 234 James 180 F 456 Olivia 50
## 3 M 345 James 185 F 123 Mary 70
## 4 F 456 Olivia 160 M 234 James 60
merge(X,Y) # This is what we want
## GENDER ID NAME Height Weight
## 1 F 123 Mary 170 70
## 2 F 456 Olivia 160 50
## 3 M 234 James 180 60
## 4 M 345 James 185 80
Any individual not present in both datasets will be lost.
Z <- data.frame(GENDER=c("M","F","F","F"),
ID=c(345,456,123,999),
NAME=c("James","Olivia","Mary","Jennifer"),
Age=c(21,19,23,99))
Z
## GENDER ID NAME Age
## 1 M 345 James 21
## 2 F 456 Olivia 19
## 3 F 123 Mary 23
## 4 F 999 Jennifer 99
merge(X,Z)
## GENDER ID NAME Height Age
## 1 F 123 Mary 170 23
## 2 F 456 Olivia 160 19
## 3 M 345 James 185 21
You can use the all.x or all.y arguments to
force the inclusion of all the subjects of a dataset.
merge(X,Z, all.x = T) # all subjects of first dataset are kept
## GENDER ID NAME Height Age
## 1 F 123 Mary 170 23
## 2 F 456 Olivia 160 19
## 3 M 234 James 180 NA
## 4 M 345 James 185 21
merge(X,Z, all.y = T) # all subjects of second dataset are kept
## GENDER ID NAME Height Age
## 1 F 123 Mary 170 23
## 2 F 456 Olivia 160 19
## 3 F 999 Jennifer NA 99
## 4 M 345 James 185 21
A factor can be used to store character strings
x <- factor(c("blue","green","blue","red","blue","green","green"))
levels(x)
## [1] "blue" "green" "red"
class(x)
## [1] "factor"
Note: Function levels() shows you all unique factors,
which can be handy.
| Data structure | Instruction in R | Description |
|---|---|---|
| vector | c() | Sequence of elements of the same nature |
| matrix | matrix() | Two-dimensional table of elements of the same nature |
| array | array() | More general than a matrix; table with several dimensions |
| list | list() | Sequence of R structures of any (and possibly different) nature. |
| data frame | data.frame() | Two-dimensional table. The columns can be of different natures, but must have the same length. |
| factor | factor() | Vector of character strings associated with a modality table |
| dates | as.Date() | Vector of dates |
| time series | ts() | Values of a variable observed at several time points |
read.table and read.csv
functions, which are widely used to import excel and csv files.| Function name | Description |
|---|---|
| read.table() | best suited for data sets presented as tables, as it is often the case in statistics |
| read.ftable() | reads contingency tables |
| scan() | much more flexible and powerful. Use this in all other cases |
| Argument name | Description |
|---|---|
| file=path/to/file | Location and name of the file to be read |
| header=TRUE | Indicates whether the variable names are given on the first line of the file |
| sep=“” | This is the field separator character. Values on each line of the file are separated by this character. (e.g. “” = whitespace, “,” = comma, “\(\backslash\)t” = tabulation) |
| dec=“.” | Decimal mark for numbers (“.” or “,”) |
data <- read.table(file="mydata.txt")
.txt file (“mydata.txt”)head(data) or tail(data)In the Ed ‘Exercise Space’ for this week we have placed a dataset in
the .txt format called danishfire.txt. It
contains claim amounts for three different categories of insurance
losses, also with the dates at which the losses occurred.
readLines("danishfire.txt", n=5) to
visualize the beginning of the data. Note that at this stage you are
just looking at the data, you have not imported it.danish_fire.danish_fire by using
the head() function.class() and str() to
danish_fire. What information does this provide you?mean() and cor().# Vizualise data using readLines
readLines("danishfire.txt", n=5)
# import data, note that the header argument is TRUE
danish_fire <- read.table(file="danishfire.txt", header=TRUE)
head(danish_fire)
class(danish_fire)
str(danish_fire)
# some numerical analysis
mean(danish_fire$building)
cor(danish_fire$building, danish_fire$contents)
Note that function readLines() is useful as it allows
you to visualize the beginning of the data file before
you import the data, so you know the structure of the data and can
determine the arguments of the function read.table().
If you are using R installed on your computer (not in Ed), you can specify a file that is not in your working directory
my.data <- read.table(file="C:/MyFolder/somedata.txt")
Note you can use double back slash \\ instead of /.
getwd()getwd()
With standard formats, most of the arguments of the function
read.table() are fixed. There are some functions that are
designed to be equivalent to read.table() with several
arguments filled with pre-determined values:
read.csv(): .csv format (csv stands for comma separated
values)read.delim(): Tab-separated dataNow try to export the danish fire data you have imported to R.
# Exporting data to a text file
write.table(danish_fire, file = "myfile.txt", sep = "\t")
# Exporting data to a csv file
write.csv(danish_fire, file = "myfile.csv")
Note that the commands above save your data into the working directory.
If you are using R on your computer (not in Ed!), you can specify a
different path if you wish, by using
file = "path/to/data/myfile.txt".