Julius Schmid
At the beginning of the chapter, I said that a matrix is just a
vector but with two additional attributes: the number of rows and the
number of columns. Here, we’ll take a closer look at the vector nature
of matrices. Consider this example:
z <- matrix(1:15,nrow=5)
z
[,1] [,2] [,3]
[1,] 1 6 11
[2,] 2 7 12
[3,] 3 8 13
[4,] 4 9 14
[5,] 5 10 15
Return the length of z. Since z is still a vector, we can query its
length:
length(z)
[1] 15
Let us print out the class of z:
class(z)
[1] "matrix" "array"
We get two outputs! This is because z can be interpreted as a matrix
as well as an array.
When printing the dimension, z is interpreted as the above matrix
with 5 rows and 3 columns.
dim(z)
[1] 5 3
Avoiding Unintended Dimension Reduction In the world
of statistics, dimension reduction is a good thing, with many
statistical procedures aimed to do it well. If we are working with, say,
10 variables and can reduce that number to 3 that still capture the
essence of our data, we’re happy. However, in R, something else might
merit the name dimension reduction that we may sometimes wish to avoid.
Say we have a five-row matrix and extract a row from it:
First, le tus visualize the matrix z again:
z
[,1] [,2] [,3]
[1,] 1 6 11
[2,] 2 7 12
[3,] 3 8 13
[4,] 4 9 14
[5,] 5 10 15
Now, return the second row of z and save the result as a new vector
called r:
r <- z[2,]
r
[1] 2 7 12
This seems innocuous, but note the format in which R has displayed r.
It’s a vector format, not a matrix format. In other words, r is a vector
of length 3, rather than a 1-by-3 matrix. We can confirm this in a
couple of ways:
The funtion attributes returns the dimension of matrices.
When entering the matrix z as an input, we get its number of rows and
columns:
attributes(z)
$dim
[1] 5 3
However, when setting r as an argument, there seems to be no
dimension at all.
attributes(r)
NULL
In a similar way, we can confirm our assumptions by using the
structure function str().
Input z:
str(z)
int [1:5, 1:3] 1 2 3 4 5 6 7 8 9 10 ...
We can access values of z with
two-dimensional indexing.
On the other hand, let us set r as an input :
str(r)
int [1:3] 2 7 12
We can access values of r with
one-dimensional indexing.
Fortunately, R has a way to suppress this dimension reduction: the
drop argument. Here’s an example, using the matrix z from above:
r <- z[2,, drop=FALSE]
r
[,1] [,2] [,3]
[1,] 2 7 12
Setting drop=FALSE, r will be interpreted as a matrix again. We can
confirm this by applying more functions to our new r:
First, apply the dim() function:
dim(r)
[1] 1 3
Now, r is interpreted as matrix with 1 row and 3 columns.
Applay the class() function next:
class(r)
[1] "matrix" "array"
This is exactly the output that we got above when we applied the
class() function to z. Indeed, r is now interpreted as a matrix as
well.
Naming Matrix Rows and Columns
The natural way to refer to rows and columns in a matrix is via the
row and column numbers. However, you can also give names to these
entities. Here’s an example:
Let us display z again:
z
[,1] [,2] [,3]
[1,] 1 6 11
[2,] 2 7 12
[3,] 3 8 13
[4,] 4 9 14
[5,] 5 10 15
Apply the function colnames():
colnames(z)
NULL
We observe that z has no column names. We can add them by assigning a
vector to the variable colnames(z):
colnames(z) <- c("First Column","Second Column", "Third Column")
Let us see whta happens if we return z now:
z
First Column Second Column Third Column
[1,] 1 6 11
[2,] 2 7 12
[3,] 3 8 13
[4,] 4 9 14
[5,] 5 10 15
Instead of the the indices [,1], [,2], and [,3] our assigned column
names are now displayed above each column.
We can also return just the column names without the specific column
values of the matrix:
colnames(z)
[1] "First Column" "Second Column" "Third Column"
Also, we can return a specific column, not using the column indices
1, 2, and 3, but rather the colum names instead:
z[,"Third Column"]
[1] 11 12 13 14 15
Higher-Dimensional Arrays
In a statistical context, a typical matrix in R has rows
corresponding to observations, say on various people, and columns
corresponding to variables, such as weight and blood pressure. The
matrix is then a two-dimensional data structure. But suppose we also
have data taken at different times, one data point per person per
variable per time. Time then becomes the third dimension, in addition to
rows and columns. In R, such data sets are called arrays. As a simple
example, consider students and test scores. Say each test consists of
two parts, so we record two scores for a student for each test. Now
suppose that we have two tests, and to keep the example small, assume we
have only three students. Here’s the data for the first test:
firsttest <- matrix(nrow=3,ncol=4)
firsttest[1,1] <- 12
firsttest[2,1] <- 24
firsttest[1,2] <- 31
firsttest[2,2] <- 39
firsttest[3,1] <- 21
firsttest[3,2] <- 35
firsttest[1,3] <- 4
firsttest[2,3] <- 2
firsttest[1,4] <- 30
firsttest[2,4] <- 25
firsttest[3,3] <- 5
firsttest[3,4] <- 22
Let us print the newly created matrix:
firsttest
[,1] [,2] [,3] [,4]
[1,] 12 31 4 30
[2,] 24 39 2 25
[3,] 21 35 5 22
Student 1 had scores of 12, 31, 4, and 30 on the first test, student
2 scored 24, 39, 2, and 25 and so on. Here are the scores for the same
students on the second test:
secondtest <- matrix(nrow=3,ncol=4)
secondtest[1,1] <- 14
secondtest[2,1] <- 22
secondtest[1,2] <- 36
secondtest[2,2] <- 37
secondtest[3,1] <- 19
secondtest[3,2] <- 35
secondtest[1,3] <- 5
secondtest[2,3] <- 6
secondtest[1,4] <- 28
secondtest[2,4] <- 22
secondtest[3,3] <- 4
secondtest[3,4] <- 27
Let us print the results for the second test:
secondtest
[,1] [,2] [,3] [,4]
[1,] 14 36 5 28
[2,] 22 37 6 22
[3,] 19 35 4 27
Now let’s put both tests into one data structure, which we’ll name
tests. We’ll arrange it to have two “layers”—one layer per test—with
three rows and four columns within each layer. We’ll store firsttest in
the first layer and secondtest in the second. In layer 1, there will be
three rows for the three students’ scores on the first test, with four
columns per row for the four portions of a test. We use R’s array
function to create the data structure:
tests <- array(data=c(firsttest,secondtest),dim=c(3,4,2))
In the argument dim=c(3,4,2), we are specifying two layers (this is
the 2 in the third argument), each consisting of three rows and four
columns. This then becomes an attribute of the data structure:
attributes(tests)
$dim
[1] 3 4 2
Each element of tests now has three subscripts, rather than two as in
the matrix case. The first subscript corresponds to the first element in
the $dim vector, the second subscript corresponds to the second element
in the vector, and so on. For instance, the score on the second portion
of test 1 for student 3 is retrieved as follows:
tests[3,2,1]
[1] 35
If we now print the new array tests, we get firsttest in the first
layer [,,1] and secondtest in the second layer [,,2]:
tests
, , 1
[,1] [,2] [,3] [,4]
[1,] 12 31 4 30
[2,] 24 39 2 25
[3,] 21 35 5 22
, , 2
[,1] [,2] [,3] [,4]
[1,] 14 36 5 28
[2,] 22 37 6 22
[3,] 19 35 4 27
Just as we built our three-dimensional array by combining two
matrices, we can build four-dimensional arrays by combining two or more
three dimensional arrays, and so on.
