Playing with matrices and matrices-like objects (e.g. data frames)

Matrices are an important data structures in R programming language because we can store information that we usually have to deal with in bioinformatics. For example:

R is very good in reading tables in contrast to other languages e.g. C.

Pay attention that the data.frame is different than matrix. Some functions require a data.frame. Other, functions may require a matrix, while other functions may accept both data types.

Pay attention Data frame is a matrix-like data type that is able to have a different type of data in every column. Within a column, however, there must be only a single data type (i.e., only integers, or only factors or only characters)

Read in files

FILE –> data.frame

In R you can read in a data file and save its contents in a data.frame.

To read in a file that contains data organized in a table format we use the function read.table. It’s worthwhile to pay some attention to the help information of the read.table.

?read.table
read.table {utils} R Documentation

Data Input

Description

Reads a file in table format and creates a data frame from it, with cases corresponding to lines and variables to fields in the file.

Usage

read.table(file, header = FALSE, sep = "", quote = "\"'",
           dec = ".", numerals = c("allow.loss", "warn.loss", "no.loss"),
           row.names, col.names, as.is = !stringsAsFactors,
           na.strings = "NA", colClasses = NA, nrows = -1,
           skip = 0, check.names = TRUE, fill = !blank.lines.skip,
           strip.white = FALSE, blank.lines.skip = TRUE,
           comment.char = "#",
           allowEscapes = FALSE, flush = FALSE,
           stringsAsFactors = default.stringsAsFactors(),
           fileEncoding = "", encoding = "unknown", text, skipNul = FALSE)

read.csv(file, header = TRUE, sep = ",", quote = "\"",
         dec = ".", fill = TRUE, comment.char = "", ...)

read.csv2(file, header = TRUE, sep = ";", quote = "\"",
          dec = ",", fill = TRUE, comment.char = "", ...)

read.delim(file, header = TRUE, sep = "\t", quote = "\"",
           dec = ".", fill = TRUE, comment.char = "", ...)

read.delim2(file, header = TRUE, sep = "\t", quote = "\"",
            dec = ",", fill = TRUE, comment.char = "", ...)

We need to pay attention to the default values of several arguments.

## It means that the default behavior is that data do not include a header. Thus, if the data actually include a header, then the real header will become a part of the data if we say 'header=FALSE'. 
header = FALSE

sep=""
## It means that the separator is any whitespace character. 
## The problem with this is that if there are cells with missing values that are not indicated with any special character
## then this will crash. For example:
## 1 2 3
## 1   3
## The second element of the second row is missing. However, using the default separator, R will **MERGE** all the delimiters. 
## Thus, the program will crash because the data are not actually in table format (because the second row has only 2 elements.)

## Here, we should have used the:
sep="\t"  ## indicates specifically that the delimitation character is the tab. 

quote="\"'" ## The default quoting signs are the ' and the ". Things that are between the quoting signs are considered as *one entity* and are not separated. Also, R recognizes the quoting marks and it does not include them in cell's value.
## Often, when reading GEO datasets the gene descriptions are given with quotes. This may confuse the read.table function. Thus, it is possible that you should use your own version of quoting marks. 

For example, assume that the file is as follows:

less test.file
## actually there is a tab between 'a' and ' "I am ..." and a tab between 'second' and 'column'. 
## a    "I am second          column, '2nd' "
## b    3

R can read this file correctly if we type:

a <- read.table("test.file", quote="\"", sep='\t')
a
a[1,2]
## [1] I am second          column, '2nd' 
## Levels: 3 I am second          column, '2nd'

To proceed to a more ‘bioinformatics’ example, let’s assume a file with expression data. The file ‘expr.txt’ contains expression data. There is a header, denoting the name of each sample. Thus, the columns denote samples and the rows are the genes. For example, the value in the second row and third column denote the expression value of the second gene for the third individual. Typical expression datasets contain thousands of rows and dozens of columns. Here, we just have a small part of a dataset.

data <- read.table("expr.txt", header=TRUE)

## the dimensions of the dataset
dim(data)
## [1] 99  4
## the first rows of the dataset
head(data)
## the names of the column
colnames(data)
## [1] "GSM447401" "GSM447411" "GSM447413" "GSM447415"
## the names of the rows
rownames(data) ## R gives default names to data frame objects
##  [1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10" "11" "12" "13" "14"
## [15] "15" "16" "17" "18" "19" "20" "21" "22" "23" "24" "25" "26" "27" "28"
## [29] "29" "30" "31" "32" "33" "34" "35" "36" "37" "38" "39" "40" "41" "42"
## [43] "43" "44" "45" "46" "47" "48" "49" "50" "51" "52" "53" "54" "55" "56"
## [57] "57" "58" "59" "60" "61" "62" "63" "64" "65" "66" "67" "68" "69" "70"
## [71] "71" "72" "73" "74" "75" "76" "77" "78" "79" "80" "81" "82" "83" "84"
## [85] "85" "86" "87" "88" "89" "90" "91" "92" "93" "94" "95" "96" "97" "98"
## [99] "99"

Accessing the data

We can access the first sample, i.e., the first column by various ways

Accessing the columns
  1. Asking for all the rows and only the first column
firstColumn <- data[,1] ## before the , there is nothing. This means 'take all rows'. After the ',' there are the columns that we want to obtain. 
firstColumn
##  [1] 1124.000  203.300   53.400  245.600    8.228  157.300   30.010
##  [8]   30.970   95.240   81.700  216.300  344.100   29.320   72.190
## [15]  829.700   12.400   18.070   83.340  481.700    7.796    9.573
## [22]   38.300   25.430   65.690   49.880   32.230   96.800   16.860
## [29]   51.610   11.730  118.100  519.800   29.660  139.800   13.300
## [36]   16.080   85.800   19.170   42.700   11.230   16.710   39.130
## [43]   11.330   17.910   14.470   18.690   55.940   46.830  164.900
## [50]   78.940  130.500   18.140   49.930   47.070   51.870  352.800
## [57]    6.948    8.081    7.132   12.760    6.904   12.240   11.390
## [64]  115.600   62.840  133.300   78.990   25.830   15.350   14.860
## [71]   13.570   27.360   53.780  217.100   27.850   19.450   13.780
## [78]   42.670    7.201   31.240   39.280   16.780  213.500   96.240
## [85]    9.344   68.450    6.186    7.127   14.600   58.540   31.710
## [92]    7.416    7.980   37.400   35.950   12.000   32.570    8.385
## [99]    9.606
  1. By using its name
colnames(data)
## [1] "GSM447401" "GSM447411" "GSM447413" "GSM447415"
firstColumn <- data$GSM447401
firstColumn
##  [1] 1124.000  203.300   53.400  245.600    8.228  157.300   30.010
##  [8]   30.970   95.240   81.700  216.300  344.100   29.320   72.190
## [15]  829.700   12.400   18.070   83.340  481.700    7.796    9.573
## [22]   38.300   25.430   65.690   49.880   32.230   96.800   16.860
## [29]   51.610   11.730  118.100  519.800   29.660  139.800   13.300
## [36]   16.080   85.800   19.170   42.700   11.230   16.710   39.130
## [43]   11.330   17.910   14.470   18.690   55.940   46.830  164.900
## [50]   78.940  130.500   18.140   49.930   47.070   51.870  352.800
## [57]    6.948    8.081    7.132   12.760    6.904   12.240   11.390
## [64]  115.600   62.840  133.300   78.990   25.830   15.350   14.860
## [71]   13.570   27.360   53.780  217.100   27.850   19.450   13.780
## [78]   42.670    7.201   31.240   39.280   16.780  213.500   96.240
## [85]    9.344   68.450    6.186    7.127   14.600   58.540   31.710
## [92]    7.416    7.980   37.400   35.950   12.000   32.570    8.385
## [99]    9.606
  1. By using its name but in a matrix-like form
colnames(data)
## [1] "GSM447401" "GSM447411" "GSM447413" "GSM447415"
firstColumn <- data[, "GSM447401"]
firstColumn
##  [1] 1124.000  203.300   53.400  245.600    8.228  157.300   30.010
##  [8]   30.970   95.240   81.700  216.300  344.100   29.320   72.190
## [15]  829.700   12.400   18.070   83.340  481.700    7.796    9.573
## [22]   38.300   25.430   65.690   49.880   32.230   96.800   16.860
## [29]   51.610   11.730  118.100  519.800   29.660  139.800   13.300
## [36]   16.080   85.800   19.170   42.700   11.230   16.710   39.130
## [43]   11.330   17.910   14.470   18.690   55.940   46.830  164.900
## [50]   78.940  130.500   18.140   49.930   47.070   51.870  352.800
## [57]    6.948    8.081    7.132   12.760    6.904   12.240   11.390
## [64]  115.600   62.840  133.300   78.990   25.830   15.350   14.860
## [71]   13.570   27.360   53.780  217.100   27.850   19.450   13.780
## [78]   42.670    7.201   31.240   39.280   16.780  213.500   96.240
## [85]    9.344   68.450    6.186    7.127   14.600   58.540   31.710
## [92]    7.416    7.980   37.400   35.950   12.000   32.570    8.385
## [99]    9.606
  1. By using a TRUE FALSE vector.

This is a useful way. The idea is to use TRUE for the indexes that you need to obtain and FALSE otherwise.

firstColumn <- data[, c(T, F, F, F)] ## remember that T and TRUE are equivalent. The same holds for F and FALSE
firstColumn
##  [1] 1124.000  203.300   53.400  245.600    8.228  157.300   30.010
##  [8]   30.970   95.240   81.700  216.300  344.100   29.320   72.190
## [15]  829.700   12.400   18.070   83.340  481.700    7.796    9.573
## [22]   38.300   25.430   65.690   49.880   32.230   96.800   16.860
## [29]   51.610   11.730  118.100  519.800   29.660  139.800   13.300
## [36]   16.080   85.800   19.170   42.700   11.230   16.710   39.130
## [43]   11.330   17.910   14.470   18.690   55.940   46.830  164.900
## [50]   78.940  130.500   18.140   49.930   47.070   51.870  352.800
## [57]    6.948    8.081    7.132   12.760    6.904   12.240   11.390
## [64]  115.600   62.840  133.300   78.990   25.830   15.350   14.860
## [71]   13.570   27.360   53.780  217.100   27.850   19.450   13.780
## [78]   42.670    7.201   31.240   39.280   16.780  213.500   96.240
## [85]    9.344   68.450    6.186    7.127   14.600   58.540   31.710
## [92]    7.416    7.980   37.400   35.950   12.000   32.570    8.385
## [99]    9.606

It’s also easy to obtain more than a column. For example, if we want to get the first and the third column:

  1. Obtaining a range of columns
partData <- data[,c(1,3)]
head(partData)
  1. Using the column names
partData <- data[,c("GSM447401", "GSM447413")]
head(partData)
Accessing the rows

You can access the rows of a data frame with exactly the same way as the columns. Remember that R gives row names to the data frames by default.

Thus,

rowData1 <- data[c(1,3), ]
rowData2 <- data[c("1", "3"), ]
rowData1
rowData2

will give access to the first and third row. Note that, the second way, i.e. rowData2 uses the names of the rows of data.

Accessing distinct cells

The easiest way to access a cell value is to use its indexes. For example, to obtain access to the cell (1,2):

data[1,2]
## [1] 1196

Another way is by the names of the rows and columns

data["1", "GSM447401"] ## again remember that "2" is the name of the second row
## [1] 1124
## it's the same as:
data[1, "GSM447401"]
## [1] 1124

A further way is to use the TRUE/FALSE trick. For example if we want to get all values that are above 1,000:

data[ data > 1000 ]
## [1] 1124 1196 1181 1075 1407

Let’s try to understand the previous statement. The data > 1000 will give a matrix with TRUE/FALSE values. It will contain TRUE if the condition data > 1000 is satisfied and FALSE, otherwise. Thus:

data > 1000
##        GSM447401 GSM447411 GSM447413 GSM447415
##   [1,]      TRUE      TRUE     FALSE      TRUE
##   [2,]     FALSE     FALSE     FALSE     FALSE
##   [3,]     FALSE     FALSE     FALSE     FALSE
##   [4,]     FALSE     FALSE     FALSE     FALSE
##   [5,]     FALSE     FALSE     FALSE     FALSE
##   [6,]     FALSE     FALSE     FALSE     FALSE
##   [7,]     FALSE     FALSE     FALSE     FALSE
##   [8,]     FALSE     FALSE     FALSE     FALSE
##   [9,]     FALSE     FALSE     FALSE     FALSE
##  [10,]     FALSE     FALSE     FALSE     FALSE
##  [11,]     FALSE     FALSE     FALSE     FALSE
##  [12,]     FALSE     FALSE     FALSE     FALSE
##  [13,]     FALSE     FALSE     FALSE     FALSE
##  [14,]     FALSE     FALSE     FALSE     FALSE
##  [15,]     FALSE     FALSE     FALSE     FALSE
##  [16,]     FALSE     FALSE     FALSE     FALSE
##  [17,]     FALSE     FALSE     FALSE     FALSE
##  [18,]     FALSE     FALSE     FALSE     FALSE
##  [19,]     FALSE     FALSE     FALSE     FALSE
##  [20,]     FALSE     FALSE     FALSE     FALSE
##  [21,]     FALSE     FALSE     FALSE     FALSE
##  [22,]     FALSE     FALSE     FALSE     FALSE
##  [23,]     FALSE     FALSE     FALSE     FALSE
##  [24,]     FALSE     FALSE     FALSE     FALSE
##  [25,]     FALSE     FALSE     FALSE     FALSE
##  [26,]     FALSE     FALSE     FALSE     FALSE
##  [27,]     FALSE     FALSE     FALSE     FALSE
##  [28,]     FALSE     FALSE     FALSE     FALSE
##  [29,]     FALSE     FALSE     FALSE     FALSE
##  [30,]     FALSE     FALSE     FALSE     FALSE
##  [31,]     FALSE     FALSE     FALSE     FALSE
##  [32,]     FALSE     FALSE     FALSE     FALSE
##  [33,]     FALSE     FALSE     FALSE     FALSE
##  [34,]     FALSE     FALSE     FALSE     FALSE
##  [35,]     FALSE     FALSE     FALSE     FALSE
##  [36,]     FALSE     FALSE     FALSE     FALSE
##  [37,]     FALSE     FALSE     FALSE     FALSE
##  [38,]     FALSE     FALSE     FALSE     FALSE
##  [39,]     FALSE     FALSE     FALSE     FALSE
##  [40,]     FALSE     FALSE     FALSE     FALSE
##  [41,]     FALSE     FALSE     FALSE     FALSE
##  [42,]     FALSE     FALSE     FALSE     FALSE
##  [43,]     FALSE     FALSE     FALSE     FALSE
##  [44,]     FALSE     FALSE     FALSE     FALSE
##  [45,]     FALSE     FALSE     FALSE     FALSE
##  [46,]     FALSE     FALSE     FALSE     FALSE
##  [47,]     FALSE     FALSE     FALSE     FALSE
##  [48,]     FALSE      TRUE     FALSE      TRUE
##  [49,]     FALSE     FALSE     FALSE     FALSE
##  [50,]     FALSE     FALSE     FALSE     FALSE
##  [51,]     FALSE     FALSE     FALSE     FALSE
##  [52,]     FALSE     FALSE     FALSE     FALSE
##  [53,]     FALSE     FALSE     FALSE     FALSE
##  [54,]     FALSE     FALSE     FALSE     FALSE
##  [55,]     FALSE     FALSE     FALSE     FALSE
##  [56,]     FALSE     FALSE     FALSE     FALSE
##  [57,]     FALSE     FALSE     FALSE     FALSE
##  [58,]     FALSE     FALSE     FALSE     FALSE
##  [59,]     FALSE     FALSE     FALSE     FALSE
##  [60,]     FALSE     FALSE     FALSE     FALSE
##  [61,]     FALSE     FALSE     FALSE     FALSE
##  [62,]     FALSE     FALSE     FALSE     FALSE
##  [63,]     FALSE     FALSE     FALSE     FALSE
##  [64,]     FALSE     FALSE     FALSE     FALSE
##  [65,]     FALSE     FALSE     FALSE     FALSE
##  [66,]     FALSE     FALSE     FALSE     FALSE
##  [67,]     FALSE     FALSE     FALSE     FALSE
##  [68,]     FALSE     FALSE     FALSE     FALSE
##  [69,]     FALSE     FALSE     FALSE     FALSE
##  [70,]     FALSE     FALSE     FALSE     FALSE
##  [71,]     FALSE     FALSE     FALSE     FALSE
##  [72,]     FALSE     FALSE     FALSE     FALSE
##  [73,]     FALSE     FALSE     FALSE     FALSE
##  [74,]     FALSE     FALSE     FALSE     FALSE
##  [75,]     FALSE     FALSE     FALSE     FALSE
##  [76,]     FALSE     FALSE     FALSE     FALSE
##  [77,]     FALSE     FALSE     FALSE     FALSE
##  [78,]     FALSE     FALSE     FALSE     FALSE
##  [79,]     FALSE     FALSE     FALSE     FALSE
##  [80,]     FALSE     FALSE     FALSE     FALSE
##  [81,]     FALSE     FALSE     FALSE     FALSE
##  [82,]     FALSE     FALSE     FALSE     FALSE
##  [83,]     FALSE     FALSE     FALSE     FALSE
##  [84,]     FALSE     FALSE     FALSE     FALSE
##  [85,]     FALSE     FALSE     FALSE     FALSE
##  [86,]     FALSE     FALSE     FALSE     FALSE
##  [87,]     FALSE     FALSE     FALSE     FALSE
##  [88,]     FALSE     FALSE     FALSE     FALSE
##  [89,]     FALSE     FALSE     FALSE     FALSE
##  [90,]     FALSE     FALSE     FALSE     FALSE
##  [91,]     FALSE     FALSE     FALSE     FALSE
##  [92,]     FALSE     FALSE     FALSE     FALSE
##  [93,]     FALSE     FALSE     FALSE     FALSE
##  [94,]     FALSE     FALSE     FALSE     FALSE
##  [95,]     FALSE     FALSE     FALSE     FALSE
##  [96,]     FALSE     FALSE     FALSE     FALSE
##  [97,]     FALSE     FALSE     FALSE     FALSE
##  [98,]     FALSE     FALSE     FALSE     FALSE
##  [99,]     FALSE     FALSE     FALSE     FALSE

Now, we can use this matrix to access the data that we want, i.e., only the elements with value > 1000.

Matrix objects

In contrast to data.frame objects that can handle data of different types, matrices contain only a single type of data. For example, integers or floats or characters, etc.

Creating matrices

Using the as.matrix function

It is easy to convert a data.frame to a matrix by the as.matrix function.

data.matrix <- as.matrix(data)

## Note that the row matrix are shown as [1,], [2,]
head(data.matrix)
##      GSM447401 GSM447411 GSM447413 GSM447415
## [1,]  1124.000  1196.000   982.800  1075.000
## [2,]   203.300   181.500   229.600   160.400
## [3,]    53.400    55.600    49.040    39.100
## [4,]   245.600   209.400   252.900   223.300
## [5,]     8.228     8.131     8.994     7.576
## [6,]   157.300   149.800   131.300   127.700
colnames(data.matrix)
## [1] "GSM447401" "GSM447411" "GSM447413" "GSM447415"
row.names(data.matrix)
## NULL

Note that by default, the matrix does not contain row names. However, we can set row names. Thus, the function row.names can either get or set the row names of the matrix

row.names(data.matrix) ## get the row names
## NULL
row.names(data.matrix) <- row.names(data) ## set the row names
row.names(data.matrix) ## get the modified row names
##  [1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10" "11" "12" "13" "14"
## [15] "15" "16" "17" "18" "19" "20" "21" "22" "23" "24" "25" "26" "27" "28"
## [29] "29" "30" "31" "32" "33" "34" "35" "36" "37" "38" "39" "40" "41" "42"
## [43] "43" "44" "45" "46" "47" "48" "49" "50" "51" "52" "53" "54" "55" "56"
## [57] "57" "58" "59" "60" "61" "62" "63" "64" "65" "66" "67" "68" "69" "70"
## [71] "71" "72" "73" "74" "75" "76" "77" "78" "79" "80" "81" "82" "83" "84"
## [85] "85" "86" "87" "88" "89" "90" "91" "92" "93" "94" "95" "96" "97" "98"
## [99] "99"

Construction of a matrix using the matrix function

We can construct the matrix by giving its content as well as its dimensions.

mymat <- matrix(0, nrow=5, ncol=6) ## a 5 x 6 matrix. All contents are 0
mymat
##      [,1] [,2] [,3] [,4] [,5] [,6]
## [1,]    0    0    0    0    0    0
## [2,]    0    0    0    0    0    0
## [3,]    0    0    0    0    0    0
## [4,]    0    0    0    0    0    0
## [5,]    0    0    0    0    0    0
mymat <- matrix(NA, nrow=5, ncol=6) ## a 5 x 6 matrix. All contents are NA
mymat
##      [,1] [,2] [,3] [,4] [,5] [,6]
## [1,]   NA   NA   NA   NA   NA   NA
## [2,]   NA   NA   NA   NA   NA   NA
## [3,]   NA   NA   NA   NA   NA   NA
## [4,]   NA   NA   NA   NA   NA   NA
## [5,]   NA   NA   NA   NA   NA   NA

You can also specify the contents of the matrix by using an array. For example, if you want to create a 4x4 matrix with the numbers 1 to 16 within it, arranged by rows then you do the following:

mymat <- matrix(1:16, nrow=4, ncol=4, byrow=TRUE) ## a 4x4 matrix with the elements 1-16
mymat
##      [,1] [,2] [,3] [,4]
## [1,]    1    2    3    4
## [2,]    5    6    7    8
## [3,]    9   10   11   12
## [4,]   13   14   15   16

The byrow=TRUE is important here because it specifies that the numbers 1-16 will be placed in the matrix starting from the cell [1,1], then [1,2] etc. The default behaviour of R is byrow = FALSE, i.e. [1,1], then [2,1] etc.

NOTE: This is a very important way to create matrices because you can create vectors from several objects: e.g. lists. Thus, easily you can convert your data in a matrix format

Exercices - matrices

Exercise 1

Construct a 10x20 matrix as follows: - create a list of 10 vectors, each composed of 20 integers. The ( i^{th} ) vector at the ( j^{th} ) position contains the value ( j * i ). - convert this list to a matrix, where each vector (element of the list) will be a row of the matrix.

Exercise 2

Create three vectors x,y,z with integers and each vector has 3 elements. Combine the three vectors to become a 3×3 matrix A where each column represents a vector. Change the row names to a,b,c.

Exercise 3

Create a vector with 12 integers. Convert the vector to a 4*3 matrix B using matrix(). Please change the column names to x, y, z and row names to a, b, c, d. The argument byrow in matrix() is set to be FALSE by default. Please change it to TRUE and print B to see the differences.

Exercise 4

Please obtain the transpose matrix of B named tB .

Exercise 5

Now tB is a 3×4 matrix. By the rule of matrix multiplication in algebra, can we perform tB*tB in R language? (Is a 3×4 matrix multiplied by a 3×4 allowed?) What result would we get? HINT: the multiplication symbol of matrices is %*%

Exercies 6

Download the dataset GDS3309 from the NCBI. - Clean it (remove the !,^,# lines) - Inspect if it’s normalized - From every column find the mean - From every column find the standard deviation - Create a new matrix where each column will be standardized. This is, substract from the elements of the column the mean of the column and divide by the standard deviation of the column. - For each column find the gene with the maximum value - For each column find the gene with the minimum value