Matrices are an important data structures in R programming language because we can store information that we usually have to deal with in bioinformatics. For example:
R is very good in reading tables in contrast to other languages e.g. C.
Pay attention that the data.frame is different than matrix. Some functions require a data.frame. Other, functions may require a matrix, while other functions may accept both data types.
Pay attention Data frame is a matrix-like data type that is able to have a different type of data in every column. Within a column, however, there must be only a single data type (i.e., only integers, or only factors or only characters)
In R you can read in a data file and save its contents in a data.frame.
To read in a file that contains data organized in a table format we use the function read.table. It’s worthwhile to pay some attention to the help information of the read.table.
?read.table
| read.table {utils} | R Documentation |
Reads a file in table format and creates a data frame from it, with cases corresponding to lines and variables to fields in the file.
read.table(file, header = FALSE, sep = "", quote = "\"'",
dec = ".", numerals = c("allow.loss", "warn.loss", "no.loss"),
row.names, col.names, as.is = !stringsAsFactors,
na.strings = "NA", colClasses = NA, nrows = -1,
skip = 0, check.names = TRUE, fill = !blank.lines.skip,
strip.white = FALSE, blank.lines.skip = TRUE,
comment.char = "#",
allowEscapes = FALSE, flush = FALSE,
stringsAsFactors = default.stringsAsFactors(),
fileEncoding = "", encoding = "unknown", text, skipNul = FALSE)
read.csv(file, header = TRUE, sep = ",", quote = "\"",
dec = ".", fill = TRUE, comment.char = "", ...)
read.csv2(file, header = TRUE, sep = ";", quote = "\"",
dec = ",", fill = TRUE, comment.char = "", ...)
read.delim(file, header = TRUE, sep = "\t", quote = "\"",
dec = ".", fill = TRUE, comment.char = "", ...)
read.delim2(file, header = TRUE, sep = "\t", quote = "\"",
dec = ",", fill = TRUE, comment.char = "", ...)
We need to pay attention to the default values of several arguments.
## It means that the default behavior is that data do not include a header. Thus, if the data actually include a header, then the real header will become a part of the data if we say 'header=FALSE'.
header = FALSE
sep=""
## It means that the separator is any whitespace character.
## The problem with this is that if there are cells with missing values that are not indicated with any special character
## then this will crash. For example:
## 1 2 3
## 1 3
## The second element of the second row is missing. However, using the default separator, R will **MERGE** all the delimiters.
## Thus, the program will crash because the data are not actually in table format (because the second row has only 2 elements.)
## Here, we should have used the:
sep="\t" ## indicates specifically that the delimitation character is the tab.
quote="\"'" ## The default quoting signs are the ' and the ". Things that are between the quoting signs are considered as *one entity* and are not separated. Also, R recognizes the quoting marks and it does not include them in cell's value.
## Often, when reading GEO datasets the gene descriptions are given with quotes. This may confuse the read.table function. Thus, it is possible that you should use your own version of quoting marks.
For example, assume that the file is as follows:
less test.file
## actually there is a tab between 'a' and ' "I am ..." and a tab between 'second' and 'column'.
## a "I am second column, '2nd' "
## b 3
R can read this file correctly if we type:
a <- read.table("test.file", quote="\"", sep='\t')
a
a[1,2]
## [1] I am second column, '2nd'
## Levels: 3 I am second column, '2nd'
To proceed to a more ‘bioinformatics’ example, let’s assume a file with expression data. The file ‘expr.txt’ contains expression data. There is a header, denoting the name of each sample. Thus, the columns denote samples and the rows are the genes. For example, the value in the second row and third column denote the expression value of the second gene for the third individual. Typical expression datasets contain thousands of rows and dozens of columns. Here, we just have a small part of a dataset.
data <- read.table("expr.txt", header=TRUE)
## the dimensions of the dataset
dim(data)
## [1] 99 4
## the first rows of the dataset
head(data)
## the names of the column
colnames(data)
## [1] "GSM447401" "GSM447411" "GSM447413" "GSM447415"
## the names of the rows
rownames(data) ## R gives default names to data frame objects
## [1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10" "11" "12" "13" "14"
## [15] "15" "16" "17" "18" "19" "20" "21" "22" "23" "24" "25" "26" "27" "28"
## [29] "29" "30" "31" "32" "33" "34" "35" "36" "37" "38" "39" "40" "41" "42"
## [43] "43" "44" "45" "46" "47" "48" "49" "50" "51" "52" "53" "54" "55" "56"
## [57] "57" "58" "59" "60" "61" "62" "63" "64" "65" "66" "67" "68" "69" "70"
## [71] "71" "72" "73" "74" "75" "76" "77" "78" "79" "80" "81" "82" "83" "84"
## [85] "85" "86" "87" "88" "89" "90" "91" "92" "93" "94" "95" "96" "97" "98"
## [99] "99"
We can access the first sample, i.e., the first column by various ways
firstColumn <- data[,1] ## before the , there is nothing. This means 'take all rows'. After the ',' there are the columns that we want to obtain.
firstColumn
## [1] 1124.000 203.300 53.400 245.600 8.228 157.300 30.010
## [8] 30.970 95.240 81.700 216.300 344.100 29.320 72.190
## [15] 829.700 12.400 18.070 83.340 481.700 7.796 9.573
## [22] 38.300 25.430 65.690 49.880 32.230 96.800 16.860
## [29] 51.610 11.730 118.100 519.800 29.660 139.800 13.300
## [36] 16.080 85.800 19.170 42.700 11.230 16.710 39.130
## [43] 11.330 17.910 14.470 18.690 55.940 46.830 164.900
## [50] 78.940 130.500 18.140 49.930 47.070 51.870 352.800
## [57] 6.948 8.081 7.132 12.760 6.904 12.240 11.390
## [64] 115.600 62.840 133.300 78.990 25.830 15.350 14.860
## [71] 13.570 27.360 53.780 217.100 27.850 19.450 13.780
## [78] 42.670 7.201 31.240 39.280 16.780 213.500 96.240
## [85] 9.344 68.450 6.186 7.127 14.600 58.540 31.710
## [92] 7.416 7.980 37.400 35.950 12.000 32.570 8.385
## [99] 9.606
colnames(data)
## [1] "GSM447401" "GSM447411" "GSM447413" "GSM447415"
firstColumn <- data$GSM447401
firstColumn
## [1] 1124.000 203.300 53.400 245.600 8.228 157.300 30.010
## [8] 30.970 95.240 81.700 216.300 344.100 29.320 72.190
## [15] 829.700 12.400 18.070 83.340 481.700 7.796 9.573
## [22] 38.300 25.430 65.690 49.880 32.230 96.800 16.860
## [29] 51.610 11.730 118.100 519.800 29.660 139.800 13.300
## [36] 16.080 85.800 19.170 42.700 11.230 16.710 39.130
## [43] 11.330 17.910 14.470 18.690 55.940 46.830 164.900
## [50] 78.940 130.500 18.140 49.930 47.070 51.870 352.800
## [57] 6.948 8.081 7.132 12.760 6.904 12.240 11.390
## [64] 115.600 62.840 133.300 78.990 25.830 15.350 14.860
## [71] 13.570 27.360 53.780 217.100 27.850 19.450 13.780
## [78] 42.670 7.201 31.240 39.280 16.780 213.500 96.240
## [85] 9.344 68.450 6.186 7.127 14.600 58.540 31.710
## [92] 7.416 7.980 37.400 35.950 12.000 32.570 8.385
## [99] 9.606
colnames(data)
## [1] "GSM447401" "GSM447411" "GSM447413" "GSM447415"
firstColumn <- data[, "GSM447401"]
firstColumn
## [1] 1124.000 203.300 53.400 245.600 8.228 157.300 30.010
## [8] 30.970 95.240 81.700 216.300 344.100 29.320 72.190
## [15] 829.700 12.400 18.070 83.340 481.700 7.796 9.573
## [22] 38.300 25.430 65.690 49.880 32.230 96.800 16.860
## [29] 51.610 11.730 118.100 519.800 29.660 139.800 13.300
## [36] 16.080 85.800 19.170 42.700 11.230 16.710 39.130
## [43] 11.330 17.910 14.470 18.690 55.940 46.830 164.900
## [50] 78.940 130.500 18.140 49.930 47.070 51.870 352.800
## [57] 6.948 8.081 7.132 12.760 6.904 12.240 11.390
## [64] 115.600 62.840 133.300 78.990 25.830 15.350 14.860
## [71] 13.570 27.360 53.780 217.100 27.850 19.450 13.780
## [78] 42.670 7.201 31.240 39.280 16.780 213.500 96.240
## [85] 9.344 68.450 6.186 7.127 14.600 58.540 31.710
## [92] 7.416 7.980 37.400 35.950 12.000 32.570 8.385
## [99] 9.606
TRUE FALSE vector.This is a useful way. The idea is to use TRUE for the indexes that you need to obtain and FALSE otherwise.
firstColumn <- data[, c(T, F, F, F)] ## remember that T and TRUE are equivalent. The same holds for F and FALSE
firstColumn
## [1] 1124.000 203.300 53.400 245.600 8.228 157.300 30.010
## [8] 30.970 95.240 81.700 216.300 344.100 29.320 72.190
## [15] 829.700 12.400 18.070 83.340 481.700 7.796 9.573
## [22] 38.300 25.430 65.690 49.880 32.230 96.800 16.860
## [29] 51.610 11.730 118.100 519.800 29.660 139.800 13.300
## [36] 16.080 85.800 19.170 42.700 11.230 16.710 39.130
## [43] 11.330 17.910 14.470 18.690 55.940 46.830 164.900
## [50] 78.940 130.500 18.140 49.930 47.070 51.870 352.800
## [57] 6.948 8.081 7.132 12.760 6.904 12.240 11.390
## [64] 115.600 62.840 133.300 78.990 25.830 15.350 14.860
## [71] 13.570 27.360 53.780 217.100 27.850 19.450 13.780
## [78] 42.670 7.201 31.240 39.280 16.780 213.500 96.240
## [85] 9.344 68.450 6.186 7.127 14.600 58.540 31.710
## [92] 7.416 7.980 37.400 35.950 12.000 32.570 8.385
## [99] 9.606
It’s also easy to obtain more than a column. For example, if we want to get the first and the third column:
partData <- data[,c(1,3)]
head(partData)
partData <- data[,c("GSM447401", "GSM447413")]
head(partData)
You can access the rows of a data frame with exactly the same way as the columns. Remember that R gives row names to the data frames by default.
Thus,
rowData1 <- data[c(1,3), ]
rowData2 <- data[c("1", "3"), ]
rowData1
rowData2
will give access to the first and third row. Note that, the second way, i.e. rowData2 uses the names of the rows of data.
The easiest way to access a cell value is to use its indexes. For example, to obtain access to the cell (1,2):
data[1,2]
## [1] 1196
Another way is by the names of the rows and columns
data["1", "GSM447401"] ## again remember that "2" is the name of the second row
## [1] 1124
## it's the same as:
data[1, "GSM447401"]
## [1] 1124
A further way is to use the TRUE/FALSE trick. For example if we want to get all values that are above 1,000:
data[ data > 1000 ]
## [1] 1124 1196 1181 1075 1407
Let’s try to understand the previous statement. The data > 1000 will give a matrix with TRUE/FALSE values. It will contain TRUE if the condition data > 1000 is satisfied and FALSE, otherwise. Thus:
data > 1000
## GSM447401 GSM447411 GSM447413 GSM447415
## [1,] TRUE TRUE FALSE TRUE
## [2,] FALSE FALSE FALSE FALSE
## [3,] FALSE FALSE FALSE FALSE
## [4,] FALSE FALSE FALSE FALSE
## [5,] FALSE FALSE FALSE FALSE
## [6,] FALSE FALSE FALSE FALSE
## [7,] FALSE FALSE FALSE FALSE
## [8,] FALSE FALSE FALSE FALSE
## [9,] FALSE FALSE FALSE FALSE
## [10,] FALSE FALSE FALSE FALSE
## [11,] FALSE FALSE FALSE FALSE
## [12,] FALSE FALSE FALSE FALSE
## [13,] FALSE FALSE FALSE FALSE
## [14,] FALSE FALSE FALSE FALSE
## [15,] FALSE FALSE FALSE FALSE
## [16,] FALSE FALSE FALSE FALSE
## [17,] FALSE FALSE FALSE FALSE
## [18,] FALSE FALSE FALSE FALSE
## [19,] FALSE FALSE FALSE FALSE
## [20,] FALSE FALSE FALSE FALSE
## [21,] FALSE FALSE FALSE FALSE
## [22,] FALSE FALSE FALSE FALSE
## [23,] FALSE FALSE FALSE FALSE
## [24,] FALSE FALSE FALSE FALSE
## [25,] FALSE FALSE FALSE FALSE
## [26,] FALSE FALSE FALSE FALSE
## [27,] FALSE FALSE FALSE FALSE
## [28,] FALSE FALSE FALSE FALSE
## [29,] FALSE FALSE FALSE FALSE
## [30,] FALSE FALSE FALSE FALSE
## [31,] FALSE FALSE FALSE FALSE
## [32,] FALSE FALSE FALSE FALSE
## [33,] FALSE FALSE FALSE FALSE
## [34,] FALSE FALSE FALSE FALSE
## [35,] FALSE FALSE FALSE FALSE
## [36,] FALSE FALSE FALSE FALSE
## [37,] FALSE FALSE FALSE FALSE
## [38,] FALSE FALSE FALSE FALSE
## [39,] FALSE FALSE FALSE FALSE
## [40,] FALSE FALSE FALSE FALSE
## [41,] FALSE FALSE FALSE FALSE
## [42,] FALSE FALSE FALSE FALSE
## [43,] FALSE FALSE FALSE FALSE
## [44,] FALSE FALSE FALSE FALSE
## [45,] FALSE FALSE FALSE FALSE
## [46,] FALSE FALSE FALSE FALSE
## [47,] FALSE FALSE FALSE FALSE
## [48,] FALSE TRUE FALSE TRUE
## [49,] FALSE FALSE FALSE FALSE
## [50,] FALSE FALSE FALSE FALSE
## [51,] FALSE FALSE FALSE FALSE
## [52,] FALSE FALSE FALSE FALSE
## [53,] FALSE FALSE FALSE FALSE
## [54,] FALSE FALSE FALSE FALSE
## [55,] FALSE FALSE FALSE FALSE
## [56,] FALSE FALSE FALSE FALSE
## [57,] FALSE FALSE FALSE FALSE
## [58,] FALSE FALSE FALSE FALSE
## [59,] FALSE FALSE FALSE FALSE
## [60,] FALSE FALSE FALSE FALSE
## [61,] FALSE FALSE FALSE FALSE
## [62,] FALSE FALSE FALSE FALSE
## [63,] FALSE FALSE FALSE FALSE
## [64,] FALSE FALSE FALSE FALSE
## [65,] FALSE FALSE FALSE FALSE
## [66,] FALSE FALSE FALSE FALSE
## [67,] FALSE FALSE FALSE FALSE
## [68,] FALSE FALSE FALSE FALSE
## [69,] FALSE FALSE FALSE FALSE
## [70,] FALSE FALSE FALSE FALSE
## [71,] FALSE FALSE FALSE FALSE
## [72,] FALSE FALSE FALSE FALSE
## [73,] FALSE FALSE FALSE FALSE
## [74,] FALSE FALSE FALSE FALSE
## [75,] FALSE FALSE FALSE FALSE
## [76,] FALSE FALSE FALSE FALSE
## [77,] FALSE FALSE FALSE FALSE
## [78,] FALSE FALSE FALSE FALSE
## [79,] FALSE FALSE FALSE FALSE
## [80,] FALSE FALSE FALSE FALSE
## [81,] FALSE FALSE FALSE FALSE
## [82,] FALSE FALSE FALSE FALSE
## [83,] FALSE FALSE FALSE FALSE
## [84,] FALSE FALSE FALSE FALSE
## [85,] FALSE FALSE FALSE FALSE
## [86,] FALSE FALSE FALSE FALSE
## [87,] FALSE FALSE FALSE FALSE
## [88,] FALSE FALSE FALSE FALSE
## [89,] FALSE FALSE FALSE FALSE
## [90,] FALSE FALSE FALSE FALSE
## [91,] FALSE FALSE FALSE FALSE
## [92,] FALSE FALSE FALSE FALSE
## [93,] FALSE FALSE FALSE FALSE
## [94,] FALSE FALSE FALSE FALSE
## [95,] FALSE FALSE FALSE FALSE
## [96,] FALSE FALSE FALSE FALSE
## [97,] FALSE FALSE FALSE FALSE
## [98,] FALSE FALSE FALSE FALSE
## [99,] FALSE FALSE FALSE FALSE
Now, we can use this matrix to access the data that we want, i.e., only the elements with value > 1000.
In contrast to data.frame objects that can handle data of different types, matrices contain only a single type of data. For example, integers or floats or characters, etc.
as.matrix functionIt is easy to convert a data.frame to a matrix by the as.matrix function.
data.matrix <- as.matrix(data)
## Note that the row matrix are shown as [1,], [2,]
head(data.matrix)
## GSM447401 GSM447411 GSM447413 GSM447415
## [1,] 1124.000 1196.000 982.800 1075.000
## [2,] 203.300 181.500 229.600 160.400
## [3,] 53.400 55.600 49.040 39.100
## [4,] 245.600 209.400 252.900 223.300
## [5,] 8.228 8.131 8.994 7.576
## [6,] 157.300 149.800 131.300 127.700
colnames(data.matrix)
## [1] "GSM447401" "GSM447411" "GSM447413" "GSM447415"
row.names(data.matrix)
## NULL
Note that by default, the matrix does not contain row names. However, we can set row names. Thus, the function row.names can either get or set the row names of the matrix
row.names(data.matrix) ## get the row names
## NULL
row.names(data.matrix) <- row.names(data) ## set the row names
row.names(data.matrix) ## get the modified row names
## [1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10" "11" "12" "13" "14"
## [15] "15" "16" "17" "18" "19" "20" "21" "22" "23" "24" "25" "26" "27" "28"
## [29] "29" "30" "31" "32" "33" "34" "35" "36" "37" "38" "39" "40" "41" "42"
## [43] "43" "44" "45" "46" "47" "48" "49" "50" "51" "52" "53" "54" "55" "56"
## [57] "57" "58" "59" "60" "61" "62" "63" "64" "65" "66" "67" "68" "69" "70"
## [71] "71" "72" "73" "74" "75" "76" "77" "78" "79" "80" "81" "82" "83" "84"
## [85] "85" "86" "87" "88" "89" "90" "91" "92" "93" "94" "95" "96" "97" "98"
## [99] "99"
matrix functionWe can construct the matrix by giving its content as well as its dimensions.
mymat <- matrix(0, nrow=5, ncol=6) ## a 5 x 6 matrix. All contents are 0
mymat
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 0 0 0 0 0 0
## [2,] 0 0 0 0 0 0
## [3,] 0 0 0 0 0 0
## [4,] 0 0 0 0 0 0
## [5,] 0 0 0 0 0 0
mymat <- matrix(NA, nrow=5, ncol=6) ## a 5 x 6 matrix. All contents are NA
mymat
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] NA NA NA NA NA NA
## [2,] NA NA NA NA NA NA
## [3,] NA NA NA NA NA NA
## [4,] NA NA NA NA NA NA
## [5,] NA NA NA NA NA NA
You can also specify the contents of the matrix by using an array. For example, if you want to create a 4x4 matrix with the numbers 1 to 16 within it, arranged by rows then you do the following:
mymat <- matrix(1:16, nrow=4, ncol=4, byrow=TRUE) ## a 4x4 matrix with the elements 1-16
mymat
## [,1] [,2] [,3] [,4]
## [1,] 1 2 3 4
## [2,] 5 6 7 8
## [3,] 9 10 11 12
## [4,] 13 14 15 16
The byrow=TRUE is important here because it specifies that the numbers 1-16 will be placed in the matrix starting from the cell [1,1], then [1,2] etc. The default behaviour of R is byrow = FALSE, i.e. [1,1], then [2,1] etc.
NOTE: This is a very important way to create matrices because you can create vectors from several objects: e.g. lists. Thus, easily you can convert your data in a matrix format
Construct a 10x20 matrix as follows: - create a list of 10 vectors, each composed of 20 integers. The ( i^{th} ) vector at the ( j^{th} ) position contains the value ( j * i ). - convert this list to a matrix, where each vector (element of the list) will be a row of the matrix.
Create three vectors x,y,z with integers and each vector has 3 elements. Combine the three vectors to become a 3×3 matrix A where each column represents a vector. Change the row names to a,b,c.
Create a vector with 12 integers. Convert the vector to a 4*3 matrix B using matrix(). Please change the column names to x, y, z and row names to a, b, c, d. The argument byrow in matrix() is set to be FALSE by default. Please change it to TRUE and print B to see the differences.
Please obtain the transpose matrix of B named tB .
Now tB is a 3×4 matrix. By the rule of matrix multiplication in algebra, can we perform tB*tB in R language? (Is a 3×4 matrix multiplied by a 3×4 allowed?) What result would we get? HINT: the multiplication symbol of matrices is %*%
Download the dataset GDS3309 from the NCBI. - Clean it (remove the !,^,# lines) - Inspect if it’s normalized - From every column find the mean - From every column find the standard deviation - Create a new matrix where each column will be standardized. This is, substract from the elements of the column the mean of the column and divide by the standard deviation of the column. - For each column find the gene with the maximum value - For each column find the gene with the minimum value