Creating Matrix Tutorial

Overview

Creating Simple Matrices
Names Attributes
Factors
Missing Values
Reading Data in R
Writing Data Using R
Subsetting Data Tables
Vectorized Operations

Introduction Matrices are special vectors in R. They are vectors with dimensions. While generating Matrices, I can specifically mention the number of columns and rows I want in it. I can create an empty matrix or come up with a data-matrix.

Creating Simple Matrices

Creating an Empty Matrix

rm(list=ls())
a <- matrix(nrow=3, ncol=4)
a #prints the matrix a

     [,1] [,2] [,3] [,4]
[1,]   NA   NA   NA   NA
[2,]   NA   NA   NA   NA
[3,]   NA   NA   NA   NA

dim(a) #gives the dimension of the matrix a

[1] 3 4

attributes(a)#gives the dimension of the matrix a

$dim
[1] 3 4

Creating a pre-filled Matrix

b <- matrix(nrow=4, ncol=4, 1:16)
b

     [,1] [,2] [,3] [,4]
[1,]    1    5    9   13
[2,]    2    6   10   14
[3,]    3    7   11   15
[4,]    4    8   12   16

dim(b)

[1] 4 4

Creating a table with a 1000 numbers having 2 columns and 2 rows

c <- matrix(1:1000, nrow=500, ncol=2)
head(c)

     [,1] [,2]
[1,]    1  501
[2,]    2  502
[3,]    3  503
[4,]    4  504
[5,]    5  505
[6,]    6  506

Names Attributes

R objects or data frame or even matrix can be assigned any names and they can have wide variety of elements in them. They are useful when we are working/writing readable code and self describing objects. For example,

name_vector <- LETTERS[1:5]
names(name_vector)

NULL

names(name_vector) <- c("First Letter", "Second Letter", "Third Letter", "Forth Letter", "Fifth Letter")
names(name_vector)

[1] "First Letter"  "Second Letter" "Third Letter"  "Forth Letter" 
[5] "Fifth Letter"

name_vector

 First Letter Second Letter  Third Letter  Forth Letter  Fifth Letter 
          "A"           "B"           "C"           "D"           "E"

Lists also can have names, for example

list_names <- list(a = 15, b = 7, c = 56, d = 109)
names(list_names)

[1] "a" "b" "c" "d"

And the options a, b, c, and d are the names of the values in the list. Finally, matrices can have names as well. We can assign names for both rows and columns in a matrix.

name_matrix <- matrix(1:15, nrow = 5, ncol = 3)
dimnames(name_matrix) <- list(c("Row 1", "Row 2", "Row 3", "Row 4", "Row 5"), c("Column 1", "Column 2", "Column 3"))
name_matrix

      Column 1 Column 2 Column 3
Row 1        1        6       11
Row 2        2        7       12
Row 3        3        8       13
Row 4        4        9       14
Row 5        5       10       15

The row and column are named exactly the way I assigned.

The name of the columns have now been “Height” and “Weight”

Other way of Creting matrices

Creating Two 1 X 3 Matrices

d <- c(1,2,3)
e <- c(4,5,6)

Combining by Rows

f <- rbind(d, e)
f

  [,1] [,2] [,3]
d    1    2    3
e    4    5    6

Combining by Columns

g <- cbind(d,e)
g

     d e
[1,] 1 4
[2,] 2 5
[3,] 3 6

Creating One more 3 by 2 Matrix

h <- matrix(7:18,ncol = 2)
h

     [,1] [,2]
[1,]    7   13
[2,]    8   14
[3,]    9   15
[4,]   10   16
[5,]   11   17
[6,]   12   18

Creating one more 2 X 3 Matrix named g and combining it with Matrix h

i <- matrix(1:12, ncol = 2)
j <- cbind(i,h)
colnames(j)<-c("SN", "Position", "Place", "Value")
j

     SN Position Place Value
[1,]  1        7     7    13
[2,]  2        8     8    14
[3,]  3        9     9    15
[4,]  4       10    10    16
[5,]  5       11    11    17
[6,]  6       12    12    18

Factors

Factors represent categorical data. Factors are often used in modeling functions like lm(), and glm(). They can be either,

a. Ordered: ordered factors represent things that are ranked for example assistant professor, associate professors, and professors. They are categories and they are ordered. Assistant professors are junior than full professors.

b. Unordered: Some factors do have orders but they are just used to facilitate analyses. They don’t have any orders. For example, we can define Male and Female as 0 and 1, which don’t mean males are inferior/junior to females.

Note: Using factors with their original labels are better than representing them by the integers because they offer ease of interpretation.

We can create factors using the factor functions like below:

m <- factor(c("Agree","Strongly Agree", "Strongly Agree", "Disagree", "Agree", "Disagree", "Strongly Agree", "Agree", "Disagree", "Agree","Disagree"))
m # Prints the factor m

 [1] Agree          Strongly Agree Strongly Agree Disagree       Agree         
 [6] Disagree       Strongly Agree Agree          Disagree       Agree         
[11] Disagree      
Levels: Agree Disagree Strongly Agree

table(m) # gives the tabular representation of the frequency counts in factor m

m
         Agree       Disagree Strongly Agree 
             4              4              3

unclass(m)# gives the list of the attributes in factor m

 [1] 1 3 3 2 1 2 3 1 2 1 2
attr(,"levels")
[1] "Agree"          "Disagree"       "Strongly Agree"

Sometime we may want to establish the baseline category on our own, especially when we run regression anlayses. We can do so:

n <- factor(c("Agree","Strongly Agree", "Strongly Agree", "Disagree", "Agree", "Disagree", "Strongly Agree", "Agree", "Disagree", "Agree","Disagree"), levels = c("Strongly Agree","Agree", "Disagree"))
n

 [1] Agree          Strongly Agree Strongly Agree Disagree       Agree         
 [6] Disagree       Strongly Agree Agree          Disagree       Agree         
[11] Disagree      
Levels: Strongly Agree Agree Disagree

Missing Values

Missing values in R are special type of objects. They are often denoted by NA or NaN. We can use the following command to check the for the missing values in a data set.

is.na() tests if there’s any NA
is.nan() tests NaNs
NA values can be integers or characters, i.e., they have classes
NaNs are NAs but NAs are not NaNs, because NaNs are used for the mathematical operations and NA are used for pretty much everything.

I am now going to create a data set with a missing NA value:

m_value <- c("A", NA, "C", NA, "E", "F", "G", NA, "I")
# Checking if there is any NA Value in m_value
is.na(m_value)

[1] FALSE  TRUE FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE

is.na() is a logical vector, i.e., when we run a logical function the given conditions are tested for all individual elements in the vector/column etc. The results showed that there are three NAs and they are Second, forth, and the 8th values. Now, creating a vector with both NA and NaNs:

mi_value <- c(1, 1.5, 3, NA, 6, 0, NaN, NaN, NA, 19, 28, 7)
is.na(mi_value)

 [1] FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE  TRUE  TRUE FALSE FALSE FALSE

is.nan(mi_value)

 [1] FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE

When I tested for the NAs, the NaNs were also marked TRUE. However, when I inquired for the NaN, only the NaNs were marked TRUE.

Reading Data in R

The rea a few different types of functions that are used to read data in R.

read.table, read.csv, read_excel, read.xlsx, read.xlsx2 for reading a tabular data: Most commonly used function in R. They read data file that use rows and columns to store data. The followings are the arguments we need to consider when we read in data files in R
- file, the name of a file, or a connection
- header, logical indicating if the file has a header line
- sep, a string indicating how the columns are separated in the data file
- colClasses, a character vector indicating the class of each column in the dataset
- nrows, the number of rows in the data set being reading in
- comment.char a character string indicating the comment character
- skip, the number of lines to skip from the beginning
- stringAsFactors, should character variables be coded as factors? Defaults to “TRUE”
readLines, for reading lines of a text file: Any type of file that gives character files in R
source, for reading in R code files (inverse of dump)
dget for reading in R code files (inverse of dput)
load for reading in saved workspaces
unserialize for reading single R objects in binary form
file opens a connection to a file
gzfile opens a connection to a file compressed with gzip
url opens a connection to a webpage

The Analogous Writing Data Functions

write.table
writeLines
dump
Textual Format: dput or dump:
- dumping and dputting are useful because the resulting textual format is edit-able, and in the case of corruption, potentially recoverable.
- Unlike writing out a table or csv file, dump and dput preserve the metadata (sacrificing some readability), so that another user doesn’t hae to specify it all over again
- Textual formats can work much better with version control programs like subversion or git which can only track changes meaningfully in the text files
- Textual formats can be longer-lived; if there is corruption somewhere in the file, it can be easier to fix the problem
- Textual formats adhere to the “Unix Philosophy”
- Downside: The format is not very space-efficient

c <- data.frame(a=1, b="a")
dput(c)

structure(list(a = 1, b = "a"), class = "data.frame", row.names = c(NA, 
-1L))

dput(c, file="c.R")
new.c <- dget("c.R")
new.c

  a b
1 1 a

save
serialize

Reading Larger Datasets with read.table - Use the colClasses argument. Specifying this option instead of using the default can make ‘read.table’ run much faster, often twice as fast. In order to use this option, we have to know the class of each column in the data set we are reading in. If all the columns are numeric, for example, then we can just set **colClasses = “numeric”. A quick and dirty way to figure out the classes of each of the column is the following:

Subsetting in R

Subsetting is one of the most used functions in R. There are three signs we need to remember while subsetting data in R. They are, Single Bracket Operator, Double Bracket Operator,and a Dollar Sign.

[] always returns an object of the same class as the original, it can be used to extract more then one elements

h[2:3,] #Gives only rows 2 and 3

     [,1] [,2]
[1,]    8   14
[2,]    9   15

h[1,] #Only gives 1st row

[1]  7 13

h[,2]#Only gives 2nd column

[1] 13 14 15 16 17 18

h[,1]#Only gives 1st column

[1]  7  8  9 10 11 12

h[4,2]#Only the value in the 4th row second column

[1] 16

h[6,1]#only the value in the 6th row first column

[1] 12

h[h > 3]# Only the values that are greater than 3

 [1]  7  8  9 10 11 12 13 14 15 16 17 18

#Creating a logical vector that gives me true or false for all the values in the matrix x based on whether or not they meet the condition that i define
i <- h > 12 # Tests all the elements in h for if they are bigger than 12
i

      [,1] [,2]
[1,] FALSE TRUE
[2,] FALSE TRUE
[3,] FALSE TRUE
[4,] FALSE TRUE
[5,] FALSE TRUE
[6,] FALSE TRUE

#If I subset the matrix h by i vector, I get all the elements that re bigger than 12
h[i]

[1] 13 14 15 16 17 18

[[]] returns elements of a list or a data frame. It can only be used to extract a single element and the class fo the returned object can be anything. We can use both single bracket, dollar sign, and double brackets to subset a list or a data frame.
$ returns elements of a list or data frame by name. They closely align to [[]]. Let’s see some examples:

# creating a vector named j
j <- list(cap_letter = LETTERS[5:10], numbers = 5:10, g_list = c("milk", "water", "sugar", "oatmeal")) 
# subsetting the first list of LETTERs
j[1]

$cap_letter
[1] "E" "F" "G" "H" "I" "J"

# Subsetting all items in the g_list
j$g_list

[1] "milk"    "water"   "sugar"   "oatmeal"

# Subsetting the list of numbers
j[[2]]

[1]  5  6  7  8  9 10

#Or
j[["g_list"]]

[1] "milk"    "water"   "sugar"   "oatmeal"

# Subsetting the list of numbers 
j["numbers"]

$numbers
[1]  5  6  7  8  9 10

# Extracting multiple elements
j[c(2,3)] # extracts the second the third list items

$numbers
[1]  5  6  7  8  9 10

$g_list
[1] "milk"    "water"   "sugar"   "oatmeal"

Use of the double bracket operator can sometimes be tricky. We can use them the following ways as well:

k <- list(l = list(102, 103, 104, 105, 106), m = c(5.8, 3.21, 9.86, 101.32))
# Extracting the third item in mylist
k[[c(1,3)]]

[1] 104

# Or
k[[1]][[3]]

[1] 104

# Extracting the Second item from herlist
k[[c(2,2)]]

[1] 3.21

#Or
k[[2]][[2]]

[1] 3.21

Removing Missing Values

Removing missing values is a very common operation in data analytics. Most of the data in social science have a lot of missing values. There is hardly any data frame without missing values.

n <- c(NA, 2, 3, NA, NA, 4, NA, NA, NA, 2, 1, NA, 3, 3, 4, 4, NA)
n # prints the object n

 [1] NA  2  3 NA NA  4 NA NA NA  2  1 NA  3  3  4  4 NA

r_na <- is.na(n) #subseting the the vector n by checking every single element of n for whether or not they are NA
r_na #prints r_na object

 [1]  TRUE FALSE FALSE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE FALSE FALSE  TRUE
[13] FALSE FALSE FALSE FALSE  TRUE

n[!r_na]# gives us an object without NAs in it.

[1] 2 3 4 2 1 3 3 4 4

If we have the multiple vectors, and we want to check for missing values, we repeat the same process. Let’s create another object named o and conduct some examples.

o <- c("a", "b", "d", NA, NA, "g", "h", NA, "i", NA, "k", "l", "m", NA, NA, NA, NA)
no_nas <- complete.cases(n, o)#checking for complete cases
no_nas # Printing the the new object

 [1] FALSE  TRUE  TRUE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE
[13]  TRUE FALSE FALSE FALSE FALSE

n[no_nas]#printing object n after removing NAs

[1] 2 3 4 1 3

o[no_nas]#printing object o after removing NAs

[1] "b" "d" "g" "k" "m"

This rule can be expanded to a data frame as well. Let’s see the airquality dataframe and check the above functions

airquality[1:15,] #Checking the values in first 15 rows

   Ozone Solar.R Wind Temp Month Day
1     41     190  7.4   67     5   1
2     36     118  8.0   72     5   2
3     12     149 12.6   74     5   3
4     18     313 11.5   62     5   4
5     NA      NA 14.3   56     5   5
6     28      NA 14.9   66     5   6
7     23     299  8.6   65     5   7
8     19      99 13.8   59     5   8
9      8      19 20.1   61     5   9
10    NA     194  8.6   69     5  10
11     7      NA  6.9   74     5  11
12    16     256  9.7   69     5  12
13    11     290  9.2   66     5  13
14    14     274 10.9   68     5  14
15    18      65 13.2   58     5  15

We can see that there are two missing values in the fifth, sixth, tenth, and the eleventh rows.

rm_missing <- complete.cases(airquality)
airquality[rm_missing,][1:15,] #getting rid of NAs in the first 15 rows

   Ozone Solar.R Wind Temp Month Day
1     41     190  7.4   67     5   1
2     36     118  8.0   72     5   2
3     12     149 12.6   74     5   3
4     18     313 11.5   62     5   4
7     23     299  8.6   65     5   7
8     19      99 13.8   59     5   8
9      8      19 20.1   61     5   9
12    16     256  9.7   69     5  12
13    11     290  9.2   66     5  13
14    14     274 10.9   68     5  14
15    18      65 13.2   58     5  15
16    14     334 11.5   64     5  16
17    34     307 12.0   66     5  17
18     6      78 18.4   57     5  18
19    30     322 11.5   68     5  19

Now, we don’t see any missing values.

Vectorized Operations

This is one of the R features, which makes it easy to write the code without having to create a lot of looping and other complex things. So, it is no brainier to have this knowledge when we are learning R.

Addiding two Matrices

We can perform many everyday calculations on vectors, objects like we do in mathematics. Let’s take some example.

h <- 3:6
i <- 1:4
#Adding two objects
h+i

[1]  4  6  8 10

# Multiplying two objects
h * i

[1]  3  8 15 24

# Dividing an object by ohter
h/i

[1] 3.000000 2.000000 1.666667 1.500000

The values get added accordingly, i.e., 3 + 1 = 4, 4 + 2 = 6, 5 + 3 = 8, 6 + 4 = 10.
The values got multiplied like this: 3 * 1 = 3, 4 * 2 = 8, 5 * 3 = 15, 6 * 4 = 24.
The values got divided this way: 3/1 = 3, 4/2 = 2, 5/3 = 1.67, and 6/4 = 1.5.

How about unequal matrices?

After trying add a 2X2 and 2X3 matrix, and 2X2, and 3X2 matrices matrices, I got the notice that they don’t work. Creating 2X2 and 3X2 matrices and checking if we can perform the above calculations

d <- matrix(1:4, 2, 2)
e <- matrix(3:8, 3, 2)
f <- matrix(4:9, 2, 3)
# Adding the matrices
#d + e # Not Possible
#d + f # Not Possible
#e + f # Not Possible

# Multiplying the matrices
#d * e # Not Possible
#d * f # Not Possible
#e * f # Not Possible
# Dividing the matrices

# d/e # Not Possible
# d/f # Not Possible
# e/f # Not Possible

They don’t work.

Other Operations

d <- matrix(1:4, 2, 2)
d

     [,1] [,2]
[1,]    1    3
[2,]    2    4

e <- matrix(rep(5, 4), 2, 2) #Creating 2X2 matrix with 5s repeating 4 times
e

     [,1] [,2]
[1,]    5    5
[2,]    5    5

## Adding 
d + e

     [,1] [,2]
[1,]    6    8
[2,]    7    9

The addition was performed like this: 1 + 5 = 6; 2 + 5 = 7; 3 + 5 = 8; 4 + 5 = 9

d * e

     [,1] [,2]
[1,]    5   15
[2,]   10   20

The elementwise multiplication was performed like this: 1 * 5 = 5; 2 * 5 = 10; 3 * 5 = 15; 4 * 5 = 20 However, it is not the true matrix multiplication, which can be done this way:

d %*% e

     [,1] [,2]
[1,]   20   20
[2,]   30   30