Overview

  1. Creating Simple Matrices
  2. Names Attributes
  3. Factors
  4. Missing Values
  5. Reading Data in R
  6. Writing Data Using R
  7. Subsetting Data Tables
  8. Vectorized Operations

Introduction Matrices are special vectors in R. They are vectors with dimensions. While generating Matrices, I can specifically mention the number of columns and rows I want in it. I can create an empty matrix or come up with a data-matrix.

Creating Simple Matrices

Creating an Empty Matrix

rm(list=ls())
a <- matrix(nrow=3, ncol=4)
a #prints the matrix a
     [,1] [,2] [,3] [,4]
[1,]   NA   NA   NA   NA
[2,]   NA   NA   NA   NA
[3,]   NA   NA   NA   NA
dim(a) #gives the dimension of the matrix a
[1] 3 4
attributes(a)#gives the dimension of the matrix a
$dim
[1] 3 4

Creating a pre-filled Matrix

b <- matrix(nrow=4, ncol=4, 1:16)
b
     [,1] [,2] [,3] [,4]
[1,]    1    5    9   13
[2,]    2    6   10   14
[3,]    3    7   11   15
[4,]    4    8   12   16
dim(b)
[1] 4 4

Creating a table with a 1000 numbers having 2 columns and 2 rows

c <- matrix(1:1000, nrow=500, ncol=2)
head(c)
     [,1] [,2]
[1,]    1  501
[2,]    2  502
[3,]    3  503
[4,]    4  504
[5,]    5  505
[6,]    6  506

Names Attributes

R objects or data frame or even matrix can be assigned any names and they can have wide variety of elements in them. They are useful when we are working/writing readable code and self describing objects. For example,

name_vector <- LETTERS[1:5]
names(name_vector)
NULL
names(name_vector) <- c("First Letter", "Second Letter", "Third Letter", "Forth Letter", "Fifth Letter")
names(name_vector)
[1] "First Letter"  "Second Letter" "Third Letter"  "Forth Letter" 
[5] "Fifth Letter" 
name_vector
 First Letter Second Letter  Third Letter  Forth Letter  Fifth Letter 
          "A"           "B"           "C"           "D"           "E" 

Lists also can have names, for example

list_names <- list(a = 15, b = 7, c = 56, d = 109)
names(list_names)
[1] "a" "b" "c" "d"

And the options a, b, c, and d are the names of the values in the list. Finally, matrices can have names as well. We can assign names for both rows and columns in a matrix.

name_matrix <- matrix(1:15, nrow = 5, ncol = 3)
dimnames(name_matrix) <- list(c("Row 1", "Row 2", "Row 3", "Row 4", "Row 5"), c("Column 1", "Column 2", "Column 3"))
name_matrix
      Column 1 Column 2 Column 3
Row 1        1        6       11
Row 2        2        7       12
Row 3        3        8       13
Row 4        4        9       14
Row 5        5       10       15

The row and column are named exactly the way I assigned.

The name of the columns have now been “Height” and “Weight”

Other way of Creting matrices

Creating Two 1 X 3 Matrices

d <- c(1,2,3)
e <- c(4,5,6)

Combining by Rows

f <- rbind(d, e)
f
  [,1] [,2] [,3]
d    1    2    3
e    4    5    6

Combining by Columns

g <- cbind(d,e)
g
     d e
[1,] 1 4
[2,] 2 5
[3,] 3 6

Creating One more 3 by 2 Matrix

h <- matrix(7:18,ncol = 2)
h
     [,1] [,2]
[1,]    7   13
[2,]    8   14
[3,]    9   15
[4,]   10   16
[5,]   11   17
[6,]   12   18

Creating one more 2 X 3 Matrix named g and combining it with Matrix h

i <- matrix(1:12, ncol = 2)
j <- cbind(i,h)
colnames(j)<-c("SN", "Position", "Place", "Value")
j
     SN Position Place Value
[1,]  1        7     7    13
[2,]  2        8     8    14
[3,]  3        9     9    15
[4,]  4       10    10    16
[5,]  5       11    11    17
[6,]  6       12    12    18

Factors

Factors represent categorical data. Factors are often used in modeling functions like lm(), and glm(). They can be either,

a. Ordered: ordered factors represent things that are ranked for example assistant professor, associate professors, and professors. They are categories and they are ordered. Assistant professors are junior than full professors.

b. Unordered: Some factors do have orders but they are just used to facilitate analyses. They don’t have any orders. For example, we can define Male and Female as 0 and 1, which don’t mean males are inferior/junior to females.

Note: Using factors with their original labels are better than representing them by the integers because they offer ease of interpretation.

m <- factor(c("Agree","Strongly Agree", "Strongly Agree", "Disagree", "Agree", "Disagree", "Strongly Agree", "Agree", "Disagree", "Agree","Disagree"))
m # Prints the factor m
 [1] Agree          Strongly Agree Strongly Agree Disagree       Agree         
 [6] Disagree       Strongly Agree Agree          Disagree       Agree         
[11] Disagree      
Levels: Agree Disagree Strongly Agree
table(m) # gives the tabular representation of the frequency counts in factor m
m
         Agree       Disagree Strongly Agree 
             4              4              3 
unclass(m)# gives the list of the attributes in factor m
 [1] 1 3 3 2 1 2 3 1 2 1 2
attr(,"levels")
[1] "Agree"          "Disagree"       "Strongly Agree"
n <- factor(c("Agree","Strongly Agree", "Strongly Agree", "Disagree", "Agree", "Disagree", "Strongly Agree", "Agree", "Disagree", "Agree","Disagree"), levels = c("Strongly Agree","Agree", "Disagree"))
n 
 [1] Agree          Strongly Agree Strongly Agree Disagree       Agree         
 [6] Disagree       Strongly Agree Agree          Disagree       Agree         
[11] Disagree      
Levels: Strongly Agree Agree Disagree

Missing Values

Missing values in R are special type of objects. They are often denoted by NA or NaN. We can use the following command to check the for the missing values in a data set.

I am now going to create a data set with a missing NA value:

m_value <- c("A", NA, "C", NA, "E", "F", "G", NA, "I")
# Checking if there is any NA Value in m_value
is.na(m_value)
[1] FALSE  TRUE FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE

is.na() is a logical vector, i.e., when we run a logical function the given conditions are tested for all individual elements in the vector/column etc. The results showed that there are three NAs and they are Second, forth, and the 8th values. Now, creating a vector with both NA and NaNs:

mi_value <- c(1, 1.5, 3, NA, 6, 0, NaN, NaN, NA, 19, 28, 7)
is.na(mi_value)
 [1] FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE  TRUE  TRUE FALSE FALSE FALSE
is.nan(mi_value)
 [1] FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE

When I tested for the NAs, the NaNs were also marked TRUE. However, when I inquired for the NaN, only the NaNs were marked TRUE.

Reading Data in R

The rea a few different types of functions that are used to read data in R.

The Analogous Writing Data Functions

c <- data.frame(a=1, b="a")
dput(c)
structure(list(a = 1, b = "a"), class = "data.frame", row.names = c(NA, 
-1L))
dput(c, file="c.R")
new.c <- dget("c.R")
new.c
  a b
1 1 a

Reading Larger Datasets with read.table - Use the colClasses argument. Specifying this option instead of using the default can make ‘read.table’ run much faster, often twice as fast. In order to use this option, we have to know the class of each column in the data set we are reading in. If all the columns are numeric, for example, then we can just set **colClasses = “numeric”. A quick and dirty way to figure out the classes of each of the column is the following:

Subsetting in R

Subsetting is one of the most used functions in R. There are three signs we need to remember while subsetting data in R. They are, Single Bracket Operator, Double Bracket Operator,and a Dollar Sign.

h[2:3,] #Gives only rows 2 and 3
     [,1] [,2]
[1,]    8   14
[2,]    9   15
h[1,] #Only gives 1st row
[1]  7 13
h[,2]#Only gives 2nd column
[1] 13 14 15 16 17 18
h[,1]#Only gives 1st column
[1]  7  8  9 10 11 12
h[4,2]#Only the value in the 4th row second column
[1] 16
h[6,1]#only the value in the 6th row first column
[1] 12
h[h > 3]# Only the values that are greater than 3
 [1]  7  8  9 10 11 12 13 14 15 16 17 18
#Creating a logical vector that gives me true or false for all the values in the matrix x based on whether or not they meet the condition that i define
i <- h > 12 # Tests all the elements in h for if they are bigger than 12
i
      [,1] [,2]
[1,] FALSE TRUE
[2,] FALSE TRUE
[3,] FALSE TRUE
[4,] FALSE TRUE
[5,] FALSE TRUE
[6,] FALSE TRUE
#If I subset the matrix h by i vector, I get all the elements that re bigger than 12
h[i]
[1] 13 14 15 16 17 18
# creating a vector named j
j <- list(cap_letter = LETTERS[5:10], numbers = 5:10, g_list = c("milk", "water", "sugar", "oatmeal")) 
# subsetting the first list of LETTERs
j[1]
$cap_letter
[1] "E" "F" "G" "H" "I" "J"
# Subsetting all items in the g_list
j$g_list
[1] "milk"    "water"   "sugar"   "oatmeal"
# Subsetting the list of numbers
j[[2]]
[1]  5  6  7  8  9 10
#Or
j[["g_list"]]
[1] "milk"    "water"   "sugar"   "oatmeal"
# Subsetting the list of numbers 
j["numbers"]
$numbers
[1]  5  6  7  8  9 10
# Extracting multiple elements
j[c(2,3)] # extracts the second the third list items
$numbers
[1]  5  6  7  8  9 10

$g_list
[1] "milk"    "water"   "sugar"   "oatmeal"

Use of the double bracket operator can sometimes be tricky. We can use them the following ways as well:

k <- list(l = list(102, 103, 104, 105, 106), m = c(5.8, 3.21, 9.86, 101.32))
# Extracting the third item in mylist
k[[c(1,3)]]
[1] 104
# Or
k[[1]][[3]]
[1] 104
# Extracting the Second item from herlist
k[[c(2,2)]]
[1] 3.21
#Or
k[[2]][[2]]
[1] 3.21

Removing Missing Values

Removing missing values is a very common operation in data analytics. Most of the data in social science have a lot of missing values. There is hardly any data frame without missing values.

n <- c(NA, 2, 3, NA, NA, 4, NA, NA, NA, 2, 1, NA, 3, 3, 4, 4, NA)
n # prints the object n
 [1] NA  2  3 NA NA  4 NA NA NA  2  1 NA  3  3  4  4 NA
r_na <- is.na(n) #subseting the the vector n by checking every single element of n for whether or not they are NA
r_na #prints r_na object
 [1]  TRUE FALSE FALSE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE FALSE FALSE  TRUE
[13] FALSE FALSE FALSE FALSE  TRUE
n[!r_na]# gives us an object without NAs in it. 
[1] 2 3 4 2 1 3 3 4 4

If we have the multiple vectors, and we want to check for missing values, we repeat the same process. Let’s create another object named o and conduct some examples.

o <- c("a", "b", "d", NA, NA, "g", "h", NA, "i", NA, "k", "l", "m", NA, NA, NA, NA)
no_nas <- complete.cases(n, o)#checking for complete cases
no_nas # Printing the the new object
 [1] FALSE  TRUE  TRUE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE
[13]  TRUE FALSE FALSE FALSE FALSE
n[no_nas]#printing object n after removing NAs
[1] 2 3 4 1 3
o[no_nas]#printing object o after removing NAs
[1] "b" "d" "g" "k" "m"

This rule can be expanded to a data frame as well. Let’s see the airquality dataframe and check the above functions

airquality[1:15,] #Checking the values in first 15 rows
   Ozone Solar.R Wind Temp Month Day
1     41     190  7.4   67     5   1
2     36     118  8.0   72     5   2
3     12     149 12.6   74     5   3
4     18     313 11.5   62     5   4
5     NA      NA 14.3   56     5   5
6     28      NA 14.9   66     5   6
7     23     299  8.6   65     5   7
8     19      99 13.8   59     5   8
9      8      19 20.1   61     5   9
10    NA     194  8.6   69     5  10
11     7      NA  6.9   74     5  11
12    16     256  9.7   69     5  12
13    11     290  9.2   66     5  13
14    14     274 10.9   68     5  14
15    18      65 13.2   58     5  15

We can see that there are two missing values in the fifth, sixth, tenth, and the eleventh rows.

rm_missing <- complete.cases(airquality)
airquality[rm_missing,][1:15,] #getting rid of NAs in the first 15 rows
   Ozone Solar.R Wind Temp Month Day
1     41     190  7.4   67     5   1
2     36     118  8.0   72     5   2
3     12     149 12.6   74     5   3
4     18     313 11.5   62     5   4
7     23     299  8.6   65     5   7
8     19      99 13.8   59     5   8
9      8      19 20.1   61     5   9
12    16     256  9.7   69     5  12
13    11     290  9.2   66     5  13
14    14     274 10.9   68     5  14
15    18      65 13.2   58     5  15
16    14     334 11.5   64     5  16
17    34     307 12.0   66     5  17
18     6      78 18.4   57     5  18
19    30     322 11.5   68     5  19

Now, we don’t see any missing values.

Vectorized Operations

This is one of the R features, which makes it easy to write the code without having to create a lot of looping and other complex things. So, it is no brainier to have this knowledge when we are learning R.

Addiding two Matrices

We can perform many everyday calculations on vectors, objects like we do in mathematics. Let’s take some example.

h <- 3:6
i <- 1:4
#Adding two objects
h+i
[1]  4  6  8 10
# Multiplying two objects
h * i
[1]  3  8 15 24
# Dividing an object by ohter
h/i
[1] 3.000000 2.000000 1.666667 1.500000
  • The values get added accordingly, i.e., 3 + 1 = 4, 4 + 2 = 6, 5 + 3 = 8, 6 + 4 = 10.
  • The values got multiplied like this: 3 * 1 = 3, 4 * 2 = 8, 5 * 3 = 15, 6 * 4 = 24.
  • The values got divided this way: 3/1 = 3, 4/2 = 2, 5/3 = 1.67, and 6/4 = 1.5.

How about unequal matrices?

After trying add a 2X2 and 2X3 matrix, and 2X2, and 3X2 matrices matrices, I got the notice that they don’t work. Creating 2X2 and 3X2 matrices and checking if we can perform the above calculations

d <- matrix(1:4, 2, 2)
e <- matrix(3:8, 3, 2)
f <- matrix(4:9, 2, 3)
# Adding the matrices
#d + e # Not Possible
#d + f # Not Possible
#e + f # Not Possible

# Multiplying the matrices
#d * e # Not Possible
#d * f # Not Possible
#e * f # Not Possible
# Dividing the matrices

# d/e # Not Possible
# d/f # Not Possible
# e/f # Not Possible

They don’t work.

Other Operations

d <- matrix(1:4, 2, 2)
d
     [,1] [,2]
[1,]    1    3
[2,]    2    4
e <- matrix(rep(5, 4), 2, 2) #Creating 2X2 matrix with 5s repeating 4 times
e
     [,1] [,2]
[1,]    5    5
[2,]    5    5
## Adding 
d + e
     [,1] [,2]
[1,]    6    8
[2,]    7    9

The addition was performed like this: 1 + 5 = 6; 2 + 5 = 7; 3 + 5 = 8; 4 + 5 = 9

d * e
     [,1] [,2]
[1,]    5   15
[2,]   10   20

The elementwise multiplication was performed like this: 1 * 5 = 5; 2 * 5 = 10; 3 * 5 = 15; 4 * 5 = 20 However, it is not the true matrix multiplication, which can be done this way:

d %*% e
     [,1] [,2]
[1,]   20   20
[2,]   30   30