Overview
Introduction Matrices are special vectors in R. They are vectors with dimensions. While generating Matrices, I can specifically mention the number of columns and rows I want in it. I can create an empty matrix or come up with a data-matrix.
rm(list=ls())
a <- matrix(nrow=3, ncol=4)
a #prints the matrix a
[,1] [,2] [,3] [,4]
[1,] NA NA NA NA
[2,] NA NA NA NA
[3,] NA NA NA NA
dim(a) #gives the dimension of the matrix a
[1] 3 4
attributes(a)#gives the dimension of the matrix a
$dim
[1] 3 4
b <- matrix(nrow=4, ncol=4, 1:16)
b
[,1] [,2] [,3] [,4]
[1,] 1 5 9 13
[2,] 2 6 10 14
[3,] 3 7 11 15
[4,] 4 8 12 16
dim(b)
[1] 4 4
c <- matrix(1:1000, nrow=500, ncol=2)
head(c)
[,1] [,2]
[1,] 1 501
[2,] 2 502
[3,] 3 503
[4,] 4 504
[5,] 5 505
[6,] 6 506
R objects or data frame or even matrix can be assigned any names and they can have wide variety of elements in them. They are useful when we are working/writing readable code and self describing objects. For example,
name_vector <- LETTERS[1:5]
names(name_vector)
NULL
names(name_vector) <- c("First Letter", "Second Letter", "Third Letter", "Forth Letter", "Fifth Letter")
names(name_vector)
[1] "First Letter" "Second Letter" "Third Letter" "Forth Letter"
[5] "Fifth Letter"
name_vector
First Letter Second Letter Third Letter Forth Letter Fifth Letter
"A" "B" "C" "D" "E"
Lists also can have names, for example
list_names <- list(a = 15, b = 7, c = 56, d = 109)
names(list_names)
[1] "a" "b" "c" "d"
And the options a, b, c, and d are the names of the values in the list. Finally, matrices can have names as well. We can assign names for both rows and columns in a matrix.
name_matrix <- matrix(1:15, nrow = 5, ncol = 3)
dimnames(name_matrix) <- list(c("Row 1", "Row 2", "Row 3", "Row 4", "Row 5"), c("Column 1", "Column 2", "Column 3"))
name_matrix
Column 1 Column 2 Column 3
Row 1 1 6 11
Row 2 2 7 12
Row 3 3 8 13
Row 4 4 9 14
Row 5 5 10 15
The row and column are named exactly the way I assigned.
The name of the columns have now been “Height” and “Weight”
d <- c(1,2,3)
e <- c(4,5,6)
f <- rbind(d, e)
f
[,1] [,2] [,3]
d 1 2 3
e 4 5 6
g <- cbind(d,e)
g
d e
[1,] 1 4
[2,] 2 5
[3,] 3 6
h <- matrix(7:18,ncol = 2)
h
[,1] [,2]
[1,] 7 13
[2,] 8 14
[3,] 9 15
[4,] 10 16
[5,] 11 17
[6,] 12 18
i <- matrix(1:12, ncol = 2)
j <- cbind(i,h)
colnames(j)<-c("SN", "Position", "Place", "Value")
j
SN Position Place Value
[1,] 1 7 7 13
[2,] 2 8 8 14
[3,] 3 9 9 15
[4,] 4 10 10 16
[5,] 5 11 11 17
[6,] 6 12 12 18
Factors represent categorical data. Factors are often used in modeling functions like lm(), and glm(). They can be either,
a. Ordered: ordered factors represent things that are ranked for example assistant professor, associate professors, and professors. They are categories and they are ordered. Assistant professors are junior than full professors.
b. Unordered: Some factors do have orders but they are just used to facilitate analyses. They don’t have any orders. For example, we can define Male and Female as 0 and 1, which don’t mean males are inferior/junior to females.
Note: Using factors with their original labels are better than representing them by the integers because they offer ease of interpretation.
m <- factor(c("Agree","Strongly Agree", "Strongly Agree", "Disagree", "Agree", "Disagree", "Strongly Agree", "Agree", "Disagree", "Agree","Disagree"))
m # Prints the factor m
[1] Agree Strongly Agree Strongly Agree Disagree Agree
[6] Disagree Strongly Agree Agree Disagree Agree
[11] Disagree
Levels: Agree Disagree Strongly Agree
table(m) # gives the tabular representation of the frequency counts in factor m
m
Agree Disagree Strongly Agree
4 4 3
unclass(m)# gives the list of the attributes in factor m
[1] 1 3 3 2 1 2 3 1 2 1 2
attr(,"levels")
[1] "Agree" "Disagree" "Strongly Agree"
n <- factor(c("Agree","Strongly Agree", "Strongly Agree", "Disagree", "Agree", "Disagree", "Strongly Agree", "Agree", "Disagree", "Agree","Disagree"), levels = c("Strongly Agree","Agree", "Disagree"))
n
[1] Agree Strongly Agree Strongly Agree Disagree Agree
[6] Disagree Strongly Agree Agree Disagree Agree
[11] Disagree
Levels: Strongly Agree Agree Disagree
Missing values in R are special type of objects. They are often denoted by NA or NaN. We can use the following command to check the for the missing values in a data set.
I am now going to create a data set with a missing NA value:
m_value <- c("A", NA, "C", NA, "E", "F", "G", NA, "I")
# Checking if there is any NA Value in m_value
is.na(m_value)
[1] FALSE TRUE FALSE TRUE FALSE FALSE FALSE TRUE FALSE
is.na() is a logical vector, i.e., when we run a logical function the given conditions are tested for all individual elements in the vector/column etc. The results showed that there are three NAs and they are Second, forth, and the 8th values. Now, creating a vector with both NA and NaNs:
mi_value <- c(1, 1.5, 3, NA, 6, 0, NaN, NaN, NA, 19, 28, 7)
is.na(mi_value)
[1] FALSE FALSE FALSE TRUE FALSE FALSE TRUE TRUE TRUE FALSE FALSE FALSE
is.nan(mi_value)
[1] FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE
When I tested for the NAs, the NaNs were also marked TRUE. However, when I inquired for the NaN, only the NaNs were marked TRUE.
The rea a few different types of functions that are used to read data in R.
The Analogous Writing Data Functions
c <- data.frame(a=1, b="a")
dput(c)
structure(list(a = 1, b = "a"), class = "data.frame", row.names = c(NA,
-1L))
dput(c, file="c.R")
new.c <- dget("c.R")
new.c
a b
1 1 a
Reading Larger Datasets with read.table - Use the colClasses argument. Specifying this option instead of using the default can make ‘read.table’ run much faster, often twice as fast. In order to use this option, we have to know the class of each column in the data set we are reading in. If all the columns are numeric, for example, then we can just set **colClasses = “numeric”. A quick and dirty way to figure out the classes of each of the column is the following:
Subsetting is one of the most used functions in R. There are three signs we need to remember while subsetting data in R. They are, Single Bracket Operator, Double Bracket Operator,and a Dollar Sign.
h[2:3,] #Gives only rows 2 and 3
[,1] [,2]
[1,] 8 14
[2,] 9 15
h[1,] #Only gives 1st row
[1] 7 13
h[,2]#Only gives 2nd column
[1] 13 14 15 16 17 18
h[,1]#Only gives 1st column
[1] 7 8 9 10 11 12
h[4,2]#Only the value in the 4th row second column
[1] 16
h[6,1]#only the value in the 6th row first column
[1] 12
h[h > 3]# Only the values that are greater than 3
[1] 7 8 9 10 11 12 13 14 15 16 17 18
#Creating a logical vector that gives me true or false for all the values in the matrix x based on whether or not they meet the condition that i define
i <- h > 12 # Tests all the elements in h for if they are bigger than 12
i
[,1] [,2]
[1,] FALSE TRUE
[2,] FALSE TRUE
[3,] FALSE TRUE
[4,] FALSE TRUE
[5,] FALSE TRUE
[6,] FALSE TRUE
#If I subset the matrix h by i vector, I get all the elements that re bigger than 12
h[i]
[1] 13 14 15 16 17 18
# creating a vector named j
j <- list(cap_letter = LETTERS[5:10], numbers = 5:10, g_list = c("milk", "water", "sugar", "oatmeal"))
# subsetting the first list of LETTERs
j[1]
$cap_letter
[1] "E" "F" "G" "H" "I" "J"
# Subsetting all items in the g_list
j$g_list
[1] "milk" "water" "sugar" "oatmeal"
# Subsetting the list of numbers
j[[2]]
[1] 5 6 7 8 9 10
#Or
j[["g_list"]]
[1] "milk" "water" "sugar" "oatmeal"
# Subsetting the list of numbers
j["numbers"]
$numbers
[1] 5 6 7 8 9 10
# Extracting multiple elements
j[c(2,3)] # extracts the second the third list items
$numbers
[1] 5 6 7 8 9 10
$g_list
[1] "milk" "water" "sugar" "oatmeal"
Use of the double bracket operator can sometimes be tricky. We can use them the following ways as well:
k <- list(l = list(102, 103, 104, 105, 106), m = c(5.8, 3.21, 9.86, 101.32))
# Extracting the third item in mylist
k[[c(1,3)]]
[1] 104
# Or
k[[1]][[3]]
[1] 104
# Extracting the Second item from herlist
k[[c(2,2)]]
[1] 3.21
#Or
k[[2]][[2]]
[1] 3.21
Removing missing values is a very common operation in data analytics. Most of the data in social science have a lot of missing values. There is hardly any data frame without missing values.
n <- c(NA, 2, 3, NA, NA, 4, NA, NA, NA, 2, 1, NA, 3, 3, 4, 4, NA)
n # prints the object n
[1] NA 2 3 NA NA 4 NA NA NA 2 1 NA 3 3 4 4 NA
r_na <- is.na(n) #subseting the the vector n by checking every single element of n for whether or not they are NA
r_na #prints r_na object
[1] TRUE FALSE FALSE TRUE TRUE FALSE TRUE TRUE TRUE FALSE FALSE TRUE
[13] FALSE FALSE FALSE FALSE TRUE
n[!r_na]# gives us an object without NAs in it.
[1] 2 3 4 2 1 3 3 4 4
If we have the multiple vectors, and we want to check for missing values, we repeat the same process. Let’s create another object named o and conduct some examples.
o <- c("a", "b", "d", NA, NA, "g", "h", NA, "i", NA, "k", "l", "m", NA, NA, NA, NA)
no_nas <- complete.cases(n, o)#checking for complete cases
no_nas # Printing the the new object
[1] FALSE TRUE TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE FALSE
[13] TRUE FALSE FALSE FALSE FALSE
n[no_nas]#printing object n after removing NAs
[1] 2 3 4 1 3
o[no_nas]#printing object o after removing NAs
[1] "b" "d" "g" "k" "m"
This rule can be expanded to a data frame as well. Let’s see the airquality dataframe and check the above functions
airquality[1:15,] #Checking the values in first 15 rows
Ozone Solar.R Wind Temp Month Day
1 41 190 7.4 67 5 1
2 36 118 8.0 72 5 2
3 12 149 12.6 74 5 3
4 18 313 11.5 62 5 4
5 NA NA 14.3 56 5 5
6 28 NA 14.9 66 5 6
7 23 299 8.6 65 5 7
8 19 99 13.8 59 5 8
9 8 19 20.1 61 5 9
10 NA 194 8.6 69 5 10
11 7 NA 6.9 74 5 11
12 16 256 9.7 69 5 12
13 11 290 9.2 66 5 13
14 14 274 10.9 68 5 14
15 18 65 13.2 58 5 15
We can see that there are two missing values in the fifth, sixth, tenth, and the eleventh rows.
rm_missing <- complete.cases(airquality)
airquality[rm_missing,][1:15,] #getting rid of NAs in the first 15 rows
Ozone Solar.R Wind Temp Month Day
1 41 190 7.4 67 5 1
2 36 118 8.0 72 5 2
3 12 149 12.6 74 5 3
4 18 313 11.5 62 5 4
7 23 299 8.6 65 5 7
8 19 99 13.8 59 5 8
9 8 19 20.1 61 5 9
12 16 256 9.7 69 5 12
13 11 290 9.2 66 5 13
14 14 274 10.9 68 5 14
15 18 65 13.2 58 5 15
16 14 334 11.5 64 5 16
17 34 307 12.0 66 5 17
18 6 78 18.4 57 5 18
19 30 322 11.5 68 5 19
Now, we don’t see any missing values.
This is one of the R features, which makes it easy to write the code without having to create a lot of looping and other complex things. So, it is no brainier to have this knowledge when we are learning R.
We can perform many everyday calculations on vectors, objects like we do in mathematics. Let’s take some example.
h <- 3:6
i <- 1:4
#Adding two objects
h+i
[1] 4 6 8 10
# Multiplying two objects
h * i
[1] 3 8 15 24
# Dividing an object by ohter
h/i
[1] 3.000000 2.000000 1.666667 1.500000
After trying add a 2X2 and 2X3 matrix, and 2X2, and 3X2 matrices matrices, I got the notice that they don’t work. Creating 2X2 and 3X2 matrices and checking if we can perform the above calculations
d <- matrix(1:4, 2, 2)
e <- matrix(3:8, 3, 2)
f <- matrix(4:9, 2, 3)
# Adding the matrices
#d + e # Not Possible
#d + f # Not Possible
#e + f # Not Possible
# Multiplying the matrices
#d * e # Not Possible
#d * f # Not Possible
#e * f # Not Possible
# Dividing the matrices
# d/e # Not Possible
# d/f # Not Possible
# e/f # Not Possible
They don’t work.
d <- matrix(1:4, 2, 2)
d
[,1] [,2]
[1,] 1 3
[2,] 2 4
e <- matrix(rep(5, 4), 2, 2) #Creating 2X2 matrix with 5s repeating 4 times
e
[,1] [,2]
[1,] 5 5
[2,] 5 5
## Adding
d + e
[,1] [,2]
[1,] 6 8
[2,] 7 9
The addition was performed like this: 1 + 5 = 6; 2 + 5 = 7; 3 + 5 = 8; 4 + 5 = 9
d * e
[,1] [,2]
[1,] 5 15
[2,] 10 20
The elementwise multiplication was performed like this: 1 * 5 = 5; 2 * 5 = 10; 3 * 5 = 15; 4 * 5 = 20 However, it is not the true matrix multiplication, which can be done this way:
d %*% e
[,1] [,2]
[1,] 20 20
[2,] 30 30