R has various data containers (e.g., variables, vectors, matrices, factors, data frames, and lists) on which we run our analyses. In this chapter, we examine these basic elements systematically using examples to get you going.
Variables have different types and usually contain a specific piece of information (i.e., a specific characteristic such as height, purchase intention, gender) about cases (i.e., experimental participants, customers, etc.). Some of the variable types/classes are: numeric, character and logical.
The R codes below define three variable types discussed above, respectively:
A = 2.78
My_text = "Hello"
My_logical = FALSE
Assignments in R can also be done through <-
assignment sign. For example, x <-2
assigns 2 to variable x
. Another way to assign a value to a variable is to use the assign()
function. The name of the variable comes first and always within ""
, followed by the value to be given to
assign("a",2)
a
## [1] 2
After a variable’s content has been assigned, we can easily call that variable to examine its contents, in different ways. For example, we can use the print()
function or simply type in the name of the variable.
print (A)
## [1] 2.78
A
## [1] 2.78
All elements in R are objects. Objects belong to classes/types which have their own specific characteristics. To identify the class of an object (e.g., the type of a variable) the function class()
can be utilized as following:
class(A)
## [1] "numeric"
class(My_text)
## [1] "character"
class(My_logical)
## [1] "logical"
A more comprehensive way to examine the nature of an object is to use str()
function. Compared to class()
which only identifies the type of an object (e.g., whether it is numerical, character, etc.), Structure provides further information about an object. Str()
is exceedingly useful in summarizing lists
, data frames
.
str(A)
## num 2.78
str(My_text)
## chr "Hello"
str(My_logical)
## logi FALSE
Yet another alternative to summarize, is to use summary()
function. Numeric variables are summarized as mean, median, and some quantiles. Categorical and logical vectors are summarized by the counts of each value. Multidimensional objects, like matrices and data frames, are summarized by column. For example, the summary()
function for data frames works like calling summary on each individual column.
Most of the common classes have their own is.*()
functions:
is.character("red lorry, yellow lorry")
## [1] TRUE
is.logical(FALSE)
## [1] TRUE
is.numeric(2)
## [1] TRUE
is.list(list(a = 1, b = 2)) #we learn about lists later in this this chapter.
## [1] TRUE
We can see a complete list of all the is.*()
functions in thebase package
using:
ls(pattern = "^is", baseenv())
## [1] "is.array" "is.atomic"
## [3] "is.call" "is.character"
## [5] "is.complex" "is.data.frame"
## [7] "is.double" "is.element"
## [9] "is.environment" "is.expression"
## [11] "is.factor" "is.finite"
## [13] "is.function" "is.infinite"
## [15] "is.integer" "is.language"
## [17] "is.list" "is.loaded"
## [19] "is.logical" "is.matrix"
## [21] "is.na" "is.na.data.frame"
## [23] "is.na.numeric_version" "is.na.POSIXlt"
## [25] "is.na<-" "is.na<-.default"
## [27] "is.na<-.factor" "is.na<-.numeric_version"
## [29] "is.name" "is.nan"
## [31] "is.null" "is.numeric"
## [33] "is.numeric.Date" "is.numeric.difftime"
## [35] "is.numeric.POSIXt" "is.numeric_version"
## [37] "is.object" "is.ordered"
## [39] "is.package_version" "is.pairlist"
## [41] "is.primitive" "is.qr"
## [43] "is.R" "is.raw"
## [45] "is.recursive" "is.single"
## [47] "is.symbol" "is.table"
## [49] "is.unsorted" "is.vector"
## [51] "isatty" "isBaseNamespace"
## [53] "isdebugged" "isIncomplete"
## [55] "isNamespace" "isNamespaceLoaded"
## [57] "isOpen" "isRestart"
## [59] "isS4" "isSeekable"
## [61] "isSymmetric" "isSymmetric.matrix"
## [63] "isTRUE"
In the preceding example, ls
lists what it is asked for to list; ^is
is a regular expression that means match strings that begin with ’is, and baseenv
is a function that simply returns the environment of the base package.
Note: is.numeric
returns TRUE for integers as well as floating point values.
Sometimes we want to change the type of an object. This is called casting, and most is.*()
functions have a corresponding as.*()
function to achieve this.
for example in the code below, we first assign a character to a variable and then turn it into a number.
myvar= "1234"
class(myvar)
as.numeric(myvar) #treats myvar as a number but notice that the class of myvar remains as character.
class(myvar)
myvar + 10 #returns error.
as.numeric(myvar)+10 #it works.
Note: as.*()
functions basically treat variables as if they belong to a particular class but do not change the class of a variable permanently. To change class of variables permanently, one can use the class()
function, though this is not recommended because class assignments usually have a different and more technical use. For the illustration purposes:
myvar= "1234"
myvar + 10 #does not work because myvar is still a character variable.
class(myvar) = "numeric"
myvar + 10 #now it works.
An alternative approach is to assign às.*()
to a new variable. For example:
myvar="1234"
new_var= as.numeric(myvar)
new_var+10
## [1] 1244
To help with arithmetic, R supports four special numeric values: Inf
, -Inf
, NaN
, and NA
. The first two are, of course, positive and negative infinity, but the second pair need a little more explanation. NaN
is short for not-a-number and means that our calculation either didn’t make mathematical sense or could not be performed properly. NA
is short for not available and represents a missing value-a problem all too common in data analysis.
In general, if our calculation involves a missing value, then the results will also be missing:
grade= c(3,5,7, NA)
mean(grade) #results in NA, becuase of a missing value in the vector.
## [1] NA
There are functions to check for these special values. Usually, the family of is.XYZ()
functions come handy again and are used for such purpose.
is.na(grade) #checks whether any of the values in the grade vector is NA (missing).
## [1] FALSE FALSE FALSE TRUE
is.nan(grade) # Checks wether a vector has any undefined value, that is NaN.
## [1] FALSE FALSE FALSE FALSE
limits = c (Inf, 2, 5) #Inf which represents infinity is not undefined.
is.nan(limits)
## [1] FALSE FALSE FALSE
is.infinite(limits) #this indicates which value in the vector is infinity.
## [1] TRUE FALSE FALSE
impossible = c(Inf/Inf, 9, 0) #Infinity divided by infinity is not defined. So is.nan(impossible) returns a TRUE this time.
is.nan(impossible)
## [1] TRUE FALSE FALSE
Finally, !
is used for not
, &
is used for and
, and |
is used for or
.
To format how the numerical data are printed out, one can use the format()
function. Notice that the input should be a type of numerical data but the outcome is always a character vector or array.
initial = c(1:3)
powered= exp(initial) #generates the e^initials.
powered # see the output numbers.
## [1] 2.718282 7.389056 20.085537
class(powered)
## [1] "numeric"
format(powered, digits = 3, scientific = FALSE) #digits argument sets the number of digits the final numver should have. sometimes if rounding can not do this, numbers might be longer, as in the case in this example is with 20.085537.
## [1] " 2.72" " 7.39" "20.09"
format(powered, digits = 3, scientific = TRUE) #Scientific notification.
## [1] "2.72e+00" "7.39e+00" "2.01e+01"
Another basic element in R is array or vector that combines
and stores a set of data, which usually are variables of the same type. For example, all of them are usually numeric, or all of them are usually character.
We can define vector
by function c()
, standing for combine
. Moreover, we can assign names to elements of a vector using names()
function:
some_vector = c("John Doe", "poker player")
names(some_vector) = c("Name", "Profession")
This code first creates a vector some_vector
and then gives its two elements a name. The first element is labeled Name
, while the second element is labeled Profession
. Printing the vector leads to the following:
some_vector
## Name Profession
## "John Doe" "poker player"
we can also check out the structure of our vector:
str(some_vector)
## Named chr [1:2] "John Doe" "poker player"
## - attr(*, "names")= chr [1:2] "Name" "Profession"
To name elements of a vector, we can specify names directly when we create a vector in the form name = value
. You can name some elements of a vector and leave others blank.
fruits= c(apple = 1, banana = 2, "kiwi fruit" = 3, 4)
fruits
## apple banana kiwi fruit
## 1 2 3 4
We can also name the elements of a vector after the vector is created using the names()
function. Imagine we have two numerical vectors. the first one contains the amounts in dollars that you won in Poker from Monday to Friday last week; and the other one contains the amounts in dollars that you won in Roulette in those days.
poker_vector = c(140, -50, 20, -120, 240)
roulette_vector = c(-24, -50, 100, -350, 10)
If we want to name the elements of these two vectors based on week days, because the names of the elements are the same for both of these vectors, to avoid repeating, we can first create a vector containing the names and then assign that vector as the names corresponding to each of the above vectors.
days_vector = c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday")
names(poker_vector) = days_vector
names(roulette_vector) = days_vector
poker_vector
## Monday Tuesday Wednesday Thursday Friday
## 140 -50 20 -120 240
roulette_vector
## Monday Tuesday Wednesday Thursday Friday
## -24 -50 100 -350 10
To retrieve names of a vector, we can use the names()
function again (when there is no assignment in front of this function, it returns the names values)
names(poker_vector)
## [1] "Monday" "Tuesday" "Wednesday" "Thursday" "Friday"
Note: Can a vector have different types of data in it?
Sure! a vector can contain elements of different types. For example vector a
contains both numeric and character information. Interestingly when we check its class, it is considered as a character
vector. R has coerced this vector into a character class. However, vector b
only has numeric elements and its class is numeric.
a = c(2, "Rad", 5)
b= c(4,6,9)
class(a)
## [1] "character"
class(b)
## [1] "numeric"
because the two vectors are of different class, we cannot run mathematical operations on them. For example, it is meaningless to add the two vectors and R would give us a warning message in that case.
Let’s work with the Poker vector, we defined in the previous section.
poker_vector <- c(140, -50, 20, -120, 240)
days_vector <- c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday")
names(poker_vector) <- days_vector
Imagine we want to select elements of these vectors. In general, to select elements of a vector (and also matrices, data frames, etc.), we can use square brackets []
. Between the square brackets, we indicate which element(s) to select. For example, to select the first element of the vector, we use the following code line:
poker_vector[1]
## Monday
## 140
poker_vector[4]
## Thursday
## -120
It is important to notice that the first element in a vector has index of 1, not 0 as in many other programming languages.
Another way to call an element is to call its corresponding name label, instead of using its numeric location. For example, the following code also returns the first element in the vector:
poker_vector["Monday"]
## Monday
## 140
We can also call and return more than one element of a vector. For example, if we would like to call elements 2, 3 and 4 from our vector, then the following methods will do the job for us:
poker_vector[2:4] # the symbol a:b considers all numbers between a and b, including a and b.
## Tuesday Wednesday Thursday
## -50 20 -120
An alternative method is to use a vector of locations we are interested in within the brackets. For example:
poker_vector[c(2,3,4)]
## Tuesday Wednesday Thursday
## -50 20 -120
This method is useful, particularly when we want to call elements in an unordered manner.
poker_vector[c(1,4,5)]
## Monday Thursday Friday
## 140 -120 240
The same method can be coded in more than one line as below:
element_place = c(1,4,5)
poker_vector[(element_place)]
## Monday Thursday Friday
## 140 -120 240
Another way to call the elements using their names is:
poker_vector[c("Monday","Friday")]
## Monday Friday
## 140 240
We should note that (1) we must use the brackets, and (2) within the brackets, we must use only ONE item to index locations. That item could be a simple number, a character, a vector, etc. But it is always ONE item. Therefore using a code like poker_vector[1,3,5]
is wrong! because within the brackets we do not have ONE item, but three.
Another useful function is lenght()
. All vectors have a length, which tells us how many elements they contain. This is a non-negative integer (yes, zero-length vectors are allowed), and you can access this value with the length()
function. Below you see several examples of this function:
length(1:5)
## [1] 5
length(c(TRUE, FALSE, NA))
## [1] 3
sn <- c("Sheena", "leads", "Sheila", "needs")
length(sn) # returns the number of strings in the vector.
## [1] 4
If instead of number of strings in a character vector, we were interested in knowing the length of each string in that vector (i.e., number of their characters), then nchar()
function can be used:
sn <- c("Sheena", "leads", "Sheila", "needs")
nchar(sn)
## [1] 6 5 6 5
While we use length()
function to check the number of elements in a vector object, it is also possible to assign a new length to a vector using this function, but this is an unusual thing to do, and probably indicates bad code. If you shorten a vector, the values at the end will be removed, and if you extend a vector, missing values will be added to the end:
poincare = c(1, 0, 0, 0, 2, 0, 2, 0)
length(poincare) = 3
poincare
## [1] 1 0 0
length(poincare) = 8
poincare
## [1] 1 0 0 NA NA NA NA NA
We can examine elements of a vector using logical operators. When we do this, the logical operation (e.g., a comparison) is conducted on all elements of the vector. The result of such operation is also a vector consisting TRUE
or FALSE
values for each comparison made. Therefore, the result is basically a logical vector
of the same size as the original vector.
For example, let’s see if there is any
element bigger than 500 in our poker_vector
. A simple line below does the job for us:
poker_vector > 50 #when there is a TRUE in the output, it means that its corresponding element in the vector satisfies the condition.
## Monday Tuesday Wednesday Thursday Friday
## TRUE FALSE FALSE FALSE TRUE
Notice that our aim is beyond simply running a logical operation on every element of a vector. For example, we would like to call only those elements in the vector that satisfy a condition, or in the case of our example, those that are bigger than 50. Below, we introduce two ways to do so:
# method 1:
poker_vector[poker_vector>50]
## Monday Friday
## 140 240
# method 2
bigger_than_50_true_vector = poker_vector>50
poker_vector[bigger_than_50_true_vector]
## Monday Friday
## 140 240
Note that when a logical operation/command or vector is located within the brackets as the arguments of a vector (as in the above codes) only those elements whose arguments are TRUE
are returned. This is a great advantage provided by R.
The rep()
function is very useful to create a vector with repeated elements:
rep(1:5, 3)
## [1] 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5
rep(1:5, each = 3) #compare the output with the previous code.
## [1] 1 1 1 2 2 2 3 3 3 4 4 4 5 5 5
rep(1:5, times = 1:5) #interesting?
## [1] 1 2 2 3 3 3 4 4 4 4 5 5 5 5 5
rep(1:5, length.out = 7) #repeates 1 to 5 until 7 numbers in total are created.
## [1] 1 2 3 4 5 1 2
rep_len(1:5, 7) #an alternative for the above code.
## [1] 1 2 3 4 5 1 2
The paste()
function pastes the elements of its first argument, to elements of its second argument. Its output is a character vector, not numeric, even though the class of its arguments might be numeric.
paste(c(1,2,3), c(1:10)) #see the output. remeber it is a character vector.
## [1] "1 1" "2 2" "3 3" "1 4" "2 5" "3 6" "1 7" "2 8" "3 9" "1 10"
paste(c("hello", "hi"), c("bye", "bye"))
## [1] "hello bye" "hi bye"
paste(c("hello", "hi"), c("bye", "bye"), sep = "-") #indicating the separator type.
## [1] "hello-bye" "hi-bye"
Create a vector containing 100 randomly generated number from a standard normal distribution
, that is with mean
equal to 0 and sd
of 1. After doing so, (a) check whether there are numbers in that vector bigger than 1, (b) print those numbers, (c) indicate how many numbers in that vector are bigger than 1 using lenght()
function. What portion of the numbers are bigger than 1?
My_Vector = rnorm(100, mean=0, sd=1)
#(a)
My_Vector > 1
## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [12] TRUE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
## [23] FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
## [34] FALSE TRUE TRUE FALSE TRUE FALSE FALSE FALSE FALSE FALSE TRUE
## [45] FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [56] FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE
## [67] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE TRUE FALSE
## [78] TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [89] FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE TRUE FALSE
## [100] TRUE
#(b)
My_Vector[My_Vector>1]
## [1] 2.125449 1.742505 1.518000 1.557337 1.559839 2.933353 1.265785
## [8] 1.681433 1.275373 2.181931 1.297052 1.638294 1.862588 1.123541
## [15] 1.771729 1.589635 1.603114
#(c)
Bigger_than_ONE_Vector = My_Vector[My_Vector>1]
length(Bigger_than_ONE_Vector) #returns the length of a vector.
## [1] 17
percent = length(Bigger_than_ONE_Vector)/100 #calculating percentage of numbers bigger than 1 in our sample vector.
percent
## [1] 0.17
Note: a cool example is to repeat the above code for 5000 times. Every time, we can record the portiion of random numbers between (-1, 1) that is (-1SD, 1SD) and then calculate the respective portion of those numbers. The overall average of those portions is an estimate of the real portion of random numbers that lie between +/- 1SD in the normal distribution. The same thing can be done for +/- 2SD, and then 3SD. The later should include about 99% of all observations. We will do this exercise when we study loops
.
We can make a matrix with the matrix()
function. Matrices have rows and columns and therefore we need to specify their arrangements. For example, let’s begin by arranging numbers 1 to 9, in a 3x3 matrix.
matrix(1:9, byrow = TRUE, ncol = 3) # arrange numbers 1 to 9, in a 3x3 matrix, in a row-based order.
## [,1] [,2] [,3]
## [1,] 1 2 3
## [2,] 4 5 6
## [3,] 7 8 9
matrix(1:9, byrow = FALSE, ncol = 3) # arrange numbers 1 to 9, in a 3x3 matrix, in a column-based order.
## [,1] [,2] [,3]
## [1,] 1 4 7
## [2,] 2 5 8
## [3,] 3 6 9
The output of matrix()
function is a matrix object and can be saved into another object/variable.
my_object = matrix(1:20, byrow = TRUE, ncol = 10) # a matrix of 2x10, that is 2 rows and 10 columns.
str(my_object) #indicates the structure of the matrix object.
## int [1:2, 1:10] 1 11 2 12 3 13 4 14 5 15 ...
my_object
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 1 2 3 4 5 6 7 8 9 10
## [2,] 11 12 13 14 15 16 17 18 19 20
Imagine that the following three vectors specify Star Wars
sales in USA and Europe in millions of dollars, for three episodes of this movie: a new hope, the empire strikes back, and return of jedi. a) create a new vector that combines all the sales and call it total sales. b) create a 3x2 matrix of Star War sales, such that the first column is USA sales and the second is Europe sales. Each row should then represent one of the episodes. c) print the matrix.
new_hope = c(460.998, 314.4)
empire_strikes = c(290.475, 247.900)
return_jedi = c(309.306, 165.8)
total_sales = c(new_hope, empire_strikes, return_jedi)
star_wars_matrix = matrix(total_sales, byrow = TRUE, ncol = 2)
star_wars_matrix
## [,1] [,2]
## [1,] 460.998 314.4
## [2,] 290.475 247.9
## [3,] 309.306 165.8
Now that we have made a matrice, we can put labels to name its rows and columns using the names()
function. For naming rows, we use the rownames()
function and for naming the columns, we use the colnames()
function.
first we define two vectors including the names for columns (regions of sales) and then rows (film names) respectively
region = c("US", "non-US")
titles = c("A New Hope", "The Empire Strikes Back", "Return of the Jedi")
now we assign these names using the rownames()
and colnames()
functions ass following:
rownames(star_wars_matrix) = titles
colnames(star_wars_matrix) = region
we could also do this naming more directly, as following:
rownames(star_wars_matrix) = c("A New Hope", "The Empire Strikes Back", "Return of the Jedi")
colnames(star_wars_matrix) = c("US", "non-US")
we can now print and see the outcome:
star_wars_matrix
## US non-US
## A New Hope 460.998 314.4
## The Empire Strikes Back 290.475 247.9
## Return of the Jedi 309.306 165.8
If a matrice’s rows and columns already have names, we can call them by the same rownames()
and the colnames()
function.
rownames(star_wars_matrix)
## [1] "A New Hope" "The Empire Strikes Back"
## [3] "Return of the Jedi"
colnames(star_wars_matrix)
## [1] "US" "non-US"
For matrices, the dim()
function returns a vector of integers of the dimensions of the variable/matrix: number of rows and then columns.
dim(star_wars_matrix)
## [1] 3 2
For matrices, the functions nrow()
and ncol()
return the number of rows and columns, respectively:
nrow(star_wars_matrix)
## [1] 3
ncol(star_wars_matrix)
## [1] 2
The length()
function that we have previously used with vectors also works on matrices. In this case it returns the product of dimensions:
length(star_wars_matrix)
## [1] 6
We can also use the summary()
function on matrices. As said before, the summary()
function on multidimensional elements work by summarizing information on a column-based manner.
summary(star_wars_matrix)
## US non-US
## Min. :290.5 Min. :165.8
## 1st Qu.:299.9 1st Qu.:206.8
## Median :309.3 Median :247.9
## Mean :353.6 Mean :242.7
## 3rd Qu.:385.2 3rd Qu.:281.1
## Max. :461.0 Max. :314.4
Imagine we want to see the total sales for each star wars movies (i.e., based on movies), as well as their total regional sales (i.e., in each region). The functions rowSums()
and colSums()
do the job for us. The function rowSums()
calculates sums of each row of the matrix. The function colSums()
calculates the sums of elements for each column of the matrix. The outcomes of either of these functions are saved as a vector. Moreover, if the matrix rows or columns have name labels, they will be shown in the output vector object that rowSums()
and colSums()
return.
movie_sales = rowSums(star_wars_matrix)
movie_sales
## A New Hope The Empire Strikes Back Return of the Jedi
## 775.398 538.375 475.106
region_sales = colSums(star_wars_matrix)
region_sales
## US non-US
## 1060.779 728.100
the c()
function turns matrices into vectors and then combine them together and returns a vector. So it is not useful for combining matrices. We use functions cbind()
and rbind()
instead.
Imagine, we would like to add a column to the star_wars_matrix
which contains the total sales in each row, that is the total sales for each movie. We can do this using the cbind()
function, stands for column bind. The outcome of this function is a new matrix that can be named and used in further analyses.
We use the vector of row sales that we calculated in the previous section here and bind it to the matrix.
new_matrix_col = cbind(star_wars_matrix, movie_sales)
new_matrix_col #notice that the name of the vector (i.e., movie_sales) is now the label name of the new column in our matrix. Cool!
## US non-US movie_sales
## A New Hope 460.998 314.4 775.398
## The Empire Strikes Back 290.475 247.9 538.375
## Return of the Jedi 309.306 165.8 475.106
In a similar fashion, the function rbind()
adds a row to an already existing matrix. The output is a new matrix that can be named and used.
new_matrix_row = rbind(star_wars_matrix, region_sales)
new_matrix_row ##notice that the name of the vector (i.e., regional_sales) is now the label name of the new row in our matrix.
## US non-US
## A New Hope 460.998 314.4
## The Empire Strikes Back 290.475 247.9
## Return of the Jedi 309.306 165.8
## region_sales 1060.779 728.1
Make two matrices of 3x3, one arranging numbers from 1 to 9, and the other from 10 to 18. Then combine them both into a new matrix of 3x6. Could we also make a matrix of 6x3 out of these two?
A = matrix(1:9, byrow = TRUE, ncol = 3)
B = matrix(10:18, byrow = TRUE, ncol=3)
M = cbind(A, B)
M #matrix of 3x6
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 1 2 3 10 11 12
## [2,] 4 5 6 13 14 15
## [3,] 7 8 9 16 17 18
N = rbind(A, B)
N #matrix of 6x3
## [,1] [,2] [,3]
## [1,] 1 2 3
## [2,] 4 5 6
## [3,] 7 8 9
## [4,] 10 11 12
## [5,] 13 14 15
## [6,] 16 17 18
NOTE: As in the above example, both cbind()
and rbind()
can be used for combining matrices. For example, big_matrix = cbind(matrix1, matrix2,...matrixN)
combines the N matrices, column-wise, into a bigger one.
Similar to vectors, we use the square brackets [ ] to index one or multiple elements from a matrix. Whereas vectors are only one dimensional, matrices have two dimensions (rows, columns), and thus a comma is used to indicate the coordination of an element in a specified row and column. For example, the following code brings the element in the 3rd row and 2nd column of the star_wars_matrix
star_wars_matrix [3,2]
## [1] 165.8
Also, the codes below return the entire 2nd row and the entire 1st column of the star_wars_matrix
, respectively.
star_wars_matrix [2,] #returning 2nd row of the matrix.
## US non-US
## 290.475 247.900
star_wars_matrix [,1] #returning 1st column of the matrix.
## A New Hope The Empire Strikes Back Return of the Jedi
## 460.998 290.475 309.306
star_wars_matrix [c(1,3),1] #returns emlements of [1,1] and [3,1]
## A New Hope Return of the Jedi
## 460.998 309.306
star_wars_matrix
## US non-US
## A New Hope 460.998 314.4
## The Empire Strikes Back 290.475 247.9
## Return of the Jedi 309.306 165.8
Create a 10 x 10 matrix consisting of 100 numbers from 1 to 100. Name the matrix as my_matrix and do the following: * a) return all elements in rows 2 to 4, and in columns 3 to 6. * b) return elements (1,1) and (3,1) with one command. * c) try the code my_matrix[c(1,3,5), c(2,4,6)]. What is the output?
my_matrix = matrix(1:100, byrow = TRUE, ncol = 10)
my_matrix
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 1 2 3 4 5 6 7 8 9 10
## [2,] 11 12 13 14 15 16 17 18 19 20
## [3,] 21 22 23 24 25 26 27 28 29 30
## [4,] 31 32 33 34 35 36 37 38 39 40
## [5,] 41 42 43 44 45 46 47 48 49 50
## [6,] 51 52 53 54 55 56 57 58 59 60
## [7,] 61 62 63 64 65 66 67 68 69 70
## [8,] 71 72 73 74 75 76 77 78 79 80
## [9,] 81 82 83 84 85 86 87 88 89 90
## [10,] 91 92 93 94 95 96 97 98 99 100
# Part A.
my_matrix[2:4, 3:6]
## [,1] [,2] [,3] [,4]
## [1,] 13 14 15 16
## [2,] 23 24 25 26
## [3,] 33 34 35 36
# Part B.
my_matrix[c(1,3), 1]
## [1] 1 21
# Part C.
my_matrix[c(1,3,5), c(2,4,6)] #it returns a matrix of 3x3 elements consiting elements (1,2), (1,4), and (1,6) of my_matrix in its first row, then (3,2), (3,4), and (3,6) in its second row and finally (5,2), (5,4), and (5,6) in its last row.
## [,1] [,2] [,3]
## [1,] 2 4 6
## [2,] 22 24 26
## [3,] 42 44 46
Note: For visualizing two-dimensional variables such as matrices and data frames, the View()
function (notice the capital “V”) displays the variable as a (read-only) spreadsheet.
View(my_matrix)
k * Matrix
multiplies every element of Matrix by k
. In a similar fashion (contrary to principles of Matrix Algebra), the command Matrix1 x Matrix2
(assuming that matrices are of the same size) returns a new matrix in which each element is the product of the corresponding elements in Matrix1 and Matrix2.
NOTE: As you have noticed multiplication of two matrices by *
does not return to the product of two matrices as we have learnt in Algebra. To get the Algebraic multiplication of two matrices, one should use %*%
in R.
Below, we create a matrix consisting of of squared of 1 to 9:
M1 = matrix(1:9, byrow = TRUE, ncol = 3)
Square_Matrix = M1*M1
Square_Matrix
## [,1] [,2] [,3]
## [1,] 1 4 9
## [2,] 16 25 36
## [3,] 49 64 81
Another application of corresponding multiplication is for example when matrix A represents the quantity of product (i,j) sold in the store and matrix B with the exact same size and structure representing the price of (i,j) product. Therefore, a sales matrix for all products can be created using A*B command.
Finally, in the following code, we do the mathematical (algebraic) multiplication of two matrices:
My_test = matrix(1:4, byrow=TRUE, nrow=2)
My_test
## [,1] [,2]
## [1,] 1 2
## [2,] 3 4
Math_way_multiplication_matrix = My_test%*% My_test
Math_way_multiplication_matrix
## [,1] [,2]
## [1,] 7 10
## [2,] 15 22
To transpose a matrice, we use the function t()
.
t(My_test) # do not confuse this with inverse matrix. this is just transposed.
## [,1] [,2]
## [1,] 1 3
## [2,] 2 4
To calculate the ïnverse of a matrice, that is A-1, we use the function solve()
solve(My_test)
## [,1] [,2]
## [1,] -2.0 1.0
## [2,] 1.5 -0.5
Notice that, we know A%*%A-1 is I matrice. See below:
My_test%*%solve(My_test)
## [,1] [,2]
## [1,] 1 1.110223e-16
## [2,] 0 1.000000e+00
The gender_vector
contains the sex of five different individuals. It is evident that there are two categories, or in R-terms ‘factor levels’, in this vector: Male and Female. The function factor()
encodes the input vector and turns it into a factor. In other words, the output is still a vector, however, its elements have found meanings. They are now levels
of a factor
.
gender_vector = c("Male","Female","Female","Male","Male") #a typical vector
gender_factor = factor(gender_vector) #turning the input vector into a factor
gender_factor
## [1] Male Female Female Male Male
## Levels: Female Male
Note: Every time, we call a factor, not only the entire original vector of observations is returned, but also the levels of the factor, below them.
By default the function factor()
transforms vectors to an unordered/nominal factor
, as in the above example. In other words, it does not differentiate between levels in a meaningful and comparative way. However, factors might be ordianl
and thus their levels comparable. In case of creating an ordinal
factor, we apply the same function factor()
and use ordered
and levels
arguments. By setting the argument order
to TRUE
in the function factor()
, we indicate that the factor is ordered
. With the argument levels
, we indicate the values of the factor in the correct order, in an ascending manner.
temperature_vector = c("High", "Low", "High","Low", "Medium")
temperature_factor = factor(temperature_vector, ordered = TRUE, levels = c("Low", "Medium", "High"))
temperature_factor
## [1] High Low High Low Medium
## Levels: Low < Medium < High
Note: Another way of creating factor
variables is when we create a data frame
. In data frames, all the character columns are automatically turn into factors. We will see this in an example in the data frames section.
Sometimes we want to change the names of the levels for clarity or other reasons. R allows us to do this with the function levels()
. A good illustration is the raw data that is provided to you by a survey. A standard question for every questionnaire is the gender of the respondent. You remember from the previous question that this is a factor and when performing the questionnaire on the streets its levels are often coded as “M” and “F”. Imagine we have a vector indicating gender of of 5 respondents. Also, imagine when you want to start your data analysis, your main concern is to keep a nice overview of all the variables and what they mean. At that point, you will often want to change the factor levels to “Male” and “Female” instead of “M” and “F” to make your life easier.
gender_vector = c("M", "F", "F", "M", "M")
gender_vector= factor(gender_vector) #creating a factor with M and F levels.
levels(gender_factor)= c("Female", "Male")
gender_factor
## [1] Male Female Female Male Male
## Levels: Female Male
Note: the levles()
function assigns the levels of old factor based on their alphabetical orders to their new names. In other words, the first new name is assigned to “F” and the other one to “M”, because “F” precedes “M” in the alphabet order.
The function summary()
returns a summary of an object it receives as input. If this input is a factor, it returns useful information regarding its levels, number of observations under each level, etc.
summary(gender_factor)
## Female Male
## 2 3
NOTE: if the input of summary()
function is a numerical vector, it returns the basic statistical summary of the vector.
numerical_vector = c(1,2,3,4,5)
summary(numerical_vector)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1 2 3 3 4 5
Imagine you have 5 data analysts working for you and you have classified them based on their speed at work.
speed_vector = c("fast", "slow", "slow", "fast", "insane")
speed_factor = factor(speed_vector, ordered = TRUE, levels = c("slow", "fast", "insane")) #we turn the above vector into an ordered factor vector.
Remember that a factor is in its nature still a vector, except that its elements are classified as meaningful levels which are either ordered or not. Now imagine that one day, the second data analysts complains about the fifth data analysts and claims that he is slow. You want to test whether this claim is true. We can do this by running a simple and logical comparison, as we do on any typical factor if we wanted to know whether element 2 was bigger than 5. This comparison is possible because the vector is already an ordered factor vector.
factor_speed_vector[2] > factor_speed_vector[5]
speed_factor[2] > speed_factor[5]
## [1] FALSE
Unlike vectors and matrices, data frames are two-dimensional structures that can include variety of data types at the same time. Similar to matrices, data frames consist of rows and columns.
Each column represent a variable and each row is an observation/case/participant. Variables can be of different kinds: numeric, logical, character.
Imagine we have the following six vectors (of the same size) that we would like to combine and make a data frame out of them. In other words, you want each of the following vectors become one column in our final data frame. The function data.frame()
does the job for us.
name <- c("Mercury", "Venus", "Earth", "Mars", "Jupiter", "Saturn", "Uranus", "Neptune")
type <- c("Terrestrial planet", "Terrestrial planet", "Terrestrial planet", "Terrestrial planet", "Gas giant", "Gas giant", "Gas giant", "Gas giant")
diameter <- c(0.382, 0.949, 1, 0.532, 11.209, 9.449, 4.007, 3.883)
rotation <- c(58.64, -243.02, 1, 1.03, 0.41, 0.43, -0.72, 0.67)
rings <- c(FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, TRUE)
planets_dataf <- data.frame(name, type, diameter, rotation, rings)
planets_dataf #notice that names of vectors becomes labels of each column in the data frame.
## name type diameter rotation rings
## 1 Mercury Terrestrial planet 0.382 58.64 FALSE
## 2 Venus Terrestrial planet 0.949 -243.02 FALSE
## 3 Earth Terrestrial planet 1.000 1.00 FALSE
## 4 Mars Terrestrial planet 0.532 1.03 FALSE
## 5 Jupiter Gas giant 11.209 0.41 TRUE
## 6 Saturn Gas giant 9.449 0.43 TRUE
## 7 Uranus Gas giant 4.007 -0.72 TRUE
## 8 Neptune Gas giant 3.883 0.67 TRUE
To know the structure of a data frame, the function str()
does the job.
str(planets_dataf)
## 'data.frame': 8 obs. of 5 variables:
## $ name : Factor w/ 8 levels "Earth","Jupiter",..: 4 8 1 3 2 6 7 5
## $ type : Factor w/ 2 levels "Gas giant","Terrestrial planet": 2 2 2 2 1 1 1 1
## $ diameter: num 0.382 0.949 1 0.532 11.209 ...
## $ rotation: num 58.64 -243.02 1 1.03 0.41 ...
## $ rings : logi FALSE FALSE FALSE FALSE TRUE TRUE ...
Let’s try to check the type of name
and type
columns. Note that we previously mentioned that a character vector, when positioned in a data frame, is turned into a factor
.
class(planets_dataf$name)
## [1] "factor"
class(planets_dataf$type)
## [1] "factor"
we can now check their levels and number of levels in each of these factors:
levels(planets_dataf$name)
## [1] "Earth" "Jupiter" "Mars" "Mercury" "Neptune" "Saturn" "Uranus"
## [8] "Venus"
nlevels(planets_dataf$name)
## [1] 8
levels(planets_dataf$type)
## [1] "Gas giant" "Terrestrial planet"
nlevels(planets_dataf$type)
## [1] 2
the function head()
returns the first few observations of a data frame. Similarly, the function tail()
returns the last few observations in a data frame. Both head()
and tail()
return also a top line
called the header
, which contains the names of the different variables in a data frame.
Similar to vectors and matrices, we use the square brackets to identify element/elements. For example, the code below returns the element in row 1, column 3:
planets_dataf[1,3]
## [1] 0.382
the same job can be done using the row and column names. For example the code below prints the element in the first row and under the column called “name”.
planets_dataf[1,"name"]
## [1] Mercury
## Levels: Earth Jupiter Mars Mercury Neptune Saturn Uranus Venus
Moreover, the following codes also returns several elements simultaneously:
planets_dataf[1:3, 2:5] # returning rows 1 to 3, at columns 2 to 5.
## type diameter rotation rings
## 1 Terrestrial planet 0.382 58.64 FALSE
## 2 Terrestrial planet 0.949 -243.02 FALSE
## 3 Terrestrial planet 1.000 1.00 FALSE
planets_dataf[1:3, "rings"] # rows 1 to 4, under the column "rings"
## [1] FALSE FALSE FALSE
planets_dataf[,"name"] # All the rows, but from the name column only.
## [1] Mercury Venus Earth Mars Jupiter Saturn Uranus Neptune
## Levels: Earth Jupiter Mars Mercury Neptune Saturn Uranus Venus
planets_dataf[,1] # complete first column (from all rows).
## [1] Mercury Venus Earth Mars Jupiter Saturn Uranus Neptune
## Levels: Earth Jupiter Mars Mercury Neptune Saturn Uranus Venus
planets_dataf[2,] # complete second row (from all columns)
## name type diameter rotation rings
## 2 Venus Terrestrial planet 0.949 -243.02 FALSE
planets_dataf[c(1,2),3] # returns elemnts (1,3) and (2,3) respectively.
## [1] 0.382 0.949
planets_dataf[c(1,2),] #returns the 1st and second rows.
## name type diameter rotation rings
## 1 Mercury Terrestrial planet 0.382 58.64 FALSE
## 2 Venus Terrestrial planet 0.949 -243.02 FALSE
To single out a particular column, we can also use the symbol $
. This shortcut can only be used when the columns have names and in the form of planet_dataf$name
. The $
sign can only be used to refer to columns (variables) not rows.
planets_dataf$diameter[diameter>1] #only focuses on diameter column and retunrs values form that column.
## [1] 11.209 9.449 4.007 3.883
planets_dataf[planets_dataf$diameter>1, ] #returns all cases where diameter is bigger than 1. this code retuns all information about those cases, compare it with the previous code.
## name type diameter rotation rings
## 5 Jupiter Gas giant 11.209 0.41 TRUE
## 6 Saturn Gas giant 9.449 0.43 TRUE
## 7 Uranus Gas giant 4.007 -0.72 TRUE
## 8 Neptune Gas giant 3.883 0.67 TRUE
planets_dataf[planets_dataf$diameter>1 | planets_dataf$rings==FALSE, ]
## name type diameter rotation rings
## 1 Mercury Terrestrial planet 0.382 58.64 FALSE
## 2 Venus Terrestrial planet 0.949 -243.02 FALSE
## 3 Earth Terrestrial planet 1.000 1.00 FALSE
## 4 Mars Terrestrial planet 0.532 1.03 FALSE
## 5 Jupiter Gas giant 11.209 0.41 TRUE
## 6 Saturn Gas giant 9.449 0.43 TRUE
## 7 Uranus Gas giant 4.007 -0.72 TRUE
## 8 Neptune Gas giant 3.883 0.67 TRUE
imagine that we are interested in the information of the planets that have rings only. You want them to be selected from the rest of info in the data set. This is called subsetting.
The function subset ()
does the job for us. It has the following structure: subset(my_dataframe, subset = some_condition). For our example, here is the code:
subset(planets_dataf, subset = rings) # because the ring vector is a logical vector, only those elements with TRUE values are printed. So, simply coding "subset = rings" does the job.
## name type diameter rotation rings
## 5 Jupiter Gas giant 11.209 0.41 TRUE
## 6 Saturn Gas giant 9.449 0.43 TRUE
## 7 Uranus Gas giant 4.007 -0.72 TRUE
## 8 Neptune Gas giant 3.883 0.67 TRUE
Suppose, we are interested in planets with bigger size than earth. That is those whose diameter in our data frame are bigger than one (under the diameter column).In that case, the following code does the job:
subset(planets_dataf, subset= diameter>1)
## name type diameter rotation rings
## 5 Jupiter Gas giant 11.209 0.41 TRUE
## 6 Saturn Gas giant 9.449 0.43 TRUE
## 7 Uranus Gas giant 4.007 -0.72 TRUE
## 8 Neptune Gas giant 3.883 0.67 TRUE
planets_dataf$diameter[diameter>1] #compare with the top code regarding their outputs...
## [1] 11.209 9.449 4.007 3.883
Notice that the condition in front of the argument subset =
is only focusing on the criteria/column of the interest. The data frame is already selected in the first argument of subset() function. So no need for the data frame to be specified again.
Consider we would like to sort elements of a numerical vector in an ascending order, that is from minimum to maximum. Function order()
checks and returns the position of the elements of an input vector, had its elements been ordered ascendingly, from the minimum to the maximum. Notice, this function simply returns the position of elements in the vector, in an ascending-ordered manner. It does not however return the ordered vector itself. But that is enough for us to be able to rebuild the orders vector.
test = c(200, -1, 14, -750) #initial unordered vector.
ordered_positions = order(test)
ordered_positions #returns a vector containig positions of elements in the test vector, had they been ordered ascendingly
## [1] 4 2 3 1
test[ordered_positions] #returns the vector test in an ordered manner.
## [1] -750 -1 14 200
In the same fashion as in the previous example, we can sort a data frame. Let’s sort out all cases/rows in our data frame planets_dataf
based on the data in the column/variable, diameter, in a ascending order.
diameters_ord = order(planets_dataf$diameter) #lets find the ordered position of diameter variable first.
Once we have the ordered position of diameter variable, it is the exact order by which we want to shuffle and present our cases/rows in the data frame. So, we can call the cases/rows in the data frame in that order, following by all the columns.
planets_dataf[diameters_ord,]
## name type diameter rotation rings
## 1 Mercury Terrestrial planet 0.382 58.64 FALSE
## 4 Mars Terrestrial planet 0.532 1.03 FALSE
## 2 Venus Terrestrial planet 0.949 -243.02 FALSE
## 3 Earth Terrestrial planet 1.000 1.00 FALSE
## 8 Neptune Gas giant 3.883 0.67 TRUE
## 7 Uranus Gas giant 4.007 -0.72 TRUE
## 6 Saturn Gas giant 9.449 0.43 TRUE
## 5 Jupiter Gas giant 11.209 0.41 TRUE
Data frames can also be joined together using the merge()
function. Where two data frames share columns, they can be merged together using the merge()
function. merge()
provides a variety of options for doing database-style joins. To join two data frames, you need to specify which columns contain the key values to match up. By default, the merge function uses all the common columns from the two data frames, but more commonly you will just want to use a single shared ID column. Note that merge()
function combines two data frames by their columns,
v1 = c(2,3,5)
v2=c("M", "S", "I")
firstdf =data.frame(v1, v2)
firstdf
v3 = c(6,7,8)
v4=c("T","L","W")
seconddf=data.frame(v3,v4)
seconddf
thirddf = data.frame(v1,v4)
thirddf
merge(firstdf,seconddf, by="v1") #returns error because there is no v1 in the seconddf.
merge(firstdf, thirddf, by="v1") #this works.
A Recap:
Vectors (one dimensional arrays)can hold numeric
, character
or logical
values. The elements in a vector all have the same data type.
Matrices (two dimensional arrays) can hold numeric
, character
or logical
values. As in Vectors The elements of matrix all have the same data type.
Data Frames (two-dimensional objects) can hold numeric
, character
or logical
values. Within a column/variable all elements have the same data type, however different columns can have different data types.
A List in R is similar to our notebook
or to-do list
at work or school. Different items in our notebook most likely differ in length, characteristic, type, etc. A list
in R allows us to gather various objects under one umbrella,that is, the name of the list. These objects can be matrices, vectors, data frames, even other lists, etc. It is not even required that these objects to be related/similar to each other in any way. You could say that a list is some kind super data type
; you can store practically any piece of information in it. lists do not have dimensions.
Note: Due to this ability to contain other lists within themselves, lists are considered to be recursive variables. Vectors, matrices, and arrays, by contrast, are atomic. (Variables can either be recursive or atomic, never both). The functions is.recursive()
and is.atomic()
let us test variables to see what type they are
We can create a list using the list()
function. Below, we create and return a list containing an array, matrix and a data frame.
my_vector = c(1:10)
my_matrix = matrix(1:9, ncol = 3, byrow=TRUE)
my_dataframe = mtcars[1:10,] # First 10 elements of the R built-in data frame, mtcars
my_list =list(my_vector, my_matrix, my_dataframe) #creating a list
my_list #calling a list
## [[1]]
## [1] 1 2 3 4 5 6 7 8 9 10
##
## [[2]]
## [,1] [,2] [,3]
## [1,] 1 2 3
## [2,] 4 5 6
## [3,] 7 8 9
##
## [[3]]
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
## Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
## Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
Just like your to-do list, you want to avoid not knowing or remembering what the components of your list stand for. That is why you should give names to them. Here we introduce two methods, just as in vectors, to do so.
This creates a list and simultaneously name those components with labels as name1, name2, and so on.
names()
function as you did with vectors. The following commands are equivalent to the above code in Method1:
Note: Once we have named the objects in our list, we can use their names to summon them.
Now, we use both Method 1 and 2 to name objects within the list we created above (i.e., my_list)
# Method1.
my_list = list("Object1"= my_vector, "Object2"=my_matrix, "Object3"=my_dataframe)
my_list
## $Object1
## [1] 1 2 3 4 5 6 7 8 9 10
##
## $Object2
## [,1] [,2] [,3]
## [1,] 1 2 3
## [2,] 4 5 6
## [3,] 7 8 9
##
## $Object3
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
## Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
## Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
# Method 2. Here we change the names, so we can see the difference between codes below and above.
names(my_list)= c("Object_One", "Object_Two", "Object_Three" )
my_list
## $Object_One
## [1] 1 2 3 4 5 6 7 8 9 10
##
## $Object_Two
## [,1] [,2] [,3]
## [1,] 1 2 3
## [2,] 4 5 6
## [3,] 7 8 9
##
## $Object_Three
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
## Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
## Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
To select objects from a list, one should use double brackets [[]]
. Inside the double brackets we can use either the name of an object inside ""
marks or just a number to refer to the position of that object in the list. Another approach is using $
and then attaching the name of the object.
Suppose, we would like to return my_matrix
from my_list
. Here are three ways to do it:
my_list[[2]]
## [,1] [,2] [,3]
## [1,] 1 2 3
## [2,] 4 5 6
## [3,] 7 8 9
my_list$Object_Two
## [,1] [,2] [,3]
## [1,] 1 2 3
## [2,] 4 5 6
## [3,] 7 8 9
my_list[["Object_Two"]]
## [,1] [,2] [,3]
## [1,] 1 2 3
## [2,] 4 5 6
## [3,] 7 8 9
Suppose, we want to see the first row in the data frame, the 3rd object, in the list. Here is how it works:
my_list[[3]][1,]
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21 6 160 110 3.9 2.62 16.46 0 1 4 4
Here is how to call the first element of the vector in the list, which the first object in the list:
my_list[[1]][1]
## [1] 1
# Or doing the same, using the name of objects:
my_list[["Object_One"]][1]
## [1] 1
We can use the $
sign to call objects with their names and then call elements within them with usual []. For example, the code below calls the 2nd object (i.e., the matrix named, “Object_Two”) and returns its first row.
my_list$Object_Two[1,]
## [1] 1 2 3
Interestingly, one can use the $ sign several times. The code below calls the data frame object from the list, then returns a column in it called mpg.
my_list$Object_Three$mpg # REALLY COOL!
## [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2
List is a box that contains objects. One can therefore add a new object into a list using ordinary c()
command.
vector_NEW = c(2,6,8, 9)
my_list = c(my_list, vector_NEW)
my_list
## $Object_One
## [1] 1 2 3 4 5 6 7 8 9 10
##
## $Object_Two
## [,1] [,2] [,3]
## [1,] 1 2 3
## [2,] 4 5 6
## [3,] 7 8 9
##
## $Object_Three
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
## Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
## Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
##
## [[4]]
## [1] 2
##
## [[5]]
## [1] 6
##
## [[6]]
## [1] 8
##
## [[7]]
## [1] 9
One issue with the above code is that elements of vector_NEW
each have been added as new objects into the list. The list had 3 objects in it previously. Now with the above code, it has turned into a list with 7 objects. Unfortunately, the vector_NEW has not been added as a whole, one vector object, but as four different objects. Perhaps, in such cases it is better to define the original list again, including NEW objects we want it to contain.
Note: We simply use the str()
or summary()
functions to get an overview of the list structure.
Note: NULL NULL is a special value that represents an empty variable. Its most common use is in lists, but it also crops up with data frames and function arguments. When you create a list, you may wish to specify that an element should exist, but should have no contents. For example, the following list contains UK bank holidays1 for 2013 by month. Some months have no bank holidays, so we use NULL to represent this absence.
uk_bank_holidays_2013 = list( Jan = "New Year's Day",
Feb = NULL, Mar = "Good Friday",
Apr = "Easter Monday",
May = c("Early May Bank Holiday", "Spring Bank Holiday"),
Jun = NULL,
Jul = NULL,
Aug = "Summer Bank Holiday",
Sep = NULL,
Oct = NULL,
Nov = NULL,
Dec = c("Christmas Day", "Boxing Day") )
uk_bank_holidays_2013
## $Jan
## [1] "New Year's Day"
##
## $Feb
## NULL
##
## $Mar
## [1] "Good Friday"
##
## $Apr
## [1] "Easter Monday"
##
## $May
## [1] "Early May Bank Holiday" "Spring Bank Holiday"
##
## $Jun
## NULL
##
## $Jul
## NULL
##
## $Aug
## [1] "Summer Bank Holiday"
##
## $Sep
## NULL
##
## $Oct
## NULL
##
## $Nov
## NULL
##
## $Dec
## [1] "Christmas Day" "Boxing Day"
It is important to understand the difference between NULL
and the special missing value NA
. The biggest difference is that NA
is a scalar value, whereas NULL
takes up no space at all-it has length zero:
length(NULL)
## [1] 0
length(NA)
## [1] 1
You can test for NULL using the function is.null()
. Missing values are not null. Null values are not missing either. NULL should be seen and interpreted as nonexisting, and not as missing.