Julius Schmid

In this class, you will learn how to:

Atomic Vectors

We create and atomic vector die that stores 6 elements, each corresponding to one side of the die.

die <- c(1, 2, 3, 4, 5, 6)
die
## [1] 1 2 3 4 5 6
## 1 2 3 4 5 6

Is it a vector?

is.vector(die)
## [1] TRUE

Yes, it is. Since all arguments of die have the same data type numeric, die is interpreted as a vector.

Now, we create an atomic vector that stores one element, the number 2. Note that we don’t need the concatenate function in this case.

two <- 2
two
## [1] 2

Again, we wonder if the created object two is interpreted as a vector:

is.vector(two)
## [1] TRUE

Again, this is the case! Only because we input one element, does not mean that two is not interpreted as a vector. In fact, two is a vector with only one element.

Function length gets or sets the length of vectors (including lists) and factors, and of any other R object for which a method has been defined. In simple terms, length returns the length of an atomic vector.

length(two)
## [1] 1
length(die)
## [1] 6

Since two has only one element, the length of two is 1. Since die represents all possible outcomes when rolling a die (1-6), die has a length of 6.

Each atomic vector stores its values as a one-dimensional vector, and each atomic vector can only store one type of data. R recognizes six basic types of atomic vectors: doubles, integers, characters, logicals, complex, and raw.

Let us create example vectors to clarify the difference between them:

int <- 14
text <- "ace"
do_uble <- 21.25 #64 bits to store
logic <- TRUE

In the code above, we created atomic vectors for four of these basic types:

Floating-point errors arise due to each double accuracy to about 16 significant digits. This introduces a little bit of error. In most cases, this rounding error will go unnoticed. However, in some situations, the rounding error can cause surprising results. For example, you may expect the result of the expression below to be zero, but it is not:

sqrt(2)^2 - 2
## [1] 4.440892e-16

Mathematically, sqrt(2)^2 == 2, so the correct result is 0. But since sqrt(2) is an infinite number and we can only save a finite number of digits, sqrt(2) is rounded to 16 digits. Hence, when squaring the result, we get a number that slightly differs from 2.

Let us consider the other types complex and raw.

comp <- c(1 + 1i, 1 + 2i, 1 + 3i)
comp
## [1] 1+1i 1+2i 1+3i
r_raw <- raw(3)
r_raw
## [1] 00 00 00
## 00 00 00

comp represents a 3-dimensional vector with complex components. r_raw, on the other hand, returns a series of 00s, where the concrete amount depends on the input argument in the raw() function. In this case, we call raw(3), and hence get the output 00 00 00.

#Attributes The most common attributes to give an atomic vector are names, dimensions (dim), and classes. Notice how object die has no names after we created the object.

names(die)
## NULL

The output in this case is just NULL. Let us change that by assigning names to the elemements.

names(die) <- c("one", "two", "three", "four", "five", "six")
names(die)
## [1] "one"   "two"   "three" "four"  "five"  "six"

We see that each numeric element in the die vector has now an assigned name. In our case, we just gave each number its letter-written value as a name.

Let’s recheck the attributes function.

attributes(die)
## $names
## [1] "one"   "two"   "three" "four"  "five"  "six"

We observe that names is interpreted as an attribute of the die vector.

Names do not affect the values. We could also give each number the german-written value and the values itself would not change:

names(die) <- c("eins", "zwei", "drei", "vier", "fünf", "sechs")
die
##  eins  zwei  drei  vier  fünf sechs 
##     1     2     3     4     5     6

We can also remove names. We do this by assigning NULL to the names attribute:

names(die) <- NULL

#Creating n dimensional Structures

A vector is a one-dimensional array. A matrix is a two-dimensional array; therefore in this case an array is the same thing as a matrix. Next, we modify the dim attribute of an atomic vector into either a matrix or an array with more than three dimensions.

For example you can reorganize die into a 2 × 3 matrix. We can do that since 2 * 3 = 6, which is the number of elements in the array die. If we set the dimension values such that its product is bigger than the number of elements in the array, R will automatically recycle its entries to fill up all the entries in the matrix.

dim(die) <- c(2, 3)

R will always use the first value in dim for the number of rows and the second value for the number of columns. In general, rows always come first in R operations that deal with both rows and columns.

So if we want to create a matrix with 3 rows and 2 columns, we just switch the input arguments:

dim(die) <- c(3, 2)

Notice how by default R fills up each matrix by columns.

#hypercube
dim(die) <- c(1, 2, 3)
class(die)
## [1] "array"

In the code above, we created a three-dimensional array that consists of 3 1x2 submatrices. The class of this die is interpreted as array.

If you’d like more control over how the data is stored, you can use one of R’s helper functions, matrix or array. They do the same thing as changing the dim attribute, but they provide extra arguments to customize the process.

#Matrix Function

m <- matrix(die, nrow = 2)

The matrix() function takes an array as an input and creates a two-dimensional matrix. Since there is many possibilities for the creation of a matrix (in this case 1x6, 2x3, 3x2, and 6x1), we need to specify how many rows the matrix should have. Since we set nrow = 2, we decide to create a 2x3 matrix.

m <- matrix(die, nrow = 2, byrow = TRUE)

By default, the array entries fill the matrix column by column, i. e. the first array component corresponds to matrix entry m[1,1], the second array component corresponds to m[2,1], and so on. However, when we set byrow = TRUE (just like we did in the code above), we fill up the matrix row by row. In this case, the second array component corresponds to matrix entry m[1,2].

#Array Function The array function creates an n-dimensional array.

ar <- array(c(101:104, 201:204, 301:304), dim = c(2, 2, 3))
ar
## , , 1
## 
##      [,1] [,2]
## [1,]  101  103
## [2,]  102  104
## 
## , , 2
## 
##      [,1] [,2]
## [1,]  201  203
## [2,]  202  204
## 
## , , 3
## 
##      [,1] [,2]
## [1,]  301  303
## [2,]  302  304

Here, we create 3 2x2 submatrices, where each submatrix corresponds to one argument in the concatenate() function. The first submatrix contains the values 101-104, the second 201-204, and the third 301-304.

Notice that changing the dimensions of your object will not change the type of the object,but it will change the object’s class attribute:

dim(die) <- c(2, 3)
typeof(die)
## [1] "double"
class(die)
## [1] "matrix" "array"

If we let us represent die as a 2x3 matrix, the class() function will return “matrix” as well as “array”.

Note that an object’s class attribute will not always appear when you run attributes; you may need to specifically search for it with class: attributes(die)

attributes(die)
## $dim
## [1] 2 3

You can apply class to objects that do not have a class attribute. class will return a value based on the object’s atomic type. Notice that the “class” of a double is “numeric,” an odd deviation, but one I am thankful for. I think that the most important property of a double vector is that it contains numbers, a property that “numeric” makes obvious:

class("Hello")
## [1] "character"
class(5)
## [1] "numeric"

“Hello” is a string. Hence the class function returns the output character. The number 5 is correctly interpreted as a numeric value.

Now, let us consider a new function Sys.time() which returns the current time for the moment when you run the code. The time is returned in the format “YYYY-MM-DD Hour-Minute-Second Timezone”. By default, the time is returned in the UTC time zone:

now <- Sys.time()
now
## [1] "2022-11-26 18:35:13 UTC"
typeof(now)
## [1] "double"
class(now)
## [1] "POSIXct" "POSIXt"

POSIXct is a framework for representing dates and times. Time is represented by the number of seconds that have passed between now and12:00 AM January 1st 1970 (in the Universal Time Coordinated (UTC) zone). You can see this number by removing the class attribute of now, or by using the un class function, which does the same thing:

unclass(now)
## [1] 1669487714

R then gives the double vector a class attribute that contains two classes, “POSIXct” and “POSIXt”. This attribute alerts R functions that they are dealing with a POSIXct time, so they can treat it in a special way. For example, R functions will use the POSIXct standard to convert the time into a user-friendly character string before displaying it. You can take advantage of this system by giving the POSIXct class to random R objects. For example, have you ever wondered what day it was a million seconds after 12:00 a.m. Jan. 1, 1970?

To answer this question, we create a variable mil which stores the number 1 million. After that, we use the class() function to interpret 1 million as a date. This will just add 1 million seconds to the default data Jan. 1, 1970 12:00 am.

mil <- 1000000
mil
## [1] 1e+06
class(mil) <- c("POSIXct", "POSIXt")
mil
## [1] "1970-01-12 13:46:40 UTC"

We see, that a million seconds after Jan. 1 1970, 12:00 am we arrive at 1:46:40 pm on the 12th of January 1970.

#Factors

Let us proceed to consider factors. As a first step, we create a factor gender, which factorized a vector which has the entries “female”, “male”, “female”, “female”, and “male”. After that, we apply the functions typeof() and attributes() for further analysis.

gender <- factor(c("female", "male", "female", "female", "male"))
typeof(gender)
## [1] "integer"
attributes(gender)
## $levels
## [1] "female" "male"  
## 
## $class
## [1] "factor"

It is surprising that gender is interpreted as an integer! We would expect it to be a character instead. The attributes function, on the other hand, returns the values that we might expect: There is only two levels female and male, and the class is determined as factor.

Now, let us unclass the factor gender.

unclass(gender)
## [1] 1 2 1 1 2
## attr(,"levels")
## [1] "female" "male"

The output returns the different levels as well. Additionally, in the first row we can see which component of the original input vector corresponds to level female (entries 1, 3, and 4) and which components correspond to level male (namely 2 and 5).

Let us return the factor gender again:

gender
## [1] female male   female female male  
## Levels: female male

Here, both the original input vector (with duplicates) and the levels (without duplicates) are returned.

We can transform the variable gender into a character, by applying the as.character() function:

as.character(gender)
## [1] "female" "male"   "female" "female" "male"

Now, gender is just a vector of length 5 with datatype character. When returning gender, there is no output displaying levels.

#Coercion

Let us consider logical values. In R, their value can only be represented with the two outcomes TRUE and FALSE. Let us see what happens when we build the sum of a logical vector:

sum(c(TRUE, TRUE, FALSE, FALSE))
## [1] 2
#will become:
sum(c(1, 1, 0, 0))
## [1] 2

We observe that when applying the sum() function on a logicaln vector, all the TRUE values are summed up. This is the same as if we interpret each TRUE value as 1, and each FALSE value as 0 compute the sum of this vector.

In the following, we perform data type changes.

as.character(1)
## [1] "1"
## "1"
as.logical(1)
## [1] TRUE
## TRUE
as.numeric(FALSE)
## [1] 0
## 0

1 is originally interpreted as a numeric value. When applying as.character(), the number 1 is transformed to the string “1”. When calling as.logical(1), 1 is transformed to the logical value TRUE. This is consistent with the observations that we made when we applied the sum() function to a logical vector. In reverse, we can transform the logical FALSE into a numeric value. FALSE will then become the number 0.

#Lists Lists do not group together individual values; lists group together R objects, they are used as building blocks to create many more spohisticated types of R objects.

list1 <- list(240:270, "R", list(TRUE, FALSE))
list1
## [[1]]
##  [1] 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258
## [20] 259 260 261 262 263 264 265 266 267 268 269 270
## 
## [[2]]
## [1] "R"
## 
## [[3]]
## [[3]][[1]]
## [1] TRUE
## 
## [[3]][[2]]
## [1] FALSE

list1 consists of 3 components: A numeric component, listing the integers 240 through 270, one string “R”, and another list, which contains the two boolean values TRUE and FALSE.

#Data Frames Data frames are the two-dimensional version of a list. They are far and away the most useful storage structure for data analysis, and they provide an ideal way to store an entire deck of cards. You can think of a data frame as R’s equivalent to the Excel spreadsheet because it stores data in a similar format.

Let us create a data frame, representing 3 French playing cards:

df <- data.frame(face = c("ace", "two", "six"),
suit = c("clubs", "hearts", "clubs"), value = c(11, 2, 6))
df

The three drawn cards are ace clubs (value 11), two hearts (value 2), and six clubs (value 6).

Data frames cannot combine columns of different lengths.

df <- data.frame(face = c("ace", "two", "six"),
suit = c("clubs", "hearts", "clubs"), value = c(11, 2, 6),
stringsAsFactors = FALSE)
df

In the code above, we set an aditional parameter when creating the data frame, namely SringsAsFactors = FALSE. This makes sure that all the string columns are not interpreted as factor columns. Per default, StringsAsFactors is set to TRUE so we only need to include it when we want to set it to FALSE.

Let us see if the setting StringsAsFactors = FALSE did its job:

typeof(df)
## [1] "list"
class(df)
## [1] "data.frame"
str(df)
## 'data.frame':    3 obs. of  3 variables:
##  $ face : chr  "ace" "two" "six"
##  $ suit : chr  "clubs" "hearts" "clubs"
##  $ value: num  11 2 6

First, the typeof() function returns “list” as an output, which may seem to contradict the output of the class() function (“data.frame”). But this is not a mistake, since every data frame can be interpreted as a list in higher dimensions. The structure function str() confirms that there are no columns with data type factor. Both string columns face and suit have the data type character.