This chapter summarises the most important data structures in base R. You’ve probably used many (if not all) of them before, but you may not have thought deeply about how they are interrelated. In this brief overview, I won’t discuss individual types in depth. Instead, I’ll show you how they fit together as a whole. If you need more details, you can find them in R’s documentation.
R’s base data structures can be organised by their dimensionality (1d, 2d, or nd) and whether they’re homogeneous (all contents must be of the same type) or heterogeneous (the contents can be of different types). This gives rise to the five data types most often used in data analysis:
| Dim | Homogeneous | Heterogeneous |
|---|---|---|
| 1d | Atomic vector | List |
| 2d | Matrix | Data frame |
| nd | Array |
Almost all other objects are built upon these foundations. Note that R has no 0-dimensional, or scalar types. Individual numbers or strings, which you might think would be scalars, are actually vectors of length one.
Given an object, the best way to understand what data structures it’s composed of is to use str(). str() is short for structure and it gives a compact, human readable description of any R data structure.
Vectors introduces you to atomic vectors and lists, R’s 1d data structures.
Attributes takes a small detour to discuss attributes, R’s flexible metadata specification. Here you’ll learn about factors, an important data structure created by setting attributes of an atomic vector.
Matrices and arrays introduces matrices and arrays, data structures for storing 2d and higher dimensional data.
Data frames teaches you about the data frame, the most important data structure for storing data in R. Data frames combine the behaviour of lists and matrices to make a structure ideally suited for the needs of statistical data.
The basic data structure in R is the vector. Vectors come in two flavours: atomic vectors and lists.
Atomic vectors and lists have three common properties:
typeof().length().attributes().Atomic vectors and lists have one key difference:
You can use is.atomic(x) to check if an object is an atomic vector, is.list(x) to test if an object is a list, and is.atomic(x) || is.list(x) to check if an object is either an atomic vector or a list, i.e. is a vector of either type.
x <- c(1, 3, 5, 9)
is.atomic(x)
is.list(x)
is.atomic(x) || is.list(x)
## [1] TRUE
## [1] FALSE
## [1] TRUE
Note: is.vector() does not strictly test if an object is a vector. Instead it returns TRUE if the object is a vector or expression that has either no attributes or only the names attribute set.
Because elements of an atomic vector must be the same type, we can describe any atomic vector by describing the type of element it contains.
Atomic vectors which contain logical elements are called logical atomic vectors; those containing integer elements are called integer atomic vectors; those containing double (often called numeric) elements are called double atomic vectors; and finally those containing character elements are called character atomic vectors.
There are two rare types that we will not discuss further, called complex and raw.
Atomic vectors are usually created with c(), short for combine:
dbl_var <- c(1, 2.5, 4.5)
typeof(dbl_var)
# With the L suffix, you get an integer rather than a double
int_var <- c(1L, 6L, 10L)
typeof(int_var)
# Use TRUE and FALSE (or T and F) to create logical atomic vectors
log_var <- c(TRUE, FALSE, T, F)
typeof(log_var)
chr_var <- c("these are", "some strings")
typeof(chr_var)
## [1] "double"
## [1] "integer"
## [1] "logical"
## [1] "character"
Atomic vectors are always flat, even if you nest c()’s within one another:
c(1, c(2, c(3, 4)))
# the same as
c(1, 2, 3, 4)
## [1] 1 2 3 4
## [1] 1 2 3 4
Missing values are specified with NA, which is a logical atomic vector of length 1. NA will always be coerced to the correct type if used inside c(), i.e. it will match the type of the elements inside the atomic vector. Alternatively you can create NA’s of a specific type with NA_real_ (double), NA_integer_ and NA_character_.
Given an atomic vector, you can determine its type with typeof(), as demonstrated above, or check if it’s a specific type with an “is” function.
Here are is.character() and is.logical() being used, as well as the more general is.atomic() (again, as shown above).
log_var <- c(TRUE, FALSE, T, F)
is.logical(log_var)
chr_var <- c("these are", "some strings")
is.character(chr_var)
is.numeric(log_var)
## [1] TRUE
## [1] TRUE
## [1] FALSE
is.numeric() will check an atomic vector for “numberliness”, returning TRUE if it is either an integer atomic vector or a double atomic vector. If you want to differentiate further, use is.integer() to check if it’s an integer atomic vector and is.double() to check if it’s a double atomic vector. This is made slightly confusing due to the words double and numeric being often used interchangeably in R - however, now you know the correct terminology!
Note: Unless you use L when creating an integer atomic vector, shown above, R will convert an atomic vector containing only integers into an double atomic vector.
# A group of numbers containing decimals are saved as double
x <- c(4.5, 6, -1.025, 6.8)
typeof(x)
is.double(x)
## [1] "double"
## [1] TRUE
# A set of numbers not containing decimals are saved as double too
y <- c(1, 2, 4, 4)
typeof(y)
is.integer(y)
is.double(y)
## [1] "double"
## [1] FALSE
## [1] TRUE
# Force R to save numbers as integers using L
z <- c(1L, 2L, 10L, 10L)
typeof(z)
is.integer(z)
is.double(z)
## [1] "integer"
## [1] TRUE
## [1] FALSE
# All of the above are numeric
is.numeric(x)
is.numeric(y)
is.numeric(z)
## [1] TRUE
## [1] TRUE
## [1] TRUE
# All of the above are atomic vectors
is.atomic(x)
is.atomic(y)
is.atomic(z)
## [1] TRUE
## [1] TRUE
## [1] TRUE
We’ve already said that all elements of an atomic vector must be the same type. When you attempt to combine different types they will be coerced to the most flexible type, for example double is more flexible than integer.
Logical is the least flexible type, followed by integer, double, and then character, which is the most flexible type.
For example, combining a character and an integer yields a character:
str(c("a", 1))
## chr [1:2] "a" "1"
When a logical atomic vector is coerced to an integer or double, TRUE becomes 1 and FALSE becomes 0. This is very useful in conjunction with sum() and mean().
x <- c(FALSE, FALSE, TRUE)
as.numeric(x)
# Total number of TRUEs
sum(x)
# Proportion that are TRUE
mean(x)
#> [1] 0.3333333
## [1] 0 0 1
## [1] 1
## [1] 0.3333333
When a logical atomic vector is coerced to a character, TRUE becomes "TRUE" and FALSE becomes "FALSE".
y <- c(FALSE, TRUE, TRUE, "apples")
y
typeof(y)
## [1] "FALSE" "TRUE" "TRUE" "apples"
## [1] "character"
Coercion often happens automatically. Most mathematical functions (+, log, abs, etc.) will coerce to a double or integer, and most logical operations (&, |, any, etc) will coerce to a logical. You will usually get a warning message if the coercion might lose information.
If confusion is likely, explicitly coerce with as.character(), as.double(), as.integer(), or as.logical().
Lists are different from atomic vectors because their elements can be of any type, including other lists. You construct lists by using list() instead of c():
x <- list(1:3, "a", c(TRUE, FALSE, TRUE), c(2.3, 5.9))
str(x)
## List of 4
## $ : int [1:3] 1 2 3
## $ : chr "a"
## $ : logi [1:3] TRUE FALSE TRUE
## $ : num [1:2] 2.3 5.9
Recall that because atomic vectors can contain just one type of element, typeof() on an atomic vector tells you what type of element that atomic vector contains.
As a list can contain lots of different types of element, typeof() simply tells you that the object is a list!
typeof(x)
## [1] "list"
c() used on several lists will combine them all into one list (not an atomic vector - can you think why?).
However, list() used on several lists will keep them separate. This is why lists are sometimes called recursive vectors, because a list can contain other lists.
Atomic vectors are not recursive, because atomic vectors can’t contain other atomic vectors - we said earlier that c(1, c(2, c(3, 4))) is the same object as c(1, 2, 3, 4).
Recursiveness is another fundamental difference between the two types of 1D vector, atomic vectors and lists.
x <- list(list(list(list(9.5))))
str(x)
is.recursive(x)
## List of 1
## $ :List of 1
## ..$ :List of 1
## .. ..$ :List of 1
## .. .. ..$ : num 9.5
## [1] TRUE
If given a combination of atomic vectors and lists, c() will coerce atomic vectors to lists before combining them.
a <- list(1, 4, 5, 2)
str(a)
# A list, containing four elements
## List of 4
## $ : num 1
## $ : num 4
## $ : num 5
## $ : num 2
b <- list(list(1), list(4), list(5), list(2))
str(b)
# A list, containing four elements, each of which is a list, each containing one element
## List of 4
## $ :List of 1
## ..$ : num 1
## $ :List of 1
## ..$ : num 4
## $ :List of 1
## ..$ : num 5
## $ :List of 1
## ..$ : num 2
x <- c(list(1, 4))
str(x)
typeof(x)
# A list, containing two elements
## List of 2
## $ : num 1
## $ : num 4
## [1] "list"
y <- c(list(1), list(4), list(5), list(2))
str(y)
# A list, containing four elements
## List of 4
## $ : num 1
## $ : num 4
## $ : num 5
## $ : num 2
z <- c(list(1, 4), c(5, 2))
str(z)
# A list, containing four elements
## List of 4
## $ : num 1
## $ : num 4
## $ : num 5
## $ : num 2
l <- list(list(1, 4), c(5, 2))
str(l)
# A list, containing two elements, the first a list, the second an atomic vector
## List of 2
## $ :List of 2
## ..$ : num 1
## ..$ : num 4
## $ : num [1:2] 5 2
As mentioned already, typeof() applied to a list returns list. You can test for a list with is.list() and coerce to a list with as.list().
x <- c(1, 4, 5, 2)
y <- as.list(x)
x
y
all.equal(x, y)
## [1] 1 4 5 2
## [[1]]
## [1] 1
##
## [[2]]
## [1] 4
##
## [[3]]
## [1] 5
##
## [[4]]
## [1] 2
##
## [1] "Modes: numeric, list" "target is numeric, current is list"
And you can turn a list into an atomic vector with unlist().
x <- list(1, 4, 5, 2)
y <- unlist(x)
x
y
is.list(x)
is.atomic(x)
is.list(y)
is.atomic(y)
## [[1]]
## [1] 1
##
## [[2]]
## [1] 4
##
## [[3]]
## [1] 5
##
## [[4]]
## [1] 2
##
## [1] 1 4 5 2
## [1] TRUE
## [1] FALSE
## [1] FALSE
## [1] TRUE
If the elements of a list have different types, unlist() uses the same coercion rules as c().
x <- list(T, F, F, F, 1, 2, 5)
y <- unlist(x)
y
## [1] 1 0 0 0 1 2 5
Lists are used to build up many of the more complicated data structures in R. For example, both data frames (described below in the data frames section) and linear models objects (as produced by lm()) are lists:
# Data frames are lists
is.list(mtcars)
# Linear models objects
mod <- lm(mpg ~ wt, data = mtcars)
is.list(mod)
## [1] TRUE
## [1] TRUE
What are the six types of atomic vector? How does a list differ from an atomic vector?
What makes is.vector() and is.numeric() fundamentally different to is.list() and is.character()?
Test your knowledge of vector coercion rules by predicting the output of the following uses of c():
c(1, FALSE)
c("a", 0)
c(list(1), "a")
c(TRUE, 1L)
Try coding them and see if you were right!
Why do you need to use unlist() to convert a list to an atomic vector? Why doesn’t as.vector() work?
Why is 1 == "1" true? Why is -1 < FALSE true? Why is "one" < 2 false?
Why is the default missing value, NA, a logical vector? What’s special about logical vectors? (Hint: think about c(FALSE, NA_character_).)
All objects can have arbitrary additional attributes, used to store metadata about the object.
These attributes can be seen as labeled values you can attach to any object.
For example, both the names and the dimensions of matrices and arrays are stored in R as attributes of the object.
Attributes can be thought of as a named list (with unique names).
Attributes can be accessed individually with attr() or all at once (as a list) with attributes().
y <- 1:10
attr(y, "my_attribute") <- "This is a vector"
attr(y, "my_attribute")
#> [1] "This is a vector"
str(attributes(y))
#> List of 1
#> $ my_attribute: chr "This is a vector"
## [1] "This is a vector"
## List of 1
## $ my_attribute: chr "This is a vector"
The structure() function returns a new object with modified attributes:
structure(1:10, my_attribute = "This is a vector")
## [1] 1 2 3 4 5 6 7 8 9 10
## attr(,"my_attribute")
## [1] "This is a vector"
By default, most attributes are lost when modifying a vector:
attributes(y[1])
attributes(sum(y))
## NULL
## NULL
The only attributes not lost are the three most important:
Names, a character vector giving each element a name, described in names.
Dimensions, used to turn vectors into matrices and arrays, described below in the matrices and arrays section.
Class, used to implement the S3 object system.1.
Each of these attributes has a specific accessor function to get and set values. When working with these attributes, use names(x), dim(x), and class(x), not attr(x, "names"), attr(x, "dim"), and attr(x, "class").
You can name a vector in three ways:
When creating it: x <- c(a = 1, b = 2, c = 3).
By modifying an existing vector in place:
x <- 1:3; names(x) <- c("a", "b", "c") or,x <- 1:3; names(x)[[1]] <- c("a").By creating a modified copy of a vector: x <- setNames(1:3, c("a", "b", "c")).
Names don’t have to be unique. However, character subsetting, described in http://adv-r.had.co.nz/Subsetting.html#lookup-tables, is the most important reason to use names and it is most useful when the names are unique.
Not all elements of a vector need to have a name. - If some names are missing when you create the vector, the names will be set to an empty string for those elements. - If you modify the vector using names(), it will return NA (specifically NA_character_) for any missing variable nammes. - If all names are missing, names() simply returns NULL.
y <- c(a = 1, 2, 3)
names(y)
v <- c(1, 2, 3)
names(v) <- c('a')
names(v)
z <- c(1, 2, 3)
names(z)
## [1] "a" "" ""
## [1] "a" NA NA
## NULL
You can create a new vector without names using unname(x), or remove names in place by assigning names(x) <- NULL.
One important use of attributes is to define factors. A factor is an atomic vector that can contain only predefined values, and is used to store categorical data. We met them back in Module 1.
Factors are built on top of integer atomic vectors using two attributes: - the class, “factor”, which makes them behave differently from regular integer vectors, - the levels, which defines the set of allowed values.
There are many classes in R. The most common are matrices, arrays, functions, numerical, logical, lists and factors.
x <- factor(c("horse", "donkey", "donkey", "horse"))
x
class(x)
length(x)
levels(x)
## [1] horse donkey donkey horse
## Levels: donkey horse
## [1] "factor"
## [1] 4
## [1] "donkey" "horse"
So here x is a factor object with two levels.
# You can't use values that are not in the levels
x[2] <- "c"
## Warning in `[<-.factor`(`*tmp*`, 2, value = "c"): invalid factor level, NA
## generated
x
#> [1] a <NA> b a
#> Levels: a b
# Note: you can't combine factors
y <- c(factor("a"), factor("b"))
y
#> [1] 1 1
## [1] horse <NA> donkey horse
## Levels: donkey horse
## [1] 1 1
Factors are useful when you know the possible values a variable may take, even if you don’t see all these values in a given dataset. Using a factor instead of a character vector makes it obvious when some groups contain no observations:
sex_char <- c("m", "m", "m")
sex_factor <- factor(sex_char, levels = c("m", "f"))
table(sex_char)
table(sex_factor)
## sex_char
## m
## 3
## sex_factor
## m f
## 3 0
Sometimes when a data frame is read directly from a file, a column you thought would produce a numeric vector instead produces a factor. This is caused by a non-numeric value somewhere in the column, often a missing value encoded in a special way like . or -. To remedy the situation, coerce the vector from a factor to a character vector, and then from a character to a double vector. (Be sure to check for missing values after this process.)
x <- typeof(as.double(as.character(as.factor("."))))
## Warning in typeof(as.double(as.character(as.factor(".")))): NAs introduced by
## coercion
x
## [1] "double"
Of course, a much better plan is to discover what caused the problem in the first place and fix that; using the na.strings argument to read.csv() is often a good place to start.
# Reading in "text" instead of from a file here:
z <- read.csv(text = "value\n12\n1\n.\n9")
z$value
typeof(z$value)
as.double(z$value)
# Oops, that's not right: 3 2 1 4 are the levels of a factor, not the values (12, 1, ., 9) that we read in!
class(z$value)
# We can fix it now:
as.double(as.character(z$value))
## Warning: NAs introduced by coercion
# Or change how we read it in:
z <- read.csv(text = "value\n12\n1\n.\n9", na.strings=".")
typeof(z$value)
class(z$value)
z$value
# Perfect! :)
## [1] 12 1 . 9
## Levels: . 1 12 9
## [1] "integer"
## [1] 3 2 1 4
## [1] "factor"
## [1] 12 1 NA 9
## [1] "integer"
## [1] "integer"
## [1] 12 1 NA 9
Unfortunately, most data loading functions in R automatically convert character vectors to factors. This is suboptimal, because there’s no way for those functions to know the set of all possible levels or their optimal order.
Instead, use the argument stringsAsFactors = FALSE to suppress this behaviour, and then manually convert character vectors to factors using your knowledge of the data.2
While factors look (and often behave) like character vectors, they are actually integers! Be careful when treating them like strings. Some string methods (like gsub() and grepl() which we met in Module 2) will coerce factors to strings, while others (like nchar() which we also met in Module 2) will throw an error, and still others (like c()) will use the underlying integer values. For this reason, it’s usually best to explicitly convert factors to character vectors if you need string-like behaviour.
x <- factor(c("milk", "bread", "honey", "water"))
x[3]
# Factors sometimes look like character strings
gsub(pattern = "e", replacement="b", x)
grepl("br", x)
# But they're really not
nchar(x[3])
## Error in nchar(x[3]): 'nchar()' requires a character vector
# c() uses the underlying integer values
y <- c(x)
y
## [1] honey
## Levels: bread honey milk water
## [1] "milk" "brbad" "honby" "watbr"
## [1] FALSE TRUE FALSE FALSE
## [1] 3 1 2 4
In early versions of R, there was a memory advantage to using factors instead of character vectors, but this is no longer the case.
Write two paragraphs to explain to someone what attributes are. Make specific reference to names, factors and specific R objects.
Adding a dim attribute to an atomic vector allows it to behave like a multi-dimensional array. A special case of the array is the matrix, which has two dimensions. Matrices are used commonly as part of the mathematical machinery of statistics. We met them in Module 1. Arrays are rarer, but worth being aware of.
Matrices and arrays are created with matrix() and array(), or by using dim() in a variable assignment:
# For a matrix, 2 scalar arguments are needed to specify rows and columns
a <- matrix(1:6, ncol = 3, nrow = 2)
# For an array, use an atomic vector argument to describe all dimensions, in this case 3 dimensions
b <- array(1:24, c(2, 3, 2))
# You can also modify an object in place by setting dim()
c <- 1:6
c
dim(c) <- c(3, 2)
c
dim(c) <- c(2, 3)
c
## [1] 1 2 3 4 5 6
## [,1] [,2]
## [1,] 1 4
## [2,] 2 5
## [3,] 3 6
## [,1] [,2] [,3]
## [1,] 1 3 5
## [2,] 2 4 6
length() and names() have high-dimensional generalisations:
length() generalises to nrow() and ncol() for matrices, and more generally dim() for arrays.
names() generalises to rownames() and colnames() for matrices, and dimnames(), a list of character vectors, for arrays.
length(a)
dim(a)
nrow(a)
ncol(a)
rownames(a) <- c("A", "B")
colnames(a) <- c("a", "b", "c")
a
length(b)
dim(b)
dimnames(b) <- list(c("one", "two"), c("a", "b", "c"), c("A", "B"))
b
## [1] 6
## [1] 2 3
## [1] 2
## [1] 3
## a b c
## A 1 3 5
## B 2 4 6
## [1] 12
## [1] 2 3 2
## , , A
##
## a b c
## one 1 3 5
## two 2 4 6
##
## , , B
##
## a b c
## one 7 9 11
## two 8 10 12
c() generalises to cbind() and rbind() for matrices, and to abind() (provided by the abind package) for arrays. You can transpose a matrix with t(); the generalised equivalent for arrays is aperm().
You can test if an object is a matrix or array using is.matrix() and is.array(), or by looking at the length of the dim(). as.matrix() and as.array() make it easy to turn an existing atomic vector into a matrix or array.
Vectors are not the only 1-dimensional data structure. You can have matrices with a single row or single column, or arrays with a single dimension. They may print similarly, but will behave differently. The differences aren’t too important, but it’s useful to know they exist in case you get strange output from a function (tapply() from Module 2 is a frequent offender). As always, use str() to reveal the differences.
1:3
str(1:3) # 1d vector
matrix(1:3, ncol = 1) # column vector
str(matrix(1:3, ncol = 1))
matrix(1:3, nrow = 1)
str(matrix(1:3, nrow = 1)) # row vector
array(1:3, 3)
str(array(1:3, 3)) # "array" vector
## [1] 1 2 3
## int [1:3] 1 2 3
## [,1]
## [1,] 1
## [2,] 2
## [3,] 3
## int [1:3, 1] 1 2 3
## [,1] [,2] [,3]
## [1,] 1 2 3
## int [1, 1:3] 1 2 3
## [1] 1 2 3
## int [1:3(1d)] 1 2 3
While atomic vectors are most commonly turned into matrices, the dimension attribute can also be set on lists to make list-matrices or list-arrays3:
l <- list(2:11, "a", TRUE, 1.0)
l
class(l)
dim(l) <- c(2, 2)
l
class(l)
## [[1]]
## [1] 2 3 4 5 6 7 8 9 10 11
##
## [[2]]
## [1] "a"
##
## [[3]]
## [1] TRUE
##
## [[4]]
## [1] 1
##
## [1] "list"
## [,1] [,2]
## [1,] Integer,10 TRUE
## [2,] "a" 1
## [1] "matrix"
What does dim() return when applied to a vector?
If is.matrix(x) is TRUE, what will is.array(x) return?
If is.array(x) is TRUE, what will is.matrix(x) return?
How would you describe the following three objects? What makes them different to 1:5?
x1 <- array(1:5, c(1, 1, 5))
x2 <- array(1:5, c(1, 5, 1))
x3 <- array(1:5, c(5, 1, 1))
Data frames, which we first met in Module 1, are the most common way of storing data in R, and if used systematically makes data analysis easier. Under the hood, a data frame is simply a list of equal-length atomic vectors.
This makes it a 2-dimensional structure, so it shares properties of both the matrix and the list:
A data frame has names(), colnames(), and rownames(), although names() and colnames() are the same thing.
The length() of a data frame is the length of the underlying list and so is the same as ncol().
nrow() gives the number of rows.
You can subset a data frame like a 1d structure (where it behaves like a list), or a 2d structure (where it behaves like a matrix).4
Unsurprisingly,you create a data frame using data.frame(), which takes named vectors as input:
df <- data.frame(x = 1:3, y = c("a", "b", "c"))
df
str(df)
## x y
## 1 1 a
## 2 2 b
## 3 3 c
## 'data.frame': 3 obs. of 2 variables:
## $ x: int 1 2 3
## $ y: Factor w/ 3 levels "a","b","c": 1 2 3
Notice data.frame()’s default behaviour which turns strings into factors. Use stringsAsFactors = FALSE to suppress this behaviour:
df <- data.frame(
x = 1:3,
y = c("a", "b", "c"),
stringsAsFactors = FALSE)
df
str(df)
## x y
## 1 1 a
## 2 2 b
## 3 3 c
## 'data.frame': 3 obs. of 2 variables:
## $ x: int 1 2 3
## $ y: chr "a" "b" "c"
A data.frame’s type reflects the underlying vector used to build it5: the list. As such, typeof() will tell you a data frame is a list, and both class() and is.data.frame() will confirm that it’s a data frame:
typeof(df)
class(df)
is.data.frame(df)
## [1] "list"
## [1] "data.frame"
## [1] TRUE
You can coerce an object to a data frame with as.data.frame():
as.data.frame() on a vector will create a one-column data frame.
as.data.frame() on a list will create one column for each element, and will throw an error if the elements are not all the same length.
as.data.frame() on a matrix will create a data frame with the same number of columns and rows as the matrix.
You can combine data frames using cbind() and rbind().
When combining column-wise, the number of rows must match, but any row names are ignored.
cbind(df, data.frame(z = 3:1))
## x y z
## 1 1 a 3
## 2 2 b 2
## 3 3 c 1
When combining row-wise, both the number and names of columns must match.
rbind(df, data.frame(x = 10, y = "z"))
## x y
## 1 1 a
## 2 2 b
## 3 3 c
## 4 10 z
It’s a common mistake to try and create a data frame by cbind()ing atomic vectors together. This doesn’t work because cbind() will create a matrix unless one of the arguments is already a data frame. Instead use data.frame() directly:
bad <- data.frame(cbind(a = 1:2, b = c("a", "b")))
str(bad)
good <- data.frame(a = 1:2, b = c("a", "b"),
stringsAsFactors = FALSE)
str(good)
## 'data.frame': 2 obs. of 2 variables:
## $ a: Factor w/ 2 levels "1","2": 1 2
## $ b: Factor w/ 2 levels "a","b": 1 2
## 'data.frame': 2 obs. of 2 variables:
## $ a: int 1 2
## $ b: chr "a" "b"
The conversion rules for cbind() are complicated and best avoided by ensuring all inputs are of the same type.
Since a data frame is a list of vectors, it is possible for a data frame to have a column that is a list:
df <- data.frame(x = 1:3)
df$y <- list(1:2, 1:3, 1:4)
df
## x y
## 1 1 1, 2
## 2 2 1, 2, 3
## 3 3 1, 2, 3, 4
However, when a list is given to data.frame() directly, it tries to put each item of the list into its own column, so this fails:
data.frame(x = 1:3, y = list(1:2, 1:3, 1:4))
## Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE, : arguments imply differing number of rows: 2, 3, 4
A workaround is to use I(), which causes data.frame() to treat the list as one unit:
dfl <- data.frame(x = 1:3, y = I(list(1:2, 1:3, 1:4)))
str(dfl)
dfl[2, "y"]
## 'data.frame': 3 obs. of 2 variables:
## $ x: int 1 2 3
## $ y:List of 3
## ..$ : int 1 2
## ..$ : int 1 2 3
## ..$ : int 1 2 3 4
## ..- attr(*, "class")= chr "AsIs"
## [[1]]
## [1] 1 2 3
I() adds the AsIs class to its input.6
Similarly, it’s also possible to have a column of a data frame that’s a matrix or array, as long as the number of rows matches the data frame:
dfm <- data.frame(x = 1:3, y = I(matrix(1:9, nrow = 3)))
str(dfm)
dfm[2, "y"]
## 'data.frame': 3 obs. of 2 variables:
## $ x: int 1 2 3
## $ y: 'AsIs' int [1:3, 1:3] 1 2 3 4 5 6 7 8 9
## [,1] [,2] [,3]
## [1,] 2 5 8
Use list and array columns with caution: many functions that work with data frames assume that all columns are atomic vectors.
What attributes does a data frame possess?
What does as.matrix() do when applied to a data frame with columns of different types?
Can you have a data frame with 0 rows? What about 0 columns?
What are the three properties of a vector (atomic vector or list) other than its contents?
What are the four common types of atomic vectors? What are the two rarer types?
What are attributes? How do you get them and set them?
How is a list different from an atomic vector? How is a matrix different from a data frame?
Can you have a list that is a matrix? Can a data frame have a column that is a matrix?
Described further in http://adv-r.had.co.nz/OO-essentials.html#s3↩︎
A global option, options(stringsAsFactors = FALSE), is available to control this behaviour, but I don’t recommend using it. Changing a global option may have unexpected consequences when combined with other code (either from packages, or code that you’re source()ing), and global options make code harder to understand because they increase the number of lines you need to read to understand how a single line of code will behave.↩︎
These are pretty esoteric data structures, but can be useful if you want to arrange objects into a grid-like structure. For example, if you’re running models on a spatio-temporal grid, it might be natural to preserve the grid structure by storing the models in a 3d array.↩︎
See http://adv-r.had.co.nz/Subsetting.html#subsetting for more on subsetting.↩︎
Because it’s an S3 class↩︎
But this can usually be safely ignored.↩︎