Introduction

Programming

Programming is the way of telling a computer how to do certain things by giving it (unambiguous) instructions. The person who writes these instructions (you) is called a programmer, the instructions itself are called a program. Instructions can be written in many different ways, depending on the task which has to be done. The format in which instructions are written are called programming languages.

Today, 7117 different living natural languages are known (recognized by ISO 639) and 8945 programming languages (including different versions of the same software; see hopl.info).

If programming languages are explicit and well defined, why do we need so many different ones? Well, depending on the task, different languages have different advantages and drawbacks. Some programming languages have been designed for web applications, others for mobile devices, to solve mathematical systems fast and efficient, or to provide tools for data analysis. Some have their focus on being platform independent (run on different operating systems), others to be used for parallel computing. A “one for all” programming language does not exist and will quite likely never exist in the future.

The R programming language

What is R ?

R is a language and environment for statistical computing and graphics. It is a GNU project which is similar to the S language and environment which was developed at Bell Laboratories (formerly AT&T, now Lucent Technologies) by John Chambers and colleagues. R can be considered as a different implementation of S. There are some important differences, but much code written for S runs unaltered under R.

R, BIG DATA AND DATA SCIENCE

BIG DATA

Big data refers to significant volumes of data that cannot be processed effectively with the traditional applications that are currently used. The processing of big data begins with raw data that isn’t aggregated and is most often impossible to store in the memory of a single computer.

Gartner provides the following definition of big data: Big data is high-volume, and high-velocity or high-variety information assets that demand cost-effective, innovative forms of information processing that enable enhanced insight, decision making, and process automation.

DATA SCIENCE

Dealing with unstructured and structured data, data science is a field that comprises everything that is related to data cleansing, preparation, and analysis.

Data science is the combination of statistics, mathematics, programming, problem-solving, capturing data in ingenious ways, the ability to look at things differently, and the activity of cleansing, preparing, and aligning data. This umbrella term includes various techniques that are used when extracting insights and information from data.

R for BIG DATA and DATA SCIENCE

R is one of the most languages used for BIG DATA and DATA SCIENCE.

Source: fossbytes.com

Another important aspect of all programming languages is to communicate with the outside world to import and export data or to communicate over a network and R is adequate for it.

How R can communicate with the outside world

Installation

Installing R

Deppending on your operating system:

  • Windows : Download and install the latest base R version (.exe) from CRAN (same for both, 32/64 bit Windows versions). Furthermore, download and install the Rtools (recommended version).

  • Mac OS X : Download and install the latest base R version (.pkg) from CRAN (e.g., R-3.6.1.pkg). Furthermore, download and install the latest clang and gfortran compilers (which can be found under tools).

  • Linux : Download and install the latest base R version via the Linux package manager (r-base which consists of r-base-core and r-base-devel) or follow the instructions on CRAN. Binaries in various flavours (.deb, .rpm) are available on CRAN.

Install RStudio (IDE)

To make things easier, especially for beginners, we will use graphical interface, a so called integrated development environment (IDE). Several IDEs for R exist such as e.g., the R commander, RGui, or also RStudio which we will use in this course.

RStudio is free to use and open source (but developed and maintained by a private company) and is available for all common operating systems (MS Windows, Mac OS X, Linux). To install, visit https://www.rstudio.com/products/rstudio/download/ and download the free RStudio Desktop version which fits your operating system. For linux users: RStudio is typically also available via package manager (apt, yum, …).

Vectors

Objects in R

R is an object-oriented programming language with the fundamental design principle: Everything in R is an object. In R, objects can be:

  • Variables (e.g., a, b, result)
  • Functions
  • Connection handlers

Vectors are sometimes also referred to as the nuts & bolts in R as they build the basis for all complex objects such as data frames or fitted regression models. The image below shows a (simplified) schematic overview of how various more complex objects (on the right hand side) are based on vectors (on the left hand side).

Simplified schematic overview of how different R objects are connected

(Atomic) vectors

A vector is nothing else than a sequence of elements of a certain type. R distinguishes vectors with two different modes.

  • Atomic vectors: All elements must have the same basic type (e.g., numeric, character, …).
  • Lists: Special vector mode. Different elements can have different types.

(Atomic) vectors are the most basic objects in R as they can contain only data of one type (e.g., only numeric values, or only character strings, etc.). Six different types of data can be stored in atomic vectors.

Important vector functions

In programming, functions are used to perform a specific task, e.g., manipulate an object, calculate a derived quantity, or investigate existing objects. A few of the most important ones for creating and investigating simple vectors are:

  • c(): Combines multiple elements into one atomic vector.
  • length(): Returns the length (number of elements) of an object.
  • class(): Returns the class of an object.
  • typeof(): Returns the type of an object. There is a small (sometimes important) difference between typeof() and class() as we will see later.
  • attributes(): Returns further metadata of arbitrary type.

Checking class/type

A range of functions exist to check whether or not an object is of a specific type. These very handy functions return either a logical TRUE if the input is of this specific type, or FALSE if not.

  • is.double()
  • is.numeric()
  • is.integer()
  • is.logical()
  • is.character()
  • is.vector()

Numeric sequences

Regular sequences can be created using the function seq() and its shortcuts seq.int(), seq_along(), and seq_len(). * Sequences : sequences are a set of numeric or integer values between two well defined points (from and to) with an equidistant spacing. By default the spacing/increment is 1L but can be explicitly specified using the by argument.

  • Repeat : Repeat allows to repeat a number (or a set of numbers) in different ways. We can repeat one element several times, repeat a vector multiple times, or repeat the elements of a vector multiple times.

Numeric sequences : We can create numeric sequences using the seq() (generic function) or seq.int() (internal/primitive function). You can try it out yourself, seq() and seq.int() would do the very same in the examples below. Some times it can be beneficial to use seq.int() as it might be much faster, but a bit less flexible. To create a sequence, we have to at least specify three input arguments: * from : Where to start the sequence * to : Where to end the sequence * length or by : either define the length of the resulting vector, or the increment by which the values change.

# Equidistant numeric sequence
seq(from = 1.5, to = 2.5, length.out = 5)  # Specify length
## [1] 1.50 1.75 2.00 2.25 2.50
seq(from = 4, to = -4, by = -0.5)          # Specify increment/interval
##  [1]  4.0  3.5  3.0  2.5  2.0  1.5  1.0  0.5  0.0 -0.5 -1.0 -1.5 -2.0 -2.5 -3.0
## [16] -3.5 -4.0

Technically, seq() can also be used to create integer sequences. However, this only works if all three arguments from, to, and by (default is 1L) are given, and all are integers!

# Explicitly define from/to/by as integers
(x <- seq(from = 10L, to = 100L, by = 10L))
##  [1]  10  20  30  40  50  60  70  80  90 100
# we can check up the class
class(x)
## [1] "integer"
# 'from' defined as a numeric value (10.0)
(y <- seq(from = 10, to = 100L, by = 10L))
##  [1]  10  20  30  40  50  60  70  80  90 100

Integer sequences : There are three distinct functions to create proper integer sequences.

  • : : Two values separated by a colon (:); creates sequences in steps of +1 or -1. If is an integer (or a numeric value without decimals), an integer sequence will be created
# Sequence from 1:4
(x <- 1:4)
## [1] 1 2 3 4
  • seq_along(x) : Creates a sequence from 1L to length(x) where x is an existing object (e.g., a vector).
# Create character vector 'cities'
cities <- c("Vienna", "Paris", "Berlin", "Rome", "Bern")
seq_along(cities)
## [1] 1 2 3 4 5
  • seq_len(n) : creates a sequence between 1L and n.
seq_len(10)
##  [1]  1  2  3  4  5  6  7  8  9 10

Character Sequences : If you need the letters of the alphabet, there are two convenient vectors which are available globally called LETTERS and letters. LETTERS contains the alphabet (no special characters) in upper case letters (“A”, “B”, …), letters the same in lower case letters (“a”, “b”, …). This can be used if one just wants to have some random character values.

# First 7 letters of the alphabet
LETTERS[1:7]
## [1] "A" "B" "C" "D" "E" "F" "G"
letters[1:7]
## [1] "a" "b" "c" "d" "e" "f" "g"

Replicating elements

Another often useful function is replicate. Replicate can be used for all vectors, no matter if they are numeric, integer, character, or logical. The rep() function can be used in different ways:

  • Replicate one specific value n times :
rep(2L, 5)                # Case (1)
## [1] 2 2 2 2 2
  • Given an existing vector: Replicate each element n times
cities <- c("Vienna", "Bern", "Rome")
rep(cities, each = 3)     # Case (2)
## [1] "Vienna" "Vienna" "Vienna" "Bern"   "Bern"   "Bern"   "Rome"   "Rome"  
## [9] "Rome"
  • Given an existing vector: Replicate the entire vector n times.
rep(cities, times = 3)    # Case (3)
## [1] "Vienna" "Bern"   "Rome"   "Vienna" "Bern"   "Rome"   "Vienna" "Bern"  
## [9] "Rome"

Alternatively we can specify which element in the original vector (cities) should be replicated how often.

rep(cities, times = c(3, 2, 5))
##  [1] "Vienna" "Vienna" "Vienna" "Bern"   "Bern"   "Rome"   "Rome"   "Rome"  
##  [9] "Rome"   "Rome"

Similar as for seq() there are some shortcuts, namely rep.int() and rep_len(), rep.int().

  • Replicate elements of an existing vector until a specific length is reached

Let us assume we would like to replicate the elements c(4, 5, 6) as often as necessary to get a vector of length 5, or 9:

rep_len(c(4, 5, 6), length.out = 5)    # Resulting vector has length 5
## [1] 4 5 6 4 5
rep_len(c(4, 5, 6), length.out = 9)    # Resulting vector has length 9
## [1] 4 5 6 4 5 6 4 5 6

It repeats the elements (always from left to right) until the new vector reaches the length specified as second input argument.

Coercion (mixing objects)

As mentioned earlier Vectors can only contain elements of one type! Thus, if we combine elements of different types (e.g., combine a numeric value and character values as in the next example), R has to convert all elements into the same type/class as vectors can only contain elements of one type.

This is called ‘coercion’.

Numeric and character

# Numeric and character = character
(x <- c(1.7, "1", "A"))
## [1] "1.7" "1"   "A"
class(x)
## [1] "character"

Numeric and logical

Similar when mixing logical and numeric values. Again, R needs to bring all values to the same type and tries to lose as little information as possible.

# Combine TRUE, 5.5, and 10.0.
(x <- c(FALSE, 5.5, 10.0))
## [1]  0.0  5.5 10.0

Mathematical operations

Beside mathematical operators (see table above) a series of relational and logical operators exist. The table above shows the most important operators.

Vectors and scalars

# Multiply a sequence by 2
x <- 1:10
x * 2
##  [1]  2  4  6  8 10 12 14 16 18 20
x <- c(43, 100, 34, 483, 1000)
x %% 10
## [1] 3 0 4 3 0

Vectors and Vectors (Matching Length)

x <- c(500, 400, 600)
y <- c(10, 5, 100)
# Call x + y (addition) and x / y (division)
x + y
## [1] 510 405 700
x / y
## [1] 50 80  6

Vectors and vectors (non-matching lengths)

x <- c(500, 400, 600, 800)
y <- c(100, 2)
# Call x + y (addition) and x / y (division)
x + y
## [1] 600 402 700 802
x / y
## [1]   5 200   6 400

Subsetting vectors

Creating vectors is one part of the game, the second part is to be able to get information from a vector. Accessing specific elements of a vector (or an object in general) is called subsetting. Vectors can be subsetted in different ways :

  • By index (position in the vector).
  • Based on logical vectors.
  • By name (if set).

Note : It is very important to understand this concept as it works similar/the same for all R objects.

Subsetting by index

As we have learned previously, function calls use round brackets (e.g., class(), length()). Subsetting uses squared brackets ([…] or [[…]]).

The simplest way of subsetting a vector is subsetting by index. An index is simply the position of a specific element in a vector. R always starts counting at 1 and we can access the first element by calling x[1] (“give me the 1th element of x”), or the fifth element by using x[5].

# Create a demo vector of length 5
(x <- c(10, 20, 0, 30, 50))
## [1] 10 20  0 30 50
x[1]          # Extract first element by index
## [1] 10
x[5]          # Fifth element
## [1] 50
x[c(1, 5)]   # Get the first and fifth
## [1] 10 50
x[c(5, 1)]   # Or the fifth and first (different order)
## [1] 50 10

Similarly, we can use negative indexes to get all values except some specific ones. E.g., x[-1] gives “all but the first element of x”.

x[-1]         # all but first
## [1] 20  0 30 50
x[-5]         # all but fifth
## [1] 10 20  0 30
x[-c(1, 5)]   # all except first and fifth
## [1] 20  0 30

Out-of-range indexes: What if we try to access elements outside the vector? Our vector (x <- c(10, 20, 0, 30, 50)) only contains 5 elements. Let us subset the elements 4:7, or element 100:

x[4:7]
## [1] 30 50 NA NA
x[100]
## [1] NA

Last/first few elements:

x[1:4]          # First four elements (element 1 - 4)
## [1] 10 20  0 30
x[7:10]         # Last four elements (elements 7 - 10)
## [1] NA NA NA NA

There are convenience functions for that:

head(x, n = 4)
## [1] 10 20  0 30
tail(x, n = 4)
## [1] 20  0 30 50

The two functions head() and tail() are generic functions and will also work for most more complex objects.

Subsetting by name

x <- c(age = 35, height = 1.72, zipcode = 6020)
x["age"]
## age 
##  35
x[c("age", "zipcode")]  # get both, "age" and "zipcode"
##     age zipcode 
##      35    6020

Subsetting by logical vectors

Alternatively, we can use a logical vector to subset vectors. Remember: logical vectors contain either TRUE or FALSE. In the context of subsetting it can be used to get all elements where the logical vector contains TRUE. An example:

x <- c(30, 10, 20, 0, 30, 50)
x[c(TRUE, TRUE, FALSE, FALSE, FALSE, FALSE)]
## [1] 30 10

As you can see we only get the first two elements, as only the first two elements of the logical vector contain TRUE. The two vectors (vector x to be subsetted and the logical vector) should be of the same length, else the logical vector will be recycled.

As we have already seen in subchapter Mathematical operations R comes with a series of relational and logical operators. They now become very handy to subset vectors. Given the vector above, let us find all elements in x which are larger than 25.

x > 25          # Shows the result of the relational comparison
## [1]  TRUE FALSE FALSE FALSE  TRUE  TRUE
x[x > 25]       # Use 'x > 25' (logical vector) for subsetting
## [1] 30 30 50

Some more examples which you can try yourself given our vector x <- c(30, 10, 20, 0, 30, 50): * x[x == 30]: Elements in x where x is (exactly) equal to 30 * x[x == 30 | x == 50]: Elements where x is either equal to 30 OR equal to 50. * x[x == 30 & x == 50]: Elements where x is equal to 30 AND 50 at the same time. The result should be an empty vector as this is not possible. * x[x >= 30 & x < 40]: All elements where x is larger than or equal to 30, but less than 40. * x[x < 30 & x > 40]: All elements where x is less than 30 AND larger than 40 at the same time. Again, impossible, the result should be an empty vector.

Index of TRUE elements

x <- c(30, 10, 20, 0, 30, 50)
x >= 30
## [1]  TRUE FALSE FALSE FALSE  TRUE  TRUE
which(x >= 30)
## [1] 1 5 6

The expression x >= 30 is TRUE for the elements at position or index 1, 5, 6, which is exactly what which() returns.

min(x)                # The minimum
## [1] 0
x == min(x)           # Comparison
## [1] FALSE FALSE FALSE  TRUE FALSE FALSE
which(x == min(x))    # Index/position
## [1] 4
max(x)                # The maximum
## [1] 50
x == max(x)           # Comparison
## [1] FALSE FALSE FALSE FALSE FALSE  TRUE
which(x == max(x))    # Index/position
## [1] 6

In addition, there are two functions which.min() and which.max() which are more reliable to find the minimum/maximum of floating point numbers. However, they only return the index of the very first occurrence of the minimum/maximum! Imagine the following vector z. The minimum is 0.5 which occurs twice, position 2 and 5. Now compare the different result of which(z == min(z)) and which.min(z):

z <- c(10, 0.5, 20, 30, 0.5)
which(z == min(z))     # Position/index of all minima
## [1] 2 5
which.min(z)           # Position/index of first occurrence of minimum
## [1] 2

The same is valid for which.max().

Vector functions

Plotting vectors

Well, one key feature of R is it’s ability to graphically display data (plotting). Visualizing data in some sort or form can often be very helpful to better understand the data and/or identify possible problems such as outliers or unrealistic values.

?plot
## Help on topic 'plot' was found in the following packages:
## 
##   Package               Library
##   graphics              /Library/Frameworks/R.framework/Versions/4.0/Resources/library
##   base                  /Library/Frameworks/R.framework/Resources/library
## 
## 
## Using the first match ...
?barplot
?pie
?hist   
?boxplot    

Matrices

In the Vectors chapter we have learned about the atomic vectors and that they build the base for more complex objects. The next level of complexity are arrays and matrices. An array is a multi-dimensional extension of a vector. In the special case that the array has only two dimensions, this array is also called a matrix.

Matrix introduction

Simplified relationship between atomic vectors and matrices/arrays

While a vector is a (long) sequence of values, a matrix is a two-dimensional rectangular object with values. Important aspects of matrices in R:

Arrays are based on atomic vectors. A matrix is a special array with two dimensions. * Matrices can only contain data of one type (like vectors). * Matrices always also have a length (number of elements). * In addition to vectors, matrices have an additional dimension attribute (dim()) which vectors don’t have. * As vectors, matrices can have names (row and column names; optional attribute).

Creating matrices

Matrices can be created using the matrix() function. According to the R documentation the usage of the matrix() function (see ?matrix or help(“matrix”) ).

To create a matrix containing a constant value of 999L (data = 999L) with two rows (nrow = 2) and three columns (ncol = 3) we call:

(x <- matrix(data = 999L, nrow = 2, ncol = 3))
##      [,1] [,2] [,3]
## [1,]  999  999  999
## [2,]  999  999  999

We can now check the dimension of the object using dim() .

dim(x)
## [1] 2 3

Alternatively, we can make use of the two convenience functions nrows() and ncols().

nrow(x)
## [1] 2
ncol(x)
## [1] 3

As mentioned earlier, matrices always have a length. Matrices are based on atomic vectors, the length is nothing else than the number of elements of the underlying vector.

length(x)           # Using length
## [1] 6
nrow(x) * ncol(x)   # Calculate 'by hand'
## [1] 6

Matrix-to-vector

Let us take our matrix x and explicit coercion to convert it into a vector (as.vector()):

(y <- as.vector(x))
## [1] 999 999 999 999 999 999
length(y)
## [1] 6

Type of data

Matrices (as vectors) can only contain data of one type. We can create numeric matrices, integer matrices, character matrices, and logical matrices by adding the corresponding values in the data argument when creating a matrix.

The following four matrices are all based on vectors of different types (double, integer, character, and logical).

x1 <- matrix(seq(0, 4.5, length.out = 9), nrow = 3)    # double
x2 <- matrix(1:9, nrow = 3)                            # integer
x3 <- matrix(LETTERS[1:9], nrow = 3)                   # character
x4 <- matrix(TRUE, nrow = 3, ncol = 3)                 # logical

Investigate the objects: We can check the type of the objects using the is.*() functions. Take one of the examples above and try it yourself:

Order of elements

Let us have a closer look at x2, the integer matrix from above, and how the elements of the vector end up in the matrix.

(x2 <- matrix(1:9, nrow = 3))
##      [,1] [,2] [,3]
## [1,]    1    4    7
## [2,]    2    5    8
## [3,]    3    6    9

This matrix is of dimension 3×3. But how does R know how big the matrix must be? If we check the command we can see that we provide an integer vector with 9 elements, and ask for a matrix with 3 rows (nrow = 3). There is only one way to fulfill the requirements: creating a 3×3 matrix.

When we look at the output above we can also see how the values have been filled in. At first, the leftmost column has been filled (with 1, 2, and 3), then the second (4, 5, 6), and last but not least the third column (7, 8, 9). This is called filled by column. The image below shows a sketch of what is happening here:

Sketch of how the data are added to a matrix (by column; default)

This is the default behavior of matrix() as the input argument byrow is set to FALSE. We can change this by setting byrow = TRUE. Instead of filling in the data by column, the top row is now filled first, followed by the second, and so far and so on.

(x <- matrix(data = 1:9, nrow = 3, byrow = TRUE))
##      [,1] [,2] [,3]
## [1,]    1    2    3
## [2,]    4    5    6
## [3,]    7    8    9

Sketch of how the data are added to a matrix when setting byrow = TRUE

Matrix functions

As for vectors a series of functions exist to work with matrices. The following list is not a complete list, but contains some useful functions for matrices:

As matrices are based on vectors, we can also use all functions from the table shown in Vector functions like e.g., get the minimum (min()), calculate the logarithm of all elements (log()), or check elements (e.g., all(x > 0)).

Mathematical operations

Matrices are often used for arithmetic (mathematics; working with numbers) to solve mathematical problems such as solving systems of linear equations, estimate regression models, and many more. The following sections give an brief introduction on mathematical operations in combination with matrices.

Matrices and scalars

One of the most simple operations is to work with a matrix and a scalar (single numeric value). As for vectors we can perform e.g., addition or multiplication as follows:

(x <- matrix(1:4, ncol = 2))
##      [,1] [,2]
## [1,]    1    3
## [2,]    2    4
x + 2          # Add 2 to each element
##      [,1] [,2]
## [1,]    3    5
## [2,]    4    6
x * 1.5        # Multiply each element by 1.5
##      [,1] [,2]
## [1,]  1.5  4.5
## [2,]  3.0  6.0

The same is true for all other operations including +, -, /, ^, %%, sin(), cos() , and many more. The operation is applied element-by-element, the result of these operations is always a matrix of the same dimension with the same attributes.

Matrices and Vectors

x <- matrix(1:4, ncol = 2)  # (Re-)define matrix
y <- c(10, 100)             # Define vector
x * y                       # Multiply
##      [,1] [,2]
## [1,]   10   30
## [2,]  200  400

Matrices and matrices

In the same way, simple ‘matrix and matrix’ operations work. When two matrices are of the same dimension, we can use basic element-wise arithmetic operations.

(x <- matrix(c( 1,  2,  3,  4), ncol = 2, nrow = 2))
##      [,1] [,2]
## [1,]    1    3
## [2,]    2    4
(y <- matrix(c(10, 20, 30, 40), ncol = 2, nrow = 2))
##      [,1] [,2]
## [1,]   10   30
## [2,]   20   40
x + y
##      [,1] [,2]
## [1,]   11   33
## [2,]   22   44
x / y
##      [,1] [,2]
## [1,]  0.1  0.1
## [2,]  0.1  0.1
x^y
##         [,1]         [,2]
## [1,]       1 2.058911e+14
## [2,] 1048576 1.208926e+24

Besides simple arithmetic, R comes with a wide range of functions for mathematical tasks and can do ‘all’ you need. The following list is incomplete, but gives an idea what we can do beyond the content.

For those interested in linear algebra/advanced mathematical topics using R, you may be want to check the following sources:

Matrix attributes

Dimension names

  • rownames(): names of the rows of a matrix.
  • colnames(): names of columns of a matrix.
  • dimnames(): returns all dimension names (as a list; works for all arrays).
(x <- matrix(data = 1:9, nrow = 3, ncol = 3))
##      [,1] [,2] [,3]
## [1,]    1    4    7
## [2,]    2    5    8
## [3,]    3    6    9
rownames(x); colnames(x); dimnames(x)   # All returning NULL
## NULL
## NULL
## NULL

rownames() and colnames() allow us to add row names and/or column names to this matrix. Let us add some simple names similar to what you know from Microsoft Excel or comparable spreadsheet applications.

rownames(x) <- c("Row 1", "Row 2", "Row 3")
colnames(x) <- c("Col A", "Col B", "Col C")

When printing the matrix we see that all our rows and columns are now named:

x
##       Col A Col B Col C
## Row 1     1     4     7
## Row 2     2     5     8
## Row 3     3     6     9

Once set, we can get these names by using the two functions again. Note: Row names and column names are always characters, wherefore rownames() and colnames() always return character vectors (or NULL if no names specified).

rownames(x)
## [1] "Row 1" "Row 2" "Row 3"
colnames(x)
## [1] "Col A" "Col B" "Col C"
c(class = class(colnames(x)), type = typeof(colnames(x)), length = length(colnames(x)))
##       class        type      length 
## "character" "character"         "3"

Alternatively we could use dimnames() which returns all dimension names at the same time. dimnames() returns an object of class list of length two (for matrices) where each element in the list itself is a character vector – the very same as rownames() and colnames() return.

dimnames(x)
## [[1]]
## [1] "Row 1" "Row 2" "Row 3"
## 
## [[2]]
## [1] "Col A" "Col B" "Col C"
attributes(x)
## $dim
## [1] 3 3
## 
## $dimnames
## $dimnames[[1]]
## [1] "Row 1" "Row 2" "Row 3"
## 
## $dimnames[[2]]
## [1] "Col A" "Col B" "Col C"

Changing dimension names

At any time we can use rownames() and colnames() to change (or overwrite) existing names. Let us take the matrix from above – instead of having ‘Col A’, ‘Col B’, and ‘Col C’ we would like to have ‘first’, ‘second’, and ‘third’. This can be achieved by assigning a vector with our new names to colnames(x).

colnames(x) <- c("first", "secon", "third")
x
##       first secon third
## Row 1     1     4     7
## Row 2     2     5     8
## Row 3     3     6     9

Oh dear! I have made a typo (secon instead of second). To fix this, we could (of course) overwrite all three column names again. However, there is a smarter way to do this.

colnames(x)[2]              # The one to fix
## [1] "secon"
colnames(x)[2] <- "second"  # Overwrite second column name
x
##       first second third
## Row 1     1      4     7
## Row 2     2      5     8
## Row 3     3      6     9

Creating named matrices

(x <- matrix(data = 1:9, nrow = 3, ncol = 3,
             dimnames = list(c("Row 1", "Row 2", "Row 3"),
                             c("Col A", "Col B", "Col C"))))
##       Col A Col B Col C
## Row 1     1     4     7
## Row 2     2     5     8
## Row 3     3     6     9

If you only want to specify either row names or column names, the other list element can simply be set to NULL.

# Only row names, column names set to NULL
(x <- matrix(data = 1:9, nrow = 3, ncol = 3,
             dimnames = list(c("Row 1", "Row 2", "Row 3"),
                             NULL)))
##       [,1] [,2] [,3]
## Row 1    1    4    7
## Row 2    2    5    8
## Row 3    3    6    9
# Only column names, row names set to NULL
(x <- matrix(data = 1:9, nrow = 3, ncol = 3,
             dimnames = list(NULL,
                             c("Col A", "Col B", "Col C"))))
##      Col A Col B Col C
## [1,]     1     4     7
## [2,]     2     5     8
## [3,]     3     6     9

Combine Objects

?rbind
?cbind

Subsetting Matrices

In the previous chapter we have learned how to subset vectors (Subsetting vectors). Matrices can be subsetted with the same/similar techniques. As with vectors, we can use the following for subsetting matrices:

  • Subsetting by index.
  • Subsetting by name (if set).
  • Subsetting by logical vectors.

Subsetting by index

(x <- matrix(sprintf("x[%d, %d]", rep(1:3, 4), rep(1:4, each = 3)), nrow = 3))
##      [,1]      [,2]      [,3]      [,4]     
## [1,] "x[1, 1]" "x[1, 2]" "x[1, 3]" "x[1, 4]"
## [2,] "x[2, 1]" "x[2, 2]" "x[2, 3]" "x[2, 4]"
## [3,] "x[3, 1]" "x[3, 2]" "x[3, 3]" "x[3, 4]"

Extracting single elements :

x[1, 1]
## [1] "x[1, 1]"
x[2, 4]
## [1] "x[2, 4]"
x[3]
## [1] "x[3, 1]"

c(x[3],  x[3, 1])
## [1] "x[3, 1]" "x[3, 1]"
c(x[11], x[2, 4])
## [1] "x[2, 4]" "x[2, 4]"

Extracting multiple elements :

x[c(2, 3), c(3, 1)]
##      [,1]      [,2]     
## [1,] "x[2, 3]" "x[2, 1]"
## [2,] "x[3, 3]" "x[3, 1]"

Extracting rows/columns :

(x <- matrix(1:12, nrow = 3))
##      [,1] [,2] [,3] [,4]
## [1,]    1    4    7   10
## [2,]    2    5    8   11
## [3,]    3    6    9   12
x[1, ]     # Returns the entire first row
## [1]  1  4  7 10
x[, 3]     # Returns the entire third column
## [1] 7 8 9

Subsetting by name

Extracting single elements

# Construct matrix
countries <- c("United States", "Great Britain", "Canada", "Russia", "Switzerland")
(medals   <- matrix(c(3, 3, 2, 1, 1, 4, 1, 1, 0, 0, 1, 5, 1, 2, 2),
                    ncol = 3, dimnames = list(countries, c("Gold", "Silver", "Bronze"))))
##               Gold Silver Bronze
## United States    3      4      1
## Great Britain    3      1      5
## Canada           2      1      1
## Russia           1      0      2
## Switzerland      1      0      2

If we are interested in the number of gold medals (first column) Canada got (third row), we could of course use subsetting by index:

medals[3, 1]   # Canada, number of gold medals
## [1] 2

However, as we have dimension names, we can also directly use the names instead of indices. To get the same information, we can thus call:

medals["Canada", "Gold"]
## [1] 2

Extracting rows/columns

medals["Canada", ]
##   Gold Silver Bronze 
##      2      1      1
medals[, "Gold"]
## United States Great Britain        Canada        Russia   Switzerland 
##             3             3             2             1             1

Functions

Functions consist of three key elements:

When Should I Use Functions? * Avoid repetitions: Try to avoid copying & pasting chunks of code. Whenever you use copy & paste, it is a good indication that you should think about writing a function.

?mean
?getwd
?length

Declaring functions

myfun <- function() {
    # instructions
  return() 
}
myfun <- function() {
    # instructions
  invisible() 
}

Examples :

# Using return()
hello_return <- function(x) {
    res <- paste("Hi", x)
    return(res)
}

# Using invisible()
hello_invisible <- function(x) {
    res <- paste("Hello", x)
    invisible(res)
}

Conditional Execution

x <- 10
if (x < 10) {
    print("x is smaller than 10")
} else if (x > 10) {
    print("x is larger than 10")
} else {
    print("x is exactly 10") 
}
## [1] "x is exactly 10"

Vectorized if

There is a special function which allows perform an if-else statement element by element for each element of a vector (or matrix).

(x <- 1:6)
## [1] 1 2 3 4 5 6
ifelse(x %% 2 == 0, "even", "odd")
## [1] "odd"  "even" "odd"  "even" "odd"  "even"
ifelse(x %% 2 == 0,  x,  -x)
## [1] -1  2 -3  4 -5  6
(x <- matrix(1:12, nrow = 3,
             dimnames = list(paste("Row", 1:3), paste("Col", LETTERS[1:4]))))
##       Col A Col B Col C Col D
## Row 1     1     4     7    10
## Row 2     2     5     8    11
## Row 3     3     6     9    12
ifelse(x %% 2 == 0, x, -x)
##       Col A Col B Col C Col D
## Row 1    -1     4    -7    10
## Row 2     2    -5     8   -11
## Row 3    -3     6    -9    12

Loops

for loops

The simplest and most frequently used type of loops is the for loop. For loops in R always iterate over a sequence (a vector), where the length of the vector defines how often the action inside the loop is executed.

Basic usage : for ( in ) { }

  • : Current loop variable.
  • : Set over which the variable iterates. Typically an atomic vector but can also be a list.
  • : Executed for each in .
  • {…}: As for functions or if-statements – necessary when multiple commands are executed, optional for a single command.
for (i in 1:3) {
    print(i)
}
## [1] 1
## [1] 2
## [1] 3
for (i in c("Reto", "Ben", "Lea")) {
    print(i)
}
## [1] "Reto"
## [1] "Ben"
## [1] "Lea"

while loops

The second type of loop is while. In contrast to a for-loop which runs for a fixed number of iterations, a while-loop runs while a condition is true.

Basic usage: while () { }.

  • : Logical condition, has to be FALSE or TRUE.
  • : Executed as long as the is TRUE.
  • {…}: Necessary for multiple commands, optional for single ones.
# Start with 0
x <- 0
# Loop until condition is FALSE
while (x^2 < 20) {
  print(x)      # Print x
  x <- x + 1    # Increase x by 1
}
## [1] 0
## [1] 1
## [1] 2
## [1] 3
## [1] 4

repeat loops

The last one is a repeat-loop. In contrast to the other two the repeat loop runs forever – until we explicitly stop it by calling break.

Basic usage: repeat { }.

  • : Executed until the break statement is called. Thus, don’t forget to include break.
  • {…}: Necessary for multiple commands, optional for single ones.

Remarks:

  • More rarely used compared to for and while loops.
  • Not necessary for any task in this course!
  • (But super simple to write).
# Initialization
x <- 0
# Repeat loop
repeat {
    if (x^2 > 20) break     # Break condition (important)
    print(x)                # print(x)
    x <- x + 1              # Increase x by 1
}
## [1] 0
## [1] 1
## [1] 2
## [1] 3
## [1] 4

Loop replacements

Instead of the three basic repetitive control structures (for, while, and repeat) R comes with a series of functions which can be used as replacements. These ‘loop replacements’ are real functions (no longer control statements). The following exist:

set.seed(1)
(x <- matrix(rnorm(20), nrow = 4,
             dimnames = list(NULL, LETTERS[1:5])))
##               A          B          C           D           E
## [1,] -0.6264538  0.3295078  0.5757814 -0.62124058 -0.01619026
## [2,]  0.1836433 -0.8204684 -0.3053884 -2.21469989  0.94383621
## [3,] -0.8356286  0.4874291  1.5117812  1.12493092  0.82122120
## [4,]  1.5952808  0.7383247  0.3898432 -0.04493361  0.59390132

For each column of the matrix we would like to calculate the means, standard deviation and count all positive elements.

apply(x, 2, mean)
##           A           B           C           D           E 
##  0.07921043  0.18369829  0.54300434 -0.43898579  0.58569212
apply(x, 2, sd)
##         A         B         C         D         E 
## 1.1021597 0.6902835 0.7489618 1.3889432 0.4266423
npos <- function(x) { sum(x > 0) }
apply(x, 2, npos)
## A B C D E 
## 2 3 3 1 3

Data Frames

Data frame introduction

Data frames – a combination of lists and (most often) atomic vectors

Reminder: So far we have only learned to use matrices to represent rectangular data and learned about the following properties.

  • Matrices are vectors with an additional dimension attribute for rows and columns.
  • As based on vectors, they must be homogeneous (numeric, character, …).
  • An alternative object is needed for empirical data which typically contain heterogeneous columns.

Data frame: Most commonly used data type to represent rectangular data in R.

  • Data frames are always two-dimensional objects.
  • Columns of a data frame correspond to variables, rows to observations (different jargon).
  • Look like a matrix and offer similar subsetting methods, but are not matrices.
  • Internally, a data frame is a list with a series of objects of the same length.
  • The list-elements (variables) are most often vectors of potentially different types, however, more complex objects such as lists or matrices (for specific variables) are possible.
  • Variables (columns) are always named, the names are no longer optional.
  • Observations (rows) also always have names, by default they are set to “1”, “2”, … but can be changed.

As mentioned, data frames look like matrices but are based on a list. This allows for a higher degree of flexibility than matrices as it now allows to store different types of data into different columns/variables, e.g., characters in the first column, integers in the second, and logical in the third.

Creating data frames

Not surprisingly, data frames are constructed using the function data.frame(). The function works very similar to list() with some additional options. Let us have a look at the usage from the manual (see ?data.frame).

?data.frame

Basic usage: Let us create a simple demo data frame with two variables. The first one is an integer sequence of length \(3\) , the second a character vector of length \(3\) as well.

data.frame(1:3, c("a", "b", "c"))
##   X1.3 c..a....b....c..
## 1    1                a
## 2    2                b
## 3    3                c

As we have not named our variables R tries to guess a name. Well, not very useful. What if we pre-specify the two vectors and call them \(x\) and \(y\)?

x <- 1:3
y <- c("a", "b", "c")
data.frame(x, y)
##   x y
## 1 1 a
## 2 2 b
## 3 3 c

Alternatively we can create a data frame with names by using a series of key = value arguments.

(df <- data.frame(age = c(35, 21, 12), height = c(1.72, 1.65, 1.39)))
##   age height
## 1  35   1.72
## 2  21   1.65
## 3  12   1.39

Heterogeneous data: The big advantage is that we can now create heterogeneous rectangular objects. The following example shows a data frame with four variables, all based on vectors of different types.

(df <- data.frame(
            name = c("Petra", "Jochen", "Alexander"),     # character
            age = c(35L, 21L, 12L),                       # integer
            height = c(1.72, 1.65, 1.39),                 # numeric
            austrian = c(FALSE, TRUE, TRUE),              # logical
            stringsAsFactors = FALSE                      # default
))
##        name age height austrian
## 1     Petra  35   1.72    FALSE
## 2    Jochen  21   1.65     TRUE
## 3 Alexander  12   1.39     TRUE

Let us now investigate the new object, first using the str() function.

str(df)
## 'data.frame':    3 obs. of  4 variables:
##  $ name    : chr  "Petra" "Jochen" "Alexander"
##  $ age     : int  35 21 12
##  $ height  : num  1.72 1.65 1.39
##  $ austrian: logi  FALSE TRUE TRUE

The output looks similar to the one for a list with a few important differences. The first line tells us that we are dealing with a \(data.frame\) with \(3\) observations (\(obs.\)) and \(4\) variables. Once more: ‘observations’ correspond to rows, variables to columns.

Data frame attributes

As all other objects data frames always have a type which is “list” and a length. Besides these properties, data frames always have at least the following three attributes:

  • Class: Data frames are of class “data.frame”.
  • Dimension: 2-dimensional rectangular objects; always have a dimension attribute.
  • Names: Data frames must always be named, no longer optional (as for matrices).

Let us look at the object \(df\) defined above and what class and type it comes with.

head(df, n = 1) # first observations (row)
##    name age height austrian
## 1 Petra  35   1.72    FALSE
class(df)       # Class is data.frame
## [1] "data.frame"
typeof(df)      # The underlying type is a list (based on a list)
## [1] "list"

The class is data.frame while the type is list, like matrices which are of class matrix, array, but the type is always one of the atomic types (double, integer, …). Again, a function exists to check if a given object is a data frame:

c("is.data.frame" = is.data.frame(df),
  "is.list"       = is.list(df),
  "is.matrix"     = is.matrix(df),
  "is.atomic"     = is.atomic(df))
## is.data.frame       is.list     is.matrix     is.atomic 
##          TRUE          TRUE         FALSE         FALSE

Note that is.data.frame() and is.list() both return TRUE! However, as shown a data frame is not a matrix, and also not atomic as based on a generic vector.

The length of a data frame is the length of its underlying list, which is the same as the second dimension (or number of variables/columns). The first dimension (number of rows) tells us how many observations the data frame contains.

length(df)      # Length of underlying list
## [1] 4
dim(df)         # Like for matrices
## [1] 3 4
nrow(df)
## [1] 3
ncol(df)
## [1] 4

As data frames always have names, let us have a look what the three functions names(), colnames(), and rownames() return.

names(df)
## [1] "name"     "age"      "height"   "austrian"
colnames(df)
## [1] "name"     "age"      "height"   "austrian"
rownames(df)
## [1] "1" "2" "3"

names() and colnames() do the very same and return the names of the variables. When working with data frames one typically uses names() (not colnames()). In addition we can see that R automatically assigned the default row names “1”, “2”, and “3” as we have not specified anything different. As always: the return of these functions are character vectors!

Subsetting data frames

Subsetting combines the two worlds of lists and matrices. We can basically use all the subsetting techniques learned in the previous chapters. In addition we can use a function called subset() which becomes very handy when working with data frames

Matrix/list-like

(df <- data.frame(
            name = c("Petra", "Jochen", "Alexander"),     # character
            age = c(35L, 21L, 12L),                       # integer
            height = c(1.72, 1.65, 1.39),                 # numeric
            austrian = c(FALSE, TRUE, TRUE),              # logical
            stringsAsFactors = FALSE                      # default
))
##        name age height austrian
## 1     Petra  35   1.72    FALSE
## 2    Jochen  21   1.65     TRUE
## 3 Alexander  12   1.39     TRUE

List-style subsetting: The following subsetting commands all work (try it yourself, output not shown) on the data frame above.

Subsetting by positive or negative indices, single or double squared brackets to either get the content (element) or a subset (still a data frame but with less variables).

df[1]        # Returns sub-data-frame
##        name
## 1     Petra
## 2    Jochen
## 3 Alexander
df[-1]       # Sub-list (all except the first variable; negative index)
##   age height austrian
## 1  35   1.72    FALSE
## 2  21   1.65     TRUE
## 3  12   1.39     TRUE
df[[2]]      # Content of the second element
## [1] 35 21 12

Or use logical vectors for subsetting. This is most often only used with logical expressions.

df[c(TRUE, FALSE, TRUE, FALSE)]
##        name height
## 1     Petra   1.72
## 2    Jochen   1.65
## 3 Alexander   1.39

Subsetting by name using either squared brackets or the $ operator.

df["name"]   # Returns sub-data-frame
##        name
## 1     Petra
## 2    Jochen
## 3 Alexander
df$name      # Content of variable 'name'; preferred way
## [1] "Petra"     "Jochen"    "Alexander"
df[["name"]] # Content of variable 'name' as $name
## [1] "Petra"     "Jochen"    "Alexander"

And, last but not least, we could also use the recursive subsetting if needed. For example we could get the third list element (or variable/column) and extract the first element (or observation/row) from that.

df[[c(3, 1)]]     # Nested/recursive subsetting
## [1] 1.72

Matrix-style subsetting: In addition, we can use all matrix subsetting techniques to get specific elements (df[1, 3]), entire rows/observations (df[1, ]), or specific columns/variables (df[, 2]). This can be done using indices, names, or logical vectors (just as for matrices).

df[1, 3]      # Content of first observation (row), third variable (column)
## [1] 1.72
df[1, ]       # First observation (row); returns a data frame with one observation
##    name age height austrian
## 1 Petra  35   1.72    FALSE
df[, 1]       # First variable/column; returns the content
## [1] "Petra"     "Jochen"    "Alexander"
df[1, "name"] # Using names and indices
## [1] "Petra"
# By logical vectors
df[c(TRUE, FALSE, FALSE), c(TRUE, TRUE, FALSE, FALSE)]
##    name age
## 1 Petra  35

Mixed subsetting: We can of course combine subsetting techniques:

df$name[3]
## [1] "Alexander"
df[[2]][3]
## [1] 12
df[1, ]$name
## [1] "Petra"
df[df$height > 1.60, "austrian"]
## [1] FALSE  TRUE

The subset() function

The function subset() is another way to subset rows and columns of a data frame and is often more convenient to use.

Usage

## S3 method for class 'data.frame'
#subset(x, subset, select, drop = FALSE, ...)

Important arguments

  • x: object to be subsetted (not necessarily data frame).
  • subset: logical expression indicating elements or rows to keep: missing values are taken as false.
  • select: expression, indicating columns to select from a data frame. drop: passed on to ‘[’ indexing operator.

Example: Let us use the data frame from above.

(df <- data.frame(
            name = c("Petra", "Jochen", "Alexander"),     # character
            age = c(35L, 21L, 12L),                       # integer
            height = c(1.72, 1.65, 1.39),                 # numeric
            austrian = c(FALSE, TRUE, TRUE),              # logical
            stringsAsFactors = FALSE                      # default
))
##        name age height austrian
## 1     Petra  35   1.72    FALSE
## 2    Jochen  21   1.65     TRUE
## 3 Alexander  12   1.39     TRUE

Let us assume we are interested in all people which are 18 or older and we only want to have their name and height. The logical expression therefore would be df$age >= 18 and we could use this in combination with matrix-alike subsetting as follows:

# All rows where age >= 18, only columns c("name", "height")
df[df$age >= 18, c("name", "height")]
##     name height
## 1  Petra   1.72
## 2 Jochen   1.65

The same can be done using the subset() function.

subset(x = df, subset = age >= 18, select = c(name, height))
##     name height
## 1  Petra   1.72
## 2 Jochen   1.65
subset(df, age >= 18, c(name, height))
##     name height
## 1  Petra   1.72
## 2 Jochen   1.65

What the function does: it takes the first argument x and uses all variables/elements in this object to evaluate the additional arguments for subset and select (if specified). As shown, the variable names (age, name, height) are no longer in quotes. This works as R uses a technique called ‘non-standard evaluation’.

In the same way we could also get all non-Austrians in the data set. If we don’t define ‘select’ all variables/columns will be returned.

# The variable `austrian` is already a logical vector.
# `!austrian` negates the vector (turns every TRUE to FALSE and vice versa).
# Can be used to subset all non-austrians from the data frame `df`.
subset(df, !austrian)
##    name age height austrian
## 1 Petra  35   1.72    FALSE

The return of subset() will be a data frame if the first argument \(x\) is of class data frame – except if we select one row and set drop = TRUE. In this case we will only get a vector, in the example below a logical vector.

(x <- subset(df, age >= 18, austrian, drop = TRUE))
## [1] FALSE  TRUE
class(x)
## [1] "logical"

Graphical summary/recap

Data frames – a combination of lists and (most often) atomic vectors

Replacing/deleting/adding variables

Element-replacement works the same way as for lists. If we use subsetting and assign the value NULL, the variable will be deleted from the data frame.

df$name <- NULL
df
##   age height austrian
## 1  35   1.72    FALSE
## 2  21   1.65     TRUE
## 3  12   1.39     TRUE

We can also replace entire variables by assigning a new vector (take care of the length; recycling) on an existing variable, or adding new variables when using a new (not yet existing) variable name.

# Adds a completely new variable
df$nationality <- ifelse(df$austrian, "AT", NA)
# Replaces an existing colmn
df$height      <- as.integer(df$height * 100)
# Replace one element
df$age[2]      <- 102
# Print resulting data frame
df
##   age height austrian nationality
## 1  35    172    FALSE        <NA>
## 2 102    165     TRUE          AT
## 3  12    139     TRUE          AT

Coercing data frames

To some extent, we can coerce (convert) other R objects into data frames, or vice versa. However, this is only possible in certain situations.

Vector to data frame: A single vector yields a single-column data frame with one variable. The object name will be used as variable name.

name <- c("Jochen", "Martina")
as.data.frame(name)
##      name
## 1  Jochen
## 2 Martina

Matrix to data frame: Each column of the matrix will be converted into a variable. Row and column names will be preserved if specified, else R will add default values.

(mat <- matrix(1:6, nrow = 2, dimnames = list(c("Row 1", "Row 2"), LETTERS[1:3])))
##       A B C
## Row 1 1 3 5
## Row 2 2 4 6
(df <- as.data.frame(mat))
##       A B C
## Row 1 1 3 5
## Row 2 2 4 6
str(df)
## 'data.frame':    2 obs. of  3 variables:
##  $ A: int  1 2
##  $ B: int  3 4
##  $ C: int  5 6

This results in a homogeneouss data frame (only contains integer variables). In this example we started with an integer matrix and converted the matrix into a data frame. This allows us to easily convert the data frame back into a matrix.

(mat2 <- as.matrix(df))
##       A B C
## Row 1 1 3 5
## Row 2 2 4 6
typeof(mat2)
## [1] "integer"
identical(mat, mat2)
## [1] TRUE

Going from mat → df → mat2 gets us the identical object as we have been started with. This does no longer work for heterogeneous data frames (shown next).

Heterogeneous data frames to matrix: Assume we have the following data frame again.

(df <- data.frame(
            name = c("Petra", "Jochen", "Alexander"),     # character
            age = c(35L, 21L, 12L),                       # integer
            height = c(1.72, 1.65, 1.39),                 # numeric
            austrian = c(FALSE, TRUE, TRUE),              # logical
            stringsAsFactors = FALSE                      # default
))
##        name age height austrian
## 1     Petra  35   1.72    FALSE
## 2    Jochen  21   1.65     TRUE
## 3 Alexander  12   1.39     TRUE

Can we convert this object into a matrix and back into a data frame without losing anything? The following line converts the object df into a matrix (as.matrix()) and directly back into a data frame (as.data.frame(…))).

# Coerce to matrix and back to data frame
(df2 <- as.data.frame(as.matrix(df)))
##        name age height austrian
## 1     Petra  35   1.72    FALSE
## 2    Jochen  21   1.65     TRUE
## 3 Alexander  12   1.39     TRUE
identical(df, df2)
## [1] FALSE

Well, seems it works. At least it does something. As we can see, df and df2 are no longer identical. Somethig important happened which will cause problems when working with this new df2 object as it is. For demonstration, let us calculate the arithmetic mean of the variable age.

mean(df2$age)
## Warning in mean.default(df2$age): argument is not numeric or logical: returning
## NA
## [1] NA

We get an NA and a warning, that our input was not numeric or logical. If we have a second look at the object df2 we can see that all our variables are now characters.

str(df2)
## 'data.frame':    3 obs. of  4 variables:
##  $ name    : chr  "Petra" "Jochen" "Alexander"
##  $ age     : chr  "35" "21" "12"
##  $ height  : chr  "1.72" "1.65" "1.39"
##  $ austrian: chr  "FALSE" "TRUE" "TRUE"

The reason is that when we convert the data frame to a matrix (as.matrix()) R has to convert the information in the data frame into a vector. As a vector (and thus matrices) can only contain data of one type, everything is converted into character in this case. Take care of this!

List to data frame: As data frames are based on lists, we can always convert data frames to list, but also lists into a data frame. An example:

(df <- as.data.frame(list(x = c(1, 2, 3, 4), y = c("A", "B"))))
##   x y
## 1 1 A
## 2 2 B
## 3 3 A
## 4 4 B

Note: all variables of a data frame need to be of the same length! Thus, R is recycling the elements of the argument y such that the length matches the length of the longer argument x. When converting the data frame into a list, we will get the following:

as.list(df)
## $x
## [1] 1 2 3 4
## 
## $y
## [1] "A" "B" "A" "B"

Combining data frames

Data frames can also be combined. Let us assume we have the following three data frames containing the geographical position of some cities (df1, df2; name, latitude, longitude) and one data frame with the number of inhabitants for two cities (df3). We would like to combine them in one single object.

(df1 <- data.frame(name = c("Moskow", "Brasilia"),
                   lat  = c(55.8, -15.8),
                   long = c(37.6, -47.8)))
##       name   lat  long
## 1   Moskow  55.8  37.6
## 2 Brasilia -15.8 -47.8
(df2 <- data.frame(name = c("Innsbruck", "Graz"),
                   lat  = c(47.2, 47.4),
                   long = c(11.2, 15.3)))
##        name  lat long
## 1 Innsbruck 47.2 11.2
## 2      Graz 47.4 15.3
(df3 <- data.frame(name = c("Graz", "Innsbruck"),
                   inhabitants = c(294600, 132500)))
##        name inhabitants
## 1      Graz      294600
## 2 Innsbruck      132500

There are different functions to do so, however, they all have their difficulties. As for matrices, we can use cbind() and rbind() (with some limitations) or data.frame() to combine two data frames.

Row-binding: In case of df1 and df2 we could think of using row-binding as the two data frames do have the very same structure.

rbind(df1, df2)
##        name   lat  long
## 1    Moskow  55.8  37.6
## 2  Brasilia -15.8 -47.8
## 3 Innsbruck  47.2  11.2
## 4      Graz  47.4  15.3

This only works if the two data sets do have the very same variable names. Coercion will take place if the data type of some variables differ.

Column-binding: In the example of df2 and df3 we might combine the data frames side-by-side (column binding). The problem: R does not care about the content (we can see that we have a mismatch as the two data frames are in a different order; mixing information about Graz and Innsbruck).

cbind(df2, df3)
##        name  lat long      name inhabitants
## 1 Innsbruck 47.2 11.2      Graz      294600
## 2      Graz 47.4 15.3 Innsbruck      132500

Warning: elements will be recycled if the number of rows do not match. In addition we can see that we now have two variables called the very same (“name”). This is never a good idea and should be avoided!

Using data.frame(): As an alternative to cbind() we could use data.frame(). It takes care of duplicated variable names, however, has the same problems as cbind() regarding the mapping of the data.

data.frame(df2, df3)
##        name  lat long    name.1 inhabitants
## 1 Innsbruck 47.2 11.2      Graz      294600
## 2      Graz 47.4 15.3 Innsbruck      132500

Still not the best idea.

Merge: The best option for this case would be merge(). merge() merges two data frames. If we have the same variable in both data frames, this variable will be used to match the columns to be sure that we combine the correct values! In our case we have name in both data frames. Check the difference to before:

merge(df2, df3)
##        name  lat long inhabitants
## 1      Graz 47.4 15.3      294600
## 2 Innsbruck 47.2 11.2      132500

R auto-detects that we have one column with the same name. Given the value in “name” the observations of the two data frames will be brought in the same order – and then combined. The “name” variable is automatically used as the by argument (by which column the data should be merged). This can also be specified manually, e.g.,:

merge(df2, df3, by = "name")
##        name  lat long inhabitants
## 1      Graz 47.4 15.3      294600
## 2 Innsbruck 47.2 11.2      132500

In case the variables are named differently in the two data frames we could also define a by.x (variable name in the first data frame) and by.y (second data frame). The function merge() has a series of arguments, check ?merge for details.

Apply functions

In the chapter loops we discussed so called loop replacement functions and have shown a function called apply() which can be used to apply a function to specific margins of a matrix. Beside apply() a series of additional *apply() functions exist used to replace more complicated loops.

Data frames – a combination of lists and (most often) atomic vectors

Important arguments:

  • X: A vector (atomic or list) or an expression object. Other objects will be coerced by as.list().
  • FUN: The function to be applied to each element of X. In the case of functions like +, %*%, the function name must be back-quoted or quoted.
  • FUN.VALUE: A (generalized) vector, a template for the return value from FUN.
  • …: Optional arguments forwarded to FUN.

Let us take the following data frame to see how the different *apply() functions work.

(df <- data.frame(
            name = c("Petra", "Jochen", "Alexander"),     # character
            age = c(35L, 21L, 12L),                       # integer
            height = c(1.72, 1.65, 1.39),                 # numeric
            austrian = c(FALSE, TRUE, TRUE),              # logical
            stringsAsFactors = FALSE                      # default
))
##        name age height austrian
## 1     Petra  35   1.72    FALSE
## 2    Jochen  21   1.65     TRUE
## 3 Alexander  12   1.39     TRUE

Function lapply(): We would like to get the class of all variables of this data frame. We could write a for-loop going over all variables, subset the specific column, and then call the class() function. Alternatively, we use the lapply() function.

lapply() applies a function on each element of an object. In case of a data frame, each element is one of our variables. Thus, when calling lapply(df, class) the function class() is once applied to each variable. The return of lapply() is a named list with the corresponding results.

(res <- lapply(df, class))
## $name
## [1] "character"
## 
## $age
## [1] "integer"
## 
## $height
## [1] "numeric"
## 
## $austrian
## [1] "logical"
class(res)
## [1] "list"
length(res)
## [1] 4

Function sapply(): sapply() works the very same as lapply(). At the end, R tries to simplify the result and return a vector or matrix. If not, we will still get a list. Let us have a look at the same example again, now using sapply():

(res <- sapply(df, class))
##        name         age      height    austrian 
## "character"   "integer"   "numeric"   "logical"
class(res)
## [1] "character"
length(res)
## [1] 4

The function applied can, of course, be whatever you can think of. For example the length (must be the same for all variables due to the rectangular structure of the data frame) or the arithmetic mean (only works for numeric/logical variables).

sapply(df, length)
##     name      age   height austrian 
##        3        3        3        3
sapply(df, mean)
## Warning in mean.default(X[[i]], ...): argument is not numeric or logical:
## returning NA
##       name        age     height   austrian 
##         NA 22.6666667  1.5866667  0.6666667

Function vapply(): Similar to sapply() but we can pre-specify the class of the return object. This can sometimes be saver (the function will throw an error if the return is something different) and can in some situations also be faster than sapply().

vapply(df, class, "")    # Return must be character
##        name         age      height    austrian 
## "character"   "integer"   "numeric"   "logical"
vapply(df, length, vector("integer", 1)) # Return must be integer
##     name      age   height austrian 
##        3        3        3        3
#vapply(df, length, vector("logical", 1)) # <- Throws the error

The last command throws an error. We would like to get the length() of all elements (variables) in our data frame. As we know, length() returns a single integer value. However, we defined that vapply() must return logical values. As we would get something else than expected (logicals) vapply() throws an error and the script will stop.

Using extra arguments: Let us again calculate the mean() of each variable but this time add some missing values to the data frame first. Maybe not the most realistic example, but shows how we can forward additional arguments to the function.

# Adding some missing values
df[1, 2] <- NA
df[2, 3] <- NA
df[3, 4] <- NA
df
##        name age height austrian
## 1     Petra  NA   1.72    FALSE
## 2    Jochen  21     NA     TRUE
## 3 Alexander  12   1.39       NA

When calling sapply(df, min) we will get NA for the three variables “age”, “height”, and “austrian” as each variable contains a missing value. However, as we know we can use min(x, na.rm = TRUE) to remove these missing values. We can forward this na.rm = TRUE argument to our function min() as follows:

sapply(df, min)
##        name         age      height    austrian 
## "Alexander"          NA          NA          NA
sapply(df, min, na.rm = TRUE)
##        name         age      height    austrian 
## "Alexander"        "12"      "1.39"         "0"

The result is a named character vector – simply because the minimum of the name is “Alexander” (lexicographical minimum) and R performs implicit coercion and converts all values into characters.

Custom function: Instead of using existing functions we can also write custom functions and use them in combination with *apply(). Let us create a simple function which returns c(NA, NA) if the argument x is not numeric, else the range of the data (removing missing values; na.rm = TRUE).

num_range <- function(x, na.rm = TRUE) {
    if (!is.numeric(x)) {
        res <- c(NA, NA)
    } else {
        res <- range(x, na.rm = TRUE)
    }
    return(res)
}
num_range(LETTERS[1:3])
## [1] NA NA
num_range(1:5)
## [1] 1 5

… and apply that to our data frame using lapply() and sapply(). Each variable (column) is again used as input to our new function, which in this case returns a vector with two elements. lapply() keeps these results in a list, the result is a list of length 4 each containing a vector of length 2. sapply() now returns a matrix – simply because the result of each function call is now a vector of length 2.

lapply(df, num_range)
## $name
## [1] NA NA
## 
## $age
## [1] 12 21
## 
## $height
## [1] 1.39 1.72
## 
## $austrian
## [1] NA NA
sapply(df, num_range)
##      name age height austrian
## [1,]   NA  12   1.39       NA
## [2,]   NA  21   1.72       NA

Summary

A data frame is one of the most common objects in R when working with data sets and data driven methods. The image above tries to summarize the three classes matrices, lists, and data frames in a more artistic way :). From left to right:

  • Matrices: Rectangular 2-dimensional homogeneous objects; based on vectors.
  • Lists: Most flexible data structure in R. Allow to store objects of different classes in a recursive way to construct highly complex objects if needed.
  • Data frames: Rectangular 2-dimensional heterogeneous objects. Share some properties with matrices (the form) and lists (heterogenity); based on lists.

In this chapter we have seen how to construct simple and very tiny data frames. In reality a data frame might contain several hundreds or thousands of observations (rows) and dozens of variables (columns). A quick overview to recap the new content:

  • Creating data frames: Using the function data.frame().
  • Rectangular: Data frames are always 2-dimensional and rectangular.
  • Jargon: Rows refer to ‘observations’, columns to ‘variables’.
  • Name attribute: Data frames must have names (mandatory; both dimensions).
  • Heterogenity: Variables can contain different data types, most often (but not restricted to) vectors.
  • Subsetting: Matrix and list-alike subsetting or using the subset() function.
  • Replace/remove/modify: Variables can be removed, replaced, or added using subsetting in combination with assigning new values.
  • Apply-functions: The loop replacement functions are handy to apply a function to all variables in the object.