General information on R

software environment for statistical computing and graphics R is based on S

GPL (general public license) –> free software

R-1.0.0 was published in 2000; current version: R-3.5.3 (3.6.0 release soon)

R is available for several OS (including 64bit versions) functionality can be extended by packages

packages are bundles of functions and data (and corresponding documentation)

Comprehensive R Archive Network (CRAN) is the primary source for packages: April 2013: 4,437 packages on CRAN; October 2014: 5,966 packages; September 2015: 7,084 packages; September 2016: 9,237 packages, April 2019: 14,030 packages

wide dissemination in academia + private industry (see pdfs)

advantages: flexibility (no separation between input and output), transparency (documentation and source code publicly available), extensibility (packages)

web page and further information: https://www.r-project.org/
literature: http://ftp5.gwdg.de/pub/misc/cran/other-docs.html

Using RStudio

RStudio is a all-in-one front-end for R including R itself as well as a GUI script editor and package management. RStudio can be obtained from https://www.rstudio.org for free.

Write code in script files (File -> New File -> R Script). Execute code by pressing Ctrl + r (Cmd + r on MAc). Execution copies code chunks into the console where the command is processed.

R as a simple calculator

5 + 5 # sumation
5 * 5 # multiplication

5^0.5 # square root, decimal separator
5^.5

# note the difference:
5^(.5/2)
5^.5/2

Objects & workspace

Objects in R are of different types to numbers, vectors, matrices as well as functions. Creating an object is via <- or =

test <- 101         # assignment
102 -> test2        # not recommended
test3 = 103         # equality sign

By creating objects these are in the workspace (see upper-right panel in RStudio). Workspaces can saved and loaded (Session -> Load Workspace / Save Workspace as).

Objects in the workspace can be listed and removed.

ls()        # list all objects in workspace
rm(test3)   # remove object test3

The workspace is the global environment. Any chunks of code can work with objects in the workspace only.

test2       # show object test2
test3       # error as test3 has been removed from workspace

Packages

Any user of R can distribute his/her code, functions, and data via a standardized procedure by bundling the content in packages. Packages are centrally stored on repository servers wordwide. To browse packages visit https://cran.r-project.org/. Before a package (and the corresponding functions and data) can be used, the package needs to be downloaded from CRAn once and afterwards loaded into the workspace, e.g.:

install.packages("robustbase")  # download package robustbase
library(robustbase)             # load package robustbase

Functions

In R functions are chunks of code summarized into a separated environment. There are predefined functions included in the packages of R and self-written functions. Once a package is loaded or self-written functions are passed to the console, they are available in the workspace. Self-written functions appear in the workspace overview, while functions from the packages are hidden.

Structure

Functions are written as
<*name*> <- function(<*argument1*>,<*argument2*>,...) {<*code*>}

Help & search

Each function in R contained in packages has a help pack which is accessible via help(<*function name*>) of ?<*function name*>, e.g.

help(mean)
?mean

If you don’t know the name of a function you can seeqarch the help files by ??<*search phrase*>, e.g.

??regression

Using

Functions can be used by passing arguments the corresponding arguments. The help pages lists the data types required for the corresponding argument. E.g.

mean(x = 5)     # there are 3 arguments of mean, see help page, x is the argument of which the mean is calculated 
#mean(y = 5)     # an argument called y is not defined for mean
mean(5)         # if the argument's name is not given, the order matters
mean(y <- 5)    # you can define an object also from within a function
y               # here is y

res <- sqrt(x = y) # results are objects
res2 <- res^2
res2

res3 <- sqrt        # functions are objects too
res3(x = 5)

Writing

When writing functions, the code can use all objects defined as arguments in the function definition and all objects in the workspace. Good programming practice calls for using only objects defined as function arguments.

foo <- function(x) {x^2}
foo(5)
foo(x = 5)

foo2 <- function(x, mu, sigma) { # define standard normal density function
  1/sqrt(2*pi*sigma^2) * exp(-(x-mu)^2/(2*sigma^2))}    

foo2(0,0,1)     # value of standard normal distribution at 0
dnorm(0,0,1)    # check by comparing with implemented function

Modes, types & classes of objects

Any object in R is of a particular type, is stored in a particular way, and belongs to a particular class. Types and storage modes describe how an object is handled in R, and object classes are based on how the objects can be used. The following table shows the functions which can be used to query class, mode and type of objects:

Type	Mode	class
typeof()	mode()	class()

Note that in most text the distinction between data and object types, storage and classes is not clear and depends on the context. Usually, data types comprise vectors, matrices and so on.

Types & storage modes

An atomic object is usually called scalar. Typically, a scalar has one type of “logical”, “integer”, “double”, “complex”, “character”, “raw” and “list”, “closure”, or “builtin” (the latter two refer to functions). Here are some examples:

mode(x <- 5)                # the storage mode of a number is numeric 
typeof(x)                   # by default numerics are double-precision floating-point numbers
mode(y <- "test")           # character strings are stored as character strings
typeof(y)                   # ... and are character strings
foo <- function(x) {x^2}  
mode(foo)                   # functions are stored as functions
typeof(foo)                 # ... and declared as encapsulated chunk of code 
typeof(list)                  # but there are also functions which are only references to internal procedures (mostly written in C)
mode(list)                  # ... which are nontheless stored as functions

Here is a comparative table with some examples:

##                   typeof(.)    mode(.)      class(.)    
## NULL              "NULL"       "NULL"       "NULL"      
## 1                 "double"     "numeric"    "numeric"   
## 1:1               "integer"    "numeric"    "integer"   
## 1i                "complex"    "complex"    "complex"   
## list(1)           "list"       "list"       "list"      
## data.frame(x = 1) "list"       "list"       "data.frame"
## foo               "closure"    "function"   "function"  
## c                 "builtin"    "function"   "function"  
## lm                "closure"    "function"   "function"  
## y ~ x             "language"   "call"       "formula"   
## expression((1))   "expression" "expression" "expression"
## 1 < 3             "logical"    "logical"    "logical"

Logical objects only have the values TRUE or FALSE (or T and F in short-hand). They result from logical evaluations, e.g.

Expression	meaning
==	equality
!=	inequality
>,>=	greater (or equal)
<, <=	smaller (or equal)
!	not
&	and
\|	or

x <- TRUE     # define a logical scalar
typeof(x)     # check type
3 < 1         # logical evaluation
x <- 3 
y <- 1
x < y         # also with objects
z <- x < y  
typeof(z)
z2 <- 1 <= 2   # looks confusing
z & z2         # concetenate logical expressions
z | z2

Data structures

Native data types (vectors,matrices and arrays) consist of scalars of the same storage type only (i.e. only numbers, characters, logicals) while advanced data types (data frames and lists) can contain data objects of different storage modes.

Name	Dimension	Built function
vector	1	c(),numeric()
matrix	2	matrix()
array	n	array()
data frame	2	data.frame()
list	1	list()

Vectors

Vectors can be generate by many functions such as

x <- numeric(5)               # initiate an empty numeric vector
y <- c(5, 6, 7)               # generate a vector by connecting scalars via c() ("concatenate")
z <- c(5, "test", 7)          # does not work as intended
typeof(x)
typeof(y)
typeof(z)
is.numeric(y)                 # check whether x is numeric
is.integer(y)                 # ... but its not integer
y <- as.integer(y)            # unless we declare it as such
is.integer(y)                 
is.numeric(z)                 # check whether z is numeric
# sequences
i <- 5:7                      # short hand for integer vectors
is.integer(i)                 # ... is by definition integer
mode(i)
typeof(i)
i == y                        # is i and y the same?
i == as.numeric(y)            # is i and y the same?
-5:5                          # also works with negative numbers
5:-5                          # ... or backwards      
seq(from = 10, to = 12.5, by = .5)  # sequence with equal increments
seq(from = 10, to=12.5, length.out = 10) # 
# repetitions
rep(x = 5, times = 3)           # concetenates an argument some times with each other
rep(x = c(6,7), times=3)        # the argument can also be a vector
rep(x = c(6,7), each = 3)       # elementwise repetition

Matrices & data frames

Matrices can be generated similarly either directly (via matrix()) or by binding vectors together

x <- matrix(1:6, ncol=2)        # construct matrix with 2 columns filled columnwise (default)
matrix(1:6, ncol=2, byrow=T)    # ... or row by row
matrix(1:6, nrow=2)             # matrix with 2 rows
ncol(x)                         # number of columns
nrow(x)                         # number of rows
# constructing matrices from vectors 
x <- cbind(1:6, 2:7, 3:8)       # bind vectors column-wise 
dim(x)                          # reports dimensions of matrix
x <- rbind(1:6, 2:7, 3:8)       # row-wise binding
is.matrix(x)                    # chack whether x is a matrix
# arrays
x <- array(1:12, dim=c(2,2,3))
dim(x)
# problem: different data types 
x <- cbind(1:3, c("rest", "test", "nest"))
is.matrix(x)
is.data.frame(x)                # numerics are automatically converted to strings

Advanced data types can consist of elements of different storage types. We distinguish data frames and lists.
data frames can be imagined as a matrices composed columnwise (i.e. data frames are 2-dimensional in any case). Each column is a vector with fixed length. Difference to a matrix is that each column can have a different data type (i.e. a data frame can consist of e.g. numeric, logical and character columns). This is useful for most real world-data sets.

x <- data.frame(a = 1:3, b = c("rest", "test", "nest"), c = c(T,F,T))     # a data frame with 3 columns named a, b, and c
is.matrix(x)
is.data.frame(x)
dim(x)            # dimension of data frame

Lists

Lists can be imagined as generalized vectors whereby the data types of the stored elements is arbitrary. I.e. you can concatenate matrices, vectors, data frames, and also other lists in lists. Lists are the most flexible and generic data type handled here.

library(tidyverse)          # you need to load the tidyverse package
x <- list(a = 1:3, b = "nest", c = TRUE)  # a quite simple list
x <- list(a = 1:3, b = "nest", c = list(d = "test", e = rep(x = c(TRUE,FALSE), each = 3)))  # a more complicted list
is.list(x)
length(x)

other

More sophisticated data types are S4 objects which are not disussed here. Moreover, note that data types can be defined by R users such that they can be designed to serve specific purposes. E.g. if you are working with dplyr you will stumble upon tibbles which are basically data frames, but with nicer handling. For an overview see https://cran.r-project.org/web/packages/tibble/vignettes/tibble.html.

df1 <- data.frame("first variable" = 1:3, "second variable" = c("nest","test","fest") )   # note the change in variable names
df1[,2]                                 # the 2nd variable is a factor, not a character vector as expected
tb1 <- tibble("first variable" = 1:3, "second variable" = c("nest","test","fest"))        # note variable names
tb1[,2]                                 # the 2nd variable is now a character
as_tibble(df1)                          # conversion to tibble 
as_data_frame(tb1)                      # ... and conversion to data frame
class(df1)                              # data frames are data frames ...
class(tb1)                              # ... but tibbles are data frames, tibbles and tibble data frames

Indexing

When one works with data objects, most tasks require to extract, alter or update parts of the data objects. Therefore, data objects are indexed by position, name, or logical identifier. Native data types (vectors, matrices, array) are indexed by <object name>[<index>]. Advanced data types can also be indexed by names, e.g. <object name>$<index name>. ## Vectors For vectors, indexing allows one to access and extract scalars and sub-vectors.

x <- c(5, 6, 7)
x[1]              # extract 1st entry --> result is a scalar
x[c(1,3)]         # extract 1st & 3rd entry --> result is a vector
x[4]              # a 4th entry does not exist
x[c(T,F,T)]       # indexing via logical vectors requires a logical index vector of the same length of the vector to be indexed
y <- x <= 6       # generate an indexing vector by logic evaluations
z <- which(y)     # retain inices of those entries --> z is integer vector
x[y]              # subvector
x[z]              # subvector

Another helpfull application of indexing is to sort and rearrange vectors:

x <- sample(x = 1:100, size = 10) # sample 10 values between 1 and 100 randomly
y <- order(x)     # retain indices of entries ascending order (1 to smallest entry, 2 to nd smallest entry, ...)
y[2]              # index of 2nd smallest entry of x
x[y]              # ordered vector
sort(x)           # same result obtained by sort function
z <- rank(x)      # z contains the ranks of the entries of x
which(z == 2)     # index of 2nd smallest entry of x

Matrices

Matrices have two dimensions and, thus, two indices are required to access an entry. The indices are seperated by “,”, i.e. <name matrix>[<index row>, <index column>] .

x <- cbind(1:6, 2:7, 3:8) # column-wise binding
x[1,1]                    # first row, first column (just one element)
x[1,]                     # first row (a vector)
x[,1]                     # first column (a vector)
x[1:2,]                   # first and second row (a matrix)
y <- rep( x = c(T,F,T), times = 2) # logical index vector
x[y,]                     # same as 
x[y,c(1,3)]               # mix row and column selections

Matrix entries can also be accessed by index tuples in the form of vectors, i.e., <name matrix>[<index vector>]

x <- matrix(1:16, ncol=4) # quadratic matrix 
y <- cbind(1:4, 1:4)      # indices of diagonal elements
x[y]                      # extract diagonal elements of x
diag(x)                   # ... or directly
z <- cbind(c(2,4,1), c(1,3,2))    # indices of diagonal elements
x[z]

When parts of objects are extracted R tries automatically to simply the resulting objects as far as possible (unless you instruct otherwise).

x <- matrix(1:16, ncol=4) # quadratic matrix 
y <- x[1,]                # extract 1st row
dim(y)                    # y is not a matrix anymore
is.matrix(y)              # check
is.numeric(y)             # but its a vector, thus ...
y[2]                        # works
y[,2]                     # doesn't works
# but you can say otherwise
y <- x[1, , drop=F]       # extract 1st row as 1,4-matrix ()
dim(y)                    # so y has two dimensions
is.numeric(y)             # y can also be used as  vector
is.matrix(y)              # ... and as a vector, thus ...
y[2]                      # ... works ...
y[,2]                     # ... works too

Arrays

Arrays are multi-dimensional matrices. Indexing works analoguesly to matrices, but with some dimensions more:

x <- array(1:12, dim=c(2,2,3))  # array in 3 dimensions (cube) with 2*2*3 entries
y <- x[1,,]                     # pick first row in each submatrix -> upper floor of cube
y <- x[,2,]                     # pick second column in each submatrix -> right wing of cube
y <- x[,,3]                     # pick last layer -> top floor
dim(y)                          # y is a matrix
# or you can keep the array type. 
y <- x[,,3, drop=F] 
dim(y)

Data Science for Production & Logistics (Part 1)

Thomas Kirschstein

7 April 2019