Data structure - Vectors - page 51

(myVector <- c(7,11,13))# Create a vector called 'myVector'

## [1]  7 11 13

myVector * 3

## [1] 21 33 39

myVector[1] # Get the first element

## [1] 7

myVector[3] # Get the third element

## [1] 13

Vector operations - Some basic functions - Page 87

length(): returns the length of a vector.
sort(): sorts the elements of a vector, in increasing or decreasing order.
rev(): rearranges the elements of a vector in reverse order.
rank(): returns the vector of ranks of the elements.
head(): returns the first few elements of a vector.
tail(): returns the last few elements of a vector.

Vector operations - Examples - Page 87-88

x  <- c(1,3,6,2,7,4,8,1,0,8)
length(x)

## [1] 10

sort(x)

##  [1] 0 1 1 2 3 4 6 7 8 8

sort(x, decreasing=TRUE)

##  [1] 8 8 7 6 4 3 2 1 1 0

rev(x)

##  [1] 8 0 1 8 4 7 2 6 3 1

rank(x)

##  [1] 2.5 5.0 7.0 4.0 8.0 6.0 9.5 2.5 1.0 9.5

head(x, 3)

## [1] 1 3 6

tail(x, 2)

## [1] 0 8

Note: The largest value of rank(x) is not always equal to length(x), as there could be a tie of largest values.

Set operations - Page 99

A <- c(4,5,2,7)
B <- c(2,1,7,3)
C <- c(2,3,7)
is.element(C,A)

## [1]  TRUE FALSE  TRUE

is.element(A,C)

## [1] FALSE FALSE  TRUE  TRUE

intersect(A,B)

## [1] 2 7

union(A,B)

## [1] 4 5 2 7 1 3

setdiff(A,B)

## [1] 4 5

Exercise

Given that

A <- c(4,5,2,7)
B <- c(2,1,7,3)
C <- c(2,3,7)

Calculate

the collection of elements of A and B that only belong to one set
the number of elements that belong to A and B and C

Solution

# the collection of elements of A and B that only belong to one set
union(setdiff(A,B), setdiff(B,A))

## [1] 4 5 1 3

# the number of elements that belong to A, B and C
length(intersect(intersect(A,B), C))

## [1] 2

Data structure - Matrices and arrays - page 51

Matrices and arrays are generalisations of vectors
A matrix has two dimensions (hence you need two indices to access a data point)
An array allows for even more dimensions (hence you need multiple indices)

Data structure - Matrices and arrays - page 52

(X <- matrix(1:12, nrow=4, ncol=3, byrow=TRUE))

##      [,1] [,2] [,3]
## [1,]    1    2    3
## [2,]    4    5    6
## [3,]    7    8    9
## [4,]   10   11   12

(X <- matrix(1:12, nrow=4, ncol=3, byrow=FALSE))

##      [,1] [,2] [,3]
## [1,]    1    5    9
## [2,]    2    6   10
## [3,]    3    7   11
## [4,]    4    8   12

(X <- array(1:24, dim=c(2,3,4)))

## , , 1
## 
##      [,1] [,2] [,3]
## [1,]    1    3    5
## [2,]    2    4    6
## 
## , , 2
## 
##      [,1] [,2] [,3]
## [1,]    7    9   11
## [2,]    8   10   12
## 
## , , 3
## 
##      [,1] [,2] [,3]
## [1,]   13   15   17
## [2,]   14   16   18
## 
## , , 4
## 
##      [,1] [,2] [,3]
## [1,]   19   21   23
## [2,]   20   22   24

Data structure - Matrices and arrays - page 53

Visualization of a three-dimensional array?

Recycling - Pages 86-87

Given an operation on two vectors/matrices/arrays of different lengths, R will complete the shortest data structure by repeating its elements from the beginning. We call this behaviour ‘recycling’:

x <- c(1,2,3,4,5,6)
y <- c(1,2,3)
x + y

## [1] 2 4 6 5 7 9

Another example is below, where the vector 1:3 is repeated to fill in a matrix:

matrix(1:3, ncol=3, nrow=4)

##      [,1] [,2] [,3]
## [1,]    1    2    3
## [2,]    2    3    1
## [3,]    3    1    2
## [4,]    1    2    3

Merging - Page 89

You can merge vectors or matrices together to create a new matrix with functions cbind() and rbind().

(B <- cbind(1:4,5:8))

##      [,1] [,2]
## [1,]    1    5
## [2,]    2    6
## [3,]    3    7
## [4,]    4    8

(C <- cbind(B, 9:12))

##      [,1] [,2] [,3]
## [1,]    1    5    9
## [2,]    2    6   10
## [3,]    3    7   11
## [4,]    4    8   12

(B <- rbind(1:4,5:8))

##      [,1] [,2] [,3] [,4]
## [1,]    1    2    3    4
## [2,]    5    6    7    8

(C <- rbind(B, 9:12))

##      [,1] [,2] [,3] [,4]
## [1,]    1    2    3    4
## [2,]    5    6    7    8
## [3,]    9   10   11   12

Matrix operations - Pages 315-316-317

(A <- matrix(c(2,3,5,4), nrow=2, ncol=2, byrow=T))

##      [,1] [,2]
## [1,]    2    3
## [2,]    5    4

(B <- matrix(c(1,2,8,7), nrow=2, ncol=2, byrow=F))

##      [,1] [,2]
## [1,]    1    8
## [2,]    2    7

(I2 <- diag(nrow=2)) # identity matrix of size 2x2

##      [,1] [,2]
## [1,]    1    0
## [2,]    0    1

A+B

##      [,1] [,2]
## [1,]    3   11
## [2,]    7   11

A*B

##      [,1] [,2]
## [1,]    2   24
## [2,]   10   28

A/B

##      [,1]      [,2]
## [1,]  2.0 0.3750000
## [2,]  2.5 0.5714286

A%*%I2

##      [,1] [,2]
## [1,]    2    3
## [2,]    5    4

A%*%B

##      [,1] [,2]
## [1,]    8   37
## [2,]   13   68

t(B)

##      [,1] [,2]
## [1,]    1    2
## [2,]    8    7

Note: the diag() function can also be used for different purposes (see R help file)

Matrix operations - the solve() function - Pages 316-317

The solve(A,b) function can be used to solve $A \mathbf{x} = \mathbf{b}$, for $\mathbf{x}$. Here $\mathbf{b}$ can be a vector or a matrix.
If solve() is used with only one argument, e.g. solve(A), it will return the inverse of a matrix (if it exists).

(A <- matrix(1:4, ncol=2))

##      [,1] [,2]
## [1,]    1    3
## [2,]    2    4

(x <- solve(A, c(1,1)))

## [1] -0.5  0.5

A%*%x

##      [,1]
## [1,]    1
## [2,]    1

solve(A) %*% A

##      [,1] [,2]
## [1,]    1    0
## [2,]    0    1

Matrix operations - The function apply() - Page 93

The function apply() is often quite handy. It applies a given function to the elements of all rows (MARGIN=1) or all columns (MARGIN=2) of a matrix.

(X <- matrix(c(1:4, 1, 6:8), nrow = 2))

##      [,1] [,2] [,3] [,4]
## [1,]    1    3    1    7
## [2,]    2    4    6    8

apply(X, MARGIN=1, FUN=sum)

## [1] 12 20

apply(X, MARGIN=2, FUN=mean)

## [1] 1.5 3.5 3.5 7.5

Other functions you could use: rowSums(), colSums(), rowMeans(), colMeans().

Matrix operations - Exercise

Given a $3\times 3$ matrix X

X <- matrix(1:9, nrow = 3)

Use function apply to create a vector called row.sums containing the row marginal sums of X (i.e. the sum of elements within each row)
Do the same to create a vector called col.sums containing the column marginal sums of X
Verify with a relational operation that sum(row.sums) = sum(col.sums) = sum(X). Make sure you see why that should be the case.

Matrix operations - Solution

(X <- matrix(1:9, nr = 3))

##      [,1] [,2] [,3]
## [1,]    1    4    7
## [2,]    2    5    8
## [3,]    3    6    9

(row.sums <- apply(X, 1, sum))

## [1] 12 15 18

(col.sums <- apply(X, 2, sum))

## [1]  6 15 24

(sum(row.sums) == sum(col.sums)) & (sum(X) == sum(row.sums))

## [1] TRUE

Important note on data structures

Do not confuse ‘data structure’ (vector, matrix, array,…) with ‘data type’ (which we saw in Week 1).
A ‘data type’ refers to the type of information (numerical, character, logical, etc.) while a ‘data structure’ refers to how we store (or structure!) the information (in a vector, in a matrix, etc.)

Data structure - Lists - page 53

Elements stored in vectors, matrices or arrays need to be of the same type (and R automatically converts them to the same type if they are not)

myVector <- c(1,2,"A", TRUE)
myVector

## [1] "1"    "2"    "A"    "TRUE"

typeof(myVector)

## [1] "character"

Lists can group together, in one structure, data of different types without altering them.

myList <- list(TRUE, my.matrix=matrix(1:4, nrow=2), c(1+2i,3), "R is my friend")
myList

## [[1]]
## [1] TRUE
## 
## $my.matrix
##      [,1] [,2]
## [1,]    1    3
## [2,]    2    4
## 
## [[3]]
## [1] 1+2i 3+0i
## 
## [[4]]
## [1] "R is my friend"

Data structure - Lists - Exercise

Consider ‘myList’ given before as

myList <- list(TRUE, my.matrix=matrix(1:4, nrow=2), c(1+2i,3), "R is my friend")

How many elements do we have in the object myList?
Do all elements have the same data types?
Does each element have its own name?

Data structure - Lists - Solution

myList

## [[1]]
## [1] TRUE
## 
## $my.matrix
##      [,1] [,2]
## [1,]    1    3
## [2,]    2    4
## 
## [[3]]
## [1] 1+2i 3+0i
## 
## [[4]]
## [1] "R is my friend"

typeof(myList[[1]])

## [1] "logical"

typeof(myList[[2]])

## [1] "integer"

typeof(myList[[3]])

## [1] "complex"

typeof(myList[[4]])

## [1] "character"

There are 4 elements.
No. This is one advantage of using a list. In fact, each element can be a vector, a matrix, an array or even a list.
The second element’s name is my.matrix, and the others are unnamed. Note that naming elements makes it easier to read the data from a list - we will discuss this further in later sessions.

Data structure - Data frames - page 54

A data.frame in R is a table where

each row represents a single observation (e.g. an individual)
each column represents a single variable, which must be of the same data type across all rows

Data frames are widely used in R

flexibility of having multiple data types
in many cases, a dataset can be formulated as a data.frame

Data structure - Data frames

BMI <- data.frame(
  Gender=c("M","F","M","F"),
  Height=c(1.83,1.78,1.80,1.60),
  Weight=c(77,68,66,48),
  row.names=c("Ben","Katja","Anthony","Jinxia"))

BMI

##         Gender Height Weight
## Ben          M   1.83     77
## Katja        F   1.78     68
## Anthony      M   1.80     66
## Jinxia       F   1.60     48

str(BMI)#display internal structure of data frame

## 'data.frame':    4 obs. of  3 variables:
##  $ Gender: chr  "M" "F" "M" "F"
##  $ Height: num  1.83 1.78 1.8 1.6
##  $ Weight: num  77 68 66 48

You can access a specific variable using the $ command

BMI$Gender

## [1] "M" "F" "M" "F"

Merging - Merging columns of data frames - Page 89-91

You may need to add to a dataset some variables from another dataset (which has the same subjects).
To do so, you can use the merge() function (but be careful of what happens when not all subjects are present in both datasets)

X <- data.frame(GENDER=c("F","M","M","F"),
ID=c(123,234,345,456),
NAME=c("Mary","James","James","Olivia"),
Height=c(170,180,185,160))

Y <- data.frame(GENDER=c("M","F","F","M"),
ID=c(345,456,123,234),
NAME=c("James","Olivia","Mary","James"),
Weight=c(80,50,70,60))
X

##   GENDER  ID   NAME Height
## 1      F 123   Mary    170
## 2      M 234  James    180
## 3      M 345  James    185
## 4      F 456 Olivia    160

##   GENDER  ID   NAME Weight
## 1      M 345  James     80
## 2      F 456 Olivia     50
## 3      F 123   Mary     70
## 4      M 234  James     60

cbind(X,Y)  # Not very useful here

##   GENDER  ID   NAME Height GENDER  ID   NAME Weight
## 1      F 123   Mary    170      M 345  James     80
## 2      M 234  James    180      F 456 Olivia     50
## 3      M 345  James    185      F 123   Mary     70
## 4      F 456 Olivia    160      M 234  James     60

merge(X,Y)  # This is what we want

##   GENDER  ID   NAME Height Weight
## 1      F 123   Mary    170     70
## 2      F 456 Olivia    160     50
## 3      M 234  James    180     60
## 4      M 345  James    185     80

Any individual not present in both datasets will be lost.

Z <- data.frame(GENDER=c("M","F","F","F"),
ID=c(345,456,123,999),
NAME=c("James","Olivia","Mary","Jennifer"),
Age=c(21,19,23,99))
Z

##   GENDER  ID     NAME Age
## 1      M 345    James  21
## 2      F 456   Olivia  19
## 3      F 123     Mary  23
## 4      F 999 Jennifer  99

merge(X,Z)

##   GENDER  ID   NAME Height Age
## 1      F 123   Mary    170  23
## 2      F 456 Olivia    160  19
## 3      M 345  James    185  21

You can use the all.x or all.y arguments to force the inclusion of all the subjects of a dataset.

merge(X,Z, all.x = T) # all subjects of first dataset are kept

##   GENDER  ID   NAME Height Age
## 1      F 123   Mary    170  23
## 2      F 456 Olivia    160  19
## 3      M 234  James    180  NA
## 4      M 345  James    185  21

merge(X,Z, all.y = T) # all subjects of second dataset are kept

##   GENDER  ID     NAME Height Age
## 1      F 123     Mary    170  23
## 2      F 456   Olivia    160  19
## 3      F 999 Jennifer     NA  99
## 4      M 345    James    185  21

Data structure - Factors

A factor can be used to store character strings

each element is treated as a factor (even if the input is a real number)
some functions require data structured as a factor

x <- factor(c("blue","green","blue","red","blue","green","green"))
levels(x)

## [1] "blue"  "green" "red"

class(x)

## [1] "factor"

Note: Function levels() shows you all unique factors, which can be handy.

Data structure - Summary

Data structure	Instruction in R	Description
vector	c()	Sequence of elements of the same nature
matrix	matrix()	Two-dimensional table of elements of the same nature
array	array()	More general than a matrix; table with several dimensions
list	list()	Sequence of R structures of any (and possibly different) nature.
data frame	data.frame()	Two-dimensional table. The columns can be of different natures, but must have the same length.
factor	factor()	Vector of character strings associated with a modality table
dates	as.Date()	Vector of dates
time series	ts()	Values of a variable observed at several time points

Importing data - text files - Page 64

There are several functions to import data from a text file. Here we focus on the read.table and read.csv functions, which are widely used to import excel and csv files.

Function name	Description
read.table()	best suited for data sets presented as tables, as it is often the case in statistics
read.ftable()	reads contingency tables
scan()	much more flexible and powerful. Use this in all other cases

Important note

Important note: For now we are working in the Ed Platform, hence all the files we will import are files already in our Working Directory, in Ed.
Of course, if you use R on your own computer (after downloading it), then you can import any data file on your computer.

Importing data - read.table() - Pages 64-66

Argument name	Description
file=path/to/file	Location and name of the file to be read
header=TRUE	Indicates whether the variable names are given on the first line of the file
sep=“”	This is the field separator character. Values on each line of the file are separated by this character. (e.g. “” = whitespace, “,” = comma, “$\backslash$t” = tabulation)
dec=“.”	Decimal mark for numbers (“.” or “,”)

If a .txt file (e.g. “mydata.txt”) is located in your current working directory, you can access it by simply using:

data <- read.table(file="mydata.txt")

If you do this, ‘data’ will be a data.frame containing the dataset from your .txt file (“mydata.txt”)
Note you can visualise the beginning or end of your data by using head(data) or tail(data)

Importing data - Exercise - Page 66

In the Ed ‘Exercise Space’ for this week we have placed a dataset in the .txt format called danishfire.txt. It contains claim amounts for three different categories of insurance losses, also with the dates at which the losses occurred.

Use the function readLines("danishfire.txt", n=5) to visualize the beginning of the data. Note that at this stage you are just looking at the data, you have not imported it.
Import the data and store it in a data frame called danish_fire.
Display the first few records of danish_fire by using the head() function.
Apply functions class() and str() to danish_fire. What information does this provide you?
Calculate the mean of the building losses, as well as the correlation between the building losses and contents losses. Hint: Use functions mean() and cor().

Importing data - Solution - Page 66

# Vizualise data using readLines
readLines("danishfire.txt", n=5)
# import data, note that the header argument is TRUE
danish_fire <- read.table(file="danishfire.txt", header=TRUE)
head(danish_fire)
class(danish_fire)
str(danish_fire)
# some numerical analysis
mean(danish_fire$building)
cor(danish_fire$building, danish_fire$contents)

Note that function readLines() is useful as it allows you to visualize the beginning of the data file before you import the data, so you know the structure of the data and can determine the arguments of the function read.table().

Importing in R

If you are using R installed on your computer (not in Ed), you can specify a file that is not in your working directory

my.data <- read.table(file="C:/MyFolder/somedata.txt")

Note you can use double back slash \\ instead of /.

To check what is your working directory, you can use function getwd()

getwd()

Importing data - Standard formats (e.g. csv) - Page 67

With standard formats, most of the arguments of the function read.table() are fixed. There are some functions that are designed to be equivalent to read.table() with several arguments filled with pre-determined values:

read.csv(): .csv format (csv stands for comma separated values)
read.delim(): Tab-separated data

Exporting data - Pages 72-73

Now try to export the danish fire data you have imported to R.

# Exporting data to a text file
write.table(danish_fire, file = "myfile.txt", sep = "\t")
# Exporting data to a csv file
write.csv(danish_fire, file = "myfile.csv")

Note that the commands above save your data into the working directory.

If you are using R on your computer (not in Ed!), you can specify a different path if you wish, by using file = "path/to/data/myfile.txt".

Data structures, mathematical operations, importation and exportation in R

Term 2 2023