Data Structures in R

R-Ladies Tunis

21/06/2020

Introduction

This article is the training materials of our second meetup “Data Structures in R”. The aim of this meetup is to introduce you to the basic structure in programming with R.

We are covering : 1. Vectors 2. Matrix 3. Dataframes 4. Lists 5. Arrays

1.Vectors

Vector is a basic structure in R. They can contain a single member or multiple members.

1.1 How to create a vector

Single vector

## [1] 2

Vectors with multiple elements

Using c() function

Vectors are generally created using the c() function. They can be : numeric, logic, characters or factors.

Let’s create our first vectors

## [1] 1 2 3 6 8
## [1]  TRUE FALSE  TRUE  TRUE
## [1] "Apple"  "Banana" "Orange"

Using seq() function

You can also create a vector using seq() function. In seq() you specify the step using by.

##  [1] 1.0 1.2 1.4 1.6 1.8 2.0 2.2 2.4 2.6 2.8 3.0

Or you can also specify the length of the vector using length.out

## [1] 1.000000 3.333333 5.666667 8.000000

Using rep() function

You can also create a vector using rep() function combined with each or times

## [1] 1 2 3 1 2 3 1 2 3
## [1] 1 1 1 2 2 2 3 3 3

1.2 How to know the type and length of a vector

To know the type of a vector, you use a function called typeof() and to know the length, you use a function called length().

## [1] "double"
## [1] "character"
## [1] 3

1.3 How to combine vectors

Vectors can be combined via the function c. For examples, the following two vectors h1 and h2 are combined into a new vector containing elements from both vectors.

## [1] "banana"     "orange"     "cherry"     "strawberry" "kiwi"

1.4 How to access an element of a vector

Vector index in R starts from 1, unlike most programming languages where index start from 0.

We can use a vector of integers as index to access specific elements.

We can also use negative integers to return all elements except that those specified.

## [1] "banana"
## [1] 54

1.5 How to modify an element of a vector

We can modify a vector using the assignment operator.

## [1] "banana"     "strawberry" "hello"
## [1]   1  22   3 100  96

1.6 How to delete an element of a vector

## NULL

1.7 Vector arithmetics

You can do all the arithmetics operations within vectors :

## [1]  3  5 11 14 13
## [1]  2  4  6 10 12
## [1] -1 -1 -5 -4 -1
## [1] 0.5000000 0.6666667 0.3750000 0.5555556 0.8571429

2.Matrices

2.1 How to create a matrix

Matrices are R objects in which the elements are arranged in a two-dimensional array.

A matrix can be considered as a join of vectors. Matrices can be created by column-binding or row-binding with the cbind() and rbind() functions.

##      x  y
## [1,] 1 10
## [2,] 2 11
## [3,] 3 12
##   [,1] [,2] [,3]
## x    1    2    3
## y   10   11   12

The basic syntax for creating a matrix in R is : matrix(data, nrow, ncol, byrow, dimnames)

  1. data is the input vector which becomes the data elements of the matrix.
  2. nrow is the number of rows to be created.
  3. ncol is the number of rows to be created.
  4. byrow is a logical clue. if TRUE then the input vector elements are arranged by row.
  5. dimnames are the names assigned to the rows and columns.

Matrices can be created directly from vectors by adding a dimension attribute.

## [1] 2 6

Let’s consider a matrix P. Let’s define the column and row names :

Let’s create the matrix :

##       control treatment control treatment
## gene1       1         2       3         4
## gene2       5         6       7         8
## gene3       9        10      11        12
## gene4      13        14      15        16

2.2 Access to an element of the matrix

To access the element at 1st row, 3rd column :

## [1] 3

Let’s access all the rows of column 4:

## gene1 gene2 gene3 gene4 
##     4     8    12    16

Let’s access the 2nd row of all the columns:

##   control treatment   control treatment 
##         5         6         7         8

2.3 Matrix Multiplication

R has two multiplication operators for matrices :

  • The first one is denoted by * which is the same as a simple multiplication sign. This operation does a simple element by element multiplication up to matrices.

  • The second operator is denoted by %*% and it performs a matrix multiplication between the two matrices.

Let’s see an example:

M is a matrix of 4 rows and 3 columns.

##      [,1] [,2] [,3]
## [1,]    3    7   11
## [2,]    4    8   12
## [3,]    5    9   13
## [4,]    6   10   14

What’s the dimension of M ? NOTE: The dimensions of a matrix are the number of rows by the number of columns. If a matrix has a rows and b columns, it is an a×b matrix.

## [1] 4 3

P is a matrix of 4 rows and 3 columns

##      [,1] [,2] [,3]
## [1,]   20   24   28
## [2,]   21   25   29
## [3,]   22   26   30
## [4,]   23   27   31

NOTE: P and M have same dimensions so we can do element by element multiplication.

##      [,1] [,2] [,3]
## [1,]   60  168  308
## [2,]   84  200  348
## [3,]  110  234  390
## [4,]  138  270  434

Let’s move to the algebric matrix multiplication in R.

N is a matrix of 3 rows and 4 columns

##      [,1] [,2] [,3] [,4]
## [1,]    1    4    7   10
## [2,]    2    5    8   11
## [3,]    3    6    9   12

What’s the dimension of N ?

## [1] 3 4

Let’s do matrix multiplication in R :

##      [,1] [,2] [,3] [,4]
## [1,]   50  113  176  239
## [2,]   56  128  200  272
## [3,]   62  143  224  305
## [4,]   68  158  248  338

3.Data Frames

Data frames are used to store tabular data in R. They are an important type of object in R and are used in a variety of statistical modeling applications.

Data frames are represented as a special type of list where every element of the list has to have the same length. Each element of the list can be thought of as a column and the length of each element of the list is the number of rows.

Unlike matrices, data frames can store different classes of objects in each column. Matrices must have every element be the same class (e.g. all integers or all numeric).

3.1 How to create a data frame

Data frames are usually created by reading in a dataset using the read.table() or read.csv(). However, data frames can also be created explicitly with the data.frame() function.

##    Gender Height Weight Age Married
## 1    Male  152.0     61  22     Yes
## 2    Male  171.5     93  38     Yes
## 3    Male  165.0     78  26     Yes
## 4    Male  170.0     61  60     Yes
## 5  Female  155.0     45  25     Yes
## 6  Female  167.0     57  17     Yes
## 7  Female  172.1     61  26      No
## 8  Female  165.8     55  18      No
## 9  Female  167.0     56  17      No
## 10 Female  181.0     71  55      No
## [1] 10
## [1] 5

R objects can have names, which is very useful for writing readable code and self-describing objects.

Note that for data frames, there is a separate function for setting the row names, the row.names() function. Also, data frames do not have column names, they just have names (like lists). So to set the column names of a data frame just use the names() function.

3.2 How to set column names and row names of a data frame

## [1] "Gender"  "Height"  "Weight"  "Age"     "Married"
##          Gender Height Weight Age Married
## 1st row    Male  152.0     61  22     Yes
## 2nd row    Male  171.5     93  38     Yes
## 3rd row    Male  165.0     78  26     Yes
## 4th row    Male  170.0     61  60     Yes
## 5th row  Female  155.0     45  25     Yes
## 6th row  Female  167.0     57  17     Yes
## 7th row  Female  172.1     61  26      No
## 8th row  Female  165.8     55  18      No
## 9th row  Female  167.0     56  17      No
## 10th row Female  181.0     71  55      No

3.3 Header of the data frame

Running this you will get the header of your dataset to get information about the variables and their values.

##         Gender Height Weight Age Married
## 1st row   Male  152.0     61  22     Yes
## 2nd row   Male  171.5     93  38     Yes
## 3rd row   Male  165.0     78  26     Yes
## 4th row   Male  170.0     61  60     Yes
## 5th row Female  155.0     45  25     Yes
## 6th row Female  167.0     57  17     Yes
##          Gender Height Weight Age Married
## 5th row  Female  155.0     45  25     Yes
## 6th row  Female  167.0     57  17     Yes
## 7th row  Female  172.1     61  26      No
## 8th row  Female  165.8     55  18      No
## 9th row  Female  167.0     56  17      No
## 10th row Female  181.0     71  55      No

3.4 Structure of the data frame

To get information about the structure of dataset (i.e if variable is numeric or factor).

## 'data.frame':    10 obs. of  5 variables:
##  $ Gender : chr  "Male" "Male" "Male" "Male" ...
##  $ Height : num  152 172 165 170 155 ...
##  $ Weight : num  61 93 78 61 45 57 61 55 56 71
##  $ Age    : num  22 38 26 60 25 17 26 18 17 55
##  $ Married: chr  "Yes" "Yes" "Yes" "Yes" ...
##     Gender              Height          Weight           Age      
##  Length:10          Min.   :152.0   Min.   :45.00   Min.   :17.0  
##  Class :character   1st Qu.:165.2   1st Qu.:56.25   1st Qu.:19.0  
##  Mode  :character   Median :167.0   Median :61.00   Median :25.5  
##                     Mean   :166.6   Mean   :63.80   Mean   :30.4  
##                     3rd Qu.:171.1   3rd Qu.:68.50   3rd Qu.:35.0  
##                     Max.   :181.0   Max.   :93.00   Max.   :60.0  
##    Married         
##  Length:10         
##  Class :character  
##  Mode  :character  
##                    
##                    
## 

3.5 Missing values in a data frame

Check the data for missing values.

##  Gender  Height  Weight     Age Married 
##       0       0       0       0       0
##          Gender Height Weight   Age Married
## 1st row   FALSE  FALSE  FALSE FALSE   FALSE
## 2nd row   FALSE  FALSE  FALSE FALSE   FALSE
## 3rd row   FALSE  FALSE  FALSE FALSE   FALSE
## 4th row   FALSE  FALSE  FALSE FALSE   FALSE
## 5th row   FALSE  FALSE  FALSE FALSE   FALSE
## 6th row   FALSE  FALSE  FALSE FALSE   FALSE
## 7th row   FALSE  FALSE  FALSE FALSE   FALSE
## 8th row   FALSE  FALSE  FALSE FALSE   FALSE
## 9th row   FALSE  FALSE  FALSE FALSE   FALSE
## 10th row  FALSE  FALSE  FALSE FALSE   FALSE
##  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [1] 0

3.6 Adding and removing columns from a data frame

There are many different ways of adding and removing columns from a data frame.

##          Gender Height Weight Married     Status
## 1st row    Male  152.0     61     Yes    student
## 2nd row    Male  171.5     93     Yes   employee
## 3rd row    Male  165.0     78     Yes freelancer
## 4th row    Male  170.0     61     Yes    student
## 5th row  Female  155.0     45     Yes   employee
## 6th row  Female  167.0     57     Yes freelancer
## 7th row  Female  172.1     61      No    student
## 8th row  Female  165.8     55      No   employee
## 9th row  Female  167.0     56      No freelancer
## 10th row Female  181.0     71      No    student

3.7 Reordering the columns in a data frame

##          Gender Weight Height Married     Status
## 1st row    Male     61  152.0     Yes    student
## 2nd row    Male     93  171.5     Yes   employee
## 3rd row    Male     78  165.0     Yes freelancer
## 4th row    Male     61  170.0     Yes    student
## 5th row  Female     45  155.0     Yes   employee
## 6th row  Female     57  167.0     Yes freelancer
## 7th row  Female     61  172.1      No    student
## 8th row  Female     55  165.8      No   employee
## 9th row  Female     56  167.0      No freelancer
## 10th row Female     71  181.0      No    student
##          Gender Weight Height Married     Status
## 1st row    Male     61  152.0     Yes    student
## 2nd row    Male     93  171.5     Yes   employee
## 3rd row    Male     78  165.0     Yes freelancer
## 4th row    Male     61  170.0     Yes    student
## 5th row  Female     45  155.0     Yes   employee
## 6th row  Female     57  167.0     Yes freelancer
## 7th row  Female     61  172.1      No    student
## 8th row  Female     55  165.8      No   employee
## 9th row  Female     56  167.0      No freelancer
## 10th row Female     71  181.0      No    student

3.8 Convert a column to a factor

We can do the following with R’s built-in functions.

##          Gender Height Weight Married     Status
## 1st row    Male  152.0     61       1    student
## 2nd row    Male  171.5     93       1   employee
## 3rd row    Male  165.0     78       1 freelancer
## 4th row    Male  170.0     61       1    student
## 5th row  Female  155.0     45       1   employee
## 6th row  Female  167.0     57       1 freelancer
## 7th row  Female  172.1     61       0    student
## 8th row  Female  165.8     55       0   employee
## 9th row  Female  167.0     56       0 freelancer
## 10th row Female  181.0     71       0    student

4.Lists

Lists are data structures having components of mixed data types. They allow to store a variety of objects under one name. These objects can be matrices, vectors, data frames or even other lists. The different component of a list have different length, characteristic and type. We could consider a list as some kind of super data type in R. You can practically store any piece of information in it.

4.1 How to create a list

List can be created using the list() function. The arguments to the list function are the list components. The syntax is list(comp1,comp2,...) The arguments to the list function are the list components. Remeber that these components can be matrices, vectors, data frames or even other lists. Following is an example to create a list containing strings, numbers, vectors.

## $name
## [1] "Sami"
## 
## $hobbie
## [1] "programming" "cinema"      "statistics"  "mathematics"
## 
## $age
## [1] 54
## 
## $genome
## [1] "ACGTCGTACAGTGCACGTGCA"

4.2 How to access components of a list

Lists can be accessed the same way as vectors. Indexing with [ will give us sublist not the content inside the component.

Let’s see an example

  • indexing using a character vector:
## $age
## [1] 54
## 
## $hobbie
## [1] "programming" "cinema"      "statistics"  "mathematics"
  • indexing using a logical vector:
## $name
## [1] "Sami"
## 
## $genome
## [1] "ACGTCGTACAGTGCACGTGCA"
  • indexing using an integer vector:
## $age
## [1] 54
## 
## $genome
## [1] "ACGTCGTACAGTGCACGTGCA"

Be aware that [ returns a list.

4.3 How to access an element of a component of a list

To retrieve the content, we need to use [[.

However, this approach will allow us to access only a single component at a time. Let’s say we want to see the name:

## [1] "Sami"

Another way of accessing content of a list is the $ operator.

## [1] "Sami"

The main advantage of using $ is the partial matching on tags that it allows. Let’s see an example:

## [1] "Sami"

Beside of selecting components, we often need to select specific elements out of these components. For example to select the first element of the second component of the list we run list1[[2]][1]

## [1] "programming"

4.4 How to modify a list in R

We can change components of a list through reassignment. For that, we can choose any of the component accessing techniques discussed above to modify it.

Be aware that modification causes reordering of components.

or

Let’s check

## $name
## [1] "Ali"
## 
## $hobbie
## [1] "programming" "cinema"      "statistics"  "mathematics"
## 
## $age
## [1] 54
## 
## $genome
## [1] "ACGTCGTACAGTGCACGTGCA"

4.5 How to add components to a list

To add new components we simply assign values using new tags.

## $name
## [1] "Ali"
## 
## $hobbie
## [1] "programming" "cinema"      "statistics"  "mathematics"
## 
## $age
## [1] 54
## 
## $genome
## [1] "ACGTCGTACAGTGCACGTGCA"
## 
## $passed
## [1] FALSE

4.6 How to delete components from a list

We can delete a component by assigning NULL to it.

## $name
## [1] "Ali"
## 
## $age
## [1] 54
## 
## $genome
## [1] "ACGTCGTACAGTGCACGTGCA"
## 
## $passed
## [1] FALSE

4.7 How to merge Lists

You can merge many lists into one list by placing all the lists inside one list() function. Let’s create a second list.

## $name
## [1] "ahmed"
## 
## $hobbie
## [1] "programming" "music"       "biology"     "mathematics"
## 
## $age
## [1] 65
## 
## $genome
## [1] "AggTTAGTAGATGTAGTCCCGTAA"

Let’s merge the two lists.

## $name
## [1] "Ali"
## 
## $age
## [1] 54
## 
## $genome
## [1] "ACGTCGTACAGTGCACGTGCA"
## 
## $passed
## [1] FALSE
## 
## $name
## [1] "ahmed"
## 
## $hobbie
## [1] "programming" "music"       "biology"     "mathematics"
## 
## $age
## [1] 65
## 
## $genome
## [1] "AggTTAGTAGATGTAGTCCCGTAA"

5.Arrays

Arrays are the R data objects which can store data in more than two dimensions. Be aware that arrays can store the values having only a similar kind of data types.

5.1 How to create an array

An array is created using the array() function. It takes vectors as input and uses the values in the dim parameter to create an array. For example − If we create an array of dimension (2, 3, 4) then it creates 4 rectangular matrices each with 2 rows and 3 columns. Arrays can store only data type.

## [1]  5 10 15 20
## [1] 25 30 35 40 45 50 55 60

seq() function generates a sequence of numbers. seq(from, to, by)

  • from, to: begin and end number of the sequence
  • by: step, increment (Default is 1)`

Let’s create our first array of dimension (2,3,4). The syntax is the following Array_NAME <- array(data, dim = (row_Size, column_Size, matrices, dimnames)

  • data – Data is an input vector that is given to the array.
  • matrices – Array in R consists of multi-dimensional matrices.
  • row_Size – row_Size describes the number of row elements that an array can store.
  • column_Size – Number of column elements that can be stored in an array.
  • dimnames – Used to change the default names of rows and columns to the user’s preference.
## , , 1
## 
##      [,1] [,2] [,3]
## [1,]    5   15   25
## [2,]   10   20   30
## 
## , , 2
## 
##      [,1] [,2] [,3]
## [1,]   35   45   55
## [2,]   40   50   60
## 
## , , 3
## 
##      [,1] [,2] [,3]
## [1,]    5   15   25
## [2,]   10   20   30
## 
## , , 4
## 
##      [,1] [,2] [,3]
## [1,]   35   45   55
## [2,]   40   50   60

5.2 How to raname the rows and/or the columns of an array

Let’s rename our arrays to array1,array2,array3 and array4 by using array.names. Also,the rows names changed to (“row1”,“row2”) and column names will be changed to (“column1”,“column2”,“column3”) respectively.The dimension of the matrix is 2 rows and 3 columns.

## , , array1
## 
##      column1 column2 column3
## row1       5      15      25
## row2      10      20      30
## 
## , , array2
## 
##      column1 column2 column3
## row1      35      45      55
## row2      40      50      60
## 
## , , array3
## 
##      column1 column2 column3
## row1       5      15      25
## row2      10      20      30
## 
## , , array4
## 
##      column1 column2 column3
## row1      35      45      55
## row2      40      50      60

5.3 How to access an element of an array

Let’s see how the elements in the array can be extracted with the following examples. Let’s access the number 55. The number 55 is in the first row of the third column of the second array.

## [1] 55

We can access the entire array myfirstarray with the following syntax where myfirstarray[ , ,1] specifies to include all rows and columns each separated by commas, which are indicated by space. The 1 specifies the array myfirstarray to be extracted.

##      column1 column2 column3
## row1       5      15      25
## row2      10      20      30

5.4 How to create matrices from these array

Let’s visualize the first matrix

##      column1 column2 column3
## row1       5      15      25
## row2      10      20      30

Let’s visualize the second matrix

##      column1 column2 column3
## row1      35      45      55
## row2      40      50      60