Module 2: Data structures

Objectives

At the end of this tutorial, learners should be able to:

Understand the various data types in R.
Use appropriate R functions to create each of the data types.
Use the typeof and class functions to check the data type.
Understand missing data and other special values in R.
Convert between data types.

Introduction

To make the best out of the R programming language, a programmer needs a strong understanding of the basic data types and data structures, and how to operate on them. A clear understanding of R data structures crucial since these are the objects you will manipulate on a day-to-day basis in R. Dealing with object conversion is one of the most common sources of frustration for beginners.

Required library

library(kableExtra) # display table formatting

Create the function below which will be used to display the results in enhanced tabular format.

display_results = function(data, caption){
    # function display_results - enhanced display of results in tabular format
    # Inputs
      # data - the result to display - vector, data frame, matrix, table, etc
      # caption - heading of displayed results
    kbl(table(months), 
        caption = "Table 5") %>%
        kable_styling(bootstrap_options = "striped", 
                      full_width = FALSE, 
                      position = "left")
}

The R programming language has the following data structures.

Vectors
Factors
Lists
Matrices
Arrays
Data frames
Time series

The functions class() and typeof() will be used to examine the kind of data type or data structure.

Vectors

This is the simplest data structure in the R programming language that store elements of the same data type. The concatenating function c() is used to combine an arbitrary number of elements (or other vectors) into a single vector. A vector can contain only numeric values, characters or a combination of numeric and character values. Character values must always be enclosed in double quotation marks.

Creating vectors

There are multiple ways of creating vectors in R. The most common way is to use the function c() which concatenates multiple elements (or other vectors) into a single vector.

Numeric vectors

These are vectors whose elements are all numeric (double and/or integers). Below are a few examples.

a = c(125, 105, 143, 132, 116)
a; typeof(a); class(a)

## [1] 125 105 143 132 116

## [1] "double"

## [1] "numeric"

b = as.integer(c(104, 142, 165, 128))
b; typeof(b); class(b)

## [1] 104 142 165 128

## [1] "integer"

## [1] "integer"

# combine above vectors
c = c(a, b)
c; typeof(c); class(c)

## [1] 125 105 143 132 116 104 142 165 128

## [1] "double"

## [1] "numeric"

Character vectors

These are vectors whose elements are characters or a combination of characters and numeric values. It is important to note that the presence of a single character in a vector causes that vector to be stored as character vector (even if all the other elements are numeric).

a = c("James Kioko", "Mary Achieng", "Daniel Kipkirui", "Christine Lesuda")
a; typeof(a)

## [1] "James Kioko"      "Mary Achieng"     "Daniel Kipkirui"  "Christine Lesuda"

## [1] "character"

b = c(104, 142, 165, 128, "N/A", 148, 120) # presence of "N/A" causes every element to be character
b; typeof(b)

## [1] "104" "142" "165" "128" "N/A" "148" "120"

## [1] "character"

There are multiple other ways of creating vectors. They include the following:

seq()
rep()
gl()
sample()
paste()
paste0()

Below we demonstrate how each of the above functions can be applied to generate vectors.

seq()

Generates a sequence of equally spaced values between the numbers $a$ and $b$ by either specifying the interval $\Delta$ or the number of values $n.$

# specifies the increment
v1 = seq(from = -100, to = 250, by = 25)
v1

##  [1] -100  -75  -50  -25    0   25   50   75  100  125  150  175  200  225  250

v2 = seq(from = -102, to = 250, by = 25) # sequence will not get to 250
v2

##  [1] -102  -77  -52  -27   -2   23   48   73   98  123  148  173  198  223  248

v3 = seq(from = 154, to = 131, by = -3) # decreasing sequence, will not get to 131
v3

## [1] 154 151 148 145 142 139 136 133

# specifies number of values to return, the increment will be calculated (135-113) / (9-1)
v4 = seq(from = 113, to = 135, length.out = 9)
v4

## [1] 113.00 115.75 118.50 121.25 124.00 126.75 129.50 132.25 135.00

rep()

Replicates an object a number of specified times. The first argument of the function is the object to be replicated while the second and third arguments specify how many times the specified vector should be replicated and the number of times each element in the vector should be replicated respectively.

v = c(2, 6, 3, 0, 4)
# replicate each value of the vector v 2 times
v1 = rep(v, each = 2)
v1

##  [1] 2 2 6 6 3 3 0 0 4 4

# replicate the vector v 3 times
v2 = rep(v, times = 3)
v2

##  [1] 2 6 3 0 4 2 6 3 0 4 2 6 3 0 4

# replicate each value of the vector v 3 times, the replicate the resulting vector 2 times
v3 = rep(v, each = 3, times = 2)
v3

##  [1] 2 2 2 6 6 6 3 3 3 0 0 0 4 4 4 2 2 2 6 6 6 3 3 3 0 0 0 4 4 4

# repeat Female 2 times and Male 5 times, then repeat the resulting vector 4 times
v4 = rep(rep(c("Female", "Male"), times = c(3, 5)), times = 4)
v4

##  [1] "Female" "Female" "Female" "Male"   "Male"   "Male"   "Male"   "Male"  
##  [9] "Female" "Female" "Female" "Male"   "Male"   "Male"   "Male"   "Male"  
## [17] "Female" "Female" "Female" "Male"   "Male"   "Male"   "Male"   "Male"  
## [25] "Female" "Female" "Female" "Male"   "Male"   "Male"   "Male"   "Male"

gl()

This function generates factors by specifying the pattern of their levels. The code below replicate each of the characters Mild, Moderate and Severe four times then converts the result to an ordered factor (i.e. the argument ordered = TRUE).

v1 = gl(n = 3, k = 4, length = 3 * 4, 
        labels = c("Mild", "Moderate", "Severe"), 
        ordered = TRUE)
v1

##  [1] Mild     Mild     Mild     Mild     Moderate Moderate Moderate Moderate
##  [9] Severe   Severe   Severe   Severe  
## Levels: Mild < Moderate < Severe

sample()

Generates a set of numbers between a specified range or strings from a specified character vector The arguments data, size, replace and prob specify the object from which to draw the sample from, number of values, sampling with or without replacement and the (approximate) probabilities for inclusion in the sample to be generated respectively.

# generate 20 values between 18 and 60 inclusive without replacement 
# i.e. none of the values should be repeated - replace = FALSE (its the default)
v1 = sample(x = 18:60, size = 20, replace = FALSE)
v1

##  [1] 30 49 50 53 42 48 51 38 25 43 35 34 21 52 55 56 33 29 59 47

# generate 15 values of Female and Male with replacement where Females are approximately 2/3 
# and Males are approximately 1/3 (i.e. females are approximately 67% and males 33%)
v2 = sample(x = c("Female","Male"), size = 15,
            replace = TRUE, prob = c(2, 1)/3)
v2

##  [1] "Male"   "Female" "Female" "Female" "Female" "Female" "Male"   "Female"
##  [9] "Male"   "Female" "Male"   "Female" "Female" "Female" "Female"

paste()

This is yet another helpful function for creating vectors by putting different values (numbers, text or special characters) together.

monthnumber = 1:12
monthname = month.name
v1 = paste(monthnumber, monthname, sep = " --> ")
v1

##  [1] "1 --> January"   "2 --> February"  "3 --> March"     "4 --> April"    
##  [5] "5 --> May"       "6 --> June"      "7 --> July"      "8 --> August"   
##  [9] "9 --> September" "10 --> October"  "11 --> November" "12 --> December"

v2 = paste("QTR ", 1:4, ": ", 2018, sep = "")
v2

## [1] "QTR 1: 2018" "QTR 2: 2018" "QTR 3: 2018" "QTR 4: 2018"

paste0()

This function is similar to paste() except that it combines values with no space between them.

# for paste() you will need to add sep = "", otherwise R will put spaces
v1 = paste("QTR ", 1:4, ": ", 2018, sep = "")
v1

## [1] "QTR 1: 2018" "QTR 2: 2018" "QTR 3: 2018" "QTR 4: 2018"

# instead of using paste() with the argument sep = "", just use paste0() as follows
v2 = paste0("QTR ", 1:4, ": ", 2018)
v2

## [1] "QTR 1: 2018" "QTR 2: 2018" "QTR 3: 2018" "QTR 4: 2018"

Vectors can also be generated using probability distribution functions. This will be discussed in a different section.

Operations on vectors

Most operations that occur in mathematics can be applied to vectors to produce new results. These operations and functions are presented in the table below.

Slicing vectors

The knowledge of Slicing objects in R will be very critical when handling most of the statistical routines. In this section, we demonstrate briefly how to slice vector objects.

Replacing elements of a vector

More often than not, you will be required to replace values of a vector either by direct input or from mathematical calculations. R provides advanced capabilities of achieving this, yet in this section, we consider the most basic ones.

# create a vector to use
v = c (43, 62, 50, 45, 35, 28, 31, 34, 38, 41)
# replace elements in positions 2, 7 and 4 with 200, 400 and 300 respectively
v1 = v # create a copy of the vector v
v1[c(2, 7, 4)] = c(200, 400, 300)
# combine into data frame
d1 = data.frame(Original = v, Replaced = v1)
d1

##    Original Replaced
## 1        43       43
## 2        62      200
## 3        50       50
## 4        45      300
## 5        35       35
## 6        28       28
## 7        31      400
## 8        34       34
## 9        38       38
## 10       41       41

# multiply even values by 10
v2 = v # create copy of vector v
v2[which(v2 %% 2 == 0)] = v2[v2 %% 2 == 0] * 10
# combine into data frame
d2 = data.frame(Original = v, Replaced = v2)
d2

##    Original Replaced
## 1        43       43
## 2        62      620
## 3        50      500
## 4        45       45
## 5        35       35
## 6        28      280
## 7        31       31
## 8        34      340
## 9        38      380
## 10       41       41

In the above examples, the function data.frame() has been used to combine the vectors into a data frame. Data frames are discussed below.

Sorting values of vector

Vectors (both numeric and strings) can be sorted using the function sort(). The arguments decreasing and na.last could be added to specify the direction of (ascending or descending) and where missing values should be placed if any (at the beginning or at the end).

# create a vector
u = c(NA, "A", "B", "F", "D", NA, "K", "M", "D", "C")
# decreasing order, missing values come last
v1 = sort(u, decreasing = TRUE, na.last = TRUE)
v1

##  [1] "M" "K" "F" "D" "D" "C" "B" "A" NA  NA

# create another vector
v = c(26, 35, NA, 38, NA, 31, 40, 28, 35)
# ascending order, missing values come first
v2 = sort(v, decreasing = FALSE, na.last = FALSE)
v2

## [1] NA NA 26 28 31 35 35 38 40

Adding ranks to a vector

Ranks are assigned to elements of a vector using the function rank() which accepts the argument ties.method to specifies how tied values should be treated (min, max, first, average, random) and na.last that specifies what happens to missing values (TRUE, FALSE, NA, keep).

# create a vector
v = c(26, 35, NA, 38, NA, 31, 38, 28, 35)
# excludes missing values before ranking
v1 = rank(v, ties.method = "min", na.last = NA)
v1

## [1] 1 4 6 3 6 2 4

# rank the first of the tied values first and retain missing values but do not rank them
v2 = rank(v, ties.method = "first", na.last = "keep")
d1 = data.frame(v = v, Ranks = v2)
d1

##    v Ranks
## 1 26     1
## 2 35     4
## 3 NA    NA
## 4 38     6
## 5 NA    NA
## 6 31     3
## 7 38     7
## 8 28     2
## 9 35     5

# take average ranks, and missing values should be ranked last
v3 = rank(v, ties.method = "average", na.last = TRUE)
d2 = data.frame(v = v, Ranks = v3)
d2

##    v Ranks
## 1 26   1.0
## 2 35   4.5
## 3 NA   8.0
## 4 38   6.5
## 5 NA   9.0
## 6 31   3.0
## 7 38   6.5
## 8 28   2.0
## 9 35   4.5

# take average ranks, and missing values should be ranked first
v4 = rank(v, ties.method = "average", na.last = FALSE)
d3 = data.frame(v = v, Ranks = v4)
d3

##    v Ranks
## 1 26   3.0
## 2 35   6.5
## 3 NA   1.0
## 4 38   8.5
## 5 NA   2.0
## 6 31   5.0
## 7 38   8.5
## 8 28   4.0
## 9 35   6.5

Factors

Factor variables are categorical variables that can either be numeric or string. The function factor() is used to create factor variables - it requires a vector of values which can either be string or numeric among other optional arguments.

# create a random vector of socio-economic status (ses)
ses = sample(x = c("Low", "Middle", "High"), 
             size = 60, replace = TRUE, 
             prob = c(25, 60, 15))
ses

##  [1] "Low"    "Middle" "Middle" "Middle" "Middle" "Middle" "High"   "Low"   
##  [9] "Low"    "Middle" "Low"    "Middle" "High"   "Middle" "High"   "Middle"
## [17] "Middle" "Middle" "Middle" "Middle" "Low"    "High"   "High"   "Middle"
## [25] "Middle" "Middle" "Middle" "Low"    "Middle" "Low"    "Middle" "Middle"
## [33] "Low"    "Middle" "Low"    "Middle" "High"   "Low"    "High"   "Low"   
## [41] "Middle" "Middle" "Middle" "Low"    "High"   "Middle" "Middle" "Low"   
## [49] "Middle" "Middle" "Middle" "High"   "Middle" "Middle" "Middle" "Middle"
## [57] "Middle" "Middle" "Middle" "Low"

Converting string variables is useful in a number of ways. In the code below, we show the result of tabulating a character vector (i.e. not yet converted to factor).

kbl(table(ses), 
    caption = "Table 1") %>%
    kable_styling(bootstrap_options = "striped", full_width = FALSE, position = "left")

Table 1
ses	Freq
High	9
Low	14
Middle	37

As you can see from the above results, the categories are in an order that is undesirable. Ideally, you would want the order in the table to be Low, Middle and High. This is achieved by converting ses to factor and specifying the correct order of the categories using the levels argument as is shown in the code below.

# convert to factor and order the categories
ses1 = factor(ses, levels = c("Low", "Middle", "High"))
ses1

##  [1] Low    Middle Middle Middle Middle Middle High   Low    Low    Middle
## [11] Low    Middle High   Middle High   Middle Middle Middle Middle Middle
## [21] Low    High   High   Middle Middle Middle Middle Low    Middle Low   
## [31] Middle Middle Low    Middle Low    Middle High   Low    High   Low   
## [41] Middle Middle Middle Low    High   Middle Middle Low    Middle Middle
## [51] Middle High   Middle Middle Middle Middle Middle Middle Middle Low   
## Levels: Low Middle High

Now tabulate the results again and note that socio-economic status have been presented in the correct order.

kbl(table(ses1), 
    caption = "Table 2") %>%
    kable_styling(bootstrap_options = "striped", full_width = FALSE, position = "left")

Table 2
ses1	Freq
Low	14
Middle	37
High	9

We can create ordered factor variables by using the function ordered() which has the same arguments as the factor() function. In the code below, we create an ordered factor from the ses vector created above.

# convert to ordered factor
ses2 = ordered(ses, levels = c("Low", "Middle", "High")) 
# the above is equivalent to factor(ses, levels = c("Low", "Middle", "High"), ordered = TRUE)
# i.e. use the function factor() with the argument ordered = TRUE
ses2

##  [1] Low    Middle Middle Middle Middle Middle High   Low    Low    Middle
## [11] Low    Middle High   Middle High   Middle Middle Middle Middle Middle
## [21] Low    High   High   Middle Middle Middle Middle Low    Middle Low   
## [31] Middle Middle Low    Middle Low    Middle High   Low    High   Low   
## [41] Middle Middle Middle Low    High   Middle Middle Low    Middle Middle
## [51] Middle High   Middle Middle Middle Middle Middle Middle Middle Low   
## Levels: Low < Middle < High

Once an ordered factor has been created, it can easily be converted into a numeric vector using the function as.numeric() as follows. The function as.double() could also be used to convert it into a numeric vector. Note that the function as.integer() converts the ordered factor into a vector whose data type is integer.

# using the function as.numeric()
ses.num = as.numeric(ses2)
ses.num

##  [1] 1 2 2 2 2 2 3 1 1 2 1 2 3 2 3 2 2 2 2 2 1 3 3 2 2 2 2 1 2 1 2 2 1 2 1 2 3 1
## [39] 3 1 2 2 2 1 3 2 2 1 2 2 2 3 2 2 2 2 2 2 2 1

class(ses.num)

## [1] "numeric"

# using the function as.numeric()
ses.dou = as.double(ses2)
ses.dou

##  [1] 1 2 2 2 2 2 3 1 1 2 1 2 3 2 3 2 2 2 2 2 1 3 3 2 2 2 2 1 2 1 2 2 1 2 1 2 3 1
## [39] 3 1 2 2 2 1 3 2 2 1 2 2 2 3 2 2 2 2 2 2 2 1

class(ses.dou)

## [1] "numeric"

# using the function as.numeric()
ses.int = as.integer(ses2)
ses.int

##  [1] 1 2 2 2 2 2 3 1 1 2 1 2 3 2 3 2 2 2 2 2 1 3 3 2 2 2 2 1 2 1 2 2 1 2 1 2 3 1
## [39] 3 1 2 2 2 1 3 2 2 1 2 2 2 3 2 2 2 2 2 2 2 1

class(ses.int)

## [1] "integer"

Generate factor vector using the cut() function

The cut() function is used to convert a numeric vector into a factor. The breaks parameter controls how the ranges of the numbers in the numeric vector will be converted to factor values. If a number is provided through the breaks parameter, the resulting factor will be created by dividing the range of the variable into that number of equal length intervals. On the other hand, if a vector of values is provided, the values in the vector are used to determine the breakpoints (number of levels of the resultant factor will be one less than the number of values in the specified vector).

Begin by creating a random numeric vector.

age = sample(x = c(20:70), size = 250, replace = TRUE)

Next we create a factor corresponding to age, with five equally-spaced age intervals “20-29”, “30-39”, “40-49”, “50-59” and “60+”. This is illustrated in the code below.

To produce factors based on percentiles of your data, the quantile() function can be used to generate the argument for the breaks parameter, ensuring approximately equal number of observations in each of the levels of the factor. This is done as follows.

Create factor from dates

In the following code, we use the function factor() to get the number of days in each month of the year 2020. Results are presented immediately after the code.

days = seq(from = as.Date('2020-1-1'), to = as.Date('2020-12-31'), by = 'day')
months = format(days, '%b')
months = factor(months, levels = unique(months), ordered = TRUE)
# display
display_results(data = table(months), caption = "Table 5")

Table 5
months	Freq
Jan	31
Feb	29
Mar	31
Apr	30
May	31
Jun	30
Jul	31
Aug	31
Sep	30
Oct	31
Nov	30
Dec	31

Lists

Lists are constructed in R using the function list(). Lists are more often used to store objects of different data type and of varying dimensions. One of its common uses is in functions (both built-in and user-defined) where it is used to return multiple outputs. For example, in regression analysis, the output is as a list object. The objects of this named list include: coefficients, residuals, fitted values, e.t.c. The syntax below shows how to create a list in R.

# create the objects mn, bw and bw
mn = c("Jane", "Mary", "Christine", "Rose","Janet", 
       "Martha", "Mercy","Joyce", "Hillary","Joylin")
gweeks = c(35.26, 38.74, 38.40, 36.65, 38.19, 37.04, 37.25, 39.68, 34.09, 35.86)
bweight <- c (3.58, 3.56, 3.48, 3.26, 2.64, 2.49, 2.86, 3.26, 2.78, 2.84)
wkswgt = data.frame(gestweeks = gweeks, birthweight = bweight)
# create the list here
births = list (mothersname = mn, weeksweight = wkswgt)
births

## $mothersname
##  [1] "Jane"      "Mary"      "Christine" "Rose"      "Janet"     "Martha"   
##  [7] "Mercy"     "Joyce"     "Hillary"   "Joylin"   
## 
## $weeksweight
##    gestweeks birthweight
## 1      35.26        3.58
## 2      38.74        3.56
## 3      38.40        3.48
## 4      36.65        3.26
## 5      38.19        2.64
## 6      37.04        2.49
## 7      37.25        2.86
## 8      39.68        3.26
## 9      34.09        2.78
## 10     35.86        2.84

As can be seen, the object births has been stored as a list which contains the objects, mothersname which is a vector and weeksweight which is a data frame. Each object in the list can be extracted using the $ or double square brackets [[i]] operation where i is the object index e.g. 1, 2, 3, …, n. For example, to extract the mother’s name use, births$mothersname or births[[1]].

births[1] # this gives the list and the list name (mothersname)

## $mothersname
##  [1] "Jane"      "Mary"      "Christine" "Rose"      "Janet"     "Martha"   
##  [7] "Mercy"     "Joyce"     "Hillary"   "Joylin"

births[[1]] # gives the list values only, not the list name

##  [1] "Jane"      "Mary"      "Christine" "Rose"      "Janet"     "Martha"   
##  [7] "Mercy"     "Joyce"     "Hillary"   "Joylin"

births$weeksweight

##    gestweeks birthweight
## 1      35.26        3.58
## 2      38.74        3.56
## 3      38.40        3.48
## 4      36.65        3.26
## 5      38.19        2.64
## 6      37.04        2.49
## 7      37.25        2.86
## 8      39.68        3.26
## 9      34.09        2.78
## 10     35.86        2.84

Matrices

A matrix is a set of real or complex numbers or elements arranged in rows and columns to form a rectangular array. The R software provides excellent and advanced capabilities of handling and manipulating matrices.

Creating matrices

The function matrix() is used to create matrices. The general syntax is given below.

M = matrix(vector_with_matrix_elements,
           nrow = number_of_rows,
           ncol = number_of_columns,
           byrow = FALSE/TRUE,
           dimnames = list(c("row names"), c("column names")))

Note that some of the parameters in the syntax above are optional. The syntax, M = matrix(vector_with_matrix_elements, nrow = number of rows) is sufficient to generate a matrix with $n$ rows. In this case, the number of columns is calculated as $\dfrac{\text{number of elements}}{\text{number of rows}}$ and the elements are arranged in the matrix by the columns (i.e. downwards). In addition, the matrix created will not have row and column names.

Example

Coughlin et al () examined breast and cervical screening practices of Hispanic and non-Hispanic women in countries that approximate the U.S. southern border region. The study used data from Behavioral Risk Factor surveillance system surveys of adults ages 18 years or older conducted in 1999 and 2000. The following table shows the number of observations of Hispanic and non-Hispanic women who had received a mammogram in the past 2 years cross classified by marital status. Write an R syntax that will capture the contents of the following table.

Solution

Below is the syntax, we have entered the row and column totals manually. The functions rowSums() and colSums() could have been used to calculate the row and column totals respectively.

# create a vector of the matrix elements (by columns)
v = c(319, 130, 88, 41, 578,
      738, 329, 402, 95, 1564,
      1057, 459, 490, 136, 2142)
# create row names (rnames) and column names (cnames)
cnames = c("Hispanic", "non Hispanic", "Total")
rnames = c("Currently married", "Divorced or separated",
           "Widowed","Never married / unmarried couple", "Total")
# create the matrix
M = matrix(v,
           nrow = 5, ncol = 3, byrow = FALSE,
           dimnames = list(rnames, cnames))
M

##                                  Hispanic non Hispanic Total
## Currently married                     319          738  1057
## Divorced or separated                 130          329   459
## Widowed                                88          402   490
## Never married / unmarried couple       41           95   136
## Total                                 578         1564  2142

Matrix operators and functions

The following table presents some of the common operators and functions that are applicable to matrices.

Matrix slicing

Particular values, rows, columns or a combination of these can be obtained from a matrix by applying the slicing techniques in R. The table below present few examples.

Arrays

Arrays are generalizations of vectors and matrices where a vector is a one directional array and a matrix is a two dimensional array. As with vectors and matrices, all elements of an array must be of the same data type. The array() function will create an array which takes a vector specifying the numbers and dimension (dim) in the argument. Let’s see an example below where two arrays Outcome and Treatment are created from the vector control.

# create a vector
control = c(8, 98, 106, 15, 115, 130, 23, 213, 236, 22, 76, 98, 16, 69, 85, 38, 145, 183)
# create the array
data = array(control, dim = c(3, 3, 2))
# assign array names, row and column names
dimnames(data) = list(Outcome = c("Death", "Survivor", "Total"),
                      Treatment = c("Drug", "Placebo", "Total"),
                      "Age Group" = c("Age < 55", "Age >= 55"))
data

## , , Age Group = Age < 55
## 
##           Treatment
## Outcome    Drug Placebo Total
##   Death       8      15    23
##   Survivor   98     115   213
##   Total     106     130   236
## 
## , , Age Group = Age >= 55
## 
##           Treatment
## Outcome    Drug Placebo Total
##   Death      22      16    38
##   Survivor   76      69   145
##   Total      98      85   183

All basic arithmetic operations which apply to vectors and matrices are also applicable to arrays. Perhaps what would be slightly different is slicing. Below we give few example of slicing arrays.

# second row of the first matrix of the array.
data[2, , 1]

##    Drug Placebo   Total 
##      98     115     213

# element in the 3rd row and 2nd column of the 1st matrix.
data[3, 2, 1]

## [1] 130

# 2nd Matrix.
data[, , 2]

##           Treatment
## Outcome    Drug Placebo Total
##   Death      22      16    38
##   Survivor   76      69   145
##   Total      98      85   183

# arithmetic
M1 = data[, , 1] # first matrix (i.e. Age Group: Age < 55)
M2 = data[, , 2] # second matrix (i.e. Age Group: Age >= 55)
M3 = M1 + M2
M3

##           Treatment
## Outcome    Drug Placebo Total
##   Death      30      31    61
##   Survivor  174     184   358
##   Total     204     215   419

Data frames

While vectors, matrices and arrays require that all elements be of the same data type, data frames can store objects of different data types (heterogeneous). This makes data frames convenient data structures for data management, manipulation and analysis in R. This is probably the most common data structure that you will likely encounter in your data management, analysis, visualization and programming tasks. Below we just show one of the most basic tasks in R: creating a data frame from vectors. Other operations / manipulations on data frames can be found elsewhere on this website.

# create vectors with random values
set.seed(1234)
gender = sample(x = c("Female", "Male"), size = 10, 
                replace = TRUE, prob = c(60, 40)/100)
age = sample(18:58, size = length(gender))
income = sample(1000:9000, size = length(gender)) * 5
# create data frame
df = data.frame(Gender = gender, Age = age, "Income" = income)
# display
display_results(data = df, caption = "Table 6: Data frames.")

Table 5
months	Freq
Jan	31
Feb	29
Mar	31
Apr	30
May	31
Jun	30
Jul	31
Aug	31
Sep	30
Oct	31
Nov	30
Dec	31

Time series

In time series, the values of a variable are investigated at time intervals that may include: daily, weekly, monthly, quarterly, semi-annually, yearly, e.t.c. To create a time series object in R, the function ts() is used as illustrated by the syntax below.

# create some random values
set.seed(1234)
v = sample(200:900, size = 20) * 5
data.ts = ts(data = v,
             start = c(2015, 4), freq = 4) # freq = 4 implies quarterly
data.ts

##      Qtr1 Qtr2 Qtr3 Qtr4
## 2015                2415
## 2016 1500 4110 4220 2995
## 2017 1485 1510 4005 2625
## 2018 1390 2345 2905 1915
## 2019 3865 1015 4300 3755
## 2020 2055 1970 3550

The syntax below gives a time series with frequency of 12, i.e. monthly data.

# create some random values
set.seed(1234)
v = sample(200:900, size = 50) * 5
data.ts = ts(data = v,
             start = c(2015, 4), freq = 12) # freq = 12 implies monthly
data.ts

##       Jan  Feb  Mar  Apr  May  Jun  Jul  Aug  Sep  Oct  Nov  Dec
## 2015                2415 1500 4110 4220 2995 1485 1510 4005 2625
## 2016 1390 2345 2905 1915 3865 1015 4300 3755 2055 1970 3550 3390
## 2017 4020 4165 3885 3545 3115 2890 1535 1650 2710 1200 4130 2485
## 2018 2285 4140 4455 1905 2520 2785 2530 4315 2100 3800 2560 1675
## 2019 1720 1610 2165 4035 3470

Note that the display format as displayed in the console by R is for presentation purposes. Internally, the data is stored as a time series object (a single variable). This can be viewed using the command View(data.ts). For a multivariate time series, replace the vector object v by a matrix object say M as is shown in the code below.

# create some random values
set.seed(1234)
M = matrix(sample(200:900, size = 40) * 5, nrow = 8) # matrix
data.ts = ts(data = M,
             start = c(2015, 1), freq = 4)
data.ts

##         Series 1 Series 2 Series 3 Series 4 Series 5
## 2015 Q1     2415     2625     3755     3545     2485
## 2015 Q2     1500     1390     2055     3115     2285
## 2015 Q3     4110     2345     1970     2890     4140
## 2015 Q4     4220     2905     3550     1535     4455
## 2016 Q1     2995     1915     3390     1650     1905
## 2016 Q2     1485     3865     4020     2710     2520
## 2016 Q3     1510     1015     4165     1200     2785
## 2016 Q4     4005     4300     3885     4130     2530

Convert data structures

Data structures can be converted from one to another using the the following functions.

Function	Description
`as.vector()`	Convert to a vector
`as.factor()`	Convert to a factor
`as.matrix()`	Convert to a matrix
`as.array()`	Convert to an array
`as.list()`	Convert to a list
`as.data.frame()`	Convert to a data frame
`as.ts()`	Convert to a time series
`as.table()`	Convert to a table

STEM Research: Technology for Innovation

https://stemrecloud.com

Last edited on: 2022-06-25

months	Freq
Jan	31
Feb	29
Mar	31
Apr	30
May	31
Jun	30
Jul	31
Aug	31
Sep	30
Oct	31
Nov	30
Dec	31

months	Freq
Jan	31
Feb	29
Mar	31
Apr	30
May	31
Jun	30
Jul	31
Aug	31
Sep	30
Oct	31
Nov	30
Dec	31

months	Freq
Jan	31
Feb	29
Mar	31
Apr	30
May	31
Jun	30
Jul	31
Aug	31
Sep	30
Oct	31
Nov	30
Dec	31

months	Freq
Jan	31
Feb	29
Mar	31
Apr	30
May	31
Jun	30
Jul	31
Aug	31
Sep	30
Oct	31
Nov	30
Dec	31

months	Freq
Jan	31
Feb	29
Mar	31
Apr	30
May	31
Jun	30
Jul	31
Aug	31
Sep	30
Oct	31
Nov	30
Dec	31

months	Freq
Jan	31
Feb	29
Mar	31
Apr	30
May	31
Jun	30
Jul	31
Aug	31
Sep	30
Oct	31
Nov	30
Dec	31