At the end of this tutorial, learners should be able to:
Understand the various data types in R.
Use appropriate R functions to create each of the data types.
Use the typeof and class functions to check the data type.
Understand missing data and other special values in R.
Convert between data types.
To make the best out of the R programming language, a programmer needs a strong understanding of the basic data types and data structures, and how to operate on them. A clear understanding of R data structures crucial since these are the objects you will manipulate on a day-to-day basis in R. Dealing with object conversion is one of the most common sources of frustration for beginners.
library(kableExtra) # display table formatting
Create the function below which will be used to display the results in enhanced tabular format.
display_results = function(data, caption){
# function display_results - enhanced display of results in tabular format
# Inputs
# data - the result to display - vector, data frame, matrix, table, etc
# caption - heading of displayed results
kbl(table(months),
caption = "Table 5") %>%
kable_styling(bootstrap_options = "striped",
full_width = FALSE,
position = "left")
}
The R programming language has the following data structures.
Vectors
Factors
Lists
Matrices
Arrays
Data frames
Time series
The functions class() and typeof() will be used to examine the kind of data type or data structure.
This is the simplest data structure in the R programming language that store elements of the same data type. The concatenating function c() is used to combine an arbitrary number of elements (or other vectors) into a single vector. A vector can contain only numeric values, characters or a combination of numeric and character values. Character values must always be enclosed in double quotation marks.
There are multiple ways of creating vectors in R. The most common way is to use the function c() which concatenates multiple elements (or other vectors) into a single vector.
Numeric vectors
These are vectors whose elements are all numeric (double and/or integers). Below are a few examples.
a = c(125, 105, 143, 132, 116)
a; typeof(a); class(a)
## [1] 125 105 143 132 116
## [1] "double"
## [1] "numeric"
b = as.integer(c(104, 142, 165, 128))
b; typeof(b); class(b)
## [1] 104 142 165 128
## [1] "integer"
## [1] "integer"
# combine above vectors
c = c(a, b)
c; typeof(c); class(c)
## [1] 125 105 143 132 116 104 142 165 128
## [1] "double"
## [1] "numeric"
Character vectors
These are vectors whose elements are characters or a combination of characters and numeric values. It is important to note that the presence of a single character in a vector causes that vector to be stored as character vector (even if all the other elements are numeric).
a = c("James Kioko", "Mary Achieng", "Daniel Kipkirui", "Christine Lesuda")
a; typeof(a)
## [1] "James Kioko" "Mary Achieng" "Daniel Kipkirui" "Christine Lesuda"
## [1] "character"
b = c(104, 142, 165, 128, "N/A", 148, 120) # presence of "N/A" causes every element to be character
b; typeof(b)
## [1] "104" "142" "165" "128" "N/A" "148" "120"
## [1] "character"
There are multiple other ways of creating vectors. They include the following:
seq()
rep()
gl()
sample()
paste()
paste0()
Below we demonstrate how each of the above functions can be applied to generate vectors.
seq()
Generates a sequence of equally spaced values between the numbers \(a\) and \(b\) by either specifying the interval \(\Delta\) or the number of values \(n.\)
# specifies the increment
v1 = seq(from = -100, to = 250, by = 25)
v1
## [1] -100 -75 -50 -25 0 25 50 75 100 125 150 175 200 225 250
v2 = seq(from = -102, to = 250, by = 25) # sequence will not get to 250
v2
## [1] -102 -77 -52 -27 -2 23 48 73 98 123 148 173 198 223 248
v3 = seq(from = 154, to = 131, by = -3) # decreasing sequence, will not get to 131
v3
## [1] 154 151 148 145 142 139 136 133
# specifies number of values to return, the increment will be calculated (135-113) / (9-1)
v4 = seq(from = 113, to = 135, length.out = 9)
v4
## [1] 113.00 115.75 118.50 121.25 124.00 126.75 129.50 132.25 135.00
rep()
Replicates an object a number of specified times. The first argument of the function is the object to be replicated while the second and third arguments specify how many times the specified vector should be replicated and the number of times each element in the vector should be replicated respectively.
v = c(2, 6, 3, 0, 4)
# replicate each value of the vector v 2 times
v1 = rep(v, each = 2)
v1
## [1] 2 2 6 6 3 3 0 0 4 4
# replicate the vector v 3 times
v2 = rep(v, times = 3)
v2
## [1] 2 6 3 0 4 2 6 3 0 4 2 6 3 0 4
# replicate each value of the vector v 3 times, the replicate the resulting vector 2 times
v3 = rep(v, each = 3, times = 2)
v3
## [1] 2 2 2 6 6 6 3 3 3 0 0 0 4 4 4 2 2 2 6 6 6 3 3 3 0 0 0 4 4 4
# repeat Female 2 times and Male 5 times, then repeat the resulting vector 4 times
v4 = rep(rep(c("Female", "Male"), times = c(3, 5)), times = 4)
v4
## [1] "Female" "Female" "Female" "Male" "Male" "Male" "Male" "Male"
## [9] "Female" "Female" "Female" "Male" "Male" "Male" "Male" "Male"
## [17] "Female" "Female" "Female" "Male" "Male" "Male" "Male" "Male"
## [25] "Female" "Female" "Female" "Male" "Male" "Male" "Male" "Male"
gl()
This function generates factors by specifying the pattern of their levels. The code below replicate each of the characters Mild, Moderate and Severe four times then converts the result to an ordered factor (i.e. the argument ordered = TRUE).
v1 = gl(n = 3, k = 4, length = 3 * 4,
labels = c("Mild", "Moderate", "Severe"),
ordered = TRUE)
v1
## [1] Mild Mild Mild Mild Moderate Moderate Moderate Moderate
## [9] Severe Severe Severe Severe
## Levels: Mild < Moderate < Severe
sample()
Generates a set of numbers between a specified range or strings from a specified character vector The arguments data, size, replace and prob specify the object from which to draw the sample from, number of values, sampling with or without replacement and the (approximate) probabilities for inclusion in the sample to be generated respectively.
# generate 20 values between 18 and 60 inclusive without replacement
# i.e. none of the values should be repeated - replace = FALSE (its the default)
v1 = sample(x = 18:60, size = 20, replace = FALSE)
v1
## [1] 30 49 50 53 42 48 51 38 25 43 35 34 21 52 55 56 33 29 59 47
# generate 15 values of Female and Male with replacement where Females are approximately 2/3
# and Males are approximately 1/3 (i.e. females are approximately 67% and males 33%)
v2 = sample(x = c("Female","Male"), size = 15,
replace = TRUE, prob = c(2, 1)/3)
v2
## [1] "Male" "Female" "Female" "Female" "Female" "Female" "Male" "Female"
## [9] "Male" "Female" "Male" "Female" "Female" "Female" "Female"
paste()
This is yet another helpful function for creating vectors by putting different values (numbers, text or special characters) together.
monthnumber = 1:12
monthname = month.name
v1 = paste(monthnumber, monthname, sep = " --> ")
v1
## [1] "1 --> January" "2 --> February" "3 --> March" "4 --> April"
## [5] "5 --> May" "6 --> June" "7 --> July" "8 --> August"
## [9] "9 --> September" "10 --> October" "11 --> November" "12 --> December"
v2 = paste("QTR ", 1:4, ": ", 2018, sep = "")
v2
## [1] "QTR 1: 2018" "QTR 2: 2018" "QTR 3: 2018" "QTR 4: 2018"
paste0()
This function is similar to paste() except that it combines values with no space between them.
# for paste() you will need to add sep = "", otherwise R will put spaces
v1 = paste("QTR ", 1:4, ": ", 2018, sep = "")
v1
## [1] "QTR 1: 2018" "QTR 2: 2018" "QTR 3: 2018" "QTR 4: 2018"
# instead of using paste() with the argument sep = "", just use paste0() as follows
v2 = paste0("QTR ", 1:4, ": ", 2018)
v2
## [1] "QTR 1: 2018" "QTR 2: 2018" "QTR 3: 2018" "QTR 4: 2018"
Vectors can also be generated using probability distribution functions. This will be discussed in a different section.
Most operations that occur in mathematics can be applied to vectors to produce new results. These operations and functions are presented in the table below.
The knowledge of Slicing objects in R will be very critical when handling most of the statistical routines. In this section, we demonstrate briefly how to slice vector objects.
More often than not, you will be required to replace values of a vector either by direct input or from mathematical calculations. R provides advanced capabilities of achieving this, yet in this section, we consider the most basic ones.
# create a vector to use
v = c (43, 62, 50, 45, 35, 28, 31, 34, 38, 41)
# replace elements in positions 2, 7 and 4 with 200, 400 and 300 respectively
v1 = v # create a copy of the vector v
v1[c(2, 7, 4)] = c(200, 400, 300)
# combine into data frame
d1 = data.frame(Original = v, Replaced = v1)
d1
## Original Replaced
## 1 43 43
## 2 62 200
## 3 50 50
## 4 45 300
## 5 35 35
## 6 28 28
## 7 31 400
## 8 34 34
## 9 38 38
## 10 41 41
# multiply even values by 10
v2 = v # create copy of vector v
v2[which(v2 %% 2 == 0)] = v2[v2 %% 2 == 0] * 10
# combine into data frame
d2 = data.frame(Original = v, Replaced = v2)
d2
## Original Replaced
## 1 43 43
## 2 62 620
## 3 50 500
## 4 45 45
## 5 35 35
## 6 28 280
## 7 31 31
## 8 34 340
## 9 38 380
## 10 41 41
In the above examples, the function data.frame() has been used to combine the vectors into a data frame. Data frames are discussed below.
Vectors (both numeric and strings) can be sorted using the function sort(). The arguments decreasing and na.last could be added to specify the direction of (ascending or descending) and where missing values should be placed if any (at the beginning or at the end).
# create a vector
u = c(NA, "A", "B", "F", "D", NA, "K", "M", "D", "C")
# decreasing order, missing values come last
v1 = sort(u, decreasing = TRUE, na.last = TRUE)
v1
## [1] "M" "K" "F" "D" "D" "C" "B" "A" NA NA
# create another vector
v = c(26, 35, NA, 38, NA, 31, 40, 28, 35)
# ascending order, missing values come first
v2 = sort(v, decreasing = FALSE, na.last = FALSE)
v2
## [1] NA NA 26 28 31 35 35 38 40
Ranks are assigned to elements of a vector using the function rank() which accepts the argument ties.method to specifies how tied values should be treated (min, max, first, average, random) and na.last that specifies what happens to missing values (TRUE, FALSE, NA, keep).
# create a vector
v = c(26, 35, NA, 38, NA, 31, 38, 28, 35)
# excludes missing values before ranking
v1 = rank(v, ties.method = "min", na.last = NA)
v1
## [1] 1 4 6 3 6 2 4
# rank the first of the tied values first and retain missing values but do not rank them
v2 = rank(v, ties.method = "first", na.last = "keep")
d1 = data.frame(v = v, Ranks = v2)
d1
## v Ranks
## 1 26 1
## 2 35 4
## 3 NA NA
## 4 38 6
## 5 NA NA
## 6 31 3
## 7 38 7
## 8 28 2
## 9 35 5
# take average ranks, and missing values should be ranked last
v3 = rank(v, ties.method = "average", na.last = TRUE)
d2 = data.frame(v = v, Ranks = v3)
d2
## v Ranks
## 1 26 1.0
## 2 35 4.5
## 3 NA 8.0
## 4 38 6.5
## 5 NA 9.0
## 6 31 3.0
## 7 38 6.5
## 8 28 2.0
## 9 35 4.5
# take average ranks, and missing values should be ranked first
v4 = rank(v, ties.method = "average", na.last = FALSE)
d3 = data.frame(v = v, Ranks = v4)
d3
## v Ranks
## 1 26 3.0
## 2 35 6.5
## 3 NA 1.0
## 4 38 8.5
## 5 NA 2.0
## 6 31 5.0
## 7 38 8.5
## 8 28 4.0
## 9 35 6.5
Factor variables are categorical variables that can either be numeric or string. The function factor() is used to create factor variables - it requires a vector of values which can either be string or numeric among other optional arguments.
# create a random vector of socio-economic status (ses)
ses = sample(x = c("Low", "Middle", "High"),
size = 60, replace = TRUE,
prob = c(25, 60, 15))
ses
## [1] "Low" "Middle" "Middle" "Middle" "Middle" "Middle" "High" "Low"
## [9] "Low" "Middle" "Low" "Middle" "High" "Middle" "High" "Middle"
## [17] "Middle" "Middle" "Middle" "Middle" "Low" "High" "High" "Middle"
## [25] "Middle" "Middle" "Middle" "Low" "Middle" "Low" "Middle" "Middle"
## [33] "Low" "Middle" "Low" "Middle" "High" "Low" "High" "Low"
## [41] "Middle" "Middle" "Middle" "Low" "High" "Middle" "Middle" "Low"
## [49] "Middle" "Middle" "Middle" "High" "Middle" "Middle" "Middle" "Middle"
## [57] "Middle" "Middle" "Middle" "Low"
Converting string variables is useful in a number of ways. In the code below, we show the result of tabulating a character vector (i.e. not yet converted to factor).
kbl(table(ses),
caption = "Table 1") %>%
kable_styling(bootstrap_options = "striped", full_width = FALSE, position = "left")
| ses | Freq |
|---|---|
| High | 9 |
| Low | 14 |
| Middle | 37 |
As you can see from the above results, the categories are in an order that is undesirable. Ideally, you would want the order in the table to be Low, Middle and High. This is achieved by converting ses to factor and specifying the correct order of the categories using the levels argument as is shown in the code below.
# convert to factor and order the categories
ses1 = factor(ses, levels = c("Low", "Middle", "High"))
ses1
## [1] Low Middle Middle Middle Middle Middle High Low Low Middle
## [11] Low Middle High Middle High Middle Middle Middle Middle Middle
## [21] Low High High Middle Middle Middle Middle Low Middle Low
## [31] Middle Middle Low Middle Low Middle High Low High Low
## [41] Middle Middle Middle Low High Middle Middle Low Middle Middle
## [51] Middle High Middle Middle Middle Middle Middle Middle Middle Low
## Levels: Low Middle High
Now tabulate the results again and note that socio-economic status have been presented in the correct order.
kbl(table(ses1),
caption = "Table 2") %>%
kable_styling(bootstrap_options = "striped", full_width = FALSE, position = "left")
| ses1 | Freq |
|---|---|
| Low | 14 |
| Middle | 37 |
| High | 9 |
We can create ordered factor variables by using the function ordered() which has the same arguments as the factor() function. In the code below, we create an ordered factor from the ses vector created above.
# convert to ordered factor
ses2 = ordered(ses, levels = c("Low", "Middle", "High"))
# the above is equivalent to factor(ses, levels = c("Low", "Middle", "High"), ordered = TRUE)
# i.e. use the function factor() with the argument ordered = TRUE
ses2
## [1] Low Middle Middle Middle Middle Middle High Low Low Middle
## [11] Low Middle High Middle High Middle Middle Middle Middle Middle
## [21] Low High High Middle Middle Middle Middle Low Middle Low
## [31] Middle Middle Low Middle Low Middle High Low High Low
## [41] Middle Middle Middle Low High Middle Middle Low Middle Middle
## [51] Middle High Middle Middle Middle Middle Middle Middle Middle Low
## Levels: Low < Middle < High
Once an ordered factor has been created, it can easily be converted into a numeric vector using the function as.numeric() as follows. The function as.double() could also be used to convert it into a numeric vector. Note that the function as.integer() converts the ordered factor into a vector whose data type is integer.
# using the function as.numeric()
ses.num = as.numeric(ses2)
ses.num
## [1] 1 2 2 2 2 2 3 1 1 2 1 2 3 2 3 2 2 2 2 2 1 3 3 2 2 2 2 1 2 1 2 2 1 2 1 2 3 1
## [39] 3 1 2 2 2 1 3 2 2 1 2 2 2 3 2 2 2 2 2 2 2 1
class(ses.num)
## [1] "numeric"
# using the function as.numeric()
ses.dou = as.double(ses2)
ses.dou
## [1] 1 2 2 2 2 2 3 1 1 2 1 2 3 2 3 2 2 2 2 2 1 3 3 2 2 2 2 1 2 1 2 2 1 2 1 2 3 1
## [39] 3 1 2 2 2 1 3 2 2 1 2 2 2 3 2 2 2 2 2 2 2 1
class(ses.dou)
## [1] "numeric"
# using the function as.numeric()
ses.int = as.integer(ses2)
ses.int
## [1] 1 2 2 2 2 2 3 1 1 2 1 2 3 2 3 2 2 2 2 2 1 3 3 2 2 2 2 1 2 1 2 2 1 2 1 2 3 1
## [39] 3 1 2 2 2 1 3 2 2 1 2 2 2 3 2 2 2 2 2 2 2 1
class(ses.int)
## [1] "integer"
The cut() function is used to convert a numeric vector into a factor. The breaks parameter controls how the ranges of the numbers in the numeric vector will be converted to factor values. If a number is provided through the breaks parameter, the resulting factor will be created by dividing the range of the variable into that number of equal length intervals. On the other hand, if a vector of values is provided, the values in the vector are used to determine the breakpoints (number of levels of the resultant factor will be one less than the number of values in the specified vector).
Begin by creating a random numeric vector.
age = sample(x = c(20:70), size = 250, replace = TRUE)
Next we create a factor corresponding to age, with five equally-spaced age intervals “20-29”, “30-39”, “40-49”, “50-59” and “60+”. This is illustrated in the code below.
To produce factors based on percentiles of your data, the quantile() function can be used to generate the argument for the breaks parameter, ensuring approximately equal number of observations in each of the levels of the factor. This is done as follows.
In the following code, we use the function factor() to get the number of days in each month of the year 2020. Results are presented immediately after the code.
days = seq(from = as.Date('2020-1-1'), to = as.Date('2020-12-31'), by = 'day')
months = format(days, '%b')
months = factor(months, levels = unique(months), ordered = TRUE)
# display
display_results(data = table(months), caption = "Table 5")
| months | Freq |
|---|---|
| Jan | 31 |
| Feb | 29 |
| Mar | 31 |
| Apr | 30 |
| May | 31 |
| Jun | 30 |
| Jul | 31 |
| Aug | 31 |
| Sep | 30 |
| Oct | 31 |
| Nov | 30 |
| Dec | 31 |
Lists are constructed in R using the function list(). Lists are more often used to store objects of different data type and of varying dimensions. One of its common uses is in functions (both built-in and user-defined) where it is used to return multiple outputs. For example, in regression analysis, the output is as a list object. The objects of this named list include: coefficients, residuals, fitted values, e.t.c. The syntax below shows how to create a list in R.
# create the objects mn, bw and bw
mn = c("Jane", "Mary", "Christine", "Rose","Janet",
"Martha", "Mercy","Joyce", "Hillary","Joylin")
gweeks = c(35.26, 38.74, 38.40, 36.65, 38.19, 37.04, 37.25, 39.68, 34.09, 35.86)
bweight <- c (3.58, 3.56, 3.48, 3.26, 2.64, 2.49, 2.86, 3.26, 2.78, 2.84)
wkswgt = data.frame(gestweeks = gweeks, birthweight = bweight)
# create the list here
births = list (mothersname = mn, weeksweight = wkswgt)
births
## $mothersname
## [1] "Jane" "Mary" "Christine" "Rose" "Janet" "Martha"
## [7] "Mercy" "Joyce" "Hillary" "Joylin"
##
## $weeksweight
## gestweeks birthweight
## 1 35.26 3.58
## 2 38.74 3.56
## 3 38.40 3.48
## 4 36.65 3.26
## 5 38.19 2.64
## 6 37.04 2.49
## 7 37.25 2.86
## 8 39.68 3.26
## 9 34.09 2.78
## 10 35.86 2.84
As can be seen, the object births has been stored as a list which contains the objects, mothersname which is a vector and weeksweight which is a data frame. Each object in the list can be extracted using the $ or double square brackets [[i]] operation where i is the object index e.g. 1, 2, 3, …, n. For example, to extract the mother’s name use, births$mothersname or births[[1]].
births[1] # this gives the list and the list name (mothersname)
## $mothersname
## [1] "Jane" "Mary" "Christine" "Rose" "Janet" "Martha"
## [7] "Mercy" "Joyce" "Hillary" "Joylin"
births[[1]] # gives the list values only, not the list name
## [1] "Jane" "Mary" "Christine" "Rose" "Janet" "Martha"
## [7] "Mercy" "Joyce" "Hillary" "Joylin"
births$weeksweight
## gestweeks birthweight
## 1 35.26 3.58
## 2 38.74 3.56
## 3 38.40 3.48
## 4 36.65 3.26
## 5 38.19 2.64
## 6 37.04 2.49
## 7 37.25 2.86
## 8 39.68 3.26
## 9 34.09 2.78
## 10 35.86 2.84
A matrix is a set of real or complex numbers or elements arranged in rows and columns to form a rectangular array. The R software provides excellent and advanced capabilities of handling and manipulating matrices.
The function matrix() is used to create matrices. The general syntax is given below.
M = matrix(vector_with_matrix_elements,
nrow = number_of_rows,
ncol = number_of_columns,
byrow = FALSE/TRUE,
dimnames = list(c("row names"), c("column names")))
Note that some of the parameters in the syntax above are optional. The syntax, M = matrix(vector_with_matrix_elements, nrow = number of rows) is sufficient to generate a matrix with \(n\) rows. In this case, the number of columns is calculated as \(\dfrac{\text{number of elements}}{\text{number of rows}}\) and the elements are arranged in the matrix by the columns (i.e. downwards). In addition, the matrix created will not have row and column names.
Example
Coughlin et al () examined breast and cervical screening practices of Hispanic and non-Hispanic women in countries that approximate the U.S. southern border region. The study used data from Behavioral Risk Factor surveillance system surveys of adults ages 18 years or older conducted in 1999 and 2000. The following table shows the number of observations of Hispanic and non-Hispanic women who had received a mammogram in the past 2 years cross classified by marital status. Write an R syntax that will capture the contents of the following table.
Solution
Below is the syntax, we have entered the row and column totals manually. The functions rowSums() and colSums() could have been used to calculate the row and column totals respectively.
# create a vector of the matrix elements (by columns)
v = c(319, 130, 88, 41, 578,
738, 329, 402, 95, 1564,
1057, 459, 490, 136, 2142)
# create row names (rnames) and column names (cnames)
cnames = c("Hispanic", "non Hispanic", "Total")
rnames = c("Currently married", "Divorced or separated",
"Widowed","Never married / unmarried couple", "Total")
# create the matrix
M = matrix(v,
nrow = 5, ncol = 3, byrow = FALSE,
dimnames = list(rnames, cnames))
M
## Hispanic non Hispanic Total
## Currently married 319 738 1057
## Divorced or separated 130 329 459
## Widowed 88 402 490
## Never married / unmarried couple 41 95 136
## Total 578 1564 2142
The following table presents some of the common operators and functions that are applicable to matrices.
Particular values, rows, columns or a combination of these can be obtained from a matrix by applying the slicing techniques in R. The table below present few examples.
Arrays are generalizations of vectors and matrices where a vector is a one directional array and a matrix is a two dimensional array. As with vectors and matrices, all elements of an array must be of the same data type. The array() function will create an array which takes a vector specifying the numbers and dimension (dim) in the argument. Let’s see an example below where two arrays Outcome and Treatment are created from the vector control.
# create a vector
control = c(8, 98, 106, 15, 115, 130, 23, 213, 236, 22, 76, 98, 16, 69, 85, 38, 145, 183)
# create the array
data = array(control, dim = c(3, 3, 2))
# assign array names, row and column names
dimnames(data) = list(Outcome = c("Death", "Survivor", "Total"),
Treatment = c("Drug", "Placebo", "Total"),
"Age Group" = c("Age < 55", "Age >= 55"))
data
## , , Age Group = Age < 55
##
## Treatment
## Outcome Drug Placebo Total
## Death 8 15 23
## Survivor 98 115 213
## Total 106 130 236
##
## , , Age Group = Age >= 55
##
## Treatment
## Outcome Drug Placebo Total
## Death 22 16 38
## Survivor 76 69 145
## Total 98 85 183
All basic arithmetic operations which apply to vectors and matrices are also applicable to arrays. Perhaps what would be slightly different is slicing. Below we give few example of slicing arrays.
# second row of the first matrix of the array.
data[2, , 1]
## Drug Placebo Total
## 98 115 213
# element in the 3rd row and 2nd column of the 1st matrix.
data[3, 2, 1]
## [1] 130
# 2nd Matrix.
data[, , 2]
## Treatment
## Outcome Drug Placebo Total
## Death 22 16 38
## Survivor 76 69 145
## Total 98 85 183
# arithmetic
M1 = data[, , 1] # first matrix (i.e. Age Group: Age < 55)
M2 = data[, , 2] # second matrix (i.e. Age Group: Age >= 55)
M3 = M1 + M2
M3
## Treatment
## Outcome Drug Placebo Total
## Death 30 31 61
## Survivor 174 184 358
## Total 204 215 419
While vectors, matrices and arrays require that all elements be of the same data type, data frames can store objects of different data types (heterogeneous). This makes data frames convenient data structures for data management, manipulation and analysis in R. This is probably the most common data structure that you will likely encounter in your data management, analysis, visualization and programming tasks. Below we just show one of the most basic tasks in R: creating a data frame from vectors. Other operations / manipulations on data frames can be found elsewhere on this website.
# create vectors with random values
set.seed(1234)
gender = sample(x = c("Female", "Male"), size = 10,
replace = TRUE, prob = c(60, 40)/100)
age = sample(18:58, size = length(gender))
income = sample(1000:9000, size = length(gender)) * 5
# create data frame
df = data.frame(Gender = gender, Age = age, "Income" = income)
# display
display_results(data = df, caption = "Table 6: Data frames.")
| months | Freq |
|---|---|
| Jan | 31 |
| Feb | 29 |
| Mar | 31 |
| Apr | 30 |
| May | 31 |
| Jun | 30 |
| Jul | 31 |
| Aug | 31 |
| Sep | 30 |
| Oct | 31 |
| Nov | 30 |
| Dec | 31 |
In time series, the values of a variable are investigated at time intervals that may include: daily, weekly, monthly, quarterly, semi-annually, yearly, e.t.c. To create a time series object in R, the function ts() is used as illustrated by the syntax below.
# create some random values
set.seed(1234)
v = sample(200:900, size = 20) * 5
data.ts = ts(data = v,
start = c(2015, 4), freq = 4) # freq = 4 implies quarterly
data.ts
## Qtr1 Qtr2 Qtr3 Qtr4
## 2015 2415
## 2016 1500 4110 4220 2995
## 2017 1485 1510 4005 2625
## 2018 1390 2345 2905 1915
## 2019 3865 1015 4300 3755
## 2020 2055 1970 3550
The syntax below gives a time series with frequency of 12, i.e. monthly data.
# create some random values
set.seed(1234)
v = sample(200:900, size = 50) * 5
data.ts = ts(data = v,
start = c(2015, 4), freq = 12) # freq = 12 implies monthly
data.ts
## Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
## 2015 2415 1500 4110 4220 2995 1485 1510 4005 2625
## 2016 1390 2345 2905 1915 3865 1015 4300 3755 2055 1970 3550 3390
## 2017 4020 4165 3885 3545 3115 2890 1535 1650 2710 1200 4130 2485
## 2018 2285 4140 4455 1905 2520 2785 2530 4315 2100 3800 2560 1675
## 2019 1720 1610 2165 4035 3470
Note that the display format as displayed in the console by R is for presentation purposes. Internally, the data is stored as a time series object (a single variable). This can be viewed using the command View(data.ts). For a multivariate time series, replace the vector object v by a matrix object say M as is shown in the code below.
# create some random values
set.seed(1234)
M = matrix(sample(200:900, size = 40) * 5, nrow = 8) # matrix
data.ts = ts(data = M,
start = c(2015, 1), freq = 4)
data.ts
## Series 1 Series 2 Series 3 Series 4 Series 5
## 2015 Q1 2415 2625 3755 3545 2485
## 2015 Q2 1500 1390 2055 3115 2285
## 2015 Q3 4110 2345 1970 2890 4140
## 2015 Q4 4220 2905 3550 1535 4455
## 2016 Q1 2995 1915 3390 1650 1905
## 2016 Q2 1485 3865 4020 2710 2520
## 2016 Q3 1510 1015 4165 1200 2785
## 2016 Q4 4005 4300 3885 4130 2530
Data structures can be converted from one to another using the the following functions.
| Function | Description |
|---|---|
as.vector() |
Convert to a vector |
as.factor() |
Convert to a factor |
as.matrix() |
Convert to a matrix |
as.array() |
Convert to an array |
as.list() |
Convert to a list |
as.data.frame() |
Convert to a data frame |
as.ts() |
Convert to a time series |
as.table() |
Convert to a table |
STEM Research: Technology for Innovation
Last edited on: 2022-06-25