R has a wide variety of objects for holding data, including scalars, vectors, matrices, arrays, data frames, and lists.
They differ in terms of the type of data they can hold, how they’re created, their structural complexity, and the notation used to identify and access individual elements.
In R, the basic data types are: numeric, character and logical.
my_age <- 28
my_name <- "Nicolas"
is_datascientist <- TRUE
Categorical (nominal) and ordered categorical (ordinal) variables in R are called factors.
Factors are crucial in R because they determine how data is analyzed and presented visually.
The function factor() stores the categorical values as a vector of integers in the range [1. k , (where k is the number of unique values in the nominal variable) and an internal vector of character strings (the original values) mapped to these integers.
For example, assume that you have this vector:
diabetes <- c("Type1", "Type2", "Type1", "Type1")
str(diabetes)
## chr [1:4] "Type1" "Type2" "Type1" "Type1"
We can use factor() to change the data type from character to factor as shown in the code chunk below.
diabetes <- factor(diabetes)
Notice that R stores this vector as (1, 2, 1, 1) and associates it with 1 = Type1 and 2 = Type2 internally (the assignment is alphabetical).
str(diabetes)
## Factor w/ 2 levels "Type1","Type2": 1 2 1 1
Note that any analyses performed on the vector diabetes will treat the variable as nominal and select the statistical methods appropriate for this level of measurement.
Given the vector
status <- c("Poor", "Improved", "Excellent", "Poor")
str(status)
## chr [1:4] "Poor" "Improved" "Excellent" "Poor"
The code chunk below will encode the vector as (3, 2, 1, 3) and associate these values internally as 1 = Excellent, 2 = Improved, and 3 = Poor.
status <- factor(status, ordered=TRUE)
str(status)
## Ord.factor w/ 3 levels "Excellent"<"Improved"<..: 3 2 1 3
Additionally, any analyses performed on this vector will treat the variable as ordinal and select the statistical methods appropriately.
By default, factor levels for character vectors are created in alphabetical order.
You can override the default by specifying a levels option as shown in the code chun below.
status <- factor(status, order=TRUE,levels=c("Poor", "Improved", "Excellent"))
status
## [1] Poor Improved Excellent Poor
## Levels: Poor < Improved < Excellent
assigns the levels as 1 = Poor, 2 = Improved, 3 = Excellent.
sex <- c(1,1,1,2,2,2,1,2,1,2)
sex
## [1] 1 1 1 2 2 2 1 2 1 2
then the code chunk below would convert the variable to an unordered factor. Note that the order of the labels must match the order of the levels.
sex <- factor(sex, levels=c(1, 2), labels=c("Male", "Female"))
In this example, sex would be treated as categorical, the labels “Male” and “Female” would appear in the output instead of 1 and 2, and any sex value that wasn’t initially coded as a 1 or 2 would be set to missing.
sex
## [1] Male Male Male Female Female Female Male Female Male Female
## Levels: Male Female
A scalar is a single number.
The following code creates a scalar variable with the numeric value 5:
x <- 5
x
## [1] 5
Vectors are one-dimensional arrays that can hold numeric data, character data, or logical data.
The combine function c() is used to form the vector. Here are examples of each type of vector:
a <- c(1, 2, 5, 3, 6, -2, 4)
a
## [1] 1 2 5 3 6 -2 4
b <- c("one", "two", "three")
b
## [1] "one" "two" "three"
c <- c(TRUE, TRUE, TRUE, FALSE, TRUE, FALSE)
c
## [1] TRUE TRUE TRUE FALSE TRUE FALSE
Here, a is a numeric vector, b is a character vector, and c is a logical vector. Note that the data in a vector must be only one type of mode (numeric, character, or logical). You can’t mix modes in the same vector.
myymatrix <- matrix(vector, nrow=number_of_rows, ncol=number_of_columns, byrow=logical_value, dimnames=list(char_vector_rownames, char_vector_colnames))
where - vector contains the elements for the matrix, - nrow and ncol specify the row and column dimensions, and - dimnames contains optional row and column labels stored in character vectors. - The option byrow indicates whether the matrix should be filled in by row (byrow=TRUE) or by column (byrow=FALSE). The default is by column.
The code chunk below creates a 5 X 4 matrix.
y <- matrix(1:20, nrow=5, ncol=4)
y
## [,1] [,2] [,3] [,4]
## [1,] 1 6 11 16
## [2,] 2 7 12 17
## [3,] 3 8 13 18
## [4,] 4 9 14 19
## [5,] 5 10 15 20
The code below creates a 2 x 2 matrix filled by row.
cells <- c(1,26,24,68)
rnames <- c("R1", "R2")
cnames <- c("C1", "C2")
mymatrix <- matrix(cells, nrow=2, ncol=2, byrow=TRUE, dimnames=list(rnames, cnames))
mymatrix
## C1 C2
## R1 1 26
## R2 24 68
The code below creates a 2 x 2 matrix filled by column.
cells <- c(1,26,24,68)
rnames <- c("R1", "R2")
cnames <- c("C1", "C2")
mymatrix <- matrix(cells, nrow=2, ncol=2, byrow=FALSE, dimnames=list(rnames, cnames))
mymatrix
## C1 C2
## R1 1 24
## R2 26 68
You can identify rows, columns, or elements of a matrix by using subscripts and brackets.
X[i,] refers to the ith row of matrix X,
X[,j] refers to the j th column, and
X[i,j] refers to the ij th element, respectively.
The subscripts i and j can be numeric vectors in order to select multiple rows or columns.
Now, let us explore subscripts by using a series of short codes.
x <- matrix(1:10, nrow=2)
x
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1 3 5 7 9
## [2,] 2 4 6 8 10
Next, the elements in the second row are selected
x[2,]
## [1] 2 4 6 8 10
The code chunk below will select the elements in the second column
x[,2]
## [1] 3 4
The code chunk below will select the element in the first row and fourth column.
x[1,4]
## [1] 7
The code chunk below will select the elements in the first row and the fourth and fifth columns.
x[1, c(4,5)]
## [1] 7 9
Arrays are similar to matrices but can have more than two dimensions. They’re created with an array function of the following form:
myarray <- array(vector, dimensions, dimnames)
where: - vector contains the data for the array,
dimensions is a numeric vector giving the maximal index for each dimension, and
dimnames is an optional list of dimension labels.
Let us use the code below to create an array.
dim1 <- c("A1", "A2")
dim2 <- c("B1", "B2", "B3")
dim3 <- c("C1", "C2", "C3", "C4")
z <- array(1:24, c(2, 3, 4), dimnames=list(dim1, dim2, dim3))
z
## , , C1
##
## B1 B2 B3
## A1 1 3 5
## A2 2 4 6
##
## , , C2
##
## B1 B2 B3
## A1 7 9 11
## A2 8 10 12
##
## , , C3
##
## B1 B2 B3
## A1 13 15 17
## A2 14 16 18
##
## , , C4
##
## B1 B2 B3
## A1 19 21 23
## A2 20 22 24
As you can see, arrays are a natural extension of matrices. They can be useful in programming new statistical methods.
Like matrices, they must be a single mode. Identifying elements follows what you’ve seen for matrices. In the previous example, the z[1,2,3] element is 15.
Lists are the most complex of the R data types.
Basically, a list is an ordered collection of objects (components).
A list allows you to gather a variety of (possibly unrelated) objects under one name. For example, a list may contain a combination of vectors, matrices, data frames, and even other lists.
You create a list using the list() function.
In this example, you create a list with four components: a string, a numeric vector, a matrix, and a character vector. You can combine any number of objects and save them as a list.
g <- "My First List" #a string
h <- c(25, 26, 18, 39) #a nurimeirc vector
j <- matrix(1:10, nrow=5) #a matrix
k <- c("one", "two", "three") #a character vector
mylist <- list(title=g, ages=h, j, k)
Display below shows the structure of the list object created.
mylist
## $title
## [1] "My First List"
##
## $ages
## [1] 25 26 18 39
##
## [[3]]
## [,1] [,2]
## [1,] 1 6
## [2,] 2 7
## [3,] 3 8
## [4,] 4 9
## [5,] 5 10
##
## [[4]]
## [1] "one" "two" "three"
A data frame is a table or a two-dimensional array-like structure in which each column contains values of one variable and each row contains one set of values from each column.
Following are the characteristics of a data frame.
A data frame can be created programmatically by using data.frame() function as shown in the code chunk below.
emp.data <- data.frame(
emp_id = c (1:5),
emp_name = c("Rick","Dan","Michelle","Ryan","Gary"),
salary = c(623.3,515.2,611.0,729.0,843.25),
start_date = as.Date(c("2012-01-01", "2013-09-23", "2014-11-15", "2014-05-11",
"2015-03-27")),
stringsAsFactors = FALSE
)
emp.data
## emp_id emp_name salary start_date
## 1 1 Rick 623.30 2012-01-01
## 2 2 Dan 515.20 2013-09-23
## 3 3 Michelle 611.00 2014-11-15
## 4 4 Ryan 729.00 2014-05-11
## 5 5 Gary 843.25 2015-03-27
The structure of the data frame can be seen by using str() function.
str(emp.data)
## 'data.frame': 5 obs. of 4 variables:
## $ emp_id : int 1 2 3 4 5
## $ emp_name : chr "Rick" "Dan" "Michelle" "Ryan" ...
## $ salary : num 623 515 611 729 843
## $ start_date: Date, format: "2012-01-01" "2013-09-23" ...
The statistical summary and nature of the data can be obtained by applying summary() function.
summary(emp.data)
## emp_id emp_name salary start_date
## Min. :1 Length:5 Min. :515.2 Min. :2012-01-01
## 1st Qu.:2 Class :character 1st Qu.:611.0 1st Qu.:2013-09-23
## Median :3 Mode :character Median :623.3 Median :2014-05-11
## Mean :3 Mean :664.4 Mean :2014-01-14
## 3rd Qu.:4 3rd Qu.:729.0 3rd Qu.:2014-11-15
## Max. :5 Max. :843.2 Max. :2015-03-27
To add more rows permanently to an existing data frame, we need to bring in the new rows in the same structure as the existing data frame and use the rbind() function.
In the example below we create a new data frame with new rows.
emp.newdata <- data.frame(
emp_id = c (6:8),
emp_name = c("Rasmi","Pranab","Tusar"),
salary = c(578.0,722.5,632.8),
start_date = as.Date(c("2013-05-21","2013-07-30","2014-06-17")),
stringsAsFactors = FALSE
)
Next, we merge it with the existing data frame to create the final data frame buy using rbind().
emp.finaldata <- rbind(emp.data,emp.newdata)
emp.finaldata
## emp_id emp_name salary start_date
## 1 1 Rick 623.30 2012-01-01
## 2 2 Dan 515.20 2013-09-23
## 3 3 Michelle 611.00 2014-11-15
## 4 4 Ryan 729.00 2014-05-11
## 5 5 Gary 843.25 2015-03-27
## 6 6 Rasmi 578.00 2013-05-21
## 7 7 Pranab 722.50 2013-07-30
## 8 8 Tusar 632.80 2014-06-17