1 R Data Types and Structures

  • R has a wide variety of objects for holding data, including scalars, vectors, matrices, arrays, data frames, and lists.

  • They differ in terms of the type of data they can hold, how they’re created, their structural complexity, and the notation used to identify and access individual elements.

1.1 Basic data types of R

In R, the basic data types are: numeric, character and logical.

1.1.1 Numeric object: How old are you?

my_age <- 28

1.1.2 Character object: What’s your name?

my_name <- "Nicolas"

1.1.3 Logical object: Are you a data scientist? (yes/no) <=> (TRUE/FALSE)

is_datascientist <- TRUE

1.2 Factors

  • Categorical (nominal) and ordered categorical (ordinal) variables in R are called factors.

  • Factors are crucial in R because they determine how data is analyzed and presented visually.

  • The function factor() stores the categorical values as a vector of integers in the range [1. k , (where k is the number of unique values in the nominal variable) and an internal vector of character strings (the original values) mapped to these integers.

1.2.1 Factors - Norminal variable

For example, assume that you have this vector:

diabetes <- c("Type1", "Type2", "Type1", "Type1")
str(diabetes)
##  chr [1:4] "Type1" "Type2" "Type1" "Type1"

We can use factor() to change the data type from character to factor as shown in the code chunk below.

diabetes <- factor(diabetes)

Notice that R stores this vector as (1, 2, 1, 1) and associates it with 1 = Type1 and 2 = Type2 internally (the assignment is alphabetical).

str(diabetes)
##  Factor w/ 2 levels "Type1","Type2": 1 2 1 1

Note that any analyses performed on the vector diabetes will treat the variable as nominal and select the statistical methods appropriate for this level of measurement.

1.2.2 Factors - Ordinal variable

  • For vectors representing ordinal variables, you add the parameter ordered=TRUE to the factor() function.

Given the vector

status <- c("Poor", "Improved", "Excellent", "Poor")
str(status)
##  chr [1:4] "Poor" "Improved" "Excellent" "Poor"

The code chunk below will encode the vector as (3, 2, 1, 3) and associate these values internally as 1 = Excellent, 2 = Improved, and 3 = Poor.

status <- factor(status, ordered=TRUE)
str(status)
##  Ord.factor w/ 3 levels "Excellent"<"Improved"<..: 3 2 1 3

Additionally, any analyses performed on this vector will treat the variable as ordinal and select the statistical methods appropriately.

1.2.3 Factor levels

By default, factor levels for character vectors are created in alphabetical order.

You can override the default by specifying a levels option as shown in the code chun below.

status <- factor(status, order=TRUE,levels=c("Poor", "Improved", "Excellent"))
status
## [1] Poor      Improved  Excellent Poor     
## Levels: Poor < Improved < Excellent

assigns the levels as 1 = Poor, 2 = Improved, 3 = Excellent.

  • Be sure the specified levels match your actual data values. Any data values not in the list will be set to missing.

1.2.4 Factors - Working with Levels and Labels options

  • Numeric variables can be coded as factors using the levels and labels options. If sex was coded as 1 for male and 2 for female in the original data,
sex <- c(1,1,1,2,2,2,1,2,1,2)
sex
##  [1] 1 1 1 2 2 2 1 2 1 2

then the code chunk below would convert the variable to an unordered factor. Note that the order of the labels must match the order of the levels.

sex <- factor(sex, levels=c(1, 2), labels=c("Male", "Female"))

In this example, sex would be treated as categorical, the labels “Male” and “Female” would appear in the output instead of 1 and 2, and any sex value that wasn’t initially coded as a 1 or 2 would be set to missing.

sex
##  [1] Male   Male   Male   Female Female Female Male   Female Male   Female
## Levels: Male Female

2 R Data Objects

2.1 R Data Object - Scalar

  • A scalar is a single number.

  • The following code creates a scalar variable with the numeric value 5:

x <- 5
x
## [1] 5

2.2 R Data Object - Vectors

  • Vectors are one-dimensional arrays that can hold numeric data, character data, or logical data.

  • The combine function c() is used to form the vector. Here are examples of each type of vector:

a <- c(1, 2, 5, 3, 6, -2, 4)
a
## [1]  1  2  5  3  6 -2  4
b <- c("one", "two", "three")
b
## [1] "one"   "two"   "three"
c <- c(TRUE, TRUE, TRUE, FALSE, TRUE, FALSE)
c
## [1]  TRUE  TRUE  TRUE FALSE  TRUE FALSE

Here, a is a numeric vector, b is a character vector, and c is a logical vector. Note that the data in a vector must be only one type of mode (numeric, character, or logical). You can’t mix modes in the same vector.

2.3 R data Object - Matrices

  • A matrix is a two-dimensional array in which each element has the same mode (numeric, character, or logical).
  • Matrices are created with the matrix() function.
  • The general format is:

myymatrix <- matrix(vector, nrow=number_of_rows, ncol=number_of_columns, byrow=logical_value, dimnames=list(char_vector_rownames, char_vector_colnames))

where - vector contains the elements for the matrix, - nrow and ncol specify the row and column dimensions, and - dimnames contains optional row and column labels stored in character vectors. - The option byrow indicates whether the matrix should be filled in by row (byrow=TRUE) or by column (byrow=FALSE). The default is by column.

2.3.1 Creating a simple matrix

The code chunk below creates a 5 X 4 matrix.

y <- matrix(1:20, nrow=5, ncol=4)
y
##      [,1] [,2] [,3] [,4]
## [1,]    1    6   11   16
## [2,]    2    7   12   17
## [3,]    3    8   13   18
## [4,]    4    9   14   19
## [5,]    5   10   15   20

2.3.2 Creating a matrix filled by row

The code below creates a 2 x 2 matrix filled by row.

cells <- c(1,26,24,68)
rnames <- c("R1", "R2")
cnames <- c("C1", "C2")
mymatrix <- matrix(cells, nrow=2, ncol=2, byrow=TRUE, dimnames=list(rnames, cnames))
mymatrix
##    C1 C2
## R1  1 26
## R2 24 68

2.3.3 Creating a matrix filled by column

The code below creates a 2 x 2 matrix filled by column.

cells <- c(1,26,24,68)
rnames <- c("R1", "R2")
cnames <- c("C1", "C2")
mymatrix <- matrix(cells, nrow=2, ncol=2, byrow=FALSE, dimnames=list(rnames, cnames))
mymatrix
##    C1 C2
## R1  1 24
## R2 26 68

2.3.4 Using matrix subscripts

  • You can identify rows, columns, or elements of a matrix by using subscripts and brackets.

  • X[i,] refers to the ith row of matrix X,

  • X[,j] refers to the j th column, and

  • X[i,j] refers to the ij th element, respectively.

  • The subscripts i and j can be numeric vectors in order to select multiple rows or columns.

2.3.5 Working with matrix subscipts

Now, let us explore subscripts by using a series of short codes.

  • First let us create a 2 x 5 matrix containing the numbers 1 to 10. By default, the matrix is filled by column.
x <- matrix(1:10, nrow=2)
x
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    1    3    5    7    9
## [2,]    2    4    6    8   10

Next, the elements in the second row are selected

x[2,] 
## [1]  2  4  6  8 10

The code chunk below will select the elements in the second column

x[,2] 
## [1] 3 4

The code chunk below will select the element in the first row and fourth column.

x[1,4] 
## [1] 7

The code chunk below will select the elements in the first row and the fourth and fifth columns.

x[1, c(4,5)] 
## [1] 7 9

2.4 R Data Object - Arrays

Arrays are similar to matrices but can have more than two dimensions. They’re created with an array function of the following form:

myarray <- array(vector, dimensions, dimnames)

where: - vector contains the data for the array,

  • dimensions is a numeric vector giving the maximal index for each dimension, and

  • dimnames is an optional list of dimension labels.

2.4.1 Creating an array

Let us use the code below to create an array.

dim1 <- c("A1", "A2")
dim2 <- c("B1", "B2", "B3")
dim3 <- c("C1", "C2", "C3", "C4")
z <- array(1:24, c(2, 3, 4), dimnames=list(dim1, dim2, dim3))

2.4.2 Structure of an array

z
## , , C1
## 
##    B1 B2 B3
## A1  1  3  5
## A2  2  4  6
## 
## , , C2
## 
##    B1 B2 B3
## A1  7  9 11
## A2  8 10 12
## 
## , , C3
## 
##    B1 B2 B3
## A1 13 15 17
## A2 14 16 18
## 
## , , C4
## 
##    B1 B2 B3
## A1 19 21 23
## A2 20 22 24

As you can see, arrays are a natural extension of matrices. They can be useful in programming new statistical methods.

Like matrices, they must be a single mode. Identifying elements follows what you’ve seen for matrices. In the previous example, the z[1,2,3] element is 15.

2.5 R Data Object - Lists

  • Lists are the most complex of the R data types.

  • Basically, a list is an ordered collection of objects (components).

  • A list allows you to gather a variety of (possibly unrelated) objects under one name. For example, a list may contain a combination of vectors, matrices, data frames, and even other lists.

  • You create a list using the list() function.

2.5.1 Creating a list

In this example, you create a list with four components: a string, a numeric vector, a matrix, and a character vector. You can combine any number of objects and save them as a list.

g <- "My First List" #a string 
h <- c(25, 26, 18, 39) #a nurimeirc vector
j <- matrix(1:10, nrow=5) #a matrix
k <- c("one", "two", "three") #a character vector
mylist <- list(title=g, ages=h, j, k)

2.5.2 Structure of a list

Display below shows the structure of the list object created.

mylist
## $title
## [1] "My First List"
## 
## $ages
## [1] 25 26 18 39
## 
## [[3]]
##      [,1] [,2]
## [1,]    1    6
## [2,]    2    7
## [3,]    3    8
## [4,]    4    9
## [5,]    5   10
## 
## [[4]]
## [1] "one"   "two"   "three"

2.6 R Data Object - Data Frame

  • A data frame is a table or a two-dimensional array-like structure in which each column contains values of one variable and each row contains one set of values from each column.

  • Following are the characteristics of a data frame.

    • The column names should be non-empty.
    • The row names should be unique.
    • The data stored in a data frame can be of numeric, factor or character type.
    • Each column should contain same number of data items.

2.6.1 Creating a data frame programmatically

A data frame can be created programmatically by using data.frame() function as shown in the code chunk below.

emp.data <- data.frame(
   emp_id = c (1:5), 
   emp_name = c("Rick","Dan","Michelle","Ryan","Gary"),
   salary = c(623.3,515.2,611.0,729.0,843.25), 
   
   start_date = as.Date(c("2012-01-01", "2013-09-23", "2014-11-15", "2014-05-11",
      "2015-03-27")),
   stringsAsFactors = FALSE
)
emp.data
##   emp_id emp_name salary start_date
## 1      1     Rick 623.30 2012-01-01
## 2      2      Dan 515.20 2013-09-23
## 3      3 Michelle 611.00 2014-11-15
## 4      4     Ryan 729.00 2014-05-11
## 5      5     Gary 843.25 2015-03-27

2.6.2 Get the Structure of the Data Frame

The structure of the data frame can be seen by using str() function.

str(emp.data) 
## 'data.frame':    5 obs. of  4 variables:
##  $ emp_id    : int  1 2 3 4 5
##  $ emp_name  : chr  "Rick" "Dan" "Michelle" "Ryan" ...
##  $ salary    : num  623 515 611 729 843
##  $ start_date: Date, format: "2012-01-01" "2013-09-23" ...

2.6.3 Summary of Data in Data Frame

The statistical summary and nature of the data can be obtained by applying summary() function.

summary(emp.data)
##      emp_id    emp_name             salary        start_date        
##  Min.   :1   Length:5           Min.   :515.2   Min.   :2012-01-01  
##  1st Qu.:2   Class :character   1st Qu.:611.0   1st Qu.:2013-09-23  
##  Median :3   Mode  :character   Median :623.3   Median :2014-05-11  
##  Mean   :3                      Mean   :664.4   Mean   :2014-01-14  
##  3rd Qu.:4                      3rd Qu.:729.0   3rd Qu.:2014-11-15  
##  Max.   :5                      Max.   :843.2   Max.   :2015-03-27

2.6.4 Appending a new data frame onto existing data frame

To add more rows permanently to an existing data frame, we need to bring in the new rows in the same structure as the existing data frame and use the rbind() function.

In the example below we create a new data frame with new rows.

emp.newdata <-  data.frame(
   emp_id = c (6:8), 
   emp_name = c("Rasmi","Pranab","Tusar"),
   salary = c(578.0,722.5,632.8), 
   start_date = as.Date(c("2013-05-21","2013-07-30","2014-06-17")),
   stringsAsFactors = FALSE
)

Next, we merge it with the existing data frame to create the final data frame buy using rbind().

emp.finaldata <- rbind(emp.data,emp.newdata)
emp.finaldata
##   emp_id emp_name salary start_date
## 1      1     Rick 623.30 2012-01-01
## 2      2      Dan 515.20 2013-09-23
## 3      3 Michelle 611.00 2014-11-15
## 4      4     Ryan 729.00 2014-05-11
## 5      5     Gary 843.25 2015-03-27
## 6      6    Rasmi 578.00 2013-05-21
## 7      7   Pranab 722.50 2013-07-30
## 8      8    Tusar 632.80 2014-06-17