Introduction

R Stat is a widely used programming tool for statistical analysis. This guide is aimed at providing foundations on the R programming language. This guide will be useful for beginners as well as for those who have already started using R in their analysis, but haven’t got hold of the programming language underneath. Once you get the grip of this language, you will be empowered to use the built-in libraries and libraries contributed by R user communities to choose and use statistical functions on your data. For comprehensive list of packages available in R, you may visit this link: R Packages

Installation

For installation and use of R and R Studio, please follow these links: R; R Studio

Data Types

There are different kinds of data. They can be quantitative or qualitative. Quantitative data are numeric and can be continuous or discrete. When discrete numeric data can be ranked, they are called ordinal. Qualitative data are also known as nominal. R supports the following data types. Note that the integer values are suffixed with the letter L. The character values are enclosed within single or double quotes.

datatypes <- data.frame(
   'Data Type'=c('integer','double','numeric','complex','logical','character'),
   Description=c('whole numbers','numbers with decimals','numbers with decimals',
                 'numbers with imaginary parts','binary status',
                 'an empty string or a character or a string of charactars'),
   Examples=c('Inf, 0L, 23L, 9976L, Inf', '-Inf, 0.1, 12.987, 0.000012, Inf',
              '-Inf, 0.1, 12.987, 0.000012, Inf', '12i, 4+2i','TRUE, FALSE, 1, 0',
              "'', 'a', 'xyz', 'small', 'a lot of ice floats on sea'"),
   check.names=FALSE, row.names=NULL
)
knitr::kable(datatypes, caption='Data Types')

Character data type are ideal for storing date values. The different date formats can be expressed in string literals effectively, such as: '10 Oct 2017', 'Wed 10-01-87 5:30:45 PM'

However, to ease date value manipulations and calculations, these string literals are converted into date objects. For more details, see Date Functions.

Special values

R uses predefined keywords to represent certain special values.

specialVals <- data.frame(
   Keywords=c('Inf','-Inf','NA','NULL','TRUE','FALSE', 'pi'),
   Description=c('infinite: largest number unknown','negative inf: smallest number unknown',
                 'a logical value representing missing values',
                 'dont allocate memory','logical value; equals to 1; abbreviated as: T',
                 'logical value; equals to 0; abbrev. as: F', 'a constant in radians; equals to 3.14'),
   Uses=c('number of atoms in universe', 'a negative value divided by zero',
         'missing values', 'enforce no memory allocation',
         'truth in binary status such as yes or no','absence of an observable state',
         'arcs and angles'), check.names=FALSE, row.names=NULL
)
knitr::kable(specialVals, caption='Special values and constants')

Variables

In order to access a value (of different Data Types) or a set of values, R allows user defined symbolic names. Depending on the type of data, variables are called numeric variable or logial variable and so on. Variables can be named using alphanumeric characters. The rules are (a) no spaces are allowed, (b) no special characters are allowed except a dot (.) or an underscore (_), and (c) numbers are not allowed at the start.

Operations, Operands and Operators

Operations are tasks performed by the scripts. These can be assignments, mathematical expressions, comparisons, function calls and so on. Some operations use special ‘operators’ to perform specific tasks on ‘operands’. The operators in R are represented by special symbols. Operands are ‘data values’ on which the operator will perform a task. R allows variable names in place of these operands (data values), as they point to some value. Similarly, a function or another operation that returns a value can be used in place of operands. An operation where operands and operators are used in a line of R code are sometimes called expressions. In assignment expressions, the target variable to which a value to be stored appears on the side of arrow mark (this side <-) and value appears at the tail end (<- this side). An expression without assignment operator is volatile i.e. the result of the expressions are not stored anywhere. Relational expressions returns TRUE or FALSE. The operands of logical operators are logial TRUE or FALSE.

Assume that a and b are two logical variables, then the logical operation “a & b” will return TRUE only when both a and b are TRUE. Similarly, “a | b” will return FALSE only when both a and b are FALSE.

Except for exponent and rightward assignment, all other operators are evaluated from left to right. If you want to override default precedence, enclose expressions within parenthesis (sometimes referred as precedence operator). For instance compare the results of the following pair of math expressions:

Also note that enclosing an expression or a group of expressions within paranthesis increases readability of the code.

Be aware that there are special operators, other than those mentioned above, are present in R. For instance, %in% is a vector operator (see Vector operations) and %*% is a true matrix multiplication operator (see Matrix operations). In addition to that, the two square brackets (open and close types - []) together forms array access operator (see Accessing and editing vector elements and Accessing and editing matrix elements ).

# Mathematical expressions

#Assume initPop, maxRange and minIterations are numeric variables used in the
#below examples

initPop <- initPop + 34 #addition
range <- maxRange - 1.2 #subtraction
tripleIter <- minIterations * 3 #multiplication
expOf2 <- 2.718 ** 2 #square. Function: exp(2)
sqrtOf5 <- 5L ^ (1/2) #exponent. Function: sqrt(5L)
infVal <- -100 / 0 #division
leftOver <- 1000 %% 3 #modulus
weightPerBatch <- 349.8 %/% 49.2 #gets floor value of the result


# Comparative expressions (relational, logical)

# Assume airTemp and stopValveRange are numeric variables; and isFlying and isRaining are 
#logical variables used in the below examples.

isCold <- (airTemp <= 18) #less than or equal to. Similar: >, >=, <
stopValve <- (stopValveRange == 50) #equality
notInFlight <- !(isFlying) #logical not operation. Other use: a != b

isColdAndFlying <- isCold && isFlying #logical AND
isColdOrisRaining <- isCold || isRaining #logical OR

Functions

A function is a block of code (having one or more lines of code) to carry out a specific operation on the input values if given. When input values are not specified, functions simply perform a task using instructions. The input variables are called parameters. A function can have zero to many paramters. A function may return a result derived out of the task performed inside the function. Frequently used tasks are wrapped in functions, hence it helps in eliminating redundancy in the script or program. All the variables defined inside a function are called local variables. Their values are not visible outside the function in the calling program. However, their values can be returned back to the program. R has a lot of built-in functions that are bundled along with the base package and allows the user to define custom functions. Sometimes, functions are also called routines.

Creating Functions

The R syntax for defining a custom function is given below with examples.

functionNameHere <- function(parameter1='initial value', parameter2='init val', ...) {
   # <function tasks goes here>
   return(returnValue) #return is optional
}

#Example 1: This function takes two parameters (number1 and number2). Both are numeric 
#variables and initialised to zero. Inside the function, a local variable (returnValue) is created and assigned the value of addition of the two numbers. Finally, the added value is returned back to the caller of the function.
addTwoNumbers <- function(number1=0, number2=0){
   returnValue <- number1 + number2
   return(returnValue)
}

#function calls without input values. This is possible when the parameters are
#initialised in the function definition. 
addTwoNumbers() #returns 0 (= 0 + 0)

#function calls with just one value.
addTwoNumbers(2) #eturns 2 (= 2 + 0) (number1 gets the value 2; number2 is the default 0)

#function calls with two values
addTwoNumbers(2, 4) #number1=2; number2=4; returns 6 (= 2 + 4)

#function calls with named parameters; note that the order in which parameters are passed
#can be changed
addTwoNumbers(number2=7, number1=4) #returns 11 (= 4 + 7)


#Example2: This function just prints a preset error message to the user when called.
#It neither accepts any paramters nor returns any value.
raiseErrorMessage <- function() {
   print('Error occurred during the process. Pleaes contact the admin for troubleshooting!')
}

#<your main code goes here, during which an error occurred...>
raiseErrorMessage() #prints the error message
#<you further proceed with the code...>

Testing global assignment operator using functions

The variables created as function parameters or other variables that are created inside the funtion block are local variables. Local variables are neither visible nor has scope outside the function. Global assignment operators are used define global scope to variables used in the function body. Let us understand this using the following demo.

x <- 100 #a global variable x

#now, let us try modify the value of x from a function
F1 <- function() {   x <- 200 }
#let us call the function to actually modify and check the value of x
F1(); print(x)

#the above attempt will fail to change the value in x, because of the rules
#of scope of variable as explained above;

#There are three ways we can achieve this:
#1st way:
F2 <- function() { val <- 200; return(val) }
x <- F2()
#The function F2 returns a value that gets assigned to the outside x

#2nd way:
F3 <- function(val=0) { val <- val+100; return(val) }
x <- F3(x)
#The function F3 accepts the x from outside and modifies it; the modified
#value is then returned to the caller where it is transferred to the x
#defined outside

#3rd way:
F4 <- function() { x <<- 200 }
x <- F4()
#The function F4 uses global assignment operator where the LHS operand of <<-
#operator is assumed to be global

#Note that if x is not defined outside before it is used in the function 
#call that uses global assignment operator on it, then a global 
#variable x will be created from function call.

Built-in functions

R comes with several builtin functions in its base package. The functions can be directly called from the script with the required paramters. The help on any function can be seen by typing ?<function name>. Here, most frequently used functions are listed with examples and are grouped by their purpose. Some functions have parameters that are not explained here. Here, we focus only on the essential paramter list.

Data type checking

The function typeof(x) gives the name of data type of the variable x. There are a set of functions whose names follow this pattern check if the variable x belong to the specified data type: is.<datatype>(x). These functions return TRUE or FALSE.

#let us create some scalars and vectors for demo
x <- -Inf; y <- TRUE; z <- NA

typeof(x) #returns 'numeric'
typeof(y) #returns 'logical'

is.numeric(x) #returns TRUE
is.logical(y) #returns TRUE
is.integer(x) #returns FALSE
is.double(x) #returns TRUE
is.character(x) #returns FALSE

is.na(z) #checks if the value is NA
is.null(z) #checks if the value is NULL
is.infinite(x) #returns TRUE
is.finite(z) #returns FALSE

Data type conversions

Type conversion function transforms a variable from one data type to another. There are a few restrictions as well. These functions follow naming pattern: as.<datatype>().

as.numeric('87') #converts string numbers to numeric
as.numeric('X98') #throws error due to the presence of the letter X
as.numeric(FALSE) #returns 0; returns 1 for TRUE
as.integer(67.6123) # returns 67; converts double to integer

as.character(1999) #converts number to string literal '1999'
as.character(FALSE) #returns character value

as.logical(0) #=FALSE
as.logical(5065) #any number greater than 0 are TRUE

z <- 1
as.logical(z) #returns TRUE; 1=TRUE; 0=FALSE

Math functions

Some of the frequently used math functions are listed below.

abs(-18723.23) #removes sign
round(pi, digits=2) #pi value rounded off to 2 decimals
signif(expOf2, digits=3) #extracts 3 significant numbers from pi
floor(pi) #next lowest integer 
ceiling(pi) #next largest integer

log(45) #natural log
log10(45) #log base 10; Similar: log2()
log(45, base=pi) #log base of any number

exp(2) #e power 2
sqrt(56) #square root; cube root?
factorial(3) #3! = 1 * 2 * 3

sin(20) #Similar trignometric functions: cos(), tan(), asin(), acos(), atan()

Date Functions

We have seen that the text form is the convenient way of storing date values in different formats. However, date values can be stored as equivalent numeric values (a relative value with reference to an arbitrary date). Let us see the functions related to date values and their uses.

#let us initialise some date values as character type
programStartsOn <- '11 10 2017' #format: DD MM YYYY
eventEndsAt <- '18 03 21 5:30:00 PM'#format: YY MM DD HH:MM:SS AM/PM

#we can convert the above strings into Date objects, as in:
#pass the date string varible or its literal with format spec.
pso <- as.Date(programStartsOn, '%d %m %Y', tz='Asia/Calcutta') 
eea <- as.Date(eventEndsAt, '%y %m %d %H:%M:%S %p', tz='Asia/Calcutta')

#if you get warnings on timezone, you need to look at zone.tab file in 
#the installation folders of R; the tz specifications are OS specific;
Sys.setenv(TZ='Asia/Calcutta') #the approp TZ for the session will clear it

#once Date objects are created, their formats can be converted to the
#requirements, as in:
format(pso, '%Y %b %d %I:%M %p %z')

#where the format specifications are:
#  %d  - numeric day of month
#  %m  - numeric month of the year
#  %y  - two digit year format
#  %Y  - four digit year format
#  %I  - hour value in 12 hr format
#  %H  - hour value in 24 hr format
#  %M  - minutes of the hour
#  %S  - seconds of the minute
#  %p  - AM/PM for 12 hr format
#  %a  - weekday abbr;     %A  - Full weekday
#  %b  - month abbr;       %B  - Full month
#  %j  - decimal day of the year
#  %U  - decimal week of the year(starting Sunday)
#  %W  - decimal week of the year(starting Monday)
#  %z  - offset from GMT;  %Z  - time zone

#you should have noticed an issue with the previous function call to 
#format(); the time value was ignored with as.Date() conversion;

#the better version of as.Date() to handle these issues is as.POSIXct()
pso1 <- as.POSIXct(programStartsOn, '%d %m %Y', tz=Sys.getenv('TZ'))
eea1 <- as.POSIXct(eventEndsAt, '%y %m %d %I:%M:%S %p', tz=Sys.getenv('TZ'))

#now format() can be executed without any issues, as in:
format(pso1, '%Y %b %d %I:%M %p %z')
format(eea1, '%Y %b %d %I:%M %p %z')

The other date functions and their uses are liste below:

Sys.date() #system date
Sys.time() #system time
Sys.timezone() #system timezone

#to get difference between two dates
difftime(pso1, eea1, units='auto')
difftime(format(Sys.time(),'%H:%M'), format(eea1,'%H:%M'), units='auto')

#specific units of differences can be specified. The options are:
#weeks, days, hours, mins, secs

#difference in months is not available. See 'Functions(Miscellaneous)'.

Iterable container objects

Container objects are used to store multiple values. Vectors, matrices, lists, and dataframes are unique in their structure and purpose. However, they can be related to each other by some of their properties. R allows sensible conversions of one container to another with some restrictions. One of the common feature of container objects is that they are iterable. Iterable refers to a method of accessing elements of a container through sequential looping (see Looping statements).

Vectors

Vectors are single dimensional arrays. They allow values of single data type. When you add a logical value to a numeric vector, the logical values are automatically converted into numeric values (TRUE=1, FALSE=0). On the other hand, when you mix a character value to numeric vector, all elements are converted into character. You can use functions like typeof(), is.numeric() and as.logical() like functions on vectors as well (see Data type checking, Data type conversions). Each element of a vector can be given a name. Named vectors are useful when you have a few elements. However, the data type rules are the same for named vectors as well.

Built-in vectors

#some built-in vectors
LETTERS #upper case english alphabets
letters #lower case alphabets
month.name #12 full month names
month.abb #12 abbreviations of month names

Creating vectors

1:25 #creates a vector of integers starting from 1 to 25
c(1:25) #same as above

c() #does not create a empty vector; it resutls in NULL
vector() #cretes a empty logical vector

#to create an integer vector of length 15 initialized with 0s
vector(mode='integer', length=15L) 

seq(from=1, to=25) #same as above; default interval=1

c(45, 23, 12, 76, 24, 90, 88) #a vector of double data type
c(45L, 23L, 12L, 76L, 24L, 90L, 88L) #a vector of integer data type

c('UAG', 'UGA', 'UAA') #a character vector
c(amber='UAG', umber='UGA', ochre='UAA') #a named character vector

# Mixing data values does automatic type conversion
c(45, 23, 12, 76, 24, TRUE, 88) #numeric vector; TRUE gets the value of 1
c(45, 'XY', 12, 76, 24, 90, 88) #character vector
#note that all numbers have become character literals

a <- c() #giving results against intuition; does not create empty vector; results in NULL
a <- vector() #create empty vector = vector(mode='logical', length=0L)
vector(mode='double', length=10L) #creates double vector of length 10; default values=0

seq(0, 1, by=0.01) #creates numeric vector starting from 0 to 1 with an interval of 0.01

#creates numeric vector with a range of 0 to 1 divided into specified length;
#interval is calculated by length of vector specified
seq(0, 1, length.out=12) 

rep(TRUE, times=10) #creates logical vector of 10 TRUEs
rep(c(TRUE, FALSE), times=10) #creates logical vector of 10 consecutive TRUE, FALSEs
rep(c(1,2,3), each=3) #each element in c(1,2,3) are repeated 3 times
rep(c(1,2,3), each=3, length.out=5) #same as above, but limits the length to 5

Vector type checking

The functions listed under Data type checking also work on vectors. The functions is.na(), is.null(), is.finite(), and is.infinite() checks each element and returns a logical vector representing if the ‘is’ condition matched or not.

x <- c(T, NULL, F, F, T, F, NA); y <- seq(1, 10, length.out=20)
z <- letters; k <- c('red', NULL, 'blue'); u <- c(2, 4, NULL, 5, 9)
#note that the NULL value in logical vector x is replaced with FALSE automatically;
#and the NULL in character vector k and numeric vector u are ignored completely.

typeof(x) #returns 'logical'
typeof(y) #returns 'double'

is.logical(x); is.logical(y) #returns TRUE and FALSE respectively
is.integer(y); is.numeric(y); is.numeric(z) #returns FALSE, TRUE, FALSE respectively

# to check if any of the elements is NA, the following function is used
is.na(x) #returns a logical vector of FALSE, FALSE, FALSE, FALSE, FALSE, TRUE

#whereas, is.null() just checks if the vector x is NULL or not; it does not check
#each elements
is.null(x) #returns FALSE

Vector type conversions

x <- c(12.34, 14.21, 17.06, 14.67) #this is a double vector
as.integer(x) #removes decimal points from x
as.character(x) #converts the double values to string literals

y <- c('12', '00', '15', '17', '00') #this is a character vector
as.numeric(y) #converts the char vector into numeric vector

#converts the char vector into num vector, and then into logical vector
#all zeros are FALSE, values greater than 0 are TRUE
as.logical(as.numeric(y))

z <- c(T, F, F, T, T, F) #this is a logical vector TRUE and FALSE are abbreviated
as.integer(z) #converts logical vector into integer vector; T=1, F=0
as.character(z) #results in 'TRUE', 'FALSE', 'FALSE', 'TRUE', 'TRUE', 'FALSE'

Accessing and editing vector elements

# Accessing vector element by location index

x <- c(12.34, 14.21, 17.06, 14.67) #this is a double vector
x[1] #extracts the first element from x
x[4] <- 20.00 #modifies the 4th element to 20.00; the vector x is modified
x[-3] #extracts all elements other than the 3rd element
x[-c(1,3)] #extracts all elements other than 1st and 3rd element

#WARNIMG:this will overwrite the vector x and replace it with numeric scalar
x <- 20.00
x[] <- 20.00 #this replaces all values with 20.00
x[1:3] <- 10.00 #this will replace the first 3 elements with 10.00
x[c(2,5)] <- 5.00 #this will replace 2nd and 5th element with 5.00


# Accessing vector element by names of elements

stopCodons <- c(amber='UAG', umber='UGA', ochre='UAA') #this is a named vector
stopCodons['umber'] #this extracts the value of element named 'umber'
stopCodons[c('amber','ochre')] #extracts values of elements named 'amber' or 'ochre'


# Logical extraction of vector elements

#this is a sample char vector
randomLetters <- c("z", "r", "o", "n", "g", "j", "s", "v", "i", "f") 

#There are 10 alphabets in this vector; the following will extract all elements in 
#odd positions;
#Note that in the following expression, the logical vector passed to vector access 
#operator [] has the logical value TRUE at odd positions
randomLetters[c(T, F, T, F, T, F, T, F, T, F)]

#With the same vector, the vowels can be extracted by passing the logical vector
#with TRUE in the positions where vowels are present. In the above example the 3rd and 9th 
#positions have the vowels 'o' and 'i' respectively.
randomLetters[c(F, F, T, F, F, F, F, F, T, F)]

#The above methods are feasible when the size of vectors is manageable, and when 
#the positions of required elements are known; the %in% operator makes this
#easy and scalable;
#To make the above example more robust, the following can be used:
randomLetters[randomLetters %in% c('a', 'e', 'i', 'o', 'u')]

# In the above example, the expression inside [], returns a logical vector by matching each  
# element to the reference vector c('a', 'e', 'i', 'o', 'u'); all matching positions gets
# the value TRUE and the remaining gets FALSE

#The above example can reverse the condition (i.e. all elements that are NOT vowels) using
#negation operator, as given below:
randomLetters[!randomLetters %in% c('a', 'e', 'i', 'o', 'u')]


# Logical extraction of numeric vector elements

randomNumbers <- c(8,6,6,rep(3,3),rep(4,3)) #this is sample numeric vector of 9 elements

# The following expression will extract all even numbers
randomNumbers[c(T, T, T, F, F, F, T, T, T)]

#-which can also be achieved (robust and scalable) using the following conditional expression
randomNumbers[randomNumbers %% 2 == 0]

Vector operations

numObservations <- c(2, 3, 2, 4, 2)
numWeights <- c(1.0, 0.5, 0.5, 2.0, 0.1)
numCharge <- c(100, 200, 300)

# scalar operations are performed on every element (applicable to all math operators)
numObservations * 12 #all elements are multiplied by 12

# operations on two vectors of same length (applicable for all opeartions)
rep(2, 5) * numWeights #each complementary pairs are multiplied

#operations on two vectors of differing length (excuted with warning message);
#The lenght of vector with minimum number of elements is automatically adjusted 
#so as to match the length of maximum number of elements; the elements in the minimum
#length vector are repeated (recycled) one or more times to fill the gap
numCharge + numObservations #the numcharge vector is extended to the length 
#of 5 automatically

#the not opeartor on a logical vector reverses all logical values
movementDetected_d1 <- c(T, T, T, F, F, F, T, T, T)
movementNotDetected <- !(movementDetected_d1)

#the logical AND (&) and OR (|) operators checks every elements pair wise on
#two logical vectors
movementDetected_d2 <- c(T, T, F, T, T, F, T, F, T)
movementDetected_d1 & movementDetected_d2

#note that we don't use scalar logical operators && and || on vectors, which when used on 
#vectors checks only the first pair only

Generic vector functions (applicable to all types of vectors)

#let's create a char vector of some alphabets
randomLetters <- c("z", "r", "o", "n", "g", "j", "s", "v", "i", "f") 

names(stopCodons) #works only if the vector elements are named
sort(randomLetters) #sorts alphabets in ascending order

#sample() is a statistical function that samples a given vector (the population)
#for a given sample size; the parameter 'replace' when set to TRUE allows the sample
#to select same elements repeatedly; in statistical terms this kind of sampling is called
#sampling without exclusion or sampling with replacement; when this parameter is set to FALSE,
#sampling does not allow an element from the population to be selected more than once;
#when the sample size is larger than the population, then it becomes mandatory to set replace=T
randomNumbers <- sample(1:10, size=50, replace=T) #samples 50 random numbers in 1:100

sort(randomNumbers, decreasing=T) #sorts the randomNumbers sample in descending order
rev(randomNumbers) #flips or reverses the vector elements
length(letters) #counts elements

randomNumbers2 <- c(8,6,6,rep(3,3),rep(4,10),rep(5,4)) #creates a numeric vector

unique(randomNumbers2) #removes duplicates
freqRandNums <- table(randomNumbers2) #frequency of unique values
#the following expression gets the position of the value having maximum frequency 
#from frequency table ('freqRandNums')
which.max(freqRandNums)

freqRandNums[which.max(freqRandNums)] #fetches the frequency of unimodal value

#let's extend the randomNumbers2 with a repeat of the digit 5 of length 6; this makes 
#the vector bimodal; this is to give a demo on how to get 
randomNumbers2 <- c(randomNumbers2, rep(5,6))
fr2 <- table(randomNumbers2) #this creates frequency table for the modified vector
#the following expression gets all modals of same length from the new frequency table 'fr2'
as.numeric(names(fr2[fr2 == fr2[which.max(fr2)]]))

Numeric vector functions

sum(randomNumbers) #totaling

sum(movementDetected) #logical vector summing; T=1; F=0
#the logical vector summing will be useful to check how many elements in a vector
#matches a particular condition; it is like taking the count of all TRUEs

#to count the number of randomLetters that are vowels
sum(randomLetters %in% c('a','e','i','o','u'))
#to count the number  of elements that are even
sum(randomNumbers %% 2 == 0)

#the following are some frequently used numeric vector functions;
#the parameter na.rm when set to TRUE, ignores missing values present in a vector;
#when this parameter is FALSE (default) and when the numeric vector contains
#missing values (represented by NA), these functions will return NA as the result

mean(randomNumbers, na.rm=T) #averaging
max(randomNumbers, na.rm=T) 
min(randomNumbers, na.rm=T)
range(randomNumbers, na.rm=T) #gives min to max range
median(randomNumbers, na.rm=T) #center average=location
var(randomNumbers, na.rm=T) #variance
sd(randomNumbers, na.rm=T) #standard deviation

#Mode (= the most frequently observed value) function is not avilable in the
#base R; see 'Functions(Miscellaneous)' for a function definition of mode;

#if you want to classify numeric vector values into groups using 
#levels or breaks or bins, use cut() function, as in:

#(1) specify number of groups by 'breaks' parameter
#cuts 1:10 in two equal groups and labels them as A and B
cut(1:10, breaks=2, labels=c('A','B'))
#does the same trick even if numbers are shuffled
cut(c(10, 3, 6, 2, 4, 7, 1, 9, 5), breaks=2, labels=c('A','B'))

#(2) or specify exact break points using a vector
cut(1:10, breaks=c(0,5,10), labels=c('A','B'))
#Note: to include the minimum value in the grouping set the lowest
#break point as minimum-1

#Also note that by specifying labels with a vector of labels, then
#the a factorized vector is returned (See 'Factors' and revisit this...);
#Instead you set labels=FALSE to get numeric ordering of groups as
#a simple vector, as in:
cut(c(1:10), breaks=2, labels=FALSE)

Character vector functions

#the following is a string variable (in R this is a character vector of length 1);
#in english this is a sentence
myText <- 'You can\'t get out of life alive! Be bold!'
#note that in the string we have used the escape character backslash (\) to 
#let R know that the single quote inside the string is a part of it

#to split the above text by space delimiter:
strsplit(myText, split=' ') #creates a character vector of words seperated by single space

#to split the same by exclamatory mark:
strsplit(myText, split='!') #creates a char vector of length 2


#let us create a sample (size=999) of letters denoting nucleotide bases that 
#forms the structure of genes
randomNucleotides <- sample(c('a','u', 'g', 'c'), 999, replace=T)

# paste0 concatenates vector elements using seperator specified by collapse
randomGeneSeq <- paste0(randomNucleotides, collapse='') #note: collapse specifies empty string

#the above expression binds all 999 random base letters (=char vector of length 999) 
#into one string (=char vector of length 1)

length(randomNucleotides) #gives the length of char vector
nchar(randomGeneSeq) #gives the length of the string

randomGeneSeq <- toupper(randomGeneSeq) #converts to upper case

firstCodon <- substr(randomGeneSeq, 1, 3) #extracts part of the string from 1 to 3

# split the gene sequence string by a given length; codons are triplet bases
codons <- substring(randomGeneSeq, first=seq(1, nchar(randomGeneSeq)-2, by=3), 
                 last=seq(3, nchar(randomGeneSeq), by=3))

#the function substring() allows its parameters 'first' and 'last' to
#take a vector of starting positions and ending positions respectively

table(codons) #Frequency of each unique codon in the vector 'codons'


# earlier we had seen strsplit() can be used to split a string using delimiters
# such as a space or or a special character; in fact, the delimiter can be 
# any character or a string or a word; it can also be regular expressions;
# some examples are given below

# split vector by a given value
codons1 <- strsplit(randomGeneSeq, split='UAG') #splits the string using a word ('UAG')

#note that strsplit returns the value as a list; hence, codons1 is a list object

#if split needs to happen if any of a given set of words, then the regular expression
# to be passed to the split parameter is:
splicedSeq <- strsplit(randomGeneSeq, split='(UAG|UGA|UAA)')

#let us create a string literal of special characters;

specialChars <- '~!@#$%^&*()_+-={}[]|\\:;"\'<,>.?/' #there are 31 special chars in this string
#note that for some special characters when written in a string literal, they
#need to be given with escape character backslash (\); 
#such special characters are the backslash itself, and a single quote;
#a single quote needs an escpae character when it appears inside a string enclosed by two
#single quotes; if a string literal is enclosed within double quotes then the single
#quote does not require the escape character, instead a double quote will;
#try to remove the escape character and see what happens

#suppose if your string contains any of these special chars and you want to 
#split your string by one of the special characters, then you can very well
#use them as delimiters EXCEPT for $ (dollar), ^ (caret), * (asterisk), () (open or close
#parentheses), + (plus), {} (open or close curly brackets), [] (open or close square
#brackets), | (pipe), . (dot), ? (question)

#if you still want to split a string with these exceptional special chars, then
#use the following regular expression:
unlist(strsplit(specialChars, split='\\$')) #just replace $ with any of the exceptional chars 

#in addition to that,if you want to split by backslash, then use:
unlist(strsplit(specialChars, split='\\\\'))
#and for a single quote (also double quote if the string enclosed in double quotes)
unlist(strsplit(specialChars, split='\''))


# search and extract char vector using grep() function and regular expressions

#let us create some char vectors for demo
chemFormulae <- c('NaHCO_3','NaClO','H_2O_2','NaBO_3','Na_2B_4O_7.10 H_2O','S','KHC_4H_4O_6',
                  'MgSO_4.7 H_2O','CF_2Cl_2','PbS','C_2H_5OH','C','CaSO_4.2 H_20','Na_2S_2O_3',
                  'N_2O','CaO','CaCO_3','NaOH','CaCO_3','CH_3COC_2H_5','Mg(OH)_2','HCl',
                  'H_2SO_4','CaSO_4.2 H_2O','K_2CO_3','FeS_2','SiO_2','Hg','(CH_3)_2CHOH',
                  'NH_4Cl','NaCl','KCl','KNO_3','Ca(OH)_2','C_12H_22O_11','Na_3PO_4',
                  'Na_2CO_3.10 H_2O','CH_3OH')
chemNames <- c('sodium bicarbonate','sodium hypochlorite','hydrogen peroxide',
               'sodium perborate','sodium tetraborate decahydrate','sulfur',
               'potassium hydrogen tartrate','magnesium sulfate heptahydrate',
               'dichlorodifluoromethane','lead (II) sulfide','ethanol','carbon',
               'calcium sulfate dihydrate','sodium thiosulfate','dinitrogen oxide',
               'calcium oxide','calcium carbonate','sodium hydroxide','calcium carbonate',
               'ethyl methyl ketone','magnesium hydroxide','hydrochloric acid',
               'sulfuric acid','calcium sulfate hydrate','potassium carbonate',
               'iron disulfide','silicon dioxide','mercury','isopropyl alcohol',
               'ammonium chloride','sodium chloride','potassium chloride',
               'potassium nitrate','calcium hydroxide','sucrose','sodium phosphate',
               'sodium carbonate decahydrate','methyl alcohol')

# the general syntax of grep() function is:
# grep(pattern='the regular expression is used here', 
#      x=<the char vector that needs to be searched>, 
#      ignore.case=F, value=F)

#the 3rd parameter 'ignore.case' is set to TRUE or FALSE for toggling case sensitivity;
#the 4th parameter 'value' is set to TRUE if you want the grep() to return the elements
#in the vector that matches the regular expressions; by default (FALSE) it returns the 
#index position of the matching elements

grep('OH', chemFormulae) #returns index locations of compounds having OH group
#in the above expression, grep searches for the presence of OH as is;
#when the search string OH is given within square brackets - '[OH]', then
#it will result in compounds that have oxygen and/or hydrogen in its formulae;

#note that the above expression returns the index locations as a vector; which can 
#be used to fetch the related elements from chemFormulae or chemNames, as given below:
chemFormulae[grep('OH', chemFormulae)]
#this is equivalent of grep('OH', chemFormulae, value=T)
#but the index locations are useful, if used to get the corresponding chemical names that
#are stored in a different vector (chemNames), as in:
chemNames[grep('OH', chemFormulae)]


#to search for Na or K in the compounds, let us try:
grep('[NaK]', chemFormulae, value=T) #there is a problem in this!

#the above expression will fetch the compounds with the letters N, a, K, which
#is not the intended search; we wanted to consider Na as a seperate search option 
#along with K; in order to achieve the expected results, we would use the following:
grep('Na|K', chemFormulae, value=T)

#suppose, we want to search for compounds that have Na or K and also 
#followed by 'OH' somewhere; let us try using the regular expression as given below:
grep('Na|K.*OH', chemFormulae, value=T) #there is a problem here!

#before solving the problem, let us understand the use of .*;
# the dot character represents any character and * indicates that the preceding 
#character may appear any number of times; here the preceding character is any 
#character (which is a dot!)

#the above results, in fact finds all compounds with Na or K, but it does not
#correctly identified the OH groups after Na or K; instead it searched for the 
#letters O or H after Na or K; this problem can be solved using the following 
#expression:
grep('(Na|K).*(OH)', chemFormulae, value=T)
#in the above expression, we have introduced () to enclose the search words, which
#will fetch hydroxide compounds of Na and K


#let us now try to extract all hydrated compounds (=the compounds that have
#H_2O group)
grep('.*(H_2O)', chemFormulae, value=T) #why does it fetch H_2O_2?
#the above expression fetches hydrated compounds including hydrogen peroxide (H_2O_2)
#which is having two O atoms at the end (O_2) and not expected to be part of the search;

#we can solve the above problem, by using $ symbol to denote the end of search, as in:
grep('.*(H_2O)$', chemFormulae, value=T)


#earlier we have created a vector of triplet codons in the variable 'codons';
#if you want to search for those codons that start with the base A and end with base U
#having any letter in between, then follow the regular expression as given below:
grep('^(A).(U)$', codons, value=T)

#let us say we have a character vector of binary literals, such as:
binaryCode <- c('1100001100','000000','1111111111', '10001100', '0011110000')

#if you want to search for the codes that should start with 11, then the expressin is:
grep('^11.*', binaryCode, value=T) #.* allows any digits to succeed 11

#suppose, if you want to make the second digit of the starting value 11 to be optional,
#then use suffix ? to the digit that you want it make it optional, as in:
grep('^11?.*', binaryCode, value=T) #? makes the preceding character optional

#the following fetches the binary codes having four 1s anywhere within the string
grep('1{4}', binaryCode, value=T) #use of {n} repeats

#whereas, if you want to extract the binary codes having four 11s anywhere within, then
#enclose 11 within parenthesis to group:
grep('(11){4}', binaryCode, value=T) #use of {n} repeats

#you may add a range of values within curly braces as well; for instance the 
#following expression will fetch codes where 1s are repeated 2 to 4 times
grep('1{2,4}', binaryCode, value=T) #use of {n,m} range repeats


#let us create another vector of alpha numeric passwords for demo:
alphanumCodes <- c('XX000AXZER2','X3000AXZERV','YY021AXYOQ5',
                   'X5000AXZERQ','ZZ103AXUIR1', 'X9XX45PORT')

#matching word should start with X followed by a digit; .* for any chars after that
grep('^X[0-9].*', alphanumCodes, value=T) 
#matching word should start with a capital letter followed by a digit
grep('^[A-Z][0-9].*', alphanumCodes, value=T)

#building a complex pattern: should start with atleast 2 capital letters followed 
#by 3 digits and five letters and finally should end with a digit
grep('^[A-Z]{2}[0-9]{3}[A-Z]{5}[0-9]$', alphanumCodes, value=T) #how to build a pattern?

#should not start with ZZ (denoted by ^[^Z]{2}), but at 3rd position there should be
#3 digits and five capital letters, but should end with a digit
grep('^[^Z]{2}[0-9]{3}[A-Z]{5}[0-9]$', alphanumCodes, value=T) #use of '^' within []?


#gsub is similar to grep, except for replacement string
gsub(pattern='(UAG|UGA|UAA)', replacement='|*|', randomGeneSeq)
#the above expression replaces the words UAG or UGA or UAA with |*| when
#found in the string 'randomGeneSeq'

Matrices

Matrices are extension of vectors. While vectors are single dimensional arrays, matrices are multidimensional. Like vectors matrices are also strongly typed. Data type mixing follows the rules explained for vectors. (see Vectors). Matrices can be viewed as a collection of column vectors or row vectors. They have rows and columns like a table. Number of rows and columns define the 2D structure. Here, we focus on 2D matrices. Like named vectors, each column or row can be given a name.

Creating matrices

matrix() #creates a 1R X 1C matrix with a value of NA
matrix(nrow=5) #creates a 5R X 1C matrix with default values of NAs
matrix(nrow=5, ncol=5) #creates 5R X 5C matrix (with NAs)
matrix(0, 3, 5) #creates the 3R X 5C matrix with 0 as default values

#any vector can be converted into 2D matrix of different plausible dimensions;
#for instance:
numVector <- 1:36 #a numeric vector of 36 elements starting from 1 to 36

#the above vector can be converted into matrices of 18 different dimensions;
#a few examples are given below:
matrix(numVector, nrow=1) #creates 1R X 36C matrix; values filled from numVector
matrix(numVector, ncol=6) #creates 6R X 6C square matrix; values filled columnwise
matrix(numVector, ncol=6, byrow=T) #6R X 6C matrix; but values filled rowwise
matrix(numVector, nrow=4, ncol=9, byrow=T) #4R x 9C matrix; values filled rowwise


#let us create a 4X5 matrix of random numbers
matRandomNum <- matrix(sample(0:10, 20, replace=T), nrow=4, byrow=T)
#let us also create 6x6 matrix of random letters (lower case)
matRandomChar <- matrix(sample(letters, 36, replace=T), ncol=6)

Matrix type checking and conversions

Like vectors matrix is also strongly typed. Hence, type checking follows the rules of vectors. see Vector type checking for more details.

typeof(matRandomChar) #returns 
is.logical(matRandomNum)

#however, with type conversion there is a problem!. The good news is the conversion occurs
#correctly; but it also removes #the dimension from the source matrix and converts the 
#result into a vector 
as.logical(matRandomNum) #returns a logical vector (0 = F; >0 = T)
#WARNING: Also, the logical vector resulting by reading the matrix columnwise

#To solve these problems, you would need to convert the resulting vector into
#matrix, as in:
matrix(as.logical(matRandomNum), nrow=4) #do not use byrow=T in this expression!

Accessing and editing matrix elements

Vector elements are accessed using [i] operator with either the index location (i) or the name of the element (as string literal within quotes) if it is a named vector. Similarly, matrix also uses the same operator. However, you would need to specify row and column index - [r,c], to handle 2D matrix. Alternatively, you may use column names and/or row names.

dim(matRandomChar) #returns the dimension vector (rows, cols)
nrow(matRandomNum) #returns number of rows
ncol(matRandomChar) #returns number of cols

colnames(matRandomChar) #reads column names vector; by default this is NULL
row.names(matRandomNum) #reads row names vector; by default this is NULL
colnames(matRandomNum) <- sprintf('Series%d', 1:5) #sets column names
row.names(matRandomNum) <- sprintf('Obs%d', 1:4) #sets row names

#accessing by index locations
matRandomChar[3, 5] #value of element at 3rd row and 5th column
matRandomChar[3, ] #values of elements at 3rd row completely (row vector)
matRandomChar[, 5] #values of elements at 5th column completely (column vector)

matRandomChar[1:3, ] #first 3 rows only
matRandomChar[, c(2,4)] #2nd and 4th columns only

#to extract a block of matrix within matrix starting at 2nd row/3rd column and
#ending at 3rd row/5th column
matRandomChar[2:5, 3:4]

#a few examples to access by column and row names
matRandomNum['Obs2', ] #row identified by row name 'Obs2'
matRandomNum[c('Obs3','Obs5'),  ] #multiple rows identified by the given rownames
#the elements in row (Obs2), but only from the columns - Series2 and Series4
matRandomNum['Obs2', c('Series2', 'Series4')]

#When you know the ways to access a single element or a row or a column or a block of
#elements, you may also change them by assignment operator just like how you modified
#elements of a vector!

Matrix operations

You can do math and logical operations on matrix with scalar, vector, and matrix operands. The scalar operations are simple. The operations using the other two have certain rules and limitations. These are explained with examples below:

#scalar operations (applicable to all math and logical operations)
matRandomNum + 2 #adds two to all elements
(matRandomNum / 2) * 1.5 #divides the elements by 2, then multiplies by 1.5

#the following expression, first gets reminder of division (modulus) by 2 on all 
#elements, then compares if each element equals to 0; returns a logical matrix
#based on the match
(matRandomNum %% 2) == 0

matRandomChar != 'a' #checks if each character equals 'a' or not; retunrs logical matrix

# using vector operands with matrices
# the following expression checks if any of the element equals any of the letters given
# in the reference vector (here it is vowels vector); there is a problem here!
# this operation removes the matrix dimension and creates a vector whose length equals to
# the total number of elements in the matrix;
# also note that the matching follows columnwise operation (use byrow=F when you recreate
# the complementary matrix using matrix() function)
matRandomChar %in% c('a','e','i','o','u')

#if numeric operations are specified with a vector of more than one elements, then the 
#vector elements are recycled; first it will start with 1st column 1st row, then it moves down
#the column, then it moves to the next column; it goes on until it fills all positions;
matRandomNum + c(4:1) #this operation creates a virtual matrix of the same dimension and uses
#the vector elements to fill matrix with values following the rule explained above; the virtual
#matrix will look like this:
vmat <- matrix(c(4:1), nrow=nrow(matRandomNum), ncol=ncol(matRandomNum))
#now, the math operation will use vmat in place of the vector as given below;
matRandomNum + vmat #each complementary element will be added;
#you wont be seeing the above steps when execute matRandomNum + c(4:1); 
#but it explains the underlying process of such operations;
#the above method is applied to any math operation on a matrix with vector operands of any size

#note that, in the above example, the ultimate operation is applied on two matrices; 
#and the matrix addition creates matrix of the same size (otherwise known as conformable 
#matrices or arrays);
#matrix additions, subtractions, multiplications, divisions and other math operations on element
#by element basis; the elements at the complementary positions are added, subtracted, and so on.

#these kind of math operations will fail on non-conformable matrices; for example:
vmat * matrix(rep(1:2, 3), nrow=3) #will fail due to non-conformity, while...
vmat * vmat #will succeed

#the above matrix multiplication is a simple one to one process and uses the regular
#multiplication operator - asterisk (*);
#while talking about matrix operations, we need to know the true 'matrix multiplication' that
#follows a special arithmetic rules; and in order to specify this tru matrix multiplication
#R uses a special multiplication operator - %*%; the rules and examples are given below:

#Rules of true matrix multiplication:
#(1) The primary rule (conformity rule) is that the number of columns of the first matrix 
#should be equal to the number of rows of the second matrix.

#for instance, the following is a self-multiplication, which will fail according to the 
#above rule; same rule is applicable for any two matrices of different dimensions.
vmat %*% vmat

#however, the above operation will succeed if vmat is a square matrix

#in order to achieve self-multiplication on non-square matrices, you may need to transpose 
#the matrix, as in:
vmat %*% t(vmax)

Matrix functions

colSums(matRandomNum, na.rm=T) #Sum on each column
#colSums creates a row vector
#na.rm=T makes sure that the missing values are ignored

rowSums(matRandomNum, na.rm=T) #Sum on each row
#rowSums creates a column vector

colMeans(matRandomNum, na.rm=T) #Average of each col
round(colMeans(matRandomNum, na.rm=T),2) #nesting the above function with round()

rowMeans(matRandomNum, na.rm=T) #Average of each row

cbind(matRandomNum,rowSums(matRandomNum)) #adds a column to a matrix
rbind(matRandomNum, colSums(matRandomNum)) #adds a row to a matrix

t(matRandomNum) #tranpose of matrix
solve(matRandomNum[,1:4]) #inverse of a matrix (complex mathematical operation)

Lists

Lists are like containers of different kinds of objects. Each object can be named. The objects do not loose their type or class just being a part of a list. In R workspace, all variables and objects that are once created become part of a unnamed global list. This list can be accessed using the function ls().

Creating lists

aListOfObjects <- ls() #fetches all available objects in the workspace
#aListOfObjects is a list

#the following creates a list of combination of numeric, character and logical vectors.
lstGCS <- list(CO2=c(400,600), temp=c(20,30), photoperiod=c('auto','default'), 
               humid=c(40.5, 60.0), 
               isWaterFlowRegulated=c(T,F))

#you may create objects externally and then create a list out of it
#first, let us create a few objects
CO2=c(400,600)
isWaterFlowRegulated=c(T,F))
photoperiod=c('auto','default')
humid=c(40.5, 60.0)

#the following is a list of all objects created above, including a character string
lstSettings  <- list('default settings', CO2, isWaterFlowRegulated, photoperiod, humid)

#note that the above list of objects are not named yet; however, it is only optional;
#the names can be set like this:
names(lstSettings) <- c('title', 'param1', 'param2', 'param3', 'param4')

#a new list can be merged with an existing list, as in:
miscSettings <- list(intensity=c(32000, 120000))
c(miscSettings, lstGCS) #merges the two lists (any number of list can be merged like this)

#if a vector is added to a list, each vector element will be added as an independant
#object to the list; for instance, the following operation will create a list of length 4.
c(miscSettings, c(1,2,3)) #the length of list will be 4

#the right way to do that is by using list() function itself
list(miscSettings, c(1,2,3)) #the vector is merged as a list (length=2)

Accessing list elements

lstSettings[[2]]  #second object in the list
lstSettings$param4 #fetch object using a name

lstGCS$photoperiod #extracts photoperiod of its original type i.e. a vector
#whereas, the following will return photoperiod as a list object
lstGCS['photoperiod'] #gets a vector as a list object

#if you want to extract the first element in the vector photoperiod that is part of a list,
#the you might use the following syntax intuitively:
lstGCS['photoperiod'][1]
#however, the result is same as the previous command (and does not extract the first 
#element in photoperiod)

#unlists a list to solve the expected result from the above problem i.e. to get the 1st
#element
unlist(lstGCS['photoperiod'])[1]
#or, you may use:
lstGCS$photoperiod[1]

#the list objects can be accessed using numeric index
lstGCS[3] #returns photoperiod as a list
lstGCS[[3]] #returns a vector (unlists a list)
lstGCS[[3]][2] #returns the 2nd element of 3rd object if it is a vector

Factors

Factors are unique values, otherwise known as levels or factors, of vectors. The levels are sorted alphabetically or numerically (depending on the type of vector) in ascending order. And this ordered levels are numbered starting from 1 to the length of levels. The order and numbering can be changed while creating a factor of a vector.

Creating factors

isAmorphous <- c('Yes', 'No', 'No', 'Yes', 'No') #just a char vector
is.factor(isAmorphous) #checks if a factor is associated with a vector

isAmorphous <- factor(isAmorphous) #factorizes a vector
# or use
isAmorphous <- as.factor(isAmorphous) # but levels can't be reordered with this

levels(isAmorphous) #returns levels; two levels: 'No' and 'Yes' where 'No'=1, 'Yes' = 2
labels(isAmorphous) #vector of default names given to each element. chr vec: '1','2',...'5'
str(isAmorphous) #structure of vector with factor

factor(isAmorphous, levels=c('Yes','No')) #changes the order of levels
factor(isAmorphous, levels=c('Yes', 'N')) #incorrect levels coerces values to NA

isPresent <- c(1, 0, 0, 1, 1, 0, 0, 0, 0, 1) #just a numeric vector
#Warning! Better not to use factoring when the number of levels is unmanageable
isPresent <- factor(isPresent) #factorizes the numeric vector
#in the above case, 0 will be factored as 1, and 1 will be factored as 2!

labels(isPresent) #chr vec of num seq: '1','2',....'10'
levels(isPresent) # 0 and 1
str(isPresent) #structure of the vector with its levels

#we know that isAmorphous is a character vector, hence any math operation does 
#NOT make any sense; but there are occasions - like specifying weights for
#categories, you might need to convert them into numeric equivalent;
#factoring helps you with the conversion, while retaining the character values;

isAmorphous*2 #will give NA results, but 
#the following will use the numeric equivalent as per factor definition, and the
#math operation will succeed
as.integer(isAmorphous)*2

#you would expect,
isPresent * 2 # to succeed because isPresent is a numeric vector, but
#the above will result in NAs due to factoring;

as.integer(isPresent) * 2
#the above operation will be the multiplication 2 with the factor values 
#rather than the actual numeric values of the vector

#The above case might have given you an idea about why factoring of numeric
#vectors with many unique values (levels) might create confusion!

#Note: 
#(1) you have to refactor the factored vector everytime when you modify the vector
#(2) the subset of a factored vector retains the levels
#(3) if X is a factored vector, as.vector(X) removes factors, but this process
#    always converts X into a character vector irrespective of its original type;
#    similarly, you may use as.xxxx() functions appropriately; see 'Data type checking
#    and conversions' for as.xxxx() functions.
#(4) the same rules are applicable for logical vectors as well

Data frames

Data frames are also container of different types of vectors of same length. They are equivalent of 2D matrix in terms of dimension, but unlike matrices, they can accomodate different data types. That means each row may be a collection of different data types, while each column is a collection of values of the same data type only. They are also equivalent of data table just like a MS Excel sheet containing table data.

Creating dataframes

# Create dataframe using vectors

#let us create vectors of same length
colID <- c(1:10)
colSamp <- sample(200:300, 10, replace=F)
colRandFrac <- runif(10)
colLogical <- c(T, F, T, F, T, F, T, F, T, F)
colQual <- c('ace','mod', 'simp', 'ace', 'cool', 'ruf', 'simp', 'ace','mod', 'top')

#the following will create a dataframe of the above vectors as columns
dfTmp <- data.frame(colID, colSamp, colRandFrac, colLogical, colQual,
                    stringsAsFactors=F) #cretes df
names(dfTmp); colnames(dfTmp) #variable names as col names
row.names(dfTmp) #default row names

#Warning! if length of one of the columns differ, its length will automatically 
#be adjusted to the max length of all vectors; however, if more than one column
#has differing lengths, then R will throw an error;


# Creating dataframes using matrices

#let us create a matrix of 6R X 5C of numeric type
growthByDay <- matrix(
  c(5.0,4.0,5.0,5.2,6.0,
    5.5,4.0,6.0,5.2,5.5,
    6.0,5.0,9.0,5.2,6.0,
    6.2,5.5,8.0,5.2,5.5,
    6.3,5.6,7.0,5.2,6.0,
    6.3,5.7,4.0,5.2,5.5),
    ncol=5, nrow=6, byrow=T)
row.names(growthByDay) <- 1:6 #set row names
colnames(growthByDay) <- sprintf('exp%d', 1:5) #give column names

class(growthByDay); dim(growthByDay) #checks the class name and dimension of matrix
dfGrowthByDay <- as.data.frame(growthByDay) #converts matrix into dataframe

names(dfGrowthByDay); colnames(dfGrowthByDay) #col names same as that of matrix
row.names(dfGrowthByDay) #row names same as that of matrix


# Create dataframe using lists provided:
#(1) there are only vectors in the list (scalars are actually single values vectors)
#(2) the vectors in the list should have same length or their lengths are adjusted
as.data.frame(lstGCS)

#In the following list, there are vectors that are of lesser length, but the conversion
#automatically adjusts their length to the maximum length of other vectors
lstGCS1 <- list(CO2=400, temp=20, photoperiod=c('auto','default'), humid=c(40.5, 60.0), 
               isWaterFlowRegulated=T) #df of 2 rows; if humid is of length 3 then fails
as.data.frame(lstGCS1) #succeeds;lengths are adjusted using recycling

lstGCS2 <- list(CO2=NA, temp=20, photoperiod=c('auto','default'), humid=c(40.5, 60.0), 
                isWaterFlowRegulated=T)
as.data.frame(lstGCS2) #succeeds with NA; but for NULL this will fail with error

#Note: if a dataframes is present in the list, the list to dataframe conversion will
#succeed only if the lengths of vectors present in the list and the row length of 
#dataframe objects are the same.

Accessing dataframe elements

dfTmp[1:5,] #first 5 rows as f
dfTmp[5:10, c('colID', 'colQual')] #last six rows and only two cols as df

dfTmp[ ,"colLogical"] #just one col as vector
dfTmp$colSamp #just one col as vector

Functions of dataframes

dim(dfTmp); nrow(dfTmp); ncol(dfTmp) #gets dimensions

colSums(dfTmp[,1:3]) #on numeric columns only

dfTmp <- edit(dfTmp) #opens the dataframe with an editor

head(dfTmp); tail(dfTmp) #gets first and last few rows of dataframe
#head and tail also works for matrices and vectors

#Note: Treat each column as a vector: which means that all vector functions can be applied

#while, cbind and rbind functions can merge two data frames together 
#horizontally or vertically; 
#however, it has limitations or constraints;
#for example, to cbind two dataframes, they should have equal number of
#rows; 
#similarly: to rbind they should have equal number columns; in addition to
#this rbind requires the data type of columns being merged should match;

#if two dataframes are associated by a common column (known as key column),
#then they can be merged by their key; the key can be a combination of more
#than one columns;

#let us first construct two dataframes that share a key column:
dfX <- data.frame(ID=c(1:3), col=c('red','blue','green'))
dfY <- data.frame(ID=c(1:3), what=c('apple','berries','leaves'))

#to merge the above two dataframes by the column 'ID':
merge(x=dfX, y=dfY, by='ID')

#if key column names are different as in:
dfX <- data.frame(ID=c(1:3), col=c('red','blue','green'))
dfY <- data.frame(key=c(1:3), what=c('apple','berries','leaves'))
#then,
merge(x=dfX, y=dfY, by.x='ID', by.y='key')

#if one of the table has missing observations, as in:
dfX <- data.frame(ID=c(1:3), col=c('red','blue','green'))
#the observation with ID value 2 is missing
dfY <- data.frame(ID=c(1,3), what=c('apple','leaves'))
#then, only the matching entries will be merged, as in:
merge(x=dfX, y=dfY, by='ID')

#however, you may specify, for which of the dataframes you want
#to have all entries; (where the missing values of the other, 
#if any, will be replaced with NA values), as in:
merge(x=dfX, y=dfY, by='ID', all.x=T, all.y=F)


#you can sort the dataframe by a one or more column names, as in:
dfX[order(dfX$ID, decreasing=T), ] #sorts by ID in desc. order

#you can apply an aggregate function on dataframes, as in:
dfA <- data.frame(vals1=1:21, vals2=runif(21), grp=cut(1:21, breaks=3, labels=F))
aggregate(dfA[,c('vals1','vals2')], by=list(dfA$grp), FUN=mean)

#Note1: you include more than one column in the list of 'by' parameter
#to specify a combinatorial grouping column;
#Note2: you may also specify user defined function or any of the aggregate
#functions instead of 'FUN=mean'.

Creating empty objects by object name functions

a <- c() #NULL

vec <- vector() #logical empty
veca <- vector(mode='integer', length=10) #10 zeros; mode=>class

lst <- list() #list of 0 objects
fs <- factor() #factor with 0 levels

df <- data.frame() #empty df: 0r x 0c
typeof(df) #list; class(df) #data.frame

Flow Control

In programming, it might be required to conditionally control the operations, to call functions, and to use expressions. The conditions will make the linear flow of the program to branch out one or more times. Hence, the single starting point of the program might take different paths before it reaches one of the end points. The conditions are like the nodes of branches in the flow of the program. In R programming, if(), ifelse(), if()..else if()..else are the conditional structures available.

#The basic control structures

#Let us create the variables a and b
a <- 10; b <- NA

#To verify if the control statements affect the opertion on these
#variables, let us print their values
cat(sprintf('a=%0.2f, b=%0.2f\n', a, b))

if(TRUE) a <- 20 #it is as if you are changing a <- 20 directly
#what will happen with the following check?
if(FALSE) b <- 60

cat(sprintf('a=%0.2f, b=%0.2f\n', a, b))

#let us introduce, dichotomous control
if(a <= 10) a <- a * 2 else a <- a / 2 #one line if..else
#note that the operations inside if..else control are assignments

cat(sprintf('a=%0.2f, b=%0.2f\n', a, b))

#we can also make if..else to return values conditionally
b <- a <- if(a <= 10) a * 2 else a / 2 #if..else returns

cat(sprintf('a=%0.2f, b=%0.2f\n', a, b))

#the more convenient readable variant of if..else form is...
d <- ifelse(a <= 10, a*2, a/2)

Within if or else conditions, we might need to use more than one line of codes. R supports that with if(){}.. else{} code block structure

message <- ''
#the following is a simple one-line code block
if(a == b) {message <- 'a equals b'} else {message <- 'a and b are different'}
print(message)
#or,
if(a == b) {print('a equals b')} else {print('a and b are different')}

#for a more readable if..else block, use the following syntax and indentations
c <- 0
if(message == 'a equals b'){
  c <- sqrt(a)
  print(sprintf('The value of c is: %0.2f', c))
} else {
  print('The value of c is not changed')
}
#note that within {} multiple lines of codes can be included

For complex flow control, you may enclose one if..else block within another. This is called nesting of controls. Multiple “nested if’s” are allowed in R.

randomLetters <- sample(letters, 20, replace=T)
vowels <- c('a', 'e', 'i', 'o', 'u')
selectedNumbers <- c(16, 49, 25, 81, 27)

if (sum(randomLetters %in% vowels) > 0) {
   print('At least one alphabet is a vowel')
   if(any(selectedNumbers %% 6 == 0)) {
      print('At least one number is divisible by 6')
   } else {
      print('None of the numbers is divisible by 6')
   }
} else {
   print('No vowels present in the alphabets selected ')
}

The if..else block is not simply dichotomous. It can have multiple levels of checks in between.

#what will be the output with the following control structure?
habitat <- 'exosphere'
if (habitat == 'water'){
  print('Is it fresh or salt?')
} else if (habitat == 'land') {
  print('Is it elevated or plain?')
} else if (habitat == 'air') {
  print('Does it float or fly?')
} else {
  print('Is it alien?')
}

Looping structures

Sometimes, you might need to repeat a particular opertion for a finite or infinite number of times. Such repeatitions are called loops. R has two looping structuers namely “for” loop and “while” loop. The former is a finite loop by its definition. Whereas, only the while loop can be written as infinite loop. However while loop can be modified to become finite. When the loop is finite, you may want to have control for an early conditional exit. Similarly, you may want to have a conditional exit in infinite loops. The “break” command supports such exit in R. They can be made conditional by wrapping them in if..else controls.

#a simple for loop
for(i in 1:10) print('hello world!') #prints hello world 10 times

#you can make use of the iterator variable (i) in the for expression
for(i in 1:10) print(i*2)

#the sequence in for loop can be any vector, as in:
for(thisname in c('Aconite','Aster','Aubrieta')) print(thisname)

#the vector can be passed as a variable
diseases <- c('Malaria','Typhoid','HIV')
for(disease in diseases){
   cat('The causal agent is a ')
   if(disease=='Malaria'){
      cat(sprintf('Protozoan\n'))
   } else if(disease=='Typhoid'){
      cat(sprintf('Gram negative bacterium\n'))
   } else {
      cat(sprintf('Virus\n'))
   }
}
#also, note that we enclosed a block of code to be executed in the for loop
#within {}

The for loop can be used on iterable container objects (matrices, lists and dataframes) whereever it makes sense, as in:

#let us loop through the rows of the matrix 'growthByDay'
for (r in 1:nrow(growthByDay)){
   print(mean(growthByDay[r,])) #prints mean value of each row vector
}
#Similarly, you can loop through column vectors

The for loop can be nested, as in:

#let us loop through each cell of the matrix, using nested loop
for (c in 1:ncol(growthByDay)){
  print(names(dfTmp)[c])
  for (r in 1:nrow(growthByDay)){
    if (r < nrow(growthByDay)){
      cat(sprintf('%0.2f\t', growthByDay[r,c]))
    } else {
      cat(sprintf('%0.2f\n', growthByDay[r,c]))
    }
  }
}

“for loop” works on list object as well, as in:

i <- 0
for (item in lstGCS) {
  i <- i + 1
  #cat(sprintf('List%d:\t', i))
  #or to start with the name of the list item
  cat(sprintf('%s:\t', names(lstGCS)[i]))
  if (length(item) > 1){
    for (itemval in item) cat(sprintf('%s\t', itemval))
    cat('\n')
  } else {
    print(item)
  }
}

The early exit from “for loop” can be achieved using conditional break, as in:

cnt <- 0
for (i in 1:1000){
   Sys.sleep(0.01)
   cat(sprintf('\r%d', i)) #to pause for 0.01s
   samp <- sample(1:20, 100, replace=T) #sample of 100 values btwn 1 and 20
   #if mean of this random sample equals 10 print message and exit
   if (round(mean(samp),2) == 10) {
      cat(sprintf('\nIt took %d trials to arrive at mean of 10.\n', i))
      cnt <- i
      break
   }
   cnt <- cnt + 1
}

To achieve infinite loop ‘while loop’ can be used. Note1: press Ctrl+C or Ctrl-Z to exit forcefully from infinite loop Note2: if you are in R Studio, click on Stop sign (red filled hexagon) to the top right corner of the title bar of Console pane

i <- 0
while(TRUE) {
  Sys.sleep(0.01); cat(sprintf('\r%d', i))
   i <- i + 1
   #if(i == 200) break
}
#Note: Uncomment the "#if(i == 200) break" in the above to exit conditionally

#The while loop can also be made a finite if condition is given to while(),
#as in:
j <- 100
while(j <= 100){
   print(j)
   #the following early exit can also be used
   #if(j*5==25) break
}

There are some special functions in R, such as apply, sapply, and lapply, to apply any function on each element of a list or vector iteratively.

#the following sapply() applies exp() function on each element of the
#sequence 1,2,3,..,10, and returns a vector of exponents.
sapply(1:10, FUN=function(x) exp(x))

#if 'sapply' is replaced with 'lapply', the return values are the same, but
#the values are returned as a list, as in:
lapply(1:10, FUN=function(x) exp(x))
#you need to wrap the above command with unlist() to convert the returned 
#list into a vector

#however, lapply may be useful to loop through column vectors of a dataframe,
#as in:
lapply(dfTmp, FUN=function(x) table(x))
#though, the frequency function (table) works for most kinds of data types, 
#it may not make sense to use it on continuous numeric data and discrete
#numeric vector containing numerous unique values

#Unlike lapply, the 'apply()' has a provision to choose the margin of the 
#matrix or a dataframe on which the function can be applied;
#the margins are either row vectors or column vectors;
apply(growthByDay, 1, FUN=mean) #margin 1 denotes row vectors of the matrix
#the above is equivalent of the following:
rowMeans(growthByDay)

#similarly,
apply(growthByDay, 2, FUN=mean) #will give..
colMeans(growthByDay)


#Note that, in all the above apply functions, the vector or list passed 
#can be of any type; only constraint is that an appropriate function 
#needs to be applied based on the data type;

#Also note that, the function applied can be any built-in function in R or
#any user-defined function

Data inputs

R has several ways to accept input from the user or external files and output the data to console or files. Some of them are exemplified below.

#to access the value of the key pressed by the user:
keyIn <- readline()

#the following is a user-defined function that accepts keyboard input
# and prints the numeric equivalent of the key pressed;
#it keeps asking for user input until Esc key is pressed
getValOfChar <- function(){
  keyIn <- readline(prompt='Enter any key(Press Esc to stop):')
  
  if (is.na(keyIn)) {
    return(NULL)
  } else {
    keyVal <- as.integer(charToRaw(keyIn))
    cat(sprintf('%s: %d\n', keyIn, keyVal))
    getValOfChar()
  }
}
#make a call to the function:
getValOfChar()

Instead of waiting for individual key press event, the scan() function allows the user to input a definite number of values; the data type can be specified

#prompt the user
print('Enter five observations of tree girth (Ctrl-Z or Esc to end)')

#accept the value inputs from the user (comma delimited)
treeGirths <- as.double(scan(what='double', nmax=5, sep=','))

#do whatever with the values read
cat(sprintf('Average±SD tree girth (m): %0.2f ± %0.2f', 
              mean(treeGirths), sd(treeGirths)))

The scan() function can also be modified to read input from a flat file. For instance, the following is the content of treeGirth.txt file:

12.4,13.2,14.8,12.7,14.1,14.2,15.0

The values are of double type and seperated by comma delimiter. Let us say we want to scan the first five values from the file. Using scan, this can be achieved, as in:

#open the connection to the file in read-only mode
fconTreeGirth <- file('treeGirth.txt', 'r')

#scan first five double typed values and retrieve
treeGirths <- as.double(scan(file=fconTreeGirth, what='double', nmax=5, sep=','))

#do whatever you want to do with the values extracted
cat(sprintf('Average±SD tree girth (m): %0.2f ± %0.2f', 
            mean(treeGirths), sd(treeGirths)))

#it is mandatory to close the connection to the file
close(fconTreeGirth)

You may also want to read flat files line by line. The “readLines” function (note the capital letter L in the function name) can be used for this purpose. The content of the file ‘Selfish Gene-Richard Dawkins.txt’ is: ” The gene-centred view of evolution that Dawkins championed and crystallized is now central both to evolutionary theorizing and to lay commentaries on natural history such as wildlife documentaries.

A bird or a bee risks its life and health to bring its offspring into the world not to help itself, and certainly not to help its species — the prevailing, lazy thinking of the 1960s, even among luminaries of evolution such as Julian Huxley and Konrad Lorenz — but (unconsciously) so that its genes go on.

Genes that cause birds and bees to breed survive at the expense of other genes.

No other explanation makes sense, although some insist that there are other ways to tell the story. ”

#first open the connection to the file
fconSelfishGene <- file('Selfish Gene-Richard Dawkins.txt', 'r')

thisLine <- "" #to begin while loop, this initializtion is required
linenumber <- 1 #initialize the line number
while(length(thisLine)>0){
   #read one line at a time
   thisLine <- readLines(fconSelfishGene, n=1)
   
   #print number of characters in each line
   cat(sprintf('Number of chars in line %d: %d\n', 
               linenumber, nchar(thisLine)))
   
   linenumber <- linenumber+1 #increment the line number
}
close(fconSelfishGene) #dont forget to close the connection

#Note that the length of any empty line is 1. While nchar() returns the 
#length of the line in terms of number of characters.

#Also note that, once the end of file is reached, you can not point back
#to the start of the file with the same connection; for which, you would
#need to close the connection and open it again afresh;

#you can also read the whole content of the file in one shot, as in:
fconSelfishGene <- file('Selfish Gene-Richard Dawkins.txt', 'r')
excerpts <- readLines(fconSelfishGene) #all lines are read
print(excerpts[c(1,3)]) #prints only the 1st and 3rd lines
close(fconSelfishGene)

Many data files are available in the form of text files in which the data are arranged in the form of a matrix or a table; the matrix/table structure the row values are delimited by a comma or a space or a tab or some other special character; and each line of row are delimited by a new line or carriage return (not visible); and the tables may not have headers (columns);

In order to read these kinds of files into R, read.table() or read.csv() function is used. The following is the content of the space delimited data file - delimited.txt:

ID exp1 exp2 exp3 exp4 exp5 qual 1 5 4 5 5.2 6 AA 2 5.5 4 6 5.2 5.5 C 3 6 5 9 5.2 6 C 4 6.2 5.5 8 5.2 5.5 A 5 6.3 5.6 7 5.2 6 A 6 6.3 5.7 4 5.2 5.5 B

The read.csv() extracts the content of the file into a R data.frame object, as in:

#sep is one space character, header is present, single quote was used 
#for string values in the rows;
myTable <- read.csv(file='Main Sessions/delimited.csv', sep=' ', 
  header=T, stringsAsFactors=F, quote="\'")
print(myTable)

#By default, double quote is assumed for string values; or if one of the 
#value in a column contains non-numeric values, the column will be
#considered as a character type column;

#if stringsAsFactors is set to TRUE, then read.csv() will factorize the character type 
#columns (which is the default)

#Note that there is no need to open a connection and close it.

Data output

As we have data input functions, R also supports the data output functions, especially to write content into files for storing results.

The sink() function allows the console ouputs to get redirected to a file, as in:

#open the connection to a file
sink('analyticalresults.txt', append=F, split=F)

#for a new file 'append' is set to FALSE, if it is an existing file then
#you might need to add the new content to it; for which, you would set
#'append' to TRUE;

#when 'split' is set to FALSE, then you wont see any console output; all 
#the outputs are sent only to the file;

#as sink() has opened a connection to the file earlier, the outputs 
#(matrices) of the loop are sent to file;
for(i in 1:10){
  print(growthByDay * i)
}

sink() #this is required to write the file to file system and to close it

The complementary function of read.csv() is write.csv() which writes a dataframe object into a text file with specified delimiters.

newfile <- 'myTable.csv'
write.csv(dfTmp, file=newfile, row.names=F)

#the default delimiter of values in a row is comma;
#when row.names is set to FALSE, the row IDs of the matrix or data.frame
#will be ignored while writing the content into the file

Functions (miscellaneous)

Functions of files and directories

getwd() #gets the current working directory of the R session
setwd(getwd()) #sets the working dir

newdir <- 'myFolder/subFolder'
dir.exists(newdir) #checks if the specified folder exists
#if absolute path is not given, it checks in the working dir

dir.create(newdir, recursive=T)  #creates the dir specified
#recursive=T when parent and sub-folders need to be created

file.exists('treeGirth.txt') #checks if the file is present in the current
#working directory

#the following lists all the folders in the path given
list.dirs(path=getwd(), full.names=T, recursive=F)
#if recursive=T, then all subfolders will also be listed

#Similarly, the following lists all files with a given pattern
myfiles <- list.files(path=sprintf('%s/Main Sessions',getwd()), pattern='*.R$', full.names=F)

#The function rmdir(<dir>) will delete a folder; similarly, unlink(<file>)
#will delete the specified file;

Some user defined functions

#mode function to obtain the most frequently observed value in a vector
#is not available in base R; let us write a function for it:
myMode <- function(numVec=NULL){
   ft <- table(numVec) #a table of values vs frequencies
   maxFreq <- max(ft) #gets the largest frequency
   #to extract all entries in the table that share the max freq
   allEntriesWithMaxFreq <- ft[ft==maxFreq]
   #because 'ft' is a named vector, where the names are the unique
   #values for which the freq count were obtained; hence, the names that
   #share the maximum freq are the mode values
   modes <- names(allEntriesWithMaxFreq)
   #because the names are of character type, we need to convert them to
   #numeric format
   modes <- as.numeric(modes)
   #finally, if the result is multi-model (modes>1), then we need to 
   #take average of all modal values
   return( mean(modes))
}

#the above function can be zipped into two lines of code by wrapping
#functions over functions, as in:
myMode <- function(numVec=NULL){
   ft <- table(numVec)
   return (mean(as.numeric(names(ft[ft==max(ft)]))))
}

#to use it:
myMode(randomNumbers)


#let us define a function to get the difference between two dates in months
diffMonths <- function(df1=NA, df2=NA){
   if(!all(c(class(pso)[1],class(pso1)[1]) %in% c('Date','POSIXct'))){
      print('The date values should be either Date object or POSIXct object!')
      return(NULL)
   } else {
      diffm <- as.numeric(difftime(d2, d1, units='days')) %/% 30L
      return(diffm)
   }
}
#to use it:
date1 <- as.Date('01 Jan 2010', '%d %b %Y')
date2 <- as.POSIXct('01 Jan 2010', '%d %b %Y')
diffMonths(date1, date2)
