Copyright Licence: CC BY 4.0

To know step by step guide on downloading and installation of R software click on the link Downloading and Installing R and for RStudio click here on the link Downloading and Installing RStudio.

Declaration:This manual was prepared on Ubuntu 20.4 LTS with RStudio 2022.07.0+548 “Spotted Wakerobin” and R version 4.2.1 (2022-06-23) -- “Funny-Looking Kid”.

You can cite this training manual as following:

Vani, G.K. & Mishra, P. (2022). Training Manual on Programming with R.Jawaharlal Nehru Krishi Vishwa Vidyalaya, Jabalpur (India).

1 Introduction

The R is a programming language which is an interpreted language. This means that unlike Fortran language here code need not be compiled before its output is visible. It is officially runs in R console and can be used in its officially launched IDE (Integrated Development Environment) named RStudio . Code can be directly written and executed in the console and output appears below it. In case a script file is made in console, the code syntax can be executed by placing cursor on the required line and using Ctrl+R. In case of RStudio IDE, Ctrl+Enter after placing cursor on the code syntax to execute it in console window.

A view of R console window in ubuntu operating system

A view of RStudio IDE

It is a case sensitive one. Any thing that works in R is a function and what doesn’t work is an object. An object in R stores the information. A function takes input from the user 3 and provides output. The input provided to the function is called as arguments of the function. A function at time may require one or more arguments. R does not allow two objects with same name. If two objects have name age then the one created later will replace the former. R has some reserved words which are listed below:

  • if

  • else

  • repeat

  • while

  • function

  • for

  • next

  • break

  • TRUE

  • FALSE

  • NULL

  • Inf

  • NaN

  • NA

  • NA_Integer

  • NA_real

  • NA_complex

  • NA_character

    User of this software should not make/keep name of any object create exactly as or after reserved words. Reserved words can be used as non-syntatic names insed the backtick quotes, ``. But reserved words are not allowed as syntatic names.

Anything written after # will be treated as comment not as command. Comments can be inserted after command. The rmarkdown documents provide output of code chunk after ## . This document is also prepared with the help of rmarkdown . In R language, small brackets () are reserved for functions and big brackets [] and [[ ]] are reserved for indexing. Indexing involves accessing the elements from withing the objects. Each time brackets whether small, curly or big brackets are used it has to be closed appropriately otherwise execute command would be incomplete till all brackets are closed in the required manner and software will show a + plus sign instead of prompt character, > , in the beginning of syntax in console.

In R language, assignment operator <- or <<- is used to assign a value or a functional expression to an object. It consists of single or double less than < character and a minus sign - , both the character are written strictly one after the other. In most of the context = sign is also used in place of assignment operator and will work equally well. Instead of assigment operator <- or <<- and = , we can also use assign function to assign a value or functional expression to an object.

x<-10
x
## [1] 10
#OR
x<<-10
x
## [1] 10
y=10
y
## [1] 10
assign("z",10)
z
## [1] 10

The assignment of value(s) to an object can be made with reverse assignment operator also.

10->k
k
## [1] 10
#OR
10->>k
k
## [1] 10

If an command provides values to the console as output, i.e. printed to the console, but is not stored in some object then it is lost and would be no available further.

To simplify code structure, R has pipe operator, %>%, available from dplyr package. An example is given below.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
x %>% +y+z # Add x to y+z
## [1] 30

Unlike other software where semicolon (;)or comma (,)is required to separate two lines of code, R doesn’t necessitates use of ; or ,. But, two different syntax can be written in same line and can be separated by semicolon. An example is provided below.

x<-10; y<-20
x;y
## [1] 10
## [1] 20

Remember that to find all the objects created in local Environment use ls function and to remove any object use rm function. To clear the clutter console use Ctrl+l keys.

ls()
rm(name of the object)

For example to remove object x following code is executed in console.

rm(x)

2 Data Structure

Data structure refers to the specific ways of arranging the data so that it can be efficiently accessed and updated. The data in R is stored in different data structures.

  • Vector (Atomic vector),

  • Matrix,

  • Data frame,

  • Factor,

  • List,

  • Array

    are data structure in R. Although table is a class defined in R but it is not counted here as data structure.

2.1 Vector

Vector is the most simple type of data structure in R. It is the lowest level of data structure in R. It is an one dimensional object. It is a linear data structure4. A vector in R is similar to the vector in mathematics. Vector records data entries only in single mode (homogeneous data structure), either as numbers (real, integer or complex) or logical (true/false) or character (text). Vector in R is of following types or modes:

  1. Numeric (single/double)

  2. Integer

  3. Logical

  4. Character

  5. Complex

2.1.1 Creating a vector

A vector can be formed by using either vector function or the function call c(). Here, c stands for combine or concatenate. The function combines/concatenates different arguments provided to it as different elements of the vector. The function vector creates a null vector which contains empty space(s) or zero(s) or FALSE or a string of FALSE . The arguments to this function involves mode and length.

It can be used as following

vec_numeric=vector(mode="numeric",length = 5) #To create numeric vector
vec_numeric
## [1] 0 0 0 0 0
vec_logical = vector(mode="logical",length = 5)
vec_logical
## [1] FALSE FALSE FALSE FALSE FALSE
vec_character = vector(mode="character",length = 5)
vec_character
## [1] "" "" "" "" ""
vec_integer  =  vector(mode="integer",length = 5)
vec_integer
## [1] 0 0 0 0 0
vec_complex  = vector(mode="complex",length = 5)
vec_complex
## [1] 0+0i 0+0i 0+0i 0+0i 0+0i
# you can check whether an object is vector or not by using is.vector function which produces true or false logical output
is.vector(vec_numeric)
## [1] TRUE
is.vector(vec_character)
## [1] TRUE
is.vector(vec_integer)
## [1] TRUE
is.vector(vec_complex)
## [1] TRUE
is.vector(vec_logical)
## [1] TRUE

In place of numeric, we can use logical, complex, integer and character . Details on each of these modes of vector are provided below.

2.1.2 Numeric Vector

Numeric vector records real numbers, both positive as well as negative. Currently R has adopted double-precision data types system. Hence when we check the numeric vector for its type then R returns “double”. There is no vector with single type.

To create a numeric vector which contains elements 1, 2 and 3 we give following command to R.

c(1,2,3)
## [1] 1 2 3

To save this numeric vector as an object in R by name xyx

xyx<-c(1,2,3)

To print the information contained within the vector just type the name of the vector

xyx
## [1] 1 2 3

In a complex way this could be done by using print command as well

print(xyx)
## [1] 1 2 3

Now let’s create a vector of age of household member and give this vector our own name.

gourav<-c(32,31,1.5)
gourav
## [1] 32.0 31.0  1.5

If we want to name the gourav vector by name age then a new object by name age can be created without altering the object gourav

age=gourav
age
## [1] 32.0 31.0  1.5

To check the class of the vector created

class(age)
## [1] "numeric"

To check the mode of the vector

mode(age)
## [1] "numeric"

To check type of vector

typeof(age)
## [1] "double"

To find the number of elements in the vector, i.e., length of vector

length(age)
## [1] 3

suppose if we want to add more data to vector age then to add the fourth element to vector age

age[4]=1
age
## [1] 32.0 31.0  1.5  1.0

The above output shows how to add one additional data point to the vector. To add or append multiple elements to age vector

age[c(5,6)]=c(0.5,0.4)
age
## [1] 32.0 31.0  1.5  1.0  0.5  0.4

Instead users can also use append function (in base package) to append values to a vector.

append(x, values, after = length(x))

where X is the vector in which to append, values means values to be appended and after is the parameter to set position where to append the values.

Suppose for a vector having values 1,4,5,7,2, and 8, if one wants to add 0 and 9 after 3rd position then append function will work as demonstrated below.

append(c(1,4,5,7,2,8),c(0,9),after=3)
## [1] 1 4 5 0 9 7 2 8
append(c(1,4,5,7,2,8),c(0,9)) ## default position is to the end of original vector
## [1] 1 4 5 7 2 8 0 9

What if we want to replace the value of 6th element in age vector from 0.4 to 0.2

age[6]=0.2

What if we want to remove the sixth element from vector

age=age[-6] ##after remvoing 6th element it again saves the output by name age to update the existing vector
##Let's print age vector again to see if 6th element is removed
age
## [1] 32.0 31.0  1.5  1.0  0.5
##check vector length
length(age)
## [1] 5

To remove multiple elements at once, lets say first and third element,

age=age[-c(1,3)]
age
## [1] 31.0  1.0  0.5

So far we have added multiple elements to vector age but it adds them towards the end of the vector. To add elements to the beginning of the existing vector we create another vector as shown below

age=c(35,23,age,47, 56) ## 35 and 23 are added to the beginning while 47 and 56 are added to the end of vector age
age
## [1] 35.0 23.0 31.0  1.0  0.5 47.0 56.0

Lets say we want to slice the vector from element numbered 3 to 7

# R has a beautiful way to create natural number sequence with gap of 1
# To create sequence from 3 to 7
3:7
## [1] 3 4 5 6 7
age[3:7]
## [1] 31.0  1.0  0.5 47.0 56.0

To add a number (lets say 55) in between element 3 and 4, we have to create a new vector wherein we slice original vector two times, once from 1 to 3 and another time from 4 to 7.

age= c(age[1:3],34,age[4:7])
age
## [1] 35.0 23.0 31.0 34.0  1.0  0.5 47.0 56.0

To create a natural number sequence from 1 to 100

onetohundred=1:100
class(onetohundred)
## [1] "integer"
onetohundred
##   [1]   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18
##  [19]  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36
##  [37]  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54
##  [55]  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71  72
##  [73]  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90
##  [91]  91  92  93  94  95  96  97  98  99 100

To see the structure of the object

str(age)
##  num [1:8] 35 23 31 34 1 0.5 47 56

num [1:8] shows that the object age is of class numeric vector and has vector elements numbered from 1 to 8.

The metadata of any R object can be found by using attributes function. Attributes refers to the properties associated with any object in R.

attributes(age)
## NULL

Since we have not set any attribute for age vector, that’s why it is showing NULL.

Attributes includes name of each element or row names and column names, dimension of the object, class etc. In R, all objects can have their own arbitrary attributes which can store metadata about the object. It is like a named list with unique names.

#create new vector with attributes
age1=c("GKV"=32,"AV"=31,"NV"=1.5) 
age1
##  GKV   AV   NV 
## 32.0 31.0  1.5
attributes(age1)
## $names
## [1] "GKV" "AV"  "NV"
# or if you want to set attributes to an existing vector 
names(age)<-c("A","B","C","D","E","F","G","H")
age
##    A    B    C    D    E    F    G    H 
## 35.0 23.0 31.0 34.0  1.0  0.5 47.0 56.0
attributes(age)
## $names
## [1] "A" "B" "C" "D" "E" "F" "G" "H"

2.1.3 Integer Vector

Integer are numbers that does not take numbers after decimal. The difference between two successive integer is always one. Integer numbers in R are always written with L .

myfirstintegervec=c(1L,2L,3L)
myfirstintegervec
## [1] 1 2 3
class(myfirstintegervec)
## [1] "integer"
str(myfirstintegervec)
##  int [1:3] 1 2 3
typeof(myfirstintegervec)
## [1] "integer"

2.1.4 Logical Vector

Logical vector stores information in true and false pattern. It can be a result of comparison or any logical operation of type ,

  • > (GREATER THAN) ,

  • < (LESS THAN),

  • == (EQUAL TO for comparison purposes not as assignment operator),

  • >= (GREATER THAN EQUAL TO),

  • <= (LESS THAN EQUAL TO),

  • & (LOGICAL AND),

  • | (LOGICAL OR),

  • ! (LOGICAL NOT),

  • != (NOT EQUAL TO),

  • && (LOGICAL AND FOR SINGLE STATMENT) and

  • || (LOGICAL OR FOR SINGLE STATEMENT) .

myfirstlogical=c(T,F,F,T)
myfirstlogical
## [1]  TRUE FALSE FALSE  TRUE
class(myfirstlogical)
## [1] "logical"
mode(myfirstlogical)
## [1] "logical"
typeof(myfirstlogical)
## [1] "logical"
logicalvec=c(TRUE, FALSE, TRUE, FALSE, TRUE, TRUE)
logicalvec
## [1]  TRUE FALSE  TRUE FALSE  TRUE  TRUE
class(logicalvec)
## [1] "logical"
mode(logicalvec)
## [1] "logical"
typeof(logicalvec)
## [1] "logical"

As said before, logical vector can arise from logical operations, an example is provided below.

age>10 # logical operation to check for age greater than 10
##     A     B     C     D     E     F     G     H 
##  TRUE  TRUE  TRUE  TRUE FALSE FALSE  TRUE  TRUE
age>10 & age <30 # age greater than 10 and less than 30
##     A     B     C     D     E     F     G     H 
## FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE
age<10 | age >30 # age less than 10 OR greater than 30
##     A     B     C     D     E     F     G     H 
##  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
age <=30 # age less than equal to 30
##     A     B     C     D     E     F     G     H 
## FALSE  TRUE FALSE FALSE  TRUE  TRUE FALSE FALSE
age ==23 | age == 31 # age equal to 23 OR 31
##     A     B     C     D     E     F     G     H 
## FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE
 age!=23 # age not equal to 23
##     A     B     C     D     E     F     G     H 
##  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE

2.1.5 Character Vector

A character vector only has text. To create a character vector all text has to be in either single or double quotation mark.

myfirstcharactervec=c("RJ","AP","KT","TN")
class(myfirstcharactervec)
## [1] "character"

It is interesting to note that R has inbuilt character vector of alphabets, both lower and upper case, and month names. These are referred to as built-in constants in R. The pi is another built-in constants in R. It also supports conversion to upper and lower case of character vector.

# To get lower case alphabets
letters
##  [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s"
## [20] "t" "u" "v" "w" "x" "y" "z"
# To get upper case alphabets
LETTERS
##  [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R" "S"
## [20] "T" "U" "V" "W" "X" "Y" "Z"
# To convert lower case alphabets to upper case 
toupper(letters)
##  [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R" "S"
## [20] "T" "U" "V" "W" "X" "Y" "Z"
# To convert upper case alphabets to lower case 
tolower(LETTERS)
##  [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s"
## [20] "t" "u" "v" "w" "x" "y" "z"
# To get month names
month.name
##  [1] "January"   "February"  "March"     "April"     "May"       "June"     
##  [7] "July"      "August"    "September" "October"   "November"  "December"
# To get abbreviated month names
month.abb
##  [1] "Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug" "Sep" "Oct" "Nov" "Dec"

It is important to note that R has character translation capability also. The inbuilt chartr function is used for this purpose.

chartr(old character string, new character string, a character vector)
named= " He is my husband"
chartr("He","Joseph",named)
## [1] " Jo is my husband"

Don’t forget that character translation doesn’t change the length of the character vector.

2.1.6 Complex vector

Complex numbers involves iota \((\dot\iota)\) which is square root of negative one \(\sqrt -1\) . R software supports complex numbers too. When a numeric vector has presence of complex number(s) then it is of type and class complex.

myfirstcomplex=c(2,6,1+3i,24,8+2i,3+4i)
myfirstcomplex
## [1]  2+0i  6+0i  1+3i 24+0i  8+2i  3+4i
class(myfirstcomplex)
## [1] "complex"
typeof(myfirstcomplex)
## [1] "complex"
mode(myfirstcomplex)
## [1] "complex"

2.1.7 Creating vector from sequence and repeatition

In general, a sequence is created in R by seq function.

seq(from=1,to=2,by=0.1)
##  [1] 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0
seq(from=1,to=2,length=9)
## [1] 1.000 1.125 1.250 1.375 1.500 1.625 1.750 1.875 2.000
seq(18) # Treats default from value of 1 and by value of 1, output same as 1:18
##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18

To create an vector by repetition of a number, integer, character or logical argument then rep function can be of great help.

rep("A",5) # Create a character vector by repeating `A` five times
## [1] "A" "A" "A" "A" "A"
class(rep("A",5))
## [1] "character"
rep (10,3) # Create a numeric vector by repeating value 10 by three times
## [1] 10 10 10
class(rep(10,3))
## [1] "numeric"
rep(5L,4) # Create an integer vector by repeating value 5 four times
## [1] 5 5 5 5
class(rep(5L,4))
## [1] "integer"
rep(c(T,F),5) # Create a sequence of True False by repeating it five times
##  [1]  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE
class(rep(c(T,F),5))
## [1] "logical"

2.1.8 Mathematical Operations on vector

With vector many mathematical operations can be performed. This includes addition, subtraction, division multiplication (both inner and outer product), different types of transformations and statistical manipulation. All arithmetic operations on vector are performed in R element by element. It is also not necessary to have different vectors to be of same length. If vectors are short of length then short vectors are recycled to perform the given calculation till it matches the desired object length (usually the length of longest vector). In R there is also warning with some function in this regard as shown below.

If some of the values are not vectors of the appropriate length, you deserve whatever you get!

g=c(1,5,9,12)
h=c(2,3,7,11,13)
g-2 # subtracting two from all elements of vector
## [1] -1  3  7 10
g*2 # multiplying two to all elements of the vector
## [1]  2 10 18 24
g/2 # dividing all elements of vector by two
## [1] 0.5 2.5 4.5 6.0
g%%2 # %% operator gives remainder from division 
## [1] 1 1 1 0
c(2,3,5)%%2
## [1] 0 1 1
c(2,3,5)%/%2 # gives quotient from division 
## [1] 1 1 2
v=5*g+h-4+h^2
## Warning in 5 * g + h: longer object length is not a multiple of shorter object
## length
v # v is a vector of elements 5*1+2-4+2^2  5*5+3-4+3^2  5*9+7-4+7^2   5*12+11-4+11^2   5*1+13-4+13^2 , here 1 is recycled in vector g to meet the demand of longest vector h
## [1]   7  33  97 188 183
g/h # Dividing one vector by other vector, all operations are element by element
## Warning in g/h: longer object length is not a multiple of shorter object length
## [1] 0.50000000 1.66666667 1.28571429 1.09090909 0.07692308
log(h) #Use of log function 
## [1] 0.6931472 1.0986123 1.9459101 2.3978953 2.5649494
exp(g) # Use of exponential function
## [1] 2.718282e+00 1.484132e+02 8.103084e+03 1.627548e+05
sqrt (g) # Use of square root function
## [1] 1.000000 2.236068 3.000000 3.464102
sqrt (-18+0i) # While using any negative number with `sqrt` don't forget to put 0i to make it treated as complex number else you will get warning with `NaN`
## [1] 0+4.242641i
sin(g)  # Use of sin function
## [1]  0.8414710 -0.9589243  0.4121185 -0.5365729
cos(g)  # Use of cosine function
## [1]  0.5403023  0.2836622 -0.9111303  0.8438540
tan (g) # Use of tangent function 
## [1]  1.5574077 -3.3805150 -0.4523157 -0.6358599
max(g)  # Use of max function to find maximum value in a vector
## [1] 12
min(g)   # Use of min function to find minimum value in a vector 
## [1] 1
which(g==5) # which function provides position of 5 in vector g
## [1] 2
which.max(g) # which.max provides the position of maximum value in vector g
## [1] 4
which.min(g) # which.min provides the position of minimum value in vector g
## [1] 1
range(g) # Use of range functiono to find both minimum and maximum value together
## [1]  1 12
sum (g)  # Use of sum function provide sum or total of all elements in a vector
## [1] 27
prod(g) # Use of prod function provides product or multiplication of all elements in a vector
## [1] 540
summary(g) # Use of summary function provides statistical summary for a numeric, integer and complex data while for others it simply provides length, class and mode of the vector
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    4.00    7.00    6.75    9.75   12.00
summary(myfirstcharactervec)
##    Length     Class      Mode 
##         4 character character
abs(c(-1,1,-2,2,-4,5,4,-6)) # Use of abs function provides absolute values for all elements of the vector 
## [1] 1 1 2 2 4 5 4 6
sort(c(55,1,60,4,2,3,77),decreasing = F) # Use of sort function provides sorted vector in ascending/increasing order 
## [1]  1  2  3  4 55 60 77
unique(c(1,3,1,4,5,6,2,3,4,6)) # use of `unique` function provides unique values or names in a vector
## [1] 1 3 4 5 6 2
t(g) # Use of `t` function to get transpose of vector
##      [,1] [,2] [,3] [,4]
## [1,]    1    5    9   12
t(g)%*%g # Dot or inner product between a vector and its transpose resulting in scalar
##      [,1]
## [1,]  251
#Power of a vector would result in element by element operation
sum(g^2)  # calculate sum of squares of elements of vector
## [1] 251
g%*%t(g) # Outer or tensor product of one vector with its transpose resulting in a matrix
##      [,1] [,2] [,3] [,4]
## [1,]    1    5    9   12
## [2,]    5   25   45   60
## [3,]    9   45   81  108
## [4,]   12   60  108  144
t(c(5,4,3,2))%*%g # Dot or inner product between two vector resulting in a scalar
##      [,1]
## [1,]   76
c(5,4,3,2)%*%t(g) # Outer product or tensor product between two vectors resulting in a matrix
##      [,1] [,2] [,3] [,4]
## [1,]    5   25   45   60
## [2,]    4   20   36   48
## [3,]    3   15   27   36
## [4,]    2   10   18   24
rank(c(10,1,16,24,3,45)) # to get the rank of all elements in a vector
## [1] 3 1 4 5 2 6
table(myfirstlogical)#Table function generates frequency distribution
## myfirstlogical
## FALSE  TRUE 
##     2     2
table(myfirstintegervec)
## myfirstintegervec
## 1 2 3 
## 1 1 1
table (c(11,14,15,13,14,12,11,14,16,15,13,12,17))
## 
## 11 12 13 14 15 16 17 
##  2  2  2  3  2  1  1
1:5 %in% c(1,3,5,9,10,11) # %in% is an operator which is used in place of match function and is used to find whether the elements of one vector belongs to elements of other vector
## [1]  TRUE FALSE  TRUE FALSE  TRUE
# the above code results in logical true/false showing that 1, 3 and 5 belongs to other vector but not 2 and 4. The length of resulting logical vector would be equal to length of former vector not latter. 
# outer applies a vectorized function to all combinations of two arguments.
# to create a matrix from a mathematical operations of two vectors outer function can be conveniently used. It must be supplied with argument FUN
outer(g,h,FUN="+")
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    3    4    8   12   14
## [2,]    7    8   12   16   18
## [3,]   11   12   16   20   22
## [4,]   14   15   19   23   25
outer(g,h,FUN="^")
##      [,1] [,2]     [,3]         [,4]         [,5]
## [1,]    1    1        1            1 1.000000e+00
## [2,]   25  125    78125     48828125 1.220703e+09
## [3,]   81  729  4782969  31381059609 2.541866e+12
## [4,]  144 1728 35831808 743008370688 1.069932e+14
outer(g,h,FUN="*")
##      [,1] [,2] [,3] [,4] [,5]
## [1,]    2    3    7   11   13
## [2,]   10   15   35   55   65
## [3,]   18   27   63   99  117
## [4,]   24   36   84  132  156
outer(g,h,FUN="-")
##      [,1] [,2] [,3] [,4] [,5]
## [1,]   -1   -2   -6  -10  -12
## [2,]    3    2   -2   -6   -8
## [3,]    7    6    2   -2   -4
## [4,]   10    9    5    1   -1
outer(g,h,FUN="/")
##      [,1]      [,2]      [,3]       [,4]       [,5]
## [1,]  0.5 0.3333333 0.1428571 0.09090909 0.07692308
## [2,]  2.5 1.6666667 0.7142857 0.45454545 0.38461538
## [3,]  4.5 3.0000000 1.2857143 0.81818182 0.69230769
## [4,]  6.0 4.0000000 1.7142857 1.09090909 0.92307692
# To get all combinations of values in two vector we can use expand.grid function
expand.grid(g,h)
##    Var1 Var2
## 1     1    2
## 2     5    2
## 3     9    2
## 4    12    2
## 5     1    3
## 6     5    3
## 7     9    3
## 8    12    3
## 9     1    7
## 10    5    7
## 11    9    7
## 12   12    7
## 13    1   11
## 14    5   11
## 15    9   11
## 16   12   11
## 17    1   13
## 18    5   13
## 19    9   13
## 20   12   13

2.1.9 Conversion from one mode of vector to another

If we have a numeric vector it can be converted back to an integer vector.

as.integer(c(1.5,2,3.5))
## [1] 1 2 3

Similarly, an integer vector can be converted into numeric vector.

as.numeric(myfirstintegervec)
## [1] 1 2 3

A numeric/integer/complex/logical vector can be converted into a character vector but vice-versa isn’t true.

as.numeric(myfirstcharactervec)
## Warning: NAs introduced by coercion
## [1] NA NA NA NA
as.integer(myfirstcharactervec)
## Warning: NAs introduced by coercion
## [1] NA NA NA NA
as.logical(myfirstcharactervec)
## [1] NA NA NA NA
as.complex(myfirstcharactervec)
## Warning: NAs introduced by coercion
## [1] NA NA NA NA
as.logical(myfirstintegervec)
## [1] TRUE TRUE TRUE
as.complex(myfirstintegervec)
## [1] 1+0i 2+0i 3+0i
as.numeric(myfirstcomplex)
## Warning: imaginary parts discarded in coercion
## [1]  2  6  1 24  8  3
as.integer(myfirstlogical)
## [1] 1 0 0 1

2.1.10 sub-setting a vector

Sub-setting a vector can be done by using either vector indices or subset function. The first argument to the function must be the name of vector or data and second argument should be an expression or condition to apply for sub-setting.

subset(myfirstintegervec,myfirstintegervec>1)
## [1] 2 3
# alternatively
myfirstintegervec[myfirstintegervec>1]
## [1] 2 3

2.2 Exercise

  1. Create a numeric vector having numbers and save it by name numeric1
  2. Create a character vector having names "Sarah", "Shrikant", "Anaya" and "Mika" , save this object by name charct1
  3. Convert all names in charct1 to lower case
  4. Change "S" in all names wtih "N" in charct1
  5. Subset month names for month numbered
  6. Create a logical vector of length 10 which is empty and fill it with "TRUE" on position number and save this object by name logical1
  7. Create an integer vector of length 5 with numbers and assign it name intvec1
  8. Create anther integer vector by name intvec2 by squaring all elements of intvec1, and then multiply intvec1 with intvec2 (element wise multiplication)
  9. Find the position of elements of vector 1:20 in vector numeric1
  10. Create all combinations of numbers from 1 to and alphabet from "A" to

2.3 Matrix

A Matrix is an arrangement of data in rows and columns. It also contains homogeneous data structure meaning only single mode of data can be used with a given object of class matrix. A matrix is the special case of a two-dimensional array.In R language, matrix data structure can be produced or created in several ways.

2.3.1 Creating a matrix

First method is that we can bind the different vectors either row or column wise. The function rbind and cbind respectively binds the vectors row and column wise. However, for this way of creating matrix, length of each vector must be equal otherwise R will recycle the elements in short vector to match the length of the largest vector.

wheat=c(12,13,14)
chickpea=c(23,44,56)
rbind(wheat,chickpea) 
##          [,1] [,2] [,3]
## wheat      12   13   14
## chickpea   23   44   56

Here, names of vector are automatically taken as names of rows in matrix.

cbind(wheat,chickpea)
##      wheat chickpea
## [1,]    12       23
## [2,]    13       44
## [3,]    14       56

Here, names of vectors used to create matrix are automatically taken as column names. Alternative way to create matrix is to use function matrix.

matrix(c(wheat,chickpea),ncol = 2,nrow=3,byrow = F)
##      [,1] [,2]
## [1,]   12   23
## [2,]   13   44
## [3,]   14   56

Argument ncol refers to number of columns and nrow refers to number of rows. In a matrix, product of ncol and nrow must equal the number of elements in all the vectors supplied to function matrix. If there is mismatch in the sum of length of vectors supplied and product of number of rows and columns then problem of recycling of elements will persist with some warning. The matrix function takes by default values column i.e., byrow=F. Thus, even if this parameter byrow=F is ignored it will give same output.

matrix(c(wheat,chickpea),ncol = 2,nrow=3)
##      [,1] [,2]
## [1,]   12   23
## [2,]   13   44
## [3,]   14   56
matrix(c(wheat,chickpea),ncol = 2,nrow=4) # Example of recycling of elements 
## Warning in matrix(c(wheat, chickpea), ncol = 2, nrow = 4): data length [6] is
## not a sub-multiple or multiple of the number of rows [4]
##      [,1] [,2]
## [1,]   12   44
## [2,]   13   56
## [3,]   14   12
## [4,]   23   13

Instead of supplying arguments ncol and nrow, it is possible to supply them in one go by specifying a vector the first element of which is number of rows and second element is number of columns.

myfirstmatrix=matrix(c(wheat,chickpea),c(3,2),byrow = F)
myfirstmatrix
##      [,1] [,2]
## [1,]   12   23
## [2,]   13   44
## [3,]   14   56
is.matrix(myfirstmatrix) #check whether the object is a matrix or not, output of this function is a logical true or false
## [1] TRUE

In place of a numeric vector, we can use other types of vectors too to create a matrix.

matrix(vector("character",length = 6),nrow = 3)
##      [,1] [,2]
## [1,] ""   ""  
## [2,] ""   ""  
## [3,] ""   ""
matrix(vector("logical",length = 6),nrow = 3)
##       [,1]  [,2]
## [1,] FALSE FALSE
## [2,] FALSE FALSE
## [3,] FALSE FALSE
matrix(vector("integer",length = 6),nrow = 3)
##      [,1] [,2]
## [1,]    0    0
## [2,]    0    0
## [3,]    0    0
matrix(vector("complex",length = 6),nrow = 3)
##      [,1] [,2]
## [1,] 0+0i 0+0i
## [2,] 0+0i 0+0i
## [3,] 0+0i 0+0i

Existing attributes of the matrix can be obtained by attributes function.

attributes(myfirstmatrix)
## $dim
## [1] 3 2

Only one attribute dimension exists which is given by name dim and is equal to 3 rows and 2 columns. Other than dimension of matrix, row names and column names are also the other attributes. Row and column names can be set by either specifying dimnames in matrix function or by using rownames and colnames function.

rownames(myfirstmatrix)=c("Row1","Row2","Row3")
colnames(myfirstmatrix)=c("col1","col2")
myfirstmatrix
##      col1 col2
## Row1   12   23
## Row2   13   44
## Row3   14   56
class(myfirstmatrix)
## [1] "matrix" "array"
typeof(myfirstmatrix)
## [1] "double"
mode(myfirstmatrix)
## [1] "numeric"

Any element(s) of the matrix can be obtained either by using row and/or column names or by using row and/or column numbers. We can create a null vector and fill the values of various elements afterwards.

nullmatrix=matrix(0,nrow = 2,ncol = 3)
nullmatrix
##      [,1] [,2] [,3]
## [1,]    0    0    0
## [2,]    0    0    0

2.3.2 Indexing a matrix

myfirstmatrix["Row1",] #only specifying row name followed by comma will provide elements for that row from all columns
## col1 col2 
##   12   23
myfirstmatrix[,"col1"] #only specifying column names after comma 
## Row1 Row2 Row3 
##   12   13   14
myfirstmatrix["Row1","col2"] #accessing element of matrix belonging to first row and second column
## [1] 23

Any element value in matrix can be changed as required.

myfirstmatrix["Row1","col2"]=25
myfirstmatrix
##      col1 col2
## Row1   12   25
## Row2   13   44
## Row3   14   56
nullmatrix[1,1]=1
nullmatrix[2,2]=1
nullmatrix
##      [,1] [,2] [,3]
## [1,]    1    0    0
## [2,]    0    1    0
## To create identity matrix which contains all zeros but one along its diagonal
nullmatrix_square3=matrix(0,nrow = 3,ncol = 3)
diag(nullmatrix_square3)=rep(1,3) # diag function is used to set as well as access diagonal elements in a matrix
nullmatrix_square3
##      [,1] [,2] [,3]
## [1,]    1    0    0
## [2,]    0    1    0
## [3,]    0    0    1

2.3.3 Mathematical operations on matrix

R allows all mathematical operations on matrix data structure.

sqmatrix=matrix(c(2,3,7,1),nrow=2)
secondsqmatrix=matrix(c(10,8,5,12),nrow=2)
sqmatrix+secondsqmatrix #addition of two matrix of same dimension
##      [,1] [,2]
## [1,]   12   12
## [2,]   11   13
sqmatrix-secondsqmatrix # subtraction of two matrix of same order
##      [,1] [,2]
## [1,]   -8    2
## [2,]   -5  -11
sqmatrix/secondsqmatrix # division of matrix by matrix element by element
##       [,1]       [,2]
## [1,] 0.200 1.40000000
## [2,] 0.375 0.08333333
2*sqmatrix #multiplication of matrix with a scalar
##      [,1] [,2]
## [1,]    4   14
## [2,]    6    2
sqmatrix%*%secondsqmatrix # matrix multiplication
##      [,1] [,2]
## [1,]   76   94
## [2,]   38   27
sqmatrix*secondsqmatrix # element wise multiplication of two matrix
##      [,1] [,2]
## [1,]   20   35
## [2,]   24   12
sqmatrix%o%secondsqmatrix # outer or exterior product of two matrix, %o% is binary operator providing a wrapper for outer(x, y, FUN="*").
## , , 1, 1
## 
##      [,1] [,2]
## [1,]   20   70
## [2,]   30   10
## 
## , , 2, 1
## 
##      [,1] [,2]
## [1,]   16   56
## [2,]   24    8
## 
## , , 1, 2
## 
##      [,1] [,2]
## [1,]   10   35
## [2,]   15    5
## 
## , , 2, 2
## 
##      [,1] [,2]
## [1,]   24   84
## [2,]   36   12
outer(sqmatrix,secondsqmatrix,FUN="*")
## , , 1, 1
## 
##      [,1] [,2]
## [1,]   20   70
## [2,]   30   10
## 
## , , 2, 1
## 
##      [,1] [,2]
## [1,]   16   56
## [2,]   24    8
## 
## , , 1, 2
## 
##      [,1] [,2]
## [1,]   10   35
## [2,]   15    5
## 
## , , 2, 2
## 
##      [,1] [,2]
## [1,]   24   84
## [2,]   36   12
# However, outer function can perform several task with different functions provided for FUN argument.
outer(sqmatrix,secondsqmatrix,FUN="+") 
## , , 1, 1
## 
##      [,1] [,2]
## [1,]   12   17
## [2,]   13   11
## 
## , , 2, 1
## 
##      [,1] [,2]
## [1,]   10   15
## [2,]   11    9
## 
## , , 1, 2
## 
##      [,1] [,2]
## [1,]    7   12
## [2,]    8    6
## 
## , , 2, 2
## 
##      [,1] [,2]
## [1,]   14   19
## [2,]   15   13
outer(sqmatrix,secondsqmatrix,FUN="-") 
## , , 1, 1
## 
##      [,1] [,2]
## [1,]   -8   -3
## [2,]   -7   -9
## 
## , , 2, 1
## 
##      [,1] [,2]
## [1,]   -6   -1
## [2,]   -5   -7
## 
## , , 1, 2
## 
##      [,1] [,2]
## [1,]   -3    2
## [2,]   -2   -4
## 
## , , 2, 2
## 
##      [,1] [,2]
## [1,]  -10   -5
## [2,]   -9  -11
outer(sqmatrix,secondsqmatrix,FUN="^") 
## , , 1, 1
## 
##       [,1]      [,2]
## [1,]  1024 282475249
## [2,] 59049         1
## 
## , , 2, 1
## 
##      [,1]    [,2]
## [1,]  256 5764801
## [2,] 6561       1
## 
## , , 1, 2
## 
##      [,1]  [,2]
## [1,]   32 16807
## [2,]  243     1
## 
## , , 2, 2
## 
##        [,1]        [,2]
## [1,]   4096 13841287201
## [2,] 531441           1
outer(sqmatrix,secondsqmatrix,FUN="/") 
## , , 1, 1
## 
##      [,1] [,2]
## [1,]  0.2  0.7
## [2,]  0.3  0.1
## 
## , , 2, 1
## 
##       [,1]  [,2]
## [1,] 0.250 0.875
## [2,] 0.375 0.125
## 
## , , 1, 2
## 
##      [,1] [,2]
## [1,]  0.4  1.4
## [2,]  0.6  0.2
## 
## , , 2, 2
## 
##           [,1]       [,2]
## [1,] 0.1666667 0.58333333
## [2,] 0.2500000 0.08333333
sqmatrix%x%secondsqmatrix #Kronecker product of two matrix
##      [,1] [,2] [,3] [,4]
## [1,]   20   10   70   35
## [2,]   16   24   56   84
## [3,]   30   15   10    5
## [4,]   24   36    8   12
sum(diag(myfirstmatrix)) # To find trace of the matrix which is sum of the diagonal elements of the matrix
## [1] 56
solve(sqmatrix) # Inverse of matrix by solve function
##             [,1]       [,2]
## [1,] -0.05263158  0.3684211
## [2,]  0.15789474 -0.1052632
det(sqmatrix) # Determinant of the matrix
## [1] -19
solve(sqmatrix,c(1,2)) # Solving a system of equations Ax=B where we need to solve for x, sqmatrix is set as A and c(1,2) as B
## [1]  0.68421053 -0.05263158
sqmatrix %*% solve(sqmatrix)
##      [,1] [,2]
## [1,]    1    0
## [2,]    0    1
eigen(sqmatrix) # Eigen value decomposition
## eigen() decomposition
## $values
## [1]  6.109772 -3.109772
## 
## $vectors
##           [,1]       [,2]
## [1,] 0.8623579 -0.8076995
## [2,] 0.5062992  0.5895944
svd(sqmatrix) # Singular value decomposition
## $d
## [1] 7.524938 2.524938
## 
## $u
##            [,1]       [,2]
## [1,] -0.9632580 -0.2685778
## [2,] -0.2685778  0.9632580
## 
## $v
##            [,1]       [,2]
## [1,] -0.3630926  0.9317531
## [2,] -0.9317531 -0.3630926
chol(secondsqmatrix) #Cholesky decomposition
##          [,1]     [,2]
## [1,] 3.162278 1.581139
## [2,] 0.000000 3.082207
qr(sqmatrix) #QR decomposition
## $qr
##            [,1]      [,2]
## [1,] -3.6055513 -4.714952
## [2,]  0.8320503 -5.269652
## 
## $rank
## [1] 2
## 
## $qraux
## [1] 1.554700 5.269652
## 
## $pivot
## [1] 1 2
## 
## attr(,"class")
## [1] "qr"
colSums(sqmatrix) #Get column sums for each column of the numeric, complex and integer mode matrix 
## [1] 5 8
rowSums(sqmatrix) #Get row sums for each row of the numeric, complex and integer mode matrix
## [1] 9 4
colMeans(sqmatrix) #Get column means for each column of the numeric, complex and integer mode matrix 
## [1] 2.5 4.0
rowMeans(sqmatrix) #Get row means for each row of the numeric, complex and integer mode matrix
## [1] 4.5 2.0
# Get frequency distribution of numbers in matrix
table(sqmatrix)
## sqmatrix
## 1 2 3 7 
## 1 1 1 1
# To find proportion by row, margin=1 and for column, margin=2
proportions(sqmatrix,margin=1) 
##           [,1]      [,2]
## [1,] 0.2222222 0.7777778
## [2,] 0.7500000 0.2500000
proportions(sqmatrix,margin=2)
##      [,1]  [,2]
## [1,]  0.4 0.875
## [2,]  0.6 0.125

2.4 Array

An array is a multi-dimensional object which is heterogeneous in nature. It is simply a vector which is stored with additional attributes giving the dimensions (attribute dim) and optionally names for those dimensions (attribute dimnames). To create an arry, array function is used. This function takes data in vector form even when matrices are provided, it is provided by vectorizing it. Further the dim argument takes a vector which contains number of rows followed by number of columns and followed by number of matrices to be printed. In the following example, all the three have been set as 2.

2.4.1 Creating an array

myfirstarray=array(data=c(sqmatrix,secondsqmatrix),dim=c(2,2,2))
myfirstarray
## , , 1
## 
##      [,1] [,2]
## [1,]    2    7
## [2,]    3    1
## 
## , , 2
## 
##      [,1] [,2]
## [1,]   10    5
## [2,]    8   12
class(myfirstarray)
## [1] "array"
is.array(myfirstarray) #check whether the given object is an array or not, the output of the function is logical true or false
## [1] TRUE

2.4.2 Indexing an array

An array data structure has three indices to subset the array. First one is row number, followed by column number and third is matrix number. If within [ ] row and column numbers are left blank but matrix number is given then it will return the matrix for which number was provided. When row number is given but column and matrix number are left blank then that particular row from each matrix with all columns would be returned.But in such case rows would be printed as columns of a new matrix. But when only column number is specified then resulting new matrix will have selected column of each matrix as its column.

# To access the first matrix with all rows and columns
myfirstarray[,,2]
##      [,1] [,2]
## [1,]   10    5
## [2,]    8   12
# To access the first row of all matrices in array with all columns
myfirstarray[1,,] 
##      [,1] [,2]
## [1,]    2   10
## [2,]    7    5
# To access the first column of all matrices in array with all rows
myfirstarray[,1,]
##      [,1] [,2]
## [1,]    2   10
## [2,]    3    8
# To access second row from first column in first matrix 
myfirstarray[2,1,1]
## [1] 3

2.5 Data Frame

Data frame is a data structure which consists of vectors as columns and each vector should have a unique name. Thus, data frame is a heterogeneous data structure. Column names should not be left empty. Usually, a data frame has rows as observations and columns as variables. Data frame can be created by using the function data.frame. Each column in data frame has a separate mode.

2.5.1 Creating data frame

Assume some hypothetical data regarding persons and create a data frame.

Name<-c("AKS","AS",
        "SBN","GKV",
        "NK","DR","HOS",
        "AV")
Age<-c(64,62,63,32,
       64,49,64,42)
Gender<-c(rep("M",7),"F")
Education<-c(rep("Ph.D.",3),"M.Sc.",rep("Ph.D.",4))
mydept=data.frame(Name,Age,Gender,Education)
mydept # print data frame
##   Name Age Gender Education
## 1  AKS  64      M     Ph.D.
## 2   AS  62      M     Ph.D.
## 3  SBN  63      M     Ph.D.
## 4  GKV  32      M     M.Sc.
## 5   NK  64      M     Ph.D.
## 6   DR  49      M     Ph.D.
## 7  HOS  64      M     Ph.D.
## 8   AV  42      F     Ph.D.
 #View data frame object in a separate window which gives spreadsheet like feel 
View(mydept)
is.data.frame(mydept) # check whether the object is a data frame, output is logical True or false
## [1] TRUE
class(mydept)#check the class of the object
## [1] "data.frame"
tail(mydept) # tail function provides the last 6 rows of the data frame by default
##   Name Age Gender Education
## 3  SBN  63      M     Ph.D.
## 4  GKV  32      M     M.Sc.
## 5   NK  64      M     Ph.D.
## 6   DR  49      M     Ph.D.
## 7  HOS  64      M     Ph.D.
## 8   AV  42      F     Ph.D.
tail(mydept,3) # In case only last three rows needs to be seen then we have to specify to the tail function
##   Name Age Gender Education
## 6   DR  49      M     Ph.D.
## 7  HOS  64      M     Ph.D.
## 8   AV  42      F     Ph.D.
head(mydept) # head function by default provides first 6 rows of the data frame
##   Name Age Gender Education
## 1  AKS  64      M     Ph.D.
## 2   AS  62      M     Ph.D.
## 3  SBN  63      M     Ph.D.
## 4  GKV  32      M     M.Sc.
## 5   NK  64      M     Ph.D.
## 6   DR  49      M     Ph.D.
head(mydept,3) #In case only first three rows needs to be seen then we have to specify to the head function
##   Name Age Gender Education
## 1  AKS  64      M     Ph.D.
## 2   AS  62      M     Ph.D.
## 3  SBN  63      M     Ph.D.
attributes(mydept) #use attribute function to check the attributes ofthe data frame
## $names
## [1] "Name"      "Age"       "Gender"    "Education"
## 
## $class
## [1] "data.frame"
## 
## $row.names
## [1] 1 2 3 4 5 6 7 8
row.names(mydept)<-c("A Shrivastav",
                     "A Sarawgi",
                     "S B Nahatkar",
                     "GK Vani",
                     "N Khan",
                     "D Rathi", "HO Sharma","A Verma") #Use function rownames or row.names to set the rownames for data frame
mydept
##              Name Age Gender Education
## A Shrivastav  AKS  64      M     Ph.D.
## A Sarawgi      AS  62      M     Ph.D.
## S B Nahatkar  SBN  63      M     Ph.D.
## GK Vani       GKV  32      M     M.Sc.
## N Khan         NK  64      M     Ph.D.
## D Rathi        DR  49      M     Ph.D.
## HO Sharma     HOS  64      M     Ph.D.
## A Verma        AV  42      F     Ph.D.

View window for command View(mydept)

2.5.2 Indexing data frame

Indexing involves using row and column names/numbers to access data from data frame.

# Use row and column names to access dataframe elements
mydept["A Shrivastav","Age"] 
## [1] 64

Providing only row names and leaving column name blank would return the whole row corresponding to the given row name

mydept["A Shrivastav",] 
##              Name Age Gender Education
## A Shrivastav  AKS  64      M     Ph.D.

Instead of row and column names we can use row and column numbers as well.

mydept[2,3] 
## [1] "M"
mydept[3,1]
## [1] "SBN"

By using dollar sign after data frame name we can access columns by their name

mydept$Name 
## [1] "AKS" "AS"  "SBN" "GKV" "NK"  "DR"  "HOS" "AV"

To add the additional column to data frame, we should write the name of new column after dollar sign and should assign it a new vector of values or character

mydept$Income=c(1000000,200000,300000,400000,500000,600000,700000,800000) 
head(mydept)
##              Name Age Gender Education Income
## A Shrivastav  AKS  64      M     Ph.D.  1e+06
## A Sarawgi      AS  62      M     Ph.D.  2e+05
## S B Nahatkar  SBN  63      M     Ph.D.  3e+05
## GK Vani       GKV  32      M     M.Sc.  4e+05
## N Khan         NK  64      M     Ph.D.  5e+05
## D Rathi        DR  49      M     Ph.D.  6e+05

To add one or more row to the data frame using index features and assign the new row index with a data frame that contains the values for all columns existing in the data frame for equal number of rows added.

mydept[9,]=data.frame(Name="ASR",Age=29,Gender="F",Education="M.Sc.",Income=200000) 
row.names(mydept)[9] = "AS Rajpoot"
mydept
##              Name Age Gender Education Income
## A Shrivastav  AKS  64      M     Ph.D.  1e+06
## A Sarawgi      AS  62      M     Ph.D.  2e+05
## S B Nahatkar  SBN  63      M     Ph.D.  3e+05
## GK Vani       GKV  32      M     M.Sc.  4e+05
## N Khan         NK  64      M     Ph.D.  5e+05
## D Rathi        DR  49      M     Ph.D.  6e+05
## HO Sharma     HOS  64      M     Ph.D.  7e+05
## A Verma        AV  42      F     Ph.D.  8e+05
## AS Rajpoot    ASR  29      F     M.Sc.  2e+05

2.5.3 Filtering data in data frame

# To filter all the rows in data frame where Age was less than 50 and Gender was female (F)
mydept[mydept$Age<50 & mydept$Gender=="F",] 
##            Name Age Gender Education Income
## A Verma      AV  42      F     Ph.D.  8e+05
## AS Rajpoot  ASR  29      F     M.Sc.  2e+05
summary(mydept) #To get column wise summary of data frame 
##      Name                Age           Gender           Education        
##  Length:9           Min.   :29.00   Length:9           Length:9          
##  Class :character   1st Qu.:42.00   Class :character   Class :character  
##  Mode  :character   Median :62.00   Mode  :character   Mode  :character  
##                     Mean   :52.11                                        
##                     3rd Qu.:64.00                                        
##                     Max.   :64.00                                        
##      Income       
##  Min.   : 200000  
##  1st Qu.: 300000  
##  Median : 500000  
##  Mean   : 522222  
##  3rd Qu.: 700000  
##  Max.   :1000000

Filtering can also be done by using the subset function.

subset(mydept,Income>400000)
##              Name Age Gender Education Income
## A Shrivastav  AKS  64      M     Ph.D.  1e+06
## N Khan         NK  64      M     Ph.D.  5e+05
## D Rathi        DR  49      M     Ph.D.  6e+05
## HO Sharma     HOS  64      M     Ph.D.  7e+05
## A Verma        AV  42      F     Ph.D.  8e+05
subset(mydept,Income<400000 & Age>60)
##              Name Age Gender Education Income
## A Sarawgi      AS  62      M     Ph.D.  2e+05
## S B Nahatkar  SBN  63      M     Ph.D.  3e+05

2.5.4 Tables from Data frame

#Table function generates summary over categorical data like male/female, treatment/control, or yes/no
table(mydept$Gender) 
## 
## F M 
## 2 7
# table function can provide contingency tables too for categorical data
table(mydept$Gender,mydept$Education) 
##    
##     M.Sc. Ph.D.
##   F     1     1
##   M     1     6
proportions(table(mydept$Gender,mydept$Education),margin=1) #margin=1 for row
##    
##         M.Sc.     Ph.D.
##   F 0.5000000 0.5000000
##   M 0.1428571 0.8571429
proportions(table(mydept$Gender,mydept$Education),margin=2) #margin=2 for column
##    
##         M.Sc.     Ph.D.
##   F 0.5000000 0.1428571
##   M 0.5000000 0.8571429

2.5.5 Merging and appending data frames

The dplyr package in system library of R is called as grammer of data manipulation in R. It has several functions for manipulation of data. To merge data frames, there are six types of merge function in this package. Each one of these is described below:

  1. inner_join() function joins two data frames which contains only matched columns and rows from both data frames.

  2. full_join() function joins two data frames regardless of matching, i.e., contains all columns and rows from both data frames.

  3. left_join() function joins two data frames only for the columns matched from second data frame and all columns and rows of first data frame. The first data frame is left and second is right.

  4. right_join() function joins two data frames only for the columns matched from first data frame and all columns and rows of second data frame.

  5. semi_join() function provides output of columns and rows matched in left data frame.

  6. anti_join() function doesn’t join anything. It only provides unmatched observations from two data frames.

The inner_join(), left_join(), right_join() and full_join() are all examples of mutating joins wherein columns from right data frame to left data frame. If there are multiple row match in rigth data frame for a row in left data frame then all rows in right data frame would be returned for each unique row in left data frame.

The semi_join() and anti_join() are examples of filtering joins wherein rows from left data frame are filtered for its match in right data frame.

In addition there is a function nest_join which returns all rows and columns in left data frame with nested columns that contains all matched columns from right data frame. This differs from left_join in that new column names wouldn’t be set in former but in latter.

Now lets create another data frame which has some members of mydept data frame with information on some other variables.

mydata_1<-data.frame(Name=c("AKS","GKV"),Skills=c("Oratory","Data Analysis"),Age=c(64,32))
library(dplyr) ## loading dplyr package for using functions inbuilt in it
## perform inner join
inner_join(mydept,mydata_1,by="Name")
##   Name Age.x Gender Education Income        Skills Age.y
## 1  AKS    64      M     Ph.D.  1e+06       Oratory    64
## 2  GKV    32      M     M.Sc.  4e+05 Data Analysis    32
inner_join(mydept,mydata_1,by=c("Name","Age"))
##   Name Age Gender Education Income        Skills
## 1  AKS  64      M     Ph.D.  1e+06       Oratory
## 2  GKV  32      M     M.Sc.  4e+05 Data Analysis
## perform full join
full_join(mydept,mydata_1,by="Name") ## join with single matching key
##   Name Age.x Gender Education Income        Skills Age.y
## 1  AKS    64      M     Ph.D.  1e+06       Oratory    64
## 2   AS    62      M     Ph.D.  2e+05          <NA>    NA
## 3  SBN    63      M     Ph.D.  3e+05          <NA>    NA
## 4  GKV    32      M     M.Sc.  4e+05 Data Analysis    32
## 5   NK    64      M     Ph.D.  5e+05          <NA>    NA
## 6   DR    49      M     Ph.D.  6e+05          <NA>    NA
## 7  HOS    64      M     Ph.D.  7e+05          <NA>    NA
## 8   AV    42      F     Ph.D.  8e+05          <NA>    NA
## 9  ASR    29      F     M.Sc.  2e+05          <NA>    NA
full_join(mydept,mydata_1,by=c("Name","Age")) ## join with two matching keys
##   Name Age Gender Education Income        Skills
## 1  AKS  64      M     Ph.D.  1e+06       Oratory
## 2   AS  62      M     Ph.D.  2e+05          <NA>
## 3  SBN  63      M     Ph.D.  3e+05          <NA>
## 4  GKV  32      M     M.Sc.  4e+05 Data Analysis
## 5   NK  64      M     Ph.D.  5e+05          <NA>
## 6   DR  49      M     Ph.D.  6e+05          <NA>
## 7  HOS  64      M     Ph.D.  7e+05          <NA>
## 8   AV  42      F     Ph.D.  8e+05          <NA>
## 9  ASR  29      F     M.Sc.  2e+05          <NA>
## perform left join
left_join(mydept,mydata_1,by="Name")
##   Name Age.x Gender Education Income        Skills Age.y
## 1  AKS    64      M     Ph.D.  1e+06       Oratory    64
## 2   AS    62      M     Ph.D.  2e+05          <NA>    NA
## 3  SBN    63      M     Ph.D.  3e+05          <NA>    NA
## 4  GKV    32      M     M.Sc.  4e+05 Data Analysis    32
## 5   NK    64      M     Ph.D.  5e+05          <NA>    NA
## 6   DR    49      M     Ph.D.  6e+05          <NA>    NA
## 7  HOS    64      M     Ph.D.  7e+05          <NA>    NA
## 8   AV    42      F     Ph.D.  8e+05          <NA>    NA
## 9  ASR    29      F     M.Sc.  2e+05          <NA>    NA
left_join(mydept,mydata_1,by=c("Name","Age"))
##   Name Age Gender Education Income        Skills
## 1  AKS  64      M     Ph.D.  1e+06       Oratory
## 2   AS  62      M     Ph.D.  2e+05          <NA>
## 3  SBN  63      M     Ph.D.  3e+05          <NA>
## 4  GKV  32      M     M.Sc.  4e+05 Data Analysis
## 5   NK  64      M     Ph.D.  5e+05          <NA>
## 6   DR  49      M     Ph.D.  6e+05          <NA>
## 7  HOS  64      M     Ph.D.  7e+05          <NA>
## 8   AV  42      F     Ph.D.  8e+05          <NA>
## 9  ASR  29      F     M.Sc.  2e+05          <NA>
## perform nest join 
nest_join(mydept,mydata_1,by="Name")
##   Name Age Gender Education Income          mydata_1
## 1  AKS  64      M     Ph.D.  1e+06       Oratory, 64
## 2   AS  62      M     Ph.D.  2e+05                  
## 3  SBN  63      M     Ph.D.  3e+05                  
## 4  GKV  32      M     M.Sc.  4e+05 Data Analysis, 32
## 5   NK  64      M     Ph.D.  5e+05                  
## 6   DR  49      M     Ph.D.  6e+05                  
## 7  HOS  64      M     Ph.D.  7e+05                  
## 8   AV  42      F     Ph.D.  8e+05                  
## 9  ASR  29      F     M.Sc.  2e+05
nest_join(mydept,mydata_1,by=c("Name","Age"))
##   Name Age Gender Education Income      mydata_1
## 1  AKS  64      M     Ph.D.  1e+06       Oratory
## 2   AS  62      M     Ph.D.  2e+05              
## 3  SBN  63      M     Ph.D.  3e+05              
## 4  GKV  32      M     M.Sc.  4e+05 Data Analysis
## 5   NK  64      M     Ph.D.  5e+05              
## 6   DR  49      M     Ph.D.  6e+05              
## 7  HOS  64      M     Ph.D.  7e+05              
## 8   AV  42      F     Ph.D.  8e+05              
## 9  ASR  29      F     M.Sc.  2e+05
## perform right join
right_join(mydept,mydata_1,by="Name")
##   Name Age.x Gender Education Income        Skills Age.y
## 1  AKS    64      M     Ph.D.  1e+06       Oratory    64
## 2  GKV    32      M     M.Sc.  4e+05 Data Analysis    32
right_join(mydept,mydata_1,by=c("Name","Age"))
##   Name Age Gender Education Income        Skills
## 1  AKS  64      M     Ph.D.  1e+06       Oratory
## 2  GKV  32      M     M.Sc.  4e+05 Data Analysis
## perform semi join
semi_join(mydept,mydata_1,by="Name")
##              Name Age Gender Education Income
## A Shrivastav  AKS  64      M     Ph.D.  1e+06
## GK Vani       GKV  32      M     M.Sc.  4e+05
semi_join(mydept,mydata_1,by=c("Name","Age"))
##              Name Age Gender Education Income
## A Shrivastav  AKS  64      M     Ph.D.  1e+06
## GK Vani       GKV  32      M     M.Sc.  4e+05
## perform anti join
anti_join(mydept,mydata_1,by="Name")
##              Name Age Gender Education Income
## A Sarawgi      AS  62      M     Ph.D.  2e+05
## S B Nahatkar  SBN  63      M     Ph.D.  3e+05
## N Khan         NK  64      M     Ph.D.  5e+05
## D Rathi        DR  49      M     Ph.D.  6e+05
## HO Sharma     HOS  64      M     Ph.D.  7e+05
## A Verma        AV  42      F     Ph.D.  8e+05
## AS Rajpoot    ASR  29      F     M.Sc.  2e+05
anti_join(mydept,mydata_1,by=c("Name","Age"))
##              Name Age Gender Education Income
## A Sarawgi      AS  62      M     Ph.D.  2e+05
## S B Nahatkar  SBN  63      M     Ph.D.  3e+05
## N Khan         NK  64      M     Ph.D.  5e+05
## D Rathi        DR  49      M     Ph.D.  6e+05
## HO Sharma     HOS  64      M     Ph.D.  7e+05
## A Verma        AV  42      F     Ph.D.  8e+05
## AS Rajpoot    ASR  29      F     M.Sc.  2e+05

Alternatively, one can use merge function from base pacakge also. The base package is loaded by default in R. This function unlike those in dplyr package sorts the data frame columns by default. In case one doesn’t wish to sort columns then additional argument sort should be set equal to FALSE .

##left join with merge function
merge(mydept,mydata_1,all.x = T,all.y = F,by="Name")
##   Name Age.x Gender Education Income        Skills Age.y
## 1  AKS    64      M     Ph.D.  1e+06       Oratory    64
## 2   AS    62      M     Ph.D.  2e+05          <NA>    NA
## 3  ASR    29      F     M.Sc.  2e+05          <NA>    NA
## 4   AV    42      F     Ph.D.  8e+05          <NA>    NA
## 5   DR    49      M     Ph.D.  6e+05          <NA>    NA
## 6  GKV    32      M     M.Sc.  4e+05 Data Analysis    32
## 7  HOS    64      M     Ph.D.  7e+05          <NA>    NA
## 8   NK    64      M     Ph.D.  5e+05          <NA>    NA
## 9  SBN    63      M     Ph.D.  3e+05          <NA>    NA
merge(mydept,mydata_1,all.x = T,all.y = F,by=c("Name","Age"))
##   Name Age Gender Education Income        Skills
## 1  AKS  64      M     Ph.D.  1e+06       Oratory
## 2   AS  62      M     Ph.D.  2e+05          <NA>
## 3  ASR  29      F     M.Sc.  2e+05          <NA>
## 4   AV  42      F     Ph.D.  8e+05          <NA>
## 5   DR  49      M     Ph.D.  6e+05          <NA>
## 6  GKV  32      M     M.Sc.  4e+05 Data Analysis
## 7  HOS  64      M     Ph.D.  7e+05          <NA>
## 8   NK  64      M     Ph.D.  5e+05          <NA>
## 9  SBN  63      M     Ph.D.  3e+05          <NA>
##right join with merge function
merge(mydept,mydata_1,all.x = F,all.y = T,by=c("Name","Age"))
##   Name Age Gender Education Income        Skills
## 1  AKS  64      M     Ph.D.  1e+06       Oratory
## 2  GKV  32      M     M.Sc.  4e+05 Data Analysis
##full join with merge function
merge(mydept,mydata_1,all.x = T,all.y = T,by=c("Name","Age"))
##   Name Age Gender Education Income        Skills
## 1  AKS  64      M     Ph.D.  1e+06       Oratory
## 2   AS  62      M     Ph.D.  2e+05          <NA>
## 3  ASR  29      F     M.Sc.  2e+05          <NA>
## 4   AV  42      F     Ph.D.  8e+05          <NA>
## 5   DR  49      M     Ph.D.  6e+05          <NA>
## 6  GKV  32      M     M.Sc.  4e+05 Data Analysis
## 7  HOS  64      M     Ph.D.  7e+05          <NA>
## 8   NK  64      M     Ph.D.  5e+05          <NA>
## 9  SBN  63      M     Ph.D.  3e+05          <NA>
## for not sorting the joint data frame
merge(mydept,mydata_1,all.x = T,all.y = T,by=c("Name","Age"),sort = FALSE)
##   Name Age Gender Education Income        Skills
## 1  AKS  64      M     Ph.D.  1e+06       Oratory
## 2  GKV  32      M     M.Sc.  4e+05 Data Analysis
## 3   NK  64      M     Ph.D.  5e+05          <NA>
## 4   AS  62      M     Ph.D.  2e+05          <NA>
## 5  SBN  63      M     Ph.D.  3e+05          <NA>
## 6   AV  42      F     Ph.D.  8e+05          <NA>
## 7  ASR  29      F     M.Sc.  2e+05          <NA>
## 8   DR  49      M     Ph.D.  6e+05          <NA>
## 9  HOS  64      M     Ph.D.  7e+05          <NA>

To append data frames in R usually rbind() function is used but it is not an efficient one. Instead, users are suggested to use bind_rows() function from dplyr package.

mydat_2=data.frame(Name=c("SK","RK"),Age=c(37,35),Gender=c("M","M"),Education=c("SSLC","SSLC"),Income=c(25000,30000))
row.names(mydat_2)<-c("Sanjay","Rajesh")
library(dplyr)
bind_rows(mydept,mydat_2)
##              Name Age Gender Education  Income
## A Shrivastav  AKS  64      M     Ph.D. 1000000
## A Sarawgi      AS  62      M     Ph.D.  200000
## S B Nahatkar  SBN  63      M     Ph.D.  300000
## GK Vani       GKV  32      M     M.Sc.  400000
## N Khan         NK  64      M     Ph.D.  500000
## D Rathi        DR  49      M     Ph.D.  600000
## HO Sharma     HOS  64      M     Ph.D.  700000
## A Verma        AV  42      F     Ph.D.  800000
## AS Rajpoot    ASR  29      F     M.Sc.  200000
## Sanjay         SK  37      M      SSLC   25000
## Rajesh         RK  35      M      SSLC   30000

2.5.6 Reshaping data frame: Wide and Long formats

A data frame can be either in long or wide format. Wide format refers to subjects or units in rows and repeated measurement or yearly data on subjects or units in columns of the corresponding row. while long format refers to subject /unit wise data over the years or repeated measurements entered in separate rows.

mydata_2<-data.frame(Subject=rep(LETTERS[1:7],2),time=c(rep(1,7),rep(2,7)),Value=runif(14))
library(dplyr)
# long to wide (direction = "wide") requires idvar and timevar at a minimum
reshape(mydata_2, direction = "wide", idvar = "Subject", timevar = "time")
##   Subject     Value.1    Value.2
## 1       A 0.147468568 0.93928864
## 2       B 0.027665727 0.40437964
## 3       C 0.977817373 0.72287824
## 4       D 0.047472843 0.82415125
## 5       E 0.007562183 0.44510712
## 6       F 0.530911993 0.46650597
## 7       G 0.382360465 0.07535331
## can also explicitly specify name of combined variable
wide <- reshape(mydata_2, direction = "wide", idvar = "Subject",timevar = "time", v.names = "Value", sep= "_")
wide
##   Subject     Value_1    Value_2
## 1       A 0.147468568 0.93928864
## 2       B 0.027665727 0.40437964
## 3       C 0.977817373 0.72287824
## 4       D 0.047472843 0.82415125
## 5       E 0.007562183 0.44510712
## 6       F 0.530911993 0.46650597
## 7       G 0.382360465 0.07535331
## reverse transformation
reshape(wide, direction = "long")
##     Subject time       Value
## A.1       A    1 0.147468568
## B.1       B    1 0.027665727
## C.1       C    1 0.977817373
## D.1       D    1 0.047472843
## E.1       E    1 0.007562183
## F.1       F    1 0.530911993
## G.1       G    1 0.382360465
## A.2       A    2 0.939288641
## B.2       B    2 0.404379635
## C.2       C    2 0.722878236
## D.2       D    2 0.824151246
## E.2       E    2 0.445107116
## F.2       F    2 0.466505970
## G.2       G    2 0.075353314
reshape(wide, idvar = "Subject", varying = list(2:3), v.names = "Value", direction = "long")
##     Subject time       Value
## A.1       A    1 0.147468568
## B.1       B    1 0.027665727
## C.1       C    1 0.977817373
## D.1       D    1 0.047472843
## E.1       E    1 0.007562183
## F.1       F    1 0.530911993
## G.1       G    1 0.382360465
## A.2       A    2 0.939288641
## B.2       B    2 0.404379635
## C.2       C    2 0.722878236
## D.2       D    2 0.824151246
## E.2       E    2 0.445107116
## F.2       F    2 0.466505970
## G.2       G    2 0.075353314

2.5.7 Data entry with data frame

Data frame can be used as data entry spreadsheet by using edit function.

# create an empty data frame first
mydata=data.frame(0)
# Now use edit function and save the output of edited data frame with same object name
mydata=edit(mydata) 
# Now to view updated mydata data frame use view function
View(mydata)
# edit function with data frame would open an text editor 
mydept=edit(mydept)

A view of data editor window.

Screen shot of edit function on mydept data frame

If the user uses edit function without providing object name/argument then it will open the temporary file in the memory. The emacs , xedit , vi , pico and xemacs function would also open the text editor for the object provided. The data.entry , dataentry and de function will open the spreadsheet like window for editing the data.

data.entry function in RStudio

2.6 List

A list is the generic/generalized vector. It is an multidimensional heterogeneous data structure. It can contain within it different types of data structures like vector, data frame, factor, and matrix.

2.6.1 Creating a list

To create a list one can use either vector function or list function. vector function takes two argument as used before, mode of the vector and length of the vector.

myfirstveclist=vector("list",length=2) # creates an empty list
myfirstveclist
## [[1]]
## NULL
## 
## [[2]]
## NULL
class(myfirstveclist)
## [1] "list"
mode(myfirstveclist)
## [1] "list"
typeof(myfirstveclist)
## [1] "list"
is.list(myfirstveclist) # check whether the object is a list or not, the output of this function is a logical True or false
## [1] TRUE
myfirstlist=list(mydataframe=mydept,
     myfirstvector=myfirstcharactervec,
     myfirstmatrix=myfirstmatrix)
myfirstlist
## $mydataframe
##              Name Age Gender Education Income
## A Shrivastav  AKS  64      M     Ph.D.  1e+06
## A Sarawgi      AS  62      M     Ph.D.  2e+05
## S B Nahatkar  SBN  63      M     Ph.D.  3e+05
## GK Vani       GKV  32      M     M.Sc.  4e+05
## N Khan         NK  64      M     Ph.D.  5e+05
## D Rathi        DR  49      M     Ph.D.  6e+05
## HO Sharma     HOS  64      M     Ph.D.  7e+05
## A Verma        AV  42      F     Ph.D.  8e+05
## AS Rajpoot    ASR  29      F     M.Sc.  2e+05
## 
## $myfirstvector
## [1] "RJ" "AP" "KT" "TN"
## 
## $myfirstmatrix
##      col1 col2
## Row1   12   25
## Row2   13   44
## Row3   14   56
attributes(myfirstlist)
## $names
## [1] "mydataframe"   "myfirstvector" "myfirstmatrix"
class(myfirstlist)
## [1] "list"

2.6.2 Indexing a list

To access the various elements or objects stored in a list, [[]] are used instead of [] used in data frame, matrix, array and vector.

# Accessing the elements in list by their name
myfirstlist$mydataframe
##              Name Age Gender Education Income
## A Shrivastav  AKS  64      M     Ph.D.  1e+06
## A Sarawgi      AS  62      M     Ph.D.  2e+05
## S B Nahatkar  SBN  63      M     Ph.D.  3e+05
## GK Vani       GKV  32      M     M.Sc.  4e+05
## N Khan         NK  64      M     Ph.D.  5e+05
## D Rathi        DR  49      M     Ph.D.  6e+05
## HO Sharma     HOS  64      M     Ph.D.  7e+05
## A Verma        AV  42      F     Ph.D.  8e+05
## AS Rajpoot    ASR  29      F     M.Sc.  2e+05
# Accessing the first and second element in list
myfirstlist[[1]]
##              Name Age Gender Education Income
## A Shrivastav  AKS  64      M     Ph.D.  1e+06
## A Sarawgi      AS  62      M     Ph.D.  2e+05
## S B Nahatkar  SBN  63      M     Ph.D.  3e+05
## GK Vani       GKV  32      M     M.Sc.  4e+05
## N Khan         NK  64      M     Ph.D.  5e+05
## D Rathi        DR  49      M     Ph.D.  6e+05
## HO Sharma     HOS  64      M     Ph.D.  7e+05
## A Verma        AV  42      F     Ph.D.  8e+05
## AS Rajpoot    ASR  29      F     M.Sc.  2e+05
myfirstlist[[2]]
## [1] "RJ" "AP" "KT" "TN"

2.7 Factor

A factor is an ordered vector. In a character vector there exists no order among the elements of the vector. But factor data structure provides levels to it.

2.7.1 Creating factor

factor function takes a vector as its argument to create a factor object and level argument.

education=c("Illiterate","Primary","Middle","High School","Higher Secondary","Graduate","Post Graduate","Doctorate","Post Doctorate")
class(education)
## [1] "character"
literacy=c("Post Doctorate","Graduate","Primary","Higher Secondary",
                 "Illiterate","Graduate","Post Doctorate","Doctorate","Graduate","High School","Middle","Higher Secondary")
table(literacy) 
## literacy
##        Doctorate         Graduate      High School Higher Secondary 
##                1                3                1                2 
##       Illiterate           Middle   Post Doctorate          Primary 
##                1                1                2                1

By creating a frequency distribution table before factoring vector literacy, it clearly visible that literacy levels are in alphabetical order.

literacy= factor(x=literacy, levels = education )
literacy
##  [1] Post Doctorate   Graduate         Primary          Higher Secondary
##  [5] Illiterate       Graduate         Post Doctorate   Doctorate       
##  [9] Graduate         High School      Middle           Higher Secondary
## 9 Levels: Illiterate Primary Middle High School Higher Secondary ... Post Doctorate
table(literacy)
## literacy
##       Illiterate          Primary           Middle      High School 
##                1                1                1                1 
## Higher Secondary         Graduate    Post Graduate        Doctorate 
##                2                3                0                1 
##   Post Doctorate 
##                2

By creating a frequency distribution table after factoring vector literacy, it is clearly visible that literacy levels are now in ascending order.

Only == and != operators can be used for factors. A factor can only be compared to another factor with an identical set of levels (not necessarily in the same ordering) or to a character vector. Ordered factors are compared in the same way, but the general dispatch mechanism precludes comparing ordered and unordered factors.

2.7.2 Sub setting the factor

Sub setting the factor involves dropping the unused levels too.

# Only second and third element are retained, rest are dropped
factor(literacy)[2:3,drop=T]
## [1] Graduate Primary 
## Levels: Primary Graduate
# Retain only 1,4,8 and 9th observation with respective levels
factor(literacy)[c(1,4,8,9),drop=T]
## [1] Post Doctorate   Higher Secondary Doctorate        Graduate        
## Levels: Higher Secondary Graduate Doctorate Post Doctorate
# Subset the factor after dropping "Doctorate" but not its level
subset(literacy,literacy!="Doctorate")
##  [1] Post Doctorate   Graduate         Primary          Higher Secondary
##  [5] Illiterate       Graduate         Post Doctorate   Graduate        
##  [9] High School      Middle           Higher Secondary
## 9 Levels: Illiterate Primary Middle High School Higher Secondary ... Post Doctorate
# first drop levels 
newfactor=droplevels(literacy,exclude = "Doctorate")
# check the new factor and we find in place of Doctorate "NA" is there
newfactor
##  [1] Post Doctorate   Graduate         Primary          Higher Secondary
##  [5] Illiterate       Graduate         Post Doctorate   <NA>            
##  [9] Graduate         High School      Middle           Higher Secondary
## 7 Levels: Illiterate Primary Middle High School Higher Secondary ... Post Doctorate
levels(newfactor)
## [1] "Illiterate"       "Primary"          "Middle"           "High School"     
## [5] "Higher Secondary" "Graduate"         "Post Doctorate"

Upon check for levels and we find that not just “Doctorate” but also “Post Graduate” has been dropped because this level was not being used.

Now lets add one new column to data frame object mydept as factor with the help of transform function.

mydept=transform(mydept,Position=factor(c("P","HOD","P","AP","P","P","P","AP","SRF"),levels = c("SRF","AP","P","HOD")))
levels(mydept$Position)
## [1] "SRF" "AP"  "P"   "HOD"
mydept
##              Name Age Gender Education Income Position
## A Shrivastav  AKS  64      M     Ph.D.  1e+06        P
## A Sarawgi      AS  62      M     Ph.D.  2e+05      HOD
## S B Nahatkar  SBN  63      M     Ph.D.  3e+05        P
## GK Vani       GKV  32      M     M.Sc.  4e+05       AP
## N Khan         NK  64      M     Ph.D.  5e+05        P
## D Rathi        DR  49      M     Ph.D.  6e+05        P
## HO Sharma     HOS  64      M     Ph.D.  7e+05        P
## A Verma        AV  42      F     Ph.D.  8e+05       AP
## AS Rajpoot    ASR  29      F     M.Sc.  2e+05      SRF

2.8 Conversion between data structure

Various data structures can be converted to another data structure.

# To convert any data structure to data frame
as.data.frame(sqmatrix)
##   V1 V2
## 1  2  7
## 2  3  1
# To convert any data structure to matrix
as.matrix(mydept)
##              Name  Age  Gender Education Income  Position
## A Shrivastav "AKS" "64" "M"    "Ph.D."   "1e+06" "P"     
## A Sarawgi    "AS"  "62" "M"    "Ph.D."   "2e+05" "HOD"   
## S B Nahatkar "SBN" "63" "M"    "Ph.D."   "3e+05" "P"     
## GK Vani      "GKV" "32" "M"    "M.Sc."   "4e+05" "AP"    
## N Khan       "NK"  "64" "M"    "Ph.D."   "5e+05" "P"     
## D Rathi      "DR"  "49" "M"    "Ph.D."   "6e+05" "P"     
## HO Sharma    "HOS" "64" "M"    "Ph.D."   "7e+05" "P"     
## A Verma      "AV"  "42" "F"    "Ph.D."   "8e+05" "AP"    
## AS Rajpoot   "ASR" "29" "F"    "M.Sc."   "2e+05" "SRF"
# To convert any data structure to list
as.list(mydept)
## $Name
## [1] "AKS" "AS"  "SBN" "GKV" "NK"  "DR"  "HOS" "AV"  "ASR"
## 
## $Age
## [1] 64 62 63 32 64 49 64 42 29
## 
## $Gender
## [1] "M" "M" "M" "M" "M" "M" "M" "F" "F"
## 
## $Education
## [1] "Ph.D." "Ph.D." "Ph.D." "M.Sc." "Ph.D." "Ph.D." "Ph.D." "Ph.D." "M.Sc."
## 
## $Income
## [1] 1e+06 2e+05 3e+05 4e+05 5e+05 6e+05 7e+05 8e+05 2e+05
## 
## $Position
## [1] P   HOD P   AP  P   P   P   AP  SRF
## Levels: SRF AP P HOD
as.list(sqmatrix)
## [[1]]
## [1] 2
## 
## [[2]]
## [1] 3
## 
## [[3]]
## [1] 7
## 
## [[4]]
## [1] 1
# To convert any data structure to array
listtoarray=as.array(myfirstlist)
str(listtoarray)
## List of 3
##  $ mydataframe  :'data.frame':   9 obs. of  5 variables:
##   ..$ Name     : chr [1:9] "AKS" "AS" "SBN" "GKV" ...
##   ..$ Age      : num [1:9] 64 62 63 32 64 49 64 42 29
##   ..$ Gender   : chr [1:9] "M" "M" "M" "M" ...
##   ..$ Education: chr [1:9] "Ph.D." "Ph.D." "Ph.D." "M.Sc." ...
##   ..$ Income   : num [1:9] 1e+06 2e+05 3e+05 4e+05 5e+05 6e+05 7e+05 8e+05 2e+05
##  $ myfirstvector: chr [1:4] "RJ" "AP" "KT" "TN"
##  $ myfirstmatrix: num [1:3, 1:2] 12 13 14 25 44 56
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:3] "Row1" "Row2" "Row3"
##   .. ..$ : chr [1:2] "col1" "col2"
##  - attr(*, "dim")= int 3
##  - attr(*, "dimnames")=List of 1
##   ..$ : chr [1:3] "mydataframe" "myfirstvector" "myfirstmatrix"
# To convert to factor data structure
as.factor(age)
##   A   B   C   D   E   F   G   H 
##  35  23  31  34   1 0.5  47  56 
## Levels: 0.5 1 23 31 34 35 47 56

3 Control Structures in R

Control structures in programming language are meant to allow programmer to control the way code chunk is executed. These are logical statements or loops which execute the code or stop its execution when certain conditions are met with. The use of loops simplifies the task at hand by repeating a task finite number of times as specified. This helps in avoiding the mere repetition of simple task.

3.1 if statement

A if statement evaluates the condition and when the condition provided is met with then it executes the expression. Expression provided for execution can be simple or complex. If statement merely does the task if certain condition is met but not otherwise.

if(condition) {expression}

x=9
if(x>0){
  print("Positive number")
  }
## [1] "Positive number"
x=-9
if(x>0){
  print("Positive number")
  } 
# It will not return any thing because it only prints when condition X>0 is met with.

If you forget to put brackets either end of {} around the expression then R console will not print output and + will be shown constantly and new line with prompt operator will not appear until the missing brackets are put at the appropriate place. In case + sing constantly appear but you want to stop at that point then press esc button until prompt operator appears in console.

A view of R console when missing brackets and inverted comma

3.2 if-else statement

To overcome the limitation with if statement that it does not allow for proceeding further when given condition is not met with, if-else statement is used.

if (condition) {expression} else {alternative expression}

x=9
if(x>0){
  print("Positive number")
} else {
    print("Negative  or zero number")
  }
## [1] "Positive number"
x=-3
if(x>0){
  print("Positive number")
} else {
    print("Negative  or zero number")
  }
## [1] "Negative  or zero number"

3.3 nested if-else statement

Many times it is simply not enough to have simple if -else statement. In the case of #ifstatement it can be observed that it was not possible further to classify number either as zero or negative number. In such cases we need nested if-else statement. Nested means one if-else statement within if-else statement.

x=9
if(x>0){
  print("Positive number")
  } else {if(x<0){
    print("Negative number")
    } else {
    print("Zero")
      }
        }
## [1] "Positive number"
x=0
if(x>0){
  print("Positive number")
  } else {if(x<0){
    print("Negative number")
    } else {
    print("Zero")
      }
        }
## [1] "Zero"
x=-4
if(x>0){
  print("Positive number")
  } else {if(x<0){
    print("Negative number")
    } else {
    print("Zero")
      }
        }
## [1] "Negative number"

A series of nested if-else statements can be used as shown below:

insect="YSB"
if(insect=="YSB"){
  print("Management Practices:Clipping of tips of leaves")
    } else {
      if(insect=="YMV"){
        print("Management Practices:Sprinkler Irrigation is suggested")
          } else {
            print("The named insect is not in our directory")
             }
               }
## [1] "Management Practices:Clipping of tips of leaves"

3.4 switch function

The switch function from base package is a good alternative to using if-else statement or nested if else statement. It evaluates condition and depending on that it chooses the alternative from list of alternatives provided.

switch (expression, list)
# with chage in number the output of function will change
switch(1,"grasshopper","praying mentid","dragonfly")
## [1] "grasshopper"
switch(2,"grasshopper","praying mentid","dragonfly")
## [1] "praying mentid"
switch(3,"grasshopper","praying mentid","dragonfly")
## [1] "dragonfly"
# case of character string as expression with list
switch("democracy","democracy"=list("India","ShriLanka","USA"),
       "autocracy"=list("China","Russia"))
## [[1]]
## [1] "India"
## 
## [[2]]
## [1] "ShriLanka"
## 
## [[3]]
## [1] "USA"

3.5 for loop

A for loop uses index values to iteratively produce output when index values are applied to a function or data structure in expression.

for(variable in sequence) {expression}
# Examples of for loop
# seq_along function creates a sequence of length equal to object length
for(i in seq_along(mydept$Name)){
  print(mydept$Name[i])
    } 
## [1] "AKS"
## [1] "AS"
## [1] "SBN"
## [1] "GKV"
## [1] "NK"
## [1] "DR"
## [1] "HOS"
## [1] "AV"
## [1] "ASR"
# for printing numbers starting from 2 to 11 
for(i in 1:10){
  y=i+1 
    print(y)
      } 
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6
## [1] 7
## [1] 8
## [1] 9
## [1] 10
## [1] 11
# for printing tables of 1 to 10
for(i in 1:10){
    y=i*1:10
     print(y)
      }
##  [1]  1  2  3  4  5  6  7  8  9 10
##  [1]  2  4  6  8 10 12 14 16 18 20
##  [1]  3  6  9 12 15 18 21 24 27 30
##  [1]  4  8 12 16 20 24 28 32 36 40
##  [1]  5 10 15 20 25 30 35 40 45 50
##  [1]  6 12 18 24 30 36 42 48 54 60
##  [1]  7 14 21 28 35 42 49 56 63 70
##  [1]  8 16 24 32 40 48 56 64 72 80
##  [1]  9 18 27 36 45 54 63 72 81 90
##  [1]  10  20  30  40  50  60  70  80  90 100

3.6 nested for loops

Like nested if-else statements, nested for loops are also required on some occasions.

for(i in 1:3){
  for(j in 1:5){
    print(i*j)
      }
        }
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 2
## [1] 4
## [1] 6
## [1] 8
## [1] 10
## [1] 3
## [1] 6
## [1] 9
## [1] 12
## [1] 15

3.8 while loop

The while loop executes the expression till it finds that the given condition is fulfilled. In case a condition is never fulfilled then while loop will repeat the expression infinitely till it is interrupted manually or system RAM is fully exhausted.

while (condition) {expression}

x=1
while (x<=5) {
  print(x)
    x=x+1
     }
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5

3.9 repeat loop and break statement

The repeat loop executes a statement repeated till the if statement breaks the repeating sequence based on condition provided. The break statement merely involves writing break within an if statement. It is used to end the loop regardless of what kind of loop it may be within.

x=1
repeat{
  print(x)
   x=x+1
    if(x>5){
      break
        }
          }
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5

3.10 next statement

The next statement is used to skip the current iteration of a loop.

# Print numbers which when divided by 3 gives zero remainder
## remember X %% Y is used to get the remainder while X %/% Y is used for finding the quotient.
for(i in 1:10){
    if(i %% 3 !=  0){
       next
        }
          print(i)
            }
## [1] 3
## [1] 6
## [1] 9

4 Writing custom functions in R

R has some inbuilt functions which are present in various packages of system library. However, users can create custom functions in R . Users must remember that functions can help solve the problem of repeatedly doing the same task. The syntax for a function involves providing arguments to a function within small brackets and within curly brackets writing functional expression which can be any formula or loop and using return function to return object to the console. The objects created within a function remain in the local environment and does not appear in the global environment. To see the user built function’s syntax, one can simply type the name of the function without adding small brackets to its end.

name_of_the_function<-function(argument1,argument2,..etc.){
                        functional expression
                            return(object name) 
                                 }
# function with default value set to 4
func<-function(x=4) {
  if(x>0){
    return("Positive") 
      } else if(x<0){
        return("Negative")
          } else {
            return("Zero")
              }
                }
# to see function's syntax 
func
## function(x=4) {
##   if(x>0){
##     return("Positive") 
##       } else if(x<0){
##         return("Negative")
##           } else {
##             return("Zero")
##               }
##                 }
# use function with various values provided to it as arguments
func(5)
## [1] "Positive"
func(-10)
## [1] "Negative"
func(0)
## [1] "Zero"
func() # even when no argument is provided to the function it returns output due to setting default value for argument. 
## [1] "Positive"
# creating a function that returns the three times object value plus one
algo1<-function(x){
  z=1+3*x
  return(z)
}
algo1(11)
## [1] 34
# creating a function with several arguments
algo2<-function(x,y,z){
  t=4*x+5*y+z*3
  return(t)
}
algo2(1,2,4)
## [1] 26
# creating function by name sumover for summing over object elements
# remember that if you create custom functions by name that exists in global environment or any package then your custom function will replace the built-in function
# for this reason sumover is chosen instead of sum
sumover <- function(x){
  z=x%*%rep(1,length(x))
  return(z)
}
sumover(c(2,3,5))
##      [,1]
## [1,]   10
sum(c(2,3,5)) # cross check with inbuilt function
## [1] 10
sumover(1:100)
##      [,1]
## [1,] 5050
# creating standard deviation function by name stdev 
stdev<-function(x){
  sd=sqrt(sum((x-mean(x))^2)/(length(x)-1))
return(sd)
}
stdev(c(2,8,10,12)) 
## [1] 4.320494
sd(c(2,8,10,12)) # crosscheck with inbuilt function for standard deviation
## [1] 4.320494
## Now to check which function is more efficient we use system.time function
system.time(stdev(1:1000000))
##    user  system elapsed 
##   0.024   0.004   0.028
system.time(sd(1:1000000))
##    user  system elapsed 
##   0.009   0.000   0.009
## now for sumover function against sum
system.time(sum(1:1000000))
##    user  system elapsed 
##       0       0       0
system.time(sumover(1:1000000))
##    user  system elapsed 
##   0.010   0.004   0.013

R has find function to find the the package for a given function. An example to demonstrate how to find package in which cor function is there. Name of the function must be typed within inverted comma, single or double.

find("cor")
## [1] "package:stats"

5 Installing and loading packages

R has collection of over 10,000 packages (Link to cran package list)spanning over different disciplines. Packages are hosted on repository. The default repository is CRAN (Comprehensive R Archive Network). To download packages from repository and install it in R , one has to use either install button in RStudio

or use command line from terminal (in Linux)

or use syntax to execute this task.

install.packages(c("gmp", "car"),dependencies = T,repos =""https://cloud.r-project.org"" )

To use any package from library, one can load the package by library or require function.

require(package name)
library(package name)

To directly use any function or data from any packages in R without having to use require or library function, one can use package name followed by two colon,:: , followed by package or data set name.

x=c(1,2,6,8,2,10,11)
stats::t.test(x,y=NULL,mu=0) ## using t.test function from stats pacakge in R
## 
##  One Sample t-test
## 
## data:  x
## t = 3.6771, df = 6, p-value = 0.01037
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  1.911745 9.516826
## sample estimates:
## mean of x 
##  5.714286

To find the file path of installed package(s) then use find.package function.

find.package("car")
## [1] "/home/gourav/R/x86_64-pc-linux-gnu-library/4.2/car"
find.package(c("car","cluster"))
## [1] "/home/gourav/R/x86_64-pc-linux-gnu-library/4.2/car"
## [2] "/usr/lib/R/library/cluster"

6 Attaching and detaching data sets

R has inbuilt data set in various packages. To see the list of packages use following code.

data()

Upon executing the above code a window will appear. A screenshot of the same is provided below.

Data sets in R

To load data from any packages installed in R use attach function and to unload it use detach function.

attach(cars)
detach(cars)

After using attach function, all variables in the data set can be used without indexing. attach and detach functions package can be used for data structures in local/global environment. The attach function can alter the search path and can easily conflict with existing variables in the environment, however, it gives warnings about masking. It is also observed that after use of the loaded data set is over users forget to detach the data set. Therefore, it is suggested to use with function for loading data sets as it doesn’t require detaching data set. The with function evaluates the expression in the context of data set provided to it.

with(data, expression,...)
with(cars,speed>25 | dist >73)
##  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE
## [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE
## [37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE
## [49]  TRUE  TRUE

However, using with function is not by side effects as its use overrides the local variables and from programmers perspective it must be used cautiously.

7 Reading and Writing Data

To read and write data in R through RStudio is pretty easy. Users can use the in-built facility by going to File then to select Import Dataset and choosing the type of file to import. There is one package in R System Library by name foreign which provides functions for exporting and importing data to and from various file formats such as ‘Minitab’, ‘SPSS’ , ‘SAS’, ‘S’, ‘STATA’, ‘Weka’, ‘dbase’, and many others.

Reading SPSS files

The download.link function will download the file from internet and save it to the destination provided and by the file name in the destination path provided. Next, read.spss function from package foreign will read the file from the working directory. If file is not in the working directory make sure to provide complete file path in argument file.

download.file("https://stats.oarc.ucla.edu/stat/data/binary.sav",destfile =  "binary.sav")
data_spp=foreign::read.spss(file="binary.sav",to.data.frame = T)
## re-encoding from CP1252
head(data_spp)
##   admit gre  gpa rank
## 1     0 380 3.61    3
## 2     1 660 3.67    3
## 3     1 800 4.00    1
## 4     1 640 3.19    4
## 5     0 520 2.93    4
## 6     1 760 3.00    2
## Alternatively with haven package
data_spp=haven::read_sav(file="binary.sav")

Reading STATA file

download.file("https://stats.idre.ucla.edu/stat/stata/dae/binary.dta",destfile =  "binary.dta")
data_stata=haven::read_stata(file="binary.dta")
head(data_stata)
## # A tibble: 6 × 4
##   admit   gre   gpa  rank
##   <dbl> <dbl> <dbl> <dbl>
## 1     0   380  3.61     3
## 2     1   660  3.67     3
## 3     1   800  4        1
## 4     1   640  3.19     4
## 5     0   520  2.93     4
## 6     1   760  3        2
## with foreign package
data_dta=foreign::read.dta(file="binary.dta")
head(data_dta)
##   admit gre  gpa rank
## 1     0 380 3.61    3
## 2     1 660 3.67    3
## 3     1 800 4.00    1
## 4     1 640 3.19    4
## 5     0 520 2.93    4
## 6     1 760 3.00    2

Reading and writing CSV file

user doesn’t need any package to load for reading a comma separated file.

## You can download the file and at destination in your computer decided by you and then read it as described following.
# download.file("https://stats.govt.nz/assets/Uploads/Annual-enterprise-survey/Annual-enterprise-survey-2020-financial-year-provisional/Download-data/annual-enterprise-survey-2020-financial-year-provisional-size-bands-csv.csv",destfile =  "data_nz.csv") 
data_nz=read.csv("data_nz.csv",sep = ",")
head(data_nz)
##   year industry_code_ANZSIC              industry_name_ANZSIC rme_size_grp
## 1 2011                    A Agriculture, Forestry and Fishing          a_0
## 2 2011                    A Agriculture, Forestry and Fishing          a_0
## 3 2011                    A Agriculture, Forestry and Fishing          a_0
## 4 2011                    A Agriculture, Forestry and Fishing          a_0
## 5 2011                    A Agriculture, Forestry and Fishing          a_0
## 6 2011                    A Agriculture, Forestry and Fishing          a_0
##                                          variable value              unit
## 1                                   Activity unit 46134             COUNT
## 2                          Rolling mean employees     0             COUNT
## 3                         Salaries and wages paid   279 DOLLARS(millions)
## 4 Sales, government funding, grants and subsidies  8187 DOLLARS(millions)
## 5                                    Total income  8866 DOLLARS(millions)
## 6                               Total expenditure  7618 DOLLARS(millions)
#writing a data frame to a csv file
write.csv(mydept,file = "mydept.csv")
## now reading the file created
read.csv("mydept.csv",row.names = 1)
##              Name Age Gender Education Income Position
## A Shrivastav  AKS  64      M     Ph.D.  1e+06        P
## A Sarawgi      AS  62      M     Ph.D.  2e+05      HOD
## S B Nahatkar  SBN  63      M     Ph.D.  3e+05        P
## GK Vani       GKV  32      M     M.Sc.  4e+05       AP
## N Khan         NK  64      M     Ph.D.  5e+05        P
## D Rathi        DR  49      M     Ph.D.  6e+05        P
## HO Sharma     HOS  64      M     Ph.D.  7e+05        P
## A Verma        AV  42      F     Ph.D.  8e+05       AP
## AS Rajpoot    ASR  29      F     M.Sc.  2e+05      SRF

Reading and Writing ASCII and RTF file

## Reading file stored in working directory
mydept=read.table("mydept.txt",sep = ",")
head(mydept)
##              Name Age Gender Education Income
## A Shrivastav  AKS  64      M     Ph.D.  1e+06
## A Sarawgi      AS  62      M     Ph.D.  2e+05
## S B Nahatkar  SBN  63      M     Ph.D.  3e+05
## GK Vani       GKV  32      M     M.Sc.  4e+05
## N Khan         NK  64      M     Ph.D.  5e+05
## D Rathi        DR  49      M     Ph.D.  6e+05
write.table(mydept,file="mydept.txt",sep = ",")
## remember write.table can also write rtf file 
write.table(mydept,file="mydept.rtf",sep = ",")
## how to read rtf files
read.table("mydept.rtf",sep = ",")
##              Name Age Gender Education Income
## A Shrivastav  AKS  64      M     Ph.D.  1e+06
## A Sarawgi      AS  62      M     Ph.D.  2e+05
## S B Nahatkar  SBN  63      M     Ph.D.  3e+05
## GK Vani       GKV  32      M     M.Sc.  4e+05
## N Khan         NK  64      M     Ph.D.  5e+05
## D Rathi        DR  49      M     Ph.D.  6e+05
## HO Sharma     HOS  64      M     Ph.D.  7e+05
## A Verma        AV  42      F     Ph.D.  8e+05
## AS Rajpoot    ASR  29      F     M.Sc.  2e+05

Scan function to load data from files

To scan data from any file or output of some analysis from console last run, this function can come very handy. When scan function is not supplied with any argument then it takes data from console.

scan("http://www.econometrics.com/comdata/gujarati/data_2.1.shd",what = list(0,0))
## [[1]]
##  [1]  6  7  8  9 10 11 12 13 14 15 16 17 18
## 
## [[2]]
##  [1]  4.4567  5.7700  5.9787  7.3317  7.3182  6.5844  7.8182  7.8351 11.0223
## [10] 10.6738 10.8361 13.6150 13.5310

Here, argument what is used to define the list of data types like numeric, character, integer or others. Writing two zeros indicate two numeric columns in the data.

However, apart from R has its own file type for saving objects, i.e., rds and RData. While a .rds file can save only single object, .RData file can save multiple objects at once.

saveRDS(mydept,"mydept.rds")
readRDS("mydept.rds")
##              Name Age Gender Education Income
## A Shrivastav  AKS  64      M     Ph.D.  1e+06
## A Sarawgi      AS  62      M     Ph.D.  2e+05
## S B Nahatkar  SBN  63      M     Ph.D.  3e+05
## GK Vani       GKV  32      M     M.Sc.  4e+05
## N Khan         NK  64      M     Ph.D.  5e+05
## D Rathi        DR  49      M     Ph.D.  6e+05
## HO Sharma     HOS  64      M     Ph.D.  7e+05
## A Verma        AV  42      F     Ph.D.  8e+05
## AS Rajpoot    ASR  29      F     M.Sc.  2e+05
save(age,mydept,file = "myDept.RData")
## remove the objects saved to later test for it loads properly
rm(mydept); rm(age)
## Now load the objects and find them in environment
load("myDept.RData")
## now tyr to print them to see that they are the same objects saved in .Rdata file
mydept
##              Name Age Gender Education Income
## A Shrivastav  AKS  64      M     Ph.D.  1e+06
## A Sarawgi      AS  62      M     Ph.D.  2e+05
## S B Nahatkar  SBN  63      M     Ph.D.  3e+05
## GK Vani       GKV  32      M     M.Sc.  4e+05
## N Khan         NK  64      M     Ph.D.  5e+05
## D Rathi        DR  49      M     Ph.D.  6e+05
## HO Sharma     HOS  64      M     Ph.D.  7e+05
## A Verma        AV  42      F     Ph.D.  8e+05
## AS Rajpoot    ASR  29      F     M.Sc.  2e+05
age
##    A    B    C    D    E    F    G    H 
## 35.0 23.0 31.0 34.0  1.0  0.5 47.0 56.0
## when saving all objects in the workspace 
save.image() # save the whole current workspace 

8 Statistical functions in R

8.1 Arithmetic mean (non-positional average)

x=c(2,3,5,6,7)
# Calculate arithmetic mean of the vector
mean(x)
## [1] 4.6
# calculate weighted arithmetic mean of the vector
weighted.mean(x,w=c(.3,0.1,0.1,0.05,0.45))
## [1] 4.85

8.2 Positional values of the distribution

# Calculate quantile values like decile, percentile and quartile, 50th percentile = 5st decile = 2nd quartile = Median
quantile(x,0.5)
## 50% 
##   5
# first quartile value= 25th percentile 
quantile(x,0.25)
## 25% 
##   3
# third quartile value = 75th percentile
quantile(x,0.75)
## 75% 
##   6
# there is no function in system library packages which calculates mode in R
# lets create a function for mode by name getmode
getmode<-function(x){
  L=table(x)
  z=names(which.max(L))
  return(z)
}
#Now lets use getmode function
getmode(c(2,3,3,6,4,1,2,1,5))
## [1] "1"
# median function to directly calculate median values
median(x)
## [1] 5
#box plot to see all position values in one box-whisker plot
boxplot.stats(c(2,3,3,6,4,1,2,1,5))
## $stats
## [1] 1 2 3 4 6
## 
## $n
## [1] 9
## 
## $conf
## [1] 1.946667 4.053333
## 
## $out
## numeric(0)
# to calculate Tukey's five numbers summary of the data
fivenum(x)
## [1] 2 3 5 6 7
# summarize the data with summary function
summary(x)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     2.0     3.0     5.0     4.6     6.0     7.0

8.3 Sampling and distributions

The basic sampling from a given set of numbers can be performed using sample function in R. It has defaulted to without replacement sampling.

sample(x=1:500,size = 10,replace = T)
##  [1] 117 232 328  99 191 477  62 289 409 357

To be able to reproduce the same set of randome numbers, set.seed is supplied along with sample function.

sample(x=1:500,size = 10,replace = T,set.seed(123))
##  [1] 415 463 179  14 195 426 306 118 299 229
sample(x=1:500,size = 10,replace = T,set.seed(1234))
##  [1] 284 336 406 101 492 111 393 133 422 400

As the set.seed argument’s value changes, random numbers change too. To sample values from any statistical distribution the user shall have to provide values for parameters of the distribution. To sample from normal distribution, use rnorm function. For normal distribution, mean and standard deviation are parameters.

rnorm(n=10,mean=0,sd=1)
##  [1]  0.1129751  1.4285520  0.9834038 -0.6224568 -0.7315360 -0.5166697
##  [7] -1.7507334  0.8801042  1.3700104 -1.6873268

To sample from student’s t distribution, use rt function. The degree of freedom is parameter for student’s t distribution.

rt(n=10,df=6)
##  [1] -0.683363732  0.971285806  0.173864595  0.005550894 -1.342436221
##  [6] -1.126752596  1.215551218  0.488233793 -1.028252987  0.981762159

To sample from chisquare distribution, use rchisq function. The degree of freedom is parameter for chisquare distribution.

rchisq(n=10,df=6)
##  [1]  2.087914  3.480708  7.266326  5.531569  2.727402  2.305482 17.209371
##  [8]  4.975981  6.142541  4.405026

To sample from Fisher’s F distribution, use rf function. The numerator and denominator degree of freedom are parameter for F distribution.

rf(n=10,df1 = 10,df2 = 6)
##  [1] 1.5218770 1.4788311 1.1300568 2.0183948 0.1943343 1.0240435 0.8167318
##  [8] 0.6252728 0.8048146 0.7135741

To sample from uniform distribution, use runif function. The minimum and maximum values are parameter for uniform distribution.

runif(n=10,min=0,max=100)
##  [1] 95.599261  2.220682 84.171063 63.244245 31.009417 74.256937 63.891131
##  [8] 99.251599 12.826979 88.323958

To sample from binomial distribution, use rbinom function. The probability of success and number of trials are parameter for binomial distribution.

rbinom(n=10,prob = 0.5,size = 100)
##  [1] 49 52 45 46 47 58 52 48 56 45

To sample from poison distribution, use rpois function. The mean/lambda is parameter for poison distribution.

rpois(n=10,lambda = 4)
##  [1] 4 2 5 3 4 9 3 3 2 5

8.4 Test of normality

Shapiro-Wilk test is a popular test for testing normality of the data. Here, null hypothesis is that data follows normal distribution versus alternate hypothesis of data doesn’t follow normal distribution. If p-value is less than chosen level of significance (may be 0.05 or 0.01) then reject the null hypothesis of data following normal distribution.

## testing for normality of data 
shapiro.test(x)
## 
##  Shapiro-Wilk normality test
## 
## data:  x
## W = 0.95235, p-value = 0.754

8.5 Transformation towards normality

When the variable doesn’t follow normality and all values of the variable under consideration are positive then one can use Box-Cox transformation of the variable but it doesn’t always returns a series which follows normality, success rate isn’t 100% in this method. Basically, MASS package from system library has boxcox() for this purpose. It is used for transforming the dependent variable of the linear regression model towards normality if it doesn’t. Box-Cox power transformation tries to select the maximum value of \(\lambda\) such that \(y^\lambda\) follows the normal distribution. The output of this function would be series of x and y where x is the series of lambda values and y is series of log-likelihood values. The optimal \(\lambda\) values are the one which correspond to maximum log-likelihood.

k=MASS::boxcox(x~1) ## first save output of boxcox()

kt=k$x[which.max(k$y)] ## optimal lambda value
kt
## [1] 0.8282828
shapiro.test(x^kt) # now check if transformed variable follows normal distribution
## 
##  Shapiro-Wilk normality test
## 
## data:  x^kt
## W = 0.9496, p-value = 0.7343
y=sin(x)
L=MASS::boxcox(x~y) ## Here x is a linear function of y

tt=L$x[which.max(L$y)] ## optimal lambda value
tt
## [1] 0.4242424
shapiro.test(resid(lm(x^tt~y)))
## 
##  Shapiro-Wilk normality test
## 
## data:  resid(lm(x^tt ~ y))
## W = 0.94223, p-value = 0.6817

Another transformation for percent and proportions. It is also referred to as arcsin square root transformation and angular transformation and inverse transformation.

z=runif(100)
shapiro.test(z)
## 
##  Shapiro-Wilk normality test
## 
## data:  z
## W = 0.95905, p-value = 0.003443
par(mfrow=c(1,2))
# Density plot is made to see if after tranformation data follows normal distribution
plot(density(z))
plot(density(asin(sqrt(z))))

shapiro.test(asin(sqrt(z)))
## 
##  Shapiro-Wilk normality test
## 
## data:  asin(sqrt(z))
## W = 0.98285, p-value = 0.2202

8.6 Test of equality or homogeneity of variances

## levene's test for homogeneity of variances
x=c(2,3,3,6,4,1,2,1,5)
y=c(1,3,5,6,4,8,12,11,15)
xydata=data.frame("Values"=c(x,y), "Group"=c(rep(0,length(x)),rep(1,length(y))))
car::leveneTest(y=xydata$Values,group=xydata$Group,center="mean")
## Warning in leveneTest.default(y = xydata$Values, group = xydata$Group, center =
## "mean"): xydata$Group coerced to factor.
## Levene's Test for Homogeneity of Variance (center = "mean")
##       Df F value   Pr(>F)   
## group  1   8.881 0.008839 **
##       16                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Fligner-Killeen test for homogeneity of variances
fligner.test(x=xydata$Values,g=xydata$Group)
## 
##  Fligner-Killeen test of homogeneity of variances
## 
## data:  xydata$Values and xydata$Group
## Fligner-Killeen:med chi-squared = 4.2516, df = 1, p-value = 0.03921
## Bartlett test for homogeneity of variances
bartlett.test(x=xydata$Values,g=xydata$Group)
## 
##  Bartlett test of homogeneity of variances
## 
## data:  xydata$Values and xydata$Group
## Bartlett's K-squared = 6.343, df = 1, p-value = 0.01178
## F test for equality of variances in two samples
var.test(x,y,alternative = "two.sided")
## 
##  F test to compare two variances
## 
## data:  x and y
## F = 0.1399, num df = 8, denom df = 8, p-value = 0.01166
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.0315561 0.6201970
## sample estimates:
## ratio of variances 
##          0.1398964

8.7 Test of equality of proportions

## On grouped data (case of multiple groups)
sucess=c(40,20,40,10,22,56,28,47) # number of sucess in diferent trials
no._trials=c(55,20,50,29,62,66,29,52) #number of trials 
prop.test(sucess,no._trials)
## 
##  8-sample test for equality of proportions without continuity correction
## 
## data:  sucess out of no._trials
## X-squared = 94.332, df = 7, p-value < 2.2e-16
## alternative hypothesis: two.sided
## sample estimates:
##    prop 1    prop 2    prop 3    prop 4    prop 5    prop 6    prop 7    prop 8 
## 0.7272727 1.0000000 0.8000000 0.3448276 0.3548387 0.8484848 0.9655172 0.9038462
## On single group
prop.test(47,n=100,p=0.5)
## 
##  1-sample proportions test with continuity correction
## 
## data:  47 out of 100, null probability 0.5
## X-squared = 0.25, df = 1, p-value = 0.6171
## alternative hypothesis: true p is not equal to 0.5
## 95 percent confidence interval:
##  0.3703535 0.5719775
## sample estimates:
##    p 
## 0.47

8.8 One-sample t-test

##one sample  t-test 
t.test(x,y=NULL,mu=5,alternative = "two.sided")
## 
##  One Sample t-test
## 
## data:  x
## t = -3.4641, df = 8, p-value = 0.008516
## alternative hypothesis: true mean is not equal to 5
## 95 percent confidence interval:
##  1.668628 4.331372
## sample estimates:
## mean of x 
##         3
t.test(x,y=NULL,mu=5,alternative = "greater")
## 
##  One Sample t-test
## 
## data:  x
## t = -3.4641, df = 8, p-value = 0.9957
## alternative hypothesis: true mean is greater than 5
## 95 percent confidence interval:
##  1.926389      Inf
## sample estimates:
## mean of x 
##         3
t.test(x,y=NULL,mu=5,alternative = "less")
## 
##  One Sample t-test
## 
## data:  x
## t = -3.4641, df = 8, p-value = 0.004258
## alternative hypothesis: true mean is less than 5
## 95 percent confidence interval:
##      -Inf 4.073611
## sample estimates:
## mean of x 
##         3

8.9 Two sample t-test (paired and non-paired)

## two sample t-test 
t.test(x,y,mu=0,paired = F,var.equal = T,alternative = "two.sided")
## 
##  Two Sample t-test
## 
## data:  x and y
## t = -2.562, df = 16, p-value = 0.02089
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -7.7159197 -0.7285248
## sample estimates:
## mean of x mean of y 
##  3.000000  7.222222
t.test(x,y,mu=0,paired = F,var.equal = F,alternative = "two.sided")
## 
##  Welch Two Sample t-test
## 
## data:  x and y
## t = -2.562, df = 10.195, p-value = 0.02788
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -7.8847773 -0.5596672
## sample estimates:
## mean of x mean of y 
##  3.000000  7.222222
t.test(x,y,mu=0,paired = T,var.equal = T,alternative = "two.sided")
## 
##  Paired t-test
## 
## data:  x and y
## t = -2.5752, df = 8, p-value = 0.03286
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
##  -8.0031350 -0.4413095
## sample estimates:
## mean difference 
##       -4.222222
t.test(x,y,mu=0,paired = T,var.equal = F,alternative = "two.sided")
## 
##  Paired t-test
## 
## data:  x and y
## t = -2.5752, df = 8, p-value = 0.03286
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
##  -8.0031350 -0.4413095
## sample estimates:
## mean difference 
##       -4.222222

8.10 Wilcox Test

Wilcox test is non-parametric alternative to Wlech t-test. When assumptions underlying t-test (normality of data and equality of variance) are not fulfilled then wilcox test is used. It based on rank data.

One sample test

wilcox.test(x,y=NULL,mu=0)
## Warning in wilcox.test.default(x, y = NULL, mu = 0): cannot compute exact p-
## value with ties
## 
##  Wilcoxon signed rank test with continuity correction
## 
## data:  x
## V = 45, p-value = 0.008969
## alternative hypothesis: true location is not equal to 0

Two sample test

wilcox.test(x,y,paired = T,mu=0,alternative = "two.sided") ## paired test/signed rank test
## Warning in wilcox.test.default(x, y, paired = T, mu = 0, alternative =
## "two.sided"): cannot compute exact p-value with ties
## Warning in wilcox.test.default(x, y, paired = T, mu = 0, alternative =
## "two.sided"): cannot compute exact p-value with zeroes
## 
##  Wilcoxon signed rank test with continuity correction
## 
## data:  x and y
## V = 1, p-value = 0.05639
## alternative hypothesis: true location shift is not equal to 0
wilcox.test(x,y,paired = F,mu=0,alternative = "two.sided") ## non-paired test
## Warning in wilcox.test.default(x, y, paired = F, mu = 0, alternative =
## "two.sided"): cannot compute exact p-value with ties
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  x and y
## W = 16.5, p-value = 0.03679
## alternative hypothesis: true location shift is not equal to 0

8.11 Chi-square test

## Chisquare test of goodness of fit
btw=rbinom(n=100,size=100,prob = 0.6) # observed series
btk=rnorm(n=100,mean=60,sd=36) # expected series
chisq.test(btw,btk)
## Warning in chisq.test(btw, btk): Chi-squared approximation may be incorrect
## 
##  Pearson's Chi-squared test
## 
## data:  btw and btk
## X-squared = 2100, df = 2079, p-value = 0.3688
## chisquare test for association between two factors
mytable =as.table(rbind(c(10,7),c(22,13))) # observed data
chisq.test(mytable)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  mytable
## X-squared = 1.0186e-30, df = 1, p-value = 1
mytable2=as.table(rbind(c(19,12),c(29,10))) ## expected data
chisq.test(mytable,mytable2)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  mytable
## X-squared = 1.0186e-30, df = 1, p-value = 1

8.12 Fisher’s exact probability test

library(dplyr)
table(mydept$Gender,mydept$Education) %>%fisher.test()
## 
##  Fisher's Exact Test for Count Data
## 
## data:  .
## p-value = 0.4167
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##    0.03835631 543.93189281
## sample estimates:
## odds ratio 
##   4.582616

8.13 Correlation

# to get correlation between two data series
cor(x,y,method="pearson")
## [1] 0.01558447
cor.test(x,y,method = "pearson")
## 
##  Pearson's product-moment correlation
## 
## data:  x and y
## t = 0.041238, df = 7, p-value = 0.9683
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.6553198  0.6727433
## sample estimates:
##        cor 
## 0.01558447
cor(x,y,method="spearman")
## [1] -0.05063697
cor.test(x,y,method = "spearman")
## Warning in cor.test.default(x, y, method = "spearman"): Cannot compute exact p-
## value with ties
## 
##  Spearman's rank correlation rho
## 
## data:  x and y
## S = 126.08, p-value = 0.8971
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##         rho 
## -0.05063697
cor(x,y,method="kendall")
## [1] 0.02901294
cor.test(x,y,method = "kendall")
## Warning in cor.test.default(x, y, method = "kendall"): Cannot compute exact p-
## value with ties
## 
##  Kendall's rank correlation tau
## 
## data:  x and y
## z = 0.106, p-value = 0.9156
## alternative hypothesis: true tau is not equal to 0
## sample estimates:
##        tau 
## 0.02901294
## to get correlation matrix 
normmatrix=matrix(rnorm(100),ncol=5)
cor(normmatrix)
##             [,1]        [,2]        [,3]        [,4]        [,5]
## [1,]  1.00000000  0.03131195  0.06883345 -0.06844665  0.07201019
## [2,]  0.03131195  1.00000000 -0.02023516  0.12929262  0.37113451
## [3,]  0.06883345 -0.02023516  1.00000000 -0.23669943 -0.52873520
## [4,] -0.06844665  0.12929262 -0.23669943  1.00000000  0.46770027
## [5,]  0.07201019  0.37113451 -0.52873520  0.46770027  1.00000000
## Canonical correlation coefficient
cancor(x,y,xcenter=T,ycenter = T) # this correlation coefficient is same as that of pearson product-movement correlation coefficient because of only two variables involved 
## $cor
## [1] 0.01558447
## 
## $xcoef
##           [,1]
## [1,] 0.2041241
## 
## $ycoef
##            [,1]
## [1,] 0.07634799
## 
## $xcenter
## [1] 3
## 
## $ycenter
## [1] 7.222222
cancor(matrix(rnorm(100),ncol = 5),matrix(rnorm(100),ncol = 5),xcenter=T,ycenter = T)
## $cor
## [1] 0.77707078 0.61943548 0.33109628 0.31381387 0.07848493
## 
## $xcoef
##             [,1]         [,2]        [,3]        [,4]         [,5]
## [1,] -0.02860498 -0.224508287  0.02549674 -0.06488900 -0.006796558
## [2,]  0.14981792  0.080590107  0.05677894 -0.13345965  0.119967052
## [3,]  0.04311825  0.060642380  0.05073523 -0.20691642 -0.207465303
## [4,]  0.16437803 -0.048276393 -0.17849364 -0.01354266 -0.019688846
## [5,] -0.16839838 -0.003664575 -0.12380930 -0.08711594  0.074002466
## 
## $ycoef
##             [,1]        [,2]         [,3]        [,4]         [,5]
## [1,]  0.02754389 -0.20457794  0.199938700  0.32743661  0.005693569
## [2,] -0.11605644 -0.05667172  0.141964899 -0.06053659 -0.236624258
## [3,] -0.05788383  0.11812715 -0.082663530  0.13650880 -0.063810872
## [4,] -0.12905245  0.03010167  0.006015531  0.10318323  0.199379184
## [5,]  0.13015360  0.18268755  0.126331340 -0.07662970  0.102261273
## 
## $xcenter
## [1] -0.41842507 -0.20901105 -0.04350534  0.01777745 -0.04364079
## 
## $ycenter
## [1] -0.08832676 -0.30453652 -0.19765487 -0.05939289  0.14062205

8.14 Regression

lm_1=lm(y~x)
summary(lm_1)
## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -6.181 -3.264 -1.347  3.861  7.694 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  7.09722    3.45119   2.056   0.0788 .
## x            0.04167    1.01040   0.041   0.9683  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.95 on 7 degrees of freedom
## Multiple R-squared:  0.0002429,  Adjusted R-squared:  -0.1426 
## F-statistic: 0.001701 on 1 and 7 DF,  p-value: 0.9683
anova(lm_1)
## Analysis of Variance Table
## 
## Response: y
##           Df  Sum Sq Mean Sq F value Pr(>F)
## x          1   0.042  0.0417  0.0017 0.9683
## Residuals  7 171.514 24.5020

8.15 Cluster Analysis

library(cluster)


## Hierarchical clustering with cutting tree in desired number of groups
## Use dist function to calculate distance matrix between rows of data matrix
clust_1=hclust(dist(agriculture, method = "euclidean"),method = "ward.D")
summary(clust_1)
##             Length Class  Mode     
## merge       22     -none- numeric  
## height      11     -none- numeric  
## order       12     -none- numeric  
## labels      12     -none- character
## method       1     -none- character
## call         3     -none- call     
## dist.method  1     -none- character
plot(clust_1)
## determine the membership in five group scenario
membership=cutree(clust_1,k=5)
## use membership to cut dendrogram in five rectangular shapes
rect.hclust(clust_1,k=5,membership)

## Clustering Large Applications (CLARA)
clara_1=clara(agriculture,2,metric = "euclidean")
summary(clara_1)
## Object of class 'clara' from call:
##  clara(x = agriculture, k = 2, metric = "euclidean") 
## Medoids:
##      x    y
## D 18.7  3.5
## P  7.8 17.4
## Objective function:    3.36061 
## Numerical information per cluster:
##      size max_diss  av_diss isolation
## [1,]    8 5.423099 2.891691 0.3070127
## [2,]    4 7.430343 4.298448 0.4206469
## Average silhouette width per cluster:
## [1] 0.7078449 0.4785275
## Average silhouette width of best sample: 0.6314058 
## 
## Best sample:
##  [1] B   DK  D   GR  E   F   IRL I   L   NL  P   UK 
## Clustering vector:
##   B  DK   D  GR   E   F IRL   I   L  NL   P  UK 
##   1   1   1   2   2   1   2   1   1   1   2   1 
## 
## Silhouette plot information for best sample:
##     cluster neighbor sil_width
## D         1        2 0.7928036
## B         1        2 0.7657461
## NL        1        2 0.7607978
## F         1        2 0.7459605
## L         1        2 0.7442013
## DK        1        2 0.7080748
## UK        1        2 0.6314596
## I         1        2 0.5137157
## P         2        1 0.6516167
## GR        2        1 0.5745566
## IRL       2        1 0.5153727
## E         2        1 0.1725640
## 
## 66 dissimilarities, summarized :
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.649   4.357   7.987   9.594  13.250  24.035 
## Metric :  euclidean 
## Number of objects : 12
## 
## Available components:
##  [1] "sample"     "medoids"    "i.med"      "clustering" "objective" 
##  [6] "clusinfo"   "diss"       "call"       "silinfo"    "data"
par(mfrow=c(1,2))
plot(clara_1)

8.16 Principal Component

library(dplyr)
## Q-type pca
# center=T for centering data from mean, retx=T for getting rotated pca matrix of eignen vectors, scale.=T for making each variable variance equa to one, rank.= 3 for keeping 3 principal components, tol=0.1 for threshold below which components should be omitted.
prcom.1=prcomp(USArrests,center = T,retx = T,scale. = T,rank. = 3,tol = 0.1)
prcom.1$rotation
##                 PC1        PC2        PC3
## Murder   -0.5358995  0.4181809 -0.3412327
## Assault  -0.5831836  0.1879856 -0.2681484
## UrbanPop -0.2781909 -0.8728062 -0.3780158
## Rape     -0.5434321 -0.1673186  0.8177779
head(prcom.1$x) ## pca scores 
##                   PC1        PC2         PC3
## Alabama    -0.9756604  1.1220012 -0.43980366
## Alaska     -1.9305379  1.0624269  2.01950027
## Arizona    -1.7454429 -0.7384595  0.05423025
## Arkansas    0.1399989  1.1085423  0.11342217
## California -2.4986128 -1.5274267  0.59254100
## Colorado   -1.4993407 -0.9776297  1.08400162
summary(prcom.1)
## Importance of first k=3 (out of 4) components:
##                           PC1    PC2     PC3
## Standard deviation     1.5749 0.9949 0.59713
## Proportion of Variance 0.6201 0.2474 0.08914
## Cumulative Proportion  0.6201 0.8675 0.95664
screeplot(prcom.1)

biplot(prcom.1)

## R-type pca
# argument cor =TRUE for using correlation matrix to compute principal components
# scores=TRUE for getting principal component scores upon analysis
prcom.2=princomp(USArrests,cor = T,scores = T)
prcom.2$loadings ## pca loadings or eigen vectors
## 
## Loadings:
##          Comp.1 Comp.2 Comp.3 Comp.4
## Murder    0.536  0.418  0.341  0.649
## Assault   0.583  0.188  0.268 -0.743
## UrbanPop  0.278 -0.873  0.378  0.134
## Rape      0.543 -0.167 -0.818       
## 
##                Comp.1 Comp.2 Comp.3 Comp.4
## SS loadings      1.00   1.00   1.00   1.00
## Proportion Var   0.25   0.25   0.25   0.25
## Cumulative Var   0.25   0.50   0.75   1.00
prcom.2$scores ## pca scores
##                     Comp.1      Comp.2      Comp.3       Comp.4
## Alabama         0.98556588  1.13339238  0.44426879  0.156267145
## Alaska          1.95013775  1.07321326 -2.04000333 -0.438583440
## Arizona         1.76316354 -0.74595678 -0.05478082 -0.834652924
## Arkansas       -0.14142029  1.11979678 -0.11457369 -0.182810896
## California      2.52398013 -1.54293399 -0.59855680 -0.341996478
## Colorado        1.51456286 -0.98755509 -1.09500699  0.001464887
## Connecticut    -1.35864746 -1.08892789  0.64325757 -0.118469414
## Delaware        0.04770931 -0.32535892  0.71863294 -0.881977637
## Florida         3.01304227  0.03922851  0.57682949 -0.096284752
## Georgia         1.63928304  1.27894240  0.34246008  1.076796812
## Hawaii         -0.91265715 -1.57046001 -0.05078189  0.902806864
## Idaho          -1.63979985  0.21097292 -0.25980134 -0.499104101
## Illinois        1.37891072 -0.68184119  0.67749564 -0.122021292
## Indiana        -0.50546136 -0.15156254 -0.22805484  0.424665700
## Iowa           -2.25364607 -0.10405407 -0.16456432  0.017555916
## Kansas         -0.79688112 -0.27016470 -0.02555331  0.206496428
## Kentucky       -0.75085907  0.95844029  0.02836942  0.670556671
## Louisiana       1.56481798  0.87105466  0.78348036  0.454728038
## Maine          -2.39682949  0.37639158  0.06568239 -0.330459817
## Maryland        1.76336939  0.42765519  0.15725013 -0.559069521
## Massachusetts  -0.48616629 -1.47449650  0.60949748 -0.179598963
## Michigan        2.10844115 -0.15539682 -0.38486858  0.102372019
## Minnesota      -1.69268181 -0.63226125 -0.15307043  0.067316885
## Mississippi     0.99649446  2.39379599  0.74080840  0.215508013
## Missouri        0.69678733 -0.26335479 -0.37744383  0.225824461
## Montana        -1.18545191  0.53687437 -0.24688932  0.123742227
## Nebraska       -1.26563654 -0.19395373 -0.17557391  0.015892888
## Nevada          2.87439454 -0.77560020 -1.16338049  0.314515476
## New Hampshire  -2.38391541 -0.01808229 -0.03685539 -0.033137338
## New Jersey      0.18156611 -1.44950571  0.76445355  0.243382700
## New Mexico      1.98002375  0.14284878 -0.18369218 -0.339533597
## New York        1.68257738 -0.82318414  0.64307509 -0.013484369
## North Carolina  1.12337861  2.22800338  0.86357179 -0.954381667
## North Dakota   -2.99222562  0.59911882 -0.30127728 -0.253987327
## Ohio           -0.22596542 -0.74223824  0.03113912  0.473915911
## Oklahoma       -0.31178286 -0.28785421  0.01530979  0.010332321
## Oregon          0.05912208 -0.54141145 -0.93983298 -0.237780688
## Pennsylvania   -0.88841582 -0.57110035  0.40062871  0.359061124
## Rhode Island   -0.86377206 -1.49197842  1.36994570 -0.613569430
## South Carolina  1.32072380  1.93340466  0.30053779 -0.131466685
## South Dakota   -1.98777484  0.82334324 -0.38929333 -0.109571764
## Tennessee       0.99974168  0.86025130 -0.18808295  0.652864291
## Texas           1.35513821 -0.41248082  0.49206886  0.643195491
## Utah           -0.55056526 -1.47150461 -0.29372804 -0.082314047
## Vermont        -2.80141174  1.40228806 -0.84126309 -0.144889914
## Virginia       -0.09633491  0.19973529 -0.01171254  0.211370813
## Washington     -0.21690338 -0.97012418 -0.62487094 -0.220847793
## West Virginia  -2.10858541  1.42484670 -0.10477467  0.131908831
## Wisconsin      -2.07971417 -0.61126862  0.13886500  0.184103743
## Wyoming        -0.62942666  0.32101297  0.24065923 -0.166651801
head(prcom.2$scores)
##                Comp.1     Comp.2      Comp.3       Comp.4
## Alabama     0.9855659  1.1333924  0.44426879  0.156267145
## Alaska      1.9501378  1.0732133 -2.04000333 -0.438583440
## Arizona     1.7631635 -0.7459568 -0.05478082 -0.834652924
## Arkansas   -0.1414203  1.1197968 -0.11457369 -0.182810896
## California  2.5239801 -1.5429340 -0.59855680 -0.341996478
## Colorado    1.5145629 -0.9875551 -1.09500699  0.001464887
summary(prcom.2)
## Importance of components:
##                           Comp.1    Comp.2    Comp.3     Comp.4
## Standard deviation     1.5748783 0.9948694 0.5971291 0.41644938
## Proportion of Variance 0.6200604 0.2474413 0.0891408 0.04335752
## Cumulative Proportion  0.6200604 0.8675017 0.9566425 1.00000000
screeplot(prcom.2)

biplot(prcom.2)

8.17 Factor analysis

factanal(x, factors, data = NULL, covmat = NULL, n.obs = NA,
         subset, na.action, start = NULL,
         scores = c("none", "regression", "Bartlett"),
         rotation = "varimax", control = NULL, ...)
# first lets create data for analysis 
m1=matrix(runif(18*6,1,6),ncol=6)
# now data is ready by object name m1
factanal(m1, factors = 3, rotation = "promax",scores = "regression")# varimax is the default
## 
## Call:
## factanal(x = m1, factors = 3, scores = "regression", rotation = "promax")
## 
## Uniquenesses:
## [1] 0.635 0.005 0.837 0.688 0.005 0.825
## 
## Loadings:
##      Factor1 Factor2 Factor3
## [1,]                  0.590 
## [2,]          0.982         
## [3,] -0.193  -0.113   0.346 
## [4,]  0.106           0.550 
## [5,]  0.976  -0.102         
## [6,] -0.306  -0.239  -0.198 
## 
##                Factor1 Factor2 Factor3
## SS loadings      1.115   1.050   0.813
## Proportion Var   0.186   0.175   0.136
## Cumulative Var   0.186   0.361   0.496
## 
## Factor Correlations:
##         Factor1 Factor2 Factor3
## Factor1  1.0000 -0.1190 -0.0654
## Factor2 -0.1190  1.0000  0.0269
## Factor3 -0.0654  0.0269  1.0000
## 
## The degrees of freedom for the model is 0 and the fit was 0.01

9 Plotting Data

9.1 Scatter plot

Plot function is a powerful function for graphics. It has several arguments which allows users to fine tune the plots.

plot(x, y = NULL, type = “p”, xlim = NULL, ylim = NULL, log = ““, main = NULL, sub = NULL, xlab = NULL, ylab = NULL, ann = par(”ann”), axes = TRUE, frame.plot = axes, panel.first = NULL, panel.last = NULL, asp = NA, xgap.axis = NA, ygap.axis = NA, …)

Possible types are

  • "p" for points,

  • "l" for lines,

  • "b" for both (points as well as lines),

  • "c" for the lines part alone of "b",

  • "o" for both ‘overplotted’,

  • "h" for ‘histogram’ like (or ‘high-density’) vertical lines,

  • "s" for stair steps,

  • "S" for other steps, see ‘Details’ below,

  • "n" for no plotting.

    All other types give a warning or an error; using, e.g., type = "punkte" being equivalent to type = "p" for S compatibility.

#to plot the data simply use plot function
plot(c(2,3,3,6,4,1,2,1,5))

x=1:10
y=sin(x)
## using parameter option to set various graphs in rows and column like structure
par(mfrow=c(3,3))
## plot one vector against another
plot(x,y, main="Point plot", type="p")
## now to create a line lets plot x and y
plot(x,y, main="Line plot",type = "l")
## create graph with only lines no points
plot(x,y, main="Line plot without points",type =  "c")
## have both points and lines 
plot(x,y, main="Plot having both point and line",type = "b")
## create overploted graph 
plot(x,y, main="Over Plot",type = "o")
## create stair steps/setp graph 
plot(x,y, main="Step plot",type =  "s")
## using "S" insted of "s"
plot(x,y, main="Step plot with type `S`",type = "S")
## create histogram like vertical lines
plot(x,y, main="Histogram like verticle Plot ",type="h")
## Plot nothing 
plot(x,y, main="Plot nothing","n")

## use scatter.smooth to plot scatter diagram along with a smooth regression line
scatter.smooth(x,y, main="Smooth Scatter plot between X and Y")

## to create scatter plot for numeric data you can use pairs function
num_matrix=matrix(rnorm(100),ncol=4)
pairs(num_matrix)

## pairs function using swiss data
pairs(swiss, panel = panel.smooth, pch = ".") 

9.2 Bar plot

optt=tapply(mydept$Income,INDEX = list(as.factor(mydept$Gender),as.factor(mydept$Education)),FUN = "mean")

barplot(optt,col = 1:nrow(optt),legend.text = rownames(optt), args.legend = list(x = "topleft"),ylab = "Mean Income Level",xlab = "Level of Education",main = "Mean income level by edcuation and gender")

##alternatively
subb=aggregate(mydept$Income,list(mydept$Gender,mydept$Education),"mean")
subb
##   Group.1 Group.2      x
## 1       F   M.Sc. 200000
## 2       M   M.Sc. 400000
## 3       F   Ph.D. 800000
## 4       M   Ph.D. 550000
colnames(subb)<-c("Gender","Education","Income")
barplot(Income~Gender+Education,subb,col = 1:2,legend.text=T,args.legend = list(x = "topleft","Gender"),ylab = "Mean Income Level",xlab = "Level of Education",main = "Mean income level by edcuation and gender")

9.3 Histogram

x=rbinom(n=10000,size=1000,prob = 0.5)
hist(x, # histogram
 col=6, # column color
 border="black",prob = TRUE, # show densities instead of frequencies
 main = "Histogram of random number from binomial distribution")
lines(density(x), # density plot
 lwd = 2, # thickness of line
 col = 9)

## To create histogram with class intervals such that it contains equal number of observtions then we use co.intervals function 
hist(x,plot = T,breaks = co.intervals(x),freq = F)

# the co.intervals function's output is provided below
co.intervals(x)
##       [,1]  [,2]
## [1,] 438.5 491.5
## [2,] 482.5 497.5
## [3,] 490.5 503.5
## [4,] 496.5 509.5
## [5,] 502.5 517.5
## [6,] 508.5 558.5

9.4 Box plot

#box plot to see all position values in one box-whisker plot
boxplot(c(2,3,3,6,4,1,2,1,5))

boxplot(Income~Gender,data=mydept,main="Boxplot of income by gender")

9.5 Coplot/Conditioning Plot

require(graphics)
coplot(Income~Age| Gender,data = mydept) ## Plot income against age but also differentiate by a factor called Gender

coplot(len ~ dose | supp, data = ToothGrowth, panel = panel.smooth,
xlab = "ToothGrowth data: length vs dose, given type of supplement") 

9.6 Quantile Comparison plot (QQ Plot)

## Q-Q plot to check for normality
qqnorm(x)
qqline(x)

9.7 Regression diagnostics plot

par(mfrow=c(2,2))
plot(lm_1)

9.8 Density plots

plot(density(rbinom(n=10000,size=1000,prob = 0.5)))

10 Design of Experiment and Randomization

Experimental Design is plan to solve any problem or compare any factors/treatments under the control situation as much as possible. Experimental designs is a plan of layout of an experiment according to which the treatments are allocated for the plots based on theory of probability with a certain set of rules and procedure so as to obtain appropriate observations. In agricultural experiments The objects of a comparison is experiments I known as treatments. For example comparison of different crop variety yield or effect of different insecticide to control insect/pest in plant. In which treatments are applied to collect the observation or data in known as experimental unit. For agriculture experiments field or land is example of experimental unit.

Randomness is the basis on which statistics differs from mathematics. Statistics assumes variables in statistical analysis to be random in nature (also known as stochastic in nature). But random word connnotes several meaning in local parlance; to some this means absence of patten/predictability, or to some that happens accidentally. However, a more precise meaning of random in statistics corresponds to “chosen without regard to any characteristics of the individual members of the population so that each has an equal chance of being selected: random sampling” or “having a value which cannot be determined but only described probabilistically: a random variable” https://www.dictionary.com/browse/random.

The former definition states that selection must be done disregarding the characteristics of individual members of population. This is principle helps in creating several representative/comparable groups from populations for study of population with regard to either effect of some new treatment or characteristics of population itself. If the selection process does not remain random while creating groups then it can lead to biase (both selection and accidental) and groups so created from population would no longer be representative/comparable. Thus ensuring randomness in selection process is crucial to getting unbiased results. This process is the basis for application of statistical inference techniques of mean comparision.

Sometimes question also arises as to how random assignment of treatment differs from the assignment I make. For understanding this lets take a simple game. Take an unbiased coin (a fresh coin which can equally flip on both sides) and mark one side as head and other side as tail. without flipping coin write a sequence of 20 or 30 head-tail. Now if possible evaluate this sequence generated by you with the help of runs test for its randomness. It would be rare to find that some one writing suh sequence without any prior training can produce a random sequence. This is because random sequence has a property that selection of current object or subject does not depend in any way on the previous subjects selected. This is because the process generating such sequence has no memory. But, humans have very good memory and does remember a large amount of information in brain, either consciously or subconsciously. This is the reason why humans can not behave randomly. We humans behave orderly and a person behaving randomly is called mad in our societies. There is one website random.org which provides lot of games on randomness and also demonstrates why humans can not behave randomly.

Based on the above stated property of random sequence, I had created a fraud detection test

where the enumerator conducting survey work would be required to ask subjects a sequence of fruit choice from 1 to 8. If an enumerator does not conduct survey and fakes it then proposed test can detect non-randomness in set of sequences entered by him/her.
In experimental studies, three principles are always adhered to namely,

1. Randomization: As a consequence of treatments are allocated randomly, the variation caused due to soil fertility variation will be systematic in nature is eliminated completely and whatever the error that persist is of random in nature/ chance factor (Mishra and Homa, 2019). 
Advantages:

 Randomization eliminates the systematic error.  Ensures whatever the error component that still persists in the observation is purely random in nature. Types of randomization  Unrestricted randomization or complete randomization: Allocation of treatment randomly to the experiment without any restrictions. Eg: CRD  Restricted randomization: Allocation of treatment randomly to the experiment with certain restriction either one-way blocking or two-way blocking. Eg: RBD, LSD

2. Replication: Treatments are repeated several times to reduce experimental error. This also helps to increase power of statistics used in analysis of data. 

3. Local Control

In Agriculture, however, experiments are performed on fields and instead of subjects in Psychology & medicine or objects in physics, here land/field is the subject or object recieving treatment. Since land/field differs in various characteristics over space, therefore, various blocks are created on land/field, each block with homogeneous characteristics in terms of slope, soil type, soil texture, etc in it but heterogeneous among them. Treatments are assigned randomly over these blocks. Care is to be taken to include each treatment in each block. Treatments must be randomly allocated to plots within each block. This would ensure blocks as replicating units. Thus, one more principle is added to above three, i.e., Blocking. If land is homogeneous across its length and breadth then there is no need for blocking.

Now lets discuss how R software can be used to generate layout for experimental with randomized assignment of treatments to fields.

For this purpose lets install Agricolae package in R by using following command

install.packages("agricolae")

and load it from library where it gets stored.

library(agricolae) 

After this step, next step is to learn how to enter treatment and replication information in R. For this purpose, we use vectors as data structure. Let there be three treatments with control namely, A, B, C and “control” and equal number of replications.

Trt1 = c("A","B","C","Control")  

In above command, a vector named Trt1 is created. The vector starts with letter c which means combine and since it is a character vector therefore, all characters must be in inverted comma.

To create layout for Completey Randomized Design, design.crd function is used. Arguments to this function includes treatment information and number of replications. Since we want only information on layout characteristics therefore, $book is placed at the end of function.

design.crd(trt=Trt1, r=3)$book
##    plots r    Trt1
## 1    101 1       A
## 2    102 1       C
## 3    103 1 Control
## 4    104 2 Control
## 5    105 3 Control
## 6    106 1       B
## 7    107 2       C
## 8    108 2       A
## 9    109 2       B
## 10   110 3       C
## 11   111 3       A
## 12   112 3       B

The above command created the output which includes plot numbers, replication number and treatment allotment. In case Randomized control block design is used design.rcbd function is used.

For other useful designs lets create following treatments.

 Trt1 = c("A","B","C")  
 Trt2 = c("1","2")  
 Trt3 = c("+","-")  
 TRT.tmp = as.vector(sapply(Trt1, function(x){paste0(x,Trt2)}))
 TRT = as.vector(sapply(TRT.tmp, function(x){paste0(x,Trt3)}))  
 TRT.Control = c(TRT, rep("Control", 3)) 
 TRT
##  [1] "A1+" "A1-" "A2+" "A2-" "B1+" "B1-" "B2+" "B2-" "C1+" "C1-" "C2+" "C2-"

Now layout for designs with above treatments is

#Completely Randomized Design
design.crd(trt=TRT.Control, r=3)$sketch 
## NULL
#Random Block Design  
design.rcbd(trt=TRT.Control, r=3)$sketch  
##      [,1]  [,2]      [,3]      [,4]  [,5]      [,6]      [,7]  [,8]      [,9] 
## [1,] "C1-" "Control" "A1-"     "A1+" "C2+"     "C1+"     "B1-" "Control" "B1+"
## [2,] "B1-" "A1-"     "C1+"     "B1+" "Control" "A2+"     "B2+" "B2-"     "A1+"
## [3,] "B2+" "A1-"     "Control" "A1+" "C2-"     "Control" "B1-" "C1+"     "C1-"
##      [,10]     [,11]     [,12] [,13] [,14]     [,15]    
## [1,] "A2-"     "B2+"     "B2-" "C2-" "Control" "A2+"    
## [2,] "Control" "C2+"     "C1-" "C2-" "A2-"     "Control"
## [3,] "B1+"     "Control" "B2-" "A2-" "C2+"     "A2+"
#Incomplete Block Design  
design.bib(trt=TRT.Control, r=7, k=3)$sketch  
## 
## Parameters BIB
## ==============
## Lambda     : 1
## treatmeans : 15
## Block size : 3
## Blocks     : 35
## Replication: 7 
## 
## Efficiency factor 0.7142857 
## 
## <<< Book >>>
##       [,1]      [,2]      [,3]     
##  [1,] "A2-"     "B1-"     "C1-"    
##  [2,] "C2-"     "A2+"     "Control"
##  [3,] "Control" "C2-"     "A2-"    
##  [4,] "A1-"     "C1-"     "C2-"    
##  [5,] "A2+"     "A2-"     "B2+"    
##  [6,] "B2+"     "Control" "Control"
##  [7,] "B2+"     "B1-"     "A1-"    
##  [8,] "C2+"     "Control" "B1-"    
##  [9,] "C1-"     "A1+"     "B1+"    
## [10,] "A1+"     "C2-"     "Control"
## [11,] "A1-"     "C1+"     "A1+"    
## [12,] "B2+"     "B1+"     "C2-"    
## [13,] "Control" "A2+"     "B2-"    
## [14,] "B2-"     "Control" "A1+"    
## [15,] "B2-"     "B2+"     "C1-"    
## [16,] "C2-"     "B1-"     "C1+"    
## [17,] "A1+"     "B2+"     "Control"
## [18,] "C1+"     "Control" "Control"
## [19,] "B1+"     "A2+"     "C2+"    
## [20,] "C2-"     "B2-"     "C2+"    
## [21,] "B1+"     "Control" "C1+"    
## [22,] "A1-"     "B1+"     "B2-"    
## [23,] "Control" "A1-"     "A2+"    
## [24,] "Control" "B1-"     "B2-"    
## [25,] "A2-"     "C2+"     "A1+"    
## [26,] "A2-"     "Control" "A1-"    
## [27,] "A1+"     "A2+"     "B1-"    
## [28,] "B1-"     "Control" "B1+"    
## [29,] "Control" "B1+"     "A2-"    
## [30,] "B2-"     "C1+"     "A2-"    
## [31,] "Control" "C2+"     "C1-"    
## [32,] "C2+"     "C1+"     "B2+"    
## [33,] "A1-"     "C2+"     "Control"
## [34,] "C1+"     "A2+"     "C1-"    
## [35,] "C1-"     "Control" "Control"
 #Split-Plot Design  
 design.split(Trt1, Trt2, r=3, design=c("crd"))$book
##    plots splots r Trt1 Trt2
## 1    101      1 1    C    2
## 2    101      2 1    C    1
## 3    102      1 1    A    2
## 4    102      2 1    A    1
## 5    103      1 1    B    1
## 6    103      2 1    B    2
## 7    104      1 2    B    1
## 8    104      2 2    B    2
## 9    105      1 2    A    1
## 10   105      2 2    A    2
## 11   106      1 3    A    2
## 12   106      2 3    A    1
## 13   107      1 3    B    1
## 14   107      2 3    B    2
## 15   108      1 2    C    1
## 16   108      2 2    C    2
## 17   109      1 3    C    2
## 18   109      2 3    C    1

11 ANOVA

The ANOVA stands for Analysis of variance. It is actually a dummy variable regression. The dummies in most of agricultural use are treatment dummies, replication dummies and block dummies. The dummies in a data frame should be of type factor. The examples given below are using npk dataset for full factorial NPK experiment from Venables and Ripley (2002) p.165 provided in datasets package .

data(npk) ## from datasets package
head(npk) ## see how data frame is to be prepared for anova
##   block N P K yield
## 1     1 0 1 1  49.5
## 2     1 1 1 0  62.8
## 3     1 0 0 0  46.8
## 4     1 1 0 1  57.0
## 5     2 1 0 0  59.8
## 6     2 1 1 1  58.5
npk_lm=lm(yield ~ block + N*P*K, data=npk)
summary(npk_lm)
## 
## Call:
## lm(formula = yield ~ block + N * P * K, data = npk)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.3000 -1.6833  0.1583  1.9979  4.4750 
## 
## Coefficients: (1 not defined because of singularities)
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  51.8250     2.7785  18.652 3.15e-10 ***
## block2        3.4250     2.7785   1.233  0.24131    
## block3        6.7500     2.7785   2.429  0.03177 *  
## block4       -3.9000     2.7785  -1.404  0.18578    
## block5       -3.5000     2.7785  -1.260  0.23174    
## block6        2.3250     2.7785   0.837  0.41907    
## N1            9.8500     2.7785   3.545  0.00403 ** 
## P1            0.4167     2.7785   0.150  0.88329    
## K1           -1.9167     2.7785  -0.690  0.50344    
## N1:P1        -3.7667     3.2084  -1.174  0.26317    
## N1:K1        -4.7000     3.2084  -1.465  0.16865    
## P1:K1         0.5667     3.2084   0.177  0.86275    
## N1:P1:K1          NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.929 on 12 degrees of freedom
## Multiple R-squared:  0.7886, Adjusted R-squared:  0.5948 
## F-statistic: 4.069 on 11 and 12 DF,  p-value: 0.01156
anova(npk_lm)
## Analysis of Variance Table
## 
## Response: yield
##           Df Sum Sq Mean Sq F value   Pr(>F)   
## block      5 343.29  68.659  4.4467 0.015939 * 
## N          1 189.28 189.282 12.2587 0.004372 **
## P          1   8.40   8.402  0.5441 0.474904   
## K          1  95.20  95.202  6.1657 0.028795 * 
## N:P        1  21.28  21.282  1.3783 0.263165   
## N:K        1  33.13  33.135  2.1460 0.168648   
## P:K        1   0.48   0.482  0.0312 0.862752   
## Residuals 12 185.29  15.441                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Alternatively we can use aov function which is wrapper for lm functioon for fitting linear model  
aov.npk<- aov(yield~block+N*P*K,data = npk)
summary(aov.npk)
##             Df Sum Sq Mean Sq F value  Pr(>F)   
## block        5  343.3   68.66   4.447 0.01594 * 
## N            1  189.3  189.28  12.259 0.00437 **
## P            1    8.4    8.40   0.544 0.47490   
## K            1   95.2   95.20   6.166 0.02880 * 
## N:P          1   21.3   21.28   1.378 0.26317   
## N:K          1   33.1   33.13   2.146 0.16865   
## P:K          1    0.5    0.48   0.031 0.86275   
## Residuals   12  185.3   15.44                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
coefficients(aov.npk)
## (Intercept)      block2      block3      block4      block5      block6 
##  51.8250000   3.4250000   6.7500000  -3.9000000  -3.5000000   2.3250000 
##          N1          P1          K1       N1:P1       N1:K1       P1:K1 
##   9.8500000   0.4166667  -1.9166667  -3.7666667  -4.7000000   0.5666667
## Remember aov fuction is designed for use on balanced designs. In case of missing values, the design loses balance. 
## To check balance use replications function. 
## A design can be balanced when all treatements levels have same number of replications. 
replications(~.-yield,data = npk)
## block     N     P     K 
##     4    12    12    12
## replications function will provide number of replications for each treatement levels

Another dataset inbuilt in R is warpbreaks from datasets package. It has measurement on number of breaks in wool, wool type (A or B) and levels of tension (Low, Medium and High).

aov.warpbreaks=aov(breaks~wool*tension,data=warpbreaks)
summary(aov.warpbreaks)
##              Df Sum Sq Mean Sq F value   Pr(>F)    
## wool          1    451   450.7   3.765 0.058213 .  
## tension       2   2034  1017.1   8.498 0.000693 ***
## wool:tension  2   1003   501.4   4.189 0.021044 *  
## Residuals    48   5745   119.7                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
coefficients(aov.warpbreaks)
##    (Intercept)          woolB       tensionM       tensionH woolB:tensionM 
##       44.55556      -16.33333      -20.55556      -20.00000       21.11111 
## woolB:tensionH 
##       10.55556

Latin Square Design

Another dataset is OrchardSprays provided in datasets package. It was obtained from an experiment conducted to assess the efficacy of constituents of sprays done on orchard in repelling honeybees, using a Latin square design. The data has row position, column position, treatment levels as factor and response variable by name decrease.

data("OrchardSprays")
head(OrchardSprays)
##   decrease rowpos colpos treatment
## 1       57      1      1         D
## 2       95      2      1         E
## 3        8      3      1         B
## 4       69      4      1         H
## 5       92      5      1         G
## 6       90      6      1         F
## check if this is balanced design
summary(OrchardSprays$treatment)
## A B C D E F G H 
## 8 8 8 8 8 8 8 8
## after finding balance use aov function
aov.orchardspray=aov(decrease~rowpos+colpos+treatment,data=OrchardSprays)
summary(aov.orchardspray)
##             Df Sum Sq Mean Sq F value   Pr(>F)    
## rowpos       1   2605    2605   6.877   0.0113 *  
## colpos       1    511     511   1.350   0.2504    
## treatment    7  56160    8023  21.181 2.07e-13 ***
## Residuals   54  20454     379                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
coefficients(aov.orchardspray)
## (Intercept)      rowpos      colpos  treatmentB  treatmentC  treatmentD 
##   22.705357   -2.784226   -1.233631    3.000000   20.625000   30.375000 
##  treatmentE  treatmentF  treatmentG  treatmentH 
##   58.500000   64.375000   63.875000   85.625000

One Way ANOVA

The PlantGrowth dataset provided in datasets package has recorded observations on plant yield under control and two treatments.

data("PlantGrowth")
head(PlantGrowth)
##   weight group
## 1   4.17  ctrl
## 2   5.58  ctrl
## 3   5.18  ctrl
## 4   6.11  ctrl
## 5   4.50  ctrl
## 6   4.61  ctrl
## check for balance 
summary(PlantGrowth$group)
## ctrl trt1 trt2 
##   10   10   10
aov.plantgrowth=aov(weight~group,data=PlantGrowth)
summary(aov.plantgrowth)
##             Df Sum Sq Mean Sq F value Pr(>F)  
## group        2  3.766  1.8832   4.846 0.0159 *
## Residuals   27 10.492  0.3886                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
coefficients(aov.plantgrowth)
## (Intercept)   grouptrt1   grouptrt2 
##       5.032      -0.371       0.494

11.1 TWo Way ANOVA

The ToothGrowth dataset provided in dataset has observations on tooth length, supplement type and doses of vitamin C. The experiment was conducted to find effect of vitamin C on tooth growth of Guinea pigs. Here, two sources of variation exists, one is levels of vitamin C and another is supplement type.

data("ToothGrowth")
head(ToothGrowth)
##    len supp dose
## 1  4.2   VC  0.5
## 2 11.5   VC  0.5
## 3  7.3   VC  0.5
## 4  5.8   VC  0.5
## 5  6.4   VC  0.5
## 6 10.0   VC  0.5
summary(ToothGrowth)## observe that supplement has equal no. of observations for its two different levels
##       len        supp         dose      
##  Min.   : 4.20   OJ:30   Min.   :0.500  
##  1st Qu.:13.07   VC:30   1st Qu.:0.500  
##  Median :19.25           Median :1.000  
##  Mean   :18.81           Mean   :1.167  
##  3rd Qu.:25.27           3rd Qu.:2.000  
##  Max.   :33.90           Max.   :2.000
summary(as.factor(ToothGrowth$dose)) ## check if equal number of observations are there for three different levels of doses of vitamin C
## 0.5   1   2 
##  20  20  20
aov.toothgrowth=aov(len~as.factor(dose)+supp,data = ToothGrowth)
summary(aov.toothgrowth)
##                 Df Sum Sq Mean Sq F value   Pr(>F)    
## as.factor(dose)  2 2426.4  1213.2   82.81  < 2e-16 ***
## supp             1  205.4   205.4   14.02 0.000429 ***
## Residuals       56  820.4    14.7                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
coefficients(aov.toothgrowth)
##      (Intercept) as.factor(dose)1 as.factor(dose)2           suppVC 
##           12.455            9.130           15.495           -3.700

12 Quiz on Programming with R

Page Title

13 Test for training participants

All questions are compulsory. Do not use R or any of the training material during the test.

What would be the result in console of executing following code?

  1. c(1,4,7)

  2. c(3,“T”,9,3+2i)

  3. x=data.frame(Name=c(“A”,“B”,“C”), Subject=c(“EXT 501”,“AGECON 604”,“ABM 510”))

  4. x$Marks=c(50,60,70); x

  5. for( i in 1:10){i=i+1 ;print(i)}

  6. mylist=list(newmatrix=matrix(c(2,3,4,5)),vector=c(2,3,7));lapply(mylist,sum)

  7. newdf=data.frame(A=c(2,5,7,9),B=c(1,3,4,6)); newdf[newdf$A==2,]

  8. newdf[3,2]

  9. newdf[newdf$A>2 & newdf$B<4,]

  10. sapply(newdf,max)

  11. k=c(“T”,“C”,“T”,“C”,“T”); L=c(“R”,“D”,“D”,“R”,“D”);Y=c(1,5,7,3,2);tapply(Y,list(k,L),sum)

  12. table(k,L)

  13. character(length=5)

  14. factor(k,levels= c(“C”,“T”))

  15. if(length(k)>length(L)){print(1)} else {print(0)}

  16. x=1 ; repeat{print(x) ; x=x+1 ; if(x>5){ break } }

  17. alog7=function(x){z=2*x+5; return(z)}; alog7(6)

  18. Y=c(10,Y,20); Y

  19. seq(1,2,0.1)

  20. Sys.Date()

14 Downloading and Installing R

If we plan to use R, then firstly we need to download R and install it on our computer. Steps are provided below:

  1. Search for R software in the your browser.

    Screenshot of google chrome browser with R sotware in search bar

  2. One should download R software from official website of R . From the Google search following results will appear.

    Search result for R software

  3. Go to CRAN link under Download section on the left hand side of the website or click on “download R”. Both the ways directly take you to the CRAN mirror page.

    Screenshot of main page of official R website

  4. Once you reach httpshttps://cran.r-project.org/mirrors.html, choose the appropriate CRAN mirror. It is hereby suggested that “0-Cloud” mirror with url https://cloud.r-project.org/ be selected as its officially sponsored by RStudio.

    Screenshot of CRAN mirrors page

  5. Once you reach https://cloud.r-project.org/. click on “Download R for” (Linux/Windows/Mac depending on your operating system). The next steps would be explained for Windows installation.

    CRAN page for downloding R

  6. Once you click on “Download R for Windows” then it will take you to the “R for Windows” page. Click on the base sub directory link or install R for the first time link. This step will take you to “R-4.2.0 for Windows” page.

    Screenshot of R for Windows page

  7. Click Download R-4.2.0 for Windows. Now downloading will start and process will take time depending on your internet speed. Once downloading ends then click on the downloaded .exe file.

    R-4.2.0 for Windows page

  8. Now a dialogue box will appear asking you to select setup language. Select the desired language and click on OK.

    Screenshot of Select setup language dialogue box

  9. Read the license agreement and click Next.

    Screenshot of read licence agreement

  10. Select the documentation location. Click Next.

    Screenshot of Select Documentation Location

  11. Select the components you wish to install. Click Next.

    Screenshot of select component dialogue box

  12. Enter/browse the folder/path you wish to install R into and then confirm by clicking Next.

    Screenshot of Select Start Menu Folder dialogue box

  13. Select additional tasks like creating desktop shortcuts etc. Then click Next.

    Screenshot of Select Additional Task dialogue box

  14. Wait for the installation process to complete. Click on Finish to complete the installation.

    Screenshot of finish setup dialogue box

  15. R 4.2.0 will get displayed on our desktop.

    Screenshot of desktop home with R software shortcut

15 Downloading and Installing RStudio

If we plan to use RStudio, then firstly we need to download R and install it on our computer (see section Downloading and Installing R) .

  1. Search for RStudio software in browser.

  2. We can download RStudio from official website of RStudio. If you directly click on the first search result (Download the RStudio IDE), it will directly take you to the download page.

  3. Select Free Download option.

  4. Click on the Download RStudio Desktop link (depending on your operating system but further installation guide is provided for windows os).

  5. Follow the installation instructions and click Next on the welcome window.

  6. Enter the path to the installation folder and click Next to proceed.

  7. Select the folder for the start menu shortcut or click on do not create shortcuts and then click Next.

  8. Wait for the installation process to complete.

  9. Click finish to complete the installation.

  10. RStudio shortcut/icon will get displayed on our desktop.


  1. Assistant Professor(Agricultural Economics), JNKVV, Jabalpur, email id:↩︎

  2. Assistant Professor (Agricultural Statistics), JNKVV, Jabalpur, email id:↩︎

  3. Although there are functions which does not take any input from user like Sys.Date(), getwd() and many more.↩︎

  4. In linear data structure, elements are arranged one after the another.↩︎