Copyright Licence: CC BY 4.0
To know step by step guide on downloading and
installation of
R software click on the link Downloading and Installing R
and for RStudio click here on the link Downloading and Installing
RStudio.
Declaration:This manual was prepared on Ubuntu 20.4 LTS with RStudio 2022.07.0+548 “Spotted Wakerobin” and R version 4.2.1 (2022-06-23) -- “Funny-Looking Kid”.
You can cite this training manual as following:
Vani, G.K. & Mishra, P. (2022). Training Manual on Programming with R.Jawaharlal Nehru Krishi Vishwa Vidyalaya, Jabalpur (India).
The R is a programming language which is an interpreted
language. This means that unlike Fortran language here code
need not be compiled before its output is visible. It is officially runs
in R console and can be used in its officially launched IDE
(Integrated Development Environment) named RStudio . Code
can be directly written and executed in the console and output appears
below it. In case a script file is made in console, the code syntax can
be executed by placing cursor on the required line and using Ctrl+R. In
case of RStudio IDE, Ctrl+Enter after placing cursor on the
code syntax to execute it in console window.
A view of R console window in ubuntu operating system
A view of RStudio IDE
It is a case sensitive one. Any thing that works in R is
a function and what doesn’t work is an object. An object in
R stores the information. A function takes input from the
user 3
and provides output. The input provided to the function is called as
arguments of the function. A function at time may require one or more
arguments. R does not allow two objects with same name. If
two objects have name age then the one created later will
replace the former. R has some reserved words
which are listed below:
if
else
repeat
while
function
for
next
break
TRUE
FALSE
NULL
Inf
NaN
NA
NA_Integer
NA_real
NA_complex
NA_character
User of this software should not make/keep name of any object create exactly as or after reserved words. Reserved words can be used as non-syntatic names insed the backtick quotes, ``. But reserved words are not allowed as syntatic names.
Anything written after # will be treated as comment not
as command. Comments can be inserted after command. The
rmarkdown documents provide output of code chunk after
## . This document is also prepared with the help of
rmarkdown . In R language, small brackets
() are reserved for functions and big brackets
[] and [[ ]] are reserved for indexing.
Indexing involves accessing the elements from withing the objects. Each
time brackets whether small, curly or big brackets are used it has to be
closed appropriately otherwise execute command would be incomplete till
all brackets are closed in the required manner and software will show a
+ plus sign instead of prompt character, >
, in the beginning of syntax in console.
In R language, assignment operator <- or
<<- is used to assign a value or a functional
expression to an object. It consists of single or double less
than < character and a minus sign - ,
both the character are written strictly one after the other. In most of
the context = sign is also used in place of assignment
operator and will work equally well. Instead of assigment operator
<- or <<- and = , we can
also use assign function to assign a value or functional
expression to an object.
x<-10
x
## [1] 10
#OR
x<<-10
x
## [1] 10
y=10
y
## [1] 10
assign("z",10)
z
## [1] 10
The assignment of value(s) to an object can be made with reverse assignment operator also.
10->k
k
## [1] 10
#OR
10->>k
k
## [1] 10
If an command provides values to the console as output, i.e. printed to the console, but is not stored in some object then it is lost and would be no available further.
To simplify code structure, R has pipe operator,
%>%, available from dplyr package. An
example is given below.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
x %>% +y+z # Add x to y+z
## [1] 30
Unlike other software where semicolon (;)or comma (,)is required to
separate two lines of code, R doesn’t necessitates use of
; or ,. But, two different syntax can be
written in same line and can be separated by semicolon. An example is
provided below.
x<-10; y<-20
x;y
## [1] 10
## [1] 20
Remember that to find all the objects created in
local Environment use ls function and to
remove any object use rm function. To clear the clutter
console use Ctrl+l keys.
ls()
rm(name of the object)
For example to remove object x following code is
executed in console.
rm(x)
Data structure refers to the specific ways of arranging the data so
that it can be efficiently accessed and updated. The data in
R is stored in different data structures.
Vector (Atomic vector),
Matrix,
Data frame,
Factor,
List,
Array
are data structure in R. Although table is a class
defined in R but it is not counted here as data
structure.
Vector is the most simple type of data structure in R.
It is the lowest level of data structure in R. It is an one
dimensional object. It is a linear data structure4. A vector in
R is similar to the vector in mathematics. Vector records
data entries only in single mode (homogeneous data structure), either as
numbers (real, integer or complex) or logical (true/false) or character
(text). Vector in R is of following types or modes:
Numeric (single/double)
Integer
Logical
Character
Complex
A vector can be formed by using either vector function
or the function call c(). Here, c stands for
combine or concatenate. The function combines/concatenates different
arguments provided to it as different elements of the vector. The
function vector creates a null vector which contains empty
space(s) or zero(s) or FALSE or a string of
FALSE . The arguments to this function involves
mode and length.
It can be used as following
vec_numeric=vector(mode="numeric",length = 5) #To create numeric vector
vec_numeric
## [1] 0 0 0 0 0
vec_logical = vector(mode="logical",length = 5)
vec_logical
## [1] FALSE FALSE FALSE FALSE FALSE
vec_character = vector(mode="character",length = 5)
vec_character
## [1] "" "" "" "" ""
vec_integer = vector(mode="integer",length = 5)
vec_integer
## [1] 0 0 0 0 0
vec_complex = vector(mode="complex",length = 5)
vec_complex
## [1] 0+0i 0+0i 0+0i 0+0i 0+0i
# you can check whether an object is vector or not by using is.vector function which produces true or false logical output
is.vector(vec_numeric)
## [1] TRUE
is.vector(vec_character)
## [1] TRUE
is.vector(vec_integer)
## [1] TRUE
is.vector(vec_complex)
## [1] TRUE
is.vector(vec_logical)
## [1] TRUE
In place of numeric, we can use logical,
complex, integer and character .
Details on each of these modes of vector are provided below.
Numeric vector records real numbers, both positive as well as
negative. Currently R has adopted double-precision data
types system. Hence when we check the numeric vector for its type then
R returns “double”. There is no vector with single
type.
To create a numeric vector which contains elements 1, 2 and 3 we give following command to R.
c(1,2,3)
## [1] 1 2 3
To save this numeric vector as an object in R by name xyx
xyx<-c(1,2,3)
To print the information contained within the vector just type the name of the vector
xyx
## [1] 1 2 3
In a complex way this could be done by using print
command as well
print(xyx)
## [1] 1 2 3
Now let’s create a vector of age of household member and give this vector our own name.
gourav<-c(32,31,1.5)
gourav
## [1] 32.0 31.0 1.5
If we want to name the gourav vector by name
age then a new object by name age can be
created without altering the object gourav
age=gourav
age
## [1] 32.0 31.0 1.5
To check the class of the vector created
class(age)
## [1] "numeric"
To check the mode of the vector
mode(age)
## [1] "numeric"
To check type of vector
typeof(age)
## [1] "double"
To find the number of elements in the vector, i.e., length of vector
length(age)
## [1] 3
suppose if we want to add more data to vector age then
to add the fourth element to vector age
age[4]=1
age
## [1] 32.0 31.0 1.5 1.0
The above output shows how to add one additional data point to the
vector. To add or append multiple elements to age
vector
age[c(5,6)]=c(0.5,0.4)
age
## [1] 32.0 31.0 1.5 1.0 0.5 0.4
Instead users can also use append function (in
base package) to append values to a vector.
append(x, values, after = length(x))
where X is the vector in which to append,
values means values to be appended and after
is the parameter to set position where to append the values.
Suppose for a vector having values 1,4,5,7,2, and 8, if one wants to
add 0 and 9 after 3rd position then append function will
work as demonstrated below.
append(c(1,4,5,7,2,8),c(0,9),after=3)
## [1] 1 4 5 0 9 7 2 8
append(c(1,4,5,7,2,8),c(0,9)) ## default position is to the end of original vector
## [1] 1 4 5 7 2 8 0 9
What if we want to replace the value of 6th element in
age vector from 0.4 to 0.2
age[6]=0.2
What if we want to remove the sixth element from vector
age=age[-6] ##after remvoing 6th element it again saves the output by name age to update the existing vector
##Let's print age vector again to see if 6th element is removed
age
## [1] 32.0 31.0 1.5 1.0 0.5
##check vector length
length(age)
## [1] 5
To remove multiple elements at once, lets say first and third element,
age=age[-c(1,3)]
age
## [1] 31.0 1.0 0.5
So far we have added multiple elements to vector age but
it adds them towards the end of the vector. To add elements to the
beginning of the existing vector we create another vector as shown
below
age=c(35,23,age,47, 56) ## 35 and 23 are added to the beginning while 47 and 56 are added to the end of vector age
age
## [1] 35.0 23.0 31.0 1.0 0.5 47.0 56.0
Lets say we want to slice the vector from element numbered 3 to 7
# R has a beautiful way to create natural number sequence with gap of 1
# To create sequence from 3 to 7
3:7
## [1] 3 4 5 6 7
age[3:7]
## [1] 31.0 1.0 0.5 47.0 56.0
To add a number (lets say 55) in between element 3 and 4, we have to create a new vector wherein we slice original vector two times, once from 1 to 3 and another time from 4 to 7.
age= c(age[1:3],34,age[4:7])
age
## [1] 35.0 23.0 31.0 34.0 1.0 0.5 47.0 56.0
To create a natural number sequence from 1 to 100
onetohundred=1:100
class(onetohundred)
## [1] "integer"
onetohundred
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
## [19] 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
## [37] 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54
## [55] 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72
## [73] 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90
## [91] 91 92 93 94 95 96 97 98 99 100
To see the structure of the object
str(age)
## num [1:8] 35 23 31 34 1 0.5 47 56
num [1:8] shows that the object age is of
class numeric vector and has vector elements numbered from 1 to 8.
The metadata of any R object can be found by using
attributes function. Attributes refers to the properties
associated with any object in R.
attributes(age)
## NULL
Since we have not set any attribute for age vector,
that’s why it is showing NULL.
Attributes includes name of each element or row names and column names, dimension of the object, class etc. In R, all objects can have their own arbitrary attributes which can store metadata about the object. It is like a named list with unique names.
#create new vector with attributes
age1=c("GKV"=32,"AV"=31,"NV"=1.5)
age1
## GKV AV NV
## 32.0 31.0 1.5
attributes(age1)
## $names
## [1] "GKV" "AV" "NV"
# or if you want to set attributes to an existing vector
names(age)<-c("A","B","C","D","E","F","G","H")
age
## A B C D E F G H
## 35.0 23.0 31.0 34.0 1.0 0.5 47.0 56.0
attributes(age)
## $names
## [1] "A" "B" "C" "D" "E" "F" "G" "H"
Integer are numbers that does not take numbers after decimal. The
difference between two successive integer is always one. Integer numbers
in R are always written with L .
myfirstintegervec=c(1L,2L,3L)
myfirstintegervec
## [1] 1 2 3
class(myfirstintegervec)
## [1] "integer"
str(myfirstintegervec)
## int [1:3] 1 2 3
typeof(myfirstintegervec)
## [1] "integer"
Logical vector stores information in true and false pattern. It can be a result of comparison or any logical operation of type ,
> (GREATER THAN) ,
< (LESS THAN),
== (EQUAL TO for comparison purposes not as
assignment operator),
>= (GREATER THAN EQUAL TO),
<= (LESS THAN EQUAL TO),
& (LOGICAL AND),
| (LOGICAL OR),
! (LOGICAL NOT),
!= (NOT EQUAL TO),
&& (LOGICAL AND FOR SINGLE STATMENT)
and
|| (LOGICAL OR FOR SINGLE STATEMENT) .
myfirstlogical=c(T,F,F,T)
myfirstlogical
## [1] TRUE FALSE FALSE TRUE
class(myfirstlogical)
## [1] "logical"
mode(myfirstlogical)
## [1] "logical"
typeof(myfirstlogical)
## [1] "logical"
logicalvec=c(TRUE, FALSE, TRUE, FALSE, TRUE, TRUE)
logicalvec
## [1] TRUE FALSE TRUE FALSE TRUE TRUE
class(logicalvec)
## [1] "logical"
mode(logicalvec)
## [1] "logical"
typeof(logicalvec)
## [1] "logical"
As said before, logical vector can arise from logical operations, an example is provided below.
age>10 # logical operation to check for age greater than 10
## A B C D E F G H
## TRUE TRUE TRUE TRUE FALSE FALSE TRUE TRUE
age>10 & age <30 # age greater than 10 and less than 30
## A B C D E F G H
## FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
age<10 | age >30 # age less than 10 OR greater than 30
## A B C D E F G H
## TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE
age <=30 # age less than equal to 30
## A B C D E F G H
## FALSE TRUE FALSE FALSE TRUE TRUE FALSE FALSE
age ==23 | age == 31 # age equal to 23 OR 31
## A B C D E F G H
## FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE
age!=23 # age not equal to 23
## A B C D E F G H
## TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE
A character vector only has text. To create a character vector all text has to be in either single or double quotation mark.
myfirstcharactervec=c("RJ","AP","KT","TN")
class(myfirstcharactervec)
## [1] "character"
It is interesting to note that R has inbuilt character
vector of alphabets, both lower and upper case, and month names. These
are referred to as built-in constants in R. The
pi is another built-in constants in R. It also
supports conversion to upper and lower case of character vector.
# To get lower case alphabets
letters
## [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s"
## [20] "t" "u" "v" "w" "x" "y" "z"
# To get upper case alphabets
LETTERS
## [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R" "S"
## [20] "T" "U" "V" "W" "X" "Y" "Z"
# To convert lower case alphabets to upper case
toupper(letters)
## [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R" "S"
## [20] "T" "U" "V" "W" "X" "Y" "Z"
# To convert upper case alphabets to lower case
tolower(LETTERS)
## [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s"
## [20] "t" "u" "v" "w" "x" "y" "z"
# To get month names
month.name
## [1] "January" "February" "March" "April" "May" "June"
## [7] "July" "August" "September" "October" "November" "December"
# To get abbreviated month names
month.abb
## [1] "Jan" "Feb" "Mar" "Apr" "May" "Jun" "Jul" "Aug" "Sep" "Oct" "Nov" "Dec"
It is important to note that R has character translation
capability also. The inbuilt chartr function is used for
this purpose.
chartr(old character string, new character string, a character vector)
named= " He is my husband"
chartr("He","Joseph",named)
## [1] " Jo is my husband"
Don’t forget that character translation doesn’t change the length of the character vector.
Complex numbers involves iota \((\dot\iota)\) which is square root of
negative one \(\sqrt -1\) .
R software supports complex numbers too. When a numeric
vector has presence of complex number(s) then it is of type and class
complex.
myfirstcomplex=c(2,6,1+3i,24,8+2i,3+4i)
myfirstcomplex
## [1] 2+0i 6+0i 1+3i 24+0i 8+2i 3+4i
class(myfirstcomplex)
## [1] "complex"
typeof(myfirstcomplex)
## [1] "complex"
mode(myfirstcomplex)
## [1] "complex"
In general, a sequence is created in R by
seq function.
seq(from=1,to=2,by=0.1)
## [1] 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0
seq(from=1,to=2,length=9)
## [1] 1.000 1.125 1.250 1.375 1.500 1.625 1.750 1.875 2.000
seq(18) # Treats default from value of 1 and by value of 1, output same as 1:18
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
To create an vector by repetition of a number, integer, character or
logical argument then rep function can be of great
help.
rep("A",5) # Create a character vector by repeating `A` five times
## [1] "A" "A" "A" "A" "A"
class(rep("A",5))
## [1] "character"
rep (10,3) # Create a numeric vector by repeating value 10 by three times
## [1] 10 10 10
class(rep(10,3))
## [1] "numeric"
rep(5L,4) # Create an integer vector by repeating value 5 four times
## [1] 5 5 5 5
class(rep(5L,4))
## [1] "integer"
rep(c(T,F),5) # Create a sequence of True False by repeating it five times
## [1] TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE
class(rep(c(T,F),5))
## [1] "logical"
With vector many mathematical operations can be performed. This
includes addition, subtraction, division multiplication (both inner and
outer product), different types of transformations and statistical
manipulation. All arithmetic operations on vector are performed in
R element by element. It is also not necessary to have
different vectors to be of same length. If vectors are short of length
then short vectors are recycled to perform the given calculation till it
matches the desired object length (usually the length of longest
vector). In R there is also warning with some function in
this regard as shown below.
If some of the values are not vectors of the appropriate length, you deserve whatever you get!
g=c(1,5,9,12)
h=c(2,3,7,11,13)
g-2 # subtracting two from all elements of vector
## [1] -1 3 7 10
g*2 # multiplying two to all elements of the vector
## [1] 2 10 18 24
g/2 # dividing all elements of vector by two
## [1] 0.5 2.5 4.5 6.0
g%%2 # %% operator gives remainder from division
## [1] 1 1 1 0
c(2,3,5)%%2
## [1] 0 1 1
c(2,3,5)%/%2 # gives quotient from division
## [1] 1 1 2
v=5*g+h-4+h^2
## Warning in 5 * g + h: longer object length is not a multiple of shorter object
## length
v # v is a vector of elements 5*1+2-4+2^2 5*5+3-4+3^2 5*9+7-4+7^2 5*12+11-4+11^2 5*1+13-4+13^2 , here 1 is recycled in vector g to meet the demand of longest vector h
## [1] 7 33 97 188 183
g/h # Dividing one vector by other vector, all operations are element by element
## Warning in g/h: longer object length is not a multiple of shorter object length
## [1] 0.50000000 1.66666667 1.28571429 1.09090909 0.07692308
log(h) #Use of log function
## [1] 0.6931472 1.0986123 1.9459101 2.3978953 2.5649494
exp(g) # Use of exponential function
## [1] 2.718282e+00 1.484132e+02 8.103084e+03 1.627548e+05
sqrt (g) # Use of square root function
## [1] 1.000000 2.236068 3.000000 3.464102
sqrt (-18+0i) # While using any negative number with `sqrt` don't forget to put 0i to make it treated as complex number else you will get warning with `NaN`
## [1] 0+4.242641i
sin(g) # Use of sin function
## [1] 0.8414710 -0.9589243 0.4121185 -0.5365729
cos(g) # Use of cosine function
## [1] 0.5403023 0.2836622 -0.9111303 0.8438540
tan (g) # Use of tangent function
## [1] 1.5574077 -3.3805150 -0.4523157 -0.6358599
max(g) # Use of max function to find maximum value in a vector
## [1] 12
min(g) # Use of min function to find minimum value in a vector
## [1] 1
which(g==5) # which function provides position of 5 in vector g
## [1] 2
which.max(g) # which.max provides the position of maximum value in vector g
## [1] 4
which.min(g) # which.min provides the position of minimum value in vector g
## [1] 1
range(g) # Use of range functiono to find both minimum and maximum value together
## [1] 1 12
sum (g) # Use of sum function provide sum or total of all elements in a vector
## [1] 27
prod(g) # Use of prod function provides product or multiplication of all elements in a vector
## [1] 540
summary(g) # Use of summary function provides statistical summary for a numeric, integer and complex data while for others it simply provides length, class and mode of the vector
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 4.00 7.00 6.75 9.75 12.00
summary(myfirstcharactervec)
## Length Class Mode
## 4 character character
abs(c(-1,1,-2,2,-4,5,4,-6)) # Use of abs function provides absolute values for all elements of the vector
## [1] 1 1 2 2 4 5 4 6
sort(c(55,1,60,4,2,3,77),decreasing = F) # Use of sort function provides sorted vector in ascending/increasing order
## [1] 1 2 3 4 55 60 77
unique(c(1,3,1,4,5,6,2,3,4,6)) # use of `unique` function provides unique values or names in a vector
## [1] 1 3 4 5 6 2
t(g) # Use of `t` function to get transpose of vector
## [,1] [,2] [,3] [,4]
## [1,] 1 5 9 12
t(g)%*%g # Dot or inner product between a vector and its transpose resulting in scalar
## [,1]
## [1,] 251
#Power of a vector would result in element by element operation
sum(g^2) # calculate sum of squares of elements of vector
## [1] 251
g%*%t(g) # Outer or tensor product of one vector with its transpose resulting in a matrix
## [,1] [,2] [,3] [,4]
## [1,] 1 5 9 12
## [2,] 5 25 45 60
## [3,] 9 45 81 108
## [4,] 12 60 108 144
t(c(5,4,3,2))%*%g # Dot or inner product between two vector resulting in a scalar
## [,1]
## [1,] 76
c(5,4,3,2)%*%t(g) # Outer product or tensor product between two vectors resulting in a matrix
## [,1] [,2] [,3] [,4]
## [1,] 5 25 45 60
## [2,] 4 20 36 48
## [3,] 3 15 27 36
## [4,] 2 10 18 24
rank(c(10,1,16,24,3,45)) # to get the rank of all elements in a vector
## [1] 3 1 4 5 2 6
table(myfirstlogical)#Table function generates frequency distribution
## myfirstlogical
## FALSE TRUE
## 2 2
table(myfirstintegervec)
## myfirstintegervec
## 1 2 3
## 1 1 1
table (c(11,14,15,13,14,12,11,14,16,15,13,12,17))
##
## 11 12 13 14 15 16 17
## 2 2 2 3 2 1 1
1:5 %in% c(1,3,5,9,10,11) # %in% is an operator which is used in place of match function and is used to find whether the elements of one vector belongs to elements of other vector
## [1] TRUE FALSE TRUE FALSE TRUE
# the above code results in logical true/false showing that 1, 3 and 5 belongs to other vector but not 2 and 4. The length of resulting logical vector would be equal to length of former vector not latter.
# outer applies a vectorized function to all combinations of two arguments.
# to create a matrix from a mathematical operations of two vectors outer function can be conveniently used. It must be supplied with argument FUN
outer(g,h,FUN="+")
## [,1] [,2] [,3] [,4] [,5]
## [1,] 3 4 8 12 14
## [2,] 7 8 12 16 18
## [3,] 11 12 16 20 22
## [4,] 14 15 19 23 25
outer(g,h,FUN="^")
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1 1 1 1 1.000000e+00
## [2,] 25 125 78125 48828125 1.220703e+09
## [3,] 81 729 4782969 31381059609 2.541866e+12
## [4,] 144 1728 35831808 743008370688 1.069932e+14
outer(g,h,FUN="*")
## [,1] [,2] [,3] [,4] [,5]
## [1,] 2 3 7 11 13
## [2,] 10 15 35 55 65
## [3,] 18 27 63 99 117
## [4,] 24 36 84 132 156
outer(g,h,FUN="-")
## [,1] [,2] [,3] [,4] [,5]
## [1,] -1 -2 -6 -10 -12
## [2,] 3 2 -2 -6 -8
## [3,] 7 6 2 -2 -4
## [4,] 10 9 5 1 -1
outer(g,h,FUN="/")
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0.5 0.3333333 0.1428571 0.09090909 0.07692308
## [2,] 2.5 1.6666667 0.7142857 0.45454545 0.38461538
## [3,] 4.5 3.0000000 1.2857143 0.81818182 0.69230769
## [4,] 6.0 4.0000000 1.7142857 1.09090909 0.92307692
# To get all combinations of values in two vector we can use expand.grid function
expand.grid(g,h)
## Var1 Var2
## 1 1 2
## 2 5 2
## 3 9 2
## 4 12 2
## 5 1 3
## 6 5 3
## 7 9 3
## 8 12 3
## 9 1 7
## 10 5 7
## 11 9 7
## 12 12 7
## 13 1 11
## 14 5 11
## 15 9 11
## 16 12 11
## 17 1 13
## 18 5 13
## 19 9 13
## 20 12 13
If we have a numeric vector it can be converted back to an integer vector.
as.integer(c(1.5,2,3.5))
## [1] 1 2 3
Similarly, an integer vector can be converted into numeric vector.
as.numeric(myfirstintegervec)
## [1] 1 2 3
A numeric/integer/complex/logical vector can be converted into a character vector but vice-versa isn’t true.
as.numeric(myfirstcharactervec)
## Warning: NAs introduced by coercion
## [1] NA NA NA NA
as.integer(myfirstcharactervec)
## Warning: NAs introduced by coercion
## [1] NA NA NA NA
as.logical(myfirstcharactervec)
## [1] NA NA NA NA
as.complex(myfirstcharactervec)
## Warning: NAs introduced by coercion
## [1] NA NA NA NA
as.logical(myfirstintegervec)
## [1] TRUE TRUE TRUE
as.complex(myfirstintegervec)
## [1] 1+0i 2+0i 3+0i
as.numeric(myfirstcomplex)
## Warning: imaginary parts discarded in coercion
## [1] 2 6 1 24 8 3
as.integer(myfirstlogical)
## [1] 1 0 0 1
Sub-setting a vector can be done by using either vector indices or
subset function. The first argument to the function must be
the name of vector or data and second argument should be an expression
or condition to apply for sub-setting.
subset(myfirstintegervec,myfirstintegervec>1)
## [1] 2 3
# alternatively
myfirstintegervec[myfirstintegervec>1]
## [1] 2 3
A Matrix is an arrangement of data in rows and columns. It also
contains homogeneous data structure meaning only single mode of data can
be used with a given object of class matrix. A matrix is the special
case of a two-dimensional array.In R language, matrix data
structure can be produced or created in several ways.
First method is that we can bind the different vectors either row or
column wise. The function rbind and cbind
respectively binds the vectors row and column wise. However, for this
way of creating matrix, length of each vector must be equal otherwise
R will recycle the elements in short vector to match the
length of the largest vector.
wheat=c(12,13,14)
chickpea=c(23,44,56)
rbind(wheat,chickpea)
## [,1] [,2] [,3]
## wheat 12 13 14
## chickpea 23 44 56
Here, names of vector are automatically taken as names of rows in matrix.
cbind(wheat,chickpea)
## wheat chickpea
## [1,] 12 23
## [2,] 13 44
## [3,] 14 56
Here, names of vectors used to create matrix are automatically taken
as column names. Alternative way to create matrix is to use function
matrix.
matrix(c(wheat,chickpea),ncol = 2,nrow=3,byrow = F)
## [,1] [,2]
## [1,] 12 23
## [2,] 13 44
## [3,] 14 56
Argument ncol refers to number of columns and
nrow refers to number of rows. In a matrix, product of
ncol and nrow must equal the number of
elements in all the vectors supplied to function matrix. If
there is mismatch in the sum of length of vectors supplied and product
of number of rows and columns then problem of recycling of elements will
persist with some warning. The matrix function takes by
default values column i.e., byrow=F. Thus, even if this
parameter byrow=F is ignored it will give same output.
matrix(c(wheat,chickpea),ncol = 2,nrow=3)
## [,1] [,2]
## [1,] 12 23
## [2,] 13 44
## [3,] 14 56
matrix(c(wheat,chickpea),ncol = 2,nrow=4) # Example of recycling of elements
## Warning in matrix(c(wheat, chickpea), ncol = 2, nrow = 4): data length [6] is
## not a sub-multiple or multiple of the number of rows [4]
## [,1] [,2]
## [1,] 12 44
## [2,] 13 56
## [3,] 14 12
## [4,] 23 13
Instead of supplying arguments ncol and
nrow, it is possible to supply them in one go by specifying
a vector the first element of which is number of rows and second element
is number of columns.
myfirstmatrix=matrix(c(wheat,chickpea),c(3,2),byrow = F)
myfirstmatrix
## [,1] [,2]
## [1,] 12 23
## [2,] 13 44
## [3,] 14 56
is.matrix(myfirstmatrix) #check whether the object is a matrix or not, output of this function is a logical true or false
## [1] TRUE
In place of a numeric vector, we can use other types of vectors too to create a matrix.
matrix(vector("character",length = 6),nrow = 3)
## [,1] [,2]
## [1,] "" ""
## [2,] "" ""
## [3,] "" ""
matrix(vector("logical",length = 6),nrow = 3)
## [,1] [,2]
## [1,] FALSE FALSE
## [2,] FALSE FALSE
## [3,] FALSE FALSE
matrix(vector("integer",length = 6),nrow = 3)
## [,1] [,2]
## [1,] 0 0
## [2,] 0 0
## [3,] 0 0
matrix(vector("complex",length = 6),nrow = 3)
## [,1] [,2]
## [1,] 0+0i 0+0i
## [2,] 0+0i 0+0i
## [3,] 0+0i 0+0i
Existing attributes of the matrix can be obtained by
attributes function.
attributes(myfirstmatrix)
## $dim
## [1] 3 2
Only one attribute dimension exists which is given by name
dim and is equal to 3 rows and 2 columns. Other than
dimension of matrix, row names and column names are also the other
attributes. Row and column names can be set by either specifying
dimnames in matrix function or by using
rownames and colnames function.
rownames(myfirstmatrix)=c("Row1","Row2","Row3")
colnames(myfirstmatrix)=c("col1","col2")
myfirstmatrix
## col1 col2
## Row1 12 23
## Row2 13 44
## Row3 14 56
class(myfirstmatrix)
## [1] "matrix" "array"
typeof(myfirstmatrix)
## [1] "double"
mode(myfirstmatrix)
## [1] "numeric"
Any element(s) of the matrix can be obtained either by using row and/or column names or by using row and/or column numbers. We can create a null vector and fill the values of various elements afterwards.
nullmatrix=matrix(0,nrow = 2,ncol = 3)
nullmatrix
## [,1] [,2] [,3]
## [1,] 0 0 0
## [2,] 0 0 0
myfirstmatrix["Row1",] #only specifying row name followed by comma will provide elements for that row from all columns
## col1 col2
## 12 23
myfirstmatrix[,"col1"] #only specifying column names after comma
## Row1 Row2 Row3
## 12 13 14
myfirstmatrix["Row1","col2"] #accessing element of matrix belonging to first row and second column
## [1] 23
Any element value in matrix can be changed as required.
myfirstmatrix["Row1","col2"]=25
myfirstmatrix
## col1 col2
## Row1 12 25
## Row2 13 44
## Row3 14 56
nullmatrix[1,1]=1
nullmatrix[2,2]=1
nullmatrix
## [,1] [,2] [,3]
## [1,] 1 0 0
## [2,] 0 1 0
## To create identity matrix which contains all zeros but one along its diagonal
nullmatrix_square3=matrix(0,nrow = 3,ncol = 3)
diag(nullmatrix_square3)=rep(1,3) # diag function is used to set as well as access diagonal elements in a matrix
nullmatrix_square3
## [,1] [,2] [,3]
## [1,] 1 0 0
## [2,] 0 1 0
## [3,] 0 0 1
R allows all mathematical operations on matrix data
structure.
sqmatrix=matrix(c(2,3,7,1),nrow=2)
secondsqmatrix=matrix(c(10,8,5,12),nrow=2)
sqmatrix+secondsqmatrix #addition of two matrix of same dimension
## [,1] [,2]
## [1,] 12 12
## [2,] 11 13
sqmatrix-secondsqmatrix # subtraction of two matrix of same order
## [,1] [,2]
## [1,] -8 2
## [2,] -5 -11
sqmatrix/secondsqmatrix # division of matrix by matrix element by element
## [,1] [,2]
## [1,] 0.200 1.40000000
## [2,] 0.375 0.08333333
2*sqmatrix #multiplication of matrix with a scalar
## [,1] [,2]
## [1,] 4 14
## [2,] 6 2
sqmatrix%*%secondsqmatrix # matrix multiplication
## [,1] [,2]
## [1,] 76 94
## [2,] 38 27
sqmatrix*secondsqmatrix # element wise multiplication of two matrix
## [,1] [,2]
## [1,] 20 35
## [2,] 24 12
sqmatrix%o%secondsqmatrix # outer or exterior product of two matrix, %o% is binary operator providing a wrapper for outer(x, y, FUN="*").
## , , 1, 1
##
## [,1] [,2]
## [1,] 20 70
## [2,] 30 10
##
## , , 2, 1
##
## [,1] [,2]
## [1,] 16 56
## [2,] 24 8
##
## , , 1, 2
##
## [,1] [,2]
## [1,] 10 35
## [2,] 15 5
##
## , , 2, 2
##
## [,1] [,2]
## [1,] 24 84
## [2,] 36 12
outer(sqmatrix,secondsqmatrix,FUN="*")
## , , 1, 1
##
## [,1] [,2]
## [1,] 20 70
## [2,] 30 10
##
## , , 2, 1
##
## [,1] [,2]
## [1,] 16 56
## [2,] 24 8
##
## , , 1, 2
##
## [,1] [,2]
## [1,] 10 35
## [2,] 15 5
##
## , , 2, 2
##
## [,1] [,2]
## [1,] 24 84
## [2,] 36 12
# However, outer function can perform several task with different functions provided for FUN argument.
outer(sqmatrix,secondsqmatrix,FUN="+")
## , , 1, 1
##
## [,1] [,2]
## [1,] 12 17
## [2,] 13 11
##
## , , 2, 1
##
## [,1] [,2]
## [1,] 10 15
## [2,] 11 9
##
## , , 1, 2
##
## [,1] [,2]
## [1,] 7 12
## [2,] 8 6
##
## , , 2, 2
##
## [,1] [,2]
## [1,] 14 19
## [2,] 15 13
outer(sqmatrix,secondsqmatrix,FUN="-")
## , , 1, 1
##
## [,1] [,2]
## [1,] -8 -3
## [2,] -7 -9
##
## , , 2, 1
##
## [,1] [,2]
## [1,] -6 -1
## [2,] -5 -7
##
## , , 1, 2
##
## [,1] [,2]
## [1,] -3 2
## [2,] -2 -4
##
## , , 2, 2
##
## [,1] [,2]
## [1,] -10 -5
## [2,] -9 -11
outer(sqmatrix,secondsqmatrix,FUN="^")
## , , 1, 1
##
## [,1] [,2]
## [1,] 1024 282475249
## [2,] 59049 1
##
## , , 2, 1
##
## [,1] [,2]
## [1,] 256 5764801
## [2,] 6561 1
##
## , , 1, 2
##
## [,1] [,2]
## [1,] 32 16807
## [2,] 243 1
##
## , , 2, 2
##
## [,1] [,2]
## [1,] 4096 13841287201
## [2,] 531441 1
outer(sqmatrix,secondsqmatrix,FUN="/")
## , , 1, 1
##
## [,1] [,2]
## [1,] 0.2 0.7
## [2,] 0.3 0.1
##
## , , 2, 1
##
## [,1] [,2]
## [1,] 0.250 0.875
## [2,] 0.375 0.125
##
## , , 1, 2
##
## [,1] [,2]
## [1,] 0.4 1.4
## [2,] 0.6 0.2
##
## , , 2, 2
##
## [,1] [,2]
## [1,] 0.1666667 0.58333333
## [2,] 0.2500000 0.08333333
sqmatrix%x%secondsqmatrix #Kronecker product of two matrix
## [,1] [,2] [,3] [,4]
## [1,] 20 10 70 35
## [2,] 16 24 56 84
## [3,] 30 15 10 5
## [4,] 24 36 8 12
sum(diag(myfirstmatrix)) # To find trace of the matrix which is sum of the diagonal elements of the matrix
## [1] 56
solve(sqmatrix) # Inverse of matrix by solve function
## [,1] [,2]
## [1,] -0.05263158 0.3684211
## [2,] 0.15789474 -0.1052632
det(sqmatrix) # Determinant of the matrix
## [1] -19
solve(sqmatrix,c(1,2)) # Solving a system of equations Ax=B where we need to solve for x, sqmatrix is set as A and c(1,2) as B
## [1] 0.68421053 -0.05263158
sqmatrix %*% solve(sqmatrix)
## [,1] [,2]
## [1,] 1 0
## [2,] 0 1
eigen(sqmatrix) # Eigen value decomposition
## eigen() decomposition
## $values
## [1] 6.109772 -3.109772
##
## $vectors
## [,1] [,2]
## [1,] 0.8623579 -0.8076995
## [2,] 0.5062992 0.5895944
svd(sqmatrix) # Singular value decomposition
## $d
## [1] 7.524938 2.524938
##
## $u
## [,1] [,2]
## [1,] -0.9632580 -0.2685778
## [2,] -0.2685778 0.9632580
##
## $v
## [,1] [,2]
## [1,] -0.3630926 0.9317531
## [2,] -0.9317531 -0.3630926
chol(secondsqmatrix) #Cholesky decomposition
## [,1] [,2]
## [1,] 3.162278 1.581139
## [2,] 0.000000 3.082207
qr(sqmatrix) #QR decomposition
## $qr
## [,1] [,2]
## [1,] -3.6055513 -4.714952
## [2,] 0.8320503 -5.269652
##
## $rank
## [1] 2
##
## $qraux
## [1] 1.554700 5.269652
##
## $pivot
## [1] 1 2
##
## attr(,"class")
## [1] "qr"
colSums(sqmatrix) #Get column sums for each column of the numeric, complex and integer mode matrix
## [1] 5 8
rowSums(sqmatrix) #Get row sums for each row of the numeric, complex and integer mode matrix
## [1] 9 4
colMeans(sqmatrix) #Get column means for each column of the numeric, complex and integer mode matrix
## [1] 2.5 4.0
rowMeans(sqmatrix) #Get row means for each row of the numeric, complex and integer mode matrix
## [1] 4.5 2.0
# Get frequency distribution of numbers in matrix
table(sqmatrix)
## sqmatrix
## 1 2 3 7
## 1 1 1 1
# To find proportion by row, margin=1 and for column, margin=2
proportions(sqmatrix,margin=1)
## [,1] [,2]
## [1,] 0.2222222 0.7777778
## [2,] 0.7500000 0.2500000
proportions(sqmatrix,margin=2)
## [,1] [,2]
## [1,] 0.4 0.875
## [2,] 0.6 0.125
An array is a multi-dimensional object which is heterogeneous in
nature. It is simply a vector which is stored with additional attributes
giving the dimensions (attribute dim) and optionally names
for those dimensions (attribute dimnames). To create an
arry, array function is used. This function takes data in
vector form even when matrices are provided, it is provided by
vectorizing it. Further the dim argument takes a vector which contains
number of rows followed by number of columns and followed by number of
matrices to be printed. In the following example, all the three have
been set as 2.
myfirstarray=array(data=c(sqmatrix,secondsqmatrix),dim=c(2,2,2))
myfirstarray
## , , 1
##
## [,1] [,2]
## [1,] 2 7
## [2,] 3 1
##
## , , 2
##
## [,1] [,2]
## [1,] 10 5
## [2,] 8 12
class(myfirstarray)
## [1] "array"
is.array(myfirstarray) #check whether the given object is an array or not, the output of the function is logical true or false
## [1] TRUE
An array data structure has three indices to subset the array. First
one is row number, followed by column number and third is matrix number.
If within [ ] row and column numbers are left blank but
matrix number is given then it will return the matrix for which number
was provided. When row number is given but column and matrix number are
left blank then that particular row from each matrix with all columns
would be returned.But in such case rows would be printed as columns of a
new matrix. But when only column number is specified then resulting new
matrix will have selected column of each matrix as its column.
# To access the first matrix with all rows and columns
myfirstarray[,,2]
## [,1] [,2]
## [1,] 10 5
## [2,] 8 12
# To access the first row of all matrices in array with all columns
myfirstarray[1,,]
## [,1] [,2]
## [1,] 2 10
## [2,] 7 5
# To access the first column of all matrices in array with all rows
myfirstarray[,1,]
## [,1] [,2]
## [1,] 2 10
## [2,] 3 8
# To access second row from first column in first matrix
myfirstarray[2,1,1]
## [1] 3
Data frame is a data structure which consists of vectors as columns
and each vector should have a unique name. Thus, data frame is a
heterogeneous data structure. Column names should not be left empty.
Usually, a data frame has rows as observations and columns as variables.
Data frame can be created by using the function data.frame.
Each column in data frame has a separate mode.
Assume some hypothetical data regarding persons and create a data frame.
Name<-c("AKS","AS",
"SBN","GKV",
"NK","DR","HOS",
"AV")
Age<-c(64,62,63,32,
64,49,64,42)
Gender<-c(rep("M",7),"F")
Education<-c(rep("Ph.D.",3),"M.Sc.",rep("Ph.D.",4))
mydept=data.frame(Name,Age,Gender,Education)
mydept # print data frame
## Name Age Gender Education
## 1 AKS 64 M Ph.D.
## 2 AS 62 M Ph.D.
## 3 SBN 63 M Ph.D.
## 4 GKV 32 M M.Sc.
## 5 NK 64 M Ph.D.
## 6 DR 49 M Ph.D.
## 7 HOS 64 M Ph.D.
## 8 AV 42 F Ph.D.
#View data frame object in a separate window which gives spreadsheet like feel
View(mydept)
is.data.frame(mydept) # check whether the object is a data frame, output is logical True or false
## [1] TRUE
class(mydept)#check the class of the object
## [1] "data.frame"
tail(mydept) # tail function provides the last 6 rows of the data frame by default
## Name Age Gender Education
## 3 SBN 63 M Ph.D.
## 4 GKV 32 M M.Sc.
## 5 NK 64 M Ph.D.
## 6 DR 49 M Ph.D.
## 7 HOS 64 M Ph.D.
## 8 AV 42 F Ph.D.
tail(mydept,3) # In case only last three rows needs to be seen then we have to specify to the tail function
## Name Age Gender Education
## 6 DR 49 M Ph.D.
## 7 HOS 64 M Ph.D.
## 8 AV 42 F Ph.D.
head(mydept) # head function by default provides first 6 rows of the data frame
## Name Age Gender Education
## 1 AKS 64 M Ph.D.
## 2 AS 62 M Ph.D.
## 3 SBN 63 M Ph.D.
## 4 GKV 32 M M.Sc.
## 5 NK 64 M Ph.D.
## 6 DR 49 M Ph.D.
head(mydept,3) #In case only first three rows needs to be seen then we have to specify to the head function
## Name Age Gender Education
## 1 AKS 64 M Ph.D.
## 2 AS 62 M Ph.D.
## 3 SBN 63 M Ph.D.
attributes(mydept) #use attribute function to check the attributes ofthe data frame
## $names
## [1] "Name" "Age" "Gender" "Education"
##
## $class
## [1] "data.frame"
##
## $row.names
## [1] 1 2 3 4 5 6 7 8
row.names(mydept)<-c("A Shrivastav",
"A Sarawgi",
"S B Nahatkar",
"GK Vani",
"N Khan",
"D Rathi", "HO Sharma","A Verma") #Use function rownames or row.names to set the rownames for data frame
mydept
## Name Age Gender Education
## A Shrivastav AKS 64 M Ph.D.
## A Sarawgi AS 62 M Ph.D.
## S B Nahatkar SBN 63 M Ph.D.
## GK Vani GKV 32 M M.Sc.
## N Khan NK 64 M Ph.D.
## D Rathi DR 49 M Ph.D.
## HO Sharma HOS 64 M Ph.D.
## A Verma AV 42 F Ph.D.
View window for command View(mydept)
Indexing involves using row and column names/numbers to access data from data frame.
# Use row and column names to access dataframe elements
mydept["A Shrivastav","Age"]
## [1] 64
Providing only row names and leaving column name blank would return the whole row corresponding to the given row name
mydept["A Shrivastav",]
## Name Age Gender Education
## A Shrivastav AKS 64 M Ph.D.
Instead of row and column names we can use row and column numbers as well.
mydept[2,3]
## [1] "M"
mydept[3,1]
## [1] "SBN"
By using dollar sign after data frame name we can access columns by their name
mydept$Name
## [1] "AKS" "AS" "SBN" "GKV" "NK" "DR" "HOS" "AV"
To add the additional column to data frame, we should write the name of new column after dollar sign and should assign it a new vector of values or character
mydept$Income=c(1000000,200000,300000,400000,500000,600000,700000,800000)
head(mydept)
## Name Age Gender Education Income
## A Shrivastav AKS 64 M Ph.D. 1e+06
## A Sarawgi AS 62 M Ph.D. 2e+05
## S B Nahatkar SBN 63 M Ph.D. 3e+05
## GK Vani GKV 32 M M.Sc. 4e+05
## N Khan NK 64 M Ph.D. 5e+05
## D Rathi DR 49 M Ph.D. 6e+05
To add one or more row to the data frame using index features and assign the new row index with a data frame that contains the values for all columns existing in the data frame for equal number of rows added.
mydept[9,]=data.frame(Name="ASR",Age=29,Gender="F",Education="M.Sc.",Income=200000)
row.names(mydept)[9] = "AS Rajpoot"
mydept
## Name Age Gender Education Income
## A Shrivastav AKS 64 M Ph.D. 1e+06
## A Sarawgi AS 62 M Ph.D. 2e+05
## S B Nahatkar SBN 63 M Ph.D. 3e+05
## GK Vani GKV 32 M M.Sc. 4e+05
## N Khan NK 64 M Ph.D. 5e+05
## D Rathi DR 49 M Ph.D. 6e+05
## HO Sharma HOS 64 M Ph.D. 7e+05
## A Verma AV 42 F Ph.D. 8e+05
## AS Rajpoot ASR 29 F M.Sc. 2e+05
# To filter all the rows in data frame where Age was less than 50 and Gender was female (F)
mydept[mydept$Age<50 & mydept$Gender=="F",]
## Name Age Gender Education Income
## A Verma AV 42 F Ph.D. 8e+05
## AS Rajpoot ASR 29 F M.Sc. 2e+05
summary(mydept) #To get column wise summary of data frame
## Name Age Gender Education
## Length:9 Min. :29.00 Length:9 Length:9
## Class :character 1st Qu.:42.00 Class :character Class :character
## Mode :character Median :62.00 Mode :character Mode :character
## Mean :52.11
## 3rd Qu.:64.00
## Max. :64.00
## Income
## Min. : 200000
## 1st Qu.: 300000
## Median : 500000
## Mean : 522222
## 3rd Qu.: 700000
## Max. :1000000
Filtering can also be done by using the subset
function.
subset(mydept,Income>400000)
## Name Age Gender Education Income
## A Shrivastav AKS 64 M Ph.D. 1e+06
## N Khan NK 64 M Ph.D. 5e+05
## D Rathi DR 49 M Ph.D. 6e+05
## HO Sharma HOS 64 M Ph.D. 7e+05
## A Verma AV 42 F Ph.D. 8e+05
subset(mydept,Income<400000 & Age>60)
## Name Age Gender Education Income
## A Sarawgi AS 62 M Ph.D. 2e+05
## S B Nahatkar SBN 63 M Ph.D. 3e+05
#Table function generates summary over categorical data like male/female, treatment/control, or yes/no
table(mydept$Gender)
##
## F M
## 2 7
# table function can provide contingency tables too for categorical data
table(mydept$Gender,mydept$Education)
##
## M.Sc. Ph.D.
## F 1 1
## M 1 6
proportions(table(mydept$Gender,mydept$Education),margin=1) #margin=1 for row
##
## M.Sc. Ph.D.
## F 0.5000000 0.5000000
## M 0.1428571 0.8571429
proportions(table(mydept$Gender,mydept$Education),margin=2) #margin=2 for column
##
## M.Sc. Ph.D.
## F 0.5000000 0.1428571
## M 0.5000000 0.8571429
The dplyr package in system library of R is
called as grammer of data manipulation in R. It has several functions
for manipulation of data. To merge data frames, there are six types of
merge function in this package. Each one of these is described
below:
inner_join() function joins two data frames which
contains only matched columns and rows from both data frames.
full_join() function joins two data frames
regardless of matching, i.e., contains all columns and rows from both
data frames.
left_join() function joins two data frames only for
the columns matched from second data frame and all columns and rows of
first data frame. The first data frame is left and second is
right.
right_join() function joins two data frames only for
the columns matched from first data frame and all columns and rows of
second data frame.
semi_join() function provides output of columns and
rows matched in left data frame.
anti_join() function doesn’t join anything. It only
provides unmatched observations from two data frames.
The inner_join(), left_join(),
right_join() and full_join() are all examples
of mutating joins wherein columns from right data frame to left data
frame. If there are multiple row match in rigth data frame for a row in
left data frame then all rows in right data frame would be returned for
each unique row in left data frame.
The semi_join() and anti_join() are
examples of filtering joins wherein rows from left data frame are
filtered for its match in right data frame.
In addition there is a function nest_join which returns
all rows and columns in left data frame with nested columns that
contains all matched columns from right data frame. This differs from
left_join in that new column names wouldn’t be set in
former but in latter.
Now lets create another data frame which has some members of
mydept data frame with information on some other
variables.
mydata_1<-data.frame(Name=c("AKS","GKV"),Skills=c("Oratory","Data Analysis"),Age=c(64,32))
library(dplyr) ## loading dplyr package for using functions inbuilt in it
## perform inner join
inner_join(mydept,mydata_1,by="Name")
## Name Age.x Gender Education Income Skills Age.y
## 1 AKS 64 M Ph.D. 1e+06 Oratory 64
## 2 GKV 32 M M.Sc. 4e+05 Data Analysis 32
inner_join(mydept,mydata_1,by=c("Name","Age"))
## Name Age Gender Education Income Skills
## 1 AKS 64 M Ph.D. 1e+06 Oratory
## 2 GKV 32 M M.Sc. 4e+05 Data Analysis
## perform full join
full_join(mydept,mydata_1,by="Name") ## join with single matching key
## Name Age.x Gender Education Income Skills Age.y
## 1 AKS 64 M Ph.D. 1e+06 Oratory 64
## 2 AS 62 M Ph.D. 2e+05 <NA> NA
## 3 SBN 63 M Ph.D. 3e+05 <NA> NA
## 4 GKV 32 M M.Sc. 4e+05 Data Analysis 32
## 5 NK 64 M Ph.D. 5e+05 <NA> NA
## 6 DR 49 M Ph.D. 6e+05 <NA> NA
## 7 HOS 64 M Ph.D. 7e+05 <NA> NA
## 8 AV 42 F Ph.D. 8e+05 <NA> NA
## 9 ASR 29 F M.Sc. 2e+05 <NA> NA
full_join(mydept,mydata_1,by=c("Name","Age")) ## join with two matching keys
## Name Age Gender Education Income Skills
## 1 AKS 64 M Ph.D. 1e+06 Oratory
## 2 AS 62 M Ph.D. 2e+05 <NA>
## 3 SBN 63 M Ph.D. 3e+05 <NA>
## 4 GKV 32 M M.Sc. 4e+05 Data Analysis
## 5 NK 64 M Ph.D. 5e+05 <NA>
## 6 DR 49 M Ph.D. 6e+05 <NA>
## 7 HOS 64 M Ph.D. 7e+05 <NA>
## 8 AV 42 F Ph.D. 8e+05 <NA>
## 9 ASR 29 F M.Sc. 2e+05 <NA>
## perform left join
left_join(mydept,mydata_1,by="Name")
## Name Age.x Gender Education Income Skills Age.y
## 1 AKS 64 M Ph.D. 1e+06 Oratory 64
## 2 AS 62 M Ph.D. 2e+05 <NA> NA
## 3 SBN 63 M Ph.D. 3e+05 <NA> NA
## 4 GKV 32 M M.Sc. 4e+05 Data Analysis 32
## 5 NK 64 M Ph.D. 5e+05 <NA> NA
## 6 DR 49 M Ph.D. 6e+05 <NA> NA
## 7 HOS 64 M Ph.D. 7e+05 <NA> NA
## 8 AV 42 F Ph.D. 8e+05 <NA> NA
## 9 ASR 29 F M.Sc. 2e+05 <NA> NA
left_join(mydept,mydata_1,by=c("Name","Age"))
## Name Age Gender Education Income Skills
## 1 AKS 64 M Ph.D. 1e+06 Oratory
## 2 AS 62 M Ph.D. 2e+05 <NA>
## 3 SBN 63 M Ph.D. 3e+05 <NA>
## 4 GKV 32 M M.Sc. 4e+05 Data Analysis
## 5 NK 64 M Ph.D. 5e+05 <NA>
## 6 DR 49 M Ph.D. 6e+05 <NA>
## 7 HOS 64 M Ph.D. 7e+05 <NA>
## 8 AV 42 F Ph.D. 8e+05 <NA>
## 9 ASR 29 F M.Sc. 2e+05 <NA>
## perform nest join
nest_join(mydept,mydata_1,by="Name")
## Name Age Gender Education Income mydata_1
## 1 AKS 64 M Ph.D. 1e+06 Oratory, 64
## 2 AS 62 M Ph.D. 2e+05
## 3 SBN 63 M Ph.D. 3e+05
## 4 GKV 32 M M.Sc. 4e+05 Data Analysis, 32
## 5 NK 64 M Ph.D. 5e+05
## 6 DR 49 M Ph.D. 6e+05
## 7 HOS 64 M Ph.D. 7e+05
## 8 AV 42 F Ph.D. 8e+05
## 9 ASR 29 F M.Sc. 2e+05
nest_join(mydept,mydata_1,by=c("Name","Age"))
## Name Age Gender Education Income mydata_1
## 1 AKS 64 M Ph.D. 1e+06 Oratory
## 2 AS 62 M Ph.D. 2e+05
## 3 SBN 63 M Ph.D. 3e+05
## 4 GKV 32 M M.Sc. 4e+05 Data Analysis
## 5 NK 64 M Ph.D. 5e+05
## 6 DR 49 M Ph.D. 6e+05
## 7 HOS 64 M Ph.D. 7e+05
## 8 AV 42 F Ph.D. 8e+05
## 9 ASR 29 F M.Sc. 2e+05
## perform right join
right_join(mydept,mydata_1,by="Name")
## Name Age.x Gender Education Income Skills Age.y
## 1 AKS 64 M Ph.D. 1e+06 Oratory 64
## 2 GKV 32 M M.Sc. 4e+05 Data Analysis 32
right_join(mydept,mydata_1,by=c("Name","Age"))
## Name Age Gender Education Income Skills
## 1 AKS 64 M Ph.D. 1e+06 Oratory
## 2 GKV 32 M M.Sc. 4e+05 Data Analysis
## perform semi join
semi_join(mydept,mydata_1,by="Name")
## Name Age Gender Education Income
## A Shrivastav AKS 64 M Ph.D. 1e+06
## GK Vani GKV 32 M M.Sc. 4e+05
semi_join(mydept,mydata_1,by=c("Name","Age"))
## Name Age Gender Education Income
## A Shrivastav AKS 64 M Ph.D. 1e+06
## GK Vani GKV 32 M M.Sc. 4e+05
## perform anti join
anti_join(mydept,mydata_1,by="Name")
## Name Age Gender Education Income
## A Sarawgi AS 62 M Ph.D. 2e+05
## S B Nahatkar SBN 63 M Ph.D. 3e+05
## N Khan NK 64 M Ph.D. 5e+05
## D Rathi DR 49 M Ph.D. 6e+05
## HO Sharma HOS 64 M Ph.D. 7e+05
## A Verma AV 42 F Ph.D. 8e+05
## AS Rajpoot ASR 29 F M.Sc. 2e+05
anti_join(mydept,mydata_1,by=c("Name","Age"))
## Name Age Gender Education Income
## A Sarawgi AS 62 M Ph.D. 2e+05
## S B Nahatkar SBN 63 M Ph.D. 3e+05
## N Khan NK 64 M Ph.D. 5e+05
## D Rathi DR 49 M Ph.D. 6e+05
## HO Sharma HOS 64 M Ph.D. 7e+05
## A Verma AV 42 F Ph.D. 8e+05
## AS Rajpoot ASR 29 F M.Sc. 2e+05
Alternatively, one can use merge function from
base pacakge also. The base package is loaded
by default in R. This function unlike those in
dplyr package sorts the data frame columns by default. In
case one doesn’t wish to sort columns then additional argument
sort should be set equal to FALSE .
##left join with merge function
merge(mydept,mydata_1,all.x = T,all.y = F,by="Name")
## Name Age.x Gender Education Income Skills Age.y
## 1 AKS 64 M Ph.D. 1e+06 Oratory 64
## 2 AS 62 M Ph.D. 2e+05 <NA> NA
## 3 ASR 29 F M.Sc. 2e+05 <NA> NA
## 4 AV 42 F Ph.D. 8e+05 <NA> NA
## 5 DR 49 M Ph.D. 6e+05 <NA> NA
## 6 GKV 32 M M.Sc. 4e+05 Data Analysis 32
## 7 HOS 64 M Ph.D. 7e+05 <NA> NA
## 8 NK 64 M Ph.D. 5e+05 <NA> NA
## 9 SBN 63 M Ph.D. 3e+05 <NA> NA
merge(mydept,mydata_1,all.x = T,all.y = F,by=c("Name","Age"))
## Name Age Gender Education Income Skills
## 1 AKS 64 M Ph.D. 1e+06 Oratory
## 2 AS 62 M Ph.D. 2e+05 <NA>
## 3 ASR 29 F M.Sc. 2e+05 <NA>
## 4 AV 42 F Ph.D. 8e+05 <NA>
## 5 DR 49 M Ph.D. 6e+05 <NA>
## 6 GKV 32 M M.Sc. 4e+05 Data Analysis
## 7 HOS 64 M Ph.D. 7e+05 <NA>
## 8 NK 64 M Ph.D. 5e+05 <NA>
## 9 SBN 63 M Ph.D. 3e+05 <NA>
##right join with merge function
merge(mydept,mydata_1,all.x = F,all.y = T,by=c("Name","Age"))
## Name Age Gender Education Income Skills
## 1 AKS 64 M Ph.D. 1e+06 Oratory
## 2 GKV 32 M M.Sc. 4e+05 Data Analysis
##full join with merge function
merge(mydept,mydata_1,all.x = T,all.y = T,by=c("Name","Age"))
## Name Age Gender Education Income Skills
## 1 AKS 64 M Ph.D. 1e+06 Oratory
## 2 AS 62 M Ph.D. 2e+05 <NA>
## 3 ASR 29 F M.Sc. 2e+05 <NA>
## 4 AV 42 F Ph.D. 8e+05 <NA>
## 5 DR 49 M Ph.D. 6e+05 <NA>
## 6 GKV 32 M M.Sc. 4e+05 Data Analysis
## 7 HOS 64 M Ph.D. 7e+05 <NA>
## 8 NK 64 M Ph.D. 5e+05 <NA>
## 9 SBN 63 M Ph.D. 3e+05 <NA>
## for not sorting the joint data frame
merge(mydept,mydata_1,all.x = T,all.y = T,by=c("Name","Age"),sort = FALSE)
## Name Age Gender Education Income Skills
## 1 AKS 64 M Ph.D. 1e+06 Oratory
## 2 GKV 32 M M.Sc. 4e+05 Data Analysis
## 3 NK 64 M Ph.D. 5e+05 <NA>
## 4 AS 62 M Ph.D. 2e+05 <NA>
## 5 SBN 63 M Ph.D. 3e+05 <NA>
## 6 AV 42 F Ph.D. 8e+05 <NA>
## 7 ASR 29 F M.Sc. 2e+05 <NA>
## 8 DR 49 M Ph.D. 6e+05 <NA>
## 9 HOS 64 M Ph.D. 7e+05 <NA>
To append data frames in R usually rbind()
function is used but it is not an efficient one. Instead, users are
suggested to use bind_rows() function from
dplyr package.
mydat_2=data.frame(Name=c("SK","RK"),Age=c(37,35),Gender=c("M","M"),Education=c("SSLC","SSLC"),Income=c(25000,30000))
row.names(mydat_2)<-c("Sanjay","Rajesh")
library(dplyr)
bind_rows(mydept,mydat_2)
## Name Age Gender Education Income
## A Shrivastav AKS 64 M Ph.D. 1000000
## A Sarawgi AS 62 M Ph.D. 200000
## S B Nahatkar SBN 63 M Ph.D. 300000
## GK Vani GKV 32 M M.Sc. 400000
## N Khan NK 64 M Ph.D. 500000
## D Rathi DR 49 M Ph.D. 600000
## HO Sharma HOS 64 M Ph.D. 700000
## A Verma AV 42 F Ph.D. 800000
## AS Rajpoot ASR 29 F M.Sc. 200000
## Sanjay SK 37 M SSLC 25000
## Rajesh RK 35 M SSLC 30000
A data frame can be either in long or wide format. Wide format refers to subjects or units in rows and repeated measurement or yearly data on subjects or units in columns of the corresponding row. while long format refers to subject /unit wise data over the years or repeated measurements entered in separate rows.
mydata_2<-data.frame(Subject=rep(LETTERS[1:7],2),time=c(rep(1,7),rep(2,7)),Value=runif(14))
library(dplyr)
# long to wide (direction = "wide") requires idvar and timevar at a minimum
reshape(mydata_2, direction = "wide", idvar = "Subject", timevar = "time")
## Subject Value.1 Value.2
## 1 A 0.147468568 0.93928864
## 2 B 0.027665727 0.40437964
## 3 C 0.977817373 0.72287824
## 4 D 0.047472843 0.82415125
## 5 E 0.007562183 0.44510712
## 6 F 0.530911993 0.46650597
## 7 G 0.382360465 0.07535331
## can also explicitly specify name of combined variable
wide <- reshape(mydata_2, direction = "wide", idvar = "Subject",timevar = "time", v.names = "Value", sep= "_")
wide
## Subject Value_1 Value_2
## 1 A 0.147468568 0.93928864
## 2 B 0.027665727 0.40437964
## 3 C 0.977817373 0.72287824
## 4 D 0.047472843 0.82415125
## 5 E 0.007562183 0.44510712
## 6 F 0.530911993 0.46650597
## 7 G 0.382360465 0.07535331
## reverse transformation
reshape(wide, direction = "long")
## Subject time Value
## A.1 A 1 0.147468568
## B.1 B 1 0.027665727
## C.1 C 1 0.977817373
## D.1 D 1 0.047472843
## E.1 E 1 0.007562183
## F.1 F 1 0.530911993
## G.1 G 1 0.382360465
## A.2 A 2 0.939288641
## B.2 B 2 0.404379635
## C.2 C 2 0.722878236
## D.2 D 2 0.824151246
## E.2 E 2 0.445107116
## F.2 F 2 0.466505970
## G.2 G 2 0.075353314
reshape(wide, idvar = "Subject", varying = list(2:3), v.names = "Value", direction = "long")
## Subject time Value
## A.1 A 1 0.147468568
## B.1 B 1 0.027665727
## C.1 C 1 0.977817373
## D.1 D 1 0.047472843
## E.1 E 1 0.007562183
## F.1 F 1 0.530911993
## G.1 G 1 0.382360465
## A.2 A 2 0.939288641
## B.2 B 2 0.404379635
## C.2 C 2 0.722878236
## D.2 D 2 0.824151246
## E.2 E 2 0.445107116
## F.2 F 2 0.466505970
## G.2 G 2 0.075353314
Data frame can be used as data entry spreadsheet by using
edit function.
# create an empty data frame first
mydata=data.frame(0)
# Now use edit function and save the output of edited data frame with same object name
mydata=edit(mydata)
# Now to view updated mydata data frame use view function
View(mydata)
# edit function with data frame would open an text editor
mydept=edit(mydept)
A view of data editor window.
Screen shot of edit function on mydept data frame
If the user uses edit function without providing object
name/argument then it will open the temporary file in the memory. The
emacs , xedit , vi ,
pico and xemacs function would also open the
text editor for the object provided. The data.entry ,
dataentry and de function will open the
spreadsheet like window for editing the data.
data.entry function in RStudio
A list is the generic/generalized vector. It is an multidimensional heterogeneous data structure. It can contain within it different types of data structures like vector, data frame, factor, and matrix.
To create a list one can use either vector function or
list function. vector function takes two
argument as used before, mode of the vector and length of the
vector.
myfirstveclist=vector("list",length=2) # creates an empty list
myfirstveclist
## [[1]]
## NULL
##
## [[2]]
## NULL
class(myfirstveclist)
## [1] "list"
mode(myfirstveclist)
## [1] "list"
typeof(myfirstveclist)
## [1] "list"
is.list(myfirstveclist) # check whether the object is a list or not, the output of this function is a logical True or false
## [1] TRUE
myfirstlist=list(mydataframe=mydept,
myfirstvector=myfirstcharactervec,
myfirstmatrix=myfirstmatrix)
myfirstlist
## $mydataframe
## Name Age Gender Education Income
## A Shrivastav AKS 64 M Ph.D. 1e+06
## A Sarawgi AS 62 M Ph.D. 2e+05
## S B Nahatkar SBN 63 M Ph.D. 3e+05
## GK Vani GKV 32 M M.Sc. 4e+05
## N Khan NK 64 M Ph.D. 5e+05
## D Rathi DR 49 M Ph.D. 6e+05
## HO Sharma HOS 64 M Ph.D. 7e+05
## A Verma AV 42 F Ph.D. 8e+05
## AS Rajpoot ASR 29 F M.Sc. 2e+05
##
## $myfirstvector
## [1] "RJ" "AP" "KT" "TN"
##
## $myfirstmatrix
## col1 col2
## Row1 12 25
## Row2 13 44
## Row3 14 56
attributes(myfirstlist)
## $names
## [1] "mydataframe" "myfirstvector" "myfirstmatrix"
class(myfirstlist)
## [1] "list"
To access the various elements or objects stored in a list,
[[]] are used instead of [] used in data
frame, matrix, array and vector.
# Accessing the elements in list by their name
myfirstlist$mydataframe
## Name Age Gender Education Income
## A Shrivastav AKS 64 M Ph.D. 1e+06
## A Sarawgi AS 62 M Ph.D. 2e+05
## S B Nahatkar SBN 63 M Ph.D. 3e+05
## GK Vani GKV 32 M M.Sc. 4e+05
## N Khan NK 64 M Ph.D. 5e+05
## D Rathi DR 49 M Ph.D. 6e+05
## HO Sharma HOS 64 M Ph.D. 7e+05
## A Verma AV 42 F Ph.D. 8e+05
## AS Rajpoot ASR 29 F M.Sc. 2e+05
# Accessing the first and second element in list
myfirstlist[[1]]
## Name Age Gender Education Income
## A Shrivastav AKS 64 M Ph.D. 1e+06
## A Sarawgi AS 62 M Ph.D. 2e+05
## S B Nahatkar SBN 63 M Ph.D. 3e+05
## GK Vani GKV 32 M M.Sc. 4e+05
## N Khan NK 64 M Ph.D. 5e+05
## D Rathi DR 49 M Ph.D. 6e+05
## HO Sharma HOS 64 M Ph.D. 7e+05
## A Verma AV 42 F Ph.D. 8e+05
## AS Rajpoot ASR 29 F M.Sc. 2e+05
myfirstlist[[2]]
## [1] "RJ" "AP" "KT" "TN"
A factor is an ordered vector. In a character vector there exists no order among the elements of the vector. But factor data structure provides levels to it.
factor function takes a vector as its argument to create
a factor object and level argument.
education=c("Illiterate","Primary","Middle","High School","Higher Secondary","Graduate","Post Graduate","Doctorate","Post Doctorate")
class(education)
## [1] "character"
literacy=c("Post Doctorate","Graduate","Primary","Higher Secondary",
"Illiterate","Graduate","Post Doctorate","Doctorate","Graduate","High School","Middle","Higher Secondary")
table(literacy)
## literacy
## Doctorate Graduate High School Higher Secondary
## 1 3 1 2
## Illiterate Middle Post Doctorate Primary
## 1 1 2 1
By creating a frequency distribution table before factoring vector literacy, it clearly visible that literacy levels are in alphabetical order.
literacy= factor(x=literacy, levels = education )
literacy
## [1] Post Doctorate Graduate Primary Higher Secondary
## [5] Illiterate Graduate Post Doctorate Doctorate
## [9] Graduate High School Middle Higher Secondary
## 9 Levels: Illiterate Primary Middle High School Higher Secondary ... Post Doctorate
table(literacy)
## literacy
## Illiterate Primary Middle High School
## 1 1 1 1
## Higher Secondary Graduate Post Graduate Doctorate
## 2 3 0 1
## Post Doctorate
## 2
By creating a frequency distribution table after factoring vector literacy, it is clearly visible that literacy levels are now in ascending order.
Only == and != operators can be used for
factors. A factor can only be compared to another factor with an
identical set of levels (not necessarily in the same ordering) or to a
character vector. Ordered factors are compared in the same way, but the
general dispatch mechanism precludes comparing ordered and unordered
factors.
Sub setting the factor involves dropping the unused levels too.
# Only second and third element are retained, rest are dropped
factor(literacy)[2:3,drop=T]
## [1] Graduate Primary
## Levels: Primary Graduate
# Retain only 1,4,8 and 9th observation with respective levels
factor(literacy)[c(1,4,8,9),drop=T]
## [1] Post Doctorate Higher Secondary Doctorate Graduate
## Levels: Higher Secondary Graduate Doctorate Post Doctorate
# Subset the factor after dropping "Doctorate" but not its level
subset(literacy,literacy!="Doctorate")
## [1] Post Doctorate Graduate Primary Higher Secondary
## [5] Illiterate Graduate Post Doctorate Graduate
## [9] High School Middle Higher Secondary
## 9 Levels: Illiterate Primary Middle High School Higher Secondary ... Post Doctorate
# first drop levels
newfactor=droplevels(literacy,exclude = "Doctorate")
# check the new factor and we find in place of Doctorate "NA" is there
newfactor
## [1] Post Doctorate Graduate Primary Higher Secondary
## [5] Illiterate Graduate Post Doctorate <NA>
## [9] Graduate High School Middle Higher Secondary
## 7 Levels: Illiterate Primary Middle High School Higher Secondary ... Post Doctorate
levels(newfactor)
## [1] "Illiterate" "Primary" "Middle" "High School"
## [5] "Higher Secondary" "Graduate" "Post Doctorate"
Upon check for levels and we find that not just “Doctorate” but also “Post Graduate” has been dropped because this level was not being used.
Now lets add one new column to data frame object mydept
as factor with the help of transform function.
mydept=transform(mydept,Position=factor(c("P","HOD","P","AP","P","P","P","AP","SRF"),levels = c("SRF","AP","P","HOD")))
levels(mydept$Position)
## [1] "SRF" "AP" "P" "HOD"
mydept
## Name Age Gender Education Income Position
## A Shrivastav AKS 64 M Ph.D. 1e+06 P
## A Sarawgi AS 62 M Ph.D. 2e+05 HOD
## S B Nahatkar SBN 63 M Ph.D. 3e+05 P
## GK Vani GKV 32 M M.Sc. 4e+05 AP
## N Khan NK 64 M Ph.D. 5e+05 P
## D Rathi DR 49 M Ph.D. 6e+05 P
## HO Sharma HOS 64 M Ph.D. 7e+05 P
## A Verma AV 42 F Ph.D. 8e+05 AP
## AS Rajpoot ASR 29 F M.Sc. 2e+05 SRF
Various data structures can be converted to another data structure.
# To convert any data structure to data frame
as.data.frame(sqmatrix)
## V1 V2
## 1 2 7
## 2 3 1
# To convert any data structure to matrix
as.matrix(mydept)
## Name Age Gender Education Income Position
## A Shrivastav "AKS" "64" "M" "Ph.D." "1e+06" "P"
## A Sarawgi "AS" "62" "M" "Ph.D." "2e+05" "HOD"
## S B Nahatkar "SBN" "63" "M" "Ph.D." "3e+05" "P"
## GK Vani "GKV" "32" "M" "M.Sc." "4e+05" "AP"
## N Khan "NK" "64" "M" "Ph.D." "5e+05" "P"
## D Rathi "DR" "49" "M" "Ph.D." "6e+05" "P"
## HO Sharma "HOS" "64" "M" "Ph.D." "7e+05" "P"
## A Verma "AV" "42" "F" "Ph.D." "8e+05" "AP"
## AS Rajpoot "ASR" "29" "F" "M.Sc." "2e+05" "SRF"
# To convert any data structure to list
as.list(mydept)
## $Name
## [1] "AKS" "AS" "SBN" "GKV" "NK" "DR" "HOS" "AV" "ASR"
##
## $Age
## [1] 64 62 63 32 64 49 64 42 29
##
## $Gender
## [1] "M" "M" "M" "M" "M" "M" "M" "F" "F"
##
## $Education
## [1] "Ph.D." "Ph.D." "Ph.D." "M.Sc." "Ph.D." "Ph.D." "Ph.D." "Ph.D." "M.Sc."
##
## $Income
## [1] 1e+06 2e+05 3e+05 4e+05 5e+05 6e+05 7e+05 8e+05 2e+05
##
## $Position
## [1] P HOD P AP P P P AP SRF
## Levels: SRF AP P HOD
as.list(sqmatrix)
## [[1]]
## [1] 2
##
## [[2]]
## [1] 3
##
## [[3]]
## [1] 7
##
## [[4]]
## [1] 1
# To convert any data structure to array
listtoarray=as.array(myfirstlist)
str(listtoarray)
## List of 3
## $ mydataframe :'data.frame': 9 obs. of 5 variables:
## ..$ Name : chr [1:9] "AKS" "AS" "SBN" "GKV" ...
## ..$ Age : num [1:9] 64 62 63 32 64 49 64 42 29
## ..$ Gender : chr [1:9] "M" "M" "M" "M" ...
## ..$ Education: chr [1:9] "Ph.D." "Ph.D." "Ph.D." "M.Sc." ...
## ..$ Income : num [1:9] 1e+06 2e+05 3e+05 4e+05 5e+05 6e+05 7e+05 8e+05 2e+05
## $ myfirstvector: chr [1:4] "RJ" "AP" "KT" "TN"
## $ myfirstmatrix: num [1:3, 1:2] 12 13 14 25 44 56
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:3] "Row1" "Row2" "Row3"
## .. ..$ : chr [1:2] "col1" "col2"
## - attr(*, "dim")= int 3
## - attr(*, "dimnames")=List of 1
## ..$ : chr [1:3] "mydataframe" "myfirstvector" "myfirstmatrix"
# To convert to factor data structure
as.factor(age)
## A B C D E F G H
## 35 23 31 34 1 0.5 47 56
## Levels: 0.5 1 23 31 34 35 47 56
Control structures in programming language are meant to allow programmer to control the way code chunk is executed. These are logical statements or loops which execute the code or stop its execution when certain conditions are met with. The use of loops simplifies the task at hand by repeating a task finite number of times as specified. This helps in avoiding the mere repetition of simple task.
A if statement evaluates the condition and when the condition provided is met with then it executes the expression. Expression provided for execution can be simple or complex. If statement merely does the task if certain condition is met but not otherwise.
if(condition) {expression}
x=9
if(x>0){
print("Positive number")
}
## [1] "Positive number"
x=-9
if(x>0){
print("Positive number")
}
# It will not return any thing because it only prints when condition X>0 is met with.
If you forget to put brackets either end of {} around
the expression then R console will not print output and
+ will be shown constantly and new line with prompt
operator will not appear until the missing brackets are put at the
appropriate place. In case + sing constantly appear but you
want to stop at that point then press esc button until
prompt operator appears in console.
A view of R console when missing brackets and inverted comma
To overcome the limitation with if statement that it does not allow for proceeding further when given condition is not met with, if-else statement is used.
if (condition) {expression} else {alternative expression}
x=9
if(x>0){
print("Positive number")
} else {
print("Negative or zero number")
}
## [1] "Positive number"
x=-3
if(x>0){
print("Positive number")
} else {
print("Negative or zero number")
}
## [1] "Negative or zero number"
Many times it is simply not enough to have simple if -else statement. In the case of #ifstatement it can be observed that it was not possible further to classify number either as zero or negative number. In such cases we need nested if-else statement. Nested means one if-else statement within if-else statement.
x=9
if(x>0){
print("Positive number")
} else {if(x<0){
print("Negative number")
} else {
print("Zero")
}
}
## [1] "Positive number"
x=0
if(x>0){
print("Positive number")
} else {if(x<0){
print("Negative number")
} else {
print("Zero")
}
}
## [1] "Zero"
x=-4
if(x>0){
print("Positive number")
} else {if(x<0){
print("Negative number")
} else {
print("Zero")
}
}
## [1] "Negative number"
A series of nested if-else statements can be used as shown below:
insect="YSB"
if(insect=="YSB"){
print("Management Practices:Clipping of tips of leaves")
} else {
if(insect=="YMV"){
print("Management Practices:Sprinkler Irrigation is suggested")
} else {
print("The named insect is not in our directory")
}
}
## [1] "Management Practices:Clipping of tips of leaves"
The switch function from base package is a good alternative to using if-else statement or nested if else statement. It evaluates condition and depending on that it chooses the alternative from list of alternatives provided.
switch (expression, list)
# with chage in number the output of function will change
switch(1,"grasshopper","praying mentid","dragonfly")
## [1] "grasshopper"
switch(2,"grasshopper","praying mentid","dragonfly")
## [1] "praying mentid"
switch(3,"grasshopper","praying mentid","dragonfly")
## [1] "dragonfly"
# case of character string as expression with list
switch("democracy","democracy"=list("India","ShriLanka","USA"),
"autocracy"=list("China","Russia"))
## [[1]]
## [1] "India"
##
## [[2]]
## [1] "ShriLanka"
##
## [[3]]
## [1] "USA"
A for loop uses index values to iteratively produce output when index values are applied to a function or data structure in expression.
for(variable in sequence) {expression}
# Examples of for loop
# seq_along function creates a sequence of length equal to object length
for(i in seq_along(mydept$Name)){
print(mydept$Name[i])
}
## [1] "AKS"
## [1] "AS"
## [1] "SBN"
## [1] "GKV"
## [1] "NK"
## [1] "DR"
## [1] "HOS"
## [1] "AV"
## [1] "ASR"
# for printing numbers starting from 2 to 11
for(i in 1:10){
y=i+1
print(y)
}
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6
## [1] 7
## [1] 8
## [1] 9
## [1] 10
## [1] 11
# for printing tables of 1 to 10
for(i in 1:10){
y=i*1:10
print(y)
}
## [1] 1 2 3 4 5 6 7 8 9 10
## [1] 2 4 6 8 10 12 14 16 18 20
## [1] 3 6 9 12 15 18 21 24 27 30
## [1] 4 8 12 16 20 24 28 32 36 40
## [1] 5 10 15 20 25 30 35 40 45 50
## [1] 6 12 18 24 30 36 42 48 54 60
## [1] 7 14 21 28 35 42 49 56 63 70
## [1] 8 16 24 32 40 48 56 64 72 80
## [1] 9 18 27 36 45 54 63 72 81 90
## [1] 10 20 30 40 50 60 70 80 90 100
Like nested if-else statements, nested for loops are also required on some occasions.
for(i in 1:3){
for(j in 1:5){
print(i*j)
}
}
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 2
## [1] 4
## [1] 6
## [1] 8
## [1] 10
## [1] 3
## [1] 6
## [1] 9
## [1] 12
## [1] 15
The while loop executes the expression till it finds that the given condition is fulfilled. In case a condition is never fulfilled then while loop will repeat the expression infinitely till it is interrupted manually or system RAM is fully exhausted.
while (condition) {expression}
x=1
while (x<=5) {
print(x)
x=x+1
}
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
The repeat loop executes a statement repeated till the if statement
breaks the repeating sequence based on condition provided. The break
statement merely involves writing break within an if
statement. It is used to end the loop regardless of what kind of loop it
may be within.
x=1
repeat{
print(x)
x=x+1
if(x>5){
break
}
}
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
The next statement is used to skip the current iteration of a loop.
# Print numbers which when divided by 3 gives zero remainder
## remember X %% Y is used to get the remainder while X %/% Y is used for finding the quotient.
for(i in 1:10){
if(i %% 3 != 0){
next
}
print(i)
}
## [1] 3
## [1] 6
## [1] 9
R has some inbuilt functions which are present in
various packages of system library. However, users can create custom
functions in R . Users must remember that functions can
help solve the problem of repeatedly doing the same task. The syntax for
a function involves providing arguments to a function within small
brackets and within curly brackets writing functional expression which
can be any formula or loop and using return function to return object to
the console. The objects created within a function remain in the local
environment and does not appear in the global environment. To see the
user built function’s syntax, one can simply type the name of the
function without adding small brackets to its end.
name_of_the_function<-function(argument1,argument2,..etc.){
functional expression
return(object name)
}
# function with default value set to 4
func<-function(x=4) {
if(x>0){
return("Positive")
} else if(x<0){
return("Negative")
} else {
return("Zero")
}
}
# to see function's syntax
func
## function(x=4) {
## if(x>0){
## return("Positive")
## } else if(x<0){
## return("Negative")
## } else {
## return("Zero")
## }
## }
# use function with various values provided to it as arguments
func(5)
## [1] "Positive"
func(-10)
## [1] "Negative"
func(0)
## [1] "Zero"
func() # even when no argument is provided to the function it returns output due to setting default value for argument.
## [1] "Positive"
# creating a function that returns the three times object value plus one
algo1<-function(x){
z=1+3*x
return(z)
}
algo1(11)
## [1] 34
# creating a function with several arguments
algo2<-function(x,y,z){
t=4*x+5*y+z*3
return(t)
}
algo2(1,2,4)
## [1] 26
# creating function by name sumover for summing over object elements
# remember that if you create custom functions by name that exists in global environment or any package then your custom function will replace the built-in function
# for this reason sumover is chosen instead of sum
sumover <- function(x){
z=x%*%rep(1,length(x))
return(z)
}
sumover(c(2,3,5))
## [,1]
## [1,] 10
sum(c(2,3,5)) # cross check with inbuilt function
## [1] 10
sumover(1:100)
## [,1]
## [1,] 5050
# creating standard deviation function by name stdev
stdev<-function(x){
sd=sqrt(sum((x-mean(x))^2)/(length(x)-1))
return(sd)
}
stdev(c(2,8,10,12))
## [1] 4.320494
sd(c(2,8,10,12)) # crosscheck with inbuilt function for standard deviation
## [1] 4.320494
## Now to check which function is more efficient we use system.time function
system.time(stdev(1:1000000))
## user system elapsed
## 0.024 0.004 0.028
system.time(sd(1:1000000))
## user system elapsed
## 0.009 0.000 0.009
## now for sumover function against sum
system.time(sum(1:1000000))
## user system elapsed
## 0 0 0
system.time(sumover(1:1000000))
## user system elapsed
## 0.010 0.004 0.013
R has find function to find the the package
for a given function. An example to demonstrate how to find package in
which cor function is there. Name of the function must be
typed within inverted comma, single or double.
find("cor")
## [1] "package:stats"
R has collection of over 10,000 packages (Link to cran package
list)spanning over different disciplines. Packages are hosted on
repository. The default repository is CRAN (Comprehensive R
Archive Network). To download packages from repository and install it in
R , one has to use either install button in
RStudio
or use command line from terminal (in Linux)
or use syntax to execute this task.
install.packages(c("gmp", "car"),dependencies = T,repos =""https://cloud.r-project.org"" )
To use any package from library, one can load the package by
library or require function.
require(package name)
library(package name)
To directly use any function or data from any packages in
R without having to use require or
library function, one can use package name followed by two
colon,:: , followed by package or data set name.
x=c(1,2,6,8,2,10,11)
stats::t.test(x,y=NULL,mu=0) ## using t.test function from stats pacakge in R
##
## One Sample t-test
##
## data: x
## t = 3.6771, df = 6, p-value = 0.01037
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 1.911745 9.516826
## sample estimates:
## mean of x
## 5.714286
To find the file path of installed package(s) then use
find.package function.
find.package("car")
## [1] "/home/gourav/R/x86_64-pc-linux-gnu-library/4.2/car"
find.package(c("car","cluster"))
## [1] "/home/gourav/R/x86_64-pc-linux-gnu-library/4.2/car"
## [2] "/usr/lib/R/library/cluster"
R has inbuilt data set in various packages. To see the
list of packages use following code.
data()
Upon executing the above code a window will appear. A screenshot of the same is provided below.
Data sets in R
To load data from any packages installed in R use
attach function and to unload it use detach
function.
attach(cars)
detach(cars)
After using attach function, all variables in the data set can be
used without indexing. attach and detach
functions package can be used for data structures in local/global
environment. The attach function can alter the search path
and can easily conflict with existing variables in the environment,
however, it gives warnings about masking. It is also observed that after
use of the loaded data set is over users forget to detach the data set.
Therefore, it is suggested to use with function for loading
data sets as it doesn’t require detaching data set. The
with function evaluates the expression in the context of
data set provided to it.
with(data, expression,...)
with(cars,speed>25 | dist >73)
## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE
## [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE FALSE
## [37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE
## [49] TRUE TRUE
However, using with function is not by side effects as
its use overrides the local variables and from programmers perspective
it must be used cautiously.
To read and write data in R through RStudio
is pretty easy. Users can use the in-built facility by going to
File then to select Import Dataset and
choosing the type of file to import. There is one package in
R System Library by name foreign which
provides functions for exporting and importing data to and from various
file formats such as ‘Minitab’, ‘SPSS’ , ‘SAS’, ‘S’, ‘STATA’, ‘Weka’,
‘dbase’, and many others.
Reading SPSS files
The download.link function will download the file from
internet and save it to the destination provided and by the file name in
the destination path provided. Next, read.spss function
from package foreign will read the file from the working
directory. If file is not in the working directory make sure to provide
complete file path in argument file.
download.file("https://stats.oarc.ucla.edu/stat/data/binary.sav",destfile = "binary.sav")
data_spp=foreign::read.spss(file="binary.sav",to.data.frame = T)
## re-encoding from CP1252
head(data_spp)
## admit gre gpa rank
## 1 0 380 3.61 3
## 2 1 660 3.67 3
## 3 1 800 4.00 1
## 4 1 640 3.19 4
## 5 0 520 2.93 4
## 6 1 760 3.00 2
## Alternatively with haven package
data_spp=haven::read_sav(file="binary.sav")
Reading STATA file
download.file("https://stats.idre.ucla.edu/stat/stata/dae/binary.dta",destfile = "binary.dta")
data_stata=haven::read_stata(file="binary.dta")
head(data_stata)
## # A tibble: 6 × 4
## admit gre gpa rank
## <dbl> <dbl> <dbl> <dbl>
## 1 0 380 3.61 3
## 2 1 660 3.67 3
## 3 1 800 4 1
## 4 1 640 3.19 4
## 5 0 520 2.93 4
## 6 1 760 3 2
## with foreign package
data_dta=foreign::read.dta(file="binary.dta")
head(data_dta)
## admit gre gpa rank
## 1 0 380 3.61 3
## 2 1 660 3.67 3
## 3 1 800 4.00 1
## 4 1 640 3.19 4
## 5 0 520 2.93 4
## 6 1 760 3.00 2
Reading and writing CSV file
user doesn’t need any package to load for reading a comma separated file.
## You can download the file and at destination in your computer decided by you and then read it as described following.
# download.file("https://stats.govt.nz/assets/Uploads/Annual-enterprise-survey/Annual-enterprise-survey-2020-financial-year-provisional/Download-data/annual-enterprise-survey-2020-financial-year-provisional-size-bands-csv.csv",destfile = "data_nz.csv")
data_nz=read.csv("data_nz.csv",sep = ",")
head(data_nz)
## year industry_code_ANZSIC industry_name_ANZSIC rme_size_grp
## 1 2011 A Agriculture, Forestry and Fishing a_0
## 2 2011 A Agriculture, Forestry and Fishing a_0
## 3 2011 A Agriculture, Forestry and Fishing a_0
## 4 2011 A Agriculture, Forestry and Fishing a_0
## 5 2011 A Agriculture, Forestry and Fishing a_0
## 6 2011 A Agriculture, Forestry and Fishing a_0
## variable value unit
## 1 Activity unit 46134 COUNT
## 2 Rolling mean employees 0 COUNT
## 3 Salaries and wages paid 279 DOLLARS(millions)
## 4 Sales, government funding, grants and subsidies 8187 DOLLARS(millions)
## 5 Total income 8866 DOLLARS(millions)
## 6 Total expenditure 7618 DOLLARS(millions)
#writing a data frame to a csv file
write.csv(mydept,file = "mydept.csv")
## now reading the file created
read.csv("mydept.csv",row.names = 1)
## Name Age Gender Education Income Position
## A Shrivastav AKS 64 M Ph.D. 1e+06 P
## A Sarawgi AS 62 M Ph.D. 2e+05 HOD
## S B Nahatkar SBN 63 M Ph.D. 3e+05 P
## GK Vani GKV 32 M M.Sc. 4e+05 AP
## N Khan NK 64 M Ph.D. 5e+05 P
## D Rathi DR 49 M Ph.D. 6e+05 P
## HO Sharma HOS 64 M Ph.D. 7e+05 P
## A Verma AV 42 F Ph.D. 8e+05 AP
## AS Rajpoot ASR 29 F M.Sc. 2e+05 SRF
Reading and Writing ASCII and RTF file
## Reading file stored in working directory
mydept=read.table("mydept.txt",sep = ",")
head(mydept)
## Name Age Gender Education Income
## A Shrivastav AKS 64 M Ph.D. 1e+06
## A Sarawgi AS 62 M Ph.D. 2e+05
## S B Nahatkar SBN 63 M Ph.D. 3e+05
## GK Vani GKV 32 M M.Sc. 4e+05
## N Khan NK 64 M Ph.D. 5e+05
## D Rathi DR 49 M Ph.D. 6e+05
write.table(mydept,file="mydept.txt",sep = ",")
## remember write.table can also write rtf file
write.table(mydept,file="mydept.rtf",sep = ",")
## how to read rtf files
read.table("mydept.rtf",sep = ",")
## Name Age Gender Education Income
## A Shrivastav AKS 64 M Ph.D. 1e+06
## A Sarawgi AS 62 M Ph.D. 2e+05
## S B Nahatkar SBN 63 M Ph.D. 3e+05
## GK Vani GKV 32 M M.Sc. 4e+05
## N Khan NK 64 M Ph.D. 5e+05
## D Rathi DR 49 M Ph.D. 6e+05
## HO Sharma HOS 64 M Ph.D. 7e+05
## A Verma AV 42 F Ph.D. 8e+05
## AS Rajpoot ASR 29 F M.Sc. 2e+05
Scan function to load data from files
To scan data from any file or output of some analysis from console
last run, this function can come very handy. When scan
function is not supplied with any argument then it takes data from
console.
scan("http://www.econometrics.com/comdata/gujarati/data_2.1.shd",what = list(0,0))
## [[1]]
## [1] 6 7 8 9 10 11 12 13 14 15 16 17 18
##
## [[2]]
## [1] 4.4567 5.7700 5.9787 7.3317 7.3182 6.5844 7.8182 7.8351 11.0223
## [10] 10.6738 10.8361 13.6150 13.5310
Here, argument what is used to define the list of data
types like numeric, character, integer or others. Writing two zeros
indicate two numeric columns in the data.
However, apart from R has its own file type for saving
objects, i.e., rds and RData. While a
.rds file can save only single object, .RData
file can save multiple objects at once.
saveRDS(mydept,"mydept.rds")
readRDS("mydept.rds")
## Name Age Gender Education Income
## A Shrivastav AKS 64 M Ph.D. 1e+06
## A Sarawgi AS 62 M Ph.D. 2e+05
## S B Nahatkar SBN 63 M Ph.D. 3e+05
## GK Vani GKV 32 M M.Sc. 4e+05
## N Khan NK 64 M Ph.D. 5e+05
## D Rathi DR 49 M Ph.D. 6e+05
## HO Sharma HOS 64 M Ph.D. 7e+05
## A Verma AV 42 F Ph.D. 8e+05
## AS Rajpoot ASR 29 F M.Sc. 2e+05
save(age,mydept,file = "myDept.RData")
## remove the objects saved to later test for it loads properly
rm(mydept); rm(age)
## Now load the objects and find them in environment
load("myDept.RData")
## now tyr to print them to see that they are the same objects saved in .Rdata file
mydept
## Name Age Gender Education Income
## A Shrivastav AKS 64 M Ph.D. 1e+06
## A Sarawgi AS 62 M Ph.D. 2e+05
## S B Nahatkar SBN 63 M Ph.D. 3e+05
## GK Vani GKV 32 M M.Sc. 4e+05
## N Khan NK 64 M Ph.D. 5e+05
## D Rathi DR 49 M Ph.D. 6e+05
## HO Sharma HOS 64 M Ph.D. 7e+05
## A Verma AV 42 F Ph.D. 8e+05
## AS Rajpoot ASR 29 F M.Sc. 2e+05
age
## A B C D E F G H
## 35.0 23.0 31.0 34.0 1.0 0.5 47.0 56.0
## when saving all objects in the workspace
save.image() # save the whole current workspace
x=c(2,3,5,6,7)
# Calculate arithmetic mean of the vector
mean(x)
## [1] 4.6
# calculate weighted arithmetic mean of the vector
weighted.mean(x,w=c(.3,0.1,0.1,0.05,0.45))
## [1] 4.85
# Calculate quantile values like decile, percentile and quartile, 50th percentile = 5st decile = 2nd quartile = Median
quantile(x,0.5)
## 50%
## 5
# first quartile value= 25th percentile
quantile(x,0.25)
## 25%
## 3
# third quartile value = 75th percentile
quantile(x,0.75)
## 75%
## 6
# there is no function in system library packages which calculates mode in R
# lets create a function for mode by name getmode
getmode<-function(x){
L=table(x)
z=names(which.max(L))
return(z)
}
#Now lets use getmode function
getmode(c(2,3,3,6,4,1,2,1,5))
## [1] "1"
# median function to directly calculate median values
median(x)
## [1] 5
#box plot to see all position values in one box-whisker plot
boxplot.stats(c(2,3,3,6,4,1,2,1,5))
## $stats
## [1] 1 2 3 4 6
##
## $n
## [1] 9
##
## $conf
## [1] 1.946667 4.053333
##
## $out
## numeric(0)
# to calculate Tukey's five numbers summary of the data
fivenum(x)
## [1] 2 3 5 6 7
# summarize the data with summary function
summary(x)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.0 3.0 5.0 4.6 6.0 7.0
The basic sampling from a given set of numbers can be performed using
sample function in R. It has defaulted to
without replacement sampling.
sample(x=1:500,size = 10,replace = T)
## [1] 117 232 328 99 191 477 62 289 409 357
To be able to reproduce the same set of randome numbers, set.seed is
supplied along with sample function.
sample(x=1:500,size = 10,replace = T,set.seed(123))
## [1] 415 463 179 14 195 426 306 118 299 229
sample(x=1:500,size = 10,replace = T,set.seed(1234))
## [1] 284 336 406 101 492 111 393 133 422 400
As the set.seed argument’s value changes, random numbers
change too. To sample values from any statistical distribution the user
shall have to provide values for parameters of the distribution. To
sample from normal distribution, use rnorm
function. For normal distribution, mean and standard deviation are
parameters.
rnorm(n=10,mean=0,sd=1)
## [1] 0.1129751 1.4285520 0.9834038 -0.6224568 -0.7315360 -0.5166697
## [7] -1.7507334 0.8801042 1.3700104 -1.6873268
To sample from student’s t distribution, use
rt function. The degree of freedom is parameter for
student’s t distribution.
rt(n=10,df=6)
## [1] -0.683363732 0.971285806 0.173864595 0.005550894 -1.342436221
## [6] -1.126752596 1.215551218 0.488233793 -1.028252987 0.981762159
To sample from chisquare distribution, use
rchisq function. The degree of freedom is parameter for
chisquare distribution.
rchisq(n=10,df=6)
## [1] 2.087914 3.480708 7.266326 5.531569 2.727402 2.305482 17.209371
## [8] 4.975981 6.142541 4.405026
To sample from Fisher’s F distribution, use
rf function. The numerator and denominator degree of
freedom are parameter for F distribution.
rf(n=10,df1 = 10,df2 = 6)
## [1] 1.5218770 1.4788311 1.1300568 2.0183948 0.1943343 1.0240435 0.8167318
## [8] 0.6252728 0.8048146 0.7135741
To sample from uniform distribution, use
runif function. The minimum and maximum values are
parameter for uniform distribution.
runif(n=10,min=0,max=100)
## [1] 95.599261 2.220682 84.171063 63.244245 31.009417 74.256937 63.891131
## [8] 99.251599 12.826979 88.323958
To sample from binomial distribution, use
rbinom function. The probability of success and number of
trials are parameter for binomial distribution.
rbinom(n=10,prob = 0.5,size = 100)
## [1] 49 52 45 46 47 58 52 48 56 45
To sample from poison distribution, use
rpois function. The mean/lambda is parameter for poison
distribution.
rpois(n=10,lambda = 4)
## [1] 4 2 5 3 4 9 3 3 2 5
Shapiro-Wilk test is a popular test for testing normality of the data. Here, null hypothesis is that data follows normal distribution versus alternate hypothesis of data doesn’t follow normal distribution. If p-value is less than chosen level of significance (may be 0.05 or 0.01) then reject the null hypothesis of data following normal distribution.
## testing for normality of data
shapiro.test(x)
##
## Shapiro-Wilk normality test
##
## data: x
## W = 0.95235, p-value = 0.754
When the variable doesn’t follow normality and all values of the
variable under consideration are positive then one can use Box-Cox
transformation of the variable but it doesn’t always returns a series
which follows normality, success rate isn’t 100% in this method.
Basically, MASS package from system library has boxcox()
for this purpose. It is used for transforming the dependent variable of
the linear regression model towards normality if it doesn’t.
Box-Cox power transformation tries to select the maximum
value of \(\lambda\) such that \(y^\lambda\) follows the normal
distribution. The output of this function would be series of
x and y where x is the series of
lambda values and y is series of log-likelihood values. The
optimal \(\lambda\) values are the one
which correspond to maximum log-likelihood.
k=MASS::boxcox(x~1) ## first save output of boxcox()
kt=k$x[which.max(k$y)] ## optimal lambda value
kt
## [1] 0.8282828
shapiro.test(x^kt) # now check if transformed variable follows normal distribution
##
## Shapiro-Wilk normality test
##
## data: x^kt
## W = 0.9496, p-value = 0.7343
y=sin(x)
L=MASS::boxcox(x~y) ## Here x is a linear function of y
tt=L$x[which.max(L$y)] ## optimal lambda value
tt
## [1] 0.4242424
shapiro.test(resid(lm(x^tt~y)))
##
## Shapiro-Wilk normality test
##
## data: resid(lm(x^tt ~ y))
## W = 0.94223, p-value = 0.6817
Another transformation for percent and proportions. It is also referred to as arcsin square root transformation and angular transformation and inverse transformation.
z=runif(100)
shapiro.test(z)
##
## Shapiro-Wilk normality test
##
## data: z
## W = 0.95905, p-value = 0.003443
par(mfrow=c(1,2))
# Density plot is made to see if after tranformation data follows normal distribution
plot(density(z))
plot(density(asin(sqrt(z))))
shapiro.test(asin(sqrt(z)))
##
## Shapiro-Wilk normality test
##
## data: asin(sqrt(z))
## W = 0.98285, p-value = 0.2202
## levene's test for homogeneity of variances
x=c(2,3,3,6,4,1,2,1,5)
y=c(1,3,5,6,4,8,12,11,15)
xydata=data.frame("Values"=c(x,y), "Group"=c(rep(0,length(x)),rep(1,length(y))))
car::leveneTest(y=xydata$Values,group=xydata$Group,center="mean")
## Warning in leveneTest.default(y = xydata$Values, group = xydata$Group, center =
## "mean"): xydata$Group coerced to factor.
## Levene's Test for Homogeneity of Variance (center = "mean")
## Df F value Pr(>F)
## group 1 8.881 0.008839 **
## 16
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Fligner-Killeen test for homogeneity of variances
fligner.test(x=xydata$Values,g=xydata$Group)
##
## Fligner-Killeen test of homogeneity of variances
##
## data: xydata$Values and xydata$Group
## Fligner-Killeen:med chi-squared = 4.2516, df = 1, p-value = 0.03921
## Bartlett test for homogeneity of variances
bartlett.test(x=xydata$Values,g=xydata$Group)
##
## Bartlett test of homogeneity of variances
##
## data: xydata$Values and xydata$Group
## Bartlett's K-squared = 6.343, df = 1, p-value = 0.01178
## F test for equality of variances in two samples
var.test(x,y,alternative = "two.sided")
##
## F test to compare two variances
##
## data: x and y
## F = 0.1399, num df = 8, denom df = 8, p-value = 0.01166
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
## 0.0315561 0.6201970
## sample estimates:
## ratio of variances
## 0.1398964
## On grouped data (case of multiple groups)
sucess=c(40,20,40,10,22,56,28,47) # number of sucess in diferent trials
no._trials=c(55,20,50,29,62,66,29,52) #number of trials
prop.test(sucess,no._trials)
##
## 8-sample test for equality of proportions without continuity correction
##
## data: sucess out of no._trials
## X-squared = 94.332, df = 7, p-value < 2.2e-16
## alternative hypothesis: two.sided
## sample estimates:
## prop 1 prop 2 prop 3 prop 4 prop 5 prop 6 prop 7 prop 8
## 0.7272727 1.0000000 0.8000000 0.3448276 0.3548387 0.8484848 0.9655172 0.9038462
## On single group
prop.test(47,n=100,p=0.5)
##
## 1-sample proportions test with continuity correction
##
## data: 47 out of 100, null probability 0.5
## X-squared = 0.25, df = 1, p-value = 0.6171
## alternative hypothesis: true p is not equal to 0.5
## 95 percent confidence interval:
## 0.3703535 0.5719775
## sample estimates:
## p
## 0.47
##one sample t-test
t.test(x,y=NULL,mu=5,alternative = "two.sided")
##
## One Sample t-test
##
## data: x
## t = -3.4641, df = 8, p-value = 0.008516
## alternative hypothesis: true mean is not equal to 5
## 95 percent confidence interval:
## 1.668628 4.331372
## sample estimates:
## mean of x
## 3
t.test(x,y=NULL,mu=5,alternative = "greater")
##
## One Sample t-test
##
## data: x
## t = -3.4641, df = 8, p-value = 0.9957
## alternative hypothesis: true mean is greater than 5
## 95 percent confidence interval:
## 1.926389 Inf
## sample estimates:
## mean of x
## 3
t.test(x,y=NULL,mu=5,alternative = "less")
##
## One Sample t-test
##
## data: x
## t = -3.4641, df = 8, p-value = 0.004258
## alternative hypothesis: true mean is less than 5
## 95 percent confidence interval:
## -Inf 4.073611
## sample estimates:
## mean of x
## 3
## two sample t-test
t.test(x,y,mu=0,paired = F,var.equal = T,alternative = "two.sided")
##
## Two Sample t-test
##
## data: x and y
## t = -2.562, df = 16, p-value = 0.02089
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -7.7159197 -0.7285248
## sample estimates:
## mean of x mean of y
## 3.000000 7.222222
t.test(x,y,mu=0,paired = F,var.equal = F,alternative = "two.sided")
##
## Welch Two Sample t-test
##
## data: x and y
## t = -2.562, df = 10.195, p-value = 0.02788
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -7.8847773 -0.5596672
## sample estimates:
## mean of x mean of y
## 3.000000 7.222222
t.test(x,y,mu=0,paired = T,var.equal = T,alternative = "two.sided")
##
## Paired t-test
##
## data: x and y
## t = -2.5752, df = 8, p-value = 0.03286
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
## -8.0031350 -0.4413095
## sample estimates:
## mean difference
## -4.222222
t.test(x,y,mu=0,paired = T,var.equal = F,alternative = "two.sided")
##
## Paired t-test
##
## data: x and y
## t = -2.5752, df = 8, p-value = 0.03286
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
## -8.0031350 -0.4413095
## sample estimates:
## mean difference
## -4.222222
Wilcox test is non-parametric alternative to Wlech t-test. When assumptions underlying t-test (normality of data and equality of variance) are not fulfilled then wilcox test is used. It based on rank data.
One sample test
wilcox.test(x,y=NULL,mu=0)
## Warning in wilcox.test.default(x, y = NULL, mu = 0): cannot compute exact p-
## value with ties
##
## Wilcoxon signed rank test with continuity correction
##
## data: x
## V = 45, p-value = 0.008969
## alternative hypothesis: true location is not equal to 0
Two sample test
wilcox.test(x,y,paired = T,mu=0,alternative = "two.sided") ## paired test/signed rank test
## Warning in wilcox.test.default(x, y, paired = T, mu = 0, alternative =
## "two.sided"): cannot compute exact p-value with ties
## Warning in wilcox.test.default(x, y, paired = T, mu = 0, alternative =
## "two.sided"): cannot compute exact p-value with zeroes
##
## Wilcoxon signed rank test with continuity correction
##
## data: x and y
## V = 1, p-value = 0.05639
## alternative hypothesis: true location shift is not equal to 0
wilcox.test(x,y,paired = F,mu=0,alternative = "two.sided") ## non-paired test
## Warning in wilcox.test.default(x, y, paired = F, mu = 0, alternative =
## "two.sided"): cannot compute exact p-value with ties
##
## Wilcoxon rank sum test with continuity correction
##
## data: x and y
## W = 16.5, p-value = 0.03679
## alternative hypothesis: true location shift is not equal to 0
## Chisquare test of goodness of fit
btw=rbinom(n=100,size=100,prob = 0.6) # observed series
btk=rnorm(n=100,mean=60,sd=36) # expected series
chisq.test(btw,btk)
## Warning in chisq.test(btw, btk): Chi-squared approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: btw and btk
## X-squared = 2100, df = 2079, p-value = 0.3688
## chisquare test for association between two factors
mytable =as.table(rbind(c(10,7),c(22,13))) # observed data
chisq.test(mytable)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: mytable
## X-squared = 1.0186e-30, df = 1, p-value = 1
mytable2=as.table(rbind(c(19,12),c(29,10))) ## expected data
chisq.test(mytable,mytable2)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: mytable
## X-squared = 1.0186e-30, df = 1, p-value = 1
library(dplyr)
table(mydept$Gender,mydept$Education) %>%fisher.test()
##
## Fisher's Exact Test for Count Data
##
## data: .
## p-value = 0.4167
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 0.03835631 543.93189281
## sample estimates:
## odds ratio
## 4.582616
# to get correlation between two data series
cor(x,y,method="pearson")
## [1] 0.01558447
cor.test(x,y,method = "pearson")
##
## Pearson's product-moment correlation
##
## data: x and y
## t = 0.041238, df = 7, p-value = 0.9683
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.6553198 0.6727433
## sample estimates:
## cor
## 0.01558447
cor(x,y,method="spearman")
## [1] -0.05063697
cor.test(x,y,method = "spearman")
## Warning in cor.test.default(x, y, method = "spearman"): Cannot compute exact p-
## value with ties
##
## Spearman's rank correlation rho
##
## data: x and y
## S = 126.08, p-value = 0.8971
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## -0.05063697
cor(x,y,method="kendall")
## [1] 0.02901294
cor.test(x,y,method = "kendall")
## Warning in cor.test.default(x, y, method = "kendall"): Cannot compute exact p-
## value with ties
##
## Kendall's rank correlation tau
##
## data: x and y
## z = 0.106, p-value = 0.9156
## alternative hypothesis: true tau is not equal to 0
## sample estimates:
## tau
## 0.02901294
## to get correlation matrix
normmatrix=matrix(rnorm(100),ncol=5)
cor(normmatrix)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1.00000000 0.03131195 0.06883345 -0.06844665 0.07201019
## [2,] 0.03131195 1.00000000 -0.02023516 0.12929262 0.37113451
## [3,] 0.06883345 -0.02023516 1.00000000 -0.23669943 -0.52873520
## [4,] -0.06844665 0.12929262 -0.23669943 1.00000000 0.46770027
## [5,] 0.07201019 0.37113451 -0.52873520 0.46770027 1.00000000
## Canonical correlation coefficient
cancor(x,y,xcenter=T,ycenter = T) # this correlation coefficient is same as that of pearson product-movement correlation coefficient because of only two variables involved
## $cor
## [1] 0.01558447
##
## $xcoef
## [,1]
## [1,] 0.2041241
##
## $ycoef
## [,1]
## [1,] 0.07634799
##
## $xcenter
## [1] 3
##
## $ycenter
## [1] 7.222222
cancor(matrix(rnorm(100),ncol = 5),matrix(rnorm(100),ncol = 5),xcenter=T,ycenter = T)
## $cor
## [1] 0.77707078 0.61943548 0.33109628 0.31381387 0.07848493
##
## $xcoef
## [,1] [,2] [,3] [,4] [,5]
## [1,] -0.02860498 -0.224508287 0.02549674 -0.06488900 -0.006796558
## [2,] 0.14981792 0.080590107 0.05677894 -0.13345965 0.119967052
## [3,] 0.04311825 0.060642380 0.05073523 -0.20691642 -0.207465303
## [4,] 0.16437803 -0.048276393 -0.17849364 -0.01354266 -0.019688846
## [5,] -0.16839838 -0.003664575 -0.12380930 -0.08711594 0.074002466
##
## $ycoef
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0.02754389 -0.20457794 0.199938700 0.32743661 0.005693569
## [2,] -0.11605644 -0.05667172 0.141964899 -0.06053659 -0.236624258
## [3,] -0.05788383 0.11812715 -0.082663530 0.13650880 -0.063810872
## [4,] -0.12905245 0.03010167 0.006015531 0.10318323 0.199379184
## [5,] 0.13015360 0.18268755 0.126331340 -0.07662970 0.102261273
##
## $xcenter
## [1] -0.41842507 -0.20901105 -0.04350534 0.01777745 -0.04364079
##
## $ycenter
## [1] -0.08832676 -0.30453652 -0.19765487 -0.05939289 0.14062205
lm_1=lm(y~x)
summary(lm_1)
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.181 -3.264 -1.347 3.861 7.694
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.09722 3.45119 2.056 0.0788 .
## x 0.04167 1.01040 0.041 0.9683
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.95 on 7 degrees of freedom
## Multiple R-squared: 0.0002429, Adjusted R-squared: -0.1426
## F-statistic: 0.001701 on 1 and 7 DF, p-value: 0.9683
anova(lm_1)
## Analysis of Variance Table
##
## Response: y
## Df Sum Sq Mean Sq F value Pr(>F)
## x 1 0.042 0.0417 0.0017 0.9683
## Residuals 7 171.514 24.5020
library(cluster)
## Hierarchical clustering with cutting tree in desired number of groups
## Use dist function to calculate distance matrix between rows of data matrix
clust_1=hclust(dist(agriculture, method = "euclidean"),method = "ward.D")
summary(clust_1)
## Length Class Mode
## merge 22 -none- numeric
## height 11 -none- numeric
## order 12 -none- numeric
## labels 12 -none- character
## method 1 -none- character
## call 3 -none- call
## dist.method 1 -none- character
plot(clust_1)
## determine the membership in five group scenario
membership=cutree(clust_1,k=5)
## use membership to cut dendrogram in five rectangular shapes
rect.hclust(clust_1,k=5,membership)
## Clustering Large Applications (CLARA)
clara_1=clara(agriculture,2,metric = "euclidean")
summary(clara_1)
## Object of class 'clara' from call:
## clara(x = agriculture, k = 2, metric = "euclidean")
## Medoids:
## x y
## D 18.7 3.5
## P 7.8 17.4
## Objective function: 3.36061
## Numerical information per cluster:
## size max_diss av_diss isolation
## [1,] 8 5.423099 2.891691 0.3070127
## [2,] 4 7.430343 4.298448 0.4206469
## Average silhouette width per cluster:
## [1] 0.7078449 0.4785275
## Average silhouette width of best sample: 0.6314058
##
## Best sample:
## [1] B DK D GR E F IRL I L NL P UK
## Clustering vector:
## B DK D GR E F IRL I L NL P UK
## 1 1 1 2 2 1 2 1 1 1 2 1
##
## Silhouette plot information for best sample:
## cluster neighbor sil_width
## D 1 2 0.7928036
## B 1 2 0.7657461
## NL 1 2 0.7607978
## F 1 2 0.7459605
## L 1 2 0.7442013
## DK 1 2 0.7080748
## UK 1 2 0.6314596
## I 1 2 0.5137157
## P 2 1 0.6516167
## GR 2 1 0.5745566
## IRL 2 1 0.5153727
## E 2 1 0.1725640
##
## 66 dissimilarities, summarized :
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.649 4.357 7.987 9.594 13.250 24.035
## Metric : euclidean
## Number of objects : 12
##
## Available components:
## [1] "sample" "medoids" "i.med" "clustering" "objective"
## [6] "clusinfo" "diss" "call" "silinfo" "data"
par(mfrow=c(1,2))
plot(clara_1)
library(dplyr)
## Q-type pca
# center=T for centering data from mean, retx=T for getting rotated pca matrix of eignen vectors, scale.=T for making each variable variance equa to one, rank.= 3 for keeping 3 principal components, tol=0.1 for threshold below which components should be omitted.
prcom.1=prcomp(USArrests,center = T,retx = T,scale. = T,rank. = 3,tol = 0.1)
prcom.1$rotation
## PC1 PC2 PC3
## Murder -0.5358995 0.4181809 -0.3412327
## Assault -0.5831836 0.1879856 -0.2681484
## UrbanPop -0.2781909 -0.8728062 -0.3780158
## Rape -0.5434321 -0.1673186 0.8177779
head(prcom.1$x) ## pca scores
## PC1 PC2 PC3
## Alabama -0.9756604 1.1220012 -0.43980366
## Alaska -1.9305379 1.0624269 2.01950027
## Arizona -1.7454429 -0.7384595 0.05423025
## Arkansas 0.1399989 1.1085423 0.11342217
## California -2.4986128 -1.5274267 0.59254100
## Colorado -1.4993407 -0.9776297 1.08400162
summary(prcom.1)
## Importance of first k=3 (out of 4) components:
## PC1 PC2 PC3
## Standard deviation 1.5749 0.9949 0.59713
## Proportion of Variance 0.6201 0.2474 0.08914
## Cumulative Proportion 0.6201 0.8675 0.95664
screeplot(prcom.1)
biplot(prcom.1)
## R-type pca
# argument cor =TRUE for using correlation matrix to compute principal components
# scores=TRUE for getting principal component scores upon analysis
prcom.2=princomp(USArrests,cor = T,scores = T)
prcom.2$loadings ## pca loadings or eigen vectors
##
## Loadings:
## Comp.1 Comp.2 Comp.3 Comp.4
## Murder 0.536 0.418 0.341 0.649
## Assault 0.583 0.188 0.268 -0.743
## UrbanPop 0.278 -0.873 0.378 0.134
## Rape 0.543 -0.167 -0.818
##
## Comp.1 Comp.2 Comp.3 Comp.4
## SS loadings 1.00 1.00 1.00 1.00
## Proportion Var 0.25 0.25 0.25 0.25
## Cumulative Var 0.25 0.50 0.75 1.00
prcom.2$scores ## pca scores
## Comp.1 Comp.2 Comp.3 Comp.4
## Alabama 0.98556588 1.13339238 0.44426879 0.156267145
## Alaska 1.95013775 1.07321326 -2.04000333 -0.438583440
## Arizona 1.76316354 -0.74595678 -0.05478082 -0.834652924
## Arkansas -0.14142029 1.11979678 -0.11457369 -0.182810896
## California 2.52398013 -1.54293399 -0.59855680 -0.341996478
## Colorado 1.51456286 -0.98755509 -1.09500699 0.001464887
## Connecticut -1.35864746 -1.08892789 0.64325757 -0.118469414
## Delaware 0.04770931 -0.32535892 0.71863294 -0.881977637
## Florida 3.01304227 0.03922851 0.57682949 -0.096284752
## Georgia 1.63928304 1.27894240 0.34246008 1.076796812
## Hawaii -0.91265715 -1.57046001 -0.05078189 0.902806864
## Idaho -1.63979985 0.21097292 -0.25980134 -0.499104101
## Illinois 1.37891072 -0.68184119 0.67749564 -0.122021292
## Indiana -0.50546136 -0.15156254 -0.22805484 0.424665700
## Iowa -2.25364607 -0.10405407 -0.16456432 0.017555916
## Kansas -0.79688112 -0.27016470 -0.02555331 0.206496428
## Kentucky -0.75085907 0.95844029 0.02836942 0.670556671
## Louisiana 1.56481798 0.87105466 0.78348036 0.454728038
## Maine -2.39682949 0.37639158 0.06568239 -0.330459817
## Maryland 1.76336939 0.42765519 0.15725013 -0.559069521
## Massachusetts -0.48616629 -1.47449650 0.60949748 -0.179598963
## Michigan 2.10844115 -0.15539682 -0.38486858 0.102372019
## Minnesota -1.69268181 -0.63226125 -0.15307043 0.067316885
## Mississippi 0.99649446 2.39379599 0.74080840 0.215508013
## Missouri 0.69678733 -0.26335479 -0.37744383 0.225824461
## Montana -1.18545191 0.53687437 -0.24688932 0.123742227
## Nebraska -1.26563654 -0.19395373 -0.17557391 0.015892888
## Nevada 2.87439454 -0.77560020 -1.16338049 0.314515476
## New Hampshire -2.38391541 -0.01808229 -0.03685539 -0.033137338
## New Jersey 0.18156611 -1.44950571 0.76445355 0.243382700
## New Mexico 1.98002375 0.14284878 -0.18369218 -0.339533597
## New York 1.68257738 -0.82318414 0.64307509 -0.013484369
## North Carolina 1.12337861 2.22800338 0.86357179 -0.954381667
## North Dakota -2.99222562 0.59911882 -0.30127728 -0.253987327
## Ohio -0.22596542 -0.74223824 0.03113912 0.473915911
## Oklahoma -0.31178286 -0.28785421 0.01530979 0.010332321
## Oregon 0.05912208 -0.54141145 -0.93983298 -0.237780688
## Pennsylvania -0.88841582 -0.57110035 0.40062871 0.359061124
## Rhode Island -0.86377206 -1.49197842 1.36994570 -0.613569430
## South Carolina 1.32072380 1.93340466 0.30053779 -0.131466685
## South Dakota -1.98777484 0.82334324 -0.38929333 -0.109571764
## Tennessee 0.99974168 0.86025130 -0.18808295 0.652864291
## Texas 1.35513821 -0.41248082 0.49206886 0.643195491
## Utah -0.55056526 -1.47150461 -0.29372804 -0.082314047
## Vermont -2.80141174 1.40228806 -0.84126309 -0.144889914
## Virginia -0.09633491 0.19973529 -0.01171254 0.211370813
## Washington -0.21690338 -0.97012418 -0.62487094 -0.220847793
## West Virginia -2.10858541 1.42484670 -0.10477467 0.131908831
## Wisconsin -2.07971417 -0.61126862 0.13886500 0.184103743
## Wyoming -0.62942666 0.32101297 0.24065923 -0.166651801
head(prcom.2$scores)
## Comp.1 Comp.2 Comp.3 Comp.4
## Alabama 0.9855659 1.1333924 0.44426879 0.156267145
## Alaska 1.9501378 1.0732133 -2.04000333 -0.438583440
## Arizona 1.7631635 -0.7459568 -0.05478082 -0.834652924
## Arkansas -0.1414203 1.1197968 -0.11457369 -0.182810896
## California 2.5239801 -1.5429340 -0.59855680 -0.341996478
## Colorado 1.5145629 -0.9875551 -1.09500699 0.001464887
summary(prcom.2)
## Importance of components:
## Comp.1 Comp.2 Comp.3 Comp.4
## Standard deviation 1.5748783 0.9948694 0.5971291 0.41644938
## Proportion of Variance 0.6200604 0.2474413 0.0891408 0.04335752
## Cumulative Proportion 0.6200604 0.8675017 0.9566425 1.00000000
screeplot(prcom.2)
biplot(prcom.2)
factanal(x, factors, data = NULL, covmat = NULL, n.obs = NA,
subset, na.action, start = NULL,
scores = c("none", "regression", "Bartlett"),
rotation = "varimax", control = NULL, ...)
# first lets create data for analysis
m1=matrix(runif(18*6,1,6),ncol=6)
# now data is ready by object name m1
factanal(m1, factors = 3, rotation = "promax",scores = "regression")# varimax is the default
##
## Call:
## factanal(x = m1, factors = 3, scores = "regression", rotation = "promax")
##
## Uniquenesses:
## [1] 0.635 0.005 0.837 0.688 0.005 0.825
##
## Loadings:
## Factor1 Factor2 Factor3
## [1,] 0.590
## [2,] 0.982
## [3,] -0.193 -0.113 0.346
## [4,] 0.106 0.550
## [5,] 0.976 -0.102
## [6,] -0.306 -0.239 -0.198
##
## Factor1 Factor2 Factor3
## SS loadings 1.115 1.050 0.813
## Proportion Var 0.186 0.175 0.136
## Cumulative Var 0.186 0.361 0.496
##
## Factor Correlations:
## Factor1 Factor2 Factor3
## Factor1 1.0000 -0.1190 -0.0654
## Factor2 -0.1190 1.0000 0.0269
## Factor3 -0.0654 0.0269 1.0000
##
## The degrees of freedom for the model is 0 and the fit was 0.01
Plot function is a powerful function for graphics. It has several arguments which allows users to fine tune the plots.
plot(x, y = NULL, type = “p”, xlim = NULL, ylim = NULL, log = ““, main = NULL, sub = NULL, xlab = NULL, ylab = NULL, ann = par(”ann”), axes = TRUE, frame.plot = axes, panel.first = NULL, panel.last = NULL, asp = NA, xgap.axis = NA, ygap.axis = NA, …)
Possible types are
"p" for points,
"l" for lines,
"b" for both (points as well as
lines),
"c" for the lines part alone of
"b",
"o" for both
‘overplotted’,
"h" for ‘histogram’ like (or
‘high-density’) vertical lines,
"s" for stair steps,
"S" for other steps, see ‘Details’
below,
"n" for no plotting.
All other types give a warning or an error; using, e.g.,
type = "punkte" being equivalent to type = "p"
for S compatibility.
#to plot the data simply use plot function
plot(c(2,3,3,6,4,1,2,1,5))
x=1:10
y=sin(x)
## using parameter option to set various graphs in rows and column like structure
par(mfrow=c(3,3))
## plot one vector against another
plot(x,y, main="Point plot", type="p")
## now to create a line lets plot x and y
plot(x,y, main="Line plot",type = "l")
## create graph with only lines no points
plot(x,y, main="Line plot without points",type = "c")
## have both points and lines
plot(x,y, main="Plot having both point and line",type = "b")
## create overploted graph
plot(x,y, main="Over Plot",type = "o")
## create stair steps/setp graph
plot(x,y, main="Step plot",type = "s")
## using "S" insted of "s"
plot(x,y, main="Step plot with type `S`",type = "S")
## create histogram like vertical lines
plot(x,y, main="Histogram like verticle Plot ",type="h")
## Plot nothing
plot(x,y, main="Plot nothing","n")
## use scatter.smooth to plot scatter diagram along with a smooth regression line
scatter.smooth(x,y, main="Smooth Scatter plot between X and Y")
## to create scatter plot for numeric data you can use pairs function
num_matrix=matrix(rnorm(100),ncol=4)
pairs(num_matrix)
## pairs function using swiss data
pairs(swiss, panel = panel.smooth, pch = ".")
optt=tapply(mydept$Income,INDEX = list(as.factor(mydept$Gender),as.factor(mydept$Education)),FUN = "mean")
barplot(optt,col = 1:nrow(optt),legend.text = rownames(optt), args.legend = list(x = "topleft"),ylab = "Mean Income Level",xlab = "Level of Education",main = "Mean income level by edcuation and gender")
##alternatively
subb=aggregate(mydept$Income,list(mydept$Gender,mydept$Education),"mean")
subb
## Group.1 Group.2 x
## 1 F M.Sc. 200000
## 2 M M.Sc. 400000
## 3 F Ph.D. 800000
## 4 M Ph.D. 550000
colnames(subb)<-c("Gender","Education","Income")
barplot(Income~Gender+Education,subb,col = 1:2,legend.text=T,args.legend = list(x = "topleft","Gender"),ylab = "Mean Income Level",xlab = "Level of Education",main = "Mean income level by edcuation and gender")
x=rbinom(n=10000,size=1000,prob = 0.5)
hist(x, # histogram
col=6, # column color
border="black",prob = TRUE, # show densities instead of frequencies
main = "Histogram of random number from binomial distribution")
lines(density(x), # density plot
lwd = 2, # thickness of line
col = 9)
## To create histogram with class intervals such that it contains equal number of observtions then we use co.intervals function
hist(x,plot = T,breaks = co.intervals(x),freq = F)
# the co.intervals function's output is provided below
co.intervals(x)
## [,1] [,2]
## [1,] 438.5 491.5
## [2,] 482.5 497.5
## [3,] 490.5 503.5
## [4,] 496.5 509.5
## [5,] 502.5 517.5
## [6,] 508.5 558.5
#box plot to see all position values in one box-whisker plot
boxplot(c(2,3,3,6,4,1,2,1,5))
boxplot(Income~Gender,data=mydept,main="Boxplot of income by gender")
require(graphics)
coplot(Income~Age| Gender,data = mydept) ## Plot income against age but also differentiate by a factor called Gender
coplot(len ~ dose | supp, data = ToothGrowth, panel = panel.smooth,
xlab = "ToothGrowth data: length vs dose, given type of supplement")
## Q-Q plot to check for normality
qqnorm(x)
qqline(x)
par(mfrow=c(2,2))
plot(lm_1)
plot(density(rbinom(n=10000,size=1000,prob = 0.5)))
Experimental Design is plan to solve any problem or compare any factors/treatments under the control situation as much as possible. Experimental designs is a plan of layout of an experiment according to which the treatments are allocated for the plots based on theory of probability with a certain set of rules and procedure so as to obtain appropriate observations. In agricultural experiments The objects of a comparison is experiments I known as treatments. For example comparison of different crop variety yield or effect of different insecticide to control insect/pest in plant. In which treatments are applied to collect the observation or data in known as experimental unit. For agriculture experiments field or land is example of experimental unit.
Randomness is the basis on which statistics differs from mathematics. Statistics assumes variables in statistical analysis to be random in nature (also known as stochastic in nature). But random word connnotes several meaning in local parlance; to some this means absence of patten/predictability, or to some that happens accidentally. However, a more precise meaning of random in statistics corresponds to “chosen without regard to any characteristics of the individual members of the population so that each has an equal chance of being selected: random sampling” or “having a value which cannot be determined but only described probabilistically: a random variable” https://www.dictionary.com/browse/random.
The former definition states that selection must be done disregarding the characteristics of individual members of population. This is principle helps in creating several representative/comparable groups from populations for study of population with regard to either effect of some new treatment or characteristics of population itself. If the selection process does not remain random while creating groups then it can lead to biase (both selection and accidental) and groups so created from population would no longer be representative/comparable. Thus ensuring randomness in selection process is crucial to getting unbiased results. This process is the basis for application of statistical inference techniques of mean comparision.
Sometimes question also arises as to how random assignment of treatment differs from the assignment I make. For understanding this lets take a simple game. Take an unbiased coin (a fresh coin which can equally flip on both sides) and mark one side as head and other side as tail. without flipping coin write a sequence of 20 or 30 head-tail. Now if possible evaluate this sequence generated by you with the help of runs test for its randomness. It would be rare to find that some one writing suh sequence without any prior training can produce a random sequence. This is because random sequence has a property that selection of current object or subject does not depend in any way on the previous subjects selected. This is because the process generating such sequence has no memory. But, humans have very good memory and does remember a large amount of information in brain, either consciously or subconsciously. This is the reason why humans can not behave randomly. We humans behave orderly and a person behaving randomly is called mad in our societies. There is one website random.org which provides lot of games on randomness and also demonstrates why humans can not behave randomly.
Based on the above stated property of random sequence, I had created a fraud detection test
where the enumerator conducting survey work would be required to ask
subjects a sequence of fruit choice from 1 to 8. If an enumerator does
not conduct survey and fakes it then proposed test can detect
non-randomness in set of sequences entered by him/her.
In experimental studies, three principles are always adhered to
namely,
1. Randomization: As a consequence of treatments are allocated randomly, the variation caused due to soil fertility variation will be systematic in nature is eliminated completely and whatever the error that persist is of random in nature/ chance factor (Mishra and Homa, 2019).
Advantages:
Randomization eliminates the systematic error. Ensures whatever the error component that still persists in the observation is purely random in nature. Types of randomization Unrestricted randomization or complete randomization: Allocation of treatment randomly to the experiment without any restrictions. Eg: CRD Restricted randomization: Allocation of treatment randomly to the experiment with certain restriction either one-way blocking or two-way blocking. Eg: RBD, LSD
2. Replication: Treatments are repeated several times to reduce experimental error. This also helps to increase power of statistics used in analysis of data.
3. Local Control
In Agriculture, however, experiments are performed on fields and instead of subjects in Psychology & medicine or objects in physics, here land/field is the subject or object recieving treatment. Since land/field differs in various characteristics over space, therefore, various blocks are created on land/field, each block with homogeneous characteristics in terms of slope, soil type, soil texture, etc in it but heterogeneous among them. Treatments are assigned randomly over these blocks. Care is to be taken to include each treatment in each block. Treatments must be randomly allocated to plots within each block. This would ensure blocks as replicating units. Thus, one more principle is added to above three, i.e., Blocking. If land is homogeneous across its length and breadth then there is no need for blocking.
Now lets discuss how R software can be used to generate layout for experimental with randomized assignment of treatments to fields.
For this purpose lets install Agricolae package in R by using following command
install.packages("agricolae")
and load it from library where it gets stored.
library(agricolae)
After this step, next step is to learn how to enter treatment and replication information in R. For this purpose, we use vectors as data structure. Let there be three treatments with control namely, A, B, C and “control” and equal number of replications.
Trt1 = c("A","B","C","Control")
In above command, a vector named Trt1 is created. The vector starts
with letter c which means combine and since it is a
character vector therefore, all characters must be in inverted
comma.
To create layout for Completey Randomized Design, design.crd function
is used. Arguments to this function includes treatment information and
number of replications. Since we want only information on layout
characteristics therefore, $book is placed at the end of
function.
design.crd(trt=Trt1, r=3)$book
## plots r Trt1
## 1 101 1 A
## 2 102 1 C
## 3 103 1 Control
## 4 104 2 Control
## 5 105 3 Control
## 6 106 1 B
## 7 107 2 C
## 8 108 2 A
## 9 109 2 B
## 10 110 3 C
## 11 111 3 A
## 12 112 3 B
The above command created the output which includes plot numbers,
replication number and treatment allotment. In case Randomized control
block design is used design.rcbd function is used.
For other useful designs lets create following treatments.
Trt1 = c("A","B","C")
Trt2 = c("1","2")
Trt3 = c("+","-")
TRT.tmp = as.vector(sapply(Trt1, function(x){paste0(x,Trt2)}))
TRT = as.vector(sapply(TRT.tmp, function(x){paste0(x,Trt3)}))
TRT.Control = c(TRT, rep("Control", 3))
TRT
## [1] "A1+" "A1-" "A2+" "A2-" "B1+" "B1-" "B2+" "B2-" "C1+" "C1-" "C2+" "C2-"
Now layout for designs with above treatments is
#Completely Randomized Design
design.crd(trt=TRT.Control, r=3)$sketch
## NULL
#Random Block Design
design.rcbd(trt=TRT.Control, r=3)$sketch
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
## [1,] "C1-" "Control" "A1-" "A1+" "C2+" "C1+" "B1-" "Control" "B1+"
## [2,] "B1-" "A1-" "C1+" "B1+" "Control" "A2+" "B2+" "B2-" "A1+"
## [3,] "B2+" "A1-" "Control" "A1+" "C2-" "Control" "B1-" "C1+" "C1-"
## [,10] [,11] [,12] [,13] [,14] [,15]
## [1,] "A2-" "B2+" "B2-" "C2-" "Control" "A2+"
## [2,] "Control" "C2+" "C1-" "C2-" "A2-" "Control"
## [3,] "B1+" "Control" "B2-" "A2-" "C2+" "A2+"
#Incomplete Block Design
design.bib(trt=TRT.Control, r=7, k=3)$sketch
##
## Parameters BIB
## ==============
## Lambda : 1
## treatmeans : 15
## Block size : 3
## Blocks : 35
## Replication: 7
##
## Efficiency factor 0.7142857
##
## <<< Book >>>
## [,1] [,2] [,3]
## [1,] "A2-" "B1-" "C1-"
## [2,] "C2-" "A2+" "Control"
## [3,] "Control" "C2-" "A2-"
## [4,] "A1-" "C1-" "C2-"
## [5,] "A2+" "A2-" "B2+"
## [6,] "B2+" "Control" "Control"
## [7,] "B2+" "B1-" "A1-"
## [8,] "C2+" "Control" "B1-"
## [9,] "C1-" "A1+" "B1+"
## [10,] "A1+" "C2-" "Control"
## [11,] "A1-" "C1+" "A1+"
## [12,] "B2+" "B1+" "C2-"
## [13,] "Control" "A2+" "B2-"
## [14,] "B2-" "Control" "A1+"
## [15,] "B2-" "B2+" "C1-"
## [16,] "C2-" "B1-" "C1+"
## [17,] "A1+" "B2+" "Control"
## [18,] "C1+" "Control" "Control"
## [19,] "B1+" "A2+" "C2+"
## [20,] "C2-" "B2-" "C2+"
## [21,] "B1+" "Control" "C1+"
## [22,] "A1-" "B1+" "B2-"
## [23,] "Control" "A1-" "A2+"
## [24,] "Control" "B1-" "B2-"
## [25,] "A2-" "C2+" "A1+"
## [26,] "A2-" "Control" "A1-"
## [27,] "A1+" "A2+" "B1-"
## [28,] "B1-" "Control" "B1+"
## [29,] "Control" "B1+" "A2-"
## [30,] "B2-" "C1+" "A2-"
## [31,] "Control" "C2+" "C1-"
## [32,] "C2+" "C1+" "B2+"
## [33,] "A1-" "C2+" "Control"
## [34,] "C1+" "A2+" "C1-"
## [35,] "C1-" "Control" "Control"
#Split-Plot Design
design.split(Trt1, Trt2, r=3, design=c("crd"))$book
## plots splots r Trt1 Trt2
## 1 101 1 1 C 2
## 2 101 2 1 C 1
## 3 102 1 1 A 2
## 4 102 2 1 A 1
## 5 103 1 1 B 1
## 6 103 2 1 B 2
## 7 104 1 2 B 1
## 8 104 2 2 B 2
## 9 105 1 2 A 1
## 10 105 2 2 A 2
## 11 106 1 3 A 2
## 12 106 2 3 A 1
## 13 107 1 3 B 1
## 14 107 2 3 B 2
## 15 108 1 2 C 1
## 16 108 2 2 C 2
## 17 109 1 3 C 2
## 18 109 2 3 C 1
The ANOVA stands for Analysis of variance. It is
actually a dummy variable regression. The dummies in most of
agricultural use are treatment dummies, replication dummies and block
dummies. The dummies in a data frame should be of type factor. The
examples given below are using npk dataset for full
factorial NPK experiment from Venables and Ripley (2002) p.165 provided
in datasets package .
data(npk) ## from datasets package
head(npk) ## see how data frame is to be prepared for anova
## block N P K yield
## 1 1 0 1 1 49.5
## 2 1 1 1 0 62.8
## 3 1 0 0 0 46.8
## 4 1 1 0 1 57.0
## 5 2 1 0 0 59.8
## 6 2 1 1 1 58.5
npk_lm=lm(yield ~ block + N*P*K, data=npk)
summary(npk_lm)
##
## Call:
## lm(formula = yield ~ block + N * P * K, data = npk)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.3000 -1.6833 0.1583 1.9979 4.4750
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 51.8250 2.7785 18.652 3.15e-10 ***
## block2 3.4250 2.7785 1.233 0.24131
## block3 6.7500 2.7785 2.429 0.03177 *
## block4 -3.9000 2.7785 -1.404 0.18578
## block5 -3.5000 2.7785 -1.260 0.23174
## block6 2.3250 2.7785 0.837 0.41907
## N1 9.8500 2.7785 3.545 0.00403 **
## P1 0.4167 2.7785 0.150 0.88329
## K1 -1.9167 2.7785 -0.690 0.50344
## N1:P1 -3.7667 3.2084 -1.174 0.26317
## N1:K1 -4.7000 3.2084 -1.465 0.16865
## P1:K1 0.5667 3.2084 0.177 0.86275
## N1:P1:K1 NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.929 on 12 degrees of freedom
## Multiple R-squared: 0.7886, Adjusted R-squared: 0.5948
## F-statistic: 4.069 on 11 and 12 DF, p-value: 0.01156
anova(npk_lm)
## Analysis of Variance Table
##
## Response: yield
## Df Sum Sq Mean Sq F value Pr(>F)
## block 5 343.29 68.659 4.4467 0.015939 *
## N 1 189.28 189.282 12.2587 0.004372 **
## P 1 8.40 8.402 0.5441 0.474904
## K 1 95.20 95.202 6.1657 0.028795 *
## N:P 1 21.28 21.282 1.3783 0.263165
## N:K 1 33.13 33.135 2.1460 0.168648
## P:K 1 0.48 0.482 0.0312 0.862752
## Residuals 12 185.29 15.441
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Alternatively we can use aov function which is wrapper for lm functioon for fitting linear model
aov.npk<- aov(yield~block+N*P*K,data = npk)
summary(aov.npk)
## Df Sum Sq Mean Sq F value Pr(>F)
## block 5 343.3 68.66 4.447 0.01594 *
## N 1 189.3 189.28 12.259 0.00437 **
## P 1 8.4 8.40 0.544 0.47490
## K 1 95.2 95.20 6.166 0.02880 *
## N:P 1 21.3 21.28 1.378 0.26317
## N:K 1 33.1 33.13 2.146 0.16865
## P:K 1 0.5 0.48 0.031 0.86275
## Residuals 12 185.3 15.44
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
coefficients(aov.npk)
## (Intercept) block2 block3 block4 block5 block6
## 51.8250000 3.4250000 6.7500000 -3.9000000 -3.5000000 2.3250000
## N1 P1 K1 N1:P1 N1:K1 P1:K1
## 9.8500000 0.4166667 -1.9166667 -3.7666667 -4.7000000 0.5666667
## Remember aov fuction is designed for use on balanced designs. In case of missing values, the design loses balance.
## To check balance use replications function.
## A design can be balanced when all treatements levels have same number of replications.
replications(~.-yield,data = npk)
## block N P K
## 4 12 12 12
## replications function will provide number of replications for each treatement levels
Another dataset inbuilt in R is warpbreaks
from datasets package. It has measurement on number of
breaks in wool, wool type (A or B) and levels of tension (Low, Medium
and High).
aov.warpbreaks=aov(breaks~wool*tension,data=warpbreaks)
summary(aov.warpbreaks)
## Df Sum Sq Mean Sq F value Pr(>F)
## wool 1 451 450.7 3.765 0.058213 .
## tension 2 2034 1017.1 8.498 0.000693 ***
## wool:tension 2 1003 501.4 4.189 0.021044 *
## Residuals 48 5745 119.7
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
coefficients(aov.warpbreaks)
## (Intercept) woolB tensionM tensionH woolB:tensionM
## 44.55556 -16.33333 -20.55556 -20.00000 21.11111
## woolB:tensionH
## 10.55556
Latin Square Design
Another dataset is OrchardSprays provided in
datasets package. It was obtained from an experiment
conducted to assess the efficacy of constituents of sprays done on
orchard in repelling honeybees, using a Latin square design. The data
has row position, column position, treatment levels as factor and
response variable by name decrease.
data("OrchardSprays")
head(OrchardSprays)
## decrease rowpos colpos treatment
## 1 57 1 1 D
## 2 95 2 1 E
## 3 8 3 1 B
## 4 69 4 1 H
## 5 92 5 1 G
## 6 90 6 1 F
## check if this is balanced design
summary(OrchardSprays$treatment)
## A B C D E F G H
## 8 8 8 8 8 8 8 8
## after finding balance use aov function
aov.orchardspray=aov(decrease~rowpos+colpos+treatment,data=OrchardSprays)
summary(aov.orchardspray)
## Df Sum Sq Mean Sq F value Pr(>F)
## rowpos 1 2605 2605 6.877 0.0113 *
## colpos 1 511 511 1.350 0.2504
## treatment 7 56160 8023 21.181 2.07e-13 ***
## Residuals 54 20454 379
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
coefficients(aov.orchardspray)
## (Intercept) rowpos colpos treatmentB treatmentC treatmentD
## 22.705357 -2.784226 -1.233631 3.000000 20.625000 30.375000
## treatmentE treatmentF treatmentG treatmentH
## 58.500000 64.375000 63.875000 85.625000
One Way ANOVA
The PlantGrowth dataset provided in
datasets package has recorded observations on plant yield
under control and two treatments.
data("PlantGrowth")
head(PlantGrowth)
## weight group
## 1 4.17 ctrl
## 2 5.58 ctrl
## 3 5.18 ctrl
## 4 6.11 ctrl
## 5 4.50 ctrl
## 6 4.61 ctrl
## check for balance
summary(PlantGrowth$group)
## ctrl trt1 trt2
## 10 10 10
aov.plantgrowth=aov(weight~group,data=PlantGrowth)
summary(aov.plantgrowth)
## Df Sum Sq Mean Sq F value Pr(>F)
## group 2 3.766 1.8832 4.846 0.0159 *
## Residuals 27 10.492 0.3886
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
coefficients(aov.plantgrowth)
## (Intercept) grouptrt1 grouptrt2
## 5.032 -0.371 0.494
The ToothGrowth dataset provided in dataset
has observations on tooth length, supplement type and doses of vitamin
C. The experiment was conducted to find effect of vitamin C on tooth
growth of Guinea pigs. Here, two sources of variation exists, one is
levels of vitamin C and another is supplement type.
data("ToothGrowth")
head(ToothGrowth)
## len supp dose
## 1 4.2 VC 0.5
## 2 11.5 VC 0.5
## 3 7.3 VC 0.5
## 4 5.8 VC 0.5
## 5 6.4 VC 0.5
## 6 10.0 VC 0.5
summary(ToothGrowth)## observe that supplement has equal no. of observations for its two different levels
## len supp dose
## Min. : 4.20 OJ:30 Min. :0.500
## 1st Qu.:13.07 VC:30 1st Qu.:0.500
## Median :19.25 Median :1.000
## Mean :18.81 Mean :1.167
## 3rd Qu.:25.27 3rd Qu.:2.000
## Max. :33.90 Max. :2.000
summary(as.factor(ToothGrowth$dose)) ## check if equal number of observations are there for three different levels of doses of vitamin C
## 0.5 1 2
## 20 20 20
aov.toothgrowth=aov(len~as.factor(dose)+supp,data = ToothGrowth)
summary(aov.toothgrowth)
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(dose) 2 2426.4 1213.2 82.81 < 2e-16 ***
## supp 1 205.4 205.4 14.02 0.000429 ***
## Residuals 56 820.4 14.7
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
coefficients(aov.toothgrowth)
## (Intercept) as.factor(dose)1 as.factor(dose)2 suppVC
## 12.455 9.130 15.495 -3.700
All questions are compulsory. Do not use R or
any of the training material during the test.
What would be the result in console of executing following code?
c(1,4,7)
c(3,“T”,9,3+2i)
x=data.frame(Name=c(“A”,“B”,“C”), Subject=c(“EXT 501”,“AGECON 604”,“ABM 510”))
x$Marks=c(50,60,70); x
for( i in 1:10){i=i+1 ;print(i)}
mylist=list(newmatrix=matrix(c(2,3,4,5)),vector=c(2,3,7));lapply(mylist,sum)
newdf=data.frame(A=c(2,5,7,9),B=c(1,3,4,6)); newdf[newdf$A==2,]
newdf[3,2]
newdf[newdf$A>2 & newdf$B<4,]
sapply(newdf,max)
k=c(“T”,“C”,“T”,“C”,“T”); L=c(“R”,“D”,“D”,“R”,“D”);Y=c(1,5,7,3,2);tapply(Y,list(k,L),sum)
table(k,L)
character(length=5)
factor(k,levels= c(“C”,“T”))
if(length(k)>length(L)){print(1)} else {print(0)}
x=1 ; repeat{print(x) ; x=x+1 ; if(x>5){ break } }
alog7=function(x){z=2*x+5; return(z)}; alog7(6)
Y=c(10,Y,20); Y
seq(1,2,0.1)
Sys.Date()
If we plan to use R, then firstly we need to download R and install it on our computer. Steps are provided below:
Search for R software in the your browser.
Screenshot of google chrome browser with R sotware in search bar
One should download R software from official website of R
. From the Google search following results will appear.
Search result for R software
Go to CRAN link under Download section on the left hand side of the website or click on “download R”. Both the ways directly take you to the CRAN mirror page.
Screenshot of main page of official R website
Once you reach httpshttps://cran.r-project.org/mirrors.html, choose the appropriate CRAN mirror. It is hereby suggested that “0-Cloud” mirror with url https://cloud.r-project.org/ be selected as its officially sponsored by RStudio.
Screenshot of CRAN mirrors page
Once you reach https://cloud.r-project.org/. click on “Download R for” (Linux/Windows/Mac depending on your operating system). The next steps would be explained for Windows installation.
CRAN page for downloding R
Once you click on “Download R for Windows” then it will take you to the “R for Windows” page. Click on the base sub directory link or install R for the first time link. This step will take you to “R-4.2.0 for Windows” page.
Screenshot of R for Windows page
Click Download R-4.2.0 for Windows. Now
downloading will start and process will take time depending on your
internet speed. Once downloading ends then click on the downloaded
.exe file.
R-4.2.0 for Windows page
Now a dialogue box will appear asking you to select setup language. Select the desired language and click on OK.
Screenshot of Select setup language dialogue box
Read the license agreement and click Next.
Screenshot of read licence agreement
Select the documentation location. Click Next.
Screenshot of Select Documentation Location
Select the components you wish to install. Click Next.
Screenshot of select component dialogue box
Enter/browse the folder/path you wish to install R into and then confirm by clicking Next.
Screenshot of Select Start Menu Folder dialogue box
Select additional tasks like creating desktop shortcuts etc. Then click Next.
Screenshot of Select Additional Task dialogue box
Wait for the installation process to complete. Click on Finish to complete the installation.
Screenshot of finish setup dialogue box
R 4.2.0 will get displayed on our desktop.
Screenshot of desktop home with R software shortcut
If we plan to use RStudio, then firstly we need to
download R and install it on our computer (see section Downloading and Installing R)
.
Search for RStudio software in browser.
We can download RStudio from official website of RStudio. If you directly click on the first search result (Download the RStudio IDE), it will directly take you to the download page.
Select Free Download option.
Click on the Download RStudio Desktop link (depending on your operating system but further installation guide is provided for windows os).
Follow the installation instructions and click Next on the welcome window.
Enter the path to the installation folder and click Next to proceed.
Select the folder for the start menu shortcut or click on do not create shortcuts and then click Next.
Wait for the installation process to complete.
Click finish to complete the installation.
RStudio shortcut/icon will get displayed on our desktop.
Assistant Professor(Agricultural Economics), JNKVV, Jabalpur, email id:gkvani@jnkvv.org↩︎
Assistant Professor (Agricultural Statistics), JNKVV, Jabalpur, email id:pradeepjnkvv@gmail.com↩︎
Although there are functions which does not take any
input from user like Sys.Date(), getwd() and many more.↩︎
In linear data structure, elements are arranged one after the another.↩︎