Loiyumba
To download R - https://cran.r-project.org/
To download RStudio - https://www.rstudio.com/products/rstudio/download/
RStudio comes with 4 panes -
In the console/command line
7 + 9 + 5 + 0 + 0 + 1
[1] 22
7 + 9 * 5
[1] 52
(7 + 9) * 5
[1] 80
c(7, 9, 5, 0, 0, 1)
[1] 7 9 5 0 0 1
1:50
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
[24] 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46
[47] 47 48 49 50
2^3
[1] 8
c(7, 9, 5, 0, 0, 1) + c(70, 90, 50, 10, 10, 10)
[1] 77 99 55 10 10 11
c(7, 9, 5, 0, 0, 1) * c(70, 90, 50, 10, 10, 10)
[1] 490 810 250 0 0 10
c(7, 9, 5, 0, 0, 1) - c(70, 90, 50, 10, 10, 10)
[1] -63 -81 -45 -10 -10 -9
c(7, 9, 5, 0, 0, 1) + 100
[1] 107 109 105 100 100 101
1/c(7, 9, 5, 0, 0,1)
[1] 0.1428571 0.1111111 0.2000000 Inf Inf 1.0000000
c(7, 9, 5, 0, 0, 1) + c(10, 100)
[1] 17 109 15 100 10 101
paste("Hello", "World!")
[1] "Hello World!"
"Hello World!"
[1] "Hello World!"
We can add comments with the code
2 + 5 # Sum of 2 and 5 will give 7
[1] 7
Assigning to a variable/object
a <- 1 # This is assignment sign (<-)
a
[1] 1
a + 1
[1] 2
rm(object name)
R objects can have attributes, which are like metadata for the object. These metadata can be very useful in that they help to describe the object. They are -
Attributes of an object (if any) can be accessed using the attributes() function. Not all R objects contain attributes, in which case the attributes() function returns NULL.
5 -> b
print(b)
[1] 5
a + b
[1] 6
(a + b)/2
[1] 3
c = (a + b)/2 # We can use (=) instead of (<-)
c
[1] 3
w <- c(7, 9, 5, 0, 0, 1)
x <- c(1, 0, 0, 5, 9, 7)
y <- c(w, x)
y
[1] 7 9 5 0 0 1 1 0 0 5 9 7
z <- c(x, w)
z
[1] 1 0 0 5 9 7 7 9 5 0 0 1
d <- 1:10
d
[1] 1 2 3 4 5 6 7 8 9 10
e <- 10:1
e
[1] 10 9 8 7 6 5 4 3 2 1
d + 1
[1] 2 3 4 5 6 7 8 9 10 11
e - 1
[1] 9 8 7 6 5 4 3 2 1 0
d + e
[1] 11 11 11 11 11 11 11 11 11 11
R in-built functions(a.k.a. Base Package)
k <- seq(from = 1, to = 10, by = 2)
k
[1] 1 3 5 7 9
j <- seq(from = -1, to = 1, by = 0.2)
j
[1] -1.0 -0.8 -0.6 -0.4 -0.2 0.0 0.2 0.4 0.6 0.8 1.0
m <- rep(2, times = 5)
m
[1] 2 2 2 2 2
p <- rep(1:3, times = 3)
p
[1] 1 2 3 1 2 3 1 2 3
q <- rep(1:3, each = 3)
q
[1] 1 1 1 2 2 2 3 3 3
For information on any function
help(“seq”)
or, we can simply type
?rep
And if we want to check or remove the object/variable
ls() # check the variables in the current session
[1] "a" "b" "c" "d" "e" "j" "k" "m" "p" "q" "w" "x" "y" "z"
rm(p) # delete single object
rm(k, m) # delete multiple objects
ls()
[1] "a" "b" "c" "d" "e" "j" "q" "w" "x" "y" "z"
rm(list = ls()) # delete everything
ls()
character(0)
g <- c(45, 90, 20)
sum(g) # Sum
[1] 155
mean(g) # Mean
[1] 51.66667
round(51.66667, 2) # Round to n decimal places
[1] 51.67
round(mean(g), 2) # Nested function
[1] 51.67
median(g) # Median
[1] 45
rank(g) # Rank the elements
[1] 2 3 1
var(g) # Variance
[1] 1258.333
max(g) # Largest element
[1] 90
min(g) # Smallest element
[1] 20
log(25) # Natural log
[1] 3.218876
exp(5) # Exponential
[1] 148.4132
sqrt(95) # Square root
[1] 9.746794
abs(-43) # Absolute value
[1] 43
u <- 45:60
quantile(u) # Quantile
0% 25% 50% 75% 100%
45.00 48.75 52.50 56.25 60.00
sd(u) # Standard deviation
[1] 4.760952
Four basic data types in R -
Any number. Appropriate for math.
1 + 1
[1] 2
100
[1] 100
Any text. Any symbols surrounded by quotes.
"hello, this is R"
[1] "hello, this is R"
"Imphal's pin code is 795001"
[1] "Imphal's pin code is 795001"
f <- c("1", "2", "3")
f
[1] "1" "2" "3"
R's form of binary data. TRUE or FALSE. Useful for logical test.
100 < 400
[1] TRUE
100 > 400
[1] FALSE
L <- 1:5
L
[1] 1 2 3 4 5
L > 3 # greater than
[1] FALSE FALSE FALSE TRUE TRUE
L >= 3 # greater than or equal to
[1] FALSE FALSE TRUE TRUE TRUE
L < 3 # less than
[1] TRUE TRUE FALSE FALSE FALSE
L <= 3 # less than or equal to
[1] TRUE TRUE TRUE FALSE FALSE
L == 3 # equal to
[1] FALSE FALSE TRUE FALSE FALSE
L != 3 # not equal to
[1] TRUE TRUE FALSE TRUE TRUE
The %in% tests whether the object on the left is a member of the group on the right.
"mango" %in% c("mango", "apple", "banana")
[1] TRUE
1 %in% c(2:8)
[1] FALSE
c(12:20) %in% c(15, 17)
[1] FALSE FALSE FALSE TRUE FALSE TRUE FALSE FALSE FALSE
We can combine logical tests with &, |, xor, !, any, and all.
x <- 1:10
x > 2 & x < 8
[1] FALSE FALSE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE
x > 8 | x < 2
[1] TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE
x[(x > 2) & (x < 8)]
[1] 3 4 5 6 7
x[(x > 8) | (x < 2)]
[1] 1 9 10
R' form of categorical data.
dist <- factor(c("Imphal East", "Imphal West", "Senapati", "Imphal West"))
dist
[1] Imphal East Imphal West Senapati Imphal West
Levels: Imphal East Imphal West Senapati
table(dist)
dist
Imphal East Imphal West Senapati
1 2 1
as.numeric(dist)
[1] 1 2 3 2
status_vector <- c("Married", "Not Married")
status_factor <- factor(status_vector)
status_factor <- factor(status_factor, levels = c("Not Married", "Married"))
status_factor
[1] Married Not Married
Levels: Not Married Married
Some of the most frequently-used R data structures are -
Vector elements must all have the same mode, which can be integer, numeric (floating-point number), character (string), logical (boolean), complex, object, etc.
Combine multiple elements into a one dimentional array
x <- 1:10 # integer
x
[1] 1 2 3 4 5 6 7 8 9 10
fruits <- c("Apple", "Banana", "Mango", "Papaya")
fruits # character
[1] "Apple" "Banana" "Mango" "Papaya"
logi <- c(TRUE, FALSE, TRUE) # logical
logi
[1] TRUE FALSE TRUE
com <- c(1+0i, 2+4i) # complex
com
[1] 1+0i 2+4i
What happens if we mix vectors of different classes?
student <- c("Tomba", "Chaoba", "Thoibi", "Bena")
class(student)
[1] "character"
age <- c(24, 26, 25, 22)
class(age)
[1] "numeric"
info <- c(student, age)
info
[1] "Tomba" "Chaoba" "Thoibi" "Bena" "24" "26" "25" "22"
class(info)
[1] "character"
# Do ?class for more detail on class function
In coercion between
class(c(795001, "Imphal"))
[1] "character"
class(c(TRUE, 795001))
[1] "numeric"
class(c(TRUE, 795001, "Imphal"))
[1] "character"
class(c("TRUE", 795001))
[1] "character"
Objects can be explicitly coerced from one class to another using the as.* functions, if available.
t <- -1:10
class(t)
[1] "integer"
as.numeric(t)
[1] -1 0 1 2 3 4 5 6 7 8 9 10
as.logical(t)
[1] TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[12] TRUE
as.character(t)
[1] "-1" "0" "1" "2" "3" "4" "5" "6" "7" "8" "9" "10"
Sometimes, R can’t figure out how to coerce an object and this can result in NAs being produced.
v <- c("Manipur", "Nagaland", "Mizoram")
class(v)
[1] "character"
as.numeric(v)
[1] NA NA NA
as.logical(v)
[1] NA NA NA
as.complex(v)
[1] NA NA NA
NA means Not Available
y <- c(7, 9, 5, 0, 0, 1)
sort(y)
[1] 0 0 1 5 7 9
sort(y, decreasing = TRUE)
[1] 9 7 5 1 0 0
table(y)
y
0 1 5 7 9
2 1 1 1 1
rev(y)
[1] 1 0 0 5 9 7
unique(y)
[1] 7 9 5 0 1
a <- seq(from = 1, to = 20, by = 2)
a
[1] 1 3 5 7 9 11 13 15 17 19
a[5]
[1] 9
a[-5]
[1] 1 3 5 7 11 13 15 17 19
a[3:6]
[1] 5 7 9 11
a[-(3:6)]
[1] 1 3 13 15 17 19
a[c(3, 6)]
[1] 5 11
a[a > 11]
[1] 13 15 17 19
a[a < 11]
[1] 1 3 5 7 9
a[a == 11]
[1] 11
district <- c("Imphal East", "Senapati", "Churachandpur", "Thoubal", "Ukhrul", "Bishenpur")
district[5]
[1] "Ukhrul"
district[c(1, 6)]
[1] "Imphal East" "Bishenpur"
alpha <- letters[1:10]
alpha
[1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j"
alpha[c(5,5,5,5)]
[1] "e" "e" "e" "e"
alpha[c(5:1, 1:5)]
[1] "e" "d" "c" "b" "a" "a" "b" "c" "d" "e"
alpha[11] # indexing with out-of-range values
[1] NA
alpha[1:11]
[1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" NA
alpha > "d"
[1] FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE
alpha[alpha > "d"]
[1] "e" "f" "g" "h" "i" "j"
selector <- alpha > "d"
alpha[selector]
[1] "e" "f" "g" "h" "i" "j"
which(alpha > "d")
[1] 5 6 7 8 9 10
indexes <- which(alpha > "d")
alpha[indexes]
[1] "e" "f" "g" "h" "i" "j"
g <- 1:10
g
[1] 1 2 3 4 5 6 7 8 9 10
g[2] <- 100
g
[1] 1 100 3 4 5 6 7 8 9 10
g[11] <- 200
g
[1] 1 100 3 4 5 6 7 8 9 10 200
g[c(4,8)] <- -500
g
[1] 1 100 3 -500 5 6 7 -500 9 10 200
g[3] <- g[11]
g
[1] 1 100 200 -500 5 6 7 -500 9 10 200
g <- c(g, 33, 44)
g
[1] 1 100 200 -500 5 6 7 -500 9 10 200 33 44
g <- c(22, 55, g)
g
[1] 22 55 1 100 200 -500 5 6 7 -500 9 10 200 33
[15] 44
g <- c(g[1:5], 111, g[6:15])
g
[1] 22 55 1 100 200 111 -500 5 6 7 -500 9 10 200
[15] 33 44
g <- g[-3:-6]
g
[1] 22 55 -500 5 6 7 -500 9 10 200 33 44
A matrix is a vector with two additional attributes, the number of rows and number of columns. Combine multiple elements into a two dimentional array. Create with matrix
function.
mat <- matrix(c(1:6), nrow = 2)
mat
[,1] [,2] [,3]
[1,] 1 3 5
[2,] 2 4 6
attributes(mat)
$dim
[1] 2 3
Matrices are constructed column-wise, so entries can be thought of starting in the “upper left” corner and running down the columns.
mat2 <- matrix(c(1:6), nrow = 3)
mat2
[,1] [,2]
[1,] 1 4
[2,] 2 5
[3,] 3 6
dim(mat2)
[1] 3 2
However, if we want to construct matrix by row-wise, we can do so.
mat3 <- matrix(c(1:6), nrow = 3, byrow = TRUE)
mat3
[,1] [,2]
[1,] 1 2
[2,] 3 4
[3,] 5 6
dim(mat3)
[1] 3 2
Matrices can also be created directly from vectors by adding a dimension attribute.
mat4 <- 1:10
dim(mat4) <- c(2, 5)
mat4
[,1] [,2] [,3] [,4] [,5]
[1,] 1 3 5 7 9
[2,] 2 4 6 8 10
Matrices can be created by column-binding or row-binding with the cbind() and rbind() functions.
p <- 1:5
q <- 6:10
cbind(p, q)
p q
[1,] 1 6
[2,] 2 7
[3,] 3 8
[4,] 4 9
[5,] 5 10
Do ?cbind in the console for more info
rbind(p, q)
[,1] [,2] [,3] [,4] [,5]
p 1 2 3 4 5
q 6 7 8 9 10
Do ?rbind in the console for more info
We can also transpose dimension in matrix like this
pq <- rbind(p, q)
pq # 2, 5
[,1] [,2] [,3] [,4] [,5]
p 1 2 3 4 5
q 6 7 8 9 10
t(pq) # 5, 2
p q
[1,] 1 6
[2,] 2 7
[3,] 3 8
[4,] 4 9
[5,] 5 10
x <- matrix(1:6, nrow = 3)
x
[,1] [,2]
[1,] 1 4
[2,] 2 5
[3,] 3 6
y <- matrix(rep(5, times = 6), nrow = 3)
y
[,1] [,2]
[1,] 5 5
[2,] 5 5
[3,] 5 5
Element-wide multiplication
x * y
[,1] [,2]
[1,] 5 20
[2,] 10 25
[3,] 15 30
Element-wise division
x/y
[,1] [,2]
[1,] 0.2 0.8
[2,] 0.4 1.0
[3,] 0.6 1.2
3 * x
[,1] [,2]
[1,] 3 12
[2,] 6 15
[3,] 9 18
x + x
[,1] [,2]
[1,] 2 8
[2,] 4 10
[3,] 6 12
z <- x * y
z
[,1] [,2]
[1,] 5 20
[2,] 10 25
[3,] 15 30
z[1, ] # select 1st row & all columns
[1] 5 20
z[, 1] # select all rows & 1st column
[1] 5 10 15
z[2:3, , drop = FALSE] # keeping the matrix style
[,1] [,2]
[1,] 10 25
[2,] 15 30
z[, 2, drop = FALSE]
[,1]
[1,] 20
[2,] 25
[3,] 30
z[2,2] # select an element
[1] 25
z[z[, 1] >= 10, ]
[,1] [,2]
[1,] 10 25
[2,] 15 30
z[z[, 1] > 10 & z[, 2] >= 25, ]
[1] 15 30
colnames(z)
NULL
colnames(z) <- c("A", "B")
z
A B
[1,] 5 20
[2,] 10 25
[3,] 15 30
colnames(z)
[1] "A" "B"
z[, "B"]
[1] 20 25 30
rownames(z)
NULL
rownames(z) <- c("P", "Q", "R")
z
A B
P 5 20
Q 10 25
R 15 30
A list is a one dimensional group of R objects.
Create lists with 'list()' function.
a_list <- list(1, "HP", TRUE)
a_list
[[1]]
[1] 1
[[2]]
[1] "HP"
[[3]]
[1] TRUE
The element of a list can be anything. Even vectors or other lists.
students <- list(name = c("Tomba", "Chaoba"), age = c(23, 25), single = c(TRUE, FALSE))
students
$name
[1] "Tomba" "Chaoba"
$age
[1] 23 25
$single
[1] TRUE FALSE
str(students)
List of 3
$ name : chr [1:2] "Tomba" "Chaoba"
$ age : num [1:2] 23 25
$ single: logi [1:2] TRUE FALSE
names(students)
[1] "name" "age" "single"
students$name
[1] "Tomba" "Chaoba"
students[["name"]]
[1] "Tomba" "Chaoba"
students[[1]]
[1] "Tomba" "Chaoba"
students["name"]
$name
[1] "Tomba" "Chaoba"
students[c("name", "single")]
$name
[1] "Tomba" "Chaoba"
$single
[1] TRUE FALSE
students[c(1, 3)]
$name
[1] "Tomba" "Chaoba"
$single
[1] TRUE FALSE
students$education <- c("Graduate", "Master")
students
$name
[1] "Tomba" "Chaoba"
$age
[1] 23 25
$single
[1] TRUE FALSE
$education
[1] "Graduate" "Master"
students$single <- NULL
students
$name
[1] "Tomba" "Chaoba"
$age
[1] 23 25
$education
[1] "Graduate" "Master"
Data frames group vectors together into a two-dimensional table. Each vector becomes a column in the table. As a result, each column of a data frame can contain a different type of data; but within a column, every cell must be the same type of data. We can create data frames with 'data.frame' function.
students <- data.frame(rollnum = c(10,42,3),
name = c("Iboyaima", "Tomchou", "Tombi"),
examfailed = c(TRUE, FALSE, TRUE)
)
students
rollnum name examfailed
1 10 Iboyaima TRUE
2 42 Tomchou FALSE
3 3 Tombi TRUE
class(students)
[1] "data.frame"
str(students)
'data.frame': 3 obs. of 3 variables:
$ rollnum : num 10 42 3
$ name : Factor w/ 3 levels "Iboyaima","Tombi",..: 1 3 2
$ examfailed: logi TRUE FALSE TRUE
manipur <- data.frame(Districts = c("ImpWest", "ImpEast", "Ccpur",
"Thoubal", "Tamenglong", "Senapati",
"Chandel", "Ukhrul", "Bishenpur"),
Population = c(700000, 500000, 400000, 300000,
200000, 450000, 400000, 500000, 750000),
Literacy = c(9.5, 9.4, 7.2, 8.6, 5.3, 8.5, 6.8,
6.2, 8.7), stringsAsFactors = FALSE
)
manipur
Districts Population Literacy
1 ImpWest 700000 9.5
2 ImpEast 500000 9.4
3 Ccpur 400000 7.2
4 Thoubal 300000 8.6
5 Tamenglong 200000 5.3
6 Senapati 450000 8.5
7 Chandel 400000 6.8
8 Ukhrul 500000 6.2
9 Bishenpur 750000 8.7
ForestCover <- c(40, 45, 67, 60, 90, 68, 65, 85, 70)
ForestCover
[1] 40 45 67 60 90 68 65 85 70
With 'cbind' function, we will add this vector to manipur.
manipur <- cbind(manipur, ForestCover)
manipur
Districts Population Literacy ForestCover
1 ImpWest 700000 9.5 40
2 ImpEast 500000 9.4 45
3 Ccpur 400000 7.2 67
4 Thoubal 300000 8.6 60
5 Tamenglong 200000 5.3 90
6 Senapati 450000 8.5 68
7 Chandel 400000 6.8 65
8 Ukhrul 500000 6.2 85
9 Bishenpur 750000 8.7 70
We will add another row in manipur
SadarHill <- data.frame(Districts = "SadarHill",
Population = 450000,
Literacy = 8.5,
ForestCover = 66,
stringsAsFactors = FALSE)
SadarHill
Districts Population Literacy ForestCover
1 SadarHill 450000 8.5 66
manipur <- rbind(manipur, SadarHill)
We can delete rows in a data frame like this.
manipur[1:9, ] # omitting in selection
Districts Population Literacy ForestCover
1 ImpWest 700000 9.5 40
2 ImpEast 500000 9.4 45
3 Ccpur 400000 7.2 67
4 Thoubal 300000 8.6 60
5 Tamenglong 200000 5.3 90
6 Senapati 450000 8.5 68
7 Chandel 400000 6.8 65
8 Ukhrul 500000 6.2 85
9 Bishenpur 750000 8.7 70
manipur[-(8:10), ] # with - sign
Districts Population Literacy ForestCover
1 ImpWest 700000 9.5 40
2 ImpEast 500000 9.4 45
3 Ccpur 400000 7.2 67
4 Thoubal 300000 8.6 60
5 Tamenglong 200000 5.3 90
6 Senapati 450000 8.5 68
7 Chandel 400000 6.8 65
manipur[-c(2,5,8:10), ] # row selection
Districts Population Literacy ForestCover
1 ImpWest 700000 9.5 40
3 Ccpur 400000 7.2 67
4 Thoubal 300000 8.6 60
6 Senapati 450000 8.5 68
7 Chandel 400000 6.8 65
Adding new column to an existing data frame
manipur$AnnualRain <- c(3.2, 3.5, 4.0, 3.8, 4.8, 4.2, 3.8,
4.2, 3.7, 3.8)
str(manipur)
'data.frame': 10 obs. of 5 variables:
$ Districts : chr "ImpWest" "ImpEast" "Ccpur" "Thoubal" ...
$ Population : num 700000 500000 400000 300000 200000 450000 400000 500000 750000 450000
$ Literacy : num 9.5 9.4 7.2 8.6 5.3 8.5 6.8 6.2 8.7 8.5
$ ForestCover: num 40 45 67 60 90 68 65 85 70 66
$ AnnualRain : num 3.2 3.5 4 3.8 4.8 4.2 3.8 4.2 3.7 3.8
Or, delete a column like this.
manipur$Literacy <- NULL
manipur
Districts Population ForestCover AnnualRain
1 ImpWest 700000 40 3.2
2 ImpEast 500000 45 3.5
3 Ccpur 400000 67 4.0
4 Thoubal 300000 60 3.8
5 Tamenglong 200000 90 4.8
6 Senapati 450000 68 4.2
7 Chandel 400000 65 3.8
8 Ukhrul 500000 85 4.2
9 Bishenpur 750000 70 3.7
10 SadarHill 450000 66 3.8
Matrix Style Subsetting
manipur[1, ] # select first row and all the columns
Districts Population ForestCover AnnualRain
1 ImpWest 7e+05 40 3.2
manipur[, 1] # select first column and all the rows
[1] "ImpWest" "ImpEast" "Ccpur" "Thoubal" "Tamenglong"
[6] "Senapati" "Chandel" "Ukhrul" "Bishenpur" "SadarHill"
manipur[, "Districts"]
[1] "ImpWest" "ImpEast" "Ccpur" "Thoubal" "Tamenglong"
[6] "Senapati" "Chandel" "Ukhrul" "Bishenpur" "SadarHill"
manipur[, c("Districts", "AnnualRain")]
Districts AnnualRain
1 ImpWest 3.2
2 ImpEast 3.5
3 Ccpur 4.0
4 Thoubal 3.8
5 Tamenglong 4.8
6 Senapati 4.2
7 Chandel 3.8
8 Ukhrul 4.2
9 Bishenpur 3.7
10 SadarHill 3.8
manipur[2:6, ] # selecting rows 2 to 6 and all columns
Districts Population ForestCover AnnualRain
2 ImpEast 500000 45 3.5
3 Ccpur 400000 67 4.0
4 Thoubal 300000 60 3.8
5 Tamenglong 200000 90 4.8
6 Senapati 450000 68 4.2
manipur[10:7, ] # selected rows 10 to 7 and all columns
Districts Population ForestCover AnnualRain
10 SadarHill 450000 66 3.8
9 Bishenpur 750000 70 3.7
8 Ukhrul 500000 85 4.2
7 Chandel 400000 65 3.8
manipur[5:7, 1:2] # selected rows and columns
Districts Population
5 Tamenglong 200000
6 Senapati 450000
7 Chandel 400000
manipur[c(4, 8, 10), c(1, 2, 4)] # selected rows and columns
Districts Population AnnualRain
4 Thoubal 300000 3.8
8 Ukhrul 500000 4.2
10 SadarHill 450000 3.8
Or, we can use r inbuilt functions to check observations and some other information of the data frame
head(manipur)
Districts Population ForestCover AnnualRain
1 ImpWest 700000 40 3.2
2 ImpEast 500000 45 3.5
3 Ccpur 400000 67 4.0
4 Thoubal 300000 60 3.8
5 Tamenglong 200000 90 4.8
6 Senapati 450000 68 4.2
tail(manipur)
Districts Population ForestCover AnnualRain
5 Tamenglong 200000 90 4.8
6 Senapati 450000 68 4.2
7 Chandel 400000 65 3.8
8 Ukhrul 500000 85 4.2
9 Bishenpur 750000 70 3.7
10 SadarHill 450000 66 3.8
nrow(manipur) # number of rows
[1] 10
ncol(manipur) # number of columns
[1] 4
dim(manipur) # dimension of the data frame
[1] 10 4
summary(manipur[2:4]) # Summary stats of data frame
Population ForestCover AnnualRain
Min. :200000 Min. :40.00 Min. :3.200
1st Qu.:400000 1st Qu.:61.25 1st Qu.:3.725
Median :450000 Median :66.50 Median :3.800
Mean :465000 Mean :65.60 Mean :3.900
3rd Qu.:500000 3rd Qu.:69.50 3rd Qu.:4.150
Max. :750000 Max. :90.00 Max. :4.800
Selecting districts which have Population more than 500000
manipur[manipur[, 2] > 500000, ]
Districts Population ForestCover AnnualRain
1 ImpWest 700000 40 3.2
9 Bishenpur 750000 70 3.7
Selecting districts which have population equal or more than 500000
manipur[manipur[, 2] >= 500000, ]
Districts Population ForestCover AnnualRain
1 ImpWest 700000 40 3.2
2 ImpEast 500000 45 3.5
8 Ukhrul 500000 85 4.2
9 Bishenpur 750000 70 3.7
Selecting districts with less than 500000 population and forest cover more than 70
manipur[manipur[, 2] < 500000 & manipur[, 3] > 70, ]
Districts Population ForestCover AnnualRain
5 Tamenglong 2e+05 90 4.8
Now, we want only district names with population more than or equal to 500000 and forest cover more than or equal to 70
manipur[manipur[, 2] >= 500000 & manipur[, 3] >= 70, 1]
[1] "Ukhrul" "Bishenpur"
Or, use subset()
function
subset(manipur, ForestCover >= 75)
Districts Population ForestCover AnnualRain
5 Tamenglong 2e+05 90 4.8
8 Ukhrul 5e+05 85 4.2
subset(manipur, Population > 400000 & AnnualRain > 4.0)
Districts Population ForestCover AnnualRain
6 Senapati 450000 68 4.2
8 Ukhrul 500000 85 4.2
subset(manipur, AnnualRain >= 3.5)$AnnualRain
[1] 3.5 4.0 3.8 4.8 4.2 3.8 4.2 3.7 3.8
subset(manipur, AnnualRain >= 4.0)[, -(2:3)]
Districts AnnualRain
3 Ccpur 4.0
5 Tamenglong 4.8
6 Senapati 4.2
8 Ukhrul 4.2
Similarly in matrices where colnames()
and rownames()
are used to set or change column names and row names, data frames use names()
to set or change column names and row.names()
to set or change row names.
names(manipur)
[1] "Districts" "Population" "ForestCover" "AnnualRain"
names(manipur) <- c("Dist", "Pop", "ForCov", "AnRain")
names(manipur)
[1] "Dist" "Pop" "ForCov" "AnRain"
row.names(manipur)
[1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10"
row.names(manipur) <- LETTERS[1:10]
row.names(manipur)
[1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J"
head(manipur,3)
Dist Pop ForCov AnRain
A ImpWest 7e+05 40 3.2
B ImpEast 5e+05 45 3.5
C Ccpur 4e+05 67 4.0
R has many inbuilt data for its users to practice. To view these data, we can do with data()
and simply type the dataset name.
str(USArrests)
'data.frame': 50 obs. of 4 variables:
$ Murder : num 13.2 10 8.1 8.8 9 7.9 3.3 5.9 15.4 17.4 ...
$ Assault : int 236 263 294 190 276 204 110 238 335 211 ...
$ UrbanPop: int 58 48 80 50 91 78 77 72 80 60 ...
$ Rape : num 21.2 44.5 31 19.5 40.6 38.7 11.1 15.8 31.9 25.8 ...
To know more info about this data, we can do ?USArrests
head(USArrests) # first 6 rows/observations
Murder Assault UrbanPop Rape
Alabama 13.2 236 58 21.2
Alaska 10.0 263 48 44.5
Arizona 8.1 294 80 31.0
Arkansas 8.8 190 50 19.5
California 9.0 276 91 40.6
Colorado 7.9 204 78 38.7
class(USArrests)
[1] "data.frame"
we can use View()
function to see the complete dataset
Do summary statistics
# summary(USArrests)
max(USArrests$Murder) # maximum value in Murder
[1] 17.4
USArrests[USArrests$Murder == 17.4, ] # which state?
Murder Assault UrbanPop Rape
Georgia 17.4 211 60 25.8
Other way to do this is
which.min(USArrests$Murder) # gives the row number
[1] 34
USArrests[34, ] # see the row
Murder Assault UrbanPop Rape
North Dakota 0.8 45 44 7.3
But I want to see the top 10 states which are highest in murder and lowest in murder.
Top 10 states with highest number of murder
head(USArrests[order(USArrests$Murder, decreasing = TRUE), ], 10) # use order function. do ?order
Murder Assault UrbanPop Rape
Georgia 17.4 211 60 25.8
Mississippi 16.1 259 44 17.1
Florida 15.4 335 80 31.9
Louisiana 15.4 249 66 22.2
South Carolina 14.4 279 48 22.5
Alabama 13.2 236 58 21.2
Tennessee 13.2 188 59 26.9
North Carolina 13.0 337 45 16.1
Texas 12.7 201 80 25.5
Nevada 12.2 252 81 46.0
Top 10 states with lowest number of murder
tail(USArrests[order(USArrests$Murder, decreasing = TRUE), ], 10)
Murder Assault UrbanPop Rape
Connecticut 3.3 110 77 11.1
Utah 3.2 120 80 22.9
Minnesota 2.7 72 66 14.9
Idaho 2.6 120 54 14.2
Wisconsin 2.6 53 66 10.8
Iowa 2.2 56 57 11.3
Vermont 2.2 48 32 11.2
Maine 2.1 83 51 7.8
New Hampshire 2.1 57 56 9.5
North Dakota 0.8 45 44 7.3
least_murder <- tail(USArrests[order(USArrests$Murder, decreasing = TRUE), ], 10)
least_murder[order(least_murder$Murder), ]
Murder Assault UrbanPop Rape
North Dakota 0.8 45 44 7.3
Maine 2.1 83 51 7.8
New Hampshire 2.1 57 56 9.5
Iowa 2.2 56 57 11.3
Vermont 2.2 48 32 11.2
Idaho 2.6 120 54 14.2
Wisconsin 2.6 53 66 10.8
Minnesota 2.7 72 66 14.9
Utah 3.2 120 80 22.9
Connecticut 3.3 110 77 11.1
Similary, we can do for Assualt and Rape as well.
Top 10 states with highest number of assaults
head(USArrests[order(USArrests$Assault, decreasing = TRUE), ], 10)
Murder Assault UrbanPop Rape
North Carolina 13.0 337 45 16.1
Florida 15.4 335 80 31.9
Maryland 11.3 300 67 27.8
Arizona 8.1 294 80 31.0
New Mexico 11.4 285 70 32.1
South Carolina 14.4 279 48 22.5
California 9.0 276 91 40.6
Alaska 10.0 263 48 44.5
Mississippi 16.1 259 44 17.1
Michigan 12.1 255 74 35.1
Top 10 states with lowest number of assaults
tail(USArrests[order(USArrests$Assault, decreasing = TRUE), ], 10)
Murder Assault UrbanPop Rape
South Dakota 3.8 86 45 12.8
Maine 2.1 83 51 7.8
West Virginia 5.7 81 39 9.3
Minnesota 2.7 72 66 14.9
New Hampshire 2.1 57 56 9.5
Iowa 2.2 56 57 11.3
Wisconsin 2.6 53 66 10.8
Vermont 2.2 48 32 11.2
Hawaii 5.3 46 83 20.2
North Dakota 0.8 45 44 7.3
least_assault <- tail(USArrests[order(USArrests$Assault, decreasing = TRUE), ], 10)
least_assault[order(least_assault$Assault), ]
Murder Assault UrbanPop Rape
North Dakota 0.8 45 44 7.3
Hawaii 5.3 46 83 20.2
Vermont 2.2 48 32 11.2
Wisconsin 2.6 53 66 10.8
Iowa 2.2 56 57 11.3
New Hampshire 2.1 57 56 9.5
Minnesota 2.7 72 66 14.9
West Virginia 5.7 81 39 9.3
Maine 2.1 83 51 7.8
South Dakota 3.8 86 45 12.8
Top 10 states with highest number of rapes
head(USArrests[order(USArrests$Rape, decreasing = TRUE), ], 10)
Murder Assault UrbanPop Rape
Nevada 12.2 252 81 46.0
Alaska 10.0 263 48 44.5
California 9.0 276 91 40.6
Colorado 7.9 204 78 38.7
Michigan 12.1 255 74 35.1
New Mexico 11.4 285 70 32.1
Florida 15.4 335 80 31.9
Arizona 8.1 294 80 31.0
Oregon 4.9 159 67 29.3
Missouri 9.0 178 70 28.2
Top 10 states with lowest number of rapes
tail(USArrests[order(USArrests$Rape, decreasing = TRUE), ], 10)
Murder Assault UrbanPop Rape
South Dakota 3.8 86 45 12.8
Iowa 2.2 56 57 11.3
Vermont 2.2 48 32 11.2
Connecticut 3.3 110 77 11.1
Wisconsin 2.6 53 66 10.8
New Hampshire 2.1 57 56 9.5
West Virginia 5.7 81 39 9.3
Rhode Island 3.4 174 87 8.3
Maine 2.1 83 51 7.8
North Dakota 0.8 45 44 7.3
least_rape <- tail(USArrests[order(USArrests$Rape, decreasing = TRUE), ], 10)
least_rape[order(least_rape$Rape), ]
Murder Assault UrbanPop Rape
North Dakota 0.8 45 44 7.3
Maine 2.1 83 51 7.8
Rhode Island 3.4 174 87 8.3
West Virginia 5.7 81 39 9.3
New Hampshire 2.1 57 56 9.5
Wisconsin 2.6 53 66 10.8
Connecticut 3.3 110 77 11.1
Vermont 2.2 48 32 11.2
Iowa 2.2 56 57 11.3
South Dakota 3.8 86 45 12.8
So I want to know If I have to live in USA, which state should you suggest me to live.
tail(least_murder, 3)
Murder Assault UrbanPop Rape
Maine 2.1 83 51 7.8
New Hampshire 2.1 57 56 9.5
North Dakota 0.8 45 44 7.3
tail(least_assault, 3)
Murder Assault UrbanPop Rape
Vermont 2.2 48 32 11.2
Hawaii 5.3 46 83 20.2
North Dakota 0.8 45 44 7.3
tail(least_rape, 3)
Murder Assault UrbanPop Rape
Rhode Island 3.4 174 87 8.3
Maine 2.1 83 51 7.8
North Dakota 0.8 45 44 7.3
How about we categorise states with high and low murder, high and low Assault, high and low rape!
We can do so for further insights of the data.
UScrime <- USArrests # assign a new object name
head(UScrime) # first 6 rows
Murder Assault UrbanPop Rape
Alabama 13.2 236 58 21.2
Alaska 10.0 263 48 44.5
Arizona 8.1 294 80 31.0
Arkansas 8.8 190 50 19.5
California 9.0 276 91 40.6
Colorado 7.9 204 78 38.7
mean(UScrime$Murder)
[1] 7.788
UScrime$HighMurder <- as.numeric(UScrime$Murder > mean(UScrime$Murder))
str(UScrime)
'data.frame': 50 obs. of 5 variables:
$ Murder : num 13.2 10 8.1 8.8 9 7.9 3.3 5.9 15.4 17.4 ...
$ Assault : int 236 263 294 190 276 204 110 238 335 211 ...
$ UrbanPop : int 58 48 80 50 91 78 77 72 80 60 ...
$ Rape : num 21.2 44.5 31 19.5 40.6 38.7 11.1 15.8 31.9 25.8 ...
$ HighMurder: num 1 1 1 1 1 1 0 0 1 1 ...
table(UScrime$HighMurder)
0 1
27 23
mean(UScrime$Assault)
[1] 170.76
UScrime$HighAssault <- as.numeric(UScrime$Assault > mean(UScrime$Assault))
str(UScrime)
'data.frame': 50 obs. of 6 variables:
$ Murder : num 13.2 10 8.1 8.8 9 7.9 3.3 5.9 15.4 17.4 ...
$ Assault : int 236 263 294 190 276 204 110 238 335 211 ...
$ UrbanPop : int 58 48 80 50 91 78 77 72 80 60 ...
$ Rape : num 21.2 44.5 31 19.5 40.6 38.7 11.1 15.8 31.9 25.8 ...
$ HighMurder : num 1 1 1 1 1 1 0 0 1 1 ...
$ HighAssault: num 1 1 1 1 1 1 0 1 1 1 ...
table(UScrime$HighAssault)
0 1
27 23
table(UScrime$HighMurder, UScrime$HighAssault)
0 1
0 25 2
1 2 21
This table means -
mean(UScrime$Rape)
[1] 21.232
UScrime$HighRape <- as.numeric(UScrime$Rape > mean(UScrime$Rape))
str(UScrime)
'data.frame': 50 obs. of 7 variables:
$ Murder : num 13.2 10 8.1 8.8 9 7.9 3.3 5.9 15.4 17.4 ...
$ Assault : int 236 263 294 190 276 204 110 238 335 211 ...
$ UrbanPop : int 58 48 80 50 91 78 77 72 80 60 ...
$ Rape : num 21.2 44.5 31 19.5 40.6 38.7 11.1 15.8 31.9 25.8 ...
$ HighMurder : num 1 1 1 1 1 1 0 0 1 1 ...
$ HighAssault: num 1 1 1 1 1 1 0 1 1 1 ...
$ HighRape : num 0 1 1 0 1 1 0 0 1 1 ...
table(UScrime$HighRape)
0 1
29 21
table(UScrime$HighMurder, UScrime$HighRape)
0 1
0 23 4
1 6 17
table(UScrime$HighAssault, UScrime$HighRape)
0 1
0 23 4
1 6 17
We can use merge()
function to merge two data frames, and we can merge only two.
name <- I(c("Ronaldo", "Messi", "Rooney", "Klose", "Zlatan"))
country <- I(c("Portugal", "Argentina", "England", "Germany", "Sweden"))
players <- data.frame(name, country)
players
name country
1 Ronaldo Portugal
2 Messi Argentina
3 Rooney England
4 Klose Germany
5 Zlatan Sweden
name <- I(c("Ronaldo", "Messi", "Rooney", "Klose", "Zlatan"))
age <- c(28, 27, 29, 36, 24)
cap <- c("Yes", "No", "Yes", "No", "Yes")
players2 <- data.frame(name, age, cap)
players2
name age cap
1 Ronaldo 28 Yes
2 Messi 27 No
3 Rooney 29 Yes
4 Klose 36 No
5 Zlatan 24 Yes
players3 <- merge(players, players2)
players3
name country age cap
1 Klose Germany 36 No
2 Messi Argentina 27 No
3 Ronaldo Portugal 28 Yes
4 Rooney England 29 Yes
5 Zlatan Sweden 24 Yes
name <- c("Messi", "Rooney", "Klose", "Ronaldo", "Drogba")
club <- c("Bayern Munich", "Barcelona", "Real Madrid", "ManU", "Chelsea")
players4 <- data.frame(name, club)
players4
name club
1 Messi Bayern Munich
2 Rooney Barcelona
3 Klose Real Madrid
4 Ronaldo ManU
5 Drogba Chelsea
merge(players3, players4)
name country age cap club
1 Klose Germany 36 No Real Madrid
2 Messi Argentina 27 No Bayern Munich
3 Ronaldo Portugal 28 Yes ManU
4 Rooney England 29 Yes Barcelona
merge(players3, players4, all = TRUE)
name country age cap club
1 Drogba <NA> NA <NA> Chelsea
2 Klose Germany 36 No Real Madrid
3 Messi Argentina 27 No Bayern Munich
4 Ronaldo Portugal 28 Yes ManU
5 Rooney England 29 Yes Barcelona
6 Zlatan Sweden 24 Yes <NA>
merge(players3, players4, all.x = TRUE)
name country age cap club
1 Klose Germany 36 No Real Madrid
2 Messi Argentina 27 No Bayern Munich
3 Ronaldo Portugal 28 Yes ManU
4 Rooney England 29 Yes Barcelona
5 Zlatan Sweden 24 Yes <NA>
merge(players3, players4, all.y = TRUE)
name country age cap club
1 Drogba <NA> NA <NA> Chelsea
2 Klose Germany 36 No Real Madrid
3 Messi Argentina 27 No Bayern Munich
4 Ronaldo Portugal 28 Yes ManU
5 Rooney England 29 Yes Barcelona
Suppose we have different column names but same variables.
Players <- c("Drogba", "Klose", "Messi", "Ronaldo")
Fees <- c(102, 225, 400, 430)
salary <- data.frame(Players, Fees)
merge(players3, salary, by.x = "name", by.y = "Players")
name country age cap Fees
1 Klose Germany 36 No 225
2 Messi Argentina 27 No 400
3 Ronaldo Portugal 28 Yes 430
merge(players3, salary, by.x = "name", by.y = "Players", all = TRUE)
name country age cap Fees
1 Drogba <NA> NA <NA> 102
2 Klose Germany 36 No 225
3 Messi Argentina 27 No 400
4 Ronaldo Portugal 28 Yes 430
5 Rooney England 29 Yes NA
6 Zlatan Sweden 24 Yes NA
These functions manipulate slices of data from matrices, lists and data frames in a repetitive way. They allow crossing data in a number of ways and avoid explicit use of loop construct.
There are 4 commonly use apply functions in R -
Note that there are other apply functions as well which are not commonly use. They are -
First lets do ?apply
Usage
apply(X, MARGIN, FUN, …)
Where
X is matrix/array,
MARGIN is 1 = row, 2 = column, c(1,2) = both
FUN = sum, mean, etc
tv <- matrix(c(3, 5, 6, 2, 3, 5, 4, 3, 2, 1, 6, 5, 4, 3, 5, 4, 2, 4, 2, 2, 5, 6, 4, 5, 5, 2, 3, 2, 1, 4 ,1, 4, 3, 4, 5), nrow = 7)
colnames(tv) <- c("Oken", "Khagem", "Inao", "Thoi", "Romeo")
rownames(tv) <- c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday")
tv
Oken Khagem Inao Thoi Romeo
Sunday 3 3 5 6 1
Monday 5 2 4 4 4
Tuesday 6 1 2 5 1
Wednesday 2 6 4 5 4
Thursday 3 5 2 2 3
Friday 5 4 2 3 4
Saturday 4 3 5 2 5
class(tv)
[1] "matrix"
max(tv[1, ])
[1] 6
max(tv[2, ])
[1] 5
max(tv[3, ])
[1] 6
We can also use for loop to get the desire result.
for(i in 1:7){
weekday <- tv[i, ]
max <- max(weekday)
print(max)
}
[1] 6
[1] 5
[1] 6
[1] 6
[1] 5
[1] 5
[1] 5
Instead of writing so much, we can simply use the vectorised loop that apply() function offers.
apply(tv, 1, max) # finding maximum value in each row
Sunday Monday Tuesday Wednesday Thursday Friday Saturday
6 5 6 6 5 5 5
apply(tv, 2, max) # finding maximum value in each column
Oken Khagem Inao Thoi Romeo
6 6 5 6 5
Apply function on data frame
tv_df <- as.data.frame(tv)
class(tv_df)
[1] "data.frame"
str(tv_df)
'data.frame': 7 obs. of 5 variables:
$ Oken : num 3 5 6 2 3 5 4
$ Khagem: num 3 2 1 6 5 4 3
$ Inao : num 5 4 2 4 2 2 5
$ Thoi : num 6 4 5 5 2 3 2
$ Romeo : num 1 4 1 4 3 4 5
apply(tv_df, 1, mean)
Sunday Monday Tuesday Wednesday Thursday Friday Saturday
3.6 3.8 3.0 4.2 3.0 3.6 3.8
apply(tv_df, 2, mean)
Oken Khagem Inao Thoi Romeo
4.000000 3.428571 3.428571 3.857143 3.142857
How about adding a new variable which is not numeric?
tv_df$Place <- c("Club", "Home", "School", "Home", "School", "Home", "Club")
str(tv_df)
'data.frame': 7 obs. of 6 variables:
$ Oken : num 3 5 6 2 3 5 4
$ Khagem: num 3 2 1 6 5 4 3
$ Inao : num 5 4 2 4 2 2 5
$ Thoi : num 6 4 5 5 2 3 2
$ Romeo : num 1 4 1 4 3 4 5
$ Place : chr "Club" "Home" "School" "Home" ...
apply(tv_df, 1, mean)
Sunday Monday Tuesday Wednesday Thursday Friday Saturday
NA NA NA NA NA NA NA
apply(tv_df[, 1:5], 1, mean)
Sunday Monday Tuesday Wednesday Thursday Friday Saturday
3.6 3.8 3.0 4.2 3.0 3.6 3.8
apply(tv_df[, 1:5], 2, mean)
Oken Khagem Inao Thoi Romeo
4.000000 3.428571 3.428571 3.857143 3.142857
Other than using apply() for getting mean, we can also use
rowMeans(tv_df[, 1:5]) # mean of rows
Sunday Monday Tuesday Wednesday Thursday Friday Saturday
3.6 3.8 3.0 4.2 3.0 3.6 3.8
colMeans(tv_df[, 1:5]) # mean of columns
Oken Khagem Inao Thoi Romeo
4.000000 3.428571 3.428571 3.857143 3.142857
Applying custom function
ave <- function(x){
x/mean(x)
}
apply(tv, 2, ave)
Oken Khagem Inao Thoi Romeo
Sunday 0.75 0.8750000 1.4583333 1.5555556 0.3181818
Monday 1.25 0.5833333 1.1666667 1.0370370 1.2727273
Tuesday 1.50 0.2916667 0.5833333 1.2962963 0.3181818
Wednesday 0.50 1.7500000 1.1666667 1.2962963 1.2727273
Thursday 0.75 1.4583333 0.5833333 0.5185185 0.9545455
Friday 1.25 1.1666667 0.5833333 0.7777778 1.2727273
Saturday 1.00 0.8750000 1.4583333 0.5185185 1.5909091
apply(tv, 1, ave)
Sunday Monday Tuesday Wednesday Thursday Friday
Oken 0.8333333 1.3157895 2.0000000 0.4761905 1.0000000 1.3888889
Khagem 0.8333333 0.5263158 0.3333333 1.4285714 1.6666667 1.1111111
Inao 1.3888889 1.0526316 0.6666667 0.9523810 0.6666667 0.5555556
Thoi 1.6666667 1.0526316 1.6666667 1.1904762 0.6666667 0.8333333
Romeo 0.2777778 1.0526316 0.3333333 0.9523810 1.0000000 1.1111111
Saturday
Oken 1.0526316
Khagem 0.7894737
Inao 1.3157895
Thoi 0.5263158
Romeo 1.3157895
?lapply
Usage
lapply(X, FUN, …)
Where
X is list/vector/data frame
FUN = sum, mean, etc
… = optional arguments to FUN
One of the big differences between apply() and lapply() is that lappy() returns only list.
myWorkout <- list(PushUps = c(12, 12, 10, 12, 15, 13, 14),
Biceps = c(20, 22, 20, 24, 25, 22, 24),
Squats = c(30, 33, 29, 30, 32, 33, 28))
myWorkout
$PushUps
[1] 12 12 10 12 15 13 14
$Biceps
[1] 20 22 20 24 25 22 24
$Squats
[1] 30 33 29 30 32 33 28
lapply(myWorkout, mean)
$PushUps
[1] 12.57143
$Biceps
[1] 22.42857
$Squats
[1] 30.71429
Let's use lapply() on data frame
myWorkoutDF <- data.frame(PushUps = c(12, 12, 10, 12, 15, 13, 14),
Biceps = c(20, 22, 20, 24, 25, 22, 24),
Squats = c(30, 33, 29, 30, 32, 33, 28))
myWorkoutDF
PushUps Biceps Squats
1 12 20 30
2 12 22 33
3 10 20 29
4 12 24 30
5 15 25 32
6 13 22 33
7 14 24 28
lapply(myWorkoutDF, mean)
$PushUps
[1] 12.57143
$Biceps
[1] 22.42857
$Squats
[1] 30.71429
colMeans(myWorkoutDF)
PushUps Biceps Squats
12.57143 22.42857 30.71429
MyName <- c("My", "name", "is", "Loiyumba")
MyName
[1] "My" "name" "is" "Loiyumba"
lapply(MyName, nchar)
[[1]]
[1] 2
[[2]]
[1] 4
[[3]]
[1] 2
[[4]]
[1] 8
If we don't want our output as list, then we can use sapply().
?sapply
Usage
sapply(X, FUN, …)
Where
X is list/vector/data frame
FUN = sum, mean, etc
… = optional arguments to FUN
sapply(myWorkout, max)
PushUps Biceps Squats
15 25 33
sapply(myWorkoutDF, max)
PushUps Biceps Squats
15 25 33
sapply(MyName, nchar)
My name is Loiyumba
2 4 2 8
We will do some sapply() with R inbuilt dataset call mtcars.
Do ?mtcars
str(mtcars) # R inbuilt data
'data.frame': 32 obs. of 11 variables:
$ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
$ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
$ disp: num 160 160 108 258 360 ...
$ hp : num 110 110 93 110 175 105 245 62 95 123 ...
$ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
$ wt : num 2.62 2.88 2.32 3.21 3.44 ...
$ qsec: num 16.5 17 18.6 19.4 17 ...
$ vs : num 0 0 1 1 0 1 0 1 1 1 ...
$ am : num 1 1 1 0 0 0 0 0 0 0 ...
$ gear: num 4 4 4 3 3 3 3 4 4 4 ...
$ carb: num 4 4 1 1 2 1 4 2 2 4 ...
sapply(mtcars[, c(1, 3:7)], mean)
mpg disp hp drat wt qsec
20.090625 230.721875 146.687500 3.596563 3.217250 17.848750
sapply(mtcars[, c(1, 3:7)], max)
mpg disp hp drat wt qsec
33.900 472.000 335.000 4.930 5.424 22.900
sapply(mtcars[, c(1, 3:7)], min)
mpg disp hp drat wt qsec
10.400 71.100 52.000 2.760 1.513 14.500
?tapply
Usage
tapply(X, INDEX, FUN, …)
Where
X is vector/columns of data frame/elements of a list
INDEX is factors used to subset X
FUN = sum, mean, etc
… = optional arguments to FUN
table(mtcars$cyl)
4 6 8
11 7 14
tapply(mtcars$mpg, mtcars$cyl, mean)
4 6 8
26.66364 19.74286 15.10000
tapply(mtcars$mpg, mtcars$cyl, max)
4 6 8
33.9 21.4 19.2
table(mtcars$am)
0 1
19 13
tapply(mtcars$mpg, mtcars$am, mean)
0 1
17.14737 24.39231
tapply(mtcars$mpg, mtcars$am, min)
0 1
10.4 15.0
table(mtcars$gear)
3 4 5
15 12 5
tapply(mtcars$hp, mtcars$gear, max)
3 4 5
245 123 335
tapply(mtcars$hp, mtcars$gear, mean)
3 4 5
176.1333 89.5000 195.6000
Missing values are indicated by NA in R.
is.na() is the function to check missing values.
x <- c(10, 20, NA, 40, 50)
is.na(x)
[1] FALSE FALSE TRUE FALSE FALSE
!is.na(x) # negate
[1] TRUE TRUE FALSE TRUE TRUE
sum(x)
[1] NA
If there's NA, we can't really compute. In order to avoid missing values in computation, we use na.rm() argument in the function.
sum(x, na.rm = TRUE)
[1] 120
mean(x)
[1] NA
mean(x, na.rm = TRUE)
[1] 30
patientID <- 1:10
patientName <- c("Keiku", "Bala" ,"Sadananda", "Gokul", "Bonny", "Soma", "Maya", "Abenao", "Artina", "Olen" )
patientGender <- c("Male", "Female", "Male", "Male", "Male",
"Female", "Female", "Female", "Female", "Male")
patientAge <- c(36, 26, 40, 35, 37, 23, 37, 32, 28, 42)
patient <- data.frame(patientID, patientName, patientGender, patientAge)
patient
patientID patientName patientGender patientAge
1 1 Keiku Male 36
2 2 Bala Female 26
3 3 Sadananda Male 40
4 4 Gokul Male 35
5 5 Bonny Male 37
6 6 Soma Female 23
7 7 Maya Female 37
8 8 Abenao Female 32
9 9 Artina Female 28
10 10 Olen Male 42
patient[3, 4] <- NA
patient
patientID patientName patientGender patientAge
1 1 Keiku Male 36
2 2 Bala Female 26
3 3 Sadananda Male NA
4 4 Gokul Male 35
5 5 Bonny Male 37
6 6 Soma Female 23
7 7 Maya Female 37
8 8 Abenao Female 32
9 9 Artina Female 28
10 10 Olen Male 42
missing <- function(x){
sum(is.na(x))
}
sapply(patient, missing)
patientID patientName patientGender patientAge
0 0 0 1
tapply(patient$patientAge, patient$patientGender, mean)
Female Male
29.2 NA
tapply(patient$patientAge, patient$patientGender, mean, na.rm = TRUE)
Female Male
29.2 37.5
patient$patientTreatment <- c("A", "A", "D", "C", "A", "B", "C", "D", "C", "B")
str(patient)
'data.frame': 10 obs. of 5 variables:
$ patientID : int 1 2 3 4 5 6 7 8 9 10
$ patientName : Factor w/ 10 levels "Abenao","Artina",..: 6 3 9 5 4 10 7 1 2 8
$ patientGender : Factor w/ 2 levels "Female","Male": 2 1 2 2 2 1 1 1 1 2
$ patientAge : num 36 26 NA 35 37 23 37 32 28 42
$ patientTreatment: chr "A" "A" "D" "C" ...
tapply(patient$patientAge, patient$patientTreatment, max)
A B C D
37 42 37 NA
tapply(patient$patientAge, patient$patientTreatment, max, na.rm = TRUE)
A B C D
37 42 37 32
tapply(patient$patientAge, patient$patientTreatment, min, na.rm = TRUE)
A B C D
26 23 28 32
Or, we can completely remove the observations from the data frame.
head(na.omit(patient))
patientID patientName patientGender patientAge patientTreatment
1 1 Keiku Male 36 A
2 2 Bala Female 26 A
4 4 Gokul Male 35 C
5 5 Bonny Male 37 A
6 6 Soma Female 23 B
7 7 Maya Female 37 C
na <- complete.cases(patient)
patient[na, ]
patientID patientName patientGender patientAge patientTreatment
1 1 Keiku Male 36 A
2 2 Bala Female 26 A
4 4 Gokul Male 35 C
5 5 Bonny Male 37 A
6 6 Soma Female 23 B
7 7 Maya Female 37 C
8 8 Abenao Female 32 D
9 9 Artina Female 28 C
10 10 Olen Male 42 B
Dates are represented by the Date class.
date <- "01-05-1990"
class(date)
[1] "character"
date <- as.Date(date, format = "%d-%m-%Y")
class(date)
[1] "Date"
date
[1] "1990-05-01"
unclass(date)
[1] 7425
oldDate <- as.Date("1970-01-10")
unclass(oldDate)
[1] 9
Times are represented by the POSIXct or the POSIXlt class.
now <- Sys.time()
now
[1] "2016-07-05 13:03:55 IST"
class(now)
[1] "POSIXct" "POSIXt"
unclass(now)
[1] 1467704035
guess <- 220000000
class(guess) <- c("POSIXct", "POSIXt")
guess
[1] "1976-12-21 12:36:40 IST"
guess <- as.POSIXlt(guess)
names(unclass(guess))
[1] "sec" "min" "hour" "mday" "mon" "year" "wday"
[8] "yday" "isdst" "zone" "gmtoff"
guess$wday
[1] 2
now - guess
Time difference of 14441.02 days
Dates come in different style. In order to work with it, we use strptime() function. Do ?strptime
dates <- c("December 25, 2014 11:45", "January 25, 2015 23:30")
dates
[1] "December 25, 2014 11:45" "January 25, 2015 23:30"
class(dates)
[1] "character"
new_dates <- strptime(dates, format = "%B %d, %Y %H:%M")
class(new_dates)
[1] "POSIXlt" "POSIXt"
first_date <- as.Date("1996-09-28")
second_date <- as.Date("1996-10-15")
second_date - first_date
Time difference of 17 days