Starting with R

Loiyumba

What is R?

is an open-source statistical software package
can do data analysis
can visualize data
can fit machine learning models

To download R - https://cran.r-project.org/

What is RStudio?

is the most popular open source integrated development environment(IDE) for R
can create interactive plottings
can create web applications
can create presentations
can write documents, publishing, etc

To download RStudio - https://www.rstudio.com/products/rstudio/download/

RStudio

RStudio comes with 4 panes -

Source/Editor
Console/Command line
Environment/History/Files
Plots/Packages/Help/Viewer
And many other functions

Basic Operations in R

In the console/command line

7 + 9 + 5 + 0 + 0 + 1

[1] 22

7 + 9 * 5

[1] 52

(7 + 9) * 5

[1] 80

c(7, 9, 5, 0, 0, 1)

[1] 7 9 5 0 0 1

1:50

 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
[24] 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46
[47] 47 48 49 50

2^3

[1] 8

c(7, 9, 5, 0, 0, 1) + c(70, 90, 50, 10, 10, 10)

[1] 77 99 55 10 10 11

c(7, 9, 5, 0, 0, 1) * c(70, 90, 50, 10, 10, 10)

[1] 490 810 250   0   0  10

c(7, 9, 5, 0, 0, 1) - c(70, 90, 50, 10, 10, 10)

[1] -63 -81 -45 -10 -10  -9

c(7, 9, 5, 0, 0, 1) + 100

[1] 107 109 105 100 100 101

1/c(7, 9, 5, 0, 0,1)

[1] 0.1428571 0.1111111 0.2000000       Inf       Inf 1.0000000

c(7, 9, 5, 0, 0, 1) + c(10, 100)

[1]  17 109  15 100  10 101

paste("Hello", "World!")

[1] "Hello World!"

"Hello World!"

[1] "Hello World!"

We can add comments with the code

2 + 5 # Sum of 2 and 5 will give 7

[1] 7

Assigning to a variable/object

a <- 1 # This is assignment sign (<-)
a

[1] 1

a + 1

[1] 2

Some notes on Object name

Object names cannot begin with numbers, $, ^, !
Wise to avoid names already in use
R is case sensitive. So it will treat 'a' and 'A' differently
We can remove oject with

rm(object name)

R Objects Attributes

R objects can have attributes, which are like metadata for the object. These metadata can be very useful in that they help to describe the object. They are -

names, dimnames
dimensions
class
length

Attributes of an object (if any) can be accessed using the attributes() function. Not all R objects contain attributes, in which case the attributes() function returns NULL.

5 -> b 
print(b)

[1] 5

a + b

[1] 6

(a + b)/2

[1] 3

c = (a + b)/2 # We can use (=) instead of (<-)
c

[1] 3

w <- c(7, 9, 5, 0, 0, 1)
x <- c(1, 0, 0, 5, 9, 7)

y <- c(w, x)
y

 [1] 7 9 5 0 0 1 1 0 0 5 9 7

z <- c(x, w)
z

 [1] 1 0 0 5 9 7 7 9 5 0 0 1

d <- 1:10
d

 [1]  1  2  3  4  5  6  7  8  9 10

e <- 10:1
e

 [1] 10  9  8  7  6  5  4  3  2  1

d + 1

 [1]  2  3  4  5  6  7  8  9 10 11

e - 1

 [1] 9 8 7 6 5 4 3 2 1 0

d + e

 [1] 11 11 11 11 11 11 11 11 11 11

R in-built functions(a.k.a. Base Package)

k <- seq(from = 1, to = 10, by = 2)
k

[1] 1 3 5 7 9

j <- seq(from = -1, to = 1, by = 0.2)
j

 [1] -1.0 -0.8 -0.6 -0.4 -0.2  0.0  0.2  0.4  0.6  0.8  1.0

m <- rep(2, times = 5)
m

[1] 2 2 2 2 2

p <- rep(1:3, times = 3)
p

[1] 1 2 3 1 2 3 1 2 3

q <- rep(1:3, each = 3)
q

[1] 1 1 1 2 2 2 3 3 3

For information on any function

help(“seq”)

or, we can simply type

?rep

And if we want to check or remove the object/variable

ls() # check the variables in the current session

 [1] "a" "b" "c" "d" "e" "j" "k" "m" "p" "q" "w" "x" "y" "z"

rm(p) # delete single object
rm(k, m) # delete multiple objects
ls()

 [1] "a" "b" "c" "d" "e" "j" "q" "w" "x" "y" "z"

rm(list = ls()) # delete everything 
ls()

character(0)

R Maths Functions

g <- c(45, 90, 20)
sum(g) # Sum

[1] 155

mean(g) # Mean

[1] 51.66667

round(51.66667, 2) # Round to n decimal places

[1] 51.67

round(mean(g), 2) # Nested function

[1] 51.67

median(g) # Median

[1] 45

rank(g) # Rank the elements

[1] 2 3 1

var(g) # Variance

[1] 1258.333

max(g) # Largest element

[1] 90

min(g) # Smallest element

[1] 20

log(25) # Natural log

[1] 3.218876

exp(5) # Exponential

[1] 148.4132

sqrt(95) # Square root

[1] 9.746794

abs(-43) # Absolute value

[1] 43

u <- 45:60 
quantile(u) # Quantile

   0%   25%   50%   75%  100% 
45.00 48.75 52.50 56.25 60.00

sd(u) # Standard deviation

[1] 4.760952

R Data Types

Four basic data types in R -

numbers(numeric)
character string(text)
logical
factor

Numeric

Any number. Appropriate for math.

1 + 1

[1] 2

[1] 100

Character

Any text. Any symbols surrounded by quotes.

"hello, this is R"

[1] "hello, this is R"

"Imphal's pin code is 795001"

[1] "Imphal's pin code is 795001"

f <- c("1", "2", "3")
f

[1] "1" "2" "3"

Logical

R's form of binary data. TRUE or FALSE. Useful for logical test.

100 < 400

[1] TRUE

100 > 400

[1] FALSE

Logical Comparison

L <- 1:5
L

[1] 1 2 3 4 5

L > 3 # greater than

[1] FALSE FALSE FALSE  TRUE  TRUE

L >= 3 # greater than or equal to

[1] FALSE FALSE  TRUE  TRUE  TRUE

L < 3 # less than

[1]  TRUE  TRUE FALSE FALSE FALSE

L <= 3 # less than or equal to

[1]  TRUE  TRUE  TRUE FALSE FALSE

L == 3 # equal to

[1] FALSE FALSE  TRUE FALSE FALSE

L != 3 # not equal to

[1]  TRUE  TRUE FALSE  TRUE  TRUE

%in% Operator

The %in% tests whether the object on the left is a member of the group on the right.

"mango" %in% c("mango", "apple", "banana")

[1] TRUE

1 %in% c(2:8)

[1] FALSE

c(12:20) %in% c(15, 17)

[1] FALSE FALSE FALSE  TRUE FALSE  TRUE FALSE FALSE FALSE

Boolean Operators

We can combine logical tests with &, |, xor, !, any, and all.

x <- 1:10
x > 2 & x < 8

 [1] FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE

x > 8 | x < 2

 [1]  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE

x[(x > 2) & (x < 8)]

[1] 3 4 5 6 7

x[(x > 8) | (x < 2)]

[1]  1  9 10

Factor

R' form of categorical data.

dist <- factor(c("Imphal East", "Imphal West", "Senapati", "Imphal West"))
dist

[1] Imphal East Imphal West Senapati    Imphal West
Levels: Imphal East Imphal West Senapati

table(dist)

dist
Imphal East Imphal West    Senapati 
          1           2           1

as.numeric(dist)

[1] 1 2 3 2

status_vector <- c("Married", "Not Married")
status_factor <- factor(status_vector)
status_factor <- factor(status_factor, levels = c("Not Married", "Married"))
status_factor

[1] Married     Not Married
Levels: Not Married Married

R Data Structures

Some of the most frequently-used R data structures are -

Vectors
Matrices
Lists
Data Frames

Vectors

Vector elements must all have the same mode, which can be integer, numeric (floating-point number), character (string), logical (boolean), complex, object, etc.

Combine multiple elements into a one dimentional array

x <- 1:10 # integer
x

 [1]  1  2  3  4  5  6  7  8  9 10

fruits <- c("Apple", "Banana", "Mango", "Papaya") 
fruits # character

[1] "Apple"  "Banana" "Mango"  "Papaya"

logi <- c(TRUE, FALSE, TRUE) # logical
logi

[1]  TRUE FALSE  TRUE

com <- c(1+0i, 2+4i) # complex
com

[1] 1+0i 2+4i

What happens if we mix vectors of different classes?

student <- c("Tomba", "Chaoba", "Thoibi", "Bena")
class(student)

[1] "character"

age <- c(24, 26, 25, 22)
class(age)

[1] "numeric"

info <- c(student, age)
info

[1] "Tomba"  "Chaoba" "Thoibi" "Bena"   "24"     "26"     "25"     "22"

class(info)

[1] "character"

# Do ?class for more detail on class function

In coercion between

logical and numeric, class(vector) will be numeric
logical and character, class(vector) will be character
numeric and character, class(vector) will be character
logical, numeric and character, class(vector) will be character

class(c(795001, "Imphal"))

[1] "character"

class(c(TRUE, 795001))

[1] "numeric"

class(c(TRUE, 795001, "Imphal"))

[1] "character"

class(c("TRUE", 795001))

[1] "character"

Explicit Coercion

Objects can be explicitly coerced from one class to another using the as.* functions, if available.

t <- -1:10
class(t)

[1] "integer"

as.numeric(t)

 [1] -1  0  1  2  3  4  5  6  7  8  9 10

as.logical(t)

 [1]  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
[12]  TRUE

as.character(t)

 [1] "-1" "0"  "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10"

Sometimes, R can’t figure out how to coerce an object and this can result in NAs being produced.

v <- c("Manipur", "Nagaland", "Mizoram")
class(v)

[1] "character"

as.numeric(v)

[1] NA NA NA

as.logical(v)

[1] NA NA NA

as.complex(v)

[1] NA NA NA

NA means Not Available

Some Vector Functions

y <- c(7, 9, 5, 0, 0, 1)
sort(y)

[1] 0 0 1 5 7 9

sort(y, decreasing = TRUE)

[1] 9 7 5 1 0 0

table(y)

y
0 1 5 7 9 
2 1 1 1 1

rev(y)

[1] 1 0 0 5 9 7

unique(y)

[1] 7 9 5 0 1

Selecting Vector Elements

a <- seq(from = 1, to = 20, by = 2)
a

 [1]  1  3  5  7  9 11 13 15 17 19

a[5]

[1] 9

a[-5]

[1]  1  3  5  7 11 13 15 17 19

a[3:6]

[1]  5  7  9 11

a[-(3:6)]

[1]  1  3 13 15 17 19

a[c(3, 6)]

[1]  5 11

a[a > 11]

[1] 13 15 17 19

a[a < 11]

[1] 1 3 5 7 9

a[a == 11]

[1] 11

district <- c("Imphal East", "Senapati", "Churachandpur", "Thoubal", "Ukhrul", "Bishenpur")
district[5]

[1] "Ukhrul"

district[c(1, 6)]

[1] "Imphal East" "Bishenpur"

alpha <- letters[1:10]
alpha

 [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j"

alpha[c(5,5,5,5)]

[1] "e" "e" "e" "e"

alpha[c(5:1, 1:5)]

 [1] "e" "d" "c" "b" "a" "a" "b" "c" "d" "e"

alpha[11] # indexing with out-of-range values

[1] NA

alpha[1:11]

 [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" NA

alpha > "d"

 [1] FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE

alpha[alpha > "d"]

[1] "e" "f" "g" "h" "i" "j"

selector <- alpha > "d"
alpha[selector]

[1] "e" "f" "g" "h" "i" "j"

which(alpha > "d")

[1]  5  6  7  8  9 10

indexes <- which(alpha > "d")
alpha[indexes]

[1] "e" "f" "g" "h" "i" "j"

g <- 1:10
g

 [1]  1  2  3  4  5  6  7  8  9 10

g[2] <- 100
g

 [1]   1 100   3   4   5   6   7   8   9  10

g[11] <- 200
g

 [1]   1 100   3   4   5   6   7   8   9  10 200

g[c(4,8)] <- -500
g

 [1]    1  100    3 -500    5    6    7 -500    9   10  200

g[3] <- g[11]
g

 [1]    1  100  200 -500    5    6    7 -500    9   10  200

g <- c(g, 33, 44)
g

 [1]    1  100  200 -500    5    6    7 -500    9   10  200   33   44

g <- c(22, 55, g)
g

 [1]   22   55    1  100  200 -500    5    6    7 -500    9   10  200   33
[15]   44

g <- c(g[1:5], 111, g[6:15])
g

 [1]   22   55    1  100  200  111 -500    5    6    7 -500    9   10  200
[15]   33   44

g <- g[-3:-6]
g

 [1]   22   55 -500    5    6    7 -500    9   10  200   33   44

Matrices

A matrix is a vector with two additional attributes, the number of rows and number of columns. Combine multiple elements into a two dimentional array. Create with matrix function.

mat <- matrix(c(1:6), nrow = 2)
mat

     [,1] [,2] [,3]
[1,]    1    3    5
[2,]    2    4    6

attributes(mat)

$dim
[1] 2 3

Matrices are constructed column-wise, so entries can be thought of starting in the “upper left” corner and running down the columns.

mat2 <- matrix(c(1:6), nrow = 3)
mat2

     [,1] [,2]
[1,]    1    4
[2,]    2    5
[3,]    3    6

dim(mat2)

[1] 3 2

However, if we want to construct matrix by row-wise, we can do so.

mat3 <- matrix(c(1:6), nrow = 3, byrow = TRUE)
mat3

     [,1] [,2]
[1,]    1    2
[2,]    3    4
[3,]    5    6

dim(mat3)

[1] 3 2

Matrices can also be created directly from vectors by adding a dimension attribute.

mat4 <- 1:10
dim(mat4) <- c(2, 5)
mat4

     [,1] [,2] [,3] [,4] [,5]
[1,]    1    3    5    7    9
[2,]    2    4    6    8   10

Matrices can be created by column-binding or row-binding with the cbind() and rbind() functions.

p <- 1:5
q <- 6:10
cbind(p, q)

     p  q
[1,] 1  6
[2,] 2  7
[3,] 3  8
[4,] 4  9
[5,] 5 10

Do ?cbind in the console for more info

rbind(p, q)

  [,1] [,2] [,3] [,4] [,5]
p    1    2    3    4    5
q    6    7    8    9   10

Do ?rbind in the console for more info

We can also transpose dimension in matrix like this

pq <- rbind(p, q)
pq # 2, 5

  [,1] [,2] [,3] [,4] [,5]
p    1    2    3    4    5
q    6    7    8    9   10

t(pq) # 5, 2

     p  q
[1,] 1  6
[2,] 2  7
[3,] 3  8
[4,] 4  9
[5,] 5 10

Vectorized Matrix Operations

x <- matrix(1:6, nrow = 3)
x

     [,1] [,2]
[1,]    1    4
[2,]    2    5
[3,]    3    6

y <- matrix(rep(5, times = 6), nrow = 3)
y

     [,1] [,2]
[1,]    5    5
[2,]    5    5
[3,]    5    5

Element-wide multiplication

x * y

     [,1] [,2]
[1,]    5   20
[2,]   10   25
[3,]   15   30

Element-wise division

x/y

     [,1] [,2]
[1,]  0.2  0.8
[2,]  0.4  1.0
[3,]  0.6  1.2

3 * x

     [,1] [,2]
[1,]    3   12
[2,]    6   15
[3,]    9   18

x + x

     [,1] [,2]
[1,]    2    8
[2,]    4   10
[3,]    6   12

Subsetting in Matrices

z <- x * y
z

     [,1] [,2]
[1,]    5   20
[2,]   10   25
[3,]   15   30

z[1, ] # select 1st row & all columns

[1]  5 20

z[, 1] # select all rows & 1st column

[1]  5 10 15

z[2:3, , drop = FALSE] # keeping the matrix style

     [,1] [,2]
[1,]   10   25
[2,]   15   30

z[, 2, drop = FALSE]

     [,1]
[1,]   20
[2,]   25
[3,]   30

z[2,2] # select an element

[1] 25

Filtering on Matrices

z[z[, 1] >= 10, ]

     [,1] [,2]
[1,]   10   25
[2,]   15   30

z[z[, 1] > 10 & z[, 2] >= 25, ]

[1] 15 30

Matrix Row and Column Names

colnames(z)

NULL

colnames(z) <- c("A", "B")
z

      A  B
[1,]  5 20
[2,] 10 25
[3,] 15 30

colnames(z)

[1] "A" "B"

z[, "B"]

[1] 20 25 30

rownames(z)

NULL

rownames(z) <- c("P", "Q", "R")
z

Lists

A list is a one dimensional group of R objects.
Create lists with 'list()' function.

a_list <- list(1, "HP", TRUE)
a_list

[[1]]
[1] 1

[[2]]
[1] "HP"

[[3]]
[1] TRUE

The element of a list can be anything. Even vectors or other lists.

students <- list(name = c("Tomba", "Chaoba"), age = c(23, 25), single = c(TRUE, FALSE))
students

$name
[1] "Tomba"  "Chaoba"

$age
[1] 23 25

$single
[1]  TRUE FALSE

str(students)

List of 3
 $ name  : chr [1:2] "Tomba" "Chaoba"
 $ age   : num [1:2] 23 25
 $ single: logi [1:2] TRUE FALSE

names(students)

[1] "name"   "age"    "single"

students$name

[1] "Tomba"  "Chaoba"

students[["name"]]

[1] "Tomba"  "Chaoba"

students[[1]]

[1] "Tomba"  "Chaoba"

students["name"]

$name
[1] "Tomba"  "Chaoba"

students[c("name", "single")]

$name
[1] "Tomba"  "Chaoba"

$single
[1]  TRUE FALSE

students[c(1, 3)]

$name
[1] "Tomba"  "Chaoba"

$single
[1]  TRUE FALSE

Adding/Deleting List Elements

students$education <- c("Graduate", "Master")
students

$name
[1] "Tomba"  "Chaoba"

$age
[1] 23 25

$single
[1]  TRUE FALSE

$education
[1] "Graduate" "Master"

students$single <- NULL
students

$name
[1] "Tomba"  "Chaoba"

$age
[1] 23 25

$education
[1] "Graduate" "Master"

Data Frames

Data frames group vectors together into a two-dimensional table. Each vector becomes a column in the table. As a result, each column of a data frame can contain a different type of data; but within a column, every cell must be the same type of data. We can create data frames with 'data.frame' function.

students <- data.frame(rollnum = c(10,42,3),
                       name = c("Iboyaima", "Tomchou", "Tombi"),
                       examfailed = c(TRUE, FALSE, TRUE)
                   )

students

  rollnum     name examfailed
1      10 Iboyaima       TRUE
2      42  Tomchou      FALSE
3       3    Tombi       TRUE

class(students)

[1] "data.frame"

str(students)

'data.frame':   3 obs. of  3 variables:
 $ rollnum   : num  10 42 3
 $ name      : Factor w/ 3 levels "Iboyaima","Tombi",..: 1 3 2
 $ examfailed: logi  TRUE FALSE TRUE

cbind & rbind in Data Frames

manipur <- data.frame(Districts = c("ImpWest", "ImpEast", "Ccpur",
                                    "Thoubal", "Tamenglong", "Senapati",
                                    "Chandel", "Ukhrul", "Bishenpur"),
                      Population = c(700000, 500000, 400000, 300000,
                                     200000, 450000, 400000, 500000, 750000),
                      Literacy = c(9.5, 9.4, 7.2, 8.6, 5.3, 8.5, 6.8,
                                   6.2, 8.7), stringsAsFactors = FALSE
                      )

manipur

   Districts Population Literacy
1    ImpWest     700000      9.5
2    ImpEast     500000      9.4
3      Ccpur     400000      7.2
4    Thoubal     300000      8.6
5 Tamenglong     200000      5.3
6   Senapati     450000      8.5
7    Chandel     400000      6.8
8     Ukhrul     500000      6.2
9  Bishenpur     750000      8.7

ForestCover <- c(40, 45, 67, 60, 90, 68, 65, 85, 70)
ForestCover

[1] 40 45 67 60 90 68 65 85 70

With 'cbind' function, we will add this vector to manipur.

manipur <- cbind(manipur, ForestCover)

manipur

   Districts Population Literacy ForestCover
1    ImpWest     700000      9.5          40
2    ImpEast     500000      9.4          45
3      Ccpur     400000      7.2          67
4    Thoubal     300000      8.6          60
5 Tamenglong     200000      5.3          90
6   Senapati     450000      8.5          68
7    Chandel     400000      6.8          65
8     Ukhrul     500000      6.2          85
9  Bishenpur     750000      8.7          70

Adding New Row

We will add another row in manipur

SadarHill <- data.frame(Districts = "SadarHill", 
                        Population = 450000,
                        Literacy = 8.5,
                        ForestCover = 66,
                        stringsAsFactors = FALSE)
SadarHill

  Districts Population Literacy ForestCover
1 SadarHill     450000      8.5          66

manipur <- rbind(manipur, SadarHill)

Deleting Row

We can delete rows in a data frame like this.

manipur[1:9, ] # omitting in selection

   Districts Population Literacy ForestCover
1    ImpWest     700000      9.5          40
2    ImpEast     500000      9.4          45
3      Ccpur     400000      7.2          67
4    Thoubal     300000      8.6          60
5 Tamenglong     200000      5.3          90
6   Senapati     450000      8.5          68
7    Chandel     400000      6.8          65
8     Ukhrul     500000      6.2          85
9  Bishenpur     750000      8.7          70

manipur[-(8:10), ] # with - sign

   Districts Population Literacy ForestCover
1    ImpWest     700000      9.5          40
2    ImpEast     500000      9.4          45
3      Ccpur     400000      7.2          67
4    Thoubal     300000      8.6          60
5 Tamenglong     200000      5.3          90
6   Senapati     450000      8.5          68
7    Chandel     400000      6.8          65

manipur[-c(2,5,8:10), ] # row selection

  Districts Population Literacy ForestCover
1   ImpWest     700000      9.5          40
3     Ccpur     400000      7.2          67
4   Thoubal     300000      8.6          60
6  Senapati     450000      8.5          68
7   Chandel     400000      6.8          65

Adding New Column

Adding new column to an existing data frame

manipur$AnnualRain <- c(3.2, 3.5, 4.0, 3.8, 4.8, 4.2, 3.8,
                        4.2, 3.7, 3.8)
str(manipur)

'data.frame':   10 obs. of  5 variables:
 $ Districts  : chr  "ImpWest" "ImpEast" "Ccpur" "Thoubal" ...
 $ Population : num  700000 500000 400000 300000 200000 450000 400000 500000 750000 450000
 $ Literacy   : num  9.5 9.4 7.2 8.6 5.3 8.5 6.8 6.2 8.7 8.5
 $ ForestCover: num  40 45 67 60 90 68 65 85 70 66
 $ AnnualRain : num  3.2 3.5 4 3.8 4.8 4.2 3.8 4.2 3.7 3.8

Delete Column

Or, delete a column like this.

manipur$Literacy <- NULL
manipur

    Districts Population ForestCover AnnualRain
1     ImpWest     700000          40        3.2
2     ImpEast     500000          45        3.5
3       Ccpur     400000          67        4.0
4     Thoubal     300000          60        3.8
5  Tamenglong     200000          90        4.8
6    Senapati     450000          68        4.2
7     Chandel     400000          65        3.8
8      Ukhrul     500000          85        4.2
9   Bishenpur     750000          70        3.7
10  SadarHill     450000          66        3.8

Subsetting in Data Frames

Matrix Style Subsetting

manipur[1, ] # select first row and all the columns

  Districts Population ForestCover AnnualRain
1   ImpWest      7e+05          40        3.2

manipur[, 1] # select first column and all the rows

 [1] "ImpWest"    "ImpEast"    "Ccpur"      "Thoubal"    "Tamenglong"
 [6] "Senapati"   "Chandel"    "Ukhrul"     "Bishenpur"  "SadarHill"

manipur[, "Districts"]

 [1] "ImpWest"    "ImpEast"    "Ccpur"      "Thoubal"    "Tamenglong"
 [6] "Senapati"   "Chandel"    "Ukhrul"     "Bishenpur"  "SadarHill"

manipur[, c("Districts", "AnnualRain")]

    Districts AnnualRain
1     ImpWest        3.2
2     ImpEast        3.5
3       Ccpur        4.0
4     Thoubal        3.8
5  Tamenglong        4.8
6    Senapati        4.2
7     Chandel        3.8
8      Ukhrul        4.2
9   Bishenpur        3.7
10  SadarHill        3.8

manipur[2:6, ] # selecting rows 2 to 6 and all columns

   Districts Population ForestCover AnnualRain
2    ImpEast     500000          45        3.5
3      Ccpur     400000          67        4.0
4    Thoubal     300000          60        3.8
5 Tamenglong     200000          90        4.8
6   Senapati     450000          68        4.2

manipur[10:7, ] # selected rows 10 to 7 and all columns

   Districts Population ForestCover AnnualRain
10 SadarHill     450000          66        3.8
9  Bishenpur     750000          70        3.7
8     Ukhrul     500000          85        4.2
7    Chandel     400000          65        3.8

manipur[5:7, 1:2] # selected rows and columns

   Districts Population
5 Tamenglong     200000
6   Senapati     450000
7    Chandel     400000

manipur[c(4, 8, 10), c(1, 2, 4)] # selected rows and columns

   Districts Population AnnualRain
4    Thoubal     300000        3.8
8     Ukhrul     500000        4.2
10 SadarHill     450000        3.8

Or, we can use r inbuilt functions to check observations and some other information of the data frame

head(manipur)

   Districts Population ForestCover AnnualRain
1    ImpWest     700000          40        3.2
2    ImpEast     500000          45        3.5
3      Ccpur     400000          67        4.0
4    Thoubal     300000          60        3.8
5 Tamenglong     200000          90        4.8
6   Senapati     450000          68        4.2

tail(manipur)

    Districts Population ForestCover AnnualRain
5  Tamenglong     200000          90        4.8
6    Senapati     450000          68        4.2
7     Chandel     400000          65        3.8
8      Ukhrul     500000          85        4.2
9   Bishenpur     750000          70        3.7
10  SadarHill     450000          66        3.8

nrow(manipur) # number of rows

[1] 10

ncol(manipur) # number of columns

[1] 4

dim(manipur) # dimension of the data frame

[1] 10  4

summary(manipur[2:4]) # Summary stats of data frame

   Population      ForestCover      AnnualRain   
 Min.   :200000   Min.   :40.00   Min.   :3.200  
 1st Qu.:400000   1st Qu.:61.25   1st Qu.:3.725  
 Median :450000   Median :66.50   Median :3.800  
 Mean   :465000   Mean   :65.60   Mean   :3.900  
 3rd Qu.:500000   3rd Qu.:69.50   3rd Qu.:4.150  
 Max.   :750000   Max.   :90.00   Max.   :4.800

Filtering on Data Frames

Selecting districts which have Population more than 500000

manipur[manipur[, 2] > 500000, ]

  Districts Population ForestCover AnnualRain
1   ImpWest     700000          40        3.2
9 Bishenpur     750000          70        3.7

Selecting districts which have population equal or more than 500000

manipur[manipur[, 2] >= 500000, ]

  Districts Population ForestCover AnnualRain
1   ImpWest     700000          40        3.2
2   ImpEast     500000          45        3.5
8    Ukhrul     500000          85        4.2
9 Bishenpur     750000          70        3.7

Selecting districts with less than 500000 population and forest cover more than 70

manipur[manipur[, 2] < 500000 & manipur[, 3] > 70, ]

   Districts Population ForestCover AnnualRain
5 Tamenglong      2e+05          90        4.8

Now, we want only district names with population more than or equal to 500000 and forest cover more than or equal to 70

manipur[manipur[, 2] >= 500000 & manipur[, 3] >= 70, 1]

[1] "Ukhrul"    "Bishenpur"

Or, use subset() function

subset(manipur, ForestCover >= 75)

   Districts Population ForestCover AnnualRain
5 Tamenglong      2e+05          90        4.8
8     Ukhrul      5e+05          85        4.2

subset(manipur, Population > 400000 & AnnualRain > 4.0)

  Districts Population ForestCover AnnualRain
6  Senapati     450000          68        4.2
8    Ukhrul     500000          85        4.2

subset(manipur, AnnualRain >= 3.5)$AnnualRain

[1] 3.5 4.0 3.8 4.8 4.2 3.8 4.2 3.7 3.8

subset(manipur, AnnualRain >= 4.0)[, -(2:3)]

   Districts AnnualRain
3      Ccpur        4.0
5 Tamenglong        4.8
6   Senapati        4.2
8     Ukhrul        4.2

Names in Data Frame

Similarly in matrices where colnames() and rownames() are used to set or change column names and row names, data frames use names() to set or change column names and row.names() to set or change row names.

names(manipur)

[1] "Districts"   "Population"  "ForestCover" "AnnualRain"

names(manipur) <- c("Dist", "Pop", "ForCov", "AnRain")
names(manipur)

[1] "Dist"   "Pop"    "ForCov" "AnRain"

row.names(manipur)

 [1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10"

row.names(manipur) <- LETTERS[1:10]
row.names(manipur)

 [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J"

head(manipur,3)

     Dist   Pop ForCov AnRain
A ImpWest 7e+05     40    3.2
B ImpEast 5e+05     45    3.5
C   Ccpur 4e+05     67    4.0

R Inbuilt Data

R has many inbuilt data for its users to practice. To view these data, we can do with data() and simply type the dataset name.

str(USArrests)

'data.frame':   50 obs. of  4 variables:
 $ Murder  : num  13.2 10 8.1 8.8 9 7.9 3.3 5.9 15.4 17.4 ...
 $ Assault : int  236 263 294 190 276 204 110 238 335 211 ...
 $ UrbanPop: int  58 48 80 50 91 78 77 72 80 60 ...
 $ Rape    : num  21.2 44.5 31 19.5 40.6 38.7 11.1 15.8 31.9 25.8 ...

To know more info about this data, we can do ?USArrests

Explore the Data

head(USArrests) # first 6 rows/observations

           Murder Assault UrbanPop Rape
Alabama      13.2     236       58 21.2
Alaska       10.0     263       48 44.5
Arizona       8.1     294       80 31.0
Arkansas      8.8     190       50 19.5
California    9.0     276       91 40.6
Colorado      7.9     204       78 38.7

class(USArrests)

[1] "data.frame"

we can use View() function to see the complete dataset

Do summary statistics

# summary(USArrests)

max(USArrests$Murder) # maximum value in Murder

[1] 17.4

USArrests[USArrests$Murder == 17.4, ] # which state?

        Murder Assault UrbanPop Rape
Georgia   17.4     211       60 25.8

Other way to do this is

which.min(USArrests$Murder) # gives the row number

[1] 34

USArrests[34, ] # see the row

             Murder Assault UrbanPop Rape
North Dakota    0.8      45       44  7.3

But I want to see the top 10 states which are highest in murder and lowest in murder.

Top 10 states with highest number of murder

head(USArrests[order(USArrests$Murder, decreasing = TRUE), ], 10) # use order function. do ?order

               Murder Assault UrbanPop Rape
Georgia          17.4     211       60 25.8
Mississippi      16.1     259       44 17.1
Florida          15.4     335       80 31.9
Louisiana        15.4     249       66 22.2
South Carolina   14.4     279       48 22.5
Alabama          13.2     236       58 21.2
Tennessee        13.2     188       59 26.9
North Carolina   13.0     337       45 16.1
Texas            12.7     201       80 25.5
Nevada           12.2     252       81 46.0

Top 10 states with lowest number of murder

tail(USArrests[order(USArrests$Murder, decreasing = TRUE), ], 10)

              Murder Assault UrbanPop Rape
Connecticut      3.3     110       77 11.1
Utah             3.2     120       80 22.9
Minnesota        2.7      72       66 14.9
Idaho            2.6     120       54 14.2
Wisconsin        2.6      53       66 10.8
Iowa             2.2      56       57 11.3
Vermont          2.2      48       32 11.2
Maine            2.1      83       51  7.8
New Hampshire    2.1      57       56  9.5
North Dakota     0.8      45       44  7.3

least_murder <- tail(USArrests[order(USArrests$Murder, decreasing = TRUE), ], 10)
least_murder[order(least_murder$Murder), ]

              Murder Assault UrbanPop Rape
North Dakota     0.8      45       44  7.3
Maine            2.1      83       51  7.8
New Hampshire    2.1      57       56  9.5
Iowa             2.2      56       57 11.3
Vermont          2.2      48       32 11.2
Idaho            2.6     120       54 14.2
Wisconsin        2.6      53       66 10.8
Minnesota        2.7      72       66 14.9
Utah             3.2     120       80 22.9
Connecticut      3.3     110       77 11.1

Similary, we can do for Assualt and Rape as well.

Top 10 states with highest number of assaults

head(USArrests[order(USArrests$Assault, decreasing = TRUE), ], 10)

               Murder Assault UrbanPop Rape
North Carolina   13.0     337       45 16.1
Florida          15.4     335       80 31.9
Maryland         11.3     300       67 27.8
Arizona           8.1     294       80 31.0
New Mexico       11.4     285       70 32.1
South Carolina   14.4     279       48 22.5
California        9.0     276       91 40.6
Alaska           10.0     263       48 44.5
Mississippi      16.1     259       44 17.1
Michigan         12.1     255       74 35.1

Top 10 states with lowest number of assaults

tail(USArrests[order(USArrests$Assault, decreasing = TRUE), ], 10)

              Murder Assault UrbanPop Rape
South Dakota     3.8      86       45 12.8
Maine            2.1      83       51  7.8
West Virginia    5.7      81       39  9.3
Minnesota        2.7      72       66 14.9
New Hampshire    2.1      57       56  9.5
Iowa             2.2      56       57 11.3
Wisconsin        2.6      53       66 10.8
Vermont          2.2      48       32 11.2
Hawaii           5.3      46       83 20.2
North Dakota     0.8      45       44  7.3

least_assault <- tail(USArrests[order(USArrests$Assault, decreasing = TRUE), ], 10)
least_assault[order(least_assault$Assault), ]

              Murder Assault UrbanPop Rape
North Dakota     0.8      45       44  7.3
Hawaii           5.3      46       83 20.2
Vermont          2.2      48       32 11.2
Wisconsin        2.6      53       66 10.8
Iowa             2.2      56       57 11.3
New Hampshire    2.1      57       56  9.5
Minnesota        2.7      72       66 14.9
West Virginia    5.7      81       39  9.3
Maine            2.1      83       51  7.8
South Dakota     3.8      86       45 12.8

Top 10 states with highest number of rapes

head(USArrests[order(USArrests$Rape, decreasing = TRUE), ], 10)

           Murder Assault UrbanPop Rape
Nevada       12.2     252       81 46.0
Alaska       10.0     263       48 44.5
California    9.0     276       91 40.6
Colorado      7.9     204       78 38.7
Michigan     12.1     255       74 35.1
New Mexico   11.4     285       70 32.1
Florida      15.4     335       80 31.9
Arizona       8.1     294       80 31.0
Oregon        4.9     159       67 29.3
Missouri      9.0     178       70 28.2

Top 10 states with lowest number of rapes

tail(USArrests[order(USArrests$Rape, decreasing = TRUE), ], 10)

              Murder Assault UrbanPop Rape
South Dakota     3.8      86       45 12.8
Iowa             2.2      56       57 11.3
Vermont          2.2      48       32 11.2
Connecticut      3.3     110       77 11.1
Wisconsin        2.6      53       66 10.8
New Hampshire    2.1      57       56  9.5
West Virginia    5.7      81       39  9.3
Rhode Island     3.4     174       87  8.3
Maine            2.1      83       51  7.8
North Dakota     0.8      45       44  7.3

least_rape <- tail(USArrests[order(USArrests$Rape, decreasing = TRUE), ], 10)
least_rape[order(least_rape$Rape), ]

              Murder Assault UrbanPop Rape
North Dakota     0.8      45       44  7.3
Maine            2.1      83       51  7.8
Rhode Island     3.4     174       87  8.3
West Virginia    5.7      81       39  9.3
New Hampshire    2.1      57       56  9.5
Wisconsin        2.6      53       66 10.8
Connecticut      3.3     110       77 11.1
Vermont          2.2      48       32 11.2
Iowa             2.2      56       57 11.3
South Dakota     3.8      86       45 12.8

So I want to know If I have to live in USA, which state should you suggest me to live.

tail(least_murder, 3)

              Murder Assault UrbanPop Rape
Maine            2.1      83       51  7.8
New Hampshire    2.1      57       56  9.5
North Dakota     0.8      45       44  7.3

tail(least_assault, 3)

             Murder Assault UrbanPop Rape
Vermont         2.2      48       32 11.2
Hawaii          5.3      46       83 20.2
North Dakota    0.8      45       44  7.3

tail(least_rape, 3)

             Murder Assault UrbanPop Rape
Rhode Island    3.4     174       87  8.3
Maine           2.1      83       51  7.8
North Dakota    0.8      45       44  7.3

How about we categorise states with high and low murder, high and low Assault, high and low rape!

We can do so for further insights of the data.

UScrime <- USArrests # assign a new object name
head(UScrime) # first 6 rows

           Murder Assault UrbanPop Rape
Alabama      13.2     236       58 21.2
Alaska       10.0     263       48 44.5
Arizona       8.1     294       80 31.0
Arkansas      8.8     190       50 19.5
California    9.0     276       91 40.6
Colorado      7.9     204       78 38.7

mean(UScrime$Murder)

[1] 7.788

UScrime$HighMurder <- as.numeric(UScrime$Murder > mean(UScrime$Murder))

str(UScrime)

'data.frame':   50 obs. of  5 variables:
 $ Murder    : num  13.2 10 8.1 8.8 9 7.9 3.3 5.9 15.4 17.4 ...
 $ Assault   : int  236 263 294 190 276 204 110 238 335 211 ...
 $ UrbanPop  : int  58 48 80 50 91 78 77 72 80 60 ...
 $ Rape      : num  21.2 44.5 31 19.5 40.6 38.7 11.1 15.8 31.9 25.8 ...
 $ HighMurder: num  1 1 1 1 1 1 0 0 1 1 ...

table(UScrime$HighMurder)


 0  1 
27 23

mean(UScrime$Assault)

[1] 170.76

UScrime$HighAssault <- as.numeric(UScrime$Assault > mean(UScrime$Assault))
str(UScrime)

'data.frame':   50 obs. of  6 variables:
 $ Murder     : num  13.2 10 8.1 8.8 9 7.9 3.3 5.9 15.4 17.4 ...
 $ Assault    : int  236 263 294 190 276 204 110 238 335 211 ...
 $ UrbanPop   : int  58 48 80 50 91 78 77 72 80 60 ...
 $ Rape       : num  21.2 44.5 31 19.5 40.6 38.7 11.1 15.8 31.9 25.8 ...
 $ HighMurder : num  1 1 1 1 1 1 0 0 1 1 ...
 $ HighAssault: num  1 1 1 1 1 1 0 1 1 1 ...

table(UScrime$HighAssault)


 0  1 
27 23

table(UScrime$HighMurder, UScrime$HighAssault)


     0  1
  0 25  2
  1  2 21

This table means -

First row: 25 states have low murder and low Assault and 2 states have low murder but high Assault.

Second row: 2 States have high murder but low assault and 21 states have high murder and high assault.

mean(UScrime$Rape)

[1] 21.232

UScrime$HighRape <- as.numeric(UScrime$Rape > mean(UScrime$Rape))

str(UScrime)

'data.frame':   50 obs. of  7 variables:
 $ Murder     : num  13.2 10 8.1 8.8 9 7.9 3.3 5.9 15.4 17.4 ...
 $ Assault    : int  236 263 294 190 276 204 110 238 335 211 ...
 $ UrbanPop   : int  58 48 80 50 91 78 77 72 80 60 ...
 $ Rape       : num  21.2 44.5 31 19.5 40.6 38.7 11.1 15.8 31.9 25.8 ...
 $ HighMurder : num  1 1 1 1 1 1 0 0 1 1 ...
 $ HighAssault: num  1 1 1 1 1 1 0 1 1 1 ...
 $ HighRape   : num  0 1 1 0 1 1 0 0 1 1 ...

table(UScrime$HighRape)


 0  1 
29 21

table(UScrime$HighMurder, UScrime$HighRape)


     0  1
  0 23  4
  1  6 17

table(UScrime$HighAssault, UScrime$HighRape)


     0  1
  0 23  4
  1  6 17

Merging Data

We can use merge() function to merge two data frames, and we can merge only two.

name <- I(c("Ronaldo", "Messi", "Rooney", "Klose", "Zlatan"))
country <- I(c("Portugal", "Argentina", "England", "Germany", "Sweden"))
players <- data.frame(name, country)
players

     name   country
1 Ronaldo  Portugal
2   Messi Argentina
3  Rooney   England
4   Klose   Germany
5  Zlatan    Sweden

name <- I(c("Ronaldo", "Messi", "Rooney", "Klose", "Zlatan"))
age <- c(28, 27, 29, 36, 24)
cap <- c("Yes", "No", "Yes", "No", "Yes")
players2 <- data.frame(name, age, cap)
players2

     name age cap
1 Ronaldo  28 Yes
2   Messi  27  No
3  Rooney  29 Yes
4   Klose  36  No
5  Zlatan  24 Yes

players3 <- merge(players, players2)
players3

     name   country age cap
1   Klose   Germany  36  No
2   Messi Argentina  27  No
3 Ronaldo  Portugal  28 Yes
4  Rooney   England  29 Yes
5  Zlatan    Sweden  24 Yes

name <- c("Messi", "Rooney", "Klose", "Ronaldo", "Drogba")
club <- c("Bayern Munich", "Barcelona", "Real Madrid", "ManU", "Chelsea")
players4 <- data.frame(name, club)

players4

     name          club
1   Messi Bayern Munich
2  Rooney     Barcelona
3   Klose   Real Madrid
4 Ronaldo          ManU
5  Drogba       Chelsea

merge(players3, players4)

     name   country age cap          club
1   Klose   Germany  36  No   Real Madrid
2   Messi Argentina  27  No Bayern Munich
3 Ronaldo  Portugal  28 Yes          ManU
4  Rooney   England  29 Yes     Barcelona

merge(players3, players4, all = TRUE)

     name   country age  cap          club
1  Drogba      <NA>  NA <NA>       Chelsea
2   Klose   Germany  36   No   Real Madrid
3   Messi Argentina  27   No Bayern Munich
4 Ronaldo  Portugal  28  Yes          ManU
5  Rooney   England  29  Yes     Barcelona
6  Zlatan    Sweden  24  Yes          <NA>

merge(players3, players4, all.x = TRUE)

     name   country age cap          club
1   Klose   Germany  36  No   Real Madrid
2   Messi Argentina  27  No Bayern Munich
3 Ronaldo  Portugal  28 Yes          ManU
4  Rooney   England  29 Yes     Barcelona
5  Zlatan    Sweden  24 Yes          <NA>

merge(players3, players4, all.y = TRUE)

     name   country age  cap          club
1  Drogba      <NA>  NA <NA>       Chelsea
2   Klose   Germany  36   No   Real Madrid
3   Messi Argentina  27   No Bayern Munich
4 Ronaldo  Portugal  28  Yes          ManU
5  Rooney   England  29  Yes     Barcelona

Suppose we have different column names but same variables.

Players <- c("Drogba", "Klose", "Messi", "Ronaldo")
Fees <- c(102, 225, 400, 430)
salary <- data.frame(Players, Fees)

merge(players3, salary, by.x = "name", by.y = "Players")

     name   country age cap Fees
1   Klose   Germany  36  No  225
2   Messi Argentina  27  No  400
3 Ronaldo  Portugal  28 Yes  430

merge(players3, salary, by.x = "name", by.y = "Players", all = TRUE)

     name   country age  cap Fees
1  Drogba      <NA>  NA <NA>  102
2   Klose   Germany  36   No  225
3   Messi Argentina  27   No  400
4 Ronaldo  Portugal  28  Yes  430
5  Rooney   England  29  Yes   NA
6  Zlatan    Sweden  24  Yes   NA

The Apply Family Functions of R

These functions manipulate slices of data from matrices, lists and data frames in a repetitive way. They allow crossing data in a number of ways and avoid explicit use of loop construct.

There are 4 commonly use apply functions in R -

apply()
lapply()
sapply()
tapply()

Note that there are other apply functions as well which are not commonly use. They are -

rapply
mapply
vapply
eapply

apply()

First lets do ?apply

Usage
apply(X, MARGIN, FUN, …)

Where
X is matrix/array,
MARGIN is 1 = row, 2 = column, c(1,2) = both
FUN = sum, mean, etc

tv <- matrix(c(3, 5, 6, 2, 3, 5, 4, 3, 2, 1, 6, 5, 4, 3, 5, 4, 2, 4, 2, 2, 5, 6, 4, 5, 5, 2, 3, 2, 1, 4 ,1, 4, 3, 4, 5), nrow = 7)
colnames(tv) <- c("Oken", "Khagem", "Inao", "Thoi", "Romeo")
rownames(tv) <- c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday")
tv

          Oken Khagem Inao Thoi Romeo
Sunday       3      3    5    6     1
Monday       5      2    4    4     4
Tuesday      6      1    2    5     1
Wednesday    2      6    4    5     4
Thursday     3      5    2    2     3
Friday       5      4    2    3     4
Saturday     4      3    5    2     5

class(tv)

[1] "matrix"

max(tv[1, ])

[1] 6

max(tv[2, ])

[1] 5

max(tv[3, ])

[1] 6

We can also use for loop to get the desire result.

for(i in 1:7){
  weekday <- tv[i, ]
  max <- max(weekday)
  print(max)
}

[1] 6
[1] 5
[1] 6
[1] 6
[1] 5
[1] 5
[1] 5

Instead of writing so much, we can simply use the vectorised loop that apply() function offers.

apply(tv, 1, max) # finding maximum value in each row

   Sunday    Monday   Tuesday Wednesday  Thursday    Friday  Saturday 
        6         5         6         6         5         5         5

apply(tv, 2, max) # finding maximum value in each column

  Oken Khagem   Inao   Thoi  Romeo 
     6      6      5      6      5

Apply function on data frame

tv_df <- as.data.frame(tv)
class(tv_df)

[1] "data.frame"

str(tv_df)

'data.frame':   7 obs. of  5 variables:
 $ Oken  : num  3 5 6 2 3 5 4
 $ Khagem: num  3 2 1 6 5 4 3
 $ Inao  : num  5 4 2 4 2 2 5
 $ Thoi  : num  6 4 5 5 2 3 2
 $ Romeo : num  1 4 1 4 3 4 5

apply(tv_df, 1, mean)

   Sunday    Monday   Tuesday Wednesday  Thursday    Friday  Saturday 
      3.6       3.8       3.0       4.2       3.0       3.6       3.8

apply(tv_df, 2, mean)

    Oken   Khagem     Inao     Thoi    Romeo 
4.000000 3.428571 3.428571 3.857143 3.142857

How about adding a new variable which is not numeric?

tv_df$Place <- c("Club", "Home", "School", "Home", "School", "Home", "Club")
str(tv_df)

'data.frame':   7 obs. of  6 variables:
 $ Oken  : num  3 5 6 2 3 5 4
 $ Khagem: num  3 2 1 6 5 4 3
 $ Inao  : num  5 4 2 4 2 2 5
 $ Thoi  : num  6 4 5 5 2 3 2
 $ Romeo : num  1 4 1 4 3 4 5
 $ Place : chr  "Club" "Home" "School" "Home" ...

apply(tv_df, 1, mean)

   Sunday    Monday   Tuesday Wednesday  Thursday    Friday  Saturday 
       NA        NA        NA        NA        NA        NA        NA

apply(tv_df[, 1:5], 1, mean)

   Sunday    Monday   Tuesday Wednesday  Thursday    Friday  Saturday 
      3.6       3.8       3.0       4.2       3.0       3.6       3.8

apply(tv_df[, 1:5], 2, mean)

    Oken   Khagem     Inao     Thoi    Romeo 
4.000000 3.428571 3.428571 3.857143 3.142857

Other than using apply() for getting mean, we can also use

rowMeans(tv_df[, 1:5]) # mean of rows

   Sunday    Monday   Tuesday Wednesday  Thursday    Friday  Saturday 
      3.6       3.8       3.0       4.2       3.0       3.6       3.8

colMeans(tv_df[, 1:5]) # mean of columns

    Oken   Khagem     Inao     Thoi    Romeo 
4.000000 3.428571 3.428571 3.857143 3.142857

Applying custom function

ave <- function(x){
  x/mean(x)
}
apply(tv, 2, ave)

          Oken    Khagem      Inao      Thoi     Romeo
Sunday    0.75 0.8750000 1.4583333 1.5555556 0.3181818
Monday    1.25 0.5833333 1.1666667 1.0370370 1.2727273
Tuesday   1.50 0.2916667 0.5833333 1.2962963 0.3181818
Wednesday 0.50 1.7500000 1.1666667 1.2962963 1.2727273
Thursday  0.75 1.4583333 0.5833333 0.5185185 0.9545455
Friday    1.25 1.1666667 0.5833333 0.7777778 1.2727273
Saturday  1.00 0.8750000 1.4583333 0.5185185 1.5909091

apply(tv, 1, ave)

          Sunday    Monday   Tuesday Wednesday  Thursday    Friday
Oken   0.8333333 1.3157895 2.0000000 0.4761905 1.0000000 1.3888889
Khagem 0.8333333 0.5263158 0.3333333 1.4285714 1.6666667 1.1111111
Inao   1.3888889 1.0526316 0.6666667 0.9523810 0.6666667 0.5555556
Thoi   1.6666667 1.0526316 1.6666667 1.1904762 0.6666667 0.8333333
Romeo  0.2777778 1.0526316 0.3333333 0.9523810 1.0000000 1.1111111
        Saturday
Oken   1.0526316
Khagem 0.7894737
Inao   1.3157895
Thoi   0.5263158
Romeo  1.3157895

lapply()

?lapply

Usage
lapply(X, FUN, …)

Where
X is list/vector/data frame
FUN = sum, mean, etc
… = optional arguments to FUN

One of the big differences between apply() and lapply() is that lappy() returns only list.

myWorkout <- list(PushUps = c(12, 12, 10, 12, 15, 13, 14),
                  Biceps = c(20, 22, 20, 24, 25, 22, 24),
                  Squats = c(30, 33, 29, 30, 32, 33, 28))

myWorkout

$PushUps
[1] 12 12 10 12 15 13 14

$Biceps
[1] 20 22 20 24 25 22 24

$Squats
[1] 30 33 29 30 32 33 28

lapply(myWorkout, mean)

$PushUps
[1] 12.57143

$Biceps
[1] 22.42857

$Squats
[1] 30.71429

Let's use lapply() on data frame

myWorkoutDF <- data.frame(PushUps = c(12, 12, 10, 12, 15, 13, 14),
                  Biceps = c(20, 22, 20, 24, 25, 22, 24),
                  Squats = c(30, 33, 29, 30, 32, 33, 28))

myWorkoutDF

  PushUps Biceps Squats
1      12     20     30
2      12     22     33
3      10     20     29
4      12     24     30
5      15     25     32
6      13     22     33
7      14     24     28

lapply(myWorkoutDF, mean)

$PushUps
[1] 12.57143

$Biceps
[1] 22.42857

$Squats
[1] 30.71429

colMeans(myWorkoutDF)

 PushUps   Biceps   Squats 
12.57143 22.42857 30.71429

MyName <- c("My", "name", "is", "Loiyumba")
MyName

[1] "My"       "name"     "is"       "Loiyumba"

lapply(MyName, nchar)

[[1]]
[1] 2

[[2]]
[1] 4

[[3]]
[1] 2

[[4]]
[1] 8

If we don't want our output as list, then we can use sapply().
?sapply
Usage
sapply(X, FUN, …)

Where
X is list/vector/data frame
FUN = sum, mean, etc
… = optional arguments to FUN

sapply(myWorkout, max)

PushUps  Biceps  Squats 
     15      25      33

sapply(myWorkoutDF, max)

PushUps  Biceps  Squats 
     15      25      33

sapply(MyName, nchar)

      My     name       is Loiyumba 
       2        4        2        8

We will do some sapply() with R inbuilt dataset call mtcars.
Do ?mtcars

str(mtcars) # R inbuilt data

'data.frame':   32 obs. of  11 variables:
 $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
 $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
 $ disp: num  160 160 108 258 360 ...
 $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
 $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
 $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
 $ qsec: num  16.5 17 18.6 19.4 17 ...
 $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
 $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
 $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
 $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

sapply(mtcars[, c(1, 3:7)], mean)

       mpg       disp         hp       drat         wt       qsec 
 20.090625 230.721875 146.687500   3.596563   3.217250  17.848750

sapply(mtcars[, c(1, 3:7)], max)

    mpg    disp      hp    drat      wt    qsec 
 33.900 472.000 335.000   4.930   5.424  22.900

sapply(mtcars[, c(1, 3:7)], min)

   mpg   disp     hp   drat     wt   qsec 
10.400 71.100 52.000  2.760  1.513 14.500

tapply()

?tapply
Usage
tapply(X, INDEX, FUN, …)

Where
X is vector/columns of data frame/elements of a list
INDEX is factors used to subset X
FUN = sum, mean, etc
… = optional arguments to FUN

table(mtcars$cyl)


 4  6  8 
11  7 14

tapply(mtcars$mpg, mtcars$cyl, mean)

       4        6        8 
26.66364 19.74286 15.10000

tapply(mtcars$mpg, mtcars$cyl, max)

   4    6    8 
33.9 21.4 19.2

table(mtcars$am)


 0  1 
19 13

tapply(mtcars$mpg, mtcars$am, mean)

       0        1 
17.14737 24.39231

tapply(mtcars$mpg, mtcars$am, min)

   0    1 
10.4 15.0

table(mtcars$gear)


 3  4  5 
15 12  5

tapply(mtcars$hp, mtcars$gear, max)

  3   4   5 
245 123 335

tapply(mtcars$hp, mtcars$gear, mean)

       3        4        5 
176.1333  89.5000 195.6000

Missing Values

Missing values are indicated by NA in R.
is.na() is the function to check missing values.

x <- c(10, 20, NA, 40, 50)
is.na(x)

[1] FALSE FALSE  TRUE FALSE FALSE

!is.na(x) # negate

[1]  TRUE  TRUE FALSE  TRUE  TRUE

sum(x)

[1] NA

If there's NA, we can't really compute. In order to avoid missing values in computation, we use na.rm() argument in the function.

sum(x, na.rm = TRUE)

[1] 120

mean(x)

[1] NA

mean(x, na.rm = TRUE)

[1] 30

patientID <- 1:10
patientName <- c("Keiku", "Bala" ,"Sadananda", "Gokul", "Bonny", "Soma", "Maya", "Abenao", "Artina", "Olen" )
patientGender <- c("Male", "Female", "Male", "Male", "Male",
                   "Female", "Female", "Female", "Female", "Male")
patientAge <- c(36, 26, 40, 35, 37, 23, 37, 32, 28, 42)
patient <- data.frame(patientID, patientName, patientGender, patientAge)

patient

   patientID patientName patientGender patientAge
1          1       Keiku          Male         36
2          2        Bala        Female         26
3          3   Sadananda          Male         40
4          4       Gokul          Male         35
5          5       Bonny          Male         37
6          6        Soma        Female         23
7          7        Maya        Female         37
8          8      Abenao        Female         32
9          9      Artina        Female         28
10        10        Olen          Male         42

patient[3, 4] <- NA
patient

   patientID patientName patientGender patientAge
1          1       Keiku          Male         36
2          2        Bala        Female         26
3          3   Sadananda          Male         NA
4          4       Gokul          Male         35
5          5       Bonny          Male         37
6          6        Soma        Female         23
7          7        Maya        Female         37
8          8      Abenao        Female         32
9          9      Artina        Female         28
10        10        Olen          Male         42

missing <- function(x){
  sum(is.na(x))
}
sapply(patient, missing)

    patientID   patientName patientGender    patientAge 
            0             0             0             1

tapply(patient$patientAge, patient$patientGender, mean)

Female   Male 
  29.2     NA

tapply(patient$patientAge, patient$patientGender, mean, na.rm = TRUE)

Female   Male 
  29.2   37.5

patient$patientTreatment <- c("A", "A", "D", "C", "A", "B", "C", "D", "C", "B")
str(patient)

'data.frame':   10 obs. of  5 variables:
 $ patientID       : int  1 2 3 4 5 6 7 8 9 10
 $ patientName     : Factor w/ 10 levels "Abenao","Artina",..: 6 3 9 5 4 10 7 1 2 8
 $ patientGender   : Factor w/ 2 levels "Female","Male": 2 1 2 2 2 1 1 1 1 2
 $ patientAge      : num  36 26 NA 35 37 23 37 32 28 42
 $ patientTreatment: chr  "A" "A" "D" "C" ...

tapply(patient$patientAge, patient$patientTreatment, max)

 A  B  C  D 
37 42 37 NA

tapply(patient$patientAge, patient$patientTreatment, max, na.rm = TRUE)

 A  B  C  D 
37 42 37 32

tapply(patient$patientAge, patient$patientTreatment, min, na.rm = TRUE)

 A  B  C  D 
26 23 28 32

Or, we can completely remove the observations from the data frame.

head(na.omit(patient))

  patientID patientName patientGender patientAge patientTreatment
1         1       Keiku          Male         36                A
2         2        Bala        Female         26                A
4         4       Gokul          Male         35                C
5         5       Bonny          Male         37                A
6         6        Soma        Female         23                B
7         7        Maya        Female         37                C

na <- complete.cases(patient)
patient[na, ]

   patientID patientName patientGender patientAge patientTreatment
1          1       Keiku          Male         36                A
2          2        Bala        Female         26                A
4          4       Gokul          Male         35                C
5          5       Bonny          Male         37                A
6          6        Soma        Female         23                B
7          7        Maya        Female         37                C
8          8      Abenao        Female         32                D
9          9      Artina        Female         28                C
10        10        Olen          Male         42                B

Dates and Times in R

Dates are represented by the Date class.

date <- "01-05-1990"
class(date)

[1] "character"

date <- as.Date(date, format = "%d-%m-%Y")
class(date)

[1] "Date"

date

[1] "1990-05-01"

unclass(date)

[1] 7425

oldDate <- as.Date("1970-01-10")
unclass(oldDate)

[1] 9

Times are represented by the POSIXct or the POSIXlt class.

now <- Sys.time()
now

[1] "2016-07-05 13:03:55 IST"

class(now)

[1] "POSIXct" "POSIXt"

unclass(now)

[1] 1467704035

guess <- 220000000
class(guess) <- c("POSIXct", "POSIXt")
guess

[1] "1976-12-21 12:36:40 IST"

guess <- as.POSIXlt(guess)
names(unclass(guess))

 [1] "sec"    "min"    "hour"   "mday"   "mon"    "year"   "wday"  
 [8] "yday"   "isdst"  "zone"   "gmtoff"

guess$wday

[1] 2

now - guess

Time difference of 14441.02 days

Dates come in different style. In order to work with it, we use strptime() function. Do ?strptime

dates <- c("December 25, 2014 11:45", "January 25, 2015 23:30")
dates

[1] "December 25, 2014 11:45" "January 25, 2015 23:30"

class(dates)

[1] "character"

new_dates <- strptime(dates, format = "%B %d, %Y %H:%M")
class(new_dates)

[1] "POSIXlt" "POSIXt"

first_date <- as.Date("1996-09-28")
second_date <- as.Date("1996-10-15")
second_date - first_date

Time difference of 17 days