Starting with R

Loiyumba

What is R?

  • is an open-source statistical software package
  • can do data analysis
  • can visualize data
  • can fit machine learning models

To download R - https://cran.r-project.org/

What is RStudio?

  • is the most popular open source integrated development environment(IDE) for R
  • can create interactive plottings
  • can create web applications
  • can create presentations
  • can write documents, publishing, etc

To download RStudio - https://www.rstudio.com/products/rstudio/download/

RStudio

RStudio comes with 4 panes -

  • Source/Editor
  • Console/Command line
  • Environment/History/Files
  • Plots/Packages/Help/Viewer
  • And many other functions

Basic Operations in R

In the console/command line

7 + 9 + 5 + 0 + 0 + 1
[1] 22
7 + 9 * 5
[1] 52
(7 + 9) * 5
[1] 80
c(7, 9, 5, 0, 0, 1)
[1] 7 9 5 0 0 1
1:50 
 [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
[24] 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46
[47] 47 48 49 50
2^3
[1] 8
c(7, 9, 5, 0, 0, 1) + c(70, 90, 50, 10, 10, 10)
[1] 77 99 55 10 10 11
c(7, 9, 5, 0, 0, 1) * c(70, 90, 50, 10, 10, 10)
[1] 490 810 250   0   0  10
c(7, 9, 5, 0, 0, 1) - c(70, 90, 50, 10, 10, 10)
[1] -63 -81 -45 -10 -10  -9
c(7, 9, 5, 0, 0, 1) + 100
[1] 107 109 105 100 100 101
1/c(7, 9, 5, 0, 0,1)
[1] 0.1428571 0.1111111 0.2000000       Inf       Inf 1.0000000
c(7, 9, 5, 0, 0, 1) + c(10, 100)
[1]  17 109  15 100  10 101
paste("Hello", "World!")
[1] "Hello World!"
"Hello World!"
[1] "Hello World!"

We can add comments with the code

2 + 5 # Sum of 2 and 5 will give 7
[1] 7

Assigning to a variable/object

a <- 1 # This is assignment sign (<-)
a
[1] 1
a + 1
[1] 2

Some notes on Object name

  • Object names cannot begin with numbers, $, , !
  • Wise to avoid names already in use
  • R is case sensitive. So it will treat 'a' and 'A' differently
  • We can remove oject with

rm(object name)

R Objects Attributes

R objects can have attributes, which are like metadata for the object. These metadata can be very useful in that they help to describe the object. They are -

  • names, dimnames
  • dimensions
  • class
  • length

Attributes of an object (if any) can be accessed using the attributes() function. Not all R objects contain attributes, in which case the attributes() function returns NULL.

5 -> b 
print(b)
[1] 5
a + b
[1] 6
(a + b)/2
[1] 3
c = (a + b)/2 # We can use (=) instead of (<-)
c
[1] 3
w <- c(7, 9, 5, 0, 0, 1)
x <- c(1, 0, 0, 5, 9, 7)
y <- c(w, x)
y
 [1] 7 9 5 0 0 1 1 0 0 5 9 7
z <- c(x, w)
z
 [1] 1 0 0 5 9 7 7 9 5 0 0 1
d <- 1:10
d
 [1]  1  2  3  4  5  6  7  8  9 10
e <- 10:1
e
 [1] 10  9  8  7  6  5  4  3  2  1
d + 1
 [1]  2  3  4  5  6  7  8  9 10 11
e - 1
 [1] 9 8 7 6 5 4 3 2 1 0
d + e
 [1] 11 11 11 11 11 11 11 11 11 11

R in-built functions(a.k.a. Base Package)

k <- seq(from = 1, to = 10, by = 2)
k
[1] 1 3 5 7 9
j <- seq(from = -1, to = 1, by = 0.2)
j
 [1] -1.0 -0.8 -0.6 -0.4 -0.2  0.0  0.2  0.4  0.6  0.8  1.0
m <- rep(2, times = 5)
m
[1] 2 2 2 2 2
p <- rep(1:3, times = 3)
p
[1] 1 2 3 1 2 3 1 2 3
q <- rep(1:3, each = 3)
q
[1] 1 1 1 2 2 2 3 3 3

For information on any function

help(“seq”)

or, we can simply type

?rep

And if we want to check or remove the object/variable

ls() # check the variables in the current session
 [1] "a" "b" "c" "d" "e" "j" "k" "m" "p" "q" "w" "x" "y" "z"
rm(p) # delete single object
rm(k, m) # delete multiple objects
ls()
 [1] "a" "b" "c" "d" "e" "j" "q" "w" "x" "y" "z"
rm(list = ls()) # delete everything 
ls()
character(0)

R Maths Functions

g <- c(45, 90, 20)
sum(g) # Sum
[1] 155
mean(g) # Mean
[1] 51.66667
round(51.66667, 2) # Round to n decimal places
[1] 51.67
round(mean(g), 2) # Nested function
[1] 51.67
median(g) # Median
[1] 45
rank(g) # Rank the elements
[1] 2 3 1
var(g) # Variance
[1] 1258.333
max(g) # Largest element
[1] 90
min(g) # Smallest element
[1] 20
log(25) # Natural log
[1] 3.218876
exp(5) # Exponential
[1] 148.4132
sqrt(95) # Square root
[1] 9.746794
abs(-43) # Absolute value 
[1] 43
u <- 45:60 
quantile(u) # Quantile
   0%   25%   50%   75%  100% 
45.00 48.75 52.50 56.25 60.00 
sd(u) # Standard deviation
[1] 4.760952

R Data Types

Four basic data types in R -

  • numbers(numeric)
  • character string(text)
  • logical
  • factor

Numeric

Any number. Appropriate for math.

1 + 1
[1] 2
100
[1] 100

Character

Any text. Any symbols surrounded by quotes.

"hello, this is R"
[1] "hello, this is R"
"Imphal's pin code is 795001"
[1] "Imphal's pin code is 795001"
f <- c("1", "2", "3")
f
[1] "1" "2" "3"

Logical

R's form of binary data. TRUE or FALSE. Useful for logical test.

100 < 400
[1] TRUE
100 > 400
[1] FALSE

Logical Comparison

L <- 1:5
L
[1] 1 2 3 4 5
L > 3 # greater than
[1] FALSE FALSE FALSE  TRUE  TRUE
L >= 3 # greater than or equal to
[1] FALSE FALSE  TRUE  TRUE  TRUE
L < 3 # less than
[1]  TRUE  TRUE FALSE FALSE FALSE
L <= 3 # less than or equal to
[1]  TRUE  TRUE  TRUE FALSE FALSE
L == 3 # equal to
[1] FALSE FALSE  TRUE FALSE FALSE
L != 3 # not equal to
[1]  TRUE  TRUE FALSE  TRUE  TRUE

%in% Operator

The %in% tests whether the object on the left is a member of the group on the right.

"mango" %in% c("mango", "apple", "banana")
[1] TRUE
1 %in% c(2:8)
[1] FALSE
c(12:20) %in% c(15, 17)
[1] FALSE FALSE FALSE  TRUE FALSE  TRUE FALSE FALSE FALSE

Boolean Operators

We can combine logical tests with &, |, xor, !, any, and all.

x <- 1:10
x > 2 & x < 8
 [1] FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE
x > 8 | x < 2
 [1]  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE
x[(x > 2) & (x < 8)]
[1] 3 4 5 6 7
x[(x > 8) | (x < 2)]
[1]  1  9 10

Factor

R' form of categorical data.

dist <- factor(c("Imphal East", "Imphal West", "Senapati", "Imphal West"))
dist
[1] Imphal East Imphal West Senapati    Imphal West
Levels: Imphal East Imphal West Senapati
table(dist)
dist
Imphal East Imphal West    Senapati 
          1           2           1 
as.numeric(dist)
[1] 1 2 3 2
status_vector <- c("Married", "Not Married")
status_factor <- factor(status_vector)
status_factor <- factor(status_factor, levels = c("Not Married", "Married"))
status_factor
[1] Married     Not Married
Levels: Not Married Married

R Data Structures

Some of the most frequently-used R data structures are -

  • Vectors
  • Matrices
  • Lists
  • Data Frames

Vectors

Vector elements must all have the same mode, which can be integer, numeric (floating-point number), character (string), logical (boolean), complex, object, etc.

Combine multiple elements into a one dimentional array

x <- 1:10 # integer
x
 [1]  1  2  3  4  5  6  7  8  9 10
fruits <- c("Apple", "Banana", "Mango", "Papaya") 
fruits # character
[1] "Apple"  "Banana" "Mango"  "Papaya"
logi <- c(TRUE, FALSE, TRUE) # logical
logi
[1]  TRUE FALSE  TRUE
com <- c(1+0i, 2+4i) # complex
com
[1] 1+0i 2+4i

What happens if we mix vectors of different classes?

student <- c("Tomba", "Chaoba", "Thoibi", "Bena")
class(student)
[1] "character"
age <- c(24, 26, 25, 22)
class(age)
[1] "numeric"
info <- c(student, age)
info
[1] "Tomba"  "Chaoba" "Thoibi" "Bena"   "24"     "26"     "25"     "22"    
class(info)
[1] "character"
# Do ?class for more detail on class function

In coercion between

  • logical and numeric, class(vector) will be numeric
  • logical and character, class(vector) will be character
  • numeric and character, class(vector) will be character
  • logical, numeric and character, class(vector) will be character
class(c(795001, "Imphal"))
[1] "character"
class(c(TRUE, 795001))
[1] "numeric"
class(c(TRUE, 795001, "Imphal"))
[1] "character"
class(c("TRUE", 795001))
[1] "character"

Explicit Coercion

Objects can be explicitly coerced from one class to another using the as.* functions, if available.

t <- -1:10
class(t)
[1] "integer"
as.numeric(t)
 [1] -1  0  1  2  3  4  5  6  7  8  9 10
as.logical(t)
 [1]  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
[12]  TRUE
as.character(t)
 [1] "-1" "0"  "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10"

Sometimes, R can’t figure out how to coerce an object and this can result in NAs being produced.

v <- c("Manipur", "Nagaland", "Mizoram")
class(v)
[1] "character"
as.numeric(v)
[1] NA NA NA
as.logical(v)
[1] NA NA NA
as.complex(v)
[1] NA NA NA

NA means Not Available

Some Vector Functions

y <- c(7, 9, 5, 0, 0, 1)
sort(y)
[1] 0 0 1 5 7 9
sort(y, decreasing = TRUE)
[1] 9 7 5 1 0 0
table(y)
y
0 1 5 7 9 
2 1 1 1 1 
rev(y)
[1] 1 0 0 5 9 7
unique(y)
[1] 7 9 5 0 1

Selecting Vector Elements

a <- seq(from = 1, to = 20, by = 2)
a
 [1]  1  3  5  7  9 11 13 15 17 19
a[5]
[1] 9
a[-5]
[1]  1  3  5  7 11 13 15 17 19
a[3:6]
[1]  5  7  9 11
a[-(3:6)]
[1]  1  3 13 15 17 19
a[c(3, 6)]
[1]  5 11
a[a > 11]
[1] 13 15 17 19
a[a < 11]
[1] 1 3 5 7 9
a[a == 11]
[1] 11
district <- c("Imphal East", "Senapati", "Churachandpur", "Thoubal", "Ukhrul", "Bishenpur")
district[5]
[1] "Ukhrul"
district[c(1, 6)]
[1] "Imphal East" "Bishenpur"  
alpha <- letters[1:10]
alpha
 [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j"
alpha[c(5,5,5,5)]
[1] "e" "e" "e" "e"
alpha[c(5:1, 1:5)]
 [1] "e" "d" "c" "b" "a" "a" "b" "c" "d" "e"
alpha[11] # indexing with out-of-range values
[1] NA
alpha[1:11]
 [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" NA 
alpha > "d"
 [1] FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
alpha[alpha > "d"]
[1] "e" "f" "g" "h" "i" "j"
selector <- alpha > "d"
alpha[selector]
[1] "e" "f" "g" "h" "i" "j"
which(alpha > "d")
[1]  5  6  7  8  9 10
indexes <- which(alpha > "d")
alpha[indexes]
[1] "e" "f" "g" "h" "i" "j"
g <- 1:10
g
 [1]  1  2  3  4  5  6  7  8  9 10
g[2] <- 100
g
 [1]   1 100   3   4   5   6   7   8   9  10
g[11] <- 200
g
 [1]   1 100   3   4   5   6   7   8   9  10 200
g[c(4,8)] <- -500
g
 [1]    1  100    3 -500    5    6    7 -500    9   10  200
g[3] <- g[11]
g
 [1]    1  100  200 -500    5    6    7 -500    9   10  200
g <- c(g, 33, 44)
g
 [1]    1  100  200 -500    5    6    7 -500    9   10  200   33   44
g <- c(22, 55, g)
g
 [1]   22   55    1  100  200 -500    5    6    7 -500    9   10  200   33
[15]   44
g <- c(g[1:5], 111, g[6:15])
g
 [1]   22   55    1  100  200  111 -500    5    6    7 -500    9   10  200
[15]   33   44
g <- g[-3:-6]
g
 [1]   22   55 -500    5    6    7 -500    9   10  200   33   44

Matrices

A matrix is a vector with two additional attributes, the number of rows and number of columns. Combine multiple elements into a two dimentional array. Create with matrix function.

mat <- matrix(c(1:6), nrow = 2)
mat
     [,1] [,2] [,3]
[1,]    1    3    5
[2,]    2    4    6
attributes(mat)
$dim
[1] 2 3

Matrices are constructed column-wise, so entries can be thought of starting in the “upper left” corner and running down the columns.

mat2 <- matrix(c(1:6), nrow = 3)
mat2
     [,1] [,2]
[1,]    1    4
[2,]    2    5
[3,]    3    6
dim(mat2)
[1] 3 2

However, if we want to construct matrix by row-wise, we can do so.

mat3 <- matrix(c(1:6), nrow = 3, byrow = TRUE)
mat3
     [,1] [,2]
[1,]    1    2
[2,]    3    4
[3,]    5    6
dim(mat3)
[1] 3 2

Matrices can also be created directly from vectors by adding a dimension attribute.

mat4 <- 1:10
dim(mat4) <- c(2, 5)
mat4
     [,1] [,2] [,3] [,4] [,5]
[1,]    1    3    5    7    9
[2,]    2    4    6    8   10

Matrices can be created by column-binding or row-binding with the cbind() and rbind() functions.

p <- 1:5
q <- 6:10
cbind(p, q)
     p  q
[1,] 1  6
[2,] 2  7
[3,] 3  8
[4,] 4  9
[5,] 5 10

Do ?cbind in the console for more info

rbind(p, q)
  [,1] [,2] [,3] [,4] [,5]
p    1    2    3    4    5
q    6    7    8    9   10

Do ?rbind in the console for more info

We can also transpose dimension in matrix like this

pq <- rbind(p, q)
pq # 2, 5
  [,1] [,2] [,3] [,4] [,5]
p    1    2    3    4    5
q    6    7    8    9   10
t(pq) # 5, 2
     p  q
[1,] 1  6
[2,] 2  7
[3,] 3  8
[4,] 4  9
[5,] 5 10

Vectorized Matrix Operations

x <- matrix(1:6, nrow = 3)
x
     [,1] [,2]
[1,]    1    4
[2,]    2    5
[3,]    3    6
y <- matrix(rep(5, times = 6), nrow = 3)
y
     [,1] [,2]
[1,]    5    5
[2,]    5    5
[3,]    5    5

Element-wide multiplication

x * y
     [,1] [,2]
[1,]    5   20
[2,]   10   25
[3,]   15   30

Element-wise division

x/y
     [,1] [,2]
[1,]  0.2  0.8
[2,]  0.4  1.0
[3,]  0.6  1.2
3 * x
     [,1] [,2]
[1,]    3   12
[2,]    6   15
[3,]    9   18
x + x
     [,1] [,2]
[1,]    2    8
[2,]    4   10
[3,]    6   12

Subsetting in Matrices

z <- x * y
z
     [,1] [,2]
[1,]    5   20
[2,]   10   25
[3,]   15   30
z[1, ] # select 1st row & all columns
[1]  5 20
z[, 1] # select all rows & 1st column
[1]  5 10 15
z[2:3, , drop = FALSE] # keeping the matrix style
     [,1] [,2]
[1,]   10   25
[2,]   15   30
z[, 2, drop = FALSE]
     [,1]
[1,]   20
[2,]   25
[3,]   30
z[2,2] # select an element
[1] 25

Filtering on Matrices

z[z[, 1] >= 10, ]
     [,1] [,2]
[1,]   10   25
[2,]   15   30
z[z[, 1] > 10 & z[, 2] >= 25, ]
[1] 15 30

Matrix Row and Column Names

colnames(z)
NULL
colnames(z) <- c("A", "B")
z
      A  B
[1,]  5 20
[2,] 10 25
[3,] 15 30
colnames(z)
[1] "A" "B"
z[, "B"]
[1] 20 25 30
rownames(z)
NULL
rownames(z) <- c("P", "Q", "R")
z
   A  B
P  5 20
Q 10 25
R 15 30

Lists

A list is a one dimensional group of R objects.
Create lists with 'list()' function.

a_list <- list(1, "HP", TRUE)
a_list
[[1]]
[1] 1

[[2]]
[1] "HP"

[[3]]
[1] TRUE

The element of a list can be anything. Even vectors or other lists.

students <- list(name = c("Tomba", "Chaoba"), age = c(23, 25), single = c(TRUE, FALSE))
students
$name
[1] "Tomba"  "Chaoba"

$age
[1] 23 25

$single
[1]  TRUE FALSE
str(students)
List of 3
 $ name  : chr [1:2] "Tomba" "Chaoba"
 $ age   : num [1:2] 23 25
 $ single: logi [1:2] TRUE FALSE
names(students)
[1] "name"   "age"    "single"
students$name
[1] "Tomba"  "Chaoba"
students[["name"]]
[1] "Tomba"  "Chaoba"
students[[1]]
[1] "Tomba"  "Chaoba"
students["name"]
$name
[1] "Tomba"  "Chaoba"
students[c("name", "single")]
$name
[1] "Tomba"  "Chaoba"

$single
[1]  TRUE FALSE
students[c(1, 3)]
$name
[1] "Tomba"  "Chaoba"

$single
[1]  TRUE FALSE

Adding/Deleting List Elements

students$education <- c("Graduate", "Master")
students
$name
[1] "Tomba"  "Chaoba"

$age
[1] 23 25

$single
[1]  TRUE FALSE

$education
[1] "Graduate" "Master"  
students$single <- NULL
students
$name
[1] "Tomba"  "Chaoba"

$age
[1] 23 25

$education
[1] "Graduate" "Master"  

Data Frames

Data frames group vectors together into a two-dimensional table. Each vector becomes a column in the table. As a result, each column of a data frame can contain a different type of data; but within a column, every cell must be the same type of data. We can create data frames with 'data.frame' function.

students <- data.frame(rollnum = c(10,42,3),
                       name = c("Iboyaima", "Tomchou", "Tombi"),
                       examfailed = c(TRUE, FALSE, TRUE)
                   )
students
  rollnum     name examfailed
1      10 Iboyaima       TRUE
2      42  Tomchou      FALSE
3       3    Tombi       TRUE
class(students)
[1] "data.frame"
str(students)
'data.frame':   3 obs. of  3 variables:
 $ rollnum   : num  10 42 3
 $ name      : Factor w/ 3 levels "Iboyaima","Tombi",..: 1 3 2
 $ examfailed: logi  TRUE FALSE TRUE

cbind & rbind in Data Frames

manipur <- data.frame(Districts = c("ImpWest", "ImpEast", "Ccpur",
                                    "Thoubal", "Tamenglong", "Senapati",
                                    "Chandel", "Ukhrul", "Bishenpur"),
                      Population = c(700000, 500000, 400000, 300000,
                                     200000, 450000, 400000, 500000, 750000),
                      Literacy = c(9.5, 9.4, 7.2, 8.6, 5.3, 8.5, 6.8,
                                   6.2, 8.7), stringsAsFactors = FALSE
                      )
manipur
   Districts Population Literacy
1    ImpWest     700000      9.5
2    ImpEast     500000      9.4
3      Ccpur     400000      7.2
4    Thoubal     300000      8.6
5 Tamenglong     200000      5.3
6   Senapati     450000      8.5
7    Chandel     400000      6.8
8     Ukhrul     500000      6.2
9  Bishenpur     750000      8.7
ForestCover <- c(40, 45, 67, 60, 90, 68, 65, 85, 70)
ForestCover
[1] 40 45 67 60 90 68 65 85 70

With 'cbind' function, we will add this vector to manipur.

manipur <- cbind(manipur, ForestCover)
manipur
   Districts Population Literacy ForestCover
1    ImpWest     700000      9.5          40
2    ImpEast     500000      9.4          45
3      Ccpur     400000      7.2          67
4    Thoubal     300000      8.6          60
5 Tamenglong     200000      5.3          90
6   Senapati     450000      8.5          68
7    Chandel     400000      6.8          65
8     Ukhrul     500000      6.2          85
9  Bishenpur     750000      8.7          70

Adding New Row

We will add another row in manipur

SadarHill <- data.frame(Districts = "SadarHill", 
                        Population = 450000,
                        Literacy = 8.5,
                        ForestCover = 66,
                        stringsAsFactors = FALSE)
SadarHill
  Districts Population Literacy ForestCover
1 SadarHill     450000      8.5          66
manipur <- rbind(manipur, SadarHill)

Deleting Row

We can delete rows in a data frame like this.

manipur[1:9, ] # omitting in selection
   Districts Population Literacy ForestCover
1    ImpWest     700000      9.5          40
2    ImpEast     500000      9.4          45
3      Ccpur     400000      7.2          67
4    Thoubal     300000      8.6          60
5 Tamenglong     200000      5.3          90
6   Senapati     450000      8.5          68
7    Chandel     400000      6.8          65
8     Ukhrul     500000      6.2          85
9  Bishenpur     750000      8.7          70
manipur[-(8:10), ] # with - sign
   Districts Population Literacy ForestCover
1    ImpWest     700000      9.5          40
2    ImpEast     500000      9.4          45
3      Ccpur     400000      7.2          67
4    Thoubal     300000      8.6          60
5 Tamenglong     200000      5.3          90
6   Senapati     450000      8.5          68
7    Chandel     400000      6.8          65
manipur[-c(2,5,8:10), ] # row selection
  Districts Population Literacy ForestCover
1   ImpWest     700000      9.5          40
3     Ccpur     400000      7.2          67
4   Thoubal     300000      8.6          60
6  Senapati     450000      8.5          68
7   Chandel     400000      6.8          65

Adding New Column

Adding new column to an existing data frame

manipur$AnnualRain <- c(3.2, 3.5, 4.0, 3.8, 4.8, 4.2, 3.8,
                        4.2, 3.7, 3.8)
str(manipur)
'data.frame':   10 obs. of  5 variables:
 $ Districts  : chr  "ImpWest" "ImpEast" "Ccpur" "Thoubal" ...
 $ Population : num  700000 500000 400000 300000 200000 450000 400000 500000 750000 450000
 $ Literacy   : num  9.5 9.4 7.2 8.6 5.3 8.5 6.8 6.2 8.7 8.5
 $ ForestCover: num  40 45 67 60 90 68 65 85 70 66
 $ AnnualRain : num  3.2 3.5 4 3.8 4.8 4.2 3.8 4.2 3.7 3.8

Delete Column

Or, delete a column like this.

manipur$Literacy <- NULL
manipur
    Districts Population ForestCover AnnualRain
1     ImpWest     700000          40        3.2
2     ImpEast     500000          45        3.5
3       Ccpur     400000          67        4.0
4     Thoubal     300000          60        3.8
5  Tamenglong     200000          90        4.8
6    Senapati     450000          68        4.2
7     Chandel     400000          65        3.8
8      Ukhrul     500000          85        4.2
9   Bishenpur     750000          70        3.7
10  SadarHill     450000          66        3.8

Subsetting in Data Frames

Matrix Style Subsetting

manipur[1, ] # select first row and all the columns
  Districts Population ForestCover AnnualRain
1   ImpWest      7e+05          40        3.2
manipur[, 1] # select first column and all the rows
 [1] "ImpWest"    "ImpEast"    "Ccpur"      "Thoubal"    "Tamenglong"
 [6] "Senapati"   "Chandel"    "Ukhrul"     "Bishenpur"  "SadarHill" 
manipur[, "Districts"]
 [1] "ImpWest"    "ImpEast"    "Ccpur"      "Thoubal"    "Tamenglong"
 [6] "Senapati"   "Chandel"    "Ukhrul"     "Bishenpur"  "SadarHill" 
manipur[, c("Districts", "AnnualRain")]
    Districts AnnualRain
1     ImpWest        3.2
2     ImpEast        3.5
3       Ccpur        4.0
4     Thoubal        3.8
5  Tamenglong        4.8
6    Senapati        4.2
7     Chandel        3.8
8      Ukhrul        4.2
9   Bishenpur        3.7
10  SadarHill        3.8
manipur[2:6, ] # selecting rows 2 to 6 and all columns
   Districts Population ForestCover AnnualRain
2    ImpEast     500000          45        3.5
3      Ccpur     400000          67        4.0
4    Thoubal     300000          60        3.8
5 Tamenglong     200000          90        4.8
6   Senapati     450000          68        4.2
manipur[10:7, ] # selected rows 10 to 7 and all columns
   Districts Population ForestCover AnnualRain
10 SadarHill     450000          66        3.8
9  Bishenpur     750000          70        3.7
8     Ukhrul     500000          85        4.2
7    Chandel     400000          65        3.8
manipur[5:7, 1:2] # selected rows and columns
   Districts Population
5 Tamenglong     200000
6   Senapati     450000
7    Chandel     400000
manipur[c(4, 8, 10), c(1, 2, 4)] # selected rows and columns
   Districts Population AnnualRain
4    Thoubal     300000        3.8
8     Ukhrul     500000        4.2
10 SadarHill     450000        3.8

Or, we can use r inbuilt functions to check observations and some other information of the data frame

head(manipur)
   Districts Population ForestCover AnnualRain
1    ImpWest     700000          40        3.2
2    ImpEast     500000          45        3.5
3      Ccpur     400000          67        4.0
4    Thoubal     300000          60        3.8
5 Tamenglong     200000          90        4.8
6   Senapati     450000          68        4.2
tail(manipur)
    Districts Population ForestCover AnnualRain
5  Tamenglong     200000          90        4.8
6    Senapati     450000          68        4.2
7     Chandel     400000          65        3.8
8      Ukhrul     500000          85        4.2
9   Bishenpur     750000          70        3.7
10  SadarHill     450000          66        3.8
nrow(manipur) # number of rows
[1] 10
ncol(manipur) # number of columns
[1] 4
dim(manipur) # dimension of the data frame
[1] 10  4
summary(manipur[2:4]) # Summary stats of data frame
   Population      ForestCover      AnnualRain   
 Min.   :200000   Min.   :40.00   Min.   :3.200  
 1st Qu.:400000   1st Qu.:61.25   1st Qu.:3.725  
 Median :450000   Median :66.50   Median :3.800  
 Mean   :465000   Mean   :65.60   Mean   :3.900  
 3rd Qu.:500000   3rd Qu.:69.50   3rd Qu.:4.150  
 Max.   :750000   Max.   :90.00   Max.   :4.800  

Filtering on Data Frames

Selecting districts which have Population more than 500000

manipur[manipur[, 2] > 500000, ]
  Districts Population ForestCover AnnualRain
1   ImpWest     700000          40        3.2
9 Bishenpur     750000          70        3.7

Selecting districts which have population equal or more than 500000

manipur[manipur[, 2] >= 500000, ]
  Districts Population ForestCover AnnualRain
1   ImpWest     700000          40        3.2
2   ImpEast     500000          45        3.5
8    Ukhrul     500000          85        4.2
9 Bishenpur     750000          70        3.7

Selecting districts with less than 500000 population and forest cover more than 70

manipur[manipur[, 2] < 500000 & manipur[, 3] > 70, ]
   Districts Population ForestCover AnnualRain
5 Tamenglong      2e+05          90        4.8

Now, we want only district names with population more than or equal to 500000 and forest cover more than or equal to 70

manipur[manipur[, 2] >= 500000 & manipur[, 3] >= 70, 1]
[1] "Ukhrul"    "Bishenpur"

Or, use subset() function

subset(manipur, ForestCover >= 75)
   Districts Population ForestCover AnnualRain
5 Tamenglong      2e+05          90        4.8
8     Ukhrul      5e+05          85        4.2
subset(manipur, Population > 400000 & AnnualRain > 4.0)
  Districts Population ForestCover AnnualRain
6  Senapati     450000          68        4.2
8    Ukhrul     500000          85        4.2
subset(manipur, AnnualRain >= 3.5)$AnnualRain
[1] 3.5 4.0 3.8 4.8 4.2 3.8 4.2 3.7 3.8
subset(manipur, AnnualRain >= 4.0)[, -(2:3)]
   Districts AnnualRain
3      Ccpur        4.0
5 Tamenglong        4.8
6   Senapati        4.2
8     Ukhrul        4.2

Names in Data Frame

Similarly in matrices where colnames() and rownames() are used to set or change column names and row names, data frames use names() to set or change column names and row.names() to set or change row names.

names(manipur)
[1] "Districts"   "Population"  "ForestCover" "AnnualRain" 
names(manipur) <- c("Dist", "Pop", "ForCov", "AnRain")
names(manipur)
[1] "Dist"   "Pop"    "ForCov" "AnRain"
row.names(manipur)
 [1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10"
row.names(manipur) <- LETTERS[1:10]
row.names(manipur)
 [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J"
head(manipur,3)
     Dist   Pop ForCov AnRain
A ImpWest 7e+05     40    3.2
B ImpEast 5e+05     45    3.5
C   Ccpur 4e+05     67    4.0

R Inbuilt Data

R has many inbuilt data for its users to practice. To view these data, we can do with data() and simply type the dataset name.

str(USArrests)
'data.frame':   50 obs. of  4 variables:
 $ Murder  : num  13.2 10 8.1 8.8 9 7.9 3.3 5.9 15.4 17.4 ...
 $ Assault : int  236 263 294 190 276 204 110 238 335 211 ...
 $ UrbanPop: int  58 48 80 50 91 78 77 72 80 60 ...
 $ Rape    : num  21.2 44.5 31 19.5 40.6 38.7 11.1 15.8 31.9 25.8 ...

To know more info about this data, we can do ?USArrests

Explore the Data

head(USArrests) # first 6 rows/observations
           Murder Assault UrbanPop Rape
Alabama      13.2     236       58 21.2
Alaska       10.0     263       48 44.5
Arizona       8.1     294       80 31.0
Arkansas      8.8     190       50 19.5
California    9.0     276       91 40.6
Colorado      7.9     204       78 38.7
class(USArrests)
[1] "data.frame"

we can use View() function to see the complete dataset

Do summary statistics

# summary(USArrests)
max(USArrests$Murder) # maximum value in Murder
[1] 17.4
USArrests[USArrests$Murder == 17.4, ] # which state?
        Murder Assault UrbanPop Rape
Georgia   17.4     211       60 25.8

Other way to do this is

which.min(USArrests$Murder) # gives the row number
[1] 34
USArrests[34, ] # see the row
             Murder Assault UrbanPop Rape
North Dakota    0.8      45       44  7.3

But I want to see the top 10 states which are highest in murder and lowest in murder.

Top 10 states with highest number of murder

head(USArrests[order(USArrests$Murder, decreasing = TRUE), ], 10) # use order function. do ?order
               Murder Assault UrbanPop Rape
Georgia          17.4     211       60 25.8
Mississippi      16.1     259       44 17.1
Florida          15.4     335       80 31.9
Louisiana        15.4     249       66 22.2
South Carolina   14.4     279       48 22.5
Alabama          13.2     236       58 21.2
Tennessee        13.2     188       59 26.9
North Carolina   13.0     337       45 16.1
Texas            12.7     201       80 25.5
Nevada           12.2     252       81 46.0

Top 10 states with lowest number of murder

tail(USArrests[order(USArrests$Murder, decreasing = TRUE), ], 10)
              Murder Assault UrbanPop Rape
Connecticut      3.3     110       77 11.1
Utah             3.2     120       80 22.9
Minnesota        2.7      72       66 14.9
Idaho            2.6     120       54 14.2
Wisconsin        2.6      53       66 10.8
Iowa             2.2      56       57 11.3
Vermont          2.2      48       32 11.2
Maine            2.1      83       51  7.8
New Hampshire    2.1      57       56  9.5
North Dakota     0.8      45       44  7.3
least_murder <- tail(USArrests[order(USArrests$Murder, decreasing = TRUE), ], 10)
least_murder[order(least_murder$Murder), ]
              Murder Assault UrbanPop Rape
North Dakota     0.8      45       44  7.3
Maine            2.1      83       51  7.8
New Hampshire    2.1      57       56  9.5
Iowa             2.2      56       57 11.3
Vermont          2.2      48       32 11.2
Idaho            2.6     120       54 14.2
Wisconsin        2.6      53       66 10.8
Minnesota        2.7      72       66 14.9
Utah             3.2     120       80 22.9
Connecticut      3.3     110       77 11.1

Similary, we can do for Assualt and Rape as well.

Top 10 states with highest number of assaults

head(USArrests[order(USArrests$Assault, decreasing = TRUE), ], 10)
               Murder Assault UrbanPop Rape
North Carolina   13.0     337       45 16.1
Florida          15.4     335       80 31.9
Maryland         11.3     300       67 27.8
Arizona           8.1     294       80 31.0
New Mexico       11.4     285       70 32.1
South Carolina   14.4     279       48 22.5
California        9.0     276       91 40.6
Alaska           10.0     263       48 44.5
Mississippi      16.1     259       44 17.1
Michigan         12.1     255       74 35.1

Top 10 states with lowest number of assaults

tail(USArrests[order(USArrests$Assault, decreasing = TRUE), ], 10)
              Murder Assault UrbanPop Rape
South Dakota     3.8      86       45 12.8
Maine            2.1      83       51  7.8
West Virginia    5.7      81       39  9.3
Minnesota        2.7      72       66 14.9
New Hampshire    2.1      57       56  9.5
Iowa             2.2      56       57 11.3
Wisconsin        2.6      53       66 10.8
Vermont          2.2      48       32 11.2
Hawaii           5.3      46       83 20.2
North Dakota     0.8      45       44  7.3
least_assault <- tail(USArrests[order(USArrests$Assault, decreasing = TRUE), ], 10)
least_assault[order(least_assault$Assault), ]
              Murder Assault UrbanPop Rape
North Dakota     0.8      45       44  7.3
Hawaii           5.3      46       83 20.2
Vermont          2.2      48       32 11.2
Wisconsin        2.6      53       66 10.8
Iowa             2.2      56       57 11.3
New Hampshire    2.1      57       56  9.5
Minnesota        2.7      72       66 14.9
West Virginia    5.7      81       39  9.3
Maine            2.1      83       51  7.8
South Dakota     3.8      86       45 12.8

Top 10 states with highest number of rapes

head(USArrests[order(USArrests$Rape, decreasing = TRUE), ], 10)
           Murder Assault UrbanPop Rape
Nevada       12.2     252       81 46.0
Alaska       10.0     263       48 44.5
California    9.0     276       91 40.6
Colorado      7.9     204       78 38.7
Michigan     12.1     255       74 35.1
New Mexico   11.4     285       70 32.1
Florida      15.4     335       80 31.9
Arizona       8.1     294       80 31.0
Oregon        4.9     159       67 29.3
Missouri      9.0     178       70 28.2

Top 10 states with lowest number of rapes

tail(USArrests[order(USArrests$Rape, decreasing = TRUE), ], 10)
              Murder Assault UrbanPop Rape
South Dakota     3.8      86       45 12.8
Iowa             2.2      56       57 11.3
Vermont          2.2      48       32 11.2
Connecticut      3.3     110       77 11.1
Wisconsin        2.6      53       66 10.8
New Hampshire    2.1      57       56  9.5
West Virginia    5.7      81       39  9.3
Rhode Island     3.4     174       87  8.3
Maine            2.1      83       51  7.8
North Dakota     0.8      45       44  7.3
least_rape <- tail(USArrests[order(USArrests$Rape, decreasing = TRUE), ], 10)
least_rape[order(least_rape$Rape), ]
              Murder Assault UrbanPop Rape
North Dakota     0.8      45       44  7.3
Maine            2.1      83       51  7.8
Rhode Island     3.4     174       87  8.3
West Virginia    5.7      81       39  9.3
New Hampshire    2.1      57       56  9.5
Wisconsin        2.6      53       66 10.8
Connecticut      3.3     110       77 11.1
Vermont          2.2      48       32 11.2
Iowa             2.2      56       57 11.3
South Dakota     3.8      86       45 12.8

So I want to know If I have to live in USA, which state should you suggest me to live.

tail(least_murder, 3)
              Murder Assault UrbanPop Rape
Maine            2.1      83       51  7.8
New Hampshire    2.1      57       56  9.5
North Dakota     0.8      45       44  7.3
tail(least_assault, 3)
             Murder Assault UrbanPop Rape
Vermont         2.2      48       32 11.2
Hawaii          5.3      46       83 20.2
North Dakota    0.8      45       44  7.3
tail(least_rape, 3)
             Murder Assault UrbanPop Rape
Rhode Island    3.4     174       87  8.3
Maine           2.1      83       51  7.8
North Dakota    0.8      45       44  7.3

How about we categorise states with high and low murder, high and low Assault, high and low rape!

We can do so for further insights of the data.

UScrime <- USArrests # assign a new object name
head(UScrime) # first 6 rows
           Murder Assault UrbanPop Rape
Alabama      13.2     236       58 21.2
Alaska       10.0     263       48 44.5
Arizona       8.1     294       80 31.0
Arkansas      8.8     190       50 19.5
California    9.0     276       91 40.6
Colorado      7.9     204       78 38.7
mean(UScrime$Murder)
[1] 7.788
UScrime$HighMurder <- as.numeric(UScrime$Murder > mean(UScrime$Murder))
str(UScrime)
'data.frame':   50 obs. of  5 variables:
 $ Murder    : num  13.2 10 8.1 8.8 9 7.9 3.3 5.9 15.4 17.4 ...
 $ Assault   : int  236 263 294 190 276 204 110 238 335 211 ...
 $ UrbanPop  : int  58 48 80 50 91 78 77 72 80 60 ...
 $ Rape      : num  21.2 44.5 31 19.5 40.6 38.7 11.1 15.8 31.9 25.8 ...
 $ HighMurder: num  1 1 1 1 1 1 0 0 1 1 ...
table(UScrime$HighMurder)

 0  1 
27 23 
mean(UScrime$Assault)
[1] 170.76
UScrime$HighAssault <- as.numeric(UScrime$Assault > mean(UScrime$Assault))
str(UScrime)
'data.frame':   50 obs. of  6 variables:
 $ Murder     : num  13.2 10 8.1 8.8 9 7.9 3.3 5.9 15.4 17.4 ...
 $ Assault    : int  236 263 294 190 276 204 110 238 335 211 ...
 $ UrbanPop   : int  58 48 80 50 91 78 77 72 80 60 ...
 $ Rape       : num  21.2 44.5 31 19.5 40.6 38.7 11.1 15.8 31.9 25.8 ...
 $ HighMurder : num  1 1 1 1 1 1 0 0 1 1 ...
 $ HighAssault: num  1 1 1 1 1 1 0 1 1 1 ...
table(UScrime$HighAssault)

 0  1 
27 23 
table(UScrime$HighMurder, UScrime$HighAssault)

     0  1
  0 25  2
  1  2 21

This table means -

  • First row: 25 states have low murder and low Assault and 2 states have low murder but high Assault.
  • Second row: 2 States have high murder but low assault and 21 states have high murder and high assault.
mean(UScrime$Rape)
[1] 21.232
UScrime$HighRape <- as.numeric(UScrime$Rape > mean(UScrime$Rape))
str(UScrime)
'data.frame':   50 obs. of  7 variables:
 $ Murder     : num  13.2 10 8.1 8.8 9 7.9 3.3 5.9 15.4 17.4 ...
 $ Assault    : int  236 263 294 190 276 204 110 238 335 211 ...
 $ UrbanPop   : int  58 48 80 50 91 78 77 72 80 60 ...
 $ Rape       : num  21.2 44.5 31 19.5 40.6 38.7 11.1 15.8 31.9 25.8 ...
 $ HighMurder : num  1 1 1 1 1 1 0 0 1 1 ...
 $ HighAssault: num  1 1 1 1 1 1 0 1 1 1 ...
 $ HighRape   : num  0 1 1 0 1 1 0 0 1 1 ...
table(UScrime$HighRape)

 0  1 
29 21 
table(UScrime$HighMurder, UScrime$HighRape)

     0  1
  0 23  4
  1  6 17
table(UScrime$HighAssault, UScrime$HighRape)

     0  1
  0 23  4
  1  6 17

Merging Data

We can use merge() function to merge two data frames, and we can merge only two.

name <- I(c("Ronaldo", "Messi", "Rooney", "Klose", "Zlatan"))
country <- I(c("Portugal", "Argentina", "England", "Germany", "Sweden"))
players <- data.frame(name, country)
players
     name   country
1 Ronaldo  Portugal
2   Messi Argentina
3  Rooney   England
4   Klose   Germany
5  Zlatan    Sweden
name <- I(c("Ronaldo", "Messi", "Rooney", "Klose", "Zlatan"))
age <- c(28, 27, 29, 36, 24)
cap <- c("Yes", "No", "Yes", "No", "Yes")
players2 <- data.frame(name, age, cap)
players2
     name age cap
1 Ronaldo  28 Yes
2   Messi  27  No
3  Rooney  29 Yes
4   Klose  36  No
5  Zlatan  24 Yes
players3 <- merge(players, players2)
players3
     name   country age cap
1   Klose   Germany  36  No
2   Messi Argentina  27  No
3 Ronaldo  Portugal  28 Yes
4  Rooney   England  29 Yes
5  Zlatan    Sweden  24 Yes
name <- c("Messi", "Rooney", "Klose", "Ronaldo", "Drogba")
club <- c("Bayern Munich", "Barcelona", "Real Madrid", "ManU", "Chelsea")
players4 <- data.frame(name, club)
players4
     name          club
1   Messi Bayern Munich
2  Rooney     Barcelona
3   Klose   Real Madrid
4 Ronaldo          ManU
5  Drogba       Chelsea
merge(players3, players4)
     name   country age cap          club
1   Klose   Germany  36  No   Real Madrid
2   Messi Argentina  27  No Bayern Munich
3 Ronaldo  Portugal  28 Yes          ManU
4  Rooney   England  29 Yes     Barcelona
merge(players3, players4, all = TRUE)
     name   country age  cap          club
1  Drogba      <NA>  NA <NA>       Chelsea
2   Klose   Germany  36   No   Real Madrid
3   Messi Argentina  27   No Bayern Munich
4 Ronaldo  Portugal  28  Yes          ManU
5  Rooney   England  29  Yes     Barcelona
6  Zlatan    Sweden  24  Yes          <NA>
merge(players3, players4, all.x = TRUE)
     name   country age cap          club
1   Klose   Germany  36  No   Real Madrid
2   Messi Argentina  27  No Bayern Munich
3 Ronaldo  Portugal  28 Yes          ManU
4  Rooney   England  29 Yes     Barcelona
5  Zlatan    Sweden  24 Yes          <NA>
merge(players3, players4, all.y = TRUE)
     name   country age  cap          club
1  Drogba      <NA>  NA <NA>       Chelsea
2   Klose   Germany  36   No   Real Madrid
3   Messi Argentina  27   No Bayern Munich
4 Ronaldo  Portugal  28  Yes          ManU
5  Rooney   England  29  Yes     Barcelona

Suppose we have different column names but same variables.

Players <- c("Drogba", "Klose", "Messi", "Ronaldo")
Fees <- c(102, 225, 400, 430)
salary <- data.frame(Players, Fees)
merge(players3, salary, by.x = "name", by.y = "Players")
     name   country age cap Fees
1   Klose   Germany  36  No  225
2   Messi Argentina  27  No  400
3 Ronaldo  Portugal  28 Yes  430
merge(players3, salary, by.x = "name", by.y = "Players", all = TRUE)
     name   country age  cap Fees
1  Drogba      <NA>  NA <NA>  102
2   Klose   Germany  36   No  225
3   Messi Argentina  27   No  400
4 Ronaldo  Portugal  28  Yes  430
5  Rooney   England  29  Yes   NA
6  Zlatan    Sweden  24  Yes   NA

The Apply Family Functions of R

These functions manipulate slices of data from matrices, lists and data frames in a repetitive way. They allow crossing data in a number of ways and avoid explicit use of loop construct.

There are 4 commonly use apply functions in R -

  • apply()
  • lapply()
  • sapply()
  • tapply()

Note that there are other apply functions as well which are not commonly use. They are -

  • rapply
  • mapply
  • vapply
  • eapply

apply()

First lets do ?apply

Usage
apply(X, MARGIN, FUN, …)

Where
X is matrix/array,
MARGIN is 1 = row, 2 = column, c(1,2) = both
FUN = sum, mean, etc

tv <- matrix(c(3, 5, 6, 2, 3, 5, 4, 3, 2, 1, 6, 5, 4, 3, 5, 4, 2, 4, 2, 2, 5, 6, 4, 5, 5, 2, 3, 2, 1, 4 ,1, 4, 3, 4, 5), nrow = 7)
colnames(tv) <- c("Oken", "Khagem", "Inao", "Thoi", "Romeo")
rownames(tv) <- c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday")
tv
          Oken Khagem Inao Thoi Romeo
Sunday       3      3    5    6     1
Monday       5      2    4    4     4
Tuesday      6      1    2    5     1
Wednesday    2      6    4    5     4
Thursday     3      5    2    2     3
Friday       5      4    2    3     4
Saturday     4      3    5    2     5
class(tv)
[1] "matrix"
max(tv[1, ])
[1] 6
max(tv[2, ])
[1] 5
max(tv[3, ])
[1] 6

We can also use for loop to get the desire result.

for(i in 1:7){
  weekday <- tv[i, ]
  max <- max(weekday)
  print(max)
}
[1] 6
[1] 5
[1] 6
[1] 6
[1] 5
[1] 5
[1] 5

Instead of writing so much, we can simply use the vectorised loop that apply() function offers.

apply(tv, 1, max) # finding maximum value in each row
   Sunday    Monday   Tuesday Wednesday  Thursday    Friday  Saturday 
        6         5         6         6         5         5         5 
apply(tv, 2, max) # finding maximum value in each column
  Oken Khagem   Inao   Thoi  Romeo 
     6      6      5      6      5 

Apply function on data frame

tv_df <- as.data.frame(tv)
class(tv_df)
[1] "data.frame"
str(tv_df)
'data.frame':   7 obs. of  5 variables:
 $ Oken  : num  3 5 6 2 3 5 4
 $ Khagem: num  3 2 1 6 5 4 3
 $ Inao  : num  5 4 2 4 2 2 5
 $ Thoi  : num  6 4 5 5 2 3 2
 $ Romeo : num  1 4 1 4 3 4 5
apply(tv_df, 1, mean)
   Sunday    Monday   Tuesday Wednesday  Thursday    Friday  Saturday 
      3.6       3.8       3.0       4.2       3.0       3.6       3.8 
apply(tv_df, 2, mean)
    Oken   Khagem     Inao     Thoi    Romeo 
4.000000 3.428571 3.428571 3.857143 3.142857 

How about adding a new variable which is not numeric?

tv_df$Place <- c("Club", "Home", "School", "Home", "School", "Home", "Club")
str(tv_df)
'data.frame':   7 obs. of  6 variables:
 $ Oken  : num  3 5 6 2 3 5 4
 $ Khagem: num  3 2 1 6 5 4 3
 $ Inao  : num  5 4 2 4 2 2 5
 $ Thoi  : num  6 4 5 5 2 3 2
 $ Romeo : num  1 4 1 4 3 4 5
 $ Place : chr  "Club" "Home" "School" "Home" ...
apply(tv_df, 1, mean)
   Sunday    Monday   Tuesday Wednesday  Thursday    Friday  Saturday 
       NA        NA        NA        NA        NA        NA        NA 
apply(tv_df[, 1:5], 1, mean)
   Sunday    Monday   Tuesday Wednesday  Thursday    Friday  Saturday 
      3.6       3.8       3.0       4.2       3.0       3.6       3.8 
apply(tv_df[, 1:5], 2, mean)
    Oken   Khagem     Inao     Thoi    Romeo 
4.000000 3.428571 3.428571 3.857143 3.142857 

Other than using apply() for getting mean, we can also use

rowMeans(tv_df[, 1:5]) # mean of rows
   Sunday    Monday   Tuesday Wednesday  Thursday    Friday  Saturday 
      3.6       3.8       3.0       4.2       3.0       3.6       3.8 
colMeans(tv_df[, 1:5]) # mean of columns
    Oken   Khagem     Inao     Thoi    Romeo 
4.000000 3.428571 3.428571 3.857143 3.142857 

Applying custom function

ave <- function(x){
  x/mean(x)
}
apply(tv, 2, ave)
          Oken    Khagem      Inao      Thoi     Romeo
Sunday    0.75 0.8750000 1.4583333 1.5555556 0.3181818
Monday    1.25 0.5833333 1.1666667 1.0370370 1.2727273
Tuesday   1.50 0.2916667 0.5833333 1.2962963 0.3181818
Wednesday 0.50 1.7500000 1.1666667 1.2962963 1.2727273
Thursday  0.75 1.4583333 0.5833333 0.5185185 0.9545455
Friday    1.25 1.1666667 0.5833333 0.7777778 1.2727273
Saturday  1.00 0.8750000 1.4583333 0.5185185 1.5909091
apply(tv, 1, ave)
          Sunday    Monday   Tuesday Wednesday  Thursday    Friday
Oken   0.8333333 1.3157895 2.0000000 0.4761905 1.0000000 1.3888889
Khagem 0.8333333 0.5263158 0.3333333 1.4285714 1.6666667 1.1111111
Inao   1.3888889 1.0526316 0.6666667 0.9523810 0.6666667 0.5555556
Thoi   1.6666667 1.0526316 1.6666667 1.1904762 0.6666667 0.8333333
Romeo  0.2777778 1.0526316 0.3333333 0.9523810 1.0000000 1.1111111
        Saturday
Oken   1.0526316
Khagem 0.7894737
Inao   1.3157895
Thoi   0.5263158
Romeo  1.3157895

lapply()

?lapply

Usage
lapply(X, FUN, …)

Where
X is list/vector/data frame
FUN = sum, mean, etc
… = optional arguments to FUN

One of the big differences between apply() and lapply() is that lappy() returns only list.

myWorkout <- list(PushUps = c(12, 12, 10, 12, 15, 13, 14),
                  Biceps = c(20, 22, 20, 24, 25, 22, 24),
                  Squats = c(30, 33, 29, 30, 32, 33, 28))
myWorkout
$PushUps
[1] 12 12 10 12 15 13 14

$Biceps
[1] 20 22 20 24 25 22 24

$Squats
[1] 30 33 29 30 32 33 28
lapply(myWorkout, mean)
$PushUps
[1] 12.57143

$Biceps
[1] 22.42857

$Squats
[1] 30.71429

Let's use lapply() on data frame

myWorkoutDF <- data.frame(PushUps = c(12, 12, 10, 12, 15, 13, 14),
                  Biceps = c(20, 22, 20, 24, 25, 22, 24),
                  Squats = c(30, 33, 29, 30, 32, 33, 28))
myWorkoutDF
  PushUps Biceps Squats
1      12     20     30
2      12     22     33
3      10     20     29
4      12     24     30
5      15     25     32
6      13     22     33
7      14     24     28
lapply(myWorkoutDF, mean)
$PushUps
[1] 12.57143

$Biceps
[1] 22.42857

$Squats
[1] 30.71429
colMeans(myWorkoutDF)
 PushUps   Biceps   Squats 
12.57143 22.42857 30.71429 
MyName <- c("My", "name", "is", "Loiyumba")
MyName
[1] "My"       "name"     "is"       "Loiyumba"
lapply(MyName, nchar)
[[1]]
[1] 2

[[2]]
[1] 4

[[3]]
[1] 2

[[4]]
[1] 8

If we don't want our output as list, then we can use sapply().
?sapply
Usage
sapply(X, FUN, …)

Where
X is list/vector/data frame
FUN = sum, mean, etc
… = optional arguments to FUN

sapply(myWorkout, max)
PushUps  Biceps  Squats 
     15      25      33 
sapply(myWorkoutDF, max)
PushUps  Biceps  Squats 
     15      25      33 
sapply(MyName, nchar)
      My     name       is Loiyumba 
       2        4        2        8 

We will do some sapply() with R inbuilt dataset call mtcars.
Do ?mtcars

str(mtcars) # R inbuilt data
'data.frame':   32 obs. of  11 variables:
 $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
 $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
 $ disp: num  160 160 108 258 360 ...
 $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
 $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
 $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
 $ qsec: num  16.5 17 18.6 19.4 17 ...
 $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
 $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
 $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
 $ carb: num  4 4 1 1 2 1 4 2 2 4 ...
sapply(mtcars[, c(1, 3:7)], mean)
       mpg       disp         hp       drat         wt       qsec 
 20.090625 230.721875 146.687500   3.596563   3.217250  17.848750 
sapply(mtcars[, c(1, 3:7)], max)
    mpg    disp      hp    drat      wt    qsec 
 33.900 472.000 335.000   4.930   5.424  22.900 
sapply(mtcars[, c(1, 3:7)], min)
   mpg   disp     hp   drat     wt   qsec 
10.400 71.100 52.000  2.760  1.513 14.500 

tapply()

?tapply
Usage
tapply(X, INDEX, FUN, …)

Where
X is vector/columns of data frame/elements of a list
INDEX is factors used to subset X
FUN = sum, mean, etc
… = optional arguments to FUN

table(mtcars$cyl)

 4  6  8 
11  7 14 
tapply(mtcars$mpg, mtcars$cyl, mean)
       4        6        8 
26.66364 19.74286 15.10000 
tapply(mtcars$mpg, mtcars$cyl, max)
   4    6    8 
33.9 21.4 19.2 
table(mtcars$am)

 0  1 
19 13 
tapply(mtcars$mpg, mtcars$am, mean)
       0        1 
17.14737 24.39231 
tapply(mtcars$mpg, mtcars$am, min)
   0    1 
10.4 15.0 
table(mtcars$gear)

 3  4  5 
15 12  5 
tapply(mtcars$hp, mtcars$gear, max)
  3   4   5 
245 123 335 
tapply(mtcars$hp, mtcars$gear, mean)
       3        4        5 
176.1333  89.5000 195.6000 

Missing Values

Missing values are indicated by NA in R.
is.na() is the function to check missing values.

x <- c(10, 20, NA, 40, 50)
is.na(x)
[1] FALSE FALSE  TRUE FALSE FALSE
!is.na(x) # negate
[1]  TRUE  TRUE FALSE  TRUE  TRUE
sum(x)
[1] NA

If there's NA, we can't really compute. In order to avoid missing values in computation, we use na.rm() argument in the function.

sum(x, na.rm = TRUE)
[1] 120
mean(x)
[1] NA
mean(x, na.rm = TRUE)
[1] 30
patientID <- 1:10
patientName <- c("Keiku", "Bala" ,"Sadananda", "Gokul", "Bonny", "Soma", "Maya", "Abenao", "Artina", "Olen" )
patientGender <- c("Male", "Female", "Male", "Male", "Male",
                   "Female", "Female", "Female", "Female", "Male")
patientAge <- c(36, 26, 40, 35, 37, 23, 37, 32, 28, 42)
patient <- data.frame(patientID, patientName, patientGender, patientAge)
patient
   patientID patientName patientGender patientAge
1          1       Keiku          Male         36
2          2        Bala        Female         26
3          3   Sadananda          Male         40
4          4       Gokul          Male         35
5          5       Bonny          Male         37
6          6        Soma        Female         23
7          7        Maya        Female         37
8          8      Abenao        Female         32
9          9      Artina        Female         28
10        10        Olen          Male         42
patient[3, 4] <- NA
patient
   patientID patientName patientGender patientAge
1          1       Keiku          Male         36
2          2        Bala        Female         26
3          3   Sadananda          Male         NA
4          4       Gokul          Male         35
5          5       Bonny          Male         37
6          6        Soma        Female         23
7          7        Maya        Female         37
8          8      Abenao        Female         32
9          9      Artina        Female         28
10        10        Olen          Male         42
missing <- function(x){
  sum(is.na(x))
}
sapply(patient, missing)
    patientID   patientName patientGender    patientAge 
            0             0             0             1 
tapply(patient$patientAge, patient$patientGender, mean)
Female   Male 
  29.2     NA 
tapply(patient$patientAge, patient$patientGender, mean, na.rm = TRUE)
Female   Male 
  29.2   37.5 
patient$patientTreatment <- c("A", "A", "D", "C", "A", "B", "C", "D", "C", "B")
str(patient)
'data.frame':   10 obs. of  5 variables:
 $ patientID       : int  1 2 3 4 5 6 7 8 9 10
 $ patientName     : Factor w/ 10 levels "Abenao","Artina",..: 6 3 9 5 4 10 7 1 2 8
 $ patientGender   : Factor w/ 2 levels "Female","Male": 2 1 2 2 2 1 1 1 1 2
 $ patientAge      : num  36 26 NA 35 37 23 37 32 28 42
 $ patientTreatment: chr  "A" "A" "D" "C" ...
tapply(patient$patientAge, patient$patientTreatment, max)
 A  B  C  D 
37 42 37 NA 
tapply(patient$patientAge, patient$patientTreatment, max, na.rm = TRUE)
 A  B  C  D 
37 42 37 32 
tapply(patient$patientAge, patient$patientTreatment, min, na.rm = TRUE)
 A  B  C  D 
26 23 28 32 

Or, we can completely remove the observations from the data frame.

head(na.omit(patient))
  patientID patientName patientGender patientAge patientTreatment
1         1       Keiku          Male         36                A
2         2        Bala        Female         26                A
4         4       Gokul          Male         35                C
5         5       Bonny          Male         37                A
6         6        Soma        Female         23                B
7         7        Maya        Female         37                C
na <- complete.cases(patient)
patient[na, ]
   patientID patientName patientGender patientAge patientTreatment
1          1       Keiku          Male         36                A
2          2        Bala        Female         26                A
4          4       Gokul          Male         35                C
5          5       Bonny          Male         37                A
6          6        Soma        Female         23                B
7          7        Maya        Female         37                C
8          8      Abenao        Female         32                D
9          9      Artina        Female         28                C
10        10        Olen          Male         42                B

Dates and Times in R

Dates are represented by the Date class.

date <- "01-05-1990"
class(date)
[1] "character"
date <- as.Date(date, format = "%d-%m-%Y")
class(date)
[1] "Date"
date
[1] "1990-05-01"
unclass(date)
[1] 7425
oldDate <- as.Date("1970-01-10")
unclass(oldDate)
[1] 9

Times are represented by the POSIXct or the POSIXlt class.

now <- Sys.time()
now
[1] "2016-07-05 13:03:55 IST"
class(now)
[1] "POSIXct" "POSIXt" 
unclass(now)
[1] 1467704035
guess <- 220000000
class(guess) <- c("POSIXct", "POSIXt")
guess
[1] "1976-12-21 12:36:40 IST"
guess <- as.POSIXlt(guess)
names(unclass(guess))
 [1] "sec"    "min"    "hour"   "mday"   "mon"    "year"   "wday"  
 [8] "yday"   "isdst"  "zone"   "gmtoff"
guess$wday
[1] 2
now - guess
Time difference of 14441.02 days

Dates come in different style. In order to work with it, we use strptime() function. Do ?strptime

dates <- c("December 25, 2014 11:45", "January 25, 2015 23:30")
dates
[1] "December 25, 2014 11:45" "January 25, 2015 23:30" 
class(dates)
[1] "character"
new_dates <- strptime(dates, format = "%B %d, %Y %H:%M")
class(new_dates)
[1] "POSIXlt" "POSIXt" 
first_date <- as.Date("1996-09-28")
second_date <- as.Date("1996-10-15")
second_date - first_date
Time difference of 17 days