Introduction to R Lecture Notes

Mindy Fang

Jan 08, 2019

Installing R

Installing Rstudio

Open a Rmarkdown document

Arithmetic with R

In its most basic form, R can be used as a simple calculator. Consider the following arithmetic operators:

The last two might need some explaining:

Exercise

An addition

5 + 5 
## [1] 10

A subtraction

5 - 5 
## [1] 0

A multiplication

3 * 5
## [1] 15
2^5
## [1] 32
28 %% 6 
## [1] 4

Variable assignment

A basic concept in (statistical) programming is called a variable.

A variable allows you to store a value (e.g. 4) or an object (e.g. a function description) in R. You can then later use this variable’s name to easily access the value or the object that is stored within this variable.

You can assign a value 4 to a variable my_var with the command

my_var <- 4

x <- 42

x
## [1] 42

Variable assignment (2)

You have 5 apples and 6 oranges, now you want to calculate how many pieces of fruits you have in total.

my_apples <- 5
my_oranges <- 6

my_apples + my_oranges
## [1] 11

Basic data types in R

R works with numerous data types. Some of the most basic types to get started are:

Change my_numeric to be 42

my_numeric <- 42.5

my_character <- "some text"

my_logical <- TRUE

Check the data type

You can check the data type of a variable with the class() function. Declare variables of different types

my_numeric <- 42
my_character <- "universe"
my_logical <- FALSE 

Check class of my_numeric

class(my_numeric)
## [1] "numeric"
class(my_character)
## [1] "character"
class(my_logical)
## [1] "logical"

Create a vector

Vectors are one-dimension arrays that can hold numeric data, character data, or logical data. In other words, a vector is a simple tool to store data. In R, you create a vector with the combine function c(). You place the vector elements separated by a comma between the parentheses. For example:

numeric_vector <- c(1, 10, 49)
character_vector <- c("a", "b", "c")

boolean_vector <- c("TRUE", "TRUE", "FALSE", "TRUE")

Naming a vector

You can give a name to the elements of a vector with the names() function.

some_vector <- c("mindy fang", "female")
names(some_vector) <- c("Name", "Gender")

This code first creates a vector some_vector and then gives the two elements a name. The first element is assigned the name Name, while the second element is labeled Gender. Printing the contents to the console yields following output:

print(some_vector)
##         Name       Gender 
## "mindy fang"     "female"
some_vector[1]
##         Name 
## "mindy fang"
some_vector[2]
##   Gender 
## "female"

Arithmetic calculations on vectors

It is important to know that if you sum two vectors in R, it takes the element-wise sum. For example, the following three statements are completely equivalent:

c(1, 2, 3) + c(4, 5, 6)
## [1] 5 7 9
c(1 + 4, 2 + 5, 3 + 6)
## [1] 5 7 9
c(5, 7, 9)
## [1] 5 7 9

You can also do the calculations with variables that represent vectors:

a <- c(1, 2, 3) 
b <- c(4, 5, 6)
c <- a + b
c
## [1] 5 7 9

Vector selections

We can select the elements from a vector directly:

days_vector <- c("Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun")

Time spent (minutes) in housework + kids for me and my husband

me_housework_time <- c(120.5, 110, 130, 110, 100, 230, 300)
husband_housework_time <- c(0, 0, 10, 10.5, 0, 300, 360)
names(me_housework_time) <- days_vector
names(husband_housework_time) <- days_vector

How many minutes have I worked on Thursday and Friday?

me_housework_time[c(4,5)]
## Thu Fri 
## 110 100

How many minutes has my husband worked on Thursday?

husband_housework_time[c("Thu", "Fri")]
##  Thu  Fri 
## 10.5  0.0

Vector selections by comparison

The (logical) comparison operators known to R are:

me_housework_time > husband_housework_time
##   Mon   Tue   Wed   Thu   Fri   Sat   Sun 
##  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE
selection_vector <- me_housework_time > husband_housework_time
me_housework_time[selection_vector]
##   Mon   Tue   Wed   Thu   Fri 
## 120.5 110.0 130.0 110.0 100.0

Matrix

In R, a matrix is a collection of elements of the same data type (numeric, character, or logical) arranged into a fixed number of rows and columns. Since you are only working with rows and columns, a matrix is called two-dimensional. You can construct a matrix in R with the matrix() function.

matrix(1:9, byrow = TRUE, nrow = 3)
##      [,1] [,2] [,3]
## [1,]    1    2    3
## [2,]    4    5    6
## [3,]    7    8    9
matrix(1:9, byrow = FALSE, nrow = 3)
##      [,1] [,2] [,3]
## [1,]    1    4    7
## [2,]    2    5    8
## [3,]    3    6    9

In the matrix() function, the first argument is the collection of elements that R will arrange into the rows and columns of the matrix. Here, we use 1:9 which is a shortcut for c(1, 2, 3, 4, 5, 6, 7, 8, 9). The argument byrow indicates that the matrix is filled by the rows. If we want the matrix to be filled by the columns, we just place byrow = FALSE. The third argument nrow indicates that the matrix should have three rows.

Create a matrix

we can now combine the two vectors, me_housework_time and husband_housework_time into a matrix.

matrix_housework_time <- matrix(cbind(me_housework_time,
                                      husband_housework_time),ncol=2)
colnames(matrix_housework_time) <- c("Me", "Husband")
rownames(matrix_housework_time) <- days_vector
matrix_housework_time
##        Me Husband
## Mon 120.5     0.0
## Tue 110.0     0.0
## Wed 130.0    10.0
## Thu 110.0    10.5
## Fri 100.0     0.0
## Sat 230.0   300.0
## Sun 300.0   360.0
sum_housework_time <- matrix_housework_time[,"Me"] + 
  matrix_housework_time[,"Husband"]
sum_housework_time
##   Mon   Tue   Wed   Thu   Fri   Sat   Sun 
## 120.5 110.0 140.0 120.5 100.0 530.0 660.0
sum_housework_time <- rowSums(matrix_housework_time)
sum_housework_time
##   Mon   Tue   Wed   Thu   Fri   Sat   Sun 
## 120.5 110.0 140.0 120.5 100.0 530.0 660.0

Add columns

Add a new column to the above matrix

newmatrix1_housework_time <- cbind(matrix_housework_time,
                                   sum_housework_time)
newmatrix1_housework_time 
##        Me Husband sum_housework_time
## Mon 120.5     0.0              120.5
## Tue 110.0     0.0              110.0
## Wed 130.0    10.0              140.0
## Thu 110.0    10.5              120.5
## Fri 100.0     0.0              100.0
## Sat 230.0   300.0              530.0
## Sun 300.0   360.0              660.0

Add rows

Add a new row to the matrix

newmatrix2_housework_time <- rbind(matrix_housework_time,
                                   colSums(matrix_housework_time))
newmatrix2_housework_time 
##         Me Husband
## Mon  120.5     0.0
## Tue  110.0     0.0
## Wed  130.0    10.0
## Thu  110.0    10.5
## Fri  100.0     0.0
## Sat  230.0   300.0
## Sun  300.0   360.0
##     1100.5   680.5

Selection of matrix elements

Similar to vectors, you can use the square brackets [ ] to select one or multiple elements from a matrix. Whereas vectors have one dimension, matrices have two dimensions. You should therefore use a comma to separate the rows you want to select from the columns. For example:

If you want to select all elements of a row or a column, no number is needed before or after the comma, respectively:

matrix_housework_time[1:5,1:2]
##        Me Husband
## Mon 120.5     0.0
## Tue 110.0     0.0
## Wed 130.0    10.0
## Thu 110.0    10.5
## Fri 100.0     0.0

Calculation with matrices

The standard operators like +, -, /, *, etc. work in an element-wise way on matrices in R. Transform the housework time matrix from minutes to hours (and round to 1 decimal place):

round(matrix_housework_time/60,1)
##      Me Husband
## Mon 2.0     0.0
## Tue 1.8     0.0
## Wed 2.2     0.2
## Thu 1.8     0.2
## Fri 1.7     0.0
## Sat 3.8     5.0
## Sun 5.0     6.0

Note that my_matrix1 * my_matrix2 creates a matrix where each element is the product of the corresponding elements in my_matrix1 and my_matrix2, whilst my_matrix1 %*% my_matrix2 gives the matrix algebra multiplication.

matrixA <- matrix(c(1,2,3,4),nrow=2)
matrixA
##      [,1] [,2]
## [1,]    1    3
## [2,]    2    4
matrixB <- matrix(c(1,2,3,4),nrow=2)
matrixB
##      [,1] [,2]
## [1,]    1    3
## [2,]    2    4
matrixA*matrixB
##      [,1] [,2]
## [1,]    1    9
## [2,]    4   16
matrixA%*%matrixB
##      [,1] [,2]
## [1,]    7   15
## [2,]   10   22

Factor

The term factor refers to a statistical data type used to store categorical variables. The difference between a categorical variable and a continuous variable is that a categorical variable can belong to a limited number of categories. A continuous variable, on the other hand, can correspond to an infinite number of values. To create factors in R, you make use of the function factor(). First thing that you have to do is create a vector that contains all the observations that belong to a limited number of categories. For example, sex_vector contains the sex of 5 different individuals:

sex_vector <- c("Male", "Female", "Female", "Male", "Male")
a <- factor(sex_vector)
levels(a) 
## [1] "Female" "Male"
summary(sex_vector)
##    Length     Class      Mode 
##         5 character character
summary(a)
## Female   Male 
##      2      3

Ordered factors

Sometimes you will deal with factors that have a natural ordering between its categories. If this is the case, we have to make sure that we pass this information to R. For example, speed_vector should be converted to an ordinal factor since its categories have a natural ordering. By default, the function factor() transforms speed_vector into an unordered factor. To create an ordered factor, you have to add two additional arguments: ordered and levels.

speed_vector <- c("medium", "slow", "slow", "medium", "fast")
factor_speed_vector <- factor(speed_vector, ordered = TRUE, 
                              levels = c("slow", "medium", "fast"))
factor_speed_vector
## [1] medium slow   slow   medium fast  
## Levels: slow < medium < fast

The fact that factor_speed_vector is now ordered enables us to compare different elements.

speed_vector[1] > speed_vector[2]
## [1] FALSE
factor_speed_vector[1] > factor_speed_vector[2]
## [1] TRUE

Data frame

All the elements that you put in a matrix should be of the same type. You will often find yourself working with data sets that contain different data types instead of only one. A data frame has the variables of a data set as columns and the observations as rows. A data frame can contain variables of different data types. Create a data frame:

dataframe_housework_time <- data.frame(days_vector,
                                       matrix_housework_time, 
                                       c("Weekday", "Weekday", 
                                         "Weekday", "Weekday", 
                                         "Weekday", "Weekend", 
                                         "Weekend"))
colnames(dataframe_housework_time) <- c("Day", "Me", "Husband", 
                                        "Weekday")
dataframe_housework_time[1,]
##     Day    Me Husband Weekday
## Mon Mon 120.5       0 Weekday

Matrix elements selection

We can select the elements of the data frame as we did with matrices.

dataframe_housework_time[1:2, 2:3]
##        Me Husband
## Mon 120.5       0
## Tue 110.0       0

We can also select the columns by the shortcut:

dataframe_housework_time$Me
## [1] 120.5 110.0 130.0 110.0 100.0 230.0 300.0

Subsetting and sorting data frames

subset(dataframe_housework_time, Husband==0)
##     Day    Me Husband Weekday
## Mon Mon 120.5       0 Weekday
## Tue Tue 110.0       0 Weekday
## Fri Fri 100.0       0 Weekday
subset(dataframe_housework_time, Weekday=="Weekend")
##     Day  Me Husband Weekday
## Sat Sat 230     300 Weekend
## Sun Sun 300     360 Weekend
subset(dataframe_housework_time, Me>200)
##     Day  Me Husband Weekday
## Sat Sat 230     300 Weekend
## Sun Sun 300     360 Weekend

Sorting

You can sort your data according to a certain variable in the data set. In R, this is done with the help of the function order(). order() is a function that gives you the ranked position of each element when it is applied on a variable, such as a vector for example:

a <- c(100, 10, 1000)
order(a)
## [1] 2 1 3

10, which is the second element in a, is the smallest element, so 2 comes first in the output of order(a). 100, which is the first element in a is the second smallest element, so 1 comes second in the output of order(a). This means we can use the output of order(a) to reshuffle a:

a[order(a)]
## [1]   10  100 1000

Sorting a data frame

You can rearrange a data frame according to a particular column. For example, let us rearrange the housework time data frame according to “Me”.

dataframe_housework_time[order(dataframe_housework_time$Me),]
##     Day    Me Husband Weekday
## Fri Fri 100.0     0.0 Weekday
## Tue Tue 110.0     0.0 Weekday
## Thu Thu 110.0    10.5 Weekday
## Mon Mon 120.5     0.0 Weekday
## Wed Wed 130.0    10.0 Weekday
## Sat Sat 230.0   300.0 Weekend
## Sun Sun 300.0   360.0 Weekend
dataframe_housework_time[order(dataframe_housework_time$Me),"Day"]
## [1] Fri Tue Thu Mon Wed Sat Sun
## Levels: Fri Mon Sat Sun Thu Tue Wed

List

Why do we need lists? Let us do a quick recap of what we have known.

A list in R allows you to gather a variety of objects under one name (that is, the name of the list) in an ordered way. These objects can be matrices, vectors, data frames, even other lists, etc. It is not even required that these objects are related to each other in any way.

Creating a list

To construct a list you use the function list():

comp1 <- c(1,2)
comp2 <- matrix(1:4, nrow=2)
my_list <- list(comp1, comp2)
my_list
## [[1]]
## [1] 1 2
## 
## [[2]]
##      [,1] [,2]
## [1,]    1    3
## [2,]    2    4

The arguments to the list function are the list components. These components can be matrices, vectors, other lists, etc. You can also name the components in your list:

my_list <- list(name1=comp1, name2=comp2)
my_list
## $name1
## [1] 1 2
## 
## $name2
##      [,1] [,2]
## [1,]    1    3
## [2,]    2    4

Selecting elements from a list

One way to select a component is using the numbered position of that component.

my_list[[1]]
## [1] 1 2

You can also refer to the names of the components, with [[ ]] or with the $ sign.

my_list[["name1"]]
## [1] 1 2
my_list$name1
## [1] 1 2

Adding new components to a list

You can add more components to an exisiting list by c().

my_newlist <- c(my_list, "new contents")
my_newlist
## $name1
## [1] 1 2
## 
## $name2
##      [,1] [,2]
## [1,]    1    3
## [2,]    2    4
## 
## [[3]]
## [1] "new contents"

Conditionals and control flow - Equality

The following statements all evaluate to TRUE. Notice from the last expression that R is case sensitive: M is not equal to m.

3 == (2 + 1)
## [1] TRUE
TRUE != FALSE
## [1] TRUE
"Mindy" != "mindy"
## [1] TRUE

More comparisons

TRUE == FALSE
## [1] FALSE
"useR" == "user"
## [1] FALSE
TRUE == 1
## [1] TRUE
FALSE == 0
## [1] TRUE

Greater and less than

You can also add an equal sign to express less than or equal to or greater than or equal to, respectively. Have a look at the following R expressions, that all evaluate to FALSE:

(1 + 2) > 4
## [1] FALSE
TRUE <= FALSE
## [1] FALSE

More comparisons

"rain" < "raining"
## [1] TRUE
"Mindy" > "mindy"
## [1] TRUE
TRUE > FALSE
## [1] TRUE

Compare vectors

Without having to change anything about the syntax, R’s relational operators also work on vectors.

dataframe_housework_time$Me > dataframe_housework_time$Husband
## [1]  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE
dataframe_housework_time$Husband == 0
## [1]  TRUE  TRUE FALSE FALSE  TRUE FALSE FALSE
sum(dataframe_housework_time$Husband == 0)
## [1] 3
dataframe_housework_time[,c("Me", "Husband")] > 300
##        Me Husband
## Mon FALSE   FALSE
## Tue FALSE   FALSE
## Wed FALSE   FALSE
## Thu FALSE   FALSE
## Fri FALSE   FALSE
## Sat FALSE   FALSE
## Sun FALSE    TRUE

&(and) and |(or)

Like relational operators, logical operators work perfectly fine with vectors and matrices.

dataframe_housework_time$Me > dataframe_housework_time$Husband &
  dataframe_housework_time$Weekday == "Weekday"
## [1]  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE
dataframe_housework_time$Weekday == "Weekday" |
  dataframe_housework_time$Weekday == "Weekend"
## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE

We can reverse the result by using !.

!TRUE
## [1] FALSE
x <- 5
y <- 7
!(!(x < 4) & !!!(y > 12))
## [1] FALSE

The if statement

for(i in 1:7){
  if(dataframe_housework_time$Husband[i]==0){
    print(paste("Wife is upset on", dataframe_housework_time$Day[i]))
  }
}
## [1] "Wife is upset on Mon"
## [1] "Wife is upset on Tue"
## [1] "Wife is upset on Fri"

The if else statement

We can also add an else statement:

for(i in 1:7){
  if(dataframe_housework_time$Husband[i]==0){
    print(paste("Wife is upset on", dataframe_housework_time$Day[i], 
                "because husband doesn't help"))
  }
  else{
    print(paste("Husband has help on", dataframe_housework_time$Day[i]))
  }
}
## [1] "Wife is upset on Mon because husband doesn't help"
## [1] "Wife is upset on Tue because husband doesn't help"
## [1] "Husband has help on Wed"
## [1] "Husband has help on Thu"
## [1] "Wife is upset on Fri because husband doesn't help"
## [1] "Husband has help on Sat"
## [1] "Husband has help on Sun"

The else if statement

You can add as many else if statements as you like.

for(i in 1:7){
  if(dataframe_housework_time$Husband[i]==0){
    print(paste("Wife is upset on", dataframe_housework_time$Day[i], 
                "because husband doesn't help"))
  }
  else if(dataframe_housework_time$Husband[i] < 30){
    print(paste("Husband has helped very little on", 
                dataframe_housework_time$Day[i]))
  }
  else {
    print(paste("Husband has helped on", 
                dataframe_housework_time$Day[i]))
  }
}
## [1] "Wife is upset on Mon because husband doesn't help"
## [1] "Wife is upset on Tue because husband doesn't help"
## [1] "Husband has helped very little on Wed"
## [1] "Husband has helped very little on Thu"
## [1] "Wife is upset on Fri because husband doesn't help"
## [1] "Husband has helped on Sat"
## [1] "Husband has helped on Sun"

For loop

for(i in dataframe_housework_time$Day){
  print(i)
}
## [1] "Mon"
## [1] "Tue"
## [1] "Wed"
## [1] "Thu"
## [1] "Fri"
## [1] "Sat"
## [1] "Sun"
for(i in 1:7){
  print(dataframe_housework_time$Husband[i])
  if(dataframe_housework_time$Husband[i] > 10) break
}
## [1] 0
## [1] 0
## [1] 10
## [1] 10.5

While loop

Let us simulate the interaction between a driver and a driver’s assistant: When the speed was too high, “Slow down!” got printed out to the console, resulting in a decrease of your speed. The initial speed is 64.

speed <- 64
while (speed > 30) {
  print(paste("Your speed is",speed))
  if (speed > 48) {
    print("Slow down big time!"); speed = speed - 11
  } else {
    print("Slow down!"); speed = speed - 6
  }
}
## [1] "Your speed is 64"
## [1] "Slow down big time!"
## [1] "Your speed is 53"
## [1] "Slow down big time!"
## [1] "Your speed is 42"
## [1] "Slow down!"
## [1] "Your speed is 36"
## [1] "Slow down!"

Functions

All the relevant details such as a description, usage, and arguments for a function can be found in the documentation. For example, you can use one of following R commands:

help(mean)

You can also inspect the arguments of the mean() function by

args(mean)

Use the function

mean(dataframe_housework_time$Me)
## [1] 157.2143
sum(dataframe_housework_time$Husband)
## [1] 680.5
median(dataframe_housework_time$Me)
## [1] 120.5

Writing functions

Here is a function template:

my_fun <- function(arg1, arg2...) {
  body
}

Creating a function in R basically is the assignment of a function object to a variable. In the recipe above, you’re creating a new R variable my_fun, that becomes available in the workspace as soon as you execute the definition. From then on, you can use the my_fun as a function.

my_fun1 <- function(x){
  x*2
}
my_fun1(10)
## [1] 20
my_fun2 <- function(x,y){
  x + y
}
my_fun2(10, 30)
## [1] 40

lapply

From the lapply() help document, the usage section shows the following expression:

lapply(X, FUN, ...)

lapply takes a vector or list X, and applies the function FUN to each of its members. If FUN requires additional arguments, you pass them after you’ve specified X and FUN (...). The output of lapply() is a list, the same length as X, where each element is the result of applying FUN on the corresponding element of X. Here is an exercise to apply our self-defined funcion my_fun1 on the Husband’s housework time vector.

t(lapply(dataframe_housework_time$Husband, my_fun1))
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,] 0    0    20   21   0    600  720

You can also put the function directly into the lapply(). The above is the same as the following:

t(lapply(dataframe_housework_time$Husband, function(x){x*2}))
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,] 0    0    20   21   0    600  720

Use lapply with additional arguments

lapply() provides a way to handle functions that require more than one argument, for example:

my_fun3 <- function(x, factor){
  x > factor
}
lapply(dataframe_housework_time$Husband, my_fun3, factor=0)
## [[1]]
## [1] FALSE
## 
## [[2]]
## [1] FALSE
## 
## [[3]]
## [1] TRUE
## 
## [[4]]
## [1] TRUE
## 
## [[5]]
## [1] FALSE
## 
## [[6]]
## [1] TRUE
## 
## [[7]]
## [1] TRUE

sapply

You can use sapply() similar to how you used lapply(). The first argument of sapply() is the list or vector X over which you want to apply a function, FUN. Potential additional arguments to this function are specified afterwards (...):

sapply(X, FUN, ...)
lapply(dataframe_housework_time[,c("Me", "Husband")], max)
## $Me
## [1] 300
## 
## $Husband
## [1] 360
sapply(dataframe_housework_time[,c("Me", "Husband")], max)
##      Me Husband 
##     300     360

Like lapply(), sapply() allows you to use self-defined functions and apply them over a vector or a list:

sapply(dataframe_housework_time[,c("Me", "Husband")], my_fun1)
##       Me Husband
## [1,] 241       0
## [2,] 220       0
## [3,] 260      20
## [4,] 220      21
## [5,] 200       0
## [6,] 460     600
## [7,] 600     720

sapply with function returning vector

What if the function you’re applying over a list or a vector returns a vector of length greater than 1? For example, now we are to define an extremes() function. It takes a vector of numerical values and returns a vector containing the minimum and maximum values of a given vector, with the names “min” and “max”, respectively.

extremes <- function(x) {
  c(min = min(x), max = max(x))
}
sapply(dataframe_housework_time[,c("Me", "Husband")], extremes)
##      Me Husband
## min 100       0
## max 300     360
lapply(dataframe_housework_time[,c("Me", "Husband")], extremes)
## $Me
## min max 
## 100 300 
## 
## $Husband
## min max 
##   0 360

vapply

The function vapply() has the following syntax:

vapply(X, FUN, FUN.VALUE, ..., USE.NAMES = TRUE)

Over the elements inside X, the function FUN is applied. The FUN.VALUE argument expects a template for the return argument of this function FUN. USE.NAMES is TRUE by default.

basics <- function(x) {
  c(min = min(x), mean = mean(x), median = median(x), max = max(x))
}
vapply(dataframe_housework_time[,c("Me", "Husband")], basics, numeric(4))
##              Me   Husband
## min    100.0000   0.00000
## mean   157.2143  97.21429
## median 120.5000  10.00000
## max    300.0000 360.00000

Another example:

basics <- function(x) {
  c(min = min(x), mean = mean(x), median = median(x), max = max(x))
}
vapply(dataframe_housework_time[,c("Me", "Husband")], 
       function(x,y){min(x)>y}, y=0, logical(1))
##      Me Husband 
##    TRUE   FALSE

Mathematical utilities

Here are some useful math functions that R features:

errors <- c(1.9, -2.6, 4.0, -9.5, -3.4, 7.3)
round(errors, digits = 0)
## [1]   2  -3   4 -10  -3   7
sum(abs(round(errors, digits = 0)))
## [1] 29

Data utilities

R features a bunch of functions to juggle around with data structures::

grepl & grep

In their most basic form, regular expressions can be used to see whether a pattern exists inside a character string or a vector of character strings. For this purpose, you can use:

emails <- c("john.doe@ivyleague.edu", "education@world.gov", 
            "dalai.lama@peace.org", "invalid.edu", 
            "quant@bigdatacollege.edu", "cookie.monster@sesame.tv")
grepl(pattern = "edu", x = emails)
## [1]  TRUE  TRUE FALSE  TRUE  TRUE FALSE
hits <- grep(pattern = "edu", x = emails)
emails[hits]
## [1] "john.doe@ivyleague.edu"   "education@world.gov"     
## [3] "invalid.edu"              "quant@bigdatacollege.edu"

grepl & grep (2)

You can use the caret, ^, and the dollar sign, $ to match the content located in the start and end of a string, respectively. This could take us one step closer to a correct pattern for matching only the .edu email addresses from our list of emails. But there’s more that can be added to make the pattern more robust:

grepl(pattern = "@.*\\.edu", x = emails)
## [1]  TRUE FALSE FALSE FALSE  TRUE FALSE
hits <- grep(pattern = "@.*\\.edu", x = emails)
emails[hits]
## [1] "john.doe@ivyleague.edu"   "quant@bigdatacollege.edu"

sub & gsub (1)

While grep() and grepl() were used to simply check whether a regular expression could be matched with a character vector, sub() and gsub() take it one step further: you can specify a replacement argument. If inside the character vector x, the regular expression pattern is found, the matching element(s) will be replaced with replacement. sub() only replaces the first match, whereas gsub() replaces all matches.

sub(pattern = "@.*\\.edu$", replacement = "@kuhs.ac.jp", x = emails)
## [1] "john.doe@kuhs.ac.jp"      "education@world.gov"     
## [3] "dalai.lama@peace.org"     "invalid.edu"             
## [5] "quant@kuhs.ac.jp"         "cookie.monster@sesame.tv"

sub & gsub (2)

Regular expressions are a typical concept that you’ll learn by doing and by seeing other examples. Before you rack your brains over the regular expression in this exercise, have a look at the new things that will be used:

sub & gsub (3)

awards <- c("Won 1 Oscar.",
  "Won 1 Oscar. Another 9 wins & 24 nominations.",
  "1 win and 2 nominations.",
  "2 wins & 3 nominations.",
  "Nominated for 2 Golden Globes. 1 more win & 2 nominations.",
  "4 wins & 1 nomination.")

sub(".*\\s([0-9]+)\\snomination.*$", "\\1", awards)
## [1] "Won 1 Oscar." "24"           "2"            "3"           
## [5] "2"            "1"

The ([0-9]+) selects the entire number that comes before the word “nomination” in the string, and the entire match gets replaced by this number because of the \\1 that reference to the content inside the parentheses.

Encoding and locale

Times and dates

In R, dates are represented by Date objects, while times are represented by POSIXct objects. Under the hood, however, these dates and times are simple numerical values. Get the current date: today

today <- Sys.Date()
today

See what today looks like under the hood

unclass(today)

Get the current time: now

now <- Sys.time()
now

See what now looks like under the hood

unclass(now)

Create and format dates

To create a Date object from a simple character string in R, you can use the as.Date() function. The character string has to obey a format that can be defined using a set of symbols (the examples correspond to 13 January, 1982):

Create and format dates (2)

The following R commands will all create the same Date object for the 13th day in January of 1982:

as.Date("1982-01-13")
as.Date("Jan1382", "%b%d%y", locale = "japanese")
as.Date("13 January, 1982", format = "%d %B, %Y")
x <- c("1jan1960", "2jan1960", "31mar1960", "30jul1960")
z <- as.Date(x, "%d%b%Y")
as.Date("20150905", format = "%Y%m%d")

Create and format dates (3)

In addition to creating dates, you can also convert dates to character strings that use a different date notation. For this, you use the format() function.

today <- Sys.Date()
format(Sys.Date(), "%d %B, %Y")
format(Sys.Date(), format = "Today is a %A!")

Create and format dates (4)

Definition of character strings representing dates

str1 <- "May 23, '96"
str2 <- "2012-03-15"
str3 <- "30/January/2006"

Convert the strings to dates: date1, date2, date3

date1 <- as.Date(str1, format = "%b %d, '%y")
date2 <- as.Date(str2, format = "%Y-%m-%d")
date3 <- as.Date(str3, format = "%d/%B/%Y")

Convert dates to formatted strings

format(date1, "%A")
format(date2, "%d")
format(date3, "%b %Y")

Create and format times (5)

Similar to working with dates, you can use as.POSIXct() to convert from a character string to a POSIXct object, and format() to convert from a POSIXct object to a character string. Again, you have a wide variety of symbols:

Create and format times (6)

Definition of character strings representing times

str1 <- "May 23, '96 hours:23 minutes:01 seconds:45"
str2 <- "2012-3-12 14:23:08"

Convert the strings to POSIXct objects: time1, time2

time1 <- as.POSIXct(str1, 
          format = "%B %d, '%y hours:%H minutes:%M seconds:%S")
time2 <- as.POSIXct(str2, format = "%Y-%m-%d %H:%M:%S")
#format(time1, "%M")
format(time2, "%I:%M %p")

Calculations with Dates

Both Date and POSIXct R objects are represented by simple numerical values under the hood. This makes calculation with time and date objects very straightforward: R performs the calculations using the underlying numerical values, and then converts the result back to human-readable time information again. You can increment and decrement Date objects, or do actual calculations with them

today <- Sys.Date()
today + 1
today - 1

as.Date("2015-03-12") - as.Date("2015-02-27")

Calculations with Dates (2)

day1 <- as.Date("2018-08-15")
day2 <- as.Date("2018-08-17")
day3 <- as.Date("2018-08-22")
day4 <- as.Date("2018-08-28")
day5 <- as.Date("2018-09-02")
as.Date(day5) - as.Date(day1)

daylist <- c(day1, day2, day3, day4, day5)
day_diff <- diff(daylist)
mean(day_diff)

Calculations with Times

Calculations using POSIXct objects are completely analogous to those using Date objects.

now <- Sys.time()
now + 3600          
now - 3600 * 24 

login and logout time

login <- as.POSIXct(c("2018-08-19 10:18:04 UTC", "2018-08-24 09:14:18 UTC", 
                      "2018-08-24 12:21:51 UTC", "2018-08-24 12:37:24 UTC", 
                      "2018-08-26 21:37:55 UTC"))
logout <- as.POSIXct(c("2018-08-19 10:56:29 UTC", "2018-08-24 09:14:52 UTC", 
                       "2018-08-24 12:35:48 UTC", "2018-08-24 13:17:22 UTC", 
                       "2018-08-26 22:08:47 UTC"))
time_online = logout - login
time_online
mean(time_online)

R packages

Statistical modeling

Designing and analysing clinical trials

Analyse survey data

Writing functions in R

Reporting with Rmarkdown

Text mining

Building dashboards and interactive UI

Bioconductor