R Introduction

Why use R?

Many reasons:

Free and open-source
Has a Large and comprehensive set of packages (>8600) that allow to access and clean data, analyze it and generate reports and other visualization outputs
It has an active and friendly developers community and an even bigger users community

How to write on R

Software used: Base-R, and RStudio. Base-R is the basic software which contains the R programming language. RStudio is software that makes R programming easier. When working in RStudio, you find 4 panes: source, console, environment history and files/plots/help. Normally you will write your code in the source pane.

Figure 1. Four panes of RStudio. Source: https://bookdown.org/.

Set the working environment

Afterwards, you need first to indicate the program where to store your code. This is the “Working Directory”. To set up the working directory, add the path of the folder where your files are located in the following function:

# Set the working directory ("wd") and assign it to an object (here, "w")
w <- setwd("C:yourpath/yourfolder")

# Check where your wd is
getwd()

In addition to the previous step, it is also useful to change the settings of R and adapt them to your needs. In R, under Tools > Global Options > General, you can also change the working directory. Explore the other options on the left panel to customize the settings to your taste and needs.

Installing packages

In order to be able to work in R, you need to use functions. Many of these are already built by other people, and they are compacted in the form of packages. An R package is simply a bunch of data, functions, examples, among others, stored in one neat file. The first step is to install the packages you need with the function “install.packages” and afterwards you need to load its libraries with the command “library”. There will be further information regarding this topic in the later lessons.

To be able to complete this lesson you will need to install the following packages.

install.packages("dplyr")
install.packages("magrittr")

library(dplyr)
library(magrittr)

TIPS:

Whenever you want to run a code line hit ctrl+ENTER.
If you want to plot a graph outside of R Studio, open an external window using: X11() plotfunction
When copying text information from the Windows clipboard, you can type in the console: readClipboard()

And now, explore the R World! ***

Aritmethics

You can make normal calculations.

# Calculate 6 + 12
6+12

# A subtraction
5 - 5 

# A multiplication
3 * 5

 # A division
(5 + 5) / 2 

# Exponentiation. 2 to the power of 5
2^5

# Modulo. The modulo returns the remainder of the division of the number to the left by the number on its right, for example 28 modulo 6 
28 %% 6

Variable assignment

The <- indicates R that the right part is the value assigned to the left part.

Take into account that R is case sensitive.

# Assign the value 42 to x
x <- 42

# Print out the value of the variable x
x

Types of variables

You will find different types of variables. A couple of them are: numeric, character and logical.

Character (or string): text values. Note how the quotation marks on the right indicate that “universe” is a character
Numeric: decimal values (for example: 4.5)
Logical: a vector that contains only TRUE or FALSE values

# Declare variables of different types
my_numeric <- 42
my_character <- "universe"
my_logical <- FALSE

Functions for variables

# Check class of my_numeric
class(my_numeric)

## [1] "numeric"

# Check class of my_character
class(my_character)

## [1] "character"

# Check class of my_logical
class(my_logical)

## [1] "logical"

Types of datasets: vector, matrix, data frame, list

Figure 2. Four types of datasets: vector, matrix, data frame, list. Source: https://mgimond.github.io/ES218/Week02a.html.

Vectors

In R, you create a vector with the combine function c(). You place the vector elements separated by a comma between the parentheses. Vectors are one-dimension arrays that can hold numeric data, character data, or logical data. In other words, a vector is a simple tool to store data. Once you have created these vectors in R, you can use them to do calculations.

# Types of vectors
numeric_vector <- c(1, 2, 3)
character_vector <- c("a", "b", "c")

# Calories intake/burn per day
cal_vector <- c(140, -50, 20, -120, 240, 300, -60)

Assign or change names/labels to a vector, using another character vector and the function names. You can do it in two ways:

First method:

# Assign days as names of cal_vector
names(cal_vector) <- c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday")
# Check vector
cal_vector

##    Monday   Tuesday Wednesday  Thursday    Friday  Saturday    Sunday 
##       140       -50        20      -120       240       300       -60

Second method:

# Or create a variable for the names of the week
days_vector <- c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday")

# Assign the names
names(cal_vector) <- days_vector

Another example:

# Assign the same names to a second vector: Calories from a second person
cal2_vector <- c(-24, -50, 100, -350, 10, 80, 420)
names(cal2_vector) <- days_vector

Select vector components and apply some operations to them:

# Note the difference!
# Calories change from population 
cal_perday <- cal_vector + cal2_vector
# Total exchange of calories
cal_total <- sum(cal_vector, cal2_vector)

# Call individual elements of your vectors
# only one
cal_wednesday <- cal_vector[3]
cal_wednesday <- cal_vector["Wednesday"]
# multiple
cal_weekend <- cal_vector[c(5, 6, 7)]
cal_weekend <- cal_vector[c(5:7)]

# Math
mean(cal_weekend)

For certain questions, you will get TRUE or FALSE answer. These are logical vectors.

For example, if you want to know which values are positive:

# Positive values
selection_vector <- cal_vector>0

Matrix

In R, a matrix is a two-dimensional collection of elements of the same data type (numeric, character, or logical) arranged into a fixed number of rows and columns.

To state or find out the number of columns and rows that a matrix has, use:

nrow(my_matrix) number of rows
ncol(my_matrix) number of columns

# Construct a matrix with 3 rows that contain the numbers 1 up to 9
xmat <- matrix(1:9, byrow=TRUE, nrow=3)

# Or construct a matrix combining 3 vectors together:
# Assume that we are testing the effect of different bacterias in presence and absence of light. We measure how much oxygen they consume.
sp_1 <- c(460.998, 314.4)
sp_2 <- c(290.475, 247.900)
sp_3 <- c(309.306, 165.8)

A matrix can be built from independent vectors joined together.

cbind binds all vectors by columns into a matrix
rbind binds all vectors by rows into a matrix

Using the previous example, the box_matrix can be built using the cbind function:

box_matrix <- cbind(sp_1, sp_2, sp_3)

Another example:

# Add names to columns. You can also replace names with these functions
col_names_vector <- c("light", "shadow", "blank")
colnames(box_matrix) <- col_names_vector
# For rows use: rownames(my_matrix) <- row_names_vector

# Check resultant matrix
View(box_matrix)

Another way of building a matrix in 2 steps, using the c and the matrix commands.

# With the command c you can bind the vectors together to make a matrix. Here there is a 2 step example
box_vectors <- c(sp_1, sp_2, sp_3)

box_matrix <- matrix(box_vectors, nrow=3, byrow=TRUE)
````

The same command can be used to add an extra column or row to an existent matrix.

# Build the same matrix but with other commands 
box_vectors <- c(460.998, 314.4, 290.475, 247.900, 309.306, 165.8)
box_matrix <- matrix(box_vectors, nrow = 3, byrow = TRUE,
                           dimnames = list(c("E. coli", "Lactobacillus", "Staphylococus aureus" ), 
                                           c("light", "shadow")))

# To sum columns: colSums(). To sum rows: rowSums()
# Sum Rows
spcs_oxygen <- rowSums(box_matrix)
spcs_oxygen

# Add the last column to the original matrix to have a total of Oxygen comsuption
# For adding columns: cbind. For adding rows: rbind
oxy_matrix <- cbind(box_matrix, spcs_oxygen)
oxy_matrix

Similar to vectors, you can use the square brackets [ ] to select one or multiple elements from a matrix. Whereas vectors have one dimension, matrices have two dimensions. You should therefore use a comma to separate the rows you want to select from the columns. For example:

my_matrix[1,2] selects the element at the first row and second column.
my_matrix[1:3,2:4] results in a matrix with the data on the rows 1, 2, 3 and columns 2, 3, 4.

If you want to select all elements of a row or a column, no number is needed before or after the comma, respectively:

my_matrix[,1] selects all elements of the first column.
my_matrix[1,] selects all elements of the first row.

If you want to eliminate one or more columns or rows from a matrix, add a negative sign in front of the corresponding place of column or row:

my_matrix[, -1] eliminates the first column
my_matrix[-1:-5,] OR my_matrix[-(1:5),] eliminates the rows 1 to 5.

# Select shadow O2 values for E. coli and Lactobacillus
subset_bac <- oxy_matrix[1:2,3]
subset_bac

##       E. coli Lactobacillus 
##       775.398       538.375

Operations with matrices:

# Total average
mean(subset_bac) # mean() only works for vectors!

## [1] 656.8865

# Average by rows (Load package 'base')
meanrow <- rowMeans(oxy_matrix, na.rm = FALSE)

# Average by columns
meancol <- colMeans(oxy_matrix, na.rm = FALSE)

You can combine the previous commands, and calculate in one step the mean of a selection of columns and rows. Pay attention to the na.rm component:

na.rm = FALSE if you do not have NA’s in your dataset
na.rm = TRUE if you have NA’s in your dataset and you do not want them to be considered.

# In this case we don't have NAs in the dataset. 
subset_bac <- mean(oxy_matrix[1:2,3], na.rm = FALSE)

With the function summaryyou can have an overview of the Min, Max, Quantiles, MEan and Median of your data.

summary(oxy_matrix)

##      light           shadow       spcs_oxygen   
##  Min.   :290.5   Min.   :165.8   Min.   :475.1  
##  1st Qu.:299.9   1st Qu.:206.8   1st Qu.:506.7  
##  Median :309.3   Median :247.9   Median :538.4  
##  Mean   :353.6   Mean   :242.7   Mean   :596.3  
##  3rd Qu.:385.2   3rd Qu.:281.1   3rd Qu.:656.9  
##  Max.   :461.0   Max.   :314.4   Max.   :775.4

Finally, you can also make logical statements for detecting the values in your matrix that correspond to your question.

For example, if you want to find out which rows contain NA values:

new_object <- !is.na(oxy_matrix)

Factors

The term factor refers to a statistical data type used to store categorical variables. The difference between a categorical variable and a continuous variable is that a categorical variable can belong to a limited number of categories. A continuous variable, on the other hand, can correspond to an infinite number of values.

The function factor() encodes vectors as a factors.

# Create vector that contains observations related to a limited number of categories (Nominal categorical variable)
land_vector <- c("urban", "water", "urban", "grassland", "grassland", "water")

# Encode as factor
land_factor <- factor(land_vector)

# Select vectors from factors
land_factor[2]

# Change names to levels (this is also an argument of the function 'factor')
levels(land_factor) <- c("Class1", "Class2", "Class3")

# Overview of variables
summary(land_factor)

# Ordinal variable
fire_intensity <- c("High", "Low", "Medium", "Medium", "Low", "High")
fire_factors <- factor(fire_intensity, order = TRUE, levels = c("Low", "Medium", "High"))

Pipes

What are pipes?

A pipe “%>%” is a function that let’s you pass an intermediate result of one function to the next function, thus allowing chained method calls.

“%>%” is not included in the base package of R, you will need to install “dplyr” or “magrittr” to use it.

Why use pipes?

As R is a functional language, your code will often contain a lot of parenthesis “(” and “)”. In complex code you will often have to nest these together, making the code hard to read and understand.

Piping helps with that.

A short example:

# initialize 'x'
x <- c(0.109, 0.359, 0.63, 0.996, 0.515, 0.142, 0.017, 0.829, 0.907)

# Compute the logarithm of `x`, return suitably lagged and iterated differences, compute the exponential function and round the result

round(exp(diff(log(x))), 1)

With the help of pipe (“%>%”) you can write the above as follows:

x %>% log() %>% diff() %>% exp() %>% round(1)

# OR 

x %>% log() %>%
  diff() %>%
  exp() %>%
  round(1)

To further your understanding of “%>%” the following exercises will always also include examples of how to do them with “%>%”.

Functions for data frames

Create a data frame

vec1 <- c(1, 2, 3, 4, 5, 6, 7, 8)
vec2 <- vec1[c(1:8)]
mat1 <- c(5,9,11,0,15,57,88,105)

my_df <- data.frame(vec1, mat1, vec2) # creates a df binding the vector 'vec1' + matrix 'mat1' + vector 'vec2'

View data

View(my_df)

my_df %>% View()

Structure and summary of dataframe

# If your data frame is called: 'my_df'

str(my_df) # Check the structure

summary(my_df) # Make a summary of the whole data frame
summary(my_df[1,]) # Make a summary of the 1st row
summary(my_df[,3]) # Make a summary of the 5th column
summary(my_df[,"vec1"]) # Make a summary of the column called "label1"

min(my_df) # returns the maximum value of the whole df
max(my_df) # returns the minimum value of the whole df

ncol(my_df) # number of columns
nrow(my_df) # number of rows

With pipes

my_df %>% str()

my_df %>% summary()
my_df %>% slice(1) %>% summary()     # 'slice()' selects rows by index or name, use ':' for ranges
my_df %>% select(3) %>% summary()    # 'select()' selects columns by index or name, use ':' for ranges
my_df %>% select(vec1) %>% summary()

my_df %>% min()
my_df %>% max()

my_df %>% ncol()
my_df %>% nrow()

Operations with rows and columns

# number of rows
nrow(my_df)

# number of columns
ncol(my_df) 

# binds all vectors by columns into a df
cbind(vec1, vec2)

# binds all vectors by rows into a df
rbind(vec1, vec2)

with pipes

my_df %>% nrow()

my_df %>% ncol()

vec1 %>% cbind(vec2)

vec2 %>% rbind(vec1)

Select elements from the data frame

head(my_df) # First 6 rows
tail(my_df) # Last 6 rows

my_df[1,2]       # selects the element at the first row and second column
my_df[1:3,2:3]   # results in a df with the data on the rows 1, 2, 3 and columns 2, 3
my_df[,1]        # selects all elements of the first column
my_df[1,]        # selects all elements of the first row
my_df[, -1]      # eliminates the first column
my_df[-1:-5,]    #OR 
my_df[-(1:5), ]          # eliminates the rows 1 to 5
my_df[,c('vec1','mat1','vec2')]    # selects columns by column name

my_df$mat1 # returns all the values of the column named 'mat1'

With pipes

my_df %>% head()
my_df %>% tail()

my_df %>% slice(1) %>% select(2)
my_df %>% slice(1:3) %>% select (2:3)
my_df %>% select(1)   # selects first column
my_df %>% slice(1)    # selects first row
my_df %>% select(-1)
my_df %>% select(vec1, mat1, vec2)
my_df %>% select(vec1:vec2)

my_df %>% select(mat1)

Assign / Change names

Method A:

# Create vector 'vec1'
vec1 <- c(140, -50, 20, -120, 240, 300, -60)
# Assign days as names of v1
names(vec1) <- c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday")
# Check vector
vec1

##    Monday   Tuesday Wednesday  Thursday    Friday  Saturday    Sunday 
##       140       -50        20      -120       240       300       -60

Method B:

# Or create a variable for the names of the week
days_vector <- c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday")

# Assign the names
names(vec1) <- days_vector

With pipes

days_vector2 <- c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday", "Someday")
my_df <- my_df %>% set_rownames(days_vector2)
my_df

##           vec1 mat1 vec2
## Monday       1    5    1
## Tuesday      2    9    2
## Wednesday    3   11    3
## Thursday     4    0    4
## Friday       5   15    5
## Saturday     6   57    6
## Sunday       7   88    7
## Someday      8  105    8

my_df <- my_df %>% rename(data1 = 1,  
                          data2 = 2,
                          data3 = 3)  # 'new_name' = 'old_name' / index
my_df

##           data1 data2 data3
## Monday        1     5     1
## Tuesday       2     9     2
## Wednesday     3    11     3
## Thursday      4     0     4
## Friday        5    15     5
## Saturday      6    57     6
## Sunday        7    88     7
## Someday       8   105     8

#OR
my_df <- my_df %>% rename(col1 = data1,
                          col2 = data2,
                          col3 = data3)
my_df

##           col1 col2 col3
## Monday       1    5    1
## Tuesday      2    9    2
## Wednesday    3   11    3
## Thursday     4    0    4
## Friday       5   15    5
## Saturday     6   57    6
## Sunday       7   88    7
## Someday      8  105    8

Operations

mean(my_df[,3], na.rm = TRUE) # Mean of the 3rd column and all rows
mean(my_df[1:3,2], na.rm = TRUE) # Mean of the first three rows of the second column

# mean() only works on Vectors!!!

rowMeans(my_df[1:3], na.rm = TRUE) # Mean of the rows (1:3)
colMeans(my_df["col2"], na.rm = TRUE)   # Mean of all values of the column 'col2'

rowSums(my_df, na.rm=TRUE) # sums up all the values from each row

is.na(my_df[,"col1"]) # shows for each element of 'col1' if it contains NA

subset(my_df,select = col1) # creates a subset of the df using only the column named "col1"
subset(my_df, col1 > 4, select = col2) # creates a subset of the df using only values of "col2" where "col1" is greater than 4

sum(my_df$col1, na.rm = TRUE)     # add all elements of 'col1' together
sum(my_df$col1 > 4)           # count the amount of elements bigger than 4 from the column "col1"

sort(my_df$col2) # sorts all elements of 'col2' in ascending order

With pipes

my_df %>% select(3) %>% as.matrix() %>% mean(na.rm = TRUE)
my_df %>% select(2) %>% slice(1:3) %>% as.matrix() %>% mean(na.rm = TRUE)

my_df %>% slice(1:3) %>% rowMeans(na.rm = TRUE)
my_df %>% select(col2) %>% colMeans(na.rm = TRUE)

my_df %>% rowSums(na.rm = TRUE)

my_df %>% select(col1) %>% is.na()

my_df %>% subset(select = col1)
my_df %>% subset(col1 > 4, select = col2)

my_df %>% subset(col1 > 4, select = col1) %>% sum(na.rm = TRUE) # sum of elements bigger than 4 in 'col1'
my_df %>% subset(col1 > 4, select = col1) %>% as.matrix() %>% length() # number of elements bigger than 4 in 'col1'

my_df %>% select(col2) %>% as.matrix() %>% sort()

Lists

A list in R allows you to gather a variety of objects under one name (that is, the name of the list) in an ordered way. These objects can be matrices, vectors, data frames, even other lists, etc. The objects do not need to related to each other in any way.

# Build a list out of 3 types of data: 
# Vector with numerics from 1 up to 10
my_vector <- 1:10 

# Matrix with numerics from 1 up to 9
my_matrix <- matrix(1:9, ncol = 3)

# First 10 elements of the built-in data frame mtcars
my_df <- mtcars[1:10,]

# Construct list with these different elements:
my_list <-list(my_vector, my_matrix, my_df)
my_list <-list(vec=my_vector, matrix=my_matrix, df=my_df)

print(my_list)

# Select elements from a list
# Components, for example, a matrix
my_list[2]
my_list$matrix[,2]
# Values, for example, in the matrix
my_list[[2]][,2]

R Introduction

Yrneh Ulloa & Severin Herzprung

October 2021

Why use R?

How to write on R

Set the working environment

Installing packages

TIPS:

Aritmethics

Variable assignment

Types of variables

Functions for variables

Types of datasets: vector, matrix, data frame, list

Vectors

Matrix

Factors

Pipes

What are pipes?

Why use pipes?

Functions for data frames

Create a data frame

View data

Structure and summary of dataframe

With pipes

Operations with rows and columns

Select elements from the data frame

With pipes

Assign / Change names

With pipes

Operations

With pipes

Lists